Moroccan Data-Driven Spelling Normalization Using Character Neural Embedding
Ridouane Tachicart and Karim Bouzoubaa
Mohammadia School of Engineers, Mohammed V University in Rabat, Avenue Ibn Sina B.P 765 Agdal Rabat, 10090, Morocco
ridouane.tachicart@research.emi.ac.ma
Abstract
With the increase of web use in Morocco today, Internet has become an important source of information. Specifically, across social media, Moroccan people use several languages in their communication leaving behind unstructured user-generated text that presents several opportunities for Natural Language Processing. Among languages found in this data, Moroccan Arabic stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with an emphasis on Moroccan Arabic. For this purpose, we follow several steps, using some tools such as a language identification system, in order to conduct a deep study of this data. The most interesting findings that have emerged are the use of code-switching, multi-script and low amount of words in Moroccan User-Generated Text. Moreover, we used the investigated data in order to build a new Moroccan language resource. The latter consists in building a Moroccan words orthographic variations lexicon following an unsupervised approach and using character neural embedding. This lexicon can be useful for several NLP tasks such as spelling normalization.
Keywords: Moroccan Arabic; lexicon; NLP, word embedding, neural networks, normalization.
1. Introduction
Social media are the collection of tools that facilitate creating virtual communities and sharing information interactively through electronic communication. Typically, basic social media services are free and are available via web-based technologies. The creation of user-generated content is the most valuable feature [1] that characterizes social media services and it is considered as the social media lifeblood. The latter consists mainly of sharing text, images or videos and ensures the possibility of adding related comments. User-generated content is the main contributor to “social media analytics” raise [2], which is a new activity that consists of collecting data from social media and analyzing them in order to make, for instance, business decisions.
Over the past few years, social media has known a widespread use in Morocco and hence became one of the major means for communication and content production in virtual communities. As an illustration, 71% of the Moroccan Internet users are active in Social Media[*]. Social media has been influencing users that prefer today using these media rather than other web alternatives since they can easily and instantly interact with others [3]. They express themselves using Moroccan Arabic (MA), Modern Standard Arabic (MSA) and alternatively other European languages such as French. As a result, user-generated text (UGT) constitutes new opportunities for understanding Moroccan Arabic used in social media platforms.
Contrary to many other countries where Twitter is the most used Social Media platform, Facebook and YouTube are the most popular social media platforms in Morocco. On one hand, 59% of social media viewed pages are Facebook pages (with more than 15 million active users in Morocco). This means that Facebook is the most popular social media in Morocco. On the other hand, YouTube stands in the second rank with 35% of social media viewed pages according to the latest statistics[†],[‡]. Hence, with such an important number of users, a huge amount of user-generated text is produced through social media in Morocco instantly. However, there is currently no clear idea on its content and structure and this situation constitutes one of the main challenges for its processing.
Processing MA user-generated text is an interesting area given that it reflects the language spoken by Moroccan people in their everyday life. However, it poses several challenges due to the lack of MA NLP tools. Hence, before considering this data as a resource or processing it through NLP tools, it should be useful to study user-generated text effect by analyzing its content. The latter may help to build a clear idea about how MA user-generated text is written, to identify its features and to get a better understanding of its rules. To the best of our knowledge, no related work has yet been conducted.
In the aim to later ease the processing of MA UGT, our project is split in two phases. Previously achieved [4], the first one seeks to collect an important amount of the UGT and then investigate general features namely: identifying and analyzing used scripts, identifying used languages and evaluating the UGT context (amount of used words). As an extension of that work, the second phase leverages a portion of the investigated UGT in order to train a neural model. By applying some rule-based refinements, the latter enables the building of the lexicon of Moroccan words orthographic variants. This resource can be useful to facilitate the building of a Moroccan spelling normalization system. With the lack of MA resources, we follow an unsupervised approach that does not require annotated data. To reach the first goals, we start by collecting Moroccan UGT and then perform some tasks such as cleaning, filtering and language identifying. Then, we analyze the pre-processed data. To reach the second goal, we first generate a vocabulary of Moroccan words and then use it with the collected data in a deep learning process.
The remainder of this paper is organized as follows: Section 2 gives an overview of the Moroccan Arabic. Section 3 presents related works in the field of processing Arabic dialect (AD) user-generated text. In Section 4, we present the followed steps in order to prepare Moroccan language resources that involve the building of a Moroccan reference vocabulary and the UGT text corpus. Then, in Section 5 we detail the deep learning process performed on the collected data in order to generate the Moroccan orthographic variant lexicon. Section 6 provides a discussion about results; finally, we conclude the paper in section 7 with some observations
2. Moroccan Arabic Overview
The Moroccan constitution recognizes MSA and Tamazight as two official Moroccan languages. However, Moroccan Arabic, which is considered as a variant of MSA, is the most used language in Morocco according to the official census performed in 2014[§] (90% of Moroccan people use MA).
Historically, MA raised as a result of the interaction between Arabic and Tamazight in spreading Islam period and contained a mixture of these languages until the beginning of the 20th century. After the establishment of the French and Spanish protectorate, the MA vocabulary integrated several words from these languages as shown in Table 1. For example, the word كاسكيطة (cap) originates from French where the original word is (casquette). Nevertheless, MA is strongly influenced by Arabic according to a previous work [5] especially at the lexical level. This study showed that 81% of the Moroccan vocabulary originates from Arabic language.
Table 1. MA words origins.
MA Word |
Word origin |
Language origin |
English translation |
لامبة |
lámpara |
Spanish |
lamp |
كاسكيطة |
casquette |
French |
cap |
طوبيس |
autobus |
French |
autobus |
رويدة |
rueda |
Spanish |
wheel |
3. Related Works
Over the last few years, processing Arabic language increasingly gained attention. Furthermore, while many works have been proposed to deal with standard and normalized Arabic text, we noticed that the NLP community started recently to deal with user-generated text due to the popularity of social media and the Arabic digital expansion. In this section, we highlight the most relevant works dealing with Arabic dialectal UGT.
Sadat F. et al. [6] presented a framework in order to detect Arabic dialects in social media using probabilistic models. To train their system, they collected data from blogs and forums where the users are located in different Arabic countries. They considered 18 classes representing 18 Arabic dialects and manually annotated each sentence of the collected corpus with its corresponding dialect. In the training phase, the system was trained using character n-gram Markov language model and Naïve Bayes classifiers. Evaluation results showed that the Naïve Bayes classifier outperforms Markov language model classifier with 98% of accuracy. Moreover, according to n-gram comparison, it was noticed that Naïve Bayes classifier based on character bi-gram model performs better than uni-gram and tri-gram models.
Voss et al. [7] built a classifier that can distinguish Moroccan Arabic tweets written in Latin script from French and English tweets using unsupervised learning approach and focusing on the token level. The classifier is trained on an annotated data consisting of 40K tweets that were collected from Twitter using several keywords. Evaluation of the classifier on a test set composed of 800 tweets showed that it significantly works better in English and French than Moroccan with respectively an accuracy of 95,5%, 95% and 76%.
Albogamy and Ramsay [8] built a light stemmer for Arabic tweets. They avoided using lexicons given that the Arabic language used in Twitter may include dialectal and new words. Their approach relies on defining all possible affixes and then writing rules in order to compose a given word. Using a sample of 390 Arabic tweets, evaluation results gave an accuracy of 74%.
In order to improve the quality of Arabic UGT machine translation, Afli et al. [9] proposed to integrate an Automatic Error Correction module as pre-processing and prior to the translation phase. To train and test the proposed module, authors used a portion of QALB corpus [10] containing 1,3M words including dialectal words. The trained and test data were manually annotated and corrected regarding MSA rules. After that, UGT sentences and their MSA corrections were aligned at the sentence level and tokenized using the MADA morphological analyzer [11]. Authors proposed to use two different systems according to tokenization. The first system is trained on data without tokenization, where the second one is trained using MADA tokenization. Evaluation results on test data containing 66 K words showed that the second system outputs with an accuracy of 68,68% while the first one outputs with 63,18% of accuracy. Hence, the authors realized that including tokenized words in the training data is crucial for increasing error detection.
Abidi and Smaili [12] performed an empirical study of the Algerian dialect used in YouTube. They started by collecting a corpus containing 17M words from YouTube comments. The corpus contains different languages where an important amount of text is written in Latin script (LS) (about 47%). They noticed also that 82% of the collected sentences include code-switching. Furthermore, authors reported that grammar was not respected when using either Arabic or Latin script. As an example, a word may be written in different ways. For this reason, they built a lexicon that contains, for each word, its correlated words to deal with the problem of spelling inconsistency.
From the above surveyed works, UGT is the subject of several Arabic dialect researches in different NLP fields such as empirical studies, building resources, text mining and machine translation, etc. To the best of our knowledge, Moroccan UGT has not been studied in Arabic NLP. Our study seeks to take a first step towards understanding, analyzing and extracting useful knowledge related to Moroccan UGT.
4. Resources Preparation
As cited in Section 1, the second goal of the present work is to build a corpus containing the maximum potential amount of Moroccan words orthographic variants (OVs). In this work, we use deep learning techniques in order to build it. This resource can be useful to train a Moroccan spelling normalization system. To achieve this goal, it is necessary to prepare two elements before engaging in experiments. The first contains Moroccan words that are written in a standard and unique spelling while the second contains an important amount of Moroccan User-Generated Text without normalization. The following sections describe different tasks that have been done in order to prepare these resources.
4.1. Moroccan Reference Vocabulary Generation
Moroccan Arabic has no writing standards. This is why Moroccan users apply different rules to write their text. This situation poses several problems for the processing of the Moroccan UGT. To remedy, it is necessary to define a set of orthographic rules and apply them to the resources that NLP systems use. One of the important and available Moroccan resources is the Moroccan dialect electronic dictionary (MDED) [13]. It is normalized following unique writing rules and it is considered as the most comprehensive electronic lexicon for MA including almost 12,000 entries. In addition, one major MDED feature is the annotation of its entries with useful metadata such as POS, origin and root as shown in Table 2.
Table 2. Sample of MDED lexicon entries
MA |
MSA |
POS |
root |
Origin |
English[**] |
ماكلة |
طعام |
Noun |
كلا |
MSA |
Food |
شحال |
كم |
particle |
شحال |
MSA |
How much |
سطاسيونا |
ركن |
Verb |
سطاسيون |
French |
park |
In order to generate a lexicon of Moroccan words, which we consider henceforth as a Moroccan Reference Vocabulary (MRV) for the building of future NLP systems, we use MDED in addition to a morphemes list (affixes and clitics). In fact, our vocabulary generator (Figure 1) is rule-based following the concatenative morphology [14] and it is implemented as an algorithm that combines MDED, the morphemes list and two tables containing rules and constraints governing the concatenation of these morphemes.
The first table (compatibility table) is built in order to avoid generating impossible words such as الخدمتنا (the our work). For this purpose, we prepared and linguistically checked different morphemes concatenations rules that indicate which morpheme can be concatenated with which other. In the second table (orthographic adjustments table), we store the orthographic adjustments that should be performed in order to correct the generated word. These adjustments mainly concern morphemes boundaries in the new generated word. As an example, with respect to the compatibility table, the concatenation of the noun lemma المناسبة (event) with the plural morpheme ات produces the misspelled word المناسبةات instead of the correct form المناسبات. To avoid such cases, our vocabulary generator detects morphemes boundaries that cause misspelled words by using the orthographic adjustments table and hence produces the correct word form.
In order to determine the degree to which the MRV covers the Moroccan Arabic in Internet, we used a test corpus for the collected UGT with 1000 MA sentences containing 10564 words. Each word is manually normalized following the same MORV orthographic rules. In fact, the manual task that should be performed in order to prepare such a corpus impedes us to increase its size. The goal of the evaluation in fact is to ensure that each UGT word in the test corpus is orthographically recognized using the MA generated vocabulary. Obtained results show that the percentage of the recognized word rates to 84%. Thereby, with such results, MRV is supposed to be representative of the Moroccan Arabic at the orthographic level.
Fig. 1. MA vocabulary generation
As shown in Table 3, the result of this task outputs 4.590.000 words representing all the possible MA vocabulary. We then eliminate items having the same spelling such as شافكم which can refer to a noun (your boss) or can refer to a verb (he saw you). The final MRV contains 3,976,805 unique words. This elimination makes the processing fast and does not affect the process of comparing UGT and MA vocabulary since we do not consider grammatical features in this work.
Table 3. MA Vocabulary.
|
Raw |
unique |
# words |
4,590,000 |
3,991,949 |
4.2. Moroccan User-Generated Text Corpus Building
In this section, we describe the followed steps in order to build an MA corpus. The latter is collected from web sources and it is generated from Moroccan users. As stated above, we limited the scope of UGT sources to Facebook and YouTube due to their popularity in Morocco. This is proved by their extensive use compared to other social media websites[††] as shown in Figure 2. For example, 59% of Moroccan social media traffic come from Facebook, 34% from YouTube and 7% from other social media (Twitter, Google+, etc.).
4.2.1. Collecting data
Through social media websites, users can generate texts using either the post or the comment features. On one hand, post feature allows expressing an idea. On the other hand, many users can react and give their own comments to a given post through the comment feature. Therefore, comments are widely available and tend to be in general natural, more spontaneous and linguistically rich compared to posts. For this reason, we decided to focus only on collecting comments in order to build an MA corpus that reflects the Moroccan language. Thus, we addressed the most popular Moroccan pages by harvesting user’s comments that are related to a set of Facebook posts and YouTube videos using two different tools. First, after identifying the most popular Moroccan Facebook pages (using Facebook Audience Insights feature – Table 4), we collected their related comments. Since the first 20 popular pages in Facebook Audience Insights cover the majority of topics (Facebook categories), we limited comments scrapping to these pages for the period of January 2018. Secondly, to ensure that YouTube collected texts mainly concern topics posted by Moroccans (for the same period), we used YouTube Data API and collected Moroccan video comments by including a list of MA keywords.
Fig. 2. Social media use in Morocco
Table 4. Facebook Audience Insights (Page Likes feature)
Page |
Relevance |
Audience |
Active people |
Affinity |
100 |
475.3K |
530.1K |
108x |
|
99 |
508.9K |
566.3K |
108x |
|
98 |
654.1K |
727.6K |
108x |
|
97 |
428.9K |
476.5K |
108x |
|
96 |
449.9K |
499.9K |
108x |
|
95 |
367.5K |
407.6K |
109x |
|
94 |
653.8K |
724.7K |
109x |
|
93 |
478.3K |
530.2K |
109x |
|
92 |
1.3m |
1.4m |
109x |
|
91 |
1.6m |
1.8m |
109x |
|
90 |
428K |
473K |
109x |
|
89 |
345.1K |
381.1K |
109x |
|
88 |
1.8m |
2m |
109x |
|
87 |
1.4m |
1.6m |
109x |
|
86 |
413.1K |
455.7K |
109x |
|
85 |
649.5K |
714.1K |
110x |
|
84 |
387.4K |
425.9K |
110x |
|
83 |
676.2K |
742.4K |
110x |
In total, raw text before cleaning and normalization was composed of 748.433 comments where Facebook comments represent almost 89% of this data. Note that, massive amount of data can be harvested using the above steps by extending either keywords list, period or number of pages. In fact, huge amount of data is suitable for several NLP applications especially the machine learning applications based, statistical machine translation, etc. However, with respect to our objectives, the collected UGT is enough for the kind of our study.
4.2.2. Data cleaning
The challenge we faced after achieving the previous step is the noisy data. Indeed, collected data is a mixture of several languages. MA and MSA are written either in Arabic or Romanized letters (Arabizi) [15], while European languages are written only in Romanized letters. Arabizi is a sort of social media script where users usually type letters alongside numbers (such as 3, 7 and 9) to represent Arabic letters that have no equivalent in Roman script such as theق letter. The latter is represented in Arabizi by the number 9. In addition, some comments contain only special characters, emoticons or dates, etc. For this reason, it was necessary to perform a special process to convert this data into a useful corpus that can be analyzed and used for NLP purposes. Thus, we started first by removing punctuations, special characters (*, $, and !) and Emoji in both Arabic script and Arabizi texts. We removed also numbers and diacritics from Arabic texts. After performing the cleaning process, the data was reduced from 748.433 comments to 642.502 given that some comments do not contain words. Finally, redundant comments were deleted and only 580.751 unique sentences were retained from this step.
4.2.3. Classification
Given that the resulted corpus contains several languages and it is written in both Arabic, Arabizi and Romanized letters, we separated sentences written in Arabic script and those written in Arabizi in order to separately study their content. Regarding Arabizi content, we used sequentially two language identification systems in order to determine the language of each sentence. In order to detect MA comments, we used the Language Identification (LID) system built by authors in [16] by extending it to Arabizi. After that, we used the Stand-alone language identification system “Langid.Py” [17] in order to distinguish between French, English and Spanish in the remaining comments. A sample of the task results is illustrated in table 5 where sentences are grouped according to languages.
Table 5. Arabizi sentences classification.
Language |
Arabizi sentence |
English equivalent |
MA |
Allah iwaf9ak aiman falmasira dyalak al faniya ohna kantsanaw al jadid dyalak |
God help you in your career. We will be waiting for your update |
MA |
5alih i3abr 3la ch3or dyalo |
Let him express its sentiment |
English |
“9amar” I really Miss you |
“9amar” I really Miss you |
French |
non c’est pas vrai |
No, that is not true |
mixed |
3andakom je suis dangereuse |
Pay attention, I’m dangerous |
Regarding Arabic script, we conducted a manual task with a group composed of 30 annotators where the goal is to label each comment as MA, MSA or mixed. Annotators are MA/MSA native speakers and each one was asked to read 5500 comments and accurately classify them. The manual task was preferred in place of an automatic process because on one hand, human annotations are performed almost without error rates and on the other hand, the annotated dataset can be used in future Moroccan Dialect NLP tools.
In table 6, results of the language and code-switching identification showed that MA is heavily used with more than 74% (33,93 + 40,86) of the collected UGT. Moreover, Arabic script and Arabizi content represent respectively almost 57% and 43%. We notice also that using foreign languages such as French, English and Spanish in the same UGT content is too low and does not exceed in total 2,1%.
Table 6. Language distribution of the UGT
Script |
Language |
Comments |
Percentage |
|
Arabic |
Mixed |
86.522 |
14,81% |
57,06% |
MSA |
48.599 |
8,32% |
||
MA |
198.151 |
33,93% |
||
Romanized |
MA |
238.616 |
40,86% |
42,94% |
French |
6252 |
1,07% |
||
English |
2611 |
0,45% |
||
Spanish |
3291 |
0,56% |
Table 7 presents a set of features that characterize the content of this corpus. The shortest comment contains three words where the largest one contains 901 words. In addition, one important remark is that the majority of the collected comments (85% of the UGT) are written in short context and each comment does not exceed 13 words. Finally, comments with more than 13 words represent only 15% of the collected UGT.
Table 7. General features of the dialectal Moroccan UGT (Arabic and Latin scripts).
Shortest comment (# words) |
3 |
Largest comment (# words) |
901 |
Weighted average (# words) |
13 |
Comments with less than 13 words |
85% |
Comments with more than 100 words |
1% |
Comments with 13<words<100 Words characters average |
14% 6,4 |
5. Moroccan orthographic variants lexicon generation
In general, the relation between two words can be described at several levels. For example, at the semantic level they can be synonyms or antonyms. At the orthographic level, a word can be written in different spelling forms. In general, we classify a word that we consider as misspelled, as an orthographic variant (OV) of another correctly spelled word. Usually we face this situation when processing the User-Generated Text. Thereby, using the Moroccan UGT is suitable for extracting orthographic variants of each item composing the Moroccan reference vocabulary (considered for us as correctly spelled).
In this section, we describe our approach to automatically build a lexicon of Moroccan words orthographic variants using these resources. Such a lexicon can be very useful in training spelling normalization systems. Moreover, the use of that system is crucial before processing the Moroccan User-Generated Text. In our approach, we consider only the text written in Arabic script in order to create a lexicon of Moroccan orthographic variants. We follow several steps to this end: First, we train a model with the Arabic UGT using shallow neural networks, then we extract for each MRV word a list of candidate orthographically similar words. Finally, a filter is applied in order to refine this list.
5.1. Word Embedding
Fig 3. Word embedding model
Since words’ orthographic variants are produced by users, it is necessary to use a UGT corpus with an important size that may contain these spelling forms. Moreover, instead of following rule-based approaches that rely on patterns recognition, probabilistic systems seem to be the most efficient solution facing this situation especially when using machine learning and deep learning models. In practice, these models perform most efficiently when provided with numerical data as input instead of words. This is why the use of words embedding (WE), which is a popular idea to represent text in the form of vectors, is pervasive in most real-world applications scenarios and considered currently as one of the interesting NLP trends. WE can provide a low dimensional vector representation of words given a corpus. A key benefit of word embedding is capturing relations between words at different levels. There are three known methods of word embedding (Figure 3): Latent Semantic Analysis (LSA) [18], Word2Vec [19] and Global Vectors for Word Representation (Glove) [20]. In this work, we follow the Word2Vec architecture since it provides the best word vector representation.
Fig 4. Word embedding architectures
In fact, Word2vec is a shallow neural network architecture that consists of an input layer, a projection layer, and an output layer to predict nearby words of a given word w(t). It provides two models architectures to produce a distributed representation of words: skip-grams and CBOW (Continuous Bag Of Words) as illustrated in Figure 4. The skip-grams model uses the current word to predict the surrounding window of context words whereas the CBOW model predicts the target word according to a window of surrounding context words. Hence, given enough text data, CBOW is much adapted to produce different spelling forms of a given word.
5.2. Experimental setup
5.2.1. The FastText Model
There are several open-source libraries that implement Word2Vec. In our approach, we use the open-source library FastText [21] allowing us to learn multiple n-grams forming words that make the model very sensitive to orthography. Besides semantic similarities, FastText can capture also orthographic and morphological similarities between words in a corpus of important size. One of the key features of FastText word representation is its ability to produce vectors for misspelled words or concatenation of words which is useful for building a list of Moroccan words orthographic variants.
5.2.2. Data and Model training
In order to train our model, we use the Moroccan UGT cited in section 4 which is written in Arabic letters and contains 2,1M words. We enhanced this data with texts collected from Moroccan web sites and blogs[‡‡] to reach 3.6M words. As it has been done for the UGT, we used our Moroccan Language Identification (LID) system [16] in order to ensure that additional texts are written in Moroccan Arabic. Regarding the training, it is performed by optimizing the following loss function (Lo):
(1)
Where:
· T=3.6M is the UGT corpus size;
· Wi is a given word in the UGT training corpus;
· is the size of the Wi context which consists of a window of +/- 2 (two) words ahead. This context is used to calculate the probability that another word occurs within this window and it is defined as:
(2)
Given that FastText operates at the internal morphology of the word by including character level sub-word information (sub n-grams), the optimization of Lo is ensured by using a scoring function of these internal structures which is computed using the dot product between a word in the UGT and its word vector context WC as follow:
(3)
Where:
V represents the Moroccan UGT words;
G represents all n-grams of a word in addition to itself;
Vg is a vector representation of each n-gram g;
Uc is an output vector associated with the word .
As an example, if we consider that =طوموبيل and an n-gram=3, therefore G={طوم, ومو, موب, وبي , بيل , طوموبيل }. In our method, the main idea is to represent each UGT word (Wv) as an n-dimensional vector formed by summing up the vectors of the sub n-grams in that word. Similarly, a vector of the orthographic variant (Worth) is obtained by summing up the vectors of its sub n-grams. Hence, with high likelihood the vector representation of the Wv is in the neighborhood of the vector representation of its orthographic variant Worth.
Following the FastText guidelines for automatic hyper-parameter optimization, we obtain automatically the best hyper-parameters for our dataset using the FastText’s auto-tune feature[§§]. Therefore, we set the followinmain options in the training process:
· Window size=2: represents the size of the word context;
· Number of epochs=5: controls the total number of iterations that the algorithm performs the training over the entire data;
· embedding size=300: dimension of the embedding space;
· batch size=201: number of tuples, on which the neural network is in each training step.
Once the training is finished, two model files are generated. The first is a binary file ‘.bin’ containing the model parameters in addition to the optimized hyper parameters. The second is a text file containing 1.27M unique Moroccan UGT words with their vector representations in a 300-dimensional space. These vectors have certain orientation and it is possible to explicitly define their relationship with each other and hence explore the orthographic similarities between the Moroccan UGT words.
5.2.3. Extracting orthographic variants
We use the FastText4J library[***] in order to handle the “.bin” file and hence extract orthographic variants of the MRV words. This library implements FastText with java and allows us to extract the information related to the creation of a nearest neighbors words list of each MRV word. In total we report that 53,14% of the MRV entries or words (setα) have at least one orthographic variant in the UGT corpus. This means that the followed process (without any refinement tasks) failed to extract OVs for the rest of the MRV (46,86%). In Figure 5, we exhibit the MRV words distribution according to the number of corresponding OVs before (setα) and after refinement (setβ). As an example, the node corresponding to Number of OVs= 2 indicates that 2,57% of the MRV words have only two orthographic variants before refinement. In Figure 6, we display the detailed values of the OVs distribution. In this perspective, we notice that the majority of setα have only one orthographic variant, whereas the amount of words having more than one OV (from 2 to 20) is low. Figure 7 shows a sample of the obtained orthographic variants given the MRV word الطوموبيلة (car) with a corresponding orthographic similarity rate.
Fig 5. Distribution of the MRV words orthographic variants before and after refinement
Fig 6. MRV words orthographic variants distribution
Fig 7. Inference of the FastText model in order to extract orthographic variants of الطوموبيلة
By examining a sample of the output list items, we notice that the most similar of them to the MRV word are related orthographic variants. Besides, some other items are semantically close to the MRV word but they are far at the orthographic level. This means that these items should be eliminated. As an illustration, the words فيلوطو (in the car), طاكسي (taxi) in Figure 8 are not orthographic variants of الطوموبيلة. In addition, some affixed words cannot be considered as OVs since they are included in the Moroccan vocabulary.
For this reason, we reduce the candidate OVs list by applying a set of filters. With some probability of eliminating real OVs, it is necessary to perform such a refinement in order to ensure the resource reliability. First, we eliminate real-word OV candidates which have a meaning and can be found in the MRV. Then, the remaining candidates list contains Non-word OV which have no meaning and are thus not included in the MRV.
The second filter is performed with the Levenshtein distance [22]. To this end, we calculate the corresponding distance of each pair that is composed of an MRV word and its candidate OV (resulted from the previous filter). We keep only candidate OVs where the value of the Levenshtein distance equals 2 or less.
Finally, we refine the remaining candidates list by considering words sub n-grams. In fact, a given MRV (WMRV) and its real orthographic variant (WOV) share almost the same sub n-grams. Thus, they are expected to share most of their sub n-grams. Based on this idea, we compute the n-gram similarity score Sngram [23] which is defined as: Sngram = α/β where:
· α denotes the number of unique sub n-grams that WMRV and WOV share.
· β is the total sum of unique sub n-grams in both α and β.
Sngram is calculated considering words bi-grams and only OVs candidates where Sngram<0,5 are eliminated. After applying the above rules, setα containing at least one OV candidate is reduced to 30,13% (setβ) as illustrated in Figure 6.
5.2.4. Lexicon Evaluation
With the lack of Moroccan resources that could be useful in the evaluation process, we decided to manually examine a sample of the OVs lexicon chosen randomly and containing about 1200 MRV words which represents almost 1% of MRV words having at least one OV (setβ). In fact, the important size of the data to be evaluated impedes us to set a test set with larger size. Nevertheless, the selected test set is supposed to be sufficient given that setβ to be evaluated represents only about 30% of the MRV. The evaluation consists in checking manually the final OVs list of each MRV word in this sample where the results showed an error rate of 8,65%.
6. Discussion
The language used in Moroccan UGT is unstructured and informal. This makes its analysis challenging. The results of the performed experiments show that Moroccan UGT displays several features that should be considered before engaging in any related NLP processing task.
· The majority of the Moroccan UGT is written using Arabic script (57%) and if we consider only the MA, Moroccan users prefer to express their ideas using Arabic script rather than Arabizi.
· Several languages are used including MA, MSA, French Spanish and English. Regardless the used script, MA content represents 73% of the collected UGT and hence it is the most used language. Regarding Arabic script, the code-switching phenomenon represents an important amount. In fact, users usually include in their MA sentences some MSA words not belonging to the MA lexicon.
· Moroccan UGT sentences are usually short given that the average of the Moroccan comments is 13 words. The main cause of this phenomenon is the character limit imposed by Facebook and YouTube. This feature leads a processing difficulty when one needs to consider the word context in some NLP applications such as language identification systems. In contrast, accurate results are obtained when processing a text with short context in other NLP applications such as sentiment analysis.
· Regarding the UGT spelling, we argue that not all the MRV items have orthographic variants. But generally, the Moroccan orthographic variation lexicon shows that users do not adopt any writing rules and feel free to use keyboard keys to write their MA words since this dialect has no writing standards. For this reason, it is necessary to normalize the Moroccan UGT before its processing. The OVs lexicon can be useful for this purpose.
· Finally, even though using a UGT corpus with medium size (3,1 M word), the FastText model allowed us to find OVs for about 30% of the MRV words. While the obtained OVs concern only uni-gram words, it will be useful to extend this resource by considering a sequence of two words especially for items including stop-words such as في البلاصة.
This work investigates the MA UGT regarding four features that are script, language and text context and text orthography. Obtained results meet the two main goals defined in Section I and lead us to better understand how to process the MA text. Indeed, the processing of the MA UGT requires some NLP tools in order to obtain accurate results. The problem of language diversity can be resolved using MA LID system which is currently available[†††]. Besides, the orthographic variation of Moroccan words problem can be tackled by using a spelling correction system. While an MA morphological analyzer that takes into consideration the text context can fix the UGT context shortness.
7. Conclusion
In this paper, we analyzed the Moroccan User-Generated Text through a corpus collected from the most used social media websites in Morocco. The collected MA UGT was cleaned and classified then analyzed. Results of this analysis showed that Moroccan people use both Arabizi and Arabic scripts to write their comments with a slight preference for Arabic (57%). It shows also that MA is heavily used in social media websites (74%). In addition, the UGT is usually written in a short context. Finally, besides generating a new Moroccan vocabulary, the collected data has been exploited in order to create a lexicon of orthographic variants of Moroccan words. Such a resource is very useful for building a spelling normalization system. As future work, we plan to investigate an important amount of Moroccan UGT in order to extend and improve the orthographic variants lexicon.
1. M. Itani, Sentiment analysis and resources for informal Arabic text on social media, Sheffield Hallam University, Sheffield, 2018. |
2. X. Hu and H. Liu, Text analytics in social media, in Mining Text Data, Springer US, 2012, pp. 385-414. |
3. X. Liu, M. Wang and B. Huet, “Event analysis in social multimedia: a survey,” Frontiers of Computer Science, vol. 10, no. 3, pp. 433 – 446, 2016. |
4. R. Tachicart and K. Bouzoubaa, An Empirical Analysis of Moroccan Dialectal User-Generated Text, in Proc. 11th International Conference Computational Collective Intelligence (ICCCI’19), Hendaye, France, 2019. |
5. R. Tachicart, K. Bouzoubaa and H. Jaafar, Lexical differences and similarities between Moroccan dialect and Arabic, in Proc. 4th IEEE International Colloquium on Information Science and Technology (CIST’16), Tangier, 2016. |
6. F. Sadat, F. Kazemi and A. Farzindar, Automatic Identification of Arabic Language Varieties and Dialects in Social Media, in Proc. Second Workshop on Natural Language Processing for Social Media (SocialNLP’14), Queensland, Australia, 2014. |
7. C. Voss, S. Tratz, J. Laoudi and D. Briesch, Finding Romanized Arabic Dialect in Code-Mixed Tweets, in Proc. Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 2014. |
8. F. Albogamy and A. Ramsay, Unsupervised Stemmer for Arabic Tweets, in Proc. Second Workshop on Noisy User-generated Text (WNUT’16), Osaka, Japan, 2016. |
9. H. Afli, W. Aransa, P. Lohar and A. Way, From Arabic User-Generated Content to Machine Translation: Integrating Automatic Error Correction, in Proc. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING’16), Konya, Turkey, 2016. |
10. W. Zaghouani, B. Mohit, N. Habash, O. Obeid, N. Tomeh, A. Rozovskaya, N. Farra, S. Alkuhlani and K. Oflazer, Large scale Arabic error annotation: Guidelines and framework, in Proc. Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 2014. |
11. N. Habash, O. Rambow and R. Roth, Mada+tokan: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization, in Proc. Second International Conference on Arabic Language Resources and Tools (MEDAR’09), Cairo, Egypt, 2009. |
12. K. Abidi and K. Smaïli, An empirical study of the Algerian dialect of Social network, in Proc. International Conference on Natural Language, Signal and Speech Processing (ICNLSP’19), Casablanca, Morocco, 2017. |
13. R. Tachicart, K. Bouzoubaa and H. Jaafar, Building a Moroccan dialect electronic Dictionary (MDED), in Proc. 5th International Conference on Arabic Language Processing (CITALA’14), Oujda, Morocco, 2014. 14. N. Y. Habash, Introduction to Arabic Natural Language Processing, Morgan & Claypool Publishers, 2010, p. 187. |
15. A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Strassel, N. Habash, R. Eskander and O. Rambow, Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus, in Proc. First Workshop on Computational Approaches to Code Switching (EMNLP ’14), Doha, Qatar, 2014. |
16. R. Tachicart, K. Bouzoubaa, S. L. Aouragh and H. Jaafar, Automatic Identification of Moroccan Colloquial Arabic, Arabic Language Processing: From Theory to Practice, Springer International Publishing, Cham, vol. 782, pp. 201-214, 2018. |
17. M. Lui and T. Baldwin, langid.py: An Off-the-shelf Language Identification Tool, in Proc. 50th Annual Meeting of the Association for Computational Linguistics (ACL’12), Jeju, Republic of Korea, 2012. |
18. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, in Journal of the American Society for Information Science, 1990. |
19. T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient Estimation of Word Representations in Vector Space, CoRR, 2013. |
20. J. Pennington, R. Socher and C. Manning, Glove: Global Vectors for Word Representation, in the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014. |
21. A. Joulin, E. Grave, P. Bojanowski and T. Mikolov, Bag of Tricks for Efficient Text Classification, in Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 2016. |
22. V. I. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady, vol. 10, p. 707, 1966. |
23. P. Ulrich, P. Thomas and F. Norbert, Retrieval effectiveness of proper name search methods, Information Processing & Management, Vols. Volume 32, Issue 6, pp. 667-679, 1996.
[*] According to Hootsuite https://hootsuite.com
[†] http://www.internetworldstats.com
[‡] http://gs.statcounter.com/social-media-stats/all/morocco
[**] This tag is not a part of the dictionary.
[††] http://gs.statcounter.com/social-media-stats/all/morocco
[‡‡] Such as https://9isas.modareb.info
[§§] https://fasttext.cc/docs/en/autotune.html
[***] https://github.com/mayabot/fastText4j
[†††] http://arabic.emi.ac.ma:8080/MCAP/faces/lid.xhtml;jsessionid=834e738ebfc626d2b431beac006c