3. Moroccan Arabic
3.1. General Overview
The Moroccan constitution recognizes two official languages: Arabic and Tamazight. Both have their spoken (informal) and written forms and are used in official venues as well as informal situations. The spoken form of Arabic in Morocco is the Moroccan Arabic dialect and it is considered as the mother tongue of Moroccans besides other spoken forms of Tamazight such as Tarifit, Tashelhit and Tamazight (Ennaji, 2005). However, based on the latest available figures, most of the Moroccans (91%) can speak Moroccan Arabic, while only 27% of them can use at least one of the spoken forms of Tamazight. Thus, obviously Moroccan Arabic is the primary dialect in Morocco which is mainly used in informal venues such as communication between people and exchanging information.
Recently, with the advent of the Internet and new technologies in Morocco, there has been an outstanding explosion and dispersion of information sources. As Moroccan Arabic is the primary language of communication between Moroccans, this dialect has become dominant in different web sources expressed various forms such as written text, audio and video materials. Consequently, various opportunities are open to better understanding the Moroccan community in different contexts by analyzing and lifting out useful information from the text they produce every day on the web. Within this scope, NLP techniques can be applied to address a wide variety of tasks such as sentiment analysis, topic identification, user’s behavior prediction, events detection, to name a few.
According to several linguistic experts such as Ouadghiri (Ouadghiri, 2013), Moroccan Arabic diverges from MSA at the lexical and the phonological levels according to three factors as follow:
· The periodic time: Moroccan Arabic has evolved from its interaction with the Tamazight language in the 7th century to the 20th century with the influence of French and Spanish (during the protectorate period from 1912 to 1956).
· The speech context: spoken Moroccan Arabic differs according to the context of the speech. For example, in TV programs and education venues, spoken Moroccan Arabic is heavily influenced by MSA where speakers may also alternate between MA and MSA. In other situations, like communication between people and family, spoken Moroccan Arabic can include French words.
This situation poses several problems in MA identification and processing. Hence, since we are not linguistic experts, we are so far from determining and defining standards for Moroccan Arabic. However, as we deal with digital content expressed in MA that presents several business opportunities, we limit the scope of our research on the MA used on the Internet.
3.2. Moroccan Arabic morphology
In this section, we provide an overview of the Moroccan Arabic morphology on which we have relied to generate MORV. In fact, we have based our findings on the works of Moroccan linguistic researchers (Medlaoui Mennabhi, 2019) (Ouadghiri, 2013) (Chafik, 1999). According to these researches, MA morphology is inspired by Arabic morphology with limited exceptions which come from the influence of Tamazight language (Chtatou, 1997). Thus, in the light of these works and our understandings as Moroccan native speakers, we identify main MA words categories and their corresponding morphological attributes. We present also for each category the various rules that can occur during the concatenation of a lemma with affixes and clitics.
As with the Arabic language, Moroccan Arabic (MA) has a rich morphology. In general, MA vocabulary is composed of words that can be classified into three categories: Noun, Verb and Particle. A word can be decomposed to morphemes as described in Fig. 1 where affixes and clitics are used in order to make new words starting from a lemma/stem without changing the POS.
In this paper we define lemma, stem, word, and other morphemes as follows:
• Prefixes: attach before the lemma/stem and states the inflection;
• Suffixes: attach after the lemma/stem stating the inflection;
• Affixes: the set of prefixes and suffixes;
• Proclitics: attach before the lemma/stem and states a syntactic role;
• Enclitics: attach after the lemma/stem and states a syntactic role;
• Lemma: it is the uninflected base form of a word without affixes and clitics. For verbs, it is conjugated in the perfective, 3rd person and singular form. In the case of nouns and adjectives, the lemma is the singular indefinite form.
• Stem: it is the combination of lemma with affixes.
• Word: can be either a lemma, a stem or the combination of the stem/lemma with clitics (fully inflected form).
Following this definition, we can decompose for example the Moroccan Arabic (MA) word وماكانخدموهاش /And we don’t process it/ (wmakankhdmohach) to several morphemes as illustrated in Fig. 1.
Present tense in the 1st person
I will not write
It has been written
In order to facilitate generating MA verbs, we seek to categorize MA verbs according to their common conjugation rules. Thus, one key is considering weak letters in order to categorize MA verbs. In fact, weak verbs (as in Arabic) can also be present in the MA lexicon given that 81% of the MA lexicon is borrowed from Arabic according to a previous study (Tachicart, et al., 2016). A weak verb has one or two weak letters in its root. The letters that make an MSA verb weak are Waw (و), Alif (ا) and Yae (ى). Particularly, given the MA standards spelling adopted in this work, only Waw (و), Alif (ا) are considered weak. In this context, if we consider the orthographic transformations that occur during the concatenation process between morphemes, MA verbs can be categorized into five sets according to the number of letters and the presence of weak letters as illustrated in table 3.
The first set does not undergo any changes in the lemma during the concatenation process. It includes verbs that have no weak letters such as كتب /to write/ (ktb) and زرب /to hurry up/ (zrb) in addition to weak verbs with more than three letters such as تخاصم /to argue/ (tkhasm) and هاجر /to emigrate/ (hajr). The second set is composed of verbs having three letters with the presence of Alif in the middle such as شاف /to see/ (chaf) and قال /to say/ (qal). In the concatenation process, for example in the present tense, the Alif is transformed either to و Waw or to ي Yae such as illustrated in table 3. Additionally, the Alif is deleted in the past tense such as شفت /I saw/ (chft). The third set is composed of verbs that have Alif as the last letter. This letter is transformed in some cases to ي Yae such as the concatenation with present and imperative affixes. In the fourth set, we can find verbs that are composed of two letters with the presence of Chedda (ّ) in the last such as شمّ /to sniff/ (chmm) and سدّ /to close/ (sdd). In some cases such as the concatenation with past affixes, Yae (ي ) is added to the lemma. The fifth set gathers a few irregular verbs such as خدا /to take/ (khda) and كلا /to eat/ (kla) where the Alif may change its position to the first letter. For example, in the case of conjugating كلا to the present tense with the third person ياكل /he eats/ (yakl). It should be noted that contrary to MSA, Moroccan Arabic lexicon lemmas does not include verbs where Yae or Waw is the end position and consequently, only Alif can be that position.
– Strong verbs composed of 3 letters
– Composed of more than 3 letters given that Alif is not the end position
They don’t write
They don’t argue
Composed of three letters where Alif is the second position
He sees us
he leans in
The Alif letter is the end position
He parks it
Composed of two letters with chedda (ّ) at the end position
We closed it
They don’t eat
The majority of the MA lexicon (67%) is composed of nouns according to previous work (Tachicart, et al., 2014). In this work, we fit Arabic standards to MA nouns categorization used in (Jaafar, et al., 2015) and thus we decompose MA nouns to several types in order to prepare the necessary rules for the generation process. In this context, we consider noun types that are mentioned in table 4 where each type is compatible with a specific morpheme set. We illustrate each noun category with an example in order to understand our classification. For example, pronouns are compatible only with negation clitics and the conjunction و. Additionally, adverbs are not compatible with definite clitics ال whereas common nouns and adjectives are compatible with almost all nominal morphemes.
Particles are words to which noun and verb symptoms cannot apply (Namly, et al., 2016). Contrary to nouns and verbs, particles cannot be inflected. However, they can be concatenated with some morphemes. We consider five types of Moroccan particles as mentioned in table 5: interjections, prepositions, interjections, conjunctions, exceptions and interrogations. Each category can be concatenated with some morphemes as exhibited in table 6.
and under it
M1: Morphemes taking place before lemma
M2: Morphemes taking place after lemma
3.6. Morphological attributes definition
At the morphological level, most of the Moroccan rules are extended from Arabic since Moroccan Arabic is a variety of Arabic language. In this context, we standardize MORV morphological information according to the ALESCO standards for Arabic morphological analyzers.
In tables 7 and 8 below, we detail MORV morphological information. Indeed, MORV considers the same Arabic morphological categories (noun, verb and particle) where each category can be assigned the following attributes:
Indefinite but limited
Indefinite but limited
masculine; feminine; common
· Gender: Verbs can be separated into three classes: feminine and masculine and common. Gender does not apply to particles.
· Number: refers to the quantity of countable nouns or to the number of verb-subject.
· Tense: is the time described by a verb which can be in the past, the present, the future or the imperative.
· Person: refers to someone taking part in the event which is expressed by a verb. It can be with assigned three values: the first, the second or the third.
· Voice: In a given sentence, it describes the relationship between the subject and the verb. There two verb voices: the active and the passive.
· State: a noun is indefinite when it is unspecific. By adding the prefix ال, the word state is then transformed to definite.
· Form: negation can be applied to words in affirmative form by using the affixes ما and ش and thus the word form is transformed to negative.
present; future; past; imperative
1; 2 ; 3
definite; indefinite; not applicable
 General Moroccan census performed in 2014 (http://rgph2014.hcp.ma)
 Region of Morocco located in the south
 Situated in west-central Morocco
Strong verbs do not have weak letters
Regular verbs set (from set 1 to set 4) are conjugated according to rules that the large majority of verbs in the language use. While irregular verbs (set 5) are conjugated according to different rules.