preloader

Moroccan Arabic Vocabulary Generation Using a Rule-Based Approach (3. Moroccan Arabic )

3. Moroccan Arabic

3.1. General Overview

The Moroccan constitution[1] recognizes two official languages: Arabic and Tamazight. Both have their spoken (informal) and written forms and are used in official venues as well as informal situations. The spoken form of Arabic in Morocco is the Moroccan Arabic dialect and it is considered as the mother tongue of Moroccans besides other spoken forms of Tamazight such as Tarifit, Tashelhit and Tamazight (Ennaji, 2005). However, based on the latest available figures[2], most of the Moroccans (91%) can speak Moroccan Arabic, while only 27% of them can use at least one of the spoken forms of Tamazight. Thus, obviously Moroccan Arabic is the primary dialect in Morocco which is mainly used in informal venues such as communication between people and exchanging information.

Recently, with the advent of the Internet and new technologies in Morocco, there has been an outstanding explosion and dispersion of information sources. As Moroccan Arabic is the primary language of communication between Moroccans, this dialect has become dominant in different web sources expressed various forms such as written text, audio and video materials. Consequently, various opportunities are open to better understanding the Moroccan community in different contexts by analyzing and lifting out useful information from the text they produce every day on the web. Within this scope, NLP techniques can be applied to address a wide variety of tasks such as sentiment analysis, topic identification, user’s behavior prediction, events detection, to name a few.

According to several linguistic experts such as Ouadghiri (Ouadghiri, 2013), Moroccan Arabic diverges from MSA at the lexical and the phonological levels according to three factors as follow:

·        The periodic time: Moroccan Arabic has evolved from its interaction with the Tamazight language in the 7th century to the 20th century with the influence of French and Spanish (during the protectorate period from 1912 to 1956).

·        The geographic area: spoken MA in the east of Morocco differs from the MA spoken in Moroccan Sahara[3] and Doukkala[4].

·        The speech context: spoken Moroccan Arabic differs according to the context of the speech. For example, in TV programs and education venues, spoken Moroccan Arabic is heavily influenced by MSA where speakers may also alternate between MA and MSA. In other situations, like communication between people and family, spoken Moroccan Arabic can include French words.

This situation poses several problems in MA identification and processing. Hence, since we are not linguistic experts, we are so far from determining and defining standards for Moroccan Arabic. However, as we deal with digital content expressed in MA that presents several business opportunities, we limit the scope of our research on the MA used on the Internet.

3.2. Moroccan Arabic morphology

In this section, we provide an overview of the Moroccan Arabic morphology on which we have relied to generate MORV. In fact, we have based our findings on the works of Moroccan linguistic researchers (Medlaoui Mennabhi, 2019) (Ouadghiri, 2013) (Chafik, 1999). According to these researches, MA morphology is inspired by Arabic morphology with limited exceptions which come from the influence of Tamazight language (Chtatou, 1997). Thus, in the light of these works and our understandings as Moroccan native speakers, we identify main MA words categories and their corresponding morphological attributes. We present also for each category the various rules that can occur during the concatenation of a lemma with affixes and clitics.

As with the Arabic language, Moroccan Arabic (MA) has a rich morphology. In general, MA vocabulary is composed of words that can be classified into three categories: Noun, Verb and Particle. A word can be decomposed to morphemes as described in Fig. 1 where affixes and clitics are used in order to make new words starting from a lemma/stem without changing the POS.

In this paper we define lemma, stem, word, and other morphemes as follows:

•         Prefixes: attach before the lemma/stem and states the inflection;

•         Suffixes: attach after the lemma/stem stating the inflection;

•         Affixes: the set of prefixes and suffixes;

•         Proclitics: attach before the lemma/stem and states a syntactic role;

•         Enclitics: attach after the lemma/stem and states a syntactic role;

•         Lemma: it is the uninflected base form of a word without affixes and clitics. For verbs, it is conjugated in the perfective, 3rd person and singular form. In the case of nouns and adjectives, the lemma is the singular indefinite form.

•         Stem: it is the combination of lemma with affixes.

•         Word: can be either a lemma, a stem or the combination of the stem/lemma with clitics (fully inflected form).

Following this definition, we can decompose for example the Moroccan Arabic (MA) word وماكانخدموهاش /And we don’t process it/ (wmakankhdmohach) to several morphemes as illustrated in Fig. 1.

 

 

 

Fig. 1. MA word decomposition.

3.3.  Verbs

Given that MA is a variant of the Arabic language, not only Arabic lexicon is borrowed but also Arabic grammar rules. Thus, most of them are the same for Moroccan Arabic and in some cases, they are altered in order to meet MA phonology. Accordingly, besides applying Arabic conjugation rules to MA verbs, there are some MA verb conjugations that are slightly different from Arabic. For instance, given the Arabic conjugated verb in the first person at the present tense أكتُبُ /I write/, the Arabic prefix أ is replaced by the MA prefix كان and the last diacritic Damma (ُ) is transformed to Soukoun diacritic (ْ) to obtain the MA verbكانكتبْ  (kanktb). For making negative statements, MA follows a similar pattern to French language by placing the lemma verb between the prefix ما and the suffix ش. Passive verbs are obtained by adding the prefix ت to a given verb. Table 2 illustrates some verb conjugation cases in the present tense, the negative state and the passive voice.

Table 2. MA verb conjugation cases

Conjugation case

MA

Arabic

Meaning

Present tense in the 1st person

كانكتب

(kanktb)

أَكْتُبُ

I write

Negation form

مانكتبش

(manktbch)

لَنْ أَكْتُبَ

I will not write

Passive voice

تكتبات

(tktbat)

كُتِبَتْ

It has been written

In order to facilitate generating MA verbs, we seek to categorize MA verbs according to their common conjugation rules. Thus, one key is considering weak letters in order to categorize MA verbs. In fact, weak verbs (as in Arabic) can also be present in the MA lexicon given that 81% of the MA lexicon is borrowed from Arabic according to a previous study (Tachicart, et al., 2016). A weak verb has one or two weak letters in its root. The letters that make an MSA verb weak are Waw (و), Alif (ا) and Yae (ى). Particularly, given the MA standards spelling adopted in this work, only Waw (و), Alif (ا) are considered weak. In this context, if we consider the orthographic transformations that occur during the concatenation process between morphemes, MA verbs can be categorized into five sets according to the number of letters and the presence of weak letters as illustrated in table 3.

The first set does not undergo any changes in the lemma during the concatenation process. It includes verbs that have no weak letters such as كتب /to write/ (ktb) and زرب /to hurry up/ (zrb) in addition to weak verbs with more than three letters such as تخاصم /to argue/ (tkhasm) and هاجر /to emigrate/ (hajr). The second set is composed of verbs having three letters with the presence of Alif in the middle such as شاف /to see/ (chaf) and قال /to say/ (qal). In the concatenation process, for example in the present tense, the Alif is transformed either to و Waw or to ي Yae such as illustrated in table 3. Additionally, the Alif is deleted in the past tense such as شفت /I saw/ (chft). The third set is composed of verbs that have Alif as the last letter. This letter is transformed in some cases to ي Yae such as the concatenation with present and imperative affixes. In the fourth set, we can find verbs that are composed of two letters with the presence of Chedda (ّ) in the last such as شمّ /to sniff/ (chmm) and  سدّ /to close/ (sdd). In some cases such as the concatenation with past affixes, Yae (ي ) is added to the lemma. The fifth set gathers a few irregular verbs such as خدا /to take/ (khda) and كلا /to eat/ (kla) where the Alif may change its position to the first letter. For example, in the case of conjugating كلا to the present tense with the third person ياكل /he eats/ (yakl). It should be noted that contrary to MSA, Moroccan Arabic lexicon lemmas does not include verbs where Yae or Waw is the end position and consequently, only Alif can be that position.

Table 3: MA verbs categories according to concatenation variations

set

Verb features

Example

Transformation

Meaning

1

  Strong verbs[5] composed of 3 letters

 

  Composed of more than 3 letters given that Alif is not the end position

كتب

(ktb)

 

تخاصم

(tkhasm)

ماكاي+كتب+وش

 

 

ماكاي+تخاصم+وش

They don’t write

 

 

They don’t argue

2

Composed of three letters where Alif is the second position

شاف

(chaf)

مال

(mal)

كاي+شوف+ونا

 

كاي+ميل+ها

He sees us

he leans in

3

The Alif letter is the end position

سطاسيونا

(stasiona)

مشا

(mcha)

كاي+سطاسوني+ها

كاي+مشي

He parks it

He goes

4

Composed of two letters with chedda (ّ) at the end position

سدّ

(sdd)

شمّ

(chmm)

سدّي+ناها

 

شمّي+ت

We closed it

I sniffed

5

Irregular verbs[6]

خدا

(khda)

كلا

(kla)

كي+اخد

ماي+اكل+وش

He takes

They don’t eat

 

3.4. Nouns

The majority of the MA lexicon (67%) is composed of nouns according to previous work (Tachicart, et al., 2014). In this work, we fit Arabic standards to MA nouns categorization used in (Jaafar, et al., 2015) and thus we decompose MA nouns to several types in order to prepare the necessary rules for the generation process. In this context, we consider noun types that are mentioned in table 4 where each type is compatible with a specific morpheme set. We illustrate each noun category with an example in order to understand our classification. For example, pronouns are compatible only with negation clitics and the conjunction و. Additionally, adverbs are not compatible with definite clitics ال whereas common nouns and adjectives are compatible with almost all nominal morphemes.

Table 4: MA nouns categories

Category

Example

English equivalent

Common

سكات

Silence

Adverb

تقريبا

Approximately

Adjective

فقير

Poor

Pronoun

نتوما

You

Proper

المغرب

Morocco

Number

جوج

two

Broken plural

بيبان

doors

3.5. Particles

Particles are words to which noun and verb symptoms cannot apply (Namly, et al., 2016). Contrary to nouns and verbs, particles cannot be inflected. However, they can be concatenated with some morphemes. We consider five types of Moroccan particles as mentioned in table 5: interjections, prepositions, interjections, conjunctions, exceptions and interrogations. Each category can be concatenated with some morphemes as exhibited in table 6.

Table 5: MA particles categories

Category

Example

English equivalent

Interjection

اوّاه

Oh!

Preposition

تحت

under

Conjunction

و

and

Exception

غير

except

Interrogation

شنو

what

Table 6: MA particles compatibility

Category

M1

M2

Example

English equivalent

Interjection

Not compatible

Not compatible

وا

Oh !

Preposition

Partially compatible

Partially compatible

وتحتها

and under it

Conjunction

Not compatible

Not compatible

و

and

Exception

Partially compatible

Partially compatible

وغيرها

and others

Interrogation

Partially compatible

Not compatible

واشنو

and what

M1: Morphemes taking place before lemma

M2: Morphemes taking place after lemma

3.6. Morphological attributes definition

At the morphological level, most of the Moroccan rules are extended from Arabic since Moroccan Arabic is a variety of Arabic language. In this context, we standardize MORV morphological information according to the ALESCO[7] standards for Arabic morphological analyzers.

In tables 7 and 8 below, we detail MORV morphological information. Indeed, MORV considers the same Arabic morphological categories (noun, verb and particle) where each category can be assigned the following attributes:

Table 7: MORV morphological information – Common features

Morphological category

Associated attributes

Possible values

 

Verbs

And

 Nouns

root

Indefinite but limited

lemma

Indefinite but limited

gender

masculine; feminine; common

number

singular; plural

form

affirmative; negative

·        Gender: Verbs can be separated into three classes: feminine and masculine and common. Gender does not apply to particles.

·        Number: refers to the quantity of countable nouns or to the number of verb-subject.

·        Tense: is the time described by a verb which can be in the past, the present, the future or the imperative.

·        Person: refers to someone taking part in the event which is expressed by a verb. It can be with assigned three values: the first, the second or the third.

·        Voice: In a given sentence, it describes the relationship between the subject and the verb. There two verb voices: the active and the passive.

·        Transitivity: a verb that accepts one or more objects is transitive.

·        State: a noun is indefinite when it is unspecific. By adding the prefix ال, the word state is then transformed to definite.

·        Form: negation can be applied to words in affirmative form by using the affixes ما and ش and thus the word form is transformed to negative.

Table 8: MORV morphological information Specific features

Morphological category

Associated attributes

Possible values

 

Verbs

tense

present; future; past; imperative

person

1; 2 ; 3

voice

active; passive

transitivity

yes; no

Nouns

state

definite; indefinite; not applicable

Particles

negation

1;0

 



[1] https://www.maroc.ma/en/content/constitution

[2] General Moroccan census performed in 2014 (http://rgph2014.hcp.ma)

[3] Region of Morocco located in the south

[4] Situated in west-central Morocco

[5]Strong verbs do not have weak letters

[6]Regular verbs set (from set 1 to set 4) are conjugated according to rules that the large majority of verbs in the language use. While irregular verbs (set 5) are conjugated according to different rules.

Next section