1. Methodology
Following a rule-based approach, generating MA words involves the application of a set of linguistic and orthographic rules that define how morphemes can be concatenated to each other. These rules vary from one-word category to another (verbs, nouns, particles). As stated in section 1, concatenative morphology and templatic morphology are two directions to follow in order to generate a MA word. In this section, we describe the design and the implementation of our generator MORG.
Design
As illustrated in Fig. 2, four key components enable MORG:
– Lexicon with labelled lemma named MDED;
– Labelled morphemes table containing both affixes and clitics;
– Rules and constraints governing the spelling and concatenation of morphemes and lemmas;
– Decision algorithm that combines the previous items.
Fig. 2. MA word decomposition.
Indeed, the design of the MORG system (that enables the generation of the Moroccan vocabulary ‘MORV’) is flexible enough to be extended easily. In fact, the used resources and the generation rules are designed in a generic format and are stored in separate tables that allow not only efficient management of both resources and rules but also better maintainability and scalability. Consequently, languages supporting the concatenative morphology can be accommodated seamlessly especially Arabic dialects that share a lot in common with the Moroccan Arabic dialect.
Lexicon
First, as an MA lexicon, we used the Moroccan Dialect Electronic Dictionary (MDED) built in previous work (Tachicart, et al., 2014). To the best of our knowledge, it is the most comprehensive Electronic lexicon for MA that is periodically updated. It contains almost 12000 MA entries written in Arabic letters and translated to MSA. In addition, one major MDED feature is the annotation of its entries with useful metadata such as POS, origin and root as illustrated in Table 9. For instance, the MA noun ماكلة /food/ is originated from MSA with the root كلا.
Table 9: Sample of MDED lexicon
MA | MSA | POS | root | origin | English translation |
ماكلة | طعام | Noun | كلا | MSA | food |
شحال | كم | particle | شحال | MSA | How much |
سطاسيونا | ركن | Verb | سطاسيونا | French | To park |
MA morphemes table
Besides the lexicon, the morphemes table is a central resource for our morphological generator. As far as we know, there is no work that gathers MA morphemes and exhibits their features. Indeed, 402 MA affixes and clitics were manually created and linguistically checked. Morphemes table is composed of 24 atomic affixes, 43 atomic clitics and 335 compound morphemes. The main advantage of this table is its rich morphological information such as POS, negation, and personas as illustrated in Table 10. For example, the morpheme وكان is composed of the prefix كان and the clitic و. It is compatible with verbs in the present tense, the first person, plural form and all genders. The negation does not apply to this morpheme. Table 11 presents also some insights about the Moroccan morphemes table.
Table 10: Sample of the MA affixes and clitics
morpheme | value | composition | pos | tense | pers | neg | num | gen |
clitic | و | atomic | verb | all | all | all | all | all |
prefix | وكان | و+كان | verb | present | 1 | 0 | p | all |
prefix | بال | ب+ال | noun | – | – | 0 | all | all |
suffix | ين | atomic | noun | – | – | 0 | p | all |
proclitic | وماب | و+ما+ب | noun | – | – | 1 | all | all |
suffix | ات | atomic | verb | all | 3 | 0 | s | f |
enclitic | كش | ك+ش | verb | all | all | 1 | s | all |
clitic | و | atomic | verb | all | all | all | all | all |
Table 11: Distribution of MA affixes and clitics According to POS
Type | M1 | M2 | Total | Percentage |
nominal | 23 | 67 | 90 | 20,81% |
verbal | 167 | 145 | 312 | 79,19% |
Total | 200 | 202 | 402 | 100% |
nominal | 23 | 67 | 90 | 20,81% |
Concatenation rules
After preparing the lexical resources that describe different MA morphemes, it is necessary to define rules and constraints governing the concatenation of these morphemes in order to form new MA words, then build and implement the decision algorithm. Therefore, morphological and orthographic rules are stored in three separate tables. Adding new rules or updating them can be performed easily and does not affect MORG overall performance. The first table gathers the morphological attributes of morphemes concatenation. The second indicates which morpheme can be concatenated with which other and in which order. The third specifies orthographic adjustments required in order to convert the generated word into a correct spelling.
Morphological attributes table
Regarding the generated word, defining the value of each morphological attribute such as the person, gender, number, etc. relies on the morphological information of each morpheme composing this word. Table 12 shows the effect of combining morphemes on the value of the word morphological attribute. For example, the third line indicates that combining (inside a verb word) a prefix in the present tense with a suffix that accepts all tenses should produce a verb in the present tense.
Table 12: Morphological attributes table
Morphological attribute | M1 | Lemma | M2 | Resulted attribute |
gender | feminine | verb | – | feminine |
person | all | verb | all | all |
tense | present | verb | all | all |
number | all | verb | – | singular |
person | all | noun | all | all |
Compatibility table
In order to avoid obtaining impossible words such as والشربنا /and the we drink/ we build the compatibility table. This table is hand-written and determines for each lemma category which morpheme preceding the lemma (proclitic or prefix) can be concatenated with another morpheme that is placed after the lemma (enclitic or suffix) inside a word. To build this table, we start by assuming all morphemes are compatible with each other and thus we generate the corresponding list. Then, we manually checked and excluded morphemes combinations that can produce an incorrect word. For example, even if the prefix ال and the suffix كم are compatible with nouns, they cannot be concatenated together in the same word. As an illustration, concatenating the previous morphemes (ال and كم) with the lemma طيارة /plane/ produces the incorrect word الطيارتكم /the your plane/. As a result, this morphemes combination is excluded from the compatibility table. Table 13 presents a sample of the compatibility table.
Table 13: Compatibility table
Lemma | M1 | M2 |
Example in a word |
English equivalent |
verb | ماكان | ش | ماكانخدموش | We don’t work |
verb | غي | وهم | غيعرضوهم | They will invite them |
noun | ال | ات | البيكالات | bicycles |
noun | ب | تنا | بطوموبيلتنا | with our car |
particle | وما | كش | ماعندكش | You don’t have |
Orthographic adjustments table
Related information of the constraints governing lemmas concatenation with other morphemes is held in a separate table. Given that some morphemes boundaries are affected during the concatenation process; some orthographic adjustments should be performed in order to correct the generated word. As illustrated in table 14, the newly generated word (which is an intermediate representation) تايتتمشّا /He walks/ arises from the concatenation of the prefix تايت with the lemma تمشّا. However, it presents an orthographic imperfection consisting in a double ت letter. Thus, one ت letter should be deleted in order to produce the correct word form تايتمشّا. The same table illustrates other orthographic adjustments that may occur after combining morphemes.
Table 14: Orthographic adjustments examples
Concatenation | Intermediate representation | Corrected form | English |
تايت+تمشّا | تايتتمشّا | تايتمشّا | He walks |
مشا+ات | مشاات | مشات | She leaves |
Implementation
Rules to generate MORV are implemented using Finite State Transducers (FSTs). These machines have been used in various NLP applications and show their capacity to model different NLP fields such as generation, analysis and speech recognition, etc. as cited in (Karttunen, 2000) and (Mohri, 1996). In fact, an FST is an enhanced finite-state automaton (FSA). While FSA can only accept or reject a string, FST is more general given that it produces output string as well as reading input by defining relations between them; Our FSTs consist of a finite number of states (listed in table 15 and illustrated in Fig. 3) which are linked by transitions and labelled with an input/output pair. In the following figure, we define morpheme boundaries with the (^) mark and word boundaries with the (#) mark. As an example, we consider words generation that takes as input verbal lemmas and all types of morphemes (taking place before and after the lemma). Hence, reading a morpheme taking place before the lemma (M1) leads to the q1 state and so on, until the final state q4 that produces the generated word with corresponding morphological attributes.
Table 15: FST states and transitions
states | M1 | M1+lemma | M1+lemma+M2 | Intermediate Representation +tags |
q0 | q1 | – | – | – |
q1 | – | q2 | – | – |
q2 | – | – | q3 | – |
q3 | – | – | – | q4 |
Fig. 3. FST transitions (handling verbs)
In order to build MORV, a finite-state network in cascade is created by defining two levels of morphology. The highest level hosts the lexical string corresponding to the combination of different morphemes and lemma with their corresponding tags. As illustrated in Fig. 4 that exhibits the creation of the word مامشاتش /she didn’t leave/, these elements are chained together with boundary markers (+) and present input for the first set of transducers (FST1). The latter maps each lexical string to an intermediate representation which may require some orthographic adjustments in order to meet the Moroccan Arabic spelling constraints. Thus, the intermediate form presents, in turn, an input for the second set of transducers (FST2) that maps this combination form to the orthographically correct surface form.
Table 16: MORV general insights
Morphological category | Lexicon entries | Lexicon percentage | MORV generated forms | MORV percentage |
Verbs | 3130 | 26% | 2.021.152 | 43,15% |
Nouns | 8598 | 73% | 2.655.460 | 56,68% |
Particles | 118 | 1% | 8.154 | 0,17% |
Total | 12.000 | 100% | 4.684.766 | 100% |
Verbs | 3130 | 26% | 2.021.152 | 43,15% |
Fig. 4. FST1 handling a set of Moroccan verbs
Finally, the last task consists of compiling MORV and building a vocabulary of Moroccan words annotated with different morphological attributes. To this end, we use the previous finite-state network to create separately three lexical databases: the first gathers nouns (including irregular forms such as broken plural), the second includes verbs while the third is composed of particles. The reason why MORV compilation is performed separately according to the lemmas category is that the concatenation rules and also the orthographic adjustments differ accordingly. Table 16 presents a global insight about MORV content while table 17 illustrates a sample of generated forms regarding the verb سطاسيونا /to park/. Additionally, an extended sample of MORV containing further entries can be found at the SAFAR website[1].
Table 17: Sample of the MA affixes and clitics
word | lemma | transitivity | root | M2 | M1 | form | tense | number | gender | person |
ماكاتسطاسيونيش (You don’t park) | سطاسيونا | yes | سطسين | ش | ماكات | 1 | 2 | s | m | 2 |
ماكاتسطاسيونيش (You don’t park) | سطاسيونا | yes |
سطسين | ش | ماكات | 1 | 2 | s | f | 3 |
وماغتّسطاسيوناش (And it will be not parked) | سطاسيونا | yes |
سطسين | ش | وماغتّ | 1 | 3 | s | m | 2 |
Next section