preloader

Moroccan Arabic Vocabulary Generation Using a Rule-Based Approach (2. Related work)

2. Related Work

Unlike the Arabic language, a few works are dealing with the morphological generation of Arabic dialects vocabularies. In addition, there is currently no work addressing the morphological generation of MA vocabularies to the best of our knowledge. The literature review of Arabic morphological vocabularies exhibits various approaches that can be mainly classified into manual annotations such as in (Al-Shargi, et al., 2016), (Maamouri, et al., 2006) and automatic approaches. In the following, we summarize related works concerning the automatic generation of both MSA and dialectal morphological vocabularies.

Among the earliest efforts to build Arabic morphological generators was the work of Beesley (Beesley, 1996) & (Beesley, 2001) using Xerox’s finite-state transducer[1]. To implement the generator, the author compiled a lexical database including 4930 roots and 400 patterns as well as a set of morphotactics and alternations rules that govern the combination of stems with clitics. The result of running the system over the lexical database gives 72M fully inflected forms.

In the work of Cavalli-Sforza et. al (Cavalli-Sforza, et al., 2000) which is reviewed also in (Soudi, et al., 2007), authors presented an approach to generate Arabic verbs using MORPHE (Leavitt, 1994), a tool for modeling morphology based on discrimination trees and regular expressions. The system follows the concatenative morphology and is driven by a morphological form hierarchy governing not only the relationship between roots and patterns forms but also transformational rules that attach to leaf nodes in the hierarchy.

Habash (Habash, 2004) presented Aragen as a lexeme-based Arabic morphological generator that follows concatenative morphology. Aragen uses Buclwalter’s database (BAMA) (Buckwalter, 2002) that includes a set of tables representing morphotactics and orthographic rules. In this database, we find a lexicon of annotated morphemes (lemma, affixes and clitics) and a compatibility morphemes table that indicates which morpheme can be concatenated to which other. To evaluate Aragen, the author used a sample of 1M words from the UN Arabic-English corpus (Jinxi, 2002) and realized that it reaches a coverage of 76%.  

Authors in (Habash, et al., 2005), (Habash, et al., 2006) and (Habash, et al., 2007) built MAGEAD a morphological generator and analyzer of Modern Standard Arabic (MSA) and Levantine (LEV) verbs using FSTs. MAGEAD follows templatic morphology where the principle of its analysis relies on lexeme and features. Authors define the lexeme as a triple containing a root, a meaning index and a morphological behavior class (MBC). In another work (Altantawy, et al., 2010), MAGEAD has been extended to cover MSA nouns and adjectives. It should be noted that MAGEAD is the first tool for Arabic dialects that includes roots and patterns in its implementation. Also, it was very helpful in the process of corpora annotation in several works such as the work of (Diab, et al., 2010).

Shaalan, et al. (Shaalan, et al., 2007) performed an effort to build a rule-based Arabic morphological generator in order to facilitate the task of automatic translation. Using the logic programming language Prolog, authors implemented this generator by encoding transformational rules that govern the concatenation of affixes with Arabic lemmas.

(Attia, et al., 2011) & (Attia, et al., 2014) developed AraComLex an MSA morphological processing toolkit based on finite-state transducers. The implementation of AraComLex follows the concatenative morphology and considers the lemma as the base form. The authors used a lexical database containing more than 30k lemmas in order to generate about 9M surface forms. In another work, Shaalan et. al (Shaalan, et al., 2012) created an open-source resource of Arabic words on the basis of AraComLex transducer. The main goal of building this resource is to facilitate the building of an Arabic spelling checker. Authors used Microsoft spell-checker (included in Microsoft Office 2010) to validate a set of 9M words from 13M AraComLex generated words.

Neme (Amid Neme, 2013) built a vocabulary of 2.5M Arabic verbs starting from 15.4K verbs and following templatic morphology by using finite-state transducers (FSTs). To evaluate the generated vocabulary, the author used 10K verbs extracted from the NEMLAR corpus (Attia, et al., 2005). It should be noted that the accuracy rate is not reported.

Doumi et. al (Doumi, et al., 2016) built a lexical resource that contains 11M verbal inflected forms. They followed a concatenative morphology and used for this purpose a representative corpus of MSA to extract verb lemmas. They used a corpus instead of a lexicon in order to avoid obsolete words that have no place in current usage. Then, they used FSTs in order to generate MSA verbs following MSA concatenation rules with orthographic adjustments. Evaluation results showed that the generated resource covers more than 70% of the MSA verbs.

Khalifa et. al (Khalifa, et al., 2017) introduced CALIMAGLF as a morphological analyzer and generator for Emirati (EMR) Arabic verbs. In this work, two resources providing explicit linguistic knowledge are used. The first is a database gathering a collection of root-abstracted paradigms that map from features to root-abstracted stems, prefixes, and suffixes. While the second consists of a lexicon specifying verbal entries in terms of roots and paradigm IDs. By merging these two resources in one model, all possible analyses are provided to cover more than 2600 EMR verbs. Evaluation of CALIMAGLF on 620 verbs gives an accuracy of 81%.

Taji et. al (Taji, et al., 2018) presented CALIMAStar a multi-system that includes an MSA morphological generator. This generator follows concatenative morphology and relies on an extended database of Buckwalter. It contains tables of stems, clitics, and compatibility rules that are used in order to avoid generating incorrect words. Taking into consideration only compatible morphemes, the implemented generator expects a lemma and a POS category as input to generate all possible forms.

Torjman and Haddar (Torjmen, et al., 2019) automatically built a Tunisian annotated vocabulary containing 150 460 words using finite-state transducers. They started by building a lexicon of 1 452 annotated lemmas and implemented a set of morphological local grammars in NooJ linguistic platform (Silberztein, 2005) following concatenative morphology. Local grammars are then concerted to transducers which govern the concatenation of Tunisian morphemes with these lemmas. To test this vocabulary, they collected 18 134 words from social media and realized that the coverage is 58.5%.

Table 1 presents a summary of various works that target Arabic words generation. By analyzing related information, we notice first that no works targeted Moroccan Arabic. Additionally, the majority of these works deal with MSA and implement finite-state transducers to generate corresponding vocabularies. In addition, the coverage, by comparing the generated vocabulary with an MSA corpus, in the claimed works reaches an unsatisfying score. In the following sections, we present the Moroccan Arabic morphology and then the followed approach to generate the corresponding vocabulary (MORV).

Table 1. Arabic morphological generators

Work

Generator

language

morphology

implementation

size

accuracy

(Beesley, 1996) & (Beesley, 2001)

Xerox

MSA

T+C

FST

72M

 

 

 

 

 

 

 

(Cavalli-Sforza, et al., 2000)

MORPHE

MSA

T+C

 

 

 

 

 

 

 

(Habash, 2004)

Aragen

 

MSA

C

76%

(Habash, et al., 2005), (Habash & Rambow, 2006) and (Habash & Rambow, 2007)

 

 

 

MAGEAD

 

 

MSA & LEV

 

 

T+C

 

 

FST

 

 

 

 

(Shaalan, et al., 2007)

 

MSA

C

Prolog

(Attia, et al., 2011) & (Shaalan, et al., 2012) & (Attia, et al., 2014)

 

 

AraComLex

 

 

MSA

 

 

C

 

 

FST

 

 

9M

 

 

 

 

 

 

 

 

 

(Amid Neme, 2013)

 

MSA

C

FST

2.5M

(Doumi, et al., 2016)

 

MSA

T+C

FST

11M

70%

(Khalifa, et al., 2017)

CALIMAGLF

Gulf dialects

T+C

81%

 

 

 

 

 

 

 

(Taji, et al., 2018)

 

CALIMAStar

MSA

C

(Torjmen, et al., 2019)

NooJ

Tunisian

C

FST

150K

59%

C: concatenative morphology

T: Templatic morphology

 



[1] https://web.stanford.edu/~laurik/fsmbook/home.html