preloader

Moroccan Arabic Vocabulary Generation Using a Rule-Based Approach (1. Introduction) - tachicart

1. Introduction

Morphology is the field of linguistics that studies the internal structure of words. The main purpose of morphology is to analyze the word structure and to describe the meaningful units called morphemes (atomic linguistic units that carry meaning) for a given word. From a practical point of view, the word structure can be expressed using different morphological attributes such as tense, person, number, gender, etc.

One important implementation of morphology is vocabulary generation. It is the process of word-formation that produces inflected forms of a word starting from labelled morphemes. Regarding the Arabic language that includes Modern Standard Arabic (MSA) and a set of dialects, there are two main approaches to the morphological generation of vocabularies (Habash, 2010): either following concatenative or templatic morphology. In order to create a word, concatenative morphology involves the combination of lemma or stem with morphemes such as affixes and clitics, whereas templatic morphology is interleaving and merges roots with patterns.

Lexical resources are crucial and important to most NLP applications such as morphological analyzers. However, the Moroccan Arabic dialect (MA) is considered a resource-scarce language since it suffers from the lack of available MA resources and NLP tools. In fact, there are currently no morphological analyzers nor morphological generation systems. Thus, analyzing MA texts is restricted to manual tasks. Moreover, various NLP applications rely on extracting the morphological information encoded in the word. Additionally, if we consider that some NLP applications suffer from data sparsity such as statistical machine translation, the availability of MA resources exhibiting morphologically annotated MA words can alleviate these problems. Hence, building a new resource describing MA words morphology is useful to facilitate the building of NLP applications such as morphological analyzer and machine translation.

Previous approaches to generate vocabularies for standard languages followed either statistical (Faruqui, et al., 2016) & (Dusek, et al., 2013) or rule-based techniques (Bauer, et al., 2015) & (Viks, 2000) & (Jisha, et al., 2011). The first relies on training taggers on large annotated corpus using common machine learning algorithms such as Support Vector Machines (SVM) (Vapnik, 1995) or LSTM (Hochreiter, et al., 1997), etc.). Saving time and increasing accuracy are the main advantages of this approach. Unfortunately, no such resources are currently available to train MA taggers. The second is rule-based and consists in using lexicons of morphemes and implementing decision algorithms, using for example finite-state transducers (FSTs). The latter governs the concatenation of different morphemes and output new words with their morphological analysis. Such an approach is appealing since it meets the linguistic requirements. Furthermore, with the lack of currently annotated MA corpora, following this approach seems to be the most suitable solution to generate MA morphological vocabulary.

The main contribution of this paper is to present and evaluate our MORphological Vocabulary (MORV) using a MORphological Generator (MORG) that relies on a rule-based approach. The idea behind conducting such a work is to establish a Moroccan Arabic morphological analyzer that will enable solving various NLP tasks.

In our method, we used a lexicon of MA lemmas and a table of MA annotated morphemes (affixes and clitics) as a dataset. Besides, we stored linguistic and orthographic rules in separate tables to seamlessly govern the concatenation of different morphemes by an appropriate algorithm. MORV evaluation consists of assessing the generated output regarding two aspects. The first (quantitative evaluation) aims at ensuring that MORV entries (generated words only) cover sufficiently the Moroccan Arabic dialect. While the second involves assessing the precision of the morphological information that MORV provides using common evaluation metrics such as Precision, Recall, and F-measure. In this perspective, the main advantage of MORV is the good coverage of the Moroccan Arabic vocabulary, the flexibility in managing rules, and the ability to be easily extended.

 

In section 2 we discuss related works dealing with vocabulary generation. In section 3, we exhibit morphological information about different MA categories. After that in section 4, we highlight two linguistic approaches to Moroccan Arabic word generation. In section 5, we discuss the main objectives of building MORV. Then, we present the adopted approach and its implementation. We present the result of the MORV evaluation and discuss its features in section 6. Finally, we conclude this paper with some perspectives in section 7.

Next section