5. Evaluation and discussion
5.1. Quantitative Evaluation (coverage)
In order to evaluate the MA-generated vocabulary regarding the coverage of Moroccan Arabic, we check whether the used Moroccan Arabic words orthographically exist in MORV. Thereby, this evaluation does not concern the associated annotations with MORV generated words. To this purpose, we use a test corpus extracted from the MA User-generated Text (UGT) which is introduced in previous work (Tachicart, et al., 2019). Our test corpus is manually normalized following the same MORV orthographic rules and includes 1000 MA sentences containing 10564 words. In fact, the manual task that should be performed in order to prepare such a corpus impedes us to increase its size. The goal of the evaluation is to ensure that each UGT word in the test corpus is orthographically recognized using MORV. Obtained results show that the percentage of the recognized words is 84% which means that only 16% of the test corpus words are missed. This score is encouraging if we compare MORV to MSA vocabularies which do not exceed 87% in the language coverage such as BAMA and AraComLex. In table 18 below, we detail the quantitative evaluation by providing the coverage in addition to the out of vocabulary (OOV) rates according to the POS.
Number of words
By examining the OOV words list, we notice that (without considering named entities) most of the words belong to the MSA vocabulary or completely have an MSA morphology due to the use of code-switching in MA texts. For example, the wordsتنسجم /fit into/, الأقمصة /shirts/, and الاستهجان /boos/ are MSA words that are included in the MA evaluated sentences. Besides, there are some examined cases in the OOV list where the items can be considered as MA words such as شيباهم (he shipped the items). This can be explained by the fact that MDED lexicon does not include the corresponding lemma which could be considered as a new word for Moroccan Arabic.
5.2. Qualitative Evaluation (performance rates)
Building new NLP resources such as MORV is very important to various NLP tasks. The importance of such resources depends not only on their size and coverage but also on the credibility of the provided information. In this perspective, we assess MORV performance using regular evaluation metrics namely: precision, recall, accuracy and F-measure. These metrics are calculated using the regular parameters True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) as described in table 19.
Correct annotation (corresponding to the input word) found in MORV
Incorrect annotation (corresponding to the input word) not found in MORV
Incorrect annotation (corresponding to the input word) found in MORV
Correct annotation (corresponding to the input word) not found in MORV
Based on these parameters, for each input word, we define the precision as the number of correct annotations found in MORV compared to the total number of the annotations that correspond to the input word in MORV. Alternatively, it can be calculated as in the following formula:
Moreover, the recall is defined as the number of correct annotations found in MORV compared to the expected annotations that are correct and should be found. The recall can be calculated by the following formula:
Also, accuracy expresses the proportion of false annotations. It is calculated as in the following formula:
Finally, the F-measure is defined as the harmonic mean of precision and recall as follow:
Yet, before engaging in the qualitative evaluation, it is necessary to prepare a manually annotated test corpus. To this end, we selected carefully 25 sentences containing 304 MA words from the above test corpus (used in the quantitative evaluation) and asked a linguist expert to provide for each word all possible annotations without taking into consideration the context and also considering the same morphological attributes provided by MORV. Then, we extracted MORV associated annotations for each word (in the new test corpus) and perform a comparison against that in the test corpus regarding all morphological attributes. Table 20 below provides a summary of the evaluation process.
Given the obtained results, we observe first that the precision rate is 94,88% which expresses the correctness of the annotations that are associated with MORV words. It relates in fact to the low false-positive rate including correct annotations at the morphological level but incorrect at the semantic level such as the word كانسكنوكم /we live in you/. We are not impressed by the high precision rate since the generation process follows a rule-based approach where morphological rules are checked by linguist experts. Besides, we obtained a recall of 81,42% which means that not all relevant results are returned but it can be relatively considered a good score. The reason behind missing some relevant results is that contrary to the precision calculation, we considered the OOV words in the recall calculation. This increased the number of associated annotations that should be found in the test corpus. In light of the evaluation rates, it is clear that including new lemmas in MDED lexicon and running the corresponding generation process should decrease MORV OOV rate and thus it is a key factor towards maximizing the recall. Regarding accuracy and F-measure rates, we obtained respectively 77.99% and 87,64%. In fact, the cost of false positives and false negatives are very different in the case of evaluating MORV associated annotations. Moreover, these data are not symmetric where values of false positive and false negatives are almost the same. For this reason, it is useful to look both at precision and recall as metrics of MORV evaluation.
To compare MORV to the other existing works, we consider only Arabic dialects. Thus, we believe that MORV has shown the largest size and the best precision as reviewed in Table 1 and reported in the quantitative and the qualitative evaluation. As an example, while the Tunisian vocabulary contains 150k forms generated from 1452 lemmas, MORV exploit 12.000 lemmas to generate 4.68M forms. Additionally, the accuracy of MORV outperforms the claimed accuracy of the other Arabic dialects vocabularies.
It should also be noted that all experiments that enabled the building and evaluating MORV were performed on a workstation having the following characteristics: CPU = i7 @ 2.7 GHz 2.7 GHz, RAM = 32GO, Operating System = Win10, 64 bits.
Typically, a strategic requirement for research and development in the NLP field is the creation of high-quality language resources given that the performance of NLP tools usually relies on the quality of these resources. Therefore, we believe that extending MORV or creating new resources will pave the way towards addressing Moroccan NLP tasks. For this reason, it would be useful to follow the templatic morphology that involves the creation of MA roots, patterns and the rules governing the use of roots and patterns to create new MA words. This may help in comparing the results of the concatenative morphology that has already been covered in this work and the templatic morphology.