preloader

A description and demonstration of SAFAR framework (3. SAFAR)

3.       SAFAR

3.1         General principles

In most cases, the development of Arabic NLP applications requires the use of several tools at once, each dealing with a certain level of language. For example, to develop an automatic translator, one approach consists of using a parser as well as a morphological analyzer. Generally, these tools are heterogeneous and raise many SE problems such as interoperability, reusability, portability, etc. Therefore, the ANLP application developer must very often face problems of integration of different technologies, a more difficult maintenance of the system, a larger number of codes or a tedious search of the most appropriate tools. Moreover, researchers are usually in need not only to tools but also to Language Resources (LRs). This complicates the situation even more because researchers will be confronted once again with the same problem regarding using and processing resources.

To overcome the above-mentioned SE issues and to suit the needs of the ANLP community in terms of processing Arabic effectively and providing reusable LRs, we developed SAFAR as a software architecture for Arabic[1].

As will be detailed in the benchmarking section, SAFAR outperforms existing architectures as one that has been built with the following constraints and principles:

·  Integrate           t only tools and programs but also LRs

·  Structure the architecture to integrate the three types of Arabic, namely MSA, Classical and dialects

·  Respect the Arabic language features in the structure of the architecture

·  Develop tools or LRs when available ones are not satisfactory

·  Provide the architecture to be exploited not only by computer scientists but also by linguists

·  Involve in our team computer scientists, statisticians and linguists

 

SAFAR is a Java-based framework dedicated to Arabic Natural Language Processing. It brings together all layers of ANLP: LRs, pre-processing, morphology, syntax, semantics and applications. In general, our philosophy is not to develop ourselves all the NLP layers and modules, but to integrate existing ones consistently. Consequently, our approach consists in providing the specification in terms of APIs for each module of our architecture and also provide (if any) implementations of these APIs with tools that have proved to be efficient. However, when modules and LRs are not available, we develop them from scratch inside SAFAR.

3.2         SAFAR architecture

As shown in Figure 1, SAFAR has several layers. Each one is developed as a set of reusable Java APIs that provide services directly usable by other layers in accordance with the relationships modeled with arrows in the figure. SAFAR layers consist of models and interfaces that help standardizing the various aspects shared by tools belonging to the same family.

safar framework

Figure 1: SAFAR framework general architecture.

 

·  Morphology: deals with four types of tools namely, stemmers, lemmatizers, morphological analyzers, and generators;

·  Syntax: contains implementations of Arabic syntactic parsers;

·  Semantics: designed to implement tools dealing with semantics;

·  Utilities: includes a set of technical services and pre-processing tools as well as machine and deep learning utilities;

·  Resources: provides services for maintaining, consulting and managing Arabic language resources such as corpora, dictionaries and ontologies;

·  Application: contains high-level applications such as sentiment analysis or Question/Answering systems;

·  Client applications: interacts with all other layers to serve clients via web applications, web services, etc.

3.3         The content

As previously explained, the structure of SAFAR is split in three main packages: MSA, Classical and Dialects. Since Dialects are numerous and exceeds largely the number of Arabic countries, we have been interested so far to integrate only the Moroccan dialect. In addition, each package is split into basic services (split into the tree most known language levels: morphology, syntax and semantics), applications and LRs (split into corpora and lexicon).

3.3.1        MSA

This package is the most populated one. Indeed, for almost two decades the research community spent all their efforts in developing components (tools and resources) for this type of Arabic. The work within SAFAR is either to integrate one or multiple of these components or develop them from scratch.

Once developed/integrated, all SAFAR components become standardized according to several Java interfaces. These interfaces provide a variety of inputs and outputs.

 

Users have several possibilities when calling methods by specifying appropriate parameters according to their needs. They could also easily create customizable pipelines where the output of one component is the input of another (Jaafar         and Bouzoubaa, 2015). Moreover, other developers can add new implementations of any family components simply by implementing the appropriate interface. All these aspects of SAFAR help solving SE issues especially the interoperability, the reuse and the flexibility of exploitation.

 

 

SAFAR Layer

Package Level

Processing level

Implementation name

Per

Vr

App

key_words_extractor

SAFAR key_words_extractor

3

3

stopwords_analyzer

SAFAR stopwords_analyzer

3

3

moajam_moaassir

SAFAR moajam_moaassir

2

1

moajam_tafaoli

SAFAR moajam_tafaoli

2

1

Light  summarization

SAFAR light_summarization

2

2

morphosyntactic_processor

SAFAR morphosyntactic_processor

2

1

stem_counter

SAFAR stem_counter

2

1

Basic

Semantics

CG generator

Amine Semantic Analyzer

3

2

Syntax

Parsers

ATKS Parser

2

2

Farasa

2

3

Stanford

2

1

POS tagger

SAFAR POS tagger

3

3

Morphology

Analyzers

ATKS Analyzer

2

2

Alkhalil

2

1

Alkhalil 2

2

2

BAMA (Aramorph)

2

1

MADAMIRA

2

1

Lemmatizer

Farasa

2

3

SAFAR Lemmatizer

3

3

Stemmers

ISRI

2

2

Khoja

2

1

Light10

2

1

Motaz stemmer

2

2

Tashaphyne Root Extractor

2

2

Tashaphyne stemmer

2

2

SAFAR stemmer

2

2

Generators

 

Util

StopWords

StopWords

SAFAR StopWords remover

3

3

Benchmark

Morphology

SAFAR Analyzers benchmark

2

2

SAFAR Stemmers benchmark

2

2

Syntax

SAFAR Parsers benchmark

2

2

Normalization + Diacritics

SAFAR Normalizer

3

1

Splitting last V

SAFARS sentence Splitter

2

1

Tokenization

SAFAR Tokenizer

2

1

Pattern detector

 

SAFAR Pattern detector

2

Transliteration

SAFAR Transliterator

2

1

ATKS Transliterator

2

2

Machine & Deep Learning

Hidden markov model

Hidden markov model

3

3

Language model

Language model

2

3

Levenshtein

Levenshtein

2

3

weka

weka

1

3

fastText

fastText

1

3

Table 1: MSA tools implemented in SAFAR

 

Table 1 shows all integrated tools for MSA. We notice that many tools have been integrated within SAFAR such as “BAMA” morphological analyzer, “Stanford Parser”, “Light10” stemmer, etc. These tools have been widely used by the ANLP community and it will be very advantageous to use them within a homogenous and flexible framework. Other tools have been developed from scratch such as “SAFAR stemmer”, “SAFAR POS tagger”, etc. We developed them because: 1) available tools return incorrect results, 2) there are no similar tools within the community, or 3) existing tools cannot be reused in several contexts and environments. To prove this, we indeed made three benchmarking of all similar tools at the stemming (Jaafar and Bouzoubaa, 2016), morphological analysing (Jaafar and Bouzoubaa, 2014) and parsing (Jaafar and Bouzoubaa, 2017) layers. All tools starting with “SAFAR” in the table have been developed from scratch by our research team, while all others have been integrated.

It should be noted that some processing levels have more tools compared to others. For example, the morphological level has many tools compared to the syntactic one. This can be justified by the lack of those types of tools. Therefore, we highly encourage researches to provide tools in such levels, especially in the semantics level.

The column “Per” indicates how many researcher persons have been involved in the development/integration of the corresponding tool. The “Vr” column indicates SAFAR version from which the tool is present.

On another hand, Table 2 shows all integrated resources for MSA. The LRs building process is based on the Arabic language structure. Arabic is a highly structured language respecting a concatenative morphology. The concatenative inflexion denotes that the lemma concatenates to affixes to produce the stem, which in turn concatenates to clitics to yield the word. And according to their features, a lemma is either a verb, a noun or a particle. From this, we identify the basic components taking part in the composition of the Arabic words which are the lemmas (particle, verb and noun), stems and clitics.

 

SAFAR Layer

Package Level

Processing level

Implementation name

Size

(Entries for lexicons and words for corpora)

Per

Vr

Resources

Lexicon

Alphabet

SAFAR Alphabet

42

3

1

Clitics

SAFAR Clitics

167

3

1

Particles

SAFAR Particles

413

5

1

Contemporary

Contemporary dictionary

32.300

2

2

Interactive

Interactive dictionary

61.101

2

2

CALEM

SAFAR Stems Lemmas

7.133.106

3

3

Arabic WordNet

SAFAR Arabic WordNet

56.164

3

2

Corpus

NAFIS

SAFAR Stemming gold standard

172

4

3

Morphology evaluation

SAFAR morphological analyzers evaluation

100

3

2

Stemming evaluation

Quranic stemming evaluation corpus

1000

3

2

Table 2: MSA resources implemented in SAFAR

 

Thus, SAFAR follows the above Arabic language structure and contains the three basic alphabets (Loukili and Bouzoubaa 2011, Namly et al. 2016), clitics (Namly et al. 2015) and particles lexicon. We also make use of existing and known dictionaries (Contemporary and Interactive). It is worth mentioning that SAFAR contains currently one of the most comprehensive lexicons with more than 7 millions stems and corresponding lemmas (Namly et al. 2019).

 

 

SAFAR Layer

Package Level

Processing level

Implementation name

Size

(Entries for lexicons and words for corpora)

Per

Vr

Resources

Lexicon

Mded

SAFAR Mded

12.000

2

3

Moroccan_vocabulary

SAFAR MRV

4.500.000

2

3

Orthographic_variants

SAFAR OV

2.385.000

2

3

Corpus

LID

SAFAR Language Identification

519.000

2

3

Util

LID system

SAFAR Language Identification

SAFAR Language_Identification

2

3

Table 3: Moroccan dialect resources and tools implemented in SAFAR

 

One important feature very rarely found in frameworks, is the ability of a programmer, to use LRs not only as raw data from SAFAR but can also be accessed from the API.  Finally, because of the importance of ontologies, we enrich and integrate the existing Arabic WordNet (Abouenour et al. 2013) (AWN). We note that enriched AWN is approved as the official version of the Global WordNet association[2].

3.3.2        Moroccan Dialect

Besides being interested in processing Arabic language, we take into consideration the informal variety of Arabic spoken in Morocco (the Moroccan Dialect). Therefore, several resources and tools, illustrated in Table 3, have been developed targeting that variety since no previous works addressed the Moroccan dialect.

Regarding resources, a Moroccan dialect electronic Dictionary (Tachicart et al. 2014) (MDED) has been developed containing almost 12,000 entries with useful annotations. Another lexicon is the Moroccan reference vocabulary (Tachicart et al. 2019) (MRV), which compiles 4.5M possible Moroccan words with respect to a normalization guideline. Also, a corpus for language identification tasks is available with SAFAR. It is composed of 57k comments collected from social media and then manually classified to three categories: MSA, MD, and code-switched. Besides and based on neural models, a lexicon of orthographic variants that covers almost 54% of the MRV has been generated. It can be useful for several dialectal NLP tasks such as spelling normalization. Table 3 shows all integrated resources for the Moroccan dialect.

Concerning tools, a language identification system (Tachicart et al. 2018) has been developed and integrated within SAFAR in order to distinguish between MD and MSA (table 3).

3.3.3        Classical Arabic

We notice that we haven’t yet focused our efforts in developing particular tools or resources for this kind of Arabic language. However, it is important to note that Classical Arabic and MSA share multiple common rules. Consequently, many SAFAR tools and resources are valid, such as all morphological tools and Al wassit dictionary, to process any classical Arabic text, while others need particular implementation. For instance, it is regular today to divide Arabic texts into sentences using the simple “.” symbol while ancient writers were not using such symbols leading to long paragraphs with multiple sentences without the use of any punctuation. Such feature needs then a particular attention and process to develop the “sentence splitter” utility method for the classical Arabic.

3.4         Use and exploitation

As previously mentioned, SAFAR tools and integrated resources can be exploited either as an API or from client applications. Moreover, to allow full interoperability, resources are integrated using international standards.

3.4.1        API

For each level of processing (preprocessing, stemming, morphology, syntax and semantic), we standardize all aspects shared by the same type of tools according to APIs and models so that they become homogenous and flexible in their exploitation. This ensures the standardization inside SAFAR. For example, we propose an API for the normalization process as described in figure 2. All processing levels within SAFAR follow the same philosophy as described for the normalization below.

This API is based on an interface that every developper should implement within SAFAR in order to be valid. That is to say, users can either use available normalizers within SAFAR or develop their own ones by implementing the interface above. The new normalizer will be called and executed the same way as any other normalizer within SAFAR. This is valid as well for all other Arabic processing levels within SAFAR.

As mentioned in Figure 2, SAFAR overloads the normalize() method to allow normalizing Arabic texts in different ways. While the default normalization method specifies only the text to process, the overloaded methods are specified with additional parameters. Moreover, there are two types of methods: methods that take an input text either as String or File, and methods that return results either as String or XML File.

The execution of a normalizer within SAFAR can be simple as calling “normalizer.normalize(text)”. If the normalization should be customized, overloaded methods can be called.

 

«interface» : INormalizer

+ normalize (String text, File outputFile): void
+ normalize (String text, File outputFile, String normalizationForm,  

                      String otherCharsToDelete): void
+ normalize (File inputFile, String inputEncoding,

                      File outputFile): void
+ normalize (File inputFile, String inputEncoding, File outputFile,

                      String normalizationForm,

                      String otherCharsToDelete): void

+ normalize (String text): String
+ normalize (String text, String normalizationForm,

                      String otherCharsToDelete): String
+ normalize (File inputFile, String inputEncoding): String
+ normalize (File inputFile, String inputEncoding,

                      String  normalizationForm,

                      String otherCharsToDelete): String

Figure 2: Normalizers interface (API)

It is worth mentioning that when developing then SAFAR API, we fully respect “Checkstyle[3]” and “FindBugs[4]” which are two development tools that help adhering to coding standards. While “checkstyle” checks the style of the code, “FindBugs” checks if there are any eventual bugs. These tools helped us following a general coding standard for the whole framework.

3.4.2        Web application

For non-developers such as linguists, SAFAR framework can be executed using an online application[5] in which all SAFAR levels are developed as online processing. Results can be either printed on the same page or downloaded as xml files.

safar

 Figure 3: Alkhalil morphological analysis within SAFAR web

As an example, Figure 3 shows the online morphological analysis for the word “يأكلان” (they eat). After selecting the morphological analyzer to use via the drop-down menu (Alkhalil in this case) and clicking on “Analyze & display” button, the output is displayed in a table format.

A simple change in the selected morphological analyzer will lead to new results. To get results as XML, users should click on “Analyze & Save as XML” button instead. The online XML results have the same structure as the API results.

3.4.3        Web services

By using Web services, SAFAR framework can publish its functions or messages to the rest of the world so that all developers using either Java or any other programming language can remotely exploit SAFAR API. This helps also in resolving interoperability issues by giving different applications a way to link their data. Our Web services use XML to code and decode data, and SOAP[6] to transport it. Results can be retrieved either in XML or in JSON format.

For example, to execute Alkhalil2 analyzer as a web service, users specify the call as: “{server}/{text}/alkhalil2” where {server} refers to the URL in which the web service is active (in our case it is http://arabic.emi.ac.ma:8084), and {text} refers to the text that should be processed by Alkhalil2. This call returns results in XML format as shown in figure 4.

safar

Figure 4: XML Results of executing Alkhalil 2 as web service

If users prefer to get results as JSON format, they only need to change the call to “{server}/{text}/alkhalil2.json”. All these possibilities of using SAFAR, either via API, web application or web services, makes SAFAR usable by a wide variety of users either developers or linguists without being obliged to write any line of code and without being obliged to use Java as programming language.

3.4.4        Standards

Concerning the respect of International standards, and in order to facilitate their use in different contexts, we adopt the interoperability guides for all SAFAR components. Indeed, SAFAR tools input/output and LRs are formatted using the XML representation standard. In addition to the respect of representation standard, we use structuring standards such as Arab League Educational, Cultural and Scientific Organization[7] (ALECSO) recommendations for the design of Arabic morphological analyzers, Lexical Markup Framework (ISO 24613:2008)(LMF) for lexicons and Text Encoding Initiative (Lou Burnard et al. 2008) (TEI) for coprora.


[1] http://arabic.emi.ac.ma/safar/

[2] http://globalwordnet.org/resources/arabic-wordnet/

[5] http://arabic.emi.ac.ma:8080/SafarWeb_V2

[6] https://www.w3.org/TR/soap/

[7] http://www.alecso.org/site/

Next section