A description and demonstration of SAFAR framework (1. Introduction )

1.       Introduction

NLP infrastructures, referred also as NLP architectures, represent an efficient way for standardization, optimization of efforts, collaboration and acceleration of developments in the field of NLP.

For the last decade, the NLP research community witnessed an extensive release of these infrastructures. Some become very famous, while others existed only for a very short time. Some are multilingual while others are not, some are targeting multiple domains while others are not, etc. It is quite tedious to list them all and it is today difficult to know all their features, advantages, and disadvantages in a benchmarking dashboard.

However, it is known that only few of them are dedicated to only one language. For instance, “ITU Turkish Natural Language Processing Pipeline” (G. Eryiğit, 2014) is a platform dedicated to the Turkish language. It provides few tools and most of them are available only via a web interface. AraNLP (Althobaiti et al. 2014) is another example for the context of Arabic; it provides some Arabic tools and outputs all results as simple text files.

On another hand, the literature shows that the list of existing infrastructures are using randomly three different namings: “toolkit”, “platform” and “Framework”. From the Software Engineering (SE) perspective, these namings have different meanings. It is then necessary to first define them before presenting, categorizing, and benchmarking NLP infrastructures. Briefly speaking[1], a toolkit is a set of tools within a single box used for a particular purpose. A platform consists of several interoperable tools with a homogeneous structure but without providing any API to extend their components. A framework is a layered structure developed to be used as a support and guide to build NLP programs and tools.

In this paper, we focus on Arabic language infrastructures. We show that the “Software Architecture for ARabic” (SAFAR) framework is one of the most interesting frameworks to consider when developing any Arabic NLP component.

This paper is not a thorough presentation about SAFAR. The interested reader can refer to many aspects of the framework from (Jaafar et al. 2018, Jaafar and Bouzoubaa 2018, Namly et al. 2016). Herein, we aim at presenting a global view of the framework but also at reporting many abstract issues regarding our framework that concerns its placement within the larger picture of all NLP infrastructures. We discuss among other things (i) how it helps in the standardization (ii) how the maintenance of the system is important (iii) how it participates as a collaborative work among available Arabic NLP architectures, and (iv) how software engineering principles are respected.

The rest of this article is as follows. Section 2 explains the most important features to know about the Arabic language. Section 3 presents SAFAR in terms of principles, architecture, content and exploitation. Section 4 is dedicated to benchmarking most important Arabic NLP infrastructures. Finally, in the last section, we conclude the paper.