Stemming is not new as it was first introduced in 1968 by Julie Beth Lovis who was a computational linguist that created the first algorithm known today as the Lovins Stemming algorithm. Her algorithm has significantly influenced other algorithms such as the Porter Stemmer algorithm which is now a common stemming algorithm for English words. These algorithms are specific to the English language and will not work for French, Greek or Russian.
To support several natural languages, it is necessary to have several algorithms. The Snowball stemming algorithms project provides such support through a specific string processing language, a compiler and a set of algorithms for various natural languages. The Snowball compiler has been adapted to generate Ada code (See Snowball Ada on GitHub).
The Ada Stemmer Library integrates stemming algorithms for: English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian. The Snowball compiler provides several other algorithms but they are not integrated yet: their integration is left as an exercise to the reader.
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. A Snowball script describes a set of rules which are applied and checked on an input word or some portion of it in order to eliminate or replace some terms. The stemmer will usually transform a plural into a singular form, it will reduce the multiple forms of a verb, find the noun from an adverb and so on. Romance languages, Germanic languages, Scandinavian languages share some common rules but each language will need its own snowball algorithm. The Snowball compiler provides a detailed list of several stemming algorithms for various natural languages. This list is available on: https://snowballstem.org/algorithms/
The Ada Stemmer Library supports only UTF-8 strings which simplifies both the implementation and the API. The library only uses the Ada
String type to handle strings.
To use the library, you should run the following commands:
git clone https://github.com/stcarrez/ada-stemmer.git cd ada-stemmer make build install
This will fetch, compile and install the library. You can then add the following line in your GNAT project file:
Each stemmer algorithm works on a single word at a time. The Ada Stemmer Library does not split words. You have to give it one word at a time to stem and it returns either the word itself or its stem. The
Stemmer.Factory is the multi-language entry point. The stemmer algorithm is created for each call. The following simple code:
with Stemmer.Factory; use Stemmer.Factory; with Ada.Text_IO; use Ada.Text_IO; ... Put_Line (Stem (L_FRENCH, "chienne"));
will print the string:
When multiple words must be stemmed, it may be better to declare the instance of the stemmer and use the same instance to stem several words. The
Stem_Word procedure can be called with each word and it returns a boolean that indicates whether the word was stemmed or not. The result is obtained by calling the
Get_Result function. For exemple,
with Stemmer.English; with Ada.Text_IO; use Ada.Text_IO; .. Ctx : Stemmer.English.Context_Type; Stemmed : Boolean; .. Ctx.Stem_Word ("zealously", Stemmed); if Stemmed then Put_Line (Ctx.Get_Result); end if;
Integrating a new Stemming algorithm
Integration of a new stemming algorithm is quite easy but requires to install the Snowball Ada compiler.
git clone --branch ada-support https://github.com/stcarrez/snowball cd snowball make
The Snowball compiler needs the path of the stemming algorithm, the target programming language, the name of the Ada child package that will contain the generated algorithm and the target path. For example, to generate the Lithuanian stemmer, the following command can be used:
./snowball algorithms/lithuanian.sbl -ada -P Lithuanian -o stemmer-lithuanian
You will then get two files:
stemmer-lithuanian.adb. After integration of the generated files in your project, you can access the generated stemmer with:
with Stemmer.Lithuanian; .. Ctx : Stemmer.Lithuanian.Context_Type;
Thanks to the Snowball compiler and its algorithms, it is possible to do some natural language analysis. Version 1.0 of the Ada Stemmer Library being available on GitHub, it is now possible to start doing some natural language analysis in Ada!