A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base

Carreras-Riudavets, F.J.; Rodríguez-del-Pino, J.C.; Hernández-Figueroa, Z.; Rodríguez-Rodríguez, G.
Lecture Notes Computer Science. Computational Linguistics and Intelligent Text Processing (LNCS 7181-1). ISSN: 0302-9743
Marzo, 2012.

This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical knowledge base practically covers the whole morphological casuistry of the Spanish language. However, the analyzer solves the processing of prefixes and of enclitic pronouns by rules, since the words that can contain these elements are much and some of them are neologisms. MAHT reaches a processing average speed over 100,000 words/second. This one is possible because it uses hash tables in main memory. MAHT has been designed to isolate the data from the algorithms that analyze words, even with their irregularities. This design is very important for an irregular and highly inflectional language, like Spanish, to simplify the insertion of new words and software maintenance. This system is useful to use it previously to a POS tagger.

