MARv4

Acronym

MARv stands for Morphossyntactic Ambiguity Resolver

Introduction

MARv4's architecture comprehends two submodules: a set of linguistically-oriented disambiguation rules module and a probabilistic disambiguation module. The linguistic-oriented is no longer used in the STRING chain because that function is now implemented by the RuDriCo module.

MARv4 is based in Hidden Markovian Models using the Viterbi algorithm. The language model is based on second-order (trigram) models, which codify contextual information concerning entities, and unigrams, which codify lexical information.

Module evolution

MARv1 -> MARv2

After an analysis to the performance of the MARv module, some changes proved necessary, namely the improvement of the processing time, the use of memory used, and the reduction of the error rate. The analysis of these three parameters was carried out for the reading of the dictionaries, for the standalone performance of the module, and in the client-server version. The analysis of the reading of the dictionaries was done independently from the rest of the application, in order to assess the impact of several changes exclusive of this module's component. This functionality is particularly relevant since the reading of the dictionaries is one of the main differences betwen the two existing versions: in the client-server version it is carried out only once, while in the standalone version it is performed every time it is necessary to disambiguate a text. Each implemented change has been analysed as well as its impact in the system's performance, according to the three parameters mentioned above. Each comparison is based on the results from the previous test. The results concerning processing time results consider the average of using of the application 30 times. The comparison between the different changes was carried out considering the time indicated by the Real parameter, since this corresponds to the time the user efectively awaits for the disambiguation to take. The quantification of memory use was made using the Valgrind tool (Nicholas Nethercote, 2003), having been considered the information resulting from memory allocation, as indicated by this tool. Only the memory results shown at the end of the reading of the dicionaries by the MARv were considered, since the goal was to assess the impact of the whole set of changes in memory use.

Changes introduced

Suspending lexical probabilities: After a brief analysis of the Viterbi algorithm implementation, an implemented extension was found which concerns lexical probabilities. The algorithm used the lexical probabilities of the classes from the lastest words, together with the probability of the current word and that of the classes associated to it. In this extension, since the system uses a 3-gram model, instead of considering the probability of a given word to appear in a given grammatical class, the sequence of the last three words forming the sentence under analysis is considered. After analysing the performance of the algorithm with and witout this extension, the error rate proved to be smaller (from 6.83% to 5.91%) in its original version, that is, without the extension.
Changing data entry structure
Converting probabilities into logarithms
Converting the dictionaries from text to numeric format
Changing from Map-like structures to Multidimensional vector-type
Disambiguation considering the sentence beginings
Increasing the window of analysis to more than 7 elements (the change in the probabilistic model to logarithms allowed for thre processing of the entire sentence, instead of just 7 elements)

MARV2 -> MARv3

Comparison of the initial and end state of the tool, having used the original DTD. Significant improvent in the processing time (dictionaries' reading: 95.46%, client-server version: 89.22% and standalone version: 89.17%), in memory use util the end of the reading of the dictionaires (98.98%), and in error rate (a gain of 23.72%, yielding a global error rate of 5.21%)

MARV3 -> MARv4

Marv4 performs disambiguation of verbal lemmas, e.g. word="foi", lemma="ser" (was) OR "ir" (went) . It also solves two cases of ambiguity in personal pronouns: the ambiguity between reflex, dative and accusative pronouns (v.g. "me", "te", "nos", "vos"), and the ambiguity between nominative (subject) and oblique (prepositional object), e.g. "ele fez" (he did) vs. "Eu gosto de_ele" (I like him). It uses maximum entropy models to perform the disambiguation.

Publications

[1] Ricardo Ribeiro, Anotação Morfossintáctica Desambiguada do Português, MSc thesis, Instituto Superior Técnico, Lisbon, Portugal, March 2003 (bibtex)

[2] Ricardo Ribeiro, Luís C. Oliveira, Isabel Trancoso, Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese, In PROPOR'2003 - 6th Workshop on Computational Processing of the Portuguese Language, Springer-Verlag, Heidelberg, series Lecture Notes in Artificial Inteligence, pages 143-150, Faro, Portugal, June 2003

[3] David Rodrigues, Uma evolução do sistema ShRep. Optimização, interface gráfica e integração de mais duas ferramentas, MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, November 2007 (bibtex)