RuDriCo2 and MARv4: Difference between pages

From String
(Difference between pages)
(Created page with "<div style="float:right;">__TOC__</div> ==== Acronym ==== '''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter ==== Brief Description ==== RuDriCo2's main goal is to provide for an adjustment of the results produced by the LexMan morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morp...")
 
(Created page with "<div style="float:right;">__TOC__</div> ==== Acronym ==== '''''MARv''''' stands for '''M'''orphossyntactic '''A'''mbiguity '''R'''esol'''v'''er ==== Introduction ==== MARv2's architecture comprehends two submodules: a set of linguistically-oriented disambiguation rules module and a probabilistic disambiguation module. The linguistic-oriented is no longer used in the STRING chain because that function is now implemented by the RuDriCo module. MARv2...")
 
Line 1: Line 1:
<div style="float:right;">__TOC__</div>
<div style="float:right;">__TOC__</div>
==== Acronym ====
==== Acronym ====
'''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter
'''''MARv''''' stands for '''M'''orphossyntactic '''A'''mbiguity '''R'''esol'''v'''er




==== Brief Description ====
==== Introduction ====
[[RuDriCo2]]'s main goal is to provide for an adjustment of the results produced by the [[LexMan]] morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ''ex-'' and ''aluno'', into one segment: ''ex-aluno''; or it can perform the opposite and expand expressions such as ''nas'' into two segments: ''em'' and ''as''. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching.  [[RuDriCo2]]  can also be used to solve (or introduce) morphosyntactic ambiguities. By the time  [[RuDriCo2]] is executed along the processing chain, it performs all of the mentioned tasks.
[[MARv2]]'s architecture comprehends two submodules: a set of linguistically-oriented disambiguation rules module and a probabilistic disambiguation module.
The linguistic-oriented is no longer used in the STRING chain because that function is now implemented by the [[RuDriCo2|RuDriCo]] module.


The input of [[RuDriCo2]] is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (''word'') and one or more annotations (''class''). An annotation is composed by a lemma (''root'') and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.
[[MARv2]] is based in  [http://en.wikipedia.org/wiki/Hidden_Markov_Model Hidden Markovian Models] using the [http://en.wikipedia.org/wiki/Viterbi_algorithm Viterbi algorithm].
 
The language model is based on second-order (trigram) models, which codify contextual information concerning entities, and unigrams, which codify lexical information.
In this example, the word "''partido''" is represented as an ambiguous segment containing one surface and three annotations.
  <tt style="color:red">[surface='partido',lemma='partido',CAT='adj',NUM='s',GEN=m',DEG='nor']</tt>
<tt style="color:red">    [lemma='partido,CAT='adj',SCT='com',NUM='s',GEN=m',DEG='nor']</tt>
<tt style="color:red">    [lemma='partir,CAT='ver',MOD='par',NUM='s',GEN=m']</tt>
 
[[RuDriCo2]]  has two types of rules: disambiguation and segmentation rules. ''Disambiguation'' rules allow the system to choose the correct category of a word by considering the surrounding context. ''Segmentation'' rules change the segmentation and can be divided into contraction and expansion rules. ''Contraction'' rules convert two or more segments into a single one. ''Expansion'' rules transform a segment into at least two segments.
 
An example of an expansion rule is to transform the segment ''Na'' into two segments ''Em'' and ''a''. An example of a contraction rule is to turn segments ''Coreia'', ''do'' and ''Sul'' into a single segment ''Coreia do Sul''.
 
Example of a disambiguation rule that disambiguates the form ''a'' which can be an article (art), a pronoun (pro) or a preposition (pre), selecting the POS article when this form is preceded by a preposition:
<tt style="color:red">  |[CAT='pre']!|</tt>
<tt style="color:red">  [surface='a',CAT='art'][CAT=~'art']</tt>
<tt style="color:red"> :=</tt>
<tt style="color:red">  [CAT='art']+.</tt>
 
Example of a join rule that joins the sequence of tokens ''África do Sul'', producing a single token, which is then given the features of POS (noun), subcategory (proper noun), gender and number :
<tt style="color:red">0>[surface='África'],</tt>
<tt style="color:red">  [surface='do'],</tt>
<tt style="color:red">  [surface='Sul']</tt>
<tt style="color:red"> :></tt>
<tt style="color:red">  [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].</tt>
 
Example of an expansion rule that resolves the contracted form ''ao'' (to_the.masc.sg), spliting it into the preposition ''a'' (to) and the definite article ''o'' (the.masc.sg):
<tt style="color:red">0>[surface='ao',CAT='pre']</tt>
<tt style="color:red"> :<</tt>
<tt style="color:red">  [surface='a',lemma='a',CAT='pre'],</tt>
<tt style="color:red">  [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].</tt>




==== Module evolution ====
==== Module evolution ====
'''''RuDriCo1''''' is an evolution of [http://www.inesc-id.pt/pt/indicadores/Ficheiros/2365.pdf PAsMo], that is, by its turn, is an evolution of MPS (Module Post-SMorph, [[media:MPS1999.txt|bibtex]]).
===== MARv1 -> MARv2 =====
After an analysis to the performance of the MARv module, some changes proved necessary, namely the improvement of the processing time, the use of memory used, and the reduction of the error rate.
The analysis of these three parameters was carried out for the reading of the dictionaries, for the standalone performance of the module, and in the client-server version.
The analysis of the reading of the dictionaries was done independently from the rest of the application, in order to assess the impact of several changes exclusive of this module's component.
This functionality is particularly relevant since the reading of the dictionaries is one of the main differences betwen the two existing versions:
in the client-server version it is carried out only once, while in the standalone version it is performed every time it is necessary to disambiguate a text.
Each implemented change has been analysed as well as its impact in the system's performance, according to the three parameters mentioned above. Each comparison is based on the results from the previous test. The results concerning processing time results consider the average of using of the application 30 times.
The comparison between the different changes was carried out considering the time indicated by the Real parameter, since this corresponds to the time the user efectively awaits for the disambiguation to take. The quantification of memory use was made using the Valgrind tool (Nicholas Nethercote, 2003), having been considered the information resulting from memory allocation, as indicated by this tool.
Only the memory results shown at the end of the reading of the dicionaries by the MARv were considered, since the goal was to assess the impact of the whole set of changes in memory use.


====== Changes introduced ======
*Suspending lexical probabilities: After a brief analysis of the Viterbi algorithm implementation, an implemented extension was found which concerns lexical probabilities. The algorithm used the lexical probabilities of the classes from the lastest words, together with the probability of the current word and that of the classes associated to it. In this extension, since the system uses a 3-gram model, instead of considering the probability of a given word to appear in a given grammatical class, the sequence of the last three words forming the sentence under analysis is considered. After analysing the performance of the algorithm with and witout this extension, the error rate proved to be smaller (from 6.83% to 5.91%) in its original version, that is, without the extension.
*Changing data entry structure
*Converting probabilities into logarithms
*Converting the dictionaries from text to numeric format
*Changing from Map-like structures to Multidimensional vector-type
*Disambiguation considering the sentence beginings
*Increasing the window of analysis to more than 7 elements (the change in the probabilistic model to logarithms allowed for thre processing of the entire sentence, instead of just 7 elements)


In 2009, '''''RuDriCo1''''' was substantially slower than the remaining modules of the chain. [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in the rule's syntax, such as the addition of new operators and contexts, making the rules more expressive.
===== MARV2 -> MARv3 =====
The new version, named  [[RuDriCo2]], is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach towards the XIP parser syntax. It also validates the input data, featuring error messages and warnings for potential problems. [[RuDriCo2]] is a significant improvement over the original module.
Comparison of the initial and end state of the tool, having used the original DTD.
Significant improvent in the processing time (dictionaries' reading: 95.46%, client-server version: 89.22% and standalone version: 89.17%), in memory use util the end of the reading of the dictionaires (98.98%), and in error rate (a gain of 23.72%, yielding a global error rate of 5.21%)


===== MARV3 -> MARv4 =====
Marv4 performs disambiguation of verbal lemmas, e.g. word="foi", lemma="ser" (was) OR "ir" (went) . It also solves two cases of ambiguity in personal pronouns: the ambiguity between reflex, dative and accusative pronouns (v.g. "me", "te", "nos", "vos"), and the ambiguity between nominative (subject) and oblique (prepositional object), e.g. "ele fez" (he did) vs. "Eu gosto de_ele" (I like him). It uses maximum entropy models to perform the disambiguation.


==== Demo ====
==== Publications ====
[[RuDriCo2]] can be tested [http://string.l2f.inesc-id.pt/demo/tokenizer.pl here]
'''[1]''' Ricardo Ribeiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/1424.pdf Anotação Morfossintáctica Desambiguada do Português], MSc thesis, Instituto Superior Técnico, Lisbon, Portugal, March 2003 ([[media:Ribeiro2003.txt|bibtex]])


'''[2]''' Ricardo Ribeiro, Luís C. Oliveira, Isabel Trancoso, Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese, In PROPOR'2003 - 6th Workshop on Computational Processing of the Portuguese Language, Springer-Verlag, Heidelberg, series Lecture Notes in Artificial Inteligence, pages 143-150, Faro, Portugal, June 2003


==== User's Manual ====
'''[3]''' David Rodrigues, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/3318.pdf Uma evolução do sistema ShRep. Optimização, interface gráfica e integração de mais duas ferramentas], MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, November 2007 ([[media:Rodrigues2007.txt|bibtex]])
Though [[RuDriCo2]] is not freely available, the user's manual will be available [[here]] as soon as possible.
 
 
==== Publications ====
'''[1]''' Cláudio Diniz, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/5451.pdf Um Conversor baseado em regras de transformação declarativas], MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 ([[media:Diniz2010b.txt|bibtex]])
 
'''[2]''' Cláudio Diniz, Nuno Mamede, João D. Pereira, [http://inforum.org.pt/INForum2010/papers/gestao-e-tratamento-de-informacao/Paper085.pdf RuDriCo2 - a faster disambiguator and segmentation modifier], in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 ([[media:Diniz2010a.txt|bibtex]])

Revision as of 13:52, 10 January 2024

Acronym

MARv stands for Morphossyntactic Ambiguity Resolver


Introduction

MARv2's architecture comprehends two submodules: a set of linguistically-oriented disambiguation rules module and a probabilistic disambiguation module. The linguistic-oriented is no longer used in the STRING chain because that function is now implemented by the RuDriCo module.

MARv2 is based in Hidden Markovian Models using the Viterbi algorithm. The language model is based on second-order (trigram) models, which codify contextual information concerning entities, and unigrams, which codify lexical information.


Module evolution

MARv1 -> MARv2

After an analysis to the performance of the MARv module, some changes proved necessary, namely the improvement of the processing time, the use of memory used, and the reduction of the error rate. The analysis of these three parameters was carried out for the reading of the dictionaries, for the standalone performance of the module, and in the client-server version. The analysis of the reading of the dictionaries was done independently from the rest of the application, in order to assess the impact of several changes exclusive of this module's component. This functionality is particularly relevant since the reading of the dictionaries is one of the main differences betwen the two existing versions: in the client-server version it is carried out only once, while in the standalone version it is performed every time it is necessary to disambiguate a text. Each implemented change has been analysed as well as its impact in the system's performance, according to the three parameters mentioned above. Each comparison is based on the results from the previous test. The results concerning processing time results consider the average of using of the application 30 times. The comparison between the different changes was carried out considering the time indicated by the Real parameter, since this corresponds to the time the user efectively awaits for the disambiguation to take. The quantification of memory use was made using the Valgrind tool (Nicholas Nethercote, 2003), having been considered the information resulting from memory allocation, as indicated by this tool. Only the memory results shown at the end of the reading of the dicionaries by the MARv were considered, since the goal was to assess the impact of the whole set of changes in memory use.

Changes introduced
  • Suspending lexical probabilities: After a brief analysis of the Viterbi algorithm implementation, an implemented extension was found which concerns lexical probabilities. The algorithm used the lexical probabilities of the classes from the lastest words, together with the probability of the current word and that of the classes associated to it. In this extension, since the system uses a 3-gram model, instead of considering the probability of a given word to appear in a given grammatical class, the sequence of the last three words forming the sentence under analysis is considered. After analysing the performance of the algorithm with and witout this extension, the error rate proved to be smaller (from 6.83% to 5.91%) in its original version, that is, without the extension.
  • Changing data entry structure
  • Converting probabilities into logarithms
  • Converting the dictionaries from text to numeric format
  • Changing from Map-like structures to Multidimensional vector-type
  • Disambiguation considering the sentence beginings
  • Increasing the window of analysis to more than 7 elements (the change in the probabilistic model to logarithms allowed for thre processing of the entire sentence, instead of just 7 elements)
MARV2 -> MARv3

Comparison of the initial and end state of the tool, having used the original DTD. Significant improvent in the processing time (dictionaries' reading: 95.46%, client-server version: 89.22% and standalone version: 89.17%), in memory use util the end of the reading of the dictionaires (98.98%), and in error rate (a gain of 23.72%, yielding a global error rate of 5.21%)

MARV3 -> MARv4

Marv4 performs disambiguation of verbal lemmas, e.g. word="foi", lemma="ser" (was) OR "ir" (went) . It also solves two cases of ambiguity in personal pronouns: the ambiguity between reflex, dative and accusative pronouns (v.g. "me", "te", "nos", "vos"), and the ambiguity between nominative (subject) and oblique (prepositional object), e.g. "ele fez" (he did) vs. "Eu gosto de_ele" (I like him). It uses maximum entropy models to perform the disambiguation.

Publications

[1] Ricardo Ribeiro, Anotação Morfossintáctica Desambiguada do Português, MSc thesis, Instituto Superior Técnico, Lisbon, Portugal, March 2003 (bibtex)

[2] Ricardo Ribeiro, Luís C. Oliveira, Isabel Trancoso, Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese, In PROPOR'2003 - 6th Workshop on Computational Processing of the Portuguese Language, Springer-Verlag, Heidelberg, series Lecture Notes in Artificial Inteligence, pages 143-150, Faro, Portugal, June 2003

[3] David Rodrigues, Uma evolução do sistema ShRep. Optimização, interface gráfica e integração de mais duas ferramentas, MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, November 2007 (bibtex)