Eugenio: Created page with "
TOC
==== Acronym ==== '''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter ==== Brief Description ==== RuDriCo2's main goal is to provide for an adjustment of the results produced by the LexMan morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morp..."

2024-01-10T13:33:50Z

Created page with "<div style="float:right;">__TOC__</div> ==== Acronym ==== '''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter ==== Brief Description ==== RuDriCo2's main goal is to provide for an adjustment of the results produced by the LexMan morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morp..."

New page

<div style="float:right;">__TOC__</div>
==== Acronym ====
'''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter

==== Brief Description ====
[[RuDriCo2]]'s main goal is to provide for an adjustment of the results produced by the [[LexMan]] morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ''ex-'' and ''aluno'', into one segment: ''ex-aluno''; or it can perform the opposite and expand expressions such as ''nas'' into two segments: ''em'' and ''as''. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. [[RuDriCo2]] can also be used to solve (or introduce) morphosyntactic ambiguities. By the time [[RuDriCo2]] is executed along the processing chain, it performs all of the mentioned tasks.

The input of [[RuDriCo2]] is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (''word'') and one or more annotations (''class''). An annotation is composed by a lemma (''root'') and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.

In this example, the word "''partido''" is represented as an ambiguous segment containing one surface and three annotations.
<tt style="color:red">[surface='partido',lemma='partido',CAT='adj',NUM='s',GEN=m',DEG='nor']</tt>
<tt style="color:red"> [lemma='partido,CAT='adj',SCT='com',NUM='s',GEN=m',DEG='nor']</tt>
<tt style="color:red"> [lemma='partir,CAT='ver',MOD='par',NUM='s',GEN=m']</tt>

[[RuDriCo2]] has two types of rules: disambiguation and segmentation rules. ''Disambiguation'' rules allow the system to choose the correct category of a word by considering the surrounding context. ''Segmentation'' rules change the segmentation and can be divided into contraction and expansion rules. ''Contraction'' rules convert two or more segments into a single one. ''Expansion'' rules transform a segment into at least two segments.

An example of an expansion rule is to transform the segment ''Na'' into two segments ''Em'' and ''a''. An example of a contraction rule is to turn segments ''Coreia'', ''do'' and ''Sul'' into a single segment ''Coreia do Sul''.

Example of a disambiguation rule that disambiguates the form ''a'' which can be an article (art), a pronoun (pro) or a preposition (pre), selecting the POS article when this form is preceded by a preposition:
<tt style="color:red"> |[CAT='pre']!|</tt>
<tt style="color:red"> [surface='a',CAT='art'][CAT=~'art']</tt>
<tt style="color:red"> :=</tt>
<tt style="color:red"> [CAT='art']+.</tt>

Example of a join rule that joins the sequence of tokens ''África do Sul'', producing a single token, which is then given the features of POS (noun), subcategory (proper noun), gender and number :
<tt style="color:red">0>[surface='África'],</tt>
<tt style="color:red"> [surface='do'],</tt>
<tt style="color:red"> [surface='Sul']</tt>
<tt style="color:red"> :></tt>
<tt style="color:red"> [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].</tt>

Example of an expansion rule that resolves the contracted form ''ao'' (to_the.masc.sg), spliting it into the preposition ''a'' (to) and the definite article ''o'' (the.masc.sg):
<tt style="color:red">0>[surface='ao',CAT='pre']</tt>
<tt style="color:red"> :<</tt>
<tt style="color:red"> [surface='a',lemma='a',CAT='pre'],</tt>
<tt style="color:red"> [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].</tt>

==== Module evolution ====
'''''RuDriCo1''''' is an evolution of [http://www.inesc-id.pt/pt/indicadores/Ficheiros/2365.pdf PAsMo], that is, by its turn, is an evolution of MPS (Module Post-SMorph, [[media:MPS1999.txt|bibtex]]).

In 2009, '''''RuDriCo1''''' was substantially slower than the remaining modules of the chain. [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in the rule's syntax, such as the addition of new operators and contexts, making the rules more expressive.
The new version, named [[RuDriCo2]], is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach towards the XIP parser syntax. It also validates the input data, featuring error messages and warnings for potential problems. [[RuDriCo2]] is a significant improvement over the original module.

==== Demo ====
[[RuDriCo2]] can be tested [http://string.l2f.inesc-id.pt/demo/tokenizer.pl here]

==== User's Manual ====
Though [[RuDriCo2]] is not freely available, the user's manual will be available [[here]] as soon as possible.

==== Publications ====
'''[1]''' Cláudio Diniz, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/5451.pdf Um Conversor baseado em regras de transformação declarativas], MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 ([[media:Diniz2010b.txt|bibtex]])

'''[2]''' Cláudio Diniz, Nuno Mamede, João D. Pereira, [http://inforum.org.pt/INForum2010/papers/gestao-e-tratamento-de-informacao/Paper085.pdf RuDriCo2 - a faster disambiguator and segmentation modifier], in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 ([[media:Diniz2010a.txt|bibtex]])

RuDriCo2 - Revision history