RuDriCo2

From String
Revision as of 17:07, 6 March 2012 by Njm (talk | contribs) (Module evolution)
Jump to: navigation, search
Acronym

RuDriCo stands for Rule Driven Converter


Brief Description

RuDriCo2's main goal is to provide for an adjustment of the results produced by a morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ex- and aluno, into one segment: ex-aluno; or it can perform the opposite and expand expressions such as nas into two segments: em and as. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo2 can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo2 is executed along the processing chain, it performs all of the mentioned tasks.

The input of RuDriCo2 is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (word) and one or more annotations (class). An annotation is composed by a lemma (root) and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.

In this example, the word \textit{partido} is represented as an ambiguous segment containing one surface and three annotations.
[surface='partido', ]


RuDriCo2 has two types of rules: disambiguation and segmentation rules. The former ones allow the system to choose the correct category of a word by considering the surrounding context. Segmentation rules change the segmentation and can be divided into contraction and expansion rules. Contraction rules convert two or more segments into a single one. Expansion rules transform a segment into at least two segments. An example of an expansion rule is to transform the segment Na into two segments Em and a. An example of a contraction rule is to turn segments Coreia, do and Sul into a single segment Coreia do Sul.

Example of a disambiguation rule (disambiguates a forma «a» que pode ser artigo (art), pronome (pro) ou preposição (pre), antecedida da preposição de, classificando-a como artigo (art).):

0> |[CAT='pre']!|
   [surface='a',CAT='art'][CAT=~'art']
  :=
   [CAT='art']+.

Example of a join rule (joins ....):

0> [surface='África'],
   [surface='do'],
   [surface='Sul']
  :>
   [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].

Example of an expansion rule (................):

0> [surface='ao',CAT='pre']
    :<
     [surface='a',lemma='a',CAT='pre'],
     [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].


Module evolution

O RuDriCo2 ?? prove ́m de uma evoluc ̧a ̃o do sistema RuDriCo ?? que e ́ uma evoluc ̧a ̃o do sistema PAsMo ??, sendo este, por sua vez, uma evoluc ̧a ̃o do sistema MPS ??. O sistema RuDriCo2 tem duas funcionalidades: modificar a segmentac ̧a ̃o e seleccionar/remover anotac ̧o ̃es de segmentos amb ́ıguos.


Rudrico1 was substantially slower than the remaining modules of the chain. RuDriCo2 is a rule-based morphological disambiguator with the possibility to change segmentation (join or split tokens). [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in rule syntax, such as the addition of new operators and contexts, which makes the rules more expressive. The new version, named RuDriCo2, is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach to the XIP parser syntax. It also validates the input data, features error messages and warnings for potential problems. RuDriCo2 is a significant improvement over the original module.

User's Manual

Although RuDriCo2 being not freely available the user's manual is here


Publications

[1] Cláudio Diniz, Um Conversor baseado em regras de transformação declarativas, MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 (bibtex)

[2] Cláudio Diniz, Nuno Mamede, João D. Pereira, RuDriCo2 - a faster disambiguator and segmentation modifier, in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 (bibtex)