RuDriCo2

From String

Acronym

RuDriCo stands for Rule Driven Converter


Brief Description

RuDriCo2's main goal is to provide for an adjustment of the results produced by the LexMan morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ex- and aluno, into one segment: ex-aluno; or it can perform the opposite and expand expressions such as nas into two segments: em and as. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo2 can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo2 is executed along the processing chain, it performs all of the mentioned tasks.

The input of RuDriCo2 is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (word) and one or more annotations (class). An annotation is composed by a lemma (root) and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.

In this example, the word "partido" is represented as an ambiguous segment containing one surface and three annotations.

[surface='partido',lemma='partido',CAT='adj',NUM='s',GEN=m',DEG='nor']
     [lemma='partido,CAT='adj',SCT='com',NUM='s',GEN=m',DEG='nor']
     [lemma='partir,CAT='ver',MOD='par',NUM='s',GEN=m']

RuDriCo2 has two types of rules: disambiguation and segmentation rules. Disambiguation rules allow the system to choose the correct category of a word by considering the surrounding context. Segmentation rules change the segmentation and can be divided into contraction and expansion rules. Contraction rules convert two or more segments into a single one. Expansion rules transform a segment into at least two segments.

An example of an expansion rule is to transform the segment Na into two segments Em and a. An example of a contraction rule is to turn segments Coreia, do and Sul into a single segment Coreia do Sul.

Example of a disambiguation rule that disambiguates the form a which can be an article (art), a pronoun (pro) or a preposition (pre), selecting the POS article when this form is preceded by a preposition:

  |[CAT='pre']!|
  [surface='a',CAT='art'][CAT=~'art']
 :=
  [CAT='art']+.

Example of a join rule that joins the sequence of tokens África do Sul, producing a single token, which is then given the features of POS (noun), subcategory (proper noun), gender and number :

0>[surface='África'],
  [surface='do'],
  [surface='Sul']
 :>
  [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].

Example of an expansion rule that resolves the contracted form ao (to_the.masc.sg), spliting it into the preposition a (to) and the definite article o (the.masc.sg):

0>[surface='ao',CAT='pre']
 :<
  [surface='a',lemma='a',CAT='pre'],
  [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].


Module evolution

RuDriCo1 is an evolution of PAsMo, that is, by its turn, is an evolution of MPS (Module Post-SMorph, bibtex).


In 2009, RuDriCo1 was substantially slower than the remaining modules of the chain. [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in the rule's syntax, such as the addition of new operators and contexts, making the rules more expressive. The new version, named RuDriCo2, is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach towards the XIP parser syntax. It also validates the input data, featuring error messages and warnings for potential problems. RuDriCo2 is a significant improvement over the original module.


Demo

RuDriCo2 can be tested here


User's Manual

Though RuDriCo2 is not freely available, the user's manual will be available here as soon as possible.


Publications

[1] Cláudio Diniz, Um Conversor baseado em regras de transformação declarativas, MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 (bibtex)

[2] Cláudio Diniz, Nuno Mamede, João D. Pereira, RuDriCo2 - a faster disambiguator and segmentation modifier, in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 (bibtex)