Difference between revisions of "RuDriCo2"

From String
Jump to: navigation, search
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
===== Acronim =====
+
<div style="float:right;">__TOC__</div>
 +
==== Acronym ====
 
'''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter
 
'''''RuDriCo''''' stands for '''''Ru'''''le '''''Dri'''''ven '''''Co'''''nverter
  
===== Brief Description =====
 
'''''RuDriCo2''''''s main goal is to provide for an adjustment of the results produced by a morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ''ex-'' and ''aluno'', into one segment: ''ex-aluno''; or it can perform the opposite and expand expressions such as ''nas'' into two segments: ''em'' and ''as''. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo is executed along the processing chain, it performs all of the mentioned tasks.
 
  
The input of '''''RuDriCo2'''''' is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (\texttt{word}) and one or more annotations (\texttt{class}). An annotation is composed by a lemma (\texttt{root}) and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.  
+
==== Brief Description ====
 +
[[RuDriCo2]]'s main goal is to provide for an adjustment of the results produced by the [[LexMan]] morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ''ex-'' and ''aluno'', into one segment: ''ex-aluno''; or it can perform the opposite and expand expressions such as ''nas'' into two segments: ''em'' and ''as''. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching.  [[RuDriCo2]]  can also be used to solve (or introduce) morphosyntactic ambiguities. By the time  [[RuDriCo2]]  is executed along the processing chain, it performs all of the mentioned tasks.
  
In this example, the word \textit{partido} is represented as an ambiguous segment containing one surface and three annotations.
+
The input of [[RuDriCo2]] is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (''word'') and one or more annotations (''class''). An annotation is composed by a lemma (''root'') and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.
[surface='partido', ]
 
  
 +
In this example, the word "''partido''" is represented as an ambiguous segment containing one surface and three annotations.
 +
<tt style="color:red">[surface='partido',lemma='partido',CAT='adj',NUM='s',GEN=m',DEG='nor']</tt>
 +
<tt style="color:red">    [lemma='partido,CAT='adj',SCT='com',NUM='s',GEN=m',DEG='nor']</tt>
 +
<tt style="color:red">    [lemma='partir,CAT='ver',MOD='par',NUM='s',GEN=m']</tt>
  
 +
[[RuDriCo2]]  has two types of rules: disambiguation and segmentation rules. ''Disambiguation'' rules allow the system to choose the correct category of a word by considering the surrounding context. ''Segmentation'' rules change the segmentation and can be divided into contraction and expansion rules. ''Contraction'' rules convert two or more segments into a single one. ''Expansion'' rules transform a segment into at least two segments.
  
'''''RuDriCo2'''''' has two types of rules: disambiguation and segmentation rules. The former ones allow the system to choose the correct category of a word by considering the surrounding context. Segmentation rules change the segmentation and can be divided into contraction and expansion rules. Contraction rules convert two or more segments into a single one. Expansion rules transform a segment into at least two segments. An example of an expansion rule is to transform the segment ''Na'' into two segments ''Em'' and ''a''. An example of a contraction rule is to turn segments ''Coreia'', ''do'' and ''Sul'' into a single segment ''Coreia do Sul''.
+
An example of an expansion rule is to transform the segment ''Na'' into two segments ''Em'' and ''a''. An example of a contraction rule is to turn segments ''Coreia'', ''do'' and ''Sul'' into a single segment ''Coreia do Sul''.
  
Example of a disambiguation rule (disambiguates a forma «a» que pode ser artigo (art), pronome (pro) ou preposição (pre), antecedida da preposição de, classificando-a como artigo (art).):
+
Example of a disambiguation rule that disambiguates the form ''a'' which can be an article (art), a pronoun (pro) or a preposition (pre), selecting the POS article when this form is preceded by a preposition:
  <nowiki>0> |[CAT='pre']!|
+
  <tt style="color:red"> |[CAT='pre']!|</tt>
  [surface='a',CAT='art'][CAT=~'art']
+
<tt style="color:red">  [surface='a',CAT='art'][CAT=~'art']</tt>
  :=
+
<tt style="color:red"> :=</tt>
  [CAT='art']+.</nowiki>
+
<tt style="color:red">  [CAT='art']+.</tt>
  
Example of a join rule (joins ....):
+
Example of a join rule that joins the sequence of tokens ''África do Sul'', producing a single token, which is then given the features of POS (noun), subcategory (proper noun), gender and number :
  <nowiki>0> [surface='África'],
+
  <tt style="color:red">0>[surface='África'],</tt>
  [surface='do'],
+
<tt style="color:red">  [surface='do'],</tt>
  [surface='Sul']
+
<tt style="color:red">  [surface='Sul']</tt>
  :>
+
<tt style="color:red"> :></tt>
  [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].</nowiki>
+
<tt style="color:red">  [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].</tt>
  
Example of an expansion rule (................):
+
Example of an expansion rule that resolves the contracted form ''ao'' (to_the.masc.sg), spliting it into the preposition ''a'' (to) and the definite article ''o'' (the.masc.sg):
  <nowiki>0> [surface='ao',CAT='pre']
+
  <tt style="color:red">0>[surface='ao',CAT='pre']</tt>
    :<
+
<tt style="color:red"> :<</tt>
    [surface='a',lemma='a',CAT='pre'],
+
<tt style="color:red">  [surface='a',lemma='a',CAT='pre'],</tt>
    [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].</nowiki>
+
<tt style="color:red">  [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].</tt>
  
===== Module evolution =====
 
'''''Rudrico1''''' was substantially slower than the remaining modules of the chain. '''''Rudrico2''''' is a rule-based morphological disambiguator with the possibility to change segmentation (join or split tokens).
 
[2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in rule syntax, such as the addition of new operators and contexts, which makes the rules more expressive.
 
The new version, named '''''RuDriCo2''''', is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach to the XIP parser syntax. It also validates the input data, features error messages and warnings for potential problems.  '''''Rudrico2''''' is a significant improvement over the original module.
 
  
===== User's Manual =====
+
==== Module evolution ====
Although '''''Rudrico2''''' being not freely available the user's manual is [[here]]
+
'''''RuDriCo1''''' is an evolution of [http://www.inesc-id.pt/pt/indicadores/Ficheiros/2365.pdf PAsMo], that is, by its turn, is an evolution of MPS (Module Post-SMorph, [[media:MPS1999.txt|bibtex]]).
  
===== Publications =====
+
 
 +
In 2009, '''''RuDriCo1''''' was substantially slower than the remaining modules of the chain. [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in the rule's syntax, such as the addition of new operators and contexts, making the rules more expressive.
 +
The new version, named  [[RuDriCo2]], is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach towards the XIP parser syntax. It also validates the input data, featuring error messages and warnings for potential problems. [[RuDriCo2]] is a significant improvement over the original module.
 +
 
 +
 
 +
==== Demo ====
 +
[[RuDriCo2]] can be tested [http://string.l2f.inesc-id.pt/demo/tokenizer.pl here]
 +
 
 +
 
 +
==== User's Manual ====
 +
Though [[RuDriCo2]] is not freely available, the user's manual will be available [[here]] as soon as possible.
 +
 
 +
 
 +
==== Publications ====
 
'''[1]''' Cláudio Diniz, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/5451.pdf Um Conversor baseado em regras de transformação declarativas], MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 ([[media:Diniz2010b.txt|bibtex]])
 
'''[1]''' Cláudio Diniz, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/5451.pdf Um Conversor baseado em regras de transformação declarativas], MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 ([[media:Diniz2010b.txt|bibtex]])
  
 
'''[2]''' Cláudio Diniz, Nuno Mamede, João D. Pereira, [http://inforum.org.pt/INForum2010/papers/gestao-e-tratamento-de-informacao/Paper085.pdf RuDriCo2 - a faster disambiguator and segmentation modifier], in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 ([[media:Diniz2010a.txt|bibtex]])
 
'''[2]''' Cláudio Diniz, Nuno Mamede, João D. Pereira, [http://inforum.org.pt/INForum2010/papers/gestao-e-tratamento-de-informacao/Paper085.pdf RuDriCo2 - a faster disambiguator and segmentation modifier], in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 ([[media:Diniz2010a.txt|bibtex]])

Latest revision as of 02:35, 10 March 2012

Acronym

RuDriCo stands for Rule Driven Converter


Brief Description

RuDriCo2's main goal is to provide for an adjustment of the results produced by the LexMan morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ex- and aluno, into one segment: ex-aluno; or it can perform the opposite and expand expressions such as nas into two segments: em and as. This will depend on what the parser might need. Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo2 can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo2 is executed along the processing chain, it performs all of the mentioned tasks.

The input of RuDriCo2 is a set of rules and the text to process. Input text is in XML format and consists in a set of sentences where each sentence has one or more segments. The segments represent words that are constituted by a surface (word) and one or more annotations (class). An annotation is composed by a lemma (root) and a set of attribute-value pairs. The attribute-value pairs represent the properties of each annotation, e.g. the category of a word.

In this example, the word "partido" is represented as an ambiguous segment containing one surface and three annotations.

[surface='partido',lemma='partido',CAT='adj',NUM='s',GEN=m',DEG='nor']
     [lemma='partido,CAT='adj',SCT='com',NUM='s',GEN=m',DEG='nor']
     [lemma='partir,CAT='ver',MOD='par',NUM='s',GEN=m']

RuDriCo2 has two types of rules: disambiguation and segmentation rules. Disambiguation rules allow the system to choose the correct category of a word by considering the surrounding context. Segmentation rules change the segmentation and can be divided into contraction and expansion rules. Contraction rules convert two or more segments into a single one. Expansion rules transform a segment into at least two segments.

An example of an expansion rule is to transform the segment Na into two segments Em and a. An example of a contraction rule is to turn segments Coreia, do and Sul into a single segment Coreia do Sul.

Example of a disambiguation rule that disambiguates the form a which can be an article (art), a pronoun (pro) or a preposition (pre), selecting the POS article when this form is preceded by a preposition:

  |[CAT='pre']!|
  [surface='a',CAT='art'][CAT=~'art']
 :=
  [CAT='art']+.

Example of a join rule that joins the sequence of tokens África do Sul, producing a single token, which is then given the features of POS (noun), subcategory (proper noun), gender and number :

0>[surface='África'],
  [surface='do'],
  [surface='Sul']
 :>
  [surface=@@+,lemma='África do Sul',CAT='nou',SCT='prp',GEN='f',NUM='s'].

Example of an expansion rule that resolves the contracted form ao (to_the.masc.sg), spliting it into the preposition a (to) and the definite article o (the.masc.sg):

0>[surface='ao',CAT='pre']
 :<
  [surface='a',lemma='a',CAT='pre'],
  [surface='o',lemma='o',CAT='art',SCT='def',NUM='s',GEN='m'].


Module evolution

RuDriCo1 is an evolution of PAsMo, that is, by its turn, is an evolution of MPS (Module Post-SMorph, bibtex).


In 2009, RuDriCo1 was substantially slower than the remaining modules of the chain. [2] describes the changes made to the system to improve its performance by using the concept of layers and also by reducing the number of variables contained in the rules. It also describes the changes in the rule's syntax, such as the addition of new operators and contexts, making the rules more expressive. The new version, named RuDriCo2, is significantly (10 times) faster that the previous version, uses a more expressive language (allowing negation and disjunction, the use of regular expressions both in the lemma and in the surface form) and constitutes an approach towards the XIP parser syntax. It also validates the input data, featuring error messages and warnings for potential problems. RuDriCo2 is a significant improvement over the original module.


Demo

RuDriCo2 can be tested here


User's Manual

Though RuDriCo2 is not freely available, the user's manual will be available here as soon as possible.


Publications

[1] Cláudio Diniz, Um Conversor baseado em regras de transformação declarativas, MSc thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, Portugal, October 2010 (bibtex)

[2] Cláudio Diniz, Nuno Mamede, João D. Pereira, RuDriCo2 - a faster disambiguator and segmentation modifier, in II Simpósio de Informática (INForum 2010), Universidade do Minho, pages 573-584, September 2010 (bibtex)