Architecture

From String
Revision as of 12:01, 6 March 2012 by Njm (talk | contribs)
Jump to: navigation, search

The processing chain of L2F consists of several modules, which are represented in the next figure:

600px

Tokenizer

The first module is responsible for segmentation, it divides the text into tokens. Besides this, the module is also responsible for the early identification of certain types of entities, namely: email addresses, ordinal numbers (e.g. , 42ª), numbers with . and , (e.g. 12.345,67), IP and HTTP addresses, integers (e.g. 12345), several abbreviations with . (e.g. a.c., V.Exa.), numbers written in full, such as duzentos e trinta e cinco (two hundred and thirty-five), sequences of interrogation and exclamation marks, as well as ellipsis (e.g. ???, !!!, ?!?!, ...), punctuation marks (e.g. !, ?, ., ,, :, ;, (, ), [, ], -), symbols (e.g. «, », #, $, %, &, +, *, <, >, =, @), Roman numerals (e.g. LI, MMM, XIV) and also words, such as alface (lettuce) and fim-de-semana (weekend).


POS tagger

The LexMan does the morphosyntatic labeling. Afterwards, the segmentation module's output tokens are tagged by LexMan, with POS (part of speech) labels, such as noun, verb, adjective, or adverb, among others. There are thirteen categories and the information is encoded in ten fields:

  • category (CAT),
  • subcategory (SCT),
  • mood (MOD),
  • tense (TEN),
  • person (PER),
  • number (NUM),
  • gender (GEN),
  • degree (DEG),
  • case (CAS), and
  • formation (FOR).

No category uses all ten fields.


Sentence Splitter

The final step of the pre-processing stage is the text division into sentences. In order to build a sentence, the system matches sequences that end either with ., ! or ?. There are, however, some exceptions to this rule:

  • All registered abbreviations (e.g. Dr.)
  • Sequences of pairs of capitalized letters and dots (e.g. N.A.S.A.)
  • If any of the following symbols or any lower case letter is found after an ellipsis: », ), ], }.


Disambiguation

The next stage of the processing chain is the disambiguation process, which is comprised of two steps:

  • Rule-driven morphosyntactic disambiguation, performed by RuDriCo2;
  • Statistical disambiguation, performed by MARv3.

RuDriCo2's main goal is to provide for an adjustment of the results produced by a morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former. For example, it might contract expressions provided by the morphological analyzer, such as ex- and aluno, into one segment: ex-aluno; or it can perform the opposite and expand expressions such as nas into two segments: em and as. This will depend on what the parser might need.

Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo2 can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo2 is executed along the processing chain, it performs all of the mentioned tasks.

MARv3's main goal is to analyze the labels that were attributed to each token in the previous step of the processing chain, and then choose the most likely label for each one. In order to achieve this, it employs the statistical model known as Hidden Markov Model (HMM). In order to properly define a HMM, first one needs to introduce the Markov chain, sometimes called the observed Markov model. A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through. Because it cannot represent inherently ambiguous problems, a Markov chain is only useful for assigning probabilities to unambiguous sequences, that is, when we need to compute a probability for a sequence of events that can be observed in the world. However, in many cases events may not be directly observable.

In this case in particular, POS tags are not observable: what we see are words, or tokens, and we need to infer the correct tags from the word sequence. So we say that the tags are hidden — because they are not observed. Hence, a HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

There are many algorithms to compute the likelihood of a particular observation sequence. MARv uses the Viterbi algorithm.


Syntactic analysis

XIP performes the syntactic analysis. This analyzer allows to introduction of lexical, syntactic, and semantic information, it also allows the aplication of local grammars, morphosyntactic disambiguation rules, the calculation of chunks and dependencies. XIP is composed by different modules:

  • Lexicons — allow to add information to the different tokens. In the XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing existing ones.
  • Local Grammars — XIP enables the writing of rules considering the left and right the contexts. The rules intended to define entities formed by more than one lexical units, grouping elements together into a single entity.
  • Chunking Module — Chuking rules perform a sintatic analysis of the text, for each phrase is build a sequences of categories and grouped into structures (chunks).
  • Dependency Module — dependences are syntactic relationships between different chunks. They allow to have a deeper and richer knowledge of a text. The nodes sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.


Pos-syntactic analysis

For the moment two modules have been developed:

  • Anaphora resolution, and
  • Time expressions normalization.


Among the different modules of the processing chain is used XML (eXtensible Markup Language).