Difference between revisions of "Architecture"

From String
Jump to: navigation, search
Line 3: Line 3:
 
  [[file:Architecture.jpg|600px]]
 
  [[file:Architecture.jpg|600px]]
  
 +
====== Tokenizer ======
 
The first module is responsible for segmentation, it divides the text into tokens. Besides this, the module is also responsible for the early identification of certain types of entities, namely: email addresses, ordinal numbers (e.g. '''3º''', '''42ª'''), numbers with '''.''' and ''',''' (e.g. '''12.345,67'''), IP and HTTP addresses, integers (e.g. '''12345'''), several abbreviations with '''.''' (e.g. '''a.c.''', '''V.Exa.'''), numbers written in full, such as '''duzentos e trinta e cinco''' (two hundred and thirty-five), sequences of interrogation and exclamation marks, as well as ellipsis (e.g. '''???''', '''!!!''', '''?!?!''', '''...'''), punctuation marks (e.g. '''!''', '''?''', '''.''', ''',''', ''':''', ''';''', '''(''', ''')''', '''[''', ''']''', '''-'''), symbols (e.g. '''«''', '''»''', '''#''', '''$''', '''%''', '''&''', '''+''', '''*''', '''<''', '''>''', '''=''', '''@'''), Roman numerals (e.g. '''LI''', '''MMM''', '''XIV''') and also words, such as '''alface''' (lettuce) and '''fim-de-semana''' (weekend).
 
The first module is responsible for segmentation, it divides the text into tokens. Besides this, the module is also responsible for the early identification of certain types of entities, namely: email addresses, ordinal numbers (e.g. '''3º''', '''42ª'''), numbers with '''.''' and ''',''' (e.g. '''12.345,67'''), IP and HTTP addresses, integers (e.g. '''12345'''), several abbreviations with '''.''' (e.g. '''a.c.''', '''V.Exa.'''), numbers written in full, such as '''duzentos e trinta e cinco''' (two hundred and thirty-five), sequences of interrogation and exclamation marks, as well as ellipsis (e.g. '''???''', '''!!!''', '''?!?!''', '''...'''), punctuation marks (e.g. '''!''', '''?''', '''.''', ''',''', ''':''', ''';''', '''(''', ''')''', '''[''', ''']''', '''-'''), symbols (e.g. '''«''', '''»''', '''#''', '''$''', '''%''', '''&''', '''+''', '''*''', '''<''', '''>''', '''=''', '''@'''), Roman numerals (e.g. '''LI''', '''MMM''', '''XIV''') and also words, such as '''alface''' (lettuce) and '''fim-de-semana''' (weekend).
  
  
The [[LexMan]] does the morphosyntatic labeling.
+
====== POS tagger ======
 +
The [[LexMan]] does the morphosyntatic labeling. Afterwards, the segmentation module's output tokens are tagged by  [[LexMan]], with POS (part of speech) labels, such as ''noun'', ''verb'', ''adjective'', or ''adverb'', among others. There are thirteen categories and the information is encoded in ten fields:
 +
* category (CAT),
 +
* subcategory (SCT),
 +
* mood (MOD),
 +
* tense (TEN),
 +
* person (PER),
 +
* number (NUM),
 +
* gender (GEN),
 +
* degree (DEG),
 +
* case (CAS), and
 +
* formation (FOR).
 +
No category uses all ten fields.
  
The next module of the processing chain divides the text into sentences.
 
  
 +
====== Sentence Splitter ======
 +
The final step of the pre-processing stage is the text division into sentences. In order to build a sentence, the system matches sequences that end either with '''.''', '''!''' or '''?'''. There are, however, some exceptions to this rule:
 +
*All registered abbreviations (e.g. '''Dr.''')
 +
*Sequences of pairs of capitalized letters and dots (e.g. '''N.A.S.A.''')
 +
*If any of the following symbols or any lower case letter is found after an ellipsis: '''»''', ''')''', ''']''', '''}'''.
 +
 +
 +
====== POS rule Disambiguator ======
 
The morphosyntactic disambiguation module, [[RuDriCo2]] performs corrections to the output of morphosyntactic labeling module.
 
The morphosyntactic disambiguation module, [[RuDriCo2]] performs corrections to the output of morphosyntactic labeling module.
  
[[MARv3]], the statistical morphosyntactic disambiguation module performs a statistical disambiguation.
+
 +
====== POS Statistical Disambiguator ======
 +
[[MARv3]], the statistical morphosyntactic disambiguation module performs a statistical disambiguation.
  
 +
 +
 +
====== XIP ======
 
[[XIP]] performes the syntactic analysis. This analyzer allows to introduction of lexical, syntactic, and semantic information, it also allows the aplication of local grammars, morphosyntactic disambiguation rules, the calculation of chunks and dependencies. XIP is composed by different modules:
 
[[XIP]] performes the syntactic analysis. This analyzer allows to introduction of lexical, syntactic, and semantic information, it also allows the aplication of local grammars, morphosyntactic disambiguation rules, the calculation of chunks and dependencies. XIP is composed by different modules:
 
* Lexicons - allow to add information to the different tokens. In the XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing existing ones.
 
* Lexicons - allow to add information to the different tokens. In the XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing existing ones.
Line 19: Line 44:
 
* Chunking Module - Chuking rules perform a sintatic analysis of the text, for each phrase is build a sequences of categories and grouped into structures (chunks).
 
* Chunking Module - Chuking rules perform a sintatic analysis of the text, for each phrase is build a sequences of categories and grouped into structures (chunks).
 
* Dependency Module - dependences are syntactic relationships between different chunks. They allow to have a deeper and richer knowledge of a text. The nodes sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.
 
* Dependency Module - dependences are syntactic relationships between different chunks. They allow to have a deeper and richer knowledge of a text. The nodes sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.
 +
  
 
Among the different modules of the processing chain is used XML (eXtensible Markup Language).
 
Among the different modules of the processing chain is used XML (eXtensible Markup Language).

Revision as of 22:16, 5 March 2012

The processing chain of L2F consists of several modules, which are represented in the next figure:

600px
Tokenizer

The first module is responsible for segmentation, it divides the text into tokens. Besides this, the module is also responsible for the early identification of certain types of entities, namely: email addresses, ordinal numbers (e.g. , 42ª), numbers with . and , (e.g. 12.345,67), IP and HTTP addresses, integers (e.g. 12345), several abbreviations with . (e.g. a.c., V.Exa.), numbers written in full, such as duzentos e trinta e cinco (two hundred and thirty-five), sequences of interrogation and exclamation marks, as well as ellipsis (e.g. ???, !!!, ?!?!, ...), punctuation marks (e.g. !, ?, ., ,, :, ;, (, ), [, ], -), symbols (e.g. «, », #, $, %, &, +, *, <, >, =, @), Roman numerals (e.g. LI, MMM, XIV) and also words, such as alface (lettuce) and fim-de-semana (weekend).


POS tagger

The LexMan does the morphosyntatic labeling. Afterwards, the segmentation module's output tokens are tagged by LexMan, with POS (part of speech) labels, such as noun, verb, adjective, or adverb, among others. There are thirteen categories and the information is encoded in ten fields:

  • category (CAT),
  • subcategory (SCT),
  • mood (MOD),
  • tense (TEN),
  • person (PER),
  • number (NUM),
  • gender (GEN),
  • degree (DEG),
  • case (CAS), and
  • formation (FOR).

No category uses all ten fields.


Sentence Splitter

The final step of the pre-processing stage is the text division into sentences. In order to build a sentence, the system matches sequences that end either with ., ! or ?. There are, however, some exceptions to this rule:

  • All registered abbreviations (e.g. Dr.)
  • Sequences of pairs of capitalized letters and dots (e.g. N.A.S.A.)
  • If any of the following symbols or any lower case letter is found after an ellipsis: », ), ], }.


POS rule Disambiguator

The morphosyntactic disambiguation module, RuDriCo2 performs corrections to the output of morphosyntactic labeling module.

POS Statistical Disambiguator

MARv3, the statistical morphosyntactic disambiguation module performs a statistical disambiguation.


XIP

XIP performes the syntactic analysis. This analyzer allows to introduction of lexical, syntactic, and semantic information, it also allows the aplication of local grammars, morphosyntactic disambiguation rules, the calculation of chunks and dependencies. XIP is composed by different modules:

  • Lexicons - allow to add information to the different tokens. In the XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing existing ones.
  • Local Grammars - XIP enables the writing of rules considering the left and right the contexts. The rules intended to define entities formed by more than one lexical units, grouping elements together into a single entity.
  • Chunking Module - Chuking rules perform a sintatic analysis of the text, for each phrase is build a sequences of categories and grouped into structures (chunks).
  • Dependency Module - dependences are syntactic relationships between different chunks. They allow to have a deeper and richer knowledge of a text. The nodes sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.


Among the different modules of the processing chain is used XML (eXtensible Markup Language).