Difference between revisions of "Architecture"

From String
Jump to: navigation, search
m (Post-syntactic analysis)
 
(24 intermediate revisions by 3 users not shown)
Line 1: Line 1:
The processing chain of L2F consists of several modules, which are represented in the next figure:
+
<div style="float:right;">__TOC__</div>
 +
'''STRING''' is a '''St'''atistical and '''R'''ule-Based '''N'''atural Lan'''g'''uage Processing Chain for Portuguese developed at <span class="plainlinks">[http://www.l2f.inesc-id.pt/wiki/ L2F]</span> and it consists of several modules, which are represented in the next figure:
  
    <blockquote>[[media:Architecture.jpg]]</blockquote>
+
[[file:Architecture2.jpg|800px]]
  
Among the different modules of the processing chain is used XML (eXtensible Markup Language).
+
==== Tokenizer ====
 +
The first module is responsible for text segmentation, and it divides the text into tokens. Besides this, the module is also responsible for the identification at the earliest possible stage of certain special types of tokens, namely: email addresses, ordinal numbers (''e.g.'' '''3º''' (3rd_masc.sg), '''42ª'''(42th_fem.sg)), numbers with dot '''.''' and/or coma ''',''' (''e.g.'' '''12.345,67'''), IP and HTTP addresses, integers (''e.g.'' '''12345'''), several abbreviations with dot '''.''' (''e.g.'' '''a.C.''' (before Christ), '''V.Exa.''' (Your Excellency)), numbers written in full (''e.g.'' '''duzentos e trinta e cinco''' (two hundred and thirty-five)), sequences of interrogation and exclamation marks, as well as ellipsis (''e.g.'' '''???''', '''!!!''', '''?!?!''', '''...'''), punctuation marks (''e.g.'' '''!''', '''?''', '''.''', ''',''', ''':''', ''';''', '''(''', ''')''', '''[''', ''']''', '''-'''), symbols (''e.g.'' '''«''', '''»''', '''#''', '''$''', '''%''', '''&''', '''+''', '''*''', '''<''', '''>''', '''=''', '''@'''), and Roman numerals (''e.g.'' '''LI''', '''MMM''', '''XIV''').
 +
Naturally, besides these special textual elements, the tokenizer identifies ''ordinary'' simple words, such as '''alface''' (lettuce). It also tokenizes as a single element sequences of words connected by hyphen, most of them compound words, like '''fim-de-semana''' (weekend).
  
The first module is responsible for segmentation, it divides the text into tokens.
 
  
The [[LexMan]] does the morphosyntatic labeling.
+
==== POS tagger ====
 +
The [[LexMan]] POS tagger does the morphosyntatic labeling using [http://en.wikipedia.org/wiki/Finite_state_transducer finite state transducers]. It operates on the output of the segmentation module. As this stage, tokens are tagged by  [[LexMan]], with [http://en.wikipedia.org/wiki/Part_of_speech POS (part of speech)] labels, such as ''noun'', ''verb'', ''adjective'', or ''adverb'', among others.  
  
The next module of the processing chain divides the text into sentences.
+
There are twelve categories:
 +
* noun,
 +
* verb,
 +
* adjective,
 +
* pronoun,
 +
* article,
 +
* adverb,
 +
* preposition,
 +
* conjunction,
 +
* numeral,
 +
* interjection,
 +
* ponctuation, and
 +
* symbol
  
The morphosyntactic disambiguation module, [[RuDriCo2]] performs corrections to the output of morphosyntactic labeling module.
+
The information is encoded in eleven fields:
 +
* category (CAT),
 +
* subcategory (SCT),
 +
* mood (MOD),
 +
* tense (TEN),
 +
* person (PER),
 +
* number (NUM),
 +
* gender (GEN),
 +
* degree (DEG),
 +
* case (CAS),
 +
* syntactic (SYN), and
 +
* semantic (SEM),
  
[[MARv3]], the statistical morphosyntactic disambiguation module performs a statistical disambiguation.
+
No category uses all ten fields. For the most part, categories correspond to traditional grammar POS. Explicit criteria have been defined for problematic POS classification cases (noun-adjective distinction, pronoun, adverb, preposition, conjunction, and interjection).
  
[[XIP]] performes the syntactic analysis. This analyzer allows to introduction of lexical, syntactic, and semantic information, it also allows the aplication of local grammars, morphosyntactic disambiguation rules, the calculation of chunks and dependencies. XIP is composed by different modules:
+
 
* Lexicons - allow to add information to the different tokens. In the XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing existing ones.
+
==== Sentence Splitter ====
* Local Grammars - XIP enables the writing of rules considering the left and right the contexts. The rules intended to define entities formed by more than one lexical units, grouping elements together into a single entity.
+
The final step of the pre-processing stage is the text division into sentences. In order to build a sentence, the system matches sequences that end either with '''.''', '''!''' or '''?'''. There are, however, some exceptions to this rule:
* Chunking Module - Chuking rules perform a sintatic analysis of the text, for each phrase is build a sequences of categories and grouped into structures (chunks).
+
* All registered abbreviations (e.g. '''Dr.''')
* Dependency Module - dependences are syntactic relationships between different chunks. They allow to have a deeper and richer knowledge of a text. The nodes sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.
+
* Sequences of capitalized letters and dots (e.g. '''N.A.S.A.''')
 +
* If any of the following symbols or any lower case letter is found after an ellipsis ('''...'''): '''»''', ''')''', ''']''', '''}'''.
 +
 
 +
 
 +
==== Disambiguation ====
 +
The next stage of the processing chain is the disambiguation process, which is comprised of two steps:
 +
* Rule-driven morphosyntactic disambiguation, performed by [[RuDriCo2]];
 +
*Statistical disambiguation, performed by [[MARv4]].
 +
 
 +
[[RuDriCo2]]'s main goal is to provide for an adjustment of the results produced by a morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former module. For example, it might contract expressions provided by the morphological analyzer, such as ''ex-'' and ''aluno'', into one segment: ''ex-aluno'' (former student); or it can perform the opposite operation and thus expand expressions such as ''nas'' into two segments, ''em'' and ''as'' (in + as/Art.def.fem.pl). This will depend on what the parser might need.
 +
 +
Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. [[RuDriCo2]] can also be used to solve (or introduce) morphosyntactic ambiguities. By the time [[RuDriCo2]] is executed along the processing chain, it performs all of the above mentioned tasks.
 +
 
 +
[[MARv4]]'s main goal is to analyze the labels that were attributed to each token in the previous step of the processing chain, and then choose the most likely label for each one. In order to achieve this, it employs the statistical model known as [http://en.wikipedia.org/wiki/Hidden_Markov_Model Hidden Markov Model] (HMM). In order to properly define a HMM, first one needs to introduce the Markov chain, sometimes called the observed Markov model. A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through. Because it cannot represent inherently ambiguous problems, a Markov chain is only useful for assigning probabilities to unambiguous sequences, that is, when we need to compute a probability for a sequence of events that can be observed in the world. However, in many cases events may not be directly observable.
 +
 +
In this case in particular, POS tags are not observable: what we see are words, or ''tokens'', and we need to infer the correct tags from the word sequence. So we say that the tags are ''hidden'' — because they are not observed. Hence, a HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).
 +
 +
There are many algorithms to compute the likelihood of a particular observation sequence. [[MARv4|MARv]] uses the [http://en.wikipedia.org/wiki/Viterbi_algorithm Viterbi algorithm].
 +
 
 +
 
 +
==== Syntactic analysis ====
 +
[[XIP]] ('''X'''erox '''I'''ncremental '''P'''arser) is the module responsible for performing the syntactic analysis. This analyzer also allows for the introduction of lexical, syntactic, and semantic information to the output of the previous modules. XIP then performs the syntactic analysis of the text through the sequential application of local grammars, morphosyntactic disambiguation rules, and, finally, the calculation of chunks and dependencies. [[XIP]] itself is composed by different modules:
 +
* Lexicons — [[XIP]] lexicons allow to add information to the different tokens. In [[XIP]], there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing the existing ones.
 +
* Local Grammars — [[XIP]]  enables the writing of rules for pattern-matching, considering both the left and right contexts of a given pattern. These rules are intended to define entities formed by more than one lexical unit, and then grouping these elements together into a single entity.
 +
* Chunking Module — [[XIP]] Chunking rules perform a (so-called) ''shallow parsing'' or basic syntactic analysis of the text. For each phrase type (''e.g.'' '''NP''', '''PP''', '''VP''', etc.) a sequences of categories is grouped into elementary syntactic structures (''chunks'').
 +
* Dependency Module — In [[XIP]], dependencies are syntactic relationships between different chunks, chunk heads, or elements inside chunks. Dependencies allow a deeper and richer knowledge about the text's information and content. Major dependencies correspond to the so-called ''deep parsing'' syntactic functions, such as '''SUBJECT''', '''DIRECT OBJECT''', etc. Other dependencies are just auxiliary relations, used to calculate the ''deeper'' syntactic dependencies. For example, the '''CLINK''' dependency links each argument of a coordination to the coordinative conjunction they depend on.  
 +
The nodes' sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.
 +
 
 +
 
 +
==== Post-syntactic analysis ====
 +
For the moment, the following post-syntactic modules have been developed:
 +
* Anaphora resolution,
 +
* Slot filling,
 +
* Time expressions normalization, and
 +
* Event ordering.
 +
 
 +
==== Communication ====
 +
Between the different modules of the processing chain, '''''STRING''''' uses [http://en.wikipedia.org/wiki/XML XML (eXtensible Markup Language)].
 +
 
 +
 
 +
==== Demo ====
 +
'''''STRING''''' and its individual modules can be tested <span class="plainlinks">[http://string.l2f.inesc-id.pt/demo here]</span>.

Latest revision as of 06:03, 29 July 2015

STRING is a Statistical and Rule-Based Natural Language Processing Chain for Portuguese developed at L2F and it consists of several modules, which are represented in the next figure:

Architecture2.jpg

Tokenizer

The first module is responsible for text segmentation, and it divides the text into tokens. Besides this, the module is also responsible for the identification at the earliest possible stage of certain special types of tokens, namely: email addresses, ordinal numbers (e.g. (3rd_masc.sg), 42ª(42th_fem.sg)), numbers with dot . and/or coma , (e.g. 12.345,67), IP and HTTP addresses, integers (e.g. 12345), several abbreviations with dot . (e.g. a.C. (before Christ), V.Exa. (Your Excellency)), numbers written in full (e.g. duzentos e trinta e cinco (two hundred and thirty-five)), sequences of interrogation and exclamation marks, as well as ellipsis (e.g. ???, !!!, ?!?!, ...), punctuation marks (e.g. !, ?, ., ,, :, ;, (, ), [, ], -), symbols (e.g. «, », #, $, %, &, +, *, <, >, =, @), and Roman numerals (e.g. LI, MMM, XIV). Naturally, besides these special textual elements, the tokenizer identifies ordinary simple words, such as alface (lettuce). It also tokenizes as a single element sequences of words connected by hyphen, most of them compound words, like fim-de-semana (weekend).


POS tagger

The LexMan POS tagger does the morphosyntatic labeling using finite state transducers. It operates on the output of the segmentation module. As this stage, tokens are tagged by LexMan, with POS (part of speech) labels, such as noun, verb, adjective, or adverb, among others.

There are twelve categories:

  • noun,
  • verb,
  • adjective,
  • pronoun,
  • article,
  • adverb,
  • preposition,
  • conjunction,
  • numeral,
  • interjection,
  • ponctuation, and
  • symbol

The information is encoded in eleven fields:

  • category (CAT),
  • subcategory (SCT),
  • mood (MOD),
  • tense (TEN),
  • person (PER),
  • number (NUM),
  • gender (GEN),
  • degree (DEG),
  • case (CAS),
  • syntactic (SYN), and
  • semantic (SEM),

No category uses all ten fields. For the most part, categories correspond to traditional grammar POS. Explicit criteria have been defined for problematic POS classification cases (noun-adjective distinction, pronoun, adverb, preposition, conjunction, and interjection).


Sentence Splitter

The final step of the pre-processing stage is the text division into sentences. In order to build a sentence, the system matches sequences that end either with ., ! or ?. There are, however, some exceptions to this rule:

  • All registered abbreviations (e.g. Dr.)
  • Sequences of capitalized letters and dots (e.g. N.A.S.A.)
  • If any of the following symbols or any lower case letter is found after an ellipsis (...): », ), ], }.


Disambiguation

The next stage of the processing chain is the disambiguation process, which is comprised of two steps:

  • Rule-driven morphosyntactic disambiguation, performed by RuDriCo2;
  • Statistical disambiguation, performed by MARv4.

RuDriCo2's main goal is to provide for an adjustment of the results produced by a morphological analyzer to the specific needs of each parser. In order to achieve this, it modifies the segmentation that is done by the former module. For example, it might contract expressions provided by the morphological analyzer, such as ex- and aluno, into one segment: ex-aluno (former student); or it can perform the opposite operation and thus expand expressions such as nas into two segments, em and as (in + as/Art.def.fem.pl). This will depend on what the parser might need.

Altering the segmentation is also useful for performing tasks such as recognition of numbers and dates. The ability to modify the segmentation is achieved through declarative rules, which are based on the concept of pattern matching. RuDriCo2 can also be used to solve (or introduce) morphosyntactic ambiguities. By the time RuDriCo2 is executed along the processing chain, it performs all of the above mentioned tasks.

MARv4's main goal is to analyze the labels that were attributed to each token in the previous step of the processing chain, and then choose the most likely label for each one. In order to achieve this, it employs the statistical model known as Hidden Markov Model (HMM). In order to properly define a HMM, first one needs to introduce the Markov chain, sometimes called the observed Markov model. A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through. Because it cannot represent inherently ambiguous problems, a Markov chain is only useful for assigning probabilities to unambiguous sequences, that is, when we need to compute a probability for a sequence of events that can be observed in the world. However, in many cases events may not be directly observable.

In this case in particular, POS tags are not observable: what we see are words, or tokens, and we need to infer the correct tags from the word sequence. So we say that the tags are hidden — because they are not observed. Hence, a HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

There are many algorithms to compute the likelihood of a particular observation sequence. MARv uses the Viterbi algorithm.


Syntactic analysis

XIP (Xerox Incremental Parser) is the module responsible for performing the syntactic analysis. This analyzer also allows for the introduction of lexical, syntactic, and semantic information to the output of the previous modules. XIP then performs the syntactic analysis of the text through the sequential application of local grammars, morphosyntactic disambiguation rules, and, finally, the calculation of chunks and dependencies. XIP itself is composed by different modules:

  • Lexicons — XIP lexicons allow to add information to the different tokens. In XIP, there is a pre-existing lexicon, which can be enriched by adding lexical entries or changing the existing ones.
  • Local Grammars — XIP enables the writing of rules for pattern-matching, considering both the left and right contexts of a given pattern. These rules are intended to define entities formed by more than one lexical unit, and then grouping these elements together into a single entity.
  • Chunking Module — XIP Chunking rules perform a (so-called) shallow parsing or basic syntactic analysis of the text. For each phrase type (e.g. NP, PP, VP, etc.) a sequences of categories is grouped into elementary syntactic structures (chunks).
  • Dependency Module — In XIP, dependencies are syntactic relationships between different chunks, chunk heads, or elements inside chunks. Dependencies allow a deeper and richer knowledge about the text's information and content. Major dependencies correspond to the so-called deep parsing syntactic functions, such as SUBJECT, DIRECT OBJECT, etc. Other dependencies are just auxiliary relations, used to calculate the deeper syntactic dependencies. For example, the CLINK dependency links each argument of a coordination to the coordinative conjunction they depend on.

The nodes' sequence previously identified by chunking rules are used by the dependency rules to calculate the relationships between them.


Post-syntactic analysis

For the moment, the following post-syntactic modules have been developed:

  • Anaphora resolution,
  • Slot filling,
  • Time expressions normalization, and
  • Event ordering.

Communication

Between the different modules of the processing chain, STRING uses XML (eXtensible Markup Language).


Demo

STRING and its individual modules can be tested here.