Difference between revisions of "XIP"

From String
Jump to: navigation, search
Line 1: Line 1:
 +
 
===== Acronim =====
 
===== Acronim =====
 
'''''XIP''''' stands for '''''X'''''EROX '''''I'''''ncremental '''''P'''''arsing
 
'''''XIP''''' stands for '''''X'''''EROX '''''I'''''ncremental '''''P'''''arsing
Line 143: Line 144:
 
  <tt style="color:red">NE_LOCAL_COUNTRY_ADMIN_AREA(Japão)</tt>
 
  <tt style="color:red">NE_LOCAL_COUNTRY_ADMIN_AREA(Japão)</tt>
  
The last two indicate that two NEs have been captured and classified in this sentence: ''Pedro'' has been captured and classified as \ent{HUMAN} \ent{INDIVIDUAL} \ent{PERSON} and ''Japão'' (Japan) has been captured and classified as \ent{LOCATION} \ent{CREATED} \ent{COUNTRY}. The tags <tt style="color:red">NE_INDIVIDUAL_PEOPLE</tt> and\\<tt style="color:red">NE_LOCAL_COUNTRY_ADMIN_AREA</tt> are merely used to see that the NEs have been classified. The final XML tags are created afterwards, as the final step of the whole process.
+
The last two indicate that two NEs have been captured and classified in this sentence: ''Pedro'' has been captured and classified as <tt style="color:red">HUMAN</tt> <tt style="color:red">INDIVIDUAL</tt> <tt style="color:red">PERSON</tt> and ''Japão'' (Japan) has been captured and classified as <tt style="color:red">LOCATION</tt> <tt style="color:red">CREATED</tt> <tt style="color:red">COUNTRY</tt>. The tags <tt style="color:red">NE_INDIVIDUAL_PEOPLE</tt> and\\<tt style="color:red">NE_LOCAL_COUNTRY_ADMIN_AREA</tt> are merely used to see that the NEs have been classified. The final XML tags are created afterwards, as the final step of the whole process.
  
 
The other dependencies listed above cover a wide range of binary relationships such as:
 
The other dependencies listed above cover a wide range of binary relationships such as:
Line 158: Line 159:
 
  <tt style="color:red">    NE[quant=+,sports_results=+](#1)</tt>
 
  <tt style="color:red">    NE[quant=+,sports_results=+](#1)</tt>
  
This rule uses a variable, represented by \verb</tt>#1|, which is assigned to the top node, because it is placed before the first brace (<tt style="color:red">{</tt>). This variable could have been placed inside the braces structure, assigned (for example) to the node <tt style="color:red">num</tt>. This rule states that if a node is ''made of'' any category followed by a number (with two features that determine whether it is a sports result), and if this node has not yet been classified as a NE with these features, then one wants to add them to the top node in order to classify it as \ent{AMOUNT} \ent{SPORTS\_RESULT}. Please notice that it is the top node that is classified, because the variable is assigned to it; if it had been placed next to the node <tt style="color:red">num</tt>, for example, then only this subnode would have been classified.
+
This rule uses a variable, represented by \verb</tt>#1|, which is assigned to the top node, because it is placed before the first brace (<tt style="color:red">{</tt>). This variable could have been placed inside the braces structure, assigned (for example) to the node <tt style="color:red">num</tt>. This rule states that if a node is ''made of'' any category followed by a number (with two features that determine whether it is a sports result), and if this node has not yet been classified as a NE with these features, then one wants to add them to the top node in order to classify it as <tt style="color:red">AMOUNT</tt> <tt style="color:red">SPORTS\_RESULT</tt>. Please notice that it is the top node that is classified, because the variable is assigned to it; if it had been placed next to the node <tt style="color:red">num</tt>, for example, then only this subnode would have been classified.
  
 
Notice also the usage of the negation operator (<tt style="color:red">~</tt>) inside the conditional statement. XIP's syntax for these conditional statements also allows the operators <tt style="color:red">&</tt> for conjunction and \verb~|~ for disjunction. Parentheses are also used to group statements and establish a clearer precedence, as in most programming languages.
 
Notice also the usage of the negation operator (<tt style="color:red">~</tt>) inside the conditional statement. XIP's syntax for these conditional statements also allows the operators <tt style="color:red">&</tt> for conjunction and \verb~|~ for disjunction. Parentheses are also used to group statements and establish a clearer precedence, as in most programming languages.
Line 169: Line 170:
  
 
The lexical rules attempt to provide a more precise interpretation of the tokens associated with a node. They have the following syntax (the parts of the rule contained in parentheses are optional):
 
The lexical rules attempt to provide a more precise interpretation of the tokens associated with a node. They have the following syntax (the parts of the rule contained in parentheses are optional):
 
 
  <tt style="color:red">lemma(: POS([features])) (+)= (POS)[features].</tt>
 
  <tt style="color:red">lemma(: POS([features])) (+)= (POS)[features].</tt>
  
 
Some examples of lexical rules follow:
 
Some examples of lexical rules follow:
 +
<tt style="color:red">$US                = noun[meas=+, curr=+].</tt>
 +
<tt style="color:red">eleitor:    noun += [human=+].</tt>
 +
<tt style="color:red">google            += verb[intransitive=+].</tt>
  
{|
+
The first two examples show how to add new features to existing words (in this case, they are both nouns). In the first case, the features <tt style="color:red">meas</tt> (measure) and <tt style="color:red">curr</tt> (currency) are added to <tt style="color:red">$US</tt>; in the second case, the <tt style="color:red">human</tt> feature is added to <tt style="color:red">eleitor</tt> (elector). In the third case, however, <tt style="color:red">google</tt> is given the additional reading of <tt style="color:red">verb</tt>.
<tt style="color:red">$US</tt> & <tt style="color:red">=</tt> & <tt style="color:red">noun[meas=+, curr=+].</tt>
 
<tt style="color:red">eleitor:</tt> & <tt style="color:red">noun +=</tt> & <tt style="color:red">[human=+].</tt>
 
<tt style="color:red">google</tt>       & <tt style="color:red">+=</tt> & <tt style="color:red">verb[intransitive=+].</tt>
 
|}
 
 
 
  
The first two examples show how to add new features to existing words (in this case, they are both nouns). In the first case, the features <tt style="color:red">meas</tt> (measure) and <tt style="color:red">curr</tt> (currency) are added to <tt style="color:red">$US</tt>; in the second case, the <tt style="color:red">human</tt> feature is added to <tt style="color:red">eleitor</tt> (elector). In the third case, however, <tt style="color:red">google</tt> is given the additional reading of <tt style="color:red">verb</tt>.
 
  
 
===== Local grammars =====
 
===== Local grammars =====
Local grammars are text files that contain chunking rules and each file may contain ID/LP and sequence rules. Essentially, we use different local grammar files to capture desirable sequences of nodes and to attribute features to them. We employ a division based on different categories of NEs. For example, whereas the file <tt style="color:red">LGLocation</tt> is aimed at capturing sequences of nodes related to the \ent{LOCATION} category, the file <tt style="color:red">LGPeople</tt> will capture sequences of nodes related to the \ent{INDIVIDUAL} type (\ent{HUMAN} category).
+
Local grammars are text files that contain chunking rules and each file may contain ID/LP and sequence rules. Essentially, we use different local grammar files to capture desirable sequences of nodes and to attribute features to them. We employ a division based on different categories of NEs. For example, whereas the file <tt style="color:red">LGLocation</tt> is aimed at capturing sequences of nodes related to the <tt style="color:red">LOCATION</tt> category, the file <tt style="color:red">LGPeople</tt> will capture sequences of nodes related to the <tt style="color:red">INDIVIDUAL</tt> type (<tt style="color:red">HUMAN</tt> category).
  
 
After the pre-processing and disambiguation stages, XIP receives its input sentence(s) and tries to match it/them to the rules in the local grammars' files. They are run sequentially through a predefined order in a configuration file.
 
After the pre-processing and disambiguation stages, XIP receives its input sentence(s) and tries to match it/them to the rules in the local grammars' files. They are run sequentially through a predefined order in a configuration file.
  
As an example, consider the following sequence rule belonging to the local grammar responsible for dealing with \ent{LOCATION} NEs:
+
As an example, consider the following sequence rule belonging to the local grammar responsible for dealing with <tt style="color:red">LOCATION</tt> NEs:
 
+
  <tt style="color:red">1> noun[location=+, admin_area=+] = ?[lemma:novo,maj];?[lemma:nova,maj], noun[location,maj].</tt>
  <tt style="color:red">1> noun[location=+, admin_area=+] = ?[lemma:novo,maj];?[lemma:nova,maj],</tt>
 
<tt style="color:red">                                      noun[location,maj].</tt>
 
  
 
This rule is responsible for matching expressions such as ''Novo México'' (New Mexico), ''Nova Zelândia'' (New Zealand) or ''Nova Escócia'' (New Scotland), and then it creates a <tt style="color:red">noun</tt> node with two feature-value pairs (<tt style="color:red">location</tt> and <tt style="color:red">admin_area</tt>). Notice how the <tt style="color:red">?</tt> and <tt style="color:red">;</tt> operators were used in order to capture either ''Novo'' or ''Nova''.
 
This rule is responsible for matching expressions such as ''Novo México'' (New Mexico), ''Nova Zelândia'' (New Zealand) or ''Nova Escócia'' (New Scotland), and then it creates a <tt style="color:red">noun</tt> node with two feature-value pairs (<tt style="color:red">location</tt> and <tt style="color:red">admin_area</tt>). Notice how the <tt style="color:red">?</tt> and <tt style="color:red">;</tt> operators were used in order to capture either ''Novo'' or ''Nova''.
Line 210: Line 205:
  
 
Consider the example below:
 
Consider the example below:
 
 
  <tt style="color:red">1> ?<maj:+,start:~> = ?<proper:+> .</tt>
 
  <tt style="color:red">1> ?<maj:+,start:~> = ?<proper:+> .</tt>
  
 
This rule states that upper case words (other than at the beginning of a sentence) must be a proper name.
 
This rule states that upper case words (other than at the beginning of a sentence) must be a proper name.

Revision as of 00:40, 6 March 2012

Acronim

XIP stands for XEROX Incremental Parsing


Introduction

XIP is a XEROX able to perform several tasks, namely:

  • calculation of chunks and dependencies;
  • adding lexical, syntactic and semantic information;
  • applying morphosyntactic disambiguation rules;
  • applying local grammars;

The fundamental data representation unit in XIP is the node. A node has a category, feature-value pairs and brother nodes. For example, the node below represents the noun Pedro and it has several features that are used as a means to express its properties. In this case, the features have the following meaning: Pedro is a noun that represents a human, an individual male (feature masc); the node also has features to describe its number (singular, sg) and the fact that it is spelled with an upper case initial letter (feature maj):

Pedro: noun[human, individual, proper, firstname, people, sg, masc, maj]

Every node category and every feature must be declared in declaration files. Furthermore, features must be declared with their domain of possible values. They are an extremely important part of XIP, as they describe the properties of nodes. Features, by themselves, do not exist; they are always associated with a value, hence the so-called feature-value pair.

Moreover, features can be instantiated (operator =), tested (operator :), or deleted (operator =~) within all types of rules. While instantiation and deletion are all about setting/removing values to/from features, testing consists of checking whether a specific value is set to a specific feature:

Type Example Explanation
Instantiated [gender = fem] The value fem is set to the feature gender
Tested [gender:fem] Does the feature gender have the value fem ?
[gender:~] The feature gender should not be instantiated on the node
[gender:~fem] The feature gender should not have the value fem
Deleted [acc =~] The feature acc is cleared of all values on the node


Chunking rules

Chunking is the process by which sequences of categories are grouped into structures; this process is achieved through chunking rules. There are two types of chunking rules:

  • immediate dependency and linear precedence rules (ID/LP rules);
  • sequence rules.

The first important aspect about chunking rules is that each one must be defined in a specific layer. This layer is represented by an integer number, ranging from 1 to 300. Below is an example of how to define two rules in two different layers:

1> NP = (art;?[dem]), ?[indef1]. // layer 1
2> NP = (art;?[dem]), ?[poss].   // layer 2

Layers are processed sequentially from the first one to the last. Each layer can contain only one type of chunking rule.

ID/LP rules are significantly different from sequence rules. While ID rules describe unordered sets of nodes and LP rules work with ID rules to establish some order between the categories, sequence rules describe an ordered sequence of nodes. The syntax of an ID rule is:

layer> node-name -> list-of-lexical-nodes.

Consider the following example of an ID rule:

1> NP -> det, noun, adj.

Assuming that det, noun and adj are categories that have already been declared, this rule is interpreted as follows: whenever there is a sequence of a determiner, noun and adjective, regardless of the order in which they appear, create a Noun Phrase (NP) node. Obviously, this rule applies to more expressions than those desirable, e.g. o carro preto (the car black), o preto carro (the black car), preto carro o (black car the) and carro preto o (car black the). This is where LP rules come in. By being associated with ID rules, they can apply to a particular layer or be treated as a general constraint throughout the XIP grammar. They have the following syntax:

layer> [set-of-features] < [set-of-features].

Consider the following example:

1> [det:+] < [noun:+].
1> [noun:+] < [adj:+].

Thus, by stating that a determiner must precede a noun only in layer 1, and that a noun must precede an adjective also only in layer 1, the system is now setting constraints in this layer, which means that expressions such as o preto carro (the black car) will no longer be allowed. However, o carro preto (the car black) will\footnote{Naturally, these are just examples of ID/LP rules. The actual grammatical rules governing the relative position of adjectives and nouns are much more complex.}.

It is also possible to use parentheses to express optional categories, and an asterisk to indicate that zero or more instances of a category are accepted. The following rule states that the determiner is optional and that as many adjectives as possible are accepted:

1> NP -> (det), noun, adj*.

Taking into account both LP rules established above, the following expressions are accepted: carro (car), carro preto (car black), o carro preto (the car black), o carro preto bonito (the car black beautiful).

Finally, it is worth mentioning that these rules can be further constrained with contexts. For example:

1> NP -> |det, ?*| noun, adj |?*, verb|.

Simple enough, this rule states that a determiner must be on the left of the set of categories, and that a verb must be on the right. By applying this rule on a sentence such as o carro preto andou na estrada (the black car went on the road), we obtain the following chunk:

NP[o carro preto].

Hence, although they help constraining a rule even further, contexts are not saved inside a node.

The other kind of chunking rules, sequence rules, though conceptually different because they describe an ordered sequence of nodes, are almost equal to the ID/LP rules in terms of syntax. There are, however, some differences and additions:

  • sequence rules do not use the -> operator. Instead, they use the = operator, which matches the shortest possible sequence. In order to match the longest possible sequence, the @= operator is used;
  • there is an operator for applying negation (~) and another for applying disjunction (;);
  • unlike ID/LP rules, the question mark (?) can be used to represent any category on the right side of a rule;
  • sequence rules can use variables.

The following sequence rule matches expressions like alguns rapazes/uns rapazes (some boys), nenhum rapaz (no boy), muitos rapazes (many boys) or cinco rapazes (five boys):

1> NP @= ?[indef2];?[q3];num, (AP;adj;pastpart), noun.

Finally, consider again the example O Pedro foi ao Japão. (Pedro went to Japan). At this stage, after the pre-processing and disambiguation, and also after applying the chunking rules, the system presents the following output tree:

                     TOP
           +----------+----------+
           |          |          |
          NP         VF         PP
       +-------+      +    +----+-------+
       |       |      |    |    |       |
      ART    NOUN   VERB PREP  ART    NOUN
       +       +      +    +    +       +
       |       |      |    |    |       |
       O     Pedro   foi   a    o    Japão


Dependency rules

Being able to extract dependencies between nodes is very important because it can provide us with a richer, deeper understanding of the texts. Dependency rules take the sequences of constituent nodes identified by the chunking rules and identify relationships between them. This section presents a brief overview of their syntax, operators, and some examples.

A dependency rule presents the following syntax:

|pattern| if <condition> <dependency_terms>.

In order to understand what the pattern is, first it is essential to understand what is a Tree Regular Expression (TRE). A TRE is a special type of regular expression that is used in XIP in order to establish connections between distant nodes. In particular, TREs explore the inner structure of subnodes through the use of the braces characters ({}). The following example states that a NP node's inner structure must be examined in order to see if it is made of a determiner and a noun:

NP{det, noun}.

TREs support the use of several operators, namely:

  • the semicolon (;) operator is used to indicate disjunction;
  • the asterisk (*) operator is used to indicate zero or more;
  • the question mark (?) operator is used to indicate any;
  • the circumflex (^) operator is used to explore subnodes for a category.

Hence, and returning to the dependency rules, the pattern contains a TRE that describes the structural properties of parts of the input tree. The condition is any Boolean expression supported by XIP (with the appropriate syntax), and the dependency_terms are the consequent of the rule.

The first dependency rules to be executed are the ones that establish the relationships between the nodes, as seen in the next example:

|NP#1{?*, #2[last]}|
  HEAD(#2, #1)

This rule identifies HEAD relations (see below), for example a bela rapariga (the beautiful girl) $\Rightarrow$ HEAD(rapariga, a bela rapariga).

As already stated, the main goal of the dependency rules is to establish relationships between the nodes. Coming back to our usual example, the following output is the current result of applying these rules to the sentence O Pedro foi ao Japão. (Pedro went to Japan):

MAIN(foi)
HEAD(Pedro,O Pedro)
HEAD(Japão,a o Japão)
HEAD(foi,foi)
DETD(Pedro,O)
DETD(Japão,o)
PREPD(Japão,a)
VDOMAIN(foi,foi)
MOD_POST(foi,Japão)
SUBJ_PRE(foi,Pedro)
NE_INDIVIDUAL_PEOPLE(Pedro)
NE_LOCAL_COUNTRY_ADMIN_AREA(Japão)

The last two indicate that two NEs have been captured and classified in this sentence: Pedro has been captured and classified as HUMAN INDIVIDUAL PERSON and Japão (Japan) has been captured and classified as LOCATION CREATED COUNTRY. The tags NE_INDIVIDUAL_PEOPLE and\\NE_LOCAL_COUNTRY_ADMIN_AREA are merely used to see that the NEs have been classified. The final XML tags are created afterwards, as the final step of the whole process.

The other dependencies listed above cover a wide range of binary relationships such as:

  • the relation between the nucleus of some chunk and the chunk itself (HEAD);
  • the relation between a nominal head and a determiner (DETD);
  • the relation between the head of a Prepositional Phrase (PP) and the preposition (PREPD);
  • among many others.

To see a complete list and a detailed description of all dependency relationships as of July 2009, please refer to [Portuguese Grammar manual].

Now consider the following example of another kind of dependency rule (aimed at classifying NEs):

 #1{?*, num[quant,sports_results]}
  if (~NE[quant,sports_results](#1))
    NE[quant=+,sports_results=+](#1)

This rule uses a variable, represented by \verb</tt>#1|, which is assigned to the top node, because it is placed before the first brace ({). This variable could have been placed inside the braces structure, assigned (for example) to the node num. This rule states that if a node is made of any category followed by a number (with two features that determine whether it is a sports result), and if this node has not yet been classified as a NE with these features, then one wants to add them to the top node in order to classify it as AMOUNT SPORTS\_RESULT. Please notice that it is the top node that is classified, because the variable is assigned to it; if it had been placed next to the node num, for example, then only this subnode would have been classified.

Notice also the usage of the negation operator (~) inside the conditional statement. XIP's syntax for these conditional statements also allows the operators & for conjunction and \verb~|~ for disjunction. Parentheses are also used to group statements and establish a clearer precedence, as in most programming languages.


Lexicons

XIP allows the definition of custom lexicons (lexicon files), which add new features that are not stored in the standard lexicon. Having a rich vocabulary in the system can be very beneficial for improving its recall.

In XIP, a lexicon file begins by simply stating Vocabulary:, which tells the XIP engine that the file contains a custom lexicon. Only afterwards come the actual additions to the vocabulary.

The lexical rules attempt to provide a more precise interpretation of the tokens associated with a node. They have the following syntax (the parts of the rule contained in parentheses are optional):

lemma(: POS([features])) (+)= (POS)[features].

Some examples of lexical rules follow:

$US                = noun[meas=+, curr=+].
eleitor:     noun += [human=+].
google            += verb[intransitive=+].

The first two examples show how to add new features to existing words (in this case, they are both nouns). In the first case, the features meas (measure) and curr (currency) are added to $US; in the second case, the human feature is added to eleitor (elector). In the third case, however, google is given the additional reading of verb.


Local grammars

Local grammars are text files that contain chunking rules and each file may contain ID/LP and sequence rules. Essentially, we use different local grammar files to capture desirable sequences of nodes and to attribute features to them. We employ a division based on different categories of NEs. For example, whereas the file LGLocation is aimed at capturing sequences of nodes related to the LOCATION category, the file LGPeople will capture sequences of nodes related to the INDIVIDUAL type (HUMAN category).

After the pre-processing and disambiguation stages, XIP receives its input sentence(s) and tries to match it/them to the rules in the local grammars' files. They are run sequentially through a predefined order in a configuration file.

As an example, consider the following sequence rule belonging to the local grammar responsible for dealing with LOCATION NEs:

1> noun[location=+, admin_area=+] = ?[lemma:novo,maj];?[lemma:nova,maj], noun[location,maj].

This rule is responsible for matching expressions such as Novo México (New Mexico), Nova Zelândia (New Zealand) or Nova Escócia (New Scotland), and then it creates a noun node with two feature-value pairs (location and admin_area). Notice how the ? and ; operators were used in order to capture either Novo or Nova.


Disambiguation rules

To conclude this section, it is also important to state that XIP allows the definition of disambiguation rules. The general syntax for a disambiguation rule is:

layer> readings_filter = |left_context| selected_readings |right_context|.

Like chunking rules, disambiguation rules also employ the concept of layer and contexts. The left side of a disambiguation rule contains the readings_filter. This filter specifies a subset of categories and features that can be associated with a word. The list can be constrained with features and the filter applies when it matches a subset of the complete ambiguity class of a word. Finally, the selected_readings portion of a disambiguation rule gives the selected interpretation(s) of the word.

There are four main operators used in disambiguation rules:

  • the <> operator: it is used to define specific features associated with a category;
  • the [] operator: it is used to refer to the complete set of features for a category;
  • the % operator: it restricts the interpretation of a word to one solution;
  • the <* operator: when used, one specifies that each reading must bear the features listed immediately after.

Consider the example below:

1> ?<maj:+,start:~> = ?<proper:+> .

This rule states that upper case words (other than at the beginning of a sentence) must be a proper name.