XIP

Acronym

XIP stands for XEROX Incremental Parsing

Introduction

XIP is a XEROX parser, based on finite-state technology and able to perform several tasks, namely:

adding lexical, syntactic and semantic information;
applying local grammars;
applying morphosyntactic disambiguation rules;
calculating of chunks and dependencies;

The fundamental data representation unit in XIP is the node. A node has a category, feature-value pairs and brother nodes. For example, the node below represents the noun Pedro and it has several features, used to express its properties. In this case, the features have the following meaning: Pedro is a noun that represents a human, an individual male (feature masc); the node also has features to describe its number (singular, sg) and the fact that it is spelled with an upper case initial letter (feature maj):

Pedro: noun[human, individual, proper, firstname, people, sg, masc, maj]

Every node category and every feature must be declared in declaration files. Furthermore, features must be declared with their domain of possible values. They are an extremely important part of XIP, as they describe the properties of nodes. Features, by themselves, do not exist; they are always associated with a value, hence the so-called feature-value pair.

Moreover, features can be instantiated (operator =), tested (operator :), or deleted (operator =~) within all types of rules. While instantiation and deletion are all about setting/removing values to/from features, testing consists of checking whether a specific value is set to a specific feature:

Type	Example	Explanation
Instantiated	`[gender=fem]`	The value fem is set to the feature gender
Tested	`[gender:fem]`	Does the feature gender have the value fem ?
	`[gender:~]`	The feature gender should not be instantiated on the node
	`[gender:~fem]`	The feature gender should not have the value fem
Deleted	`[acc=~]`	The feature acc is cleared of all values on the node

Chunking rules

Chunking is the process by which sequences of categories are grouped into structures; this process is achieved through chunking rules. There are two types of chunking rules:

immediate dependency and linear precedence rules (ID/LP rules);
sequence rules.

In the following, we present some toy examples to illustrate the syntax of the chunking rules.

The first important aspect about chunking rules is that each one must be defined in a specific layer. This layer is represented by an integer number, ranging from 1 to 300. Below is an example of how to define two rules in two different layers:

1> NP = (art;?[dem]), ?[indef1]. // layer 1
2> NP = (art;?[dem]), ?[poss].   // layer 2

Layers are processed sequentially from the first one to the last. Each layer can contain only one type of chunking rule.

ID/LP rules are significantly different from sequence rules. While ID rules describe unordered sets of nodes and LP rules work with ID rules to establish some order between the categories, sequence rules describe an ordered sequence of nodes. The syntax of an ID rule is:

layer> node-name -> list-of-lexical-nodes.

Consider the following example of an ID rule:

1> NP -> det, noun, adj.

Assuming that det, noun and adj are categories that have already been declared, this rule is interpreted as follows: whenever there is a sequence of a determiner, noun and adjective, regardless of the order in which they appear, create a Noun Phrase (NP) node. Obviously, this rule applies to more expressions than those desirable, e.g. o carro preto (the car black), o preto carro (the black car), preto carro o (black car the) and carro preto o (car black the). This is where LP rules come into play. By being associated with ID rules, they can apply to a particular layer or be treated as a general constraint throughout the XIP grammar. They have the following syntax:

layer> [set-of-features] < [set-of-features].

Consider the following example:

1> [det:+] < [noun:+].
1> [noun:+] < [adj:+].

Thus, by stating that a determiner must precede a noun only in layer 1, and that a noun must precede an adjective also only in layer 1, the system is now setting constraints in this layer, which means that expressions such as o preto carro (the black car) will no longer be allowed. However, o carro preto (the car black) will. The examples above are just meant as an illustration of chunking rules. The actual grammatical rules governing the relative position of adjectives and nouns are much more complex.

It is also possible to use parentheses to express optional categories, and an asterisk to indicate that zero or more instances of a category are accepted. The following rule states that the determiner is optional and that zero or more adjectives are accepted, to forma a NP chunk:

1> NP -> (det), noun, adj*.

Taking into account both LP rules established above, the following expressions are accepted: carro (car), carro preto (car black), o carro preto (the car black), o carro preto bonito (the car black beautiful).

Finally, it is worth mentioning that these rules can be further constrained with right and/or left contexts. For example:

1> NP -> | conj | noun, adj | verb |.

Simple enough, this rule states that a conjunction must be on the left of the set of categories, and that a verb must be on the right. By applying this rule on a sentence such as E carros pretos há muitos na estrada (and black car there are many on the road), we obtain the following chunk:

NP[carros pretos].

Hence, although they help constraining a rule even further, contexts are not saved inside a node.

The other kind of chunking rules, sequence rules, though conceptually different because they describe an ordered sequence of nodes, are almost identical to the ID/LP rules in terms of their syntax. There are, however, some differences and additions:

sequence rules do not use the -> operator. Instead, they use the = operator, which matches the shortest possible sequence. In order to match the longest possible sequence, the @= operator is used instead;
there is an operator for applying negation (~) and another for applying disjunction (;);
unlike ID/LP rules, the question mark (?) can be used to represent any category on the right side of a rule;
sequence rules can use variables.

The following sequence rule matches expressions like alguns rapazes/uns rapazes (some boys), nenhum rapaz (no boy), muitos rapazes (many boys) or cinco rapazes (five boys); [indef2] and [q2] are features of lexical itens:

1> NP @= ?[indef2];?[q3];num, (AP;adj;pastpart), noun.

Finally, consider the example O Pedro foi ao Japão. (Pedro went to Japan). At this stage, after the pre-processing and disambiguation, and also after applying the chunking rules, the system presents the following chunking output tree:

                     TOP
           +----------+----------+
           |          |          |
          NP         VF         PP
       +-------+      +    +----+-------+
       |       |      |    |    |       |
      ART    NOUN   VERB PREP  ART    NOUN
       +       +      +    +    +       +
       |       |      |    |    |       |
       O     Pedro   foi   a    o    Japão

Dependency rules

Being able to extract dependencies between nodes is very important because it can provide us with a richer, deeper understanding of the texts. Dependency rules take the sequences of constituent nodes identified by the chunking rules and identify relationships between them. This section presents a brief overview of their syntax, operators, and some examples.

A dependency rule presents the following syntax:

|pattern| if <condition> <dependency_terms>.

In order to understand what the pattern is, first it is essential to understand what is a Tree Regular Expression (TRE). A TRE is a special type of regular expression that is used in XIP in order to establish connections between distant nodes. In particular, TREs explore the inner structure of subnodes through the use of the braces characters ({}). The following example states that a NP node's inner structure must be examined in order to see if it is made of a determiner and a noun:

NP{det, noun}.

TREs support the use of several operators, namely:

the semicolon (;) operator is used to indicate disjunction;
the asterisk (*) operator is used to indicate zero or more;
the question mark (?) operator is used to indicate any;
the circumflex (^) operator is used to explore subnodes for a category.

Hence, and returning to the dependency rules, the pattern contains a TRE that describes the structural properties of parts of the input tree. The condition is any Boolean expression supported by XIP (with the appropriate syntax), and the dependency_terms are the consequent of the rule.

The first dependency rules to be executed are the ones that establish the relationships between the nodes, as seen in the next example:

|NP#1{?*, #2[last]}|
  HEAD(#2, #1)

This rule identifies HEAD relations (see below) in noun phrases. For example, in the NP a bela rapariga (the beautiful girl) the rule extracts a HEAD dependency between the head noun rapariga (girl) and the whole noun phrase — HEAD(rapariga, a bela rapariga).

As already stated, the main goal of the dependency rules is to establish relationships between the nodes. Coming back to our usual example, the following output is the current result of applying these rules to the sentence O Pedro foi ao Japão (Pedro went to Japan):

MAIN(foi)
HEAD(Pedro,O Pedro)
HEAD(Japão,a o Japão)
HEAD(foi,foi)
DETD(Pedro,O)
DETD(Japão,o)
PREPD(Japão,a)
VDOMAIN(foi,foi)
MOD_POST(foi,Japão)
SUBJ_PRE(foi,Pedro)
NE_INDIVIDUAL_PEOPLE(Pedro)
NE_LOCAL_COUNTRY_ADMIN_AREA(Japão)

The last two indicate that two NEs have been captured and classified in this sentence: Pedro has been identified as HUMAN INDIVIDUAL PERSON and Japão (Japan) as LOCATION CREATED COUNTRY. The tags NE_INDIVIDUAL_PEOPLE and NE_LOCAL_COUNTRY_ADMIN_AREA are merely used to see that the NEs have been classified. The final XML tags are created afterwards, as the final step of the whole process.

The other dependencies listed above cover a wide range of binary relationships such as:

the relation between the nucleus of some chunk and the chunk itself (HEAD);
the relation between a nominal head and a determiner (DETD);
the relation between the head of a Prepositional Phrase (PP) and the preposition that introduces it (PREPD);
among many others.

To see a complete list and a detailed description of all dependency relationships as of July 2009, please refer to [Portuguese Grammar manual].

Now consider the following example of another kind of dependency rule, aimed at classifying NEs:

 #1{?*, num[quant,sports_results]}
  if (~NE[quant,sports_results](#1))
    NE[quant=+,sports_results=+](#1)

This rule uses a variable, represented by #1, which is assigned to the top node, because it is placed before the first brace ({). This variable could have been placed inside the braces structure, assigned (for example) to the node num. This rule states that if a node is made of any category followed by a number (with two features that determine whether it is a sports result), and if this node has not yet been classified as a NE with these features, then one wants to add them to the top node in order to classify it as AMOUNT SPORTS_RESULT. Please notice that it is the top node that is classified, because the variable is assigned to it; if it had been placed next to the node num, for example, then only this subnode would have been classified.

Notice also the usage of the negation operator (~) inside the conditional statement. XIP's syntax for these conditional statements also allows the operators & for conjunction and | for disjunction. Parentheses are also used to group statements and establish a clearer precedence, as in most programming languages.

Lexicons

XIP allows the definition of custom lexicons (lexicon files), which add new features that are not stored in the standard lexicon. Having a rich vocabulary in the system can be very beneficial for improving its recall.

In XIP, a lexicon file begins by simply stating Vocabulary:, which tells the XIP engine that the file contains a custom lexicon. Only afterwards come the actual additions to the vocabulary.

The lexical rules attempt to provide a more precise interpretation of the tokens associated with a node. They have the following syntax (the parts of the rule contained in parentheses are optional):

lemma(: POS([features])) (+)= (POS)[features].

Some examples of lexical rules follow:

$US                = noun[meas=+, curr=+].
eleitor:     noun += [human=+].
google            += verb[intransitive=+].

The first two examples show how to add new features to existing words. In the first case, the features meas (measure) and curr (currency) are added to $US, which is POS-tagged as a noun; in the second case, the human feature is added to the noun eleitor (elector). In the third case, the word, irrespective of its former POS, google is given the additional reading of verb.

Local grammars

Local grammars are text files that contain chunking rules and each file may contain ID/LP and sequence rules. Essentially, different local grammar files are used to capture specific sequences of nodes and to attribute features to them. For practical reasons, a division based on different categories of NEs is employed. For example, whereas the file LGLocation is aimed at capturing sequences of nodes related to the LOCATION category, the file LGPeople will capture sequences of nodes related to the INDIVIDUAL type (HUMAN category).

After the pre-processing and disambiguation stages, XIP receives their output and and tries to match it to the rules in the local grammars. They are run sequentially through a predefined order in a configuration file.

For example, consider the following sequence rule belonging to the local grammar responsible for dealing with LOCATION NEs:

1> noun[location=+, admin_area=+] = ?[lemma:novo,maj]; noun[location,maj].

This rule is responsible for matching expressions such as Novo México (New Mexico), Nova Zelândia (New Zealand) or Nova Escócia (New Scotland), and then it creates a noun node with two feature-value pairs (location and admin_area). Notice how the ? and ; operators were used in order to capture either Novo or Nova.

Disambiguation rules

XIP also allows the definition of disambiguation rules. The general syntax for a disambiguation rule is:

layer> readings_filter = |left_context| selected_readings |right_context|.

Like chunking rules, disambiguation rules also employ the concept of layer and context. The left side of a disambiguation rule contains the readings_filter. This filter specifies a subset of categories and features that can be associated with a word. The list can be constrained with features and the filter applies when it matches a subset of the complete ambiguity class of a word. Finally, the selected_readings portion of a disambiguation rule gives the selected interpretation(s) of the word.

There are four main operators used in disambiguation rules:

the <> operator is used to define specific features associated with a category;
the [] operator is used to refer to the complete set of features for a category;
the % operator restricts the interpretation of a word to one solution;
the <* operator is used to specify that each reading must bear the features listed immediately after.

Consider the example below:

1> ?<maj:+,start:~> = ?<proper:+> .

This rule states that upper case words (other than at the beginning of a sentence) must be a proper name.