From String

Welcome to the HLT's wiki about

STRING — A Hybrid Statistical and Rule-Based Natural Language Processing Chain for Portuguese.

STRING has a modular structure and performs all basic text processing tasks, namely:

  • tokenization and text segmentation,
  • part-of-speech tagging,
  • morphosyntactic disambiguation,
  • shallow parsing (chunking), and
  • deep parsing (dependency extraction).

STRING is organized as follows. The first module receives the text to process and tokenizes it, defining the segments that compose the text. LexMan is a morphological tagger that receives the result of this segmentation as input and associates all possible part-of-speech (POS) tags to each segment. The next module groups the segments into sentences. The next module to apply is RuDriCo2. This module is a rule-based morphological disambiguator and it also makes segmentation changes to the input, like joining segments (compound words). MARv4 a stochastic morphological disambiguator, receives the result of RuDriCo2 and it selects the best POS tag to each segment. Finally, the last module to apply is XIP which is responsible for the syntactic analysis.

STRING performs:

  • Named Entity Recognition,
  • Information Retrieval,
  • Anaphora Resolution, and
  • other NLP tasks.

Though the initial modules of the STRING chain can be traced back as far as 2001 (see publications), the onset of current architecture could be placed in 2006, with the integration of the XIP parser in the NLP chain and the development of the corresponding Portuguese grammar.

A web-interface makes STRING available to the community and the general public.