From String
Revision as of 07:47, 14 June 2013 by Njm (talk | contribs) (Team members)
Jump to: navigation, search


Nuno Mamede (Computer Science Coordination)
Nuno J. Mamede received his graduation, MSc and PhD degrees in Electrical and Computer Engineering by the Instituto Superior Técnico, Lisbon, in 1981, 1985 and 1992, respectively. In 1982 he started as lecturer and since 2006 he holds a position of Associate Professor in Instituto Superior Técnico, where he has taught Digital Systems, Object Oriented Programming, Programming Languages, Knowledge Representation and Natural Language Processing. He is a proud member of Spoken Language Systems Lab (L²F) since its creation. His current research interests are Natural Language Text Processing and Computer Aided Language Learning. He is the PI of the REAP.PT project, and he is also involved in the project OOBIAN.
Jorge Baptista (Linguistic Coordination)
Jorge Baptista graduated in Languages and Literatures-Portuguese Studies and obtained a MA in Portuguese Linguistics from Faculdade de Letras da Univ. Lisboa with a thesis on nominal compounding and electronic dictionaries. His PhD thesis on predicative nouns and nominalization of adjectival constructions was presented to the Univ. Algarve, where he started working in 1992. He is an invited researcher of Spoken Language Systems Lab (L²F) since 2005. His current research interests in NLP concern Information Extraction (Named Entity Recognition and Relation Extraction), Parsing, and Computer Assisted Language Learning; in Linguistics proper, he's working on the Lexicon-Syntax description of verbal, adverbial and idiomatic constructions. He is also involved in the REAP.PT project and OOBIAN projects.

XEROX/XRCE Liaison and Collaborator

Caroline Hagège (XEROX Research Center Europe)
Caroline Hagège is a research engineer since July 2001 and works in the Parsing and Semantics group of XEROX Research Center Europe, Grenoble, France, mainly on robust and deep parsing and in the bridge between syntax and semantics. She holds a PhD (Doctorat) in Computational Linguistics from the [ University Blaise Pascal], Clermont-Ferrand, France, which was done at the GRIL laboratory (Université Blaise Pascal). Before joining XRCE, she was a researcher at the L2F laboratory (INESC-Id, Lisboa, Portugal), where she was involved in Portuguese robust language processing (morphology and shallow parsing). She was key to the initial development of the Portuguese grammar for XIP, and has ever worked as liaison with XEROX/XRCE. She has intensively collaborated in the development of the Named Entity Recognition module of STRING, particularly in the time expressions grammar, having join forces with us in the proposal and successful participation in the TIMEX evaluation track of the Second HAREM (2008), where STRING was assessed as the best NER-TIMEX system.

Team members

Eduardo Castanho (2013-present)
Implement a repository of morphological entities and develop an interface to manipulate them without requiring knowledge about the structure of the repository.
[not yet]
Rui Santos (2012-present)
A hybrid rule-based and statistical module for automatic Semantic Role Labeling, integrated in STRING.
[not yet]
João Marques (2012-present)
Improvement of the Anaphora Resolution (AR) post-processing module, initialy developed by Nuno Nobre. This module deals with pronominal anaphora.
[not yet]
Viviana Cabrita (2012-present)
Detecting and ordering events.
[not yet]
Alexandre Vicente (2011-2013)
This work allowed the union of the tokenization module and the morphological analysis in a single module, LexMan, using transducers. With this change, it was possible to transfer morpho-syntactic, context-independent, joining rules (for compound identification), previously implemented in the chain’s morphosyntactic disambiguator, RuDriCo to the LexMan module. The information used in the generation of the dictionary transducer can now be complemented also by derivational information, making possible to recognise prefixed-derived words, particularly neologisms.
[MSc Dissertation]
Tiago Travanca (2011-2013)
This work addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb Sense Disambiguation is a sub-problem of the Word Sense Disambiguation (WSD) problem, that tries to identify in which sense a polissemic word is used in a given sentence. Thus a sense inventory for each word (or lemma) must be used. For the VSD problem, this sense inventory consisted in a lexicon- syntactic classification of the most frequent verbs in European Portuguese (ViPEr). Two approaches to VSD were considered. The first, rule-based, approach makes use of the lexical, syntactic and semantic descriptions of the verb senses present in ViPEr to determine the meaning of a verb. The second approach uses machine learning with a set of features commonly used in the WSD problem to determine the correct meaning of the target verb. Both approaches were tested in several scenarios to determine the impact of different features and different combinations of methods. The baseline accuracy of 84%, resulting from the most frequent sense for each verb lemma, was both surprisingly high and hard to surpass. Still, both approaches provided some improvement over this value. The best combination of the two techniques and the baseline yielded an accuracy of 87.2%, a gain of 3.2% above the baseline..
[MSc Dissertation]
Filipe Carapinha (2011-2013)
Development of a slot-filling (SL) module, to be integrated in the STRING system. The slot-filling task is an Information Retrieval challenge that consists in aggregating all information associated to a given named entity (NE) in a predefined template of relations and attributes. For now, this module will deal with the PERSON and ORGANIZATION NE types, already implemented in STRING (Oliveira 2010). An already XIP-implemented Relation Extraction module (Santos 2010) will be used to map those relations onto the corresponding slots. Since both these modules rely on accurate Anaphora Resolution (AR), the already existing AR module (Nobre 2011) will be integrated in STRING in order to improve the quantity and quality of the extracted information.
[MSc Dissertation]
Cláudio Diniz (2009-present)
Implementation of RuDriCo2, a morphological rule-based disambiguator that can change the segmentation of the text, improving the former version of this module, its rules' syntax and optimizing the system’s main algorithm. Implementation of LexMan, a Lexical Analizer based in finite-state transducers, which is able to associate to the text tokens all the relevant morpho-syntactic information for its further processing. It uses a rich and highly granular tag set, adapted from the PAROLE project, and featuring 12 part-of-speech categories and 11 fields. LexMan replaced a previous module of of the NLP chain, Palavroso. LexMan is used to generate and validate all the inflected forms associated to lexical lemmas, along with the corresponding morpho-syntactic information. To this end the conversion of previous lexical resources was necessary. LexMan has much improved the performance of the STRING chain and it also provides an efficient, fast and ductile way of maintaining and updating the lexicons.
The new MARv4, a statistical part-of-speech tagger, whose function is to choose the most likely POS tag for each word, using HMMs. The language model used by MARv4 is trained on a 250K Portuguese corpus originally produced under project PAROLE. To train MARv4 and fine-tune post-tagger rule-base disambiguation module of STRING, the training corpus underwent an extensive and systematic revision. For this end, scripts have been produced to ensure consistency, fast access and corpus maintenance. Development of the demo web interface for STRING.
[MSc Dissertation, 2, 3, 4, 5]
Vera Cabarrão (2010-present)
Currently doing her MA thesis in Universidade de Lisboa - Faculdade de Letras. Been in the team since 2010. Corpus notation for Named Entity Recognition (NER), Relation Extraction, Anaphora Resolution, and Time Expressions. Writing rules in XIP for the identification of Natural Events (“tsunamis”, “earthquakes”), and Organized Events (political, scientific, artistic, and other) as NE. Writing dependency rules in XIP for Relation Extraction, namely Lifetime (e.g., relations regarding the events Birth, Death, Education), Business (e.g., Job, Foundation, Owner), and Location relations. Analysis of Portuguese newspapers to test the correct identification and annotation of NE by XIP. Improvement of the XIP lexicons. Contribution to improve the "Classification directives for named entities in Portuguese texts" and the "Classification directives for Relations between NE".
Munshi Asadullah (2010-2012)
A heuristic based modeling of data from two different parsers namely Constraint Grammar (CG) based parser PALAVRAS and Phrase Structure Grammar (PSG) based Finite-State Parser (FSP) used as the parsing backbone of the STRING Natural Language Processing (NLP) chain for Portuguese is proposed. Different models using two parser output will be produced and put together in a linear combination for performance maximization. For the development of the research, a processing framework is also proposed and its development is presented. A dependency annotation tool is also developed within the scope of the research. The models performance was satisfactory if not extraordinary, although the primary objective was to present the modeling possibilities rather than the absolute performance..
[MSc Dissertation]
Lucas Vieira (2010-2012)
Extending the coverage of the syntactic-semantic classification of the Portuguese Adv-mente. {V-Adv} pairs automatic collocation extraction for Machine Translation applications.
[MSc Dissertation]
Andreia Maurício (2009-2011)
A module for TIMEX (time expressions) processing. TIMEX is part of the Named Entitity Recognition (NER) task. This new TIMEX module aims to identify, classify and normalize temporal expressions contained in a Portuguese written text. The TIMEX classification guidelines, adopted for the he participation in the Second HAREM Joint Evaluation Campaign were extended and adapted to identify more complex types of TIMEX. The TIMEX processing module was developed, evaluated and integrated in STRING.
[MSc Dissertation]
Diogo Oliveira (2009-2011)
Improvement of the Named Entity Recognition (NER) module of STRING, especially for the HUMAN, LOCATION and AMOUNT categories, with reference to the performance attained during the Second HAREM Joint Evaluation Campaign (2008). A new set of delimitation and classification directives has been proposed to replace those used in the Second HAREM. Several improvements were introduced in the NLP chain, specially in the XIP syntactic parser, which is responsible for named entity extraction. Finally, the system performance has been evaluated, and a general trend of improvement has been confirmed.
[MSc Dissertation]
Ricardo Portela (2008-2011)
Identification of multiword expresssions (MWE) in Portuguese. MWE are sequences of words whose meaning can not be calculated from the composition of the literal meaning of its individual words, so that together they acquire figurative/idiomatic/non-compositional meaning. Several collocation-based statistical methods were used to improve MWE extraction. The STRING was used to test linguistically motivated criteria against the syntactic dependencies extracted from the entire CETEMPúblico corpus and the semantic features associated to its lexicon. Procedures for the processing of large-scale corpora using the L2F GRID parallel computing, as well as scheduling and parallel programming software, were implemented and the results were evaluated from different perspectives.
[MSc Dissertation, 2]
Daniel Santos (2009-2010)
Implementation of a Information Retrieval (IR) rule-based module for Relation Extraction (RE), specifically built to maximize the information retrieved. A set of directives for relation identification and annotation were defined, inspired on the work already developed for Portuguese and English. At this stage, FAMILY relations (spouse, parent, sibling, etc.), LIFETIME relations (date-of-birth or death), BUSINESS relations (employee, client, owner, etc.) and LOCATION (people ad organization) relations. An evaluation corpus was selected and annotated by a linguistic, in order to perform a more independent evaluation, thus allowing a better analysis of the results.
[MSc Dissertation]
Nuno Nobre (2009-2011)
Implementation of an Anaphora Resolution (AR) post-processing module, that operates on the output of the XIP parser. This module deals with pronominal anaphora, that is, correference (or identity) relation between pronouns (the anaphor) and a previous mention of the same entity in discourse (the antecedent). At this stage, third-person, personal and possessive pronouns, as well as relative and demonstrative pronouns (in headless NPs) are resolved. During the system development a manual annotation tool was created, allowing to enrich text with anaphoric information quickly. The system was evaluated on a corpus that had been manually annotated by a linguist using that annotation tool, and it presented an f-measure of 33.5%.
[MSc Dissertation]
Fernando Gomes (2008-2009)
Validation over a corpus of lexical-syntactical matrices, i.e. formal descriptions of the linguistic properties associated to lexical items, is a difficult and time-consuming task, but essential if such information is to be used in several NLP tasks. The validation is based on a statistical comparison between results obtained from a large corpus using STRING and the information contained in the matrices. This information consists in morphological, distributional and transformational properties of lexical items, and each of them must be verified individually through distinct processes. The statistical comparison is done with the aid of GRID computing, as well as scheduling and parallel programming software. Finally an evaluation of the work has been performed to check the findings.
[MSc Dissertation]
Gaia Fernandes (2009-2011)
A systematic syntactic-semantic classification of the Portuguese Adv-mente and experiments on Word-Sense Disambiguation using machine learning.
[MSc Dissertation]
Simone Pereira (2008-2010)
Notation of a Brazilian Portuguese corpus for Zero-Anaphora and the implementation of dependency rules for its resolution in XIP.
[1, 2, 3]
Marcos Zampieri (2008-2010)
A systematic review of a semantic classification of the 5,000 most common Portuguese nouns for experiments on Word-Sense Disambiguation using machine learning.

Currently doing his PhD at the University of Köln, Germany.
[MSc Dissertation, 2]

David Rodrigues (2006-2007)
Improvement of the MARv's performance, a statistical part-of-speech tagger, whose function is to choose the most likely POS tag for each word, using the Viterbi algorithm. namely its processing time and memory management, and the reduction of its error rate when disambiguating. The new MARv2, The new implementation of MARv2 increased its precision by 23.72% and it is significantly (9 times) faster that the previous version . Furthermore, it does not discard rejected tags and uses the same DTD than RuDriCo2.
[MSc Dissertation]
João Loureiro (2006-2007)
A Named Entity Recognition (NER) module for the categories Work-of-Art (Obra), Value (Valor), Family relations (Relações de Parentesco) and Time (Tempo), for Portuguese. A first attempt to normalize time expressions such as dates ("24 de Novembro de 2005") and other productive phrases ("no próximo dia", next day). Time normalization is about converting time expressions’ values to a standard format allowing this information to be shared between different systems.
[MSc Dissertation, 2]
Luís Romão (2006-2007)
Luis Romao.png
Development of the Named Entity Recognition (NER) module, focusing on the categories LOCATION, ORGANIZATION, PEOPLE and EVENTs using STRING. In a rule-based approach to this NLP task, NE are identified based solely on the information in the lexicons and manually-built rules, either contextual or based on the entity’s structure. The system was evaluated according to the criteria defined by the First HAREM, a NER joint evaluation campaign for Portuguese. Results were in general above average when compared to other participant systems, obtaining the best results in the identification of ORGANIZATIONS and the best global results in several of the classification evaluation scenarios.
[MSc Dissertation, 2]
Telmo Machado (2006-2007)
Development of an information extraction system in a specific domain, the cooking domain. The system is composed of three modules: pre-processing, recipe processing and output transformation. The main objectives are to identify ingredients and their associated quantities, and to identify the different tasks needed to prepare each recipe, as well as the utensils needed and the ingredients used in each of these tasks. After identifying the ingredients and the tasks of each recipe, these are introduced into a database. This database is supported by an ontology, which contains a description of the concepts used in the culinary domain.
[MSc Dissertation]
Joana Paulo (2006-2007)
Development of the RuDriCo system, which is an evolution from the PAsMo (post-morphologic analyzer) [Faiza 99]. RuDriCo adapts the output of the morphological analyzer to the specific needs of each parser. The modifications the system produces include: (i) segmentation changes; (ii) changes to the information added to the words tagged by the morphologic analyzer; (iii) changes in the output format of the morphologic analyzer, so that it be adequate to the format required by the parser. All these modifications are expressed by way of transformation declarative rules, based on the concept of pattern matching.
[Graduation Thesis]
Ricardo Ribeiro (2006-2008)
Contributed to the implementation of the first version of the chain. Developed the MARv module, still used by String. Provided specific support for the development of the next MARv versions and general support throughout the development of the chain. Was part of the team that developed the corpus used for training and testing MARv in the LE-PAROLE project.
[MSc Dissertation,2]