STRING operates based on large-sized, comprehensive, highly granular lexical resources. Much emphasis is put in building them, under the conviction that the lexicon is key to many NLP tasks and applications. This page, still under construction, describes briefly the main resources already available and being used by STRING.
LexMan uses a dictionary of lemmas containing, for the major part-of-speech categories, the following entries:
Lemmas verbs: 12,995
Lemmas nouns and adj: 38,180
Lemmas adverbs: 7,250
Compound words: 35,201
Long neglected in dictionaries and grammars (and in NLP in general), adverbs have received a special attention in [LexMan] dictionaries. Portuguese -mente ending adverbs (e.g. curiosamente 'curiously') constitute a large, morphologically homogenous, but syntactically and semantically very diverse lexical set. When coordinated, the first adverb loses the adverbial suffix and takes the shape of the base adjective, in the feminine-singular form. This raises the issue of its part of-speech (POS) classification (adverb or adjective?), but especially its adequate parsing, since it may then be incorrectly analyzed as a modifier of a preceding noun. However, the POS tagging can not be adequately performed prior to some minimal syntactic analysis. The size of the lexicon involved (more than 7,000 adverbs) and the scarcity of instances, even in large corpora, make it ineffective to leave only for the POS tagger the task of solving this adjective/reduced adverbial form ambiguity.
Baptista et al (2012) propose an integrated solution, where a rule-base disambiguating module and a POS statistical tagger combine to produce more accurate tagging and better parsing results to this non-trivial empirical problem. The system was evaluated on a large-sized corpus.
For the processing of coordinated '-mente' ending adverbs an existing lexicon of 3,800 entries in previous versions of the system has been systematically completed by adding all Adv-mente entries found in an orthographic vocabulary (Casteleiro 2009). These correspond to 3,614 entries. Then, all valid -mente ending forms found in the European Portuguese corpus were manually perused and the adverbs selected. Duplicates from the first list were removed, thus yielding 3,636 new entries. For each entry, the feminine-singular form of the base adjective was automatically generated and the list was then manually revised for errors and for the insertion of orthographic variants, resulting from the new, unified Portuguese orthography. The final list consists of 7,250 -mente ending adverbs. For example, the entry for abstratamente ‘abstractly’ is associated with the orthographic variant abstractamente, and to the reduced forms abstrata and abstracta ‘abstract fs’. This reduced form is then given the feature ‘r’ (for ‘reduced’). When analyzing a sentence where abstracta appears, at this morphologic stage where LecMan operates, the system produces the following tags (format adapted for clarity): abstracta: abstratamente Adv r; abstrata Adj fs . In this way, only forms with attested -mente adverbial counterparts are validated.
It has been previously noted by that compound adverbs (or colocational combinations), such as única e exclusivamente ‘uniquely and exclusively’ and única e simplesmente ‘uniquely and simply’ occurred quite often in the corpus. Other forms were added to the lexicon, e.g. pura e simplesmente ‘purely and simply’, dire(c)ta ou indire(c)tamente ‘directly or indirectly’, explícita ou implicitamente ‘implicitly or explicitly’ and total ou parcialmente ‘totally or partially’.
Naturally, the close set of simple adverbs has also been systematically collected from several sources. However, rare or old forms (acá 'here') are kept separately from the main lexicon resources in use.
Finally, an extensive listing of about 2,000 compound, often idiomatic adverbs (Palma 2009) has also been added.
Semantic and syntactic information has also been added to most of the adverbs in the lexicon of [XIP].
Compounding is one of the most productive lexical mechanism for creating new entries and designate new concepts and objects, especially in scientific and technological domains. Lexical coverage of the STRING lexicon of compound words can already be considered quite satisfactory, especially considering non-technical texts.
Portela (2011) and Portela et al. (2011) explore the resources and modules of STRING to develop new ways of acquiring compound words' candidates from large corpora, combining machine-learning and patter-matching techniques. Results are very encouraging and new lists of over 2,000 compounds are in the process of being integrated in the existing lexicon.
Casteleiro, J.M.: Vocabulário Ortográfico da Língua Portuguesa. Porto Editora, Lisboa (2009).
Baptista, J.; Vieira, L.; Diniz, C.; Mamede, N.: Coordination of -mente ending Adverbs in Portuguese: an Integrated Solution, in 10th International Conference on Computational Processing of Portuguese (PROPOR 2012), April. 2012, Springer Berlin, Heidelberg, vol. [?], series LNCS/LNAI, pages ?-?, Coimbra, Portugal. (accepted for publication)
Palma, C: Estudo Contrastivo Português-Espanhol de Expressões Fixas Adverbiais, (MA Thesis) Faro: U. Algarve (2009).
Regras de atribuição de lemas no XIP: 48 348
For many NLP tasks, but specially for any task where a fine-grained semantic distinction in required of ambiguous lexical forms, being able to identify the meaning of the verb (and of the surrounding elements as well) can be facilitated by the knowledge of the syntactic and semantic constraints it imposes on the lexical fulfillment of its argument positions. In particular, the number of arguments; their structural and distributional type; the prepositions in selects to introduce its essential complements; and the main shape-changes that structures can undergo; all this information can be put to use to improve parsing strategies, word sense disambiguation, question-answer systems, computer-assisted language learning systems, among other applications. But, above all, an inventory of basic word senses and their corresponding structures is necessary, and this is the aim of the current project. Verbs of European Portuguese (ViPEr) is a lexical resource that describes several syntactic and semantic information about the European Portuguese verbs. This resource is dedicated to full (distributional or lexical) verbs, i.e., verbs whose meaning allows for an intensional definition of their respective construction and the distributional (semantic) constraints on their argument positions (subject and complements). A total of 3,485 verb senses have been described so far, with frequency 10 or higher in the CETEMPúblico corpus. The description of the remainder verbs is still on going.
Number of Possible Classes Number of Verbs 1 1978 2 462 3 125 4 31 5 7 6 7 7 1 Total: 2611
Table 1. ViPEr’s verb distribution (per number of possible classes)
Class Example 36DT O bandido apontou uma faca ao polícia. 38LD O Pedro apontou os números premiados num recibo do multibanco. 39 O Pedro apontou o João como sério candidato. 9 O Pedro apontou à Joana quais os defeitos que devia corrigir
Table 2. ViPEr’s entries for the different meanings of 'apontar' (to point)