Dictionaries

From String
Revision as of 14:24, 10 January 2024 by Eugenio (talk | contribs) (Created page with "<div style="float:right;">__TOC__</div> === Description === STRING operates based on large-sized, comprehensive, highly granular lexical resources. Much emphasis is put in building them, under the conviction that the lexicon is key to many NLP tasks and applications. This page, constantly under construction, describes briefly the main resources already available and being used by STRING. === LexMan Dictionary === LexMan uses a dictionary of lemmas containing, for the m...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Description

STRING operates based on large-sized, comprehensive, highly granular lexical resources. Much emphasis is put in building them, under the conviction that the lexicon is key to many NLP tasks and applications. This page, constantly under construction, describes briefly the main resources already available and being used by STRING.

LexMan Dictionary

LexMan uses a dictionary of lemmas containing, for the major part-of-speech categories, the following entries:

Lemmas verbs: 12,995

Lemmas nouns and adj: 38,180

Lemmas adverbs: 7,250

Compound words: 35,201


Adverbs

Long neglected in dictionaries and grammars (and in NLP in general), adverbs have received a special attention in [LexMan] dictionaries. Portuguese -mente ending adverbs (e.g. curiosamente 'curiously') constitute a large, morphologically homogenous, but syntactically and semantically very diverse lexical set. When coordinated, the first adverb loses the adverbial suffix and takes the shape of the base adjective, in the feminine-singular form. This raises the issue of its part of-speech (POS) classification (adverb or adjective?), but especially its adequate parsing, since it may then be incorrectly analyzed as a modifier of a preceding noun. However, the POS tagging can not be adequately performed prior to some minimal syntactic analysis. The size of the lexicon involved (more than 7,000 adverbs) and the scarcity of instances, even in large corpora, make it ineffective to leave only for the POS tagger the task of solving this adjective/reduced adverbial form ambiguity.

Baptista et al., (2012) propose an integrated solution, where a rule-base disambiguating module and a POS statistical tagger combine to produce more accurate tagging and better parsing results to this non-trivial empirical problem. The system was evaluated on a large-sized corpus.

For the processing of coordinated '-mente' ending adverbs an existing lexicon of 3,800 entries in previous versions of the system has been systematically completed by adding all Adv-mente entries found in an orthographic vocabulary (Casteleiro 2009). These correspond to 3,614 entries. Then, all valid -mente ending forms found in the European Portuguese corpus were manually perused and the adverbs selected. Duplicates from the first list were removed, thus yielding 3,636 new entries. For each entry, the feminine-singular form of the base adjective was automatically generated and the list was then manually revised for errors and for the insertion of orthographic variants, resulting from the new, unified Portuguese orthography. The final list consists of 7,250 -mente ending adverbs. For example, the entry for abstratamente ‘abstractly’ is associated with the orthographic variant abstractamente, and to the reduced forms abstrata and abstracta ‘abstract fs’. This reduced form is then given the feature ‘r’ (for ‘reduced’). When analyzing a sentence where abstracta appears, at this morphologic stage where LecMan operates, the system produces the following tags (format adapted for clarity): abstracta: abstratamente Adv r; abstrata Adj fs . In this way, only forms with attested -mente adverbial counterparts are validated.

It has been previously noted by that compound adverbs (or colocational combinations), such as única e exclusivamente ‘uniquely and exclusively’ and única e simplesmente ‘uniquely and simply’ occurred quite often in the corpus. Other forms were added to the lexicon, e.g. pura e simplesmente ‘purely and simply’, dire(c)ta ou indire(c)tamente ‘directly or indirectly’, explícita ou implicitamente ‘implicitly or explicitly’ and total ou parcialmente ‘totally or partially’.

Naturally, the close set of simple adverbs has also been systematically collected from several sources. However, rare or old forms (acá 'here') are kept separately from the main lexicon resources in use.

Finally, an extensive listing of about 2,000 compound, often idiomatic adverbs (Palma 2009) has also been added.

Semantic and syntactic information has also been added to most of the adverbs in the lexicon of [XIP].

Compound words

Compounding is one of the most productive lexical mechanism for creating new entries and designate new concepts and objects, especially in scientific and technological domains. Lexical coverage of the STRING lexicon of compound words can already be considered quite satisfactory, especially considering non-technical texts.

Portela (2011) and Portela et al. (2011) explore the resources and modules of STRING to develop new ways of acquiring compound words' candidates from large corpora, combining machine-learning and patter-matching techniques. Results are very encouraging and new lists of over 2,000 compounds are in the process of being integrated in the existing lexicon.


References

Casteleiro, J.M.: Vocabulário Ortográfico da Língua Portuguesa. Porto Editora, Lisboa (2009).

Baptista, Jorge. Verba dicendi: a structure looking for verbs. In: Nakamura, Takuya; Laporte, Éric; Dister, Anne; Fairon, Cédrick (eds.). Les Tables. La grammaire du français par le menu. Mélanges en hommage à Christian Leclère. Cahiers du CENTAL 6 : 11-20. Louvain-la-Neuve: CENTAL/Presses Universitaires de Louvain (2010).

Baptista, J.; Vieira, L.; Diniz, C.; Mamede, N.: Coordination of -mente ending Adverbs in Portuguese: an Integrated Solution, in Caseli, H. et al. (eds.), 10th International Conference on Computational Processing of Portuguese (PROPOR 2012), Coimbra, Portugal, April 2012, Springer Berlin, Heidelberg, LNCS/LNAI 7243, pages 24-34,

Palma, C: Estudo Contrastivo Português-Espanhol de Expressões Fixas Adverbiais, (MA Thesis) Faro: U. Algarve (2009).

Lexicons (XIP)

Number of lemmas in XIP lexicons : 48,348

  • AdjectivsPredicativ: Predicative adjectives resulting from nominalizations, and provided with syntactic and semantic features (Baptista 2005) ;
  • Adverb : Adverbs with syntactic and semantic features: simple, non-derived adverbs, -mente endind adverbs (Baptista et al. 2012) and compound adverbs (Palma 2009); features include: comparative, negation, proximity/deitic, quantification, time (frequency, date, duration), view-point, focus, maner, maner-subject-oriented, conjunctive, disjunctive-style, disjunctive-subject-oriented, disjunctive-modal, disjunctive-evaluative, disjunctive-habitual;
  • Brands : List of common brands (used in NER);
  • Conjunction : Simple and compound conjunctions with syntactic and semantic features; main classification distinguishes subordinate and coordinate conjunctions; classification includes: additive, adversative, aspectual, causal, comparative, concessive, conditional, consecutive, final, negation, preterition, proportional, temporal, topic; syntactic features also indicate the mood of the subordinate clause: subjunctive, indicative, bare infinitive or inflected infinitive;
  • Culture : List of words can introduce an named entity, typically a building or a monument (used in NER);
  • Currency : List of monetary units, their symbols, abbreviations and other information (used in NER);
  • Group : Large-sized gazetteer of music groups and bands (used in NER) (Oliveira 2011);
  • Habitation : List of words can introduce an named entity, typically an address (used in NER);
  • Human : Auxiliary vocabulary for Human context (used in NER);
  • Location : Auxiliary vocabulary for Place context (used in NER);
  • Measure : List of stock exchange indices, measure units, determinative nouns (used in NER and in Parsing);
  • Nationality : List of nationality adjectives-nouns and related vocabulary (used in NER and in Relation Extraction);
  • lexNounSem : List of most common 5,200 nouns with semantic features;
  • NounPredicativ : Predicative nouns with support verb ser de provided with syntactic and semantic features (Baptista 2005);
  • Number : List of number words (used in NER and in Parsing);
  • Organization : Large-sized gazetteer of organizations, classified according to domain (e.g. sporting clubs); auxiliary vocabulary for Human-Collective context (used in NER) (Oliveira 2011);
  • People : Auxiliary vocabulary (titles, office, etc.) for Human context (used in NER) (Oliveira 2011);
  • Preposition : Simple and compound prepositions with syntactic and semantic features (similar to those of conjunctions); other features help define the distributional nature of the NP they introduce: abstract, beneficiary, comitative, hidronym, human, instrumental, locative, manner, negation and point-of-view;
  • Profession : Large-sized gazetteer of profession and affiliation nouns; auxiliary vocabulary for Human context (used in NER) (Oliveira 2011);
  • ProperNoun : Large-sized gazetteer of proper names (used in NER);
  • Relatives : Auxiliary vocabulary (kinship) for Human context used in NER (Oliveira 2011) and in Relation Extraction (Santos 2010);
  • Religion : Auxiliary vocabulary (titles, office, etc.) for Human context (used in NER) (Oliveira 2011);
  • Sports : List of common sports (used in NER);
  • Time : Time expressions and auxiliary vocabulary associated with the notion of time (used in NER) (Maurício 2011);
  • TimeFestive : List of festive dates and other special dates of the calendar, including awareness days (used in NER)(Maurício 2011);
  • Verb : (see ViPEr);
  • VerbAgression : List of vocabulary associated with the notion of aggression;
  • VerbAuxiliar : Auxiliary verbs; features provide main classification (temporal, aspectual and modal), the preposition and the form of the main verb (infinitive, past participle or gerund;see Baptista et al. 2010);
  • VerbControl : Verbs with special features for zero-anaphora resolution (Pereira et al. 2010);
  • VerbDative : Verbs selecting an essential indirect (dative) complement (used for Parsing in the calculus of the CINDIR (indirect object) dependency; soon to be replaced by information on ViPEr);
  • VerbDicendi : Verbs that can introduce direct speech (see Baptista 2010);
  • VerbHumAct : List of vocabulary associated with the notion of action and that typically select a Human subject (used for Metonymy in NER; Oliveira 2011);
  • VerbIntransit : List of intransitive verbs that do not accept a direct object (used for Parsing in the calculus of the CDIR (direct object) dependency; soon to be replaced by information on ViPEr);

Lexicon-Grammar of Portuguese Verbs (ViPEr)

Being able to identify the meaning of a verb (and of the surrounding elements as well) can be facilitated by the knowledge of the syntactic and semantic constraints that verb imposes on the lexical fulfilment of its argument positions, and on its syntactic construction. This is true for many NLP tasks, but specially for any task where a fine-grained semantic distinction is required about ambiguous lexical forms. In particular, the number of arguments; their structural and distributional type; the prepositions in selects to introduce its essential complements; and the main shape-changes that structures can undergo; all this information can be put to use to improve parsing strategies, word sense disambiguation, question-answer systems, computer-assisted language learning systems, among other applications. an inventory of basic word senses and their corresponding structures is necessary, and this is the aim of the current project. The Lexicon-Grammar of Portuguese Verbs (ViPEr) is a lexical resource that describes several syntactic and semantic information about the Portuguese verbs. For the most part, ViPEr focuses on European Portuguese, but some Brazilian constructions, particularly locative verbs, have also been identified. This resource is dedicated to full (distributional or lexical) verbs, i.e., verbs whose meaning allows for an intensional definition of their respective construction and the distributional (semantic) constraints on their argument positions (subject and complements). Around 6,800 verb senses (corresponding to 5,200 verb lemmas) have been described so far (with frequency 5 or higher in the CETEMPúblico corpus) and classified into 72 formal classes based on syntactic-semantic criteria. Besides, approximately 245 auxiliary constructions (copula verbs, tense/aspect/mood auxiliaries, support-verbs and operator-verbs) have been identified for later description. The description of the remainder full verbs is still ongoing.

Table 1. ViPEr’s verb distribution (per number of possible classes)
Number of Possible Classes Number of Verbs
1 4,249
2 564
3 193
4 91
5 29
6 21
7 6
8 2
Total: 5,200
Table 2. ViPEr’s entries for the different meanings of 'apontar' (to point)
Class Example
36DT O bandido apontou uma faca ao polícia
38LD O Pedro apontou os números premiados num recibo do multibanco
39T O Pedro apontou o João como sério candidato
9C O Pedro apontou à Joana quais os defeitos que devia corrigir

Example of annotated text (codes in curled brackets indicate the verb class; the one on the left of # corresponds to that instance's word sense):

"A Europa deve{dever(VMOD)} cumprir{cumprir(32R#05H,35R)}
os acordos com a maior celeridade possível. Espero{esperar(06#35R)} que a Europa
esteja{estar(VSUP#)} a a altura de as circunstâncias" , afirmou{afirmar(09I#31H)} .

Last update: 2016-06-07

Lexicon-Grammar of Predicative Nouns

Ever since the late 1980s and early 1990's, a systematic survey has been undertaken by several scholars towards the comprehensive description of the lexical, morphologic, syntactic, semantic and transformational properties of predicative nouns and their support verb constructions in Portuguese, under the Lexicon-Grammar framework (M. Gross 1981, 1996). Usually organized around their elementary support verbs, one can cite the constructions with estar Prep 'be Prep' (Ranchhod, 1988,1990), ser de 'be of' (Baptista 2000, 2005), fazer 'do/make' (Chacoto 2005), dar 'give' (Vaza 1989, Baptista 1997), for European Portuguese; and, for Brazilian Portuguese, fazer 'do/make' (Barros 2013), ter 'have' (Santos 2015) and dar 'give' (Rassi 2015), among others.

A general framework for integrating support verb constructions in the rule-based Portuguese grammar developed for STRING system (Baptista et al. 2014, Rassi et al. 2014), using the XIP parser (Ait-Mokhtar et al. 2002) has been proposed and implemented (Mamede and Baptista 2015, TechRep). A corpus of annotated for support verb constructions with dar in Brazilian Portuguese (Rassi et al. 2015a and 2015b) has been produced and used to evaluate the approach implemented in STRING, producing satisfactory results (77-82% f-measure).

This approach is based on the concept of event (here taken in the sense of semantic predicate) and event arguments (Cabarrão et al. 2012, TechRep) which are then used to associate them their corresponding semantic roles (Talhadas et al. 2013, Talhadas 2014). It consists, basically, in using the basic parse of the grammar and using it to extract a SUPPORT-VERB dependency, holding between the support and the predicative noun it construes. Then, at the event extraction phase, instead of using the verb as the center of the predicate, the predicative nouns is used, and its arguments, along with the corresponding semantic roles, are associated.

In the example below, the parsing output of STRING is shown for the sentence O Pedro deu um abraço ao João 'Pedro gave a hug to João'.

echo "O Pedro deu um abraço ao João." | xip/string.sh -t -tr -f -indent

                                    TOP
           +----------+---------+----------------+------------+
           |          |         |                |            |
          NP         VF        NP               PP          PUNCT
       +-------+      +    +--------+      +----+-------+    +-
       |       |      |    |        |      |    |       |    |
      ART    NOUN   VERB  ART     NOUN   PREP  ART    NOUN   .
       +      +-     +-    +       +-     +-    +      +-
       |      |      |     |       |      |     |      |
       O    Pedro   deu   um    abraço   a     o    João

MAIN(deu)
QUANTD(abraço,um)
DETD(Pedro,O)
DETD(João,o)
VDOMAIN(deu,deu)
MOD_POST_DAT(deu,João)
SUBJ_PRE(deu,Pedro)
CDIR_POST(deu,abraço)
NE_PEOPLE_INDIVIDUAL(Pedro)
NE_PEOPLE_INDIVIDUAL(João)
SUPPORT-VERB_STANDARD(dar,abraço)
EVENT_LEX(abraço,outro)
EVENT_OTHER(abraço)
EVENT_AGENT-GENERIC(abraço,Pedro)
EVENT_PATIENT(abraço,João)
0>TOP{NP{O Pedro} VF{deu} NP{um abraço} PP{a o João} .}

In this framework, the so-called 'vertical' equivalence relations (Ranchhod 1990, M. Gross 1998) between different support verb constructions that the same predicative noun (and lexicon-grammatical entry) can feature are treated in a compact way. For example, the same (or very similar) event representation is used for the support verb variation show in examples like:

O Pedro (está com=tem) sede 'Pedro is thirsty; lit: be with/have thirst'.

This implies collecting and pooling together the linguistic descriptions dispersed through several authors, something that is currently being done.

The pervasive operation of Conversion (G. Gross 1989, Baptista 1997) has also been taken into account: this transformation establishes a paraphrastic equivalence between two distinct support-verb constructions of the same predicative noun, a standard construction, with agentive subject, and an equivalent converse sentence, where the agent is moved to complement position while the patient or object becomes the subject; the semantic roles of the arguments of the predicative noun are kept; e.g.

O Pedro deu um abraço ao João = O João recebeu um abraço do Pedro 'Pedro gave a hug to João' , 'João received a hug from Pedro'

In this case, the semantic representation of the event and its arguments are the same for both the standard and converse constructions, but the support-verb dependency gets different features (standard/converse):

SUPPORT-VERB_CONVERSE(recebeu,abraço)
EVENT_LEX(abraço,outro)
EVENT_OTHER(abraço)
EVENT_AGENT-GENERIC(abraço,Pedro)
EVENT_PATIENT(abraço,João)

Finally, M. Gross (1981)'s concept of causative and linking operator verb has been adopted and explicitly described in the Lexicon-Grammar. In the grammar, this is represented by VOP-CAUSE or VOP-LINK dependencies, holding between the verb and the predicative noun. With the causative operator verb, the underlying support verb construction is somewhat recovered in the parse, by representing the nominal predicate with its arguments and the semantic role they have in that construction, while the causative operator verb is represented by a CAUSE semantic role:

Isso fez/Vopc # O Pedro tem sede = Isso fez sede ao Pedro (lit.) 'That did' # 'Peter has thirst', 'That did thirst to Pedro'

VOP-CAUSE(fez,sede)
EVENT_LEX(sede,outro)
EVENT_OTHER(sede)
EVENT_EXPERIENCER(sede,Pedro)
EVENT_CAUSE(sede,Isso)

In the case of linking operator verb constructions, the operator verb dependency is extracted but this does not yield any new argument to the predicative noun, since its semantic role is the same as in the underlying support verb construction:

O Pedro tem/Vopl # A ana está sob o poder mental do Pedro = O Pedro tem a Ana sob o seu poder mental (lit.) 'That did' # 'Peter has thirst', 'That did thirst to Pedro'

VOP-LINK(tem,poder mental)
EVENT_LEX(poder mental,outro)
EVENT_OTHER(poder mental)
EVENT_AGENT-GENERIC(poder mental,Pedro)
EVENT_PATIENT(poder mental,Ana)

This framework can also be extended for operator verb constructions acting over other types of predicates (verbs and adjectives, for instance).

(Under construction; Last update: 2016-06-09)

Lexicon-Grammar of Verbal Idioms

Verbal idioms or frozen sentences (M. Gross 1982, 1989, 1996) constitute a significant part of the language lexicon-grammar. For European Portuguese (Baptista et al. 2004, 2005) we undertook the systematic description of the corresponding constructions, and recently proposed a general framework for integrating these expressions on the Portuguese grammar of the STRING system, namely for the rule-based parsing module XIP.

Up to know, the main (and most productive) syntactic classes of frozen sentences have been described in detail. Current contents of the verbal idioms are shown in the table below.


Table 1. Lexicon-Grammar of European Portuguese Verbal idioms.
Class Structure Example Count
C1 N0 V C1 'O Pedro bateu a bota 'Pedro beat the boot (=died)' 496
CDN N0 V (C de N)1 != N0 V C1 a N2 'O Pedro aprendeu o bê-á-bá da gramática ' Pedro learnt the abc of grammar (= learn the essentials of a subject matter)' 44
CAN N0 V (C de N)1 = N0 V C1 a N2 'O Pedro cortou as asas (de=a) o João 'Pedro cut the wings of/to João (= prevent someone from doing smthg)' 176
CNP2 N0 V N1 Prep2 C2 'O Pedro não tirava a Ana da cabeça 'Pedro did got Ana from his head (= keep on thinking)' 174
C1PN N0 V C1 Prep2 N2 'A Rita cravou as unhas na fortuna do João ' Rita dig [her] nails in João's fortune' 235
C1P2 N0 V C1 Prep C2 'O Pedro deu o dito pelo não dito 'Pedro gave the said for the unsaid (= take back what has been said)' 289
CPPN N0 V C1 Prep C2 Prep C3 'O Pedro deitou fora o bebé com a água do banho ' Pedro threw away the baby with bath water' 27
CPP N0 V Prep C1 Prep C2 'O Pedro deu com o nariz na porta ' Pedro gave with [his] nose on the door (= go somewhere in vain)' 201
CP1 N0 V Prep C1 'O Pedro passou pelas brasas ' Pedro passed through the embers (= sleep)' 201
CPN N0 V Prep (C de N)1 'O Pedro não chega aos calcanhares do João ' Pedro does not reach the heals of João (= is not a match)' 96

For other classes of frozen sentences the description has just recently started. Here are some examples: Essa rua vai dar àquela praça 'This street leads to that square' (CV: N0 V (Prep) Vinf w); Vai onde foi o pai da Rosa! (profanity) (C0E: C0E C0 V w !, mostly exclamative or imperative); A sorte sorriu ao Pedro 'The luck smiled to Pedro' (C0x: C0 V w); O Pedro foi-se embora 'Peter went away' (N0 V ADV w).

The parsing strategy to capture this multiword predicates consists in extracting a new dependency FIXED with the verb and any other frozen constituents as its arguments. Hence, for a sentence like: O Pedro comprou gato por lebre 'Pedro bought cat for hare' (= be fooled)

echo "O Pedro comprou gato por lebre." | xip/string.sh -t -tr -f

                               TOP
           +------------+-------+---------+---------+
           |            |       |         |         |
          NP           VF      NP        PP       PUNCT
       +-------+        +       +      +------+    +-
       |       |        |       |      |      |    |
      ART    NOUN     VERB    NOUN   PREP   NOUN   .
       +      +-       +-       +     +-     +-
       |      |        |        |     |      |
       O    Pedro   comprou   gato   por   lebre

MAIN(comprou)
DETD(Pedro,O)
VDOMAIN(comprou,comprou)
MOD_POST(comprou,lebre)
SUBJ_PRE(comprou,Pedro)
CDIR_POST(comprou,gato)
NE_PEOPLE_INDIVIDUAL(Pedro)
FIXED(comprou,gato,lebre)
0>TOP{NP{O Pedro} VF{comprou} NP{gato} PP{por lebre} .}

(Under construction, last update: 2016-06-09)

Uses

(Under construction)