From String
Revision as of 13:01, 31 May 2016 by Jbaptis (talk | contribs) (Zero Anaphora Corpus (ZAC))
Jump to: navigation, search

Zero Anaphora Corpus (ZAC)


ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.

In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.

The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalised from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totalling 35,212 words.

Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).
Text type Words  %
Special Report 15,791 44.88
News 1,769 5.02
Chronicle 8,385 23.81
Fiction (short stories) 3,227 9.17
Fiction (novel) 6,040 17.15
Total 35,212

The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira et al. 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. 2016 and Pereira (2010) for further details.

The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=<x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘<’ (anaphora proper) or after ‘>’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘<<’ and ‘>>’, irrespective of the number of intervening sentences.

So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.

Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.

Table 2. Breakdown of zero anaphora types in the ZAC corpus
Text type zero indef impers 1p 3p Total
Special Report 371 81 42 41 3 538
News 40 8 4 0 0 52
Chronicle 286 41 17 43 8 395
Fiction (short stories) 110 4 11 5 16 146
Fiction (novel) 281 7 26 19 25 358
Total 1,088 141 100 108 52 1,489

The Zero Anaphora Corpus an be download below:

(File:ZAC Corpus v2.txt)