Difference between revisions of "Corpora"

From String
Jump to: navigation, search
(Zero Anaphora Corpus (ZAC))
(Vocative Corpus)
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
=== Zero Anaphora Corpus (ZAC) ===
 
=== Zero Anaphora Corpus (ZAC) ===
 
<div style="float:right;">__TOC__</div>
 
<div style="float:right;">__TOC__</div>
 
[THIS PAGE IS UNDER CONSTRUCTION]
 
  
 
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.  
 
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.  
 
<pre style="color:blue">
 
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores
 
passaram a se dedicar a [0=<pesquisadores] entender a função de cada um
 
dos genes e, o supremo desafio, [0=<pesquisadores] explicar as razões pelas
 
quais eles às vezes exercem suas funções e outras [0=<eles] parecem hibernar
 
preguiçosamente nos cromossomas [...].
 
 
A cor dos olhos, a tendência para [0=indef] engordar [...] são características
 
definidas [...] pelas bases químicas dos genes.
 
 
"[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre
 
etnia, crime e predisposição genética", alerta Pamela Sankar [...]
 
 
As descobertas são impressionantes. [0=1p] Conseguimos informações
 
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento
 
científico suficiente para [0=1p] saber o que fazer com todas essas informações".
 
 
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando
 
e [0=3p] praguejando.
 
</pre>
 
 
  
 
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.
 
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.
Line 69: Line 45:
  
 
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=<x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘<’ (anaphora proper) or after ‘>’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘<<’ and ‘>>’, irrespective of the number of intervening sentences.
 
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=<x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘<’ (anaphora proper) or after ‘>’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘<<’ and ‘>>’, irrespective of the number of intervening sentences.
 +
 +
<pre style="color:blue">
 +
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores
 +
passaram a se dedicar a [0=<pesquisadores] entender a função de cada um
 +
dos genes e, o supremo desafio, [0=<pesquisadores] explicar as razões pelas
 +
quais eles às vezes exercem suas funções e outras [0=<eles] parecem hibernar
 +
preguiçosamente nos cromossomas [...].
 +
 +
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.
 +
[0=<<Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.
 +
Mas o que [0=<<Romain Rolland] experimentava naquele momento era diferente
 +
de tudo o que já [0=<<Romain Roland] sentira antes.
 +
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem
 +
sido amigos a vida inteira.
 +
 +
A cor dos olhos, a tendência para [0=indef] engordar [...] são características
 +
definidas [...] pelas bases químicas dos genes.
 +
 +
"[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre
 +
etnia, crime e predisposição genética", alerta Pamela Sankar [...]
 +
 +
As descobertas são impressionantes. [0=1p] Conseguimos informações
 +
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento
 +
científico suficiente para [0=1p] saber o que fazer com todas essas informações".
 +
 +
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando
 +
e [0=3p] praguejando.
 +
</pre>
  
 
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.
 
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.
Line 203: Line 207:
 
The Zero Anaphora Corpus an be download below:
 
The Zero Anaphora Corpus an be download below:
  
([[file:ZAC_Corpus_v2.txt‎|ZAC_Corpus_v2.txt]])
+
([[media:ZAC_Corpus_v2.txt‎|ZAC_Corpus_v2.txt]])
 +
 
 +
=== Vocative Corpus  ===
 +
 
 +
The Vocative Corpus an be download below:
 +
 
 +
([[media:Vocative.txt|vocative.txt]])

Latest revision as of 15:39, 25 May 2017

Zero Anaphora Corpus (ZAC)

ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.

In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.

The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalised from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totalling 35,212 words.

Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).
Text type Words  %
Special Report 15,791 44.88
News 1,769 5.02
Chronicle 8,385 23.81
Fiction (short stories) 3,227 9.17
Fiction (novel) 6,040 17.15
Total 35,212

The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.

The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=<x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘<’ (anaphora proper) or after ‘>’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘<<’ and ‘>>’, irrespective of the number of intervening sentences.

O mundo científico ficou ainda mais complexo [...] quando os pesquisadores 
passaram a se dedicar a [0=<pesquisadores] entender a função de cada um 
dos genes e, o supremo desafio, [0=<pesquisadores] explicar as razões pelas 
quais eles às vezes exercem suas funções e outras [0=<eles] parecem hibernar 
preguiçosamente nos cromossomas [...].

Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente. 
[0=<<Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida. 
Mas o que [0=<<Romain Rolland] experimentava naquele momento era diferente
 de tudo o que já [0=<<Romain Roland] sentira antes. 
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem 
sido amigos a vida inteira. 

A cor dos olhos, a tendência para [0=indef] engordar [...] são características 
definidas [...] pelas bases químicas dos genes.

"[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre 
etnia, crime e predisposição genética", alerta Pamela Sankar [...]

 As descobertas são impressionantes. [0=1p] Conseguimos informações 
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento 
científico suficiente para [0=1p] saber o que fazer com todas essas informações".

[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando 
e [0=3p] praguejando.

So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.

Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.

Table 2. Breakdown of zero anaphora types in the ZAC corpus
Text type zero indef impers 1p 3p Total
Special Report 371 81 42 41 3 538
News 40 8 4 0 0 52
Chronicle 286 41 17 43 8 395
Fiction (short stories) 110 4 11 5 16 146
Fiction (novel) 281 7 26 19 25 358
Total 1,088 141 100 108 52 1,489
Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus
Text type < << > >>
Special Report 275 74 20 0
News 34 2 4 0
Chronicle 156 115 5 2
Fiction (short stories) 44 65 4 0
Fiction (novel) 171 99 8 0
Total 680 355 41 2

The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.


References


Baptista, J., Pereira, J., Mamede, N. ZAC: Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese, in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal.

Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)

Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)

Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets, Bulgaria (2009) 53–59


The Zero Anaphora Corpus an be download below:

(ZAC_Corpus_v2.txt)

Vocative Corpus

The Vocative Corpus an be download below:

(vocative.txt)