Difference between revisions of "Corpora"

From String
Jump to: navigation, search
(Zero Anaphora Corpus (ZAC))
(Zero Anaphora Corpus (ZAC))
Line 19: Line 19:
 
| style="text-align: left; border-style: solid; border-width: 1px" | 44.88
 
| style="text-align: left; border-style: solid; border-width: 1px" | 44.88
 
|-
 
|-
| style="text-align: center; border-style: solid; border-width: 1px" | News  
+
| style="text-align: left; border-style: solid; border-width: 1px" | News  
 
| style="text-align: right; border-style: solid; border-width: 1px" | 1,769  
 
| style="text-align: right; border-style: solid; border-width: 1px" | 1,769  
 
| style="text-align: left; border-style: solid; border-width: 1px" | 5.02
 
| style="text-align: left; border-style: solid; border-width: 1px" | 5.02

Revision as of 12:07, 31 May 2016

Zero Anaphora Corpus (ZAC)

ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.

In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.

The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalised from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totalling 35,212 words.

Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).
Text type Words  %
Special Report 15,791 44.88
News 1,769 5.02
Chronicle 8,385 23.81
Fiction (short stories) 3,227 9.17
Fiction (novel) 6,040 17.15

[THIS PAGE IS UNDER CONSTRUCTION]

The Zero Anaphora Corpus

(File:ZAC Corpus v2.txt)