<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://string.hlt.inesc-id.pt/w/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Jorge.Baptista</id>
	<title>String - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://string.hlt.inesc-id.pt/w/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Jorge.Baptista"/>
	<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/wiki/Special:Contributions/Jorge.Baptista"/>
	<updated>2026-04-06T00:29:03Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=181</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=181"/>
		<updated>2024-01-30T15:04:08Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: ([[media:Baptista-et-al_2016_ZAC-corpus.pdf|Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese]]), in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. &lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus can be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International&lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
This corpus contains a set of simple sentences presenting several types of patterns corresponding to vocatives in European Portuguese and used to develop this aspect of the rule-based grammar of STRING. The corpus contains both positive and negative examples of textual patterns related to vocatives. For example, &lt;br /&gt;
&lt;br /&gt;
Meu caro amigo, faça isto.&lt;br /&gt;
&lt;br /&gt;
The corpus consists in the output of STRING, where each sentence is provided with its chunking structure and the VOCATIVE dependency. As described in Baptista &amp;amp; Mamede (2017), this dependency is made to operate in an instantiated node 'outro', considering that vocatives operate on the entire sentence.&lt;br /&gt;
&lt;br /&gt;
VOCATIVE(Caro amigo , faça isto . outro outro,amigo)&lt;br /&gt;
36&amp;gt;TOP{NP{Caro amigo} , VF{faça} NP{isto} .}&lt;br /&gt;
&lt;br /&gt;
The construction of the testing examples is also described in Baptista &amp;amp; Mamede (2017). This paper describes the most salient linguistic aspects of vocative constructions in Portuguese, with special reference to its European variety. Next, the paper presents the strategy followed for implementing this linguistic knowledge in the computational grammar of Portuguese, developed for the natural language processing chain STRING and using the XIP rule-based parser. Very precise and detailed linguistic descriptions can be implemented in this way.&lt;br /&gt;
&lt;br /&gt;
Reference&lt;br /&gt;
&lt;br /&gt;
Jorge Baptista and Nuno Mamede. Vocatives in Portuguese: Identification and Processing. In 6th Symposium on Languages, Applications and Technologies (SLATE 2017). Open Access Series in Informatics (OASIcs), Volume 56, pp. 22:1-22:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)&lt;br /&gt;
https://doi.org/10.4230/OASIcs.SLATE.2017.22&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;br /&gt;
&lt;br /&gt;
Vocative Corpus © 2017 by Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=180</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=180"/>
		<updated>2024-01-30T12:01:50Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: /* Zero Anaphora Corpus (ZAC) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: ([[media:https://string.hlt.inesc-id.pt/wiki/File:Baptista-et-al_2016_ZAC-corpus.pdf|Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese]]), in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. &lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus can be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International&lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
This corpus contains a set of simple sentences presenting several types of patterns corresponding to vocatives in European Portuguese and used to develop this aspect of the rule-based grammar of STRING. The corpus contains both positive and negative examples of textual patterns related to vocatives. For example, &lt;br /&gt;
&lt;br /&gt;
Meu caro amigo, faça isto.&lt;br /&gt;
&lt;br /&gt;
The corpus consists in the output of STRING, where each sentence is provided with its chunking structure and the VOCATIVE dependency. As described in Baptista &amp;amp; Mamede (2017), this dependency is made to operate in an instantiated node 'outro', considering that vocatives operate on the entire sentence.&lt;br /&gt;
&lt;br /&gt;
VOCATIVE(Caro amigo , faça isto . outro outro,amigo)&lt;br /&gt;
36&amp;gt;TOP{NP{Caro amigo} , VF{faça} NP{isto} .}&lt;br /&gt;
&lt;br /&gt;
The construction of the testing examples is also described in Baptista &amp;amp; Mamede (2017). This paper describes the most salient linguistic aspects of vocative constructions in Portuguese, with special reference to its European variety. Next, the paper presents the strategy followed for implementing this linguistic knowledge in the computational grammar of Portuguese, developed for the natural language processing chain STRING and using the XIP rule-based parser. Very precise and detailed linguistic descriptions can be implemented in this way.&lt;br /&gt;
&lt;br /&gt;
Reference&lt;br /&gt;
&lt;br /&gt;
Jorge Baptista and Nuno Mamede. Vocatives in Portuguese: Identification and Processing. In 6th Symposium on Languages, Applications and Technologies (SLATE 2017). Open Access Series in Informatics (OASIcs), Volume 56, pp. 22:1-22:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)&lt;br /&gt;
https://doi.org/10.4230/OASIcs.SLATE.2017.22&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;br /&gt;
&lt;br /&gt;
Vocative Corpus © 2017 by Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=179</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=179"/>
		<updated>2024-01-30T09:27:16Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: /* Vocative Corpus */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: ([[media:https://string.hlt.inesc-id.pt/wiki/File:Baptista-et-al_2016_ZAC-corpus.pdf|Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese]]), in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. &lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International &lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
This corpus contains a set of simple sentences presenting several types of patterns corresponding to vocatives in European Portuguese and used to develop this aspect of the rule-based grammar of STRING. The corpus contains both positive and negative examples of textual patterns related to vocatives. For example, &lt;br /&gt;
&lt;br /&gt;
Meu caro amigo, faça isto.&lt;br /&gt;
&lt;br /&gt;
The corpus consists in the output of STRING, where each sentence is provided with its chunking structure and the VOCATIVE dependency. As described in Baptista &amp;amp; Mamede (2017), this dependency is made to operate in an instantiated node 'outro', considering that vocatives operate on the entire sentence.&lt;br /&gt;
&lt;br /&gt;
VOCATIVE(Caro amigo , faça isto . outro outro,amigo)&lt;br /&gt;
36&amp;gt;TOP{NP{Caro amigo} , VF{faça} NP{isto} .}&lt;br /&gt;
&lt;br /&gt;
The construction of the testing examples is also described in Baptista &amp;amp; Mamede (2017). This paper describes the most salient linguistic aspects of vocative constructions in Portuguese, with special reference to its European variety. Next, the paper presents the strategy followed for implementing this linguistic knowledge in the computational grammar of Portuguese, developed for the natural language processing chain STRING and using the XIP rule-based parser. Very precise and detailed linguistic descriptions can be implemented in this way.&lt;br /&gt;
&lt;br /&gt;
Reference&lt;br /&gt;
&lt;br /&gt;
Jorge Baptista and Nuno Mamede. Vocatives in Portuguese: Identification and Processing. In 6th Symposium on Languages, Applications and Technologies (SLATE 2017). Open Access Series in Informatics (OASIcs), Volume 56, pp. 22:1-22:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)&lt;br /&gt;
https://doi.org/10.4230/OASIcs.SLATE.2017.22&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;br /&gt;
&lt;br /&gt;
Vocative Corpus © 2017 by Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=178</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=178"/>
		<updated>2024-01-30T08:48:37Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: ([[media:https://string.hlt.inesc-id.pt/wiki/File:Baptista-et-al_2016_ZAC-corpus.pdf|Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese]]), in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. &lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International &lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=177</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=177"/>
		<updated>2024-01-30T08:47:09Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese, in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. ([[media:https://string.hlt.inesc-id.pt/wiki/File:Baptista-et-al_2016_ZAC-corpus.pdf]])&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International &lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=176</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=176"/>
		<updated>2024-01-30T08:43:09Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese, in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. ([[media:https://string.hlt.inesc-id.pt/ (incomplete)]])&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International &lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=175</id>
		<title>Corpora</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Corpora&amp;diff=175"/>
		<updated>2024-01-30T08:41:14Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Zero Anaphora Corpus (ZAC) ===&lt;br /&gt;
&amp;lt;div style=&amp;quot;float:right;&amp;quot;&amp;gt;__TOC__&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ZAC - Zero Anaphora Corpus is a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of the STRING system. The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is an anaphora relation where the anaphoric expression (or anaphor) has been zeroed.&lt;br /&gt;
&lt;br /&gt;
In the following, we briefly present the main linguistic aspects involved in the process of annotating zero anaphora.&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus (ZAC) consists on a set of full and partial texts retrieved from the web, or digitalized from books, encompassing several genres and text types, namely journalistic and literary text from contemporary Brazilian Portuguese native-speaking authors, totaling 35,212 words.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 1. Breakdown of the contents of the ZAC corpus per text type (words/percentage).&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | Words&lt;br /&gt;
! style=&amp;quot;border-style: solid; border-width: 1px&amp;quot; | %&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 15,791&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 44.88&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,769&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 5.02&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8,385&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 23.81&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3,227&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 9.17&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 6,040&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 17.15&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
| style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 35,212&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; |&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The corpus was jointly annotated by two linguists, who revised and discussed each other’s work, so that each annotation one of them encoded was always checked by the other annotator. A set of guidelines (Pereira and Baptista 2009) have been produced to help provide consistency to the annotation process. Please refer to Baptista et al. (2016) and Pereira (2009, 2010) for further details.&lt;br /&gt;
&lt;br /&gt;
The annotation of zero anaphora consisted, basically, in inserting a tag for the zero anaphor with the form ‘[0=&amp;lt;x]’ in the empty slot of the zeroed constituent, linking it to its immediate antecedent (x) and determining whether it appeared before ‘&amp;lt;’ (anaphora proper) or after ‘&amp;gt;’ the anaphor (cataphora). Inter-sentential anaphora is marked with double arrows ‘&amp;lt;&amp;lt;’ and ‘&amp;gt;&amp;gt;’, irrespective of the number of intervening sentences.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;pre style=&amp;quot;color:blue&amp;quot;&amp;gt;&lt;br /&gt;
O mundo científico ficou ainda mais complexo [...] quando os pesquisadores&lt;br /&gt;
passaram a se dedicar a [0=&amp;lt;pesquisadores] entender a função de cada um&lt;br /&gt;
dos genes e, o supremo desafio, [0=&amp;lt;pesquisadores] explicar as razões pelas&lt;br /&gt;
quais eles às vezes exercem suas funções e outras [0=&amp;lt;eles] parecem hibernar&lt;br /&gt;
preguiçosamente nos cromossomas [...].&lt;br /&gt;
&lt;br /&gt;
Romain Rolland descreve a primeira experiência com a amizade do seu herói adolescente.&lt;br /&gt;
[0=&amp;lt;&amp;lt;Romain Rolland] Já conhecera muitas pessoas nos curtos anos de sua vida.&lt;br /&gt;
Mas o que [0=&amp;lt;&amp;lt;Romain Rolland] experimentava naquele momento era diferente&lt;br /&gt;
 de tudo o que já [0=&amp;lt;&amp;lt;Romain Roland] sentira antes.&lt;br /&gt;
O encontro acontecera de repente, mas [0=?] era como se [0=3p] já tivessem&lt;br /&gt;
sido amigos a vida inteira.&lt;br /&gt;
&lt;br /&gt;
A cor dos olhos, a tendência para [0=indef] engordar [...] são características&lt;br /&gt;
definidas [...] pelas bases químicas dos genes.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;[0=impers] Há uma perigosa tendência a [0=indef] fazer correlações entre&lt;br /&gt;
etnia, crime e predisposição genética&amp;quot;, alerta Pamela Sankar [...]&lt;br /&gt;
&lt;br /&gt;
 As descobertas são impressionantes. [0=1p] Conseguimos informações&lt;br /&gt;
preciosas sobre os genes, [...]. Mas ainda não [0=1p] temos conhecimento&lt;br /&gt;
científico suficiente para [0=1p] saber o que fazer com todas essas informações&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[0=3p] Começaram a empurrar o veículo de volta para casa, [0=3p] bufando&lt;br /&gt;
e [0=3p] praguejando.&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So far, only zeroed subjects have been annotated. The annotation considers the special cases of zero indefinite subjects [0=indef] and impersonal subjects [0=impers]. In the case of first-person-plural and third-person-plural indefinite subjects, which are formally indistinguishable from zeroed constituents, the special notations [0=1p] and [0=3p] were used.&lt;br /&gt;
&lt;br /&gt;
Tables 2 and 3 show the breakdown of the type of anaphors annotated in the corpus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 2. Breakdown of zero anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | zero&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | indef&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | impers&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 1p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | 3p&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 371&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 81&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 42&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 3&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 538&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 40&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 286&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 17&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 43&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 395&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 110&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 11&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 16&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 146&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 281&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 7&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 26&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 19&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 25&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 358&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,088&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 141&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 100&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 108&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 52&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 1,489&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
{|  style=&amp;quot;border-collapse: collapse; text-align: left;&amp;quot; cellpadding=&amp;quot;5&amp;quot; class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Table 3. Distribution anaphora/cataphora and intra-/inter-sentential anaphora types in the ZAC corpus&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Text type&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;lt;&amp;lt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&lt;br /&gt;
! style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | &amp;gt;&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Special Report&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 275&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 74&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 20&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | News&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 34&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Chronicle&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 156&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 115&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 5&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (short stories)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 44&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 65&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 4&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Fiction (novel)&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 171&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 99&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 8&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 0&lt;br /&gt;
|-&lt;br /&gt;
| style=&amp;quot;text-align: left; border-style: solid; border-width: 1px&amp;quot; | Total&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 680&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 355&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 41&lt;br /&gt;
! style=&amp;quot;text-align: right; border-style: solid; border-width: 1px&amp;quot; | 2&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The anaphora type corresponds to 1,035 instances (96%) against 43 cases of cataphora (4%). The intra-sentential type constitutes 721 instances (66.9%) while the 357 inter-sentential cases represent 33.1%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Baptista, J., Pereira, J., Mamede, N. ZAC: Zero Anaphora Corpus - A Corpus for Zero Anaphora Resolution in Portuguese, in Proceedings of Workshop on Corpora and Tools for Processing Corpora, PROPOR 2016, July 13, 2016, Tomar, Portugal. ([[media:]])&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Univ. Algarve/Univ. Wolverhampton, Faro and Wolverhampton (2010)&lt;br /&gt;
&lt;br /&gt;
Pereira, S., Baptista, J.: Zero anaphora corpus annotation guidelines. Technical&lt;br /&gt;
report, L2F - Spoken Language Laboratory, INESC-ID Lisboa, Lisboa (2009)&lt;br /&gt;
&lt;br /&gt;
Pereira, S.: ZAC.PB: An annotated corpus for zero anaphora resolution in Portuguese. In: Student Research Workshop in conjunction with RANLP-09, Borovets,&lt;br /&gt;
Bulgaria (2009) 53–59&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Zero Anaphora Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:ZAC_Corpus_v2.txt|ZAC_Corpus_v2.txt]])&lt;br /&gt;
&lt;br /&gt;
Zero Anaphora Corpus (ZAC) © 2009 by Simone Pereira, Jorge Baptista and Nuno Mamede is licensed under Attribution-NonCommercial-ShareAlike 4.0 International &lt;br /&gt;
&lt;br /&gt;
=== Vocative Corpus  ===&lt;br /&gt;
&lt;br /&gt;
The Vocative Corpus an be download below:&lt;br /&gt;
&lt;br /&gt;
([[media:Vocative.txt|vocative.txt]])&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=File:Baptista-et-al_2016_ZAC-corpus.pdf&amp;diff=174</id>
		<title>File:Baptista-et-al 2016 ZAC-corpus.pdf</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=File:Baptista-et-al_2016_ZAC-corpus.pdf&amp;diff=174"/>
		<updated>2024-01-30T08:37:36Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: This paper describes a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of a fully-fledged Natural Language Processing system (STRING). The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed The paper briefly discusses the linguistic issues in the process of zero anaphora resolution and describes the annotation proce...&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Summary ==&lt;br /&gt;
This paper describes a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of a fully-fledged Natural Language Processing system (STRING). The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed The paper briefly discusses the linguistic issues in the process of zero anaphora resolution and describes the annotation process in detail, as well as the main aspects of the anaphoric relations thus annotated.&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=153</id>
		<title>Compound Adverbs</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=153"/>
		<updated>2024-01-11T09:13:55Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Testing List of Examples of the 300 most frequent compound (multi-word) adverbs in BP and EP''' (*)&lt;br /&gt;
&lt;br /&gt;
This [[media:PortugueseCompoundAdverbs.pdf|document]] presents a [[media:PortugueseCompoundAdverbs.xlsx|list of the 300 most frequent compound (multi-word) adverbs]] that are common to both the Brazilian (BP) and European (EP) varieties of the Portuguese language. The frequency of these adverbs was first determined from the extant lexicon-grammar of 3,500 compound adverbs [4], considering their occurrence on two corpora: the ''CETEMPúblico corpus'' [6], and the ''Corpus Brasileiro'' [7]. The goal was to map the distribution of compound adverbs in corpora from each variety, as described in [4] (in preparation). Then, these most frequent expressions were queried in the ''Portuguese TenTen 2020'' corpus [3], using the Sketch Engine platform [1]. Furthermore, using the ''Good Dictionary Examples'' (GDEX) extraction tool, [2], a selection of examples was collated and carefully edited to shorten each sentence as much as possible without changing the overall meaning or the relevant syntactic dependencies involving the adverb. The example sentences were then translated into English using ''ChatGPT'' (version 3.5)[5] and&lt;br /&gt;
manually revised. Soon, we intend to provide, alongside these examples, the target word that the adverb is modifying within each sentence (or, eventually, the entire sentence). Focus adverbs will be signaled also.&lt;br /&gt;
&lt;br /&gt;
To cite this work, please use:&lt;br /&gt;
&lt;br /&gt;
Müller, Izabela, Nuno Mamede, and Jorge Baptista. Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese, ''Proceedings of the 16th International Conference on Computational Processing of Portuguese'' (PROPOR 2024), Universidade de Santiago de Compostela, Galiza, Spain, March 12–15, 2024 (to appear).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&amp;lt;code style=&amp;quot;font-family: Consolas, monospace;&amp;quot;&amp;gt;@inproceedings{Muller-et-al-2024-Hurdles, author = {M\&amp;quot;uller, Izabela AND Mamede, Nuno AND Baptista,Jorge}, title = {{Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese}}, booktitle = {Proceedings of the 16$^{th}$ International Conference on Computational Processing of Portuguese (PROPOR 2024), address= {Universidade de Santiago de Compostela, Galiza, Spain, March 12--15, 2024}, year = {2024}}&amp;lt;/code&amp;gt;&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.pdf|Document]]&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.xlsx|Spreadsheet]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;(*) Research for this paper has been partially supported by national funds from Fundação para a Ciência e a Tecnologia, under project reference DOI: 10.54499/UIDB/50021/2020. Izabela Müller has also received support from the University of Algarve, through the Language Sciences PhD program. This work is disseminated under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. [https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en]&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''References'''&lt;br /&gt;
&lt;br /&gt;
[1] Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell. The sketch engine. ''Proceedings of the 11th EURALEX International Congress'', pages 105–116, 2004.&lt;br /&gt;
&lt;br /&gt;
[2] Adam Kilgarriff, Milos Husak, Katy McAdam, Michael Rundell, and Pavel Rychly. GDEX: Automatically finding good dictionary examples in a corpus. In ''Proceedings of the 13th EURALEX International Congress'', volume 1, pages 425–432. Universitat Pompeu Fabra Barcelona, 2008.&lt;br /&gt;
&lt;br /&gt;
[3] Adam Kilgarriff, Milos Jakubıcek, Jan Pomikalek, Tony Berber Sardinha, and Pen WHITELOCK. PtTenTen: A Corpus for Portuguese Lexicography. ''Working with Portuguese Corpora'', pages 111–30, 2014.&lt;br /&gt;
&lt;br /&gt;
[4] Izabela Müller, Jorge Baptista, and Nuno Mamede. Differentiating Brazilian and European Portuguese Multiword Adverbs. Paper presented to the ''39th National Meeting of the Portuguese Linguistics Association'' (APL), Covilhã, Portugal, October, 2023.&lt;br /&gt;
&lt;br /&gt;
[5] OpenAI. ChatGPT-3.5: Language Models are Few-Shot Learners. [https://openai.com/blog/chatgpt-3-5/], 2023. Accessed: [05/01/2024].&lt;br /&gt;
&lt;br /&gt;
[6] Paulo Alexandre Rocha and Diana Santos. CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In Maria das Graças Volpe Nunes (ed.) ''V Encontro para o processamento computacional da língua portuguesa escrita e falada'' (PROPOR 2000)(Atibaia SP 19-22 de Novembro de 2000), São Paulo, Brasil: ICMC/USP, 2000.&lt;br /&gt;
&lt;br /&gt;
[7] Tony Berber Sardinha. Corpus Brasileiro. Informática, 708:0–1, 2010.&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=152</id>
		<title>Compound Adverbs</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=152"/>
		<updated>2024-01-11T08:57:52Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Testing List of Examples of the 300 most frequent compound (multi-word) adverbs in BP and EP''' (*)&lt;br /&gt;
&lt;br /&gt;
This [[media:PortugueseCompoundAdverbs.pdf|document]] presents a [[media:PortugueseCompoundAdverbs.xlsx|list of the 300 most frequent compound (multi-word) adverbs]] that are common to both the Brazilian (BP) and European (EP) varieties of the Portuguese language. The frequency of these adverbs was first determined from the extant lexicon-grammar of 3,500 compound adverbs [4], considering their occurrence on two corpora: the ''CETEMPúblico corpus'' [6], and the ''Corpus Brasileiro'' [7]. The goal was to map the distribution of compound adverbs in corpora from each variety, as described in [4] (in preparation). Then, these most frequent expressions were queried in the ''Portuguese TenTen 2020'' corpus [3], using the Sketch Engine platform [1]. Furthermore, using the ''Good Dictionary Examples'' (GDEX) extraction tool, [2], a selection of examples was collated and carefully edited to shorten each sentence as much as possible without changing the overall meaning or the relevant syntactic dependencies involving the adverb. The example sentences were then translated into English using ''ChatGPT'' (version 3.5)[5] and&lt;br /&gt;
manually revised. Soon, we intend to provide, alongside these examples, the target word that the adverb is modifying within each sentence (or, eventually, the entire sentence). Focus adverbs will be signaled also.&lt;br /&gt;
&lt;br /&gt;
To cite this work, please use:&lt;br /&gt;
&lt;br /&gt;
Müller, Izabela, Nuno Mamede, and Jorge Baptista. Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese, ''Proceedings of the 16th International Conference on Computational Processing of Portuguese'' (PROPOR 2024), Universidade de Santiago de Compostela, Galiza, Spain, March 12–15, 2024 (to appear).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code style=&amp;quot;font-family: Consolas, monospace;&amp;quot;&amp;gt;@inproceedings{Muller-et-al-2024-Hurdles, author = {M\&amp;quot;uller, Izabela AND Mamede, Nuno AND Baptista,Jorge}, title = {{Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese}}, booktitle = {Proceedings of the 16$^{th}$ International Conference on Computational Processing of Portuguese (PROPOR 2024), address= {Universidade de Santiago de Compostela, Galiza, Spain, March 12--15, 2024}, year = {2024}}&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.pdf|Document]]&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.xlsx|Spreadsheet]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;(*) Research for this paper has been partially supported by national funds from Fundação para a Ciência e a Tecnologia, under project reference DOI: 10.54499/UIDB/50021/2020. Izabela Müller has also received support from the University of Algarve, through the Language Sciences PhD program. This work is disseminated under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. [https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en]&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''References'''&lt;br /&gt;
&lt;br /&gt;
[1] Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell. The sketch engine. ''Proceedings of the 11th EURALEX International Congress'', pages 105–116, 2004.&lt;br /&gt;
&lt;br /&gt;
[2] Adam Kilgarriff, Milos Husak, Katy McAdam, Michael Rundell, and Pavel Rychly. GDEX: Automatically finding good dictionary examples in a corpus. In ''Proceedings of the 13th EURALEX International Congress'', volume 1, pages 425–432. Universitat Pompeu Fabra Barcelona, 2008.&lt;br /&gt;
&lt;br /&gt;
[3] Adam Kilgarriff, Milos Jakubıcek, Jan Pomikalek, Tony Berber Sardinha, and Pen WHITELOCK. PtTenTen: A Corpus for Portuguese Lexicography. ''Working with Portuguese Corpora'', pages 111–30, 2014.&lt;br /&gt;
&lt;br /&gt;
[4] Izabela Müller, Jorge Baptista, and Nuno Mamede. Differentiating Brazilian and European Portuguese Multiword Adverbs. Paper presented to the ''39th National Meeting of the Portuguese Linguistics Association'' (APL), Covilhã, Portugal, October, 2023, 2023.&lt;br /&gt;
&lt;br /&gt;
[5] OpenAI. ChatGPT-3.5: Language Models are Few-Shot Learners. [https://openai.com/blog/chatgpt-3-5/], 2023. Accessed: [05/01/2024].&lt;br /&gt;
&lt;br /&gt;
[6] Paulo Alexandre Rocha and Diana Santos. CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In Maria das Graças Volpe Nunes (ed.) ''V Encontro para o processamento computacional da língua portuguesa escrita e falada'' (PROPOR 2000)(Atibaia SP 19-22 de Novembro de 2000), São Paulo, Brasil: ICMC/USP, 2000.&lt;br /&gt;
&lt;br /&gt;
[7] Tony Berber Sardinha. Corpus Brasileiro. Informática, 708:0–1, 2010.&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=151</id>
		<title>Compound Adverbs</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=151"/>
		<updated>2024-01-11T07:50:49Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Testing List of Examples of the 300 most frequent compound (multi-word) adverbs in BP and EP (*)&lt;br /&gt;
&lt;br /&gt;
This document presents a list of the 300 most frequent compound (multi-word) adverbs that are common to both the Brazilian (BP) and European (EP) varieties of the Portuguese language. The frequency of these adverbs was first determined from the extant lexicon-grammar of 3,500 compound adverbs [4], considering their occurrence on two corpora: the CETEM-P ́ublico corpus [6], and the Corpus Brasileiro [7]. The goal was to map the distribution of compound adverbs in corpora from each variety, as described in [4] (in preparation). Then, these most frequent expressions were queried in the Portuguese TenTen 2020 corpus [3], using the Sketch Engine platform [1]. Furthermore, using the Good Dictionary Examples (GDEX), [2], a selection of examples was collated and carefully edited to shorten each sentence as much as possible without changing the overall meaning nor the relevant syntactic dependencies involving the adverb. The example sentences were then translated into English using ChatGPT [5](version 3.5) and&lt;br /&gt;
manually revised.&lt;br /&gt;
&lt;br /&gt;
In the near future, we intend to provide, alongside these examples, the target word that the adverb is modifying within each sentence (or, eventually, the entire sentence). Furthermore, focus adverbs will be signaled.&lt;br /&gt;
&lt;br /&gt;
To cite this work, please use:&lt;br /&gt;
&lt;br /&gt;
Müller, Izabela, Nuno Mamede, and Jorge Baptista. Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese, Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024), Universidade de Santiago de Compostela, Galiza, Spain, March 12–15, 2024 (to appear).&lt;br /&gt;
&lt;br /&gt;
@inproceedings{Muller-et-al-2024-Hurdles, author = {M\&amp;quot;uller, Izabela AND Mamede, Nuno AND Baptista,Jorge}, title = {{Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese}}, booktitle = {Proceedings of the 16$^{th}$ International Conference&lt;br /&gt;
on Computational Processing of Portuguese (PROPOR 2024), address= {Universidade de Santiago de Compostela, Galiza, Spain, March 12--15, 2024}, year = {2024}}&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.pdf|Document]]&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.xlsx|Spreadsheet]]&lt;br /&gt;
&lt;br /&gt;
(*) Research for this paper has been partially supported by national funds from Fundação para a Ciência e a Tecnologia, under project reference DOI: 10.54499/UIDB/50021/2020. Izabela Müller has also received support from the University of Algarve, through the Language Sciences PhD program.&lt;br /&gt;
&lt;br /&gt;
This work is disseminated under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0&lt;br /&gt;
International License. see https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
[1] Adam Kilgarriff, Pavel Rychl ́y, Pavel Smrˇz, and David Tugwell. The sketch engine. Proceedings of the 11th EURALEX International Congress, pages 105–116, 2004.&lt;br /&gt;
&lt;br /&gt;
[2] Adam Kilgarriff, Milos Hus ́ak, Katy McAdam, Michael Rundell, and Pavel Rychl`y. GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the XIII EURALEX international congress, volume 1, pages 425–432. Universitat Pompeu Fabra Barcelona, 2008.&lt;br /&gt;
&lt;br /&gt;
[3] Adam Kilgarriff, Miloˇs Jakub ́ıˇcek, Jan Pomik ́alek, Tony Berber Sardinha, and Pen WHITELOCK. PtTenTen: A Corpus for Portuguese Lexicography. Working with Portuguese Corpora, pages 111–30, 2014.&lt;br /&gt;
&lt;br /&gt;
[4] Izabela M ̈uller, Jorge Baptista, and Nuno Mamede. Differentiating Brazilian and European Portuguese Multiword Adverbs. Paper presented to the 39th National Meeting of the Portuguese Linguistics Association (APL), Covilh ã, Portugal, October, 2023, 2023.&lt;br /&gt;
&lt;br /&gt;
[5] OpenAI. ChatGPT-3.5: Language Models are Few-Shot Learners. https://openai.com/blog/chatgpt-3-5/, 2023. Accessed: [05/01/2024].&lt;br /&gt;
&lt;br /&gt;
[6] Paulo Alexandre Rocha and Diana Santos. CETEMP ́ublico: Um corpus de grandes dimens ̃oes de linguagem jornal ́ıstica portuguesa. quot; In Maria das Gra ̧cas Volpe Nunes (ed) V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000)(Atibaia SP 19-22 de Novembro de 2000) S ão Paulo: ICMC/USP, 2000.&lt;br /&gt;
&lt;br /&gt;
[7] Tony Berber Sardinha. Corpus Brasileiro. Inform ́atica, 708:0–1, 2010.&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
	<entry>
		<id>https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=150</id>
		<title>Compound Adverbs</title>
		<link rel="alternate" type="text/html" href="https://string.hlt.inesc-id.pt/w/index.php?title=Compound_Adverbs&amp;diff=150"/>
		<updated>2024-01-11T07:49:01Z</updated>

		<summary type="html">&lt;p&gt;Jorge.Baptista: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Testing List of Examples of the 300 most frequent compound (multi-word) adverbs in BP and EP (*)&lt;br /&gt;
&lt;br /&gt;
This document presents a list of the 300 most frequent compound (multi-word) adverbs that are common&lt;br /&gt;
to both the Brazilian (BP) and European (EP) varieties of the Portuguese language. The frequency of&lt;br /&gt;
these adverbs was first determined from the extant lexicon-grammar of 3,500 compound adverbs [4],&lt;br /&gt;
considering their occurrence on two corpora: the CETEM-P ́ublico corpus [6], and the Corpus Brasileiro&lt;br /&gt;
[7]. The goal was to map the distribution of compound adverbs in corpora from each variety, as described&lt;br /&gt;
in [4] (in preparation). Then, these most frequent expressions were queried in the Portuguese TenTen&lt;br /&gt;
2020 corpus [3], using the Sketch Engine platform [1]. Furthermore, using the Good Dictionary Examples&lt;br /&gt;
(GDEX), [2], a selection of examples was collated and carefully edited to shorten each sentence as much&lt;br /&gt;
as possible without changing the overall meaning nor the relevant syntactic dependencies involving the&lt;br /&gt;
adverb. The example sentences were then translated into English using ChatGPT [5](version 3.5) and&lt;br /&gt;
manually revised.&lt;br /&gt;
&lt;br /&gt;
In the near future, we intend to provide, alongside these examples, the target word that the adverb&lt;br /&gt;
is modifying within each sentence (or, eventually, the entire sentence). Furthermore, focus adverbs will&lt;br /&gt;
be signaled.&lt;br /&gt;
&lt;br /&gt;
To cite this work, please use:&lt;br /&gt;
&lt;br /&gt;
Müller, Izabela, Nuno Mamede, and Jorge Baptista. Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese, Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024), Universidade de Santiago de Compostela, Galiza, Spain, March 12–15, 2024 (to appear).&lt;br /&gt;
&lt;br /&gt;
@inproceedings{Muller-et-al-2024-Hurdles, author = {M\&amp;quot;uller, Izabela AND Mamede, Nuno AND Baptista,Jorge}, title = {{Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese}}, booktitle = {Proceedings of the 16$^{th}$ International Conference&lt;br /&gt;
on Computational Processing of Portuguese (PROPOR 2024), address= {Universidade de Santiago de Compostela, Galiza, Spain, March 12--15, 2024}, year = {2024}}&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.pdf|Document]]&lt;br /&gt;
&lt;br /&gt;
[[media:PortugueseCompoundAdverbs.xlsx|Spreadsheet]]&lt;br /&gt;
&lt;br /&gt;
(*) Research for this paper has been partially supported by national funds from Funda ̧c ̃ao para a Ciˆencia e a Tecnologia,&lt;br /&gt;
under project reference DOI: 10.54499/UIDB/50021/2020. Izabela M ̈uller has also received support from the University of&lt;br /&gt;
Algarve, through the Language Sciences PhD program.&lt;br /&gt;
&lt;br /&gt;
This work is disseminated under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0&lt;br /&gt;
International License. see https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en&lt;br /&gt;
&lt;br /&gt;
References&lt;br /&gt;
&lt;br /&gt;
[1] Adam Kilgarriff, Pavel Rychl ́y, Pavel Smrˇz, and David Tugwell. The sketch engine. Proceedings of the 11th EURALEX International Congress, pages 105–116, 2004.&lt;br /&gt;
&lt;br /&gt;
[2] Adam Kilgarriff, Milos Hus ́ak, Katy McAdam, Michael Rundell, and Pavel Rychl`y. GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the XIII EURALEX international congress, volume 1, pages 425–432. Universitat Pompeu Fabra Barcelona, 2008.&lt;br /&gt;
&lt;br /&gt;
[3] Adam Kilgarriff, Miloˇs Jakub ́ıˇcek, Jan Pomik ́alek, Tony Berber Sardinha, and Pen WHITELOCK. PtTenTen: A Corpus for Portuguese Lexicography. Working with Portuguese Corpora, pages 111–30, 2014.&lt;br /&gt;
&lt;br /&gt;
[4] Izabela M ̈uller, Jorge Baptista, and Nuno Mamede. Differentiating Brazilian and European Portuguese Multiword Adverbs. Paper presented to the 39th National Meeting of the Portuguese Linguistics Association (APL), Covilh ã, Portugal, October, 2023, 2023.&lt;br /&gt;
&lt;br /&gt;
[5] OpenAI. ChatGPT-3.5: Language Models are Few-Shot Learners. https://openai.com/blog/chatgpt-3-5/, 2023. Accessed: [05/01/2024].&lt;br /&gt;
&lt;br /&gt;
[6] Paulo Alexandre Rocha and Diana Santos. CETEMP ́ublico: Um corpus de grandes dimens ̃oes de linguagem jornal ́ıstica portuguesa. quot; In Maria das Gra ̧cas Volpe Nunes (ed) V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000)(Atibaia SP 19-22 de Novembro de 2000) S ão Paulo: ICMC/USP, 2000.&lt;br /&gt;
&lt;br /&gt;
[7] Tony Berber Sardinha. Corpus Brasileiro. Inform ́atica, 708:0–1, 2010.&lt;/div&gt;</summary>
		<author><name>Jorge.Baptista</name></author>
	</entry>
</feed>