Linguateca's Infrastructure for Portuguese and How It Allows the Detailed Study of Language Varieties

In this paper I present briefly Linguateca 1 , an infrastructure project for Portuguese which is ten years old, and will show how it provides several possibilities to study grammatical and semantical differences between varieties of the language. After a short history of Portuguese corpus linguistics, presenting the main projects in and sketch some ideas for parallel corpora as started in CorTrad 2 (Tagnin et al., 2009). I will use three different kinds of examples: those related to known differences between variants, in both grammar and lexis, those related to diachronic differences, in that respect describing in detail Silva (2008, in press) model of Quantitative Lexicology and Variational Linguistics 3 in CONDIVport, and those that are in a way corpus-driven and for which novel functionalities of AC/DC have been devised, namely the comparison of two search expressions; and the pattern database. The paper will be structured as follows 1. a short introduction to Linguateca, for an international audience instead of to a Portuguese-speaking one (Santos, 2009) 2. some history of Portuguese corpora and corpus linguistics, surveying the main projects internationally that provide similar data or systems 3. present in a nutshell three ongoing Linguateca projects that deal with syntactical analysis of running text (AC/DC, Floresta and CorTrad) 4. provide some examples of variety studies with the above projects a. how to go about studying known differences b. a model of convergence and divergence of national varieties c. new functionalities for corpus-based discovery References Susana Afonso, Eckhard Bick, Renato Haber & Diana Santos. 2002. "Floresta sintá (c)tica: a treebank for Portuguese". Eckhard Bick. 2004. " Looking at the Floresta Sintá(c)tica with a CorpusEye: A user-friendly cross-language search interface " .

diana santos After the discussion, and given that several projects had been started (catalogue, publications catalogue, and some corpus services), one more year was granted, that prepared the ground for what later became Linguateca.
Linguateca was conceived as a three-axed initiative to foster R&D in the computational processing of the Portuguese language, with relevant work on (i) information dissemination, (ii) resource creation, and (iii) organization of evaluation initiatives.
[2] p o rt u g u e s e t e x t c o r p o r a Assuming that an international audience is probably generally unaware of what has been done in Portuguese corpus processing, I will attempt here a short presentation of the eld, with special emphasis on what is oered by Linguateca.

A brief history
As far as I know, corpus compilation for Portuguese started during the 1960s with the Português Fundamental (Bacelar do Nascimento et al., 1984, 1987), a project shaped after and inspired by the Français Fondamental (Gougenheim et al., 1964).Strict criteria for documenting authentic usage in oral contexts all over the country were used, and a signicant number of documents of spoken Portuguese (from 1971Portuguese (from to 1974) ) was recorded, transcribed and analysed at the Centro de Linguística da Universidade de Lisboa, see Bacelar do Nascimento (2001).The work of this team has continued ever since with the compilation i.a. of the large Corpus de Referência do Português Contemporâneo, CRPC 3 (Bacelar do Nascimento, 2000), as can also be appreciated in the recent papers on the comparison of African varieties of Portuguese (Bacelar do Nascimento et al., 2008a,b).
Several degrees of latitude and longitude further, the NURC project (Callou, 1999) was taking place in Brazil, aiming to describe the oral and educated language 4 in ve major Brazilian cities (Recife, Salvador, Rio de Janeiro, São Paulo and Porto Alegre), being thus a ve-headed project.Started in 1970, it produced dierent oral corpora and dierent research lines, as can be better appreciated in the overview by Varejão (2009).In NURC-RJ, comparative oral corpora of the decades 1970s and 1990s were deployed, and it is currently connected with the project Para uma História do Português do Brasil, PHPB 5 , including also written materials since the XVIth century.In Recife, the project was extended to address conversation analysis, while in Porto Alegre it merged with the VARSUL project (Menon et al., 2009).
[3] Outside a Portuguese-speaking countries context, Brigham Young University (US) researchers were interested in electronically available Portuguese material, having created the Borba-Ramsey corpus 6 , a subset of which was later included in the European Corpus Initiative (Thomson et al., 1994) and has since 1999 been browsable also through AC/DC.We can also mention Portext (Maciel, 1997) in France, the English-Norwegian Parallel Corpus in Norway (Oksefjell, 1999) and the VISL (Bick, 1997) project in Denmark, as early providers of Portuguese texts searchable on the web.Castilho et al. (1995) mention John Uriagereka from Maryland as having proposed a joint database for Portuguese and Gallician as early as 1991.From the same source we also learn that in 1993 there was already a corpus project in Mozambique, led by Perpétua Gonçalves.
For further information and historical overviews on Portuguese corpora of which the pointers presented are just a small part, since many other corpora have come to light during the last decade see Bacelar do Nascimento et al.
What I would like to stress here, before introducing the AC/DC project in the next section, is: when it started back in 1998, there were no services on the web that allowed a linguist or an engineer to query a Portuguese corpus.Also, the few available corpora for download had very dierent formatting, encoding, and conceptual organization, so that their content was hard to compare and required a lot of processing to be used simultaneously, as explained in Santos (1999b) as initial motivation for AC/DC. [2.2] The AC/DC cluster As devised in 1998-1999, AC/DC had as its main purpose to make a large number of corpus resources available on the web with a unied and simple interface that allowed people to interact with corpora without requiring physical access to institutions or software installation (at that time, there was no such thing for Portuguese).Later on we also considered as Linguateca's task [6] Named after the corpus compilers, Francisco Borba and Myriam Ramsey.
OSLa volume X(Y), 20ZZ [4] diana santos to create resources that were lacking, such as a large newpaper text corpus, CETEMPúblico, which was also included in the AC/DC service.
As a service to the (Portuguese-language processing) community, every corpus owner or developer could make use of AC/DC to serve his corpora, and we have in fact tried to contact everyone and make the oer explicit, for modern Portuguese.In some cases, however, the oer was turned down (or simply ignored), for reasons that ranged from copyright problems to the desire of the particular groups to develop their own solutions.We note, however, that no requirement of exclusivity was ever made by Linguateca: on the contrary, our own corpora, notably CETEMPúblico (Santos & Rocha, 2001), were also distributed by the Linguistic Data Consortium (LDC) and by Mark Davies for some time.So, one of the most used corpus of Portuguese, the NILC corpus, was given access by AC/DC although many other solutions to make it available were created as well by NILC (Aluísio et al., 2004).
Other related (resource) services provided by Linguateca were then developed as, in a way, an outgrowth of the basic AC/DC services, and I refer to this extended set as the AC/DC cluster, including the Floresta Sintáctica treebank (Afonso et al., 2002;Freitas et al., 2008) the rst treebank for Portuguese, COMPARA (Frankenberg-Garcia & Santos, 2003) a large manually revised Portuguese-English ction parallel corpus, and CorTrad (Tagnin et al., 2009) a parallel (multi-version and multi-genre) corpus.These other resources have further tools, parts, and interfaces, which will not be dealt with here, and were created in cooperation with other researchers and projects.The initial and obvious data gathering In order to be able to compare and study varieties and variation, one has to have materials that represent them.So, the rst and obvious requirement is to have plenty of material, so that one can take a language bath, and immerse in language use in dierent countries, times and social classes.Finally, just for the ction material, Table 3 presents the distribution per decade in the last two centuries.Since three of the sources concern parallel corpora, let me clarify that only the material in Portuguese is counted (and the dates for the translation concern the publication of the translation, not of the original).For more details see the corresponding project pages.Note also that literary text of which the exact sources are not known (such as those included in some multi-genre corpora in AC/DC) is not included.
In addition to the textual material, special sentence separation and tokenizer modules for Portuguese were developed in AC/DC, and all data were parsed by PALAVRAS (Bick, 2000), oering lemma, part of speech, morphological information (such as tense form, gender, number, pronoun case, diminutive, aumentative and superlative degree) and syntactical function (in a ver- Eckhard Bick, including also some discourse-related features such as topic and focus and some semantic information).As discussed in Inácio & Santos (2006), some of the material in the AC/DC cluster has been manually revised, as to their text and to their annotation, but most of it has not (after all, AC/DC encompasses more than 280 million words, or ca.16 million dierent sentences).
In addition to having developed our own AC/DC format as a transduction of PALAVRAS output format, we have also started to add semantic information in some domains, using a simple lexicon-driven approach followed by human rule writing for correction and improvement of both precision and recall, as described in Silva & Santos (2009); Santos & Mota (2010).
The distribution of the colour domain can be appreciated in Figure 1, where both the density of colour tokens and types is shown.As far as I know, this is the largest semantically annotated corpus, which has undergone human revision, currently available.(Although colour annotation of the largest corpora OSLa volume X(Y), 20ZZ linguateca's infrastructure for portuguese and how it allows the detailed study of language varieties [7] has not yet been fully revised.)corpus colour density 0 20 40 60 80 100 q q q q q q q q q q q q q q q q q q q q q q q corpus colour density times the ratio of colour words (tokens or types) compared to all words (tokens or types) in the corpus.
[3.2] Support for formal variational linguistics In addition to providing an electronic bookshelf , or a web distribution window, to any group or project that is willing to have us making their corpus or resource available, the AC/DC project may also develop specic facilities for the resources it (re)distributes, if they display new features.
This happened with the CONDIVport corpus we started by simply making it available through the web as a regular AC/DC member, but soon we understood the interest in providing support for more complex models of (on-line) linguistic research: Given that CONDIVport was compiled to study the convergence and divergence of national varieties of Portuguese, under the framework initially developed by the Quantitative Lexicology and Variational Linguistics group in Leuwen 8 , it had, in addition to three specic themes (soccer, fashion [8] See http://wwwling.arts.kuleuven.ac.be/qlvl/OSLa volume X(Y), 20ZZ [8] diana santos and health), texts from three dierent time periods, from Brazil and from Portugal.In addition, as an integral part of the methodology, a list of terms in the two rst of these themes had also been compiled.
For foundations and critical discussions of the methodology, I redirect the reader to Geeraerts et al. (1999); Geeraerts & Grondelaers (1999); Speelman et al. (2003); Soares da Silva (2010).Here, I will only provide concrete examples of how the process goes: First, one gathers a set of formal onomasiological proles 9 for key concepts in a given area let us take clothing as an example: key concepts may be blusa (roughly blouse) or saia (roughly skirt).Their onomasiological prole is a set of lexical items which the linguist classies (in context) as belonging to this class.So, as an example, the casaco F (female overcoat) prole has been found to be: blazer, blêizer, casaco, casaquinho, casaquinha, manteau, mantô, paletó, paletot (Soares da Silva, 2008a, page 66).
OSLa volume X(Y), 20ZZ linguateca's infrastructure for portuguese and how it allows the detailed study of language varieties [9] ual classications, search for the classes and the specic contexts of occurrence, and even provide feedback or corrections if needed.A similar point has been made in Santos & Oksefjell (1999) in what concerns parallel corpora.
This allows for both a wider dissemination of the original research, and a better quality control through communication with one's peers.Both aims are included in Linguateca's mission for the computational processing (and study) of the Portuguese language.
It is thus currently possible to ask, in addition to the occurrence or distribution of the forms included in the proles, for an entire prole, or for the prole distributions themselves.That is, how many cases of the members of the prole casaco appear by date/decade, or variety.
We have also used the initial proles compiled in CONDIVport as a seed to compiling larger sets of fashion-related lexical items, thus colouring the dierent corpora also with clothing information.Comparison of two search expressions, inspired by the CorpusEye search system (Bick, 2004), to compare explicitly two distributions; Reuse of a pattern database, inspired by the search system of Davies & Ferreira (2006-) and based on the capabilities of the underlying CWB system (Schulze, 1996;Evert, 2009).
These will pave the way for yet further developments in the AC/DC cluster, some of which can be mentioned here as natural extensions, namely (i) the possibility to include (tailored) synonym search as an option, following e.g.Christ (1998); and (ii) search by subject matter through concept nets.

Illustration sentences
Although their wealth of real, in context, examples is generally accepted as one of the basic advantages of corpora, as opposed to laboriously crafted ones (by [10] Whether the use of semantic domains and ontology-based classications is also useful for variation analysis is something that will have to be ascertained empirically. [11] See http://www.linguateca.pt/PAPEL/OSLa volume X(Y), 20ZZ [10] diana santos a lexicographer or textbook author), it is not easy to come up automatically with good examples from a corpus, as pointed out by Kilgarri et al. (2008).
Even harder did we nd the task of illustrating, or validating, semantic relations between words in context, as we wished to do for PAPEL, whose relations between words (and not word senses) had been produced automatically and were thus in need of human validation (Gonçalo Oliveira et al., 2009, 2010).
We have thus developed an AC/DC-based service to help us achieve two related purposes: (i) nd out the best patterns to validate and/or discover semantic relations in text, and (ii) develop clearer insights into the semantic fabric of Portuguese, while at the same time improving a public-domain semantic resource.As is common practice in Linguateca, we oer this as a service to the community 12 , which means that everyone can use it to develop or evaluate their own resources.

Comparison of two phenomena
Although one could already perform a comparison by doing two (or more) searches in AC/DC on a row and then comparing the results, this capability provides an easier way by aligning the results on two sides of the same screen.
Since we have been doing similar things within DISPARA for a long time now, cf. the quantitative wrapup function in Santos (2002), it seemed appropriate to oer this also in a monolingual corpus context.

Reuse of a pattern database
Again, this is not new in the sense that in other services oered by Linguateca, namely Águia (Santos, 2003), use was made of a set of patterns to query complex treebank structures in the Floresta project, but this feature had never been integrated in the main service interface, which relied mainly in direct email answers to users asking us how to produce complex queries.Now we have created an option of loading previous queries/commands into the search space, which, although possibly slowing down the corpus system, will also provide higher expressivity.It remains to be seen how much of this will in fact be reused/employed by power users of the AC/DC services.

7[ 3
] s t u d y i n g va r i at i o n a n d l a n g u a g e va r i e t i e s w i t h t h e a c / d c c l u s t e r I start by a presentation of the available material, then present the browsing of CONDIVport, which was compiled for variational analysis, and nally present new functionalities for corpus-based discovery that are currently under test in the AC/DC project.[3.1] figure 1: The semantic eld of colour in AC/DC as of July 2010: circles describe types, triangles tokens.Colour density is dened as 10,000 the AC/DC interface Several capabilities newly added to the AC/DC interface deserve mention here: Human validation of corpus illustration sentences for semantic relation evaluation (the VARRA service, developed in connection with yet another subproject in Linguateca, PAPEL 11 , whose goal was to create a free lexical ontology for Portuguese based on an existing general dictionary); a c k n o w l e d g e m e n t s Linguateca has throughout the years been jointly funded by the Portuguese Government, the European Union (FEDER and FSE), under contract ref.POSC/339/1.3/C/NAC,UMIC and FCCN.I would like to thank the remaining members of the AC/DC team, Paulo[12] See http://www.linguateca.pt/acesso/varra.phpOSLa volume X(Y), 20ZZ linguateca's infrastructure for portuguese and how it allows the detailed study of language varieties [11] Rocha, Luís Costa, Rosário Silva and Cristina Mota for the joint work, the corpus owners for letting us grant access to them on the web, and all users who have requested features or suggested improvements.Eckhard Bick's long-standing collaboration with his PALAVRAS parser has been the single most important factor for AC/DC's success near its users.Thanks also to Tony Berber Sardinha and Violeta Quental for relevant information concerning the history of Brazilian corpora, to Fernanda Bacelar do Nascimento relevant references, to Augusto Soares da Silva for the introduction to quantitative lexicology methodology, to Cristina Mota for pertinent comments on a draft version and, last but not least, to the VARRA team (Cláudia Freitas, Hugo Gonçalo Oliveira and Violeta Quental).r e f e r e n c e s Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos.2002.Floresta Sintá(c)tica: a treebank for Portuguese.In Manuel Gonzalez Rodrigues & Carmen Paz Suarez Araujo (eds.),Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), 1698

Table 2
presents the material in terms of language variety.