corpus-driven glossaries in translator training courses

A Linguística de Corpus tem-semostrado um recurso valioso para a extração de candidatos a termos e unidades fraseológicas a partir de corpora especializados (Bowker & Pearson 2002). Na realidade, trata-se de uma abordagem relativamente nova já que a maioria dos glossários baseia-se, em geral, em material similar anteriormente existente. Embora haja muitos glossários no mercado, poucos foram compilados para atender às necessidades dos tradutores, cuja principal tarefa na tradução técnica é produzir um texto natural e fluente, seja na sua língua nativa, ou em uma língua estrangeira. Por essa razão, um glossário que consiste simplesmente de uma lista de termos e seus equivalentes não será satisfatório para o tradutor. Como produtores de texto, os tradutores precisam saber como a palavra é usada, ou seja, com quais palavras combina (Firth 1957; Sinclair 1991). Além disso, a linguagem técnica abriga termos que consistem de várias palavras assim como unidades fraseológicas ainda mais longas. A compilação de glossários era abordada no Curso de Especialização em Tradução na Universidade de São Paulo como metodologia para melhorar o conhecimento especializado dos alunos. Após algumas experiências, verificou-se que a abordagem condizia com o que Shreve (2006) chamou de “prática deliberada”, metodologia que contribui para o desenvolvimento das habilidades de pesquisa e de tradução dos alunos, levando à aquisição de conhecimento e de técnicas especializados (Maia 1997, 2002; Tagnin 2002), de que os aprendizes poderão se valer em qualquer área na qual venham a trabalhar. Este artigo descreverá como isso foi realizado em várias ocasiões, ou seja, com o recurso a uma abordagem baseada em corpus, e ilustrará, com exemplos de vários projetos, os passos seguidos.

[1] i n t r o d u c t i o n Corpus Linguistics, an empirical approach to language studies (McEnery & Hardie 2011;McEnery & Wilson 1997), has proved to be a valuable tool for the extraction of candidates for technical terms and phraseological units (Bowker & Pearson 2002).We understand terms as words or multiword units characteristic of specialized contexts.So, for instance, cup is a word that belongs to the general vocabulary of the English language.However, in a culinary context, cup is considered a term as it refers to a measurement, not to the utensil proper.In the same vein, we consider longer phraseological units, even without a term, as terminological units when they are typical of a certain domain.For example, roll out the pastry on a lightly floured working surface.
Even though a methodology that uses corpora has been used in various academic studies (Teixeira 2008;Perrotti-Garcia & Rebechi 2007;Tagnin & Bevilacqua 2013), many current glossaries, mostly commercial ones, are still based on existing ones, either editing previous editions or adding to them.In contrast, corpusdriven terminology derives all its data from a specialized corpus compiled for that specific purpose.
Although there are many glossaries available on the market, few meet the needs of technical translators (Teixeira 2008), who are expected to produce a natural and fluent text, either in their mother tongue, or in a foreign language, depending on the direction they are working in.For this reason, a simple list of terms and their equivalents will not suffice.A text producer needs to know how a word is used, that is, the words it combines with (Firth 1957;Sinclair 1991).In addition, technical language may have multi-word terms and even longer phraseological units which may also enjoy the status of terms and as such should feature as stand-alone entries in reference works.For instance, this would be the case of freshly ground black pepper in a glossary of culinary terms.
A corpus-driven compilation of glossaries was one of the main foci of Technical Translation, one of the disciplines of the two-year Translation diploma courses 1 at the University of São Paulo.Students were required to participate in projects envisaging the construction of specialized corpora and the extraction of relevant terminology.To that end they were introduced to the methodology and tools used by Corpus Linguistics.Thus students mastered corpus-related skills, such as defining criteria to build a reliable corpus, investigating a corpus with specific computational programs, designing criteria to select examples to include in a glossary entry, developing techniques to identify equivalent terms in two different languages and, finally, building appropriate glossary entries.This methodology produced, in general, good works, some of which have already been published (Perrotti-Garcia & Rebechi 2007;Teixeira & Tagnin 2008;Tagnin 2013).
From the perspective of translator training, this "deliberate practice" (Shreve 2006) -a well-defined motivating task with an adequate level of difficulty so as to promote students' improvement, and with appropriate feedback from the teacher -certainly contributed to the development of research and translation skills, leading to specialized knowledge which students would be able to put to use in any area they might come to work in.
corpus-driven glossaries in translator training courses [361] This paper reports on the decisions made regarding what to teach in a translator training course and describes how Corpus Linguistics can be used for terminological works.
[2] c o r p o r a i n t h e t r a n s l at i o n c l a s s r o o m The use of corpora in translator training courses has been a fact for over two decades (Maia 1997(Maia , 2002;;Tagnin 2002).In Brazil it was introduced as a methodology for the compilation of technical glossaries in the Specialization Course in Translation at the University of São Paulo in 2001.During a course on Technical Translation students were divided into thematic groups and instructed to build an English-Portuguese comparable corpus in a specialized area, that is, a corpus with original texts in both languages.They should then extract the technical terms, identify equivalents and collect examples in both languages.Glossaries resulting from this activity were made available at the course's site 2 under "Trabalhos de alunos" -"Glossário" (Student works -Glossary).In 2005, students were asked to build a bilingual glossary along the lines of a series of technical glossaries brought out by a local publisher.Each group could choose one field of study, and the best works would be submitted to the publisher for possible publication.In 2008, as part of a similar course 3 , it was suggested that the whole class engage in one collective project for the construction of a Photography glossary.This project is discussed in detail in Section [4].
[2.1] What to teach: translators' needs Before deciding on the format of the glossaries to be produced, it was deemed necessary to determine the translator's terminological needs (Teixeira 2008;Fromm 2008).When one reflects about this, what immediately comes to mind is that a translator needs equivalents, which is actually only partially true.As González-Jover & Sierra (2004) have already pointed out, terminology materials should help translators make decisions that are part of their daily practice.And their daily practice involves much more than just finding an equivalent.
A survey carried out by Fromm (2008) with professional translators on the features of the bilingual dictionaries they mostly use showed (see Table 1) that the dictionaries translators find more valuable, apart from the ones that present "all of the above", are the ones the results that provide a translation as well as examples.And it is this preference that has been the basis on which the template for our entries was built.
[2.2] What kind of glossary?Given that translators are, above all, text producers and that their goal in technical translation is to produce a natural text, they need, in addition to equivalents, examples that contextualize a certain term found in the source text as well as information about its textual and linguistic patterns.In other words, they need to know the term's collocations and phraseologies (Tagnin 2002).For terms which do not have equivalents in the target language, translators would need other translation possibilities or even suggestions for adaptation.On such occasions, cultural information may help them to choose adequate substitutions.Let us illustrate this with an example taken from the area of Cooking.If a translator needs to translate 1 large onion, finely chopped into Portuguese, he/she would find it useful to have a glossary which would specify that the Portuguese cognate for finely (finamente) does not usually occur in this context.Rather, the most natural translation for finely into Portuguese would be the adverb bem (= well), which renders bem picada (*well chopped).Another option would be the diminutive picadinha, with or without the adverb bem.Thus, the glossary would specify that the best translation options are 1 cebola grande, bem picada or 1 cebola grande (bem) picadinha.In the case of finely grated Parmesan cheese, the glossary should provide the information that the usual translation is simply queijo parmesão (= parmesan cheese), since in Brazil this kind of cheese is customarily finely grated.Thus, the texture is only specified when the cheese should be coarsely grated, which would be ralado grosso in Portuguese.The cultural gap becomes even more evident when the translator encounters the term buttermilk.Although the Portuguese language has a corresponding term, leitelho, it is not used, mainly because this product does not exist in our country.Thus, the glossary could add an explanatory note or even suggest that buttermilk can be replaced by "a mixture of equal parts of milk and plain yogurt" (Teixeira & Tagnin 2008).
However, much of the material available on the market does not meet these needs and is often limited to a mere list of monolexical terms and their equivalents OSLa volume 7(1), 2015 corpus-driven glossaries in translator training courses [363] in the target language, without providing examples or other linguistic information that can help the translator to make adequate decisions and create a text in which naturalness (Sinclair 1984) prevails.Thus, as mentioned before, it is necessary to create a model for a glossary that meets the needs of the translator.In this sense, as Krieger & Finatto (2004) have suggested, translators can be instrumental in creating new methodologies for the production of reliable terminological sources of information.
In this paper we claim that a methodology relying on the premises of Corpus Linguistics can provide this so much needed "reliable terminological source of information" for translators.
[3] c o r p u s l i n g u i s t i c s As we know, Corpus Linguistics is an empirical approach based on the observation of a large number of texts.These texts, always authentic, constitute a corpus, which can be investigated by means of specific computational programs that produce, among other data, concordance lines (see Figure 1).Concordance lines show the search word with its surrounding co-text, and allow investigators to identify recurrent patterns, terms and phraseological units.Concordance lines can also be sorted alphabetically by the words to the right or to the left of the search word, which makes identifying recurrent patterns even easier by grouping them together.The first example (Figure 1) is a selection of concordance lines for the Portuguese word imagem (= image), taken from the Photography corpus.The above concordance lines show the recurrence of three collocations: imagem ampliada, imagem captada and imagem capturada, which might indicate that OSLa volume 7(1), 2015 [364] stella esther ortweiler tagnin they are candidate terms.Besides, one notices that imagem capturada occurs five times while imagem captada, which has the same meaning, only occurs three times.This seems to indicate that the first one is probably more common and thus a more natural choice.It is important to point out that Corpus Linguistics looks at language as a probabilistic system, that is, it observes which patterns have a higher probability of occurring to the detriment of those that just feature a grammatical possibility of occurrence (Kennedy 1998).Therefore, if a technical translator seeks to produce a natural-sounding text he/she should use the terms that are more likely to occur in the specialized area he/she is working in.
Recurring patterns in the English counterpart of the Photography corpus can be seen in Figure 2.These concordance lines show mainly verbal collocations such as capture an image, copy an image, delete an image, display an image and edit an image.Another method to extract terminological units is by using a list of n-grams (Guinovart & Simões 2009;Maia et al. 2008).These lists show all combinations of two words (bigrams), three words (trigrams) or even longer combinations, depending on how the researcher adjusts the settings of the program being used.Again, however, these lists need to be examined by the researcher in order to decide which combinations are, in fact, terminological units.
corpus-driven glossaries in translator training courses [365] Corpus Linguistics can be used in two ways to compile glossaries: as a methodology or as an approach.In the first case, we refer to it as corpus-based Terminology; in the second, as corpus-driven Terminology.It is the latter that was used in our courses.
[3.1] Corpus-based Terminology A terminological reference source is said to be corpus-based when texts are selected because they offer a variety of defining contexts, which will be used to build the definitions for its entries.Besides, work is usually based on a pre-selected list of nouns -and only more recently of verbs -derived from an ontology, which shows the structure of the area being addressed and all of its subareas.This allows the terminologist to decide which areas to address in the glossary to be built.Once the list has been compiled, definitions and examples are extracted from the corpus built for that purpose.Basically, only pre-established terms and phraseological units which contain these terms will make up the entries of such a reference work.In short, the corpus is seen as a repository of definitions and examples (Teixeira 2008).

[3.2] Corpus-driven Terminology
In contrast, Corpus Linguistics is used as an approach when all entries that will make up the glossary are extracted directly from the corpus.In other words, only terms present in the texts that make up the corpus will be included in the glossary.Also, corpora are composed of the texts most commonly written or referred to by specialists, such as articles published in journals, textbooks, manuals, articles in newspapers, etc.The type of texts to be collected will depend on the area being addressed but they are expected to feature the actual and updated terminology used in that area.Whether these texts have defining contexts or not is not relevant.

N Word
Freq

stella esther ortweiler tagnin
In corpus-driven terminology, the first step is to extract a list of all the words in the corpus with their frequencies (Table 2).
It is interesting to notice that most highly frequent words are grammatical words; the first content word -camera -only appears in position seven, which gives an indication of the field the corpus covers.
In order to establish which of these words are typical of the area being addressed, a wordlist is usually compared to another wordlist extracted from a corpus of general language, usually three to five times larger than the study corpus, which is known as a "reference corpus" or "comparison corpus."This comparison yields a list of keywords (see Table 3), which are the words that show a statistically relevant frequency in the specialized corpus in relation to the reference corpus.In other words, these lexical items are relatively more frequent in the study corpus than in the reference corpus.For this reason, they are regarded as potential "candidate terms."This list, entirely extracted from the corpus, will be used as the starting point for the selection of candidate terms.Each of these candidates is examined in its context in order to identify possible collocations and longer phraseological units.This is done by running concordance lines for the search word and then looking for recurrent patterns, which can be seen in Figure 3 for the word camera.The lines from Figure 3 show various collocations and phraseological units such as CCD camera, DCS camera, DCS Camera Manager, digital camera, (Kodak) Ni-MH rechargeable digital camera battery, Kodak Li-Ion rechargeable digital camera battery, (Kodak) EasyShare camera dock and film camera.In a corpus-driven terminological reference source each one of these recurrent combinations will be listed along with relevant examples extracted from the concordance lines.
[4] t h e p h o t o g r a p h y g l o s s a r y p r o j e c t The above sequence of activities was followed on various occasions during Technical Translation courses at the University of São Paulo.The most recent ones took place in 2005 and 2008, as mentioned before.For the sake of illustration, we will concentrate on the 2008 project on Photography, but will resort to other areas from the 2005 project when they provide better examples to illustrate the procedures being discussed.
[4.1] Class procedures The first step was to establish the subareas that would be addressed in the project.Examining instructional material on Photography, we determined the following [368] stella esther ortweiler tagnin six topics to be covered: history of photography 4 , light, cameras, studio, storage and digital photography.The class was accordingly divided into six groups, each of which should build a comparable bilingual English-Portuguese corpus in the area assigned to them.They also had to select a one-page text from their English corpus to be translated into Portuguese by the whole class.Each group would be responsible for discussing their translation with the whole class.Besides, preliminary results for the glossary were also to be presented so that procedures and doubts could be discussed.The stages of the project are described below.

Instruction in Corpus Linguistics
As most of the class had no previous knowledge in Corpus Linguistics, they were introduced to its basic notions in a series of three lectures, with special emphasis on the stages of building a specialized corpus and using linguistic software to investigate it, in that case, WordSmith Tools version 5 (Scott 1996), with its suite of tools: WordLists, Keywords and Concord.

Building a corpus
Students were required to build a bilingual comparable corpus with approximately 100,000 words in each language according to the following steps: (i) search for texts on the Internet so as to avoid having to scan them.Although most texts were indeed retrieved from the Internet, some groups had to resort to written material and hence scan it; (ii) clean the texts, eliminating figures, tables, charts, illustrations and any other non-linguistic material which the researcher believes will not contribute relevant material 5 ; (iii) save texts in .txtformat; (iv) include a header with metatextual information such as: title of the text, place of publication, date of publication, subarea etc.
The final composition of the five subcorpora compiled by each group is presented in Table 4.

Extracting terms (Wordlist and Keywords)
Once the corpora were built, students generated WordLists for each of their corpora and then compared these lists with similar lists for general language corpora.
[4] This group was discontinued during the course.[5] It is true that some tables exhibit terminological material, though not in context.It is up to the researcher, in those cases, to include the tables or not in the corpus.This comparison yielded words -keywords -that occurred at a statistically significant higher frequency in the study corpus (see Table 3).These words were considered candidate terms as they were peculiar to the study corpus.In order to confirm whether they were actually terms or not, students ran concordance lines for each of the words to examine their context of occurrence (see Figure 3).

Extracting patterns
Let us remember that recurrent patterns in concordance lines may be candidate terms.Figure 4 shows some of these patterns for the word photographs.
The Figure 4 concordance lines allow us to identify nominal collocations such as albumen photographs, colo[u]r photographs, digital photographs and family photographs, as well as verbal collocations like clean photographs, display photographs and even longer phraseological units like water-damaged photographs.

Extracting relevant context (examples)
Once all relevant terms and phraseologies had been identified, examples were retrieved from the concordance lines to be inserted in the entries.If the concordance line did not show the full context, a double click on it led to the full source text.Part of it is shown below for concordance line 25 in Figure 5.

Identifying equivalents
One way to identify possible equivalents is to compare the lists of keywords in both languages.Figure 6 illustrates this procedure for an English-Portuguese Cooking glossary (Teixeira & Tagnin 2008).
Once a pair is identified, concordance lines should be generated to check whether the selected equivalents occur in similar contexts.When there is no such prima facie (literal) equivalent, search can be pursued by the word's collocates or context (Tagnin 2007).For example, if we wish to find the equivalent for finely -the most frequent adverb in a Cooking corpus -we will realize that it is not finamente, the Portuguese cognate for finely, because this adverb displays a very OSLa volume 7(1), 2015 [370] stella esther ortweiler tagnin A Cooking glossary built along the same lines was produced by a former translation student and co-authored by me (Teixeira & Tagnin 2008).Although not part of either the 2005 or the 2008 project, it is an offspring of a glossary on Cooking spices and condiments compiled in the 2001 course.After finishing the Translation course, Teixeira pursued her master's degree with a thesis on the translation of cooking recipes (Teixeira 2004) and her PhD with a dissertation on a proposal for a Cooking dictionary aimed at a translator's textual production (Teixeira 2008)  6 .
The results of the Photography project, unsurprisingly, were a bit uneven.One group excelled and one presented very poor material.The work of the other groups was good but needed some improvement.As the aim was to submit high quality material to a publisher and only one glossary met this requirement, after grades had been assigned, the instructor called a meeting of those who would be interested in pursuing the project on their own time and making all necessary adjustments for the work to be suitable for submission to the publisher.A group of five students 7 decided to embrace the project and the final material was submitted in early 2009.As it is the publisher's policy to have all technical glossaries revised by a professional in the area, the material was examined by a professional photographer who returned it with a few comments and suggestions.These were worked on by the group and the Vocabulário para fotografia was eventually published in 2013 (Tagnin 2013).
[6] a n i n t e r e s t i n g o u t c o m e A couple of years ago I participated in a round table on the teaching of translation.One of my colleagues, Fabio Alves, from the Federal University of Minas Gerais, presented the concept of "deliberate practice."It goes something like this: for students to acquire translation competence, their training should aim at developing specific skills that will contribute to their optimal learning and expert performance in a certain field (Ericsson & Charness 1997).This requires certain conditions to be met, among which the most mentioned one is "subjects' motivation to attend to the task and exert effort to improve their performance" (Ericsson et al. 1993, pg. 367) .This is developed by Shreve (2006, pg. 29) who states that for deliberate practice to occur, the following requirements must be met: [6] Both works were done under the author's supervision.[7] Angelica Royo, Eliana C. R. Antonopoulos, Helena Akemi Misumi, Moira Martins de Andrade and Veridiana Rocha Schwenck.
corpus-driven glossaries in translator training courses [375] [7] f i n a l r e m a r k s This article was intended to demonstrate how a corpus-driven methodology can produce glossaries that meet the translator's needs and how this practice can enhance students' translation competence.
The methodology described showed that building corpus-driven glossaries can be an adequate practice to enhance students' performance towards achieving translation competence.First, because Corpus Linguistics has shown to be an effective approach to build technical glossaries that meet the translator's needs.Second, because, as it was later discovered, the methodology was considered an adequate practice in helping students to achieve specialized knowledge and master translation techniques which they will be able to put to use in any technical area they may come to work in (Alves & Tagnin 2010).
figure 2: Selection of concordance lines for image, sorted by 1 st and 2 nd word on the left.

table 3 :
KeyWord List -First 20 keywords in the Photography corpus.

table 4 :
Final composition of the Photography corpus.

Do not display photographs in direct sunlight or under bright lights, and keep them away from heat vents and damp loca- tions.
As mentioned above, this procedure was carried out on two occasions, 2005 and 2008.From the glossaries produced by the 2005 class, one on Chemistry was published in 2007(Perrotti-Garcia & Rebechi 2007).