the nordic dialect corpus – a joint research infrastructure

The paper describes the Nordic Dialect Corpus as of June 2010. The corpus (see Johannessen et al. 2009) is steadily growing, and new features are constantly added, so the version we describe is that of June 2010, while the corpus work has funding for another two years. The corpus is a tool that combines a number of useful features that together makes it a unique and very advanced resource for researchers of many fields of language studies. The corpus is web-based and features full audio-visual representation linked to transcriptions and translations.

janne bondi johannessen and of results handling.
[2] w h y t h e n o r d i c d i a l e c t c o r p u s wa s d e v e l o p e d The Nordic Dialect Corpus was developed after a need for research material was voiced by members of the NORMS (Nordic Centre of Excellence in Micro-comparative Syntax) and the ScanDiaSyn (Scandinavian Dialect Syntax) networks.
The overarching goal for these researchers is to study the dialects of the North-Germanic languages, i.e., the Nordic languages spoken in the Nordic countries, as dialects of the same language.The languages are closely related to each other, and three of them are mutually intelligible (Norwegian, Swedish and Danish), as are two others (Faroese and Icelandic).All of them have some mutual intelligibility with each other if we consider written forms.
Studying the dialects only within the confines of each national language was therefore considered to be misguided from a theoretical and principled point of view.Second, doing research across dialects over such a big area, covering six countries (Denmark, Faroe Islands, Finland, Iceland, Norway, and Sweden), would be almost impossible if each researcher should get hold of relevant data on their own.
Third, the research in NORMS and ScanDiaSyn focusses on syntax -in which case data of many different kinds were necessary.Questionnaires for specific phenomena were needed (but will not be discussed in this paper), and recordings of spontaneous speech as it is used in ordinary conversations were very important.The latter need is satisfied by the Nordic Dialect Corpus.
[3] d e s c r i p t i o n o f t h e c o r p u s [3.1] Linguistic contents and numbers The corpus contains dialect data from the national languages Danish, Faroese, Icelandic, Norwegian, and Swedish.It is steadily growing, since new recordings are still being done, or planned, while other recordings are in various stages of finishing.At the moment, it contains speech data from approximately 525 informants with 1.8 million words, unevenly spread between the five countries.Eventually, this will rise to around 600 informants.The numbers for the corpus as of today are given in Table 1.
Due to differences in the financing of the data collection in the different countries, the data are less uniform than one might have wanted ideally.(Some record-  ings and transcriptions were done for this corpus, while others were already done, such as most of the Swedish ones, which were generously given us by the earlier project Swedia 2000.)Some recordings, such as those from Norway, the Swedish dialect of Övdalian and the Danish dialect of Western Jutlandic, have two kinds of recordings per informant: one semi-formal interview (informant and project assistant), and one informal conversation between two informants.Some dialects have recordings of both young and old informants, while others are only represented by old ones.Some dialects are represented by both old and new recordings, where old ones are generally around fifty years old.Some dialects have been recorded by audio only, while others have been recorded by both audio and video.All the dialects have recordings of informants belonging to both genders.Most importantly, however, all the recordings represent spontaneous speech.
[3.2] Annotation: transcription and tagging All the dialect data have been transcribed by at least one transcription standard, and this work has been done for the most part in the individual countries: Each dialect has been transcribed by the standard official orthography of that country.(For Norwegian, which has two standard orthographies, Bokmål was chosen since there exist important computational tools for this variant.)In addition, all the Norwegian dialects and some Swedish ones have also been transcribed phonetically 2 .For the Norwegian dialects and the Övdalian Swedish ones that have two transcriptions, the first transcription to be done was in each case the phonetic one, and then the phonetic transcription was translated to an orthographic transcription via a semi-automatic dialect transliterator developed for the project.The fact that there are two transcriptions for dialects that are very different from the standard national orthography makes it possible to search with both transcriptions in the corpus, and present search results in both, as illustrated below for the Swedish dialect of Övdalian in Figure 1.This figure also shows the translation by Google, which is provided as a service in the corpus results presentation.The Text Laboratory at the University of Oslo has the responsibility for the further technical devopment, including tagging.The whole corpus will be grammatically tagged with POS and selected morpho-syntactic features language by language.So far, the Norwegian data have been tagged, while the transcribed texts from the other languages are in the process of being tagged now.Tagging speech data is different from tagging written data.Speech contains disfluencies, interruptions and repetitions, and there are rarely clear clause boundaries (Allwood, Nivre andAhlsén 1989, Johannessen andJørgensen 2006).This is usually reflected in the transcription of speech, which generally does not contain clause boundaries or sentential markers such as full stops and exclamation marks (Jørgensen 2008, Rosén 2008).Any tagger developed for written language will therefore be difficult to use directly for spoken language.(Though Nivre and Grönqvist 2001 did this, on a material different from ours).
The Norwegian speech tagger was developed for the NoTa Corpus (Norwegian speech corpus -Oslo part).Søfteland and Nøklestad (2008) describe how the corpus was first tagged with the Oslo-Bergen tagger for written Norwegian (Hagen et al. 2000), and then trained with a TreeTagger (Schmid 1994) on the resulting, manually corrected file.The TreeTagger gained an accuracy of 96.9%.This tagger has then been used unchanged for the dialect corpus, under the assumption that the speech as represented in the dialects and in Oslo are sufficiently similar once they are all transcribed by the same transcription standard.The Swedish tagger has been trained in the same way.A written language TnT tagger developed by Sofie Johansson Kokkinakis (2003) has been applied to the Swedish dialect transcriptions (their standard orthographic version).After having been manually corrected and retrained, a spoken language Swedish statistical HunPos tagger has been developed at the Text Laboratory 3 .For Faroese, we have used a Faroese constraint grammar tagger developed for written language (Trosterud 2009), and manually corrected the results 4 . [2] The Norwegian phonetic transcription follows that of Papazian and Helleland (2005).The transcription of the Övdalian dialect follows the Övdalian orthography standardised in 2005 by the Råðdjärum (The Övdalian Language Council).[3] The manual corrections of the Swedish tagger were done by Piotr Garbacz, and the tagger was developed by André Lynum, both at the Text Laboratory, UiO.[4] The manual corrections of the Faroese tagger were done by Remco Knooihuizen for the Text Laboratory, UiO.
the nordic dialect corpus The corpus uses an advanced search interface and results handling system, Glossa (Nygaard 2007, Johannessen et al. 2008).The system allows for a large variety of search combinations making it possible to do very advanced and complex searches, even though the interface is very simple, with pull-down menus, and boxes that expand only when prompted by the user.The corpus search system Corpus Work Bench (Christ 1994, Evert 2005) is used, so that the simple corpus queries are translated to regular expressions before querying -something that is invisible to the user.Several of the features in the search interface and the results display follow suggestions by participants in ScanDiaSyn and NORMS.
Searching for lemmas and part of words: For those parts of the corpus that are tagged and lemmatised, it is possible to search for the lemma only.This way we get all inflected forms of one lexeme.This feature is very useful when there is suppletion in the stem of the word.For example, search for the Norwegian lemma gås ('goose') will give the results gås, gåsa, gjess, gjessene (various combinations of number and definiteness).
The same box where the user can write a full search word or a lemma can also be used to write part of a search word.This way the user can, for example, search for a particular suffix.In Figure 2, the user has searched for the suffix -ig, which can be found in Norwegian, Swedish, and Danish.Notice that since nothing else was specified, this search would query the whole corpus, all the languages.In Table 2 we can see some of the many hits for the frequent adjectival suffixes -ig and -lig in the mainland Nordic languages, and a couple of occurrences of words containing the same sequence of letters in the insular Nordic languages (not representing these suffixes, however).Searching for more than one word: In order to specify a search for more than one word, the user clicks on the plus sign in the first box, which gives one more box, with the possibility of specifying a number of words in between (Figure 3).The illustration shows a search for a word ending in -ig separated by at most three words from a conjunction to the right.
Searching for part of speech: The tagged part of the corpus can also be queried directly by part-of-speech tags.This is exemplified in Figure 3, where the second word is specified to be a conjunction.The user can choose whether a search word is specified by a word form (or part of one) and a part of speech or both.The pull-down menus in Figure 2 exemplify many of the search options that are available for a word.
Phonetic querying: The user can choose to query the corpus by giving a phonetically specified string.This works only for the dialects that have two transcriptions (cf.section 4.2).An example of a situation in which this is useful will be where we want to query person-number inflection on verbs.Here, tagging will not help, since each tagger is trained on the standard orthographic version of the the nordic dialect corpus [51] texts, and person-number inflection is only a dialect feature.Searching for this feature in Övdalian, we can simply write for example the 1pl suffix as it is (Figure 4): figure 4: Searching in phonetic mode This will give results that would have been impossible to get from the orthographic transcriptions.We refer to Figure 1, where the dialectal bellum ('can' 1pl) is represented by the standard kan ('can').
Informant-based querying: There are a number of ways to query the corpus in addition to the linguistics-based ones that we have seen above.All the details that are known about each informant are also searchable in the search interface.Thus, it is possible to specify as search criteria: age, sex, recording year, place of residence, country, region and area.In Figure 5, we show how we can choose individual places from the complete list, to be able to query only the informants from these places, which happen to be the area of Älvdalen in Sweden.

Multimedia display:
The corpus includes transcribed speech from five countries and spans four decades.Some of the speech was recorded using a tape recorder and later mp3 recorder, and some was recorded by videocamera.The search result is accompanied by a clickable symbol to show the audio and video of that particular speech sequence.This is illustrated in Figure 6 below.Action menu: On the results page there is an Action menu with a selection of choices for further displaying of results and results handling (the latter of which will be presented in section [3.6]).The functionalities that follow in this subsection are choices in this menu (Figure 8).The count results can be shown in a number of ways, such as histograms and pie charts.
[54] janne bondi johannessen figure 10: The same information as in Figure 9.

Sort:
The results are by default sorted according to the geographical residence of the informants.However, they can be displayed in many other ways as well.The most useful ones are perhaps those that sort the matches by the next word to the right or left.

Collocations:
The results can be shown as collocations according to many different statistical measurements such as dice coeffiency, log-likelihood ratio etc., with a choice between neighbouring bigrams and trigrams.The example in Figure 11 illustrates the collocations for the word bil 'car', used in the three mainland Nordic countries.The value of this choice is clearly illustrated in the example in Figure 11; the frequencies of the collocations are the same independently of language.

Maps:
Recently an option of displaying the search results on maps (using Google Maps technology) has been added.Since one search can cover a variety of results, for example when one orthographic word covers many different phonetic varieties, an additional option has been added in which each variety can be selected independently.In the map in Figure 12 the different phonetic varieties of the negation are displayed in the right-hand column, giving the user the choice to choose one or more and have them independently shown on the map.The orthographic variety has been displayed by a neutral dot covering all pronunciations.

[3.5] Displaying information on informants
There are two ways of finding information on the informants.Via results page: Each concordance line has an information symbol on its very left.Clicking on this symbol reveals information on the informant in question: informant code, sex, age group, country, place, number of words, recording year, and recently we have also included a map for his/her home place, see Figure 13.
Via search page: There is a button called "Show Texts", which shows information on which informants are included in a particular query.For example, if the user wants to query the corpus on Swedish data only, (s)he can press this button and immediately see how many informants are represented in the selection, how many words each informant has uttered etc., and this information can also be sorted by category to present for example number of words in a descending order.This way, we can see how different the informants are in this respect.For example, one old man from Skreia, Norway, utters 1,300 words during his session, while another old man, from nearby Stange, utters more than 6,400 words.

[3.6] Further processing of results
Deleting or choosing some results: In a corpus search it is often the case that the user gets more results than intended.Sometimes the search expression just was janne bondi johannessen figure 12: A map showing all the places that have hits (all the dots) for the orthographic forms of the negation 'not'.The column on the right can be specified for a phonetic variant.Here the phonetic form ikkje has been chosen.It should be noted that parts of North Norway have not yet been included in the corpus.
not good enough, which can best be corrected by a new and more precise search.However, sometimes it is impossible to formulate better search criteria, whether it is because there is too much homonymy in the corpus, or because it just is not annotated for all imaginable research features.Let us use a simple example: We want to find all and only the occurrences of the 3sgF pronoun ('she') used as a determiner, followed by any word, and then a noun.This search will give a lot of unwanted hits that we want to remove.We can then choose the Delete option from the Action menu and get Figure 14.Notice in the figure that by having chosen the Delete option, the results come with a little box on the left hand side.In this box we tick the examples that we want to remove.If we suspected that there would only be a few examples that were appropriate for our research, we could instead have used the Choose option, which functions in the same way, but where ticking a box would mean to keep that result and delete the unticked ones.
Annotating results: The individual researcher often needs to further annotate the results, for example according to pronunciation of certain sounds or words, or specific syntactic patterns.In Figure 15 Saving and downloading results: All results can be saved and/or downloaded, whether we choose the raw results or those that we have further processed by deletion, choice or annotation.By saving we get the opportunity to look at the results later, and with exactly the same possibilities for further processing and displaying of results in the corpus interface.Downloaded results, on the other hand, are not thus available in the corpus system, but can be imported as for instance tab-separated text. [

4] c o m p a r i s o n w i t h o t h e r d i a l e c t c o r p o r a
There are some other dialect resources on the web, but there are to our knowledge few or no available web-based dialect multimedia corpora for other languages.One interesting resource is Sounds familiar?Accents and Dialects of the UK.It contains information on British dialects, and recordings of the dialects with transcripts, all presented via a web map.However, it is pedagogical, and not aimed at researchers.For example, there is no search option in the transcripts and no grammatical annotation.
The Scottish Corpus of Text and Speech contains 4 million words, 20% of which is spoken texts, provided with orthographic transcription, synchronised with the audio or video.It is not grammatically annotated and is not representative.How- The British National Corpus contains 10 million words of spoken English, which have been categorised into 28 different dialects.However, it says in their own search interface distribution that this categorisation is unreliable.Further, as a dialect corpus, the BNC has limited value, since it is not represented with audio, and the speech is transcribed orthographically.
The DynaSand web-based dialect database consists of information on various syntactic features and their distribution geographically in the Netherlands and Belgium.It contains recorded material from the project's questionnaire sessions, but the conversations contain to a large extent read sentences and meta-linguistic discussions, and less spontaneous speech.
The Spoken Dutch Corpus is transcribed orthographically, some of it also phonetically, and it is morphologically tagged.It contains spoken standard Dutch, not dialect data, and is not available by a web-interface.
The Corpus of French Phonology (La phonologie du français contemporain: usages, variétés et structure -PFC) is a web-based corpus of spoken French from across the Francophone world.It is searchable both phonologically and w.r.t.informant characteristics, and has transcriptions linked to sound.
There might be web-based dialect corpora for other languages, but information about these is hard to find, and they do not seem to be available on the web.
the nordic dialect corpus [59] One such corpus under development is Corpus of Estonian Dialects.Another is Spoken Japanese Dialect Corpus (GSR-JD), available on DVD.Finally we should mention a small dialect corpus of Norwegian (Talesøk).It contains audio and transcriptions, and is available on the web.
There are some general web-based speech corpora that do not focus on dialect classification.For an overview of some Northern European ones, and their state of art w.r.t.topics like technical solutions and audio-visual availability, we refer to Johannessen et al. (2007).
Finally, we would like to mention that Paul Thompson at the University of Reading had a posting at Corpora List on November 30 2008 asking for information on corpus projects in which the developers have linked digital audio and/or video files to the transcripts, to allow access to the precise segment(s) of the audiovisual files that relates to a part of the transcript.In his summary of 15 responses there was only one dialect corpus -our own Nordic Dialect Corpus.
[5] c o n c l u s i o n We have presented the first version of the Nordic Dialect Corpus.It contains nearly 1.8 million words of Nordic dialects as spontaneous, not manuscripted, conversations.Most of them have been collected recently, but we have also included some old speech data.The Nordic Dialect Corpus has an advanced interface for searching and results handling.It is already a great resource for dialect researchers and linguists interested in the Nordic languages.The next version of the corpus will contain more dialect data.Part-of-speech taggers adapted for speech will be developed for alle the languages, and all present and future texts will be tagged.
[6] a c k n o w l e d g e m e n t s In addition to participants in the ScanDiaSyn and NORMS networks, I would like to thank three anonymous NODALIDA-09 reviewers for valuable comments on an earlier version on this paper.I would also like to thank various funding bodies for funding the technical part of this project: the University of Oslo (both the Humanities Faculty and the central Research Department), The Norwegian Research Council, NordForsk and NOS-HS.In addition the national research councils in Norway, Sweden, Denmark and Iceland have contributed to the projects NorDiaSyn, SweDiaSyn, DanDiaSyn and IceDiaSyn, as has the project Swedia 2000, which has contributed a lot of Swedish recordings, and Norsk Ordbok 2014, which has contributed many months worth of recordings for the old Målførearkiv material that is included in the corpus.
Finally, I would like to emphasise that developing a corpus of the kind that I have described here is team work.I want to mention the cooperation of Øystein Corpus is the result of close collaboration between the partners in the research networks Scandinavian Dialect Syntax and Nordic Centre of Excellence in Microcomparative Syntax.The researchers in the network have contributed in everything from decisions to actual work ranging from methodology to recordings, transcription, and annotation.Some of the corpus (in particular, recordings of informants) has been financed by the national research councils in the individual countries, while the technical development has been financed by the University of Oslo and the Norwegian Research Council, plus the Nordic research funds NOS-HS and NordForsk.
figure 1: Two transcriptions for Övdalian and a Google translation.

figure 3 :
figure 3: Searching for two words

figure 5 :
figure 5: Delimiting the corpus by choosing some places from the full list

figure 6 :
figure 6: The multimedia results window

figure 7 :
figure7: A window shows all information for each word that is moused over

figure 8 :
figure 8: Action menu in results window figure 13: Information that appears in the search results window figure 14: Results window with Delete option

table 1 :
Corpus contents by June 2010

table 2 :
Some results from the -ig search