Abstract
The Dictionary of Old Norse Prose (ONP) is a long-term lexicographic project to semantically analyse an important corpus of medieval writings. The traditional audience for such projects consists of researchers and students of Old Norse language, who often have some knowledge of a modern Scandinavian language. The texts covered by the dictionary include many of general interest, such as the sagas of Icelanders, and texts of interest to fields such as history of religion, archaeology and comparative literature. This paper addresses the problem of how to make such a dictionary inclusive of a much larger audience, particularly those who do not have a modern Scandinavian language but wish to access the rich resources that ONP provides. We also assess the accuracy of one of the features implemented: user-requested machine translations using Google Translate.
ONP is working on producing entries with both English and Danish as the target languages, but since the dictionary went digital-only (2004) the focus has been on Danish. Currently around 20% of the lexicon has definitions in English and around 50% in only Danish. The remainder is unedited at this stage. Specialist users who can read Scandinavian languages (such as Norwegian and Swedish) tend to be comfortable reading Danish, meaning that they can potentially understand a large proportion of ONP’s definitions. However, according to analytics data, less than 20% of ONP’s digital user base work in these languages. This suggests that a large proportion of the definitions in the dictionary are potentially not available to the overwhelming majority of users, thus excluding a large proportion of ONP’s current and potential users.
The new web application (onp.ku.dk) developed for the project includes features designed to make the dictionary more inclusive for its users who do not have sufficient knowledge of Danish to use the large proportion of the dictionary with only Danish definitions. The techniques described here use external resources and fall broadly into two categories: supplementation (via querying of external resources such as dictionaries) and automatic translation (via Google Translate).
ONP’s database includes detailed information about the occurrence of words in other dictionaries and glossaries. This information can be used to link words in ONP accurately to out-of-copyright digitised dictionaries, supplementing ONP where it is not edited, or where definitions are lacking in English. The most important of these dictionaries is Fritzner’s Old Norse dictionary (1886-96), which has been processed so that in almost all cases a user can access the precise homograph for any given word in ONP. The definitions are in Norwegian, but we discuss later how these can be processed further. There also exist reasonable-quality digitisations of dictionaries by Cleasby and Vigfússon (1874) and Zoega (1926). These dictionaries have English as the target language and have been imported into a database that can be linked to ONP, giving users an English interpretation of almost all words in ONP, albeit a nineteenth-century interpretation.
ONP also provides links to concordances in other corpora (Menota and the Skaldic Project) for each homograph, that is, linking to individual distinct entries rather than lemma strings. The Skaldic Project includes translations of every word in its context and a link to ONP’s wordlist: at this stage there are 126,000 translated words representing 10,000 headwords in ONP. It therefore can provide contextual translations for headwords in ONP.
These external resources provide information for almost all words in ONP, including the large proportion that have not yet been edited, meaning that users can get information about the semantics and grammar of the lexicon while ONP continues its work.
The second general technique used by the web application is to integrate automatic translation via the Google Translate API. Where a definition exists only in Danish, the text of the definition in the web application becomes interactive so that when a user clicks or taps on the text, the application will substitute the translation via the API (the original text is saved in a ‘tooltip’). The same applies to the Norwegian text in Fritzner’s dictionary.
The technique here of machine translating specific definitions on demand means that user-requested translations can be logged and assessed for accuracy. There is a relatively small body of scholarly literature assessing the accuracy of Google Translate as a translation engine. For short phrases and terminology, and with Western European languages as the target, the level of accuracy is at least 74% as reported already in 2014 (Patil et al. 2014). The service’s performance is continually improving.
The automatic translations of definitions requested by external users from Danish (ONP) and Norwegian (Fritzner) were logged in the period from 15-28 January 2020 (this data will be updated). These included 268 distinct pieces of Danish text (avg. 34 characters) and 90 distinct pieces of Norwegian text (avg. 40 characters) requested by approximately 60 external users. We have manually categorised on a scale as: 1. Accurate (including repetitive translations of synonyms), 2. Accurate but unidiomatic, 3. Partially inaccurate (but with inaccurate material easily identifiable in the context), and 4. Inaccurate (misleading or incorrect). Accuracy is assessed in the context of the original headword in Old Norse.
Preliminary data indicate around 90% of the machine translations fall into the first three categories, that is to say, in spite of repetitions and some errors, a user can interpret the generated English definition in the context of the web entry to get a sense of the semantics of the particular usage and without a great risk of them misunderstanding the particular sense of the word.
The service is therefore accurate enough to be very useful for users who do not have a grasp of Danish, but not nearly accurate enough to be used on its own for research purposes. These features provide information for a non-Danish-speaking audience in the majority of instances where no English language information is otherwise available, with only a small minority of instances creating the potential for error.
ONP is working on producing entries with both English and Danish as the target languages, but since the dictionary went digital-only (2004) the focus has been on Danish. Currently around 20% of the lexicon has definitions in English and around 50% in only Danish. The remainder is unedited at this stage. Specialist users who can read Scandinavian languages (such as Norwegian and Swedish) tend to be comfortable reading Danish, meaning that they can potentially understand a large proportion of ONP’s definitions. However, according to analytics data, less than 20% of ONP’s digital user base work in these languages. This suggests that a large proportion of the definitions in the dictionary are potentially not available to the overwhelming majority of users, thus excluding a large proportion of ONP’s current and potential users.
The new web application (onp.ku.dk) developed for the project includes features designed to make the dictionary more inclusive for its users who do not have sufficient knowledge of Danish to use the large proportion of the dictionary with only Danish definitions. The techniques described here use external resources and fall broadly into two categories: supplementation (via querying of external resources such as dictionaries) and automatic translation (via Google Translate).
ONP’s database includes detailed information about the occurrence of words in other dictionaries and glossaries. This information can be used to link words in ONP accurately to out-of-copyright digitised dictionaries, supplementing ONP where it is not edited, or where definitions are lacking in English. The most important of these dictionaries is Fritzner’s Old Norse dictionary (1886-96), which has been processed so that in almost all cases a user can access the precise homograph for any given word in ONP. The definitions are in Norwegian, but we discuss later how these can be processed further. There also exist reasonable-quality digitisations of dictionaries by Cleasby and Vigfússon (1874) and Zoega (1926). These dictionaries have English as the target language and have been imported into a database that can be linked to ONP, giving users an English interpretation of almost all words in ONP, albeit a nineteenth-century interpretation.
ONP also provides links to concordances in other corpora (Menota and the Skaldic Project) for each homograph, that is, linking to individual distinct entries rather than lemma strings. The Skaldic Project includes translations of every word in its context and a link to ONP’s wordlist: at this stage there are 126,000 translated words representing 10,000 headwords in ONP. It therefore can provide contextual translations for headwords in ONP.
These external resources provide information for almost all words in ONP, including the large proportion that have not yet been edited, meaning that users can get information about the semantics and grammar of the lexicon while ONP continues its work.
The second general technique used by the web application is to integrate automatic translation via the Google Translate API. Where a definition exists only in Danish, the text of the definition in the web application becomes interactive so that when a user clicks or taps on the text, the application will substitute the translation via the API (the original text is saved in a ‘tooltip’). The same applies to the Norwegian text in Fritzner’s dictionary.
The technique here of machine translating specific definitions on demand means that user-requested translations can be logged and assessed for accuracy. There is a relatively small body of scholarly literature assessing the accuracy of Google Translate as a translation engine. For short phrases and terminology, and with Western European languages as the target, the level of accuracy is at least 74% as reported already in 2014 (Patil et al. 2014). The service’s performance is continually improving.
The automatic translations of definitions requested by external users from Danish (ONP) and Norwegian (Fritzner) were logged in the period from 15-28 January 2020 (this data will be updated). These included 268 distinct pieces of Danish text (avg. 34 characters) and 90 distinct pieces of Norwegian text (avg. 40 characters) requested by approximately 60 external users. We have manually categorised on a scale as: 1. Accurate (including repetitive translations of synonyms), 2. Accurate but unidiomatic, 3. Partially inaccurate (but with inaccurate material easily identifiable in the context), and 4. Inaccurate (misleading or incorrect). Accuracy is assessed in the context of the original headword in Old Norse.
Preliminary data indicate around 90% of the machine translations fall into the first three categories, that is to say, in spite of repetitions and some errors, a user can interpret the generated English definition in the context of the web entry to get a sense of the semantics of the particular usage and without a great risk of them misunderstanding the particular sense of the word.
The service is therefore accurate enough to be very useful for users who do not have a grasp of Danish, but not nearly accurate enough to be used on its own for research purposes. These features provide information for a non-Danish-speaking audience in the majority of instances where no English language information is otherwise available, with only a small minority of instances creating the potential for error.
Original language | English |
---|---|
Publication date | 2021 |
Publication status | Published - 2021 |
Event | Euralex XIX: Lexicography for inclusion - Virtual, Greece Duration: 7 Sep 2021 → 10 Dec 2021 Conference number: 19 https://euralex2020.gr/ |
Conference
Conference | Euralex XIX |
---|---|
Number | 19 |
Location | Virtual |
Country/Territory | Greece |
Period | 07/09/2021 → 10/12/2021 |
Internet address |