This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automatically Creating a Large Number of New Bilingual Dictionaries

Khang Nhut Lam    Feras Al Tarouti    Jugal Kalita
Computer Science Department
University of Colorado, USA
{klam2, faltarou, jkalita}@uccs.edu
Abstract

This paper proposes approaches to automatically create a large number of new bilingual dictionaries for low-resource languages, especially resource-poor and endangered languages, from a single input bilingual dictionary. Our algorithms produce translations of words in a source language to plentiful target languages using available Wordnets and a machine translator (MT). Since our approaches rely on just one input dictionary, available Wordnets and an MT, they are applicable to any bilingual dictionary as long as one of the two languages is English or has a Wordnet linked to the Princeton Wordnet. Starting with 5 available bilingual dictionaries, we create 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the popular MTs: Google111http://translate.google.com/ and Bing222http://www.bing.com/translator.

Introduction

Bilingual dictionaries play a major role in applications such as machine translation, information retrieval, cross lingual document, automatic disambiguation of word sense, computing similarities among documents and increasing translation accuracy (?). Bilingual dictionaries are also useful to general readers, who may need help in translating documents in a given language to their native language or to a language in which they are familiar. Such dictionaries may also be important from an intelligence perspective, especially when they deal with smaller languages from sensitive areas of the world. Creating new bilingual dictionaries is also a purely intellectual and scholarly endeavor important to the humanities and other scholars.

The powerful online MTs developed by Google and Bing provide pairwise translations for 80 and 50 languages, respectively. These machines provide translators for single words and phrases also. In spite of so much information for some “privileged” language pairs, there are many languages for which we are lucky to find a single bilingual dictionary online or in print. For example, we can find an online Karbi-English dictionary and an English-Vietnamese dictionary, but we can not find a Karbi-Vietnamese333The ISO 693-3 codes of Arabic, Assamese, Dimasa, English, Karbi and Vietnamese are arb, asm, dis, eng, ajz and vie, respectively. dictionary.

The question we address in this paper is the following: Given a language, especially a resource-poor language, with only one available dictionary translating from that language to a resource-rich language, can we construct several good dictionaries translating from the original language to many other languages using publicly available resources such as bilingual dictionaries, MTs and Wordnets? We call a dictionary good if each entry in it is of high quality and we have the largest number of entries possible. We must note that these two objectives conflict: Frequently if an algorithm produces a large number of entries, there is a high probability that the entries are of low quality. Restating our goal, with only one input dictionary translating from a source language to a language which is a language with an available Wordnet linked to the Princeton Wordnet (PWN) (?), we create a number of good bilingual dictionaries from that source language to all other languages supported by an MT with different levels of accuracy and sophistication.

Our contribution in this work is our reliance on the existence of just one bilingual dictionary between a low-resource language and a resource-rich language, viz., eng. This strict constraint on the number of input bilingual dictionaries can be met by even many endangered languages. We consciously decided not to depend on additional bilingual dictionaries or external corpora because such languages usually do not have such resources. The simplicity of our algorithms along with low-resource requirements are our main strengths.

Related work

Let A, B and C be three distinct human languages. Given two input dictionaries Dict(A, B) consisting of entries (ai,bk)(a_{i},b_{k}) and Dict(B,C) containing entries (bk,cj)(b_{k},c_{j}), a naïve method to create a new bilingual dictionary Dict(A,C) may use B as a pivot: If a word aia_{i} is translated into a word bkb_{k}, and the word bkb_{k} is translated into a word cjc_{j}, this straightforward approach concludes that the word cjc_{j} is a translation of the word aia_{i}, and adds the entry (ai,cj)(a_{i},c_{j}) into the dictionary Dict(A,C). However, if the word bkb_{k} has more than one sense, being a polysemous word, this method might make wrong conclusions. For example, if the word bkb_{k} has two distinct senses which are translated into words cj1c_{j1} and cj2c_{j2}, the straightforward method will conclude that the word aia_{i} is translated into both the word cj1c_{j1} and the word cj2c_{j2}, which may be incorrect. This problem is called the ambiguous word sense problem. After obtaining an initial bilingual dictionary, past researchers have used several approaches to mitigate the effect of the ambiguity problem. All the methods used for word sense disambiguation use Wordnet distance between source and target words in some ways, in addition to looking at dictionary entries in forward and backward directions and computing the amount of overlap or match to obtain disambiguation scores (?), (?), (?), (?), (?), (?), (?) and (?). The formulas used and the names used for the disambiguation scores by different authors are different. Researchers have also merged information from sources such as parallel corpora or comparable corpora (?), (?) and a Wordnet (?). Some researchers have also extracted bilingual dictionaries from parallel corpora or comparable corpora using statistical methods (?), (?), (?), (?), (?) and (?).

The primary similarity among these methods is that they work with languages that already possess several lexical resources or these approaches take advantage of related languages (that have some lexical resources) by using such languages as intermediary. The accuracies of bilingual dictionaries created from available dictionaries and Wordnets are usually high. However, it is expensive to create such original lexical resources and they do not always exist for many languages. For example, such resources do not exist for most major languages of India, some spoken by hundred of millions. The same holds for many other widely spoken languages from around the world. In addition, these methods can only generate one or just a few new bilingual dictionaries using published approaches.

In this paper, we propose methods for creating a significant number of bilingual dictionaries from a single available bilingual dictionary, which translates a source language to a resource-rich language with an available Wordnet. We use publicly available Wordnets in several resource-rich languages and a publicly available MT as well.

Bilingual dictionary

An entry in a dictionary, called LexicalEntry, is a 2-tuple <LexicalUnit, Definition>. A LexicalUnit is a word or phrase being defined, the so-called definiendum (?). A list of entries sorted by the LexicalUnit is called a lexicon or a dictionary. Given a LexicalUnit, the Definition associated with it usually contains its class and pronunciation, its meaning, and possibly additional information. The meaning associated with it can have several Senses. A Sense is a discrete representation of a single aspect of the meaning of a word. Entries in the dictionaries we create are of the form <LexicalUnit,Sense1><LexicalUnit,Sense_{1}>, <LexicalUnit,Sense2><LexicalUnit,Sense_{2}>,….

Proposed approaches

This section describes approaches to create new bilingual dictionaries Dict(S,D), each of which translates a word in language S to a word or multiword expression in a destination language D. Our starting point is just one existing bilingual dictionary Dict(S,R), where S is the source language and R is an “intermediate helper” language. We require that the language R has an available Wordnet linked to the PWN. We do not think this is a big imposition since the PWN and other Wordnets are freely available for research purposes.

Direct translation approach (DT)

We first develop a direct translation method which we call the DT approach (see Algorithm 1). The DT approach uses transitivity to create new bilingual dictionaries from existing dictionaries and an MT. An existing dictionary Dict(S,R) contains alphabetically sorted LexicalUnits in a source language S and each has one or more Senses in the language R. We call such a sense SenseRSense_{R}. To create a new bilingual dictionary Dict(S,D), we simply take every pair <LexicalUnit,SenseRSense_{R}> in Dict(S,R) and translate SenseRSense_{R} to D to generate translation candidates candidateSet (lines 2-4). When there is no translation of SenseRSense_{R} in D, we skip that pair <LexicalUnit,SenseRSense_{R}>. Each candidate in candidateSet becomes a SenseDSense_{D} in language D of that LexicalUnit. We add the new tuple <LexicalUnit,SenseDLexicalUnit,Sense_{D}> to Dict(S,D) (lines 5-7).

Algorithm 1 DT algorithm

Input: Dict(S, R)
Output: Dict(S, D)

1:  Dict(S, D) := ϕ\phi
2:  for all LexicalEntryLexicalEntry \in Dict(S, R) do
3:     for all SenseRSense_{R} \in LexicalEntry do
4:        candiateSet= translate(SenseRSense_{R},D)
5:        for all candidate \in candiateSet do
6:           SenseDjSense_{D_{j}} = candidate
7:           add tuple <LexicalUnit,SenseDjSense_{D_{j}}> to Dict(S,D)
8:        end for
9:     end for
10:  end for

An example of generating an entry for a Dict(asm,vie) using the DT approach from an input Dict(asm,eng) is presented in Figure 1.

Refer to caption
Figure 1: An example of DT approach for generating a new dictionary Dict(asm,eng). In Dict(asm,eng), the word “diha-juguti” in asm has two translations in eng “suggestion” and “advice”, which are translated to vie as “đề nghị” and “tư vấn”, respectively, using the Bing Translator. Therefore, in the new Dict(asm,vie), the word “diha-juguti” has two translations in vie which are “đề nghị” and “tư vấn”.

Using publicly available Wordnets as intermediate resources (IW)

To handle ambiguities in the dictionaries created, we propose the IW approach as in Figure 2 and Algorithm 2.

Refer to caption
Figure 2: The IW approach for creating a new bilingual dictionary

For each SenseRSense_{R} in every given LexicalEntry from Dict(S,R), we find all Offset-POSs444Synset is a set of cognitive synonyms. Offset-POS refers to the offset for a synset with a particular POS, from the beginning of its data file. Words in a synset have the same sense. in the Wordnet of the language R to which SenseRSense_{R} belongs (Algorithm 2, lines 2-5). Then, we find a candidate set for translations from the Offset-POSs and the destination language DD using Algorithm 3. For each Offset-POS from the extracted Offset-POSs\emph{Offset-POS}s, we obtain each wordword belonging to that Offset-POS from different Wordnets (Algorithm 3, lines 2-3) and translate it to D using a MT to generate translation candidates (Algorithm 3, line 4). We add translation candidates to the candidateSetcandidateSet (Algorithm 3, line 6). Each candidate in the candidateSetcandidateSet has 2 attributes: a translation of the word wordword in the target language D, the so-called candidate.wordcandidate.word and the occurrence count or the rank value of the candidate.wordcandidate.word, the so-called candidate.rankcandidate.rank. A candidate with a greater rank value is more likely to become a correct translation. Candidates having the same ranks are treated similarly. Then, we sort all candidates in the candidateSetcandidateSet in descending order based on their rank values (Algorithm 2, line 7), and add them into the new dictionary Dict(S,D) (Algorithm 2, lines 8-10). We can vary the Wordnets and the numbers of Wordnets used during experiments, producing different results.

Algorithm 2 IW algorithm

Input: Dict(S,R)
Output: Dict(S, D)

1:  Dict(S, D) := ϕ\phi
2:  for all LexicalEntry \in Dict(S,R) do
3:     for all SenseRSense_{R}\in LexicalEntry do
4:        candidateSet := ϕ\phi
5:        Find all Offset-POSs of synsets containing SenseRSense_{R} from the R Wordnet
6:        candidatSetcandidatSet = FindCandidateSet (Offset-POSs, D)
7:        sort all candidatecandidate in descending order based on their rank values
8:        for all candidatecandidateSetcandidate\in candidateSet do
9:           SenseDSense_{D}=candidate.word
10:           add tuple <LexicalUnit,SenseDSense_{D}> to Dict(S,D)
11:        end for
12:     end for
13:  end for
Algorithm 3 FindCandidateSet (Offset-POSs,D)

Input: Offset-POSs, D
Output: candidateSet

1:  candidateSetcandidateSet := ϕ\phi
2:  for all Offset-POSOffset-POSs\emph{Offset-POS}\in\emph{Offset-POS}s do
3:     for all wordword in the Offset-POS extracted from the PWN and other available Wordnets linked to the PWN do
4:        candidate.wordcandidate.word= translate (word,D)(word,D)
5:        candidate.rankcandidate.rank++
6:        candidateSetcandidateSet += candidatecandidate
7:     end for
8:  end for
9:  return candidateSetcandidateSet

Figure 3 shows an example of creating entries for Dict(asm,arb) from Dict(asm,eng) using the IW approach.

Refer to caption
Figure 3: Example of generating new LexicalEntrys for Dict(asm,arb) using the IW approach from Dict(asm,eng). The word "diha-juguti" in asm has two senses as in Figure 2: “suggestion” and “ advice”. This example only shows the IW approach to find the translation of "diha-juguti" with the sense “suggestion”. We find all offset-POSs in the PWN containing "suggestion". Then, we extract words belonging to all offset-POSs from the PWN, FinnWordnet, WOLF and Japanese Wordnet. Next, we translate extracted words to arb and rank them based on the occurrence counts. According to the ranks, the best translation of "diha-juguti" in asm, which has the greatest rank value, is the word "almshwrh" in arb.

Experimental results

Data sets used

Our approach is general, but to demonstrate the effectiveness and usefulness of our algorithms, we have carefully selected a few languages for experimentation. These languages include widely-spoken languages with limited computational resources such as arb and vie; a language spoken by tens of millions in a specific region within India, viz., asm with almost no resources; and a couple of languages in the UNESCO’s list of endangered languages, viz., dis and ajz both from northeast India, again with almost no resources at all.

We work with 5 existing bilingual dictionaries that translate a given language to a resource-rich language, which happens to be eng in our experiments: Dict(arb,eng) and Dict(vie,eng) supported by Panlex555http://panlex.org/; Dict(ajz,eng) and Dict(dis,eng) supported by Xobdo666http://www.xobdo.org/; one Dict(asm,eng) created by integrating two dictionaries Dict(asm,eng) provided by Panlex and Xobdo. The numbers of entries in Dict(ajz,eng), Dict(arb,eng), Dict(asm-eng), Dict(dis,eng) and Dict(vie,eng) are 4682, 53194, 76634, 6628 and 231665, respectively. Many LexicalEntrys in some of our input dictionaries have no POS information. For example, 100% and 6.63% LexicalEntrys in Dict(arb,eng) and Dict(vie,eng), respectively, do not have POS.

To solve the problem of ambiguities, we use the PWN and Wordnets in several other languages linked to the PWN provided by the Open Multilingual Wordnet project (?): FinnWordnet (?) (FWN), WOLF (?) (WWN) and Japanese Wordnet (?) (JWN). We choose these Wordnets because they are already aligned with the PWN, cover a large number of synsets of the about 5,000 most frequently used word senses and are available online777http://compling.hss.ntu.edu.sg/omw/ for free. Depending on which Wordnets are used and the number of intermediate Wordnets, the sizes and qualities of the new dictionaries created change. The Microsoft Translator Java API888https://datamarket.azure.com/dataset/bing/microsofttranslator is used as another main resource. The Microsoft Translator supports translations for 50 languages.

In our experiments, we create dictionaries from any of {ajz, arb, asm, dis, vie} to any non-eng language supported by the Microsoft Translator, e.g., arb, Chinese (cht), German (deu), Hmong Daw (mww), Indonesian (ind), Korean (kor), Malay (zlm), Thai (tha), Spanish (spa) and vie, as shown in Figure 4.

Refer to caption
Figure 4: New bilingual dictionaries created

Results and human evaluation

Ideally, evaluation should be performed by volunteers who are fluent in both source and target languages. However, for evaluating created dictionaries, we could not recruit any individuals who are experts in two appropriate languages999This is not surprising. Considering languages we focus on are disparate belonging to different classes, with provenance spread out around the world, and frequently resource poor and even endangered. For example, it is almost impossible to find an individual fluent in the endangered language Karbi and Vietnamese.. Hence, every dictionary is evaluated by 2 people sitting together, one using the target language as mother tongue, and the other the source language. Each volunteer pair was requested to evaluate using a 5-point scale – 5: excellent, 4: good, 3: average, 2: fair and 1: bad. For selecting the samples to evaluate, we follow the concept of “simple random method” (?). According to the general rules of thumb, we can be confident of the normal approximation whenever the sample size is at least 30 (?, page 310). To achieve reliable judgment, we randomly picked 100 translations from every dictionary created.

To study the effect of available resources used to create our new dictionaries, we first evaluate the input dictionaries. The average score of LexicalEntrys in the input dictionaries Dict(arb,eng), Dict(asm,eng) and Dict(vie,eng) are 3.58, 4.65 and 3.77, respectively. This essentially means that Dict(asm,eng) is of almost excellent quality, while the other two are of reasonably good quality. These are the best dictionaries we could find. The average score and the number of LexicalEntrys in the dictionaries we create using the DT approach are presented in Table 1. We believe that if we find “better” input dictionaries, our results will be commensurately better.

Dict. Score Entries Dict. Score Entries
arb-deu 4.29 1,323 arb-spa 3.61 1,709
arb-vie 3.66 2,048 asm-arb 4.18 47,416
asm-spa 4.81 20,678 asm-vie 4.57 42,743
vie-arb 2.67 85,173 vie-spa 3.55 35,004
Table 1: The average score and the number of LexicalEntrys in the dictionaries created using the DT approach.

The average scores and the numbers of LexicalEntrys in the dictionaries created by the IW approach are presented in Table 2 and Table 3, respectively. In these tables, Top n means dictionaries created by picking only translations with the top n highest ranks for each word, A: dictionaries created using PWN only; B: using PWN and FWN; C: using PWN, FWN and JWN; D: using PWN, FWN, JWN and WWN. The method using all 4 Wordnets produces dictionaries with the highest scores and the highest number of LexicalEntrys as well.

Dict. Wordnets used
we create A B C D
arb-vie Top 1 3.42 3.65 3.33 3.71
Top 3 3.33 3.58 3.76 3.61
Top 5 2.99 3.04 3.08 3.31
asm-arb Top 1 4.51 3.83 4.69 4.67
Top 3 4.03 3.75 3.80 4.10
Top 5 3.78 3.85 3.42 4.00
asm-vie Top 1 4.43 4.31 3.86 4.43
Top 3 3.93 3.59 3.33 3.94
Top 5 3.74 3.34 3.4 2.91
vie-arb Top 1 3.11 2.94 2.78 3.11
Top 3 2.47 2.72 2.61 3.02
Top 5 2.54 2.37 2.60 2.73
Table 2: The average score of LexicalEntrys in the dictionaries we create using the IW approach.
Dict. Wordnets used
we create A B C D
arb Top1 1,786 2,132 2,169 2,200
- Top3 3,434 4,611 4,908 5,110
vie Top5 4,123 5,926 6,529 6,853
asm Top1 27,039 27,336 27,449 27,468
- Top3 70,940 76,695 78,979 79,585
arb Top5 104,732 118,261 125,087 126,779
asm Top1 25,824 26,898 27,064 27,129
- Top3 64,636 73,652 76,496 77,341
vie Top5 92,863 111,977 120,090 122,028
vie Top1 63,792 65,606 66,040 65,862
- Top3 152,725 177,666 183,098 185,221
arb Top5 210,220 261,392 278,117 282,398
Table 3: The number of LexicalEntrys in the dictionaries we create using the IW approach.

The number of LexicalEntrys and the accuracies of the newly created dictionaries definitely depend on the sizes and qualities of the input dictionaries. Therefore, if the sizes and the accuracies of the dictionaries we create are comparable to those of the input dictionaries, we conclude that the new dictionaries are acceptable. Using four Wordnets as intermediate resources to create new bilingual dictionaries increases not only the accuracies but also the number of LexicalEntrys in the dictionaries created. We also evaluate several bilingual dictionaries we create for a few of the language pairs. Table 4 presents the number of LexicalEntrys and the average score of some of the bilingual dictionaries generated using the four Wordnets.

Dict. Top 1 Top 3
we create Score Entries Score Entries
arb-deu 4.27 1,717 4.21 3,859
arb-spa 4.54 2,111 4.27 4,673
asm-spa 4.65 26,224 4.40 72,846
vie-spa 3.42 61,477 3.38 159,567
Table 4: The average score of entries and the number of LexicalEntrys in some other bilingual dictionaries constructed using 4 Wordnets: PWN, FWN, JWN and WWN.

The dictionaries created from arb to other languages have low accuracies because our algorithms rely on the POS of the LexicalUnits to find the Offset-POSs and the input Dict(arb,eng) does not have POS. We were unable to access a better Dict(arb,eng) for free. For LexicalEntrys without POS, our algorithms choose the best POS of the eng word. For instance, the word “book” has two POSs, viz., “verb” and “noun”, of which “noun” is more common. Hence, all translations to the word “book” in Dict(arb,eng) will have the same POS “noun”. As a result, all LexicalEntrys translating to the word “book” will be treated as a noun, leading to many wrong translations.

Based on experiments, we conclude that using the four public Wordnets, viz., PWN, FWN, JWN and WWN as intermediate resources, we are able to create good bilingual dictionaries, considering the dual objective of high quality and a large number of entries. In other words, the IW approach using the four intermediate Wordnets is our best approach. We note that if we include only translations with the highest ranks, the resulting dictionaries have accuracies even better than the input dictionaries used. We are in the processes of finding volunteers to evaluate dictionaries translating from ajz and dis to other languages. Table 5 presents the number of entries of some of dictionaries, we created using the best approach, without human evaluation.

Dict. Entries Dict. Entries
ajz-arb 4,345 ajz-cht 3,577
ajz-deu 3,856 ajz-mww 4,314
ajz-ind 4,086 ajz-kor 4,312
ajz-zlm 4,312 ajz-spa 3,923
ajz-tha 4,265 ajz-vie 4,344
asm-cht 67,544 asm-deu 71,789
asm-mww 79,381 asm-ind 71,512
asm-kor 79,926 asm-zlm 80,101
asm-tha 78,317 dis-arb 7,651
dis-cht 6,120 dis-deu 6,744
dis-mww 7,552 dis-ind 6,762
dis-kor 7,539 dis-zlm 7,606
dis-spa 6,817 dis-tha 7,348
dis-vie 7,652
Table 5: The number of LexicalEntrys in some other the dictionaries, we created using the best approach. ajz and dis are endangered.

Comparing with existing approaches

It is difficult to compare approaches because the language involved in different papers are different, the number and quality of input resources vary and the evaluation methods are not standard. However, for the sake of completeness, we make an attempt at comparing our results with (?). The precision of the best dictionary created by (?) is 79.15%. Although our score is not in terms of percentage, we obtain the average score of all dictionaries we created using 4 Wordnets and containing 3-top greatest ranks LexicalEntrys is 3.87/5.00, with the highest score being 4.10/5.00 which means the entries are very good on average. If we look at the greatest ranks only (Top 1 ranks), the highest score is 4.69/5.00 which is almost excellent. We believe that we can apply these algorithms to create dictionaries where the source is any language, with a bilingual dictionary, to eng.

To handle ambiguities, the existing methods need at least two intermediate dictionaries translating from the source language to intermediate languages. For example, to create Dict(asm,arb), (?) and (?) need at least two dictionaries: Dict(asm,eng) and Dict(asm, French). For asm, the second dictionary simply does not exist to the best of our knowledge. The IW approach requires only one input dictionary. This is a strength of our method, in the context of resources-poor language.

Comparing with Google Translator

Our purpose of creating dictionaries is to use them for machine learning and machine translation. Therefore, we evaluate the dictionaries we create against a well-known high quality MT: the Google Translator. We do not compare our work against the Microsoft Translator because we use it as an input resource.

We randomly pick 300 LexicalEntrys from each of our created dictionaries for language pairs supported by the Google Translator. Then, we compute the matching percentages between translations in our dictionaries and translations from the Google Translator. For example, in our created dictionary Dict(vie,spa), “người quảng cáo” in vie translates to “anunciante” in spa which is as same as the translation given by the Google Translator. As the result, we mark the LexicalEntry (“người quảng cáo”, “anunciante”) as “matching”. The matching percentages of our dictionaries Dict(arb,spa), Dict(arb,vie), Dict(arb,deu), Dict(vie,deu), Dict(vie,spa) and the Google Translator are 55.56%, 39.16%, 58.17%, 25.71%, and 35.71%, respectively.

The LexicalEntrys marked as “unmatched” do not mean our translations are incorrect. Table 6 presents some LexicalEntrys which are correct but are marked as “unmatched”.

[Uncaptioned image]
Table 6: Some LexicalEntrys in dictionaries we created are correct but do not match with translations from the Google Translator. According to our evaluators, the translations from the Google Translator of the first four source words are bad.

Crowdsourcing for evaluation

To achieve better evaluation, we intend to use crowdsourcing for evaluation. We are in the process of creating a Website where all dictionaries we create will be available, along with a user friendly interface to give feedback on individual entries. Our goal will be to use this feedback to improve the quality of the dictionaries.

Conclusion

We present two approaches to create a large number of good bilingual dictionaries from only one input dictionary, publicly available Wordnets and a machine translator. In particular, we created 48 new bilingual dictionaries from 5 input bilingual dictionaries. We note that 30 dictionaries we created have not supported by any machine translation yet. We believe that our research will help increase significantly the number of resources for machine translators which do not have many existing resources or are not supported by machine translators. This includes languages such as ajz, asm and dis, and tens of similar languages. We use Wordnets as intermediate resources to create new bilingual dictionaries because these Wordnets are available online for unfettered use and they contain information that can be used to remove ambiguities.

Acknowledgments

We would like to thank the volunteers evaluating the dictionaries we create: Dubari Borah, Francisco Torres Reyes, Conner Clark, and Tri Si Doan. We also thank all friends in the Microsoft, Xobdo and Panlex projects who provided us dictionaries.

References

  • [Ahn and Frampton 2006] Ahn, K., and Frampton, M. 2006. Automatic generation of translation dictionaries using intermediary languages. Cross-Language knowledge induction workshop of the EACL 06. Trento, Italy 41–44.
  • [Bond and Foster 2013] Bond, F., and Foster, R. 2013. Linking and extending an open multilingual Wordnet. In 51st Annual Meeting of the Association for Computational Linguistics (ACL). Bulgaria.
  • [Bond and Ogura 2008] Bond, F., and Ogura, K. 2008. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Languge resources and evaluation 42:127–136.
  • [Bond et al. 2001] Bond, F.; Sulong, R. B.; Yamazaki, T.; and Kentaro. 2001. Design and construction of a machine-tractable Japanese-Malay dictionary. In MT Summit-2001, Santiago de Compostela 53–58.
  • [Bouamor, Semmar, and Zweigenbaum 2013] Bouamor, D.; Semmar, N.; and Zweigenbaum, P. 2013. Using wordnet and semantic similarity for bilingual terminology mining from comparable corpora. In ACL, Sofia, Bulgaria.
  • [Brown 1997] Brown, R. D. 1997. Automated dictionary extraction for Knowledge-free example-based translation. In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation 111–118.
  • [Fellbaum 1998] Fellbaum, C. 1998. Wordnet: An electronic lexical database. Cambridge, MA: MIT Press.
  • [Gollins and Sanderson 2001] Gollins, T., and Sanderson, M. 2001. Improving cross language information retrieval with triangulated translation. SIGIR ’01: Proceeding of the 24th annual internation ACM SIGIR conderence on Research and development in information retrieval, New York 90–95.
  • [Haghighi et al. 2008] Haghighi, A.; Liang, P.; Berg-Kirkpatrick, T.; and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. ACL 771–779.
  • [Hays and Winkler 1971] Hays, W. L., and Winkler, R. L. 1971. Statistics: Probability, inference and decision. Decision. Holt, Rinehart and Winston. Inc., New York, USA.
  • [Heja 2010] Heja, E. 2010. Dictionary building based on parallel corpora and word alignment. In: Dykstra, A. and Schoonheim, T., (eds): Proceedings of the XIV. EURALEX International Congress 341–352.
  • [Isahara et al. 2008] Isahara, H.; Bond, F.; Uchimoto, K.; Utiyama, M.; and Kanzaki., K. 2008. Development of Japanese Wordnet. In LREC, Marrakech.
  • [Istvan and Shoichi 2009] Istvan, V., and Shoichi, Y. 2009. Bilingual dictionary generation for low-resourced language pairs. EMNLP ’09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 2:862–870.
  • [Knight and Luk 1994] Knight, K., and Luk, S. K. 1994. Building a large-scale knowledge base for machine translation. Proceedings of AAAI 94:773–778.
  • [Lam and Kalita 2013] Lam, K. N., and Kalita, J. 2013. Creating reverse bilingual dictionaries. Proceedings of NAACL-HLT, Atlanta 524–528.
  • [Landau 1984] Landau, S. 1984. Dictionaries. Cambridge Univ Press.
  • [Linden and Carlson 2010] Linden, K., and Carlson, L. 2010. FinnWordnet - Wordnet påfinska via oversattning, LexicoNordica. Nordic Journal of Lexicography 17:119–140.
  • [Ljubesic and Fiser 2011] Ljubesic, N., and Fiser, D. 2011. Bootstrapping bilingual lexicons from comparable corpora for closely related languages. In proceeding of: Text, Speech and Dialogue - 14th International Conference, TSD 2011, Pilsen, Czech Republic 6836:91–98.
  • [Mausam et al. 2010] Mausam; Soderland, S.; Etzioni, O.; Weld, D. S.; Reiter, K.; Skinner, M.; Sammer, M.; and Bilmes, J. 2010. Panlingual lexical translation via probabilistic inference. Artificial Intelligence 174:619–637.
  • [Nakov and Ng 2009] Nakov, P., and Ng, H. T. 2009. Improved statistical machine translation for resource-poor languages using related resource-rich languages. EMNLP ’09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA 3:1358–1367.
  • [Nerima and Wehrli 2008] Nerima, L., and Wehrli, E. 2008. Generating bilingual dictionaries by transitivity. Language Resources and Evaluation - LREC 2584–2587.
  • [Otero and Campos 2010] Otero, P. G., and Campos, J. R. 2010. Automatic generation of bilingual dictionaries using intermediate languages and comparable corpora. CICLing’10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing 473–483.
  • [Ross 2010] Ross, S. M. 2010. Introductory statistics. Academic Press.
  • [Sagot and Fiser 2008] Sagot, B., and Fiser, D. 2008. Building a free French WordNet from multilingual resources. In Proceedings of Ontolex 2008, Marrakech, Morocco.
  • [Shaw et al. 2013] Shaw, R.; Datta, A.; VanderMeer, D.; and Dutta, K. 2013. Building a scalable databse-driven reverse dictionary. IEEE Transactions on Knowledge and Data Engineering 25(3).
  • [Tanaka and Umemura 1994] Tanaka, K., and Umemura, K. 1994. Construction of bilingual dictionary intermediated by a third language. COLING ’94 Proceedings of the 15th conference on Computational linguistics. Stroudsburg, PA, USA 1:297–303.