Creating Reverse Bilingual Dictionaries

Khang Nhut Lam
Department of Computer Science
University of Colorado
Colorado Springs, USA
[email protected] &Jugal Kalita
Department of Computer Science
University of Colorado
Colorado Springs, USA
[email protected]

Abstract

Bilingual dictionaries are expensive resources and not many are available when one of the languages is resource-poor. In this paper, we propose algorithms for creation of new reverse bilingual dictionaries from existing bilingual dictionaries in which English is one of the two languages. Our algorithms exploit the similarity between word-concept pairs using the English Wordnet to produce reverse dictionary entries. Since our algorithms rely on available bilingual dictionaries, they are applicable to any bilingual dictionary as long as one of the two languages has Wordnet type lexical ontology.

1 Introduction

The Ethnologue organization¹¹1http://www.ethnologue.com/ lists 6,809 distinct languages in the world, most of which are resource-poor. Most existing online bilingual dictionaries are between two resource-rich languages (e.g., English, Spanish, French or German) or between a resource-rich language and a resource-poor language. There are languages for which we are lucky to find a single bilingual dictionary online. For example, the University of Chicago hosts bilingual dictionaries from 29 Southeast Asian languages²²2http://dsal.uchicago.edu/dictionaries/list.html, but many of these languages have only one bilingual dictionary online.

Existing algorithms for creating new bilingual dictionaries use intermediate languages or intermediate dictionaries to find chains of words with the same meaning. For example, [Gollins and Sanderson, 2001] use lexical triangulation to translate in parallel across multiple intermediate languages and fuse the results. They query several existing dictionaries and then merge results to maximize accuracy. They use four pivot languages, German, Spanish, Dutch and Italian, as intermediate languages. Another existing approach for creating bilingual dictionaries is using probabilistic inference [Mausam et al., 2010]. They organize dictionaries in a graph topology and use random walks and probabilistic graph sampling. [Shaw et al., 2011] propose a set of algorithms to create a reverse dictionary in the context of single language by using converse mapping. In particular, given an English-English dictionary, they attempt to find the original words or terms given a synonymous word or phrase describing the meaning of a word.

The goal of this research is to study the feasibility of creating a reverse dictionary by using only one existing dictionary and Wordnet lexical ontology. For example, given a Karbi³³3Karbi is an endangered language spoken by 492,000 people (2007 Ethnologue data) in Northeast India, ISO 639-3 code AJZ. ISO 693-3 code for English is ENG.-English dictionary, we will construct an ENG-AJZ dictionary. The remainder of this paper is organized as follows. In Section 2, we discuss the nature of bilingual dictionaries. Section 3 describes the algorithms we propose to create new bilingual dictionaries from existing dictionaries. Results of our experiments are presented in Section 4. Section 5 concludes the paper.

2 Existing Online Bilingual Dictionaries

Powerful online translators developed by Google and Bing provide pairwise translations (including for individual words) for 65 and 40 languages, respectively. Wiktionary, a dictionary created by volunteers, supports over 170 languages. We find a large number of bilingual dictionaries at PanLex⁴⁴4http://panlex.org/ including an ENG-Hindi⁵⁵5ISO 693-3 code HIN and a Vietnamese⁶⁶6ISO 693-3 code VIE-ENG dictionary. The University of Chicago has a number of bilingual dictionaries for South Asian languages. Xobdo⁷⁷7http://www.xobdo.org/ has a number of dictionaries, focused on Northeast India.

We classify the many freely available dictionaries into three main kinds.

•

Word to word dictionaries: These are dictionaries that translate one word in one language to one word or a phrase in another language. An example is an ENG-HIN dictionary at Panlex.
•

Definition dictionaries: One word in one language has one or more meanings in the second language. It also may have pronunciation, parts of speech, synonyms and examples. An example is the VIE-ENG dictionary, also at Panlex.
•

One language dictionaries: A dictionary of this kind is found at dictionary.com.

We have examined several hundred online dictionaries and found that they occur in many different formats. Extracting information from these dictionaries is arduous. We have experimented with five existing bilingual dictionaries: VIE-ENG, ENG-HIN, and a dictionary supported by Xobdo with 4 languages: Assamese⁸⁸8Assamese is an Indo-European language spoken by about 30 million people, but it is resource-poor, ISO 693-3 code ASM., ENG, AJZ, and Dimasa⁹⁹9Dimasa is another endangered language from Northeast India, spoken by about 115,000 people, ISO 693-3 code DIS.. We consider the last one to be a collection of 3 bilingual dictionaries: ASM-ENG, AJZ-ENG, and DIS-ENG. We choose these languages since one of our goals is to work with resource-poor languages to enhance the quantity and quality of resources available.

3 Proposed Solution Approach

A dictionary entry, called LexicalEntry, is a 2-tuple $<$ LexicalUnit, Definition $>$ . A LexicalUnit is a word or a phrase being defined, also called definiendum [Landau, 1984]. A list of entries sorted by the LexicalUnit is called a lexicon or a dictionary. Given a LexicalUnit, the Definition associated with it usually contains its class and pronunciation, its meaning, and possibly additional information. The meaning associated with it can have several Senses. A Sense is a discrete representation of a single aspect of the meaning of a word. Thus, a dictionary entry is of the form <LexicalUnit, ${Sense}_{1}$ , ${Sense}_{2},\cdots$ >.

In this section, we propose a series of algorithms, each one of which automatically creates a reverse dictionary, or $ReverseDictionary$ , from a dictionary that translates a word in language $L_{1}$ to a word or phrase in language $L_{2}$ . We require that at least one of two these languages has a Wordnet type lexical ontology [Miller, 1995]. Our algorithms are used to create reverse dictionaries from them at various levels of accuracy and sophistication.

3.1 Direct Reversal (DR)

The existing dictionary has alphabetically sorted LexicalUnits in $L_{1}$ and each of them has one or more Senses in $L_{2}$ . To create ReverseDictionary, we simply take every pair < $LexicalUnit,Sense$ > in SourceDictionary and swap the positions of the two.

Algorithm 1 DR Algorithm

ReverseDictionary

\phi

for all

{LexicalEntry}_{i}\in

SourceDictionary

for all

{Sense}_{j}\in{LexicalEntry}_{i}

Add tuple <

{Sense}_{j}

{LexicalEntry}_{i}.LexicalUnit

> to

ReverseDictionary

end for

This is a baseline algorithm so that we can compare improvements as we create new algorithms. If in our input dictionary, the sense definitions are mostly single words, and occasionally a simple phrase, even such a simple algorithm gives fairly good results. In case there are long or complex phrases in senses, we skip them. The approach is easy to implement, and produces a high-accuracy $ReverseDictionary$ . However, the number of entries in the created dictionaries are limited because this algorithm just swaps the positions of LexicalUnit and Sense of each entry in the $SourceDictionary$ and does not have any method to find the additional words having the same meanings.

3.2 Direct Reversal with Distance (DRwD)

To increase the number of entries in the output dictionary, we compute the distance between words in the Wordnet hierarchy. For example, the words "hasta-lipi" and "likhavat" in HIN have the meanings "handwriting" and "script", respectively. The distance between "handwriting" and "script" in Wordnet hierarchy is 0.0, so that "handwriting" and "script" likely have the same meaning. Thus, each of "hasta-lipi" and "likhavat" should have both meanings "handwriting" and "script". This approach helps us find additional words having the same meanings and possibly increase the number of lexical entries in the reverse dictionaries.

To create a ReverseDictionary, for every $LexicalEntry_{i}$ in the existing dictionary, we find all $LexicalEntry_{j},i\neq j$ with distance to $LexicalEntry_{i}$ equal to or smaller than a threshold $\alpha$ . As results, we have new pairs of entries < $LexicalEntry_{i}.LexicalUnit$ , $LexicalEntry_{j}.Sense$ > ; then we swap positions in the two-tuples, and add them into the ReverseDictionary. The value of $\alpha$ affects the number of entries and the quality of created dictionaries. The greater the value of $\alpha$ , the larger the number of lexical entries, but the smaller the accuracy of the ReverseDictionary.

The distance between the two LexicalEntrys is the distance between the two $LexicalUnit$ s if the LexicalUnits occur in Wordnet ontology; otherwise, it is the distance between the two $Sense$ s. The distance between each phrase pair is the average of the total distances between every word pair in the phrases [Wu and Palmer, 1994]. If the distance between two words or phrases is 1.00, there is no similarity between these words or phrases, but if they have the same meaning, the distance is 0.00.

We find that a $ReverseDictionary$ created using the value 0.0 for $\alpha$ has the highest accuracy. This approach significantly increases the number of entries in the ReverseDictionary. However, there is an issue in this approach. For instance, the word "tuhbi" in DIS means "crowded", "compact", "dense", or "packed". Because the distance between the English words "slow" and "dense" in Wordnet is 0.0, this algorithm concludes that "slow" has the meaning "tuhbi" also, which is wrong.

Algorithm 2 DRwD Algorithm

ReverseDictionar

y :=

\phi

for all

{LexicalEntry}_{i}\in

SourceDictionary

for all

{Sense}_{j}\in{LexicalEntry}_{i}

for all

{LexicalEntry}_{u}\in

SourceDictionary

for all

{Sense}_{v}\in{LexicalEntry}_{u}

distance

{LexicalEntry}_{i}.LexicalUnit

{Sense}_{j}

> ,<

{LexicalEntry}_{u}.LexicalUnit

{Sense}_{v}

> )

\leqslant\alpha

then

Add tuple <

{Sense}_{j}

{LexicalEntry}_{u}.LexicalUnit

> to

ReverseDictionary

end if

end for

3.3 Direct Reversal with Similarly (DRwS)

The DRwD approach computes simply the distance between two senses, but does not look at the meanings of the senses in any depth. The DRwS approach represents a concept in terms of its Wordnet synset¹⁰¹⁰10Synset is a set of cognitive synonyms., synonyms, hyponyms and hypernyms. This approach is like the DRwD approach, but instead of computing the distance between lexical entries in each pair, we calculate the similarity, called simValue. If the simValue of a < $LexicalEntry_{i}$ , $LexicalEntry_{j}$ >, $i\neq j$ pair is equal or larger than $\beta$ , we conclude that the $LexicalEntry_{i}$ has the same meaning as $LexicalEntry_{j}$ .

To calculate simValue between two phrases, we obtain the ExpansionSet for every word in each phrase from the WordNet database. An ExpansionSet of a phrase is a union of synset, and/or synonym, and/or hyponym, and/or hypernym of every word in it. We compare the similarity between the ExpansionSets. The value of $\beta$ and the kinds of ExpansionSets are changed to create different ReverseDictionarys. Based on experiments, we find that the best value of $\beta$ is 0.9, and the best ExpansionSet is the union of synset, synonyms, hyponyms, and hypernyms. The algorithm for computing the simValue of entries is shown in Algorithm 3.

Algorithm 3 simValue(

LexicalEntry_{i}

LexicalEntry_{j}

)

simWords

\phi

LexicalEntry_{i}.LexicalUnit

LexicalEntry_{j}.LexicalUnit

have a Wordnet lexical ontology then

for all (

{LexicalUnit}_{u}\in LexicalEntry_{i})

\&

({LexicalUnit}_{v}\in LexicalEntry_{j}

) do

Find

ExpansionSet

of every

LexicalEntry

based on

LexicalUnit

end for

else

for all

({Sense}_{u}\in LexicalEntry_{i})

\&

({Sense}_{v}\in LexicalEntry_{j})

Find

ExpansionSet

of every

LexicalEntry

based on

Sense

end for

end if

simWords

\leftarrow

ExpansionSet (

LexicalEntry_{i}

)

\cap

ExpansionSet(

LexicalEntry_{j}

)

\leftarrow

ExpansionSet(

LexicalEntry_{i}

).length

\leftarrow

ExpansionSet(

LexicalEntry_{j}

).length

simValue

\leftarrow

min{

simWords.length\over n

simWords.length\over m

}

4 Experimental results

The goals of our study are to create the high-precision reverse dictionaries, and to increase the numbers of lexical entries in the created dictionaries. Evaluations were performed by volunteers who are fluent in both source and destination languages. To achieve reliable judgment, we use the same set of 100 non-stop word ENG words, randomly chosen from a list of the most common words¹¹¹¹11http://www.world-english.org/english500.htm. We pick randomly 50 words from each created $ReverseDictionary$ for evaluation. Each volunteer was requested to evaluate using a 5-point scale, 5: excellent, 4: good, 3: average, 2: fair, and 1: bad. The average scores of entries in the ReverseDictionarys is presented in Figure 1. The DRwS dictionaries are the best in each case. The percentage of agreements between raters is in all cases is around 70%.

The dictionaries we work with frequently have several meanings for a word. Some of these meanings are unusual, rare or very infrequently used. The DR algorithm creates entries for the rare or unusual meanings by direct reversal. We noticed that our evaluators do not like such entries in the reversed dictionaries and mark them low. This results in lower average scores in the DR algorithm comparing to averages cores in the DRwS algorithm. The DRwS algorithm seems to have removed a number of such unusual or rare meanings (and entries similar to the rare meanings, recursively) improving the average score

Our proposed approaches do not work well for dictionaries containing an abundance of complex phrases. The original dictionaries, except the VIE-ENG dictionary, do not contain many long phrases or complex words. In Vietnamese, most words we find in the dictionary can be considered compound words composed of simpler words put together. However, the component words are separated by space. For example, "bái thần giáo" means "idolatry". The component words are "bái" meaning "bow low"; "thần" meaning "deity"; and "giáo" meaning "lance", "spear", "to teach", or "to educate". The presence of a large number of compound words written in this manner causes problems with the ENG-VIE dictionary. If we look closely at Figure 1, all language pairs, except ENG-VIE show substantial improvement in score when we compare the DR algorithm with DRwS algorithm.

Refer to caption — Figure 1: Average entry score in $ReverseDictionary$

The DRwD approach significantly increases the number of entries, but the accuracy of the created dictionaries is much lower. The DRwS approach using a union of synset, synonyms, hyponyms, and hypernyms of words, and $\beta\geq 0.9$ produces the best reverse dictionaries for each language pair. The DRwS approach increases the number of entries in the created dictionaries compared to the DR algorithm as shown in Figure 2.

We also create the entire reverse dictionary for the AJZ-ENG dictionary. The total number of entries in the ENG-AJZ dictionaries created by using the DR algorithm and DRwS algorithm are 4677 and 5941, respectively. Then, we pick 100 random words from the ENG-AJZ created by using the DRwS algorithm for evaluation. The average score of every entry in this created dictionary is 4.07. Some of the reversal bilingual dictionaries can be downloaded at http://cs.uccs.edu/ linclab/creatingBilingualLexicalResource.html.

5 Conclusion

We proposed approaches to create a reverse dictionary from an existing bilingual dictionary using Wordnet. We show that a high precision reverse dictionary can be created without using any other intermediate dictionaries or languages. Using the Wordnet hierarchy increases the number of entries in the created dictionaries. We perform experiments with several resource-poor languages including two that are in the UNESCO’s list of endangered languages.

Acknowledgements

We would like to thank the volunteers evaluating the dictionaries we create: Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Abhijit Bendale, and Lalit Prithviraj Jain. We also thank all friends in the Xobdo project who provided us with the ASM-ENG-DIS-AJZ dictionaries.

References

[Mausam et al., 2010] Mausam, S. Soderlan, O. Etzioni, D.S. Weld, K. Reiter, M. Skinner, M. Sammer, and J. Bilmers 2010. Panlingual lexical translation via probabilistic inference, Artificial Intelligence, 174:619–637.
[Shaw et al., 2011] R. Shaw, A. Datta, D. VanderMeer, and K. Datta 2011. Building a scalable database - Driven Reverse Dictionary, IEEE Transactions on Knowledge and Data Engineering, volume 99.
[Gollins and Sanderson, 2001] T. Gollins and M. Sanderson. 2001. Improving cross language information retrieval with triangulated translation, SIGIR ’01 Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New York, 90–95.
[Landau, 1984] S.I. Landau. 1984. Dictionaries, Cambridge Univ Press.
[Miller, 1995] G.A. Miller. 1995. Wordnet: a lexical database for English, Communications of the ACM, volume 38(11):39–41.
[Wu and Palmer, 1994] Z. Wu and P. Palmer. 1994. Verbs semantics and lexical selection, In proceeding of the 32nd annual meeting on Association for computaional linguistics, Stroudsburg, 133–138.