This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multilingual Neural Machine Translation:
Can Linguistic Hierarchies Help?

Fahimeh Saleh  Wray Buntine  Gholamreza Haffari  Lan Du
[email protected]
Monash University
  Corresponding author
Abstract

Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distills the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines.

1 Introduction

The surge over the past few decades in the number of languages used in electronic texts for international communications has promoted Machine Translation (MT) systems to shift towards multilingualism. However, most successful MT applications, i.e., Neural Machine Translation (NMT) systems, usually rely on supervised deep learning, which is notoriously data-hungry Koehn and Knowles (2017). Despite decades of research, high-quality annotated MT resources are only available for a subset of the world’s thousands of languages Paolillo and Das (2006). Hence, data scarcity is one of the significant challenges which comes along with the language diversity and multilingualism in MT. One of the most widely-researched approaches to tackle this problem is unsupervised learning which takes advantage of available unlabeled data in multiple languages Lample et al. (2017); Arivazhagan et al. (2019); Snyder et al. (2010); Xu et al. (2019). However, unsupervised approaches have relatively lower performance compared to their supervised counterparts Dabre et al. (2020). Nevertheless, the performance of the supervised MNMT models is highly dependent on the types of languages used to train the model Tan et al. (2019a). If languages are from very distant language families, they can lead to negative transfer Torrey and Shavlik (2010); Rosenstein (2005), causing lower translation quality compared to the individual bilingual counterparts.

To address this problem, some improvements have been achieved recently with solutions that employ some sort of supervision to guide MNMT using linguistic typology Oncevay et al. (2020); Chowdhury et al. (2020); Kudugunta et al. (2019); Bjerva et al. (2019). The linguistic typology provides this supervision by treating the world’s languages based on their functional and structural characteristics O’Horan et al. (2016). Taking advantage of this property, which explains both language similarity and language diversity, we aim in our approach to combine two solutions for training an MNMT model: (a) creating a universal, language-independent MNMT model Johnson et al. (2017); (b) systematically designing the possible variations of language-dependent MNMT models based on the language relations Maimaiti et al. (2019).

Our approach to preventing negative transfer in MNMT is to group models which behave similarly in separate language clusters. Then, we perform a Knowledge Distillation (KD) Hinton et al. (2015) approach by selectively distilling the bilingual teacher models’ knowledge in the same language cluster to a multilingual teacher-assistant model. The intermediate teacher-assistant models are representative of their own language cluster. We further adaptively distill knowledge from the multilingual teacher-assistant models to the ultimate multilingual student. In summary, our main contributions are as follows:

  • We use cluster-based teachers in a hierarchical knowledge distillation approach to prevent negative transfer in MNMT. Different from the previous cluster-based approaches in multilingual settings Oncevay et al. (2020); Tan et al. (2019a), our approach makes use of all the clusters with a universal MNMT model while retaining the language relatedness structure in a hierarchy.

  • We distill the ultimate MNMT model from multilingual teacher-assistant models, each of which represents one language family and usually perform better than the individual bilingual models from the same language family. Thus, the cluster-based teacher-assistant models can lead to a better knowledge distillation compared to a diverse set of bilingual teacher models as used in multilingual KD Tan et al. (2019b).

  • We explore a mixture of linguistic features by utilizing different clustering approaches to obtain the cluster-based teacher-assistants. As the language groups created by different language feature vectors can contribute differently to translation, we adaptively distill knowledge from teacher-assistant models to the ultimate student to improve the knowledge gap of the student.

  • We perform extensive experiments on 53 languages, showing the effectiveness of our approach in avoiding negative transfer in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines. We also conduct comprehensive ablation studies and analysis, demonstrating the impact of language clustering in MNMT for different language families and in different resource-size scenarios.

2 Related Work

The majority of works on MNMT mainly focus on different architectural choices varying in the degree of parameter sharing in the multilingual setting. For example, the works based on the idea of minimal parameter sharing share either encoder, decoder, or attention module Firat et al. (2017); Lu et al. (2018), and those with complete parameter sharing tend to share entire models Johnson et al. (2017); Ha et al. (2016). In general, these techniques implicitly assume that a set of languages is pre-given without considering the positive or negative effect of language transfer between the languages shared in one model. Hence, they can usually achieve comparable results with individual models (trained with individual language pairs) only when the languages are less diverse or the number of languages is small. When several diverse language pairs are involved in training an MNMT system, the negative transfer Torrey and Shavlik (2010); Rosenstein (2005) usually happens between more distant languages, resulting in degraded translation accuracy in the multilingual setting. To address this problem, Tan et al. (2019a) suggested a clustering approach using either prior knowledge of language families or using language embedding. They obtained the language embedding by retrieving the representation of a language tag which is added to the input of an encoder in a universal MNMT model. Later, Oncevay et al. (2020) introduced another clustering technique using the multi-view language representation. They fused language embeddings learned in an MNMT model with syntactic features of a linguistic knowledge base Dryer and Haspelmath (2013). Tan et al. (2019b) proposed a knowledge distillation approach which transfers knowledge from bilingual teachers to a multilingual student when the accuracy of teachers are higher than the student. Their approach eliminates the accuracy gap between the bilingual and multilingual NMT models. However, we argue that distilling knowledge from a diverse set of parent models into a student model can be sub-optimal, as the parents may compete instead of collaborating with each other, resulting in negative transfer.

3 Hierarchical Knowledge Distillation

We address the problem of data scarcity and negative transfer in MNMT with a Hierarchical Knowledge Distillation (HKD) approach.

Refer to caption
Figure 1: HKD approach: In the first phase of knowledge distillation, aka “Selective KD”, the knowledge is transferred from bilingual teacher models per clusters (orange circles) to the multilingual teacher-assistant models (green circles). For example Tl1T_{l_{1}}, Tl2T_{l_{2}}, Tl4T_{l_{4}}, and Tl6T_{l_{6}} are belonged to one cluster and distilled to teacher-assistant model Tc1{T}_{c_{1}}. In the second KD phase, aka “Adaptive KD”, knowledge is transferred from ensemble of intermediate related teacher-assistant models to the ultimate student (red circle) adaptively.

The hierarchy in HKD is constructed in such a way that the node structure captures the similarity structure and the relatedness of the languages. Specifically, in an inverse pyramidal structure as shown in Figure 1, the root node corresponds to the ultimate MNMT model that we aim to train, the leave nodes correspond to each individual bilingual NMT models, and the non-terminal nodes represent the language clusters. Our hypothesis is that leveraging common characteristics of languages in the same language group, which is formed using clustering algorithms based on the typological properties of languages O’Horan et al. (2016), the HKD method can train a high quality MNMT model by distilling knowledge from related languages, rather than diverse ones.

Our HKD approach consists of two knowledge distillation mechanisms, providing two levels of supervision for training the ultimate MNMT model (illustrated in Figure 1), including: (i) selective distillation of knowledge from individual bilingual teachers to the multilingual intermediate teacher-assistants, each of which corresponds to one language group; and (ii) adaptive distillation of knowledge from all related cluster-wise teacher-assistants to the super-multilingual ultimate student model in each mini-batch of training per language pair. Note that we do not utilize multilingual adaptive KD in both distillation phases as we need to have the predictions of all the relevant experts in adaptive KD. Using adaptive KD for both stages is particularly impractical when there is a huge set of diverse teachers as in the first phase. Hence, in the first distillation phase, we aim to generate the cluster-wise teacher assistants using selective KD as the pre-requisites for the adaptive KD phase. The main steps of HKD are elaborated as follows:

Clustering: Clustering can be conducted using different language vectors such as: i) sparse language vectors from typological knowledge base (KB) databases, ii) dense learned language embedding vectors from multilingual NLP tasks, and iii) the combination of KB and task-learned language vectors. The implicit causal relationships between languages are usually learned from translation tasks; the genetic, the geographical, and the structural similarities between languages are extracted from typlogical KBs Bjerva et al. (2019). Thus, the language groups created by different language vectors can contribute differently to the translation and it is not quite clear which types of language features are more helpful in MNMT systems Oncevay et al. (2020). For example, “Greek” can be clustered with “Arabic” and “Hebrew” based on the mix of KB and task-learned language vectors. Meanwhile, it can be clustered with “Macedonian” and “Bulgarian” based on NMT-learned language vectors. Therefor, we cluster the languages based on all types of language representations and propose to explore a mixture of linguistic features by utilizing all clusters in training the ultimate MNMT student. So, given a training dataset consisting of LL languages and KK clustering approaches, where each clustering approach creates nn clusters, we are interested in training a many-to-one MNMT model (ultimate student) by hierarchically distilling knowledge from all MM clusters to the ultimate student, where M:=k=1KnkM:=\sum\nolimits_{k=1}^{K}n_{k}.

Multilingual selective knowledge distillation: Assume we have a language cluster that consists of LL^{\prime} languages, where l{1,2,,L}l\in\{1,2,\dots,L^{\prime}\}. Given a collection of pretrained individual teacher models {θl}l=1L\{\theta^{l}\}_{l=1}^{L^{\prime}}, each handling one language pair in {𝒟l}l=1L\{\mathcal{D}^{l}\}_{l=1}^{L^{\prime}}, and inspired by Tan et al. (2019b), we use the following knowledge distillation objective for each language ll in the cluster.

KDselective(𝒟l,θc,θl):=𝒙,𝒚𝒟lt=1|𝒚|vVQ(v|𝒚<t,𝒙,θl)logP(v|𝒚<t,𝒙,θc)\begin{split}\mathcal{L}_{KD}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l}):=-\sum\nolimits_{{\bm{x}},{\bm{y}}\in\mathcal{D}^{l}}\sum\nolimits_{t=1}^{|{\bm{y}}|}\\ \sum\nolimits_{v\in V}Q(v|{\bm{y}}_{<t},{\bm{x}},\theta^{l})\log P(v|{\bm{y}}_{<t},{\bm{x}},\theta^{c})\end{split} (1)

where θc\theta^{c} is the teacher assistant model, |V||V| is the vocabulary set, P()P(\cdot\mid\cdot) is the conditional probability of the teacher assistant model, and Q()Q(\cdot\mid\cdot) denotes the output distribution of the bilingual teacher model. According to Eq. (1), knowledge distillation regularises the predictive probabilities generated by a cluster-wise multilingual model with those generated by each individual bilingual models. Together with the translation loss (NLL\mathcal{L}_{NLL}), we have the following selective KD loss to generate the intermediate teacher-assistant model:

ALLselective(𝒟l,θc,θl):=\displaystyle\mathcal{L}_{ALL}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l}):=
(1λ)NLL(𝒟l,θc)+λKDselective(𝒟l,θc,θl)\displaystyle{(1-\lambda)}\mathcal{L}_{NLL}(\mathcal{D}^{l},\theta^{c})+{\lambda}\mathcal{L}_{KD}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l})

where λ{\lambda} is a tuning parameter that balances the contribution of the two losses. Instead of using all language pairs in Eq (1), we used a deterministic but dynamic approach to exclude language pairs from the loss function if the multilingual student surpasses the individual models on some language pairs during the training, which makes the training selective.

Input :  Training corpora: {𝒟l}l=1L\{\mathcal{D}^{l}\}_{l=1}^{L}, where 𝒟l:={(𝒙1l,𝒚1),..,(𝒙nl,𝒚n)}\mathcal{D}^{l}:=\{({\bm{x}}_{1}^{l},{\bm{y}}_{1}),..,({\bm{x}}_{n}^{l},{\bm{y}}_{n})\} ;
List of languages:LL;
List of language clusters: {Cm}m=1M\{C^{m}\}_{m=1}^{M} ;
Cluster-based MNMT models: {θc}c=1M\{\theta^{c}\}_{c=1}^{M} ;
Total training epochs: NN;
Batch size: J;
Output : Ultimate multilingual student model: θs\theta_{s};
Randomly initialize multilingual model θs\theta_{s}, accumulated gradient g=0g=0, distillation flag fl=Truef^{l}=True for lLl\in L ;
n=0n=0 ;
while n<Nn<N do
       g=0g=0;
       Csim=[]C_{sim}=[];
       for lLl\in L do
             // find the effective clusters with similar languages;
             for c{C}1Mc\in\{C\}_{1}^{M} do
                   if lcl\in c then
                         Csim.append(c)C_{sim}.\mathrm{append}(c)
                  
            Dl=random_permute(𝒟l)D^{l}=random\_permute(\mathcal{D}^{l}) ;
             𝒃1l,..,𝒃Jl=create_minibatches(𝒟l){\bm{b}}_{1}^{l},..,{\bm{b}}_{J}^{l}=create\_minibatches(\mathcal{D}^{l})
             //where bl=(xl,y)b^{l}=(x^{l},y) ;
             j=1j=1 ;
             while jJj\leq J do
                   // compute contribution weights;
                   for cCsimc\in C_{sim} do
                         Δc=ppl(θc(bjl))\Delta_{c}=-ppl(\theta^{c}(b_{j}^{l})) ;
                        
                  𝜶=softmax(Δ1,..,Δc){\bm{\alpha}}=\mathrm{softmax}(\Delta_{1},..,\Delta_{c}) ;
                   //compute the gradient on loss ALLadapt.\mathcal{L}_{ALL}^{adapt.};
                   𝒈=θsALLadapt.(𝒃jl,θs,{θc}1C,𝜶){\bm{g}}=\nabla_{\theta_{s}}\mathcal{L}_{ALL}^{adapt.}({\bm{b}}_{j}^{l},\theta_{s},\{\theta^{c}\}_{1}^{C},{\bm{\alpha}}) ;
                   // updates the parameters ;
                   θs=update_param(θs,𝒈)\theta_{s}=\textrm{update\_param}(\theta_{s},{\bm{g}}) ;
                   j=j+1j=j+1 ;
                  
            
      n=n+1n=n+1 ;
      
Algorithm 1 Multilingual Adaptive KD

This selective distillation process111 The training algorithm of selective knowledge distillation is summarized in Alg. 1 in (section A.1 - Appendix), which is similar to the one used in Tan et al. (2019b). is applied to all clusters obtained from different clustering approaches. It is noteworthy that (i) the selective knowledge distillation generates a teacher-assistant model for each cluster, i.e., c{1,2,,M}c\in\{1,2,\dots,M\}; (ii) each language can be in multiple clusters due to the use of different language representations, thus there can be more than one effective teacher-assistant model for any given language pair (illustrated in Figure 2.). So for each language pair, we have a set of effective clusters: c{1,2,,Csim}c\in\{1,2,\dots,C_{sim}\}.

Refer to caption
Figure 2: Effective teachers for each language after clustering. CC refers to the clustering type and TT refers to the Teacher. For language a, we have two effective teachers: TC11T_{C_{1}}^{1} and TC21T_{C_{2}}^{1}.

Multilingual adaptive knowledge distillation: Given a collection of effective teacher-assistant models {θc}c=1Csim\{\theta^{c}\}_{c=1}^{C_{sim}}, where Csim{C_{sim}} is the number of effective clusters per language, we devise the following KD objective for each language pair,

KDadaptive(𝒟l,θs,{θc}1Csim,𝜶):=c=1Csim𝒙,𝒚𝒟l𝜶ct=1|𝒚|vVQ(v|𝒚<t,𝒙,θc)logP(v|𝒚<t,𝒙,θs)\begin{split}\mathcal{L}_{KD}^{adaptive}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}}):=-\sum_{c=1}^{{C_{sim}}}\sum_{{\bm{x}},{\bm{y}}\in\mathcal{D}^{l}}\\ {\bm{\alpha}}_{c}\sum_{t=1}^{|{\bm{y}}|}\sum_{v\in V}Q(v|{\bm{y}}_{<t},{\bm{x}},\theta^{c})\log P(v|{\bm{y}}_{<t},{\bm{x}},\theta_{s})\end{split} (3)

where 𝜶{\bm{\alpha}} dynamically weigh the contribution of the teacher-assistants/clusters. 𝜶{\bm{\alpha}} is computed via an attention mechanism based on the rewards (negative perplexity) attained by the teachers on the data, where these values are passed through a softmax transformation to turn into a distribution Saleh et al. (2020).

cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 cluster 8 cluster 9 cluster 10
Japanese Korean Mongolian Burmese Malay Thai Chinese Indonesian Vietnamese Marathi Tamli Bengali Georgian Kurdish Persian Kazakh Basque Hindi Urdu Greek Arabic Hebrew Turkish Azerbaijani Finnish Hungarian Armenian Slovak Polish Russian Macedonian Lithuanian Belarusian Ukrainian Croatian Bosnian Slovenian Serbian Czech Estonian Albanian Galician Italian Portuguese Bulgarian Romanian Spanish French Danish Swedish Dutch German Bokmal
Table 1: Clustering type (1) – SVCCA-53 Oncevay et al. (2020), Based on multi-view representation using both syntax features of WALS and language vectors learned by MNMT model trained with TED-53.
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 cluster 8 cluster 9 cluster 10
Korean Bengali Marathi Hindi Urdu Basque Arabic Hebrew Armenian Persian Kurdish Hungarian Azerbaijani Mongolian Turkish Japanese Georgian Tamli Kazakh Burmese Bosnian Albanian Polish Slovak Croatian Macedonian Belarusian Estonian Russian Ukrainian Slovenian Lithuanian Finnish Czech Serbian Thai Malay Indonesian Vietnamese Chinese Romanian Spanish Italian Galician Bulgarian Swedish German Dutch Portuguese French Danish Greek Bokmal
Table 2: Clustering type (2) – SVCCA-23 Oncevay et al. (2020), Based on multi-view representation using both syntax features of WALS and language vectors learned by MNMT model trained with WIT-23.
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 cluster 8 cluster 9 cluster 10 cluster 11
Estonian Finnish Hindi Burmese Armenian Georgian Basque Bengali Kurdish Bosnian Belarusian Azerbaijani Mongolian Marathi Urdu Tamli Kazakh Malay Galician French Italian Spanish Portuguese Bokmal Danish Swedish German Dutch Chinese Japanese Korean Hungarian Turkish Slovak Polish Russian Ukrainian Lithuanian Slovenian Croatian Serbian Czech Persian Indonesian Hebrew Vietnamese Thai Arabic Romanian Albanian Macedonian Bulgarian Greek
Table 3: Clustering type (3) – Based on NMT-learned representation using a set of 53 factored language embeddings Oncevay et al. (2020); Tan et al. (2019a).
cluster 1 cluster 2 cluster 3
Bengali Chinese Korean Kazakh Azerbaijani Mongolian Burmese Japanese Marathi Urdu Hindi Turkish Tamli Thai Vietnamese Indonesian Malay Armenian Georgian Macedonian Bosnian Slovenian Hungarian Basque Serbian Dutch Albanian Greek German Galician Romanian Portuguese Croatian Bokmal Spanish French Belarusian Lithuanian Danish Ukrainian Hebrew Polish Czech Russian Finnish Bulgarian Kurdish Italian Slovak Estonian Arabic Persian Swedish
Table 4: Clustering type (4) – Based on KB representation using syntax features of WALS Oncevay et al. (2020).

IE/Balto-Slavic

IE/Italic

IE/Indo-Iranian

IE/Germanic

Turkic

Uralic

Afroasiatic

Sino-Tibetan

Austronesian

Koreanic

Japonic

Austroasiatic

IE/Hellenic

Kra-Dai

IE/Albanian

IE/Armenian

Kartvelian

Mongolic

Dravidian

Isolate (Basque)

be, bs, sl, mk, lt, sk cs, uk, hr, sr, bg, pl, ru bn, ur, ku hi, fa, mr gl, pt, ro fr, es, it nb, da sv, de, nl kk, az, tr et, fi, hu he, ar my, zh ms, id ko ja vi el th sq hy ka mn ta eu
Table 5: Language families Eberhard et al. (2019). IE refers to Indo European.

This adaptive distillation of knowledge allows the student model to get the best of teacher-assistants (which are representative of different linguistic features) based on their effectiveness to improve the knowledge gap of the student. The total loss function then becomes a weighted combination of losses coming from the ensemble of teachers and the data,

ALLadaptive(𝒟l,θs,{θc}1Csim,𝜶):=\displaystyle\mathcal{L}_{ALL}^{adaptive}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}}):=
λ1NLL(𝒟l,θs)+λ2KDadapt.(𝒟l,θs,{θc}1Csim,𝜶)\displaystyle{\lambda}_{1}\mathcal{L}_{NLL}(\mathcal{D}^{l},\theta_{s})+\lambda_{2}\mathcal{L}_{KD}^{adapt.}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}})

The training process is summarized in Alg. 1.

4 Experiment

In this section we explain our experiment settings as well as our experimental findings.

Data: We conducted extensive experiments on a parallel corpus (53 languages\rightarrowEnglish) from TED talks transcripts 222https://github.com/neulab/word-embeddings-for-nmt created and tokenized by Qi et al. (2018). This corpus has 26% of language pairs having less than or equal to 10k sentences (extremely low-resource), and 33% of language pairs having less than 20k sentences (low-resource). All the sentences were segmented with BPE segmentation Sennrich et al. (2016) with a learned BPE model with 32k merge operations on all languages. We kept the output vocabulary of the teacher and student models the same to make the knowledge distillation feasible. Details about the size of training data, language codes, and preprocessing steps are described in Section A.2.

Clustering: We clustered all the languages based on the three different types of representations discussed in Section 3 in order to take advantage of a mixture of linguistic features while training the ultimate student. Following Oncevay et al. (2020), we adopted their multi-view language representation approach that uses Singular Vector Canonical Correlation Analysis – SVCCA Raghu et al. (2017) to fuse the one-hot encoded KB representation obtained from syntactic features of WALS Dryer and Haspelmath (2013) and a dense NMT-learned view obtained from MNMT Tan et al. (2019a). Specifically, SVCCA-53 uses 53 languages of TED dataset to build the language representations and generates 10 clusters, the languages within each of which usually have the same phylogenetic or geographical features. SVCCA-23 instead uses 23 languages of WIT-23 Cettolo et al. (2012) to compute the shared space. We also generated language clusters based on either KB-based representation using syntax features of WALS Dryer and Haspelmath (2013) and NMT-learned representation alone. Tables 1-4 show the generated language clusters.
Training Configuration: All models were trained with Transformer architecture Vaswani et al. (2017). The individual baseline models were trained with the model hidden size of 256, feed-forward hidden size of 1024, and 2 layers. All multilingual models were trained with the model hidden size of 512, feed-forward hidden size of 1024, and 6 layers. We used selective multilingual knowledge distillation approach for training our cluster-based MNMT models as selective multilingual KD perform better than universal MNMT without KD Tan et al. (2019b). We did not carry out intense parameter tuning to search for the best parameter settings of each model for the sake of simplicity. It is noteworthy that the purpose of our experiments is rather to demonstrate the benefit of considering different type of language clusters through a unified hierarchical knowledge distillation. Details about the training configuration is provided in Section A.2.

4.1 Experimental Results

The translation results of (53 languages \rightarrow English) for all approaches are summarised in Table 6. The language pairs are sorted based on the size of training data in an ascending order. The translation quality is evaluated and reported based on the BLEU score Papineni et al. (2002).

Resource src size #sent. (k) Baseline Multilingual Selective KD Multi. HKD
Individ. Universal Multi. All langs Clus. type1 Clus. type2 Clus. type3 Clus. type4
Extremely low resource
kk 3.3 3.42 5.05 4.66 7.00 3.51 3.10 8.13 6.61
be 4.5 5.13 12.51 12.36 12.78 10.81 8.46 15.18 14.88
bn 4.6 5.06 12.50 12.58 9.13 10.11 12.16 10.13 12.81
eu 5.1 4.4 13.12 12.00 9.08 9.70 8.14 11.03 12.90
ms 5.2 3.78 13.88 14.61 12.93 12.94 7.63 12.98 14.11
bs 5.6 7.92 14.82 15.46 18.05 16.89 9.03 19.02 17.51
az 5.9 5.79 10.32 9.91 9.59 9.17 8.64 9.23 10.40
ur 5.9 8.98 12.76 16.50 13.35 13.17 12.02 13.39 15.95
ta 6.2 4.57 5.86 6.19 4.02 5.76 3.96 3.49 5.63
mn 7.6 3.54 6.11 5.82 5.75 6.60 5.39 6.20 6.81
mr 9.8 6.92 10.53 10.72 8.70 9.00 8.39 9.04 10.98
gl 10.0 13.5 22.04 22.44 25.53 26.81 26.93 25.90 27.11
ku 10.3 6.3 10.32 12.12 9.43 6.98 9.22 12.93 13.03
et 10.7 8.24 12.21 13.19 13.47 13.21 10.44 13.94 14.10
Low resource
ka 13.1 8.64 8.18 8.66 9.28 10.85 9.14 8.88 11.15
nb 15.8 26.36 28.49 29.08 33.55 34.31 28.79 30.87 33.89
hi 18.7 10.66 16.03 17.93 13.27 12.09 12.16 12.80 16.11
sl 19.8 11.45 15.12 15.39 16.48 16.54 17.75 18.31 18.43
Enough resource
hy 21.3 11.14 14.07 15.12 13.76 12.77 10.81 17.17 16.72
my 21.4 4.91 10.70 11.11 9.65 6.35 8.48 9.54 8.81
fi 24.2 8.16 11.69 12.23 11.36 12.57 10.59 12.76 12.90
mk 25.3 18.32 20.63 21.09 21.48 20.06 24.65 23.8 25.05
lt 41.9 14.78 15.44 16.76 16.98 16.9 17.96 18.24 18.11
sq 44.4 22.62 24.44 25.22 24.74 23.22 26.89 26.42 26.93
da 44.9 31.85 30.39 30.61 35.02 39.76 30.58 32.04 36.00
sv 56.6 27.2 27.18 26.84 31.36 34.52 26.81 28.65 33.14
sk 61.4 19.36 22.04 22.58 22.49 21.18 24.08 23.77 24.33
id 87.4 20.51 20.89 20.69 21.11 21.13 22.56 21.12 22.76
th 96.9 20.46 21.34 21.72 22.94 22.94 23.09 22.87 23.30
cs 103.0 20.13 22.01 22.07 21.72 22.49 23.12 22.86 23.62
uk 108.4 21.32 22.11 23.07 23.06 22.91 23.58 23.66 24.09
hr 122.0 25.89 26.51 27.17 27.56 25.62 28.66 28.34 28.91
el 134.3 26.82 26.07 28.51 30.05 31.35 29.13 29.66 30.10
sr 136.8 26.94 25.43 25.88 27.48 25.75 27.69 27.12 27.97
hu 147.1 18.46 17.61 18.55 19.08 18.41 20.16 19.82 20.10
fa 150.8 23.60 21.7 21.29 21.31 22.44 23.51 22.24 23.19
de 167.8 15.23 14.83 16.69 16.88 17.79 15.44 16.67 18.04
ja 168.2 10.11 8.61 8.93 10.14 10.14 10.17 8.69 10.30
vi 171.9 18.97 19.19 20.58 21.60 21.60 21.30 20.33 21.82
bg 174.4 28.85 27.66 29.14 31.67 32.18 29.86 30.48 32.33
pl 176.1 17.23 18.62 19.45 19.45 18.37 19.93 20.26 20.71
ro 180.4 25.21 25.97 26.53 28.03 28.43 28.90 27.35 29.00
tr 182.3 17.72 10.2 10.01 18.66 18.19 19.85 18.27 16.91
nl 183.7 27.65 26.91 26.82 28.05 29.07 29.17 28.03 29.58
zh 184.8 20.44 22.10 22.71 22.19 22.11 23.91 22.84 23.85
es 195.9 30.17 29.55 30.00 29.06 33.45 31.46 31.82 32.76
it 204.4 26.84 25.13 27.99 30.57 30.85 30.36 29.45 30.93
ko 205.4 15.98 15.71 16.41 15.17 15.18 17.45 16.00 17.70
ru 208.4 19.76 19.83 20.86 20.85 20.80 21.25 21.49 21.77
he 211.7 29.35 28.03 28.27 32.82 32.18 31.32 30.02 32.05
fr 212.0 30.08 30.55 30.28 32.25 32.19 31.14 31.34 32.65
ar 213.8 25.36 23.89 24.48 28.53 28.03 27.46 25.85 28.94
pt 236.4 30.99 31.12 30.85 33.36 33.84 33.25 32.56 33.90
Avg. - - 18.50 18.64 19.24 19.84 19.87 19.35 20.05 21.16
Table 6: BLEU scores of the translation tasks for 53 Languages\rightarrowEnglish. The bold numbers show the best scores and the underline ones show the 2nd2^{nd} best results. All langs" refers to all 53 languages trained with a multilingual selective KD.

4.1.1 Studies of cluster-based MNMT models

In this section, we discuss the cluster-based MNMTs’ results through the following observations.

Low resource vs high resource languages: All cluster-based MNMT and baseline approaches are ranked based on the number of the times they got the first or second-best score in different resource-size scenarios in Table 7. Based on this result and also the result represented in Table 6, the massive MNMT models with all languages (second column under baseline and first column under selective KD in Tables 6, 7) outperform the cluster-based MNMT models (columns (2-5) under selective KD in Tables 6, 7) in extremely low-resource scenarios (e.g., bn-en, ta-en, eu-en). This result shows that having more data either from related languages or distant languages has the most impact on training a better MNMT model for under-resourced languages. Furthermore, clustering type (4) is dominant among other clustering approaches for under-resourced situations. This result is also explainable based on the size of the clusters in clustering type (4). The translations of extremely low resource languages are significantly improved when they have been clustered in the third cluster of clustering type (4) with 35 languages (shown in Table 4). However, for languages with enough resources, the multilingual baselines with all languages under-performed other cluster-based MNMT models.

Resource-size size (# sent.) Baseline Multilingual Selective KD
Individ. Multi. All langs Clus. type1 Clus. type2 Clus. type3 Clus. type4
Extremely low resource <= 10k 0% 21.43% 28.57% 14.28% 7.14% 3.58% 25.00%
Low resource > 10k and <= 20k 0% 12.50% 12.50% 25.00% 25.00% 12.50% 12.50%
Enough resource > 20k 1.39% 1.39% 2.78% 20.83% 25.00% 27.78% 20.83%
Table 7: The translation ranking ablation study for all approaches excluding the HKD approach based on the percentage of the times they got the 1st1^{st} or 2nd2^{nd} best results. Sum of percentages in each row = 100%.
Resource-size size (# sent.) Baseline Multilingual Selective KD HKD
Individ. Multi. All langs Clus. type1 Clus. type2 Clus. type3 Clus. type4
Extremely low resource <= 10k 0% 13.79% 17.24% 6.90% 3.45% 3.45% 17.24% 37.93%
Low resource > 10k and < 20k 0% 0% 12.50% 0% 25.00% 0% 12.50% 50.00%
Enough resource >20k 1.43% 1.43% 1.43% 5.71% 12.86% 24.28% 8.57% 44.29%
Table 8: The ranking of all approaches based on the percentage of the times they got the 1st1^{st} or 2nd2^{nd} best results.
Model Contrib. Langs BLEU Contrib. Langs BLEU
Clus. Rand.1 gl, nb, uk, hr, se, ja 16.53 el, id, be 28.01
Clus. Rand.2 gl, ta, mk, be, id, sq, pt, fr, ur, az, ku, bs, fa 20.27 el, cs, lt, id, sk, th, it, hy, ms, hu, mk, my, bn 27.78
\hdashline Clus. Rand.3 gl, az, ja, nb, kk 13.61 el, sq, th 27.97
Clus. Rand.4 gl, zh, pt, fa, ar, kk, sr, bg, nl, cs, th, ko, vi, hu, mk, fi, ru, mn, de, sl, el, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az 22.20 el, zh, pt, fa, ar, kk, sr, bg, nl, cs,th, ko, vi, hu, mk, fi, ru, mn, de, sl, gl, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az 28.82
Avg. - 18.15 - 28.14
Table 9: Ablation study on using random languages in all clustersing types for (gl\rightarrowen) and (el\rightarrowen). The results for actual clustering are provided in Table 6. The average result of actual clustering for (gl\rightarrowen) and (el\rightarrowen) are 26.29 and 30.04 respectively.

Related vs isolated languages: A group of languages that originated from a similar ancestor is known as a language family; and a language that does not have any relationship with another languages is called a language isolate. The language families Eberhard et al. (2019) are shown in Table 5. According the results shown in Table 6, clustering approaches usually have the same behaviour and less diversity for clustering languages belonged to IE/Germanic, IE/Italic, Afroasiatic, and Austronesian families. In comparison, there is more diversity, and less consensus for clustering languages belonged to IE/Balto-Slavic, IE/Indo-Iranian, Turkic, Uralic, and Sino-Tibetan families. Moreover, the isolated languages (shown in the last 11 columns of Table 5) generally have the same behaviour and less variance in BLEU scores in different clustering approaches. This observation shows that cluster-based MNMT models (regardless of the clustering type) do not significantly improve the translation of isolated languages unless the isolated languages have extremely low resources and have been clustered in a huge cluster (e.g., eu-en, hy-en). The results of cluster-based MNMT are presented based on the language families in Section A.3.

Random clustering vs Actual clustering: We conducted an ablation study by using clusters with randomly chosen languages for two translation tasks (el\rightarrowen and gl\rightarrowen). We kept the number of languages per cluster the same as the actual clustering to make a fair comparison. According to the result represented in Table 9, for both translation tasks, random clusters underperform the actual clusters in all clustering types. However, notably, the average BLEU score’s difference between the random and actual clusters for gl\rightarrowen is considerably higher than el\rightarrowen (gl\rightarrowen, Δ\Delta = -8.14 vs el\rightarrowen, Δ\Delta = -1.9). This observation is inline with the previous observation that Greek (el) is an isolated language categorised in IE/Hellenic family and clustering approaches have less impact on this language due to its lower similarity to most languages. In comparison, Galician (gl) is highly similar to the languages in IE/Italic family and clustering improves the translation of gl\rightarrowen remarkably. We compared the translation results between actual cluster-based fake cluster-based multilingual approaches in Section A.3.

4.1.2 Studies of HKD

According to the results in Table 6, the HKD approach outperforms massive and cluster-based MNMT models in average by 1.11 BLEU score. We discuss it more in the following observations:

Ranking based on data size: We ranked all approaches, including HKD, based on the number of times they got the first or second-best score (shown in Table 8). According to this result, the HKD approach in three different situations, i.e. extremely low resource, low resource, and enough resource, has the best rank among other approaches. This observation proves that the HKD approach is robust in different data-size situations by leveraging the best of both multilingual NMT and language-relatedness guidance in a systematic HKD setting.

Clustering consistency impact on HKD: Based on the result in Table 6, the HKD approach underperforms other multilingual approaches when the clusters are inconsistent, causing a high variance in teacher-assistants’ results. For example for kk\rightarrowen, bs\rightarrowen, the variance of the BLEU scores of the teacher-assistant models is 6.92 and 20.81 respectively and HKD underachieved a good result. This observation shows that, although in the second phase of the HKD approach, the cluster-based teacher-assistants adaptively contribute to training the ultimate students, still a weak teacher-assistant deteriorates the collaborative teaching process. One possible solution is excluding the worst teacher-assistant in such heterogeneous situations.

4.1.3 Comparison with other approaches

To highlight the pros and cons of the related baselines (with and without KD), we draw a comparison shown in Table 10. Accordingly, our HKD approach is comparable with other approaches based on the following properties:

Multilingual translation: Our approach works in a multilingual setting by sharing resources between high-resource and low-resource languages. This property not only improves the regularisation of the model by avoiding over-fitting to the limited data of the low-resource languages but also decreases the deployment footprint by leveraging the whole training in a single model instead of having individual models per language Dabre et al. (2020).

Optimal transfer: In the HKD approach, we have an optimal transfer by transferring knowledge from all possible languages related to a student in the hierarchical structure which leads to the best average BLEU score (21.16) comparing to the other baselines (shown in Table 6). In the universal multilingual NMT without KD, the language transfer stream is maximized when all languages shared their knowledge in a single model during training; however, it is not an optimal transfer due to the lack of any condition on the the relatedness of languages contributing in the multilingual training. The related experiments are shown in the second column under the baseline experiments in Table 6. The average BLEU score of this approach is 18.46. In multilingual selective KD Tan et al. (2019b), knowledge is distilled from one selected teacher with the same language when training the student multilingually. So, although there is a condition on language relatedness, knowledge transfer is not maximized as the similar languages from the same language family are ignored in the distillation process. The related results are shown in the first column under multilingual selective KD of Table 6. Accordingly, this approach got the average BLEU score of 19.24 in our experiments. Adaptive KD Saleh et al. (2020) is a bilingual approach and also uses a random set of teachers which does not essentially have all the related languages to the student and does not lead to optimal transfer. We did not perform any experiment on adaptive KD Saleh et al. (2020) since this is a bilingual approach.

Adaptive KD vs Selective KD vs HKD: All KD-based approaches in our comparison reduce negative transfer in different ways. In multilingual selective KD Tan et al. (2019b), the risk of negative transfer is reduced by distilling knowledge from the selected teacher per language in a multilingual setting. In bilingual adaptive KD Saleh et al. (2020), the contribution weights of different teachers vary based on their effectiveness to improve the student which prevents the negative transfer in bilingual setting. In HKD, the hierarchical grouping based on the language similarity provides a systematic guide to prevent negative transfer as much as possible. This property leads HKD to get the best results for 32 language pairs out of total 53 language pairs in our multilingual experiments.

Individual NMT Uniform MNMT Selective KD MNMT Adaptive KD NMT HKD MNMT
Multilingual
\hdashline Maximum transfer
\hdashline KD from multiple languages
\hdashline Reduced risk of negative transfer
Table 10: Comparing different properties of HKD with: transformer-based individual and multilingual NMT Vaswani et al. (2017), multilingual selective KD Tan et al. (2019b), and adaptive KD Saleh et al. (2020).

5 Conclusion

We presented a Hierarchical Knowledge Distillation (HKD) approach to mitigate the negative transfer effect in MNMT when having a diverse set of languages in training. We put together all languages which behave similarly in the first phase of distillation process and generated the expert teacher-assistants for each group of languages. As we clustered languages based on four different language representations capturing different linguistic features, we then adaptively distill knowledge from all related teacher-assistant models to the ultimate student in each mini-batch of training per language. Experimental results on 53 languages to English show our approach’s effectiveness to reduce negative transfer in MNMT. Our proposed approach is generalizable to one-to-many with the same setting as a many-to-one task. The clustering needs to be done in the target language, though. For many-to-many tasks, the hierarchical KD can be effective if the clustering is applied to both source and target languages. Our intended use, however, is many to one or one to many.

As the future direction, it is interesting to study an end-to-end HKD approach by adding a backward HKD pass compared to the forward HKD pass described in this paper.

Acknowledgements

This work was supported by the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au).

References

Appendix:

Selective Knowledge Distillation

This first stage of knowledge distillation is the same as the algorithm which is proposed by Tan et al. (2019b) and is summarized in Algorithm 2. This process is applied to all clusters obtained from different clustering approaches, {Cm:={l1,,l|L|}}m=1M\{C^{m}:=\{l_{1},...,l_{|L^{\prime}|}\}\}_{m=1}^{M} where MM refers to the number of clusters and LL^{\prime} refers to the number of languages in each cluster.

Experiment Settings

Data.

We conducted the experiments on a parallel corpus from TED talks transcripts333https://github.com/neulab/word-embeddings-for-nmt on 53 languages to English created and tokenized by Qi et al. (2018). Detail about the size of training data and language codes based on ISO 639-1 standard444http://www.loc.gov/standards/iso639-2/php/English_list.php are listed in Table 6 and visualised in Figure 3. We concatenated all data which have the Portuguese-related languages in the source (pt\rightarrowen, pt-br\rightarrowen). We also concatenated all data with French-related languages in the source (fr\rightarrowen, fr-ca\rightarrowen). We removed any sentences in the training data which has overlap with any of the test sets. For multilingual training, as a standard practice Wang et al. (2020), we up-sampled the data of low-resource language pairs to make all language pairs having roughly the same size and adjust the distribution of training data.

Input : Training corpora: {𝒟l}l=1L\{\mathcal{D}^{l}\}_{l=1}^{L}; where 𝒟l:={(𝒙1l,𝒚1),..,(𝒙nl,𝒚n)}\mathcal{D}^{l}:=\{({\bm{x}}_{1}^{l},{\bm{y}}_{1}),..,({\bm{x}}_{n}^{l},{\bm{y}}_{n})\};
List of all languages: LL;
Individual models {θl}l=1L\{\theta^{l}\}_{l=1}^{L^{\prime}};
List of language pairs per cluster: LL^{\prime} ;
Total training epochs: NN;
Distillation check step: 𝒩check\mathcal{N}_{check};
Threshold of distillation accuracy: 𝒯\mathcal{T}
Output : θc\theta^{c}: multilingual model for each cluster,
Randomly initialize multilingual model θc\theta^{c}, accumulated gradient g=0g=0, distillation flag fl=Truef^{l}=True for lLl\in L^{\prime} ;
n=0n=0 ;
while n<Nn<N do
       g=0g=0;
       for lLl\in L^{\prime} do
             Dl=random_permute(𝒟l)D^{l}=random\_permute(\mathcal{D}^{l}) ;
             𝒃1l,..,𝒃Jl=create_minibatches(𝒟l){\bm{b}}_{1}^{l},..,{\bm{b}}_{J}^{l}=create\_minibatches(\mathcal{D}^{l}) ,where bl=(xl,y)b^{l}=(x^{l},y) ;
             j=1j=1 ;
             while jJj\leq J do
                   if fl==Truef^{l}==True then
                         //compute and accumulate the gradient on loss ALLselective\mathcal{L}_{ALL}^{selective};
                         𝒈=θcALLselective(𝒃jl,θc,θl){\bm{g}}=\nabla_{\theta^{c}}\mathcal{L}_{ALL}^{selective}({\bm{b}}_{j}^{l},\theta^{c},\theta^{l}) ;
                         // updates the parameters using the optimiser ADAM ;
                         θc=update_param(θc,𝒈)\theta^{c}=\textrm{update\_param}(\theta^{c},{\bm{g}}) ;
                  else
                         //compute and accumulate the gradient on loss NLL\mathcal{L}_{NLL} ;
                         𝒈=θcNLL(𝒃jl,θc,θl){\bm{g}}=\nabla_{\theta^{c}}\mathcal{L}_{NLL}({\bm{b}}_{j}^{l},\theta^{c},\theta^{l}) ;
                         // updates the parameters using the optimiser ADAM ;
                         θc=update_param(θc,𝒈)\theta^{c}=\textrm{update\_param}(\theta^{c},{\bm{g}}) ;
                        
                  
            j=j+1j=j+1 ;
            
      if N%𝒩check==0N\%\mathcal{N}_{check}==0 then
             for lLl\in L^{\prime} do
                   if Accuracy(θc)<Accuracy(θl)+Accuracy(\theta^{c})<Accuracy(\theta^{l})+ 𝒯\mathcal{T} then
                         fl=Truef^{l}=True\;
                  else
                         fl=Falsef^{l}=False
                  
            
      n=n+1n=n+1 ;
      
Algorithm 2 Multilingual Selective Knowledge Distillation Tan et al. (2019b)
TED-53 Languages
Language name Kazakh Belarusian Bengali Basque Malay Bosnian
Code kk be bn eu ms bs
train-size (#sent(k)) 3.3 4.5 4.6 5.1 5.2 5.6
Language name Azerbaijani Urdu Tamli Mongolian Marathi Galician
Code az ur ta mn mr gl
train-size (#sent(k)) 5.9 5.9 6.2 7.6 9.8 10
Language name Kurdish Estonian Georgian Bokmal Hindi Slovenian
Code ku et ka nb hi sl
train-size (#sent(k)) 10.3 10.7 13.1 15.8 18.7 19.8
Language name Kurdish Estonian Georgian Bokmal Hindi Slovenian
Code ku et ka nb hi sl
train-size (#sent(k)) 10.3 10.7 13.1 15.8 18.7 19.8
Language name Armenian Burmese Finnish Macedonian Lithuanian Albanian
Code hy my fi mk lt sq
train-size (#sent(k)) 21.3 21.4 24.2 25.3 41.9 44.4
Language name Danish Swedish Slovak Indonesian Thai Czech
Code da sv sk id th cs
train-size (#sent(k)) 44.9 56.6 61.4 87.4 96.9 103
Language name Ukrainian Croatian Greek Serbian Hungarian Persian
Code uk hr el sr hu fa
train-size (#sent(k)) 108.4 122 134.3 136.8 147.1 150.8
Language name German Japanese Vietnamese Bulgarian Polish Romanian
Code de ja vi bg pl ro
train-size (#sent(k)) 167.8 168.2 171.9 174.4 176.1 180.4
Language name Turkish Dutch Chinese Spanish Italian Korean
Code tr nl zh es it ko
train-size (#sent(k)) 182.3 183.7 184.8 195.9 204.4 205.4
Language name Russian Hebrew French Arabic Portuguese
Code ru he fr ar pt
train-size (#sent(k)) 208.4 211.7 212 213.8 236.4
Table 11: Bilingual resources of 53 Languages \rightarrow English from TED dataset. Language names, language codes based on ISO 639-1 standard666http://www.loc.gov/standards/iso639-2/php/English_list.php, and training size based on the number of sentences in bilingual resources are shown in this table.
Refer to caption
Figure 3: Training size (based on the number of sentences) for TED-53 bilingual resources (Language\rightarrowEnglish)
Training configuration.

All models are trained with Transformer architecture Vaswani et al. (2017), implemented in the Fairseq framework Ott et al. (2019). The individual models are trained with the model hidden size of 256, feed-forward hidden size of 1024, and 2 layers. All multilingual models either cluster-based or universal MNMT models with or without knowledge distillation were trained with the model hidden size of 512, feed-forward hidden size of 1024, and 6 layers. We use the Adam optimizer Kingma and Ba (2015) and an inverse square root schedule with warmup (maximum LR 0.0005). We apply dropout and label smoothing with a rate of 0.3 and 0.1 for bilingual and multilingual models respectively.

For the first phase of distillation, i.e., the multilingual selective KD, the distillation coefficient λ\lambda is equal to 0.6. In the second phase of distillation, i.e., the multilingual adaptive KD, we applied λ1=0.5{\lambda}_{1}=0.5 and λ2{\lambda}_{2} is started from 0.5 and increased to 3 using the annealing function of Bowman et al. (2016); Saleh et al. (2020). We train our final multilingual student with mixed-precision floats on up to 8 V100 GPUs for maximum 100 epochs (3\approx 3 days), with at most 8192 tokens per batch and early stopping at 20 validation steps based on the BLEU score. The translation quality is also evaluated and reported based on the BLEU Papineni et al. (2002) score777SacreBLEU signature:BLEU+case.mixed+numrefs.
1+smooth.exp+tok.none+version.1.3.1
.

Model Contributed Langs BLEU Contributed Langs BLEU
Individual gl 13.50 el 26.82
Multi. (uniform) All langs. 22.04 All langs. 26.07
\hdashline Multi. (SKD) All langs. 22.44 All langs. 28.51
Clus. type1 gl, bg, ro, es, it, pt 25.53 el, ar, he 30.05
\hdashline Clus. type2 gl, bg, sv, da, nb, de, nl, ro, el, es, it, fr, pt 26.81 el, bg, sv, da, nb, de, nl, es, it, gl, fr, pt, ro 31.35
\hdashline Clus. type3 gl, fr, it, es, pt 26.93 el, mk, bg 29.13
\hdashline Clus. type4 gl, fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, it, sv, da, nb, ru, be, pl, bg, sl, pt sr, uk, hr, mk, sk, ro, sq, el, es 25.90 el, fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, it, sv, da, nb, ru, be, pl, bg, sl, pt, sr, uk, hr, mk, sk, ro, sq, es, gl 29.66
\hdashline Avg. - 26.29 - 30.04
Clus. Rand.1 gl, nb, uk, hr, se, ja 16.53 el, id, be 28.01
\hdashline Clus. Rand.2 gl, ta, mk, be, id, sq, pt, fr, ur, az, ku, bs, fa 20.27 el, cs, lt, id, sk, th, it, hy, ms, hu, mk, my, bn 27.78
\hdashline Clus. Rand.3 gl, az, ja, nb, kk 13.61 el, sq, th 27.97
\hdashline Clus. Rand.4 gl, zh, pt, fa, ar, kk, sr, bg, nl, cs, th, ko, vi, hu, mk, fi, ru, mn, de, sl, el, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az 22.20 el, zh, pt, fa, ar, kk, sr, bg, nl, cs,th, ko, vi, hu, mk, fi, ru, mn, de, sl, gl, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az 28.82
\hdashline Avg. - 18.15Δ8.14{\textbf{18.15}}_{\Delta-8.14} - 28.14Δ1.9{\textbf{28.14}}_{\Delta-1.9}
Table 12: Ablation study on using random clusters. Comparison of the (gl\rightarrowen) and (el\rightarrowen) translation tasks between individual, massive multilingual, and clustering-based multilingual (for actual and random clusters) baselines.

Analysis

Random clustering vs Actual clustering.

We studied the behaviour of random clusters compared to the actuall clusters in Section 4.1.1. For more clarification about the mentioned discussion, you can refer to Table 12.

Language Family.

A group of languages that originated from a similar ancestor is known as a language family; and a language that does not have any relationship with another languages is called a language isolate.. We already discussed the ablation study’s results regarding the language families in comparison with isolated languages in Section 4.1.1 of the paper. For better visualisation of results based on language family, we sorted all language pairs based on their language family relations in Tables 13, 14, 15, and 16.

family lang ja, ko, mn, my ms, th, vi, zh, id mr, ta, bn, ka ku, fa, kk, eu, hi, ur el, ar, he tr, az, fi, hu, hy uk, pl, ru, mk, lt, be, sk cs, et, sq, hr, bs, sl, sr bg, ro, es, gl, it, pt fr, da, sv, nl, de, nb ko, bn, mr, hi, ur eu, ar, he hy, fa, ku hu, tr, az, ja, mn ka, ta kk, my mk, sq, pl, sk, hr, bs, be, et ru, uk, sl, fi, cs, lt zh, th, id, vi, ms bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro et, fi hi, my, hy, ka eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms gl, fr, it, es, pt nb, da, sv de, nl zh, ja, ko, hu, tr lt, sl, hr, sr, cs, sk, pl, ru, uk fa, id, ar, he, th, vi ro, sq mk, bg, el fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it vi, id, ms, th kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
Clustering Type 1 Clustering Type 2 Clustering Type 3 Clustering Type 4
c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c11c_{11} c1c_{1} c2c_{2} c3c_{3}
IE/Balto-Slavic be - - - - - - 12.7 - - - - - - - - - 10.8 - - - - - 8.4 - - - - - - - - 15.1 - -
bs - - - - - - - 18.0 - - - - - - - - 16.8 - - - - - 9.0 - - - - - - - - 19.0 - -
sl - - - - - - - 16.4 - - - - - - - - - 16.5 - - - - - - - - - 17.7 - - - 18.3 - -
mk - - - - - - 21.4 - - - - - - - - - 20.0 - - - - - - - - - - - - - 24.6 23.8 - -
lt - - - - - - 16.9 - - - - - - - - - - 16.9 - - - - - - - - - 17.9 - - - - - 18.2
sk - - - - - - 22.4 - - - - - - - - - - 21.1 - - - - - - - - - 24.0 - - - 23.7 - -
cs - - - - - - - 21.7 - - - - - - - - - 22.4 - - - - - - - - - 23.1 - - - 22.8 - -
uk - - - - - - 23.0 - - - - - - - - - - 22.9 - - - - - - - - - 23.5 - - - 23.6 - -
hr - - - - - - - 27.5 - - - - - - - - 25.6 - - - - - - - - - - 28.6 - - - 28.3 - -
sr - - - - - - - 27.4 - - - - - - - - 25.7 - - - - - - - - - - 27.6 - - - 27.1 - -
bg - - - - - - - - 31.6 - - - - - - - - - - 32.1 - - - - - - - - - - 29.8 30.4 - -
pl - - - - - - 19.4 - - - - - - - - - 18.3 - - - - - - - - - - 19.9 - - - 20.2 - -
ru - - - - - - 20.8 - - - - - - - - - - 20.8 - - - - - - - - - 21.2 - - - 21.4 - -
IE/Indo-Iranian bn - - 9.1 - - - - - - - 10.1 - - - - - - - - - - - 12.1 - - - - - - - - - - 10.1
ur - - - 13.3 - - - - - - 13.1 - - - - - - - - - - - 12.0 - - - - - - - - - - 13.3
ku - - - 9.4 - - - - - - 6.9 - - - - - - - - - - - 9.2 - - - - - - - - - - 12.9
hi - - - 13.2 - - - - - - 12.0 - - - - - - - - - - 12.1 - - - - - - - - - 12.8 - -
fa - - - 21.3 - - - - - - - - 22.4 - - - - - - - - - - - - - - - 23.5 - - 22.2 - -
mr - - 8.7 - - - - - - - 9.0 - - - - - - - - - - - 8.3 - - - - - - - - - - 9.0
IE/Italic gl - - - - - - - - 25.5 - - - - - - - - - - 26.8 - - - 26.9 - - - - - - - 25.9 - -
pt - - - - - - - - 33.3 - - - - - - - - - - 33.8 - - - 33.2 - - - - - - - 32.5 - -
ro - - - - - - - - 28.0 - - - - - - - - - - 28.4 - - - - - - - - - 28.9 - 27.3 - -
fr - - - - - - - - - 32.2 - - - - - - - - - 32.1 - - - 31.1 - - - - - - - 31.3 - -
es - - - - - - - - 29.0 - - - - - - - - - - 33.4 - - - 31.4 - - - - - - - 31.8 - -
it - - - - - - - - 30.5 - - - - - - - - - - 30.8 - - - 30.3 - - - - - - - 29.4 - -
Table 13: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Balto-Slavic, IE/Indo-Iranina, and IE/Italic.
family lang ja, ko, mn, my ms, th, vi, zh, id mr, ta, bn, ka ku, fa, kk, eu, hi, ur el, ar, he tr, az, fi, hu, hy uk, pl, ru, mk, lt, be, sk cs, et, sq, hr, bs, sl, sr bg, ro, es, gl, it, pt fr, da, sv, nl, de, nb ko, bn, mr, hi, ur eu, ar, he hy, fa, ku hu, tr, az, ja, mn ka, ta kk, my mk, sq, pl, sk, hr, bs, be, et ru, uk, sl, fi, cs, lt zh, th, id, vi, ms bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro et, fi hi, my, hy, ka eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms gl, fr, it, es, pt nb, da, sv de, nl zh, ja, ko, hu, tr lt, sl, hr, sr, cs, sk, pl, ru, uk fa, id, ar, he, th, vi ro, sq mk, bg, el fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it vi, id, ms, th kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
Clustering Type 1 Clustering Type 2 Clustering Type 3 Clustering Type 4
c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c11c_{11} c1c_{1} c2c_{2} c3c_{3}
IE/Germanic nb - - - - - - - - - 33.5 - - - - - - - - - 34.3 - - - - 28.7 - - - - - - 30.8 - -
da - - - - - - - - - 35.0 - - - - - - - - - 39.7 - - - - 30.5 - - - - - - 32.0 - -
sv - - - - - - - - - 31.3 - - - - - - - - - 34.5 - - - - 26.8 - - - - - - 28.6 - -
de - - - - - - - - - 16.8 - - - - - - - - - 17.7 - - - - - 15.4 - - - - - 16.6 - -
nl - - - - - - - - - 28.0 - - - - - - - - - 29.0 - - - - - 29.1 - - - - - 28.0 - -
Turkic kk - - - 7.0 - - - - - - - - - - - 3.5 - - - - - - 3.1 - - - - - - - - - - 8.1
az - - - - - 9.5 - - - - - - - 9.1 - - - - - - - - 8.6 - - - - - - - - - - 9.2
tr - - - - - 18.6 - - - - - - - 18.1 - - - - - - - - - - - - 19.8 - - - - - - 18.2
Uralic et - - - - - - - 13.4 - - - - - - - - 13.2 - - - 10.4 - - - - - - - - - - 13.9 - -
fi - - - - - 11.3 - - - - - - - - - - - 12.5 - - 10.5 - - - - - - - - - - 12.7 - -
hu - - - - - 19.0 - - - - - - - 18.4 - - - - - - - - - - - - 20.1 - - - - 19.8 - -
Afroasiatic
he - - - - 32.8 - - - - - - - - - - - 32.1 - - - - 31.3 - - - - - - - - - 30.0 - -
ar - - - - 28.5 - - - - - - 28.0 - - - - - - - - - - - - - - - - 27.4 - - 25.8 - -
Sino-Tibetan
my 9.6 - - - - - - - - - - - - - - 6.3 - - - - - 8.4 - - - - - - - - - 9.5 - -
zh - 22.1 - - - - - - - - - - - - - - - - 22.1 - - - - - - - 23.9 - - - - - - 22.8
Austronesian
ms - 12.9 - - - - - - - - - - - - - - - - 12.9 - - - 7.6 - - - - - - - - - 12.9 -
id - 21.1 - - - - - - - - - - - - - - - - 21.1 - - - - - - - - - 22.5 - - - 21.1 -
Table 14: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Germanic, Turkic, Uralic, Afroasiatic, Sino-Tibetan, and Austronesian.
family lang ja, ko, mn, my ms, th, vi, zh, id mr, ta, bn, ka ku, fa, kk, eu, hi, ur el, ar, he tr, az, fi, hu, hy uk, pl, ru, mk, lt, be, sk cs, et, sq, hr, bs, sl, sr bg, ro, es, gl, it, pt fr, da, sv, nl, de, nb ko, bn, mr, hi, ur eu, ar, he hy, fa, ku hu, tr, az, ja, mn ka, ta kk, my mk, sq, pl, sk, hr, bs, be, et ru, uk, sl, fi, cs, lt zh, th, id, vi, ms bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro et, fi hi, my, hy, ka eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms gl, fr, it, es, pt nb, da, sv de, nl zh, ja, ko, hu, tr lt, sl, hr, sr, cs, sk, pl, ru, uk fa, id, ar, he, th, vi ro, sq mk, bg, el fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it vi, id, ms, th kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
Clustering Type 1 Clustering Type 2 Clustering Type 3 Clustering Type 4
c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c11c_{11} c1c_{1} c2c_{2} c3c_{3}
Koreanic
ko 15.1 - - - - - - - - - 15.1 - - - - - - - - - - - - - - - 17.4 - - - - - - 16.0
Japonic
ja 10.1 - - - - - - - - - - - - 10.1 - - - - - - - - - - - - 10.1 - - - - - - 8.6
Austroasiatic
vi - 21.6 - - - - - - - - - - - - - - - - 21.6 - - - - - - - - - 21.3 - - - 20.3 -
IE/Hellenic
el - - - - 30.0 - - - - - - - - - - - - - - 31.3 - - - - - - - - - - 29.1 29.6 - -
Kra-Dai
th - 22.9 - - - - - - - - - - - - - - - - 22.9 - - - - - - - - - 23.09 - - - 22.8 -
IE/Albanian
sq - - - - - - - 24.7 - - - - - - - - 23.2 - - - - - - - - - - - - 26.8 - 26.4 - -
Table 15: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to Koreanic, Japonic, Austroasiatic, IE/Hellenic, Kra-Dai, and IE/Albanian families.
family lang ja, ko, mn, my ms, th, vi, zh, id mr, ta, bn, ka ku, fa, kk, eu, hi, ur el, ar, he tr, az, fi, hu, hy uk, pl, ru, mk, lt, be, sk cs, et, sq, hr, bs, sl, sr bg, ro, es, gl, it, pt fr, da, sv, nl, de, nb ko, bn, mr, hi, ur eu, ar, he hy, fa, ku hu, tr, az, ja, mn ka, ta kk, my mk, sq, pl, sk, hr, bs, be, et ru, uk, sl, fi, cs, lt zh, th, id, vi, ms bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro et, fi hi, my, hy, ka eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms gl, fr, it, es, pt nb, da, sv de, nl zh, ja, ko, hu, tr lt, sl, hr, sr, cs, sk, pl, ru, uk fa, id, ar, he, th, vi ro, sq mk, bg, el fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it vi, id, ms, th kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
Clustering Type 1 Clustering Type 2 Clustering Type 3 Clustering Type 4
c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c1c_{1} c2c_{2} c3c_{3} c4c_{4} c5c_{5} c6c_{6} c7c_{7} c8c_{8} c9c_{9} c10c_{10} c11c_{11} c1c_{1} c2c_{2} c3c_{3}
IE/Armenian
hy - - - - - 13.7 - - - - - - 12.7 - - - - - - - - 10.8 - - - - - - - - - 17.1 - -
Kartvelian
ka - - 9.2 - - - - - - - - - - - 10.8 - - - - - - 9.1 - - - - - - - - - 8.8 - -
Mongolic
mn 5.7 - - - - - - - - - - - - 6.6 - - - - - - - - 5.3 - - - - - - - - - - 6.2
Dravidian
ta - - 4.0 - - - - - - - - - - - 5.7 - - - - - - - 3.9 - - - - - - - - - - 3.4
Isolate
eu - - - 9.0 - - - - - - - 9.7 - - - - - - - - - - 8.1 - - - - - - - - 11.0 - -
Table 16: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Armenian, Kartvelian, Mongolic, Dravidian, and Isolate families.