Multilingual Neural Machine Translation:
Can Linguistic Hierarchies Help?

Fahimeh Saleh Wray Buntine Gholamreza Haffari Lan Du
[email protected]
Monash University Corresponding author

Abstract

Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distills the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines.

1 Introduction

The surge over the past few decades in the number of languages used in electronic texts for international communications has promoted Machine Translation (MT) systems to shift towards multilingualism. However, most successful MT applications, i.e., Neural Machine Translation (NMT) systems, usually rely on supervised deep learning, which is notoriously data-hungry Koehn and Knowles (2017). Despite decades of research, high-quality annotated MT resources are only available for a subset of the world’s thousands of languages Paolillo and Das (2006). Hence, data scarcity is one of the significant challenges which comes along with the language diversity and multilingualism in MT. One of the most widely-researched approaches to tackle this problem is unsupervised learning which takes advantage of available unlabeled data in multiple languages Lample et al. (2017); Arivazhagan et al. (2019); Snyder et al. (2010); Xu et al. (2019). However, unsupervised approaches have relatively lower performance compared to their supervised counterparts Dabre et al. (2020). Nevertheless, the performance of the supervised MNMT models is highly dependent on the types of languages used to train the model Tan et al. (2019a). If languages are from very distant language families, they can lead to negative transfer Torrey and Shavlik (2010); Rosenstein (2005), causing lower translation quality compared to the individual bilingual counterparts.

To address this problem, some improvements have been achieved recently with solutions that employ some sort of supervision to guide MNMT using linguistic typology Oncevay et al. (2020); Chowdhury et al. (2020); Kudugunta et al. (2019); Bjerva et al. (2019). The linguistic typology provides this supervision by treating the world’s languages based on their functional and structural characteristics O’Horan et al. (2016). Taking advantage of this property, which explains both language similarity and language diversity, we aim in our approach to combine two solutions for training an MNMT model: (a) creating a universal, language-independent MNMT model Johnson et al. (2017); (b) systematically designing the possible variations of language-dependent MNMT models based on the language relations Maimaiti et al. (2019).

Our approach to preventing negative transfer in MNMT is to group models which behave similarly in separate language clusters. Then, we perform a Knowledge Distillation (KD) Hinton et al. (2015) approach by selectively distilling the bilingual teacher models’ knowledge in the same language cluster to a multilingual teacher-assistant model. The intermediate teacher-assistant models are representative of their own language cluster. We further adaptively distill knowledge from the multilingual teacher-assistant models to the ultimate multilingual student. In summary, our main contributions are as follows:

•

We use cluster-based teachers in a hierarchical knowledge distillation approach to prevent negative transfer in MNMT. Different from the previous cluster-based approaches in multilingual settings Oncevay et al. (2020); Tan et al. (2019a), our approach makes use of all the clusters with a universal MNMT model while retaining the language relatedness structure in a hierarchy.
•

We distill the ultimate MNMT model from multilingual teacher-assistant models, each of which represents one language family and usually perform better than the individual bilingual models from the same language family. Thus, the cluster-based teacher-assistant models can lead to a better knowledge distillation compared to a diverse set of bilingual teacher models as used in multilingual KD Tan et al. (2019b).
•

We explore a mixture of linguistic features by utilizing different clustering approaches to obtain the cluster-based teacher-assistants. As the language groups created by different language feature vectors can contribute differently to translation, we adaptively distill knowledge from teacher-assistant models to the ultimate student to improve the knowledge gap of the student.
•

We perform extensive experiments on 53 languages, showing the effectiveness of our approach in avoiding negative transfer in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines. We also conduct comprehensive ablation studies and analysis, demonstrating the impact of language clustering in MNMT for different language families and in different resource-size scenarios.

2 Related Work

The majority of works on MNMT mainly focus on different architectural choices varying in the degree of parameter sharing in the multilingual setting. For example, the works based on the idea of minimal parameter sharing share either encoder, decoder, or attention module Firat et al. (2017); Lu et al. (2018), and those with complete parameter sharing tend to share entire models Johnson et al. (2017); Ha et al. (2016). In general, these techniques implicitly assume that a set of languages is pre-given without considering the positive or negative effect of language transfer between the languages shared in one model. Hence, they can usually achieve comparable results with individual models (trained with individual language pairs) only when the languages are less diverse or the number of languages is small. When several diverse language pairs are involved in training an MNMT system, the negative transfer Torrey and Shavlik (2010); Rosenstein (2005) usually happens between more distant languages, resulting in degraded translation accuracy in the multilingual setting. To address this problem, Tan et al. (2019a) suggested a clustering approach using either prior knowledge of language families or using language embedding. They obtained the language embedding by retrieving the representation of a language tag which is added to the input of an encoder in a universal MNMT model. Later, Oncevay et al. (2020) introduced another clustering technique using the multi-view language representation. They fused language embeddings learned in an MNMT model with syntactic features of a linguistic knowledge base Dryer and Haspelmath (2013). Tan et al. (2019b) proposed a knowledge distillation approach which transfers knowledge from bilingual teachers to a multilingual student when the accuracy of teachers are higher than the student. Their approach eliminates the accuracy gap between the bilingual and multilingual NMT models. However, we argue that distilling knowledge from a diverse set of parent models into a student model can be sub-optimal, as the parents may compete instead of collaborating with each other, resulting in negative transfer.

3 Hierarchical Knowledge Distillation

We address the problem of data scarcity and negative transfer in MNMT with a Hierarchical Knowledge Distillation (HKD) approach.

Refer to caption — Figure 1: HKD approach: In the first phase of knowledge distillation, aka “Selective KD”, the knowledge is transferred from bilingual teacher models per clusters (orange circles) to the multilingual teacher-assistant models (green circles). For example $T_{l_{1}}$ , $T_{l_{2}}$ , $T_{l_{4}}$ , and $T_{l_{6}}$ are belonged to one cluster and distilled to teacher-assistant model ${T}_{c_{1}}$ . In the second KD phase, aka “Adaptive KD”, knowledge is transferred from ensemble of intermediate related teacher-assistant models to the ultimate student (red circle) adaptively.

The hierarchy in HKD is constructed in such a way that the node structure captures the similarity structure and the relatedness of the languages. Specifically, in an inverse pyramidal structure as shown in Figure 1, the root node corresponds to the ultimate MNMT model that we aim to train, the leave nodes correspond to each individual bilingual NMT models, and the non-terminal nodes represent the language clusters. Our hypothesis is that leveraging common characteristics of languages in the same language group, which is formed using clustering algorithms based on the typological properties of languages O’Horan et al. (2016), the HKD method can train a high quality MNMT model by distilling knowledge from related languages, rather than diverse ones.

Our HKD approach consists of two knowledge distillation mechanisms, providing two levels of supervision for training the ultimate MNMT model (illustrated in Figure 1), including: (i) selective distillation of knowledge from individual bilingual teachers to the multilingual intermediate teacher-assistants, each of which corresponds to one language group; and (ii) adaptive distillation of knowledge from all related cluster-wise teacher-assistants to the super-multilingual ultimate student model in each mini-batch of training per language pair. Note that we do not utilize multilingual adaptive KD in both distillation phases as we need to have the predictions of all the relevant experts in adaptive KD. Using adaptive KD for both stages is particularly impractical when there is a huge set of diverse teachers as in the first phase. Hence, in the first distillation phase, we aim to generate the cluster-wise teacher assistants using selective KD as the pre-requisites for the adaptive KD phase. The main steps of HKD are elaborated as follows:

Clustering: Clustering can be conducted using different language vectors such as: i) sparse language vectors from typological knowledge base (KB) databases, ii) dense learned language embedding vectors from multilingual NLP tasks, and iii) the combination of KB and task-learned language vectors. The implicit causal relationships between languages are usually learned from translation tasks; the genetic, the geographical, and the structural similarities between languages are extracted from typlogical KBs Bjerva et al. (2019). Thus, the language groups created by different language vectors can contribute differently to the translation and it is not quite clear which types of language features are more helpful in MNMT systems Oncevay et al. (2020). For example, “Greek” can be clustered with “Arabic” and “Hebrew” based on the mix of KB and task-learned language vectors. Meanwhile, it can be clustered with “Macedonian” and “Bulgarian” based on NMT-learned language vectors. Therefor, we cluster the languages based on all types of language representations and propose to explore a mixture of linguistic features by utilizing all clusters in training the ultimate MNMT student. So, given a training dataset consisting of $L$ languages and $K$ clustering approaches, where each clustering approach creates $n$ clusters, we are interested in training a many-to-one MNMT model (ultimate student) by hierarchically distilling knowledge from all $M$ clusters to the ultimate student, where $M:=\sum\nolimits_{k=1}^{K}n_{k}$ .

Multilingual selective knowledge distillation: Assume we have a language cluster that consists of $L^{\prime}$ languages, where $l\in\{1,2,\dots,L^{\prime}\}$ . Given a collection of pretrained individual teacher models $\{\theta^{l}\}_{l=1}^{L^{\prime}}$ , each handling one language pair in $\{\mathcal{D}^{l}\}_{l=1}^{L^{\prime}}$ , and inspired by Tan et al. (2019b), we use the following knowledge distillation objective for each language $l$ in the cluster.

\begin{split}\mathcal{L}_{KD}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l}):=-\sum\nolimits_{{\bm{x}},{\bm{y}}\in\mathcal{D}^{l}}\sum\nolimits_{t=1}^{|{\bm{y}}|}\\ \sum\nolimits_{v\in V}Q(v|{\bm{y}}_{<t},{\bm{x}},\theta^{l})\log P(v|{\bm{y}}_{<t},{\bm{x}},\theta^{c})\end{split}

(1)

where $\theta^{c}$ is the teacher assistant model, $|V|$ is the vocabulary set, $P(\cdot\mid\cdot)$ is the conditional probability of the teacher assistant model, and $Q(\cdot\mid\cdot)$ denotes the output distribution of the bilingual teacher model. According to Eq. (1), knowledge distillation regularises the predictive probabilities generated by a cluster-wise multilingual model with those generated by each individual bilingual models. Together with the translation loss ( $\mathcal{L}_{NLL}$ ), we have the following selective KD loss to generate the intermediate teacher-assistant model:

	$\displaystyle\mathcal{L}_{ALL}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l}):=$
			$\displaystyle{(1-\lambda)}\mathcal{L}_{NLL}(\mathcal{D}^{l},\theta^{c})+{\lambda}\mathcal{L}_{KD}^{selective}(\mathcal{D}^{l},\theta^{c},\theta^{l})$

where ${\lambda}$ is a tuning parameter that balances the contribution of the two losses. Instead of using all language pairs in Eq (1), we used a deterministic but dynamic approach to exclude language pairs from the loss function if the multilingual student surpasses the individual models on some language pairs during the training, which makes the training selective.

Input : Training corpora:

\{\mathcal{D}^{l}\}_{l=1}^{L}

, where

\mathcal{D}^{l}:=\{({\bm{x}}_{1}^{l},{\bm{y}}_{1}),..,({\bm{x}}_{n}^{l},{\bm{y}}_{n})\}

;

List of languages:

L

;

List of language clusters:

\{C^{m}\}_{m=1}^{M}

;

Cluster-based MNMT models:

\{\theta^{c}\}_{c=1}^{M}

;

Total training epochs:

N

;

Batch size: J;

Output : Ultimate multilingual student model:

\theta_{s}

;

Randomly initialize multilingual model

\theta_{s}

, accumulated gradient

g=0

, distillation flag

f^{l}=True

for

l\in L

;

n=0

;

while $n<N$ do

g=0

;

C_{sim}=[]

;

for $l\in L$ do

// find the effective clusters with similar languages;

for $c\in\{C\}_{1}^{M}$ do

if $l\in c$ then

C_{sim}.\mathrm{append}(c)

D^{l}=random\_permute(\mathcal{D}^{l})

;

{\bm{b}}_{1}^{l},..,{\bm{b}}_{J}^{l}=create\_minibatches(\mathcal{D}^{l})

//where

b^{l}=(x^{l},y)

;

j=1

;

while $j\leq J$ do

// compute contribution weights;

for $c\in C_{sim}$ do

\Delta_{c}=-ppl(\theta^{c}(b_{j}^{l}))

;

{\bm{\alpha}}=\mathrm{softmax}(\Delta_{1},..,\Delta_{c})

;

//compute the gradient on loss

\mathcal{L}_{ALL}^{adapt.}

;

{\bm{g}}=\nabla_{\theta_{s}}\mathcal{L}_{ALL}^{adapt.}({\bm{b}}_{j}^{l},\theta_{s},\{\theta^{c}\}_{1}^{C},{\bm{\alpha}})

;

// updates the parameters ;

\theta_{s}=\textrm{update\_param}(\theta_{s},{\bm{g}})

;

j=j+1

;

n=n+1

;

Algorithm 1 Multilingual Adaptive KD

This selective distillation process¹¹1 The training algorithm of selective knowledge distillation is summarized in Alg. 1 in (section A.1 - Appendix), which is similar to the one used in Tan et al. (2019b). is applied to all clusters obtained from different clustering approaches. It is noteworthy that (i) the selective knowledge distillation generates a teacher-assistant model for each cluster, i.e., $c\in\{1,2,\dots,M\}$ ; (ii) each language can be in multiple clusters due to the use of different language representations, thus there can be more than one effective teacher-assistant model for any given language pair (illustrated in Figure 2.). So for each language pair, we have a set of effective clusters: $c\in\{1,2,\dots,C_{sim}\}$ .

Multilingual adaptive knowledge distillation: Given a collection of effective teacher-assistant models $\{\theta^{c}\}_{c=1}^{C_{sim}}$ , where ${C_{sim}}$ is the number of effective clusters per language, we devise the following KD objective for each language pair,

\begin{split}\mathcal{L}_{KD}^{adaptive}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}}):=-\sum_{c=1}^{{C_{sim}}}\sum_{{\bm{x}},{\bm{y}}\in\mathcal{D}^{l}}\\ {\bm{\alpha}}_{c}\sum_{t=1}^{|{\bm{y}}|}\sum_{v\in V}Q(v|{\bm{y}}_{<t},{\bm{x}},\theta^{c})\log P(v|{\bm{y}}_{<t},{\bm{x}},\theta_{s})\end{split}

(3)

where ${\bm{\alpha}}$ dynamically weigh the contribution of the teacher-assistants/clusters. ${\bm{\alpha}}$ is computed via an attention mechanism based on the rewards (negative perplexity) attained by the teachers on the data, where these values are passed through a softmax transformation to turn into a distribution Saleh et al. (2020).

Japanese Korean Mongolian Burmese	Malay Thai Chinese Indonesian Vietnamese	Marathi Tamli Bengali Georgian	Kurdish Persian Kazakh Basque Hindi Urdu	Greek Arabic Hebrew	Turkish Azerbaijani Finnish Hungarian Armenian	Slovak Polish Russian Macedonian Lithuanian Belarusian Ukrainian	Croatian Bosnian Slovenian Serbian Czech Estonian Albanian	Galician Italian Portuguese Bulgarian Romanian Spanish	French Danish Swedish Dutch German Bokmal
cluster 1	cluster 2	cluster 3	cluster 4	cluster 5	cluster 6	cluster 7	cluster 8	cluster 9	cluster 10

Table 1: Clustering type (1) – SVCCA-53 Oncevay et al. (2020), Based on multi-view representation using both syntax features of WALS and language vectors learned by MNMT model trained with TED-53.

Korean Bengali Marathi Hindi Urdu	Basque Arabic Hebrew	Armenian Persian Kurdish	Hungarian Azerbaijani Mongolian Turkish Japanese	Georgian Tamli	Kazakh Burmese	Bosnian Albanian Polish Slovak Croatian Macedonian Belarusian Estonian	Russian Ukrainian Slovenian Lithuanian Finnish Czech Serbian	Thai Malay Indonesian Vietnamese Chinese	Romanian Spanish Italian Galician Bulgarian Swedish German Dutch Portuguese French Danish Greek Bokmal
cluster 1	cluster 2	cluster 3	cluster 4	cluster 5	cluster 6	cluster 7	cluster 8	cluster 9	cluster 10

Table 2: Clustering type (2) – SVCCA-23 Oncevay et al. (2020), Based on multi-view representation using both syntax features of WALS and language vectors learned by MNMT model trained with WIT-23.

Estonian Finnish	Hindi Burmese Armenian Georgian	Basque Bengali Kurdish Bosnian Belarusian Azerbaijani Mongolian Marathi Urdu Tamli Kazakh Malay	Galician French Italian Spanish Portuguese	Bokmal Danish Swedish	German Dutch	Chinese Japanese Korean Hungarian Turkish	Slovak Polish Russian Ukrainian Lithuanian Slovenian Croatian Serbian Czech	Persian Indonesian Hebrew Vietnamese Thai Arabic	Romanian Albanian	Macedonian Bulgarian Greek
cluster 1	cluster 2	cluster 3	cluster 4	cluster 5	cluster 6	cluster 7	cluster 8	cluster 9	cluster 10	cluster 11

Table 3: Clustering type (3) – Based on NMT-learned representation using a set of 53 factored language embeddings Oncevay et al. (2020); Tan et al. (2019a).

Bengali Chinese Korean Kazakh Azerbaijani Mongolian Burmese Japanese Marathi Urdu Hindi Turkish Tamli	Thai Vietnamese Indonesian Malay	Armenian Georgian Macedonian Bosnian Slovenian Hungarian Basque Serbian Dutch Albanian Greek German Galician Romanian Portuguese Croatian Bokmal Spanish French Belarusian Lithuanian Danish Ukrainian Hebrew Polish Czech Russian Finnish Bulgarian Kurdish Italian Slovak Estonian Arabic Persian Swedish
cluster 1	cluster 2	cluster 3

Table 4: Clustering type (4) – Based on KB representation using syntax features of WALS Oncevay et al. (2020).

be, bs, sl, mk, lt, sk cs, uk, hr, sr, bg, pl, ru	bn, ur, ku hi, fa, mr	gl, pt, ro fr, es, it	nb, da sv, de, nl	kk, az, tr	et, fi, hu	he, ar	my, zh	ms, id	ko	ja	vi	el	th	sq	hy	ka	mn	ta	eu
IE/Balto-Slavic	IE/Italic	IE/Indo-Iranian	IE/Germanic	Turkic	Uralic	Afroasiatic	Sino-Tibetan	Austronesian	Koreanic	Japonic	Austroasiatic	IE/Hellenic	Kra-Dai	IE/Albanian	IE/Armenian	Kartvelian	Mongolic	Dravidian	Isolate (Basque)

Table 5: Language families Eberhard et al. (2019). IE refers to Indo European.

This adaptive distillation of knowledge allows the student model to get the best of teacher-assistants (which are representative of different linguistic features) based on their effectiveness to improve the knowledge gap of the student. The total loss function then becomes a weighted combination of losses coming from the ensemble of teachers and the data,

	$\displaystyle\mathcal{L}_{ALL}^{adaptive}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}}):=$
			$\displaystyle{\lambda}_{1}\mathcal{L}_{NLL}(\mathcal{D}^{l},\theta_{s})+\lambda_{2}\mathcal{L}_{KD}^{adapt.}(\mathcal{D}^{l},\theta_{s},\{\theta^{c}\}_{1}^{{C_{sim}}},{\bm{\alpha}})$

The training process is summarized in Alg. 1.

4 Experiment

In this section we explain our experiment settings as well as our experimental findings.

Data: We conducted extensive experiments on a parallel corpus (53 languages $\rightarrow$ English) from TED talks transcripts ²²2https://github.com/neulab/word-embeddings-for-nmt created and tokenized by Qi et al. (2018). This corpus has 26% of language pairs having less than or equal to 10k sentences (extremely low-resource), and 33% of language pairs having less than 20k sentences (low-resource). All the sentences were segmented with BPE segmentation Sennrich et al. (2016) with a learned BPE model with 32k merge operations on all languages. We kept the output vocabulary of the teacher and student models the same to make the knowledge distillation feasible. Details about the size of training data, language codes, and preprocessing steps are described in Section A.2.

Clustering: We clustered all the languages based on the three different types of representations discussed in Section 3 in order to take advantage of a mixture of linguistic features while training the ultimate student. Following Oncevay et al. (2020), we adopted their multi-view language representation approach that uses Singular Vector Canonical Correlation Analysis – SVCCA Raghu et al. (2017) to fuse the one-hot encoded KB representation obtained from syntactic features of WALS Dryer and Haspelmath (2013) and a dense NMT-learned view obtained from MNMT Tan et al. (2019a). Specifically, SVCCA-53 uses 53 languages of TED dataset to build the language representations and generates 10 clusters, the languages within each of which usually have the same phylogenetic or geographical features. SVCCA-23 instead uses 23 languages of WIT-23 Cettolo et al. (2012) to compute the shared space. We also generated language clusters based on either KB-based representation using syntax features of WALS Dryer and Haspelmath (2013) and NMT-learned representation alone. Tables 1-4 show the generated language clusters.
Training Configuration: All models were trained with Transformer architecture Vaswani et al. (2017). The individual baseline models were trained with the model hidden size of 256, feed-forward hidden size of 1024, and 2 layers. All multilingual models were trained with the model hidden size of 512, feed-forward hidden size of 1024, and 6 layers. We used selective multilingual knowledge distillation approach for training our cluster-based MNMT models as selective multilingual KD perform better than universal MNMT without KD Tan et al. (2019b). We did not carry out intense parameter tuning to search for the best parameter settings of each model for the sake of simplicity. It is noteworthy that the purpose of our experiments is rather to demonstrate the benefit of considering different type of language clusters through a unified hierarchical knowledge distillation. Details about the training configuration is provided in Section A.2.

4.1 Experimental Results

The translation results of (53 languages $\rightarrow$ English) for all approaches are summarised in Table 6. The language pairs are sorted based on the size of training data in an ascending order. The translation quality is evaluated and reported based on the BLEU score Papineni et al. (2002).

Resource	src	size #sent. (k)	Individ.	Universal Multi.	All langs	Clus. type1	Clus. type2	Clus. type3	Clus. type4	Multi. HKD
Resource	src	size #sent. (k)	Baseline		Multilingual Selective KD					Multi. HKD
Extremely low resource
	kk	3.3	3.42	5.05	4.66	7.00	3.51	3.10	8.13	6.61
	be	4.5	5.13	12.51	12.36	12.78	10.81	8.46	15.18	14.88
	bn	4.6	5.06	12.50	12.58	9.13	10.11	12.16	10.13	12.81
	eu	5.1	4.4	13.12	12.00	9.08	9.70	8.14	11.03	12.90
	ms	5.2	3.78	13.88	14.61	12.93	12.94	7.63	12.98	14.11
	bs	5.6	7.92	14.82	15.46	18.05	16.89	9.03	19.02	17.51
	az	5.9	5.79	10.32	9.91	9.59	9.17	8.64	9.23	10.40
	ur	5.9	8.98	12.76	16.50	13.35	13.17	12.02	13.39	15.95
	ta	6.2	4.57	5.86	6.19	4.02	5.76	3.96	3.49	5.63
	mn	7.6	3.54	6.11	5.82	5.75	6.60	5.39	6.20	6.81
	mr	9.8	6.92	10.53	10.72	8.70	9.00	8.39	9.04	10.98
	gl	10.0	13.5	22.04	22.44	25.53	26.81	26.93	25.90	27.11
	ku	10.3	6.3	10.32	12.12	9.43	6.98	9.22	12.93	13.03
	et	10.7	8.24	12.21	13.19	13.47	13.21	10.44	13.94	14.10
Low resource
	ka	13.1	8.64	8.18	8.66	9.28	10.85	9.14	8.88	11.15
	nb	15.8	26.36	28.49	29.08	33.55	34.31	28.79	30.87	33.89
	hi	18.7	10.66	16.03	17.93	13.27	12.09	12.16	12.80	16.11
	sl	19.8	11.45	15.12	15.39	16.48	16.54	17.75	18.31	18.43
Enough resource
	hy	21.3	11.14	14.07	15.12	13.76	12.77	10.81	17.17	16.72
	my	21.4	4.91	10.70	11.11	9.65	6.35	8.48	9.54	8.81
	fi	24.2	8.16	11.69	12.23	11.36	12.57	10.59	12.76	12.90
	mk	25.3	18.32	20.63	21.09	21.48	20.06	24.65	23.8	25.05
	lt	41.9	14.78	15.44	16.76	16.98	16.9	17.96	18.24	18.11
	sq	44.4	22.62	24.44	25.22	24.74	23.22	26.89	26.42	26.93
	da	44.9	31.85	30.39	30.61	35.02	39.76	30.58	32.04	36.00
	sv	56.6	27.2	27.18	26.84	31.36	34.52	26.81	28.65	33.14
	sk	61.4	19.36	22.04	22.58	22.49	21.18	24.08	23.77	24.33
	id	87.4	20.51	20.89	20.69	21.11	21.13	22.56	21.12	22.76
	th	96.9	20.46	21.34	21.72	22.94	22.94	23.09	22.87	23.30
	cs	103.0	20.13	22.01	22.07	21.72	22.49	23.12	22.86	23.62
	uk	108.4	21.32	22.11	23.07	23.06	22.91	23.58	23.66	24.09
	hr	122.0	25.89	26.51	27.17	27.56	25.62	28.66	28.34	28.91
	el	134.3	26.82	26.07	28.51	30.05	31.35	29.13	29.66	30.10
	sr	136.8	26.94	25.43	25.88	27.48	25.75	27.69	27.12	27.97
	hu	147.1	18.46	17.61	18.55	19.08	18.41	20.16	19.82	20.10
	fa	150.8	23.60	21.7	21.29	21.31	22.44	23.51	22.24	23.19
	de	167.8	15.23	14.83	16.69	16.88	17.79	15.44	16.67	18.04
	ja	168.2	10.11	8.61	8.93	10.14	10.14	10.17	8.69	10.30
	vi	171.9	18.97	19.19	20.58	21.60	21.60	21.30	20.33	21.82
	bg	174.4	28.85	27.66	29.14	31.67	32.18	29.86	30.48	32.33
	pl	176.1	17.23	18.62	19.45	19.45	18.37	19.93	20.26	20.71
	ro	180.4	25.21	25.97	26.53	28.03	28.43	28.90	27.35	29.00
	tr	182.3	17.72	10.2	10.01	18.66	18.19	19.85	18.27	16.91
	nl	183.7	27.65	26.91	26.82	28.05	29.07	29.17	28.03	29.58
	zh	184.8	20.44	22.10	22.71	22.19	22.11	23.91	22.84	23.85
	es	195.9	30.17	29.55	30.00	29.06	33.45	31.46	31.82	32.76
	it	204.4	26.84	25.13	27.99	30.57	30.85	30.36	29.45	30.93
	ko	205.4	15.98	15.71	16.41	15.17	15.18	17.45	16.00	17.70
	ru	208.4	19.76	19.83	20.86	20.85	20.80	21.25	21.49	21.77
	he	211.7	29.35	28.03	28.27	32.82	32.18	31.32	30.02	32.05
	fr	212.0	30.08	30.55	30.28	32.25	32.19	31.14	31.34	32.65
	ar	213.8	25.36	23.89	24.48	28.53	28.03	27.46	25.85	28.94
	pt	236.4	30.99	31.12	30.85	33.36	33.84	33.25	32.56	33.90
Avg.	-	-	18.50	18.64	19.24	19.84	19.87	19.35	20.05	21.16

Table 6: BLEU scores of the translation tasks for 53 Languages

\rightarrow

English. The bold numbers show the best scores and the underline ones show the

2^{nd}

best results. “All langs" refers to all 53 languages trained with a multilingual selective KD.

4.1.1 Studies of cluster-based MNMT models

In this section, we discuss the cluster-based MNMTs’ results through the following observations.

Low resource vs high resource languages: All cluster-based MNMT and baseline approaches are ranked based on the number of the times they got the first or second-best score in different resource-size scenarios in Table 7. Based on this result and also the result represented in Table 6, the massive MNMT models with all languages (second column under baseline and first column under selective KD in Tables 6, 7) outperform the cluster-based MNMT models (columns (2-5) under selective KD in Tables 6, 7) in extremely low-resource scenarios (e.g., bn-en, ta-en, eu-en). This result shows that having more data either from related languages or distant languages has the most impact on training a better MNMT model for under-resourced languages. Furthermore, clustering type (4) is dominant among other clustering approaches for under-resourced situations. This result is also explainable based on the size of the clusters in clustering type (4). The translations of extremely low resource languages are significantly improved when they have been clustered in the third cluster of clustering type (4) with 35 languages (shown in Table 4). However, for languages with enough resources, the multilingual baselines with all languages under-performed other cluster-based MNMT models.

Resource-size	size (# sent.)	Individ.	Multi.	All langs	Clus. type1	Clus. type2	Clus. type3	Clus. type4
Resource-size	size (# sent.)	Baseline		Multilingual Selective KD
Extremely low resource	<= 10k	0%	21.43%	28.57%	14.28%	7.14%	3.58%	25.00%
Low resource	> 10k and <= 20k	0%	12.50%	12.50%	25.00%	25.00%	12.50%	12.50%
Enough resource	> 20k	1.39%	1.39%	2.78%	20.83%	25.00%	27.78%	20.83%

Table 7: The translation ranking ablation study for all approaches excluding the HKD approach based on the percentage of the times they got the

1^{st}

2^{nd}

best results. Sum of percentages in each row = 100%.

Resource-size	size (# sent.)	Individ.	Multi.	All langs	Clus. type1	Clus. type2	Clus. type3	Clus. type4	HKD
Resource-size	size (# sent.)	Baseline		Multilingual Selective KD					HKD
Extremely low resource	<= 10k	0%	13.79%	17.24%	6.90%	3.45%	3.45%	17.24%	37.93%
Low resource	> 10k and < 20k	0%	0%	12.50%	0%	25.00%	0%	12.50%	50.00%
Enough resource	>20k	1.43%	1.43%	1.43%	5.71%	12.86%	24.28%	8.57%	44.29%

Table 8: The ranking of all approaches based on the percentage of the times they got the

1^{st}

2^{nd}

best results.

Model	Contrib. Langs	BLEU	Contrib. Langs	BLEU
Clus. Rand.1	gl, nb, uk, hr, se, ja	16.53	el, id, be	28.01
Clus. Rand.2	gl, ta, mk, be, id, sq, pt, fr, ur, az, ku, bs, fa	20.27	el, cs, lt, id, sk, th, it, hy, ms, hu, mk, my, bn	27.78
\hdashline Clus. Rand.3	gl, az, ja, nb, kk	13.61	el, sq, th	27.97
Clus. Rand.4	gl, zh, pt, fa, ar, kk, sr, bg, nl, cs, th, ko, vi, hu, mk, fi, ru, mn, de, sl, el, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az	22.20	el, zh, pt, fa, ar, kk, sr, bg, nl, cs,th, ko, vi, hu, mk, fi, ru, mn, de, sl, gl, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az	28.82
Avg.	-	18.15	-	28.14

Table 9: Ablation study on using random languages in all clustersing types for (gl

\rightarrow

en) and (el

\rightarrow

en). The results for actual clustering are provided in Table 6. The average result of actual clustering for (gl

\rightarrow

en) and (el

\rightarrow

en) are 26.29 and 30.04 respectively.

Related vs isolated languages: A group of languages that originated from a similar ancestor is known as a language family; and a language that does not have any relationship with another languages is called a language isolate. The language families Eberhard et al. (2019) are shown in Table 5. According the results shown in Table 6, clustering approaches usually have the same behaviour and less diversity for clustering languages belonged to IE/Germanic, IE/Italic, Afroasiatic, and Austronesian families. In comparison, there is more diversity, and less consensus for clustering languages belonged to IE/Balto-Slavic, IE/Indo-Iranian, Turkic, Uralic, and Sino-Tibetan families. Moreover, the isolated languages (shown in the last 11 columns of Table 5) generally have the same behaviour and less variance in BLEU scores in different clustering approaches. This observation shows that cluster-based MNMT models (regardless of the clustering type) do not significantly improve the translation of isolated languages unless the isolated languages have extremely low resources and have been clustered in a huge cluster (e.g., eu-en, hy-en). The results of cluster-based MNMT are presented based on the language families in Section A.3.

Random clustering vs Actual clustering: We conducted an ablation study by using clusters with randomly chosen languages for two translation tasks (el $\rightarrow$ en and gl $\rightarrow$ en). We kept the number of languages per cluster the same as the actual clustering to make a fair comparison. According to the result represented in Table 9, for both translation tasks, random clusters underperform the actual clusters in all clustering types. However, notably, the average BLEU score’s difference between the random and actual clusters for gl $\rightarrow$ en is considerably higher than el $\rightarrow$ en (gl $\rightarrow$ en, $\Delta$ = -8.14 vs el $\rightarrow$ en, $\Delta$ = -1.9). This observation is inline with the previous observation that Greek (el) is an isolated language categorised in IE/Hellenic family and clustering approaches have less impact on this language due to its lower similarity to most languages. In comparison, Galician (gl) is highly similar to the languages in IE/Italic family and clustering improves the translation of gl $\rightarrow$ en remarkably. We compared the translation results between actual cluster-based fake cluster-based multilingual approaches in Section A.3.

4.1.2 Studies of HKD

According to the results in Table 6, the HKD approach outperforms massive and cluster-based MNMT models in average by 1.11 BLEU score. We discuss it more in the following observations:

Ranking based on data size: We ranked all approaches, including HKD, based on the number of times they got the first or second-best score (shown in Table 8). According to this result, the HKD approach in three different situations, i.e. extremely low resource, low resource, and enough resource, has the best rank among other approaches. This observation proves that the HKD approach is robust in different data-size situations by leveraging the best of both multilingual NMT and language-relatedness guidance in a systematic HKD setting.

Clustering consistency impact on HKD: Based on the result in Table 6, the HKD approach underperforms other multilingual approaches when the clusters are inconsistent, causing a high variance in teacher-assistants’ results. For example for kk $\rightarrow$ en, bs $\rightarrow$ en, the variance of the BLEU scores of the teacher-assistant models is 6.92 and 20.81 respectively and HKD underachieved a good result. This observation shows that, although in the second phase of the HKD approach, the cluster-based teacher-assistants adaptively contribute to training the ultimate students, still a weak teacher-assistant deteriorates the collaborative teaching process. One possible solution is excluding the worst teacher-assistant in such heterogeneous situations.

4.1.3 Comparison with other approaches

To highlight the pros and cons of the related baselines (with and without KD), we draw a comparison shown in Table 10. Accordingly, our HKD approach is comparable with other approaches based on the following properties:

Multilingual translation: Our approach works in a multilingual setting by sharing resources between high-resource and low-resource languages. This property not only improves the regularisation of the model by avoiding over-fitting to the limited data of the low-resource languages but also decreases the deployment footprint by leveraging the whole training in a single model instead of having individual models per language Dabre et al. (2020).

Optimal transfer: In the HKD approach, we have an optimal transfer by transferring knowledge from all possible languages related to a student in the hierarchical structure which leads to the best average BLEU score (21.16) comparing to the other baselines (shown in Table 6). In the universal multilingual NMT without KD, the language transfer stream is maximized when all languages shared their knowledge in a single model during training; however, it is not an optimal transfer due to the lack of any condition on the the relatedness of languages contributing in the multilingual training. The related experiments are shown in the second column under the baseline experiments in Table 6. The average BLEU score of this approach is 18.46. In multilingual selective KD Tan et al. (2019b), knowledge is distilled from one selected teacher with the same language when training the student multilingually. So, although there is a condition on language relatedness, knowledge transfer is not maximized as the similar languages from the same language family are ignored in the distillation process. The related results are shown in the first column under multilingual selective KD of Table 6. Accordingly, this approach got the average BLEU score of 19.24 in our experiments. Adaptive KD Saleh et al. (2020) is a bilingual approach and also uses a random set of teachers which does not essentially have all the related languages to the student and does not lead to optimal transfer. We did not perform any experiment on adaptive KD Saleh et al. (2020) since this is a bilingual approach.

Adaptive KD vs Selective KD vs HKD: All KD-based approaches in our comparison reduce negative transfer in different ways. In multilingual selective KD Tan et al. (2019b), the risk of negative transfer is reduced by distilling knowledge from the selected teacher per language in a multilingual setting. In bilingual adaptive KD Saleh et al. (2020), the contribution weights of different teachers vary based on their effectiveness to improve the student which prevents the negative transfer in bilingual setting. In HKD, the hierarchical grouping based on the language similarity provides a systematic guide to prevent negative transfer as much as possible. This property leads HKD to get the best results for 32 language pairs out of total 53 language pairs in our multilingual experiments.

	Individual NMT	Uniform MNMT	Selective KD MNMT	Adaptive KD NMT	HKD MNMT
Multilingual	✗	✔	✔	✗	✔
\hdashline Maximum transfer	⚫	✔	✗	✗	✔
\hdashline KD from multiple languages	⚫	⚫	✗	✔	✔
\hdashline Reduced risk of negative transfer	⚫	✗	✔	✔	✔

Table 10: Comparing different properties of HKD with: transformer-based individual and multilingual NMT Vaswani et al. (2017), multilingual selective KD Tan et al. (2019b), and adaptive KD Saleh et al. (2020).

5 Conclusion

We presented a Hierarchical Knowledge Distillation (HKD) approach to mitigate the negative transfer effect in MNMT when having a diverse set of languages in training. We put together all languages which behave similarly in the first phase of distillation process and generated the expert teacher-assistants for each group of languages. As we clustered languages based on four different language representations capturing different linguistic features, we then adaptively distill knowledge from all related teacher-assistant models to the ultimate student in each mini-batch of training per language. Experimental results on 53 languages to English show our approach’s effectiveness to reduce negative transfer in MNMT. Our proposed approach is generalizable to one-to-many with the same setting as a many-to-one task. The clustering needs to be done in the target language, though. For many-to-many tasks, the hierarchical KD can be effective if the clustering is applied to both source and target languages. Our intended use, however, is many to one or one to many.

As the future direction, it is interesting to study an end-to-end HKD approach by adding a backward HKD pass compared to the forward HKD pass described in this paper.

Acknowledgements

This work was supported by the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au).

References

Arivazhagan et al. (2019) Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091.
Bjerva et al. (2019) Johannes Bjerva, Robert Östling, Maria Han Veiga, Jörg Tiedemann, and Isabelle Augenstein. 2019. What do language representations really represent? Computational Linguistics, 45(2):381–389.
Bowman et al. (2016) Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21.
Cettolo et al. (2012) Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit3: Web inventory of transcribed and translated talks. In Conference of european association for machine translation, pages 261–268.
Chowdhury et al. (2020) Koel Dutta Chowdhury, Cristina España-Bonet, and Josef van Genabith. 2020. Understanding translationese in multi-view embedding spaces. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6056–6062.
Dabre et al. (2020) Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR), 53(5):1–38.
Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath. 2013. The world atlas of language structures.
Eberhard et al. (2019) DM Eberhard, GF Simons, and CD Fennig. 2019. What are the largest language families. Ethnologue: Languages of the World.
Firat et al. (2017) Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T Yarman Vural, and Yoshua Bengio. 2017. Multi-way, multilingual neural machine translation. Computer Speech & Language, 45:236–252.
Ha et al. (2016) Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. Institute for Anthropomatics and Robotics, 2(10.12):16.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. ACL 2017, page 28.
Kudugunta et al. (2019) Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigating multilingual nmt representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1565–1575.
Lample et al. (2017) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
Lu et al. (2018) Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84–92.
Maimaiti et al. (2019) Mieradilijiang Maimaiti, Yang Liu, Huanbo Luan, and Maosong Sun. 2019. Multi-round transfer learning for low-resource nmt using multiple high-resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18(4):1–26.
O’Horan et al. (2016) Helen O’Horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, and Anna Korhonen. 2016. Survey on the use of typological information in natural language processing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1297–1308.
Oncevay et al. (2020) Arturo Oncevay, Barry Haddow, and Alexandra Birch. 2020. Bridging linguistic typology and multilingual machine translation with multi-view language representations. arXiv preprint arXiv:2004.14923.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
Paolillo and Das (2006) John C Paolillo and Anupam Das. 2006. Evaluating language statistics: The ethnologue and beyond. Contract report for UNESCO Institute for Statistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wj Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
Qi et al. (2018) Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana. ACL.
Raghu et al. (2017) Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6078–6087.
Rosenstein (2005) Michael T Rosenstein. 2005. To transfer or not to transfer. In NIPS-2005 Workshop on Transfer Learning, volume 898, pages 1–4.
Saleh et al. (2020) Fahimeh Saleh, Wray Buntine, and Gholamreza Haffari. 2020. Collective wisdom: Improving low-resource neural machine translation using adaptive knowledge distillation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3413–3421.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. ACL.
Snyder et al. (2010) Benjamin Snyder et al. 2010. Unsupervised multilingual learning. Ph.D. thesis, Massachusetts Institute of Technology.
Tan et al. (2019a) Xu Tan, Jiale Chen, Di He, Yingce Xia, QIN Tao, and Tie-Yan Liu. 2019a. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 962–972.
Tan et al. (2019b) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019b. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
Torrey and Shavlik (2010) Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Wang et al. (2020) Xinyi Wang, Yulia Tsvetkov, and Graham Neubig. 2020. Balancing training for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8526–8537.
Xu et al. (2019) Chang Xu, Tao Qin, Gang Wang, and Tie-Yan Liu. 2019. Polygon-net: A general framework for jointly boosting multiple unsupervised neural machine translation models. In IJCAI, pages 5320–5326.

Appendix:

Selective Knowledge Distillation

This first stage of knowledge distillation is the same as the algorithm which is proposed by Tan et al. (2019b) and is summarized in Algorithm 2. This process is applied to all clusters obtained from different clustering approaches, $\{C^{m}:=\{l_{1},...,l_{|L^{\prime}|}\}\}_{m=1}^{M}$ where $M$ refers to the number of clusters and $L^{\prime}$ refers to the number of languages in each cluster.

Experiment Settings

Data.

We conducted the experiments on a parallel corpus from TED talks transcripts³³3https://github.com/neulab/word-embeddings-for-nmt on 53 languages to English created and tokenized by Qi et al. (2018). Detail about the size of training data and language codes based on ISO 639-1 standard⁴⁴4http://www.loc.gov/standards/iso639-2/php/English_list.php are listed in Table 6 and visualised in Figure 3. We concatenated all data which have the Portuguese-related languages in the source (pt $\rightarrow$ en, pt-br $\rightarrow$ en). We also concatenated all data with French-related languages in the source (fr $\rightarrow$ en, fr-ca $\rightarrow$ en). We removed any sentences in the training data which has overlap with any of the test sets. For multilingual training, as a standard practice Wang et al. (2020), we up-sampled the data of low-resource language pairs to make all language pairs having roughly the same size and adjust the distribution of training data.

Input : Training corpora:

\{\mathcal{D}^{l}\}_{l=1}^{L}

; where

\mathcal{D}^{l}:=\{({\bm{x}}_{1}^{l},{\bm{y}}_{1}),..,({\bm{x}}_{n}^{l},{\bm{y}}_{n})\}

;

List of all languages:

L

;

Individual models

\{\theta^{l}\}_{l=1}^{L^{\prime}}

;

List of language pairs per cluster:

L^{\prime}

;

Total training epochs:

N

;

Distillation check step:

\mathcal{N}_{check}

;

Threshold of distillation accuracy:

\mathcal{T}

Output :

\theta^{c}

: multilingual model for each cluster,

Randomly initialize multilingual model

\theta^{c}

, accumulated gradient

g=0

, distillation flag

f^{l}=True

for

l\in L^{\prime}

;

n=0

;

while $n<N$ do

g=0

;

for $l\in L^{\prime}$ do

D^{l}=random\_permute(\mathcal{D}^{l})

;

{\bm{b}}_{1}^{l},..,{\bm{b}}_{J}^{l}=create\_minibatches(\mathcal{D}^{l})

,where

b^{l}=(x^{l},y)

;

j=1

;

while $j\leq J$ do

if $f^{l}==True$ then

//compute and accumulate the gradient on loss

\mathcal{L}_{ALL}^{selective}

;

{\bm{g}}=\nabla_{\theta^{c}}\mathcal{L}_{ALL}^{selective}({\bm{b}}_{j}^{l},\theta^{c},\theta^{l})

;

// updates the parameters using the optimiser ADAM ;

\theta^{c}=\textrm{update\_param}(\theta^{c},{\bm{g}})

;

else

//compute and accumulate the gradient on loss

\mathcal{L}_{NLL}

;

{\bm{g}}=\nabla_{\theta^{c}}\mathcal{L}_{NLL}({\bm{b}}_{j}^{l},\theta^{c},\theta^{l})

;

// updates the parameters using the optimiser ADAM ;

\theta^{c}=\textrm{update\_param}(\theta^{c},{\bm{g}})

;

j=j+1

;

if $N\%\mathcal{N}_{check}==0$ then

for $l\in L^{\prime}$ do

if $Accuracy(\theta^{c})<Accuracy(\theta^{l})+$ $\mathcal{T}$ then

f^{l}=True\;

else

f^{l}=False

n=n+1

;

Algorithm 2 Multilingual Selective Knowledge Distillation Tan et al. (2019b)

TED-53 Languages
Language name	Kazakh	Belarusian	Bengali	Basque	Malay	Bosnian
Code	kk	be	bn	eu	ms	bs
train-size (#sent(k))	3.3	4.5	4.6	5.1	5.2	5.6
Language name	Azerbaijani	Urdu	Tamli	Mongolian	Marathi	Galician
Code	az	ur	ta	mn	mr	gl
train-size (#sent(k))	5.9	5.9	6.2	7.6	9.8	10
Language name	Kurdish	Estonian	Georgian	Bokmal	Hindi	Slovenian
Code	ku	et	ka	nb	hi	sl
train-size (#sent(k))	10.3	10.7	13.1	15.8	18.7	19.8
Language name	Kurdish	Estonian	Georgian	Bokmal	Hindi	Slovenian
Code	ku	et	ka	nb	hi	sl
train-size (#sent(k))	10.3	10.7	13.1	15.8	18.7	19.8
Language name	Armenian	Burmese	Finnish	Macedonian	Lithuanian	Albanian
Code	hy	my	fi	mk	lt	sq
train-size (#sent(k))	21.3	21.4	24.2	25.3	41.9	44.4
Language name	Danish	Swedish	Slovak	Indonesian	Thai	Czech
Code	da	sv	sk	id	th	cs
train-size (#sent(k))	44.9	56.6	61.4	87.4	96.9	103
Language name	Ukrainian	Croatian	Greek	Serbian	Hungarian	Persian
Code	uk	hr	el	sr	hu	fa
train-size (#sent(k))	108.4	122	134.3	136.8	147.1	150.8
Language name	German	Japanese	Vietnamese	Bulgarian	Polish	Romanian
Code	de	ja	vi	bg	pl	ro
train-size (#sent(k))	167.8	168.2	171.9	174.4	176.1	180.4
Language name	Turkish	Dutch	Chinese	Spanish	Italian	Korean
Code	tr	nl	zh	es	it	ko
train-size (#sent(k))	182.3	183.7	184.8	195.9	204.4	205.4
Language name	Russian	Hebrew	French	Arabic	Portuguese
Code	ru	he	fr	ar	pt
train-size (#sent(k))	208.4	211.7	212	213.8	236.4

Table 11: Bilingual resources of 53 Languages

\rightarrow

English from TED dataset. Language names, language codes based on ISO 639-1 standard⁶⁶6http://www.loc.gov/standards/iso639-2/php/English_list.php, and training size based on the number of sentences in bilingual resources are shown in this table.

Training configuration.

All models are trained with Transformer architecture Vaswani et al. (2017), implemented in the Fairseq framework Ott et al. (2019). The individual models are trained with the model hidden size of 256, feed-forward hidden size of 1024, and 2 layers. All multilingual models either cluster-based or universal MNMT models with or without knowledge distillation were trained with the model hidden size of 512, feed-forward hidden size of 1024, and 6 layers. We use the Adam optimizer Kingma and Ba (2015) and an inverse square root schedule with warmup (maximum LR 0.0005). We apply dropout and label smoothing with a rate of 0.3 and 0.1 for bilingual and multilingual models respectively.

For the first phase of distillation, i.e., the multilingual selective KD, the distillation coefficient $\lambda$ is equal to 0.6. In the second phase of distillation, i.e., the multilingual adaptive KD, we applied ${\lambda}_{1}=0.5$ and ${\lambda}_{2}$ is started from 0.5 and increased to 3 using the annealing function of Bowman et al. (2016); Saleh et al. (2020). We train our final multilingual student with mixed-precision floats on up to 8 V100 GPUs for maximum 100 epochs ( $\approx 3$ days), with at most 8192 tokens per batch and early stopping at 20 validation steps based on the BLEU score. The translation quality is also evaluated and reported based on the BLEU Papineni et al. (2002) score⁷⁷7SacreBLEU signature:BLEU+case.mixed+numrefs.
1+smooth.exp+tok.none+version.1.3.1.

Model	Contributed Langs	BLEU	Contributed Langs	BLEU
Individual	gl	13.50	el	26.82
Multi. (uniform)	All langs.	22.04	All langs.	26.07
\hdashline Multi. (SKD)	All langs.	22.44	All langs.	28.51
Clus. type1	gl, bg, ro, es, it, pt	25.53	el, ar, he	30.05
\hdashline Clus. type2	gl, bg, sv, da, nb, de, nl, ro, el, es, it, fr, pt	26.81	el, bg, sv, da, nb, de, nl, es, it, gl, fr, pt, ro	31.35
\hdashline Clus. type3	gl, fr, it, es, pt	26.93	el, mk, bg	29.13
\hdashline Clus. type4	gl, fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, it, sv, da, nb, ru, be, pl, bg, sl, pt sr, uk, hr, mk, sk, ro, sq, el, es	25.90	el, fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, it, sv, da, nb, ru, be, pl, bg, sl, pt, sr, uk, hr, mk, sk, ro, sq, es, gl	29.66
\hdashline Avg.	-	26.29	-	30.04
Clus. Rand.1	gl, nb, uk, hr, se, ja	16.53	el, id, be	28.01
\hdashline Clus. Rand.2	gl, ta, mk, be, id, sq, pt, fr, ur, az, ku, bs, fa	20.27	el, cs, lt, id, sk, th, it, hy, ms, hu, mk, my, bn	27.78
\hdashline Clus. Rand.3	gl, az, ja, nb, kk	13.61	el, sq, th	27.97
\hdashline Clus. Rand.4	gl, zh, pt, fa, ar, kk, sr, bg, nl, cs, th, ko, vi, hu, mk, fi, ru, mn, de, sl, el, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az	22.20	el, zh, pt, fa, ar, kk, sr, bg, nl, cs,th, ko, vi, hu, mk, fi, ru, mn, de, sl, gl, ka, pl, et, ta, fr, ur, ro, sv, mr, be, bs, uk, sq, az	28.82
\hdashline Avg.	-	${\textbf{18.15}}_{\Delta-8.14}$	-	${\textbf{28.14}}_{\Delta-1.9}$

Table 12: Ablation study on using random clusters. Comparison of the (gl

\rightarrow

en) and (el

\rightarrow

en) translation tasks between individual, massive multilingual, and clustering-based multilingual (for actual and random clusters) baselines.

Analysis

Random clustering vs Actual clustering.

We studied the behaviour of random clusters compared to the actuall clusters in Section 4.1.1. For more clarification about the mentioned discussion, you can refer to Table 12.

Language Family.

A group of languages that originated from a similar ancestor is known as a language family; and a language that does not have any relationship with another languages is called a language isolate.. We already discussed the ablation study’s results regarding the language families in comparison with isolated languages in Section 4.1.1 of the paper. For better visualisation of results based on language family, we sorted all language pairs based on their language family relations in Tables 13, 14, 15, and 16.

family	lang	ja, ko, mn, my	ms, th, vi, zh, id	mr, ta, bn, ka	ku, fa, kk, eu, hi, ur	el, ar, he	tr, az, fi, hu, hy	uk, pl, ru, mk, lt, be, sk	cs, et, sq, hr, bs, sl, sr	bg, ro, es, gl, it, pt	fr, da, sv, nl, de, nb	ko, bn, mr, hi, ur	eu, ar, he	hy, fa, ku	hu, tr, az, ja, mn	ka, ta	kk, my	mk, sq, pl, sk, hr, bs, be, et	ru, uk, sl, fi, cs, lt	zh, th, id, vi, ms	bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro	et, fi	hi, my, hy, ka	eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms	gl, fr, it, es, pt	nb, da, sv	de, nl	zh, ja, ko, hu, tr	lt, sl, hr, sr, cs, sk, pl, ru, uk	fa, id, ar, he, th, vi	ro, sq	mk, bg, el	fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it	vi, id, ms, th	kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
		Clustering Type 1										Clustering Type 2										Clustering Type 3											Clustering Type 4
		$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{11}$	$c_{1}$	$c_{2}$	$c_{3}$
IE/Balto-Slavic	be	-	-	-	-	-	-	12.7	-	-	-	-	-	-	-	-	-	10.8	-	-	-	-	-	8.4	-	-	-	-	-	-	-	-	15.1	-	-
	bs	-	-	-	-	-	-	-	18.0	-	-	-	-	-	-	-	-	16.8	-	-	-	-	-	9.0	-	-	-	-	-	-	-	-	19.0	-	-
	sl	-	-	-	-	-	-	-	16.4	-	-	-	-	-	-	-	-	-	16.5	-	-	-	-	-	-	-	-	-	17.7	-	-	-	18.3	-	-
	mk	-	-	-	-	-	-	21.4	-	-	-	-	-	-	-	-	-	20.0	-	-	-	-	-	-	-	-	-	-	-	-	-	24.6	23.8	-	-
	lt	-	-	-	-	-	-	16.9	-	-	-	-	-	-	-	-	-	-	16.9	-	-	-	-	-	-	-	-	-	17.9	-	-	-	-	-	18.2
	sk	-	-	-	-	-	-	22.4	-	-	-	-	-	-	-	-	-	-	21.1	-	-	-	-	-	-	-	-	-	24.0	-	-	-	23.7	-	-
	cs	-	-	-	-	-	-	-	21.7	-	-	-	-	-	-	-	-	-	22.4	-	-	-	-	-	-	-	-	-	23.1	-	-	-	22.8	-	-
	uk	-	-	-	-	-	-	23.0	-	-	-	-	-	-	-	-	-	-	22.9	-	-	-	-	-	-	-	-	-	23.5	-	-	-	23.6	-	-
	hr	-	-	-	-	-	-	-	27.5	-	-	-	-	-	-	-	-	25.6	-	-	-	-	-	-	-	-	-	-	28.6	-	-	-	28.3	-	-
	sr	-	-	-	-	-	-	-	27.4	-	-	-	-	-	-	-	-	25.7	-	-	-	-	-	-	-	-	-	-	27.6	-	-	-	27.1	-	-
	bg	-	-	-	-	-	-	-	-	31.6	-	-	-	-	-	-	-	-	-	-	32.1	-	-	-	-	-	-	-	-	-	-	29.8	30.4	-	-
	pl	-	-	-	-	-	-	19.4	-	-	-	-	-	-	-	-	-	18.3	-	-	-	-	-	-	-	-	-	-	19.9	-	-	-	20.2	-	-
	ru	-	-	-	-	-	-	20.8	-	-	-	-	-	-	-	-	-	-	20.8	-	-	-	-	-	-	-	-	-	21.2	-	-	-	21.4	-	-
IE/Indo-Iranian	bn	-	-	9.1	-	-	-	-	-	-	-	10.1	-	-	-	-	-	-	-	-	-	-	-	12.1	-	-	-	-	-	-	-	-	-	-	10.1
	ur	-	-	-	13.3	-	-	-	-	-	-	13.1	-	-	-	-	-	-	-	-	-	-	-	12.0	-	-	-	-	-	-	-	-	-	-	13.3
	ku	-	-	-	9.4	-	-	-	-	-	-	6.9	-	-	-	-	-	-	-	-	-	-	-	9.2	-	-	-	-	-	-	-	-	-	-	12.9
	hi	-	-	-	13.2	-	-	-	-	-	-	12.0	-	-	-	-	-	-	-	-	-	-	12.1	-	-	-	-	-	-	-	-	-	12.8	-	-
	fa	-	-	-	21.3	-	-	-	-	-	-	-	-	22.4	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	23.5	-	-	22.2	-	-
	mr	-	-	8.7	-	-	-	-	-	-	-	9.0	-	-	-	-	-	-	-	-	-	-	-	8.3	-	-	-	-	-	-	-	-	-	-	9.0
IE/Italic	gl	-	-	-	-	-	-	-	-	25.5	-	-	-	-	-	-	-	-	-	-	26.8	-	-	-	26.9	-	-	-	-	-	-	-	25.9	-	-
	pt	-	-	-	-	-	-	-	-	33.3	-	-	-	-	-	-	-	-	-	-	33.8	-	-	-	33.2	-	-	-	-	-	-	-	32.5	-	-
	ro	-	-	-	-	-	-	-	-	28.0	-	-	-	-	-	-	-	-	-	-	28.4	-	-	-	-	-	-	-	-	-	28.9	-	27.3	-	-
	fr	-	-	-	-	-	-	-	-	-	32.2	-	-	-	-	-	-	-	-	-	32.1	-	-	-	31.1	-	-	-	-	-	-	-	31.3	-	-
	es	-	-	-	-	-	-	-	-	29.0	-	-	-	-	-	-	-	-	-	-	33.4	-	-	-	31.4	-	-	-	-	-	-	-	31.8	-	-
	it	-	-	-	-	-	-	-	-	30.5	-	-	-	-	-	-	-	-	-	-	30.8	-	-	-	30.3	-	-	-	-	-	-	-	29.4	-	-

Table 13: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Balto-Slavic, IE/Indo-Iranina, and IE/Italic.

family	lang	ja, ko, mn, my	ms, th, vi, zh, id	mr, ta, bn, ka	ku, fa, kk, eu, hi, ur	el, ar, he	tr, az, fi, hu, hy	uk, pl, ru, mk, lt, be, sk	cs, et, sq, hr, bs, sl, sr	bg, ro, es, gl, it, pt	fr, da, sv, nl, de, nb	ko, bn, mr, hi, ur	eu, ar, he	hy, fa, ku	hu, tr, az, ja, mn	ka, ta	kk, my	mk, sq, pl, sk, hr, bs, be, et	ru, uk, sl, fi, cs, lt	zh, th, id, vi, ms	bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro	et, fi	hi, my, hy, ka	eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms	gl, fr, it, es, pt	nb, da, sv	de, nl	zh, ja, ko, hu, tr	lt, sl, hr, sr, cs, sk, pl, ru, uk	fa, id, ar, he, th, vi	ro, sq	mk, bg, el	fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it	vi, id, ms, th	kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
		Clustering Type 1										Clustering Type 2										Clustering Type 3											Clustering Type 4
		$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{11}$	$c_{1}$	$c_{2}$	$c_{3}$
IE/Germanic	nb	-	-	-	-	-	-	-	-	-	33.5	-	-	-	-	-	-	-	-	-	34.3	-	-	-	-	28.7	-	-	-	-	-	-	30.8	-	-
	da	-	-	-	-	-	-	-	-	-	35.0	-	-	-	-	-	-	-	-	-	39.7	-	-	-	-	30.5	-	-	-	-	-	-	32.0	-	-
	sv	-	-	-	-	-	-	-	-	-	31.3	-	-	-	-	-	-	-	-	-	34.5	-	-	-	-	26.8	-	-	-	-	-	-	28.6	-	-
	de	-	-	-	-	-	-	-	-	-	16.8	-	-	-	-	-	-	-	-	-	17.7	-	-	-	-	-	15.4	-	-	-	-	-	16.6	-	-
	nl	-	-	-	-	-	-	-	-	-	28.0	-	-	-	-	-	-	-	-	-	29.0	-	-	-	-	-	29.1	-	-	-	-	-	28.0	-	-
Turkic	kk	-	-	-	7.0	-	-	-	-	-	-	-	-	-	-	-	3.5	-	-	-	-	-	-	3.1	-	-	-	-	-	-	-	-	-	-	8.1
	az	-	-	-	-	-	9.5	-	-	-	-	-	-	-	9.1	-	-	-	-	-	-	-	-	8.6	-	-	-	-	-	-	-	-	-	-	9.2
	tr	-	-	-	-	-	18.6	-	-	-	-	-	-	-	18.1	-	-	-	-	-	-	-	-	-	-	-	-	19.8	-	-	-	-	-	-	18.2
Uralic	et	-	-	-	-	-	-	-	13.4	-	-	-	-	-	-	-	-	13.2	-	-	-	10.4	-	-	-	-	-	-	-	-	-	-	13.9	-	-
	fi	-	-	-	-	-	11.3	-	-	-	-	-	-	-	-	-	-	-	12.5	-	-	10.5	-	-	-	-	-	-	-	-	-	-	12.7	-	-
	hu	-	-	-	-	-	19.0	-	-	-	-	-	-	-	18.4	-	-	-	-	-	-	-	-	-	-	-	-	20.1	-	-	-	-	19.8	-	-
Afroasiatic
	he	-	-	-	-	32.8	-	-	-	-	-	-	-	-	-	-	-	32.1	-	-	-	-	31.3	-	-	-	-	-	-	-	-	-	30.0	-	-
	ar	-	-	-	-	28.5	-	-	-	-	-	-	28.0	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	27.4	-	-	25.8	-	-
Sino-Tibetan
	my	9.6	-	-	-	-	-	-	-	-	-	-	-	-	-	-	6.3	-	-	-	-	-	8.4	-	-	-	-	-	-	-	-	-	9.5	-	-
	zh	-	22.1	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	22.1	-	-	-	-	-	-	-	23.9	-	-	-	-	-	-	22.8
Austronesian
	ms	-	12.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	12.9	-	-	-	7.6	-	-	-	-	-	-	-	-	-	12.9	-
	id	-	21.1	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	21.1	-	-	-	-	-	-	-	-	-	22.5	-	-	-	21.1	-

Table 14: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Germanic, Turkic, Uralic, Afroasiatic, Sino-Tibetan, and Austronesian.

family	lang	ja, ko, mn, my	ms, th, vi, zh, id	mr, ta, bn, ka	ku, fa, kk, eu, hi, ur	el, ar, he	tr, az, fi, hu, hy	uk, pl, ru, mk, lt, be, sk	cs, et, sq, hr, bs, sl, sr	bg, ro, es, gl, it, pt	fr, da, sv, nl, de, nb	ko, bn, mr, hi, ur	eu, ar, he	hy, fa, ku	hu, tr, az, ja, mn	ka, ta	kk, my	mk, sq, pl, sk, hr, bs, be, et	ru, uk, sl, fi, cs, lt	zh, th, id, vi, ms	bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro	et, fi	hi, my, hy, ka	eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms	gl, fr, it, es, pt	nb, da, sv	de, nl	zh, ja, ko, hu, tr	lt, sl, hr, sr, cs, sk, pl, ru, uk	fa, id, ar, he, th, vi	ro, sq	mk, bg, el	fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it	vi, id, ms, th	kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
		Clustering Type 1										Clustering Type 2										Clustering Type 3											Clustering Type 4
		$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{11}$	$c_{1}$	$c_{2}$	$c_{3}$
Koreanic
Koreanic	ko	15.1	-	-	-	-	-	-	-	-	-	15.1	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	17.4	-	-	-	-	-	-	16.0
Japonic
Japonic	ja	10.1	-	-	-	-	-	-	-	-	-	-	-	-	10.1	-	-	-	-	-	-	-	-	-	-	-	-	10.1	-	-	-	-	-	-	8.6
Austroasiatic
Austroasiatic	vi	-	21.6	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	21.6	-	-	-	-	-	-	-	-	-	21.3	-	-	-	20.3	-
IE/Hellenic
IE/Hellenic	el	-	-	-	-	30.0	-	-	-	-	-	-	-	-	-	-	-	-	-	-	31.3	-	-	-	-	-	-	-	-	-	-	29.1	29.6	-	-
Kra-Dai
Kra-Dai	th	-	22.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	22.9	-	-	-	-	-	-	-	-	-	23.09	-	-	-	22.8	-
IE/Albanian
IE/Albanian	sq	-	-	-	-	-	-	-	24.7	-	-	-	-	-	-	-	-	23.2	-	-	-	-	-	-	-	-	-	-	-	-	26.8	-	26.4	-	-

Table 15: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to Koreanic, Japonic, Austroasiatic, IE/Hellenic, Kra-Dai, and IE/Albanian families.

family	lang	ja, ko, mn, my	ms, th, vi, zh, id	mr, ta, bn, ka	ku, fa, kk, eu, hi, ur	el, ar, he	tr, az, fi, hu, hy	uk, pl, ru, mk, lt, be, sk	cs, et, sq, hr, bs, sl, sr	bg, ro, es, gl, it, pt	fr, da, sv, nl, de, nb	ko, bn, mr, hi, ur	eu, ar, he	hy, fa, ku	hu, tr, az, ja, mn	ka, ta	kk, my	mk, sq, pl, sk, hr, bs, be, et	ru, uk, sl, fi, cs, lt	zh, th, id, vi, ms	bg, sv, da, nb, de, nl, el, es, it, gl, fr, pt, ro	et, fi	hi, my, hy, ka	eu, az, kk, mn, ur, mr, bn, ta, ku, bs, be, ms	gl, fr, it, es, pt	nb, da, sv	de, nl	zh, ja, ko, hu, tr	lt, sl, hr, sr, cs, sk, pl, ru, uk	fa, id, ar, he, th, vi	ro, sq	mk, bg, el	fa, ku, eu, hu, et, fi, hy, ka, ar, he, fr, cs, lt, de, nl, sv, da, nb, ru, be, pl, bg, sl, sr, uk, hr, mk, sk, ro, sq, el, es, gl, pt, it	vi, id, ms, th	kk, az, tr, lt, ur, bn, mr, zh, my, ko, ja, mn, ta
		Clustering Type 1										Clustering Type 2										Clustering Type 3											Clustering Type 4
		$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$	$c_{10}$	$c_{11}$	$c_{1}$	$c_{2}$	$c_{3}$
IE/Armenian
IE/Armenian	hy	-	-	-	-	-	13.7	-	-	-	-	-	-	12.7	-	-	-	-	-	-	-	-	10.8	-	-	-	-	-	-	-	-	-	17.1	-	-
Kartvelian
Kartvelian	ka	-	-	9.2	-	-	-	-	-	-	-	-	-	-	-	10.8	-	-	-	-	-	-	9.1	-	-	-	-	-	-	-	-	-	8.8	-	-
Mongolic
Mongolic	mn	5.7	-	-	-	-	-	-	-	-	-	-	-	-	6.6	-	-	-	-	-	-	-	-	5.3	-	-	-	-	-	-	-	-	-	-	6.2
Dravidian
Dravidian	ta	-	-	4.0	-	-	-	-	-	-	-	-	-	-	-	5.7	-	-	-	-	-	-	-	3.9	-	-	-	-	-	-	-	-	-	-	3.4
Isolate
Isolate	eu	-	-	-	9.0	-	-	-	-	-	-	-	9.7	-	-	-	-	-	-	-	-	-	-	8.1	-	-	-	-	-	-	-	-	11.0	-	-

Table 16: The results of clustering-based multilingual NMT with knowledge distillation for languages belong to IE/Armenian, Kartvelian, Mongolic, Dravidian, and Isolate families.

Multilingual Neural Machine Translation: Can Linguistic Hierarchies Help?

Abstract

1 Introduction

2 Related Work

3 Hierarchical Knowledge Distillation

4 Experiment

4.1 Experimental Results

4.1.1 Studies of cluster-based MNMT models

4.1.2 Studies of HKD

4.1.3 Comparison with other approaches

5 Conclusion

Acknowledgements

References

Appendix:

Selective Knowledge Distillation

Experiment Settings

Data.

Training configuration.

Analysis

Random clustering vs Actual clustering.

Language Family.

Multilingual Neural Machine Translation:
Can Linguistic Hierarchies Help?