Improving Robustness of Retrieval Augmented Translation via
Shuffling of Suggestions

Cuong Hoang, Devendra Sachan¹¹footnotemark: 1, Prashant Mathur, Brian Thompson, Marcello Federico
AWS AI Labs
[email protected] Work done while the authors were at AWS AI Labs.

Abstract

Several recent studies have reported dramatic performance improvements in neural machine translation (NMT) by augmenting translation at inference time with fuzzy-matches retrieved from a translation memory (TM). However, these studies all operate under the assumption that the TMs available at test time are highly relevant to the testset. We demonstrate that for existing retrieval augmented translation methods, using a TM with a domain mismatch to the test set can result in substantially worse performance compared to not using a TM at all. We propose a simple method to expose fuzzy-match NMT systems during training and show that it results in a system that is much more tolerant (regaining up to $5.8$ BLEU) to inference with TMs with domain mismatch. Also, the model is still competitive to the baseline when fed with suggestions from relevant TMs.

1 Introduction

Retrieval Augmented Translation (RAT) refers to a paradigm which combines a translation model Vaswani et al. (2017) with an external retriever module Li et al. (2022). The retrieval module (e.g. BM25 ranker Robertson and Zaragoza (2009) or a neural retriever Cai et al. (2021); Sachan et al. (2021)) takes each source sentence as input and retrieves the top- $k$ most similar target translations from a Translation Memory (TM) Farajian et al. (2017); Gu et al. (2017); Bulte and Tezcan (2019). The translation module then encodes the input along with the top- $k$ fuzzy-matches, either by appending the suggestions to the input Bulte and Tezcan (2019); Xu et al. (2020) or using separate encoders for input and suggestions He et al. (2021); Cai et al. (2021). Decoder then either learns to copy or carry over the stylistics from the suggestions while generating the translation. In this work, we focus only on the translation module of this paradigm.

In existing literature, inference with RAT models have typically assumed that TMs are domain-matched i.e. the test set is of the same domain as the translation memory. Many works (e.g. Bulte and Tezcan (2019), Xu et al. (2020) and Cai et al. (2021)) have reported dramatic performance improvements under such setting. However, it is not clear how the models perform when there is a domain-mismatch between the TM and the test set.

In this work, we focus on the problem where the assumption of TM being domain-matched with test set does not hold. We explore the conditions where models are provided suggestions from a TM, but not from the same domain as the test set. We show that RAT models suffer performance drop if fed with suggestions coming from less relevant TMs.

This finding is especially important from a usability standpoint. Often a translator may pick a TM as a best fit for a translation job if an ideal TM is not available or created just yet e.g. IT domain TM is picked for Patent translation job because the domains are closer and there no available TM for Patent. This could lead to multiple issues like ambiguous terminologies or multiple meanings across same context Jalili Sabet et al. (2016); Barbu et al. (2016)). RAT model leveraging such (mismatched) TM ends up producing worse quality translation than a standard MT system. Therefore, it is desirable that RAT models not only improve translation with suggestions from relevant TMs, but also be more robust to suggestions from less relevant TMs.

To this end, we propose an enhancement to the training of RAT models with a simple shuffling method to mitigate this problem. Instead of always using $k$ most-relevant fuzzy-matches in training, our method randomly samples $k$ from a larger list (e.g. randomly sample 3 sentences from the top- $10$ matches). Our hypothesis is that if we systematically provide only top similar suggestions during training, the model will overly rely on the suggestions and simply copy the tokens in them. By shuffling the retrieved results, we ensure suggestions to be less similar to the input and train the system to be more robust to less relevant suggestions at test time. Our experiment results show that the model trained with shuffling of suggestions outperforms the standard RAT model by up to $+5.8$ BLEU score on average when suggestions come from less relevant TMs while dropping $-0.15$ BLEU on average when suggestions come from relevant TMs.

To the best of our knowledge, this is the first work to consider the robustness of RAT methods, which we believe is critical for acceptance by human translators.

2 Related Work

RAT is a form of domain adaptation, which is often achieved via continued training in NMT Freitag and Al-Onaizan (2016); Luong and Manning (2015). However, RAT differs from standard domain adaptation techniques like continued training in that it is online, that is, the model is not adapted during training and instead domain adaptation occurs at inference time. This makes RAT better suited for some real-world applications e.g. a single server with a single model loaded in memory can serve hundreds or thousands of users with custom translations, adapted to their unique TM. Other works have considered online adaptation outside the context of RAT, including Vilar (2018), who propose Learning Hidden Unit Contributions Swietojanski et al. (2016) as a compact way to store many adaptations of the same general-domain model.

Previous works in retrieval augmented translation have mainly explored aspects of filtering fuzzy-matches by applying similarity thresholds Xia et al. (2019); Xu et al. (2020), leveraging word alignment information Zhang et al. (2018); Xu et al. (2020); He et al. (2021) or re-ranking with additional score (e.g. word overlapping) Gu et al. (2018); Zhang et al. (2018). Our approach do not make use of any filtering and as such do not require any ad-hoc optimization. Our work is related to the use of $k$ -nearest-neighbor for NMT Khandelwal et al. (2021); Zheng et al. (2021) but it is less expensive and does not require storage and search over a large data store of context representations and corresponding target tokens Meng et al. (2021).

Our work also relates to work in offline adaptation which has addressed catastrophic forgetting of general domain knowledge during domain adaptation Thompson et al. (2019) via ensembling in-domain and out-of-domain models Freitag and Al-Onaizan (2016), mixing of in-domain and out-of-domain data during adaptation Chu et al. (2017), multi-objective learning and multi-output learning Dakwale and Monz (2017), and elastic weight consolidation Kirkpatrick et al. (2017); Thompson et al. (2019); Saunders et al. (2019), or combinations of these techniques Hasler et al. (2021).

3 RAT with Shuffling

3.1 Retrieval Module

We use Okapi BM25 Robertson and Zaragoza (2009), a classical retrieval algorithm that performs search by computing lexical matches of the query with all sentences in the evidence, to obtain top-ranked sentences for each input.¹¹1To enable fast retrieval, we leverage the implementation provided by the ElasticSearch library, which can be found at https://github.com/elastic/elasticsearch-py. Specifically, we built an index using source sentences of a TM. For every input, we collect top- $k$ (i.e. $k=\{1,2,3,4,5\}$ in our experiments) similar source side sentences and then use their corresponding target side sentences as fuzzy-matches.

3.2 Shuffling suggestions

We propose to relax the use of top- $k$ relevant suggestions during training by training RAT model with $k$ fuzzy-matches randomly sampled from a larger list. In our experiments, we sample $k$ from the top- $10$ matches, where top- $10$ is chosen based on our preliminary experiments.

By shuffling retrieval fuzzy-matches, we ensure suggestions to be less similar to the target reference. With that, we expect the model learns to be more selective in using the suggestions for translation and thus be more robust to less relevant suggestions at test time. In fact, training models with noisy data has been shown to improve model’s robustness to irrelevant data Belinkov and Bisk (2018).

4 Data, Models & Experiments

4.1 Data

We conduct experiments in two language directions: En-De with five domain-specialized TMs²²2Medical, Law, IT, Religion and Subtitles. and En-Fr with seven domain-specialized TMs.³³3News, Medical, Bank, Law, IT, TED and Religion. En-De data is taken from Aharoni and Goldberg (2020), which is a re-split version of the multi-domain data set from Koehn and Knowles (2017) while En-Fr data set is taken from the multi-domain data set of Pham et al. (2021). We concatenate all parallel training data from multiple domains for training and similarly for development sets.

4.2 Models and Training

We experiment with a common RAT model used in Bulte and Tezcan (2019) and Xu et al. (2020). The model encodes the input along with the top- $k$ fuzzy-matches by appending the suggestions to the input. We follow the recipe of Vaswani et al. (2017) during training. We report translation quality with BLEU scores computed via Sacrebleu Post (2018).⁴⁴4Details of all hyper-parameters of training and evaluation are presented in Appendix A. We use compare-mt Neubig et al. (2019) for significance testing with bootstrap = 1000 and prob_thresh = 0.05.

4.3 Inference with TMs

We run inference with RAT models using top-k suggestions coming from relevant and less relevant TMs. Specifically, for every input sentence in a test set from one specific domain, less relevant suggestions are retrieved from all TMs but not the TM coming from the same domain to the test set. Taking En-De and IT domain test set as an example, less relevant suggestions are collected from all TMs coming from Medical, Law, Religion and Subtitles but not IT domain.

En $\rightarrow$ De
Model	TM	AVER
Transformer	-	37.61
RAT	Less relevant	32.89
RAT + Shuf.	Less relevant	35.28/+2.39
RAT	Relevant	41.34
RAT + Shuf.	Relevant	41.19/-0.15

En $\rightarrow$ Fr
Model	TM	AVER
Transformer	-	55.86
RAT	Less relevant	44.39
RAT + Shuf.	Less relevant	49.20/+5.81
RAT	Relevant	56.59
RAT + Shuf.	Relevant	56.44/-0.15

Table 1: Average BLEU scores of models across top-k values and across domains. The model trained with shuffling is more tolerant to domain mismatch in TMs while still being competitive when fed with suggestions from relevant TMs.

En $\rightarrow$ De
Model	Top-k	IT	LAW	REL	MED	SUBT	AVER
Baseline	-	39.22	53.06	23.03	48.48	24.24	37.61
RAT	k=2	41.16	44.41	16.48	41.95	23.84	33.57
RAT + Shuf.	k=2	40.81	49.37*	21.01*	46.85*	23.92	36.39
RAT	k=3	41.03	43.61	15.11	40.35	23.35	32.69
RAT + Shuf.	k=3	40.33	46.52*	18.37*	44.66*	24.18	34.81
RAT	k=4	40.91	42.98	14.39	39.09	23.56	32.19
RAT + Shuf.	k=4	42.0	46.4*	18.23*	43.65*	24.02	34.86

En $\rightarrow$ Fr
Model	Top-k	LAW	MED	IT	NEWS	BANK	REL	TED	AVER
Baseline	-	65.58	57.92	48.89	34.43	57.32	86.87	40.03	55.86
RAT	k=2	56.89	51.36	42.18	33.15	47.42	49.62	39.09	45.67
RAT + Shuf.	k=2	60.20*	52.66*	45.55*	34.12*	50.36*	81.76*	39.85	52.07
RAT	k=3	56.48	50.31	40.41	33.25	45.32	43.61	38.31	43.96
RAT + Shuf.	k=3	57.81*	51.89*	42.78*	33.51	48.08*	73.11*	38.75	49.42
RAT	k=4	54.93	49.98	39.77	33.05	45.03	39.26	37.75	42.82
RAT + Shuf.	k=4	56.33*	50.92*	42.07*	32.94	46.58	60.31*	38.48	46.80

Table 2: Model comparison with less relevant suggestions. The best BLEU for models with a specific top-k value is highlighted in bold, and "*" indicates whether the models are statistical significance difference with the others.

En $\rightarrow$ De
Model	Top-k	IT	LAW	REL	MED	SUBT	AVER
Baseline	-	39.22	53.06	23.03	48.48	24.24	37.61
RAT	k=2	41.52	57.86	30.67	53.52	23.83	41.48
RAT + Shuf.	k=2	40.78	57.82	30.87	53.34	24.07	41.38
RAT	k=3	41.43	57.67	31.24	53.05*	23.66	41.41
RAT + Shuf.	k=3	40.90	56.98	30.65	51.36	24.24	40.83
RAT	k=4	41.35	57.54	31.7	52.43	23.53	41.31
RAT + Shuf.	k=4	42.24	58.90*	33.02*	53.7	23.94	42.36

En $\rightarrow$ Fr
Model	Top-k	LAW	MED	IT	NEWS	BANK	REL	TED	AVER
Baseline	-	65.58	57.92	48.89	34.43	57.32	86.87	40.03	55.86
RAT	k=2	70.25*	60.20*	50.28	33.95	58.65*	83.99	40.32	56.81
RAT + Shuf.	k=2	69.45	59.42	49.41	34.38	57.46	84.15	40.63	56.41
RAT	k=3	70.70	61.02	52.57	33.61	60.67	84.84	39.47	57.55
RAT + Shuf.	k=3	70.03	60.41	51.69	33.64	60.41	85.15	39.66	57.28
RAT	k=4	70.07	60.23*	50.28	33.20	57.5	84.65	39.37	56.47
RAT + Shuf.	k=4	69.73	59.03	49.4	33.32	56.63	85.29	39.78	56.17

Table 3: Model comparison with relevant suggestions. The best BLEU for models with a specific top-k value is highlighted in bold, and "*" indicates whether the models are statistical significance difference with the others.

5 Results

5.1 Main results

We experiment with comprehensive top-k values of $\{1,2,3,4,5\}$ and compare models when fed with suggestions from both relevant and less relevant TMs. The summary results across top-k values and across domains are collected in Table 1 for both En-De and En-Fr. When the standard RAT models are fed with less relevant suggestions, we see a drop of 12.5% (37.61 $\rightarrow$ 32.89) for En-De and 20.5% (55.86 $\rightarrow$ 44.39) for En-Fr against standard Transformer based MT system. This shows how brittle the RAT models are when the assumption - TMs are of same domain as test set - breaks. Our approach with shuffling closes the gap by 50% from 4.72 (37.61 $\rightarrow$ 32.89) to 2.39 (37.61 $\rightarrow$ 35.28) for En-De and 54.7% similarly for En-Fr when RAT models are fed with less relevant suggestions while dropping 0.15 when the suggestions are from relevant TMs. This is an acceptable trade-off from the point of view of a user because of the significant gains observed in the less relevant TM setting.

Details with a subset of top- $k$ values of $\{2,3,4\}$ are in Table 2 (with less relevant TMs) and Table 3 (with relevant TMs).⁵⁵5We report a subset of results due to space constraint. Comprehensive results are shown in Appendix B. The improvements with shuffling are consistent for all $k$ s when models are fed less relevant TMs and clearly support that shuffling suggestions during training significantly improves robustness of RAT models. In Table 3, where the models are fed with relevant TMs, we observe a drop in translation quality in some cases (e.g. MED domain in En-De with k=3) but on an average performance of the models across domains remains competitive.

Model	IT	LAW	REL	MED	SUBT
Less Relevant TM
RAT	24.82	17.81	19.25	14.73	20.69
+ Shuf.	23.51*	17.04*	17.74*	13.42*	20.05*
Relevant TM
RAT	27.01	40.75	36.37	39.45	20.67
+ Shuf.	25.63*	39.05*	34.33*	37.46*	19.99*

Table 4: Average percentage of suggestion tokens that appear in MT outputs produced by RAT models across five En-De test sets, with both relevant and less relevant TMs. “*” indicates statistical significance difference.

5.2 Usage of Suggestions

To understand the direct impact of suggestions on translation, we looked at the percentage of tokens in suggestions that appear in output produced by model. Average percentage of overlapping tokens for each test set is in Table 4, with $k$ = 3 and En-De as a case study.

Our results show a consistently significantly less use of less relevant TMs by RAT models trained with shuffling method. This clearly support our postulation that RAT models trained with shuffling are more selective in using suggestions for translation, which explains the robustness of the model when it is run inference with less relevant suggestions.

As the model is more selective, it also tends to less use of suggestions from relevant TMs as in Table 4. This explains the translation accuracy drop in some cases (e.g. MED domain in En-De with k=3) when fed with relevant suggestions in exchange of the robustness.

6 Conclusion

We show that standard RAT systems are not robust to domain-mismatch at test time. We proposed a training procedure of shuffling fuzzy-match suggestions to mitigate the problem. Our experiment results show that the model trained with shuffling significantly outperforms the one without shuffling, across multiple domains on two language pairs when suggestions come from less relevant TMs at inference time. We also show that RAT with shuffling is still competitive to models without shuffling across all domains when fed with relevant suggestions. Further analysis shows that the model trained with shuffling is more selective in using suggestions for translation, which explains the increased robustness.

References

Aharoni and Goldberg (2020) Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
Barbu et al. (2016) Eduard Barbu, Carla Parra Escartín, Luisa Bentivogli, Matteo Negri, Marco Turchi, Constantin Orasan, and Marcello Federico. 2016. The first automatic translation memory cleaning shared task. Machine Translation, 30(3–4):145–166.
Belinkov and Bisk (2018) Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
Bulte and Tezcan (2019) Bram Bulte and Arda Tezcan. 2019. Neural fuzzy repair: Integrating fuzzy matches into neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1800–1809, Florence, Italy. Association for Computational Linguistics.
Cai et al. (2021) Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. 2021. Neural machine translation with monolingual translation memory. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7307–7318, Online. Association for Computational Linguistics.
Chu et al. (2017) Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of domain adaptation methods for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–391, Vancouver, Canada. Association for Computational Linguistics.
Dakwale and Monz (2017) Praveen Dakwale and Christof Monz. 2017. Finetuning for neural machine translation with limited degradation across in-and out-of-domain data. Proceedings of the XVI Machine Translation Summit, 117.
Farajian et al. (2017) M. Amin Farajian, Marco Turchi, Matteo Negri, and Marcello Federico. 2017. Multi-domain neural machine translation through unsupervised adaptation. In Proceedings of the Second Conference on Machine Translation, pages 127–137, Copenhagen, Denmark. Association for Computational Linguistics.
Freitag and Al-Onaizan (2016) Markus Freitag and Yaser Al-Onaizan. 2016. Fast domain adaptation for neural machine translation. arXiv preprint arXiv:1612.06897.
Gu et al. (2017) Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2017. Search engine guided non-parametric neural machine translation. CoRR, abs/1705.07267.
Gu et al. (2018) Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2018. Search engine guided neural machine translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5133–5140. AAAI Press.
Hasler et al. (2021) Eva Hasler, Tobias Domhan, Jonay Trenous, Ke Tran, Bill Byrne, and Felix Hieber. 2021. Improving the quality trade-off for neural machine translation multi-domain adaptation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8470–8477, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
He et al. (2021) Qiuxiang He, Guoping Huang, Qu Cui, Li Li, and Lemao Liu. 2021. Fast and accurate neural machine translation with translation memory. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3170–3180, Online. Association for Computational Linguistics.
Jalili Sabet et al. (2016) Masoud Jalili Sabet, Matteo Negri, Marco Turchi, José G. C. de Souza, and Marcello Federico. 2016. TMop: a tool for unsupervised translation memory cleaning. In Proceedings of ACL-2016 System Demonstrations, pages 49–54, Berlin, Germany. Association for Computational Linguistics.
Khandelwal et al. (2021) Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2021. Nearest neighbor machine translation. In International Conference on Learning Representations.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
Li et al. (2022) Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. 2022. A survey on retrieval-augmented text generation.
Luong and Manning (2015) Minh-Thang Luong and Christopher Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 76–79, Da Nang, Vietnam.
Meng et al. (2021) Yuxian Meng, Xiaoya Li, Xiayu Zheng, Fei Wu, Xiaofei Sun, Tianwei Zhang, and Jiwei Li. 2021. Fast nearest neighbor machine translation. CoRR, abs/2105.14528.
Neubig et al. (2019) Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. compare-mt: A tool for holistic comparison of language generation systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 35–41, Minneapolis, Minnesota. Association for Computational Linguistics.
Pham et al. (2021) MinhQuang Pham, Josep Maria Crego, and François Yvon. 2021. Revisiting Multi-Domain Machine Translation. Transactions of the Association for Computational Linguistics, 9:17–35.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
Robertson and Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
Sachan et al. (2021) Devendra Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L. Hamilton, and Bryan Catanzaro. 2021. End-to-end training of neural retrievers for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6648–6662, Online. Association for Computational Linguistics.
Saunders et al. (2019) Danielle Saunders, Felix Stahlberg, Adrià de Gispert, and Bill Byrne. 2019. Domain adaptive inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 222–228, Florence, Italy. Association for Computational Linguistics.
Swietojanski et al. (2016) Pawel Swietojanski, Jinyu Li, and Steve Renals. 2016. Learning hidden unit contributions for unsupervised acoustic model adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(8):1450–1463.
Thompson et al. (2019) Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn. 2019. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068, Minneapolis, Minnesota. Association for Computational Linguistics.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Vilar (2018) David Vilar. 2018. Learning hidden unit contribution for adapting neural machine translation models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 500–505, New Orleans, Louisiana. Association for Computational Linguistics.
Xia et al. (2019) Mengzhou Xia, Guoping Huang, Lemao Liu, and Shuming Shi. 2019. Graph based translation memory for neural machine translation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7297–7304.
Xu et al. (2020) Jitao Xu, Josep Crego, and Jean Senellart. 2020. Boosting neural machine translation with similar translations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1580–1590, Online. Association for Computational Linguistics.
Zhang et al. (2018) Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Graham Neubig, and Satoshi Nakamura. 2018. Guiding neural machine translation with retrieved translation pieces. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1325–1335, New Orleans, Louisiana. Association for Computational Linguistics.
Zheng et al. (2021) Xin Zheng, Zhirui Zhang, Junliang Guo, Shujian Huang, Boxing Chen, Weihua Luo, and Jiajun Chen. 2021. Adaptive nearest neighbor machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 368–374, Online. Association for Computational Linguistics.

Appendix

A Model Parameters

We employed Transformer big architecture with 6 encoder and 6 decoder layers. Hidden size was set to $1024$ and maximum input length to $1024$ tokens. We employed a joint source-target language subword vocabulary of size $32K$ using Sentencepiece algorithm Kudo and Richardson (2018).

We use the Adam optimizer Kingma and Ba (2015) with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ and $\epsilon=10^{-9}$ ; and (ii) increase the learning rate linearly for the first $4K$ training steps and decrease it thereafter; (iii) use batch size of $32K$ source tokens and $32K$ target tokens. Checkpoints are saved after every $10K$ iterations during training. We use dropout of $0.1$ and label-smoothing of $0.1$ . We train models with maximum of $300K$ iterations.

B Full results with $k=\{1,2,3,4,5\}$

Details with full set of top- $k$ values of $\{1,2,3,4,5\}$ are in Table 5 (i.e. with less relevant TMs) and Table 6 (i.e. with relevant TMs).

En $\rightarrow$ De
Model	TM	Top-k	IT	LAW	REL	MED	SUBT	AVER
Baseline	-	-	39.22	53.06	23.03	48.48	24.24	37.61
RAT	Less relevant	k=1	40.82	45.61	17.98	42.65	24.08	34.22
RAT + Shuf.		k=1	39.56	50.89*	21.6*	47.8*	24.33	36.84
RAT		k=2	41.16	44.41	16.48	41.95	23.84	33.57
RAT + Shuf.		k=2	40.81	49.37*	21.01*	46.85*	23.92	36.39
RAT		k=3	41.03	43.61	15.11	40.35	23.35	32.69
RAT + Shuf.		k=3	40.33	46.52*	18.37*	44.66*	24.18	34.81
RAT		k=4	40.91	42.98	14.39	39.09	23.56	32.19
RAT + Shuf.		k=4	42.0	46.4*	18.23*	43.65*	24.02	34.86
RAT		k=5	40.83	41.93	13.8	38.55	23.67	31.76
RAT + Shuf.		k=5	41.04	44.41*	16.25*	41.57*	24.12	33.48
En $\rightarrow$ Fr
Model	TM	Top-k	LAW	MED	IT	NEWS	BANK	REL	TED	AVER
Baseline	-	-	65.58	57.92	48.89	34.43	57.32	86.87	40.03	55.86
RAT	Less relevant	k=1	58.3	51.54	44.99	33.51	48.49	59.62	39.07	47.93
RAT + Shuf.		k=1	60.97*	53.48*	45.95	33.76	51.2	82.37*	39.73	52.49
RAT		k=2	56.89	51.36	42.18	33.15	47.42	49.62	39.09	45.67
RAT + Shuf.		k=2	60.20*	52.66*	45.55*	34.12*	50.36*	81.76*	39.85	52.07
RAT		k=3	56.48	50.31	40.41	33.25	45.32	43.61	38.31	43.96
RAT + Shuf.		k=3	57.81*	51.89*	42.78*	33.51	48.08*	73.11*	38.75	49.42
RAT		k=4	54.93	49.98	39.77	33.05	45.03	39.26	37.75	42.82
RAT + Shuf.		k=4	56.33*	50.92*	42.07*	32.94	46.58	60.31*	38.48	46.80
RAT		k=5	54.36	49.91	39	32.2	44.38	33.88	37.23	41.57
RAT + Shuf.		k=5	55.43*	50.41	40.51*	32.81	45.26	54.02*	37.91	45.19

Table 5: Full results of comparing models with less relevant fuzzy-matches. Here, top-k values are with

\{1,2,3,4,5\}

. The best BLEU for models with a specific top-k value is highlighted in bold, and "*" indicates whether the models are statistical significance difference with the others.

En $\rightarrow$ De
Model	TM	Top-k	IT	LAW	REL	MED	SUBT	AVER
Baseline	-	-	39.22	53.06	23.03	48.48	24.24	37.61
RAT	Relevant	k=1	41.24*	57.66*	29.11*	53.73*	24.04	41.16
RAT + Shuf.		k=1	39.55	55.84	25.92	50.8	24.30	39.28
RAT		k=2	41.52	57.86	30.67	53.52	23.83	41.48
RAT + Shuf.		k=2	40.78	57.82	30.87	53.34	24.07	41.38
RAT		k=3	41.43	57.67	31.24	53.05*	23.66	41.41
RAT + Shuf.		k=3	40.90	56.98	30.65	51.36	24.24	40.83
RAT		k=4	41.35	57.54	31.7	52.43	23.53	41.31
RAT + Shuf.		k=4	42.24	58.90*	33.02*	53.7	23.94	42.36
RAT		k=5	40.58	57.2	32.1	53.21	23.69	41.36
RAT + Shuf.		k=5	41.43	58.32*	33.07	53.89	23.78	42.10
En $\rightarrow$ Fr
Model	TM	Top-k	LAW	MED	IT	NEWS	BANK	REL	TED	AVER
Baseline	-	-	65.58	57.92	48.89	34.43	57.32	86.87	40.03	55.86
RAT	Relevant	k=1	70.39*	59.45	51.77*	34.07	57.42	81.15	39.93	56.31
RAT + Shuf.		k=1	69.2	58.83	49.75	34.26	57.29	84.92*	40.16	56.34
RAT		k=2	70.25*	60.20*	50.28	33.95	58.65*	83.99	40.32	56.81
RAT + Shuf.		k=2	69.45	59.42	49.41	34.38	57.46	84.15	40.63	56.41
RAT		k=3	70.70	61.02	52.57	33.61	60.67	84.84	39.47	57.55
RAT + Shuf.		k=3	70.03	60.41	51.69	33.64	60.41	85.15	39.66	57.28
RAT		k=4	70.07	60.23*	50.28	33.20	57.5	84.65	39.37	56.47
RAT + Shuf.		k=4	69.73	59.03	49.4	33.32	56.63	85.29	39.78	56.17
RAT		k=5	68.99	59.39	49.61	32.85	56.92	83.88	39.14	55.83
RAT + Shuf.		k=5	69.55	59.51	49.60	33.37	56.45	84.13	39.44	56.01

Table 6: Full results of comparing models with relevant fuzzy-matches. Here, top-k values are with

\{1,2,3,4,5\}