Improving Word Sense Disambiguation
in Neural Machine Translation with Salient Document Context

Elijah Rippeth Marine Carpuat
Department of Computer Science
University of Maryland
{erip,marine}@cs.umd.edu
&Kevin Duh Matt Post
HLTCOE
Johns Hopkins University
{kevinduh,post}@cs.jhu.edu

Abstract

Lexical ambiguity is a challenging and pervasive problem in machine translation (MT). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural MT. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of MT training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release doc-MuCoW, a challenge set for translation disambiguation based on the English-German MuCoW Raganato et al. (2020) augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.

1 Introduction

Lexical ambiguity is a challenging problem for machine translation (MT) systems—when an ambiguous source word lacks sufficient context, a translation system may fail to translate it accurately, as seen in Table 1. Exacerbating the situation, neural MT systems are nearly always trained, evaluated, and deployed at the sentence level by translating individual sentences independently, ignoring context when available. As a result, sentence-level systems typically struggle with word sense disambiguation (WSD) Rios Gonzales et al. (2017) where extra-sentential context could help Bawden et al. (2018).

Document-level MT has attempted to address this issue by incorporating additional sentence context. While interest in document-level MT has increased recently (Rysová et al., 2019; Farajian et al., 2020; Akhbardeh et al., 2021, i.a.), its adoption has been slow in part due to architectural complexity, computational expense, and lack of high-quality document-level data Sun et al. (2022).

English src	I went to the bank after lunch.
context₁	finance money vault checking deposit
German ref₁	Ich ging zur Bank nach dem Mittagessen.
context₂	nature river water forest shore
German ref₂	Ich ging zum Ufer nach dem Mittagessen.

Table 1: Different contexts give rise to different German translations of the ambiguous English word bank: a financial institution or a shore.

To address these limitations, we produce context by prefixing the source sentence with a document “summary”—this summary is a bag of salient words which contextualizes the sentence in a noisily compiled pseudo-document. We hypothesize this approach can improve accuracy of ambiguous word translation at the sentence-level with no additional parameters.

To this end, we contribute the following:

C1

A framework for incorporating salient context into training, which reduces context to relevant keywords while maintaining a standard sentence-level architecture for MT.
C2

An updated test set, English-German doc-MuCoW, which is a document ID-augmented subset of MuCoW Raganato et al. (2020): a challenging test set to measure performance of systems in the task of WSD in MT.
C3

A set of results and accompanying analysis showing that saliency-based models uniformly improve over sentence-level and comparable document-level baselines in translation disambiguation at a reduced training cost.

2 Method

Pseudo-Documents

Because MT is primarily framed as a sentence-level task, the vast majority of corpora do not preserve document information, hindering document-level approaches to MT Li et al. (2020); Sun et al. (2022). We address this problem by relaxing the requirement for actual document context. Instead, we define “pseudo-documents” which are bags of related sentences constructed in a potentially fuzzy manner. The related sentences could be drawn from the same actual document or from webpages with similar URLs, they could belong to the same topic model cluster, etc. While this is a noisy process that does not provide the structure and ordering of actual documents, the resulting pseudo-documents are still likely to provide most context information relevant to WSD such as topical keywords.

Saliency-based MT models

We introduce a framework for incorporating salient document context into training through keyword prefixing. We define a sentence $s_{i}:=\left[w_{1},\dots,w_{n}\right]\in\mathbb{S}\subset\mathbb{W}^{n}$ as a sequence of $n$ words and a pseudo-document $d_{j}:=\{s_{1},\dots,s_{m}\}\in\mathbb{D}\subset\mathbb{S}^{m}$ as a set of $m>1$ sentences. Given a pseudo-document $d\in\mathbb{D}$ , and a saliency function $\phi_{n}:\mathbb{D}\rightarrow\mathbb{W}^{n}$ which maps a pseudo-document to a set of $n$ words, we define the $k$ most salient words given $d$ as $\hat{\textbf{w}}^{k}=\phi_{k}(d)$ ¹¹1in descending order by saliency function-specific weight..

For a given source sentence $x\in d_{j}$ and salient words given $x$ ’s parent pseudo-document, $\hat{\textbf{w}}^{k}=\phi_{k}(d_{j})$ , we generate a translation according to

y^{*}=\operatorname*{arg\,max}p(y\mid x,\hat{\textbf{w}}^{k})

allowing the model to condition on a small amount of extra-sentential information from $x$ ’s “global context”. To do this, we treat $\hat{\textbf{w}}^{k}$ as a prefix of a modified model input as given by

x^{\prime}=\hat{\textbf{w}}^{k}\oplus\left[\langle\textsc{SEP}\rangle\right]\oplus x

where $\oplus$ is defined as sequence concatenation and $x^{\prime}$ is the new effective source sentence. This process is applied to both training and test samples. The target side of the training data remains unchanged. A benefit of this simple source sentence manipulation is that by treating $x^{\prime}$ as the source sentence, we maintain an identical model architecture, loss function, and decoding process as traditional sentence-level neural MT systems.

An example of these salient words and their impact can be seen in Table 1 in which a given source sentence $x$ being prefixed with varying salient context $\hat{\textbf{w}}^{5}$ yields different, valid translations with varying realizations of the ambiguous source word.

Saliency functions

We consider two saliency functions for extracting representative words from each pseudo-document.

First, we use the standard term frequency inverse document frequency (tfidf) score Spärck Jones (1972), which views as salient words that are frequent in a corpus $D$ but occur in relatively few documents. Formally, for a document $d$ ,

\displaystyle\phi^{tfidf}_{n}(d)=\operatorname*{arg\,max}_{\hat{\textbf{w}}^{n}\subset d}\ \{\operatorname{tfidf}_{D}(w,d):w\in\hat{\textbf{w}}^{n}\}

with

	$\displaystyle\operatorname{tfidf}_{D}(w,d)$	$\displaystyle=\operatorname{tf}(t,d)\cdot\operatorname{idf}_{D}(t)$
	$\displaystyle\operatorname{tf}(t,d)$	$\displaystyle=\frac{1}{\|d\|}\sum_{t^{\prime}\in d}\mathbbm{1}[t=t^{\prime}]$
	$\displaystyle\operatorname{idf}_{D}(t)$	$\displaystyle=\log\left(\frac{\|D\|+1}{\operatorname{df}_{D}(t)+1}\right)+1$
	$\displaystyle\operatorname{df}_{D}(t)$	$\displaystyle=\sum_{d\in D}\mathbbm{1}[t\in d]$

Second, we use YAKE! Campos et al. (2020), a strong off-the-shelf unsupervised keyword extractor. Unlike tfidf, YAKE! does not treat documents as bags-of-words, and uses features such as candidate position within a document, candidate term frequency, and duplicity of candidates across sentences. YAKE! was found to be likelier to return highly-ranked human-generated keywords when compared to contemporary state-of-the-art unsupervised statistical, unsupervised graph-based, and supervised baselines across a number of diverse corpora and across a number of languages Campos et al. (2020). By fixing hyperparameters, we can treat YAKE! as a black-box keyword extractor which returns the $n$ highest scoring words in a document with the following saliency function:

\displaystyle\phi^{yake}_{n}(d)=\underset{topk=n}{\operatorname{YAKE}}(d)

3 doc-MuCoW: A Large Test Set for WSD with Document Context

	# ambiguous types	# sentences	# doc IDs
Subcorpus
Europarl v7 Koehn (2005)	199	8912	871
GlobalVoices v2017q3 Tiedemann (2012)	160	1395	456
JW300 Agić and Vulić (2019)	204	6505	1944
MultiUN v1 Eisele and Chen (2010)	14	25	11
TED 2013 v1.1 Cettolo et al. (2013)	194	3708	667
Combined	206	20545	3949

Table 2: The composition of the doc-MuCoW test set shows that is has broad document coverage and a wide range of ambiguous word types. Due to its large size, smaller improvements are more meaningful.

Standard MT test sets are designed to measure translation quality and do not afford a direct measure of translation disambiguation accuracy. In past work, targeted WSD evaluation has used challenge sets Rios Gonzales et al. (2017), where systems are evaluated by scoring correct and incorrect translations of a source sentence constructed to feature lexical ambiguity. While enabling controlled comparisons, this approach is limited to evaluating the scores of MT systems on artificially constructed data, rather than their generation ability on naturally occurring text. The MuCoW dataset bridges that gap by automatically augmenting standard MT test sets from diverse domains with sense annotation Raganato et al. (2020). Each example in MuCoW contains an ambiguous source lemma, the ambiguous token’s sense cluster ID, the corpus from which the bitext was sourced, the bitext, and the ambiguous source token. MuCoW derives its sense inventory from BabelNet Navigli and Ponzetto (2010): BabelNet-derived embeddings Mancini et al. (2017) are used to form sense clusters for translations of ambiguous words occurring in MT test sets.

A key missing attribute of each example for the purposes of this work is the document ID. Without these document IDs we cannot create document context with which to test our method. To this end, we reconstruct document IDs for examples drawn from available sources²²2Tatoeba, the Books Corpus, CommonCrawl, and the EU Bookshop Corpus are either sentence-level in nature or unavailable in a form permitting doc ID assignment. by searching the raw text files from the original corpora for sentences in MuCoW ³³3To address differences in tokenization, we remove all non-alphanumeric characters from the needle and the haystack and check for substring inclusion. The resulting test set, doc-MuCoW, contains 20.5k examples from MuCoW augmented with document IDs, details of which can be found in Table 2.

WSD evaluation

Along with the challenge sets, Raganato et al. (2020) prescribe an evaluation scheme to measure the corpus-level precision (P), recall (R), and F1 of target senses in system outputs. To measure this, system outputs are lemmatized using an off-the-shelf parsing model and sense inventories are queried relative to the source sentence’s ambiguous lemma. Formally, given system output $s$ as a set of lemmas, a set of acceptable “positive” lemmatized realizations, $p$ , for the source sentence’s ambiguous token, and a set of unacceptable “negative” lemmatized realizations, $n$ , for the source sentence’s ambiguous token, the sense accuracy labels are computed as follows:

\operatorname{C}(s,p,n)=\begin{cases}\textsc{pos}&s\cap p\neq\emptyset\land s\cap n=\emptyset\\ \textsc{neg}&s\cap n\neq\emptyset\\ \textsc{unk}&\text{otherwise}\end{cases}

with

	P	$\displaystyle=\frac{\#\textsc{pos}}{\#\textsc{pos}+\#\textsc{neg}}\text{ and}$
	R	$\displaystyle=\frac{\#\textsc{pos}}{\#\textsc{pos}+\#\textsc{neg}+\#\textsc{unk}}$

While R is a slightly unusual definition of recall due to lack of distinction between true and false negatives afforded by $\operatorname{C}$ , we use it for consistency.

4 Experimental Setup

4.1 Training Data

Sun et al. (2022) compile a comprehensive list of English-German document-level datasets which shows there are only on the order of 2M sentences (across approximately 125k documents) of document-level data. A subset of these (e.g., Europarl) are contained within doc-MuCoW, making them unavailable for training. Additionally, the nature of documents in several of these datasets (e.g., OpenSubtitles, Europarl) would make the documents too large and thus impractical for providing salient context for translation disambiguation.

Instead, we can construct pseudo-documents for training data using ParaCrawl Bañón et al. (2020). To do this, we process the English-German TMX representation of filtered ParaCrawl⁴⁴4We retain pairs of sentences which have cosine similarity greater than $0.85$ based on LaBSE representations Feng et al. (2022). which provides a set of source and target URLs, treating canonicalized versions of sets of URLs as proxies for document IDs. These are pseudo-documents, rather than real documents, since the canonicalization process is noisy, and since the resulting documents are bags-of-sentences. We filter long documents, keeping only documents which contain between $2$ and $10$ sentences. This results in 9,412,822 pseudo-documents containing a total of 47,418,165 high-quality bitext pairs.

4.2 Systems

Baselines

As a baseline we consider two systems: context-agnostic sent and context-aware 2sent. sent is trained on the sentences from all pseudo-documents in the training set. 2sent is trained with all sentences from all pseudo-documents in the training set, but each source sentence $x\in d_{j}$ is prefixed with a randomly selected sentence from the set of sentences within $x$ ’s parent pseudo-document, omitting $x$ ; i.e., $\hat{x}\in d_{j}-\{x\}$ . We train a joint unigram LM segmenter Kudo (2018) using SentencePiece Kudo and Richardson (2018) by sampling 10m sentences from the training data with 0.995 character coverage, a vocab size of 32k, and a user defined $\langle\text{SEP}\rangle$ token which is only used in context-aware models.

We train systems $\texttt{X}_{N}$ using the same architecture and training data as sent, but with context constructed by passing pseudo-documents to saliency function $\phi^{X}_{N}$ as descibed in Section 2, with $N$ chosen to be $5$ and $10$ . This results in four systems: tfidf₅, tfidf₁₀, yake₅, and yake₁₀.

Each system follows an identical architecture: a 60.5m parameter $6$ + $6$ transformer Vaswani et al. (2017) trained to optimize label smoothed cross-entropy using Adam, trained with the fairseq toolkit Ott et al. (2019). We split the training data into four shards with equal number of examples and train for $30$ epochs⁵⁵5an epoch is defined as a pass over a shard with a maximum of $32,768$ tokens per batch on a single A6000 GPU.

Saliency-based models

For salient context, pseudo-documents are lowercased and tokenized with Moses. A single tfidf statistic was learned from a sample of 10m pseudo-documents in the training data and used for all tfidf systems. yake systems were presented with documents with identical preprocessing as the tfidf based systems. For all yake systems we use unigram features to ensure we only select key words (cf. phrases) and use the “sequence matcher” keyword deduplication algorithm with a Levenshtein ratio of $0.9$ .

	WSD Metrics			Translation Quality		Efficiency
System	P	R	F1	BLEU	COMET	Length	Training time
Baseline models
sent	0.7850	0.6008	0.6807	21.8	0.782	24.3	97.8 ( $\times 1.00$ )
2sent	0.7830	0.5976	0.6779	21.9	0.783	50.5	135.4 ( $\times 1.38$ )
Saliency-based models
tfidf₅	0.7816	0.5954	0.6759	21.9	0.784	36.0	113.3 ( $\times 1.16$ )
tfidf₁₀	0.7871	0.6013	0.6817	22.0	0.783	44.6	127.5 ( $\times 1.30$ )
yake₅	0.7878	0.5973	0.6795	22.0	0.785	34.3	110.7 ( $\times 1.13$ )
yake₁₀	0.7885	0.6058	0.6852	21.9	0.783	41.7	122.8 ( $\times 1.26$ )

Table 3: Significant improvements between systems and sent at the 95% CI as determined by one-sided t-test with 50 trials each with 750 samples (with replacement) and as determined by paired bootstrap resampling with 5k resamples. Saliency-based systems perform comparably or better than sent and 2sent in WSD metrics and Translation Quality, but at a fraction of the training time and frequently with significant improvements. Because of the size of doc-MuCoW (20.5k sentences), large changes are unnecessary for statistical significance.

4.3 Evaluation

Our evaluation targets three main goals: ambiguous word translation accuracy, general MT quality, and training time.

To address the first goal, we use the MuCoW evaluation pipeline described in Section 3 and record R, P, and F1. We use use spaCy v3.6.1 and de_core_news_lg v3.6.0 Montani et al. (2023) to lemmatize system outputs.

To address the second goal, we measure BLEU Papineni et al. (2002) using SacreBLEU⁶⁶6nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1 Post (2018) and COMET ⁷⁷7Unbabel/wmt22-comet-da Rei et al. (2020). We expect neither of these to vary dramatically and include them to ensure no regressions in translation quality. We expect that the change in translation of individual words would cause BLEU to improve only slightly. The expectations for COMET are less clear: if additional context is required, COMET scores may vary widely since the semantics of the translations could change dramatically; otherwise context-aware models may perform as well as context-agnostic leading to small COMET score differences.

To address the third goal, we also record the average sequence length in number of subwords and training time in thousands of seconds, computing the increase in training time relative to sent. As attention costs grow quadratically in sequence length, maintaining a tradeoff between task performance and training time is critical for practical purposes and is a benefit of using few salient words.

5 Main Results

WSD metrics

As reported in Table 3, most saliency-based systems significantly improve ambiguous word translation relative to the sentence-only baseline sent in some respect. Comparatively, the document-level system 2sent actually perform worse than sent. Systems trained with fewer salient words ( $5$ ) translate ambiguous words worse than systems trained with $10$ salient words, regardless of the saliency function used. The saliency-based model with a more sophisticated saliency function, yake, yields better WSD results than tfidf.

Translation quality

As expected, extra-sentential context preserves or slightly improves the translation quality of the sentence-level model, based on BLEU and COMET scores. Saliency-based models and 2sent models achieve comparable translation quality overall. The saliency-based tfidf₅ and yake₅ models stands out by significantly improving BLEU and COMET over the sent baseline.

While 2sent has significantly different BLEU from sent, inspection of the sentence-level BLEU scores suggests that context helps the system match the reference style, even if it does not produce the correct sense.

Efficiency

The benefits of extra-sentential context come at a much smaller training cost when using saliency-based models than document-level models. Because 2sent’s are the longest inputs of all models in this work, its training time is considerably longer. Saliency-based models train 8-25% faster than 2sent with identical architectures, hyperparameters, and hardware while performing comparably or better in other aspects of our evaluation.

Overall, our results show that, in the absence of actual document context, pseudo-documents provide useful context to translate ambiguous words better. Further, summarizing pseudo-documents using a small number of salient words is an effective strategy to improve translation of ambiguous words, while reducing training time costs compared to document-based systems. Even with these advances, there remains room for improvement in WSD metrics on doc-MuCoW, suggesting that WSD is not a solved problem in high-resource MT.

6 Analysis

Next, we conduct an extensive analysis to better understand these results, starting by manually inspecting outputs, before breaking down quantitative results across several data and modeling dimensions.

6.1 Manual Inspection

We manually inspect $50$ random samples of examples where WSD results differ across models.

Figure 1: Here context appropriately contextualizes the ambiguous word, nudging the model toward the sense of “speech”

When saliency-based methods improve over the sent baseline, the extracted salient words often provide topical information needed to select the appropriate sense, as seen in Figure 1. When saliency-based methods are worse than the sent baseline, the salient context appears to be useful for summarizing the document, but is not useful for disambiguation of the sentence being translated. This suggests that correctly selecting context is important and the lack of local context may hurt translation disambiguation in these cases. Additionally we find that in very rare cases, noise from the automatic curation of MuCoW penalizes correct generations by assigning synonyms different sense cluster IDs. We present full examples in Figures 4–5 in the Appendix.

Among cases where 2sent made an incorrect sense prediction, the input often contained sufficient support for the correct decision, but 2sent failed to incorporate it, as seen in Figure 2.

Figure 2: Here context appropriately contextualizes the ambiguous word, but this cue is not strong enough to appropriately translate it. sys realization is the “goal” sense of target rather than “victim”. We note that the inclusion of <unk> in the output is an unfortunate effect of randomness in the subword model training.

6.2 Quantitative Analysis

Impact of sense frequency

We examine the performance of each model as a function of the relative frequency of each sense of an ambiguous token in the training set by binning examples into frequency bins of size 20% and present the F1 scores as a function of frequency in Table 4. Low frequency senses are naturally harder to disambiguate. Most systems improve uniformly and exhibit the largest average gains in low frequency senses. While a saliency-based model is always among the strongest in a bin, 2sent is never better than all of the saliency-based models.

bin	sent	$\Delta$ tfidf₅	$\Delta$ tfidf₁₀	$\Delta$ yake₅	$\Delta$ yake₁₀	$\Delta$ 2sent
20-40%	0.532	$-0.006$	0.017	$-0.001$	$0.005$	$-0.008$
40-60%	0.710	0.000	$-0.017$	$-0.025$	$-0.003$	0.000
60-80%	0.813	$-0.021$	$-\textbf{0.011}$	$-0.017$	$-0.018$	$-0.015$
80-100%	0.847	$0.001$	$0.004$	$-0.003$	0.007	$0.003$

Table 4: F1 of various systems as a function of relative sense frequency. Best in bin bolded. All systems struggle with low frequency senses, but context-aware systems with

10

salient words typically improve in this range.

Impact of sentence length

Refer to caption — Figure 3: # $\textsc{pos}_{\texttt{sys}}$ - # $\textsc{pos}_{\texttt{sent}}$ for each context-aware sys as a function of subword-segmented source sentence length. Shorter sentences benefit more from context under saliency-based models with more context, with diminishing returns with more intra-sentential context.

We expect that the importance of extra-sentential context diminishes with the length of $x$ : as more intra-sentential context is made available, less extra-sentential context is required. As seen in Figure 3, shorter sequences benefit more from context and this benefit tapers off as sequences get longer. A notable difference is that in short sentences, models with $5$ salient words are much more volatile which follows their general degraded performance relative to sent.

Confidence as a function of context

To examine the reliance on provided context, we employ conditional cross-mutual information (CXMI) Fernandes et al. (2021) which measures the relative improvements in confidence of generating the exact reference between context-aware models and context-agnostic models over a corpus, as given by

	$\displaystyle\operatorname{CXMI}(\hat{\textbf{W}}^{k}\rightarrow Y\mid X)$	$\displaystyle\approx$
	$\displaystyle-\frac{1}{N}$	$\displaystyle\sum_{i=1}^{N}\log\frac{p(y_{i}\mid x_{i})}{p(y_{i}\mid x_{i},\hat{\textbf{w}}^{k}_{i})}$

where positive CXMI indicates information gain afforded by the context over the context-agnostic model and present results in Table 5.

System	CXMI
tfidf₅	$0.002$
tfidf₁₀	$-0.012$
yake₅	$0.003$
yake₁₀	$-0.022$
2sent	$-0.030$

Table 5: CXMI scores for each system show that less salient context improves model confidence, though most context-aware models are less confident than sent.

All models exhibit similar confidence to sent and confidence improves with less context. Models trained with $5$ salient words rely on their context more than both their $10$ word variants and 2sent. These findings agree from previous findings that when using source context, confidence in the translation reduces with the amount of context Bawden et al. (2018); Fernandes et al. (2021) and provide some information-theoretic support for our findings. This analysis suggests that less context is more in terms of confidence.

Effects of context shuffling in training

To measure the sensitivity of context ordering imposed by our method (i.e., Footnote 1), we train shuffled variants of each model— $\texttt{X}^{\text{shuf}}_{N}$ —in which the examples are prefixed with the same bag of salient words but shuffled randomly per-example prior to subword segmentation. We hypothesize that this shuffling may reduce the importance of keyword order and may improve context usage with a similar intuition to CoWord dropout (Fernandes et al., 2021).

Using these new models, we re-run evaluation and CXMI analysis. We find shuffling improves the Translation Quality metrics up marginally on average, but has interesting impact on the WSD metrics: on average, models with $5$ salient words improve in both P and R (and subsequently F1) while models with $10$ salient words degrade marginally. Additionally, we find interesting trends in terms of CXMI: confidence drops on average for systems trained with shuffled context with a slight improvement in (relatively low baseline) confidence for the yake₁₀ pair of models. We refer to the full results in Tables 6–7 in the Appendix.

Prefixing sent with salient context

We ask whether pseudo-document context is strictly required at training time: if sent can benefit from this context at test time with some deterministic post-processing, this suggests that the training method may not be worth the effort. We use the trained sent system with test inputs crafted by prefixing examples with salient context from each saliency function as with saliency-based systems and treat these as inputs to sent at test time to determine if training with salient context is necessary. For a given saliency function $\phi^{X}_{N}$ , we label the outputs of this scheme as $\texttt{sent-X}_{N}$ . We split the outputs on $\langle\textsc{SEP}\rangle$ to mirror the expected sentence-level evaluation. Expectedly, there are steep decreases in baseline performance for WSD Metrics and Translation Quality, suggesting that training with salient context is necessary. By examining sentenceBLEU scores and manually inspecting output of low-scoring translations, we find that the large drop in translation quality can be attributed to hallucinated output and suspect that these are due to a mismatch between training and test conditions. We refer to Table 8 for full results.

Overall, saliency-based models improve translation disambiguation in more sense-frequency ranges than 2sent. Saliency-based models’ translation disambiguation improvements over sent are highest for short sequences and efficacy diminishes as sentences get longer, likely due to additional intra-sentential context. All context-aware models are as confident as sent at scoring references in doc-MuCoW, but confidence grows as context get longer. On average, it improves over sent’s when context shuffling is introduced in training. Finally, we demonstrate that sent cannot effectively use saliency-based context at test time, illustrating the importance of saliency-based training.

7 Related Work

Translation Disambiguation

Translating ambiguous words has been recognized as a central problem since the first formulations of MT Weaver (1952) and remains an issue in state-of-the-art neural models (Emelin et al., 2020; Campolungo et al., 2022a).

Much work has focused on designing explicit translation disambiguation modules for MT, such as context-dependent translation lexicons for statistical MT (Chan et al., 2007; Carpuat and Wu, 2007, i.a.). Even in neural MT architectures, where encoders already induce context-dependent representations of source tokens, explicitly incorporating source sentence context improves translation (Liu et al., 2018; Popescu-Belis, 2019). We take a simpler approach by augmenting neural MT models with salient pseudo-document context without change to standard encoder-decoder architectures.

Translation disambiguation has also been addressed by exploiting sense inventories and WSD models to explicitly disambiguate source words for translation (Rios Gonzales et al., 2017; Nguyen et al., 2018; Pu et al., 2018; Campolungo et al., 2022b). This direction can directly benefit from advances in WSD (Bevilacqua et al., 2021), but increases the cost of training and testing MT models in practice, due to more complex architectures and the reliance on hand-crafted resources. By contrast, our work augments MT models with knowledge that can be easily acquired automatically at scale.

Document-level MT

Much attention has also been paid to designing MT models that translate sentences in their document context (Maruf et al., 2021). This has the potential of improving many aspects of translation beyond lexical disambiguation, including pronoun translation (Hardmeier, 2015), lexical consistency (Carpuat and Simard, 2012), and more broadly discourse cohesion and coherence (Jiang et al., 2022). A wealth of models have been proposed to address this task in neural MT, primarily by conditioning decoding on representations of preceding sentences (Tiedemann and Scherrer, 2017; Bawden et al., 2018), using cache-based mechanisms Tu et al. (2018); Kuang and Xiong (2018), and using full document context with hierarchical attention Miculicich et al. (2018); Tan et al. (2019). In this work, we show that extra-sentential context benefits MT even when it takes the form of noisy bags of sentences rather than actual documents, and that encoding complete sentences is not necessary for WSD.

Several avenues have been explored to compactly encode salient document context instead of directly encoding context sentences. Zhang et al. (2016) and Wang et al. (2021) use representations derived from topic models to improve MT lexical choice. Encoding document provenance Chiang et al. (2011) or document IDs Macé and Servan (2019) as a prefix to the input sentence can improve translation consistency, at the cost of introducing artificial tokens in the MT vocabulary. Here, we show that efficient general purpose keyword extraction methods benefit WSD for MT, but exploring the impact of other representations would be interesting to explore in the future.

8 Conclusion

In this work we present a simple and scalable framework for training models with salient document context to improve WSD performance in sentence-level MT. To test this, we release English-German doc-MuCoW, a challenging test set for WSD performance formed by augmenting a subset of MuCoW examples with document IDs. We test our method using salient context derived from pseudo-documents formed from related sentences in ParaCrawl and show that even with noisily compiled pseudo-documents, performance surpasses that of context-agnostic sentence-level models and of comparable document-level MT models with far more context and higher training costs. Quantitative analysis shows that shorter sentences benefit more from additional context and suggests that more context may be necessary in the case of context utilization for WSD.

The models trained in this work use salient context constructed from pseudo-documents, in light of the relative paucity of parallel document-level contextual data. The paradigm could, however, easily be applied to proper document-level data, were such data available. This work focused on finding salient topical context to address lexical ambiguity, but could possibly be applied to other types of context to address other discourse phenomena. Finally, our work shows that use of pseudo-documents in training appears useful and construction of these pseudo-documents could be achieved in other ways.

Limitations

Single, high-resource setting

We demonstrate the strengths of our approach on English-German, which is a high-resource language pair. The lack of additional and diverse language pairs is a limitation. An open question is how performance changes with amount of available resources and whether additional context may improve any aspects of translation for languages with fewer resources.

Automatic curation

doc-MuCoW is constructed from an automatically curated dataset, MuCoW. While inspection suggests that the resulting dataset is only minimally impacted by noise from the automation process, there are some artifacts that impact the downstream evaluation. We believe these artifacts both to be insignificant and to impact all methods equally, which we believe should not impact conclusions drawn from this work. Furthermore, the evaluation necessarily requires the system output to be lemmatized. This introduces two limitations: first, this methodology may not be applicable to languages without such an ability (e.g., lower resourced languages); second, this lemmatization may introduce downstream artifacts in evaluation. As before, we believe that all systems should be equally impacted by noisy lemmatizers and that this should not impact conclusions.

Saliency functions

While we investigate two saliency functions in this work, others could plausibly be used. We believe this to be interesting future work and present the two saliency functions in this work as a way of introducing a general framework and not as an exhaustive set.

Sense-enhanced models

While there has been a flurry of work examining sense-enhanced MT, we found the resource and computational requirements to be too high to label and train models at the scale investigated in this work. We also believe that the sentence-level nature of these methods only serves to improve cases where intra-sentential context is enough to resolve the translation ambiguity while our approach provides extra-sentential context.

Training Data

In this work, we only use filtered ParaCrawl. While this is convenient for testing our hypothesis about the utility of pseudo-documents and to ensure no train-test overlap, it omits a large part of high-quality bitext from available training. We ensure our baseline sent model is reasonable by evaluating on WMT14 English-German and achieving 26.1 BLEU, which we believe is a competitive single-system score.

The 2sent model

Because our contextualized models are effectively being trained and evaluated at the sentence-level, sampling a random sentence as context for 2sent is a necessary choice. “True” document models require document order which is not available for the training data and, while we could plausibly train a document-level model on the small amount of curated data, it would not be a controlled experiment when compared to our models trained on ParaCrawl.

References

Agić and Vulić (2019) Željko Agić and Ivan Vulić. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguistics.
Akhbardeh et al. (2021) Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, and Marcos Zampieri. 2021. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
Bañón et al. (2020) Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.
Bawden et al. (2018) Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
Bevilacqua et al. (2021) Michele Bevilacqua, Tommaso Pasini, Alessandro Raganato, and Roberto Navigli. 2021. Recent Trends in Word Sense Disambiguation: A Survey. In Twenty-Ninth International Joint Conference on Artificial Intelligence, volume 5, pages 4330–4338.
Campolungo et al. (2022a) Niccolò Campolungo, Federico Martelli, Francesco Saina, and Roberto Navigli. 2022a. DiBiMT: A novel benchmark for measuring Word Sense Disambiguation biases in Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4331–4352, Dublin, Ireland. Association for Computational Linguistics.
Campolungo et al. (2022b) Niccolò Campolungo, Tommaso Pasini, Denis Emelin, and Roberto Navigli. 2022b. Reducing disambiguation biases in NMT by leveraging explicit word sense information. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4824–4838, Seattle, United States. Association for Computational Linguistics.
Campos et al. (2020) Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. 2020. Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289.
Carpuat and Simard (2012) Marine Carpuat and Michel Simard. 2012. The trouble with SMT consistency. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 442–449, Montréal, Canada. Association for Computational Linguistics.
Carpuat and Wu (2007) Marine Carpuat and Dekai Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 61–72, Prague, Czech Republic. Association for Computational Linguistics.
Cettolo et al. (2013) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2013. Report on the 10th IWSLT evaluation campaign. In Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign, Heidelberg, Germany.
Chan et al. (2007) Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 33–40, Prague, Czech Republic. Association for Computational Linguistics.
Chiang et al. (2011) David Chiang, Steve DeNeefe, and Michael Pust. 2011. Two easy improvements to lexical weighting. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 455–460, Portland, Oregon, USA. Association for Computational Linguistics.
Eisele and Chen (2010) Andreas Eisele and Yu Chen. 2010. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA).
Emelin et al. (2020) Denis Emelin, Ivan Titov, and Rico Sennrich. 2020. Detecting word sense disambiguation biases in machine translation for model-agnostic adversarial attacks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7635–7653, Online. Association for Computational Linguistics.
Farajian et al. (2020) M. Amin Farajian, António V. Lopes, André F. T. Martins, Sameen Maruf, and Gholamreza Haffari. 2020. Findings of the WMT 2020 shared task on chat translation. In Proceedings of the Fifth Conference on Machine Translation, pages 65–75, Online. Association for Computational Linguistics.
Feng et al. (2022) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
Fernandes et al. (2021) Patrick Fernandes, Kayo Yin, Graham Neubig, and André F. T. Martins. 2021. Measuring and increasing context usage in context-aware machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6467–6478, Online. Association for Computational Linguistics.
Hardmeier (2015) Christian Hardmeier. 2015. A Document-Level SMT System with Integrated Pronoun Prediction. In Proceedings of the Second Workshop on Discourse in Machine Translation, pages 72–77, Lisbon, Portugal. Association for Computational Linguistics.
Jiang et al. (2022) Yuchen Jiang, Tianyu Liu, Shuming Ma, Dongdong Zhang, Jian Yang, Haoyang Huang, Rico Sennrich, Ryan Cotterell, Mrinmaya Sachan, and Ming Zhou. 2022. BlonDe: An automatic evaluation metric for document-level machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1550–1565, Seattle, United States. Association for Computational Linguistics.
Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
Kuang and Xiong (2018) Shaohui Kuang and Deyi Xiong. 2018. Fusing recency into neural machine translation with an inter-sentence gate model. In Proceedings of the 27th International Conference on Computational Linguistics, pages 607–617, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Kudo (2018) Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
Li et al. (2020) Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, and Changliang Li. 2020. Does multi-encoder help? a case study on context-aware neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3512–3518, Online. Association for Computational Linguistics.
Liu et al. (2018) Frederick Liu, Han Lu, and Graham Neubig. 2018. Handling homographs in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1336–1345, New Orleans, Louisiana. Association for Computational Linguistics.
Macé and Servan (2019) Valentin Macé and Christophe Servan. 2019. Using whole document context in neural machine translation. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
Mancini et al. (2017) Massimiliano Mancini, Jose Camacho-Collados, Ignacio Iacobacci, and Roberto Navigli. 2017. Embedding words and senses together via joint knowledge-enhanced training. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 100–111, Vancouver, Canada. Association for Computational Linguistics.
Maruf et al. (2021) Sameen Maruf, Fahimeh Saleh, and Gholamreza Haffari. 2021. A Survey on Document-level Neural Machine Translation: Methods and Evaluation. ACM Computing Surveys, 54(2):45:1–45:36.
Miculicich et al. (2018) Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.
Montani et al. (2023) Ines Montani, Matthew Honnibal, Matthew Honnibal, Adriane Boyd, Sofie Van Landeghem, Henning Peters, Paul O’Leary McCann, jim geovedi, Jim O’Regan, Maxim Samsonov, Daniël de Kok, György Orosz, Marcus Blättermann, Duygu Altinok, Madeesh Kannan, Raphael Mitsch, Søren Lind Kristiansen, Edward, Lj Miranda, Peter Baumgartner, Raphaël Bournhonesque, Richard Hudson, Explosion Bot, Roman, Leander Fiedler, Ryn Daniels, kadarakos, Wannaphong Phatthiyaphaibun, and Schero1994. 2023. explosion/spaCy: v3.6.1: Support for Pydantic v2, find-function CLI and more.
Navigli and Ponzetto (2010) Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225, Uppsala, Sweden. Association for Computational Linguistics.
Nguyen et al. (2018) Quang-Phuoc Nguyen, Anh-Dung Vo, Joon-Choul Shin, and Cheol-Young Ock. 2018. Effect of Word Sense Disambiguation on Neural Machine Translation: A Case Study in Korean. IEEE Access, 6:38512–38523.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Popescu-Belis (2019) Andrei Popescu-Belis. 2019. Context in Neural Machine Translation: A Review of Models and Evaluations.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
Pu et al. (2018) Xiao Pu, Nikolaos Pappas, James Henderson, and Andrei Popescu-Belis. 2018. Integrating weakly supervised word sense disambiguation into neural machine translation. Transactions of the Association for Computational Linguistics, 6:635–649.
Raganato et al. (2020) Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. 2020. An evaluation benchmark for testing the word sense disambiguation capabilities of machine translation systems. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 3668–3675, Marseille, France. European Language Resources Association.
Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
Rios Gonzales et al. (2017) Annette Rios Gonzales, Laura Mascarell, and Rico Sennrich. 2017. Improving word sense disambiguation in neural machine translation with sense embeddings. In Proceedings of the Second Conference on Machine Translation, pages 11–19, Copenhagen, Denmark. Association for Computational Linguistics.
Rysová et al. (2019) Kateřina Rysová, Magdaléna Rysová, Tomáš Musil, Lucie Poláková, and Ondřej Bojar. 2019. A test suite and manual evaluation of document-level NMT at WMT19. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 455–463, Florence, Italy. Association for Computational Linguistics.
Spärck Jones (1972) Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21.
Sun et al. (2022) Zewei Sun, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Shujian Huang, Jiajun Chen, and Lei Li. 2022. Rethinking document-level neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3537–3548, Dublin, Ireland. Association for Computational Linguistics.
Tan et al. (2019) Xin Tan, Longyin Zhang, Deyi Xiong, and Guodong Zhou. 2019. Hierarchical modeling of global context for document-level neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1576–1585, Hong Kong, China. Association for Computational Linguistics.
Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
Tiedemann and Scherrer (2017) Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
Tu et al. (2018) Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. 2018. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6:407–420.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wang et al. (2021) Weixuan Wang, Wei Peng, Meng Zhang, and Qun Liu. 2021. Neural machine translation with heterogeneous topic knowledge embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3197–3202, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Weaver (1952) Warren Weaver. 1952. Translation. In Proceedings of the Conference on Mechanical Translation, Massachusetts Institute of Technology.
Zhang et al. (2016) Jian Zhang, Liangyou Li, Andy Way, and Qun Liu. 2016. Topic-informed neural machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1807–1817, Osaka, Japan. The COLING 2016 Organizing Committee.

Figure 4: We see the context provides no information to contextualize the ambiguous word, so the model may not receive additional information to disambiguate.

Figure 5: Some annotations in MuCoW contain noise in which synonyms are assigned different sense cluster IDs which penalizes models. Here, the model produces Strom (“stream”) instead of Bach (“brook”), which are nearly synonymous modulo extremely narrow semantic differences. We note that the inclusion of <unk> in the output is an unfortunate effect of randomness in the subword model training.

	WSD Metrics			Translation Quality		Efficiency
System	P	R	F1	BLEU	COMET	Length	Training time
Baseline models
sent	0.7850	0.6008	0.6807	21.8	0.782	24.3	97.8 ( $\times 1.00$ )
2sent	0.7830	0.5976	0.6779	21.9	0.783	50.5	135.4 ( $\times 1.38$ )
Saliency-based models
tfidf₅	0.7816	0.5954	0.6759	21.9	0.784	36.0	113.3 ( $\times 1.16$ )
tfidf₁₀	0.7871	0.6013	0.6817	22.0	0.783	44.6	127.5 ( $\times 1.30$ )
yake₅	0.788	0.5973	0.6795	22.0	0.785	34.3	110.7 ( $\times 1.13$ )
yake₁₀	0.7885	0.6058	0.6852	21.9	0.783	41.7	122.8 ( $\times 1.26$ )
Shuffled saliency-based models
tfidf ${}_{5}^{\text{shuf}}$	0.788	0.6033	0.6833	22.0	0.784	36.0	115.2 ( $\times 1.18$ )
tfidf ${}_{10}^{\text{shuf}}$	0.7857	0.6002	0.6805	22.0	0.783	44.6	127.0 ( $\times 1.30$ )
yake ${}_{5}^{\text{shuf}}$	0.7904	0.6042	0.6849	22.0	0.785	34.3	112.3 ( $\times 1.15$ )
yake ${}_{10}^{\text{shuf}}$	0.7860	0.5984	0.6795	22.0	0.784	41.7	122.8 ( $\times 1.26$ )

Table 6: Significant improvements between systems and sent at the 95% CI as determined by one-sided t-test with 50 trials each with 750 samples (with replacement) and as determined by paired bootstrap resampling with 5k resamples. Shuffled saliency-based systems perform comparably or better than sent and 2sent in WSD metrics and Translation Quality. Because of the size of doc-MuCoW (20.5k sentences), large changes are unnecessary for statistical significance.

Saliency function, $\phi_{N}^{X}$	CXMI, $\texttt{X}_{N}$	CXMI, $\texttt{X}_{N}^{\text{shuf}}$
$\phi_{5}^{tfidf}$	$0.002$	$0.002$
$\phi_{10}^{tfidf}$	$-0.012$	$-0.015$
$\phi_{5}^{yake}$	$0.003$	$0.000$
$\phi_{10}^{yake}$	$-0.022$	$-0.021$

Table 7: CXMI scores for each system show that introducing shuffling in training hurts confidence for almost all models while marginally improving confidence for yake

{}_{10}^{\text{shuf}}

	WSD Metrics			Translation Quality
System	P	R	F1	BLEU	COMET
Baseline model
sent	0.7850	0.6008	0.6807	21.8	0.782
Prefixed sent outputs
sent-tfidf₅	0.7755	0.5971	0.6747	18.4	0.621
sent-tfidf₁₀	0.7651	0.5973	0.6709	15.5	0.513
sent-yake₅	0.7768	0.5988	0.6763	18.7	0.641
sent-yake₁₀	0.7672	0.5995	0.6731	15.9	0.535

Table 8: Performance of various prefixed outputs of sent. Significant differences between systems and sent at the 95% CI as determined by one-sided t-test with 50 trials each with 750 samples (with replacement) and as determined by paired bootstrap resampling with 5k resamples. Prefixed sent models neither improve in WSD metrics nor in translation quality illustrating the importance of training with salient context.

Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context

Abstract

1 Introduction

2 Method

Pseudo-Documents

Saliency-based MT models

Saliency functions

3 doc-MuCoW: A Large Test Set for WSD with Document Context

WSD evaluation

4 Experimental Setup

4.1 Training Data

4.2 Systems

Baselines

Saliency-based models

4.3 Evaluation

5 Main Results

WSD metrics

Translation quality

Efficiency

6 Analysis

6.1 Manual Inspection

6.2 Quantitative Analysis

Impact of sense frequency

Impact of sentence length

Confidence as a function of context

Effects of context shuffling in training

Prefixing sent with salient context

7 Related Work

Translation Disambiguation

Document-level MT

8 Conclusion

Limitations

Single, high-resource setting

Automatic curation

Saliency functions

Sense-enhanced models

Training Data

The 2sent model

References

Improving Word Sense Disambiguation
in Neural Machine Translation with Salient Document Context