This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Minimum Bayes Risk Decoding with Confidence-based Pruning

Julius Cheng, Andreas Vlachos
Department of Computer Science and Technology
University of Cambridge
{jncc3,av308}@cam.ac.uk

Faster Minimum Bayes Risk Decoding with Confidence-based Pruning

Julius Cheng, Andreas Vlachos
Department of Computer Science and Technology
University of Cambridge
{jncc3,av308}@cam.ac.uk
Abstract

Minimum Bayes risk (MBR) decoding outputs the hypothesis with the highest expected utility over the model distribution for some utility function. It has been shown to improve accuracy over beam search in conditional language generation problems and especially neural machine translation, in both human and automatic evaluations. However, the standard sampling-based algorithm for MBR is substantially more computationally expensive than beam search, requiring a large number of samples as well as a quadratic number of calls to the utility function, limiting its applicability. We describe an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling. Our method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy. We demonstrate the effectiveness of our approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics.

1 Introduction

Minimum Bayes risk (MBR) decoding (Bickel and Doksum, 1977; Goel and Byrne, 2000) has recently gained renewed attention as a decision rule for conditional sequence generation tasks, especially neural machine translation (NMT). In MBR, the sequence with the highest expected utility with respect to thez model distribution is chosen as the output, where the utility is usually some measure of text similarity. This contrasts with the more commonly used maximum a posteriori (MAP) decision rule, which returns the sequence with the highest probability under the model. MAP is generally intractable, and beam search is typically used to find an approximation. MBR is likewise intractable, and Eikema and Aziz (2020) propose an sampling-based approximation algorithm.

MBR has been shown to outperform MAP beam search in both automatic and qualitative evaluation in a diverse range of tasks (Suzgun et al., 2023), including NMT (Freitag et al., 2022a) and code generation (Shi et al., 2022). MBR also generalizes other previously proposed decoding methods and explains their success (Bertsch et al., 2023).

The accuracy improvement from MBR comes at a heavy cost: the number of samples used can reach thousands (Freitag et al., 2023), and the number of calls to the utility function required is quadratic in the number of samples. Often, the utility function itself is a deep neural model, rendering MBR prohibitively expensive for many use cases.

In this work, we address the computational efficiency of MBR with an iterative pruning algorithm where low-performing hypotheses are removed while the number of samples used to estimate utilities grows. Hypotheses are pruned based on their estimated probability of being the true best hypothesis under the MBR objective, thus avoiding making expensive fine-grained utility estimates for hypotheses which are unlikely to be the final prediction.

In NMT experiments on three language pairs using chrF++ (Popović, 2015), and COMET (Rei et al., 2020) as MBR utility and evaluation metrics, we show that our method maintains the same level of accuracy as standard MBR while reducing the number of utility calls by a factor of at least 7 for chrF++ and 15 for COMET. Our algorithm can also use fewer samples to reach a prediction by terminating early, unlike standard MBR.

2 Minimum Bayes risk decoding

Conditional sequence generation problems such as neural machine translation (NMT) model the probability of the next token yty_{t} given a source sequence xx and prefix y<ty_{<t} with a neural network pθp_{\theta}. This model can be used to assign probabilities to full sequences pθ(y|x)p_{\theta}(y|x) via the chain rule.

At test time, a decision rule is employed to select a single “best” sequence. The most common decision rule is to return the highest probability sequence yMAP=argmaxypθ(y|x).y^{MAP}=\operatorname*{arg\,max}_{y}{p_{\theta}(y|x)}. The exact solution is generally intractable, and beam search is used to find an approximation.

In contrast, minimum Bayes risk decoding (MBR) (Goel and Byrne, 2000) outputs:

yMBR\displaystyle y^{MBR} =argmaxy𝔼y¯pθ(|x)[u(y,y¯)]\displaystyle=\operatorname*{arg\,max}_{y}{\operatorname{\mathop{{}\mathbb{E}}}_{\bar{y}\sim p_{\theta}(\cdot|x)}[u(y,\bar{y})]}
=argmaxyU(y,pθ(|x)),\displaystyle=\operatorname*{arg\,max}_{y}{U(y,p_{\theta}(\cdot|x))}, (1)

for some utility function uu, a measure of similarity between sequences, and U(y,𝒴)=𝔼y^𝒴[u(y,y^)]U(y,\mathcal{Y})=\operatorname{\mathop{{}\mathbb{E}}}_{\hat{y}\sim\mathcal{Y}}[u(y,\hat{y})], where 𝒴\mathcal{Y} is either a probability distribution or an array of samples. We call UU the expected utility function. Eikema and Aziz (2020) propose a sampling method for neural language models where hypotheses \mathcal{H} and pseudo-references \mathcal{R} are generated with unbiased sampling, and yMBRy^{MBR} is estimated as:

yMBRargmaxyU(y,).y^{MBR}\approx\operatorname*{arg\,max}_{y\in\mathcal{H}}U(y,\mathcal{R}). (2)

This method, which we refer to as "standard" MBR, requires ||+|||\mathcal{H}|+|\mathcal{R}| samples (assuming \mathcal{H}\neq\mathcal{R}) and |||||\mathcal{H}||\mathcal{R}| calls to uu. The latter is the main computational bottleneck which we address in this work. Recent works on MBR focus on identifying accurate and efficient generation methods (Eikema and Aziz, 2022; Freitag et al., 2023; Yan et al., 2022), and pruning \mathcal{H} to a smaller size with a faster method prior to running standard MBR (Eikema and Aziz, 2022; Fernandes et al., 2022).

Freitag et al. (2022a) and Eikema and Aziz (2022) show that MBR is more effective relative to MAP when the utility metric has high segment-level accuracy as measured by Freitag et al. (2021). So, we use COMET (Rei et al., 2020), one of the best available metrics for NMT, and chrF++ (Popović, 2015), as a simpler and faster yet reasonably good lexical metric. We heed the call of Freitag et al. (2022b) and do not use BLEU, which is obsoleted by newer metrics as both an evaluation and utility metric (Freitag et al., 2022a).

3 Confidence-based hypothesis pruning

Sampling-based MBR returns the highest-utility hypothesis from a set measured over pseudo-references sampled from the model. Speedups can be achieved if low-performing hypotheses are removed from consideration based on coarse utility estimates obtained from a subset of the pseudo-references. In other words, we can save time by not computing precise utility estimates for hypotheses which are unlikely to be chosen in the end.

We propose an iterative algorithm for MBR where the hypothesis set is gradually shrunk while the pseudo-reference list grows. The procedure is shown in Algorithm 1. We start with an initial hypothesis set 1\mathcal{H}_{1}, and at each time step tt, a pruning function uses a pseudo-reference list t\mathcal{R}_{t} of size rtr_{t} to select t+1t\mathcal{H}_{t+1}\subseteq\mathcal{H}_{t}. After the maximum time step is reached or when the current hypothesis set contains one element, terminate and return the highest utility hypothesis under all available pseudo-references. The size of t\mathcal{R}_{t} grows according to a pre-defined "schedule" r1,,rTr_{1},...,r_{T}.

Algorithm 1 Pruning MBR
1:Source sentence xx.
2:sample size schedule r=r1,,rTr=r_{1},...,r_{T}, expected utility function UU, model parameters θ\theta, pruning function prune\operatorname*{prune}, hypothesis generation function gen(x,θ)\operatorname*{gen}(x,\theta).
3:An MBR prediction.
4:0[]\mathcal{R}_{0}\leftarrow[\ ]
5:t1t\leftarrow 1
6:1gen(x,θ)\mathcal{H}_{1}\leftarrow\operatorname*{gen}(x,\theta)
7:while tTt\leq T and |t|>1|\mathcal{H}_{t}|>1 do
8:     tt1\mathcal{R}_{t}\leftarrow\mathcal{R}_{t-1}
9:     while |t|<rt|\mathcal{R}_{t}|<r_{t} do
10:         Append y^pθ(|x)\hat{y}\sim p_{\theta}(\cdot|x) to t\mathcal{R}_{t}
11:     end while
12:     t+1prune(t,t)\mathcal{H}_{t+1}\leftarrow\operatorname*{prune}(\mathcal{H}_{t},\mathcal{R}_{t})
13:     tt+1t\leftarrow t+1
14:end while
15:return argmaxytU(y,t1)\operatorname*{arg\,max}_{y\in\mathcal{H}_{t}}U(y,\mathcal{R}_{t-1})

The goal of the pruning function is to exclude as many hypotheses as possible to reduce the number of utility calls made in the future without excluding the true top-ranking MBR hypothesis argmaxytU(y,pθ(|x))\operatorname*{arg\,max}_{y\in\mathcal{H}_{t}}U(y,p_{\theta}(\cdot|x)), the true "winner".

We propose to prune hypotheses in t\mathcal{H}_{t} with low probability of being the true winner and to estimate this probability using nonparametric bootstrap resampling (Efron, 1979; Koehn, 2004); given an initial collection of i.i.d. samples 𝒮\mathcal{S} from an unknown distribution 𝒳\mathcal{X}, our beliefs about the true value of any statistic T(𝒳)T(\mathcal{X}) are represented by the distribution p(T(𝒮^))p(T(\hat{\mathcal{S}})), where 𝒮^boot(𝒮)\hat{\mathcal{S}}\sim\operatorname*{boot}(\mathcal{S}) , and boot(𝒮)\operatorname*{boot}(\mathcal{S}) returns a with-replacement size-|𝒮||\mathcal{S}| resample of 𝒮\mathcal{S}.

In our case, we want to estimate the probability that yy is the true winner in t\mathcal{H}_{t}:

p(y¯U(y,pθ(|x))U(y¯,pθ(|x))),p\big{(}\bigwedge_{\bar{y}\in\mathcal{H}}U(y,p_{\theta}(\cdot|x))\geq U(\bar{y},p_{\theta}(\cdot|x))\big{)}, (3)

which we estimate as the chance that yy wins in a bootstrap resample. Let t^boot(t)\hat{\mathcal{R}_{t}}\sim\operatorname*{boot}(\mathcal{R}_{t}). Then the bootstrap estimator is:

𝔼t^boot(t)[1y¯t(U(y,t^)U(y¯,t^)].\mathop{\mathbb{E}}_{\hat{\mathcal{R}_{t}}\sim\operatorname*{boot}(\mathcal{R}_{t})}\big{[}1\bigwedge_{\bar{y}\in\mathcal{H}_{t}}(U(y,\hat{\mathcal{R}_{t}})\geq U(\bar{y},\hat{\mathcal{R}_{t}})\big{]}. (4)

This estimator can be high variance because the probability yy winning in a bootstrap sample is very small when t\mathcal{H}_{t} is large, so instead we use the probability that yy outranks a particular y¯¯t\bar{\bar{y}}\in\mathcal{H}_{t}:

𝔼t^boot(t)[1(U(y,t^)U(y¯¯,t^))].\mathop{{}\mathbb{E}}_{\hat{\mathcal{R}_{t}}\sim\operatorname*{boot}(\mathcal{R}_{t})}\big{[}1(U(y,\hat{\mathcal{R}_{t}})\geq U(\bar{\bar{y}},\hat{\mathcal{R}_{t}}))\big{]}. (5)

This statistic is invariant to the size of t\mathcal{H}_{t} because it only considers utility estimates of yy and y¯¯\bar{\bar{y}}. It is an upper bound of Equation 4 because the probability of yy winning against all y¯t\bar{y}\in\mathcal{H}_{t} cannot be higher than the probability of winning against a particular y¯¯t\bar{\bar{y}}\in\mathcal{H}_{t}. y¯¯\bar{\bar{y}} can be any element in t\mathcal{H}_{t}, but we set it to the winner under t\mathcal{R}_{t}, i.e. y¯¯=argmaxy¯tU(y¯,t)\bar{\bar{y}}=\operatorname*{arg\,max}_{\bar{y}\in\mathcal{H}_{t}}U(\bar{y},\mathcal{R}_{t}), to achieve a tighter upper bound.

We propose to prune yy if its estimated probability of beating y¯¯\bar{\bar{y}} is less than 1α1-\alpha, where α\alpha is a confidence threshold 0α10\leq\alpha\leq 1. In summary, this procedure prunes some but not necessarily all yty\in\mathcal{H}_{t} which are estimated to have less than 1α1-\alpha chance of being the true winner. Algorithm 2 shows the procedure in detail.

Note that bootstrapped utility estimates do not require additional utility function calls because they reuse utility values already computed over t,t\mathcal{H}_{t},\mathcal{R}_{t}. Also note that the bootstrap estimator is always biased because t\mathcal{R}_{t} is never equal to pθ(|x)p_{\theta}(\cdot|x), and the bias is worse when t\mathcal{R}_{t} is small. Nonetheless, we show empirically that bootstrapping is effective in our pruning algorithm for modest sizes of t\mathcal{R}_{t}.

Algorithm 2 Confidence-based pruning function
1:Hypothesis set \mathcal{H}, pseudo-reference list \mathcal{R}.
2:Expected utility function UU, confidence threshold α\alpha, number of bootstrap samples nn.
3:A subset of \mathcal{H}.
4:new{}\mathcal{H}_{new}\leftarrow\{\}
5:y¯¯argmaxy¯U(y¯,)\bar{\bar{y}}\leftarrow\operatorname*{arg\,max}_{\bar{y}\in\mathcal{H}}U(\bar{y},\mathcal{R})
6:1gen(x,θ)\mathcal{H}_{1}\leftarrow\operatorname*{gen}(x,\theta)
7:for i1i\leftarrow 1 to nn do
8:     ^iboot()\hat{\mathcal{R}}^{i}\leftarrow\operatorname*{boot}(\mathcal{R})
9:end for
10:for yy\in\mathcal{H} do
11:     w1nin1(U(y,^i)U(y¯¯,^i))w\leftarrow\frac{1}{n}\sum^{n}_{i}{1(U(y,\hat{\mathcal{R}}^{i}})\geq U(\bar{\bar{y}},\hat{\mathcal{R}}^{i}))
12:     if w>1αw>1-\alpha then
13:         newnew{y}\mathcal{H}_{new}\leftarrow\mathcal{H}_{new}\cup\{y\}
14:     end if
15:end for
16:return new\mathcal{H}_{new}

Another benefit of our pruning method compared to standard MBR is that it can terminate early if t\mathcal{H}_{t} has only one remaining hypothesis, reducing the total number of pseudo-references needed.

As a baseline pruning function for comparison, we rank each yty\in\mathcal{H}_{t} by U(y,t)U(y,\mathcal{R}_{t}) and prune the bottom-β\beta proportion. At β=0\beta=0, no pruning occurs and standard MBR decoding is recovered. We refer to this baseline as pruneβ\operatorname*{prune}_{\beta} and our confidence-based method as pruneα\operatorname*{prune}_{\alpha}.

4 Experiments

We perform our experiments on NMT models which we train on the German-English (de-en), English-Estonian (en-et), and Turkish-English (tr-en) news datasets from WMT18. We use the data preprocessing steps provided by the WMT18 organizers, except that we exclude Paracrawl from the de-en dataset following Eikema and Aziz (2022). The final training sets have 5.8 million, 1.9 million, and 207 thousand sentence pairs respectively. All models are transformers of the base model size from Vaswani et al. (2017) and are trained without label smoothing until convergence.

For all language pairs and validation/test datasets, we generate 1\mathcal{H}_{1} with beam top-kk with k=256k=256111Different sequences may have the same detokenization result, so we have fewer than 256 hypotheses on average.. We generate 1024 pseudo-references \mathcal{R}^{\ast} with epsilon sampling (Hewitt et al., 2022) with ϵ=0.02\epsilon=0.02, following Freitag et al. (2023). In order to run multiple random trials efficiently, we simulate sampling pseudo-references by sampling from \mathcal{R}^{\ast} without replacement. The experiments in Sections 4.1 and 4.2 show averaged results from 10 trials. We use chrF++ and COMET as utility functions and always match the evaluation metric to the utility. chrF++ is computed using SacreBLEU (Post, 2018) with default settings. COMET is computed with the COMET-22 model from Rei et al. (2022). We use 500 bootstrap samples for pruneα\operatorname*{prune}_{\alpha}.

For the pseudo-reference sample size schedule r1,,rTr_{1},...,r_{T}, the choice of r1r_{1} is pivotal in the speed-accuracy trade-off; |1||1||\mathcal{H}_{1}||\mathcal{R}_{1}| is a lower bound on the number of utility calls needed, but the bootstrap estimate is more biased when the sample size is small. In a preliminary experiment on the validation set, we measure the "false pruning rate", the rate that the estimated true winner, argmaxy¯U(y¯,)\operatorname*{arg\,max}_{\bar{y}\in\mathcal{H}}U(\bar{y},\mathcal{R}^{\ast}) is pruned under different choices of α\alpha and |||\mathcal{R}|. Based on results shown in Figure 1, we set r1r_{1} to 8 for COMET and 16 for chrF++ for all experiments. r2,,rTr_{2},...,r_{T} are set by doubling at each time step until reaching 256.

More experimental details and figures for language pairs not shown here are in the Appendix. Our code is publicly available222https://github.com/juliusc/pruning_mbr.

Refer to caption
Figure 1: False pruning rates for different choices of α\alpha and |||\mathcal{R}| measured on the validation set.

4.1 Speed-accuracy trade-off

pruneα\operatorname*{prune}_{\alpha} and pruneβ\operatorname*{prune}_{\beta} allow control over the speed-accuracy trade-off with a single parameter. We observe this trade-off over the validation set by comparing the number of utility function calls made against various measures of accuracy. High evaluation score is underlying goal, but we find that the score changes very little across settings, so we also evaluate pruning quality in terms of how well final predictions match those under standard MBR with \mathcal{R}^{*}. We use exact accuracy, whether the prediction yy equals argmaxy¯U(y¯,)\operatorname*{arg\,max}_{\bar{y}\in\mathcal{H}}U(\bar{y},\mathcal{R}^{*}), and reciprocal rank (RR), equal to (y¯1(U(y¯,)U(y,)))1(\sum_{\bar{y}\in\mathcal{H}}1(U(\bar{y},\mathcal{R}^{*})\geq U(y,\mathcal{R}^{*})))^{-1} as a soft accuracy measure adapted from the mean reciprocal rank used in search (Craswell, 2009). Figure 2 shows that pruneα\operatorname*{prune}_{\alpha} generally outperforms pruneβ\operatorname*{prune}_{\beta} on all metrics.

Refer to caption
Figure 2: Speed-accuracy trade-off curves of pruning functions for α0.8,0.9,0.95,0.98,0.99\alpha\in{0.8,0.9,0.95,0.98,0.99} and β{0.05,,0.95}\beta\in\{0.05,...,0.95\} on the de-en validation set. The x-axes are truncated for better visual comparison.

4.2 Test results

Metric: chrF++
de-en en-et tr-en
β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9 β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9 β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9
Score 62.18 62.19 62.11 46.47 46.48 46.44 42.63 42.63 42.64
Accuracy 0.749 0.736 0.648 0.741 0.728 0.639 0.761 0.751 0.660
RR 0.850 0.840 0.769 0.843 0.833 0.762 0.857 0.850 0.779
# Pseudo-refs 256.0 234.9 185.2 256.0 235.3 188.5 256.0 238.7 185.7
# Utility calls 65082 9542 5402 63898 9644 5386 65356 8761 5275
Metric: COMET
de-en en-et tr-en
β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9 β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9 β=0\beta=0 α=0.99\alpha=0.99 α=0.9\alpha=0.9
Score 78.37 78.37 78.41 82.75 82.74 82.74 73.70 73.68 73.65
Accuracy 0.872 0.844 0.742 0.924 0.898 0.830 0.897 0.878 0.791
RR 0.930 0.911 0.838 0.959 0.943 0.898 0.944 0.932 0.873
# Pseudo-refs 256.0 200.3 120.4 256.0 151.4 82.6 256.0 177.8 100.8
# Utility calls 65082 3831 2533 63898 2813 2250 65356 3244 2394
Table 1: Statistics for MBR decoding on the test set for all language pair and metric settings. β=0\beta=0 indicates standard MBR. All values are averaged across 10 random trials.

We evaluate our methods on the test set with α=0.99\alpha=0.99 and α=0.9\alpha=0.9 and compare them to standard MBR on the accuracy metrics described in Section 4.1 as well as the number of utility calls and pseudo-references used. Table 1 shows that across all language pairs and metrics, our method achieves similar evaluation scores as standard MBR while using much less computation, with the most dramatic case being en-et and COMET with α=0.9\alpha=0.9 which uses 3.5% as many utility calls and 32% as many pseudo-references as the baseline.

When comparing α=0.99\alpha=0.99 with α=0.9\alpha=0.9, we see that while exact accuracy and RR differ, the evaluation score differs very little if at all, suggesting that high-ranking hypotheses are often equally good as one another, and finely discriminating between them has diminishing returns.

4.3 Human evaluation

We confirm that our method is indistinguishable from standard MBR in human evaluation. On the de-en test set, for each instance, we sample 256 pseudo-references without replacement from \mathcal{R}^{\ast} and use this pool to decode with both standard MBR and pruneα,α=0.99\operatorname*{prune}_{\alpha},\alpha=0.99. 85% of the predictions are the same, and for the rest, we asked bilingual readers of German and English to state which prediction they preferred. We obtained 125 ratings. The standard MBR prediction won in 48 cases, lost in 42, and tied in 35. This fails the significance test of Koehn (2004), so we conclude that pruneα\operatorname*{prune}_{\alpha} with α=0.99\alpha=0.99 is not significantly different from standard MBR on de-en.

4.4 Run times

To measure realistic run times, we implement a practical version of our algorithm and compare our method against standard MBR decoding and beam search. We run our algorithm with COMET as the utility function and α=0.99\alpha=0.99. With COMET, sentence embeddings can be cached, which greatly speeds up utility function calls. Some of the theoretical gains seen in Section 4.2 are diminished in practice due to our iterative algorithm dividing computations over smaller batches. Our best implementation samples all 256 pseudo-references at once but generates pseudo-reference sentence embeddings as needed. We run each algorithm on a set of 100 randomly sampled test instances. Further implementation details are given in Appendix A.2. Table 2 provides a detailed breakdown of run times. As expected, our method achieves the greatest time savings on utility computation. However, it is still substantially slower than beam search.

Run time (seconds)
Prune MBR Std. MBR Beam
Generate \mathcal{H} 36.56 36.56 26.79
Generate \mathcal{R} 59.92 59.92
Embed sentences 100.50 108.33
Compute utilities 12.96 193.01
Bootstrap pruning 0.63
Total 210.57 397.82 26.79
Table 2: Summary of run times for decoding methods on 100 sentences from the de-en test set for pruning MBR (α=0.99\alpha=0.99), standard MBR, and beam search with beam size 10, averaged over 3 trials.

5 Conclusion

We propose an iterative pruning algorithm for MBR along with a pruning criterion based on confidence estimates derived from bootstrap resampling. In experiments across diverse language pairs and metrics, we show that our method consistently outperforms our proposed baseline and achieves significant computational savings over standard sampling-based MBR without sacrificing accuracy. Our method is a drop-in substitute for standard MBR that requires no knowledge about the model pθp_{\theta}, how 1\mathcal{H}_{1} is generated, or the utility function.

Limitations

Even with our pruning algorithm, MBR is many times more costly to run than beam search.

An important hyperparameter in our method is the sample size schedule. We show why it is important to carefully choose the size of the first sample, but not how the remaining schedule should be set, opting to simply double the size at each step. We leave this issue to future work.

Methods such as MBR and reranking that directly optimize a metric may exploit noise in the metric to improve the score without actually improving quality (Fernandes et al., 2022). In these settings, automatic evaluation is less trustworthy and should ideally be combined with human evaluation. However, human evaluation is difficult and expensive to obtain.

Acknowledgements

Julius Cheng is supported by a scholarship from Huawei. The authors would like to thank the bilingual readers who helped with the human evaluation and the anonymous reviewers for their helpful suggestions.

References

Appendix A Additional experimental details

A.1 All experiments

Preliminary experiments showed no significant difference between 500 and 1000 bootstrap samples when running pruneα\operatorname*{prune}_{\alpha}, so we use 500 for all experiments.

For efficiency, we use average sentence-level chrF++ instead of corpus-level chrF++ for corpus-level evaluations. This allows us to pre-compute the sentence-level chrF++ for each hypothesis and obtain the corpus-level score of a set of predictions by simple averaging.

All experiments are implemented on top of our fork of Fairseq (Ott et al., 2019).

A.2 Run times

This section contains additional details for the experiment in Section 4.4.

For both the standard and pruning MBR algorithms, we deduplicate and cache computations whenever possible. For each unique pseudo-reference, its sentence embedding and utility scores against each yty\in\mathcal{H}_{t} are only computed once.

For simplicity, all decoding methods are run on one sequence at a time. Batching across sequences would likely affect the relative performance characteristics of each method.

All experiments are conducted on the same machine with one Nvidia Quadro RTX 8000 GPU.

Appendix B False pruning rates for en-et, tr-en

Figure 3 shows the false pruning rates for en-et and tr-en.

Refer to caption
Figure 3: False pruning rates for different choices of α\alpha and |||\mathcal{R}| measured on the validation set.

Appendix C Hypotheses remaining per time step

Figure 4 shows the distribution of the number of remaining hypotheses after each time step when running our method on the de-en validation set, following the experimental setup of Section 4.1 where |1|=256|\mathcal{H}_{1}|=256. This is provided to further illustrate the pruning process.

Refer to caption
Figure 4: Number of remaining hypotheses after each time step while running the pruning MBR algorithm for various choices of α\alpha on the de-en validation set. The x-axis is the number of pseudo-references at a time step, and the y-axis is the number of hypotheses remaining after pruning. Colored bars show the mean, and error bars show the interquartile range.

Appendix D Speed-accuracy trade-off for en-et, tr-en

Refer to caption
Figure 5: Speed-accuracy trade-off curves of pruning functions for α0.8,0.9,0.95,0.98,0.99\alpha\in{0.8,0.9,0.95,0.98,0.99} and β{0.05,,0.95}\beta\in\{0.05,...,0.95\} on the en-et validation set.
Refer to caption
Figure 6: Speed-accuracy trade-off curves of pruning functions for α0.8,0.9,0.95,0.98,0.99\alpha\in{0.8,0.9,0.95,0.98,0.99} and β{0.05,,0.95}\beta\in\{0.05,...,0.95\} on the tr-en validation set.