Extending Word-Level Quality Estimation for Post-Editing Assistance

Yizhen Wei^† Takehito Utsuro^† Masaaki Nagata^‡
^†Deg. Prog. Sys.&Inf. Eng., Grad. Sch. Sci.&Tech., University of Tsukuba
^‡NTT Communication Science Laboratories, NTT Corporation, Japan

Abstract

We define a novel concept called extended word alignment in order to improve post-editing assistance efficiency. Based on extended word alignment, we further propose a novel task called refined word-level QE that outputs refined tags and word-level correspondences. Compared to original word-level QE, the new task is able to directly point out editing operations, thus improves efficiency. To extract extended word alignment, we adopt a supervised method based on mBERT. To solve refined word-level QE, we firstly predict original QE tags by training a regression model for sequence tagging based on mBERT and XLM-R. Then, we refine original word tags with extended word alignment. In addition, we extract source-gap correspondences, meanwhile, obtaining gap tags. Experiments on two language pairs show the feasibility of our method and give us inspirations for further improvement.

1 Introduction

Post-editing refers to the process of editing a rough machine-translated sentence (referred to as MT) into a correct one. Compared with conventional statistical machine translation (Koehn et al., 2003), neural machine translation (Cho et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017) can generate translations with high accuracy. However, Yamada (2019) suggested that there is no significant difference in terms of cognitive load for one to post-edit an MT even it has high quality. Therefore, post-editing assistance is profoundly needed.

Traditional post-editing assistance methods leave room for improvement. A typical method is word-level QE (Specia et al., 2020) that predicts tags expressed in the form of OK or BAD. However, such a dualistic judgement is not efficient enough because meaning of BAD is ambiguous.

Refer to caption — (a) Original word-level QE

Word alignment is also proved to be helpful for post-editing assistance. Schwartz et al. (2015) demonstrated that displaying word alignment statistically significantly improves post-editing quality. However, unlike QE tags, word alignment cannot tell where translation errors are. Besides, it is non-trivial to extract word alignment between source sentence and MT. Schwartz et al. (2015) used a built-in function of Moses (Koehn et al., 2007), a decoder for statistical machine translation that is no longer suitable for neural models.

In this paper, we propose a novel concept called extended word alignment. In extended word alignment, we include incorrect word translations and null alignment between a source sentence and MT. We adopt a supervised method based on pre-trained language models to extract it. Based on extended word alignment, we further propose a novel task called refined word-level QE which outputs refined tags including REP, INS, and DEL along with word-level correspondences. By referring to those information, post-editors could immediately realize what operations (replacement, insertion, and deletion towards MT) to do. Thus, we believe that refined word-level QE can significantly improve post-editing assistance efficiency. Methodologically, we firstly predict original word tags by training regression models for sequence tagging based on architectures such as multilingual BERT (Devlin et al., 2019) (mBERT) and XLM-RoBERTa (Conneau et al., 2020) (XLM-R). Then, we refine the original word tags by incorporating extended word alignment in a rule-based manner. In addition, we adopt a method similar to the one for extended word alignment to extract source-gap correspondences and then determine gap tags.

Experiments on En-De and En-Zh datasets are conducted. Results show that our method significantly outperforms the baseline. For En-De, our best performance outperforms the baseline by 12.9% and 6.0% respectively in terms of mean F1 scores for Source and MT word refined tags. For En-Zh, the gap reaches 48.9% and 16.9%. Further more, we discuss the effectiveness and limitations of our method with specific cases.

2 Related Work

Word Alignment Extraction. Methods based on statistical models (Brown et al., 1993; Och and Ney, 2003; Dyer et al., 2013) were dominant methods for word alignment extraction. In recent years, neural-based methods developed quickly. Garg et al. (2019) tried to obtain word alignment based on attention inside a transformer (Vaswani et al., 2017), but their method perform just as well as statistical tools like GIZA++ (Och and Ney, 2003). Dou and Neubig (2021) utilized multilingual BERT to extract embeddings of all words conditioned on context, aligning them under the restriction of optimal transport (Kusner et al., 2015). Nagata et al. (2020) utilized the pre-trained language model in a supervised manner and achieved a significant improvement against previous studies with only around 300 parallel sentence pairs for fine-tuning. In our work, we adapt their approach from ordinary word alignment to extended word alignment. Details will be introduced in Section 4.1.

Word-Level QE. One of the conventional architectures for word-level QE is LTSM-based predictor-estimator (Kim and Lee, 2016; Zhang and Weiss, 2016; Kim et al., 2017). Recent researches (Wang et al., 2020) adopted new architectures such as transformer (Vaswani et al., 2017).

For moderner methods, a typical example is QE BERT (Kim et al., 2019). They built a mBERT for classification with explicit gap tokens in the input sequence, but we find that regression models with adjustable threshold consistently outperform classification models and explicit gap tokens harm final performance. A newer research (Lee, 2020) adopted XLM-R rather than mBERT, but they did not explain their strategy to determine a threshold.

All methods above require third-party large-scale parallel data for pre-training. In contrast, our method introduced in Section 4.2 achieves acceptable performance with small cost.

Post-Editing User Interface. Nayek et al. (2015) depicted an interface where words that need editing are displayed with different colors. Schwartz et al. (2015) emphasized the importance of displaying the word alignment. Both interfaces do not tell the correctness of the translation of the MT words. Compared to them, the interface we envisaged provides information about translation quality (correctness) as well as suggestions of specific post-editing operations.

There are also some other studies(Herbig et al., 2020; Jamara et al., 2021) tried to introduce multi-modalities including touching, speech, hand gestures into post-editing user interface, improving efficiency from another perspective.

3 Refined Word-Level QE for Post-Editing Assistance

3.1 Original Word-Level QE

According to Specia et al. (2020), word-level QE shown in Figure 1(a) is a task that takes a source sentence and its machine-translated counterpart (MT) as input. It then outputs tags for source words, MT words and gaps between MT words (MT gaps).¹¹1For convenience, source tags and MT word tags are collectively known as word tags. MT word tags and MT gap tags are collectively known as MT tags. All those tags are expressed either as OK or BAD. BAD indicates potential translation errors that post-editors should correct. We refer to such a task as original word-level QE.

Original word-level QE is not efficient enough for post-editing assistance because BAD is ambiguous. For example, in Figure 1(a), tag of “white” indicates a replacement of the mistranslation “黑” (black), but tag of “dogs” indicates an insertion into the gap between “猫” and “吗”. It is impossible to distinguish between these indications unless one attend to both entire sentences, which makes post-editing assistance meaningless.

3.2 Extended Word Alignment

We formally define a novel concept called extended word alignment between source sentence and MT. Ordinary word alignment indicates word-to-word relations between a pair of semantically equivalent sentences in two languages. Any word can be theoretically aligned with another semantically equivalent word on the other side. In contrast, in extended word alignment, translation errors in MT are considered. Specifically, a source word is allowed to be aligned with its mistranslation (wrong word choice) and a word is allowed to be aligned with nothing, namely null-aligned.

3.3 Refined Word-Level QE

Extended word alignment can disambiguate BAD tags, overcoming the disadvantage of original word-level QE. When a BAD-tagged source word is aligned with a BAD-tagged MT word, it is clear that a replacement is needed. Likewise, a null-aligned BAD-tagged source word indicates an insertion and a BAD-tagged MT word is a deletion.

To make our idea more user-friendly, we formally propose a novel task called refined word-level QE by incorporating extended word alignment with original word-level QE. Besides extended word alignment, following refined tags are also included as the objectives.

•

REP is assigned to a source word and its mistranslation (wrong word choice) in MT, indicating a replacement.
•

INS is assigned to a source word and the gap where translation should be inserted in, indicating an insertion.
•

DEL is assigned to a redundant MT word, indicating a deletion.

In addition, we include correspondences between INS-tagged source words and MT gaps to express the insertion points. Those source-gap correspondences along with extended word alignment are collectively referred to as word-level correspondences.

Figure 1(b) is an example of our proposal. Compared with Figure 1(a), Figure 1(b) points out the replacement of “黑” (black), the insertions of “and” and “dogs” to the insertion point, and the deletion of “吗” (an interrogative voice auxiliary).

4 Methodology²²2Besides the current method, we have also tried to use a unified model based on architectures like XLM-R to directly predict refined tags (OK/REP/INS/DEL) and word-level correspondences. However, due to lack of training data and complexity of the problem, direct approach did not work well. Therefore, we decided to adopt this multiple-phase approach.

4.1 Extended Word Alignment Extraction

Extracting extended word alignment is non-trivial. Traditional unsupervised statistical tools(Och and Ney, 2003; Dyer et al., 2013) cannot work well because they expect semantically equivalent sentence pair as input. After trying several neural methods (Garg et al., 2019; Dou and Neubig, 2021), we empirically adopt the supervised method proposed by Nagata et al. (2020).

Specifically, extended word alignment extraction is regarded as a cross-lingual span prediction problem similar to the paradigm that utilizes BERT (Devlin et al., 2019) for SQuAD v2.0 (Rajpurkar et al., 2018). mBERT is used as the basic architecture. Given a source sentence with one word marked $S=[s_{1},s_{2},...,M,s_{i},M,...,s_{m}]$ ( $M$ stands for a special mark token) and the MT $T=[t_{1},t_{2},...,t_{n}]$ , mBERT is trained to identify a span $T_{(j,k)}=[t_{j},...,t_{k}](1\leq j\leq k\leq n)$ that is aligned with the marked source word $s_{i}$ . Cross entropy loss is adopted during training.

$\mathcal{L}^{align}_{s_{i}}=-\frac{1}{2}[\mathrm{log}(p_{j}^{start})+\mathrm{log}(p_{k}^{end})]$

Because of the symmetry of word alignment, similar operations will be done again in the opposite direction. During testing, following Nagata et al. (2020), we recognize word pairs whose mean probability of both directions is greater than 0.4 as a valid word alignment. The image of the model is illustrated in Figure 2.

Nagata et al. (2020) demonstrated that the mBERT-based method significantly outperforms statistical methods in ordinary word alignment extraction. According to them, extracting word alignment for each word independently is the key to outperform other methods. Traditional methods model word alignment on a joint distribution, so that an incorrect previous alignment might cause more incorrect alignments like dominos. Our experiments prove that their method consistently works for extended word alignment.

4.2 Original Word Tag Prediction

For Original tags, we conduct sequence tagging with multilingual pre-trained language models including mBERT and XLM-R. Figure 3 shows the image. Input sequence is organized in the format of “[CLS] source sentence [SEP] MT [SEP]” without any mark tokens. Two linear layers followed by Sigmoid function transform output vectors into scalar values as respective probabilities of being BAD for each token. Formally, for a source sentence $S=[s_{1},s_{2},...,s_{i},...,s_{m}]$ and an MT $T=[t_{1},t_{2},...,t_{j},...,t_{n}]$ , the total loss is the mean of binary cross entropy of all word tags.

$\mathcal{L}^{tag}_{s_{i}}=-[y_{s_{i}}\mathrm{log}(p_{s_{i}})+(1-y_{s_{i}})\mathrm{log}(1-p_{s_{i}})]$ $\mathcal{L}^{tag}_{t_{j}}=-[y_{t_{j}}\mathrm{log}(p_{t_{j}})+(1-y_{t_{j}})\mathrm{log}(1-p_{t_{j}})]$ $\mathcal{L}^{tag}=\frac{1}{m+n}(\sum\limits_{i=1}^{m}\mathcal{L}_{s_{i}}+\sum\limits_{j=1}^{n}\mathcal{L}_{t_{j}})$

We have also implemented our models with classification top-layers⁴⁴4Classification top-layers refers to a binary classification linear layer with Softmax., but we find that regression models are consistently better since we can adopt flexible threshold to offset the bias caused by imbalance of reference tags.

4.3 Word Tag Refinement and Gap Tag Prediction

We use extended word alignment to refine the original word tags. Following the rules described in Section 3.3, we can refine word tags as Figure 4 shows. In practical situation, some BAD-tagged words are likely to be aligned with OK-tagged words. In that case, we change OK into BAD encouraging more generation of REP.

For gap tags, we adopt a method similar to the one described in Section 4.1. Specifically, we model source-gap correspondences as alignment between source words and MT gaps. We train a model that aligns an INS-tagged source word with a two-word span in MT where corresponding gap is surrounded. Figure 5 illustrates our idea. During testing, when a valid source-gap correspondence is confirmed, we tag the MT gap as INS⁵⁵5As for source words involved, we do not change their tags and trust the refinement based on extended word alignment because we believe extended word alignment is easier to model..

It is natural if we use such a method to determine gap tags based on the INS-tagged source words prediction from previous workflow. However, in experiment, we notice that absolute value of accuracy of INS-tagged source words is not high. In order not to be influenced by the previous wrong predictions, instead of treating this task as a downstream one, we conducted it independently.

5 Experiment

	En-De		En-Zh
	Source MCC	MT MCC	Source MCC	MT MCC
OpenKiwi	0.266	0.358	0.248	0.520
mBERT-cls	0.314	0.419	0.309	0.555
mBERT	0.340	0.457	0.357	0.570
XLM-R-cls	0.326	0.446	0.330	0.579
XLM-R	0.345	0.453	0.354	0.592
WMT20 Top	0.523(Wang et al., 2020)	0.597(Lee, 2020)	0.336(Rubino, 2020)	0.610(Hu et al., 2020)

Table 1: MCC of original tags. All MT gap tags of our systems are set to OK. For En-De, unlike top systems that employs large-scale third-party resources, we achieve acceptable performance only using QE dataset.

5.1 Data and Experimental Setups

We make full advantage of the En-De and En-Zh datasets of the shared task of original word-level QE in WMT20⁶⁶6http://www.statmt.org/wmt20/quality-es
timation-task.html. There are 7,000, 1,000, and 1,000 sentence pairs with annotation of tags respectively for training, development, and test set. Since the original datasets do not contain refined objectives, we additionally annotate the original development sets with all the objectives for refined word-level QE. Those annotated 1,000 pairs are further divided into 200 pairs for evaluation and 800 pairs for fine-tuning.

Extended Word Align.	Source-Gap Corr.	En-De (F1/P/R)	En-Zh (F1/P/R)
FastAlign	mBERT	0.828/0.812/0.844	0.739/0.773/0.709
AWESoME		0.891/0.915/0.868	0.814/0.871/0.764
mBERT		0.895/0.917/0.875	0.836/0.888/0.790
ft-mBERT	ft-mBERT	0.916/0.913/0.918	0.888/0.887/0.889

Table 2: Evaluation of word-level correspondences. “mBERT” indicates mBERT trained with 7,000-pair pseudo data and “ft-mBERT” indicates mBERT further fine-tuned with 800-pair data.

All the experiments are conducted with modified scripts from transformers-v3.3.1⁷⁷7https://github.com/huggingface/transfo
rmers on an NVIDIA TITAN RTX (24GB) with CUDA 10.1. For pre-trained models, we use bert-base-multilingual-cased for mBERT and xlm-roberta-large for XLM-R from Huggingface.

To train the model for original tags described in Section 4.2, we use the 7,000-pair training set provided by WMT20. 800 pairs of manually annotated data whose refined tags are degenerated into original tags are used for further training. Learning rate is set to 3e-5 and 1e-5 for mBERT and XLM-R respectively and both models are trained for 5 epochs. All the other configurations are remained unchanged as the default.

To train the models extracting extended word alignment described in Section 4.1, we utilize AWESoME (Dou and Neubig, 2021) to generate pseudo alignment data based on 7,000-pair WMT20 training set. We also use extra 800 sentence-pair annotated alignment data for fine-tuning. Models are pre-trained for 2 epochs and fine-tuned for 5 epochs with a learning rate of 3e-5. Most configurations are remained unchanged as the default but max_seq_length and max_ans_length are set to 160 and 15 following Nagata et al. (2020).

To train the model extracting source-gap correspondences described in Section 4.3, similar to what described above, we firstly adopt 7,000 sentence-pair WMT20 training set, generating pseudo data by randomly dropping out some target words in PE⁸⁸8Provided [P]ost-[E]dited sentence from MT in WMT20 dataset. It is regarded as the correct translations.. Then we link gaps where words are dropped with their source counterparts according to the source-PE alignment extracted by AWESoME. Also, 800 sentence-pair of gold source-gap correspondences are used for fine-tuning. All model configurations and training settings are kept identical as those of the model for extended word alignment extraction.

5.2 Experimental Results

5.2.1 Evaluation of Original Tags

We firstly compare our performance with other participants of WMT20 Therefore, we use identical test sets to evaluate and only use data from the original training set of WMT20 to train our models here. Following WMT20 (Specia et al., 2020), we adopt the Matthews correlation coefficient (MCC) as the metric. From the perspective of competition, we make every effort to boost the performance. Thus we set all gap tags as OK rather than predicting them as we find such a strategy leads to the best MCC. The results are shown in Table 1.

In general, pre-trained language models consistently outperform the baseline which is a LSTM-based predictor-estimator implemented with OpenKiwi. For En-De, our best source and MT MCC would have ranked sixth on the leaderboard of WMT20. For En-Zh, our best source and MT MCC would have ranked first and second on the leaderboard of WMT20.

It is also noteworthy that regression models consistently outperform classification models with the suffix “-cls”. For regression models, we search an optimized threshold that maximize sum of source and MT MCC on the development set and adopt it on the test set to determine tags. To exclude errors caused by single optimized threshold, we further draw the ROC curves and AUC in Figure 6. The results demonstrate that our regression models based on mBERT and XLM-R statistically significantly outperform the baseline.

For En-De, Wang et al. (2020) and Lee (2020) both used large-scale third-party data⁹⁹9Wang et al. (2020) used parallel data from WMT20 news translation task to pre-train a predictor and Lee (2020) generated 11 million pairs of pseudo QE data with 23 million pairs of sentences.. Besides the top-2, the third system (Rubino, 2020) is also pre-trained with 5 million sentence pairs but got 0.357 and 0.485 respectively. Therefore, we believe that we achieve acceptable performance with very small cost.

5.2.2 Evaluation of Word-Level Correspondences

We evaluate extended word alignment and source-gap correspondences jointly as word-level correspondences. The results are shown in Table 2. Two baselines (“FastAlign” and “AWESoME”) cannot predict source-gap correspondences since they are designed for ordinary word alignment. We combine their extended word alignment with prediction of source-gap correspondences by “mBERT” for fair comparison. All predictions are evaluated by F1 score as well as precision and recall.

Neural-based methods significantly outperform statistical “FastAlign”. The gap of 0.4% for En-De and 2.2% for En-Zh between “AWESoME” and “mBERT” is not significant. But it might implies that pre-trained language models like mBERT is able to filter noises in pseudo data and produce high-quality word-level correspondences. Additionally, a better performance of “fine-tuned mBERT” indicates that the upper bound could be higher if more annotated data is available.

5.2.3 Evaluation of Refined Tags

Extended

Word Alignment

Original

QE Tags

Source F1 Scores

Mean (OK/REP/INS)

MT F1 Scores

Mean (OK/REP/DEL/INS)

FastAlign

OpenKiwi

0.626 (0.696/0.492/0.174)

0.767 (0.847/0.477/0.124/0.156)

AWESoME

0.708 (0.781/0.549/0.373)

0.807 (0.879/0.548/0.395/0.156)

mBERT

0.739 (0.825/0.540/0.421)

0.820 (0.895/0.544/0.389/0.156)

XLM-R

0.709 (0.781/0.548/0.410)

0.809 (0.879/0.522/0.415/0.156)

ft-mBERT

rt-mBERT

0.755 (0.850/0.538/0.400)

0.827 (0.904/0.535/0.347/0.175)

rt-XLM-R

0.685 (0.748/0.544/0.431)

0.805 (0.871/0.538/0.580/0.175)

(a) En-De Results

Extended

Word Alignment

Original

QE Tags

Mean Source F1 Scores

(OK/REP/INS)

Mean MT F1 Scores

(OK/REP/DEL/INS)

FastAlign

OpenKiwi

0.360 (0.379/0.280/0.071)

0.728 (0.781/0.276/0.173/0.042)

AWESoME

0.371 (0.391/0.285/0.066)

0.733 (0.786/0.280/0.202/0.042)

mBERT

0.836 (0.914/0.446/0.020)

0.891 (0.947/0.441/0.316/0.042)

XLM-R

0.843 (0.929/0.410/0.018)

0.895 (0.955/0.402/0.275/0.042)

ft-mBERT

rt-mBERT

0.848 (0.929/0.447/0.034)

0.897 (0.954/0.441/0.284/0.042)

rt-XLM-R

0.849 (0.928/0.451/0.028)

0.897 (0.955/0.446/0.289/0.042)

(b) En-Zh Results

Table 3: Evaluation of refined tags. Main metric is a weighted mean of F1 scores according to ratio of each type of tags in reference. “ft-” indicates that the model is fine-tuned with the extra 800-pair annotated alignment data. “rt-” indicates that the model is further trained with extra 800-pair annotated tag data.

As introduced, we combine prediction of extended word alignment and original word tags¹⁰¹⁰10While predicting the original tags, we did not directly use the optimized threshold determined in Section 5.2.1 since test set here originates from the original development set. Instead, we take the original test set of WMT20 for development purposes and re-searched an optimized threshold on it. to get refined word tags. Moreover, we deduce gap tags from source-gap correspondences. Origin of source-gap correspondences used is kept consistent with Table 2 according to extended word alignment. For baseline, combinations of FastAlign/AWESoME and OpenKiwi is adopted. As for metric, we use F1 score of each type of tags along with a weighted mean of all those F1 scores, taking the proportion of each tag in reference as weight. The results are shown in Table 3.

Our best model outperforms the baseline by 12.9% and 6.0% respectively on source and MT refined tags in terms of mean F1 scores in En-De experiments. As for the En-Zh experiments, mean F1 scores are significantly improved by 48.9% and 16.9%.

We also notice that though fine-tuned mBERT extracts extended word alignment with good accuracy, the absolute value of refined tag accuracy is still unsatisfactory (especially that of INS and DEL). We will discuss that in the next section.

6 Discussion on Specific Cases

6.1 Discussion on Refined Word Tags

In Figure 7(a), our system basically succeeds in detecting errors caused by incorrect use of punctuations. Our system correctly suggests the replacements for the second comma and the half-width period. As for the first comma, the translation is still natural and acceptable if we delete the comma following the system’s suggestion. Moreover, our system successfully detects mistranslations of “passes” and “touchdowns”. In MT, those football terminologies are respectively translated as “通行证” (a pass to enter somewhere) and “摔倒” (falling down). It is noteworthy that those two mistranslations are not revised in post-edited corpus provided by WMT21. It implies that our system performs surprisingly well as it even succeeds in detecting mistranslations that is not noticed by human annotators.

In Figure 7(b), our system still works well in detecting incorrect use of half-width punctuations. However, compared with the reference, “abdominal aneurysm” is mistranslated and our model failed to detect it because both are tagged as OK during the prediction of original tags. A premature prediction of OK prevents a word from being refined into REP/INS/DEL later.

We believe that an inappropriate threshold mainly leads to such an issue. Predicted probabilities of “腹部” and “动脉瘤” are respectively 0.103 and 0.134, but the optimized threshold used is 0.88 as we searched it to maximize the MCC on the whole set. Meanwhile, probabilities of all other OK-tagged MT words are actually smaller than 0.01. As a result, if we set a threshold between 0.01-0.10 for this sentence pair, we could have obtained the perfect result. In the future, we plan to investigate into methods that can determine fine-grained optimized threshold for each sentence pair.

6.2 Discussion on Gap Tags

Figure 7(c) shows a typical En-De case that our model handles well. In German, it is more natural to indicate actions took place in the past in perfect tense rather than past tense. In this case, English verb “drafted” should be modified to “haben … ausgewählt”. Our model correctly suggests a correspondence between “drafted” and the MT gap in front of the period. As there are many cases need similar modification that inserts a particular word (like particle or infinitive for clause) before the period in MT, it is easier for our model to learn such laws. It probably explains the relatively good accuracy of INS in En-De experiments.

In contrast, Figure 7(d) is an En-Zh example showing that our model tends to align many source words with the gap right before or after their translation in MT even the translation is correct and needs no extra insertions. The word “dissected” is unnecessarily aligned with the gap around its translation “解剖”. Two human names are also unnecessarily aligned with gaps. As a result, four gaps are incorrectly tagged as INS. We observed the annotated dataset and noticed that many Chinese words in MT are slightly modified by adding prefixes and suffixes during post-editing. For example, “成年海龟” (adult sea turtle) is modified to “成年的海龟” (adding “的” as a suffix for adjective). “演讲” (the speech) is modified to “这一演讲” (emphasizing “this” speech). Generally, those modifications are not necessary because of the free Chinese grammar. However, existence of those modifications might mislead the model into preferring to unnecessarily align a word with the gap around its translation like Figure 7(d). To address this issue, we plan to restrict the annotation rules to exclude meaningless modification in En-Zh training data in the future.

7 Conclusion and Future Work

To improve post-editing assistance efficiency, we define a novel concept called extended word alignment. By incorporating extended word alignment with original word-level QE, we formally propose a novel task called refined word-level QE. To solve the task, we firstly adopt a supervised method to extract extended word alignment and then predict original tags with pre-trained language models by conducting sequence tagging. We then refine word tags with extended word alignment. Additionally, we extract source-gap correspondences and determine gap tags. We perform experiments and a discussion on specific cases.

In the future, we would like to polish our work in the following perspectives. Firstly, we want to develop methods that determines fine-grained threshold as elaborated in Section 6. Moreover, we prepare to conduct a human-evaluated experiment to prove the superiority of refined word-level QE in terms of post-editing assistance efficiency.

References

Brown et al. (1993) P. F. Brown, Stephen A. Della P., Vincent J., and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
Cho et al. (2014) K. Cho, van Merrienboer B., C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. EMNLP, pages 1724–1734.
Conneau et al. (2020) A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proc. 58th ACL, pages 8440–8451.
Devlin et al. (2019) J. Devlin, M. Chang, K. Lee, and Toutanova K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 17th NAACL-HLT, pages 4171–4186.
Dou and Neubig (2021) Z. Dou and G. Neubig. 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Proc. 16th EACL, pages 2112–2128.
Dyer et al. (2013) C. Dyer, V. Chahuneau, and N. Smith. 2013. A simple, fast, and effective reparameterization of IBM model 2. In Proc. 11th NAACL-HLT, pages 644–648.
Garg et al. (2019) S. Garg, S. Peitz, U. Nallasamy, and M. Paulik. 2019. Jointly learning to align and translate with transformer models. pages 4453–4462.
Herbig et al. (2020) N. Herbig, T. Duwel, S. Pal, K. Meladaki, M. Monshizadeh, A. Kruger, and J. van Genabith. 2020. MMPE: A Multi-Modal Interface for Post-Editing Machine Translation. In Proc. 58th ACL, pages 327–334.
Hu et al. (2020) C. Hu, H. Liu, K. Feng, C. Xu, N. Xu, Z. Zhou, S. Yan, Y. Luo, C. Wang, X. Meng, T. Xiao, and J. Zhu. 2020. The niutrans system for the wmt20 quality estimation shared task. In Proc. 5th WMT, pages 1018–1023.
Jamara et al. (2021) R. A. Jamara, N. Herbig, A. Kruger, and J. van Genabith. 2021. Mid-air hand gestures for post-editing of machine translation. In Proc. 59th ACL, pages 6763–6773.
Kim and Lee (2016) H. Kim and J. Lee. 2016. Recurrent neural network based translation quality estimation. In Proc. 1st WMT, pages 787–792.
Kim et al. (2017) H. Kim, J. Lee, and S. Na. 2017. Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. In Proc. 2nd WMT, pages 562–568.
Kim et al. (2019) H. Kim, J. Lim, H. Kim, and S. Na. 2019. QE BERT: Bilingual BERT using multi-task learning for neural quality estimation. In Proc. 4th WMT, pages 85–89.
Koehn et al. (2007) P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. 45th ACL, pages 177–180.
Koehn et al. (2003) P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proc. HLT-NAACL, pages 127–133.
Kusner et al. (2015) M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proc. 32nd ICML, pages 957–966.
Lee (2020) D. Lee. 2020. Two-phase cross-lingual language model fine-tuning for machine translation quality estimation. In Proc. 5th WMT, pages 1024–1028.
Nagata et al. (2020) M. Nagata, K. Chousa, and M. Nishino. 2020. A supervised word alignment method based on cross-language span prediction using multilingual BERT. In Proc. EMNLP, pages 555–565.
Nayek et al. (2015) T. Nayek, S. K. Naskar, S. Pal, M. Zampieri, M. Vela, and J. van Genabith. 2015. CATaLog: New approaches to TM and post editing interfaces. In Proc. Workshop NLP4TM, pages 36–42.
Och and Ney (2003) F. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Rajpurkar et al. (2018) P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proc. 56th ACL, pages 784–789.
Rubino (2020) R. Rubino. 2020. NICT kyoto submission for the wmt’20 quality estimation task: Intermediate training for domain and task adaptation. In Proc. 5th WMT, pages 1042–1048.
Schwartz et al. (2015) L. Schwartz, I. Lacruz, and T. Bystrova. 2015. Effects of word alignment visualization on post-editing quality & speed. In Proc. MT Summit XV.
Specia et al. (2020) L. Specia, F. Blain, M. Fomicheva, E. Fonseca, V. Chaudhary, F. GuzmÃ¡n, and A. Martins. 2020. Findings of the WMT 2020 shared task on quality estimation. In Proc. 5th WMT, pages 741–762.
Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. Le. 2014. Sequence to sequence learning with neural networks. In Proc. 27th NIPS, pages 3104–3112.
Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proc. 31st NIPS, pages 5998–6008.
Wang et al. (2020) M. Wang, H. Yang, H. Shang, D. Wei, J. Guo, L. Lei, Y. Qin, S. Tao, S. Sun, Y. Chen, and L. Li. 2020. HW-TSC’s participation at WMT 2020 automatic post editing shared task. In Proc. 5th WMT, pages 1054–1059.
Yamada (2019) M. Yamada. 2019. The impact of google neural machine translation on post-editing by student translators. The Journal of Specialised Translation, 31:87–106.
Zhang and Weiss (2016) Y. Zhang and D. Weiss. 2016. Stack-propagation: Improved representation learning for syntax. In Proc. 54th ACL, pages 1557–1566.