How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus

Chanjun Park¹, Seolhwa Lee², Hyeonseok Moon¹ Sugyeong Eo¹, Jaehyung Seo¹, Heuiseok Lim^1†

¹ Korea University, {bcj1210, glee889, djtnrud, seojae777, limhseok}@korea.ac.kr
² University of Copenhagen, [email protected]

Abstract

This paper proposes a tool for efficiently constructing high-quality parallel corpora with minimizing human labor and making this tool publicly available. Our proposed construction process is based on neural machine translation (NMT) to allow for it to not only coexist with human translation, but also improve its efficiency by combining data quality control with human translation in a data-centric approach.

1 Introduction

Building a high-quality parallel corpus, which has its target sentence precisely translates the source sentence, vice versa, is a common important issue in the entire field of machine translation. Unfortunately, obtaining a high-quality parallel corpus is difficult for many reasons, including problems of copyright acquisition, the difficulty of finding the proper alignment, and high monetary and temporal costs of building the corpus [6]. Human translation is fundamentally the most trusted approach for improving data quality, and it can construct a high-quality parallel corpus [5, 10]. However, even this approach is limited, as it requires a tremendous amount of money and time for humans to manually construct the entire corpus. To alleviate this limitation, we present a novel tool for constructing high-quality parallel corpora using only simple mono corpus.

In detail, we divided the generic process of constructing a high-quality parallel corpus into two components. First, a data advancing automation approach is employed to advance the source (i.e., the initial mono corpus) language using corpus filtering [4] and Grammar Error Correction (GEC) [12]. Subsequently, the quality of the translated target is also improved by both the advanced mono corpus and the process of Automatic Post Editing (APE) [1]. Second, we utilize both the predicting automation approach (i.e., predictions of data quality) and human translation to minimize the human labor required to complete the task. The predicting automation approach employs Quality Estimation [3] to predict the sentence quality of the parallel corpus, and its labels are used to measure the human labor cost.

Thus, the limitations associated with generating a parallel corpus can be alleviated if the computer automatically determines the quality of the corpus according to a specified level of threshold control. That is, human labor is unnecessary when the threshold is exceeded, although if threshold is not exceeded then it requires refinement, which is the verification and post-processing of the corpus conducted by humans. Overall, our approach contributes to the human translation market and to the automated machine translation field by improving efficiency and minimizing cost.

Refer to caption — Figure 1: Overall process of building the high-quality parallel corpus based on our proposed tool.

2 Data Construction Process and Tool

Process

Data construction processes that build high quality parallel corpus solely by mono corpus are described as Figure 1.

For Stage 1, corpus filtering [4] and grammar error correction [12] are conducted to ensure the quality of the mono corpus. Through Stage 1, quality improvements on the source data is done automatically. In Stage 2, the refined mono corpus is translated (i.e., the target) by the NMT model. Any well-performing NMT model can be utilized, either an in-house NMT model or commercialized translation system. Through this process, the primary pseudo-parallel corpus is constructed. In Stage 3, the Automatic Post Editing (APE) system corrects errors that exist in the primary pseudo-parallel corpus [2]. This phase contributes to the enhancement of the parallel corpus quality, particularly for the translation results from the target side. Stage 4 proceeds with a quality prediction of the machine translation for the parallel corpus through the Quality Estimation (QE) model [11]. Automatic labeling of sentence quality (i.e, level) is performed based on Pearson’s correlation, Mean Average Error (MAE), and Root Mean Squared Error (RMSE), which are sentence-level performance evaluation measures. The average value of the three metrics is selected as the final quality value. Stage 5 determines the level of quality of the given inspection target using the corresponding score obtained from Stage 4. The continuously calculated score is quantized into three levels (High, Middle, and Low), and each sentence pair is classified based on these levels. We implemented a heuristic logic-based decision criteria that grouped sentences into those with scores over 20% (High), under 20% (Low), and between these values (Middle). The high-scoring level is regarded as a high-quality parallel corpus and can be used immediately, without modification. Sentence pairs included in the middle and low levels are assigned to the human translation supervisor to enhance their quality. The price for the editing labor can be estimated by the quality level that has been determined by the QE model. Therefore, the user decides whether to use each sentence as corpus data or have the translation supervised by a human agent at an agreed upon price.

The advantage of this process is that it enables the quantitative estimation of the data quality before translation, thereby the reducing the cost of human translation supervision, as the easier sentences are already translated by the machine translation system. For high level sentences, only minor supervision at most is required, whereas for low level sentences, relatively more intensive and in-depth editing will be required. Overall, this strategy can shorten the time required for editing and improve the efficiency of the supervision work.

Tool

We implemented and distributed this tool in the form of a web application. The webserver was developed based on Flask. The corpus cleaning was implemented based on Park et al. [7] filtering process and Park et al. [8] GEC model. The Google Translator API was used for the NMT model. We developed and reimplemented the APE system based on the model in Yang et al. [13] and the QE system released by Transquest [9] in the form of a Rest API and combined it with the tool. We released this tool to be publicly available¹¹1http://nlplab.iptime.org:9090/.

3 Conclusion

This paper proposed a data construction method that can work alongside the human translation market of machine translation. For future work, we plan to enhance the tools performance by improving the modular performance of each stage.

Acknowledgment

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation) and IITP grant funded by the Korea government(MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques) and Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425).Thanks to Seungjun Lee for helping us build the Tool. Heuiseok Lim^† is a corresponding author.

References

Chatterjee et al. [2019] Rajen Chatterjee, Christian Federmann, Matteo Negri, and Marco Turchi. 2019. Findings of the wmt 2019 shared task on automatic post-editing. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 11–28.
do Carmo et al. [2021] Félix do Carmo, Dimitar Shterionov, Joss Moorkens, Joachim Wagner, Murhaf Hossari, Eric Paquin, Dag Schmidtke, Declan Groves, and Andy Way. 2021. A review of the state-of-the-art in automatic post-editing. Machine Translation, 35(2):101–143.
Fonseca et al. [2019] Erick Fonseca, Lisa Yankovskaya, André FT Martins, Mark Fishel, and Christian Federmann. 2019. Findings of the wmt 2019 shared tasks on quality estimation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 1–10.
Herold et al. [2021] Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, and Hermann Ney. 2021. Data filtering using cross-lingual word embeddings. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 162–172.
Hutchins [2001] John Hutchins. 2001. Machine translation and human translation: in competition or in complementation. International Journal of Translation, 13(1-2):5–20.
Koehn et al. [2020] Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, and Francisco Guzmán. 2020. Findings of the wmt 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742.
Park et al. [2021] Chanjun Park, Jaehyung Seo, Seolhwa Lee, Chanhee Lee, Hyeonseok Moon, Sugyeong Eo, and Heuiseok Lim. 2021. BTS: Back TranScription for speech-to-text post-processor using text-to-speech-to-text. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 106–116, Online. Association for Computational Linguistics.
Park et al. [2020] Chanjun Park, Yeongwook Yang, Chanhee Lee, and Heuiseok Lim. 2020. Comparison of the evaluation metrics for neural grammatical error correction with overcorrection. IEEE Access, 8:106264–106272.
Ranasinghe et al. [2020] Tharindu Ranasinghe, Constantin Orasan, and Ruslan Mitkov. 2020. Transquest: Translation quality estimation with cross-lingual transformers. arXiv preprint arXiv:2011.01536.
Rojo [2018] Jorge Leiva Rojo. 2018. Aspects of human translation:: the current situation and an emerging trend. Hermeneus: Revista de la Facultad de Traducción e Interpretación de Soria, (20):257–294.
Wang et al. [2020a] Minghan Wang, Hao Yang, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Lizhi Lei, Ying Qin, Shimin Tao, Shiliang Sun, Yimeng Chen, et al. 2020a. Hw-tsc’s participation at wmt 2020 quality estimation shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 1056–1061.
Wang et al. [2020b] Yu Wang, Yuelin Wang, Jie Liu, and Zhuo Liu. 2020b. A comprehensive survey of grammar error correction. arXiv preprint arXiv:2005.06600.
Yang et al. [2020] Hao Yang, Minghan Wang, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Zongyao Li, Lizhi Lei, Ying Qin, Shimin Tao, Shiliang Sun, et al. 2020. Hw-tsc’s participation at wmt 2020 automatic post editing shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 797–802.