Unsupervised Parallel Corpus Mining on Web Data

Guokun Lai
Carnegie Mellon University
[email protected]
\AndZihang Dai
Carnegie Mellon University
[email protected]
\AndYiming Yang
Carnegie Mellon University
[email protected]

Abstract

With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT’14 English-French and WMT’16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT’16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.

1 Introduction

As one the the most successful applications in natural language processing Vaswani et al. (2017); Ott et al. (2018), modern neural machine translation systems are able to match human-level performances given a large amount of labeled parallel data. Despite the success, it remains extremely challenging to construct a large parallel corpus for a new language pair given the non-trivial skill requirement and annotation cost.

On the other hand, there exists a large quantity of unaligned sentences expressing the same or very similar meanings in different languages. For any language pair, if we can correctly extract and pair such sentences with similar meanings in corresponding languages, they could be used as crawled parallel corpus to train the machine translation system directly. In fact, this idea has taken by the parallel corpus mining community, which has led to various pseudo parallel corpus between European languages¹¹1https://paracrawl.eu/index.php/news/item/9-paracrawl-works and hence improved performances Sánchez-Cartagena et al. (2018); Azpeitia et al. (2018); Artetxe and Schwenk (2018). Despite the success, methods along this line still require a significant amount of labeled parallel corpus to train a sentence aligner, which is then used to filter the abundant unaligned text. This requirement restricts the practical application of these parallel corpus mining methods.

In the meantime, the unsupervised machine translation technique has developed rapidly. It provides us a potential choice to generate the pseudo parallel data based on the unsupervised machine translator, and use them as the training data for the parallel data miner.

Based on this intuition, we propose an unsupervised web parallel corpus mining pipeline by combining the unsupervised machine translation with the web parallel corpus mining technique. It can automatically collect and extract the high-quality parallel data from the Internet without requiring any labeled data. The propose pipeline reduces the cost of collecting the parallel data for arbitrary language pairs. In our experiment, we show that the machine translation system trained with crawled parallel data from our system is able to achieve a similar or even superior performance compared to fully supervised systems on the WMT benchmarks.

Our proposed pipeline can be separated into three phases: (1) Train an unsupervised machine translation model and use it to generate pseudo parallel corpus ${\mathcal{A}}$ . (2) Construct a dictionary based on the pseudo parallel data. The crawler will collect the raw parallel corpus ${\mathcal{B}}$ from the Internet based on the generated dictionary. (3) Use the pseudo parallel data ${\mathcal{A}}$ to train a classifier to differentiate whether a pair of sentences is a parallel sample of the given language pair. We use the classifier to filter ${\mathcal{B}}$ to get the final parallel corpus ${\mathcal{C}}$ . Finally, we treat ${\mathcal{C}}$ as supervised data to train the machine translation system. The details of the pipeline are described in section 2 and the experiment results are included in section 3.

2 The Proposed Pipeline

In this section, we introduce the details of the proposed unsupervised web parallel corpus mining pipeline. In the following parts, the targeted language pair is denoted as $(p,q)$ .

Train an Unsupervised Machine Translation System:

In the first step, we follow the XLM paper Lample and Conneau (2019) to train an unsupervised machine translator, denote as ${\mathcal{F}}$ . The training process is to initialize the encoder and decoder by the pretrained XLM model, then minimize the objective function which combines the de-noising encoder-decoder loss and the back-translation loss. Next, given the monolingual data of language $p$ , ${\mathcal{M}}_{p}$ , we can generate a pseudo parallel corpus ${\mathcal{A}}_{(p,q)}=({\mathcal{M}}_{p},{\mathcal{F}}({\mathcal{M}}_{p}))$ .

Obtain a Dictionary:

To run the mining crawler, we need a dictionary for language pair $(p,q)$ as the seed. Here, we run a statistical machine translation model Koehn and Hoang (2007) on the pseudo parallel corpus ${\mathcal{A}}_{(p,q)}$ to generate a dictionary.

Crawl the Parallel Data:

To crawl the parallel data from Internet, we utilize Bitextor package²²2https://github.com/bitextor/bitextor Esplá-Gomis and Forcada (2009) as our crawler. Given a website URL, the crawler would download all HTML pages from its domain. Then the package performs two-stage processing, document and sentence alignments, to generate aligned sentence pairs.

In the document alignment step, the algorithm will take the URL and HTML structure information of pages as input to align website pages. For example, the pages with URLs, “xx.com/abc/en” and “xx.com/abc/de”, would produce high probability to be aligned.

After aligning documents, the algorithm utilize the Hunalign Varga et al. (2007) package to align the sentences in the paired documents. It takes the dictionary, generated in the last step, and linguistic information of sentences as the input, and produces the aligned sentence pairs.

The Bitextor package allows users to integrate machine learning system into the document and sentence alignment process, which can improve the precision. Here, we could inject the machine translator trained in the first step. But in practical, we found that the neural machine translator would be the speed bottleneck of the crawling pipeline. So we did not use this function of Bitextor.

Filter the Crawled Data:

The first step of filtration is following the heuristic rules described in Artetxe and Schwenk (2018). It includes three rules: (1) remove all duplicate sample pairs. (2) remove any sentences whose length small than 4. (3) remove any sample pairs whose overlap ratio is larger than 50%. After applying these rules, nearly 80% crawled parallel data are removed. Toward this point, we denote the outcoming parallel corpus as ${\mathcal{B}}_{(p,q)}$ .

Because in the previous parts of proposed pipeline, we only use a learned dictionary to mine the parallel corpus, which limits the precision of crawler. Simultaneously, in order to keep most of high-quality parallel data, we set a low alignment threshold to promise a high recall rate.

Next, we need to perform a post-process to filter out the high quality parallel sentence from the noise data ${\mathcal{B}}_{(p,q)}$ . We use the pseudo parallel data generated in the first step ${\mathcal{A}}_{(p,q)}$ to train a classifier to differentiate the parallel and unparallel sentence pairs. We treat ${\mathcal{A}}_{(p,q)}$ as the positive samples, and randomly generate negative samples by sampling the unpaired sentences from ${\mathcal{A}}_{(p,q)}$ . Here, we train two machine learning classifiers:

•

Random Forest: We use the Bicleaner Sánchez-Cartagena et al. (2018) tool³³3https://github.com/bitextor/bicleaner to train a random forest classifier. This classifier can perform fast inference on CPU. So it can be integrated into the crawler step to save the disk memory for the intermediate results.
•

Finetuned XLM: We finetune a XLM model as another classifier, which is the state-of-the-art method for the text classification. Due to its computation cost for inference, we uses this classifier after collecting the results from crawler step.

After two-step filtration, we obtain a high-quality parallel dataset ${\mathcal{C}}_{(p,q)}$ . We can use it with any supervised machine translation algorithm to train the final machine translator.

3 Experiment

3.1 Experiment settings

In this section, we will test the proposed pipeline on three language pairs, English-French, English-German, and English-Romanian. In the first step to generate pseudo parallel corpus ${\mathcal{A}}_{(p,q)}$ , we follow the training script in the XLM repository⁴⁴4https://github.com/facebookresearch/XLM to train the unsupervised machine translator. Next, we sample 1M sentence from NewCrawl⁵⁵5https://www.statmt.org/wmt16/translation-task.html datasets of French, German and Romanian, and translate them into English by the unsupervised machine translator ${\mathcal{F}}$ to obtain ${\mathcal{A}}_{(p,q)}$ .

For the URL domains feed into the crawler, we follow the ones used in the ParaCrawl project Esplà-Gomis et al. (2019), whose statistic information is included in the table 1 .

For the finetuned XLM model in the filtration step, we use the pretrained 6-layer XLMs, which are the same ones in the first step, as the initial parameters, then finetune them on the ${\mathcal{A}}_{(p,q)}$ for 10 epochs. The hyperparameter setting is the same as the XNLI finetuning script in the XLM repository.

Language Pair	En-Fr	En-De	En-Ro
# url domains	62.5K	84.5K	12.8K

Table 1: The sizes of crawled URL domains

3.2 The Results of Crawling Pipeline

In the table 2, we summarize the result of unsupervised web parallel data mining pipeline. Firstly, we observe that the size of crawled data has a similar scale of supervision data in WMT benchmark. Here, the WMT of EN-Fr indicates WMT2014 training set, and WMT of En-De and En-Ro are WMT2016 training set. Secondly, The result of the filtration process, comparing the size of ${\mathcal{B}}_{(p,q)}$ and ${\mathcal{C}}_{(p,q)}$ , indicates 40%-50% crawled data are not high-quality parallel data.

In the following parts, we are going to evaluate the quality of this parallel corpus ${\mathcal{C}}_{(p,q)}$ by using it to train neural machine learning systems and compare the system performance on the supervised and unsupervised machine translation benchmark results.

parallel set	$\|{\mathcal{B}}_{(p,q)}\|$	$\|{\mathcal{C}}_{(p,q)}\|$	WMT
En-Fr	21.2M	12.0M	35.7M
En-De	22.6M	10.6M	3.96M
En-Ro	1.23M	724K	399K

Table 2: The sizes of the crawled and filtered parallel corpus

3.3 Evaluation with Supervised Machine Translation Benchmarks

Firstly, we evaluate the parallel corpus ${\mathcal{C}}_{(p,q)}$ with the supervised machine translation benchmark. We follow the experiment setting in the Scaling NMT paper Ott et al. (2018), including model architecture and choice of the hyper-parameters, and report the BELU score on the En-Fr and En-De directions on the WMT2014 test sets.

The evaluation results are included in table 3. WMT indicates that the model trained with the WMT training set. bt means the back-translation augmentation. From the results, we obverse that the machine translation system trained with ${\mathcal{C}}_{(p,q)}$ can achieve similar performance to the ones trained with millions of human-labeled parallel samples. The performance gap is small than 1 BELU score It indicates that the quality of ${\mathcal{C}}_{(p,q)}$ is similar to the current largest-scale public parallel dataset, while the proposed website data mining pipeline does not require any labeled parallel sample and dictionary as the seed.

Data	En-Fr	En-De
WMTOtt et al. (2018)	43.2	29.3
WMT+btEdunov et al. (2018)	45.6	35.0
Crawled Data	42.79	28.66

Table 3: Evaluation of the crawled corpus

{\mathcal{C}}_{(p,q)}

on the supervised machine translation benchmark

Model	En-Fr	Fr-En	En-De	De-En	En-Ro	Ro-En
XLMLample and Conneau (2019)	33.4	33.3	27.0	34.3	33.3	31.8
MASSSong et al. (2019)	37.5	34.9	28.3	35.2	35.2	33.1
mBartLiu et al. (2020)	-	-	29.8	34.0	35.0	30.5
Crawled Data+XLM	38.81	38.00	32.92	41.46	39.96	38.95
Crawled Data+Mass	39.61	38.65	32.85	40.76	39.81	38.91

Table 4: Evaluation of the crawled corpus

{\mathcal{C}}_{(p,q)}

on the unsupervised machine translation benchmark setting.

3.4 Evaluation with Unsupervised Machine Translation Benchmarks

Next, we evaluate our corpus on the benchmark setting of unsupervised machine translation problems. Similar to its problem definition, our pipeline can train a machine translation system without requiring any labeled parallel samples. The model architecture design and choice of the hyper-parameters are the same as XLM Lample and Conneau (2019) and MASS Song et al. (2019) papers. The machine translation systems are trained with ${\mathcal{C}}_{(p,q)}$ and the back-translation augmented data generated in an online manner. The (En, Fr) results are the BLEU scores on the WMT2014 test set. The (En, De) and (En, Ro) results are the BELU scores on the WMT2016 test set.

The experiment results are included in table 4. Compared with both baselines, the model trained with data from the proposed pipeline achieves a large margin improvement in all directions. The proposed method averagely improves 4.55 BELU scores compared with the best baseline. In the low resource case, Ro-En, our result, 38.95 BELU score, achieves new state-of-the-art results, even compared with the best performance with the WMT supervision data, which is 38.5 BELU score.

Supervision Data	En-Fr	En-De
${\mathcal{C}}_{(p,q)}$	42.79	28.66
${\mathcal{B}}_{(p,q)}$	42.24	28.02
${\mathcal{B}}_{(p,q)}-{\mathcal{C}}_{(p,q)}$	19.71	24.91

Table 5: Ablation study of the filtration process

3.5 Ablation Study about the Post Filtration

To better understand the importance of the crawler and filtration components, we perform an ablation study by eliminating the parallel data classifier in the filtration process from the proposed pipeline. We train three models respectively with the filtered parallel data ${\mathcal{C}}_{(p,q)}$ , raw parallel data ${\mathcal{B}}_{(p,q)}$ , and the low quality data ${\mathcal{B}}_{(p,q)}-{\mathcal{C}}_{(p,q)}$ , which are the samples discarded by the classifier. The experiment setting is same as the supervised machine translation study in section 3.3. The experiment results are present in table 5. Surprisingly, trained with raw parallel data, the model can achieve similar performance compared to the filtered version, where the difference is smaller than 1 BELU score. On the other hand, the models trained with low-quality parallel data have significantly lower performance. It indicates that the filtration process can differentiate the quality of the parallel samples, but leaving this noise in the neural machine translator training process would not harm the final performance too much.

4 Related Work

The most relevant work to this paper is the ParaCrawl project Esplà-Gomis et al. (2019). It develops the Bitextor crawler and Bicleaner classifier to mining parallel data from the Internet. However, both components need human-labeled parallel data. The crawler needs a labeled dictionary and the classifier needs 100K parallel sentences as the seed. In contrast, the proposed pipeline does not require any human-labeled data.

There is a research line to discuss how to improve the accuracy of the parallel corpus extractor by proposing novel objective function and network architecture Azpeitia et al. (2018); Bouamor and Sajjad (2018); Artetxe and Schwenk (2018). Although these methods require the supervision data to provide the training signal, we still can use the idea of this paper, generating a supervision parallel corpus in an unsupervised manner, to integrate these methods into our pipeline.

5 Conclusion

In this paper, we propose an unsupervised website parallel data mining pipeline, which dose not require any labeled parallel data. The experiment results demonstrate that the machine translation systems trained with the crawled corpus are able to match the performance of the ones trained with the WMT supervision data in both rich and low resources language cases. Due to the unsupervised feature of the proposed pipeline, it can be applied to build the translation system for any language pairs that are lack of parallel corpus.

References

Artetxe and Schwenk (2018) Mikel Artetxe and Holger Schwenk. 2018. Margin-based parallel corpus mining with multilingual sentence embeddings. arXiv preprint arXiv:1811.01136.
Azpeitia et al. (2018) Andoni Azpeitia, Thierry Etchegoyhen, and Eva Martınez Garcia. 2018. Extracting parallel sentences from comparable corpora with stacc variants. In Proceedings of the 11th Workshop on Building and Using Comparable Corpora, pages 48–52.
Bouamor and Sajjad (2018) Houda Bouamor and Hassan Sajjad. 2018. H2@ bucc18: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In Proc. Workshop on Building and Using Comparable Corpora.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
Esplá-Gomis and Forcada (2009) Miquel Esplá-Gomis and Mikel L Forcada. 2009. Bitextor, a free/open-source software to harvest translation memories from multilingual websites. Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas.
Esplà-Gomis et al. (2019) Miquel Esplà-Gomis, Mikel L Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. 2019. Paracrawl: Web-scale parallel corpora for the languages of the eu. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, pages 118–119.
Koehn and Hoang (2007) Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pages 868–876.
Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210.
Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. arXiv preprint arXiv:1806.00187.
Sánchez-Cartagena et al. (2018) Víctor M Sánchez-Cartagena, Marta Bañón, Sergio Ortiz Rojas, and Gema Ramírez-Sánchez. 2018. Prompsit’s submission to wmt 2018 parallel corpus filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955–962.
Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
Varga et al. (2007) Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. 2007. Parallel corpora for medium density languages. Amsterdam Studies In The Theory And History Of Linguistic Science Series 4, 292:247.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.