CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

Thuy Vu
Amazon Alexa AI
Manhattan Beach, CA, USA
[email protected]
&Alessandro Moschitti
Amazon Alexa AI
Manhattan Beach, CA, USA
[email protected]

Abstract

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps: (i) projecting documents of a web domain to a shared multilingual space; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using $\text{TF}\!\times\!\text{IDF}$ . CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.

1 Introduction

Online machine translation (MT) services require industrial-scale training data, i.e., significantly large and high-quality parallel sentences, to build accurate models. Exploiting the web for multilingual content has become a usual strategy in collecting large-scale parallel sentences for MT Uszkoreit et al. (2010); Smith et al. (2013); Buck and Koehn (2016a). Structural Translation Recognition for Acquiring Natural Data (STRAND) Resnik and Smith (2003) is a standard pipeline to extract parallel data from the web, consisting in three steps: (i) bilingual document alignment for an input set of documents, (ii) sentence alignment for each aligned document pair, and (iii) sentence filtering for non-translation or boilerplate cleaning. The first step of identifying bilingual documents is technically challenging and made more complicated by the presence of large and noisy documents from web data.

In the WMT-16 Bilingual Document Alignment Shared Task (WMT16-BDAST), two standard approaches for identifying parallel pages¹¹1“page” and “document” interchangeably refer to the content of a web page. were studied: 1) URL matching heuristic Smith et al. (2013) as a baseline and 2) content similarity as a solution to maximize the performance in identifying parallel documents. The benchmark on English-French document alignment task shows that the best top-1 recall (R $@1$ ) for each approach are $59.8$ % and $89.1$ %, respectively, as evaluated on the test set Buck and Koehn (2016a, b). The results, albeit conducted within English-French setting, indicate that leveraging document content can lead to a significant increase, up to $30$ percent points in recall, and contributes $\sim$ 50% novel bilingual document pairs.

The URL matching heuristic approach, named URL, identifies parallel pages using language identifiers, typically from ISO 639, annotated in the addresses. Pages, or web-documents, in different languages from a domain are aligned if their URLs are matchable after their language-identifier being removed Smith et al. (2013). The strategy can identify a candidate at a decent cost by comparing two URLs without significant pre-processing needed. For example, the following URLs are a match: xyz.ca/index.htm and xyz.ca/ fr/index.htm after removing “ fr/” from the second URL. On the contrary, cost is the major issue when comparing content as it often requires language-specific processing and sophisticated modelings for cross-language normalization and alignment. The problem becomes even more challenging when dealing with web data and for low-resourced languages.

We optimize the cost for the latter approach to enable its application at scale. Specifically, we design CDA to project multilingual documents to a shared multilingual space for direct similarity measurement. Therefore, we can run the STRAND pipeline for multiple languages at once to keep the pre-processing cost monotonic with respect to the number of languages and documents. In particular, we design CDA with two key objectives: i) minimal data processing cost and ii) fast scaling to new languages. The latter is also crucial to the language expansion in online MT services. Our contribution is three-fold:

•

We present an optimized, efficient, and scalable framework that can perform multilingual document alignment at state-of-the-art performance in one go.
•

To facilitate the development and evaluation, we created two web-scale datasets, which are much larger and have many more languages than what is currently publicly available.
•

We study the contribution of CDA in multiple applications involving parallel sentence extraction for MT and mutual complement between URL and CDA.

We tested CDA with multiple large-scale datasets of web documents. We also studied the applications of CDA within an industrial setting, including (i) extracting more and better parallel sentences from an extremely large-scale dataset, (ii) producing better MT models, and (iii) improving yield, measured by the amount of extracted parallel content, for both URL and CDA in the STRAND pipeline. The experimental results show that, despite its minimality, CDA (i) is on par with the top systems from WMT16-BDAST, which use expensive bilingual resources, and (ii) can double the amount of parallel data extracted by previous URL-based approaches. Most importantly, we show that CDA provides robust performance when dealing with millions of documents and processing up to 28 languages, including low-resourced languages.

In the remainder of this paper, we summarize the previous work regarding document alignment in Section 2. We then describe the proposed system and our experiments in sections 3 and 4, respectively. Finally, we derive the conclusions of the paper in Section 5.

2 Related Work

Aligning multilingual documents is the key required processing in most multilingual text processing pipelines, including cross-lingual information retrieval Steinberger et al. (2002); Pouliquen et al. (2004); Vulic and Moens (2015); Jiang et al. (2020) and parallel data extraction. In the context of creating parallel training data for MT, the problem has been studied in the literature for comparable corpora Munteanu et al. (2004); Vu et al. (2009); Pal et al. (2014) and web-structured parallel data extraction Resnik (1999); Uszkoreit et al. (2010); Buck and Koehn (2016a). We focus on the latter in this paper.

WMT-16 Bilingual Document Alignment Shared Task is a recent shared-task focusing on identifying bilingual documents from crawled websites Buck and Koehn (2016b). The top 3 systems are YODA Dara and Lin (2016), NOVALINCS Gomes and Pereira Lopes (2016), and UEDIN1 COSINE Buck and Koehn (2016b). The first two require costly features, such as (i) n-gram comparison after translating all non-English text into English Dara and Lin (2016), and (ii) phrase-table of statistical MT (SMT) as a dictionary Gomes and Pereira Lopes (2016). UEDIN1 COSINE Buck and Koehn (2016b), on the other hand, only uses $\text{TF}\!\times\!\text{IDF}$ -weighted with Cosine similarity. Interestingly, this method performs surprisingly well even without French-to-English translations, dropping just 3.4% in recall, from 93.7% to 90.3%. Though the finding can be due to the English and French lexicons’ overlap, it suggests that $\text{TF}\!\times\!\text{IDF}$ with proper normalization is useful to compare document representations from sub-domains. Our proposed method exploits this aspect.

Given the advent of deep neural network modeling, document embeddings are among the main interests in general NLP applications Le and Mikolov (2014); Cer et al. (2018); El-Kishky and Guzmán (2020). This line of research, however, is not technically related to our problem setting. Specifically, the cost to run a neural inference over a web-scale setting is prohibitively high, e.g., processing a dataset of several billion pages from CommonCrawl²²2commoncrawl.org is not feasible. To have an idea, the Cloud Translation³³3cloud.google.com/translate/pricing cost to translate a webpage having 20,000 characters is $\$0.4$ as of Jan 2021.

3 Proposed System

We describe our method to identify parallel pages from a web domain. Specifically, pages in different languages are projected to a shared space, where their similarity can be measured.

Problem Definition

Let $D=\{d_{1},d_{2},\dots,d_{n}\}$ be the $n$ pages from a domain, each page $d_{i}$ is described by its content $c_{i}$ in language $L_{i}$ . The problem is to identify all $\left(d_{i},d_{j}\right)$ of different languages, $L_{i}\neq L_{j}$ , and $c_{i}$ and $c_{j}$ are translational equivalent.

Multilingual Space

Let $\mathcal{L}=\{L_{1},L_{2},\dots\}$ be the set of languages found in $D$ and let $D_{i}$ be the set of documents in $L_{i}$ . Thus, $D=\bigcup_{L_{i}\in\mathcal{L}}D_{i}$ , where each $D_{i}$ is associated with a lexicon $V_{i}$ . Without loss of generality, we project documents from two languages, $L_{\alpha}$ and $L_{\beta}$ , into a common space as follows. We first define alignment between two lexicons $V_{\alpha}$ and $V_{\beta}$ as:

	$\displaystyle\mathcal{A}$	$\displaystyle=\big{\{}\left(a,b\right):a\in V_{\alpha},b\in V_{\beta},$
		$\displaystyle P_{\alpha\rightarrow\beta}(a,b)+P_{\beta\rightarrow\alpha}(b,a)$
		$\displaystyle\geq P_{\alpha\rightarrow\beta}(a,w)+P_{\beta\rightarrow\alpha}(w,a),\forall w\in V_{\beta}\big{\}},$

where $P_{\alpha\rightarrow\beta}$ and $P_{\beta\rightarrow\alpha}$ are lexical translation models from $L_{\alpha}$ to $L_{\beta}$ , and vice versa⁴⁴4It can be easily shown that the proposed aligned is symmetric, i.e., the other condition $P_{\alpha\rightarrow\beta}(a,b)+P_{\beta\rightarrow\alpha}(b,a)\geq P_{\alpha\rightarrow\beta}(w,b)+P_{\beta\rightarrow\alpha}(b,w),\forall w\in V_{\alpha}$ holds.. It should be noted that $\mathcal{A}$ defines a common space, $\mathbb{R}^{|\mathcal{A}|}$ , where the dimensions are all word pairs. However, we can simplify the approach by mapping all languages in the space of a pivot language, i.e., $\alpha$ . Thus, we define $\Pi_{\alpha}:D_{\beta}\longrightarrow\mathbb{R}^{|V_{\alpha}|}$ that maps documents $d_{\beta}\in D_{\beta}$ into the same space of $D_{\alpha}$ , as:

	$\displaystyle\Pi_{\alpha}(d_{\beta})$	$\displaystyle=\Pi_{\alpha}\left(w_{1},w_{2},\dots,w_{\|d_{\beta}\|}\right)$
		$\displaystyle=\vec{x}=\left(x_{1},x_{2},\dots,x_{\|V_{\alpha}\|}\right)$		(1)

We define a lexical mapping for document $d_{\beta}$ as $M(d_{\beta})=d_{\beta\rightarrow\alpha}=\{w^{\prime}_{1},...,w^{\prime}_{|d_{\beta\rightarrow\alpha}|}|(w_{i},w^{\prime}_{j})\in\mathcal{A}\}$ , which maps $d_{\beta}$ into language $\alpha$ . Similar, we denote the mapping for a document collection $D_{\beta}$ as $D_{\beta\rightarrow\alpha}=\{M(d_{\beta}):\forall d_{\beta}\in D_{\beta}\}$ . Finally, we compute the $\text{TF}\!\times\!\text{IDF}$ representation of $d_{\beta}$ as follows, $\forall w_{i}\in V_{\alpha}$ :

•

$\text{TF}(w_{i})=$ number of occurrences of $w_{i}$ in $d_{\beta\rightarrow\alpha}$ ; and
•

$\text{IDF}(w_{i})=\log\left(1+\frac{|D_{\beta\rightarrow\alpha}|}{1+\#(w_{i},D_{\beta\rightarrow\alpha})}\right)$ , where $\#\left(w,D\right)$ returns the number of documents in $D$ containing $w$ .

We compute $x_{i}$ in Eq. 1 using $\text{TF}(w_{i})\times\text{IDF}(w_{i})$ .

Aligning Multilingual Documents

Two documents are considered a good pair if their representations are similar, according to a similarity threshold $t$ . We compute the similarity between $d_{i}$ and $d_{j}$ as the dot-product between, $v_{i}\cdot v_{j}\in[0..1]$ (we normalized the vector representations with $\ell_{2}$ ). In practice, we use English to build the multilingual space as it is the dominant language on the Internet and in most multilingual websites.

4 Experiments

We examine the efficacy of CDA in this section. First, we describe (i) the pipeline setup for the experiments and (ii) our effort in creating suitable benchmark data and selecting relevant resources in Section 4.1, and Section 4.2, respectively. We then address the following performance aspects of CDA:

1.

The performance in multilingual document alignment.
2.

The impact of CDA, compared to URL, in an end-to-end STRAND pipeline.
3.

The by-product applications of CDA in identifying (i) novel language identifiers beyond ISO 639 for URL and (ii) web-domains containing multilingual data that are not detectable using language identifiers.
4.

The cost required to enable the support to a new language.

4.1 Pipeline Setup

Figure 1 depicts the STRAND pipeline for our experiments.

Refer to caption — Figure 1: A STRAND pipeline that uses URL to align document and CDA as a replacement.

•

The input is constituted by web documents of multiple domains. The output is a set of parallel sentence pairs extracted from the pipeline. Each document has a web address and a raw HTML source.
•

The pre-processing step groups input documents by domain to create data for document alignment step using URL and CDA. For CDA, additionally, it extracts the text content from HTML structure, using the following tags: title, h1..h6, label, blockquote, dd, dt, p, pre, q, div. This helps remove boilerplate effectively from being considered in the calculation. We use Python’s langid package to identify the language of a page.
•

Document alignment is performed by either URL or CDA. For URL, we use a similar set of language identifiers from BDAST’s baseline⁵⁵5https://github.com/christianbuck/wmt16-document-alignment-task/blob/master/languagestripper.py.
•

For each aligned document pair, the sentence alignment step aligns text segments, called sentence pair candidates, of the aligned pages based on the DOM structure Smith et al. (2013).
•

Finally, the sentence filtering step removes low-quality pairs Xu and Koehn (2017); Sánchez-Cartagena et al. (2018) or duplications. The filter we used in this experiment has approximately 90% F1 score for each language pair.

4.2 Dataset and Resource

We describe the datasets and resources used in the experiments.

4.2.1 Dataset

We collect and create the following datasets to study CDA performance in (i) matching parallel content, (ii) handling large datasets, and (iii) extending its use to new languages.

WMT-16 Shared Task

First, we use the benchmark dataset provided for WMT-16 Shared Task on Bilingual Document Alignment. We evaluate and compare CDA with other English–French document alignment methods on the BDAST’s training set. The dataset consists of 348,858 and 225,043 English and French documents from 49 web-domains, respectively. Each document has a web address and a clean content. Besides, French documents are translated into English using a standard SMT model. This translation is to study the potential upper-bound performance when having full translations. An alignment candidate has one document from each language, English or French, from the same domain. Thus, there are more than 4.2e9 possible alignments between the documents. The golden data has 1,624 pairs provided by WMT16-BDAST. In this set, the number of labeled alignments per domain ranges from 4 (e.g., www.eohu.ca) to 236 (e.g., tsb.gc.ca). The pairs generated by a system are first filtered by 1-1 rule: each document should participate in at most one alignment. A system is evaluated based on the recall achieved on these 1,624 pairs.

WMT-16 Deep Crawl

The previous benchmark has two limitations. First, the size of the dataset is relatively small compared to a typical web-scale setting⁷⁷7A typical multilingual domain can have thousands to millions of pages; e.g. nato.int and microsoft.com have 3e5 and 38e6 pages, respectively, indexed by Google⁶⁶6search results returned by Google using query “site:...” for a specific website. . Second, the choice of English–French is not representative of the ultimate goal — finding more and better parallel data to enable MT in low-resourced languages. English–French has been the most studied pair in MT task. Besides, their lexicons are also highly overlapped Lewis (2009).

We address this problem, creating a larger dataset of more than 14MM pages using the same set of 49 domains. Specifically, we used these domains and URLs as seeds and recursively downloaded all reachable pages from those seeds. We did not download pages that link to external domains. This exercise resulted in a dataset consisting of 8.7MM and 5.5MM pages for English and 28 other languages. These languages include: Arabic, Bulgarian, Chinese Simplified, Chinese Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Thai, and Turkish.

CommonCrawl Sextet

Previous datasets share the same domains that are heavily biased toward French content (see Table 3). We leverage a monthly crawl from CommonCrawl, specifically CC-MAIN-2017-17⁸⁸8s3://commoncrawl/crawl-data/CC-MAIN-2017-17, to create a better distribution dataset to validate CDA. We select pages from the crawl having pages in Chinese, Czech, Italian, Japanese, Russian, and Turkish. We only keep the following top-level domains: .cn, .tw, .cz, .it, .jp, .ru, .tr, .edu, .gov, and .org. The process results in a dataset of 600+ domains with 17.6MM and 4.1MM pages in English and six selected languages. Table 1 summarizes the datasets considered in our experiments.

Table 1: Summary of Benchmark Datasets for Document Alignment Task. Each dataset is described by the number of web-domains (#Dom.), English documents (#EN-docs), non-English documents (#XX-docs), and languages (#

\mathcal{L}

Dataset	#Dom.	#EN-docs	#XX-docs	# $\mathcal{L}$
WMT-16 Shared Task	49	348,858	225,043	En+Fr
WMT-16 Deep Crawl	49	8.7MM	5.5MM	En+28
CommonCrawl Sextet	662	17.6MM	4.1MM	En+6

4.2.2 Resource

Lexical translation dictionaries are the significant resource required in our proposed method to support a new language pair. Our experiments used the lexical translation dictionaries created by methods introduced for traditional SMT Brown et al. (1993) and neural-based MT Lample et al. (2018).

In particular, we use IBM-1 models for popular languages that have sufficient parallel data from general domains Koehn (2005). We use the GIZA++ toolkit Och and Ney (2003) to create IBM-1 models. Collecting such parallel data for the low-resourced languages is generally challenging. We instead leverage the advances in multilingual embeddings from the MUSE project⁹⁹9github.com/facebookresearch/MUSE Lample et al. (2018). We create translation probability between two words by their normalized embedding similarity score.

4.3 Bilingual Document Alignment Results

We evaluate the performance of CDA under the WMT-16 Shared Task benchmark. We conduct experiments on both settings, using the original text and using full translations. The latter setting allows us to understand the possible benefit of the expensive step, full document translation. Besides, the construction of $V_{L_{i}}$ is crucial to the distinction of the representations. Therefore, we examine the impact of the vocabulary size of $V_{L_{i}}$ to the alignment result. Specifically, we construct $V_{L_{i}}$ by selecting the top frequent tokens after removing stop-words and the first $k$ frequent tokens. We empirically set $k$ to be 100. We experiment with three different sizes for $V_{L_{i}}$ : 2,000, 10,000, and 20,000. Finally, we compare the results of CDA with the baseline URL and the top-3 systems of the WMT-16 Bilingual Document Alignment Shared Task. The evaluation metric is the percentage of the 1,624 golden pairs found in the top-1 alignment for each English document. Table 2 shows the result.

Table 2: Comparison of the baseline URL, top-3 performing systems from WMT16-BDAST and CDA under different settings of vocabulary size, using either original text (original documents in English and French) or translated text (documents in English and the English translations of documents in French)

baseline: URL	67.9
content-based systems	align w. org. text	align w. trans. text
YODA	n/a	93.71
Dara and Lin (2016)	n/a	93.71
NOVALINCS	90.50	n/a
Gomes and Pereira Lopes (2016)	90.50	n/a
UEDIN1 COSINE	90.25^$\dagger$	93.65^$\dagger$
Buck and Koehn (2016b)	90.25^$\dagger$	93.65^$\dagger$
CDA: \|V\|=2,000	87.17	90.95
CDA: \|V\|=10,000	90.40	91.02
CDA: \|V\|=20,000	90.33	89.90

•

[ $\dagger$ ] average performance of valid and test splits

For alignments using original text, the results indicate that CDA achieves similar performance with the state-of-the-art methods from BDAST. The result shows the efficacy of the proposed alignment method. The results also show that the vocabulary size, or the vector representations’ size, impacts the performance. $|V|=10,000$ yields the best result among the three settings. Second, even though using full translation is better, the performance gains are negligible with respect to the processing cost required for building the MT models and translating all the data into an anchor language. Since CDA does not exploit bi-gram features, its performance is relatively lower, up to 3%, compared to the state of the art. In short, the result suggests an optimal configuration for CDA with a vocabulary size of 10,000.

4.4 Multilingual Document Alignment Results

It was showed in WMT16-BDAST that content-based methods could add 60% more English–French document pairs compared to URL. This section aims to verify this in a multilingual setting, mainly when operating with a significantly higher number of languages and domains using WMT-16 Deep Crawl and CommonCrawl Sextet, respectively.

On WMT-16 Deep Crawl

Table 3 shows the number of parallel documents and sentences extracted by an end-to-end STRAND pipeline, after filtering and duplication removal, described in Section 4.1. The result shows that CDA contributes an extra of $53\%$ , $75\%$ , and $195\%$ in clean parallel sentences compared to URL for French alone, and when French is and is not considered, in this more extensive and more realistic setting, respectively. It also suggests that CDA is effective and can significantly increase the number of parallel sentences extracted for low-resourced languages. Finally, the result indicates that our proposed method is robust in a multilingual setting.

Table 3: Comparison of yields, number of clean document pairs and sentence pairs, produced via URL and CDA on WMT-16 Deep Crawl. Column CDA\ URL reports the number of novel sentence pairs exclusively extracted via CDA.

Language	# Document Pairs		# Sentence Pairs
Language	URL	CDA	URL	CDA	CDA\URL
Arabic	3,266	2,896	36,262	39,590	24,065
Bulgarian	1,184	1,070	9,292	1,748	1,359
Chinese-S	2,805	2,160	30,519	27,666	13,289
Chinese-T	316	102	2,584	2,055	374
Croatian	704	3,119	889	56,300	55,854
Czech	29	241	77	7,264	7,248
Danish	137	2,932	693	39,488	38,996
Deutsch	5,525	8,863	83,663	170,932	113,851
Dutch	599	2,407	8,228	79,293	79,146
Farsi	1,316	1,404	14,697	13,875	6,122
Finnish	170	1,313	355	12,403	12,229
French	115,671	143,972	2,653K	3,568K	1,411K
Hebrew	209	140	7,742	5,295	83
Hungarian	1,253	1,382	10,494	6,158	4,448
Indonesian	368	551	625	1,204	900
Italian	6,644	7,310	57,977	94,098	55,802
Japanese	823	1,475	6,593	14,720	11,138
Korean	913	136	13,365	2,229	224
Malay	1,040	1,904	8,467	13,088	7,213
Norwegian	67	1,875	196	35,362	35,273
Polish	557	934	9,685	21,528	17,255
Portugese	1,545	6,200	12,294	104,561	96,850
Russian	3,984	2,475	36,565	55,010	40,722
Slovak	170	850	211	2,157	2,106
Spanish	8,334	21,765	114,874	317,430	252,523
Swedish	83	2,394	2,773	49,420	49,238
Thai	82	10	830	259	40
Turkish	1,057	2,041	9,598	18,412	10,789
All	159,343	222,283	3,134K	4,761K	2,349K
All\French	43,672	78,311	481,333	1,193K	937,683

On CommonCrawl Sextet

Table 4 shows the result in English parallel tokens extracted from the pipeline using URL and CDA in the document alignment step. The result shows similar gains as in the previous experiment, except for Czech — increasing 7 $\times$ more parallel tokens. Our post-hoc analysis discovers non-standard language identifiers missing for URL processing, e.g., ces or cesky.

To confirm the study, we randomly selected 1,320 English–Turkish document pairs identified by CDA for human verification since we do not have annotated data. The outcome indicates that the accuracy of the document pairs is at 91.5%. Information on these datasets is described here: github.com/alexa/wqa_dataset.

Table 4: Parallel English tokens extracted by CDA and URL on CommonCrawl Sextet

Lang.	#Dom.	#EN-docs	#XX-docs	$\frac{\|{\bf CDA}\|}{\|{\bf URL}\|}$
Turkish	37	1,434,923	71,034	1.77
Czech	69	2,333,914	831,072	7.07
Japanese	84	2,097,664	757,872	1.90
Russian	125	2,918,594	1,163,258	1.09
Italian	239	5,770,684	1,112,868	1.05
Chinese	108	3,061,782	207,211	0.99
All	662	17,617,561	4,143,315	1.24

4.5 Industrial Benchmarks

We conducted multiple internal experiments to examine the performance of CDA over URL under an industrial setting. Specifically, we focus on three application aspects of CDA: (i) robustness, (ii) identifying non-standard language identifiers for URL, (iii) identifying multilingual web-domains. Due to business security reasons, we do not name the specific languages considered in this study. We do not provide some details of the experiment setting, which are not critical to illustrate our findings.

4.5.1 Robustness Benchmark

We ran the STRAND pipeline end-to-end to extract parallel sentence pairs from document pairs identified by URL and CDA replacing URL. We employ a crawl dataset larger than a typical monthly crawl archive from CommonCrawl. The dataset is also considered densely multilingual. We target six mid-tier languages that are not in the top-10 high-resourced languages. It shows that CDA can increase additional 27% English parallel tokens over the selected languages.

Automatic Evaluation

We first study the quality of the extracted parallel data, especially the addition of 27% produced by CDA, using automatic MT evaluation. Specifically, for each language, we compare the translation models trained by two equal-sized parallel sentence pairs sampled from the exclusive pairs extracted by URL and CDA individually, i.e., after removing common pairs extracted by both methods. We train vanilla seq2seq models using Sockeye¹⁰¹⁰10https://github.com/awslabs/sockeye. We report the MT performance in BLEU scores on our MT evaluation data in Figure 2. The results indicate that the models trained using novel sentence pairs extracted by CDA consistently give better translation models.

Human Evaluation

We had linguists manually verify the extracted parallel sentence pairs produced by the pipeline using either URL or CDA. Specifically, we randomly sampled 500 sentence pairs extracted from each pipeline using either URL and CDA for Language A and Language E for human evaluation (we do not remove the common pairs in this evaluation). The selection of these languages is based on their low performance reported during the automatic evaluation in Figure 2. Table 5 shows the result in terms of precision and recall. In general, we find that the quality of pairs produced by both methods is typically comparable. The result also confirms the robustness of our proposed CDA under stress evaluations.

Table 5: Human verification for parallel sentence pairs extracted from URL and CDA in precision (P) and recall (R).

	URL	CDA
Language A	P=96, R=87	P=98, R=89
Language E	P=94 ,R=81	P=92, R=90

4.5.2 Identifying Non-standard Language Identifiers for URL

Even though URL method operates decently fast, it requires language identifiers usually collected manually. This task is challenging as language identifiers used in different web-domains typically do not follow any standard. For example, the language patterns for Czech may include czech, cze, ces, cz, cs, cesky. The result in Table 4 also suggests the limitation of URL for Czech language. In this exercise, we examine URL pairs matched by CDA method to extract relevant language identifiers. Specifically, we focus on URL pairs distanced by one token and extract the different tokens as candidates. For example, en and vi_vn are extracted as candidates for (www.visitsingapore.com/en/, www.visitsingapore.com/vi_vn/). We curated the candidates, identified additional novel language identifiers, and feeded them to URL method. The exercise helped significantly increase the yields for multiple low-resourced languages at 453%, 295%, and 266%. For example, we found additional identifiers for Chinese language: chs, chn, c, zho, zht, cht, webcn, sc, tc, chinese_gb, chinese_big5, besides other popular zh, chi, and zho.

4.5.3 Identifying Multilingual Web-domains

We study the application of URL and CDA in identifying densely multilingual web-domains. Specifically, we compare the yield of parallel content, in the total number of extracted parallel English tokens, from two different datasets processed by the same pipeline. The datasets differ in whether their web-domains are identified as multilingual by URL or CDA.

On a sufficiently large dataset, we first ran URL and selected those web-domains having at least 100 candidate pairs. Subsequently, we ran CDA and selected those with at least 100 candidate pairs on the remaining of the dataset, i.e., those not selected by URL. We randomly selected 10,000 domains from each group to create the two datasets, namely “Domains by URL” and “Domains by CDA,” respectively. We applied the same pipeline using both methods for aligning documents on each dataset. We then computed the yield of parallel English tokens extracted from each setting. Table 6 shows the results.

Table 6: Number of parallel English tokens (in MM) from two deep crawls seeded by URL and CDA.

	Domains by URL	Domains by CDA
Parallel Token Count	23.3MM	69.5MM
Host Count	10,000	10,000
Dataset Size (TB)	6	3.1

These indicate that CDA can identify densely multilingual web-domains effectively.

In particular, given the same number of web-domains, the dataset identified by CDA can produce almost 3 $\times$ more parallel data with a size of only half of the dataset identified by URL. This finding suggests that the yield of parallel content from web-domains identified by CDA is 6 $\times$ higher than those identified by URL. The finding is essential in optimizing the parallel extraction pipeline and identifying better densely multilingual web content.

4.6 Cost Analysis

As presented, we anticipate two cost types when extending CDA to support a new language: (i) building a lexical translation model and (ii) processing more documents. The former is a one-time cost, while the latter is dataset dependent.

Specifically, we have shown in Section 4.2.2 that a lexical translation model can be built using either statistical method IBM-1 with parallel data or neural-based unsupervised method Lample et al. (2018); we observed comparable performance of CDA when using a model built by these methods. Given the rapid advance in deep neural language models, it is increasingly possible to obtain such resources for low-resourced languages. This suggests that we will be able to leverage recent advances in neural-based NLP to continuously extend CDA for many more languages.

Regarding the execution time, the primary bottleneck typically is due to the scoring of all possible alignments between English and non-English documents. Even though this scoring step is quadratic, this workload is perfectly parallel. With proper engineering optimization, we empirically found out that it is possible to bring the run-time for CDA to be within 2.5 $\times$ than the one of URL’s for 20 low-resourced languages and on a sufficiently large dataset. This optimized cost is crucial in enabling a spectrum of multilingual applications, including cross-lingual information retrieval and enabling MT services for scarce languages.

5 Conclusion

We presented our content-based document alignment for web data, CDA, which projects multilingual documents to a common space for similarity calculation. We also described our effort to collect and create benchmark datasets to study different performance aspects of the proposed method. The results show that CDA is efficient when projecting multilingual documents in one go for as many as 28 languages. Moreover, we also explain the different types of benchmarking for CDA under industrial settings.

The results show that our proposed method is robust when processing huge datasets and useful in identifying non-standard language identifiers and multilingual web-domains. Finally, and most importantly, the only significant resource required by CDA is the lexical translation dictionary: this can be easily built thanks to the recent advance in learning multilingual embeddings.

Future applications of CDA can be many. For example, the URLs paired with CDA can be used to improve the coverage for URL-based methods (e.g., Czech case in Table 4) and to study the web structure of multilingual content. Moreover, the robustness and extensibility of CDA make it applicable to other multilingual processing systems, including cross-lingual search and retrieval.

Acknowledgements

We thank Chris Bissell, Jeremiah Hankins, Kwang Hyun Jang, Daniel Marcu, Masud Moshtaghi, Akash Patel, Chris de Vries, Renxia Wang, William Wong, and members of the Alexa AI Search team in Manhattan Beach, CA for their helpful feedback on earlier drafts of the manuscript. We also thank the anonymous reviewers for their thoughtful suggestions which led to an improved manuscript.

References

Brown et al. (1993) Peter F. Brown, Stephen Della-Pietra, Vincent Della-Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation. Comp.Ling.
Buck and Koehn (2016a) Christian Buck and Philipp Koehn. 2016a. Findings of the wmt 2016 bilingual document alignment shared task. In WMT 2016, pages 554–563, Berlin, Germany.
Buck and Koehn (2016b) Christian Buck and Philipp Koehn. 2016b. Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance. In WMT 2016, pages 672–678, Berlin, Germany.
Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder for English. In EMNLP: System Demonstrations, pages 169–174, Brussels, Belgium.
Dara and Lin (2016) Aswarth Abhilash Dara and Yiu-Chang Lin. 2016. YODA system for WMT16 shared task: Bilingual document alignment. In WMT 2016, Berlin, Germany.
El-Kishky and Guzmán (2020) Ahmed El-Kishky and Francisco Guzmán. 2020. Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In AACL-IJCNLP 2020, pages 616–625, Suzhou, China.
Gomes and Pereira Lopes (2016) Luís Gomes and Gabriel Pereira Lopes. 2016. First steps towards coverage-based document alignment. In WMT 2016, pages 697–702, Berlin, Germany.
Jiang et al. (2020) Zhuolin Jiang, Amro El-Jaroudi, William Hartmann, Damianos Karakos, and Lingjun Zhao. 2020. Cross-lingual information retrieval with BERT. In Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), pages 26–31, Marseille, France. European Language Resources Association.
Koehn (2005) Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In MT Summit 2005, pages 79–86, Phuket, Thailand.
Lample et al. (2018) Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In ICLR 2018.
Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. volume 32 of Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China. PMLR.
Lewis (2009) M. Paul Lewis, editor. 2009. Ethnologue: Languages of the World, sixteenth edition. SIL International, Dallas, TX, USA.
Munteanu et al. (2004) Dragos Stefan Munteanu, Alexander Fraser, and Daniel Marcu. 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. In HLT-NAACL 2004, pages 265–272, Boston, Massachusetts, USA.
Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Pal et al. (2014) Santanu Pal, Partha Pakray, and Sudip Kumar Naskar. 2014. Automatic building and using parallel resources for SMT from comparable corpora. In HyTra.
Pouliquen et al. (2004) Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Emilia Käsper, and Irina Temnikova. 2004. Multilingual and cross-lingual news topic tracking. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 959–965, Geneva, Switzerland. COLING.
Resnik (1999) Philip Resnik. 1999. Mining the web for bilingual text. In ACL 1999, pages 527–534, College Park, Maryland, USA.
Resnik and Smith (2003) Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380.
Sánchez-Cartagena et al. (2018) Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, and Gema Ramírez-Sánchez. 2018. Prompsit’s submission to wmt 2018 parallel corpus filtering shared task. In WMT 2018, Brussels, Belgium.
Smith et al. (2013) Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the common crawl. In ACL 2013, pages 1374–1383, Sofia, Bulgaria.
Steinberger et al. (2002) Ralf Steinberger, Bruno Pouliquen, and Johan Hagman. 2002. Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC. In Computational Linguistics and Intelligent Text Processing, Third International Conference, CICLing’2002, number 2276 in Lecture Notes in Computer Science, LNCS, pages 415–424. Springer-Verlag.
Uszkoreit et al. (2010) Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In COLING 2010, China.
Vu et al. (2009) Thuy Vu, AiTi Aw, and Min Zhang. 2009. Feature-based method for document alignment in comparable news corpora. In EACL 2009, pages 843–851, Athens, Greece.
Vulic and Moens (2015) Ivan Vulic and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. pages 363–372. ACM; New York, NY.
Xu and Koehn (2017) Hainan Xu and Philipp Koehn. 2017. Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora. In EMNLP 2017, Denmark.