This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Computer Science and Electrical Engineering,
Handong Global University, Pohang, South Korea
11email: [email protected], [email protected]

End-to-End Training of Back-Translation Framework with Categorical Reparameterization Trick

DongNyeong Heo 11 0000-0002-8765-6744    Heeyoul Choi 11 0000-0002-0855-8725
Abstract

Back-translation (BT) is an effective semi-supervised learning framework in neural machine translation (NMT). A pre-trained NMT model translates monolingual sentences and makes synthetic bilingual sentence pairs for the training of the other NMT model, and vice versa. Understanding the two NMT models as inference and generation models, respectively, the training method of variational auto-encoder (VAE) was applied in previous works, which is a mainstream framework of generative models. However, the discrete property of translated sentences prevents gradient information from flowing between the two NMT models. In this paper, we propose the categorical reparameterization trick (CRT) that makes NMT models generate differentiable sentences so that the VAE’s training framework can work in an end-to-end fashion. Our BT experiment conducted on a WMT benchmark dataset demonstrates the superiority of our proposed CRT compared to the Gumbel-softmax trick, which is a popular reparameterization method for categorical variable. Moreover, our experiments conducted on multiple WMT benchmark datasets demonstrate that our proposed end-to-end training framework is effective in terms of BLEU scores not only compared to its counterpart baseline which is not trained in an end-to-end fashion, but also compared to other previous BT works. The code is available at the web111https://github.com/Nunpuking/End-To-End-Backtranslation.

Keywords:
Deep Learning Natural Language Processing Back-Translation Variational Auto-Encoder Reparameterization Trick.

1 Introduction

Supervised learning algorithms in the neural machine translation (NMT) task have shown outstanding performances along with successes in deep learning [1, 39]. Those algorithms perform well if there is a large amount of bilingual corpus. However, just a few pairs of languages have large bilingual corpora, while most of the other pairs do not. In addition, even though a language pair has a large bilingual corpus, it should be updated on a regular basis because language is not static over time. New words appear, and existing words might disappear, corresponding to changes in culture, society, and generations. Therefore, supervised learning algorithms for the NMT task suffer endlessly from a data-hungry situation and expensive data collection for bilingual corpus.

Unlike bilingual corpus, a monolingual corpus is easy to collect. Therefore, semi-supervised learning algorithms that use additional monolingual corpora in various ways have been suggested [14, 41, 8, 9, 37]. Alongside these methods, the back-translation (BT) methods have been proposed with showing significant performance improvements from the supervised learning algorithms [35, 18, 20, 10, 42, 40, 15].

The central idea of the BT method is firstly proposed by [35]. The pre-trained target-to-source NMT (TS-NMT) model, which is trained with only a bilingual corpus, translates target-language monolingual sentences to source-language sentences. Then, the translated source-language sentences are used as the synthetic pairs of the corresponding target-language monolingual input sentences. By adding these synthetic pairs to the original bilingual corpus, the size of the total training corpus increases, like the data augmentation [12]. Then, this increased corpus is used to train the source-to-target NMT (ST-NMT) model. The same process with source-language monolingual sentences can be applied to train the TS-NMT model.

The theoretical background of the BT method has been developed based on the auto-encoding framework [7, 42, 40], such as variational auto-encoder (VAE) [26]. Considering the translated sentence as the inferred latent variable of the corresponding monolingual input sentence, BT can be understood as a reconstruction process. For example, the TS-NMT model infers a ‘latent sentence’ in the source-language domain given a target-language monolingual sentence as an input, then the ST-NMT model reconstructs the input target-language monolingual sentence from the latent sentence. This process is a target-to-source-to-target (TST) process. In this process, the TS-NMT model approximates the posterior of the latent sentence as an ‘inference model’, and the ST-NMT model estimates the likelihood of the monolingual sentence as a ‘generation model’. Likewise, the source-to-target-to-source (STS) process is conducted with the opposite order and roles with source-language monolingual corpus.

However, training the NMT models in the VAE framework is challenging for several reasons. First, the distribution of each word in the latent sentence is discrete categorical distribution, and the non-differentiable latent sentence makes the backpropagation impossible to train the inference model. Second, the latent sentence should be a realistic sentence in that language domain, not an arbitrary sentence, to guarantee translation quality. Note that in the conventional VAE models, the latent space is modeled to have an isotropic Gaussian distribution without any other regularizations on the space, though there are several works that regularize the space to disentangle the dimensions [19, 24, 16]. If we do not address this issue, the inference model can be trained to mistranslate the monolingual input sentence, just focusing on trivial reconstruction. Because of these challenges, previous works trained only the generation model or used expectation-maximization (EM) [7, 42] algorithm. However, such optimization methods are not effective because the inference model cannot update its parameters directly along with the generation model with respect to the final objective function.

Refer to caption
Figure 1: Examples of English-to-German-to-English monolingual process of (a) previous BT and (b) BT with our proposed CRT. Given the English sentence, ’I need backprop to learn.’, they infer the German sentence, ’Ich brauche Backprop, um zu lemen.’ (only the first three words are illustrated for brevity). The solid black lines are forward propagations, and the thin red lines are backward propagations. While the sampling operation blocks the gradient flow in the previous works (a), the gradient flow detours to the inference model (English-to-German) in (b).

In this paper, we propose a new algorithm that handles the above challenges so that the two NMT models in the BT framework can be trained by end-to-end backpropagation, as in VAE. To overcome the non-differentiable issue, we propose a categorical reparameterization trick (CRT). By adding a conjugate part to the estimated categorical distribution, the CRT outputs an differentiable sentence, skipping the non-differentiable sampling process. Therefore, the end-to-end backpropagation finally updates the inference model to directly minimize with respect to the final objective function. The comparison of the previous BT works and the BT with our CRT is illustrated in Fig. 1. Additionally, we study the advantages of the CRT compared to the Gumbel-softmax trick (GST) [23], which is a popular reparameterization method for categorical distribution. In addition, in order to regularize the latent sentence to be realistic, we use the output distribution of the pre-trained language model [3, 29] as the prior distribution of the latent sentence. Finally, we propose a regularization technique that controls the amount of stochasticity in the inference process of latent sentence during training.

Our experiments conducted on multiple NMT benchmark datasets, such as WMT14 English-German and WMT18 English-Turkish [5], demonstrate the advantages of our proposed approaches compared to the baseline BT method, which was not trained using end-to-end backpropagation. Moreover, we compare our proposed approach with previous works that used comparable training setting scales (e.g., model size and dataset size). The results show considerable performance gains of our approach compared to previous BT methods.

2 Related Works

2.1 Back-translation (BT) for NMT

2.1.1 Original BT

Using a monolingual corpus for the NMT task often improves translation performance because it can give additional information for language understanding. As a practical method, the BT method was proposed [35, 22, 10, 11, 13, 6] and the main idea is augmenting the bilingual corpus with synthetic pairs. During the training of ST-NMT, a pre-trained TS-NMT model which was pre-trained by only bilingual corpus, translates target-language monolingual sentences into source-language synthetic sentences. Then, the synthetic source-language sentences and the corresponding target-language monolingual input sentences become synthetic pairs, then used as additional bilingual sentence pairs.

2.1.2 Iterative BT

Motivated by the above approach, the iterative back-translation (IBT) method that symmetrically conducts the original BT method multiple times was proposed [20]. In other words, given both pre-trained models of ST-NMT and TS-NMT, they translate and make synthetic pairs of source-language and target-language monolingual corpora, respectively. Each synthetic pair that made by a NMT model is used to train the other NMT model. This process iteratively operates, so that the quality of the synthetic pairs can be improved by the improved NMT models. Repetition of this process often significantly outperform the original BT method. The IBT method can be divided into three ways according to the update timings of synthetic sentences: offline IBT, online IBT, and semi-online IBT. Offline IBT updates the whole synthetic sentences after the models converge within a single BT process. Different from the offline IBT, online IBT updates a mini-batch of the synthetic sentences at every iteration. Semi-online IBT follows the same strategy as the online IBT, but it sometimes loads previous synthetic sentences from memory given a pre-defined probability [17]. In this paper, we follow the semi-online IBT.

2.1.3 Probabilistic Framework of IBT

Following the IBT method, a probabilistic framework was proposed considering the translated synthetic sentence as an inferred latent variable that is the aligned representation of the monolingual input sentence [7, 42]. As described in Section 1, this BT framework is naturally connected to VAE based on the auto-encoding framework. For the sake of the reader’s understanding, we repeatedly summarize important terminologies in this paper. In the case of the source-to-target-to-source (STS) process, the ST-NMT model plays the role of the inference model (approximated posterior estimator), and it infers a latent sentence in the target-language domain given a source-language monolingual sentence. The TS-NMT model plays the role of the generation model (likelihood estimator), and it generates the input source-language monolingual sentence. In the target-to-source-to-target (TST) process, ST-NMT and TS-NMT play opposite roles.

However, it is still challenging to train the NMT models in the BT framework because of the non-differentiable latent sentences in the inference stage. Previous works proposed expectation-maximization algorithm [7, 42] or backpropagation with ignoring the update of the inference model [40]. In these cases, the inference model loses the learning signal that would have been propagated from the generation model. Therefore, it might find a worse local optimum because of the inefficient optimization. In this paper, we propose a new trick that reparameterizes the inferred latent sentence so that the end-to-end backpropagation can be feasible as in conventional VAE.

2.2 Binary Reparameterization Trick

Learning discrete representations in neural networks has several advantages. First, it is proper to represent discrete variables such as characters or words in natural language. Second, it can be efficiently implemented at the hardware-level. Lower memory cost and faster matrix multiplication than those of continuous representations are attractive properties [21, 33].

However, because of the non-differentiable property of the discrete representation, a gradient could not be backward propagated through it. To overcome this challenge, the straight-through estimator (STE) was proposed [4], which estimates the gradient for discretizing operations (e.g., sampling or argmax operations) as 1 if the output is 1, otherwise it estimates 0. Although STE imposes a bias on the lower layer’s gradient estimation, it is empirically demonstrated as an effective gradient estimator in the training of existing binary neural networks [21, 33]. The STE estimator can be implemented with automatic differentiation tools such as Pytorch [32, 29]. With a smart reparameterization technique, it outputs a differentiable binary representation as follows.

s\displaystyle s Bern(p),s{0,1},\displaystyle\sim Bern(p),\quad s\in\{0,1\}, (1)
c\displaystyle c =s(1p)+(1s)(p),\displaystyle=s(1-p)+(1-s)(-p), (2)
z\displaystyle z =p+detach(c),\displaystyle=p+\text{detach}(c), (3)

where pp is a normalized probability of the binary variable. Bern(p)Bern(p) and ss are the Bernoulli distribution with a parameter pp and its random variable. detach(c)\text{detach}(c) is the operation that detaches its input, cc, from the gradient computation graph in backward propagation stage. As a result, the final output, zz, is a binary variable, z{0,1}z\in\{0,1\}, and it can flow the gradient through the first term, pp. As a variant of STE, it is easy to implement and work with other network architectures [34].

3 Proposed Method: E2E BT

In this section, we propose a new method, E2E BT, for end-to-end training of the BT framework like VAE. In Section 3.1, we propose a new reparameterization trick to handle the non-differentiable property of the latent sentence. In Section 3.2, we derive objective functions for training in the BT framework with our proposed reparameterization trick. Also, we propose to use the language model’s output distribution as an appropriate prior distribution of the latent sentence. Finally, in Section 3.3, we propose a regularization technique that anneals the stochasticity of the latent sentence inference process during training.

3.1 Categorical Reparameterization Trick (CRT)

The distribution of a sentence is based on a sequence of categorical distributions of words, and a sampled sentence from the distribution is non-differentiable. To make backpropagation feasible through the sentence, we propose CRT, a reparameterization trick for categorical distribution, that is inspired by the binary reparameterization trick as in Section 2.2. Because we handle the categorical distribution (also called as Multinoulli distribution), the Bernoulli distribution in Eq. (1) is replaced by the Multinoulli distribution with the probability 𝒑\boldsymbol{p} computed by the inference model. Then, a one-hot vector for one word, 𝒔{0,1}|V|\boldsymbol{s}\in\{0,1\}^{|V|} s.t. i=1|V|𝒔i=1\sum_{i=1}^{|V|}{\boldsymbol{s}_{i}}=1, is sampled from the distribution, where VV is the vocabulary set. Instead of stochastic sampling based on a distribution, we can also use arbitrarily selection to determine 𝒔\boldsymbol{s}. It makes the CRT can output a one-hot vector for a class that does not have the highest probability. The CRT process is formulated as follow:

𝒔\displaystyle\boldsymbol{s} Mult(𝒑), or 𝒔 is given,\displaystyle\sim Mult(\boldsymbol{p}),\text{ or }\boldsymbol{s}\text{ is given}, (4)
𝒄\displaystyle\boldsymbol{c} =𝒔(𝟏λ𝒑)+(𝟏𝒔)(λ𝒑),\displaystyle=\boldsymbol{s}\odot(\boldsymbol{1}-\lambda\boldsymbol{p})+(\boldsymbol{1}-\boldsymbol{s})\odot(-\lambda\boldsymbol{p}), (5)
𝒛\displaystyle\boldsymbol{z} =λ𝒑+detach(𝒄),\displaystyle=\lambda\boldsymbol{p}+\text{detach}(\boldsymbol{c}), (6)

where \odot is the element-wise multiplication. Mult(𝒑)Mult(\boldsymbol{p}) is the Multinoulli distribution given the normalized probability vector. Based on 𝒑\boldsymbol{p}, we compute the non-differentiable conjugate part, 𝒄\boldsymbol{c}, that is determined by the sample, 𝒔\boldsymbol{s}. Finally, it outputs a one-hot encoded vector 𝒛\boldsymbol{z} which consists of 𝒑\boldsymbol{p} and 𝒄\boldsymbol{c} parts. 𝒄\boldsymbol{c} is detached from the computation graph, so that backpropagation can flow gradient into the lower layers through 𝒑\boldsymbol{p}. We multiply the scalar λ\lambda by 𝒑\boldsymbol{p} in Eqs.(5) and (6) to control the amount of gradient that flows through 𝒑\boldsymbol{p} while ensuring that the output 𝒛\boldsymbol{z} remains a one-hot vector. Fig. 1 illustrates the whole process of this trick in the BT framework.

Given the Multinoulli distribution, Mult(𝒑)Mult(\boldsymbol{p}), the output process of the one-hot vector, 𝒔\boldsymbol{s}, is implemented by a sampling operator (e.g., Categorical sampling or argmax) and the one-hot encoding function that manually maps an integer value to a |V||V|-dimensional one-hot vector.

3.1.1 Comparison between Gumbel-Softmax Trick (GST)

Like our proposed CRT, the GST is a reparameterization trick for categorical distribution [23]. The GST process for ii-th class is formulated as follow:

gi\displaystyle g_{i} Gumbel(0,1),\displaystyle\sim Gumbel(0,1), (7)
zi\displaystyle z_{i} =exp(logpi+gi)/τj=1|V|exp(logpj+gj)/τ,\displaystyle=\frac{\exp{(\log{p_{i}}}+g_{i})/\tau}{\sum_{j=1}^{|V|}\exp{(\log{p_{j}}}+g_{j})/\tau}, (8)

where the scalar τ\tau represents softmax temperature, which influences the sharpness of the output distribution in Eq. (8). Typically, a low value of τ\tau is used. This choice encourages the distribution to approximate a one-hot vector by increasing the probability of the maximum value towards one, while decreasing the probabilities of other non-maximum values towards zero. In addition, the straight-through Gumbel-softmax trick (ST-GST) computes a strict one-hot vector while preserving the flow of gradient with using the trick below:

zist\displaystyle z_{i}^{st} =𝟏(argmaxjzj=i)detach(zi)+zi.\displaystyle=\boldsymbol{1}\left(\operatorname*{arg\,max}_{j}z_{j}=i\right)-detach(z_{i})+z_{i}. (9)

Compared to the GST, we believe that the CRT has three benefits in the E2E BT training framework.

  • (1) Fast Computation: the CRT has less computation cost than the GST (and ST-GST).

  • (2) Controllable Gradient: the CRT can control the amount of backward propagated gradient, while the GST cannot and the ST-GST needs modification.

  • (3) Flexible Output: the CRT is able to determine its one-hot vector output regardless of the distribution, while the GST (and ST-GST) cannot.

We will give explanations of the above benefits in detail.

First, the CRT is computationally less expensive because the GST operates the softmax function twice while the CRT operates it only once. The two softmax operations of the GST are as follows: one for computing the normalized word probability vector to match the scale with Gumbel distribution’s sample and the other for the final output [23]. The softmax function is expensive especially when the number of classes is large which is usually the case for natural language processing tasks. In experiments, we measured the spending times on reparameterization processes of the GST in Pytorch library’s implementation [29] and our CRT. The vocabulary size, sentence length, and mini-batch size were set to 30000, 50, and 60, respectively. In the result, while the GST spent 3.35 seconds, our CRT spent 0.98 seconds, which is more than three times faster.

Also, our CRT can simply control the amount of backpropagated gradient. During the E2E BT training, the backpropagation might lead the inference model to learn undesirable degenerating solutions, such as copying for the easiest reconstruction. Therefore, the scale of the gradient needs to be adjusted to find a desirable solution. The CRT can adjust the backward propagated gradient with just multiplying the scalar λ\lambda to the word probability in the reparameterization process, Eqs.(5) and (6). Then, λ\lambda plays a role of the additional learning rate only for the inference model. We can set two coefficients, λx\lambda_{x} and λy\lambda_{y}, for two languages, respectively, if they need different controls. In the GST, this trick is hard to implement, and the ST-GST needs the modification that multiplies λ\lambda to the last two terms.

Lastly, the CRT can reparameterize any word, while the GST and ST-GST reparameterize only the word that has the maximum probability. This is a crucial benefit, especially when the BT framework is implemented with the Transformer architecture [39] which is the standard model in various natural language processing tasks in these days. The inference process of the BT framework follows free-running where Transformer estimates every word’s probability multiple times until the translation finishes. At every step, Transformer estimates the distributions of all words, including the previous output words. Thus, it is possible that the distribution of the previous word changes due to the stochasticity of the model such as dropout [38]. That is, the maximally probable words in the final estimation might not be the same as the word sampled previously (See Fig. 2). This situation may be even more frequent if we use stochastic sampling as the word sampling method. When there is discrepancy between the previous sampled words and the final words, it disturbs the gradient direction of the inference model in the wrong direction because the generation model computes and backpropagates the gradient based on the final output words, while the inference model receives the gradient assuming that it is computed based on the different words.

Refer to caption
Figure 2: Example of the BT process with different reparameterization tricks, our CRT or the GST, where the third word’s distribution at the third step (green) changed at the next step (pink). Our CRT can output the previous word, wcw_{c}, while the GST only outputs the maximally probable word, waw_{a}, as the final output.

Fig. 2 gives an example with the vocabulary set, {wa,wb,wc,wd}\{w_{a},w_{b},w_{c},w_{d}\}. At the third step of the free-running inference, the word, wcw_{c}, is sampled from the third distribution (marked in green). However, in the next final step, the newly computed third distribution (marked in pink) is different from the previous one. Then, the GST samples a different word that has the maximum probability (waw_{a} in the figure) as the final output of the latent sentence, which is not the original conditional word, wcw_{c}, of the final estimation. Now, the generation model computes and backpropagates gradient through the last word, wdw_{d}, and it enforces the inference model to update its probability of wdw_{d} given the final latent sentence, (BOS, waw_{a}, wbw_{b}, waw_{a}), as conditional words. However, the inference model will update its parameter assuming that the conditional words were (BOS, waw_{a}, wbw_{b}, wcw_{c}). Contrary to the GST and ST-GST, our CRT can avoid this problem by just arbitrary selecting (’𝒔\boldsymbol{s} is given’ in Eq. (4)) wcw_{c} instead of the maximum word.

We will analyze the feasibilities of the second and the last benefits in the experiment section to understand how those properties are crucial in the E2E BT training.

3.2 End-to-End Training in the BT Framework

In this section, we describe our objective functions for our E2E BT training. The final objective function can be decomposed into two terms with two types of processes: bilingual and monolingual processes. The first objective term for the bilingual process is cross-entropy as follow:

𝒥ST(θ)\displaystyle\mathcal{J}_{ST}(\theta) =(xb,yb)Dblogpθ(yb|xb),\displaystyle=-\sum_{(x^{b},y^{b})\in D^{b}}\log{p_{\theta}(y^{b}|x^{b})}, (10)

where xbx^{b} and yby^{b} are a source-language and its paired target-language sentences in the bilingual corpus, DbD^{b}, respectively. θ\theta is the parameter of the ST-NMT model. Likewise, the objective function for the TS-NMT model, 𝒥TS(ϕ)\mathcal{J}_{TS}(\phi), is computed in the same way by switching the source and target sentences and replacing the θ\theta with the TS-NMT model’s parameter, ϕ\phi.

On the other hand, the objective function for the monolingual process is defined by the negative log likelihood of the monolingual sentence as follows:

𝒥TST\displaystyle\mathcal{J}_{TST} =ymDylogp(ym),\displaystyle=-\sum_{y^{m}\in D^{y}}\log{p(y^{m})}, (11)

where ymy^{m} is a target-language monolingual sentence in its monolingual corpus, DyD^{y}. The formulations are provided only for the TST process, but the STS process is simply symmetric to the TST process.

As usual, logp(ym)\log{p(y^{m})} in Eq. 11 can be marginalized with latent sentences as follows:

logp(ym)\displaystyle\log{p(y^{m})} =logx^p(x^)p(ym|x^)p(x^),\displaystyle=\log{\sum_{\hat{x}\sim p(\hat{x})}p(y^{m}|\hat{x})p(\hat{x})},

where x^\hat{x} is the inferred latent sentence that is the aligned representation of ymy^{m} in the source-language domain, and p(x^)p(\hat{x}) is a prior distribution of x^\hat{x}. By introducing an approximated posterior distribution, q(x^|y)q(\hat{x}|y), which is easy to sample x^\hat{x} given ymy^{m}, we can derive the evidence lower bound objective (ELBO) of the marginal probability, logp(ym)\log{p(y^{m})}, based on Jensen’s inequality as follows:

logp(ym)\displaystyle\log{p(y^{m})} x^q(x^|ym)q(x^|ym)logp(ym|x^)p(x^)q(x^|ym),\displaystyle\geq\sum_{\hat{x}\sim q(\hat{x}|y^{m})}q(\hat{x}|y^{m})\log{\frac{p(y^{m}|\hat{x})p(\hat{x})}{q(\hat{x}|y^{m})}},
=𝔼x^q(x^|ym)[logp(ym|x^)]DKL[q(x^|ym)p(x^)],\displaystyle\begin{split}&=\mathbb{E}_{\hat{x}\sim q(\hat{x}|y^{m})}[\log{p(y^{m}|\hat{x})}]\\ &\qquad-D_{KL}[q(\hat{x}|y^{m})\parallel p(\hat{x})],\end{split}

where DKLD_{KL} is Kullback-Liebler divergence (KL). In this TST process of the BT framework, the inference model, q(x^|ym)q(\hat{x}|y^{m}), is modeled by the TS-NMT model with parameter ϕ\phi. Also, the generation model, p(ym|x^)p(y^{m}|\hat{x}), is modeled by the ST-NMT with parameter θ\theta.

As we mentioned in the introduction section 1, the latent sentence, x^\hat{x}, should be a valid sentence in the source language domain. Therefore, we use a language model’s output distribution [3, 29] as the prior distribution in the KL term, p(x^)p(\hat{x}), instead of isotropic Gaussian in the conventional VAE. In practice, ELBO of the TST monolingual process can be formulated as follows:

𝒥TSTELBO(ϕ,θ)=ymDy𝔼x^qϕ(x^|ym)[logpθ(ym|x^)]αxDKL[qϕ(x^|ym)pψx(x^)],\displaystyle\begin{split}\mathcal{J}^{ELBO}_{TST}(\phi,\theta)&=\sum_{y^{m}\in D^{y}}\mathbb{E}_{\hat{x}\sim q_{\phi}(\hat{x}|y^{m})}[\log{p_{\theta}(y^{m}|\hat{x})}]\\ &\quad-\alpha_{x}D_{KL}[q_{\phi}(\hat{x}|y^{m})\parallel p_{\psi_{x}}(\hat{x})],\end{split} (12)

where pψx(x^)p_{\psi_{x}}(\hat{x}) is the pre-trained fixed source language model, and αx\alpha_{x} is a hyperparameter to control the effect of the KL term in the final objective function. Importantly, to make backpropagation feasible through the latent sentence, we apply the CRT to the latent sentence, x^\hat{x}. In order to input the latent sentence, which comprises a sequence of |V||V|-dimensional one-hot vectors, into the ST-NMT model, we implemented the conventional embedding layer as follows. While, in general, this layer accepts an integer value and returns the corresponding embedding vector from the embedding lookup table, we adapted the embedding layer to output a weighted sum of embedding vectors using the latent sentence vector’s (one-hot) elements as the weight values.

Finally, the total objective functions for θ\theta and ϕ\phi are as follows:

𝒥T(θ,ϕ)\displaystyle\mathcal{J}_{T}(\theta,\phi) =𝒥ST(θ)𝒥TSTELBO(ϕ,θ),\displaystyle=\mathcal{J}_{ST}(\theta)-\mathcal{J}^{ELBO}_{TST}(\phi,\theta), (13)
𝒥S(ϕ,θ)\displaystyle\mathcal{J}_{S}(\phi,\theta) =𝒥TS(ϕ)𝒥STSELBO(θ,ϕ).\displaystyle=\mathcal{J}_{TS}(\phi)-\mathcal{J}^{ELBO}_{STS}(\theta,\phi). (14)

The parameters of the both translation models are updated by the two objective functions.

3.3 Regularization Technique: Annealing Stochasticity (AS)

In this section, we propose an additional regularization technique related to the sampling method of the latent sentence. It would improve performance of our proposed original approach in Section 3.2. As argued in [10], using stochastic sampling in the inference method gives more chances to find a better solution than the greedy method. However, adding more noise to the inference from the beginning of the training can cause the generation model to learn undesirable translation mapping, especially when the bilingual corpus is small. Considering the pro and con, we suggest to anneal the ratio of stochastic sampling during the training process as in the scheduled sampling approach in NMT [2]. As an example of semi-online IBT framework, we newly infer a latent sentence given a monolingual sentence if we do not load the previous latent sentence from the memory. In that time, we infer the latent sentence by only the greedy method at the beginning and slowly increase the ratio of stochastic sampling as the training goes on.

4 Experiments and Results

Our experiments include internal ablation studies and comparisons with the previous BT works on two benchmark datasets: WMT18 English-German (En-De) and WMT18 English-Turkish (En-Tr) [5], for large and small datasets, respectively. The bilingual corpora of ‘En-De’ and ‘En-Tr’ experiments consist of 5.2M and 0.2M sentence pairs, respectively. We collected 5.0M monolingual corpora for both languages of ‘En-De’ and 4.7M monolingual corpora for ‘En-Tr’ experiments. The whole monolingual corpora were randomly selected from NewsCrawl datasets provided by WMT18 translation task [28]. About dataset pre-processing including tokenization, byte-pair encoding [36], and making vocabularies, we follow the same processes with [10] based on the open source of fairseq toolkit222https://github.com/facebookresearch/fairseq/tree/main/examples/backtranslation. We selected 32K and 10K most frequent subwords to make a vocabulary for each dataset, respectively. For validation of each experiment, we used Newstest13 (3K pairs) and Newsdev16 (1K pairs) for ‘En-De’ and ‘En-Tr’, respectively, from WMT18. Lastly, for testing, we used Newstest14\sim18 and Newstest16\sim18 for ‘En-De’ and ‘En-Tr’, respectively.

For the basic model architecture of the NMT models, we implemented the Transformer model with the base configuration of the original paper [39]. That is, the number of layers of encoder and decoder is 6. The dimensionality of the hidden state and word embedding dimension is 512. We refer the original paper for more specific configurations. As we mentioned in Section 3.2, we used the output distributions of language models as the prior distributions of the latent sentence of each language. We adopted two, English and German, Transformer-based language models [29], which were pre-trained with the bilingual and monolingual corpora of ‘En-De’ experiment. Likewise, other two, English and Turkish, Transformer-based language models were pre-trained with the bilingual and monolingual corpora of ‘En-Tr’ experiment. We note that we did not use any additional data for the pre-training of language models.

For the optimization, we used the training strategy of fairseq toolkit [27]. The optimizer was Adam [25] with the learning rate of 0.001. In addition, the inverse square root scheduler was used for the learning rate schedule. In each iteration, we made mini-batch consists of 8K and 4K tokens for ‘En-De’ and ‘En-Tr’ internal ablation study (Section 4.1), respectively. For the comparisons with the previous BT works (Section 4.2), we made mini-batch with 32K tokens for ‘En-De’ experiments. We pre-trained the ST-NMT and TS-NMT with only bilingual corpus based on this optimization setting. Also, we used the same optimization setting for the pre-training of language models.

For our main E2E BT training, we followed the semi-online IBT framework. Each mini-batch contains bilingual and monolingual tokens with 1:1 and 1:4 ratios for ‘En-De’ and ‘En-Tr’, respectively, within the total token sizes mentioned above. We set both λx\lambda_{x} and λy\lambda_{y} to 0.005 for both ‘En-De’ and ‘En-Tr’ experiments. We set 0.0005 value for both αx\alpha_{x} and αy\alpha_{y} of ‘En-De’ experiment, and 0.0001 value for both αx\alpha_{x} and αy\alpha_{y} of ‘En-Tr’ experiment. For the annealing schedule of AS regularization, we linearly increased the ratio of stochastic sampling from 0.0 to 1.0 during 300K iterations for the E2E BT training on the ‘En-De’ dataset. Similarly, we set the ratio of stochastic sampling from 0.0 to 0.5 during 300K iterations for ‘En-Tr’.

4.1 Internal Ablation Study

In this section, we demonstrate the experimental results of our internal ablation study to understand the advantages of our proposed E2E BT training and the AS regularization on top of the bilingual NMT and the basic semi-online IBT (not E2E training) with the greedy selection method for latent sentence inference. In addition to the ablation study of ‘En-Tr’ experiment, we also provide comparisons between our proposed CRT and the ST-GST as different reparameterization methods in E2E BT training. To evaluate translation quality, we used case-sensitive SacreBLEU [31] with ‘13a’ tokenizer.

Table 1: Ablation study of ‘En-Tr’ experiments in BLEU score. We used the beam search with a width of 5. In each evaluation, the left and right numbers of ‘/’ mean En-to-Tr and Tr-to-En results.
Model Newstest16 Newstest17 Newstest18 Average
Transformer 11.54/15.62 12.11/15.66 10.58/16.57 11.41/15.95
BasicBT 16.26/22.67 18.69/22.10 15.37/23.15 16.77/22.64
BasicBT+AS 16.50/22.48 18.74/21.86 15.58/23.35 16.94/22.56
E2E BT+AS 16.68/23.44 19.29/22.28 15.76/24.12 17.24/23.28

Table 1 presents the BLEU score results on each testset of each model. ‘Transformer’ indicates the NMT model that was trained with only bilingual corpus. ‘BasicBT’ is the basic semi-online IBT method that does not train the two NMT models (ST-NMT and TS-NMT) in an end-to-end fashion. On top of ‘BasicBT’, we applied AS first, and then E2E BT, because the original VAE uses stochastic sampling to infer the latent variable during training, but ‘BasicBT’ is based on greedy selection. Therefore, before applying E2E BT, we applied AS to give appropriate amount of stochasticity during inference at each iteration. Our proposed AS regularization technique shows a slight improvement in En-to-Tr translation. Interestingly, when we apply E2E BT training to the ‘BasicBT+AS’ model, it noticeably increased BLEU scores in general.

Table 2: Comparisons of reparameterization methods in BLEU score. We used the beam search with a width of 5. In each evaluation, the left and right numbers of ‘/’ mean En-to-Tr and Tr-to-En results. ‘E2E BT w/ CRT’ is the same model with the ‘E2E BT+AS’ in Table 1
Model Newstest16 Newstest17 Newstest18 Average
E2E BT w/ CRT 16.68/23.44 19.29/22.28 15.76/24.12 17.24/23.28
E2E BT w/ ST-GST 2.51/3.61 2.54/3.02 2.42/3.99 2.49/3.54
E2E BT w/ ST-GST+λ\lambda 15.66/22.63 17.84/21.28 14.92/23.73 16.15/22.55
E2E BT w/ ST-GST+λ\lambda+FO 16.48/23.26 19.03/22.14 15.83/24.24 17.11/23.21

Table 2 demonstrates the comparison between several reparameterization methods. First, we find that the ST-GST reparameterization method, ‘E2E BT w/ ST-GST’, is inappropriate in this E2E BT training, leading to significantly low BLEU scores. We interpret the absences of the last two advantageous properties of the CRT, namely ‘(2) Controllable Gradient’ and ‘(3) Flexible Output (FO)’ which were discussed in Section 3.1.1, as significant factors contributing to the poor performances. To check the feasibilities of these two properties, we modified ST-GST to acquire those properties. First, as in CRT, we multiplied λ\lambda to the last two terms of Eq. (9), ‘E2E BT w/ ST-GST+λ\lambda’, whose λ\lambda is set to 0.0005 as in ‘E2E BT w/ CRT’. We find that the gradient controllable property is the most crucial factor in the E2E BT training, because it can prevent the model to learn the degenerating solution (refer Section 3.1.1). However, the performances of ‘E2E BT w/ ST-GST+λ\lambda’ are still lower than ‘E2E BT w/ CRT’. Therefore, we further modified the ST-GST to ensuring the FO property with adding the value instead of gig_{i} in Eq. (8) that ensures the arbitrarily selected word has the highest value regardless of the output probability. We modeled the value as follow:

bi=\displaystyle b_{i}= {logmax{p1,,p|V|}min{p1,,p|V|}+ϵif i is the index to reparam.,0otherwise,\displaystyle\begin{cases}\log{\frac{\max\{p_{1},\dots,p_{|V|}\}}{\min\{p_{1},\dots,p_{|V|}\}}}+\epsilon&\text{if }i\text{ is the index to reparam.},\\ 0&\text{otherwise},\\ \end{cases} (15)

where ϵ\epsilon is a small value of scalar. Finally, when we applied the final modified version of ST-GST to E2E BT, ‘E2E BT w/ ST-GST+λ\lambda+FO’, it achieved similar performances with ‘E2E BT w/ CRT’. We believe that the current ST-GST needs such modifications to work in E2E BT training, while our proposed CRT is faster and more appropriate to the E2E BT training.

Table 3: Ablation study of ‘En-De’ experiments in BLEU score. We used the beam search with a width of 5. In each evaluation, the left and right numbers of ‘/’ mean En-to-De and De-to-En results.
Model Newstest14 Newstest15 Newstest16 Newstest17 Newstest18 Average
Transformer 24.74/29.65 28.24/30.76 32.64/36.06 26.45/31.47 39.32/38.38 30.28/33.26
BasicBT 27.87/30.01 29.80/31.99 34.42/39.44 28.33/33.11 41.26/41.08 32.37/35.13
Basic BT+AS 28.27/32.21 30.26/33.48 34.58/40.88 28.48/34.98 41.71/42.47 32.66/36.80
E2E BT+AS 28.26/32.63 30.54/33.98 34.70/41.07 28.64/34.89 42.15/42.51 32.86/37.02

Table 3 presents the ablation study of ‘En-De’ experiments. Macroscopically, our proposed AS regularization and E2E BT training shows similar tendencies of performance gains with ‘En-Tr’ experiments. However, AS regularization, in this experiment, shows noteworthy performance gain in ‘De-En’ more than ‘En-De’. Based on these experiments, we argue that our proposed approaches are advantageous in several benchmark datasets.

4.2 Comparisons with Previous BT Works

In this section, we compare our approaches with other recent BT works that are based on base Transformer architectures [43, 13, 40, 30]. However, there are many differences on testing such as different testsets, different BLEU score metrics, etc. For fair comparisons, we report the average BLEU scores of the previous work and our method on their testsets with the same BLEU metric. Also, we report the size of monolingual corpus that they used for BT training (note that we used 5M monolingual corpora for ‘En-De’ experiments).

Table 4: Comparisons with the previous BT works that conducted ‘En-De’ experiments based on the base Transformer architecture. In each evaluation, the left and right numbers of ‘/’ mean En-to-De and De-to-En averaged BLEU scores. Based on the open source code, we found the BLEU score metric that (Z. Zheng et al. 2019) used was SacreBLEU with ‘intl’ tokenizer and case-insensitive. We could not find the specific information of (H. Pham et al. 2021)’s BLEU score metric, so we used tokenBLEU for the comparison.
Previous Work Newstest
BLEU
Metric
# of
MonoData
Avg. BLEU
(Prev.Work)
Avg. BLEU
(Our Work)
(Z. Zheng et al. 2019)[43] 14 SacreBLEU 5M 30.30/33.80 30.37/34.21
(M. Graça et al. 2019)[13] 17 SacreBLEU 4M 28.60/ - 29.17/35.58
(W. Xu et al. 2020)[40] 16\sim18 SacreBLEU 5M 30.00/33.05 35.54/39.79
(H. Pham et al. 2021)[30] 14 unknown 220M 30.39/ - 30.77/33.97

Table 4 demonstrates the comparisons between a previous work’s average BLEU score and the same one of our proposed E2E BT model with AS regularization based on the previous work’s BLEU score metric and testsets. We found that our work steadily outperforms the previous works. Especially, the comparison with (H. Pham et al. 2021) indicates an astonishing efficiency of our proposed approach when we consider the amount of monolingual data they used.

5 Conclusion

In this paper, we proposed a categorical reparameterization trick that makes the translated sentences differentiable. Based on the trick, backpropagation became feasible through the sentences, and we used this trick to train both translation models in the end-to-end learning fashion for back-translation. To train the models together in back-translation, we developed the evidence lower bound objective, which could train the translation models in the semi-supervised learning fashion. In addition, we proposed a regularization techniques that are practically advantageous in the back-translation training. Finally, our experimental results demonstrated that our proposed method is beneficial to learn better translation models while outperforming the baselines.

5.0.1 Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2022R1A2C1012633)

References

  • [1] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  • [2] Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems 28 (2015)
  • [3] Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of machine learning research 3(Feb), 1137–1155 (2003)
  • [4] Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  • [5] Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Koehn, P., Monz, C.: Findings of the 2018 conference on machine translation (wmt18). In: Proceedings of the Third Conference on Machine Translation. vol. 2, pp. 272–307 (2018)
  • [6] Caswell, I., Chelba, C., Grangier, D.: Tagged back-translation. arXiv preprint arXiv:1906.06442 (2019)
  • [7] Cotterell, R., Kreutzer, J.: Explaining and generalizing back-translation through wake-sleep. arXiv preprint arXiv:1806.04402 (2018)
  • [8] Currey, A., Miceli-Barone, A.V., Heafield, K.: Copied monolingual data improves low-resource neural machine translation. In: Proceedings of the Second Conference on Machine Translation. pp. 148–156 (2017)
  • [9] Domhan, T., Hieber, F.: Using target-side monolingual data for neural machine translation through multi-task learning. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 1500–1505 (2017)
  • [10] Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. arXiv preprint arXiv:1808.09381 (2018)
  • [11] Fadaee, M., Monz, C.: Back-translation sampling by targeting difficult words in neural machine translation. arXiv preprint arXiv:1808.09006 (2018)
  • [12] Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http://www.deeplearningbook.org
  • [13] Graça, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. arXiv preprint arXiv:1906.07286 (2019)
  • [14] Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk, H., Bengio, Y.: On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015)
  • [15] Guo, Y., Zhu, H., Lin, Z., Chen, B., Lou, J.G., Zhang, D.: Revisiting iterative back-translation from the perspective of compositional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 7601–7609 (2021)
  • [16] Hahn, S., Choi, H.: Disentangling latent factors of variational auto-encoder with whitening. In: International Conference on Artificial Neural Networks. pp. 590–603. Springer (2019)
  • [17] Han, J.M., Babuschkin, I., Edwards, H., Neelakantan, A., Xu, T., Polu, S., Ray, A., Shyam, P., Ramesh, A., Radford, A., et al.: Unsupervised neural machine translation with generative language models only. arXiv preprint arXiv:2110.05448 (2021)
  • [18] He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T.Y., Ma, W.Y.: Dual learning for machine translation. Advances in neural information processing systems 29, 820–828 (2016)
  • [19] Higgins, I., Matthey, L., Pal, A., Burgess, C.P., Glorot, X., Botvinick, M.M., Mohamed, S., Lerchner, A.: Beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR (2017)
  • [20] Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. pp. 18–24 (2018)
  • [21] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. Advances in neural information processing systems 29 (2016)
  • [22] Imamura, K., Fujita, A., Sumita, E.: Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. pp. 55–63 (2018)
  • [23] Jang, E., Gu, S., Poole, B.: Categorical reparametrization with gumble-softmax. In: International Conference on Learning Representations (ICLR 2017). OpenReview. net (2017)
  • [24] Kim, H., Mnih, A.: Disentangling by factorising. In: International Conference on Machine Learning. pp. 2649–2658. PMLR (2018)
  • [25] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [26] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • [27] Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M.: fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations (2019)
  • [28] Ott, M., Edunov, S., Grangier, D., Auli, M.: Scaling neural machine translation (2018)
  • [29] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
  • [30] Pham, H., Wang, X., Yang, Y., Neubig, G.: Meta back-translation. arXiv preprint arXiv:2102.07847 (2021)
  • [31] Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers. pp. 186–191. Association for Computational Linguistics, Belgium, Brussels (Oct 2018), https://www.aclweb.org/anthology/W18-6319
  • [32] Raiko, T., Berglund, M., Alain, G., Dinh, L.: Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989 (2014)
  • [33] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European conference on computer vision. pp. 525–542. Springer (2016)
  • [34] Rim, D.N., Jang, I., Choi, H.: Deep neural networks and end-to-end learning for audio compression. arXiv preprint arXiv:2105.11681 (2021)
  • [35] Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
  • [36] Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
  • [37] Skorokhodov, I., Rykachevskiy, A., Emelyanenko, D., Slotin, S., Ponkratov, A.: Semi-supervised neural machine translation with language models. In: Proceedings of the AMTA 2018 workshop on technologies for MT of low resource languages (LoResMT 2018). pp. 37–44 (2018)
  • [38] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1), 1929–1958 (2014)
  • [39] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
  • [40] Xu, W., Niu, X., Carpuat, M.: Dual reconstruction: a unifying objective for semi-supervised neural machine translation. arXiv preprint arXiv:2010.03412 (2020)
  • [41] Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 1535–1545 (2016)
  • [42] Zhang, Z., Liu, S., Li, M., Zhou, M., Chen, E.: Joint training for neural machine translation models with monolingual data. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
  • [43] Zheng, Z., Zhou, H., Huang, S., Li, L., Dai, X.Y., Chen, J.: Mirror-generative neural machine translation. In: International Conference on Learning Representations (2019)