This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Regression for Machine Translation

Ergun Biçici
Department of Electrical and Computer Engineering
Koç University
34450 Sariyer, Istanbul, Turkey
[email protected]
Abstract

We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of L1subscript𝐿1L_{1} regularized regression (lasso) to learn the mappings between sparsely observed feature sets versus L2subscript𝐿2L_{2} regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce dice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that L1subscript𝐿1L_{1} regularized regression performs better than L2subscript𝐿2L_{2} regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.

1 Introduction

Regression can be used to find mappings between the source and target feature sets derived from given parallel corpora. Transduction learning uses a subset of the training examples that are closely related to the test set without using the model induced by the full training set. In the context of statistical machine translation, translations are performed at the sentence level and this enables us to select a few training instances for each test instance to guide the translation process. This also gives us a computational advantage when considering the high dimensionality of the problem.

The goal in transductive regression based machine translation (RegMT) is both reducing the computational burden of the regression approach by reducing the dimensionality of the training set and the feature set and also improving the translation quality by using transduction. In an idealized feature mapping matrix where features are word sequences, we would like to observe few target features for each source feature close to permutation matrices with one nonzero item for each column. L1subscript𝐿1L_{1} regularization helps us achieve solutions close to the permutation matrices by increasing sparsity.

We compare L1subscript𝐿1L_{1} regularized regression with other regression techniques and show that it achieves better performance in estimating target features and in generating the translations during graph decoding. We present encouraging results on German and Spanish to English translation tasks. We also replace the phrase table of a phrase-based decoder, Moses [Koehn et al., 2007], with the feature mappings we find with the RegMT model.

Related Work: Regression techniques can be used to model the relationship between strings [Cortes et al., 2007]. Wang et al. [Wang et al., 2007] applies a string-to-string mapping approach to machine translation by using ordinary least squares regression and n𝑛n-gram string kernels to a small dataset. Later they use L2subscript𝐿2L_{2} regularized least squares regression [Wang and Shawe-Taylor, 2008]. Although the translation quality they achieve is not better than Moses [Koehn et al., 2007], which is accepted to be the state-of-the-art, they show the feasibility of the approach. Serrano et al. [Serrano et al., 2009] use kernel regression to find translation mappings from source to target feature vectors and experiment with translating hotel front desk requests. Locally weighted regression solves separate weighted least squares problems for each instance [Hastie et al., 2009], weighted by a kernel similarity function.

Outline: Section 2 gives an overview of regression based machine translation, which is used to find the mappings between the source and target features of the training set. We present L1subscript𝐿1L_{1} regularized transductive regression for alignment learning and the instance selection method used. In section 3, we compare the performance of L1subscript𝐿1L_{1} versus L2subscript𝐿2L_{2} regularized regression results in effectively estimating the target features. In section 4, we present the graph decoding results on German-English and Spanish-English translation tasks. Section 5 compares the results obtained with Moses with different learning settings. The last sections present our contributions.

2 Machine Translation Using Regression

Let X and Y correspond to the sets of tokens that can be used in the source and target strings, then, m𝑚m training instances are represented as (x1,y1),,(xm,ym)X×Y,subscriptx1subscripty1subscriptx𝑚subscripty𝑚superscript𝑋superscript𝑌(\textbf{x}_{1},\textbf{y}_{1}),\ldots,(\textbf{x}_{m},\textbf{y}_{m})\in X^{*}\times Y^{*}, where (xi,yi)subscriptx𝑖subscripty𝑖(\textbf{x}_{i},\textbf{y}_{i}) corresponds to a pair of source and target language token sequences for 1im1𝑖𝑚1\leq i\leq m. Our goal is to find a mapping f:XY:𝑓superscript𝑋superscript𝑌f:X^{*}\rightarrow Y^{*} that can convert a source sentence to a target sentence sharing the same meaning in the target language (Figure 1).

Xsuperscript𝑋X^{*}Ysuperscript𝑌Y^{*}FXsubscript𝐹𝑋F_{X}FYsubscript𝐹𝑌F_{Y}gΦXsubscriptΦ𝑋\Phi_{X}ΦYsubscriptΦ𝑌\Phi_{Y}ΦY1superscriptsubscriptΦ𝑌1\Phi_{Y}^{-1}fh
Figure 1: String-to-string mapping.

We define feature mappers ΦX:XFX=NX:subscriptΦ𝑋superscript𝑋subscript𝐹𝑋superscriptsubscript𝑁𝑋\Phi_{X}:X^{*}\rightarrow F_{X}=\mathbb{R}^{N_{X}} and ΦY:YFY=NY:subscriptΦ𝑌superscript𝑌subscript𝐹𝑌superscriptsubscript𝑁𝑌\Phi_{Y}:Y^{*}\rightarrow F_{Y}=\mathbb{R}^{N_{Y}} that map each string sequence to a point in high dimensional real number space. Let MXNX×msubscriptM𝑋superscriptsubscript𝑁𝑋𝑚\textbf{M}_{X}\in\mathbb{R}^{N_{X}\times m} and MYNY×msubscriptM𝑌superscriptsubscript𝑁𝑌𝑚\textbf{M}_{Y}\in\mathbb{R}^{N_{Y}\times m} such that MX=[ΦX(x1),,ΦX(xm)]subscriptM𝑋subscriptΦ𝑋subscriptx1subscriptΦ𝑋subscriptx𝑚\textbf{M}_{X}=[\Phi_{X}(\textbf{x}_{1}),\ldots,\Phi_{X}(\textbf{x}_{m})] and MY=[ΦY(y1),,ΦY(ym)]subscriptM𝑌subscriptΦ𝑌subscripty1subscriptΦ𝑌subscripty𝑚\textbf{M}_{Y}=[\Phi_{Y}(\textbf{y}_{1}),\ldots,\Phi_{Y}(\textbf{y}_{m})]. The ridge regression solution using L2subscript𝐿2L_{2} regularization is found by minimizing the following cost:

WL2=argminWNY×NXMYWMXF2+λWF2.subscriptWsubscript𝐿2subscriptargminWsuperscriptsubscript𝑁𝑌subscript𝑁𝑋subscriptsuperscriptnormsubscriptM𝑌subscriptWM𝑋2𝐹𝜆subscriptsuperscriptnormW2𝐹\textbf{W}_{L_{2}}\!=\!\!\!\operatornamewithlimits{arg\,min}_{\textbf{W}\in\mathbb{R}^{N_{Y}\times N_{X}}}\!\!\!\!\parallel\!\textbf{M}_{Y}-\textbf{W}\textbf{M}_{X}\!\parallel^{2}_{F}+\lambda\!\parallel\!\textbf{W}\!\parallel^{2}_{F}.\!\!\! (1)

Two main challenges of the regression based machine translation (RegMT) approach are learning the regression function, h:FXFY:subscript𝐹𝑋subscript𝐹𝑌h:F_{X}\rightarrow F_{Y}, and solving the pre-image problem, which, given the features of the estimated target string sequence, h(ΦX(x))=ΦY(y^)subscriptΦ𝑋xsubscriptΦ𝑌^yh(\Phi_{X}(\textbf{x}))=\Phi_{Y}(\hat{\textbf{y}}), attempts to find yYysuperscript𝑌\textbf{y}\in Y^{*}: y=argminyYh(ΦX(x))ΦY(y)2ysubscriptargminysuperscript𝑌superscriptnormsubscriptΦ𝑋xsubscriptΦ𝑌y2\textbf{y}=\operatornamewithlimits{arg\,min}_{\textbf{y}\in Y^{*}}||h(\Phi_{X}(\textbf{x}))-\Phi_{Y}(\textbf{y})||^{2}.

2.1 L1subscript𝐿1L_{1} Regularized Regression

String kernels lead to sparse feature representations and L1subscript𝐿1L_{1} regularized regression is effective to find the mappings between sparsely observed features.We would like to observe only a few nonzero target coefficients corresponding to a source feature in the coefficient matrix. L1subscript𝐿1L_{1} regularization helps us achieve solutions close to permutation matrices by increasing sparsity [Bishop, 2006].

WL2subscriptWsubscript𝐿2\textbf{W}_{L_{2}} is not a sparse solution and most of the coefficients remain non-zero. We are interested in penalizing the coefficients better; zeroing the irrelevant ones leading to sparsification to obtain a solution that is closer to a permutation matrix. L1subscript𝐿1L_{1} norm behaves both as a feature selection technique and a method for reducing coefficient values.

WL1subscriptWsubscript𝐿1\displaystyle\!\!\!\!\!\!\textbf{W}_{L_{1}} =\displaystyle\!\!\!\!= argminWNY×NXMYWMXF2+λW1.subscriptargminWsuperscriptsubscript𝑁𝑌subscript𝑁𝑋subscriptsuperscriptnormsubscriptM𝑌subscriptWM𝑋2𝐹𝜆subscriptnormW1\displaystyle\!\!\!\!\!\!\!\operatornamewithlimits{arg\,min}_{\textbf{W}\in\mathbb{R}^{N_{Y}\times N_{X}}}\!\!\!\parallel\!\textbf{M}_{Y}-\textbf{W}\textbf{M}_{X}\!\parallel^{2}_{F}+\lambda\!\parallel\!\textbf{W}\!\parallel_{1}. (2)

Equation 2 presents the lasso [Tibshirani, 1996] solution where the regularization term is now the L1subscript𝐿1L_{1} matrix norm defined as W1=i,j|Wi,j|subscriptnormW1subscript𝑖𝑗subscript𝑊𝑖𝑗\parallel\!\!\!\textbf{W}\!\!\!\parallel_{1}=\sum_{i,j}|W_{i,j}|. WL2subscriptWsubscript𝐿2\textbf{W}_{L_{2}} can be found by taking the derivative but since L1subscript𝐿1L_{1} regularization cost is not differentiable, WL1subscriptWsubscript𝐿1\textbf{W}_{L_{1}} is found by optimization or approximation techniques. We use forward stagewise regression (FSR) [Hastie et al., 2006], which approximates lasso for L1subscript𝐿1L_{1} regularized regression.

2.2 Instance Selection

Proper selection of training instances plays an important role for accurately learning feature mappings with limited computational resources. Coverage of the features is important since if we do not have the correct features in the training matrices, we will not be able to translate them. Coverage is measured by the percentage of target features of the test set found in the training set. For each test sentence, we pick a limited number of training instances designed to improve the coverage of correct features to build a regression model. We use a technique that we call dice, which optimizes source language bigram coverage such that the difficulty of aligning source and target features is minimized. We define Dice’s coefficient score as:

dice(x,y)=2C(x,y)C(x)C(y),𝑑𝑖𝑐𝑒𝑥𝑦2𝐶𝑥𝑦𝐶𝑥𝐶𝑦dice(x,y)=\frac{2C(x,y)}{C(x)C(y)}, (3)

where C(x,y)𝐶𝑥𝑦C(x,y) is the number of times x𝑥x and y𝑦y co-occurr and C(x)𝐶𝑥C(x) is the count of observing x𝑥x in the selected training set. Given a test source sentence, S𝑆S, we can estimate the goodness of a target sentence, T𝑇T, by the sum of the alignment scores:

ϕdice(S,T)=1|T|log|S|xX(S)j=1|T|yY(x)dice(y,Tj),subscriptitalic-ϕ𝑑𝑖𝑐𝑒𝑆𝑇1𝑇𝑆subscript𝑥𝑋𝑆superscriptsubscript𝑗1𝑇subscript𝑦𝑌𝑥𝑑𝑖𝑐𝑒𝑦subscript𝑇𝑗\phi_{dice}(S,T)=\frac{1}{|T|\log|S|}\sum_{x\in X(S)}\sum_{j=1}^{|T|}\sum_{y\in Y(x)}dice(y,T_{j}), (4)

where X(S)𝑋𝑆X(S) stores the features of S𝑆S and Y(x)𝑌𝑥Y(x) lists the tokens in feature x𝑥x. We assume that tokens are separated by the space character. The difficulty of word aligning a pair of training sentences, (S,T)𝑆𝑇(S,T), can be approximated by |S||T|superscript𝑆𝑇|S|^{|T|} as for each source token we have |T|𝑇|T| alternatives to map to. We use a normalization factor proportional to |T|log|S|𝑇𝑆|T|\log|S|, which corresponds to the log\log of this value.

3 Regression Experiments

We experiment with different regression techniques and compare their performance to L2subscript𝐿2L_{2} and lasso. We show that lasso is able to identify features better than other techniques we compared while making less error.

We use the German-English (de-en) parallel corpus of size about 1.61.61.6 million sentences from the WMT’10 [Callison-Burch et al., 2010] to select training instances. For development and test sentences, we select randomly among the sentences whose length is in the range [10,20]1020[10,20] and target language bigram coverage in the range [0.6,1]0.61[0.6,1]. We select 202020 random instances from each target coverage decimal fraction (i.e. 20 from [0.6,0.7]0.60.7[0.6,0.7], 20 from [0.7,0.8]0.70.8[0.7,0.8], etc.) to obtain sets of 100100100 sentences. We create in-domain dev, dev2, and test sets following this procedure and making sure that test set source sentences do not have exact matches in the training set. We use n𝑛n-spectrum weighted word kernel [Taylor and Cristianini, 2004] as feature mappers which considers all word sequences up to order n𝑛n:

k(x,x)=p=1ni=1|x|p+1j=1|x|p+1pI(x[i:i+p1]=x[j:j+p1])k(\textbf{x},\textbf{x}^{\prime})\!=\!\!\!\sum_{p=1}^{n}\!\!\sum_{i=1}^{|x|-p+1}\!\sum_{j=1}^{|x^{\prime}|-p+1}\!\!\!\!\!\!p\;I(\textbf{x}[i\!:\!i+p-1]\!=\!\textbf{x}^{\prime}[j\!:\!j+p-1]) (5)

where x[i:j]\textbf{x}[i\!\!:\!\!j] denotes a substring of x with the words in the range [i,j]𝑖𝑗[i,j], I(.)I(.) is the indicator function, and p𝑝p is the number of words in the feature. Features used are 111-grams, 222-grams, or 111&222-grams.

3.1 Tuning for Target F1subscript𝐹1F_{1}

We perform parameter optimization for the machine learning models we use to estimate the target vector. The model parameters such as the regularization λ𝜆\lambda and iteration number for FSR are optimized on dev using the F1subscript𝐹1F_{1} measure over the 0/1-class predictions obtained after thresholding ΦY(y^)subscriptΦ𝑌^y\Phi_{Y}(\hat{\textbf{y}}). Let TP be the true positive, TN the true negative, FP the false positive, and FN the false negative rates, the measures are defined as:

prec=TPTP+FP,precTPTPFP\displaystyle\mbox{prec}=\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}, BER=(FPTN+FP+FNTP+FN)/2BERFPTNFPFNTPFN2\displaystyle\mbox{BER}=(\frac{\mbox{FP}}{\mbox{TN}+\mbox{FP}}+\frac{\mbox{FN}}{\mbox{TP}+\mbox{FN}})/2 (6)
rec=TPTP+FN,recTPTPFN\displaystyle\mbox{rec}=\frac{\mbox{TP}}{\mbox{TP}+\mbox{FN}}, F1=2×prec×recprec+recsubscript𝐹12precrecprecrec\displaystyle F_{1}=\frac{2\times\mbox{prec}\times\mbox{rec}}{\mbox{prec}+\mbox{rec}} (7)

where BER is the balanced error rate, prec is precision, and rec is recall. The evaluation techniques measure the effectiveness of the learning models in identifying the features of the target sentence making minimal error to increase the performance of the decoder and its translation quality.

The thresholds are used to map real feature values to 00/111-class predictions and they are also optimized using the F1subscript𝐹1F_{1} measure on dev. We compare the performance of L2subscript𝐿2L_{2} regularized ridge regression with L1subscript𝐿1L_{1} regularized lasso and L1subscript𝐿1L_{1} regression (L1subscript𝐿1L_{1}-reg), where there is no regularization term involved. Then we compare the results we obtain with support vector regression (SVR) using rbf kernel and iterative thresholding (iter𝑖𝑡𝑒𝑟iter-ε𝜀\varepsilon) for L1subscript𝐿1L_{1} minimization [Maleh, 2009].

BER Prec Rec F1subscript𝐹1F_{1}
111-grams
L2subscript𝐿2L_{2} 0.18 0.47 0.65 0.55
lasso 0.15 0.60 0.71 0.65
L1subscript𝐿1L_{1}-reg 0.28 0.62 0.45 0.52
SVR 0.20 0.54 0.61 0.57
iter𝑖𝑡𝑒𝑟iter-ε𝜀\varepsilon 0.17 0.40 0.68 0.50
\cdashline1-5 Moses 0.23 0.68 0.54 0.60
222-grams
L2subscript𝐿2L_{2} 0.38 0.44 0.25 0.32
lasso 0.32 0.51 0.37 0.43
L1subscript𝐿1L_{1}-reg 0.41 0.37 0.19 0.25
SVR 0.39 0.43 0.22 0.29
iter𝑖𝑡𝑒𝑟iter-ε𝜀\varepsilon 0.41 0.60 0.18 0.27
\cdashline1-5 Moses 0.38 0.37 0.24 0.28
111&222-grams
L2subscript𝐿2L_{2} 0.27 0.52 0.45 0.49
lasso 0.24 0.57 0.53 0.55
\cdashline1-5 Moses 0.30 0.58 0.40 0.47
Table 1: Comparison of techniques on dev2 optimized with F1subscript𝐹1F_{1} value with 100100100 instances used for each test sentence using dice selection. Dimensions are NX×NY676.62×698.63subscript𝑁𝑋subscript𝑁𝑌676.62698.63N_{X}\times N_{Y}\approx 676.62\times 698.63 for 111-grams, 1678.63×1803.851678.631803.851678.63\times 1803.85 for 222-grams, and 2582.54×2674.022582.542674.022582.54\times 2674.02 for 111/222-grams. Threshold used for Moses is 0.26660.26660.2666, which is optimized on dev set.

The coverage as measured by the percentage of test bigrams found in the training set is scov𝑠𝑐𝑜𝑣scov and tcov𝑡𝑐𝑜𝑣tcov for source and target coverage. We measure scov𝑠𝑐𝑜𝑣scov and tcov𝑡𝑐𝑜𝑣tcov in dev2 when using 100100100 training instances per test sentence to be (1.0,0.961.00.961.0,0.96), (0.94,0.740.940.740.94,0.74), and (0.97,0.850.970.850.97,0.85) for 111-grams, 222-grams, and 111&222-grams respectively. Table 1 present the results on dev2, listing BER, prec, rec, and F1subscript𝐹1F_{1} when using 111-grams, 222-grams, or both. The reason for lower performance with bigrams is likely to be due to lower counts of observing them in the training set. This causes a bigger problem for L1subscript𝐿1L_{1}-reg and SVR.

We compare the performance with Moses’s phrase table obtained using all of the parallel corpus with maximum phrase length set to 222. We obtain Moses target vectors from the target phrase table entries for each source test sentence feature and optimize the threshold for achieving the maximum F1subscript𝐹1F_{1} value on dev. This threshold is found to be 0.26660.26660.2666. Moses achieves (1.0,0.98)1.00.98(1.0,0.98), (0.90,0.66)0.900.66(0.90,0.66), and (0.94,0.92)0.940.92(0.94,0.92) coverage values for 111-grams, 222-grams, and 111&222-grams. These coverage values are higher for 111-grams and 111&222-grams, thus Moses retains only likely entries in the phrase table. The results on dev2 set are given in Table 1 separated with dashed lines.

We observe that lasso is able to achieve better performance than other techniques in terms of BER, recall and F1subscript𝐹1F_{1} values. iter𝑖𝑡𝑒𝑟iter-ε𝜀\varepsilon achieves higher precision in 222-grams. lasso is also better than Moses results in all measures.

3.2 Target Feature Estimation as Classification

We can also interpret the feature mapping problem as a classification problem and estimate whether a feature exists in the target (class 1) or not (class 0). We use logistic regression (logreg) and support vector classification (SVC) to determine the target feature for each feature of the source test sentence. When the number of 1’s is much smaller than the number of 0’s in the training set as we observe in our sparse learning setting, a classifier may tend to choose class 0 over class 1. In such cases, we need to introduce some bias towards choosing 1’s. Thresholding approach is helpful in increasing the recall while maintaining a high precision value to optimize F1 measure. The results on dev2 is given in Table 2. Our interpretation of feature mapping as a classification problem does not give better results than regression.

111&222-grams BER Prec Rec F1subscript𝐹1F_{1}
logreg 0.33 0.58 0.34 0.43
SVC 0.30 0.47 0.41 0.44
Table 2: Interpreting feature estimation as classification on dev2.

3.3 Regression Results

The regression results on test when using 250250250 training instances per test sentence is given in Table 3. scov𝑠𝑐𝑜𝑣scov and tcov𝑡𝑐𝑜𝑣tcov is found to be (1.0,0.961.00.961.0,0.96), (0.94,0.750.940.750.94,0.75), and (0.96,0.860.960.860.96,0.86) for 111-grams, 222-grams, and 111&222-grams respectively. We observe that FSR’s F1subscript𝐹1F_{1} gains over L2subscript𝐿2L_{2} increase as we used 250250250 training instances per test sentence and BER is lower for both.

BER Prec Rec F1subscript𝐹1F_{1}
111-grams
L2subscript𝐿2L_{2} 0.18 0.45 0.65 0.53
lasso 0.14 0.61 0.73 0.66
222-grams
L2subscript𝐿2L_{2} 0.39 0.40 0.23 0.29
lasso 0.32 0.52 0.37 0.43
1/2121/2-grams
L2subscript𝐿2L_{2} 0.28 0.47 0.44 0.45
lasso 0.23 0.58 0.54 0.56
Table 3: Results on test with 250250250 instances used for each test sentence using dice selection. Dimensions are NX×NY671.33×691.90subscript𝑁𝑋subscript𝑁𝑌671.33691.90N_{X}\times N_{Y}\approx 671.33\times 691.90 for 111-grams, 1712.34×1836.181712.341836.181712.34\times 1836.18 for 222-grams, and 2657.03×2744.962657.032744.962657.03\times 2744.96 for 111&222-grams.

The results show that lasso achieves better performance on all feature sets and on all measures when compared with other regression techniques in estimating the target feature for a given source feature.

4 Graph Decoding Experiments

We demonstrate machine translation results using graph decoding on the German-English test set as well as in a constrained translation domain from Spanish to English (es-en) using the categorized EuTrans corpus [Serrano et al., 2009]. The corpus provides a more restricted translation environment for decoding and contains 9000 training, 1000 development, and 3000 test sentences.

We perform graph-based decoding by first generating a De Bruijn graph from the estimated y^^y\hat{\textbf{y}} [Cortes et al., 2007] and then finding Eulerian paths with maximum path weight. We use four features when scoring paths: (1) estimation weight from regression, (2) language model score, (3) brevity penalty as found by eα(lR|s|/|path|)superscript𝑒𝛼subscript𝑙𝑅𝑠𝑝𝑎𝑡e^{\alpha(l_{R}-|s|/|path|)} for lRsubscript𝑙𝑅l_{R} representing the length ratio from the parallel corpus and |path|𝑝𝑎𝑡|path| representing the length of the current path, (4) future cost as in Moses [Koehn et al., 2007] and weights are tuned using MERT [Och, 2003] on the de-en dev set.

For de-en, we built a Moses model using default settings with maximum sentence length set to 808080 using 555-gram language model where about 1.61.61.6 million sentences were used for training and 100010001000 random development sentences including dev used for tuning. We obtained 0.34220.34220.3422 BLEU score on test. Regression results for increasing training data can be seen in Figure 2 where 222-grams are used for decoding. We see a large BLEU gain of lasso over L2subscript𝐿2L_{2} in our transductive learning setting although the performance is lower than Moses.

Refer to caption
Figure 2: de-en translation results for increasing m𝑚m using 222-grams.

For es-en, Moses achieves .9340.9340.9340 BLEU on the full test set. Regression results for increasing training data can be seen in Figure 3 where 111&222-grams are used for decoding. We see that lasso performs better than L2subscript𝐿2L_{2} in the beginning when we use smaller number of training instances but it performs worse as the training set size increase. The red line corresponds to the Moses baseline. These results are comparable to previous work [Serrano et al., 2009].

Refer to caption
Figure 3: es-en translation results for increasing m𝑚m using 111&222-grams.

We demonstrate that sparse L1subscript𝐿1L_{1} regularized regression performs better than L2subscript𝐿2L_{2} regularized regression. Graph based decoding can provide an alternative to state of the art phrase-based decoding system Moses in translation domains with small vocabulary and training set size.

5 Decoding Experiments with Moses

Moses uses some extra reordering features, lexical weights, and phrase translation probabilities in both translation directions. Therefore we also experiment with decoding using Moses.

We interpret the coefficient matrix W as the phrase table and perform experiments with Moses using the new phrase table we obtain. These experiments show whether we can effectively use W as the phrase table in a phrase-based decoder. In our transductive learning framework, a W matrix is created for each test sentence and this replaces the phrase table. We modify Moses such that it reloads the phrase table before translating each test sentence. Moses use a default translation table limit (--ttable-limit) of 202020 entries per source phrase, which limits the number of translation options considered for each source phrase.

We obtain direct and indirect translation probabilities for the target phrase entries for each source phrase. We obtain the scores for a given source test sentence S as follows 111p(tf|sf)𝑝conditionalsubscript𝑡𝑓subscript𝑠𝑓p(t_{f}|s_{f}) is the direct phrase translation probability and corresponds to p(e|f)𝑝conditional𝑒𝑓p(e|f) in Moses. Similarly, p(sf|tf)𝑝conditionalsubscript𝑠𝑓subscript𝑡𝑓p(s_{f}|t_{f}) is the inverse phrase translation probability and corresponds to p(f|e)𝑝conditional𝑓𝑒p(f|e) in Moses.:

p(tf|sf)𝑝conditionalsubscript𝑡𝑓subscript𝑠𝑓\displaystyle p(t_{f}|s_{f}) =\displaystyle= W[tf,sf]tf(FY)W[tf,sf],Wsubscript𝑡𝑓subscript𝑠𝑓subscriptsuperscriptsubscript𝑡𝑓subscript𝐹𝑌Wsuperscriptsubscript𝑡𝑓subscript𝑠𝑓\displaystyle\frac{\textbf{W}[t_{f},s_{f}]}{\sum_{{t_{f}}^{\prime}\in\mathcal{F}(F_{Y})}\textbf{W}[{t_{f}}^{\prime},s_{f}]}, (8)
p(sf|tf)𝑝conditionalsubscript𝑠𝑓subscript𝑡𝑓\displaystyle p(s_{f}|t_{f}) =\displaystyle= W[tf,sf]sf(ΦX(S))W[tf,sf]Wsubscript𝑡𝑓subscript𝑠𝑓subscriptsuperscriptsubscript𝑠𝑓subscriptΦ𝑋𝑆Wsubscript𝑡𝑓superscriptsubscript𝑠𝑓\displaystyle\frac{\textbf{W}[t_{f},s_{f}]}{\sum_{{s_{f}}^{\prime}\in\mathcal{F}(\Phi_{X}(S))}\textbf{W}[t_{f},{s_{f}}^{\prime}]} (9)

where (FY)subscript𝐹𝑌\mathcal{F}(F_{Y}) is the set of target features found in the training set and (ΦX(S))subscriptΦ𝑋𝑆\mathcal{F}(\Phi_{X}(S)) return the source features of source test sentence S𝑆S where (.)\mathcal{F}(.) return the features used in a given domain. sfsubscript𝑠𝑓s_{f} are chosen from the feature set of the source test sentence and tfsubscript𝑡𝑓t_{f} from where W[tf,sf]>0Wsubscript𝑡𝑓subscript𝑠𝑓0\textbf{W}[t_{f},s_{f}]>0. This choice of summing over source features found in the source test sentence helps us discriminate among target feature alternatives better than summing over all source features found in the training set. We also find that by summing over the positive entries for features selected from (FY)subscript𝐹𝑌\mathcal{F}(F_{Y}) rather than selecting from the top N𝑁N aligned target features, we increase the precision in the phrase translation scores by increasing the denominator for frequently observed phrases.

We retain the phrase penalty to obtain 333 scores for each phrase table entry: p(sf|tf),p(tf|sf),2.718𝑝conditionalsubscript𝑠𝑓subscript𝑡𝑓𝑝conditionalsubscript𝑡𝑓subscript𝑠𝑓2.718p(s_{f}|t_{f}),p(t_{f}|s_{f}),2.718, where the last score is used for the phrase penalty. To compare the performance with Moses, we use the option -score-options ’--NoLex’ during training, which removes the scores coming from lexical weights in phrase table entries, leaving 333 scores similar to the scores used in the RegMT phrase table. Instead of adding extra lexical weights, we remove it to measure the difference between the phrase tables better.

5.1 Results

In our experiments, we measure the effect of phrase length (--max-phrase-length), whether to use lexical weights, the number of instances used for training, and the development set selection on the test set performance. We use dev set for tuning the weights, which is constructed similarly to test set. We perform individual Moses translation experiment for each test sentence to compare the performance with replacing phrase table with W.

For the de-en system, we built a Moses model using default settings with maximum sentence length set to 808080 using 555-gram language model where about 1.61.61.6 million sentences were used for training and 100010001000 random development sentences including dev used for tuning. We obtained 0.34220.34220.3422 BLEU score on test.

Individual translations: The training set is composed of only the set of instances selected for each test sentence. Individual translation results obtained with Moses are given in Table 4. Individual SMT training and translation can be preferable due to smaller computational costs and high parallelizability. As we translate a single sentence with each SMT system, tuning weights becomes important and the variance of the weights learned can become high in the individual setting. As we increase the training set size, we observe that the performance gets closer to the Moses system using all of the training corpus.

BLEU 100 250 500
2absent2\leq 2-grams 0.3148 0.3442 0.3709
3absent3\leq 3-grams 0.3429 0.3534 0.4031
Table 4: Individual Moses results with training instances selected individually for each source test sentence.

Transductive RegMT W as the phrase table: The results obtained when the coefficient matrix obtained by RegMT is used as the phrase table for Moses is given in Figure 4. Due to computational limitations, we use the weights obtained for L2subscript𝐿2L_{2} to decode with lasso phrase table and skip tuning for lasso. The learning curve for increasing size of the training set is given in Figure 4.

Refer to caption
Figure 4: Moses de-en translation results with RegMT W used as the phrase table.

We obtain lower performance with L2subscript𝐿2L_{2} when compared with the individual translations obtained. lasso selects few possible translations for a given source feature. This decrease in vocabulary and testing smaller possibilities may result in lower performance although better estimations are obtained as we observed in the previous section. The increased precision pays when creating the translation from the bigrams found in the estimation. We observe similar learning curves both with graph decoding and decoding using Moses. RegMT model may need a larger training set size for achieving better performance when the mappings are used as the phrase table. RegMT model is good in estimating the target features but has difficulty in correctly finding the target sentence when Moses is used as the decoder.

6 Contributions

We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. The results show the effectiveness of using L1subscript𝐿1L_{1} regularization versus L2subscript𝐿2L_{2} used in ridge regression. We introduce dice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings We show that L1subscript𝐿1L_{1} regularized regression performs better than L2subscript𝐿2L_{2} regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English.

We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model. We observe that RegMT model achieves lower performance than Moses system built individually for each test sentence. RegMT model may need a larger training set size for achieving better performance when the mappings are used as the phrase table.

F1subscript𝐹1F_{1} measure is giving us a metric that correlates with BLEU without performing the translation step.

References

  • [Bishop, 2006] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
  • [Callison-Burch et al., 2010] Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, and Omar Zaidan, editors. 2010. Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. Association for Computational Linguistics, Uppsala, Sweden, July.
  • [Cortes et al., 2007] Corinna Cortes, Mehryar Mohri, and Jason Weston. 2007. A general regression framework for learning string-to-string mappings. In Gokhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan, editors, Predicting Structured Data, pages 143–168. The MIT Press, September.
  • [Hastie et al., 2006] Trevor Hastie, Jonathan Taylor, Robert Tibshirani, and Guenther Walther. 2006. Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1.
  • [Hastie et al., 2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, 2nd edition.
  • [Koehn et al., 2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Annual Meeting of the Assoc. for Computational Linguistics, pages 177–180, Prague, Czech Republic, June.
  • [Maleh, 2009] Ray Maleh. 2009. Efficient Sparse Approximation Methods for Medical Imaging. Ph.D. thesis, The University of Michigan.
  • [Och, 2003] Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. Association for Computational Linguistics, 1:160–167.
  • [Serrano et al., 2009] Nicolas Serrano, Jesus Andres-Ferrer, and Francisco Casacuberta. 2009. On a kernel regression approach to machine translation. In Iberian Conference on Pattern Recognition and Image Analysis, pages 394–401.
  • [Taylor and Cristianini, 2004] J. Shawe Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
  • [Tibshirani, 1996] Robert J. Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288.
  • [Wang and Shawe-Taylor, 2008] Zhuoran Wang and John Shawe-Taylor. 2008. Kernel regression framework for machine translation: UCL system description for WMT 2008 shared translation task. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 155–158, Columbus, Ohio, June. Association for Computational Linguistics.
  • [Wang et al., 2007] Zhuoran Wang, John Shawe-Taylor, and Sandor Szedmak. 2007. Kernel regression based machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 185–188, Rochester, New York, April. Association for Computational Linguistics.