Sparse Regression for Machine Translation
Abstract
We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of regularized regression (lasso) to learn the mappings between sparsely observed feature sets versus regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce dice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that regularized regression performs better than regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.
1 Introduction
Regression can be used to find mappings between the source and target feature sets derived from given parallel corpora. Transduction learning uses a subset of the training examples that are closely related to the test set without using the model induced by the full training set. In the context of statistical machine translation, translations are performed at the sentence level and this enables us to select a few training instances for each test instance to guide the translation process. This also gives us a computational advantage when considering the high dimensionality of the problem.
The goal in transductive regression based machine translation (RegMT) is both reducing the computational burden of the regression approach by reducing the dimensionality of the training set and the feature set and also improving the translation quality by using transduction. In an idealized feature mapping matrix where features are word sequences, we would like to observe few target features for each source feature close to permutation matrices with one nonzero item for each column. regularization helps us achieve solutions close to the permutation matrices by increasing sparsity.
We compare regularized regression with other regression techniques and show that it achieves better performance in estimating target features and in generating the translations during graph decoding. We present encouraging results on German and Spanish to English translation tasks. We also replace the phrase table of a phrase-based decoder, Moses [Koehn et al., 2007], with the feature mappings we find with the RegMT model.
Related Work: Regression techniques can be used to model the relationship between strings [Cortes et al., 2007]. Wang et al. [Wang et al., 2007] applies a string-to-string mapping approach to machine translation by using ordinary least squares regression and -gram string kernels to a small dataset. Later they use regularized least squares regression [Wang and Shawe-Taylor, 2008]. Although the translation quality they achieve is not better than Moses [Koehn et al., 2007], which is accepted to be the state-of-the-art, they show the feasibility of the approach. Serrano et al. [Serrano et al., 2009] use kernel regression to find translation mappings from source to target feature vectors and experiment with translating hotel front desk requests. Locally weighted regression solves separate weighted least squares problems for each instance [Hastie et al., 2009], weighted by a kernel similarity function.
Outline: Section 2 gives an overview of regression based machine translation, which is used to find the mappings between the source and target features of the training set. We present regularized transductive regression for alignment learning and the instance selection method used. In section 3, we compare the performance of versus regularized regression results in effectively estimating the target features. In section 4, we present the graph decoding results on German-English and Spanish-English translation tasks. Section 5 compares the results obtained with Moses with different learning settings. The last sections present our contributions.
2 Machine Translation Using Regression
Let X and Y correspond to the sets of tokens that can be used in the source and target strings, then, training instances are represented as where corresponds to a pair of source and target language token sequences for . Our goal is to find a mapping that can convert a source sentence to a target sentence sharing the same meaning in the target language (Figure 1).
We define feature mappers and that map each string sequence to a point in high dimensional real number space. Let and such that and . The ridge regression solution using regularization is found by minimizing the following cost:
(1) |
Two main challenges of the regression based machine translation (RegMT) approach are learning the regression function, , and solving the pre-image problem, which, given the features of the estimated target string sequence, , attempts to find : .
2.1 Regularized Regression
String kernels lead to sparse feature representations and regularized regression is effective to find the mappings between sparsely observed features.We would like to observe only a few nonzero target coefficients corresponding to a source feature in the coefficient matrix. regularization helps us achieve solutions close to permutation matrices by increasing sparsity [Bishop, 2006].
is not a sparse solution and most of the coefficients remain non-zero. We are interested in penalizing the coefficients better; zeroing the irrelevant ones leading to sparsification to obtain a solution that is closer to a permutation matrix. norm behaves both as a feature selection technique and a method for reducing coefficient values.
(2) |
Equation 2 presents the lasso [Tibshirani, 1996] solution where the regularization term is now the matrix norm defined as . can be found by taking the derivative but since regularization cost is not differentiable, is found by optimization or approximation techniques. We use forward stagewise regression (FSR) [Hastie et al., 2006], which approximates lasso for regularized regression.
2.2 Instance Selection
Proper selection of training instances plays an important role for accurately learning feature mappings with limited computational resources. Coverage of the features is important since if we do not have the correct features in the training matrices, we will not be able to translate them. Coverage is measured by the percentage of target features of the test set found in the training set. For each test sentence, we pick a limited number of training instances designed to improve the coverage of correct features to build a regression model. We use a technique that we call dice, which optimizes source language bigram coverage such that the difficulty of aligning source and target features is minimized. We define Dice’s coefficient score as:
(3) |
where is the number of times and co-occurr and is the count of observing in the selected training set. Given a test source sentence, , we can estimate the goodness of a target sentence, , by the sum of the alignment scores:
(4) |
where stores the features of and lists the tokens in feature . We assume that tokens are separated by the space character. The difficulty of word aligning a pair of training sentences, , can be approximated by as for each source token we have alternatives to map to. We use a normalization factor proportional to , which corresponds to the of this value.
3 Regression Experiments
We experiment with different regression techniques and compare their performance to and lasso. We show that lasso is able to identify features better than other techniques we compared while making less error.
We use the German-English (de-en) parallel corpus of size about million sentences from the WMT’10 [Callison-Burch et al., 2010] to select training instances. For development and test sentences, we select randomly among the sentences whose length is in the range and target language bigram coverage in the range . We select random instances from each target coverage decimal fraction (i.e. 20 from , 20 from , etc.) to obtain sets of sentences. We create in-domain dev, dev2, and test sets following this procedure and making sure that test set source sentences do not have exact matches in the training set. We use -spectrum weighted word kernel [Taylor and Cristianini, 2004] as feature mappers which considers all word sequences up to order :
(5) |
where denotes a substring of x with the words in the range , is the indicator function, and is the number of words in the feature. Features used are -grams, -grams, or &-grams.
3.1 Tuning for Target
We perform parameter optimization for the machine learning models we use to estimate the target vector. The model parameters such as the regularization and iteration number for FSR are optimized on dev using the measure over the 0/1-class predictions obtained after thresholding . Let TP be the true positive, TN the true negative, FP the false positive, and FN the false negative rates, the measures are defined as:
(6) | |||||
(7) |
where BER is the balanced error rate, prec is precision, and rec is recall. The evaluation techniques measure the effectiveness of the learning models in identifying the features of the target sentence making minimal error to increase the performance of the decoder and its translation quality.
The thresholds are used to map real feature values to /-class predictions and they are also optimized using the measure on dev. We compare the performance of regularized ridge regression with regularized lasso and regression (-reg), where there is no regularization term involved. Then we compare the results we obtain with support vector regression (SVR) using rbf kernel and iterative thresholding (-) for minimization [Maleh, 2009].
BER | Prec | Rec | ||
-grams | ||||
0.18 | 0.47 | 0.65 | 0.55 | |
lasso | 0.15 | 0.60 | 0.71 | 0.65 |
-reg | 0.28 | 0.62 | 0.45 | 0.52 |
SVR | 0.20 | 0.54 | 0.61 | 0.57 |
- | 0.17 | 0.40 | 0.68 | 0.50 |
\cdashline1-5 Moses | 0.23 | 0.68 | 0.54 | 0.60 |
-grams | ||||
0.38 | 0.44 | 0.25 | 0.32 | |
lasso | 0.32 | 0.51 | 0.37 | 0.43 |
-reg | 0.41 | 0.37 | 0.19 | 0.25 |
SVR | 0.39 | 0.43 | 0.22 | 0.29 |
- | 0.41 | 0.60 | 0.18 | 0.27 |
\cdashline1-5 Moses | 0.38 | 0.37 | 0.24 | 0.28 |
&-grams | ||||
0.27 | 0.52 | 0.45 | 0.49 | |
lasso | 0.24 | 0.57 | 0.53 | 0.55 |
\cdashline1-5 Moses | 0.30 | 0.58 | 0.40 | 0.47 |
The coverage as measured by the percentage of test bigrams found in the training set is and for source and target coverage. We measure and in dev2 when using training instances per test sentence to be (), (), and () for -grams, -grams, and &-grams respectively. Table 1 present the results on dev2, listing BER, prec, rec, and when using -grams, -grams, or both. The reason for lower performance with bigrams is likely to be due to lower counts of observing them in the training set. This causes a bigger problem for -reg and SVR.
We compare the performance with Moses’s phrase table obtained using all of the parallel corpus with maximum phrase length set to . We obtain Moses target vectors from the target phrase table entries for each source test sentence feature and optimize the threshold for achieving the maximum value on dev. This threshold is found to be . Moses achieves , , and coverage values for -grams, -grams, and &-grams. These coverage values are higher for -grams and &-grams, thus Moses retains only likely entries in the phrase table. The results on dev2 set are given in Table 1 separated with dashed lines.
We observe that lasso is able to achieve better performance than other techniques in terms of BER, recall and values. - achieves higher precision in -grams. lasso is also better than Moses results in all measures.
3.2 Target Feature Estimation as Classification
We can also interpret the feature mapping problem as a classification problem and estimate whether a feature exists in the target (class 1) or not (class 0). We use logistic regression (logreg) and support vector classification (SVC) to determine the target feature for each feature of the source test sentence. When the number of 1’s is much smaller than the number of 0’s in the training set as we observe in our sparse learning setting, a classifier may tend to choose class 0 over class 1. In such cases, we need to introduce some bias towards choosing 1’s. Thresholding approach is helpful in increasing the recall while maintaining a high precision value to optimize F1 measure. The results on dev2 is given in Table 2. Our interpretation of feature mapping as a classification problem does not give better results than regression.
&-grams | BER | Prec | Rec | |
---|---|---|---|---|
logreg | 0.33 | 0.58 | 0.34 | 0.43 |
SVC | 0.30 | 0.47 | 0.41 | 0.44 |
3.3 Regression Results
The regression results on test when using training instances per test sentence is given in Table 3. and is found to be (), (), and () for -grams, -grams, and &-grams respectively. We observe that FSR’s gains over increase as we used training instances per test sentence and BER is lower for both.
BER | Prec | Rec | ||
---|---|---|---|---|
-grams | ||||
0.18 | 0.45 | 0.65 | 0.53 | |
lasso | 0.14 | 0.61 | 0.73 | 0.66 |
-grams | ||||
0.39 | 0.40 | 0.23 | 0.29 | |
lasso | 0.32 | 0.52 | 0.37 | 0.43 |
-grams | ||||
0.28 | 0.47 | 0.44 | 0.45 | |
lasso | 0.23 | 0.58 | 0.54 | 0.56 |
The results show that lasso achieves better performance on all feature sets and on all measures when compared with other regression techniques in estimating the target feature for a given source feature.
4 Graph Decoding Experiments
We demonstrate machine translation results using graph decoding on the German-English test set as well as in a constrained translation domain from Spanish to English (es-en) using the categorized EuTrans corpus [Serrano et al., 2009]. The corpus provides a more restricted translation environment for decoding and contains 9000 training, 1000 development, and 3000 test sentences.
We perform graph-based decoding by first generating a De Bruijn graph from the estimated [Cortes et al., 2007] and then finding Eulerian paths with maximum path weight. We use four features when scoring paths: (1) estimation weight from regression, (2) language model score, (3) brevity penalty as found by for representing the length ratio from the parallel corpus and representing the length of the current path, (4) future cost as in Moses [Koehn et al., 2007] and weights are tuned using MERT [Och, 2003] on the de-en dev set.
For de-en, we built a Moses model using default settings with maximum sentence length set to using -gram language model where about million sentences were used for training and random development sentences including dev used for tuning. We obtained BLEU score on test. Regression results for increasing training data can be seen in Figure 2 where -grams are used for decoding. We see a large BLEU gain of lasso over in our transductive learning setting although the performance is lower than Moses.
For es-en, Moses achieves BLEU on the full test set. Regression results for increasing training data can be seen in Figure 3 where &-grams are used for decoding. We see that lasso performs better than in the beginning when we use smaller number of training instances but it performs worse as the training set size increase. The red line corresponds to the Moses baseline. These results are comparable to previous work [Serrano et al., 2009].
We demonstrate that sparse regularized regression performs better than regularized regression. Graph based decoding can provide an alternative to state of the art phrase-based decoding system Moses in translation domains with small vocabulary and training set size.
5 Decoding Experiments with Moses
Moses uses some extra reordering features, lexical weights, and phrase translation probabilities in both translation directions. Therefore we also experiment with decoding using Moses.
We interpret the coefficient matrix W as the phrase table and perform experiments with Moses using the new phrase table we obtain. These experiments show whether we can effectively use W as the phrase table in a phrase-based decoder. In our transductive learning framework, a W matrix is created for each test sentence and this replaces the phrase table. We modify Moses such that it reloads the phrase table before translating each test sentence. Moses use a default translation table limit (--ttable-limit
) of entries per source phrase, which limits the number of translation options considered for each source phrase.
We obtain direct and indirect translation probabilities for the target phrase entries for each source phrase. We obtain the scores for a given source test sentence S as follows 111 is the direct phrase translation probability and corresponds to in Moses. Similarly, is the inverse phrase translation probability and corresponds to in Moses.:
(8) | |||||
(9) |
where is the set of target features found in the training set and return the source features of source test sentence where return the features used in a given domain. are chosen from the feature set of the source test sentence and from where . This choice of summing over source features found in the source test sentence helps us discriminate among target feature alternatives better than summing over all source features found in the training set. We also find that by summing over the positive entries for features selected from rather than selecting from the top aligned target features, we increase the precision in the phrase translation scores by increasing the denominator for frequently observed phrases.
We retain the phrase penalty to obtain scores for each phrase table entry: , where the last score is used for the phrase penalty. To compare the performance with Moses, we use the option -score-options ’--NoLex’
during training, which removes the scores coming from lexical weights in phrase table entries, leaving scores similar to the scores used in the RegMT phrase table. Instead of adding extra lexical weights, we remove it to measure the difference between the phrase tables better.
5.1 Results
In our experiments, we measure the effect of phrase length (--max-phrase-length
), whether to use lexical weights, the number of instances used for training, and the development set selection on the test set performance.
We use dev set for tuning the weights, which is constructed similarly to test set. We perform individual Moses translation experiment for each test sentence to compare the performance with replacing phrase table with W.
For the de-en system, we built a Moses model using default settings with maximum sentence length set to using -gram language model where about million sentences were used for training and random development sentences including dev used for tuning. We obtained BLEU score on test.
Individual translations: The training set is composed of only the set of instances selected for each test sentence. Individual translation results obtained with Moses are given in Table 4. Individual SMT training and translation can be preferable due to smaller computational costs and high parallelizability. As we translate a single sentence with each SMT system, tuning weights becomes important and the variance of the weights learned can become high in the individual setting. As we increase the training set size, we observe that the performance gets closer to the Moses system using all of the training corpus.
BLEU | 100 | 250 | 500 |
---|---|---|---|
-grams | 0.3148 | 0.3442 | 0.3709 |
-grams | 0.3429 | 0.3534 | 0.4031 |
Transductive RegMT W as the phrase table: The results obtained when the coefficient matrix obtained by RegMT is used as the phrase table for Moses is given in Figure 4. Due to computational limitations, we use the weights obtained for to decode with lasso phrase table and skip tuning for lasso. The learning curve for increasing size of the training set is given in Figure 4.
We obtain lower performance with when compared with the individual translations obtained. lasso selects few possible translations for a given source feature. This decrease in vocabulary and testing smaller possibilities may result in lower performance although better estimations are obtained as we observed in the previous section. The increased precision pays when creating the translation from the bigrams found in the estimation. We observe similar learning curves both with graph decoding and decoding using Moses. RegMT model may need a larger training set size for achieving better performance when the mappings are used as the phrase table. RegMT model is good in estimating the target features but has difficulty in correctly finding the target sentence when Moses is used as the decoder.
6 Contributions
We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. The results show the effectiveness of using regularization versus used in ridge regression. We introduce dice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings We show that regularized regression performs better than regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English.
We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model. We observe that RegMT model achieves lower performance than Moses system built individually for each test sentence. RegMT model may need a larger training set size for achieving better performance when the mappings are used as the phrase table.
measure is giving us a metric that correlates with BLEU without performing the translation step.
References
- [Bishop, 2006] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
- [Callison-Burch et al., 2010] Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, and Omar Zaidan, editors. 2010. Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. Association for Computational Linguistics, Uppsala, Sweden, July.
- [Cortes et al., 2007] Corinna Cortes, Mehryar Mohri, and Jason Weston. 2007. A general regression framework for learning string-to-string mappings. In Gokhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan, editors, Predicting Structured Data, pages 143–168. The MIT Press, September.
- [Hastie et al., 2006] Trevor Hastie, Jonathan Taylor, Robert Tibshirani, and Guenther Walther. 2006. Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1.
- [Hastie et al., 2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, 2nd edition.
- [Koehn et al., 2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Annual Meeting of the Assoc. for Computational Linguistics, pages 177–180, Prague, Czech Republic, June.
- [Maleh, 2009] Ray Maleh. 2009. Efficient Sparse Approximation Methods for Medical Imaging. Ph.D. thesis, The University of Michigan.
- [Och, 2003] Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. Association for Computational Linguistics, 1:160–167.
- [Serrano et al., 2009] Nicolas Serrano, Jesus Andres-Ferrer, and Francisco Casacuberta. 2009. On a kernel regression approach to machine translation. In Iberian Conference on Pattern Recognition and Image Analysis, pages 394–401.
- [Taylor and Cristianini, 2004] J. Shawe Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
- [Tibshirani, 1996] Robert J. Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288.
- [Wang and Shawe-Taylor, 2008] Zhuoran Wang and John Shawe-Taylor. 2008. Kernel regression framework for machine translation: UCL system description for WMT 2008 shared translation task. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 155–158, Columbus, Ohio, June. Association for Computational Linguistics.
- [Wang et al., 2007] Zhuoran Wang, John Shawe-Taylor, and Sandor Szedmak. 2007. Kernel regression based machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 185–188, Rochester, New York, April. Association for Computational Linguistics.