Machine Translation Pre-training for Data-to-Text Generation - A Case Study in Czech
Abstract
While there is a large body of research studying deep learning methods for text generation from structured data, almost all of it focuses purely on English. In this paper, we study the effectiveness of machine translation based pre-training for data-to-text generation in non-English languages. Since the structured data is generally expressed in English, text generation into other languages involves elements of translation, transliteration and copying - elements already encoded in neural machine translation systems. Moreover, since data-to-text corpora are typically small, this task can benefit greatly from pre-training. Based on our experiments on Czech, a morphologically complex language, we find that pre-training lets us train end-to-end models with significantly improved performance, as judged by automatic metrics and human evaluation. We also show that this approach enjoys several desirable properties, including improved performance in low data scenarios and robustness to unseen slot values.
1 Introduction
Data-to-Text refers to the process of generating accurate and fluent natural language text from structured data such as tables, lists, graphs etc.(Gatt and Krahmer, 2018) It has several applications, including generating weather and sports summaries, response generation in task-oriented dialogue systems etc. For example, consider Figure 1, in the context of a restaurant booking system. The system must take a meaning representation (MR) as input - in this case represented in the form of a dialogue act (inform) and a list of key value pairs related to the restaurant - and generate fluent text that is firmly grounded in the MR.

Data-to-text can broadly be classified into two categories with respect to the nature of the output text: lexicalized and delexicalized. Figure 2 provides an example of both. In the lexicalized setting, models are trained to produce the full natural text. We refer to these as lexicalized models. In the delexicalized setting, the slot-values are replaced with placeholders. Models are trained to produce output text with these placeholders. We refer to these as delexicalized models. The placeholders are filled in via a separate lexicalization step. For English, this is achieved by simply copying slot values from the structured data into the corresponding placeholders.
However, Dušek and Jurčíček (2019) recently highlighted the deficiencies of delexicalization and copy based methods in the presence of linguistic phenomena such as morphological inflection. For instance, in Figure 1, when generating in Czech, the restaurant name ”Pivo & Basilico” from the MR must be correctly inflected to ”Pivu & Basilicu” to ensure fluency. Simple copying would fail 111Nouns in Czech may have up to 14 different forms, depending on the context.. Moreover, in several languages, these complexities are compounded by the fact that inflecting a noun in a certain way requires changes to be made in the surrounding words (since modifying adjectives need to exhibit agreement). This makes the lexicalization step complex, requiring extensive linguistic knowledge. Consequently, end-to-end systems that directly generate fully lexicalized text without depending on any external linguistic knowledge present an attractive alternative. However, their performance in terms of semantic accuracy tends to lag far behind their delexicalized counterparts, especially in the presence of slot values not seen during training. (Shimorina and Gardent, 2018).
In this work, we focus on generating text in non-English languages and show that it is possible to significantly reduce this accuracy gap by pre-training fully lexicalized models on an NMT task. For an example motivating the use of NMT, consider Figure 1 once again. In order to generate semantically correct and natural sounding text in Czech (Marathi), a data-to-text model would need to learn the following skills:
-
•
Translate the slot value ”dinner” to the target language
-
•
Copy the phone number correctly
-
•
Inflect the restaurant name
In the case of Marathi, which has a different script, there is the additional challenge of Transliterating the restaurant name as well.
It is unreasonable to expect neural data-to-text models to learn all these skills, especially since the size of most NLG 222While NLG is a broad term, in this paper, we use NLG and data-to-text interchangeably. datasets is quite small. However, modern neural machine translation systems are already fairly adept at translating, transliterating, copying, inflecting etc. Consequently, we hypothesise that the parameters of a NMT model will act as a very strong prior for an NLG model.

2 Related Work
Earlier work on NLG was mainly studied rule-based pipelined methods, but recent works favor end-to-end neural approaches. Wen et al. (2015) proposed the Semantically Controlled LSTM and were one of the first to show the success of neural networks for this problem, with applications to task oriented dialogue. Since then, some works have focused on alternative architectures - Liu et al. (2018) generate text by conditioning language models on tables, while Puduppully et al. (2019) propose to explictly model entities present in the structured data. The findings of the E2E challenge (Dušek et al., 2018) show that standard seq2seq models with attention also perform well.
With the advent of ELMo, BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019), the unsupervised pre-training + fine-tuning paradigm has shown to be remarkably effective, leading to improvements in NLP tasks like classification, question answering and spoken language understanding (Siddhant et al., 2019a). Results for generation tasks like summarization are also positive, albeit less dramatic. Song et al. (2019) propose the MASS technique and obtain state-of-the-art results for summarization and unsupervised machine translation. Freitag and Roy (2018) show that denoising autoencoders can be leveraged for unsupervised language generation from structured data. Budzianowski and Vulić (2019) cast data-to-text as text-to-text generation and show that fine-tuning GPT language models can lead to performance competitive with architectures developed specifically for data-to-text. Chen et al. (2019) use language models to improve performance in the low resource scenario.
While the above works focus on unsupervised pre-training, Siddhant et al. (2019b) and Schuster et al. (2018) examine transfer learning via neural machine translation for NLU tasks like spoken language understanding and named entity recognition in the cross-lingual setting. They find that the results are mixed and for several NLU tasks, unsupervised pre-training actually outperforms its NMT counterpart. Our work shows that for NLG, machine translation substantially outperforms unsupervised pre-training objectives.
Recently, Chi et al. (2019) found multilingual pre-training techniques to be effective for cross-lingual language generation tasks like summarization and question generation. They focus on text-to-text NLG such as question generation and text summarization, where both the input and output are in the same language. In contrast, our work studies generation in the data-to-text setting, where the input is structured data as opposed to free form text and the output can be in any language.
The WNGT 2019 shared task provides a data-to-document dataset for German. However it is a small dataset that has been obtained by translating the English RotoWire dataset (Wiseman et al., 2017). Since the English dataset was automatically created by crawling and aligning sports score boxes and summaries, large parts of the text in the RotoWire dataset are not grounded in the data. Hayashi et al. (2019) find that techniques such as multilingual training, back-translation etc can help improve data-to-text performance in data scarce scenarios. Our focus is on NMT based transfer learning 333We use pre-training and transfer learning interchangeably. and it can be combined with all of the above techniques.
3 Model Architecture
We use the transformer (Vaswani et al., 2017) based encoder-decoder architecture by casting data-to-text as a seq2seq problem, where the structured data is flattened into a plain string consisting of a series of intents and slot key-value pairs. More exotic architectures have been suggested in prior work, but the findings of Dušek et al. (2018) show that simple seq2seq models are competitive alternatives, while being simpler to implement. Secondly, the transformer architecture is state-of-the art for NMT. Thirdly, keeping the pre-train and fine-tune architectures the same allows us to easily transfer knowledge between the two steps by parameter initialization.
4 Pre-train + Fine-tune
Our modeling approach is simple. We first use a parallel corpus to train a transformer based neural machine translation model that translates English text into the target language (Czech for our experiments). Next, we fine-tune this NMT model using a data-to-text corpus for a small number of steps. All the model parameters are updated in the fine-tuning process.
5 Models and Baselines
Machine Translation pre-training This is our proposed approach (nmt), where we first train an NMT model and fine-tune it for the NLG task. We also experiment with fine-tuning a bidirectional machine translation (binmt), where the NMT model is trained to translate both from English to the target language and vice-versa. This translation is trained on the concatenation of English-Czech and Czech-English parallel data.
Training from scratch A baseline where all the parameters are learned from scratch, without any kind of transfer learning. This is a 1 layer Transformer model. Larger models trained from scratch did not improve performance.
Unsupervised pre-training baseline Monolingual data is generally far easier to obtain than bilingual data, which makes unsupervised pre-training techniques more attractive. Interestingly, Wu and Dredze (2019) and Pires et al. (2019) find that training multilingual BERT models on a combination of languages can lead to surprisingly effective cross-lingual performance on NLU tasks, without using any parallel data. Of the myriad unsupervised techniques, we choose MASS (Song et al., 2019) for our baseline since it has been shown to outperform other alternatives like BERT, left-to-right language models and denoising autoencoders for language generation tasks. We first train a unsupervised English-Czech MASS model and then fine-tune it for the NLG task. We denote this approach as mass. From a transfer learning perspective, MASS is a state-of-the-art baseline.
TGen is a freely available open-source NLG system based on seq2seq + attention and was used as a strong baseline in the E2E challenge. Dušek and Jurčíček (2019) create a pipelined system consisting of : a TGen based model that outputs delexicalized text, a classifier that ranks the beam search hypotheses and a language model which which does the lexicalization by picking the exact surface form. We denote this combined system, consisting of all 3 components as tgen-sota. It is also currently the state-of-the-art for this dataset. Note that unlike tgen-sota, all our proposed models are trained to directly generate lexicalized outputs, which is a much harder task.
6 Experimental Setup
Slot Type | Example Values |
---|---|
name | Kočár z Vídně, Green Spirit |
area | Hradčany, Žatecká |
address | Kaprova 38, Žatecká 30 |
phone | 250625609, 219289692 |
good_for_meal | lunch, dinner, breakfast |
near | Powder Tower |
food | German, American |
price_range | cheap, expensive |
count | 10, 21 |
price | between 180 and 730 Kč |
postcode | 12100, 11700 |
kids_allowed | Yes, No |
part | Train | Dev | Test |
---|---|---|---|
Unique MRs | 144 | 51 | 53 |
Corpus size | 3,569 | 781 | 842 |
6.1 Datasets
Pre-training We use the Czech-English parallel corpus provided by the WMT 2019 shared task. The dataset comprises of 57 million translation pairs, automatically mined from the web. The data is comprised of a variety of domains (news, subtitles etc). In order to facilitate a fair comparison, we use this corpus for our unsupervised pre-training baselines as well. This effectively results in 114 million monolingual sentences, equally split between English and Czech.
NLG We use the recently released Czech Restaurant dataset, consisting of roughly 3500 examples for training. Further data related statistics can be found in Table 2. The delexicalized MRs in the test set never appear in the training set. As a result, models must learn to generalize to MRs with unseen slot and intent combinations. Table 1 lists all the slots that appear in the dataset, along with examples.
6.2 Training details
For NMT and MASS, we train transformer models with 93M parameters (6 layers, 8 heads, 512 hidden dimensions). They are trained on a TPU for 1 million steps with Adam optimizer and a learning rate schedule of (1,4K) 444The shorthand form (1.0, 4K) corresponds to a learning rate of 1.0, with 4000 warm-up steps for the schedule, which is decayed with the inverse square root of the number of training steps after warm-up.. The effective batch size is 1024.
For NLG, all our models are trained synchronously on 8 P100 GPUs for 10K steps with a batch size of 32 per GPU. We do not perform any hyperparameter tuning. Decoding is performed using beam search, with a beam width of 8.
6.3 Data pre-processing
Our vocabulary consists of a sentencepiece model with 32,000 tokens (Kudo and Richardson, 2018) shared between English and Czech. It is computed on English and Czech sentences from the pre-training corpus. In order to facilitate a fair comparison, we maintain the same vocabulary across all the transformer based models and baselines. Relying on sentencepieces also ensures that out-of-vocabulary tokens will not be encountered. No special rules or pre-processing is done to tokenize the structured data - we simply feed it as a plain string. The input sequence is pre-pended with a task specific token - [TRANSLATE] for translation, [GENERATE] for NLG. Following Aharoni et al. (2019), we pre-pend a second token to specify the desired output language - <2en> for English and <2cs> for Czech 555<2en> is required for the bidirectional NMT model..
model | BLEU ‡ | SER | NIST | METEOR | ROUGE-L | CIDEr | BLEU § |
---|---|---|---|---|---|---|---|
tgen-sota † | 20.6 | 2.75 | 4.77 | 23.32 | 42.95 | 2.18 | 21.96 |
scratch | 11.19 | 63.18 | 3.06 | 15.79 | 28.27 | 0.84 | 11.66 |
mass | 16.61 | 24.82 | 4.22 | 21.16 | 38.94 | 1.75 | 17.72 |
nmt | 24.41 | 2.38 | 5.19 | 25.46 | 46.85 | 2.55 | 25.84 |
binmt | 24.87 | 1.9 | 5.24 | 25.81 | 47.07 | 2.60 | 26.35 |
6.4 Metrics
We use BLEU (Papineni et al., 2002) as one of our automatic metrics 666Computed by sacrebleu (Post, 2018). We compute a Slot Error Rate (SER) metric to gauge how well the generated text reflects the structured data. We calculate how many of the slot values in the structured data have been mentioned in the generated text. An example is marked as correct only if all the slot-values in the structured data are present in the output 777SER can be reliably computed only for delexicalizable slots. As a result, the kids_allowed slot is ignored.. We refer the reader to the supplementary material for the exact SER algorithm. We also use the suite of word-overlap-based automatic metrics from the E2E NLG Challenge 888https://github.com/tuetschek/e2e-metrics , supporting NIST (Doddington, 2002), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007), CIDEr (Vedantam et al., 2015) and BLEU. 999Note that this is computed differently from sacrebleu.
7 Results and Discussion
7.1 Main Results
We report results in Table 3. Recall that these are models are trained to generate fully lexicalized output.
The scratch baseline performs quite poorly. While unsupervised transfer learning (mass) performs better, pre-training via machine translation gives the best results by large margin. nmt brings down the SER to just 2.38, a 20 point gain over mass, while improving the BLEU score by 8 points. Similar trends are observed in the other metrics as well. binmt slightly outperforms nmt and leads to further gains across all metrics. These results give credence to our hypothesis that machine translation can be a strong pre-training objective for data-to-text generation in non-English languages.
Compared to the pipelined tgen-sota system, both nmt and binmt compare favorably, showing improvements on all metrics, including a 4 point improvement in BLEU. In section 7.6, we discuss this result in detail, along with a comparison of the two approaches.
7.2 Human Evaluation
We also conduct human evaluations on a set of 200 examples randomly sampled from the test set. Concretely, we measure three metrics - accuracy, fluency and pairwise preference.
Accuracy: Human raters are shown the gold text and the predicted text and are instructed to mark the generated text as inaccurate if any information contradicts the gold text. This effectively catches errors due to hallucinations, incorrect grounding etc. Each example is rated by 3 raters, and we consider an example to be correct if at least two raters say so.
Fluency: We show the predicted text to raters and ask them how natural and fluent the text sounds on a 1-5 scale, with 5 being the highest score. Again, each example is rated by 3 raters. We average the scores across all the ratings to get the fluency score.
We conduct accuracy and fluency evaluations for our best model (binmt) and the best lexicalized baseline, mass. Results are reported in table 5. In terms of fluency, we note that mass produces quite fluent text, likely due to its strong language model. It would seem that unsupervised learning on unlabeled data is enough to generate fluent text, echoing findings of past work (Radford et al., 2019). binmt performs slightly better with a score of 4.83. However, when it comes to accuracy, our model gets a high score of 97.5, surpassing mass by 7.5 points. We take this as proof that transfer learning from machine translation helps produce text that is not only fluent, but much better grounded in the structured data.
Pairwise Preference: We do a side by side evaluation of the predictions from binmt with the gold text written by humans. We show both texts to the raters and ask them which one they prefer on a 7 point Likert scale. Each example is rated by 3 raters, with the final rating obtained via majority vote.
In 40% of the cases, our model produces output that is as good as human written text, while in another 30% there is no majority. Strikingly, in 21% of the cases, the raters actually preferred the model’s output over the human written gold text. The human text is preferred in only 9% of the cases. These results strongly point to the applicability of this approach to real-world NLG systems.
rating | percentage |
---|---|
much better | 0.5% |
better | 12.5% |
slightly better | 8% |
about the same | 40% |
slightly worse | 2.5% |
worse | 6.5% |
much worse | 0% |
no majority | 30% |
model | accuracy | fluency |
---|---|---|
binmt | 97.5 | 4.83 |
mass | 90 | 4.77 |
7.3 Low-resource machine translation
Our previous experiments use NMT models trained on a fairly large corpus. However, for many languages, the amount of available parallel data can be small. Unfortunately, we do not know of any public data-to-text datasets for actual low resource languages. Therefore, to study the impact of the size of bitext corpus, we run experiments in a simulated low-resource setting. We train bidirectional machine translation models on 10% (5.7 million examples, medium resource, denoted as binmt-5m) and 1% (570K examples, low resource, denoted as binmt-500k) of the data and use them for fine-tuning the NLG task.
First, to get an idea of how the corpus size effects translation performance, we compute BLEU scores of each model on the WMT 2019 English-Czech validation set. The medium resource model appears to be as good as the high resource model, but the low resource model is considerably weaker.
Next, we fine-tune each of these models on the data-to-text task. From the results in Table 8, we see that while the high resource model performs the best, the medium resource models is not far behind in terms of BLEU. Both the high and medium resource models have a comparable SER. Even the low resource model, pre-trained on just 1% of the translation corpora is significantly better than mass, which has been pre-trained on almost 1.6 billion tokens. The results indicate that machine translation based transfer learning can be successfully applied even when the size of parallel corpus is small, and thus holds promise for low-resource languages.
Model | BLEU |
---|---|
binmt-50m | 20.95 |
binmt-5m | 20.46 |
binmt-500k | 15.86 |
Pre-train | Model | BLEU | SER |
---|---|---|---|
1.6B | binmt-50m | 24.87 | 1.9 |
160M | binmt-5m | 22.17 | 1.43 |
16M | binmt-500k | 21.27 | 12.47 |
1.6B | mass | 16.61 | 24.82 |
7.4 Low resource NLG
In this section we study the effects of transfer learning when the size of the fine-tuning corpus is small. We create two random subsets from the NLG training data of size 100 and 1000. Results are reported in Table 8. We find that once again, NMT offers substantial gains over MASS. When fine-tuning on 1000 examples, pre-training with NMT is substantially better (20% improvement on SER, +3 on BLEU) than fine-tuning MASS with the full dataset. Remarkably, with just 100 examples, our model outperforms training from scratch on the entire training set by over 3 BLEU, while reducing SER by over 30 points. These results lead us to believe that NMT pre-training can lead to substantial cost savings with respect to training data annotation.
Training Size | Model | BLEU | SER |
---|---|---|---|
scratch | 2.83 | 78.5 | |
100 | mass | 4.42 | 78.74 |
binmt | 14.62 | 31.82 | |
scratch | 6.93 | 70.19 | |
1000 | mass | 9.07 | 66.15 |
binmt | 19.89 | 4.51 | |
scratch | 11.19 | 63.18 | |
Full | mass | 16.61 | 24.82 |
binmt | 24.87 | 1.9 |
7.5 Out-of-Vocabulary Slot Values
For real world systems, generalizing to out-of-vocabulary OOV slot-values is essential. However, since NLG datasets are small, this is a major failure mode for models producing lexicalized outputs. The model simply does not see enough unique slot values during training.
Generalization to OOV slot-values is hard to measure on this test set, since most of the slot-values already appear in the training set. Therefore, we design a new test set of meaning representations which exclusively contain OOV slot-values. This set contains 100 meaning representations (MRs) with a total of 200 slots and 155 unique OOV slot values. We use Google Search and Wikipedia to sample new values for slots like name, area, food etc. The exact dataset creation procedure is described in the supplementary materials.
Since we do not have gold text for these MRs, we manually rate the predictions of binmt and mass on this new test and compute slot specific error rates. For each of the 200 slots, we mark it as incorrect if the corresponding slot value is missing from the prediction.

slot | unique | total | errors | accuracy |
---|---|---|---|---|
name | 39 | 70 | 3 | 95.7 |
area | 30 | 30 | 2 | 93.3 |
address | 10 | 10 | 1 | 90.0 |
food | 10 | 10 | 4 | 60.0 |
phone | 10 | 10 | 0 | 100.0 |
count | 20 | 20 | 6 | 70.0 |
post code | 10 | 10 | 0 | 100.0 |
price | 10 | 10 | 0 | 100.0 |
near | 16 | 30 | 1 | 96.7 |
total | 155 | 200 | 17 | 91.5 |
Unsupervised pre-training through MASS completely fails to generalize to OOV values - none of the slots are realized in the predictions. Looking at the output, we noticed that mass has a strong tendency to hallucinate or simply output slot values seen during training. The poor performance further reinforces the practical popularity of delexicalized models and highlights the need for challenging OOV test sets. binmt on the other hand, remains robust - 91.5% of the slots from the MR are realized in the predictions. We show some examples in Figure 3, along with slot specific scores in Table 9. The results confirm that NMT can greatly improve the robustness to unseen values. Further reducing the performance gap between seen and unseen values is an important area of future work.
7.6 Comparison with pipelined approaches.
We showed in section 7.1 that our lexicalized NMT based models compare favorably to the current best model for this dataset - the tgen-sota system (Dušek and Jurčíček, 2019). Note that our aim in this work is not to beat state-of-the-art, but to gauge the effectiveness of machine translation as a pre-training strategy for NLG. Nevertheless, a comparison with tgen-sota offers some interesting insights. We first describe tgen-sota in detail. It consists of the following pipelined components:
Delexicalized Generator: A TGen model trained to generate delexicalized output, either as a sequence of words, or as a sequence of interleaved lemmas and morphological tags.
Classifier Reranker: An LSTM based NLU classifier that ranks top-k beam search hypotheses. The classifier is trained to predict the MR from the text and can be used to select outputs that are most faithful to the meaning representation.
LM Lexicalizer: Due to heavy inflections, inserting slot-values into the placeholders verbatim leads to ungrammatical texts. To remedy this, the authors train a language model, which picks the most probable surface form for every slot-value.
The delexicalized generator and reranker ensure that the generated text is firmly grounded in the MR. And while the language model does a good job of selecting the correct surface form, the technique relies on the existence of an exhaustive list of surface forms associated with each slot-value.
Such a list can be hard to obtain and maintain. The problem is exacerbated for open-domain slots like movies, people, restaurant names which can take a large number of values and are constantly expanding. In addition, since several slot values in the structured data are in English, some form of bilingual knowledge (eg - dictionaries) is also necessary. The linguistic expertise and resources required to create such a database, or even to train alternative models that inflect text based on morphological tags, may not be easily available (especially for low resource languages). Finally, pipelined methods require training, tuning and maintaining separate models for each component.
In stark contrast, our approach is completely end-to-end, consisting of a single model which directly produces fully lexicalized outputs without relying on any linguistic resources. The only dependence is on availability of parallel data and as we show in section 7.3, we can learn accurate models even in low resource NMT settings. Recent advances in bitext mining have resulted in sizeable bitext corpora for many low resource language pairs (Schwenk et al., 2019b, a), bolstering the potential use of this approach. Finally, as we show in the next section, NMT pre-training can also be used to develop improved delexicalized models and subsequently be incorporated into the pipelined approach.
7.7 Delexicalized NLG
In this set up, the model must produce delexicalized output. This is achieved by replacing certain slots in the output by placeholders, similar to the example in Figure 2. The model produces output with these these placeholders, which are subsequently filled in as a separate step. The advantages of training delexicalized models include robustness to out of vocabulary values for slots involving entities. Such two step methods are common in practice. Producing delexicalized text is arguably a simpler problem, since the model needs to just output placeholders instead of fully lexicalized text. Every slot except the binary slot kids_allowed are delexicalized.
We compare NMT with with MASS and a strong TGen based delexicalized model proposed by Dušek and Jurčíček (2019). From the results in Table 10, we see that all the models exhibit low SER, as expected. Our model outperforms the best baseline by 5 BLEU points, pointing to the applicability of our approach even in the case of delexicalized NLG. We leave the combination of this model with a lexicalizer component for future work.
model | BLEU | SER |
---|---|---|
baseline-mass | 23.48 | 1.07 |
binmt | 30.87 | 0.95 |
tgen-delex | 25.34 | 1.22 |
We compute sacrebleu on outputs provided to us by the authors (Dušek and Jurčíček, 2019).
8 Conclusion and Future Work
In this work we investigated neural machine translation based transfer learning for data-to-text generation in non-English languages. Using Czech as a target language, we showed that such an approach is effective and surpasses the performance of unsupervised transfer learning. It enables us to learn simple, fully lexicalized end-to-end models that perform on par with a sophisticated, linguistically informed pipelined system. Experimental results suggest several desirable properties including improved sample efficiency, robustness to unseen values and potential applications to low resource languages. At the same time, the approach can also be leveraged to improve performance of delexicalized models.
Studying pre-training on a wide variety of languages, especially those with different scripts, is a direct line of future work. Since this is mainly hindered by a lack of datasets, we hope to develop data-to-text corpora for other languages, including ones that are truly low-resource.
Acknowledgments
We would like to thank Markus Freitag for insightful discussions and Ondřej Dušek for providing the tgen-sota model outputs.
References
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283.
- Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089.
- Budzianowski and Vulić (2019) Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s gpt-2–how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774.
- Chen et al. (2019) Zhiyu Chen, Harini Eavani, Yinyin Liu, and William Yang Wang. 2019. Few-shot nlg with pre-trained language model. arXiv preprint arXiv:1904.09521.
- Chi et al. (2019) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2019. Cross-lingual natural language generation via pre-training. arXiv preprint arXiv:1909.10481.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Doddington (2002) George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc.
- Dušek and Jurčíček (2019) Ondřej Dušek and Filip Jurčíček. 2019. Neural generation for czech: Data and baselines. arXiv preprint arXiv:1910.05298.
- Dušek et al. (2018) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2018. Findings of the e2e nlg challenge. arXiv preprint arXiv:1810.01170.
- Freitag and Roy (2018) Markus Freitag and Scott Roy. 2018. Unsupervised natural language generation with denoising autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3922–3929.
- Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
- Hayashi et al. (2019) Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, and Katsuhito Sudoh. 2019. Findings of the third workshop on neural generation and translation. arXiv preprint arXiv:1910.13299.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Lavie and Agarwal (2007) Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Liu et al. (2018) Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502.
- Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
- Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6908–6915.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Schuster et al. (2018) Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2018. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv preprint arXiv:1810.13327.
- Schwenk et al. (2019a) Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. 2019a. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791.
- Schwenk et al. (2019b) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. 2019b. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944.
- Shen et al. (2019) Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. 2019. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295.
- Shimorina and Gardent (2018) Anastasia Shimorina and Claire Gardent. 2018. Handling rare items in data-to-text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 360–370.
- Siddhant et al. (2019a) Aditya Siddhant, Anuj Goyal, and Angeliki Metallinou. 2019a. Unsupervised transfer learning for spoken language understanding in intelligent agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4959–4966.
- Siddhant et al. (2019b) Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2019b. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. arXiv preprint arXiv:1909.00437.
- Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745.
- Wiseman et al. (2017) Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052.
- Wu and Dredze (2019) Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of bert. arXiv preprint arXiv:1904.09077.