This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Language Generation: Formulation, Methods, and Evaluation

Cristina Gârbacea1, Qiaozhu Mei1,2
1Department of EECS, University of Michigan, Ann Arbor, MI, USA
2School of Information, University of Michigan, Ann Arbor, MI, USA
{garbacea, qmei}@umich.edu
Abstract

Recent advances in neural network-based generative modeling have reignited the hopes in having computer systems capable of seamlessly conversing with humans and able to understand natural language. Neural architectures have been employed to generate text excerpts to various degrees of success, in a multitude of contexts and tasks that fulfil various user needs. Notably, high capacity deep learning models trained on large scale datasets demonstrate unparalleled abilities to learn patterns in the data even in the lack of explicit supervision signals, opening up a plethora of new possibilities regarding producing realistic and coherent texts. While the field of natural language generation is evolving rapidly, there are still many open challenges to address. In this survey we formally define and categorize the problem of natural language generation. We review particular application tasks that are instantiations of these general formulations, in which generating natural language is of practical importance. Next we include a comprehensive outline of methods and neural architectures employed for generating diverse texts. Nevertheless, there is no standard way to assess the quality of text produced by these generative models, which constitutes a serious bottleneck towards the progress of the field. To this end, we also review current approaches to evaluating natural language generation systems. We hope this survey will provide an informative overview of formulations, methods, and assessments of neural natural language generation.

1 Introduction

Recent successes in deep generative modeling and representation learning have led to significant advances in natural language generation (NLG), motivated by an increasing need to understand and derive meaning from language. The research field of text generation is fundamental in natural language processing and aims to produce realistic and plausible textual content that is indistinguishable from human-written text Turing (1950). Broadly speaking, the goal of predicting a syntactically and semantically correct sequence of consecutive words given some context is achieved in two steps by first estimating a distribution over sentences from a given corpus, and then sampling novel and realistic-looking sentences from the learnt distribution. Ideally, the generated sentences preserve the semantic and syntactic properties of real-world sentences, and are different from the training examples used to estimate the model Zhang et al. (2017b). Language generation is an inherently complex task, which requires considerable linguistic and domain knowledge at multiple levels, including syntax, semantics, morphology, phonology, pragmatics, etc. Moreover, texts are generated to fulfill a communicative goal Reiter (2019), such as to provide support in decision making, summarize content, translate between languages, converse with humans, make specific texts more accessible, as well as to entertain users or encourage them to change their behaviour. Therefore generated texts should be tailored to their specific audience in terms of appropriateness of content and terminology used Paris (2015), as well as for fairness and transparency reasons Mayfield et al. (2019). For a long time natural language generation models have been rule-based or relied on training shallow models on sparse high dimensional features. With the recent resurgence of neural networks, neural networks based models for text generation trained with dense vector representations have established unmatched prior performance and reignited the hopes in having machines able to understand language and seamlessly converse with humans. Indeed, generating meaningful and coherent texts is pivotal to many natural language processing tasks. Nevertheless, designing neural networks that can generate coherent text and model long-term dependencies has long been a challenge for natural language generation due to the discrete nature of text data. Beyond that, the ability of neural network models to understand language and ground textual concepts beyond picking up on shallow patterns in the data still remains limited. Finally, evaluation of generative models for natural language is an equally active and challenging research area of significant importance in driving forward the progress of the field.

In this work we formally define the problem of neural text generation at particular contexts and present the diverse practical applications of text generation in Section 2. In Section 3 we include a comprehensive overview of deep learning methodologies and neural model architectures employed in the literature for neural network-based natural language generation. We review methods for the evaluation of the generated texts in Section 4. Finally, in Section 5 we conclude with insights and future perspectives regarding neural text generation and evaluation. Given the rapid evolution of this field of research, we hope the current survery will serve as a thorough overview of present times neural-network based natural language generation and evaluation for anyone interested in learning about these topics, and will provide the reader with the up-to-date information on the latest research advances. Compared to the survey of Gatt and Krahmer (2018), our overview is a more comprehensive and updated coverage of neural network methods and evaluation centered around the novel problem definitions and task formulations.

2 Problem Definitions

In what follows we formally define the natural language generation problem according to context, conditions and constraints considered when producing new text. We divide text generation into the following three categories: i) generic or free-text generation presented in Section 2.1, ii) conditional text generation introduced in Section 2.2, and iii) constrained text generation outlined in Section 2.3. For each category we define the text generation problem according to the assumptions made, and clarify differences between these categories. In addition, in Section 2.4 we provide examples of application areas where language generation presents rich practical opportunities.

2.1 Generic / Free-Text Generation

The problem of generic text generation aims to produce realistic text without placing any external user-defined constraints on the model output. Nevertheless, it does consider the intrinsic history of past words generated by the model as context. We formally define the problem of free-text generation.

Given a discrete sequence of text tokens x=(x1,x2,,xn)x=(x_{1},x_{2},\ldots,x_{n}) as input where each xix_{i} is drawn from a fixed set of symbols, the goal of language modeling is to learn the unconditional probability distribution p(x)p(x) of the sequence xx. This distribution can be factorized using the chain rule of probability Bengio et al. (2003) into a product of conditional probabilities:

p(x)=i=1np(xi|x<i)p(x)=\prod_{i=1}^{n}p(x_{i}|x_{<i}) (1)

When p(x)p(x) is modeled by a neural network with parameters θ\theta, the neural network is trained to minimize the negative log-likelihood over a collection of samples D={x1,,x|D|}D=\{x^{1},\ldots,x^{|D|}\}:

(D)=k=1|D|logpθ(xik|x<ik)\mathcal{L}(D)=-\sum_{k=1}^{|D|}\log p_{\theta}(x_{i}^{k}|x_{<i}^{k}) (2)

Large scale models for generic text generation show promising abilities to produce coherent texts. Nevertheless, the problem of free-text generation is challenging as it places a lot of burden on the generative model to capture complex semantic and structural features underlying the data distribution. This can often result in incoherent and largely randomized generated text. In addition, the generated content is uncontrollable with respect to particular attributes and modes of data being generated.

2.2 Conditional Text Generation

Conditional text generation is useful when generating textual content whose attributes can be controlled/ adjusted so as to enable the manipulation of the generated content. By conditioning the generative model on additional information, it becomes possible to direct the data generation process and control which modes of the data are generated. In the literature conditional text generation is sometimes referred to as context-dependent text generation. We are aware that the word context may carry different semantics for different readers, therefore we want to clarify that in this survey the definition of conditional text generation considers as context only external attributes to the model and not any model intrinsic attributes such as for example, the history of past generated words which is already included in the formulation of the generic text generation problem in Section 2.1.

Conditional language models are used to learn the distribution p(x|c)p(x|c) of the data xx conditioned on a specific attribute code cc. Similar to the formulation of generic text generation, the distribution can still be decomposed using the chain rule of probability as follows:

p(x|c)=i=1np(xi|x<i,c)\begin{split}p(x|c)&=\prod_{i=1}^{n}p(x_{i}|x_{<i},c)\end{split} (3)

The conditional language model parameterized by a neural network can be trained with a negative log-likelihood loss function which takes into account the control code cc:

(D)=k=1|D|logpθ(xik|x<ik,ck)\begin{split}\mathcal{L}(D)&=-\sum_{k=1}^{|D|}\log p_{\theta}(x_{i}^{k}|x_{<i}^{k},c^{k})\end{split} (4)

As specified above, conditional models for text generation add a contextual variable or condition into the probabilistic model transforming it into a conditional probability model. Nevertheless, conditional models do not place any hard constraints on the generated output. Conditioning the generated text on specific low-level attributes is approached in two ways in the literature, by either using i) conditional training Fan et al. (2018a) methods which condition the model on additional control features at training time modeling P(y|x,z)P(y|x,z) as a function of the output yy given the input xx and discrete control variable zz, or via ii) weighed decoding Ghazvininejad et al. (2017) methods which add control features to the decoding scoring function at test time only. Examples of attributes used for conditioning the generated text are the source sentence in machine translation, the conversational history in dialogue systems, the input document in text summarization and text simplification, the input question in question answering systems, or contextual information such as product, time and location in review generation.

2.3 Constrained Text Generation

The problem of constrained text generation is focusing on generating coherent and logical texts that cover a specific set of concepts (such as pre-defined nouns, verbs, entities, phrases or sentence fragments) desired to be present in the output, and/ or abide to user-defined rules which reflect the particular interests of the system user. Lexically constrained text generation Hokamp and Liu (2017) places explicit constraints on independent attribute controls and combines these with differentiable approximation to produce discrete text samples. In the literature the distinction between conditional, controlled and constrained text generation is not clearly defined, and these terms are often used interchangeably. In fact, the first work that proposed generating constrained text is actually referring to the task as controlled generation Hu et al. (2017). In what follows we formally define the problem of constrained text generation.

Let us consider we are (optionally) given an unordered or ordered set of nn concepts x={c1,c2,,cn}𝒳x=\{c_{1},c_{2},\ldots,c_{n}\}\in\mathcal{X}, where 𝒳\mathcal{X} denotes the space of all concepts, and each ciCc_{i}\in C, where CC represents the concept vocabulary and cic_{i} denotes a noun or a verb. In addition, let us assume we are also (optionally) given a set of mm rules y={y1,y2,,ym}𝒴y=\{y_{1},y_{2},\ldots,y_{m}\}\in\mathcal{Y}, with yiy_{i}\in\mathcal{R}, where \mathcal{R} denotes the space of all rules, and each yiy_{i} is a text generation constraint expressed in logical form. We formulate constrained text generation as learning the structured predictive function f:𝒳𝒴𝒵f:\mathcal{X}\cup\mathcal{Y}\rightarrow\mathcal{Z}, 𝒳𝒴ϕ\mathcal{X}\cup\mathcal{Y}\neq\phi which maps a set of concepts and/ or constraint rules to a generated sentence. Therefore, constrained text generation methods impose constraints on the generated sentences and produce output in the form of grammatical sentence z𝒵z\in\mathcal{Z} which contains all concepts present in xx and/ or meets all constraints specified in yy. The matching function ff manipulates the probability distribution and indicates to which extent the constraints are satisfied. In the literature, constraint text generation methods are categorized into:

  • Soft-constrained text generation (priming): requires the generated sentences to be semantically related to the given constraints, without strictly enforcing the presence of those constraints (for eg., topic words) in the generated content. The matching function ff is in this case a soft measure of semantic similarity. Typically, a corpus of (keyword, text) pairs is first constructed, followed by training a conditional text generation model to capture their co-occurence and generate text which contains the constrained keywords. Nevertheless, this approach does not guarantee that all desired keywords will be preserved during generation; some of them may get lost and will not be found in the generated output, in particular when there are constraints on simultaneously including multiple keywords.

  • Hard-constrained text generation: refers to the mandatory inclusion of certain keywords in the output sentences. The matching function ff is in this case a binary indicator, which rules out the possibility of generating infeasible sentences that do not meet the given constraints. Therefore, by placing hard constraints on the generated output, all lexical constraints must be present in the generated output. Unlike soft-constrained models which are straightforward to design, the problem of hard-constrained text generation requires the design of complex dedicated neural network architectures.

Constrained text generation is useful in many scenarios, such as incorporating in-domain terminology in machine translation Post and Vilar (2018), avoiding generic and meaningless responses in dialogue systems Mou et al. (2016), incorporating ground-truth text fragments (such as semantic attributes, object annotations) in image caption generation Anderson et al. (2017). Typical attributes used to generate constrained natural language are the tense and the length of the summaries in text summarization Fan et al. (2018a), the sentiment of the generated content in review generation Mueller et al. (2017), language complexity in text simplification or the style in text style transfer applications. In addition, constrained text generation is used to overcome limitations of neural text generation models for dialogue such as genericness and repetitiveness of responses See et al. (2019), Serban et al. (2016).

Nevertheless, generating text under specific lexical constraints is challenging Zhang et al. (2020). While for humans it is straightforward to generate sentences that cover a given set of concepts or abide to pre-defined rules by making use of their commonsense reasoning ability, generative commonsense reasoning with a constrained text generation task is not as simple for machine learning models Lin et al. (2019).

2.4 Natural Language Generation Tasks

In what follows we present natural language generation tasks which are instances of generic, conditional and constrained text generation. All these applications demonstrate the practical value of generating coherent and meaningful texts, and that advances in natural language generation are of immediate applicability and practical importance in many downstream tasks.

2.4.1 Neural Machine Translation

The field of machine translation is focusing on the automatic translation of textual content from one language into another language. The field has undergone major changes in recent years, with end-to-end learning approaches for automated translation based on neural networks replacing conventional phrase-based statistical methods Bahdanau et al. (2014), Wu et al. (2016a). In contrast to statistical models which consist of several sub-components trained and tuned separately, neural machine translation models build and train a single, large neural network end-to-end by feeding it as input textual content in the source language and retrieving its corresponding translation in the target language. Neural machine translation is a typical example of conditional text generation, where the condition encapsulated by the conditional attribute code cc is represented by the input sentence in the source language and the goal task is to generate its corresponding translation in the target language. In addition, neural machine translation is also an instance of constrained text generation given that it imposes the constraint to generate text in the target language. Additional constraints can be placed on the inclusion in the target sentence of named entities already present in the source sentence. In what follows we formally define the problem of neural machine translation.

We denote with VsV_{s} the vocabulary of the source language and with VtV_{t} the vocabulary of the target language, with |Vt||Vs||V_{t}|\approx|V_{s}| and VtVs=ϕV_{t}\cap V_{s}=\phi. Let us also denote with with VsV_{s}^{*} and VtV_{t}^{*} all possible sentences under VsV_{s}, respectively VtV_{t}. Given a source sentence X=(x1,x2,,xl),XVs,xiVsX=(x_{1},x_{2},\ldots,x_{l}),X\in V_{s}^{*},x_{i}\in V_{s}, where xix_{i} is the ithi^{th} word in XX, i=1,,l\forall i=1,\ldots,l, the goal is to generate the distribution over the possible output sentences Y=(y1,y2,,yl),YVt,yjVtY=(y_{1},y_{2},\ldots,y_{l^{{}^{\prime}}}),Y\in V_{t}^{*},y_{j}\in V_{t}, where yjy_{j} is the jthj^{th} word in YY, j=1,,l\forall j=1,\ldots,l^{{}^{\prime}} by factoring YY into a chain of conditional probabilities with left-to-right causal structure using a neural network with parameters θ\theta:

p(Y|X;θ)=t=1l+1p(yt|y0:t1,x1:l;θ)p(Y|X;\theta)=\prod_{t=1}^{l^{{}^{\prime}}+1}p(y_{t}|y_{0:t-1},x_{1:l};\theta) (5)

Special sentence delimiters y0(<y_{0}(<S>)>) and yl+1(<y_{l^{{}^{\prime}}+1}(<E>)>) are commonly added to the vocabulary to mark the beginning and end of target sentence YY. Typically in machine translation the source and target vocabularies consist of the most frequent words used in a language (for eg., top 15,000 words), while the remaining words are replaced with a special <<UNK>> token. Every source sentence XX is usually mapped to exactly one target sentence YY, and there is no sharing of words between the source sentence XX and the target sentence YY.

Although neural network based approaches to machine translation have resulted in superior performance compared to statistical models, they are computationally expensive both in training and in translation inference time. The output of machine translation models is evaluated by asking human annotators to rate the generated translations on various dimensions of textual quality, or by comparisons with human-written reference texts using automated evaluation metrics.

2.4.2 Text Summarization

Text summarization is designed to facilitate a quick grasp of the essence of an input document by producing a condensed summary of its content. This can be achieved in two ways, either by means of extractive summarization or through abstractive/generative summarization. While extractive summarization Nallapati et al. (2017) methods produce summaries by copy-pasting the relevant portions from the input document, abstractive summarization Rush et al. (2015), Nallapati et al. (2016), See et al. (2017) algorithms can generate novel content that is not present in the input document. Hybrid approaches combining extractive summarization techniques with a a neural abstractive summary generation serve to identify salient information in a document and generate distilled Wikipedia articles Liu et al. (2018b). Characteristics of a good summary include brevity, fluency, non-redundancy, coverage and logical entailment of the most salient pieces of information from the input document(s).

Text summarization is a conditional text generation task where the condition is represented by the given document(s) to be summarized. Additional control codes are used in remainder summarization offering flexibility to define which parts of the document(s) are of interest, for eg., remaining paragraphs the user has not read yet, or in source-specific summarization to condition summaries on the source type of input documents, for eg., newspapers, books or news articles. Besides being a conditional text generation task, text summarization is also a typical example of constrained text generation where the condition is set such that the length of the resulting summary is strictly less than the length of the original document. Unlike machine translation where output length varies depending on the source content, in text summarization the length of the output is fixed and pre-determined. Controlling the length of the generated summary allows to digest information at different levels of granularity and define the level of detail desired accounting for particular user needs and time budgets; for eg., a document can be summarized into a headline, a single sentence or a multi-sentence paragraph. In addition, explicit constraints can be placed on specific concepts desired for inclusion in the summary. Most frequently, named entities are used as constraints in text summarization to ensure the generated summary is specifically focused on topics and events describing them. In addition, in the particular case of extractive summarization, there is the additional constraint that sentences need to be picked explicitly from the original document. In what follows we formally define the task of text summarization.

We consider the input consisting of a sequence of MM words x=(x1,x2,,xM),xi𝒱𝒳,i=1,,Mx=(x_{1},x_{2},\ldots,x_{M}),x_{i}\in\mathcal{V_{\mathcal{X}}},i=1,\ldots,M, where 𝒱𝒳\mathcal{V_{\mathcal{X}}} is a fixed vocabulary of size |𝒱𝒳||\mathcal{V}_{\mathcal{X}}|. Each word xix_{i} is represented as an indicator vector xi{0,1}𝒱𝒳x_{i}\in\{0,1\}^{\mathcal{V}_{\mathcal{X}}}, sentences are represented as sequences of indicators and 𝒳\mathcal{X} denotes the set of all possible inputs. A summarization model takes xx as input and yields a shorter version of it in the form of output sequence y=(y1,y2,,yN)y=(y_{1},y_{2},\ldots,y_{N}), with N<MN<M and yj{0,1}𝒱𝒴,j=1,,Ny_{j}\in\{0,1\}^{\mathcal{V_{\mathcal{Y}}}},\forall j=1,\ldots,N.

Abstractive / Generative Summarization We define 𝒴({0,1}𝒱𝒴,,{0,1}𝒱𝒴)\mathcal{Y}\subset(\{0,1\}^{\mathcal{V}_{\mathcal{Y}}},\ldots,\{0,1\}^{\mathcal{V}_{\mathcal{Y}}}) as the set of all possible generated summaries of length NN, with y𝒴y\in\mathcal{Y}. The summarization system is abstractive if it tries to find the optimal sequence y,y𝒴y^{*},y^{*}\subset\mathcal{Y}, under the scoring function s:𝒳×𝒴s:\mathcal{X}\times\mathcal{Y}\rightarrow\mathcal{R}, which can be expressed as:

y=argmaxy𝒴s(x,y)y^{*}=\arg\max_{y\in\mathcal{Y}}s(x,y) (6)

Extractive Summarization As opposed to abstractive approaches which generate novel sentences, extractive approaches transfer parts from the input document xx to the output yy:

y=argmaxm{1,,M}Ns(x,x[m1,,mN])y^{*}=\underset{m\in\{1,\ldots,M\}^{N}}{\arg\max}s(x,x_{[m_{1},\dots,m_{N}]}) (7)

Abstractive summarization is notably more challenging than extractive summarization, and allows to incorporate real-world knowledge, paraphrasing and generalization, all crucial components of high-quality summaries See et al. (2017). In addition, abstractive summarization does not impose any hard constraints on the system output other than shorter length and gives the system a lot of freedom to generate suitable content, which in turn results in the system’s ability to fit a wide range of training data. Approaches for neural abstractive summarization build upon advances in machine translation. Attention mechanisms Bahdanau et al. (2014) and pointer networks Vinyals et al. (2015b) are used to focus on specific parts of the source document and copy input entities. Nevertheless, a common limitation of current abstractive summarization models is their tendency to copy long passages from the source document as opposed to generating novel content. Consequently, the word overlap between the source document and the generated abstractive summary is generally high See et al. (2017), Kryściński et al. (2018).

Very related in nature to the task of text summarization is the problem of text compression, which takes as input a text document and aims to produce a short summary of it by deleting the least critical information, while retaining the most important ideas and preserving sentence fluency. Sentence compression is referred in the literature as a “scaled down version of the text summarization problem”Knight and Marcu (2002), and is of practical importance in many downstream applications including text summarization, subtitle generation and displaying text on small screens Mallinson et al. (2018). Similar to text summarization, text compression is both a conditional text generation and constrained text generation task. The condition is represented by the input document for which the text compression system needs to output a condensed version. The task is also constrained text generation given the system needs to produce a compressed version of the input strictly shorter lengthwise. In addition, there can be further constraints specified when the text compression output is desired to be entity-centric.

We denote with Ci={ci1,ci2,,cil}C_{i}=\{c_{i1},c_{i2},\ldots,c_{il}\} the set of possible compression spans and with yy,cy_{y,c} a binary variable which equals 1 if the cthc^{th} token of the ithi^{th} sentence si^\hat{s_{i}} in document DD is deleted, we are interested in modeling the probability p(yi,c|D,si^)p(y_{i,c}|D,\hat{s_{i}}). Following the same definitions from section 2.4.2, we can formally define the optimal compressed text sequence under scoring function ss as:

y=argmaxm{1,,M}N,mi1<mis(x,x[m1,,mN])y^{*}=\underset{m\in\{1,\ldots,M\}^{N},m_{i-1}<m_{i}}{\arg\max}s(x,x_{[m_{1},\dots,m_{N}]}) (8)

Shorter paraphrases are generated through end-to-end neural networks in Filippova et al. (2015). Unsupervised models based on denoising autoencoders wihout the need for paired corpora are used in Fevry and Phang (2018). The level of compression is computed as the character length ratio between the source sentence and the target sentence Martin et al. (2019b). Sentence compressions are identified by constituency parsing and scored by neural models for text summarization Xu and Durrett (2019). Factors directly impacting the generated summary complexity where are the compression rate, the summarization technique and the nature of the summarized corpus Vodolazova and Lloret (2019).

Current datasets, models and evaluation metrics for text summarization are considered not robust enough Kryscinski et al. (2019). Shortcomings include uncurated automatically collected datasets, models that overfit to biases in the data and produce outputs with little diversity, as well as non-informative evaluation metrics weakly correlated with human judgements.

2.4.3 Text Simplification

Text simplification is designed to reduce the lexical and syntactic complexity of text, while preserving the main idea and approximating the original meaning. The goal of text simplification systems is to make highly specialized textual content accessible to readers who lack adequate literacy skills, such as children, people with low education, people who have reading disorders or dyslexia, and non-native speakers of the language. In the literature text simplification has been addressed at multiple levels: i) lexical simplification Devlin (1999) is concerned with replacing complex words or phrases with simpler alternatives; ii) syntactic simplification Siddharthan (2006) alters the syntactic structure of the sentence; iii) semantic simplification Kandula et al. (2010), sometimes also known as explanation generation, paraphrases portions of the text into simpler and clearer variants. More recently, end-to-end models for text simplification attempt to address all these steps at once.

Text simplification is an instance of conditional text generation given we are conditioning on the input text to produce a simpler and more readable version of a complex document, as well as an instance of constrained text generation since there are constraints on generating simplified text that is shorter in length compared to the source document and with higher readability level. To this end, it is mandatory to use words of lower complexity from a much simpler target vocabulary than the source vocabulary. We formally introduce the text simplification task below.

Let us denote with VsV_{s} the vocabulary of the source language and with VtV_{t} the vocabulary of the target language, with |Vt||Vs||V_{t}|\ll|V_{s}| and VtVsV_{t}\subseteq V_{s}. Let us also denote with with VsV_{s}^{*} and VtV_{t}^{*} all possible sentences under VsV_{s}, respectively VtV_{t}. Given source sentence X=(x1,x2,,xl),XVs,xiVsX=(x_{1},x_{2},\ldots,x_{l}),X\in V_{s}^{*},x_{i}\in V_{s}, where xix_{i} is the ithi^{th} word in XX, i=1,,l\forall i=1,\ldots,l, the goal is to produce the simplified sentence Y=(y1,y2,,yl),YVt,yjVtY=(y_{1},y_{2},\ldots,y_{l^{{}^{\prime}}}),Y\in V_{t}^{*},y_{j}\in V_{t}, where yjy_{j} is the jthj^{th} word in YY, j=1,,l\forall j=1,\ldots,l^{{}^{\prime}} by modeling the conditional probability p(Y|X)p(Y|X). In the context of neural text simplification, a neural network with parameters θ\theta is used to maximize the probability p(Y|X;θ)p(Y|X;\theta).

Next we highlight differences between machine translation and text simplification. Unlike machine translation where the output sentence YY does not share any common terms with the input sentence XX, in text simplification some or all of the words in YY might remain identical with the words in XX in cases when the terms in XX are already simple. In addition, unlike machine translation where the mapping between the source sentence and the target sentence is usually one-to-one, in text simplification the relation between the source sentence and the target sentence can be one-to-many or many-to-one, as simplification involves splitting and merging operations Surya et al. (2018). Furthermore, infrequent words in the vocabulary cannot be simply dropped out and replaced with an unknown token as it is typically done in machine translation, but they need to be simplified appropriately corresponding to their level of complexity Wang et al. (2016a). Lexical simplification and content reduction is simultaneously approached with neural machine translation models in Nisioi et al. (2017), Sulem et al. (2018c). Nevertheless, text simplification presents particular challenges compared to machine translation. First, simplifications need to be adapted to particular user needs, and ideally personalized to the educational background of the target audience Bingel (2018), Mayfield et al. (2019). Second, text simplification has the potential to bridge the communication gap between specialists and laypersons in many scenarios. For example, in the medical domain it can help improve the understandability of clinical records Shardlow and Nawaz (2019), address disabilities and inequity in educational environments Mayfield et al. (2019), and assist with providing accessible and timely information to the affected population in crisis management Temnikova (2012).

2.4.4 Text Style Transfer

Style transfer is a newly emerging task designed to preserve the information content of a source sentence while delivering it to meet desired presentation constraints. To this end, it is important to disentangle the content itself from the style in which it is presented and be able to manipulate the style so as to easily change it from one attribute into another attribute of different or opposite polarity. This is often achieved without the need for parallel data for source and target styles, but accounting for the constraint that the transferred sentences should match in style example sentences from the target style. To this end, text style transfer is an instance of constrained text generation. In addition, it is also a typical scenario of conditional text generation where we are conditioning on the given source text. Style transfer has been originally used in computer vision applications for image-to-image translation Gatys et al. (2016), Liu and Tuzel (2016), Zhu et al. (2017), and more recently has been used in natural natural language processing applications for machine translation, sentiment modification to change the sentiment of a sentence from positive to negative and vice versa, word substitution decipherment and word order recovery Hu et al. (2017).

The problem of style transfer in language generation can be formally defined as follows. Given two datasets X1={x1(1),x1(2),,x1(n)}X_{1}=\{x_{1}^{(1)},x_{1}^{(2)},\ldots,x_{1}^{(n)}\} and X2={x2(1),x2(2),,x2(n)}X_{2}=\{x_{2}^{(1)},x_{2}^{(2)},\ldots,x_{2}^{(n)}\} with the same content distribution but different unknown styles y1y_{1} and y2y_{2}, where the samples in dataset X1X_{1} are drawn from the distribution p(x1|y1)p(x_{1}|y_{1}) and the samples in dataset X2X_{2} are drawn from the distribution p(x2|y2)p(x_{2}|y_{2}), the goal is to estimate the style transfer functions between them p(x1|x2;y1,y2)p(x_{1}|x_{2};y_{1},y_{2}) and p(x2|x1;y1,y2)p(x_{2}|x_{1};y_{1},y_{2}). According to the formulation of the problem we can only observe the marginal distributions p(x1|y1)p(x_{1}|y_{1}) and p(x2|y2)p(x_{2}|y_{2}), and the goal is to recover the joint distribution p(x1,x2|y1,y2)p(x_{1},x_{2}|y_{1},y_{2}), which can be expressed as follows assuming the existence of latent content variable zz generated from distribution p(z)p(z):

p(x1,x2|y1,y2)=zp(z)p(x1|y1,z)p(x2|y2,z)𝑑zp(x_{1},x_{2}|y_{1},y_{2})=\int_{z}p(z)p(x_{1}|y_{1},z)p(x_{2}|y_{2},z)dz (9)

Given that x1x_{1} and x2x_{2} are independent from each other given zz, the conditional distribution corresponding to the style transfer function is defined:

p(x1|x2;y1,y2)=zp(x1,z|x2;y1,y2)𝑑z=zp(x1|y1,z)p(x2|y2,z)𝑑z=𝔼zp(z|x2,y2)[p(x1|y1,z)]\begin{split}p(x_{1}|x_{2};y_{1},y_{2})&=\int_{z}p(x_{1},z|x_{2};y_{1},y_{2})dz\\ &=\int_{z}p(x_{1}|y_{1},z)p(x_{2}|y_{2},z)dz\\ &=\mathbb{E}_{z\sim p(z|x_{2},y_{2})}[p(x_{1}|y_{1},z)]\end{split} (10)

Models proposed in the literature for style transfer rely on encoder-decoder models. Given encoder E:X×YZE:X\times Y\rightarrow Z with paramters θE\theta_{E} which infers the content zz and style yy for a given sentence xx, and generator G:Y×ZXG:Y\times Z\rightarrow X with parameters θG\theta_{G} which given content zz and style yy generates sentence xx, the reconstruction loss can be defined as follows:

rec=𝔼x1X1[logpG(x1|y1,E(x1,y1))]+𝔼x2X2[logpG(x2|y2,E(x2,y2))]\begin{split}\mathcal{L}_{\text{rec}}=&\mathbb{E}_{x_{1}\sim X_{1}}[-\log p_{G}(x_{1}|y_{1},E(x_{1},y_{1}))]+\\ &\mathbb{E}_{x_{2}\sim X_{2}}[-\log p_{G}(x_{2}|y_{2},E(x_{2},y_{2}))]\end{split} (11)

Latent VAE representations are manipulated to generate textual output with specific attributes, for eg. contemporary text written in Shakespeare style or improving the positivity sentiment of a sentence Mueller et al. (2017). Style-independent content representations are learnt via disentangled latent representations for generating sentences with controllable style attributes Shen et al. (2017), Hu et al. (2017). Language models are employed as style discriminators to learn disentangled representations for unsupervised text style transfer tasks such as sentiment modification Yang et al. (2018d).

2.4.5 Dialogue Systems

A dialogue system, also known as a conversational agent, is a computer system designed to converse with humans using natural language. To be able to carry a meaningful conversation with a human user, the system needs to first understand the message of the user, represent it internally, decide how to respond to it and issue the target response using natural language surface utterances Chen et al. (2017a). Dialogue generation is an instance of conditional text generation where the system response is conditioned on the previous user utterance and frequently on the overall conversational context. Dialogue generation can also be an instance of constrained text generation when the conversation is carried on a topic which explicitly involves entities such as locations, persons, institutions, etc. From an application point of view, dialogue systems can be categorized into Keselj (2009):

  • task-oriented dialogue agents: are designed to have short conversations with a human user to help him/ her complete a particular task. For example, dialogue agents embedded into digital assistants and home controllers assist with finding products, booking accommodations, provide travel directions, make restaurant reservations and phone calls on behalf of their users. Therefore, task-oriented dialogue generation is an instance of both conditional and constrained text generation.

  • non-task oriented dialogue agents or chat-bots: are designed for carrying extended conversations with their users on a wide range of open domains. They are set up to mimic human-to-human interaction and unstructured human dialogues in an entertaining way. Therefore, non-task oriented dialogue is an instance of conditional text generation.

We formally define the task of dialogue generation. Generative dialogue models take as input a dialogue context cc and generate the next response xx. The training data consists of a set of samples of the form {cn,xn,dn}psource(c,x,d)\{c^{n},x^{n},d^{n}\}\sim p_{\text{source}}(c,x,d), where dd denotes the source domain. At testing time, the model is given the dialog context cc and the target domain, and must generate the correct response xx. The goal of a generative dialogue model is to learn the function :C×DX\mathcal{F}:C\times D\rightarrow X which performs well on unseen examples from the target domain after seeing the training examples on the source domain. The source domain and the target domain can be identical; when they differ the problem is defined as zero-shot dialogue generation Zhao and Eskenazi (2018). The dialogue generation problem can be summarized as:

Training data:{cn,xn,dn}psource(c,x,d)Testing data:{c,x,d}ptarget(c,x,d)Goal::C×DX\begin{split}\text{Training data}&:\{c^{n},x^{n},d^{n}\}\sim p_{\text{source}}(c,x,d)\\ \text{Testing data}&:\{c,x,d\}\sim p_{\text{target}}(c,x,d)\\ \text{Goal}&:\mathcal{F}:C\times D\rightarrow X\end{split} (12)

A common limitation of neural networks for dialogue generation is that they tend to generate safe, universally relevant responses that carry little meaning Serban et al. (2016), Li et al. (2016a), Mou et al. (2016); for example universal replies such as “I don’t know” or “something” frequently occur in the training set are likely to have high estimated probabilities at decoding time. Additional factors that impact the conversational flow in generative models of dialogue are identified as repetitions and contradictions of previous statements, failing to balance specificity with genericness of the output, and not taking turns in asking questions See et al. (2019). Furthermore, it is desirable for generated dialogues to incorporate explicit personality traits Zheng et al. (2019) and control the sentiment Kong et al. (2019a) of the generated response to resemble human-to-human conversations.

2.4.6 Question Answering

Question answering systems are designed to find and integrate information from various sources to provide responses to user questions Fu and Feng (2018). While traditionally candidate answers consist of words, phrases or sentence snippets retrieved and ranked appropriately from knowledge bases and textual documents Kratzwald et al. (2019), answer generation aims to produce more natural answers by using neural models to generate the answer sentence. Question answering can be considered as both a conditional text generation and constrained text generation task. A question answering system needs to be conditioned on the question that was asked, while simultaneously ensuring that concepts needed to answer the question are found in the generated output.

A question answering system can be formally defined as follows. Given a context paragraph C={c1,c2,,cn}C=\{c_{1},c_{2},\ldots,c_{n}\} consisting of nn words from word vocabulary 𝒱\mathcal{V} and the query Q={q1,q2,,qm}Q=\{q_{1},q_{2},\ldots,q_{m}\} of mm words in length, the goal of a question answering system is to either: i) output a span S={ci,ci+1,,ci+j},i=1,,nS=\{c_{i},c_{i+1},\ldots,c_{i+j}\},\forall i=1,\ldots,n and j=0,,ni\forall j=0,\ldots,n-i from the original context paragraph CC, or ii) generate a sequence of words A={a1,a2,,al},ak𝒱,k=1,,lA=\{a_{1},a_{2},\ldots,a_{l}\},a_{k}\in\mathcal{V},\forall k=1,\ldots,l as the output answer. Below we differentiate between multiple types of question answering tasks:

  • Factoid Question Answering: given a description of an entity (person, place or item) formulated as a query and a text document, the task is to identify the entity referenced in the given piece of text. This is an instance of both conditional and constrained text generation, given conditioning on the input question and constraining the generation task to be entity-centric. Factoid question answering methods combine word and phrase-level representations across sentences to reason about entities Iyyer et al. (2014), Yin et al. (2015).

  • Reasoning-based Question Answering: given a collection of documents and a query, the task is to reason, gather, and synthesize disjoint pieces of information spread within documents and across multiple documents to generate an answer De Cao et al. (2019). The task involves multi-step reasoning and understanding of implicit relations for which humans typically rely on their background commonsense knowledge Bauer et al. (2018). The task is conditional given that the system generates an answer conditioned on the input question, and may be constrained when the information across documents is focused on entities or specific concepts that need to be incorporated in the generated answer.

  • Visual Question Answering: given an image and a natural language question about the image, the goal is to provide an accurate natural language answer to the question posed about the image Antol et al. (2015). By its nature the task is conditional, and can be constraint when specific objects or entities in the image need to be included in the generated answer.

Question answering systems that meet various information needs are proposed in the literature, for eg., for answering mathematical questions Schubotz et al. (2018), medical information needs Wiese et al. (2017), Bhandwaldar and Zadrozny (2018), quiz bowl questions Iyyer et al. (2014), cross-lingual and multi-lingual questions Loginova et al. (2018). In practical applications of question answering, users are typically not only interested in learning the exact answer word, but also in how this is related to other important background information and to previously asked questions and answers Fu and Feng (2018).

2.4.7 Image / Video Captioning

Image captioning is designed to generate captions in the form of textual descriptions for an image. This involves the recognition of the important objects present in the image, as well as object properties and interactions between objects to be able to generate syntactically and semantically correct natural language sentences Hossain et al. (2019). In the literature the image captioning task has been framed from either a natural language generation perspective Kulkarni et al. (2013), Chen et al. (2017b) where each system produces a novel sentence, or from a ranking perspective where existing captions are ranked and the top one is selected Hodosh et al. (2013). Image/ video captioning is a conditional text generation task where the caption is conditioned on the input image or video. In addition, it can be a constrained text generation task when specific concepts describing the input need to be present in the generated output.

Formally, the task of image/ video captioning takes as input an image or video II and generates a sequence of words y=(y1,y2,,yN),yV and yiV,i=1,,Ny=(y_{1},y_{2},\ldots,y_{N}),y\in V^{*}\text{ and }y_{i}\in V,\forall i=1,\ldots,N, where VV denotes the vocabulary of output words and includes special tokens to mark the beginning <<S>> and end <<E>> of a sentence, as well as the unknown token <<UNK>> used for all words not present in the vocabulary VV, and VV^{*} denotes all possible sentences over VV. Given training set 𝒟={(I,y)}\mathcal{D}=\{(I,y^{*})\} containing mm pairs of the form (Ij,yj),j=1,,m(I_{j},y_{j}^{*}),\forall j=1,\ldots,m consisting of input image IjI_{j} and its corresponding ground-truth caption yj=(yj1,yj2,,yjM),yjV and yjkV,k=1,,My_{j}^{*}=(y_{j_{1}}^{*},y_{j_{2}}^{*},\ldots,y_{j_{M}}^{*}),y_{j}^{*}\in V^{*}\text{ and }y_{j_{k}}^{*}\in V,\forall k=1,\ldots,M, we want to maximize the probabilistic model p(y|I;θ)p(y|I;\theta) with respect to model parameters θ\theta.

2.4.8 Narrative Generation / Story Telling

Neural narrative generation aims to produce coherent stories automatically and is regarded as an important step towards computational creativity Gervás (2009). Unlike machine translation which produces a complete transduction of an input sentence which fully defines the target semantics, story telling is a long-form open-ended text generation task which simultaneously addresses two separate challenges: the selection of appropriate content (“what to say”) and the surface realization of the generation (“how to say it”)Wiseman et al. (2017). In addition, the most difficult aspect of neural story generation is producing a a coherent and fluent story which is much longer than the short input specified by the user as the story title. To this end, many neural story generation models assume the existence of a high-level plot (commonly specified as a one-sentence outline) which serves the role of a bridge between titles and stories Chen et al. (2019a), Fan et al. (2018b), Xu et al. (2018b), Drissi et al. (2018), Yao et al. (2019). Therefore, narrative generation is a constrained text generation task since explicit constraints are placed on which concepts to include in the narrative so as to steer the generation in particular topic directions. In addition, another constraint is that the output length needs to be strictly greater than the input length. We formally define the task of narrative generation below.

Assuming as input to the neural story generation system the title x=x1,x2,,xIx=x_{1},x_{2},\ldots,x_{I} consisting of II words, the goal is to produce a comprehensible and logical story y=y1,y2,,yJy=y_{1},y_{2},\ldots,y_{J} of JJ words in length. Assuming the existence of a one sentence outline z=z1,z2,,zKz=z_{1},z_{2},\ldots,z_{K} that contains KK words for the entire story, the latent variable model for neural story generation can be formally expressed:

P(y|x;θ,γ)=zP(z|x;θ)P(y|x,z;γ)P(y|x;\theta,\gamma)=\sum_{z}P(z|x;\theta)P(y|x,z;\gamma) (13)

where P(z|x;θ)P(z|x;\theta) defines a planning model parameterized by θ\theta and P(y|x,z;γ)P(y|x,z;\gamma) defines a generation model parameterized by γ\gamma.

The planning model P(z|x;θ)P(z|x;\theta) receives an input the one sentence title zz for the narrative and generates the narrative outline given the title:

P(z|x;θ)=k=1KP(zk|x,z<k;θ)P(z|x;\theta)=\prod_{k=1}^{K}P(z_{k}|x,z_{<k};\theta) (14)

where z<k=z1,z2,,zk1z_{<k}=z_{1},z_{2},\ldots,z_{k-1} denotes a partial outline. The generation model is used to produce a narrative given a title and an outline:

P(y|x,z;γ)=j=1JP(yj|x,z,y<j;γ)P(y|x,z;\gamma)=\prod_{j=1}^{J}P(y_{j}|x,z,y_{<j};\gamma) (15)

where y<jy_{<j} denotes a partially generated story.

Hierarchical models for story generation break down the generation process into multiple steps: first modelling the action sequence, then the story narrative, and finally entities such as story characters Fan et al. (2019). While existing models can generate stories with good local coherence, generating long stories is challenging. Difficulties in coalescing individual phrases into coherent plots and in maintaining character consistency throughout the story lead to a rapid decrease in coherence as the output length increases van Stegeren and Theune (2019). Neural narrative generation combining story-writing with human collaboration in an interactive way improves both story quality and human engagement Goldfarb-Tarrant et al. (2019).

2.4.9 Poetry Generation

Automatic poetry generation is an important step towards computational creativity. In the poem generation literature, the generator operates in an interactive context where the user initially supplies the model with a set of keywords representing the concepts which outline the main writing intents, as well as their ordering. The user is also in charge of selecting a particular format for the generated poem. For example, common formats are quatrain, consisting of 4 lines of sentences, or regulated verse in which the poem is made up of 8 lines of sentences. The process is interactive and the author can keep modifying terms to reflect his writing intent. Poetry generation is a constrained text generation problem since user defined concepts need to be included in the generated poem. At the same time, it can also be a conditional text generation problem given explicit conditioning on the stylistic features of the poem. We define the petry generation task below.

Given as input a set of keywords that summarize an author’s writing intent K={k1,k2,,k|K|}, where each kiV,i=1,,|K|K=\{k_{1},k_{2},\ldots,k_{|K|}\},\text{ where each }k_{i}\in V,i=1,\ldots,|K| is a keyword term from vocabulary VV, the goal is to generate a poem 𝒫={w|wΩ}\mathcal{P}=\{w|w\in\Omega\} where each term ww is selected from the candidate term set Ω={w|w{K{VK}}}\Omega=\{w|w\in\{K\cup\{V-K\}\}\}, and KP,PΩK\subseteq P,P\subseteq\Omega to fit the user specified constraints of the poetry format. The generative model computes the probability of line Si+1=w1,w2,,wmS_{i+1}=w_{1},w_{2},\ldots,w_{m} given all previously generated poem lines S1:i,i1S_{1:i},i\geq 1 (or alternatively, only the previous generated line or lexical n-grams) as follows:

P(Si+1|S1:i)=j=1m1P(wj+1|w1:j,S1:i)P(S_{i+1}|S_{1:i})=\prod_{j=1}^{m-1}P(w_{j+1}|w_{1:j},S_{1:i}) (16)

Poetry composition is formulated as a constrained optimization problem in a generative summarization framework in iPOET Yan et al. (2013). Candidate terms from a large human-written poem corpus are retrieved to match the user intent, and then clustered to fit the poetry format, tone, rhythm, etc. Each cluster generates one line of the poem in a multi-pass generative summarization framework by conducting iterative term substitutions so that the generated poem matches the initial user constraints and poetic preference, and the relevance and coherence of the output is maximized. Generative models that jointly perform content selection and surface realization are proposed in Zhang and Lapata (2014). Generated poems are revised and polished through multiple style configurations in Ghazvininejad et al. (2017). Neural poetry generation models based on maximum likelihood estimation (MLE) only learn the most common patterns in the poetry corpus and generate outputs with little diversity Zhang et al. (2017a). In addition, these MLE based models suffer from loss-evaluation mismatch Wiseman and Rush (2016) manifested through incompatibility at evaluation time between the word-level loss function optimized by MLE and humans focusing on whole sequences of poem lines and assessing fine-grained criteria of the generated text such as fluency, coherence, meaningfulness and overall quality. These human evaluation criteria are modeled and incorporated into the reward function of a mutual reinforcement learning framework for poem generation Yi et al. (2018). For a detailed overview of poetry generation we point the reader to Oliveira (2017).

2.4.10 Review Generation

Product reviews allow users to express opinions for different aspects of products or services received, and are popular on many online review websites such as Amazon, Yelp, Ebay, etc. These online reviews encompass a wide variety of writing styles and polarity strengths. The task of review generation is similar in nature to sentiment analysis and a lot of past work has focused on identifying and extracting subjective content in review data Liu (2015), Zhao et al. (2016). Automatically generating reviews given contextual information focused on product attributes, ratings, sentiment, time and location is a meaningful conditional text generation task. Common product attributes used in the literature are the user ID, the product ID, the product rating or the user sentiment for the generated review Dong et al. (2017), Tang et al. (2016). The task can also be constrained text generation when topical and syntactic characteristics of natural languages are explicitly specified as constraints to incorporate in the generation process. We formally define the review generation task below.

Given as input a set of product attributes a=(a1,a2,,a|a|)a=(a_{1},a_{2},\ldots,a_{|a|}) of fixed length |a||a|, the goal is to generate a product review r=(y1,y2,,y|r|)r=(y_{1},y_{2},\ldots,y_{|r|}) of variable length |r||r| by maximizing the conditional probability p(r|a)p(r|a):

p(r|a)=t=1|r|p(yt|y<t,a)p(r|a)=\prod_{t=1}^{|r|}p(y_{t}|y_{<t},a) (17)

where y<t=(y1,y2,,yt1)y_{<t}=(y_{1},y_{2},\ldots,y_{t-1}). The training data consists of pairs (a,r)(a,r) of attributes aa with their corresponding reviews rr, and the model learns to maximize the likelihood of the generated reviews given the input attributes for the training data 𝒟\mathcal{D}. The optimization problem can therefore be expressed as:

max(a,r)𝒟logp(r|a)\max\sum_{(a,r)\in\mathcal{D}}\log p(r|a) (18)

Generating long, well-structured and informative reviews requires considerable effort when written by human users and is a similarly challenging task to do automatically Li et al. (2019a).

2.4.11 Miscellaneous tasks related to natural language generation

Handwriting synthesis aims to automatically generate data that resembles natural handwriting and is a key component in the development of intelligent systems that can provide personalized experiences to humans Zong and Zhu (2014). The task of handwritten text generation is very much analogous to sequence generation. Given as input a user defined sequence of words x=(x1,x2,,xT)x=(x_{1},x_{2},\ldots,x_{T}) which can be either typed into the computer system or fed as an input image II to capture the user’s writing style, the goal of handwriting generation is to train a neural network model which can produce a cursive handwritten version of the input text to display under the form of output image OO Graves (2013). Handwriting generation is a conditional generation task when the system is conditioning on the input text. In addition, it is also a constrained text generation task since the task is constrained on generating text in the user’s own writing style. While advances in deep learning have given computers the ability to see and recognize printed text from input images, generating cursive handwriting is a considerably more challenging problem Alonso et al. (2019). Character boundaries are not always well-defined, which makes it hard to segment handwritten text into individual pieces or characters. In addition, handwriting evaluation is ambiguous and not well defined given the multitude of existent human handwriting style profiles Mohammed et al. (2018).

Other related tasks where natural language generation plays an important role are generating questions, arguments, counter-arguments and opinions, news headlines and digests, reports, financial statements, stock market reports, sports reports, slides and entire presentations, error corrections, generating creative and entertaining texts, composing music, lyrics and tweets, data-to-text generation, paraphrasing, speech synthesis, generating proteins as sequences of aminoacids, code in a programming language of choice, etc. All these tasks illustrate the widespread importance of having robust models for natural language generation.

3 Models

Neural networks are used in a wide range of supervised and unsupervised machine learning tasks due to their ability to learn hierarchical representations from raw underlying features in the data and model complex high-dimensional distributions. A wide range of model architectures based on neural networks have been proposed for the task of natural language generation in a wide variety of contexts and applications. In what follows we briefly discuss the main categories of generative models in the literature and continue with presenting specific models for neural language generation.

Deep generative models have received a lot of attention recently due to their ability to model complex high-dimensional distributions. These models combine uncertainty estimates provided by probabilistic models with the flexibility and scalability of deep neural networks to learn in an unsupervised way the distribution from which data is drawn. Generative probabilistic models are useful for two reasons: i) can perform density estimation and inference of latent variables, and ii) can sample efficiently from the probability density represented by the input data and generate novel content. Deep generative models can be classified into either explicit or implicit density probabilistic models. On the one hand, explicit density models provide an explicit parametric specification of the data distribution and have tractable likelihood functions. On the other hand, implicit density models do not specify the underlying distribution of the data, but instead define a stochastic process which allows to simulate the data distribution after training by drawing samples from it. Since the data distribution is not explicitly specified, implicit generative models do not have a tractable likelihood function. A mix of both explicit and implicit models have been used in the literature to generate textual content in a variety of settings. Among these, we enumerate explicit density models with tractable density such as autoregressive models Bahdanau et al. (2014), Vaswani et al. (2017), explicit density models with approximate density like the Variational Autoencoder Kingma and Welling (2013), and implicit direct density generative models such as Generative Adversarial Networks Goodfellow et al. (2014).

Autoregressive (Fully-observed) generative models model the observed data directly without introducing dependencies on any new unobserved local variables. Assuming all items in a sequence x=(x1,x2,,xN)x=(x_{1},x_{2},\ldots,x_{N}) are fully observed, the probability distribution p(x)p(x) of the data is modeled in an auto-regressive fashion using the chain rule of probability:

p(x1,x2,,xN)=i=1Np(xi|x1,x2,,xi1)p(x_{1},x_{2},\ldots,x_{N})=\prod_{i=1}^{N}p(x_{i}|x_{1},x_{2},\ldots,x_{i-1}) (19)

Training autoregressive models is done by maximizing the data likelihood, allowing these models to be evaluated quickly and exactly. Sampling from autoregressive models is exact, but it is expensive since samples need to be generated in sequential order. Extracting representions from fully observed models is challenging, but this is currently an active research topic.

Latent variable generative models explain hidden causes by introducing an unobserved random variable zz for every observed data point. The data likelihood p(x)p(x) is computed as follows:

logp(x)=pθ(x|z)p(z)𝑑z=𝔼p(z)[pθ(x|z)]\log p(x)=\int p_{\theta}(x|z)p(z)dz=\mathbb{E}_{p(z)}[p_{\theta}(x|z)] (20)

Latent models present the advantage that sampling is exact and cheap, while extracting latent features from these models is straightforward. They are evaluated using the lower bound of the log likelihood.

Implicit density models (among which the most famous models are GANs) introduce a second discriminative model able to distinguish model generated samples from real samples in addition to the generative model. While sampling from these models is cheap, it is inexact. The evaluation of these models is difficult or even impossible to carry, and extracting latent representations from these models is very challenging. We summarize in Table 1 characteristics of the three categories of generative models discussed above.

Table 1: Comparison of generative model frameworks.
Model type Evaluation Sampling
Fully-observed Exact and Exact and
Cheap Expensive
Latent models Lower Bound Exact and
Cheap
Implicit models Hard or Inexact and
Impossible Cheap

In what follows we review models for neural language generation from most general to the most specific according to the problem definition categorization presented in Section 2; for each model architecture we first list models for generic text generation, then introduce models for conditional text generation, and finally outline models used for constrained text generation. We begin with recurrent neural network models for text generation in Section 3.1, then present sequence-to-sequence models in Section 3.2, generative adversarial networks (GANs) in Section 3.4, variational autoencodes (VAEs) in Section 3.5 and pre-trained models for text generation in Section 3.8. We also provide a comprehensive overview of text generation tasks associated with each model.

3.1 Recurrent Architectures

3.1.1 Recurrent Models for Generic / Free-Text Generation

Recurrent Neural Networks (RNNs) Rumelhart et al. (1986), Mikolov et al. (2010) are able to model long-term dependencies in sequential data and have shown promising results in a variety of natural language processing tasks, from language modeling Mikolov (2012) to speech recognition Graves et al. (2013) and machine translation Kalchbrenner and Blunsom (2013). An important property of RNNs is the ability of learning to map an input sequence of variable length into a fixed dimensional vector representation.

At each timestep, the RNN receives an input, updates its hidden state, and makes a prediction. Given an input sequence x=(x1,x2,,xT)x=(x_{1},x_{2},\ldots,x_{T}), a standard RNN computes the hidden vector sequence h=(h1,h2,,hT)h=(h_{1},h_{2},\ldots,h_{T}) and the output vector sequence y=(y1,y2,,yT)y=(y_{1},y_{2},\ldots,y_{T}), where each datapoint xt,ht,yt, t{1,,T}x_{t},h_{t},y_{t},\forall\text{ }t\in\{1,\ldots,T\} is a real valued vector, in the following way:

ht=(Wxhxt+Whhht1+bh)yt=Whyht+by\begin{split}h_{t}&=\mathcal{H}(W_{xh}x_{t}+W_{hh}h_{t-1}+b_{h})\\ y_{t}&=W_{hy}h_{t}+b_{y}\end{split} (21)

In Equation 21 terms WW denote weight matrices, in particular WxhW_{xh} is the input-hidden weight matrix and WhhW_{hh} is the hidden-hidden weight matrix. The bb terms denote bias vectors, where bhb_{h} is the hidden bias vector and byb_{y} is the output bias vector. \mathcal{H} is the function that computes the hidden layer representation. Gradients in an RNN are computed via backpropagation through time Rumelhart et al. (1986), Werbos (1989). By definition, RNNs are inherently deep in time considering that the hidden state at each timestep is computed as a function of all previous timesteps. While in theory RNNs can make use of information in arbitrarily long sequences, in practice they fail to consider context beyond the few previous timesteps due to the vanishing and exploding gradients Bengio et al. (1994) which cause gradient descent to not be able to learn long-range temporal structure in a standard RNN. Moreover, RNN-based models contain millions of parameters and have traditionally been very difficult to train, limiting their widespread use Sutskever et al. (2011). Improvements in network architectures, optimization techniques and parallel computation have resulted in recurrent models learning better at large-scale Lipton et al. (2015).

Long Short Term Memory (LSTM) Hochreiter and Schmidhuber (1997) networks are introduced to overcome the limitations posed by vanishing gradients in RNNs and allow gradient descent to learn long-term temporal structure. The LSTM architecture largely resembles the standard RNN architecture with one hidden layer, and each hidden layer node is modified to include a memory cell with a self-connected recurrent edge of fixed weight which stores information over long time periods. A memory cell ctc_{t} consists of a node with an internal hidden state hth_{t} and a series of gates, namely an input gate iti_{t} which controls how much each LSTM unit is updated, a forget gate ftf_{t} which controls the extent to which the previous memory cell is forgotten, and an output gate oto_{t} which controls the exposure of the internal memory state. The LSTM transition equations at timestep tt are:

it=σ(W(i)xt+U(i)ht1+b(i))ft=σ(W(f)xt+U(f)ht1+b(f))ot=σ(W(o)xt+U(o)ht1+b(o))ut=σ(W(u)xt+U(t)ht1+b(t))ct=itut+ftct1ht=ottanh(ct)\begin{split}i_{t}&=\sigma(W^{(i)}x_{t}+U^{(i)}h_{t-1}+b^{(i)})\\ f_{t}&=\sigma(W^{(f)}x_{t}+U^{(f)}h_{t-1}+b^{(f)})\\ o_{t}&=\sigma(W^{(o)}x_{t}+U^{(o)}h_{t-1}+b^{(o)})\\ u_{t}&=\sigma(W^{(u)}x_{t}+U^{(t)}h_{t-1}+b^{(t)})\\ c_{t}&=i_{t}\odot u_{t}+f_{t}\odot c_{t-1}\\ h_{t}&=o_{t}\odot\text{tanh}(c_{t})\end{split} (22)

In Equation 22, xtx_{t} is the input at the current timestep tt, σ\sigma denotes the logistic sigmoid function and \odot denotes elementwise multiplication. UU and WW are learned weight matrices. LSTMs can represent information over multiple time steps by adjusting the values of the gating variables for each vector element, therefore allowing the gradient to pass without vanishing or exploding. In both RNNs and LSTMs the data is modeled via a fully-observed directed graphical model, where the distribution over a discrete output sequence y=(y1,y2,,yT)y=(y_{1},y_{2},\dots,y_{T}) is decomposed into an ordered product of conditional distributions over tokens:

P(y1,y2,,yT)=P(y1)t=1TP(yt|y1,,yt1)P(y_{1},y_{2},\dots,y_{T})=P(y_{1})\prod_{t=1}^{T}P(y_{t}|y_{1},\dots,y_{t-1}) (23)

Similar to LSTMs, Gated Recurrent Units (GRUs) Cho et al. (2014) learn semantically and syntactically meaningful representations of natural language and have gating units to modulate the flow of information. Unlike LSTMs, GRU units do not have a separate memory cell and present a simpler design with fewer gates. The activation htjh_{t}^{j} at timestep tt linearly interpolates between the activation at the previous timestep htj1h_{t}^{j-1} and the candidate activation h~tj\widetilde{h}_{t}^{j}. The update gate ztjz_{t}^{j} decides how much the current unit updates its content, while the reset gate rtjr_{t}^{j} allows it to forget the previously computed state. The GRU update equations at each timestep tt are:

htj=(1ztj)ht1j+ztjh~tjztj=σ(Wzxt+Uzht1)jh~tj=tanh(Wxt+U(rtht1))jrtj=σ(Wrxt+Urht1)j\begin{split}h_{t}^{j}&=(1-z_{t}^{j})h_{t-1}^{j}+z_{t}^{j}\widetilde{h}_{t}^{j}\\ z_{t}^{j}&=\sigma(W_{z}x_{t}+U_{z}h_{t-1})^{j}\\ \widetilde{h}_{t}^{j}&=\tanh(Wx_{t}+U(r_{t}\odot h_{t-1}))^{j}\\ r_{t}^{j}&=\sigma(W_{r}x_{t}+U_{r}h_{t-1})^{j}\end{split} (24)

Models with recurrent connections are trained with teacher forcing Williams and Zipser (1989), a strategy emerging from the maximum likelihood criterion designed to keep the recurrent model predictions close to the ground-truth sequence. At each training step the model generated token y^t\hat{y}_{t} is replaced with its ground-truth equivalent token yty_{t}, while at inference time each token is generated by the model itself (i.e. sampled from its conditional distribution over the sequence given the previously generated samples). The discrepancy between training and inference stages leads to exposure bias, causing errors in the model predictions that accumulate and amplify quickly over the generated sequence, Lamb et al. (2016). As a remedy, Scheduled Sampling Bengio et al. (2015) mixes inputs from the ground-truth sequence with inputs generated by the model at training time, gradually adjusting the training process from fully guided (i.e. using the true previous token) to less guided (i.e. using mostly the generated token) based on curriculum learning Bengio et al. (2009). While the model generated distribution can still diverge from the ground truth distribution as the model generates several consecutive tokens, possible solutions are: i) make the self-generated sequences short, and ii) anneal the probability of using self-generated vs. ground-truth samples to 0, according to some schedule. Still, models trained with scheduled sampling are shown to memorize the distribution of symbols conditioned on their position in the sequence instead of the actual prefix of preceding symbols Huszár (2015).

Many extensions of vanilla RNN and LSTM architectures are proposed in the literature aiming to improve generalization and sample quality Yu et al. (2019). Bidirectional RNNs Schuster and Paliwal (1997), Berglund et al. (2015) augment unidirectional recurrent models by introducing a second hidden layer with connections flowing in opposite temporal order to exploit both past and future information in a sequence. Multiplicative RNNs Sutskever et al. (2011) allow flexible input-dependent transitions, however many complex transition functions hard to bypass. Gated feedback RNNs and LSTMs Chung et al. (2014) rely on gated-feedback connections to enable the flow of control signals from the upper to lower recurrent layers in the network. Similarly, depth gated LSTMs Yao et al. (2015) introduce dependencies between lower and upper recurrent units by using a depth gate which connects memory cells of adjacent layers. Stacked LSTMs stack multiple layers at each time-step to increase the capacity of the network, while nested LSTMs Moniz and Krueger (2018) selectively access LSTM memory cells with inner memory. Convolutional LSTMs Sainath et al. (2015), Xingjian et al. (2015) are designed for jointly modeling spatio-temporal sequences. Tree-structured LSTMs Zhu et al. (2015), Tai et al. (2015) extend the LSTM structure beyond a linear chain to tree-structured network topologies, and are useful at semantic similarity and sentiment classification tasks. Multiplicative LSTMs Krause et al. (2016) combine vanilla LSTM networks of fixed weights with multiplicative RNNs to allow for flexible input-dependent weight matrices in the network architecture. Multiplicative Integration Wu et al. (2016b) RNNs achieve better performance than vanilla RNNs by using the Hadamard product in the computational additive building block of RNNs. Mogrifier LSTMs Melis et al. (2019) capture interactions between inputs and their context by mutually gating the current input and the previous output of the network. For a comprehensive review of RNN and LSTM-based network architectures we point the reader to Yu et al. (2019).

3.1.2 Recurrent Models for Conditional Text Generation

A recurrent free-text generation model becomes a conditional recurrent text generation model when the distribution over training sentences is conditioned on another modality. For example in machine translation the distribution is conditioned on another language, in image caption generation the condition is the input image, in video description generation we condition on the input video, while in speech recognition we condition on the input speech.

Content and stylistic properties (such as sentiment, topic, style and length) of generated movie reviews are controlled in a conditional LSTM language model by conditioning on context vectors that reflect the presence of these properties Ficler and Goldberg (2017). Affective dialogue responses are generated by conditioning on affect categories in an LSTM language model Ghosh et al. (2017). A RNN-based language model equipped with dynamic memory outperforms more complex memory-based models for dialogue generation Mei et al. (2017). Participant roles and conversational topics are represented as context vectors and incorporated into a LSTM-based response generation model Luan et al. (2016).

3.1.3 Recurrent Models for Constrained Text Generation

Metropolis-Hastings sampling Miao et al. (2019) is proposed for both soft and hard constrained sentence generation from models based on recurrent neural networks. The method is based on Markov Chain Monte Carlo (MCMC) sampling and performs local operations such as insertion, deletion and replacement in the sentence space for any randomly selected word in the sentence.

Hard constraints on the generation of scientific paper titles are imposed by the use of a forward- backward recurrent language model which generates both previous and future words in a sentence conditioned on a given topic word Mou et al. (2015). While the topic word can occur at any arbitrary position in the sentence, the approach can only generate sentences constrained precisely on one keyword. Multiple constraints are incorporated in sentences generated by a backward-forward LSTM language model by lexically substituting constrained tokens with their closest matching neighbour in the embedding space Latif et al. (2020). Guiding the conversation towards a designated topic while integrating specific vocabulary words is achieved by combining discourse-level rules with neural next keywords prediction Tang et al. (2019). A recurrent network based sequence classifier is used for extractive summarization in Nallapati et al. (2017). Poetry generation which obeys hard rhythmic, rhyme and topic constraints is proposed in Ghazvininejad et al. (2016).

3.2 Sequence-to-Sequence Architectures

Although the recurrent models presented in Section 3.1 present good performance whenever large labeled training sets are available, they can only be applied to problems whose inputs and targets are encoded with vectors of fixed dimensionality. Sequences represent a challenge for recurrent models since RNNs require the dimensionality of their inputs and outputs to be known and fixed beforehand. In practice, there are many problems in which the sequence length is not known a-priori and it is necessary to map variable length sequences into fixed-dimensional vector representations. To this end, models that can map sequences to sequences are proposed. These models makes minimal assumptions on the sequence structure and learn to map an input sequence into a vector of fixed dimensionality and then map that vector back into an output sequence, therefore learning to decode the target sequence from the encoded vector representation of the source sequence. We present these models in detail below.

3.3 Sequence-to-sequence models that condition on the input text

Sequence-to-sequence (seq2seq) models Kalchbrenner and Blunsom (2013), Sutskever et al. (2014), Cho et al. (2014) are conditional language models which can deal with variable length inputs and output. Also known as encoder-decoder models, they have been very successful in machine translation Luong et al. (2015b), text summarization Nallapati et al. (2016), dialogue systems Vinyals and Le (2015), and image and video captioning Vinyals et al. (2015d). Seq2seq models consist of two paired recurrent neural networks: the first network (encoder) summarizes a variable-length source sequence of symbols x=(x1,x2,,xT)x=(x_{1},x_{2},\ldots,x_{T}) into a rich fixed-length vector representation vv, while the second network (decoder) uses the vector representation vv as its initial hidden state and deciphers it into another variable-length target sequence of symbols y=(y1,y2,,yT)y=(y_{1},y_{2},\ldots,y_{T^{\prime}}) by computing the conditional probability of each word yty_{t} in the target sequence given the previous words y1,y2,,yt1y_{1},y_{2},\ldots,y_{t-1} and the input xx, where length TT of input may differ from length TT^{\prime} of output. The conditional probability p(y1,y2,,yT|x1,x2,,xT)p(y_{1},y_{2},\ldots,y_{T^{\prime}}|x_{1},x_{2},\ldots,x_{T}) of the output sequence yy given the input sequence xx can be formally expressed as:

p(y1,y2,,yT|x1,x2,,xT)=t=1Tp(yt|v,y1,y2,,yt1)\begin{split}p(y_{1},y_{2},\ldots,y_{T^{\prime}}|x_{1},x_{2},\ldots,x_{T})=\\ \prod_{t=1}^{T^{\prime}}p(y_{t}|v,y_{1},y_{2},\ldots,y_{t-1})\end{split} (25)

Typical architectural choices for the encoder and decoder are RNN and LSTM neural networks. In addition, deep convolutional networks are proposed to model long-range dependencies in lengthy documents Dauphin et al. (2017), Gehring et al. (2017). Given training set consisting of NN paired input and output sentences, a sequence-to-sequence model is trained using maximum likelihood to maximize the conditional log-likelihood of the correct target sentence yny_{n} given the source sentence xnx_{n} for every (xnx_{n}, yny_{n}) pair and model parameters θ\theta:

maxθ1Nt=1Nlogpθ(yn|xn)\max_{\theta}\frac{1}{N}\sum_{t=1}^{N}\log p_{\theta}(y_{n}|x_{n}) (26)

After training, the sequence-to-sequence model can be used in two ways: i) to produce a probability score pθ(y|x)p_{\theta}(y|x) for a given pair (x,y)(x,y) of input xx and output yy sequences, or ii) to generate the target sequence yy corresponding to the input sequence xx. In the latter case, decoding methods are used to generate the most likely output sequence. In what follows we present decoding strategies for neural sequence-to-sequence generative models.

Decoding

Generating the most likely output sequence from a trained model involves running an exhaustive search over all possible output sequences, scoring them based on their likelihood and selecting the most likely sequence y^\hat{y} such that:

y^=argmaxyP(y|x)=argmaxyt=1NP(yt|y<t,x)\hat{y}=\arg\max_{y}P(y|x)=\arg\max_{y}\prod_{t=1}^{N}P(y_{t}|y_{<t},x) (27)

In Equation 27, the model outputs a probability distribution over the next token at each timestep tt given the input xx and the previously predicted tokens y<ty_{<t}. The probability distribution over the next word in the target vocabulary wiVw_{i}\in V is commonly modeled using a softmax function:

P(yt=wi|y<t,x)=exp(zt,i)j=1Vexp(zt,j),P(y_{t}=w_{i}|y_{<t},x)=\frac{\exp(z_{t,i})}{\sum_{j=1}^{V}\exp(z_{t,j})}, (28)

where zt=f(y<t,x)z_{t}=f(y_{<t},x) represents the output of the encoder-decoder model given input sequence xx and the sequence of tokens predicted so far y<ty_{<t}. While Equation 27 theoretically outputs the optimal output sequence y^\hat{y}, in practice it is intractable to run an exhaustive search to find y^\hat{y} precisely. The exact decoding problem is exponential in the length of the source sentence xx, and factors such as the branching factor, number of timesteps and the large vocabulary size impede yielding precisely the most probable sequence y^\hat{y} from the trained model. Alternative decoding strategies based on heuristic search are used to find reasonable approximations of the optimal output sequence. As opposed to exact decoding, these sampling techniques are incomplete decoding strategies which exclude tokens from consideration at each step, and generate a random sequence according to the learnt probability distribution. The decoding strategy of choice bears a huge impact on the quality of the generated machine text, even when the same neural language model is used for generation Holtzman et al. (2019). In what follows we present decoding methods commonly used in the NLG literature, noting that the best decoding strategy for text generation from a trained language model is still largely an unresolved problem.

Argmax/ Greedy search

The argmax sampler is the simplest approach to decoding a likely sequence. At each timestep it greedily selects the most likely (argmax) token over the softmax output distribution and feeds it as input to the next timestep until the end-of-sentence token is reached, as follows:

y^t=argmaxytP(yt|y<t,x)\hat{y}_{t}=\arg\max_{y_{t}}P(y_{t}|y_{<t},x) (29)

Greedy decoding preserves a single hypothesis at each timestep, however selecting the best individual token output per timestep does not necessarily result in the best overall output hypothesis – there may well exist a better path which includes a less likely token (not argmax) at some point in the decoding process; the method will also miss a high-probability token hiding after a low-probability one. In addition, other limitations of greedy decoding include the generation of repetitive and short output sequences which lack in diversity even for large well-trained models Holtzman et al. (2019), Radford et al. (2019), and the impossibility to generate multiple samples during decoding. This makes it a suboptimal decoding strategy Chen et al. (2018b).

Random/ Stochastic/ Temperature search/ Ancestral Sampling

Stochastic sampling introduces randomness in decoding and arbitrarily samples each output token from the model’s distribution at each timestep. A temperature parameter T>0T>0 is commonly used to control how flat (TT\rightarrow\infty) or greedy (T0T\rightarrow 0) the multinomial distribution over the next token is Ackley et al. (1985). Values of T>1T>1 cause increasingly more random outputs, while T0T\approx 0 resembles greedy sampling:

P(yt=wi|y<t,x)=exp(zt,i/T)j=1Vexp(zt,j/T)P(y_{t}=w_{i}|y_{<t},x)=\frac{\exp(z_{t,i}/T)}{\sum_{j=1}^{V}\exp(z_{t,j}/T)} (30)

It is important to note that doing completely random sampling can negatively impact sequence generation as it introduces unlikely words and mistakes not encountered at training time Fan et al. (2018b).

Beam search

Beam search is an approximate graph-based inference algorithm which explores the hypothesis space in a greedy left-to-right (breadth-first) manner over a limited portion of the overall search space. Each hypothesis is expanded iteratively one token at a time and only the kk-best hypotheses are eventually kept, where kk denotes the beam width or beam size. Unlike greedy decoding which maintains a single hypothesis at a time and can miss out a highly probable token when it is preceded by a low probable one, beam search explores multiple sequences in parallel and mitigates the aforementioned problems by maintaining a beam (set of kk hypotheses) of potential sequences constructed word-by-word. More specifically, the decoding process begins with the start-of-sentence token and at every step of decoding new kk-best tokens wiw_{i} are selected according to the probability distribution P(yt=wi|y<t,x)P(y_{t}=w_{i}|y_{<t},x). Each partial hypothesis is expanded with a new token and its cumulative log-probability is updated accordingly to capture model’s preferences. The process repeats until the end-of-sentence token is produced, at which time the hypothesis is complete. Finally, all complete hypotheses are scored in descending order of their likelihood and only the kk-best scoring hypothesis are preserved. Decoding new sequences from the trained model is equivalent to finding the sequence yy^{*} that is most probable under the model distribution:

y=argmaxyp(y|x)=argminylogp(y|x)y^{*}=\arg\max_{y}p(y|x)=\arg\min_{y}-\log p(y|x) (31)

Beam search was effective in early work on neural machine translation Sutskever et al. (2014) and has become the standard algorithm for many language generation tasks at sampling sufficiently likely sequences from probabilistic encoder-decoder models. Nevertheless, beam search is sensitive to output length and best results are obtained when the length of the target sentence is predicted before decoding Murray and Chiang (2018), Yang et al. (2018b). Beam search decoding is also slow as it introduces a substantial computational overhead Cho (2016) and the candidate sequences it produces are short, dull, generic, and include common phrases and repetitive text from the training set Shao et al. (2016), Vijayakumar et al. (2016), Sordoni et al. (2015), Li et al. (2016a), Wolf et al. (2019). While the use of maximum likelihood as a training objective leads to high quality models for many language understanding tasks, maximization based decoding results in neural text degeneration Holtzman et al. (2019). In addition, the likelihood objective assigns too much probability to repetitive and frequent words, focusing only on producing the next word and not on optimizing sequence generation Welleck et al. (2019b). Consequently, the generated outputs produced by beam search lack in diversity Li et al. (2016b), Li et al. (2016a), Gimpel et al. (2013) and are largely variations of the same high likelihood beam with minor differences in punctuation and morphology Li and Jurafsky (2016). Increasing the beam size leads to a degrade in performance and negatively affects sequence generation quality Koehn and Knowles (2017), Yang et al. (2018b). As an alternative to maximum likelihood training, unlikelihood training is proposed to force the model to assign lower probability scores to unlikely generations Welleck et al. (2019a).

Extensions of beam search are proposed focusing on improving output diversity, and can be applied either during the decoding process or post-decoding. We present diversity promoting methods used during the decoding process first. NN-gram blocking Paulus et al. (2017), Klein et al. (2017) discards a hypothesis if the occurrence frequency of any token within it is greater than one; this strategy is used to block previously generated nn-grams from subsequent generation. Iterative beam search Kulikov et al. (2019) runs multiple iterations of beam search on disjunct areas of the search space, ensuring there is no overlap between the current search space and areas explored by previous iterations. Diverse beam search Vijayakumar et al. (2016) augments beam search with a diversity promoting term which ensures a candidate hypothesis is sufficiently different from other partial hypotheses according to standard diversity functions such as nn-gram diversity, neural embedding diversity and Hamming distance. Cluster-based beam search Tam (2020) clusters semantically similar partial hypothesis using kk-means clustering, followed by hypothesis pruning to keep only top candidate hypotheses from each cluster. Noisy parallel decoding Cho (2016) can be combined with any decoding strategy and works by injecting noise (randomly sampled from a normal distribution) into the hidden state of the decoder at each timestep, followed by running in parallel multiple approximate decoding processes. Top-g capping beam search Li et al. (2016b), Li and Jurafsky (2016) incentives diversity by grouping candidate hypotheses according to their parent, and selects top-gg candidates from each group. Post-decoding diversity-promoting methods are also proposed. The simplest strategy to increase the diversity of outputs is to cluster decoded sentences and remove highly similar candidates Kriz et al. (2019). Along the same line, it is possible to over-sample generated sequences from the model and filter them down to retain a smaller number of outputs Ippolito et al. (2018); random sampling is the recommended way to over-sample candidates.

Lexically constrained decoding with grid beam search is used to enforce lexical constraints (words or phrases) and incorporate additional knowledge in the generated output Hokamp and Liu (2017). The search space is hard constrained to produce only candidates which contain one or more pre-specified sub-sequences. Similarly, constrained beam search Anderson et al. (2017) is used to force the inclusion of specific words in the generated sequences by adopting a finite-state machine approach which recognizes valid constrained and complete outputs. Dynamic beam allocation Post and Vilar (2018) improves the time efficiency of constrained decoding by grouping beam candidates according to how many constraints they meet. Vectorized dynamic beam allocation achieves even faster decoding by organizing into a trie the constraints that have not yet been generated Hu et al. (2019).

In addition, decoding methods that optimize for output with high probability produce incoherent, repetitive and generic output sequences. Beam search and greedy decoded texts fail to reproduce the distribution of words in human generated texts Holtzman et al. (2019). As a remedy, decoding strategies that truncate the neural probability distribution at different thresholds establish a trustworthy prediction zone from which tokens can be sampled according to their relative probabilities. Next we present thresholding-based decoding strategies.

Top-k sampling

The top-kk random sampling Fan et al. (2018b) scheme restricts sampling from the kk most likely terms in the distribution and introduces randomness in the decoding process. New words are generated at each timestep by randomly selecting kk (typically k=10k=10) tokens from the most likely candidate tokens sampled from the probability distribution of each word in the vocabulary being the likely next word given the previously selected words. This decoding scheme is found more effective than beam search Radford et al. (2019), Holtzman et al. (2018).

Nucleus (Top-p) sampling

Nucleus sampling Holtzman et al. (2019) suppresses the unreliable tail of the probability distribution consisting of tokens with relatively low probability and samples tokens from the remaining top-pp portion or nucleus, which concentrates highest probability mass. This approach allows the model to generate tokens from the vast majority of the probability mass and prevents sampling low probability tokens.

Penalized sampling

Penalized sampling Keskar et al. (2019) is used at inference time to encourage output diversity. By discounting the probability of the already generated tokens in a sequence, the model is discouraged from generating them again.

Sequence Generation

In what follows we present the dominant approaches used in neural sequence generation.

  1. 1.

    Monotonic / Autoregressive Sequence Generation: In natural language generation sequences are typically generated iteratively following a left-to-right generation order in which new tokens are added successively to the end of an unfinished sequence. Monotonic neural text generation decomposes the sequence prediction problem into a series of next token predictions, i.e. given input sequence xx and assuming the availability of ground-truth previous tokens (y1,y2,,yt1)(y_{1}^{*},y_{2}^{*},\ldots,y_{t-1}^{*}) predict the next token yty_{t}:

    P(Y|x)=t=1Tp(yt|y1,y2,,yt1,x)P(Y|x)=\prod_{t=1}^{T}p(y_{t}|y_{1},y_{2},\ldots,y_{t-1},x) (32)

    The learning process maximizes the log-probability of a correct output sequence given the input xx, teaching the model to predict the correct next token p(yt|y1,y2,,yt1,x)p(y_{t}^{*}|y_{1}^{*},y_{2}^{*},\ldots,y_{t-1}^{*},x) by using a cross-entropy loss applied at each decoding step:

    maxlogp(Y|x)=t=1Tlogp(yt|y1,y2,,yt1,x)\max\log p(Y^{*}|x)=\sum_{t=1}^{T}\log p(y_{t}^{*}|y_{1}^{*},y_{2}^{*},\ldots,y_{t-1}^{*},x) (33)

    With recent advances in text representation learning, monotonic generation of sequences has become the standard approach in neural text generation. Autoregressive models are easy to train and achieve robust performance on large datasets. To speed up their training, models that leverage parallelism at training time replace recurrent layers in the decoder with masked convolution layers Kalchbrenner et al. (2016), Gehring et al. (2017) or self-attention Vaswani et al. (2017). Nevertheless, at inference time autoregressive decoding with beam search can be slow as the individual steps of the decoder must run sequentially Gu et al. (2017).

  2. 2.

    Parallel Sequence Generation: Recent work in text generation is challenging the assumption that text needs to be generated sequentially Gu et al. (2017). Indeed, the simplistic procedure of sequential token generation does not reflect how humans write text Guu et al. (2018) and is limiting content diversity Mehri and Sigal (2018). In contrast to standard autoregressive models that predict each word conditioned on all previous words and naturally model the length of a target sequence, non-autoregressive models enable parallel generation of output tokens and incorporate target sequence length prediction at inference time Lee et al. (2018). Sequence generation in parallel speeds up inference by leveraging parallel computation, and captures dependencies between tokens by iteratively refining a sequence Lee et al. (2018). Parallel decoding models include iterative refinement Lee et al. (2018), noisy parallel decoding Gu et al. (2017), masked language models Ghazvininejad et al. (2019), Ghazvininejad et al. (2020), insertion-based methods Stern et al. (2019), Chan et al. (2019), Gu et al. (2019a), edit-based methods Gu et al. (2019b), Ruis et al. (2020), normalizing flow models Ma et al. (2019b) and connectionist temporal classification Libovickỳ and Helcl (2018).

    Non-autoregressive generation models approach the performance of autoregressive models and have been successfully applied in machine translation Gu et al. (2017), Guo et al. (2019), Saharia et al. (2020) and speech synthesis Oord et al. (2018). Nevertheless, they make the limiting assumption that output tokens are conditionally independent given the input, which leads to the presence of redundant tokens in the non-autoregressive generated sequences. In addition, unlike their autoregressive counterparts which stop generation by emitting the end-of-sentence token, non-autoregressive models need to explicitly incorporate output length prediction as a preliminary generation step.

  3. 3.

    Non-Monotonic Sequence Generation: As opposed to monotonic sequence generation, flexible sequence generation produces an output sentence without following a strict pre-defined left-to-right order. To this end, non-monotonic generation approaches decompose a ground-truth sequence YY into a multi-set of items to be generated 𝒴\mathcal{Y} and a set of ordering constraints 𝒞\mathcal{C} Welleck et al. (2018). Naturally, the order is which sequences are generated impacts performance Vinyals et al. (2015a). Generating sequences in arbitrary orders by simultaneously predicting a word and the position in which it should be inserted during each decoding step presents comparable performance to conventional left-to-right generation Gu et al. (2019a). A hierarchical approach to decoding is proposed by deliberation networks Xia et al. (2017), where the first-pass decoder generates a raw sequence which is then further polished and refined by a second decoder. Review networks Yang et al. (2016b) further edit the encoder hidden states before generating the output sentence.

Different layers in sequence-to-sequence models exhibit different functionality and learn different representations. While lower layers of the encoder learn to represent word structure, higher layers of the encoder capture semantics and word meaning Belinkov et al. (2017). This is consistent with findings on representations learnt by CNNs on image data Zeiler and Fergus (2014).

Sequence-to-sequence models are trained in a multitask learning settings in which either the encoder, the decoder or both encoder and decoder are shared between multiple tasks Luong et al. (2015a), Dong et al. (2015).

Attention

The attention mechanism Bahdanau et al. (2014), Luong et al. (2015b) is proposed to enhance seq2seq models with a random access memory which allows to handle long input sequences and focus on salient pieces of input information. Attention dynamically attends to different parts of the input while generating each target-side word. In order to estimate the relevance of input tokens, the distribution of attention weights is computed over all input tokens and higher values are assigned to those tokens considered relevant. The attention mechanism is a crucial component of many seq2seq models used in image captioning Xu et al. (2015), machine translation Jean et al. (2015), Luong and Manning , constituency parsing Vinyals et al. (2015c), visual object tracking Mnih et al. (2014), abstractive summarization Rush et al. (2015), Nallapati et al. (2016). Besides performance gains, attention is also commonly used as a tool for interpreting the behaviour of neural architectures since it allows to dynamically highlight relevant features of the raw input data Hermann et al. (2015) or higher-level neural representations Galassi et al. (2019).

In the context of encoder-decoder models for neural machine translation Bahdanau et al. (2014), attention is designed to learn alignments between the decoding states and the encoded memories. Therefore, attention makes all encoder hidden states available to the decoder at decoding time (i.e. soft attention) as opposed to regular seq2seq models where the decoder can only access the last encoder hidden state. The attention mechanism computes alignment weights for all input positions, and decides how much information to retrieve from the input by learning where to focus. The benefit of using attention is that the encoder no longer needs to encode all source-side information into a fixed-length vector, while the decoder can selectively retrieve information spread throughout the entire input sequence. Empirical evidence shows that the attention model is more efficient than the encoder-decoder approach since its dynamic alignment mechanism requires less parameters and training instances Jean et al. (2015).

In its basic formulation, the attention function maps a sequence of KK vectors or keys kik_{i} with dimensionality dkd_{k} corresponding to input features (either word or character level embeddings) to a distribution aa of weights ai,|a|=dka_{i},|a|=d_{k} for the input query qq. If qq is defined (for eg., machine translation, question answering), input elements which are relevant to qq will be selected; if qq is undefined (for eg., document classification), inherently relevant input elements are selected. The compatibility function ff is used to measure how well the query matches the keys, yielding a vector ee of energy scores Zhao and Zhang (2018) with dimensionality dkd_{k} where each element eie_{i} represents the relevance of key kik_{i} to query qq under ff; please see Table 2.

e=f(q,K):energy scores\begin{split}e&=f(q,K):\text{energy scores}\\ \end{split} (34)
Table 2: Attention compatibility functions.
Name Alignment function Reference
Similarity / Content-based f(q,K)=cos(q,k)f(q,K)=\cos(q,k) Graves et al. (2014)
Additive / Concat f(q,K)=vaTtanh(W[K;q])f(q,K)=v_{a}^{T}\tanh(W[K;q]) Bahdanau et al. (2014),
Luong et al. (2015b)
General / Bilinear f(q,K)=qTWKf(q,K)=q^{T}WK Luong et al. (2015b)
Dot-Product f(q,K)=qTKf(q,K)=q^{T}K Luong et al. (2015b)
Scaled Dot-Product f(q,K)=qTK(dk)f(q,K)=\frac{q^{T}K}{\sqrt{(}d_{k})} Vaswani et al. (2017)
Location-based f(q,K)=softmax(Wq)f(q,K)=\text{softmax}(Wq) Luong et al. (2015b)

Next, the energy scores ee are transformed into a vector aa of attention weights aia_{i} with dimensionality dkd_{k} by mapping to the distribution function gg (softmax function is a common choice). While the attention weights aia_{i} still represent the relevance of each element kik_{i} to the query, new representations of the keys kik_{i} are computed under the form of sequence VV of dkd_{k} vectors viv_{i} Cui et al. (2017). There is a one-to-one mapping between elements of VV and KK, and the two vectors are different representations of the same data. Nevertheless, attention weights aia_{i} are applied on vectors values viv_{i} to obtain vector ZZ of attention-weighted representations of VV. Finally, all elements ziz_{i} of ZZ are aggregated to obtain a compact representation of the input in the form of context vector cc:

a=g(e):attention weightszi=aivi:weighted representationsc=i=1dkzi:context vector\begin{split}a&=g(e):\text{attention weights}\\ z_{i}&=a_{i}v_{i}:\text{weighted representations}\\ c&=\sum_{i=1}^{d_{k}}z_{i}:\text{context vector}\\ \end{split} (35)

In the literature attention mechanisms are categorized based on whether attention is placed on all or just a few source positions. In what follows we review the main categories of attention models.

Soft vs. Hard Attention

The distinction between soft and hard attention is proposed in image caption generation Xu et al. (2015), based on whether the attention model has access to the entire image or just an image patch. Deterministic Soft Attention places the attention weights “softly” over all patches in the source image. The model is differentiable and can be trained via standard back-propagation. Nevertheless, it can be expensive to compute when the source input is large. Stochastic Hard Attention only selects a one patch of the image to attend at a time. While at inference time hard attention is less expensive to compute compared to soft attention, it is non-differentiable. To this end, hard attention is trained either by maximizing an approximate variational lower bound, or via the Reinforce Williams (1992) algorithm.

Global vs. Local Attention

The distinction between global and local attention is proposed in the context of machine translation Luong et al. (2015b). Global Attention attends to all source words for each target word, i.e. all hidden states of the encoder are used to calculate the context vector ctc_{t} as the weighted average over all source states according to attention values ata_{t}. Global attention is same as the deterministic soft attention proposed in Xu et al. (2015) and resembles the attention mechanism in Bahdanau et al. (2014) with minor architectural differences. Nevertheless, since global attention simultaneously attends to all words on the source side for each target word, it is computationally expensive and impractical in scenarios where the source sentence is long. Local attention considers for each target word only a subset of source words from a small context window at a time. Local attention combines the soft and hard attention mechanisms proposed in Xu et al. (2015) – it eliminates the extensive computational needs of soft attention and adds differentiability to hard attention. For the current target word at time tt, the model identifies a single aligned source position ptp_{t} by either assuming source-to-target monotonic alignments or by predicting it. The context window of words [ptD,pt+D],D[p_{t}-D,p_{t}+D],D\in\mathbb{N}^{*} centered at ptp_{t} is then used to compute the context vector ctc_{t}.

Self-Attention / Intra-Attention

Aiming to discover lexical relations between tokens in an input sequence, memory and attention are combined within a sequence encoder to create an attention-based memory addressing mechanism which can generate contextual representations of input tokens Cheng et al. (2016). The intra-attention mechanism can either be used for single sentences to compute a sentence representation which relates different positions in the sequence, or integrated with encoder-decoder architectures to identify unidirected (and presumably latent) relations between input tokens which mimic the human memory span. All intermediate relations captured by self-attention are soft and differentiable.

Self-attention was first applied in the context of machine reading Cheng et al. (2016), where a LSTM architecture is enhanced with a memory network Weston et al. (2014) to extract and represent meaning from natural language text. Self-attention is a general mechanism that can be applied to a wide variety of network architectures and tasks; it can be used as a stand-alone layer and is especially effective when used in later layers Ramachandran et al. (2019). Self-attention has become increasingly popular in recent years in a variety of tasks including reading comprehension, abstractive summarization, question answering, textual entailment, learning task-independent sentence representations, and is an integral component of many state-of-the-art neural network models Radford et al. (2019), Devlin et al. (2018).

Multi-head Attention

Nevertheless, a single attention layer, especially when computed as a simple weighted average, cannot model complex functions. As opposed to self-attention which performs a single attention function at a time, multi-head attention consists of several attention layers (or “heads”) running in parallel and focusing simultaneously on different parts of the input. These attention heads jointly attend to information from different representation subspaces at different positions. Models consisting entirely of multi-headed attention have led to considerable progress on a diverse range of language processing tasks, and in many cases have successfully replaced the more complex recurrence or convolutional neural mechanisms.

Transformer Vaswani et al. (2017) is the first transduction model based exclusively on attention mechanisms which is highly paralellizable and can handle long-term dependencies while entirely omitting recurrent and convolutional layers. The model consists of a stacked encoder-decoder architecture with self-attention Cheng et al. (2016) and point-wise, fully connected layers in both the encoder and decoder. The encoder embodies a stack of six identical layers, where each encoder layer contains two sub-layers: i) a multi-head self-attention mechanism, followed by ii) a position-wise fully connected feed-forward network. Similarly, the decoder is also composed of a stack of six layers, where each decoder layer contains three sub-layers: i) a multi-head attention over the output of the encoder stack, ii) a multi-head self-attention mechanism, and iii) a position-wise fully connected feed-forward network. In addition, to prevent incorporating any future information at decoding time and keep the model auto-regressive, causal constraints are placed on the self-attention decoder blocks. Finally, in the absence of any position-aware recurrence or convolutional mechanisms, sequence ordering information is provided to the model via relative or absolute positioning encodings injected into the input embeddings at the bottom of the encoder and decoder stacks.

Transformer is the first entirely attention-based model applied to machine translation. Improvements in the memory and computational efficiency of the model are proposed in numerous follow-up works. Notably, the on-going trend nowadays is to extend the Transformer architecture to Transformer-based models larger than ever before, and train them on datasets bigger than ever before with superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. We present these models in more detail Section 3.8.

3.3.1 Sequence-to-sequence models that handle additional conditions

Sequence to sequence Sutskever et al. (2014) models can be conditioned on specific attributes at training time so as to control their output at inference time. In the literature the main ways in which generative encoder-decoder models are conditioned are categorized Sennrich et al. (2016) as follows: i) adding special tokens at the beginning Johnson et al. (2017) or end Sennrich et al. (2016) of the source text , ii) incorporating additional conditions into the decoder hidden states, therefore bypassing attention Logeswaran et al. (2018), and iii) connecting the conditions directly to the decoder output layer. In addition, when categorical attributes are used in conjuction with end-to-end neural text classification models, incorporating these attributes in the attention mechanism is the least effective method Amplayo (2019).

Attribute-conditioned review generation uses an attention-enhanced attribute-to-sequence model to generate reviews conditioned on specific product attributes Dong et al. (2017), Tang et al. (2016). Natural language descriptions of database events are generated via encoder-decoder models for concept-to-text generation Mei et al. (2016) and table-to-text generation Liu et al. (2018c). Generating questions from long documents is achieved by combining a sequence-to-sequence model with multi-stage attention to represent the broad document context Tuan et al. (2019). A hierarchical sequence-to-sequence architecture is proposed for generating dialogue responses by first sampling a continuous variable in the latent space and then generating the response conditioned on that latent variable Serban et al. (2017). A dual attention sequence-to-sequence framework which conditions on both the source text and factual information is used in abstractive summarization for encouraging faithfulness of the generated content to the source document Cao et al. (2018).

3.3.2 Sequence-to-sequence models that handle constraints

Hierarchical story generation systems Fan et al. (2018b) use a two-step approach to text generation: first generate a prompt describing the topic of the story, and then conditioned on the given prompt generate the content of the story. To this end, fusion mechanisms Sriram et al. (2018) used on top of sequence-to-sequence models encourage conditioning on the story outline and are found useful at building dependencies between the given inputs and the generated outputs. Entity-focused story generation is proposed in Clark et al. (2018a).

Concise summaries of a specific desired length are obtained by controlling the output sequence length for neural encoder-decoder models through either learning or decoding-based methods Kikuchi et al. (2016). The most salient sentences in a document are identified using extractive summarization and then paraphrased using an encoder-decoder based sentence abstractor model Nikolov and Hahnloser (2020). Nevertheless, in abstractive summarization conventional sequence-to-sequence models often suffer from repetition and semantic irrelevance Lin et al. (2018), and the attention-based encoder outputs are noisy when there is no obvious alignment relationship between the source text and the target summary Zhou et al. (2017). To alleviate the problem, global encoding of the source context is used to filter encoder outputs and refine represnetations learnt at each timestep based on the global context Lin et al. (2018). An encoder-decoder framework for extractive summarization which incorporates both a sentence encoder and a document encoder to capture context surrounding a sentence, as well as a document decoder to predict sentence labels for inclusion in the summary based on representations learned by the document encoder is proposed in Zhang et al. (2018b). A hierarchical document encoder is combined with an attention-based extractor for selecting sentences and words in extractive summarization Cheng and Lapata (2016). Other approaches to hierarchical abstractive multi-document summarization combine maximal marginal relevance which serves to extract sentences from the original document based on their relevance and redundancy with the pointer-generator network See et al. (2017) used to alternate between copying words from the source documents with outputting other vocabulary words Fabbri et al. (2019). In text simplification appending special tokens to input sentences at training time is found to improve performance Nishihara et al. (2019). Similarly, sequence to sequence models parameterized on specific attributes of the target simplification, such as length, paraphrasing, lexical complexity and syntactic complexity are proposed in Martin et al. (2019b). An attentional encoder-decoder framework with side constraints for politeness is used to control for the level of courtesy in machine translation Sennrich et al. (2016). Textual attributes of sentences such as sentiment, tense, voice, mood and negation are modified by incorporating conditioning information into neural encoder-decoder models Logeswaran et al. (2018). Generating emotional responses in neural conversational systems is achieved by feeding the emotion category embedding to a sequence-to-sequence decoder Zhou et al. (2018), Asghar et al. (2018). Furthermore, informative and on-topic dialogue responses are generated via sentence control functions Ke et al. (2018). Many encoder-decoder models treat the entire dialogue history as one single sequence Serban et al. (2016), while others treat each conversational turn as a separate sequence Vinyals and Le (2015), Shang et al. (2015),Dušek and Jurcicek (2016). A topic aware sequence-to-sequence model is used to generate on-topic conversational responses Xing et al. (2016). Topic words are extracted using Latent Dirichlet Allocation and the decoder produces each word conditioned on both the input message and the topics through a joint attention mechanism. Abstractive summaries are produced relying on a pointer-generator network See et al. (2017), a hybrid model which combines pointer networks Vinyals et al. (2015b) to accurately reproduce source side information with a sequence-to-sequence attentional model to generate new words.

Synthesizing sentences containing specific keywords is done in a sequence-to-sequence backward and forward Mou et al. (2016) model by first generating the sentence fragment to the left of the given keyword (in reverse order and conditioned on that keyword), then encoding the sentence fragment generated so far, followed by decoding the sentence fragment to the right of the keyword conditioned on the already generated first part. An encoder-decoder framework for factoid question answering Yin et al. (2016) is able to query and generate answers containing terms retrieved from a knowledge base. Constrained modification of factual Wikipedia sentences according to given claims is performed via a two-encoder sequence-to-sequence model with copy attention Shah et al. (2020). Style transfer between scientific paper titles and newspaper titles is approached using multiple decoders for each style, or by passing encoded representations combined with style embeddings to a single decoder Fu et al. (2018). Generating memorable headlines constrained on stylistic attributes such as humour, romance and clickbait is performed in a sequence-to-sequence Transformer-based model for stylistic headline generation Jin et al. (2020). A denoising autoencoding approach is adopted for style transfer which replces learning disentangled latent representations with back-translation Lample et al. (2018). Neural machine translation models with attention are used for style transfer via backtranslation in the absence of parallel aligned data Zhang et al. (2018d) Neural paraphrase generation is performed through a syntactically constrained encoder-decoder model Iyyer et al. (2018) or via a sequence-to-sequence model with pivoting over multiple sentences from multiple languages Mallinson et al. (2017). Neural poetry translation with a sequence-to-sequence model is proposed in Ghazvininejad et al. (2018).

3.4 GAN Architectures

3.4.1 GAN Models for Generic / Free-Text Generation

Generative Adversarial Networks (GANs) Goodfellow et al. (2014) train generative models through an adversarial process which consists of two competing models trained simultaneously: a generative model GG that captures the data distribution and whose objective is to generate fake data which is indistinguishable from real data, and a discriminative model DD which estimates the probability that a sample came from the training data rather from the generator GG. The generator G(z,θG)G(z,\theta_{G}) learns the distribution pgp_{g} over real data x by mapping function the prior noise distribution pz(z)p_{z}(z) to real data, while the discriminator D(x,θD)D(x,\theta_{D}) outputs a single scalar value representing the probability that a sample came from the training data instead of pgp_{g}. The optimization objective for the two-player mini-max game between the GG and DD can be formally expressed as:

minGmaxDV(D,G)=𝔼xpdata(x)[logD(x)]+𝔼zpz(z)[log(1D(G(z))]\begin{split}\min_{G}\max_{D}V(D,G)&=\mathbb{E}_{\text{x}\sim p_{\text{data}}(\text{x})}[\log D(\text{x})]\\ &+\mathbb{E}_{\text{z}\sim p_{z}(\text{z})}[\log(1-D(G(\text{z}))]\end{split} (36)

The GAN framework relies on an implicit probability model, as opposed to incorporating an explicit formulation of the probability density. Unlike autoregressive modeling in which exposure bias is a common issue, in GANs this issue is avoided by sampling synthetic examples at training time from the generator and providing them as input to the discriminator for comparison with real sentences (i.e. sentence-level comparison instead of word-level comparison). The latent representations extracted from real data are distributed according to the specified prior pz(z)p_{z}(z) (for eg. Gaussian or uniform). Gradients are backpropagated from the discriminator DD through the generated samples to the generator GG; note that this is possible only when the generated samples are differentiable w.r.t the θG\theta_{G} generator parameters. When the discriminator is trained to optimality before each generator parameter update, the GAN adversarial game is equivalent to minimizing the Jenson-Shannon divergence between the real data distribution px(.)p_{x}(.) and the synthetic data distribution p(G(z)),zpz(.)p(G(z)),z\sim p_{z}(.) Arjovsky and Bottou (2017). Nevertheless, in such cases gradients can vanish as the discriminator saturates, learns to reject all samples and gives meaningless gradients to the generator. In practice, the second term in Equation (36) is replaced with 𝔼zpz(z)[log(D(G(z))]\mathbb{E}_{\text{z}\sim p_{z}(\text{z})}[\log(D(G(\text{z}))] which helps circumvent the vanishing gradient problem to a certain extent Goodfellow et al. (2014).

GAN-based models can be either continuous or discrete, depending on whether the model learns the probability distribution over a sequence of tokens or each individual token p(xt|x<t)p(x_{t}|x_{<t}) at a time. While in the former case it is possible to directly backpropagate from the discriminator into the generator, in the latter case the generator output is non-differentiable and backpropagation becomes challenging since gradients cannot be passed through the discrete output words of the generator. Therefore, the difficulty of applying GANs to discrete data is caused by the discontinuity which prohibits the update of the generator parameters via standard back-propagation. The non-differentiability of discrete word tokens results in difficult generator optimization. Nevertheless, discrete representations are more interpretable and more computationally efficient than their continuous counterparts Jang et al. (2017). Given that the composition of the generator and discriminator needs to be fully differentiable, existing GAN-based solutions for dealing with discrete data such as text can be categorized as follows: i) reinforcement learning based methods, ii) latent space solutions, and iii) continuous approximations of discrete sampling. We present these below.

Reinforcement Learning (RL) methods

Discrete GAN models for text generation which employ RL to train the non-differentiable generator represent the current state as the tokens generated so far and the current action is the next token to generate. A discriminator is used to evaluate the current state and provide rewards to the generator to guide its learning. RL-based approaches for training GANs on discrete sequences perform gradient policy update via REINFORCE Williams (1992) to bypass the generator differentiation problem. Nevertheless, RL training presents its own challenges difficult to deal with, such as the large action space, reward sparsity, the credit assignment problem and large variance for gradient estimation Maddison et al. (2016), Zhang et al. (2017b). Indeed, RL algorithms applied to dynamic environments with sparse reward are very unstable and the credit assignment problem through discrete computation makes it difficult to pass gradient information to the generator Che et al. (2017). SeqGAN Yu et al. (2017) employs Monte Carlo policy gradient to overcome the differentiation difficulty for discrete data generation, RankGAN Lin et al. (2017) trains the discriminator in a learning to rank setting and evaluates the quality of the generated samples through their relative ranking scores, while LeakGAN Guo et al. (2018) overcomes reward sparsity in a hierarchical RL setting by allowing the discriminator to leak its own high-level features to the generator. Monte-Carlo(MC) rollouts are commonly used to provide ample feedback to the generator at every timestep and circumvent the credit assignment problem. StepGAN Tuan and Lee (2019) proposes a more computationally efficient approach in which the discriminator issues rewards without computing the entire search tree. Pre-training the generator with a negative log-likelihood objective is also commonly used to reduce the large action space and avoid reward sparsity. Nevertheless, RL based methods with pre-training tend to be computationally expensive and more inefficient than solutions based on latent or continuous approximations Haidar and Rezagholizadeh (2019). In addition, GAN practical limitations include mode collapse Metz et al. (2016) which occurs when the generator produces same representation for multiple latent representations, and vanishing gradients Arjovsky and Bottou (2017) when the discriminator is close to its local optimum and the generator’s contribution to the learning signal is insignificant; the GAN objective in Equation 36 thus becomes a weak learning signal.

Latent space solutions

These methods extract latent representations of the discrete input data by means of autoencoding and apply smooth transformations to learn the data manifold. Adversarially regularized autoencoders Zhao et al. (2018a) map discrete inputs to an adversarially regularized continuous latent space. TextKD-GAN Haidar and Rezagholizadeh (2019) uses knowledge distillation on sentence representations learnt by an autoencoder to train the generator to produce similar continuous representations.

Continuous approximations

Generating sequences of discrete elements by sampling from a multinomial distribution on discrete objects is not differentiable with respect to the distribution parameters and is the main limitation why GANs cannot not be directly applied to text generation. Nevertheless, approximating the multinomial distribution with the continuous Gumbel-softmax distribution Jang et al. (2017), a continuous distribution over the simplex that can approximate samples from a categorical distribution, is differentiable and allows to backpropagate through samples Kusner and Hernández-Lobato (2016). In addition, the non-differentiable argmax operator is approximated at learning time with the soft-argmax Zhang et al. (2016c) operator, a continuous differentiable function Zhang et al. (2017b). Other gradient estimators for training neural networks with discrete units include straight-through estimators Bengio et al. (2013), Raiko et al. (2014) and Concrete relaxations Maddison et al. (2016).

Many GAN models have been proposed to address the task of text generation. Pre-training Yu et al. (2017), Li et al. (2017a), Yang et al. (2018c) or joint training Lamb et al. (2016), Che et al. (2017) of the generator and discriminator with a supervised maximum-likelihood loss is commonly employed before the start of adversarial training, aiming to reduce training instability and guide the generator towards promising improvement directions. Alternatively, in cases when the generator acts randomly, all samples produced by the generator will be easily recognized as fake by the discriminator and consequently a low reward will be assigned to any generator action, resulting in an ineffective training procedure. TextGAN Zhang et al. (2017b) leverages a GAN framework (consisting of a LSTM generator and a CNN discriminator) which forces empirical distributions of real and synthetic sentences to have matched moments in latent-feature space. Instead of optimizing for the GAN objective, MaliGAN Che et al. (2017) uses a normalized maximum likelihood target optimized via importance sampling. RelGAN Nie et al. (2018) uses a relational memory based generator to model long distance dependencies, along with embedded representations in the discriminator to provide more informative signal to the generator.

Nevertheless, there are also models using purely adversarial training techniques. ScratchGAN d’Autume et al. (2019) attempts to train language GANs from scratch, i.e. without maximum likelihood pre-training; the model generates realistic looking samples by heavily relying on engineering tricks such large batch sizes for variance reduction, dense rewards provided by a recurrent discriminator at each step for each generated token, and discriminator regularization. A curriculum learning strategy is proposed to generate sequences of increasing length also starting from scratch Press et al. (2017). Wasserstein GANs-GP Gulrajani et al. (2017) train a character-level language model in which the discriminator distinguishes between one-hot representations of real text and the probabilistic (softmax) output vector from the generator; for more stable training the norm of the gradient from the discriminator is penalized. Boundary-seeking GANs Hjelm et al. (2018) compute importance weights for the generated samples based on their estimated difference from the generator to use as policy gradients for the generator.

To overcome reward sparsity, self-adversarial learning Zhou et al. (2020) provides dense rewards to the generator by comparing text quality between pairs of generated samples similar (unlike standard GANs which compare fake and real texts). The generator is rewarded by the pairwise comparative discriminator whenever the current generated sentence is better than previously generated samples, similar to a self-play / self-improvement scenario. In FM-GAN Chen et al. (2018a), latent feature distributions of real and synthetic sentences are minimized by the generator when synthesizing realistic text, and maximized to delineate the dissimilarity of the feature distributions by the discriminator.

In spite of these attempts at overcoming limitations of GANs for language generation, adversarial learning hurts performance Semeniuta et al. (2018), Garbacea et al. (2019). GAN-based models are frequently unstable during training (even less stable than regular language models), extremely sensitive to random initialization and the choice of hyperparameters Salimans et al. (2016), and the error signal provided by the discriminator can be insufficient to train the generator to produce fluent language Yang et al. (2018d). In addition, training GANs using gradient-based methods is inherently difficult due to training instability and frequently gradient based optimization fails to converge Salimans et al. (2016), Gulrajani et al. (2017). In turn, samples produced by GANs are at their best comparable to or even worse in quality than samples produced by properly tuned conventional language models; the latter are frequently reported in the literature to yield better results than many GAN-based systems Semeniuta et al. (2018). The extent to which GANs generalize from the training data as opposed to memorizing training examples is still an open question Nagarajan et al. . Therefore, the benefits of using GANs for language generation are rather unclear, and GAN-based models seem to not benefit much from the maximum likelihood pre-training approach combined with small amounts of adversarial fine-tuning. This in turn suggests that best performing GAN models tend to stay close to the maximum-likelihood training solution Caccia et al. (2018). Nevertheless, more recent results indicate that relying on pure adversarial training and avoiding the maximum likelihood pre-training step in the GAN training procedure achieves comparable results to maximum likelihood models for unsupervised unconditional word-level text generation d’Autume et al. (2019).

3.4.2 GAN Models for Conditional Text Generation

Conditional GANs Mirza and Osindero (2014) are constructed by feeding the data we wish to condition on yy (for eg., class labels or auxiliary data from other modalities) to both the generator and the discriminator as an additional input layer. In the generator GG the prior input noise pz(z)p_{z}(z) and yy are combined in the hidden joint representation, while in the discriminator DD the data xx and the conditioning information yy are specified as different inputs to a discriminative function. The conditional GAN objective function is formulated as:

minGmaxDV(D,G)=𝔼xpdata(x)[logD(x|y)]+𝔼zpz(z)[log(1D(G(z|y))]\begin{split}\min_{G}\max_{D}V(D,G)&=\mathbb{E}_{x\sim p_{\text{data}}(\text{x})}[\log D(\text{x}|\text{y})]\\ &+\mathbb{E}_{\text{z}\sim p_{z}(\text{z})}[\log(1-D(G(\text{z}|\text{y}))]\end{split} (37)

Diverse text generation is encouraged by having a language-model based discriminator reward the generator based on the novelty of text produced Xu et al. (2018a). Unlike classifier-based discriminators which in cases when classification accuracy saturates no longer distinguish between relative degrees of novelty, the cross-entropy of the language model does not saturate and is discriminative between repetitive text and novel and fluent text. Generic and uninformative responses are a common problem in dialogue systems. A variational mutual information objective is employed to encourage natural conversations with diverse and unpredictable responses Zhang et al. (2018c). Adversarial training for open-domain dialogue generation in a reinforcement learning setting is proposed to generate the next response given the dialogue utterance history Li et al. (2017a). To alleviate the credit assignment problem, rewards for each action (word) selection step in partially decoded sequences are assigned by either using Monte Carlo search, or by training a discriminator to provide a reward to a partial utterance; nevertheless, computing such a reward is time-consuming.

MaskGAN Fedus et al. (2018) adopts a fill-in-the-blank approach to text generation and masks contiguous blocks of words in a sentence; an actor-critic conditional GAN fills in missing text conditioned on the surrounding context. Conditional GANs are used to generate image Dai et al. (2017) and video Yang et al. (2018a) captions. Autoregressive and adversarial models are combined for neural outline generation Subramanian et al. (2018). Conditional GANs for neural machine translation with a sentence-level BLEU reinforced objective are proposed in Yang et al. (2018c). Generating poems from images is accomplished by extracting coupled visual-poetic embeddings and feeding them to a recurrent neural network for poem generation in an adversarial training framework with multiple discriminators via policy gradient Liu et al. (2018a).

3.4.3 GAN Models for Constrained Text Generation

BFGAN Liu et al. (2019a) is the first GAN-based model proposed for lexically constrained sentence generation. The model architecture employs two generators, namely a forward generator and a backward generator, as well as a discriminator that guides their joint training and learns to distinguish human-written sentences from machine-generated lexically constrained sentences. The model is used to generate user reviews for Amazon products and conversational responses with lexical constraints. GAN-based stylistic headline generation is proposed in Shu et al. (2018).

3.5 VAE Architectures

3.5.1 VAE Models for Generic / Free-Text Generation

The variational autoencoder (VAE) Kingma and Welling (2013), Rezende et al. (2014), Doersch (2016), Kingma and Welling (2019) is a generative model which integrates stochastic latent variables into the conventional auto-encoder architecture. VAE-based generative models aim to produce realistic samples by feeding noise vectors through the decoder. Given observed variable x, the VAE framework assumes that x is generated from latent variable z and models their joint probability as follows:

pθ(x,z)=pθ(x|z)pθ(z)p_{\theta}(\text{x},\text{z})=p_{\theta}(\text{x}|\text{z})p_{\theta}(\text{z}) (38)

The model is parameterized by θ\theta and pθ(z)p_{\theta}(\text{z}) represents the prior, which is typically chosen to be a simple Gaussian distribution. VAE learns the conditional probability distribution pθ(x|z)p_{\theta}(x|z) which models the generation procedure of the observed data x given latent variable z. However, this distribution over latent variables is intractable and VAEs derive an analytic approximation in the form of recognition model qϕ(z|x)q_{\phi}(\text{z}|\text{x}) which estimates latent variable z for a particular observation x. Probability distributions pp and qq are parameterized by neural network parameters θ\theta and ϕ\phi (variational parameters), and are learnt by maximizing the variational lower bound on the marginal log likelihood of data:

logpθ(x)𝔼zqϕ(z|x)[logpθ(x|z)]KL(qϕ(z|x)||p(z))\begin{split}\log p_{\theta}(\text{x})\geq\mathbb{E}_{\text{z}\sim q_{\phi}(\text{z}|\text{x})}[\log p_{\theta}(\text{x}|\text{z})]\\ -\text{KL}(q_{\phi}(\text{z}|\text{x})||p(\text{z}))\end{split} (39)

The KL term ensures distributions estimated by the recognition model q(z|x)q(\text{z}|\text{x}) do not diverge from the prior probability distribution p(z)p(\text{z}) imposed over the latent variables. The reparameterization trick is used to train the model with backpropagation and optimize the parameters with gradient descent. According to the reparameterization trick, the Gaussian latent variables z are reparameterized by the differentiable functions w.r.t. ϕ\phi and are expressed in deterministic form as z=gϕ(ϵ,x)\text{z}=g_{\phi}(\epsilon,\text{x}) with mean μ\mu and variance σ2\sigma^{2}, where ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1) is an independent Gaussian noise variable. To this end, instead of generating zz from qϕ(z|x)q_{\phi}(\text{z}|\text{x}), z is obtained from z=μϕ(x)+σϕ(x)ϵz=\mu_{\phi}(x)+{\sigma}_{\phi}(x)\circ\epsilon, allowing gradients to backpropagate through ϕ\phi. VAEs are considered a regularized version of the standard autoencoder, where the latent variable z captures the variations ϵ\epsilon in the observed variable x.

While VAEs achieve strong performance in continuous domains, for eg. image modeling Bachman (2016), Gulrajani et al. (2016), using VAEs on discrete text sequences is more challenging due to optimization issues. Parameterizing conditional likelihoods with powerful function approximators such as neural networks makes posterior inference intractable and introduces points of non-differentiability which complicate backpropagation Kim et al. (2018). In particular, the collapse of the KL divergence term in the latent loss to zero leads to the model behaving like a regular language model and completely ignoring the latent representations Bowman et al. (2015), Pelsmaeker and Aziz (2019). This posterior collapse issue occurs in particular when learning VAEs with an auto-regressive decoder, and in such cases the model generates repetitive and uninteresting samples Semeniuta et al. (2017) and behaves like a regular language model Pelsmaeker and Aziz (2019). In addition, the assumption that the variational posterior is Gaussian introduces an approximation gap with respect to the true posterior Cremer et al. (2018). Solutions proposed in the literature aim to force the decoder to incorporate the information from the latent vectors by imposing structured sparsity on the latents Yeung et al. (2016), batch normalization and deterministic warm-up to gradually turn on the KL-term Sønderby et al. (2016), as well as input dropout Bowman et al. (2015) and adding auxiliary reconstruction terms computed from the activations of the last decoder layer Semeniuta et al. (2017). Nevertheless, training deep latent variable models for discrete structures is still an open research problem Zhao et al. (2018a).

An RNN-based variational autoencoder generative model is used to generate natural language sentences from a latent continuous space Bowman et al. (2015). To this end, distributed latent representations encode the full content of sentences and allow to explicitly incorporate and vary textual attributes such as style, topic, and high-level syntactic features. Nevertheless, the authors report the negative result that VAEs with LSTM decoders perform worse than LSTM language models; this is attributed to LSTM decoders ignoring the conditioning information from the encoder. In follow-up work, VAEs outperform language models when the decoder architecture is changed with a dilated CNN, demonstrating the trade-off between the effectiveness of encoding information and the contextual capacity of the decoder Yang et al. (2017). For generating longer texts, a VAE framework based on a convolutional encoder and a decoder which combines deconvolutional and RNN layers is used in Semeniuta et al. (2017). The variational RNN Chung et al. (2015) incorporates random latent variables into the hidden state of a recurrent neural network and is designed to model variability in highly structured sequential data such as speech.

Finally, it is important that the posterior distribution over latent variables appropriately covers the latent space Bowman et al. (2015). When mapping sentences to latent representations there are many regions in the latent space which do not necessarily map or decode to realistic-looking sentences, therefore it is not enough to only cover a small region of the latent space corresponding to a manifold embedding Zhang et al. (2017b). VAEs for text generation are also difficult to train when combined with powerful autoregressive decoders – “posterior collapse” causes the model to rely entirely on the decoder and ignore latent variables. Latent space expanded VAE disperses sentences into the latent space based on their similarity to avoid mode collapse Song et al. (2019b).

3.5.2 VAE Models for Conditional Text Generation

In the standard VAE model it is difficult to control textual features directly since the latent code is assumed to be Gaussian distributed; this makes it impossible to distinguish which part of code controls the structure and which part controls the semantics Li et al. (2019c).

A document-level language model based on the VAE architecture is introduced in Miao et al. (2016) for the answer selection problem. The model represents texts as bags of words and extracts a continuous semantic latent variable for each document which is then passed to decoder which generates either generic or conditional sentence reconstructions. Auto-encoding sentence compression (both extractive and abstractive) is modeled in the VAE framework by first drawing a compact summary sentence from a latent background language model, and then drawing the observed sentence conditioned on the latent summary Miao and Blunsom (2016). Variational neural machine translation Zhang et al. (2016a) incorporates a continuous latent variable to learn the conditional distribution of a target sentence given a source sentence and learn the underlying semantics of sentence pairs. In follow-up work, the variational recurrent neural machine translation model introduces a sequence of continuous random latent variables z={z1,z2,,zN}z=\{z_{1},z_{2},\ldots,z_{N}\} to capture the underlying semantics of sentence pairs and model the high variability in structured data Su et al. (2018). Conditional VAEs that condition on observed images are proposed for image caption generation Pu et al. (2016). A conditional VAE is introduced for the task of poetry generation Liu et al. (2019c), conditioning on aesthetical aspects of the generated poem such as the use of metaphor and personification.

3.5.3 VAE Models for Constrained Text Generation

Semi-supervised VAEs that operate on both continuous and discrete latent variables are used for labeled sequence transduction – given an input sequence and a set of labels, the model changes the input sequence to reflect attributes of the given labels Zhou and Neubig (2017). VAEs are enhanced with attribute discriminators that help the model learn disentangled latent representations of semantic structures Hu et al. (2017). In addition, this also helps enhance interpretability in the latent space where each attribute is focusing solely on just one aspect of the generated samples; authors control for the sentiment and tense of the generated sentences. Implicit latent features in VAEs are extracted following a sample-based approach which aligns the posterior to the prior distribution Fang et al. (2019). Topic guided variational autoencoder Wang et al. (2019c) is used for text generation on a specific topic of interest. Unlike the VAE which specifies a simple Gaussian prior for the latent code, the model specifies the prior as a Gaussian mixture model parameterized by a neural topic module. Style transfer for tasks such as sentiment modification, word substitution and word ordering is achieved using a VAE model that separates content from stylistic properties of text Shen et al. (2017). To this end, the VAE encoder takes as input a sentence and its original style indicator and maps it to a style-independent content representation; this representation is then passed to a style-dependent decoder for generation. Learning disentangled representations for style transfer are also proposed in Balasubramanian et al. (2020), John et al. (2019). Paraphrase generation is performed through a VAE module with two latent variables designed to capture semantics and syntax Chen et al. (2019b), Bao et al. (2019).

3.6 Memory-based Architectures

Although RNNs and LSTMs are trained to predict the next token in a sequence, their memory is small and used mainly to store information about local context, which does not allow these models to accurately recall facts from the past. Indeed, recurrent neural network memory degrades with time Khandelwal et al. (2018). Parametric neural networks implicitly encapsulate memory in their weights, nevertheless this hurts their ability to generalize across complex linguistic tasks Nematzadeh et al. . Attempts to capture non-local dependencies in language models aim to enhance their ability to adapt to a changing environment and dynamically update the word probabilities based on the long-term context. Improving neural language models with external storage units is done by means of introducing an external memory component in the form of a soft attention mechanism Bahdanau et al. (2014), Luong et al. (2015b), Daniluk et al. (2017) which allows them to focus on specific parts of the input, an explicit memory block which implicitly captures dependencies for word prediction Tran et al. (2016), or cache model Grave et al. (2016) which can be added on top of a pre-trained language model. Shared memory models are reported to further improve attention based neural models Munkhdalai and Yu (2017).

Integrated LSTM networks are proposed to alleviate the practical engineering requirements of LSTMs by relying on external memory units to enhance the memory capacity of neural networks. Neural Turing Machines Graves et al. (2014) extend the memory resources of RNNs by coupling them with an addressable external memory bank that can be read from and written to (i.e. random access memory with read and write operations). C-LSTMs Zhou et al. (2015) combine CNN with LSTM networks to learn high-level sentence representations that capture both local features of phrases and global and temporal sentence semantics. In the context of question answering, the use of a long-term memory acting similar to a dynamic knowledge base which can be read from and written to is proposed in memory networks Weston et al. (2014). Nevertheless, the discrete model is difficult to train via backpropagation and requires supervision at each layer of the network. The memory network architecture is further extended to operate without supervision in a continuous space Sukhbaatar et al. (2015). Single-layer LSTM networks enhanced with an unbounded differentiable memory, yield comparable performance to deep RNNs in sentence transduction tasks such as machine translation Grefenstette et al. (2015). Memory based architectures incorporating stacked layers of memories for storing and accessing intermediate representations in sequence-to-sequence learning are proposed in Meng et al. (2015). Dynamic memory networks Kumar et al. (2016) are used to generate relevant answers in question answering by means of episodic memories reasoned over in a hierarchical recurrent sequence model.

Memory architectures for recurrent neural network language models are compared in Yogatama et al. (2018). Stack-based memory access which dynamically stores and retrieves contextual information with a stack is shown to outperform sequential access which fails at capturing long term dependencies or random memory access in which the learner needs to infer dependencies from the data in the absence of any structural biases. Instead of having a monolithic model to fit all training examples, a few-shot meta-learning scenario in which multiple task-specific models covering groups of similar examples is proposed in Huang et al. (2018).

While the on-going trend in language modeling is to learn contextual representations from ever larger datasets, alternative methods which are sample efficient and leverage smaller amounts of data represents the next research frontier for deep learning models. kkNN-LMs Khandelwal et al. (2019) is a general framework which allows to augment any pre-trained language model by means of linearly interpolating its next word distribution with a k-nearest neighbors search. The approach helps memorize long-tail patterns (e.g., factual knowledge and rare nn-grams) explicitly by drawing nearest neighbours from any text collection in the pre-trained embedding space rather than modeling these rare patterns implicitly in the model parameters.

An additional memory component is used to store external simplification rules from a paraphrase database in neural text simplification in combination with the multi-layer and multi-head attention Transformer architecture Zhao et al. (2018b); the additional memory is used to recognize the context and output of each simplification rule. Neural semantic encoders Munkhdalai and Yu (2017) augment neural network models with an evolving memory of the input sequence for natural language understanding tasks including natural language inference, question answering, sentence classification, sentiment analysis and machine translation. Relational memory Santoro et al. (2018) adds interactions between memory units via attention and is designed to enhance reasoning abilities of neural networks across sequential information. An external factual memory component is incorporated into a neural pre-trained language model for question answering Verga et al. (2020). Finally, memory networks are used to generate scientific articles with constraints on entities and human-written paper titles Wang et al. (2019b).

3.7 Reinforcement Learning (RL) Architectures

3.7.1 RL Models for Generic / Free-Text Generation

Reinforcement learning is used in the context of natural language generation to directly optimize non-differentiable reward functions and evaluation metrics. To this end, policy gradient methods such as REINFORCE Williams (1992) are used to alleviate current issues in training generative models for text generation, namely exposure bias and loss functions which do not operate at the sequence level. In the RL framework the generative model is seen as an agent with parameters that define a policy and which interacts with an external environment by taking actions, receives a reward once it reaches the end of a sequence and updates its internal state consequently. While any user-defined reward function can be employed for training, most frequently optimized metrics with RL are BLEU for machine translation and image captioning Ranzato et al. (2015), Wu et al. (2016a), ROUGE for text summarization Ranzato et al. (2015), Paulus et al. (2017). Nevertheless, policy gradient algorithms present large variance and generally struggle in settings with large action spaces such as natural language generation. In addition, the improvement in the optimized metrics is not always reflected in human evaluations Wu et al. (2016a). In the context of machine translation in particular, reinforcement learning methods do not optimize the expected reward and take very long time to converge Choshen et al. (2019).

3.7.2 RL Models for Conditional Text Generation

Deep reinforcement learning is used to model the future reward in neural conversational systems and reward responses that are informative, coherent and simple Li et al. (2016c). Algorithms such as DQN Zhao and Eskenazi (2016), Li et al. (2017b), Peng et al. (2018b), Cuayáhuitl et al. (2016), policy-gradient Liu et al. (2017a) and actor-critic Peng et al. (2018a), Liu and Lane (2017) have been widely used for single-domain or multi-domain dialogue generation for movie-ticket bookings and restaurant search.

Inverse reinforcement learning produces more dense reward signals and generates texts with higher diversity Shi et al. (2018). Inverse reinforcement learning has been applied in paraphrase generation Li et al. (2018) and in open-domain dialogue systems Li et al. (2019d) for modeling the reward function. Reward functions are learnt from human preferences and further optimized with reinforcement learning for tuning language models Ziegler et al. (2019) and text summarization Böhm et al. (2019). Conditional RNNs are trained via REINFORCE to directly optimize for test time evaluation metrics such as BLEU for machine translation and image captioning, and ROUGE for text summarization tasks Ranzato et al. (2015).

Sequence-to-sequence models can be further improved with reinforcement learning training to alleviate exposure bias and improve generalization Keneshloo et al. (2019). Neural machine translation is framed as a stochastic reinforcement learning policy with translation adequacy rewards Kong et al. (2019b). Generating polite dialogue responses is encouraged by rewarding polite utterances with positive reward, and rude ones discouraged with negative reward Niu and Bansal (2018). Internal rewards such as ease of answering, semantic coherence, emotional intelligence, as well as external rewards based on human feedback are incorporated through reinforcement learning in an encoder-decoder framework for dialogue response generation focused on movie and restaurant reviews Srinivasan et al. (2019). Dialogue responses are generated by conditioning on discrete attributes such as sentiment, emotion, speaker id, speaker personality and user features when framing dialogue attribute selection as a reinforcement learning problem Sankar and Ravi (2019). Abstractive headline generation is performed in a reinforcement learning setting which maximizes the sensationalism score as the reward for the reinforcement learner Xu et al. (2019).

Hierarchical models decompose the learning problem into a sequence of sub-problems and are a natural fit to language given its hierarchical structure. These models first decompose natural language into a sequence of utterances, and then decompose each utterance into a sequence of words. Hierarchical reinforcement learning is employed in open-domain dialogue generation to optimize for human-centered metrics of conversation quality and prevent the generation of inappropriate, biased or offensive language Saleh et al. (2019). Hierarchical sequence-to-sequence dialogue models are employed to learn reward functions from human interactions Jaques et al. (2019). Task-oriented dialogue systems Peng et al. (2017), Budzianowski et al. (2017), Tang et al. (2018), Zhang et al. (2018a) learn hierarchical dialogue policies by diving a complex goal-oriented task into a set of simpler subgoals with distinct reward functions; these methods have been applied to various dialogue tasks such as travel planning or task-oriented visual dialogue.

3.7.3 RL Models for Constrained Text Generation

Non-monotonic constrained text generation is framed as part of an immitation learning framework (learning a generation policy that mimics the actions of an oracle generation policy) in which a token is first generated in an arbitrary position in the sentence, and the model recursively generates a binary tree of words to its left and right Welleck et al. (2019a).

Extractive and abstractive sentences are mixed in a hierarchical reinforcement learning framework for text summarization in which a copy-or-rewrite mechanism allows to switch between copying a sentence and rewriting a sentence Xiao et al. . Policy gradient methods that optimize non-differentiable evaluation metrics (for eg., ROUGE) are used for extractive summarization in a contextual bandit setting Dong et al. (2018) or in a sentence ranking setting Narayan et al. (2018), as well as for abstractive summarization in a hierarchical setting Chen and Bansal (2018). Saliency and logical entailment rewards for abstractive summarization are simultaneously optimized by means of reinforce-based policy gradient Pasunuru and Bansal (2018). A hybrid learning objective which combines standard supervised word prediction using maximum likelihood with reinforcement learning policy gradient is used for abstractive summarization Paulus et al. (2018), Celikyilmaz et al. (2018). To improve the level of abstraction in summary generation, a ROUGE based reward is combined with a novelty metric which counts the fraction of unique nn-grams in the summary that are novel in the policy gradient optimization objective Kryściński et al. (2018). A cycled reinforcement learning approach with unpaired data is proposed for the task of sentiment-to-sentiment translation to generate emotionally charged sentences Xu et al. (2018c).

3.8 Transfer Learning for NLG

3.8.1 Transfer Learning Models for Generic / Free-Text Generation

Recent advances in natural language generation rely on pre-training a large generative language model on a large corpus of unsupervised data, followed by fine-tunning the model for specific applications. The goal of pre-training is to provide models with general purpose knowledge that can be leveraged in many downstream tasks Raffel et al. (2019). Indeed, large-scale language models pre-trained on huge unlabeled datasets have shown unparalleled text generation capabilities substantially outperforming training on supervised datasets from scratch and have considerably advanced the state-of-the-art on many natural language processing problems. These models leverage the Transformer architecture Vaswani et al. (2017) pre-trained on large amounts of text and optimize for different unsupervised language modeling objectives, showing that transferring many self-attention blocks can often replace task-specific architectures Devlin et al. (2018), Radford et al. (2019). High-capacity language models pre-trained on large datasets can be an alternative to traditional knowledge bases extracted from text Petroni et al. (2019). These models can acquire commonsense reasoning capabilities about previously unseen events Sap et al. (2019), infer relations between entities Jiang et al. (2019), Soares et al. (2019), Rosset et al. (2020), answer factoid questions and commonsense queries Trinh and Le (2018), as well as open-domain questions without access to any external context or knowledge source Roberts et al. (2020). Nevertheless, while these models leverage massive amounts of data and excel at capturing statistical patterns in the datasets, they are sample inefficient and fail to generalize as quickly and robustly as humans Linzen (2020).

3.8.2 Transfer Learning Models for Conditional Text Generation

Early work Ramachandran et al. (2017) shows that pretraining improves the generalization of sequence-to-sequence models. Using unsupervised learning to initialize the weights of both the encoder and decoder with the pretrained weights of language models outperforms purely supervised learning baselines for machine translation and abstractive summarization. Moreover, language modeling pre-training is also helpful for difficult text generation tasks such as chit-chat dialog and dialog based question answering systems Dinan et al. (2018), Wolf et al. (2019).

Representation models for language successfully adopt a masked language modeling approach similar to denoising auto-encoding Vincent et al. (2008), in which the identities of a subset of input tokens are masked and a neural network is trained to recover the original input. BERT Devlin et al. (2018) is a multi-layer bidirectional Transformer Vaswani et al. (2017) encoder used for learning deep contextualized token representations from unlabeled text; the model incorporates left and right context fusion to predict the masked words. Nevertheless, the bidirectional nature of BERT does not allow to use the model as is for text generation purposes Wang and Cho (2019). BERT is used for sequence generation for text summarization as part of a pre-trained encoder-decoder framework which relies on a BERT-based encoder and a Transformer-based decoder Zhang et al. (2019a). A similar BERT-based encoding approach is adopted for both extractive and abstractive text summarization in Liu and Lapata (2019b). In parallel, masked sequence-to-sequence pre-training Song et al. (2019a) proposes a BERT inspired pre-trained objective in an encoder-decoder framework in which the decoder is trained to reconstruct an encoded sentence with randomly masked fragments; the model is applied to generative tasks such as neural machine translation, text summarization and conversational response generation. UniLM Dong et al. (2019) extends BERT for sequence generation by combining unidirectional, bidirectional and sequence-to-sequence unsupervised language modeling objectives. Building upon the success of natural language models, a wide range of models are proposed for jointly modeling vision and language tasks, including VisualBERT Li et al. (2019b), ViLBERT Lu et al. (2019), VideoBERT Sun et al. (2019a) for image and video captioning, visual question answering and visual commonsense reasoning.

OpenAI-GPT Radford et al. (2018), Radford et al. (2019), Brown et al. (2020) autoregressive models learn universal representations from massive unlabeled datasets useful for a wide range of language tasks such as text summarization, machine translation, question answering and reading comprehension. These models build upon the left-to-right Transformer Vaswani et al. (2017) to predict a text sequence word-by-word initially in a semi-supervised fashion Radford et al. (2018), by combining unsupervised generative pre-training on a large unlabeled text corpus with supervised discriminative fine-tuning for quick adaptation to a particular task. Later extensions Radford et al. (2019), Brown et al. (2020) are completely unsupervised and demonstrate the ability to adapt to few-shot and zero-shot settings even without fine-tuning in a multitude of text generation scenarios, including machine translation, text summarization, question answering and news story generation.

Many other extensions demonstrate that Transformers can be used for generative tasks. Transformer Memory Networks Dinan et al. (2018) combine the Transformer architecture with memory networks Sukhbaatar et al. (2015) in the context of dialogue agents that store encyclopedic knowledge in large memory systems and carry engaging open-domain conversations. Transformer-XL Dai et al. (2019) captures longer term dependencies by adding recurrence into the deep self-attention network. A BERT initialized Transformer model is proposed for text simplification Jiang et al. (2020). BART Lewis et al. (2019) relies on the Tranformer architecture to train a sequence-to-sequence denoising autoencoder for tasks such as abstractive dialogue generation, question answering, machine translation and text summarization. Turing-NLG Microsoft (2020) is a Transformer based generative language model useful in text summarization and question answering. More efficient versions which improve memory and time constraints for long-term structure generation are proposed by Sparse Transformers Child et al. (2019), Reformer Kitaev et al. (2019), Universal Transformers Dehghani et al. (2018), Compressed Transformer Rae et al. (2019), Evolved Transformer So et al. (2019), Megatron-LM Shoeybi et al. (2019), Big Bird Zaheer et al. (2020), Rae and Razavi (2020). Wikipedia articles are generated using a decoder-only sequence transduction model by conditioning on the article title Liu et al. (2018b). Furthermore, Transformer language models are found to outperform sequence-to-sequence models for neural document summarization Subramanian et al. (2019). Conditional Transformer is used to control for attributes of the generated text such as style and content Keskar et al. (2019). Similarly, the pre-trained Transformer is combined with attribute classifiers to control attributes of the generated language Dathathri et al. (2019). Lexical and syntactic constraints are added to the Transformer architecture to control the type and level of text simplification Mallinson and Lapata (2019). A hierarchical Transformer encoder is used for multi-document summarization Liu and Lapata (2019a). MARGE Lewis et al. (2020a) proposes a self-supervised alternative to the masked language modeling objective by reconstructing the target text conditioned on retrieved related documents, and is used for machine translation, text summarization, question answering and paraphrasing. Any natural language processing task can be formulated as a “text-to-text” generation problem, feeding text as input and producing new text as output in T5 Raffel et al. (2019). In addition, adversarial pre-training on top of Transformer-based language models by applying perturbations in the embedding space improves robustness and generalization Liu et al. (2020), Wang et al. (2019a).

Flexible sequence generation in arbitrary orders with dynamic length changes and refinement through insertion and deletion operations is introduced in Levenstein Transformer Gu et al. (2019b). Similarly, generating sequences in the absence of a predefined generation and through iterative refinement in multiple passes is proposed in Emelianenko et al. (2019), Ford et al. (2018). Furthermore, neural network-based pre-trained language models can act like universal and general-purpose decoders for generative tasks Raffel et al. (2019) and can be steered to recover arbitrary sentences Subramani et al. (2019). The main components of Transformer’s attention and the evolution of representations learnt across layers are analyzed in Tsai et al. (2019), Voita et al. (2019), Kaplan et al. (2020), Talmor et al. (2019), Yogatama et al. (2019).

3.8.3 Transfer Learning Models for Constrained Text Generation

Recent progress in Transformer-based language models has led to generative models that learn powerful distributions and produce high quality samples. While these large scale language models display promising text generation capabilities, it is desirable to allow the user to control different aspects of the generated text and include user-defined key phrases in the generated output.

Soft-constrained text generation by integrating external knowledge into a neural conversational model is achieved by encoding the conversation history and relevant external text excerpts, and passing them both to a Transformer-based response generator Qin et al. (2019b). Counterfactual story generation does minimal revisions to an existing story constrained on a given intervening counterfactual event. OpenAI-GPT2 Radford et al. (2019) pre-trained model is used to re-write a story through counterfactual reasoning and make the narrative consistent with the imposed constraints Qin et al. (2019a). OpenAI-GPT2 Radford et al. (2019) is also used for abstractive summarization in a reinforcement learning setting which trains the summarization agent to maximize coverage and fluency constrained on a given length Laban et al. (2020). Hard-constrained text generation under specified lexical constraints is performed by using a masked language modeling objective Devlin et al. (2018) which recursively inserts new tokens between existing ones until a sentence is completed Zhang et al. (2020). Sentence generation is carried in a hierarchical fashion, by first generating high-level words (nouns, verbs, adjectives), using them as pivoting points for iteratively inserting finer granularity details, and finally adding the least informative words (pronouns and prepositions).

3.9 Discussion - Neural NLG

Advances in the field of deep learning have reignited the hopes of having machine models capable to generate realistic and coherent natural language. The field of natural language generation has undergone major changes in recent years, and is currently witnessing impressive developments and an increased surge in interest. The availability of large and diverse datasets, combined with powerful neural models and compute-intensive infrastructure have led to high-capacity neural models achieving widespread success in a multitude of language generation tasks, from machine translation, text summarization, to dialogue generation and creative applications such as story and poetry generation.

In this section we have presented the latent developments in natural language generation and introduced the models employed for generating texts that fulfill various user goals in a multitude of problem scenarios. Significant performance gains are reported by using larger and larger models, on datasets larger than ever before. Consequently, it is not clear where the ceiling is when combining pre-training with finetuning approaches Radford et al. (2019). Nevertheless, conducting comparisons between the various neural approaches to natural language generation is challenging, since it is not always possible to reproduce the reported results, datasets used are not always publicly available, and the impact of many key hyperparameters and training data size presents a significant impact on the final performance Liu et al. (2019b).

Despite the reported recent success, natural language generation remains a difficult problem to model and there is still a large gap to achieving human peformance Turing (1950), Linzen (2020). Generating long and coherent pieces of text that capture long-term dependencies in the data is particularly challenging. Longer generated texts are frequently incoherent and present grammatical errors, lack in diversity, include redundant, short and safe phrases, and contradictory arguments. Unsurprisingly, powerful neural models tend to memorize the training data and often fail to generalize and demonstrate that they learn meaningful representations that capture more than just shallow patterns in the data. Moreover, these systems are brittle, sensitive to slight changes in the data distribution and task specification Radford et al. (2019). The lack of generalization is also directly tied to their inability to perform natural language understanding and inference, both important hallmarks of intelligence. Natural language understanding requires mastery of linguistic structure and the ability to ground it in the world, and meaning cannot be learnt solely by relying on huge training datasets Bender and Koller . In addition, robust evaluation metrics which can accurately measure the “goodness” of the generated language are imperative for quantifying research progress, comparing natural language generation models and pushing forward the state-of-the-art.

While current generative models display promising free-form text generation abilities with rather little conditioning beyond the input context on the generated output, it is desirable to produce output conditioned or constrained on particular text attributes for the generation of meaningful texts in specific contexts. To this end, modeling and manipulating the stylistic properties of the generated text also reflects how humans communicate with a specific intent or goal in mind. Conditional and constrained text generation are important research directions for better human-AI interaction which allow users to control the content and style of the generated text. Furthermore, approaches that incorporate external information in the generation process enhance language representations with structured knowledge facts for more general and effective language understanding Sun et al. (2019b), Lewis et al. (2020b), Rosset et al. (2020).

The current trend nowadays is to train bigger models on ever larger datasets, however large models are not necessarily more robust to adversarial examples Jia and Liang (2017) and their behaviour is unpredictable and inconsistent when the test set distribution differs from the training data distribution McCoy et al. (2019). Instead, training these models on diverse datasets has the potential to improve out-of-distribution robustness Hendrycks et al. (2020). To this end, careful consideration is required when selecting the training data to ensure fair and unbiased language generation, and responsible research and innovation Brundage (2016). For example, relying on uncurated movie scripts or dialog datasets collected online for training models often leads to malicious, aggressive, biased or offensive responses Blodgett et al. (2020). Abstractive summarization models tend to generate untruthful information and fake facts when fusing parts of the source document Cao et al. (2018). Societal biases such as race, gender and age are often encoded in the word embeddings used by the generative models Romanov et al. (2019). To prevent such problems, an increased focus on the fairness, accountability, and transparency issues of generative systems is essential and imperative .

Neural generative models, especially large-scale pre-trained models, encode commonsense knowledge and factual and relational information in their latent parameters Petroni et al. (2019), Roberts et al. (2020). However, inspecting and interpreting this information is not straightforward Lei (2017). Adding interpretability to neural models can increase user’s acceptance of the models and trust in their ability to make informed decisions Reiter (2019). To this end, natural language generation can help with providing human-interpretable explanations for neural generative or discriminative models Forrest et al. (2018).

Finally, it is important to consider text generation for low resource languages or tasks for which large datasets are not readily available Tilk and Alumäe (2017). There is a huge gap between neural generative models’s ability to generalize quickly and robustly in low-resource settings compared to human’s ability to learn language from limited exposure to data Linzen (2020). Performing text generation in few-shot or zero-shot settings in an important step towards having general systems which can perform many tasks Radford et al. (2019), eventually without the need to manually create and annotate a training dataset for each task in particular Brown et al. (2020). Having competent general-purpose systems which can perform many tasks and which can easily generalize to new domains instead of relying on specialized narrow expert systems is a longstanding dream of artificial intelligence. Recent progress shows that designing task specific architectures can be replaced with large models able to simultaneously accomplish a multitude of natural language generation tasks in diverse domains Gururangan et al. (2020). Moreover, the OpenAI GPT2 Radford et al. (2019) model shows that architectures designed specifically for text generation can be straightforwardly used for image generation too Chen et al. (2020).

We hope to see progress in the future on the research directions outlined for developing robust and fair natural language generation systems.

4 Evaluation Methods

While many natural language generation models have been proposed in the literature, a critical question is what objective metrics to use for their evaluation and for meaningful comparison with other models. Choosing the appropriate model is important for obtaining good performance in a specific application, nevertheless the choice of the evaluation metric is equally important for measuring progress and drawing the right conclusions. As we are witnessing considerable progress in the field of natural language generation, evaluation of the generated text is largely an unsolved problem. Currently, there is no consensus on how NLG systems should be evaluated van der Lee et al. (2019), Gkatzia and Mahamood (2015), and the lack of meaningful quantitative evaluation methods to accurately assess the quality of trained models is detrimental to the progress of the field. In the absence of well established evaluation measures, natural language evaluations are carried in a rather ad-hoc manner with a lot of variability across the proposed models and tasks, resulting in misleading performance measures. Subjective evaluations based on visual inspection of the generated samples are often carried, making it difficult to quantify and judge precisely the quality of a generative model Hashimoto et al. (2019). In addition, the evaluation of generative models is a notoriously difficult problem Borji (2019).

The two main approaches to performance evaluations are based on either intrinsic or extrinsic criteria. While intrinsic criteria relate to a system’s objective, extrinsic criteria focus to its function and role in relation to the purpose it was designed for Galliers and Jones (1993). In what follows we summarize these approaches to natural language evaluation, starting with intrinsic evaluation in Section 4.1, continuing with extrinsic evaluation in Section 4.2, and finally summarize these approaches and main takeaways in Section 4.3.

4.1 Intrinsic Evaluation

Intrinsic measures of evaluation assess properties of the system or system components in terms of the output produced, and can be further categorized into user like measures (human-based, subjective assessment of quality) or output quality measures (corpora-based, carried automatically) Belz and Hastie (2014). We provide a detailed overview of intrinsic evaluation metrics below.

4.1.1 Human Evaluation

Human evaluation of generative models is a straightforward surrogate of the Turing test in which human judges are asked to assess whether machine-generated samples can be distinguished from real data. Human evaluations measure either holistic properties of the generated text, such as overall quality, or are conducted at a more finer-grained level to measure particular attributes such as fluency, relevance Dathathri et al. (2019), adequacy, correctness, informativeness, naturalness, meaning preservation, simplicity, grammaticality, degree of realism Novikova et al. (2017). Human evaluations are commonly regarded as the gold standard for generative models, however there is a high degree of variation in the way human evaluations are conducted van der Lee et al. (2019). Furthermore, these evaluations are expensive to carry and it is impossible to thoroughly assess through human evaluations any generative model across the entire quality-diversity spectrum. Typically only a few samples generated by the model are presented to human raters, allowing to measure precision and sample quality, but not recall and diversity. In addition, it is impossible to identify models which simply plagiarize the training set; due to this human evaluations may yield unrealistically optimistic scores Semeniuta et al. (2018). Human crowdsourcing evaluations are proposed to assess generative realism in HYPE Zhou et al. (2019). Best practices for carrying human evaluations are summarized in van der Lee et al. (2019). With the latest advances in natural language generation, it is frequently reported in the literature that human evaluators have difficulty in identifying machine-generated sentences in the domain of short stories Donahue et al. (2020) or online product reviews Garbacea et al. (2019).

4.1.2 Automated Evaluation Metrics

Automatic evaluation is a quicker and cheaper alternative compared to human evaluation. Nevertheless, the use of automated evaluation metrics is dependent upon their correlation with human judgements of quality. To this end, there is a wide variety of factors that influence the correlation of automatic evaluation metrics with human judgements Fomicheva and Specia (2019), including domain, type of human evaluation employed and its reliability, type of machine learning system assessed, language pair (for machine translation), or correlation metric used which can be unstable and highly sensitive to outliers Mathur et al. (2020). From a machine learning perspective, automated evaluation can be divided into learnable and non-learnable evaluation metrics. While non-learnable evaluation metrics rely on heuristics / manually defined equations to measure the quality of the generated sentences, learnable metrics train machine learning models to immitate human judgements.

N-gram based metrics

Metrics measuring word overlaps were originally developed in the machine translation community to estimate surface similarity between the translated texts and a set of ground-truth human-written references in the target language, and are currently adopted for the evaluation of the generated text in a multitude of tasks. These metrics assume the existence of a human-written set of references which is often not available Xu et al. (2016), and make strong assumptions regarding its correctness and completeness Novikova et al. (2017). BLEU Papineni et al. (2002), SentBLEU Lin and Och (2004b), Δ\Delta BLEU Galley et al. (2015), NIST Doddington (2002), ROUGE Lin and Och (2004a), METEOR Banerjee and Lavie (2005), Lavie and Denkowski (2009), Denkowski and Lavie (2014), Guo and Hu (2019), SERA Cohan and Goharian (2016), LEPOR Han et al. (2012), CIDEr Vedantam et al. (2015), SPICE Anderson et al. (2016), SPIDER Liu et al. (2017b), SARI Xu et al. (2016), RIBES Isozaki et al. (2010), MPEDA Zhang et al. (2016b) are commonly used to assess sample quality based on comparisons with a set of human-written references in many natural language generation tasks such as dialogue systems Song et al. (2016), Tian et al. (2017), machine translation, text summarization, text simplification, image captioning. Metrics measuring nn-gram overlap at character level are also proposed in CHRF Popović (2015), CHRF++ Popović (2017). As demonstrated by numerous studies, nn-gram matching is not adequate for the evaluation of unsupervised language generation models as it fails to capture semantic variation. Indeed, BLEU scores are insufficient in evaluating text generative systems Reiter (2020), lack interpretability, do not detect deterioration in sample quality van der Lee et al. (2019) and overall are not representative of the quality of a model Semeniuta et al. (2018). Multiple studies also show that nn-gram based metrics correlate poorly with human judgements at the instance-level Novikova et al. (2017), Stent et al. (2005), Specia et al. (2010), Wu et al. (2016a), Liu et al. (2016), fail to account for semantic similarity Chaganty et al. (2018), do not capture diversity Liu et al. (2016), cannot distinguish between outputs of medium and good quality Novikova et al. (2017), do not reflect genuine quality improvements in the model output Mathur et al. (2020) or nuanced quality distinctions Fomicheva and Specia (2019), and generally are not a good way to perform evaluation even when good quality references are available as false conclusions can be drawn Štajner et al. (2015). For task-specific applications, it is reported that word-overlap metrics are more effective for question answering and machine translation, while for dialogue generation and text summarization they present little to no correlation with human judgements Liu et al. (2016), Novikova et al. (2017), Kryscinski et al. (2019). Interestingly, in machine translation these metrics do better on average at evaluating high quality samples as opposed to low quality samples Fomicheva and Specia (2019), given that it is difficult to draw meaningful conclusions regarding output quality when there are very few candidate-reference matches. In addition, BLEU is not suitable for the evaluation of text simplification Sulem et al. (2018a) and document generation Wiseman et al. (2017), and cannot judge the rhythm, meter, creativity, syntactic and semantic coherence in poetry generation Ghazvininejad et al. (2017). Moreover, optimizing discrete metrics such as ROUGE in a reinforcement learning setting does not does not necessarily guarantee an increase in quality, readability and relevance of the generated output Liu et al. (2016), Paulus et al. (2018). Furthermore, BLEU, ROUGE and METEOR are inversely correlated with diversity Sultan et al. (2020).

Nevertheless, there are also studies which report high system level correlations for these metrics with human judgements Sulem et al. (2018b), Snover et al. (2006), Anderson et al. (2016), at the system-level Reiter and Belz (2009), Specia et al. (2010), Ma et al. (2019a), sentence-level Fomicheva and Specia (2019) and on worse quality samples Novikova et al. (2017). Automatic evaluation metrics are more reliable at evaluating the output of neural machine translation models and less reliable at evaluating conventional statistical translation models, mainly due to differences in translation errors Fomicheva and Specia (2019). BLEU and METEOR correlate the most with human judgments of grammaticality and meaning preservation, whereas text simplicity is best evaluated by basic length-based metrics Martin et al. (2019a). ROUGE and its variants is found to agree with manual evaluations of text summarization Owczarzak et al. (2012), Rankel et al. (2013). In addition, FKBLEU Xu et al. (2016), iBLEU Sun and Zhou (2012) capture the adequacy and diversity of the generated paraphrase sentence. Metrics assessing inexact matches are also proposed, for eg. TINE Rios et al. (2011).

Estimating the quality of generated text does not require a set of human-written references when it is cast into a prediction task based on features learnt from the training data Specia et al. (2010). Reference-less automatic evaluation is also proposed in SAMSA Sulem et al. (2018b), which uses semantic parsing on the source side to assess simplification quality. Alternatively, evaluation without requiring references is carried by computing the similarity between the generated output with the source documents in text summarization Louis and Nenkova (2013), Steinberger and Ježek (2012).

Grammar-based metrics

The use of grammar-based evaluation metrics has been studied in machine translation Giménez and Màrquez (2008b), grammatical error correction Napoles et al. (2016), and proposed for the evaluation of generated texts in Novikova et al. (2017). The authors use the number of mispelings and the Stanford parser score as a crude proxy for the grammaticality of a sentence, in combination with standard readability metrics. Compared to word-overlap metrics, grammar-based metrics do not require a corpus of human-written references, however they fail to establish how relevant the output is to the input.

Perplexity

Perplexity Jelinek et al. (1977) is commonly used to evaluate and compare language models, and measures the average number of words the model is uncertain about when making a prediction. Nevertheless, perplexity is a a model dependent metric, and “how likely a sentence is generated by a given model” is not directly comparable across different models. Perplexity based evaluation metrics are proposed to measure the fluency and diversity of the generated samples. Reverse Perplexity and Forward Perplexity Kim et al. (2017) scores are calculated by training language models on synthetic samples, respectively real samples, and measuring the perplexity of the trained model on real samples, respectively generated samples. The Forward Perplexity score captures precision of the generative model, however it is biased in cases when the model repetitively generates only a few highly likely sentences that yield high scores. The Reverse Perplexity score is dependent upon the quality of the data sample which serves as a proxy for the true data distribution, and the capacity of the language model.

Nevertheless, perplexity is shown to be an inadequate measure of quality Theis et al. (2016), Fedus et al. (2018). Likelihoods do not necessarily correspond well to sample quality due to the fact that models with high likelihood can generate low-quality samples, and conversely samples of good quality can present low likelihood. Moreover, infinite perplexity can still be obtained from a perfect model even when its ability to generate test sentences is removed Hashimoto et al. (2019). Finally, perplexity cannot detect mode collapse in GANs and comparing GAN models based on perplexity puts them at disadvantage with other models since they do not optimize for this objective.

Distance-based metrics

Levenstein distance Levenshtein (1966), also known as word edit distance or word error rate Nießen et al. (2000), quantifies the minimum amount of editing (in terms of additions, deletions and paraphrasing operations) a human would have to perform to convert a hypothesis sentence into its closest reference sentence. TER Snover et al. (2006) normalizes the number of edit operations with the average number of words in the reference, while TER-Plus Snover et al. (2009) relaxes the exact word match assumptions by also counting candidate words that share a stem, are synonyms or paraphrases of the reference words. ITER Panja and Naskar (2018) includes stem matching, optimizable edit costs and improved normalization. PER Tillmann et al. (1997) computes position-independent word error rate at the word level. CDER Leusch et al. (2006) is combines edit distance with block reorderings. CharacTER Wang et al. (2016b) and EED Stanchev et al. (2019) extend the edit distance at character level. Jensen-Shannon divergence compares the underlying probability distributions of n-grams in system summaries and source documents Lin et al. (2006), Louis and Nenkova (2013).

Inspired by distance based metrics such as Inception Score (IS) Salimans et al. (2016) and Fréchet Inception Distance (FID) Heusel et al. (2017) widely used to measure the similarity between real and generated samples in computer vision, Fréchet InferSent Distance (FISD) Semeniuta et al. (2018) is the equivalent of FID for text evaluation purposes. The FID metric is designed to capture both the quality and diversity of the generated samples by measuring the distance in the embedding space between distributions of features extracted from real and generated samples, nevertheless it does not differentiate the fidelity and diversity aspects of the generated output Naeem et al. (2020). Kernel Inception Distance (KID) Bińkowski et al. (2018) is used to measure convergence in GANs through an unbiased estimator independent of sample size. Word Mover’s distance (WMD) Kusner et al. (2015) treats documents as bags of embeddings and measures the semantic distance between two texts by computing the amount of flow traveling between embedded words in two documents after aligning semantically similar words. Cosine similarity in the embedding space is used to measure distances between source and target sentences in neural style transfer and quantify the content preservation rate Fu et al. (2018). RUBER Tao et al. (2018) is used for the evaluation of dialogue systems and measures embedding space cosine similarity between a generated response and its query in conversational tasks.

Discriminative Evaluation

Learnt discriminative models are analogous to learning the discriminator in GANs Goodfellow et al. (2014). Based on the two-sample tests Lehmann and Romano (2006) in statistics which summarize differences between two samples into a real-valued test statistic, the goal is to estimate whether two samples SpPnS_{p}\sim P^{n} and SqQmS_{q}\sim Q^{m} are drawn from the same data distribution. If P=QP=Q, the test accuracy of a binary classifier trained on data samples drawn from the two distributions would remain near-chance level, while if PQP\neq Q the classifier reveals distributional differences between SpS_{p} and SqS_{q}. To this end, a classification model is trained with human-written (real) and machine-generated (fake) data samples and its classification accuracy on the test set is used to estimate the quality of the generated samples Bowman et al. (2015), Kannan and Vinyals (2017), Li et al. (2017a), Hodosh and Hockenmaier (2016), Lopez-Paz and Oquab (2016), Im et al. (2018), Ravuri and Vinyals (2019). Nevertheless, this approach requires (re-)training a classifier whenever a new generative model is considered and might be biased in cases when the real and fake distributions differ in just one dimension, yielding high overall accuracy but nonetheless assigning lower quality to a superior model Sajjadi et al. (2018).

Class-conditional GAN architectures are compared by means of evaluation metrics that measure the difference between the learned (generated) and the target (real) distributions. GAN-train and GAN-test Shmelkov et al. (2018) train a classification network on synthetic/real samples generated by a GAN model and evaluate its classification performance on a test set consisting of real-world/generated examples. GAN-train is analogous to recall, while GAN-test is similar to precision. A similar approach is proposed in Ravuri and Vinyals (2019), where the Classification Accuracy Score (CAS) measures the performance of a classifier trained on synthetic data at inferring the class labels of real data samples. The metric allows to understand limitations and deficiencies of the generative model. Classification accuracy is also used to measure transfer strength in neural style transfer Shen et al. (2017) Fu et al. (2018), Zhou et al. (2018). LEIC Cui et al. (2018) is used in image captioning to predict if a caption is human-written or machine-generated. Furthermore, classification models are also built to distinguish human reference translations from machine translations Corston-Oliver et al. (2001), Kulesza and Shieber (2004), Gamon et al. (2005).

Precision, Recall and F1 score

are used to measure the distance of the generated samples to the real data manifold Lucic et al. (2018). When precision is high, the generated samples are close to the data manifold, and when recall is high, the generator outputs samples that cover the manifold well. Metrics that aggregate precision and recall such as FβF_{\beta}, a generalization of the F1F_{1} score, are used to quantify the relative importance of precision and recall Sajjadi et al. (2018). Nevertheless, the data manifold of non-synthetic data is unknown and therefore impossible to compute in practice.

Readability Metrics

Flesch-Kincaid Grade Level Flesch (1948), Kincaid et al. (1975) and Flesch Reading Ease Flesch (1979) are used to account for simplicity and measure the reading difficulty of a piece of text. Both metrics are computed as linear combinations of the number of words per sentence and number of syllables per word with different weighting factors. Nevertheless, even though these metrics are frequently used to measure readability, they should not be used on their own but in combination with metrics able to capture the grammaticality and meaning preservation of the generated output Wubben et al. (2012).

Diversity Metrics

There are many tasks in which it is desirable to generate a set of diverse outputs, such as in story generation to provide multiple continuations for a story prompt Clark et al. (2018b), in image captioning to capture different perspectives about an image Krause et al. (2017), in text reranking algorithms to select best candidate responses and improve user personalization in open-ended dialogue generation and machine translation Li et al. (2015), and in question generation to produce more accurate answers Sultan et al. (2020). In the literature diversity of the generated text is regarded from multiple perspectives, on the one hand considering diversity as a measure of how different generated sentences are from each other in terms of word choice, topic and meaning Vijayakumar et al. (2016), Gimpel et al. (2013), Ippolito et al. (2018), and on the other hand accounting for the level of sentence interestingness or unlikeliness Hashimoto et al. (2019).

Perplexity on a reference set, nn-gram diversity Li et al. (2016a) and Self-BLEU Zhu et al. (2018) are commonly used measures of the diversity of the generated samples. In addition, Backward-BLEU Shi et al. (2018) evaluates test data using the generated samples as reference; the higher the score the more diverse the generator output. Lexical diversity Bache et al. (2013) calculates the ratio of unique tokens to the total number of generated tokens. Similarly, Distinct-kk or Dist-kk Li et al. (2016a) measures the total number of unique kk-grams normalized by the total number of generated kk-gram tokens to avoid favoring long sentences. Nevertheless, the Dist-kk metric ignores the fact that infrequent kk-grams contribute more to diversity than frequent ones and assign same weight to all kk-grams that appear at least once. Entropy based metrics such as Ent-kk Zhang et al. (2018c) are proposed to reflect the frequency difference of kk-grams and to analyze the information content of the generated responses in dialogue systems Serban et al. (2017), Mou et al. (2016).

Learnt Evaluation Metrics based on Continuous Representations

Unlike traditional evaluation metrics based on heuristics, learnable metrics train machine learning models on human annotated datasets to learn a scoring function that reproduces human judgements. Traditional machine learning models can incorporate human-specified attributes and handcrafted features, while neural network based approaches work in an end-to-end fashion. In what follows we provide an overview of machine learning based evaluation metrics.

  • Fully-learnt metrics leverage existing datasets of human ratings to learn automated evaluation metrics that fit the human data distribution. In addition, these metrics can be tuned to measure specific properties of the generated texts, such as fluency, style, grammaticality, fidelity, etc.

    MTeRater and MTeRater-Plus Parton et al. (2011) learn a ranking model for scoring machine translation candidates. A similar ranking approach to evaluating machine translation outputs is adopted in Avramidis et al. (2011). Machine translation evaluation is approached as a regression task based on linguistic features extracted from the source sentence and its translation Specia et al. (2010). BEER Stanojević and Sima’an (2014) trains a linear regression model by combining sub-word features (character n-grams) with global word order features (skip bigrams). Linear regression based on human judgements is used to learn a model for scoring system summaries in Peyrard et al. (2017). RUSE Shimanaka et al. (2018) combines three universal sentence embeddings in a multi-layer perceptron regressor model. ESIM Chen et al. (2017c), Mathur et al. (2019) feeds the encoded representations of the candidate and the reference sentence into a feedforward regressor. BLEURT Sellam et al. (2020) does quality evaluation by incorporating lexical and semantic pre-training signals and fine-tuning BERT Devlin et al. (2018) on human ratings datasets for similarity score prediction. MAUDE Sinha et al. (2020) is proposed for the evaluation of online dialogue conversations and works by leveraging sentence representations from the BERT pre-trained language model to train text encoders which can distinguish between valid dialogue responses and generated negative examples.

    Models trained on human judgements are used to predict human scores to dialogue responses. ROSE Conroy and Dang (2008) is a linear combination of ROUGE Lin and Och (2004a) based metrics designed to maximize correlation with human responsiveness. A voting based regression model is proposed to score summaries in Hirao et al. (2007). Regression based models are also used as a sentence-level metric of machine translation quality Quirk (2004), Albrecht and Hwa (2007b), Albrecht and Hwa (2007a), Giménez and Màrquez (2008a), Specia et al. (2009). ADEM Lowe et al. (2017) learns to mimic human judgements in dialogue systems by training a hierarchical RNN encoder to capture the similarity between the dialogue context, the generated model response and human-written reference responses. PARADISE Walker et al. (1997) is one of the first learnt evaluation metrics for the evaluation of task-based dialogue systems.

  • Hybrid metrics combine learnt elements with human-defined logical rules, for example, contextual embeddings with token alignment rules. These metrics are robust to training/ testing data distributing drifts and can work even when limited training data is available. ROUGE Lin and Och (2004a) is enhanced with word embeddings in ROUGE-WE Ng and Abrecht (2015) to capture semantic similarities between words beyond surface lexicographic matches. Human judgements are elicited to extract sets of words with similar meanings for summary evaluation with the Pyramid scoring scheme Harnly et al. (2005), and later extended to fully automated evaluation Yang et al. (2016a). YiSi Lo (2019) and MEANT Lo and Wu (2011) measure translation quality by matching semantic frames. BERTscore Zhang et al. (2019b) evaluates generated text against gold standard references using soft-string similarity matches (i.e. cosine similarity) computed on pre-trained contextualized BERT Devlin et al. (2018) token embeddings. MoverScore Zhao et al. (2019) combines contextualized representations of system and reference texts with semantic measures of distance computed using Word Mover’s Distance Kusner et al. (2015). Furthermore, Word Mover’s Distance is extended to evaluate multi-sentence texts in Clark et al. (2019). Transformers-based Language Models Kané et al. (2019) such as RoBERTa Liu et al. (2019b) are fine-tuned to predict sentence similarity, logical entailment and robustness to grammatical errors for text evaluation purposes. Human and statistical evaluation are combined in HUSE Hashimoto et al. (2019), an evaluation framework which estimates the optimal error rate of predicting whether a piece of text is human-written or machine-generated. Similarly, automatic metrics are combined with human evaluation to infer an unbiased estimator based on control variates which averages differences between human judgments and automatic metrics rather than averaging the human judgments alone Chaganty et al. (2018). However, a limitation of such learned evaluation metrics is that they do not generalize well across different systems Chaganty et al. (2018).

4.2 Extrinsic Evaluation

Extrinsic evaluation measures the effectiveness of the generated texts on downstream natural language processing tasks or directly on end users. Consequently, extrinsic evaluation is considered the most meaningful type of evaluation in NLG and is generally more useful than intrinsic evaluation Reiter and Belz (2009), however extrinsic evaluations are less frequently carried in the literature as they are cost and time intensive, and require careful design.

Extrinsic evaluation methods can be categorized into system-purpose-success and user-type-success metrics Belz and Hastie (2014). System-type success metrics quantify the performance of the generated texts on downstream tasks such as information retrieval Fujii et al. (2009), information extraction Parton et al. (2009), question answering and reading comprehension Jones et al. (2007). User-type success metrics measure the impact of the system on real users as the extent to which it helps them achieve the task it was designed for. Extrinsic evaluations are commonly used in evaluating the performance of task-oriented dialogue agents designed to carry short conversations with human users and assist them in accomplishing a particular goal Deriu et al. (2020).

User performance on a specific task is a direct indicator of text quality Young (1999), Mani et al. (1999), Di Eugenio et al. (2002), Carenini and Moore (2006), Hastie et al. (2016). NLG texts are shown to assist humans in decision-making under uncertainty Gkatzia et al. (2016). Nevertheless, task based evaluations can be expensive and time-consuming to carry, and results obtained depend on the good will of the participants in the study. Moreover, it is hard to generalize results to new tasks, especially if there is little to no correlation between them. Finally, not every piece of text has a clear function, therefore in some cases a relevant task may not be readily available.

4.3 Discussion - NLG Evaluation

In this section we have introduced a wide diversity of metrics for the evaluation of the generated language. As the field of natural language generation is advancing at a fast pace, evaluation becomes critical for measuring progress and conducting fair comparisons between generative models. While many automated evaluation metrics are well established for judging specific natural language tasks, such as BLEU for machine translation, ROUGE and METEOR for text summarization, SARI for text simplification, CIDEr and SPICE for image captioning, there is no universal metric that fits all natural language generation tasks and captures all desirable properties of language. To this end, it is necessary to rely on multiple metrics that reflect different textual attributes such as grammaticality, fluency, coherence, readability, diversity, etc. when conducting language evaluations. However, small changes in the scores reported by these automatic evaluation metrics are not reliable to draw definite conclusions Mathur et al. (2020). Human evaluations remain the gold-standard in natural language generation and automated evaluation metrics can only be used as a proxy for human judgements only when there is reasonable correlation with human decisions. Ideally, automated evaluations are carried simultaneously with human annotation studies, and not as a replacement of human evaluations.

While progress has been made recently on proposing new evaluation metrics to assess the output of natural language generation systems, more robust evaluation procedures are needed Novikova et al. (2017). Moving beyond traditional evaluation metrics that only account for shallow surface form comparisons between the generated texts and gold-standard reference texts, emerging directions in evaluating natural language generation output are focusing on conducting semantic comparisons to achieve better correlation with human judgments Zhao et al. (2019). Evaluation metrics based on word and sentence-level embeddings trained from large-scale data and which capture semantic variations show promise in having scalable, cheap, fast and realiable automated evaluation Ma et al. (2019a). Robust evaluation metrics should also incorporate context Tian et al. (2017), and account for diversity of content and the presence of rare words which are found to be more indicative for sentence similarity than common words Zhang et al. (2019b). Evaluating long texts poses special challenges in terms of assessing long-term inter-sentence or inter-paragraph coherence, correctness, fluency, style and semantics, diversity, creativity. It is desirable to have new metrics that are tailored for the evaluation of long texts in particular accounting for these criteria. In addition, reference-less evaluation of the generated output is an important research direction for tasks such as machine translation, text simplification or dialogue generation when no gold-standard reference data is available Novikova et al. (2017), Shimanaka et al. (2018). The reference-less quality estimation approach relies on neural networks to predict a quality score for the generated output by comparing it to the source meaning representation only, therefore presenting the benefit of less resources invested in collecting expensive human-written annotations. Moreover, meaningful extrinsic evaluation metrics that measure the contribution of the generated language to task success in a variety of scenarios represent an important future research direction.

Finally, metrics that evaluate the interpretability of neural network models, are able to explain the decisions made (especially valid for metrics based on large pre-trained models) and measure the fairness of generated texts are needed to ensure unbiased, responsible and ethical usage of the natural language generation technology for societal benefit, while combating any of its potential malicious deployments Ippolito et al. (2020), Kreps et al. (2020). Interpretable explanations can also help determine how much factual knowledge is encoded within the latent parameters of the model, typically inaccessible to inspection and interpretation, and to what extent this information is memorized from the training corpora Petroni et al. (2019), Verga et al. (2020).

In parallel with our work, evaluation methods for text generation are also reviewed in Celikyilmaz et al. (2020), offering a complementary perspective on approaches to natural language evaluation.

5 Conclusion

In the present work we have formally defined the problem of natural language generation at particular contexts and in a variety of natural language processing tasks. In addition, we have presented diverse generative models based on neural networks employed for natural language generation, including recurrent neural networks, sequence-to-sequence models, VAEs, GANs, memory and transfer learning architectures for which we summarized the latest advances focused on language generation. Moreover, we have included a comprehensive overview of methods for evaluating the quality of the generated texts. Given the latest development and the rapid advances in the field, a lot of progress has been made in recent years in both natural language generation and evaluation. Nevertheless, there are still many open challenges to address, including improving generalization to produce novel outputs beyond just memorizing training set examples, generating long-term coherent and diverse texts conditioned or constrained on particular attributes and stylistic properties, learning from few examples in low-resource settings, ensuring fair, ethical and socially responsible uses of the generated text and improving the accountability, explainability and transparency of natural language generative systems.

Evaluation of the generated output is crucial for improving the performance of generative models of natural language, nevertheless it largely remains an open challenge. Human evaluations represent the gold-standard for assessing the quality of machine-generated texts, and automated evaluation metrics should be used with caution only when they present reasonable correlation with human judgements as a complement to human annotations and not as a replacement. Since no automated metric captures all desirable properties of generated text, ideally multiple automated metrics are used simultaneously to capture fine-grained textual attributes such as fluency, readability, coherence, correctness, diversity, etc. Promising directions for developing new evaluation metrics are focused on training neural models to perform reference-less semantic evaluations in the embedding space by means of comparing the generated output with the source input, as opposed to collecting expensive human-written ground-truth annotations for every task. We also hope to see more focus on task-specific extrinsic evaluation metrics, as well as evaluation metrics which ensure the generated texts are fair, unbiased and do not encode societal stereotypes.

In this survey we have summarized the most recent developments in neural language generation in terms of problem formulation, methods and evaluation. We hope it serves as a useful resource for anyone interested in learning and advancing this fascinating field of research.

Acknowledgments

This work was in part supported by the National Science Foundation under grant number 1633370.

References

  • Ackley et al. (1985) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169.
  • Albrecht and Hwa (2007a) Joshua Albrecht and Rebecca Hwa. 2007a. A re-examination of machine learning approaches for sentence-level mt evaluation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 880–887.
  • Albrecht and Hwa (2007b) Joshua Albrecht and Rebecca Hwa. 2007b. Regression for sentence-level mt evaluation with pseudo references. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 296–303.
  • Alonso et al. (2019) Eloi Alonso, Bastien Moysset, and Ronaldo Messina. 2019. Adversarial generation of handwritten text images conditioned on sequences. arXiv preprint arXiv:1903.00277.
  • Amplayo (2019) Reinald Kim Amplayo. 2019. Rethinking attribute representation and injection for sentiment classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5606–5617.
  • Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages 382–398. Springer.
  • Anderson et al. (2017) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017. Guided open vocabulary image captioning with constrained beam search. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945.
  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  • Arjovsky and Bottou (2017) Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arxiv e-prints, art. arXiv preprint arXiv:1701.04862.
  • Asghar et al. (2018) Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou. 2018. Affective neural response generation. In European Conference on Information Retrieval, pages 154–166. Springer.
  • Avramidis et al. (2011) Eleftherios Avramidis, Maja Popović, David Vilar, and Aljoscha Burchardt. 2011. Evaluate with confidence estimation: Machine ranking of translation outputs using grammatical features. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 65–70.
  • Bache et al. (2013) Kevin Bache, David Newman, and Padhraic Smyth. 2013. Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 23–31.
  • Bachman (2016) Philip Bachman. 2016. An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems, pages 4826–4834.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Balasubramanian et al. (2020) Vikash Balasubramanian, Ivan Kobyzev, Hareesh Bahuleyan, Ilya Shapiro, and Olga Vechtomova. 2020. Polarized-vae: Proximity based disentangled representation learning for text generation. arXiv preprint arXiv:2004.10809.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Bao et al. (2019) Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, and Jiajun Chen. 2019. Generating sentences from disentangled syntactic and semantic spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6008–6019.
  • Bauer et al. (2018) Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018. Commonsense for generative multi-hop question answering tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4220–4230.
  • Belinkov et al. (2017) Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872.
  • Belz and Hastie (2014) Anja Belz and Helen Hastie. 2014. Towards comparative evaluation and shared tasks for nlg in interactive systems. In Natural Language Generation in Interactive Systems, pages 302–350. Cambridge University Press.
  • (21) Emily M Bender and Alexander Koller. Climbing towards nlu: On meaning, form, and understanding in the age of data.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
  • Berglund et al. (2015) Mathias Berglund, Tapani Raiko, Mikko Honkala, Leo Kärkkäinen, Akos Vetek, and Juha T Karhunen. 2015. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pages 856–864.
  • Bhandwaldar and Zadrozny (2018) Abhishek Bhandwaldar and Wlodek Zadrozny. 2018. Uncc qa: biomedical question answering system. In Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering, pages 66–71.
  • Bingel (2018) Joachim Bingel. 2018. Personalized and Adaptive Text Simplification. Ph.D. thesis, Department of Computer Science, Faculty of Science, University of Copenhagen.
  • Bińkowski et al. (2018) Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. 2018. Demystifying mmd gans. arXiv preprint arXiv:1801.01401.
  • Blodgett et al. (2020) Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of” bias” in nlp. arXiv preprint arXiv:2005.14050.
  • Böhm et al. (2019) Florian Böhm, Yang Gao, Christian M Meyer, Ori Shapira, Ido Dagan, and Iryna Gurevych. 2019. Better rewards yield better summaries: Learning to summarise without references. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3101–3111.
  • Borji (2019) Ali Borji. 2019. Pros and cons of gan evaluation measures. Computer Vision and Image Understanding, 179:41–65.
  • Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
  • Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  • Brundage (2016) Miles Brundage. 2016. Artificial intelligence and responsible innovation. In Fundamental Issues of Artificial Intelligence, pages 543–554. Springer.
  • Budzianowski et al. (2017) Paweł Budzianowski, Stefan Ultes, Pei-Hao Su, Nikola Mrkšić, Tsung-Hsien Wen, Iñigo Casanueva, Lina M Rojas Barahona, and Milica Gasic. 2017. Sub-domain modelling for dialogue management with hierarchical reinforcement learning. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 86–92.
  • Caccia et al. (2018) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
  • Cao et al. (2018) Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural abstractive summarization. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Carenini and Moore (2006) Giuseppe Carenini and Johanna D Moore. 2006. Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11):925–952.
  • Celikyilmaz et al. (2018) Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1662–1675.
  • Celikyilmaz et al. (2020) Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
  • Chaganty et al. (2018) Arun Tejasvi Chaganty, Stephen Mussman, and Percy Liang. 2018. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202.
  • Chan et al. (2019) William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. 2019. Kermit: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604.
  • Che et al. (2017) Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983.
  • Chen et al. (2019a) Gang Chen, Yang Liu, Huanbo Luan, Meng Zhang, Qun Liu, and Maosong Sun. 2019a. Learning to predict explainable plots for neural story generation. arXiv preprint arXiv:1912.02395.
  • Chen et al. (2017a) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017a. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35.
  • Chen et al. (2018a) Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. 2018a. Adversarial text generation via feature-mover’s distance. In Advances in Neural Information Processing Systems, pages 4666–4677.
  • Chen et al. (2017b) Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017b. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5659–5667.
  • Chen et al. (2020) Mark Chen, Alec Radford, Rewon Child, Jeff Wu, and Heewoo Jun. 2020. Generative pretraining from pixels.
  • Chen et al. (2019b) Mingda Chen, Qingming Tang, Sam Wiseman, and Kevin Gimpel. 2019b. Controllable paraphrase generation with a syntactic exemplar. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5972–5984.
  • Chen et al. (2017c) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017c. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668.
  • Chen and Bansal (2018) Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–686.
  • Chen et al. (2018b) Yun Chen, Victor OK Li, Kyunghyun Cho, and Samuel Bowman. 2018b. A stable and effective learning strategy for trainable greedy decoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 380–390.
  • Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 551–561.
  • Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 484–494.
  • Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  • Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Choshen et al. (2019) Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend. 2019. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.
  • Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988.
  • Clark et al. (2019) Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. 2019. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2748–2760.
  • Clark et al. (2018a) Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018a. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2250–2260.
  • Clark et al. (2018b) Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A Smith. 2018b. Creative writing with a machine in the loop: Case studies on slogans and stories. In 23rd International Conference on Intelligent User Interfaces, pages 329–340.
  • Cohan and Goharian (2016) Arman Cohan and Nazli Goharian. 2016. Revisiting summarization evaluation for scientific articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 806–813.
  • Conroy and Dang (2008) John Conroy and Hoa Trang Dang. 2008. Mind the gap: Dangers of divorcing evaluations of summary content from linguistic quality. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), pages 145–152.
  • Corston-Oliver et al. (2001) Simon Corston-Oliver, Michael Gamon, and Chris Brockett. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 148–155.
  • Cremer et al. (2018) Chris Cremer, Xuechen Li, and David Duvenaud. 2018. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, pages 1078–1086.
  • Cuayáhuitl et al. (2016) Heriberto Cuayáhuitl, Seunghak Yu, Ashley Williamson, and Jacob Carse. 2016. Deep reinforcement learning for multi-domain dialogue systems. arXiv preprint arXiv:1611.08675.
  • Cui et al. (2017) Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 593–602.
  • Cui et al. (2018) Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. 2018. Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5804–5812.
  • Dai et al. (2017) Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pages 2970–2979.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  • Daniluk et al. (2017) Michał Daniluk, Tim Rocktäschel, Johannes Welbl, and Sebastian Riedel. 2017. Frustratingly short attention spans in neural language modeling. arXiv preprint arXiv:1702.04521.
  • Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: a simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  • Dauphin et al. (2017) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 933–941. JMLR. org.
  • d’Autume et al. (2019) Cyprien de Masson d’Autume, Mihaela Rosca, Jack Rae, and Shakir Mohamed. 2019. Training language gans from scratch. arXiv preprint arXiv:1905.09922.
  • De Cao et al. (2019) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents with graph convolutional networks. In Proceedings of NAACL-HLT, pages 2306–2317.
  • Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819.
  • Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
  • Deriu et al. (2020) Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, pages 1–56.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Devlin (1999) Siobhan Lucy Devlin. 1999. Simplifying natural language for aphasic readers. Ph.D. thesis, University of Sunderland.
  • Di Eugenio et al. (2002) Barbara Di Eugenio, Michael Glass, and Michael Trolio. 2002. The diag experiments: Natural language generation for intelligent tutoring systems. In Proceedings of the International Natural Language Generation Conference, pages 120–127.
  • Dinan et al. (2018) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
  • Doddington (2002) George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145.
  • Doersch (2016) Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
  • Donahue et al. (2020) Chris Donahue, Mina Lee, and Percy Liang. 2020. Enabling language models to fill in the blanks. arXiv preprint arXiv:2005.05339.
  • Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732.
  • Dong et al. (2017) Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu. 2017. Learning to generate product reviews from attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 623–632.
  • Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
  • Dong et al. (2018) Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. 2018. Banditsum: Extractive summarization as a contextual bandit. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3739–3748.
  • Drissi et al. (2018) Mehdi Drissi, Olivia Watkins, and Jugal Kalita. 2018. Hierarchical text generation using an outline. In 15th International Conference on Natural Language Processing, page 180.
  • Dušek and Jurcicek (2016) Ondřej Dušek and Filip Jurcicek. 2016. A context-aware natural language generator for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 185–190.
  • Emelianenko et al. (2019) Dmitrii Emelianenko, Elena Voita, and Pavel Serdyukov. 2019. Sequence modeling with unconstrained generation order. arXiv preprint arXiv:1911.00176.
  • Fabbri et al. (2019) Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084.
  • Fan et al. (2018a) Angela Fan, David Grangier, and Michael Auli. 2018a. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54.
  • Fan et al. (2018b) Angela Fan, Mike Lewis, and Yann Dauphin. 2018b. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
  • Fan et al. (2019) Angela Fan, Mike Lewis, and Yann Dauphin. 2019. Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2650–2660.
  • Fang et al. (2019) Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and Changyou Chen. 2019. Implicit deep latent variable models for text generation. arXiv preprint arXiv:1908.11527.
  • Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the _. In International Conference on Learning Representations.
  • Fevry and Phang (2018) Thibault Fevry and Jason Phang. 2018. Unsupervised sentence compression using denoising auto-encoders. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 413–422.
  • Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104.
  • Filippova et al. (2015) Katja Filippova, Enrique Alfonseca, Carlos A Colmenares, Łukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with lstms. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 360–368.
  • Flesch (1979) Rudolf Franz Flesch. 1979. How to write plain English: A book for lawyers and consumers. Harpercollins.
  • Flesch (1948) Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221.
  • Fomicheva and Specia (2019) Marina Fomicheva and Lucia Specia. 2019. Taking mt evaluation metrics to extremes: Beyond correlation with human judgments. Computational Linguistics, 45(3):515–558.
  • Ford et al. (2018) Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George Dahl. 2018. The importance of generation order in language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2942–2946.
  • Forrest et al. (2018) James Forrest, Somayajulu Sripada, Wei Pang, and George Coghill. 2018. Towards making nlg a voice for interpretable machine learning. In Proceedings of The 11th International Natural Language Generation Conference. Association for Computational Linguistics (ACL).
  • Fu and Feng (2018) Yao Fu and Yansong Feng. 2018. Natural answer generation with heterogeneous memory. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 185–195.
  • Fu et al. (2018) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Fujii et al. (2009) Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and Takehito Utsuro. 2009. Evaluating effects of machine translation accuracy on cross-lingual patent retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 674–675.
  • Galassi et al. (2019) Andrea Galassi, Marco Lippi, and Paolo Torroni. 2019. Attention, please! a critical review of neural attention models in natural language processing. arXiv preprint arXiv:1902.02181.
  • Galley et al. (2015) Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 445–450.
  • Galliers and Jones (1993) Julia Rose Galliers and K Sparck Jones. 1993. Evaluating natural language processing systems.
  • Gamon et al. (2005) Michael Gamon, Anthony Aue, and Martine Smets. 2005. Sentence-level mt evaluation without reference translations: Beyond language modeling. In Proceedings of EAMT, pages 103–111.
  • Garbacea et al. (2019) Cristina Garbacea, Samuel Carton, Shiyan Yan, and Qiaozhu Mei. 2019. Judge the judges: A large-scale evaluation study of neural language models for online review generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3959–3972.
  • Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
  • Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org.
  • Gervás (2009) Pablo Gervás. 2009. Computational approaches to storytelling and creativity. AI Magazine, 30(3):49–49.
  • Ghazvininejad et al. (2018) Marjan Ghazvininejad, Yejin Choi, and Kevin Knight. 2018. Neural poetry translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 67–71.
  • Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6114–6123.
  • Ghazvininejad et al. (2020) Marjan Ghazvininejad, Omer Levy, and Luke Zettlemoyer. 2020. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785.
  • Ghazvininejad et al. (2016) Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. 2016. Generating topical poetry. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1183–1191.
  • Ghazvininejad et al. (2017) Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48.
  • Ghosh et al. (2017) Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. 2017. Affect-lm: A neural language model for customizable affective text generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 634–642.
  • Giménez and Màrquez (2008a) Jesús Giménez and Lluís Màrquez. 2008a. Heterogeneous automatic mt evaluation through non-parametric metric combinations. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I.
  • Giménez and Màrquez (2008b) Jesús Giménez and Lluís Màrquez. 2008b. A smorgasbord of features for automatic mt evaluation. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 195–198.
  • Gimpel et al. (2013) Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. 2013. A systematic exploration of diversity in machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1100–1111.
  • Gkatzia et al. (2016) Dimitra Gkatzia, Oliver Lemon, and Verena Rieser. 2016. Natural language generation enhances human decision-making with uncertain information. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pages 264–268. Association for Computational Linguistics.
  • Gkatzia and Mahamood (2015) Dimitra Gkatzia and Saad Mahamood. 2015. A snapshot of nlg evaluation practices 2005-2014. In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), pages 57–60.
  • Goldfarb-Tarrant et al. (2019) Seraphina Goldfarb-Tarrant, Haining Feng, and Nanyun Peng. 2019. Plan, write, and revise: an interactive system for open-domain story generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 89–97.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Grave et al. (2016) Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426.
  • Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
  • Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE.
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
  • Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to transduce with unbounded memory. In Advances in neural information processing systems, pages 1828–1836.
  • Gu et al. (2017) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
  • Gu et al. (2019a) Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019a. Insertion-based decoding with automatically inferred generation order. arXiv preprint arXiv:1902.01370.
  • Gu et al. (2019b) Jiatao Gu, Changhan Wang, and Jake Zhao. 2019b. Levenshtein transformer. arXiv preprint arXiv:1905.11006.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777.
  • Gulrajani et al. (2016) Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. 2016. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013.
  • Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Guo et al. (2019) Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730.
  • Guo and Hu (2019) Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 501–506.
  • Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of ACL.
  • Guu et al. (2018) Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating sentences by editing prototypes. Transactions of the Association of Computational Linguistics, 6:437–450.
  • Haidar and Rezagholizadeh (2019) Md Akmal Haidar and Mehdi Rezagholizadeh. 2019. Textkd-gan: Text generation using knowledge distillation and generative adversarial networks. In Canadian Conference on Artificial Intelligence, pages 107–118. Springer.
  • Han et al. (2012) Aaron LF Han, Derek F Wong, and Lidia S Chao. 2012. Lepor: A robust evaluation metric for machine translation with augmented factors. In Proceedings of COLING 2012: Posters, pages 441–450.
  • Harnly et al. (2005) Aaron Harnly, Ani Nenkova, Rebecca Passonneau, and Owen Rambow. 2005. Automation of summary evaluation by the pyramid method. In Recent Advances in Natural Language Processing (RANLP), pages 226–232.
  • Hashimoto et al. (2019) Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792.
  • Hastie et al. (2016) Helen Hastie, Heriberto Cuayáhuitl, Nina Dethlefs, Simon Keizer, and Xingkun Liu. 2016. Evaluation of nlg in an end-to-end spoken dialogue system-is it worth it? In 7th International Workshop on Spoken Dialogue Systems 2016.
  • Hendrycks et al. (2020) Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637.
  • Hirao et al. (2007) Tsutomu Hirao, Manabu Okumura, Norihito Yasuda, and Hideki Isozaki. 2007. Supervised automatic evaluation for summarization with voted regression model. Information Processing & Management, 43(6):1521–1535.
  • Hjelm et al. (2018) R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, and Yoshua Bengio. 2018. Boundary-seeking generative adversarial networks. In 6th International Conference on Learning Representations, ICLR 2018.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hodosh and Hockenmaier (2016) Micah Hodosh and Julia Hockenmaier. 2016. Focused evaluation for image description with binary forced-choice tasks. In Proceedings of the 5th Workshop on Vision and Language, pages 19–28.
  • Hodosh et al. (2013) Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899.
  • Hokamp and Liu (2017) Chris Hokamp and Qun Liu. 2017. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546.
  • Holtzman et al. (2018) Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  • Hossain et al. (2019) MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6):1–36.
  • Hu et al. (2019) J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme. 2019. Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 839–850.
  • Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1587–1596. JMLR. org.
  • Huang et al. (2018) Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural language to structured query generation via meta-learning. arXiv preprint arXiv:1803.02400.
  • Huszár (2015) Ferenc Huszár. 2015. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101.
  • Im et al. (2018) Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. 2018. Quantitatively evaluating gans with divergences proposed for training. arXiv preprint arXiv:1803.01045.
  • Ippolito et al. (2020) Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808–1822.
  • Ippolito et al. (2018) Daphne Ippolito, Reno Kriz, Joao Sedoc, Maria Kustikova, Chris Callison-Burch, Reno Kriz, Eleni Miltsakaki, Marianna Apidianaki, Chris Callison-Burch, John Hewitt, et al. 2018. Comparison of diverse decoding methods from conditional language models. In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics.
  • Isozaki et al. (2010) Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 944–952.
  • Iyyer et al. (2014) Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 633–644.
  • Iyyer et al. (2018) Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net.
  • Jaques et al. (2019) Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. 2019. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.
  • Jean et al. (2015) Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015, pages 1–10. Association for Computational Linguistics (ACL).
  • Jelinek et al. (1977) Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63.
  • Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031.
  • Jiang et al. (2020) Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, and Wei Xu. 2020. Neural crf model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Jiang et al. (2019) Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2019. How can we know what language models know? arXiv preprint arXiv:1911.12543.
  • Jin et al. (2020) Di Jin, Zhijing Jin, Joey Tianyi Zhou, Lisa Orii, and Peter Szolovits. 2020. Hooks in the headline: Learning to generate headlines with controlled styles. arXiv preprint arXiv:2004.01980.
  • John et al. (2019) Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2019. Disentangled representation learning for non-parallel text style transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 424–434.
  • Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
  • Jones et al. (2007) Douglas Jones, Martha Herzog, Hussny Ibrahim, Arvind Jairam, Wade Shen, Edward Gibson, and Michael Emonts. 2007. Ilr-based mt comprehension test with multi-level questions. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 77–80.
  • Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.
  • Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
  • Kandula et al. (2010) Sasikiran Kandula, Dorothy Curtis, and Qing Zeng-Treitler. 2010. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, volume 2010, page 366. American Medical Informatics Association.
  • Kané et al. (2019) Hassan Kané, Yusuf Kocyigit, Pelkins Ajanoh, Ali Abdalla, and Mohamed Coulibali. 2019. Towards neural language evaluators. arXiv preprint arXiv:1909.09268.
  • Kannan and Vinyals (2017) Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  • Ke et al. (2018) Pei Ke, Jian Guan, Minlie Huang, and Xiaoyan Zhu. 2018. Generating informative responses with controlled sentence function. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1499–1508.
  • Keneshloo et al. (2019) Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K Reddy. 2019. Deep reinforcement learning for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems.
  • Keselj (2009) Vlado Keselj. 2009. Speech and language processing daniel jurafsky and james h. martin (stanford university and university of colorado at boulder) pearson prentice hall, 2009, xxxi+ 988 pp; hardbound, isbn 978-0-13-187321-6, $115.00.
  • Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  • Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. arXiv preprint arXiv:1805.04623.
  • Khandelwal et al. (2019) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
  • Kikuchi et al. (2016) Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2016. Controlling output length in neural encoder-decoders. In EMNLP.
  • Kim et al. (2018) Yoon Kim, Sam Wiseman, and Alexander M Rush. 2018. A tutorial on deep latent variable models of natural language. arXiv preprint arXiv:1812.06834.
  • Kim et al. (2017) Yoon Kim, Kelly Zhang, Alexander M Rush, Yann LeCun, et al. 2017. Adversarially regularized autoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223, 2:12.
  • Kincaid et al. (1975) J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Kingma and Welling (2019) Diederik P Kingma and Max Welling. 2019. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691.
  • Kitaev et al. (2019) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2019. Reformer: The efficient transformer. In International Conference on Learning Representations.
  • Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72.
  • Knight and Marcu (2002) Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1):91–107.
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
  • Kong et al. (2019a) Xiang Kong, Bohan Li, Graham Neubig, Eduard Hovy, and Yiming Yang. 2019a. An adversarial approach to high-quality, sentiment-controlled neural dialogue generation. arXiv preprint arXiv:1901.07129.
  • Kong et al. (2019b) Xiang Kong, Zhaopeng Tu, Shuming Shi, Eduard Hovy, and Tong Zhang. 2019b. Neural machine translation with adequacy-oriented learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6618–6625.
  • Kratzwald et al. (2019) Bernhard Kratzwald, Anna Eigenmann, and Stefan Feuerriegel. 2019. Rankqa: Neural question answering with answer re-ranking. arXiv preprint arXiv:1906.03008.
  • Krause et al. (2016) Ben Krause, Liang Lu, Iain Murray, and Steve Renals. 2016. Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959.
  • Krause et al. (2017) Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 317–325.
  • Kreps et al. (2020) Sarah E Kreps, Miles McCain, and Miles Brundage. 2020. All the news that’s fit to fabricate: Ai-generated text as a tool of media misinformation. Available at SSRN 3525002.
  • Kriz et al. (2019) Reno Kriz, Joao Sedoc, Marianna Apidianaki, Carolina Zheng, Gaurav Kumar, Eleni Miltsakaki, and Chris Callison-Burch. 2019. Complexity-weighted loss and diverse reranking for sentence simplification. In Proceedings of NAACL-HLT, pages 3137–3147.
  • Kryscinski et al. (2019) Wojciech Kryscinski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551.
  • Kryściński et al. (2018) Wojciech Kryściński, Romain Paulus, Caiming Xiong, and Richard Socher. 2018. Improving abstraction in text summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1808–1817.
  • Kulesza and Shieber (2004) Alex Kulesza and Stuart Shieber. 2004. A learning approach to improving sentence-level mt evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation. European Association for Machine Translation.
  • Kulikov et al. (2019) Ilia Kulikov, Alexander Miller, Kyunghyun Cho, and Jason Weston. 2019. Importance of search and evaluation strategies in neural dialogue modeling. In Proceedings of the 12th International Conference on Natural Language Generation, pages 76–87.
  • Kulkarni et al. (2013) Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903.
  • Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International conference on machine learning, pages 1378–1387.
  • Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning, pages 957–966.
  • Kusner and Hernández-Lobato (2016) Matt J Kusner and José Miguel Hernández-Lobato. 2016. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051.
  • Laban et al. (2020) Philippe Laban, Andrew Hsi, John Canny, and Marti A Hearst. 2020. The summary loop: Learning to write abstractive summaries without examples. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, volume 1.
  • Lamb et al. (2016) Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609.
  • Lample et al. (2018) Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2018. Multiple-attribute text rewriting. In International Conference on Learning Representations.
  • Latif et al. (2020) Seemab Latif, Sarmad Bashir, Mir Muntasar Ali Agha, and Rabia Latif. 2020. Backward-forward sequence generative network for multiple lexical constraints. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 39–50. Springer.
  • Lavie and Denkowski (2009) Alon Lavie and Michael J Denkowski. 2009. The meteor metric for automatic evaluation of machine translation. Machine translation, 23(2-3):105–115.
  • van der Lee et al. (2019) Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368.
  • Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1173–1182.
  • Lehmann and Romano (2006) Erich L Lehmann and Joseph P Romano. 2006. Testing statistical hypotheses. Springer Science & Business Media.
  • Lei (2017) Tao Lei. 2017. Interpretable Neural Models for Natural Language Processing. Ph.D. thesis, Massachusetts Institute of Technology.
  • Leusch et al. (2006) Gregor Leusch, Nicola Ueffing, and Hermann Ney. 2006. Cder: Efficient mt evaluation using block movements. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
  • Levenshtein (1966) Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.
  • Lewis et al. (2020a) Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. 2020a. Pre-training via paraphrasing. arXiv preprint arXiv:2006.15020.
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Lewis et al. (2020b) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020b. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401.
  • Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  • Li and Jurafsky (2016) Jiwei Li and Dan Jurafsky. 2016. Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372.
  • Li et al. (2016b) Jiwei Li, Will Monroe, and Dan Jurafsky. 2016b. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
  • Li et al. (2016c) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016c. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202.
  • Li et al. (2017a) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017a. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
  • Li et al. (2019a) Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, and Yang Song. 2019a. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1969–1979.
  • Li et al. (2019b) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  • Li et al. (2019c) Miao Li, Beihong Jin, et al. 2019c. A topic augmented text generation model: Joint learning of semantics and structural features. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5093–5102.
  • Li et al. (2017b) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017b. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 733–743.
  • Li et al. (2018) Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2018. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878.
  • Li et al. (2019d) Ziming Li, Julia Kiseleva, and Maarten de Rijke. 2019d. Dialogue generation: From imitation learning to inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6722–6729.
  • Libovickỳ and Helcl (2018) Jindřich Libovickỳ and Jindřich Helcl. 2018. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3016–3021.
  • Lin et al. (2019) Bill Yuchen Lin, Ming Shen, Yu Xing, Pei Zhou, and Xiang Ren. 2019. Commongen: A constrained text generation dataset towards generative commonsense reasoning. arXiv preprint arXiv:1911.03705.
  • Lin et al. (2006) Chin-Yew Lin, Guihong Cao, Jianfeng Gao, and Jian-Yun Nie. 2006. An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 463–470.
  • Lin and Och (2004a) Chin-Yew Lin and FJ Och. 2004a. Looking for a few good metrics: Rouge and its evaluation. In Ntcir Workshop.
  • Lin and Och (2004b) Chin-Yew Lin and Franz Josef Och. 2004b. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 501–507.
  • Lin et al. (2018) Junyang Lin, Xu Sun, Shuming Ma, and Qi Su. 2018. Global encoding for abstractive summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 163–169.
  • Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155–3165.
  • Linzen (2020) Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? arXiv preprint arXiv:2005.00955.
  • Lipton et al. (2015) Zachary C Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
  • Liu et al. (2018a) Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. 2018a. Beyond narrative description: Generating poetry from images by multi-adversarial training. In Proceedings of the 26th ACM international conference on Multimedia, pages 783–791.
  • Liu (2015) Bing Liu. 2015. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press.
  • Liu and Lane (2017) Bing Liu and Ian Lane. 2017. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE.
  • Liu et al. (2017a) Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. 2017a. End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. arXiv preprint arXiv:1711.10712.
  • Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
  • Liu et al. (2019a) Dayiheng Liu, Jie Fu, Qian Qu, and Jiancheng Lv. 2019a. Bfgan: Backward and forward generative adversarial networks for lexically constrained sentence generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2350–2361.
  • Liu and Tuzel (2016) Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477.
  • Liu et al. (2018b) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018b. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198.
  • Liu et al. (2017b) Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017b. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pages 873–881.
  • Liu et al. (2018c) Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018c. Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Liu et al. (2020) Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. 2020. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994.
  • Liu and Lapata (2019a) Yang Liu and Mirella Lapata. 2019a. Hierarchical transformers for multi-document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5070–5081.
  • Liu and Lapata (2019b) Yang Liu and Mirella Lapata. 2019b. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3721–3731.
  • Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Liu et al. (2019c) Zhiqiang Liu, Zuohui Fu, Jie Cao, Gerard de Melo, Yik-Cheung Tam, Cheng Niu, and Jie Zhou. 2019c. Rhetorically controlled encoder-decoder for modern chinese poetry generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1992–2001.
  • Lo (2019) Chi-kiu Lo. 2019. Yisi-a unified semantic mt quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 507–513.
  • Lo and Wu (2011) Chi-kiu Lo and Dekai Wu. 2011. Meant: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 220–229.
  • Logeswaran et al. (2018) Lajanugen Logeswaran, Honglak Lee, and Samy Bengio. 2018. Content preserving text generation with attribute controls. In Advances in Neural Information Processing Systems, pages 5103–5113.
  • Loginova et al. (2018) Ekaterina Loginova, Stalin Varanasi, and Günter Neumann. 2018. Towards multilingual neural question answering. In European Conference on Advances in Databases and Information Systems, pages 274–285. Springer.
  • Lopez-Paz and Oquab (2016) David Lopez-Paz and Maxime Oquab. 2016. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545.
  • Louis and Nenkova (2013) Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300.
  • Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23.
  • Luan et al. (2016) Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016. Lstm based conversation models. arXiv preprint arXiv:1603.09457.
  • Lucic et al. (2018) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. 2018. Are gans created equal? a large-scale study. In Advances in neural information processing systems, pages 700–709.
  • Luong et al. (2015a) Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
  • (287) Minh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for spoken language domains.
  • Luong et al. (2015b) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015b. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  • Ma et al. (2019a) Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Graham. 2019a. Results of the wmt19 metrics shared task: Segment-level and strong mt systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 62–90.
  • Ma et al. (2019b) Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019b. Flowseq: Non-autoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480.
  • Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
  • Mallinson and Lapata (2019) Jonathan Mallinson and Mirella Lapata. 2019. Controllable sentence simplification: Employing syntactic and lexical constraints. arXiv preprint arXiv:1910.04387.
  • Mallinson et al. (2017) Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2017. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 881–893.
  • Mallinson et al. (2018) Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2018. Sentence compression for arbitrary languages via multilingual pivoting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2453–2464.
  • Mani et al. (1999) Inderjeet Mani, David House, Gary Klein, Lynette Hirschman, Therese Firmin, and Beth M Sundheim. 1999. The tipster summac text summarization evaluation. In Ninth Conference of the European Chapter of the Association for Computational Linguistics.
  • Martin et al. (2019a) Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie, and Benoît Sagot. 2019a. Reference-less quality estimation of text simplification systems. arXiv preprint arXiv:1901.10746.
  • Martin et al. (2019b) Louis Martin, Benoît Sagot, Éric de la Clergerie, and Antoine Bordes. 2019b. Controllable sentence simplification. arXiv preprint arXiv:1910.02677.
  • Mathur et al. (2020) Nitika Mathur, Tim Baldwin, and Trevor Cohn. 2020. Tangled up in bleu: Reevaluating the evaluation of automatic machine translation evaluation metrics. arXiv preprint arXiv:2006.06264.
  • Mathur et al. (2019) Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2019. Putting evaluation in context: Contextual embeddings improve machine translation evaluation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2799–2808.
  • Mayfield et al. (2019) Elijah Mayfield, Michael Madaio, Shrimai Prabhumoye, David Gerritsen, Brittany McLaughlin, Ezekiel Dixon-Román, and Alan W Black. 2019. Equity beyond bias in language technologies for education. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 444–460.
  • McCoy et al. (2019) R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.
  • Mehri and Sigal (2018) Shikib Mehri and Leonid Sigal. 2018. Middle-out decoding. In Advances in Neural Information Processing Systems, pages 5518–5529.
  • Mei et al. (2017) Hongyuan Mei, Mohit Bansal, and Matthew R Walter. 2017. Coherent dialogue with attention-based language models. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Mei et al. (2016) Hongyuan Mei, TTI UChicago, Mohit Bansal, and Matthew R Walter. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In Proceedings of NAACL-HLT, pages 720–730.
  • Melis et al. (2019) Gábor Melis, Tomáš Kočiskỳ, and Phil Blunsom. 2019. Mogrifier lstm. arXiv preprint arXiv:1909.01792.
  • Meng et al. (2015) Fandong Meng, Zhengdong Lu, Zhaopeng Tu, Hang Li, and Qun Liu. 2015. A deep memory-based architecture for sequence-to-sequence learning. arXiv preprint arXiv:1506.06442.
  • Metz et al. (2016) Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.
  • Miao et al. (2019) Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. 2019. Cgmh: Constrained sentence generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6834–6842.
  • Miao and Blunsom (2016) Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 319–328.
  • Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International conference on machine learning, pages 1727–1736.
  • Microsoft (2020) Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/.
  • Mikolov (2012) Tomáš Mikolov. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April, 80.
  • Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
  • Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212.
  • Mohammed et al. (2018) Omar Mohammed, Gerard Bailly, and Damien Pellier. 2018. Handwriting styles: benchmarks and evaluation metrics. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 159–166. IEEE.
  • Moniz and Krueger (2018) Joel Ruben Antony Moniz and David Krueger. 2018. Nested lstms. arXiv preprint arXiv:1801.10308.
  • Mou et al. (2016) Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970.
  • Mou et al. (2015) Lili Mou, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2015. Backward and forward language modeling for constrained sentence generation. arXiv preprint arXiv:1512.06612.
  • Mueller et al. (2017) Jonas Mueller, David Gifford, and Tommi Jaakkola. 2017. Sequence to better sequence: continuous revision of combinatorial structures. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2536–2544. JMLR. org.
  • Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Neural semantic encoders. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 1, page 397. NIH Public Access.
  • Murray and Chiang (2018) Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 212–223.
  • Naeem et al. (2020) Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. 2020. Reliable fidelity and diversity metrics for generative models. arXiv preprint arXiv:2002.09797.
  • (324) Vaishnavh Nagarajan, Colin Raffel, and Ian J Goodfellow. Theoretical insights into memorization in gans.
  • Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
  • Napoles et al. (2016) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2016. There’s no comparison: Reference-less evaluation metrics in grammatical error correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2109–2115.
  • Narayan et al. (2018) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. In NAACL-HLT.
  • (329) Aida Nematzadeh, Sebastian Ruder, and Dani Yogatama. On memory in human and artificial language processing systems.
  • Ng and Abrecht (2015) Jun-Ping Ng and Viktoria Abrecht. 2015. Better summarization evaluation with word embeddings for rouge. arXiv preprint arXiv:1508.06034.
  • Nie et al. (2018) Weili Nie, Nina Narodytska, and Ankit Patel. 2018. Relgan: Relational generative adversarial networks for text generation.
  • Nießen et al. (2000) Sonja Nießen, Franz Josef Och, Gregor Leusch, Hermann Ney, et al. 2000. An evaluation tool for machine translation: Fast evaluation for mt research. In LREC.
  • Nikolov and Hahnloser (2020) Nikola I Nikolov and Richard Hahnloser. 2020. Abstractive document summarization without parallel data. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 6638–6644.
  • Nishihara et al. (2019) Daiki Nishihara, Tomoyuki Kajiwara, and Yuki Arase. 2019. Controllable text simplification with lexical constraint loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 260–266.
  • Nisioi et al. (2017) Sergiu Nisioi, Sanja Štajner, Simone Paolo Ponzetto, and Liviu P Dinu. 2017. Exploring neural text simplification models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 85–91.
  • Niu and Bansal (2018) Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6:373–389.
  • Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for nlg. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252.
  • Oliveira (2017) Hugo Gonçalo Oliveira. 2017. A survey on intelligent poetry generation: Languages, features, techniques, reutilisation and evaluation. In Proceedings of the 10th International Conference on Natural Language Generation, pages 11–20.
  • Oord et al. (2018) Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3918–3926.
  • Owczarzak et al. (2012) Karolina Owczarzak, John Conroy, Hoa Trang Dang, and Ani Nenkova. 2012. An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of workshop on evaluation metrics and system comparison for automatic summarization, pages 1–9.
  • Panja and Naskar (2018) Joybrata Panja and Sudip Kumar Naskar. 2018. Iter: Improving translation edit rate through optimizable edit costs. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 746–750.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Paris (2015) Cecile Paris. 2015. User modelling in text generation. Bloomsbury Publishing.
  • Parton et al. (2009) Kristen Parton, Kathleen R McKeown, Bob Coyne, Mona T Diab, Ralph Grishman, Dilek Hakkani-Tür, Mary Harper, Heng Ji, Wei Yun Ma, Adam Meyers, et al. 2009. Who, what, when, where, why?: comparing multiple approaches to the cross-lingual 5w task. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 423–431. Association for Computational Linguistics.
  • Parton et al. (2011) Kristen Parton, Joel Tetreault, Nitin Madnani, and Martin Chodorow. 2011. E-rating machine translation.
  • Pasunuru and Bansal (2018) Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 646–653.
  • Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
  • Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
  • Pelsmaeker and Aziz (2019) Tom Pelsmaeker and Wilker Aziz. 2019. Effective estimation of deep generative language models. arXiv preprint arXiv:1904.08194.
  • Peng et al. (2018a) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung Chen, and Kam-Fai Wong. 2018a. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6149–6153. IEEE.
  • Peng et al. (2018b) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. 2018b. Deep dyna-q: Integrating planning for task-completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2182–2192.
  • Peng et al. (2017) Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2231–2240.
  • Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  • Peyrard et al. (2017) Maxime Peyrard, Teresa Botschen, and Iryna Gurevych. 2017. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84.
  • Popović (2015) Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  • Popović (2017) Maja Popović. 2017. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618.
  • Post and Vilar (2018) Matt Post and David Vilar. 2018. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324.
  • Press et al. (2017) Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent generative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399.
  • Pu et al. (2016) Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. 2016. Variational autoencoder for deep learning of images, labels and captions. In Advances in neural information processing systems, pages 2352–2360.
  • Qin et al. (2019a) Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019a. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5046–5056.
  • Qin et al. (2019b) Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019b. Conversing by reading: Contentful neural conversation with on-demand machine reading. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5427–5436.
  • Quirk (2004) Christopher Quirk. 2004. Training a sentence-level machine translation confidence measure. In LREC. Citeseer.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.
  • Rae et al. (2019) Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
  • Rae and Razavi (2020) Jack W Rae and Ali Razavi. 2020. Do transformers need deep long-range memory. arXiv preprint arXiv:2007.03356.
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
  • Raiko et al. (2014) Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. 2014. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989.
  • Ramachandran et al. (2017) Prajit Ramachandran, Peter J Liu, and Quoc Le. 2017. Unsupervised pretraining for sequence to sequence learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391.
  • Ramachandran et al. (2019) Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909.
  • Rankel et al. (2013) Peter A Rankel, John Conroy, Hoa Trang Dang, and Ani Nenkova. 2013. A decade of automatic content evaluation of news summaries: Reassessing the state of the art. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 131–136.
  • Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
  • Ravuri and Vinyals (2019) Suman Ravuri and Oriol Vinyals. 2019. Classification accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, pages 12247–12258.
  • Reiter (2019) Ehud Reiter. 2019. Natural language generation challenges for explainable ai. arXiv preprint arXiv:1911.08794.
  • Reiter (2020) Ehud Reiter. 2020. Why do we still use 18-year old BLEU. https://ehudreiter.com/2020/03/02/why-use-18-year-old-bleu.
  • Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286.
  • Rios et al. (2011) Miguel Rios, Wilker Aziz, and Lucia Specia. 2011. Tine: A metric to assess mt adequacy. In Proceedings of the sixth workshop on statistical machine translation, pages 116–122. Association for Computational Linguistics.
  • Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910.
  • Romanov et al. (2019) Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, and Adam Kalai. 2019. What’s in a name? reducing bias in bios without access to protected attributes. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4187–4195.
  • Rosset et al. (2020) Corby Rosset, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020. Knowledge-aware language model pretraining. arXiv preprint arXiv:2007.00655.
  • Ruis et al. (2020) Laura Ruis, Mitchell Stern, Julia Proskurnia, and William Chan. 2020. Insertion-deletion transformer. arXiv preprint arXiv:2001.05540.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature, 323(6088):533–536.
  • Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389.
  • Saharia et al. (2020) Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi. 2020. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437.
  • Sainath et al. (2015) Tara N Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4580–4584. IEEE.
  • Sajjadi et al. (2018) Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. 2018. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pages 5228–5237.
  • Saleh et al. (2019) Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, and Rosalind Picard. 2019. Hierarchical reinforcement learning for open-domain dialog. arXiv preprint arXiv:1909.07547.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242.
  • Sankar and Ravi (2019) Chinnadhurai Sankar and Sujith Ravi. 2019. Deep reinforcement learning for modeling chit-chat dialog with discrete attributes. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 1–10.
  • Santoro et al. (2018) Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. 2018. Relational recurrent neural networks. In Advances in neural information processing systems, pages 7299–7310.
  • Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3027–3035.
  • Schubotz et al. (2018) Moritz Schubotz, Philipp Scharpf, Kaushal Dudhat, Yash Nagar, Felix Hamborg, and Bela Gipp. 2018. Introducing mathqa: a math-aware question answering system. Information Discovery and Delivery.
  • Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
  • See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
  • See et al. (2019) Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of NAACL-HLT, pages 1702–1723.
  • Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur P Parikh. 2020. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
  • Semeniuta et al. (2017) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A hybrid convolutional variational autoencoder for text generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637.
  • Semeniuta et al. (2018) Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. 2018. On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 35–40.
  • Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Shah et al. (2020) Darsh J Shah, Tal Schuster, and Regina Barzilay. 2020. Automatic fact-guided sentence modification. In AAAI, pages 8791–8798.
  • Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1577–1586.
  • Shao et al. (2016) Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2016. Generating long and diverse responses with neural conversation models.
  • Shardlow and Nawaz (2019) Matthew Shardlow and Raheel Nawaz. 2019. Neural text simplification of clinical letters with a domain specific phrase table. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 380–389.
  • Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pages 6830–6841.
  • Shi et al. (2018) Zhan Shi, Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. 2018. Toward diverse text generation with inverse reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4361–4367. AAAI Press.
  • Shimanaka et al. (2018) Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. Ruse: Regressor using sentence embeddings for automatic machine translation evaluation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 751–758.
  • Shmelkov et al. (2018) Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. 2018. How good is my gan? In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229.
  • Shoeybi et al. (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053.
  • Shu et al. (2018) Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, and Huan Liu. 2018. Deep headline generation for clickbait detection. In 2018 IEEE International Conference on Data Mining (ICDM), pages 467–476. IEEE.
  • Siddharthan (2006) Advaith Siddharthan. 2006. Syntactic simplification and text cohesion. Research on Language and Computation, 4(1):77–109.
  • Sinha et al. (2020) Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L Hamilton, and Joelle Pineau. 2020. Learning an unreferenced metric for online dialogue evaluation. arXiv preprint arXiv:2005.00583.
  • Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200.
  • Snover et al. (2009) Matthew Snover, Nitin Madnani, Bonnie J Dorr, and Richard Schwartz. 2009. Fluency, adequacy, or hter?: exploring different human judgments with a tunable mt metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 259–268. Association for Computational Linguistics.
  • So et al. (2019) David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International Conference on Machine Learning, pages 5877–5886.
  • Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905.
  • Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746.
  • Song et al. (2019a) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019a. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
  • Song et al. (2019b) Tianbao Song, Jingbo Sun, Bo Chen, Weiming Peng, and Jihua Song. 2019b. Latent space expanded variational autoencoder for sentence generation. IEEE Access, 7:144618–144627.
  • Song et al. (2016) Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two are better than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 196–205.
  • Specia et al. (2010) Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Machine translation evaluation versus quality estimation. Machine translation, 24(1):39–50.
  • Specia et al. (2009) Lucia Specia, Marco Turchi, Nicola Cancedda, Marc Dymetman, and Nello Cristianini. 2009. Estimating the sentence-level quality of machine translation systems. In 13th Conference of the European Association for Machine Translation, pages 28–37.
  • Srinivasan et al. (2019) Vidhushini Srinivasan, Sashank Santhanam, and Samira Shaikh. 2019. Natural language generation using reinforcement learning with external rewards. arXiv preprint arXiv:1911.11404.
  • Sriram et al. (2018) Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. 2018. Cold fusion: Training seq2seq models together with language models. Proc. Interspeech 2018, pages 387–391.
  • Štajner et al. (2015) Sanja Štajner, Hannah Béchara, and Horacio Saggion. 2015. A deeper exploration of the standard pb-smt approach to text simplification and its evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 823–828.
  • Stanchev et al. (2019) Peter Stanchev, Weiyue Wang, and Hermann Ney. 2019. Eed: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514–520.
  • Stanojević and Sima’an (2014) Miloš Stanojević and Khalil Sima’an. 2014. Beer: Better evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 414–419.
  • van Stegeren and Theune (2019) Judith van Stegeren and Mariët Theune. 2019. Narrative generation in the wild: Methods from nanogenmo. In Proceedings of the Second Workshop on Storytelling, pages 65–74.
  • Steinberger and Ježek (2012) Josef Steinberger and Karel Ježek. 2012. Evaluation measures for text summarization. Computing and Informatics, 28(2):251–275.
  • Stent et al. (2005) Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In international conference on intelligent text processing and computational linguistics, pages 341–351. Springer.
  • Stern et al. (2019) Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. In International Conference on Machine Learning, pages 5976–5985.
  • Su et al. (2018) Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. 2018. Variational recurrent neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Subramani et al. (2019) Nishant Subramani, Samuel Bowman, and Kyunghyun Cho. 2019. Can unconditional language models recover arbitrary sentences? In Advances in Neural Information Processing Systems, pages 15232–15242.
  • Subramanian et al. (2019) Sandeep Subramanian, Raymond Li, Jonathan Pilault, and Christopher Pal. 2019. On extractive and abstractive neural document summarization with transformer language models. arXiv preprint arXiv:1909.03186.
  • Subramanian et al. (2018) Sandeep Subramanian, Sai Rajeswar Mudumba, Alessandro Sordoni, Adam Trischler, Aaron C Courville, and Chris Pal. 2018. Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems, pages 7551–7563.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
  • Sulem et al. (2018a) Elior Sulem, Omri Abend, and Ari Rappoport. 2018a. Bleu is not suitable for the evaluation of text simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 738–744.
  • Sulem et al. (2018b) Elior Sulem, Omri Abend, and Ari Rappoport. 2018b. Semantic structural evaluation for text simplification. In Proceedings of NAACL-HLT, pages 685–696.
  • Sulem et al. (2018c) Elior Sulem, Omri Abend, and Ari Rappoport. 2018c. Simple and effective text simplification using semantic and neural methods. arXiv preprint arXiv:1810.05104.
  • Sultan et al. (2020) Md Arafat Sultan, Shubham Chandel, Ramón Fernandez Astudillo, and Vittorio Castelli. 2020. On the importance of diversity in question generation for qa. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5651–5656.
  • Sun et al. (2019a) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019a. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 7464–7473.
  • Sun and Zhou (2012) Hong Sun and Ming Zhou. 2012. Joint learning of a dual smt system for paraphrase generation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 38–42. Association for Computational Linguistics.
  • Sun et al. (2019b) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019b. Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412.
  • Surya et al. (2018) Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain, and Karthik Sankaranarayanan. 2018. Unsupervised neural text simplification. arXiv preprint arXiv:1810.07931.
  • Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  • Talmor et al. (2019) Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2019. olmpics–on what language model pre-training captures. arXiv preprint arXiv:1912.13283.
  • Tam (2020) Yik-Cheung Tam. 2020. Cluster-based beam search for pointer-generator chatbot grounded by knowledge. Computer Speech & Language, page 101094.
  • Tang et al. (2018) Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, and Tony Jebara. 2018. Subgoal discovery for hierarchical dialogue policy learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2298–2309.
  • Tang et al. (2016) Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. 2016. Context-aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900.
  • Tang et al. (2019) Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing, and Zhiting Hu. 2019. Target-guided open-domain conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5624–5634.
  • Tao et al. (2018) Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Temnikova (2012) Irina Temnikova. 2012. Text complexity and text simplification.
  • Theis et al. (2016) L Theis, A van den Oord, and M Bethge. 2016. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10.
  • Tian et al. (2017) Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–236.
  • Tilk and Alumäe (2017) Ottokar Tilk and Tanel Alumäe. 2017. Low-resource neural headline generation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 20–26.
  • Tillmann et al. (1997) Christoph Tillmann, Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf. 1997. Accelerated dp based search for statistical translation. In Fifth European Conference on Speech Communication and Technology.
  • Tran et al. (2016) Ke Tran, Arianna Bisazza, and Christof Monz. 2016. Recurrent memory networks for language modeling. arXiv preprint arXiv:1601.01272.
  • Trinh and Le (2018) Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. arXiv preprint arXiv:1908.11775.
  • Tuan et al. (2019) Luu Anh Tuan, Darsh J Shah, and Regina Barzilay. 2019. Capturing greater context for question generation. arXiv preprint arXiv:1910.10274.
  • Tuan and Lee (2019) Yi-Lin Tuan and Hung-Yi Lee. 2019. Improving conditional sequence generative adversarial networks by stepwise evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4):788–798.
  • Turing (1950) Intelligence by AM Turing. 1950. Computing machinery and intelligence-am turing. Mind, 59(236):433.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  • Verga et al. (2020) Pat Verga, Haitian Sun, Livio Baldini Soares, and William W Cohen. 2020. Facts as experts: Adaptable and interpretable neural memory over symbolic knowledge. arXiv preprint arXiv:2007.00849.
  • Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  • Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103.
  • Vinyals et al. (2015a) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015a. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
  • Vinyals et al. (2015b) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015b. Pointer networks. In Advances in neural information processing systems, pages 2692–2700.
  • Vinyals et al. (2015c) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015c. Grammar as a foreign language. In Advances in neural information processing systems, pages 2773–2781.
  • Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
  • Vinyals et al. (2015d) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015d. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
  • Vodolazova and Lloret (2019) Tatiana Vodolazova and Elena Lloret. 2019. Towards adaptive text summarization: How does compression rate affect summary readability of l2 texts? In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1265–1274.
  • Voita et al. (2019) Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380.
  • Walker et al. (1997) Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. arXiv preprint cmp-lg/9704004.
  • Wang and Cho (2019) Alex Wang and Kyunghyun Cho. 2019. Bert has a mouth, and it must speak: Bert as a markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–36.
  • Wang et al. (2019a) Dilin Wang, Chengyue Gong, and Qiang Liu. 2019a. Improving neural language modeling via adversarial training. In International Conference on Machine Learning, pages 6555–6565.
  • Wang et al. (2019b) Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. 2019b. Paperrobot: Incremental draft generation of scientific ideas. arXiv preprint arXiv:1905.07870.
  • Wang et al. (2016a) Tong Wang, Ping Chen, John Rochford, and Jipeng Qiang. 2016a. Text simplification using neural machine translation. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Wang et al. (2016b) Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016b. Character: Translation edit rate on character level. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 505–510.
  • Wang et al. (2019c) Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019c. Topic-guided variational auto-encoder for text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 166–177.
  • Welleck et al. (2019a) Sean Welleck, Kianté Brantley, Hal Daumé III, and Kyunghyun Cho. 2019a. Non-monotonic sequential text generation. arXiv preprint arXiv:1902.02192.
  • Welleck et al. (2019b) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019b. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
  • Welleck et al. (2018) Sean Welleck, Zixin Yao, Yu Gai, Jialin Mao, Zheng Zhang, and Kyunghyun Cho. 2018. Loss functions for multiset prediction. In Advances in Neural Information Processing Systems, pages 5783–5792.
  • Werbos (1989) P Werbos. 1989. Backpropagation through time: What it is and how to do it. In IEEE Proceedings.
  • Weston et al. (2014) Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
  • Wiese et al. (2017) Georg Wiese, Dirk Weissenborn, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 281–289.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
  • Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
  • Wiseman and Rush (2016) Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960.
  • Wiseman et al. (2017) Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
  • Wu et al. (2016a) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016a. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Wu et al. (2016b) Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Russ R Salakhutdinov. 2016b. On multiplicative integration with recurrent neural networks. In Advances in neural information processing systems, pages 2856–2864.
  • Wubben et al. (2012) Sander Wubben, Antal Van Den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 1015–1024. Association for Computational Linguistics.
  • Xia et al. (2017) Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, pages 1784–1794.
  • (502) Liqiang Xiao, Lu Wang, Hao He, and Yaohui Jin. Copy or rewrite: Hybrid summarization with hierarchical reinforcement learning.
  • Xing et al. (2016) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2016. Topic augmented neural response generation with a joint attention mechanism. arXiv preprint arXiv:1606.08340, 2(2).
  • Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810.
  • Xu and Durrett (2019) Jiacheng Xu and Greg Durrett. 2019. Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3283–3294.
  • Xu et al. (2018a) Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. 2018a. Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345.
  • Xu et al. (2018b) Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun. 2018b. A skeleton-based model for promoting coherence among sentences in narrative story generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4306–4315.
  • Xu et al. (2018c) Jingjing Xu, Xu Sun, Qi Zeng, Xiaodong Zhang, Xuancheng Ren, Houfeng Wang, and Wenjie Li. 2018c. Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 979–988.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.
  • Xu et al. (2019) Peng Xu, Chien-Sheng Wu, Andrea Madotto, and Pascale Fung. 2019. Clickbait? sensational headline generation with auto-tuned reinforcement learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3056–3066.
  • Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  • Yan et al. (2013) Rui Yan, Han Jiang, Mirella Lapata, Shou-De Lin, Xueqiang Lv, and Xiaoming Li. 2013. I, poet: automatic chinese poetry composition through a generative summarization framework under constrained optimization. In Twenty-Third International Joint Conference on Artificial Intelligence.
  • Yang et al. (2016a) Qian Yang, Rebecca J Passonneau, and Gerard De Melo. 2016a. Peak: Pyramid evaluation via automated knowledge extraction. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Yang et al. (2018a) Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018a. Video captioning by adversarial lstm. IEEE Transactions on Image Processing, 27(11):5600–5611.
  • Yang et al. (2018b) Yilin Yang, Liang Huang, and Mingbo Ma. 2018b. Breaking the beam search curse: A study of (re-) scoring methods and stopping criteria for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3054–3059.
  • Yang et al. (2018c) Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018c. Improving neural machine translation with conditional sequence generative adversarial nets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1346–1355.
  • Yang et al. (2016b) Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and Russ R Salakhutdinov. 2016b. Review networks for caption generation. In Advances in neural information processing systems, pages 2361–2369.
  • Yang et al. (2018d) Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. 2018d. Unsupervised text style transfer using language models as discriminators. In Advances in Neural Information Processing Systems, pages 7287–7298.
  • Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3881–3890. JMLR. org.
  • Yao et al. (2015) Kaisheng Yao, Trevor Cohn, Katerina Vylomova, Kevin Duh, and Chris Dyer. 2015. Depth-gated lstm. arXiv preprint arXiv:1508.03790.
  • Yao et al. (2019) Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7378–7385.
  • Yeung et al. (2016) Serena Yeung, Anitha Kannan, Yann Dauphin, and Li Fei-Fei. 2016. Epitomic variational autoencoders.
  • Yi et al. (2018) Xiaoyuan Yi, Maosong Sun, Ruoyu Li, and Wenhao Li. 2018. Automatic poetry generation with mutual reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3143–3153.
  • Yin et al. (2015) Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2015. Neural generative question answering. arXiv preprint arXiv:1512.01337.
  • Yin et al. (2016) Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016. Neural generative question answering. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2972–2978.
  • Yogatama et al. (2019) Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
  • Yogatama et al. (2018) Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. 2018. Memory architectures in recurrent neural network language models.
  • Young (1999) R Michael Young. 1999. Using grice’s maxim of quantity to select the content of plan descriptions. Artificial Intelligence, 115(2):215–256.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
  • Yu et al. (2019) Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: Lstm cells and network architectures. Neural computation, 31(7):1235–1270.
  • Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
  • Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.
  • Zhang et al. (2016a) Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. 2016a. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530.
  • Zhang et al. (2019a) Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang. 2019a. Pretraining-based natural language generation for text summarization. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 789–797.
  • Zhang et al. (2018a) Jiaping Zhang, Tiancheng Zhao, and Zhou Yu. 2018a. Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 140–150.
  • Zhang et al. (2017a) Jiyuan Zhang, Yang Feng, Dong Wang, Yang Wang, Andrew Abel, Shiyue Zhang, and Andi Zhang. 2017a. Flexible and creative chinese poetry generation using neural memory. arXiv preprint arXiv:1705.03773.
  • Zhang et al. (2016b) Lilin Zhang, Zhen Weng, Wenyan Xiao, Jianyi Wan, Zhiming Chen, Yiming Tan, Maoxi Li, and Mingwen Wang. 2016b. Extract domain-specific paraphrase from monolingual corpus for automatic evaluation of machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 511–517.
  • Zhang et al. (2019b) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019b. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • Zhang and Lapata (2014) Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 670–680.
  • Zhang et al. (2018b) Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018b. Neural latent extractive document summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 779–784.
  • Zhang et al. (2018c) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. 2018c. Generating informative and diverse conversational responses via adversarial information maximization. In Advances in Neural Information Processing Systems, pages 1810–1820.
  • Zhang et al. (2016c) Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016c. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21.
  • Zhang et al. (2017b) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017b. Adversarial feature matching for text generation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4006–4015. JMLR. org.
  • Zhang et al. (2020) Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. 2020. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558.
  • Zhang et al. (2018d) Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang, Peng Chen, Mu Li, Ming Zhou, and Enhong Chen. 2018d. Style transfer as unsupervised machine translation. arXiv preprint arXiv:1808.07894.
  • Zhao et al. (2018a) Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. 2018a. Adversarially regularized autoencoders. In 35th International Conference on Machine Learning, ICML 2018, pages 9405–9420. International Machine Learning Society (IMLS).
  • Zhao et al. (2016) Jun Zhao, Kang Liu, and Liheng Xu. 2016. Sentiment analysis: mining opinions, sentiments, and emotions.
  • Zhao et al. (2018b) Sanqiang Zhao, Rui Meng, Daqing He, Andi Saptono, and Bambang Parmanto. 2018b. Integrating transformer and paraphrase rules for sentence simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3164–3173.
  • Zhao and Zhang (2018) Shenjian Zhao and Zhihua Zhang. 2018. Attention-via-attention neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Zhao and Eskenazi (2016) Tiancheng Zhao and Maxine Eskenazi. 2016. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 1–10.
  • Zhao and Eskenazi (2018) Tiancheng Zhao and Maxine Eskenazi. 2018. Zero-shot dialog generation with cross-domain latent actions. arXiv preprint arXiv:1805.04803.
  • Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.
  • Zheng et al. (2019) Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. 2019. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672.
  • Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 310–320.
  • Zhou et al. (2015) Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630.
  • Zhou et al. (2018) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2017. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1095–1104.
  • Zhou et al. (2019) Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. 2019. Hype: A benchmark for human eye perceptual evaluation of generative models. In Advances in Neural Information Processing Systems, pages 3444–3456.
  • Zhou et al. (2020) Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, and Ming Zhou. 2020. Self-adversarial learning with comparative discrimination for text generation. arXiv preprint arXiv:2001.11691.
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232.
  • Zhu et al. (2015) Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning, pages 1604–1612.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100. ACM.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
  • Zong and Zhu (2014) Alfred Zong and Yuke Zhu. 2014. Strokebank: Automating personalized chinese handwriting generation. In Twenty-Sixth IAAI Conference.