Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors

Wanyu Du¹ , Jianqiao Zhao² , Liwei Wang² , Yangfeng Ji¹
¹Department of Computer Science, University of Virginia
²Department of Computer Science and Engineering, The Chinese University of Hong Kong
{wd5jq,yangfeng}@virginia.edu
{jqzhao,lwwang}@cse.cuhk.edu.hk

Abstract

Generating high quality texts with high diversity is important for many NLG applications, but current methods mostly focus on building deterministic models to generate higher quality texts and do not provide many options for promoting diversity. In this work, we present a novel latent structured variable model to generate high quality texts by enriching contextual representation learning of encoder-decoder models. Specifically, we introduce a stochastic function to map deterministic encoder hidden states into random context variables. The proposed stochastic function is sampled from a Gaussian process prior to (1) provide infinite number of joint Gaussian distributions of random context variables (diversity-promoting) and (2) explicitly model dependency between context variables (accurate-encoding). To address the learning challenge of Gaussian processes, we propose an efficient variational inference approach to approximate the posterior distribution of random context variables. We evaluate our method in two typical text generation tasks: paraphrase generation and text style transfer. Experimental results on benchmark datasets demonstrate that our method improves the generation quality and diversity compared with other baselines. ¹¹1Code and data are available at https://github.com/wyu-du/GP-VAE.

1 Introduction

Generating high quality texts with high diversity is an important requirement for many text generation applications, such as paraphrase generation (Prakash et al., 2016; Li et al., 2018), style transfer (Jhamtani et al., 2017; Rao and Tetreault, 2018), dialog generation (Sordoni et al., 2015; Serban et al., 2016), etc. The encoder-decoder framework (Cho et al., 2014; Sutskever et al., 2014) is widely adopted (Bahdanau et al., 2014; Luong et al., 2015; Gu et al., 2016; See et al., 2017; Raffel et al., 2019) to generate high-quality texts, where an encoder is applied to learn contextual information from source texts and a decoder is used to generate texts for target tasks. To improve the diversity of generated texts, prior works propose to introduce variations into either the encoder (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019; Cho et al., 2019; Qian and Cheung, 2019; Wu et al., 2020; Duan et al., 2020; Sun et al., 2021) or the decoder (Vijayakumar et al., 2016; Holtzman et al., 2019; He et al., 2018; Shen et al., 2019). However, it is difficult to incorporate meaningful variations into encoder-decoder models without hurting the quality of generated texts.

(a) With a normal Gaussian prior

(b) With a Gaussian process prior

Figure 1: A simple illustration on a variational encoder-decoder model with (a) an normal Gaussian prior and (b) a Gaussian process prior. With a normal Gaussian prior, each hidden state

\bm{h}_{i}

will be mapped into a random vector

\bm{z}_{i}

independently; while a Guassian process prior imposes dependency constraints among

\{\bm{z}_{i}\}

. Double circles on

\{\bm{h}_{i}\}

indicate they are deterministic variables.

Promoting diversity at the encoder side mainly focuses on modelling the probabilistic distribution of contextual representations. Some prior works (Deng et al., 2018; Bahuleyan et al., 2018) propose to model attention alignments between encoder and decoder hidden states as latent variables, and generate diverse texts by sampling from the latent attention variables. Other existing works (Wang and Wan, 2019; Liu and Liu, 2019; Shinoda et al., 2021; Sun et al., 2021) directly apply conditional variational autoencoders to model encoder hidden states as latent variables, and generate high-diversity texts by sampling from the latent context variables. However, when modelling the latent variables, they treat each latent variable as independent to each other, which inevitably causes the loss of some contextual information during learning, as shown in Figure 1(a).

Other works turns towards designing diversity-promoting decoding strategies at the decoder side, such as diverse beam search (Vijayakumar et al., 2016), top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2019). But for those decoding strategies, there is often a trade-off between quality and diversity, and the generation models have to sacrifice quality for a higher diversity. Another line of works suggest to learn a mixture of expert encoders (Cho et al., 2019) or decoders (He et al., 2018; Shen et al., 2019), and generate diverse texts by sampling from different encoders or decoders. While different expert encoders or decoders can introduce some diversity, the model capacities are limited within the pre-defined set of experts.

In this work, we propose a novel approach to introduce context-aware variations into the encoder in order to generate high-quality and high-diversity texts. For an encoder-decoder model, we introduce a stochastic function to map deterministic encoder hidden states $\{\bm{h}_{i}\}$ into a set of random context variables $\{\bm{z}_{i}\}$ . The advantage of this stochastic function is that it explicitly models the dependency between each context variable, as shown in Figure 1(b), which can help preserve more semantic information from source texts. During generation, the decoder generates diverse outputs conditioning on sampled different context variables. In other words, by learning a stochastic function on top of one deterministic encoder, the proposed approach offers many versions of random context variables for a decoder to generate diverse texts.

To learn the stochastic function over hidden states, we propose a Gaussian process prior (Rasmussen and Williams, 2006, GP) to model the joint distribution of all encoder hidden states. The major differences between GP priors and other priors used in previous works (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019; Cho et al., 2019; Wu et al., 2020; Duan et al., 2020; Shinoda et al., 2021; Sun et al., 2021) have two-folds: (1) GP priors explicitly model the dependency between latent variables of varying sizes as illustrated in Figure 1(b), while previous works consider latent variables as independent with each other as shown in Figure 1(a); (2) GP priors provide infinite number of joint Gaussian distributions of latent variables as shown in Figure 2(b), while previous works have to pre-define a fixed set of Gaussian distributions (e.g. a standard normal distribution, or a mixture of Gaussian distributions) with the risk of experiencing the posterior collapse problem (Bowman et al., 2016; Kim et al., 2018; Dieng et al., 2019). Besides, the proposed random function only introduces variations into the encoder, and is orthogonal to diversity-promoting decoding strategies at the decoder side. Users can freely adopt different decoding strategies to further encourage diverse generation outputs.

The major contributions of this work are three-fold:

1.

We propose a novel method to introduce context-aware variations into encoder-decoder models, which can help the model learn rich contextual representations and also promote diversity in generation.
2.

We propose an efficient variational inference method to approximate the joint distribution of fully-connected random context variables.
3.

We test our proposed method in both LSTM-based (See et al., 2017) and Transformers-based (Raffel et al., 2019) encoder-decoder models on paraphrase generation and style transfer tasks. Empirical experimental results show that, on one hand, the proposed method can generate higher quality texts than deterministic encoder-decoder models and conditional variational auto-encoders; on the other hand, it also supports diverse generation by conditioning on different sets of sampled random context variables.

Refer to caption — (a) Variational encoder-decoder model with a normal Gaussian prior. Note that we simplify the latent variable $\bm{z}_{i}$ from a vector to a scalar in order to plot out the Gaussian distribution for better illustration.

2 Model Description

This section discusses our novel latent structured variable model on learning rich context representations by transforming the hidden states from a deterministic encoder into random hidden states via stochastic functions.

2.1 Encoding with Stochastic Functions

Let $\bm{x}_{1:N}=\{\bm{x}_{i}\}_{i=1}^{N}$ be the source sentence of length $N$ , and $\bm{y}_{1:T}=\{\bm{y}_{t}\}_{t=1}^{T}$ be the target sentence of length $T$ . In encoder-decoder models, an encoder is used to obtain deterministic context representations of the source sentence, i.e. the encoder hidden states: $\bm{h}_{1:N}=f_{enc}(\bm{x}_{1:N})$ , where $f_{enc}(\cdot)$ is a nonlinear transition function implemented by LSTM (Sutskever et al., 2014) or Transformer (Vaswani et al., 2017).

To introduce context-aware variations into the encoder, we propose to learn a stochastic function that maps the deterministic hidden states to variables. Specifically, after computing the hidden states $\bm{h}_{1:N}$ from the transition function $f_{enc}(\cdot)$ , the proposed method employs a stochastic mapping function $g(\cdot)$ to model the deterministic context representations as a series of random context variables:

p(\bm{z}_{1:N}\mid\bm{h}_{1:N})=g(\bm{h}_{1:N})+\bm{\epsilon}

(1)

where $\bm{\epsilon}\sim\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I})$ is a Gaussian noise. Then, the decoder can generate diverse texts conditioning on different sets of context variables sampled from $p(\bm{z}_{1:N}\mid\bm{h}_{1:N})$ , as shown in Figure 2(b). Considering that natural language texts are always context-dependent, we expect the random context variables $\bm{z}_{1:N}$ to encode the context dependency to some extent. In other words, the distribution of $\bm{z}_{i}$ will not only depend on $\bm{h}_{i}$ , but also depend on other $\{\bm{z}_{j}\}_{j\not=i}$ , as shown in Figure 1(b).

Under this framework, variational encoder-decoder models (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019) can be viewed as a special case, as illustrated in Figure 2(a), where random context variables $\{\bm{z}_{i}\}_{i=1}^{N}$ are independent from each other. In this work, we consider this special case as generation with normal priors. Empirical comparison between normal priors and GP priors will be given in section 4.

2.2 Gaussian Process Priors for Stochastic Functions

The learning of stochastic function $g(\bm{h})$ is the key for the proposed method to be successful. Intuitively, we design $g(\bm{h})$ to satisfy two constraints simultaneously: (1) it can introduce some variation to the deterministic encoder hidden states; (2) it should preserve the contextual information in the deterministic encoder hidden states to be a faithful representation.

In this work, we propose to learn $g(\bm{h})$ with a functional prior defined by Gaussian processes. As shown in Figure 3(a) in Appendix A, we can sample very different functions $g(\bm{h})$ from the same GP prior, which ensures randomness when sampling $\bm{z}_{1:N}$ .²²2Please refer to Appendix A and (Rasmussen and Williams, 2006) for detailed introduction of Gaussian processes. We define the stochastic function $g(\bm{h})$ following a GP prior:

g(\bm{h})\sim\mathcal{GP}(m(\bm{h}),k(\bm{h},\bm{h}^{\prime}))

(2)

with the mean function $m(\bm{h})$ and covariance function $k(\bm{h},\bm{h}^{\prime})$ as

\begin{split}m(\bm{h})&=\bm{h}\\ k(\bm{h},\bm{h}^{\prime})&=v^{2}\exp\{-\frac{\|\bm{h}-\bm{h}^{\prime}\|_{2}^{2}}{2r^{2}}\}\end{split}

(3)

where $\bm{h}$ indicates the current observed encoder hidden state, and $\bm{h}^{\prime}$ indicates the other contextual encoder hidden states; $v$ controls the average distance between a sampled function $g(\bm{h})$ and the mean function $m(\bm{h})$ , and $r$ controls the covariance between random variables, increasing $r$ will make $\bm{z}$ and $\bm{z}^{\prime}$ become more correlated. In this work, $v$ and $r$ are chosen based on the text generation performance on development sets. By setting $m(\bm{h})=\bm{h}$ , we actually define a semi-parametric GP prior (Murphy, 2012, Sec. 15.2.6) instead of a fully non-parameteric prior, since $\bm{h}$ as a hidden state is computed from the deterministic encoder with learnable parameters. The intuition behind this definition is that, although we want to introduce some variations, taking the expectation of the sampled random states $\bm{z}$ should still be $\bm{h}$ .

The main advantage of applying GP priors is that we can sample infinite number of random functions $g(\bm{h})$ thus obtaining infinite sets of random context variables $\bm{z}_{1:N}$ , as illustrated in Figure 2(b). In contrast, standard variational encoder-decoder models can only learn a fixed set of $C$ joint distributions $p(\bm{z}_{1:N}|\bm{h}_{1:N})$ , where $1\leq C\ll\infty$ .³³3When $C=1$ , it represents a conventional variational autoencoder (Bowman et al., 2016); when $C=5$ , it represents a variational autoencoder with a mixture of Gaussians prior (component number = 5); when $C\to\infty$ , it represents a variational autoencoder with a GP prior.

2.3 Generation with Random Context Variables

In this section, we demonstrate how to incorporate $\bm{z}_{1:N}$ into two typical encoder-decoder models for text generation: a LSTM-based encoder-decoder model See et al. (2017) and a Transformer-based encoder-decoder model (Raffel et al., 2019). The performance of these two variational encoder-decoder models with will be evaluated in section 4.

Given the deterministic encoder hidden states $\bm{h}_{1:N}$ , we first sample a function $g(\bm{h})$ from the GP prior in Equation 2; then sample a set of random context variables $\bm{z}_{1:N}$ from $g(\bm{h})$ ; finally generate a output sentence $\bm{y}_{1:T}$ based on the sampled $\bm{z}_{1:N}$ . The generative story with random context variables $\bm{z}_{1:N}$ is detailed in algorithm 1.

For a LSTM-based encoder-decoder model See et al. (2017), we apply the attention mechanism (Bahdanau et al., 2014) over the random context variables $\{\bm{z}_{i}\}_{i=1}^{N}$ to construct $\bm{c}_{t}$ for the decoder. At each decoding time step $t$ , the decoder computes the attention vector $\bm{c}_{t}$ and decoder hidden state $\bm{s}_{t}$ as follows:

$\displaystyle\alpha_{ti}$	$\displaystyle=$	$\displaystyle\frac{\exp{(a(\bm{s}_{t-1},\bm{z}_{i}))}}{\sum_{j=1}^{N}\exp{(a(\bm{s}_{t-1},\bm{z}_{j}))}}$	(4)
$\displaystyle\bm{c}_{t}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{N}\alpha_{ti}\cdot\bm{z}_{i}$	(5)
$\displaystyle\bm{s}_{t}$	$\displaystyle=$	$\displaystyle f_{dec}(\bm{s}_{t-1},\bm{y}_{t-1},\bm{c}_{t})$	(6)

where $a(\bm{s}_{t-1},\bm{z}_{i})=\bm{v}_{a}^{\top}\text{tanh}(\mathbf{W}_{a}\bm{s}_{t-1}+\mathbf{U}_{a}\bm{z}_{i})$ , $\mathbf{W}_{a}$ , $\mathbf{U}_{a}$ and $\bm{v}_{a}^{\top}$ are parameter matrices. Finally, the decoder outputs a word distribution based on the context representations and previous decoded words at each decoding time step $t$ :

p(\bm{y}_{t}\mid\bm{y}_{t-1},\bm{z}_{1:N})=\text{softmax}(\mathbf{W}_{b}\cdot\bm{s}_{t})

(7)

where $\mathbf{W}_{b}$ is a parameter matrix.

1: Input: A source sentence

\bm{x}_{1:N}

2: Output: A generated sentence

\bm{y}_{1:T}

3: // Encode context

4: Initialize

\bm{h}_{0}\leftarrow\bm{0}

5: for

i=1,\dots,N

6: Compute

\bm{h}_{i}=f_{enc}(\bm{h}_{i-1},\bm{x}_{i})

7: end for

8: // Sample random context variables

9: Draw

g(\bm{h})\sim\mathcal{GP}(m(\bm{h}),k(\bm{h},\bm{h}^{\prime}))

10: Draw

\bm{z}_{1:N}\sim g(\bm{h}_{1:N})+\bm{\epsilon}

11: // Generate a new sentence

12: Initialize

\bm{s}_{0}\leftarrow\bm{0}

13: for

t=1,\dots,T

14: Compute

\bm{s}_{t}=f_{dec}(\bm{s}_{t-1},\bm{y}_{t-1},\bm{z}_{1:N})

15: Draw

\bm{y}_{t}\sim\text{softmax}(\mathbf{W}\cdot\bm{s}_{t})

16: end for

Algorithm 1 The generative story with a stochastic function

g(\cdot)

sampled from the GP prior

For a Transformer-based encoder-decoder model (Raffel et al., 2019), we take the output of the last layer in the encoder as $\bm{h}_{1:N}$ , and feed them into $g(\bm{h})$ to get random context variables $\bm{z}_{1:N}$ . For the decoder, the inputs $\mathbf{K}$ and $\mathbf{V}$ are the combination of $\bm{h}_{1:N}$ and $\bm{z}_{1:N}$ :

$\displaystyle\mathbf{K}^{l}=\mathbf{V}^{l}$	$\displaystyle=$	$\displaystyle\mathbf{W}_{z}[\bm{z}_{1:N};\bm{h}_{1:N}]$	(8)
$\displaystyle\mathbf{A}$	$\displaystyle=$	$\displaystyle\text{MultiHead}(\mathbf{S}^{l-1},\mathbf{K}^{l},\mathbf{V}^{l})$	(9)
$\displaystyle\mathbf{B}$	$\displaystyle=$	$\displaystyle\text{LayerNorm}(\mathbf{A}+\mathbf{S}^{l-1})$	(10)
$\displaystyle\mathbf{S}^{l}$	$\displaystyle=$	$\displaystyle\text{LayerNorm}(\text{FFN}(\mathbf{B})+\mathbf{B})$	(11)

where $\mathbf{S}^{l}=\{\bm{s}_{t}\}_{t=1}^{T}$ is the last layer of decoder hidden states, and $\text{MultiHead}(\cdot)$ , $\text{LayerNorm}(\cdot)$ and $\text{FFN}(\cdot)$ follow the standard implementation in (Vaswani et al., 2017). The word distribution $p(\bm{y}_{t}\mid\bm{y}_{t-1},\bm{z}_{1:N})$ at each decoding time step $t$ is computed the same way as in Equation 7.

3 Efficient Variational Inference

With the observed deterministic hidden states $\bm{h}_{1:N}$ , we estimate the GP posterior to make the prediction of context variables $\bm{z}_{1:N}$ more accurate. Although the posterior estimation of Gaussian processes can be written in a closed form theoretically, the challenge in this work comes from learning with other parts of the model, such as the deterministic encoder producing $\bm{h}_{1:N}$ and the decoder generating $\bm{y}_{1:T}$ . To simplify the inference procedure, we will focus on inferring the samples of the GP posterior regarding the hidden states only as $p(g\mid\bm{h}_{1:N})$ , which essentially is a Gaussian distribution with non-isotropic covariance. In this work, we apply variational inference to approximate the GP posterior $p(g\mid\bm{h}_{1:N})$ and learn other model parameters jointly with maximum likelihood estimation.

For notation simplicity, we let $\bm{h}=f_{enc}(\bm{x}_{1:N})$ , $\bm{z}=\{\bm{z}_{i}\}_{i=1}^{N}$ , $\bm{y}=\{\bm{y}_{i}\}_{i=1}^{T}$ in this section. With a sampled random function $g(\bm{h})$ from the GP prior as described in line 9 of algorithm 1, we will get the joint prior distribution $p(\bm{z}\mid\bm{h})$ according to Equation 1. Then we approximate the true posterior $p(\bm{z}\mid\bm{h},\bm{y})$ with the variational posterior $q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})$ by maximizing the evidence lower bound of the marginal log-likelihood (ELBo):

\begin{split}\log p(\bm{y}\mid\bm{h})\geq&\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z})]\\ &-\text{KL}[q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h})]\end{split}

(12)

where $\bm{\phi}$ denotes the variational parameters. The derivation of Equation 12 is presented in Appendix B.

During generation, we propose a two-step approximation to simplify $q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})$ . First, to maintain the generative property when using the variational distribution, we propose an approximation of the variational distribution $q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\approx q_{\bm{\phi}}(\bm{z}\mid\bm{h})$ . In this case, random context vector $\bm{z}$ will only depend on $\bm{h}$ during inference. Second, we apply the mean-field amortized variational approximation (Kingma and Welling, 2014) to approximating the parameters of $q_{\bm{\phi}}(\bm{z}\mid\bm{h})$ :

\begin{split}q_{\bm{\phi}}(\bm{z}\mid\bm{h})&=\prod_{i=1}^{N}q_{\bm{\phi}}(\bm{z}_{i}\mid\bm{h}_{i})\\ &=\prod_{i=1}^{N}\mathcal{N}(f_{\mu}(\bm{h}_{i}),f_{\sigma^{2}}(\bm{h}_{i}))\end{split}

(13)

where $f_{\mu}(\cdot)$ and $f_{\sigma^{2}}(\cdot)$ are the mean and covariance in the amortized variational inference network. In this work, we use two simple feed-forward neural networks $f_{\mu}$ and $f_{\sigma^{2}}$ . The implementation details are included in Appendix C.

Paraphrase Generation: Twitter URL Paraphrasing Corpus
Original Sentence	Target Paraphrase
Amazon only needs a minute of human labor to ship your next package.	Amazon ships your packages in one minute.
Amazon only needs a minute of human labor to ship your next package.	Amazon only needs a minute of labor to ship your next package.
Text Style Transfer: GYAFC Corpus
Informal Sentence	Formal Sentence
I’d say it is punk though.	However, I do believe it to be punk.
Gotta see both sides of the story.	You have to consider both sides of the story.

Table 1: Some example source-target pairs from the Twitter URL Paraphrasing Corpus Lan et al. (2017) and the GYAFC Corpus Rao and Tetreault (2018).

4 Experiments

We evaluate our method on two text generation tasks that require rich contextual representations: paraphrase generation (subsection 4.4) and text style transfer (subsection 4.5). We provide some data examples for the two tasks in Table 1. We compared our method with previous works on diverse text generation in terms of quality and diversity. Empirical experiment results show that our method is able to: (1) adapt to different encoder-decoder architectures, such as the pointer-generator network (See et al., 2017, PG) and the text-to-text transfer Transformer (Raffel et al., 2019, T5); (2) generate higher quality texts compared with deterministic encoder-decoder models (See et al., 2017; Raffel et al., 2019) while also enabling diverse generation by conditioning on random context variables.

4.1 Evaluation Methods

Text Quality.

For quality evaluation, we use two commonly used automatic metrics in text generation: METEOR Banerjee and Lavie (2005) and BLEU with up to bi-grams Papineni et al. (2002), which tell us how well the generated outputs match the reference sentences.

Text Diversity.

For diversity evaluation, we aim at examining how well different latent context variables $\bm{z}_{1:N}$ from $q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})$ can make the decoder generate diverse outputs. We use self-BLEU with up to bi-grams (Zhu et al., 2018, self-BLEU) to measure the mutual bi-gram overlap between the set of outputs per source sentence, lower self-BLEU indicates less bi-gram overlap between generated outputs. In addition, we use diverse 4-gram (Deshpande et al., 2019, Div-4) to measure the ratio of distinct 4-grams in the set of outputs per source sentence, higher diverse 4-gram shows more unique 4-grams between generated outputs. Finally, we use uniqueness (Deshpande et al., 2019, Uni.) to measure the ratio of unique generated sentences in the set of outputs per source sentence, higher uniqueness suggests different context variables $\bm{z}_{1:N}$ lead to very different output sentences.

4.2 Competitive Baselines

We compare our method with competitive deterministic encoder-decoder models and variational encoder-decoder models as follows.

PG: See et al. (2017) proposes the pointer-generator network, which is a strong deterministic LSTM-based encoder-decoder baseline. We refer their model as PG.

T5: Raffel et al. (2019) propose the text-to-text transfer Transformer, which is a strong deterministic Transformer-based encoder-decoder baseline. We refer their model as T5.

Variation Attention: Deng et al. (2018) model the deterministic attention vectors as latent alignment variables to promote diverse text generation. We refer their model as Variation Attention.

Multi-Selectors: Cho et al. (2019) use a mixture of experts to sample different binary masks on the source texts for diverse content generation. We refer their model as Multi-Selectors.

T-CVAE: Wang and Wan (2019) model deterministic encoder hidden states as latent context variables with a Transformer-based conditional variational autoencoder. We refer their model as T-CVAE.

PG/T5 + Normal prior: PG or T5 with a normal prior $p(\bm{z})=\mathcal{N}(\bm{0},\mathbf{I})$ , which follows the conventional variational autoencoders Kingma and Welling (2014); Bowman et al. (2016).

PG/T5 + GP prior: PG or T5 with our GP prior defined in Equation 2.

	Twitter URL		GYAFC (E&M)		GYAFC (F&R)
Methods	BLEU $\uparrow$	METEOR $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$
Seq2Seq baselines
PG	0.291	0.471	0.683	0.817	0.717	0.845
T5	0.264	0.453	0.683	0.819	0.726	0.847
Related works
Multi-Selectors	0.290	0.492	0.606	0.779	0.618	0.783
Variation Attention	0.294	0.512	0.632	0.804	0.675	0.833
T-CVAE	0.339	0.494	0.481	0.686	0.537	0.730
Our works
PG + Normal prior	0.041	0.127	0.145	0.354	0.191	0.452
T5 + Normal prior	0.269	0.461	0.675	0.815	0.722	0.846
PG + GP prior	0.307	0.483	0.681	0.828	0.734	0.849
T5 + GP prior	0.281	0.474	0.688	0.815	0.739	0.847

Table 2: Model performance on text quality. To get the best performance of variational encoder-decoder models, we directly take the mean of

q_{\phi}(\bm{z}\mid\bm{h},\bm{y})

as the sampled context variables to generate the output texts. Results of the Multi-Selectors (Cho et al., 2019), Variation Attention (Deng et al., 2018) and T-CVAE Wang and Wan (2019) are collected based on the source code provided in the original paper.

4.3 Generation Setups

For the decoding strategy, we use beam search with beam size of 10. Note that our method is orthogonal to all diversity-promoting decoding strategies, such as top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019). We choose beam search in order to make fair comparison with other works which promotes diversity at the encoder side.

For quality generation, we directly take the mean of $q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y})$ , and generate one $\bm{y}_{1:T}$ based on the sampled context variables $\bm{z}_{1:N}$ , since we want to examine how well the posterior network can encode contextual information and make the decoder generate high-quality texts.

For diverse generation, we sample different $\bm{z}_{1:N}$ (instead of directly taking the mean) from $q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y})$ , and generate different $\bm{y}_{1:T}$ based on the sampled context variables $\bm{z}_{1:N}$ , since we want to examine how well different latent context variables $\bm{z}_{1:N}$ from $q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})$ can make the decoder generate diverse outputs. For experiment setups, we sample 10 different $\bm{z}_{1:N}$ and generate 10 different $\bm{y}_{1:T}$ correspondingly. We compute the diversity scores following the prior work (Deshpande et al., 2019). To compute the self-BLEU and Div-4, we randomly sample 5 different $\bm{y}_{1:T}$ out of the 10 generated $\bm{y}_{1:T}$ . To compute the Uni., we compute the unique number of sentences among the 10 generated $\bm{y}_{1:T}$ .

In preliminary experiments, we found that sampling from the original variational distribution tends to make the decoder generate same sentences. We hypothesize that $q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y})$ is a high-dimensional multivariate Gaussian and sampling from a high-dimensional distribution is a fundamental challenging problem. Therefore, we applied a simple heuristics to alleviate the sampling issue, where we scale up the covariance matrix of the variational distribution by a numeric scalar. We find this simple heuristics can help the decoder generate more diverse sentences. For PG + Normal prior and PG + GP prior, we set the numeric scalar for paraphrase generation task to 25, for style transfer task to 10. For T5 + Normal prior and T5 + GP prior, we set the numeric scalar for paraphrase generation task to 7, for style transfer task to 4.

4.4 Paraphrase Generation

We first evaluate the model’s capability of generating paraphrases using the Twitter URL paraphrasing dataset Lan et al. (2017). In this task, we aim at comparing the quality of generated texts between our method and other competitive baselines. We include the experimental setups in Appendix C.

Dataset.

The Twitter URL paraphrasing dataset Lan et al. (2017) contains both positive and negative examples of paraphrases. We filter out all negative examples from the 1-year 2,869,657 candidate pairs, and divided the remaining paraphrase pairs into 110K training pairs, 3K testing pairs and 1K validation pairs.

	GYAFC (E&M)				GYAFC (F&R)
Methods	avg-BLEU $\uparrow$	self-BLEU $\downarrow$	Div-4 $\uparrow$	Uni. $\uparrow$	avg-BLEU $\uparrow$	self-BLEU $\downarrow$	Div-4 $\uparrow$	Uni. $\uparrow$
T-CVAE	0.481	0.986	0.221	0.130	0.537	0.990	0.220	0.126
T5 + Normal prior	0.419	0.522	0.524	0.791	0.347	0.415	0.484	0.845
T5 + GP prior	0.329	0.395	0.727	0.898	0.252	0.295	0.748	0.910

Table 3: Model performance on text diversity. Avg-BLEU measures the average quality of generated sentences compared with ground-truth references. Self-BLEU measures the token-level repetitiveness, and a lower self-BLEU indicates a higher token-level diversity. Div-4 measures the ratio of unique 4-grams, and a higher Div-4 means a higher token-level diversity. Uni. measures the ratio of unique generated sentences, and a higher Div-4 illustrates a higher sentence-level diversity. We include the sampling configuration details in Appendix C.

Result Analysis.

As shown in Table 2, for the quality of generated texts, our method is able to well preserve the semantic information from source texts. For LSTM-based models, PG + GP prior generates better quality texts compared with both its deterministic baseline PG and other variational baselines, e.g. Multi-Selectors and Variation Attention. Note that PG + Normal prior experiences the posterior collapse problem (Bowman et al., 2016; Kim et al., 2018; Dieng et al., 2019) during training, which causes the context variables preserving little semantic information in the source text and the model generating random tokens during inference. For Transformer-based models, T-CVAE generates the better quality texts than T5, T5 + Normal prior and T5 + GP prior. But T-CVAE lowercases all input and output tokens while the other models keep both lowercase and capital tokens, this text preprocessing step may bring an unfairly better performance of T-CVAE in quality scores. Note that the posterior collapse problem does not happen in T5 + Normal prior, and T5 + GP prior still outperforms T5 + Normal prior, which shows the advantage of GP priors in introducing context-aware variations.

4.5 Text Style Transfer

We evaluate our model’s capability of generating stylistic texts using the Grammarly’s Yahoo Answers Formality Corpus (GYAFC) Rao and Tetreault (2018). In this task, we first compare the quality of generated texts between our method and other competitive baselines, then we test the diversity of generated texts between our GP prior and conditional variational autoencoders. We include the experimental setups in Appendix C.

Dataset.

The GYAFC dataset covers two sub-domains: Entertainment & Music (E&M), which has 52,593 training pairs, 2,877 validation pairs, 1,416 testing pairs; and Family & Relationships (F&R), which has 51,967 training pairs, 2,788 validation pairs, 1,332 testing pairs.

Informal Sentence: Your age… Dude that one is old.
Formal References: ["Your age. That one is old."; "You are quite old."; "Wow, that one is very old."; "How old are you?"]
T-CVAE	T5 + Normal Prior	T5 + GP Prior
you are older .	Your age is old.	Your age, that one is old.
you are older .	Your age, that one is old.	Your age, that one is old.
you are older .	You’re your age. No, that one is old.	Your age, and that one is old.
you are older .	You are your age. Due, that one is old.	You are your age, and that one is old.
you are older .	Your age, that one is old………	Your age, you are not the one who is old.
you are older .	Your age and i……..	You’re a fool, that one is old.
you are older .	Your age is arbitrary to you………	Your age doesn’t matter, that one is old.
you are older .	You are a very, jo, you are a very, jo	Your age is due to the fact is very old.
you are older .	You are a ant / a ante / a ante / a ante /	Regardless of your age, he is a young person.
you are older .	Your count count count count count count	I am not sure your age, but that one is old.

Table 4: Sample outputs conditioned on different

\bm{z}

sampled from

q_{\phi}(\bm{z}|\bm{x})

on GYAFC (E&M) test set.

Result Analysis.

For the quality of generated texts, GP prior makes the model more robust to generate accurate texts. As shown in Table 2, for LSTM-based models, PG + GP prior generates the most accurate texts compared with PG, Multi-Selectors and Variation Attention. Note that PG + Normal prior also experiences the posterior collapse problem in GYAFC datasets, resulting in very low quality scores on the test set. For Transformer-based models, T5 + GP prior achieves the best performance than T5, T5 + Normal and T-CVAE, which shows the superiority of GP priors in encoding contextual information.

For the diversity of generated texts, imposing context-aware variations into encoder hidden states is beneficial for generating diverse outputs. As demonstrated in Table 3, for Transformer-based models, T5 + GP prior gives the best diversity performance in both token-level and sentence-level compared with T5 + Normal prior and T-CVAE. The model performance on the style transfer task verifies the capability of our GP prior in promoting generation diversity. Table 4 shows some diverse generation outputs of Transformer-based variational encoder-decoder models. However, we also notice that increasing diversity will inevitably cause degradation in quality, because $\bm{z}_{1:N}$ are i.i.d. sampled from a high-dimensional multivariate Gaussian $q_{\phi}(\bm{z}\mid\bm{h},\bm{y})$ . As discussed in previous work (Vono et al., 2022), multivariate sampling in high-dimensional settings can become computationally demanding.

Computation Complexity Analysis.

Our GP priors require more computation during training, where the major computation comes from calculating the full co-variance matrix of context variables of the GP prior. However, during inference, we approximate the GP posterior with a variational posterior $q_{\phi}(\bm{z}\mid\bm{h},\bm{y})$ and conducts i.i.d. sampling, which saves the time for multivariate sampling and has the same computation complexity with other conditional variational autoencoder baselines at testing time.

5 Related Works

Diverse text generation.

Related works on diverse text generation mainly focus on changing decoding strategies at the decoder side or introducing randomness at the encoder side. At the decoder side, recent works apply various decoding algorithms to promote diversity, such as diverse beam search Vijayakumar et al. (2016), top-k sampling Fan et al. (2018) and nucleus sampling Holtzman et al. (2019).Our model is orthogonal to these diverse decoding algorithms since we focus on the encoder side. Another group of works He et al. (2018); Shen et al. (2019) propose to use a mixture of decoders to generate multiple outputs, where the context encodings are shared across multiple decoders. At the encoder side, Cho et al. (2019) propose to leverage a mixture of selectors to identify key contents from the source text, where each selector samples a sequential binary latent variables as a hard attention mask on every source token. Xu et al. (2018) train different pattern embeddings, and generate diverse paraphrases conditioning on different pattern embeddings.

Conditional variational autoencoders.

Variational encoder-decoder models Deng et al. (2018); Bahuleyan et al. (2018); Wang and Wan (2019); Sun et al. (2021) are related to our method. Deng et al. (2018) formulate the attention vector as latent alignment variables, and use the latent variables as hard attention for the decoder to select which source words to focus on during generation. Wang and Wan (2019) present a conditional variational autoencoder based on Transformer, and learn a latent variable for generating diverse texts for the story completion task. Sun et al. (2021) propose a self-separated conditional variational autoencoder that introduces group information to regularize the latent variables, which alleviates the posterior collapse problem and improves the model performance in the dialogue generation task.

6 Conclusion

In this work, we investigate the problem of generating high quality texts for variational encoder-decoder models. We propose a novel stochastic function to introduce context-aware variations into encoder hidden states, which provides the decoder with more diverse contextual representations. To learn this stochastic function, we propose a GP prior to model the dependency between random context variables, and apply an efficient amortized variational inference method to approximate the GP posterior. Experimental results demonstrate that our method can learn a better contextual representation that leads to higher generation quality compared with deterministic encoder-decoder models and conditional variational autoencoders.

References

Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint, arXiv:1409.0473.
Bahuleyan et al. (2018) Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. 2018. Variational attention for sequence-to-sequence models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1672–1682.
Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany. Association for Computational Linguistics.
Cho et al. (2019) Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi. 2019. Mixture content selection for diverse sequence generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3121–3131, Hong Kong, China. Association for Computational Linguistics.
Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pages 9712–9724.
Deshpande et al. (2019) Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10695–10704.
Dieng et al. (2019) Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2019. Avoiding latent variable collapse with generative skip models. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2397–2405. PMLR.
Duan et al. (2020) Yu Duan, Canwen Xu, Jiaxin Pei, Jialong Han, and Chenliang Li. 2020. Pre-train and plug-in: Flexible conditional text generation with variational auto-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 253–262, Online. Association for Computational Linguistics.
Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
He et al. (2018) Xuanli He, Gholamreza Haffari, and Mohammad Norouzi. 2018. Sequence to sequence mixture model for diverse machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 583–592.
Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
Jhamtani et al. (2017) Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation, pages 10–19.
Kim et al. (2018) Yoon Kim, Sam Wiseman, Andrew Miller, David Sontag, and Alexander Rush. 2018. Semi-amortized variational autoencoders. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2678–2687. PMLR.
Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Representation Learning.
Lan et al. (2017) Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234.
Li et al. (2018) Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2018. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878, Brussels, Belgium. Association for Computational Linguistics.
Liu and Liu (2019) Danyang Liu and Gongshen Liu. 2019. A transformer-based variational autoencoder for sentence generation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE.
Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
Murphy (2012) Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Prakash et al. (2016) Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. 2016. Neural paraphrase generation with stacked residual lstm networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2923–2934.
Qian and Cheung (2019) Dong Qian and William K. Cheung. 2019. Enhancing variational autoencoders with mutual information neural estimation for text generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4047–4057, Hong Kong, China. Association for Computational Linguistics.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K.I. Williams. 2006. Gaussian Processes for Machine Learning. The MIT Press.
See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
Shen et al. (2019) Tianxiao Shen, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2019. Mixture models for diverse machine translation: Tricks of the trade. arXiv preprint arXiv:1902.07816.
Shinoda et al. (2021) Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2021. Improving the robustness of QA models to challenge sets with variational question-answer pair generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 197–214, Online. Association for Computational Linguistics.
Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, volume 28, pages 3483–3491. Curran Associates, Inc.
Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562.
Sun et al. (2021) Bin Sun, Shaoxiong Feng, Yiwei Li, Jiamou Liu, and Kan Li. 2021. Generating relevant and coherent dialogue responses using self-separated conditional variational AutoEncoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5624–5637, Online. Association for Computational Linguistics.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
Vono et al. (2022) Maxime Vono, Nicolas Dobigeon, and Pierre Chainais. 2022. High-dimensional gaussian sampling: a review and a unifying approach based on a stochastic proximal point algorithm. SIAM Review, 64(1):3–56.
Wang and Wan (2019) Tianming Wang and Xiaojun Wan. 2019. T-cvae: Transformer-based conditioned variational autoencoder for story completion. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5233–5239. International Joint Conferences on Artificial Intelligence Organization.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Wu et al. (2020) Chen Wu, Prince Zizhuang Wang, and William Yang Wang. 2020. On the encoder-decoder incompatibility in variational text modeling and beyond. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3449–3464, Online. Association for Computational Linguistics.
Xu et al. (2018) Qiongkai Xu, Juyan Zhang, Lizhen Qu, Lexing Xie, and Richard Nock. 2018. D-page: Diverse paraphrase generation. arXiv preprint arXiv:1808.04364.
Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100.

Appendix A Gaussian Processes as Function Priors

In this work, we are interested in learning a non-linear mapping function from encoder hidden states to latent context variables. Gaussian process has the nice property which can represent complex non-linear functions and also allow uncertainty to account for noisy data observations. Considering a set of observed training data $\mathcal{D}=\{(\bm{h}_{i},\bm{z}_{i})\}_{i=1}^{N}$ , a Gaussian process defines a probability distribution over possible functions $p(g)$ . Given a Gaussian process prior $\mathcal{GP}(\bm{0},\mathbf{K})$ on the function $g(\bm{h})$ , we have:

$\displaystyle\bm{z}_{i}$	$\displaystyle=$	$\displaystyle g(\bm{h}_{i})+\epsilon_{i}$	(14)
$\displaystyle g(\bm{h}_{i})$	$\displaystyle\sim$	$\displaystyle\mathcal{GP}(\bm{0},\mathbf{K})$	(15)
$\displaystyle\epsilon_{i}$	$\displaystyle\sim$	$\displaystyle\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I})$	(16)

Note that $\epsilon_{i}$ is the noise of the observed data point $\bm{z}_{i}$ , which is assumed to be an independent identically distributed Gaussian with variance ${\sigma}^{2}$ . $\mathbf{K}$ is the covariance matrix which is constructed using a squared exponential covariance function $k(\bm{h},\bm{h}^{\prime})=\exp(-\frac{\left\|\bm{h}-\bm{h}^{\prime}\right\|^{2}}{2})$ . Now, we can sample different mapping functions $g(\bm{h})$ from this Gaussian process prior $\mathcal{GP}(\bm{0},\mathbf{K})$ . Figure 3(a) illustrates some possible mapping functions $g_{1}$ , $g_{2}$ and $g_{3}$ .

In Gaussian process, each training and testing data point is treated as random variable which follows Gaussian distribution. Therefore, we can apply the Bayesian inference to predict a testing data point $\bm{z}_{*}$ conditioning on observed training data points $\mathcal{D}$ . To make concise notation, we let $\bm{h}_{1:N}=\{\bm{h}_{i}\}_{i=1}^{N}$ , $\bm{z}_{1:N}=\{\bm{z}_{i}\}_{i=1}^{N}$ , and $\mathbf{K}^{-1}=[k(\bm{h}_{1:N},\bm{h}_{1:N})+{\sigma}^{2}\mathbf{I}]^{-1}$ . The probability distribution of the testing data point $\bm{z}_{*}$ can be computed by:

p(\bm{z}_{*}\mid\bm{h}_{*},\mathcal{D})=\int p(\bm{z}_{*}\mid\bm{h}_{*},\bm{g},\mathcal{D})p(\bm{g}\mid\mathcal{D})dg

(17)

where

\begin{split}&p(\bm{z}_{*}\mid\bm{h}_{*},\mathcal{D})\sim\mathcal{N}(\bm{\mu}_{*},\mathbf{K}_{*})\\ &\bm{\mu}_{*}=k(\bm{h}_{*},\bm{h}_{1:N})\mathbf{K}^{-1}\bm{z}\\ &\mathbf{K}_{*}=k(\bm{h}_{*},\bm{h}_{*})-k(\bm{h}_{*},\bm{h}_{1:N})\mathbf{K}^{-1}k(\bm{h}_{1:N},\bm{h}_{*})\end{split}

(18)

Intuitively, training data points $\mathcal{D}$ constrain the set of functions $g$ to pass through them since the covariance becomes smaller when we have training data, as shown in Figure 3(b).

Under our variational encoder-decoder framework, $\bm{h}_{1:N}$ are encoder hidden states and $\bm{z}_{1:N}$ are latent context variables. Since the Gaussian process induces a distribution over the mapping function $g(\bm{h})$ , theoretically we could sample infinite number of mapping functions, where each function gives us a different set of latent context representations $\bm{z}_{1:N}$ . In this way, we managed to obtain diverse context representations in encoder-decoder models.

Appendix B Derivations of ELBo

We follow conditional variational autoencoders Sohn et al. (2015) and assume that for given observation $\bm{h}$ , $\bm{z}$ is drawn from the prior distribution $p(\bm{z}\mid\bm{h})$ , and the output $\bm{y}$ is generated from the distribution $p(\bm{y}\mid\bm{h},\bm{z})$ . We learn the variational posterior by minimizing $\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y}))$ , which is equivalent to maximizing the evidence lower bound of the marginal log-likelihood (ELBo):

\begin{split}&\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y}))\\ &=\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})}{p(\bm{z}\mid\bm{h},\bm{y})}d{\bm{\phi}}\\ &=\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{y}\mid\bm{h})p(\bm{h})}{p(\bm{z},\bm{h},\bm{y})}d{\bm{\phi}}\\ &=\log p(\bm{y}\mid\bm{h})\\ &+\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{h})}{p(\bm{y}\mid\bm{h},\bm{z})p(\bm{z}\mid\bm{h})}d{\bm{\phi}}\end{split}

(19)

Since $\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y}))\geq 0$ , we have:

\begin{split}&\log p(\bm{y}\mid\bm{h})\\ &\geq-\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{h})}{p(\bm{y}\mid\bm{h},\bm{z})p(\bm{z}\mid\bm{h})}d{\bm{\phi}}\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})+\log p(\bm{z}\mid\bm{x})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})]\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})]-\mathbb{E}_{q_{\bm{\phi}}}[\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})}{\log p(\bm{z}\mid\bm{h})}]\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\text{KL}[q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h})]\end{split}

(20)

where $\bm{\phi}$ are parameters for the variational inference networks.

Appendix C Experiment Setup Details

Model Configurations.

For the implementation details of PG, it is an LSTM-based encoder-decoder model with copying mechanism. The encoder is a 1-layer Bi-LSTM, and the decoder is a 1-layer uni-directional LSTM. We set the word embedding size to 300, the hidden dimension for both encoder and decoder to 512. We let the encoder and decoder shares the same vocabulary list and word embedding, and the vocabulary size is 20000. For the configuration of the posterior networks, both the mean and covariance network are a single feed-forward neural network, and we set the dimension of the latent variable to 256.

For the implementation details of T5, we use the T5-base implementation from Huggingface (Wolf et al., 2020) ⁴⁴4https://huggingface.co/transformers/model_doc/t5.html, and use their default model configuration. We load the pre-trained weights of T5-base, and fine-tune them on our target task datasets. For the configuration of the posterior networks, both the mean and covariance network are a single feed-forward neural network, and we set the dimension of the latent variable to 512.

Training Configurations.

For the training details of PG and T5, we does not apply KL annealing and the coefficient of the KL divergence is always 1. We use Adam optimizer Duchi et al. (2011) with learning rate of 0.0001, and adopt early stopping if the validation loss does not decrease after 10 epochs. For the hyper-parameters $\{v,r\}$ of the kernel function in Equation 3, we try a range of values where $v\in[0.01,100]$ and $r\in[0.0001,10]$ , and do grid search cross validation on the validation set to select the best model. All experiments are independently conducted on a GPU server (RTX 2090 Ti) with 40cores CPU and 256GB Memory.