This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors

Wanyu Du1 , Jianqiao Zhao2 , Liwei Wang2 , Yangfeng Ji1
1Department of Computer Science, University of Virginia
2Department of Computer Science and Engineering, The Chinese University of Hong Kong
{wd5jq,yangfeng}@virginia.edu
{jqzhao,lwwang}@cse.cuhk.edu.hk
Abstract

Generating high quality texts with high diversity is important for many NLG applications, but current methods mostly focus on building deterministic models to generate higher quality texts and do not provide many options for promoting diversity. In this work, we present a novel latent structured variable model to generate high quality texts by enriching contextual representation learning of encoder-decoder models. Specifically, we introduce a stochastic function to map deterministic encoder hidden states into random context variables. The proposed stochastic function is sampled from a Gaussian process prior to (1) provide infinite number of joint Gaussian distributions of random context variables (diversity-promoting) and (2) explicitly model dependency between context variables (accurate-encoding). To address the learning challenge of Gaussian processes, we propose an efficient variational inference approach to approximate the posterior distribution of random context variables. We evaluate our method in two typical text generation tasks: paraphrase generation and text style transfer. Experimental results on benchmark datasets demonstrate that our method improves the generation quality and diversity compared with other baselines. 111Code and data are available at https://github.com/wyu-du/GP-VAE.

1 Introduction

Generating high quality texts with high diversity is an important requirement for many text generation applications, such as paraphrase generation (Prakash et al., 2016; Li et al., 2018), style transfer (Jhamtani et al., 2017; Rao and Tetreault, 2018), dialog generation (Sordoni et al., 2015; Serban et al., 2016), etc. The encoder-decoder framework (Cho et al., 2014; Sutskever et al., 2014) is widely adopted (Bahdanau et al., 2014; Luong et al., 2015; Gu et al., 2016; See et al., 2017; Raffel et al., 2019) to generate high-quality texts, where an encoder is applied to learn contextual information from source texts and a decoder is used to generate texts for target tasks. To improve the diversity of generated texts, prior works propose to introduce variations into either the encoder (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019; Cho et al., 2019; Qian and Cheung, 2019; Wu et al., 2020; Duan et al., 2020; Sun et al., 2021) or the decoder (Vijayakumar et al., 2016; Holtzman et al., 2019; He et al., 2018; Shen et al., 2019). However, it is difficult to incorporate meaningful variations into encoder-decoder models without hurting the quality of generated texts.

𝒉1\bm{h}_{1}𝒉2\bm{h}_{2}𝒉3\bm{h}_{3}𝒛1\bm{z}_{1}𝒛2\bm{z}_{2}𝒛3\bm{z}_{3}𝒙1\bm{x}_{1}𝒙2\bm{x}_{2}𝒙3\bm{x}_{3}𝒄\bm{c}𝒔2\bm{s}_{2}𝒔1\bm{s}_{1}𝒔3\bm{s}_{3}𝒚1\bm{y}_{1}𝒚2\bm{y}_{2}𝒚3\bm{y}_{3}
(a) With a normal Gaussian prior
𝒉1\bm{h}_{1}𝒉2\bm{h}_{2}𝒉3\bm{h}_{3}𝒛1\bm{z}_{1}𝒛2\bm{z}_{2}𝒛3\bm{z}_{3}𝒙1\bm{x}_{1}𝒙2\bm{x}_{2}𝒙3\bm{x}_{3}𝒄1\bm{c}_{1}𝒄2\bm{c}_{2}𝒄3\bm{c}_{3}𝒔1\bm{s}_{1}𝒔2\bm{s}_{2}𝒔3\bm{s}_{3}𝒚1\bm{y}_{1}𝒚2\bm{y}_{2}𝒚3\bm{y}_{3}
(b) With a Gaussian process prior
Figure 1: A simple illustration on a variational encoder-decoder model with (a) an normal Gaussian prior and (b) a Gaussian process prior. With a normal Gaussian prior, each hidden state 𝒉i\bm{h}_{i} will be mapped into a random vector 𝒛i\bm{z}_{i} independently; while a Guassian process prior imposes dependency constraints among {𝒛i}\{\bm{z}_{i}\}. Double circles on {𝒉i}\{\bm{h}_{i}\} indicate they are deterministic variables.

Promoting diversity at the encoder side mainly focuses on modelling the probabilistic distribution of contextual representations. Some prior works (Deng et al., 2018; Bahuleyan et al., 2018) propose to model attention alignments between encoder and decoder hidden states as latent variables, and generate diverse texts by sampling from the latent attention variables. Other existing works (Wang and Wan, 2019; Liu and Liu, 2019; Shinoda et al., 2021; Sun et al., 2021) directly apply conditional variational autoencoders to model encoder hidden states as latent variables, and generate high-diversity texts by sampling from the latent context variables. However, when modelling the latent variables, they treat each latent variable as independent to each other, which inevitably causes the loss of some contextual information during learning, as shown in Figure 1(a).

Other works turns towards designing diversity-promoting decoding strategies at the decoder side, such as diverse beam search (Vijayakumar et al., 2016), top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2019). But for those decoding strategies, there is often a trade-off between quality and diversity, and the generation models have to sacrifice quality for a higher diversity. Another line of works suggest to learn a mixture of expert encoders (Cho et al., 2019) or decoders (He et al., 2018; Shen et al., 2019), and generate diverse texts by sampling from different encoders or decoders. While different expert encoders or decoders can introduce some diversity, the model capacities are limited within the pre-defined set of experts.

In this work, we propose a novel approach to introduce context-aware variations into the encoder in order to generate high-quality and high-diversity texts. For an encoder-decoder model, we introduce a stochastic function to map deterministic encoder hidden states {𝒉i}\{\bm{h}_{i}\} into a set of random context variables {𝒛i}\{\bm{z}_{i}\}. The advantage of this stochastic function is that it explicitly models the dependency between each context variable, as shown in Figure 1(b), which can help preserve more semantic information from source texts. During generation, the decoder generates diverse outputs conditioning on sampled different context variables. In other words, by learning a stochastic function on top of one deterministic encoder, the proposed approach offers many versions of random context variables for a decoder to generate diverse texts.

To learn the stochastic function over hidden states, we propose a Gaussian process prior (Rasmussen and Williams, 2006, GP) to model the joint distribution of all encoder hidden states. The major differences between GP priors and other priors used in previous works (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019; Cho et al., 2019; Wu et al., 2020; Duan et al., 2020; Shinoda et al., 2021; Sun et al., 2021) have two-folds: (1) GP priors explicitly model the dependency between latent variables of varying sizes as illustrated in Figure 1(b), while previous works consider latent variables as independent with each other as shown in Figure 1(a); (2) GP priors provide infinite number of joint Gaussian distributions of latent variables as shown in Figure 2(b), while previous works have to pre-define a fixed set of Gaussian distributions (e.g. a standard normal distribution, or a mixture of Gaussian distributions) with the risk of experiencing the posterior collapse problem (Bowman et al., 2016; Kim et al., 2018; Dieng et al., 2019). Besides, the proposed random function only introduces variations into the encoder, and is orthogonal to diversity-promoting decoding strategies at the decoder side. Users can freely adopt different decoding strategies to further encourage diverse generation outputs.

The major contributions of this work are three-fold:

  1. 1.

    We propose a novel method to introduce context-aware variations into encoder-decoder models, which can help the model learn rich contextual representations and also promote diversity in generation.

  2. 2.

    We propose an efficient variational inference method to approximate the joint distribution of fully-connected random context variables.

  3. 3.

    We test our proposed method in both LSTM-based (See et al., 2017) and Transformers-based (Raffel et al., 2019) encoder-decoder models on paraphrase generation and style transfer tasks. Empirical experimental results show that, on one hand, the proposed method can generate higher quality texts than deterministic encoder-decoder models and conditional variational auto-encoders; on the other hand, it also supports diverse generation by conditioning on different sets of sampled random context variables.

Refer to caption
(a) Variational encoder-decoder model with a normal Gaussian prior. Note that we simplify the latent variable 𝒛i\bm{z}_{i} from a vector to a scalar in order to plot out the Gaussian distribution for better illustration.
Refer to caption
(b) Variational encoder-decoder model with a GP prior. Note that we simplify the latent variable 𝒛i\bm{z}_{i} from a vector to a scalar in order to plot out the joint Gaussian distribution for better illustration.
Figure 2: A simple illustration for comparison between our GP priors and the priors in conditional variational autoencoders Sohn et al. (2015) under the variational encoder-decoder framework.

2 Model Description

This section discusses our novel latent structured variable model on learning rich context representations by transforming the hidden states from a deterministic encoder into random hidden states via stochastic functions.

2.1 Encoding with Stochastic Functions

Let 𝒙1:N={𝒙i}i=1N\bm{x}_{1:N}=\{\bm{x}_{i}\}_{i=1}^{N} be the source sentence of length NN, and 𝒚1:T={𝒚t}t=1T\bm{y}_{1:T}=\{\bm{y}_{t}\}_{t=1}^{T} be the target sentence of length TT. In encoder-decoder models, an encoder is used to obtain deterministic context representations of the source sentence, i.e. the encoder hidden states: 𝒉1:N=fenc(𝒙1:N)\bm{h}_{1:N}=f_{enc}(\bm{x}_{1:N}), where fenc()f_{enc}(\cdot) is a nonlinear transition function implemented by LSTM (Sutskever et al., 2014) or Transformer (Vaswani et al., 2017).

To introduce context-aware variations into the encoder, we propose to learn a stochastic function that maps the deterministic hidden states to variables. Specifically, after computing the hidden states 𝒉1:N\bm{h}_{1:N} from the transition function fenc()f_{enc}(\cdot), the proposed method employs a stochastic mapping function g()g(\cdot) to model the deterministic context representations as a series of random context variables:

p(𝒛1:N𝒉1:N)=g(𝒉1:N)+ϵp(\bm{z}_{1:N}\mid\bm{h}_{1:N})=g(\bm{h}_{1:N})+\bm{\epsilon} (1)

where ϵ𝒩(𝟎,σ2𝐈)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I}) is a Gaussian noise. Then, the decoder can generate diverse texts conditioning on different sets of context variables sampled from p(𝒛1:N𝒉1:N)p(\bm{z}_{1:N}\mid\bm{h}_{1:N}), as shown in Figure 2(b). Considering that natural language texts are always context-dependent, we expect the random context variables 𝒛1:N\bm{z}_{1:N} to encode the context dependency to some extent. In other words, the distribution of 𝒛i\bm{z}_{i} will not only depend on 𝒉i\bm{h}_{i}, but also depend on other {𝒛j}ji\{\bm{z}_{j}\}_{j\not=i}, as shown in Figure 1(b).

Under this framework, variational encoder-decoder models (Bahuleyan et al., 2018; Deng et al., 2018; Wang and Wan, 2019) can be viewed as a special case, as illustrated in Figure 2(a), where random context variables {𝒛i}i=1N\{\bm{z}_{i}\}_{i=1}^{N} are independent from each other. In this work, we consider this special case as generation with normal priors. Empirical comparison between normal priors and GP priors will be given in section 4.

2.2 Gaussian Process Priors for Stochastic Functions

The learning of stochastic function g(𝒉)g(\bm{h}) is the key for the proposed method to be successful. Intuitively, we design g(𝒉)g(\bm{h}) to satisfy two constraints simultaneously: (1) it can introduce some variation to the deterministic encoder hidden states; (2) it should preserve the contextual information in the deterministic encoder hidden states to be a faithful representation.

In this work, we propose to learn g(𝒉)g(\bm{h}) with a functional prior defined by Gaussian processes. As shown in Figure 3(a) in Appendix A, we can sample very different functions g(𝒉)g(\bm{h}) from the same GP prior, which ensures randomness when sampling 𝒛1:N\bm{z}_{1:N}.222Please refer to Appendix A and (Rasmussen and Williams, 2006) for detailed introduction of Gaussian processes. We define the stochastic function g(𝒉)g(\bm{h}) following a GP prior:

g(𝒉)𝒢𝒫(m(𝒉),k(𝒉,𝒉))g(\bm{h})\sim\mathcal{GP}(m(\bm{h}),k(\bm{h},\bm{h}^{\prime})) (2)

with the mean function m(𝒉)m(\bm{h}) and covariance function k(𝒉,𝒉)k(\bm{h},\bm{h}^{\prime}) as

m(𝒉)=𝒉k(𝒉,𝒉)=v2exp{𝒉𝒉222r2}\begin{split}m(\bm{h})&=\bm{h}\\ k(\bm{h},\bm{h}^{\prime})&=v^{2}\exp\{-\frac{\|\bm{h}-\bm{h}^{\prime}\|_{2}^{2}}{2r^{2}}\}\end{split} (3)

where 𝒉\bm{h} indicates the current observed encoder hidden state, and 𝒉\bm{h}^{\prime} indicates the other contextual encoder hidden states; vv controls the average distance between a sampled function g(𝒉)g(\bm{h}) and the mean function m(𝒉)m(\bm{h}), and rr controls the covariance between random variables, increasing rr will make 𝒛\bm{z} and 𝒛\bm{z}^{\prime} become more correlated. In this work, vv and rr are chosen based on the text generation performance on development sets. By setting m(𝒉)=𝒉m(\bm{h})=\bm{h}, we actually define a semi-parametric GP prior (Murphy, 2012, Sec. 15.2.6) instead of a fully non-parameteric prior, since 𝒉\bm{h} as a hidden state is computed from the deterministic encoder with learnable parameters. The intuition behind this definition is that, although we want to introduce some variations, taking the expectation of the sampled random states 𝒛\bm{z} should still be 𝒉\bm{h}.

The main advantage of applying GP priors is that we can sample infinite number of random functions g(𝒉)g(\bm{h}) thus obtaining infinite sets of random context variables 𝒛1:N\bm{z}_{1:N}, as illustrated in Figure 2(b). In contrast, standard variational encoder-decoder models can only learn a fixed set of CC joint distributions p(𝒛1:N|𝒉1:N)p(\bm{z}_{1:N}|\bm{h}_{1:N}), where 1C1\leq C\ll\infty.333When C=1C=1, it represents a conventional variational autoencoder (Bowman et al., 2016); when C=5C=5, it represents a variational autoencoder with a mixture of Gaussians prior (component number = 5); when CC\to\infty, it represents a variational autoencoder with a GP prior.

2.3 Generation with Random Context Variables

In this section, we demonstrate how to incorporate 𝒛1:N\bm{z}_{1:N} into two typical encoder-decoder models for text generation: a LSTM-based encoder-decoder model See et al. (2017) and a Transformer-based encoder-decoder model (Raffel et al., 2019). The performance of these two variational encoder-decoder models with will be evaluated in section 4.

Given the deterministic encoder hidden states 𝒉1:N\bm{h}_{1:N}, we first sample a function g(𝒉)g(\bm{h}) from the GP prior in Equation 2; then sample a set of random context variables 𝒛1:N\bm{z}_{1:N} from g(𝒉)g(\bm{h}); finally generate a output sentence 𝒚1:T\bm{y}_{1:T} based on the sampled 𝒛1:N\bm{z}_{1:N}. The generative story with random context variables 𝒛1:N\bm{z}_{1:N} is detailed in algorithm 1.

For a LSTM-based encoder-decoder model See et al. (2017), we apply the attention mechanism (Bahdanau et al., 2014) over the random context variables {𝒛i}i=1N\{\bm{z}_{i}\}_{i=1}^{N} to construct 𝒄t\bm{c}_{t} for the decoder. At each decoding time step tt, the decoder computes the attention vector 𝒄t\bm{c}_{t} and decoder hidden state 𝒔t\bm{s}_{t} as follows:

αti\displaystyle\alpha_{ti} =\displaystyle= exp(a(𝒔t1,𝒛i))j=1Nexp(a(𝒔t1,𝒛j))\displaystyle\frac{\exp{(a(\bm{s}_{t-1},\bm{z}_{i}))}}{\sum_{j=1}^{N}\exp{(a(\bm{s}_{t-1},\bm{z}_{j}))}} (4)
𝒄t\displaystyle\bm{c}_{t} =\displaystyle= i=1Nαti𝒛i\displaystyle\sum_{i=1}^{N}\alpha_{ti}\cdot\bm{z}_{i} (5)
𝒔t\displaystyle\bm{s}_{t} =\displaystyle= fdec(𝒔t1,𝒚t1,𝒄t)\displaystyle f_{dec}(\bm{s}_{t-1},\bm{y}_{t-1},\bm{c}_{t}) (6)

where a(𝒔t1,𝒛i)=𝒗atanh(𝐖a𝒔t1+𝐔a𝒛i)a(\bm{s}_{t-1},\bm{z}_{i})=\bm{v}_{a}^{\top}\text{tanh}(\mathbf{W}_{a}\bm{s}_{t-1}+\mathbf{U}_{a}\bm{z}_{i}), 𝐖a\mathbf{W}_{a}, 𝐔a\mathbf{U}_{a} and 𝒗a\bm{v}_{a}^{\top} are parameter matrices. Finally, the decoder outputs a word distribution based on the context representations and previous decoded words at each decoding time step tt:

p(𝒚t𝒚t1,𝒛1:N)=softmax(𝐖b𝒔t)p(\bm{y}_{t}\mid\bm{y}_{t-1},\bm{z}_{1:N})=\text{softmax}(\mathbf{W}_{b}\cdot\bm{s}_{t}) (7)

where 𝐖b\mathbf{W}_{b} is a parameter matrix.

1:  Input: A source sentence 𝒙1:N\bm{x}_{1:N}
2:  Output: A generated sentence 𝒚1:T\bm{y}_{1:T}
3:  // Encode context
4:  Initialize 𝒉0𝟎\bm{h}_{0}\leftarrow\bm{0}
5:  for i=1,,Ni=1,\dots,N do
6:     Compute 𝒉i=fenc(𝒉i1,𝒙i)\bm{h}_{i}=f_{enc}(\bm{h}_{i-1},\bm{x}_{i})
7:  end for
8:  // Sample random context variables
9:  Draw g(𝒉)𝒢𝒫(m(𝒉),k(𝒉,𝒉))g(\bm{h})\sim\mathcal{GP}(m(\bm{h}),k(\bm{h},\bm{h}^{\prime}))
10:  Draw 𝒛1:Ng(𝒉1:N)+ϵ\bm{z}_{1:N}\sim g(\bm{h}_{1:N})+\bm{\epsilon}
11:  // Generate a new sentence
12:  Initialize 𝒔0𝟎\bm{s}_{0}\leftarrow\bm{0}
13:  for t=1,,Tt=1,\dots,T do
14:     Compute 𝒔t=fdec(𝒔t1,𝒚t1,𝒛1:N)\bm{s}_{t}=f_{dec}(\bm{s}_{t-1},\bm{y}_{t-1},\bm{z}_{1:N})
15:     Draw 𝒚tsoftmax(𝐖𝒔t)\bm{y}_{t}\sim\text{softmax}(\mathbf{W}\cdot\bm{s}_{t})
16:  end for
Algorithm 1 The generative story with a stochastic function g()g(\cdot) sampled from the GP prior

For a Transformer-based encoder-decoder model (Raffel et al., 2019), we take the output of the last layer in the encoder as 𝒉1:N\bm{h}_{1:N}, and feed them into g(𝒉)g(\bm{h}) to get random context variables 𝒛1:N\bm{z}_{1:N}. For the decoder, the inputs 𝐊\mathbf{K} and 𝐕\mathbf{V} are the combination of 𝒉1:N\bm{h}_{1:N} and 𝒛1:N\bm{z}_{1:N}:

𝐊l=𝐕l\displaystyle\mathbf{K}^{l}=\mathbf{V}^{l} =\displaystyle= 𝐖z[𝒛1:N;𝒉1:N]\displaystyle\mathbf{W}_{z}[\bm{z}_{1:N};\bm{h}_{1:N}] (8)
𝐀\displaystyle\mathbf{A} =\displaystyle= MultiHead(𝐒l1,𝐊l,𝐕l)\displaystyle\text{MultiHead}(\mathbf{S}^{l-1},\mathbf{K}^{l},\mathbf{V}^{l}) (9)
𝐁\displaystyle\mathbf{B} =\displaystyle= LayerNorm(𝐀+𝐒l1)\displaystyle\text{LayerNorm}(\mathbf{A}+\mathbf{S}^{l-1}) (10)
𝐒l\displaystyle\mathbf{S}^{l} =\displaystyle= LayerNorm(FFN(𝐁)+𝐁)\displaystyle\text{LayerNorm}(\text{FFN}(\mathbf{B})+\mathbf{B}) (11)

where 𝐒l={𝒔t}t=1T\mathbf{S}^{l}=\{\bm{s}_{t}\}_{t=1}^{T} is the last layer of decoder hidden states, and MultiHead()\text{MultiHead}(\cdot), LayerNorm()\text{LayerNorm}(\cdot) and FFN()\text{FFN}(\cdot) follow the standard implementation in (Vaswani et al., 2017). The word distribution p(𝒚t𝒚t1,𝒛1:N)p(\bm{y}_{t}\mid\bm{y}_{t-1},\bm{z}_{1:N}) at each decoding time step tt is computed the same way as in Equation 7.

3 Efficient Variational Inference

With the observed deterministic hidden states 𝒉1:N\bm{h}_{1:N}, we estimate the GP posterior to make the prediction of context variables 𝒛1:N\bm{z}_{1:N} more accurate. Although the posterior estimation of Gaussian processes can be written in a closed form theoretically, the challenge in this work comes from learning with other parts of the model, such as the deterministic encoder producing 𝒉1:N\bm{h}_{1:N} and the decoder generating 𝒚1:T\bm{y}_{1:T}. To simplify the inference procedure, we will focus on inferring the samples of the GP posterior regarding the hidden states only as p(g𝒉1:N)p(g\mid\bm{h}_{1:N}), which essentially is a Gaussian distribution with non-isotropic covariance. In this work, we apply variational inference to approximate the GP posterior p(g𝒉1:N)p(g\mid\bm{h}_{1:N}) and learn other model parameters jointly with maximum likelihood estimation.

For notation simplicity, we let 𝒉=fenc(𝒙1:N)\bm{h}=f_{enc}(\bm{x}_{1:N}), 𝒛={𝒛i}i=1N\bm{z}=\{\bm{z}_{i}\}_{i=1}^{N}, 𝒚={𝒚i}i=1T\bm{y}=\{\bm{y}_{i}\}_{i=1}^{T} in this section. With a sampled random function g(𝒉)g(\bm{h}) from the GP prior as described in line 9 of algorithm 1, we will get the joint prior distribution p(𝒛𝒉)p(\bm{z}\mid\bm{h}) according to Equation 1. Then we approximate the true posterior p(𝒛𝒉,𝒚)p(\bm{z}\mid\bm{h},\bm{y}) with the variational posterior qϕ(𝒛𝒉,𝒚)q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y}) by maximizing the evidence lower bound of the marginal log-likelihood (ELBo):

logp(𝒚𝒉)𝔼qϕ[logp(𝒚𝒛)]KL[qϕ(𝒛𝒉,𝒚)p(𝒛𝒉)]\begin{split}\log p(\bm{y}\mid\bm{h})\geq&\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z})]\\ &-\text{KL}[q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h})]\end{split} (12)

where ϕ\bm{\phi} denotes the variational parameters. The derivation of Equation 12 is presented in Appendix B.

During generation, we propose a two-step approximation to simplify qϕ(𝒛𝒉,𝒚)q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y}). First, to maintain the generative property when using the variational distribution, we propose an approximation of the variational distribution qϕ(𝒛𝒉,𝒚)qϕ(𝒛𝒉)q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\approx q_{\bm{\phi}}(\bm{z}\mid\bm{h}). In this case, random context vector 𝒛\bm{z} will only depend on 𝒉\bm{h} during inference. Second, we apply the mean-field amortized variational approximation (Kingma and Welling, 2014) to approximating the parameters of qϕ(𝒛𝒉)q_{\bm{\phi}}(\bm{z}\mid\bm{h}):

qϕ(𝒛𝒉)=i=1Nqϕ(𝒛i𝒉i)=i=1N𝒩(fμ(𝒉i),fσ2(𝒉i))\begin{split}q_{\bm{\phi}}(\bm{z}\mid\bm{h})&=\prod_{i=1}^{N}q_{\bm{\phi}}(\bm{z}_{i}\mid\bm{h}_{i})\\ &=\prod_{i=1}^{N}\mathcal{N}(f_{\mu}(\bm{h}_{i}),f_{\sigma^{2}}(\bm{h}_{i}))\end{split} (13)

where fμ()f_{\mu}(\cdot) and fσ2()f_{\sigma^{2}}(\cdot) are the mean and covariance in the amortized variational inference network. In this work, we use two simple feed-forward neural networks fμf_{\mu} and fσ2f_{\sigma^{2}}. The implementation details are included in Appendix C.

Paraphrase Generation: Twitter URL Paraphrasing Corpus
Original Sentence Target Paraphrase
Amazon only needs a minute of human labor to ship your next package. Amazon ships your packages in one minute.
Amazon only needs a minute of human labor to ship your next package. Amazon only needs a minute of labor to ship your next package.
Text Style Transfer: GYAFC Corpus
Informal Sentence Formal Sentence
I’d say it is punk though. However, I do believe it to be punk.
Gotta see both sides of the story. You have to consider both sides of the story.
Table 1: Some example source-target pairs from the Twitter URL Paraphrasing Corpus Lan et al. (2017) and the GYAFC Corpus Rao and Tetreault (2018).

4 Experiments

We evaluate our method on two text generation tasks that require rich contextual representations: paraphrase generation (subsection 4.4) and text style transfer (subsection 4.5). We provide some data examples for the two tasks in Table 1. We compared our method with previous works on diverse text generation in terms of quality and diversity. Empirical experiment results show that our method is able to: (1) adapt to different encoder-decoder architectures, such as the pointer-generator network (See et al., 2017, PG) and the text-to-text transfer Transformer (Raffel et al., 2019, T5); (2) generate higher quality texts compared with deterministic encoder-decoder models (See et al., 2017; Raffel et al., 2019) while also enabling diverse generation by conditioning on random context variables.

4.1 Evaluation Methods

Text Quality.

For quality evaluation, we use two commonly used automatic metrics in text generation: METEOR Banerjee and Lavie (2005) and BLEU with up to bi-grams Papineni et al. (2002), which tell us how well the generated outputs match the reference sentences.

Text Diversity.

For diversity evaluation, we aim at examining how well different latent context variables 𝒛1:N\bm{z}_{1:N} from qϕ(𝒛𝒉,𝒚)q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y}) can make the decoder generate diverse outputs. We use self-BLEU with up to bi-grams (Zhu et al., 2018, self-BLEU) to measure the mutual bi-gram overlap between the set of outputs per source sentence, lower self-BLEU indicates less bi-gram overlap between generated outputs. In addition, we use diverse 4-gram (Deshpande et al., 2019, Div-4) to measure the ratio of distinct 4-grams in the set of outputs per source sentence, higher diverse 4-gram shows more unique 4-grams between generated outputs. Finally, we use uniqueness (Deshpande et al., 2019, Uni.) to measure the ratio of unique generated sentences in the set of outputs per source sentence, higher uniqueness suggests different context variables 𝒛1:N\bm{z}_{1:N} lead to very different output sentences.

4.2 Competitive Baselines

We compare our method with competitive deterministic encoder-decoder models and variational encoder-decoder models as follows.

PG: See et al. (2017) proposes the pointer-generator network, which is a strong deterministic LSTM-based encoder-decoder baseline. We refer their model as PG.

T5: Raffel et al. (2019) propose the text-to-text transfer Transformer, which is a strong deterministic Transformer-based encoder-decoder baseline. We refer their model as T5.

Variation Attention: Deng et al. (2018) model the deterministic attention vectors as latent alignment variables to promote diverse text generation. We refer their model as Variation Attention.

Multi-Selectors: Cho et al. (2019) use a mixture of experts to sample different binary masks on the source texts for diverse content generation. We refer their model as Multi-Selectors.

T-CVAE: Wang and Wan (2019) model deterministic encoder hidden states as latent context variables with a Transformer-based conditional variational autoencoder. We refer their model as T-CVAE.

PG/T5 + Normal prior: PG or T5 with a normal prior p(𝒛)=𝒩(𝟎,𝐈)p(\bm{z})=\mathcal{N}(\bm{0},\mathbf{I}), which follows the conventional variational autoencoders Kingma and Welling (2014); Bowman et al. (2016).

PG/T5 + GP prior: PG or T5 with our GP prior defined in Equation 2.

Twitter URL GYAFC (E&M) GYAFC (F&R)
Methods BLEU\uparrow METEOR\uparrow BLEU\uparrow METEOR\uparrow BLEU\uparrow METEOR\uparrow
Seq2Seq baselines
PG 0.291 0.471 0.683 0.817 0.717 0.845
T5 0.264 0.453 0.683 0.819 0.726 0.847
Related works
Multi-Selectors 0.290 0.492 0.606 0.779 0.618 0.783
Variation Attention 0.294 0.512 0.632 0.804 0.675 0.833
T-CVAE 0.339 0.494 0.481 0.686 0.537 0.730
Our works
PG + Normal prior 0.041 0.127 0.145 0.354 0.191 0.452
T5 + Normal prior 0.269 0.461 0.675 0.815 0.722 0.846
PG + GP prior 0.307 0.483 0.681 0.828 0.734 0.849
T5 + GP prior 0.281 0.474 0.688 0.815 0.739 0.847
Table 2: Model performance on text quality. To get the best performance of variational encoder-decoder models, we directly take the mean of qϕ(𝒛𝒉,𝒚)q_{\phi}(\bm{z}\mid\bm{h},\bm{y}) as the sampled context variables to generate the output texts. Results of the Multi-Selectors (Cho et al., 2019), Variation Attention (Deng et al., 2018) and T-CVAE Wang and Wan (2019) are collected based on the source code provided in the original paper.

4.3 Generation Setups

For the decoding strategy, we use beam search with beam size of 10. Note that our method is orthogonal to all diversity-promoting decoding strategies, such as top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019). We choose beam search in order to make fair comparison with other works which promotes diversity at the encoder side.

For quality generation, we directly take the mean of qϕ(𝒛|𝒉,𝒚)q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y}), and generate one 𝒚1:T\bm{y}_{1:T} based on the sampled context variables 𝒛1:N\bm{z}_{1:N}, since we want to examine how well the posterior network can encode contextual information and make the decoder generate high-quality texts.

For diverse generation, we sample different 𝒛1:N\bm{z}_{1:N} (instead of directly taking the mean) from qϕ(𝒛|𝒉,𝒚)q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y}), and generate different 𝒚1:T\bm{y}_{1:T} based on the sampled context variables 𝒛1:N\bm{z}_{1:N}, since we want to examine how well different latent context variables 𝒛1:N\bm{z}_{1:N} from qϕ(𝒛𝒉,𝒚)q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y}) can make the decoder generate diverse outputs. For experiment setups, we sample 10 different 𝒛1:N\bm{z}_{1:N} and generate 10 different 𝒚1:T\bm{y}_{1:T} correspondingly. We compute the diversity scores following the prior work (Deshpande et al., 2019). To compute the self-BLEU and Div-4, we randomly sample 5 different 𝒚1:T\bm{y}_{1:T} out of the 10 generated 𝒚1:T\bm{y}_{1:T}. To compute the Uni., we compute the unique number of sentences among the 10 generated 𝒚1:T\bm{y}_{1:T}.

In preliminary experiments, we found that sampling from the original variational distribution tends to make the decoder generate same sentences. We hypothesize that qϕ(𝒛|𝒉,𝒚)q_{\bm{\phi}}(\bm{z}|\bm{h},\bm{y}) is a high-dimensional multivariate Gaussian and sampling from a high-dimensional distribution is a fundamental challenging problem. Therefore, we applied a simple heuristics to alleviate the sampling issue, where we scale up the covariance matrix of the variational distribution by a numeric scalar. We find this simple heuristics can help the decoder generate more diverse sentences. For PG + Normal prior and PG + GP prior, we set the numeric scalar for paraphrase generation task to 25, for style transfer task to 10. For T5 + Normal prior and T5 + GP prior, we set the numeric scalar for paraphrase generation task to 7, for style transfer task to 4.

4.4 Paraphrase Generation

We first evaluate the model’s capability of generating paraphrases using the Twitter URL paraphrasing dataset Lan et al. (2017). In this task, we aim at comparing the quality of generated texts between our method and other competitive baselines. We include the experimental setups in Appendix C.

Dataset.

The Twitter URL paraphrasing dataset Lan et al. (2017) contains both positive and negative examples of paraphrases. We filter out all negative examples from the 1-year 2,869,657 candidate pairs, and divided the remaining paraphrase pairs into 110K training pairs, 3K testing pairs and 1K validation pairs.

GYAFC (E&M) GYAFC (F&R)
Methods avg-BLEU\uparrow self-BLEU\downarrow Div-4\uparrow Uni.\uparrow avg-BLEU\uparrow self-BLEU\downarrow Div-4\uparrow Uni.\uparrow
T-CVAE 0.481 0.986 0.221 0.130 0.537 0.990 0.220 0.126
T5 + Normal prior 0.419 0.522 0.524 0.791 0.347 0.415 0.484 0.845
T5 + GP prior 0.329 0.395 0.727 0.898 0.252 0.295 0.748 0.910
Table 3: Model performance on text diversity. Avg-BLEU measures the average quality of generated sentences compared with ground-truth references. Self-BLEU measures the token-level repetitiveness, and a lower self-BLEU indicates a higher token-level diversity. Div-4 measures the ratio of unique 4-grams, and a higher Div-4 means a higher token-level diversity. Uni. measures the ratio of unique generated sentences, and a higher Div-4 illustrates a higher sentence-level diversity. We include the sampling configuration details in Appendix C.

Result Analysis.

As shown in Table 2, for the quality of generated texts, our method is able to well preserve the semantic information from source texts. For LSTM-based models, PG + GP prior generates better quality texts compared with both its deterministic baseline PG and other variational baselines, e.g. Multi-Selectors and Variation Attention. Note that PG + Normal prior experiences the posterior collapse problem (Bowman et al., 2016; Kim et al., 2018; Dieng et al., 2019) during training, which causes the context variables preserving little semantic information in the source text and the model generating random tokens during inference. For Transformer-based models, T-CVAE generates the better quality texts than T5, T5 + Normal prior and T5 + GP prior. But T-CVAE lowercases all input and output tokens while the other models keep both lowercase and capital tokens, this text preprocessing step may bring an unfairly better performance of T-CVAE in quality scores. Note that the posterior collapse problem does not happen in T5 + Normal prior, and T5 + GP prior still outperforms T5 + Normal prior, which shows the advantage of GP priors in introducing context-aware variations.

4.5 Text Style Transfer

We evaluate our model’s capability of generating stylistic texts using the Grammarly’s Yahoo Answers Formality Corpus (GYAFC) Rao and Tetreault (2018). In this task, we first compare the quality of generated texts between our method and other competitive baselines, then we test the diversity of generated texts between our GP prior and conditional variational autoencoders. We include the experimental setups in Appendix C.

Dataset.

The GYAFC dataset covers two sub-domains: Entertainment & Music (E&M), which has 52,593 training pairs, 2,877 validation pairs, 1,416 testing pairs; and Family & Relationships (F&R), which has 51,967 training pairs, 2,788 validation pairs, 1,332 testing pairs.

Informal Sentence: Your age… Dude that one is old.
Formal References: ["Your age. That one is old."; "You are quite old."; "Wow, that one is very old."; "How old are you?"]
T-CVAE T5 + Normal Prior T5 + GP Prior
you are older . Your age is old. Your age, that one is old.
you are older . Your age, that one is old. Your age, that one is old.
you are older . You’re your age. No, that one is old. Your age, and that one is old.
you are older . You are your age. Due, that one is old. You are your age, and that one is old.
you are older . Your age, that one is old……… Your age, you are not the one who is old.
you are older . Your age and i…….. You’re a fool, that one is old.
you are older . Your age is arbitrary to you……… Your age doesn’t matter, that one is old.
you are older . You are a very, jo, you are a very, jo Your age is due to the fact is very old.
you are older . You are a ant / a ante / a ante / a ante / Regardless of your age, he is a young person.
you are older . Your count count count count count count I am not sure your age, but that one is old.
Table 4: Sample outputs conditioned on different 𝒛\bm{z} sampled from qϕ(𝒛|𝒙)q_{\phi}(\bm{z}|\bm{x}) on GYAFC (E&M) test set.

Result Analysis.

For the quality of generated texts, GP prior makes the model more robust to generate accurate texts. As shown in Table 2, for LSTM-based models, PG + GP prior generates the most accurate texts compared with PG, Multi-Selectors and Variation Attention. Note that PG + Normal prior also experiences the posterior collapse problem in GYAFC datasets, resulting in very low quality scores on the test set. For Transformer-based models, T5 + GP prior achieves the best performance than T5, T5 + Normal and T-CVAE, which shows the superiority of GP priors in encoding contextual information.

For the diversity of generated texts, imposing context-aware variations into encoder hidden states is beneficial for generating diverse outputs. As demonstrated in Table 3, for Transformer-based models, T5 + GP prior gives the best diversity performance in both token-level and sentence-level compared with T5 + Normal prior and T-CVAE. The model performance on the style transfer task verifies the capability of our GP prior in promoting generation diversity. Table 4 shows some diverse generation outputs of Transformer-based variational encoder-decoder models. However, we also notice that increasing diversity will inevitably cause degradation in quality, because 𝒛1:N\bm{z}_{1:N} are i.i.d. sampled from a high-dimensional multivariate Gaussian qϕ(𝒛𝒉,𝒚)q_{\phi}(\bm{z}\mid\bm{h},\bm{y}). As discussed in previous work (Vono et al., 2022), multivariate sampling in high-dimensional settings can become computationally demanding.

Computation Complexity Analysis.

Our GP priors require more computation during training, where the major computation comes from calculating the full co-variance matrix of context variables of the GP prior. However, during inference, we approximate the GP posterior with a variational posterior qϕ(𝒛𝒉,𝒚)q_{\phi}(\bm{z}\mid\bm{h},\bm{y}) and conducts i.i.d. sampling, which saves the time for multivariate sampling and has the same computation complexity with other conditional variational autoencoder baselines at testing time.

5 Related Works

Diverse text generation.

Related works on diverse text generation mainly focus on changing decoding strategies at the decoder side or introducing randomness at the encoder side. At the decoder side, recent works apply various decoding algorithms to promote diversity, such as diverse beam search Vijayakumar et al. (2016), top-k sampling Fan et al. (2018) and nucleus sampling Holtzman et al. (2019).Our model is orthogonal to these diverse decoding algorithms since we focus on the encoder side. Another group of works He et al. (2018); Shen et al. (2019) propose to use a mixture of decoders to generate multiple outputs, where the context encodings are shared across multiple decoders. At the encoder side, Cho et al. (2019) propose to leverage a mixture of selectors to identify key contents from the source text, where each selector samples a sequential binary latent variables as a hard attention mask on every source token. Xu et al. (2018) train different pattern embeddings, and generate diverse paraphrases conditioning on different pattern embeddings.

Conditional variational autoencoders.

Variational encoder-decoder models Deng et al. (2018); Bahuleyan et al. (2018); Wang and Wan (2019); Sun et al. (2021) are related to our method. Deng et al. (2018) formulate the attention vector as latent alignment variables, and use the latent variables as hard attention for the decoder to select which source words to focus on during generation. Wang and Wan (2019) present a conditional variational autoencoder based on Transformer, and learn a latent variable for generating diverse texts for the story completion task. Sun et al. (2021) propose a self-separated conditional variational autoencoder that introduces group information to regularize the latent variables, which alleviates the posterior collapse problem and improves the model performance in the dialogue generation task.

6 Conclusion

In this work, we investigate the problem of generating high quality texts for variational encoder-decoder models. We propose a novel stochastic function to introduce context-aware variations into encoder hidden states, which provides the decoder with more diverse contextual representations. To learn this stochastic function, we propose a GP prior to model the dependency between random context variables, and apply an efficient amortized variational inference method to approximate the GP posterior. Experimental results demonstrate that our method can learn a better contextual representation that leads to higher generation quality compared with deterministic encoder-decoder models and conditional variational autoencoders.

References

  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint, arXiv:1409.0473.
  • Bahuleyan et al. (2018) Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. 2018. Variational attention for sequence-to-sequence models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1672–1682.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany. Association for Computational Linguistics.
  • Cho et al. (2019) Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi. 2019. Mixture content selection for diverse sequence generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3121–3131, Hong Kong, China. Association for Computational Linguistics.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
  • Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pages 9712–9724.
  • Deshpande et al. (2019) Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10695–10704.
  • Dieng et al. (2019) Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2019. Avoiding latent variable collapse with generative skip models. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2397–2405. PMLR.
  • Duan et al. (2020) Yu Duan, Canwen Xu, Jiaxin Pei, Jialong Han, and Chenliang Li. 2020. Pre-train and plug-in: Flexible conditional text generation with variational auto-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 253–262, Online. Association for Computational Linguistics.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
  • He et al. (2018) Xuanli He, Gholamreza Haffari, and Mohammad Norouzi. 2018. Sequence to sequence mixture model for diverse machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 583–592.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  • Jhamtani et al. (2017) Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation, pages 10–19.
  • Kim et al. (2018) Yoon Kim, Sam Wiseman, Andrew Miller, David Sontag, and Alexander Rush. 2018. Semi-amortized variational autoencoders. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2678–2687. PMLR.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Representation Learning.
  • Lan et al. (2017) Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234.
  • Li et al. (2018) Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2018. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878, Brussels, Belgium. Association for Computational Linguistics.
  • Liu and Liu (2019) Danyang Liu and Gongshen Liu. 2019. A transformer-based variational autoencoder for sentence generation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
  • Murphy (2012) Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  • Prakash et al. (2016) Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. 2016. Neural paraphrase generation with stacked residual lstm networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2923–2934.
  • Qian and Cheung (2019) Dong Qian and William K. Cheung. 2019. Enhancing variational autoencoders with mutual information neural estimation for text generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4047–4057, Hong Kong, China. Association for Computational Linguistics.
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
  • Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K.I. Williams. 2006. Gaussian Processes for Machine Learning. The MIT Press.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  • Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Shen et al. (2019) Tianxiao Shen, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2019. Mixture models for diverse machine translation: Tricks of the trade. arXiv preprint arXiv:1902.07816.
  • Shinoda et al. (2021) Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2021. Improving the robustness of QA models to challenge sets with variational question-answer pair generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 197–214, Online. Association for Computational Linguistics.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, volume 28, pages 3483–3491. Curran Associates, Inc.
  • Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562.
  • Sun et al. (2021) Bin Sun, Shaoxiong Feng, Yiwei Li, Jiamou Liu, and Kan Li. 2021. Generating relevant and coherent dialogue responses using self-separated conditional variational AutoEncoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5624–5637, Online. Association for Computational Linguistics.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  • Vono et al. (2022) Maxime Vono, Nicolas Dobigeon, and Pierre Chainais. 2022. High-dimensional gaussian sampling: a review and a unifying approach based on a stochastic proximal point algorithm. SIAM Review, 64(1):3–56.
  • Wang and Wan (2019) Tianming Wang and Xiaojun Wan. 2019. T-cvae: Transformer-based conditioned variational autoencoder for story completion. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5233–5239. International Joint Conferences on Artificial Intelligence Organization.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wu et al. (2020) Chen Wu, Prince Zizhuang Wang, and William Yang Wang. 2020. On the encoder-decoder incompatibility in variational text modeling and beyond. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3449–3464, Online. Association for Computational Linguistics.
  • Xu et al. (2018) Qiongkai Xu, Juyan Zhang, Lizhen Qu, Lexing Xie, and Richard Nock. 2018. D-page: Diverse paraphrase generation. arXiv preprint arXiv:1808.04364.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100.

Appendix A Gaussian Processes as Function Priors

Refer to caption
(a) Sample gg from Gaussian process prior.
Refer to caption
(b) Sample gg from Gaussian process posterior.
Figure 3: Samples of potential functions gg from (a) a Gaussian process prior 𝒢𝒫(𝟎,𝐊)\mathcal{GP}(\bm{0},\mathbf{K}) which uses a squared exponential kernel k(h,h)=exp((hh)22)k(h,h^{\prime})=\exp(-\frac{(h-h^{\prime})^{2}}{2}), (b) a Gaussian process posterior p(g𝒟)p(g\mid\mathcal{D}) conditioning on the training data 𝒟\mathcal{D}.

In this work, we are interested in learning a non-linear mapping function from encoder hidden states to latent context variables. Gaussian process has the nice property which can represent complex non-linear functions and also allow uncertainty to account for noisy data observations. Considering a set of observed training data 𝒟={(𝒉i,𝒛i)}i=1N\mathcal{D}=\{(\bm{h}_{i},\bm{z}_{i})\}_{i=1}^{N}, a Gaussian process defines a probability distribution over possible functions p(g)p(g). Given a Gaussian process prior 𝒢𝒫(𝟎,𝐊)\mathcal{GP}(\bm{0},\mathbf{K}) on the function g(𝒉)g(\bm{h}), we have:

𝒛i\displaystyle\bm{z}_{i} =\displaystyle= g(𝒉i)+ϵi\displaystyle g(\bm{h}_{i})+\epsilon_{i} (14)
g(𝒉i)\displaystyle g(\bm{h}_{i}) \displaystyle\sim 𝒢𝒫(𝟎,𝐊)\displaystyle\mathcal{GP}(\bm{0},\mathbf{K}) (15)
ϵi\displaystyle\epsilon_{i} \displaystyle\sim 𝒩(𝟎,σ2𝐈)\displaystyle\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I}) (16)

Note that ϵi\epsilon_{i} is the noise of the observed data point 𝒛i\bm{z}_{i}, which is assumed to be an independent identically distributed Gaussian with variance σ2{\sigma}^{2}. 𝐊\mathbf{K} is the covariance matrix which is constructed using a squared exponential covariance function k(𝒉,𝒉)=exp(𝒉𝒉22)k(\bm{h},\bm{h}^{\prime})=\exp(-\frac{\left\|\bm{h}-\bm{h}^{\prime}\right\|^{2}}{2}). Now, we can sample different mapping functions g(𝒉)g(\bm{h}) from this Gaussian process prior 𝒢𝒫(𝟎,𝐊)\mathcal{GP}(\bm{0},\mathbf{K}). Figure 3(a) illustrates some possible mapping functions g1g_{1}, g2g_{2} and g3g_{3}.

In Gaussian process, each training and testing data point is treated as random variable which follows Gaussian distribution. Therefore, we can apply the Bayesian inference to predict a testing data point 𝒛\bm{z}_{*} conditioning on observed training data points 𝒟\mathcal{D}. To make concise notation, we let 𝒉1:N={𝒉i}i=1N\bm{h}_{1:N}=\{\bm{h}_{i}\}_{i=1}^{N}, 𝒛1:N={𝒛i}i=1N\bm{z}_{1:N}=\{\bm{z}_{i}\}_{i=1}^{N}, and 𝐊1=[k(𝒉1:N,𝒉1:N)+σ2𝐈]1\mathbf{K}^{-1}=[k(\bm{h}_{1:N},\bm{h}_{1:N})+{\sigma}^{2}\mathbf{I}]^{-1}. The probability distribution of the testing data point 𝒛\bm{z}_{*} can be computed by:

p(𝒛𝒉,𝒟)=p(𝒛𝒉,𝒈,𝒟)p(𝒈𝒟)𝑑gp(\bm{z}_{*}\mid\bm{h}_{*},\mathcal{D})=\int p(\bm{z}_{*}\mid\bm{h}_{*},\bm{g},\mathcal{D})p(\bm{g}\mid\mathcal{D})dg (17)

where

p(𝒛𝒉,𝒟)𝒩(𝝁,𝐊)𝝁=k(𝒉,𝒉1:N)𝐊1𝒛𝐊=k(𝒉,𝒉)k(𝒉,𝒉1:N)𝐊1k(𝒉1:N,𝒉)\begin{split}&p(\bm{z}_{*}\mid\bm{h}_{*},\mathcal{D})\sim\mathcal{N}(\bm{\mu}_{*},\mathbf{K}_{*})\\ &\bm{\mu}_{*}=k(\bm{h}_{*},\bm{h}_{1:N})\mathbf{K}^{-1}\bm{z}\\ &\mathbf{K}_{*}=k(\bm{h}_{*},\bm{h}_{*})-k(\bm{h}_{*},\bm{h}_{1:N})\mathbf{K}^{-1}k(\bm{h}_{1:N},\bm{h}_{*})\end{split} (18)

Intuitively, training data points 𝒟\mathcal{D} constrain the set of functions gg to pass through them since the covariance becomes smaller when we have training data, as shown in Figure 3(b).

Under our variational encoder-decoder framework, 𝒉1:N\bm{h}_{1:N} are encoder hidden states and 𝒛1:N\bm{z}_{1:N} are latent context variables. Since the Gaussian process induces a distribution over the mapping function g(𝒉)g(\bm{h}), theoretically we could sample infinite number of mapping functions, where each function gives us a different set of latent context representations 𝒛1:N\bm{z}_{1:N}. In this way, we managed to obtain diverse context representations in encoder-decoder models.

Appendix B Derivations of ELBo

We follow conditional variational autoencoders Sohn et al. (2015) and assume that for given observation 𝒉\bm{h}, 𝒛\bm{z} is drawn from the prior distribution p(𝒛𝒉)p(\bm{z}\mid\bm{h}), and the output 𝒚\bm{y} is generated from the distribution p(𝒚𝒉,𝒛)p(\bm{y}\mid\bm{h},\bm{z}). We learn the variational posterior by minimizing KL(qϕ(𝒛𝒉,𝒚)p(𝒛𝒉,𝒚))\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y})), which is equivalent to maximizing the evidence lower bound of the marginal log-likelihood (ELBo):

KL(qϕ(𝒛𝒉,𝒚)p(𝒛𝒉,𝒚))=qϕ(𝒛𝒉,𝒚)logqϕ(𝒛𝒉,𝒚)p(𝒛𝒉,𝒚)dϕ=qϕ(𝒛𝒉,𝒚)logqϕ(𝒛𝒉,𝒚)p(𝒚𝒉)p(𝒉)p(𝒛,𝒉,𝒚)dϕ=logp(𝒚𝒉)+qϕ(𝒛𝒉,𝒚)logqϕ(𝒛𝒉,𝒚)p(𝒉)p(𝒚𝒉,𝒛)p(𝒛𝒉)dϕ\begin{split}&\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y}))\\ &=\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})}{p(\bm{z}\mid\bm{h},\bm{y})}d{\bm{\phi}}\\ &=\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{y}\mid\bm{h})p(\bm{h})}{p(\bm{z},\bm{h},\bm{y})}d{\bm{\phi}}\\ &=\log p(\bm{y}\mid\bm{h})\\ &+\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{h})}{p(\bm{y}\mid\bm{h},\bm{z})p(\bm{z}\mid\bm{h})}d{\bm{\phi}}\end{split} (19)

Since KL(qϕ(𝒛𝒉,𝒚)p(𝒛𝒉,𝒚))0\text{KL}(q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h},\bm{y}))\geq 0, we have:

logp(𝒚𝒉)qϕ(𝒛𝒉,𝒚)logqϕ(𝒛𝒉,𝒚)p(𝒉)p(𝒚𝒉,𝒛)p(𝒛𝒉)dϕ=𝔼qϕ[logp(𝒚𝒛,𝒉)+logp(𝒛𝒙)qϕ(𝒛𝒉,𝒚)]=𝔼qϕ[logp(𝒚𝒛,𝒉)]𝔼qϕ[qϕ(𝒛𝒉,𝒚)logp(𝒛𝒉)]=𝔼qϕ[logp(𝒚𝒛,𝒉)]KL[qϕ(𝒛𝒉,𝒚)p(𝒛𝒉)]\begin{split}&\log p(\bm{y}\mid\bm{h})\\ &\geq-\int q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\log\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})p(\bm{h})}{p(\bm{y}\mid\bm{h},\bm{z})p(\bm{z}\mid\bm{h})}d{\bm{\phi}}\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})+\log p(\bm{z}\mid\bm{x})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})]\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})]-\mathbb{E}_{q_{\bm{\phi}}}[\frac{q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})}{\log p(\bm{z}\mid\bm{h})}]\\ &=\mathbb{E}_{q_{\bm{\phi}}}[\log p(\bm{y}\mid\bm{z},\bm{h})]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\text{KL}[q_{\bm{\phi}}(\bm{z}\mid\bm{h},\bm{y})\|p(\bm{z}\mid\bm{h})]\end{split} (20)

where ϕ\bm{\phi} are parameters for the variational inference networks.

Appendix C Experiment Setup Details

Model Configurations.

For the implementation details of PG, it is an LSTM-based encoder-decoder model with copying mechanism. The encoder is a 1-layer Bi-LSTM, and the decoder is a 1-layer uni-directional LSTM. We set the word embedding size to 300, the hidden dimension for both encoder and decoder to 512. We let the encoder and decoder shares the same vocabulary list and word embedding, and the vocabulary size is 20000. For the configuration of the posterior networks, both the mean and covariance network are a single feed-forward neural network, and we set the dimension of the latent variable to 256.

For the implementation details of T5, we use the T5-base implementation from Huggingface (Wolf et al., 2020) 444https://huggingface.co/transformers/model_doc/t5.html, and use their default model configuration. We load the pre-trained weights of T5-base, and fine-tune them on our target task datasets. For the configuration of the posterior networks, both the mean and covariance network are a single feed-forward neural network, and we set the dimension of the latent variable to 512.

Training Configurations.

For the training details of PG and T5, we does not apply KL annealing and the coefficient of the KL divergence is always 1. We use Adam optimizer Duchi et al. (2011) with learning rate of 0.0001, and adopt early stopping if the validation loss does not decrease after 10 epochs. For the hyper-parameters {v,r}\{v,r\} of the kernel function in Equation 3, we try a range of values where v[0.01,100]v\in[0.01,100] and r[0.0001,10]r\in[0.0001,10], and do grid search cross validation on the validation set to select the best model. All experiments are independently conducted on a GPU server (RTX 2090 Ti) with 40cores CPU and 256GB Memory.