This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discrete Variational Attention Models for Language Generation

Xianghong Fang1 111Equal contribution in random order.    Haoli Bai1{}^{1\hskip 2.1097pt*}    Zenglin Xu2    Michael Lyu1    Irwin King1 1The Chinese University of Hong Kong
2Harbin Institute of Technology, Shenzhen
[email protected], {hlbai,lyu,king}@cse.cuhk.edu.hk, [email protected]
Abstract

Variational autoencoders have been widely applied for natural language generation, however, there are two long-standing problems: information underrepresentation and posterior collapse. The former arises from the fact that only the last hidden state from the encoder is transformed to the latent space, which is insufficient to summarize data. The latter comes as a result of the imbalanced scale between the reconstruction loss and the KL divergence in the objective function. To tackle these issues, in this paper we propose the discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages. Our approach is combined with an auto-regressive prior to capture the sequential dependency from observations, which can enhance the latent space for language generation. Moreover, thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse. Furthermore, we carefully analyze the superiority of discrete latent space over the continuous space with the common Gaussian distribution. Extensive experiments on language generation demonstrate superior advantages of our proposed approach in comparison with the state-of-the-art counterparts.

1 Introduction

As one of the representative of deep generative models, variational autoencoders (VAEs) Kingma and Welling (2013) have been widely applied in natural language generation Wang and Wang (2019); Fu et al. (2019); Li et al. (2019). Given input text xx, VAE learns the variational posterior q(z|x)q(z|x) through the encoder, and reconstructs the output from latent variables zz via the decoder p(x|z)p(x|z). To generate diverse sentences, the decoder p(x|z)p(x|z) heavily relies on the samples drawn from the prior p(z)p(z) that controls the contents, topics and semantics for generation. However, despite the successful applications of VAEs for language generation, they majorly suffers from two long-standing challenges, i.e., information underrepresentation and posterior collapse.

First of all, in most variational language generation methods Fu et al. (2019); He et al. (2019); Wang and Wang (2019); Li et al. (2019), the latent space is derived from only the last hidden state of the encoder, and is therefore insufficient to summarize the input. We call this challenge as information underrepresentation. Intuitively, given an observed sentence, the corresponding sequence of hidden states should be semantically correlated and representative during the phase of language generation. Thus a potential solution is to enhance the representation power via the attention mechanism Bahdanau et al. (2014), which can build the correlation between the hidden states of the encoder and decoder. However, little efforts have been paid towards the utilization of attention in variational language generation.

The next challenge is posterior collapse, a long standing phenomenon troubling the training of VAEs especially in the case of language generation. When the posterior fails to encode any knowledge from the observations, the decoder therefore receives no signal at each time step during generation. Many approaches have been proposed to alleviate the issue, for instance, annealing the KL divergence term Bowman et al. (2015b); Kingma et al. (2017); Fu et al. (2019), revising the model Yang et al. (2017); Semeniuta et al. (2017); Xu and Durrett (2018) and modifying the training procedure He et al. (2019); Li et al. (2019). Despite the effectiveness of these methods, the trade-off between the reconstruction loss and the KL divergence is inevitable, and the phenomenon could still happen when the two terms are not properly scaled.

Aiming to address the above challenges, we propose the Discrete Variational Attention Model (DVAM) with categorical distributions over the attention mechanism. As shown in Figure 1, the proposed DVAM adopts two different RNNs as the encoder network and the decoder network, respectively. In order to better explore the prior context information for the language generation phase, we introduce the latent stochastic variable zz to build the semantic connection between the encoder hidden states and the decoder hidden states, via the attention Bahdanau et al. (2014). Considering that text inputs xx are more naturally modeled as a sequence of discrete symbols rather than continuous ones, we explore the potential of quantized latent space in the attention mechanisum. The advantages of quantized representation have also been empirically justified in the recently proposed vector quantized variational autoencoder (VQVAE) van den Oord et al. (2017); Roy et al. (2018) in learning the sequentially correlated prior for image generation. We further show that the quantized representation can avoid DVAM trapping in posterior collapse (when KL divergence DKL(q(z|x)||p(z))0D_{KL}(q(z|x)||p(z))\rightarrow 0)—since the variational posterior q(z|x)q(z|x) is a discrete distribution, the KL divergence is not differentiable with its parameters and hence not involved during model training. This can also let us learn the variational posterior and prior separately. We first train DVAM until convergence where the posterior successfully encodes the sequential dependency from observations. Then we deploy informative context priors, such as a separate auto-regressive prior van den Oord et al. (2016b), to learn the sequential dependency from the well-trained posterior, after which we can sample diverse and representative latent sequences from the prior for sentence generation. Furthermore, we provide detailed analysis on the advantages of the quantized latent space over continual latent space. Finally, we evaluate the proposed model on several benchmark datasets for language modelling, and experimental results demonstrate the superiority of DVAM in generating languages over its counterparts.

Our contributions can thus be summarised as:

  1. 1.

    We propose the discrete variational attention model with an auto-regressive prior to capture the sequential dependency in the latent space, such that issues of information underrepresentation and posterior collapse can be effectively tackled.

  2. 2.

    We carefully analyze reasons why discrete latent space with categorical distribution in our model is preferred than the continuous space with the commonly Gaussain distribution for language generation.

  3. 3.

    Experimental results on benchmark datasets demonstrate the advantages of our DVAM against state-of-the-art baselines in language generation.

2 Background

2.1 Variational Antoencoders for Language Generation

Variational Autoencoders (VAEs) Kingma and Welling (2013) are well known class of generative models. Given observations xx, we seek to infer latent variables zz from which new observations x^\hat{x} can be generated. To achieve this, we need to maximize the marginal log likelihood logpθ(x)\log p_{\theta}(x), which is usually intractable due to the complex posterior p(z|x)p(z|x). Consequently an approximate posterior qϕ(z|x)q_{\phi}(z|x) (i.e. the encoder) is introduced, and the evidence lower bound (ELBO) of the marginal likelihood is maximized as follows:

logpθ(x)𝔼zqϕ(z|x)[logpθ(x|z)]reconstruction lossDKL(qϕ(z|x)p(z))KL divergence,\log p_{\theta}(x)\geq\underbrace{\mathbb{E}_{z\sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{\textrm{reconstruction loss}}-\underbrace{D_{KL}(q_{\phi}(z|x)\|p(z))}_{\textrm{KL divergence}}, (1)

where pθ(x|z)p_{\theta}(x|z) represents likelihood function conditioned on latent codes zz, also known as the decoder. θ\theta and ϕ\phi are the corresponding parameters.

In language generation, VAEs are used to learn the mapping from the latent space of zz to the observations xx. Based on this mapping, new sentences can be effectively generated. VAEs are usually armed with an RNN encoder and decoder, where the input sentences xx are given to the encoder, the latent variables zz are derived from the last hidden states hTeh^{e}_{T} via the reparameterization trick Kingma and Welling (2013) z=μ(hTe)+σ(hTe)ϵz=\mu(h^{e}_{T})+\sigma(h^{e}_{T})\epsilon with ϵ𝒩(0,I)\epsilon\sim\mathcal{N}(0,I). To generate new sentences, we sample latent variables zz from the prior that are forwarded to the decoder to generate sequences x^\hat{x} by

p(x^1:T|z)=t=2Tp(x^t|x^t1,z)p(x^1|z).p(\hat{x}_{1:T}|z)=\prod_{t=2}^{T}p(\hat{x}_{t}|\hat{x}_{t-1},z)\cdot p(\hat{x}_{1}|z). (2)

2.2 Diminishing Effect of Latent Variables

Modeling RNN-based language models with VAEs is prone to suffer from the diminishing effect of latent variables, i.e., the generated sequences x^\hat{x} are loosely dependent on zz and thereon the learned latent space plays no role in generation. The diminishing effect largely comes from two aspects:

Information Underrepresentation.

While sentences xx are fed into the encoder, the corresponding latent variables zz are obtained through the transformation of only last hidden state hTeh^{e}_{T} of the encoder. The resulting latent space, however, is usually insufficient to summarize the observations xx. The rich semantics in the whole sequences are thereon lost in zz, known as the information underrepresentation. During generation, such latent variables zz are forwarded to the decoder, which cannot effectively guide the decoder to generate sentences with high correlation and quality. Some recent attempts Bahuleyan et al. (2018); Deng et al. (2018) borrow ideas from the attention mechanism Bahdanau et al. (2014), and introduce variational distributions over the context vector, however, they use uninformative prior that may fail to sample sequentially correlated latent variables for sentence generation.

Posterior Collapse.

Posterior collapse usually arises as DKL(qϕ(z|x)p(z))D_{KL}(q_{\phi}(z|x)\|p(z)) in Equation (1) diminishes to zero, where the local optimal gives qϕ(z|x)=p(z)q_{\phi}(z|x)=p(z). When posterior collapse happens, we can verify that xx are independent of zz by p(x)p(z)=p(x)qϕ(z|x)=p(x)p(x,z)p(x)=p(x,z)p(x)p(z)=p(x)q_{\phi}(z|x)=p(x)\frac{p(x,z)}{p(x)}=p(x,z). Therefore, the encoder learns a data-agnostic posterior (i.e., the standard Normal distribution) without any information from the observations xx, while the decoder learns to generate itself without actually relying on the latent variable zz.

Posterior collapse happens inevitably as the ELBO contains both the reconstruction loss 𝔼zqϕ(z|x)[logpθ(x|z)]\mathbb{E}_{z\sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] and the KL-divergence DKL(qϕ(z|x)p(z))D_{KL}(q_{\phi}(z|x)\|p(z)). When the scales between the two terms are not properly balanced, it could easily make the KL divergence over optimized. Moreover, given a powerful and auto-regressive decoder, the decoder itself can learn to generate sentences x^\hat{x} without actually relying on zz.

3 Methods

Refer to caption
Figure 1: The overall architecture of the proposed DVAM. Given observations xx, the encoder first maps xx to hidden states h1:Teh^{e}_{1:T}, which are then quantized based on Euclidean distance to the code book. The quantized hidden states ez1:Te_{z_{1:T}} are then forwarded to the attention module to align the decoder. To generate new sentences from DVAM, we deploy an auto-regressive prior to draw sequentially correlated samples.

In order to tackle the diminishing effect of latent variables in variational language models, we first present the Discrete Variational Attention Model (DVAM), and then introduce meaningful priors for language generation.

3.1 Discrete Variational Attention Models

As shown in Figure 1, the proposed DVAM adopts two different RNNs as the encoder network and the decoder network, respectively. In order to build the connection between the encoder hidden states (denoted by h1:Teh^{e}_{1:T}) and the decoder hidden states (denoted byh1:Tdh^{d}_{1:T}), we further involve an attention mechanism similar to sequence to sequence (seq2seq) models Bahdanau et al. (2014).

A proper variational posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) plays an important role in effectively capturing the semantic information from the observations xx. A simple idea would be choosing the widely used Gaussian distribution, which however easily leads to posterior collapse (as will be discussed in Section 3.3). As representations of languages are discrete in nature, we seek to quantize the latent space with a code book {ek}k=1K\{e_{k}\}_{k=1}^{K}, where KK is the code book size. The combination of code book vectors therefore represents the sequential dependency from observed sentences x. The advantages of quantized representation have also been verified in the recently proposed vector quantized variational autoencoder van den Oord et al. (2017); Razavi et al. (2019) for image generation. We set the variational posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) as categorical distribution to index {ek}k=1K\{e_{k}\}_{k=1}^{K} as follows:

qϕ(zt=k|x)={1for k=argminjhteej20otherwise.q_{\phi}(z_{t}=k|x)=\begin{cases}1&\text{for }k=\arg\min_{j}\|h^{e}_{t}-e_{j}\|_{2}\\ 0&\text{otherwise}\end{cases}. (3)

Given the code book index ztz_{t}, the encoder hidden state hteh_{t}^{e} is therefore quantized to ezte_{z_{t}}, and the attention scores can be computed as usual by

αti=exp(α~ti)j=1Texp(α~tj),\hskip 8.61108pt\alpha_{ti}=\frac{\exp(\tilde{\alpha}_{ti})}{\sum_{j=1}^{T}\exp(\tilde{\alpha}_{tj})}, (4)

where α~t=vtanh(Weez1:T+Wdht1d+b)\tilde{\alpha}_{t}=v^{\top}\tanh(W_{e}e_{z_{1:T}}+W_{d}h^{d}_{t-1}+b) is the score before the softmax normalization.

We then compute the context vectors ct=i=1Tαtiezic_{t}=\sum_{i=1}^{T}\alpha_{ti}e_{z_{i}} as an extra input to the decoder, and reformulate the generation process as

p(x^1:T|z1:T)=t=2Tp(x^t|x^t1,z1:T)p(x^1|z1:T).p(\hat{{x}}_{1:T}|z_{1:T})=\prod_{t=2}^{T}p(\hat{x}_{t}|\hat{x}_{t-1},z_{1:T})p(\hat{x}_{1}|z_{1:T}).\vspace{-0.5ex} (5)

Therefore at each time step, the decoder receives the supervision from the context vector ctc_{t} , which is a weighted sum from the code books {e}k=1K\{e\}_{k=1}^{K} of the whole sequence. Consequently, the variational posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) encodes the sequential dependency from the observations that addresses the issue of information underrepresentation.

To allow new sentence generation, our model can also draw sequentially dependent samples from the prior p(z1:T)p(z_{1:T}), which will be discussed in the next section.

3.2 Model Training

Thanks to the nice property of discreteness, the optimization of DVAM does not suffer from posterior collapse. To see this, note that the ELBO in Equation (1) includes both the reconstruction loss and the KL divergence, whereas the KL divergence of DVAM can be written as

DKL(qϕ(z1:T|x)||p(z1:T))\displaystyle D_{KL}(q_{\phi}(z_{1:T}|x)||p(z_{1:T}))
=t=1T{H(qϕ(zt))k=1Kqϕ(zt=k|x)logp(zt=k)}\displaystyle=\sum_{t=1}^{T}\{-H(q_{\phi}(z_{t}))-\sum_{k=1}^{K}q_{\phi}(z_{t}=k|x)\log p(z_{t}=k)\}
=0t=1T{1logp(zt=jt)+0logp(ztjt)},\displaystyle=-0-\sum_{t=1}^{T}\{1\cdot\log p(z_{t}=j_{t})+0\cdot\log p(z_{t}\neq j_{t})\},\ (6)

where the third line is obtained by qϕ(zt=jt)=1q_{\phi}(z_{t}=j_{t})=1, qϕ(ztjt)=0q_{\phi}(z_{t}\neq j_{t})=0 and the fact that H(qϕ(zt))=1log10log0=0H(q_{\phi}(z_{t}))=-1\log 1-0\log 0=0. Consequently, DKL(qϕ(z|x)p(z))D_{KL}(q_{\phi}(z|x)\|p(z)) is not differentiable w.r.t. the variational parameters ϕ\phi. The optimization of KL divergence does not play a role in the reconstruction loss optimization, and even we do not need to know the form of the prior p(z1:T)p(z_{1:T}) in advance.

On the other hand, since the latent variables z1:Tz_{1:T} are obtained based on the Euclidean distance to the code books, we encourage the hidden states to stay close to {ek}k=1K\{e_{k}\}_{k=1}^{K}. Therefore, the training objective for DVAM can be formulated as

minθ,ϕ𝔼z1:Tqϕlogpθ(x|z1:T)+βt=1Thtesg(e)F2,\min_{\theta,\phi}-\mathbb{E}_{z_{1:T}\sim q_{\phi}}\log p_{\theta}(x|z_{1:T})+\beta\sum_{t=1}^{T}\|h_{t}^{e}-\mathrm{{sg}}(e)\|_{F}^{2}, (7)

where β\beta is the regularizer, and sg()\mathrm{{sg(\cdot)}} stands for stop-gradient operation. Note that since the quantization is non-differentiable, we adopt the widely used straight through estimator (STE) Bengio et al. (2013)] to copy gradients from ezte_{z_{t}} to hteh_{t}^{e}, as is shown in Figure 1, In terms of the code books {ek}k=1K\{e_{k}\}_{k=1}^{K}, since Equation (7) is non-differentiable with them, we first apply the K-means algorithm to calculate the average over all latent variables h1:Teh_{1:T}^{e} that are closest to {ek}k=1K\{e_{k}\}_{k=1}^{K}, then we take exponential moving average over the code books so as to stabilize the mini-batch update.

Auto-regressive Prior for Language Generation

Despite the optimization of Equation (7) does not rely on the choice of the prior, however, the form of prior does affect the process of language generation. Recall that in the training phase, latent variables z1:Tqϕ(z1:T|x)z_{1:T}\sim q_{\phi}(z_{1:T}|x) are sampled conditioned on the input xx, and therefore the posterior learns to encode the sequential dependency in the latent space. Nevertheless, to generate new sentences after network training, we first sample z1:Tz_{1:T} unconditionally from the prior and then pass it to the decoder for generation, where the sequential dependency can hardly be guaranteed. As a result, the decoder cannot receive the structured supervision from the latent space for valid generation.

To solve the problem, we seek to find a auto-regressive prior pψ(z1:T)=pψ(z1)t=2Tpψ(zt|z1:t1)p_{\psi}(z_{1:T})=p_{\psi}(z_{1})\prod_{t=2}^{T}p_{\psi}(z_{t}|z_{1:t-1}) parameterized by ψ\psi such that it has enough capacity to capture the underlying sequential structures in the posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x). Towards that end, we adopt a PixelCNN van den Oord et al. (2016a) to learn the prior. Unlike PixelCNN models on images, we use a 16-layer residual 1-dimensional convolutional network to swap over the latent sequence. In order to learn the sequential dependency in the posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x), we first train DVAM using Equation (7) until convergence, and then minimize the KL divergence tDKL(qϕ(zt|x)||pψ(zt|z1:t1))\sum_{t}D_{KL}(q_{\phi}(z_{t}|x)||p_{\psi}(z_{t}|z_{1:t-1})) w.r.t ψ\psi, which reduces to the cross entropy loss according to Equation (3.2).

3.3 Discussion on the Distribution of zz

As mentioned earlier in Section 3.1, a simple idea is to directly assign Gaussian distributions over the attention, i.e. zt=μt(hte)+σt(hte)ϵz_{t}=\mu_{t}(h^{e}_{t})+\sigma_{t}(h^{e}_{t})\epsilon for ϵ𝒩(0,I)\epsilon\sim\mathcal{N}(0,I), leading to the Gaussian Variational Attention Model (GVAM).

To analyze the defects of GVAM, we first recall the KL divergence between two Gaussian distributions, which can be written as follows:

t=1TDKL(qϕ(zt|x)pψ(zt|z1:t1))\displaystyle\sum_{t=1}^{T}D_{KL}(q_{\phi}({z_{t}}|{x})\|p_{\psi}({z_{t}|z_{1:t-1}}))
=t=1Td=1D12(logσ^td2σtd21+σtd2+(μ^tdμtd)2σ^td2),\displaystyle=\sum_{t=1}^{T}\sum_{d=1}^{D}\frac{1}{2}(\log\frac{\hat{\sigma}_{td}^{2}}{\sigma_{td}^{2}}-1+\frac{\sigma_{td}^{2}+(\hat{\mu}_{td}-\mu_{td})^{2}}{\hat{\sigma}_{td}^{2}}), (8)

where DD is the latent dimension of ztz_{t}. We denote {μtd,σtd}\{\mu_{td},\sigma_{td}\} and {μ^td,σ^td}\{\hat{\mu}_{td},\hat{\sigma}_{td}\} as the parameters of the posterior and prior distributions, respectively. Unlike the objective of DVAM in Equation (3.2), we find that Equation (3.3) is differentiable w.r.t to variational parameters ϕ\phi, and therefore should be included for the minmization of ELBO. As we also need an auto-regressive prior during generation, such differentiation brings the challenge for GVAM. Unlike DVAM that learns the posterior qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) in advance to teach the prior, GVAM needs to involve the posterior and prior jointly during the optimization, i.e.,

minϕ,θ,ψ𝔼z1:Tqϕlogpθ(x|z1:T)+DKL(qϕ(z1:T|x)pψ(z1:T)).\min_{\phi,\theta,\psi}-\mathbb{E}_{z_{1:T}\sim q_{\phi}}\log p_{\theta}(x|z_{1:T})+D_{KL}(q_{\phi}(z_{1:T}|{x})\|p_{\psi}({z_{1:T}})). (9)

Such optimization can be easily troubled by posterior collapse due to two aspects. Firstly, the scale of the KL divergence increases linearly to the length of the sequence, and a large scale usually overlooks the minimization of the reconstruction loss. The training thereon is unstable across various observations xx with different lengths. Secondly and more seriously, both ϕ\phi and ψ\psi are used to minimize the KL divergence. Whenever qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) collapses to pψ(z1:T)p_{\psi}(z_{1:T}) before the qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) learns any sequential dependency from the observations, both qϕ(z1:T|x)q_{\phi}(z_{1:T}|x) and pψ(z1:T)p_{\psi}(z_{1:T}) are trapped to the local optimal that cannot provide structural supervision to the decoder during both training and generation.

4 Experiments

In this section, we verify the advantages of our proposed DVAM for language generation. We first perform language modelling on three benchmark datasets, which measures the model capacity of different approaches. Then we investigate the training of the auto-regressive prior, and evaluate the generated sentences from these approaches. Finally we conduct a set of ablation studies to shed more lights into DVAM. Codes implemented by Pytorch will be released on Github.

4.1 Experimental Setup

We take three benchmark datasets of language modelling for verification: Yahoo Answers  Xu and Durrett (2018), Penn Tree Marcus et al. (1993), and a down-sampled version of SNLI Bowman et al. (2015a). A summary of dataset statistics is shown in Table 1.

Table 1: Dataset statistics.
Datasets Train Size Val Size Test Size Avg Len
Yahoo 100,000 10,000 10,000 78.7
PTB 42,068 3,370 3,761 23.1
SNLI 100,000 10,000 10,000 9.7
Table 2: Results of language modelling on Yahoo, PTB and SNLI Datasets.
Method Yahoo PTB SNLI
Rec\downarrow PPL\downarrow KL Rec\downarrow PPL\downarrow KL Rec\downarrow PPL\downarrow KL
LSTM-LM - 60.75 - - 100.47 - - 21.44 -
VAE 329.08 61.60 0.00 102.31 106.59 0.00 33.36 22.22 0.007
+anneal 328.62 60.34 0.00 102.18 105.87 0.00 32.33 20.19 1.05
+cyclic 328.78 61.35 0.06 102.07 105.36 0.00 31.71 19.07 1.65
+aggressive 323.08 57.12 4.94 100.81 96.46 1.85 32.91 21.31 0.47
+FBP 326.58 59.68 8.08 98.06 89.91 5.91 31.91 19.41 1.99
+pretraining+FBP 316.17 52.39 16.17 94.88 76.21 7.47 30.62 17.22 2.83
GVAM 350.14 79.28 0.00 102.20 105.94 0.00 30.90 17.68 0.38
DVAM (K=128) 303.65 44.36 1.88 79.94 38.38 2.21 16.08 4.46 2.33
DVAM (K=512) 259.68 25.83 2.60 64.79 19.22 3.13 11.06 2.82 2.58

We compare the proposed DVAM against a number of baselines, including the classical LSTM-LM, vanilla VAE Kingma and Welling (2013), as well as its advanced variants, e.g. annealing VAE Bowman et al. (2015b), cyclic annealing VAE222https://github.com/haofuml/cyclical_annealing Fu et al. (2019), lagging VAE333https://github.com/jxhe/vae-lagging-encoder He et al. (2019), Free Bits (FB) Kingma et al. (2017) and pretraining+FBP VAE444https://github.com/bohanli/vae-pretraining-encoder Li et al. (2019). We also compare to our closest counterpart, i.e., Gaussian Variational Attention Model (GVAM) so as to directly verify the advantages of discreteness in the variational attention model.

We evaluate the performance of language generation models using three metrics, i.e., the reconstruction loss (Rec) calculated as 𝔼𝒛qϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]\mathbb{E}_{\boldsymbol{z}\sim q_{\phi}(\boldsymbol{z}|\boldsymbol{x})}[\log p_{\theta}(\boldsymbol{x}|\boldsymbol{z})] for measuring the ability to recover data from latent space (the lower the better), the Perplexity (PPL) measuring the capacity of language modelling (the lower the better), and the KL divergence (KL) between the posterior q(z|x)q(z|x) and the prior p(z)p(z) indicating whether posterior collapse occurs.

4.1.1 Implementation

For baselines, we keep the same hyper-parameter settings to pretraining+FBP VAE Li et al. (2019), e.g., the latent dimension of zz, the word embedding size as well as the hidden size of the LSTM. Since our latent variables are discrete, we do not use importance weighted samples to approximate the reconstruction loss in Lagging VAE He et al. (2019) and pretraining+FBP VAE Li et al. (2019). Also, for all methods, we compute PPL by using the reconstruction loss instead of ELBO. Therefore we reproduce their results based on the released codes, which is slightly different from those in original papers. For GVAM and DVAM, we keep shared hyper-parameters of baselines unchanged. By default we set the code-book size KK to 512512. We first warm up the training for 30 epochs, and then gradually increase β\beta in Equation (7) from 0.1 to βmax=5.0\beta_{max}=5.0, in a similar spirit to annealing VAE. For all experiments, we use the SGD optimizer with learning rate 1.01.0, and decay it until five counts if the loss on the validation set does not decrease for 22 epochs. For the auto-regressive prior, we use the same architecture and training settings for both GVAM and DVAM.

4.2 Results

4.2.1 Language Modelling

We first perform language modelling over the testing corpus of benchmark datasets, which measures the capacity of different approaches. Generally, the model is more expressive when it achieves lower Rec and PPL on the observations. We average the KL divergence for GVAM and DVAM along the sequence to make them comparable to rest baselines. Note that unlike other approaches, the KL divergence of DVAM does not affect language modelling and the model capacity, but is only related to the training of the auto-regressive prior.

The results of language modelling on Yahoo, PTB and SNLI are listed in Table 2. Comparing to baselines without variational attentions, we find that despite the variational attention is adopted, GVAM does not show advantages over other baselines in Yahoo and PTB datasets. The KL divergence of GVAM is as tiny as that of VAE, especially on Yahoo and PTB when the average sequence length is long. This indicates posterior collapse which fails to learn the sequential dependency in the observation. On the other hand, our DVAM achieves significantly better results on all three datasets, especially with large code book size KK. The success verifies the fact that the variational posterior learns to adapt itself to the sequential dependency, which significantly improves the expressiveness for language modelling.

Table 3: Sampled Sentences on Yahoo Dataset.
Method Samples
pretraining+   i hate wandering, i just wan na know when the skies in the sky and the winds. [/s]
FBP   where is it that morning when snow on thanksgiving ? what’s the next weekend ? dress it!!!! my
VAE mother was the teen mom and i love her and she just is going to be my show. [/s]
  are they allowed to join (francisco) in _UNK. giants in the first place.? check out other answers.
do you miss the economy and not taking risks in the merchant form, what would you tell? go to
the yahoo home page and ask what restaurants follow this one. [/s]
GVAM   didn’t i still worry, he loves porn and feels awful??? [/s]
  if i aint divorced b4 the prom, and i wont worry, worry, i realy worry, and nobody feels awful,
and i realy _UNK sometime, i wont worry, and eventually. [/s]
  what is the worst and worst moment and the worst and worst, and a stranger and hugs, and nobody,
and i deserve it, and nobody feels awful and i wont worry and worry. [/s]
DVAM   i need to start a modeling company ! any suggestions on what is a reliable topic? [/s]
(K=512)   does anyone agree, there is a global warming of the earth? in general. there are several billion things,
including the earth, solar system. [/s]
  is anyone willing to donate plasma if you are allergic to cancer or anything else? probably you can.
i’ve never done any thing but it is only that dangerous to kill bacteria. i have heard that it doesn’t have
any effect on your immune system. [/s]

4.2.2 Training Dynamics

To further demonstrate the advantages of DVAM over GVAM, we now turn to investigate their training dynamics. We plot the curvature of Rec and KL on the validation set of PTB in Figure 2. We can find that the KL of GVAM tends oscillate at the beginning and diminishes quickly, whereas Rec does not decrease sufficiently. For DVAM, since Rec is not affected by the KL, Rec is minimized sufficiently. Meanwhile, the KL of DVAM converges quickly without oscillation as only parameters in the prior are updated to learn the sequential structure in the posterior.

Refer to caption
Figure 2: The loss curvature during first 20 epochs on PTB.
Refer to caption
(a) Code book size KK
Refer to caption
(b) Maximum regularizer βmax\beta_{\max}
Refer to caption
(c) Latent dimension of ezte_{z_{t}}
Figure 3: Ablation studies of DVAM.

4.2.3 Sampled Sentences

Finally, we compare the generated sentences from the pretraining+FBP VAE, GVAM and our DVAM (K=512) respectively. Due to the limited space, we randomly sample 3 generated sentences with different lengths based on the models trained on Yahoo dataset, as is shown in Table 3. We can find that while pretraining+FBP VAE produces readable sentences, the semantic meanings are poorly consistent. For GVAM, we can hardly find readable sentences even on short sequences, which is probably due to poorly correlated samples from the prior when both the posterior and prior are trapped to some local minimal. On the contrary, our DVAM can produce relatively good sentences with consistent semantic meanings, even when the sequences are long. This suggests the samples from the prior indeed contain sequential dependency that benefits the generation.

4.3 Ablation Studies

By default, all the ablation studies are conducted on Yahoo dataset with the default parameter settings.

4.3.1 Code Book Size KK

We begin with the effect of different code book size KK on the reconstruction loss and KL divergence for language modelling. We vary K{128,256,512,1024}K\in\{128,256,512,1024\}, and the results are shown in Figure 3(a). It can be observed that as KK increases, the Rec loss decreases while KL increases, both monotonically. The results are also consistent to Table 2 by increasing KK from 128 to 512. Such phenomenons are intuitive, since a larger KK improves model capacity but poses more challenges for training the auto-regressive prior. Consequently, one should properly choose the code book size, such that the prior can approximate the posterior well, and yet the posterior is representative enough for the sequential dependency.

4.3.2 Maximum Regularizer βmax\beta_{max}

Then we tune the maximum regularizer βmax\beta_{max}, which controls the distance of the continuous hidden state h1:Teh_{1:T}^{e} to the code books {ek}k=1K\{e_{k}\}_{k=1}^{K}. Recall that a small βmax\beta_{max} loosely restricts the continuous space h1:Teh_{1:T}^{e} to the code book, making the quantization process hard to converge. On the other hand, if βmax\beta_{max} is too large, h1:Teh_{1:T}^{e} could easily get stuck in some local minimal during the training. Therefore, it is necessary to find a proper trade-off between the two situations. We vary βmax{0.1,0.2,0.5,1,5,10,20}\beta_{max}\in\{0.1,0.2,0.5,1,5,10,20\}, and the results is shown in Figure 3(b). We can find that when βmax=5\beta_{max}=5, the algorithm achieves the lowest Rec, while smaller or larger βmax\beta_{max} both lead to higher Recs.

4.3.3 Dimension of Code Book Vectors

Finally, we vary latent dimension of {ek}k=1K\{e_{k}\}_{k=1}^{K} in {8,16,32,64,128,256}\{8,16,32,64,128,256\}, and the results are shown in Figure 3(c). We find that the performance of language modelling is relatively robust to the choice of latent dimension. Intuitively, in the continuous space the dimension of latent variables is closely related to the model capacity. However, in the discrete case, the capacity of the model is largely determined by the code book size KK instead of the latent dimension, which is also verified in Table 2 and Figure 3(a).

5 Conclusion

In this paper, we propose discrete variational attention model, a new algorithm for natural language generation. Our proposed approach can address the issues of information underrepresentation and posterior collapse. Moreover we also carefully analyze the advantages of discreteness over continuity in variational attention models. Extensive experiment results on benchmark language modelling datasets demonstrate the superiority of our proposed approach. As a future direction, our approach can be applied for more applications of natural language processing such as text summarization, dialogue systems and poetry generation.

References

  • Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, 2014.
  • Bahuleyan et al. [2018] Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. Variational attention for sequence-to-sequence models. In COLING, 2018.
  • Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv, 2013.
  • Bowman et al. [2015a] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. EMNLP, 2015.
  • Bowman et al. [2015b] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. In CoNLL, 2015.
  • Deng et al. [2018] Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. Latent alignment and variational attention. In NIPS, 2018.
  • Fu et al. [2019] Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Çelikyilmaz, and Lawrence Carin. Cyclical annealing schedule: A simple approach to mitigating kl vanishing. In NAACL-HLT, 2019.
  • He et al. [2019] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. ArXiv, 2019.
  • Kingma and Welling [2013] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, 2013.
  • Kingma et al. [2017] Diederik P. Kingma, Tim Salimans, and Max Welling. Improved variational inference with inverse autoregressive flow. ArXiv, 2017.
  • Li et al. [2019] Bohan Li, Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick, and Yiming Yang. A surprisingly effective fix for deep latent variable modeling of text. In EMNLP/IJCNLP, 2019.
  • Marcus et al. [1993] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 1993.
  • Razavi et al. [2019] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. ArXiv, 2019.
  • Roy et al. [2018] Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. ArXiv, 2018.
  • Semeniuta et al. [2017] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional variational autoencoder for text generation. In EMNLP, 2017.
  • van den Oord et al. [2016a] Aäron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
  • van den Oord et al. [2016b] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.
  • van den Oord et al. [2017] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, 2017.
  • Wang and Wang [2019] Prince Zizhuang Wang and William Yang Wang. Neural gaussian copula for variational autoencoder. In EMNLP/IJCNLP, 2019.
  • Xu and Durrett [2018] Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In EMNLP, 2018.
  • Yang et al. [2017] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. In ICML, 2017.