This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SparseGAN: Sparse Generative Adversarial Network for Text Generation

Liping Yuan1    Jiehang Zeng1    Xiaoqing Zheng1    Jun He2
1Fudan University
   2the Administrative Center of Shanghai R&D Public Service Platforms
{lpyuan19, jhzeng18, zhengxq }@fudan.edu.cn, [email protected]
Abstract

It is still a challenging task to learn a neural text generation model under the framework of generative adversarial networks (GANs) since the entire training process is not differentiable. The existing training strategies either suffer from unreliable gradient estimations or imprecise sentence representations. Inspired by the principle of sparse coding, we propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator. The key idea is that we treat an embedding matrix as an over-complete dictionary, and use a linear combination of very few selected word embeddings to approximate the output feature representation of the generator at each time step. With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentaiable. Experiments on multiple text generation datasets yield performance improvements, especially in sequence-level metrics, such as BLEU.

1 Introduction

Text generation is an important task in natural language processing. Recurrent neural networks (RNNs) have been empirically proven to be quite successful for text generation task due to their capability to capture long-range dependencies. By far the most popular strategy to train RNNs is maximum likelihood estimation (MLE), which maximizes the probability of the next word in a sequence given the current (recurrent) state and previous ground truth word (also known as teacher forcing). At inference time, truth previous words are unknown, and then are replaced by words predicted by the model itself. The models trained by the teacher forcing strategy usually suffer from the discrepancy between training and inference, called exposure bias Ranzato et al. (2015), which yields errors because the model is only exposed to distribution of training data, instead of its own prediction at inference time.

Most recently, generative adversarial networks (GANs) Goodfellow et al. (2014) have been used to deal with the exposure bias of RNNs Yu et al. (2017); Che et al. (2017); Lin et al. (2017). In a typical GAN-based text generation framework, a generator is used to generate sentences given random inputs, and a discriminator is trained to distinguish natural sentences from the generated ones. The generator and discriminator play in a two-player game, and such competition drives the both to improve their desired performance.

Even though GAN-based approaches have shown to be promising for text generation task Ke et al. (2019); Weili Nie and Patel (2019), it is still challenge to train a GAN-based text generation model due to the discrete nature of text. The output of the generator will be sampled to generate discrete texts, which results in a non-differentiable training process because the gradients can not back-propagate from the discriminator to the generator. Reinforcement learning (RL) technique was introduced to handle the non-differentiable issue Yu et al. (2017); Che et al. (2017); Fedus et al. (2018), but it still suffers from high-variance gradient estimates, which is hard to alleviate Li et al. (2019).

An alternative to deal with the non-differentiable issue is to use a continuous function to replace the samplings. After a multinomial distribution over the words from a given vocabulary is estimated by the generator, a differentiable sample, Gumbel Softmax for example Jang et al. (2016), that can be smoothly annealed into a categorical distribution is used to replace the non-differentiable sample from a categorical distribution. However, as the support set of the multinomial distribution is the whole vocabulary, words with close-to-zero probabilities are all taken into consideration. Such approximation become imprecise since these unnecessary words account for a large majority of the vocabulary, which is well-known as the long-tailed distribution. Although it can be mitigated via temperature to control the “steepness” of the distribution, this problem cannot be completely solved because many unwanted words with nonzero probabilities are still involved, which makes the training inefficient.

To address the above problem, we propose a SparseGAN that generates low-noise, but semantic-interpretable, sparse distributions (i.e. convex combinations of very few word embeddings) to replace the non-differentiable sample. With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentaiable. Sparse representation has been proven to be powerful for compressing high-dimensional signals Huang and Aviyente (2007). It is used to search for the most compact representation of a signal in terms of the linear combination of several signals in an overcomplete dictionary.

In the SparseGAN, we take the entire word embedding matrix as an overcomplete dictionary, and form the sparse representations as the convex combinations of just a few word embeddings. Those sparse representations are concatenated and fed into a CNN-based discriminator. We also show that such sparse representations can be produced by a matching pursuit algorithm Mallat and Zhang (1993). Generally speaking, no matter what neural network architectures are used in NLP, semantic feature representations at each layer are derived from the input (word) embeddings. Our approach encourage the generator and the discriminator in the GAN-based framework to share the same input feature space spanned by the word embeddings, which can be viewed as a regularization facilitating network training and yielding the better performance.

2 Related work

GAN-baserd Text Generation There are mainly two methods to train GAN-based text generation models with the non-differentiable issue caused by the discrete data nature. One is to use the RL algorithm, another is to introduce a continuous function to approxijjate the discrete data in latent space.

RL-based GANs usually treat the generator as an agent, where states are the generated words so far and actions are the next words to be generated. Specifically, SeqGAN Yu et al. (2017) models text generation by sequential decision making process and trains the generator with the policy gradient algorithm. MaliGAN Che et al. (2017) trains GAN with maximum likelihood objective to reduce the gradient variance. RankGAN Lin et al. (2017) introduces a margin-based ranking classifier as the discriminator instead of the original binary classifier. LeakGAN Guo et al. (2018) allows the discriminator to leak its own high-level features to the generator to counter the sparse signal from the discriminator. MaskGAN Fedus et al. (2018) introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context to improve sample quality. However, RL-based models usually suffer from large variance of gradient estimation and are difficult to converge.

An alternative method is to approximate the discrete data in the continuous latent space to deal with the non-differentiable problem. WGAN Gulrajani et al. (2017) feeds the multinomial distribution produced by the generator directly to the discriminator to avoid the sampling operations. GSGAN Jang et al. (2016) applies Gumbel-Softmax trick to re-parameterize a discrete distribution, which provides a differentiable way to sample from discrete random variables. RelGAN Weili Nie and Patel (2019), TextGAN Zhang et al. (2017), GAN-AEL Xu et al. (2017) use a weighted sum over the embeddings matrix to yield an approximate representation of the generated word sequences, where the weight is the probability of the corresponding word in multinomial distribution. These models confine the inputs of the discriminators to the feature space spanned by the word embeddings. Since the embedding matrix is shared by the generated sentences and real sentences, it will be easier for the discriminator to converge. However, these methods suffer from long-tail problem due to the large size of a vocabulary, resulting imprecise approximation of the discrete data. Another type of GANs directly work in latent space derived from the generator or the encoder of the auto-encoder. GAN2vec Budhkar et al. (2019) generates real-valued word2vec-like vectors as opposed to discrete multinomial distribution during training. ARAE Junbo et al. (2017) combines auto-encoder with GANs for text generation, where the intermediate representations of the auto-encoder are directly used for adversarial training. Since the latent spaces of generated sentences and real ones are usually different, it can be difficult to minimize the distance between them.

Refer to caption
Figure 1: Architecture. The sentence feature representations at each step produced by the generator and the auto-encoder are transformed into their sparse representations by the sparse encoder. Those sparse representations are then summarized and fed into the discriminator to determine whether the sentences are natural or generated ones. The gradients derived from the discriminator’s predictions back-propagating to all previous states in an end-to-end manner. By the sparse representations, just a few words are involved in parameter updates that restricts unnecessary noises and facilitates the training.

Sparse Representation The notion of sparse representation was proposed by Mallat et al Mallat and Zhang (1993). The core idea of sparse representation is to approximate a signal in terms of a linear combination of some selected basis elements from a prespecified dictionary. To extract appropriate basis elements, various optimization algorithms have been applied, such as greedy algorithm and convex relaxation. Some examples of greedy algorithm include Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP) Tropp and Gilbert (2007), and Compressive Sampling Matching Pursuit (CoSAMP) Needell and Tropp (2009). Convex relaxation is another kind of algorithm to solve the sparse signal representation problem, including Basis Pursuit (BP) Chen et al. (2011), Gradient Projection for Sparse Reconstruction (GPSR) Figueiredo et al. (2007), and Gradient Descent (Grades) Garg and Khandekar (2009). Sparse representation has achieved great success in computer vision, such as face recognition Wright et al. (2008) and object detection He et al. (2016), but has drawn relatively little attention in NLP. To the best of our knowledge, SparseGAN is among the first ones that incorporate the idea of sparse representation into GAN-based text generation task.

3 Model

We here describe the proposed SparseGAN for discrete text generation. As shown in Figure 1, the SparseGAN consists of four components: a generator GθG_{\theta} to generate sentences, an auto-encoder to extract the latent representation of real sentences, a sparse encoder for rendering sparse representations, and a discriminator DϕD_{\phi} to distinguish real sentences from the generated ones, where θ\theta and ϕ\phi are model parameters.

3.1 LSTM-based Generator

During adversarial training, the generator takes a random variable zz as input, and outputs the latent representation of generated sentence HgT×dH_{g}\in\mathbb{R}^{T\times d} using a multi-layer Long Short-Term Memory (LSTM) decoder Schmidhuber and Hochreiter (1997):

Hg=Gθ(z)H_{g}=G_{\theta}(z) (1)

where TT denotes the sentence length and dd the dimensionality of hidden states.

Specifically, the random variable zz has a standard normal distribution z𝒩(0,1)z\sim\mathcal{N}(0,1) that is taken as the initial value of the LSTM decoder’s hidden state. Then, at each time stamp tt, the LSTM decoder outputs the hidden state htdh_{t}\in\mathbb{R}^{d} given previous state ht1dh_{t-1}\in\mathbb{R}^{d} and previous word vt1dv_{t-1}\in\mathbb{R}^{d} predicted by the model:

ht=(ht1,vt1)h_{t}=\mathcal{H}(h_{t-1},v_{t-1}) (2)

where (,)\mathcal{H}(\cdot,\cdot) is the standard forward process of a LSTM decoder. Once the whole sequence is generated, the sentence representation HgH_{g}, is derived as the concatenation of all hidden states:

Hg=[h1,h2,,hT]H_{g}=[h_{1},h_{2},...,h_{T}] (3)

where [,][\cdot,\cdot] denotes the concatenation operation of multiple vectors.

3.2 Denoising Auto-encoder

The purpose of introducing a pretrained denoising auto-encoder (DAE) Vincent et al. (2008) into the GAN-based text generation model is to force the generator to mimic the reconstructed latent representations HrT×dH_{r}\in\mathbb{R}^{T\times d} of real sentences instead of the conventional embedding representations Haidar et al. (2019). The DAE consists of two parts: a multi-layer bi-LSTM encoder to encode the input real sentence rr into intermediate representation, and a multi-layer LSTM decoder to decode the reconstructed hidden state htdh^{{}^{\prime}}_{t}\in\mathbb{R}^{d} at each time stamp. Similar to the generator, these hidden states hth^{{}^{\prime}}_{t} are concatenated jointly to form the latent representation HrT×dH_{r}\in\mathbb{R}^{T\times d} of the real sentence rr.

3.3 Sparse Encoder

The role of the sparse encoder is to provide a sparse version of the sentence representation including the generated sentence’s representation HgH_{g} output by generator and the real sentence’s representation HrH_{r} output by DAE:

Sg=sparse(Hg)\displaystyle S_{g}=\mathcal{F}_{sparse}(H_{g}) (4)
Sr=sparse(Hr)\displaystyle S_{r}=\mathcal{F}_{sparse}(H_{r})

where Sg,SrT×dS_{g},S_{r}\in\mathbb{R}^{T\times d}, and sparse()\mathcal{F}_{sparse}(\cdot) denotes the sparse representation learning algorithm (See Section 4).

3.4 CNN-based Discriminator

A commonly used discriminator for text generation is a Convolutional neural network (CNN) classifier which employs a convolutional layer with multiple filters of different sizes to capture relations of various word lengths, followed by a fully-connected layer. The CNN-based discriminator takes the sparse representation Sk×bS\in\mathbb{R}^{k\times b} output by the sparse encoder as input , and output a score to determine whether the sentences are natural or generated ones. Formally, the scoring function is defined as follows:

Dϕ(S)=Wf(Sω)+bD_{\phi}(S)=Wf(S*\omega)+b (5)

where * denotes the convolution operator; f()f(\cdot) denotes a non-linear function and W,b,ωW,b,\omega are model parameters.

3.5 Loss Function

Inspired by Wasserstein GAN (WGAN) Gulrajani et al. (2017), the game between the generator GθG_{\theta} and the discriminator DϕD_{\phi} is the minimax objective:

L\displaystyle L =𝔼zg[Dϕ(Sg)]𝔼rr[Dϕ(Sr)]\displaystyle=\mathbb{E}_{z\sim\mathbb{P}_{g}}[D_{\phi}(S_{g})]-\mathbb{E}_{r\sim\mathbb{P}_{r}}[D_{\phi}(S_{r})] (6)
+λ𝔼x^x^[(Sx^Dϕ(Sx^)21)2]\displaystyle+\lambda\mathbb{E}_{\hat{x}\sim\mathbb{P}_{\hat{x}}}[(||\nabla_{S_{\hat{x}}}D_{\phi}(S_{\hat{x}})||_{2}-1)^{2}]

where r\mathbb{P}_{r} is the data distribution, g\mathbb{P}_{g} is the distribution of the generator’s input and Sg,SrS_{g},S_{r} are defined in Equation 4. The gradient penalty term Gulrajani et al. (2017) in the objective function enforces the discriminator to be a 1-Lipschitz function, where x^\mathbb{P}_{\hat{x}} is the distribution sampling uniformly along straight lines between pairs of points sampled from the r\mathbb{P}_{r} and g\mathbb{P}_{g}, while Sx^S_{\hat{x}} is the sparse representation of x^\hat{x} output by the sparse encoder. The importance of this gradient penalty term is controlled by a hyperparameter λ\lambda.

4 Sparse Representation Learning

The sparse encoder aims at finding a sparse equivalence of the sentence representation HT×dH\in\mathbb{R}^{T\times d}. As described before, HH is the concatenation of all hidden states, implying the sparse representation can be computed independently for each state. In this section, we denote hth_{t} as tt-th state of HH for simplicity.

4.1 Problem Definition

Sparse representation learning is to search for the most compact representation of a vector via the linear combination of elements in an overcomplete dictionary Mallat and Zhang (1993). Given an overcomplete dictionary DN×dD\in\mathbb{R}^{N\times d} with elements in its rows, and a target vector ydy\in\mathbb{R}^{d}, the problem of sparse representation is to find the sparsest coefficient vector of the linear combination xNx^{*}\in\mathbb{R}^{N} satisfying y=DTxy=D^{T}x^{*}:

x=argminx||x||0s.t.y=DTxx^{*}=\operatorname*{argmin}_{x}{||x||_{0}}\quad s.t.\quad y=D^{T}x (7)

where x0||x||_{0} is 0\ell_{0}-norm of xx, namely the number of non-zero coordinates of xx. However, the equality constraint is too strict to be satisfied, and it can be relaxed by minimize the Euclidean distance between yy and DTxD^{T}x. The original problem is then translated into the following problem:

x=argminxλx0+12yDTx22x^{*}=\operatorname*{argmin}_{x}{\lambda||x||_{0}+\frac{1}{2}||y-D^{T}x||_{2}^{2}} (8)

The objective is to reduce the reconstruction error while using the elements as few as possible. Once the problem solved, DTxD^{T}x^{*} can be used as the final sparse representation of yy.

Inspired by the sparse representation principle, the sparse encoder takes the vocabulary embedding matrix EN×dE\in\mathbb{R}^{N\times d} as the overcomplete dictionary and approximates hth_{t} as the sparse linear combination of word embeddings in EE, which can be derived as:

c=argmincλc0+12htETc22c^{*}=\operatorname*{argmin}_{c}{\lambda||c||_{0}+\frac{1}{2}||h_{t}-E^{T}c||_{2}^{2}} (9)

where cc is the coefficient vector of the linear combination. The embedding matrix EN×dE\in\mathbb{R}^{N\times d} can be used as the overcomplete dictionary since in text generation tasks, the embedding matrix is always overcomplete with tens of thousands of words, and the condition of N>>dN>>d is satisfied in most cases.

As shown in Figure 2, the constructed sparse representation confines the inputs of the discriminators to the feature space spanned by the word embeddings. Since the generator and DAE share the same embedding matrix, it will be easier for the discriminator to minimize distance between distributions of real sentences and generated ones. To solve the above optimization problem, we apply the Matching Pursuit (MP) algorithm Mallat and Zhang (1993).

Refer to caption
Figure 2: The sparse encoder. It transforms hth_{t} in hidden space into embedding space in an iterative way. The final sparse representations take forms as the linear combinations of just a few word embeddings (indicated by yellow word).

4.2 Matching Pursuit Algorithm

The MP algorithm calculates the sparse representation stds_{t}\in\mathbb{R}^{d} of hth_{t} in an iterative way. As illustrated in Algorithm 1, there is a residual vector rtdr_{t}\in\mathbb{R}^{d} to record the remaining portion of hth_{t} that has not been expressed. At a certain iteration ll, current residue rtr_{t} is used to search the nearest word embedding ele_{l} (ll represents the ll-th iteration) from embedding matrix EE by comparing the inner product between rtr_{t} and all word embeddings in embedding matrix:

el=argmaxeErt,ee_{l}=\operatorname*{argmax}_{e\in E}\langle r_{t},e\rangle (10)

where ,\langle\cdot,\cdot\rangle is the inner product operation of two vectors. The concatenation of ele_{l} and previous selected embeddings forms the basis vector matrix Mk×dM\in\mathbb{R}^{k\times d}, and the linear combination over the row vectors of MM is used to approximate hth_{t}. The linear combination coefficient vector ckc\in\mathbb{R}^{k} is determined by solving the least square problem:

c\displaystyle c^{*} =argminchtMTc2\displaystyle=\operatorname*{argmin}_{c}{||h_{t}-M^{T}c||_{2}} (11)
=M+ht=(MMT)1Mht\displaystyle=M^{+}h_{t}=(MM^{T})^{-1}Mh_{t}

where M+k×dM^{+}\in\mathbb{R}^{k\times d} is the pseudo-inverse of MT,M+=(MMT)1MM^{T},M^{+}=(MM^{T})^{-1}M. After cc^{*} is calculated, the sparse representation sts_{t} of hth_{t} can be defined as:

st=MTcs_{t}=M^{T}c^{*} (12)

where sts_{t} is the closest to hth_{t} until the current iteration. And rtr_{t}, the residual vector between ht,sth_{t},s_{t} can be defined as:

rt=htst=htMTcr_{t}=h_{t}-s_{t}=h_{t}-M^{T}c^{*} (13)

The process described above will be repeated for LL times, where LL is a hyperparameter to control the degree of how well hth_{t} is represented approximately. After LL iterations, the final sparse representation stds_{t}\in\mathbb{R}^{d} is defined in Equation 12. For other hidden states h1,h2,,hTHh_{1},h_{2},...,h_{T}\in H, the same calculation process is performed to obtain their corresponding sparse representations s1,s2,,sTbs_{1},s_{2},...,s_{T}\in\mathbb{R}^{b}. These sparse representation are then concatenated together to form the final output Sk×dS\in\mathbb{R}^{k\times d} of the sparse encoder S=[s1,s2,,sT]S=[s_{1},s_{2},...,s_{T}], which is fed into the CNN-based discriminator to determine the score of the input sentence.

The sparse representation learning algorithm is differentiable. The gradient of sts_{t} can be passed to cc^{*} through Equation 12 and then be passed to hth_{t} through Equation 11. As a result, SparseGAN is trainable and differentiable via using sstandard back-propagation.

Algorithm 1 Sparse representation learning in SparseGAN
0:    The vector hth_{t}; The overcomplete dictionary EE;Maximum number of iterations LL;
0:    The sparse representation sts_{t} of hth_{t};
1:  initial rtr_{t} = hth_{t}; M=M=\varnothing; l=1l=1;
2:  repeat
3:     l=l+1l=l+1;
4:     find ekEe_{k}\in E with maximum inner product ek,rt\langle e_{k},r_{t}\rangle;
5:     M=[e1,e2,,ek]M=[e_{1},e_{2},...,e_{k}];
6:     solve the least square problem, c=argminchtMTc2c^{*}=\operatorname*{argmin}_{c}{||h_{t}-M^{T}c||_{2}};
7:     compute approximation of hth_{t} by st=MTcs_{t}=M^{T}c^{*};
8:     update residue rt=htstr_{t}=h_{t}-s_{t};
9:  until lLl\geq L

5 Experiments

5.1 Dataset

We conduct experiments on two different text generation datasets of COCO Image Caption Chen et al. (2015) and EMNLP2017 WMT News Guo et al. (2018) to demonstrate the effectiveness of SparseGAN. The COCO Image Caption dataset is preprocessed basically following Zhu et al Zhu et al. (2018), which contains 4,6824,682 distinct words, 10,00010,000 sentences as train set and other 10,00010,000 sentences as test set, where all sentences are 3737 or less in length. The EMNLP2017 WMT News dataset contains 5,7125,712 distinct words with maximum sentence length 5151. The training set and testing set consists of 278,586278,586 and 10,00010,000 sentences respectively.

Table 1: The results of BLEU scores on COCO Image Caption Dataset. BL and SBL denote BLEU and Self-BLEU.
Model BL2 BL3 BL4 BL5 SBL2 SB3 SBL4 SBL5
MLE 0.731 0.497 0.305 0.189 0.916 0.769 0.583 0.408
SeqGAN Yu et al. (2017) 0.745 0.498 0.294 0.180 0.950 0.840 0.670 0.490
MaliGAN Che et al. (2017) 0.673 0.432 0.257 0.159 0.918 0.781 0.606 0.437
RankGAN Lin et al. (2017) 0.743 0.467 0.264 0.156 0.960 0.883 0.763 0.619
LeakGAN Guo et al. (2018) 0.746 0.528 0.355 0.230 0.966 0.913 0.849 0.780
TextGAN Zhang et al. (2017) 0.593 0.463 0.277 0.207 0.942 0.932 0.805 0.746
LATEXTGAN Haidar et al. (2019) 0.787 0.496 0.286 0.150 0.988 0.950 0.847 0.612
TopKGAN-S 0.786 0.588 0.399 0.260 0.949 0.868 0.751 0.613
TopKGAN-D 0.783 0.578 0.390 0.255 0.947 0.865 0.746 0.607
SparseGAN 0.845 0.666 0.474 0.323 0.954 0.951 0.896 0.816
Table 2: The results of BLEU scores on EMNLP2017 WMT News Dataset. BL and SBL denote BLEU and Self-BLEU.
Model BL2 BL3 BL4 BL5 SBL2 SB3 SBL4 SBL5
MLE 0.749 0.453 0.223 0.113 0.840 0.555 0.301 0.163
SeqGAN Yu et al. (2017) 0.668 0.388 0.192 0.101 0.883 0.645 0.233 0.400
MaliGAN Che et al. (2017) 0.727 0.395 0.160 0.077 0.873 0.643 0.404 0.226
LeakGAN Lin et al. (2017) 0.733 0.421 0.185 0.092 0.887 0.659 0.399 0.216
TextGAN Zhang et al. (2017) 0.245 0.204 0.165 0.108 0.999 0.999 0.999 0.999
RelGAN Weili Nie and Patel (2019) 0.887 0.725 0.495 0.287 0.969 0.901 0.797 0.671
TopKGAN-S 0.883 0.703 0.468 0.275 0.961 0.881 0.758 0.615
TopKGAN-D 0.904 0.743 0.525 0.332 0.966 0.899 0.801 0.689
SparseGAN 0.921 0.825 0.643 0.437 0.992 0.982 0.961 0.930

5.2 Experiment Settings

The generator is a two layer LSTM with 300 hidden units and the discriminator is a multi-layer 1-D convolution neural network with 300 feature maps and filter size set to 5. The denoising auto-encoder (DAE) is a two layer LSTM with 300 hidden cells for both the encoder and the decoder. For training DAE, we preprocess the input data following Freitag and Roy Freitag and Roy (2018), where 50%50\% of words are randomly removed and all words are shuffled while keeping all word pairs together that occur in original sentence. A variational auto-encoer (VAE) Kingma and Welling (2013) is used to initialize the generator, which is trained with KL cost annealing and word dropout during decoding following Bowman et al Bowman et al. (2015). Inspired by WGAN-GP Gulrajani et al. (2017), the hyperparameter λ\lambda of the gradient penalty term in Equation 6 is set to 10, and 5 gradient descent steps on the discriminator is performed for every step on the generator. All models are optimized by Adam with β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999 and eps=108eps=10^{-8}. Learning rate is set to 10310^{-3} for pretraining and 10410^{-4} for adversarial training. The 300-dimensional Glove word embeddings released by Pennington et al Pennington et al. (2014) are used to initialize word embedding matrix. The batch size is set to 6464, the maximum of sequence length to 4040, the maximum of iterations for adversarial training to 20,00020,000, and the number of iterations LL for sparse representation learning to 1010.

Table 3: Example sentences generated by SparseGAN trained on COCO Image Caption dataset (shown in the top) and EMNLP2017 WMT News dataset (shown in the bottom) respectively.
a motorcycle is parked on a concrete road .
the picture of a kitchen with stainless and white appliances.
a man riding a motorcycle down a road with a person on the back.
people are preparing food in a kitchen with a pot.
two teddy bears on a sidewalk next to a baby giraffe.
a table with various cakes and a cup of sausage on it.
an old kitchen with a black and white checkered floor.
a motorcycle is parked on the side of a road with a crowd of people.
a kitchen with hardwood cabinets and white appliances.
a small bathroom with a white toilet and a latticed mirror.
i think that’s the most important thing that’s happening, there is a lot of ideas in the white house of the next time.
the queen’s story is aimed on making a positive increase in the uk’s population in scotland.
the government’s executive ministry said: “ it was just a very positive problem in my relationship and i am pleased to be able to make sure it would be.
“ i think it’s going to be investigated, but it doesn’t matter , if she can have a child , ” he says.
the queen government is asking to comment on a review of the brexit referendum, and asked whether this was not a big question.
the government also said that’s president obama had to do that negotiations and we did not consider the possibility of parliament to be successful, it’s not a good team.
“ the first message, to say that trump will be a bitter path to the white house, ” kaine said.
“ it’s hard to get a good team, and we don’t want to get the best players in the country, ” he said.
it’s important that i’m working at the best time in the world , there’s diversity of people who are talented guys, ” he said.
there are a lot of people who are going to go on the work , especially on the day, ” pence said.

5.3 Evaluation Metrics

We use two metrics below to evaluate our models, comparing different models.

BLEU Papineni et al. (2002) This metric is used to measure quality of generated sentences. To calculate BLEU-N scores, we generate 10,00010,000 sentences as candidate texts and use the entire test set as reference texts. The higher the BLEU score is, the higher quality the generated sentences is.

Self-BLEU Zhu et al. (2018) This metric is used to measure diversity of generated sentences. Using one generated sentence as candidate text and others as reference texts, the BLEU is calculated for every generated sentence, and the average BLEU score of 10,00010,000 generated sentences is defined as the self-BLEU. The higher the self-BLEU score is, the less diversity the generated sentences is.

5.4 Compared Models

We compared the SparseGAN with several recent representative models. For RL-based text generative models, we choose to compare SeqGAN Yu et al. (2017), MaliGAN Che et al. (2017), RankGAN Lin et al. (2017) and LeakGAN Guo et al. (2018). We also compared TextGAN Zhang et al. (2017) and LATEXTGAN Haidar et al. (2019). TextGAN adopts the weighted sum over the embeddings matrix as the continuous approximation, while LATEXTGAN uses the multinomial distribution by the generator for adversarial training.

Top-kk method, which has not been applied to GAN-based text generation model, approximates the sampling action via the linear combination of word embeddings with kk highest probabilities and use the re-normalized probabilities as the linear combination coefficients. Here we denote this model as TopKGAN-S (static TopKGAN). TopKGAN-D (dynamic TopKGAN), a variant of TopKGAN-S, chooses words dynamically via comparing the logit of each word with a threshold δ\delta. The logit of each word is defined as the inner product between hidden state hth_{t} and word embedding eke_{k}. For TopKGAN-S, the number of words to be chosen KK is set to 10, while for TopKGAN-D, the threshold δ\delta is set to 0 here. Two variants of TopKGAN are implemented with the same setting as SparseGAN.

5.5 Results

The BLEU scores and Self-BLEU scores on COCO Image Caption dataset and EMNLP2017 WMT News dataset are shown in Table 1 and Table 2, correspondingly. The proposed SparseGAN achieves the highest BLEU scores on both datasets, which means the generated sentences by SparseGAN is more natural and fluent. MLE-based model has the lowest Self-BLEU scores. TopKGAN-S and TopKGAN-D have similar performance on both COCO Image Caption dataset and EMNLP2017 WMT News dataset. These two models behave better than several competitive models in terms of both BLEU sores and Self-BLEU scores, such as RankGAN, LeakGAN, TextGAN and LATEXTGAN on COCO Image Caption dataset.

High BLEU scores of SparseGAN may benefits from the way to treat the embedding matrix. Since SparseGAN chooses word embedding via the residual vector rtr_{t}, which is initialized as hth_{t}, the word with the highest probability will be chosen at the first iteration. This word is usually a common word in vocabulary. After several iterations, when hth_{t} has been well approximated by the sparse representation, uncommon words tend to be chosen. Both common and uncommon words are adjusted in SparseGAN, thus the embedding matrix obtains sufficient training. However, RL-based models only choose one word to adjust at each time stamp; TopKGAN only choose words with high probabilities, which is usually the common words, to adjust; continuous approximation methods choose all words to adjust but contain much noise in their approximation, resulting in imprecise gradient values.

Low Self-BELU scores of MLE-based model reflects that the generated sentences via MLE-based training are more diverse than all GAN-based models. It implies that GAN-based models tend to suffer from mode collapse and generate safe but similar sentences. However, the generated sentences of MLE-based model are less natural than GAN-based models, especially on EMNLP2017 WMT dataset which has longer sentences than the other dataset.

Table 3 shows the sentences generated by SpargeGAN that is trained with COCO Image Caption and EMNLP2017 WMT News datasets respectively. Those examples illustrate that SparseGAN is capable of generating meaningful and natural sentences with a coherent structure.

6 Conclusion

Generative adversarial networks have been used for text generation in order to alleviate the discrepancy between training and inference (exposure bias). However, simply applying GANs to the generation task will lead to a non-differentiable training process that hinders the gradients back-propagating to the generator from the discriminator. We proposed a fully differentiable training solution that is achieved by feeding the discriminator with semantic-interpretable, but anti-noise sparse sentence representations. The proposed solution encourages the generator and the discriminator to share the same input semantic feature space formed by the word embeddings — a regularization method that facilitates network training and improves the performance. Experiments on multiple text generation datasets showed that the proposed model and training algorithm achieved the best or comparable performance, especially in terms of the BLEU scores and self-BLEU scores, reflecting the enhanced ability in recovering the probability of the whole sequence and improving the diversity in the generated sentences.

Acknowledgements

This work was supported by Shanghai Municipal Science and Technology Project (No. 21511102800).

References

  • Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • Budhkar et al. (2019) Akshay Budhkar, Krishnapriya Vishnubhotla, Safwan Hossain, and Frank Rudzicz. Generative adversarial networks for text using word2vec intermediaries. arXiv preprint arXiv:1904.02293, 2019.
  • Che et al. (2017) Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.
  • Chen et al. (2011) Yi Chen, Nasser M Nasrabadi, and Trac D Tran. Hyperspectral image classification using dictionary-based sparse representation. IEEE transactions on geoscience and remote sensing, 49(10):3973–3985, 2011.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via filling in the_. arXiv preprint arXiv:1801.07736, 2018.
  • Figueiredo et al. (2007) Mário AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of selected topics in signal processing, 1(4):586–597, 2007.
  • Freitag and Roy (2018) Markus Freitag and Scott Roy. Unsupervised natural language generation with denoising autoencoders. arXiv preprint arXiv:1804.07899, 2018.
  • Garg and Khandekar (2009) Rahul Garg and Rohit Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML, volume 9, pages 337–344, 2009.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
  • Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Haidar et al. (2019) Md Haidar, Mehdi Rezagholizadeh, Alan Do-Omri, Ahmad Rashid, et al. Latent code and text-based generative adversarial networks for soft-text generation. arXiv preprint arXiv:1904.07293, 2019.
  • He et al. (2016) Zhenyu He, Shuangyan Yi, Yiu-Ming Cheung, Xinge You, and Yuan Yan Tang. Robust object tracking via key patch sparse representation. IEEE transactions on cybernetics, 47(2):354–364, 2016.
  • Huang and Aviyente (2007) Ke Huang and Selin Aviyente. Sparse representation for signal classification. In NIPS, pages 609–616, 2007.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Junbo et al. (2017) Zhao Junbo, Y Kim, K Zhang, AM Rush, and Y LeCun. Adversarially regularized autoencoders for generating discrete structures. ArXiv e-prints, 2017.
  • Ke et al. (2019) Pei Ke, Fei Huang, Minlie Huang, and Xiaoyan Zhu. Araml: A stable adversarial training framework for text generation. arXiv preprint arXiv:1908.07195, 2019.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Li et al. (2019) Zhongliang Li, Tian Xia, Xingyu Lou, Kaihe Xu, Shaojun Wang, and Jing Xiao. Adversarial discrete sequence generation without explicit neuralnetworks as discriminators. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3089–3098, 2019.
  • Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. In NIPS, pages 3155–3165, 2017.
  • Mallat and Zhang (1993) Stéphane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993.
  • Needell and Tropp (2009) Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301–321, 2009.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318. Association for Computational Linguistics, 2002.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
  • Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  • Schmidhuber and Hochreiter (1997) Jürgen Schmidhuber and Sepp Hochreiter. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Tropp and Gilbert (2007) Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory, 53(12):4655–4666, 2007.
  • Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103. ACM, 2008.
  • Weili Nie and Patel (2019) Nina Narodytska Weili Nie and Ankit Patel. Relgan: Relational generative adversarial networks for text generation. In ICLR, 2019.
  • Wright et al. (2008) John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2008.
  • Xu et al. (2017) Zhen Xu, Bingquan Liu, Baoxun Wang, SUN Chengjie, Xiaolong Wang, Zhuoran Wang, and Chao Qi. Neural response generation via gan with an approximate embedding layer. In EMNLP, pages 617–626, 2017.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
  • Zhang et al. (2017) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In ICML, pages 4006–4015. JMLR. org, 2017.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100. ACM, 2018.