This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Interpretable and Discrete Representations
with Adversarial Training for Unsupervised Text Classification

Yau-Shian Wang Hung-Yi Lee Yun-Nung Chen
National Taiwan University, Taipei, Taiwan
[email protected][email protected][email protected]
Abstract

Learning continuous representations from unlabeled textual data has been increasingly studied for benefiting semi-supervised learning. Although it is relatively easier to interpret discrete representations, due to the difficulty of training, learning discrete representations for unlabeled textual data has not been widely explored. This work proposes TIGAN that learns to encode texts into two disentangled representations, including a discrete code and a continuous noise, where the discrete code represents interpretable topics, and the noise controls the variance within the topics. The discrete code learned by TIGAN can be used for unsupervised text classification. Compared to other unsupervised baselines, the proposed TIGAN achieves superior performance on six different corpora. Also, the performance is on par with a recently proposed weakly-supervised text classification method. The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.

1 Introduction

In natural language processing (NLP), learning meaningful representations from large amounts of unlabeled texts is a core problem for unsupervised and semi-supervised language understanding. While learning continuous text representations has been widely studied Kiros et al. (2015); Arora et al. (2017); Logeswaran and Lee (2018); Pagliardini et al. (2018); Peters et al. (2018); Devlin et al. (2019), learning discrete representations has been explored by fewer works Miao et al. (2017); Zhao et al. (2018). Nevertheless, learning discrete representations is still important, as discrete representations are easier to interpret Zhao et al. (2018) and can benefit unsupervised learning van den Oord et al. (2017).

Auto-encoders and their variants are widely used to learn latent representations, but how to learn meaningful discrete latent representations remains unsolved. The reason is that using discrete variables as latent representations in auto-encoders hinders the gradient from backpropagating from the decoder to the encoder. To take non-differentiable discrete variables as latent representations in an auto-encoder, some special methods, such as Gumbel-Softmax Jang et al. (2016) or vector quantization van den Oord et al. (2017), are applied to enable training. Another direction of learning useful discrete representations is to maximize the mutual information between data and discrete variables such as infoGAN Chen et al. (2016) or IMSAT Hu et al. (2017).

In this work, we propose Textual InfoGAN (TIGAN), which is mainly built upon InfoGAN but with several useful extensions to make it suitable for textual data. In this model, texts are generated from two disentangled representations, including a discrete code cc and a continuous noise zz, where each dimension of cc represents a topic (or a category), and zz controls the variance within a topic. For example, if the model learns that one dimension of cc represents a topic about “sport”, different zz represents different sport types like basketball or baseball, or different teams. Given a new text, the model can efficiently infer the discrete topic code cc. As we alternatively optimize the InfoGAN objective and auto-encode objective function, the model can be considered as the integration of the infoGAN and auto-encoder based approaches.

To evaluate the interpretability of discovered discrete topics, we evaluate our model on unsupervised text classification. Better performance on unsupervised text classification implies that the discovered topics directly match the human-annotated categories, and thus humans can intuitively understand what they represent, such as the category of news or the type of questions. Compared to other possible unsupervised text classification methods, such as unsupervised sentence representation methods or topic models, TIGAN achieves superior performance. Also, the performance of our unsupervised method is on par with a recent weakly-supervised method  (Haj-Yahia et al., 2019), which required keywords of each category annotated by human experts.

To interpret discovered topics, we extract a few topical words from each topic to represent the topic. The extracted topical words are evaluated by quantitative and qualitative analysis. Compared to baseline methods, TIGAN is capable of extracting more coherent topical words. Also, it is possible to generate various texts conditioned on topics discovered by TIGAN in an unsupervised manner, which gives us a way to interpret the discovered topics further. The generated texts demonstrate that TIGAN successfully learns disentangled representations.

2 Related Work

Here we review the approaches of unsupervised discrete representation learning.

Auto-encoder

Our work is closely related to other unsupervised discrete representation learning methods. VQ-VAE van den Oord et al. (2017) encodes the data into a discrete one-hot code by drawing the index with embedding closest to the data representation. They utilize vector quantization (VQ) to approximate the gradient from the decoder to the encoder. DI-VAE Zhao et al. (2018) uses Batch Prior Regularization (BPR) to approximate the KL-divergence between discrete variables, and learn discrete representations by Gumbel-Softmax and minimizing the KL-divergence between discrete posterior and a discrete prior.

In this work, because we set the dimension of discrete code to a small number, we have to model the variance within the code. Otherwise, the text can not be successfully generated. Therefore, we search for other discrete representation learning models.

Maximizing Mutual Information

To learn useful latent representation, IMSAT Hu et al. (2017) maximizes the mutual information between data and the encoded discrete representation. They also propose an objective function to make the discrete representation invariant to the data augmentation. On the other hand, InfoGAN has shown impressive performance for learning disentangled representations of images in an unsupervised manner Chen et al. (2016). The original GAN generates images from a continuous noise zz, while each dimension of the noise does not contain disentangled features of generated images. To learn semantically meaningful representations, InfoGAN maximizes the mutual information between input code cc and the generated output G(z,c)G(z,c). However, maximizing the mutual information is intractable because it requires the access of P(cG(z,c))P(c\mid G(z,c)). Based on variational information maximization, Chen et al. (2016) used an auxiliary function QQ to approximate P(cG(z,c))P(c\mid G(z,c)). The auxiliary function QQ can be a neural network and jointly optimized with GG.

3 TIGAN

Refer to caption
Figure 1: Illustration of the proposed model. (A) In the upper part (InfoGAN training), the bag-of-words generator GG takes discrete topic code cc and continuous noise vector zz as input and generates bag-of-words. The DD discriminates its input bag-of-words is from GG or human-written text. The topic classifier QQ predicts the latent topic cc from its input bag-of-words. In the lower part, an auto-encoder is learned from bag-of-words of human-written text. The noise predictor EE which predicts the noise zz from input bag-of-words and the topic classifier QQ together form the encoder, while GG is the decoder. (B) To interpret what is learned by the model, an LSTM text generator GG^{\prime} is trained to generate text based on bag-of-words.

TIGAN considers that text is generated from a discrete code cc and a continuous noise zz. Each dimension of discrete code cc represents a topic (or a category), and the continuous noise zz controls the variance within the topic. As sequential textual data is too complex to discover meaningful topics, we focus on learning topics from bag-of-words textual data. Besides, we propose some extra extensions of infoGAN including: (1) categorical loss clipping, (2) combining infoGAN with auto-encoder, (3) using WGAN-gp Gulrajani et al. (2017), and (4) using pretrained word embeddings to regularize the discovered topics. The above extensions greatly improve performance.

3.1 Model

The Figure 1 (A) illustrates our model, where there are a generator GG, a discriminator DD, a topic classifier QQ, and a noise predictor EE:

  • Generator GG
    It takes a discrete topic code cc and a continuous noise zz as the input, and manages to generate bag-of-words that are indistinguishable from the bag-of-words of real texts, while captures the topical information of input cc. Here the output of the generator, G(c,z)G(c,z), is a vocabulary size vector. The output layer of GG is sigmoidsigmoid function, so the value of each dimension is between 0 and 11. G(c,z)G(c,z) can be interpreted as a bag-of-words vector, where each dimension indicates whether a single word exists in the text. The G(c,z)G(c,z) is directly fed to the discriminator, topic classifier, and noise predictor without sampling.

  • Discriminator DD
    It takes a bag-of-word vector as its input and outputs a scalar, which distinguishes whether the input is generated by the generator GG or from human-written texts. Here the bag-of-words vector xx from human-written texts would be assigned a higher score, or D(x)D(x) would be larger. On the other hand, the bag-of-words vector G(c,z)G(c,z) generated by GG would be assigned lower score by DD, or D(G(c,z))D(G(c,z)) would be smaller. DD is trained to minimize the following loss function D\mathcal{L}_{D}

    D=𝔼xPdata[log(D(x))]𝔼zPz,cPc[log(1D(G(z,c)))],\begin{split}\mathcal{L}_{D}=&-\operatorname{\mathbb{E}}_{x\sim P_{data}}[\log(D(x))]\\ &-\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c}}[\log(1-D(G(z,c)))],\end{split} (1)

    where PdataP_{data} is the real data distribution, PzP_{z} is the noise distribution, and PcP_{c} is the topic code distribution. The distributions PcP_{c} and PzP_{z} are determined by the developers based on the prior knowledge about the corpus. In the following experiments, PzP_{z} is a normal distribution, while PcP_{c} is a uniform distribution that generates one-hot vectors in which each dimension has an equal probability of being set to 11. In general, we find using the normal distribution as PzP_{z} is sufficient to model most texts. We use uniform distribution as PcP_{c} because our model is evaluated on single-label datasets in which the most texts are written according to one main topic. In a multi-label classification dataset, it is feasible to sample a topic code cc containing several topics according to the distribution of human-annotated categories.

  • Topic Classifier QQ
    It is a categorical topic code classifier, which predicts the categorical topic distribution from input bag-of-words. The categorical classifier QQ plays the role of inferring topics from bag-of-words in this work. It is trained by predicting the topic code cc from the generated bag-of-words G(c,z)G(c,z). The categorical loss Q\mathcal{L}_{Q} for training QQ is shown as below,

    Q=𝔼zPz,cPc,c=Q(G(z,c))[ce(c,c)],\mathcal{L}_{Q}=\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c},c^{\prime}=Q(G(z,c))}[\mathcal{L}_{ce}(c,c^{\prime})], (2)

    where the function ce\mathcal{L}_{ce} evaluates the cross entropy between cc and the prediction c=Q(G(z,c))c^{\prime}=Q(G(z,c)), and QQ learns to minimize the cross entropy. Since cc is one-hot vector, cross-entropy is used here. It is possible to use other difference measures.

  • Noise predictor EE
    It focuses on predicting the continuous noise zz from the bag-of-words xx of human-written texts. EE is only learned with auto-encoder.

3.2 Training

We apply (3) to train our generator GG, discriminator DD, and the topic classifier QQ.

minG,QmaxDD+λQ.\min_{G,Q}\max_{D}-\mathcal{L}_{D}+\lambda\mathcal{L}_{Q}. (3)

where λ\lambda is a hyper-parameter. As we find that when the value of λ\lambda is too large, the generator generates unreasonable bag-of-words to minimize Q\mathcal{L}_{Q}, we set λ\lambda to 0.10.1. The discriminator DD learns to minimize D\mathcal{L}_{D}, QQ learns to minimize Q\mathcal{L}_{Q}, and GG learns to maximize D\mathcal{L}_{D} while minimizing Q\mathcal{L}_{Q}. However, it is difficult to apply infoGAN on textual data. Hence, we propose some extensions and tips in the following sections.

3.3 Categorical Loss Clipping

During training, there is a severe mode collapse issue within the same topic. That is, given the same topic code cc, the generator GG ignores the continuous noise zz and always outputs the same bag-of-words. The reason is that the optimal solution for the generator to maximize the mutual information (or minimize (2)) between discrete cc and generated discrete bag-of-words is always predicting the same bag-of-words given the same cc. To tackle this issue, we clip the categorical loss in (2) to a lower bound α\alpha as below.

Q=𝔼zPz,cPc,c=Q(G(z,c))[max(ce(c,c),α)].\mathcal{L}_{Q}^{\prime}=\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c},c^{\prime}=Q(G(z,c))}[max(\mathcal{L}_{ce}(c,c^{\prime}),\alpha)]. (4)

With (4), the model stops to make the categorical loss smaller when it is small enough, which is controlled by the hyper-parameter α\alpha. As setting the value of α\alpha too large hinders the QQ from predicting correct topic code cc, we set the value slightly greater than zero, which is 0.150.15. It is worth mentioning that applying batch-normalization Ioffe and Szegedy (2015) to GG also greatly alleviates the mode collapse problem.

3.4 Combining infoGAN with auto-encoder

As mentioned by prior works Larsen et al. (2016); Huang et al. (2018), it’s useful to train GAN and auto-encoder alternatively. Therefore, we try to include auto-encoder training in the optimization procedure, as shown in the right part of Figure 1 (A). This is another way to prevent the mode collapse mentioned in the last subsection.

We use bag-of-words xx from real text to train an auto-encoder. In auto-encoder training, the noise predictor EE and topic classifier QQ are jointly regarded as an encoder, which encodes bag-of-words into a continuous code E(x)E(x) and a discrete code Q(x)Q(x) respectively. Here, the generator GG serves as a decoder, which reconstructs the original input xx from a continuous code E(x)E(x) and a discrete code Q(x)Q(x). Therefore, GG, QQ and EE jointly learn to minimize the reconstruction loss,

minG,Q,E𝔼xPdata,x=G(Q(x),E(x))[bce(x,x)],\min_{G,Q,E}\operatorname{\mathbb{E}}_{x\sim P_{data},x^{\prime}=G(Q(x),E(x))}[\mathcal{L}_{bce}(x,x^{\prime})], (5)

where binary cross entropy loss bce\mathcal{L}_{bce} is used as the reconstruction loss function. We train (3) and (5) alternately.

3.5 Using WGAN-gp

As mentioned in Arjovsky et al. (2017), it is challenging to generate discrete data using GAN. Because bag-of-words vectors from real texts are also discrete, when using original loss function for DD in (1), the training fails. Here, we apply WGAN Arjovsky et al. (2017) loss to train GG and DD and rewrite (1) as

D=𝔼xPdata[D(x)]+𝔼zPz,cPc[D(G(z,c))].\mathcal{L}_{D}^{\prime}=-\operatorname{\mathbb{E}}_{x\sim P_{data}}[D(x)]+\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c}}[D(G(z,c))]. (6)

Here, to constrain DD to be a Lipschitz function, we apply gradient penalty Gulrajani et al. (2017) to DD.

3.6 Using pretrained word embeddings

Prior works Nguyen et al. (2015) suggested that using pretrained word embeddings such as word2vec Mikolov et al. (2013) helps models discover more coherent topics. To encourage the model to learn more consistent topics, we use pretrained word embeddings on QQ to regularize QQ to learn more coherent topics. When QQ takes a bag-of-words as input, it first computes the document vector as the weighted average of pretrained word embeddings of bag-of-words. Then, it feeds the document vector to a linear network to predict discrete topic code cc. We use smooth inverse frequency (SIF) Arora et al. (2017) to compute the weighted average of the word embeddings. The weight of word ww is a/(a+p(w))a/(a+p(w)), where aa is a learnable non-negative parameter and p(w)p(w) is word frequency. The word embeddings are pretrained by fasttext Joulin et al. (2016).

4 Text Generation from Discovered Topics

To interpret what the model learns, we try to generate text conditioned on discovered discrete topic cc, which is shown in Figure 1 (B). Given a discrete topic cc, we sample many different noises zz to generate different bag-of-words G(c,z)G(c,z). After obtaining bag-of-words from GG, we use a text generator GG^{\prime} to generate sequential text from the bag-of-words. Because we can transform each sequential text into its corresponding bag-of-words, we can obtain numerous (bag-of-words, text) pairs to train the text generator GG^{\prime}. However, compared to human-written text, generated G(c,z)G(c,z) may be noisy and unreasonable. To make GG^{\prime} robust on G(c,z)G(c,z), during training, we add some noise such as word random removal or shuffling to input bag-of-words. The network architecture of GG^{\prime} is an LSTM whose hidden state is initialized by the output of a feedforward neural network which takes bag-of-words as input.

5 Experiments

5.1 Unsupervised Classification

Methods 20News Yahoo! DBpedia Stackoverflow Agnews News-Cat
word embed avg+k-means 28.63 38.91 69.04 22.44 73.83 37.78
sent2vec+k-means 28.06 51.24 60.98 36.78 83.82 40.64
LDA + k-means 28.97 22.58 61.13 43.57 49.36 21.84
NVDM 24.63 33.21 46.22 26.33 62.36 30.12
ProdLDA 30.02 39.65 68.19 25.13 72.78 38.62
LDA-few topics 29.62 29.44 68.62 35.82 71.07 26.24
KE Haj-Yahia et al. (2019) 37.8 53.9 - - 73.8 -
TIGAN 34.12 52.25 85.37 47.01 84.13 49.32
TIGAN w/o loss clipping 30.12 45.92 83.32 40.99 83.66 47.42
TIGAN w/o auto-encoder 29.14 43.07 78.76 31.60 81.62 45.12
TIGAN w/o word embed 36.89 42.14 71.26 46.14 69.78 47.32
TIGAN w/ linear QQ 41.01 42.91 83.73 64.46 72.65 48.14
Table 1: Unsupervised classification accuracy.

Datasets

We evaluate our model on 6 various datasets, including (1)20NewsGroups, (2)Yahoo! answers, (3)DBpedia ontology classification, (4)stackoverflow title classification Xu et al. (2015) (5)agnews and (6)News-Category-Dataset 111https://www.kaggle.com/rmisra/news-category-dataset.

The 20NewsGroups is a document classification dataset composed of 20 different classes of documents. Yahoo! answers is a question type classification dataset with 10 types of question-answer pairs constructed by Zhang and LeCun (2015). DBpedia ontology classification dataset is constructed by Zhang and LeCun (2015), with 14 ontology classes selected from DBpedia 2014. Stackoverflow title classification is constructed by Xu et al. (2015) with 20 different title categories. The agnews is a news classification dataset with 4 different categories constructed by Zhang and LeCun (2015). We combined news titles and descriptions as our training texts. News-Category-Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. We selected the most frequent 11 classes and roughly balanced the number of data of each class.

Experimental Settings

For all datasets, we use the same model architecture and same optimizer without tuning on each dataset. The details of model architecture can be found in Appendix A. For each dataset, we choose the most frequent 3000 words as the vocabulary after stop words and punctuation removal and lowercase conversion. When training TIGAN, we sample a one-hot topic code cc from a uniform distribution and set the number of latent topics (i.e. the dimension of cc) the same as the number of human-annotated categories in each dataset. For example, as there are four news categories in agnews, we set the topic number to four. Continuous noise zz is sampled from a normal distribution and the dimension of zz is set to be 200 in all experiments. Our model is not sensitive to the dimension of zz, setting it to smaller value such as 5050 or 100100 will yield similar results.

We use the topic classifier QQ to predict the latent topic probability distribution of each sample, and we assign each sample to the latent topic with the maximum probability. The samples assigned to a latent topic cluster use their human-annotated labels to vote for which label should be assigned to the whole cluster. After assigning each latent topic to a human-annotated category, we evaluate the classification accuracy as the quality of the captured latent topics.

Baselines

We compare our model to several possible methods for unsupervised text classification, including topic models, unsupervised sentence representation and keywords enrichment (KE) method.

The most straightforward method for unsupervised text classification is unsupervised sentence representation based methods. In these methods, we first encode texts into vectors by using unsupervised text representation learning methods and then conduct k-means on the text vectors. In Table 1 “word embed avg+k-means”, we average the pretrained word embeddings as text vectors. Here, the word embeddings are same as the pretrained word embeddings used in TIGAN. In Table 1 “sent2vec+k-means”, the vectors are extracted by state-of-the-art unsupervised sentence representation method Pagliardini et al. (2018) with latent dimension set to be 300. In Table 1 “LDA + k-means”, we set the topic number of LDA Blei et al. (2003) to 50, and then conduct k-means unsupervised clustering on the learned features.

Another baselines are topic models which represent a text as a mixture of latent topics. Because these latent topics are likely to be identical to human-annotated categories, topic models are possible methods for unsupervised text classification. The evaluation setting of topic models is same as TIGAN, in which we set the topic number of topic models the same as the number of human-annotated categories and assign each latent topic to its most possible human-annotated category. For example, in Table 1 “LDA-few topic”, differet from the setting of “LDA + k-means”, the topic number is same as the class number. Also, two strong variational auto-encoder based neural topic models, including NVDM Miao et al. (2016) and ProdLDASrivastava and Sutton (2017) are chosen as our baselines.

Keyword enrichment(KE) Haj-Yahia et al. (2019) is a recent weakly-supervised text classification method which initially asks human experts to label several keywords for each category and then gradually enriches the keywords dictionary. The documents similar to the keywords of a category are classified to the same category. This method is not strictly comparable to our method because it requires the help of humans and the performance hinges on good keywords labeled by experts while our model don’t require any keyword from human. Also, it requires WordNet Miller et al. (1990) to retrieve synonym of keywords.

Methods 20NewsGroups Yahoo! Answers DBpedia Gigaword
LDA 42.68 36.34 51.06 35.61
NVDM 42.36 47.02 49.95 37.22
ProdLDA 43.44 52.01 55.26 43.17
TIGAN w/ word embed 43.01 55.92 62.16 46.12
TIGAN w/o word embed 45.22 45.64 54.23 41.79
Table 2: Topic coherence scores. Higher is better.
Dataset Topic ID Topical Words
DBpedia 1 pianist, composer, singer, cyrillic, songwriter, romania, poet, painter, jazz, actress, musician, tributary
2 skyscraper, building, courthouse, tower, historic, plaza, brick, floors, twostory, mansion, hotel, buildings, register, tallest, palace
3 midfielder, footballer, goalkeeper, football, championship, league, striker, soccer, defender, goals, matches, cup, hockey, medals
English Gigaword 1 inflation, index, futures, benchmark, prices, currencies, output, outlook, unemployment, lowest, stocks, opec, mortgage
2 sars, environment, pollution, tourism, virus, disease, flights, water, scientific, airports, quality, agricultural, animal, flu, alert
3 polls, votes, elections, democrats, electoral, election, conservative, re-election, democrat, candidates, republicans, liberal, presidential
Table 3: Topical words generated from discovered latent topic code cc. In DBpedia, Topic 1 is about music, Topic 2 is about building and Topic 3 is about sport. In English Gigaword, Topic 1 is about business, Topic 2 is about disease and Topic 3 is about politics.

Results and Discussion.

Compared to the methods of clustering continuous text vectors like “word embed avg+k-means” and “sent2vec+k-means” in Table 1, TIGAN improves the performance. The reason is that TIGAN maximize the mutual information between generated data and a discrete one-hot distribution, which encourages classifier QQ to learn more salient features to separate texts into different classes. In “LDA+k-means”, the result is even worse than “LDA-few topics”, which means it’s not feasible to obtain human-annotated classes by directly clustering the features learned by LDA.

As shown in the Table 1, TIGAN outperforms two variational neural topic models and LDA on unsupervised classification. No matter variational neural topic models or statistical topic models, they all assume each document is produced from mixture of topics. To model the variance of documents, they have to split a topic such as “sport” into many subtopics such as baseball or basketball. Because in our unsupervised text classification setting, each document should belong to a single category, this assumption harms the performance. However, as TIGAN is able to control the variance within a topic with a noise vector, it can directly assume a document is generated from a single main topic, which allows TIGAN to discover the topics identical to human-annotated categories. This performance gap is understandable because topic models are not originally designed for unsupervised text classification setting.

The performance of TIGAN is on par with keywords enriching (KE) in Table 1 222Here, we use the weighted recall reported in the paper because the metric that we compute accuracy score is same as weighted recall. even without keywords annotated by human experts. This result suggests that our model is able to automatically categorize texts without any help from human.

5.2 Ablation Study

Training

We conduct ablation study to show that all mechanisms described in Section 3.2 are useful. As shown in Table 1, categorical loss clipping improves the performance. During training, we find clipping this term makes the training more stable because this term is easy to diverge. Combining infoGAN with auto-encoder also improves the performance. The possible reason is that it encourages generator to generate realistic data which alleviates the difficulty of discrete data generation. Although we don’t restrict topic code cc to be one-hot in auto-encoder training, InfoGAN training forces all the text encoded by QQ to be one-hot.

Model architecture of Topic Classifier

In Table 1, we find that the model architecture of topic classifier QQ greatly influences the unsupervised classification performance. Both “w/o word embed” and “w/ linear Q” are the settings without pretrained word embeddings. We can observe that using pretrained embedding helps topic classifier discover more interpretable topics on most datasets, but not on all datasets. Similar phenomenon can also be found in Table 2. As we pretrain word embeddings on each dataset separately, for the datasets that don’t contain enough training samples such as stackoverflow dataset or 20NewsGroups dataset, it is difficult to learn representative word embeddings. Hence, the performance on those datasets is not improved with pretrained word embeddings. Additionally, we find that using pretrained word embedding makes QQ not sensitive to the random initialization weights, and thus QQ discovers more consistent topics at each training.

In “w/o word embed”, the model architecture of QQ is same as TIGAN (i.e. a neural network with randomly initialized word embeddings), while in “w/ linear Q”, the model model architecture of QQ is simply a linear network without any word embeddings. Comparing “w/o word embed” with “w/ linear Q”, the simpler model architecture yields better results. With more complex model of QQ, it arbitrarily maps complex but not meaningful distribution to one-hot code, which makes the learned topics not interpretable.

Dataset Topic ID Generated Text
DBpedia 1 james nelson is an american musician singer songwriter and actor he was a line of his work with musical career as metal albums in 1989
2 the UNK is a skyscraper located in downtown washington dc district it was completed june 2009 and tallest building currently
3 UNK born 3 january 1977 is a finnish football player who plays for west coast club dynamo he has won bronze medals at 2007
English Gigaword 1 oil prices fell in asia friday as traders fear of the world demand for biggest drop summer months figures
2 the world health organisation has placed on bird flu in nigeria ’s most populous countries virus , saying they are expected to appear
3 the ruling party won overwhelming majority of seats in opposition ’s battle for sweeping elections
Table 4: Texts generated from topics with topic ID corresponding to Table 3. The texts in the same dataset are generated from different cc but from the same continuous noise zz.
Generated Text
UNK is a historic public square located in south bronx built in 1987
UNK is a french southwest hotel originally located in the of greece south carolina it was conceived by josef
UNK in historic district is a methodist church located in south african cape was added to national register of places 200
Table 5: The generated texts of DBpedia Topic ID 2 in Table 3 with different noise vectors zz.

5.3 Topic Coherence

In this section, we use the method in Appendix B to extract topical words to represent latent topics and evaluate the quality of the topical words. As topic models are also good at extracting topical words to represent topics, they are chosen as our baselines.

Experimental setup

In this section, the topic number of all models is same as the number of human-annotated classes. To show our model not only works on classification datasets, we try to evaluate our model on an unannotated dataset which is English Gigaword Rush et al. (2015), a news summarization dataset. The topic number on English Gigaword is set to be 10 and the discussion of how to decide topic code distribution is in Appendix D.

Quantitative analysis.

We use the CvC_{v} metric Roder et al. (2015) to evaluate the coherence of extracted topical words Chang et al. (2009). 333We use the CoherenceModel in gensim module to compute coherence scores. The coherence scores are computed by using English Wikipedia of 5.6 million articles as an external corpus. The higher coherence scores in Table 2 show that TIGAN extracts reasonable and coherent words to represent a topic.

Qualitative analysis.

To further analyze the quality of discovered topics, the topical words of latent topics are listed in Table 3. Additionally, we use the method mentioned in Section  4 to generate corresponding texts of the discovered topics in Table 4. In the tables, the captured latent topics are clearly semantically different based on the generated topical words. Similarly, the generated texts are also topically related to the associated topics. More extracted topical words and generated texts are offered in Appendix C.

5.4 Disentangled Representations

The essential assumption of our work is that cc and zz can learn disentangled representations, where cc encodes main topic information and zz control the variance within the topics. To validate this assumption, we generate texts from the same continuous noise zz but from different topic code cc in Table 4. There is almost no overlap of words between the texts generated from the same continuous noise zz but from different cc, which means zz encodes the information disentangled from the topic code cc.

We also want to analyse whether zz can control the variance within the topic by modeling subtopic information. In Table 5 and Appendix.Table 6, we generate the texts from the same topic code cc but from different noise zz. It is obvious that in Table 5 the topic of cc is about building, but with different zz quite different sentences are generated. The diversity of generated texts within the same topic shows zz is capable to model the variance within the same topic. Also, by generating numerous texts of a same topic, it gives us a way to interpret the discovered topics further. With the above two experiments, we demonstrate that our model is able to learn disentangled representations.

6 Conclusion

This paper proposes a novel unsupervised framework, TIGAN, for exploring unsupervised discrete text representations to interpret textual data. By learning a discrete topic code disentangled from a continuous vector controlling subtopic information, TIGAN shows the superior performance on unsupervised text classification, which gives humans a good way to understand textual data. The extracted topical words and generated texts from the discovered topics further showcase the effectiveness of our model to learn explainable and coherent topics.

References

  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
  • Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research.
  • Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22.
  • Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
  • Haj-Yahia et al. (2019) Zied Haj-Yahia, Adrien Sieg, and Léa A. Deleris. 2019. Towards unsupervised text classification leveraging experts and word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  • Hu et al. (2017) Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning.
  • Huang et al. (2018) Huaibo Huang, zhihang li, Ran He, Zhenan Sun, and Tieniu Tan. 2018. Introvae: Introspective variational autoencoders for photographic image synthesis. In Advances in Neural Information Processing Systems 31.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28.
  • Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning.
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
  • Miao et al. (2017) Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine Learning.
  • Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of the 34th International Conference on Machine Learning.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26.
  • Miller et al. (1990) George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography.
  • Nguyen et al. (2015) Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics.
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30.
  • Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
  • Roder et al. (2015) Michael Roder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. WSDM.
  • Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  • Srivastava and Sutton (2017) Akash Srivastava and Charles Sutton. 2017. Autoencoding variational inference for topic models. In International Conference on Learning Representations.
  • Xu et al. (2015) Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.
  • Zhang and LeCun (2015) Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. arXiv preprint arXiv:1502.01710.
  • Zhao et al. (2018) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Appendix A Model Architecture and Hyper-parameters

In all datasets, we use the same model architecture and same optimizer without tuning our model on each dataset. The model architecture of generator GG is a fully-connected neural network with 3 hidden layers and the hidden dimension is 1000 of each hidden layer. Setting hidden dimension to 500 or 1500 only slightly influences the performance. We apply batch normalization Ioffe and Szegedy (2015) at each layer. The discriminator DD is a fully-connected neural network with 2 hidden layer and the hidden dimension is 500 of each hidden layer. As the discriminator has to be well trained to guide the gradient of generator, we make discriminator easier to train that the model architecture of discriminator is simpler than generator. If the hidden dimension of GG and DD is too deep, GG and DD becomes difficult to train. On the other hand, if the hidden dimension of GG and DD is too shallow, the performance of TIGAN drops.

The topic classifier QQ is consisted of a word embedding matrix and a linear network which takes the weighted average of word embeddings as input and predicts the latent topics. We set the dimension of pretrained word embedding to 100. As we find the model architecture of topic classifier QQ greatly influences the performance of our model, we has discussed the model architecture of QQ in later section. The noise predictor EE is a fully-connected neural network with one hidden layer of dimension 500. The GG,DD, EE and QQ are all optimized by Adam optimizer with learning rate 0.00050.0005, β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999.

Appendix B Retrieving Topical Words from TIGAN

The topical words of our model can be retrieved from topic classifier QQ as following. QQ consists of a word embedding matrix and a linear layer. Let WV×NW_{V\times N} be a word embedding matrix, where each row is a NN dimensional word vector, and VV is the vocabulary size. Let MK×NM_{K\times N} be the weight matrix of the linear layer to predict the topics from sentence embedding which is the weighted average of word embeddings, where KK is the topic number. Let CK×V=MWTC_{K\times V}=M\cdot W^{T}, where the value of Ck,vC_{k,v} represents the importance of the v-th word to the k-th topic. The top few words with the highest values within row kk are selected as the topical words of k-th topic.

Appendix C Qualitative Analysis of Topical Words

We provide the learned topical words of all latent topics of Yahoo! Answer from our model in Table 7, ProdLDA in Table 8, NVDM in Table 9 and LDA in Table 10. As shown in the tables, the topical words in each latent topic of TIGAN are closely semantically related to each other. For example, all the words in the first latent topic are obviously related to science, the words in second topic are related to the operation of computer and the words in third topic are related to politics.

Most of the topical words from ProdLDA are all related to each other, except some words. For example, some words in the second-last latent topic is not reasonable like “cousin”, “do I” or “bc”. In NVDM, we can find more unreasonable latent topics. For example, in the last latent topic, “diet” and “movie” should not be clustered into the same topic and in fourth-last latent topic, “immigrants” and “cup” is not related to each other. The quality of extracted topical words from LDA is worse than the above three neural models. Most of the topical words in a same topic are not semantically related to each other.

Appendix D Discussion of Topic Code Distribution

For labeled datasets, no matter single-label or multi-label datasets, we encourage to sample topic code cc according to the distribution of the human-labeled categories. For unlabeled datasets, we encourage to set the topic number of cc to 102010\sim 20 and tune from greater number. With a greater topic number, TIGAN is still able to learn explainable topics as this setting just splits a single topic into several subtopics. However, setting topic number smaller forces several unrelated topics to combine into a single topic and thus harms the performance. It’s worth mentioning that we find that sampling one-hot cc from uniform distribution works for most datasets. That is because it encourages topic classifier to learn salient and highly interpretable features to maximize the mutual information between discrete one-hot cc and generated discrete bag-of-words.

Generated Text
british airports leaders plan to ban on one of thousands the city health virus officials said friday
country health officials from world organisation has warned turkey ’s swine flu in britain pakistan saying it was unsafe bringing the bird
the talk of a new disease on tiny team of has formed in tanzania which based on ebola virus
sars epidemic has been recently detected in the hong kong and disease in france public health organisation is for
health experts said it was in southwest china to contain the outbreak of disease
national organizations have been established in beijing to sars for safety alert and common disease resources
the swine flu virus has been muted by health organisation response to tackle disease
thailand ’s sichuan province has given to foreign investors in airlines travel disease
indonesia has developed a discovery of health system wild birds at home from the flu
the impact of poor health organisation has been UNK swine flu season in for population
the governments gathered in thailand ’s health officials of world ministry earlier on friday
indonesia has been hospitalized with pneumonia the health organisation said
doctors from romania and countries are to explore ways of recent disease
the death of # percent bird flu in latest report region asia , according to a study released by u.s. health organization
agricultural experts have set to double the european commission meet with a new study on and human bird flu in animal was unveiled
Table 6: The generated texts of English Gigaword Topic ID 2 in Table 3 with different noise vectors. It’s clear that all the generated sentences are about disease.
Topical Words
measure, equations, cm, density, equation, units, formula, solar, atoms, inches, cycle, calculate, volume, radius, physics, triangle, molecules, height, equal, gravity
deleted, files, folder, users, delete, messenger, spyware, scan, installed, settings, upload, downloaded, ip, disk, router, screen, firewall, software, message, xp
terrorists, democrats, troops, republicans, terrorism, terrorist, leaders, senate, iraq, politicians, liberals, democracy, saddam, congress, democratic, elections, elected, political, majority, war
biology, engineering, courses, resources, management, materials, algebra, technology, design, accounting, analysis, studies, communication, learning, education, programming, skills, teaching, study, marketing
nervous, abuse, depressed, drunk, crying, depression, orgasm, jealous, sexually, jail, hang, drinking, emotional, chat, busy, sleeping, herself, hanging, custody, diagnosed
sore, infection, skin, stomach, urine, milk, vitamin, pills, symptoms, muscle, therapy, breast, severe, treatment, muscles, fluid, bleeding, acne, pain
bone spirit, holy, heaven, rock, satan, worship, ghost, evolution, spiritual, bible, devil, jesus, bands, mary, lyrics, adam, gods, soul, singer, christian
chat, pics, porn, orgasm, females, nasty, bored, penis, sensitive, avatar, attracted, emotional, ugly, naked, sexy, addicted, vagina, lesbian, jokes, dirty
champions, playoffs, championship, league, teams, cup, nba, team, finals, nfl, hockey, fifa, football, soccer, kobe, wrestling, basketball, brazil, matches, cricket
payments, estate, mortgage, property, loan, insurance, payment, fees, loans, funds, debt, companies, investment, agency, financial, employees, filed, employment, owner, taxes
Table 7: Topical words generated from TIGAN in Yahoo! Answers.
Topical Words
teaching ,learning ,subject ,teach ,learn ,skills ,studying ,topic ,study ,project ,prepare ,courses ,students ,schools ,english ,guide ,essay ,language ,spelling ,education
tooth ,teeth ,knee ,dentist ,exercise ,loose ,dr ,healthy ,coffee ,gym ,counter ,workout ,stomach ,pill ,medication ,severe ,acne ,bleeding ,eating ,diet
fees ,employees ,employee ,employer ,cash ,agent ,earn ,selling ,homes ,banks ,hire ,mortgage ,payment ,invest ,offered ,funds ,budget ,estate ,bills ,agency
divide ,motion ,factor ,formula ,direction ,scale ,compound ,length ,equations ,circle ,zero ,frequency ,steel ,wave ,triangle ,elements ,electricity ,velocity ,solid ,table
detroit ,olympics ,nfl ,league ,winning ,wins ,teams ,2002 ,playoffs ,match ,championship ,johnson ,coach ,bowl ,dallas ,team ,goal ,soccer ,baseball ,finals
liberals ,congress ,conservative ,voting ,911 ,saddam ,liberal ,soldiers ,majority ,elected ,bush ,politicians ,vote ,elections ,violence ,terrorism ,freedom ,vietnam ,troops ,democracy
band ,bob ,bands ,lyrics ,singing ,artist ,singer ,scene ,sang ,80s ,movie ,episode ,sings ,sing ,pink ,rap ,dancing ,movies ,song ,songs
bible ,believe ,gods ,christianity ,christ ,worship ,faith ,spiritual ,jesus ,spirit ,belief ,christian ,god ,heaven ,jewish ,christians ,eve ,mary ,lord ,belive
dated ,talked ,weve ,eachother ,cousin ,friendship ,ex ,gf ,break ,cheating ,cheated ,hang ,dating ,together ,broke ,boyfriends ,bc ,hanging ,doi ,talk
opened ,icon ,automatically ,task ,edition ,comp ,blocked ,sharing ,instant ,deleted ,press ,blank ,dell ,keyboard ,connection ,bar ,beta ,update ,mac ,dsl
Table 8: Topical words generated from ProdLDA in Yahoo! Answers.
Topical Words
earn ,points ,money ,countries ,questions ,energy ,learn ,invest ,score ,market ,nuclear ,iran ,dollars ,business ,technology ,government ,study ,stock ,economy ,world
school ,schools ,high ,university ,college ,colleges ,students ,basketball ,sport ,la ,de ,girls ,grade ,courses ,graduate ,teacher ,football ,les ,teachers ,boys
english ,lyrics ,song ,search ,google ,spanish ,translate ,site ,language ,christian ,sites ,myspace ,sings ,bible ,love ,translation ,websites ,songs ,web ,books
god ,jesus ,christians ,christian ,religion ,truth ,bush ,religious ,beliefs ,faith ,church ,christ ,believe ,evil ,heaven ,feelings ,bible ,kill ,gods ,illegal
download ,computer ,software ,email ,myspace ,laptop ,install ,phone ,pc ,card ,cd ,program ,online ,password ,files ,video ,address ,send ,connect ,downloaded
relationship ,sex ,math ,grade ,age ,degree ,feelings ,shy ,boyfriend ,teacher ,homework ,advice ,healthy ,pregnant ,study ,dating ,bf ,classes ,yrs ,weight
win ,hate ,watch ,immigrants ,idol ,americans ,mexicans ,hes ,gonna ,watching ,iraq ,games ,tired ,mexico ,jobs ,thinks ,illegals ,cup ,republicans ,democrats
foods ,food ,health ,diet ,products ,skin ,protein ,fat ,muscle ,eat ,eating ,exercise ,treatment ,cure ,disease ,healthy ,muscles ,sugar ,bacteria ,weight
county ,tax ,state ,federal ,income ,insurance ,loan ,taxes ,visa ,citizen ,pay ,financial ,states ,department ,lawyer ,child ,california ,legal ,loans ,court
christmas ,season ,diet ,dvd ,burn ,buy ,songs ,movie ,night ,water ,song ,day ,band ,summer ,drink ,gift ,eat ,ebay ,credit ,price
Table 9: Topical words generated from NVDM in Yahoo! Answers.
Topical Words
looking, friend, girl, yahoo, song, new, look, good, trying, know, im, information, music, online, business, need, write, check, send, change
like, know, want, really, think, people, help, say, question, tell, answer, need, good, make, guy, feel, things, thing, bad, ask
school, job, work, home, state, high, good, college, number, age, small, book, run, law, area, want, best, parents, doctor, vs
time, best, long, person, years, money, man, life, sex, guys, like, women, email, way, great, little, lot, men, away, good
day, real, read, bush, anybody, eat, explain, 12, power, pain, mind, food, new, red, period, rid, weeks, air, hours, ex
water, times, end, english, card, point, states, difference, credit, page, gets, comes, different, 10, blood, light, time, used, order, green
need, help, free, site, computer, want, website, know, heard, try, weight, using, web, best, whats, download, windows, like, company, search
people, god, called, believe, country, think, pay, days, ago, means, war, child, come, world, probably, president, info, wondering, usa, test
use, old, better, internet, told, buy, type, stop, used, black, white, ones, turn, easy, language, date, science, fast, laptop, skin
love, going, think, got, world, year, talk, yes, team, body, win, hes, come, american, movie, game, play, cup, ive, boyfriend
Table 10: Topical words generated from LDA in Yahoo! Answers.