Learning Interpretable and Discrete Representations
with Adversarial Training for Unsupervised Text Classification
Abstract
Learning continuous representations from unlabeled textual data has been increasingly studied for benefiting semi-supervised learning. Although it is relatively easier to interpret discrete representations, due to the difficulty of training, learning discrete representations for unlabeled textual data has not been widely explored. This work proposes TIGAN that learns to encode texts into two disentangled representations, including a discrete code and a continuous noise, where the discrete code represents interpretable topics, and the noise controls the variance within the topics. The discrete code learned by TIGAN can be used for unsupervised text classification. Compared to other unsupervised baselines, the proposed TIGAN achieves superior performance on six different corpora. Also, the performance is on par with a recently proposed weakly-supervised text classification method. The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.
1 Introduction
In natural language processing (NLP), learning meaningful representations from large amounts of unlabeled texts is a core problem for unsupervised and semi-supervised language understanding. While learning continuous text representations has been widely studied Kiros et al. (2015); Arora et al. (2017); Logeswaran and Lee (2018); Pagliardini et al. (2018); Peters et al. (2018); Devlin et al. (2019), learning discrete representations has been explored by fewer works Miao et al. (2017); Zhao et al. (2018). Nevertheless, learning discrete representations is still important, as discrete representations are easier to interpret Zhao et al. (2018) and can benefit unsupervised learning van den Oord et al. (2017).
Auto-encoders and their variants are widely used to learn latent representations, but how to learn meaningful discrete latent representations remains unsolved. The reason is that using discrete variables as latent representations in auto-encoders hinders the gradient from backpropagating from the decoder to the encoder. To take non-differentiable discrete variables as latent representations in an auto-encoder, some special methods, such as Gumbel-Softmax Jang et al. (2016) or vector quantization van den Oord et al. (2017), are applied to enable training. Another direction of learning useful discrete representations is to maximize the mutual information between data and discrete variables such as infoGAN Chen et al. (2016) or IMSAT Hu et al. (2017).
In this work, we propose Textual InfoGAN (TIGAN), which is mainly built upon InfoGAN but with several useful extensions to make it suitable for textual data. In this model, texts are generated from two disentangled representations, including a discrete code and a continuous noise , where each dimension of represents a topic (or a category), and controls the variance within a topic. For example, if the model learns that one dimension of represents a topic about “sport”, different represents different sport types like basketball or baseball, or different teams. Given a new text, the model can efficiently infer the discrete topic code . As we alternatively optimize the InfoGAN objective and auto-encode objective function, the model can be considered as the integration of the infoGAN and auto-encoder based approaches.
To evaluate the interpretability of discovered discrete topics, we evaluate our model on unsupervised text classification. Better performance on unsupervised text classification implies that the discovered topics directly match the human-annotated categories, and thus humans can intuitively understand what they represent, such as the category of news or the type of questions. Compared to other possible unsupervised text classification methods, such as unsupervised sentence representation methods or topic models, TIGAN achieves superior performance. Also, the performance of our unsupervised method is on par with a recent weakly-supervised method (Haj-Yahia et al., 2019), which required keywords of each category annotated by human experts.
To interpret discovered topics, we extract a few topical words from each topic to represent the topic. The extracted topical words are evaluated by quantitative and qualitative analysis. Compared to baseline methods, TIGAN is capable of extracting more coherent topical words. Also, it is possible to generate various texts conditioned on topics discovered by TIGAN in an unsupervised manner, which gives us a way to interpret the discovered topics further. The generated texts demonstrate that TIGAN successfully learns disentangled representations.
2 Related Work
Here we review the approaches of unsupervised discrete representation learning.
Auto-encoder
Our work is closely related to other unsupervised discrete representation learning methods. VQ-VAE van den Oord et al. (2017) encodes the data into a discrete one-hot code by drawing the index with embedding closest to the data representation. They utilize vector quantization (VQ) to approximate the gradient from the decoder to the encoder. DI-VAE Zhao et al. (2018) uses Batch Prior Regularization (BPR) to approximate the KL-divergence between discrete variables, and learn discrete representations by Gumbel-Softmax and minimizing the KL-divergence between discrete posterior and a discrete prior.
In this work, because we set the dimension of discrete code to a small number, we have to model the variance within the code. Otherwise, the text can not be successfully generated. Therefore, we search for other discrete representation learning models.
Maximizing Mutual Information
To learn useful latent representation, IMSAT Hu et al. (2017) maximizes the mutual information between data and the encoded discrete representation. They also propose an objective function to make the discrete representation invariant to the data augmentation. On the other hand, InfoGAN has shown impressive performance for learning disentangled representations of images in an unsupervised manner Chen et al. (2016). The original GAN generates images from a continuous noise , while each dimension of the noise does not contain disentangled features of generated images. To learn semantically meaningful representations, InfoGAN maximizes the mutual information between input code and the generated output . However, maximizing the mutual information is intractable because it requires the access of . Based on variational information maximization, Chen et al. (2016) used an auxiliary function to approximate . The auxiliary function can be a neural network and jointly optimized with .
3 TIGAN

TIGAN considers that text is generated from a discrete code and a continuous noise . Each dimension of discrete code represents a topic (or a category), and the continuous noise controls the variance within the topic. As sequential textual data is too complex to discover meaningful topics, we focus on learning topics from bag-of-words textual data. Besides, we propose some extra extensions of infoGAN including: (1) categorical loss clipping, (2) combining infoGAN with auto-encoder, (3) using WGAN-gp Gulrajani et al. (2017), and (4) using pretrained word embeddings to regularize the discovered topics. The above extensions greatly improve performance.
3.1 Model
The Figure 1 (A) illustrates our model, where there are a generator , a discriminator , a topic classifier , and a noise predictor :
-
•
Generator
It takes a discrete topic code and a continuous noise as the input, and manages to generate bag-of-words that are indistinguishable from the bag-of-words of real texts, while captures the topical information of input . Here the output of the generator, , is a vocabulary size vector. The output layer of is function, so the value of each dimension is between and . can be interpreted as a bag-of-words vector, where each dimension indicates whether a single word exists in the text. The is directly fed to the discriminator, topic classifier, and noise predictor without sampling. -
•
Discriminator
It takes a bag-of-word vector as its input and outputs a scalar, which distinguishes whether the input is generated by the generator or from human-written texts. Here the bag-of-words vector from human-written texts would be assigned a higher score, or would be larger. On the other hand, the bag-of-words vector generated by would be assigned lower score by , or would be smaller. is trained to minimize the following loss function(1) where is the real data distribution, is the noise distribution, and is the topic code distribution. The distributions and are determined by the developers based on the prior knowledge about the corpus. In the following experiments, is a normal distribution, while is a uniform distribution that generates one-hot vectors in which each dimension has an equal probability of being set to . In general, we find using the normal distribution as is sufficient to model most texts. We use uniform distribution as because our model is evaluated on single-label datasets in which the most texts are written according to one main topic. In a multi-label classification dataset, it is feasible to sample a topic code containing several topics according to the distribution of human-annotated categories.
-
•
Topic Classifier
It is a categorical topic code classifier, which predicts the categorical topic distribution from input bag-of-words. The categorical classifier plays the role of inferring topics from bag-of-words in this work. It is trained by predicting the topic code from the generated bag-of-words . The categorical loss for training is shown as below,(2) where the function evaluates the cross entropy between and the prediction , and learns to minimize the cross entropy. Since is one-hot vector, cross-entropy is used here. It is possible to use other difference measures.
-
•
Noise predictor
It focuses on predicting the continuous noise from the bag-of-words of human-written texts. is only learned with auto-encoder.
3.2 Training
We apply (3) to train our generator , discriminator , and the topic classifier .
(3) |
where is a hyper-parameter. As we find that when the value of is too large, the generator generates unreasonable bag-of-words to minimize , we set to . The discriminator learns to minimize , learns to minimize , and learns to maximize while minimizing . However, it is difficult to apply infoGAN on textual data. Hence, we propose some extensions and tips in the following sections.
3.3 Categorical Loss Clipping
During training, there is a severe mode collapse issue within the same topic. That is, given the same topic code , the generator ignores the continuous noise and always outputs the same bag-of-words. The reason is that the optimal solution for the generator to maximize the mutual information (or minimize (2)) between discrete and generated discrete bag-of-words is always predicting the same bag-of-words given the same . To tackle this issue, we clip the categorical loss in (2) to a lower bound as below.
(4) |
With (4), the model stops to make the categorical loss smaller when it is small enough, which is controlled by the hyper-parameter . As setting the value of too large hinders the from predicting correct topic code , we set the value slightly greater than zero, which is . It is worth mentioning that applying batch-normalization Ioffe and Szegedy (2015) to also greatly alleviates the mode collapse problem.
3.4 Combining infoGAN with auto-encoder
As mentioned by prior works Larsen et al. (2016); Huang et al. (2018), it’s useful to train GAN and auto-encoder alternatively. Therefore, we try to include auto-encoder training in the optimization procedure, as shown in the right part of Figure 1 (A). This is another way to prevent the mode collapse mentioned in the last subsection.
We use bag-of-words from real text to train an auto-encoder. In auto-encoder training, the noise predictor and topic classifier are jointly regarded as an encoder, which encodes bag-of-words into a continuous code and a discrete code respectively. Here, the generator serves as a decoder, which reconstructs the original input from a continuous code and a discrete code . Therefore, , and jointly learn to minimize the reconstruction loss,
(5) |
where binary cross entropy loss is used as the reconstruction loss function. We train (3) and (5) alternately.
3.5 Using WGAN-gp
As mentioned in Arjovsky et al. (2017), it is challenging to generate discrete data using GAN. Because bag-of-words vectors from real texts are also discrete, when using original loss function for in (1), the training fails. Here, we apply WGAN Arjovsky et al. (2017) loss to train and and rewrite (1) as
(6) |
Here, to constrain to be a Lipschitz function, we apply gradient penalty Gulrajani et al. (2017) to .
3.6 Using pretrained word embeddings
Prior works Nguyen et al. (2015) suggested that using pretrained word embeddings such as word2vec Mikolov et al. (2013) helps models discover more coherent topics. To encourage the model to learn more consistent topics, we use pretrained word embeddings on to regularize to learn more coherent topics. When takes a bag-of-words as input, it first computes the document vector as the weighted average of pretrained word embeddings of bag-of-words. Then, it feeds the document vector to a linear network to predict discrete topic code . We use smooth inverse frequency (SIF) Arora et al. (2017) to compute the weighted average of the word embeddings. The weight of word is , where is a learnable non-negative parameter and is word frequency. The word embeddings are pretrained by fasttext Joulin et al. (2016).
4 Text Generation from Discovered Topics
To interpret what the model learns, we try to generate text conditioned on discovered discrete topic , which is shown in Figure 1 (B). Given a discrete topic , we sample many different noises to generate different bag-of-words . After obtaining bag-of-words from , we use a text generator to generate sequential text from the bag-of-words. Because we can transform each sequential text into its corresponding bag-of-words, we can obtain numerous (bag-of-words, text) pairs to train the text generator . However, compared to human-written text, generated may be noisy and unreasonable. To make robust on , during training, we add some noise such as word random removal or shuffling to input bag-of-words. The network architecture of is an LSTM whose hidden state is initialized by the output of a feedforward neural network which takes bag-of-words as input.
5 Experiments
5.1 Unsupervised Classification
Methods | 20News | Yahoo! | DBpedia | Stackoverflow | Agnews | News-Cat |
word embed avg+k-means | 28.63 | 38.91 | 69.04 | 22.44 | 73.83 | 37.78 |
sent2vec+k-means | 28.06 | 51.24 | 60.98 | 36.78 | 83.82 | 40.64 |
LDA + k-means | 28.97 | 22.58 | 61.13 | 43.57 | 49.36 | 21.84 |
NVDM | 24.63 | 33.21 | 46.22 | 26.33 | 62.36 | 30.12 |
ProdLDA | 30.02 | 39.65 | 68.19 | 25.13 | 72.78 | 38.62 |
LDA-few topics | 29.62 | 29.44 | 68.62 | 35.82 | 71.07 | 26.24 |
KE Haj-Yahia et al. (2019) | 37.8 | 53.9 | - | - | 73.8 | - |
TIGAN | 34.12 | 52.25 | 85.37 | 47.01 | 84.13 | 49.32 |
TIGAN w/o loss clipping | 30.12 | 45.92 | 83.32 | 40.99 | 83.66 | 47.42 |
TIGAN w/o auto-encoder | 29.14 | 43.07 | 78.76 | 31.60 | 81.62 | 45.12 |
TIGAN w/o word embed | 36.89 | 42.14 | 71.26 | 46.14 | 69.78 | 47.32 |
TIGAN w/ linear | 41.01 | 42.91 | 83.73 | 64.46 | 72.65 | 48.14 |
Datasets
We evaluate our model on 6 various datasets, including (1)20NewsGroups, (2)Yahoo! answers, (3)DBpedia ontology classification, (4)stackoverflow title classification Xu et al. (2015) (5)agnews and (6)News-Category-Dataset 111https://www.kaggle.com/rmisra/news-category-dataset.
The 20NewsGroups is a document classification dataset composed of 20 different classes of documents. Yahoo! answers is a question type classification dataset with 10 types of question-answer pairs constructed by Zhang and LeCun (2015). DBpedia ontology classification dataset is constructed by Zhang and LeCun (2015), with 14 ontology classes selected from DBpedia 2014. Stackoverflow title classification is constructed by Xu et al. (2015) with 20 different title categories. The agnews is a news classification dataset with 4 different categories constructed by Zhang and LeCun (2015). We combined news titles and descriptions as our training texts. News-Category-Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. We selected the most frequent 11 classes and roughly balanced the number of data of each class.
Experimental Settings
For all datasets, we use the same model architecture and same optimizer without tuning on each dataset. The details of model architecture can be found in Appendix A. For each dataset, we choose the most frequent 3000 words as the vocabulary after stop words and punctuation removal and lowercase conversion. When training TIGAN, we sample a one-hot topic code from a uniform distribution and set the number of latent topics (i.e. the dimension of ) the same as the number of human-annotated categories in each dataset. For example, as there are four news categories in agnews, we set the topic number to four. Continuous noise is sampled from a normal distribution and the dimension of is set to be 200 in all experiments. Our model is not sensitive to the dimension of , setting it to smaller value such as or will yield similar results.
We use the topic classifier to predict the latent topic probability distribution of each sample, and we assign each sample to the latent topic with the maximum probability. The samples assigned to a latent topic cluster use their human-annotated labels to vote for which label should be assigned to the whole cluster. After assigning each latent topic to a human-annotated category, we evaluate the classification accuracy as the quality of the captured latent topics.
Baselines
We compare our model to several possible methods for unsupervised text classification, including topic models, unsupervised sentence representation and keywords enrichment (KE) method.
The most straightforward method for unsupervised text classification is unsupervised sentence representation based methods. In these methods, we first encode texts into vectors by using unsupervised text representation learning methods and then conduct k-means on the text vectors. In Table 1 “word embed avg+k-means”, we average the pretrained word embeddings as text vectors. Here, the word embeddings are same as the pretrained word embeddings used in TIGAN. In Table 1 “sent2vec+k-means”, the vectors are extracted by state-of-the-art unsupervised sentence representation method Pagliardini et al. (2018) with latent dimension set to be 300. In Table 1 “LDA + k-means”, we set the topic number of LDA Blei et al. (2003) to 50, and then conduct k-means unsupervised clustering on the learned features.
Another baselines are topic models which represent a text as a mixture of latent topics. Because these latent topics are likely to be identical to human-annotated categories, topic models are possible methods for unsupervised text classification. The evaluation setting of topic models is same as TIGAN, in which we set the topic number of topic models the same as the number of human-annotated categories and assign each latent topic to its most possible human-annotated category. For example, in Table 1 “LDA-few topic”, differet from the setting of “LDA + k-means”, the topic number is same as the class number. Also, two strong variational auto-encoder based neural topic models, including NVDM Miao et al. (2016) and ProdLDASrivastava and Sutton (2017) are chosen as our baselines.
Keyword enrichment(KE) Haj-Yahia et al. (2019) is a recent weakly-supervised text classification method which initially asks human experts to label several keywords for each category and then gradually enriches the keywords dictionary. The documents similar to the keywords of a category are classified to the same category. This method is not strictly comparable to our method because it requires the help of humans and the performance hinges on good keywords labeled by experts while our model don’t require any keyword from human. Also, it requires WordNet Miller et al. (1990) to retrieve synonym of keywords.
Methods | 20NewsGroups | Yahoo! Answers | DBpedia | Gigaword |
LDA | 42.68 | 36.34 | 51.06 | 35.61 |
NVDM | 42.36 | 47.02 | 49.95 | 37.22 |
ProdLDA | 43.44 | 52.01 | 55.26 | 43.17 |
TIGAN w/ word embed | 43.01 | 55.92 | 62.16 | 46.12 |
TIGAN w/o word embed | 45.22 | 45.64 | 54.23 | 41.79 |
Dataset | Topic ID | Topical Words |
DBpedia | 1 | pianist, composer, singer, cyrillic, songwriter, romania, poet, painter, jazz, actress, musician, tributary |
2 | skyscraper, building, courthouse, tower, historic, plaza, brick, floors, twostory, mansion, hotel, buildings, register, tallest, palace | |
3 | midfielder, footballer, goalkeeper, football, championship, league, striker, soccer, defender, goals, matches, cup, hockey, medals | |
English Gigaword | 1 | inflation, index, futures, benchmark, prices, currencies, output, outlook, unemployment, lowest, stocks, opec, mortgage |
2 | sars, environment, pollution, tourism, virus, disease, flights, water, scientific, airports, quality, agricultural, animal, flu, alert | |
3 | polls, votes, elections, democrats, electoral, election, conservative, re-election, democrat, candidates, republicans, liberal, presidential |
Results and Discussion.
Compared to the methods of clustering continuous text vectors like “word embed avg+k-means” and “sent2vec+k-means” in Table 1, TIGAN improves the performance. The reason is that TIGAN maximize the mutual information between generated data and a discrete one-hot distribution, which encourages classifier to learn more salient features to separate texts into different classes. In “LDA+k-means”, the result is even worse than “LDA-few topics”, which means it’s not feasible to obtain human-annotated classes by directly clustering the features learned by LDA.
As shown in the Table 1, TIGAN outperforms two variational neural topic models and LDA on unsupervised classification. No matter variational neural topic models or statistical topic models, they all assume each document is produced from mixture of topics. To model the variance of documents, they have to split a topic such as “sport” into many subtopics such as baseball or basketball. Because in our unsupervised text classification setting, each document should belong to a single category, this assumption harms the performance. However, as TIGAN is able to control the variance within a topic with a noise vector, it can directly assume a document is generated from a single main topic, which allows TIGAN to discover the topics identical to human-annotated categories. This performance gap is understandable because topic models are not originally designed for unsupervised text classification setting.
The performance of TIGAN is on par with keywords enriching (KE) in Table 1 222Here, we use the weighted recall reported in the paper because the metric that we compute accuracy score is same as weighted recall. even without keywords annotated by human experts. This result suggests that our model is able to automatically categorize texts without any help from human.
5.2 Ablation Study
Training
We conduct ablation study to show that all mechanisms described in Section 3.2 are useful. As shown in Table 1, categorical loss clipping improves the performance. During training, we find clipping this term makes the training more stable because this term is easy to diverge. Combining infoGAN with auto-encoder also improves the performance. The possible reason is that it encourages generator to generate realistic data which alleviates the difficulty of discrete data generation. Although we don’t restrict topic code to be one-hot in auto-encoder training, InfoGAN training forces all the text encoded by to be one-hot.
Model architecture of Topic Classifier
In Table 1, we find that the model architecture of topic classifier greatly influences the unsupervised classification performance. Both “w/o word embed” and “w/ linear Q” are the settings without pretrained word embeddings. We can observe that using pretrained embedding helps topic classifier discover more interpretable topics on most datasets, but not on all datasets. Similar phenomenon can also be found in Table 2. As we pretrain word embeddings on each dataset separately, for the datasets that don’t contain enough training samples such as stackoverflow dataset or 20NewsGroups dataset, it is difficult to learn representative word embeddings. Hence, the performance on those datasets is not improved with pretrained word embeddings. Additionally, we find that using pretrained word embedding makes not sensitive to the random initialization weights, and thus discovers more consistent topics at each training.
In “w/o word embed”, the model architecture of is same as TIGAN (i.e. a neural network with randomly initialized word embeddings), while in “w/ linear Q”, the model model architecture of is simply a linear network without any word embeddings. Comparing “w/o word embed” with “w/ linear Q”, the simpler model architecture yields better results. With more complex model of , it arbitrarily maps complex but not meaningful distribution to one-hot code, which makes the learned topics not interpretable.
Dataset | Topic ID | Generated Text |
DBpedia | 1 | james nelson is an american musician singer songwriter and actor he was a line of his work with musical career as metal albums in 1989 |
2 | the UNK is a skyscraper located in downtown washington dc district it was completed june 2009 and tallest building currently | |
3 | UNK born 3 january 1977 is a finnish football player who plays for west coast club dynamo he has won bronze medals at 2007 | |
English Gigaword | 1 | oil prices fell in asia friday as traders fear of the world demand for biggest drop summer months figures |
2 | the world health organisation has placed on bird flu in nigeria ’s most populous countries virus , saying they are expected to appear | |
3 | the ruling party won overwhelming majority of seats in opposition ’s battle for sweeping elections |
Generated Text |
UNK is a historic public square located in south bronx built in 1987 |
UNK is a french southwest hotel originally located in the of greece south carolina it was conceived by josef |
UNK in historic district is a methodist church located in south african cape was added to national register of places 200 |
5.3 Topic Coherence
In this section, we use the method in Appendix B to extract topical words to represent latent topics and evaluate the quality of the topical words. As topic models are also good at extracting topical words to represent topics, they are chosen as our baselines.
Experimental setup
In this section, the topic number of all models is same as the number of human-annotated classes. To show our model not only works on classification datasets, we try to evaluate our model on an unannotated dataset which is English Gigaword Rush et al. (2015), a news summarization dataset. The topic number on English Gigaword is set to be 10 and the discussion of how to decide topic code distribution is in Appendix D.
Quantitative analysis.
We use the metric Roder et al. (2015) to evaluate the coherence of extracted topical words Chang et al. (2009). 333We use the CoherenceModel in gensim module to compute coherence scores. The coherence scores are computed by using English Wikipedia of 5.6 million articles as an external corpus. The higher coherence scores in Table 2 show that TIGAN extracts reasonable and coherent words to represent a topic.
Qualitative analysis.
To further analyze the quality of discovered topics, the topical words of latent topics are listed in Table 3. Additionally, we use the method mentioned in Section 4 to generate corresponding texts of the discovered topics in Table 4. In the tables, the captured latent topics are clearly semantically different based on the generated topical words. Similarly, the generated texts are also topically related to the associated topics. More extracted topical words and generated texts are offered in Appendix C.
5.4 Disentangled Representations
The essential assumption of our work is that and can learn disentangled representations, where encodes main topic information and control the variance within the topics. To validate this assumption, we generate texts from the same continuous noise but from different topic code in Table 4. There is almost no overlap of words between the texts generated from the same continuous noise but from different , which means encodes the information disentangled from the topic code .
We also want to analyse whether can control the variance within the topic by modeling subtopic information. In Table 5 and Appendix.Table 6, we generate the texts from the same topic code but from different noise . It is obvious that in Table 5 the topic of is about building, but with different quite different sentences are generated. The diversity of generated texts within the same topic shows is capable to model the variance within the same topic. Also, by generating numerous texts of a same topic, it gives us a way to interpret the discovered topics further. With the above two experiments, we demonstrate that our model is able to learn disentangled representations.
6 Conclusion
This paper proposes a novel unsupervised framework, TIGAN, for exploring unsupervised discrete text representations to interpret textual data. By learning a discrete topic code disentangled from a continuous vector controlling subtopic information, TIGAN shows the superior performance on unsupervised text classification, which gives humans a good way to understand textual data. The extracted topical words and generated texts from the discovered topics further showcase the effectiveness of our model to learn explainable and coherent topics.
References
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
- Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
- Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research.
- Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22.
- Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
- Haj-Yahia et al. (2019) Zied Haj-Yahia, Adrien Sieg, and Léa A. Deleris. 2019. Towards unsupervised text classification leveraging experts and word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Hu et al. (2017) Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning.
- Huang et al. (2018) Huaibo Huang, zhihang li, Ran He, Zhenan Sun, and Tieniu Tan. 2018. Introvae: Introspective variational autoencoders for photographic image synthesis. In Advances in Neural Information Processing Systems 31.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
- Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28.
- Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning.
- Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
- Miao et al. (2017) Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine Learning.
- Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of the 34th International Conference on Machine Learning.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26.
- Miller et al. (1990) George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography.
- Nguyen et al. (2015) Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics.
- van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30.
- Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
- Roder et al. (2015) Michael Roder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. WSDM.
- Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- Srivastava and Sutton (2017) Akash Srivastava and Charles Sutton. 2017. Autoencoding variational inference for topic models. In International Conference on Learning Representations.
- Xu et al. (2015) Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.
- Zhang and LeCun (2015) Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. arXiv preprint arXiv:1502.01710.
- Zhao et al. (2018) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Appendix A Model Architecture and Hyper-parameters
In all datasets, we use the same model architecture and same optimizer without tuning our model on each dataset. The model architecture of generator is a fully-connected neural network with 3 hidden layers and the hidden dimension is 1000 of each hidden layer. Setting hidden dimension to 500 or 1500 only slightly influences the performance. We apply batch normalization Ioffe and Szegedy (2015) at each layer. The discriminator is a fully-connected neural network with 2 hidden layer and the hidden dimension is 500 of each hidden layer. As the discriminator has to be well trained to guide the gradient of generator, we make discriminator easier to train that the model architecture of discriminator is simpler than generator. If the hidden dimension of and is too deep, and becomes difficult to train. On the other hand, if the hidden dimension of and is too shallow, the performance of TIGAN drops.
The topic classifier is consisted of a word embedding matrix and a linear network which takes the weighted average of word embeddings as input and predicts the latent topics. We set the dimension of pretrained word embedding to 100. As we find the model architecture of topic classifier greatly influences the performance of our model, we has discussed the model architecture of in later section. The noise predictor is a fully-connected neural network with one hidden layer of dimension 500. The ,, and are all optimized by Adam optimizer with learning rate , and .
Appendix B Retrieving Topical Words from TIGAN
The topical words of our model can be retrieved from topic classifier as following. consists of a word embedding matrix and a linear layer. Let be a word embedding matrix, where each row is a dimensional word vector, and is the vocabulary size. Let be the weight matrix of the linear layer to predict the topics from sentence embedding which is the weighted average of word embeddings, where is the topic number. Let , where the value of represents the importance of the v-th word to the k-th topic. The top few words with the highest values within row are selected as the topical words of k-th topic.
Appendix C Qualitative Analysis of Topical Words
We provide the learned topical words of all latent topics of Yahoo! Answer from our model in Table 7, ProdLDA in Table 8, NVDM in Table 9 and LDA in Table 10. As shown in the tables, the topical words in each latent topic of TIGAN are closely semantically related to each other. For example, all the words in the first latent topic are obviously related to science, the words in second topic are related to the operation of computer and the words in third topic are related to politics.
Most of the topical words from ProdLDA are all related to each other, except some words. For example, some words in the second-last latent topic is not reasonable like “cousin”, “do I” or “bc”. In NVDM, we can find more unreasonable latent topics. For example, in the last latent topic, “diet” and “movie” should not be clustered into the same topic and in fourth-last latent topic, “immigrants” and “cup” is not related to each other. The quality of extracted topical words from LDA is worse than the above three neural models. Most of the topical words in a same topic are not semantically related to each other.
Appendix D Discussion of Topic Code Distribution
For labeled datasets, no matter single-label or multi-label datasets, we encourage to sample topic code according to the distribution of the human-labeled categories. For unlabeled datasets, we encourage to set the topic number of to and tune from greater number. With a greater topic number, TIGAN is still able to learn explainable topics as this setting just splits a single topic into several subtopics. However, setting topic number smaller forces several unrelated topics to combine into a single topic and thus harms the performance. It’s worth mentioning that we find that sampling one-hot from uniform distribution works for most datasets. That is because it encourages topic classifier to learn salient and highly interpretable features to maximize the mutual information between discrete one-hot and generated discrete bag-of-words.
Generated Text |
british airports leaders plan to ban on one of thousands the city health virus officials said friday |
country health officials from world organisation has warned turkey ’s swine flu in britain pakistan saying it was unsafe bringing the bird |
the talk of a new disease on tiny team of has formed in tanzania which based on ebola virus |
sars epidemic has been recently detected in the hong kong and disease in france public health organisation is for |
health experts said it was in southwest china to contain the outbreak of disease |
national organizations have been established in beijing to sars for safety alert and common disease resources |
the swine flu virus has been muted by health organisation response to tackle disease |
thailand ’s sichuan province has given to foreign investors in airlines travel disease |
indonesia has developed a discovery of health system wild birds at home from the flu |
the impact of poor health organisation has been UNK swine flu season in for population |
the governments gathered in thailand ’s health officials of world ministry earlier on friday |
indonesia has been hospitalized with pneumonia the health organisation said |
doctors from romania and countries are to explore ways of recent disease |
the death of # percent bird flu in latest report region asia , according to a study released by u.s. health organization |
agricultural experts have set to double the european commission meet with a new study on and human bird flu in animal was unveiled |
Topical Words |
measure, equations, cm, density, equation, units, formula, solar, atoms, inches, cycle, calculate, volume, radius, physics, triangle, molecules, height, equal, gravity |
deleted, files, folder, users, delete, messenger, spyware, scan, installed, settings, upload, downloaded, ip, disk, router, screen, firewall, software, message, xp |
terrorists, democrats, troops, republicans, terrorism, terrorist, leaders, senate, iraq, politicians, liberals, democracy, saddam, congress, democratic, elections, elected, political, majority, war |
biology, engineering, courses, resources, management, materials, algebra, technology, design, accounting, analysis, studies, communication, learning, education, programming, skills, teaching, study, marketing |
nervous, abuse, depressed, drunk, crying, depression, orgasm, jealous, sexually, jail, hang, drinking, emotional, chat, busy, sleeping, herself, hanging, custody, diagnosed |
sore, infection, skin, stomach, urine, milk, vitamin, pills, symptoms, muscle, therapy, breast, severe, treatment, muscles, fluid, bleeding, acne, pain |
bone spirit, holy, heaven, rock, satan, worship, ghost, evolution, spiritual, bible, devil, jesus, bands, mary, lyrics, adam, gods, soul, singer, christian |
chat, pics, porn, orgasm, females, nasty, bored, penis, sensitive, avatar, attracted, emotional, ugly, naked, sexy, addicted, vagina, lesbian, jokes, dirty |
champions, playoffs, championship, league, teams, cup, nba, team, finals, nfl, hockey, fifa, football, soccer, kobe, wrestling, basketball, brazil, matches, cricket |
payments, estate, mortgage, property, loan, insurance, payment, fees, loans, funds, debt, companies, investment, agency, financial, employees, filed, employment, owner, taxes |
Topical Words |
teaching ,learning ,subject ,teach ,learn ,skills ,studying ,topic ,study ,project ,prepare ,courses ,students ,schools ,english ,guide ,essay ,language ,spelling ,education |
tooth ,teeth ,knee ,dentist ,exercise ,loose ,dr ,healthy ,coffee ,gym ,counter ,workout ,stomach ,pill ,medication ,severe ,acne ,bleeding ,eating ,diet |
fees ,employees ,employee ,employer ,cash ,agent ,earn ,selling ,homes ,banks ,hire ,mortgage ,payment ,invest ,offered ,funds ,budget ,estate ,bills ,agency |
divide ,motion ,factor ,formula ,direction ,scale ,compound ,length ,equations ,circle ,zero ,frequency ,steel ,wave ,triangle ,elements ,electricity ,velocity ,solid ,table |
detroit ,olympics ,nfl ,league ,winning ,wins ,teams ,2002 ,playoffs ,match ,championship ,johnson ,coach ,bowl ,dallas ,team ,goal ,soccer ,baseball ,finals |
liberals ,congress ,conservative ,voting ,911 ,saddam ,liberal ,soldiers ,majority ,elected ,bush ,politicians ,vote ,elections ,violence ,terrorism ,freedom ,vietnam ,troops ,democracy |
band ,bob ,bands ,lyrics ,singing ,artist ,singer ,scene ,sang ,80s ,movie ,episode ,sings ,sing ,pink ,rap ,dancing ,movies ,song ,songs |
bible ,believe ,gods ,christianity ,christ ,worship ,faith ,spiritual ,jesus ,spirit ,belief ,christian ,god ,heaven ,jewish ,christians ,eve ,mary ,lord ,belive |
dated ,talked ,weve ,eachother ,cousin ,friendship ,ex ,gf ,break ,cheating ,cheated ,hang ,dating ,together ,broke ,boyfriends ,bc ,hanging ,doi ,talk |
opened ,icon ,automatically ,task ,edition ,comp ,blocked ,sharing ,instant ,deleted ,press ,blank ,dell ,keyboard ,connection ,bar ,beta ,update ,mac ,dsl |
Topical Words |
earn ,points ,money ,countries ,questions ,energy ,learn ,invest ,score ,market ,nuclear ,iran ,dollars ,business ,technology ,government ,study ,stock ,economy ,world |
school ,schools ,high ,university ,college ,colleges ,students ,basketball ,sport ,la ,de ,girls ,grade ,courses ,graduate ,teacher ,football ,les ,teachers ,boys |
english ,lyrics ,song ,search ,google ,spanish ,translate ,site ,language ,christian ,sites ,myspace ,sings ,bible ,love ,translation ,websites ,songs ,web ,books |
god ,jesus ,christians ,christian ,religion ,truth ,bush ,religious ,beliefs ,faith ,church ,christ ,believe ,evil ,heaven ,feelings ,bible ,kill ,gods ,illegal |
download ,computer ,software ,email ,myspace ,laptop ,install ,phone ,pc ,card ,cd ,program ,online ,password ,files ,video ,address ,send ,connect ,downloaded |
relationship ,sex ,math ,grade ,age ,degree ,feelings ,shy ,boyfriend ,teacher ,homework ,advice ,healthy ,pregnant ,study ,dating ,bf ,classes ,yrs ,weight |
win ,hate ,watch ,immigrants ,idol ,americans ,mexicans ,hes ,gonna ,watching ,iraq ,games ,tired ,mexico ,jobs ,thinks ,illegals ,cup ,republicans ,democrats |
foods ,food ,health ,diet ,products ,skin ,protein ,fat ,muscle ,eat ,eating ,exercise ,treatment ,cure ,disease ,healthy ,muscles ,sugar ,bacteria ,weight |
county ,tax ,state ,federal ,income ,insurance ,loan ,taxes ,visa ,citizen ,pay ,financial ,states ,department ,lawyer ,child ,california ,legal ,loans ,court |
christmas ,season ,diet ,dvd ,burn ,buy ,songs ,movie ,night ,water ,song ,day ,band ,summer ,drink ,gift ,eat ,ebay ,credit ,price |
Topical Words |
looking, friend, girl, yahoo, song, new, look, good, trying, know, im, information, music, online, business, need, write, check, send, change |
like, know, want, really, think, people, help, say, question, tell, answer, need, good, make, guy, feel, things, thing, bad, ask |
school, job, work, home, state, high, good, college, number, age, small, book, run, law, area, want, best, parents, doctor, vs |
time, best, long, person, years, money, man, life, sex, guys, like, women, email, way, great, little, lot, men, away, good |
day, real, read, bush, anybody, eat, explain, 12, power, pain, mind, food, new, red, period, rid, weeks, air, hours, ex |
water, times, end, english, card, point, states, difference, credit, page, gets, comes, different, 10, blood, light, time, used, order, green |
need, help, free, site, computer, want, website, know, heard, try, weight, using, web, best, whats, download, windows, like, company, search |
people, god, called, believe, country, think, pay, days, ago, means, war, child, come, world, probably, president, info, wondering, usa, test |
use, old, better, internet, told, buy, type, stop, used, black, white, ones, turn, easy, language, date, science, fast, laptop, skin |
love, going, think, got, world, year, talk, yes, team, body, win, hes, come, american, movie, game, play, cup, ive, boyfriend |