Learning Interpretable and Discrete Representations
with Adversarial Training for Unsupervised Text Classification

Yau-Shian Wang Hung-Yi Lee Yun-Nung Chen
National Taiwan University, Taipei, Taiwan
[email protected] [email protected] [email protected]

Abstract

Learning continuous representations from unlabeled textual data has been increasingly studied for benefiting semi-supervised learning. Although it is relatively easier to interpret discrete representations, due to the difficulty of training, learning discrete representations for unlabeled textual data has not been widely explored. This work proposes TIGAN that learns to encode texts into two disentangled representations, including a discrete code and a continuous noise, where the discrete code represents interpretable topics, and the noise controls the variance within the topics. The discrete code learned by TIGAN can be used for unsupervised text classification. Compared to other unsupervised baselines, the proposed TIGAN achieves superior performance on six different corpora. Also, the performance is on par with a recently proposed weakly-supervised text classification method. The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.

1 Introduction

In natural language processing (NLP), learning meaningful representations from large amounts of unlabeled texts is a core problem for unsupervised and semi-supervised language understanding. While learning continuous text representations has been widely studied Kiros et al. (2015); Arora et al. (2017); Logeswaran and Lee (2018); Pagliardini et al. (2018); Peters et al. (2018); Devlin et al. (2019), learning discrete representations has been explored by fewer works Miao et al. (2017); Zhao et al. (2018). Nevertheless, learning discrete representations is still important, as discrete representations are easier to interpret Zhao et al. (2018) and can benefit unsupervised learning van den Oord et al. (2017).

Auto-encoders and their variants are widely used to learn latent representations, but how to learn meaningful discrete latent representations remains unsolved. The reason is that using discrete variables as latent representations in auto-encoders hinders the gradient from backpropagating from the decoder to the encoder. To take non-differentiable discrete variables as latent representations in an auto-encoder, some special methods, such as Gumbel-Softmax Jang et al. (2016) or vector quantization van den Oord et al. (2017), are applied to enable training. Another direction of learning useful discrete representations is to maximize the mutual information between data and discrete variables such as infoGAN Chen et al. (2016) or IMSAT Hu et al. (2017).

In this work, we propose Textual InfoGAN (TIGAN), which is mainly built upon InfoGAN but with several useful extensions to make it suitable for textual data. In this model, texts are generated from two disentangled representations, including a discrete code $c$ and a continuous noise $z$ , where each dimension of $c$ represents a topic (or a category), and $z$ controls the variance within a topic. For example, if the model learns that one dimension of $c$ represents a topic about “sport”, different $z$ represents different sport types like basketball or baseball, or different teams. Given a new text, the model can efficiently infer the discrete topic code $c$ . As we alternatively optimize the InfoGAN objective and auto-encode objective function, the model can be considered as the integration of the infoGAN and auto-encoder based approaches.

To evaluate the interpretability of discovered discrete topics, we evaluate our model on unsupervised text classification. Better performance on unsupervised text classification implies that the discovered topics directly match the human-annotated categories, and thus humans can intuitively understand what they represent, such as the category of news or the type of questions. Compared to other possible unsupervised text classification methods, such as unsupervised sentence representation methods or topic models, TIGAN achieves superior performance. Also, the performance of our unsupervised method is on par with a recent weakly-supervised method (Haj-Yahia et al., 2019), which required keywords of each category annotated by human experts.

To interpret discovered topics, we extract a few topical words from each topic to represent the topic. The extracted topical words are evaluated by quantitative and qualitative analysis. Compared to baseline methods, TIGAN is capable of extracting more coherent topical words. Also, it is possible to generate various texts conditioned on topics discovered by TIGAN in an unsupervised manner, which gives us a way to interpret the discovered topics further. The generated texts demonstrate that TIGAN successfully learns disentangled representations.

2 Related Work

Here we review the approaches of unsupervised discrete representation learning.

Auto-encoder

Our work is closely related to other unsupervised discrete representation learning methods. VQ-VAE van den Oord et al. (2017) encodes the data into a discrete one-hot code by drawing the index with embedding closest to the data representation. They utilize vector quantization (VQ) to approximate the gradient from the decoder to the encoder. DI-VAE Zhao et al. (2018) uses Batch Prior Regularization (BPR) to approximate the KL-divergence between discrete variables, and learn discrete representations by Gumbel-Softmax and minimizing the KL-divergence between discrete posterior and a discrete prior.

In this work, because we set the dimension of discrete code to a small number, we have to model the variance within the code. Otherwise, the text can not be successfully generated. Therefore, we search for other discrete representation learning models.

Maximizing Mutual Information

To learn useful latent representation, IMSAT Hu et al. (2017) maximizes the mutual information between data and the encoded discrete representation. They also propose an objective function to make the discrete representation invariant to the data augmentation. On the other hand, InfoGAN has shown impressive performance for learning disentangled representations of images in an unsupervised manner Chen et al. (2016). The original GAN generates images from a continuous noise $z$ , while each dimension of the noise does not contain disentangled features of generated images. To learn semantically meaningful representations, InfoGAN maximizes the mutual information between input code $c$ and the generated output $G(z,c)$ . However, maximizing the mutual information is intractable because it requires the access of $P(c\mid G(z,c))$ . Based on variational information maximization, Chen et al. (2016) used an auxiliary function $Q$ to approximate $P(c\mid G(z,c))$ . The auxiliary function $Q$ can be a neural network and jointly optimized with $G$ .

3 TIGAN

Refer to caption — Figure 1: Illustration of the proposed model. (A) In the upper part (InfoGAN training), the bag-of-words generator $G$ takes discrete topic code $c$ and continuous noise vector $z$ as input and generates bag-of-words. The $D$ discriminates its input bag-of-words is from $G$ or human-written text. The topic classifier $Q$ predicts the latent topic $c$ from its input bag-of-words. In the lower part, an auto-encoder is learned from bag-of-words of human-written text. The noise predictor $E$ which predicts the noise $z$ from input bag-of-words and the topic classifier $Q$ together form the encoder, while $G$ is the decoder. (B) To interpret what is learned by the model, an LSTM text generator $G^{\prime}$ is trained to generate text based on bag-of-words.

TIGAN considers that text is generated from a discrete code $c$ and a continuous noise $z$ . Each dimension of discrete code $c$ represents a topic (or a category), and the continuous noise $z$ controls the variance within the topic. As sequential textual data is too complex to discover meaningful topics, we focus on learning topics from bag-of-words textual data. Besides, we propose some extra extensions of infoGAN including: (1) categorical loss clipping, (2) combining infoGAN with auto-encoder, (3) using WGAN-gp Gulrajani et al. (2017), and (4) using pretrained word embeddings to regularize the discovered topics. The above extensions greatly improve performance.

3.1 Model

The Figure 1 (A) illustrates our model, where there are a generator $G$ , a discriminator $D$ , a topic classifier $Q$ , and a noise predictor $E$ :

•

Generator $G$
It takes a discrete topic code $c$ and a continuous noise $z$ as the input, and manages to generate bag-of-words that are indistinguishable from the bag-of-words of real texts, while captures the topical information of input $c$ . Here the output of the generator, $G(c,z)$ , is a vocabulary size vector. The output layer of $G$ is $sigmoid$ function, so the value of each dimension is between $0$ and $1$ . $G(c,z)$ can be interpreted as a bag-of-words vector, where each dimension indicates whether a single word exists in the text. The $G(c,z)$ is directly fed to the discriminator, topic classifier, and noise predictor without sampling.

•

Discriminator $D$
It takes a bag-of-word vector as its input and outputs a scalar, which distinguishes whether the input is generated by the generator $G$ or from human-written texts. Here the bag-of-words vector $x$ from human-written texts would be assigned a higher score, or $D(x)$ would be larger. On the other hand, the bag-of-words vector $G(c,z)$ generated by $G$ would be assigned lower score by $D$ , or $D(G(c,z))$ would be smaller. $D$ is trained to minimize the following loss function $\mathcal{L}_{D}$

\begin{split}\mathcal{L}_{D}=&-\operatorname{\mathbb{E}}_{x\sim P_{data}}[\log(D(x))]\\ &-\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c}}[\log(1-D(G(z,c)))],\end{split}

(1)

where $P_{data}$ is the real data distribution, $P_{z}$ is the noise distribution, and $P_{c}$ is the topic code distribution. The distributions $P_{c}$ and $P_{z}$ are determined by the developers based on the prior knowledge about the corpus. In the following experiments, $P_{z}$ is a normal distribution, while $P_{c}$ is a uniform distribution that generates one-hot vectors in which each dimension has an equal probability of being set to $1$ . In general, we find using the normal distribution as $P_{z}$ is sufficient to model most texts. We use uniform distribution as $P_{c}$ because our model is evaluated on single-label datasets in which the most texts are written according to one main topic. In a multi-label classification dataset, it is feasible to sample a topic code $c$ containing several topics according to the distribution of human-annotated categories.

•

Topic Classifier $Q$
It is a categorical topic code classifier, which predicts the categorical topic distribution from input bag-of-words. The categorical classifier $Q$ plays the role of inferring topics from bag-of-words in this work. It is trained by predicting the topic code $c$ from the generated bag-of-words $G(c,z)$ . The categorical loss $\mathcal{L}_{Q}$ for training $Q$ is shown as below,

$\mathcal{L}_{Q}=\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c},c^{\prime}=Q(G(z,c))}[\mathcal{L}_{ce}(c,c^{\prime})],$ (2)

where the function $\mathcal{L}_{ce}$ evaluates the cross entropy between $c$ and the prediction $c^{\prime}=Q(G(z,c))$ , and $Q$ learns to minimize the cross entropy. Since $c$ is one-hot vector, cross-entropy is used here. It is possible to use other difference measures.
•

Noise predictor $E$
It focuses on predicting the continuous noise $z$ from the bag-of-words $x$ of human-written texts. $E$ is only learned with auto-encoder.

3.2 Training

We apply (3) to train our generator $G$ , discriminator $D$ , and the topic classifier $Q$ .

\min_{G,Q}\max_{D}-\mathcal{L}_{D}+\lambda\mathcal{L}_{Q}.

(3)

where $\lambda$ is a hyper-parameter. As we find that when the value of $\lambda$ is too large, the generator generates unreasonable bag-of-words to minimize $\mathcal{L}_{Q}$ , we set $\lambda$ to $0.1$ . The discriminator $D$ learns to minimize $\mathcal{L}_{D}$ , $Q$ learns to minimize $\mathcal{L}_{Q}$ , and $G$ learns to maximize $\mathcal{L}_{D}$ while minimizing $\mathcal{L}_{Q}$ . However, it is difficult to apply infoGAN on textual data. Hence, we propose some extensions and tips in the following sections.

3.3 Categorical Loss Clipping

During training, there is a severe mode collapse issue within the same topic. That is, given the same topic code $c$ , the generator $G$ ignores the continuous noise $z$ and always outputs the same bag-of-words. The reason is that the optimal solution for the generator to maximize the mutual information (or minimize (2)) between discrete $c$ and generated discrete bag-of-words is always predicting the same bag-of-words given the same $c$ . To tackle this issue, we clip the categorical loss in (2) to a lower bound $\alpha$ as below.

\mathcal{L}_{Q}^{\prime}=\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c},c^{\prime}=Q(G(z,c))}[max(\mathcal{L}_{ce}(c,c^{\prime}),\alpha)].

(4)

With (4), the model stops to make the categorical loss smaller when it is small enough, which is controlled by the hyper-parameter $\alpha$ . As setting the value of $\alpha$ too large hinders the $Q$ from predicting correct topic code $c$ , we set the value slightly greater than zero, which is $0.15$ . It is worth mentioning that applying batch-normalization Ioffe and Szegedy (2015) to $G$ also greatly alleviates the mode collapse problem.

3.4 Combining infoGAN with auto-encoder

As mentioned by prior works Larsen et al. (2016); Huang et al. (2018), it’s useful to train GAN and auto-encoder alternatively. Therefore, we try to include auto-encoder training in the optimization procedure, as shown in the right part of Figure 1 (A). This is another way to prevent the mode collapse mentioned in the last subsection.

We use bag-of-words $x$ from real text to train an auto-encoder. In auto-encoder training, the noise predictor $E$ and topic classifier $Q$ are jointly regarded as an encoder, which encodes bag-of-words into a continuous code $E(x)$ and a discrete code $Q(x)$ respectively. Here, the generator $G$ serves as a decoder, which reconstructs the original input $x$ from a continuous code $E(x)$ and a discrete code $Q(x)$ . Therefore, $G$ , $Q$ and $E$ jointly learn to minimize the reconstruction loss,

\min_{G,Q,E}\operatorname{\mathbb{E}}_{x\sim P_{data},x^{\prime}=G(Q(x),E(x))}[\mathcal{L}_{bce}(x,x^{\prime})],

(5)

where binary cross entropy loss $\mathcal{L}_{bce}$ is used as the reconstruction loss function. We train (3) and (5) alternately.

3.5 Using WGAN-gp

As mentioned in Arjovsky et al. (2017), it is challenging to generate discrete data using GAN. Because bag-of-words vectors from real texts are also discrete, when using original loss function for $D$ in (1), the training fails. Here, we apply WGAN Arjovsky et al. (2017) loss to train $G$ and $D$ and rewrite (1) as

\mathcal{L}_{D}^{\prime}=-\operatorname{\mathbb{E}}_{x\sim P_{data}}[D(x)]+\operatorname{\mathbb{E}}_{z\sim P_{z},c\sim P_{c}}[D(G(z,c))].

(6)

Here, to constrain $D$ to be a Lipschitz function, we apply gradient penalty Gulrajani et al. (2017) to $D$ .

3.6 Using pretrained word embeddings

Prior works Nguyen et al. (2015) suggested that using pretrained word embeddings such as word2vec Mikolov et al. (2013) helps models discover more coherent topics. To encourage the model to learn more consistent topics, we use pretrained word embeddings on $Q$ to regularize $Q$ to learn more coherent topics. When $Q$ takes a bag-of-words as input, it first computes the document vector as the weighted average of pretrained word embeddings of bag-of-words. Then, it feeds the document vector to a linear network to predict discrete topic code $c$ . We use smooth inverse frequency (SIF) Arora et al. (2017) to compute the weighted average of the word embeddings. The weight of word $w$ is $a/(a+p(w))$ , where $a$ is a learnable non-negative parameter and $p(w)$ is word frequency. The word embeddings are pretrained by fasttext Joulin et al. (2016).

4 Text Generation from Discovered Topics

To interpret what the model learns, we try to generate text conditioned on discovered discrete topic $c$ , which is shown in Figure 1 (B). Given a discrete topic $c$ , we sample many different noises $z$ to generate different bag-of-words $G(c,z)$ . After obtaining bag-of-words from $G$ , we use a text generator $G^{\prime}$ to generate sequential text from the bag-of-words. Because we can transform each sequential text into its corresponding bag-of-words, we can obtain numerous (bag-of-words, text) pairs to train the text generator $G^{\prime}$ . However, compared to human-written text, generated $G(c,z)$ may be noisy and unreasonable. To make $G^{\prime}$ robust on $G(c,z)$ , during training, we add some noise such as word random removal or shuffling to input bag-of-words. The network architecture of $G^{\prime}$ is an LSTM whose hidden state is initialized by the output of a feedforward neural network which takes bag-of-words as input.

5 Experiments

5.1 Unsupervised Classification

Methods	20News	Yahoo!	DBpedia	Stackoverflow	Agnews	News-Cat
word embed avg+k-means	28.63	38.91	69.04	22.44	73.83	37.78
sent2vec+k-means	28.06	51.24	60.98	36.78	83.82	40.64
LDA + k-means	28.97	22.58	61.13	43.57	49.36	21.84
NVDM	24.63	33.21	46.22	26.33	62.36	30.12
ProdLDA	30.02	39.65	68.19	25.13	72.78	38.62
LDA-few topics	29.62	29.44	68.62	35.82	71.07	26.24
KE Haj-Yahia et al. (2019)	37.8	53.9	-	-	73.8	-
TIGAN	34.12	52.25	85.37	47.01	84.13	49.32
TIGAN w/o loss clipping	30.12	45.92	83.32	40.99	83.66	47.42
TIGAN w/o auto-encoder	29.14	43.07	78.76	31.60	81.62	45.12
TIGAN w/o word embed	36.89	42.14	71.26	46.14	69.78	47.32
TIGAN w/ linear $Q$	41.01	42.91	83.73	64.46	72.65	48.14

Table 1: Unsupervised classification accuracy.

Datasets

We evaluate our model on 6 various datasets, including (1)20NewsGroups, (2)Yahoo! answers, (3)DBpedia ontology classification, (4)stackoverflow title classification Xu et al. (2015) (5)agnews and (6)News-Category-Dataset ¹¹1https://www.kaggle.com/rmisra/news-category-dataset.

The 20NewsGroups is a document classification dataset composed of 20 different classes of documents. Yahoo! answers is a question type classification dataset with 10 types of question-answer pairs constructed by Zhang and LeCun (2015). DBpedia ontology classification dataset is constructed by Zhang and LeCun (2015), with 14 ontology classes selected from DBpedia 2014. Stackoverflow title classification is constructed by Xu et al. (2015) with 20 different title categories. The agnews is a news classification dataset with 4 different categories constructed by Zhang and LeCun (2015). We combined news titles and descriptions as our training texts. News-Category-Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. We selected the most frequent 11 classes and roughly balanced the number of data of each class.

Experimental Settings

For all datasets, we use the same model architecture and same optimizer without tuning on each dataset. The details of model architecture can be found in Appendix A. For each dataset, we choose the most frequent 3000 words as the vocabulary after stop words and punctuation removal and lowercase conversion. When training TIGAN, we sample a one-hot topic code $c$ from a uniform distribution and set the number of latent topics (i.e. the dimension of $c$ ) the same as the number of human-annotated categories in each dataset. For example, as there are four news categories in agnews, we set the topic number to four. Continuous noise $z$ is sampled from a normal distribution and the dimension of $z$ is set to be 200 in all experiments. Our model is not sensitive to the dimension of $z$ , setting it to smaller value such as $50$ or $100$ will yield similar results.

We use the topic classifier $Q$ to predict the latent topic probability distribution of each sample, and we assign each sample to the latent topic with the maximum probability. The samples assigned to a latent topic cluster use their human-annotated labels to vote for which label should be assigned to the whole cluster. After assigning each latent topic to a human-annotated category, we evaluate the classification accuracy as the quality of the captured latent topics.

Baselines

We compare our model to several possible methods for unsupervised text classification, including topic models, unsupervised sentence representation and keywords enrichment (KE) method.

The most straightforward method for unsupervised text classification is unsupervised sentence representation based methods. In these methods, we first encode texts into vectors by using unsupervised text representation learning methods and then conduct k-means on the text vectors. In Table 1 “word embed avg+k-means”, we average the pretrained word embeddings as text vectors. Here, the word embeddings are same as the pretrained word embeddings used in TIGAN. In Table 1 “sent2vec+k-means”, the vectors are extracted by state-of-the-art unsupervised sentence representation method Pagliardini et al. (2018) with latent dimension set to be 300. In Table 1 “LDA + k-means”, we set the topic number of LDA Blei et al. (2003) to 50, and then conduct k-means unsupervised clustering on the learned features.

Another baselines are topic models which represent a text as a mixture of latent topics. Because these latent topics are likely to be identical to human-annotated categories, topic models are possible methods for unsupervised text classification. The evaluation setting of topic models is same as TIGAN, in which we set the topic number of topic models the same as the number of human-annotated categories and assign each latent topic to its most possible human-annotated category. For example, in Table 1 “LDA-few topic”, differet from the setting of “LDA + k-means”, the topic number is same as the class number. Also, two strong variational auto-encoder based neural topic models, including NVDM Miao et al. (2016) and ProdLDASrivastava and Sutton (2017) are chosen as our baselines.

Keyword enrichment(KE) Haj-Yahia et al. (2019) is a recent weakly-supervised text classification method which initially asks human experts to label several keywords for each category and then gradually enriches the keywords dictionary. The documents similar to the keywords of a category are classified to the same category. This method is not strictly comparable to our method because it requires the help of humans and the performance hinges on good keywords labeled by experts while our model don’t require any keyword from human. Also, it requires WordNet Miller et al. (1990) to retrieve synonym of keywords.

Methods	20NewsGroups	Yahoo! Answers	DBpedia	Gigaword
LDA	42.68	36.34	51.06	35.61
NVDM	42.36	47.02	49.95	37.22
ProdLDA	43.44	52.01	55.26	43.17
TIGAN w/ word embed	43.01	55.92	62.16	46.12
TIGAN w/o word embed	45.22	45.64	54.23	41.79

Table 2: Topic coherence scores. Higher is better.

Dataset	Topic ID	Topical Words
DBpedia	1	pianist, composer, singer, cyrillic, songwriter, romania, poet, painter, jazz, actress, musician, tributary
	2	skyscraper, building, courthouse, tower, historic, plaza, brick, floors, twostory, mansion, hotel, buildings, register, tallest, palace
	3	midfielder, footballer, goalkeeper, football, championship, league, striker, soccer, defender, goals, matches, cup, hockey, medals
English Gigaword	1	inflation, index, futures, benchmark, prices, currencies, output, outlook, unemployment, lowest, stocks, opec, mortgage
	2	sars, environment, pollution, tourism, virus, disease, flights, water, scientific, airports, quality, agricultural, animal, flu, alert
	3	polls, votes, elections, democrats, electoral, election, conservative, re-election, democrat, candidates, republicans, liberal, presidential

Table 3: Topical words generated from discovered latent topic code

c

. In DBpedia, Topic 1 is about music, Topic 2 is about building and Topic 3 is about sport. In English Gigaword, Topic 1 is about business, Topic 2 is about disease and Topic 3 is about politics.

Results and Discussion.

Compared to the methods of clustering continuous text vectors like “word embed avg+k-means” and “sent2vec+k-means” in Table 1, TIGAN improves the performance. The reason is that TIGAN maximize the mutual information between generated data and a discrete one-hot distribution, which encourages classifier $Q$ to learn more salient features to separate texts into different classes. In “LDA+k-means”, the result is even worse than “LDA-few topics”, which means it’s not feasible to obtain human-annotated classes by directly clustering the features learned by LDA.

As shown in the Table 1, TIGAN outperforms two variational neural topic models and LDA on unsupervised classification. No matter variational neural topic models or statistical topic models, they all assume each document is produced from mixture of topics. To model the variance of documents, they have to split a topic such as “sport” into many subtopics such as baseball or basketball. Because in our unsupervised text classification setting, each document should belong to a single category, this assumption harms the performance. However, as TIGAN is able to control the variance within a topic with a noise vector, it can directly assume a document is generated from a single main topic, which allows TIGAN to discover the topics identical to human-annotated categories. This performance gap is understandable because topic models are not originally designed for unsupervised text classification setting.

The performance of TIGAN is on par with keywords enriching (KE) in Table 1 ²²2Here, we use the weighted recall reported in the paper because the metric that we compute accuracy score is same as weighted recall. even without keywords annotated by human experts. This result suggests that our model is able to automatically categorize texts without any help from human.

5.2 Ablation Study

Training

We conduct ablation study to show that all mechanisms described in Section 3.2 are useful. As shown in Table 1, categorical loss clipping improves the performance. During training, we find clipping this term makes the training more stable because this term is easy to diverge. Combining infoGAN with auto-encoder also improves the performance. The possible reason is that it encourages generator to generate realistic data which alleviates the difficulty of discrete data generation. Although we don’t restrict topic code $c$ to be one-hot in auto-encoder training, InfoGAN training forces all the text encoded by $Q$ to be one-hot.

Model architecture of Topic Classifier

In Table 1, we find that the model architecture of topic classifier $Q$ greatly influences the unsupervised classification performance. Both “w/o word embed” and “w/ linear Q” are the settings without pretrained word embeddings. We can observe that using pretrained embedding helps topic classifier discover more interpretable topics on most datasets, but not on all datasets. Similar phenomenon can also be found in Table 2. As we pretrain word embeddings on each dataset separately, for the datasets that don’t contain enough training samples such as stackoverflow dataset or 20NewsGroups dataset, it is difficult to learn representative word embeddings. Hence, the performance on those datasets is not improved with pretrained word embeddings. Additionally, we find that using pretrained word embedding makes $Q$ not sensitive to the random initialization weights, and thus $Q$ discovers more consistent topics at each training.

In “w/o word embed”, the model architecture of $Q$ is same as TIGAN (i.e. a neural network with randomly initialized word embeddings), while in “w/ linear Q”, the model model architecture of $Q$ is simply a linear network without any word embeddings. Comparing “w/o word embed” with “w/ linear Q”, the simpler model architecture yields better results. With more complex model of $Q$ , it arbitrarily maps complex but not meaningful distribution to one-hot code, which makes the learned topics not interpretable.

Dataset	Topic ID	Generated Text
DBpedia	1	james nelson is an american musician singer songwriter and actor he was a line of his work with musical career as metal albums in 1989
	2	the UNK is a skyscraper located in downtown washington dc district it was completed june 2009 and tallest building currently
	3	UNK born 3 january 1977 is a finnish football player who plays for west coast club dynamo he has won bronze medals at 2007
English Gigaword	1	oil prices fell in asia friday as traders fear of the world demand for biggest drop summer months figures
	2	the world health organisation has placed on bird flu in nigeria ’s most populous countries virus , saying they are expected to appear
	3	the ruling party won overwhelming majority of seats in opposition ’s battle for sweeping elections

Table 4: Texts generated from topics with topic ID corresponding to Table 3. The texts in the same dataset are generated from different

c

but from the same continuous noise

z

Generated Text

UNK is a historic public square located in south bronx built in 1987

UNK is a french southwest hotel originally located in the of greece south carolina it was conceived by josef

UNK in historic district is a methodist church located in south african cape was added to national register of places 200

Table 5: The generated texts of DBpedia Topic ID 2 in Table 3 with different noise vectors

z

5.3 Topic Coherence

In this section, we use the method in Appendix B to extract topical words to represent latent topics and evaluate the quality of the topical words. As topic models are also good at extracting topical words to represent topics, they are chosen as our baselines.

Experimental setup

In this section, the topic number of all models is same as the number of human-annotated classes. To show our model not only works on classification datasets, we try to evaluate our model on an unannotated dataset which is English Gigaword Rush et al. (2015), a news summarization dataset. The topic number on English Gigaword is set to be 10 and the discussion of how to decide topic code distribution is in Appendix D.

Quantitative analysis.

We use the $C_{v}$ metric Roder et al. (2015) to evaluate the coherence of extracted topical words Chang et al. (2009). ³³3We use the CoherenceModel in gensim module to compute coherence scores. The coherence scores are computed by using English Wikipedia of 5.6 million articles as an external corpus. The higher coherence scores in Table 2 show that TIGAN extracts reasonable and coherent words to represent a topic.

Qualitative analysis.

To further analyze the quality of discovered topics, the topical words of latent topics are listed in Table 3. Additionally, we use the method mentioned in Section 4 to generate corresponding texts of the discovered topics in Table 4. In the tables, the captured latent topics are clearly semantically different based on the generated topical words. Similarly, the generated texts are also topically related to the associated topics. More extracted topical words and generated texts are offered in Appendix C.

5.4 Disentangled Representations

The essential assumption of our work is that $c$ and $z$ can learn disentangled representations, where $c$ encodes main topic information and $z$ control the variance within the topics. To validate this assumption, we generate texts from the same continuous noise $z$ but from different topic code $c$ in Table 4. There is almost no overlap of words between the texts generated from the same continuous noise $z$ but from different $c$ , which means $z$ encodes the information disentangled from the topic code $c$ .

We also want to analyse whether $z$ can control the variance within the topic by modeling subtopic information. In Table 5 and Appendix.Table 6, we generate the texts from the same topic code $c$ but from different noise $z$ . It is obvious that in Table 5 the topic of $c$ is about building, but with different $z$ quite different sentences are generated. The diversity of generated texts within the same topic shows $z$ is capable to model the variance within the same topic. Also, by generating numerous texts of a same topic, it gives us a way to interpret the discovered topics further. With the above two experiments, we demonstrate that our model is able to learn disentangled representations.

6 Conclusion

This paper proposes a novel unsupervised framework, TIGAN, for exploring unsupervised discrete text representations to interpret textual data. By learning a discrete topic code disentangled from a continuous vector controlling subtopic information, TIGAN shows the superior performance on unsupervised text classification, which gives humans a good way to understand textual data. The extracted topical words and generated texts from the discovered topics further showcase the effectiveness of our model to learn explainable and coherent topics.

References

Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research.
Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22.
Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
Haj-Yahia et al. (2019) Zied Haj-Yahia, Adrien Sieg, and Léa A. Deleris. 2019. Towards unsupervised text classification leveraging experts and word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Hu et al. (2017) Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning.
Huang et al. (2018) Huaibo Huang, zhihang li, Ran He, Zhenan Sun, and Tieniu Tan. 2018. Introvae: Introspective variational autoencoders for photographic image synthesis. In Advances in Neural Information Processing Systems 31.
Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28.
Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning.
Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
Miao et al. (2017) Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine Learning.
Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of the 34th International Conference on Machine Learning.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26.
Miller et al. (1990) George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography.
Nguyen et al. (2015) Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics.
van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30.
Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
Roder et al. (2015) Michael Roder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. WSDM.
Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
Srivastava and Sutton (2017) Akash Srivastava and Charles Sutton. 2017. Autoencoding variational inference for topic models. In International Conference on Learning Representations.
Xu et al. (2015) Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.
Zhang and LeCun (2015) Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. arXiv preprint arXiv:1502.01710.
Zhao et al. (2018) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Appendix A Model Architecture and Hyper-parameters

In all datasets, we use the same model architecture and same optimizer without tuning our model on each dataset. The model architecture of generator $G$ is a fully-connected neural network with 3 hidden layers and the hidden dimension is 1000 of each hidden layer. Setting hidden dimension to 500 or 1500 only slightly influences the performance. We apply batch normalization Ioffe and Szegedy (2015) at each layer. The discriminator $D$ is a fully-connected neural network with 2 hidden layer and the hidden dimension is 500 of each hidden layer. As the discriminator has to be well trained to guide the gradient of generator, we make discriminator easier to train that the model architecture of discriminator is simpler than generator. If the hidden dimension of $G$ and $D$ is too deep, $G$ and $D$ becomes difficult to train. On the other hand, if the hidden dimension of $G$ and $D$ is too shallow, the performance of TIGAN drops.

The topic classifier $Q$ is consisted of a word embedding matrix and a linear network which takes the weighted average of word embeddings as input and predicts the latent topics. We set the dimension of pretrained word embedding to 100. As we find the model architecture of topic classifier $Q$ greatly influences the performance of our model, we has discussed the model architecture of $Q$ in later section. The noise predictor $E$ is a fully-connected neural network with one hidden layer of dimension 500. The $G$ , $D$ , $E$ and $Q$ are all optimized by Adam optimizer with learning rate $0.0005$ , $\beta_{1}=0.5$ and $\beta_{2}=0.999$ .

Appendix B Retrieving Topical Words from TIGAN

The topical words of our model can be retrieved from topic classifier $Q$ as following. $Q$ consists of a word embedding matrix and a linear layer. Let $W_{V\times N}$ be a word embedding matrix, where each row is a $N$ dimensional word vector, and $V$ is the vocabulary size. Let $M_{K\times N}$ be the weight matrix of the linear layer to predict the topics from sentence embedding which is the weighted average of word embeddings, where $K$ is the topic number. Let $C_{K\times V}=M\cdot W^{T}$ , where the value of $C_{k,v}$ represents the importance of the v-th word to the k-th topic. The top few words with the highest values within row $k$ are selected as the topical words of k-th topic.

Appendix C Qualitative Analysis of Topical Words

We provide the learned topical words of all latent topics of Yahoo! Answer from our model in Table 7, ProdLDA in Table 8, NVDM in Table 9 and LDA in Table 10. As shown in the tables, the topical words in each latent topic of TIGAN are closely semantically related to each other. For example, all the words in the first latent topic are obviously related to science, the words in second topic are related to the operation of computer and the words in third topic are related to politics.

Most of the topical words from ProdLDA are all related to each other, except some words. For example, some words in the second-last latent topic is not reasonable like “cousin”, “do I” or “bc”. In NVDM, we can find more unreasonable latent topics. For example, in the last latent topic, “diet” and “movie” should not be clustered into the same topic and in fourth-last latent topic, “immigrants” and “cup” is not related to each other. The quality of extracted topical words from LDA is worse than the above three neural models. Most of the topical words in a same topic are not semantically related to each other.

Appendix D Discussion of Topic Code Distribution

For labeled datasets, no matter single-label or multi-label datasets, we encourage to sample topic code $c$ according to the distribution of the human-labeled categories. For unlabeled datasets, we encourage to set the topic number of $c$ to $10\sim 20$ and tune from greater number. With a greater topic number, TIGAN is still able to learn explainable topics as this setting just splits a single topic into several subtopics. However, setting topic number smaller forces several unrelated topics to combine into a single topic and thus harms the performance. It’s worth mentioning that we find that sampling one-hot $c$ from uniform distribution works for most datasets. That is because it encourages topic classifier to learn salient and highly interpretable features to maximize the mutual information between discrete one-hot $c$ and generated discrete bag-of-words.

Generated Text

british airports leaders plan to ban on one of thousands the city health virus officials said friday

country health officials from world organisation has warned turkey ’s swine flu in britain pakistan saying it was unsafe bringing the bird

the talk of a new disease on tiny team of has formed in tanzania which based on ebola virus

sars epidemic has been recently detected in the hong kong and disease in france public health organisation is for

health experts said it was in southwest china to contain the outbreak of disease

national organizations have been established in beijing to sars for safety alert and common disease resources

the swine flu virus has been muted by health organisation response to tackle disease

thailand ’s sichuan province has given to foreign investors in airlines travel disease

indonesia has developed a discovery of health system wild birds at home from the flu

the impact of poor health organisation has been UNK swine flu season in for population

the governments gathered in thailand ’s health officials of world ministry earlier on friday

indonesia has been hospitalized with pneumonia the health organisation said

doctors from romania and countries are to explore ways of recent disease

the death of # percent bird flu in latest report region asia , according to a study released by u.s. health organization

agricultural experts have set to double the european commission meet with a new study on and human bird flu in animal was unveiled

Table 6: The generated texts of English Gigaword Topic ID 2 in Table 3 with different noise vectors. It’s clear that all the generated sentences are about disease.

Topical Words

measure, equations, cm, density, equation, units, formula, solar, atoms, inches, cycle, calculate, volume, radius, physics, triangle, molecules, height, equal, gravity

deleted, files, folder, users, delete, messenger, spyware, scan, installed, settings, upload, downloaded, ip, disk, router, screen, firewall, software, message, xp

terrorists, democrats, troops, republicans, terrorism, terrorist, leaders, senate, iraq, politicians, liberals, democracy, saddam, congress, democratic, elections, elected, political, majority, war

biology, engineering, courses, resources, management, materials, algebra, technology, design, accounting, analysis, studies, communication, learning, education, programming, skills, teaching, study, marketing

nervous, abuse, depressed, drunk, crying, depression, orgasm, jealous, sexually, jail, hang, drinking, emotional, chat, busy, sleeping, herself, hanging, custody, diagnosed

sore, infection, skin, stomach, urine, milk, vitamin, pills, symptoms, muscle, therapy, breast, severe, treatment, muscles, fluid, bleeding, acne, pain

bone spirit, holy, heaven, rock, satan, worship, ghost, evolution, spiritual, bible, devil, jesus, bands, mary, lyrics, adam, gods, soul, singer, christian

chat, pics, porn, orgasm, females, nasty, bored, penis, sensitive, avatar, attracted, emotional, ugly, naked, sexy, addicted, vagina, lesbian, jokes, dirty

champions, playoffs, championship, league, teams, cup, nba, team, finals, nfl, hockey, fifa, football, soccer, kobe, wrestling, basketball, brazil, matches, cricket

payments, estate, mortgage, property, loan, insurance, payment, fees, loans, funds, debt, companies, investment, agency, financial, employees, filed, employment, owner, taxes

Table 7: Topical words generated from TIGAN in Yahoo! Answers.

Topical Words

teaching ,learning ,subject ,teach ,learn ,skills ,studying ,topic ,study ,project ,prepare ,courses ,students ,schools ,english ,guide ,essay ,language ,spelling ,education

tooth ,teeth ,knee ,dentist ,exercise ,loose ,dr ,healthy ,coffee ,gym ,counter ,workout ,stomach ,pill ,medication ,severe ,acne ,bleeding ,eating ,diet

fees ,employees ,employee ,employer ,cash ,agent ,earn ,selling ,homes ,banks ,hire ,mortgage ,payment ,invest ,offered ,funds ,budget ,estate ,bills ,agency

divide ,motion ,factor ,formula ,direction ,scale ,compound ,length ,equations ,circle ,zero ,frequency ,steel ,wave ,triangle ,elements ,electricity ,velocity ,solid ,table

detroit ,olympics ,nfl ,league ,winning ,wins ,teams ,2002 ,playoffs ,match ,championship ,johnson ,coach ,bowl ,dallas ,team ,goal ,soccer ,baseball ,finals

liberals ,congress ,conservative ,voting ,911 ,saddam ,liberal ,soldiers ,majority ,elected ,bush ,politicians ,vote ,elections ,violence ,terrorism ,freedom ,vietnam ,troops ,democracy

band ,bob ,bands ,lyrics ,singing ,artist ,singer ,scene ,sang ,80s ,movie ,episode ,sings ,sing ,pink ,rap ,dancing ,movies ,song ,songs

bible ,believe ,gods ,christianity ,christ ,worship ,faith ,spiritual ,jesus ,spirit ,belief ,christian ,god ,heaven ,jewish ,christians ,eve ,mary ,lord ,belive

dated ,talked ,weve ,eachother ,cousin ,friendship ,ex ,gf ,break ,cheating ,cheated ,hang ,dating ,together ,broke ,boyfriends ,bc ,hanging ,doi ,talk

opened ,icon ,automatically ,task ,edition ,comp ,blocked ,sharing ,instant ,deleted ,press ,blank ,dell ,keyboard ,connection ,bar ,beta ,update ,mac ,dsl

Table 8: Topical words generated from ProdLDA in Yahoo! Answers.

Topical Words

earn ,points ,money ,countries ,questions ,energy ,learn ,invest ,score ,market ,nuclear ,iran ,dollars ,business ,technology ,government ,study ,stock ,economy ,world

school ,schools ,high ,university ,college ,colleges ,students ,basketball ,sport ,la ,de ,girls ,grade ,courses ,graduate ,teacher ,football ,les ,teachers ,boys

english ,lyrics ,song ,search ,google ,spanish ,translate ,site ,language ,christian ,sites ,myspace ,sings ,bible ,love ,translation ,websites ,songs ,web ,books

god ,jesus ,christians ,christian ,religion ,truth ,bush ,religious ,beliefs ,faith ,church ,christ ,believe ,evil ,heaven ,feelings ,bible ,kill ,gods ,illegal

download ,computer ,software ,email ,myspace ,laptop ,install ,phone ,pc ,card ,cd ,program ,online ,password ,files ,video ,address ,send ,connect ,downloaded

relationship ,sex ,math ,grade ,age ,degree ,feelings ,shy ,boyfriend ,teacher ,homework ,advice ,healthy ,pregnant ,study ,dating ,bf ,classes ,yrs ,weight

win ,hate ,watch ,immigrants ,idol ,americans ,mexicans ,hes ,gonna ,watching ,iraq ,games ,tired ,mexico ,jobs ,thinks ,illegals ,cup ,republicans ,democrats

foods ,food ,health ,diet ,products ,skin ,protein ,fat ,muscle ,eat ,eating ,exercise ,treatment ,cure ,disease ,healthy ,muscles ,sugar ,bacteria ,weight

county ,tax ,state ,federal ,income ,insurance ,loan ,taxes ,visa ,citizen ,pay ,financial ,states ,department ,lawyer ,child ,california ,legal ,loans ,court

christmas ,season ,diet ,dvd ,burn ,buy ,songs ,movie ,night ,water ,song ,day ,band ,summer ,drink ,gift ,eat ,ebay ,credit ,price

Table 9: Topical words generated from NVDM in Yahoo! Answers.

Topical Words

looking, friend, girl, yahoo, song, new, look, good, trying, know, im, information, music, online, business, need, write, check, send, change

like, know, want, really, think, people, help, say, question, tell, answer, need, good, make, guy, feel, things, thing, bad, ask

school, job, work, home, state, high, good, college, number, age, small, book, run, law, area, want, best, parents, doctor, vs

time, best, long, person, years, money, man, life, sex, guys, like, women, email, way, great, little, lot, men, away, good

day, real, read, bush, anybody, eat, explain, 12, power, pain, mind, food, new, red, period, rid, weeks, air, hours, ex

water, times, end, english, card, point, states, difference, credit, page, gets, comes, different, 10, blood, light, time, used, order, green

need, help, free, site, computer, want, website, know, heard, try, weight, using, web, best, whats, download, windows, like, company, search

people, god, called, believe, country, think, pay, days, ago, means, war, child, come, world, probably, president, info, wondering, usa, test

use, old, better, internet, told, buy, type, stop, used, black, white, ones, turn, easy, language, date, science, fast, laptop, skin

love, going, think, got, world, year, talk, yes, team, body, win, hes, come, american, movie, game, play, cup, ive, boyfriend

Table 10: Topical words generated from LDA in Yahoo! Answers.

Learning Interpretable and Discrete Representations with Adversarial Training for Unsupervised Text Classification

Abstract

1 Introduction

2 Related Work

Auto-encoder

Maximizing Mutual Information

3 TIGAN

3.1 Model

3.2 Training

3.3 Categorical Loss Clipping

3.4 Combining infoGAN with auto-encoder

3.5 Using WGAN-gp

3.6 Using pretrained word embeddings

4 Text Generation from Discovered Topics

5 Experiments

5.1 Unsupervised Classification

Datasets

Experimental Settings

Baselines

Results and Discussion.

5.2 Ablation Study

Training

Model architecture of Topic Classifier

5.3 Topic Coherence

Experimental setup

Quantitative analysis.

Qualitative analysis.

5.4 Disentangled Representations

6 Conclusion

References

Appendix A Model Architecture and Hyper-parameters

Appendix B Retrieving Topical Words from TIGAN

Appendix C Qualitative Analysis of Topical Words

Appendix D Discussion of Topic Code Distribution

Learning Interpretable and Discrete Representations
with Adversarial Training for Unsupervised Text Classification