This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network

Ruixuan Luo, Wei Li, Zhiyuan Zhang, Ruihan Bao\diamondsuit, Keiko Harimoto\diamondsuit, Xu Sun†,‡
Abstract

Clustering is one of the fundamental problems in unsupervised learning. Recent deep learning based methods focus on learning clustering oriented representations. Among those methods, Variational Deep Embedding achieves great success in various clustering tasks by specifying a Gaussian Mixture prior to the latent space. However, VaDE suffers from two problems: 1) it is fragile to the input noise; 2) it ignores the locality information between the neighboring data points. In this paper, we propose a joint learning framework that improves VaDE with a robust embedding discriminator and a local structure constraint, which are both helpful to improve the robustness of our model. Experiment results on various vision and textual datasets demonstrate that our method outperforms the state-of-the-art baseline models in all metrics. Further detailed analysis shows that our proposed model is very robust to the adversarial inputs, which is a desirable property for practical applications.

Introduction

Clustering is an essential problem in unsupervised learning, which aims to group unlabeled data based on their similarities. Over the past decades, several traditional clustering methods such as k-means and Gaussian Mixture Model (GMM) (Kohonen 1990; Hartigan and Wong 1979; Ester et al. 1996) have been established, which measure the distances of data points using Euclidean distance and minimize the within-class variances. However, as the dimension of the input data grows in the feature space, the clustering with Euclidean distance will deteriorate significantly. In order to solve the problem, various dimension reduction methods (Turk and Pentland 1991; Belhumeur, Hespanha, and Kriegman 1997; Donoho and Grimes 2003) have been proposed to convert data points from the feature space to a lower-dimensional space as a pre-processing step for clustering.

Due to the recent success, deep learning based clustering receives lots of attention. Earlier approaches adopt deterministic models such as Autoencoder to learn a latent representation by minimizing the reconstruction error. Furthermore, constraints are imposed to make the latent representation more suitable for the clustering (Xie, Girshick, and Farhadi 2016; Yang et al. 2017). Other attempts build clustering methods based on generative models because of their ability to learn the inherent data structure.Mukherjee et al. (2019) design a GAN based model which contains an inverse-mapping network and a clustering-specific loss. Their model achieves high scores on various tasks, but is vulnerable to the unbalanced data. Jiang et al. (2017) propose Variational Deep Embedding (VaDE) that extends VAE by adding a GMM constraint to the latent layer. While VaDE achieves the state-of-the-art performance in various clustering tasks, it suffers from two major drawbacks: 1), as VaDE trains a network that tries to recover the original inputs, the latent representations are vulnerable to the input noise; 2), it only models the global structure of the latent representations, while ignoring the local structure between two latent variables which carries important information.

In this paper, we resolve the problems of VaDE by proposing a Locality Preserving Variational Discriminative Network (LPVDN) that encodes robust global and local data structure in the latent representations used for clustering, which is stable for noise and unbalanced input. Specifically, our work first models the global data structure using a GMM regularized Variational Autoencoder (VAE), because VAE is good at capturing the global structure of data distribution. Then we improve the robustness of the global structure modeling by introducing an embedding discriminator that maximizes the mutual information between the input data and the latent representations, because maximizing the mutual information can improve the robustness to the input noise.111A mathematical proof is given in the Appendix. After the global representations are available, we encode the local structure information by modeling pairwise distance probability and minimizing the KL-divergence of the distribution. We assume that the global structure is more sensitive to the adversarial noise, while local structures can be helpful to reduce the impact of the noise.

Specifically, to capture the local structures, we propose a locality preserving network module which is inspired by t-SNE (Maaten and Hinton 2008) due to its effectiveness and simplicity. It employs a Student t-distribution with one degree of freedom in the low-dimensional map and uses it to fit the joint probability in the high-dimensional space. The optimization objective attracts similar points while introduces strong but finite repulsion between dissimilar points.

We perform extensive experiments on both vision and textual clustering datasets. Experiments show that our model achieves the state-of-the-art results on all the metrics. Further analysis testifies the robustness under different degree of data poisoning of our model and reveals its potential for data dimension reduction and visualization.

Refer to caption
Figure 1: Illustration of Locality Preserving Variational Discriminative Network (LPVDN). Our model retains both global (lower part) and local (upper part) structure. Latent variables μ\mu and σ\sigma are regularized to a mixture of Gaussian distribution.

We conclude our main contributions as follows:

  • We propose a robust locality preserving variational discriminative network that retains both global and local data structure, which is also robust to the noise in the input. 222Our code is released at https://github.com/lancopku/LPVDN

  • Extensive experiments on both vision and textual benchmark datasets show that our proposed model outperforms the state-of-the-art clustering methods. Analysis testifies the superiority of the robustness of our model over previous state-of-the-art methods.

  • Further analysis shows that our model’s performance is steady even when the dimension of the latent representations reduces to two, which means that our model can be served as a promising candidate for data visualization.

Related Work

Recently, deep learning-based clustering methods attract lots of attention due to its success in various problems.

One branch of the popular methods are based on Autoencoders, in which a latent feature representation can be learned by minimizing reconstruction errors. Specifically, Yang et al. (2017) propose DCN by incorporating the K-means loss with the reconstruction loss to perform clustering and feature embedding simultaneously. Huang et al. (2014) introduce DEC with a KL-divergence loss between soft assignments of the clusters and an auxiliary distribution to minimize within-clustering distance. In order to achieve better performance, the paper initializes the parameters with general Autoencoder training, which is adopted as the major practice for other clustering networks using deep learning. Ji et al. (2017) apply the idea of subspace clustering and introduce a self-expressive layer to the Autoencoder, the trained parameters of the self-expressive layer are used to construct an affinity matrix used by spectral clustering. Leveraging recent progress in image recognition, Ghasedi Dizaji et al. (2017) adopt a convolutional Autoencoder with a softmax layer stacked on top of the encoder. Yang et al. (2019) extend the idea of convolutional Autoencoder by proposing a joint learning framework of spectral clustering and discriminative embedding which is regularized by a normal distribution. As convolutional Autoencoder can extract important image features. Both methods work extremely well for image clustering tasks, but the performance often drops significantly for non-image clustering tasks.

Driven by the success of VAE (Kingma and Welling 2014) and GAN (Goodfellow et al. 2014), generative models become favourable choices for clustering due to its ability to capture the data distribution. Among those methods, Variational Deep Embedding (Jiang et al. 2017) extends VAE by assuming latent variables following a mixture of Gaussian distribution, where the means and variances of the Gaussian components are trainable. Deep Adversarial Clustering (Harchaoui, Mattei, and Bouveyron 2017) proposes an adversarial Autoencoder, which utilizes an adversarial training procedure to match the aggregated posterior of the latent representation with the prior distribution. CategorialGAN (Springenberg 2016) generalizes the GAN framework to multiple classes. InfoGAN (Chen et al. 2016) maximizes the mutual information between a subset of the latent variables and the observation, which achieves disentangled representations for clustering. ClusterGAN (Mukherjee et al. 2019) samples latent variables from a mixture of one-hot and continuous latent variables, and the inverse network is trained jointly with a clustering specific loss. However, as it requires sampling one-hot vectors from a uniform distribution of the clusters, when the data is highly unbalanced, the performance will experience a serious decline.

Locality Preserving Variational Discriminative Network (LPVDN)

In this section, we describe our proposed Locality Preserving Variational Discriminative Network (LPVDN).

The idea of our method is to joint learn robust global and local structures in the latent space. The global structure modeling is based on the VaDE, which achieves good results in clustering with a Variational Autoencoder and Gaussian Mixture Model (GMM), however, it is vulnerable to the noise in the original input data space and fails to model the local structure between data points. In order to tackle these problems, we propose a novel model that not only improves the robustness of the global structure, but also encodes the local structure between data points. A sketch of our model is given in Fig. 1.

More specifically, our methods can be explained using the following optimization objective:

minϕ,θ,ψ,ρi=1N\displaystyle\min_{\phi,\theta,\psi,\rho}\sum_{i=1}^{N}{} (LG(xi;ϕ,θ)\displaystyle\big{(}L_{\text{G}}(x_{i};\phi,\theta) (1)
+α0LMI(xi;ϕ,ρ)+α1LLP(xi,xj;ϕ,ψ))\displaystyle+\alpha_{0}L_{\text{MI}}(x_{i};\phi,\rho)+\alpha_{1}L_{\text{LP}}(x_{i},x_{j};\phi,\psi)\big{)}

where xix_{i} and xjx_{j} are input data points, nn is the total number of the data, ϕ\phi, θ\theta, ψ\psi, ρ\rho are trainable parameters.

As expressed in Eqn.1, our model can be decomposed into three parts: LG(xi;ϕ,θ)L_{\text{G}}(x_{i};\phi,\theta) captures the global structure using a Variational Autoencoder with GMM Prior.LMI(xi;ϕ,ρ)L_{\text{MI}}(x_{i};\phi,\rho) applies an embedding discriminator to improve the robustness of the global structure modeling. Finally, LLP(xi,xj;ϕ,ψ)L_{\text{LP}}(x_{i},x_{j};\phi,\psi) models the local structure by minimizing the KL-divergence of the pairwise distance distribution. α0\alpha_{0} and α1\alpha_{1} are hyper-parameters which try to balance different parts. After the training phase, as the final latent representations comprise robust global and local information, we use it as the input for general clustering methods such as K-means or GMM.

Global Structure Modeling LG(xi;ϕ,θ)L_{\text{G}}(x_{i};\phi,\theta)

To model the global structure, we take advantage of Variational Deep Embedding (VaDE) (Jiang et al. 2017), which extends VAE with GMM to model the distribution of latent representations. Specifically, both VAE and VaDE maximize the log probability of the given data point:

maxϕ,θi=1Nlogp(xi;ϕ,θ)\max_{\phi,\theta}\sum_{i=1}^{N}\log{p(x_{i};\phi,\theta)} (2)

As the optimization is non-trivial, an evidence lower bound (ELBO) (Kingma and Welling 2014; Jiang et al. 2017) is proposed and maximized instead. In order to understand the ELBO, we further decompose it as:

LELBO(xi)=\displaystyle L_{\text{ELBO}}(x_{i})= Eq(z,c|xi)[logp(xi|z)]\displaystyle E_{q(z,c|x_{i})}[\log{p(x_{i}|z)}] (3)
DKL(q(z,c|xi)||p(z,c))\displaystyle-D_{\text{KL}}(q(z,c|x_{i})||p(z,c))

where LELBOL_{\text{ELBO}} is the ELBO loss to be optimized. Term xix_{i} denotes the input data, zz represents the latent variables, cc is the GMM cluster in the latent layer. Eqn.3 shows that the ELBO loss consists of two terms. The first term corresponds to the reconstruction loss, which encourages the network to better approximate the original data. The second term is the regularization term, which requires the distribution of latent variables zz to be as close as Mixture-of-Gaussian distribution p(𝒛,c)p(\bm{z},c). Since ELBO preserves the global distribution of latent variables, our global structure term LG(xi;ϕ,θ)L_{\text{G}}(x_{i};\phi,\theta) is set to the negative ELBO loss (since we minimize the objective), precisely, LG(xi;ϕ,θ)=LELBO(xi)L_{\text{G}}(x_{i};\phi,\theta)=-L_{\text{ELBO}}(x_{i}).

In practice, following the derivation of Jiang et al. (2017), LG(xi;ϕ,θ)L_{\text{G}}(x_{i};\phi,\theta) is computed as

LG(xi;ϕ,θ)\displaystyle{}L_{\text{G}}(x_{i};\phi,\theta) (4)
=\displaystyle= d=1Dxidlog𝝁xi|d+(1xid)log(1𝝁xi|d)\displaystyle-\sum_{d=1}^{D}{x_{i}^{d}\log{\bm{\mu}_{x_{i}}|_{d}}}+(1-x_{i}^{d})log(1-\bm{\mu}_{x_{i}}|_{d})
+c=1Kγicj=1J(log𝝈c2|j+𝝈~i2|j𝝈c2|j+(𝝁~i|jμc|j)2𝝈c2|j)\displaystyle+\sum_{c=1}^{K}{\gamma_{ic}\sum_{j=1}^{J}{\big{(}\log{\bm{\sigma}_{c}^{2}|_{j}}}+\dfrac{\bm{\widetilde{\sigma}}_{i}^{2}|_{j}}{\bm{\sigma}_{c}^{2}|_{j}}+\dfrac{{(\bm{\widetilde{\mu}}_{i}|_{j}-\mu_{c}|_{j})^{2}}}{\bm{\sigma}_{c}^{2}|_{j}}\big{)}}
c=1Kγiclogπicγic12j=1J(1+log𝝈~i2|j)\displaystyle-\sum_{c=1}^{K}{\gamma_{ic}\log{\dfrac{\pi_{ic}}{\gamma_{ic}}}}-\dfrac{1}{2}\sum_{j=1}^{J}(1+\log{\bm{\widetilde{\sigma}}_{i}^{2}|_{j}})

where DD is the dimension of input xix_{i} and reconstructed outputs 𝝁xi\bm{\mu}_{x_{i}}. xidx_{i}^{d} is the dd-th element of xix_{i}, |j*|_{j} denotes the jj-th element of *. γic\gamma_{ic} represents q(c|zi)q(c|z_{i}), which is the GMM distribution for latent variable ziz_{i}. Moreover,

[𝝁~i,log𝝈~i2]\displaystyle[\bm{\widetilde{\mu}}_{i},\log\bm{\widetilde{\sigma}}_{i}^{2}] =gg(xi;ϕ)\displaystyle=g_{g}(x_{i};\phi) (5)
zi\displaystyle z_{i} =𝝁~i+𝝈~iϵi\displaystyle=\widetilde{\bm{\mu}}_{i}+\widetilde{\bm{\sigma}}_{i}\circ{\epsilon}_{i} (6)
𝝁xi\displaystyle\bm{\mu}_{x_{i}} =fg(zi;θ)\displaystyle=f_{g}(z_{i};\theta) (7)

where gg(;ϕ)g_{g}(*;\phi) and fg(;θ)f_{g}(*;\theta) are two neural networks parameterized by ϕ\phi and θ\theta, ziz_{i} is generated from the latent distribution 𝒩(𝝁~i,𝝈~i2)\mathcal{N}(\bm{\widetilde{\mu}}_{i},\bm{\widetilde{\sigma}}_{i}^{2}), ϵi\epsilon_{i} is a vector with elements drawn from independent normal distribution and \circ denotes the element-wise multiplication. 𝝁xi\bm{\mu}_{x_{i}} is the reconstructed data of xix_{i}.

Robust Embedding Discriminator LMI(xi;ϕ,ρ)L_{\text{MI}}(x_{i};\phi,\rho)

In Eqn.3, it is essential to reconstruct the original data, which is achieved by minimizing the cross entropy loss or mean square error between raw data and the latent representations. However, such objectives are too strict, which would make the model vulnerable to the input noise. In order to improve the robustness of our model, we reformulate the reconstruction loss over all input data samples (for the simplicity, we omit the variable cc),

ExP(X)[Eq(z|x)[log(p(x|z))]]\displaystyle E_{x\sim P(X)}\big{[}E_{q(z|x)}[\log(p(x|z))]\big{]} (8)
=\displaystyle= p(x)q(z|x)log(p(x|z))𝑑z𝑑x\displaystyle\int\int p(x)q(z|x)\log(p(x|z))dzdx
=\displaystyle= p(x)q(z|x)log(q(z|x)p(x)q(z))𝑑z𝑑x\displaystyle\int\int p(x)q(z|x)\log\big{(}\frac{q(z|x)p(x)}{q(z)}\big{)}dzdx
=\displaystyle= p(x)q(z|x)log(q(z|x)p(x)q(z)p(x))𝑑z𝑑x\displaystyle\int\int p(x)q(z|x)\log\big{(}\frac{q(z|x)p(x)}{q(z)p(x)}\big{)}dzdx
+p(x)q(z|x)log(p(x))𝑑z𝑑x\displaystyle+\int\int p(x)q(z|x)\log(p(x))dzdx
=\displaystyle= DKL(Q(Z|X)P(X)||Q(Z)P(X))H(X)\displaystyle D_{\text{KL}}(Q(Z|X)P(X)||Q(Z)P(X))-H(X)
=\displaystyle= I(X,Z)H(X)\displaystyle I(X,Z)-H(X)

where X={x1,,xn}X=\{x_{1},...,x_{n}\} represent the set of all the input samples, ZZ represents the corresponding latent variables. Since H(X)H(X) solely depends on the input data, the maximization of the first term in Eqn.3 is equivalent to maximize the mutual information I(X,Z)I(X,Z).

To estimate the mutual information I(X,Z)I(X,Z), Hjelm et al. (2019) introduce a JS-divergence based approximation,

I(JSD)(X,Z)=DJS(Q(Z|X)P(X),Q(Z)P(X))I^{\text{(JSD)}}(X,Z)=D_{\text{JS}}(Q(Z|X)P(X),Q(Z)P(X)) (9)

One way to compute the above JS-divergence in Eqn. 9 is using a discriminator DD (Nowozin, Cseke, and Tomioka 2016; Hjelm et al. 2019; Yang et al. 2019):

DJS(P(X),Q(X))\displaystyle D_{\text{JS}}(P(X),Q(X)) =12maxD{ExP(X)[log(σ(D(x)))]\displaystyle=\frac{1}{2}\max_{D}\{E_{x\sim P(X)}[\log(\sigma(D(x)))] (10)
+ExQ(X)[log(1σ(D(x)))]}\displaystyle+E_{x\sim Q(X)}[\log(1-\sigma(D(x)))]\}
+log2\displaystyle+\log 2

Following Eqn. 9 and Eqn.10, we utilize a neural network to serve as the robust embedding discriminator,

LMI(xi;ϕ,ρ)=log[σ(D(xi,zi;ρ)]\displaystyle L_{\text{MI}}(x_{i};\phi,\rho)=-\log[\sigma(D(x_{i},z_{i};\rho)] (11)
E(xj,zi)p(zi)p(xj)[log(1σ(D(xj,zi;ρ)))]\displaystyle-E_{(x_{j},z_{i})\sim p(z_{i})p(x_{j})}[\log(1-\sigma(D(x_{j},z_{i};\rho)))]

where D(;ρ)D(*;\rho) is a neural network parameterized by ρ\rho. Term ziz_{i} is the latent variable generated from xix_{i} by Eqn.5 and Eqn.6. Discriminator D(;ρ)D(*;\rho) outputs a binary value indicating whether zz is generated from xx. The second term in Eqn.LABEL:eq:LMI is negative sampling estimation to train the discriminator. In practice, we randomly draw samples from the datasets to form the negative pairs with ziz_{i}.333In the appendix, we further provide a theoretical proof for the effectiveness of the proposed robust discriminator.

One thing should be noted is that LMI(xi;ϕ,ρ)L_{\text{MI}}(x_{i};\phi,\rho) cannot replace the original reconstruction loss in Eqn.3. This is because Eqn.9 is valid only when P(X)P(X) and Q(X)Q(X) are close. As we maximize the mutual information I(X,Z)I(X,Z) during training, P(X)P(X) and Q(X)Q(X) will be gradually varied and Eqn.9 will no longer hold. In order to avoid this problem, we combine the reconstruction term and the robust embedding discriminator in our method to achieve better clustering performance (Eqn.1).

method MNIST Fashion-MNIST
ACC NMI ARI ACC NMI ARI
K-means 0.5850 0.4998 0.3652 0.5542 0.5117 0.3477
AE+K-means 0.7147 0.6610 0.5593 0.6240 0.6360 0.4440
VAE+K-means 0.8583 0.7432 0.7249 0.6049 0.5629 0.4082
DEC (Xie, Girshick, and Farhadi 2016) 0.8971 0.8631 0.8322 0.6303 0.6560 0.4987
ClusterGAN (Mukherjee et al. 2019) 0.9596 0.9051 0.9134 0.6374 0.6423 0.5067
VaDE (Jiang et al. 2017) 0.9446* 0.8761 0.8822 0.6383 0.6434 0.4955
LPVDN (Proposed) 0.9716 0.9276 0.9385 0.6837 0.6912 0.5151
Table 1: Evaluation on vision datasets of MNIST and Fashion-MNIST for different algorithms using ACC, NMI and ARI. * is taken from (Jiang et al. 2017), other results are generated by the released code.

Local Structure Modeling LLP(xi,xj;ϕ,ψ)L_{\text{LP}}(x_{i},x_{j};\phi,\psi)

While the global structure models the latent space based on clustering distribution (e.g. GMM), such methods may generate unsatisfactory results for the data around the cluster boundary. To solve the problem, we introduce a local structure model that captures the pairwise relationships by minimizing the KL-divergence on the distance distribution.

Inspired by Maaten and Hinton (2008), we define the following pairwise distance distribution pijp_{ij} in the latent space:

pj|i=exp(𝝁~i𝝁~j2/2ηi2)kiexp(𝝁~k𝝁~i2/2ηi2)p_{j|i}=\frac{\exp(-\|\bm{\widetilde{\mu}}_{i}-\bm{\widetilde{\mu}}_{j}\|^{2}/2\eta_{i}^{2})}{\sum_{k\not=i}\exp(-\|\bm{\widetilde{\mu}}_{k}-\bm{\widetilde{\mu}}_{i}\|^{2}/2\eta_{i}^{2})} (12)
pij=pj|i+pi|j2np_{ij}=\frac{p_{j|i}+p_{i|j}}{2n} (13)

where 𝝁~i\widetilde{\bm{\mu}}_{i} and 𝝁~j\widetilde{\bm{\mu}}_{j} are the mean of the latent variables obtained from Eqn. 5. Here, we select 𝝁~\widetilde{\bm{\mu}} instead of zz because it contains much less noise. Term ηi\eta_{i} is a hyper-parameter adjusted according to the data density. One way to compute ηi\eta_{i} is from a pre-defined perplexity value:

Perplexity(Pi)=2jpj|ilog2pj|i\text{Perplexity}(P_{i})=2^{-\sum_{j}p_{j|i}\log_{2}p_{j|i}} (14)

In order to encode the pairwise information, we apply a neural network to minimize the KL-divergence of the distance distribution. To be more precise, we first apply a locality mapping network that converts 𝝁~i\widetilde{\bm{\mu}}_{i} to 𝒐~i\bm{\widetilde{\bm{o}}}^{\prime}_{i}:

𝒐~i=flp(𝝁~i;ψ)\bm{\widetilde{\bm{o}}}^{\prime}_{i}=f_{lp}(\bm{\widetilde{\mu}}_{i};\psi) (15)

where flp(;ψ)f_{lp}(*;\psi) is a neural network parameterized by ψ\psi. Once 𝒐~i\bm{\widetilde{\bm{o}}}^{\prime}_{i} is obtained, the pairwise distribution qijq_{ij} between 𝒐~i\bm{\widetilde{\bm{o}}}^{\prime}_{i} and 𝒐~j\bm{\widetilde{\bm{o}}}^{\prime}_{j} is defined using a student t-distribution with one degree of freedom,

qij=(1+𝒐~i𝒐~j2)1kkl(1+𝒐~k𝒐~l2)1q_{ij}=\frac{(1+||\bm{\widetilde{\bm{o}}}^{\prime}_{i}-\bm{\widetilde{\bm{o}}}^{\prime}_{j}||^{2})^{-1}}{\sum_{k}{\sum_{k\not=l}(1+||\bm{\widetilde{\bm{o}}}^{\prime}_{k}-\bm{\widetilde{\bm{o}}}^{\prime}_{l}||^{2})^{-1}}} (16)

t-distribution is applied due to its computational efficiency and its property to mitigate crowding problems (Maaten and Hinton 2008).

Finally, the local structure modeling LLP(xi,xj;ϕ,ψ)L_{\text{LP}}(x_{i},x_{j};\phi,\psi) is given by measuring the KL-divergence between the two joint probabilities pijp_{ij} and qijq_{ij}:

LLP(xi,xj;ϕ,ψ)=ijpijlogpijqijL_{\text{LP}}(x_{i},x_{j};\phi,\psi)=\sum_{i\not=j}p_{ij}\log\frac{p_{ij}}{q_{ij}} (17)

An intuitive interpretation for LLP(xi;ϕ,ψ)L_{\text{LP}}(x_{i};\phi,\psi) is that it encourages 𝒐~i\bm{\widetilde{\bm{o}}}^{\prime}_{i} and 𝒐~j\bm{\widetilde{\bm{o}}}^{\prime}_{j} to be closer when the pairwise distance of 𝝁~i\widetilde{\bm{\mu}}_{i} and 𝝁~j\widetilde{\bm{\mu}}_{j} are small, and vice versa. As a result, the final output of 𝒐~\bm{\widetilde{\bm{o}}}^{\prime} not only inherits the global structure information from 𝝁~\bm{\widetilde{\mu}} but also encodes local structure information from the pairwise relationship, thus can be used as an effective representation for clustering.

Experiments

method REUTERS-10k REUTERS
ACC NMI ARI ACC NMI ARI
K-means 0.5403 0.4175 0.2775 0.5328 0.4014 0.2631
AE+K-means 0.7477 0.4714 0.5300 0.7006 0.3330 0.3394
VAE+K-means 0.7421 0.4319 0.5073 0.6922 0.3586 0.3317
DEC (Xie, Girshick, and Farhadi 2016) 0.7229 0.5229 0.5525 0.7082 0.4915 0.5387
ClusterGAN (Mukherjee et al. 2019) 0.4317 0.0271 0.0163 0.4621 0.0303 0.0411
VaDE (Jiang et al. 2017) 0.8092 0.5267 0.5951 0.7938* 0.5591 0.5779
LPVDN (Proposed) 0.8301 0.5857 0.6112 0.8483 0.6553 0.6806
Table 2: Evaluation on textual datasets of REUTERS-10k and REUTERS for different algorithms using ACC, NMI and ARI. * is taken from (Jiang et al. 2017), other results are generated by the released code.

In this section, extensive experiments are conducted to validate the effectiveness of the proposed LPVDN.

Dataset Description

To test the ability of our model to solve general clustering tasks, we do experiments on both vision and textual datasets.

Vision Datasets:

  • MNIST is a widely known dataset containing 70,000 (60,000/10,000) handwritten gray-scale images. There are 10 classes and each image size is 28 ×\times 28 pixels.

  • Fashion-MNIST: Fashion-MNIST is a dataset similar to MNIST with the same number of samples and image sizes. Instead of handwritten digits, Fashion-MNIST consists of fashion goods and products which yields a more challenging task.

Textual Datasets:

  • REUTERS: The original REUTERS dataset contains about 810,000 English news stories labeled with categories. Following Huang et al. (2014), we use four root categories during the evaluation, namely, corporate/industrial, government/social, markets, and economics. The documents with multiple root categories are pruned, resulting in a dataset of size 685,071.

  • REUTERS-10k: REUTERS-10k is a subset of REUTERS. Using a similar approach as Huang et al. (2014), we adopt a small scale dataset for REUTERS by sampling a subset of 10,000 examples from REUTERS.

Refer to caption
Refer to caption
Refer to caption
Figure 2: ACC, NMI and ARI of the proposed LPVDN with different α0\alpha_{0} and α1\alpha_{1} combination on the MNIST dataset. The performance is stable with different hyper-parameters.

Experiment Setting

For both vision and textual datasets, we combine the training and test sets together because of the unsupervised setting. Furthermore, for vision data, we concatenate the pixels to form the input vector without preprocessing. Following the settings of Xie, Girshick, and Farhadi (2016); Jiang et al. (2017), we set the encoder of global structure as D500500200010D-500-500-2000-10 and decoder as 102000500500D10-2000-500-500-D, where DD is the input dimension. The robust discriminator uses the structure of (10+D)2561(10+D)-256-1, where 1010 is the dimension of the latent representation. For the local structure model, we adopt the network with 102562562561010-256-256-256-10. All layers are fully connected using ReLU as the activation function. We set α0=1\alpha_{0}=1 and α1=104\alpha_{1}=10^{-4} in MNIST and Fashion-MNIST and α0=1\alpha_{0}=1 and α1=103\alpha_{1}=10^{-3} in REUTERS-10k and REUTERS. The batch size is 1,000. We use Adam optimizer (Kingma and Ba 2015). We apply K-means on the latent representations (i.e. 𝒐~\bm{\widetilde{\bm{o}}}^{\prime}) to do clustering.

method MNIST REUTERS
ACC NMI ARI ACC NMI ARI
LGL_{\text{G}}+LMIL_{\text{MI}}+LLPL_{\text{LP}} 0.9716 0.9276 0.9385 0.8483 0.6553 0.6806
LGL_{\text{G}}+LLPL_{\text{LP}} 0.9710 0.9269 0.9373 0.8373 0.6542 0.6762
LGL_{\text{G}}+LMIL_{\text{MI}} 0.9477 0.8910 0.8888 0.8224 0.5969 0.6364
LGL_{\text{G}} only 0.9446 0.8761 0.8822 0.7938 0.5591 0.5779
Table 3: Ablation study of the proposed LPVDN on MNIST and REUTERS. Both robust embedding discriminator LMIL_{\text{MI}} and local structure modeling LLPL_{\text{LP}} contribute to the improvement compared with only using global structure modeling LGL_{\text{G}}

Baseline Methods

We compare the results of our model with several state-of-the-art baseline models. All the K-means algorithms mentioned below use 10 different K-means++ initialization (Arthur and Vassilvitskii 2007) and finally select the settings with the lowest average distances.

  • K-means: apply the K-means algorithm to the original data.

  • AE + K-means: A pipeline method that trains an Autoencoder first and then applies the K-means algorithm to the latent representations produced by the Autoencoder.

  • VAE + K-means: A pipeline method that trains a Variational Autoencoder first and then applies K-means to the latent representations.

  • DEC (Xie, Girshick, and Farhadi 2016): Following the experiment in the paper, DEC is first pre-trained with the autoencoder setting, and then iteratively optimized by minimizing the KL-divergence between the distribution of the current soft cluster assignment and its proposed auxiliary target distribution.

  • ClusterGAN (Mukherjee et al. 2019): introduces a variation of GAN with an inverse-mapping network and a clustering-specific loss, resulting in superior results among GAN-based algorithms in the clustering task.

  • VaDE (Jiang et al. 2017): a generative clustering framework combining VAE with GMM prior. It is reported that VaDE achieves the best performance compared with other general AE/VAE based clustering methods.

The settings of the encoder and decoder in AE and VAE are all the same as our proposed model.

Evaluation Metrics

While evaluating the real performance of clustering is arguably a non-trivial problem, we follow the mainstream practices and use the unsupervised clustering accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) as the metrics to evaluate the models. In our experiments, we set the number of clusters the same as the ground-truth labels.

ACC

The clustering accuracy(ACC) is defined as:

ACC=maxmi=1n1{li=m(ci)}n\text{ACC}=\max\limits_{m}\frac{\sum_{i=1}^{n}\textbf{1}\{l_{i}=m(c_{i})\}}{n} (18)

where lil_{i} denotes the ground-truth label and cic_{i} is the output label of the clustering algorithm. The term mm ranges over all possible one-to-one mappings between the cluster and label.

NMI

Normalized Mutual Information (NMI) computes the normalized measure of similarity between two labels of the same data, which is defined as:

NMI(U,V)=2×MI(U,V)H(U)+H(V)\text{NMI}(U,V)=\frac{2\times\text{MI}(U,V)}{H(U)+H(V)} (19)

where H(U)H(U) and H(V)H(V) are the entropy of the label assignments UU and VV, respectively. MI(U,V)\text{MI}(U,V) is the mutual information of the two label assignments UU and VV.

ARI

The unadjusted Rand Index (RI) is a measure of the similarity between two data clusters, which is defined as (a+b)/(2n)(a+b)/(_{2}^{n}), where nn is the number of samples, aa is the number of pairs that are divided into the same set in both ground-truth labels and output labels, bb is the number of pairs that are divided into different sets in both ground-truth labels and output labels. Since RI score does not guarantee that random labels can get a score close to 0, the Adjusted Rand Index (ARI) is proposed and defined as:

ARI=RIE[RI]max(RI)E[RI]\text{ARI}=\frac{\text{RI}-{E}[\text{RI}]}{\max(\text{RI})-{E}[\text{RI}]} (20)

where E[RI]E[\text{RI}] is the expected RI of random labeling.

Experiment Result

In this section, we explain the experiment results in details on vision and textual datasets.

Vision Datasets:

Table 1 shows the results for the vision datasets, MNIST and fashion-MNIST.

From the experiment results, we have the following observations. 1) DEC performs better than the simple combination of AE+K-means. This means that modeling the global structure during the dimension reduction is helpful to improve the clustering performance. 2) generative methods such as VaDE and ClusterGAN achieve better results compared with discriminative methods such as DEC. This is owing to their ability to model the underlying distribution of the data instead of recovering the exact input. 3) our method outperforms ClusterGAN and VaDE by a significant margin, this is because our model jointly learns a more robust global and local data structure, while ClusterGAN and VaDE only model the global structures.

Furthermore, we show the clustering performance by varying the hyper-parameters α0\alpha_{0} and α1\alpha_{1} on the MNIST dataset (in Fig. 2). It can be seen that the performance of our proposed method is relatively stable in terms of different combinations of hyper-parameters α0\alpha_{0} and α1\alpha_{1}.

Textual Results:

Table 2 shows the results on the textual datasets. Our proposed method achieves the best results on all datasets for all the metrics. This further verifies the effectiveness of our model for non-image tasks. Notably, ClusterGAN performs badly on textual datasets. As discussed in the related works, this is due to the fact that ClusterGAN is sensitive to the unbalanced data, which is the case for REUTERS. As a comparison, our method makes no assumptions on the prior distribution of the datasets, thus is not influenced by such unbalance effects.

Ablation Study

In Table 3, we show the results of the ablation study on both MNIST and REUTERS datasets. As our model contains three parts (LGL_{\text{G}},LMIL_{\text{MI}},LLPL_{\text{LP}}), we design the experiment to study the contribution of each component. We use the method that only models the global structure LGL_{G} as the baseline to compare, which is almost equivalent to VaDE. From the experiment results, it can be seen that the local structure component (LLPL_{\text{LP}}) is crucial and improves the clustering performance significantly, while the robust embedding discriminator LMIL_{\text{MI}} only improves the performance by a slight margin. This result is in-line with our expectation because the embedding discriminator is better at improving the robustness of the system rather than the accuracy.

Refer to caption
Refer to caption
Figure 3: Accuracy on MNIST datasets corrupted by Gaussian Noise(σ\sigma). The sample images with the Gaussian noise are shown above.

Robustness under Input Pollution

To testify the robustness of our model, we do experiments on the MNIST dataset following Gilmer et al. (2019). Particularly, we gradually corrupt the pixels in the input images with independent Gaussian Noise (N(0,σ2)N(0,\sigma^{2})). In Fig.3, we show that the proposed robust embedding discriminator (LMIL_{MI}) performs significantly better than the methods only using the global structure (LGL_{G} only). In addition, the experiment results also show that the local structure (LLPL_{LP}) contributes to the robustness, especially when the image is seriously corrupted. We can also observe that VaDE and ClusterGAN are fragile to the input noise, especially ClusterGAN, which collapses even under small pollution ratio.

Refer to caption
Figure 4: The 10 nearest neighbors retrieved using our proposed model, VaDE and raw data on MNIST dataset. The picture on the left is the query picture of digit “4”, which looks similar to “9”.

Case Study

In Fig. 4, we show a concrete example by performing an image retrieval task using the latent representations. In particular, we use the representation of an image as the query to retrieve top KK images based on euclidean distances. Specifically, in the task we extract digit “4” form the dataset (this is a difficult example as it looks like “9”). The top 10 nearest neighbors found by our model are all correct, while the ones obtained using VaDE are a mixture of “4” and “9”. This is because the global structure modeling usually focuses on the cluster centroids and modeling the data distribution around those centroids, thus it is difficult to differentiate the data around the cluster boundary. On the other hand, since our model encodes the local pairwise relationships, it can incorporate the information of the boundary points, therefore performs better under such situations.

Refer to caption
Figure 5: Accuracy on MNIST dataset with different embedding dimension sizes. Our method performs well even when the output dimension is reduced to two, which is a desirable property for data visualization.

Visualization and Dimension Reduction

Refer to caption
Refer to caption
Figure 6: Visualization on the MNIST test dataset (left) and REUTERS-10k dataset (right) by setting the dimension of latent representations to two. The proposed LPVDN shows clear boundary among clusters.

In Fig. 5 we examine the accuracy of our model on MNIST by adjusting the embedding dimension. It can be observed that the results are quite stable with various dimensions. One interesting observation is that the cluster accuracy is almost kept when we reduce the dimension to 2, which makes our model promising to be served as a data visualization tool. In Fig. 6 we show the visualization for the latent representations on the test datasets of MNIST and REUTERS-10k, which all contain 10,000 samples. Different colors represent the gold labels. It shows that our proposed LPVDN creates a cohesive structure for the same clusters, while clearly separates the latent representations of different clusters.

Conclusion

In this paper, we propose a locality preserving variational discriminative network to solve the clustering problem. Because our model can capture robust global and local data structures in the latent layer, it outperforms the state-of-the-art baselines in all of the metrics for the vision and textual benchmark datasets. Moreover, our model exhibits stable clustering performance even when the dimension is reduced to 2, which means our model is a promising candidate for other unsupervised applications such as dimension reduction and data visualization.

References

  • Arthur and Vassilvitskii (2007) Arthur, D.; and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035. Society for Industrial and Applied Mathematics.
  • Belhumeur, Hespanha, and Kriegman (1997) Belhumeur, P. N.; Hespanha, J. P.; and Kriegman, D. J. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7): 711–720.
  • Chen et al. (2016) Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2172–2180.
  • Donoho and Grimes (2003) Donoho, D. L.; and Grimes, C. 2003. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences 100(10): 5591–5596.
  • Ester et al. (1996) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X.; et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, 226–231.
  • Ghasedi Dizaji et al. (2017) Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; and Huang, H. 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, 5736–5745.
  • Gilmer et al. (2019) Gilmer, J.; Ford, N.; Carlini, N.; and Cubuk, E. D. 2019. Adversarial Examples Are a Natural Consequence of Test Error in Noise. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2280–2289. URL http://proceedings.mlr.press/v97/gilmer19a.html.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
  • Harchaoui, Mattei, and Bouveyron (2017) Harchaoui, W.; Mattei, P.-A.; and Bouveyron, C. 2017. Deep adversarial Gaussian mixture auto-encoder for clustering .
  • Hartigan and Wong (1979) Hartigan, J. A.; and Wong, M. A. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1): 100–108.
  • Hjelm et al. (2019) Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. URL https://openreview.net/forum?id=Bklr3j0cKX.
  • Huang et al. (2014) Huang, P.; Huang, Y.; Wang, W.; and Wang, L. 2014. Deep embedding network for clustering. In 2014 22nd International Conference on Pattern Recognition, 1532–1537. IEEE.
  • Ji et al. (2017) Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; and Reid, I. 2017. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, 24–33.
  • Jiang et al. (2017) Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; and Zhou, H. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 1965–1972. doi:10.24963/ijcai.2017/273. URL https://doi.org/10.24963/ijcai.2017/273.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. URL http://arxiv.org/abs/1412.6980.
  • Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. URL http://arxiv.org/abs/1312.6114.
  • Kohonen (1990) Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE 78(9): 1464–1480.
  • Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
  • Mukherjee et al. (2019) Mukherjee, S.; Asnani, H.; Lin, E.; and Kannan, S. 2019. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 4610–4617.
  • Nowozin, Cseke, and Tomioka (2016) Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, 271–279.
  • Springenberg (2016) Springenberg, J. T. 2016. Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. URL http://arxiv.org/abs/1511.06390.
  • Turk and Pentland (1991) Turk, M.; and Pentland, A. 1991. Eigenfaces for recognition. Journal of cognitive neuroscience 3(1): 71–86.
  • Xie, Girshick, and Farhadi (2016) Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, 478–487.
  • Yang et al. (2017) Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3861–3870. JMLR. org.
  • Yang et al. (2019) Yang, X.; Deng, C.; Zheng, F.; Yan, J.; and Liu, W. 2019. Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4066–4075.

Appendix A Appendix

Theoretical Analysis of MI\mathcal{L}_{\text{MI}}

In this section, we will conduct a theoretical analysis for the global robust term MI\mathcal{L}_{\text{MI}}. First, we will show the relation between MI\mathcal{L}_{\text{MI}} and the margin loss. Utilizing the relation between them, we replace the initial loss function with the margin loss, which can be written in the form of the penalty function. Second, We will explain the effects of LMIL_{MI} from the point of view of the penalty function. Last, we will explain how MI\mathcal{L}_{\text{MI}} improve the robustness of the model.

Relation between MI\mathcal{L}_{\text{MI}} and the margin loss

In this subsection, we will analysis the relation between MI\mathcal{L}_{\text{MI}} and the margin loss.

The form of MI\mathcal{L}_{\text{MI}} can be written as:

MI\displaystyle\mathcal{L}_{\text{MI}} =𝔼x,zq(z|x)p(x)[log(σ(D(x,z))]𝔼x,zq(z)p(x)[log(1σ(D(x,z))]\displaystyle=-\mathbb{E}_{x,z\sim q(z|x)p(x)}[\log(\sigma(D(x,z))]-\mathbb{E}_{x,z\sim q(z)p(x)}[\log(1-\sigma(D(x,z))] (21)
=𝔼x,zq(z|x)p(x)[log(1+exp(D(x,z)))]+𝔼x,zq(z)p(x)[log(1+exp(D(x,z)))]\displaystyle=\mathbb{E}_{x,z\sim q(z|x)p(x)}[\log(1+\exp(-D(x,z)))]+\mathbb{E}_{x,z\sim q(z)p(x)}[\log(1+\exp(D(x,z)))] (22)

The function f(t)=log(1+exp(t))f(t)=\log(1+\exp(t)) can be approximated by a margin loss g(t)=[t+γ]+=max(0,t+γ)(γ0)g(t)=[t+\gamma]_{+}=\max(0,t+\gamma)(\gamma\geq 0), which is derivable almost everywhere. When the model is well-learned, namely the loss function f(t)f(t) is small enough, their Chebyshev distance DD and their gradients’ Chebyshev distance are

D=inft<γ|f(t)g(t)|=log(1+exp(γ));D=inft<γ|f(t)g(t)|=exp(γ)1+exp(γ)\displaystyle D=\inf_{t<-\gamma}|f(t)-g(t)|=\log(1+\exp(-\gamma));\quad D^{\prime}=\inf_{t<-\gamma}|f^{\prime}(t)-g^{\prime}(t)|=\frac{\exp(-\gamma)}{1+\exp(-\gamma)} (23)

Therefore, we can choose a appropriate γ\gamma to ensure that if we replace f(t)f(t) with g(t)g(t), the gradients and well-learned loss values are precise enough. For example, when γ=3\gamma=3, D=log(1+e3)=0.049<0.05,D=e3/(1+e3)=0.047<0.05D=\log(1+e^{-3})=0.049<0.05,D^{\prime}=e^{-3}/(1+e^{-3})=0.047<0.05, the well-learned loss values and gradients have bounded errors less than 0.050.05.

Equivalence between MI\mathcal{L}_{\text{MI}} and the penalty function

In this subsection, we will analysis the equivalence between MI\mathcal{L}_{\text{MI}} and the penalty function under certain circumstances, which improve the robustness of the model.

After replacing f(t)f(t) with g(t)g(t),

MI=𝔼q(z|x)p(x)[f(D(x,z))]+𝔼q(z)p(x)[f(D(x,z))]𝔼q(z|x)p(x)[[γD(x,z)]+]+𝔼q(z)p(x)[[γ+D(x,z)]+]\displaystyle\mathcal{L}_{\text{MI}}=\mathbb{E}_{q(z|x)p(x)}[f(-D(x,z))]+\mathbb{E}_{q(z)p(x)}[f(D(x,z))]\approx\mathbb{E}_{q(z|x)p(x)}[[\gamma-D(x,z)]_{+}]+\mathbb{E}_{q(z)p(x)}[[\gamma+D(x,z)]_{+}] (24)

Let xi,ziq(z|x)p(x)(1in)x_{i},z_{i}\sim q(z|x)p(x)(1\leq i\leq n) be the positive samples and xj,zjq(z)p(x)(n+1jn+m)x_{j},z_{j}\sim q(z)p(x)(n+1\leq j\leq n+m) be the negative samples for training MI\mathcal{L}_{\text{MI}}, when minimizing the total loss function +λMI\mathcal{L}+\lambda\mathcal{L}_{\text{MI}}, where \mathcal{L} is the sum of other loss functions, the optimization form can be written as:

min+αi=1n[γD(xi,zi)]++βj=n+1n+m[γ+D(xj,zj)]+;\displaystyle\min\quad\mathcal{L}+\alpha{\sum\limits_{i=1}^{n}[\gamma-D(x_{i},z_{i})]_{+}+\beta{\sum\limits_{j=n+1}^{n+m}}[\gamma+D(x_{j},z_{j})]_{+}}; (25)

which is equivalent to the following optimization under certain circumstances with the point of the penalty function,

min\displaystyle\min \displaystyle\quad\mathcal{L} (26)
s.t.\displaystyle s.t. [γD(xi,zi)]+=0;(1in)\displaystyle\quad[\gamma-D(x_{i},z_{i})]_{+}=0;(1\leq i\leq n) (27)
[γ+D(xj,zj)]+=0;(n+1jn+m)\displaystyle\quad[\gamma+D(x_{j},z_{j})]_{+}=0;(n+1\leq j\leq n+m) (28)

Therefore, for x,zq(z|x)p(x)x,z\sim q(z|x)p(x), D(x,z)>γD(x,z)>\gamma holds with high probability, and for x,zq(z)p(x)x,z\sim q(z)p(x), D(x,z)<γD(x,z)<-\gamma holds with high probability.

How MI\mathcal{L}_{\text{MI}} improves the robustness of the model

In this subsection, we will show how MI\mathcal{L}_{\text{MI}} improves the robustness of the model.

First, we define the rationality of hidden state zz. Define XX as the manifold of real xx, with a well trained discriminator DD, D(x,z)>0D(x,z)>0 is the rational region for distribution q(z|x)p(x)q(z|x)p(x). With the point of the penalty function, we have

α(x)=D(x,z)>0q(z|x)𝑑zD(x,z)>γq(z|x)𝑑z\displaystyle\alpha(x)=\int\limits_{D(x,z)>0}q(z|x)dz\approx\int\limits_{D(x,z)>\gamma}q(z|x)dz (29)

So for xx in manifold XX, α(x)1\alpha(x)\approx 1 if DD is well trained. And α(x)1\alpha(x)\approx 1 infers that the hidden state generated by xx is rational.

Define x=x+Δx,xXx^{\prime}=x+\Delta x,x\in X being the data out of manifold XX, when we use a continuous function q(z|x)q(z|x), we will prove that the generated hidden state zz^{\prime} is still rationality with high probability when Δx\Delta x is small.

α(x+Δx)=D(x+Δx,z)>0q(z|x+Δx)𝑑z=D(x,z)+xDΔx>0q(z|x)𝑑z+D(x,z)>0x[q(z|x)]Δx𝑑z+o(Δx2)\displaystyle\alpha(x+\Delta x)=\int\limits_{D(x+\Delta x,z)>0}q(z|x+\Delta x)dz=\int\limits_{D(x,z)+\nabla_{x}D\cdot\Delta x>0}q(z|x)dz+\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz+o(\|\Delta x\|_{2}) (30)

where high-order infinitesimals are ignored, and we will show in Corollary 1 that D(x,z)>0x[q(z|x)]Δx𝑑z\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz can also be ignored for some certain distributions.

Therefore, we only need to consider the first term. Define m=xDΔxm=-\nabla_{x}D\cdot\Delta x, when m>0m>0, α(x)\alpha(x^{\prime}) tends to get worse.

α(x+Δx)=D(x,z)>0q(z|x)𝑑zD(x,z)(0,m)q(z|x)𝑑z+o(Δx2)=α(x)D(x,z)(0,m)q(z|x)𝑑z+o(Δx2)\displaystyle\alpha(x+\Delta x)=\int\limits_{D(x,z)>0}q(z|x)dz-\int\limits_{D(x,z)\in(0,m)}q(z|x)dz+o(\|\Delta x\|_{2})=\alpha(x)-\int\limits_{D(x,z)\in(0,m)}q(z|x)dz+o(\|\Delta x\|_{2}) (31)

When m<γm<\gamma, MI\mathcal{L}_{\text{MI}} makes D(x,z)D(x,z) almost impossible to be less than γ\gamma, therefore D(x,z)(0,m)q(z|x)𝑑z\int\limits_{D(x,z)\in(0,m)}q(z|x)dz is almost 0. And α(x+Δx)\alpha(x+\Delta x) tends to decrease without MI\mathcal{L}_{\text{MI}} because D(x,z)(0,m)q(z|x)𝑑z\int\limits_{D(x,z)\in(0,m)}q(z|x)dz tends to be greater than 0.

To conclude, MI\mathcal{L}_{\text{MI}} improves the robustness of the model.

Corollary 1 and its proof

Corollary 1.

For q(z|x)=1(2π)dz/2i=0dz1σiexp[12i=0dz1(ziμi(x))2σi2]q(z|x)=\frac{1}{(2\pi)^{d_{z}/2}\prod_{i=0}^{d_{z}-1}\sigma_{i}}\exp{[-\frac{1}{2}\sum_{i=0}^{d_{z}-1}\frac{(z_{i}-\mu_{i}(x))^{2}}{\sigma_{i}^{2}}]}, where μi(x)(i=0,1,,dz1)\mu_{i}(x)(i=0,1,\cdots,d_{z}-1) is a smooth function and dzd_{z} is the dimension of zz, the term D(x,z)>0x[q(z|x)]Δx𝑑z\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz is a high-order infinitesimal and can be ignored, namely

D(x,z)>0x[q(z|x)]Δx𝑑z=o(Δx2)\displaystyle\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz=o(\|\Delta x\|_{2}) (32)
Proof.

Since D(x,z)D(x,z) measures whether zz is generated from q(z|x)q(z|x) with high probability, we may assume that the closed region 𝒜={z:i=0dz1(ziμi(x))2σi2<ϵ2}\mathcal{A}=\{z:\sum_{i=0}^{d_{z}-1}\frac{(z_{i}-\mu_{i}(x))^{2}}{\sigma_{i}^{2}}<\epsilon^{2}\} can approximate the region {z:D(x,z)>0}\{z:D(x,z)>0\} with only an o(1)o(1) infinitesimal ignored. Therefore,

D(x,z)>0x[q(z|x)]Δx𝑑z=D(x,z)>0q(z|x)i=0dz1(ziμi(x)σi2μi(x)Δxi)dz\displaystyle\quad\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz=\int\limits_{D(x,z)>0}q(z|x)\sum_{i=0}^{d_{z}-1}\big{(}\frac{z_{i}-\mu_{i}(x)}{\sigma_{i}^{2}}\mu_{i}^{\prime}(x)\Delta x_{i}\big{)}dz (33)
=i=0dz1(D(x,z)>0q(z|x)(ziμi(x))𝑑z×μi(x)σi2Δxi)=i=0dz1([𝒜q(z|x)(ziμi(x))𝑑z+o(1)]×μi(x)σi2Δxi)\displaystyle=\sum_{i=0}^{d_{z}-1}\big{(}\int\limits_{D(x,z)>0}q(z|x)(z_{i}-\mu_{i}(x))dz\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)}=\sum_{i=0}^{d_{z}-1}\big{(}[\int\limits_{\mathcal{A}}q(z|x)(z_{i}-\mu_{i}(x))dz+o(1)]\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)} (34)
=i=0dz1(o(1)×μi(x)σi2Δxi)=o(Δx2)\displaystyle=\sum_{i=0}^{d_{z}-1}\big{(}o(1)\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)}=o(\|\Delta x\|_{2}) (35)

Experimental Settings

In this section, we illustrate the details of the experiment in the paper. The experimental settings are shown in Table 4.

MNIST Fashion-MNIST REUTERS10k Reuters
α0\alpha_{0} 1 1 10210^{-2} 1
α1\alpha_{1} 10410^{-4} 10410^{-4} 10310^{-3} 10210^{-2}
batch size 800 800 1000 1000
epoch number 300 300 50 300
optimizer adam adam adam adam
learning rate 2×1032\times 10^{-3} 2×1032\times 10^{-3} 2×1042\times 10^{-4} 2×1042\times 10^{-4}
lr-decay 0.95/10epoch 0.95/10epoch 0.95/10epoch 0.95/10epoch
Table 4: Experimental settings.