Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network

Ruixuan Luo^†, Wei Li^‡, Zhiyuan Zhang^‡, Ruihan Bao^{$\diamondsuit$}, Keiko Harimoto^{$\diamondsuit$}, Xu Sun^†,‡

Abstract

Clustering is one of the fundamental problems in unsupervised learning. Recent deep learning based methods focus on learning clustering oriented representations. Among those methods, Variational Deep Embedding achieves great success in various clustering tasks by specifying a Gaussian Mixture prior to the latent space. However, VaDE suffers from two problems: 1) it is fragile to the input noise; 2) it ignores the locality information between the neighboring data points. In this paper, we propose a joint learning framework that improves VaDE with a robust embedding discriminator and a local structure constraint, which are both helpful to improve the robustness of our model. Experiment results on various vision and textual datasets demonstrate that our method outperforms the state-of-the-art baseline models in all metrics. Further detailed analysis shows that our proposed model is very robust to the adversarial inputs, which is a desirable property for practical applications.

Introduction

Clustering is an essential problem in unsupervised learning, which aims to group unlabeled data based on their similarities. Over the past decades, several traditional clustering methods such as k-means and Gaussian Mixture Model (GMM) (Kohonen 1990; Hartigan and Wong 1979; Ester et al. 1996) have been established, which measure the distances of data points using Euclidean distance and minimize the within-class variances. However, as the dimension of the input data grows in the feature space, the clustering with Euclidean distance will deteriorate significantly. In order to solve the problem, various dimension reduction methods (Turk and Pentland 1991; Belhumeur, Hespanha, and Kriegman 1997; Donoho and Grimes 2003) have been proposed to convert data points from the feature space to a lower-dimensional space as a pre-processing step for clustering.

Due to the recent success, deep learning based clustering receives lots of attention. Earlier approaches adopt deterministic models such as Autoencoder to learn a latent representation by minimizing the reconstruction error. Furthermore, constraints are imposed to make the latent representation more suitable for the clustering (Xie, Girshick, and Farhadi 2016; Yang et al. 2017). Other attempts build clustering methods based on generative models because of their ability to learn the inherent data structure.Mukherjee et al. (2019) design a GAN based model which contains an inverse-mapping network and a clustering-specific loss. Their model achieves high scores on various tasks, but is vulnerable to the unbalanced data. Jiang et al. (2017) propose Variational Deep Embedding (VaDE) that extends VAE by adding a GMM constraint to the latent layer. While VaDE achieves the state-of-the-art performance in various clustering tasks, it suffers from two major drawbacks: 1), as VaDE trains a network that tries to recover the original inputs, the latent representations are vulnerable to the input noise; 2), it only models the global structure of the latent representations, while ignoring the local structure between two latent variables which carries important information.

In this paper, we resolve the problems of VaDE by proposing a Locality Preserving Variational Discriminative Network (LPVDN) that encodes robust global and local data structure in the latent representations used for clustering, which is stable for noise and unbalanced input. Specifically, our work first models the global data structure using a GMM regularized Variational Autoencoder (VAE), because VAE is good at capturing the global structure of data distribution. Then we improve the robustness of the global structure modeling by introducing an embedding discriminator that maximizes the mutual information between the input data and the latent representations, because maximizing the mutual information can improve the robustness to the input noise.¹¹1A mathematical proof is given in the Appendix. After the global representations are available, we encode the local structure information by modeling pairwise distance probability and minimizing the KL-divergence of the distribution. We assume that the global structure is more sensitive to the adversarial noise, while local structures can be helpful to reduce the impact of the noise.

Specifically, to capture the local structures, we propose a locality preserving network module which is inspired by t-SNE (Maaten and Hinton 2008) due to its effectiveness and simplicity. It employs a Student t-distribution with one degree of freedom in the low-dimensional map and uses it to fit the joint probability in the high-dimensional space. The optimization objective attracts similar points while introduces strong but finite repulsion between dissimilar points.

We perform extensive experiments on both vision and textual clustering datasets. Experiments show that our model achieves the state-of-the-art results on all the metrics. Further analysis testifies the robustness under different degree of data poisoning of our model and reveals its potential for data dimension reduction and visualization.

Refer to caption — Figure 1: Illustration of Locality Preserving Variational Discriminative Network (LPVDN). Our model retains both global (lower part) and local (upper part) structure. Latent variables $\mu$ and $\sigma$ are regularized to a mixture of Gaussian distribution.

We conclude our main contributions as follows:

•

We propose a robust locality preserving variational discriminative network that retains both global and local data structure, which is also robust to the noise in the input. ²²2Our code is released at https://github.com/lancopku/LPVDN
•

Extensive experiments on both vision and textual benchmark datasets show that our proposed model outperforms the state-of-the-art clustering methods. Analysis testifies the superiority of the robustness of our model over previous state-of-the-art methods.
•

Further analysis shows that our model’s performance is steady even when the dimension of the latent representations reduces to two, which means that our model can be served as a promising candidate for data visualization.

Related Work

Recently, deep learning-based clustering methods attract lots of attention due to its success in various problems.

One branch of the popular methods are based on Autoencoders, in which a latent feature representation can be learned by minimizing reconstruction errors. Specifically, Yang et al. (2017) propose DCN by incorporating the K-means loss with the reconstruction loss to perform clustering and feature embedding simultaneously. Huang et al. (2014) introduce DEC with a KL-divergence loss between soft assignments of the clusters and an auxiliary distribution to minimize within-clustering distance. In order to achieve better performance, the paper initializes the parameters with general Autoencoder training, which is adopted as the major practice for other clustering networks using deep learning. Ji et al. (2017) apply the idea of subspace clustering and introduce a self-expressive layer to the Autoencoder, the trained parameters of the self-expressive layer are used to construct an affinity matrix used by spectral clustering. Leveraging recent progress in image recognition, Ghasedi Dizaji et al. (2017) adopt a convolutional Autoencoder with a softmax layer stacked on top of the encoder. Yang et al. (2019) extend the idea of convolutional Autoencoder by proposing a joint learning framework of spectral clustering and discriminative embedding which is regularized by a normal distribution. As convolutional Autoencoder can extract important image features. Both methods work extremely well for image clustering tasks, but the performance often drops significantly for non-image clustering tasks.

Driven by the success of VAE (Kingma and Welling 2014) and GAN (Goodfellow et al. 2014), generative models become favourable choices for clustering due to its ability to capture the data distribution. Among those methods, Variational Deep Embedding (Jiang et al. 2017) extends VAE by assuming latent variables following a mixture of Gaussian distribution, where the means and variances of the Gaussian components are trainable. Deep Adversarial Clustering (Harchaoui, Mattei, and Bouveyron 2017) proposes an adversarial Autoencoder, which utilizes an adversarial training procedure to match the aggregated posterior of the latent representation with the prior distribution. CategorialGAN (Springenberg 2016) generalizes the GAN framework to multiple classes. InfoGAN (Chen et al. 2016) maximizes the mutual information between a subset of the latent variables and the observation, which achieves disentangled representations for clustering. ClusterGAN (Mukherjee et al. 2019) samples latent variables from a mixture of one-hot and continuous latent variables, and the inverse network is trained jointly with a clustering specific loss. However, as it requires sampling one-hot vectors from a uniform distribution of the clusters, when the data is highly unbalanced, the performance will experience a serious decline.

Locality Preserving Variational Discriminative Network (LPVDN)

In this section, we describe our proposed Locality Preserving Variational Discriminative Network (LPVDN).

The idea of our method is to joint learn robust global and local structures in the latent space. The global structure modeling is based on the VaDE, which achieves good results in clustering with a Variational Autoencoder and Gaussian Mixture Model (GMM), however, it is vulnerable to the noise in the original input data space and fails to model the local structure between data points. In order to tackle these problems, we propose a novel model that not only improves the robustness of the global structure, but also encodes the local structure between data points. A sketch of our model is given in Fig. 1.

More specifically, our methods can be explained using the following optimization objective:

	$\displaystyle\min_{\phi,\theta,\psi,\rho}\sum_{i=1}^{N}{}$	$\displaystyle\big{(}L_{\text{G}}(x_{i};\phi,\theta)$		(1)
		$\displaystyle+\alpha_{0}L_{\text{MI}}(x_{i};\phi,\rho)+\alpha_{1}L_{\text{LP}}(x_{i},x_{j};\phi,\psi)\big{)}$		(1)

where $x_{i}$ and $x_{j}$ are input data points, $n$ is the total number of the data, $\phi$ , $\theta$ , $\psi$ , $\rho$ are trainable parameters.

As expressed in Eqn.1, our model can be decomposed into three parts: $L_{\text{G}}(x_{i};\phi,\theta)$ captures the global structure using a Variational Autoencoder with GMM Prior. $L_{\text{MI}}(x_{i};\phi,\rho)$ applies an embedding discriminator to improve the robustness of the global structure modeling. Finally, $L_{\text{LP}}(x_{i},x_{j};\phi,\psi)$ models the local structure by minimizing the KL-divergence of the pairwise distance distribution. $\alpha_{0}$ and $\alpha_{1}$ are hyper-parameters which try to balance different parts. After the training phase, as the final latent representations comprise robust global and local information, we use it as the input for general clustering methods such as K-means or GMM.

Global Structure Modeling $L_{\text{G}}(x_{i};\phi,\theta)$

To model the global structure, we take advantage of Variational Deep Embedding (VaDE) (Jiang et al. 2017), which extends VAE with GMM to model the distribution of latent representations. Specifically, both VAE and VaDE maximize the log probability of the given data point:

\max_{\phi,\theta}\sum_{i=1}^{N}\log{p(x_{i};\phi,\theta)}

(2)

As the optimization is non-trivial, an evidence lower bound (ELBO) (Kingma and Welling 2014; Jiang et al. 2017) is proposed and maximized instead. In order to understand the ELBO, we further decompose it as:

	$\displaystyle L_{\text{ELBO}}(x_{i})=$	$\displaystyle E_{q(z,c\|x_{i})}[\log{p(x_{i}\|z)}]$		(3)
		$\displaystyle-D_{\text{KL}}(q(z,c\|x_{i})\|\|p(z,c))$		(3)

where $L_{\text{ELBO}}$ is the ELBO loss to be optimized. Term $x_{i}$ denotes the input data, $z$ represents the latent variables, $c$ is the GMM cluster in the latent layer. Eqn.3 shows that the ELBO loss consists of two terms. The first term corresponds to the reconstruction loss, which encourages the network to better approximate the original data. The second term is the regularization term, which requires the distribution of latent variables $z$ to be as close as Mixture-of-Gaussian distribution $p(\bm{z},c)$ . Since ELBO preserves the global distribution of latent variables, our global structure term $L_{\text{G}}(x_{i};\phi,\theta)$ is set to the negative ELBO loss (since we minimize the objective), precisely, $L_{\text{G}}(x_{i};\phi,\theta)=-L_{\text{ELBO}}(x_{i})$ .

In practice, following the derivation of Jiang et al. (2017), $L_{\text{G}}(x_{i};\phi,\theta)$ is computed as

	$\displaystyle{}L_{\text{G}}(x_{i};\phi,\theta)$	(4)
$\displaystyle=$	$\displaystyle-\sum_{d=1}^{D}{x_{i}^{d}\log{\bm{\mu}_{x_{i}}\|_{d}}}+(1-x_{i}^{d})log(1-\bm{\mu}_{x_{i}}\|_{d})$
	$\displaystyle+\sum_{c=1}^{K}{\gamma_{ic}\sum_{j=1}^{J}{\big{(}\log{\bm{\sigma}_{c}^{2}\|_{j}}}+\dfrac{\bm{\widetilde{\sigma}}_{i}^{2}\|_{j}}{\bm{\sigma}_{c}^{2}\|_{j}}+\dfrac{{(\bm{\widetilde{\mu}}_{i}\|_{j}-\mu_{c}\|_{j})^{2}}}{\bm{\sigma}_{c}^{2}\|_{j}}\big{)}}$
	$\displaystyle-\sum_{c=1}^{K}{\gamma_{ic}\log{\dfrac{\pi_{ic}}{\gamma_{ic}}}}-\dfrac{1}{2}\sum_{j=1}^{J}(1+\log{\bm{\widetilde{\sigma}}_{i}^{2}\|_{j}})$

where $D$ is the dimension of input $x_{i}$ and reconstructed outputs $\bm{\mu}_{x_{i}}$ . $x_{i}^{d}$ is the $d$ -th element of $x_{i}$ , $*|_{j}$ denotes the $j$ -th element of $*$ . $\gamma_{ic}$ represents $q(c|z_{i})$ , which is the GMM distribution for latent variable $z_{i}$ . Moreover,

$\displaystyle[\bm{\widetilde{\mu}}_{i},\log\bm{\widetilde{\sigma}}_{i}^{2}]$	$\displaystyle=g_{g}(x_{i};\phi)$	(5)
$\displaystyle z_{i}$	$\displaystyle=\widetilde{\bm{\mu}}_{i}+\widetilde{\bm{\sigma}}_{i}\circ{\epsilon}_{i}$	(6)
$\displaystyle\bm{\mu}_{x_{i}}$	$\displaystyle=f_{g}(z_{i};\theta)$	(7)

where $g_{g}(*;\phi)$ and $f_{g}(*;\theta)$ are two neural networks parameterized by $\phi$ and $\theta$ , $z_{i}$ is generated from the latent distribution $\mathcal{N}(\bm{\widetilde{\mu}}_{i},\bm{\widetilde{\sigma}}_{i}^{2})$ , $\epsilon_{i}$ is a vector with elements drawn from independent normal distribution and $\circ$ denotes the element-wise multiplication. $\bm{\mu}_{x_{i}}$ is the reconstructed data of $x_{i}$ .

Robust Embedding Discriminator $L_{\text{MI}}(x_{i};\phi,\rho)$

In Eqn.3, it is essential to reconstruct the original data, which is achieved by minimizing the cross entropy loss or mean square error between raw data and the latent representations. However, such objectives are too strict, which would make the model vulnerable to the input noise. In order to improve the robustness of our model, we reformulate the reconstruction loss over all input data samples (for the simplicity, we omit the variable $c$ ),

	$\displaystyle E_{x\sim P(X)}\big{[}E_{q(z\|x)}[\log(p(x\|z))]\big{]}$	(8)
$\displaystyle=$	$\displaystyle\int\int p(x)q(z\|x)\log(p(x\|z))dzdx$
$\displaystyle=$	$\displaystyle\int\int p(x)q(z\|x)\log\big{(}\frac{q(z\|x)p(x)}{q(z)}\big{)}dzdx$
$\displaystyle=$	$\displaystyle\int\int p(x)q(z\|x)\log\big{(}\frac{q(z\|x)p(x)}{q(z)p(x)}\big{)}dzdx$
	$\displaystyle+\int\int p(x)q(z\|x)\log(p(x))dzdx$
$\displaystyle=$	$\displaystyle D_{\text{KL}}(Q(Z\|X)P(X)\|\|Q(Z)P(X))-H(X)$
$\displaystyle=$	$\displaystyle I(X,Z)-H(X)$

where $X=\{x_{1},...,x_{n}\}$ represent the set of all the input samples, $Z$ represents the corresponding latent variables. Since $H(X)$ solely depends on the input data, the maximization of the first term in Eqn.3 is equivalent to maximize the mutual information $I(X,Z)$ .

To estimate the mutual information $I(X,Z)$ , Hjelm et al. (2019) introduce a JS-divergence based approximation,

I^{\text{(JSD)}}(X,Z)=D_{\text{JS}}(Q(Z|X)P(X),Q(Z)P(X))

(9)

One way to compute the above JS-divergence in Eqn. 9 is using a discriminator $D$ (Nowozin, Cseke, and Tomioka 2016; Hjelm et al. 2019; Yang et al. 2019):

$\displaystyle D_{\text{JS}}(P(X),Q(X))$	$\displaystyle=\frac{1}{2}\max_{D}\{E_{x\sim P(X)}[\log(\sigma(D(x)))]$	(10)
	$\displaystyle+E_{x\sim Q(X)}[\log(1-\sigma(D(x)))]\}$
	$\displaystyle+\log 2$

Following Eqn. 9 and Eqn.10, we utilize a neural network to serve as the robust embedding discriminator,

		$\displaystyle L_{\text{MI}}(x_{i};\phi,\rho)=-\log[\sigma(D(x_{i},z_{i};\rho)]$		(11)
		$\displaystyle-E_{(x_{j},z_{i})\sim p(z_{i})p(x_{j})}[\log(1-\sigma(D(x_{j},z_{i};\rho)))]$		(11)

where $D(*;\rho)$ is a neural network parameterized by $\rho$ . Term $z_{i}$ is the latent variable generated from $x_{i}$ by Eqn.5 and Eqn.6. Discriminator $D(*;\rho)$ outputs a binary value indicating whether $z$ is generated from $x$ . The second term in Eqn.LABEL:eq:LMI is negative sampling estimation to train the discriminator. In practice, we randomly draw samples from the datasets to form the negative pairs with $z_{i}$ .³³3In the appendix, we further provide a theoretical proof for the effectiveness of the proposed robust discriminator.

One thing should be noted is that $L_{\text{MI}}(x_{i};\phi,\rho)$ cannot replace the original reconstruction loss in Eqn.3. This is because Eqn.9 is valid only when $P(X)$ and $Q(X)$ are close. As we maximize the mutual information $I(X,Z)$ during training, $P(X)$ and $Q(X)$ will be gradually varied and Eqn.9 will no longer hold. In order to avoid this problem, we combine the reconstruction term and the robust embedding discriminator in our method to achieve better clustering performance (Eqn.1).

method	MNIST			Fashion-MNIST
method	ACC	NMI	ARI	ACC	NMI	ARI
K-means	0.5850	0.4998	0.3652	0.5542	0.5117	0.3477
AE+K-means	0.7147	0.6610	0.5593	0.6240	0.6360	0.4440
VAE+K-means	0.8583	0.7432	0.7249	0.6049	0.5629	0.4082
DEC (Xie, Girshick, and Farhadi 2016)	0.8971	0.8631	0.8322	0.6303	0.6560	0.4987
ClusterGAN (Mukherjee et al. 2019)	0.9596	0.9051	0.9134	0.6374	0.6423	0.5067
VaDE (Jiang et al. 2017)	0.9446*	0.8761	0.8822	0.6383	0.6434	0.4955
LPVDN (Proposed)	0.9716	0.9276	0.9385	0.6837	0.6912	0.5151

Table 1: Evaluation on vision datasets of MNIST and Fashion-MNIST for different algorithms using ACC, NMI and ARI. * is taken from (Jiang et al. 2017), other results are generated by the released code.

Local Structure Modeling $L_{\text{LP}}(x_{i},x_{j};\phi,\psi)$

While the global structure models the latent space based on clustering distribution (e.g. GMM), such methods may generate unsatisfactory results for the data around the cluster boundary. To solve the problem, we introduce a local structure model that captures the pairwise relationships by minimizing the KL-divergence on the distance distribution.

Inspired by Maaten and Hinton (2008), we define the following pairwise distance distribution $p_{ij}$ in the latent space:

p_{j|i}=\frac{\exp(-\|\bm{\widetilde{\mu}}_{i}-\bm{\widetilde{\mu}}_{j}\|^{2}/2\eta_{i}^{2})}{\sum_{k\not=i}\exp(-\|\bm{\widetilde{\mu}}_{k}-\bm{\widetilde{\mu}}_{i}\|^{2}/2\eta_{i}^{2})}

(12)

p_{ij}=\frac{p_{j|i}+p_{i|j}}{2n}

(13)

where $\widetilde{\bm{\mu}}_{i}$ and $\widetilde{\bm{\mu}}_{j}$ are the mean of the latent variables obtained from Eqn. 5. Here, we select $\widetilde{\bm{\mu}}$ instead of $z$ because it contains much less noise. Term $\eta_{i}$ is a hyper-parameter adjusted according to the data density. One way to compute $\eta_{i}$ is from a pre-defined perplexity value:

\text{Perplexity}(P_{i})=2^{-\sum_{j}p_{j|i}\log_{2}p_{j|i}}

(14)

In order to encode the pairwise information, we apply a neural network to minimize the KL-divergence of the distance distribution. To be more precise, we first apply a locality mapping network that converts $\widetilde{\bm{\mu}}_{i}$ to $\bm{\widetilde{\bm{o}}}^{\prime}_{i}$ :

\bm{\widetilde{\bm{o}}}^{\prime}_{i}=f_{lp}(\bm{\widetilde{\mu}}_{i};\psi)

(15)

where $f_{lp}(*;\psi)$ is a neural network parameterized by $\psi$ . Once $\bm{\widetilde{\bm{o}}}^{\prime}_{i}$ is obtained, the pairwise distribution $q_{ij}$ between $\bm{\widetilde{\bm{o}}}^{\prime}_{i}$ and $\bm{\widetilde{\bm{o}}}^{\prime}_{j}$ is defined using a student t-distribution with one degree of freedom,

q_{ij}=\frac{(1+||\bm{\widetilde{\bm{o}}}^{\prime}_{i}-\bm{\widetilde{\bm{o}}}^{\prime}_{j}||^{2})^{-1}}{\sum_{k}{\sum_{k\not=l}(1+||\bm{\widetilde{\bm{o}}}^{\prime}_{k}-\bm{\widetilde{\bm{o}}}^{\prime}_{l}||^{2})^{-1}}}

(16)

t-distribution is applied due to its computational efficiency and its property to mitigate crowding problems (Maaten and Hinton 2008).

Finally, the local structure modeling $L_{\text{LP}}(x_{i},x_{j};\phi,\psi)$ is given by measuring the KL-divergence between the two joint probabilities $p_{ij}$ and $q_{ij}$ :

L_{\text{LP}}(x_{i},x_{j};\phi,\psi)=\sum_{i\not=j}p_{ij}\log\frac{p_{ij}}{q_{ij}}

(17)

An intuitive interpretation for $L_{\text{LP}}(x_{i};\phi,\psi)$ is that it encourages $\bm{\widetilde{\bm{o}}}^{\prime}_{i}$ and $\bm{\widetilde{\bm{o}}}^{\prime}_{j}$ to be closer when the pairwise distance of $\widetilde{\bm{\mu}}_{i}$ and $\widetilde{\bm{\mu}}_{j}$ are small, and vice versa. As a result, the final output of $\bm{\widetilde{\bm{o}}}^{\prime}$ not only inherits the global structure information from $\bm{\widetilde{\mu}}$ but also encodes local structure information from the pairwise relationship, thus can be used as an effective representation for clustering.

Experiments

method	REUTERS-10k			REUTERS
method	ACC	NMI	ARI	ACC	NMI	ARI
K-means	0.5403	0.4175	0.2775	0.5328	0.4014	0.2631
AE+K-means	0.7477	0.4714	0.5300	0.7006	0.3330	0.3394
VAE+K-means	0.7421	0.4319	0.5073	0.6922	0.3586	0.3317
DEC (Xie, Girshick, and Farhadi 2016)	0.7229	0.5229	0.5525	0.7082	0.4915	0.5387
ClusterGAN (Mukherjee et al. 2019)	0.4317	0.0271	0.0163	0.4621	0.0303	0.0411
VaDE (Jiang et al. 2017)	0.8092	0.5267	0.5951	0.7938*	0.5591	0.5779
LPVDN (Proposed)	0.8301	0.5857	0.6112	0.8483	0.6553	0.6806

Table 2: Evaluation on textual datasets of REUTERS-10k and REUTERS for different algorithms using ACC, NMI and ARI. * is taken from (Jiang et al. 2017), other results are generated by the released code.

In this section, extensive experiments are conducted to validate the effectiveness of the proposed LPVDN.

Dataset Description

To test the ability of our model to solve general clustering tasks, we do experiments on both vision and textual datasets.

Vision Datasets:

•

MNIST is a widely known dataset containing 70,000 (60,000/10,000) handwritten gray-scale images. There are 10 classes and each image size is 28 $\times$ 28 pixels.
•

Fashion-MNIST: Fashion-MNIST is a dataset similar to MNIST with the same number of samples and image sizes. Instead of handwritten digits, Fashion-MNIST consists of fashion goods and products which yields a more challenging task.

Textual Datasets:

•

REUTERS: The original REUTERS dataset contains about 810,000 English news stories labeled with categories. Following Huang et al. (2014), we use four root categories during the evaluation, namely, corporate/industrial, government/social, markets, and economics. The documents with multiple root categories are pruned, resulting in a dataset of size 685,071.
•

REUTERS-10k: REUTERS-10k is a subset of REUTERS. Using a similar approach as Huang et al. (2014), we adopt a small scale dataset for REUTERS by sampling a subset of 10,000 examples from REUTERS.

Experiment Setting

For both vision and textual datasets, we combine the training and test sets together because of the unsupervised setting. Furthermore, for vision data, we concatenate the pixels to form the input vector without preprocessing. Following the settings of Xie, Girshick, and Farhadi (2016); Jiang et al. (2017), we set the encoder of global structure as $D-500-500-2000-10$ and decoder as $10-2000-500-500-D$ , where $D$ is the input dimension. The robust discriminator uses the structure of $(10+D)-256-1$ , where $10$ is the dimension of the latent representation. For the local structure model, we adopt the network with $10-256-256-256-10$ . All layers are fully connected using ReLU as the activation function. We set $\alpha_{0}=1$ and $\alpha_{1}=10^{-4}$ in MNIST and Fashion-MNIST and $\alpha_{0}=1$ and $\alpha_{1}=10^{-3}$ in REUTERS-10k and REUTERS. The batch size is 1,000. We use Adam optimizer (Kingma and Ba 2015). We apply K-means on the latent representations (i.e. $\bm{\widetilde{\bm{o}}}^{\prime}$ ) to do clustering.

method	MNIST			REUTERS
method	ACC	NMI	ARI	ACC	NMI	ARI
$L_{\text{G}}$ + $L_{\text{MI}}$ + $L_{\text{LP}}$	0.9716	0.9276	0.9385	0.8483	0.6553	0.6806
$L_{\text{G}}$ + $L_{\text{LP}}$	0.9710	0.9269	0.9373	0.8373	0.6542	0.6762
$L_{\text{G}}$ + $L_{\text{MI}}$	0.9477	0.8910	0.8888	0.8224	0.5969	0.6364
$L_{\text{G}}$ only	0.9446	0.8761	0.8822	0.7938	0.5591	0.5779

Table 3: Ablation study of the proposed LPVDN on MNIST and REUTERS. Both robust embedding discriminator

L_{\text{MI}}

and local structure modeling

L_{\text{LP}}

contribute to the improvement compared with only using global structure modeling

L_{\text{G}}

Baseline Methods

We compare the results of our model with several state-of-the-art baseline models. All the K-means algorithms mentioned below use 10 different K-means++ initialization (Arthur and Vassilvitskii 2007) and finally select the settings with the lowest average distances.

•

K-means: apply the K-means algorithm to the original data.
•

AE + K-means: A pipeline method that trains an Autoencoder first and then applies the K-means algorithm to the latent representations produced by the Autoencoder.
•

VAE + K-means: A pipeline method that trains a Variational Autoencoder first and then applies K-means to the latent representations.
•

DEC (Xie, Girshick, and Farhadi 2016): Following the experiment in the paper, DEC is first pre-trained with the autoencoder setting, and then iteratively optimized by minimizing the KL-divergence between the distribution of the current soft cluster assignment and its proposed auxiliary target distribution.
•

ClusterGAN (Mukherjee et al. 2019): introduces a variation of GAN with an inverse-mapping network and a clustering-specific loss, resulting in superior results among GAN-based algorithms in the clustering task.
•

VaDE (Jiang et al. 2017): a generative clustering framework combining VAE with GMM prior. It is reported that VaDE achieves the best performance compared with other general AE/VAE based clustering methods.

The settings of the encoder and decoder in AE and VAE are all the same as our proposed model.

Evaluation Metrics

While evaluating the real performance of clustering is arguably a non-trivial problem, we follow the mainstream practices and use the unsupervised clustering accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) as the metrics to evaluate the models. In our experiments, we set the number of clusters the same as the ground-truth labels.

ACC

The clustering accuracy(ACC) is defined as:

\text{ACC}=\max\limits_{m}\frac{\sum_{i=1}^{n}\textbf{1}\{l_{i}=m(c_{i})\}}{n}

(18)

where $l_{i}$ denotes the ground-truth label and $c_{i}$ is the output label of the clustering algorithm. The term $m$ ranges over all possible one-to-one mappings between the cluster and label.

NMI

Normalized Mutual Information (NMI) computes the normalized measure of similarity between two labels of the same data, which is defined as:

\text{NMI}(U,V)=\frac{2\times\text{MI}(U,V)}{H(U)+H(V)}

(19)

where $H(U)$ and $H(V)$ are the entropy of the label assignments $U$ and $V$ , respectively. $\text{MI}(U,V)$ is the mutual information of the two label assignments $U$ and $V$ .

ARI

The unadjusted Rand Index (RI) is a measure of the similarity between two data clusters, which is defined as $(a+b)/(_{2}^{n})$ , where $n$ is the number of samples, $a$ is the number of pairs that are divided into the same set in both ground-truth labels and output labels, $b$ is the number of pairs that are divided into different sets in both ground-truth labels and output labels. Since RI score does not guarantee that random labels can get a score close to $0$ , the Adjusted Rand Index (ARI) is proposed and defined as:

\text{ARI}=\frac{\text{RI}-{E}[\text{RI}]}{\max(\text{RI})-{E}[\text{RI}]}

(20)

where $E[\text{RI}]$ is the expected RI of random labeling.

Experiment Result

In this section, we explain the experiment results in details on vision and textual datasets.

Vision Datasets:

Table 1 shows the results for the vision datasets, MNIST and fashion-MNIST.

From the experiment results, we have the following observations. 1) DEC performs better than the simple combination of AE+K-means. This means that modeling the global structure during the dimension reduction is helpful to improve the clustering performance. 2) generative methods such as VaDE and ClusterGAN achieve better results compared with discriminative methods such as DEC. This is owing to their ability to model the underlying distribution of the data instead of recovering the exact input. 3) our method outperforms ClusterGAN and VaDE by a significant margin, this is because our model jointly learns a more robust global and local data structure, while ClusterGAN and VaDE only model the global structures.

Furthermore, we show the clustering performance by varying the hyper-parameters $\alpha_{0}$ and $\alpha_{1}$ on the MNIST dataset (in Fig. 2). It can be seen that the performance of our proposed method is relatively stable in terms of different combinations of hyper-parameters $\alpha_{0}$ and $\alpha_{1}$ .

Textual Results:

Table 2 shows the results on the textual datasets. Our proposed method achieves the best results on all datasets for all the metrics. This further verifies the effectiveness of our model for non-image tasks. Notably, ClusterGAN performs badly on textual datasets. As discussed in the related works, this is due to the fact that ClusterGAN is sensitive to the unbalanced data, which is the case for REUTERS. As a comparison, our method makes no assumptions on the prior distribution of the datasets, thus is not influenced by such unbalance effects.

Ablation Study

In Table 3, we show the results of the ablation study on both MNIST and REUTERS datasets. As our model contains three parts ( $L_{\text{G}}$ , $L_{\text{MI}}$ , $L_{\text{LP}}$ ), we design the experiment to study the contribution of each component. We use the method that only models the global structure $L_{G}$ as the baseline to compare, which is almost equivalent to VaDE. From the experiment results, it can be seen that the local structure component ( $L_{\text{LP}}$ ) is crucial and improves the clustering performance significantly, while the robust embedding discriminator $L_{\text{MI}}$ only improves the performance by a slight margin. This result is in-line with our expectation because the embedding discriminator is better at improving the robustness of the system rather than the accuracy.

Robustness under Input Pollution

To testify the robustness of our model, we do experiments on the MNIST dataset following Gilmer et al. (2019). Particularly, we gradually corrupt the pixels in the input images with independent Gaussian Noise ( $N(0,\sigma^{2})$ ). In Fig.3, we show that the proposed robust embedding discriminator ( $L_{MI}$ ) performs significantly better than the methods only using the global structure ( $L_{G}$ only). In addition, the experiment results also show that the local structure ( $L_{LP}$ ) contributes to the robustness, especially when the image is seriously corrupted. We can also observe that VaDE and ClusterGAN are fragile to the input noise, especially ClusterGAN, which collapses even under small pollution ratio.

Case Study

In Fig. 4, we show a concrete example by performing an image retrieval task using the latent representations. In particular, we use the representation of an image as the query to retrieve top $K$ images based on euclidean distances. Specifically, in the task we extract digit “4” form the dataset (this is a difficult example as it looks like “9”). The top 10 nearest neighbors found by our model are all correct, while the ones obtained using VaDE are a mixture of “4” and “9”. This is because the global structure modeling usually focuses on the cluster centroids and modeling the data distribution around those centroids, thus it is difficult to differentiate the data around the cluster boundary. On the other hand, since our model encodes the local pairwise relationships, it can incorporate the information of the boundary points, therefore performs better under such situations.

Visualization and Dimension Reduction

In Fig. 5 we examine the accuracy of our model on MNIST by adjusting the embedding dimension. It can be observed that the results are quite stable with various dimensions. One interesting observation is that the cluster accuracy is almost kept when we reduce the dimension to 2, which makes our model promising to be served as a data visualization tool. In Fig. 6 we show the visualization for the latent representations on the test datasets of MNIST and REUTERS-10k, which all contain 10,000 samples. Different colors represent the gold labels. It shows that our proposed LPVDN creates a cohesive structure for the same clusters, while clearly separates the latent representations of different clusters.

Conclusion

In this paper, we propose a locality preserving variational discriminative network to solve the clustering problem. Because our model can capture robust global and local data structures in the latent layer, it outperforms the state-of-the-art baselines in all of the metrics for the vision and textual benchmark datasets. Moreover, our model exhibits stable clustering performance even when the dimension is reduced to 2, which means our model is a promising candidate for other unsupervised applications such as dimension reduction and data visualization.

References

Arthur and Vassilvitskii (2007) Arthur, D.; and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035. Society for Industrial and Applied Mathematics.
Belhumeur, Hespanha, and Kriegman (1997) Belhumeur, P. N.; Hespanha, J. P.; and Kriegman, D. J. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7): 711–720.
Chen et al. (2016) Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2172–2180.
Donoho and Grimes (2003) Donoho, D. L.; and Grimes, C. 2003. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences 100(10): 5591–5596.
Ester et al. (1996) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X.; et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, 226–231.
Ghasedi Dizaji et al. (2017) Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; and Huang, H. 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, 5736–5745.
Gilmer et al. (2019) Gilmer, J.; Ford, N.; Carlini, N.; and Cubuk, E. D. 2019. Adversarial Examples Are a Natural Consequence of Test Error in Noise. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2280–2289. URL http://proceedings.mlr.press/v97/gilmer19a.html.
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Harchaoui, Mattei, and Bouveyron (2017) Harchaoui, W.; Mattei, P.-A.; and Bouveyron, C. 2017. Deep adversarial Gaussian mixture auto-encoder for clustering .
Hartigan and Wong (1979) Hartigan, J. A.; and Wong, M. A. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1): 100–108.
Hjelm et al. (2019) Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. URL https://openreview.net/forum?id=Bklr3j0cKX.
Huang et al. (2014) Huang, P.; Huang, Y.; Wang, W.; and Wang, L. 2014. Deep embedding network for clustering. In 2014 22nd International Conference on Pattern Recognition, 1532–1537. IEEE.
Ji et al. (2017) Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; and Reid, I. 2017. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, 24–33.
Jiang et al. (2017) Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; and Zhou, H. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 1965–1972. doi:10.24963/ijcai.2017/273. URL https://doi.org/10.24963/ijcai.2017/273.
Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. URL http://arxiv.org/abs/1412.6980.
Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. URL http://arxiv.org/abs/1312.6114.
Kohonen (1990) Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE 78(9): 1464–1480.
Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
Mukherjee et al. (2019) Mukherjee, S.; Asnani, H.; Lin, E.; and Kannan, S. 2019. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 4610–4617.
Nowozin, Cseke, and Tomioka (2016) Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, 271–279.
Springenberg (2016) Springenberg, J. T. 2016. Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. URL http://arxiv.org/abs/1511.06390.
Turk and Pentland (1991) Turk, M.; and Pentland, A. 1991. Eigenfaces for recognition. Journal of cognitive neuroscience 3(1): 71–86.
Xie, Girshick, and Farhadi (2016) Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, 478–487.
Yang et al. (2017) Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3861–3870. JMLR. org.
Yang et al. (2019) Yang, X.; Deng, C.; Zheng, F.; Yan, J.; and Liu, W. 2019. Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4066–4075.

Appendix A Appendix

Theoretical Analysis of $\mathcal{L}_{\text{MI}}$

In this section, we will conduct a theoretical analysis for the global robust term $\mathcal{L}_{\text{MI}}$ . First, we will show the relation between $\mathcal{L}_{\text{MI}}$ and the margin loss. Utilizing the relation between them, we replace the initial loss function with the margin loss, which can be written in the form of the penalty function. Second, We will explain the effects of $L_{MI}$ from the point of view of the penalty function. Last, we will explain how $\mathcal{L}_{\text{MI}}$ improve the robustness of the model.

Relation between $\mathcal{L}_{\text{MI}}$ and the margin loss

In this subsection, we will analysis the relation between $\mathcal{L}_{\text{MI}}$ and the margin loss.

The form of $\mathcal{L}_{\text{MI}}$ can be written as:

	$\displaystyle\mathcal{L}_{\text{MI}}$	$\displaystyle=-\mathbb{E}_{x,z\sim q(z\|x)p(x)}[\log(\sigma(D(x,z))]-\mathbb{E}_{x,z\sim q(z)p(x)}[\log(1-\sigma(D(x,z))]$		(21)
		$\displaystyle=\mathbb{E}_{x,z\sim q(z\|x)p(x)}[\log(1+\exp(-D(x,z)))]+\mathbb{E}_{x,z\sim q(z)p(x)}[\log(1+\exp(D(x,z)))]$		(22)

The function $f(t)=\log(1+\exp(t))$ can be approximated by a margin loss $g(t)=[t+\gamma]_{+}=\max(0,t+\gamma)(\gamma\geq 0)$ , which is derivable almost everywhere. When the model is well-learned, namely the loss function $f(t)$ is small enough, their Chebyshev distance $D$ and their gradients’ Chebyshev distance are

\displaystyle D=\inf_{t<-\gamma}|f(t)-g(t)|=\log(1+\exp(-\gamma));\quad D^{\prime}=\inf_{t<-\gamma}|f^{\prime}(t)-g^{\prime}(t)|=\frac{\exp(-\gamma)}{1+\exp(-\gamma)}

(23)

Therefore, we can choose a appropriate $\gamma$ to ensure that if we replace $f(t)$ with $g(t)$ , the gradients and well-learned loss values are precise enough. For example, when $\gamma=3$ , $D=\log(1+e^{-3})=0.049<0.05,D^{\prime}=e^{-3}/(1+e^{-3})=0.047<0.05$ , the well-learned loss values and gradients have bounded errors less than $0.05$ .

Equivalence between $\mathcal{L}_{\text{MI}}$ and the penalty function

In this subsection, we will analysis the equivalence between $\mathcal{L}_{\text{MI}}$ and the penalty function under certain circumstances, which improve the robustness of the model.

After replacing $f(t)$ with $g(t)$ ,

\displaystyle\mathcal{L}_{\text{MI}}=\mathbb{E}_{q(z|x)p(x)}[f(-D(x,z))]+\mathbb{E}_{q(z)p(x)}[f(D(x,z))]\approx\mathbb{E}_{q(z|x)p(x)}[[\gamma-D(x,z)]_{+}]+\mathbb{E}_{q(z)p(x)}[[\gamma+D(x,z)]_{+}]

(24)

Let $x_{i},z_{i}\sim q(z|x)p(x)(1\leq i\leq n)$ be the positive samples and $x_{j},z_{j}\sim q(z)p(x)(n+1\leq j\leq n+m)$ be the negative samples for training $\mathcal{L}_{\text{MI}}$ , when minimizing the total loss function $\mathcal{L}+\lambda\mathcal{L}_{\text{MI}}$ , where $\mathcal{L}$ is the sum of other loss functions, the optimization form can be written as:

\displaystyle\min\quad\mathcal{L}+\alpha{\sum\limits_{i=1}^{n}[\gamma-D(x_{i},z_{i})]_{+}+\beta{\sum\limits_{j=n+1}^{n+m}}[\gamma+D(x_{j},z_{j})]_{+}};

(25)

which is equivalent to the following optimization under certain circumstances with the point of the penalty function,

$\displaystyle\min$	$\displaystyle\quad\mathcal{L}$	(26)
$\displaystyle s.t.$	$\displaystyle\quad[\gamma-D(x_{i},z_{i})]_{+}=0;(1\leq i\leq n)$	(27)
	$\displaystyle\quad[\gamma+D(x_{j},z_{j})]_{+}=0;(n+1\leq j\leq n+m)$	(28)

Therefore, for $x,z\sim q(z|x)p(x)$ , $D(x,z)>\gamma$ holds with high probability, and for $x,z\sim q(z)p(x)$ , $D(x,z)<-\gamma$ holds with high probability.

How $\mathcal{L}_{\text{MI}}$ improves the robustness of the model

In this subsection, we will show how $\mathcal{L}_{\text{MI}}$ improves the robustness of the model.

First, we define the rationality of hidden state $z$ . Define $X$ as the manifold of real $x$ , with a well trained discriminator $D$ , $D(x,z)>0$ is the rational region for distribution $q(z|x)p(x)$ . With the point of the penalty function, we have

\displaystyle\alpha(x)=\int\limits_{D(x,z)>0}q(z|x)dz\approx\int\limits_{D(x,z)>\gamma}q(z|x)dz

(29)

So for $x$ in manifold $X$ , $\alpha(x)\approx 1$ if $D$ is well trained. And $\alpha(x)\approx 1$ infers that the hidden state generated by $x$ is rational.

Define $x^{\prime}=x+\Delta x,x\in X$ being the data out of manifold $X$ , when we use a continuous function $q(z|x)$ , we will prove that the generated hidden state $z^{\prime}$ is still rationality with high probability when $\Delta x$ is small.

\displaystyle\alpha(x+\Delta x)=\int\limits_{D(x+\Delta x,z)>0}q(z|x+\Delta x)dz=\int\limits_{D(x,z)+\nabla_{x}D\cdot\Delta x>0}q(z|x)dz+\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz+o(\|\Delta x\|_{2})

(30)

where high-order infinitesimals are ignored, and we will show in Corollary 1 that $\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz$ can also be ignored for some certain distributions.

Therefore, we only need to consider the first term. Define $m=-\nabla_{x}D\cdot\Delta x$ , when $m>0$ , $\alpha(x^{\prime})$ tends to get worse.

\displaystyle\alpha(x+\Delta x)=\int\limits_{D(x,z)>0}q(z|x)dz-\int\limits_{D(x,z)\in(0,m)}q(z|x)dz+o(\|\Delta x\|_{2})=\alpha(x)-\int\limits_{D(x,z)\in(0,m)}q(z|x)dz+o(\|\Delta x\|_{2})

(31)

When $m<\gamma$ , $\mathcal{L}_{\text{MI}}$ makes $D(x,z)$ almost impossible to be less than $\gamma$ , therefore $\int\limits_{D(x,z)\in(0,m)}q(z|x)dz$ is almost $0$ . And $\alpha(x+\Delta x)$ tends to decrease without $\mathcal{L}_{\text{MI}}$ because $\int\limits_{D(x,z)\in(0,m)}q(z|x)dz$ tends to be greater than $0$ .

To conclude, $\mathcal{L}_{\text{MI}}$ improves the robustness of the model.

Corollary 1 and its proof

Corollary 1.

For $q(z|x)=\frac{1}{(2\pi)^{d_{z}/2}\prod_{i=0}^{d_{z}-1}\sigma_{i}}\exp{[-\frac{1}{2}\sum_{i=0}^{d_{z}-1}\frac{(z_{i}-\mu_{i}(x))^{2}}{\sigma_{i}^{2}}]}$ , where $\mu_{i}(x)(i=0,1,\cdots,d_{z}-1)$ is a smooth function and $d_{z}$ is the dimension of $z$ , the term $\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz$ is a high-order infinitesimal and can be ignored, namely

\displaystyle\int\limits_{D(x,z)>0}\nabla_{x}[q(z|x)]\cdot\Delta xdz=o(\|\Delta x\|_{2})

(32)

Proof.

Since $D(x,z)$ measures whether $z$ is generated from $q(z|x)$ with high probability, we may assume that the closed region $\mathcal{A}=\{z:\sum_{i=0}^{d_{z}-1}\frac{(z_{i}-\mu_{i}(x))^{2}}{\sigma_{i}^{2}}<\epsilon^{2}\}$ can approximate the region $\{z:D(x,z)>0\}$ with only an $o(1)$ infinitesimal ignored. Therefore,

	$\displaystyle\quad\int\limits_{D(x,z)>0}\nabla_{x}[q(z\|x)]\cdot\Delta xdz=\int\limits_{D(x,z)>0}q(z\|x)\sum_{i=0}^{d_{z}-1}\big{(}\frac{z_{i}-\mu_{i}(x)}{\sigma_{i}^{2}}\mu_{i}^{\prime}(x)\Delta x_{i}\big{)}dz$		(33)
	$\displaystyle=\sum_{i=0}^{d_{z}-1}\big{(}\int\limits_{D(x,z)>0}q(z\|x)(z_{i}-\mu_{i}(x))dz\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)}=\sum_{i=0}^{d_{z}-1}\big{(}[\int\limits_{\mathcal{A}}q(z\|x)(z_{i}-\mu_{i}(x))dz+o(1)]\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)}$		(34)
	$\displaystyle=\sum_{i=0}^{d_{z}-1}\big{(}o(1)\times\frac{\mu_{i}^{\prime}(x)}{\sigma_{i}^{2}}\Delta x_{i}\big{)}=o(\\|\Delta x\\|_{2})$		(35)

∎

Experimental Settings

In this section, we illustrate the details of the experiment in the paper. The experimental settings are shown in Table 4.

	MNIST	Fashion-MNIST	REUTERS10k	Reuters
$\alpha_{0}$	1	1	$10^{-2}$	1
$\alpha_{1}$	$10^{-4}$	$10^{-4}$	$10^{-3}$	$10^{-2}$
batch size	800	800	1000	1000
epoch number	300	300	50	300
optimizer	adam	adam	adam	adam
learning rate	$2\times 10^{-3}$	$2\times 10^{-3}$	$2\times 10^{-4}$	$2\times 10^{-4}$
lr-decay	0.95/10epoch	0.95/10epoch	0.95/10epoch	0.95/10epoch

Table 4: Experimental settings.

Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network

Abstract

Introduction

Related Work

Locality Preserving Variational Discriminative Network (LPVDN)

Global Structure Modeling LG​(xi;ϕ,θ)L_{\text{G}}(x_{i};\phi,\theta)

Robust Embedding Discriminator LMI​(xi;ϕ,ρ)L_{\text{MI}}(x_{i};\phi,\rho)

Local Structure Modeling LLP​(xi,xj;ϕ,ψ)L_{\text{LP}}(x_{i},x_{j};\phi,\psi)

Experiments

Dataset Description

Vision Datasets:

Textual Datasets:

Experiment Setting

Baseline Methods

Evaluation Metrics

ACC

NMI

ARI

Experiment Result

Vision Datasets:

Textual Results:

Ablation Study

Robustness under Input Pollution

Case Study

Visualization and Dimension Reduction

Conclusion

References

Appendix A Appendix

Theoretical Analysis of ℒMI\mathcal{L}_{\text{MI}}

Relation between ℒMI\mathcal{L}_{\text{MI}} and the margin loss

Equivalence between ℒMI\mathcal{L}_{\text{MI}} and the penalty function

How ℒMI\mathcal{L}_{\text{MI}} improves the robustness of the model

Corollary 1 and its proof

Corollary 1.

Proof.

Experimental Settings

Global Structure Modeling $L_{\text{G}}(x_{i};\phi,\theta)$

Robust Embedding Discriminator $L_{\text{MI}}(x_{i};\phi,\rho)$

Local Structure Modeling $L_{\text{LP}}(x_{i},x_{j};\phi,\psi)$

Theoretical Analysis of $\mathcal{L}_{\text{MI}}$

Relation between $\mathcal{L}_{\text{MI}}$ and the margin loss

Equivalence between $\mathcal{L}_{\text{MI}}$ and the penalty function

How $\mathcal{L}_{\text{MI}}$ improves the robustness of the model