EXoN: EXplainable encoder Network
Abstract
We propose a new semi-supervised learning method of Variational AutoEncoder (VAE) which yields a customized and explainable latent space by EXplainable encoder Network (EXoN). Customization means a manual design of latent space layout for specific labeled data. To improve the performance of our VAE in a classification task without the loss of performance as a generative model, we employ a new semi-supervised classification method called SCI (Soft-label Consistency Interpolation). The classification loss and the Kullback-Leibler divergence play a crucial role in constructing explainable latent space. The variability of generated samples from our proposed model depends on a specific subspace, called activated latent subspace. Our numerical results with MNIST and CIFAR-10 datasets show that EXoN produces an explainable latent space and reduces the cost of investigating representation patterns on the latent space.
1 Introduction
Variational AutoEncoder (VAE) (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) aims to reconstruct a well-representing latent space and recover an original observation well. In general, the probabilistic model used in VAE is parameterized by neural networks defined as two maps from the domain of observations to the latent space and vice versa. However, since the probability model of the VAE is not written in a closed form, the maximum likelihood method is unsuitable for application to the VAE. As an alternative, the Variational Bayesian method is popularly applied to the model to maximize the Evidence of Lower Bound (ELBO) (Jordan et al. 1999).
Plenty of semi-supervised learning methods for the VAE (Kingma et al. 2014; Maaløe et al. 2016; Siddharth et al. 2017; Maaløe, Fraccaro, and Winther 2017; Li et al. 2019; Feng et al. 2021; Hajimiri, Lotfi, and Baghshah 2021) have been introduced, and especially (Maaløe, Fraccaro, and Winther 2017; Hajimiri, Lotfi, and Baghshah 2021) applied the mixture prior distribution such that the VAE model provides explainable latent space according to labels. However, existing semi-supervised VAE models still have practical limitations.
(Kingma et al. 2014; Maaløe et al. 2016; Siddharth et al. 2017; Li et al. 2019; Feng et al. 2021) introduced an additional latent space representing labels and assumed the probabilistic independence of the label and the other latent variables. While the model is simply formulated, the trained latent space cannot provide explainable and measurable quantitative information used to generate a new image by interpolating between two images. So, it is difficult to impose structural restrictions on the latent space, such as the proximity of latent features across some labeled observations. Also, the trained latent space differs according to the training process, so the latent space is not explained consistently even with the same dataset (Maaløe, Fraccaro, and Winther 2017; Hajimiri, Lotfi, and Baghshah 2021). For example, it is difficult to obtain information about interpolated images from latent space before the trained latent space is entirely investigated by observations. Thus, a manual design of the latent space is required for an explainable representation model.
The current study proposes a new semi-supervised VAE model, wherein we have incorporated the manual design to improve the clarity of the model. The model employs a Gaussian mixture for prior and posterior latent distributions (Dilokthanakul et al. 2016; Zheng and Sun 2019; Willetts, Roberts, and Holmes 2019; Mathieu et al. 2019; Guo et al. 2020) and focuses on constructing an explainable encoder of VAE. In our model, the latent space is customized for each component’s correspondence to a specific label and proximity. Since the latent space is shared with the decoder and classifier, each centroid of the mixture distribution is trained as an identifier representing a specific label. Therefore, the latent space is manually decomposed by labels, and a user can obtain a customized explainable latent space.
In addition, we propose a measure of the representation power of latent variables with which the importance of latent subspace can be investigated. It is found that the encoder of our VAE model selectively activates only a part of the latent space, and the latent subspace represents the characteristics of generated samples. Because the activation is measured by a posterior variance linked with the encoder, we call the encoder EXoN (EXplainable encoder Network) by borrowing the term in the field of gene biology.
This paper is organized as follows. Section 2 briefly introduces Variational AutoEncoder, and Section 3 proposes our VAE model, including its derivation. Section 4 shows the results of numerical simulations. Concluding remarks follow in Section 5.
2 Preliminary
2.1 VAE for Unsupervised Learning
Let and with be the observed data and the latent variable. and denote the probability density functions (pdf) of and . Let be the conditional pdf of for a given and where , is the identity matrix and the observation noise . is a neural network with parameter . To emphasize dependence on parameters, denote and by and , respectively. is referred to as the decoder of VAE. The term of decoder is also used as the map , because is fully parameterized by .
The parameter of VAE is estimated by Variational method (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) that maximizes ELBO. The ELBO of is obtained from the inequality
where denotes Kullback-Leibler divergence from to . (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) employ a neural network with parameter as , and obtain the objective function under finite samples ,
(2) |
Here, is referred to as the encoder of VAE or the posterior distribution over the latent variables. Multivariate Gaussian distribution is a popular choice for in which the mean and covariance are given by neural network models. Thus, the parameters of VAE consist of in the encoder and in the decoder.
In practice, the VAE is fitted by maximizing a stochastically approximated ELBO for (2) with respect to . The approximated ELBO is given by
(3) |
where is a sample from and is -norm. The VAE is fitted by maximizing (2.1) with respect to (Lucas et al. 2019). The first term in (2.1) is the precision of generated samples. Since the KL-divergence is always non-negative, (Higgins et al. 2016; Mathieu et al. 2019; Li et al. 2019) explain as a tuning parameter regularizing and , and the last term of (2.1) is considered as constant.
3 Proposed Model
We propose a new VAE model with a customized explainable latent space where a conceptual center of a latent distribution for a specific label can be freely assigned. Let be a discrete random variable and denote the joint probability density of by . In this paper, high dimensional observations are partly labeled by . Denote the sets of indices for labeled and unlabeled samples by and .
3.1 Model Assumptions
The prior latent distribution for each label is assumed as and for where indicates matrix whose diagonal elements are . The prior distribution marginalized by is
Here, and are pre-determined parameters denoting the conceptual center and dispersion of the latent variable for each label. The choice of and is the customization of the latent space. Since and for all are fixed, the notation of and is omitted in .
The encoder is assumed as a mixture distribution, . Since using is computationally prohibitive, the proposal distribution for and for are introduced. The two proposal distributions are multinomial and Gaussian, whose parameters are modeled by neural networks. The posterior distribution is approximated by
(4) |
where the parameters of the neural networks are and . Note that the posterior mixture weight decompose the latent space according to labels. Given image and label , the posterior mixture weight assigns a latent variable to the mixture component corresponding to the given label and the latent space is separated by labels.
The decoder is assumed by
where the true label is not included. Even though a natural decoder can be considered as a mixture Gaussian , the mixture distribution is approximated by a uni-modal Gaussian distribution especially when is assumed to be close to 1 for some . The latent space is separated by in our encoder and we can obtain such . Thus, the validity of the assumption mainly depends on the classification performance of . Although the decoder variance is trainable, we fixed it due to computational issues.
3.2 EXoN: Semi-Supervised VAE
Our proposed VAE model is derived from the joint pdf . is decomposed into and , and the derivation of the ELBO only for leads to (5). In the derivation, is approximated by the proposal classification model .
(5) | |||||
The first two terms in (5) are the typical ELBO used in the unsupervised VAE learning, and the remained term is the classification loss. Note that these terms are coupled with the shared parameter . The classification loss term plays the role of training the posterior mixture weights. Therefore, with lower classification error can separate the latent space more clearly by .
Whenever ,
where
and is the upper bound of (Wang et al. 2019; Guo et al. 2020) which is written in closed-form as
Therefore, the lower bound on the joint likelihood for the entire dataset is
(6) |
(Kingma et al. 2014) first employed the classification loss (the second term in (6)) as a penalty function of VAE and (Maaløe et al. 2016; Li et al. 2019; Siddharth et al. 2017) use the same penalty function in the subsequent papers. In our semi-supervised VAE model, the penalty function is applied with a similar idea. However, the derivation of the regularized objective function stems from (5) unlike the existing studies.
3.3 SCI: Soft-label Consistency Interpolation
The derivation of (6) is mathematically plausible, but it does not guarantee state-of-the-art semi-supervised classification performance. To improve the performance of the classification task, we propose a new loss function for the pseudo-labeling semi-supervised classification method (Iscen et al. 2019; Berthelot et al. 2019; Arazo et al. 2020) called SCI (Soft-label Consistency Interpolation) loss.
The SCI loss consists of three parts, 1) an interpolated new image with a pair of unlabeled images, 2) a pair of pseudo-label of the unlabeled images, and 3) a convex combination of the cross entropy. Let be a pair of images and let be the interpolated image. Denote the discrete probability by . The pseudo-labels of and are defined by and for a given estimate (true label is used for labeled dataset). Note that the pseudo-label is not a function of but a non-trainable quantity determined by . Then the SCI loss for with is defined by
where is the cross entropy of the distribution relative to a distribution .
The SCI loss is motivated by the assumption of consistency interpolation.
Definition 1.
The consistency interpolation assumes the existence of a linear map from an image to a pseudo-label. It is well known that such a mix-up strategy can improve the generalization error (Zhang et al. 2017).
Interestingly, the estimation of in our VAE model is also used in other existing semi-supervised learning methods (Feng et al. 2021; Arazo et al. 2020) and the following algorithm provides a general framework to estimate such a in the VAE model. Let be an estimate of obtained by the th step while training the VAE, and be the solution of following optimal interpolation problem (Feng et al. 2021)
(8) | |||||
Then, it is easily shown that
(9) | |||||
If there exists such that as given two data point , then (9) finally implies the consistency interpolation. Since (8) is equivalent with (3.3) up to constant, (3.3) is introduced in our proposed objective function. We also found that using stochastically augmented images for SCI loss helps our classifier achieve higher accuracy (see Appendix A.4 for detailed augmentation techniques).
Finally, the objective function is given by
(10) | |||||
where , is a randomly chosen sample, and is a ramp-up function. If and are set to be large enough, the mixture components of are shrunk toward those of according to the true matching labels, and the latent space is well separated. The shared parameter leads to this adaption of to .
3.4 Activated Latent Subspace
In this section, we investigated the meaning of the explainability of the latent space. Denote the random vector associated with the distribution of the th component of the mixture distribution (prior or posterior distribution) by . Then, what is the role of the posterior conditional variance in the encoding process?
To answer above question, consider an extreme case where goes to zero for given . If , then becomes constant. has a deterministic relationship with given , implying that can explain the given data point. Therefore, the value of represents the explainability of the latent space, and we call coordinate is activated.
Interestingly, there is a theorem which shows a connection between our objective function (10) and the posterior conditional variance .
Proof.
See Appendix A.1. ∎
Theorem 1 means that the KL-divergence upper bound in our objective function is bounded by the ratio of the prior and posterior variances. The lower bound of Theorem 1 can be re-written as , so it can be interpreted as the coefficient of determination which measures the proportion of latent variable variation explained by an observation.
On the other hand, a small increases the weight of the generated sample precision term and indirectly decreases the relative weight of the KL-divergence term in (10). It implies that the KL-divergence term is relaxed (not fully minimized). So, the relaxation of the KL-divergence term induces an increase in the upper bound of Theorem 1. It means that there exists an activated coordinate such that shrinks to zero for some , as the prior mixture weight and variance are fixed due to pre-design. Thus, Theorem 1 shows that the relaxation of the KL-divergence obtained by tuning is closely related to controlling the explainability of the latent space for .
Based on Theorem 1, we propose a statistics V-nat (VAE-natural unit of information) which screens such activated latent coordinates,
Additionally, we define the set of activated latent coordinates as activated latent subspace for -labeled dataset for some ,
where Our numerical study found that the subspace represents the informative characteristics of generated samples, and the subspace can be effectively used to produce a high-quality image (see Section 4.2). These results are consistent with those of VQ-VAE (Van Den Oord, Vinyals et al. 2017) that an image generation process mainly depends on a nearly deterministic encoding map, and the refinement of the map can improve the quality of generated images.
4 Experiments
4.1 MNIST Dataset
We have used the MNIST dataset (LeCun, Cortes, and Burges 2010) to consider the 2-dimensional latent space in our VAE model. The values were scaled in the range of -1 to 1. The encoder returns the 10-component Gaussian mixture distribution parameters, mixing probability, mean vector, and diagonal covariance elements. Thus, the encoder maps to . Especially, the mixing probabilities are produced by the classifier in the encoder. Gumbel-Max trick (Gumbel 1954; Jang, Gu, and Poole 2016) is used for sampling discrete latent variables. Detailed network architectures of the encoder, decoder and classifier are described in Appendix A.2 Table 4, 5.

The customized is illustrated in Figure 1. Each label is assigned counterclockwise from 3 O’clock on the component of the mixture. Note that Figure 1 illustrates an example of conceptual centers. The th component of the mixture distribution corresponds to the distribution of the digit in the MNIST dataset, for (see Appendix A.4 for detailed pre-design settings).
Our model is compared with (Kingma et al. 2014; Maaløe et al. 2016; Li et al. 2019; Hajimiri, Lotfi, and Baghshah 2021; Feng et al. 2021), and the simulation results show that our fitted model achieves a competitive classification performance, 3.12% error with 59,900 unlabeled and 100 labeled images. (see Appendix A.2 Table 6 for comparison results). For implementation details, see Appendix A.4.

Effect of Regularizations
First, the role of tuning parameter of (10) is investigated in fitted latent space (Lucas et al. 2019). The top panels of Figure 2 display the samples from where is an observation in the test dataset, and the bottom panels show images generated from grid points on the latent space.
The top panels demonstrate that the large regularizes to by indirectly increasing the weight of the KL-divergence term in the objective function. The bottom panels illustrate that each generated image exactly matches the label defined on the pre-designed latent space. Also, it is confirmed that generated images are naturally interpolated on the pre-designed latent space according to our arrangement of conceptual centers. These results show that the proposed VAE yields an explainable latent space with the labels. See Appendix A.2 for additional evaluation results (the negative average single-scale structural similarity (negative SSIM) (Wang et al. 2004; Zheng and Sun 2019), the classification error, and the KL-divergence) with various values.
Diversity of Generated Images
The bottom panels of Figure 2 show that the images are generated with lower diversity for larger . This result can be explained by the mutual information . The expectation of in (10) is rewritten in terms of mutual information as
(11) | |||||
See Appendix A.1 for detailed derivation.
To maximize (10) under a large , (11) should be nearly zero and the mutual information between and be close to zero for all classes as well. The conditional independence of and implies that does not depend on for each , because is assumed. For this reason, as shown in the bottom row of Figure 2, the latent mixture component for a specific label is not able to capture the complex pattern of observations that belong to a corresponding label when is large.
Customized Latent Space


Figure 3 shows two series of reconstructed images from interpolation on the prior structure. The latent variable A and B are sampled from mixture components of labels 0 and 1, respectively. Suppose the interpolation path from point A to B produces interpolated images with labels 0 and 1. However, the left panel of Figure 3 shows reconstructed images whose labels are neither 0 nor 1 in the middle of the interpolation path.
It means that the interpolation path is unpredictable before training with Parted-VAE (Hajimiri, Lotfi, and Baghshah 2021). However, the interpolation path with our model consists of only interpolated images with labels 0 and 1 since our latent space can be manually designed (see the right panel of Figure 3). Furthermore, we can pre-determine the interpolation path and the resolution of interpolation by controlling the proximity between mixture components (see Appendix A.2).
4.2 CIFAR-10 Dataset
We apply our VAE model to the CIFAR-10 dataset (Krizhevsky, Hinton et al. 2009) of which images have 10 labels. We evaluated our model with 400 labeled images per class and 46,000 unlabeled images. All values of the images are scaled to the range of . 256-dimensional latent space and the 10-mixture Gaussian distribution are employed. Each component of the prior mixture distribution has a separate mean vector, and all components share the same covariance (see Appendix A.4 for detailed pre-designed priors, hyper-parameters, and implementation settings). Gumbel-Max trick (Gumbel 1954; Jang, Gu, and Poole 2016) is used for sampling discrete latent variables. The network architectures of the encoder, decoder and classifier are shown in Appendix A.3 Table 8, 9.
Comparison
models | error (%) | Inception Score |
---|---|---|
-model (4.5M) (Laine and Aila 2016) | 17.53 0.15 | - |
VAT (1.5M) (Miyato et al. 2018) | 10.55 0.05 | - |
MixMatch (5.8M) (Berthelot et al. 2019) | 5.91 0.36 | - |
PLCB (4.5M) (Arazo et al. 2020) | 7.82 0.18 | - |
M2 (4.5M) (Kingma et al. 2014) | 27.30 0.25 | 1.86 0.02 |
Parted-VAE (5.8M) (Hajimiri, Lotfi, and Baghshah 2021) | 34.00 1.67 | 1.73 0.20 |
SHOT-VAE (5.8M) (Feng et al. 2021) | 6.45 0.28 | 3.43 0.05 |
Ours (4.5M) () | 7.45 0.48 | 3.14 0.05 |




Table 1 shows quantitative comparison results of the classification performance in the CIFAR-10 benchmark setting and image generation performance. The classification models (Laine and Aila 2016; Miyato et al. 2018; Berthelot et al. 2019; Arazo et al. 2020) are not generative, so the Inception Score is not computed. Although the EXoN is not the best in the Inception Score, our model has a remarkable image interpolation quality. It is shown in Figure 4 which visualizes the reconstructed images from interpolation. This Figure indicates that our model achieves much better recovering and interpolation performance than other models (Hajimiri, Lotfi, and Baghshah 2021; Feng et al. 2021) (see Appendix Figure 11 for more reconstructed images by our model).
Activated Latent Subspace


The properties of the activated latent subspace are investigated. For , the latent space with is obtained for the automobile class. The associated V-nat values are displayed in Figure 5(a) (see Appendix A.3 for other classes). We generate images on three cases of latent variables, 1) the original latent variable, 2) the latent variable perturbed by the uniform noise on , 3) the latent variable perturbed by the uniform noise on . Figure 5(b) displays examples of noise values and generated images.
The two images of left and right in Figure 5(b) seem identical even though their latent variables are far from each other. While, the middle image of the bottom panel is distorted compared with the other images, which confirms that only the activated latent subspace determines the features of generated images. Motivated by these numerical results in Figure 5(b), we manipulated images using only the activated latent subspace (see results in Appendix A.3).
0.01 | 0.05 | 0.1 | 0.5 | 1 | |
---|---|---|---|---|---|
Inception Score | 3.14 0.05 | 3.36 0.05 | 3.37 0.10 | 3.17 0.08 | 2.86 0.06 |
189.2 22.7 | 114.6 4.22 | 82.8 12.50 | 36.00 5.87 | 22.00 1.10 | |
error (%) | 7.45 0.48 | 7.27 0.19 | 7.51 0.25 | 8.18 0.63 | 9.50 1.07 |
5 Conclusion and Limitations
This paper proposes a new method to construct an explainable latent space in semi-supervised learning VAE. The explainable latent space can be obtained by combining classification losses (cross-entropy and SCI loss) and the relaxed KL-divergence term in our objective function. By the classification, the latent space can be effectively decomposed according to the mixture components and the true labels of observations. In addition, it is shown that the relaxation of the KL-divergence increases the dimension of the activated latent subspace which determines the characteristics of the images generated from our proposed model. The activated latent subspace can be discovered through the V-nat measure.
In short, the explainability of our proposed model is demonstrated by two points 1) the label-specific conceptual means and variances of the latent distribution, and 2) the features of the given images can be analyzed in association with the activated latent subspace. Thus, the practical advantage of our method is that the manually interpolated images corresponding to the user’s specific purpose can be achieved by customizing the prior distribution structure in advance. We guess that the clustering method would replace the role of the classification loss in our VAE model, and leave the development of the EXoN in unsupervised learning as our future work.
References
- Abadi et al. (2015) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.
- Arazo et al. (2020) Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N. E.; and McGuinness, K. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
- Berthelot et al. (2019) Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. A. 2019. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32.
- Dilokthanakul et al. (2016) Dilokthanakul, N.; Mediano, P. A.; Garnelo, M.; Lee, M. C.; Salimbeni, H.; Arulkumaran, K.; and Shanahan, M. 2016. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
- Feng et al. (2021) Feng, H.-Z.; Kong, K.; Chen, M.; Zhang, T.; Zhu, M.; and Chen, W. 2021. SHOT-VAE: semi-supervised deep generative models with label-aware ELBO approximations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7413–7421.
- Gumbel (1954) Gumbel, E. J. 1954. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office.
- Guo et al. (2020) Guo, C.; Zhou, J.; Chen, H.; Ying, N.; Zhang, J.; and Zhou, D. 2020. Variational Autoencoder With Optimizing Gaussian Mixture Model Priors. IEEE Access, 8: 43992–44005.
- Hajimiri, Lotfi, and Baghshah (2021) Hajimiri, S.; Lotfi, A.; and Baghshah, M. S. 2021. Semi-Supervised Disentanglement of Class-Related and Class-Independent Factors in VAE. arXiv preprint arXiv:2102.00892.
- Higgins et al. (2016) Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2016. beta-vae: Learning basic visual concepts with a constrained variational framework.
- Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. PMLR.
- Iscen et al. (2019) Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5070–5079.
- Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
- Jordan et al. (1999) Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning, 37(2): 183–233.
- Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589.
- Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
- Laine and Aila (2016) Laine, S.; and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
- LeCun, Cortes, and Burges (2010) LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2.
- Li et al. (2019) Li, Y.; Pan, Q.; Wang, S.; Peng, H.; Yang, T.; and Cambria, E. 2019. Disentangled variational auto-encoder for semi-supervised learning. Information Sciences, 482: 73–85.
- Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Lucas et al. (2019) Lucas, J.; Tucker, G.; Grosse, R. B.; and Norouzi, M. 2019. Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse. In Advances in Neural Information Processing Systems, 9408–9418.
- Maaløe, Fraccaro, and Winther (2017) Maaløe, L.; Fraccaro, M.; and Winther, O. 2017. Semi-supervised generation with cluster-aware generative models. arXiv preprint arXiv:1704.00637.
- Maaløe et al. (2016) Maaløe, L.; Sønderby, C. K.; Sønderby, S. K.; and Winther, O. 2016. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
- Mathieu et al. (2019) Mathieu, E.; Rainforth, T.; Siddharth, N.; and Teh, Y. W. 2019. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, 4402–4412.
- Miyato et al. (2018) Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979–1993.
- Rezende, Mohamed, and Wierstra (2014) Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
- Salimans et al. (2016) Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. Advances in neural information processing systems, 29: 2234–2242.
- Siddharth et al. (2017) Siddharth, N.; Paige, B.; Van de Meent, J.-W.; Desmaison, A.; Goodman, N.; Kohli, P.; Wood, F.; and Torr, P. 2017. Learning disentangled representations with semi-supervised deep generative models. In Advances in neural information processing systems, 5925–5935.
- Van Den Oord, Vinyals et al. (2017) Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.
- Verma et al. (2019) Verma, V.; Kawaguchi, K.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-Paz, D. 2019. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825.
- Wang et al. (2019) Wang, W.; Gan, Z.; Xu, H.; Zhang, R.; Wang, G.; Shen, D.; Chen, C.; and Carin, L. 2019. Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137.
- Wang et al. (2004) Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4): 600–612.
- Willetts, Roberts, and Holmes (2019) Willetts, M.; Roberts, S.; and Holmes, C. 2019. Disentangling to Cluster: Gaussian Mixture Variational Ladder Autoencoders. arXiv preprint arXiv:1909.11501.
- Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
- Zheng and Sun (2019) Zheng, Z.; and Sun, L. 2019. Disentangling latent space for vae by label relevant/irrelevant dimensions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 12192–12201.
Appendix A Appendix
A.1 Theoretical Derivations
Upper bound of the KL-divergence
Proof of Theorem 1
Let and . For understanding following equations, we denote the th elements of , , and by , , and , respectively. The KL-divergence between two univariate normal distributions leads to a close form of the KL-divergence from to ,
(13) | |||||
Using above equality (A.1), the expectation of the second term in (12) can be written as:
(14) | |||||
where and are the th element of and , respectively. The second equality above equations holds for .
Let , the distribution of the th component of the mixture distribution . First, the upper bound of (12) is written as
(15) | |||||
by Jensen’s inequality and .
Similarly, the lower bound of (12) is derived as
(16) | |||||
Mutual Information Derivation
Here, we denote the mutual information as and the expectation of the KL-divergence upper bound (12) is
(19) | |||||
We assume that .
(20) | |||||
where and .
Therefore,
(21) | |||||
where and .
In conclusion, the expectation of the KL-divergence upper bound is written in mutual information terms as
(22) | |||||
A.2 MNIST Dataset
Evaluation with various tuning parameter s
0.1 | 0.25 | 0.5 | 0.75 | 1 | 5 | 10 | 50 | |
---|---|---|---|---|---|---|---|---|
negative SSIM | 0.43 | 0.438 | 0.44 | 0.445 | 0.435 | 0.439 | 0.418 | 0.316 |
KL-divergence | 19.654 | 9.654 | 8.481 | 7.684 | 6.75 | 4.064 | 2.729 | 1.855 |
error(%) | 3.29 | 3.23 | 3.25 | 3.33 | 3.46 | 3.74 | 4.07 | 3.38 |
Denote the set of indices for the test dataset . The KL-divergence measures how different the trained posterior distribution is from the pre-designed prior distribution and is defined by:
(23) |
The negative average single-scale structural similarity (negative SSIM) of is defined by
(24) |
where for is the similarity measure between two images , and has a value on (Wang et al. 2004; Zheng and Sun 2019). So, negative SSIM has a value on and indicates how many diverse images consists of. For being the set of images generated from a fitted VAE model, indicates how expressive the model is. We consist with the images generated from our trained decoder, where each image is produced by 3131 equally spaced grid points on the 2-dimensional latent space.
The classification error is given by
(25) |
where is a indicator function. The classification error shows the degree of discrepancy between the assigned label by the posterior probability and the true observation label. Thus, the VAE model with low classification error can separate the latent space into subsets on which the labels of observations are well identified.
Table 3 indicates that the KL-divergence mostly depends on . Because indirectly controls the weight of the KL-divergence loss term, the KL-divergence for a large dominates in our objective function (Lucas et al. 2019). Also, we can observe that the diversity of generated sample (negative SSIM) increases as decreases. For all s, the latent space is clearly separated according to the true labels. It means that the classification performance does not depend on since the weight of additional classification loss terms is scaled by .
Controlling Proximity and Interpolation

We investigate the patterns of interpolated images according to various pre-designs of the proximity of the prior mixture components. We use a subset of the MNIST dataset with only 0 and 1 labels, so the 2-component Gaussian mixture distribution is used. All Gaussian components have diagonal covariance matrices, and their diagonal elements are all 4. We set the location parameters as , which determines the proximity. We set the distances between the two location parameters as 8, 16, 24, and 32. The images were generated from 11 points of equally spacing line segment from to on the 2-dimensional latent space. All experiment settings are equally given by those of MNIST dataset analysis in Section 4.1. Figure 6 shows that the interpolated images vary more slowly if the two location parameters of are farther from each other and the latent space is effectively adapted to pre-designed characteristics.
encoder | decoder |
---|---|
(28, 28, 1) image | -dimension latent variable |
Flatten | Dense(128), ReLU |
Dense(256), ReLU | Dense(256), ReLU |
Dense(128), ReLU | Dense(784), ReLU |
2 K Dense(), Linear | Reshape(28, 28, 1) |
classifier |
---|
input: 28 28 1 Image |
Conv2D(32, 5, 1, ‘same’), BN, LeakyReLU(=0.1) |
MaxPool2D(pool size=(2, 2), strides=2) |
SpatialDropout2D(rate=0.5) |
Conv2D(64, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
MaxPool2D(pool size=(2, 2), strides=2) |
SpatialDropout2D(rate=0.5) |
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
MaxPool2D(pool size=(2, 2), strides=2) |
SpatialDropout2D(rate=0.5) |
GlobalAveragePooling2D |
Dense(64), BN, ReLU |
Dense(K), softmax |
A.3 CIFAR-10 Dataset
V-nat of the EXoN


Figure 7 shows V-nat vectors for all classes in CIFAR-10 datasets, and th element of V-nat vector is . We found that the activated latent subspace of each class (latent dimensions which have a higher V-nat value than ) does not differ significantly from each other.
To show that the variability of generated images depends on the almost same latent subspace for all classes, the correlation matrix between V-nat vectors of all classes with is shown in Figure 8. The correlation plot indicates that the V-nat vectors are remarkably correlated (correlation coefficients are almost 1), implying that the diversity of generated images depends on almost the same subspace.
Manipulation on the activated latent subspace




In Figure 9, 5 series of images are generated from latent variables where the value of the activated latent axis is changed from -3 to 3, and values of other latent dimensions are fixed. From top to bottom, latent axes having the top 5 most significant V-nat values are used for each . As the value of each activated latent subspace changes, some different features of generated samples are varied (like brightness, the color and the shape of the object, or the color of the background).
Note that characteristics of images change more significantly where the decoder variance parameter is large because a single activated latent dimension determines relatively more image characteristics due to the small cardinality of the activated latent subspace when is large. Disentangling properties represented by the activated latent subspace is left for our future work.
Furthermore, we manipulated all activated latent subspace, and it is visualized in Figure 10 for each . It shows images generated from latent variables perturbed by the uniform noise on , where the uniform noise is sampled from and automobile class .
Most tops left images are reconstructed images given the unperturbed latent variable. As in manipulating the single activated latent subspace, this Figure confirms that a single activated latent dimension determines relatively more characteristics of a given image because the cardinality of the activated latent subspace is small for a large .


A.4 Experiment settings
We run all experiments using Geforce GTX 2080 Ti GPU and 16GB RAM, and our experimental codes are all available with Tensorflow (Abadi et al. 2015). Pre-designs of the prior distribution:
-
•
MNIST: The conceptual centers and variances are given by , where and for .
-
•
CIFAR-10: The mean vector of consists of two subvectors, a 10-dimensional label-relevant mean which is the one-hot vector for the corresponding label, and a 246-dimensional label-irrelevant mean which has all zero values (Zheng and Sun 2019). All covariance matrices are commonly given by where for and is a -dimensional vector whose elements are all .
Stochastic image augmentations we used:
-
•
MNIST: random rotation, random cropping
-
•
CIFAR-10: random horizontal flip, random cropping
dataset | epochs | batch size() | drop rate | ||
---|---|---|---|---|---|
MNIST | 100 | (128, 32) | 0.5 | 6000 | |
CIFAR-10 | 600 | (128, 32) | 0.1 | 5000 |
In implementation, we used Adam(Kingma and Ba 2014) optimizer for both datasets. The initial learning rate of the MNIST experiments is 0.003, and we decayed the learning rate exponentially by multiplying after 10 epoch, where is current epoch number. And the initial learning rate of CIFAR-10 experiments is 0.001, and we multiplied the learning rate by 0.1 at 250, 350, 450, and 550 epochs. After 550 epoch, we decayed the learning rate exponentially by multiplying . For both datasets, we applied decoupled weight decay (Loshchilov and Hutter 2017) with a factor of 0.0005. Other experiment settings are shown in Table 7.
encoder | decoder |
---|---|
input: 32 32 3 Image | input: -dimension latent variable |
Conv2D(32, 5, 2, ‘same’), BN, ReLU | Dense(4 4 512), BN, ReLU |
Conv2D(64, 4, 2, ‘same’), BN, ReLU | Reshape(4, 4, 512) |
Conv2D(128, 4, 2, ‘same’), BN, ReLU | Conv2DTranspose(256, 5, 2, ‘same’), BN, ReLU |
Conv2D(256, 4, 1, ‘same’), BN, ReLU | Conv2DTranspose(128, 5, 2, ‘same’), BN, ReLU |
Flatten | Conv2DTranspose(64, 5, 2, ‘same’), BN, ReLU |
Dense(1024), BN, Linear | Conv2DTranspose(32, 5, 1, ‘same’), BN, ReLU |
mean: K Dense(), Linear | Conv2D(3, 4, 1, ‘same’), tanh |
log-variance: K Dense(), Linear | - |
classifier |
---|
input: 32 32 3 Image |
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
MaxPool2D(pool size=(2, 2), strides=2) |
SpatialDropout2D(rate=0.1) |
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
MaxPool2D(pool size=(2, 2), strides=2) |
SpatialDropout2D(rate=0.1) |
Conv2D(512, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU(=0.1) |
GlobalAveragePooling2D |
Dense(K), softmax |