EXoN: EXplainable encoder Network

SeungHwan An¹, Hosik Choi², Jong-June Jeon¹

Abstract

We propose a new semi-supervised learning method of Variational AutoEncoder (VAE) which yields a customized and explainable latent space by EXplainable encoder Network (EXoN). Customization means a manual design of latent space layout for specific labeled data. To improve the performance of our VAE in a classification task without the loss of performance as a generative model, we employ a new semi-supervised classification method called SCI (Soft-label Consistency Interpolation). The classification loss and the Kullback-Leibler divergence play a crucial role in constructing explainable latent space. The variability of generated samples from our proposed model depends on a specific subspace, called activated latent subspace. Our numerical results with MNIST and CIFAR-10 datasets show that EXoN produces an explainable latent space and reduces the cost of investigating representation patterns on the latent space.

1 Introduction

Variational AutoEncoder (VAE) (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) aims to reconstruct a well-representing latent space and recover an original observation well. In general, the probabilistic model used in VAE is parameterized by neural networks defined as two maps from the domain of observations to the latent space and vice versa. However, since the probability model of the VAE is not written in a closed form, the maximum likelihood method is unsuitable for application to the VAE. As an alternative, the Variational Bayesian method is popularly applied to the model to maximize the Evidence of Lower Bound (ELBO) (Jordan et al. 1999).

Plenty of semi-supervised learning methods for the VAE (Kingma et al. 2014; Maaløe et al. 2016; Siddharth et al. 2017; Maaløe, Fraccaro, and Winther 2017; Li et al. 2019; Feng et al. 2021; Hajimiri, Lotfi, and Baghshah 2021) have been introduced, and especially (Maaløe, Fraccaro, and Winther 2017; Hajimiri, Lotfi, and Baghshah 2021) applied the mixture prior distribution such that the VAE model provides explainable latent space according to labels. However, existing semi-supervised VAE models still have practical limitations.

(Kingma et al. 2014; Maaløe et al. 2016; Siddharth et al. 2017; Li et al. 2019; Feng et al. 2021) introduced an additional latent space representing labels and assumed the probabilistic independence of the label and the other latent variables. While the model is simply formulated, the trained latent space cannot provide explainable and measurable quantitative information used to generate a new image by interpolating between two images. So, it is difficult to impose structural restrictions on the latent space, such as the proximity of latent features across some labeled observations. Also, the trained latent space differs according to the training process, so the latent space is not explained consistently even with the same dataset (Maaløe, Fraccaro, and Winther 2017; Hajimiri, Lotfi, and Baghshah 2021). For example, it is difficult to obtain information about interpolated images from latent space before the trained latent space is entirely investigated by observations. Thus, a manual design of the latent space is required for an explainable representation model.

The current study proposes a new semi-supervised VAE model, wherein we have incorporated the manual design to improve the clarity of the model. The model employs a Gaussian mixture for prior and posterior latent distributions (Dilokthanakul et al. 2016; Zheng and Sun 2019; Willetts, Roberts, and Holmes 2019; Mathieu et al. 2019; Guo et al. 2020) and focuses on constructing an explainable encoder of VAE. In our model, the latent space is customized for each component’s correspondence to a specific label and proximity. Since the latent space is shared with the decoder and classifier, each centroid of the mixture distribution is trained as an identifier representing a specific label. Therefore, the latent space is manually decomposed by labels, and a user can obtain a customized explainable latent space.

In addition, we propose a measure of the representation power of latent variables with which the importance of latent subspace can be investigated. It is found that the encoder of our VAE model selectively activates only a part of the latent space, and the latent subspace represents the characteristics of generated samples. Because the activation is measured by a posterior variance linked with the encoder, we call the encoder EXoN (EXplainable encoder Network) by borrowing the term in the field of gene biology.

This paper is organized as follows. Section 2 briefly introduces Variational AutoEncoder, and Section 3 proposes our VAE model, including its derivation. Section 4 shows the results of numerical simulations. Concluding remarks follow in Section 5.

2 Preliminary

2.1 VAE for Unsupervised Learning

Let ${\bf x}\in\mathcal{X}\subset\mathbb{R}^{m}$ and ${\bf z}\in\mathcal{Z}\subset\mathbb{R}^{d}$ with $d<m$ be the observed data and the latent variable. $p({\bf x})$ and $p({\bf z})$ denote the probability density functions (pdf) of ${\bf x}$ and ${\bf z}$ . Let $p({\bf x}|{\bf z})$ be the conditional pdf of ${\bf x}$ for a given ${\bf z}$ and ${\bf x}|{\bf z}\sim N(D({\bf z}),\beta\cdot{\bf I}),$ where $D:\mathcal{Z}\mapsto\mathcal{X}$ , ${\bf I}$ is the $m\times m$ identity matrix and the observation noise $\beta>0$ . $D({\bf z})$ is a neural network with parameter $\theta$ . To emphasize dependence on parameters, denote $D({\bf z})$ and $p({\bf x}|{\bf z})$ by $D({\bf z};\theta)$ and $p({\bf x}|{\bf z};\theta,\beta)$ , respectively. $p({\bf x}|{\bf z};\theta,\beta)$ is referred to as the decoder of VAE. The term of decoder is also used as the map ${\bf z}\mapsto(D({\bf z};\theta),\beta)$ , because $p({\bf x}|{\bf z};\theta,\beta)$ is fully parameterized by $(D({\bf z};\theta),\beta)$ .

The parameter of VAE $(\theta,\beta)$ is estimated by Variational method (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) that maximizes ELBO. The ELBO of $\log p({\bf x};\theta,\beta)$ is obtained from the inequality

			$\displaystyle\log p({\bf x};\theta,\beta)$
		$\displaystyle\geq$	$\displaystyle\mathbb{E}_{q({\bf z}\|{\bf x})}[\log p({\bf x}\|{\bf z};\theta,\beta)]-\mathcal{KL}\left(q({\bf z}\|{\bf x})\\|p({\bf z})\right),$

where $\mathcal{KL}(q\|p)$ denotes Kullback-Leibler divergence from $p$ to $q$ . (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) employ a neural network with parameter $\phi$ as $q({\bf z}|{\bf x};\phi)$ , and obtain the objective function under finite samples $x_{1},\cdots,x_{n}$ ,

\displaystyle\sum_{i=1}^{n}\mathbb{E}_{q}[\log p(x_{i}|{\bf z};\theta,\beta)]-\mathcal{KL}\left(q({\bf z}|x_{i};\phi)\|p({\bf z})\right).

(2)

Here, $q({\bf z}|{\bf x};\phi)$ is referred to as the encoder of VAE or the posterior distribution over the latent variables. Multivariate Gaussian distribution is a popular choice for $q({\bf z}|{\bf x};\phi)$ in which the mean and covariance are given by neural network models. Thus, the parameters of VAE consist of $\phi$ in the encoder and $(\theta,\beta)$ in the decoder.

In practice, the VAE is fitted by maximizing a stochastically approximated ELBO for (2) with respect to $(\theta,\beta,\phi)$ . The approximated ELBO is given by

	$\displaystyle\sum_{i=1}^{n}-\frac{1}{2}\\|x_{i}-D(z_{i};\theta)\\|^{2}-\beta\cdot\mathcal{KL}(q({\bf z}\|x_{i};\phi)\\|p({\bf z}))$
	$\displaystyle-\frac{nm}{2}\log 2\pi\beta$		(3)

where $z_{i}$ is a sample from $q({\bf z}|x_{i};\phi)$ and $\|\cdot\|$ is $l_{2}$ -norm. The VAE is fitted by maximizing (2.1) with respect to $(\theta,\beta,\phi)$ (Lucas et al. 2019). The first term in (2.1) is the precision of generated samples. Since the KL-divergence is always non-negative, (Higgins et al. 2016; Mathieu et al. 2019; Li et al. 2019) explain $\beta$ as a tuning parameter regularizing $\theta$ and $\phi$ , and the last term of (2.1) is considered as constant.

3 Proposed Model

We propose a new VAE model with a customized explainable latent space where a conceptual center of a latent distribution for a specific label can be freely assigned. Let ${\bf y}\in\mathcal{Y}=\{1,\cdots,K\}$ be a discrete random variable and denote the joint probability density of $({\bf x},{\bf y})$ by $p({\bf x},{\bf y})$ . In this paper, high dimensional observations ${\bf x}$ are partly labeled by ${\bf y}$ . Denote the sets of indices for labeled and unlabeled samples by $I_{L}$ and $I_{U}$ .

3.1 Model Assumptions

The prior latent distribution for each label is assumed as ${\bf z}|{\bf y}=k\sim N(m_{k},diag\{s_{k}^{2}\})$ and $p({\bf y}=k)=w_{k}$ for $k\in\mathcal{Y}$ where $diag\{a\},a\in\mathbb{R}^{d}$ indicates $d\times d$ matrix whose diagonal elements are $a$ . The prior distribution marginalized by ${\bf y}$ is

\displaystyle p({\bf z})=\sum_{k=1}^{K}w_{k}\cdot\mathcal{N}({\bf z}|m_{k},diag\{s_{k}^{2}\}).

Here, $m_{k}\in\mathbb{R}^{d}$ and $s_{k}^{2}\in\mathbb{R}_{+}^{d}$ are pre-determined parameters denoting the conceptual center and dispersion of the latent variable for each label. The choice of $m_{k}$ and $s_{k}^{2}$ is the customization of the latent space. Since $m_{k}$ and $w_{k}$ for all $k$ are fixed, the notation of $m_{k}$ and $diag\{s_{k}^{2}\}$ is omitted in $p({\bf z})$ .

The encoder $p({\bf z}|{\bf x})$ is assumed as a mixture distribution, $p({\bf z}|{\bf x})=\sum_{k=1}^{K}p({\bf y}=k|{\bf x})\cdot p({\bf z}|{\bf y}=k,{\bf x})$ . Since using $p({\bf z}|{\bf x})$ is computationally prohibitive, the proposal distribution $w({\bf y}=k|{\bf x};\eta)$ for $p({\bf y}=k|{\bf x})$ and $g({\bf z}|{\bf y}=k,{\bf x};\xi)=\mathcal{N}({\bf z}|\mu_{k}({\bf x};\xi),diag\{\sigma^{2}_{k}({\bf x};\xi)\})$ for $p({\bf z}|{\bf y}=k,{\bf x})$ are introduced. The two proposal distributions are multinomial and Gaussian, whose parameters are modeled by neural networks. The posterior distribution is approximated by

\displaystyle q({\bf z}|{\bf x};\eta,\xi)=\sum_{k=1}^{K}w({\bf y}=k|{\bf x};\eta)\cdot g({\bf z}|{\bf y}=k,{\bf x};\xi),

(4)

where the parameters of the neural networks are $\eta$ and $\xi$ . Note that the posterior mixture weight $w({\bf y}=k|{\bf x};\eta)$ decompose the latent space according to labels. Given image ${\bf x}$ and label ${\bf y}$ , the posterior mixture weight assigns a latent variable to the mixture component corresponding to the given label ${\bf y}$ and the latent space is separated by labels.

The decoder is assumed by

\displaystyle p({\bf x}|{\bf z};\theta,\beta)=\mathcal{N}({\bf x}|D({\bf z};\theta),\beta\cdot{\bf I}),

where the true label ${\bf y}$ is not included. Even though a natural decoder can be considered as a mixture Gaussian $p({\bf x}|{\bf z};\theta,\beta)=\sum_{k=1}^{K}p({\bf y}=k|{\bf z})\cdot\mathcal{N}({\bf x}|D_{k}({\bf z};\theta),\beta\cdot{\bf I})$ , the mixture distribution is approximated by a uni-modal Gaussian distribution especially when $p({\bf y}=k|{\bf z})$ is assumed to be close to 1 for some $k$ . The latent space is separated by $w({\bf y}|{\bf x};\eta)$ in our encoder and we can obtain such $p({\bf y}|{\bf z})$ . Thus, the validity of the assumption mainly depends on the classification performance of $w({\bf y}|{\bf x};\eta)$ . Although the decoder variance $\beta$ is trainable, we fixed it due to computational issues.

3.2 EXoN: Semi-Supervised VAE

Our proposed VAE model is derived from the joint pdf $p({\bf x},{\bf y})$ . $p({\bf x},{\bf y})$ is decomposed into $p({\bf x})$ and $p({\bf y}|{\bf x})$ , and the derivation of the ELBO only for $\log p({\bf x})$ leads to (5). In the derivation, $p({\bf y}|{\bf x})$ is approximated by the proposal classification model $w({\bf y}|{\bf x};\eta)$ .

	$\displaystyle\log p({\bf x},{\bf y};\theta,\beta)$	(5)
$\displaystyle=$	$\displaystyle\log p({\bf x};\theta,\beta)+\log p({\bf y}\|{\bf x};\theta,\beta)$
$\displaystyle\geq$	$\displaystyle\mathbb{E}_{q}[\log p({\bf x}\|{\bf z};\theta,\beta)]-\mathcal{KL}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$
	$\displaystyle+\log p({\bf y}\|{\bf x};\theta,\beta)$
$\displaystyle\simeq$	$\displaystyle\mathbb{E}_{q}[\log p({\bf x}\|{\bf z};\theta,\beta)]-\mathcal{KL}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$
	$\displaystyle+\log w({\bf y}\|{\bf x};\eta).$

The first two terms in (5) are the typical ELBO used in the unsupervised VAE learning, and the remained term is the classification loss. Note that these terms are coupled with the shared parameter $\eta$ . The classification loss term plays the role of training the posterior mixture weights. Therefore, $w({\bf y}|{\bf x};\eta)$ with lower classification error can separate the latent space more clearly by $q({\bf z}|{\bf x};\eta,\xi)$ .

Whenever $p({\bf y}|{\bf x};\theta,\beta)=w({\bf y}|{\bf x};\eta)$ ,

\displaystyle\log p({\bf x},{\bf y};\theta,\beta)\geq\mathcal{L}({\bf x};\theta,\beta,\xi,\eta)+\log w({\bf y}|{\bf x};\eta),

where

	$\displaystyle\mathcal{L}({\bf x};\theta,\beta,\xi,\eta)$	$\displaystyle=$	$\displaystyle\mathbb{E}_{q}[\log p({\bf x}\|{\bf z};\theta,\beta)]$
			$\displaystyle-\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$

and $\mathcal{KL}^{u}\left(q({\bf z}|{\bf x};\eta,\xi)\|p({\bf z})\right)$ is the upper bound of $\mathcal{KL}\left(q({\bf z}|{\bf x};\eta,\xi)\|p({\bf z})\right)$ (Wang et al. 2019; Guo et al. 2020) which is written in closed-form as

			$\displaystyle\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
		$\displaystyle+$	$\displaystyle\sum_{k=1}^{K}w({\bf y}=k\|{\bf x};\eta)$
			$\displaystyle\times\mathcal{KL}\left(q({\bf z}\|{\bf x},{\bf y}=k;\eta,\xi)\\|p({\bf z}\|{\bf y}=k)\right).$

Therefore, the lower bound on the joint likelihood for the entire dataset is

\displaystyle\sum_{i\in I_{L}\cup I_{U}}\mathcal{L}(x_{i};\theta,\beta,\xi,\eta)+\sum_{i\in I_{L}}\log w(y_{i}|x_{i};\eta).

(6)

(Kingma et al. 2014) first employed the classification loss (the second term in (6)) as a penalty function of VAE and (Maaløe et al. 2016; Li et al. 2019; Siddharth et al. 2017) use the same penalty function in the subsequent papers. In our semi-supervised VAE model, the penalty function is applied with a similar idea. However, the derivation of the regularized objective function stems from (5) unlike the existing studies.

3.3 SCI: Soft-label Consistency Interpolation

The derivation of (6) is mathematically plausible, but it does not guarantee state-of-the-art semi-supervised classification performance. To improve the performance of the classification task, we propose a new loss function for the pseudo-labeling semi-supervised classification method (Iscen et al. 2019; Berthelot et al. 2019; Arazo et al. 2020) called SCI (Soft-label Consistency Interpolation) loss.

The SCI loss consists of three parts, 1) an interpolated new image with a pair of unlabeled images, 2) a pair of pseudo-label of the unlabeled images, and 3) a convex combination of the cross entropy. Let $({\bf x}^{1},{\bf x}^{2})$ be a pair of images and let $\tilde{{\bf x}}=\rho\cdot{\bf x}^{1}+(1-\rho)\cdot{\bf x}^{2}$ be the interpolated image. Denote the discrete probability $w({\bf y}|{\bf x};\hat{\eta})$ by $f({\bf x};\hat{\eta})$ . The pseudo-labels of ${\bf x}^{1}$ and ${\bf x}^{2}$ are defined by $\tilde{{\bf y}}^{1}=f({\bf x}_{1};\hat{\eta})$ and $\tilde{{\bf y}}^{2}=f({\bf x}_{2};\hat{\eta})$ for a given estimate $\hat{\eta}$ (true label is used for labeled dataset). Note that the pseudo-label is not a function of $\hat{\eta}$ but a non-trainable quantity determined by $\hat{\eta}$ . Then the SCI loss for $\eta$ with $({\bf x}^{1},{\bf x}^{2})$ is defined by

			$\displaystyle\mbox{SCI}(({\bf x}^{1},\tilde{{\bf y}}^{1}),({\bf x}^{2},\tilde{{\bf y}}^{2});\eta)$
		$\displaystyle=$	$\displaystyle\rho\cdot H(\tilde{{\bf y}}^{1},f(\tilde{{\bf x}},\eta))+(1-\rho)\cdot H(\tilde{{\bf y}}^{2},f(\tilde{{\bf x}},\eta)),$

where $H(p,q)$ is the cross entropy of the distribution $q$ relative to a distribution $p$ .

The SCI loss is motivated by the assumption of consistency interpolation.

Definition 1.

Given two data point ${\bf x}^{1},{\bf x}^{2}$ , if

			$\displaystyle f(\rho\cdot{\bf x}^{1}+(1-\rho)\cdot{\bf x}^{2};\eta^{*})$
		$\displaystyle=$	$\displaystyle\rho\cdot f({\bf x}^{1};\eta^{})+(1-\rho)\cdot f({\bf x}^{2};\eta^{})$

for $\forall\rho\in[0,1]$ , then the consistency interpolation (Zhang et al. 2017; Verma et al. 2019) is satisfied for the parameter $\eta^{*}$ .

The consistency interpolation assumes the existence of a linear map $f$ from an image to a pseudo-label. It is well known that such a mix-up strategy can improve the generalization error (Zhang et al. 2017).

Interestingly, the estimation of $f(\cdot;\eta^{*})$ in our VAE model is also used in other existing semi-supervised learning methods (Feng et al. 2021; Arazo et al. 2020) and the following algorithm provides a general framework to estimate such a $f(\cdot;\eta^{*})$ in the VAE model. Let $\eta^{(t)}$ be an estimate of $\eta$ obtained by the $t$ th step while training the VAE, and $\eta^{(t+1)}$ be the solution of following optimal interpolation problem (Feng et al. 2021)

	$\displaystyle\min_{\eta}$		$\displaystyle\rho\cdot\mathcal{KL}\left(f({\bf x}^{1};\eta^{(t)})\\|f(\tilde{{\bf x}};\eta)\right)$		(8)
		$\displaystyle+$	$\displaystyle(1-\rho)\cdot\mathcal{KL}\left(f({\bf x}^{2};\eta^{(t)})\\|f(\tilde{{\bf x}};\eta)\right).$		(8)

Then, it is easily shown that

			$\displaystyle f(\rho\cdot{\bf x}^{1}+(1-\rho)\cdot{\bf x}^{2};\eta^{(t+1)})$		(9)
		$\displaystyle=$	$\displaystyle\rho\cdot f({\bf x}^{1};\eta^{(t)})+(1-\rho)\cdot f({\bf x}^{2};\eta^{(t)}).$		(9)

If there exists $\eta^{*}$ such that $\eta^{(t)}\rightarrow\eta^{*}$ as $t\rightarrow\infty$ given two data point ${\bf x}^{1},{\bf x}^{2}$ , then (9) finally implies the consistency interpolation. Since (8) is equivalent with (3.3) up to constant, (3.3) is introduced in our proposed objective function. We also found that using stochastically augmented images for SCI loss helps our classifier achieve higher accuracy (see Appendix A.4 for detailed augmentation techniques).

Finally, the objective function is given by

$\displaystyle-$	$\displaystyle\sum_{i\in I_{L}\cup I_{U}}\mathcal{L}(x_{i};\theta,\beta,\xi,\eta)-\sum_{i\in I_{L}}\log w(y_{i}\|x_{i};\eta)$	(10)
$\displaystyle-$	$\displaystyle\frac{\lambda_{1}}{\beta}\sum_{i\in I_{L}}\log w(y_{i}\|x_{i};\eta)$
$\displaystyle+$	$\displaystyle\frac{\lambda_{1}}{\beta}\sum_{i\in I_{L}}\mbox{SCI}((x^{1}_{i},y^{1}_{i}),(x^{2}_{i},y^{2}_{i});\eta)$
$\displaystyle+$	$\displaystyle\frac{\lambda_{2}(t)}{\beta}\sum_{i\in I_{U}}\mbox{SCI}((x^{1}_{i},\tilde{y}^{1}_{i}),(x^{2}_{i},\tilde{y}^{2}_{i});\eta),$

where $x_{i}^{1}=x_{i}$ , $x_{i}^{2}$ is a randomly chosen sample, and $\lambda_{2}(t)$ is a ramp-up function. If $\lambda_{1}$ and $\lambda_{2}(t)$ are set to be large enough, the mixture components of $q({\bf z}|{\bf x};\eta,\xi)$ are shrunk toward those of $p({\bf z})$ according to the true matching labels, and the latent space is well separated. The shared parameter $\eta$ leads to this adaption of $q({\bf z}|{\bf x};\eta,\xi)$ to $p({\bf z})$ .

3.4 Activated Latent Subspace

In this section, we investigated the meaning of the explainability of the latent space. Denote the random vector associated with the distribution of the $k$ th component of the mixture distribution (prior or posterior distribution) by ${\bf z}^{k}=({\bf z}_{1}^{k},\cdots,{\bf z}_{d}^{k})$ . Then, what is the role of the posterior conditional variance $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)$ in the encoding process?

To answer above question, consider an extreme case where $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)$ goes to zero for given $k$ . If $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)\rightarrow 0$ , then ${\bf z}_{j}^{k}$ becomes constant. ${\bf z}_{j}^{k}$ has a deterministic relationship with given ${\bf x}$ , implying that ${\bf z}_{j}^{k}$ can explain the given data point. Therefore, the value of $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)$ represents the explainability of the latent space, and we call $j$ coordinate is activated.

Interestingly, there is a theorem which shows a connection between our objective function (10) and the posterior conditional variance $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)$ .

Theorem 1.

Let $p({\bf z})$ the mixture prior distribution defined in Section 3.1 and let $q({\bf z}|{\bf x};\eta,\xi)$ be (4).

			$\displaystyle\frac{d}{2}-\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}{\mbox{Var}({\bf z}_{j}^{k})}\right]$
		$\displaystyle\leq$	$\displaystyle\mathbb{E}_{{\bf x}}\Big{[}\sum_{k=1}^{K}w({\bf y}=k\|{\bf x};\eta)$
			$\displaystyle\times\mathcal{KL}\Big{(}q({\bf z}\|{\bf x},{\bf y}=k;\eta,\xi)\\|p({\bf z}\|{\bf y}=k)\Big{)}\Big{]}$
		$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}\right]-\frac{d}{2}.$

Proof.

See Appendix A.1. ∎

Theorem 1 means that the KL-divergence upper bound in our objective function is bounded by the ratio of the prior and posterior variances. The lower bound of Theorem 1 can be re-written as $\sum_{k=1}^{K}\sum_{j=1}^{d}w_{k}/2\cdot\mbox{Var}_{{\bf x}|{\bf y}=k}[\mathbb{E}({\bf z}_{j}^{k}|{\bf x};\xi)]/\mbox{Var}({\bf z}_{j}^{k})$ , so it can be interpreted as the coefficient of determination $R^{2}$ which measures the proportion of latent variable variation explained by an observation.

On the other hand, a small $\beta$ increases the weight of the generated sample precision term and indirectly decreases the relative weight of the KL-divergence term in (10). It implies that the KL-divergence term is relaxed (not fully minimized). So, the relaxation of the KL-divergence term induces an increase in the upper bound of Theorem 1. It means that there exists an activated coordinate $j$ such that $\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)$ shrinks to zero for some $k$ , as the prior mixture weight $w_{k}$ and variance $\mbox{Var}({\bf z}_{j}^{k})$ are fixed due to pre-design. Thus, Theorem 1 shows that the relaxation of the KL-divergence obtained by tuning $\beta$ is closely related to controlling the explainability of the latent space for ${\bf x}$ .

Based on Theorem 1, we propose a statistics V-nat (VAE-natural unit of information) which screens such activated latent coordinates,

\displaystyle\log\mathbb{E}_{{\bf x}|{\bf y}=k}\left[\mbox{Var}({\bf z}_{j}^{k})/\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)\right].

Additionally, we define the set of activated latent coordinates as activated latent subspace for $k$ -labeled dataset for some $\delta>0$ ,

\displaystyle{\cal A}_{k}(\delta)=\bigg{\{}j\in[d]:\log\mathbb{E}_{{\bf x}|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)}\right]>\delta\bigg{\}}

where $[d]=\{1,\cdots,d\}$ Our numerical study found that the subspace represents the informative characteristics of generated samples, and the subspace can be effectively used to produce a high-quality image (see Section 4.2). These results are consistent with those of VQ-VAE (Van Den Oord, Vinyals et al. 2017) that an image generation process mainly depends on a nearly deterministic encoding map, and the refinement of the map can improve the quality of generated images.

4 Experiments

4.1 MNIST Dataset

We have used the MNIST dataset (LeCun, Cortes, and Burges 2010) to consider the 2-dimensional latent space in our VAE model. The values were scaled in the range of -1 to 1. The encoder returns the 10-component Gaussian mixture distribution parameters, mixing probability, mean vector, and diagonal covariance elements. Thus, the encoder maps ${\cal X}$ to $\left((0,1)\times\mathbb{R}^{2}\times\mathbb{R}_{+}^{2}\right)^{10}$ . Especially, the mixing probabilities are produced by the classifier in the encoder. Gumbel-Max trick (Gumbel 1954; Jang, Gu, and Poole 2016) is used for sampling discrete latent variables. Detailed network architectures of the encoder, decoder and classifier are described in Appendix A.2 Table 4, 5.

Refer to caption — Figure 1: Scatter plot of samples from $p({\bf z})$ .

The customized $p({\bf z})$ is illustrated in Figure 1. Each label is assigned counterclockwise from 3 O’clock on the component of the mixture. Note that Figure 1 illustrates an example of conceptual centers. The $k$ th component of the mixture distribution corresponds to the distribution of the digit $(k-1)$ in the MNIST dataset, for $k=1,\cdots,10$ (see Appendix A.4 for detailed pre-design settings).

Our model is compared with (Kingma et al. 2014; Maaløe et al. 2016; Li et al. 2019; Hajimiri, Lotfi, and Baghshah 2021; Feng et al. 2021), and the simulation results show that our fitted model achieves a competitive classification performance, 3.12% error with 59,900 unlabeled and 100 labeled images. (see Appendix A.2 Table 6 for comparison results). For implementation details, see Appendix A.4.

Effect of Regularizations

First, the role of tuning parameter $\beta$ of (10) is investigated in fitted latent space (Lucas et al. 2019). The top panels of Figure 2 display the samples from $q({\bf z}|x_{i};\eta,\xi)$ where $x_{i}$ is an observation in the test dataset, and the bottom panels show images generated from grid points on the latent space.

The top panels demonstrate that the large $\beta$ regularizes $q({\bf z}|x;\eta,\xi)$ to $p({\bf z})$ by indirectly increasing the weight of the KL-divergence term in the objective function. The bottom panels illustrate that each generated image exactly matches the label defined on the pre-designed latent space. Also, it is confirmed that generated images are naturally interpolated on the pre-designed latent space according to our arrangement of conceptual centers. These results show that the proposed VAE yields an explainable latent space with the labels. See Appendix A.2 for additional evaluation results (the negative average single-scale structural similarity (negative SSIM) (Wang et al. 2004; Zheng and Sun 2019), the classification error, and the KL-divergence) with various $\beta$ values.

Diversity of Generated Images

The bottom panels of Figure 2 show that the images are generated with lower diversity for larger $\beta$ . This result can be explained by the mutual information $I(\cdot,\cdot)$ . The expectation of $\mathcal{K}\mathcal{L}^{u}(q({\bf z}|{\bf x};\eta,\xi)\|p({\bf z}))$ in (10) is rewritten in terms of mutual information as

	$\displaystyle\mathbb{E}_{p({\bf x})}\left[\mathcal{K}\mathcal{L}^{u}(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z}))\right]$	(11)
$\displaystyle=$	$\displaystyle I({\bf x},{\bf y};\eta)+\mathcal{KL}(w({\bf y};\eta)\\|p({\bf y}))$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[I({\bf x},{\bf z}\|{\bf y};\eta,\xi)]$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[\mathcal{KL}(q({\bf z}\|{\bf y};\eta,\xi)\\|p({\bf z}\|{\bf y}))].$

See Appendix A.1 for detailed derivation.

To maximize (10) under a large $\beta$ , (11) should be nearly zero and the mutual information between ${\bf x}|{\bf y}$ and ${\bf z}|{\bf y}$ be close to zero for all classes ${\bf y}$ as well. The conditional independence of ${\bf x}$ and ${\bf z}$ implies that $D({\bf z};\theta)$ does not depend on ${\bf z}$ for each ${\bf y}$ , because ${\bf x}|{\bf z}\sim N(D({\bf z};\theta),\beta\cdot{\bf I})$ is assumed. For this reason, as shown in the bottom row of Figure 2, the latent mixture component for a specific label is not able to capture the complex pattern of observations that belong to a corresponding label when $\beta$ is large.

Customized Latent Space

Figure 3 shows two series of reconstructed images from interpolation on the prior structure. The latent variable A and B are sampled from mixture components of labels 0 and 1, respectively. Suppose the interpolation path from point A to B produces interpolated images with labels 0 and 1. However, the left panel of Figure 3 shows reconstructed images whose labels are neither 0 nor 1 in the middle of the interpolation path.

It means that the interpolation path is unpredictable before training with Parted-VAE (Hajimiri, Lotfi, and Baghshah 2021). However, the interpolation path with our model consists of only interpolated images with labels 0 and 1 since our latent space can be manually designed (see the right panel of Figure 3). Furthermore, we can pre-determine the interpolation path and the resolution of interpolation by controlling the proximity between mixture components (see Appendix A.2).

4.2 CIFAR-10 Dataset

We apply our VAE model to the CIFAR-10 dataset (Krizhevsky, Hinton et al. 2009) of which images have 10 labels. We evaluated our model with 400 labeled images per class and 46,000 unlabeled images. All values of the images are scaled to the range of $(-1,1)$ . 256-dimensional latent space and the 10-mixture Gaussian distribution are employed. Each component of the prior mixture distribution has a separate mean vector, and all components share the same covariance (see Appendix A.4 for detailed pre-designed priors, hyper-parameters, and implementation settings). Gumbel-Max trick (Gumbel 1954; Jang, Gu, and Poole 2016) is used for sampling discrete latent variables. The network architectures of the encoder, decoder and classifier are shown in Appendix A.3 Table 8, 9.

Comparison

models	error (%)	Inception Score
$\Pi$ -model (4.5M) (Laine and Aila 2016)	17.53 $\pm$ 0.15	-
VAT (1.5M) (Miyato et al. 2018)	10.55 $\pm$ 0.05	-
MixMatch (5.8M) (Berthelot et al. 2019)	5.91 $\pm$ 0.36	-
PLCB (4.5M) (Arazo et al. 2020)	7.82 $\pm$ 0.18	-
M2 (4.5M) (Kingma et al. 2014)	27.30 $\pm$ 0.25	1.86 $\pm$ 0.02
Parted-VAE (5.8M) (Hajimiri, Lotfi, and Baghshah 2021)	34.00 $\pm$ 1.67	1.73 $\pm$ 0.20
SHOT-VAE (5.8M) (Feng et al. 2021)	6.45 $\pm$ 0.28	3.43 $\pm$ 0.05
Ours (4.5M) ( $\beta=0.01$ )	7.45 $\pm$ 0.48	3.14 $\pm$ 0.05

Table 1: Mean and standard deviation of test dataset classification error for 4,000 labeled dataset and the Inception Score from 5 replicates of the experiment. The numbers in parenthesis are the number of classifier parameters. The Inception Scores are computed using 10,000 generated images. * means that the result is from the original paper.

Table 1 shows quantitative comparison results of the classification performance in the CIFAR-10 benchmark setting and image generation performance. The classification models (Laine and Aila 2016; Miyato et al. 2018; Berthelot et al. 2019; Arazo et al. 2020) are not generative, so the Inception Score is not computed. Although the EXoN is not the best in the Inception Score, our model has a remarkable image interpolation quality. It is shown in Figure 4 which visualizes the reconstructed images from interpolation. This Figure indicates that our model achieves much better recovering and interpolation performance than other models (Hajimiri, Lotfi, and Baghshah 2021; Feng et al. 2021) (see Appendix Figure 11 for more reconstructed images by our model).

Activated Latent Subspace

The properties of the activated latent subspace are investigated. For $\beta=0.05$ , the latent space with $|{\cal A}_{k}(1)|=115$ is obtained for the automobile class. The associated V-nat values are displayed in Figure 5(a) (see Appendix A.3 for other classes). We generate images on three cases of latent variables, 1) the original latent variable, 2) the latent variable perturbed by the uniform noise $U(-2,2)$ on ${\cal A}_{k}(1)$ , 3) the latent variable perturbed by the uniform noise $U(-2,2)$ on ${\cal A}^{c}_{k}(1)$ . Figure 5(b) displays examples of noise values and generated images.

The two images of left and right in Figure 5(b) seem identical even though their latent variables are far from each other. While, the middle image of the bottom panel is distorted compared with the other images, which confirms that only the activated latent subspace determines the features of generated images. Motivated by these numerical results in Figure 5(b), we manipulated images using only the activated latent subspace (see results in Appendix A.3).

$\beta$	0.01	0.05	0.1	0.5	1
Inception Score	3.14 $\pm$ 0.05	3.36 $\pm$ 0.05	3.37 $\pm$ 0.10	3.17 $\pm$ 0.08	2.86 $\pm$ 0.06
$\|{\cal A}_{k}(1)\|$	189.2 $\pm$ 22.7	114.6 $\pm$ 4.22	82.8 $\pm$ 12.50	36.00 $\pm$ 5.87	22.00 $\pm$ 1.10
error (%)	7.45 $\pm$ 0.48	7.27 $\pm$ 0.19	7.51 $\pm$ 0.25	8.18 $\pm$ 0.63	9.50 $\pm$ 1.07

Table 2: For various

\beta

values, mean and standard deviation of the Inception Score (Salimans et al. 2016), cardinality of activated latent subspace

|{\cal A}_{k}(1)|

of automobile class and test dataset classification error with 4,000 labeled dataset from 5 replicates of the experiment.

Theorem 1 theoretically supports that relaxation of KL-divergence (shrinkage of $\beta$ ) leads to a large set ${\cal A}_{k}(1)$ . Meanwhile, (11) implies that strong regularization of KL-divergence (increasing $\beta$ ) leads to decreasing $\mathbb{E}_{p({\bf y})}[I({\bf x},{\bf z}|{\bf y};\eta,\xi)]$ , which results in a small set ${\cal A}_{k}(1)$ . These theoretical results are shown numerically in Table 2.

5 Conclusion and Limitations

This paper proposes a new method to construct an explainable latent space in semi-supervised learning VAE. The explainable latent space can be obtained by combining classification losses (cross-entropy and SCI loss) and the relaxed KL-divergence term in our objective function. By the classification, the latent space can be effectively decomposed according to the mixture components and the true labels of observations. In addition, it is shown that the relaxation of the KL-divergence increases the dimension of the activated latent subspace which determines the characteristics of the images generated from our proposed model. The activated latent subspace can be discovered through the V-nat measure.

In short, the explainability of our proposed model is demonstrated by two points 1) the label-specific conceptual means and variances of the latent distribution, and 2) the features of the given images can be analyzed in association with the activated latent subspace. Thus, the practical advantage of our method is that the manually interpolated images corresponding to the user’s specific purpose can be achieved by customizing the prior distribution structure in advance. We guess that the clustering method would replace the role of the classification loss in our VAE model, and leave the development of the EXoN in unsupervised learning as our future work.

References

Abadi et al. (2015) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.
Arazo et al. (2020) Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N. E.; and McGuinness, K. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
Berthelot et al. (2019) Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. A. 2019. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32.
Dilokthanakul et al. (2016) Dilokthanakul, N.; Mediano, P. A.; Garnelo, M.; Lee, M. C.; Salimbeni, H.; Arulkumaran, K.; and Shanahan, M. 2016. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
Feng et al. (2021) Feng, H.-Z.; Kong, K.; Chen, M.; Zhang, T.; Zhu, M.; and Chen, W. 2021. SHOT-VAE: semi-supervised deep generative models with label-aware ELBO approximations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7413–7421.
Gumbel (1954) Gumbel, E. J. 1954. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office.
Guo et al. (2020) Guo, C.; Zhou, J.; Chen, H.; Ying, N.; Zhang, J.; and Zhou, D. 2020. Variational Autoencoder With Optimizing Gaussian Mixture Model Priors. IEEE Access, 8: 43992–44005.
Hajimiri, Lotfi, and Baghshah (2021) Hajimiri, S.; Lotfi, A.; and Baghshah, M. S. 2021. Semi-Supervised Disentanglement of Class-Related and Class-Independent Factors in VAE. arXiv preprint arXiv:2102.00892.
Higgins et al. (2016) Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2016. beta-vae: Learning basic visual concepts with a constrained variational framework.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. PMLR.
Iscen et al. (2019) Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5070–5079.
Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Jordan et al. (1999) Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning, 37(2): 183–233.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589.
Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
Laine and Aila (2016) Laine, S.; and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
LeCun, Cortes, and Burges (2010) LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2.
Li et al. (2019) Li, Y.; Pan, Q.; Wang, S.; Peng, H.; Yang, T.; and Cambria, E. 2019. Disentangled variational auto-encoder for semi-supervised learning. Information Sciences, 482: 73–85.
Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lucas et al. (2019) Lucas, J.; Tucker, G.; Grosse, R. B.; and Norouzi, M. 2019. Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse. In Advances in Neural Information Processing Systems, 9408–9418.
Maaløe, Fraccaro, and Winther (2017) Maaløe, L.; Fraccaro, M.; and Winther, O. 2017. Semi-supervised generation with cluster-aware generative models. arXiv preprint arXiv:1704.00637.
Maaløe et al. (2016) Maaløe, L.; Sønderby, C. K.; Sønderby, S. K.; and Winther, O. 2016. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
Mathieu et al. (2019) Mathieu, E.; Rainforth, T.; Siddharth, N.; and Teh, Y. W. 2019. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, 4402–4412.
Miyato et al. (2018) Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979–1993.
Rezende, Mohamed, and Wierstra (2014) Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
Salimans et al. (2016) Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. Advances in neural information processing systems, 29: 2234–2242.
Siddharth et al. (2017) Siddharth, N.; Paige, B.; Van de Meent, J.-W.; Desmaison, A.; Goodman, N.; Kohli, P.; Wood, F.; and Torr, P. 2017. Learning disentangled representations with semi-supervised deep generative models. In Advances in neural information processing systems, 5925–5935.
Van Den Oord, Vinyals et al. (2017) Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.
Verma et al. (2019) Verma, V.; Kawaguchi, K.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-Paz, D. 2019. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825.
Wang et al. (2019) Wang, W.; Gan, Z.; Xu, H.; Zhang, R.; Wang, G.; Shen, D.; Chen, C.; and Carin, L. 2019. Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137.
Wang et al. (2004) Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4): 600–612.
Willetts, Roberts, and Holmes (2019) Willetts, M.; Roberts, S.; and Holmes, C. 2019. Disentangling to Cluster: Gaussian Mixture Variational Ladder Autoencoders. arXiv preprint arXiv:1909.11501.
Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Zheng and Sun (2019) Zheng, Z.; and Sun, L. 2019. Disentangling latent space for vae by label relevant/irrelevant dimensions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 12192–12201.

Appendix A Appendix

A.1 Theoretical Derivations

Upper bound of the KL-divergence

We use the upper bound of the KL-divergence from $p({\bf z})$ to $q({\bf z}|{\bf x};\eta,\xi)$ (Wang et al. 2019; Guo et al. 2020) by

	$\displaystyle\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$	(12)
$\displaystyle\equiv$	$\displaystyle\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
$\displaystyle+$	$\displaystyle\sum_{k=1}^{K}w({\bf y}=k\|{\bf x};\eta)$
	$\displaystyle\times\mathcal{KL}\Big{(}\mathcal{N}({\bf z}\|\mu_{k}({\bf x};\xi),diag\{(\sigma_{k}({\bf x};\xi))^{2}\})\\|$
	$\displaystyle\mathcal{N}({\bf z}\|\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})\Big{)}.$

Proof of Theorem 1

Let ${\bf z}=({\bf z}_{1},\cdots,{\bf z}_{d})\sim N(\mu,diag\{(\sigma)^{2}\})$ and ${\bf z}|{\bf x}\sim N(\mu({\bf x}),diag\{(\sigma({\bf x}))^{2}\})$ . For understanding following equations, we denote the $j$ th elements of $\mu$ , $\mu({\bf x})$ , $(\sigma)^{2}$ and $(\sigma({\bf x}))^{2}$ by $\mathbb{E}({\bf z}_{j})$ , $\mathbb{E}({\bf z}_{j}|{\bf x})$ , $\mbox{Var}({\bf z}_{j})$ and $\mbox{Var}({\bf z}_{j}|{\bf x})$ , respectively. The KL-divergence between two univariate normal distributions leads to a close form of the KL-divergence from $N(\mu,diag\{(\sigma)^{2}\})$ to $N(\mu({\bf x}),diag\{(\sigma({\bf x}))^{2}\})$ ,

			$\displaystyle\mathbb{E}_{{\bf x}}\Big{[}\mathcal{KL}\Big{(}N(\mu({\bf x}),diag\{(\sigma({\bf x}))^{2}\})\\|$
			$\displaystyle N(\mu,diag\{(\sigma)^{2}\})\Big{)}\Big{]}$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{{\bf x}}\Big{[}\frac{1}{2}\sum_{j=1}^{d}\Big{(}\frac{[\mathbb{E}({\bf z}_{j}\|{\bf x})-\mathbb{E}({\bf z}_{j})]^{2}}{\mbox{Var}({\bf z}_{j})}+\frac{\mbox{Var}({\bf z}_{j}\|{\bf x})}{\mbox{Var}({\bf z}_{j})}$
			$\displaystyle+\log\frac{\mbox{Var}({\bf z}_{j})}{\mbox{Var}({\bf z}_{j}\|{\bf x})}-1\Big{)}\Big{]}$
		$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{j=1}^{d}\Big{(}\frac{\mbox{Var}_{{\bf x}}[\mathbb{E}({\bf z}_{j}\|{\bf x})]}{\mbox{Var}({\bf z}_{j})}+\frac{\mathbb{E}_{{\bf x}}[\mbox{Var}({\bf z}_{j}\|{\bf x})]}{\mbox{Var}({\bf z}_{j})}$

	$\displaystyle+\mathbb{E}_{{\bf x}}\left[\log\frac{\mbox{Var}({\bf z}_{j})}{\mbox{Var}({\bf z}_{j}\|{\bf x})}\right]\Big{)}-\frac{d}{2}$	(13)
$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{j=1}^{d}\left(\frac{\mbox{Var}({\bf z}_{j})}{\mbox{Var}({\bf z}_{j})}+\mathbb{E}_{{\bf x}}\left[\log\frac{\mbox{Var}({\bf z}_{j})}{\mbox{Var}({\bf z}_{j}\|{\bf x})}\right]\right)-\frac{d}{2}$
$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{j=1}^{d}\mathbb{E}_{{\bf x}}\left[\log\frac{\mbox{Var}({\bf z}_{j})}{\mbox{Var}({\bf z}_{j}\|{\bf x})}\right].$

Using above equality (A.1), the expectation of the second term in (12) can be written as:

	$\displaystyle\sum_{k=1}^{K}\mathbb{E}_{{\bf x}}\Big{[}w({\bf y}=k\|{\bf x};\eta)$	(14)
	$\displaystyle\times\mathcal{KL}\Big{(}\mathcal{N}({\bf z}\|\mu_{k}({\bf x};\xi),diag\{(\sigma_{k}({\bf x};\xi))^{2}\})\\|$
	$\displaystyle\mathcal{N}({\bf z}\|\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})\Big{)}\Big{]}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\int_{{\bf x}}p({\bf x})w({\bf y}=k\|{\bf x};\eta)$
	$\displaystyle\times\mathcal{KL}\Big{(}\mathcal{N}({\bf z}\|\mu_{k}({\bf x};\xi),diag\{(\sigma_{k}({\bf x};\xi))^{2}\})\\|$
	$\displaystyle\mathcal{N}({\bf z}\|\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})\Big{)}d{\bf x}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\int_{{\bf x}}p({\bf y}=k)p({\bf x}\|{\bf y}=k;\eta)$
	$\displaystyle\times\mathcal{KL}\Big{(}\mathcal{N}({\bf z}\|\mu_{k}({\bf x};\xi),diag\{(\sigma_{k}({\bf x};\xi))^{2}\})\\|$
	$\displaystyle\mathcal{N}({\bf z}\|\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})\Big{)}d{\bf x}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}w_{k}^{0}\cdot\mathbb{E}_{{\bf x}\|{\bf y}=k}\Big{[}$
	$\displaystyle\mathcal{KL}\Big{(}\mathcal{N}({\bf z}\|\mu_{k}({\bf x};\xi),diag\{(\sigma_{k}({\bf x};\xi))^{2}\})\\|$
	$\displaystyle\mathcal{N}({\bf z}\|\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})\Big{)}\Big{]}$
$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{k=1}^{K}w_{k}^{0}\sum_{j=1}^{d}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\log\frac{(\sigma^{0}_{kj})^{2}}{(\sigma_{kj}({\bf x};\xi))^{2}}\right],$

where $\sigma^{0}_{kj}$ and $\sigma_{kj}({\bf x};\xi)$ are the $j$ th element of $\sigma_{k}^{0}$ and $\sigma_{k}({\bf x};\xi)$ , respectively. The second equality above equations holds for $p({\bf y}|{\bf x};\eta)=w({\bf y}|{\bf x};\eta)$ .

Let ${\bf z}^{k}=({\bf z}_{1}^{k},\cdots,{\bf z}_{d}^{k})\sim N(\mu_{k}^{0},diag\{(\sigma_{k}^{0})^{2}\})$ , the distribution of the $k$ th component of the mixture distribution $p({\bf z})$ . First, the upper bound of (12) is written as

	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$	(15)
$\displaystyle=$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle+\frac{1}{2}\sum_{k=1}^{K}w_{k}^{0}\sum_{j=1}^{d}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\log\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle+\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\log\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle+\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}\right]-\frac{d}{2},$

by Jensen’s inequality and $\log x\leq x-1$ .

Similarly, the lower bound of (12) is derived as

	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$	(16)
$\displaystyle=$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle-\frac{1}{2}\sum_{k=1}^{K}w_{k}^{0}\sum_{j=1}^{d}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\log\frac{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}{\mbox{Var}({\bf z}_{j}^{k})}\right]$
$\displaystyle\geq$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle-\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\log\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}{\mbox{Var}({\bf z}_{j}^{k})}\right]$
$\displaystyle\geq$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
	$\displaystyle-\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}{\mbox{Var}({\bf z}_{j}^{k})}\right]+\frac{d}{2}.$

Therefore, combining (15) and (16),

	$\displaystyle\frac{d}{2}-\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}{\mbox{Var}({\bf z}_{j}^{k})}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{{\bf x}}\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)-\mathbb{E}_{{\bf x}}\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))$
$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\sum_{j=1}^{d}\frac{w_{k}^{0}}{2}\mathbb{E}_{{\bf x}\|{\bf y}=k}\left[\frac{\mbox{Var}({\bf z}_{j}^{k})}{\mbox{Var}({\bf z}_{j}^{k}\|{\bf x};\xi)}\right]-\frac{d}{2},$	(18)

completes the proof.

Mutual Information Derivation

Here, we denote the mutual information as $I(\cdot,\cdot)$ and the expectation of the KL-divergence upper bound (12) is

	$\displaystyle\mathbb{E}_{p({\bf x})}\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)$	(19)
$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})}\left[\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|p({\bf y}))\right]$
	$\displaystyle+\mathbb{E}_{p({\bf x})}\Big{[}\sum_{k=1}^{K}w({\bf y}=k\|{\bf x};\eta)$
	$\displaystyle\times\mathcal{KL}\left(q({\bf z}\|{\bf x},{\bf y}=k;\xi)\\|p({\bf z}\|{\bf y}=k)\right)\Big{]}.$

We assume that $w({\bf y}|{\bf x};\eta)=p({\bf y}|{\bf x};\eta)$ .

	$\displaystyle\mathbb{E}_{p({\bf x})}[\mathcal{KL}\left(w({\bf y}\|{\bf x};\eta)\\|p({\bf y})\right)]$	(20)
$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})}\mathbb{E}_{w({\bf y}\|{\bf x};\eta)}\left[\log\frac{w({\bf y}\|{\bf x};\eta)}{w({\bf y})}\right]$
	$\displaystyle+\mathbb{E}_{p({\bf x})}\mathbb{E}_{w({\bf y}\|{\bf x};\eta)}[\log w({\bf y};\eta)]$
	$\displaystyle-\mathbb{E}_{p({\bf x})}\mathbb{E}_{w({\bf y}\|{\bf x};\eta)}[\log p({\bf y})]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})}[\mathcal{KL}(w({\bf y}\|{\bf x};\eta)\\|w({\bf y};\eta))]$
	$\displaystyle+\mathbb{E}_{w({\bf y};\eta)}\left[\log\frac{w({\bf y};\eta)}{p({\bf y})}\right]$
$\displaystyle=$	$\displaystyle\int p({\bf x})w({\bf y}\|{\bf x};\eta)\log\frac{w({\bf y}\|{\bf x};\eta)}{w({\bf y};\eta)}d{\bf y}d{\bf x}$
	$\displaystyle+\mathcal{KL}(w({\bf y};\eta)\\|p({\bf y}))$
$\displaystyle=$	$\displaystyle\int p({\bf x},{\bf y};\eta)\log\frac{p({\bf x},{\bf y};\eta)}{p({\bf x})w({\bf y};\eta)}d{\bf y}d{\bf x}$
	$\displaystyle+\mathcal{KL}(w({\bf y};\eta)\\|p({\bf y}))$
$\displaystyle=$	$\displaystyle I({\bf x},{\bf y};\eta)+\mathcal{KL}(w({\bf y};\eta)\\|p({\bf y})),$

where $w({\bf y};\eta)=\mathbb{E}_{p({\bf x})}[w({\bf y}|{\bf x};\eta)]$ and $p({\bf x},{\bf y};\eta)=p({\bf x})w({\bf y}|{\bf x};\eta)$ .

For the second term in (19), we define $q({\bf z}|{\bf y};\eta,\xi)$ as

			$\displaystyle\mathbb{E}_{p({\bf x}\|{\bf y};\eta)}[q({\bf z}\|{\bf x},{\bf y};\xi)]$
		$\displaystyle=$	$\displaystyle\int p({\bf x}\|{\bf y};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)d{\bf x}$
		$\displaystyle=$	$\displaystyle\int\frac{p({\bf x})w({\bf y}\|{\bf x};\eta)}{p({\bf y})}\frac{q({\bf z},{\bf y}\|{\bf x};\eta,\xi)}{w({\bf y}\|{\bf x};\eta)}d{\bf x}$
		$\displaystyle=$	$\displaystyle\int\frac{q({\bf z},{\bf y},{\bf x};\eta,\xi)}{p({\bf y})}d{\bf x}=\frac{q({\bf z},{\bf y};\eta,\xi)}{p({\bf y})}$
		$\displaystyle=$	$\displaystyle q({\bf z}\|{\bf y};\eta,\xi)$

where $p({\bf x}|{\bf y};\eta)=p({\bf x})w({\bf y}|{\bf x};\eta)/p({\bf y})$ .

Therefore,

			$\displaystyle\mathbb{E}_{p({\bf x})}\Big{[}\sum_{k=1}^{K}w({\bf y}=k\|{\bf x};\eta)$
			$\displaystyle\times\mathcal{KL}\Big{(}q({\bf z}\|{\bf x},{\bf y}=k;\xi)\\|p({\bf z}\|{\bf y}=k)\Big{)}\Big{]}$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})w({\bf y}\|{\bf x};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf x},{\bf y};\xi)}{p({\bf z}\|{\bf y})}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})w({\bf y}\|{\bf x};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf x},{\bf y};\xi)q({\bf z}\|{\bf y};\eta,\xi)}{p({\bf z}\|{\bf y})q({\bf z}\|{\bf y};\eta,\xi)}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf x})w({\bf y}\|{\bf x};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf x},{\bf y};\xi)}{q({\bf z}\|{\bf y};\eta,\xi)}\right]$
			$\displaystyle+\mathbb{E}_{p({\bf x})w({\bf y}\|{\bf x};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf y};\eta,\xi)}{p({\bf z}\|{\bf y})}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf y})p({\bf x}\|{\bf y};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf x},{\bf y};\xi)}{q({\bf z}\|{\bf y};\eta,\xi)}\right]$
			$\displaystyle+\mathbb{E}_{p({\bf y})p({\bf x}\|{\bf y};\eta)q({\bf z}\|{\bf x},{\bf y};\xi)}\left[\log\frac{q({\bf z}\|{\bf y};\eta,\xi)}{p({\bf z}\|{\bf y})}\right]$

$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf y})q({\bf x},{\bf z}\|{\bf y};\eta,\xi)}\left[\log\frac{q({\bf z}\|{\bf x},{\bf y};\eta,\xi)}{q({\bf z}\|{\bf y};\eta,\xi)}\right]$	(21)
	$\displaystyle+\mathbb{E}_{p({\bf y})q({\bf z}\|{\bf y};\eta,\xi)}\left[\log\frac{q({\bf z}\|{\bf y};\eta,\xi)}{p({\bf z}\|{\bf y})}\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf y})}\int q({\bf x},{\bf z}\|{\bf y};\eta,\xi)\log\frac{q({\bf x},{\bf z}\|{\bf y};\eta,\xi)}{q({\bf z}\|{\bf y};\eta,\xi)p({\bf x}\|{\bf y})}d{\bf x}d{\bf z}$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[\mathcal{KL}(q({\bf z}\|{\bf y};\eta,\xi)\\|p({\bf z}\|{\bf y}))]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p({\bf y})}[I({\bf x},{\bf z}\|{\bf y};\eta,\xi)]$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[\mathcal{KL}(q({\bf z}\|{\bf y};\eta,\xi)\\|p({\bf z}\|{\bf y}))],$

where $p({\bf x}|{\bf y};\eta)p({\bf y})=p({\bf x})w({\bf y}|{\bf x};\eta)$ and $q({\bf x},{\bf z}|{\bf y};\eta,\xi)=p({\bf x}|{\bf y};\eta)q({\bf z}|{\bf x},{\bf y};\xi)$ .

In conclusion, the expectation of the KL-divergence upper bound is written in mutual information terms as

	$\displaystyle\mathbb{E}_{p({\bf x})}[\mathcal{KL}^{u}\left(q({\bf z}\|{\bf x};\eta,\xi)\\|p({\bf z})\right)]$	(22)
$\displaystyle=$	$\displaystyle I({\bf x},{\bf y};\eta)+\mathcal{KL}(w({\bf y};\eta)\\|p({\bf y}))$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[I({\bf x},{\bf z}\|{\bf y};\eta,\xi)]$
	$\displaystyle+\mathbb{E}_{p({\bf y})}[\mathcal{KL}(q({\bf z}\|{\bf y};\eta,\xi)\\|p({\bf z}\|{\bf y}))].$

A.2 MNIST Dataset

Evaluation with various tuning parameter $\beta$ s

$\beta$	0.1	0.25	0.5	0.75	1	5	10	50
negative SSIM	0.43	0.438	0.44	0.445	0.435	0.439	0.418	0.316
KL-divergence	19.654	9.654	8.481	7.684	6.75	4.064	2.729	1.855
error(%)	3.29	3.23	3.25	3.33	3.46	3.74	4.07	3.38

Table 3: The negative SSIM, KL-divergence, and the classification error of 100 labeled dataset for various

\beta

values.

Denote the set of indices for the test dataset $I_{test}$ . The KL-divergence measures how different the trained posterior distribution is from the pre-designed prior distribution and is defined by:

\displaystyle\frac{1}{|I_{test}|}\sum_{i\in I_{test}}\mathcal{KL}^{u}\left(q({\bf z}|x_{i};\eta,\xi)\|p({\bf z})\right).

(23)

The negative average single-scale structural similarity (negative SSIM) of ${\bf X}$ is defined by

\displaystyle\frac{1}{2}\bigg{(}1-\frac{1}{|{\bf X}|^{2}}\sum_{({\bf x},{\bf x}^{\prime})\in{\bf X}\times{\bf X}}\mbox{SSIM}({\bf x},{\bf x}^{\prime})\bigg{)},

(24)

where $\mbox{SSIM}({\bf x},{\bf x}^{\prime})$ for ${\bf x},{\bf x}^{\prime}\in{\bf X}$ is the similarity measure between two images ${\bf x}$ , ${\bf x}^{\prime}$ and has a value on $[-1,1]$ (Wang et al. 2004; Zheng and Sun 2019). So, negative SSIM has a value on $[0,1]$ and indicates how many diverse images ${\bf X}$ consists of. For ${\bf X}$ being the set of images generated from a fitted VAE model, $\mbox{negative SSIM}({\bf X})$ indicates how expressive the model is. We consist ${\bf X}$ with the images generated from our trained decoder, where each image is produced by 31 $\times$ 31 equally spaced grid points on the 2-dimensional latent space.

The classification error is given by

\displaystyle\frac{1}{|I_{test}|}\sum_{i\in I_{test}}\mathbb{I}\left(y_{i}\neq\arg\max_{k}(w({\bf y}=k|x_{i};\eta))\right),

(25)

where $\mathbb{I}(\cdot)$ is a indicator function. The classification error shows the degree of discrepancy between the assigned label by the posterior probability and the true observation label. Thus, the VAE model with low classification error can separate the latent space into subsets on which the labels of observations are well identified.

Table 3 indicates that the KL-divergence mostly depends on $\beta$ . Because $\beta$ indirectly controls the weight of the KL-divergence loss term, the KL-divergence for a large $\beta$ dominates in our objective function (Lucas et al. 2019). Also, we can observe that the diversity of generated sample (negative SSIM) increases as $\beta$ decreases. For all $\beta$ s, the latent space is clearly separated according to the true labels. It means that the classification performance does not depend on $\beta$ since the weight of additional classification loss terms is scaled by $1/\beta$ .

Controlling Proximity and Interpolation

We investigate the patterns of interpolated images according to various pre-designs of the proximity of the prior mixture components. We use a subset of the MNIST dataset with only 0 and 1 labels, so the 2-component Gaussian mixture distribution is used. All Gaussian components have diagonal covariance matrices, and their diagonal elements are all 4. We set the location parameters as $(-r,0),(r,0)$ , which determines the proximity. We set the distances between the two location parameters as 8, 16, 24, and 32. The images were generated from 11 points of equally spacing line segment from $(-10,0)$ to $(10,0)$ on the 2-dimensional latent space. All experiment settings are equally given by those of MNIST dataset analysis in Section 4.1. Figure 6 shows that the interpolated images vary more slowly if the two location parameters of $p({\bf z})$ are farther from each other and the latent space is effectively adapted to pre-designed characteristics.

encoder	decoder
(28, 28, 1) image	$d$ -dimension latent variable
Flatten	Dense(128), ReLU
Dense(256), ReLU	Dense(256), ReLU
Dense(128), ReLU	Dense(784), ReLU
2 $\times$ K $\times$ Dense( $d$ ), Linear	Reshape(28, 28, 1)

Table 4: Model descriptions of the encoder and decoder. Here,

K

denotes the number of classes, and

d

is the dimension of the latent variable.

classifier
input: 28 $\times$ 28 $\times$ 1 Image
Conv2D(32, 5, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
MaxPool2D(pool size=(2, 2), strides=2)
SpatialDropout2D(rate=0.5)
Conv2D(64, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
MaxPool2D(pool size=(2, 2), strides=2)
SpatialDropout2D(rate=0.5)
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
MaxPool2D(pool size=(2, 2), strides=2)
SpatialDropout2D(rate=0.5)
GlobalAveragePooling2D
Dense(64), BN, ReLU
Dense(K), softmax

Table 5: Network structures of the classifier. Here,

K

denotes the number of classes, and BN is Batch-Normalization layer (Ioffe and Szegedy 2015). Conv2D hyper-parameters are filters, kernel size, strides, and padding.

models	error(%)
M2(2014)	11.97
M1+M2(2014)	3.33
ADGM(2016)	0.96
Disentangled-VAE(2019)	2.71
Parted-VAE(2021)	9.79
SHOT-VAE(2021)	3.12
Ours ( $\beta=0.25$ )	3.23

Table 6: Semi-supervised test classification errors with 100 labeled datasets (except for Parted-VAE, all results are from original papers).

A.3 CIFAR-10 Dataset

V-nat of the EXoN

Figure 7 shows V-nat vectors for all classes in CIFAR-10 datasets, and $j$ th element of V-nat vector is $\log\mathbb{E}_{{\bf x}|{\bf y}=k}\left[\mbox{Var}({\bf z}_{j}^{k})/\mbox{Var}({\bf z}_{j}^{k}|{\bf x};\xi)\right]$ . We found that the activated latent subspace of each class (latent dimensions which have a higher V-nat value than $\delta=1$ ) does not differ significantly from each other.

To show that the variability of generated images depends on the almost same latent subspace for all classes, the correlation matrix between V-nat vectors of all classes with $\beta=0.05$ is shown in Figure 8. The correlation plot indicates that the V-nat vectors are remarkably correlated (correlation coefficients are almost 1), implying that the diversity of generated images depends on almost the same subspace.

Manipulation on the activated latent subspace

In Figure 9, 5 series of images are generated from latent variables where the value of the activated latent axis is changed from -3 to 3, and values of other latent dimensions are fixed. From top to bottom, latent axes having the top 5 most significant V-nat values are used for each $\beta$ . As the value of each activated latent subspace changes, some different features of generated samples are varied (like brightness, the color and the shape of the object, or the color of the background).

Note that characteristics of images change more significantly where the decoder variance parameter $\beta$ is large because a single activated latent dimension determines relatively more image characteristics due to the small cardinality of the activated latent subspace when $\beta$ is large. Disentangling properties represented by the activated latent subspace is left for our future work.

Furthermore, we manipulated all activated latent subspace, and it is visualized in Figure 10 for each $\beta$ . It shows images generated from latent variables perturbed by the uniform noise on ${\cal A}_{k}(1)$ , where the uniform noise is sampled from $U(-1.5,1.5)$ and automobile class $k$ .

Most tops left images are reconstructed images given the unperturbed latent variable. As in manipulating the single activated latent subspace, this Figure confirms that a single activated latent dimension determines relatively more characteristics of a given image because the cardinality of the activated latent subspace is small for a large $\beta$ .

A.4 Experiment settings

We run all experiments using Geforce GTX 2080 Ti GPU and 16GB RAM, and our experimental codes are all available with Tensorflow (Abadi et al. 2015). Pre-designs of the prior distribution:

•

MNIST: The conceptual centers and variances are given by $\mu_{k}^{0}=r\cdot\big{(}\cos(\frac{\pi}{10}(2k-1)),\sin(\frac{\pi}{10}(2k-1))\big{)}$ , where $r=4/\sin(\frac{\pi}{10})$ and $(\sigma_{k}^{0})^{2}=(4,4)$ for $k=1,\cdots,10$ .
•

CIFAR-10: The mean vector of $p({\bf z})$ consists of two subvectors, a 10-dimensional label-relevant mean which is the one-hot vector for the corresponding label, and a 246-dimensional label-irrelevant mean which has all zero values (Zheng and Sun 2019). All covariance matrices are commonly given by $diag\{({0.1}_{10},{1}_{246})\}$ where ${a}_{k}$ for ${a}\in\mathbb{R}$ and $k\in\mathbb{N}$ is a $k$ -dimensional vector whose elements are all $a$ .

Stochastic image augmentations we used:

•

MNIST: random rotation, random cropping
•

CIFAR-10: random horizontal flip, random cropping

dataset	epochs	batch size( $U,L$ )	drop rate	$\lambda_{1}$	$\lambda_{2}(e)$
MNIST	100	(128, 32)	0.5	6000	$\lambda_{1}\exp\left(-5(1-\min\{1,\frac{c}{10}\})^{2}\right)$
CIFAR-10	600	(128, 32)	0.1	5000	$\lambda_{1}\exp\left(-5(1-\min\{1,\frac{c}{50}\})^{2}\right)$

Table 7: Detailed experiment settings for the MNIST and the CIFAR-10 datasets.

In implementation, we used Adam(Kingma and Ba 2014) optimizer for both datasets. The initial learning rate of the MNIST experiments is 0.003, and we decayed the learning rate exponentially by multiplying $\exp\left(-5(1-\frac{100-c}{100})^{2}\right)$ after 10 epoch, where $c$ is current epoch number. And the initial learning rate of CIFAR-10 experiments is 0.001, and we multiplied the learning rate by 0.1 at 250, 350, 450, and 550 epochs. After 550 epoch, we decayed the learning rate exponentially by multiplying $\exp\left(-5(1-\frac{600-c}{600-550})^{2}\right)$ . For both datasets, we applied decoupled weight decay (Loshchilov and Hutter 2017) with a factor of 0.0005. Other experiment settings are shown in Table 7.

encoder	decoder
input: 32 $\times$ 32 $\times$ 3 Image	input: $d$ -dimension latent variable
Conv2D(32, 5, 2, ‘same’), BN, ReLU	Dense(4 $\times$ 4 $\times$ 512), BN, ReLU
Conv2D(64, 4, 2, ‘same’), BN, ReLU	Reshape(4, 4, 512)
Conv2D(128, 4, 2, ‘same’), BN, ReLU	Conv2DTranspose(256, 5, 2, ‘same’), BN, ReLU
Conv2D(256, 4, 1, ‘same’), BN, ReLU	Conv2DTranspose(128, 5, 2, ‘same’), BN, ReLU
Flatten	Conv2DTranspose(64, 5, 2, ‘same’), BN, ReLU
Dense(1024), BN, Linear	Conv2DTranspose(32, 5, 1, ‘same’), BN, ReLU
mean: K $\times$ Dense( $d$ ), Linear	Conv2D(3, 4, 1, ‘same’), tanh
log-variance: K $\times$ Dense( $d$ ), Linear	-

Table 8: Model descriptions of the encoder and decoder. Here,

K

denotes the number of classes,

d

is the dimension of the latent variable, and BN is Batch-Normalization layer (Ioffe and Szegedy 2015). Conv2D and Conv2DTranspose hyper-parameters are filters, kernel size, strides, and padding.

classifier
input: 32 $\times$ 32 $\times$ 3 Image
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
MaxPool2D(pool size=(2, 2), strides=2)
SpatialDropout2D(rate=0.1)
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
MaxPool2D(pool size=(2, 2), strides=2)
SpatialDropout2D(rate=0.1)
Conv2D(512, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(256, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
Conv2D(128, 3, 1, ‘same’), BN, LeakyReLU( $\alpha$ =0.1)
GlobalAveragePooling2D
Dense(K), softmax

Table 9: Network structures of the classifier. Here,

K