This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantitative Understanding of VAE as a Non-linearly Scaled Isometric Embedding

Akira Nakagawa    Keizo Kato    Taiji Suzuki
Abstract

Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property through the differential geometric and information-theoretic interpretations of VAE. According to the Rate-distortion theory, the optimal transform coding is achieved by using an orthonormal transform with PCA basis where the transform space is isometric to the input. Considering the analogy of transform coding to VAE, we clarify theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters, and further, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA.

Machine Learning, VAE, differential geometry, ICML

1 Introduction

Variational autoencoder (VAE) (Kingma & Welling, 2014) is one of the most successful generative models, estimating posterior parameters of latent variables for each input data. In VAE, the latent representation is obtained by maximizing an evidence lower bound (ELBO). A number of studies have attempted to characterize the theoretical property of VAE. Indeed, there still are unsolved questions, e.g., what is the meaning of the latent variable VAE obtained, what β\beta represents in β\beta-VAE (Higgins et al., 2017), whether ELBO converges to an appropriate value, and so on. Alemi et al. (2018) introduced the RD trade-off based on the information-theoretic framework to analyse β\beta-VAE. However, they did not clarify what VAE captures after optimization. Dai et al. (2018) showed VAE restricted as a linear transform can be considered as a robust principal component analysis (PCA). But, their model has a limitation for the analysis on each latent variable basis because of the linearity assumption. Rolínek et al. (2019) showed the Jacobian matrix of VAE is orthogonal, which seems to make latent variables disentangled implicitly. However, they do not uncover the impact of each latent variable on the input data quantitatively because they simplify KL divergence as a constant. Locatello et al. (2019) also showed the unsupervised learning of disentangled representations fundamentally requires inductive biases on both the metric and data. Yet, they also do not uncover the quantitative property of disentangled representations which is obtained under the given inductive biases. Kumar & Poole (2020) connected the VAE objective with the Riemannian metric and proposed new deterministic regularized objectives. However, they still did not uncover the quantitative property of VAE after optimizing their objectives.

These problems are essentially due to the lack of a clear formulation of the quantitative relationship between the input data and the latent variables. To overcome this point, Kato et al. (2020) propose an isometric autoencoder as a non-VAE model. In the isometric embedding (Han & Hong, 2006), the distance between arbitrary two input points is retained in the embedding space. With isometric embedding, the quantitative relationship between the input data and the latent variables is tractable. Our intuition is that if we could also map VAE to an isometric autoencoder, the behavior of VAE latent variables will become clear. Thus, the challenge of this paper is to resolve these essential problems by utilizing the view point of isometric embedding.

1. First of all, we show that VAE obtains an implicit isometric embedding of the support of the input distribution as its latent space. That is, the input variable is embedded through the encoder in a low dimensional latent space in which the distance in the given metric between two points in the input space is preserved. Surprisingly, this characterization resolves most unsolved problems of VAE such as what we have enumerated above. This implicit isometric embedding is derived as a non-linear scaling of VAE embedding.

2. More concretely, we will show the following issues via the isometric embedding characterization theoretically:
(a) Role of β\beta in β\beta-VAE: β\beta controls each dimensional posterior variance of isometric embedding as a constant β/2\beta/2.
(b) Estimation of input distribution: the input distribution can be quantitatively estimated from the distribution of implicit isometric embedding because of the constant Jacobian determinant between the input and implicit isometric spaces.
(c) Disentanglement: If the manifold has a separate latent variable in the given metric by nature, the implicit isometric embedding captures such disentangled features as a result of minimizing the entropy.
(d) Rate-distortion (RD) optimal: VAE can be considered as a rate-distortion optimal encoder formulated by RD theory (Berger, 1971).

3. Finally, we justify our theoretical findings through several numerical experiments. We observe the estimated distribution is proportional to the input distribution in the toy dataset. By utilizing this property, the performance of the anomaly detection for real data is comparable to state-of-the-art studies. We also observe that the variance of each dimensional component in the isometric embedding shows the importance of each disentangled property like PCA.

2 Variational autoencoder

In VAE, ELBO is maximized instead of maximizing the log-likelihood directly. Let 𝒙m\bm{x}\in\mathbb{R}^{m} be a point in a dataset. The original VAE model consists of a latent variable with fixed prior 𝒛p(𝒛)=𝒩(𝒛;0,𝑰n)n\bm{z}\sim p(\bm{z})=\mathcal{N}({\bm{z}};0,\bm{I}_{n})\in\mathbb{R}^{n}, a parametric encoder Encϕ:𝒙𝒛\mathrm{Enc}_{\phi}:\bm{x}\Rightarrow\bm{z}, and a parametric decoder Decθ:𝒛𝒙^\mathrm{Dec}_{\theta}:\bm{z}\Rightarrow\hat{\bm{x}}. In the encoder, qϕ(𝒛|𝒙)=𝒩(𝒛;𝝁(𝒙),𝝈(𝒙))q_{\phi}(\bm{z}|\bm{x})=\mathcal{N}({\bm{z}};{\bm{\mu}}_{(\bm{x})},{\bm{\sigma}}_{(\bm{x})}) is provided by estimating parameters 𝝁(𝒙){\bm{\mu}}_{(\bm{x})} and 𝝈(𝒙){\bm{\sigma}}_{(\bm{x})}. Let L𝒙L_{\bm{x}} be a local cost at data 𝒙{\bm{x}}. Then, ELBO is described by

Ep(𝒙)[Eqϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]DKL(qϕ(𝒛|𝒙)p(𝒛))].\displaystyle{E}_{p(\bm{x})}\left[{E}_{{q_{\phi}(\bm{z}|\bm{x})}}[\log p_{\theta}(\bm{x}|\bm{z})]-D_{\mathrm{KL}}(q_{\phi}(\bm{z}|\bm{x})\|p({\bm{z}}))\right]. (1)

In Ep(𝒙)[]{E}_{p(\bm{x})}[\ \cdot\ ], the second term DKL()D_{\mathrm{KL}}(\cdot) is a Kullback–Leibler (KL) divergence. Let μj(𝒙){\mu}_{{j}(\bm{x})}, σj(𝒙){\sigma}_{{j}(\bm{x})}, and DKLj(x)D_{\mathrm{KL}{j(x)}} be jj-th dimensional values of 𝝁(𝒙){\bm{\mu}}_{(\bm{x})}, 𝝈(𝒙){\bm{\sigma}}_{(\bm{x})}, and KL divergence, respectively. Then DKL()D_{\mathrm{KL}}(\cdot) is derived as:

DKL()=j=1nDKLj(x),where\displaystyle D_{\mathrm{KL}}(\cdot)=\sum_{j=1}^{n}D_{\mathrm{KL}{j(x)}},\hskip 5.69054pt\text{where}\hskip 85.35826pt
DKLj(x)=12(μj(𝒙)2+σj(𝒙)2logσj(𝒙)21).\displaystyle D_{\mathrm{KL}{j(x)}}=\frac{1}{2}\left({{\mu}_{{j}(\bm{x})}}^{2}+{{\sigma}_{{j}(\bm{x})}}^{2}-\log{{\sigma}_{{j}(\bm{x})}}^{2}-1\right). (2)

The first term Eqϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]{E}_{{q_{\phi}(\bm{z}|\bm{x})}}[\log p_{\theta}(\bm{x}|\bm{z})] is called the reconstruction loss. Instead directly estimate logpθ(𝒙|𝒛)\log p_{\theta}(\bm{x}|\bm{z}) in training, 𝒙^=Decθ(𝒛)\hat{\bm{x}}=\mathrm{Dec}_{\theta}(\bm{z}) is derived and D(𝒙,𝒙^)=logpp(𝒙|𝒙^)D(\bm{x},\hat{\bm{x}})=-\log p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) is evaluated as reconstruction loss, where pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) denotes the predetermined conditional distribution. In the case Gaussian and Bernoulli distributions are used as pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}), D(𝒙,𝒙^)D(\bm{x},\hat{\bm{x}}) becomes the sum square error (SSE) and binary cross-entropy (BCE), respectively. In training β\beta-VAE (Higgins et al., 2017), the next objective is used instead of Eq. 1, where β\beta is a parameter to control the trade-off.

L𝒙=E𝒛qϕ(𝒛|𝒙)[D(𝒙,𝒙^)]+βDKL().\displaystyle L_{{\bm{x}}}={E}_{{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}}[D(\bm{x},\hat{\bm{x}})]+\beta D_{\mathrm{KL}}(\cdot). (3)

3 Isometric embedding

Isometric embedding (Han & Hong, 2006) is a smooth embedding from 𝒙\bm{x} to 𝒛\bm{z} (𝒙,𝒛m\bm{x},{\bm{z}}\in\mathbb{R}^{m}) on a Riemannian manifold where the distances between arbitrary two points are equivalent in both the input and embedding spaces. Assume that 𝒙\bm{x} and 𝒛\bm{z} belong to a Riemannian metric space with a metric tensor 𝑮𝒙{\bm{G}}_{{\bm{x}}} and a Euclidean space, respectively. Then, the isometric embedding from 𝒙\bm{x} to 𝒛\bm{z} satisfies the following condition for all inputs and dimensions as shown in Kato et al. (2020), where δjk\delta_{jk} denotes Kronecker delta:

t𝒙/zj𝑮𝒙𝒙/zk=δjk.{}^{t}{\partial\bm{x}}/{\partial z_{j}}\ {\bm{G}}_{\bm{x}}\ {\partial\bm{x}}/{\partial z_{k}}=\delta_{jk}. (4)

The isometric embedding has several preferable properties. First of all, the probability density of input data at the given metric is preserved in the isometric embedding space. Let p(𝒙)p(\bm{x}) and p(𝒛)p(\bm{z}) be distributions in their respective metric spaces. JdetJ_{\mathrm{det}} denotes |det(𝒙/𝒛)||\mathrm{det}({\partial\bm{x}}/{\partial\bm{z}})|, i.e., an absolute value of the Jacobian determinant. Since JdetJ_{\mathrm{det}} is 1 from orthonormality, the following equation holds:

p(𝒛)=Jdetp(𝒙)=p(𝒙).p(\bm{z})=J_{\mathrm{det}}\ p(\bm{x})=p(\bm{x}). (5)

Secondly, the entropies in both spaces are also equivalent. Let XX and ZZ be sets of 𝒙{\bm{x}} and 𝒛{\bm{z}}, respectively. H(X)H(X) and H(Z)H(Z) denotes the entropies of XX and ZZ in each metric spaces. Then H(X)H(X) and H(Z)H(Z) are equivalent as follows:

H(Z)\displaystyle H(Z) =\displaystyle= p(𝒛)logp(𝒛)d𝒛\displaystyle-\int p({\bm{z}})\log p({\bm{z}})\ \mathrm{d}{\bm{z}} (6)
=\displaystyle= Jdetp(𝒙)log(Jdetp(𝒙))Jdet1d𝒙\displaystyle-\int J_{\mathrm{det}}\ p(\bm{x})\log\left(J_{\mathrm{det}}\ p(\bm{x})\right)J_{\mathrm{det}}^{-1}\ \mathrm{d}{\bm{x}}
=\displaystyle= p(𝒙)logp(𝒙)d𝒙\displaystyle-\int p({\bm{x}})\log p({\bm{x}})\ \mathrm{d}{\bm{x}}
=\displaystyle= H(X).\displaystyle H(X).

Thus, the isometric embedding is a powerful tool to analyse input data. Note that Eqs. 5-6 do not hold in general if the embedding is not isometric.

Recently, Kato et al. (2020) proposed an isometric autoencoder RaDOGAGA (Rate-distortion optimization guided autoencoder for generative analysis), inspired by deep image compression (Ballé et al., 2018). In the conventional image compression using orthonormal transform coding, Rate-distortion optimization (RDO) objective has been widely used (Sullivan & Wiegand, 1998). Let RR and DD be a rate and distortion after encoding, respectively. Then RDO finds the best encoding parameters that minimizes L=D+λRL=D+\lambda R at given Lagrange multiplier λ\lambda. In the deep image compression (Ballé et al., 2018), the model is composed of a parametric prior and posterior with constant variance, then trained using the RDO objective. Kato et al. (2020) proved that such a model achieves an isometric embedding in Euclidean space, and they proposed an isometric autoencoder RaDOGAGA for quantitative analysis. By contrast, VAE uses a fixed prior with a variable posterior. Here, we have an intuition that VAE can be mapped to an isometric embedding such as RaDOGAGA by introducing a non-linear scaling of latent space. If our intuition is correct, the behavior of VAE will be quantitatively explained.

4 Understanding of VAE as a scaled isometric embedding

This section shows the quantitative property of VAE by introducing an implicit isometric embedding. First, we present the hypothesis of mapping VAE to an implicit isometric embedding. Second, we theoretically formulate the derivation of implicit isometric embedding as the minimum condition of the VAE objective. Lastly, we explain the quantitative properties of VAE to provide a practical data analysis.

4.1 Mapping β\beta-VAE to implicit isometric embedding

In this section, we explain our motivations for introducing an implicit isometric embedding to analyse β\beta-VAE. Rolínek et al. (2019) showed that each pair of column vectors in the Jacobian matrix 𝒙/𝝁(𝒙)\partial\bm{x}/\partial\bm{\mu}_{(\bm{x})} is orthogonal such that t𝒙/μj(𝒙)𝒙/μk(𝒙)=0{}^{t}\partial\bm{x}/\partial\mu_{j(\bm{x})}\cdot\partial\bm{x}/\partial\mu_{k(\bm{x})}=0 for jkj\neq k when D(𝒙,𝒙^)D(\bm{x},\hat{{\bm{x}}}) is SSE. From this property, we can introduce the implicit isometric embedding by scaling the VAE latent space appropriately as follows: 𝒙μj{\bm{x}}_{\mu_{j}} denotes 𝒙/μj(𝒙)\partial\bm{x}/\partial\mu_{j(\bm{x})}. Let 𝒚\bm{y} and yjy_{j} be an implicit variable and its jj-th dimensional component which satisfies dyj/dμj(𝒙)=|𝒙μj|2\mathrm{d}y_{j}/\mathrm{d}\mu_{j(\bm{x})}=|{\bm{x}}_{\mu_{j}}|_{2}. Then 𝒙/yj\partial\bm{x}/\partial y_{j} forms the isometric embedding in Euclidean space:

t𝒙/yj𝒙/yk=δjk.{}^{t}\partial\bm{x}/\partial y_{j}\cdot\partial\bm{x}/\partial y_{k}=\delta_{jk}. (7)

If the L2 norm of 𝒙μj{\bm{x}}_{\mu_{j}} is derived mathematically, we can formulate the mapping VAE to an implicit isometric embedding as in Eq. 7. Then, this mapping will strongly help to understand the quantitative behavior of VAE as explained in section 3. Thus, our motivation in this paper is to formulate the implicit isometric embedding theoretically and analyse VAE properties quantitatively.

Figure 1 shows how β\beta-VAE is mapped to an implicit isometric embedding. In VAE encoder, 𝝁(𝒙){\bm{\mu}}_{({\bm{x}})} is calculated from an input 𝒙X{\bm{x}}\in X. Then, the posterior 𝒛{\bm{z}} is derived by adding a stochastic noise 𝒩(0,𝝈(𝒙))\mathcal{N}(0,\bm{\sigma}_{({\bm{x}})}) to 𝝁(𝒙){\bm{\mu}}_{({\bm{x}})}. Finally, the reconstruction data 𝒙^X^\hat{\bm{x}}\in\hat{X} is decoded from 𝒛{\bm{z}}.

Our theoretical analysis in section 4.2 reveals that implicit isometric embedding 𝒚Y{\bm{y}}\in Y can be introduced by mapping 𝝁(𝒙){\bm{\mu}}_{(\bm{x})} to 𝒚{\bm{y}} with a scaling dyj/dμj(𝒙)=|𝒙μj|2=β/2/σj(𝒙)\mathrm{d}y_{j}/\mathrm{d}\mu_{j({\bm{x}})}=|{\bm{x}}_{\mu_{j}}|_{2}=\sqrt{\beta/2}/\sigma_{j(\bm{x})} in each dimension. Then, the posterior 𝒚^Y^\hat{\bm{y}}\in\hat{Y} is derived by adding a stochastic noise 𝒩(0,(β/2)In)\mathcal{N}(0,(\beta/2)I_{n}) to 𝒚{\bm{y}}. Note that the noise variances, i.e., the posterior variances, are a constant β/2\beta/2 for all inputs and dimensions, which is analogous to RaDOGAGA. Then, the mutual information H(X;X^)H(X;\hat{X}) in β\beta-VAE can be estimated as:

I(X;X^)\displaystyle I(X;\hat{X}) =\displaystyle= I(Y;Y^)\displaystyle I(Y;\hat{Y}) (8)
\displaystyle\simeq H(Y)H(𝒩(0,(β/2)In))\displaystyle H(Y)-H\left(\mathcal{N}(0,(\beta/2)I_{n})\right)
=\displaystyle= H(Y)n2log(πeβ).\displaystyle H(Y)-\frac{n}{2}\log(\pi e\beta).

This implies that the posterior entropy n2log(πeβ)\frac{n}{2}\log(\pi e\beta) should be smaller enough than H(X)H(X) to give the model a sufficient expressive ability. Thus, the posterior variance β/2\beta/2 should be also sufficiently smaller than the variance of input data. Note that Eq. 8 is consistent with the Rate-distortion (RD) optimal condition in the RD theory as shown in section 6.

Refer to caption
Figure 1: Mapping of β\beta-VAE to implicit isometric embedding.

4.2 Theoretical derivation of implicit isometric embedding

In this section, we derive the implicit isometric embedding theoretically. First, we reformulate D(𝒙,𝒙^)D(\bm{x},\hat{\bm{x}}) and DKL()D_{\mathrm{KL}}(\ \cdot\ ) in β\beta-VAE objective L𝒙L_{\bm{x}} in Eq. 3 for mathematical analysis. Then we derive the implicit isometric embedding as a minimum condition of L𝒙L_{\bm{x}}. Here, we set the prior p(𝒛)p({\bm{z}}) to 𝒩(𝒛;0,In)\mathcal{N}({\bm{z}};0,I_{n}) for easy analysis. The condition where the approximation in this section is valid is that β/2\beta/2 is smaller enough than the variance of the input dataset, which is important to achieve a sufficient expressive ability. We also assume the data manifold is smooth and differentiable.

Firstly, we introduce a metric tensor to treat arbitrary kinds of metrics for the reconstruction loss in the same framework. D(𝒙,𝒙´)=logpp(𝒙|𝒙´)D(\bm{x},\acute{\bm{x}})=-\log p_{\mathbb{R}p}({\bm{x}}|\acute{{\bm{x}}}) denotes a metric between two points 𝒙{\bm{x}} and 𝒙´\acute{\bm{x}}. Let δ𝒙\delta{\bm{x}} be 𝒙´𝒙\acute{\bm{x}}-\bm{x}. If δ𝒙\delta{\bm{x}} is small, D(𝒙,𝒙´)=D(𝒙,𝒙+δ𝒙)D({\bm{x}},\acute{\bm{x}})=D({\bm{x}},{\bm{x}}+\delta{\bm{x}}) can be approximated by δt𝒙𝑮𝒙δ𝒙{}^{t}{\delta\bm{x}}\ {{\bm{G}}}_{\bm{x}}{\delta\bm{x}} using the second order Taylor expansion, where 𝑮𝒙{{\bm{G}}}_{\bm{x}} is an 𝒙\bm{x} dependent positive definite metric tensor. Appendix G.2 shows the derivations of 𝑮𝒙{\bm{G}}_{\bm{x}} for SSE, BCE, and structural similarity (SSIM) (Wang et al., 2001). Especially for SSE, 𝑮𝒙{\bm{G}}_{\bm{x}} is an identity matrix 𝑰\bm{I}, i.e., a metric tensor in Euclidean space.

Next, we formulate the approximation of LxL_{x} via the following three lemmas, to examine the Jacobian matrix easily.

Lemma 1. Approximation of reconstruction loss:
Let 𝒙˘\breve{\bm{x}} be Decθ(𝝁(𝒙))\mathrm{Dec}_{\theta}(\bm{\mu}_{(\bm{x})}). 𝒙μj{\bm{x}}_{\mu_{j}} denotes 𝒙/μj(𝒙)\partial\bm{x}/\partial\mu_{j(\bm{x})}. Then the reconstruction loss in LxL_{x} can be approximated as:

E𝒛qϕ(𝒛|𝒙)[D(𝒙,𝒙^)]D(𝒙,𝒙˘)+j=1nσj(𝒙)2𝒙μjt𝑮x𝒙μj.{E}_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}\left[D(\bm{x},\hat{\bm{x}})\right]\simeq D({\bm{x}},\breve{\bm{x}})+\sum_{j=1}^{n}{{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}. (9)

Proof: Appendix A.1 describes the proof. The outline is as follows: Rolínek et al. (2019) show D(𝒙,𝒙^)D({\bm{x}},\hat{\bm{x}}) can be decomposed to D(𝒙,𝒙˘)+D(𝒙˘,𝒙^)D(\bm{x},\breve{\bm{x}})+D(\breve{\bm{x}},\hat{\bm{x}}). We call the first term D(𝒙,𝒙˘)D({\bm{x}},\breve{\bm{x}}) a transform loss. Obviously, the average of transform loss over 𝒛qϕ(𝒛|𝒙){\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x}) is still D(𝒙,𝒙˘)D({\bm{x}},\breve{\bm{x}}). We call the second term D(𝒙˘,𝒙^)D(\breve{\bm{x}},\hat{\bm{x}}) a coding loss. The average of coding loss can be further approximated as the second term of Eq. 9.

Lemma 2. Approximation of KL divergence:
Let p(𝝁(𝒙))=𝒩(𝝁(𝒙);0,In)p(\bm{\mu}_{({\bm{x}})})=\mathcal{N}({\bm{\mu}}_{({\bm{x}})};0,I_{n}) be a prior probability density where 𝒛=𝝁(𝒙){\bm{z}}=\bm{\mu}_{({\bm{x}})}. Then the KL divergence in LxL_{x} can be approximated as:

DKL(qϕ(𝒛|𝒙)p(𝒛))\displaystyle D_{\mathrm{KL}}(q_{\phi}(\bm{z}|\bm{x})\|p({\bm{z}}))\hskip 136.57323pt
log(p(𝝁(𝒙))j=1nσj(𝒙))nlog2πe2\displaystyle\simeq-\log\Bigl{(}p(\bm{\mu}_{(\bm{x})})\prod_{j=1}^{n}{\sigma}_{j({\bm{x}})}\Bigr{)}-\frac{n\log{2\pi e}}{2}\hskip 51.21495pt
log(p(𝒙)|det(𝒙𝝁(𝒙))|j=1nσj(𝒙))nlog2πe2.\displaystyle\simeq-\log\Bigl{(}{p(\bm{x})\left|\mathrm{det}\Bigl{(}\frac{\partial\bm{x}}{\partial\bm{\mu}_{({\bm{x}})}}\Bigr{)}\right|}\prod_{j=1}^{n}{\sigma}_{j({\bm{x}})}\Bigr{)}-\frac{n\log{2\pi e}}{2}. (10)

Proof: The detail is described in Appendix A.2. The outlines is as follows: First, σj(𝒙)21{\sigma_{j({\bm{x}})}}^{2}\ll 1 will be observed in meaningful dimensions. For example, σj(𝒙)2<0.1{\sigma_{j({\bm{x}})}}^{2}<0.1 will almost hold if the dimensional component has information that exceeds only 1.2 nat. Furthermore, when σj(𝒙)2<0.1{\sigma_{j({\bm{x}})}}^{2}<0.1, we have (σj(𝒙)2/logσj(𝒙)2)<0.05-({\sigma_{j({\bm{x}})}}^{2}/\log{\sigma_{j({\bm{x}})}}^{2})<0.05. Thus, by ignoring the σj(𝒙)2{\sigma_{j({\bm{x}})}}^{2} in Eq. 2, DKLj(x)D_{\mathrm{KL}{j(x)}} can be approximated as:

DKLj(x)12(μj(𝒙)2logσj(𝒙)21)\displaystyle D_{\mathrm{KL}{j(x)}}\simeq\frac{1}{2}\left({\mu_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}-1\right)\hskip 49.79231pt
=log(σj(𝒙)𝒩(μj(𝒙);0,1))log2πe2.\displaystyle=-\log\left({\sigma_{j({\bm{x}})}}\ \mathcal{N}(\mu_{j({\bm{x}})};0,1)\right)\ -\frac{\log{2\pi e}}{2}. (11)

As a result, the second equation of the proposition Eq. 4.2 is derived by summing the last equation of Eq. 4.2. Then, using p(𝝁(𝒙))=p(𝒙)|det(𝒙/𝝁(𝒙))|p(\bm{\mu}_{(\bm{x})})={p(\bm{x})\ |\mathrm{det}({\partial\bm{x}}/{\partial\bm{\mu}_{({\bm{x}})}})|}, the last equation of Eq. 4.2 is derived. Appendix A.2 shows that the approximation of the second line in Eq. 4.2 can be also derived for arbitrary priors, which suggests that the theoretical derivations that follow in this section also hold for arbitrary priors.

Lemma 3. Estimation of transform loss:
Let x𝒩(x;0,σx2)x\sim\mathcal{N}(x;0,{\sigma_{x}}^{2}) be a 1-dimensional dataset. When β\beta-VAE is trained for xx, the ratio between the transform loss D(x,x˘)D(x,\breve{x}) and the coding loss D(x˘,x^)D(\breve{x},\hat{x}) is estimated as:

D(x,x˘)D(x˘,x^)β/2σx2.\frac{D(x,\breve{x})}{D(\breve{x},\hat{x})}\simeq\frac{\beta/2}{{\sigma_{x}}^{2}}. (12)

Proof: Appendix A.3 describes the proof. As explained there, this is analogous to the Wiener filter (Wiener, 1964), one of the most basic theories for signal restoration.

Lemma 4.2 is also validated experimentally in the multi-dimensional non-Gaussian toy dataset. Fig. 24 in Appendix D.2 shows that the experimental results match the theory well. Thus, we ignore the transform loss D(x,x˘)D(x,\breve{x}) in the discussion that follows, since we assume β/2\beta/2 is smaller enough than the variance of the input data. Using Lemma 4.2-4.2, we can derive the approximate expansion of LxL_{x} as follows:

Theorem 1. Approximate expansion of VAE objective:
Assume β/2\beta/2 is smaller enough than the variance of input dataset. The objective LxL_{x} can be approximated as:

L𝒙j=1nσj(𝒙)2𝒙μjt𝑮x𝒙μj\displaystyle L_{\bm{x}}\simeq\sum_{j=1}^{n}{{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}\hskip 102.42992pt
βlog(p(𝒙)|det(𝒙𝝁(𝒙))|j=1nσj(𝒙))nβlog2πe2.\displaystyle-\beta\log\Bigl{(}{p(\bm{x})\left|\mathrm{det}\Bigl{(}\frac{\partial\bm{x}}{\partial\bm{\mu}_{({\bm{x}})}}\Bigr{)}\right|}\prod_{j=1}^{n}{\sigma}_{j({\bm{x}})}\Bigr{)}-\frac{n\beta\log{2\pi e}}{2}. (13)

Proof: Apply Lemma 4.2-4.2 to L𝒙L_{\bm{x}} in Eq. 3.

Then, we can finally derive the implicit isometric embedding as a minimum condition of Eq. 4.2 via Lemma 4.2-4.2.

Lemma 4. Orthogonality of Jacobian matrix in VAE:
At the minimum condition of Eq. 4.2, each pair 𝒙μj{\bm{x}}_{\mu_{j}} and 𝒙μk{\bm{x}}_{\mu_{k}} of column vectors in the Jacobian matrix 𝒙/𝝁(𝒙)\partial\bm{x}/\partial\bm{\mu}_{(\bm{x})} show the orthogonality in the Riemannian metric space, i.e., the inner product space with the metric tensor 𝑮𝒙\bm{G}_{\bm{x}} as:

(2σj(𝒙)2/β)𝒙μjt𝑮x𝒙μk=δjk.\displaystyle({2{{\sigma}_{j({\bm{x}})}}^{2}}/{{\beta}})\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{k}}}=\delta_{jk}. (14)

Proof: Eq. 14 is derived by examining the derivative dL𝒙/d𝒙μj=0\mathrm{d}L_{\bm{x}}/\mathrm{d}{\bm{x}}_{\mu_{j}}=0. The proof is described in Appendix A.4. A diagonal posterior covariance is the key for orthogonality.

Eq. 14 is consistent with Rolínek et al. (2019) who show the orthogonality for SSE metric. In addition, we quantify the Jacobian matrix for arbitrary metric spaces.

Lemma 5. L2 norm of xμj\bm{x}_{\mu_{j}}:
the L2 norm of 𝒙μj\bm{x}_{\mu_{j}} in the metric space of 𝑮𝒙\bm{G}_{\bm{x}} is derived as:

|𝒙μj|2=𝒙μjt𝑮x𝒙μj=β/2/σj(𝒙).\displaystyle|\bm{x}_{\mu_{j}}|_{2}=\sqrt{{}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}}=\sqrt{\beta/2}/{{{\sigma}_{j({\bm{x}})}}}. (15)

Proof: Apply k=jk=j to Eq. 14 and arrange it.

Theorem 2. Implicit isometric embedding:
An implicit isometric embedding 𝒚\bm{y} is introduced by mapping jj-th component μj(𝒙)\mu_{j({\bm{x}})} of VAE latent variable to yjy_{j} with the following scaling factor:

dyj/dμj(𝒙)=|𝒙μj|2=β/2/σj(𝒙).{\mathrm{d}y_{j}}/{\mathrm{d}\mu_{j({\bm{x}})}}=|\bm{x}_{\mu_{j}}|_{2}={\sqrt{{\beta}/{2}}}/{{\sigma}_{j({\bm{x}})}}. (16)

𝒙yj{{\bm{x}}_{y_{j}}} denotes 𝒙/yj\partial{\bm{x}}/\partial{y_{j}}. Then 𝒙yj{{\bm{x}}_{y_{j}}} satisfies the next equation:

𝒙yjt𝑮𝒙𝒙yk=δjk.\displaystyle{}^{t}{{\bm{x}}_{y_{j}}}\bm{G}_{\bm{x}}{{\bm{x}}_{y_{k}}}=\delta_{jk}. (17)

This shows the isometric embedding from the inner product space of 𝒙\bm{x} with metric 𝑮𝒙\bm{G}_{\bm{x}} to the Euclidean space of 𝒚\bm{y}.

Proof: Apply 𝒙μj=dyj/dμj(𝒙)𝒙yj{{\bm{x}}_{\mu_{j}}}=\mathrm{d}y_{j}/\mathrm{d}\mu_{j({\bm{x}})}\ {{\bm{x}}_{y_{j}}} to Eq. 14.

Remark 1: Isometricity in Eq. 17 is on the decoder side. Since the transform loss D(𝒙,𝒙˘)D(\bm{x},\breve{\bm{x}}) is close to 0, Decθ(𝝁j(𝒙))Encϕ1(𝝁j(𝒙))\mathrm{Dec}_{\theta}(\bm{\mu}_{j({\bm{x}})})\simeq\mathrm{Enc}_{\phi}^{-1}(\bm{\mu}_{j({\bm{x}})}) holds. As a result, the isometricity on the encoder side is also almost achieved. If D(𝒙,𝒙˘)D(\bm{x},\breve{\bm{x}}) is explicitly reduced by using a decomposed loss, the isometricity will be further promoted.

Theorem 3. Posterior variance in isometric embedding:
The posterior variance of implicit isometric embedding is a constant β/2\beta/2 for all inputs and dimensional components.

Proof: Let σyj(𝒙)2{\sigma_{y_{j}({\bm{x}})}}^{2} be a posterior variance of the implicit isometric component yj{y_{j}}. By scaling σj(𝒙)\sigma_{j({\bm{x}})} for the original VAE latent variable with Eq. 16, σyj(𝒙)\sigma_{y_{j}({\bm{x}})} is derived as:

σyj(𝒙)σj(𝒙)dyjdμj(𝒙)=β/2.\displaystyle\sigma_{y_{j}({\bm{x}})}\simeq{\sigma_{j({\bm{x}})}}\frac{\mathrm{d}y_{j}}{\mathrm{d}\mu_{j({\bm{x}})}}=\sqrt{{\beta}/{2}}. (18)

Thus, the posterior variance σyj(𝒙)2{\sigma_{y_{j}({\bm{x}})}}^{2} is a constant β/2\beta/2 for all dimensions jj at any inputs 𝒙{\bm{x}} as in Section 4.1.

4.3 Quantitative data analysis method using implicit isometric embedding in VAE

This section describes three quantitative data analysis methods by utilizing the property of isometric embedding.

4.3.1 Estimation of the data probability distribution:

Estimation of data distribution is one of the key targets in machine learning. We show VAE can estimate the distribution in both metric space and input space quantitatively.

Proposition 1. Probability estimation in metric space:
Let p𝑮𝒙(𝒙)p_{\bm{G}_{\bm{x}}}({\bm{x}}) be a probability distribution in the inner product space of 𝑮𝒙\bm{G}_{\bm{x}}. p𝑮𝒙(𝒙)p_{\bm{G}_{\bm{x}}}({\bm{x}}) can be quantitatively estimated as:

p𝑮𝒙(𝒙)p(𝒚)\displaystyle p_{\bm{G}_{\bm{x}}}(\bm{x})\ \simeq\ p(\bm{y}) \displaystyle\propto p(𝝁(𝒙))j=1mσj(𝒙)\displaystyle{p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{m}{{\sigma}_{j({\bm{x}})}} (19)
\displaystyle\propto exp(L𝒙/β).\displaystyle\exp(-L_{\bm{x}}/{\beta}).

Proof: Appendix A.5 explains the detail. The outline is as follows: The third equation is derived by applying Eq. 16 to p(𝒚)=jp(yj)=j(dyj/dμj(𝒙))1p(μj)p({\bm{y}})=\prod_{j}p(y_{j})=\prod_{j}(\mathrm{d}y_{j}/\mathrm{d}\mu_{j({\bm{x}})})^{-1}p(\mu_{j}), showing that σj(𝒙){{\sigma}_{j({\bm{x}})}} bridges between the distributions of input data and prior. The fourth equation is derived by applying Eq. 16 to Eq. 4.2. The last equation implies that the VAE objective converges to the log-likelihood of the input 𝒙{\bm{x}} as expected. When the metric is SSE, Eq. 19 show the probability distribution in the input space since 𝑮𝒙\bm{G}_{\bm{x}} is an identity matrix.

Proposition 2. Probability estimation in the input space:
In the the case m=nm=n, the probability distribution p(𝒙)p({\bm{x}}) in the input space can be estimated as:

p(𝒙)=|det(𝑮x)|12p𝑮𝒙(𝒙)|det(𝑮x)|12p(𝒚)\displaystyle p(\bm{x})=|\mathrm{det}(\bm{G}_{x})|^{\frac{1}{2}}\ p_{\bm{G}_{\bm{x}}}({\bm{x}})\simeq|\mathrm{det}(\bm{G}_{x})|^{\frac{1}{2}}\ p({\bm{y}})
|det(𝑮x)|12p(𝝁(𝒙))j=1mσj(𝒙)\displaystyle\propto|\mathrm{det}(\bm{G}_{x})|^{\frac{1}{2}}\ {p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{m}{{\sigma}_{j({\bm{x}})}}\hskip 38.41121pt
|det(𝑮x)|12exp(L𝒙/β).\displaystyle\propto|\mathrm{det}(\bm{G}_{x})|^{\frac{1}{2}}\exp(-L_{\bm{x}}/{\beta}).\hskip 54.06023pt (20)

In the case m>nm>n and 𝑮x=a𝒙𝑰m\bm{G}_{x}=a_{\bm{x}}\bm{I}_{m} holds where a𝒙a_{\bm{x}} is an 𝒙\bm{x}-dependent scalar factor, p(𝒙)p(\bm{x}) can be estimated as:

p(𝒙)a𝒙n2p(𝝁(𝒙))j=1nσj(𝒙)a𝒙n2exp(L𝒙/β).\displaystyle p(\bm{x})\propto{a_{\bm{x}}}^{\frac{n}{2}}\ {p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{n}{{\sigma}_{j({\bm{x}})}}\propto{a_{\bm{x}}}^{\frac{n}{2}}\exp(-L_{\bm{x}}/{\beta}). (21)

Proof: The absolute value of Jacobian determinant between the input and metric spaces gives the the PDF ratio. In the case m=nm=n, this is derived as |det(𝑮x)|12|\mathrm{det}(\bm{G}_{x})|^{\frac{1}{2}}. In the case m>nm>n and 𝑮x=a𝒙𝑰m\bm{G}_{x}=a_{\bm{x}}\bm{I}_{m}, the Jacobian determinant is proportional to axn/2{a_{x}}^{n/2}. Appendix A.6 explains the detail.

4.3.2 Quantitative analysis of disentanglement

Assume the data manifold has a disentangled property with independent latent variable by nature. Then each yjy_{j} will capture each disentangled latent variable like to PCA. This subsection explains how to derive the importance of each dimension in the given metrics for data analysis.

Proposition 3. Meaningful dimension:
The dimensional components yjy_{j} with DKLj(x)>0D_{\mathrm{KL}{j(x)}}>0 have meaningful information for representation, where the entropy of yjy_{j} is larger than H(𝒩(0,β/2))=log(βπe)/2H(\mathcal{N}(0,\beta/2))=\log(\beta\pi e)/2. In contrast, the dimension with DKLj(x)=0D_{\mathrm{KL}{j(x)}}=0 has no information, where μj(𝒙)=0\mu_{j({\bm{x}})}=0 and σj(𝒙)=1\sigma_{j({\bm{x}})}=1 will be observed.

Proof: Appendix A.7 shows the detail in view of RD theory. This appendix also explains that the entropy of 𝒚{\bm{y}} becomes minimum after optimization.

Proposition 4. Importance of each dimension:
Assume that the prior p(𝒛)p({\bm{z}}) is a Gaussian distribution 𝒩(𝒛;0,𝑰n)\mathcal{N}({\bm{z}};0,\bm{I}_{n}). Let Var(yj)\mathrm{Var}(y_{j}) be the variance of the jj-th implicit isometric component yjy_{j}, indicating the quantitative importance of each dimension. Var(yj)\mathrm{Var}(y_{j}) in the meaningful dimension (DKLj(x)>0D_{\mathrm{KL}{j(x)}}>0) can be roughly estimated as:

Var(yj)(β/2)E𝒙p(𝒙)[σj(𝒙)2].\displaystyle\mathrm{Var}(y_{j})\simeq({\beta}/{2})\ {E}_{\bm{x}\sim p(\bm{x})}[{{\sigma}_{j({\bm{x}})}}^{-2}]. (22)

Proof: Appendix A.8 shows the derivation from Eq. 16. The case other than Gaussian prior is also explained there.

4.3.3 Check the isometricity after training

This subsection explains how to determine if the model acquires isometric embedding by evaluating the norm of 𝒙yj{{\bm{x}}_{y_{j}}}. Let 𝒆(j)\bm{e}_{(j)} be a vector (0,,1,,0)(0,\cdots,1,\cdots,0) where the jj-th dimension is 11 and others are 0. Let Dj(𝒛)D^{\prime}_{j}(\bm{z}) be D(Decθ(𝒛),Decθ(𝒛+ϵ𝒆(j)))/ϵ2D(\mathrm{Dec}_{\theta}(\bm{z}),\mathrm{Dec}_{\theta}(\bm{z}+\epsilon\bm{e}_{(j)}))/{\epsilon}^{2}, where ϵ\epsilon denotes a minute value for the numerical differential. Then the squared L2 norm of yjy_{j} can be evaluated as the last equation:

𝒙yjt𝑮x𝒙yj\displaystyle{}^{t}{{\bm{x}}_{y_{j}}}\bm{G}_{x}{{\bm{x}}_{y_{j}}} \displaystyle\simeq (2/β)(σj(𝒙)2𝒙μjt𝑮x𝒙μj)\displaystyle({2}/{\beta})\ \left({{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}\right) (23)
\displaystyle\simeq (2/β)σj(𝒙)2Dj(𝒛).\displaystyle({2}/{\beta})\ {{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).

Observing a value close to 1 means a unit norm and indicates that an implicit isometric embedding is captured.

Remark 2: Eq. 23 will not hold and the norm will be 0 in such a dimension where DKLj(x)=0D_{\mathrm{KL}{j(x)}}=0, since the reconstruction loss, i.e., β/2\beta/2 times squared L2 norm of yjy_{j}, and DKLj(x)D_{\mathrm{KL}{j(x)}} do not have to be balanced in Eq. 4.2.

5 Experiment

This section describes three experimental results. First, the results of the toy dataset are examined to validate our theory. Next, the disentanglement analysis for the CelebA dataset is presented. Finally, an anomaly detection task is evaluated to show the usefulness of data distribution estimation.

5.1 Quantitative evaluation in the toy dataset

The toy dataset is generated as follows. First, three dimensional variables s1s_{1}, s2s_{2}, and s3s_{3} are sampled in accordance with the three different shapes of distributions p(s1)p(s_{1}), p(s2)p(s_{2}), and p(s3)p(s_{3}), as shown in Fig. 2. The variances of s1s_{1}, s2s_{2}, and s3s_{3} are 1/61/6, 2/32/3, and 8/38/3, respectively, such that the ratio of the variances is 1:4:16. Second, three 1616-dimensional uncorrelated vectors 𝒗1\bm{v}_{1}, 𝒗2\bm{v}_{2}, and 𝒗3\bm{v}_{3} with L2 norm 11 are provided. Finally, 50,00050,000 toy data with 1616 dimensions are generated by 𝒙=i=13si𝒗i\bm{x}=\sum_{i=1}^{3}s_{i}\bm{v}_{i}. The data distribution p(𝒙)p(\bm{x}) is also set to p(s1)p(s2)p(s3)p(s_{1})p(s_{2})p(s_{3}). If our hypothesis is correct, p(yj)p(y_{j}) will be close to p(sj)p(s_{j}). Then, σj(𝒙)dzj/dyj=p(yj)/p(zj){\sigma_{j(\bm{x})}}\propto\mathrm{d}{z_{j}}/\mathrm{d}{y_{j}}=p(y_{j})/p(z_{j}) will also vary a lot with these varieties of PDFs. Because the properties in Section 4.3 are derived from σj(𝒙){\sigma_{j(\bm{x})}}, our theory can be easily validated by evaluating those properties.

Refer to caption
Figure 2: PDFs of three variables to generate a toy dataset.
Table 1: Property measurements of the toy dataset
trained with the square error loss.
variable z1z_{1} z2z_{2} z3z_{3}
2βσj2Dj\frac{2}{\beta}{\sigma_{j}}^{2}D^{\prime}_{j}   Av. 0.965 0.925 0.972
SD 0.054 0.164 0.098
Dj(𝒛)D^{\prime}_{j}(\bm{z})      Av. 0.162 0.726 2.922
SD 0.040 0.466 1.738
σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2}    Av. 3.33e1 1.46e2 5.89e2
(Ratio) Av. 1.000 4.39 17.69
Table 2: Property measurements of the toy dataset
trained with the downward-convex loss.
variable z1z_{1} z2z_{2} z3z_{3}
2βσj2Dj\frac{2}{\beta}{\sigma_{j}}^{2}D^{\prime}_{j}   Av. 0.964 0.928 0.978
SD 0.060 0.160 0.088
Dj(𝒛)D^{\prime}_{j}(\bm{z})      Av. 0.161 0.696 2.695
SD 0.063 0.483 1.573
σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2}    Av. 3.30e1 1.40e2 5.43e2
(Ratio) Av. 1.000 4.25 16.22
Refer to caption
(a) p(𝝁(𝒙))p({\bm{\mu}}_{(\bm{x})})
Refer to caption
(b) exp(Lx/β)\exp(-L_{x}/\beta)
Refer to caption
(c) a𝒙3/2p(𝝁(𝒙))jσj(𝒙)a_{\bm{x}}^{{3}/{2}}p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}}
Refer to caption
(d) a𝒙3/2exp(Lx/β)a_{\bm{x}}^{{3}/{2}}\exp(-L_{x}/\beta)
Figure 3: Scattering plots of the data distribution (x-axis) versus four estimated probabilities (y-axes) for the downward-convex loss. y-axes are (a) p(𝝁(𝒙))p({\bm{\mu}}_{(\bm{x})}), (b) exp(Lx/β)\exp(-L_{x}/\beta), (c) a𝒙3/2p(𝝁(𝒙))jσj(𝒙)a_{\bm{x}}^{{3}/{2}}p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}}, and (d) a𝒙3/2exp(Lx/β)a_{\bm{x}}^{{3}/{2}}\exp(-L_{x}/\beta).
Refer to caption
Figure 4: Graph of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} average and 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in VAE for CelebA dataset.
Refer to caption
Figure 5: Graph of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} average and 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in VAE for CelebA dataset with explicit decomposed loss.
Refer to caption
Figure 6: Dependency of decoded image changes with zj=2z_{j}=-2 to 22 on the average of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2}.

Then, the VAE model is trained using Eq. 3. We use two kinds of the reconstruction loss D(,)D(\cdot,\cdot) to analyze the effect of the loss metrics. The first is the square error loss equivalent to SSE. The second is the downward-convex loss which we design as Eq. 5.1, such that the shape becomes similar to the BCE loss as in Appendix G.2:

D(𝒙,𝒙^)=a𝒙𝒙𝒙^22,\displaystyle D({\bm{x}},\hat{\bm{x}})=a_{\bm{x}}\|{\bm{x}}-\hat{\bm{x}}\|_{2}^{2},\hskip 108.12047pt
wherea𝒙=(2/3+2𝒙22/21)and𝑮x=a𝒙𝑰m.\displaystyle\hskip 11.38109pt\text{where}\ a_{\bm{x}}=(2/3+2\ \|{\bm{x}}\|_{2}^{2}/21)\ \text{and}\ \bm{G}_{x}=a_{\bm{x}}\bm{I}_{m}.\ (24)

Here, a𝒙a_{\bm{x}} is chosen such that the mean of a𝒙a_{\bm{x}} for the toy dataset is 1.0 since the variance of 𝒙\bm{x} is 1/6+2/3+8/3=7/2. The details of the networks and training conditions are written in Appendix C.1.

After training with two types of reconstruction losses, the loss ratio D(x,x˘)/D(x˘,x^)D(x,\breve{x})/D(\breve{x},\hat{x}) for the square error loss is 0.023, and that for the downward-convex loss is 0.024. As expected in Lemma 4.2, the transform losses are negligibly small.

First, an implicit isometric property is examined. Tables 2 and 2 show the measurements of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{\sigma_{j(\bm{x})}}^{2}D^{\prime}_{j}(\bm{z}) (shown as 2βσj2Dj\frac{2}{\beta}{\sigma_{j}}^{2}D^{\prime}_{j}), Dj(𝒛)D^{\prime}_{j}(\bm{z}), and σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2} described in Section 4.3. In these tables, z1z_{1}, z2z_{2}, and z3z_{3} show acquired latent variables. ”Av.” and ”SD” are the average and standard deviation, respectively. In both tables, the values of 2βσ(𝒙)j2Dj(𝒛)\frac{2}{\beta}{\sigma_{(\bm{x})j}}^{2}D^{\prime}_{j}(\bm{z}) are close to 1.0 in each dimension, showing isometricity as in Eq. 21. By contrast, the average of Dj(𝒛)D^{\prime}_{j}(\bm{z}), which corresponds to 𝒙μjt𝑮x𝒙μj{}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}, is different in each dimension. Thus, 𝒙μk{\bm{x}}_{\mu_{k}} for the original VAE latent variable is not isometric.

Next, the disentanglement analysis is examined. The average of σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2} in Eq.22 and its ratio are shown in Tables 2 and 2. Although the average of σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2} is a rough estimation of variance, the ratio is close to 1:4:16, i.e., the variance ratio of generation parameters s1s_{1}, s2s_{2}, and s3s_{3}. When comparing both losses, the ratio of s2s_{2} and s3s_{3} for the downward-convex loss is somewhat smaller than that for the square error. This is explained as follows. In the downward-convex loss, |𝒙yj|22|{\bm{x}}_{y_{j}}|_{2}^{2} tends to be 1/a𝒙1/{a_{\bm{x}}} from Eq. 17, i.e. 𝒙yjt(ax𝑰m)𝒙yk=δjk{}^{t}{{\bm{x}}_{y_{j}}}\left(a_{x}{\bm{I}}_{m}\right){{\bm{x}}_{y_{k}}}=\delta_{jk}. Therefore, the region in the metric space with a larger norm is shrunk, and the estimated variances corresponding to s2s_{2} and s3s_{3} become smaller.

Finally, we examine the probability estimation. Figure 3 shows the scattering plots of the data distribution p(𝒙)p(\bm{x}) and estimated probabilities for the downward-convex loss. Figure 3 shows the plots of p(𝒙)p(\bm{x}) and the prior probabilities p(𝝁(𝒙))p(\bm{\mu}_{(\bm{x})}). This graph implies that it is difficult to estimate p(𝒙)p(\bm{x}) only from the prior. The correlation coefficient shown as ”R” (0.434) is also low. Figure 3 shows the plots of p(𝒙)\color[rgb]{0,0,0}p(\bm{x}) and exp(L𝒙/β)\exp(-L_{\bm{x}}/\beta) in in Eq. 19. The correlation coefficient (0.771) becomes better, but is still not high. Lastry, Figures 3-3 are the plots of a𝒙3/2p(𝝁(𝒙))jσj(𝒙){a_{\bm{x}}}^{{3}/{2}}\ p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}}\ and a𝒙3/2exp(L𝒙/β){a_{\bm{x}}}^{{3}/{2}}\exp(-L_{\bm{x}}/\beta) in Eq. 21, showing high correlations around 0.91. This strongly supports our theoretical probability estimation which considers the metric space.

Appendix D also shows results using square error loss. The correlation coefficient for exp(L𝒙/β)\exp(-L_{\bm{x}}/\beta) also gives a high score 0.904, since the input and metric spaces are equivalent.

Appendix D shows the exhaustive ablation study with different PDFs, losses, and β\beta, which further supports our theory.

5.2 Evaluations in CelebA dataset

This section presents the disentanglement analysis using VAE for the CelebA dataset 111(http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) (Liu et al., 2015). This dataset is composed of 202,599 celebrity facial images. In use, the images are center-cropped to form 64×6464\times 64 sized images. As a reconstruction loss, we use SSIM which is close to subjective quality evaluation. The details of networks and training conditions are written in Appendix C.2.

Figure 6 shows the averages of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} in Eq.22 as the estimated variances, as well as the average and the standard deviation of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in Eq.23 as the estimated square norm of implicit transform. The latent variables ziz_{i} are numbered in descending order by the estimated variance. In the dimensions greater than the 27th, the averages of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} are close to 1 and that of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) is close to 0, implying DKL()=0D_{\mathrm{KL}}(\cdot)=0. Between the 1st and 26th dimensions, the mean and standard deviation of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) averages are 1.83 and 0.13, respectively. This also implies the variance σyj(𝒙)2{\sigma_{y_{j}({\bm{x}})}}^{2} is around 1.83(β/2)1.83(\beta/2). These values seem almost constant with a small standard deviation; however, the mean is somewhat larger than the expected value 1. This suggests that the implicit embedding 𝒚{\bm{y}}^{\prime} which satisfies dyj/dμj(𝒙)=1.83(β/2)/σj(𝒙){\mathrm{d}{y_{j}}}^{\prime}/{\mathrm{d}\mu_{j({\bm{x}})}}={\sqrt{{1.83(\beta}/{2}})}/{{\sigma}_{j({\bm{x}})}} can be considered as almost isometric. Thus, σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} averages still can determine the quantitative importance of each dimension.

We also train VAE using the decomposed loss explicitly, i.e., L𝒙=D(𝒙,𝒙˘)+D(𝒙˘,𝒙^)+βDKL()L_{\bm{x}}=D(\bm{x},\breve{\bm{x}})+D(\breve{\bm{x}},\hat{\bm{x}})+\beta D_{\mathrm{KL}}(\cdot). Figure 6 shows the result. Here, the mean and standard deviation of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) averages are 0.92 and 0.04, respectively, which suggests almost a unit norm. This result implies that the explicit use of decomposed loss promotes isometricity and allows for better analysis, as explained in Remark 1.

Figure 6 shows decoder outputs where the selected latent variables are traversed from 2-2 to 22 while setting the rest to 0. The average of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} is also shown there. The components are grouped by σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} averages, such that z1z_{1}, z2z_{2}, z3z_{3} to the large, z16z_{16}, z17z_{17} to the medium, and z32z_{32} to the small, respectively. In the large group, significant changes of background brightness, face direction, and hair color are observed. In the medium group, we can see minor changes such as facial expressions. However, in the small group, there are almost no changes. In addition, Appendix E.1 shows the traversed outputs of all dimensional components in descending order of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} averages, where the degree of image changes clearly depends on σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} averages. Thus, it is strongly supported that the average of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} indicates the importance of each dimensional component like PCA.

5.3 Anomaly detection with realistic data

Table 3: Average and standard deviations (in brackets) of F1
Dataset Methods F1
KDDCup GMVAE111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9326
DAGMM111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9500 (0.0052)
RaDOGAGA(d)111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9624 (0.0038)
RaDOGAGA(log(d))111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9638 (0.0042)
vanilla VAE 0.9642 (0.0007)
Thyroid GMVAE111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.6353
DAGMM111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.4755 (0.0491)
RaDOGAGA(d)111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.6447 (0.0486)
RaDOGAGA(log(d))111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.6702 (0.0585)
vanilla VAE 0.6596 (0.0436)
Arrythmia GMVAE111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.4308
DAGMM111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.5060 (0.0395)
RaDOGAGA(d)111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.5433 (0.0468)
RaDOGAGA(log(d))111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.5373 (0.0411)
vanilla VAE 0.4985 (0.0412)
KDDCup-rev DAGMM111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9779 (0.0018)
RaDOGAGA(d)111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9797 (0.0015)
RaDOGAGA(log(d))111Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9865 (0.0009)
vanilla VAE 0.9880 (0.0008)

Using a vanilla VAE model with a single Gaussian prior, we finally examine the performance in anomaly detection in which PDF estimation is the key issue. We use four public datasets333Datasets can be downloaded at https://kdd.ics.uci.edu/ and http://odds.cs.stonybrook.edu.: KDDCUP99, Thyroid, Arrhythmia, and KDDCUP-Rev. The details of the datasets and network configurations are given in Appendix H. 11footnotetext: Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) For a fair comparison with previous works, we follow the setting in Zong et al. (2018). Randomly extracted 50% of the data were assigned to the training and the rest to the testing. Then the model is trained using normal data only. Here, we use the explicit decomposed loss to promote isometricity. The coding loss is set to SSE. For the test, the anomaly score for each sample is set to L𝒙L_{{\bm{x}}} in Eq. 3 after training since L𝒙/β-L_{{\bm{x}}}/\beta gives a log-likelihood of the input data from Proposition 4.3.1. Then, samples with anomaly scores above the threshold are identified as anomalies. The threshold is given by the ratio of the anomaly data in each data set. For instance, in KDDCup99, data with L𝒙L_{{\bm{x}}} in the top 20 % is detected as an anomaly. We run experiments 20 times for each dataset split by 20 different random seeds.

5.3.1 BASELINE METHODS

We compare previous methods such as GMVAE (Liao et al., 2018), DAGMM (Zong et al., 2018), and RaDOGAGA (Kato et al., 2020) that conducted the same experiments. All of them apply GMM as a prior because they believe GMM is more appropriate to capture the complex data distribution than VAE with a single Gaussian prior.

5.3.2 Results

Table 3 reports the average F1 scores and standard deviations (in brackets). Recall and precision are shown in Appendix H. Liao et al. (2018) insisted that the vanilla VAE is not appropriate for PDF estimation. Contrary to their claim, by considering the quantitative property as proven in this paper, even a vanilla VAE achieves state-of-the-art performance in KDDCup99 and KDDCup-rev. In other data sets, the score of VAE is comparable with RaDOGAGA, which is the previous best method. Here, RaDOGAGA attempts to adapt the parametric distribution such as GMM to the input distribution in the isometric space. However, fitting sufficiency is strongly dependent on the capability of the parametric distribution. By contrast, VAE can flexibly adapt a simple prior distribution to the input distribution via trainable posterior variance σj(𝒙)\sigma_{j({\bm{x}})}. As a result, VAE can provide a simpler tool for estimating the data distribution.

6 Relation with previous studies

First of all, we show VAE can be interpreted as a Rate-distortion (RD) optimal encoder based on RD theory (Berger, 1971), which has been successfully applied to image compression in the industry. The optimal transform coding (Goyal, 2001) for the Gaussian data with SSE metric is formulated as follows: First, the data are transformed deterministically using the orthonormal transform (orthogonal and unit norm) with a PCA basis. Note that the orthonormal transform is a part of the isometric embedding where the encoder is restricted as linear. Then, the transformed data is entropy-coded. Here, the key point for optimizing RD is to introduce equivalent stochastic distortion in all dimensions (or to use a uniform quantizer for image compression). Then the rate RoptR_{\mathrm{opt}} at the optimum condition is derived as follows: 𝒛Z{\bm{z}}\in Z denotes transformed data from inputs. Let zj{z_{j}} be the jj-th dimensional component of 𝒛\bm{z}. σzj2{\sigma_{zj}}^{2} denotes a variance of zj{z_{j}} in a dataset. Note that σzj2{\sigma_{zj}}^{2} is equivalent to the eigenvalue of PCA in each dimension. Let σd2{\sigma_{d}}^{2} be a distortion equally allowed in each dimensional channel. Assume the input dimension is mm and σd2{\sigma_{d}}^{2} is smaller than σzj2{\sigma_{zj}}^{2} for all jj. Then, RoptR_{\mathrm{opt}} is derived as:

Ropt\displaystyle R_{\mathrm{opt}} =\displaystyle= j=1m(H(𝒩(zj;0,σzj2)H(𝒩(zj;0,σd2))\displaystyle\sum_{j=1}^{m}\bigl{(}H(\mathcal{N}(z_{j};0,{\sigma_{zj}}^{2})-H(\mathcal{N}(z_{j};0,{\sigma_{d}}^{2})\bigr{)} (25)
=\displaystyle= H(Z)H(0,σd2𝑰m).\displaystyle H(Z)-H(0,{\sigma_{d}}^{2}\ \bm{I}_{m}).

Here, if σd2{\sigma_{d}}^{2} is set to β/2\beta/2, Eq. 8 is equivalent to Eq. 25. This suggests that VAE can be considered as a rate-distortion optimal encoder where RD theory is extended from linear orthonormal transform to general isometric embedding in the given metric. More details are described in Appendix B.6.

Refer to caption
(a) Features for large β\beta
Refer to caption
(b) Features for small β\beta
Figure 7: Conceptual explanation of captured features in the implicit isometric space for 2D manifold with non-zero Gaussian curvature.

Next, our theory can intuitively explain how the captured features in β\beta-VAE behave when varying β\beta. Higgins et al. (2017) suggests that β\beta-VAE with large β\beta can capture a global features while degrading the reconstruction quality. Our intuitive explanation is as follows: Assume the case of 2D manifold in 3D space. According to Gauss’s Theorema Egregium, the Gaussian curvature is an intrinsic invariant of a 2D surface and its value is unchanged after any isometric embeddings (Andrews, 2002). Figure 7 shows the conceptual explanation of captured features in the implicit isometric space for 2D manifold with non-zero Gaussian curvature. Our theory shows that β/2\beta/2 is considered as the allowable distortion in each dimensional component of implicit isometric embedding. If β\beta is large as shown in Fig. 7, β\beta-VAE can capture global features in the implicit isometric space allowing large distortion with lower rate. If β\beta is small as shown in Fig. 7, by contrast, β\beta-VAE will capture only fragmented features allowing small distortion with higher rate. We believe similar behaviors occur in general higher-dimensional manifolds.

Finally, we correct the analysis in Alemi et al. (2018). They describe ”the ELBO objective alone cannot distinguish between models that make no use of the latent variable versus models that make large use of the latent variable and learn useful representations for reconstruction,” because the reconstruction loss and KL divergence have unstable values after training. From this reason, they introduce a new objective D(𝒙,𝒛^)+|DKL()σ|D({\bm{x}},\hat{\bm{z}})+|D_{\mathrm{KL}}(\cdot)-\sigma| to fix this instability using a target rate σ\sigma. Correctly, the reconstruction loss and KL divergence are stably derived as a function of β\beta as shown in Appendix B.1 and B.4.

Our theory can further explain the analysis results of related prior works such as Higgins et al. (2017); Alemi et al. (2018); Dai et al. (2018); Dai & Wipf (2019), and Tishby et al. (1999). The details are described in Appendix B.

7 Conclusion

This paper provides a quantitative understanding of VAE by non-linear mapping to an isometric embedding. According to the Rate-distortion theory, the optimal transform coding is achieved by using orthonormal transform with a PCA basis, where the transform space is isometric to the input. From this analogy, we show theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. Based on this property, we also clarify that VAE can provide a practical quantitative analysis of input data such as the probability estimation in the input space and the PCA-like quantitative multivariate analysis. We believe the quantitative properties thoroughly uncovered in this paper will be a milestone to further advance the information theory-based generative models such as VAE in the right direction.

Acknowledgement

We express our gratitude to Tomotake Sasaki and Takashi Katoh for improving the clarity of the manuscript. Taiji Suzuki was partially supported by JSPS KAKENHI (18H03201, and 20H00576), and JST CREST.

References

  • Alemi et al. (2018) Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., and Murphy, K. Fixing a broken ELBO. In Proceedings of the 35th International Conference on Machine Learning(ICML), pp.  159–168. PMLR, July 2018.
  • Andrews (2002) Andrews, B. Notes on the isometric embedding problem and the nash-moser implicit function theorem. Proceedings of the Centre for Mathematics and its Applications, 20:157–208, January 2002.
  • Ballé et al. (2016) Ballé, J., Valero, L., and Eero P., S. Density modeling of images using a generalized normalization transformation. In Proceedings of the 4t International Conference on Learning Representations (ICLR), May 2016.
  • Ballé et al. (2018) Ballé, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the 6th International Conference on Learning Representations (ICLR), April 2018.
  • Berger (1971) Berger, T. (ed.). Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice Hall, 1971. ISBN 0137531036.
  • Bishop (2006) Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. ISBN 978-0387-31073-2.
  • Dai & Wipf (2019) Dai, B. and Wipf, D. Diagnosing and enhancing vae models. In International Conference on Learning Representations (ICLR), May 2019.
  • Dai et al. (2018) Dai, B., Wang, Y., Aston, J., Hua, G., and Wipf, D. Hidden talents of the variational autoencoder. The Journal of Machine Learning Research, 19:1573–1614, January 2018.
  • Dua & Graff (2019) Dua, D. and Graff, C. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2019.
  • Goyal (2001) Goyal, V. K. Theoretical foundations of transform coding. IEEE Signal Processing Magazine, 18:9–21, September 2001.
  • Han & Hong (2006) Han, Q. and Hong, J.-X. Isometric Embedding of Riemannian Manifolds in Euclidean Spaces. American Mathematical Society, 2006. ISBN 0821840711.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR), April 2017.
  • Huang et al. (2020) Huang, S., Makhzani, A., Cao, Y., and Grosse, R. Evaluating lossy compression rates of deep generative models. In Proceedings of the 37th International Conference on Machine Learning (ICML), July 2020.
  • Kato et al. (2020) Kato, K., Zhou, Z., Sasaki, T., and Nakagawa, A. Rate-distortion optimization guided autoencoder for generative analysis. In Proceedings of the 37th International Conference on Machine Learning (ICML), July 2020.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, Canada, April 2014.
  • Kumar & Poole (2020) Kumar, A. and Poole, B. On implicit regularization in β\beta-VAEs. In Proceedings of the 37th International Conference on Machine Learning (ICML), July 2020.
  • Liao et al. (2018) Liao, W., Guo, Y., Chen, X., and Li, P. A unified unsupervised gaussian mixture variational autoencoder for high dimensional outlier detection. In 2018 IEEE International Conference on Big Data (Big Data), pp.  1208–1217. IEEE, 2018.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • Locatello et al. (2019) Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pp.  4114–4124. PMLR, June 2019.
  • Pearlman & Said (2011) Pearlman, W. A. and Said, A. Digital Signal Compression: Principles and Practice. Cambridge University Press, 2011. ISBN 0521899826.
  • Rao & Yip (2000) Rao, K. R. and Yip, P. (eds.). The Transform and Data Compression Handbook. CRC Press, Inc., Boca Raton, FL, USA, 2000. ISBN 0849336929.
  • Rolínek et al. (2019) Rolínek, M., Zietlow, D., and Martius, G. Variational autoencoders pursue pca directions (by accident). In Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2019.
  • Sullivan & Wiegand (1998) Sullivan, G. J. and Wiegand, T. Rate-distortion optimization for video compression. IEEE Signal Processing Magazine, 15:74–90, November 1998.
  • Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control, and Computing, September 1999.
  • Wang et al. (2001) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing, 13:600–612, April 2001.
  • Wiener (1964) Wiener, N. Extrapolation, Interpolation, and Smoothing of Stationary Time Series. The MIT Press, 1964. ISBN 978-0-262-73005-1.
  • Zong et al. (2018) Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., and Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In In International Conference on Learning Representations, 2018.

Appendix A Derivations and proofs in Section 4.3

A.1 Proof of Lemma 4.2: Approximated expansion of the reconstruction loss

The approximated expansion of the reconstruction loss is mainly the same as Rolínek et al. (2019) except we consider a metric tensor 𝑮𝒙\bm{G}_{\bm{x}} which is a positive definite Hermitian matrix.

δ𝒙˘\delta\breve{\bm{x}} and δ𝒙^\delta\hat{\bm{x}} denote 𝒙˘𝒙\breve{\bm{x}}-{\bm{x}} and 𝒙^𝒙˘\hat{\bm{x}}-\breve{\bm{x}}, respectively. Let δzj𝒩(δzj;0,σj(𝒙))\delta z_{j}\sim\mathcal{N}(\delta z_{j};0,\sigma_{j({\bm{x}})}) be an added noise in the reparameterization trick where zj=μj(𝒙)+δzjz_{j}=\mu_{j({\bm{x}})}+\delta z_{j}. Then, δ𝒙^=𝒙^𝒙˘{\delta\hat{\bm{x}}}=\hat{\bm{x}}-\breve{\bm{x}} is approximated as:

δ𝒙^j=1nδzj𝒙μj.{\delta\hat{\bm{x}}}\simeq\sum_{j=1}^{n}\delta{z_{j}}\ {{\bm{x}}_{\mu_{j}}}. (26)

Next, the reconstruction loss D(𝒙,𝒙^)D({\bm{x}},\hat{\bm{x}}) can be approximated as follows.

D(𝒙,𝒙^)\displaystyle D({\bm{x}},\hat{\bm{x}}) =\displaystyle= D(𝒙,𝒙+(δ𝒙˘+δ𝒙^))\displaystyle D({\bm{x}},{\bm{x}}+(\delta\breve{\bm{x}}+\delta\hat{\bm{x}})) (27)
\displaystyle\simeq (δ𝒙˘+δ𝒙^)t𝑮𝒙(δ𝒙˘+δ𝒙^)\displaystyle{}^{t}(\delta\breve{\bm{x}}+\delta\hat{\bm{x}}){\bm{G}}_{{\bm{x}}}(\delta\breve{\bm{x}}+\delta\hat{\bm{x}})
=\displaystyle= δt𝒙˘𝑮𝒙δ𝒙˘+δt𝒙^𝑮𝒙δ𝒙^+2δt𝒙^𝑮𝒙δ𝒙˘\displaystyle{}^{t}\delta\breve{\bm{x}}\ {\bm{G}}_{{\bm{x}}}\delta\breve{\bm{x}}+{}^{t}\delta\hat{\bm{x}}\ {\bm{G}}_{{\bm{x}}}\delta\hat{\bm{x}}+2\ {}^{t}\delta\hat{\bm{x}}\ {\bm{G}}_{{\bm{x}}}\delta\breve{\bm{x}}
\displaystyle\simeq D(𝒙,𝒙˘)+D(𝒙˘,𝒙^)+j=1n2δzj𝒙μjt𝑮xδ𝒙˘\displaystyle D({\bm{x}},\breve{\bm{x}})+D(\breve{\bm{x}},\hat{\bm{x}})+\sum_{j=1}^{n}2\delta{z_{j}}\ {}^{t}{\bm{x}}_{\mu_{j}}\bm{G}_{x}\delta\breve{\bm{x}}

Then, we evaluate the average of D(𝒙,𝒙^)D({\bm{x}},\hat{\bm{x}}) over 𝒛qϕ(𝒛|𝒙){\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x}), i.e., δzj𝒩(δzj;0,σj(𝒙))\delta z_{j}\sim\mathcal{N}(\delta z_{j};0,\sigma_{j({\bm{x}})}) for all jj. Note that E[δzjδzk]=σj(𝒙)2δjkE[\delta{z_{j}}\delta{z_{k}}]={\sigma_{j({\bm{x}})}}^{2}\delta_{jk} where δjk\delta_{jk} is the Kronecker delta. First, the average of D(𝒙,𝒙˘)D({\bm{x}},\breve{\bm{x}}) in the last line of Eq. 27 is still D(𝒙,𝒙˘)D({\bm{x}},\breve{\bm{x}}) since this term does not depend on δzj\delta{z_{j}}. Second, the average of D(𝒙˘,𝒙^)D(\breve{\bm{x}},\hat{\bm{x}}) in the last line of Eq. 27 is approximated as:

E𝒛qϕ(𝒛|𝒙)[D(𝒙˘,𝒙^)]\displaystyle{E}_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}\left[D(\breve{\bm{x}},\hat{\bm{x}})\right] \displaystyle\simeq E𝒛qϕ(𝒛|𝒙)[δt𝒙^𝑮𝒙δ𝒙^]\displaystyle E_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}\left[{}^{t}{\delta\hat{\bm{x}}}\ {\bm{G}}_{\bm{x}}{\delta\hat{\bm{x}}}\right] (28)
\displaystyle\simeq E𝒛qϕ(𝒛|𝒙)[(j=1nδzj𝒙μjt)𝑮𝒙(k=1nδzk𝒙μk)]\displaystyle E_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}\Bigl{[}\Bigl{(}{\sum_{j=1}^{n}\delta{z_{j}}\ {}^{t}{{\bm{x}}_{\mu_{j}}}}\Bigr{)}{\bm{G}}_{\bm{x}}\Bigl{(}{\sum_{k=1}^{n}\delta{z_{k}}\ {{\bm{x}}_{\mu_{k}}}}\Bigr{)}\Bigr{]}
=\displaystyle= j=1nk=1nE𝒛qϕ(𝒛|𝒙)[δzjδzk]𝒙μjt𝑮𝒙𝒙μk\displaystyle{\sum_{j=1}^{n}}{\sum_{k=1}^{n}}E_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}[\delta{z_{j}}\delta{z_{k}}]\ {}^{t}{{\bm{x}}_{\mu_{j}}}{\bm{G}}_{\bm{x}}{{\bm{x}}_{\mu_{k}}}
=\displaystyle= j=1nσj(𝒙)2𝒙μjt𝑮x𝒙μj.\displaystyle\sum_{j=1}^{n}{{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}.\hskip 96.73936pt

Third, the average of the third term in the last line of Eq. 27, i.e., j=1n2δzj𝒙μjt𝑮xδ𝒙˘\sum_{j=1}^{n}2\delta{z_{j}}\ {}^{t}{\bm{x}}_{\mu_{j}}\bm{G}_{x}\delta\breve{\bm{x}}, is 0 since the average of δzj\delta z_{j} over 𝒩(δzj;0,σj(𝒙))\mathcal{N}(\delta z_{j};0,\sigma_{j({\bm{x}})}) is 0.

As a result, the average of D(𝒙,𝒙^)D({\bm{x}},\hat{\bm{x}}) over 𝒛qϕ(𝒛|𝒙){\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x}) can be approximated as:

E𝒛qϕ(𝒛|𝒙)[D(𝒙,𝒙^)]D(𝒙,𝒙˘)+j=1nσj(𝒙)2𝒙μjt𝑮x𝒙μj.{E}_{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}\left[D({\bm{x}},\hat{\bm{x}})\right]\simeq D({\bm{x}},\breve{\bm{x}})+\sum_{j=1}^{n}{{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}. (29)

A.2 Proof of Lemma 4.2: More precise derivation of KL divergence approximation.

This appendix explains more precise derivation of KL divergence approximation. First, we show the approximation for the Gaussian prior. We also show at the end of this appendix that our approximation also holds for arbitrary prior.

In the case of Gaussian prior, we show that KL divergence can be interpreted as an amount of information in the transform coding (Goyal, 2001) allowing the distortion σj(𝒙)2{\sigma_{j({\bm{x}})}}^{2}. In the transform coding, input data is transformed by an orthonormal transform. Then, the transformed data is quantized, and an entropy code is assigned to the quantized symbol, such that the length of the entropy code is equivalent to the logarithm of the estimated symbol probability. Here, we assume σj(𝒙)21{\sigma_{j({\bm{x}})}}^{2}\ll 1 will be observed in meaningful dimensions as shown later.

It is generally intractable to derive the rate and distortion of individual symbols in the ideal information coding. Thus, we first discuss the case of uniform quantization. Let PzjP_{z_{j}} and RzjR_{z_{j}} be the probability and amount of information in the uniform quantization coding of zj𝒩(zj;0,1)z_{j}\sim\mathcal{N}(z_{j};0,1). Here, μj(𝒙)\mu_{j({\bm{x}})} and σj(𝒙)2{\sigma_{j({\bm{x}})}}^{2} are regarded as a quantized value and a coding noise after the uniform quantization, respectively. Since we assume σj(𝒙)21{\sigma_{j({\bm{x}})}}^{2}\ll 1, μj(𝒙)𝒩(μj(𝒙);0,1)\mu_{j({\bm{x}})}\sim\mathcal{N}(\mu_{j({\bm{x}})};0,1) will also hold. Let TT be a quantization step size. The coding noise after quantization is T2/12T^{2}/12 for the quantization step size TT, as explained in Appendix G.1. Thus, TT is derived as T=23σj(𝒙)T=2\sqrt{3}{\sigma_{j({\bm{x}})}} from σj(𝒙)2=T2/12{\sigma_{j(\bm{x})}}^{2}=T^{2}/12. We also assume σj(𝒙)21{\sigma_{j({\bm{x}})}}^{2}\ll 1. As shown in Fig.8, PzjP_{z_{j}} is denoted by μj(𝒙)T/2μj(𝒙)+T/2p(zj)dzj\int_{\mu_{j({\bm{x}})}-T/2}^{\mu_{j({\bm{x}})}+T/2}p(z_{j})\mathrm{d}z_{j} where p(zj)p(z_{j}) is 𝒩(zj;0,1)\mathcal{N}(z_{j};0,1). Using Simpson’s numerical integration method and ex=1+x+O(x2)e^{x}=1+x+O(x^{2}) expansion, PzjP_{z_{j}} is approximated as:

Pzj\displaystyle P_{z_{j}} \displaystyle\simeq T6(p(μj(𝒙)T2)+4p(μj(𝒙))+p(μj(𝒙)+T2))\displaystyle\frac{T}{6}\left(p({\mu_{j({\bm{x}})}}-{\textstyle\frac{T}{2}})+4p({\mu_{j({\bm{x}})}})+p({\mu_{j({\bm{x}})}}+{\textstyle\frac{T}{2}})\right) (30)
=\displaystyle= Tp(μj(𝒙))6(4+e4μj(𝒙)TT28+e4μj(𝒙)TT28)\displaystyle\frac{Tp({\mu_{j({\bm{x}})}})}{6}\biggl{(}4+e^{\frac{4\mu_{j({\bm{x}})}T-T^{2}}{8}}+e^{\frac{-4\mu_{j({\bm{x}})}T-T^{2}}{8}}\biggr{)}\hskip 5.69054pt
\displaystyle\simeq Tp(μj(𝒙))(1T2/24)\displaystyle Tp\left({\mu_{j({\bm{x}})}}\right)\left(1-{T^{2}}/{24}\right)\hskip 93.89409pt
=\displaystyle= 6πσj(𝒙)e(μj(𝒙)2)/2(1σj(𝒙)22).\displaystyle\sqrt{\frac{6}{\pi}}{\sigma_{j({\bm{x}})}}\ e^{-({\mu_{j({\bm{x}})}}^{2})/{2}}\left(1-\frac{{\sigma_{j({\bm{x}})}}^{2}}{2}\right).\hskip 42.67912pt

Using log(1+x)=x+O(x2)\log(1+x)=x+O(x^{2}) expansion, RμσR_{\mu\sigma} is derived as:

Rzj=logPzj12(μj(𝒙)2+σj(𝒙)2logσj(𝒙)2log6π)=DKLj(𝒙)()+12logπe6.\displaystyle R_{z_{j}}=-\log P_{z_{j}}\simeq\frac{1}{2}\left({\mu_{j({\bm{x}})}}^{2}+{\sigma_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}-\log\frac{6}{\pi}\right)=D_{\mathrm{KL}j(\bm{x})}(\cdot)+\frac{1}{2}\log\frac{\pi e}{6}. (31)

When RzjR_{z_{j}} and DKLj(𝒙)()D_{\mathrm{KL}j(\bm{x})}(\cdot) in Eq. 2 are compared, both equations are equivalent except a small constant difference 12log(πe/6)0.176\frac{1}{2}\log(\pi e/6)\simeq 0.176 for each dimension. As a result, KL divergence for jj-th dimension is equivalent to the rate for the uniform quantization coding, allowing a small constant difference.

To make theoretical analysis easier, we use the simpler approximation as Pzj=Tp(μj(𝒙))=23σj(𝒙)p(μj(𝒙))P_{z_{j}}=T\ p({\mu_{j({\bm{x}})}})=2\sqrt{3}{\sigma_{j({\bm{x}})}}\ p({\mu_{j({\bm{x}})}}) instead of Eq.30, as shown in Fig.8. Then, RzjR_{z_{j}} is derived as:

Rzj=log(23σj(𝒙)p(μj(𝒙)))=12(μj(𝒙)2logσj(𝒙)21)+12logπe6.\displaystyle R_{z_{j}}=-\log(2\sqrt{3}\ \sigma_{j({\bm{x}})}\ p({\mu}_{j({\bm{x}})}))=\frac{1}{2}\left({\mu_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}-1\right)+\frac{1}{2}\log\frac{\pi e}{6}. (32)

Here, the first term of the right equation is equivalent to Eq. 4.2. This equation also means that the approximation of KL divergence in Eq. 4.2 is equivalent to the rate in the uniform quantization coding with Pzj=23σj(𝒙)p(μj(𝒙))P_{z_{j}}=2\sqrt{3}{\sigma_{j({\bm{x}})}}\ p({\mu_{j({\bm{x}})}}) approximation, allowing the same small constant difference as in Eq. 31. It is noted that the approximation Pzj=23σj(𝒙)p(μj(𝒙))P_{z_{j}}=2\sqrt{3}{\sigma_{j({\bm{x}})}}\ p({\mu_{j({\bm{x}})}}) in Figure 8 can be applied to any kinds of prior PDFs because there is no explicit assumption for the prior PDF. This implies that the theoretical discussion after Eq. 4.2 in the main text will hold in arbitrary prior PDFs.

Refer to caption
(a) Probability PzjP_{z_{j}}
Refer to caption
(b) Approximation of PzjP_{z_{j}}
Figure 8: Probability for a symbol with mean μ\mu and noise σ2\sigma^{2}

The meaning of the small constant difference 12logπe6\frac{1}{2}\log\frac{\pi e}{6} in Eqs. 31 and 32 can be explained as follows: Pearlman & Said (2011) show that the difference of the rate between the ideal information coding and uniform quantization is 12logπe6\frac{1}{2}\log\frac{\pi e}{6}. This is caused by the entropy difference of the noise distributions. In the ideal case, the noise distribution is known as a Gaussian. In the case the noise variance is σ2\sigma^{2}, the entropy of the Gaussian noise is 12log(σ22πe)\frac{1}{2}\log(\sigma^{2}2\pi e). For the uniform quantization with a uniform noise distribution, the entropy is 12log(σ212)\frac{1}{2}\log(\sigma^{2}12). As a result, the difference is just 12logπe6\frac{1}{2}\log\frac{\pi e}{6}. Because the rate estimation in this appendix uses a uniform quantization, the small offset 12logπe6\frac{1}{2}\log\frac{\pi e}{6} can be regarded as a difference between the ideal information coding and the uniform quantization. As a result, KL divergence in Eq. 2 and Eq. 4.2 can be regarded as a rate in the ideal informaton coding for the symbol with the mean μj(𝒙)\mu_{j({\bm{x}})} and variance σj(𝒙)2{\sigma_{j({\bm{x}})}}^{2}.

Here, we validate the assumption that σj(𝒙)1\sigma_{j({\bm{x}})}\ll 1 will be observed in meaningful dimensions. From the discussion above, the information RzjR_{z_{j}} in each dimension can be considered as KL divergence:

Rzj=12(μj(𝒙)2+σj(𝒙)2logσj(𝒙)21).R_{z_{j}}=\frac{1}{2}\left({\mu_{j({\bm{x}})}}^{2}+{\sigma_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}-1\right). (33)

For simple analysis, we assume that σj(𝒙)\sigma_{j({\bm{x}})} is constant in the jj-th dimension. We further assume μj(𝒙)𝒩(μj(𝒙);0,1)\mu_{j({\bm{x}})}\sim\mathcal{N}(\mu_{j({\bm{x}})};0,1). Then E[Rzj]E[R_{z_{j}}] which shows the information of the jj-th dimensional component is derived as:

E[Rzj]\displaystyle E[R_{\ z_{j}}] \displaystyle\simeq Eμj(𝒙)𝒩(μj(𝒙);0,1)[Rzj]\displaystyle E_{\mu_{j({\bm{x}})}\sim\mathcal{N}(\mu_{j({\bm{x}})};0,1)}[R_{z_{j}}] (34)
=\displaystyle= 12(μj(𝒙)2+σj(𝒙)2logσj(𝒙)21)𝒩(μj(𝒙);0,1)dμj(𝒙)\displaystyle\int\frac{1}{2}\left({\mu_{j({\bm{x}})}}^{2}+{\sigma_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}-1\right)\ \mathcal{N}(\mu_{j({\bm{x}})};0,1)\ \mathrm{d}\mu_{j({\bm{x}})}
=\displaystyle= 12(σj(𝒙)2logσj(𝒙)2).\displaystyle\frac{1}{2}\left({\sigma_{j({\bm{x}})}}^{2}-\log{\sigma_{j({\bm{x}})}}^{2}\right).

From this equation, we can estimate an amount of information in each dimension from the posterior variance. From this equation, it is derived that if the amount of information E[Rzj]E[R_{z_{j}}] is more than about 1.201.20 nat or 1.731.73 bit, σj(𝒙)2<0.1{\sigma_{j({\bm{x}})}}^{2}<0.1 holds. In addition, as the information E[Rzj]E[R_{z_{j}}] is increasing, σj(𝒙)2{\sigma_{j({\bm{x}})}}^{2} becomes exponentially decreasing. As a result, the assumption that σj(𝒙)21{\sigma_{j({\bm{x}})}}^{2}\ll 1 will be observed in meaningful dimensions is reasonable.

Finally, we show that the approximation of the KL divergence in the second line of Eq. 4.2 also holds for arbitrary priors. Let p(𝒛)p({\bm{z}}), qϕ(𝒛|𝒙)q_{\phi}({\bm{z}}|{\bm{x}}), and DKL(qϕ(𝒛|𝒙)p(𝒛))D_{\mathrm{KL}}(q_{\phi}({\bm{z}}|{\bm{x}})\|p({\bm{z}})) be an arbitrary prior, a posterior j𝒩(μj(𝒙),σj(𝒙))\prod_{j}\mathcal{N}({\mu}_{j({\bm{x}})},{\sigma}_{j({\bm{x}})}), and KL divergence, respectively. First, the shape of qϕ(𝒛|𝒙)q_{\phi}({\bm{z}}|{\bm{x}}) becomes close to a delta function δ(𝒛𝝁(𝒙))\delta(\bm{z}-{\bm{\mu}}_{({\bm{x}})}) when each σj(𝒙){\sigma}_{j({\bm{x}})} is small. Thus qϕ(𝒛|𝒙)q_{\phi}({\bm{z}}|{\bm{x}}) will act like a delta function δ(𝒛𝝁(𝒙))\delta(\bm{z}-{\bm{\mu}}_{({\bm{x}})}). Next the differential entropy of qϕ(𝒛|𝒙)q_{\phi}({\bm{z}}|{\bm{x}}), i.e., qϕ(𝒛|𝒙)logqϕ(𝒛|𝒙)d𝒛-\int q_{\phi}({\bm{z}}|{\bm{x}})\log q_{\phi}({\bm{z}}|{\bm{x}})\mathrm{d}{\bm{z}} is derived as jnlogσj(𝒙)2πe\sum_{j}^{n}\log{\sigma}_{j({\bm{x}})}\sqrt{2\pi e}. Using these equations, KL divergence for an arbitrary prior can be approximated by the second line of Eq. 4.2 as follows:

DKL(qϕ(𝒛|𝒙)p(𝒛))\displaystyle D_{\mathrm{KL}}(q_{\phi}({\bm{z}}|{\bm{x}})\|p({\bm{z}})) =\displaystyle= qϕ(𝒛|𝒙)logp(𝒛)d𝒛+qϕ(𝒛|𝒙)logqϕ(𝒛|𝒙)d𝒛\displaystyle-\int q_{\phi}({\bm{z}}|{\bm{x}})\log p({\bm{z}})\mathrm{d}{\bm{z}}+\int q_{\phi}({\bm{z}}|{\bm{x}})\log q_{\phi}({\bm{z}}|{\bm{x}})\mathrm{d}{\bm{z}} (35)
\displaystyle\simeq δ(𝒛𝝁(𝒙))logp(𝒛)d𝒛+qϕ(𝒛|𝒙)logqϕ(𝒛|𝒙)d𝒛\displaystyle-\int\delta(\bm{z}-{\bm{\mu}}_{({\bm{x}})})\log p({\bm{z}})\mathrm{d}{\bm{z}}+\int q_{\phi}({\bm{z}}|{\bm{x}})\log q_{\phi}({\bm{z}}|{\bm{x}})\mathrm{d}{\bm{z}}
=\displaystyle= logp(𝝁(𝒙))jnlogσj2πe\displaystyle-\log p({\bm{\mu}}_{({\bm{x}})})-\sum_{j}^{n}\log\sigma_{j}\sqrt{2\pi e}
=\displaystyle= log(p(𝝁(𝒙))j=1nσj(𝒙))nlog2πe2\displaystyle-\log\Bigl{(}p(\bm{\mu}_{(\bm{x})})\prod_{j=1}^{n}{\sigma}_{j({\bm{x}})}\Bigr{)}-\frac{n\log{2\pi e}}{2}

In the derivation of the third line in Eq. 4.2, we assume that qϕ(𝝁(𝒙))q_{\phi}(\bm{\mu}_{({\bm{x}})}) is close to p(𝝁(𝒙))p(\bm{\mu}_{({\bm{x}})}) where p()p(\cdot) is the prior distribution of 𝒛{\bm{z}}. The reason of this assumption is as follows: ELBO can be also derived as logp(𝒙)DKL(qϕ(𝒛|𝒙)pθ(𝒛|𝒙))\log p({\bm{x}})-D_{\mathrm{KL}}(q_{\phi}({\bm{z}}|{\bm{x}})\|p_{\theta}({\bm{z}}|{\bm{x}})) (Bishop, 2006). When ELBO is maximized at each 𝒙{\bm{x}}, qϕ(𝒛|𝒙)pθ(𝒛|𝒙)q_{\phi}({\bm{z}}|{\bm{x}})\simeq p_{\theta}({\bm{z}}|{\bm{x}}) will hold to minimize KL divergence where logp(𝒙)\log p({\bm{x}}) is a constant. Finally, we have qϕ(𝒛)p(𝒛)q_{\phi}({\bm{z}})\simeq p({\bm{z}}) by the marginalization of 𝒙{\bm{x}}. Appendix A.3 also validates this assumption in the simple 1-dimensional VAE case where qϕ(μ(x))=p(μ(x))=𝒩(μ(x);0,1)q_{\phi}(\mu_{(x)})=p(\mu_{(x)})=\mathcal{N}(\mu_{(x)};0,1) holds. Thus, we can derive the approximation p(𝝁(𝒙))qϕ(𝝁(𝒙))=p(𝒙)|det(𝒙/𝝁(𝒙))|p(\bm{\mu}_{({\bm{x}})})\simeq q_{\phi}(\bm{\mu}_{({\bm{x}})})=p({\bm{x}})\ |\mathrm{det}(\partial{\bm{x}}/\partial\bm{\mu}_{({\bm{x}})})| in Eq. 4.2.

A.3 Proof of Lemma 4.2: Estimation of the coding loss and transform loss in 11-dimensional linear VAE

This appendix estimates the coding loss and transform loss in 11-dimensional linear β\beta-VAE for the Gaussian data, and also shows that the result is consistent with the Wiener filter (Wiener, 1964). Let xx be a one dimensional input data with the normal distribution:

x,x𝒩(x;0,σx2).\displaystyle x\in{\mathbb{R}},\hskip 5.69054ptx\sim\mathcal{N}(x;0,{{\sigma}_{x}}^{2}). (36)

First, a simple VAE model in this analysis is explained. zz denotes a one dimensional latent variable. Let the prior distribution p(z)p(z) be 𝒩(z;0,1)\mathcal{N}(z;0,1). Next, two linear parametric encoder and decoder are provided with constant parameters aa, bb, and σz\sigma_{z} to optimize:

Encϕ:z\displaystyle\mathrm{Enc}_{\phi}:\hskip 14.22636ptz =\displaystyle= μ+σzϵwhereμ=axandϵ𝒩(ϵ;0,1),\displaystyle\mu+{\sigma_{z}}{\epsilon}\hskip 5.69054pt\mathrm{where}\hskip 5.69054pt\mu=ax\hskip 5.69054pt\mathrm{and}\hskip 5.69054pt\epsilon\sim\mathcal{N}(\epsilon;0,1),
Decθ:x^\displaystyle\mathrm{Dec}_{\theta}:\hskip 14.22636pt\hat{x} =\displaystyle= bz.\displaystyle bz. (37)

Here, the encoding parameter ϕ\phi consists of {a,σz}\{a,\sigma_{z}\}, and the decoding parameter θ\theta consists of {b}\{b\}. Then the square error is used as a reconstruction loss.

Next, the objective is derived. DKLxD_{\mathrm{KL}x} and DxD_{x} denote the KL divergence and reconstruction loss at xx, respectively. We further assume that DxD_{x} uses a square error. Then we define the loss objective at xx as Lx=Dx+βDKLxL_{x}=D_{x}+\beta D_{\mathrm{KL}x}. Using Eq. 4.2, DKLxD_{\mathrm{KL}x} can be evaluated as:

DKLx\displaystyle D_{\mathrm{KL}x} =\displaystyle= log(σzp(μ))12log2πe\displaystyle-\log(\sigma_{z}\ p(\mu))-\frac{1}{2}\log 2\pi e (38)
=\displaystyle= log(σz𝒩(ax;0,1))12log2πe\displaystyle-\log(\sigma_{z}\ \mathcal{N}(ax;0,1))-\frac{1}{2}\log 2\pi e
=\displaystyle= logσx+a2x2212.\displaystyle-\log\sigma_{x}+\frac{a^{2}x^{2}}{2}-\frac{1}{2}.

Then, DxD_{x} is evaluated as:

Dx\displaystyle D_{x} =\displaystyle= Eϵ𝒩(ϵ;0,1)[(xDecθ(Encϕ(x)))2]\displaystyle{E}_{{\epsilon\sim\mathcal{N}(\epsilon;0,1)}}\left[\left(x-\mathrm{Dec}_{\theta}(\mathrm{Enc}_{\phi}(x))\right)^{2}\right] (39)
=\displaystyle= Eϵ𝒩(ϵ;0,1)[(x(b(ax+σzϵ)))2]\displaystyle{E}_{{\epsilon\sim\mathcal{N}(\epsilon;0,1)}}\left[\left(x-(b(ax+{\sigma_{z}}{\epsilon}))\right)^{2}\right]
=\displaystyle= ((ab1)2x2+2(ab1)xbσzϵ+b2σz2ϵ2)𝒩(ϵ;0,1)dϵ\displaystyle\int\left((ab-1)^{2}x^{2}+2(ab-1)xb{\sigma_{z}}\epsilon+b^{2}{\sigma_{z}}^{2}\epsilon^{2}\right)\mathcal{N}(\epsilon;0,1)\ \mathrm{d}\epsilon
=\displaystyle= (ab1)2x2+b2σz2.\displaystyle(ab-1)^{2}x^{2}+b^{2}{\sigma_{z}}^{2}.

By averaging LxL_{x} over x𝒩(x;0,σx2)x\sim\mathcal{N}(x;0,{\sigma_{x}}^{2}), the objective LL to minimize is derived as:

L\displaystyle L =\displaystyle= Ex𝒩(x;0,σx2)[Lx]\displaystyle{E}_{x\sim\mathcal{N}(x;0,{\sigma_{x}}^{2})}[L_{x}] (40)
=\displaystyle= ((ab1)2x2+b2σz2+β(logσz+a2x2212))𝒩(x;0,σx2)dx.\displaystyle\int\left((ab-1)^{2}x^{2}+b^{2}{\sigma_{z}}^{2}+\beta\left(-\log\sigma_{z}+\frac{a^{2}x^{2}}{2}-\frac{1}{2}\right)\right)\mathcal{N}(x;0,{\sigma_{x}}^{2})\ \mathrm{d}x.
=\displaystyle= (ab1)2σx2+b2σz2+β(logσz+a2σx2212).\displaystyle(ab-1)^{2}{\sigma_{x}}^{2}+b^{2}{\sigma_{z}}^{2}+\beta\left(-\log\sigma_{z}+\frac{a^{2}{\sigma_{x}}^{2}}{2}-\frac{1}{2}\right).

Here, the first term (ab1)2σx2(ab-1)^{2}{\sigma_{x}}^{2} and the second term b2σz2b^{2}{\sigma_{z}}^{2} in the last line are corresponding to the transform loss DTD_{\mathrm{T}} and coding loss DCD_{\mathrm{C}}, respectively.

By solving dL/da=0{\mathrm{d}L}/{\mathrm{d}a}=0, dL/db=0{\mathrm{d}L}/{\mathrm{d}b}=0, and dL/dσz=0{\mathrm{d}L}/{\mathrm{d}\sigma_{z}}=0, aa , bb, and σz\sigma_{z} are derived as follows:

a\displaystyle a =\displaystyle= 1/σx,\displaystyle 1/\sigma_{x},
b\displaystyle b =\displaystyle= σx(1+12β/σx2)2,\displaystyle\frac{\sigma_{x}\left(1+\sqrt{1-2\beta/{\sigma_{x}}^{2}}\right)}{2},
σz\displaystyle\sigma_{z} =\displaystyle= 2β/2σx(1+12β/σx2).\displaystyle\frac{2\sqrt{\beta/2}}{\sigma_{x}\left(1+\sqrt{1-2\beta/{\sigma_{x}}^{2}}\right)}. (41)

From Eq.  A.3, DTD_{\mathrm{T}} and DCD_{\mathrm{C}} are derived as:

DT\displaystyle D_{\mathrm{T}} =\displaystyle= (12β/σx212)2σx2,\displaystyle\left(\frac{\sqrt{1-2\beta/{\sigma_{x}}^{2}}-1}{2}\right)^{2}{\sigma_{x}}^{2},
DC\displaystyle D_{\mathrm{C}} =\displaystyle= β/2.\displaystyle\beta/2. (42)

As shown in section 4.1, the added noise, β/2\beta/2, should be reasonably smaller than the data variance σx2{\sigma_{x}}^{2}. If σx2β{\sigma_{x}}^{2}\gg\beta, bb and σz\sigma_{z} in Eq. A.3 can be approximated as:

DT(β/2)2σx2=β/2σx2DC.\displaystyle D_{\mathrm{T}}\simeq\frac{(\beta/2)^{2}}{{\sigma_{x}}^{2}}=\frac{\beta/2}{{\sigma_{x}}^{2}}D_{\mathrm{C}}. (43)

As shown in this equation, DT/DCD_{\mathrm{T}}/D_{\mathrm{C}} is small in the VAE where the added noise is reasonably small, and DTD_{\mathrm{T}} can be ignored.

Note that the distribution of μ=ax=x/σx\mu=a\ x=x/{\sigma_{x}}, i.e., qϕ(μ)q_{\phi}(\mu), is derived as 𝒩(μ;0,1)\mathcal{N}(\mu;0,1) by scaling p(x)=𝒩(x;0,σx2)p(x)=\mathcal{N}(x;0,{\sigma_{x}}^{2}) with a factor of a=1/σxa=1/{\sigma_{x}}. Thus qϕ(μ)q_{\phi}(\mu) is equivalent to the prior of zz, i.e., 𝒩(z;0,1)\mathcal{N}(z;0,1) in this simple VAE case.

Next, the relation to the Wiener filter (Wiener, 1964) is discussed. The Wiener filter is one of the most basic, but most important theories for signal restoration. We consider an simple 11-dimensional Gaussian process. Let x𝒩(x;0,σx2)x\sim\mathcal{N}(x;0,\sigma_{x}^{2}) be input data. Then, xx is scaled by ss, and a Gaussian noise n𝒩(n;0,σn2)n\sim\mathcal{N}(n;0,\sigma_{n}^{2}) is added. Thus, y=sx+ny=s\ x+n is observed. From the Wiener filter theory, the estimated value with minimum distortion, x^\hat{x} can be formulated as:

x^=sσx2s2σx2+σn2y.\hat{x}=\frac{s{\sigma_{x}}^{2}}{s^{2}{\sigma_{x}}^{2}+{\sigma_{n}}^{2}}y. (44)

In this case, the estimation error is derived as:

E[(x^x)2]=σn4(s2σx2+σn2)2σx2+s2σx4(s2σx2+σn2)2σn2=σx2σx2+(σn2/s2)(σn2/s2).E[(\hat{x}-x)^{2}]=\frac{{\sigma_{n}}^{4}}{(s^{2}{\sigma_{x}}^{2}+{\sigma_{n}}^{2})^{2}}{{\sigma_{x}}^{2}}+\frac{s^{2}{\sigma_{x}}^{4}}{(s^{2}{\sigma_{x}}^{2}+{\sigma_{n}}^{2})^{2}}{{\sigma_{n}}^{2}}=\frac{{\sigma_{x}}^{2}}{{\sigma_{x}}^{2}+({\sigma_{n}}^{2}/s^{2})}({{\sigma_{n}}^{2}}/s^{2}). (45)

In the second equation, the first term is corresponding to the transform loss, and the second term is corresponding to the coding loss. Here the ratio of the transform loss and coding loss is derived as σn2/(s2σx2){{\sigma_{n}}^{2}}/(s^{2}{{\sigma_{x}}^{2}}). By appying s=1/σxs=1/{\sigma_{x}} and σn=σz\sigma_{n}=\sigma_{z} to σn2/(s2σx2){{\sigma_{n}}^{2}}/(s^{2}{{\sigma_{x}}^{2}}) and assuming σx2β/2\sigma_{x}^{2}\gg\beta/2, this ratio can be described as:

σn2s2σx2=σz2=β/2σx24(1+12β/σx2)2=β/2σx2+O((β/2σx2)2).\frac{{\sigma_{n}}^{2}}{s^{2}{{\sigma_{x}}^{2}}}={\sigma_{z}}^{2}=\frac{\beta/2}{\sigma_{x}^{2}}\frac{4}{\left(1+\sqrt{1-2\beta/{\sigma_{x}}^{2}}\right)^{2}}=\frac{\beta/2}{\sigma_{x}^{2}}+O\left(\left(\frac{\beta/2}{\sigma_{x}^{2}}\right)^{2}\right). (46)

This result is consistent with Eq. 43, implying that optimized VAE and the Wiener filter show similar behaviours.

A.4 Proof of Lemma 4.2 : Derivation of the orthogonality

Lemma 4.2 is proved by examining the minimum condition of L𝒙L_{\bm{x}} at 𝒙\bm{x}. The proof outline is similar to Kato et al. (2020) while σj(𝒙){\sigma}_{j({\bm{x}})} should be also considered as a variable in our derivation.

We first show the following mathematical formula which is used our derivation. Let 𝑨\bm{A} be a regular matrix and 𝒂i{\bm{a}}_{i} be its ii-th column vector. 𝒂i~\tilde{{\bm{a}}_{i}} denotes the ii-th column vector of a cofactor matrix for 𝑨\bm{A}. Then the following equation holds mathematically.

dlog|det(𝑨)|d𝒂i=dlog|det(𝑨)|ddet(𝑨)ddet(𝑨)d𝒂i=1det(𝑨)𝒂i~.\frac{\mathrm{d}\log|\mathrm{det}(\bm{A})|}{\mathrm{d}{{\bm{a}}_{i}}}=\frac{\mathrm{d}\log|\mathrm{det}(\bm{A})|}{\mathrm{d}\ \mathrm{det}(\bm{A})}\ \frac{\mathrm{d}\ \mathrm{det}(\bm{A})}{\mathrm{d}{{\bm{a}}_{i}}}=\frac{1}{\mathrm{det}(\bm{A})}\tilde{\bm{a}_{i}}. (47)

Let 𝒙~μj{\tilde{\bm{x}}_{\mu_{j}}} be the jj-th column vector of a cofactor matrix for Jacobian matrix 𝒙/𝝁(𝒙){\partial\bm{x}}/{\partial\bm{\mu}_{({\bm{x}})}}. Using the formula in Eq. 47, the partial derivative of L𝒙L_{\bm{x}} by 𝒙μj{\bm{x}}_{\mu_{j}} is described by

L𝒙𝒙μj=2σj(𝒙)2𝑮x𝒙μjβdet(𝒙/𝝁(𝒙))𝒙~μj.\displaystyle\frac{\partial L_{\bm{x}}\ }{\partial{\bm{x}}_{\mu_{j}}}=2{{\sigma}_{j({\bm{x}})}}^{2}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}-\frac{\beta}{\mathrm{det}\left({\partial\bm{x}}/{\partial\bm{\mu}_{({\bm{x}})}}\right)}{\tilde{\bm{x}}_{\mu_{j}}}. (48)

Note that 𝒙μkt𝒙~μj=det(𝒙/𝒛)δjk{}^{t}{{\bm{x}}_{\mu_{k}}}\cdot{\tilde{\bm{x}}_{\mu_{j}}}=\mathrm{det}({\partial\bm{x}}/{\partial\bm{z}})\ \delta_{jk} holds by the cofactor’s property. Here,\ \cdot denotes the dot product, and δjk\delta_{jk} denotes the Kronecker delta. By setting Eq. 48 to zero and multiplying 𝒙zkt{}^{t}{{\bm{x}}_{z_{k}}} from the left, we have the next orthogonal form of 𝒙μj\bm{x}_{\mu_{j}}:

(2σj(𝒙)2/β)𝒙μkt𝑮x𝒙μj=δjk.\displaystyle({2{{\sigma}_{j({\bm{x}})}}^{2}}/{{\beta}})\ {}^{t}{{\bm{x}}_{\mu_{k}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}=\delta_{jk}. (49)

Next, the partial derivative of L𝒙L_{\bm{x}} by σj(𝒙){\sigma_{j({\bm{x}})}} is derived as:

L𝒙σj(𝒙)=2σj(𝒙)𝒙μj𝑮xt𝒙μjβσj(𝒙).\displaystyle\frac{\partial L_{\bm{x}}\ }{\partial{\sigma}_{j({\bm{x}})}}=2{{\sigma}_{j({\bm{x}})}}\ {{\bm{x}}_{\mu_{j}}}{}^{t}\ \bm{G}_{x}{{\bm{x}}_{\mu_{j}}}-\frac{\beta}{{\sigma}_{j({\bm{x}})}}. (50)

By setting Eq. 50 to zero, we have the next equation:

(2σj(𝒙)2/β)𝒙μjt𝑮x𝒙μj=1.\displaystyle({2{{\sigma}_{j({\bm{x}})}}^{2}}/{{\beta}})\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}=1. (51)

Note that Eq. 51 is a part of Eq. 49 where j=kj=k. As a result, the condition to minimize L𝒙L_{\bm{x}} is derived as Eq. 49.

A.5 Proof of Proposition 4.3.1: Estimation of input data distribution in the metric space

This equation explains the derivation of Eq. 19 in Proposition 4.3.1. Using Eq. 16, the third equation in Eq. 19 is derived as:

p(𝒚)=jnp(yj)=jn(dyj/dμj(𝒙))1p(μj)=jnp(μj)jnσj(𝒙)β/2=(β/2)n/2p(𝝁j(𝒙))jnσj(𝒙).\displaystyle p({\bm{y}})=\prod_{j}^{n}p(y_{j})=\prod_{j}^{n}(\mathrm{d}y_{j}/\mathrm{d}\mu_{j({\bm{x}})})^{-1}p(\mu_{j})=\prod_{j}^{n}p(\mu_{j})\prod_{j}^{n}\frac{\sigma_{j({\bm{x}})}}{\sqrt{\beta/2}}={(\beta/2)^{n/2}}p(\bm{\mu}_{j({\bm{x}})})\prod_{j}^{n}{\sigma_{j({\bm{x}})}}. (52)

This shows that the posterior variance σj(𝒙){{\sigma}_{j({\bm{x}})}} bridges between the distributions of data and prior. Thus the prior close to the data distribution will facilitate training, where σj(𝒙){{\sigma}_{j({\bm{x}})}} is close to constant.

The fourth equation in Eq. 19 in Proposition 4.3.1 is derived by applying Eq. 16 to Eq. 4.2 and arranging the result. Let Lmin𝒙L_{\mathrm{min}\ {\bm{x}}} be a minimum of L𝒙L_{\bm{x}} at 𝒙\bm{x}. Dmin𝒙D_{\min{\bm{x}}} and Rmin𝒙R_{\mathrm{min}{\bm{x}}} denote a coding loss and KL divergence in Lmin𝒙L_{\min{\bm{x}}}, respectively.

First, Dmin𝒙D_{\min{\bm{x}}} is derived. The next equation holds from Eq. 14.

σj(𝒙)2𝒙μjt𝑮x𝒙μj=β/2.\displaystyle{{{\sigma}_{j({\bm{x}})}}^{2}}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}=\beta/2. (53)

By applying Eq. 53 to the first term of Eq. 4.2, Dmin𝒙D_{\min{\bm{x}}} is derived as:

Dmin𝒙=jnσj(𝒙)2𝒙μjt𝑮x𝒙μj=nβ/2.D_{\mathrm{min}\ {\bm{x}}}=\sum_{j}^{n}{{{\sigma}_{j({\bm{x}})}}^{2}}\ {}^{t}{{\bm{x}}_{\mu_{j}}}\bm{G}_{x}{{\bm{x}}_{\mu_{j}}}=n\beta/2. (54)

This implies that the reconstruction loss is constant for all inputs at the minimum condition.

Second, Rmin𝒙R_{\min{\bm{x}}} is derived. From Eq. 52, the next equation holds.

p(𝝁j(𝒙))jnσj(𝒙)=(β/2)n/2p(𝒚).\displaystyle p(\bm{\mu}_{j({\bm{x}})})\prod_{j}^{n}{\sigma_{j({\bm{x}})}}={(\beta/2)^{-n/2}}\ p({\bm{y}}). (55)

By applying Eq. 55 to the second equation of Eq. 4.2, Rmin𝒙R_{\min{\bm{x}}} is derived as:

Rmin𝒙=logp(𝒚)nlog(βπe)2.R_{\mathrm{min}\ {\bm{x}}}=-\log p(\bm{y})-\frac{n\log(\beta\pi e)}{2}. (56)

As a result, the minimum value of the objective Lmin𝒙L_{\mathrm{min}\ {\bm{x}}} is derived as:

Lmin𝒙=Dmin𝒙+βRmin𝒙=βlogp(𝒚)+nβ2(1log(βπe)).\displaystyle L_{\mathrm{min}\ {\bm{x}}}=D_{\mathrm{min}\ {\bm{x}}}+\beta R_{\mathrm{min}\ {\bm{x}}}=-\beta\log p(\bm{y})+\frac{n{\beta}}{2}(1-\log(\beta\pi e)). (57)

As a result, p(𝒙)p({\bm{x}}) can be evaluated as:

exp(Lmin𝒙/β)=p(𝒚)exp(n(1log(βπe))2)p(𝒚)p𝑮𝒙(𝒙).\displaystyle\exp(-L_{\mathrm{min}\ {\bm{x}}}/\beta)=p({\bm{y}})\exp(-\frac{n(1-\log(\beta\pi e))}{2})\propto p({\bm{y}})\simeq p_{\bm{G}_{\bm{x}}}({\bm{x}}). (58)

This result implies that the VAE objective converges to the log-likelihood of the input 𝒙{\bm{x}} at the optimized condition as expected.

A.6 Proof of Proposition 4.3.1: Estimation of data distribution in the input space

This appendix shows the derivation of variables in Eqs. 19 and 21. When we estimate a probability in real dataset, we use an approximation of L𝒙L_{\bm{x}}. First, the derivation of L𝒙L_{\bm{x}} approximation for the input 𝒙\bm{x} is presented. Then, the PDF ratio between the input space and inner product space is explained for the cases m=nm=n and m>nm>n.

Derivation of LxL_{\bm{x}} approximation for the input x\bm{x} of real data:
As shown in in Eq. 1, L𝒙L_{\bm{x}} is denoted as E𝒛qϕ(𝒛|𝒙)[]+βDKL()-{E_{{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}}}[\ \cdot\ ]+\beta D_{\mathrm{KL}}(\ \cdot\ ). We approximate E𝒛qϕ(𝒛|𝒙)[]{E_{{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}}}[\ \cdot\ ] as 12(D(𝒙,Decθ(𝝁𝒙+𝝈𝒙))+D(𝒙,Decθ(𝝁𝒙𝝈𝒙)))\frac{1}{2}(D(\bm{x},\mathrm{Dec}_{\theta}({\bm{\mu}}_{\bm{x}}+{\bm{\sigma}}_{\bm{x}}))+D(\bm{x},\mathrm{Dec}_{\theta}({\bm{\mu}}_{\bm{x}}-{\bm{\sigma}}_{\bm{x}}))), i.e., the average of two samples, instead of the average over 𝒛qϕ(𝒛|𝒙){{{\bm{z}}\sim q_{\phi}(\bm{z}|\bm{x})}}. DKL()D_{\mathrm{KL}}(\ \cdot\ ) can be calculated from 𝝁𝒙{\bm{\mu}}_{\bm{x}} and 𝝈𝒙{\bm{\sigma}}_{\bm{x}} using Eq. 2.

The PDF ratio in the case m=nm=n:
The PDF ratio for m=nm=n is a Jacobian determinant between two spaces. First, (𝒙𝒚)T𝑮𝒙(𝒙𝒚)=𝑰m(\frac{\partial\bm{x}}{\partial\bm{y}})^{T}\bm{G}_{\bm{x}}(\frac{\partial\bm{x}}{\partial\bm{y}})=\bm{I}_{m} holds from Eq. 17. |𝒙/𝒚|2|𝑮𝒙|=1\left|{\partial\bm{x}}/{\partial\bm{y}}\right|^{2}\ |\bm{G}_{\bm{x}}|=1 also holds by calculating the determinant. Finally, |𝒙/𝒚|\left|{\partial\bm{x}}/{\partial\bm{y}}\right| is derived as |𝑮x|1/2|\bm{G}_{x}|^{1/2} using |𝒚/𝒙|=|𝒙/𝒚|1\left|{\partial\bm{y}}/{\partial\bm{x}}\right|=\left|{\partial\bm{x}}/{\partial\bm{y}}\right|^{-1}.

The PDF ratio in the case m>nm>n and Gx=axIm\bm{G}_{\bm{x}}=a_{\bm{x}}\bm{I}_{m}:
Although the strict derivation needs the treatment of the Riemannian manifold, we provide a simple explanation in this appendix. Here, it is assumed that DKL(j)()>0D_{\mathrm{KL}(j)}(\cdot)>0 holds for all j=[1,..n]j=[1,..n]. If DKL(j)()=0D_{\mathrm{KL}(j)}(\cdot)=0 for some jj, nn is replaced by the number of latent variables with DKL(j)()>0D_{\mathrm{KL}(j)}(\cdot)>0.

For the implicit isometric space Siso(m)S_{\mathrm{iso}}(\subset\mathbb{R}^{m}), there exists a matrix 𝑳𝒙\bm{L}_{\bm{x}} such that both 𝒚=𝑳𝒙𝒙\bm{y}=\bm{L}_{\bm{x}}\bm{x} and 𝑮𝒙=𝑳𝒙t𝑳𝒙\bm{G}_{\bm{x}}={}^{t}\bm{L}_{\bm{x}}\bm{L}_{\bm{x}} holds. 𝒘\bm{w} denotes a point in SisoS_{\mathrm{iso}}, i.e., 𝒘Siso\bm{w}\in S_{\mathrm{iso}}. Because 𝑮𝒙\bm{G}_{\bm{x}} is assumed as a𝒙𝑰ma_{\bm{x}}\bm{I}_{m} in Section 4.3, 𝑳𝒙=a𝒙1/2𝑰m\bm{L}_{\bm{x}}={a_{\bm{x}}}^{{1}/{2}}\bm{I}_{m} holds. Then, the mapping function 𝒘=h(𝒙)\bm{w}=h(\bm{x}) between SinputS_{\mathrm{input}} and SisoS_{\mathrm{iso}} is defined, such that:

h(𝒙)𝒙=𝒘𝒙=𝑳𝒙,andh(𝒙(0))=𝒘(0)for𝒙(0)Sinputand𝒘(0)Siso.\displaystyle\frac{\partial h(\bm{x})}{\partial\bm{x}}=\frac{\partial\bm{w}}{\partial\bm{x}}=\bm{L}_{\bm{x}},\hskip 5.69054pt\text{and}\hskip 5.69054pth({\bm{x}}^{(0)})={\bm{w}}^{(0)}\hskip 5.69054pt\text{for}\hskip 5.69054pt\exists\ {\bm{x}}^{(0)}\in S_{\mathrm{input}}\hskip 5.69054pt\text{and}\hskip 5.69054pt\exists\ {\bm{w}}^{(0)}\in S_{\mathrm{iso}}. (59)

Let δ𝒙\delta\bm{x} and δ𝒘\delta\bm{w} are infinitesimal displacements around 𝒙\bm{x} and 𝒘=h(𝒙)\bm{w}=h(\bm{x}), such that 𝒘+δ𝒘=h(𝒙+δ𝒙)\bm{w}+\delta\bm{w}=h(\bm{x}+\delta\bm{x}). Then the next equation holds from Eq. 59:

δ𝒘=𝑳𝒙δ𝒙.\displaystyle\delta\bm{w}=\bm{L}_{\bm{x}}\delta\bm{x}. (60)

Let δ𝒙(1)\delta\bm{x}^{(1)}, δ𝒙(2)\delta\bm{x}^{(2)}, δ𝒘(1)\delta\bm{w}^{(1)}, and δ𝒘(2)\delta\bm{w}^{(2)} be two arbitrary infinitesimal displacements around 𝒙\bm{x} and 𝒘=h(𝒙)\bm{w}=h(\bm{x}), such that δ𝒘(1)=𝑳𝒙δ𝒙(1)\delta\bm{w}^{(1)}=\bm{L}_{\bm{x}}\delta\bm{x}^{(1)} and δ𝒘(2)=𝑳𝒙δ𝒙(2)\delta\bm{w}^{(2)}=\bm{L}_{\bm{x}}\delta\bm{x}^{(2)}. Then the following equation holds, where \cdot denotes the dot product.

δt𝒙(1)𝑮𝒙δ𝒙(2)=(𝑳𝒙δ𝒙(1))t(𝑳𝒙δ𝒙(2))=δ𝒘(1)δ𝒘(2).\displaystyle{}^{t}\delta\bm{x}^{(1)}\bm{G}_{\bm{x}}\delta\bm{x}^{(2)}={}^{t}(\bm{L}_{\bm{x}}\delta\bm{x}^{(1)})(\bm{L}_{\bm{x}}\delta\bm{x}^{(2)})=\delta\bm{w}^{(1)}\cdot\delta\bm{w}^{(2)}. (61)

This equation shows the isometric mapping from the inner product space for 𝒙Sinput\bm{x}\in S_{\mathrm{input}} with the metric tensor 𝑮𝒙\bm{G}_{\bm{x}} to the Euclidean space for 𝒘Siso\bm{w}\in S_{\mathrm{iso}}.

Note that all of the column vectors in the Jacobian matrix 𝒙/𝒚\partial\bm{x}/\partial\bm{y} also have a unit norm and are orthogonal to each other in the metric space for 𝒙Sinput\bm{x}\in S_{\mathrm{input}} with the metric tensor 𝑮𝒙\bm{G}_{\bm{x}}. Therefore, the m×nm\times n Jacobian matrix 𝒘/𝒚\partial\bm{w}/\partial\bm{y} should have a property that all of the column vectors have a unit norm and are orthogonal to each other in the Euclidean space.

Refer to caption
Figure 9: Projection of the volume element from the implicit orthonormal space to the isometric space and input space. Vn()V_{n}(\cdot) denotes nn-dimensional volume.

Then nn-dimensional space which is composed of the meaningful dimensions from the implicit isometric space is named as the implicit orthonormal space SorthoS_{\mathrm{ortho}}. Figure 9 shows the projection of the volume element from the implicit orthonormal space to the isometric space and input space. Let dVortho\mathrm{d}V_{\mathrm{ortho}} be an infinitesimal nn-dimensional volume element in SorthoS_{\mathrm{ortho}}. This volume element is a nn-dimensional rectangular solid having each edge length dyj\mathrm{d}y_{j}. Let Vn(dVX)V_{n}(\mathrm{d}V_{\mathrm{X}}) be the nn-dimensional volume of a volume element dVX\mathrm{d}V_{\mathrm{X}}. Then, Vn(dVortho)=jndyjV_{n}(\mathrm{d}V_{\mathrm{ortho}})=\prod_{j}^{n}\mathrm{d}y_{j} holds. Next, dVortho\mathrm{d}V_{\mathrm{ortho}} is projected to nn dimensional infinitesimal element dViso\mathrm{d}V_{\mathrm{iso}} in SisoS_{\mathrm{iso}} by 𝒘/𝒚\partial\bm{w}/\partial\bm{y}. Because of the orthonormality, dViso\mathrm{d}V_{\mathrm{iso}} is equivalent to the rotation / reflection of dVortho\mathrm{d}V_{\mathrm{ortho}}, and Vn(dViso)V_{n}(\mathrm{d}V_{\mathrm{iso}}) is the same as Vn(dVortho)V_{n}(\mathrm{d}V_{\mathrm{ortho}}), i.e., jndyj\prod_{j}^{n}\mathrm{d}y_{j}. Then, dViso\mathrm{d}V_{\mathrm{iso}} is projected to nn-dimensional element dVinput\mathrm{d}V_{\mathrm{input}} in SinputS_{\mathrm{input}} by 𝒙/𝒘=𝑳𝒙1=a𝒙1/2𝑰m\partial\bm{x}/\partial\bm{w}=\bm{L}_{\bm{x}}^{-1}={a_{\bm{x}}}^{-1/2}\bm{I}_{m}. Because each dimension is scaled equally by the scale factor a𝒙1/2{a_{\bm{x}}}^{-{1}/{2}}, Vn(dVinput)=jna𝒙1/2dyj=a𝒙n/2Vn(dVortho)V_{n}(\mathrm{d}V_{\mathrm{input}})=\prod_{j}^{n}{a_{\bm{x}}}^{-{1}/{2}}\mathrm{d}y_{j}={a_{\bm{x}}}^{-{n}/{2}}\ V_{n}(\mathrm{d}V_{\mathrm{ortho}}) holds. Here, the ratio of the volume element between SinputS_{\mathrm{input}} and SorthoS_{\mathrm{ortho}} is Vn(dVinput)/Vn(dVortho)=a𝒙n/2V_{n}(\mathrm{d}V_{\mathrm{input}})/V_{n}(\mathrm{d}V_{\mathrm{ortho}})={a_{\bm{x}}}^{-{n}/{2}}. Note that the PDF ratio is derived by the reciprocal of Vn(dVinput)/Vn(dVortho)V_{n}(\mathrm{d}V_{\mathrm{input}})/V_{n}(\mathrm{d}V_{\mathrm{ortho}}). As a result, the PDF ratio is derived as a𝒙n/2{a_{\bm{x}}}^{{n}/{2}}.

A.7 Proof of proposition 4.3.2: Determination of the meaningful dimension for representation

This appendix explain the derivation of Proposition 4.3.2. Here, we estimate the KL divergence, i.e., a rate for the dimensions whose variance is less than β\beta. As shown later, the discussion in this appendix is closely related with Rate-distortion theory (Berger, 1971; Pearlman & Said, 2011; Goyal, 2001).

Let LGL_{\mathrm{G}}, DGD_{\mathrm{G}}, and RGR_{\mathrm{G}} be averages of Lmin𝒙L_{\min{\bm{x}}}, Dmin𝒙D_{\min{\bm{x}}}, and Rmin𝒙R_{\min{\bm{x}}} in Appendix A.5 over 𝒙p(𝒙){\bm{x}}\sim p({\bm{x}}), respectively. Here, LG=DG+βRGL_{\mathrm{G}}=D_{\mathrm{G}}+\beta R_{\mathrm{G}} holds by definition. Since Dmin𝒙D_{\min{\bm{x}}} is a constant nβ/2n\beta/2 as in Eq. 54, DGD_{\mathrm{G}} is derived as:

DG=nβ/2.D_{\mathrm{G}}=n\beta/2. (62)

As DGD_{\mathrm{G}} is constant, the minimum condition of LGL_{\mathrm{G}} is equivalent to that of RGR_{\mathrm{G}}. Let DKLminj(x)D_{\mathrm{KLmin}\ {j(x)}} be a KL divergence of the jj-th dimensional component at the minimum condition. Here, Rmin𝒙=jnDKLminj(x)R_{\min{\bm{x}}}=\sum_{j}^{n}D_{\mathrm{KLmin}\ {j(x)}} holds by definition. Eq. 4.2 holds for for small β/2\beta/2. Thus, we can approximate DKLminj(x)D_{\mathrm{KLmin}\ {j(x)}} for small β/2\beta/2 from Eqs. 16 and 4.2 as:

DKLminj(x)\displaystyle D_{\mathrm{KLmin}\ {j(x)}} \displaystyle\simeq log(σj(𝒙)p(μj(𝒙)))log2πe2\displaystyle-\log\left({\sigma_{j({\bm{x}})}}\ p(\mu_{j({\bm{x}})})\right)\ -\frac{\log{2\pi e}}{2} (63)
=\displaystyle= log(β/2p(yj))log2πe2\displaystyle-\log\left(\sqrt{\beta/2}\ p(y_{j})\right)\ -\frac{\log{2\pi e}}{2}
=\displaystyle= log(p(yj))logβπe2\displaystyle-\log\left(p(y_{j})\right)\ -\frac{\log{\beta\pi e}}{2}
=\displaystyle= log(p(yj))H(𝒩(yj;0,β/2)).\displaystyle-\log\left(p(y_{j})\right)\ -H(\mathcal{N}(y_{j};0,\beta/2)).

Here, H(𝒩(yj;0,β/2))H(\mathcal{N}(y_{j};0,\beta/2)) denotes a entropy of the Gaussian with variance β/2\beta/2. Next, RGR_{\mathrm{G}} is expressed as:

RG\displaystyle R_{\mathrm{G}} =\displaystyle= E𝒙p(𝒙)[Rmin𝒙]\displaystyle E_{{\bm{x}}\sim p({\bm{x}})}[R_{\min{\bm{x}}}] (64)
=\displaystyle= E𝒙p(𝒙)[jnDKLminj(x)]\displaystyle E_{{\bm{x}}\sim p({\bm{x}})}\left[\sum_{j}^{n}D_{\mathrm{KLmin}\ {j(x)}}\right]
\displaystyle\simeq p(𝒙)jn(log(p(yj))H(𝒩(yj;0,β/2)),0)d𝒙.\displaystyle-\int p(\bm{x})\sum_{j}^{n}(-\log\left(p(y_{j})\right)\ -H(\mathcal{N}(y_{j};0,\beta/2)),0)\mathrm{d}\bm{x}.
=\displaystyle= p(𝒚)|det(𝒙𝒚)|1jn(log(p(yj))H(𝒩(yj;0,β/2)),0)|det(𝒙𝒚)|d𝒚\displaystyle-\int p(\bm{y})\left|\mathrm{det}\left(\frac{\partial{\bm{x}}}{\partial{\bm{y}}}\right)\right|^{-1}\sum_{j}^{n}(-\log\left(p(y_{j})\right)\ -H(\mathcal{N}(y_{j};0,\beta/2)),0)\ \left|\mathrm{det}\left(\frac{\partial{\bm{x}}}{\partial{\bm{y}}}\right)\right|\mathrm{d}\bm{y}
=\displaystyle= p(𝒚)jn(log(p(yj))H(𝒩(yj;0,β/2)),0)d𝒚.\displaystyle-\int p(\bm{y})\sum_{j}^{n}(-\log\left(p(y_{j})\right)\ -H(\mathcal{N}(y_{j};0,\beta/2)),0)\mathrm{d}\bm{y}.
=\displaystyle= jn(p(yj)log(p(yj))dyjH(𝒩(yj;0,β/2))).\displaystyle\sum_{j}^{n}\left(-\int p(y_{j})\log\left(p(y_{j})\right)\mathrm{d}y_{j}-H(\mathcal{N}(y_{j};0,\beta/2))\right).

Note that the KL-divergence is always equal or greater than 0 by definition. By considering this, RGR_{\mathrm{G}} is further approximated as:

RGjnmax(p(yj)log(p(yj))dyjH(𝒩(yj;0,β/2)), 0).\displaystyle R_{\mathrm{G}}\simeq\sum_{j}^{n}\mathrm{max}\left(-\int p(y_{j})\log\left(p(y_{j})\right)\mathrm{d}y_{j}-H(\mathcal{N}(y_{j};0,\beta/2)),\ \ 0\right). (65)

Note that the approximation of Eq. 65 is reasonable from the Rate-distortion theory and optimal transform coding theory (Berger, 1971; Pearlman & Said, 2011; Goyal, 2001). The outline of Rate-distortion theory and optimal transform coding is explained in Appendix B.6. The term p(yj)logp(yj)dyj-\int p(y_{j})\log p(y_{j})\mathrm{d}y_{j} is the entropy of yjy_{j}. Thus, the optimal implicit isometric space is derived such that the entropy of data representation is minimum. When the data manifold has a disentangled property in the given metric by nature, each yjy_{j} will capture a disentangled feature with minimum entropy such that the mutual information between implicit isometric components becomes minimized. This is analogous to PCA for Gaussian data, which gives the disentangled representation with minimum entropy in SSE. Considering the similarity to the PCA eigenvalues, the variance of yjy_{j} will indicate the importance of each dimension.

Thus, if the entropy of yjy_{j} is larger than H(𝒩(0,β/2)H(\mathcal{N}(0,\beta/2), then it is reasonable that DKLminj(x)>0D_{\mathrm{KLmin}\ {j(x)}}>0 holds. By contrast, if the entropy of yjy_{j} is less than H(𝒩(0,β/2)H(\mathcal{N}(0,\beta/2), then DKLminj(x)=0D_{\mathrm{KLmin}\ {j(x)}}=0 will hold. In such dimensions, σj(𝒙)=1\sigma_{j({\bm{x}})}=1, μj(𝒙)=0\mu_{j({\bm{x}})}=0, and DKLj(x)=0D_{\mathrm{KL}{j(x)}}=0 will hold. In addition, σj(𝒙)2𝒙μj(𝒙)t𝑮𝒙𝒙μj(𝒙){{\sigma}_{j({\bm{x}})}}^{2}\ {}^{t}{{\bm{x}}_{\mu_{j({\bm{x}})}}}\bm{G}_{\bm{x}}{{\bm{x}}_{\mu_{j({\bm{x}})}}} will be close to 0 because this needs not to be balanced with DKLj(x)D_{\mathrm{KL}{j(x)}}.

These properties of VAE can be clearly explained by rate-distortion theory (Berger, 1971), which has been successfully applied to transform coding such as image / audio compression. Appendix B.6 explains that VAE can be interpreted as an optimal transform coding with non-linear scaling of latent space.

Thus, latent variables with variances from the largest to the n-th with DKLj(x)>0D_{\mathrm{KL}{j(x)}}>0 are sufficient for the representation and the dimensions with DKLj(x)=0D_{\mathrm{KL}{j(x)}}=0 can be ignored, allowing the reduction of the dimension nn for 𝒛\bm{z}.

A.8 Proof of proposition 4.3.2: Derivation of the estimated variance

This appendix explains the derivation of quantitative importance for each dimension in Eq. 22 of Proposition 4.3.2.

First, we set yjy_{j} to 0 at μj(𝒙)=0\mu_{j({\bm{x}})}=0 to derive yjy_{j} value from dyj/dμj(𝒙){\mathrm{d}y_{j}}/{\mathrm{d}{\mu}_{j({\bm{x}})}} in Eq. 16. We also assume that the prior distribution is 𝒩(𝒛;0,𝑰n)\mathcal{N}({\bm{z}};0,{\bm{I}}_{n}). The variance is derived by the subtraction of E[yj]2{E[y_{j}]}^{2}, i.e., the square of the mean, from E[yj2]{E[y_{j}^{2}]}, i.e., the square mean. Thus, the approximations of both E[yj]{E[y_{j}]} and E[yj2]{E[y_{j}^{2}]} are needed.

First, the approximation of the mean E[yj]{E[y_{j}]} is explained. Because the cumulative distribution functions (CDFs) of yjy_{j} are the same as CDF of μj(𝒙)\mu_{j({\bm{x}})}, the following equations hold:

0p(yj)dyj=0p(μj(𝒙))dμj(𝒙)=0.5,0p(yj)dyj=0p(μj(𝒙))dμj(𝒙)=0.5.\displaystyle\int_{-\infty}^{0}p(y_{j})\mathrm{d}y_{j}=\int_{-\infty}^{0}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}=0.5,\hskip 2.84526pt\int_{0}^{\infty}p(y_{j})\mathrm{d}y_{j}=\int_{0}^{\infty}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}=0.5. (66)

This equation means that the median of the yjy_{j} distribution is 0. Because the mean and median are close in most cases, the mean E[yj]{E[y_{j}]} can be approximated as 0. As a result, the variance of yjy_{j} can be approximated by the square mean E[yj2]{E[y_{j}^{2}]}.

Second, the approximation of the square mean E[yj2]{E[y_{j}^{2}]} is explained. Since we assume the manifold has a disentangled property by nature, the standard deviation of the posterior σj(𝒙)\sigma_{j(\bm{x})} is assumed as a function of μj(𝒙)\mu_{j({\bm{x}})}, regardless of 𝒙\bm{x}. This function is denoted as σj(μj(𝒙))\sigma_{j}(\mu_{j({\bm{x}})}). For 0\geq 0, yjy_{j} is approximated as follows, using Eq. 16 and replacing the average of 1/σj(μ´j(𝒙)){1}/{\sigma_{j}(\acute{\mu}_{j({\bm{x}})})} over μ´j(𝒙)=[0,μj(𝒙)]\acute{\mu}_{j({\bm{x}})}=[0,\mu_{j({\bm{x}})}] by 1/σj(μj(𝒙)){1}/{\sigma_{j}(\mu_{j({\bm{x}})})}:

yj=0μj(𝒙)dyjdz´jdz´j=β20zi1σj(μ´j(𝒙))dz´iβ21σj(μj(𝒙))0μj(𝒙)dz´j=β2μj(𝒙)σj(μj(𝒙)).\displaystyle y_{j}=\int_{0}^{\mu_{j({\bm{x}})}}\frac{\mathrm{d}y_{j}}{\mathrm{d}{\acute{z}}_{j}}{\mathrm{d}{\acute{z}}_{j}}=\sqrt{\frac{\beta}{2}}\int_{0}^{z_{i}}\frac{1}{\sigma_{j}(\acute{\mu}_{j({\bm{x}})})}{\mathrm{d}{\acute{z}}_{i}}\simeq\sqrt{\frac{\beta}{2}}\frac{1}{\sigma_{j}(\mu_{j({\bm{x}})})}\int_{0}^{\mu_{j({\bm{x}})}}{\mathrm{d}{\acute{z}}_{j}}=\sqrt{\frac{\beta}{2}}\frac{\mu_{j({\bm{x}})}}{\sigma_{j}(\mu_{j({\bm{x}})})}. (67)

The same approximation is applied to zi<0z_{i}<0. Then the square mean of yiy_{i} is approximated as follows, assuming that the correlation between σ(μj(𝒙))2{{\sigma}{({\mu_{j({\bm{x}})}})}}^{-2} and μj(𝒙)2{\mu_{j({\bm{x}})}}^{2} is low:

yj2p(yj)dyjβ2(μj(𝒙)σj(μj(𝒙)))2p(μj(𝒙))dμj(𝒙)β2σj(μj(𝒙))2p(μj(𝒙))dμj(𝒙)μj(𝒙)2p(μj(𝒙))dμj(𝒙).\displaystyle\int{y_{j}}^{2}p(y_{j})\mathrm{d}y_{j}\simeq\frac{\beta}{2}\int\left(\frac{\mu_{j({\bm{x}})}}{{\sigma}_{j}(\mu_{j({\bm{x}})})}\right)^{2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}\simeq\frac{\beta}{2}\int{{\sigma_{j}}{({\mu_{j({\bm{x}})}})}}^{-2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}{{\int{\mu_{j({\bm{x}})}}^{2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}}}. (68)

Finally, the square mean of yiy_{i} is approximated as the following equation, using μj(𝒙)2p(μj(𝒙))dμj(𝒙)=1{{\int{\mu_{j({\bm{x}})}}^{2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}}}=1 and replacing σj(μj(𝒙))2{{\sigma_{j}}(\mu_{j({\bm{x}})})}^{2} by σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{2}, i.e., the posterior variance derived from the input data:

yj2p(yj)dyjβ2σj(μj(𝒙))2p(μj(𝒙))dμj(𝒙)β2E𝝁j(𝒙)p(μj(𝒙))[σj(μj(𝒙))2]β2E𝒙p(𝒙)[σj(𝒙)2].\displaystyle\int{y_{j}}^{2}p(y_{j})\mathrm{d}y_{j}\simeq\frac{\beta}{2}\int{{\sigma_{j}}{({\mu_{j({\bm{x}})}})}}^{-2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}\simeq\frac{\beta}{2}\ \ \underset{\bm{\mu}_{j({\bm{x}})}\sim p(\mu_{j({\bm{x}})})}{E}[{{\sigma_{j}}{({\mu_{j({\bm{x}})}})}}^{-2}]\simeq\frac{\beta}{2}\ \ \underset{\bm{x}\sim p(\bm{x})}{E}[{{\sigma}_{j({\bm{x}})}}^{-2}]. (69)

Although some rough approximations are used in the expansion, the estimated variance in the last equation seems still reasonable, because σj(𝒙){\sigma}_{j({\bm{x}})} shows a scale factor between yjy_{j} and μj(𝒙)\mu_{j({\bm{x}})} while the variance of μj(𝒙)\mu_{j({\bm{x}})} is always 1 for the prior 𝒩(μj(𝒙);0,1)\mathcal{N}(\mu_{j({\bm{x}})};0,1). Considering the variance of the prior μj(𝒙)2p(μj(𝒙))dμj(𝒙){{\int{\mu_{j({\bm{x}})}}^{2}p(\mu_{j({\bm{x}})})\mathrm{d}\mu_{j({\bm{x}})}}} in the expansion, this estimation method can be applied to any prior distribution.

Appendix B Detailed relation to prior works

This section first describes the clear formulation of ELBO in VAE by utilizing isometric embedding. Then the detailed relationship, including correction, with previous works are explained.

B.1 Derivation of ELBO with clear and quantitative form

This section clarifies that the ELBO value after optimization becomes close to the log-likelihood of input data in the metric space (not input space), by the theoretical derivation of the reconstruction loss and KL divergence via isometric embedding.

We derive the ELBO (without β\beta) at 𝒙{\bm{x}} in Eq. 1, i.e., Eqϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]DKL(qϕ(𝒛|𝒙)p(𝒛)){E}_{{q_{\phi}(\bm{z}|\bm{x})}}[\log p_{\theta}(\bm{x}|\bm{z})]-D_{\mathrm{KL}}(q_{\phi}(\bm{z}|\bm{x})\|p({\bm{z}})) when the objective of β\beta-VAE L𝒙L_{\bm{x}} (with β\beta) in Eq. 57, i.e., L𝒙=D(𝒙,𝒙^)+βDKL()L_{\bm{x}}=D(\bm{x},\hat{\bm{x}})+\beta D_{\mathrm{KL}}(\cdot) is optimised.

First, the reconstruction loss can be rewritten as:

E𝒛qϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]=qϕ(𝒛|𝒙)logpθ(𝒙|𝒛)d𝒛=qϕ(𝒚|𝒙)logpθ(𝒙|𝒚)d𝒚.E_{{\bm{z}}\sim q_{\phi}({\bm{z}}|{\bm{x}})}[\log p_{\theta}({\bm{x}}|{\bm{z}})]=\int q_{\phi}({\bm{z}}|{\bm{x}})\log p_{\theta}({\bm{x}}|{\bm{z}})\mathrm{d}{\bm{z}}=\int q_{\phi}({\bm{y}}|{\bm{x}})\log p_{\theta}({\bm{x}}|{\bm{y}})\mathrm{d}{\bm{y}}. (70)

Let 𝝁𝒚(𝒙){\bm{\mu}}_{{\bm{y}}({\bm{x}})} be a implicit isometric variable corresponding to μ(𝒙)\mu_{({\bm{x}})}. Because the posterior variance in each isometric latent variable is a constant β/2\beta/2, qϕ(𝒚|𝒙)𝒩(𝒚;𝝁𝒚(𝒙),(β/2)𝑰n)q_{\phi}({\bm{y}}|{\bm{x}})\simeq\mathcal{N}({\bm{y}};{\bm{\mu}}_{{\bm{y}}({\bm{x}})},(\beta/2){\bm{I}}_{n}) will hold. If β/2\beta/2 is small, p(𝒙^)p(𝒙)p(\hat{{\bm{x}}})\simeq p({{\bm{x}}}) will hold. Then, the next equation will hold also using isometricity;

pθ(𝒙|𝒛)=pθ(𝒙|𝒚)=pθ(𝒙|𝒙^)=p(𝒙^|𝒙)p(𝒙)/p(𝒙^)p(𝒙^|𝒙)qϕ(𝒚|𝒙)𝒩(𝒚;𝝁𝒚(𝒙),(β/2)𝑰n).p_{\theta}({\bm{x}}|{\bm{z}})=p_{\theta}({\bm{x}}|{\bm{y}})=p_{\theta}({\bm{x}}|\hat{{\bm{x}}})=p(\hat{{\bm{x}}}|{{\bm{x}}})p({{\bm{x}}})/p(\hat{{\bm{x}}})\simeq p(\hat{{\bm{x}}}|{{\bm{x}}})\simeq q_{\phi}({\bm{y}}|{\bm{x}})\simeq\mathcal{N}({\bm{y}};{\bm{\mu}}_{{\bm{y}}({\bm{x}})},(\beta/2){\bm{I}}_{n}). (71)

Thus the reconstruction loss is estimated as:

E𝒛qϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]\displaystyle E_{{\bm{z}}\sim q_{\phi}({\bm{z}}|{\bm{x}})}[\log p_{\theta}({\bm{x}}|{\bm{z}})] \displaystyle\simeq 𝒩(𝒚;𝝁𝒚(𝒙),(β/2)𝑰n)log𝒩(𝒚;𝝁𝒚(𝒙),(β/2)𝑰n)d𝒚\displaystyle\int\mathcal{N}({\bm{y}};{\bm{\mu}}_{{\bm{y}}({\bm{x}})},(\beta/2){\bm{I}}_{n})\log\mathcal{N}({\bm{y}};{\bm{\mu}}_{{\bm{y}}({\bm{x}})},(\beta/2){\bm{I}}_{n})\ \mathrm{d}{\bm{y}} (72)
=\displaystyle= (n/2)log(βπe).\displaystyle-(n/2)\log(\beta\pi e).

Next, KL divergence is derived from Eq. 56 as:

DKL()=Rmin𝒙=logp(𝒚)(n/2)log(βπe).D_{\mathrm{KL}}(\cdot)=R_{\mathrm{min}{\bm{x}}}=-\log p({\bm{y}})-(n/2)\log(\beta\pi e). (73)

By summing both terms, ELBO at 𝒙\bm{x} can be estimated as

ELBO\displaystyle ELBO =\displaystyle= E𝒙p(𝒙)[E𝒛qϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]DKL()]\displaystyle E_{{\bm{x}}\sim p({\bm{x}})}[E_{{\bm{z}}\sim q_{\phi}({\bm{z}}|{\bm{x}})}[\log p_{\theta}({\bm{x}}|{\bm{z}})]-D_{\mathrm{KL}}(\cdot)] (74)
\displaystyle\simeq E𝒙p(𝒙)[logp(𝒚)]\displaystyle E_{{\bm{x}}\sim p({\bm{x}})}[\log p({\bm{y}})]
\displaystyle\simeq E𝒙p(𝒙)[logp(𝒙)].\displaystyle E_{{\bm{x}}\sim p({\bm{x}})}[\log p({\bm{x}})].

As a result, ELBO (Eq. 1) in the original form (Kingma & Welling, 2014) is close to the log-likelihood of 𝒙{\bm{x}}, regardless β=1\beta=1 or not, when the objective of β\beta-VAE (Higgins et al., 2017) is optimised. Note that logp(𝒙)\log p({\bm{x}}) in Eq.73 is defined in the metric space. This also implies that the representation 𝒚{\bm{y}} depends on the metrics.

Next, the predetermined conditional distribution pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) used for training and the true conditional distribution pθ(𝒙|𝒛)=pθ(𝒙|𝒙^)p_{\theta}({\bm{x}}|{\bm{z}})=p_{\theta}({\bm{x}}|\hat{\bm{x}}) after optimization are examined. Although pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) and pθ(𝒙|𝒙^)p_{\theta}({\bm{x}}|\hat{\bm{x}}) are expected to be equivalent after optimization, the theoretical relationship between both is not well discussed. Assume pp(𝒙|𝒙^)=𝒩(𝒙;𝒙^,σ2𝑰)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}})=\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\sigma^{2}\bm{I}). In this case, the metric D(𝒙,𝒙^)D({\bm{x}},\hat{{\bm{x}}}) is derived as logpp(𝒙|𝒙^)=(1/2σ2)|𝒙𝒙^|22+Const-\log p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}})=(1/2\sigma^{2})|{\bm{x}}-\hat{{\bm{x}}}|_{2}^{2}+\mathrm{Const}. Using Eq. 53, the following equations are derived:

Ep(𝒙)[D(𝒙,𝒙^)]=Ep(𝒙)[(1/2σ2)|𝒙𝒙^|22]=Ep(𝒙)[(1/2σ2)i(xix^i)2]nβ/2.E_{p({\bm{x}})}[D({\bm{x}},\hat{{\bm{x}}})]=E_{p({\bm{x}})}\bigl{[}(1/2\sigma^{2})|{\bm{x}}-\hat{{\bm{x}}}|_{2}^{2}\bigr{]}=E_{p({\bm{x}})}\bigl{[}(1/2\sigma^{2})\sum_{i}(x_{i}-\hat{x}_{i})^{2}\bigr{]}\simeq n\beta/2. (75)

Assume that i(xix^i)2\sum_{i}(x_{i}-\hat{x}_{i})^{2} for all ii are equivalent. Then the next equation is derived:

Ep(𝒙)[(xix^i)2]βσ2.E_{p({\bm{x}})}\bigl{[}(x_{i}-\hat{x}_{i})^{2}\bigr{]}\simeq\beta\sigma^{2}.\hskip 247.53897pt (76)

Because the variance of each dimension is βσ2\beta\sigma^{2}, the conditional distribution after optimization is estimated as pθ(𝒙|𝒙^)=𝒩(𝒙;𝒙^,βσ2𝑰)p_{\theta}({\bm{x}}|\hat{{\bm{x}}})=\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\beta\sigma^{2}\bm{I}).

If β=1\beta=1, i.e., the original VAE objective, both pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) and pθ(𝒙|𝒙^)p_{\theta}({\bm{x}}|\hat{{\bm{x}}}) are equivalent. This result is consistent with what is expected.

If β1\beta\neq 1, however, pp(𝒙|𝒙^)p_{\mathbb{R}p}({\bm{x}}|\hat{{\bm{x}}}) and pθ(𝒙|𝒙^)p_{\theta}({\bm{x}}|\hat{{\bm{x}}}) are different. In other words, what β\beta-VAE really does is to scale a variance of the pre-determined conditional distribution in the original VAE by a factor of β\beta as Eq. 76. The detail is explained in Appendix B.3.

If D(𝒙,𝒙+δ𝒙)=δt𝒙𝑮𝒙δ𝒙+O(δ𝒙3)D(\bm{x},\bm{x}+\delta{\bm{x}})={}^{t}\delta{\bm{x}}{\bm{G}}_{\bm{x}}\delta{\bm{x}}+O(||\delta{\bm{x}}||^{3}) is not SSE, by introducing a variable 𝒙´=𝑳𝒙1𝒙\acute{\bm{x}}={{\bm{L}}_{\bm{x}}}^{-1}{\bm{x}} where 𝑳𝒙{\bm{L}}_{\bm{x}} satisfies 𝑳𝒙t𝑳𝒙=𝑮𝒙{}^{t}{\bm{L}}_{\bm{x}}\ {\bm{L}}_{\bm{x}}={\bm{G}}_{\bm{x}}, the metric D(,)D(\cdot,\cdot) can be replaced by SSE in the Euclidean space of 𝒙´\acute{\bm{x}}.

B.2 Relation to Tishby et al. (1999)

The theory described in Tishby et al. (1999), which first proposes the concept of information bottleneck (IB), is consistent with our analysis. Tishby et al. (1999) clarified the behaviour of the compressed representation when the rate-distortion trade-off is optimized. 𝒙X\bm{x}\in X denotes the signal space with a fixed probability p(𝒙)p(\bm{x}) and 𝒙^X^\hat{\bm{x}}\in\hat{X} denotes its compressed representation. Let D(𝒙,𝒙^)D(\bm{x},\hat{\bm{x}}) be a loss metric. Then the rate-distortion trade-off can be described as:

L=I(X;X^)+βEp(𝒙,𝒙^)[D(𝒙,𝒙^)].\displaystyle L=I(X;\hat{X})+\beta^{\prime}\underset{p(\bm{x},\hat{\bm{x}})}{E}[D(\bm{x},\hat{\bm{x}})]. (77)

By solving this condition, they derive the following equation:

p(𝒙^|𝒙)exp(βD(𝒙,𝒙^)).\displaystyle{p(\hat{\bm{x}}|\bm{x})}\propto\exp(-\beta^{\prime}D(\bm{x},\hat{\bm{x}})). (78)

As shown in our discussion above, p(𝒙^|𝒙)𝒩(𝒙^;𝒙,(β/2)𝑰m)p(\hat{\bm{x}}|{\bm{x}})\simeq\mathcal{N}(\hat{\bm{x}};\bm{x},(\beta/2){\bm{I}}_{m}) will hold in the metric defined space from our VAE analysis. This result is equivalent to Eq. 78 in their work if D(𝒙,𝒙^)D(\bm{x},\hat{\bm{x}}) is SSE and β\beta^{\prime} is set to β1\beta^{-1}, as follows:

p(𝒙^|𝒙)exp(βD(𝒙,𝒙^))=exp(𝒙𝒙^222(β/2))𝒩(𝒙^;𝒙,(β/2)𝑰m).\displaystyle{p(\hat{\bm{x}}|\bm{x})}\propto\exp(-\beta^{\prime}D(\bm{x},\hat{\bm{x}}))=\exp\left(-\frac{||\bm{x}-\hat{\bm{x}}||_{2}^{2}}{2(\beta/2)}\right)\propto\mathcal{N}(\hat{\bm{x}};\bm{x},(\beta/2){\bm{I}}_{m}). (79)

If D(𝒙,𝒙^)D(\bm{x},\hat{\bm{x}}) is not SSE, the use of the space transformation explained in appendix B.1 will lead to the same result.

B.3 Relation to β\beta-VAE (Higgins et al., 2017)

This section explains the clear understanding of β\beta-VAE (Higgins et al., 2017), and also corrects some of their theory.

In Higgins et al. (2017), ELBO equation is modified as:

Ep(𝒙)[E𝒙^pϕ(𝒙^|𝒙)[qθ(𝒙|𝒙^)]βDKL()].E_{p({\bm{x}})}[\ E_{\hat{{\bm{x}}}\sim p_{\phi}(\hat{{\bm{x}}}|{\bm{x}})}[q_{\theta}({\bm{x}}|\hat{{\bm{x}}})]-\beta D_{\mathrm{KL}}(\cdot)\ ]. (80)

However, they use the predetermined probabilities of pθ(𝒙^|𝒙)p_{\theta}(\hat{\bm{x}}|{\bm{x}}) such as the Bernoulli and Gaussian distributions in training (described in table 1 in Higgins et al. (2017)). As shown in our appendix  G.2, the log-likelihoods of the Bernoulli and Gaussian distributions can be regarded as BCE and SSE metrics, respectively. As a result, the actual objective for training in Higgins et al. (2017) is not Eq. 80, but the objective L𝒙=D(𝒙,𝒙^)+βDKL()L_{\bm{x}}=D(\bm{x},\hat{\bm{x}})+\beta D_{\mathrm{KL}}(\cdot) in Eq. 3 using BCE and SSE metrics with varying β\beta. Thus ELBO as Eq. 1 form will become logp(𝒙)\log p({\bm{x}}) in the BCE / SSE metric defined space regardless β=1\beta=1 or not, as shown in appendix B.1.

Actually, the equation 80 dose not show the log-likelihood of 𝒙{\bm{x}} after optimization. When DKL()logp(𝒙)(n/2)log(βπe)D_{\mathrm{KL}}(\cdot)\simeq-\log p({\bm{x}})-(n/2)\log(\beta\pi e) and E𝒙^p(𝒙^|𝒙)[p(𝒙|𝒙^)](n/2)log(βπe)E_{\hat{{\bm{x}}}\sim p(\hat{{\bm{x}}}|{\bm{x}})}[p({\bm{x}}|\hat{{\bm{x}}})]\simeq-(n/2)\log(\beta\pi e) are applied, the value of Eq. 80 is derived as βlogp(𝒙)+(β1)(n/2)log(βπe)\beta\log p({\bm{x}})+(\beta-1)(n/2)\log(\beta\pi e), which is different from the log-likelihood of 𝒙{\bm{x}} in Eq. 73 if β1\beta\neq 1.

Correctly, what β\beta-VAE really does is only to scale the variance of the pre-determined conditional distribution in the original VAE by a factor of β\beta. In the case the pre-determined conditional distribution is Gaussian 𝒩(𝒙;𝒙^,σ2𝑰)\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\sigma^{2}\bm{I}), the objective of β\beta-VAE can be can be rewritten as a linearly scaled original VAE objective with a Gaussian 𝒩(𝒙;𝒙^,βσ2𝑰)\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\beta\sigma^{2}\bm{I}) where the variance is βσ2\beta\sigma^{2} instead of σ2\sigma^{2}:

Eqϕ()[log𝒩(𝒙;𝒙^,σ2𝑰)]βDKL()\displaystyle E_{q_{\phi}(\cdot)}[\log\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\sigma^{2}\bm{I})]-\beta D_{\mathrm{KL}}(\cdot) =\displaystyle= Eqϕ()[12log2πσ2|𝒙𝒙^|222σ2]βDKL()\displaystyle E_{q_{\phi}(\cdot)}\left[-\frac{1}{2}\log 2\pi\sigma^{2}-\frac{|{\bm{x}}-\hat{\bm{x}}|_{2}^{2}}{2\sigma^{2}}\right]-\beta D_{\mathrm{KL}}(\cdot) (81)
=\displaystyle= β(Eqϕ()[12log2πβσ2|𝒙𝒙^|222βσ2]DKL())\displaystyle\beta\left(E_{q_{\phi}(\cdot)}\left[-\frac{1}{2}\log 2\pi\beta\sigma^{2}-\frac{|{\bm{x}}-\hat{\bm{x}}|_{2}^{2}}{2\beta\sigma^{2}}\right]-D_{\mathrm{KL}}(\cdot)\right)
+β2log2πβσ212log2πσ2\displaystyle+\frac{\beta}{2}\log 2\pi\beta\sigma^{2}-\frac{1}{2}\log 2\pi\sigma^{2}
=\displaystyle= β(Eqϕ()[log𝒩(𝒙;𝒙^,βσ2𝑰)]DKL()¯)+const.\displaystyle\beta\left(\ \underline{E_{q_{\phi}(\cdot)}[\log\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\beta\sigma^{2}\bm{I})]-D_{\mathrm{KL}}(\cdot)}\ \right)+\mathrm{const}.

Here, the underlined terms in the last equation is just the ELBO with the predetermined conditional distribution 𝒩(𝒙;𝒙^,βσ2𝑰)\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\beta\sigma^{2}\bm{I}). So the optimization of β\beta-VAE objective with the predetermined conditional distribution 𝒩(𝒙;𝒙^,σ2𝑰)\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\sigma^{2}\bm{I}) is just the same as the optimization of the original VAE objective (β\beta=1) with with the predetermined conditional distribution 𝒩(𝒙;𝒙^,βσ2𝑰)\mathcal{N}({\bm{x}};\hat{{\bm{x}}},\beta\sigma^{2}\bm{I}).

B.4 Relation to Alemi et al. (2018)

Alemi et al. (2018) discuss the rate-distortion trade-off by the theoretical entropy analysis. Their work is also presumed that the objective L𝒙L_{\bm{x}} was not mistakenly distinguished from ELBO, which leads to the incorrect discussion. In their work, the differential entropy for the input HH, distortion DD, and rate RR are derived carefully. They suggest that VAE with β=1\beta=1 is sensitive (unstable) because DD and RR can be arbitrary value on the line R=HβD=HDR=H-\beta D=H-D. Furthermore, they also suggest that RH,D=0R\geq H,\ D=0 at β0\beta\rightarrow 0 and R=0,DHR=0,\ D\geq H at β\beta\rightarrow\infty will hold as shown the figure 1 of their work.

In this appendix, we will show that β\beta determines the value of RR and DD specifically. We also show that RHDR\simeq H-D will hold regardless β=1\beta=1 or not.

In their work, these values of HH, DD, and DD are mathematically defined as:

H\displaystyle H \displaystyle\equiv d𝒙p(𝒙)logp(𝒙),\displaystyle-\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\log p^{*}(\bm{x}), (82)
D\displaystyle D \displaystyle\equiv d𝒙p(𝒙)d𝒛e(𝒛|𝒙)logd(𝒙|𝒛),\displaystyle-\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\int\mathrm{d}\bm{z}\ e(\bm{z}|\bm{x})\log d(\bm{x}|\bm{z}), (83)
R\displaystyle R \displaystyle\equiv d𝒙p(𝒙)d𝒛e(𝒛|𝒙)loge(𝒛|𝒙)m(𝒛).\displaystyle\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\int\mathrm{d}\bm{z}\ e(\bm{z}|\bm{x})\log\frac{e(\bm{z}|\bm{x})}{m(\bm{z})}. (84)

Here, p(𝒙)p^{*}(\bm{x}) is a true PDF of 𝒙\bm{x}, e(𝒛|𝒙)e(\bm{z}|\bm{x}) is a stochastic encoder, e(𝒛|𝒙)e(\bm{z}|\bm{x}) is a decoder, and m(𝒛)m(\bm{z}) is a marginal probability of 𝒛\bm{z}.

Our work allows a rough estimation of Eqs. 82-84 with β\beta by introducing the implicit isometric variable 𝒚{\bm{y}} as explained in our work.

Using isometric variable 𝒚\bm{y} and the relation d𝒛e(𝒛|𝒙)=d𝒚e(𝒚|𝒙)\mathrm{d}\bm{z}\ e(\bm{z}|\bm{x})=\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x}), Eq. 83 can be rewritten as:

D=d𝒙p(𝒙)d𝒚e(𝒚|𝒙)logd(𝒙|𝒚).\displaystyle D=-\int\mathrm{d}{\bm{x}}\ p^{*}({\bm{x}})\int\mathrm{d}{\bm{y}}\ e({\bm{y}}|{\bm{x}})\log d({\bm{x}}|{\bm{y}}). (85)

Let 𝝁y{\bm{\mu}}_{y} be the implicit isometric latent variable corresponding to the mean of encoder output 𝝁(𝒙){\bm{\mu}}_{(\bm{x})}. As discussed in section 4.1, e(𝒚|𝒙)=𝒩(𝒚;𝝁𝒚,(β/2)𝑰n)e({\bm{y}}|{\bm{x}})=\mathcal{N}(\bm{y};\bm{\mu_{y}},({\beta}/{2}){\bm{I}}_{n}) will hold. Because of isometricity, the value of d(𝒙|𝒚)d({\bm{x}}|{\bm{y}}) will be also close to e(𝒚|𝒙)=𝒩(𝒚;𝝁𝒚,(β/2)𝑰n)e({\bm{y}}|{\bm{x}})=\mathcal{N}(\bm{y};\bm{\mu_{y}},({\beta}/{2}){\bm{I}}_{n}). Though d(𝒙|𝒛)d({\bm{x}}|{\bm{z}}) must depend on e(𝒛|𝒙)e({\bm{z}}|{\bm{x}}), this important point has not been discussed well in this work. By using the implicit isometric variable, we can connect both theoretically. Thus, DD can be estimated as:

D\displaystyle D \displaystyle\simeq d𝒙p(𝒙)d𝒚𝒩(𝒚;𝝁y,(β/2)𝑰n)log𝒩(𝒚;𝝁𝒚,(β/2)𝑰n)\displaystyle\int\mathrm{d}{\bm{x}}\ p^{*}({\bm{x}})\int\mathrm{d}{\bm{y}}\ \mathcal{N}({\bm{y}};{\bm{\mu}}_{y},({\beta}/{2}){\bm{I}}_{n})\log\mathcal{N}(\bm{y};\bm{\mu_{y}},({\beta}/{2}){\bm{I}}_{n}) (86)
\displaystyle\simeq d𝒙p(𝒙)(n2log(βπe))\displaystyle\int\mathrm{d}{\bm{x}}\ p^{*}({\bm{x}})\left(\frac{n}{2}\log(\beta\pi e)\right)
=\displaystyle= n2log(βπe).\displaystyle\frac{n}{2}\log(\beta\pi e).

Second, RR is examined. m(𝒚)m(\bm{y}) is a marginal probability of 𝒚\bm{y}. Using the relation d𝒛e(𝒛|𝒙)=d𝒚e(𝒚|𝒙)\mathrm{d}\bm{z}\ e(\bm{z}|\bm{x})=\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x}) and e(𝒛|𝒙)/m(𝒛)=(e(𝒚|𝒙)(d𝒚/d𝒛))/(m(𝒚)(d𝒚/d𝒛))=e(𝒚|𝒙)/m(𝒚){e(\bm{z}|\bm{x})}/{m(\bm{z})}=({e(\bm{y}|\bm{x})}(\mathrm{d}\bm{y}/\mathrm{d}\bm{z}))/({m(\bm{y})}(\mathrm{d}\bm{y}/\mathrm{d}\bm{z}))={e(\bm{y}|\bm{x})}/{m(\bm{y})}, Eq. 84 can be rewritten as:

Rd𝒙p(𝒙)d𝒚e(𝒚|𝒙)loge(𝒚|𝒙)m(𝒚).\displaystyle R\simeq\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\int\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x})\log\frac{e(\bm{y}|\bm{x})}{m(\bm{y})}. (87)

Because of isometricity, e(𝒚|𝒙)p(𝒙^|𝒙)𝒩(𝒙^;𝒙,(β/2)𝑰m)e(\bm{y}|\bm{x})\simeq p(\hat{\bm{x}}|{\bm{x}})\simeq\mathcal{N}(\hat{\bm{x}};\bm{x},(\beta/2){\bm{I}}_{m}) will approximately hold where 𝒙^\hat{\bm{x}} denotes a decoder output. Thus m(𝒚)m(\bm{y}) can be approximated by:

m(𝒚)d𝒙p(𝒙)e(𝒚|𝒙)d𝒙p(𝒙)𝒩(𝒙^;𝒙,(β/2)𝑰m)\displaystyle m(\bm{y})\simeq\int\mathrm{d}{\bm{x}}\ p^{*}({\bm{x}})e(\bm{y}|\bm{x})\simeq\int\mathrm{d}{\bm{x}}\ p^{*}({\bm{x}})\ \mathcal{N}(\hat{\bm{x}};\bm{x},(\beta/2){\bm{I}}_{m}) (88)

Here, if β/2\beta/2, i.e., added noise, is small enough compared to the variance of 𝒙\bm{x}, a normal distribution function term in this equation will act like a delta function. Thus m(𝒚)m(\bm{y}) can be approximated as:

m(𝒚)d𝒙´p(𝒙´)δ(𝒙´𝒙)p(𝒙).\displaystyle m(\bm{y})\simeq\int\mathrm{d}{\acute{\bm{x}}}\ p^{*}(\acute{\bm{x}})\ \delta(\acute{\bm{x}}-\bm{x})\simeq p^{*}({\bm{x}}). (89)

In the similar way, the following approximation will also hold.

d𝒚e(𝒚|𝒙)logm(𝒚)d𝒚e(𝒚|𝒙)logp(𝒙)d𝒙´δ(𝒙´𝒙)logp(𝒙´)logp(𝒙).\displaystyle\int\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x})\log{m(\bm{y})}\simeq\int\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x})\log{p^{*}(\bm{x})}\simeq\int\mathrm{d}{\acute{\bm{x}}}\ \delta(\acute{\bm{x}}-\bm{x})\ \log\ p^{*}(\acute{\bm{x}})\simeq\log{p^{*}(\bm{x})}. (90)

By using these approximation and applying Eqs. 85-86, RR in Eq. 84 can be approximated as:

R\displaystyle R \displaystyle\simeq d𝒙p(𝒙)d𝒚e(𝒚|𝒙)loge(𝒚|𝒙)p(𝒙)\displaystyle\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\int\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x})\log\frac{e(\bm{y}|\bm{x})}{p^{*}(\bm{x})} (91)
\displaystyle\simeq d𝒙p(𝒙)logp(𝒙)(d𝒙p(𝒙)d𝒚e(𝒚|𝒙)loge(𝒚|𝒙))\displaystyle-\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\ \log{p^{*}(\bm{x})}-\left(-\int\mathrm{d}\bm{x}\ p^{*}(\bm{x})\int\mathrm{d}\bm{y}\ e(\bm{y}|\bm{x})\log{e(\bm{y}|\bm{x})}\right)
\displaystyle\simeq Hn2log(βπe)\displaystyle H-\frac{n}{2}\log(\beta\pi e)
\displaystyle\simeq HD.\displaystyle H-D.

As discussed above, RR and DD can be specifically derived from β\beta. In addition, Shannon lower bound discussed in Alemi et al. (2018) can be roughly verified in the optimized VAE with clearer notations using β\beta.

From the discussion above, we presume Alemi et al. (2018) might wrongly treat DD in their work. They suggest that VAE with β=1\beta=1 is sensitive (unstable) because DD and RR can be arbitrary value on the line R=HβD=HDR=H-\beta D=H-D; however, our work as well as Tishby et al. (1999) (appendix B.2) and Dai & Wipf (2019)(appendix B.5) show that the differential entropy of the distortion and rate, i.e., DD and RR, are specifically determined by β\beta after optimization, and R=HDR=H-D will hold for any β\beta regardless β=1\beta=1 or not. Alemi et al. (2018) also suggest DD should satisfy D0D\geq 0 because DD is a distortion; however, we suggest DD should be treated as a differential entropy and can be less than 0 because 𝒙\bm{x} is once handled as a continuous signal with a stochastic process in Eqs. 82-84. Here, D(n/2)log(βπe)D\simeq(n/2)\log(\beta\pi e) can be -\infty if β0\beta\rightarrow 0, as also shown in Dai & Wipf (2019). Thus, upper bound of RR at β0\beta\rightarrow 0 is not HH, but R=H()=R=H-(-\infty)=\infty, as shown in RD theory for a continuous signal. Huang et al. (2020) show this property experimentally in their figures 4-8 such that RR seems to diverge if MSE is close to 0.

B.5 Relation to Dai et al. (2018) and Dai & Wipf (2019)

Our work is consistent with Dai et al. (2018) and Dai & Wipf (2019).

Dai et al. (2018) analyses VAE by assuming a linear model. As a result, the estimated posterior is constant. If the distribution of the manifold is the Gaussian, our work and Dai et al. (2018) give a similar result with constant posterior variances. For non-Gaussian data, however, the quantitative analysis such as probability estimation is intractable using their linear model. Our work reveals that the posterior variance gives a scaling factor between 𝒛\bm{z} in VAE and 𝒚\bm{y} in the isometric space when VAE is ideally trained with rich parameters. This is validated by Figures 3 and 3, where the estimation of the posterior variance at each data point is a key.

Next, the relation to Dai & Wipf (2019) is discussed. They analyse a behavior of VAE when ideally trained. For example, the theorem 5 in their work shows that D(d/2)logγ+O(1)D\rightarrow(d/2)\log\gamma+O(1) and R(γ^/2)logγ+O(1)R\rightarrow-(\hat{\gamma}/2)\log\gamma+O(1) hold if γ+0\gamma\rightarrow+0, where γ,d\gamma,d, and γ^\hat{\gamma} denote a variance of d(𝒙|𝒛)d({\bm{x}}|{\bm{z}}), data dimension, and latent dimension, respectively. By setting γ=β/2\gamma=\beta/2 and d=γ^=nd=\hat{\gamma}=n, this theorem is consistent with RR and DD derived in Eq. 86 and Eq. 91.

B.6 Relation to Rate-distortion theory (Berger, 1971) and transform coding (Goyal, 2001; Pearlman & Said, 2011)

RD theory (Berger, 1971) formulated the optimal transform coding (Goyal, 2001; Pearlman & Said, 2011) for the Gaussian source with square error metric as follows. Let 𝒙m\bm{x}\in\mathbb{R}^{m} be a point in a dataset. First, the data are transformed deterministically with the orthonormal transform (orthogonal and unit norm) such as Karhunen-Loève transform (KLT) (Rao & Yip, 2000). Note that the basis of KLT is equivalent to a PCA basis. Let 𝒛m\bm{z}\in\mathbb{R}^{m} be a point transformed from 𝒙\bm{x}. Then, 𝒛\bm{z} is entropy-coded by allowing equivalent stochastic distortion (or posterior with constant variance) in each dimension. A lower bound of a rate RR at a distortion DD is denoted by R(D)R(D). The derivation of R(D)R(D) is as follows. Let zj{z_{j}} be the jj-th dimensional component of 𝒛\bm{z} and σzj2{\sigma_{zj}}^{2} be the variance of zj{z_{j}} in a dataset. It is noted that σzj2{\sigma_{zj}}^{2} is the equivalent to eigenvalues of PCA for the dataset. Let dd be a distortion equally allowed in each dimensional channel. At the optimal condition, the distortion DoptD_{\mathrm{opt}} and rate RoptR_{\mathrm{opt}} on the curve R(D)R(D) is calculated as a function of dd:

Ropt\displaystyle R_{\mathrm{opt}} =\displaystyle= 12j=1mmax(log(σzj2/d),0),\displaystyle\frac{1}{2}\sum_{j=1}^{m}\max(\log({\sigma_{zj}}^{2}/d),0),\hskip 8.53581pt
Dopt\displaystyle D_{\mathrm{opt}} =\displaystyle= j=1mmin(d,σzj2).\displaystyle\sum_{j=1}^{m}\min(d,\ {\sigma_{zj}}^{2}). (92)

The simplest way to allow equivalent distortion is to use a uniform quantization (Goyal, 2001). Let TT be a quantization step, and round()\mathrm{round}(\cdot) be a round function. Quantized value zj^\hat{z_{j}} is derived as kTkT, where k=round(zj/T)k=\mathrm{round}({z_{j}}/T). Then, dd is approximated by T2/12{T^{2}}/{12} as explained in Appendix G.1.

To practically achieve the best RD trade-off in image compression, rate-distortion optimization (RDO) has also been widely used (Sullivan & Wiegand, 1998). In RDO, the best trade-off is achieved by finding a encoding parameter that minimizes the cost LL at given Lagrange parameter λ\lambda as:

L=D+λR.L=D+\lambda R. (93)

This equation is equivalent to VAE when λ=β1\lambda=\beta^{-1}.

We show the optimum condition of VAE shown in Eq.  62 and 65 can be mapped to the optimum condition of transform coding (Goyal, 2001) as shown in Eq. B.6. First, the derivation of Eq. B.6 is explained by solving the optimal distortion assignment to each dimension. In the transform coding for mm dimensional the Gaussian data, an input data 𝒙\bm{x} is transformed to 𝒛\bm{z} using an orthonormal transform such as KLT/DCT. Then each dimensional component zjz_{j} is encoded with allowing distortion djd_{j}. Let DD be a target distortion satisfying D=j=1mdjD=\sum_{j=1}^{m}d_{j}. Next, σzj2\sigma_{zj}^{2} denotes a variance of each dimensional component zjz_{j} for the input dataset. Then, a rate RR can be derived as j=1m12log(σzj2/dj)\sum_{j=1}^{m}\frac{1}{2}\log(\sigma_{zj}^{2}/d_{j}). By introducing a Lagrange parameter λ\lambda and minimizing a rate-distortion optimization cost L=D+λRL=D+\lambda R, the optimum condition is derived as:

λopt=2D/m,dj=D/m=λopt/2.\lambda_{\mathrm{opt}}=2D/m,\hskip 5.69054ptd_{j}=D/m=\lambda_{\mathrm{opt}}/2. (94)

This result is consistent with Eq.  62 and 65 by setting β=λopt=2D/m\beta=\lambda_{\mathrm{opt}}=2D/m. This implies that LG=DG+βRGL_{\mathrm{G}}=D_{\mathrm{G}}+\beta R_{\mathrm{G}} is a rate-distortion optimization (RDO) cost of transform coding when 𝒙\bm{x} is deterministically transformed to 𝒚\bm{y} in the implicit isometric space and stochastically encoded with a distortion β/2\beta/2.

Appendix C Details of the networks and training conditions in the experiments

This appendix explains the networks and training conditions in Section 5.

C.1 Toy data set

This appendix explains the details of the networks and training conditions in the experiment of the toy data set in Section 5.1.

Network configurations:
FC(i, o, f) denotes a FC layer with input dimension i, output dimension o, and activate function f.

The encoder network is composed of FC(16, 128, tanh)-FC(128, 64, tahh)-FC(64, 3, linear)×2\times 2 (for μ\mu and σ\sigma). The decoder network is composed of FC(3, 64, tanh)-FC(64, 128, tahh)-FC(128, 16, linear).

Training conditions:
The reconstruction loss D(,)D(\cdot,\cdot) is derived such that the loss per input dimension is calculated and all of the losses are averaged by the input dimension m=16m=16. The KL divergence is derived as a summation of DKL(j)()D_{\mathrm{KL}(j)}(\cdot) as explained in Eq. 2.

In our code, we use essentially the same, but a constant factor scaled loss objective from the original β\beta-VAE form L𝒙=D(,)+βDKL(j)()L_{\bm{x}}=D(\cdot,\cdot)+\beta D_{\mathrm{KL}(j)}(\cdot) in Eq. 1, such as:

L𝒙=λD(,)+DKL(j)().L_{\bm{x}}=\lambda\ D(\cdot,\cdot)+D_{\mathrm{KL}(j)}(\cdot). (95)

Equation 95 is essentially equivalent to L=D(,)+βDKL(j)()L=D(\cdot,\cdot)+\beta D_{\mathrm{KL}(j)}(\cdot), multiplying a constant λ=β1\lambda=\beta^{-1} to the original form. The reason why we use this form is as follows. Let ELBOtrue\mathrm{ELBO}_{\mathrm{true}} be the true ELBO in the sense of log-likelihood, such as E[logp(𝒙)]E[\log p(\bm{x})]. As shown in Eq. 57, the minimum of the loss objective in the original β\beta-VAE form is likely to be a βELBOtrue+Constant-\beta\mathrm{ELBO}_{\mathrm{true}}+\mathrm{Constant}. If we use Eq. 95, the minimum of the loss objective will be ELBOtrue+Constant-\mathrm{ELBO}_{\mathrm{true}}+\mathrm{Constant}, which seems more natural form of ELBO. Thus, Eq. 95 allows estimating a data probability from L𝒙L_{\bm{x}} in Eqs. 19 and 21, without scaling L𝒙L_{\bm{x}} by 1/β1/\beta.

Then the network is trained with λ=β1=100\lambda=\beta^{-1}=100 using 500 epochs with a batch size of 128. Here, Adam optimizer is used with the learning rate of 1e-3. We use a PC with CPU Inter(R) Xeon(R) CPU [email protected], 32GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 20 minutes, including the statistics evaluation codes.

C.2 CelebA data set

This appendix explains the details of the networks and training conditions in the experiment of the toy data set in Section 5.2.

Network configurations:
CNN(w, h, s, c, f) denotes a CNN layer with kernel size (w, h), stride size s, dimension c, and activate function f. GDN and IGDN Google provides a code in the official Tensorflow library (https://github.com/tensorflow/compression) are activation functions designed for image compression (Ballé et al., 2016). This activation function is effective and popular in deep image compression studies.

The encoder network is composed of CNN(9, 9, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - FC(1024, 1024, softplus) - FC(1024, 32, None)×2\times 2 (for 𝝁\bm{\mu} and 𝝈\bm{\sigma}) in encoder.

The decoder network is composed of FC(32, 1024, softplus) - FC(1024, 1024, softplus) - CNN(5, 5, 2, 64, IGDN) - CNN(5, 5, 2, 64, IGDN) - CNN(5, 5, 2, 64, IGDN)-CNN(9, 9, 2, 3, IGDN).

Training conditions:
In this experiment, SSIM explained in Appendix G.2 is used as a reconstruction loss. The reconstruction loss D(,)D(\cdot,\cdot) is derived as follows. Let SSIM\mathrm{SSIM} be a SSIM calculated from two input images. As explained in Appendix G.2, SSIM is measured for a whole image, and its range is between 0 and 11. If the quality is high, SSIM value becomes close to 1. Then 1SSIM1-\mathrm{SSIM} is set to D(,)D(\cdot,\cdot).

We also use the loss form as in Equation 95 in our code. In the case of the decomposed loss, the loss function L𝒙L_{\bm{x}} is set to λ(D(𝒙,𝒙˘)+D(𝒙˘,𝒙^))+DKL()\lambda(D(\bm{x},\breve{\bm{x}})+D(\breve{\bm{x}},\hat{\bm{x}}))+D_{\mathrm{KL}}(\cdot) in our code. Then, the network is trained with λ=β1=1,000\lambda=\beta^{-1}=1,000 using a batch size of 64 for 300,000 iterations. Here, Adam optimizer is used with the learning rate of 1e-3.

We use a PC with CPU Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 12GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 180 minutes, including the statistics evaluation codes.

Appendix D Additional results in the toy datasets

D.1 Scattering plots for the square error loss in Section

Figure 10 shows the plots of p(𝒙)p(\bm{x}) and estimated probabilities for the square error coding loss in Section 5.1, where the scale factor a𝒙a_{\bm{x}} in Eq. 21 is 11. Thus, both exp(Lx/β)\exp(-L_{x}/\beta) and p(𝝁(𝒙))jσj(𝒙)p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}} show a high correlation, allowing easy estimation of the data probability in the input space. In contrast, p(𝝁(𝒙))p({\bm{\mu}}_{(\bm{x})}) still shows a low correlation. These results are consistent with our theory.

Refer to caption
(a) p(𝝁(𝒙))p({\bm{\mu}}_{(\bm{x})})
Refer to caption
(b) exp(Lx/β)\exp(-L_{x}/\beta)
Refer to caption
(c) p(𝝁(𝒙))jσj(𝒙)p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}}
Figure 10: Plots of the data generation probability (x-axis) versus estimated probabilities (y-axes) for the square error loss. y-axes are (a) p(𝝁(𝒙))p({\bm{\mu}}_{(\bm{x})}), (b) exp(Lx/β)\exp(-L_{x}/\beta), and (c) p(𝝁(𝒙))jσj(𝒙)p({\bm{\mu}}_{(\bm{x})})\prod_{j}\sigma_{j{(\bm{x})}}.

D.2 Ablation study using 3 toy datasets, 3 coding losses, and 10 β\beta parameters.

In this appendix, we explain the ablation study for the toy datasets. We introduce three toy datasets and three coding losses including those used in Section 5.1. We also change β1=λ\beta^{-1}=\lambda from 11 to 1,0001,000 in training. The details of the experimental conditions are shown as follows.

Datasets: First, we call the toy dataset used in Section 5.1 the Mix dataset in order to distinguish three datasets. The second dataset is generated such that three dimensional variables s1s_{1}, s2s_{2}, and s3s_{3} are sampled in accordance with the distributions p(s1)p(s_{1}), p(s2)p(s_{2}), and p(s3)p(s_{3}) in Figure 12. The variances of the variables are the same as those of the Mix dataset, i.e., 1/6, 2/3, and 8/3, respectively. We call this the Ramp dataset. Because the PDF shape of this dataset is quite different from the prior 𝒩(𝒛;0,I3)\mathcal{N}({\bm{z}};0,I_{3}), the fitting will be the most difficult among the three. The third dataset is generated such that three dimensional variables s1s_{1}, s2s_{2}, and s3s_{3} are sampled in accordance with the normal distributions 𝒩(s1;0,1/6)\mathcal{N}(s_{1};0,1/6), 𝒩(s2;0,2/3)\mathcal{N}(s_{2};0,2/3), and 𝒩(s3;0,8/3)\mathcal{N}(s_{3};0,8/3), respectively. We call this the Norm dataset. The fitting will be the easiest, because both the prior and input have the normal distributions, and the posterior standard deviation, given by the PDF ratio at the same CDF, can be a constant.

Coding losses: Two of the three coding losses is the square error loss and the downward-convex loss described in Section 5.1. The third coding loss is a upward-convex loss which we design as Eq. 96 such that the scale factor a𝒙a_{\bm{x}} becomes the reciprocal of the scale factor in Eq. 5.1:

D(𝒙,𝒙^)=a𝒙𝒙𝒙^22,wherea𝒙=(2/3+2𝒙22/21)1and𝑮x=a𝒙𝑰m.\displaystyle D({\bm{x}},\hat{\bm{x}})=a_{\bm{x}}\|{\bm{x}}-\hat{\bm{x}}\|_{2}^{2},\hskip 8.53581pt\text{where}\ a_{\bm{x}}=(2/3+2\ \|{\bm{x}}\|_{2}^{2}/21)^{-1}\ \text{and}\ \bm{G}_{x}=a_{\bm{x}}\bm{I}_{m}.\ (96)

Figure 12 shows the scale factors a𝒙a_{\bm{x}} in Eqs. 5.1 and 96, where s1s_{1} in 𝒙=(s1,0,0)\bm{x}=(s_{1},0,0) moves within ±5\pm 5.

Parameters: As explained in Appendix C.1, λ=1/β\lambda=1/\beta is used as a hyper parameter. Specifically, λ=1\lambda=1, 22, 55, 1010, 2020, 5050, 100100, 200200, 500500, and 1,0001,000 are used.

Refer to caption
Figure 11: PDFs of three variables
to generate a Ramp dataset.
Refer to caption
Figure 12: Scale factor a𝒙a_{\bm{x}} for the downward-convex loss and upward-convex loss.

Figures 13 - 21 show the property measurements for all combinations of the datasets and coding losses, with changing λ\lambda. In each Figure, the estimated norms of the implicit transform are shown in the figure (a), the ratios of the estimated variances are shown in the figure (b), and the correlation coefficients between p(𝒙)p(\bm{x}) and estimated data probabilities are shown in the figure (c), respectively.

First, the estimated norm of the implicit transform in the figures (a) is discussed. In all conditions, the norms are close to 1 as described in Eq. 23 in the λ\lambda range 5050 to 10001000. These results show consistency with our theoretical analysis, supporting the existence of the implicit orthonormal transform. The values in the Norm dataset are the closest to 11, and those in the Ramp dataset are the most different, which seems consistent with the difficulty of the fitting.

Second, the ratio of the estimated variances is discussed. In the figures (b), Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}. Then, Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1}) and Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) are plotted. In all conditions, the ratios of Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1}) and Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) are close to the variance ratios of the input variables, i.e., 4 and 16, in the λ\lambda range 55 to 500500. Figure 22 shows the detailed comparison of the ratio for the three datasets and three coding losses at λ=100\lambda=100. In most cases, the estimated variances in the downward-convex loss are the smallest, and those in the upward-convex loss are the largest, which is more distinct for Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}). This can be explained as follows. When using the downward-convex loss, the space region with a large norm is thought of as shrinking in the inner product space, as described in Section 5.1. This will make the variance smaller. In contrast, when using the upward-convex loss, the space region with a large norm is thought of as expanding in the inner product space, making the variance larger. Here, the dependency of the losses on the ratio changes is less in the Norm dataset. The possible reason is that data in the normal distribution concentrate around the center, having less effect on the loss scale factor in the downward-convex loss and upward-convex loss.

Third, the correlation coefficients between p(𝒙)p(\bm{x}) and the estimated data probabilities in the figures (c) are discussed. In the Mix dataset and Ramp dataset, the correlation coefficients are around 0.9 in the λ\lambda range from 2020 to 200200 when the estimated probabilities a𝒙n/2p(𝝁(𝒙))j=1nσj(𝒙){a_{\bm{x}}}^{n/2}{p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{n}{{\sigma}_{j({\bm{x}})}} and a𝒙n/2exp((1/β)L𝒙){a_{\bm{x}}}^{n/2}\exp(-(1/{\beta})L_{\bm{x}}) in Eq. 21 are used. When using p(𝝁(𝒙))j=1nσj(𝒙){p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{n}{{\sigma}_{j({\bm{x}})}} and exp((1/β)L𝒙)\exp(-(1/{\beta})L_{\bm{x}}) in the downward-convex loss and upward-convex loss, the correlation coefficients become worse. In addition, when using the prior probability p(𝝁(𝒙)){p(\bm{\mu}_{(\bm{x})})}, the correlation coefficients always show the worst. In the Norm dataset, the correlation coefficients are close to 1.0 in the wider range of λ\lambda when using the estimated distribution in Eq. 21. When using p(𝝁(𝒙))j=1nσj(𝒙){p(\bm{\mu}_{(\bm{x})})}\prod_{j=1}^{n}{{\sigma}_{j({\bm{x}})}} and exp((1/β)L𝒙)\exp(-(1/{\beta})L_{\bm{x}}) in the downward-convex loss and upward-convex loss, the correlation coefficients also become worse. When using the prior probability p(𝝁(𝒙)){p(\bm{\mu}_{(\bm{x})})}, however, the correlation coefficients are close to 1 in contrast to the other two datasets. This can be explained because both the input distribution and the prior distribution are the same normal distribution, allowing the posterior variances almost constant. These results also show consistency with our theoretical analysis.

Figure 24 shows the dependency of the coding loss on β\beta for the Mix, Ramp, and Norm dataset using square the error loss. From DGD_{\mathrm{G}} in Eq. 64 and n=3n=3, the theoretical value of coding loss is 3β2\frac{3\beta}{2}, as also shown in the figure. Unlike Figs. 13-21, xx-axis is β=λ1\beta=\lambda^{-1} to evaluate the linearity. As expected in Theorem 4.2, the coding losses are close to the theoretical value where β<0.1\beta<0.1, i.e., λ>10\lambda>10.

Figure 24 shows the dependency of the ratio of transform loss to coding loss on β\beta for the Mix, Ramp, and Norm dataset using square the error loss. From Eq. 43, the estimated transform loss is i=13(β/2)2/Var(si)=63β232\sum_{i=1}^{3}(\beta/2)^{2}/\mathrm{Var}(s_{i})=\frac{63\beta^{2}}{32}. Thus the theoretical value is (63β232)/(3β2)=21β16(\frac{63\beta^{2}}{32})/(\frac{3\beta}{2})=\frac{21\beta}{16}, as is also shown in the figure. xx-axis is also β=λ1\beta=\lambda^{-1} like Figure 24. Considering the correlation coefficient discussed above, the useful range of β\beta seems between 0.005-0.05 (20-200 for λ\lambda). In this range, the ratio is less than 0.1, implying the transform loss is almost negligible. As expected in Lemma 4.2 and appendix A.3, the ratio is close to the theoretical value where β>0.01\beta>0.01, i.e., λ<100\lambda<100. For β<0.01\beta<0.01, the transform loss is still negligibly small, but the ratio is somewhat off the theoretical value. The reason is presumably that the transform loss is too small to fit the network.

As shown above, this ablation study strongly supports our theoretical analysis in sections 4.

Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 13: Property measurements of the Mix dataset using the square error loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 14: Property measurements of the Mix dataset using the downward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 15: Property measurements of the Mix dataset using the upward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 16: Property measurements of the Ramp dataset using the square error loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 17: Property measurements of the Ramp dataset using the downward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 18: Property measurements of the Ramp dataset using the upward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 19: Property measurements of the Norm dataset using the square error loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 20: Property measurements of the Norm dataset using the downward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Estimated norm 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}).
Refer to caption
(b) Ratio of the estimated
variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and
Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1})
Refer to caption
(c) Correlation coefficient of
the estimated data probability
Figure 21: Property measurements of the Mix dataset using the upward-convex loss. λ\lambda is changed from 11 to 1,0001,000. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
(a) Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1}).
Refer to caption
(b) Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}).
Figure 22: Ratio of the estimated variances Var(z3)/Var(z1)\mathrm{Var}(z_{3})/\mathrm{Var}(z_{1}) and Var(z2)/Var(z1)\mathrm{Var}(z_{2})/\mathrm{Var}(z_{1}) for the three datasets and three coding losses at λ=100\lambda=100. Var(zj)\mathrm{Var}(z_{j}) denotes the estimated variance, given by the average of σj(𝒙)2\sigma_{j(\bm{x})}^{-2}.
Refer to caption
Figure 23: Dependency of Coding Loss on β\beta for Mix, Norm, and Ramp dataset using square loss.
Refer to caption
Figure 24: Dependency of Transform loss / Coding Loss Ratio on β\beta for Mix, Norm, and Ramp dataset using square loss.

D.3 Increase of latent dimension

The Table 25 and Figure 25 show the results using Table 2 condition except the latent dimension is increased to 5. For z1z_{1} to z3z_{3}, each value is close to Table 2. For z4z_{4} and z5z_{5}, (2/β)σj2Dj({2}/{\beta}){\sigma_{j}}^{2}{D^{\prime}}_{j} are almost 0 and the averages of σj(𝒙)2{{\sigma_{{j}(\bm{x})}}}^{-2} are close to 1. In such dimensions, Var(yj)<β/2\mathrm{Var}(y_{j})<\beta/2 and DKLj()=0D_{\mathrm{KL}j}(\cdot)=0 will hold as explained in Appendix A.7, E.2, and B.6 (RD theory). Figure 25 describes the plot for axn/2exp(L𝒙/β){a_{x}}^{n/2}\exp(-L_{\bm{x}}/\beta) corresponding to Fig. 3, also showing almost proportionality.

variable z1z_{1} z2z_{2} z3z_{3} z4z_{4} z5z_{5}
2βσj2Dj\frac{2}{\beta}{\sigma_{j}}^{2}D^{\prime}_{j}   Av. 0.963 0.918 0.964 0.000 0.000
SD 0.053 0.169 0.103 0.000 0.000
σj(𝒙)2{\sigma_{j(\bm{x})}}^{-2}    Av. 3.34e1 1.46e2 5.88e2 1.00e0 1.00e0
(Ratio) Av. 1.0 4.39 17.69 0.03 0.03
Table 4: Property measurements of the toy dataset with 5-dimensional latents trained using the square error loss.
Refer to caption
Figure 25: Plots of the data generation probability (x-axis) versus estimated probabilities (y-axes) for the square error loss. The dimension of latents is set to 5.

Appendix E Additional results in CelebA dataset

E.1 Traversed outputs for all the component in the experimental section 5.2

Figure 26 shows decoder outputs for all the components, where each latent variable is traversed from 2-2 to 22. The estimated variance of each yjy_{j}, i.e., σj2¯\overline{\sigma^{-2}_{j}}, is also shown in these figures. The latent variables ziz_{i} are numbered in descending order by the estimated variances. Figure 26 is a result using the conventional loss form, i.e., L𝒙=D(𝒙,𝒙^)+βDKL()L_{\bm{x}}=D(\bm{x},\hat{\bm{x}})+\beta D_{\mathrm{KL}}(\cdot). The degrees of change seem to descend in accordance with the estimated variances. In the range where jj is 11 from 1010, the degrees of changes are large. In the range j>10j>10, the degrees of changes becomes gradually smaller. Furthermore, almost no change is observed in the range j>27j>27. As shown in Figure 6, DKL(j)()D_{\mathrm{KL(j)}}(\cdot) is close to zero for j>27j>27, meaning no information. Note that the behavior of dimensional components where DKL(j)()=0D_{\mathrm{KL(j)}}(\cdot)=0 is explained in section E.2. Thus, this result is clearly consistent with our theoretical analysis in section 4.3.

Figure 26 is a result using the decomposed loss form, i.e., L𝒙=D(𝒙,𝒙˘)+D(𝒙˘,𝒙^)+βDKL()L_{\bm{x}}=D(\bm{x},\breve{\bm{x}})+D(\breve{\bm{x}},\hat{\bm{x}})+\beta D_{\mathrm{KL}}(\cdot). The degrees of change also seem to descend in accordance with the estimated variances. When looking at the detail, there are still minor changes even j=32j=32. As shown in Figure 6, KL divergences DKL(j)()D_{\mathrm{KL(j)}}(\cdot) for all the components are larger than zero. This implies all of the dimensional components have meaningful information. Therefore, we can see a minor change even j=32j=32. Thus, this result is also consistent with our theoretical analysis in Section 4.3.

Another minor difference is sharpness. Although the quantitative comparison is difficult, the decoded images in Figure 26 seems somewhat sharper than those in Figure 26. A possible reason for this minor difference is as follows. The transform loss D(𝒙,𝒙˘)D(\bm{x},\breve{\bm{x}}) serves to bring the decoded image of 𝝁(𝒙)\bm{\mu}_{(\bm{x})} closer to the input. In the conventional image coding, the orthonormal transform and its inverse transform are used for encoding and decoding, respectively. Therefore, the input and the decoded output are equivalent when not using quantization. If not so, the quality of the decoded image will suffer from the degradation. Considering this analogy, the use of decomposed loss might improve the decoded images for 𝝁(𝒙)\bm{\mu}_{(\bm{x})}, encouraging the improvement of the orthonormality of the encoder/decoder in VAE.

E.2 The understanding of latent components where DKLj()=0D_{\mathrm{KL}j}(\cdot)=0 in Figure 6

This section explains the behaviors of latent components where DKLj()=0D_{\mathrm{KL}j}(\cdot)=0, especially in Fig. 6. First, we explain why the norm becomes 0 when DKLj()=0D_{\mathrm{KL}j}(\cdot)=0. The loss in Eq. 4.2 consists of a norm (multiplied by β/2\beta/2) and βDKLj()\beta D_{\mathrm{KL}j}(\cdot) to find the best balance (trade-off) between them. If DKLj()=0D_{\mathrm{KL}j}(\cdot)=0, the norm also becomes zero because balancing them is no more needed. Second, we explain the condition where DKLj()=0D_{\mathrm{KL}j}(\cdot)=0. Let Var(yj)\mathrm{Var}(y_{j}) and σyj(x)2{\sigma_{y_{j}(x)}}^{2} be the variance and posterior variance of jj-th implicit isometric component yj{y_{j}}, respectively. Here, σyj(x)2{\sigma_{y_{j}(x)}}^{2} is β/2\beta/2 in our theory. Then the condition where DKLj()=0D_{\mathrm{KL}j}(\cdot)=0 is derived as Var(yj)σyj(x)2\mathrm{Var}(y_{j})\leq{\sigma_{y_{j}(x)}}^{2}, as shown in Appendix A.7 and B.6 (RD theory). In RD theory, this is corresponding to the case where the signal magnitude is always less than the quantizer size (β/2\sqrt{\beta/2} in β\beta-VAE case) and no information is needed to be encoded. Finally, we explain the reason why the behaviors in Figs. 6 and 6 are different. In Fig. 6 with the decomposed loss, σyj(x)2{\sigma_{y_{j}(x)}}^{2} is almost β/2\beta/2 as the theory expects. In this case, all of Var(yj)\mathrm{Var}(y_{j}) happen to be greater than σyj(x)2{\sigma_{y_{j}(x)}}^{2}. In Fig. 6 with the conventional loss, however, σyj(x)2{\sigma_{y_{j}(x)}}^{2} is about 1.83 times greater than β/2\beta/2. Note that in both Figs. 6 and 6, Var(yj)\mathrm{Var}(y_{j}) will be almost the same because of the isometric embedding. Since σyj(x)2{\sigma_{y_{j}(x)}}^{2} becomes larger, the number of dimensions where Var(yj)σyj(x)2\mathrm{Var}(y_{j})\leq{\sigma_{y_{j}(x)}}^{2} will increases. Accordingly, the dimensions where the norms are zero also increase. Figure 27 shows the CelebA results with smaller β\beta, resulting smaller σyj(x)2{\sigma_{y_{j}(x)}}^{2}. Here, all dimensions have nonzero norms because Var(yj)>σyj(x)2\mathrm{Var}(y_{j})>{\sigma_{y_{j}(x)}}^{2} will hold.

Refer to caption
(a) Trained using the conventional loss form.
Refer to caption
(b) Trained using the decomposed loss form.
Figure 26: Traversed outputs for all the component, changing zjz_{j} from 2-2 to 22. The latent variables zjz_{j} are numbered in descending order by the estimated variance σj2¯\overline{\sigma_{j}^{-2}} shown in Figures 6 and 6.

E.3 Additional experimental result with other condition

Refer to caption
(a) Conventional loss form
Refer to caption
(b) Decomposed loss form
Figure 27: Graph of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} average and 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in CelebA dataset. The bottleneck size and λ\lambda are set to 256256 and 1000010000, respectively.

In this Section, we provide the experimental results with other condition. We use essentially the same condition as described in Appendix C.2, except for the following conditions. The bottleneck size and λ\lambda are set to 256256 and 1000010000, respectively. The encoder network is composed of CNN(9, 9, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - CNN(5, 5, 2, 64, GDN) - FC(1024, 2048, softplus) - FC(2048, 256, None)×2\times 2 (for 𝝁\bm{\mu} and 𝝈\bm{\sigma}) in encoder. The decoder network is composed of FC(256, 2048, softplus) - FC(2048, 1024, softplus) - CNN(5, 5, 2, 64, IGDN) - CNN(5, 5, 2, 64, IGDN) - CNN(5, 5, 2, 64, IGDN)-CNN(9, 9, 2, 3, IGDN).

Figures 27 and 27 show the averages of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} as well as the average and the standard deviation of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in the conventional loss form and the decomposed loss form, respectively. When using the conventional loss form, the mean of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) is 1.25, which is closer to 1 than the mean 1.83 in Section 5.2. This suggests that the implicit transform is closer to the orthonormal. The possible reason is that a bigger reconstruction error is likely to cause the interference to RD-trade off and a slight violation of the theory, and it might be compensated with a larger lambda. When using the decomposed loss form, the mean of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) is 0.95, meaning almost unit norm. These results also support that VAE provides the implicit orthonormal transform even if the lambda or bottleneck size is varied.

Appendix F Additional Experimental Result with MNIST dataset

In this Appendix, we provide the experimental result of Section 5.2 with MNIST datasethttp://yann.lecun.com/exdb/mnist/ consists of binary hand-written digits with a dimension of 768(=28 ×\times 28). We use standard training split which includes 50,000 data points. For the reconstruction loss, we use the binary cross entropy loss (BCE) for the Bernoulli distribution. We averaged BCE by the number of pixels.

The encoder network is composed of FC(768, 1024, relu) - FC(1024, 1024, relu) - FC(1024, bottleneck size) in encoder. The decoder network is composed of FC(bottleneck size, 1024, relu) - FC(1024, 1024, relu) - FC(1024, 768, sigmoid). The batch size is 256 and the training iteration number is 50,000. In this section, results with two parameters, (bottleneck size=32, λ\lambda=2000) and (bottleneck size=64, λ\lambda=10000) are provided. Note that since we averaged BCE loss by the number of pixels, β\beta in the conventional β\beta VAE is derived by 768/λ768/{\lambda}. Then, the model is optimized by Adam optimizer with the learning rate of 1e-3, using the conventional (not decomposed) loss form.

We use a PC with CPU Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 12GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 10 minutes, including the statistics evaluation codes.

Figure 28 shows the averages of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} as well as the average and the standard deviation of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}). In both conditions, the means of 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) averages are also close to 1 except in the dimensions where σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} is less than 1010. These results suggest the theoretical property still holds when using the BCE loss. In the dimensions where σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} is less than 1010, the 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) is somewhat lower than 1. The possible reason is that DKL(j)()D_{\mathrm{KL(j)}}(\cdot) in such dimension is 0 for some inputs and is larger than 0 in other inputs. The understanding of the transition region needs further study.

Refer to caption
(a) bottle neck=32, λ\lambda=2000
Refer to caption
(b) bottle neck=64, λ\lambda=10000
Figure 28: Graph of σj(𝒙)2{{\sigma}_{j({\bm{x}})}}^{-2} average and 2βσj(𝒙)2Dj(𝒛)\frac{2}{\beta}{{\sigma}_{j({\bm{x}})}}^{2}D^{\prime}_{j}(\bm{z}) in MNIST dataset.

Appendix G Derivation/Explanation in RDO-related equation expansions

G.1 Approximation of distortion in uniform quantization

Let TT be a quantization step. Quantized values zj^\hat{z_{j}} is derived as kTk\ T, where k=round(zj/T)k=\mathrm{round}({z_{j}}/T). Then dd, the distortion per channel, is approximated by

d\displaystyle d =\displaystyle= k(k1/2)T(k+1/2)Tp(zj)(zjkT)2dzjkTp(kT)(k1/2)T(k+1/2)T1T(zjkT)2dzj\displaystyle\sum_{k}\int_{(k-1/2)T}^{(k+1/2)T}p(z_{j})(z_{j}-k\ T)^{2}\ \mathrm{d}z_{j}\hskip 5.69054pt\simeq\hskip 5.69054pt\sum_{k}T\ p(k\ T)\int_{(k-1/2)T}^{(k+1/2)T}\frac{1}{T}(z_{j}-k\ T)^{2}\ \mathrm{d}z_{j} (97)
=\displaystyle= T212kTp(kT)T212.\displaystyle\frac{T^{2}}{12}\sum_{k}T\ p(k\ T)\hskip 5.69054pt\simeq\hskip 5.69054pt\frac{T^{2}}{12}.

Here, kTp(kT)p(zj)dzj=1\sum_{k}T\ p(k\ T)\simeq\int_{-\infty}^{\infty}p(z_{j})\mathrm{d}z_{j}=1 is used. The distortion for the given quantized value is also estimated as T2/12T^{2}/12, because this value is approximated by (k1/2)T(k+1/2)T1T(zjkT)2dzj\int_{(k-1/2)T}^{(k+1/2)T}\frac{1}{T}(z_{j}-k\ T)^{2}\ \mathrm{d}z_{j}.

G.2 Approximation of reconstruction loss as a quadratic form.

In this appendix, the approximations of the reconstruction losses as a quadratic form δt𝒙𝑮𝒙δ𝒙+C𝒙{}^{t}{\delta\bm{x}}\ {\bm{G}}_{\bm{x}}{\delta\bm{x}}+C_{\bm{x}} are explained for the sum of square error (SSE), binary cross entropy (BCE) and Structural Similarity (SSIM). Here, we have borrowed the derivation of BCE and SSIM from Kato et al. (2020), and add some explanation and clarification to them for convenience. We also describe the log-likelihood of the Gaussian distribution.

Let 𝒙^\hat{\bm{x}} and xi^\hat{x_{i}} be decoded sample Decθ(𝒛)\mathrm{Dec}_{\theta}(\bm{z}) and its ii-th dimensional component respectively. δ𝒙\delta\bm{x} and δxi\delta{x_{i}} denote 𝒙𝒙^\bm{x}-\hat{\bm{x}} and xixi^{x_{i}}-\hat{x_{i}}, respectively. It is also assumed that δ𝒙\delta\bm{x} and δxi\delta{x_{i}} are infinitesimal. The details of the approximations are described as follows.

Sum square error:
In the case of sum square error, 𝑮𝒙{\bm{G}}_{\bm{x}} is equal to 𝑰m{\bm{I}}_{m}. This can be derived as:

i=1m(xixi^)2=i=1mδxi2=δt𝒙𝑰mδ𝒙.\displaystyle\sum_{i=1}^{m}{(x_{i}-\hat{x_{i}})^{2}}=\sum_{i=1}^{m}{\delta x_{i}^{2}}={}^{t}\delta\bm{x}{\bm{I}}_{m}\delta\bm{x}. (98)

Binary cross entropy:
Binary cross entropy is a log likelihood of the Bernoulli distribution. The Bernoulli distribution is described as:

pθ(𝒙|𝒛)=i=1mxi^xi(1xi^)(1xi).\displaystyle p_{\theta}(\bm{x}|\bm{z})=\prod_{i=1}^{m}\hat{x_{i}}^{{x_{i}}}\ (1-\hat{x_{i}})^{(1-{x_{i}})}. (99)

Then, the binary cross-entropy (BCE) can be expanded as:

logpθ(𝒙|𝒛)\displaystyle-\log p_{\theta}(\bm{x}|\bm{z}) =\displaystyle= logi=1mxi^xi(1xi^)(1xi)\displaystyle-\log\prod_{i=1}^{m}\hat{x_{i}}^{{x_{i}}}\ (1-\hat{x_{i}})^{(1-{x_{i}})} (100)
=\displaystyle= i=1m(xilogxi^(1xi)log(1xi^))\displaystyle\sum_{i=1}^{m}(-x_{i}\log{\hat{x_{i}}}-(1-x_{i})\log{(1-\hat{x_{i}})})
=\displaystyle= i(xilog(1+δxixi)(1xi)log(1δxi1xi))\displaystyle\sum_{i}\left(-x_{i}\log\left(1+\frac{\delta x_{i}}{x_{i}}\right)-\left(1-x_{i}\right)\log\left(1-\frac{\delta x_{i}}{1-x_{i}}\right)\right)
+i(xilog(xi)(1xi)log(1xi)).\displaystyle+\sum_{i}(-x_{i}\log(x_{i})-(1-x_{i})\log(1-x_{i})).

Here, the second term of the last equation is a constant C𝒙C_{\bm{x}} depending on 𝒙\bm{x}. Using log(1+x)=xx2/2+O(x3)\log(1+x)=x-x^{2}/2+O(x^{3}), the first term of the last equation is further expanded as follows:

i(xi(δxixiδxi22xi2)(1xi)(δxi1xiδxi22(1xi)2)+O(δxi3))\displaystyle\sum_{i}\left(-x_{i}\left(\frac{\delta x_{i}}{x_{i}}-\frac{{\delta x_{i}}^{2}}{2{x_{i}}^{2}}\right)-\left(1-x_{i}\right)\left(-\frac{\delta x_{i}}{1-x_{i}}-\frac{{\delta x_{i}}^{2}}{2\left(1-x_{i}\right)^{2}}\right)+O\left({\delta x_{i}}^{3}\right)\right)
=i(12(1xi+11xi)δxi2+O(δxi3)).\displaystyle=\sum_{i}\left(\frac{1}{2}\left(\frac{1}{x_{i}}+\frac{1}{1-x_{i}}\right){{\delta x_{i}}^{2}}+O\left({\delta x_{i}}^{3}\right)\right). (101)

As a result, a metric tensor 𝑮𝒙\bm{G}_{\bm{x}} can be approximated as the following positive definite Hermitian matrix:

𝑮𝒙=(12(1x1+11x1)0012(1x2+11x2)).\displaystyle\bm{G}_{\bm{x}}=\left(\begin{array}[]{ccc}\frac{1}{2}\left(\frac{1}{x_{1}}+\frac{1}{1-x_{1}}\right)&0&\ldots\\ 0&\frac{1}{2}\left(\frac{1}{x_{2}}+\frac{1}{1-x_{2}}\right)&\ldots\\ \vdots&\vdots&\ddots\\ \end{array}\right). (105)

Here, the loss function in each dimension 12(1x1+11x1)\frac{1}{2}\left(\frac{1}{x_{1}}+\frac{1}{1-x_{1}}\right) is a downward-convex function as shown in Figure 29.

Refer to caption
Figure 29: Graph of 12(1x+11x)\frac{1}{2}\left(\frac{1}{x}+\frac{1}{1-x}\right) in the BCE approximation.

Structural similarity (SSIM):
Structural similarity (SSIM) (Wang et al., 2001) is widely used for picture quality metric, which is close to subjective quality. Let SSIM\mathrm{SSIM} be a SSIM value between two pictures. The range of the SSIM\mathrm{SSIM} is between 0 and 11. The higher the value, the better the quality. In this appendix, we also show that (1SSIM)(1-\mathrm{SSIM}) can be approximated to a quadratic form such as δt𝒙𝑮𝒙δ𝒙{}^{t}{\delta\bm{x}}\ {\bm{G}}_{\bm{x}}{\delta\bm{x}}.

SSIMN×N(h,v)(𝒙,𝒚)\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{y}) denotes a SSIM value between N×NN\times N windows in pictures XX and YY, where 𝒙N2\bm{x}\in\mathbb{R}^{N^{2}} and 𝒚N2\bm{y}\in\mathbb{R}^{N^{2}} denote N×NN\times N pixels cropped from the top-left coordinate (h,v)(h,v) in the images XX and YY, respectively. Let μ𝒙\mu_{\bm{x}}, μ𝒚\mu_{\bm{y}} be the averages of all dimensional components in 𝒙\bm{x}, 𝒚\bm{y}, and σ𝒙\sigma_{\bm{x}} , σ𝒚\sigma_{\bm{y}} be the variances of all dimensional components in 𝒙\bm{x}, 𝒚\bm{y} in the N×NN\times N windows, respectively. Then, SSIMN×N(h,v)(𝒙,𝒚)\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{y}) is derived as

SSIMN×N(h,v)(𝒙,𝒚)=2μ𝒙μ𝒚μ𝒙2+μ𝒚22σ𝒙𝒚σ𝒙2+σ𝒚2.\mathrm{SSIM}_{N\times N(h,v)}({\bm{x}},{\bm{y}})=\frac{2{\mu}_{\bm{x}}{\mu}_{\bm{y}}}{{{\mu}_{\bm{x}}}^{2}+{{\mu}_{\bm{y}}}^{2}}\cdot\frac{2{\sigma_{\bm{xy}}}}{{{\sigma}_{\bm{x}}}^{2}+{{\sigma}_{\bm{y}}}^{2}}. (106)

In order to calculate a SSIM value for a picture, the window is shifted in a whole picture and all of SSIM values are averaged. Therefore, if (1SSIMN×N(h,v)(𝒙,𝒚))\left(1-\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{y})\right) is expressed as a quadratic form δt𝒙𝑮(h,v)𝒙δ𝒙{}^{t}{\delta\bm{x}}\ {\bm{G}}_{(h,v)\bm{x}}\ {\delta\bm{x}}, (1SSIM)(1-\mathrm{SSIM}) can be also expressed in quadratic form δt𝒙𝑮𝒙δ𝒙{}^{t}{\delta\bm{x}}\ {\bm{G}}_{\bm{x}}{\delta\bm{x}}.

Let δ𝒙\delta\bm{x} be a minute displacement of 𝒙\bm{x}. μδ𝒙{\mu_{\delta\bm{x}}} and σδ𝒙2{\sigma_{\delta\bm{x}}}^{2} denote an average and variance of all dimensional components in δ𝒙\delta\bm{x}, respectively. Then, SSIM between 𝒙\bm{x} and 𝒙+δ𝒙\bm{x}+\delta\bm{x} can be approximated as:

SSIMN×N(h,v)(𝒙,𝒙+δ𝒙)1μδ𝒙22μx2σδ𝒙22σx2+O((|δ𝒙|/|𝒙|)3).\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{x}+\delta\bm{x})\simeq 1-\frac{{\mu_{\delta\bm{x}}}^{2}}{2{\mu_{x}}^{2}}-\frac{{\sigma_{\delta\bm{x}}}^{2}}{2{\sigma_{x}}^{2}}+O\left(\left(|\delta\bm{x}|/|\bm{x}|\right)^{3}\right). (107)

Then μδ𝒙2{\mu_{\delta\bm{x}}}^{2} and σδ𝒙2{\sigma_{\delta\bm{x}}}^{2} can be expressed as

μδ𝒙2=δt𝒙𝑴δ𝒙,where𝑴=1N2(111111111),\displaystyle{\mu_{\delta\bm{x}}}^{2}={}^{t}{\delta\bm{x}}\ \bm{M}{\delta\bm{x}},\hskip 5.69054pt\text{where}\hskip 5.69054pt\bm{M}=\frac{1}{N^{2}}\ \left(\begin{array}[]{cccc}1&1&\ldots&1\\ 1&1&\ldots&1\\ \vdots&\vdots&\ddots&\vdots\\ 1&1&\ldots&1\\ \end{array}\right), (112)

and

σδ𝒙2=δt𝒙𝑽δ𝒙,where𝑽=1N𝑰NM,\displaystyle{\sigma_{\delta\bm{x}}}^{2}={}^{t}{\delta\bm{x}}\ \bm{V}{\delta\bm{x}},\hskip 5.69054pt\text{where}\hskip 5.69054pt\bm{V}=\frac{1}{N}\bm{I}_{N}-M, (113)

respectively. As a result, (1SSIMN×N(h,v)(𝒙,𝒙+δ𝒙))\left(1-\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{x}+\delta\bm{x})\right) can be expressed in the following quadratic form as:

1SSIMN×N(h,v)(𝒙,𝒙+δ𝒙)δt𝒙𝑮(h,v)𝒙δ𝒙,where𝑮(h,v)𝒙=(12μx2𝑴+12σx2𝑽).1-\mathrm{SSIM}_{N\times N(h,v)}(\bm{x},\bm{x}+\delta\bm{x})\simeq{}^{t}{\delta\bm{x}}\ {\bm{G}}_{(h,v)\bm{x}}{\delta\bm{x}},\hskip 5.69054pt\text{where}\hskip 2.84526pt{\bm{G}}_{(h,v)\bm{x}}=\left(\frac{1}{2{\mu_{x}}^{2}}\bm{M}+\frac{1}{2{\sigma_{x}}^{2}}\bm{V}\right). (114)

It is noted that 𝑴\bm{M} is a positive definite Hermitian matrix and 𝑽\bm{V} is a positive semidefinite Hermitian matrix. Therefore, 𝑮(h,v)𝒙{\bm{G}}_{(h,v)\bm{x}} is a positive definite Hermitian matrix. As a result, (1SSIM)(1-\mathrm{SSIM}) can be also expressed in quadratic form δt𝒙𝑮𝒙δ𝒙{}^{t}{\delta\bm{x}}\ {\bm{G}}_{\bm{x}}{\delta\bm{x}}, where 𝑮𝒙{\bm{G}}_{\bm{x}} is a positive definite Hermitian matrix.

Log-likelihood of Gaussian distribution:
Gaussian distribution is described as:

pθ(𝒙|𝒛)=i=1m12πσ2e(xixi^)2/2σ2=i=1m12πσ2eδxi2/2σ2,\displaystyle p_{\theta}(\bm{x}|\bm{z})=\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-(x_{i}-\hat{x_{i}})^{2}/2\sigma^{2}}=\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-{\delta x_{i}}^{2}/2\sigma^{2}}, (115)

where σ2\sigma^{2} is a variance as a hyper parameter. Then, the log-likelihood of the Gaussian distribution is denoted as:

logpθ(𝒙|𝒛)=logi=1m12πσ2eδxi2/2σ2=12σ2i=1mδxi2+m2log(2πσ2).\displaystyle-\log p_{\theta}(\bm{x}|\bm{z})=-\log\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\delta x_{i}^{2}/2\sigma^{2}}=\frac{1}{2\sigma^{2}}\sum_{i=1}^{m}{\delta x_{i}^{2}}+\frac{m}{2}\log(2\pi\sigma^{2}). (116)

Since he first term is (1/2σ2)δt𝒙𝑰mδ𝒙(1/2{\sigma}^{2})\ {}^{t}\delta{\bm{x}}\bm{I}_{m}\delta{\bm{x}}, 𝑮𝒙=(1/2σ2)𝑰m\bm{G}_{\bm{x}}=(1/2{\sigma}^{2})\ \bm{I}_{m} holds. C𝒙C_{\bm{x}} is the second term of the last equation in Eq.116.

Appendix H Detail of the Experiment in Section 5.3

In this section, we provide further detail of experiment in Section 5.3.

H.1 Datasets

We describe the detail of following four public datasets:

KDDCUP99 (Dua & Graff, 2019) The KDDCUP99 10 percent dataset from the UCI repository is a dataset for cyber-attack detection. This dataset includes 494,021 instances. Each instance contains 34 continuous and 7 categorical features. We use one hot representation to encode the categorical features, and finally obtain a dataset with features of 121 dimensions. Only 20% of instances labeled -normal- and the rest labeled as -attacks-. Therefore, -normal- instances are used as anomalies, because they are in a minority group.

Thyroid (Dua & Graff, 2019) This dataset consists of 3,772 data sample with 6-dimensional feature from patients. Each instance can be divided in three classes: normal (not hypothyroid), hyperfunction, and subnormal functioning. We regard the hyperfunction class (2.5%) as an anomaly and rest two classes as normal.

Arrhythmia (Dua & Graff, 2019) This is dataset to detect cardiac arrhythmia which containing 452 instances with 274-dimensional feature. We treat minor classes (3, 4, 5, 7, 8, 9, 14, and 15, accounting for 15% of the total) as anomalies. The rest of classes are treated as normal.

KDDCUP-Rev (Dua & Graff, 2019) This is a revised version of KDDCUP99. To treat “normal” instances as the majority in the KDDCUP dataset, we keep all -normal- instances and randomly pick up -attack- instances so that the ratio of -normal- and -attack- to be 8:2. The number of instances is 121,597 in the end.

Data is max-min normalized toward dimension through the entire dataset, which is the same setting as previous studies.

H.2 Network architecture, Hyperparameter, and Training Detail

The VAE in this experiment consists of FC layers. Expect for the last layer of the encoder, Leaky ReLU (for KDDCup99, Thyroid, and Arrhythmia) or tanh (for KDDCup-rev) is attached as the activation function.

In this experiment, VAE is constructed by the form of decomposed loss to promote isometricity as explained in Remark 1. Here, the deconposed loss function L𝒙L_{\bm{x}} is set to λ1D(𝒙,𝒙˘)+λ2D(𝒙˘,𝒙^)+DKL()\lambda_{1}D(\bm{x},\breve{\bm{x}})+\lambda_{2}D(\breve{\bm{x}},\hat{\bm{x}})+D_{\mathrm{KL}}(\cdot), meaning, λ2=β1\lambda_{2}=\beta^{-1}, are adjusted independently for reconstruction loss and transform loss. For the transform loss, we tested both L2 norm and SSE loss and choose the better one for each dataset. The reason for introducing L2 loss to the transform loss is as follows. The reduction of the transform loss promotes the isometricity, as explained in Remark 1 of Section 4.2. Since the derivative of L2 norm is steeper than SSE used in coding loss, the use of L2 norm for transform loss will reduce the value of transform loss explicitly and promote the isometricity.

Hyperparameter is described in Table 5. The first column is the number of neurons. For Thyroid, we also tested the size of (30, 24, 6, 24, 30). For other datasets, we tested the size of (200, 100 or 50, 10 or 20 or 50, 100 or 50, 200). The second column is the type of reconstruction loss. (λ1\lambda_{1}, λ2\lambda_{2}) is determined experimentally. Both of them varied from 6000 to 30000 by every 6000 intervals. For all datasets, optimization is done by Adam optimizer with a learning rate of 1×1041\times 10^{-4} with batch size of 1024. The epoch numbers for each dataset are 600, 40000, 30000, and 600 respectively. Test models are saved by every 1/10 epochs and early stop is applied. For this experiment, we use GeForce GTX 1080.

Table 5: Hyper parameter for RaDOGAGA
Dataset Autoencoder Transform loss λ1\lambda_{1} λ2\lambda_{2}
KDDCup99 200, 100, 10, 100, 200 L2 30000 6000
Thyroid 60, 30, 6, 30, 60 L2 6000 18000
Arrhythmia 200, 100, 50, 100, 200 L2 6000 24000
KDDCup-rev 200, 50, 20, 50, 200 SSE 30000 6000

H.3 Precision, Recall, and F1

Due to the page limitation, we reported only F1 score in main paper. Now we provide Precision and Recall Score as well in Table 6 .

Table 6: Average and standard deviations (in brackets) of Precision, Recall and F1
Dataset Methods Precision Recall F1
KDDCup GMVAE222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.952 0.9141 0.9326
DAGMM222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9427 (0.0052) 0.9575 (0.0053) 0.9500 (0.0052)
RaDOGAGA(d)222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9550 (0.0037) 0.9700 (0.0038) 0.9624 (0.0038)
RaDOGAGA(log(d))222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9563 (0.0042) 0.9714 (0.0042) 0.9638 (0.0042)
VAE 0.9568(0.0007) 0.9718 (0.0007) 0.9642(0.0007)
Thyroid GMVAE222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.7105 0.5745 0.6353
DAGMM222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.4656 (0.0481) 0.4859 (0.0502) 0.4755 (0.0491)
RaDOGAGA(d)222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.6313 (0.0476) 0.6587 (0.0496) 0.6447 (0.0486)
RaDOGAGA(log(d))222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.6562 (0.0572) 0.6848 (0.0597) 0.6702 (0.0585)
VAE 0.6458 (0.04270) 0.6739 (0.04455) 0.6596 (0.0436)
Arrythmia GMVAE222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.4375 0.4242 0.4308
DAGMM222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.4985 (0.0389) 0.5136 (0.0401) 0.5060 (0.0395)
RaDOGAGA(d)222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.5353 (0.0461) 0.5515 (0.0475) 0.5433 (0.0468)
RaDOGAGA(log(d))222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.5294 (0.0405) 0.5455 (0.0418) 0.5373 (0.0411)
VAE 0.4912(0.0406) 0.5061 (0.0419) 0.4985 (0.0413)
KDDCup-rev DAGMM222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9778 (0.0018) 0.9779 (0.0017) 0.9779 (0.0018)
RaDOGAGA(d)222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9768 (0.0033) 0.9827 (0.0012) 0.9797 (0.0015)
RaDOGAGA(log(d))222Scores are cited from Liao et al. (2018) (GMVAE) and Kato et al. (2020)(DAGMM, RaDOGAGA) 0.9864 (0.0009) 0.9865 (0.0009) 0.9865 (0.0009)
VAE 0.9880 (0.0008) 0.9881 (0.0008) 0.9880 (0.0008)