This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Factor Analysis, Probabilistic Principal Component Analysis,
Variational Inference, and Variational Autoencoder: Tutorial and Survey

Benyamin Ghojogh Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
   Ali Ghodsi Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,
Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada
   Fakhri Karray Department of Electrical and Computer Engineering,
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
   Mark Crowley Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Abstract

This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.

Tutorial
\AddToShipoutPictureBG

*\AtPageUpperLeft                                                                                              To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning.


1 Introduction

Learning models can be divided into discriminative and generative models (Ng & Jordan, 2002; Bouchard & Triggs, 2004). Discriminative models discriminate the classes of data for better separation of classes while the generative models learn a latent space which generates the data points. The methods introduced in this paper are generative models.

Variational inference is a technique which finds a lower bound on the log-likelihood of data and maximizes this lower bound rather than the log-likelihood in the Maximum Likelihood Estimation (MLE). This lower bound is usually referred to as the Evidence Lower Bound (ELBO). Learning the parameters of latent space can be done using Expectation Maximization (EM) (Bishop, 2006). Variational Autoencoder (VAE) (Kingma & Welling, 2014) implements the variational inference in an autoencoder neural network setup where the encoder and decoder model the E-step and M-step of EM, respectively. Although, VAE is usually trained using backprogatation, in practice (Rezende et al., 2014; Hou et al., 2017). Variational inference and VAE have had many applications in Bayesian analysis; for example, see the application of variational inference in 3D human motion analysis (Sminchisescu & Jepson, 2004) and the application of VAE in forecasting (Walker et al., 2016).

Factor analysis assumes that every data point is generated from a latent factor/variable where some noise may have been added to data in the data space. Using the EM introduced in variational inference, the ELBO is maximized and the parameters of the latent space are learned iteratively. Probabilistic PCA (PPCA), as a special case of factor analysis, restricts the noise of dimensions to be uncorrelated and assumes the variance of noise to be equal in all dimensions. This restriction makes the solution of PPCA closed-form and simpler.

In this paper, we explain the theory and details of factor analysis, PPCA, variational inference, and VAE. The remainder of this paper is organized as follows. Section 2 introduces variational inference. We explain factor analysis and PPCA in Sections 3 and 4, respectively. VAE is explained in Section 5. Finally, Section 6 concludes the paper.

Required Background for the Reader

This paper assumes that the reader has general knowledge of calculus, probability, linear algebra, and basics of optimization.

2 Variational Inference

Consider a dataset {𝒙i}i=1n\{\boldsymbol{x}_{i}\}_{i=1}^{n}. Assume that every data point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} is generated from a latent variable 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p}. This latent variable has a prior distribution (𝒛i)\mathbb{P}(\boldsymbol{z}_{i}). According to Bayes’ rule, we have:

(𝒛i|𝒙i)=(𝒙i|𝒛i)(𝒛i)(𝒙i).\displaystyle\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i})=\frac{\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i})\,\mathbb{P}(\boldsymbol{z}_{i})}{\mathbb{P}(\boldsymbol{x}_{i})}. (1)

Let (𝒛i)\mathbb{P}(\boldsymbol{z}_{i}) be an arbitrary distribution denoted by q(𝒛i)q(\boldsymbol{z}_{i}). Suppose the parameter of conditional distribution of 𝒛i\boldsymbol{z}_{i} on 𝒙i\boldsymbol{x}_{i} is denoted by 𝜽\boldsymbol{\theta}; hence, (𝒛i|𝒙i)=(𝒛i|𝒙i,𝜽)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}). Therefore, we can say:

(𝒛i|𝒙i,𝜽)=(𝒙i|𝒛i,𝜽)(𝒛i|𝜽)(𝒙i|𝜽).\displaystyle\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})=\frac{\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta})\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})}{\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta})}. (2)

2.1 Evidence Lower Bound (ELBO)

Consider the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) between the prior probability of the latent variable and the posterior of the latent variable:

KL(q(𝒛i)(𝒛i|𝒙i,𝜽))\displaystyle\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})\big{)}
=(a)q(𝒛i)log(q(𝒛i)(𝒛i|𝒙i,𝜽))𝑑𝒛i\displaystyle\overset{(a)}{=}\int q(\boldsymbol{z}_{i})\log\big{(}\frac{q(\boldsymbol{z}_{i})}{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})}\big{)}d\boldsymbol{z}_{i}
=q(𝒛i)(log(q(𝒛i))log((𝒛i|𝒙i,𝜽)))𝑑𝒛i\displaystyle=\int q(\boldsymbol{z}_{i})\big{(}\log(q(\boldsymbol{z}_{i}))-\log(\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}))\big{)}d\boldsymbol{z}_{i}
=(2)q(𝒛i)(log(q(𝒛i))log((𝒙i|𝒛i,𝜽))\displaystyle\overset{(\ref{equation_inference_Bayes2})}{=}\int q(\boldsymbol{z}_{i})\big{(}\log(q(\boldsymbol{z}_{i}))-\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}))
log((𝒛i|𝜽))+log((𝒙i|𝜽)))d𝒛i\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\log(\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}))+\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))\big{)}\,d\boldsymbol{z}_{i}
=(b)log((𝒙i|𝜽))+q(𝒛i)(log(q(𝒛i))\displaystyle\overset{(b)}{=}\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))+\int q(\boldsymbol{z}_{i})\big{(}\log(q(\boldsymbol{z}_{i}))
log((𝒙i|𝒛i,𝜽))log((𝒛i|𝜽)))d𝒛i\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}))-\log(\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}))\big{)}\,d\boldsymbol{z}_{i}
=log((𝒙i|𝜽))\displaystyle=\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))
+q(𝒛i)log(q(𝒛i)(𝒙i|𝒛i,𝜽)(𝒛i|𝜽))𝑑𝒛i\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\int q(\boldsymbol{z}_{i})\log(\frac{q(\boldsymbol{z}_{i})}{\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta})\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})})\,d\boldsymbol{z}_{i}
=log((𝒙i|𝜽))+q(𝒛i)log(q(𝒛i)(𝒙i,𝒛i|𝜽))𝑑𝒛i\displaystyle=\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))+\int q(\boldsymbol{z}_{i})\log(\frac{q(\boldsymbol{z}_{i})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})})\,d\boldsymbol{z}_{i}
=log((𝒙i|𝜽))+KL(q(𝒛i)(𝒙i,𝒛i|𝜽)),\displaystyle=\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))+\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{)},

where (a)(a) is for definition of KL divergence and (b)(b) is because log((𝒙i|𝜽))\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta})) is independent of 𝒛i\boldsymbol{z}_{i} and comes out of integral and 𝑑𝒛i=1\int d\boldsymbol{z}_{i}=1. Hence:

log((𝒙i|𝜽))=\displaystyle\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))= KL(q(𝒛i)(𝒛i|𝒙i,𝜽))\displaystyle\,\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})\big{)} (3)
KL(q(𝒛i)(𝒙i,𝒛i|𝜽)).\displaystyle-\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{)}.

We define the Evidence Lower Bound (ELBO) as:

(q,𝜽):=KL(q(𝒛i)(𝒙i,𝒛i|𝜽)).\displaystyle\mathcal{L}(q,\boldsymbol{\theta}):=-\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{)}. (4)

So:

log((𝒙i|𝜽))=KL(q(𝒛i)(𝒛i|𝒙i,𝜽))+(q,𝜽).\displaystyle\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))=\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})\big{)}+\mathcal{L}(q,\boldsymbol{\theta}).

Therefore:

(q,𝜽)=log((𝒙i|𝜽))KL(q(𝒛i)(𝒛i|𝒙i,𝜽))0.\displaystyle\mathcal{L}(q,\boldsymbol{\theta})=\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))-\underbrace{\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})\big{)}}_{\geq 0}. (5)

As the second term is negative with its minus, the ELBO is a lower bound on the log likelihood of data:

(q,𝜽)log((𝒙i|𝜽)).\displaystyle\mathcal{L}(q,\boldsymbol{\theta})\leq\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta})). (6)

The likelihood (𝒙i|𝜽)\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}) is also referred to as the evidence. Note that this lower bound gets tight when:

(q,𝜽)log((𝒙i|𝜽))\displaystyle\mathcal{L}(q,\boldsymbol{\theta})\approx\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}))
0KL(q(𝒛i)(𝒛i|𝒙i,𝜽))=set0\displaystyle~{}~{}~{}~{}~{}~{}~{}\implies 0\leq\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta})\big{)}\overset{\text{set}}{=}0
q(𝒛i)=(𝒛i|𝒙i,𝜽).\displaystyle~{}~{}~{}~{}~{}~{}~{}\implies q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}). (7)

This lower bound is depicted in Fig. 1.

Refer to caption
Figure 1: Depiction of ELBO as the lower bound on log likelihood. The image is inspired by (Bishop, 2006).

2.2 Expectation Maximization

2.2.1 Background on Expectation Maximization

This part is taken from our previous tutorial paper (Ghojogh et al., 2019a). Sometimes, the data are not fully observable. For example, the data are known to be whether zero or greater than zero. In this case, Maximum Likelihood Expectation (MLE) cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) is useful. The main idea of EM can be summarized in this short friendly conversation:

– What shall we do? Some data are missing! The log-likelihood is not known completely so MLE cannot be used.
– Hmm, probably we can replace the missing data with something…
– Aha! Let us replace it with its mean.
– You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then everything in the log-likelihood will be known, and then…
– And then we can do MLE!

EM consists of two steps which are the E-step and the M-step. In the E-step, the expectation of log-likelihood with respect to the missing data is calculated, in order to have a mean estimation of it. In the M-step, the MLE approach is used where the log-likelihood is replaced with its expectation. These two steps are iteratively repeated until convergence of the estimated parameters.

2.2.2 Expectation Maximization in Variational Inference

According to MLE, we want to maximize the log-likelihood of data. According to Eq. (6), maximizing the ELBO will also maximize the log-likelihood. The Eq. (6) holds for any prior distribution qq. We want to find the best distribution to maximize the lower bound. Hence, EM for variational inference is performed iteratively as:

E-step:q(t):=argmaxq(q,𝜽(t1)),\displaystyle\text{E-step:}~{}~{}~{}~{}~{}q^{(t)}:=\arg\max_{q}~{}~{}~{}~{}\mathcal{L}(q,\boldsymbol{\theta}^{(t-1)}), (8)
M-step:𝜽(t):=argmax𝜽(q(t),𝜽),\displaystyle\text{M-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}^{(t)}:=\arg\max_{\boldsymbol{\theta}}~{}~{}~{}~{}\mathcal{L}(q^{(t)},\boldsymbol{\theta}), (9)

where tt denotes the iteration index.

E-step in EM for Variational Inference: The E-step is:

maxq(q,𝜽(t1))=(5)maxqlog((𝒙i|𝜽(t1)))\displaystyle\max_{q}\mathcal{L}(q,\boldsymbol{\theta}^{(t-1)})\overset{(\ref{ELBO_equation2})}{=}\max_{q}\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}^{(t-1)}))
+maxq(KL(q(𝒛i)(𝒛i|𝒙i,𝜽(t1))))\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\max_{q}\big{(}\!-\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)})\big{)}\big{)}
=maxqlog((𝒙i|𝜽(t1)))\displaystyle=\max_{q}\log(\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\theta}^{(t-1)}))
+minqKL(q(𝒛i)(𝒛i|𝒙i,𝜽(t1))).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\min_{q}\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)})\big{)}.

The second term is always non-negative; hence, its minimum is zero:

KL(q(𝒛i)(𝒛i|𝒙i,𝜽(t1)))=set0\displaystyle\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)})\big{)}\overset{\text{set}}{=}0
q(𝒛i)=(𝒛i|𝒙i,𝜽(t1)),\displaystyle\implies q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)}),

which was already found in Eq. (7). Thus, the E-step assigns:

q(t)(𝒛i)(𝒛i|𝒙i,𝜽(t1)).\displaystyle q^{(t)}(\boldsymbol{z}_{i})\leftarrow\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)}). (10)

In other words, in Fig. 1, it pushes the middle line toward the above line by maximizing the ELBO.

M-step in EM for Variational Inference: The M-step is:

max𝜽(q(t),𝜽)=(4)max𝜽(KL(q(t)(𝒛i)(𝒙i,𝒛i|𝜽)))\displaystyle\max_{\boldsymbol{\theta}}\mathcal{L}(q^{(t)},\boldsymbol{\theta})\overset{(\ref{ELBO_equation1})}{=}\max_{\boldsymbol{\theta}}\big{(}\!-\text{KL}\big{(}q^{(t)}(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{)}\big{)}
=(a)max𝜽[q(t)(𝒛i)log(q(t)(𝒛i)(𝒙i,𝒛i|𝜽))𝑑𝒛i]\displaystyle\overset{(a)}{=}\max_{\boldsymbol{\theta}}\Big{[}-\int q^{(t)}(\boldsymbol{z}_{i})\log(\frac{q^{(t)}(\boldsymbol{z}_{i})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})})\,d\boldsymbol{z}_{i}\Big{]}
=max𝜽q(t)(𝒛i)log((𝒙i,𝒛i|𝜽))𝑑𝒛i\displaystyle=\max_{\boldsymbol{\theta}}\int q^{(t)}(\boldsymbol{z}_{i})\log(\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}))\,d\boldsymbol{z}_{i}
max𝜽q(t)(𝒛i)log(q(t)(𝒛i))𝑑𝒛i,\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\max_{\boldsymbol{\theta}}\int q^{(t)}(\boldsymbol{z}_{i})\log(q^{(t)}(\boldsymbol{z}_{i}))\,d\boldsymbol{z}_{i},

where (a)(a) is for definition of KL divergence. The second term is constant w.r.t. 𝜽\boldsymbol{\theta}. Hence:

max𝜽(q(t),𝜽)=max𝜽q(t)(𝒛i)log((𝒙i,𝒛i|𝜽))𝑑𝒛i\displaystyle\max_{\boldsymbol{\theta}}\mathcal{L}(q^{(t)},\boldsymbol{\theta})=\max_{\boldsymbol{\theta}}\int q^{(t)}(\boldsymbol{z}_{i})\log(\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}))\,d\boldsymbol{z}_{i}
=(a)max𝜽𝔼q(t)(𝒛i)[log(𝒙i,𝒛i|𝜽)],\displaystyle\overset{(a)}{=}\max_{\boldsymbol{\theta}}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{]},

where (a)(a) is because of definition of expectation. Thus, the M-step assigns:

𝜽(t)argmax𝜽𝔼q(t)(𝒛i)[log(𝒙i,𝒛i|𝜽)].\displaystyle\boldsymbol{\theta}^{(t)}\leftarrow\arg\max_{\boldsymbol{\theta}}~{}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{]}. (11)

In other words, in Fig. 1, it pushes the above line higher. The E-step and M-step together somehow play a game where the E-step tries to reach the middle line (or the ELBO) to the log-likelihood and the M-step tries to increase the above line (or the log-likelihood). This procedure is done repeatedly so the two steps help each other improve to higher values.

To summarize, the EM in variational inference is:

q(t)(𝒛i)(𝒛i|𝒙i,𝜽(t1)),\displaystyle q^{(t)}(\boldsymbol{z}_{i})\leftarrow\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}^{(t-1)}), (12)
𝜽(t)argmax𝜽𝔼q(t)(𝒛i)[log(𝒙i,𝒛i|𝜽)].\displaystyle\boldsymbol{\theta}^{(t)}\leftarrow\arg\max_{\boldsymbol{\theta}}~{}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\big{]}. (13)

It is noteworthy that, in variational inference, sometimes, the parameter 𝜽\boldsymbol{\theta} is absorbed into the latent variable 𝒛i\boldsymbol{z}_{i}. According to the chain rule, we have:

(𝒙i,𝒛i,𝜽)=(𝒙i|𝒛i,𝜽)(𝒛i|𝜽)(𝜽).\displaystyle\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i},\boldsymbol{\theta})=\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta})\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\,\mathbb{P}(\boldsymbol{\theta}).

Considering the term (𝒛i|𝜽)(𝜽)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta})\,\mathbb{P}(\boldsymbol{\theta}) as one probability term, we have:

(𝒙i,𝒛i)=(𝒙i|𝒛i)(𝒛i),\displaystyle\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i})\,\mathbb{P}(\boldsymbol{z}_{i}),

where the parameter 𝜽\boldsymbol{\theta} disappears because of absorption.

3 Factor Analysis

3.1 Background on Marginal Multivariate Gaussian Distribution

Consider two random variables 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} and 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p} and let 𝒚i:=[𝒙i,𝒛i]d+p\boldsymbol{y}_{i}:=[\boldsymbol{x}_{i}^{\top},\boldsymbol{z}_{i}^{\top}]^{\top}\in\mathbb{R}^{d+p}. Assume that 𝒙i\boldsymbol{x}_{i} and 𝒛i\boldsymbol{z}_{i} are jointly multivariate Gaussian; hence, the variable 𝒚i\boldsymbol{y}_{i} has a multivariate Gaussian distribution, i.e., 𝒚i𝒩(𝝁y,𝚺y)\boldsymbol{y}_{i}\sim\mathcal{N}(\boldsymbol{\mu}_{y},\boldsymbol{\Sigma}_{y}). The mean and covariance can be decomposed as:

𝝁y=[𝝁,𝝁0]d+p,\displaystyle\boldsymbol{\mu}_{y}=[\boldsymbol{\mu}^{\top},\boldsymbol{\mu}_{0}^{\top}]^{\top}\in\mathbb{R}^{d+p}, (14)
𝚺y=[𝚺11𝚺12𝚺21𝚺22](d+p)×(d+p),\displaystyle\boldsymbol{\Sigma}_{y}=\begin{bmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}&\boldsymbol{\Sigma}_{22}\end{bmatrix}\in\mathbb{R}^{(d+p)\times(d+p)}, (15)

where 𝝁d\boldsymbol{\mu}\in\mathbb{R}^{d}, 𝝁0p\boldsymbol{\mu}_{0}\in\mathbb{R}^{p}, 𝚺11d×d\boldsymbol{\Sigma}_{11}\in\mathbb{R}^{d\times d}, 𝚺22p×p\boldsymbol{\Sigma}_{22}\in\mathbb{R}^{p\times p}, 𝚺12d×p\boldsymbol{\Sigma}_{12}\in\mathbb{R}^{d\times p}, and 𝚺21=𝚺12p×d\boldsymbol{\Sigma}_{21}=\boldsymbol{\Sigma}_{12}^{\top}\in\mathbb{R}^{p\times d}.

It can be shown that the marginal distributions for 𝒙i\boldsymbol{x}_{i} and 𝒛i\boldsymbol{z}_{i} are Gaussian distributions where 𝔼[𝒙i]=𝝁\mathbb{E}[\boldsymbol{x}_{i}]=\boldsymbol{\mu} and 𝔼[𝒛i]=𝝁0\mathbb{E}[\boldsymbol{z}_{i}]=\boldsymbol{\mu}_{0} (Ng, 2018). The covariance matrix of the joint distribution can be simplified as (Ng, 2018):

𝚺=𝔼[(𝒚i𝝁y)(𝒚i𝝁y)]\displaystyle\boldsymbol{\Sigma}=\mathbb{E}[(\boldsymbol{y}_{i}-\boldsymbol{\mu}_{y})(\boldsymbol{y}_{i}-\boldsymbol{\mu}_{y})^{\top}]
=𝔼[[𝒙i𝝁𝒛i𝝁0][𝒙i𝝁𝒛i𝝁0]]\displaystyle=\mathbb{E}\Bigg{[}\begin{bmatrix}\boldsymbol{x}_{i}-\boldsymbol{\mu}\\ \boldsymbol{z}_{i}-\boldsymbol{\mu}_{0}\end{bmatrix}\begin{bmatrix}\boldsymbol{x}_{i}-\boldsymbol{\mu}\\ \boldsymbol{z}_{i}-\boldsymbol{\mu}_{0}\end{bmatrix}^{\top}\Bigg{]}
=𝔼[[(𝒙i𝝁)(𝒙i𝝁),(𝒙i𝝁)(𝒛i𝝁0)(𝒛i𝝁0)(𝒙i𝝁),(𝒛i𝝁0)(𝒛i𝝁0)]].\displaystyle=\mathbb{E}\Bigg{[}\begin{bmatrix}(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top},(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})^{\top}\\ (\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top},(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})^{\top}\end{bmatrix}\Bigg{]}. (16)

This shows that the marginal distributions are:

𝒙i𝒩(𝝁,𝚺11),\displaystyle\boldsymbol{x}_{i}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}_{11}), (17)
𝒛i𝒩(𝝁0,𝚺22).\displaystyle\boldsymbol{z}_{i}\sim\mathcal{N}(\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{22}). (18)

According to the definition of the multivariate Gaussian distribution, the conditional distribution is also a Gaussian distribution, i.e., 𝒙i|𝒛i𝒩(𝝁x|z,𝚺x|z)\boldsymbol{x}_{i}|\boldsymbol{z}_{i}\sim\mathcal{N}(\boldsymbol{\mu}_{x|z},\boldsymbol{\Sigma}_{x|z}) where (Ng, 2018):

d𝝁x|z:=𝝁+𝚺12𝚺221(𝒛i𝝁0),\displaystyle\mathbb{R}^{d}\ni\boldsymbol{\mu}_{x|z}:=\boldsymbol{\mu}+\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0}), (19)
d×d𝚺x|z:=𝚺11𝚺12𝚺221𝚺21,\displaystyle\mathbb{R}^{d\times d}\ni\boldsymbol{\Sigma}_{x|z}:=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}, (20)

and likewise for 𝒛i|𝒙i𝒩(𝝁z|x,𝚺z|x)\boldsymbol{z}_{i}|\boldsymbol{x}_{i}\sim\mathcal{N}(\boldsymbol{\mu}_{z|x},\boldsymbol{\Sigma}_{z|x}):

p𝝁z|x:=𝝁0+𝚺21𝚺111(𝒙i𝝁),\displaystyle\mathbb{R}^{p}\ni\boldsymbol{\mu}_{z|x}:=\boldsymbol{\mu}_{0}+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}), (21)
p×p𝚺z|x:=𝚺22𝚺21𝚺111𝚺12.\displaystyle\mathbb{R}^{p\times p}\ni\boldsymbol{\Sigma}_{z|x}:=\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}. (22)

3.2 Main Idea of Factor Analysis

Factor analysis (Fruchter, 1954; Cattell, 1965; Harman, 1976; Child, 1990) is one of the simplest and most fundamental generative models. Although its theoretical derivations are a little complicated but its main idea is very simple. Factor analysis assumes that every data point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} is generated from a latent variable 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p}. The latent variable is also referred to as the latent factor; hence, the name of factor analysis comes from the fact that it analyzes the latent factors.

In factor analysis, we assume that the data point 𝒙i\boldsymbol{x}_{i} is obtained through the following steps: (1) by linear projection of the pp-dimensional 𝒛i\boldsymbol{z}_{i} onto a dd-dimensional space by projection matrix 𝚲d×p\boldsymbol{\Lambda}\in\mathbb{R}^{d\times p}, then (2) applying some linear translation, and finally (3) adding a Gaussian noise ϵd\boldsymbol{\epsilon}\in\mathbb{R}^{d} with covariance matrix 𝚿d×d\boldsymbol{\Psi}\in\mathbb{R}^{d\times d}. Note that as the noises in different dimensions are independent, the covariance matrix 𝚿\boldsymbol{\Psi} is diagonal. Factor analysis can be illustrated as a graphical model (Ghahramani & Hinton, 1996) where the visible data variable is conditioned on the latent variable and the noise random variable.. Figure 2 shows this graphical model.

Refer to caption
Figure 2: The graphical model for factor analysis. The image is inspired by (Ghahramani & Hinton, 1996).

3.3 The Factor Analysis Model

For simplicity, the prior distribution of the latent variable can be assumed to be a multivariate Gaussian distribution:

(𝒛i)=𝒩(𝒛i|𝝁0,𝚺0)\displaystyle\mathbb{P}(\boldsymbol{z}_{i})=\mathcal{N}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{0})
=1(2π)p|𝚺0|exp((𝒛i𝝁0)𝚺01(𝒛i𝝁0)2),\displaystyle=\frac{1}{\sqrt{(2\pi)^{p}|\boldsymbol{\Sigma}_{0}|}}\exp\Big{(}\!\!-\frac{(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})^{\top}\boldsymbol{\Sigma}_{0}^{-1}(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})}{2}\Big{)}, (23)

where 𝝁0p\boldsymbol{\mu}_{0}\in\mathbb{R}^{p} and 𝚺0p×p\boldsymbol{\Sigma}_{0}\in\mathbb{R}^{p\times p} are the mean and the covariance matrix of 𝒛i\boldsymbol{z}_{i} and |.||.| is the determinant of matrix. As was explain in Section 3.2, 𝒙i\boldsymbol{x}_{i} is obtained through (1) the linear projection of 𝒛i\boldsymbol{z}_{i} by 𝚲d×p\boldsymbol{\Lambda}\in\mathbb{R}^{d\times p}, (2) applying some linear translation, and (3) adding a Gaussian noise ϵd\boldsymbol{\epsilon}\in\mathbb{R}^{d} with covariance 𝚿d×d\boldsymbol{\Psi}\in\mathbb{R}^{d\times d}. Hence, the data point 𝒙i\boldsymbol{x}_{i} has a conditional multivariate Gaussian distribution given the latent variable; its conditional likelihood is:

(𝒙i|𝒛i)=(𝒙i|𝒛i,𝚲,𝝁,𝚿)=𝒩(𝚲𝒛i+𝝁,𝚿),\displaystyle\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\mu},\boldsymbol{\Psi})=\mathcal{N}(\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu},\boldsymbol{\Psi}), (24)

where 𝝁\boldsymbol{\mu}, which is the translation vector, is the mean of data {𝒙i}i=1n\{\boldsymbol{x}_{i}\}_{i=1}^{n}:

d𝝁:=1ni=1n𝒙i.\displaystyle\mathbb{R}^{d}\ni\boldsymbol{\mu}:=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{x}_{i}. (25)

The marginal distribution of 𝒙i\boldsymbol{x}_{i} is:

(𝒙i)=(𝒙i|𝒛i)(𝒛i)𝑑𝒛i\displaystyle\mathbb{P}(\boldsymbol{x}_{i})=\int\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i})\,\mathbb{P}(\boldsymbol{z}_{i})\,d\boldsymbol{z}_{i}\implies
(𝒙i|𝚲,𝝁,𝚿)\displaystyle\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\Lambda},\boldsymbol{\mu},\boldsymbol{\Psi})
=(𝒙i|𝒛i,𝚲,𝝁,𝚿)(𝒛i|𝝁0,𝚺0)𝑑𝒛i\displaystyle~{}~{}~{}~{}~{}=\int\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\mu},\boldsymbol{\Psi})\,\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{0})\,d\boldsymbol{z}_{i}
=(a)𝒩(𝚲𝝁0+𝝁,𝚿+𝚲𝚺0𝚲)\displaystyle~{}~{}~{}~{}~{}\overset{(a)}{=}\mathcal{N}(\boldsymbol{\Lambda}\boldsymbol{\mu}_{0}+\boldsymbol{\mu},\boldsymbol{\Psi}+\boldsymbol{\Lambda}\boldsymbol{\Sigma}_{0}\boldsymbol{\Lambda}^{\top}) (26)
=𝒩(𝝁^,𝚿+𝚲^𝚲^),\displaystyle~{}~{}~{}~{}~{}=\mathcal{N}(\widehat{\boldsymbol{\mu}},\boldsymbol{\Psi}+\widehat{\boldsymbol{\Lambda}}\widehat{\boldsymbol{\Lambda}}^{\top}), (27)

where d𝝁^:=𝚲𝝁0+𝝁\mathbb{R}^{d}\ni\widehat{\boldsymbol{\mu}}:=\boldsymbol{\Lambda}\boldsymbol{\mu}_{0}+\boldsymbol{\mu}, d×d𝚲^:=𝚲𝚺0(1/2)\mathbb{R}^{d\times d}\ni\widehat{\boldsymbol{\Lambda}}:=\boldsymbol{\Lambda}\boldsymbol{\Sigma}_{0}^{(1/2)}, and (a)(a) is because mean is linear and variance is quadratic so the mean and variance of projection are applied linearly and quadratically, respectively.

As the mean 𝝁^\widehat{\boldsymbol{\mu}} and covariance 𝚲^\widehat{\boldsymbol{\Lambda}} are needed to be learned, we can absorb 𝝁0\boldsymbol{\mu}_{0} and 𝚺0\boldsymbol{\Sigma}_{0} into 𝝁\boldsymbol{\mu} and 𝚲\boldsymbol{\Lambda} and assume that 𝝁0=𝟎\boldsymbol{\mu}_{0}=\boldsymbol{0} and 𝚺0=𝑰\boldsymbol{\Sigma}_{0}=\boldsymbol{I}.

In summary, factor analysis assumes every data point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} is obtained by projecting a latent variable 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p} onto a dd-dimensional space by projection matrix 𝚲d×p\boldsymbol{\Lambda}\in\mathbb{R}^{d\times p} and translating it by 𝝁d\boldsymbol{\mu}\in\mathbb{R}^{d} and finally adding some Gaussian noise ϵd\boldsymbol{\epsilon}\in\mathbb{R}^{d} (whose dimensions are independent) as:

𝒙i:=𝚲𝒛i+𝝁+ϵ,\displaystyle\boldsymbol{x}_{i}:=\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu}+\boldsymbol{\epsilon}, (28)
(𝒛i)=𝒩(𝟎,𝑰),\displaystyle\mathbb{P}(\boldsymbol{z}_{i})=\mathcal{N}(\boldsymbol{0},\boldsymbol{I}), (29)
(ϵ)=𝒩(𝟎,𝚿).\displaystyle\mathbb{P}(\boldsymbol{\epsilon})=\mathcal{N}(\boldsymbol{0},\boldsymbol{\Psi}). (30)

3.4 The Joint and Marginal Distributions in Factor Analysis

The joint distribution of 𝒙i\boldsymbol{x}_{i} and 𝒛i\boldsymbol{z}_{i} is:

𝒚i:=[𝒙i𝒛i]𝒩(𝝁y,𝚺y).\displaystyle\boldsymbol{y}_{i}:=\begin{bmatrix}\boldsymbol{x}_{i}\\ \boldsymbol{z}_{i}\end{bmatrix}\sim\mathcal{N}(\boldsymbol{\mu}_{y},\boldsymbol{\Sigma}_{y}). (31)

The expectation of 𝒙i\boldsymbol{x}_{i} is:

𝔼[𝒙i]=(28)𝔼[𝚲𝒛i+𝝁+ϵ]=𝚲𝔼[𝒛i]+𝝁+𝔼[ϵ]=(a)𝝁,\displaystyle\mathbb{E}[\boldsymbol{x}_{i}]\overset{(\ref{equation_x_z_epsilon})}{=}\mathbb{E}[\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu}+\boldsymbol{\epsilon}]=\boldsymbol{\Lambda}\mathbb{E}[\boldsymbol{z}_{i}]+\boldsymbol{\mu}+\mathbb{E}[\boldsymbol{\epsilon}]\overset{(a)}{=}\boldsymbol{\mu}, (32)

where (a)(a) is because of Eqs. (29) and (30). Hence:

𝝁y:=[𝝁x𝝁z]=(a)[𝝁𝟎],\displaystyle\boldsymbol{\mu}_{y}:=\begin{bmatrix}\boldsymbol{\mu}_{x}\\ \boldsymbol{\mu}_{z}\end{bmatrix}\overset{(a)}{=}\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{0}\end{bmatrix}, (33)

where (a)(a) is because of Eqs. (29) and (32). Consider Eq. (15). According to Eq. (29), we have 𝚺22=𝚺z=𝑰\boldsymbol{\Sigma}_{22}=\boldsymbol{\Sigma}_{z}=\boldsymbol{I}. According to Eq. (28), we have:

𝚺11=𝚺x=𝔼[(𝒙i𝝁)(𝒙i𝝁)]\displaystyle\boldsymbol{\Sigma}_{11}=\boldsymbol{\Sigma}_{x}=\mathbb{E}[(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}]
=𝔼[(𝚲𝒛i+𝝁+ϵ𝝁)(𝚲𝒛i+𝝁+ϵ𝝁)]\displaystyle=\mathbb{E}[(\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu}+\boldsymbol{\epsilon}-\boldsymbol{\mu})(\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu}+\boldsymbol{\epsilon}-\boldsymbol{\mu})^{\top}]
=𝔼[𝚲𝒛i𝒛i𝚲+ϵ𝒛i𝚲+𝚲𝒛iϵ+ϵϵ]\displaystyle=\mathbb{E}[\boldsymbol{\Lambda}\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\epsilon}\boldsymbol{z}_{i}^{\top}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Lambda}\boldsymbol{z}_{i}\boldsymbol{\epsilon}^{\top}+\boldsymbol{\epsilon}\boldsymbol{\epsilon}^{\top}]
=𝚲𝔼[𝒛i𝒛i]𝚲+𝔼[ϵ]𝔼[𝒛i]𝚲+𝚲𝔼[𝒛i]𝔼[ϵ]+𝔼[ϵϵ]\displaystyle=\boldsymbol{\Lambda}\mathbb{E}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\boldsymbol{\Lambda}^{\top}+\mathbb{E}[\boldsymbol{\epsilon}]\mathbb{E}[\boldsymbol{z}_{i}]^{\top}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Lambda}\mathbb{E}[\boldsymbol{z}_{i}]\mathbb{E}[\boldsymbol{\epsilon}]^{\top}+\mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^{\top}]
=(a)𝚲𝑰𝚲+𝟎+𝟎+𝚿=𝚲𝚲+𝚿,\displaystyle\overset{(a)}{=}\boldsymbol{\Lambda}\boldsymbol{I}\boldsymbol{\Lambda}^{\top}+\boldsymbol{0}+\boldsymbol{0}+\boldsymbol{\Psi}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi}, (34)

where (a)(a) is because of Eqs. (29) and (30). Moreover, we have:

𝚺12=𝚺xz=𝔼[(𝒙i𝝁)(𝒛i𝝁0)]\displaystyle\boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{xz}=\mathbb{E}[(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{z}_{i}-\boldsymbol{\mu}_{0})^{\top}]
=(a)𝔼[(𝚲𝒛i+𝝁+ϵ𝝁)(𝒛i𝟎)]\displaystyle\overset{(a)}{=}\mathbb{E}[(\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu}+\boldsymbol{\epsilon}-\boldsymbol{\mu})(\boldsymbol{z}_{i}-\boldsymbol{0})^{\top}]
=(b)𝚲𝔼[𝒛i𝒛i]+𝔼[ϵ]𝔼[𝒛i]=𝚲𝑰+(𝟎𝟎)=𝚲,\displaystyle\overset{(b)}{=}\boldsymbol{\Lambda}\mathbb{E}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]+\mathbb{E}[\boldsymbol{\epsilon}]\mathbb{E}[\boldsymbol{z}_{i}^{\top}]=\boldsymbol{\Lambda}\boldsymbol{I}+(\boldsymbol{0}\boldsymbol{0}^{\top})=\boldsymbol{\Lambda}, (35)

where (a)(a) is because of Eqs. (28) and (29) and (b)(b) is because 𝒛i\boldsymbol{z}_{i} and ϵ\boldsymbol{\epsilon} are independent. We also have 𝚺21=𝚺12=𝚲\boldsymbol{\Sigma}_{21}=\boldsymbol{\Sigma}_{12}^{\top}=\boldsymbol{\Lambda}^{\top}. Therefore:

[𝒙i𝒛i]𝒩([𝝁𝟎],[𝚲𝚲+𝚿𝚲𝚲𝑰]).\displaystyle\begin{bmatrix}\boldsymbol{x}_{i}\\ \boldsymbol{z}_{i}\end{bmatrix}\sim\mathcal{N}\Bigg{(}\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{0}\end{bmatrix},\begin{bmatrix}\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi}&\boldsymbol{\Lambda}\\ \boldsymbol{\Lambda}^{\top}&\boldsymbol{I}\end{bmatrix}\Bigg{)}. (36)

Hence, the marginal distribution of data point 𝒙i\boldsymbol{x}_{i} is:

(𝒙i)=(𝒙i|𝚲,𝝁,𝚿)=𝒩(𝝁,𝚲𝚲+𝚿).\displaystyle\mathbb{P}(\boldsymbol{x}_{i})=\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\Lambda},\boldsymbol{\mu},\boldsymbol{\Psi})=\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi}). (37)

According to Eqs. (21) and (22), the posterior or the conditional distribution of latent variable given data is:

q(𝒛i)=(12)(𝒛i|𝒙i)\displaystyle q(\boldsymbol{z}_{i})\overset{(\ref{equation_E_step_variationalInference})}{=}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i}) =(𝒛i|𝒙i,𝚲,𝝁,𝚿)\displaystyle=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\Lambda},\boldsymbol{\mu},\boldsymbol{\Psi}) (38)
=𝒩(𝝁z|x,𝚺z|x),\displaystyle=\mathcal{N}(\boldsymbol{\mu}_{z|x},\boldsymbol{\Sigma}_{z|x}),

where:

p𝝁z|x:=𝚲(𝚲𝚲+𝚿)1(𝒙i𝝁),\displaystyle\mathbb{R}^{p}\ni\boldsymbol{\mu}_{z|x}:=\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}), (39)
p×p𝚺z|x:=𝑰𝚲(𝚲𝚲+𝚿)1𝚲.\displaystyle\mathbb{R}^{p\times p}\ni\boldsymbol{\Sigma}_{z|x}:=\boldsymbol{I}-\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi})^{-1}\boldsymbol{\Lambda}. (40)

Recall that the conditional distribution of data given the latent variable, i.e. (𝒙i|𝒛i)\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i}), was introduced in Eq. (24).

If data {𝒙i}i=1n\{\boldsymbol{x}_{i}\}_{i=1}^{n} are centered, i.e. 𝝁=𝟎\boldsymbol{\mu}=\boldsymbol{0}, the marginal of data, Eq. (37), and the likelihood of data, Eq. (24), become:

(𝒙i|𝚲,𝚿)=𝒩(𝟎,𝚿+𝚲𝚲),\displaystyle\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{\Lambda},\boldsymbol{\Psi})=\mathcal{N}(\boldsymbol{0},\boldsymbol{\Psi}+\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}), (41)
(𝒙i|𝒛i,𝚲,𝚿)=𝒩(𝚲𝒛i,𝚿),\displaystyle\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\Psi})=\mathcal{N}(\boldsymbol{\Lambda}\boldsymbol{z}_{i},\boldsymbol{\Psi}), (42)

respectively. In some works, people center the data as a pre-processing to factor analysis.

3.5 Expectation Maximization in Factor Analysis

3.5.1 Maximization of Joint Likelihood

In factor analysis, the parameter 𝜽\boldsymbol{\theta} of variational inference is the two parameters 𝚲\boldsymbol{\Lambda} and 𝚿\boldsymbol{\Psi}. As we have in Eq. (13), consider the maximization of joint likelihood, which reduces to the likelihood of data, over all nn data points:

max𝚲,𝚿i=1n𝔼q(t)(𝒛i)[log(𝒙i,𝒛i|𝚲,𝚿)]\displaystyle\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\Lambda},\boldsymbol{\Psi})\big{]}
=(a)max𝚲,𝚿i=1n(𝔼q(t)(𝒛i)[log(𝒙i|𝒛i,𝚲,𝚿)]\displaystyle\overset{(a)}{=}\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\sum_{i=1}^{n}\Big{(}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\Psi})\big{]}
+𝔼q(t)(𝒛i)[log(𝒛i)]),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{z}_{i})\big{]}\Big{)},
=(b)max𝚲,𝚿i=1n𝔼q(t)(𝒛i)[log(𝒙i|𝒛i,𝚲,𝚿)]\displaystyle\overset{(b)}{=}\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\Psi})\big{]}
=(24)max𝚲,𝚿i=1n𝔼q(t)(𝒛i)[log𝒩(𝚲𝒛i+𝝁,𝚿)]\displaystyle\overset{(\ref{equation_factor_analysis_likelihood_x_given_z})}{=}\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\log\mathcal{N}(\boldsymbol{\Lambda}\boldsymbol{z}_{i}+\boldsymbol{\mu},\boldsymbol{\Psi})\big{]}
=max𝚲,𝚿i=1n𝔼q(t)(𝒛i)[log(1(2π)p/2|𝚿|1/2\displaystyle=\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\bigg{[}\log\big{(}\frac{1}{(2\pi)^{p/2}|\boldsymbol{\Psi}|^{1/2}}
exp((𝒙i𝚲𝒛i𝝁)𝚿1(𝒙i𝚲𝒛i𝝁)2))]\displaystyle~{}~{}~{}~{}~{}\exp\Big{(}\!\!-\frac{(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})^{\top}\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})}{2}\Big{)}\big{)}\bigg{]}
=max𝚲,𝚿(dn2log(2π)constantn2log|𝚿|\displaystyle=\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\Big{(}\underbrace{-\frac{d\,n}{2}\log(2\pi)}_{\text{constant}}-\frac{n}{2}\log|\boldsymbol{\Psi}|
i=1n𝔼q(t)(𝒛i)[12(𝒙i𝚲𝒛i𝝁)𝚿1\displaystyle~{}~{}~{}~{}-\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\big{[}\frac{1}{2}(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})^{\top}\boldsymbol{\Psi}^{-1}
(𝒙i𝚲𝒛i𝝁)])\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})\big{]}\Big{)} (43)

where (a)(a) is because of the chain rule (𝒙i,𝒛i|𝚲,𝚿)=(𝒙i|𝒛i,𝚲,𝚿)(𝒛i)\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\Lambda},\boldsymbol{\Psi})=\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\Lambda},\boldsymbol{\Psi})\,\mathbb{P}(\boldsymbol{z}_{i}), and (b)(b) is because the second term is zero because of zero mean of prior of 𝒛i\boldsymbol{z}_{i} (see Eq. (29)).

3.5.2 The E-Step in EM for Factor Analysis

As we will see later in the M-step of EM, we will have two expectation terms which need to be computed in the E-step. These expectations, which are over the q(𝒛i)q(\boldsymbol{z}_{i}) distribution, are 𝔼q(t)(𝒛i)[𝒛i]\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}] and 𝔼q(t)(𝒛i)[𝒛i𝒛i]\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]. According to Eq. (12), we have q(𝒛i)=(𝒛i|𝒙i)q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i}). Therefore, according to Eqs. (38), (39), and (40), we have:

𝔼q(t)(𝒛i)[𝒛i]=𝝁z|x:=𝚲(𝚲𝚲+𝚿)1(𝒙i𝝁),\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]=\boldsymbol{\mu}_{z|x}:=\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}), (44)
𝔼q(t)(𝒛i)[𝒛i𝒛i]=𝚺z|x:=𝑰𝚲(𝚲𝚲+𝚿)1𝚲.\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]=\boldsymbol{\Sigma}_{z|x}:=\boldsymbol{I}-\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\boldsymbol{\Psi})^{-1}\boldsymbol{\Lambda}. (45)

3.5.3 The M-Step in EM for Factor Analysis

We have two variables 𝚲\boldsymbol{\Lambda} and 𝚿\boldsymbol{\Psi} so we solve the maximization w.r.t. these variables.

Finding parameter 𝚲\boldsymbol{\Lambda}:

d×pEq.(43)𝚲\displaystyle\mathbb{R}^{d\times p}\ni\frac{\partial\,\text{Eq.}\,(\ref{equation_factor_analysis_EM_likelihood})}{\partial\boldsymbol{\Lambda}}
=i=1n𝚲𝔼q(t)(𝒛i)[12tr(𝒛i𝚲𝚿1𝚲𝒛i)\displaystyle=-\sum_{i=1}^{n}\frac{\partial}{\partial\boldsymbol{\Lambda}}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\Big{[}\frac{1}{2}\textbf{tr}(\boldsymbol{z}_{i}^{\top}\boldsymbol{\Lambda}^{\top}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\boldsymbol{z}_{i})
tr(𝒛i𝚲𝚿1(𝒙i𝝁))]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\textbf{tr}(\boldsymbol{z}_{i}^{\top}\boldsymbol{\Lambda}^{\top}\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}))\Big{]}
=(a)i=1n𝚲𝔼q(t)(𝒛i)[12tr(𝚲𝚿1𝚲𝒛i𝒛i)\displaystyle\overset{(a)}{=}-\sum_{i=1}^{n}\frac{\partial}{\partial\boldsymbol{\Lambda}}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\Big{[}\frac{1}{2}\textbf{tr}(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top})
tr(𝚲𝚿1(𝒙i𝝁)𝒛i)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\textbf{tr}(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\boldsymbol{z}_{i}^{\top})\Big{]}
=i=1n𝔼q(t)(𝒛i)[𝚿1𝚲𝒛i𝒛i𝚿1(𝒙i𝝁)𝒛i]\displaystyle=-\sum_{i=1}^{n}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\Big{[}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}-\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\boldsymbol{z}_{i}^{\top}\Big{]}
=i=1n[𝚿1𝚲𝔼q(t)(𝒛i)[𝒛i𝒛i]\displaystyle=-\sum_{i=1}^{n}\Big{[}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]
𝚿1(𝒙i𝝁)𝔼q(t)(𝒛i)[𝒛i]],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]^{\top}\Big{]},

where (a)(a) is because of the cyclic property of trace. Setting this derivative to zero gives us the optimum 𝚲\boldsymbol{\Lambda}:

i=1n𝚿1𝚲𝔼q(t)(𝒛i)[𝒛i𝒛i]\displaystyle\sum_{i=1}^{n}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]
=i=1n𝚿1(𝒙i𝝁)𝔼q(t)(𝒛i)[𝒛i]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\sum_{i=1}^{n}\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]^{\top}
𝚲=(i=1n𝚿1(𝒙i𝝁)𝔼q(t)(𝒛i)[𝒛i])\displaystyle\implies\boldsymbol{\Lambda}=\Big{(}\sum_{i=1}^{n}\boldsymbol{\Psi}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]^{\top}\Big{)}
(i=1n𝚿1𝚲𝔼q(t)(𝒛i)[𝒛i𝒛i])1.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Big{(}\sum_{i=1}^{n}\boldsymbol{\Psi}^{-1}\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\Big{)}^{-1}. (46)

Finding parameter 𝚿\boldsymbol{\Psi}: Now, consider maximization w.r.t 𝚿\boldsymbol{\Psi}. We restate Eq.  (43) as (Paola Garcia, 2018):

max𝚲,𝚿(dn2log(2π)constantn2log|𝚿|n2tr(𝚿1𝑺)),\displaystyle\max_{\boldsymbol{\Lambda},\boldsymbol{\Psi}}~{}\big{(}\underbrace{-\frac{d\,n}{2}\log(2\pi)}_{\text{constant}}-\frac{n}{2}\log|\boldsymbol{\Psi}|-\frac{n}{2}\textbf{tr}(\boldsymbol{\Psi}^{-1}\boldsymbol{S})\big{)}, (47)

where 𝑺d×d\boldsymbol{S}\in\mathbb{R}^{d\times d} is a sample covariance matrix defined as:

𝑺:=1ni=1n𝔼q(t)(𝒛i)[(𝒙i𝚲𝒛i𝝁)(𝒙i𝚲𝒛i𝝁)]\displaystyle\boldsymbol{S}\!:=\!\frac{1}{n}\sum_{i=1}^{n}\!\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\!\big{[}(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\Lambda}\boldsymbol{z}_{i}-\boldsymbol{\mu})^{\top}\big{]}
=1ni=1n((𝒙i𝝁)(𝒙i𝝁)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\Big{(}(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}
2𝚲𝔼q(t)(𝒛i)[𝒛i](𝒙i𝝁)+𝚲𝔼q(t)(𝒛i)[𝒛𝒛]𝚲).\displaystyle~{}~{}-2\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}_{i}](\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}+\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}\boldsymbol{z}^{\top}]\boldsymbol{\Lambda}^{\top}\Big{)}. (48)

The maximization results in (Paola Garcia, 2018):

d×dEq.(47)𝚿1=n2𝚿n2𝑺=set𝟎𝚿=𝑺.\displaystyle\mathbb{R}^{d\times d}\ni\frac{\partial\,\text{Eq.}\,(\ref{equation_factor_analysis_EM_likelihood_with_S})}{\partial\boldsymbol{\Psi}^{-1}}=\frac{n}{2}\boldsymbol{\Psi}-\frac{n}{2}\boldsymbol{S}\overset{\text{set}}{=}\boldsymbol{0}\implies\boldsymbol{\Psi}=\boldsymbol{S}.

Note that as the dimensions of noise ϵd\boldsymbol{\epsilon}\in\mathbb{R}^{d} are independent, the covariance matrix of noise, 𝚿\boldsymbol{\Psi}, is a diagonal matrix. Hence:

𝚿=diag(𝑺)=(48)1ndiag(i=1n[(𝒙i𝝁)(𝒙i𝝁)\displaystyle\boldsymbol{\Psi}=\textbf{diag}(\boldsymbol{S})\overset{(\ref{equation_factor_analysis_S})}{=}\frac{1}{n}\textbf{diag}\bigg{(}\sum_{i=1}^{n}\Big{[}(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top} (49)
2𝚲𝔼q(t)(𝒛i)[𝒛i](𝒙i𝝁)+𝚲𝔼q(t)(𝒛i)[𝒛𝒛]𝚲]).\displaystyle-2\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}_{i}](\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}+\boldsymbol{\Lambda}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}\boldsymbol{z}^{\top}]\boldsymbol{\Lambda}^{\top}\Big{]}\bigg{)}.

3.5.4 Summary of Factor Analysis Algorithm

According to the derived Eqs. (44), (45), (46), and (49), the EM algorithm in factor analysis is summarized as follows. The mean of data, 𝝁\boldsymbol{\mu}, is computed. Then, for every data point 𝒙i\boldsymbol{x}_{i}, we iteratively solve as:

𝔼q(t)(𝒛i)[𝒛i]𝚲(t)(𝚲(t)𝚲(t)+𝚿(t))1(𝒙i𝝁),\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]\leftarrow\boldsymbol{\Lambda}^{(t)\top}(\boldsymbol{\Lambda}^{(t)}\boldsymbol{\Lambda}^{(t)\top}+\boldsymbol{\Psi}^{(t)})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}),
𝔼q(t)(𝒛i)[𝒛i𝒛i]𝑰𝚲(t)(𝚲(t)𝚲(t)+𝚿(t))1𝚲(t),\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\leftarrow\boldsymbol{I}-\boldsymbol{\Lambda}^{(t)\top}(\boldsymbol{\Lambda}^{(t)}\boldsymbol{\Lambda}^{(t)\top}+\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{\Lambda}^{(t)},
𝚲(t+1)(i=1n(𝚿(t))1(𝒙i𝝁)𝔼q(t)(𝒛i)[𝒛i])\displaystyle\boldsymbol{\Lambda}^{(t+1)}\leftarrow\Big{(}\sum_{i=1}^{n}(\boldsymbol{\Psi}^{(t)})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]^{\top}\Big{)}
(i=1n(𝚿(t))1𝚲(t)𝔼q(t)(𝒛i)[𝒛i𝒛i])1.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Big{(}\sum_{i=1}^{n}(\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{\Lambda}^{(t)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\Big{)}^{-1}.
𝚿(t+1)1ndiag(i=1n[(𝒙i𝝁)(𝒙i𝝁)\displaystyle\boldsymbol{\Psi}^{(t+1)}\leftarrow\frac{1}{n}\textbf{diag}\bigg{(}\sum_{i=1}^{n}\Big{[}(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}
2𝚲(t+1)𝔼q(t)(𝒛i)[𝒛i](𝒙i𝝁)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-2\boldsymbol{\Lambda}^{(t+1)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}_{i}](\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}
+𝚲(t+1)𝔼q(t)(𝒛i)[𝒛𝒛]𝚲(t+1)]).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\boldsymbol{\Lambda}^{(t+1)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}\boldsymbol{z}^{\top}]\boldsymbol{\Lambda}^{(t+1)}\Big{]}\bigg{)}.

Note that if data are centered as a pre-processing to factor analysis, i.e. 𝝁=𝟎\boldsymbol{\mu}=\boldsymbol{0}, the algorithm of factor analysis is simplified as:

𝔼q(t)(𝒛i)[𝒛i]𝚲(t)(𝚲(t)𝚲(t)+𝚿(t))1𝒙i,\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]\leftarrow\boldsymbol{\Lambda}^{(t)\top}(\boldsymbol{\Lambda}^{(t)}\boldsymbol{\Lambda}^{(t)\top}+\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{x}_{i},
𝔼q(t)(𝒛i)[𝒛i𝒛i]𝑰𝚲(t)(𝚲(t)𝚲(t)+𝚿(t))1𝚲(t),\displaystyle\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\leftarrow\boldsymbol{I}-\boldsymbol{\Lambda}^{(t)\top}(\boldsymbol{\Lambda}^{(t)}\boldsymbol{\Lambda}^{(t)\top}+\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{\Lambda}^{(t)},
𝚲(t+1)(i=1n(𝚿(t))1𝒙i𝔼q(t)(𝒛i)[𝒛i])\displaystyle\boldsymbol{\Lambda}^{(t+1)}\leftarrow\Big{(}\sum_{i=1}^{n}(\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{x}_{i}\,\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}]^{\top}\Big{)}
(i=1n(𝚿(t))1𝚲(t)𝔼q(t)(𝒛i)[𝒛i𝒛i])1.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Big{(}\sum_{i=1}^{n}(\boldsymbol{\Psi}^{(t)})^{-1}\boldsymbol{\Lambda}^{(t)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}[\boldsymbol{z}_{i}\boldsymbol{z}_{i}^{\top}]\Big{)}^{-1}.
𝚿(t+1)1ndiag(i=1n[𝒙i𝒙i2𝚲(t+1)𝔼q(t)(𝒛i)[𝒛i]𝒙i\displaystyle\boldsymbol{\Psi}^{(t+1)}\leftarrow\frac{1}{n}\textbf{diag}\bigg{(}\sum_{i=1}^{n}\Big{[}\boldsymbol{x}_{i}\boldsymbol{x}_{i}^{\top}-2\boldsymbol{\Lambda}^{(t+1)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}_{i}]\,\boldsymbol{x}_{i}^{\top}
+𝚲(t+1)𝔼q(t)(𝒛i)[𝒛𝒛]𝚲(t+1)]).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\boldsymbol{\Lambda}^{(t+1)}\mathbb{E}_{\sim q^{(t)}(\boldsymbol{z}_{i})}\![\boldsymbol{z}\boldsymbol{z}^{\top}]\boldsymbol{\Lambda}^{(t+1)}\Big{]}\bigg{)}.

As it can be seen, factor analysis does not have a closed-form solution and its solution, which are the projection matrix 𝚲\boldsymbol{\Lambda} and the noise covariance matrix 𝚿\boldsymbol{\Psi}, are found iteratively until convergence.

It is noteworthy that mixture of factor analysis (Ghahramani & Hinton, 1996) also exists in the literature which considers a mixture distribution for the factor analysis and trains the parameters of mixture using EM (Ghojogh et al., 2019a).

4 Probabilistic Principal Component Analysis

4.1 Main Idea of Probabilistic PCA

Probabilistic PCA (PPCA) (Roweis, 1997; Tipping & Bishop, 1999b) is a special case of factor analysis where the variance of noise is equal in all dimensions of data space with covariance between dimensions, i.e.:

𝚿=σ2𝑰.\displaystyle\boldsymbol{\Psi}=\sigma^{2}\boldsymbol{I}. (50)

In other words, PPCA considers an isotropic noise in its formulation. Therefore, Eq. (30) is simplified to:

(ϵ)=𝒩(𝟎,σ2𝑰).\displaystyle\mathbb{P}(\boldsymbol{\epsilon})=\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}). (51)

Because of having zero covariance of noise between different dimensions, PPCA assumes that the data points are independent of each other given latent variables. As depicted in Figure 3, PPCA can be illustrated as a graphical model, where the visible data variable is conditioned on the latent variable and the isotropic noise random variable.

Refer to caption
Figure 3: The graphical model for PPCA.

4.2 MLE for Probabilistic PCA

As PPCA is a special case of factor analysis, it also is solved using EM. Similar to factor analysis, it can be solved iteratively using EM (Roweis, 1997). However, one can also find a closed-form solution to its EM approach (Tipping & Bishop, 1999b). Hence, by restricting the noise covariance to be isotropic, its solution becomes simpler and closed-form. The iterative approach is as we had in factor analysis. Here, we derive the closed-form solution.

Consider the likelihood or the marginal distribution of data points {𝒙id}i=1n\{\boldsymbol{x}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n} which is Eq. (37). The log-likelihood of data is:

i=1nlog(𝒙i)=(37)i=1nlog𝒩(𝝁,𝚲𝚲+σ2𝑰)\displaystyle\sum_{i=1}^{n}\log\mathbb{P}(\boldsymbol{x}_{i})\overset{(\ref{equation_factor_analysis_prior_of_data})}{=}\sum_{i=1}^{n}\log\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})
=i=1n[log(1(2π)d/2|𝚲𝚲+σ2𝑰|1/2×\displaystyle=\sum_{i=1}^{n}\bigg{[}\log\big{(}\frac{1}{(2\pi)^{d/2}|\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}|^{1/2}}\times
exp((𝒙i𝝁)(𝚲𝚲+σ2𝑰)1(𝒙i𝝁)2))]\displaystyle~{}~{}~{}~{}~{}\exp\Big{(}\!\!-\frac{(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})}{2}\Big{)}\big{)}\bigg{]}
=dn2log(2π)constantn2log|𝚲𝚲+σ2𝑰|\displaystyle=\underbrace{-\frac{d\,n}{2}\log(2\pi)}_{\text{constant}}-\frac{n}{2}\log|\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}|
i=1n12(𝒙i𝝁)(𝚲𝚲+σ2𝑰)1(𝒙i𝝁)\displaystyle~{}~{}~{}~{}-\sum_{i=1}^{n}\frac{1}{2}(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu})
=dn2log(2π)constantn2log|𝚲𝚲+σ2𝑰|\displaystyle=\underbrace{-\frac{d\,n}{2}\log(2\pi)}_{\text{constant}}-\frac{n}{2}\log|\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}|
n2tr((𝚲𝚲+σ2𝑰)1𝑺x),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{n}{2}\textbf{tr}\big{(}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{S}_{x}\big{)},

where 𝑺xd×d\boldsymbol{S}_{x}\in\mathbb{R}^{d\times d} is the sample covariance matrix of data:

𝑺x:=1ni=1n(𝒙i𝝁)(𝒙i𝝁).\displaystyle\boldsymbol{S}_{x}:=\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{x}_{i}-\boldsymbol{\mu})(\boldsymbol{x}_{i}-\boldsymbol{\mu})^{\top}. (52)

We use MLE where the variables of maximization optimization are the projection matrix 𝚲\boldsymbol{\Lambda} and the noise variance σ\sigma:

max𝚲,σ(dn2log(2π)constantn2log|𝚲𝚲+σ2𝑰|\displaystyle\max_{\boldsymbol{\Lambda},\sigma}~{}\Big{(}\underbrace{-\frac{d\,n}{2}\log(2\pi)}_{\text{constant}}-\frac{n}{2}\log|\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}|
n2tr((𝚲𝚲+σ2𝑰)1𝑺x)).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{n}{2}\textbf{tr}\big{(}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{S}_{x}\big{)}\Big{)}. (53)

It is noteworthy that literature usually defines:

d×d𝑪:=(𝚲𝚲+σ2𝑰d×d).\displaystyle\mathbb{R}^{d\times d}\ni\boldsymbol{C}:=(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}_{d\times d}). (54)
p×p𝑴:=(𝚲𝚲+σ2𝑰p×p).\displaystyle\mathbb{R}^{p\times p}\ni\boldsymbol{M}:=(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda}+\sigma^{2}\boldsymbol{I}_{p\times p}). (55)

According to the matrix inversion lemma, we have:

𝑪1=σ1𝑰d×dσ2𝚲𝑴1𝚲.\displaystyle\boldsymbol{C}^{-1}=\sigma^{-1}\boldsymbol{I}_{d\times d}-\sigma^{-2}\boldsymbol{\Lambda}\boldsymbol{M}^{-1}\boldsymbol{\Lambda}^{\top}. (56)

This inversion is interesting because the inverse of a (d×d)(d\times d) matrix 𝑪\boldsymbol{C} is reduced to inversion of a (p×p)(p\times p) matrix 𝑴\boldsymbol{M} which is much simpler because we usually have pdp\ll d.

4.2.1 MLE for Determining 𝚲\boldsymbol{\Lambda}

Taking the derivative of Eq. (53) w.r.t. 𝚲d×p\boldsymbol{\Lambda}\in\mathbb{R}^{d\times p} and setting it to zero is:

Eq.(53)𝚲=n((𝚲𝚲+σ2𝑰)1𝑺x(𝚲𝚲+σ2𝑰)1𝚲\displaystyle\frac{\partial\,\text{Eq.}(\ref{equation_PPCA_log_likelihood})}{\partial\boldsymbol{\Lambda}}=-n\big{(}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{S}_{x}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}
(𝚲𝚲+σ2𝑰)1𝚲)=set𝟎\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}\big{)}\overset{\text{set}}{=}\boldsymbol{0}
(𝚲𝚲+σ2𝑰)1𝑺x(𝚲𝚲+σ2𝑰)1𝚲\displaystyle\implies(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{S}_{x}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}
=(𝚲𝚲+σ2𝑰)1𝚲\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}
𝑺x(𝚲𝚲+σ2𝑰)1𝚲=𝚲,\displaystyle\implies\boldsymbol{S}_{x}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}=\boldsymbol{\Lambda},

whose trivial solutions are 𝚲=𝟎\boldsymbol{\Lambda}=\boldsymbol{0} and 𝑺x=(𝚲𝚲+σ2𝑰)\boldsymbol{S}_{x}=(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I}) which are not valid. For the non-trivial solution, consider the Singular Value Decomposition (SVD) d×p𝚲=𝑼𝑳𝑽\mathbb{R}^{d\times p}\ni\boldsymbol{\Lambda}=\boldsymbol{U}\boldsymbol{L}\boldsymbol{V}^{\top} where 𝑼d×p\boldsymbol{U}\in\mathbb{R}^{d\times p} and 𝑽p×p\boldsymbol{V}\in\mathbb{R}^{p\times p} contain the left and right singular vectors, respectively, and 𝑳p×p\boldsymbol{L}\in\mathbb{R}^{p\times p} is the diagonal matrix containing the singular values denoted by {lj}j=1p\{l_{j}\}_{j=1}^{p}. Moreover, note that tr(𝚲𝚲)=tr(𝑼𝑳𝑽𝑽𝑳𝑼)=tr(𝑼𝑳𝑳𝑼)=tr(𝑼𝑼𝑳2)=tr(𝑳2)\textbf{tr}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top})=\textbf{tr}(\boldsymbol{U}\boldsymbol{L}\boldsymbol{V}^{\top}\boldsymbol{V}\boldsymbol{L}\boldsymbol{U}^{\top})=\textbf{tr}(\boldsymbol{U}\boldsymbol{L}\boldsymbol{L}\boldsymbol{U}^{\top})=\textbf{tr}(\boldsymbol{U}^{\top}\boldsymbol{U}\boldsymbol{L}^{2})=\textbf{tr}(\boldsymbol{L}^{2}) because 𝑼\boldsymbol{U} and 𝑽\boldsymbol{V} are orthogonal matrices.

From the previous calculations, we have (Hauskrecht, 2007):

𝑺x𝑼𝑳(𝑳2+σ2𝑰)1𝑽=𝑼𝑳𝑽\displaystyle\boldsymbol{S}_{x}\boldsymbol{U}\boldsymbol{L}(\boldsymbol{L}^{2}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{V}^{\top}=\boldsymbol{U}\boldsymbol{L}\boldsymbol{V}^{\top}
𝑺x𝑼𝑳(𝑳2+σ2𝑰)1=𝑼𝑳\displaystyle\implies\boldsymbol{S}_{x}\boldsymbol{U}\boldsymbol{L}(\boldsymbol{L}^{2}+\sigma^{2}\boldsymbol{I})^{-1}=\boldsymbol{U}\boldsymbol{L}
𝑺x𝑼𝑳=𝑼(𝑳2+σ2𝑰)𝑳\displaystyle\implies\boldsymbol{S}_{x}\boldsymbol{U}\boldsymbol{L}=\boldsymbol{U}(\boldsymbol{L}^{2}+\sigma^{2}\boldsymbol{I})\boldsymbol{L}
𝑺x𝑼=𝑼(𝑳2+σ2𝑰),\displaystyle\implies\boldsymbol{S}_{x}\boldsymbol{U}=\boldsymbol{U}(\boldsymbol{L}^{2}+\sigma^{2}\boldsymbol{I}), (57)

which is an eigenvalue problem (Ghojogh et al., 2019b) for the covariance matrix 𝑺x\boldsymbol{S}_{x} where the columns of 𝑼\boldsymbol{U} are the eigenvectors of 𝑺x\boldsymbol{S}_{x} and the eigenvalues are σ2+lj2\sigma^{2}+l_{j}^{2}. Recall that σ\sigma is the variance of noise in different dimensions and ljl_{j} is the jj-th singular value of 𝚲\boldsymbol{\Lambda} (sorted from largest to smallest). We denote the jj-th eigenvalue of the covariance matrix 𝑺x\boldsymbol{S}_{x} by:

δj:=σ2+lj2lj=(δjσ2)(1/2).\displaystyle\delta_{j}:=\sigma^{2}+l_{j}^{2}\implies l_{j}=(\delta_{j}-\sigma^{2})^{(1/2)}. (58)

We consider only the top pp singular values ljl_{j} and the top pp eigenvalues δj\delta_{j}; substituting the singular values in the SVD of the projection matrix 𝚲\boldsymbol{\Lambda} results in:

𝚲=𝑼𝑳𝑽=𝑼(𝚫pσ2𝑰)(1/2)𝑽,\displaystyle\boldsymbol{\Lambda}=\boldsymbol{U}\boldsymbol{L}\boldsymbol{V}^{\top}=\boldsymbol{U}(\boldsymbol{\Delta}_{p}-\sigma^{2}\boldsymbol{I})^{(1/2)}\boldsymbol{V}^{\top},

where 𝚫p:=diag(δ1,,δp)\boldsymbol{\Delta}_{p}:=\textbf{diag}(\delta_{1},\dots,\delta_{p}). However, as Eq. (57) does not include any 𝑽\boldsymbol{V}, we can replace 𝑽\boldsymbol{V}^{\top} with any arbitrary orthogonal matrix 𝑹p×p\boldsymbol{R}\in\mathbb{R}^{p\times p} (Tipping & Bishop, 1999b):

𝚲=𝑼(𝚫pσ2𝑰)(1/2)𝑹.\displaystyle\boldsymbol{\Lambda}=\boldsymbol{U}(\boldsymbol{\Delta}_{p}-\sigma^{2}\boldsymbol{I})^{(1/2)}\boldsymbol{R}. (59)

The arbitrary orthogonal matrix 𝑹\boldsymbol{R} is a rotation matrix which rotates data in projection. It is arbitrary because rotation is not important in dimensionality reduction (the relative distances of embedded data points do not change if all embedded points are rotated). A simple choice for this rotation matrix is 𝑹=𝑰\boldsymbol{R}=\boldsymbol{I} which results in:

𝚲=𝑼(𝚫pσ2𝑰)(1/2).\displaystyle\boldsymbol{\Lambda}=\boldsymbol{U}(\boldsymbol{\Delta}_{p}-\sigma^{2}\boldsymbol{I})^{(1/2)}. (60)

4.2.2 MLE for Determining σ\sigma

If we substitute Eq. (59) in the log-likelihood, Eq. (53), the log-likelihood becomes (Hauskrecht, 2007):

maxσn2(dlog(2π)constant+j=1plog(δj)\displaystyle\max_{\sigma}~{}-\frac{n}{2}\Big{(}\underbrace{d\log(2\pi)}_{\text{constant}}+\sum_{j=1}^{p}\log(\delta_{j})
+j=p+1dδj+(dp)log((σ2)1)+p).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{j=p+1}^{d}\delta_{j}+(d-p)\log((\sigma^{2})^{-1})+p\Big{)}. (61)

Note that we have dd eigenvalues {δj}j=1d\{\delta_{j}\}_{j=1}^{d} because the covariance matrix 𝑺x\boldsymbol{S}_{x} is a (d×d)(d\times d) matrix. However, as we have only pp singular values {lj}j=1p\{l_{j}\}_{j=1}^{p}, the eigenvalues {δj}j=p+1d\{\delta_{j}\}_{j=p+1}^{d} are very small.

Taking the derivative of Eq. (61) w.r.t. σ2\sigma^{2} and setting it to zero is (Tipping & Bishop, 1999b):

Eq.(61)σ2=n2(0+j=p+1dδj+(dp)σ2+0)=set0\displaystyle\frac{\partial\,\text{Eq.}(\ref{equation_PPCA_log_likelihood_for_sigma})}{\partial\sigma^{2}}=-\frac{n}{2}\big{(}0+\sum_{j=p+1}^{d}\delta_{j}+(d-p)\,\sigma^{2}+0\big{)}\overset{\text{set}}{=}0
σ2=1dpj=p+1dδj.\displaystyle\implies\sigma^{2}=\frac{1}{d-p}\sum_{j=p+1}^{d}\delta_{j}. (62)

4.2.3 Summary of MLE Formulas

In summary, the MLE estimations for the variables of PPCA are:

σ2=1dpj=p+1dδj,\displaystyle\sigma^{2}=\frac{1}{d-p}\sum_{j=p+1}^{d}\delta_{j}, (63)
𝚲=𝑼(𝚫pσ2𝑰)(1/2)𝑹=𝑼(𝚫pσ2𝑰)(1/2).\displaystyle\boldsymbol{\Lambda}=\boldsymbol{U}(\boldsymbol{\Delta}_{p}-\sigma^{2}\boldsymbol{I})^{(1/2)}\boldsymbol{R}=\boldsymbol{U}(\boldsymbol{\Delta}_{p}-\sigma^{2}\boldsymbol{I})^{(1/2)}. (64)

Note that Eq. (63) is a measure of the variance lost in the projection by the projection matrix 𝚲\boldsymbol{\Lambda}. The Eq. (64) is the projection or mapping matrix from the latent space to the data space.

4.3 Zero Noise Limit: PCA Is a Special Case of Probabilistic PCA

Recall the posterior which is Eq. (38). According to Eqs. (39), (40), and (50), the posterior in PPCA is:

q(𝒛i)=(𝒛i|𝒙i)=𝒩(\displaystyle q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i})=\mathcal{N}\Big{(} 𝚲(𝚲𝚲+σ2𝑰)1(𝒙i𝝁),\displaystyle\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}), (65)
𝑰𝚲(𝚲𝚲+σ2𝑰)1𝚲).\displaystyle\boldsymbol{I}-\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}+\sigma^{2}\boldsymbol{I})^{-1}\boldsymbol{\Lambda}\Big{)}.

Consider zero noise limit where the variance of noise goes to zero:

limσ20(ϵ)=limσ20𝒩(𝟎,σ2𝑰)=𝒩(𝟎,limσ20(σ2)𝑰).\displaystyle\lim_{\sigma^{2}\rightarrow 0}\mathbb{P}(\boldsymbol{\epsilon})=\lim_{\sigma^{2}\rightarrow 0}\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I})=\mathcal{N}(\boldsymbol{0},\lim_{\sigma^{2}\rightarrow 0}(\sigma^{2})\boldsymbol{I}). (66)

In this case, the uncertainty of PPCA almost disappears. In the following, we show that in zero noise limit, PPCA is reduced to PCA (Ghojogh & Crowley, 2019; Jolliffe & Cadima, 2016) and this explains why the PPCA method is a probabilistic approach to PCA.

In the zero noise limit, the posterior becomes:

limσ20q(𝒛i)=limσ20(𝒛i|𝒙i)\displaystyle\lim_{\sigma^{2}\rightarrow 0}q(\boldsymbol{z}_{i})=\lim_{\sigma^{2}\rightarrow 0}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i}) (67)
=(65)𝒩(𝚲(𝚲𝚲)1(𝒙i𝝁),𝑰𝚲(𝚲𝚲)1𝚲)\displaystyle\overset{(\ref{equation_PPCA_posterior})}{=}\mathcal{N}\Big{(}\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top})^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}),\boldsymbol{I}-\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top})^{-1}\boldsymbol{\Lambda}\Big{)}
=(a)𝒩((𝚲𝚲)1𝚲(𝒙i𝝁),𝑰(𝚲𝚲)1𝚲𝚲).\displaystyle\overset{(a)}{=}\mathcal{N}\Big{(}(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda})^{-1}\boldsymbol{\Lambda}^{\top}(\boldsymbol{x}_{i}-\boldsymbol{\mu}),\boldsymbol{I}-(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda})^{-1}\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda}\Big{)}.

where (a)(a) is because according to (Roweis, 1997, footnote 4), we have:

𝚲(𝚲𝚲)1=(𝚲𝚲)1𝚲.\displaystyle\boldsymbol{\Lambda}^{\top}(\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top})^{-1}=(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda})^{-1}\boldsymbol{\Lambda}^{\top}. (68)

On the other hand, according to (Tipping & Bishop, 1999b, Appendix C), PCA minimizes the reconstruction error as:

minimize𝚲\displaystyle\underset{\boldsymbol{\Lambda}}{\text{minimize}} (𝑿𝝁)𝚲𝚲(𝑿𝝁)F2,\displaystyle||(\boldsymbol{X}-\boldsymbol{\mu})-\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{\top}(\boldsymbol{X}-\boldsymbol{\mu})||_{F}^{2}, (69)

where .F\|.\|_{F} denotes the Frobenius norm of matrix. See (Ghojogh & Crowley, 2019) for more details on minimization of reconstruction error by PCA.

Instead of minimizing the reconstruction error, one may minimize the reconstruction error for the mean of posterior (Tipping & Bishop, 1999b, Appendix C):

minimize𝚲\displaystyle\underset{\boldsymbol{\Lambda}}{\text{minimize}} (𝑿𝝁)𝚲(𝚲𝚲)1𝚲(𝑿𝝁)F2.\displaystyle||(\boldsymbol{X}-\boldsymbol{\mu})-\boldsymbol{\Lambda}(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda})^{-1}\boldsymbol{\Lambda}^{\top}(\boldsymbol{X}-\boldsymbol{\mu})||_{F}^{2}. (70)

This is the minimization of reconstruction error after projection by (𝚲𝚲)1𝚲(\boldsymbol{\Lambda}^{\top}\boldsymbol{\Lambda})^{-1}\boldsymbol{\Lambda}^{\top}. According to the posterior in the zero noise limit (see Eq. (67)), this is equivalent to PPCA in the zero noise limit model. Hence, PCA is a deterministic special case of PPCA where the variance of noise goes to zero.

4.4 Other Variants of Probabilistic PCA

There exist some other variants of PPCA. We briefly mention them here. PPCA has a hyperparameter which is the dimensionality of latent space pp. Bayesian PCA (Bishop, 1999) models this hyperparameter as another latent random variable which is learned during the EM training.

According to Eqs. (28), (29), and (51), note that PPCA uses Gaussian distributions. PPCA with Student-t distribution (Zhao & Jiang, 2006) has been proposed which uses t distributions. This change may improve the embedding of PPCA because of the heavier tails of Student-t distribution compared to Gaussian distribution. This avoids the crowding problem which has motivated the proposal of t-SNE (Ghojogh et al., 2020a).

Sparse PPCA (Guan & Dy, 2009; Mattei et al., 2016) has inserted sparsity into PPCA. Supervised PPCA (Yu et al., 2006) makes use of class labels in the formulation of PPCA. Mixture of PPCA (Tipping & Bishop, 1999a) uses mixture of distributions in the formulation PPCA. It trains the parameters of a mixture distribution using EM (Ghojogh et al., 2019a). Generalized PPCA for correlated data is another recent variant of PPCA (Gu & Shen, 2020).

5 Variational Autoencoder

Variational Autoencoder (VAE) (Kingma & Welling, 2014) applies variational inference, i.e., maximizes the ELBO, but in an autoencoder setup and makes it differentiable for the backpropagation training (Rumelhart et al., 1986). As Fig. 4 shows, VAE includes an encoder and a decoder, each of which can have several network layers. A latent space is learned between the encoder and decoder. The latent variable 𝒛i\boldsymbol{z}_{i} is sampled from the latent space. In the following, we explain the encoder and decoder parts. The input of encoder in VAE is the data point 𝒙i\boldsymbol{x}_{i} and the output of decoder in VAE is its reconstruction 𝒙i\boldsymbol{x}_{i}.

Refer to caption
Figure 4: The structure of a variational autoencoder.

5.1 Parts of Variational Autoencoder

5.1.1 Encoder of Variational Autoencoder

The encoder of VAE models the distribution q(𝒛i)=(𝒛i|𝒙i,𝜽e)q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) where the parameters of distribution 𝜽e\boldsymbol{\theta}_{e} are the weights of encoder layers in VAE. The input and output of encoder are 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} and 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p}, respectively. As Fig. 4 depicts, the output neurons of encoder are supposed to determine the parameters of the conditional distribution (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}). If this conditional distribution has mm number of parameters, we have mm sets of output neurons from the encoder, denoted by {𝒆j}j=1m\{\boldsymbol{e}_{j}\}_{j=1}^{m}. The dimensionality of these sets may differ depending on the size of the parameters.

For example, let the latent space be pp-dimensional, i.e., 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p}. If the distribution (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) is a multivariate Gaussian distribution, we have two sets of output neurons for encoder where one set has pp neurons for the mean of this distribution 𝝁z|x=𝒆1d\boldsymbol{\mu}_{z|x}=\boldsymbol{e}_{1}\in\mathbb{R}^{d} and the other set has (p×p)(p\times p) neurons for the covariance of this distribution 𝚺z|x=matrix form of 𝒆2p×p\boldsymbol{\Sigma}_{z|x}=\text{matrix form of }\boldsymbol{e}_{2}\in\mathbb{R}^{p\times p}. If the covariance matrix is diagonal, the second set has pp neurons rather than (p×p)(p\times p) neurons. In this case, we have 𝚺z|x=diag(𝒆2)d×d\boldsymbol{\Sigma}_{z|x}=\textbf{diag}(\boldsymbol{e}_{2})\in\mathbb{R}^{d\times d}. Any distribution with any number of parameters can be chosen for (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) but the multivariate Gaussian with diagonal covariance is very well-used:

q(𝒛i)=(𝒛i|𝒙i,𝜽e)=𝒩(𝒛i|𝝁z|x,𝚺z|x).\displaystyle q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})=\mathcal{N}(\boldsymbol{z}_{i}\,|\,\boldsymbol{\mu}_{z|x},\boldsymbol{\Sigma}_{z|x}). (71)

Let the network weights for the output sets of encoder, {𝒆j}j=1m\{\boldsymbol{e}_{j}\}_{j=1}^{m}, be denoted by {𝜽e,j}j=1m\{\boldsymbol{\theta}_{e,j}\}_{j=1}^{m}. As the input of encoder is 𝒙i\boldsymbol{x}_{i}, the jj-th output set of encoder can be written as 𝒆j(𝒙i,𝜽e,j)\boldsymbol{e}_{j}(\boldsymbol{x}_{i},\boldsymbol{\theta}_{e,j}). In the case of multivariate Gaussian distribution for the latent space, the parameters are 𝝁z|x=𝒆1(𝒙i,𝜽e,1)\boldsymbol{\mu}_{z|x}=\boldsymbol{e}_{1}(\boldsymbol{x}_{i},\boldsymbol{\theta}_{e,1}) and 𝚺z|x=diag(𝒆2(𝒙i,𝜽e,2))\boldsymbol{\Sigma}_{z|x}=\textbf{diag}(\boldsymbol{e}_{2}(\boldsymbol{x}_{i},\boldsymbol{\theta}_{e,2})).

5.1.2 Sampling the Latent Variable

When the data point 𝒙i\boldsymbol{x}_{i} is fed as input to the encoder, the parameters of the conditional distribution q(𝒛i)q(\boldsymbol{z}_{i}) are obtained; hence, the distribution of latent space, which is q(𝒛i)q(\boldsymbol{z}_{i}), is determined corresponding to the data point 𝒙i\boldsymbol{x}_{i}. Now, in the latent space, we sample the corresponding latent variable from the distribution of latent space:

𝒛iq(𝒛i)=(𝒛i|𝒙i,𝜽e).\displaystyle\boldsymbol{z}_{i}\sim q(\boldsymbol{z}_{i})=\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}). (72)

This latent variable is fed as input to the decoder which is explained in the following.

5.1.3 Decoder of Variational Autoencoder

As Fig. 4 shows, the decoder of VAE models the conditional distribution (𝒙i|𝒛i,𝜽d)\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}_{d}) where 𝜽d\boldsymbol{\theta}_{d} are the weights of decoder layers in VAE. The input and output of decoder are 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p} and 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d}, respectively. The output neurons of decoder are supposed to either generate the reconstructed data point or determine the parameters of the conditional distribution (𝒙i|𝒛i,𝜽d)\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}_{d}); the former is more common. In the latter case, if this conditional distribution has ll number of parameters, we have ll sets of output neurons from the decoder, denoted by {𝒅j}j=1l\{\boldsymbol{d}_{j}\}_{j=1}^{l}. The dimensionality of these sets may differ depending the size of every parameters. The example of multivariate Gaussian distribution also can be mentioned for the decoder. Let the network weights for the output sets of decoder, {𝒅j}j=1l\{\boldsymbol{d}_{j}\}_{j=1}^{l}, be denoted by {𝜽d,j}j=1l\{\boldsymbol{\theta}_{d,j}\}_{j=1}^{l}. As the input of decoder is 𝒛i\boldsymbol{z}_{i}, the jj-th output set of decoder can be written as 𝒅j(𝒛i,𝜽d,j)\boldsymbol{d}_{j}(\boldsymbol{z}_{i},\boldsymbol{\theta}_{d,j}).

5.2 Training Variational Autoencoder with Expectation Maximization

We use EM for training the VAE. Recall Eqs. (8) and (9) for EM in variational inference. Inspired by that, VAE uses EM for training where the ELBO is a function of encoder weights 𝜽e\boldsymbol{\theta}_{e}, decoder weights 𝜽d\boldsymbol{\theta}_{d}, and data point 𝒙i\boldsymbol{x}_{i}:

E-step:𝜽e(t):=argmaxq(𝜽e,𝜽d(t1),𝒙i),\displaystyle\text{E-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{e}^{(t)}:=\arg\max_{q}~{}~{}~{}~{}\mathcal{L}(\boldsymbol{\theta}_{e},\boldsymbol{\theta}_{d}^{(t-1)},\boldsymbol{x}_{i}), (73)
M-step:𝜽d(t):=argmaxq(𝜽e(t),𝜽d,𝒙i).\displaystyle\text{M-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{d}^{(t)}:=\arg\max_{q}~{}~{}~{}~{}\mathcal{L}(\boldsymbol{\theta}_{e}^{(t)},\boldsymbol{\theta}_{d},\boldsymbol{x}_{i}). (74)

We can simplify this iterative optimization algorithm by alternating optimization (Jain & Kar, 2017) where we take a step of gradient ascent optimization in every iteration. We consider mini-batch stochastic gradient ascent and take training data in batches where bb denotes the mini-batch size. Hence, the optimization is:

E-step:𝜽e(t):=𝜽e(t1)+ηei=1b(𝜽e,𝜽d(t1),𝒙i)𝜽e,\displaystyle\text{E-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{e}^{(t)}:=\boldsymbol{\theta}_{e}^{(t-1)}+\eta_{e}\frac{\partial\sum_{i=1}^{b}\mathcal{L}(\boldsymbol{\theta}_{e},\boldsymbol{\theta}_{d}^{(t-1)},\boldsymbol{x}_{i})}{\partial\boldsymbol{\theta}_{e}}, (75)
M-step:𝜽d(t):=𝜽d(t1)+ηdi=1b(𝜽e(t),𝜽d,𝒙i)𝜽d,\displaystyle\text{M-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{d}^{(t)}:=\boldsymbol{\theta}_{d}^{(t-1)}+\eta_{d}\frac{\partial\sum_{i=1}^{b}\mathcal{L}(\boldsymbol{\theta}_{e}^{(t)},\boldsymbol{\theta}_{d},\boldsymbol{x}_{i})}{\partial\boldsymbol{\theta}_{d}}, (76)

where ηe\eta_{e} and ηd\eta_{d} are the learning rates for 𝜽e\boldsymbol{\theta}_{e} and 𝜽d\boldsymbol{\theta}_{d}, respectively.

The ELBO is simplified as:

i=1b(q,𝜽)=(4)i=1bKL(q(𝒛i)(𝒙i,𝒛i|𝜽d))\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})\overset{(\ref{ELBO_equation1})}{=}-\sum_{i=1}^{b}\text{KL}\big{(}q(\boldsymbol{z}_{i})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})\big{)}
=(12)i=1bKL((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d)).\displaystyle~{}~{}~{}~{}~{}~{}~{}\overset{(\ref{equation_E_step_variationalInference})}{=}-\sum_{i=1}^{b}\text{KL}\big{(}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})\big{)}. (77)

Note that the parameter of (𝒙i,𝒛i|𝜽d)\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d}) is 𝜽d\boldsymbol{\theta}_{d} because 𝒛i\boldsymbol{z}_{i} is generated after the encoder and before the decoder.

There are different ways for approximating the KL divergence in Eq. (77) (Hershey & Olsen, 2007; Duchi, 2007). We can simplify the ELBO in at least two different ways which are explained in the following.

5.2.1 Simplification Type 1

We continue the simplification of ELBO:

i=1b(q,𝜽)=i=1bKL((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d))\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})=-\sum_{i=1}^{b}\text{KL}\big{(}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})\big{)}
=i=1b𝔼q(t1)(𝒛i)[log((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d))]\displaystyle=-\sum_{i=1}^{b}\mathbb{E}_{\sim q^{(t-1)}(\boldsymbol{z}_{i})}\Big{[}\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})}\big{)}\Big{]}
=i=1b𝔼(𝒛i|𝒙i,𝜽e)[log((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d))].\displaystyle=-\sum_{i=1}^{b}\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}\Big{[}\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})}\big{)}\Big{]}. (78)

This expectation can be approximated using Monte Carlo approximation (Ghojogh et al., 2020b) where we draw \ell samples {𝒛i,j}j=1\{\boldsymbol{z}_{i,j}\}_{j=1}^{\ell}, corresponding to the ii-th data point, from the conditional distribution distribution as:

𝒛i,j(𝒛i|𝒙i,𝜽e),j{1,,}.\displaystyle\boldsymbol{z}_{i,j}\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}),\quad\forall j\in\{1,\dots,\ell\}. (79)

Monte Carlo approximation (Ghojogh et al., 2020b), in general, approximates expectation as:

𝔼(𝒛i|𝒙i,𝜽e)[f(𝒛i)]1j=1f(𝒛i,j),\displaystyle\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}\big{[}f(\boldsymbol{z}_{i})\big{]}\approx\frac{1}{\ell}\sum_{j=1}^{\ell}f(\boldsymbol{z}_{i,j}), (80)

where f(𝒛i)f(\boldsymbol{z}_{i}) is a function of 𝒛i\boldsymbol{z}_{i}. Here, the approximation is:

i=1b(q,𝜽)i=1b~(q,𝜽)\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})\approx\sum_{i=1}^{b}\widetilde{\mathcal{L}}(q,\boldsymbol{\theta})
=i=1b1j=1log((𝒛i,j|𝒙i,𝜽e)(𝒙i,𝒛i,j|𝜽d))\displaystyle=-\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i,j}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i,j}\,|\,\boldsymbol{\theta}_{d})}\big{)}
=i=1b1j=1[log((𝒙i,𝒛i,j|𝜽d))\displaystyle=\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\!\Big{[}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i,j}\,|\,\boldsymbol{\theta}_{d})\big{)}
log((𝒛i,j|𝒙i,𝜽e))].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\log\big{(}\mathbb{P}(\boldsymbol{z}_{i,j}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\big{)}\Big{]}. (81)

5.2.2 Simplification Type 2

We can simplify the ELBO using another approach:

i=1b(q,𝜽)=i=1bKL((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d))\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})=-\sum_{i=1}^{b}\text{KL}\big{(}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\,\|\,\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})\big{)}
=i=1b(𝒛i|𝒙i,𝜽e)log((𝒛i|𝒙i,𝜽e)(𝒙i,𝒛i|𝜽d))𝑑𝒛i\displaystyle=-\sum_{i=1}^{b}\int\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{x}_{i},\boldsymbol{z}_{i}\,|\,\boldsymbol{\theta}_{d})}\big{)}\,d\boldsymbol{z}_{i}
=i=1b(𝒛i|𝒙i,𝜽e)log((𝒛i|𝒙i,𝜽e)(𝒙i|𝒛i,𝜽d)(𝒛i))𝑑𝒛i\displaystyle=-\sum_{i=1}^{b}\int\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}_{d})\,\mathbb{P}(\boldsymbol{z}_{i})}\big{)}\,d\boldsymbol{z}_{i}
=i=1b(𝒛i|𝒙i,𝜽e)log((𝒛i|𝒙i,𝜽e)(𝒛i))𝑑𝒛i\displaystyle=-\sum_{i=1}^{b}\int\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{z}_{i})}\big{)}\,d\boldsymbol{z}_{i}
+i=1b(𝒛i|𝒙i,𝜽e)log((𝒙i|𝒛i,𝜽d))𝑑𝒛i\displaystyle~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\int\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}_{d})\big{)}\,d\boldsymbol{z}_{i}
=i=1bKL((𝒛i|𝒙i,𝜽e)(𝒛i))\displaystyle=-\sum_{i=1}^{b}\text{KL}\big{(}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\,\|\,\mathbb{P}(\boldsymbol{z}_{i})\big{)}
+i=1b𝔼(𝒛i|𝒙i,𝜽e)[log((𝒙i|𝒛i,𝜽d))].\displaystyle~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}\Big{[}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i},\boldsymbol{\theta}_{d})\big{)}\Big{]}. (82)

The second term in the above equation can be estimated using Monte Carlo approximation (Ghojogh et al., 2020b) where we draw \ell samples {𝒛i,j}j=1\{\boldsymbol{z}_{i,j}\}_{j=1}^{\ell} from (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}):

i=1b(q,𝜽)i=1b~(q,𝜽)\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})\approx\sum_{i=1}^{b}\widetilde{\mathcal{L}}(q,\boldsymbol{\theta})
=i=1bKL((𝒛i|𝒙i,𝜽e)(𝒛i))\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=-\sum_{i=1}^{b}\text{KL}\big{(}\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\,\|\,\mathbb{P}(\boldsymbol{z}_{i})\big{)}
+i=1b1j=1log((𝒙i|𝒛i,j,𝜽d)).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i,j},\boldsymbol{\theta}_{d})\big{)}. (83)

The first term in the above equation can be converted to expectation and then computed using Monte Monte Carlo approximation (Ghojogh et al., 2020b) again, where we draw \ell samples {𝒛i,j}j=1\{\boldsymbol{z}_{i,j}\}_{j=1}^{\ell} from (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}):

i=1b(q,𝜽)i=1b~(q,𝜽)\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})\approx\sum_{i=1}^{b}\widetilde{\mathcal{L}}(q,\boldsymbol{\theta})
=i=1b𝔼(𝒛i|𝒙i,𝜽e)[log((𝒛i|𝒙i,𝜽e)(𝒛i))]\displaystyle~{}~{}~{}=-\sum_{i=1}^{b}\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}\Big{[}\log\big{(}\frac{\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}{\mathbb{P}(\boldsymbol{z}_{i})}\big{)}\Big{]}
+i=1b1j=1log((𝒙i|𝒛i,j,𝜽d))\displaystyle~{}~{}~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i,j},\boldsymbol{\theta}_{d})\big{)}
i=1b1j=1log((𝒛i,j|𝒙i,𝜽e))log((𝒛i,j))\displaystyle~{}~{}~{}\approx-\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\mathbb{P}(\boldsymbol{z}_{i,j}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\big{)}-\log\big{(}\mathbb{P}(\boldsymbol{z}_{i,j})\big{)}
+i=1b1j=1log((𝒙i|𝒛i,j,𝜽d)).\displaystyle~{}~{}~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i,j},\boldsymbol{\theta}_{d})\big{)}. (84)

In case we have some families of distributions, such as Gaussian distributions, for (𝒛i,j|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i,j}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) and (𝒛i,j)\mathbb{P}(\boldsymbol{z}_{i,j}), the first term in Eq. (83) can be computed analytically. In the following, we simply Eq. (83) further for Gaussian distributions.

5.2.3 Simplification Type 2 for Special Case of Gaussian Distributions

We can compute the KL divergence in the first term of Eq. (83) analytically for univariate or multivariate Gaussian distributions. For this, we need two following lemmas.

Lemma 1.

The KL divergence between two univariate Gaussian distributions p1𝒩(μ1,σ12)p_{1}\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}) and p2𝒩(μ2,σ22)p_{2}\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}) is:

KL(p1p2)=log(σ2σ1)+σ12+(μ1μ2)22σ2212.\displaystyle\text{KL}(p_{1}\|p_{2})=\log(\frac{\sigma_{2}}{\sigma_{1}})+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2}. (85)
Proof.

See Appendix A for proof. ∎

Lemma 2.

The KL divergence between two multivariate Gaussian distributions p1𝒩(𝛍1,𝚺1)p_{1}\sim\mathcal{N}(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}_{1}) and p2𝒩(𝛍2,𝚺2)p_{2}\sim\mathcal{N}(\boldsymbol{\mu}_{2},\boldsymbol{\Sigma}_{2}) with dimensionality pp is:

KL(p1p2)\displaystyle\text{KL}(p_{1}\|p_{2}) =12(log(|𝚺2||𝚺1|)p+tr(𝚺21𝚺1)\displaystyle=\frac{1}{2}\Big{(}\log(\frac{|\boldsymbol{\Sigma}_{2}|}{|\boldsymbol{\Sigma}_{1}|})-p+\textbf{tr}(\boldsymbol{\Sigma}_{2}^{-1}\boldsymbol{\Sigma}_{1})
+(𝝁2𝝁1)𝚺21(𝝁2𝝁1)).\displaystyle~{}~{}~{}~{}~{}+(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})^{\top}\boldsymbol{\Sigma}_{2}^{-1}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})\Big{)}. (86)
Proof.

See (Duchi, 2007, Section 9) for proof. ∎

Consider the case in which we have:

(𝒛i|𝒙i,𝜽e)𝒩(𝝁z|x,𝚺z|x),\displaystyle\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})\sim\mathcal{N}(\boldsymbol{\mu}_{z|x},\boldsymbol{\Sigma}_{z|x}), (87)
(𝒛i)𝒩(𝝁z,𝚺z),\displaystyle\mathbb{P}(\boldsymbol{z}_{i})\sim\mathcal{N}(\boldsymbol{\mu}_{z},\boldsymbol{\Sigma}_{z}), (88)

where 𝒛ip\boldsymbol{z}_{i}\in\mathbb{R}^{p}. Note that the parameters 𝝁z|x\boldsymbol{\mu}_{z|x} and 𝚺z|x\boldsymbol{\Sigma}_{z|x} are trained in neural network while the parameters (𝒛i,j)\mathbb{P}(\boldsymbol{z}_{i,j}) can be set to 𝝁z=𝟎\boldsymbol{\mu}_{z}=\boldsymbol{0} and 𝚺z=𝑰\boldsymbol{\Sigma}_{z}=\boldsymbol{I} inspired by Eq. (29) in factor analysis. According to Lemma 2, the approximation of ELBO, i.e. Eq. (83), can be simplified to:

i=1b(q,𝜽)i=1b~(q,𝜽)\displaystyle\sum_{i=1}^{b}\mathcal{L}(q,\boldsymbol{\theta})\approx\sum_{i=1}^{b}\widetilde{\mathcal{L}}(q,\boldsymbol{\theta})
=i=1b12(log(|𝚺z||𝚺z|x|)p+tr(𝚺z1𝚺z|x)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=-\sum_{i=1}^{b}\frac{1}{2}\Big{(}\log(\frac{|\boldsymbol{\Sigma}_{z}|}{|\boldsymbol{\Sigma}_{z|x}|})-p+\textbf{tr}(\boldsymbol{\Sigma}_{z}^{-1}\boldsymbol{\Sigma}_{z|x})
+(𝝁z𝝁z|x)𝚺z1(𝝁z𝝁z|x))\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+(\boldsymbol{\mu}_{z}-\boldsymbol{\mu}_{z|x})^{\top}\boldsymbol{\Sigma}_{z}^{-1}(\boldsymbol{\mu}_{z}-\boldsymbol{\mu}_{z|x})\Big{)}
+i=1b1j=1log((𝒙i|𝒛i,j,𝜽d)).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{i=1}^{b}\frac{1}{\ell}\sum_{j=1}^{\ell}\log\big{(}\mathbb{P}(\boldsymbol{x}_{i}\,|\,\boldsymbol{z}_{i,j},\boldsymbol{\theta}_{d})\big{)}. (89)

5.2.4 Training Variational Autoencoder with Approximations

We can train VAE with EM, where Monte Carlo approximations are applied to ELBO. The Eqs. (75) and (76) are replaced by the following equations:

E-step:𝜽e(t):=𝜽e(t1)+ηei=1b~(𝜽e,𝜽d(t1),𝒙i)𝜽e,\displaystyle\text{E-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{e}^{(t)}:=\boldsymbol{\theta}_{e}^{(t-1)}+\eta_{e}\frac{\partial\sum_{i=1}^{b}\widetilde{\mathcal{L}}(\boldsymbol{\theta}_{e},\boldsymbol{\theta}_{d}^{(t-1)},\boldsymbol{x}_{i})}{\partial\boldsymbol{\theta}_{e}}, (90)
M-step:𝜽d(t):=𝜽d(t1)+ηdi=1b~(𝜽e(t),𝜽d,𝒙i)𝜽d,\displaystyle\text{M-step:}~{}~{}~{}~{}~{}\boldsymbol{\theta}_{d}^{(t)}:=\boldsymbol{\theta}_{d}^{(t-1)}+\eta_{d}\frac{\partial\sum_{i=1}^{b}\widetilde{\mathcal{L}}(\boldsymbol{\theta}_{e}^{(t)},\boldsymbol{\theta}_{d},\boldsymbol{x}_{i})}{\partial\boldsymbol{\theta}_{d}}, (91)

where the approximated ELBO was introduced in previous sections.

5.2.5 Prior Regularization

Some works regularize the ELBO, Eq. (77), with a penalty on the prior distribution (𝒛i)\mathbb{P}(\boldsymbol{z}_{i}). Using this, we guide the learned distribution of latent space (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) to have a specific prior distribution (𝒛i)\mathbb{P}(\boldsymbol{z}_{i}). Some examples for prior regularization in VAE are geodesic priors (Hadjeres et al., 2017) and optimal priors (Takahashi et al., 2019). Note that this regularization can inject domain knowledge to the latent space. It can also be useful for making the latent space more interpretable.

5.3 The Reparametrization Trick

Sampling the \ell samples for the latent variables, i.e. Eq. (72), blocks the gradient flow because computing the derivatives through (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) by chain rule gives a high variance estimate of gradient. In order to overcome this problem, we use the reparameterization technique (Kingma & Welling, 2014; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014). In this technique, instead of sampling 𝒛i(𝒛i|𝒙i,𝜽e)\boldsymbol{z}_{i}\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}), we assume 𝒛i\boldsymbol{z}_{i} is a random variable but is a deterministic function of another random variable ϵi\boldsymbol{\epsilon}_{i} as follows:

𝒛i=g(ϵi,𝒙i,𝜽e),\displaystyle\boldsymbol{z}_{i}=g(\boldsymbol{\epsilon}_{i},\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}), (92)

where ϵi\boldsymbol{\epsilon}_{i} is a stochastic variable sampled from a distribution as:

ϵi(ϵ).\displaystyle\boldsymbol{\epsilon}_{i}\sim\mathbb{P}(\boldsymbol{\epsilon}). (93)

The Eqs. (78) and 82 both contain an expectation of a function f(𝒛i)f(\boldsymbol{z}_{i}). Using this technique, this expectation is replaced as:

𝔼(𝒛i|𝒙i,𝜽e)[f(𝒛i)]𝔼(𝒛i|𝒙i,𝜽e)[f(g(ϵi,𝒙i,𝜽e))].\displaystyle\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}[f(\boldsymbol{z}_{i})]\rightarrow\mathbb{E}_{\sim\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e})}[f(g(\boldsymbol{\epsilon}_{i},\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}))]. (94)

Using the reparameterization technique, the encoder, which implemented (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}), is replaced by g(ϵi,𝒙i,𝜽e)g(\boldsymbol{\epsilon}_{i},\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}) where in the latent space between encoder and decoder, we have ϵi(ϵ)\boldsymbol{\epsilon}_{i}\sim\mathbb{P}(\boldsymbol{\epsilon}) and 𝒛i=g(ϵi,𝒙i,𝜽e)\boldsymbol{z}_{i}=g(\boldsymbol{\epsilon}_{i},\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}).

A simple example for the reparameterization technique is when ziz_{i} and ϵi\epsilon_{i} are univariate Gaussian variables:

zi𝒩(μ,σ2),\displaystyle z_{i}\sim\mathcal{N}(\mu,\sigma^{2}),
ϵi𝒩(0,1),\displaystyle\epsilon_{i}\sim\mathcal{N}(0,1),
zi=g(ϵi)=μ+σϵi.\displaystyle z_{i}=g(\epsilon_{i})=\mu+\sigma\epsilon_{i}.

For some more advanced reparameterization techniques, the reader can refer to (Figurnov et al., 2018).

5.4 Training Variational Autoencoder with Backpropagation

In practice, VAE is trained by backpropagation (Rezende et al., 2014) where the backpropagation algorithm (Rumelhart et al., 1986) is used for training the weights of network. Recall that in training VAE with EM, the encoder and decoder are trained separately using the E-step and the M-step of EM, respectively. However, in training VAE with backpropagation, the whole network is trained together and not in separate steps. Suppose the whole weights if VAE are denoted by 𝜽:={𝜽e,𝜽d}\boldsymbol{\theta}:=\{\boldsymbol{\theta}_{e},\boldsymbol{\theta}_{d}\}. Backpropagation trains VAE using the mini-batch stochastic gradient descent with the negative ELBO, i=1b~(𝜽,𝒙i)\sum_{i=1}^{b}-\widetilde{\mathcal{L}}(\boldsymbol{\theta},\boldsymbol{x}_{i}), as the loss function:

𝜽(t):=𝜽(t1)ηi=1b~(𝜽,𝒙i)𝜽,\displaystyle\boldsymbol{\theta}^{(t)}:=\boldsymbol{\theta}^{(t-1)}-\eta\frac{\partial\sum_{i=1}^{b}-\widetilde{\mathcal{L}}(\boldsymbol{\theta},\boldsymbol{x}_{i})}{\partial\boldsymbol{\theta}}, (95)

where η\eta is the learning rate. Note that we are minimizing here because neural networks usually minimize the loss function.

5.5 The Test Phase in Variational Autoencoder

In the test phase, we feed the test data point 𝒙i\boldsymbol{x}_{i} to the encoder to determine the parameters of the conditional distribution of latent space, i.e., (𝒛i|𝒙i,𝜽e)\mathbb{P}(\boldsymbol{z}_{i}\,|\,\boldsymbol{x}_{i},\boldsymbol{\theta}_{e}). Then, from this distribution, we sample the latent variable 𝒛i\boldsymbol{z}_{i} from the latent space and generate the corresponding reconstructed data point 𝒙i\boldsymbol{x}_{i} by the decoder. As you see, VAE is a generative model which generates data points (Ng & Jordan, 2002).

5.6 Other Notes and Other Variants of Variational Autoencoder

There exist many improvements on VAE. Here, we briefly review some of these works. One of the problems of VAE is generating blurry images when data points are images. This blurry artifact may be because of several following reasons:

  • sampling for the Monte Carlo approximations

  • lower bound approximation by ELBO

  • restrictions on the family of distributions where usually simple Gaussin distributions are used.

There are some other interpretations for the reason of this problem; for example, see (Zhao et al., 2017). This work also proposed a generalized ELBO. Note that generative adversarial networks (Goodfellow et al., 2014) usually generate clearer images; therefore, some works have combined variational and adversarial inferences (Mescheder et al., 2017) for using the advantages of both models.

Variational discriminant analysis (Yu et al., 2020) has also been proposed for classification and discrimination of classes. Two other tutorials on VAE are (Doersch, 2016) and (Odaibo, 2019). Some more recently published papers on VAE are nearly optimal VAE (Bai et al., 2020), deep VAE (Hou et al., 2017), Hamiltonian VAE (Caterini et al., 2018), and Nouveau VAE (Vahdat & Kautz, 2020) which is a hierarchical VAE. For image data and image and caption modeling, a fusion of VAE and convolutional neural network is also proposed (Pu et al., 2016). The influential factors in VAE are also analyzed in the paper (Liu et al., 2020).

6 Conclusion

This paper was a tutorial and survey on several dimensionality reduction and generative model which are tightly related. Factor analysis, probabilistic PCA, variational inference, and variational autoencoder are covered in this paper. All of these methods assume that every data point is generated from a latent variable or factor where some noise have also been applied on data in the data space.

Acknowledgement

The authors hugely thank the instructors of deep learning course at the Carnegie Mellon University (you can see their YouTube channel) whose lectures partly covered some materials mentioned in this tutorial paper.

Appendix A Proof for Lemma 1

KL(p1p2)=p1(x)log(p1(x)p2(x))𝑑x\displaystyle\text{KL}(p_{1}\|p_{2})=\int p_{1}(x)\log\big{(}\frac{p_{1}(x)}{p_{2}(x)}\big{)}dx
=p1(x)log(p1(x))𝑑xp1(x)log(p2(x))𝑑x.\displaystyle=\int p_{1}(x)\log(p_{1}(x))\,dx-\int p_{1}(x)\log(p_{2}(x))\,dx.

According to integration by parts, we have:

p1(x)log(p1(x))𝑑x=12(1+log(2πσ12)).\displaystyle\int p_{1}(x)\log(p_{1}(x))\,dx=-\frac{1}{2}(1+\log(2\pi\sigma_{1}^{2})).

We also have:

p1(x)log(p2(x))𝑑x\displaystyle-\int p_{1}(x)\log(p_{2}(x))\,dx
=p1(x)log(12πσ22e(xμ2)22σ22)𝑑x\displaystyle=-\int p_{1}(x)\log\Big{(}\frac{1}{\sqrt{2\pi\sigma_{2}^{2}}}e^{-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}}\Big{)}\,dx
=12log(2πσ22)p1(x)𝑑x=1p1(x)log(e(xμ2)22σ22)𝑑x\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})\underbrace{\int p_{1}(x)dx}_{=1}-\int p_{1}(x)\log\big{(}e^{-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}}\big{)}\,dx
=12log(2πσ22)+p1(x)(xμ2)22σ22𝑑x\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})+\int p_{1}(x)\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\,dx
=12log(2πσ22)+12σ22(p1(x)x2dx\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})+\frac{1}{2\sigma_{2}^{2}}\Big{(}\int p_{1}(x)x^{2}dx
p1(x)2xμ2dx+p1(x)μ22dx)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\int p_{1}(x)2x\mu_{2}dx+\int p_{1}(x)\mu_{2}^{2}dx\Big{)}
=12log(2πσ22)\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})
+12σ22(𝔼p1(x)[x2]2μ2𝔼p1(x)[x]+μ22).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{1}{2\sigma_{2}^{2}}\Big{(}\mathbb{E}_{\sim p_{1}(x)}[x^{2}]-2\mu_{2}\mathbb{E}_{\sim p_{1}(x)}[x]+\mu_{2}^{2}\Big{)}.

We know that:

𝕍ar[x]=𝔼[x2]𝔼[x]2𝔼[x2]=σ12+μ12\displaystyle\mathbb{V}\text{ar}[x]=\mathbb{E}[x^{2}]-\mathbb{E}[x]^{2}\implies\mathbb{E}[x^{2}]=\sigma_{1}^{2}+\mu_{1}^{2}

Hence:

p1(x)log(p2(x))𝑑x\displaystyle-\int p_{1}(x)\log(p_{2}(x))\,dx
=12log(2πσ22)+12σ22(σ12+μ122μ2μ1+μ22)\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})+\frac{1}{2\sigma_{2}^{2}}\Big{(}\sigma_{1}^{2}+\mu_{1}^{2}-2\mu_{2}\mu_{1}+\mu_{2}^{2}\Big{)}
=12log(2πσ22)+12σ22(σ12+(μ1μ2)).\displaystyle=\frac{1}{2}\log(2\pi\sigma_{2}^{2})+\frac{1}{2\sigma_{2}^{2}}\big{(}\sigma_{1}^{2}+(\mu_{1}-\mu_{2})\big{)}.

Therefore, finally, we have:

KL(p1p2)\displaystyle\text{KL}(p_{1}\|p_{2}) =12(1+log(2πσ12))\displaystyle=-\frac{1}{2}(1+\log(2\pi\sigma_{1}^{2}))
+12log(2πσ22)+12σ22(σ12+(μ1μ2))\displaystyle+\frac{1}{2}\log(2\pi\sigma_{2}^{2})+\frac{1}{2\sigma_{2}^{2}}\big{(}\sigma_{1}^{2}+(\mu_{1}-\mu_{2})\big{)}
=log(σ2σ1)+σ12+(μ1μ2)22σ2212.\displaystyle=\log(\frac{\sigma_{2}}{\sigma_{1}})+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2}.

Q.E.D.

References

  • Bai et al. (2020) Bai, Jincheng, Song, Qifan, and Cheng, Guang. Nearly optimal variational inference for high dimensional regression with shrinkage priors. arXiv preprint arXiv:2010.12887, 2020.
  • Bishop (1999) Bishop, Christopher M. Bayesian PCA. In Advances in neural information processing systems, pp. 382–388, 1999.
  • Bishop (2006) Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.
  • Bouchard & Triggs (2004) Bouchard, Guillaume and Triggs, Bill. The tradeoff between generative and discriminative classifiers. In 16th IASC International Symposium on Computational Statistics, 2004.
  • Caterini et al. (2018) Caterini, Anthony L, Doucet, Arnaud, and Sejdinovic, Dino. Hamiltonian variational auto-encoder. Advances in Neural Information Processing Systems, 31:8167–8177, 2018.
  • Cattell (1965) Cattell, Raymond B. A biometrics invited paper. factor analysis: An introduction to essentials i. the purpose and underlying models. Biometrics, 21(1):190–215, 1965.
  • Child (1990) Child, Dennis. The essentials of factor analysis. Cassell Educational, 1990.
  • Doersch (2016) Doersch, Carl. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
  • Duchi (2007) Duchi, John. Derivations for linear algebra and optimization. Technical report, Berkeley, California, 2007.
  • Figurnov et al. (2018) Figurnov, Mikhail, Mohamed, Shakir, and Mnih, Andriy. Implicit reparameterization gradients. Advances in Neural Information Processing Systems, 31:441–452, 2018.
  • Fruchter (1954) Fruchter, Benjamin. Introduction to factor analysis. Van Nostrand, 1954.
  • Ghahramani & Hinton (1996) Ghahramani, Zoubin and Hinton, Geoffrey E. The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.
  • Ghojogh & Crowley (2019) Ghojogh, Benyamin and Crowley, Mark. Unsupervised and supervised principal component analysis: Tutorial. arXiv preprint arXiv:1906.03148, 2019.
  • Ghojogh et al. (2019a) Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and Karray, Fakhri. Fitting a mixture distribution to data: tutorial. arXiv preprint arXiv:1901.06708, 2019a.
  • Ghojogh et al. (2019b) Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark. Eigenvalue and generalized eigenvalue problems: Tutorial. arXiv preprint arXiv:1903.11240, 2019b.
  • Ghojogh et al. (2020a) Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and Crowley, Mark. Stochastic neighbor embedding with Gaussian and Student-t distributions: Tutorial and survey. arXiv preprint arXiv:2009.10301, 2020a.
  • Ghojogh et al. (2020b) Ghojogh, Benyamin, Nekoei, Hadi, Ghojogh, Aydin, Karray, Fakhri, and Crowley, Mark. Sampling algorithms, from survey sampling to Monte Carlo methods: Tutorial and literature review. arXiv preprint arXiv:2011.00901, 2020b.
  • Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Gu & Shen (2020) Gu, Mengyang and Shen, Weining. Generalized probabilistic principal component analysis of correlated data. Journal of Machine Learning Research, 21(13):1–41, 2020.
  • Guan & Dy (2009) Guan, Yue and Dy, Jennifer. Sparse probabilistic principal component analysis. In Artificial Intelligence and Statistics, pp.  185–192, 2009.
  • Hadjeres et al. (2017) Hadjeres, Gaëtan, Nielsen, Frank, and Pachet, François. GLSR-VAE: Geodesic latent space regularization for variational autoencoder architectures. In 2017 IEEE Symposium Series on Computational Intelligence, pp.  1–7. IEEE, 2017.
  • Harman (1976) Harman, Harry H. Modern factor analysis. University of Chicago press, 1976.
  • Hauskrecht (2007) Hauskrecht, Milos. CS3750 lecture notes for probabilistic principal component analysis and the E-M algorithm. Technical report, University of Pittsburgh, 2007.
  • Hershey & Olsen (2007) Hershey, John R and Olsen, Peder A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pp.  IV–317. IEEE, 2007.
  • Hou et al. (2017) Hou, Xianxu, Shen, Linlin, Sun, Ke, and Qiu, Guoping. Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision, pp.  1133–1141. IEEE, 2017.
  • Jain & Kar (2017) Jain, Prateek and Kar, Purushottam. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4):142–336, 2017.
  • Jolliffe & Cadima (2016) Jolliffe, Ian T and Cadima, Jorge. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 2016.
  • Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kullback & Leibler (1951) Kullback, Solomon and Leibler, Richard A. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • Liu et al. (2020) Liu, Shiqi, Liu, Jingxin, Zhao, Qian, Cao, Xiangyong, Li, Huibin, Meng, Deyu, Meng, Hongying, and Liu, Sheng. Discovering influential factors in variational autoencoders. Pattern Recognition, 100:107166, 2020.
  • Mattei et al. (2016) Mattei, Pierre-Alexandre, Bouveyron, Charles, and Latouche, Pierre. Globally sparse probabilistic pca. In Artificial Intelligence and Statistics, pp.  976–984, 2016.
  • Mescheder et al. (2017) Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning, 2017.
  • Ng (2018) Ng, Andrew. CS229 lecture notes for factor analysis. Technical report, Stanford University, 2018.
  • Ng & Jordan (2002) Ng, Andrew Y and Jordan, Michael I. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Advances in neural information processing systems, pp. 841–848, 2002.
  • Odaibo (2019) Odaibo, Stephen. Tutorial: Deriving the standard variational autoencoder (VAE) loss function. arXiv preprint arXiv:1907.08956, 2019.
  • Paola Garcia (2018) Paola Garcia, Leibny. Lecture notes for factor analysis. Technical report, Carnegie Mellon University, 2018.
  • Pu et al. (2016) Pu, Yunchen, Gan, Zhe, Henao, Ricardo, Yuan, Xin, Li, Chunyuan, Stevens, Andrew, and Carin, Lawrence. Variational autoencoder for deep learning of images, labels and captions. In Advances in neural information processing systems, pp. 2352–2360, 2016.
  • Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • Roweis (1997) Roweis, Sam. EM algorithms for PCA and SPCA. Advances in neural information processing systems, 10:626–632, 1997.
  • Rumelhart et al. (1986) Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • Sminchisescu & Jepson (2004) Sminchisescu, Cristian and Jepson, Allan. Generative modeling for continuous non-linearly embedded visual inference. In Proceedings of the twenty-first international conference on Machine learning, pp.  96, 2004.
  • Takahashi et al. (2019) Takahashi, Hiroshi, Iwata, Tomoharu, Yamanaka, Yuki, Yamada, Masanori, and Yagi, Satoshi. Variational autoencoder with implicit optimal priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5066–5073, 2019.
  • Tipping & Bishop (1999a) Tipping, Michael E and Bishop, Christopher M. Mixtures of probabilistic principal component analyzers. Neural computation, 11(2):443–482, 1999a.
  • Tipping & Bishop (1999b) Tipping, Michael E and Bishop, Christopher M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999b.
  • Titsias & Lázaro-Gredilla (2014) Titsias, Michalis and Lázaro-Gredilla, Miguel. Doubly stochastic variational Bayes for non-conjugate inference. In International conference on machine learning, pp. 1971–1979, 2014.
  • Vahdat & Kautz (2020) Vahdat, Arash and Kautz, Jan. NVAE: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33, 2020.
  • Walker et al. (2016) Walker, Jacob, Doersch, Carl, Gupta, Abhinav, and Hebert, Martial. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp.  835–851. Springer, 2016.
  • Yu et al. (2006) Yu, Shipeng, Yu, Kai, Tresp, Volker, Kriegel, Hans-Peter, and Wu, Mingrui. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  464–473, 2006.
  • Yu et al. (2020) Yu, Weichang, Azizi, Lamiae, and Ormerod, John T. Variational nonparametric discriminant analysis. Computational Statistics & Data Analysis, 142:106817, 2020.
  • Zhao & Jiang (2006) Zhao, J and Jiang, Q. Probabilistic PCA for t distributions. Neurocomputing, 69(16-18):2217–2226, 2006.
  • Zhao et al. (2017) Zhao, Shengjia, Song, Jiaming, and Ermon, Stefano. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.