This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Langevin Autoencoders for Learning Deep Latent Variable Models

Shohei Taniguchi
The University of Tokyo
[email protected]
&Yusuke Iwasawa
The University of Tokyo
[email protected]
&Wataru Kumagai
The University of Tokyo
[email protected]
&Yutaka Matsuo
The University of Tokyo
[email protected]
Abstract

Markov chain Monte Carlo (MCMC), such as Langevin dynamics, is valid for approximating intractable distributions. However, its usage is limited in the context of deep latent variable models owing to costly datapoint-wise sampling iterations and slow convergence. This paper proposes the amortized Langevin dynamics (ALD), wherein datapoint-wise MCMC iterations are entirely replaced with updates of an encoder that maps observations into latent variables. This amortization enables efficient posterior sampling without datapoint-wise iterations. Despite its efficiency, we prove that ALD is valid as an MCMC algorithm, whose Markov chain has the target posterior as a stationary distribution under mild assumptions. Based on the ALD, we also present a new deep latent variable model named the Langevin autoencoder (LAE). Interestingly, the LAE can be implemented by slightly modifying the traditional autoencoder. Using multiple synthetic datasets, we first validate that ALD can properly obtain samples from target posteriors. We also evaluate the LAE on the image generation task, and show that our LAE can outperform existing methods based on variational inference, such as the variational autoencoder, and other MCMC-based methods in terms of the test likelihood.

1 Introduction

Variational inference (VI) and Markov chain Monte Carlo (MCMC) are two practical tools to approximate intractable distributions. Recently, VI has been dominantly used in deep latent variable models (DLVMs) to approximate the posterior distribution over the latent variable ๐ณ{\mathbf{z}} given the observation ๐ฑ{\mathbf{x}}, i.e., pโ€‹(๐ณโˆฃ๐ฑ)p\left({\mathbf{z}}\mid{\mathbf{x}}\right). At the core of the success of VI is the invention of amortized variational inference (AVI) (Kingma and Welling, 2013; Rezende etย al., 2014), which replaces optimization of datapoint-wise variational parameters with an encoder that predicts latent variables from observations. The framework of learning DLVMs based on AVI is called the variational autoencoder (VAE), which is widely used in applications (Kingma etย al., 2014; An and Cho, 2015; Zhang etย al., 2016; Eslami etย al., 2018; Kumar etย al., 2018). An important advantage of AVI over traditional VI is that the optimized encoder can be used to infer a latent representation even for new data in test time, expecting its generalization ability. On the other hand, the approximation power of AVI (or VI itself) is limited because it relies on tractable distributions for complex posterior approximations of DLVMs as shown in Figure 1 (a). Although there have been attempts to improve their flexibility (e.g., normalizing flows (Rezende and Mohamed, 2015; Kingma etย al., 2016; Van Denย Berg etย al., 2018; Huang etย al., 2018)), such methods typically have architectural constraints (e.g., invertibility in normalizing flows).

Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a) Variational Inference

(b) Langevin Dynamics

(c) Hoffman (2017)

(d) ALD (ours)

Figure 1: Comparison between existing approximated inference methods and our amortized Langevin dynamics (ALD). (a) In variational inference, posteriors are approximated using tractable distributions (e.g., Gaussians). (b) In traditional Langevin dynamics (LD), the approximation is performed by running gradient-based sampling iterations directly on the latent space for each datapoint. (c) Hoffman (2017) uses an encoder that maps the observation into the latent variable to initialize traditional LD, but it still relies on datapoint-wise iterations. (d) Our ALD also uses an encoder, but it treats the output of the encoder as a posterior sample, and its update is implicitly performed by updating the encoder.

Compared to VI, MCMC (e.g., Langevin dynamics) can approximate complex distributions by repeating sampling from a Markov chain that has the target distribution as its stationary distribution (Liu and Liu, 2001; Robert etย al., 2004; Gilks etย al., 1995; Geyer, 1992). However, despite its high approximation ability, MCMC has received relatively little attention in learning DLVMs. It is because MCMC methods take a long time to converge, making it difficult to be used in the training of DLVMs. When learning DLVMs with MCMC, we need to run time-consuming MCMC iterations for sampling from each posterior per data point, i.e., pโ€‹(๐ณโˆฃ๐’™(i))p\left({\mathbf{z}}\mid{\bm{x}}^{(i)}\right) (i=1,โ€ฆ,ni=1,\ldots,n), where nn is the number of mini-batch data, as shown in Figure 1 (b). Furthermore, we need to re-run the sampling procedure when we obtain new data at test time.

As in VI, there have been some attempts to introduce the concept of amortized inference to MCMC. For example, Hoffman (2017) initializes datapoint-wise MCMC using an encoder that predicts latent variables from observations as shown in Figure 1 (c). However, as they use encoders only for the initialization of MCMC, these methods still rely on datapoint-wise sampling iterations. Not only is it time-consuming, but implementations of such partially amortized methods also tend to be complicated compared to the simplicity of AVI. To make MCMC more suitable for the inference of DLVMs, a more sophisticated framework of amortization is needed.

This paper proposes amortized Langevin dynamics (ALD), which entirely replaces datapoint-wise MCMC iterations with updates of an encoder that maps the observation into the latent variable as shown in Figure 1 (d). Since the latent variable depends on the encoder, the updates of the encoder can be regarded as implicit updates of latent variables. By replacing MCMC on the latent space with one on the encoderโ€™s parameter space, we can benefit from the encoderโ€™s generalization ability. For example, after running a Markov chain of the encoder for some mini-batch data, it is expected that the encoder can map data into high density area in the latent space; hence we can accelerate the convergence of MCMC for other data by initializing it with the encoder. Moreover, despite the simplicity, we can theoretically guarantee that ALD has the true posterior as a stationary distribution under mild assumptions, which is a critical requirement for valid MCMC algorithms.

Using our ALD for sampling from the latent posterior, we derive a novel framework of learning DLVMs, which we refer to as the Langevin autoencoder (LAE). Interestingly, the learning algorithm of LAEs can be regarded as a small modification of traditional autoencoders (Hinton and Salakhutdinov, 2006). In our experiments, we first show that ALD can properly obtain samples from target distributions using toy datasets. Subsequently, we perform numerical experiments of the image generation task using the MNIST, SVHN, CIFAR-10, and CelebA datasets. We demonstrate that our LAE can outperform existing learning methods based on variational inference, such as the VAE, and existing MCMC-based methods in terms of test likelihood.

2 Preliminaries

2.1 Problem Definition

Consider a probabilistic model with the observation ๐ฑ{\mathbf{x}}, the continuous latent variable ๐ณ{\mathbf{z}}, and the model parameter ๐œฝ{\bm{\theta}} as follows:

pโ€‹(๐ฑ;๐œฝ)=โˆซpโ€‹(๐ฑโˆฃ๐’›;๐œฝ)โ€‹pโ€‹(๐’›)โ€‹๐‘‘๐’›.\displaystyle p\left({\mathbf{x}};{\bm{\theta}}\right)=\int p\left({\mathbf{x}}\mid{\bm{z}};{\bm{\theta}}\right)p\left({\bm{z}}\right)d{\bm{z}}. (1)

To learn this latent variable model (LVM) via maximum likelihood in a gradient-based manner, we need to calculate the expectation over the posterior distribution pโ€‹(๐ณโˆฃ๐ฑ;๐œฝ)p\left({\mathbf{z}}\mid{\mathbf{x}};{\bm{\theta}}\right) as follows:

โˆ‡๐œฝ๐”ผp^dataโ€‹(๐ฑ)โ€‹[logโกpโ€‹(๐’™;๐œฝ)]โ‰ˆ1nโ€‹โˆ‘i=1n๐”ผpโ€‹(๐ณ(i)โˆฃ๐’™(i);๐œฝ)โ€‹[โˆ‡๐œฝlogโกpโ€‹(๐’™(i),๐’›(i);๐œฝ)],\displaystyle\nabla_{\bm{\theta}}\mathbb{E}_{\hat{p}_{\mathrm{data}}\left({\mathbf{x}}\right)}\left[\log p\left({\bm{x}};{\bm{\theta}}\right)\right]\approx\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{p\left({\mathbf{z}}^{(i)}\mid{\bm{x}}^{(i)};{\bm{\theta}}\right)}\left[\nabla_{\bm{\theta}}\log p\left({\bm{x}}^{(i)},{\bm{z}}^{(i)};{\bm{\theta}}\right)\right], (2)

where p^data\hat{p}_{\mathrm{data}} is the empirical distribution define by the training set, and ๐’™(1),โ€ฆ,๐’™(n){\bm{x}}^{(1)},\ldots,{\bm{x}}^{(n)} are mini-batch data uniformly drawn from p^data\hat{p}_{\mathrm{data}}. However, this expectation cannot be calculated in a closed-form because the posterior is intractable. In this paper, we consider to use Monte Carlo approximation by obtaining samples from the posterior per data point.

2.2 Langevin Dynamics

Langevin dynamics (LD) (Neal, 2011) is a sampling algorithm based on the following Langevin equation:

dโ€‹๐’›=โˆ’โˆ‡๐’›Uโ€‹(๐’™,๐’›;๐œฝ)โ€‹dโ€‹t+2โ€‹ฮฒโˆ’1โ€‹dโ€‹B,d{\bm{z}}=-\nabla_{\bm{z}}U\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right)dt+\sqrt{2\beta^{-1}}dB, (3)

where UU is a Lipschitz continuous potential function that satisfies an appropriate growth condition, ฮฒ\beta is an inverse temperature parameter, and BB is a Brownian motion. This stochastic differential equation (SDE) has pฮฒโ€‹(๐’›โˆฃ๐’™;๐œฝ)โˆexpโก(โˆ’ฮฒโ€‹Uโ€‹(๐’™,๐’›;๐œฝ))p^{\beta}\left({\bm{z}}\mid{\bm{x}};{\bm{\theta}}\right)\propto\exp\left(-\beta U\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right)\right) as its stationary distribution. We set ฮฒ=1\beta=1 and define the potential as follows to obtain the target posterior pโ€‹(๐ณโˆฃ๐’™;๐œฝ)p\left({\mathbf{z}}\mid{\bm{x}};{\bm{\theta}}\right) as its stationary distribution:

Uโ€‹(๐’™,๐’›;๐œฝ)\displaystyle U\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right) =โˆ’logโกpโ€‹(๐’™,๐’›;๐œฝ).\displaystyle=-\log p\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right). (4)

We can obtain samples from the posterior by simulating the SDE of Eq. (3) using the Eulerโ€“Maruyama method (Kloeden and Platen, 2013) as follows:

๐’›t+1โˆผqโ€‹(๐’›t+1โˆฃ๐’›t),\displaystyle{\bm{z}}_{t+1}\sim q\left({\bm{z}}_{t+1}\mid{\bm{z}}_{t}\right), (5)
qโ€‹(๐’›โ€ฒโˆฃ๐’›)โ‰”๐’ฉโ€‹(๐’›โ€ฒ;๐’›โˆ’ฮทโ€‹โˆ‡zUโ€‹(๐’™,๐’›;๐œฝ),2โ€‹ฮทโ€‹๐‘ฐ),\displaystyle q\left({\bm{z}}^{\prime}\mid{\bm{z}}\right)\coloneqq\mathcal{N}\left({\bm{z}}^{\prime};{\bm{z}}-\eta\nabla_{z}U\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right),2\eta{\bm{I}}\right), (6)

where ฮท\eta is a step size for discretization. The initial value ๐’›0{\bm{z}}_{0} is typically sampled from the prior pโ€‹(๐’›)p\left({\bm{z}}\right). When the step size is sufficiently small, the samples asymptotically move to the target posterior by repeating this sampling iteration. To remove the discretization error, an additional Metropolis-Hastings (MH) rejection step is often used. In a MH step, we first calculate the acceptance rate ฮฑt\alpha_{t} as follows:

ฮฑt=minโก{1,expโก(โˆ’Uโ€‹(๐’™,๐’›t+1;๐œฝ))โ€‹qโ€‹(๐’›tโˆฃ๐’›t+1)expโก(โˆ’Uโ€‹(๐’™,๐’›t;๐œฝ))โ€‹qโ€‹(๐’›t+1โˆฃ๐’›t)}\displaystyle\alpha_{t}=\min\left\{1,\frac{\exp\left(-U\left({\bm{x}},{\bm{z}}_{t+1};{\bm{\theta}}\right)\right)q\left({\bm{z}}_{t}\mid{\bm{z}}_{t+1}\right)}{\exp\left(-U\left({\bm{x}},{\bm{z}}_{t};{\bm{\theta}}\right)\right)q\left({\bm{z}}_{t+1}\mid{\bm{z}}_{t}\right)}\right\} (7)

The sample ๐’›t+1{\bm{z}}_{t+1} is accepted with probability ฮฑt\alpha_{t}, and rejected with probability 1โˆ’ฮฑt1-\alpha_{t}. If the sample is rejected, we set ๐’›t+1=๐’›t{\bm{z}}_{t+1}={\bm{z}}_{t}. LD can be applied to any posterior inference problems for continuous latent variables, provided the potential energy is differentiable on the latent space. To obtain the posterior samples for all mini-batch data ๐’™(1),โ€ฆโ€‹๐’™(n){\bm{x}}^{(1)},\ldots{\bm{x}}^{(n)}, we should perform iterations of Eq. (5) per data point, as shown in Figure 1 (b).

After obtaining the samples, the gradient in Eq. (2) is approximated using the samples as follows:

โˆ‡๐œฝ๐”ผp^dataโ€‹(๐ฑ)โ€‹[logโกpโ€‹(๐’™;๐œฝ)]โ‰ˆ1nโ€‹Tโ€‹โˆ‘i=1nโˆ‘t=1Tโˆ‡๐œฝlogโกpโ€‹(๐’™(i),๐’›t(i);๐œฝ),\displaystyle\nabla_{\bm{\theta}}\mathbb{E}_{\hat{p}_{\mathrm{data}}\left({\mathbf{x}}\right)}\left[\log p\left({\bm{x}};{\bm{\theta}}\right)\right]\approx\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\nabla_{\bm{\theta}}\log p\left({\bm{x}}^{(i)},{\bm{z}}_{t}^{(i)};{\bm{\theta}}\right), (8)

The time averaged gradient in Eq. (8) is sometimes substituted for the one calculated with the final sample as follows:

โˆ‡๐œฝ๐”ผp^dataโ€‹(๐ฑ)โ€‹[logโกpโ€‹(๐’™;๐œฝ)]โ‰ˆ1nโ€‹โˆ‘i=1nโˆ‡๐œฝlogโกpโ€‹(๐’™(i),๐’›T(i);๐œฝ),\displaystyle\nabla_{\bm{\theta}}\mathbb{E}_{\hat{p}_{\mathrm{data}}\left({\mathbf{x}}\right)}\left[\log p\left({\bm{x}};{\bm{\theta}}\right)\right]\approx\frac{1}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p\left({\bm{x}}^{(i)},{\bm{z}}_{T}^{(i)};{\bm{\theta}}\right), (9)

However, in these traditional approaches, we need to run MCMC iterations from random initialization every time after updating the model parameter ๐œฝ{\bm{\theta}}. In addition, we also need to run it from stratch to perform inference for new data in test time. This inefficiency hinders the practical use of MCMC for learning DLVMs. A naive approach to mitigate this problem is to initialize MCMC with a persistent sample (Tieleman, 2008; Han etย al., 2017) that is the final value of the previous Markov chain. However, this approach is also inefficient especially when the training set is large, because we need to store persistent samples for all training examples.

To alleviate the inefficiency of LD for LVMs, Hoffman (2017) has proposed to use a stochastic encoder qโ€‹(๐ณโˆฃ๐ฑ)q\left({\mathbf{z}}\mid{\mathbf{x}}\right) to initialize the datapoint-wise MCMC as shown in Figure 1 (c). The encoder is typically defined as a Gaussian distribution as in the VAE:

qโ€‹(๐’›โˆฃ๐ฑ;ฯ•)=๐’ฉโ€‹(๐’›;ฮผโ€‹(๐’™;ฯ•),diagโ€‹(ฯƒ2โ€‹(๐’™;ฯ•))),\displaystyle q\left({\bm{z}}\mid{\mathbf{x}};{\bm{\phi}}\right)=\mathcal{N}\left({\bm{z}};\mu\left({\bm{x}};{\bm{\phi}}\right),\mathrm{diag}\left(\sigma^{2}\left({\bm{x}};{\bm{\phi}}\right)\right)\right), (10)

where ฮผ\mu and ฯƒ2\sigma^{2} are mappings from the observation space into the latent space. In the training, LD iterations of Eq. (5) are initialized using a sample from the distribution of Eq. (10), and the model parameter is updated using the stochastic gradient in Eq. (9). Along with it, the encoder is trained via maximization of the evidence lower bound as in VAEs:

โ„’โ€‹(ฯ•)=๐”ผqโ€‹(๐ณโˆฃ๐ฑ)โ€‹p^โ€‹(๐ฑ)โ€‹[logโกpโ€‹(๐’™,๐’›;๐œฝ)qโ€‹(๐ณโˆฃ๐ฑ;ฯ•)].\displaystyle\mathcal{L}\left({\bm{\phi}}\right)=\mathbb{E}_{q\left({\mathbf{z}}\mid{\mathbf{x}}\right)\hat{p}\left({\mathbf{x}}\right)}\left[\log\frac{p\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right)}{q\left({\mathbf{z}}\mid{\mathbf{x}};{\bm{\phi}}\right)}\right]. (11)

Although initializing LD with the encoder can speed up the convergence, this method still relies on datapoint-wise MCMC iterations. Moreover, the encoder has to be trained using the different objective, which also makes its implementation complicated. In Section 3, we demonstrate a method that entirely remove the datapoint-wise iterations by amortization.

3 Method

Algorithm 1 Amortized Langevin dynamics
ฯ•โ†{\bm{\phi}}\leftarrow Initialize parameters
โ„ค(1),โ€ฆ,โ„ค(n)โ†โˆ…{\mathbb{Z}}^{(1)},\ldots,{\mathbb{Z}}^{(n)}\leftarrow\varnothing โ–ท\triangleright Initialize sample sets for all nn datapoints.
repeat
ย ย ย ย ย ฯ•โ€ฒโˆผqโ€‹(ฯ•โ€ฒโˆฃฯ•)โ‰”๐’ฉโ€‹(ฯ•โ€ฒ;ฯ•โˆ’ฮทโ€‹โˆ‘i=1nโˆ‡ฯ•Uโ€‹(๐’™(i),๐’›(i)=f๐ณโˆฃ๐ฑโ€‹(๐’™(i);ฯ•)),2โ€‹ฮทโ€‹๐‘ฐ){\bm{\phi}}^{\prime}\sim q\left({\bm{\phi}}^{\prime}\mid{\bm{\phi}}\right)\coloneqq\mathcal{N}\left({\bm{\phi}}^{\prime};{\bm{\phi}}-\eta\sum_{i=1}^{n}\nabla_{\bm{\phi}}U\left({\bm{x}}^{(i)},{\bm{z}}^{(i)}=f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(i)};{\bm{\phi}}\right)\right),2\eta{\bm{I}}\right)
ย ย ย ย ย ฯ•โ†ฯ•โ€ฒ{\bm{\phi}}\leftarrow{\bm{\phi}}^{\prime} with probability minโก{1,expโก(โˆ’Vโ€‹(ฯ•โ€ฒ))โ€‹qโ€‹(ฯ•โˆฃฯ•โ€ฒ)expโก(โˆ’Vโ€‹(ฯ•))โ€‹qโ€‹(ฯ•โ€ฒโˆฃฯ•)}\min\left\{1,\frac{\exp\left(-V\left({\bm{\phi}}^{\prime}\right)\right)q\left({\bm{\phi}}\mid{\bm{\phi}}^{\prime}\right)}{\exp\left(-V\left({\bm{\phi}}\right)\right)q\left({\bm{\phi}}^{\prime}\mid{\bm{\phi}}\right)}\right\}. โ–ท\triangleright MH rejection step.
ย ย ย ย ย โ„ค(1),โ€ฆ,โ„ค(n)โ†โ„ค(1)โˆช{f๐ณโˆฃ๐ฑโ€‹(๐’™(1);ฯ•)},โ€ฆ,โ„ค(N)โˆช{f๐ณโˆฃ๐ฑโ€‹(๐’™(n);ฯ•)}{\mathbb{Z}}^{(1)},\ldots,{\mathbb{Z}}^{(n)}\leftarrow{\mathbb{Z}}^{(1)}\cup\left\{f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(1)};{\bm{\phi}}\right)\right\},\ldots,{\mathbb{Z}}^{(N)}\cup\left\{f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(n)};{\bm{\phi}}\right)\right\} โ–ท\triangleright Add samples.
untilย convergence of parameters
return โ„ค(1),โ€ฆ,โ„ค(n){\mathbb{Z}}^{(1)},\ldots,{\mathbb{Z}}^{(n)}

3.1 Amortized Langevin Dynamics

As an alternative to the direct simulation of latent dynamics of Eq. (3), we define a deterministic encoder f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}}, which maps the observation into the latent variable, and consider an SDE over its parameter ฯ•{\bm{\phi}} as follows:

dโ€‹ฯ•=โˆ’โˆ‡ฯ•Vโ€‹(ฯ•)โ€‹dโ€‹t+2โ€‹dโ€‹B,\displaystyle d{\bm{\phi}}=-\nabla_{{\bm{\phi}}}V\left({\bm{\phi}}\right)dt+\sqrt{2}dB, (12)
Vโ€‹(ฯ•)โ‰”โˆ‘i=1nUโ€‹(๐’™(i),f๐ณโˆฃ๐ฑโ€‹(๐’™(i);ฯ•);๐œฝ).\displaystyle V\left({\bm{\phi}}\right)\coloneqq\sum_{i=1}^{n}U\left({\bm{x}}^{(i)},f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(i)};{\bm{\phi}}\right);{\bm{\theta}}\right). (13)

Because the function f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}} outputs the latent variable, the stochastic dynamics on the parameter space induces dynamics on the latent space. The main idea of our amortized Langevin dynamics (ALD) is to regard the transition on this induced dynamics as a sampling procedure for the posterior distributions, as shown in Figure 1 (d). We can use the Eulerโ€“Maruyama method to simulate Eq. (12) with discretization:

ฯ•t+1โˆผqโ€‹(ฯ•t+1โˆฃฯ•t),\displaystyle{\bm{\phi}}_{t+1}\sim q\left({\bm{\phi}}_{t+1}\mid{\bm{\phi}}_{t}\right), (14)
qโ€‹(ฯ•โ€ฒโˆฃฯ•)โ‰”๐’ฉโ€‹(ฯ•โ€ฒ;ฯ•โˆ’ฮทโ€‹โˆ‡ฯ•Vโ€‹(ฯ•),2โ€‹ฮทโ€‹๐‘ฐ).\displaystyle q\left({\bm{\phi}}^{\prime}\mid{\bm{\phi}}\right)\coloneqq\mathcal{N}\left({\bm{\phi}}^{\prime};{\bm{\phi}}-\eta\nabla_{{\bm{\phi}}}V\left({\bm{\phi}}\right),2\eta{\bm{I}}\right). (15)

As in the traditional LD, the discretization error can be removed by adding a MH rejection step, calculating the acceptance rate as follows:

ฮฑt=minโก{1,expโก(โˆ’Vโ€‹(ฯ•t+1))โ€‹qโ€‹(ฯ•tโˆฃฯ•t+1)expโก(โˆ’Vโ€‹(ฯ•t))โ€‹qโ€‹(ฯ•t+1โˆฃฯ•t)}\displaystyle\alpha_{t}=\min\left\{1,\frac{\exp\left(-V\left({\bm{\phi}}_{t+1}\right)\right)q\left({\bm{\phi}}_{t}\mid{\bm{\phi}}_{t+1}\right)}{\exp\left(-V\left({\bm{\phi}}_{t}\right)\right)q\left({\bm{\phi}}_{t+1}\mid{\bm{\phi}}_{t}\right)}\right\} (16)

Through the iterations, the posterior sampling is implicitly performed by collecting outputs of the encoder for each data point as described in Algorithm 1. Note that โ„ค(i){\mathbb{Z}}^{(i)} denotes a set of samples of the posterior for the ii-th data (i.e., pโ€‹(๐ณโˆฃ๐’™(i))p\left({\mathbf{z}}\mid{\bm{x}}^{(i)}\right)) obtained using ALD.

If this implicit update iteration has the posterior as its stationary distribution, this sampling procedure is valid as an MCMC algorithm. To obtain the stationary distribution over the latent variables, we first derive the stationary distribution over the parameter ฯ•{\bm{\phi}} of Eq. (12), then transform it into the latent space by considering the change of random variable by the encoder f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}}. Based on this strategy, we derive the following theorem:

Theorem 1.

Let qโ€‹(๐™โˆฃ๐—)โ‰”โˆi=1nqโ€‹(๐ณ(i)โˆฃ๐ฑ(i))q\left({\bm{Z}}\mid{\bm{X}}\right)\coloneqq\prod_{i=1}^{n}q\left({\bm{z}}^{(i)}\mid{\bm{x}}^{(i)}\right) be a stationary distribution over the latent variables that is induced by Eq. (12). When the mapping f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}} meets the following conditions, the stationary distribution satisfies qโ€‹(๐™โˆฃ๐—)โˆexpโก(โˆ’โˆ‘i=1nUโ€‹(๐ฑ(i),๐ณ(i)))q\left({\bm{Z}}\mid{\bm{X}}\right)\propto\exp\left(-\sum_{i=1}^{n}U\left({\bm{x}}^{(i)},{\bm{z}}^{(i)}\right)\right).

  1. 1.

    The mapping has the form of f๐ณโˆฃ๐ฑโ€‹(๐’™;๐šฝ)=๐šฝโ€‹gโ€‹(๐’™)f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}};{\bm{\Phi}}\right)={\bm{\Phi}}g\left({\bm{x}}\right), where ๐šฝ{\bm{\Phi}} is a d๐ณร—dd_{\mathbf{z}}\times d matrix, gg is a mapping from โ„d๐ฑ\mathbb{R}^{d_{{\mathbf{x}}}} to โ„d\mathbb{R}^{d}, and d๐ฑd_{\mathbf{x}}, d๐ณd_{\mathbf{z}} and dd are the dimensionalities of ๐’™{\bm{x}}, ๐’›{\bm{z}} and gโ€‹(๐’™)g\left({\bm{x}}\right) respectively.

  2. 2.

    The rank of ๐‘ฎ{\bm{G}} is nn, where ๐‘ฎ{\bm{G}} is a matrix with gโ€‹(๐’™(i))g\left({\bm{x}}^{(i)}\right) in row ๐‘ฎi,:{\bm{G}}_{i,:}.

See the appendix for the solid proof. Theorem 1 suggests that samples obtained by ALD asymptotically converge to the true posterior when we construct the encoder f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}} with an appropriate form. Practically, we can implement such a function using a neural network whose parameters are fixed except for the last linear layer. In this implementation, the last linear layer takes a role of the parameter ๐šฝ{\bm{\Phi}}, and the preceding feature extractor takes a role of the function gg.

To meet the second condition, the dimensionality of the last linear layer should be larger than the batch size nn. It is relatively easy to meet this condition because the batch size is about 1,000 at most in practice.

3.2 Remarks

Remark 1: ALD completely removes datapoint-wise iterations.

Because ALD treats the encoderโ€™s outputs themselves as posterior samples, we no longer need to run time-consuming iterations of datapoint-wise MCMC, whereas existing methods, such as Hoffman (2017) only use the encoder for initialization of datapoint-wise MCMC.

Remark 2: ALD is valid as an MCMC algorithm.

Although the sampling procedure of ALD is quite simple, ALD has the true posterior as its stationary distribution under mild assumptions, which guarantee that ALD is valid as an MCMC algorithm. Basically, we can meet the assumptions by using a sufficiently wide neural network for the encoder. In addition, it is worth mentioning that the traditional LD can be seen as a special case of ALD where gโ€‹(๐’™(i))=one-hotโ€‹(i)g\left({\bm{x}}^{(i)}\right)=\textrm{one-hot}\left(i\right). In that case, ๐’›(i)=f๐ณโˆฃ๐ฑโ€‹(๐’™(i);๐šฝ){\bm{z}}^{(i)}=f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(i)};{\bm{\Phi}}\right) corresponds to the ii-th column of ๐šฝ{\bm{\Phi}}, which is equivalent to running MCMC on the latent space independently for each data point.

Remark 3: The encoder may accelerate the convergence of MCMC.

After running ALD iterations of the encoder for some mini-batch data, it is expected that the encoder can map data into high density area in the latent space. Therefore, the encoder can accelerate the convergence of MCMC for other data by initializing it with the encoder. This characteristics is useful not only in the training of DLVMs to efficiently estimate the gradient of the model parameters but also at test time to infer the latent representation for new test data.

3.3 Langevin Autoencoder

Algorithm 2 Langevin Autoencoders
1:๐œฝ,๐šฝ,๐โ†{\bm{\theta}},{\bm{\Phi}},{\bm{\psi}}\leftarrow Initialize parameters
2:repeat
3:ย ย ย ย ย ๐’™(1),โ€ฆ,๐’™(n)โˆผp^โ€‹(๐ฑ){\bm{x}}^{(1)},\ldots,{\bm{x}}^{(n)}\sim\hat{p}\left({\mathbf{x}}\right) โ–ท\triangleright Sample a minibatch of nn examples from the training data.
4:ย ย ย ย ย forย t=0,โ€ฆ,Tโˆ’1t=0,\ldots,T-1ย do โ–ท\triangleright Run ALD iterations.
5:ย ย ย ย ย ย ย ย ย Vt=โˆ’โˆ‘i=1nlogp(๐’™(i),๐’›(i)=๐šฝg(๐’™(i);๐);๐œฝ)V_{t}=-\sum_{i=1}^{n}\log p\left({\bm{x}}^{(i)},{\bm{z}}^{(i)}={\bm{\Phi}}g\left({\bm{x}}^{(i)};{\bm{\psi}}\right);{\bm{\theta}}\right)
6:ย ย ย ย ย ย ย ย ย ๐šฝโ€ฒโˆผqโ€‹(๐šฝโ€ฒโˆฃ๐šฝ)โ‰”๐’ฉโ€‹(๐šฝโ€ฒ;๐šฝโˆ’ฮทโ€‹โˆ‡๐šฝVt,2โ€‹ฮทโ€‹๐‘ฐ){\bm{\Phi}}^{\prime}\sim q\left({\bm{\Phi}}^{\prime}\mid{\bm{\Phi}}\right)\coloneqq\mathcal{N}\left({\bm{\Phi}}^{\prime};{\bm{\Phi}}-\eta\nabla_{{\bm{\Phi}}}V_{t},2\eta{\bm{I}}\right)
7:ย ย ย ย ย ย ย ย ย Vtโ€ฒ=โˆ’โˆ‘i=1nlogp(๐’™(i),๐’›(i)=๐šฝโ€ฒg(๐’™(i);๐);๐œฝ)V^{\prime}_{t}=-\sum_{i=1}^{n}\log p\left({\bm{x}}^{(i)},{\bm{z}}^{(i)}={\bm{\Phi}}^{\prime}g\left({\bm{x}}^{(i)};{\bm{\psi}}\right);{\bm{\theta}}\right)
8:ย ย ย ย ย ย ย ย ย ๐šฝโ†๐šฝโ€ฒ{\bm{\Phi}}\leftarrow{\bm{\Phi}}^{\prime} with probability minโก{1,expโก(โˆ’Vtโ€ฒ)โ€‹qโ€‹(๐šฝโˆฃ๐šฝโ€ฒ)expโก(โˆ’Vt)โ€‹qโ€‹(๐šฝโ€ฒโˆฃ๐šฝ)}\min\left\{1,\frac{\exp\left(-V^{\prime}_{t}\right)q\left({\bm{\Phi}}\mid{\bm{\Phi}}^{\prime}\right)}{\exp\left(-V_{t}\right)q\left({\bm{\Phi}}^{\prime}\mid{\bm{\Phi}}\right)}\right\}. โ–ท\triangleright MH rejection step.
9:ย ย ย ย ย endย for
10:ย ย ย ย ย VT=โˆ’โˆ‘i=1nlogp(๐’™(i),๐’›(i)=๐šฝg(๐’™(i);๐);๐œฝ)V_{T}=-\sum_{i=1}^{n}\log p\left({\bm{x}}^{(i)},{\bm{z}}^{(i)}={\bm{\Phi}}g\left({\bm{x}}^{(i)};{\bm{\psi}}\right);{\bm{\theta}}\right)
11:ย ย ย ย ย ๐œฝโ†๐œฝโˆ’ฮทโ€‹โˆ‡๐œฝ1Tโ€‹โˆ‘t=1TVt{\bm{\theta}}\leftarrow{\bm{\theta}}-\eta\nabla_{\bm{\theta}}\frac{1}{T}\sum_{t=1}^{T}V_{t} โ–ท\triangleright Update the decoder.
12:ย ย ย ย ย ๐โ†๐โˆ’ฮทโ€‹โˆ‡๐1Tโ€‹โˆ‘t=1TVt{\bm{\psi}}\leftarrow{\bm{\psi}}-\eta\nabla_{\bm{\psi}}\frac{1}{T}\sum_{t=1}^{T}V_{t} โ–ท\triangleright Update the encoder.
13:untilย convergence of parameters
14:return ๐œฝ,๐šฝ,๐{\bm{\theta}},{\bm{\Phi}},{\bm{\psi}}

Based on ALD, we propose a novel framework of learning deep latent variable models called the Langevin autoencoder (LAE). Algorithm 2 is a summary of the LAEโ€™s training procedure. First, we prepare an encoder defined by f๐ณโˆฃ๐ฑโ€‹(๐’™;๐šฝ,๐)โ‰”๐šฝโ€‹gโ€‹(๐’™;๐)f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}};{\bm{\Phi}},{\bm{\psi}}\right)\coloneqq{\bm{\Phi}}g\left({\bm{x}};{\bm{\psi}}\right). The feature extractor gg is typically implemented using a deep neural network. Before each update of the model parameter ๐œฝ{\bm{\theta}}, the encoderโ€™s final linear layer ๐šฝ{\bm{\Phi}} is updated using ALD for TT times, and the gradient of ๐œฝ{\bm{\theta}} in Eq. (8) is calculated using the time average of VtV_{t}. Along with the update of the model parameter, the encoderโ€™s feature extractor ๐{\bm{\psi}} is also updated using the gradient so that the encoder can map data into high density area of the posteriors. Although we describe the parameter update using a simple stochastic gradient ascent, it can be substituted for more sophisticated optimization methods, such as Adam (DP and Ba, 2015).

It can be seen that the algorithm is very similar to the one for traditional deterministic autoencoders (Hinton and Salakhutdinov, 2006). In particular, when we skip the ALD iterations in lines 4 to 9 and the encoderโ€™s final linear layer ๐šฝ{\bm{\Phi}} is also trained via gradient ascent, the algorithm is identical to the training of autoencoders regularized by the latent prior pโ€‹(๐’›)p\left({\bm{z}}\right). In that case, the encoder tends to shrink to the maximum a posteriori (MAP) estimate rather than the posterior; hence the gradient of the model parameter ๐œฝ{\bm{\theta}} would be a strongly biased estimate of the marginal log-likelihood gradient. Therefore, the additional ALD iterations can be interpreted as the modification to reduce the bias of the traditional autoencoder by updating the encoderโ€™s parameters along with the noise-injected gradient.

4 Related Works

Amortized inference is well-investigated in the context of variational inference. It is often referred to as amortized variational inference (AVI) (Rezende and Mohamed, 2015; Shu etย al., 2018). The basic idea of AVI is to replace the optimization of the datapoint-wise variational parameters with the optimization of shared parameters across all datapoints by introducing an encoder that predicts latent variables from observations. The AVI is commonly used in generative models (Kingma and Welling, 2013), semi-supervised learning (Kingma etย al., 2014), anomaly detection (An and Cho, 2015), machine translation (Zhang etย al., 2016), and neural rendering (Eslami etย al., 2018; Kumar etย al., 2018). However, in the MCMC literature, there are few works on such amortization. Han etย al. (2017) use traditional LD to obtain samples from posteriors for the training of deep latent variable models. Such Langevin-based algorithms for deep latent variable models are known as alternating back-propagation (ABP) and are widely applied in several fields (Xie etย al., 2019; Zhang etย al., 2020; Xing etย al., 2018; Zhu etย al., 2019). However, ABP requires datapoint-wise Langevin iterations, causing slow convergence. Moreover, when we perform inference for new data in test time, ABP requires MCMC iterations from randomly initialized samples again. Li etย al. (2017) and Hoffman (2017) propose to use a VAE-like encoder to initialize MCMC, and Salimans etย al. (2015) also proposes to combine VAE-based inference and MCMC by interpreting each MCMC step as an auxiliary variable. However, they only amortize the initialization cost in MCMC by using an encoder; hence, they still rely on datapoint-wise MCMC iterations.

Autoencoders (AEs) (Hinton and Salakhutdinov, 2006) are a special case of LAEs, wherein the ALD iterations are omitted and a uniform distribution is used as the latent prior pโ€‹(๐’›)p\left({\bm{z}}\right). When a different distribution is used as the latent prior as regularization, it is known as sparse autoencoders (SAEs) (Ng etย al., 2011). As described in the previous section, the encoder of the traditional AE tends to converge to MAP estimates of the latent posterior. Therefore, the gradient of the decoderโ€™s parameter is biased as the gradient estimate of the marginal log-likelihood. Our LAE can modify this bias by adding the ALD iterations before each parameter update, making it valid as training of a generative model.

Variational Autoencoders (VAEs) are based on AVI, wherein an encoder is defined as a variational distribution qโ€‹(๐ณโˆฃ๐ฑ;ฯ•)q\left({\mathbf{z}}\mid{\mathbf{x}};{\bm{\phi}}\right) using a neural network. Its parameter ฯ•{\bm{\phi}} is optimized by maximizing the evidence lower bound (ELBO), i.e., ๐”ผqโ€‹(๐ณโˆฃ๐’™;ฯ•)โ€‹[logโกexpโก(โˆ’Uโ€‹(๐’™,๐’›))qโ€‹(๐’›โˆฃ๐’™;ฯ•)]\mathbb{E}_{q\left({\mathbf{z}}\mid{\bm{x}};{\bm{\phi}}\right)}\left[\log\frac{\exp\left(-U\left({\bm{x}},{\bm{z}}\right)\right)}{q\left({\bm{z}}\mid{\bm{x}};{\bm{\phi}}\right)}\right]. Interestingly, there is a contrast between VAEs and LAEs relative to when stochastic noise is used in posterior inference. In VAEs, noise is used to sample from the variational distribution in calculating the potential UU, i.e., in the forward calculation. However, in LAEs, noise is used for the parameter update along with the gradient calculation โˆ‡ฯ•U\nabla_{\phi}U, i.e., in the backward calculation. This contrast characterizes their two different approaches to approximate posteriors: the optimization-based approach of VAEs and the sampling-based approach of LAEs. The advantage of LAEs over VAEs is that LAEs can flexibly approximate complex posteriors by obtaining samples, whereas VAEsโ€™ approximation ability is limited by choice of variational distribution qโ€‹(๐ณโˆฃ๐ฑ;ฯ•)q\left({\mathbf{z}}\mid{\mathbf{x}};{\bm{\phi}}\right) because it requires a tractable density. Although there are several considerations in the improvement of the approximation flexibility, these methods typically have architectural constraints (e.g., invertibility and ease of Jacobian calculation in normalizing flows (Rezende and Mohamed, 2015; Kingma etย al., 2016; Van Denย Berg etย al., 2018; Huang etย al., 2018; Titsias and Ruiz, 2019)), or they incur more computational costs (e.g., MCMC sampling for the reverse conditional distribution in unbiased implicit variational inference (Titsias and Ruiz, 2019)).

5 Experiment

In our experiment, we first test our ALD algorithm on toy examples to investigate its behavior, then we show the results of its application to the training of deep generative models for image datasets.

5.1 Toy Examples

Refer to caption
Refer to caption
Refer to caption
Refer to caption

GT

d>nd>n

d=nd=n

d<nd<n

Figure 2: Effects of changing the encoderโ€™s capacity in the experiment of bivariate Gaussian examples. dd denotes the dimensionality of the last linear layerโ€™s input.

We perform numerical simulation using toy examples to demonstrate that our ALD can properly obtain samples from target distributions. First, we use a LVM where the posterior density can be derived in a closed-form. We initially generate three synthetic data ๐’™1,๐’™2,๐’™3{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}, where each ๐’™i{\bm{x}}_{i} is sampled from a bivariate Gaussian distribution as follows:

pโ€‹(๐’›)=๐’ฉโ€‹(๐๐ณ,๐šบ๐ณ),pโ€‹(๐’™โˆฃ๐’›)=๐’ฉโ€‹(๐’›,๐šบ๐ฑ).\displaystyle p\left({\bm{z}}\right)=\mathcal{N}\left({\bm{\mu}}_{\mathbf{z}},{\bm{\Sigma}}_{\mathbf{z}}\right),\quad p\left({\bm{x}}\mid{\bm{z}}\right)=\mathcal{N}\left({\bm{z}},{\bm{\Sigma}}_{\mathbf{x}}\right). (17)

In this case, we can calculate the exact posterior as follows:

pโ€‹(๐’›โˆฃ๐’™)=๐’ฉโ€‹(๐๐ณโˆฃ๐ฑ,๐šบ๐ณโˆฃ๐ฑ),\displaystyle p\left({\bm{z}}\mid{\bm{x}}\right)=\mathcal{N}\left({\bm{\mu}}_{{\mathbf{z}}\mid{\mathbf{x}}},{\bm{\Sigma}}_{{\mathbf{z}}\mid{\mathbf{x}}}\right), (18)
๐๐ณโˆฃ๐ฑ=๐šบ๐ณโˆฃ๐ฑโ€‹(๐šบ๐ณโˆ’1โ€‹๐๐ณ+๐šบ๐ฑโˆ’1โ€‹๐’™),๐šบ๐ณโˆฃ๐ฑ=(๐šบ๐ณโˆ’1+๐šบ๐ฑโˆ’1)โˆ’1.\displaystyle{\bm{\mu}}_{{\mathbf{z}}\mid{\mathbf{x}}}={\bm{\Sigma}}_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{\Sigma}}_{\mathbf{z}}^{-1}{\bm{\mu}}_{\mathbf{z}}+{\bm{\Sigma}}_{\mathbf{x}}^{-1}{\bm{x}}\right),\ {\bm{\Sigma}}_{{\mathbf{z}}\mid{\mathbf{x}}}=\left({\bm{\Sigma}}_{\mathbf{z}}^{-1}+{\bm{\Sigma}}_{\mathbf{x}}^{-1}\right)^{-1}. (19)

In this experiment, we set ๐๐ณ=[00]{\bm{\mu}}_{\mathbf{z}}=\left[\begin{array}[]{c}0\\ 0\end{array}\right], ๐šบ๐ณ=[1001]{\bm{\Sigma}}_{\mathbf{z}}=\left[\begin{array}[]{cc}1&0\\ 0&1\end{array}\right], and ๐šบ๐ฑ=[0.70.60.70.8]{\bm{\Sigma}}_{\mathbf{x}}=\left[\begin{array}[]{cc}0.7&0.6\\ 0.7&0.8\end{array}\right]. We simulate our ALD algorithm for this setting to obtain samples from the posterior. We use a neural network of three fully connected layers of 128 units with ReLU activation for the encoder f๐ณโˆฃ๐ฑf_{{\mathbf{z}}\mid{\mathbf{x}}}; setting the step size to 4ร—10โˆ’44\times 10^{-4}, and update the parameters for 3,000 steps. We omit the first 1,000 samples as burn-in steps and use the remaining 2,000 samples for qualitative evaluation. As we described in Section 3.1, we need to make the last linear layer of the encoder have sufficiently large dimensions to guarantee the convergence to the true posterior. To empirically demonstrate this theoretical finding, we change the dimensionality of the last linear layer of the encoder from 2(<n)2\ (<n) to 128(โ‰ซn)128\ (\gg n).

The results are summarized in Figure 2. It can be observed that the sample quality is good when the dimensionality of the last linear layer is equal to or greater than the number of data points (i.e., dโ‰ฅnd\geq n). When the dimensionality is smaller than the number of data points, the samples for some data points shrink to a small area, while good samples are obtained for the remaining data points.

Refer to caption GT

Refer to caption Mean-field VI

Refer to caption Full VI

Refer to caption ALD (ours)

Figure 3: Visualizations of a ground truth posterior (left), an approximation by VI (center), and samples by ALD (right) in the neural likelihood example.

In addition to the simple conjugate Gaussian example, we experiment with a complex posterior, wherein the likelihood is defined with a randomly initialized neural network. For comparison, we also implement the variational inference (VI), in which the posterior is approximated with a Gaussian distribution. Figure 3 shows a typical example, which characterizes the difference between VI and ALD. The mean-filed VI and the full VI use Gaussians with diagonal and full covariance matrices for variational distributions, respectively. The advantage of our ALD over VI is the flexibility of posterior approximation. VI methods typically approximate posteriors using variational distributions, which have tractable density functions. Hence, their approximation power is limited by the choice of variational distribution family, and they often fail to approximate such complex posteriors. In particular, the mean-field VI, which is widely used for learning DLVMs, cannot capture the correlation between dimensions due to the constraint of the variational distribution. The full VI mitigate the inflexibility, but it still cannot capture the multimodality of the true posterior. Moreover, the full VI requires computational costs proportional to the square of the dimension of the latent variable. On the other hand, ALD can capture such posteriors well. The results in other examples are summarized in the appendix.

5.2 Image Generation

To demonstrate the applicability of our LAE to the generative model training, we experiment on image generation tasks using MNIST, SVHN, CIFAR10, and CelebA datasets. Note that our goal here is not to provide the state-of-the-art results on image generation benchmarks but to verify the effectiveness of our ALD as a method of approximate inference in deep latent variable models. For this aim, we compare our LAE with some baseline methods, as shown in Table 1. The VAE (Kingma and Welling, 2013) is one of the most popular deep latent variable models, in which the posterior distribution is approximated using the VI. The VAE-flow is an extension of VAE, in which the flexibility of VI is improved using normalizing flows (Rezende and Mohamed, 2015). In addition to VI-based methods, we use Hoffman (2017) as a baseline method based on Langevin dynamics (LD). As described in Section 2.2, Hoffman (2017) uses a VAE-like encoder to initialize LD, and the encoder is trained by maximizing the evidence lower bound. We use the same fully connected deep neural networks for the model construction of all methods. We set the number of ALD iterations of the LAE to 22, i.e, T=2T=2 in Algorithm 2. Please refer to the appendix for more implementation details.

For evaluation, since the marginal log-likelihood for test data is not tractable, we substitute its evidence lower bound (ELBO) for it using a proposal distribution qq as follows:

logโกpโ€‹(๐’™;๐œฝ)โ‰ฅ๐”ผqโ€‹(๐ณ)โ€‹[logโกpโ€‹(๐’™,๐’›;๐œฝ)qโ€‹(๐’›)]\displaystyle\log p\left({\bm{x}};{\bm{\theta}}\right)\geq\mathbb{E}_{q\left({\mathbf{z}}\right)}\left[\log\frac{p\left({\bm{x}},{\bm{z}};{\bm{\theta}}\right)}{q\left({\bm{z}}\right)}\right] (20)
Table 1: Quantitative results of the image generation for MNIST, SVHN, CIFAR-10, and CelebA. We report the mean and standard deviation of the negative evidence lower bound per data dimension in three different seeds. Lower is better.
MNIST SVHN CIFAR-10 CelebA
VAE 1.189ยฑ0.0021.189\pm 0.002 4.442ยฑ0.0034.442\pm 0.003 4.820ยฑ0.0054.820\pm 0.005 4.671ยฑ0.0014.671\pm 0.001
VAE-flow 1.183ยฑ0.0011.183\pm 0.001 4.454ยฑ0.0164.454\pm 0.016 4.828ยฑ0.0054.828\pm 0.005 4.667ยฑ0.0054.667\pm 0.005
Hoffman (2017) 1.189ยฑ0.0021.189\pm 0.002 4.440ยฑ0.0074.440\pm 0.007 4.831ยฑ0.0054.831\pm 0.005 4.662ยฑ0.0114.662\pm 0.011
LAE (ours) 1.177ยฑ0.001\mathbf{1.177}\pm 0.001 4.412ยฑ0.002\mathbf{4.412}\pm 0.002 4.773ยฑ0.003\mathbf{4.773}\pm 0.003 4.636ยฑ0.003\mathbf{4.636}\pm 0.003

For the baseline methods, their stochastic encoders are used for the proposal distribution, i.e., qโ€‹(๐’›)โ‰”qโ€‹(๐’›โˆฃ๐’™;ฯ•)q\left({\bm{z}}\right)\coloneqq q\left({\bm{z}}\mid{\bm{x}};{\bm{\phi}}\right). For our LAE, we use a Gaussian distribution whose mean parameter is defined by its encoder, i.e., qโ€‹(๐’›)โ‰”๐’ฉโ€‹(๐’›;๐šฝโ€‹gโ€‹(๐’™;๐),ฯƒ2โ€‹๐‘ฐ)q\left({\bm{z}}\right)\coloneqq\mathcal{N}\left({\bm{z}};{\bm{\Phi}}g\left({\bm{x}};{\bm{\psi}}\right),\ \sigma^{2}{\bm{I}}\right). We set ฯƒ=0.05\sigma=0.05 in the experiment.

Refer to caption
Figure 4: Learning curves comparison.

The results are summarized in Table 1. Note that the negative ELBO is shown in the table, so lower values indicate better results. It can be observed that the LAE consistently outperforms baseline methods, demonstrating that accurate posterior approximation by ALD leads to better results in the training of DLVMs. On training speed, we observe that our LAE is 2.24 and 1.88 times slower than VAE and VAE-flow respectively. This is natural because VAE and VAE-flow do not require MCMC steps. Hoffman (2017) and our LAE are almost identical in terms of training speed.

We also investigate the effect of the MH rejection step and the number of ALD iterations, i.e., TT in Algorithm 2, using MNIST dataset. Figure 4 shows a comparison of the learning curves of the negative ELBO for MNISTโ€™s test set. It can be seen that the number of ALD iterations has little effect on performance as long as the number of steps is at least 2. In addition, we observe that the MH rejection step is important to stabilize the training.

6 Conclusion

This paper proposed amortized Langevin dynamics (ALD), an efficient MCMC method for deep latent variable models (DLVMs). ALD amortizes the cost of datapoint-wise iterations by using an encoder that predict the latent variable from the observation. We showed that our ALD algorithm could accurately approximate posteriors with both theoretical and empirical studies. Using ALD, we derived a novel scheme of deep generative models called the Langevin autoencoder (LAE). We demonstrated that our LAE performs better than VI-based methods, such as the variational autoencoder, and existing LD-based methods in terms of the marginal log-likelihood for test sets.

This study will be the first step to further work on encoder-based MCMC methods for latent variable models. For instance, developing algorithms based on more sophisticated Hamiltonian Monte Carlo methods is an exciting direction of future work.

One of the limitations of MCMC-based learning algorithms is the difficulty in choosing the number of MCMC iterations. To reduce the bias of the gradient estimate, we need to run the iterations for many times, but there are few clues as to how many MCMC iterations are sufficient in advance. Recently, a method to remove the bias of MCMC with couplings is proposed by Jacob etย al. (2020), and it may help to overcome this limitation of MCMC-based learning algorithm in the future work. Another limitation of our LAE is that there is a constraint on the structure of the encoder as described in Theorem 1. Although the constraint is relatively minor, it may be problematic when applying our method to modern DLVMs that have a hierarchical structure in the latent variables (e.g., Vahdat and Kautz (2020) and Child (2020)).

From a broader perspective, developing deep generative models that can synthesize realistic images could cause a negative impact, such as abuse of Deepfake technology. We must consider the negative aspects and take measures for them.

Acknowledgments and Disclosure of Funding

This work was supported by JSPS KAKENHI Grant Number JP21J22342 and the Mohammed bin Salman Center for Future Science and Technology for Saudi-Japan Vision 2030 at The University of Tokyo (MbSC2030). Computational resources of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) were used for the experiments.

References

  • Kingma and Welling (2013) Diederikย P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Rezende etย al. (2014) Daniloย Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278โ€“1286. PMLR, 2014.
  • Kingma etย al. (2014) Diederikย P Kingma, Shakir Mohamed, Daniloย Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581โ€“3589, 2014.
  • An and Cho (2015) Jinwon An and Sungzoon Cho. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1), 2015.
  • Zhang etย al. (2016) Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. Variational neural machine translation. arXiv preprint arXiv:1605.07869, 2016.
  • Eslami etย al. (2018) SMย Ali Eslami, Daniloย Jimenez Rezende, Frederic Besse, Fabio Viola, Ariย S Morcos, Marta Garnelo, Avraham Ruderman, Andreiย A Rusu, Ivo Danihelka, Karol Gregor, etย al. Neural scene representation and rendering. Science, 360(6394):1204โ€“1210, 2018.
  • Kumar etย al. (2018) Ananya Kumar, SMย Eslami, Daniloย J Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, and Murray Shanahan. Consistent generative query networks. arXiv preprint arXiv:1807.02033, 2018.
  • Rezende and Mohamed (2015) Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530โ€“1538. PMLR, 2015.
  • Kingma etย al. (2016) Durkย P Kingma, Tim Salimans, Rafal Jozefowicz, Xiย Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743โ€“4751, 2016.
  • Van Denย Berg etย al. (2018) Rianne Van Denย Berg, Leonard Hasenclever, Jakubย M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 393โ€“402. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.
  • Huang etย al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2078โ€“2087. PMLR, 2018.
  • Hoffman (2017) Matthewย D Hoffman. Learning deep latent gaussian models with markov chain monte carlo. In International conference on machine learning, pages 1510โ€“1519. PMLR, 2017.
  • Liu and Liu (2001) Junย S Liu and Junย S Liu. Monte Carlo strategies in scientific computing, volumeย 10. Springer, 2001.
  • Robert etย al. (2004) Christianย P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volumeย 2. Springer, 2004.
  • Gilks etย al. (1995) Walterย R Gilks, Sylvia Richardson, and David Spiegelhalter. Markov chain Monte Carlo in practice. CRC press, 1995.
  • Geyer (1992) Charlesย J Geyer. Practical markov chain monte carlo. Statistical science, pages 473โ€“483, 1992.
  • Hinton and Salakhutdinov (2006) Geoffreyย E Hinton and Ruslanย R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504โ€“507, 2006.
  • Neal (2011) Radfordย M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, page 113, 2011.
  • Kloeden and Platen (2013) Peterย E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations, volumeย 23. Springer Science & Business Media, 2013.
  • Tieleman (2008) Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064โ€“1071, 2008.
  • Han etย al. (2017) Tian Han, Yang Lu, Song-Chun Zhu, and Yingย Nian Wu. Alternating back-propagation for generator network. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 31, 2017.
  • DP and Ba (2015) Kingma DP and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of the 3rd International Conference for Learning Representations (ICLR), 2015.
  • Shu etย al. (2018) Rui Shu, Hungย H Bui, Shengjia Zhao, Mykelย J Kochenderfer, and Stefano Ermon. Amortized inference regularization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4398โ€“4407, 2018.
  • Xie etย al. (2019) Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-Chun Zhu, and Yingย Nian Wu. Learning dynamic generator model by alternating back-propagation through time. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 33, pages 5498โ€“5507, 2019.
  • Zhang etย al. (2020) Jing Zhang, Jianwen Xie, and Nick Barnes. Learning noise-aware encoder-decoder from noisy labels by alternating back-propagation for saliency detection. arXiv preprint arXiv:2007.12211, 2020.
  • Xing etย al. (2018) Xianglei Xing, Ruiqi Gao, Tian Han, Song-Chun Zhu, and Yingย Nian Wu. Deformable generator network: Unsupervised disentanglement of appearance and geometry. arXiv preprint arXiv:1806.06298, 2018.
  • Zhu etย al. (2019) Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9844โ€“9854, 2019.
  • Li etย al. (2017) Yingzhen Li, Richardย E Turner, and Qiang Liu. Approximate inference with amortised mcmc. arXiv preprint arXiv:1702.08343, 2017.
  • Salimans etย al. (2015) Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218โ€“1226. PMLR, 2015.
  • Ng etย al. (2011) Andrew Ng etย al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1โ€“19, 2011.
  • Titsias and Ruiz (2019) Michalisย K Titsias and Francisco Ruiz. Unbiased implicit variational inference. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 167โ€“176. PMLR, 2019.
  • Jacob etย al. (2020) Pierreย E Jacob, John Oโ€™Leary, and Yvesย F Atchadรฉ. Unbiased markov chain monte carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):543โ€“600, 2020.
  • Vahdat and Kautz (2020) Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667โ€“19679, 2020.
  • Child (2020) Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. In International Conference on Learning Representations, 2020.
  • Salimans etย al. (2017) Tim Salimans, Andrej Karpathy, Xiย Chen, and Diederikย P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
  • Ba etย al. (2016) Jimmyย Lei Ba, Jamieย Ryan Kiros, and Geoffreyย E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Rezende and Viola (2018) Daniloย Jimenez Rezende and Fabio Viola. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.
  • LeCun etย al. (1998) Yann LeCun, Lรฉon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278โ€“2324, 1998.
  • Netzer etย al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Boย Wu, and Andrewย Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Krizhevsky etย al. (2009) Alex Krizhevsky, Geoffrey Hinton, etย al. Learning multiple layers of features from tiny images. 2009.
  • Liu etย al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Appendix A Proof of Theorem 1

First, we prepare some lemmas.

Lemma 2.

Let h:โ„d๐ณร—dโ†’โ„d๐ณร—nh:\mathbb{R}^{d_{\mathbf{z}}\times d}\to\mathbb{R}^{d_{\mathbf{z}}\times n} be a linear map defined by hโ€‹(๐šฝ)โ‰”๐šฝโ€‹๐†h\left({\bm{\Phi}}\right)\coloneqq{\bm{\Phi}}{\bm{G}}. When the rank of ๐†{\bm{G}} is nn, there exists an orthogonal linear map ฯ„:โ„d๐ณร—dโ†’โ„d๐ณร—d\tau:\mathbb{R}^{d_{\mathbf{z}}\times d}\to\mathbb{R}^{d_{\mathbf{z}}\times d} such that ฯ„โ€‹(๐šฝ)=[๐šฝ~(1),๐šฝ~(2)]\tau\left({\bm{\Phi}}\right)=[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}] satisfies kerโ€‹h~=spanโ€‹[๐ŸŽd๐ณร—n,๐šฝ~(2)]\mathrm{ker}\ \tilde{h}=\mathrm{span}\left[{\bm{0}}_{d_{\mathbf{z}}\times n},{\tilde{{\bm{\Phi}}}}^{(2)}\right], where h~โ‰”hโˆ˜ฯ„โˆ’1\tilde{h}\coloneqq h\circ\tau^{-1}, ๐šฝ~(1)โ‰”[ฯ•~1,โ€ฆ,ฯ•~n]{\tilde{{\bm{\Phi}}}}^{(1)}\coloneqq\left[\tilde{{\bm{\phi}}}_{1},\ldots,\tilde{{\bm{\phi}}}_{n}\right], ๐šฝ~(2)โ‰”[ฯ•~n+1,โ€ฆโ€‹ฯ•~d]{\tilde{{\bm{\Phi}}}}^{(2)}\coloneqq\left[\tilde{{\bm{\phi}}}_{n+1},\ldots\tilde{{\bm{\phi}}}_{d}\right] and ฯ•~iโˆˆโ„d๐ณ\tilde{{\bm{\phi}}}_{i}\in\mathbb{R}^{d_{\mathbf{z}}} for i=1,โ€ฆ,di=1,\ldots,d.

Proof.

The singular value decomposition of ๐‘ฎ{\bm{G}} is represented by ๐‘ปโ€‹๐‘ฎโ€‹๐‘ป~โŠค=[๐šฒ๐ŸŽ(dโˆ’nร—n)]{\bm{T}}{\bm{G}}\tilde{{\bm{T}}}^{\top}=\left[\begin{array}[]{c}{\bm{\Lambda}}\\ {\bm{0}}_{(d-n\times n)}\end{array}\right], where ๐šฒ{\bm{\Lambda}} is a nร—nn\times n diagonal matrix ๐šฒ=diagโ€‹(ฮป1,โ€ฆ,ฮปn){\bm{\Lambda}}=\mathrm{diag}\left(\lambda_{1},\ldots,\lambda_{n}\right), and ๐‘ป{\bm{T}} and ๐‘ป~\tilde{{\bm{T}}} are orthogonal matrices. Since the rank of ๐‘ฎ{\bm{G}} is nn, ฮป1,โ€ฆ,ฮปn\lambda_{1},\ldots,\lambda_{n} are non-zero. When we set ฯ„โ€‹(๐šฝ)โ‰”๐šฝโ€‹๐‘ปโŠค\tau\left({\bm{\Phi}}\right)\coloneqq{\bm{\Phi}}{\bm{T}}^{\top}, we obtain

h~โ€‹(๐šฝ~)\displaystyle\tilde{h}\left(\tilde{{\bm{\Phi}}}\right) โ‰”hโ€‹(ฯ„โˆ’1โ€‹(๐šฝ~))\displaystyle\coloneqq h\left(\tau^{-1}\left(\tilde{{\bm{\Phi}}}\right)\right)
=๐šฝ~โ€‹๐‘ปโ€‹๐‘ฎ=๐šฝ~โ€‹(๐‘ปโ€‹๐‘ฎโ€‹๐‘ป~โŠค)โ€‹๐‘ป~\displaystyle=\tilde{{\bm{\Phi}}}{\bm{T}}{\bm{G}}=\tilde{{\bm{\Phi}}}\left({\bm{T}}{\bm{G}}\tilde{{\bm{T}}}^{\top}\right)\tilde{{\bm{T}}}
=๐šฝ~โ€‹[๐šฒ๐ŸŽ(dโˆ’nร—n)]โ€‹๐‘ป~.\displaystyle=\tilde{{\bm{\Phi}}}\left[\begin{array}[]{c}{\bm{\Lambda}}\\ {\bm{0}}_{\left(d-n\times n\right)}\end{array}\right]\tilde{{\bm{T}}}. (23)

From the above equation, kerโ€‹h~=spanโ€‹[๐ŸŽdโ€ฒร—n,๐šฝ~(2)]\mathrm{ker}\ \tilde{h}=\mathrm{span}\left[{\bm{0}}_{d^{\prime}\times n},{\tilde{{\bm{\Phi}}}}^{(2)}\right] holds. โˆŽ

Lemma 3.

For ๐šฝ~โˆˆโ„d๐ณร—d\tilde{{\bm{\Phi}}}\in\mathbb{R}^{d_{\mathbf{z}}\times d}, let ๐šฝ{\bm{\Phi}} satisfy:

๐šฝ~=[๐šฝ~(1),๐šฝ~(2)]=ฯ„โ€‹(๐šฝ).\tilde{{\bm{\Phi}}}=\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]=\tau\left({\bm{\Phi}}\right).

A map h~(1):โ„d๐ณร—nโ†’โ„d๐ณร—n{\tilde{h}}^{(1)}:\mathbb{R}^{d_{\mathbf{z}}\times n}\to\mathbb{R}^{d_{\mathbf{z}}\times n} defined by h~(1)โ€‹(๐šฝ~(1))โ‰”h~โ€‹([๐šฝ~(1),๐ŸŽ]){\tilde{h}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right)\coloneqq\tilde{h}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right) satisfies ๐šฝโ€‹๐†=h~(1)โ€‹(๐šฝ~(1)){\bm{\Phi}}{\bm{G}}={\tilde{h}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right) and is linear isomorphic.

Proof.

From the definition, we have

๐šฝโ€‹๐‘ฎ\displaystyle{\bm{\Phi}}{\bm{G}} =ฯ„โˆ’1โ€‹(๐šฝ~)โ€‹๐‘ฎ=๐šฝ~โ€‹๐‘ปโ€‹๐‘ฎ\displaystyle=\tau^{-1}\left(\tilde{{\bm{\Phi}}}\right){\bm{G}}=\tilde{{\bm{\Phi}}}{\bm{T}}{\bm{G}}
=๐šฝ~โ€‹(๐‘ปโ€‹๐‘ฎโ€‹๐‘ป~โŠค)โ€‹๐‘ป~\displaystyle=\tilde{{\bm{\Phi}}}\left({\bm{T}}{\bm{G}}\tilde{{\bm{T}}}^{\top}\right)\tilde{{\bm{T}}}
=[๐šฝ~(1),๐šฝ~(2)]โ€‹[๐šฒ๐ŸŽ(dโˆ’nร—n)]โ€‹๐‘ป~\displaystyle=\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\left[\begin{array}[]{c}{\bm{\Lambda}}\\ {\bm{0}}_{\left(d-n\times n\right)}\end{array}\right]\tilde{{\bm{T}}}
=[๐šฝ~(1),๐ŸŽ]โ€‹[๐šฒ๐ŸŽ(dโˆ’nร—n)]โ€‹๐‘ป~\displaystyle=\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\left[\begin{array}[]{c}{\bm{\Lambda}}\\ {\bm{0}}_{\left(d-n\times n\right)}\end{array}\right]\tilde{{\bm{T}}}
=ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])โ€‹๐‘ฎ\displaystyle=\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right){\bm{G}}
=hโˆ˜ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])=h~โ€‹([๐šฝ~(1),๐ŸŽ])\displaystyle=h\circ\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)=\tilde{h}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)
=h~(1)โ€‹(๐šฝ~(1)).\displaystyle={\tilde{h}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right).

By the definition, h~(1){\tilde{h}}^{(1)} is linear. Here, h~(1){\tilde{h}}^{(1)} is injective, since kerโ€‹h~=spanโ€‹[๐ŸŽdโ€ฒร—n,๐šฝ~(2)]\mathrm{ker}\ \tilde{h}=\mathrm{span}\left[{\bm{0}}_{d^{\prime}\times n},{\tilde{{\bm{\Phi}}}}^{(2)}\right], and hence, dimโ€‹(Imโ€‹h~(1))โ‰ฅd๐ณร—n\mathrm{dim}\left(\mathrm{Im}\ {\tilde{h}}^{(1)}\right)\geq d_{\mathbf{z}}\times n. Since Imโ€‹h~(1)โŠ‚โ„d๐ณร—n\mathrm{Im}\ {\tilde{h}}^{(1)}\subset\mathbb{R}^{d_{\mathbf{z}}\times n}, h~(1){\tilde{h}}^{(1)} is surjective. โˆŽ

Lemma 4.

For V:โ„Dโˆ‹ฯ•โ†ฆVโ€‹(๐šฝ)=Uโ€‹(๐—,f๐ณโˆฃ๐ฑโ€‹(๐—;๐šฝ))โ‰”โˆ‘i=1nUโ€‹(๐ฑ(i),f๐ณโˆฃ๐ฑโ€‹(๐ฑ(i);๐šฝ))โˆˆโ„V:\mathbb{R}^{D}\ni{\bm{\phi}}\mapsto V\left({\bm{\Phi}}\right)=U\left({\bm{X}},f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{X}};{\bm{\Phi}}\right)\right)\coloneqq\sum_{i=1}^{n}U\left({\bm{x}}^{(i)},f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{x}}^{(i)};{\bm{\Phi}}\right)\right)\in\mathbb{R}, ๐šฝ~=[๐šฝ~(1),๐šฝ~(2)]โ‰”ฯ„โ€‹(๐šฝ)\tilde{{\bm{\Phi}}}=\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\coloneqq\tau\left({\bm{\Phi}}\right), V~โ‰”Vโˆ˜ฯ„โˆ’1\tilde{V}\coloneqq V\circ\tau^{-1} and V~(1)โ€‹(๐šฝ~(1))โ‰”V~โ€‹([๐šฝ~(1),๐ŸŽd๐ณร—(dโˆ’n)]){\tilde{V}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right)\coloneqq\tilde{V}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}_{d_{\mathbf{z}}\times(d-n)}\right]\right), Eq. (12) is equivalent to

dโ€‹๐šฝ~(1)\displaystyle d\tilde{{\bm{\Phi}}}^{(1)} =โˆ’โˆ‡๐šฝ~(1)V~(1)โ€‹(๐šฝ~(1))โ€‹dโ€‹t+2โ€‹dโ€‹B,\displaystyle=-\nabla_{\tilde{{\bm{\Phi}}}^{(1)}}{\tilde{V}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right)dt+\sqrt{2}dB, (24)
dโ€‹๐šฝ~(2)\displaystyle d\tilde{{\bm{\Phi}}}^{(2)} =2โ€‹dโ€‹B.\displaystyle=\sqrt{2}dB. (25)
Proof.

By direct calculation, we obtain

V~โ€‹([๐šฝ~(1),๐šฝ~(2)])\displaystyle\tilde{V}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\right) (26)
=Vโˆ˜ฯ„โˆ’1โ€‹([๐šฝ~(1),๐šฝ~(2)])\displaystyle=V\circ\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\right)
=Uโ€‹(๐‘ฟ,f๐ณโˆฃ๐ฑโ€‹(๐‘ฟ;ฯ„โˆ’1โ€‹([๐šฝ~(1),๐šฝ~(2)])))\displaystyle=U\left({\bm{X}},f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{X}};\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\right)\right)\right)
=Uโ€‹(๐‘ฟ,hโ€‹(ฯ„โˆ’1โ€‹([๐šฝ~(1),๐šฝ~(2)])))\displaystyle=U\left({\bm{X}},h\left(\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\right)\right)\right)
=Uโ€‹(๐‘ฟ,hโˆ˜ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])+hโˆ˜ฯ„โˆ’1โ€‹([๐ŸŽ,๐šฝ~(2)]))\displaystyle=U\left({\bm{X}},h\circ\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)+h\circ\tau^{-1}\left(\left[{\bm{0}},{\tilde{{\bm{\Phi}}}}^{(2)}\right]\right)\right)
=Uโ€‹(๐‘ฟ,hโ€‹(ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])))\displaystyle=U\left({\bm{X}},h\left(\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)\right)\right)
=Uโ€‹(๐‘ฟ,f๐ณโˆฃ๐ฑโ€‹(๐‘ฟ;ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])))\displaystyle=U\left({\bm{X}},f_{{\mathbf{z}}\mid{\mathbf{x}}}\left({\bm{X}};\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)\right)\right)
=Vโˆ˜ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])\displaystyle=V\circ\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right)
=V~โ€‹([๐šฝ~(1),๐ŸŽ]).\displaystyle=\tilde{V}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right). (27)

Then, the following equivalence holds:

dโ€‹๐šฝ=โˆ’โˆ‡๐šฝVโ€‹(๐šฝ)โ€‹dโ€‹t+2โ€‹dโ€‹B,\displaystyle d{\bm{\Phi}}=-\nabla_{{\bm{\Phi}}}V\left({\bm{\Phi}}\right)dt+\sqrt{2}dB,
โ‡”dโ€‹ฯ„โˆ’1โ€‹(๐šฝ)=โˆ’ฯ„โŠคโ€‹(โˆ‡๐šฝ~V~โ€‹(๐šฝ~))โ€‹dโ€‹t+2โ€‹dโ€‹B\displaystyle\Leftrightarrow d\tau^{-1}\left({\bm{\Phi}}\right)=-\tau^{\top}\left(\nabla_{\tilde{{\bm{\Phi}}}}\tilde{V}\left(\tilde{{\bm{\Phi}}}\right)\right)dt+\sqrt{2}dB
โ‡”dโ€‹๐šฝ~=โˆ’ฯ„โˆ˜ฯ„โŠคโ€‹(โˆ‡๐šฝ~V~โ€‹(๐šฝ~))โ€‹dโ€‹t+2โ€‹dโ€‹ฯ„โ€‹(B)=โˆ’โˆ‡๐šฝ~V~โ€‹(๐šฝ~)โ€‹dโ€‹t+2โ€‹dโ€‹B,\displaystyle\begin{aligned} \Leftrightarrow d\tilde{{\bm{\Phi}}}&=-\tau\circ\tau^{\top}\left(\nabla_{\tilde{{\bm{\Phi}}}}\tilde{V}\left(\tilde{{\bm{\Phi}}}\right)\right)dt+\sqrt{2}d\tau\left(B\right)\\ &=-\nabla_{\tilde{{\bm{\Phi}}}}\tilde{V}\left(\tilde{{\bm{\Phi}}}\right)dt+\sqrt{2}dB,\end{aligned} (28)

where we used ฯ„โˆ˜ฯ„โŠค=id\tau\circ\tau^{\top}=\mathrm{id} because ฯ„\tau is orthogonal. From Eq. (27), the dynamics in Eq. (28) is equivalent to Eq. (24) and Eq. (25). โˆŽ

In the following, we prove Theorem 1 using the above lemmas.

Because the latent variables ๐’โ‰”h~(1)โ€‹(๐šฝ~(1)){\bm{Z}}\coloneqq\tilde{h}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right) are independent of ๐šฝ~(2){\tilde{{\bm{\Phi}}}}^{(2)}, the stationary distribution qโ€‹(๐’โˆฃ๐‘ฟ)q\left({\bm{Z}}\mid{\bm{X}}\right) is given by (h~(1))#โ€‹(pโˆ—(1))โ€‹(๐’)\left({\tilde{h}}^{(1)}\right)_{\#}\left(p_{\ast}^{(1)}\right)\left({\bm{Z}}\right), which is the pushforward measure of the probability distribution p(1)โ€‹(๐šฝ~)p^{(1)}\left(\tilde{{\bm{\Phi}}}\right) by h~(1){\tilde{h}}^{(1)}. Then, we have

qโ€‹(๐’โˆฃ๐‘ฟ)\displaystyle q\left({\bm{Z}}\mid{\bm{X}}\right)
=(h~(1))#โ€‹(pโˆ—(1))โ€‹(๐’)\displaystyle=\left({\tilde{h}}^{(1)}\right)_{\#}\left(p_{\ast}^{(1)}\right)\left({\bm{Z}}\right)
=p(1)โ€‹((h~(1))โˆ’1โ€‹(๐’))โ€‹|detโ€‹dโ€‹(h~(1))โˆ’1dโ€‹๐’|\displaystyle=p^{(1)}\left(\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right)\right)\left|\mathrm{det}\frac{d({\tilde{h}}^{(1)})^{-1}}{d{\bm{Z}}}\right|
=p(1)โ€‹((h~(1))โˆ’1โ€‹(๐’))โ€‹|detโ€‹dโ€‹h~(1)dโ€‹๐šฝ~(1)|โˆ’1\displaystyle=p^{(1)}\left(\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right)\right)\left|\mathrm{det}\frac{d{\tilde{h}}^{(1)}}{d{\tilde{{\bm{\Phi}}}}^{(1)}}\right|^{-1}
=p(1)โ€‹((h~(1))โˆ’1โ€‹(๐’))ร—|detโ€‹h~(1)|โˆ’1\displaystyle=p^{(1)}\left(\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right)\right)\times\left|\mathrm{det}{\tilde{h}}^{(1)}\right|^{-1}
โˆexpโก(โˆ’V~โ€‹((h~(1))โˆ’1โ€‹(๐’)))\displaystyle\propto\exp\left(-\tilde{V}\left(\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right)\right)\right)
=expโก(โˆ’Vโ€‹(ฯ„โˆ’1โ€‹([(h~(1))โˆ’1โ€‹(๐’),๐ŸŽ])))\displaystyle=\exp\left(-V\left(\tau^{-1}\left(\left[\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right),{\bm{0}}\right]\right)\right)\right)
=expโก(โˆ’Uโ€‹(๐‘ฟ,ฯ„โˆ’1โ€‹([(h~(1))โˆ’1โ€‹(๐’),๐ŸŽ])โ€‹๐‘ฎ))\displaystyle=\exp\left(-U\left({\bm{X}},\tau^{-1}\left(\left[\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right),{\bm{0}}\right]\right){\bm{G}}\right)\right)
=expโก(โˆ’Uโ€‹(๐‘ฟ,๐’)),\displaystyle=\exp\left(-U\left({\bm{X}},{\bm{Z}}\right)\right),

where we used that dโ€‹h~(1)dโ€‹๐šฝ~(1)=h~(1)\frac{d{\tilde{h}}^{(1)}}{d{\tilde{{\bm{\Phi}}}}^{(1)}}={\tilde{h}}^{(1)} because of the linearity of h~(1){\tilde{h}}^{(1)} and is constant with respect to ๐’{\bm{Z}}. The last equation is derived as follows. From Lemma 3, ๐šฝโ€‹๐‘ฎ=h~(1)โ€‹(๐šฝ~(1)){\bm{\Phi}}{\bm{G}}={\tilde{h}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right) holds when ๐šฝ~=[๐šฝ~(1),๐šฝ~(2)]=ฯ„โ€‹(๐šฝ)\tilde{{\bm{\Phi}}}=\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\tilde{{\bm{\Phi}}}}^{(2)}\right]=\tau\left({\bm{\Phi}}\right). Thus, when ๐šฝ=ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ]){\bm{\Phi}}=\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right), we obtain h~(1)โ€‹(๐šฝ~(1))=๐šฝโ€‹๐‘ฎ=ฯ„โˆ’1โ€‹([๐šฝ~(1),๐ŸŽ])โ€‹๐‘ฎ{\tilde{h}}^{(1)}\left({\tilde{{\bm{\Phi}}}}^{(1)}\right)={\bm{\Phi}}{\bm{G}}=\tau^{-1}\left(\left[{\tilde{{\bm{\Phi}}}}^{(1)},{\bm{0}}\right]\right){\bm{G}}. In particular, for ๐šฝ~(1)=(h~(1))โˆ’1โ€‹(๐’){\tilde{{\bm{\Phi}}}}^{(1)}=\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right), we have

๐’\displaystyle{\bm{Z}} =h~(1)โ€‹((h~(1))โˆ’1โ€‹(๐’))\displaystyle={\tilde{h}}^{(1)}\left(\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right)\right)
=ฯ„โˆ’1โ€‹([(h~(1))โˆ’1โ€‹(๐’),๐ŸŽ])โ€‹๐‘ฎ.\displaystyle=\tau^{-1}\left(\left[\left({\tilde{h}}^{(1)}\right)^{-1}\left({\bm{Z}}\right),{\bm{0}}\right]\right){\bm{G}}.

โˆŽ

Appendix B Experimental Settings

B.1 Neural likelihood example

GT

VI
(diag.)

VI
(full)

ALD
(hist.)

ALD
(sample)

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 5: Neural likelihood experiments.

We perform an experiment with a complex posterior, wherein the likelihood is defined with a randomly initialized neural network fฮธf_{\theta}. Particularly, we parameterize fฮธf_{\theta} by four fully-connected layers of 128 units with ReLU activation and two dimensional outputs like pโ€‹(๐ฑโˆฃ๐’›)=๐’ฉโ€‹(fฮธโ€‹(๐’›),ฯƒx2โ€‹I)p\left({\mathbf{x}}\mid{\bm{z}}\right)=\mathcal{N}\left(f_{\theta}\left({\bm{z}}\right),\sigma_{x}^{2}I\right). We initialize the weight and bias parameters with ๐’ฉโ€‹(0,0.2โ€‹I)\mathcal{N}\left(0,0.2I\right) and ๐’ฉโ€‹(0,0.1โ€‹I)\mathcal{N}\left(0,0.1I\right), respectively. In addition, we set the observation variance ฯƒx\sigma_{x} to 0.25. We used the same neural network architecture for the encoder fฯ•f_{\bm{{\bm{\phi}}}}. Other settings are same as the conjugate Gaussian experiment described in Section 5.1.

The results are shown in Figure 5. The left three columns show the density visualizations of the ground truth or approximation posteriors of VI methods; the right two columns show the visualizations of 2D histograms and samples obtained using ALD. For VI method, we use two different models. One uses diagonal Gaussians, i.e., ๐’ฉโ€‹(ฮผโ€‹(๐’™;ฯ•),diagโ€‹(ฯƒ2โ€‹(๐’™;ฯ•)))\mathcal{N}\left(\mu\left({\bm{x}};{\bm{\phi}}\right),\mathrm{diag}\left(\sigma^{2}\left({\bm{x}};{\bm{\phi}}\right)\right)\right), for the variational distribution, and the oher uses Gaussians with full covariance ๐’ฉโ€‹(ฮผโ€‹(๐’™;ฯ•),ฮฃโ€‹(๐’™;ฯ•))\mathcal{N}\left(\mu\left({\bm{x}};{\bm{\phi}}\right),\Sigma\left({\bm{x}};{\bm{\phi}}\right)\right). From the density visualization of GT, the true posterior is multimodal and skewed; this leads to the failure of the Gaussian VI methods notwithstanding considering covariance. In contrast, the samples of ALD accurately capture such a complex distribution, because ALD does not need to assume any tractable distributions for approximating the true posteriors. The samples of ALD capture well the multimodal and skewed posterior, while Gaussian VI methods fail it even when considering covariance.

B.2 Image Generation

For the image generation experiments, we use a standard Gaussian disrtibution ๐’ฉโ€‹(๐’›;๐ŸŽ,๐‘ฐ)\mathcal{N}\left({\bm{z}};{\bm{0}},{\bm{I}}\right) for the latent prior. The latent dimensionality is set to 88 for MNIST, 1616 for SVHN, and 3232 for CIFAR-10 and CelebA. The raw images, which take the values in {0,1,โ€ฆ,255}\left\{0,1,\ldots,255\right\}, are scaled into the range from โˆ’1-1 to 11 via preprocessing. Because the values of the preprocessed images are not continuous in a precise sense due to the quantization, it is not desirable to use continuous distributions, e.g., Gaussians, for the likelihood function pโ€‹(๐’™โˆฃ๐’›;๐œฝ)p\left({\bm{x}}\mid{\bm{z}};{\bm{\theta}}\right). Therefore, we define the likelihood using a discretized logistic distribution [Salimans etย al., 2017] as follows:

pโ€‹(๐’™โˆฃ๐’›;๐œฝ)=โˆid๐ฑโˆซaibiLogisticโ€‹(x~i;ฮผi,s)โ€‹๐‘‘x~i,=โˆid๐ฑ(ฯƒโ€‹(biโˆ’ฮผis)โˆ’ฯƒโ€‹(aiโˆ’ฮผis)),ai={โˆ’โˆžx=โˆ’1xiโˆ’1255otherwise,bi={โˆžx=1xi+1255otherwise,\displaystyle\begin{aligned} p\left({\bm{x}}\mid{\bm{z}};{\bm{\theta}}\right)&=\prod_{i}^{d_{\mathbf{x}}}\int_{a_{i}}^{b_{i}}\mathrm{Logistic}\left(\tilde{{x}}_{i};{\mu}_{i},s\right)d\tilde{{x}}_{i},\\ &=\prod_{i}^{d_{\mathbf{x}}}\left(\sigma\left(\frac{b_{i}-{\mu}_{i}}{s}\right)-\sigma\left(\frac{a_{i}-{\mu}_{i}}{s}\right)\right),\\ a_{i}&=\begin{cases}-\infty&{x}=-1\\ {x}_{i}-\frac{1}{255}&\mathrm{otherwise}\end{cases},\\ b_{i}&=\begin{cases}\infty&{x}=1\\ {x}_{i}+\frac{1}{255}&\mathrm{otherwise}\end{cases},\end{aligned} (29)

where ๐โ‰”f๐ฑโˆฃ๐ณโ€‹(๐’›;๐œฝ){\bm{\mu}}\coloneqq f_{{\mathbf{x}}\mid{\mathbf{z}}}\left({\bm{z}};{\bm{\theta}}\right), f๐ฑโˆฃ๐ณ:โ„d๐ณโ†’โ„d๐ฑf_{{\mathbf{x}}\mid{\mathbf{z}}}:\mathbb{R}^{d_{\mathbf{z}}}\to\mathbb{R}^{d_{\mathbf{x}}}. Logisticโ€‹(โ‹…;ฮผ,s)\mathrm{Logistic}\left(\cdot;\mu,s\right) is the density function of a logistic distribution with the location parameter ฮผ\mu and the scale parameter ss, and ฯƒ\sigma is the logistic sigmoid function. We use a neural network with four fully-connected layers for the decoder function f๐ฑโˆฃ๐ณf_{{\mathbf{x}}\mid{\mathbf{z}}}. The number of hidden units are set to 1,024, and ReLU is used for the activation function. Before each activation, we apply the layer normalization [Ba etย al., 2016] to stabilize training. The scale parameter ss is also optimized in the training. Because it has a constraint of s>0s>0, we parameterize s=ฮถโ€‹(b)โˆ’1/2s=\zeta\left(b\right)^{-1/2}, where ฮถ\zeta is the softplus function, and treat bb as a learnable parameter instead. When the model has sufficiently high expressive power, bb may diverge to infinity [Rezende and Viola, 2018], so we add a regularization term of (b+2โ€‹ฮถโ€‹(โˆ’b))/m\left(b+2\zeta\left(-b\right)\right)/m to the loss function, where mm is the number of training examples. This regularization corresponds to assuming a standard logistic distribution Logisticโ€‹(b;0,1)\mathrm{Logistic}\left(b;0,1\right) for the prior distribution of bb. We optimize the models using stochastic gradient ascent with the learning rate of 1ร—10โˆ’41\times 10^{-4} and the batch size of 100. We run two steps of ALD iterations, i.e., T=2T=2 in Algorithm 2, and the step size ฮท\eta is set to 1ร—10โˆ’41\times 10^{-4}. We use the same experimental settings for the baseline models. We run the training iterations for 50 epochs for MNIST, SVHN, CIFAR-10 and 20 epochs for CelebA. The implementation is available at https://github.com/iShohei220/LAE.

B.3 Datasets

All the dataset we use in the experiment is public for non-commercial research purposes. MNIST [LeCun etย al., 1998], SVHN [Netzer etย al., 2011], CIFAR-10 [Krizhevsky etย al., 2009], CelebA [Liu etย al., 2015] are available at http://yann.lecun.com/exdb/mnist/, http://ufldl.stanford.edu/housenumbers, https://www.cs.toronto.edu/~kriz/cifar.html, http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, and https://github.com/tkarras/progressive_growing_of_gans, respectively. The images of CelebA are resized to 32ร—3232\times 32 in the experiment. We use the default settings of data splits for all datasets.

B.4 Computational Resources

We run all the experiments on AI Bridging Cloud Infrastructure (ABCI), which is a large-scale computing infrastructure provided by National Institute of Advanced Industrial Science and Technology (AIST). The experiments are performed on Computing Nodes (V) of ABCI, each of which has four NVIDIA V100 GPU accelerators, two Intel Xeon Gold 6148, one NVMe SSD, 384GiB memory, two InfiniBand EDR ports (100Gbps each). Please see https://abci.ai for more details.

Appendix C Additional Experiments

In the main result in Section 5, we fix the number of MCMC iterations (i.e., TT) for the model of Hoffman [2017]. In this additional experiment, we further investigate the effect of TT by changing it from 0 to 1010. Note that when T=0T=0, the model is identical to the normal VAE. The result is shown in Table 2. It can be seen that the effect is relatively small, and our LAE (with T=2T=2) shows better performance than all cases.

Table 2: Effects of the number of MCMC iterations of Hoffman [2017]. We report the mean and standard deviation of the negative evidence lower bound per data dimension in three different seeds. Lower is better.
MNIST SVHN CIFAR-10 CelebA
Hoffman [2017] (T=0T=0) 1.189ยฑ0.0021.189\pm 0.002 4.442ยฑ0.0034.442\pm 0.003 4.820ยฑ0.0054.820\pm 0.005 4.671ยฑ0.0014.671\pm 0.001
Hoffman [2017] (T=2T=2) 1.189ยฑ0.0021.189\pm 0.002 4.440ยฑ0.0074.440\pm 0.007 4.831ยฑ0.0054.831\pm 0.005 4.662ยฑ0.0114.662\pm 0.011
Hoffman [2017] (T=10T=10) 1.188ยฑ0.0011.188\pm 0.001 4.437ยฑ0.0094.437\pm 0.009 4.832ยฑ0.0064.832\pm 0.006 4.664ยฑ0.0044.664\pm 0.004
LAE (T=2T=2) 1.177ยฑ0.001\mathbf{1.177}\pm 0.001 4.412ยฑ0.002\mathbf{4.412}\pm 0.002 4.773ยฑ0.003\mathbf{4.773}\pm 0.003 4.636ยฑ0.003\mathbf{4.636}\pm 0.003