This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Latent-Variable Kernel Learning

Haitao Liu, Yew-Soon Ong,  Xiaomo Jiang and Xiaofang Wang Haitao Liu and Xiaofang Wang are with School of Energy and Power Engineering, Dalian University of Technology, China, 116024. E-mail: [email protected], [email protected] Ong is with School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798. E-mail: [email protected] Jiang is with the Digital Twin Laboratory for Industrial Equipment at Dalian University of Technology, China, 116024. E-mail: [email protected].
Abstract

Deep kernel learning (DKL) leverages the connection between Gaussian process (GP) and neural networks (NN) to build an end-to-end, hybrid model. It combines the capability of NN to learn rich representations under massive data and the non-parametric property of GP to achieve automatic regularization that incorporates a trade-off between model fit and model complexity. However, the deterministic encoder may weaken the model regularization of the following GP part, especially on small datasets, due to the free latent representation. We therefore present a complete deep latent-variable kernel learning (DLVKL) model wherein the latent variables perform stochastic encoding for regularized representation. We further enhance the DLVKL from two aspects: (i) the expressive variational posterior through neural stochastic differential equation (NSDE) to improve the approximation quality, and (ii) the hybrid prior taking knowledge from both the SDE prior and the posterior to arrive at a flexible trade-off. Intensive experiments imply that the DLVKL-NSDE performs similarly to the well calibrated GP on small datasets, and outperforms existing deep GPs on large datasets.

Index Terms:
Gaussian process, Neural network, Latent variable, Regularization, Hybrid prior, Neural stochastic differential equation.

I Introduction

In the machine learning community, Gaussian process (GP) [1] is a well-known Bayesian model to learn the underlying function f:𝐱𝐲f:\mathbf{x}\mapsto\mathbf{y}. In comparison to the deterministic, parametric machine learning models, e.g., neural networks (NN), the non-parametric GP could encode user’s prior knowledge, calibrate the model complexity automatically, and quantify the uncertainty of prediction, thus showing high flexibility and interpretability. Hence, it has been popularized within various scenarios like regression, classification [2], clustering [3], representation learning [4], sequence learning [5], multi-task learning [6], active learning and optimization [7, 8].

However, the two main weaknesses of GP are its poor scalability and the limited model capability on massive data. Firstly, as an representative of kernel method, the GP employs the kernel function k(.,.)k(.,.) to encode the spatial correlations of nn training points into the stochastic process. Consequently, it performs operations on the full-rank kernel matrix 𝐊nnRn×n\mathbf{K}_{nn}\in R^{n\times n}, thus raising a cubic time complexity 𝒪(n3)\mathcal{O}(n^{3}) which prohibits the application in the era of big data. To improve the scalability, various scalable GPs have then been presented and studied. For example, the sparse approximations introduce mm (mnm\ll n) inducing variables 𝐮\mathbf{u} to distillate the latent function values 𝐟\mathbf{f} through prior or posterior approximation [9, 10], thus reducing the time complexity to 𝒪(nm2)\mathcal{O}(nm^{2}). The variational inference with reorganized evidence lower bound (ELBO) could further make the stochastic gradient descent optimizer, e.g., Adam [11], available for training with a greatly reduce complexity of 𝒪(m3)\mathcal{O}(m^{3}) [12]. Moreover, further complexity reduction can be achieved by exploiting the structured inducing points and the iterative methods through matrix-vector multiplies, see for example [13, 14]. In contrast to the global sparse approximation, the complexity of GP can also be reduced through distributed computation [15, 16] and local approximation [17, 18]. The idea of divide-and-conquer splits the data for subspace learning, which alleviates the computational burden and helps capturing local patterns. The readers can refer to a recent review [19] of scalable GPs for further information.

Secondly, the GP usually uses (i) the Gaussian assumption to have closed-form inference, and (ii) the stationary and smoothing kernels to simply quantify how quickly the correlations vary along dimensions, which thus raise urgent demand for developing new GP paradigms to learn rich statistical representations under massive data. Hence, the interpretation of NN from kernel learning [20] inspires the construction of deep kernels for GP to mimic the nonlinearity and recurrency behaviors of NN [21, 22]. But the representation learning of deep kernels in comparison to deep models reviewed below is limited unless they are richly parameterized [23].

Considering the theoretical connection between GP and wide deep neural networks [24, 25], a hybrid, end-to-end model, called deep kernel learning (DKL) [26, 27, 28], has been proposed to combine the non-parametric property of GP and the inductive biases of NN. In this framework, the NN plays as a deterministic encoder for representation learning, and the sparse GP is built on the latent inputs for providing Bayesian estimations. The NN+GP structure thereafter has been extended for handling semi-supervised learning [29] and time-series forecasting [30]. The automatic regularization through the marginal likelihood of the last GP layer is expected to improve the performance of DKL and reduces the requirement of fine-tuning and regularization. But we find that the deterministic encoder may deteriorate the regularization of DKL, especially on small datasets, which will be elaborated in the following sections. Alternatively, we could stack the sparse GPs together to build the deep GP (DGP) [31, 32], which admits the layer-by-layer GP transformation that yields non-Gaussian distributions. Hence, the DGPs usually resort to the variational inference for model training. Different from DKL, the DGPs employ the full GP paradigm to arrive at automatic regularization. But the representation learning through layer-by-layer sparse GPs suffers from (i) high time complexity, (ii) complicated approximate inference, and (iii) limited capability due to the finite global inducing variables in each layer.

From the foregoing review and discussion, it is observed that the simple and scalable DKL enjoys great representation power of NN but suffers from the mismatch between the deterministic representation and the stochastic inference, which may risk over-fitting, especially on small datasets. While the sparse DGP enjoys the well calibrated GP paradigm but suffers from high complexity and limited representation capability.

Therefore, this article presents a complete Bayesian version of DKL which inherits (i) the scalability and representation of DKL and (ii) the regularization of DGP. The main contributions of this article are three-fold:

  • We propose an end-to-end latent-variable framework called deep latent-variable kernel learning (DLVKL). It incorporates a stochastic encoding process for regularized representation learning and a sparse GP part for guarding against over-fitting. The whole stochastic framework ensures that it can fully benefit from the automatic regularization of GP;

  • We further improve the DLVKL by constructing (i) the informative variational posterior rather than the simple Gaussian through neural stochastic differential equations (NSDE) to reduce the gap to exact posterior, and (ii) the flexible prior incorporating the knowledge from both the SDE prior and the variational posterior to arrive at a trade-off. The NSDE transformation improves the representation learning and the trade-off provides an adjustable regularization in various scenarios;

  • We showcase the superiority of DLVKL-NSDE against existing deep GPs through intensive (un)supervised learning tasks. The tensorflow implementations are available at https://github.com/LiuHaiTao01/DLVKL.

The remainder of the article is organized as follows. Section II briefly introduces the sparse GPs. Section III then presents the framework of DLVKL. Thereafter, Section IV proposes an enhanced version, named DLVKL-NSDE, through informative posterior and flexible prior. Then extensive numerical experiments are conducted in Section V to verify the superiority of DLVKL-NSDE on (un)supervised learning tasks. Finally, Section VI offers the concluding remarks.

II Scalable GPs revisited

Let 𝐱={𝐱iRd𝐱}i=1n={𝐱dRn}d=1d𝐱\mathbf{x}=\{\mathbf{x}_{i}\in R^{d_{\mathbf{x}}}\}_{i=1}^{n}=\{\mathbf{x}_{d}\in R^{n}\}_{d=1}^{d_{\mathbf{x}}} be the collection of nn points in the input space 𝒳Rd𝐱\mathcal{X}\in R^{d_{\mathbf{x}}}, and 𝐲={𝐲iRd𝐲}i=1n={𝐲dRn}d=1d𝐲\mathbf{y}=\{\mathbf{y}_{i}\in R^{d_{\mathbf{y}}}\}_{i=1}^{n}=\{\mathbf{y}_{d}\in R^{n}\}_{d=1}^{d_{\mathbf{y}}} the observations in the output space 𝒴Rd𝐲\mathcal{Y}\in R^{d_{\mathbf{y}}}, we seek to infer the latent mappings {fd:Rd𝐱R}d=1d𝐲\{f_{d}:R^{d_{\mathbf{x}}}\mapsto R\}_{d=1}^{d_{\mathbf{y}}} from data 𝒟={𝐱,𝐲}\mathcal{D}=\{\mathbf{x},\mathbf{y}\}. To this end, the GP characterizes the distributions of latent functions by placing independent zero-mean GP priors as fd(𝐱)𝒢𝒫(0,kd(𝐱,𝐱))f_{d}(\mathbf{x})\sim\mathcal{GP}(0,k_{d}(\mathbf{x},\mathbf{x}^{\prime})), 0dd𝐲0\leq d\leq d_{\mathbf{y}}. For regression and binary classification, we usually have d𝐲=1d_{\mathbf{y}}=1; while for multi-class classification and unsupervised learning, we are often facing d𝐲>1d_{\mathbf{y}}>1.

We are interested in two statistics in the GP paradigm. The first is the marginal likelihood p(𝐲d|𝐱)=p(𝐲d|𝐟d)p(𝐟d|𝐱)𝑑𝐟dp(\mathbf{y}_{d}|\mathbf{x})=\int p(\mathbf{y}_{d}|\mathbf{f}_{d})p(\mathbf{f}_{d}|\mathbf{x})d\mathbf{f}_{d}, where p(𝐟d|𝐱)=𝒩(𝐟d|𝟎,𝐊nn)p(\mathbf{f}_{d}|\mathbf{x})=\mathcal{N}(\mathbf{f}_{d}|\mathbf{0},\mathbf{K}_{nn}) with 𝐊nn=k(𝐱,𝐱)Rn×n\mathbf{K}_{nn}=k(\mathbf{x},\mathbf{x})\in R^{n\times n}.111For the sake of simplicity, we here use the same kernel for d𝐲d_{\mathbf{y}} outputs. The marginal likelihood p(𝐲|𝐱)=d=1d𝐲p(𝐲d|𝐱)p(\mathbf{y}|\mathbf{x})=\prod_{d=1}^{d_{\mathbf{y}}}p(\mathbf{y}_{d}|\mathbf{x}), the maximization of which optimizes the hyperparameters, automatically achieves a trade-off between model fit and model complexity [1]. As for p(𝐲d|𝐟d)p(\mathbf{y}_{d}|\mathbf{f}_{d}), we adopt the Gaussian likelihood p(𝐲d|𝐟d)=𝒩(𝐲d|𝐟d,νdϵ𝐈)p(\mathbf{y}_{d}|\mathbf{f}_{d})=\mathcal{N}(\mathbf{y}_{d}|\mathbf{f}_{d},\nu_{d}^{\epsilon}\mathbf{I}) for regression and unsupervised learning with continuous outputs given the i.i.d noise ϵd𝒩(0,νdϵ)\epsilon_{d}\sim\mathcal{N}(0,\nu_{d}^{\epsilon}) [1, 4]. While for binary classification with discrete outputs y{0,1}y\in\{0,1\} and multi-class classification with y{1,,d𝐲}y\in\{1,\cdots,d_{\mathbf{y}}\}, we have p(𝐲d|𝐟d)=Benoulli(π(𝐟d))p(\mathbf{y}_{d}|\mathbf{f}_{d})=\mathrm{Benoulli}(\pi(\mathbf{f}_{d})) and p(𝐲d|𝐟d)=Categorial(π(𝐟d))p(\mathbf{y}_{d}|\mathbf{f}_{d})=\mathrm{Categorial}(\pi(\mathbf{f}_{d})), respectively, wherein π(.)[0,1]\pi(.)\in[0,1] is an inverse link function that squashes ff into the class probability space [33].

The second interested statistic is the posterior p(𝐟d|𝐲d,𝐱)p(𝐲d|𝐟d)p(𝐟d|𝐱)p(\mathbf{f}_{d}|\mathbf{y}_{d},\mathbf{x})\propto p(\mathbf{y}_{d}|\mathbf{f}_{d})p(\mathbf{f}_{d}|\mathbf{x}) used to perform prediction at a test point 𝐱\mathbf{x}_{*} as p(𝐟|𝐲,𝐱,𝐱)=p(𝐟|𝐟,𝐱,𝐱)p(𝐟|𝐲,𝐱)𝑑𝐟p(\mathbf{f}_{*}|\mathbf{y},\mathbf{x},\mathbf{x}_{*})=\int p(\mathbf{f}_{*}|\mathbf{f},\mathbf{x},\mathbf{x}_{*})p(\mathbf{f}|\mathbf{y},\mathbf{x})d\mathbf{f}. Note that for non-Gaussian likelihoods, the posterior is intractable and we resort to approximate inference algorithms [34].

The scalability of GPs however is severely limited on massive data, since the inversion and determinant of 𝐊nn\mathbf{K}_{nn} incur 𝒪(n3)\mathcal{O}(n^{3}) operations. Hence, the sparse approximation [19] employs a set of inducing variables 𝐮dRm\mathbf{u}_{d}\in R^{m} with mnm\ll n at 𝐱~Rm×d𝐱\widetilde{\mathbf{x}}\in R^{m\times d_{\mathbf{x}}} to distillate the latent function values 𝐟d\mathbf{f}_{d}. Then, we have

p(𝐮d)=𝒩(𝐮d|𝟎,𝐊mm),p(𝐟d|𝐮d,𝐱)=𝒩(𝐟d|𝐊nm𝐊mm1𝐮d,𝐊nn𝐊nm𝐊mm1𝐊nm𝖳),\displaystyle\begin{split}p(\mathbf{u}_{d})&=\mathcal{N}(\mathbf{u}_{d}|\mathbf{0},\mathbf{K}_{mm}),\\ p(\mathbf{f}_{d}|\mathbf{u}_{d},\mathbf{x})&=\mathcal{N}(\mathbf{f}_{d}|\mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{u}_{d},\mathbf{K}_{nn}-\mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{K}_{nm}^{\mathsf{T}}),\end{split}

where 𝐊mm=k(𝐱~,𝐱~)\mathbf{K}_{mm}=k(\widetilde{\mathbf{x}},\widetilde{\mathbf{x}}) and 𝐊nm=k(𝐱,𝐱~)\mathbf{K}_{nm}=k(\mathbf{x},\widetilde{\mathbf{x}}). Thereafter, variational inference could help handle the intractable logp(𝐲|𝐱)\log p(\mathbf{y}|\mathbf{x}). This is conducted by using a tractable variational posterior q(𝐮d)=𝒩(𝐮d|𝐦d,𝐒d)q(\mathbf{u}_{d})=\mathcal{N}(\mathbf{u}_{d}|\mathbf{m}_{d},\mathbf{S}_{d}), and maximizing the KL divergence

KL[q(𝐟,𝐮|𝐱)||p(𝐟,𝐮|𝐱,𝐲)]=d=1d𝐲KL[q(𝐟d,𝐮d|𝐱)||p(𝐟d,𝐮d|𝐱,𝐲d)],\displaystyle\mathrm{KL}[q(\mathbf{f},\mathbf{u}|\mathbf{x})||p(\mathbf{f},\mathbf{u}|\mathbf{x},\mathbf{y})]=\sum_{d=1}^{d_{\mathbf{y}}}\mathrm{KL}[q(\mathbf{f}_{d},\mathbf{u}_{d}|\mathbf{x})||p(\mathbf{f}_{d},\mathbf{u}_{d}|\mathbf{x},\mathbf{y}_{d})],

where p(𝐟d,𝐮d|𝐱,𝐲d)=p(𝐟d|𝐮d,𝐱)p(𝐮d|𝐲d)p(\mathbf{f}_{d},\mathbf{u}_{d}|\mathbf{x},\mathbf{y}_{d})=p(\mathbf{f}_{d}|\mathbf{u}_{d},\mathbf{x})p(\mathbf{u}_{d}|\mathbf{y}_{d}) and q(𝐟d,𝐮d|𝐱)=p(𝐟d|𝐮d,𝐱)q(𝐮d)q(\mathbf{f}_{d},\mathbf{u}_{d}|\mathbf{x})=p(\mathbf{f}_{d}|\mathbf{u}_{d},\mathbf{x})q(\mathbf{u}_{d}). It is equivalent to maximizing the following ELBO

gp=𝔼q(𝐟|𝐱)[logp(𝐲|𝐟)]KL[q(𝐮)||p(𝐮)].\displaystyle\mathcal{L}_{\mathrm{gp}}=\mathbb{E}_{q(\mathbf{f}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]. (1)

The likelihood term of gp\mathcal{L}_{\mathrm{gp}} represents the fitting error, and it factorizes over both data points and dimensions

𝔼q(𝐟|𝐱)[logp(𝐲|𝐟)]=d=1d𝐲𝔼q(𝐟d|𝐱)[logp(𝐲d|𝐟d)]=i=1n𝔼q(𝐟i|𝐱i)[logp(𝐲i|𝐟i)],\displaystyle\begin{split}\mathbb{E}_{q(\mathbf{f}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]&=\sum_{d=1}^{d_{\mathbf{y}}}\mathbb{E}_{q(\mathbf{f}_{d}|\mathbf{x})}[\log p(\mathbf{y}_{d}|\mathbf{f}_{d})]\\ &=\sum_{i=1}^{n}\mathbb{E}_{q(\mathbf{f}_{i}|\mathbf{x}_{i})}[\log p(\mathbf{y}_{i}|\mathbf{f}_{i})],\end{split}

thus having a remarkably reduced time complexity of 𝒪(m3)\mathcal{O}(m^{3}) when performing stochastic optimization. Note that q(𝐟d|𝐱)=p(𝐟d|𝐮d,𝐱)q(𝐮d)𝑑𝐮d=𝒩(𝐟d|𝝁df,𝚺df)q(\mathbf{f}_{d}|\mathbf{x})=\int p(\mathbf{f}_{d}|\mathbf{u}_{d},\mathbf{x})q(\mathbf{u}_{d})d\mathbf{u}_{d}=\mathcal{N}(\mathbf{f}_{d}|\bm{\mu}^{f}_{d},\bm{\Sigma}_{d}^{f}) is conditioned on the whole training points, where 𝝁df=𝐊nm𝐊mm1𝐦d\bm{\mu}^{f}_{d}=\mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{m}_{d} and 𝚺df=𝐊nn𝐊nm𝐊mm1[𝐈𝐒d𝐊mm1]𝐊nm𝖳\bm{\Sigma}_{d}^{f}=\mathbf{K}_{nn}-\mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}[\mathbf{I}-\mathbf{S}_{d}\mathbf{K}_{mm}^{-1}]\mathbf{K}_{nm}^{\mathsf{T}}. While the posterior q(𝐟i|𝐱i)=d=1d𝐲p(fid|𝐮d,𝐱i)q(𝐮d)𝑑𝐮d=𝒩(𝐟i|𝝁if,𝝂if)q(\mathbf{f}_{i}|\mathbf{x}_{i})=\prod_{d=1}^{d_{\mathbf{y}}}\int p(f_{id}|\mathbf{u}_{d},\mathbf{x}_{i})q(\mathbf{u}_{d})d\mathbf{u}_{d}=\mathcal{N}(\mathbf{f}_{i}|\bm{\mu}^{f}_{i},\bm{\nu}^{f}_{i}) only depends on the related point 𝐱i\mathbf{x}_{i}, where 𝝁ifRd𝐲\bm{\mu}^{f}_{i}\in R^{d_{\mathbf{y}}} collects the ii-th element from each in the set {𝝁df}d=1d𝐲\{\bm{\mu}^{f}_{d}\}_{d=1}^{d_{\mathbf{y}}}; and 𝝂ifRd𝐲\bm{\nu}^{f}_{i}\in R^{d_{\mathbf{y}}} collects the ii-th diagonal element from each in the set {𝚺df}d=1d𝐲\{\bm{\Sigma}^{f}_{d}\}_{d=1}^{d_{\mathbf{y}}}.222This viewpoint inspires the doubly stochastic variational inference [32]. Besides, the covariance here is summarized by the inducing variables. Besides, the analytical KL term in (1) guards against over-fitting and seeks to deliver a good inducing set. Note that the likelihood p(𝐲|𝐟)p(\mathbf{y}|\mathbf{f}) in (1) is not limited to Gaussian. By carefully reorganizing the formulations, we could derive analytical expressions for gp\mathcal{L}_{\mathrm{gp}} for (un)supervised learning tasks [12, 35, 36].

Finally, when the inputs 𝐱\mathbf{x} are unobservable, we use the unsupervised ELBO for logp(𝐲)\log p(\mathbf{y}) as

gpu=𝔼q(𝐟|𝐱)q(𝐱)[logp(𝐲|𝐟)]KL[q(𝐱)||p(𝐱)]KL[q(𝐮)||p(𝐮)]\displaystyle\begin{split}\mathcal{L}_{\mathrm{gp}}^{\mathrm{u}}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{x})q(\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}[q(\mathbf{x})||p(\mathbf{x})]\\ &-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]\end{split} (2)

to infer the latent variables 𝐱\mathbf{x} under the GP latent variable (GPLVM) framework [37]. This unsupervised model can be used for dimensionality reduction, data imputation and density estimation [35].

III Deep latent-variable kernel learning

We consider a GP with additional d𝐳d_{\mathbf{z}}-dimensional latent variables 𝐳\mathbf{z} as333We could describe most of deep GPs by the model (3), see Appendix A.

p(𝐲|𝐱)=p(𝐲|𝐳)p(𝐳|𝐱)𝑑𝐳,\displaystyle p(\mathbf{y}|\mathbf{x})=\int p(\mathbf{y}|\mathbf{z})p(\mathbf{z}|\mathbf{x})d\mathbf{z}, (3)

where the conditional distribution p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}) indicates a stochastic encoding of the original input 𝐱\mathbf{x}, which could ease the inference of the following generative model p(𝐲|𝐳)p(\mathbf{y}|\mathbf{z}). As a result, the log marginal likelihood satisfies

logp(𝐲|𝐱)𝔼q(𝐳|𝐱)[logp(𝐲|𝐳)]KL[q(𝐳|𝐱)||p(𝐳|𝐱)].\displaystyle\log p(\mathbf{y}|\mathbf{x})\geq\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{z})]-\mathrm{KL}[q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{x})]. (4)

In (4),

  • as for p(𝐲|𝐳)=p(𝐲|𝐟)p(𝐟|𝐳)𝑑𝐟p(\mathbf{y}|\mathbf{z})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{z})d\mathbf{f}, we use independent GPs fd𝒢𝒫(0,kd(.,.))f_{d}\sim\mathcal{GP}(0,k_{d}(.,.)), 0dd𝐲0\leq d\leq d_{\mathbf{y}}, to fit the mappings between 𝐲\mathbf{y} and 𝐳\mathbf{z}. Consequently, we have the following lower bound by resorting to the sparse GP as

    logp(𝐲|𝐳)𝔼q(𝐟|𝐮,𝐳)q(𝐮)[logp(𝐲|𝐟)]KL[q(𝐮)||p(𝐮)].\displaystyle\log p(\mathbf{y}|\mathbf{z})\geq\mathbb{E}_{q(\mathbf{f}|\mathbf{u},\mathbf{z})q(\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]. (5)

    Note that the GP mapping employed here learns a joint distribution p(𝐲d|𝐳)=𝒩(𝐲d|𝟎,𝐊nn+νdϵ𝐈)p(\mathbf{y}_{d}|\mathbf{z})=\mathcal{N}(\mathbf{y}_{d}|\mathbf{0},\mathbf{K}_{nn}+\nu_{d}^{\epsilon}\mathbf{I}) by considering the correlations over the entire space in order to achieve automatic regularization;

  • as for the prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}), we usually employ the fully independent form p(𝐳|𝐱)=i=1n𝒩(𝐳i|𝐱i)=i=1nd=1d𝐳𝒩(zi,d|𝐱i)p(\mathbf{z}|\mathbf{x})=\prod_{i=1}^{n}\mathcal{N}(\mathbf{z}_{i}|\mathbf{x}_{i})=\prod_{i=1}^{n}\prod_{d=1}^{d_{\mathbf{z}}}\mathcal{N}(z_{i,d}|\mathbf{x}_{i}) factorized over both data index ii and dimension index dd;

  • as for the decoder q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}), since we aim to utilize the great representational power of NN, it often takes the independent Gaussians q(𝐳|𝐱)=i=1n𝒩(𝐳i|𝝁i,diag[𝝂i])q(\mathbf{z}|\mathbf{x})=\prod_{i=1}^{n}\mathcal{N}(\mathbf{z}_{i}|\bm{\mu}_{i},\mathrm{diag}[\bm{\nu}_{i}]). Instead of directly treating the means {𝝁i}i=1n\{\bm{\mu}_{i}\}_{i=1}^{n} and variances {𝝂i}i=1n\{\bm{\nu}_{i}\}_{i=1}^{n} as hyperparameters, they are made of parameterized function of the input 𝐱i\mathbf{x}_{i} through multi-layer perception (MLP), for example, as

    𝝁i=𝙻𝚒𝚗𝚎𝚊𝚛(𝙼𝙻𝙿(𝐱i)),𝝂i=𝚂𝚘𝚏𝚝𝚙𝚕𝚞𝚜(𝙼𝙻𝙿(𝐱i)).\displaystyle\bm{\mu}_{i}=\mathtt{Linear}(\mathtt{MLP}(\mathbf{x}_{i})),\quad\bm{\nu}_{i}=\mathtt{Softplus}(\mathtt{MLP}(\mathbf{x}_{i})).

    This strategy is called amortized variational inference that shares the parameters over all training points, thus allowing for efficient training.

Combing (4) and (5), the final ELBO for the proposed deep latent-variable kernel learning (DLVKL) writes as

=𝔼q(𝐟|𝐳)q(𝐳|𝐱)[logp(𝐲|𝐟)]KL[q(𝐳|𝐱)||p(𝐳|𝐱)]KL[q(𝐮)||p(𝐮)],\displaystyle\begin{split}\mathcal{L}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z})q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}[q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{x})]\\ &-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],\end{split} (6)

which can be optimized by the reparameterization trick [38].

It is found that in comparison to the ELBO of DKL in Appendix A-A, the additional KL[q(𝐳|𝐱)||p(𝐳|𝐱)]\mathrm{KL}[q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{x})] in (6) regularizes the representation learning of encoder, which distinguishes our DLVKL from the DKL proposed in [26]. The DKL adopts a deterministic representation learning 𝐳=𝙼𝙻𝙿(𝐱)\mathbf{z}=\mathtt{MLP}(\mathbf{x}), which is equivalent to the variational distribution q(𝐳i|𝐱i)=𝒩(𝐳i|𝝁i,𝝂i𝟎+)=δ(𝐳i𝝁i)q(\mathbf{z}_{i}|\mathbf{x}_{i})=\mathcal{N}(\mathbf{z}_{i}|\bm{\mu}_{i},\bm{\nu}_{i}\rightarrow\mathbf{0}^{+})=\delta(\mathbf{z}_{i}-\bm{\mu}_{i}). Intuitively, in order to maximize \mathcal{L}, it is encouraged to learn a p(𝐲|𝐳)p(\mathbf{y}|\mathbf{z}) which maps 𝐳\mathbf{z} to a distribution concentrated on 𝐲\mathbf{y}. Particularly, pushing all the mass of the distribution p(𝐲|𝐳)p(\mathbf{y}|\mathbf{z}) on 𝐲\mathbf{y} results in \mathcal{L}\rightarrow\infty, which however risks severe over-fitting [39]. The last GP layer employed in DKL is expected to alleviate this issue by considering the joint correlations over the output space. But the deterministic encoder, which is inconsistent to the following stochastic GP part, will incur free latent representation to weaken the model regularization, which results in over-fitting and under-estimated prediction variance in scenarios with finite data points. Contrarily, the proposed DLVKL builds a complete statistical learning framework wherein the additional KL regularization could avoid the issue by pushing q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) to match the prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}).

Note that when 𝐱\mathbf{x} is unobservable, i.e., 𝐱𝐲\mathbf{x}\triangleq\mathbf{y}, the proposed DLVKL recovers the GPLVM using back constraints (recognition model) [40]. Also, the bound (6) becomes the VAE-type ELBO

u=𝔼q(𝐟|𝐳)q(𝐳)[logp(𝐲|𝐟)]KL[q(𝐳)||p(𝐳)]KL[q(𝐮)||p(𝐮)]\displaystyle\begin{split}\mathcal{L}^{\mathrm{u}}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z})q(\mathbf{z})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}[q(\mathbf{z})||p(\mathbf{z})]\\ &-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]\end{split} (7)

for unsupervised learning [38], wherein the main difference is that a GP decoder is employed.

Though the complete stochastic framework makes DLVKL be more attractive than DKL, it has two challenges to be addressed:

  • firstly, the assumed variational Gaussian posterior q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) is often significantly different from the exact posterior p(𝐳|𝐱,𝐲)p(\mathbf{z}|\mathbf{x},\mathbf{y}). This gap may deteriorate the performance and thus raises the demand of expressive q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) for better approximation, which will be addressed in Section IV-A;

  • secondly, the choice of prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}) affects the regularization KL[q(𝐳|𝐱)||p(𝐳|𝐱)]\mathrm{KL}[q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{x})] on latent representation. The flexibility and expressivity of the prior will be improved in Section IV-B.

IV Improvements of DLVKL

The improvements of DLVKL come from two aspects: (i) the more expressive variational posterior transformed through neural stochastic differential equation (NSDE) for better approximating the exact posterior; and (ii) the flexible, hybrid prior to introduce adjustable regularization on latent representation. The two improvements are elaborated respectively in the following subsections.

IV-A Expressive variational posterior q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) via NSDE

In order to generate expressive posterior q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) rather than the simple Gaussian, which is beneficial for minimizing KL[q(𝐳|𝐱)||p(𝐳|𝐱,𝐲)]\mathrm{KL}[q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{x},\mathbf{y})], we interpret the stochastic encoding 𝐱𝐳\mathbf{x}\rightarrow\mathbf{z} as a continuous-time dynamic system governed by SDE [41] over the time period [0,T][0,T] as

d𝐳it=𝝁it+𝐋itd𝐰t,0tT,\displaystyle d\mathbf{z}_{i}^{t}=\bm{\mu}_{i}^{t}+\mathbf{L}_{i}^{t}d\mathbf{w}^{t},\quad 0\leq t\leq T, (8)

where the initial state 𝐳i0𝐱i\mathbf{z}_{i}^{0}\triangleq\mathbf{x}_{i}; 𝝁it=μ(𝐳it)\bm{\mu}_{i}^{t}=\mu(\mathbf{z}_{i}^{t}) is the deterministic drift vector; 𝚺it=𝐋it(𝐋it)𝖳=diag[𝝂it]\bm{\Sigma}_{i}^{t}=\mathbf{L}_{i}^{t}(\mathbf{L}_{i}^{t})^{\mathsf{T}}=\mathrm{diag}[\bm{\nu}_{i}^{t}] with 𝐋it=L(𝐳it)\mathbf{L}_{i}^{t}=L(\mathbf{z}_{i}^{t}) is the positive definite diffusion matrix which indicates the scale of the random Brownian motion 𝐰t\mathbf{w}^{t} that scatters the state with random perturbation; and 𝐰t\mathbf{w}^{t} represents the standard and uncorrelated Brownian process, which starts from a zero initial state 𝐰0=𝟎\mathbf{w}^{0}=\mathbf{0} and has independent Gaussian increment 𝐰t+Δt𝐰t𝒩(𝟎,Δt𝐈)\mathbf{w}^{t+\Delta t}-\mathbf{w}^{t}\sim\mathcal{N}(\mathbf{0},\Delta t\mathbf{I}).

The SDE flow in (8) defines a sequence of transformation indexed on a continuous-time domain, the purpose of which is to evolve the simple initial state to the one with expressive distribution. In comparison to the normalizing flow [42] indexed on a discrete-time domain, the SDE flow is more theoretically grounded since it could approach any distribution asymptotically [43]. Besides, the diffusion term, which distinguishes SDE from the ordinary differential equation (ODE) [44], makes the flow more stable and robust from the view of regularization [45].

The solution of SDE is given by the Itô integral which integrates the state from the initial state to time tt as

𝐳it=𝐳i0+0t𝝁iτ𝑑τ+0t𝐋iτ𝑑𝐰τ.\displaystyle\mathbf{z}_{i}^{t}=\mathbf{z}_{i}^{0}+\int_{0}^{t}\bm{\mu}_{i}^{\tau}d\tau+\int_{0}^{t}\mathbf{L}_{i}^{\tau}d\mathbf{w}^{\tau}.

Note that due to the non-differentiable 𝐰τ\mathbf{w}^{\tau}, the SDE yields continuous but non-smooth trajectories 𝐳i0:t\mathbf{z}_{i}^{0:t}. Practically, we usually work with the Euler-Maruyama scheme for time discretization in order to solve the SDE system. Suppose we have L+1L+1 time points t0,t1,,tL=Tt^{0},t^{1},\cdots,t^{L}=T in the period [0,T][0,T], they are equally spaced with time window Δt=T/L\Delta t=T/L. Then we have a generative transition between two conservative states

𝐳il+1=𝐳il+𝝁ilΔt+diag[(𝝂il)1/2]Δt𝒩(𝟎,𝐈), 0l<L1.\displaystyle\mathbf{z}_{i}^{l+1}=\mathbf{z}_{i}^{l}+\bm{\mu}_{i}^{l}\Delta t+\mathrm{diag}[(\bm{\nu}_{i}^{l})^{1/2}]\sqrt{\Delta t}\mathcal{N}(\mathbf{0},\mathbf{I}),\,0\leq l<L-1.

This is equivalent to the following Gaussian transition, given 𝚺il=diag[𝝂il]\bm{\Sigma}_{i}^{l}=\mathrm{diag}[\bm{\nu}_{i}^{l}], as

p(𝐳il+1|𝐳il)=𝒩(𝐳il+1|𝐳il+𝝁ilΔt,𝚺ilΔt).\displaystyle p(\mathbf{z}_{i}^{l+1}|\mathbf{z}_{i}^{l})=\mathcal{N}(\mathbf{z}_{i}^{l+1}|\mathbf{z}_{i}^{l}+\bm{\mu}_{i}^{l}\Delta t,\bm{\Sigma}_{i}^{l}\Delta t). (9)

Note that though the transition is Gaussian, the SDE finally outputs expressive posterior q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}) rather than the simple Gaussian, at the cost of however having no closed-form expression for q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}). But the related samples can be obtained by solving the SDE system via the Euler-Maruyama method.

As for the drift and diffusion, alternatively, they could be represented by the mean and variance of sparse GP to describe the SDE field [46], resulting in analytical KL terms in ELBO, see Appendix A-C. In order to enhance the representation learning, we herein build a more powerful SDE with NN-formed drift and diffusion, called neural SDE (NSDE). This however comes at the cost of intractable KL term. It is found that the SDE transformation gives the following ELBO as

sde=𝔼q(𝐟|𝐳L)q(𝐳L|𝐱)[logp(𝐲|𝐟)]KL[q(𝐳L|𝐱)||p(𝐳L|𝐱)]KL[q(𝐮)||p(𝐮)].\displaystyle\begin{split}\mathcal{L}_{\mathrm{sde}}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z}^{L})q(\mathbf{z}^{L}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]\\ &-\mathrm{KL}[q(\mathbf{z}^{L}|\mathbf{x})||p(\mathbf{z}^{L}|\mathbf{x})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})].\end{split} (10)

Different from the ELBO in (6) which poses a Gaussian assumption for q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}), the second KL term in the right-hand side of sde\mathcal{L}_{\mathrm{sde}} is now intractable due to the implicit density q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}). Alternatively, it can be estimated through the obtained ss SDE trajectory samples as

KL[q(𝐳iL|𝐱i)||p(𝐳iL|𝐱i)]1sj=1slogq(𝐳iL(j)|𝐱i)p(𝐳iL(j)|𝐱i).\displaystyle\mathrm{KL}[q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})||p(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})]\approx\frac{1}{s}\sum_{j=1}^{s}\log\frac{q(\mathbf{z}_{i}^{L(j)}|\mathbf{x}_{i})}{p(\mathbf{z}_{i}^{L(j)}|\mathbf{x}_{i})}. (11)

To estimate the implicit density q(𝐳iL|𝐱i)q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i}), Chen et al. [43] used the simple empirical method according to the SDE samples as q(𝐳iL|𝐱i)=j=1sδ(𝐳iL𝐳iL(j))/sq(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})=\sum_{j=1}^{s}\delta(\mathbf{z}_{i}^{L}-\mathbf{z}_{i}^{L(j)})/s, where δ(𝐳iL𝐳iL(j))\delta(\mathbf{z}_{i}^{L}-\mathbf{z}_{i}^{L(j)}) is a point mass at 𝐳iL(j)\mathbf{z}_{i}^{L(j)}. The estimation quality of this empirical method however is not guaranteed. It is found that since

q(𝐳iL|𝐱i)=𝔼q(𝐳iL1|𝐱i)[q(𝐳iL|𝐳iL1)],\displaystyle q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})=\mathbb{E}_{q(\mathbf{z}_{i}^{L-1}|\mathbf{x}_{i})}[q(\mathbf{z}_{i}^{L}|\mathbf{z}_{i}^{L-1})], (12)

we could evaluate the density through the SDE trajectory samples from the previous time tL1t^{L-1} as

q(𝐳iL|𝐱i)1sj=1s𝒩(𝐳iL|𝐳i(L1)(j)+𝝁i(L1)(j)Δt,𝚺i(L1)(j)Δt).\displaystyle\begin{split}q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})&\approx\frac{1}{s}\sum_{j=1}^{s}\mathcal{N}(\mathbf{z}_{i}^{L}|\mathbf{z}_{i}^{(L-1)(j)}+\bm{\mu}_{i}^{(L-1)(j)}\Delta t,\bm{\Sigma}_{i}^{(L-1)(j)}\Delta t).\end{split} (13)

In practice, we adopt the single-sample approximation together with the reparameterization trick to perform backprogate and have an unbiased estimation of the gradients [38].

IV-B Flexible prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x})

IV-B1 How about an i.i.d prior?

Without apriori knowledge, we could simply choose an i.i.d prior p(𝐳|𝐱)=i=1n𝒩(𝐳i|𝟎,𝐈)p(\mathbf{z}|\mathbf{x})=\prod_{i=1}^{n}\mathcal{N}(\mathbf{z}_{i}|\mathbf{0},\mathbf{I}) with isotropic unit Gaussians. This kind of prior however is found to impede our model.

As stated before, when 𝐱\mathbf{x} is unobservable, the bound (7) becomes the VAE-type ELBO for unsupervised learning. From the view of VAE, it is found that this ELBO may not guide the model to learn a good latent representation. The VAE is known to suffer from the issue of posterior collapse (also called KL vanishing) [47]. That is, the learned posterior is independent of the input 𝐱\mathbf{x}, i.e., q(𝐳|𝐱)p(𝐳)q(\mathbf{z}|\mathbf{x})\approx p(\mathbf{z}). As a result, the latent variable 𝐳\mathbf{z} does not encode any information from 𝐱\mathbf{x}. This issue is mainly attributed to the optimization challenge of VAE [48]. It is observed that maximizing the ELBO (7) requires minimizing KL[q(𝐳)||p(𝐳)]\mathrm{KL}[q(\mathbf{z})||p(\mathbf{z})], which favors the posterior collapse in the initial training stage since it gives a zero KL value. In this case, if we are using a highly expressive decoder which is capable of modeling arbitrarily data distribution, e.g., the PixelCNN [49], then the decoder will not use information from 𝐳\mathbf{z}.

As for our model, though the GP decoder is not so highly expressive, it cannot escape from the posterior collapse due to the property of GP. Furthermore, the posterior collapse of our model has been observed even in supervised learning scenario, see Fig. 2. We will prove below that the DLVKL using the simple i.i.d prior suffers from a non-trivial state when posterior collapse happens.

Before proceeding, we make some required clarifications. First, it is known that the positions 𝐳~\widetilde{\mathbf{z}} of inducing variables 𝐮d\mathbf{u}_{d} fall into the latent space 𝒵Rd𝐳\mathcal{Z}\in R^{d_{\mathbf{z}}}. They are regarded as the variational parameters of q(𝐮d)q(\mathbf{u}_{d}) and need to be optimized. However, since the latent space is not known in advance and it dynamically evolves through training, it is hard to properly initialize 𝐳~\widetilde{\mathbf{z}}, which in turn may deteriorate the training quality. Hence, we employ the encoded inducing strategy indicated as below.

Definition 1.

(Encoded inducing) It puts the positions 𝐱~\widetilde{\mathbf{x}} of inducing variables 𝐮d\mathbf{u}_{d} into the original input space 𝒳\mathcal{X} instead of 𝒵\mathcal{Z}, and then passes them through the encoder to obtain the related inducing positions in the latent space as

vec[𝐳~]=𝒩(vec[𝐳~]|𝙴𝚗𝚌𝚘𝚍𝚎𝚛(𝐱~),diag[𝙴𝚗𝚌𝚘𝚍𝚎𝚛(𝐱~)]).\displaystyle\mathrm{vec}[\widetilde{\mathbf{z}}]=\mathcal{N}(\mathrm{vec}[\widetilde{\mathbf{z}}]|\mathtt{Encoder}(\widetilde{\mathbf{x}}),\mathrm{diag}[\mathtt{Encoder}(\widetilde{\mathbf{x}})]).

Now the inducing positions 𝐳~\widetilde{\mathbf{z}} take into account the characteristics of latent space through encoder and become Gaussians.

Second, the GP part employs the stationary kernel, e.g., the RBF kernel, for learning. The properties of stationary kernel are indicated as below.

Definition 2.

(Stationary kernel) The stationary kernel for GP is a function of the relative distance 𝛕=𝐱𝐱\bm{\tau}=\mathbf{x}-\mathbf{x}^{\prime}. Specifically, it is expressed as k(𝛕)=h2g𝛙(𝛕)k(\bm{\tau})=h^{2}g_{\bm{\psi}}(\bm{\tau}), where 𝛙\bm{\psi} is the kernel parameters which mainly control the smoothness along dimensions, and h2h^{2} is the output-scale amplitude. Particularly, the stationary kernel satisfies k(𝟎)=h2k(\mathbf{0})=h^{2} and k(𝛕)=k(𝛕)k(\bm{\tau})=k(-\bm{\tau}).

Thereafter, the following proposition reveals that the DLVKL using i.i.d prior fails when posterior collapse happens.

Proposition 1.

Given the training data 𝒟={𝐱,𝐲}\mathcal{D}=\{\mathbf{x},\mathbf{y}\}, we build a DLVKL model using the i.i.d prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}), the stationary kernel and the encoded inducing strategy. When the posterior collapse happens in the initial stage of the training process, the DLVKL falls into a non-trivial state: it degenerates to a constant predictor, and the optimizer can only calibrate the prediction variance.

Detailed proof of this proposition is provided in Appendix C. Furthermore, it is found that for any i.i.d. prior p(𝐳i|𝐱i)=𝒩(𝝁0,𝝂0)p(\mathbf{z}_{i}|\mathbf{x}_{i})=\mathcal{N}(\bm{\mu}_{0},\bm{\nu}_{0}), the degeneration in Proposition 1 happens due to the collapsed kernel matrices. The above analysis indicates that the simple i.i.d. prior impedes our model when posterior collapse happens. Though recently more informative priors have been proposed, for example, the mixture of Gaussians prior and the VampPrior [50], they still belong to the i.i.d. prior. This motivates us to come up with flexible and informative prior in order to avoid the posterior collapse.

IV-B2 A hybrid prior brings adjustable regularization

Interestingly, we could set the prior drift 𝝁=𝟎\bm{\mu}=\mathbf{0} and diffusion 𝚺=ν0𝐈\bm{\Sigma}=\nu_{0}\mathbf{I}, and let 𝐳\mathbf{z} pass through the SDE system to have an analytical prior at time TT as

psde(𝐳iL|𝐱i)=𝐱i+0T𝟎𝑑τ+0Tν0𝐈𝑑𝐰τ=𝒩(𝐳iL|𝐱i,ν0T𝐈).\displaystyle\begin{split}p_{\mathrm{sde}}(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})&=\mathbf{x}_{i}+\int_{0}^{T}\mathbf{0}d\tau+\int_{0}^{T}\sqrt{\nu_{0}}\mathbf{I}d\mathbf{w}^{\tau}\\ &=\mathcal{N}(\mathbf{z}_{i}^{L}|\mathbf{x}_{i},\nu_{0}T\mathbf{I}).\end{split} (14)

The independent but not identically distributed SDE prior is more informative than 𝒩(𝟎,𝐈)\mathcal{N}(\mathbf{0},\mathbf{I}). For this SDE prior,

  • it varies over data points, and will not incur the collapsed kernel matrices like the i.i.d prior, thus sidestepping the posterior collapse, see the empirical demonstration in Fig. 2; and

  • the SDE prior performs like the skip connection [51] by connecting the original input 𝐱\mathbf{x} to the latent output 𝐳L\mathbf{z}^{L} in order to avoid rank pathologies in deep models [23].

Besides, the additional variance ν0\nu_{0} in the SDE prior is used to build connection to the posterior. For the posterior q(𝐳iL|𝐱i)q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i}), we employ the following drift and diffusion for transition q(𝐳il+1|𝐳il)q(\mathbf{z}_{i}^{l+1}|\mathbf{z}_{i}^{l}) as

𝝁il=𝙻𝚒𝚗𝚎𝚊𝚛(𝙼𝙻𝙿(𝐳il)),𝚺il=diag[ν0×𝚂𝚒𝚐𝚖𝚘𝚒𝚍(𝙼𝙻𝙿(𝐳il))].\displaystyle\begin{split}\bm{\mu}_{i}^{l}&=\mathtt{Linear}(\mathtt{MLP}(\mathbf{z}_{i}^{l})),\\ \bm{\Sigma}^{l}_{i}&=\mathrm{diag}[\nu_{0}\times\mathtt{Sigmoid}(\mathtt{MLP}(\mathbf{z}_{i}^{l}))].\end{split} (15)

The connection through ν0\nu_{0} allows knowledge transfer between the prior and the posterior, thus further improving the flexibility of the SDE prior.

More generally, it is found that for the independent but not identically distributed prior p(𝐳iL|𝐱i)=𝒩(𝐳iL|𝝁i,𝝂i)p(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})=\mathcal{N}(\mathbf{z}_{i}^{L}|\bm{\mu}_{i},\bm{\nu}_{i}), the optimal choice for maximizing ELBO is p(𝐳iL|𝐱i)q(𝐳iL|𝐱i)p(\mathbf{z}_{i}^{L}|\mathbf{x}_{i})\triangleq q(\mathbf{z}_{i}^{L}|\mathbf{x}_{i}). But this will cancel the KL regularizer and risk severe over-fitting. Alternatively, we could construct a composited prior, like [52], as

pβ(𝐳iL|𝐱i)=1riq1β(𝐳iL|𝐱i)psdeβ(𝐳iL|𝐱i),\displaystyle p_{\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})=\frac{1}{r_{i}}q^{1-\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})p_{\mathrm{sde}}^{\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i}), (16)

where β[0,1]\beta\in[0,1] is a trade-off parameter, and ri=q1β(𝐳iL|𝐱i)psdeβ(𝐳iL|𝐱i)𝑑𝐳iLr_{i}=\int q^{1-\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})p_{\mathrm{sde}}^{\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})d\mathbf{z}_{i}^{L} is a normalizer. When β=1\beta=1, we are using the SDE prior; when β\beta is a mild value, we are using a hybrid prior taking information from both the SDE prior and the variational posterior; when β=0\beta=0, we are using the variational posterior as prior.

The mixed prior gives the KL term regarding 𝐳\mathbf{z} in sde\mathcal{L}_{\mathrm{sde}} as

KL[q(𝐳iL|𝐱i)||pβ(𝐳iL|𝐱i)]=βKL[q(𝐳iL|𝐱i)||psde(𝐳iL|𝐱i)]logri.\displaystyle\begin{split}&\mathrm{KL}[q(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})||p_{\beta}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})]\\ =&\beta\mathrm{KL}[q(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})||p_{\mathrm{sde}}(\mathbf{z}^{L}_{i}|\mathbf{x}_{i})]-\log r_{i}.\end{split} (17)

Note that the term logri\log r_{i} has no trainable parameters. Thereafter, the ELBO rewrites to

sdeβ=𝔼q(𝐟|𝐳L)q(𝐳L|𝐱)[logp(𝐲|𝐟)]βKL[q(𝐳L|𝐱)||psde(𝐳L|𝐱)]KL[q(𝐮)||p(𝐮)].\displaystyle\begin{split}\mathcal{L}_{\mathrm{sde}}^{\beta}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z}^{L})q(\mathbf{z}^{L}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]\\ -&\beta\mathrm{KL}[q(\mathbf{z}^{L}|\mathbf{x})||p_{\mathrm{sde}}(\mathbf{z}^{L}|\mathbf{x})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})].\end{split} (18)

The formulation of β\beta-ELBO in (18) on the other hand indicates that β\beta can be interpreted as a trade-off between the likelihood (data fit) and the KL regularizer w.r.t 𝐳L\mathbf{z}^{L}, which is similar to [53]. This raises an adjustable regularization on the latent representation 𝐳L\mathbf{z}^{L}: when β=1\beta=1, the SDE prior poses the most strict regularization; when β=0\beta=0, the optimal prior ignores the KL regularizer, like DKL, and focuses on fitting the training data. It is recommended to use an intermediate β\beta, e.g., 10210^{-2}, to achieve a good trade-off; or an annealing-β\beta strategy [54, 55]. Note that the β\beta-ELBO can also be derived from the view of variational information bottleneck (VIB) [56], see Appendix B.

IV-C Discussions

Fig. 1 illustrates the structure of the proposed DLVKL and DLVKL-NSDE. It is found that the DLVKL is a special case of DLVKL-NSDE: when T=1.0T=1.0 and L=1L=1, q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}) becomes Gaussian. Compared to the DLVKL, the DLVKL-NSDE generates more expressive variational posterior through NSDE, which is beneficial to further reduce the gap to the exact posterior. However, the NSDE transformation requires that the states {𝐳l}l=1L\{\mathbf{z}^{l}\}_{l=1}^{L} should have the same dimensionality over time. It makes DLVKL-NSDE unsuitable for handling high-dimensional data, e.g., images. This however is not an issue for DLVKL since it has no limits on the dimensionality of 𝐳\mathbf{z}. Besides, it is notable that when the encoder has d𝐳<d𝐱d_{\mathbf{z}}<d_{\mathbf{x}}, we cannot directly use the SDE prior (14). Alternatively, we could apply some simple dimensionality reduction algorithms, e.g., the principle component analysis (PCA), on the inputs to obtain the prior mean. When d𝐳>d𝐱d_{\mathbf{z}}>d_{\mathbf{x}}, we use the zero-padding strategy to obtain the prior mean. In the experiments below, unless otherwise indicated, we use the DLVKL-NSDE with d𝐱=d𝐳d_{\mathbf{x}}=d_{\mathbf{z}}, β=102\beta=10^{-2}, T=1.0T=1.0 and L=10L=10. When predicting, we average over s=10s=10 posterior samples.

Refer to caption
Figure 1: The model structure of the proposed DLVKL and DLVKL-NSDE.

V Numerical experiments

This section first uses two toy cases to investigate the characteristics of the proposed models, followed by intensive evaluation on nine regression and classification datasets. Finally, we also simply showcase the capability of unsupervised DLVKL on the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} dataset. The model configurations in the experimental study are detailed in Appendix D. All the experiments are performed on a windows workstation with twelve 3.50 GHz core and 64 GB memory.

V-A Toy cases

This section seeks to investigate the characteristics of the proposed models on two toy cases. Firstly, we illustrate the benefits brought by the flexible prior pβ(𝐳L|𝐱)p_{\beta}(\mathbf{z}^{L}|\mathbf{x}) and the expressive posterior q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}) on a regression case expressed as

y(x)={cos(5x)×exp(0.5x)+1,x<0,cos(5x)×exp(0.5x)1,x0,\displaystyle y(x)=\left\{\begin{array}[]{rr}\cos(5x)\times\exp(-0.5x)+1,&x<0,\\ \cos(5x)\times\exp(-0.5x)-1,&x\geq 0,\end{array}\right.

which has a step behavior at the origin. The conventional GP using stationary kernels is hard to capture this non-stationary behavior. We draw 50 points from [1.5,1.5][-1.5,1.5] together with their observations as training data.

Fig. 2 illustrates the predictions of DLVKL and DLVKL-NSDE, together with the mean of the learned latent input 𝐳\mathbf{z}. It is found, from left to right, that the i.i.d 𝒩(0,1)\mathcal{N}(0,1) prior leads to collapsed posterior and constant predictor, which agree with the analysis in Proposition 1. Instead, the independent but not identically distributed SDE prior in (14) helps the DLVKL sidestep the posterior collapse. But since this prior takes no knowledge from the posterior q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) under β=1.0\beta=1.0, the DLVKL leaves almost no space for the encoder p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}) to perform deep representation learning, which is crucial for capturing the step feature. As a result, the pure SDE prior makes DLVKL perform like a GP. Hence, to improve the capability of encoder, we employ the β\beta-mixed flexible prior in (16), which takes both the SDE prior and the variational posterior into consideration. Now we can observe that the encoder under β=102\beta=10^{-2} skews the original inputs in the latent space in order to make the GP part of DLVKL describe the step behavior easily. Moreover, the DLVKL-NSDE further employs the NSDE-transformed variational posterior rather than the Gaussian, thus resulting in better latent representation.

Refer to caption
Figure 2: The variants of the proposed model under various priors and variational posteriors on the toy regression case. The blue circles in the top row represent the pairs (𝐱,𝐲)(\mathbf{x},\mathbf{y}) of training data, while the bottom ones are the pairs (𝐳,𝐲)(\mathbf{z},\mathbf{y}). The red curve is the prediction mean and the red shallow region indicates 95% confidence interval of the prediction.

Next, Fig. 3 investigates the impact of β\beta on the behavior of DLVKL-NSDE on a toy binary classification case by changing it from 1.01.0 to 10410^{-4}. We also show the results of stochastic variational GP (SVGP) [12] and DKL for comparison. It is observed that the decreasing β\beta changes the behavior of DLVKL-NSDE from SVGP to DKL. The decreasing β\beta indicates that (a) the prior contains more information from the posterior, and it has p(𝐳|𝐱)=q(𝐳|𝐱)p(\mathbf{z}|\mathbf{x})=q(\mathbf{z}|\mathbf{x}) when β=0\beta=0, like DKL; (b) the KL penalty w.r.t 𝐳L\mathbf{z}^{L} is weakened in (18), thus risking over-fitting; and meanwhile, (c) the encoder becomes more flexible, e.g., the extreme β=0\beta=0 skews the 2D inputs to an almost 1D manifold, raising the issue of rank pathologies in deep models [23]. In practice, we need to trade off the KL regularization for guarding against over-fitting and the representation learning for improving model capability through the β\beta value.

Refer to caption
Figure 3: Varying β\beta changes the behavior of DLVKL-NSDE from SVGP to DKL on the toy classification case. The bottom row shows the data points transformed in the latent space 𝒵\mathcal{Z}.

Finally, Fig. 4 investigates the impact of SDE flow parameters TT and LL on the performance of DLVKL-NSDE. We first fix the flow time as T=1.0T=1.0 and increase the flow step from L=1L=1 to L=15L=15. When we directly use T=1.0T=1.0, L=1L=1 (DLVKL-NSDE herein degenerates to DLVKL), it leads to a single transition density with large variance, thus resulting in high degree of feature skewness. This in turn raises slight over-fitting in Fig. 4 as it identifies several orange points within the blue group. In contrast, the refinement of flow step makes the time discretization close to the underlying process and stabilizes the SDE solver. Secondly, we fix the flow step as L=10L=10 and increase the flow time from T=1.0T=1.0 to T=15.0T=15.0. This increases the time window Δt\Delta t and makes the transition density having larger variance. Hence, the encoder is equipped with higher perturbation. But purely increasing TT will deteriorate the quality of SDE solver, which is indicated by the issue of rank pathologies for DLVKL-NSDE with T=15.0T=15.0, L=10L=10. To summarize, in order to gain benefits from the SDE representation learning, the flow step LL should increase with flow time TT. For example, in comparison to the case of T=15.0T=15.0 and L=10L=10, the DLVKL-NSDE using T=15.0T=15.0 and L=50L=50 in Fig. 4 yields reasonable predictions and latent representation.

Refer to caption
Figure 4: Impact of the flow time TT and flow step LL on the performance of DLVKL-NSDE on the toy classification case. The bottom row shows the data points transformed in the latent space 𝒵\mathcal{Z}.

V-B Regression and classification

We here evaluate the proposed model on six UCI regression datasets, two classification datasets and the 𝚌𝚒𝚏𝚊𝚛\mathtt{cifar}-𝟷𝟶\mathtt{10} image classification dataset summarized in Table I. The data size ranges from 506 to 11M in order to conduct intensive comparison at different levels. It is notable that the first three small regression datasets in Table I could help verify whether the proposed model can achieve reasonable regularization to guard against over-fitting or not.

The competitors include the pure SVGP [12] and NN, the DiffGP [46], a SDE-based deep GP, and finally the DKL [26]. For the regression tasks, we employ the root mean square error (RMSE) and the negative log likelihood (NLL) as performance criteria to quantify the quality of predictive distribution. Similarly, we verify the performance of classification in terms of classification accuracy and NLL. Tables II and III report the comparative results in terms of RMSE (accuracy) and NLL, respectively.444We only provide the RMSE and accuracy results for the deterministic NN. Besides, the SVGP and DiffGP are not applied on the 𝚌𝚒𝚏𝚊𝚛\mathtt{cifar}-𝟷𝟶\mathtt{10} dataset, since the pure GPs cannot effectively handle the high-dimensional image data. The best and second-best results are marked in gray and light gray, respectively. Based on the comparative results, we have the following findings.

TABLE I: The regression and classification datasets
dataset ntrainn_{\mathrm{train}} ntestn_{\mathrm{test}} d𝐱d_{\mathbf{x}} no. of classes
boston 456 50 13 -
concrete 927 103 8 -
wine-red 1440 159 22 -
keggdirected 43945 4882 20 -
kin40k 36000 4000 8 -
protein 41157 4573 9 -
connect-4 60802 6755 43 2
higgs 9900000 1100000 28 2
cifar-10 50000 10000 32×\times32×\times3 10

The DKL risks over-fitting on small datasets. It is not surprising that the powerful NN without regularization is over-fitted on the first three small datasets. But we observe that though equipped with a GP layer, the deterministic representation learning also risks over-fitting for DKL on the small datasets. Fig. 5 depicts the comparative results on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston} dataset. It indicates that the DKL improves the prediction quickly without regularization for the deterministic encoder. But the free latent representation weakens the regularization of the following GP part. Consequently, over-fitting occurs after around 200 iterations and severely underestimated prediction variance happens after 500 iterations.

Besides, the DKL directly optimizes the positions of inducing points in the dynamic, unknown latent space 𝒵\mathcal{Z}. Without prior knowledge about 𝒵\mathcal{Z}, we can only use the inputs 𝐱\mathbf{x} to initialize 𝐳~\widetilde{\mathbf{z}}. The mismatch of data distributions in the two spaces 𝒳\mathcal{X} and 𝒵\mathcal{Z} may deteriorate and even lead to inappropriate termination on the training process of DKL. For instance, the DKL fails in several runs on the 𝚔𝚒𝚗𝟺𝟶𝚔\mathtt{kin40k} dataset, indicated by the high standard deviations in the tables.

The DLVKL-NSDE achieves a good trade-off. Different from the DKL, the proposed DLVKL-NSDE builds the whole model in the Bayesian framework. Due to the scarce training data, we use β=1.0\beta=1.0 for DLVKL-NSDE on the first three datasets in order to completely use the KL regularizer w.r.t 𝐳L\mathbf{z}^{L} in (18). As a result, the DLVKL-NSDE performs similarly to the well-calibrated GP and DiffGP on the first three small datasets, see Tables II and III and Fig. 5. As for the six large datasets, it is not easy to incur over-fitting. Hence, the DLVKL-NSDE employs a small balance parameter β=102\beta=10^{-2} to fully use the power of deep latent representation, and consequently it shows superiority on four datasets.

Deep latent representation improves the prediction on large datasets. It is observed that in comparison to the pure SVGP, the deep representation learning improves the capability of DiffGP, DKL and DLVKL-NSDE, thus resulting in better performance on most cases. Besides, in comparison to the sparse GP assisted representation learning in DiffGP, the more flexible NN based representation learning further enhances the performance of DKL and DLVKL-NSDE, especially on large datasets.

Refer to caption
Figure 5: Comparative results on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston} dataset. The curves represent the average results over ten runs, while the shallow regions are the bounds (minimum and maximum) around the mean.
TABLE II: The RMSE results for regression and the accuracy results for classification. For the RMSE criterion, lower is better; while for the accuracy criterion, higher is better.
dataset NN SVGP DiffGP DKL DLVKL-NSDE
boston 0.4953±0.1350 0.3766±0.0696 0.3629±0.0668 0.4703±0.1748 0.3476±0.0745
concrete 0.3383±0.0314 0.3564±0.0250 0.3232±0.0303 0.3520±0.0471 0.3375±0.0278
wine-red 0.9710±0.0748 0.7779±0.0382 0.7779±0.0381 0.9414±0.0974 0.7779±0.0380
keggdirected 0.1125±0.0820 0.0924±0.0053 0.0900±0.0054 0.0894±0.0052 0.0875±0.0057
kin40k 0.1746±0.0109 0.2772±0.0043 0.2142±0.0044 0.7519±0.3751 0.1054±0.0048
protein 0.6307±0.0069 0.7101±0.0090 0.6763±0.0104 0.6452±0.0084 0.6098±0.0088
connect-4 0.8727±0.0037 0.8327±0.0030 0.8550±0.0045 0.8756±0.0017 0.8826±0.0032
higgs 0.7616±0.0014 0.7280±0.0004 0.7297±0.0008 0.7562±0.0017 0.7529±0.0017
cifar-10 0.9155 NA NA 0.9186 0.9176
TABLE III: The NLL results for regression and classification. For this criterion, lower is better.
dataset SVGP DiffGP DKL DLVKL-NSDE
boston 0.4752±0.2377 0.4081±0.2299 48.6696±57.1183 0.3792±0.2550
concrete 0.3759±0.0657 0.2736±0.0941 1.7932±1.1077 0.3295±0.0934
wine-red 1.1666±0.0475 1.1666±0.0473 4.1887±1.1973 1.1668±0.0471
keggdirected -0.9975±0.0321 -1.0283±0.0357 -1.0245±0.0360 -1.0568±0.0349
kin40k 0.1718±0.0092 -0.0853±0.0125 0.9002±0.7889 -0.8456±0.0483
protein 1.0807±0.0116 1.0298±0.0142 0.9817±0.0132 0.9258±0.0155
connect-4 0.3637±0.0049 0.3207±0.0058 0.3098±0.0101 0.2812±0.0076
higgs 0.5356±0.0004 0.5325±0.0016 0.4907±0.0021 0.4980±0.0025
cifar-10 NA NA 0.3546 0.3710

Impact of β\beta and flow parameters. Finally, we discuss the impact of β\beta and the SDE flow parameters TT and LL on the performance of DLVKL-NSDE. As for the trade-off parameter β\beta, which adjusts the flexibility of prior pβ(𝐳L|𝐱)p_{\beta}(\mathbf{z}^{L}|\mathbf{x}) and the weight of the KL regularizer KL[q(𝐳L|𝐱)||psde(𝐳L|𝐱)]\mathrm{KL}[q(\mathbf{z}^{L}|\mathbf{x})||p_{\mathrm{sde}}(\mathbf{z}^{L}|\mathbf{x})], Fig. 6 performs investigation on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston} and the medium-scale 𝚔𝚎𝚐𝚐𝚍𝚒𝚛𝚎𝚌𝚝𝚎𝚍\mathtt{keggdirected} datasets by varying β\beta from 1.0 to 0.0. The decrease of β\beta improves the flexibility of the hybrid prior since it takes into account more information from the posterior q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}) through (16). Meanwhile, it weakens the role of KL regularizer to improve the freedom of representation learning, which is beneficial to minimize the first likelihood term of (18). Consequently, small β\beta speeds up the training of DLVKL-NSDE. But the deteriorated KL regularizer with decreasing β\beta makes DLVKL-NSDE be approaching the DKL. Hence, we observe over-fitting and underestimated prediction variance for DLVKL-NSDE with small β\beta on the 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston} dataset. As for the medium-scale 𝚔𝚎𝚐𝚐𝚍𝚒𝚛𝚎𝚌𝚝𝚎𝚍\mathtt{keggdirected} dataset with many more data points, the issues have not yet happened. But it is found that (i) β=1.0\beta=1.0 is over-regularized on this dataset; and (ii) the extreme β=0.0\beta=0.0 slights deteriorates the RMSE and NLL results. Hence, we can conclude that (i) a large β\beta is favored to fully regularize the model on small datasets in order to guard against over-fitting; while (ii) a small β\beta is recommended to improve representation learning on complicated large datasets; also it is notable that (iii) when we are using the NN encoder with higher depth and more units, which increase the model complexity, the β\beta should be accordingly increased.

Refer to caption
Figure 6: Impact of the parameter β\beta on the performance of DLVKL-NSDE on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston} dataset and the large 𝚔𝚎𝚐𝚐𝚍𝚒𝚛𝚎𝚌𝚝𝚎𝚍\mathtt{keggdirected} dataset.

Fig. 7 studies the impact of SDE flow parameters on the 𝚔𝚒𝚗𝟺𝟶𝚔\mathtt{kin40k} and 𝚙𝚛𝚘𝚝𝚎𝚒𝚗\mathtt{protein} datasets by varying the flow time TT from 0.5 to 2.0. Note that according to the discussions in Section V-A, the flow step LL is accordingly increased to ensure the quality of SDE solver. The longer SDE flow time transforms the inputs to a more expressive posterior q(𝐳L|𝐱)q(\mathbf{z}^{L}|\mathbf{x}) in order to reduce the gap to the exact posterior. As a result, the DLVKL-NSDE improves the performance with increasing TT on the 𝚔𝚒𝚗𝟺𝟶𝚔\mathtt{kin40k} dataset. As for the 𝚙𝚛𝚘𝚝𝚎𝚒𝚗\mathtt{protein} dataset, it is found that T=1.0T=1.0 is enough since the longer flow time does not further improve the results. Note that the time complexity of DLVKL-NSDE increases with TT and LL. As a trade-off, we usually employ T=1.0T=1.0 and L=10L=10 in our experiments.

Refer to caption
Figure 7: Impact of the SDE flow parameters TT and LL on the performance of DLVKL-NSDE on the 𝚔𝚒𝚗𝟺𝟶𝚔\mathtt{kin40k} and 𝚙𝚛𝚘𝚝𝚎𝚒𝚗\mathtt{protein} datasets.

V-C Unsupervised learning on the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} dataset

It is notable that the proposed DLVKL-NSDE can also be used for unsupervised learning once we replace the input 𝐱\mathbf{x} with the output 𝐲\mathbf{y} in (18). To verify this, we apply the model to the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} handwritten digit dataset, which contains 60000 gray images with size 28×2828\times 28. Since the VAE-type unsupervised learning structure requires feature transformations with varying dimensions, we employ the DLVKL-NSDE using T=1.0T=1.0 and L=1L=1, i.e., the DLVKL. In this case, the DLVKL is similar to the GPLVM using back constraints (recognition model) [40]. The difference is that DLVKL uses β=102\beta=10^{-2} instead of β=1.0\beta=1.0 in GPLVM. Besides, the competitors include VAE [38] and DKL. The details of experimental configurations are provided in Appendix D.

Refer to caption
Figure 8: Unsupervised learning on the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} dataset. The top row represents the learned two-dimensional latent space, while the bottom row illustrates the reconstructed digits in comparison to the ground truth (bottom left).

Fig. 8 illustrates the two-dimensional latent space learned by different models, and several reconstructed digits. It is clearly observed that the models properly cluster the digits in the latent space. As for reconstruction, the three models reconstruct the profile of digits but lost some details due to the limited latent dimensionality. Besides, the reconstructed digits of DKL and DLVKL have slightly noisy background due to the Bayesian averaging and the share of kernel across 784 outputs. Finally, the DLVKL is found to have more reasonable reconstruction than DKL for some digits, e.g., digit “3”.

Finally, note that for the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} dataset together with the 𝚌𝚒𝚏𝚊𝚛\mathtt{cifar}-𝟷𝟶\mathtt{10} dataset in Section V-B, we use the original GP in our models. Some recently developed GP paradigms, for example, the convolutional GP [57] for images, could be considered to further improve the performance.

VI Conclusions

This paper proposes the DLVKL which inherits the advantages of DKL but provides better calibration through regularized latent representation. We further improve the DLVKL through (i) the NSDE transformation to yield expressive variational posterior, and (ii) the hybrid prior to achieve adjustable regularization on latent representation. We investigate the algorithmic characteristics of DLVKL-NSDE and compare it against existing deep GPs. The comparative results imply that the DLVKL-NSDE performs similarly to the well calibrated GP on small datasets, and shows superiority on large datasets.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (DUT19RC(3)070) at Dalian University of Technology, and it was partially supported by the Research and Innovation in Science and Technology Major Project of Liaoning Province (2019JH1-10100024), the National Key Research and Development Project (2016YFB0600104), and the MIIT Marine Welfare Project (Z135060009002).

Appendix A Viewing existing deep GPs in the framework of (3)

A-A Deterministic representation learning of 𝐳\mathbf{z}

The DKL [26] has the transformation from 𝐱i\mathbf{x}_{i} to 𝐳i\mathbf{z}_{i} performed through a deterministic manner 𝐳i=𝙼𝙻𝙿(𝐱i)\mathbf{z}_{i}=\mathtt{MLP}(\mathbf{x}_{i}). As a result, the ELBO is expressed as

dkl=𝔼q(𝐟|𝐳=𝙼𝙻𝙿(𝐱))[logp(𝐲|𝐟)]KL(q(𝐮)||p(𝐮)).\displaystyle\mathcal{L}_{\mathrm{dkl}}=\mathbb{E}_{q(\mathbf{f}|\mathbf{z}=\mathtt{MLP}(\mathbf{x}))}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}(q(\mathbf{u})||p(\mathbf{u})).

Different from (6), the purely deterministic transformation in the above ELBO will risk over-fitting, especially on small datasets, which has been verified in our numerical experiments.

A-B Bayesian representation learning of 𝐳\mathbf{z} via GPs

Inspired by NN, the DGP [31] extends p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}) to a LL-layer hierarchical structure, wherein each layer is a sparse GP, as

p(𝐳1:L|𝐱)=l=1Lp(𝐳l|𝐳l1)=l=1Lp(𝐳l|𝐮l,𝐳l1)p(𝐮l),\displaystyle p(\mathbf{z}^{1:L}|\mathbf{x})=\prod_{l=1}^{L}p(\mathbf{z}^{l}|\mathbf{z}^{l-1})=\prod_{l=1}^{L}p(\mathbf{z}^{l}|\mathbf{u}^{l},\mathbf{z}^{l-1})p(\mathbf{u}^{l}),

where 𝐳0=𝐱\mathbf{z}^{0}=\mathbf{x}. As a result, the ELBO is expressed as

dgp=𝔼q(𝐟|𝐳L)q(𝐳L|𝐱)[logp(𝐲|𝐟)]l=1LKL[q(𝐮l)||p(𝐮l)]KL[q(𝐮)||p(𝐮)],\displaystyle\begin{split}\mathcal{L}_{\mathrm{dgp}}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z}^{L})q(\mathbf{z}^{L}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]\\ &-\sum_{l=1}^{L}\mathrm{KL}[q(\mathbf{u}^{l})||p(\mathbf{u}^{l})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],\end{split}

where q(𝐳L|𝐱)=l=1Lp(𝐳l|𝐮l,𝐳l1)q(𝐮l)d𝐮1:Ld𝐳1:L1=l=1Lq(𝐳l|𝐳l1)d𝐳1:L1q(\mathbf{z}^{L}|\mathbf{x})=\int\prod_{l=1}^{L}p(\mathbf{z}^{l}|\mathbf{u}^{l},\mathbf{z}^{l-1})q(\mathbf{u}^{l})d\mathbf{u}^{1:L}d\mathbf{z}^{1:L-1}=\int\prod_{l=1}^{L}q(\mathbf{z}^{l}|\mathbf{z}^{l-1})d\mathbf{z}^{1:L-1}. Due to the complete GP paradigm, the DGP naturally guards against over-fitting. But the capability of representation learning is limited by the finite inducing variables {𝐮l}l=1L\{\mathbf{u}^{l}\}_{l=1}^{L} for massive data.

A-C Bayesian representation learning of 𝐳\mathbf{z} via SDE

Different from traditional DGP, the sequence of transformation of which is indexed on discrete domain, the differential GP (DiffGP) [46] generalizes the transformation through the SDE indexed on continuous-time domain. Specifically, given the same dimension dl=dl1=d𝐳d^{l}=d^{l-1}=d_{\mathbf{z}}, 1lL1\leq l\leq L, the posterior transformation q(𝐳il|𝐳il1)q(\mathbf{z}_{i}^{l}|\mathbf{z}_{i}^{l-1}) through a sparse GP can be extended and interpreted as a SDE as

d𝐳it=𝝁it+𝑳itd𝐰t,t[0,T],\displaystyle d\mathbf{z}_{i}^{t}=\bm{\mu}^{t}_{i}+\bm{L}^{t}_{i}d\mathbf{w}^{t},\quad t\in[0,T],

where 𝚺it=𝑳it(𝑳it)𝖳\bm{\Sigma}^{t}_{i}=\bm{L}^{t}_{i}(\bm{L}^{t}_{i})^{\mathsf{T}} is a diagonal matrix; and we have

[𝝁it]d\displaystyle[\bm{\mu}^{t}_{i}]_{d} =𝐤im𝐊mm1𝐦dz,1dd𝐳,\displaystyle=\mathbf{k}_{im}\mathbf{K}_{mm}^{-1}\mathbf{m}_{d}^{z},\quad 1\leq d\leq d_{\mathbf{z}},
[𝚺it]dd\displaystyle[\bm{\Sigma}^{t}_{i}]_{dd} =kii𝐤im𝐊mm1[𝐈𝐒dz𝐊mm1]𝐤im𝖳,1dd𝐳.\displaystyle=k_{ii}-\mathbf{k}_{im}\mathbf{K}_{mm}^{-1}[\mathbf{I}-\mathbf{S}^{z}_{d}\mathbf{K}_{mm}^{-1}]\mathbf{k}_{im}^{\mathsf{T}},\quad 1\leq d\leq d_{\mathbf{z}}.

In the above equations, 𝐦dz\mathbf{m}_{d}^{z} and 𝐒dz\mathbf{S}^{z}_{d} are the mean and covariance of the inducing variables 𝐮dz\mathbf{u}^{z}_{d} shared across time. Thereafter, the ELBO over discrete time points {t0,t1,,tL}\{t^{0},t^{1},\cdots,t^{L}\} for supervised learning is derived as

diffgp=𝔼q(𝐟|𝐳L)q(𝐳L|𝐱)[logp(𝐲|𝐟)]KL[q(𝐮z)||p(𝐮z)]KL[q(𝐮)||p(𝐮)].\displaystyle\begin{split}\mathcal{L}_{\mathrm{diffgp}}=&\mathbb{E}_{q(\mathbf{f}|\mathbf{z}^{L})q(\mathbf{z}^{L}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{f})]\\ &-\mathrm{KL}[q(\mathbf{u}^{z})||p(\mathbf{u}^{z})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})].\end{split}

Different from (10), the sparse GP assisted SDE here results in analytical KL terms due to the Gaussian posteriors.

Appendix B β\beta-ELBO interpreted from VIB

We can interpret the DLVKL from the view of variational information bottleneck (VIB) [56]. Suppose that 𝐳\mathbf{z} is a stochastic encoding of the input 𝐱\mathbf{x}, our goal is to learn an encoding that is maximally informative about the output 𝐲\mathbf{y} subject to a constraint on its complexity as

max 𝕀[𝐳,𝐲], s.t. 𝕀q[𝐳,𝐱]Ic,\displaystyle\mbox{max }\mathbb{I}[\mathbf{z},\mathbf{y}],\mbox{ s.t. }\mathbb{I}_{q}[\mathbf{z},\mathbf{x}]\leq I_{c},

where 𝕀[.,.]\mathbb{I}[.,.] represents the mutual information (MI) between two random variables. The above constraint is applied in order to learn a good representation rather than the simple identity 𝐳=𝐱\mathbf{z}=\mathbf{x}. By introducing a Lagrange multiplier β\beta to the above problem, we have

vib=𝕀[𝐳,𝐲]β𝕀q[𝐳,𝐱].\displaystyle\mathcal{L}_{\mathrm{vib}}=\mathbb{I}[\mathbf{z},\mathbf{y}]-\beta\mathbb{I}_{q}[\mathbf{z},\mathbf{x}].

As for the MI terms in vib\mathcal{L}_{\mathrm{vib}}, given the joint distribution p(𝐱,𝐳,𝐲)=q(𝐳|𝐱)p𝒟(𝐱,𝐲)p(\mathbf{x},\mathbf{z},\mathbf{y})=q(\mathbf{z}|\mathbf{x})p_{\mathcal{D}}(\mathbf{x},\mathbf{y}), the MI 𝕀[𝐳,𝐲]\mathbb{I}[\mathbf{z},\mathbf{y}] is expressed as

𝕀[𝐳,𝐲]=p(𝐲,𝐳)logp(𝐲|𝐳)𝑑𝐲𝑑𝐳+[𝐲]=𝔼p𝒟(𝐱,𝐲)[𝔼q(𝐳|𝐱)[logp(𝐲|𝐳)]]+[𝐲],\displaystyle\begin{split}\mathbb{I}[\mathbf{z},\mathbf{y}]&=\int p(\mathbf{y},\mathbf{z})\log p(\mathbf{y}|\mathbf{z})d\mathbf{y}d\mathbf{z}+\mathbb{H}[\mathbf{y}]\\ &=\mathbb{E}_{p_{\mathcal{D}}(\mathbf{x},\mathbf{y})}\left[\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{z})]\right]+\mathbb{H}[\mathbf{y}],\end{split}

where p𝒟(𝐱,𝐲)=1ni=1nδ(𝐱𝐱i)δ(𝐲𝐲i)p_{\mathcal{D}}(\mathbf{x},\mathbf{y})=\frac{1}{n}\sum_{i=1}^{n}\delta(\mathbf{x}-\mathbf{x}_{i})\delta(\mathbf{y}-\mathbf{y}_{i}) is the empirical distribution estimated from training data. Note that [𝐲]\mathbb{H}[\mathbf{y}] has no trainable parameters. Besides, the MI 𝕀q[𝐳,𝐱]\mathbb{I}_{q}[\mathbf{z},\mathbf{x}] is

𝕀q[𝐳,𝐱]=q(𝐳,𝐱)logq(𝐳,𝐱)p(𝐳)p𝒟(𝐱)d𝐳d𝐱=𝔼p𝒟(𝐱)[KL(q(𝐳|𝐱)||p(𝐳))],\displaystyle\begin{split}\mathbb{I}_{q}[\mathbf{z},\mathbf{x}]&=\int q(\mathbf{z},\mathbf{x})\log\frac{q(\mathbf{z},\mathbf{x})}{p(\mathbf{z})p_{\mathcal{D}}(\mathbf{x})}d\mathbf{z}d\mathbf{x}\\ &=\mathbb{E}_{p_{\mathcal{D}}(\mathbf{x})}[\mathrm{KL}(q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}))],\end{split}

where p𝒟(𝐱)=1ni=1nδ(𝐱𝐱i)p_{\mathcal{D}}(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}\delta(\mathbf{x}-\mathbf{x}_{i}).

Finally, we have the ELBO

vib=𝔼p𝒟(𝐱,𝐲)[𝔼q(𝐳|𝐱)[logp(𝐲|𝐳)]]β𝔼p𝒟(𝐱)[KL(q(𝐳|𝐱)||p(𝐳))],\displaystyle\begin{split}\mathcal{L}_{\mathrm{vib}}=&\mathbb{E}_{p_{\mathcal{D}}(\mathbf{x},\mathbf{y})}\left[\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{y}|\mathbf{z})]\right]\\ &-\beta\mathbb{E}_{p_{\mathcal{D}}(\mathbf{x})}[\mathrm{KL}(q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}))],\end{split}

which recovers the bound in (18) when we use sparse GP for p(𝐲|𝐳)p(\mathbf{y}|\mathbf{z}) and the NSDE transformation for q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}).

Appendix C Proof of Proposition 1

Proof.

When the posterior collapse happens in the initial stage, we have (i) the zero KL KL[q(𝐳|𝐱||p(𝐳|𝐱)]\mathrm{KL}[q(\mathbf{z}|\mathbf{x}||p(\mathbf{z}|\mathbf{x})] staying at its local minimum; and (ii) the mapped inputs 𝐳i𝒩(𝟎,𝐈)\mathbf{z}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), 0in0\leq i\leq n and inducing inputs 𝐳~i𝒩(𝟎,𝐈)\widetilde{\mathbf{z}}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), 0im0\leq i\leq m. As a result, the relative distance between any two inputs always follow 𝝉𝒩(𝟎,𝐈)\bm{\tau}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). We therefore have the collapsed kernel value

𝔼[k(𝐳i,𝐳i)]=𝔼[k(𝐳i,𝐳~i)]=h2𝔼[g𝝍(𝝉i)]=c𝜽,\displaystyle\mathbb{E}[k(\mathbf{z}_{i},\mathbf{z}^{\prime}_{i})]=\mathbb{E}[k(\mathbf{z}_{i},\widetilde{\mathbf{z}}^{\prime}_{i})]=h^{2}\mathbb{E}[g_{\bm{\psi}}(\bm{\tau}_{i})]=c_{\bm{\theta}},

where c𝜽c_{\bm{\theta}}, which is composed of the model parameters 𝜽\bm{\theta}, is independent of inputs. This makes 𝐊mm\mathbf{K}_{mm} and 𝐊nn\mathbf{K}_{nn} be the matrices with all the elements being the same, which however are not invertible. In practice, we usually add a positive numeric jitter to the diagonal elements of 𝐊mm\mathbf{K}_{mm} and 𝐊nn\mathbf{K}_{nn} in order to relieve this issue.

When we are attempting to optimize the GP parameters, the collapsed kernel cannot measure the correlations among inputs. In this case, it is observed that the posterior mean for q(𝐟d|𝐱)q(\mathbf{f}_{d}|\mathbf{x}) follows

𝝁fd=𝐊nm𝐊mm1𝐦dm¯d𝟏𝖳,\displaystyle\bm{\mu}^{f}_{d}=\mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{m}_{d}\propto\overline{m}_{d}\mathbf{1}^{\mathsf{T}},

where m¯d\overline{m}_{d} is the average of 𝐦d\mathbf{m}_{d}. It indicates that the GP degenerates to a constant predictor. For example, we know that the optimum of the mean 𝐦d\mathbf{m}_{d} of q(𝐮d)q(\mathbf{u}_{d}) for GP regression satisfies [10]

𝐦d=1νϵd𝐊mm(𝐊mm+1νϵd𝐊mn𝐊nm)1𝐊mn𝐲d𝟏m𝟏n𝖳𝐲d.\displaystyle\mathbf{m}_{d}=\frac{1}{\nu^{\epsilon}_{d}}\mathbf{K}_{mm}(\mathbf{K}_{mm}+\frac{1}{\nu^{\epsilon}_{d}}\mathbf{K}_{mn}\mathbf{K}_{nm})^{-1}\mathbf{K}_{mn}\mathbf{y}_{d}\propto\mathbf{1}_{m}\mathbf{1}_{n}^{\mathsf{T}}\mathbf{y}_{d}.

When 𝐲d\mathbf{y}_{d} is normally normalized, i.e., 𝔼[𝐲d]=0\mathbb{E}[\mathbf{y}_{d}]=0, we have 𝐦d=𝟎\mathbf{m}_{d}=\mathbf{0} and therefore 𝝁fd=𝟎\bm{\mu}^{f}_{d}=\mathbf{0}. As for classification, the degenerated constant predictor will simply use the percentages of training labels as class probabilities.

Hence, due to the constant prediction mean, what the optimizer can do is adjusting all the parameters of the encoder and GP for simply calibrating the prediction variances in order to fit the output variances in training data. ∎

Appendix D Experimental details

Toy cases. For the two toy cases, we adopt the settings for the following regression and classification tasks except that (i) the inducing size is m=20m=20; (ii) the length-scales of the RBF kernel is initialized as 1.0; (iii) the batch size is 𝚖𝚒𝚗{64,n}\mathtt{min}\{64,n\}; and finally (iv) the Adam optimizer is ran over 5000 iterations.

Regression and classification tasks. The experimental configurations for the six regression tasks (𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston}, 𝚌𝚘𝚗𝚌𝚛𝚝𝚎\mathtt{concrte}, 𝚠𝚒𝚗𝚎\mathtt{wine}-𝚛𝚎𝚍\mathtt{red}, 𝚔𝚎𝚐𝚐𝚍𝚒𝚛𝚎𝚌𝚝𝚎𝚍\mathtt{keggdirected}, 𝚔𝚒𝚗𝟺𝟶𝚔\mathtt{kin40k}, 𝚙𝚛𝚘𝚝𝚎𝚒𝚗\mathtt{protein}) and two classification tasks (𝚌𝚘𝚗𝚗𝚎𝚌𝚝\mathtt{connect}-𝟺\mathtt{4}, 𝚑𝚒𝚐𝚐𝚜\mathtt{higgs}) are elaborated as below.

As for data preprocessing, we perform standardization over input dimensions to have zero mean and unit variance. Additionally, the outputs are standardized for regression. We shuffle the data and randomly select 10% for testing. The shuffle and split are repeated to have ten instances for multiple runs.

As for the GP part, we adopt the RBF kernel with the length-scales initialized as 0.1d𝐳0.1\sqrt{d_{\mathbf{z}}} and the signal variance initialized as 1.01.0. The inducing size is m=100m=100. The related positions of inducing points 𝐱~\widetilde{\mathbf{x}} are initialized in the original input space 𝒳\mathcal{X} through clustering techniques and then passed through the SDE transformation as 𝐳~=SDE(𝐱~)\widetilde{\mathbf{z}}=\mbox{SDE}(\widetilde{\mathbf{x}}) for DLVKL-NSDE. The variational parameters for the inducing variables 𝐮d𝒩(𝐮d|𝐦d,𝐒d)\mathbf{u}_{d}\sim\mathcal{N}(\mathbf{u}_{d}|\mathbf{m}_{d},\mathbf{S}_{d}) are initialized as 𝐦d=𝟎\mathbf{m}_{d}=\mathbf{0} and 𝐒d=𝐈\mathbf{S}_{d}=\mathbf{I}. We set the prior parameter as β=1.0\beta=1.0 on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston}, 𝚌𝚘𝚗𝚌𝚛𝚎𝚝𝚎\mathtt{concrete} and 𝚠𝚒𝚗𝚎\mathtt{wine}-𝚛𝚎𝚍\mathtt{red} datasets and β=102\beta=10^{-2} on the remaining datasets, and have the SDE parameters as T=1.0T=1.0 and L=10L=10.

As for the MLP part, we use the fully-connected (FC) NN with three hidden layers and the ReLU activation. The number of units for the hidden layers is max[2d𝐱,10]\mathrm{max}[2d_{\mathbf{x}},10]. Particularly, the MLPs of DLVKL-NSDE in (15) share the hidden layers but have separate output layers for explaining the drift and diffusion, respectively. The diffusion variance ν0\nu_{0} is initialized as 0.01/T0.01/T. Additionally, since the SDE flows over time, we include time tlt^{l} as additional inputs for the MLPs. And all the layers except the output layers share the parameters over time.

As for the optimization, we employ the Adam with the batch size of 256, the learning rate of 5×1035\times 10^{-3},555The training of GPs with Adam often uses the learning rate of 10210^{-2}. While the training of NN often uses the learning rate of 10310^{-3}. Since the DLVKL-NSDE is a hybrid model, we adopt a medium learning rate of 5×1035\times 10^{-3}. and the maximum number of iterations as 3000 on the small 𝚋𝚘𝚜𝚝𝚘𝚗\mathtt{boston}, 𝚌𝚘𝚗𝚌𝚛𝚎𝚝𝚎\mathtt{concrete} and 𝚠𝚒𝚗𝚎\mathtt{wine}-𝚛𝚎𝚍\mathtt{red} datasets, 100000 on the large 𝚑𝚒𝚐𝚐𝚜\mathtt{higgs} dataset, and 20000 on the remaining datasets. In the experiments, we do not adopt additional fine-tune tricks, e.g., the scheduled learning rate, the regularized weights, or the dropout technique, for MLPs.

The 𝚌𝚒𝚏𝚊𝚛\mathtt{cifar}-𝟷𝟶\mathtt{10} image classification task. For this image classification dataset, we build our codes upon the resnet-20 architecture implemented at https://github.com/yxlijun/cifar-tensorflow. We keep the convolution layers and the 64D FC layer, but additionally add the “FC(10+1)-tanh-FC(10+1)-tanh-FC(2×1\times 10)” layers plus the GP part for DLVKL-NSDE. For DKL, we drop the additional time input and use 10 units in the final layer. We use m=500m=500 inducing points and directly initialize them in the latent space, since the encoded inducing strategy in the high-dimensional image space yields too many parameters which may make the model training difficult. We use the default data split, data augmentation and optimization strategies of resent-20 and run over 200 epochs.

The 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} unsupervised learning task. For the 𝚖𝚗𝚒𝚜𝚝\mathtt{mnist} dataset, the intensity of the gray images is normalized to [0,1][0,1]. We build the decoder for the models using FC nets, the architecture of which is “784 Inputs-FC(196)-Relu-FC(49)-Relu-FC(2×\times2)”. Note that the DKL employs a deterministic encoder with the last layer as FC(2). The VAE uses a mirrored NN structure to build the corresponding decoder. Differently, the DKL and DLVKL adopt the sparse GP decoder for mapping the 2D latent inputs to 784 outputs using the shared RBF kernel with the length-scales initialized as 1.01.0 and the signal variance initialized as 1.01.0. The inducing size is m=100m=100 and the related positions are optimized through the encoded inducing strategy. The mean for the prior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}) is obtained through the PCA transformation of 𝐱\mathbf{x}. Finally, we employ the Adam optimizer with the batch size of 256, the learning rate of 5×1035\times 10^{-3}, and run it over 20000 iterations.

References

  • [1] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning.   MIT press Cambridge, MA, 2006.
  • [2] L. Wang and C. Li, “Spectrum-based kernel length estimation for Gaussian process classification,” IEEE Transactions on Cybernetics, vol. 44, no. 6, pp. 805–816, 2014.
  • [3] P. Li and S. Chen, “Shared Gaussian process latent variable model for incomplete multiview clustering,” IEEE Transactions on Cybernetics, vol. 50, no. 1, pp. 61–73, 2020.
  • [4] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” Journal of Machine Learning Research, vol. 6, no. Nov, pp. 1783–1816, 2005.
  • [5] R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational Gaussian process state-space models,” in Advances in Neural Information Processing Systems, 2014, pp. 3680–3688.
  • [6] H. Liu, J. Cai, and Y.-S. Ong, “Remarks on multi-output Gaussian process regression,” Knowledge-Based Systems, vol. 144, no. March, pp. 102–121, 2018.
  • [7] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
  • [8] J. Luo, A. Gupta, Y.-S. Ong, and Z. Wang, “Evolutionary optimization of expensive multiobjective problems with co-sub-pareto front Gaussian process surrogates,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1708–1721, 2019.
  • [9] E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems.   MIT Press, 2006, pp. 1257–1264.
  • [10] M. Titsias, “Variational learning of inducing variables in sparse Gaussian processes,” in Artificial Intelligence and Statistics, 2009, pp. 567–574.
  • [11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [12] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” in Uncertainty in Artificial Intelligence.   Citeseer, 2013, p. 282.
  • [13] A. Wilson and H. Nickisch, “Kernel interpolation for scalable structured gaussian processes (kiss-gp),” in International Conference on Machine Learning.   PMLR, 2015, pp. 1775–1784.
  • [14] J. R. Gardner, G. Pleiss, R. Wu, K. Q. Weinberger, and A. G. Wilson, “Product kernel interpolation for scalable Gaussian processes,” in Artificial Intelligence and Statistics, 2018, pp. 1407–1416.
  • [15] Y. Gal, M. van der Wilk, and C. Rasmussen, “Distributed variational inference in sparse Gaussian process regression and latent variable models,” in Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
  • [16] H. Peng, S. Zhe, X. Zhang, and Y. Qi, “Asynchronous distributed variational gaussian process for regression,” in International Conference on Machine Learning.   PMLR, 2017, pp. 2788–2797.
  • [17] M. P. Deisenroth and J. W. Ng, “Distributed gaussian processes,” in International Conference on Machine Learning.   PMLR, 2015, pp. 1481–1490.
  • [18] H. Liu, J. Cai, Y. Wang, and Y.-S. Ong, “Generalized robust Bayesian committee machine for large-scale Gaussian process regression,” in International Conference on Machine Learning, 2018, pp. 1–10.
  • [19] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, “When Gaussian process meets big data: A review of scalable gps,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19, 2020.
  • [20] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as Gaussian processes,” arXiv preprint arXiv:1711.00165, 2017.
  • [21] Y. Cho and L. K. Saul, “Kernel methods for deep learning,” in Advances in Neural Information Processing Systems, 2009, pp. 342–350.
  • [22] M. Hermans and B. Schrauwen, “Recurrent kernel machines: Computing with infinite echo state networks,” Neural Computation, vol. 24, no. 1, pp. 104–133, 2012.
  • [23] D. K. Duvenaud, O. Rippel, R. P. Adams, and Z. Ghahramani, “Avoiding pathologies in very deep networks,” in Artificial Intelligence and Statistics, 2014, pp. 202–210.
  • [24] R. M. Neal, Bayesian Learning for Neural Networks.   Berlin, Heidelberg: Springer-Verlag, 1996.
  • [25] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani, “Gaussian process behaviour in wide deep neural networks,” arXiv preprint arXiv:1804.11271, 2018.
  • [26] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep kernel learning,” in Artificial Intelligence and Statistics, 2016, pp. 370–378.
  • [27] A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing, “Stochastic variational deep kernel learning,” in Advances in Neural Information Processing Systems, 2016, pp. 2586–2594.
  • [28] G.-L. Tran, E. V. Bonilla, J. P. Cunningham, P. Michiardi, and M. Filippone, “Calibrating deep convolutional Gaussian processes,” in Artificial Intelligence and Statistics, 2019, pp. 1554–1563.
  • [29] N. Jean, S. M. Xie, and S. Ermon, “Semi-supervised deep kernel learning: Regression with unlabeled data by minimizing predictive variance,” in Advances in Neural Information Processing Systems, 2018, pp. 5322–5333.
  • [30] M. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, and E. P. Xing, “Learning scalable deep kernels with recurrent structure,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 2850–2886, 2017.
  • [31] A. Damianou and N. Lawrence, “Deep Gaussian processes,” in Artificial Intelligence and Statistics, 2013, pp. 207–215.
  • [32] H. Salimbeni and M. Deisenroth, “Doubly stochastic variational inference for deep Gaussian processes,” in Advances in Neural Information Processing Systems, 2017, pp. 4588–4599.
  • [33] H.-C. Kim and Z. Ghahramani, “Bayesian gaussian process classification with the EM-EP algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 1948–1959, 2006.
  • [34] H. Nickisch and C. E. Rasmussen, “Approximations for binary Gaussian process classification,” Journal of Machine Learning Research, vol. 9, no. Oct, pp. 2035–2078, 2008.
  • [35] A. C. Damianou, M. K. Titsias, and N. D. Lawrence, “Variational inference for latent variables and uncertain inputs in Gaussian processes,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1425–1486, 2016.
  • [36] H. Liu, Y.-S. Ong, Z. Yu, J. Cai, and X. Shen, “Scalable Gaussian process classification with additive noise for various likelihoods,” arXiv preprint arXiv:1909.06541, 2019.
  • [37] M. K. Titsias and N. D. Lawrence, “Bayesian Gaussian process latent variable model,” in International Conference on Artificial Intelligence and Statistics, vol. 9, 2010, pp. 844–851.
  • [38] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [39] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio, “Generating sentences from a continuous space,” arXiv preprint arXiv:1511.06349, 2015.
  • [40] T. D. Bui and R. E. Turner, “Stochastic variational inference for Gaussian process latent variable models using back constraints,” in Black Box Learning and Inference NIPS workshop, 2015.
  • [41] R. Friedrich, J. Peinke, M. Sahimi, and M. R. R. Tabar, “Approaching complexity by stochastic methods: From biological systems to turbulence,” Physics Reports, vol. 506, no. 5, pp. 87–162, 2011.
  • [42] D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in International Conference on Machine Learning, 2015.
  • [43] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. C. Duke, “Continuous-time flows for efficient inference and density estimation,” in International Conference on Machine Learning, 2018, pp. 823–832.
  • [44] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.   Curran Associates, Inc., 2018, pp. 6571–6583.
  • [45] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C.-J. Hsieh, “Neural SDE: Stabilizing neural ODE networks with stochastic noise,” arXiv preprint arXiv:1906.02355, 2019.
  • [46] P. Hegde, M. Heinonen, H. Lähdesmäki, and S. Kaski, “Deep learning with differential Gaussian process flows,” in Artificial Intelligence and Statistics, 2019, pp. 1812–1821.
  • [47] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel, “Variational lossy autoencoder,” in International Conference on Learning Representations, 2017.
  • [48] J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick, “Lagging inference networks and posterior collapse in variational autoencoders,” in International Conference on Learning Representations, 2019.
  • [49] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with PixelCNN decoders,” in Advances in Neural Information Processing Systems, 2016, pp. 4797–4805.
  • [50] J. Tomczak and M. Welling, “Vae with a vampprior,” in Artificial Intelligence and Statistics, 2018, pp. 1214–1223.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [52] M. D. Hoffman, C. Riquelme, and M. J. Johnson, “The β\beta-VAE’s implicit prior,” in Workshop on Bayesian Deep Learning, NIPS, 2017, pp. 1–5.
  • [53] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.” in International Conference on Learning Representations, 2017, pp. 1–6.
  • [54] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “How to train deep variational autoencoders and probabilistic ladder networks,” in International Conference on Machine Learning, 2016.
  • [55] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” in The Association for Computational Linguistics: Human Language Technologies, 2019, pp. 240–250.
  • [56] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410, 2016.
  • [57] M. van der Wilk, C. E. Rasmussen, and J. Hensman, “Convolutional Gaussian processes,” in Advances in Neural Information Processing Systems, 2017, pp. 2845–2854.