Modulating Scalable Gaussian Processes for Expressive Statistical Learning

Haitao Liu [email protected] Yew-Soon Ong [email protected] Xiaomo Jiang [email protected] Xiaofang Wang [email protected] School of Energy and Power Engineering, Dalian University of Technology, China, 116024 School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798 Digital Twin Laboratory for Industrial Equipment at Dalian University of Technology, China, 116024.

Abstract

For a learning task, Gaussian process (GP) is interested in learning the statistical relationship between inputs and outputs, since it offers not only the prediction mean but also the associated variability. The vanilla GP however struggles to learn complicated distribution with the property of, e.g., heteroscedastic noise, multi-modality and non-stationarity, from massive data due to the Gaussian marginal and the cubic complexity. To this end, this article studies new scalable GP paradigms including the non-stationary heteroscedastic GP, the mixture of GPs and the latent GP, which introduce additional latent variables to modulate the outputs or inputs in order to learn richer, non-Gaussian statistical representation. We further resort to different variational inference strategies to arrive at analytical or tighter evidence lower bounds (ELBOs) of the marginal likelihood for efficient and effective model training. Extensive numerical experiments against state-of-the-art GP and neural network (NN) counterparts on various tasks verify the superiority of these scalable modulated GPs, especially the scalable latent GP, for learning diverse data distributions.

keywords:

Gaussian process , Modulation , Scalability , Heteroscedastic noise , Multi-modality , Non-stationarity

1 Introduction

Given the input $\mathbf{x}\in\mathbb{R}^{d_{\mathbf{x}}}$ , rather than simply predicting the point estimation of the output $y(\mathbf{x})$ , we are more interested in inferring the underlying generative process $f:\mathbb{R}^{d_{\mathbf{x}}}\mapsto\mathbb{R}$ , the distribution of which is most likely to produce the observed data $\mathcal{D}=\{\mathbf{X}\in\mathbb{R}^{N\times d_{\mathbf{x}}},\mathbf{y}\in\mathbb{R}^{N}\}$ , in order to figure out not only the prediction mean but also the associated variability. Along this line, the modeling of the marginal (conditional) distribution $p(\mathbf{y}|\mathbf{X})$ becomes the central task, which raises from many machine learning scenarios, for example, regression, conditional density estimation [1], data association [2], and uncertainty quantification [3].

The well-known Gaussian process (GP) [4] is suitable for building our beliefs upon the statistical relationship between the inputs and outputs due to the Bayesian perspective, thus showcasing widespread application in various scenarios [5, 6, 7, 8, 9]. Usually, the GP adopts the Gaussian assumption and the independent and identically distributed (i.i.d.) Gaussian noise to conduct efficient closed-form inference and prediction, and employs the stationary kernels $k(.,.)$ to simply quantify how quickly the correlations vary along each dimension.¹¹1The stationary kernel depends only on the relative distance $||\mathbf{x}-\mathbf{x}^{\prime}||$ . Consequently, the vanilla GP may not be appropriate for approximating non-Gaussuan, multi-modal or non-stationary probabilistic behaviors in reality, since its marginal is always Gaussian. Besides, another prominent weakness of GP is the poor scalability on massive data due to the operations of the $N\times N$ kernel matrix, which raise the cubic complexity $\mathcal{O}(N^{3})$ . Hence, the above issues raise an urgent demand for having novel GP paradigms which could effectively and efficiently learn rich statistical representations from massive data.

In order to improve the scalability, various scalable GPs have been exploited in recent years from different perspectives [10]. Alternatively, we could directly use advanced linear algebra methods, for example, the hierarchical off-diagonal low-rank and matrix vector multiplication algorithms [11, 12], and train the GP distributedly through multiple CPUs/GPUs. Besides, we could resort to the divide-and-conquer idea by splitting the data into small blocks and aggregating predictions from local GPs [13, 14, 15], which naturally support parallel and distributed learning. A more efficient and principled alternative is using sparse approximation [16, 17, 18], which introduces $M$ ( $M\ll N$ ) global inducing variables $\mathbf{u}\in\mathbb{R}^{M}$ to summarize the latent variables $\mathbf{f}=\{\mathbf{x}_{i}\}_{i=1}^{N}\in\mathbb{R}^{N}$ statistically, thus significantly reducing the complexity to $\mathcal{O}(M^{3})$ through variational inference (VI) [19]. Further reduction of model complexity is possible by exploiting the structured inducing points, see for example [20, 21]. This article builds our scalable GPs upon the widely used sparse approximation, which has been further elaborated in Sec. 2, due to the complete statistical framework and the remarkably low complexity with theoretically guaranteed property [22].

Aiming to develop new scalable GP paradigms for learning rich probabilistic behaviors, we particularly introduce a Bayesian framework wherein a latent modulation variable $\mathbf{w}$ is introduced for encoding the complicated statistical structures from outputs or inputs. Under this modulated GP paradigm, three representative models have been studied, including (i) the scalable heteroscedastic GP (SHGP) which modulates the amplitude of latent output and the noise variance simultaneously, (ii) the scalable mixture of GPs (SMGP) which modulates the assignment of global GP experts at data points for mixing distributions, and finally (iii) the scalable latent GP (SLGP) which augments the input space with latent variables in order to modulate the covariances and moreover the predictive distribution. The key for the above three modulated GPs is to derive scalable and effective evidence lower bounds (ELBOs) for model training.

Along this line, the main contributions of this article are four-fold:

1.

We use the sparse approximation strategy to build a scalable version for SHGP and derive an analytical and scalable ELBO for efficient training through variational inference;
2.

We sidestep the need of tackling the posterior of discrete assignment distribution for SMGP, and marginalize all the latent variables out to directly approximate the marginal likelihood and derive a tighter bound for model training with higher quality;
3.

We enhance the power of latent representation of SLGP by introducing a regularized stochastic encoder. Besides, in order to achieve higher training quality, we derive a hybrid and tighter ELBO that takes the advantage of both the VI and importance-weighted VI (IWVI) based bounds;
4.

We compare the three scalable modulated GPs against state-of-the-art GP and neural network (NN) counterparts to comprehensively investigate their characteristics on various tasks, and release the python implementations at https://github.com/LiuHaiTao01/ModulatedGPs.

The remaining of the article is organized as follows. Sec. 2 first briefly introduces the scalable GP using sparse approximation and variational inference. Thereafter, Sec. 3 develops three scalable GPs that modulate the probabilistic behavior from outputs or inputs, and derives the analytical or tight ELBOs for model training. Sec. 4 then discussed the related works and their differences to our work. Finally, Sec. 5 conducts extensive experiments to showcase the superiority of modulated GPs on various tasks, followed by the overall concluding remarks summarized in Sec. 6. Note that for improving the readability of this article, the employed acronyms and notations are summarized in Appendix A.

2 Scalable GP revisited

The GP learns the underlying, noise-free latent function $f:\mathbb{R}^{d_{\mathbf{x}}}\mapsto\mathbb{R}$ by placing the GP prior over the functional space as $f(\mathbf{x})\sim\mathcal{GP}(0,k(\mathbf{x},\mathbf{x}^{\prime}))$ , where we take the zero mean without loss of generality, and $k(.,.)$ is the kernel function describing the smoothness of $f$ .²²2The squared exponential (SE) kernel with automatic relevance determination $k(\mathbf{x},\mathbf{x}^{\prime})=\nu_{f}\exp(-0.5(\mathbf{x}-\mathbf{x}^{\prime})^{\mathsf{T}}\bm{\Delta}^{-1}(\mathbf{x}-\mathbf{x}^{\prime}))$ is commonly used in practice. The final observation polluted with independent and identically distributed (i.i.d.) Gaussian noise is expressed as

\displaystyle y(\mathbf{x})=f(\mathbf{x})+\epsilon,

(1)

where the observation noise $\epsilon\sim\mathcal{N}(0,\nu^{\epsilon})$ . Given $N$ training data $\mathcal{D}=\{\mathbf{X},\mathbf{y}\}$ , in order to infer the model hyperparameters, we marginalize all the latent variables out and maximize the marginal likelihood

\displaystyle p(\mathbf{y})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f})d\mathbf{f}=\mathcal{N}(\mathbf{y}|\mathbf{0},\mathbf{K}_{NN}+\nu^{\epsilon}\mathbf{I}),

(2)

where the GP prior $p(\mathbf{f})=\mathcal{N}(\mathbf{f}|\mathbf{0},\mathbf{K}_{NN})$ with the covariance matrix $\mathbf{K}_{NN}=k(\mathbf{X},\mathbf{X})\in\mathbb{R}^{N\times N}$ , and the Gaussian likelihood $p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{y}|\mathbf{f},\nu^{\epsilon}\mathbf{I})$ .³³3We omit the dependency of these distributions on the deterministic inputs $\mathbf{X}$ . After model training, we calculate the Bayes’ rule $p(\mathbf{f}|\mathbf{y})\propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f})$ and use it to predict at the test point $\mathbf{x}_{*}$ as $p(f_{*}|\mathbf{y})=\int p(f_{*}|\mathbf{f})p(\mathbf{f}|\mathbf{y})d\mathbf{f}=\mathcal{N}(f_{*}|\mu_{*}^{f},\nu_{*}^{f})$ where the prediction mean $\mu_{*}^{f}=\mathbf{k}_{*N}(\mathbf{K}_{NN}+\nu^{\epsilon}\mathbf{I})^{-1}\mathbf{y}$ and the prediction variance $\nu_{*}^{f}=k_{**}-\mathbf{k}_{*N}(\mathbf{K}_{NN}+\nu^{\epsilon}\mathbf{I})^{-1}\mathbf{k}_{N*}$ with $k_{**}=k(\mathbf{x}_{*},\mathbf{x}_{*})\in\mathbb{R}$ and $\mathbf{k}_{*N}=\mathbf{k}_{N*}^{\mathsf{T}}=k(\mathbf{x}_{*},\mathbf{X})\in\mathbb{R}^{1\times N}$ . Furthermore, we have the final predictive distribution $p(y_{*}|\mathbf{y})=\int p(y_{*}|f_{*})p(f_{*}|\mathbf{y})df_{*}=\mathcal{N}(y_{*}|\mu_{*}^{f},\nu_{*}^{f}+\nu^{\epsilon})$ .

The vanilla GP however suffers from poor scalability due to the cubic complexity $\mathcal{O}(N^{3})$ raised by the operations over the $N\times N$ kernel matrix $\mathbf{K}_{NN}$ . To alleviate this issue in the era of big data, the well-known scalable GP (SGP) using sparse approximation [16, 17] introduces $M$ ( $M\ll N$ ) inducing variables $\mathbf{u}$ at the inducing points $\mathbf{Z}\in\mathbb{R}^{M\times d_{\mathbf{x}}}$ as sufficient statistics of $\mathbf{f}$ . Consequently, we arrive at the Nyström approximation $\mathbf{K}_{NN}\approx\mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}\mathbf{K}_{MN}$ , where $\mathbf{K}_{NM}=k(\mathbf{X},\mathbf{Z})\in\mathbb{R}^{N\times M}$ and $\mathbf{K}_{MM}=k(\mathbf{Z},\mathbf{Z})\in\mathbb{R}^{M\times M}$ , with the time complexity reduced as $\mathcal{O}(NM^{2})$ .

Furthermore, the variational inference employs a tractable variational posterior $q(\mathbf{u})=\mathcal{N}(\mathbf{u}|\mathbf{m},\mathbf{S})$ to minimize the Kullback-Leibler (KL) divergence

\displaystyle\mathrm{KL}[q(\mathbf{f},\mathbf{u})||p(\mathbf{f},\mathbf{u}|\mathbf{y})]=\log p(\mathbf{y})-\mathcal{L}_{\mathrm{sgp}},

which is equivalent to maximizing the analytical ELBO $\mathcal{L}_{\mathrm{sgp}}$ expressed as [19]

\displaystyle\begin{split}\mathcal{L}_{\mathrm{sgp}}=&\mathbb{E}_{q(\mathbf{f})}[\log p(\mathbf{y}|\mathbf{f})]-\mathrm{KL}(q(\mathbf{u})||p(\mathbf{u}))\\ =&\sum_{i=1}^{N}\mathbb{E}_{q(f_{i})}[\log p(y_{i}|f_{i})]-\mathrm{KL}(q(\mathbf{u})||p(\mathbf{u})),\end{split}

(3)

where the GP prior $p(\mathbf{u})=\mathcal{N}(\mathbf{u}|\mathbf{0},\mathbf{K}_{MM})$ , the posterior $q(\mathbf{f})=\int p(\mathbf{f}|\mathbf{u})q(\mathbf{u})d\mathbf{u}=\mathcal{N}(\mathbf{f}|\bm{\mu},\bm{\Sigma})$ with the mean and covariance expressed respectively as

	$\displaystyle\bm{\mu}^{f}=$	$\displaystyle\mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}\mathbf{m},$
	$\displaystyle\bm{\Sigma}^{f}=$	$\displaystyle\mathbf{K}_{NN}-\mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}[\mathbf{I}-\mathbf{S}\mathbf{K}_{MM}^{-1}]\mathbf{K}_{MN},$

and the individual $q(f_{i})=\mathcal{N}(f_{i}|\mu^{f}_{i}\triangleq[\bm{\mu}^{f}]_{i},\nu^{f}_{i}\triangleq[\bm{\Sigma}^{f}]_{ii})$ . Due to the factorization of the expectation term in (3) over data points, the bound has an unbiased estimation

\displaystyle\mathcal{L}_{\mathrm{sgp}}\approx\frac{N}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\mathbb{E}_{q(f_{i})}[\log p(y_{i}|f_{i})]-\mathrm{KL}(q(\mathbf{u})||p(\mathbf{u})),

(4)

where $\mathcal{B}$ is a subset of the training data $\mathcal{D}$ . This bound owns a remarkable complexity of $\mathcal{O}(M^{3})$ through the stochastic optimization with single-sample approximation (i.e., $|\mathcal{B}|=1$ ).

It is observed that the vanilla SGP only (i) provide the Gaussian predictive distribution with homoscedastic noise; and (ii) describe the invariant spatio-temporal behaviors due to the stationary kernels, which therefore limit the application to complicated tasks dominated by rich underlying stochastic processes.

3 Modulated scalable Gaussian processes

It is known that the vanilla GP paradigm struggles to approximate complicated distribution. Hence, in order to improve the capability, we introduce an additional latent variable $\mathbf{w}$ to modulate the behavior of GP as

\displaystyle y(\mathbf{x})=f(\mathbf{x},\mathbf{w}).

(5)

The latent variable $\mathbf{w}$ permits disentangling the statistical structure to enhance the expressivity of modeling. Hence, the challenging details in data, e.g., the heteroscedastic noise, the multi-modality, and the non-stationarity, could be explained and absorbed by $\mathbf{w}$ . We next proceed to present three scalable modulated GPs which perform the modulation on the outputs or inputs.

Refer to caption — Figure 1: The graphical models of three scalable modulated GPs, wherein the (a) SHGP and (b) SMGP perform the modulation on the output (noise), while the (c) SLGP modulates the inputs.

3.1 Scalable heteroscedastic GP

Instead of using the poor i.i.d noise, the heteroscedastic GP (HGP) defines the following additive model

\displaystyle y(\mathbf{x})=e^{w(\mathbf{x})}f(\mathbf{x})+\epsilon(\mathbf{x})

(6)

to capture the variability in both outputs and noise. In (6), the additional latent function $w(.):\mathbb{R}^{d_{\mathbf{x}}}\mapsto\mathbb{R}$ is introduced to modulate the amplitude of output $f(.)$ in order to describe the non-stationarity [23]; besides, it has an input-dependent noise $\epsilon(\mathbf{x})=\mathcal{N}(0,ce^{2w(\mathbf{x})})$ [24], which also uses $w(.)$ to modulated the noise variance; and the positive parameter $c$ leaves flexibility for adjusting the noise variance. Note that the exponential form of $w(.)$ ensures the positivity of noise variance and the modulation for $f$ .

The HGP takes the following GP priors

\displaystyle p(\mathbf{f})=\mathcal{N}(\mathbf{f}|\mathbf{0},\mathbf{K}_{NN}),\quad p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mu_{0}^{w}\mathbf{1},\mathbf{K}_{NN}^{w}),

(7)

and the non-stationary, heteroscedastic likelihood

\displaystyle p(\mathbf{y}|\mathbf{f},\mathbf{w})=\mathcal{N}(\mathbf{y}|e^{\mathbf{w}}\mathbf{f},\mathbf{\Sigma}^{\epsilon}),

(8)

where the covariance $\mathbf{\Sigma}^{\epsilon}=\mathrm{diag}[ce^{2\mathbf{w}}]$ . Note that the prior mean $\mu_{0}^{w}$ is utilized to account for the variability of noise variance. Besides, it has been pointed out that the likelihood of this model is log-concave on both $f$ and $w$ , thus resulting in unimodal posteriors [25, 26].

Analytical ELBO. We next proceed to use variational inference to derive the scalable and analytical objective for model training of scalable HGP (SHGP), the graphical model of which is depicted in Fig. 1(a). We first introduce the inducing variables $\mathbf{u}$ and $\mathbf{u}^{w}\sim\mathcal{N}(\mathbf{m}^{w},\mathbf{S}^{w})$ for $\mathbf{f}$ and $\mathbf{w}$ respectively. Thereafter, by minimizing the KL divergence $\mathrm{KL}[q(\mathbf{f},\mathbf{w},\mathbf{u},\mathbf{u}^{w})||p(\mathbf{f},\mathbf{w},\mathbf{u},\mathbf{u}^{w}|\mathbf{y})]$ where $q(\mathbf{f},\mathbf{w},\mathbf{u},\mathbf{u}^{w})=p(\mathbf{f}|\mathbf{u})p(\mathbf{w}|\mathbf{u}^{w})q(\mathbf{u})q(\mathbf{u}^{w})$ , we arrive at the following ELBO for $\log p(\mathbf{y})$ as

\displaystyle\mathcal{L}_{\mathrm{shgp}}=\mathbb{E}_{q(\mathbf{w})q(\mathbf{f})}[\log p(\mathbf{y}|\mathbf{f},\mathbf{w})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]-\mathrm{KL}[q(\mathbf{u}^{w})||p(\mathbf{u}^{w})],

(9)

where the GP prior $p(\mathbf{u}^{w})=\mathcal{N}(\mathbf{u}^{w}|\mathbf{0},\mathbf{K}^{w}_{MM})$ , and the variational posterior $q(\mathbf{w})=\int p(\mathbf{w}|\mathbf{u}^{w})q(\mathbf{u}^{w})d\mathbf{u}^{w}=\mathcal{N}(\mathbf{w}|\bm{\mu}^{w},\bm{\Sigma}^{w})$ with the mean $\bm{\mu}^{w}=\mathbf{K}_{NM}^{w}[\mathbf{K}^{w}_{MM}]^{-1}(\mathbf{m}^{w}-\mu_{0}^{w}\mathbf{1})+\mu_{0}^{w}\mathbf{1}$ and covariance $\bm{\Sigma}^{w}=\mathbf{K}^{w}_{NN}-\mathbf{K}^{w}_{NM}[\mathbf{K}^{w}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{w}[\mathbf{K}^{w}_{MM}]^{-1}]\mathbf{K}^{w}_{MN}$ . Due to the Gaussian posteriors $q(\mathbf{u})$ and $q(\mathbf{u}^{w})$ , the two KL terms in (9) have closed-form expressions. Furthermore, the expectation of the likelihood in the right-hand side of $\mathcal{L}_{\mathrm{shgp}}$ can be analytically expressed as

\displaystyle\begin{split}&\mathbb{E}_{q(\mathbf{w})q(\mathbf{f})}[\log p(\mathbf{y}|\mathbf{f},\mathbf{w})]\\ =&\mathbb{E}_{q(\mathbf{w})}\left[\log\mathcal{N}(\mathbf{y}|\bm{\mu}^{f},\mathbf{\Sigma}^{\epsilon})-\frac{1}{2}\mathrm{Tr}[(\mathbf{\Sigma}^{\epsilon})^{-1}\bm{\Sigma}^{f}]\right]\\ =&-\frac{n}{2}\log(2c\pi)-\frac{1}{2}\sum_{i=1}^{n}\left(2[\bm{\mu}^{w}]_{i}+\frac{1}{c}([\mathbf{R}_{1}^{w}]_{ii}y_{i}^{2}-2y_{i}[\mathbf{R}_{2}^{w}\bm{\mu}^{f}]_{i}+[\bm{\mu}^{f}]_{i}^{2}+[\mathbf{\Sigma}^{f}]_{ii})\right),\end{split}

(10)

where the diagonal matrices $\mathbf{R}_{1}^{w}=\mathrm{diag}[e^{2\bm{\nu}^{w}-2\bm{\mu}^{w}}]$ and $\mathbf{R}_{2}^{w}=\mathrm{diag}[e^{0.5\bm{\nu}^{w}-\bm{\mu}^{w}}]$ with $\bm{\nu}^{w}$ being the vector comprising the diagonal elements of $\bm{\Sigma}^{w}$ .

Prediction. Finally, when performing prediction at the test point $\mathbf{x}_{*}$ , we have

\displaystyle\begin{split}p(y_{*}|\mathbf{y})=&\int p(y_{*}|f_{*},w_{*})p(f_{*}|\mathbf{y})p(w_{*}|\mathbf{y})df_{*}dw_{*}\\ =&\int\mathcal{N}(y_{*}|\mu^{f}_{*}e^{w_{*}},e^{2w_{*}}\nu^{f}_{*}+ce^{2w_{*}})p(w_{*}|\mathbf{y})dw_{*},\end{split}

(11)

where the prediction $p(f_{*}|\mathbf{y})=\int p(f_{*}|\mathbf{u})q(\mathbf{u})d\mathbf{u}=\mathcal{N}(f_{*}|\mu_{*}^{f},\nu_{*}^{f})$ with the mean $\mu_{*}^{f}=\mathbf{k}_{*M}\mathbf{K}_{MM}^{-1}\mathbf{m}$ and variance $\nu_{*}^{f}=k_{**}-\mathbf{k}_{*M}\mathbf{K}_{MM}^{-1}[\mathbf{I}-\mathbf{S}\mathbf{K}_{MM}^{-1}]\mathbf{k}_{M*}$ ; and similarly, $p(w_{*}|\mathbf{y})=\int p(w_{*}|\mathbf{u}^{w})q(\mathbf{u}^{w})d\mathbf{u}^{w}=\mathcal{N}(w_{*}|\mu_{*}^{w},\nu_{*}^{w})$ with the mean $\mu_{*}^{w}=\mathbf{k}^{w}_{*M}[\mathbf{K}^{w}_{MM}]^{-1}(\mathbf{m}^{w}-\nu_{0}^{w}\mathbf{1})+\nu_{0}^{w}\mathbf{1}$ and variance $\nu_{*}^{w}=k^{w}_{**}-\mathbf{k}^{w}_{*M}[\mathbf{K}^{w}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{w}[\mathbf{K}^{w}_{MM}]^{-1}]\mathbf{k}^{w}_{M*}$ . The prediction $p(y_{*}|\mathbf{y})$ can be estimated through the Markov Chain Monte Carlo (MCMC) sampling, or more efficiently, the Gauss-Hermite quadrature. Note that the predictive distribution of SHGP is non-Gaussian.

3.2 Scalable mixture of GPs

The capability of SHGP for describing complicated distribution is limited due to the Gaussian noise. However, it is known that the mixture of infinite Gaussians could approximate any target distribution. To this end, the mixture of GPs (MGP) aggregates the predictions from GP experts in order to break through the Gaussian assumption. By activating the GP experts in different locations, the MGP is enabled to tackle both the non-stationary and multi-modal data distributions.

Specifically, the MGP employs $T$ independent latent functions $\{f^{t}\}_{t=1}^{T}$ , and the indicator vector $\mathbf{w}_{i}\in\{0,1\}^{T}$ at point $\mathbf{x}_{i}$ to represent its assignment to one of the GP experts. Consequently, we have the following likelihood

\displaystyle p(\mathbf{y}|\mathbf{F},\mathbf{W})=\prod_{i=1}^{N}\prod_{t=1}^{T}p(y_{i}|f_{i}^{t},\nu^{t})^{\mathbb{I}(w_{i}^{t}=1)},

(12)

where the sets $\mathbf{F}=\{\mathbf{f}_{i}\in\mathbb{R}^{T}\}_{i=1}^{N}=\{\mathbf{f}^{t}\in\mathbb{R}^{N}\}_{t=1}^{T}$ , $\mathbf{W}=\{\mathbf{w}_{i}\in\mathbb{R}^{T}\}_{i=1}^{N}=\{\mathbf{w}^{t}\in\mathbb{R}^{N}\}_{t=1}^{T}$ , the variable $\nu^{t}$ is the noise variance for the $t$ -th GP expert, and $\mathbb{I}(.)$ is the indicator function. It is observed that (12) defines a generative process where the observed data is generated by the mixture of several global independent stochastic processes.

We further assume that the indicator $\mathbf{w}_{i}$ are drawn independently from a multinomial distribution with unnormalized logit parameters $\mathbf{a}_{i}=\{a_{i}^{t}=a^{t}(\mathbf{x}_{i})\}_{t=1}^{T}$ as

\displaystyle p(\mathbf{w}_{i}|\mathbf{a}_{i})=\mathcal{M}(\mathbf{w}_{i}|\mathtt{Softmax}(\mathbf{a}_{i})),\quad 1\leq i\leq N,

(13)

where the logit parameters are assumed to follow independent GP priors as

\displaystyle p(\mathbf{a}^{t})=\mathcal{N}(\mathbf{a}^{t}|\mathbf{0},\mathbf{K}^{a,t}_{NN}),\quad 1\leq t\leq T.

(14)

Now the GP priors $p(\mathbf{f}^{t})=\mathcal{N}(\mathbf{f}^{t}|\mathbf{0},\mathbf{K}^{t}_{NN})$ and $p(\mathbf{a}^{t})$ in MGP encode the structures of both functions and associations.

Tighter ELBO. We thereafter decide to build the scalable training framework for the scalable MGP (SMGP) depicted in Fig. 1(b). The joint prior of SMGP is defined as

\displaystyle p(\mathbf{y},\mathbf{F},\mathbf{U},\mathbf{W},\mathbf{A},\mathbf{U}^{a})=p(\mathbf{y}|\mathbf{F},\mathbf{W})p(\mathbf{F}|\mathbf{U})p(\mathbf{U})p(\mathbf{W}|\mathbf{A})p(\mathbf{A}|\mathbf{U}^{a})p(\mathbf{U}^{a}).

where the sets $\mathbf{A}=\{\mathbf{a}_{i}\in\mathbb{R}^{T}\}_{i=1}^{N}=\{\mathbf{a}^{t}\in\mathbb{R}^{N}\}_{t=1}^{T}$ and $\mathbf{U}^{a}=\{\mathbf{u}^{a,t}\in\mathbb{R}^{M}\}_{t=1}^{T}$ are the inducing variables for $T$ assignment GPs;⁴⁴4For the convenience of presentation, we assume that the inducing sizes for the GPs are the same. the conditionals factorize as $p(\mathbf{F}|\mathbf{U})=\prod_{t=1}^{T}p(\mathbf{f}^{t}|\mathbf{u}^{t})$ and $p(\mathbf{W}|\mathbf{A})=\prod_{i=1}^{N}p(\mathbf{w}_{i}|\mathbf{a}_{i})$ ; and finally, the GP priors $p(\mathbf{U})=\prod_{t=1}^{T}p(\mathbf{u}^{t})=\prod_{t=1}^{T}\mathcal{N}(\mathbf{u}^{t}|\mathbf{0},\mathbf{K}_{MM}^{t})$ and $p(\mathbf{U}^{a})=\prod_{t=1}^{T}p(\mathbf{u}^{a,t})=\prod_{t=1}^{T}\mathcal{N}(\mathbf{u}^{a,t}|\mathbf{0},\mathbf{K}_{MM}^{a,t})$ . Correspondingly, the joint variational posterior writes as

\displaystyle q(\mathbf{F},\mathbf{U},\mathbf{W},\mathbf{A},\mathbf{U}^{a})=p(\mathbf{F}|\mathbf{U})q(\mathbf{U})q(\mathbf{W}|\mathbf{A})p(\mathbf{A}|\mathbf{U}^{a})q(\mathbf{U}^{a}),

where the variational posteriors $q(\mathbf{U})=\prod_{t=1}^{T}q(\mathbf{u}^{t})=\prod_{t=1}^{T}\mathcal{N}(\mathbf{u}^{t}|\mathbf{m}^{t},\mathbf{S}^{t})$ and $q(\mathbf{U}^{a})=\prod_{t=1}^{T}q(\mathbf{u}^{a,t})=\prod_{t=1}^{T}\mathcal{N}(\mathbf{u}^{a,t}|\mathbf{m}^{a,t},\mathbf{S}^{a,t})$ .

The variational inference then helps derive the following ELBO for SMGP as

\displaystyle\begin{split}\mathcal{L}=&\mathbb{E}_{q(\mathbf{F})q(\mathbf{W}|\mathbf{A})q(\mathbf{A})}[\log p(\mathbf{y}|\mathbf{F},\mathbf{W})]-\mathrm{KL}[q(\mathbf{W}|\mathbf{A})||p(\mathbf{W}|\mathbf{A}))]\\ &-\mathrm{KL}[q(\mathbf{U})||p(\mathbf{U})]-\mathrm{KL}[q(\mathbf{U}^{a})||p(\mathbf{U}^{a})],\end{split}

(15)

where the posterior factorizes as $q(\mathbf{A})=\prod_{t=1}^{T}\int p(\mathbf{a}^{t}|\mathbf{u}^{a,t})q(\mathbf{u}^{a,t})d\mathbf{u}^{a,t}=\prod_{t=1}^{T}\mathcal{N}(\mathbf{a}^{t}|\bm{\mu}^{a,t},\bm{\Sigma}^{a,t})$ with the mean $\bm{\mu}^{a,t}=\mathbf{K}_{NM}^{a,t}[\mathbf{K}^{a,t}_{MM}]^{-1}\mathbf{m}^{a,t}$ and covariance $\bm{\Sigma}^{a,t}=\mathbf{K}^{a,t}_{NN}-\mathbf{K}^{a,t}_{NM}[\mathbf{K}^{a,t}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{a,t}[\mathbf{K}^{a,t}_{MM}]^{-1}]\mathbf{K}^{a,t}_{MN}$ ; and similarly, the variational posterior $q(\mathbf{F})=\prod_{t=1}^{T}p(\mathbf{f}^{t}|\mathbf{u}^{t})q(\mathbf{u}^{t})d\mathbf{u}^{t}=\prod_{t=1}^{T}\mathcal{N}(\mathbf{f}^{t}|\bm{\mu}^{f,t},\bm{\Sigma}^{f,t})$ with the mean $\bm{\mu}^{f,t}=\mathbf{K}_{NM}^{t}[\mathbf{K}^{t}_{MM}]^{-1}\mathbf{m}^{t}$ and covariance $\bm{\Sigma}^{f,t}=\mathbf{K}^{t}_{NN}-\mathbf{K}^{t}_{NM}[\mathbf{K}^{t}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{t}[\mathbf{K}^{t}_{MM}]^{-1}]\mathbf{K}^{t}_{MN}$ . However, since the prior $p(\mathbf{W}|\mathbf{A})=\prod_{i=1}^{N}p(\mathbf{w}_{i}|\mathbf{a}_{i})$ is the product of discrete multinomial distributions, it is not straightforward to handle the posterior $q(\mathbf{W}|\mathbf{A})$ and the related KL divergence in (15). To sidestep this issue, Kaiser et al. [2] decided to keep the latent variables $\mathbf{W}$ by approximating $\log p(\mathbf{y},\mathbf{W})$ rather than the interested $\log p(\mathbf{y})$ .

Differently, we here marginalize all the latent variables out to directly approximate $\log p(\mathbf{y})$ through a tighter ELBO. To this end, we first follow the sparse GP framework to derive the lower bound for the conditional $\log p(\mathbf{y}|\mathbf{W})$ as

\displaystyle\begin{split}\mathcal{L}_{\mathbf{W}}=&\mathbb{E}_{q(\mathbf{F})}[\log p(\mathbf{y}|\mathbf{F},\mathbf{W})]-\mathrm{KL}[q(\mathbf{U})||p(\mathbf{U})]\\ =&\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{mgp}}-\mathrm{KL}[q(\mathbf{U})||p(\mathbf{U})].\end{split}

(16)

Thereafter, according to Jensen’s inequality we have

\displaystyle\begin{split}\log p(\mathbf{y})\geq&\log\int\exp(\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{mgp}})p(\mathbf{W}|\mathbf{A})p(\mathbf{A}|\mathbf{U}^{a})\frac{p(\mathbf{U}^{a})}{q(\mathbf{U}^{a})}q(\mathbf{U}^{a})d\mathbf{W}d\mathbf{A}d\mathbf{U}^{a}\\ \geq&\mathbb{E}_{q(\mathbf{A})}\left[\log\mathbb{E}_{p(\mathbf{W}|\mathbf{A})}[\exp(\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{mgp}})]\right]-\mathrm{KL}[q(\mathbf{U})||p(\mathbf{U})]-\mathrm{KL}[q(\mathbf{U}^{a})||p(\mathbf{U}^{a})]\\ =&\mathcal{L}_{\mathrm{smgp}}\geq\mathcal{L}.\end{split}

(17)

It is observed that in comparison to the bound (15), the tighter bound $\mathcal{L}_{\mathrm{smgp}}$ avoids introducing an additional variational posterior $q(\mathbf{W}|\mathbf{A})$ . Besides, note that during the model training, instead of sampling from the discrete prior distribution $\mathbf{w}_{i}\sim\mathcal{M}(\mathbf{w}_{i}|\mathtt{Softmax}(\mathbf{a}_{i}))$ , we employ the continuous relaxation proposed in [27, 28] in order to perform back propagation for stochastic optimization: it adopts the reparameterization trick to sample from a Concrete random variable controlled by a temperature parameter $\lambda$ , and approaches a discrete random variable when $\lambda\to 0$ . Our SMGP adopts a small value of $\lambda=0.01$ .

Prediction. When performing prediction at the test point $\mathbf{x}_{*}$ , we have

\displaystyle p(y_{*}|\mathbf{y})=\int p(y_{*}|\mathbf{f}_{*},\mathbf{w}_{*})p(\mathbf{f}_{*}|\mathbf{y})p(\mathbf{w}_{*}|\mathbf{y})d\mathbf{w}_{*}d\mathbf{f}_{*},

where the prediction $p(\mathbf{f}_{*}|\mathbf{y})=\prod_{t=1}^{T}\int p(f^{t}_{*}|\mathbf{u}^{t})q(\mathbf{u}^{t})d\mathbf{u}^{t}=\prod_{t=1}^{T}\mathcal{N}(f^{t}_{*}|\mu^{f,t}_{*},\nu^{f,t}_{*})$ with the mean $\mu_{*}^{f,t}=\mathbf{k}^{t}_{*M}[\mathbf{K}^{t}_{MM}]^{-1}\mathbf{m}^{t}$ and variance $\nu_{*}^{f,t}=k^{t}_{**}-\mathbf{k}^{t}_{*M}[\mathbf{K}^{t}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{t}[\mathbf{K}^{t}_{MM}]^{-1}]\mathbf{k}^{t}_{M*}$ ; and the assignment prediction $p(\mathbf{w}_{*}|\mathbf{y})=\int p(\mathbf{w}_{*}|\mathbf{a}_{*})p(\mathbf{a}_{*}|\mathbf{y})d\mathbf{a}_{*}$ where $p(\mathbf{a}_{*}|\mathbf{y})=\prod_{t=1}^{T}\int p(a^{t}_{*}|\mathbf{u}^{a,t})q(\mathbf{u}^{a,t})d\mathbf{u}^{a,t}=\mathcal{N}(a^{t}_{*}|\mu^{a,t}_{*},\nu^{a,t}_{*})$ with the mean $\mu_{*}^{a,t}=\mathbf{k}^{a,t}_{*M}[\mathbf{K}^{a,t}_{MM}]^{-1}\mathbf{m}^{a,t}$ and variance $\nu_{*}^{a,t}=k^{a,t}_{**}-\mathbf{k}^{a,t}_{*M}[\mathbf{K}^{a,t}_{MM}]^{-1}[\mathbf{I}-\mathbf{S}^{a,t}[\mathbf{K}^{a,t}_{MM}]^{-1}]\mathbf{k}^{a,t}_{M*}$ . Thereafter, we sample from the Gaussian and multinomial distributions, and pass through the model to produce predictive samples.

3.3 Scalable latent GP

The foregoing models including SHGP and SMGP directly modulate the outputs for capturing complicated distributions. Differently, we could introduce the stochasticity in the augmented input space, and modulate the covariances in order to output expressive predictive distribution more flexibly. It has been found that the GP could warp the normal stochasticity in the augmented inputs into complicated distribution like [29, 30]. Along this line, the latent GP (LGP) introduces additional latent inputs $\mathbf{w}\in R^{d_{\mathbf{w}}}$ to augment the input space as

\displaystyle y(\mathbf{x})=f([\mathbf{x},\mathbf{w}])+\epsilon.

(18)

It is usually assumed that the latent variables $\mathbf{W}=\{\mathbf{w}_{i}\}_{i=1}^{N}$ for data points are independent from each other. We next attempt to derive scalable ELBO for LGP using variational inference.

The IWVI-based ELBO. Similar to SMGP, the scalable LGP (SLGP) first obtains the bound for the conditional $\log p(\mathbf{y}|\mathbf{W})$ as

\displaystyle\log p(\mathbf{y}|\mathbf{W})\geq\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{lgp}}-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]=\log\hat{p}(\mathbf{y}|\mathbf{W}),

(19)

where the partial bound $\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{lgp}}=\mathbb{E}_{q(\mathbf{f}|\mathbf{W})}[\log p(\mathbf{y}|\mathbf{f})]$ is analytical; the posterior $q(\mathbf{f}|\mathbf{W})=\int p(\mathbf{f}|\mathbf{u},\mathbf{W})q(\mathbf{u})d\mathbf{u}=\mathcal{N}(\mathbf{f}|\bm{\mu}^{f},\bm{\Sigma}^{f})$ wherein the kernel matrices in $\bm{\nu}^{f}$ and $\bm{\Sigma}^{f}$ are calculated in the augmented input space $\mathbb{R}^{d_{\mathbf{x}}+d_{\mathbf{w}}}$ ; and finally, $\hat{p}(\mathbf{y}|\mathbf{W})$ is an unnormalized distribution which satisfies $\lim_{\mathbf{u}\to\mathbf{f}}\log\hat{p}(\mathbf{y}|\mathbf{W})=\log p(\mathbf{y}|\mathbf{W})$ , since in this case the sparse GP recovers the full GP. Thereafter, by inserting the inequality (19) back into $\log p(\mathbf{y})$ and using the Jensen’s inequality, we get a scalable bound for $\log p(\mathbf{y})$ as

\displaystyle\begin{split}\log p(\mathbf{y})\geq&\log\mathbb{E}_{q(\mathbf{W})}\left[\exp(\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{lgp}})\frac{p(\mathbf{W})}{q(\mathbf{W})}\right]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]\\ \geq&\mathbb{E}_{q(\mathbf{W})}\log\left[\exp(\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{lgp}})\frac{p(\mathbf{W})}{q(\mathbf{W})}\right]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]=\mathcal{L}_{\mathrm{slgp}}.\end{split}

(20)

In (20), we could have the independent normal prior $p(\mathbf{W})=\prod_{i=1}^{N}p(\mathbf{w}_{i})=\prod_{i=1}^{N}\mathcal{N}(\mathbf{w}_{i}|\mathbf{0},\mathbf{I})$ , or more informatively, the independent but expressive prior [31]

\displaystyle p(\mathbf{w}_{i})=\mathcal{N}(\mathbf{w}_{i}|\mathtt{Linear}(\mathtt{MLP}(\mathbf{x}_{i})),\mathtt{SoftPlus}(\mathtt{MLP}(\mathbf{x}_{i}))),

(21)

which amortizes the parameters over a multi-layer perception (MLP) for efficient training; the variational posterior $q(\mathbf{W})=\prod_{i=1}^{N}q(\mathbf{w}_{i})$ also takes the Gaussian, with the mean and variance (i) treated as individual hyperparameters [1], or more efficiently, (ii) parameterized as a MLP of both $\mathbf{x}$ and $\mathbf{y}$ , i.e.,

\displaystyle q(\mathbf{w}_{i})=\mathcal{N}(\mathbf{w}_{i}|\mathtt{Linear}(\mathtt{MLP}([\mathbf{x}_{i},y_{i}])),\mathtt{SoftPlus}(\mathtt{MLP}([\mathbf{x}_{i},y_{i}]))).

(22)

Moreover, the importance weighted variational inference (IWVI) [32, 33] further helps pushing the bound towards the marginal likelihood by making the quantity inside the first expectation term of (20) concentrated about the mean. This is performed by a sample average over $S$ terms with respect to (w.r.t.) $\mathbf{w}$ as

\displaystyle\mathcal{L}_{\mathrm{slgp}}^{S}=\mathbb{E}_{q(\mathbf{W}^{1:S})}\left[\log\frac{1}{S}\sum_{s=1}^{S}\exp(\tilde{\mathcal{L}}_{\mathbf{W}^{s}}^{\mathrm{lgp}})\frac{p(\mathbf{W}^{s})}{q(\mathbf{W}^{s})}\right]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],

(23)

where the posterior $q(\mathbf{W}^{1:S})=\prod_{s=1}^{S}q(\mathbf{W}^{s})$ , and maximizing the expectation of $p(\mathbf{W}^{s})/q(\mathbf{W}^{s})$ w.r.t. $q(\mathbf{W}^{s})$ represents a regularization: it pushes the posterior $q(\mathbf{W}^{s})$ towards the prior $p(\mathbf{W}^{s})$ . This bound becomes strictly tighter as $S$ increases, that is, $\mathcal{L}_{\mathrm{slgp}}=\mathcal{L}_{\mathrm{slgp}}^{1}\leq\cdots\mathcal{L}_{\mathrm{slgp}}^{S}\leq\log p(\mathbf{y})$ [32], which can be obtained by using Jensen’s inequality inversely. From another point of view [34], the IWVI is equivalent to employing a posterior $q_{\mathrm{IW}}(\mathbf{W})=\mathbb{E}_{q(\mathbf{W}^{1:S})}\left[\hat{p}(\mathbf{y},\mathbf{W})/\left(\frac{1}{S}\frac{\hat{p}(\mathbf{y},\mathbf{W}^{s})}{q(\mathbf{W}^{s})}\right)\right]$ with $\hat{p}(\mathbf{y},\mathbf{W})=\hat{p}(\mathbf{y}|\mathbf{W})p(\mathbf{W})$ . This posterior is more informative than the original $q(\mathbf{W})$ , and converges to $p(\mathbf{W}|\mathbf{y})$ when $S\to+\infty$ and $\mathbf{u}\to\mathbf{f}$ .

Enabling non-stationarity. Note that unlike the SHGP and SMGP, the SLGP described so for has no way to tackle non-stationary features. Therefore, as depicted in Fig. 1(c), we introduce an additional stochastic encoder $p(\mathbf{h}|\mathbf{w},\mathbf{x})=p(\mathbf{h}|\mathbf{w})$ ⁵⁵5The latent variable $\mathbf{h}$ is actually conditioned on $[\mathbf{x},\mathbf{w}]$ . But since $\mathbf{x}$ is deterministic, we omit it. as the mapping between the augmented input space $\mathbb{R}^{d_{\mathbf{x}}+d_{\mathbf{w}}}$ and the latent space $\mathbb{R}^{d_{\mathbf{h}}}$ for performing latent representation learning, which thus warps the non-stationary feature to ease the subsequent regression task. The stochastic encoder could be implemented by deep Gaussian process [35], which however significantly increases model complexity. Alternatively, inspired by the idea of amortized variational inference, we follow [36] and resort to the great representational power of neural networks. Hence, for the stochastic encoder, we take

1.

the factorized Gaussian prior

$\displaystyle p(\mathbf{h}|\mathbf{w})=\mathcal{N}(\mathbf{h}|\phi(\mathbf{x},\mathbf{w}),\nu_{0}\mathbf{I}),$ (24)

where the prior mean $\phi(\mathbf{x},\mathbf{w})$ represents the mapping of the augmented inputs. Particularly, when $d_{\mathbf{h}}=d_{\mathbf{x}}+d_{\mathbf{w}}$ , we have an identity mapping $\phi(\mathbf{x},\mathbf{w})=[\mathbf{x},\mathbf{w}]$ ; when $d_{\mathbf{h}}>d_{\mathbf{x}}+d_{\mathbf{w}}$ , we use the zero-padding strategy $\phi(\mathbf{x},\mathbf{w})=[\mathbf{x},\mathbf{w},\mathbf{0}]$ ; and finally, when $d_{\mathbf{h}}<d_{\mathbf{x}}+d_{\mathbf{w}}$ , we use some dimensionality reduction algorithms, for example, the principle component analysis (PCA), to map $\mathbf{w}$ as $\phi(\mathbf{x},\mathbf{w})=\mathtt{PCA}([\mathbf{x},\mathbf{w}])$ ; and

the variational posterior

\displaystyle q(\mathbf{h}|\mathbf{w})=\mathcal{N}(\mathbf{h}|\bm{\mu}^{h},\mathrm{diag}[\bm{\nu}^{h}])

to be Gaussians factorized over dimensions.The variational mean and variance are parameterized as a MLP of input as

\displaystyle\bm{\mu}^{h}=\mathtt{Linear}(\mathtt{MLP}([\mathbf{x},\mathbf{w}])),\quad\bm{\nu}^{h}=\nu_{0}\times\mathtt{Sigmoid}(\mathtt{MLP}([\mathbf{x},\mathbf{w}])),

(25)

where the shared parameter $\nu_{0}$ in $p(\mathbf{h}|\mathbf{w})$ and $q(\mathbf{h}|\mathbf{w})$ enables knowledge transfer between the prior and posterior.

Note that though we are using the MLPs for amortized inference, the GP framework alleviates the necessity for fine-tuning or regularization.

Furthermore, following [36], a hybrid prior

\displaystyle p_{\beta}(\mathbf{h}|\mathbf{w})=p^{\beta}(\mathbf{h}|\mathbf{w})q^{1-\beta}(\mathbf{h}|\mathbf{w}),\quad\beta\in[0,1]

(26)

is employed in order to arrive at (i) more expressive prior and (ii) adjustable regularization on the latent representation learning $[\mathbf{x},\mathbf{w}]\to\mathbf{h}$ . Consequently, we obtain the following ELBO

\displaystyle\begin{split}\mathcal{L}_{\mathrm{slgp}}^{S,\beta}=&\mathbb{E}_{q(\mathbf{H}|\mathbf{W}^{1:S})q(\mathbf{W}^{1:S})}\left[\log\frac{1}{S}\sum_{s=1}^{S}\exp(\tilde{\mathcal{L}}_{\mathbf{W}^{s}}^{\mathrm{lgp}})\frac{p_{\beta}(\mathbf{H}|\mathbf{W}^{s})}{q(\mathbf{H}|\mathbf{W}^{s})}\frac{p(\mathbf{W}^{s})}{q(\mathbf{W}^{s})}\right]\\ &-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})]\\ =&\mathbb{E}_{q(\mathbf{H}|\mathbf{W}^{1:S})q(\mathbf{W}^{1:S})}\left[\log\frac{1}{S}\sum_{s=1}^{S}\exp(\tilde{\mathcal{L}}_{\mathbf{W}^{s}}^{\mathrm{lgp}})\frac{p^{\beta}(\mathbf{H}|\mathbf{W}^{s})}{q^{\beta}(\mathbf{H}|\mathbf{W}^{s})}\frac{p(\mathbf{W}^{s})}{q(\mathbf{W}^{s})}\right]\\ &-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],\end{split}

(27)

where $q(\mathbf{H}|\mathbf{W}^{1:S})=\prod_{i=1}^{N}q(\mathbf{h}_{i}|\mathbf{w}_{i}^{1:S})=\prod_{i=1}^{N}\prod_{s=1}^{S}q(\mathbf{h}_{i}|\mathbf{w}_{i}^{s})$ . A key advantage of this bound is the capability to tune and control the regularization on latent representation learning through the balance factor $\beta$ . As has been investigated in [36], when $\beta=1.0$ , we pose the complete regularization to constrain the latent representation learning $[\mathbf{x},\mathbf{w}]\to\mathbf{h}$ , since maximizing the expectation of $p(\mathbf{H}|\mathbf{W}^{s})/q(\mathbf{H}|\mathbf{W}^{s})$ w.r.t. $q(\mathbf{H}|\mathbf{W}^{s})$ pushes the posterior $q(\mathbf{H}|\mathbf{W}^{s})$ towards the prior $p(\mathbf{H}|\mathbf{W}^{s})$ ; contrarily, the decreasing $\beta$ weakens the regularization to have more powerful representation at the cost of however raising the possibility of over-fitting; particularly, $\beta=0$ offers a deterministic encoder for $[\mathbf{x},\mathbf{w}]\to\mathbf{h}$ .

Tighter bound is not necessarily better. Though increasing $S$ is found to bring tighter bound, it is however found that the bound $\mathcal{L}_{\mathrm{slgp}}^{S,\beta}$ using $\beta=1.0$ still often risks severe over-fitting in scenarios with finite number of data points. This may be attributed to the monte carlo approximation quality of $\mathcal{L}_{\mathrm{slgp}}^{S,\beta}$ , the maximization of which may not push $q(\mathbf{H}|\mathbf{W})$ towards $p(\mathbf{H}|\mathbf{W})$ . Empirical evidence has been provided in the numerical experiments in Sec. 5.

It is observed that when $S=1$ , the bound (27) degenerates to the relaxed VI-based ELBO as

\displaystyle\begin{split}\mathcal{L}_{\mathrm{slgp}}^{\beta}=&\mathbb{E}_{q(\mathbf{H}|\mathbf{W})q(\mathbf{W})}\left[\tilde{\mathcal{L}}_{\mathbf{W}}^{\mathrm{lgp}}\right]-\beta\mathbb{E}_{q(\mathbf{W})}\left[\mathrm{KL}[q(\mathbf{H}|\mathbf{W})||p(\mathbf{H}|\mathbf{W})]\right]\\ &-\mathrm{KL}[q(\mathbf{W})||p(\mathbf{W})]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],\end{split}

(28)

which now owns an analytical, non-negative KL term to measure the gap between $q(\mathbf{H}|\mathbf{W})$ and $p(\mathbf{H}|\mathbf{W})$ , the minimization of which pushes the posterior towards the prior. This bound therefore fully utilizes the regularization when $\beta=1.0$ , with however less tight approximation to the objective $\log p(\mathbf{y})$ , which may harm the performance.

Therefore, aiming to take advantages from both the IWVI-based ELBO (27) and the VI-based ELBO (28), we propose a hybrid ELBO as

\displaystyle\begin{split}\mathcal{L}_{\mathrm{slgp}}^{\mathrm{hyb}}=&\mathbb{E}_{q(\mathbf{H}|\mathbf{W}^{1:S})q(\mathbf{W}^{1:S})}\left[\log\frac{1}{S}\sum_{s=1}^{S}\exp(\tilde{\mathcal{L}}_{\mathbf{W}^{s}}^{\mathrm{lgp}})\frac{p(\mathbf{W}^{s})}{q(\mathbf{W}^{s})}\right]\\ &-\beta\mathbb{E}_{q(\mathbf{W}^{1:S})}\left[\frac{1}{S}\sum_{s=1}^{S}\mathrm{KL}[q(\mathbf{H}|\mathbf{W}^{s})||p(\mathbf{H}|\mathbf{W}^{s})]\right]-\mathrm{KL}[q(\mathbf{u})||p(\mathbf{u})],\end{split}

(29)

which separates the KL term w.r.t. $\mathbf{H}$ to achieve the regularized latent representation learning. The hybrid ELBO has higher approximation quality for $\log p(\mathbf{y})$ in comparison to (28); meanwhile, it achieves controllable regularization w.r.t. $\mathbf{w}$ in comparison to (27).

Prediction. Finally, it is notable that when performing prediction at $\mathbf{x}_{*}$ , it depends on the posterior $q(\mathbf{w}_{*})$ which however is unknown. Instead, we take samples from the prior $p(\mathbf{w}_{*})$ and pass them through the model to output predictive samples. This inconsistency indicates that instead of using the simple unit normal prior $p(\mathbf{w})$ , we may need to use more informative prior, like (21), in order to mimic the posterior $q(\mathbf{w})$ . Consequently, it alleviates the inconsistency when performing prediction. That is why recent works have exploited the usage of expressive priors like mixture of Gaussians and normalizing flow [37].

3.4 Discussions

Table 1 summarizes the capability of various GP paradigms together with their time complexity in different scenarios. It is observed that the hemoscedastic, stationary SGP [19] fails in the three challenging scenarios. Contrarily, the three scalable modulated GPs achieve improvement. Among them, the SHGP is not designed for multi-modal scenario, and the Gaussian noise assumption prohibits the modeling of non-Gaussian residuals. Besides, the share of $w(.)$ in (6) for modulating both the output and noise may weaken the learning of non-stationary features. The time complexity of SHGP is two times that of SGP due to the additional log-GP. As for SMGP, ideally, it is capable of approximating any target distribution, given that we own many GP experts (i.e., a large $T$ ) and learn the assignment GPs reasonably, which however are quite challenging and raise remarkably high time complexity. As for the final SLGP, it showcases superiority in the following numerical experiments, with the additional complexity $\mathcal{O}(\mathrm{NN})$ , which is usually lower than that of SGP, brought by the MLPs for the prior $p(\mathbf{w})$ in (21), and the posteriors $q(\mathbf{w})$ in (21) and $q(\mathbf{h}|\mathbf{w})$ in (25).

Table 1: The capability and complexity of GP paradigms in various scenarios, with the number of stars indicating the model capability.

Model	Hetero. noise	Multi. modality	Non-stationarity	Time complexity
SGP	$\times$	$\times$	$\times$	$\mathcal{O}(\|\mathcal{B}\|M^{2}+M^{3})$
SHGP	$\star$	$\times$	$\star$	$2\mathcal{O}(\|\mathcal{B}\|M^{2}+M^{3})$
SMGP	$\star$ $\star$	$\star$ $\star$	$\star$ $\star$	$2T\mathcal{O}(\|\mathcal{B}\|M^{2}+M^{3})$
SLGP	$\star$ $\star$ $\star$	$\star$ $\star$ $\star$	$\star$ $\star$ $\star$	$\mathcal{O}(\mathrm{NN})+\mathcal{O}(\|\mathcal{B}\|M^{2}+M^{3})$

4 Related work

As for the modeling of heteroscedastic residuals, Goldberg et al [24] developed the heteroscedastic GP model $y(\mathbf{x})=f(\mathbf{x})+\mathcal{N}(0,e^{w(\mathbf{x})})$ , wherein the Gibbs sampling is used to sample from posteriors for model training. The scalability of this model has thereafter been improved through variational inference and distributed learning using sparse approximation [38, 39, 40]. Moreover, in order to tackle non-stationary features and arrive at log-concave likelihood, the HGP model in (6) has been first studied in [25, 26] by combining the non-stationary GP $y(\mathbf{x})=e^{w(\mathbf{x})}f(\mathbf{x})+\mathcal{N}(0,\nu^{\epsilon})$ [23] and the above heteroscedastic noise, the inference of which resorts to expectation propagation. The full GP paradigm however makes it only available on small datasets. As an improvement, we here improve the scalability of (6) under sparse approximation and extend it to the regime of big data; besides, we use variational inference to derive the closed-form ELBO (9) for efficient model training.

As for the mixture of GPs,⁶⁶6It belongs to the category of mixture of experts (MoE) [41, 42]. early works seek to infer a gating network $g(.)$ , which is usually parameterized as a softmax function, to perform probabilistic separation of the input domain and the assignment of data points, which in turn allow the training of local GP experts [43, 44, 45]. The sparse approximations [46, 47, 48] have then been introduced in MGP to reduce the model complexity under the variational expectation maximization framework. The key assumption in the above MGPs is that only one GP expert is used to explain the data at each point. Differently, a whole Bayesian generative model for both the GPs and the assignments has been recently investigated in [2], wherein the key assumption is that the data at each position is generated by a number of global GP experts. However, as discussed in Sec. 3.2, the discrete multinomial prior $p(\mathbf{W}|\mathbf{A})$ makes the VI-based ELBO (15) hard to use. To sidestep this issue, Kaiser et al. [2] decided to keep the latent variables $\mathbf{W}$ by approximating $\log p(\mathbf{y},\mathbf{W})$ rather than the interested $\log p(\mathbf{y})$ . Contrarily, our work devotes to marginalize all the latent variables out to directly approximate the marginal likelihood $\log p(\mathbf{y})$ and derive a tighter bound for model training with higher quality.

As for the latent GP, it was first proposed in [49] for modeling hereroscedastic non-Gaussian residuals, and then extended for conditional density estimation [50, 1, 35] and disentangled learning [51]. The similar idea of augmenting inputs with latent variables was also independently presented from the regime of reinforcement learning [52]. It is also notable that the LGP is close to the Gaussian process latent variable model (GPLVM) [30, 29], despite that the latter seeks to infer the low-dimensional latent variables for unsupervised learning. Besides, the LGP has the same model structure to the conditional variational autoencoder (CVAE) [53], though they were developed from different views. Our SLGP model is most related to the work in [1], which combines the amortized variational inference with LGP for conditional density estimation. The major differences are that (i) our SLGP introduces an additional, regularized stochastic encoder $q(\mathbf{h}|\mathbf{w})$ to enhance the power of latent representation; and (ii) we derive a hybrid and tighter ELBO to achieve high training quality.

5 Numerical experiments

This section comprehensively investigates the characteristics and performance of different scalable modulated GPs for approximating diverse distributions on three toy cases, eight UCI datasets, the large New York City Taxi dataset, and finally the mnist image generation task. All the experiments are performed on a Linux workstation with eight 3.20 GHz cores, nvidia GTX1080Ti, and 32GB memory.

5.1 Toy cases

This section employs three toy cases with different characteristics to showcase the performance of modulated GPs. The first case

\displaystyle y(x)=\cos(5x)\exp(-0.5x)+0.25\cos(6x+1)\exp(-x)\mathcal{N}(0,1),\quad x\in[-2,2]

has the heteroscedastic Gaussian noise; the second case is a non-stationary step function around $x=0.5$ with a constant noise $\epsilon\sim\mathcal{N}(0,1e^{-4})$ ; and finally the third case is the multi-modal moon data imported from the scikit-learn package [54]. We generate $n=1000$ training points for the heteroscedastic case, $n=500$ for the step case, and $n=200$ for the multi-modal case, and predict at 500 evenly distributed points.

In the comparison study, the training data has been normalized before model training. The modulated GPs use the squared exponential (SE) kernel, and $m=50$ inducing variables. Particularly, the SMGP uses $K=4$ GP experts; the SLGP uses the balance parameter $\beta=0.01$ and a single latent input (i.e., $d_{\mathbf{w}}=1$ ), and adopts the MLPs for $p(\mathbf{w})$ , $q(\mathbf{w})$ and $q(\mathbf{h}|\mathbf{w})$ with three hidden layers, each of which owns 100 units and takes the ReLU activation function. Note that both SMGP and SLGP use $S=10$ MC samples to evaluate their ELBOs for model training. As for the optimization, we employ the Adam solver with the learning rate of $0.005$ and run it over 10000 iterations. Fig. 2 illustrates the prediction samples generated by the three modulated GPs. The hemoscedastic, stationary SGP [19] is also included as a baseline. We conclude the following findings from the comparative results.

The modulated GPs outperform the SGP. The homoscedastic SGP is only able to estimate the simple constant noise in data, and has no way to tackle non-stationary and multi-modal features. As a result, the generated samples cannot well recover the three data distributions. Contrarily, the modulated GPs approximate the data more accurately by extracting richer statistical behavior behind data via the modulation $\mathbf{w}$ . As shown in Fig. 3, (i) the modulation variables learnt by SHGP capture the varying noise in the heteroscedastic case; (ii) the four GP experts learnt by SMGP describe part of the data for capturing the step behavior; and (iii) the mean of $q(\mathbf{w})$ learnt by SLGP encodes the multi-modality of the moon case, and the expressive prior $p(\mathbf{w})$ mimics the posterior for better prediction.

The SLGP is promising in various scenarios. The SHGP uses an additional latent function $w(.)$ to describe the varying Gaussian noise and calibrate the amplitude of output. Hence, it perfectly approximates the first case and improves over SGP in capturing the step case. Due to the Gaussian noise assumption, the SHGP however cannot fit the multi-modal case well. The SMGP describes the distributions more flexibly by mixing several GP experts. The proper number $T$ of experts however is problem-dependent. Ideally, the SMGP improves with $T$ ; however, it meanwhile toughens the model training due to the exponentially increased number of hyperparameters and the more complicated model structure. In comparison to the flexible but average performance of SMGP, the SLGP is promising. By including stochasticity in the augmented input space and using an NN-assisted stochastic encoder, the SLGP well recovers the three distributions.

The SLGP may sacrifice the homoscedastic noise. It is found that the GP framework uses the estimated variance $\nu^{f}_{*}$ to quantify the epistemic uncertainty in model parameters [55]. As for the aleatoric uncertainty inherited in the observations, the SHGP employs an additional GP to fit it; while the SGP, SMGP and SLGP use a simple i.i.d noise $\mathcal{N}(0,\nu^{\epsilon})$ for approximation. It is observed from Fig. 2 that the latent samples of SLGP from $p(f_{*}|\mathbf{y})$ are almost the same as the samples from $p(y_{*}|\mathbf{y})$ for the first heteroscedastic case, which indicates that the SLGP here sacrifices the i.i.d. noise. That is, the SLGP pushes the noise variance towards zero, e.g., it has the pretty small estimation $\nu^{\epsilon}=0.0043$ . This is because for SLGP, the i.i.d. noise is inconsistent to the heteroscedastic property of the first case. Thus, it tends to ignore the noise term and resorts to the $p(f_{*}|\mathbf{y})$ conditioned on augmented stochastic inputs to explain the data.

5.2 UCI regression

This section further investigates the performance of the three modulated GPs on eight UCI benchmarks [56] summarized in Table 2. In addition to the SGP [19], the comparative study adopts the GP competitors including

1.

the stochastic variational HGP (SVHGP) [40] which uses the latent function $w(.)$ to only modulate the noise variance,
2.

the data association GP (DAGP) [2] which maximizes the joint likelihood $p(\mathbf{y},\mathbf{W})$ rather than the marginal likelihood $p(\mathbf{y})$ for model training, and
3.

the GP conditional density estimation (GPCDE) [1] which derives an optimal ELBO for LGP.

Besides, the competitors also include the neural network counterparts including

1.

the heteroscedastic neural network (HNN) [57] with the last layer having two outputs, one for the prediction mean of a Gaussian and the other for the uncertainty of the prediction mean,
2.

the mixture of density network (MDN) [58] with the last layer outputting the means and variances of multiple Gaussians together with the related coefficients to mix them, and finally,
3.

the conditional variational auto-encoder(CVAE) [53] which has the similar structure to the proposed SLGP with the main difference being that it is a pure NN architecture without stochastic regularization.

The model configurations on the UCI datasets are elaborated in Appendix B. Note that the SLGP performs grid search of the balance parameter $\beta$ from the candidate set $\{1.0,0.5,0.1,0.01\}$ . Table 3 reports the average comparative results over ten runs in terms of negative log likelihood (NLL), which is estimated by the kernel density estimator built from the prediction samples. We have the following findings from the comparative results.

Table 2: Eight UCI benchmark datasets.

dataset	$N_{\mathrm{train}}$	$N_{\mathrm{test}}$	$d_{\mathbf{x}}$
boston	456	50	13
energy	692	76	8
concrete	927	103	8
wine-red	1440	159	11
kin8nm	7373	819	8
power	8612	956	4
naval	10741	1193	16
protein	41157	4573	9

Table 3: The negative log likelihood (NLL) results of modulated GPs and their GP (marked in orange) and NN (marked in blue) counterparts on the UCI datasets, with the best and second-best results marked in gray and light gray, respectively. For the NLL criterion, lower is better. Besides, the last row provides the optimal balance parameter

\beta

of SLGP through grid search.

model	boston	energy	concrete	wine-red	kin8nm	power	naval	protein
SGP	2.3325_±0.0723	1.4431_±0.0331	3.0514_±0.0313	0.9527_±0.0544	-0.9947_±0.0135	2.7274_±0.0097	-7.0355_±0.6860	2.8784_±0.0080
HNN	3.5959_±0.2348	2.0527_±0.3290	2.7110_±0.1831	1.0314_±0.1625	1.4230_±0.5217	2.5078_±0.0411	-5.2414_±0.7426	2.1944_±0.1062
SVHGP	2.1863_±0.0530	1.0108_±0.0786	2.9546_±0.0266	0.8585_±0.0652	-1.0413_±0.0118	2.6678_±0.0136	-7.1000_±0.3524	2.7445_±0.0130
SHGP	2.1174_±0.0678	1.0762_±0.0774	2.9706_±0.0307	0.8991_±0.0579	-1.0378_±0.0120	2.7019_±0.0137	-7.4956_±0.1271	2.7628_±0.0114
MDN	3.3399_±0.1768	1.3182_±0.2403	2.6500_±0.1326	0.3965_±0.9257	1.7145_±0.2523	2.4702_±0.0252	-3.5556_±1.0886	1.9195_±0.0219
DAGP	2.3004_±0.0535	1.1484_±0.0938	3.0369_±0.0377	0.9533_±0.0621	-0.9985_±0.0176	2.7186_±0.0087	-7.2886_±0.4019	2.8367_±0.0094
SMGP	2.2094_±0.0770	1.1500_±0.1117	2.9681_±0.0321	1.1342_±0.0846	-1.0263_±0.0128	2.7001_±0.0110	-7.3514_±0.1601	2.7529_±0.0118
CVAE	3.6231_±0.2309	1.7857_±0.1825	2.6141_±0.1813	-0.7183_±0.3034	1.3869_±0.2132	2.3990_±0.0472	-5.8454_±0.3416	1.9666_±0.0237
GPCDE	2.3292_±0.0662	1.2102_±0.0636	3.0572_±0.0339	0.9516_±0.0602	-0.9927_±0.0136	2.7293_±0.0083	-6.9172_±0.2030	2.8181_±0.0053
SLGP	2.2075_±0.0804	0.6821_±0.3064	2.5031_±0.1559	0.6663_±0.0904	-1.0414_±0.0135	2.4083_±0.0368	-6.3133_±0.2206	1.9089_±0.0311
$\beta$	1.0	1.0	0.5	1.0	0.5	0.1	0.1	0.01

The SLGP generally outperforms the others. Similar to the results on toy cases, the SLGP again showcases superiority on six out of the eight UCI datasets. Besides, it is observed in Table 3 that the optimal balance parameter $\beta$ is problem-dependent. As for the first four small datasets, the SLGP prefers using $\beta=1$ in order to fully utilize the regularization to guard against over-fitting. Contrarily, the SLGP decides to use a small $\beta$ on the remaining four large datasets for improving the latent representation learning, which is beneficial for extracting the underlying patterns under massive data. As for the other two modulated GPs, they outperform the SGP for most of the cases. Finally, the mixture of GPs, which ideally has high flexibility and capability, does not show significant superiority over the SHGP. This may be attributed to the difficulty in training such complex model.

The modulated GPs usually outperform the GP counterparts. As for SLGP, it performs better than GPCDE on all the eight benchmarks, due to the regularized latent representation learning $q(\mathbf{h}|\mathbf{w})$ together with the hybrid and tight ELBO (29). As for SMGP, it outperforms the DAGP in six out of the eight benchmarks due to the more reasonable ELBO (17), which marginalizes all the latent variables out. As for SHGP, however, it does not show superiority over SVHGP by additionally modulating the amplitude of output $f(.)$ in (6). As has been explained before, this may be attributed to the share of $w(.)$ in SHGP for modulating both the output and noise, which may weaken the learning of non-stationary features.

The NN counterparts suffer from over-fitting on small datasets. It is not surprising to observe that in comparison to the modulated GPs, the NN counterparts without regularization risk over-fitting on small datasets, for example, the boston and energy datasets. This issue can be alleviated by some fine-tuning tricks, e.g., model pruning, adaptive learning rate, early stop, dropout, and data augmentation. The NN models, especially the CVAE, perform well on the remaining large datasets, because the many data points themselves act as a regularization. Contrarily, though being a hybrid model of NN and GP, the well-calibrated SLGP under the GP framework yields robust and superior performance on both small and large datasets without fine-tuning tricks.

The SMGP improves with $T$ ? The mixture of more GP experts in SMGP is expected to improve the capability of fitting more complicated data distributions. The comparative results in Fig. 4 on the small concrete and large naval datasets however cannot strongly support this claim. It is observed that the SMGP using $T=4$ improves over that with $T=2$ on the concrete dataset; but the larger $T=6$ brings almost no improvement. More seriously, the SLMP is insensitive to $T$ on the naval dataset. This again confirms that training the SMGP comprising $2\times T$ GP experts with interactions is a challenging task.

The SLGP improves with $d_{\mathbf{w}}$ ? Fig. 5 investigates the impact of the dimensionality $d_{\mathbf{w}}$ of the latent inputs $\mathbf{w}$ on the performance of SLGP on the small wine-red and large protein datasets. It is found that the performance of SLGP is insensitive to $d_{\mathbf{w}}$ on the two datasets. This may be attributed to the sufficient capability of $d_{\mathbf{w}}=1$ for describing statistical structures on the two dataset. We again investigate the impact of $d_{\mathbf{w}}$ on the challenging mnist image dataset in Sec. 5.4 and observe the benefits brought by large $d_{\mathbf{w}}$ .

The hybrid inference in SLGP outperforms VI and IWVI. As stated before, the IWVI helps derive a tighter bound (27) than the commonly used VI-based bound (28). The bound is indeed tighter at the cost of however losing the explicit control of regularization on the stochastic encoder when resorting to the MC approximation. This risks severe over-fitting on the small wine-red dataset even when we are using $\beta=1.0$ , as shown in Fig. 6. Differently, the hybrid bound (29) takes the advantages of both the VI- and IWVI-based ELBOs. Consequently, it outperforms the VI-based inference on the small wine-red dataset, and even performs better than the IWVI-based inference on the large protein dataset.

5.3 Large-scale spatio-temporal modeling

This section applies the three scalable modulated GPs to the large-scale 2016 New York City Taxi (nytaxi) dataset, which is composed of the records of more than 1.4 million taxi trips.⁷⁷7The data is available at https://www.kaggle.com/c/nyc-taxi-trip-duration/data. We employ the Bayesian Benchmarks⁸⁸8https://github.com/hughsalimbeni/bayesian_benchmarks. package to choose the trips within the Manhattan area. We attempt to predict the duration of trip with respect to the drop-off location, the pick-up location, the day of week, and the time of day. It is notable that the day of week and the time of day are transformed as sine and cosine with the natural periods, thus resulting in totally eight inputs. The model configurations on this experiment are elaborated in Appendix B.

Fig. 7 shows the comparative results of modulated GPs and their counterparts over ten runs in terms of NLL. It is observed that all the three modulated GPs significantly outperform the conventional SGP. Specifically, the SMGP performs slightly better than the SHGP; the SMGP and SLGP outperform their GP counterparts (DAGP and GPCDE), but the SHGP and SMGP perform worse than their NN counterparts on this large data; finally, the SLGP shows remarkable superiority on this large dataset, even in comparison to CVAE. Furthermore, Fig. 8 illustrates the estimated densities of the four GPs with respect to the pick-up locations. It is observed that the three modulated GPs, especially the SLGP, have higher likelihood to explain the data.

5.4 Image generation task

Finally, since the powerful SLGP and its NN counterpart CVAE [53] showcase superiority on the foregoing numerical experiments, this section compares them on the high-dimensional mnist dataset that is composed of 70000 hand-written digit images for density estimation as well as sample generation.⁹⁹9The dataset is available at http://yann.lecun.com/exdb/mnist/. We choose 90% of the dataset as training data, and the remaining 10% for testing. Different from the above tasks, the image task has ten-dimensional inputs encoded from the one-hot labels, and 784 vectorized pixel outputs. Appendix B describes the model settings as well as the MLP architectures of SLGP and CVAE. In this experiment, we additionally investigate the impact of latent dimensionality by using $d_{\mathbf{w}}=2$ and $d_{\mathbf{w}}=16$ , respectively.

Table 4: The negative log likelihood (NLL) results on the mnist dataset. For this criterion, lower is better.

$d_{\mathbf{w}}$	CVAE	SLGP
2	-825.1102	-1198.2542
16	-935.4681	-1225.2195

Table 4 reports the NLL results of CVAE and SLGP on the mnist dataset. It is observed that the Bayesian framework helps the SLGP outperform the CVAE, and both of them improve with the latent dimensionality $d_{\mathbf{w}}$ . Besides, Fig. 9 depicts the 0-10 digit images sampled from the distributions learnt by SLGP using $d_{\mathbf{w}}=2$ and $d_{\mathbf{w}}=16$ , respectively. It is observed that the samples generated by the SLGP with $d_{\mathbf{w}}=2$ have obviously noisy background and misidentify some digits, for example, ’1’ and ’7’. By increasing $d_{\mathbf{w}}$ up to sixteen, we are now capable of generating clear and diverse digit images conditioned on the labels.

6 Conclusion

In order to enhance the capability of learning rich statistical representation for GP from massive data, we studied three scalable modulated GPs, which encode challenging details into the latent variable $\mathbf{w}$ by modulating the outputs or inputs. The variational inference helps further derive analytical or tighter ELBOs for efficient and effective model training. We extensively evaluate these scalable modulated GPs and compare them against state-of-the-art GP and NN counterparts on various tasks. It is observed that (i) they outperform the conventional scalable GP in terms of the quality of predictive distribution in various challenging scenarios, and (ii) particularly, the SLGP shows remarkable performance in learning various distributions at the cost of however may sacrificing the homoscedastic noise in some circumstances.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (DUT19RC(3)070) at Dalian University of Technology, and it was partially supported by the Research and Innovation in Science and Technology Major Project of Liaoning Province (2019JH1-10100024), the National Key Research and Development Project (2016YFB0600104), and the MIIT Marine Welfare Project (Z135060009002).

Appendix A Acronyms and notations

A.1 Acronyms

$\begin{array}[]{ll}\mbox{CVAE}&\mbox{Conditional Variational Autoencoder}\\ \mbox{DAGP}&\mbox{Data Association Gaussian Process}\\ \mbox{ELBO}&\mbox{Evidence Lower Bound}\\ \mbox{GPR}&\mbox{Gaussian Process Regression}\\ \mbox{GPCDE}&\mbox{Gaussian Process Conditional Density Estimation}\\ \mbox{HNN}&\mbox{Heteroscedastic Neural Network}\\ \mbox{{i.i.d.}}&\mbox{Independent and Identically Distributed}\\ \mbox{IWVI}&\mbox{Importance Weighted Variational Inference}\\ \mbox{KL}&\mbox{Kullback-Leibler}\\ \mbox{MCMC}&\mbox{Markov Chain Monte Carlo}\\ \mbox{MDN}&\mbox{Mixture Density Network}\\ \mbox{NLL}&\mbox{Negative Log Likelihood}\\ \mbox{SGP}&\mbox{Scalable Gaussian Process}\\ \mbox{SE}&\mbox{Squared Exponential}\\ \mbox{SHGP}&\mbox{Scalable Heteroscedastic Gaussian Process}\\ \mbox{SLGP}&\mbox{Scalable Latent Gaussian Process}\\ \mbox{SMGP}&\mbox{Scalable Mixture of Gaussian Processes}\\ \mbox{SVHGP}&\mbox{Stochastic Variational Heteroscedastic Gaussian Process}\\ \mbox{VI}&\mbox{Variational Inference}\\ \mbox{w.r.t.}&\mbox{With Respect To}\end{array}$

A.2 Notation

$\begin{array}[]{ll}\mathbf{a}^{t},a^{t}_{*}&\mbox{Latent function values of }a^{t}\mbox{ }(0\leq t\leq T)\mbox{ at }\mathbf{X}\mbox{ and }\mathbf{x}_{*}\mbox{ in SMGP}\\ d_{\mathbf{x}},d_{\mathbf{w}},d_{\mathbf{h}}&\mbox{Dimensionality of }\mathbf{x}\mbox{, }\mathbf{w}\mbox{ and }\mathbf{h}\\ \mathbf{h}&\mbox{Output of stochastic encoder in SLGP}\\ \mathbf{K},\mathbf{k}&\mbox{Kernel matrix and vector}\\ \mathcal{L}&\mbox{Evidence lower bound}\\ M&\mbox{Number of inducing variables}\\ \mathbf{m},\mathbf{S}&\mbox{Mean and covariane for the Gaussian }q(\mathbf{u})\\ \mathbf{m}^{w},\mathbf{S}^{w}&\mbox{Mean and covariane for the Gaussian }q(\mathbf{u}^{w})\mbox{ in SHGP}\\ \mathbf{m}^{a,t},\mathbf{S}^{a,t}&\mbox{Mean and covariane for the Gaussian }q(\mathbf{u}^{a,t})\mbox{ in SMGP}\\ N&\mbox{Number of training points}\\ \mathbf{f},f_{*}&\mbox{Latent function values of }f\mbox{ at }\mathbf{X}\mbox{ and }\mathbf{x}_{*}\\ \mathbf{f}^{t},f^{t}_{*}&\mbox{Latent function values of }f^{t}\mbox{ at }\mathbf{X}\mbox{ and }\mathbf{x}_{*}\mbox{ in SMGP}\\ \mathbf{w},w_{*}&\mbox{Modulation variables at }\mathbf{X}\mbox{ and }\mathbf{x}_{*}\\ S&\mbox{Number of monte carlo samples in SMGP and SLGP}\\ T&\mbox{Number of assignment GPs in SMGP}\\ \bm{\mu}^{f},\bm{\Sigma}^{f}&\mbox{Mean and covariance for the Gaussian }q(\mathbf{f})\\ \bm{\mu}^{f,t},\bm{\Sigma}^{f,t}&\mbox{Mean and covariance for the Gaussian }q(\mathbf{f}^{t})\mbox{ in SMGP}\\ \bm{\mu}^{w},\bm{\Sigma}^{w}&\mbox{Mean and covariance for the Gaussian }q(\mathbf{w})\mbox{ in SHGP}\\ \bm{\mu}^{a,t},\bm{\Sigma}^{a,t}&\mbox{Mean and covariance for the Gaussian }q(\mathbf{a}^{t})\mbox{ in SMGP}\\ \mathbf{X},\mathbf{y}&\mbox{Training inputs and outputs}\\ \mathbf{x}_{*},y_{*}&\mbox{Test inputs and outputs}\\ \mathbf{Z},\mathbf{u}&\mbox{Inducing points and variables}\\ \mathbf{u}^{w}&\mbox{Inducing variables for the latent variables }\mathbf{w}\mbox{ in SHGP}\\ \mathbf{u}^{a,t}&\mbox{Inducing variables for the latent variables }\mathbf{a}^{t}\mbox{ in SMGP}\\ \bm{\nu}^{f},\bm{\nu}^{w}&\mbox{Vectors comprising the diagonal elements of }\mathbf{\Sigma}^{f}\mbox{ and }\mathbf{\Sigma}^{w}\mbox{ in SHGP}\\ \mathcal{B}&\mbox{Subset of training data}\\ \mathcal{D}&\mbox{Training data}\\ \mathcal{M}&\mbox{Multinomial distribution}\\ \mathcal{N}&\mbox{Gaussian distribution}\\ \beta&\mbox{Balance parameter in SLGP}\\ \epsilon&\mbox{Observation noise}\\ \end{array}$

Appendix B Experimental configurations

The UCI datasets. We elaborate the model configurations on the eight UCI datasets as below. First, we preprocess the data standardization over each dimension to have zero mean and unit variance. In addition, we shuffle and split the data to choose 90% for training and the rest 10% for testing.

Second, the modulated GPs employ the SE kernel with the length-scales initialized as $1.0$ and the signal variance initialized as $1.0$ . They adopt $M=100$ inducing variables, the positions of which are initialized by the $k$ -means clustering technique. The GP competitors including SGP, SVHGP, DAGP and GPCDE also take the same configuration for the fair of comparison. It is notable that, since the SLGP has an augmented input space $\mathbb{R}^{d_{\mathbf{x}}+d_{\mathbf{z}}}$ , we draw samples from normal distribution to act as inducing points for the extra dimensions. Besides, the SLGP performs grid search of the balance parameter $\beta$ from the candidate set $\{1.0,0.5,0.1,0.01\}$ , uses a single latent input (i.e., $d_{\mathbf{w}}=1$ ), and has the latent dimension of $d_{\mathbf{h}}=d_{\mathbf{w}}+d_{\mathbf{x}}$ without otherwise indicated. In addition, both the SMGP and SLGP use $S=10$ Monte Carlo samples to estimate their ELBOs (17) and (29). Note that the SHGP model has no need due to the analytical ELBO (9).

Third, as for the MLPs for $p(\mathbf{w})$ , $q(\mathbf{w})$ and $q(\mathbf{h}|\mathbf{w})$ in SLGP, they take the fully-connected (FC) neural networks with three hidden layers, each of which has 100 units and applies the ReLU activation. Besides, the parameter $\nu_{0}$ is initialized as $0.01$ for the MLPs of encoder in (25). It is notable that the experiments do not use additional fine-tuning tricks, e.g., the scheduled learning rate, the regularized weights, or the dropout technique, for MLPs. The NN counterparts including HNN, MDN and CVAE also take the same architecture, with the only difference being that the last layer is not GP.

Forth, the training process adopts the Adam solver [59] with the batch size of 512, the learning rate of $0.005$ , and the maximum number of iterations of 20000.

Finally, as for model evaluation on the test set, we sample 200 points from the predictive distribution at each test point, and then employ the kernel density estimator from the scikit-learn package [54] to estimate the density as well as the negative log likelihood. Note that the bandwidth parameter for the kernel density estimator is estimated by Silverman’s rule.

The large-scale nytaxi dataset. The three scalable modulated GPs take almost the same configuration to that on the UCI datasets. The differences are that they herein adopt $M=500$ inducing variables and run the Adam optimizer with the batch size of 1024 and the maximum number of iterations of 50000; besides, the SLGP employs a small balance parameter $\beta=0.01$ on this large dataset to enhance extracting the patterns under the massive data; finally, the MLPs in SLGP and the NN models take the hidden architecture of “FC(256)-Relu-FC(128)-Relu-FC(64)-Relu-FC(32)”.

The $\mathtt{mnist}$ image dataset. For this 10-input-784-output image dataset, we randomly choose 63000 images for training and the rest 7000 images for testing. The SLGP employs $M=100$ inducing points and the small balance parameter $\beta=0.01$ , and runs the Adam optimizer over 50000 iterations with the batch size of 64. As for the MLPs in SLGP and CVAE, they take the hidden architecture of “FC(512)-Relu-FC(512)-Relu-FC(512)”.

References

[1] V. Dutordoir, H. Salimbeni, J. Hensman, M. Deisenroth, Gaussian process conditional density estimation, in: Advances in Neural Information Processing Systems, 2018, pp. 2385–2395.
[2] M. Kaiser, C. Otte, T. A. Runkler, C. H. Ek, Data association with Gaussian processes, in: European Conference on Principles of Data Mining and Knowledge Discovery, 2019, pp. 548–564.
[3] I. Bilionis, N. Zabaras, Multi-output local Gaussian process regression: Applications to uncertainty quantification, Journal of Computational Physics 231 (17) (2012) 5718–5746.
[4] C. K. Williams, C. E. Rasmussen, Gaussian processes for machine learning, MIT press Cambridge, MA, 2006.
[5] Y. Son, S. Lee, S. Park, J. Lee, Learning representative exemplars using one-class gaussian process regression, Pattern Recognition 74 (2018) 185–197.
[6] M. Kandemir, T. Kekeç, R. Yeniterzi, Supervising topic models with gaussian processes, Pattern Recognition 77 (2018) 226–236.
[7] P. Li, S. Chen, Hierarchical gaussian processes model for multi-task learning, Pattern Recognition 74 (2018) 134–144.
[8] P. Li, S. Chen, Gaussian process approach for metric learning, Pattern Recognition 87 (2019) 17–28.
[9] S. Sun, W. Dong, Q. Liu, Multi-view representation learning with deep gaussian processes, IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 1–15.
[10] H. Liu, Y.-S. Ong, X. Shen, J. Cai, When Gaussian process meets big data: A review of scalable GPs, IEEE Transactions on Neural Networks and Learning Systems (2020) 1–19.
[11] S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. W. Hogg, M. O/’Neil, Fast direct methods for Gaussian processes, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2) (2016) 252–265.
[12] K. A. Wang, G. Pleiss, J. R. Gardner, S. Tyree, K. Q. Weinberger, A. G. Wilson, Exact Gaussian processes on a million data points, arXiv preprint arXiv:1903.08114.
[13] M. P. Deisenroth, J. W. Ng, Distributed Gaussian processes, in: International Conference on Machine Learning, JMLR. org, 2015, pp. 1481–1490.
[14] H. Liu, J. Cai, Y. Wang, Y.-S. Ong, Generalized robust Bayesian committee machine for large-scale Gaussian process regression, in: International Conference on Machine Learning, 2018, pp. 3131–3140.
[15] G. Pillonetto, L. Schenato, D. Varagnolo, Distributed multi-agent gaussian regression via finite-dimensional approximations, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9) (2019) 2098–2111.
[16] E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in: Advances in Neural Information Processing Systems, 2006, pp. 1257–1264.
[17] M. K. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in: Artificial Intelligence and Statistics, 2009, pp. 567–574.
[18] Y. Cao, M. A. Brubaker, D. J. Fleet, A. Hertzmann, Efficient optimization for sparse gaussian process regression, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12) (2015) 2415–2427.
[19] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in: Uncertainty in Artificial Intelligence, 2013, pp. 282–290.
[20] A. Wilson, H. Nickisch, Kernel interpolation for scalable structured Gaussian processes (KISS-GP), in: International Conference on Machine Learning, 2015, pp. 1775–1784.
[21] G. Pleiss, J. Gardner, K. Weinberger, A. G. Wilson, Constant-time predictive distributions for Gaussian processes, in: International Conference on Machine Learning, 2018, pp. 4111–4120.
[22] D. Burt, C. E. Rasmussen, M. Van Der Wilk, Rates of convergence for sparse variational Gaussian process regression, in: International Conference on Machine Learning, 2019, pp. 862–871.
[23] R. P. Adams, O. Stegle, Gaussian process product models for nonparametric nonstationarity, in: International Conference on Machine Learning, 2008, pp. 1–8.
[24] P. W. Goldberg, C. K. Williams, C. M. Bishop, Regression with input-dependent noise: A Gaussian process treatment, in: Advances in Neural Information Processing Systems, 1998, pp. 493–499.
[25] L. Munoz-Gonzalez, M. Lazaro-Gredilla, A. R. Figueiras-Vidal, Divisive Gaussian processes for nonstationary regression, IEEE Transactions on Neural Networks 25 (11) (2014) 1991–2003.
[26] L. Munoz-Gonzalez, M. Lazaro-Gredilla, A. R. Figueiras-Vidal, Laplace approximation for divisive Gaussian processes for nonstationary regression, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3) (2016) 618–624.
[27] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, in: International Conference on Learning Representations, 2017.
[28] E. Jang, S. Gu, B. Poole, Categorical reparameterization with Gumbel-Softmax, in: International Conference on Learning Representations, 2017.
[29] A. C. Damianou, M. K. Titsias, N. D. Lawrence, Variational inference for uncertainty on the inputs of Gaussian process models, arXiv preprint arXiv:1409.2287.
[30] N. Lawrence, Probabilistic non-linear principal component analysis with Gaussian process latent variable models, Journal of Machine Learning Research 6 (Nov) (2005) 1783–1816.
[31] A. Pagnoni, K. Liu, S. Li, Conditional variational autoencoder for neural machine translation, arXiv preprint arXiv:1812.04405.
[32] Y. Burda, R. B. Grosse, R. Salakhutdinov, Importance weighted autoencoders, in: International Conference on Learning Representations, 2016.
[33] J. Domke, D. R. Sheldon, Importance weighting and variational inference, in: Advances in Neural Information Processing Systems, 2018, pp. 4470–4479.
[34] C. Cremer, Q. Morris, D. Duvenaud, Reinterpreting importance-weighted autoencoders, arXiv preprint arXiv:1704.02916.
[35] H. Salimbeni, V. Dutordoir, J. Hensman, M. Deisenroth, Deep Gaussian processes with importance-weighted variational inference, in: International Conference on Machine Learning, 2019, pp. 5589–5598.
[36] H. Liu, Y.-S. Ong, X. Jiang, X. Wang, Deep latent-variable kernel learning, arXiv preprint arXiv:2005.08467.
[37] A. Bhattacharyya, M. Hanselmann, M. Fritz, B. Schiele, C.-N. Straehle, Conditional flow variational autoencoders for structured sequence prediction, in: Bayesian Deep Learning NeurIPS 2019 Workshop, 2019.
[38] M. Lázaro-Gredilla, M. K. Titsias, Variational heteroscedastic Gaussian process regression, in: International Conference on Machine Learning, Omnipress, 2011, pp. 841–848.
[39] I. A. Almosallam, M. J. Jarvis, S. J. Roberts, GPZ: Non-stationary sparse Gaussian processes for heteroscedastic uncertainty estimation in photometric redshifts, Monthly Notices of the Royal Astronomical Society 462 (1) (2016) 726–739.
[40] H. Liu, Y.-S. Ong, J. Cai, Large-scale heteroscedastic regression via Gaussian process, IEEE Transactions on Neural Networks and Learning Systems (2018) 1–14.
[41] S. E. Yuksel, J. N. Wilson, P. D. Gader, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems 23 (8) (2012) 1177–1193.
[42] S. Masoudnia, R. Ebrahimpour, Mixture of experts: A literature survey, Artificial Intelligence Review 42 (2) (2014) 275–293.
[43] C. E. Rasmussen, Z. Ghahramani, Infinite mixtures of Gaussian process experts, in: Advances in Neural Information Processing Systems, 2002, pp. 881–888.
[44] E. Meeds, S. Osindero, An alternative infinite mixture of Gaussian process experts, in: Advances in Neural Information Processing Systems, 2006, pp. 883–890.
[45] Y. Yang, J. Ma, An efficient EM approach to parameter learning of the mixture of Gaussian processes, in: International Symposium on Neural Networks, 2011, pp. 165–174.
[46] C. Yuan, C. Neubauer, Variational mixture of Gaussian process experts, in: Advances in Neural Information Processing Systems, 2009, pp. 1897–1904.
[47] S. Sun, X. Xu, Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction, IEEE Transactions on Intelligent Transportation Systems 12 (2) (2011) 466–475.
[48] T. Nguyen, E. Bonilla, Fast allocation of Gaussian process experts, in: International Conference on Machine Learning, 2014, pp. 145–153.
[49] C. Wang, R. M. Neal, Gaussian process regression with heteroscedastic or non-Gaussian residuals, arXiv preprint arXiv:1212.6246.
[50] E. Bodin, N. D. Campbell, C. H. Ek, Latent Gaussian process regression, arXiv preprint arXiv:1707.05534.
[51] K. Märtens, K. Campbell, C. Yau, Decomposing feature-level variation with covariate Gaussian process latent variable models, in: International Conference on Machine Learning, 2019, pp. 4372–4381.
[52] S. Depeweg, J. Hernández-Lobato, F. Doshi-Velez, S. Udluft, Learning and policy search in stochastic dynamical systems with Bayesian neural networks, in: International Conference on Learning Representations, 2019.
[53] K. Sohn, X. Yan, H. Lee, Learning structured output representation using deep conditional generative models, in: Advances in Neural Information Processing Systems, 2015, pp. 3483–3491.
[54] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.
[55] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer vision, in: Advances in Neural Information Processing Systems, 2017, pp. 5580–5590.
[56] D. Dua, C. Graff, UCI machine learning repository (2017).
URL http://archive.ics.uci.edu/ml
[57] D. Nix, A. Weigend, Estimating the mean and variance of the target probability distribution, in: IEEE International Conference on Neural Networks, Vol. 1, 1994, pp. 55–60.
[58] C. M. Bishop, Mixture density networks, Tech. Rep. NCRG/94/004 (January 1994).
URL https://www.microsoft.com/en-us/research/publication/mixture-density-networks/
[59] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.