This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gaussian Processes Sampling with Sparse Grids
under Additive Schwarz Preconditioner

Haoyuan Chen1  Rui Tuo1
1Department of Industrial and Systems Engineering, Texas A&M University
College Station, TX 77843
[email protected]
[email protected]
Submitted to the editors on January 31, 2024.
Abstract

Gaussian processes (GPs) are widely used in non-parametric Bayesian modeling, and play an important role in various statistical and machine learning applications. In a variety tasks of uncertainty quantification, generating random sample paths of GPs is of interest. As GP sampling requires generating high-dimensional Gaussian random vectors, it is computationally challenging if a direct method, such as the Cholesky decomposition, is used. In this paper, we propose a scalable algorithm for sampling random realizations of the prior and posterior of GP models. The proposed algorithm leverages inducing points approximation with sparse grids, as well as additive Schwarz preconditioners, which reduce computational complexity, and ensure fast convergence. We demonstrate the efficacy and accuracy of the proposed method through a series of experiments and comparisons with other recent works.

Keywords— Gaussian processes, sparse grids, additive Schwarz preconditioner

1 Introduction

Gaussian processes (GPs) hold significant importance in various fields, particularly in statistical and machine learning applications (Rasmussen, , 2003; Santner et al., , 2003; Cressie, , 2015), due to their unique properties and capabilities. As probabilistic models, GPs are highly flexible and able to adapt to a wide range of data patterns, including those with complex, nonlinear relationships. GPs have been successfully applied in numerous domains including regression (O’Hagan, , 1978; Bishop et al., , 1995; MacKay et al., , 2003), classification (Kuss et al., , 2005; Nickisch and Rasmussen, , 2008; Hensman et al., , 2015), time series analysis (Girard et al., , 2002; Roberts et al., , 2013), spatial data analysis (Banerjee et al., , 2008; Nychka et al., , 2015), Bayesian networks (Neal, , 2012), optimization (Srinivas et al., , 2009), and more. This wide applicability showcases their versatility in handling different types of problems. As the size of the dataset increases, GP inference suffers from computational issues of the operations on the covariance matrix such as inversion and determinant calculation. These computational challenges have motivated the development of various approximation methods to make GPs more scalable. Recent advances in scalable GP regression include Nyström approximation (Quinonero-Candela and Rasmussen, , 2005; Titsias, , 2009; Hensman et al., , 2013), random Fourier features (Rahimi and Recht, , 2007; Hensman et al., , 2018), local approximation (Gramacy and Apley, , 2015; Cole et al., , 2021), structured kernel interpolation (Wilson and Nickisch, , 2015; Stanton et al., , 2021), state-space formulation (Hartikainen and Särkkä, , 2010; Grigorievskiy et al., , 2017; Nickisch et al., , 2018), Krylov methods (Gardner et al., , 2018; Wang et al., , 2019), Vecchia approximation (Katzfuss and Guinness, , 2021; Katzfuss et al., , 2022), sparse representation (Ding et al., , 2021; Chen et al., , 2022; Chen and Tuo, , 2022), etc.

The focus of this article is on generating samples from GPs, which includes sampling from both prior and posterior processes. GP sampling is a valuable tool in statistical modeling and machine learning, and is used in various applications, such as in Bayesian optimization (Snoek et al., , 2012; Frazier, 2018b, ; Frazier, 2018a, ), in computer model calibration (Kennedy and O’Hagan, , 2001; Tuo and Wu, , 2015; Plumlee, , 2017; Xie and Xu, , 2021), in reinforcement learning (Kuss and Rasmussen, , 2003; Engel et al., , 2005; Grande et al., , 2014), and in uncertainty quantification (Murray-Smith and Pearlmutter, , 2004; Marzouk and Najm, , 2009; Teckentrup, , 2020). Practically, sampling from GPs involves generating random vectors from a multivariate normal distribution after discretizing the input space. However, sampling from GPs poses computational challenges particularly when dealing with large datasets due to the need to factorize the large covariance matrices, which is an 𝒪(n3)\mathcal{O}(n^{3}) operation for nn inputs. There has been much research on scalable GP regression, yet sampling methodologies remain relatively unexplored.

Scalable sampling techniques for GPs are relatively rare in the existing literature. A notable recent development in the field is the work of Wilson et al., (2020), who introduced an efficient sampling method known as decoupled sampling. This approach utilizes Matheron’s rule and combines the inducing points approximation with random Fourier features. Furthermore, Wilson et al., (2021) expanded this concept to pathwise conditioning based on Matheron’s update. This method, which requires only sampling from GP priors, has proven to be a powerful tool for both understanding and implementing GPs. Inspired by this work, Maddox et al., (2021) applied Matheron’s rule to multi-task GPs for Bayesian optimization. Additionally, Nguyen et al., (2021) applied such techniques to tighten the variance and enhance GP posterior sampling. Wenger et al., 2022b introduced IterGP, a novel method that ensures consistent estimation of combined uncertainties and facilitates efficient GP sampling by computing a low-rank approximation of the covariance matrix. Lin et al., (2023) developed a method for generating GP posterior samples using stochastic gradient descent, targeting the complementary problem of approximating these samples in 𝒪(n)\mathcal{O}(n) time. It is important to note, however, that none of these methods have explored sampling within a structured design, an area that possesses many inherent properties.

In this study, we propose a novel algorithms for efficient sampling from GP priors and GP regression posteriors by utilizing inducing point approximations on sparse grids. We show that the proposed method can reduce the time complexity to 𝒪((η+n)nsg)\mathcal{O}\big{(}(\eta+n)n_{sg}) for input data of size nn and inducing points on sparse grids of level η\eta and size nsgn_{sg}. This results in a linear-time sampling process, particularly effective when the size of the sparse grids is maintained at a reasonably low level. Specifically, our work makes the following contributions:

  • We propose a novel and scalable sampling approach for Gaussian Processes using product kernels on multi-dimensional points. This approach leverages the inducing point approximation on sparse grids, and notably, its computational time scales linearly with the product of the sizes of the inducing points and the sparse grids.

  • Our method provides a 22-Wasserstein distance error bound with a convergence rate of 𝒪(nsgν(lognsg)(ν+2)(d1)+d+1)\mathcal{O}(n_{sg}^{-\nu}(\log n_{sg})^{(\nu+2)(d-1)+d+1}) for dd-dimensional sparse grids of size nsgn_{sg} and product Matérn kernel of smoothness ν\nu.

  • We develop a two-level additive Schwarz preconditioner to accelerate the convergence of conjugate gradient during posterior sampling from GPs.

  • We illustrate the effectiveness of our algorithm in real-world scenarios by applying it to areas such as Bayesian optimization and dynamical system problems.

The structure of the paper is as follows: Background information is presented in Section 2, our methodology is detailed in Section 3, experimental results are discussed in Section 4, and the paper concludes with discussions in Section 5. Detailed theorems along with their proofs are provided in Appendix A, and supplementary plots can be found in Appendix B.

2 Background

This section provides an overview of the foundational concepts relevant to our proposed method. Section 2.1 introduces GPs, GP regression, and the fundamental methodology for GP sampling. The underlying assumptions of our work are outlined in Section 2.2. Additionally, detailed explanations of the inducing points approximation and sparse grids, which are key components of our sampling method, are presented in Section 2.3 and Section 2.4, respectively.

2.1 GP review

GPs

A GP is a collection of random variables uniquely specified by its mean function μ()\mu(\cdot) and kernel function K(,)K(\cdot,\cdot), such that any finite collection of those random variables have a joint Gaussian distribution (Williams and Rasmussen, , 2006). We denote the corresponding GP as 𝒢𝒫(μ(),K(,))\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot)).

GP regression

Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be an unknown function. Suppose we are given a set of training points 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n} and the observations 𝐲=(y1,,yn)\mathbf{y}=(y_{1},\ldots,y_{n})^{\top} where yi=f(𝐱i)+ϵiy_{i}=f(\mathbf{x}_{i})+\epsilon_{i} with the i.i.d. noise ϵi𝒩(0,σϵ2)\epsilon_{i}\sim\mathcal{N}(0,\sigma_{\epsilon}^{2}). The GP regression imposes a GP prior over the latent function as f()𝒢𝒫(μ(),K(,))f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot)). Then the posterior distribution at mm test points 𝐗={𝐱i}i=1m\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m} is

p(f(𝐗)|𝐲)=𝒩(𝝁|𝐲,𝐊,|𝐲),p(f(\mathbf{X}^{*})|\mathbf{y})=\mathcal{N}(\bm{\mu}_{*|\mathbf{y}},\mathbf{K}_{*,*|\mathbf{y}}), (1)

with

𝝁|𝐲=𝝁+𝐊,𝐗[𝐊𝐗,𝐗+σϵ2𝐈n]1(𝐲𝝁𝐗),\bm{\mu}_{*|\mathbf{y}}=\bm{\mu}_{*}+\mathbf{K}_{*,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}\left(\mathbf{y}-\bm{\mu}_{\mathbf{X}}\right), (2a)
𝐊,|𝐲=𝐊,𝐊,𝐗[𝐊𝐗,𝐗+σϵ2𝐈n]1𝐊𝐗,,\mathbf{K}_{*,*|\mathbf{y}}=\mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}\mathbf{K}_{\mathbf{X},*}, (2b)

where σϵ2>0\sigma_{\epsilon}^{2}>0 is the noise variance, 𝐈n\mathbf{I}_{n} is a n×nn\times n identity matrix, 𝐊,𝐗=K(𝐗,𝐗)=[K(𝐱i,𝐱j)]i=1,j=1i=m,j=n=K(𝐗,𝐗)=𝐊𝐗,\mathbf{K}_{*,\mathbf{X}}=K(\mathbf{X}^{*},\mathbf{X})=\big{[}K(\mathbf{x}_{i}^{*},\mathbf{x}_{j})\big{]}_{i=1,j=1}^{i=m,j=n}=K(\mathbf{X},\mathbf{X}^{*})^{\top}=\mathbf{K}_{\mathbf{X},*}^{\top}, 𝐊𝐗,𝐗=[K(𝐱i,𝐱j)]i,j=1n\mathbf{K}_{\mathbf{X},\mathbf{X}}=\big{[}K(\mathbf{x}_{i},\mathbf{x}_{j})\big{]}_{i,j=1}^{n}, 𝐊,=[K(𝐱i,𝐱j)]i,j=1m\mathbf{K}_{*,*}=\big{[}K(\mathbf{x}_{i}^{*},\mathbf{x}_{j}^{*})\big{]}_{i,j=1}^{m}, 𝝁𝐗=(μ(𝐱1),,μ(𝐱n))\bm{\mu}_{\mathbf{X}}=\big{(}\mu(\mathbf{x}_{1}),\cdots,\mu(\mathbf{x}_{n})\big{)}^{\top} and 𝝁=(μ(𝐱1),,μ(𝐱m))\bm{\mu}_{*}=\big{(}\mu(\mathbf{x}_{1}^{*}),\cdots,\mu(\mathbf{x}_{m}^{*})\big{)}^{\top}.

Sampling from GPs

The goal is to sample a function f()𝒢𝒫(μ(),K(,))f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot)). To represent this in a finite form, we discretize the input space and focus on the function values at a set of grid points 𝐙={𝐳i}i=1ns\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}. Our goal is to generate samples 𝐟𝐙=(f(𝐳1),,f(𝐳ns))\mathbf{f}_{\mathbf{Z}}=\big{(}f(\mathbf{z}_{1}),\cdots,f(\mathbf{z}_{n_{s}})\big{)}^{\top} from the multivariate normal distribution 𝒩(μ(𝐙),K(𝐙,𝐙))=𝒩(𝝁𝐙,𝐊𝐙,𝐙)\mathcal{N}(\mu(\mathbf{Z}),K(\mathbf{Z},\mathbf{Z}))=\mathcal{N}(\bm{\mu}_{\mathbf{Z}},\mathbf{K}_{\mathbf{Z},\mathbf{Z}}). The standard method for sampling from a multivariate normal distribution involves the following steps: (1) generate a vector of samples 𝐟𝐈\mathbf{f}_{\mathbf{I}} whose entries are independent and identically distributed normal, i.e. 𝐟𝐈ns𝒩(𝟎,𝐈ns)\mathbf{f}_{\mathbf{I}_{n_{s}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{s}}); (2) Use the Cholesky decomposition (Golub and Van Loan, , 2013) to factorize the covariance matrix 𝐊𝐙,𝐙\mathbf{K}_{\mathbf{Z},\mathbf{Z}} as 𝐋𝐙𝐋𝐙=𝐊𝐙,𝐙\mathbf{L}_{\mathbf{Z}}\mathbf{L}_{\mathbf{Z}}^{\top}=\mathbf{K}_{\mathbf{Z},\mathbf{Z}}; (3) generate the output sample 𝐟𝐙\mathbf{f}_{\mathbf{Z}} as

𝐟𝐙𝐋𝐙𝐟𝐈ns+𝝁𝐙.\mathbf{f}_{\mathbf{Z}}\leftarrow\mathbf{L}_{\mathbf{Z}}\mathbf{f}_{\mathbf{I}_{n_{s}}}+\bm{\mu}_{\mathbf{Z}}. (3)

Sampling a posterior GP follows a similar procedure. Suppose we have observations 𝐲=(y1,,yn)\mathbf{y}=\big{(}y_{1},\cdots,y_{n}\big{)}^{\top} on nn distinct points 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}, where yi=f(𝐱i)+ϵiy_{i}=f(\mathbf{x}_{i})+\epsilon_{i} with noise ϵi𝒩(0,σϵ2)\epsilon_{i}\sim\mathcal{N}(0,\sigma_{\epsilon}^{2}) and f()𝒢𝒫(μ(),K(,))f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot)). The goal is to generate posterior samples f(𝐗)𝐲f(\mathbf{X}^{*})\mid\mathbf{y} at mm test points 𝐗={𝐱i}i=1m\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m}. Since the posterior samples f(𝐗)𝐲𝒩(𝝁|𝐲,𝐊,|𝐲)f(\mathbf{X}^{*})\mid\mathbf{y}\sim\mathcal{N}(\bm{\mu}_{*|\mathbf{y}},\mathbf{K}_{*,*|\mathbf{y}}) also follow a multivariate normal distribution, as detailed in eq. 2a and eq. 2b in Section 2.1, we can apply Cholesky decomposition 𝐋|𝐗𝐋|𝐗=𝐊,|𝐗\mathbf{L}_{*|\mathbf{X}}\mathbf{L}_{*|\mathbf{X}}^{\top}=\mathbf{K}_{*,*|\mathbf{X}} to generate GP posterior sample 𝐟=f(𝐗)=(f(𝐱1),,f(𝐱m))\mathbf{f}_{*}=f(\mathbf{X}^{*})=\big{(}f(\mathbf{x}_{1}^{*}),\cdots,f(\mathbf{x}_{m}^{*})\big{)}^{\top} as

𝐟𝐲𝐋|𝐗𝐟𝐈n+𝝁|𝐗,\mathbf{f}_{*}\mid\mathbf{y}\leftarrow\mathbf{L}_{*|\mathbf{X}}\mathbf{f}_{\mathbf{I}_{n}}+\bm{\mu}_{*|\mathbf{X}}, (4)

where 𝐟𝐈n𝒩(𝟎,𝐈n)\mathbf{f}_{\mathbf{I}_{n}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n}).

2.2 Assumptions

A widely used kernel structure in multi-dimensional settings is “separable” or “product” kernels, which can be expressed as follows:

K(𝐱,𝐱)=σ2j=1dK0(j)(x(j),x(j)),K(\mathbf{x},\mathbf{x}^{\prime})=\sigma^{2}\prod_{j=1}^{d}K_{0}^{(j)}(x^{(j)},x^{\prime(j)}), (5)

where x(j)x^{(j)} and x(j)x^{\prime(j)} are the jj-th components of 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime} respectively, K0(j)(,)K_{0}^{(j)}(\cdot,\cdot) denotes the one-dimensional correlation function for each dimension jj with the variance set to one, dd denotes the dimension of the input points. While the product of one-dimensional kernels does not share identical smoothness properties with multi-dimensional kernels, the assumption of separability is widely adopted in the field of spatio-temporal statistics (Gneiting et al., , 2006; Genton, , 2007; Constantinou et al., , 2017). Its widespread adoption is mainly due to its efficacy in enabling the development of valid space-time parametric models and in simplifying computational processes, particularly when handling large datasets for purposes such as inference and parameter estimation.

In this article, we consider the inputs to be dd-dimensional points 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n} where each 𝐱id\mathbf{x}_{i}\in\mathbb{R}^{d} and d+d\in\mathbb{N}^{+}, and we focus on product Matérn kernels that employ identical base kernels across every dimension. Matérn kernel (Genton, , 2001) is a popular choice of covariance functions in spatial statistics (Diggle et al., , 2003), geostatistics (Curriero, , 2006; Pardo-Iguzquiza and Chica-Olmo, , 2008), image analysis (Zafari et al., , 2020; Okhrin et al., , 2020), and other applications. The Matérn kernel is defined as the following form (Stein, , 1999):

K(x,x)=σ221νΓ(ν)(2ν|xx|ω)νKν(2ν|xx|ω),K(x,x^{\prime})=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right)^{\nu}K_{\nu}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right), (6)

for any x,xx,x^{\prime}\in\mathbb{R}, where σ2>0\sigma^{2}>0 is the variance, ν>0\nu>0 is the smoothness parameter, ω>0\omega>0 is the lengthscale and KνK_{\nu} is the modified Bessel function of the second kind.

2.3 Inducing points approximation

Williams and Seeger, (2000) first applied Nyström approximation to kernel-based methods such as GPs. Following this, Quinonero-Candela and Rasmussen, (2005) presented a novel unifying perspective, and reformulated the prior by introducing an additional set of latent variables 𝐟𝐮\mathbf{f}_{\mathbf{u}}, known as inducing variables. These latent variables correspond to a specific set of input locations 𝐮=(u1,,uq)\mathbf{u}=(u_{1},\cdots,u_{q})^{\top}, referred to as inducing inputs. In the subset of regressors (SoR) (Silverman, , 1985) algorithm, the kernels can be approximated using the following expression:

KSoR(𝐱i,𝐱j)=K(𝐱i,𝐮)K(𝐮,𝐮)1K(𝐮,𝐱j).K_{\text{SoR}}(\mathbf{x}_{i},\mathbf{x}_{j})=K(\mathbf{x}_{i},\mathbf{u})K(\mathbf{u},\mathbf{u})^{-1}K(\mathbf{u},\mathbf{x}_{j}). (7)

Employing the SoR approximation detailed in eq. 7, the form of the SoR predictive distribution qSoR(𝐟𝐲)q_{\rm SoR}(\mathbf{f}_{*}\mid\mathbf{y}) can be derived as follows:

qSoR(𝐟𝐲)=𝒩(𝝁+σϵ2𝐊,𝐮𝚺𝐮1𝐊𝐮,𝐗(𝐲𝝁𝐗),𝐊,𝐮𝚺𝐮1𝐊𝐮,),q_{\rm SoR}(\mathbf{f}_{*}\mid\mathbf{y})=\mathcal{N}(\bm{\mu}_{*}+\sigma_{\epsilon}^{-2}\mathbf{K}_{*,\mathbf{u}}\bm{\Sigma}_{\mathbf{u}}^{-1}\mathbf{K}_{\mathbf{u},\mathbf{X}}(\mathbf{y}-\bm{\mu}_{\mathbf{X}}),\;\mathbf{K}_{*,\mathbf{u}}\bm{\Sigma}_{\mathbf{u}}^{-1}\mathbf{K}_{\mathbf{u},*}), (8)

where we define 𝚺𝐮=[𝐊𝐮,𝐮+σϵ2𝐊𝐮,𝐗𝐊𝐗,𝐮]\bm{\Sigma}_{\mathbf{u}}=\big{[}\mathbf{K}_{\mathbf{u},\mathbf{u}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathbf{u},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathbf{u}}\big{]}, 𝐊,𝐮=K(𝐗,𝐮)=[K(𝐮,𝐗)]=𝐊𝐮,\mathbf{K}_{*,\mathbf{u}}=K(\mathbf{X}^{*},\mathbf{u})=\big{[}K(\mathbf{u},\mathbf{X}^{*})\big{]}^{\top}=\mathbf{K}_{\mathbf{u},*}^{\top}, 𝐊𝐮,𝐮=[K(𝐮i,𝐮s)]i,s=1q\mathbf{K}_{\mathbf{u},\mathbf{u}}=\big{[}K(\mathbf{u}_{i},\mathbf{u}_{s})\big{]}_{i,s=1}^{q}, 𝝁𝐗=(μ(𝐱1),,μ(𝐱n))\bm{\mu}_{\mathbf{X}}=\big{(}\mu(\mathbf{x}_{1}),\cdots,\mu(\mathbf{x}_{n})\big{)}^{\top}, and 𝝁=(μ(𝐱1),,μ(𝐱m))\bm{\mu}_{*}=\big{(}\mu(\mathbf{x}_{1}^{*}),\cdots,\mu(\mathbf{x}_{m}^{*})\big{)}^{\top}. Consequently, the computation of GP regression using inducing points can be achieved with a time complexity of 𝒪(nq2)\mathcal{O}(nq^{2}) for qq inducing points.

To quantify the approximation power in the inducing points approach, we shall use the Wasserstein metric. Consider a separable complete metric space denoted as (M,d)(M,d). The pp-Wasserstein distance between two probability measures μ1\mu_{1} and μ2\mu_{2}, on the space MM, is defined as follows:

Wp(μ,ν)=(infγΓ(μ1,μ2)𝔼(x,y)γdp(x,y))1/p,W_{p}(\mu,\nu)=\left(\inf_{\gamma\in\Gamma(\mu_{1},\mu_{2})}\mathbb{E}_{(x,y)\sim\gamma}d^{p}(x,y)\right)^{1/p},

for p[1,+)p\in[1,+\infty), where Γ(μ1,μ2)\Gamma(\mu_{1},\mu_{2}) is the set of joint measures on M×MM\times M with marginals μ1\mu_{1} and μ2\mu_{2}. Theorem A.1 in Appendix A shows that the inducing points approximation of a GP converges rapidly to the target GP, with a convergence rate of 𝒪(nν)\mathcal{O}(n^{-\nu}) for the Matérn kernels with smoothness ν\nu, provided that the inducing points are uniformly distributed. It is worth noting that, inequality presented in eq. 30 is also applicable to multivariate GPs.

Therefore, the convergence rate for multivariate inducing points approximation can be established, assuming the corresponding convergence rate of GP regression is known. A general framework to investigate the latter problem is given by Wang et al., (2020). Inspired by Theorem A.1, we assume that the vector 𝐲=(y1,,yn)\mathbf{y}=(y_{1},\cdots,y_{n})^{\top} is observed on nn distinct dd-dimensional points 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}. The inducing inputs 𝐔\mathbf{U}, as defined in Section 2.3, are given by a full grid, denoted as 𝐔=×j=1d𝐮(j)\mathbf{U}=\times_{j=1}^{d}\mathbf{u}^{(j)}, where each 𝐮(j)\mathbf{u}^{(j)} is a set of qjq_{j} one-dimensional data points, A×BA\times B denotes the Cartesian product of sets AA and BB, accordingly the grid 𝐔\mathbf{U} has q=j=1dqjq=\prod_{j=1}^{d}q_{j} points. Given the separable structure as defined in eq. 5 and utilizing the full grid 𝐔\mathbf{U}, we can transform the multi-dimensional problem to a one-dimensional problem since the covariance matrix 𝐊𝐔,𝐔:=K(𝐔,𝐔)\mathbf{K}_{\mathbf{U},\mathbf{U}}\colon=K(\mathbf{U},\mathbf{U}) can be represented by Kronecker products (Henderson et al., , 1983; Saatçi, , 2012; Wilson and Nickisch, , 2015) of matrices over each input dimension jj:

𝐊𝐔,𝐔=K(𝐔,𝐔)=j=1dKj(𝐮(j),𝐮(j)).\mathbf{K}_{\mathbf{U},\mathbf{U}}=K(\mathbf{U},\mathbf{U})=\bigotimes_{j=1}^{d}K_{j}(\mathbf{u}^{(j)},\mathbf{u}^{(j)}). (9)

However, the size of the inducing points qq grows exponentially as dd gets large, therefore we consider using sparse grids (Garcke, , 2013; Rieger and Wendland, , 2017) to solve the curse of dimensionality.

2.4 Sparse grids

A sparse grid is built on a nested sequence of one-dimensional points =𝒰j,0𝒰j,t𝒰j,t+1\varnothing=\mathcal{U}_{j,0}\subseteq\ldots\subseteq\mathcal{U}_{j,t}\subseteq\mathcal{U}_{j,t+1}\subseteq\ldots for each dimension jj, j=1,,dj=1,\ldots,d. A sparse grid with level η\eta and dimension dd is a design as follows:

𝒰(η,d)\displaystyle\mathcal{U}(\eta,d) =\vvt𝔾(η)𝒰1,t1×𝒰2,t2××𝒰d,td,\displaystyle=\bigcup_{\vv{t}\in\mathbb{G}(\eta)}\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}}, (10)

where \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}), 𝔾(η)={\vvtd|j=1dtj=η}\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}. Figure 1 and Figure 9 show sparse grids corresponding to the hyperbolic cross points (bisection) for two dimensional points at levels η=3,4,5,6\eta=3,4,5,6 and for three dimensional points at η=4,5,6,7\eta=4,5,6,7 respectively. Hyperbolic cross points sets, defined over the interval [0,1][0,1], are detailed in eq. 11:

𝒰j,tj={i2tj:i=1,,2tj1}.\mathcal{U}_{j,t_{j}}=\left\{\frac{i}{2^{t_{j}}}:i=1,\ldots,2^{t_{j}}-1\right\}. (11)

An operator on the sparse grid can be efficiently computed using Smolyak’s algorithm (Smolyak, , 1963; Ullrich, , 2008), which states that the predictor operator 𝒫(η,d)\mathcal{P}(\eta,d) on function ff with respect to the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) can be evaluated by the following formula:

𝒫(η,d)(f)=\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)(j=1d𝒰j,tj)(f),\mathcal{P}(\eta,d)(f)=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}(f), (12)

where |\vvt|=j=1dtj|\vv{t}|=\sum_{j=1}^{d}t_{j}, (η,d)={\vvtd|max(d,ηd+1)|\vvt|η}\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq|\vv{t}|\leq\eta\}, (j=1d𝒰j,tj)\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)} here represents the operator over the Kronecker product points j=1d𝒰j,tj\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}. Plumlee, (2014) and Yadav et al., (2022) also provided the linear time computing method for GP regression based on Smolyak’s algorithm. Theorem A.3 in Appendix A demonstrates that for a dd-dimensional GP, the inducing points approximation converges to the target GP at the rate of 𝒪(nν(logn)(ν+2)(d1)+d+1)\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}) for the product Matérn kernels with smoothness ν\nu, given that the inducing points are sparse grids 𝒰(η,d)\mathcal{U}(\eta,d) of level η\eta and dimension dd, as defined in eq. 10.

Refer to caption
Figure 1: Sparse grids 𝒰(η,d)\mathcal{U}(\eta,d) over [0,1]d[0,1]^{d} of level η=3,4,5,6\eta=3,4,5,6 and dimension d=2d=2.

3 Sampling with sparse grids

In this section, we elaborate the algorithm for sampling from multi-dimensional GPs for product kernels with sparse grids.

3.1 Prior

Sampling algorithm

Our goal is to sample from GP priors f()𝒢𝒫(μ(),K(,))f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot)) over dd-dimensional grid points 𝐙={𝐳i}i=1ns\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}, 𝐳id\mathbf{z}_{i}\in\mathbb{R}^{d}. According to the conditional distributions of the SoR model in equations eq. 7, the distribution of 𝐟𝐙=f(𝐙)=(f(𝐳1),,f(𝐳ns))\mathbf{f}_{\mathbf{Z}}=f(\mathbf{Z})=(f(\mathbf{z}_{1}),\cdots,f(\mathbf{z}_{n_{s}}))^{\top} can be approximated by 𝐊𝐙,𝒰𝐊𝒰,𝒰1𝐟𝒰\mathbf{K}_{\mathbf{Z},\mathcal{U}}\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}} with 𝐟𝒰𝒩(𝟎,𝐊𝒰,𝒰)\mathbf{f}_{\mathcal{U}}\sim\mathcal{N}(\mathbf{0},\mathbf{K}_{\mathcal{U},\mathcal{U}}), where 𝐟𝒰=f(𝒰)\mathbf{f}_{\mathcal{U}}=f(\mathcal{U}), and 𝒰=𝒰(η,d)\mathcal{U}=\mathcal{U}(\eta,d) is a sparse grid design with level η\eta and dimension dd defined in eq. 10. Then it reduces to generating samplings from 𝐊𝒰,𝒰1𝐟𝒰𝒩(𝟎,𝐊𝒰,𝒰1)\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}}\sim\mathcal{N}(\mathbf{0},\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}). Apply Smolyak’s algorithm to samplings from GPs (see Theorem A.4 in Appendix A), we can infer that

𝐊𝒰,𝒰1𝐟𝒰=d\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)j=1dchol(K(𝒰j,tj,𝒰j,tj)1)𝐟𝐈n\vvtδx(𝒰\vvt),\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}}), (13)
δx(𝒰\vvt)={0if x𝒰𝒰\vvt1if x𝒰\vvt,\delta_{x}(\mathcal{U}_{\vv{t}})=\begin{cases}0\quad&\text{if $x\in\mathcal{U}\setminus\mathcal{U}_{\vv{t}}$}\\ 1\quad&\text{if $x\in\mathcal{U}_{\vv{t}}$}\end{cases}, (14)

where chol(K(𝒰j,tj,𝒰j,tj)1)\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)} is the Cholesky decomposition of K(𝒰j,tj,𝒰j,tj)1K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}, 𝒰=𝒰(η,d)\mathcal{U}=\mathcal{U}(\eta,d) is a sparse grid design with level η\eta and dimension dd defined in eq. 10, 𝐟𝐈n\vvt𝒩(𝟎,𝐈n\vvt)\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}}) is standard multivariate normal distributed, 𝐈n\vvt\mathbf{I}_{n_{\vv{t}}} is a n\vvt×n\vvtn_{\vv{t}}\times n_{\vv{t}} identity matrix, n\vvtn_{\vv{t}} == j=1dtj\prod_{j=1}^{d}t_{j} is the size of the full grid design 𝒰\vvt=𝒰1,t1×𝒰2,t2××𝒰d,td\mathcal{U}_{\vv{t}}=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}} associated with \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}), δx(𝒰\vvt)\delta_{x}(\mathcal{U}_{\vv{t}}) is a Dirac measure on the set 𝒰\mathcal{U}. Therefore, the GP prior 𝐟𝐙\mathbf{f}_{\mathbf{Z}} over grid points 𝐙\mathbf{Z} can be approximately sampled via

𝐟𝐙𝐊𝐙,𝒰(\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)j=1dchol(K(𝒰j,tj,𝒰j,tj)1)𝐟𝐈n\vvtδx(𝒰\vvt))\displaystyle\mathbf{f}_{\mathbf{Z}}\leftarrow\mathbf{K}_{\mathbf{Z},\mathcal{U}}\left(\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}})\right)
+𝝁𝐙,\displaystyle+\bm{\mu}_{\mathbf{Z}}, (15)

where 𝐟𝐈n\vvt𝒩(𝟎,𝐈n\vvt)\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}}) is standard multivariate normal distributed, δx(𝒰\vvt)\delta_{x}(\mathcal{U}_{\vv{t}}) is a Dirac measure on set 𝒰\mathcal{U} as defined in eq. 14, chol(K(𝒰j,tj,𝒰j,tj)1)\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)} is the Cholesky decomposition of K(𝒰j,tj,𝒰j,tj)1K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}.

Complexity

Since the Kronecker matrix-vector product can be written as the form of vectorization operator (see Proposition 3.7 in (Kolda, , 2006)), the Kronecker matrix-vector product in section 3.1 only costs 𝒪(j=1dtjj=1dtj)\mathcal{O}(\sum_{j=1}^{d}t_{j}\prod_{j=1}^{d}t_{j}) =𝒪(ηn\vvt)=\mathcal{O}(\eta n_{\vv{t}}) time, hence the time complexity of the prior sampling via section 3.1 is 𝒪((η+ns)\vvt(η,d)n\vvt)\mathcal{O}\Big{(}(\eta+n_{s})\sum_{\vv{t}\in\mathbb{P}(\eta,d)}n_{\vv{t}}\Big{)} =𝒪((η+ns)nsg)=\mathcal{O}\Big{(}(\eta+n_{s})n_{sg}\Big{)}, where ν\nu is the smoothness parameter of the Matérn kernel, dd is the dimension of the sampling points 𝐙={𝐳i}i=1ns\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}, nsn_{s} is the size of the sampling points, nsgn_{sg} is the size of inducing points on a sparse grid 𝒰(η,d)\mathcal{U}(\eta,d), n\vvtn_{\vv{t}} == j=1dtj\prod_{j=1}^{d}t_{j} is the size of the full grid design 𝒰\vvt=×j=1d𝒰j,tj\mathcal{U}_{\vv{t}}=\times_{j=1}^{d}\mathcal{U}_{j,t_{j}}, \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}), (η,d)={\vvtd|max(d,ηd+1)j=1dtjη}\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq\sum_{j=1}^{d}t_{j}\leq\eta\}.

Note that we can obtain the upper bound of the number of points in the Smolyak type sparse grids. According to Lemma 3 in (Rieger and Wendland, , 2017), given that |𝒰j,tj|2tj1|\mathcal{U}_{j,t_{j}}|\leq 2^{t_{j}}-1, when sparse grids are constructed on hyperbolic cross points as defined in eq. 11, the cardinality of a sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) defined in eq. 10, is thereby bounded by

nsg=|𝒰(η,d)|\displaystyle n_{sg}=|\mathcal{U}(\eta,d)| 1d(212)d2η(η1d1)min{221,ηd}\displaystyle\leq 1^{d}\big{(}\frac{2-1}{2}\big{)}^{d}2^{\eta}\binom{\eta-1}{d-1}\min\{\frac{2}{2-1},\frac{\eta}{d}\}
2ηdηd1(d1)!min{2,ηd}\displaystyle\leq 2^{\eta-d}\frac{\eta^{d-1}}{(d-1)!}\min\{2,\frac{\eta}{d}\}
2ηdηd/(d!).\displaystyle\leq 2^{\eta-d}\eta^{d}/(d!). (16)

3.2 Posterior

Matheron’s rule

Sampling from the posterior GPs can be effectively conducted using Matheron’s rule. The Matheron’s rule was popularized in the geostatistics field by Journel and Huijbregts, (1976), and recently rediscovered by Wilson et al., (2020), who leveraged it to develop a novel approach for GP posterior samplings. Matheron’s rule can be described as follows: Consider 𝐚\mathbf{a} and 𝐛\mathbf{b} as jointly Gaussian random vectors. Then the random vector 𝐚\mathbf{a}, conditional on 𝐛=𝜷\mathbf{b}=\bm{\beta}, is equal in distribution to

(𝐚𝐛=𝜷)=d𝐚+Cov(𝐚,𝐛)Cov(𝐛,𝐛)1(𝜷𝐛),(\mathbf{a}\mid\mathbf{b}=\bm{\beta})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\mathbf{a}+\mathrm{Cov}(\mathbf{a},\mathbf{b})\mathrm{Cov}(\mathbf{b},\mathbf{b})^{-1}(\bm{\beta}-\mathbf{b}), (17)

where Cov(𝐚,𝐛)\mathrm{Cov}(\mathbf{a},\mathbf{b}) denotes the covariance of (𝐚,𝐛)(\mathbf{a},\mathbf{b}).

According to eq. 17, exact GP posterior samplers can be obtained using two jointly Gaussian random variables. Following this technique, we derive

𝐟𝐲=d𝐟+𝐊,𝐗[𝐊𝐗,𝐗+σϵ2𝐈n]1(𝐲𝐟𝐗ϵ),\mathbf{f}_{*}\mid\mathbf{y}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\mathbf{f}_{*}+\mathbf{K}_{*,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\epsilon), (18)

where 𝐟𝐗=f(𝐗)\mathbf{f}_{\mathbf{X}}=f(\mathbf{X}) and 𝐟=f(𝐗)\mathbf{f}_{*}=f(\mathbf{X}^{*}) are jointly Gaussian random variables from the prior distribution, noise variates ϵ𝒩(𝟎,σϵ2𝐈n)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}_{n}). Clearly, the joint distribution of (𝐟𝐗,𝐟)(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*}) follows the multivariate normal distribution as follows:

(𝐟𝐗,𝐟)𝒩([𝝁𝐗𝝁],[𝐊𝐗,𝐗𝐊𝐗,𝐊,𝐗𝐊,]).(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})\sim\mathcal{N}\Bigg{(}\begin{bmatrix}\bm{\mu}_{\mathbf{X}}\\ \bm{\mu}_{*}\\ \end{bmatrix},\begin{bmatrix}\mathbf{K}_{\mathbf{X},\mathbf{X}}&\mathbf{K}_{\mathbf{X},*}\\ \mathbf{K}_{*,\mathbf{X}}&\mathbf{K}_{*,*}\end{bmatrix}\Bigg{)}. (19)

Sampling algorithm

We can apply SoR approximation eq. 7 to Matheron’s rule eq. 17 and obtain

𝐟𝐲d𝐟+σϵ2𝐊,𝒰𝚺𝒰1𝐊𝒰,𝐗(𝐲𝐟𝐗ϵ),\mathbf{f}_{*}\mid\mathbf{y}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{\approx}}\mathbf{f}_{*}+\sigma_{\epsilon}^{-2}\mathbf{K}_{*,\mathcal{U}}\bm{\Sigma}_{\mathcal{U}}^{-1}\mathbf{K}_{\mathcal{U},\mathbf{X}}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\bm{\epsilon}), (20)

where 𝚺𝒰=[𝐊𝒰,𝒰+σϵ2𝐊𝒰,𝐗𝐊𝐗,𝒰]\bm{\Sigma}_{\mathcal{U}}=\big{[}\mathbf{K}_{\mathcal{U},\mathcal{U}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathcal{U},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathcal{U}}\big{]}, 𝐟𝐗=f(𝐗)\mathbf{f}_{\mathbf{X}}=f(\mathbf{X}) and 𝐟=f(𝐗)\mathbf{f}_{*}=f(\mathbf{X}^{*}) are jointly Gaussian random variables from the prior distribution, noise variates ϵ𝒩(𝟎,σϵ2𝐈n)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}_{n}). The prior resulting from the SoR approximation can be easily derived from eq. 7, yielding the following expression:

qSoR(𝐟𝐗,𝐟)=𝒩(𝝁𝐗+,𝐊𝐗+,𝒰𝐊𝒰,𝒰1𝐊𝒰,𝐗+),q_{\rm SoR}(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})=\mathcal{N}(\bm{\mu}_{\mathbf{X}+*},\;\mathbf{K}_{\mathbf{X}+*,\mathcal{U}}\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{K}_{\mathcal{U},\mathbf{X}+*}), (21)

where 𝝁𝐗+=[𝝁𝐗,𝝁]\bm{\mu}_{\mathbf{X}+*}=[\bm{\mu}_{\mathbf{X}},\bm{\mu}_{*}]^{\top}, 𝐊𝐗+,𝒰=[K(𝒰,𝐗),K(𝒰,𝐗)]\mathbf{K}_{\mathbf{X}+*,\mathcal{U}}=[K(\mathcal{U},\mathbf{X}),K(\mathcal{U},\mathbf{X}^{*})]^{\top}. The same as above in section 3.1, (𝐟𝐗,𝐟)(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*}) can be sampled from

(𝐟𝐗,𝐟)𝐊𝐗+,𝒰(\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)j=1dchol(K(𝒰j,tj,𝒰j,tj)1)𝐟𝐈n\vvtδx(𝒰\vvt))\displaystyle(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})\leftarrow\mathbf{K}_{\mathbf{X}+*,\mathcal{U}}\left(\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}})\right)
+𝝁𝐗+,\displaystyle+\bm{\mu}_{\mathbf{X}+*}, (22)

where 𝐟𝐈n\vvt\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}, δx(𝒰\vvt)\delta_{x}(\mathcal{U}_{\vv{t}}), and chol(K(𝒰j,tj,𝒰j,tj)1)\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)} are the same as that defined in section 3.1.

Conjugate gradient

Algorithm 1 The Preconditioned Conjugate Gradient (PCG) Algorithm.
1:  Input: matrix 𝚺\bm{\Sigma}, preconditioning matrix 𝐏\mathbf{P}, vector 𝐯\mathbf{v}, convergence threshold ϵ\epsilon, initial vector 𝐱0\mathbf{x}_{0}, maximum number of iterations TT
2:  Output: 𝐱𝚺1𝐯\mathbf{x}_{*}\approx\bm{\Sigma}^{-1}\mathbf{v}
3:  𝐫0:=𝐯Σ𝐱0\mathbf{r}_{0}:=\mathbf{v}-\Sigma\mathbf{x}_{0}; 𝐳0:=𝐏1𝐫0\mathbf{z}_{0}:=\mathbf{P}^{-1}\mathbf{r}_{0}
4:  𝐩0:=𝐳0\mathbf{p}_{0}:=\mathbf{z}_{0}
5:  for i=0i=0 to TT do
6:     αi:=𝐫iT𝐳i𝐩iT𝚺𝐩i\alpha_{i}:=\frac{\mathbf{r}_{i}^{T}\mathbf{z}_{i}}{\mathbf{p}_{i}^{T}\bm{\Sigma}\mathbf{p}_{i}}
7:     𝐱i+1:=𝐱i+αi𝐩i\mathbf{x}_{i+1}:=\mathbf{x}_{i}+\alpha_{i}\mathbf{p}_{i}
8:     𝐫i+1:=𝐫iαi𝚺𝐩i\mathbf{r}_{i+1}:=\mathbf{r}_{i}-\alpha_{i}\bm{\Sigma}\mathbf{p}_{i}
9:     if 𝐫i+1<ϵ\|\mathbf{r}_{i+1}\|<\epsilon then
10:        return 𝐱:=𝐱i+1\mathbf{x}_{*}:=\mathbf{x}_{i+1}
11:     end if
12:     𝐳i+1:=𝐏1𝐫i+1\mathbf{z}_{i+1}:=\mathbf{P}^{-1}\mathbf{r}_{i+1}
13:     βi:=𝐫i+1T𝐳i+1𝐫iT𝐳i\beta_{i}:=\frac{\mathbf{r}_{i+1}^{T}\mathbf{z}_{i+1}}{\mathbf{r}_{i}^{T}\mathbf{z}_{i}}
14:     𝐩i+1:=𝐳i+1+βi𝐩i\mathbf{p}_{i+1}:=\mathbf{z}_{i+1}+\beta_{i}\mathbf{p}_{i}
15:  end for

The only remaining computationally demanding step in eq. 20 is the computation of 𝚺𝒰1𝐯\bm{\Sigma}_{\mathcal{U}}^{-1}\mathbf{v} with 𝐯=𝐊𝒰,𝐗(𝐲𝐟𝐗ϵ)\mathbf{v}=\mathbf{K}_{\mathcal{U},\mathbf{X}}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\bm{\epsilon}), requiring a time complexity of 𝒪(nsg3)\mathcal{O}(n_{sg}^{3}) where nsg=|𝒰(η,d)|n_{sg}=|\mathcal{U}(\eta,d)| is the size of the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d). To reduce this computational intensity, we consider using the conjugate gradient (CG) method (Hestenes et al., , 1952; Golub and Van Loan, , 2013), an iterative algorithm for solving linear systems efficiently via matrix-vector multiplications. Preconditioning (Trefethen and Bau III, , 1997; Demmel, , 1997; Saad, , 2003; Van der Vorst, , 2003) is a well-known tool for accelerating the convergence of the CG method, which introduces a symmetric positive-definite matrix 𝐏𝒰𝚺𝒰\mathbf{P}_{\mathcal{U}}\approx\bm{\Sigma}_{\mathcal{U}} such that 𝐏𝒰1𝚺𝒰\mathbf{P}_{\mathcal{U}}^{-1}\bm{\Sigma}_{\mathcal{U}} has a smaller condition number than 𝚺𝒰\bm{\Sigma}_{\mathcal{U}}. The entire scheme of the preconditioned conjugate gradient (PCG) takes the form in Algorithm 1.

Preconditioner choice

The choice of the precondition 𝐏𝒰\mathbf{P}_{\mathcal{U}} has a trade-off between the smallest time complexity of applying 𝐏𝒰1\mathbf{P}_{\mathcal{U}}^{-1} operation and the optimal condition number of 𝐏𝒰1𝚺𝒰\mathbf{P}_{\mathcal{U}}^{-1}\bm{\Sigma}_{\mathcal{U}}. Cutajar et al., (2016) first investigated several groups of preconditioners for radial basis function kernel, then Gardner et al., (2018) proposed a specialized precondition for general kernels based on pivoted Cholesky decomposition. Wenger et al., 2022a analyzed and summarized convergence rates for arbitrary multi-dimensional kernels and multiple preconditioners. Due to the hierarchical structure of the sparse grid 𝒰\mathcal{U}, instead of using common classes of preconditioners, we consider two-level additive Schwarz (TAS) preconidtioner (Toselli and Widlund, , 2004) for the system matrix 𝚺𝒰\bm{\Sigma}_{\mathcal{U}}, which is formed as an additive Schwarz (AS) preconditioner (Dryja and Widlund, , 1994; Cai and Sarkis, , 1999) coupled with an additive coarse space correction. Since the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) defined in eq. 10 is covered by a set of overlapping domains {𝒰\vvt}\vvt𝔾(η)\{\mathcal{U}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)} where 𝔾(η)={\vvtd|j=1dtj=η}\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}, the TAS precondition for the system matrix 𝚺𝒰\bm{\Sigma}_{\mathcal{U}} is then defined as

𝐏TAS1:=\vvt𝔾(η)𝐒\vvt𝐏\vvt1𝐒\vvt+𝐒c𝐏c1𝐒c.\mathbf{P}_{\text{TAS}}^{-1}:=\sum_{\vv{t}\in\mathbb{G}(\eta)}\mathbf{S}_{\vv{t}}^{\top}\mathbf{P}_{\vv{t}}^{-1}\mathbf{S}_{\vv{t}}+\mathbf{S}_{c}^{\top}\mathbf{P}_{c}^{-1}\mathbf{S}_{c}. (23)

On the right hand side of eq. 23, the first term is the one-level AS preconditioner, and the second term is the additive coarse space correction. 𝐒\vvtn\vvt×nsg\mathbf{S}_{\vv{t}}\in\mathbb{R}^{n_{\vv{t}}\times n_{sg}} is the selection matrix that projects 𝒰\vvt\mathcal{U}_{\vv{t}} to 𝒰\mathcal{U}, 𝐏\vvt=𝐒\vvt𝚺𝒰𝐒\vvtn\vvt×n\vvt\mathbf{P}_{\vv{t}}=\mathbf{S}_{\vv{t}}\bm{\Sigma}_{\mathcal{U}}\mathbf{S}_{\vv{t}}^{\top}\in\mathbb{R}^{n_{\vv{t}}\times n_{\vv{t}}} is a symmetric positive definite sub-matrix defined on 𝒰\vvt\mathcal{U}_{\vv{t}}. 𝐒cnc×nsg\mathbf{S}_{c}\in\mathbb{R}^{n_{c}\times n_{sg}} is the selection matrix whose columns spanning the coarse space, and 𝐏c=𝐒c𝚺𝒰𝐒cnc×nc\mathbf{P}_{c}=\mathbf{S}_{c}\bm{\Sigma}_{\mathcal{U}}\mathbf{S}_{c}^{\top}\in\mathbb{R}^{n_{c}\times n_{c}} is the coarse space matrix. ncn_{c} is the dimension of the coarse space, nsg:=|𝒰(η,d)|n_{sg}:=|\mathcal{U}(\eta,d)| is the size of the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d), n\vvt:=|𝒰\vvt|=j=1dtjn_{\vv{t}}:=|\mathcal{U}_{\vv{t}}|=\prod_{j=1}^{d}t_{j} is the size of the full grid design 𝒰\vvt=×j=1d𝒰j,tj\mathcal{U}_{\vv{t}}=\times_{j=1}^{d}\mathcal{U}_{j,t_{j}}, \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}). Figure 2 illustrates an example of the selection matrices {𝐒\vvt}\vvt𝔾(η)\{\mathbf{S}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)} for the sparse grid 𝒰(η=3,d=2)\mathcal{U}(\eta=3,d=2).

Refer to caption
Figure 2: Selection matrices with respect to the sparse grid 𝒰(η=3,d=2)\mathcal{U}(\eta=3,d=2).

Some recent work (Toselli and Widlund, , 2004; Spillane et al., , 2014; Dolean et al., , 2015; Al Daas et al., , 2023) provide robust convergence estimates of TAS preconditioner, which states that the condition number of 𝐏TAS1𝚺𝒰\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}} is bounded by the number of distinct colors required to color the graph of 𝚺𝒰\bm{\Sigma}_{\mathcal{U}} so that span{𝐒\vvt}\vvt𝔾(η)span\{\mathbf{S}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)} of the same color are mutually 𝚺𝒰\bm{\Sigma}_{\mathcal{U}}-orthogonal, the maximum number of subdomains that share an unknown, and a user defined tolerance τ\tau. Specifically, given a threshold τ>0\tau>0, a GenEO coarse space (see Section 3.2 in (Al Daas et al., , 2023)) can be constructed, and the following inequality holds for the preconditioned matrix 𝐏TAS1𝚺𝒰\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}:

κ(𝐏TAS1𝚺𝒰)(kc+1)(2+(2kc+1)kmτ),\kappa(\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}})\leq(k_{c}+1)\left(2+(2k_{c}+1)\frac{k_{m}}{\tau}\right), (24)

where kck_{c} is the minimum number of distinct colors so that each two neighboring subdomains among {𝒰\vvt}\vvt𝔾(η)\{\mathcal{U}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)} have different colors, kmk_{m} is the coloring constant which is the maximum number of overlapping subdomains sharing a row of 𝚺𝒰\bm{\Sigma}_{\mathcal{U}}. Since each point in 𝒰(η,d)\mathcal{U}(\eta,d) belongs to at most |𝔾(η)||\mathbb{G}(\eta)| of the subdomains, kc=|𝔾(η)|k_{c}=|\mathbb{G}(\eta)|. Coloring constant kmk_{m} can be viewed as the number of colors needed to color each subdomain in such a way that any two subdomains with the same color are orthogonal, therefore km=|𝔾(η)|k_{m}=|\mathbb{G}(\eta)|.

However, the conventional approach for constructing a coarse space to bound the smallest eigenvalue of the 𝐏TAS1𝚺𝒰\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}} relies on the algebraic local symmetric positive semidefinite (SPSD) splitting (Al Daas and Grigori, , 2019), which is computationally complicated. To select a coarse space that is practical and less complicated, we propose constructing it based on the sparse grid 𝒰(ηc,d)\mathcal{U}(\eta_{c},d) with a lower level ηc<η\eta_{c}<\eta. In practice, we set ηc=max{η/2,d}\eta_{c}=\max\{\lceil\eta/2\rceil,d\}. Although the TAS preconditioners with our proposed coarse space do not theoretically guarantee robust convergence, empirical results presented in Figure 3 indicate that TAS tends to converge faster than one-level AS preconditioner and CG method without preconditioning. Figure 10 shows the cases where Σ𝒰\Sigma_{\mathcal{U}} is ill-conditioned, resulting in the instability (blow-up) of the CG method without any preconditioning.

Refer to caption
Refer to caption
Figure 3: Residuals of PCG for Σ𝒰1𝐯\Sigma_{\mathcal{U}}^{-1}\mathbf{v} for tolerance τ=103\tau=10^{-3} with different preconditioners where Σ𝒰=[𝐊𝒰,𝒰+σϵ2𝐊𝒰,𝐗𝐊𝐗,𝒰]\Sigma_{\mathcal{U}}=\big{[}\mathbf{K}_{\mathcal{U},\mathcal{U}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathcal{U},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathcal{U}}\big{]}, 𝒰=𝒰(η,d)\mathcal{U}=\mathcal{U}(\eta,d) is the sparse grid defined in eq. 10, vector 𝐯\mathbf{v} is randomly generated. xx-axis is the iteration step, yy-axis is the logarithm of the residuals norm. We denote PCG with TAS preconditioners by red dots, PCG with AS preconditioners by orange crosses, PCG with Jacobi preconditioners by green triangles, CG without preconditioning by blue diamonds. Left: The sparse grid 𝒰(η=12,d=2)\mathcal{U}(\eta=12,d=2). Right: The sparse grid 𝒰(η=10,d=4)\mathcal{U}(\eta=10,d=4).

Complexity

Note that 𝐒\vvt\mathbf{S}_{\vv{t}} is a sparse matrix with n\vvtn_{\vv{t}} nonzeros and all of its nonzeros are one, hence computing 𝐏\vvt\mathbf{P}_{\vv{t}} requires 𝒪(nn\vvt2)\mathcal{O}(nn_{\vv{t}}^{2}) operations, and the computation of the first term of 𝐏TAS1\mathbf{P}_{\text{TAS}}^{-1} costs 𝒪(\vvt𝔾(η)nn\vvt2)\mathcal{O}(\sum_{\vv{t}\in\mathbb{G}(\eta)}nn_{\vv{t}}^{2}). The computational complexity of the preconditioner 𝐏PAS1\mathbf{P}_{\text{PAS}}^{-1} is then 𝒪(\vvt𝔾(η)nn\vvt2+nc2)=𝒪(n|𝔾(η)|(η/d)2d)=𝒪(nnsg(η/d)2d)\mathcal{O}(\sum_{\vv{t}\in\mathbb{G}(\eta)}nn_{\vv{t}}^{2}+n_{c}^{2})=\mathcal{O}(n|\mathbb{G}(\eta)|(\eta/d)^{2d})=\mathcal{O}(nn_{sg}(\eta/d)^{2d}) where nc=|𝒰(ηc,d)|n_{c}=|\mathcal{U}(\eta_{c},d)| is the size of the sparse grid according to the former setting.

The complexity of the matrix-vector multiplication Σ𝐱0\Sigma\mathbf{x}_{0} in Algorithm 1 is 𝒪(nnsg)\mathcal{O}(nn_{sg}) if 𝚺=𝚺𝒰\bm{\Sigma}=\bm{\Sigma}_{\mathcal{U}}. In theory, the maximum number of iterations of the PCG with error factor ϵ\epsilon is bounded by 12κln(2ϵ)\lceil\frac{1}{2}\sqrt{\kappa}\ln\big{(}\frac{2}{\epsilon}\big{)}\rceil (Shewchuk et al., , 1994), where κ\kappa is the condition number of the preconditioned matrix 𝐏TAS1𝚺𝒰\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}, so the time complexity of Algorithm 1 for 𝚺=𝚺𝒰\bm{\Sigma}=\bm{\Sigma}_{\mathcal{U}} is 𝒪((κ+(η/d)2d)nsgn)\mathcal{O}((\sqrt{\kappa}+(\eta/d)^{2d})n_{sg}n) where κ=κ(𝐏TAS1𝚺𝒰)\kappa=\kappa(\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}). Therefore the time complexity of the entire posterior sampling scheme is 𝒪((η+(κ+(η/d)2d+m)n)nsg)\mathcal{O}\Big{(}\big{(}\eta+(\sqrt{\kappa}+(\eta/d)^{2d}+m)n\big{)}n_{sg}\Big{)}, where ν\nu is the smoothness parameter of the Matérn kernel, nsgn_{sg} is the size of inducing points on a sparse grid 𝒰(η,d)\mathcal{U}(\eta,d), nn is size of observations 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}, mm is the size of test points 𝐗={𝐱i}i=1m\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m}, dd is the dimension of the observations.

4 Experiments

In this section, we will demonstrate the computational efficiency and accuracy of the proposed sampling algorithm. We first generate samples from GP priors and posteriors for dimension d=2d=2 and d=4d=4 in Section 4.1. Then we apply our method to the same real-world problems as those addressed in (Wilson et al., , 2020), described in Section 4.2. For prior samplings, we use the random Fourier features (RFF) with 26=642^{6}=64 features (Rahimi and Recht, , 2007) and the Cholesky decomposition method as benchmarks. For posterior samplings, we compare our approach with the decoupled algorithm (Wilson et al., , 2020) using RFF priors and the exact Matheron’s update, as well as the Cholesky decomposition method. We use Matérn kernels as defined in eq. 6 with the variance σ2=1\sigma^{2}=1, lengthscale ω=2ν\omega=\sqrt{2\nu} and smoothness ν=3/2\nu=3/2. Our experiments employ “separable” Matérn kernels as specified in eq. 5, with the same base kernels and the same lengthscale parameters ω=2ν\omega=\sqrt{2\nu} in each dimension. For the proposed method, the inducing points are selected as a sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) as defined in eq. 10 for (η=5,d=2)(\eta=5,d=2), and (η=6,d=4)(\eta=6,d=4) respectively. We set the same nested sequence =𝒰j,0𝒰j,t𝒰j,t+1\varnothing=\mathcal{U}_{j,0}\subseteq\ldots\subseteq\mathcal{U}_{j,t}\subseteq\mathcal{U}_{j,t+1}\subseteq\ldots for each dimension jj, and set 𝒰j,tj\mathcal{U}_{j,t_{j}} as hyperbolic cross points defined in eq. 11 for all jj. We set the noise variates ϵ𝒩(𝟎,σϵ2𝐈)\epsilon\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}) with σϵ2=104\sigma_{\epsilon}^{2}=10^{-4} for all experiments. The seed value is set to 9999, and each experiment is replicated 10001000 times.

4.1 Simulation

Prior sampling

We generate prior samples 𝐙={zi}i=1ns\mathbf{Z}=\{z_{i}\}_{i=1}^{n_{s}} over the domain [0,1]d[0,1]^{d} using Matérn 3/2 kernels, with ns=26,,213n_{s}=2^{6},\ldots,2^{13} respectively. Left plots in Figure 4 and Figure 11 show the time required by different algorithms to sample from the different number of points nsn_{s}. We can observe that the proposed algorithm (InSG) is comparable in efficiency to the RFF method. To evaluate accuracy, we compute the 2-Wasserstein distances between empirical and true distributions. The 2-Wasserstein distance measures the similarity of two distributions. Let f1𝒩(μ1,Σ1)f_{1}\sim\mathcal{N}(\mu_{1},\Sigma_{1}) and f2𝒩(μ2,Σ2)f_{2}\sim\mathcal{N}(\mu_{2},\Sigma_{2}), the 2-Wasserstein distance between the Gaussian distributions f1f_{1}, f2f_{2} on L2L^{2} norm is given by (Dowson and Landau, , 1982; Gelbrich, , 1990; Mallasto and Feragen, , 2017)

W2(f1,f2):=\displaystyle W_{2}(f_{1},f_{2}):= (μ1μ22+tr(Σ1+Σ22(Σ112Σ2Σ112)12))12.\displaystyle\Big{(}||\mu_{1}-\mu_{2}||^{2}+{\rm tr}\big{(}\Sigma_{1}+\Sigma_{2}-2(\Sigma_{1}^{\frac{1}{2}}\Sigma_{2}\Sigma_{1}^{\frac{1}{2}})^{\frac{1}{2}}\big{)}\Big{)}^{\frac{1}{2}}. (25)

For the empirical distributions, the parameters are estimated from the replicated samples. The right plots in Figure 4 and Figure 11 demonstrate that the proposed algorithm performs with nearly the same precision as Cholesky decomposition and RFF.

Refer to caption
Refer to caption
Figure 4: Time and accuracy of different algorithms for sampling from GP priors with Matérn 3/2 in dimension d=2d=2. xx-axis is the number of the sampling points nsn_{s}. The sparse grid level of the proposed method is set as η=5\eta=5. We denote the proposed approach, the inducing points on the sparse grid (InSG), by red dots, random Fourier features (RFF) by green triangles, Cholesky decomposition (Chol) by blue diamonds. Left: Logarithm of time taken to generate a draw from GP priors. Right: Logarithm of 2-Wasserstein distances between priors and true distributions.
Refer to caption
Refer to caption
Figure 5: Time and accuracy of different algorithms for sampling from GP posteriors with Matérn 3/2 and dimension d=2d=2. xx-axis is the number of observations nn. The sparse grid level of the proposed method is set as η=5\eta=5. The decoupled method is configured to use RFF priors with 28=642^{8}=64 features and exact Matheron’s update. Left: Logarithm of time taken to generate a draw from GP posteriors over m=1000m=1000 points. Right: Logarithm of 2-Wasserstein distances between posterior samplings and ground truth GP posteriors over m=1000m=1000 points.

Posterior sampling

We use the Griewank function (Griewank, , 1981) as the test function, defined as:

f(𝐱)=j=1dxj24000+j=1dcos(xjj)+1,𝐱[5,5]d.f(\mathbf{x})=\sum_{j=1}^{d}\frac{x_{j}^{2}}{4000}+\prod_{j=1}^{d}\cos(\frac{x_{j}}{\sqrt{j}})+1,\quad\mathbf{x}\in[-5,5]^{d}.

The sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) is set over [5,5]d[-5,5]^{d} with configurations (η=5,d=2)(\eta=5,d=2) and (η=6,d=4)(\eta=6,d=4) for the design of our experiment. We then evaluate the average computational time and 2-Wasserstein distance over m=1000m=1000 random test points for each sampling method. Figure 5 and Figure 12 illustrate the performance of different sampling strategies. It’s observed that our proposed algorithm is significantly more time-efficient compared to the Cholesky and decoupled algorithms, particularly when nn is larger than 2102^{10}. Moreover, our proposed method, the inducing points approximation on the sparse grid (InSG) demonstrates comparable accuracy to the decoupled method.

4.2 Application

Thompson sampling

Thompson Sampling (TS) (Thompson, , 1933) is a classical strategy for decision-making by selecting actions x𝒳x\in\mathcal{X} that minimize a black-box function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R}. At each iteration tt, TS determines 𝐱t+1argmin𝐱𝒳(f|𝐲)(𝐱)\mathbf{x}_{t+1}\in\arg\min_{\mathbf{x}\in\mathcal{X}}(f|\mathbf{y})(\mathbf{x}), where 𝐲\mathbf{y} represents the observation set at current iteration. Upon finding the minimizer, the corresponding yt+1y_{t+1} is obtained by evaluating ff at 𝐱t+1\mathbf{x}_{t+1}, and then the pair (𝐱t+1,yt+1)(\mathbf{x}_{t+1},y_{t+1}) is added to the training set. In this experiment, we consider the Ackley function (Ackley, , 2012)

f(𝐱)=aexp(b1dj=1dxj2)+a+exp(1)exp(1dj=1dcos(cxj)),\displaystyle f(\mathbf{x})=-a\exp\Big{(}-b\sqrt{\frac{1}{d}\sum_{j=1}^{d}x_{j}^{2}}\Big{)}+a+\exp(1)-\exp\Big{(}\frac{1}{d}\sum_{j=1}^{d}\cos(cx_{j})\Big{)}, (26)

with a=20a=20, b=0.2b=0.2, c=2πc=2\pi. The goal of this experiment is to find the global minimizer of the target function ff. We start with 3 initial samples, then at each iteration of TS, we draw a posterior sample f|𝐲f|\mathbf{y} on 1024 uniformly distributed points over the interval [5,5]d[-5,5]^{d} conditioned on the current observation set. Next, we pick the smallest posterior sample for this iteration, add it to the training set, and repeat the above process. This iterative approach helps us progressively approximate the global minimum. To neutralize the impact of the lengthscales and seeds, we average the regret over 8 different lengthscales [0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6][0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6] and 5 different seeds [9,99,999,9999,99999][9,99,999,9999,99999]. In Figure 6, we compare the logarithm of regret at each iteration for different sampling algorithms, the proposed approach (InSG) demonstrates comparable performance to the decoupled method.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Regret of Thompson sampling methods of different algorithms for optimizing Ackley function with d=2d=2 (left), d=4d=4 (middle) and d=6d=6 (right) for Matérn 3/2. The sparse grid design is set as 𝒰(η=5,d=2)\mathcal{U}(\eta=5,d=2), 𝒰(η=6,d=4)\mathcal{U}(\eta=6,d=4), and 𝒰(η=8,d=6)\mathcal{U}(\eta=8,d=6) respectively. The decoupled method is configured to use RFF priors with 28=642^{8}=64 features and exact Matheron’s update. xx-axis is the iteration step, yy-axis is the regret at each step.

Simulating dynamical systems

GP posteriors are also useful in dynamic systems where data is limited. An illustrative example is the FitzHugh–Nagumo model (FitzHugh, , 1961; Nagumo et al., , 1962), which intricately models the activation and deactivation dynamics of a spiking neuron. The system’s equations are given by:

𝐱˙=f(𝐱,a)=[v˙w˙]=[vv33w+a1γ(vβw+α)].\dot{\mathbf{x}}=f(\mathbf{x},a)=\begin{bmatrix}\dot{v}\\ \dot{w}\end{bmatrix}=\begin{bmatrix}v-\frac{v^{3}}{3}-w+a\\ \frac{1}{\gamma}(v-\beta w+\alpha)\end{bmatrix}. (27)

Using the Euler–Maruyama method (Maruyama, , 1955), the system’s equations eq. 27 can be discretized into a stochastic difference equation as follows:

𝐲t=𝐱t+1𝐱t=τf(𝐱t,at)+τϵt,ϵt𝒩(𝟎,σϵ2𝐈),\mathbf{y}_{t}=\mathbf{x}_{t+1}-\mathbf{x}_{t}=\tau f(\mathbf{x}_{t},a_{t})+\sqrt{\tau}\epsilon_{t},\quad\epsilon_{t}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}), (28)

where τ\tau is the fixed step size. Write a FitzHugh–Naguomo neuron eq. 27 in the form eq. 28, we can obtain:

𝐱t+1𝐱t=τ[vtvt33wt+at1γ(vtβwt+α)]+τϵt.\mathbf{x}_{t+1}-\mathbf{x}_{t}=\tau\begin{bmatrix}v_{t}-\frac{v_{t}^{3}}{3}-w_{t}+a_{t}\\ \frac{1}{\gamma}(v_{t}-\beta w_{t}+\alpha)\end{bmatrix}+\sqrt{\tau}\epsilon_{t}. (29)
Refer to caption
Figure 7: Trajectories of different algorithms for simulation of a stochastic FitzHugh–Nagumo neuron. Each iteration is generated over 10001000 samples. The sparse grid level of the proposed method (InSG) is set as η=5\eta=5. The decoupled method is configured to use RFF priors with 28=642^{8}=64 features and exact Matheron’s update. Top: Voltage vtv_{t} trajectories generated by different algorithms. Bottom: Recovery variable wtw_{t} trajectories generated by different algorithms.
Refer to caption
Refer to caption
Figure 8: Time and accuracy of different algorithms for simulation of a stochastic FitzHugh–Nagumo neuron. Left: Time cost at each iteration. Right: Logarithm of 2-Wasserstein distance between simulations and ground truth state trajectories at each iteration.

Our objective is to simulate the state trajectories of this dynamical system. Starting with an initial point 𝐱0=(v0,w0)T\mathbf{x}_{0}=(v_{0},w_{0})^{T}, at iteration tt, we draw a posterior GP sampling from the conditional distribution p(𝐲t|Dt1)p(\mathbf{y}_{t}|D_{t-1}), where Dt1D_{t-1} denotes the set of the training data {(𝐱i,𝐲i)}i=1n\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n} along with the current trajectory {(𝐱j,𝐲j)}j=1t1\{(\mathbf{x}_{j},\mathbf{y}_{j})\}_{j=1}^{t-1}. In our implementation, we choose the model eq. 29 with step size τ=0.25\tau=0.25, parameters α=0.75\alpha=0.75, β=0.75\beta=0.75, γ=20\gamma=20, noise variance σϵ2=104\sigma_{\epsilon}^{2}=10^{-4}, and initial point 𝐱0=(2.5,1)\mathbf{x}_{0}=(-2.5,-1)^{\top}. The training set {(𝐱i,𝐲i)}i=1n\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n} was generated by evaluating eq. 29 at n=256n=256 points {𝐱i}i=1n\{\mathbf{x}_{i}\}_{i=1}^{n} (𝐱i𝒳)(\mathbf{x}_{i}\in\mathcal{X}), which are uniformly distributed in the interval 𝒳=[2.5,2.5]×[1,2]\mathcal{X}=[-2.5,2.5]\times[-1,2], with corresponding currents {ai}i=1n\{a_{i}\}_{i=1}^{n} (ai𝒜)(a_{i}\in\mathcal{A}) chosen uniformly from 𝒜=[0,1]\mathcal{A}=[0,1]. Variations 𝐲t\mathbf{y}_{t} in each step were simulated using independent Matérn 3/2 GPs with ω=3\omega=\sqrt{3} and inducing points approximation on the sparse grid 𝒰(η=5,d=2)\mathcal{U}(\eta=5,d=2). We generate 1000 samples at each iteration. Figure 7 presents the state trajectories of voltage vtv_{t} and recovery variable wtw_{t} versus the iteration step tt for each algorithm. The left plots in Figure 7 show that our proposed algorithm can accurately characterize the state trajectories of this dynamical system. Figure 8 illustrates the computational time required for each iteration and the 2-Wasserstein distance between the states 𝐱t\mathbf{x}_{t} derived from GP-based simulations and those obtained using the Euler–Maruyama method as described in eq. 28 at each iteration tt. These results demonstrate that our proposed algorithm not only reduces the computational time but also preserves accuracy that is comparable to the decoupled method.

5 Conclusion

In this work, we propose a scalable algorithm for GP sampling using inducing points on a sparse grid, which only requires a time complexity of 𝒪((η+ns)nsg)\mathcal{O}\Big{(}(\eta+n_{s})n_{sg}\Big{)} for nsn_{s} sampling points, a sparse grid of size nsgn_{sg} and level η\eta. Additionally, we show that the convergence rate for GP sampling with a sparse grid is 𝒪(nsgν(lognsg)(ν+2)(d1)+d+1)\mathcal{O}(n_{sg}^{-\nu}(\log n_{sg})^{(\nu+2)(d-1)+d+1}) for a dd-dimensional sparse grid of size nsgn_{sg} and a product Matérn kernel of smoothness ν\nu. For the posterior samplings, we develop a two-level additive Schwarz preconditioner for the matrix over the sparse grid, which empirically shows rapid convergence. In the numerical study, we demonstrate that the proposed algorithm is not only efficient but also maintains accuracy comparable to other approximation methods.

Acknowledgments and Disclosure of Funding

This work is supported by NSF grants DMS-2312173 and CNS-2328395.

Appendix A Theorems

A.1 Rate of convergence

Theorem A.1.

Let M=Lq([0,1])M=L_{q}([0,1]) for any q[1,+)q\in[1,+\infty). Let ZZ be a GP on [0,1][0,1] with a Matérn kernel defined in eq. 6 with smoothness ν\nu, and Z^\hat{Z} be an inducing points approximation of ZZ under the inducing points {0,1/n,2/n,,1}\{0,1/n,2/n,\ldots,1\} for some positive integer nn. Then for any p[1,)p\in[1,\infty), as n+n\rightarrow+\infty, the order of magnitude of the approximation error is

Wp(Z,Z^)=𝒪(nν).W_{p}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}).
Proof.

Note that ZZ and Z^\hat{Z} already live in the same probability space, that is (Z,Z^)Γ(Z,Z^)(Z,\hat{Z})\in\Gamma(Z,\hat{Z}). Therefore,

Wp(Z,Z^)(𝔼ZZ^Lq([0,1])p)1/p.W_{p}(Z,\hat{Z})\leq\left(\mathbb{E}\|Z-\hat{Z}\|^{p}_{L_{q}([0,1])}\right)^{1/p}. (30)

According to Theorem 4 in (Tuo and Wang, , 2020), ZZ^Lq([0,1])\|Z-\hat{Z}\|_{L_{q}([0,1])} has the order of magnitude O(nν)O(n^{-\nu}) and sub-Gaussian tails, which leads to the desired result. ∎

Lemma A.2.

[Adapted from Proposition 1 and Corollary 2 in (Rieger and Wendland, , 2017)] Let Φ:d\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R} be a reproducing kernel of W2r;d(d)W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d}), where Φ\Phi is the tensor product with Φ(𝐱):=j=1dϕ(xj)\Phi(\mathbf{x}):=\prod_{j=1}^{d}\phi(x_{j}) and 𝐱=(x1,,xd)\mathbf{x}=(x_{1},\ldots,x_{d})^{\top}, ϕ:\phi:\mathbb{R}\rightarrow\mathbb{R} is a reproducing kernel of Sobolev space W2r(1)W_{2}^{r}(\mathbb{R}^{1}) of smoothness r>1/2r>1/2 measured in L2(1)L_{2}(\mathbb{R}^{1})-norm. Suppose we are given the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) defined in eq. 10 over Id=[0,1]dI^{d}=[0,1]^{d} with n=card(𝒰(η,d))n=\mathrm{{card}}(\mathcal{U}(\eta,d)) points and an unknown function fW2r;d(Id)f\in W_{2}^{r;\otimes^{d}}(I^{d}). Let s0s_{0} be the solution to the norm-minimal interpolant on the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) as follows:

minsW2r;d:sW2r;d(d)withs(ξ)=f(ξ),ξ𝒰(η,d).\min\|s\|_{W_{2}^{r;\otimes^{d}}}:s\in W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d})\quad\text{with}\quad s(\xi)=f(\xi),\quad\xi\in\mathcal{U}(\eta,d). (31)

Then, we have

fs0L(Id)Cnr+1/2(logn)(r+3/2)(d1)+d+1fW2r;d(d),\|f-s_{0}\|_{L_{\infty}(I^{d})}\leq Cn^{-r+1/2}(\log n)^{(r+3/2)(d-1)+d+1}\|f\|_{W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d})}, (32)

where C>0C>0 is a constant.

Proof.

See proofs of Proposition 1 and Corollary 2 in (Rieger and Wendland, , 2017). ∎

Theorem A.3.

Let M=Lq([0,1]d)M=L_{q}([0,1]^{d}) for any q[1,+)q\in[1,+\infty), d>1d>1 is the dimension of the space MM. Let ZZ be a GP on [0,1]d[0,1]^{d} with a product Matérn kernel K(𝐱,𝐱)=j=1dK0(x(j),x(j))K(\mathbf{x},\mathbf{x}^{\prime})=\prod_{j=1}^{d}K_{0}(x^{(j)},x^{\prime(j)}) defined in eq. 5 with variance σ2=1\sigma^{2}=1 and base kernel K0(,)K_{0}(\cdot,\cdot) of smoothness ν\nu, and let Z^\hat{Z} be an inducing points approximation of ZZ under the inducing points on a sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) defined in eq. 10. We denote the cardinality of the sparse grid 𝒰(η,d)\mathcal{U}(\eta,d) by nn. Then, as n+n\rightarrow+\infty, the order of magnitude of the approximation error is

W2(Z,Z^)=𝒪(nν(logn)(ν+2)(d1)+d+1).W_{2}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}).
Proof.

If ϕ\phi in Lemma A.2 is a Matérn kernel in one dimension, we can infer that r=ν+1/2r=\nu+1/2 by Lemma 15 in (Tuo and Wang, , 2020), which states that the reproducing kernel Hilbert space (RKHS) 𝒩K(1)\mathcal{N}_{K}(\mathbb{R}^{1}) is a Sobolev space W2ν+1/2(1)W_{2}^{\nu+1/2}(\mathbb{R}^{1}) with equivalent norms. We denote RKHS associated with product Matérn kernel K(,)K(\cdot,\cdot) by 𝒩Kd(d)\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d}) and its norm by K\|\cdot\|_{K}.

Let f^\hat{f} be the interpolation of the function f𝒩Kd(d)f\in\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d}) with respect to the approximation Z^\hat{Z}. Clearly, 𝒩Kd(d)=W2ν+1/2;d(d)\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d})=W_{2}^{\nu+1/2;\otimes^{d}}(\mathbb{R}^{d}). Proposition 2 in (Rieger and Wendland, , 2017) asserts that f^=s0\hat{f}=s_{0} where s0s_{0} is norm-minimal interpolant defined in Lemma A.2. Therefore, for any f𝒩Kd(d)f\in\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d}), we have

ff^L([0,1]d)Cnν(logn)(ν+2)(d1)+d+1fW2ν+1/2;d(d).\|f-\hat{f}\|_{L_{\infty}([0,1]^{d})}\leq Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}\|f\|_{W_{2}^{\nu+1/2;\otimes^{d}}(\mathbb{R}^{d})}. (33)

Additionally, Lemma 15 in (Tuo and Wang, , 2020) implies that

𝔼ZZ^Lq[0,1]d2\displaystyle\mathbb{E}\|Z-\hat{Z}\|_{L_{q}[0,1]^{d}}^{2} =supfK1|ff^|2.\displaystyle=\sup_{\|f\|_{K}\leq 1}|f-\hat{f}|^{2}. (34)

Note that ZZ and Z^\hat{Z} already live in the same probability space, that is (Z,Z^)Γ(Z,Z^)(Z,\hat{Z})\in\Gamma(Z,\hat{Z}). Therefore, by combining with eq. 33 and eq. 34, we can obtain the following results:

W2(Z,Z^)\displaystyle W_{2}(Z,\hat{Z}) (𝔼ZZ^Lq[0,1]d2)1/2\displaystyle\leq\left(\mathbb{E}\|Z-\hat{Z}\|^{2}_{L_{q}[0,1]^{d}}\right)^{1/2}
=supfK1|ff^|\displaystyle=\sup_{\|f\|_{K}\leq 1}|f-\hat{f}|
supfK1Cnν(logn)(ν+2)(d1)+d+1fK\displaystyle\leq\sup_{\|f\|_{K}\leq 1}Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}\|f\|_{K}
Cnν(logn)(ν+2)(d1)+d+1.\displaystyle\leq Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}. (35)

Thus, the order of the magnitude of the inducing points approximation on the sparse grid error is W2(Z,Z^)=𝒪(nν(logn)(ν+2)(d1)+d+1)W_{2}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}). ∎

A.2 Smolyak algorithm for GP sampling

Theorem A.4.

Let K(,)K(\cdot,\cdot) be a kernel function on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}, and 𝒰=𝒰(η,d)\mathcal{U}=\mathcal{U}(\eta,d) a sparse grid design with level η\eta and dimension dd defined in eq. 10. 𝒰(η,d)=\vvt𝔾(η)𝒰1,t1×𝒰2,t2××𝒰d,td\mathcal{U}(\eta,d)=\bigcup_{\vv{t}\in\mathbb{G}(\eta)}\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}} where \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}), 𝔾(η)={\vvtd|j=1dtj=η}\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}, ηd2\eta\geq d\geq 2, η,d+\eta,d\in\mathbb{N}^{+}. Let f()𝒢𝒫(𝟎,K(,))f(\cdot)\sim\mathcal{GP}(\mathbf{0},K(\cdot,\cdot)) be a GP, then sampling from f()f(\cdot) on a sparse grid 𝒰\mathcal{U} can be done via the following formula:

f(𝒰)\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)j=1dchol(K(𝒰j,tj,𝒰j,tj))𝐟𝐈n\vvtδx(𝒰\vvt),f(\mathcal{U})\leftarrow\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\mathrm{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}}), (36)
δx(𝒰\vvt):={0if x𝒰𝒰\vvt1if x𝒰\vvt,𝒰\vvt:=𝒰1,t1×𝒰2,t2××𝒰d,td,\delta_{x}(\mathcal{U}_{\vv{t}}):=\begin{cases}0\quad&\text{if $x\in\mathcal{U}\setminus\mathcal{U}_{\vv{t}}$}\\ 1\quad&\text{if $x\in\mathcal{U}_{\vv{t}}$}\end{cases},\qquad\mathcal{U}_{\vv{t}}:=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}}, (37)

where (η,d)={\vvtd|max(d,ηd+1)|\vvt|η}\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq|\vv{t}|\leq\eta\}, chol(K(𝒰j,tj,𝒰j,tj))\mathrm{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)} is the Cholesky decomposition of K(𝒰j,tj,𝒰j,tj)K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}}), 𝐟𝐈n\vvt𝒩(𝟎,𝐈n\vvt)\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}}) is standard multivariate normal distributed, n\vvtn_{\vv{t}} == j=1dtj\prod_{j=1}^{d}t_{j} is the size of the full grid design 𝒰\vvt=𝒰1,t1×𝒰2,t2××𝒰d,td\mathcal{U}_{\vv{t}}=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}} associated with \vvt=(t1,,td)\vv{t}=(t_{1},\ldots,t_{d}), δx(𝒰\vvt)\delta_{x}(\mathcal{U}_{\vv{t}}) is a Dirac measure on the set 𝒰\mathcal{U}.

Proof.

We know that sampling from f(𝒰)𝒩(𝟎,K(𝒰,𝒰))f(\mathcal{U})\sim\mathcal{N}(\mathbf{0},K(\mathcal{U},\mathcal{U})) can be done via Cholesky decomposition, i.e., by computing chol(K(𝒰,𝒰))𝐟nsg\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}} where 𝐟nsg𝒩(𝟎,𝐈nsg)\mathbf{f}_{n_{sg}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{sg}}) is multivariate normal distributed on the sparse grid 𝒰\mathcal{U} of size nsg=|𝒰|n_{sg}=|\mathcal{U}|. Note that chol(K(𝒰,𝒰))𝐟nsg\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}} can be regarded as a function gg evaluated on 𝒰\mathcal{U} such that g(𝒰)=chol(K(𝒰,𝒰))𝐟nsgg(\mathcal{U})=\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}}. According to eq. 12, we can obtain the following equation by applying Smolyak’s algorithm 𝒫(η,d)\mathcal{P}(\eta,d) to gg,

𝒫(η,d)(g)\displaystyle\mathcal{P}(\eta,d)(g) =\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)(j=1d𝒰j,tj)(g)\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}(g) (38)
=\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)g(×j=1d𝒰j,tj)δx(𝒰\vvt)\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}g(\times_{j=1}^{d}\mathcal{U}_{j,t_{j}})\delta_{x}(\mathcal{U}_{\vv{t}})
=\vvt(η,d)(1)η|\vvt|(d1η|\vvt|)j=1dchol(K(𝒰j,tj,𝒰j,tj))𝐟n\vvtδx(𝒰\vvt),\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}\mathbf{f}_{n_{\vv{t}}}\delta_{x}(\mathcal{U}_{\vv{t}}),

where δx(𝒰\vvt)\delta_{x}(\mathcal{U}_{\vv{t}}) and 𝒰\vvt\mathcal{U}_{\vv{t}} are defined in eq. 37. Note that gg is defined on 𝒰\mathcal{U}, so the tensor product operation j=1d𝒰j,tj\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}} on gg is actually evaluating gg on the corresponding points, which is equivalent to applying Dirac delta measure defined on set ×j=1d𝒰j,tj\times_{j=1}^{d}\mathcal{U}_{j,t_{j}} to g(j=1d𝒰j,tj)g(\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}). ∎

Appendix B Plots

Refer to caption
Figure 9: Sparse grids 𝒰(η,d)\mathcal{U}(\eta,d)over [0,1]d[0,1]^{d} of level η=4,5,6,7\eta=4,5,6,7 and dimension d=3d=3.
Refer to caption
Refer to caption
Figure 10: Residuals of PCG for Σ𝒰1𝐯\Sigma_{\mathcal{U}}^{-1}\mathbf{v} for tolerance τ=103\tau=10^{-3} with different preconditioners where Σ𝒰=[𝐊𝒰,𝒰+σϵ2𝐊𝒰,𝐗𝐊𝐗,𝒰]\Sigma_{\mathcal{U}}=\big{[}\mathbf{K}_{\mathcal{U},\mathcal{U}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathcal{U},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathcal{U}}\big{]}, 𝒰=𝒰(η,d)\mathcal{U}=\mathcal{U}(\eta,d) is the sparse grid defined in eq. 10, vector 𝐯\mathbf{v} is randomly generated. Left: The sparse grid 𝒰(η=5,d=2)\mathcal{U}(\eta=5,d=2). Right: The sparse grid 𝒰(η=6,d=4)\mathcal{U}(\eta=6,d=4).
Refer to caption
Refer to caption
Figure 11: Time and accuracy of different algorithms for sampling from GP priors with Matérn 3/2 in dimension d=4d=4. xx-axis is the number of grid points nsn_{s}. The sparse grid level of the proposed method is set as η=6\eta=6. Left: Logarithm of time taken to generate a draw from GP priors. Right: Logarithm of 2-Wasserstein distances between priors and true distributions.
Refer to caption
Refer to caption
Figure 12: Time and accuracy of different algorithms for sampling from GP posteriors with Matérn 3/2 and dimension d=4d=4. xx-axis is the number of observations nn. The sparse grid level of the proposed method is set as η=6\eta=6. The decoupled method is configured to use RFF priors with 28=642^{8}=64 features and exact Matheron’s update. Left: Logarithm of time taken to generate a draw from GP posteriors over m=1000m=1000 points. Right: Logarithm of 2-Wasserstein distances between posterior samplings and ground truth GP posteriors over m=1000m=1000 points.

References

  • Ackley, (2012) Ackley, D. (2012). A connectionist machine for genetic hillclimbing, volume 28. Springer Science & Business Media.
  • Al Daas and Grigori, (2019) Al Daas, H. and Grigori, L. (2019). A class of efficient locally constructed preconditioners based on coarse spaces. SIAM Journal on Matrix Analysis and Applications, 40(1):66–91.
  • Al Daas et al., (2023) Al Daas, H., Jolivet, P., and Rees, T. (2023). Efficient Algebraic Two-Level Schwarz Preconditioner for Sparse Matrices. SIAM Journal on Scientific Computing, 45(3):A1199–A1213.
  • Banerjee et al., (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(4):825–848.
  • Bishop et al., (1995) Bishop, C. M. et al. (1995). Neural networks for pattern recognition. Oxford university press.
  • Cai and Sarkis, (1999) Cai, X.-C. and Sarkis, M. (1999). A restricted additive Schwarz preconditioner for general sparse linear systems. Siam journal on scientific computing, 21(2):792–797.
  • Chen et al., (2022) Chen, H., Ding, L., and Tuo, R. (2022). Kernel packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Matérn Correlations. Journal of Machine Learning Research, 23(127):1–32.
  • Chen and Tuo, (2022) Chen, H. and Tuo, R. (2022). A Scalable and Exact Gaussian Process Sampler via Kernel Packets.
  • Cole et al., (2021) Cole, D. A., Christianson, R. B., and Gramacy, R. B. (2021). Locally induced Gaussian processes for large-scale simulation experiments. Statistics and Computing, 31:1–21.
  • Constantinou et al., (2017) Constantinou, P., Kokoszka, P., and Reimherr, M. (2017). Testing separability of space-time functional processes. Biometrika, 104(2):425–437.
  • Cressie, (2015) Cressie, N. (2015). Statistics for spatial data. John Wiley & Sons.
  • Curriero, (2006) Curriero, F. C. (2006). On the use of non-Euclidean distance measures in geostatistics. Mathematical Geology, 38:907–926.
  • Cutajar et al., (2016) Cutajar, K., Osborne, M., Cunningham, J., and Filippone, M. (2016). Preconditioning kernel matrices. In International conference on machine learning, pages 2529–2538. PMLR.
  • Demmel, (1997) Demmel, J. W. (1997). Applied numerical linear algebra. SIAM.
  • Diggle et al., (2003) Diggle, P. J., Ribeiro, P. J., and Christensen, O. F. (2003). An introduction to model-based geostatistics. Spatial statistics and computational methods, pages 43–86.
  • Ding et al., (2021) Ding, L., Tuo, R., and Shahrampour, S. (2021). A sparse expansion for deep gaussian processes. arXiv preprint arXiv:2112.05888.
  • Dolean et al., (2015) Dolean, V., Jolivet, P., and Nataf, F. (2015). An introduction to domain decomposition methods: algorithms, theory, and parallel implementation. SIAM.
  • Dowson and Landau, (1982) Dowson, D. and Landau, B. (1982). The Fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455.
  • Dryja and Widlund, (1994) Dryja, M. and Widlund, O. B. (1994). Domain decomposition algorithms with small overlap. SIAM Journal on Scientific Computing, 15(3):604–620.
  • Engel et al., (2005) Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement learning with Gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 201–208.
  • FitzHugh, (1961) FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445–466.
  • (22) Frazier, P. I. (2018a). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
  • (23) Frazier, P. I. (2018b). Bayesian optimization. In Recent advances in optimization and modeling of contemporary problems, pages 255–278. Informs.
  • Garcke, (2013) Garcke, J. (2013). Sparse grids in a nutshell. In Sparse grids and applications, pages 57–80. Springer.
  • Gardner et al., (2018) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31.
  • Gelbrich, (1990) Gelbrich, M. (1990). On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185–203.
  • Genton, (2001) Genton, M. G. (2001). Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299–312.
  • Genton, (2007) Genton, M. G. (2007). Separable approximations of space-time covariance matrices. Environmetrics: The official journal of the International Environmetrics Society, 18(7):681–695.
  • Girard et al., (2002) Girard, A., Rasmussen, C., Candela, J. Q., and Murray-Smith, R. (2002). Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. Advances in neural information processing systems, 15.
  • Gneiting et al., (2006) Gneiting, T., Genton, M. G., and Guttorp, P. (2006). Geostatistical space-time models, stationarity, separability, and full symmetry. Monographs On Statistics and Applied Probability, 107:151.
  • Golub and Van Loan, (2013) Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations. JHU press.
  • Gramacy and Apley, (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
  • Grande et al., (2014) Grande, R., Walsh, T., and How, J. (2014). Sample efficient reinforcement learning with Gaussian processes. In International Conference on Machine Learning, pages 1332–1340. PMLR.
  • Griewank, (1981) Griewank, A. O. (1981). Generalized descent for global optimization. Journal of optimization theory and applications, 34(1):11–39.
  • Grigorievskiy et al., (2017) Grigorievskiy, A., Lawrence, N., and Särkkä, S. (2017). Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE.
  • Hartikainen and Särkkä, (2010) Hartikainen, J. and Särkkä, S. (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In 2010 IEEE international workshop on machine learning for signal processing, pages 379–384. IEEE.
  • Henderson et al., (1983) Henderson, H. V., Pukelsheim, F., and Searle, S. R. (1983). On the history of the Kronecker product. Linear and Multilinear Algebra, 14(2):113–120.
  • Hensman et al., (2018) Hensman, J., Durrande, N., and Solin, A. (2018). Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–52.
  • Hensman et al., (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
  • Hensman et al., (2015) Hensman, J., Matthews, A., and Ghahramani, Z. (2015). Scalable variational Gaussian process classification. In Artificial Intelligence and Statistics, pages 351–360. PMLR.
  • Hestenes et al., (1952) Hestenes, M. R., Stiefel, E., et al. (1952). Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409–436.
  • Journel and Huijbregts, (1976) Journel, A. G. and Huijbregts, C. J. (1976). Mining geostatistics.
  • Katzfuss and Guinness, (2021) Katzfuss, M. and Guinness, J. (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1):124–141.
  • Katzfuss et al., (2022) Katzfuss, M., Guinness, J., and Lawrence, E. (2022). Scaled Vecchia approximation for fast computer-model emulation. SIAM/ASA Journal on Uncertainty Quantification, 10(2):537–554.
  • Kennedy and O’Hagan, (2001) Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464.
  • Kolda, (2006) Kolda, T. G. (2006). Multilinear operators for higher-order decompositions. Technical report, Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA ….
  • Kuss and Rasmussen, (2003) Kuss, M. and Rasmussen, C. (2003). Gaussian processes in reinforcement learning. Advances in neural information processing systems, 16.
  • Kuss et al., (2005) Kuss, M., Rasmussen, C. E., and Herbrich, R. (2005). Assessing Approximate Inference for Binary Gaussian Process Classification. Journal of machine learning research, 6(10).
  • Lin et al., (2023) Lin, J. A., Antorán, J., Padhy, S., Janz, D., Hernández-Lobato, J. M., and Terenin, A. (2023). Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. arXiv preprint arXiv:2306.11589.
  • MacKay et al., (2003) MacKay, D. J., Mac Kay, D. J., et al. (2003). Information theory, inference and learning algorithms. Cambridge university press.
  • Maddox et al., (2021) Maddox, W. J., Balandat, M., Wilson, A. G., and Bakshy, E. (2021). Bayesian optimization with high-dimensional outputs. Advances in Neural Information Processing Systems, 34:19274–19287.
  • Mallasto and Feragen, (2017) Mallasto, A. and Feragen, A. (2017). Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. Advances in Neural Information Processing Systems, 30.
  • Maruyama, (1955) Maruyama, G. (1955). Continuous Markov processes and stochastic equations. Rend. Circ. Mat. Palermo, 4:48–90.
  • Marzouk and Najm, (2009) Marzouk, Y. M. and Najm, H. N. (2009). Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228(6):1862–1902.
  • Murray-Smith and Pearlmutter, (2004) Murray-Smith, R. and Pearlmutter, B. A. (2004). Transformations of Gaussian process priors. In International Workshop on Deterministic and Statistical Methods in Machine Learning, pages 110–123. Springer.
  • Nagumo et al., (1962) Nagumo, J., Arimoto, S., and Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):2061–2070.
  • Neal, (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
  • Nguyen et al., (2021) Nguyen, V., Deisenroth, M. P., and Osborne, M. A. (2021). Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. arXiv preprint arXiv:2110.12087.
  • Nickisch and Rasmussen, (2008) Nickisch, H. and Rasmussen, C. E. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078.
  • Nickisch et al., (2018) Nickisch, H., Solin, A., and Grigorevskiy, A. (2018). State space Gaussian processes with non-Gaussian likelihood. In International Conference on Machine Learning, pages 3789–3798. PMLR.
  • Nychka et al., (2015) Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015). A multiresolution Gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics, 24(2):579–599.
  • O’Hagan, (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
  • Okhrin et al., (2020) Okhrin, Y., Schmid, W., and Semeniuk, I. (2020). New approaches for monitoring image data. IEEE Transactions on Image Processing, 30:921–933.
  • Pardo-Iguzquiza and Chica-Olmo, (2008) Pardo-Iguzquiza, E. and Chica-Olmo, M. (2008). Geostatistics with the Matern semivariogram model: A library of computer programs for inference, kriging and simulation. Computers & Geosciences, 34(9):1073–1079.
  • Plumlee, (2014) Plumlee, M. (2014). Fast prediction of deterministic functions using sparse grid experimental designs. Journal of the American Statistical Association, 109(508):1581–1591.
  • Plumlee, (2017) Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American Statistical Association, 112(519):1274–1285.
  • Quinonero-Candela and Rasmussen, (2005) Quinonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959.
  • Rahimi and Recht, (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20.
  • Rasmussen, (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine learning, pages 63–71. Springer.
  • Rieger and Wendland, (2017) Rieger, C. and Wendland, H. (2017). Sampling inequalities for sparse grids. Numerische Mathematik, 136:439–466.
  • Roberts et al., (2013) Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., and Aigrain, S. (2013). Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550.
  • Saad, (2003) Saad, Y. (2003). Iterative methods for sparse linear systems. SIAM.
  • Saatçi, (2012) Saatçi, Y. (2012). Scalable inference for structured Gaussian process models. PhD thesis, Citeseer.
  • Santner et al., (2003) Santner, T. J., Williams, B. J., Notz, W. I., and Williams, B. J. (2003). The design and analysis of computer experiments, volume 1. Springer.
  • Shewchuk et al., (1994) Shewchuk, J. R. et al. (1994). An introduction to the conjugate gradient method without the agonizing pain.
  • Silverman, (1985) Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):1–21.
  • Smolyak, (1963) Smolyak, S. A. (1963). Quadrature and interpolation formulas for tensor products of certain classes of functions. In Doklady Akademii Nauk, volume 148, pages 1042–1045. Russian Academy of Sciences.
  • Snoek et al., (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
  • Spillane et al., (2014) Spillane, N., Dolean, V., Hauret, P., Nataf, F., Pechstein, C., and Scheichl, R. (2014). Abstract robust coarse spaces for systems of PDEs via generalized eigenproblems in the overlaps. Numerische Mathematik, 126:741–770.
  • Srinivas et al., (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
  • Stanton et al., (2021) Stanton, S., Maddox, W., Delbridge, I., and Wilson, A. G. (2021). Kernel interpolation for scalable online Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 3133–3141. PMLR.
  • Stein, (1999) Stein, M. L. (1999). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media.
  • Teckentrup, (2020) Teckentrup, A. L. (2020). Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification, 8(4):1310–1337.
  • Thompson, (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294.
  • Titsias, (2009) Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR.
  • Toselli and Widlund, (2004) Toselli, A. and Widlund, O. (2004). Domain decomposition methods-algorithms and theory, volume 34. Springer Science & Business Media.
  • Trefethen and Bau III, (1997) Trefethen, L. N. and Bau III, D. (1997). Numerical linear algebra, vol. 50.
  • Tuo and Wang, (2020) Tuo, R. and Wang, W. (2020). Kriging prediction with isotropic Matérn correlations: Robustness and experimental designs. The Journal of Machine Learning Research, 21(1):7604–7641.
  • Tuo and Wu, (2015) Tuo, R. and Wu, C. J. (2015). Efficient calibration for imperfect computer models. The Annals of Statistics, 43(6):2331 – 2352.
  • Ullrich, (2008) Ullrich, T. (2008). Smolyak’s algorithm, sampling on sparse grids and Sobolev spaces of dominating mixed smoothness. East Journal of Approximations, 14(1):1.
  • Van der Vorst, (2003) Van der Vorst, H. A. (2003). Iterative Krylov methods for large linear systems. Number 13. Cambridge University Press.
  • Wang et al., (2019) Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K. Q., and Wilson, A. G. (2019). Exact Gaussian processes on a million data points. Advances in neural information processing systems, 32.
  • Wang et al., (2020) Wang, W., Tuo, R., and Jeff Wu, C. (2020). On prediction properties of kriging: Uniform error bounds and robustness. Journal of the American Statistical Association, 115(530):920–930.
  • (94) Wenger, J., Pleiss, G., Hennig, P., Cunningham, J., and Gardner, J. (2022a). Preconditioning for scalable Gaussian process hyperparameter optimization. In International Conference on Machine Learning, pages 23751–23780. PMLR.
  • (95) Wenger, J., Pleiss, G., Pförtner, M., Hennig, P., and Cunningham, J. P. (2022b). Posterior and computational uncertainty in gaussian processes. Advances in Neural Information Processing Systems, 35:10876–10890.
  • Williams and Seeger, (2000) Williams, C. and Seeger, M. (2000). Using the Nyström method to speed up kernel machines. Advances in neural information processing systems, 13.
  • Williams and Rasmussen, (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.
  • Wilson and Nickisch, (2015) Wilson, A. and Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International conference on machine learning, pages 1775–1784. PMLR.
  • Wilson et al., (2020) Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. (2020). Efficiently sampling functions from Gaussian process posteriors. In International Conference on Machine Learning, pages 10292–10302. PMLR.
  • Wilson et al., (2021) Wilson, J. T., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. P. (2021). Pathwise Conditioning of Gaussian Processes. J. Mach. Learn. Res., 22:105–1.
  • Xie and Xu, (2021) Xie, F. and Xu, Y. (2021). Bayesian projected calibration of computer models. Journal of the American Statistical Association, 116(536):1965–1982.
  • Yadav et al., (2022) Yadav, M., Sheldon, D. R., and Musco, C. (2022). Kernel Interpolation with Sparse Grids. Advances in Neural Information Processing Systems, 35:22883–22894.
  • Zafari et al., (2020) Zafari, S., Murashkina, M., Eerola, T., Sampo, J., Kälviäinen, H., and Haario, H. (2020). Resolving overlapping convex objects in silhouette images by concavity analysis and Gaussian process. Journal of Visual Communication and Image Representation, 73:102962.