Gaussian Processes Sampling with Sparse Grids
under Additive Schwarz Preconditioner

Haoyuan Chen¹ Rui Tuo¹
¹Department of Industrial and Systems Engineering, Texas A&M University
College Station, TX 77843
[email protected]
[email protected] Submitted to the editors on January 31, 2024.

Abstract

Gaussian processes (GPs) are widely used in non-parametric Bayesian modeling, and play an important role in various statistical and machine learning applications. In a variety tasks of uncertainty quantification, generating random sample paths of GPs is of interest. As GP sampling requires generating high-dimensional Gaussian random vectors, it is computationally challenging if a direct method, such as the Cholesky decomposition, is used. In this paper, we propose a scalable algorithm for sampling random realizations of the prior and posterior of GP models. The proposed algorithm leverages inducing points approximation with sparse grids, as well as additive Schwarz preconditioners, which reduce computational complexity, and ensure fast convergence. We demonstrate the efficacy and accuracy of the proposed method through a series of experiments and comparisons with other recent works.

Keywords— Gaussian processes, sparse grids, additive Schwarz preconditioner

1 Introduction

Gaussian processes (GPs) hold significant importance in various fields, particularly in statistical and machine learning applications (Rasmussen, , 2003; Santner et al., , 2003; Cressie, , 2015), due to their unique properties and capabilities. As probabilistic models, GPs are highly flexible and able to adapt to a wide range of data patterns, including those with complex, nonlinear relationships. GPs have been successfully applied in numerous domains including regression (O’Hagan, , 1978; Bishop et al., , 1995; MacKay et al., , 2003), classification (Kuss et al., , 2005; Nickisch and Rasmussen, , 2008; Hensman et al., , 2015), time series analysis (Girard et al., , 2002; Roberts et al., , 2013), spatial data analysis (Banerjee et al., , 2008; Nychka et al., , 2015), Bayesian networks (Neal, , 2012), optimization (Srinivas et al., , 2009), and more. This wide applicability showcases their versatility in handling different types of problems. As the size of the dataset increases, GP inference suffers from computational issues of the operations on the covariance matrix such as inversion and determinant calculation. These computational challenges have motivated the development of various approximation methods to make GPs more scalable. Recent advances in scalable GP regression include Nyström approximation (Quinonero-Candela and Rasmussen, , 2005; Titsias, , 2009; Hensman et al., , 2013), random Fourier features (Rahimi and Recht, , 2007; Hensman et al., , 2018), local approximation (Gramacy and Apley, , 2015; Cole et al., , 2021), structured kernel interpolation (Wilson and Nickisch, , 2015; Stanton et al., , 2021), state-space formulation (Hartikainen and Särkkä, , 2010; Grigorievskiy et al., , 2017; Nickisch et al., , 2018), Krylov methods (Gardner et al., , 2018; Wang et al., , 2019), Vecchia approximation (Katzfuss and Guinness, , 2021; Katzfuss et al., , 2022), sparse representation (Ding et al., , 2021; Chen et al., , 2022; Chen and Tuo, , 2022), etc.

The focus of this article is on generating samples from GPs, which includes sampling from both prior and posterior processes. GP sampling is a valuable tool in statistical modeling and machine learning, and is used in various applications, such as in Bayesian optimization (Snoek et al., , 2012; Frazier, 2018b, ; Frazier, 2018a, ), in computer model calibration (Kennedy and O’Hagan, , 2001; Tuo and Wu, , 2015; Plumlee, , 2017; Xie and Xu, , 2021), in reinforcement learning (Kuss and Rasmussen, , 2003; Engel et al., , 2005; Grande et al., , 2014), and in uncertainty quantification (Murray-Smith and Pearlmutter, , 2004; Marzouk and Najm, , 2009; Teckentrup, , 2020). Practically, sampling from GPs involves generating random vectors from a multivariate normal distribution after discretizing the input space. However, sampling from GPs poses computational challenges particularly when dealing with large datasets due to the need to factorize the large covariance matrices, which is an $\mathcal{O}(n^{3})$ operation for $n$ inputs. There has been much research on scalable GP regression, yet sampling methodologies remain relatively unexplored.

Scalable sampling techniques for GPs are relatively rare in the existing literature. A notable recent development in the field is the work of Wilson et al., (2020), who introduced an efficient sampling method known as decoupled sampling. This approach utilizes Matheron’s rule and combines the inducing points approximation with random Fourier features. Furthermore, Wilson et al., (2021) expanded this concept to pathwise conditioning based on Matheron’s update. This method, which requires only sampling from GP priors, has proven to be a powerful tool for both understanding and implementing GPs. Inspired by this work, Maddox et al., (2021) applied Matheron’s rule to multi-task GPs for Bayesian optimization. Additionally, Nguyen et al., (2021) applied such techniques to tighten the variance and enhance GP posterior sampling. Wenger et al., 2022b introduced IterGP, a novel method that ensures consistent estimation of combined uncertainties and facilitates efficient GP sampling by computing a low-rank approximation of the covariance matrix. Lin et al., (2023) developed a method for generating GP posterior samples using stochastic gradient descent, targeting the complementary problem of approximating these samples in $\mathcal{O}(n)$ time. It is important to note, however, that none of these methods have explored sampling within a structured design, an area that possesses many inherent properties.

In this study, we propose a novel algorithms for efficient sampling from GP priors and GP regression posteriors by utilizing inducing point approximations on sparse grids. We show that the proposed method can reduce the time complexity to $\mathcal{O}\big{(}(\eta+n)n_{sg})$ for input data of size $n$ and inducing points on sparse grids of level $\eta$ and size $n_{sg}$ . This results in a linear-time sampling process, particularly effective when the size of the sparse grids is maintained at a reasonably low level. Specifically, our work makes the following contributions:

•

We propose a novel and scalable sampling approach for Gaussian Processes using product kernels on multi-dimensional points. This approach leverages the inducing point approximation on sparse grids, and notably, its computational time scales linearly with the product of the sizes of the inducing points and the sparse grids.
•

Our method provides a $2$ -Wasserstein distance error bound with a convergence rate of $\mathcal{O}(n_{sg}^{-\nu}(\log n_{sg})^{(\nu+2)(d-1)+d+1})$ for $d$ -dimensional sparse grids of size $n_{sg}$ and product Matérn kernel of smoothness $\nu$ .
•

We develop a two-level additive Schwarz preconditioner to accelerate the convergence of conjugate gradient during posterior sampling from GPs.
•

We illustrate the effectiveness of our algorithm in real-world scenarios by applying it to areas such as Bayesian optimization and dynamical system problems.

The structure of the paper is as follows: Background information is presented in Section 2, our methodology is detailed in Section 3, experimental results are discussed in Section 4, and the paper concludes with discussions in Section 5. Detailed theorems along with their proofs are provided in Appendix A, and supplementary plots can be found in Appendix B.

2 Background

This section provides an overview of the foundational concepts relevant to our proposed method. Section 2.1 introduces GPs, GP regression, and the fundamental methodology for GP sampling. The underlying assumptions of our work are outlined in Section 2.2. Additionally, detailed explanations of the inducing points approximation and sparse grids, which are key components of our sampling method, are presented in Section 2.3 and Section 2.4, respectively.

2.1 GP review

GPs

A GP is a collection of random variables uniquely specified by its mean function $\mu(\cdot)$ and kernel function $K(\cdot,\cdot)$ , such that any finite collection of those random variables have a joint Gaussian distribution (Williams and Rasmussen, , 2006). We denote the corresponding GP as $\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot))$ .

GP regression

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be an unknown function. Suppose we are given a set of training points $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ and the observations $\mathbf{y}=(y_{1},\ldots,y_{n})^{\top}$ where $y_{i}=f(\mathbf{x}_{i})+\epsilon_{i}$ with the i.i.d. noise $\epsilon_{i}\sim\mathcal{N}(0,\sigma_{\epsilon}^{2})$ . The GP regression imposes a GP prior over the latent function as $f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot))$ . Then the posterior distribution at $m$ test points $\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m}$ is

p(f(\mathbf{X}^{*})|\mathbf{y})=\mathcal{N}(\bm{\mu}_{*|\mathbf{y}},\mathbf{K}_{*,*|\mathbf{y}}),

(1)

with

	$\bm{\mu}_{\|\mathbf{y}}=\bm{\mu}_{}+\mathbf{K}_{*,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}\left(\mathbf{y}-\bm{\mu}_{\mathbf{X}}\right),$		(2a)
	$\mathbf{K}_{,\|\mathbf{y}}=\mathbf{K}_{,}-\mathbf{K}_{,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}\mathbf{K}_{\mathbf{X},},$		(2b)

where $\sigma_{\epsilon}^{2}>0$ is the noise variance, $\mathbf{I}_{n}$ is a $n\times n$ identity matrix, $\mathbf{K}_{*,\mathbf{X}}=K(\mathbf{X}^{*},\mathbf{X})=\big{[}K(\mathbf{x}_{i}^{*},\mathbf{x}_{j})\big{]}_{i=1,j=1}^{i=m,j=n}=K(\mathbf{X},\mathbf{X}^{*})^{\top}=\mathbf{K}_{\mathbf{X},*}^{\top}$ , $\mathbf{K}_{\mathbf{X},\mathbf{X}}=\big{[}K(\mathbf{x}_{i},\mathbf{x}_{j})\big{]}_{i,j=1}^{n}$ , $\mathbf{K}_{*,*}=\big{[}K(\mathbf{x}_{i}^{*},\mathbf{x}_{j}^{*})\big{]}_{i,j=1}^{m}$ , $\bm{\mu}_{\mathbf{X}}=\big{(}\mu(\mathbf{x}_{1}),\cdots,\mu(\mathbf{x}_{n})\big{)}^{\top}$ and $\bm{\mu}_{*}=\big{(}\mu(\mathbf{x}_{1}^{*}),\cdots,\mu(\mathbf{x}_{m}^{*})\big{)}^{\top}$ .

Sampling from GPs

The goal is to sample a function $f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot))$ . To represent this in a finite form, we discretize the input space and focus on the function values at a set of grid points $\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}$ . Our goal is to generate samples $\mathbf{f}_{\mathbf{Z}}=\big{(}f(\mathbf{z}_{1}),\cdots,f(\mathbf{z}_{n_{s}})\big{)}^{\top}$ from the multivariate normal distribution $\mathcal{N}(\mu(\mathbf{Z}),K(\mathbf{Z},\mathbf{Z}))=\mathcal{N}(\bm{\mu}_{\mathbf{Z}},\mathbf{K}_{\mathbf{Z},\mathbf{Z}})$ . The standard method for sampling from a multivariate normal distribution involves the following steps: (1) generate a vector of samples $\mathbf{f}_{\mathbf{I}}$ whose entries are independent and identically distributed normal, i.e. $\mathbf{f}_{\mathbf{I}_{n_{s}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{s}})$ ; (2) Use the Cholesky decomposition (Golub and Van Loan, , 2013) to factorize the covariance matrix $\mathbf{K}_{\mathbf{Z},\mathbf{Z}}$ as $\mathbf{L}_{\mathbf{Z}}\mathbf{L}_{\mathbf{Z}}^{\top}=\mathbf{K}_{\mathbf{Z},\mathbf{Z}}$ ; (3) generate the output sample $\mathbf{f}_{\mathbf{Z}}$ as

\mathbf{f}_{\mathbf{Z}}\leftarrow\mathbf{L}_{\mathbf{Z}}\mathbf{f}_{\mathbf{I}_{n_{s}}}+\bm{\mu}_{\mathbf{Z}}.

(3)

Sampling a posterior GP follows a similar procedure. Suppose we have observations $\mathbf{y}=\big{(}y_{1},\cdots,y_{n}\big{)}^{\top}$ on $n$ distinct points $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ , where $y_{i}=f(\mathbf{x}_{i})+\epsilon_{i}$ with noise $\epsilon_{i}\sim\mathcal{N}(0,\sigma_{\epsilon}^{2})$ and $f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot))$ . The goal is to generate posterior samples $f(\mathbf{X}^{*})\mid\mathbf{y}$ at $m$ test points $\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m}$ . Since the posterior samples $f(\mathbf{X}^{*})\mid\mathbf{y}\sim\mathcal{N}(\bm{\mu}_{*|\mathbf{y}},\mathbf{K}_{*,*|\mathbf{y}})$ also follow a multivariate normal distribution, as detailed in eq. 2a and eq. 2b in Section 2.1, we can apply Cholesky decomposition $\mathbf{L}_{*|\mathbf{X}}\mathbf{L}_{*|\mathbf{X}}^{\top}=\mathbf{K}_{*,*|\mathbf{X}}$ to generate GP posterior sample $\mathbf{f}_{*}=f(\mathbf{X}^{*})=\big{(}f(\mathbf{x}_{1}^{*}),\cdots,f(\mathbf{x}_{m}^{*})\big{)}^{\top}$ as

\mathbf{f}_{*}\mid\mathbf{y}\leftarrow\mathbf{L}_{*|\mathbf{X}}\mathbf{f}_{\mathbf{I}_{n}}+\bm{\mu}_{*|\mathbf{X}},

(4)

where $\mathbf{f}_{\mathbf{I}_{n}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})$ .

2.2 Assumptions

A widely used kernel structure in multi-dimensional settings is “separable” or “product” kernels, which can be expressed as follows:

K(\mathbf{x},\mathbf{x}^{\prime})=\sigma^{2}\prod_{j=1}^{d}K_{0}^{(j)}(x^{(j)},x^{\prime(j)}),

(5)

where $x^{(j)}$ and $x^{\prime(j)}$ are the $j$ -th components of $\mathbf{x}$ and $\mathbf{x}^{\prime}$ respectively, $K_{0}^{(j)}(\cdot,\cdot)$ denotes the one-dimensional correlation function for each dimension $j$ with the variance set to one, $d$ denotes the dimension of the input points. While the product of one-dimensional kernels does not share identical smoothness properties with multi-dimensional kernels, the assumption of separability is widely adopted in the field of spatio-temporal statistics (Gneiting et al., , 2006; Genton, , 2007; Constantinou et al., , 2017). Its widespread adoption is mainly due to its efficacy in enabling the development of valid space-time parametric models and in simplifying computational processes, particularly when handling large datasets for purposes such as inference and parameter estimation.

In this article, we consider the inputs to be $d$ -dimensional points $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ where each $\mathbf{x}_{i}\in\mathbb{R}^{d}$ and $d\in\mathbb{N}^{+}$ , and we focus on product Matérn kernels that employ identical base kernels across every dimension. Matérn kernel (Genton, , 2001) is a popular choice of covariance functions in spatial statistics (Diggle et al., , 2003), geostatistics (Curriero, , 2006; Pardo-Iguzquiza and Chica-Olmo, , 2008), image analysis (Zafari et al., , 2020; Okhrin et al., , 2020), and other applications. The Matérn kernel is defined as the following form (Stein, , 1999):

K(x,x^{\prime})=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right)^{\nu}K_{\nu}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right),

(6)

for any $x,x^{\prime}\in\mathbb{R}$ , where $\sigma^{2}>0$ is the variance, $\nu>0$ is the smoothness parameter, $\omega>0$ is the lengthscale and $K_{\nu}$ is the modified Bessel function of the second kind.

2.3 Inducing points approximation

Williams and Seeger, (2000) first applied Nyström approximation to kernel-based methods such as GPs. Following this, Quinonero-Candela and Rasmussen, (2005) presented a novel unifying perspective, and reformulated the prior by introducing an additional set of latent variables $\mathbf{f}_{\mathbf{u}}$ , known as inducing variables. These latent variables correspond to a specific set of input locations $\mathbf{u}=(u_{1},\cdots,u_{q})^{\top}$ , referred to as inducing inputs. In the subset of regressors (SoR) (Silverman, , 1985) algorithm, the kernels can be approximated using the following expression:

K_{\text{SoR}}(\mathbf{x}_{i},\mathbf{x}_{j})=K(\mathbf{x}_{i},\mathbf{u})K(\mathbf{u},\mathbf{u})^{-1}K(\mathbf{u},\mathbf{x}_{j}).

(7)

Employing the SoR approximation detailed in eq. 7, the form of the SoR predictive distribution $q_{\rm SoR}(\mathbf{f}_{*}\mid\mathbf{y})$ can be derived as follows:

q_{\rm SoR}(\mathbf{f}_{*}\mid\mathbf{y})=\mathcal{N}(\bm{\mu}_{*}+\sigma_{\epsilon}^{-2}\mathbf{K}_{*,\mathbf{u}}\bm{\Sigma}_{\mathbf{u}}^{-1}\mathbf{K}_{\mathbf{u},\mathbf{X}}(\mathbf{y}-\bm{\mu}_{\mathbf{X}}),\;\mathbf{K}_{*,\mathbf{u}}\bm{\Sigma}_{\mathbf{u}}^{-1}\mathbf{K}_{\mathbf{u},*}),

(8)

where we define $\bm{\Sigma}_{\mathbf{u}}=\big{[}\mathbf{K}_{\mathbf{u},\mathbf{u}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathbf{u},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathbf{u}}\big{]}$ , $\mathbf{K}_{*,\mathbf{u}}=K(\mathbf{X}^{*},\mathbf{u})=\big{[}K(\mathbf{u},\mathbf{X}^{*})\big{]}^{\top}=\mathbf{K}_{\mathbf{u},*}^{\top}$ , $\mathbf{K}_{\mathbf{u},\mathbf{u}}=\big{[}K(\mathbf{u}_{i},\mathbf{u}_{s})\big{]}_{i,s=1}^{q}$ , $\bm{\mu}_{\mathbf{X}}=\big{(}\mu(\mathbf{x}_{1}),\cdots,\mu(\mathbf{x}_{n})\big{)}^{\top}$ , and $\bm{\mu}_{*}=\big{(}\mu(\mathbf{x}_{1}^{*}),\cdots,\mu(\mathbf{x}_{m}^{*})\big{)}^{\top}$ . Consequently, the computation of GP regression using inducing points can be achieved with a time complexity of $\mathcal{O}(nq^{2})$ for $q$ inducing points.

To quantify the approximation power in the inducing points approach, we shall use the Wasserstein metric. Consider a separable complete metric space denoted as $(M,d)$ . The $p$ -Wasserstein distance between two probability measures $\mu_{1}$ and $\mu_{2}$ , on the space $M$ , is defined as follows:

W_{p}(\mu,\nu)=\left(\inf_{\gamma\in\Gamma(\mu_{1},\mu_{2})}\mathbb{E}_{(x,y)\sim\gamma}d^{p}(x,y)\right)^{1/p},

for $p\in[1,+\infty)$ , where $\Gamma(\mu_{1},\mu_{2})$ is the set of joint measures on $M\times M$ with marginals $\mu_{1}$ and $\mu_{2}$ . Theorem A.1 in Appendix A shows that the inducing points approximation of a GP converges rapidly to the target GP, with a convergence rate of $\mathcal{O}(n^{-\nu})$ for the Matérn kernels with smoothness $\nu$ , provided that the inducing points are uniformly distributed. It is worth noting that, inequality presented in eq. 30 is also applicable to multivariate GPs.

Therefore, the convergence rate for multivariate inducing points approximation can be established, assuming the corresponding convergence rate of GP regression is known. A general framework to investigate the latter problem is given by Wang et al., (2020). Inspired by Theorem A.1, we assume that the vector $\mathbf{y}=(y_{1},\cdots,y_{n})^{\top}$ is observed on $n$ distinct $d$ -dimensional points $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ . The inducing inputs $\mathbf{U}$ , as defined in Section 2.3, are given by a full grid, denoted as $\mathbf{U}=\times_{j=1}^{d}\mathbf{u}^{(j)}$ , where each $\mathbf{u}^{(j)}$ is a set of $q_{j}$ one-dimensional data points, $A\times B$ denotes the Cartesian product of sets $A$ and $B$ , accordingly the grid $\mathbf{U}$ has $q=\prod_{j=1}^{d}q_{j}$ points. Given the separable structure as defined in eq. 5 and utilizing the full grid $\mathbf{U}$ , we can transform the multi-dimensional problem to a one-dimensional problem since the covariance matrix $\mathbf{K}_{\mathbf{U},\mathbf{U}}\colon=K(\mathbf{U},\mathbf{U})$ can be represented by Kronecker products (Henderson et al., , 1983; Saatçi, , 2012; Wilson and Nickisch, , 2015) of matrices over each input dimension $j$ :

\mathbf{K}_{\mathbf{U},\mathbf{U}}=K(\mathbf{U},\mathbf{U})=\bigotimes_{j=1}^{d}K_{j}(\mathbf{u}^{(j)},\mathbf{u}^{(j)}).

(9)

However, the size of the inducing points $q$ grows exponentially as $d$ gets large, therefore we consider using sparse grids (Garcke, , 2013; Rieger and Wendland, , 2017) to solve the curse of dimensionality.

2.4 Sparse grids

A sparse grid is built on a nested sequence of one-dimensional points $\varnothing=\mathcal{U}_{j,0}\subseteq\ldots\subseteq\mathcal{U}_{j,t}\subseteq\mathcal{U}_{j,t+1}\subseteq\ldots$ for each dimension $j$ , $j=1,\ldots,d$ . A sparse grid with level $\eta$ and dimension $d$ is a design as follows:

\displaystyle\mathcal{U}(\eta,d)

\displaystyle=\bigcup_{\vv{t}\in\mathbb{G}(\eta)}\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}},

(10)

where $\vv{t}=(t_{1},\ldots,t_{d})$ , $\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}$ . Figure 1 and Figure 9 show sparse grids corresponding to the hyperbolic cross points (bisection) for two dimensional points at levels $\eta=3,4,5,6$ and for three dimensional points at $\eta=4,5,6,7$ respectively. Hyperbolic cross points sets, defined over the interval $[0,1]$ , are detailed in eq. 11:

\mathcal{U}_{j,t_{j}}=\left\{\frac{i}{2^{t_{j}}}:i=1,\ldots,2^{t_{j}}-1\right\}.

(11)

An operator on the sparse grid can be efficiently computed using Smolyak’s algorithm (Smolyak, , 1963; Ullrich, , 2008), which states that the predictor operator $\mathcal{P}(\eta,d)$ on function $f$ with respect to the sparse grid $\mathcal{U}(\eta,d)$ can be evaluated by the following formula:

\mathcal{P}(\eta,d)(f)=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}(f),

(12)

where $|\vv{t}|=\sum_{j=1}^{d}t_{j}$ , $\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq|\vv{t}|\leq\eta\}$ , $\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}$ here represents the operator over the Kronecker product points $\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}$ . Plumlee, (2014) and Yadav et al., (2022) also provided the linear time computing method for GP regression based on Smolyak’s algorithm. Theorem A.3 in Appendix A demonstrates that for a $d$ -dimensional GP, the inducing points approximation converges to the target GP at the rate of $\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1})$ for the product Matérn kernels with smoothness $\nu$ , given that the inducing points are sparse grids $\mathcal{U}(\eta,d)$ of level $\eta$ and dimension $d$ , as defined in eq. 10.

Refer to caption — Figure 1: Sparse grids $\mathcal{U}(\eta,d)$ over $[0,1]^{d}$ of level $\eta=3,4,5,6$ and dimension $d=2$ .

3 Sampling with sparse grids

In this section, we elaborate the algorithm for sampling from multi-dimensional GPs for product kernels with sparse grids.

3.1 Prior

Sampling algorithm

Our goal is to sample from GP priors $f(\cdot)\sim\mathcal{GP}(\mu(\cdot),K(\cdot,\cdot))$ over $d$ -dimensional grid points $\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}$ , $\mathbf{z}_{i}\in\mathbb{R}^{d}$ . According to the conditional distributions of the SoR model in equations eq. 7, the distribution of $\mathbf{f}_{\mathbf{Z}}=f(\mathbf{Z})=(f(\mathbf{z}_{1}),\cdots,f(\mathbf{z}_{n_{s}}))^{\top}$ can be approximated by $\mathbf{K}_{\mathbf{Z},\mathcal{U}}\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}}$ with $\mathbf{f}_{\mathcal{U}}\sim\mathcal{N}(\mathbf{0},\mathbf{K}_{\mathcal{U},\mathcal{U}})$ , where $\mathbf{f}_{\mathcal{U}}=f(\mathcal{U})$ , and $\mathcal{U}=\mathcal{U}(\eta,d)$ is a sparse grid design with level $\eta$ and dimension $d$ defined in eq. 10. Then it reduces to generating samplings from $\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}}\sim\mathcal{N}(\mathbf{0},\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1})$ . Apply Smolyak’s algorithm to samplings from GPs (see Theorem A.4 in Appendix A), we can infer that

\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{f}_{\mathcal{U}}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}}),

(13)

\delta_{x}(\mathcal{U}_{\vv{t}})=\begin{cases}0\quad&\text{if $x\in\mathcal{U}\setminus\mathcal{U}_{\vv{t}}$}\\ 1\quad&\text{if $x\in\mathcal{U}_{\vv{t}}$}\end{cases},

(14)

where $\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}$ is the Cholesky decomposition of $K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}$ , $\mathcal{U}=\mathcal{U}(\eta,d)$ is a sparse grid design with level $\eta$ and dimension $d$ defined in eq. 10, $\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}})$ is standard multivariate normal distributed, $\mathbf{I}_{n_{\vv{t}}}$ is a $n_{\vv{t}}\times n_{\vv{t}}$ identity matrix, $n_{\vv{t}}$ $=$ $\prod_{j=1}^{d}t_{j}$ is the size of the full grid design $\mathcal{U}_{\vv{t}}=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}}$ associated with $\vv{t}=(t_{1},\ldots,t_{d})$ , $\delta_{x}(\mathcal{U}_{\vv{t}})$ is a Dirac measure on the set $\mathcal{U}$ . Therefore, the GP prior $\mathbf{f}_{\mathbf{Z}}$ over grid points $\mathbf{Z}$ can be approximately sampled via

	$\displaystyle\mathbf{f}_{\mathbf{Z}}\leftarrow\mathbf{K}_{\mathbf{Z},\mathcal{U}}\left(\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}})\right)$
	$\displaystyle+\bm{\mu}_{\mathbf{Z}},$		(15)

where $\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}})$ is standard multivariate normal distributed, $\delta_{x}(\mathcal{U}_{\vv{t}})$ is a Dirac measure on set $\mathcal{U}$ as defined in eq. 14, $\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}$ is the Cholesky decomposition of $K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}$ .

Complexity

Since the Kronecker matrix-vector product can be written as the form of vectorization operator (see Proposition 3.7 in (Kolda, , 2006)), the Kronecker matrix-vector product in section 3.1 only costs $\mathcal{O}(\sum_{j=1}^{d}t_{j}\prod_{j=1}^{d}t_{j})$ $=\mathcal{O}(\eta n_{\vv{t}})$ time, hence the time complexity of the prior sampling via section 3.1 is $\mathcal{O}\Big{(}(\eta+n_{s})\sum_{\vv{t}\in\mathbb{P}(\eta,d)}n_{\vv{t}}\Big{)}$ $=\mathcal{O}\Big{(}(\eta+n_{s})n_{sg}\Big{)}$ , where $\nu$ is the smoothness parameter of the Matérn kernel, $d$ is the dimension of the sampling points $\mathbf{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n_{s}}$ , $n_{s}$ is the size of the sampling points, $n_{sg}$ is the size of inducing points on a sparse grid $\mathcal{U}(\eta,d)$ , $n_{\vv{t}}$ $=$ $\prod_{j=1}^{d}t_{j}$ is the size of the full grid design $\mathcal{U}_{\vv{t}}=\times_{j=1}^{d}\mathcal{U}_{j,t_{j}}$ , $\vv{t}=(t_{1},\ldots,t_{d})$ , $\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq\sum_{j=1}^{d}t_{j}\leq\eta\}$ .

Note that we can obtain the upper bound of the number of points in the Smolyak type sparse grids. According to Lemma 3 in (Rieger and Wendland, , 2017), given that $|\mathcal{U}_{j,t_{j}}|\leq 2^{t_{j}}-1$ , when sparse grids are constructed on hyperbolic cross points as defined in eq. 11, the cardinality of a sparse grid $\mathcal{U}(\eta,d)$ defined in eq. 10, is thereby bounded by

$\displaystyle n_{sg}=\|\mathcal{U}(\eta,d)\|$	$\displaystyle\leq 1^{d}\big{(}\frac{2-1}{2}\big{)}^{d}2^{\eta}\binom{\eta-1}{d-1}\min\{\frac{2}{2-1},\frac{\eta}{d}\}$
	$\displaystyle\leq 2^{\eta-d}\frac{\eta^{d-1}}{(d-1)!}\min\{2,\frac{\eta}{d}\}$
	$\displaystyle\leq 2^{\eta-d}\eta^{d}/(d!).$	(16)

3.2 Posterior

Matheron’s rule

Sampling from the posterior GPs can be effectively conducted using Matheron’s rule. The Matheron’s rule was popularized in the geostatistics field by Journel and Huijbregts, (1976), and recently rediscovered by Wilson et al., (2020), who leveraged it to develop a novel approach for GP posterior samplings. Matheron’s rule can be described as follows: Consider $\mathbf{a}$ and $\mathbf{b}$ as jointly Gaussian random vectors. Then the random vector $\mathbf{a}$ , conditional on $\mathbf{b}=\bm{\beta}$ , is equal in distribution to

(\mathbf{a}\mid\mathbf{b}=\bm{\beta})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\mathbf{a}+\mathrm{Cov}(\mathbf{a},\mathbf{b})\mathrm{Cov}(\mathbf{b},\mathbf{b})^{-1}(\bm{\beta}-\mathbf{b}),

(17)

where $\mathrm{Cov}(\mathbf{a},\mathbf{b})$ denotes the covariance of $(\mathbf{a},\mathbf{b})$ .

According to eq. 17, exact GP posterior samplers can be obtained using two jointly Gaussian random variables. Following this technique, we derive

\mathbf{f}_{*}\mid\mathbf{y}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{=}}\mathbf{f}_{*}+\mathbf{K}_{*,\mathbf{X}}\big{[}\mathbf{K}_{\mathbf{X},\mathbf{X}}+\sigma_{\epsilon}^{2}\mathbf{I}_{n}\big{]}^{-1}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\epsilon),

(18)

where $\mathbf{f}_{\mathbf{X}}=f(\mathbf{X})$ and $\mathbf{f}_{*}=f(\mathbf{X}^{*})$ are jointly Gaussian random variables from the prior distribution, noise variates $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}_{n})$ . Clearly, the joint distribution of $(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})$ follows the multivariate normal distribution as follows:

(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})\sim\mathcal{N}\Bigg{(}\begin{bmatrix}\bm{\mu}_{\mathbf{X}}\\ \bm{\mu}_{*}\\ \end{bmatrix},\begin{bmatrix}\mathbf{K}_{\mathbf{X},\mathbf{X}}&\mathbf{K}_{\mathbf{X},*}\\ \mathbf{K}_{*,\mathbf{X}}&\mathbf{K}_{*,*}\end{bmatrix}\Bigg{)}.

(19)

Sampling algorithm

We can apply SoR approximation eq. 7 to Matheron’s rule eq. 17 and obtain

\mathbf{f}_{*}\mid\mathbf{y}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny d}}}{\approx}}\mathbf{f}_{*}+\sigma_{\epsilon}^{-2}\mathbf{K}_{*,\mathcal{U}}\bm{\Sigma}_{\mathcal{U}}^{-1}\mathbf{K}_{\mathcal{U},\mathbf{X}}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\bm{\epsilon}),

(20)

where $\bm{\Sigma}_{\mathcal{U}}=\big{[}\mathbf{K}_{\mathcal{U},\mathcal{U}}+\sigma_{\epsilon}^{-2}\mathbf{K}_{\mathcal{U},\mathbf{X}}\mathbf{K}_{\mathbf{X},\mathcal{U}}\big{]}$ , $\mathbf{f}_{\mathbf{X}}=f(\mathbf{X})$ and $\mathbf{f}_{*}=f(\mathbf{X}^{*})$ are jointly Gaussian random variables from the prior distribution, noise variates $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}_{n})$ . The prior resulting from the SoR approximation can be easily derived from eq. 7, yielding the following expression:

q_{\rm SoR}(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})=\mathcal{N}(\bm{\mu}_{\mathbf{X}+*},\;\mathbf{K}_{\mathbf{X}+*,\mathcal{U}}\mathbf{K}_{\mathcal{U},\mathcal{U}}^{-1}\mathbf{K}_{\mathcal{U},\mathbf{X}+*}),

(21)

where $\bm{\mu}_{\mathbf{X}+*}=[\bm{\mu}_{\mathbf{X}},\bm{\mu}_{*}]^{\top}$ , $\mathbf{K}_{\mathbf{X}+*,\mathcal{U}}=[K(\mathcal{U},\mathbf{X}),K(\mathcal{U},\mathbf{X}^{*})]^{\top}$ . The same as above in section 3.1, $(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{*})$ can be sampled from

	$\displaystyle(\mathbf{f}_{\mathbf{X}},\mathbf{f}_{})\leftarrow\mathbf{K}_{\mathbf{X}+,\mathcal{U}}\left(\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}})\right)$
	$\displaystyle+\bm{\mu}_{\mathbf{X}+*},$		(22)

where $\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}$ , $\delta_{x}(\mathcal{U}_{\vv{t}})$ , and $\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})^{-1}\Big{)}$ are the same as that defined in section 3.1.

Conjugate gradient

Algorithm 1 The Preconditioned Conjugate Gradient (PCG) Algorithm.

1: Input: matrix

\bm{\Sigma}

, preconditioning matrix

\mathbf{P}

, vector

\mathbf{v}

, convergence threshold

\epsilon

, initial vector

\mathbf{x}_{0}

, maximum number of iterations

T

2: Output:

\mathbf{x}_{*}\approx\bm{\Sigma}^{-1}\mathbf{v}

\mathbf{r}_{0}:=\mathbf{v}-\Sigma\mathbf{x}_{0}

;

\mathbf{z}_{0}:=\mathbf{P}^{-1}\mathbf{r}_{0}

\mathbf{p}_{0}:=\mathbf{z}_{0}

5: for

i=0

T

\alpha_{i}:=\frac{\mathbf{r}_{i}^{T}\mathbf{z}_{i}}{\mathbf{p}_{i}^{T}\bm{\Sigma}\mathbf{p}_{i}}

\mathbf{x}_{i+1}:=\mathbf{x}_{i}+\alpha_{i}\mathbf{p}_{i}

\mathbf{r}_{i+1}:=\mathbf{r}_{i}-\alpha_{i}\bm{\Sigma}\mathbf{p}_{i}

9: if

\|\mathbf{r}_{i+1}\|<\epsilon

then

10: return

\mathbf{x}_{*}:=\mathbf{x}_{i+1}

11: end if

12:

\mathbf{z}_{i+1}:=\mathbf{P}^{-1}\mathbf{r}_{i+1}

13:

\beta_{i}:=\frac{\mathbf{r}_{i+1}^{T}\mathbf{z}_{i+1}}{\mathbf{r}_{i}^{T}\mathbf{z}_{i}}

14:

\mathbf{p}_{i+1}:=\mathbf{z}_{i+1}+\beta_{i}\mathbf{p}_{i}

15: end for

The only remaining computationally demanding step in eq. 20 is the computation of $\bm{\Sigma}_{\mathcal{U}}^{-1}\mathbf{v}$ with $\mathbf{v}=\mathbf{K}_{\mathcal{U},\mathbf{X}}(\mathbf{y}-\mathbf{f}_{\mathbf{X}}-\bm{\epsilon})$ , requiring a time complexity of $\mathcal{O}(n_{sg}^{3})$ where $n_{sg}=|\mathcal{U}(\eta,d)|$ is the size of the sparse grid $\mathcal{U}(\eta,d)$ . To reduce this computational intensity, we consider using the conjugate gradient (CG) method (Hestenes et al., , 1952; Golub and Van Loan, , 2013), an iterative algorithm for solving linear systems efficiently via matrix-vector multiplications. Preconditioning (Trefethen and Bau III, , 1997; Demmel, , 1997; Saad, , 2003; Van der Vorst, , 2003) is a well-known tool for accelerating the convergence of the CG method, which introduces a symmetric positive-definite matrix $\mathbf{P}_{\mathcal{U}}\approx\bm{\Sigma}_{\mathcal{U}}$ such that $\mathbf{P}_{\mathcal{U}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ has a smaller condition number than $\bm{\Sigma}_{\mathcal{U}}$ . The entire scheme of the preconditioned conjugate gradient (PCG) takes the form in Algorithm 1.

Preconditioner choice

The choice of the precondition $\mathbf{P}_{\mathcal{U}}$ has a trade-off between the smallest time complexity of applying $\mathbf{P}_{\mathcal{U}}^{-1}$ operation and the optimal condition number of $\mathbf{P}_{\mathcal{U}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ . Cutajar et al., (2016) first investigated several groups of preconditioners for radial basis function kernel, then Gardner et al., (2018) proposed a specialized precondition for general kernels based on pivoted Cholesky decomposition. Wenger et al., 2022a analyzed and summarized convergence rates for arbitrary multi-dimensional kernels and multiple preconditioners. Due to the hierarchical structure of the sparse grid $\mathcal{U}$ , instead of using common classes of preconditioners, we consider two-level additive Schwarz (TAS) preconidtioner (Toselli and Widlund, , 2004) for the system matrix $\bm{\Sigma}_{\mathcal{U}}$ , which is formed as an additive Schwarz (AS) preconditioner (Dryja and Widlund, , 1994; Cai and Sarkis, , 1999) coupled with an additive coarse space correction. Since the sparse grid $\mathcal{U}(\eta,d)$ defined in eq. 10 is covered by a set of overlapping domains $\{\mathcal{U}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)}$ where $\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}$ , the TAS precondition for the system matrix $\bm{\Sigma}_{\mathcal{U}}$ is then defined as

\mathbf{P}_{\text{TAS}}^{-1}:=\sum_{\vv{t}\in\mathbb{G}(\eta)}\mathbf{S}_{\vv{t}}^{\top}\mathbf{P}_{\vv{t}}^{-1}\mathbf{S}_{\vv{t}}+\mathbf{S}_{c}^{\top}\mathbf{P}_{c}^{-1}\mathbf{S}_{c}.

(23)

On the right hand side of eq. 23, the first term is the one-level AS preconditioner, and the second term is the additive coarse space correction. $\mathbf{S}_{\vv{t}}\in\mathbb{R}^{n_{\vv{t}}\times n_{sg}}$ is the selection matrix that projects $\mathcal{U}_{\vv{t}}$ to $\mathcal{U}$ , $\mathbf{P}_{\vv{t}}=\mathbf{S}_{\vv{t}}\bm{\Sigma}_{\mathcal{U}}\mathbf{S}_{\vv{t}}^{\top}\in\mathbb{R}^{n_{\vv{t}}\times n_{\vv{t}}}$ is a symmetric positive definite sub-matrix defined on $\mathcal{U}_{\vv{t}}$ . $\mathbf{S}_{c}\in\mathbb{R}^{n_{c}\times n_{sg}}$ is the selection matrix whose columns spanning the coarse space, and $\mathbf{P}_{c}=\mathbf{S}_{c}\bm{\Sigma}_{\mathcal{U}}\mathbf{S}_{c}^{\top}\in\mathbb{R}^{n_{c}\times n_{c}}$ is the coarse space matrix. $n_{c}$ is the dimension of the coarse space, $n_{sg}:=|\mathcal{U}(\eta,d)|$ is the size of the sparse grid $\mathcal{U}(\eta,d)$ , $n_{\vv{t}}:=|\mathcal{U}_{\vv{t}}|=\prod_{j=1}^{d}t_{j}$ is the size of the full grid design $\mathcal{U}_{\vv{t}}=\times_{j=1}^{d}\mathcal{U}_{j,t_{j}}$ , $\vv{t}=(t_{1},\ldots,t_{d})$ . Figure 2 illustrates an example of the selection matrices $\{\mathbf{S}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)}$ for the sparse grid $\mathcal{U}(\eta=3,d=2)$ .

Some recent work (Toselli and Widlund, , 2004; Spillane et al., , 2014; Dolean et al., , 2015; Al Daas et al., , 2023) provide robust convergence estimates of TAS preconditioner, which states that the condition number of $\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ is bounded by the number of distinct colors required to color the graph of $\bm{\Sigma}_{\mathcal{U}}$ so that $span\{\mathbf{S}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)}$ of the same color are mutually $\bm{\Sigma}_{\mathcal{U}}$ -orthogonal, the maximum number of subdomains that share an unknown, and a user defined tolerance $\tau$ . Specifically, given a threshold $\tau>0$ , a GenEO coarse space (see Section 3.2 in (Al Daas et al., , 2023)) can be constructed, and the following inequality holds for the preconditioned matrix $\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ :

\kappa(\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}})\leq(k_{c}+1)\left(2+(2k_{c}+1)\frac{k_{m}}{\tau}\right),

(24)

where $k_{c}$ is the minimum number of distinct colors so that each two neighboring subdomains among $\{\mathcal{U}_{\vv{t}}\}_{\vv{t}\in\mathbb{G}(\eta)}$ have different colors, $k_{m}$ is the coloring constant which is the maximum number of overlapping subdomains sharing a row of $\bm{\Sigma}_{\mathcal{U}}$ . Since each point in $\mathcal{U}(\eta,d)$ belongs to at most $|\mathbb{G}(\eta)|$ of the subdomains, $k_{c}=|\mathbb{G}(\eta)|$ . Coloring constant $k_{m}$ can be viewed as the number of colors needed to color each subdomain in such a way that any two subdomains with the same color are orthogonal, therefore $k_{m}=|\mathbb{G}(\eta)|$ .

However, the conventional approach for constructing a coarse space to bound the smallest eigenvalue of the $\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ relies on the algebraic local symmetric positive semidefinite (SPSD) splitting (Al Daas and Grigori, , 2019), which is computationally complicated. To select a coarse space that is practical and less complicated, we propose constructing it based on the sparse grid $\mathcal{U}(\eta_{c},d)$ with a lower level $\eta_{c}<\eta$ . In practice, we set $\eta_{c}=\max\{\lceil\eta/2\rceil,d\}$ . Although the TAS preconditioners with our proposed coarse space do not theoretically guarantee robust convergence, empirical results presented in Figure 3 indicate that TAS tends to converge faster than one-level AS preconditioner and CG method without preconditioning. Figure 10 shows the cases where $\Sigma_{\mathcal{U}}$ is ill-conditioned, resulting in the instability (blow-up) of the CG method without any preconditioning.

Complexity

Note that $\mathbf{S}_{\vv{t}}$ is a sparse matrix with $n_{\vv{t}}$ nonzeros and all of its nonzeros are one, hence computing $\mathbf{P}_{\vv{t}}$ requires $\mathcal{O}(nn_{\vv{t}}^{2})$ operations, and the computation of the first term of $\mathbf{P}_{\text{TAS}}^{-1}$ costs $\mathcal{O}(\sum_{\vv{t}\in\mathbb{G}(\eta)}nn_{\vv{t}}^{2})$ . The computational complexity of the preconditioner $\mathbf{P}_{\text{PAS}}^{-1}$ is then $\mathcal{O}(\sum_{\vv{t}\in\mathbb{G}(\eta)}nn_{\vv{t}}^{2}+n_{c}^{2})=\mathcal{O}(n|\mathbb{G}(\eta)|(\eta/d)^{2d})=\mathcal{O}(nn_{sg}(\eta/d)^{2d})$ where $n_{c}=|\mathcal{U}(\eta_{c},d)|$ is the size of the sparse grid according to the former setting.

The complexity of the matrix-vector multiplication $\Sigma\mathbf{x}_{0}$ in Algorithm 1 is $\mathcal{O}(nn_{sg})$ if $\bm{\Sigma}=\bm{\Sigma}_{\mathcal{U}}$ . In theory, the maximum number of iterations of the PCG with error factor $\epsilon$ is bounded by $\lceil\frac{1}{2}\sqrt{\kappa}\ln\big{(}\frac{2}{\epsilon}\big{)}\rceil$ (Shewchuk et al., , 1994), where $\kappa$ is the condition number of the preconditioned matrix $\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}}$ , so the time complexity of Algorithm 1 for $\bm{\Sigma}=\bm{\Sigma}_{\mathcal{U}}$ is $\mathcal{O}((\sqrt{\kappa}+(\eta/d)^{2d})n_{sg}n)$ where $\kappa=\kappa(\mathbf{P}_{\text{TAS}}^{-1}\bm{\Sigma}_{\mathcal{U}})$ . Therefore the time complexity of the entire posterior sampling scheme is $\mathcal{O}\Big{(}\big{(}\eta+(\sqrt{\kappa}+(\eta/d)^{2d}+m)n\big{)}n_{sg}\Big{)}$ , where $\nu$ is the smoothness parameter of the Matérn kernel, $n_{sg}$ is the size of inducing points on a sparse grid $\mathcal{U}(\eta,d)$ , $n$ is size of observations $\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ , $m$ is the size of test points $\mathbf{X}^{*}=\{\mathbf{x}_{i}^{*}\}_{i=1}^{m}$ , $d$ is the dimension of the observations.

4 Experiments

In this section, we will demonstrate the computational efficiency and accuracy of the proposed sampling algorithm. We first generate samples from GP priors and posteriors for dimension $d=2$ and $d=4$ in Section 4.1. Then we apply our method to the same real-world problems as those addressed in (Wilson et al., , 2020), described in Section 4.2. For prior samplings, we use the random Fourier features (RFF) with $2^{6}=64$ features (Rahimi and Recht, , 2007) and the Cholesky decomposition method as benchmarks. For posterior samplings, we compare our approach with the decoupled algorithm (Wilson et al., , 2020) using RFF priors and the exact Matheron’s update, as well as the Cholesky decomposition method. We use Matérn kernels as defined in eq. 6 with the variance $\sigma^{2}=1$ , lengthscale $\omega=\sqrt{2\nu}$ and smoothness $\nu=3/2$ . Our experiments employ “separable” Matérn kernels as specified in eq. 5, with the same base kernels and the same lengthscale parameters $\omega=\sqrt{2\nu}$ in each dimension. For the proposed method, the inducing points are selected as a sparse grid $\mathcal{U}(\eta,d)$ as defined in eq. 10 for $(\eta=5,d=2)$ , and $(\eta=6,d=4)$ respectively. We set the same nested sequence $\varnothing=\mathcal{U}_{j,0}\subseteq\ldots\subseteq\mathcal{U}_{j,t}\subseteq\mathcal{U}_{j,t+1}\subseteq\ldots$ for each dimension $j$ , and set $\mathcal{U}_{j,t_{j}}$ as hyperbolic cross points defined in eq. 11 for all $j$ . We set the noise variates $\epsilon\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I})$ with $\sigma_{\epsilon}^{2}=10^{-4}$ for all experiments. The seed value is set to $99$ , and each experiment is replicated $1000$ times.

4.1 Simulation

Prior sampling

We generate prior samples $\mathbf{Z}=\{z_{i}\}_{i=1}^{n_{s}}$ over the domain $[0,1]^{d}$ using Matérn 3/2 kernels, with $n_{s}=2^{6},\ldots,2^{13}$ respectively. Left plots in Figure 4 and Figure 11 show the time required by different algorithms to sample from the different number of points $n_{s}$ . We can observe that the proposed algorithm (InSG) is comparable in efficiency to the RFF method. To evaluate accuracy, we compute the 2-Wasserstein distances between empirical and true distributions. The 2-Wasserstein distance measures the similarity of two distributions. Let $f_{1}\sim\mathcal{N}(\mu_{1},\Sigma_{1})$ and $f_{2}\sim\mathcal{N}(\mu_{2},\Sigma_{2})$ , the 2-Wasserstein distance between the Gaussian distributions $f_{1}$ , $f_{2}$ on $L^{2}$ norm is given by (Dowson and Landau, , 1982; Gelbrich, , 1990; Mallasto and Feragen, , 2017)

\displaystyle W_{2}(f_{1},f_{2}):=

\displaystyle\Big{(}||\mu_{1}-\mu_{2}||^{2}+{\rm tr}\big{(}\Sigma_{1}+\Sigma_{2}-2(\Sigma_{1}^{\frac{1}{2}}\Sigma_{2}\Sigma_{1}^{\frac{1}{2}})^{\frac{1}{2}}\big{)}\Big{)}^{\frac{1}{2}}.

(25)

For the empirical distributions, the parameters are estimated from the replicated samples. The right plots in Figure 4 and Figure 11 demonstrate that the proposed algorithm performs with nearly the same precision as Cholesky decomposition and RFF.

Posterior sampling

We use the Griewank function (Griewank, , 1981) as the test function, defined as:

f(\mathbf{x})=\sum_{j=1}^{d}\frac{x_{j}^{2}}{4000}+\prod_{j=1}^{d}\cos(\frac{x_{j}}{\sqrt{j}})+1,\quad\mathbf{x}\in[-5,5]^{d}.

The sparse grid $\mathcal{U}(\eta,d)$ is set over $[-5,5]^{d}$ with configurations $(\eta=5,d=2)$ and $(\eta=6,d=4)$ for the design of our experiment. We then evaluate the average computational time and 2-Wasserstein distance over $m=1000$ random test points for each sampling method. Figure 5 and Figure 12 illustrate the performance of different sampling strategies. It’s observed that our proposed algorithm is significantly more time-efficient compared to the Cholesky and decoupled algorithms, particularly when $n$ is larger than $2^{10}$ . Moreover, our proposed method, the inducing points approximation on the sparse grid (InSG) demonstrates comparable accuracy to the decoupled method.

4.2 Application

Thompson sampling

Thompson Sampling (TS) (Thompson, , 1933) is a classical strategy for decision-making by selecting actions $x\in\mathcal{X}$ that minimize a black-box function $f:\mathcal{X}\rightarrow\mathbb{R}$ . At each iteration $t$ , TS determines $\mathbf{x}_{t+1}\in\arg\min_{\mathbf{x}\in\mathcal{X}}(f|\mathbf{y})(\mathbf{x})$ , where $\mathbf{y}$ represents the observation set at current iteration. Upon finding the minimizer, the corresponding $y_{t+1}$ is obtained by evaluating $f$ at $\mathbf{x}_{t+1}$ , and then the pair $(\mathbf{x}_{t+1},y_{t+1})$ is added to the training set. In this experiment, we consider the Ackley function (Ackley, , 2012)

\displaystyle f(\mathbf{x})=-a\exp\Big{(}-b\sqrt{\frac{1}{d}\sum_{j=1}^{d}x_{j}^{2}}\Big{)}+a+\exp(1)-\exp\Big{(}\frac{1}{d}\sum_{j=1}^{d}\cos(cx_{j})\Big{)},

(26)

with $a=20$ , $b=0.2$ , $c=2\pi$ . The goal of this experiment is to find the global minimizer of the target function $f$ . We start with 3 initial samples, then at each iteration of TS, we draw a posterior sample $f|\mathbf{y}$ on 1024 uniformly distributed points over the interval $[-5,5]^{d}$ conditioned on the current observation set. Next, we pick the smallest posterior sample for this iteration, add it to the training set, and repeat the above process. This iterative approach helps us progressively approximate the global minimum. To neutralize the impact of the lengthscales and seeds, we average the regret over 8 different lengthscales $[0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6]$ and 5 different seeds $[9,99,999,9999,99999]$ . In Figure 6, we compare the logarithm of regret at each iteration for different sampling algorithms, the proposed approach (InSG) demonstrates comparable performance to the decoupled method.

Simulating dynamical systems

GP posteriors are also useful in dynamic systems where data is limited. An illustrative example is the FitzHugh–Nagumo model (FitzHugh, , 1961; Nagumo et al., , 1962), which intricately models the activation and deactivation dynamics of a spiking neuron. The system’s equations are given by:

\dot{\mathbf{x}}=f(\mathbf{x},a)=\begin{bmatrix}\dot{v}\\ \dot{w}\end{bmatrix}=\begin{bmatrix}v-\frac{v^{3}}{3}-w+a\\ \frac{1}{\gamma}(v-\beta w+\alpha)\end{bmatrix}.

(27)

Using the Euler–Maruyama method (Maruyama, , 1955), the system’s equations eq. 27 can be discretized into a stochastic difference equation as follows:

\mathbf{y}_{t}=\mathbf{x}_{t+1}-\mathbf{x}_{t}=\tau f(\mathbf{x}_{t},a_{t})+\sqrt{\tau}\epsilon_{t},\quad\epsilon_{t}\sim\mathcal{N}(\mathbf{0},\sigma_{\epsilon}^{2}\mathbf{I}),

(28)

where $\tau$ is the fixed step size. Write a FitzHugh–Naguomo neuron eq. 27 in the form eq. 28, we can obtain:

\mathbf{x}_{t+1}-\mathbf{x}_{t}=\tau\begin{bmatrix}v_{t}-\frac{v_{t}^{3}}{3}-w_{t}+a_{t}\\ \frac{1}{\gamma}(v_{t}-\beta w_{t}+\alpha)\end{bmatrix}+\sqrt{\tau}\epsilon_{t}.

(29)

Our objective is to simulate the state trajectories of this dynamical system. Starting with an initial point $\mathbf{x}_{0}=(v_{0},w_{0})^{T}$ , at iteration $t$ , we draw a posterior GP sampling from the conditional distribution $p(\mathbf{y}_{t}|D_{t-1})$ , where $D_{t-1}$ denotes the set of the training data $\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n}$ along with the current trajectory $\{(\mathbf{x}_{j},\mathbf{y}_{j})\}_{j=1}^{t-1}$ . In our implementation, we choose the model eq. 29 with step size $\tau=0.25$ , parameters $\alpha=0.75$ , $\beta=0.75$ , $\gamma=20$ , noise variance $\sigma_{\epsilon}^{2}=10^{-4}$ , and initial point $\mathbf{x}_{0}=(-2.5,-1)^{\top}$ . The training set $\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n}$ was generated by evaluating eq. 29 at $n=256$ points $\{\mathbf{x}_{i}\}_{i=1}^{n}$ $(\mathbf{x}_{i}\in\mathcal{X})$ , which are uniformly distributed in the interval $\mathcal{X}=[-2.5,2.5]\times[-1,2]$ , with corresponding currents $\{a_{i}\}_{i=1}^{n}$ $(a_{i}\in\mathcal{A})$ chosen uniformly from $\mathcal{A}=[0,1]$ . Variations $\mathbf{y}_{t}$ in each step were simulated using independent Matérn 3/2 GPs with $\omega=\sqrt{3}$ and inducing points approximation on the sparse grid $\mathcal{U}(\eta=5,d=2)$ . We generate 1000 samples at each iteration. Figure 7 presents the state trajectories of voltage $v_{t}$ and recovery variable $w_{t}$ versus the iteration step $t$ for each algorithm. The left plots in Figure 7 show that our proposed algorithm can accurately characterize the state trajectories of this dynamical system. Figure 8 illustrates the computational time required for each iteration and the 2-Wasserstein distance between the states $\mathbf{x}_{t}$ derived from GP-based simulations and those obtained using the Euler–Maruyama method as described in eq. 28 at each iteration $t$ . These results demonstrate that our proposed algorithm not only reduces the computational time but also preserves accuracy that is comparable to the decoupled method.

5 Conclusion

In this work, we propose a scalable algorithm for GP sampling using inducing points on a sparse grid, which only requires a time complexity of $\mathcal{O}\Big{(}(\eta+n_{s})n_{sg}\Big{)}$ for $n_{s}$ sampling points, a sparse grid of size $n_{sg}$ and level $\eta$ . Additionally, we show that the convergence rate for GP sampling with a sparse grid is $\mathcal{O}(n_{sg}^{-\nu}(\log n_{sg})^{(\nu+2)(d-1)+d+1})$ for a $d$ -dimensional sparse grid of size $n_{sg}$ and a product Matérn kernel of smoothness $\nu$ . For the posterior samplings, we develop a two-level additive Schwarz preconditioner for the matrix over the sparse grid, which empirically shows rapid convergence. In the numerical study, we demonstrate that the proposed algorithm is not only efficient but also maintains accuracy comparable to other approximation methods.

Acknowledgments and Disclosure of Funding

This work is supported by NSF grants DMS-2312173 and CNS-2328395.

Appendix A Theorems

A.1 Rate of convergence

Theorem A.1.

Let $M=L_{q}([0,1])$ for any $q\in[1,+\infty)$ . Let $Z$ be a GP on $[0,1]$ with a Matérn kernel defined in eq. 6 with smoothness $\nu$ , and $\hat{Z}$ be an inducing points approximation of $Z$ under the inducing points $\{0,1/n,2/n,\ldots,1\}$ for some positive integer $n$ . Then for any $p\in[1,\infty)$ , as $n\rightarrow+\infty$ , the order of magnitude of the approximation error is

W_{p}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}).

Proof.

Note that $Z$ and $\hat{Z}$ already live in the same probability space, that is $(Z,\hat{Z})\in\Gamma(Z,\hat{Z})$ . Therefore,

W_{p}(Z,\hat{Z})\leq\left(\mathbb{E}\|Z-\hat{Z}\|^{p}_{L_{q}([0,1])}\right)^{1/p}.

(30)

According to Theorem 4 in (Tuo and Wang, , 2020), $\|Z-\hat{Z}\|_{L_{q}([0,1])}$ has the order of magnitude $O(n^{-\nu})$ and sub-Gaussian tails, which leads to the desired result. ∎

Lemma A.2.

[Adapted from Proposition 1 and Corollary 2 in (Rieger and Wendland, , 2017)] Let $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a reproducing kernel of $W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d})$ , where $\Phi$ is the tensor product with $\Phi(\mathbf{x}):=\prod_{j=1}^{d}\phi(x_{j})$ and $\mathbf{x}=(x_{1},\ldots,x_{d})^{\top}$ , $\phi:\mathbb{R}\rightarrow\mathbb{R}$ is a reproducing kernel of Sobolev space $W_{2}^{r}(\mathbb{R}^{1})$ of smoothness $r>1/2$ measured in $L_{2}(\mathbb{R}^{1})$ -norm. Suppose we are given the sparse grid $\mathcal{U}(\eta,d)$ defined in eq. 10 over $I^{d}=[0,1]^{d}$ with $n=\mathrm{{card}}(\mathcal{U}(\eta,d))$ points and an unknown function $f\in W_{2}^{r;\otimes^{d}}(I^{d})$ . Let $s_{0}$ be the solution to the norm-minimal interpolant on the sparse grid $\mathcal{U}(\eta,d)$ as follows:

\min\|s\|_{W_{2}^{r;\otimes^{d}}}:s\in W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d})\quad\text{with}\quad s(\xi)=f(\xi),\quad\xi\in\mathcal{U}(\eta,d).

(31)

Then, we have

\|f-s_{0}\|_{L_{\infty}(I^{d})}\leq Cn^{-r+1/2}(\log n)^{(r+3/2)(d-1)+d+1}\|f\|_{W_{2}^{r;\otimes^{d}}(\mathbb{R}^{d})},

(32)

where $C>0$ is a constant.

Proof.

See proofs of Proposition 1 and Corollary 2 in (Rieger and Wendland, , 2017). ∎

Theorem A.3.

Let $M=L_{q}([0,1]^{d})$ for any $q\in[1,+\infty)$ , $d>1$ is the dimension of the space $M$ . Let $Z$ be a GP on $[0,1]^{d}$ with a product Matérn kernel $K(\mathbf{x},\mathbf{x}^{\prime})=\prod_{j=1}^{d}K_{0}(x^{(j)},x^{\prime(j)})$ defined in eq. 5 with variance $\sigma^{2}=1$ and base kernel $K_{0}(\cdot,\cdot)$ of smoothness $\nu$ , and let $\hat{Z}$ be an inducing points approximation of $Z$ under the inducing points on a sparse grid $\mathcal{U}(\eta,d)$ defined in eq. 10. We denote the cardinality of the sparse grid $\mathcal{U}(\eta,d)$ by $n$ . Then, as $n\rightarrow+\infty$ , the order of magnitude of the approximation error is

W_{2}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}).

Proof.

If $\phi$ in Lemma A.2 is a Matérn kernel in one dimension, we can infer that $r=\nu+1/2$ by Lemma 15 in (Tuo and Wang, , 2020), which states that the reproducing kernel Hilbert space (RKHS) $\mathcal{N}_{K}(\mathbb{R}^{1})$ is a Sobolev space $W_{2}^{\nu+1/2}(\mathbb{R}^{1})$ with equivalent norms. We denote RKHS associated with product Matérn kernel $K(\cdot,\cdot)$ by $\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d})$ and its norm by $\|\cdot\|_{K}$ .

Let $\hat{f}$ be the interpolation of the function $f\in\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d})$ with respect to the approximation $\hat{Z}$ . Clearly, $\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d})=W_{2}^{\nu+1/2;\otimes^{d}}(\mathbb{R}^{d})$ . Proposition 2 in (Rieger and Wendland, , 2017) asserts that $\hat{f}=s_{0}$ where $s_{0}$ is norm-minimal interpolant defined in Lemma A.2. Therefore, for any $f\in\mathcal{N}_{K}^{\otimes^{d}}(\mathbb{R}^{d})$ , we have

\|f-\hat{f}\|_{L_{\infty}([0,1]^{d})}\leq Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}\|f\|_{W_{2}^{\nu+1/2;\otimes^{d}}(\mathbb{R}^{d})}.

(33)

Additionally, Lemma 15 in (Tuo and Wang, , 2020) implies that

\displaystyle\mathbb{E}\|Z-\hat{Z}\|_{L_{q}[0,1]^{d}}^{2}

\displaystyle=\sup_{\|f\|_{K}\leq 1}|f-\hat{f}|^{2}.

(34)

Note that $Z$ and $\hat{Z}$ already live in the same probability space, that is $(Z,\hat{Z})\in\Gamma(Z,\hat{Z})$ . Therefore, by combining with eq. 33 and eq. 34, we can obtain the following results:

$\displaystyle W_{2}(Z,\hat{Z})$	$\displaystyle\leq\left(\mathbb{E}\\|Z-\hat{Z}\\|^{2}_{L_{q}[0,1]^{d}}\right)^{1/2}$
	$\displaystyle=\sup_{\\|f\\|_{K}\leq 1}\|f-\hat{f}\|$
	$\displaystyle\leq\sup_{\\|f\\|_{K}\leq 1}Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}\\|f\\|_{K}$
	$\displaystyle\leq Cn^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1}.$	(35)

Thus, the order of the magnitude of the inducing points approximation on the sparse grid error is $W_{2}(Z,\hat{Z})=\mathcal{O}(n^{-\nu}(\log n)^{(\nu+2)(d-1)+d+1})$ . ∎

A.2 Smolyak algorithm for GP sampling

Theorem A.4.

Let $K(\cdot,\cdot)$ be a kernel function on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , and $\mathcal{U}=\mathcal{U}(\eta,d)$ a sparse grid design with level $\eta$ and dimension $d$ defined in eq. 10. $\mathcal{U}(\eta,d)=\bigcup_{\vv{t}\in\mathbb{G}(\eta)}\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}}$ where $\vv{t}=(t_{1},\ldots,t_{d})$ , $\mathbb{G}(\eta)=\{\vv{t}\in\mathbb{N}^{d}|\sum_{j=1}^{d}t_{j}=\eta\}$ , $\eta\geq d\geq 2$ , $\eta,d\in\mathbb{N}^{+}$ . Let $f(\cdot)\sim\mathcal{GP}(\mathbf{0},K(\cdot,\cdot))$ be a GP, then sampling from $f(\cdot)$ on a sparse grid $\mathcal{U}$ can be done via the following formula:

f(\mathcal{U})\leftarrow\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-|\vv{t}|}\binom{d-1}{\eta-|\vv{t}|}\bigotimes_{j=1}^{d}\mathrm{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\delta_{x}(\mathcal{U}_{\vv{t}}),

(36)

\delta_{x}(\mathcal{U}_{\vv{t}}):=\begin{cases}0\quad&\text{if $x\in\mathcal{U}\setminus\mathcal{U}_{\vv{t}}$}\\ 1\quad&\text{if $x\in\mathcal{U}_{\vv{t}}$}\end{cases},\qquad\mathcal{U}_{\vv{t}}:=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}},

(37)

where $\mathbb{P}(\eta,d)=\{\vv{t}\in\mathbb{N}^{d}|\max(d,\eta-d+1)\leq|\vv{t}|\leq\eta\}$ , $\mathrm{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}$ is the Cholesky decomposition of $K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})$ , $\mathbf{f}_{\mathbf{I}_{n_{\vv{t}}}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{\vv{t}}})$ is standard multivariate normal distributed, $n_{\vv{t}}$ $=$ $\prod_{j=1}^{d}t_{j}$ is the size of the full grid design $\mathcal{U}_{\vv{t}}=\mathcal{U}_{1,t_{1}}\times\mathcal{U}_{2,t_{2}}\times\cdots\times\mathcal{U}_{d,t_{d}}$ associated with $\vv{t}=(t_{1},\ldots,t_{d})$ , $\delta_{x}(\mathcal{U}_{\vv{t}})$ is a Dirac measure on the set $\mathcal{U}$ .

Proof.

We know that sampling from $f(\mathcal{U})\sim\mathcal{N}(\mathbf{0},K(\mathcal{U},\mathcal{U}))$ can be done via Cholesky decomposition, i.e., by computing $\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}}$ where $\mathbf{f}_{n_{sg}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n_{sg}})$ is multivariate normal distributed on the sparse grid $\mathcal{U}$ of size $n_{sg}=|\mathcal{U}|$ . Note that $\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}}$ can be regarded as a function $g$ evaluated on $\mathcal{U}$ such that $g(\mathcal{U})=\text{chol}\big{(}K(\mathcal{U},\mathcal{U})\big{)}\mathbf{f}_{n_{sg}}$ . According to eq. 12, we can obtain the following equation by applying Smolyak’s algorithm $\mathcal{P}(\eta,d)$ to $g$ ,

$\displaystyle\mathcal{P}(\eta,d)(g)$	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}(g)$	(38)
	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}g(\times_{j=1}^{d}\mathcal{U}_{j,t_{j}})\delta_{x}(\mathcal{U}_{\vv{t}})$
	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}\mathbf{f}_{n_{\vv{t}}}\delta_{x}(\mathcal{U}_{\vv{t}}),$

where $\delta_{x}(\mathcal{U}_{\vv{t}})$ and $\mathcal{U}_{\vv{t}}$ are defined in eq. 37. Note that $g$ is defined on $\mathcal{U}$ , so the tensor product operation $\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}$ on $g$ is actually evaluating $g$ on the corresponding points, which is equivalent to applying Dirac delta measure defined on set $\times_{j=1}^{d}\mathcal{U}_{j,t_{j}}$ to $g(\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}})$ . ∎

Appendix B Plots

References

Ackley, (2012) Ackley, D. (2012). A connectionist machine for genetic hillclimbing, volume 28. Springer Science & Business Media.
Al Daas and Grigori, (2019) Al Daas, H. and Grigori, L. (2019). A class of efficient locally constructed preconditioners based on coarse spaces. SIAM Journal on Matrix Analysis and Applications, 40(1):66–91.
Al Daas et al., (2023) Al Daas, H., Jolivet, P., and Rees, T. (2023). Efficient Algebraic Two-Level Schwarz Preconditioner for Sparse Matrices. SIAM Journal on Scientific Computing, 45(3):A1199–A1213.
Banerjee et al., (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(4):825–848.
Bishop et al., (1995) Bishop, C. M. et al. (1995). Neural networks for pattern recognition. Oxford university press.
Cai and Sarkis, (1999) Cai, X.-C. and Sarkis, M. (1999). A restricted additive Schwarz preconditioner for general sparse linear systems. Siam journal on scientific computing, 21(2):792–797.
Chen et al., (2022) Chen, H., Ding, L., and Tuo, R. (2022). Kernel packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Matérn Correlations. Journal of Machine Learning Research, 23(127):1–32.
Chen and Tuo, (2022) Chen, H. and Tuo, R. (2022). A Scalable and Exact Gaussian Process Sampler via Kernel Packets.
Cole et al., (2021) Cole, D. A., Christianson, R. B., and Gramacy, R. B. (2021). Locally induced Gaussian processes for large-scale simulation experiments. Statistics and Computing, 31:1–21.
Constantinou et al., (2017) Constantinou, P., Kokoszka, P., and Reimherr, M. (2017). Testing separability of space-time functional processes. Biometrika, 104(2):425–437.
Cressie, (2015) Cressie, N. (2015). Statistics for spatial data. John Wiley & Sons.
Curriero, (2006) Curriero, F. C. (2006). On the use of non-Euclidean distance measures in geostatistics. Mathematical Geology, 38:907–926.
Cutajar et al., (2016) Cutajar, K., Osborne, M., Cunningham, J., and Filippone, M. (2016). Preconditioning kernel matrices. In International conference on machine learning, pages 2529–2538. PMLR.
Demmel, (1997) Demmel, J. W. (1997). Applied numerical linear algebra. SIAM.
Diggle et al., (2003) Diggle, P. J., Ribeiro, P. J., and Christensen, O. F. (2003). An introduction to model-based geostatistics. Spatial statistics and computational methods, pages 43–86.
Ding et al., (2021) Ding, L., Tuo, R., and Shahrampour, S. (2021). A sparse expansion for deep gaussian processes. arXiv preprint arXiv:2112.05888.
Dolean et al., (2015) Dolean, V., Jolivet, P., and Nataf, F. (2015). An introduction to domain decomposition methods: algorithms, theory, and parallel implementation. SIAM.
Dowson and Landau, (1982) Dowson, D. and Landau, B. (1982). The Fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455.
Dryja and Widlund, (1994) Dryja, M. and Widlund, O. B. (1994). Domain decomposition algorithms with small overlap. SIAM Journal on Scientific Computing, 15(3):604–620.
Engel et al., (2005) Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement learning with Gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 201–208.
FitzHugh, (1961) FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445–466.
(22) Frazier, P. I. (2018a). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
(23) Frazier, P. I. (2018b). Bayesian optimization. In Recent advances in optimization and modeling of contemporary problems, pages 255–278. Informs.
Garcke, (2013) Garcke, J. (2013). Sparse grids in a nutshell. In Sparse grids and applications, pages 57–80. Springer.
Gardner et al., (2018) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31.
Gelbrich, (1990) Gelbrich, M. (1990). On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185–203.
Genton, (2001) Genton, M. G. (2001). Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299–312.
Genton, (2007) Genton, M. G. (2007). Separable approximations of space-time covariance matrices. Environmetrics: The official journal of the International Environmetrics Society, 18(7):681–695.
Girard et al., (2002) Girard, A., Rasmussen, C., Candela, J. Q., and Murray-Smith, R. (2002). Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. Advances in neural information processing systems, 15.
Gneiting et al., (2006) Gneiting, T., Genton, M. G., and Guttorp, P. (2006). Geostatistical space-time models, stationarity, separability, and full symmetry. Monographs On Statistics and Applied Probability, 107:151.
Golub and Van Loan, (2013) Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations. JHU press.
Gramacy and Apley, (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
Grande et al., (2014) Grande, R., Walsh, T., and How, J. (2014). Sample efficient reinforcement learning with Gaussian processes. In International Conference on Machine Learning, pages 1332–1340. PMLR.
Griewank, (1981) Griewank, A. O. (1981). Generalized descent for global optimization. Journal of optimization theory and applications, 34(1):11–39.
Grigorievskiy et al., (2017) Grigorievskiy, A., Lawrence, N., and Särkkä, S. (2017). Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE.
Hartikainen and Särkkä, (2010) Hartikainen, J. and Särkkä, S. (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In 2010 IEEE international workshop on machine learning for signal processing, pages 379–384. IEEE.
Henderson et al., (1983) Henderson, H. V., Pukelsheim, F., and Searle, S. R. (1983). On the history of the Kronecker product. Linear and Multilinear Algebra, 14(2):113–120.
Hensman et al., (2018) Hensman, J., Durrande, N., and Solin, A. (2018). Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–52.
Hensman et al., (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
Hensman et al., (2015) Hensman, J., Matthews, A., and Ghahramani, Z. (2015). Scalable variational Gaussian process classification. In Artificial Intelligence and Statistics, pages 351–360. PMLR.
Hestenes et al., (1952) Hestenes, M. R., Stiefel, E., et al. (1952). Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409–436.
Journel and Huijbregts, (1976) Journel, A. G. and Huijbregts, C. J. (1976). Mining geostatistics.
Katzfuss and Guinness, (2021) Katzfuss, M. and Guinness, J. (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1):124–141.
Katzfuss et al., (2022) Katzfuss, M., Guinness, J., and Lawrence, E. (2022). Scaled Vecchia approximation for fast computer-model emulation. SIAM/ASA Journal on Uncertainty Quantification, 10(2):537–554.
Kennedy and O’Hagan, (2001) Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464.
Kolda, (2006) Kolda, T. G. (2006). Multilinear operators for higher-order decompositions. Technical report, Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA ….
Kuss and Rasmussen, (2003) Kuss, M. and Rasmussen, C. (2003). Gaussian processes in reinforcement learning. Advances in neural information processing systems, 16.
Kuss et al., (2005) Kuss, M., Rasmussen, C. E., and Herbrich, R. (2005). Assessing Approximate Inference for Binary Gaussian Process Classification. Journal of machine learning research, 6(10).
Lin et al., (2023) Lin, J. A., Antorán, J., Padhy, S., Janz, D., Hernández-Lobato, J. M., and Terenin, A. (2023). Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. arXiv preprint arXiv:2306.11589.
MacKay et al., (2003) MacKay, D. J., Mac Kay, D. J., et al. (2003). Information theory, inference and learning algorithms. Cambridge university press.
Maddox et al., (2021) Maddox, W. J., Balandat, M., Wilson, A. G., and Bakshy, E. (2021). Bayesian optimization with high-dimensional outputs. Advances in Neural Information Processing Systems, 34:19274–19287.
Mallasto and Feragen, (2017) Mallasto, A. and Feragen, A. (2017). Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. Advances in Neural Information Processing Systems, 30.
Maruyama, (1955) Maruyama, G. (1955). Continuous Markov processes and stochastic equations. Rend. Circ. Mat. Palermo, 4:48–90.
Marzouk and Najm, (2009) Marzouk, Y. M. and Najm, H. N. (2009). Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228(6):1862–1902.
Murray-Smith and Pearlmutter, (2004) Murray-Smith, R. and Pearlmutter, B. A. (2004). Transformations of Gaussian process priors. In International Workshop on Deterministic and Statistical Methods in Machine Learning, pages 110–123. Springer.
Nagumo et al., (1962) Nagumo, J., Arimoto, S., and Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):2061–2070.
Neal, (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
Nguyen et al., (2021) Nguyen, V., Deisenroth, M. P., and Osborne, M. A. (2021). Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. arXiv preprint arXiv:2110.12087.
Nickisch and Rasmussen, (2008) Nickisch, H. and Rasmussen, C. E. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078.
Nickisch et al., (2018) Nickisch, H., Solin, A., and Grigorevskiy, A. (2018). State space Gaussian processes with non-Gaussian likelihood. In International Conference on Machine Learning, pages 3789–3798. PMLR.
Nychka et al., (2015) Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015). A multiresolution Gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics, 24(2):579–599.
O’Hagan, (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
Okhrin et al., (2020) Okhrin, Y., Schmid, W., and Semeniuk, I. (2020). New approaches for monitoring image data. IEEE Transactions on Image Processing, 30:921–933.
Pardo-Iguzquiza and Chica-Olmo, (2008) Pardo-Iguzquiza, E. and Chica-Olmo, M. (2008). Geostatistics with the Matern semivariogram model: A library of computer programs for inference, kriging and simulation. Computers & Geosciences, 34(9):1073–1079.
Plumlee, (2014) Plumlee, M. (2014). Fast prediction of deterministic functions using sparse grid experimental designs. Journal of the American Statistical Association, 109(508):1581–1591.
Plumlee, (2017) Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American Statistical Association, 112(519):1274–1285.
Quinonero-Candela and Rasmussen, (2005) Quinonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959.
Rahimi and Recht, (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20.
Rasmussen, (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine learning, pages 63–71. Springer.
Rieger and Wendland, (2017) Rieger, C. and Wendland, H. (2017). Sampling inequalities for sparse grids. Numerische Mathematik, 136:439–466.
Roberts et al., (2013) Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., and Aigrain, S. (2013). Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550.
Saad, (2003) Saad, Y. (2003). Iterative methods for sparse linear systems. SIAM.
Saatçi, (2012) Saatçi, Y. (2012). Scalable inference for structured Gaussian process models. PhD thesis, Citeseer.
Santner et al., (2003) Santner, T. J., Williams, B. J., Notz, W. I., and Williams, B. J. (2003). The design and analysis of computer experiments, volume 1. Springer.
Shewchuk et al., (1994) Shewchuk, J. R. et al. (1994). An introduction to the conjugate gradient method without the agonizing pain.
Silverman, (1985) Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):1–21.
Smolyak, (1963) Smolyak, S. A. (1963). Quadrature and interpolation formulas for tensor products of certain classes of functions. In Doklady Akademii Nauk, volume 148, pages 1042–1045. Russian Academy of Sciences.
Snoek et al., (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
Spillane et al., (2014) Spillane, N., Dolean, V., Hauret, P., Nataf, F., Pechstein, C., and Scheichl, R. (2014). Abstract robust coarse spaces for systems of PDEs via generalized eigenproblems in the overlaps. Numerische Mathematik, 126:741–770.
Srinivas et al., (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
Stanton et al., (2021) Stanton, S., Maddox, W., Delbridge, I., and Wilson, A. G. (2021). Kernel interpolation for scalable online Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 3133–3141. PMLR.
Stein, (1999) Stein, M. L. (1999). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media.
Teckentrup, (2020) Teckentrup, A. L. (2020). Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification, 8(4):1310–1337.
Thompson, (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294.
Titsias, (2009) Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR.
Toselli and Widlund, (2004) Toselli, A. and Widlund, O. (2004). Domain decomposition methods-algorithms and theory, volume 34. Springer Science & Business Media.
Trefethen and Bau III, (1997) Trefethen, L. N. and Bau III, D. (1997). Numerical linear algebra, vol. 50.
Tuo and Wang, (2020) Tuo, R. and Wang, W. (2020). Kriging prediction with isotropic Matérn correlations: Robustness and experimental designs. The Journal of Machine Learning Research, 21(1):7604–7641.
Tuo and Wu, (2015) Tuo, R. and Wu, C. J. (2015). Efficient calibration for imperfect computer models. The Annals of Statistics, 43(6):2331 – 2352.
Ullrich, (2008) Ullrich, T. (2008). Smolyak’s algorithm, sampling on sparse grids and Sobolev spaces of dominating mixed smoothness. East Journal of Approximations, 14(1):1.
Van der Vorst, (2003) Van der Vorst, H. A. (2003). Iterative Krylov methods for large linear systems. Number 13. Cambridge University Press.
Wang et al., (2019) Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K. Q., and Wilson, A. G. (2019). Exact Gaussian processes on a million data points. Advances in neural information processing systems, 32.
Wang et al., (2020) Wang, W., Tuo, R., and Jeff Wu, C. (2020). On prediction properties of kriging: Uniform error bounds and robustness. Journal of the American Statistical Association, 115(530):920–930.
(94) Wenger, J., Pleiss, G., Hennig, P., Cunningham, J., and Gardner, J. (2022a). Preconditioning for scalable Gaussian process hyperparameter optimization. In International Conference on Machine Learning, pages 23751–23780. PMLR.
(95) Wenger, J., Pleiss, G., Pförtner, M., Hennig, P., and Cunningham, J. P. (2022b). Posterior and computational uncertainty in gaussian processes. Advances in Neural Information Processing Systems, 35:10876–10890.
Williams and Seeger, (2000) Williams, C. and Seeger, M. (2000). Using the Nyström method to speed up kernel machines. Advances in neural information processing systems, 13.
Williams and Rasmussen, (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.
Wilson and Nickisch, (2015) Wilson, A. and Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International conference on machine learning, pages 1775–1784. PMLR.
Wilson et al., (2020) Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. (2020). Efficiently sampling functions from Gaussian process posteriors. In International Conference on Machine Learning, pages 10292–10302. PMLR.
Wilson et al., (2021) Wilson, J. T., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. P. (2021). Pathwise Conditioning of Gaussian Processes. J. Mach. Learn. Res., 22:105–1.
Xie and Xu, (2021) Xie, F. and Xu, Y. (2021). Bayesian projected calibration of computer models. Journal of the American Statistical Association, 116(536):1965–1982.
Yadav et al., (2022) Yadav, M., Sheldon, D. R., and Musco, C. (2022). Kernel Interpolation with Sparse Grids. Advances in Neural Information Processing Systems, 35:22883–22894.
Zafari et al., (2020) Zafari, S., Murashkina, M., Eerola, T., Sampo, J., Kälviäinen, H., and Haario, H. (2020). Resolving overlapping convex objects in silhouette images by concavity analysis and Gaussian process. Journal of Visual Communication and Image Representation, 73:102962.

$\displaystyle\mathcal{P}(\eta,d)(g)$	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\Big{(}\bigotimes_{j=1}^{d}\mathcal{U}_{j,t_{j}}\Big{)}(g)$	(38)
	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}g(\times_{j=1}^{d}\mathcal{U}_{j,t_{j}})\delta_{x}(\mathcal{U}_{\vv{t}})$
	$\displaystyle=\sum_{\vv{t}\in\mathbb{P}(\eta,d)}(-1)^{\eta-\|\vv{t}\|}\binom{d-1}{\eta-\|\vv{t}\|}\bigotimes_{j=1}^{d}\text{chol}\Big{(}K(\mathcal{U}_{j,t_{j}},\mathcal{U}_{j,t_{j}})\Big{)}\mathbf{f}_{n_{\vv{t}}}\delta_{x}(\mathcal{U}_{\vv{t}}),$

Gaussian Processes Sampling with Sparse Grids under Additive Schwarz Preconditioner

Abstract

1 Introduction

2 Background

2.1 GP review

GPs

GP regression

Sampling from GPs

2.2 Assumptions

2.3 Inducing points approximation

2.4 Sparse grids

3 Sampling with sparse grids

3.1 Prior

Sampling algorithm

Complexity

3.2 Posterior

Matheron’s rule

Sampling algorithm

Conjugate gradient

Preconditioner choice

Complexity

4 Experiments

4.1 Simulation

Prior sampling

Posterior sampling

4.2 Application

Thompson sampling

Simulating dynamical systems

5 Conclusion

Acknowledgments and Disclosure of Funding

Appendix A Theorems

A.1 Rate of convergence

Theorem A.1.

Proof.

Lemma A.2.

Proof.

Theorem A.3.

Proof.

A.2 Smolyak algorithm for GP sampling

Theorem A.4.

Proof.

Appendix B Plots

References

Gaussian Processes Sampling with Sparse Grids
under Additive Schwarz Preconditioner