This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\renewtheoremstyle

plain \nolinenumbers

Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes

Yichen Zhu [email protected] Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, MI, Italy.    Michele Peruzzi [email protected] Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A.    Cheng Li [email protected] Department of Statistics and Data Science, National University of Singapore, Sinagpore.       David B. Dunson [email protected] Department of Statistical Science & Mathematics, Duke University, North Carolina, U.S.A.
Abstract

In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable O(n)O(n) computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes (RadGP), a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbors within a predetermined radius. We prove that any radial neighbors Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real world data showing excellent performance in both prior and posterior approximations to the original Gaussian process.

keywords:
Approximations; Directed acyclic graph; Gaussian process; Spatial statistics; Wasserstein distance.
journal: Submitted to Biometrika

\arabicsection Introduction

Data indexed by spatial coordinates are routinely collected in massive quantities due to the rapid pace of technological development in imaging sensors, wearables and tracking devices. Statistical models of spatially correlated data are most commonly built using Gaussian processes due to their flexibility and their ability to straightforwardly quantify uncertainty. Spatial dependence in the data is typically characterized by a parametric covariance function whose unknown parameters are estimated by repeatedly evaluating a multivariate normal density with a dense covariance matrix. At each density evaluation, the inverse covariance matrix and its determinant must be computed at a complexity that is cubic on the data dimension nn. This steep cost severely restricts the applicability of Gaussian processes on modern geolocated datasets. This problem is exacerbated in Bayesian hierarchical models computed via Markov-chain Monte Carlo (MCMC) because several thousands of O(n3)O(n^{3}) operations must be performed to fully characterize the joint posterior distribution.

A natural idea for retaining the flexibility of Gaussian processes while circumventing their computational bottlenecks is to use approximate methods for evaluating high dimensional multivariate Gaussian densities. If the scalable approximation is “good” in some sense, then one can use the approximate method for backing out inferences on the original process parameters. A multitude of such scalable methods has been proposed in the literature. Low-rank methods using inducing variables or knots (Quiñonero-Candela & Rasmussen, 2005; Cressie & Johannesson, 2008; Banerjee et al., 2008; Finley et al., 2009; Guhaniyogi et al., 2011; Sang et al., 2011) are only appropriate for approximating very smooth Gaussian processes because the number of inducing variables must increase polynomially fast with the sample size to avoid oversmoothing of the spatial surface (Stein, 2014; Burt et al., 2020). Scalability can also be achieved by assuming sparsity of the Gaussian covariance matrix via compactly supported covariance functions (Furrer et al., 2006; Kaufman et al., 2008; Bevilacqua et al., 2019); these methods operate by assuming marginal independence of pairs of data points that are beyond a certain distance from each other. See also the related method of Gramacy & Apley (2015). Using a similar intuition, one can achieve scalability by approximating the high dimensional Gaussian density with a product of small dimensional densities that characterize dependence of nearby data points. Composite likelihood methods (Bai et al., 2012; Eidsvik et al., 2014; Bevilacqua & Gaetan, 2015) approximate the high dimensional Gaussian density with the product of sub-likelihoods, whereas Gaussian Markov Random Fields (GMRF; Cressie, 1993; Rue & Held, 2005) approximate it via a product of conditional densities whose conditioning sets include all the local spatial neighbors.

Markovian conditional independence assumptions based on spatial neighborhoods can be visualized as a sparse undirected graphical model; these assumptions are intuitively appealing and lead to sparse precision matrices and more efficient computations (Rue, 2001). However, the normalizing constant of GMRF densities may require additional expensive computations, and additional approximation steps may be required when making predictions at new spatial locations because GMRFs do not necessarily extend to a standalone stochastic process. Both these problems can be resolved by ordering the data and conditioning only using previously ordered data points. Because one can now use a directed acyclic graph to represent conditional independence in the data, this approximation (due to Vecchia, 1988) immediately leads to a valid density which can be extended to a standalone stochastic process.

There is a rich literature on approximations of Gaussian processes using directed acyclic graphs. In addition to extensions of Vecchia’s method (Datta et al., 2016; Finley et al., 2019; Katzfuss & Guinness, 2021), scalable models can be built on domain partitions (Peruzzi et al., 2022), and generalizations are available for non-Gaussian and multivariate data (Zilber & Katzfuss, 2021; Peruzzi & Dunson, 2022, 2024) as well as for modeling nonstationary processes (Jin et al., 2021; Kidd & Katzfuss, 2022). However, we identify two issues with current approaches based on Vecchia’s approximation. First, because the graphical model representation of Vecchia approximations is not unique, the numerical performance of these methods can be sensitive to the choice of graphical model in certain scenarios (Guinness, 2018). There is no consensus on how to choose these conditioning sets. Second, although their performance has been demonstrated in practice (Heaton et al., 2019) and some theory on parameter estimation is available (Zhang, 2012; Zhang et al., 2023), there is limited theoretical study focused on accuracy of the approximation. The recent work by Schäfer et al. (2021) established Kullback-Leibler divergence bounds for Vecchia approximations, but under strong conditions that are difficult to verify for commonly used spatial covariance functions.

In this paper, we address these two issues by introducing a novel method and corresponding theoretical guarantees. Our radial neighbors Gaussian processes scalably approximate an initial Gaussian process by assuming conditional independence at spatial locations as prescribed by a directed acyclic graph. For every location in the graph, directed edges connect it to all of its neighbors within a predetermined radius ρ\rho.

We also introduce–to the best of our knowledge–the first general theoretical result regarding the closeness between a Vecchia-like method and the Gaussian process it approximates. We show that any process from the RadGP family can accurately approximate the original Gaussian processes in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. This approximation accuracy is insensitive to the graphical structure as long as the graphical structure does not alter the radial condition of RadGP. Our theory can be applied to a wide range of covariance functions including the Matérn, Gaussian and generalized Cauchy models. Empirical studies demonstrate that our RadGPs have excellent performance in approximating a Gaussian process prior and the posterior from a Gaussian process regression model.

\arabicsection Radial Neighbors Gaussian process

\arabicsection.\arabicsubsection Radial Neighbors Graphs

Let the spatial domain Ω\Omega be a connected subset of d\mathbb{R}^{d}. Let Z=(Zs:sΩ)Z=(Z_{s}:s\in\Omega) be a real valued Gaussian process on Ωd\Omega\subset\mathbb{R}^{d} with zero mean and covariance function K:Ω×ΩK:\Omega\times\Omega\to\mathbb{R} such that for all s1,s2Ωs_{1},s_{2}\in\Omega, 𝔼(Zs1)=0\mathbb{E}(Z_{s_{1}})=0 and Cov(Zs1,Zs2)=K(s1,s2)\mathrm{Cov}(Z_{s_{1}},Z_{s_{2}})=K(s_{1},s_{2}). The objective of this paper is to approximate the process ZZ with another process Z^\hat{Z} such that the distance between Z^\hat{Z} and ZZ is small in some metric and finite marginal densities of Z^\hat{Z} are easy to compute.

Let 𝒟={w1,w2,,wn}\mathcal{D}=\{w_{1},w_{2},\ldots,w_{n}\} be a finite collection of locations in Ω\Omega. A ρ\rho-radial neighbors graph on 𝒟\mathcal{D} is defined to be a directed acyclic graph such that for any wi,wj𝒟w_{i},w_{j}\in\mathcal{D} with wiwj2<ρ\|w_{i}-w_{j}\|_{2}<\rho, there is a directed edge between wiw_{i} and wjw_{j}. Such radial neighbors graphs are not unique. Section \arabicsection.\arabicsubsection provides one practical way of constructing a radial neighbors graph that will be used in our experimental studies, while Section \arabicsection.\arabicsubsection presents how to build a Gaussian process from any radial neighbors graph. Our theory in Section \arabicsection applies to any Gaussian process built from a radial neighbors graph, not just the specific graph we constructed in Section \arabicsection.\arabicsubsection.

\arabicsection.\arabicsubsection Construction of a Radial Neighbors Graph

We present a procedure to construct the radial neighbors graph that will be used for experiments in Section \arabicsection. First, we specify an order of all locations in 𝒟\mathcal{D}. This is done by choosing a “center” location of 𝒟\mathcal{D} and sorting all the locations in 𝒟\mathcal{D} according to their distances to this “center” in a non-decreasing order. If the locations in 𝒟\mathcal{D} are roughly evenly distributed in the domain Ω\Omega, then the coordinates of this center are set as the arithmetic mean of the coordinates of all locations in 𝒟\mathcal{D}.

Next, once an order of locations in 𝒟\mathcal{D} is obtained, for each location wi𝒟w_{i}\in\mathcal{D}, we set the parent set of wiw_{i} to be all the locations preceeding wiw_{i} in this order and within ρ\rho distance to wiw_{i}. As a consequence, for all locations within ρ\rho radius of wiw_{i}, the ones that precede it are its parents, whereas the ones that follow it are its children. This construction satisfies the definition of radial neighbors graphs in Section \arabicsection.\arabicsubsection that any pair of locations (wi,wj)(w_{i},w_{j}) with wiwj2<ρ\|w_{i}-w_{j}\|_{2}<\rho have a directed edge connecting them.

Depending on the spatial dispersion of locations in 𝒟\mathcal{D}, some locations may not have neighbors within ρ\rho distance. We thus populate the parent set of a location wi𝒟w_{i}\in\mathcal{D} without ρ\rho-neighbors by drawing an edge from wiw_{i}’s nearest neighbor among all locations that precede wiw_{i} in the order. This step ensures that the resulting directed acyclic graph is connected.

If the locations in 𝒟\mathcal{D} are roughly evenly distributed in the domain Ω\Omega, once the first ii locations are chosen and the directed graph on them is constructed, our procedure roughly chooses the (i+1)(i+1)th location to be as close to the first ii locations as possible. Since spatial covariance function K(s1,s2)K(s_{1},s_{2}) is often relatively large when s1s_{1} and s2s_{2} are close to each other, our procedure aims to minimize the conditional variance of the (i+1)(i+1)th location given the first ii locations, or in other words, maximize the spatial information that the (i+1)(i+1)th location can borrow from the first ii locations. Figure \arabicfigure illustrates the process of building this radial neighbors graph when 𝒟\mathcal{D} is a 15×1515\times 15 grid on [0,1]2[0,1]^{2}.

Refer to caption
Figure \arabicfigure: Construction of a radial neighbors graph. In each subpanel, the location in red follows all locations in blue and precedes all locations in gray in the predetermined order. The parent set for the location in red is built by drawing a directed edge to it from each blue location inside the red circle.

Finally, if we have a new data set 𝒟~\tilde{\mathcal{D}}, the radial neighbors graph constructed on 𝒟\mathcal{D} can be extended to a radial neighbors graph on 𝒟𝒟~\mathcal{D}\cup\tilde{\mathcal{D}}. Specifically, once an ordering of locations in 𝒟~\tilde{\mathcal{D}} is obtained, we concatenate the ordering of locations in 𝒟\mathcal{D} and 𝒟~\tilde{\mathcal{D}} to obtain an ordering of locations 𝒟𝒟~\mathcal{D}\cup\tilde{\mathcal{D}}. We then construct a radial neighbors graph on 𝒟𝒟~\mathcal{D}\cup\tilde{\mathcal{D}} via the same procedure as described above. The final graph will contain the graph on 𝒟\mathcal{D} as a subgraph. This extension framework will be useful when extending a graph from training data to test data.

For some dataset, the spatial dispersion of both training and testing locations can be highly uneven. While a radial neighbor graph ensures all locations are connected in the same directed graph, it can still result in poor predictions on some test locations. We can mitigate this issue by artificially adding more test locations, so that each test locations receives a decent number of parents.

\arabicsection.\arabicsubsection Gaussian Processes from Radial Neighbors Graphs

Given a ρ\rho-radial neighbors graph on 𝒟={w1,w2,wn}\mathcal{D}=\{w_{1},w_{2},\ldots\,w_{n}\} defined in Section \arabicsection.\arabicsubsection, we can build a RadGP on 𝒟\mathcal{D} in which each wiw_{i} is conditionally only dependent on its parent set. Let the symbol =𝑑\overset{d}{=} denote equality in distribution and define ΣA,B\Sigma_{A,B} as the covariance matrix between two spatial location sets AA and BB under the covariance function K(,)K(\cdot,\cdot). The conditional distribution of Z^wi\hat{Z}_{w_{i}} is defined as

[Z^wiZ^wj,j<i]=𝑑[Z^wiZ^pa(wi)]=𝑑[ZwiZpa(wi)]\displaystyle[\hat{Z}_{w_{i}}\mid\hat{Z}_{w_{j}},j<i]\overset{d}{=}[\hat{Z}_{w_{i}}\mid\hat{Z}_{\mathrm{pa}(w_{i})}]\overset{d}{=}[{Z}_{w_{i}}\mid{Z}_{\mathrm{pa}(w_{i})}]
\displaystyle\sim N(Σpa(wi),wiTΣpa(wi),pa(wi)1Z^pa(wi),K(wi,wi)Σpa(wi),wiTΣpa(wi),pa(wi)1Σpa(wi),wi),\displaystyle N\Big{(}\Sigma_{\mathrm{pa}(w_{i}),w_{i}}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(w_{i}),~{}\mathrm{pa}(w_{i})}^{-1}\hat{Z}_{\mathrm{pa}(w_{i})},\;K(w_{i},w_{i})-\Sigma_{\mathrm{pa}(w_{i}),w_{i}}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(w_{i}),~{}\mathrm{pa}(w_{i})}^{-1}\Sigma_{\mathrm{pa}(w_{i}),w_{i}}\Big{)}, (\arabicequation)

where for i=1i=1 the parent set pa(w1)\mathrm{pa}(w_{1}) is empty.

Equation (\arabicsection.\arabicsubsection) defines the distribution of Z^\hat{Z} on all finite subsets of 𝒟\mathcal{D}, which essentially says the conditional distribution of Z^wi\hat{Z}_{w_{i}} given the process Z^\hat{Z} on all locations preceding wiw_{i} has the same law as the conditional distribution of ZwiZ_{w_{i}} given the process ZZ on the parent set pa(wi)\mathrm{pa}(w_{i}). We then extend this process Z^\hat{Z} to arbitrary finite subsets of the whole space Ω\Omega. For all sΩ\𝒟s\in\Omega\backslash\mathcal{D}, we define the parents of ss as pa(s)={s𝒟:ss2<ρ}\mathrm{pa}(s)=\{s^{\prime}\in\mathcal{D}:\|s^{\prime}-s\|_{2}<\rho\}. For any finite set UΩ\𝒟U\subset\Omega\backslash\mathcal{D}, we define the conditional distribution of Z^U\hat{Z}_{U} given Z^𝒯1\hat{Z}_{\mathcal{T}_{1}} as

p(Z^UZ^𝒟)\displaystyle p(\hat{Z}_{U}\mid\hat{Z}_{\mathcal{D}}) =sUp(Z^sZ^pa(s))=sUp(ZsZpa(s)),\displaystyle=\prod_{s\in U}p(\hat{Z}_{s}\mid\hat{Z}_{\mathrm{pa}(s)})=\prod_{s\in U}p({Z}_{s}\mid Z_{\mathrm{pa}(s)}),
[Z^sZ^pa(s)]\displaystyle[\hat{Z}_{s}\mid\hat{Z}_{\mathrm{pa}(s)}] N(Σpa(s),sTΣpa(s),pa(s)1Z^pa(s),K(s,s)Σpa(s),sTΣpa(s),pa(s)1Σpa(s),s).\displaystyle\sim N\Big{(}\Sigma_{\mathrm{pa}(s),s}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(s),\mathrm{pa}(s)}^{-1}\hat{Z}_{\mathrm{pa}(s)},\;K(s,s)-\Sigma_{\mathrm{pa}(s),s}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(s),\mathrm{pa}(s)}^{-1}\Sigma_{\mathrm{pa}(s),s}\Big{)}. (\arabicequation)

Equations (\arabicsection.\arabicsubsection) and (\arabicsection.\arabicsubsection) together give the joint distribution of Z^\hat{Z} on any finite subset of Ω\Omega. Specifically, for a generic finite set A=UVA=U\cup V with UΩ\𝒟U\subset\Omega\backslash\mathcal{D} and V𝒟V\subset\mathcal{D}, the joint density of Z^\hat{Z} on AA implied by Equations (\arabicsection.\arabicsubsection) and (\arabicsection.\arabicsubsection) is

p(Z^A)=p(Z^U|Z^𝒟)i=1np(Z^wi|Z^pa(wi))wi𝒟\Vdwi.p(\hat{Z}_{A})=p(\hat{Z}_{U}|\hat{Z}_{\mathcal{D}})\int\prod_{i=1}^{n}p(\hat{Z}_{w_{i}}|\hat{Z}_{\mathrm{pa}(w_{i})})\prod_{w_{i}\in\mathcal{D}\backslash V}dw_{i}. (\arabicequation)

Equations (\arabicsection.\arabicsubsection)-(\arabicequation) complete the definition of our RadGP.

Lemma \arabicsection.\arabictheorem.

The radial neighbors Gaussian process (RadGP) Z^()\hat{Z}(\cdot) is a valid Gaussian process on the whole spatial domain Ω\Omega.

The proof of Lemma \arabicsection.\arabictheorem relies on the Kolmogorov extension theorem and is available at Section S1 of the Supplementary Material.

\arabicsection Theoretical Properties

\arabicsection.\arabicsubsection Wasserstein Distance Between Gaussian Processes

The objective of this section is to establish theoretical upper bounds for the Wasserstein distance between the original Gaussian process and any RadGP on 𝒟\mathcal{D} described above. For two generic random vectors Z1,Z2Z_{1},Z_{2} defined on the same space 𝒵\mathcal{Z} following the probability distributions μ1,μ2\mu_{1},\mu_{2}, their Wasserstein-2 (W2W_{2}) distance is

W2(Z1,Z2)={minμ~𝒵×𝒵z1z222dμ~(z1,z2)}1/2,W_{2}(Z_{1},Z_{2})=\left\{\min_{\tilde{\mu}\in\mathcal{M}}\int_{\mathcal{Z}\times\mathcal{Z}}\|z_{1}-z_{2}\|_{2}^{2}d\tilde{\mu}(z_{1},z_{2})\right\}^{1/2},

where \mathcal{M} is the set of all probability distributions on the product space 𝒵×𝒵\mathcal{Z}\times\mathcal{Z} whose marginals are μ1\mu_{1} and μ2\mu_{2}. Let Z𝒟Z_{\mathcal{D}} and Z^𝒟\hat{Z}_{\mathcal{D}} be the random vectors from the original Gaussian process and the RadGP on the set 𝒟\mathcal{D}. Denote the covariance matrix of Z𝒟Z_{\mathcal{D}} and Z^𝒟\hat{Z}_{\mathcal{D}} as Σ𝒟𝒟\Sigma_{\mathcal{D}\mathcal{D}} and Σ^𝒟𝒟\hat{\Sigma}_{\mathcal{D}\mathcal{D}}, respectively. Then the Wasserstein-2 distance between Z𝒟Z_{\mathcal{D}} and Z^𝒟\hat{Z}_{\mathcal{D}} has a closed form (Gelbrich, 1990):

W22(Z𝒟,Z^𝒟)=tr(Σ𝒟𝒟)+tr(Σ^𝒟𝒟)2tr{(Σ𝒟𝒟1/2Σ^𝒟𝒟Σ𝒟𝒟1/2)1/2},W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})=\mathrm{tr}(\Sigma_{\mathcal{D}\mathcal{D}})+\mathrm{tr}(\hat{\Sigma}_{\mathcal{D}\mathcal{D}})-2\mathrm{tr}\{(\Sigma_{\mathcal{D}\mathcal{D}}^{1/2}\hat{\Sigma}_{\mathcal{D}\mathcal{D}}\Sigma_{\mathcal{D}\mathcal{D}}^{1/2})^{1/2}\}, (\arabicequation)

where tr(A)\mathrm{tr}(A) denotes the trace of a square matrix AA. The right-hand side of equation (\arabicequation) involves various matrix powers that are difficult to analyze. Fortunately, for Gaussian measures, the squared Wasserstein-2 distance can be upper bounded by the trace norm of the difference between their covariance matrices, which greatly simplifies our analysis.

\arabicsection.\arabicsubsection Spatial Decaying Families

We consider the Gaussian process Z=(Zs:sΩ)Z=(Z_{s}:s\in\Omega) with zero mean and an isotropic nonnegative covariance function K(,)K(\cdot,\cdot). Thus we can reformulate KK as K(s1,s2)=K0(s1s22)K(s_{1},s_{2})=K_{0}(\|s_{1}-s_{2}\|_{2}) for all s1,s2Ωs_{1},s_{2}\in\Omega, where K0:[0,+)[0,+)K_{0}:[0,+\infty)\to[0,+\infty). For all r>0r>0, define the function vr():+v_{r}(\cdot):\mathbb{R}\to\mathbb{R}_{+} as vr(x)=k=0+|x|k/(k!)rv_{r}(x)=\sum_{k=0}^{+\infty}|x|^{k}/(k!)^{r}. Then 1/vr(x)1/v_{r}(x) is a monotone decreasing function of xx, where a larger rr results in a slower decay rate in 1/vr(x)1/v_{r}(x). Specifically, 1/v1(x)=exp(x)1/v_{1}(x)=\exp(-x) and 1/vr(x)1/v_{r}(x) with r>1r>1 has a decay rate faster than all polynomials but slower than the exponential function. We also define a series of polynomials cr(x)=1+xrc_{r}(x)=1+x^{r} for r>0r>0. For each vrv_{r} with r>1r>1 and each crc_{r} with r>0r>0, we define the following families of Gaussian processes:

𝒵vr\displaystyle\mathscr{Z}_{v_{r}} ={Z=(Zs:sΩ):K0(s1s22)1vr(s1s22)(1+s1s22d+1)},\displaystyle=\left\{Z=(Z_{s}:s\in\Omega):K_{0}(\|s_{1}-s_{2}\|_{2})\leq\frac{1}{v_{r}(\|s_{1}-s_{2}\|_{2})(1+\|s_{1}-s_{2}\|_{2}^{d+1})}\right\}, (\arabicequation)
𝒵cr\displaystyle\mathscr{Z}_{c_{r}} ={Z=(Zs:sΩ):K0(s1s22)1(1+s1s22)r}.\displaystyle=\left\{Z=(Z_{s}:s\in\Omega):K_{0}(\|s_{1}-s_{2}\|_{2})\leq\frac{1}{(1+\|s_{1}-s_{2}\|_{2})^{r}}\right\}. (\arabicequation)

Intuitively, 𝒵vr\mathscr{Z}_{v_{r}} is the family of isotropic Gaussian processes whose covariance function decays no slower than some subexponential rate of the spatial distance, while 𝒵cr\mathscr{Z}_{c_{r}} is the family with covariance functions decaying no slower than an rrth order polynomial of the spatial distance.

The objective of our theoretical study is to bound the Wasserstein-2 distance between the marginal distribution of RadGP and original Gaussian process on 𝒟\mathcal{D} when the original process belongs to the two spatial decaying families in (\arabicequation) and (\arabicequation):

supZ𝒵vrW22(Z𝒟,Z^𝒟)andsupZ𝒵crW22(Z𝒟,Z^𝒟).\sup_{Z\in\mathscr{Z}_{v_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\quad\text{and}\quad\sup_{Z\in\mathscr{Z}_{c_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}}). (\arabicequation)

\arabicsection.\arabicsubsection Rate of Approximation

For the finite set 𝒟\mathcal{D} described above, we define the minimal separation distance among all spatial locations in 𝒟\mathcal{D} as q=min1i<j|𝒟|wiwj2q=\min_{1\leq i<j\leq|\mathcal{D}|}\|w_{i}-w_{j}\|_{2}. This minimal separation distance is a parsimonious statistic for describing the spatial dispersion of 𝒟\mathcal{D}, and is useful in bounding various quantities such as the maximal eigenvalue and condition number of Σ𝒟𝒟\Sigma_{\mathcal{D}\mathcal{D}}; see Lemma S2 of the Supplementary Material for details. Let K^0(w)=(2π)d/2K0(x)exp(ıwTx)𝑑x\hat{K}_{0}(w)=(2\pi)^{-d/2}\int K_{0}(x)\exp(-\imath w^{{\mathrm{\scriptscriptstyle T}}}x)dx be the Fourier transform of K0K_{0} where ı2=1\imath^{2}=-1, and let ϕ0(x)=infw22xK^0(w)\phi_{0}(x)=\inf_{\|w\|_{2}\leq 2x}\hat{K}_{0}(w) for all x>0x>0. For two positive sequences ana_{n} and bnb_{n}, we use anbna_{n}\lesssim b_{n} to denote the relation that anCbna_{n}\leq Cb_{n} for all nn and some finite constant CC. We have the following theorems on the Wasserstein-2 distance between the RadGP and the original Gaussian process. The constants c1,c2c_{1},c_{2} are from Lemma S2 while c3c_{3} is from the proof of Lemma S5 in the Supplementary Material, all of which only depend on the dimension dd of the spatial domain Ω\Omega.

Theorem \arabicsection.\arabictheorem.

For all radial neighbors Gaussian processes Z^\hat{Z}, for the family 𝒵vr\mathscr{Z}_{v_{r}} in (\arabicequation) with r>1r>1, if 0<q<10<q<1 and n1/2{vr(ρd1/2)}1{ϕ0(c2/q)}9/2vr1[c3{ϕ0(c2/q)}1]c6n^{1/2}\{v_{r}(\rho d^{-1/2})\}^{-1}\{\phi_{0}(c_{2}/q)\}^{-9/2}v_{r-1}[c_{3}\{\phi_{0}(c_{2}/q)\}^{-1}]\leq c_{6} hold for some constant c6c_{6} only dependent on dd, then

supZ𝒵vrW22(Z𝒟,Z^𝒟)nvr(ρd1/2){ϕ0(c2/q)}5qdvr1[c3{ϕ0(c2/q)}1].\sup_{Z\in\mathscr{Z}_{v_{r}}}W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim\frac{n}{v_{r}(\rho d^{-1/2})}\{\phi_{0}(c_{2}/q)\}^{-5}q^{-d}v_{r-1}[c_{3}\{\phi_{0}(c_{2}/q)\}^{-1}]. (\arabicequation)

Else if q1q\geq 1 and n1/2/vr(ρd1/2)c6n^{1/2}/v_{r}(\rho d^{-1/2})\leq c^{\prime}_{6} hold for some constant c6c^{\prime}_{6} only dependent on dd, then supZ𝒵vrW22(Z𝒟,Z^𝒟)n/vr(ρd1/2).\sup_{Z\in\mathscr{Z}_{v_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim n/v_{r}(\rho d^{-1/2}).

Theorem \arabicsection.\arabictheorem.

For all radial neighbors Gaussian processes Z^\hat{Z}, for the family 𝒵cr\mathscr{Z}_{c_{r}} in (\arabicequation) with rd+1r\geq d+1, if 0<q<10<q<1 and n1/2(1+ρd1/2)(rd1)q(r7)d{ϕ0(c2/q)}(r+4)n^{1/2}(1+\rho d^{-1/2})^{-(r-d-1)}q^{(r-7)d}\{\phi_{0}(c_{2}/q)\}^{-(r+4)} (c1c5d2d1π/6)rc7(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r}\leq c_{7} hold for some constant c7c_{7} only dependent on dd, then

supZ𝒵crW22(Z𝒟,Z^𝒟)n(1+ρd1/2)rd1q(r8)d{ϕ0(c2/q)}(r+9/2)(c1c5d2d1π/6)r,\sup_{Z\in\mathscr{Z}_{c_{r}}}W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim\frac{n}{(1+\rho d^{-1/2})^{r-d-1}}q^{(r-8)d}\{\phi_{0}(c_{2}/q)\}^{-(r+9/2)}(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r}, (\arabicequation)

Else if q1q\geq 1 and n1/2(1+ρd1/2)(rd1){ϕ0(c2/q)}r(c1c5d2d1π/6)rc7n^{1/2}(1+\rho d^{-1/2})^{-(r-d-1)}\{\phi_{0}(c_{2}/q)\}^{-r}(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r}\leq c^{\prime}_{7} hold for some constant c7c^{\prime}_{7} only dependent on dd, then supZ𝒵crW22(Z𝒟,Z^𝒟)\sup_{Z\in\mathscr{Z}_{c_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim n(1+ρd1/2)(rd1)n(1+\rho d^{-1/2})^{-(r-d-1)}.

The conditions for the above theorems, namely the equations regarding c6,c6,c7,c7c_{6},c^{\prime}_{6},c_{7},c^{\prime}_{7}, are necessary but not sufficient conditions for the upper bounds of Wasserstein-2 distance to go to zero. Thus, they can be safely ignored if we aim for accurate approximations. We now give a detailed analysis of these upper bounds for Wasserstein-2 distance involving three variables: the sample size nn, the minimal separation distance qq, and the approximation radius ρ\rho in the RadGP. We will discuss from the perspectives of both finite samples and asymptotics. From a finite sample perspective, we fix two out of three variables and discuss how and why the upper bounds change as the remaining variable changes; from an asymptotic perspective, we describe the relations between n,qn,q and ρ\rho such that as nn goes to infinity, the Wasserstein-2 distance goes to zero.

Viewing Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem as finite sample bounds, the upper bounds are linearly dependent on the sample size nn. This directly follows from the definition of Wasserstein distance which measures the difference between two nn-dimensional Gaussian distributions. Provided both the minimal separation distance qq and approximation radius ρ\rho are fixed, as the sample size nn increases, the local structure for existing locations will not change and the approximation accuracy for them will not improve. Thus, adding more locations will lead to a linearly increasing approximation error in the squared W2W_{2} distance.

Second, the upper bounds on the squared W2W_{2} distance decay with respect to the approximation radius ρd1/2\rho d^{-1/2} at the same rate as the spatial covariance function with respect to the spatial distance, up to a (d+1)(d+1)th order polynomial. For Theorem \arabicsection.\arabictheorem, the covariance function is required to decay faster than 1/vr(x)1/v_{r}(x) by a polynomial 1+xd+11+x^{d+1} as defined in equation (\arabicequation); for Theorem \arabicsection.\arabictheorem, the covariance function decays faster than an rrth order polynomial while the upper bound of W2W_{2} distance decays at an (rd1)(r-d-1)th order polynomial. The difference between the decay rate of the covariance function and the W2W_{2} distance in Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem comes from the fact that high dimension reduces the effective decay rate of covariance function. For example, under a fixed minimal separation distance, when d=1d=1, as long as the covariance function decays faster than 1/x1/x by a polynomial term, the matrix Σ𝒟𝒟\Sigma_{\mathcal{D}\mathcal{D}} has a bounded l1l_{1} norm regardless of sample size; however, for a general dd dimensional space, the covariance function must decay faster than 1/xd1/x^{d} for Σ𝒟𝒟\Sigma_{\mathcal{D}\mathcal{D}} to have a bounded l1l_{1} norm. These upper bounds show that the approximation radius ρ\rho indeed controls the approximation accuracy. Specifically, for fixed qq and nn, we can reach an arbitrarily small W2W_{2} error by increasing ρ\rho over some threshold value; see our detailed conditions on ρ\rho in Corollary \arabicsection.\arabictheorem for various types of covariance functions.

Third, for fixed ρ\rho and nn, both upper bounds in the two theorems increase as the minimal separation distance qq decreases. This is because as the minimal separation distance decreases, spatial locations get closer to each other and the covariance matrix gets close to singular, leading to a larger approximation error.

From the asymptotic perspective, we can increase the sample size nn while viewing the minimal separation distance qq and the approximation radius ρ\rho as some functions of nn. The objective is to choose an approximation radius ρ\rho dependent on nn and qq such that the squared Wasserstein distance goes to zero as nn goes to infinity. We first present a corollary showing how to choose such ρ\rho for three commmonly used covariance functions, where the convergence is understood in the sense that qq is also a function of nn.

Corollary \arabicsection.\arabictheorem.

We have W22(Z𝒟,Z^𝒟)0W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\to 0 as nn\to\infty if one of the following conditions is satisfied:
(1) The covariance function is the isotropic Matérn: K0(s1s2)=τ221νΓ(ν)1(ϕs1s2)ν𝒦ν(ϕs1s22)K_{0}(s_{1}-s_{2})=\tau^{2}2^{1-\nu}{\Gamma(\nu)}^{-1}\left(\phi\|s_{1}-s_{2}\|\right)^{\nu}\mathcal{K}_{\nu}\left(\phi\|s_{1}-s_{2}\|_{2}\right) with q<1/ϕq<1/\phi. Define the constant cm,1=Γ(ν)/{τ22dπd/2Γ(ν+d/2)}c_{m,1}=\Gamma(\nu)/\{\tau^{2}2^{d}\pi^{d/2}\Gamma(\nu+d/2)\}. The approximation radius ρ\rho satisfies

ρd1/2ϕ[c3cm,1(1+4c22ϕ2q2)ν+d2+ln{cm,1nqd(1+4c22ϕ2q2)5(ν+d2)}]3.\rho\geq\frac{d^{1/2}}{\phi}\left[c_{3}c_{m,1}\Big{(}1+\frac{4c_{2}^{2}}{\phi^{2}q^{2}}\Big{)}^{\nu+\frac{d}{2}}+\ln\Big{\{}c_{m,1}nq^{-d}\Big{(}1+\frac{4c_{2}^{2}}{\phi^{2}q^{2}}\Big{)}^{5(\nu+\frac{d}{2})}\Big{\}}\right]^{3}.

(2) The covariance function is Gaussian K0(s1s22)=τ2exp(as1s222)K_{0}(\|s_{1}-s_{2}\|_{2})=\tau^{2}\exp(-a\|s_{1}-s_{2}\|_{2}^{2}) with q<a1/2q<a^{-1/2}. The approximation radius ρ\rho satisfies

ρ(da)1/2[c3τ2exp{c22/(aq2)}+ln(nqdτ10)+5c22aq2]3.\rho\geq\left(\frac{d}{a}\right)^{1/2}\left[c_{3}\tau^{-2}\exp\big{\{}c_{2}^{2}/(aq^{2})\big{\}}+\ln(nq^{-d}\tau^{-10})+\frac{5c_{2}^{2}}{aq^{2}}\right]^{3}.

(3) The covariance function is the generalized Cauchy K0(s1s22)=τ2{1+(ϕs1s2)δ}λ/δK_{0}(\|s_{1}-s_{2}\|_{2})=\tau^{2}\big{\{}1+(\phi\|s_{1}-s_{2}\|)^{\delta}\big{\}}^{-\lambda/\delta} with parameter λ>d+1\lambda>d+1 and q<ϕq<\phi. Let c9c_{9} be a constant dependent on dimension dd and covariance function parameters τ2,ϕ,δ,λ\tau^{2},\phi,\delta,\lambda. The approximation radius ρ\rho satisfies

ρc9q252d+δ(λ+92)λ(d+1)n1λ(d+1).\rho\geq c_{9}q^{-\frac{\frac{25}{2}d+\delta(\lambda+\frac{9}{2})}{\lambda-(d+1)}}n^{\frac{1}{\lambda-(d+1)}}.

We then analyze our main theorems from the asymptotic perspective and verify the analysis using the three examples in Corollary \arabicsection.\arabictheorem. Depending on whether and how the minimal separation distance qq changes with respect to the sample size nn, there are three commonly used asymptotic frameworks: fixed domain, mixed domain and increasing domain asymptotics.

Under the increasing domain setting, as nn goes to infinity, the radius of the domain Ω\Omega increases while the minimal separation distance qq is fixed. Therefore, as long as nd1/2vr1(n)n\gtrsim d^{1/2}v_{r}^{-1}(n) under conditions of Theorem \arabicsection.\arabictheorem or nd1/2n1/(rd1)n\gtrsim d^{1/2}n^{1/(r-d-1)} under conditions of Theorem \arabicsection.\arabictheorem, we have o(1)o(1) approximation error. When applied to Matérn covariance in case (1) and Gaussian covariance in case (2) of Corollary \arabicsection.\arabictheorem, this yields a slowly increasing rate of O(ln3n)O(\ln^{3}n) for the approximation radius ρ\rho, which is much smaller than the sample size nn. On the other hand, when the covariance function decays polynomially such as the generalized Cauchy in case (3) of Corollary \arabicsection.\arabictheorem, the approximation radius ρ\rho needs to increase at a polynomial rate of sample size nn. This is because for covariance functions that decrease slowly in the spatial distance, one needs more spatial locations and a larger approximation radius to capture the dependence information.

Under the mixed domain setting, as nn goes to infinity, the radius of the domain Ω\Omega increases with nn while the minimal separation distance qq also decreases with nn. By solving the inequalities on the right hand side of equations (\arabicequation) and (\arabicequation) being no greater than o(1)o(1), we obtain the lower bound of ρ\rho that guarantees the o(1)o(1) approximation error. Corollary \arabicsection.\arabictheorem directly shows such lower bounds for Matérn, Gaussian and generalize Cauchy covariance functions. We can still achieve the o(1)o(1) approximation error as long as these lower bounds of ρ\rho in the three cases of Corollary \arabicsection.\arabictheorem increase slower than the radius of the domain Ω\Omega. Otherwise, if ρ\rho increases exactly at the same rate as the radius of Ω\Omega, RadGP is identical to the original Gaussian process and does not make any approximation at all. Our theory is not applicable when the lower bounds of ρ\rho exceed the radius of domain Ω\Omega. In particular, for the fixed domain setting, the radius of Ω\Omega is fixed as a constant and the minimal separation distance qq decreases to zero at a rate no slower than O(n1/d)O(n^{-1/d}). In such case, the lower bounds of ρ\rho in Corollary \arabicsection.\arabictheorem will diverge with nn, and therefore Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem do not provide a vanishing bound for Wasserstein-2 distances.

Our results fill a major gap in the literature by providing the previously lacking theoretical support for popular local or neighborhood-based Gaussian process approximations. Our theory only requires knowledge of the spatial decaying pattern of the covariance functions (e.g., exponential and polynomial in 𝒵vr\mathscr{Z}_{v_{r}} and 𝒵cr\mathscr{Z}_{c_{r}}, respectively). These conditions are easily verifiable, with derivations for the Matérn, Gaussian and the generalized Cauchy covariance functions provided in Corollary \arabicsection.\arabictheorem. Similar techniques can be used for a wider range of stationary covariance functions. Furthermore, our novel theory explicitly links the approximation quality from choosing radius ρ\rho with the minimal separation distance qq and the sample size nn, both of which are readily available from observed data.

\arabicsection Bayesian Regression with RadGP

Consider a linear regression model with spatial latent effects as:

Y(s)=X(s)β+Z(s)+ϵ(s),sΩ,Y(s)=X(s)\beta+Z(s)+\epsilon(s),\;s\in\Omega, (\arabicequation)

where X(s)X(s) are covariates at location ss, β\beta are regression coefficients for the covariates, Z(s)Z(s) is a Gaussian process with zero mean and covariance function Kθ:Ω×ΩK_{\theta}:\Omega\times\Omega\to\mathbb{R}, and ϵ(s)\epsilon(s) is a white noise process. Researchers observe Y(s),X(s)Y(s),X(s) at a collection of training locations 𝒯1\mathcal{T}_{1} with the objective of estimating the regression coefficients β\beta, the covariance function parameters θ\theta, and the spatial random effects Z(s)Z(s) at both training locations 𝒯1\mathcal{T}_{1} and test locations 𝒯2Ω\mathcal{T}_{2}\subset\Omega. If the spatial random effects Z(s)Z(s) are not of interest, one can combine Z(s)Z(s) and ϵ(s)\epsilon(s) into one Gaussian process, known as the response model. This is described in Section S4.2 of the Supplementary Material. Hereafter we assume that inference on Z(s)Z(s) is desired. Let Z𝒯1Z_{\mathcal{T}_{1}} denote the column vector consisting of (Z(s):s𝒯1)(Z(s):s\in\mathcal{T}_{1}). Similarly define other notations with 𝒯i\mathcal{T}_{i} as subscripts. Endow the spatial random effects on 𝒯1𝒯2\mathcal{T}_{1}\cup\mathcal{T}_{2} with our RadGP prior defined in Section \arabicsection.\arabicsubsection, which is built on the radial neighbors graph constructed in Section \arabicsection.\arabicsubsection such that the locations in the training set 𝒯1\mathcal{T}_{1} precede locations in the test set 𝒯2\mathcal{T}_{2}. The full Bayesian model can be outlined as follows:

(β,θ,σ2)p(β)p(θ)p(σ2),ϵ(s)i.i.d.N(0,σ2),sΩ,Z^𝒯1|θs𝒯1p(Z^s|Z^pa(s)),\displaystyle(\beta,\theta,\sigma^{2})\sim p(\beta)p(\theta)p(\sigma^{2}),\quad\epsilon(s)\overset{\text{i.i.d.}}{\sim}N(0,\sigma^{2}),s\in\Omega,\quad\hat{Z}_{\mathcal{T}_{1}}|\theta\sim\prod_{s\in\mathcal{T}_{1}}p(\hat{Z}_{s}|\hat{Z}_{\mathrm{pa}(s)}),
p(Z^𝒯2|Z^𝒯1,θ)s𝒯2p(Z^s|Z^pa(s)),Y(s)=X(s)β+Z^(s)+ϵ(s),for sΩ.\displaystyle p(\hat{Z}_{\mathcal{T}_{2}}|\hat{Z}_{\mathcal{T}_{1}},\theta)\propto\prod_{s\in\mathcal{T}_{2}}p(\hat{Z}_{s}|\hat{Z}_{\mathrm{pa}(s)}),\quad Y(s)=X(s)\beta+\hat{Z}(s)+\epsilon(s),\quad\text{for }s\in\Omega. (\arabicequation)

The priors for β,θ\beta,\theta and σ2\sigma^{2} can be flexible, but it is advised to set a proper prior for either σ2\sigma^{2} or θ\theta due to potential non-identifiability.

We employ a Gibbs sampling framework while handling each full conditional with different approaches. If the prior of β\beta is normal βN(β0,Φ01)\beta\sim N(\beta_{0},\Phi_{0}^{-1}), then the posterior of β\beta is also normal:

β|Y𝒯1,Z^𝒯1,σ2\displaystyle\beta|{Y}_{\mathcal{T}_{1}},\hat{Z}_{\mathcal{T}_{1}},\sigma^{2}
\displaystyle\sim N((Φ0+X𝒯1TX𝒯1/σ2)1{Φ0β0+X𝒯1T(Y𝒯1Z^𝒯1)/σ2},(Φ0+X𝒯1TX𝒯1/σ2)1).\displaystyle N\big{(}(\Phi_{0}+{X}_{\mathcal{T}_{1}}^{T}{X}_{\mathcal{T}_{1}}/\sigma^{2})^{-1}\{\Phi_{0}\beta_{0}+{X}_{\mathcal{T}_{1}}^{T}({Y}_{\mathcal{T}_{1}}-\hat{Z}_{\mathcal{T}_{1}})/\sigma^{2}\},(\Phi_{0}+{X}_{\mathcal{T}_{1}}^{T}{X}_{\mathcal{T}_{1}}/\sigma^{2})^{-1}\big{)}. (\arabicequation)

We then sample σ2\sigma^{2} from its full conditional. If the prior distribution for σ2\sigma^{2} is inverse gamma σ2IG(a0,b0)\sigma^{2}\sim\mathrm{IG}(a_{0},b_{0}), then the posterior of σ2\sigma^{2} is also inverse gamma:

σ2|Y𝒯1,Z^𝒯1,βIG(a0+n/2,b0+Y𝒯1X𝒯1βZ^𝒯122/2).\sigma^{2}|{Y}_{\mathcal{T}_{1}},\hat{Z}_{\mathcal{T}_{1}},\beta\sim\mathrm{IG}\left(a_{0}+n/2,b_{0}+\|{Y}_{\mathcal{T}_{1}}-{X}_{\mathcal{T}_{1}}\beta-\hat{Z}_{\mathcal{T}_{1}}\|_{2}^{2}/2\right). (\arabicequation)

Denote the precision matrix of Z^𝒯1|θ\hat{Z}_{\mathcal{T}_{1}}|\theta from the RadGP prior as Φ^\hat{\Phi}. We sample the spatial random effects on 𝒯1\mathcal{T}_{1} with conjugate gradients as illustrated in Nishimura & Suchard (2022), using the full conditional distribution:

Z^𝒯1|Y𝒯1,β,σ2,θN(ξ,(Φ^+In/σ2)1),\hat{Z}_{\mathcal{T}_{1}}|{Y}_{\mathcal{T}_{1}},\beta,\sigma^{2},\theta\sim N({\xi},(\hat{\Phi}+{I}_{n}/\sigma^{2})^{-1}),

where ξ=(Φ^+In/σ2)1(Y𝒯1X𝒯1β)/σ2{\xi}=(\hat{{\Phi}}+{I}_{n}/\sigma^{2})^{-1}({Y}_{\mathcal{T}_{1}}-{X}_{\mathcal{T}_{1}}\beta)/\sigma^{2}.

Finally, for the parameter θ\theta in the covariance function KθK_{\theta}, there is no available conjugate prior in general. We sample θ\theta using the Metropolis-Hastings algorithm relying on the full conditional p(θ|Z^𝒯1)p(Z^𝒯1|θ)p(θ)p(\theta|\hat{Z}_{\mathcal{T}_{1}})\propto p(\hat{Z}_{\mathcal{T}_{1}}|\theta)p(\theta). A detailed algorithm for posterior sampling can be found in Section S4.1 of the Supplementary Material.

If prediction on a test set is of interest, given MCMC samples of random effects on training set 𝒯1\mathcal{T}_{1} and parameter θ\theta, Z𝒯2Z_{\mathcal{T}_{2}} can be sampled from each unidimensional Gaussian distribution as in equation (\arabicequation). Unlike most previous related work, this prediction accounts for dependence among test locations, which is important in many applied contexts.

In practical applications where the sample sizes of training set and test set are similar and much larger than the cardinality of parent sets, the computational complexity is O(Mlncgn)O(Mln_{\text{cg}}n), where ll is the number of MCMC iterations, ncgn_{\text{cg}} is the average number of conjugate gradient iterations, nn is the sample size of training set, and MM is the average number of nonzero elements per column in the RadGP precision matrix Φ^\hat{\Phi}. The derivation of computational complexity is in Section S4 of the Supplementary Material; here we discuss implications. The variable ncgn_{\text{cg}} is determined by the distribution of eigenvalues of matrix Φ^+In/σ2\hat{\Phi}+I_{n}/\sigma^{2} and numerical tolerance. The latter is fixed as 10610^{-6} in our R package while the former is determined by the data set if the RadGP precision matrix Φ^\hat{\Phi} approximates the original Gaussian process precision matrix Φ\Phi decently well. Therefore, the computational complexity is roughly proportional to MM, the average number of nonzero elements per column in the precision matrix.

\arabicsection Experimental Studies

\arabicsection.\arabicsubsection Approximation of a Gaussian Process Prior

Throughout Section 5, we consider the specific RadGP described in Section 2. We compare the RadGP prior with the nearest-neighbor Gaussian process (NNGP, Datta et al. 2016) implemented on a maximin ordering of the data (Guinness, 2018), in terms of their approximation accuracy. The true Gaussian process has mean zero and an isotropic Matérn covariance function

K0(x)=τ221νΓ(ν)(ϕx)ν𝒦ν(ϕx),K_{0}(x)=\frac{\tau^{2}2^{1-\nu}}{\Gamma(\nu)}(\phi x)^{\nu}\mathcal{K}_{\nu}(\phi x), (\arabicequation)

with the true parameters (ϕ,τ2,ν)=(31.63,1,1.5)(\phi,\tau^{2},\nu)=(31.63,1,1.5). The value ϕ=31.63\phi=31.63 is obtained by setting K0(0.15)=0.05K_{0}(0.15)=0.05, a choice suggested by Katzfuss et al. (2020). The experiment is carried out on a 40×4040\times 40 grid in [0,1]2[0,1]^{2}. According to the analysis of computational complexity in Section \arabicsection, we use the average number of nonzero elements per column in the precision matrix as a measure of complexity for both methods. The quality of approximation is measured by squared Wasserstein-2 distance between each approximate process and the true Gaussian process, which can be computed efficiently using the R package transport available from CRAN (Schuhmacher et al. 2024). As shown in Figure \arabicfigure, the results are generally in favor of RadGP, as it achieves smaller approximation error under various complexity of the directed graphs.

Refer to caption
Figure \arabicfigure: Accuracy of approximating a Gaussian process prior. X-axis is the average number of nonzero elements per column in the precision matrix. Y-axis is the squared Wasserstein-2 distance and its logarithm.

\arabicsection.\arabicsubsection Approximation of a Gaussian Process Regression Posterior

We study the performance of RadGP for the spatial regression model (\arabicequation) with no covariates, where the true covariance function for spatial effects Z(s)Z(s) is the same as that in Section \arabicsection.\arabicsubsection. The white noise ϵ(s)\epsilon(s) follows N(0,σ2)N(0,\sigma^{2}) with σ2=0.01\sigma^{2}=0.01. The parameters ϕ,τ2,σ2\phi,\tau^{2},\sigma^{2} are unknown and need to be estimated from training data, while the Matérn smoothness parameter ν=3/2\nu=3/2 is assumed to be fixed and known. The domain is set as Ω=[0,1]2\Omega=[0,1]^{2}, with training data consisting of 16001600 grid locations in Ω\Omega. We replicate the analysis on 50 datasets. For each dataset, the test set is generated as 10001000 random samples from the uniform distribution on Ω\Omega.

Three methods are compared: RadGP proposed in this paper, nearest-neighbor Gaussian process with maximin ordering, and Vecchia Gaussian process predictions (V-Pred) (Katzfuss et al., 2020) with maximin ordering. The last two methods share the same training procedure but differ in how they make predictions: NNGP assumes that testing locations are independent given the training locations whereas V-Pred allows dependencies among testing locations. All methods are fitted via the MCMC algorithm described in Section \arabicsection and thus only differ in terms of their assumed graphs.

To measure the accuracy of approximations for prediction in the spatial regression model (\arabicequation), we compute the Wasserstein-2 distance between the posterior predictive distributions from each of the three approximation methods and the true Gaussian process regression model. The Wasserstein-2 distance is computed using MCMC samples. The results are shown in Figure \arabicfigure. When the complexity of the directed graph is sufficiently large, NNGP and RadGP tend to have the same approximation error while V-Pred is slightly worse. However, RadGP is able to achieve the limiting approximation accuracy with a much smaller complexity compared to the other two methods, indicating a faster approximation by RadGP. Unlike prior approximation in Section \arabicsection.\arabicsubsection, the posterior approximation error does not go to zero when the complexity of directed graphs increases. This nonzero error comes from two parts: first, the Gaussian process regression model (\arabicequation) has unknown parameters that need to be estimated from a finite amount of data; second, the Wasserstein-2 distance computed based on a finite number of MCMC samples contains Monte Carlo error.

Refer to caption
Figure \arabicfigure: Accuracy of approximating the posterior predictive distribution in Gaussian process regression. X-axis is the average number of nonzero elements per column in the precision matrix, and Y-axis is the squared Wasserstein-2 distance. Solid lines are mean values and shaded regions are 90%90\% Bayesian credible intervals.

\arabicsection.\arabicsubsection Surface Temperature Data

We consider the land surface temperature data on a 499km×581km499\mathrm{km}\times 581\mathrm{km} region southeast of Addis Ababa in Ethiopia at April 1st, 2020. The data are a product of the MODIS (Moderate Resolution Imaging Spectroradiometer) instrument on the Aqua satellite. The original data set comes as a 499×581499\times 581 image with each pixel corresponding to a 1km×1km1\mathrm{km}\times 1\mathrm{km} spatial region, as shown in Fig. \arabicfigure left. However, atmosphere conditions like clouds can block earth surfaces, resulting in missing data for many pixels. There are 211,683211,683 spatial locations with observed temperature values and 78,23678,236 locations with missing values. We randomly split the training data into 33 folds, using two folds for training and one fold for out-of-sample validation. This results in a total of 141,122141,122 training locations and 148,797148,797 testing locations.

We consider the same Gaussian process regression model as Section 5.1 with no covariates, zero mean function and exponential covariance function K0(x)=τ2exp(ϕx)K_{0}(x)=\tau^{2}\exp(-\phi x). The observed temperature values are centered to compensate for the lack of intercept term. We evaluate the performances of RadGP with radius 4.01km4.01\mathrm{km} and nearest neighbor Gaussian process with neighbor size 1515. The priors for both methods are set as p(ϕ)1p(\phi)\propto 1, τ2IG(2,1)\tau^{2}\sim\mathrm{IG}(2,1) and σ2IG(1,0.1)\sigma^{2}\sim\mathrm{IG}(1,0.1). Estimates for covariance parameters, out-of-sample mean squared errors and coverage for 95%95\% credible intervals are shown in Table \arabictable. The results are quite similar between the two methods.

The observed temperature and predicted temperature by RadGP are visualized in Fig. \arabicfigure. In general, the land surface temperature changes smoothly with respect to geological locations. The east and south part of the figure, which is covered by less vegetation, features slightly higher surface temperature. Small variations of temperature also occur in many local regions, indicating the geographical complexity of Ethiopia. There is a small blue plate in the west border of the figure that has significantly lower surface temperature than its surroundings, which corresponds to several lakes in the Great Rift Valley.

We further compare the performance of the two methods in terms of joint predictions at multiple held out locations. As we do not know the ground truth, we cannot calculate Wasserstein-2 error and instead focus on the joint likelihood ratio for held out data. Specifically, we select 567567 pairs of locations in the set reserved for out-of-sample validation. Each pair consists of two locations with 11km distance from each other. Any two locations from different pairs are at least 1010km apart. For each pair of locations, we compute the logarithm of the likelihood ratio between RadGP and NNGP. The distribution of the log likelihood ratio can provide important information about joint inference. Specifically, the mean of the log likelihood ratio is 2.1522.152 and there are 68%68\% pairs with positive log likelihood ratios. Since the distribution of the log likelihood ratio has heavy tails, we remove the bottom and top 5%5\% of the values and visualize the rest in Fig. \arabicfigure, where the red vertical line is ratio=0\mathrm{ratio}=0. We can see in the majority of cases, RadGP performs better than NNGP in terms of characterizing dependencies between temperature at two close locations.

RadGP NNGP
ϕ×102\phi\times 10^{2} 5.761 (5.432, 6.133) 5.689 (5.020, 6.321)
τ2\tau^{2} 1.706 (1.605, 1.807) 1.744 (1.565, 1.972)
σ2×103\sigma^{2}\times 10^{3} 1.222 (1.104, 1.351) 1.105 (0.994, 1.222)
MSE 0.0761 0.0770
coverage 0.954 0.953
time 1929 sec 1811 sec
Table \arabictable: Performance of RadGP and NNGP on
land surface temperature data.
Refer to caption
Figure \arabicfigure: Histogram of log likelihood ratio between RadGP and NNGP. The blue and red lines correspond to the mean log likelihood ratio value, and zero, respectively.
Refer to caption
Figure \arabicfigure: Ethiopian land surface temperature on April 1st, 2020. Left: observed data; Right: predicted values from RadGP.

\arabicsection Discussion

Vecchia approximations are arguably one of the most popular approaches for scaling up Gaussian process models to large datasets in spatial statistics. The major contribution of this article is to propose the first Vecchia-like approximation method that has strong theoretical guarantees in quantifying the accuracy of approximation to the original process. Our numerical experiments have shown competitive performance from the proposed RadGP compared to other Vecchia approximation methods. We have demonstrated that our method leads to improvements in characterizing joint predictive distributions while matching the performance of state-of-the-art methods in parameter estimation and marginal predictions.

Our current theory leads to several potential future research directions. First, we have established upper bounds for the approximation error to the original Gaussian process in Wasserstein-2 distance. We can further study lower bounds of this approximation error within specific families of approximations. It will be helpful to examine when the norm-controlled inversion bounds we have used in the proofs are tight, and whether the derived conditions on the approximation radius ρ\rho for various covariance functions in Corollary \arabicsection.\arabictheorem can be made optimal for such families.

Second, existing Vecchia approximations that use maximin ordering (Katzfuss et al., 2020) systematically generate remote locations across the domain. While our theory shows close locations are effective in quantifying spatial dependence, it is still unclear whether and how remote locations help with the approximation of covariance function. We would like to expand our research into general Vecchia approximations. The objective is to study the approximation error for different approximation approaches and discover the optimal strategy that minimizes the approximation error under certain sparsity constraints.

Third, although our main focus has been on approximation of Gaussian processes, we can further consider how the approximate process will affect posterior estimation of model parameters, including spatial fixed effects and covariance parameters. Under our current theoretical framework, since the minimal separation distance qq needs to be lower bounded, we can study Bayesian parameter estimation under increasing domain (Mardia & Marshall, 1984) or mixed domain (Lahiri & Zhu, 2006) asymptotic regimes. It remains an open problem whether our scalable approximation can maintain posterior consistency and optimal rates for all model parameters. We leave these directions for future exploration.

Acknowledgements

The authors thank Dr. Karlheinz Gröchenig for helpful discussion regarding the theory on norm-controlled inversion of Banach algebras, and two anonymous reviewers for detailed feedback and suggestion that greatly improved this article. Yichen Zhu has received funding from the European Research Council (ERC) under grant No. 101041064. Yichen Zhu, Michele Peruzzi and David Dunson have received funding from ERC under grant No. 856506 and United States National Institutes of Health under grant No. R01ES028804. Cheng Li has received funding from Singapore Ministry of Education Academic Research Funds Tier 1 Grant A-0004822-00-00. Part of this research was performed while Michele Peruzzi was visiting the Institute for Mathematical and Statistical Innovation (IMSI), which is supported by the National Science Foundation (grant No. DMS-1929348).

Supplementary Material

Proofs of lemmas, theorems and corollaries, detailed algorithms, computational complexity analysis and additional experimental studies are available at Supplementary Material for “Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes.” An R package for regression models using radial neighbors Gaussian processes is available at https://github.com/mkln/radgp.

References

  • Bai et al. (2012) Bai, Y., Song, P. X.-K. & Raghunathan, T. (2012). Joint composite estimating functions in spatiotemporal models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74, 799–824.
  • Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O. & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 825–848.
  • Bevilacqua et al. (2019) Bevilacqua, M., Faouzi, T., Furrer, R. & Porcu, E. (2019). Estimation and prediction using generalized Wendland covariance functions under fixed domain asymptotics. The Annals of Statistics 47, 828–856.
  • Bevilacqua & Gaetan (2015) Bevilacqua, M. & Gaetan, C. (2015). Comparing composite likelihood methods based on pairs for spatial Gaussian random fields. Statistics and Computing 25, 877–892.
  • Burt et al. (2020) Burt, D., Rasmussen, C. E. & van der Wilk, M. (2020). Convergence of sparse variational inference in Gaussian processes regression. Journal of Machine Learning Research 21, 1–63.
  • Cressie & Johannesson (2008) Cressie, N. & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 209–226.
  • Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley-Interscience.
  • Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O. & Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association 111, 800–812.
  • Eidsvik et al. (2014) Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M. & Niemi, J. (2014). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics 23, 295–315.
  • Finley et al. (2019) Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E. & Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics 28, 401–414.
  • Finley et al. (2009) Finley, A. O., Sang, H., Banerjee, S. & Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis 53, 2873–2884.
  • Furrer et al. (2006) Furrer, R., Genton, M. G. & Nychya, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics 15, 502–523.
  • Gelbrich (1990) Gelbrich, M. (1990). On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten 147, 185–203.
  • Gramacy & Apley (2015) Gramacy, R. B. & Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics 24, 561–578.
  • Guhaniyogi et al. (2011) Guhaniyogi, R., Finley, A. O., Banerjee, S. & Gelfand, A. E. (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics 22, 997–1007.
  • Guinness (2018) Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics 60, 415–429.
  • Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F. & Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics 24, 398–425.
  • Jin et al. (2021) Jin, B., Peruzzi, M. & Dunson, D. B. (2021). Bag of DAGs: Flexible & scalable modeling of spatiotemporal dependence. arXiv preprint arXiv: 2112.11870 .
  • Katzfuss & Guinness (2021) Katzfuss, M. & Guinness, J. (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science 36, 124–141.
  • Katzfuss et al. (2020) Katzfuss, M., Guinness, J., Gong, W. & Zilber, D. (2020). Vecchia approximations of Gaussian-process predictions. Journal of Agricultural, Biological and Environmental Statistics 25, 383–414.
  • Kaufman et al. (2008) Kaufman, C. G., Schervish, M. J. & Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association 103, 1545–1555.
  • Kidd & Katzfuss (2022) Kidd, B. & Katzfuss, M. (2022). Bayesian nonstationary and nonparametric covariance estimation for large spatial data (with discussion. Bayesian Analysis 17, 291–351.
  • Lahiri & Zhu (2006) Lahiri, S. N. & Zhu, J. (2006). Resampling methods for spatial regression models under a class of stochastic designs. The Annals of Statistics 34, 1774–1813.
  • Mardia & Marshall (1984) Mardia, K. V. & Marshall, R. J. (1984). Maximum likelihood estimation of models for residual covariance in spatial statistics. Biometrika 71, 135–146.
  • Nishimura & Suchard (2022) Nishimura, A. & Suchard, M. A. (2022). Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in “large n, large p” Bayesian sparse regression. Journal of the American Statistical Association 118, 2468–2481.
  • Peruzzi et al. (2022) Peruzzi, M., Banerjee, S. & Finley, A. O. (2022). Highly scalable Bayesian geostatistical modeling via meshed Gaussian processes on partitioned domains. Journal of the American Statistical Association 117, 969–982.
  • Peruzzi & Dunson (2022) Peruzzi, M. & Dunson, D. B. (2022). Spatial multivariate trees for big data Bayesian regression. Journal of Machine Learning Research 23, 1–40.
  • Peruzzi & Dunson (2024) Peruzzi, M. & Dunson, D. B. (2024). Spatial meshing for general Bayesian multivariate models. Journal of Machine Learning Research 25, 1–49.
  • Quiñonero-Candela & Rasmussen (2005) Quiñonero-Candela, J. & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research 6, 1939–1959.
  • Rue (2001) Rue, H. (2001). Fast sampling of Gaussian markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 325–338.
  • Rue & Held (2005) Rue, H. & Held, L. (2005). Gaussian Markov random fields: theory and applications. Chapman and Hall/CRC.
  • Sang et al. (2011) Sang, H., Un, M. & Huang, J. Z. (2011). Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors. Annals of Applied Statistics 5, 2519–2548.
  • Schäfer et al. (2021) Schäfer, F., Katzfuss, M. & Owhadi, H. (2021). Sparse Cholesky factorization by Kullback–Leibler minimization. SIAM Journal on Scientific Computing 43, A2019–A2046.
  • Schuhmacher et al. (2024) Schuhmacher, D., Bähre, B., Bonneel, N., Gottschlich, C., Hartmann, V., Heinemann, F., B., S. & Schrieber, J. (2024). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0.
  • Stein (2014) Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics 8, 1–19.
  • Vecchia (1988) Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 50, 297–312.
  • Zhang (2012) Zhang, H. (2012). Asymptotics and computation for spatial statistics. In Advances and Challenge in Space-time Modelling of Natural Events. Springer.
  • Zhang et al. (2023) Zhang, L., Tang, W. & Sudipto, B. (2023). Fixed-domain asymptotics under Vecchia’s approximation of spatial process likelihoods. Statistica Sinica , forthcoming.
  • Zilber & Katzfuss (2021) Zilber, D. & Katzfuss, M. (2021). Vecchia–Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. Computational Statistics & Data Analysis 153, 107081.
\printhistory