\renewtheoremstyle

plain \nolinenumbers

Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes

Yichen Zhu [email protected] Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, MI, Italy. Michele Peruzzi [email protected] Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. Cheng Li [email protected] Department of Statistics and Data Science, National University of Singapore, Sinagpore. David B. Dunson [email protected] Department of Statistical Science & Mathematics, Duke University, North Carolina, U.S.A.

Abstract

In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes (RadGP), a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbors within a predetermined radius. We prove that any radial neighbors Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real world data showing excellent performance in both prior and posterior approximations to the original Gaussian process.

keywords:

Approximations; Directed acyclic graph; Gaussian process; Spatial statistics; Wasserstein distance.

^†^†journal: Submitted to Biometrika

\arabicsection Introduction

Data indexed by spatial coordinates are routinely collected in massive quantities due to the rapid pace of technological development in imaging sensors, wearables and tracking devices. Statistical models of spatially correlated data are most commonly built using Gaussian processes due to their flexibility and their ability to straightforwardly quantify uncertainty. Spatial dependence in the data is typically characterized by a parametric covariance function whose unknown parameters are estimated by repeatedly evaluating a multivariate normal density with a dense covariance matrix. At each density evaluation, the inverse covariance matrix and its determinant must be computed at a complexity that is cubic on the data dimension $n$ . This steep cost severely restricts the applicability of Gaussian processes on modern geolocated datasets. This problem is exacerbated in Bayesian hierarchical models computed via Markov-chain Monte Carlo (MCMC) because several thousands of $O(n^{3})$ operations must be performed to fully characterize the joint posterior distribution.

A natural idea for retaining the flexibility of Gaussian processes while circumventing their computational bottlenecks is to use approximate methods for evaluating high dimensional multivariate Gaussian densities. If the scalable approximation is “good” in some sense, then one can use the approximate method for backing out inferences on the original process parameters. A multitude of such scalable methods has been proposed in the literature. Low-rank methods using inducing variables or knots (Quiñonero-Candela & Rasmussen, 2005; Cressie & Johannesson, 2008; Banerjee et al., 2008; Finley et al., 2009; Guhaniyogi et al., 2011; Sang et al., 2011) are only appropriate for approximating very smooth Gaussian processes because the number of inducing variables must increase polynomially fast with the sample size to avoid oversmoothing of the spatial surface (Stein, 2014; Burt et al., 2020). Scalability can also be achieved by assuming sparsity of the Gaussian covariance matrix via compactly supported covariance functions (Furrer et al., 2006; Kaufman et al., 2008; Bevilacqua et al., 2019); these methods operate by assuming marginal independence of pairs of data points that are beyond a certain distance from each other. See also the related method of Gramacy & Apley (2015). Using a similar intuition, one can achieve scalability by approximating the high dimensional Gaussian density with a product of small dimensional densities that characterize dependence of nearby data points. Composite likelihood methods (Bai et al., 2012; Eidsvik et al., 2014; Bevilacqua & Gaetan, 2015) approximate the high dimensional Gaussian density with the product of sub-likelihoods, whereas Gaussian Markov Random Fields (GMRF; Cressie, 1993; Rue & Held, 2005) approximate it via a product of conditional densities whose conditioning sets include all the local spatial neighbors.

Markovian conditional independence assumptions based on spatial neighborhoods can be visualized as a sparse undirected graphical model; these assumptions are intuitively appealing and lead to sparse precision matrices and more efficient computations (Rue, 2001). However, the normalizing constant of GMRF densities may require additional expensive computations, and additional approximation steps may be required when making predictions at new spatial locations because GMRFs do not necessarily extend to a standalone stochastic process. Both these problems can be resolved by ordering the data and conditioning only using previously ordered data points. Because one can now use a directed acyclic graph to represent conditional independence in the data, this approximation (due to Vecchia, 1988) immediately leads to a valid density which can be extended to a standalone stochastic process.

There is a rich literature on approximations of Gaussian processes using directed acyclic graphs. In addition to extensions of Vecchia’s method (Datta et al., 2016; Finley et al., 2019; Katzfuss & Guinness, 2021), scalable models can be built on domain partitions (Peruzzi et al., 2022), and generalizations are available for non-Gaussian and multivariate data (Zilber & Katzfuss, 2021; Peruzzi & Dunson, 2022, 2024) as well as for modeling nonstationary processes (Jin et al., 2021; Kidd & Katzfuss, 2022). However, we identify two issues with current approaches based on Vecchia’s approximation. First, because the graphical model representation of Vecchia approximations is not unique, the numerical performance of these methods can be sensitive to the choice of graphical model in certain scenarios (Guinness, 2018). There is no consensus on how to choose these conditioning sets. Second, although their performance has been demonstrated in practice (Heaton et al., 2019) and some theory on parameter estimation is available (Zhang, 2012; Zhang et al., 2023), there is limited theoretical study focused on accuracy of the approximation. The recent work by Schäfer et al. (2021) established Kullback-Leibler divergence bounds for Vecchia approximations, but under strong conditions that are difficult to verify for commonly used spatial covariance functions.

In this paper, we address these two issues by introducing a novel method and corresponding theoretical guarantees. Our radial neighbors Gaussian processes scalably approximate an initial Gaussian process by assuming conditional independence at spatial locations as prescribed by a directed acyclic graph. For every location in the graph, directed edges connect it to all of its neighbors within a predetermined radius $\rho$ .

We also introduce–to the best of our knowledge–the first general theoretical result regarding the closeness between a Vecchia-like method and the Gaussian process it approximates. We show that any process from the RadGP family can accurately approximate the original Gaussian processes in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. This approximation accuracy is insensitive to the graphical structure as long as the graphical structure does not alter the radial condition of RadGP. Our theory can be applied to a wide range of covariance functions including the Matérn, Gaussian and generalized Cauchy models. Empirical studies demonstrate that our RadGPs have excellent performance in approximating a Gaussian process prior and the posterior from a Gaussian process regression model.

\arabicsection Radial Neighbors Gaussian process

\arabicsection.\arabicsubsection Radial Neighbors Graphs

Let the spatial domain $\Omega$ be a connected subset of $\mathbb{R}^{d}$ . Let $Z=(Z_{s}:s\in\Omega)$ be a real valued Gaussian process on $\Omega\subset\mathbb{R}^{d}$ with zero mean and covariance function $K:\Omega\times\Omega\to\mathbb{R}$ such that for all $s_{1},s_{2}\in\Omega$ , $\mathbb{E}(Z_{s_{1}})=0$ and $\mathrm{Cov}(Z_{s_{1}},Z_{s_{2}})=K(s_{1},s_{2})$ . The objective of this paper is to approximate the process $Z$ with another process $\hat{Z}$ such that the distance between $\hat{Z}$ and $Z$ is small in some metric and finite marginal densities of $\hat{Z}$ are easy to compute.

Let $\mathcal{D}=\{w_{1},w_{2},\ldots,w_{n}\}$ be a finite collection of locations in $\Omega$ . A $\rho$ -radial neighbors graph on $\mathcal{D}$ is defined to be a directed acyclic graph such that for any $w_{i},w_{j}\in\mathcal{D}$ with $\|w_{i}-w_{j}\|_{2}<\rho$ , there is a directed edge between $w_{i}$ and $w_{j}$ . Such radial neighbors graphs are not unique. Section \arabicsection.\arabicsubsection provides one practical way of constructing a radial neighbors graph that will be used in our experimental studies, while Section \arabicsection.\arabicsubsection presents how to build a Gaussian process from any radial neighbors graph. Our theory in Section \arabicsection applies to any Gaussian process built from a radial neighbors graph, not just the specific graph we constructed in Section \arabicsection.\arabicsubsection.

\arabicsection.\arabicsubsection Construction of a Radial Neighbors Graph

We present a procedure to construct the radial neighbors graph that will be used for experiments in Section \arabicsection. First, we specify an order of all locations in $\mathcal{D}$ . This is done by choosing a “center” location of $\mathcal{D}$ and sorting all the locations in $\mathcal{D}$ according to their distances to this “center” in a non-decreasing order. If the locations in $\mathcal{D}$ are roughly evenly distributed in the domain $\Omega$ , then the coordinates of this center are set as the arithmetic mean of the coordinates of all locations in $\mathcal{D}$ .

Next, once an order of locations in $\mathcal{D}$ is obtained, for each location $w_{i}\in\mathcal{D}$ , we set the parent set of $w_{i}$ to be all the locations preceeding $w_{i}$ in this order and within $\rho$ distance to $w_{i}$ . As a consequence, for all locations within $\rho$ radius of $w_{i}$ , the ones that precede it are its parents, whereas the ones that follow it are its children. This construction satisfies the definition of radial neighbors graphs in Section \arabicsection.\arabicsubsection that any pair of locations $(w_{i},w_{j})$ with $\|w_{i}-w_{j}\|_{2}<\rho$ have a directed edge connecting them.

Depending on the spatial dispersion of locations in $\mathcal{D}$ , some locations may not have neighbors within $\rho$ distance. We thus populate the parent set of a location $w_{i}\in\mathcal{D}$ without $\rho$ -neighbors by drawing an edge from $w_{i}$ ’s nearest neighbor among all locations that precede $w_{i}$ in the order. This step ensures that the resulting directed acyclic graph is connected.

If the locations in $\mathcal{D}$ are roughly evenly distributed in the domain $\Omega$ , once the first $i$ locations are chosen and the directed graph on them is constructed, our procedure roughly chooses the $(i+1)$ th location to be as close to the first $i$ locations as possible. Since spatial covariance function $K(s_{1},s_{2})$ is often relatively large when $s_{1}$ and $s_{2}$ are close to each other, our procedure aims to minimize the conditional variance of the $(i+1)$ th location given the first $i$ locations, or in other words, maximize the spatial information that the $(i+1)$ th location can borrow from the first $i$ locations. Figure \arabicfigure illustrates the process of building this radial neighbors graph when $\mathcal{D}$ is a $15\times 15$ grid on $[0,1]^{2}$ .

Refer to caption — Figure \arabicfigure: Construction of a radial neighbors graph. In each subpanel, the location in red follows all locations in blue and precedes all locations in gray in the predetermined order. The parent set for the location in red is built by drawing a directed edge to it from each blue location inside the red circle.

Finally, if we have a new data set $\tilde{\mathcal{D}}$ , the radial neighbors graph constructed on $\mathcal{D}$ can be extended to a radial neighbors graph on $\mathcal{D}\cup\tilde{\mathcal{D}}$ . Specifically, once an ordering of locations in $\tilde{\mathcal{D}}$ is obtained, we concatenate the ordering of locations in $\mathcal{D}$ and $\tilde{\mathcal{D}}$ to obtain an ordering of locations $\mathcal{D}\cup\tilde{\mathcal{D}}$ . We then construct a radial neighbors graph on $\mathcal{D}\cup\tilde{\mathcal{D}}$ via the same procedure as described above. The final graph will contain the graph on $\mathcal{D}$ as a subgraph. This extension framework will be useful when extending a graph from training data to test data.

For some dataset, the spatial dispersion of both training and testing locations can be highly uneven. While a radial neighbor graph ensures all locations are connected in the same directed graph, it can still result in poor predictions on some test locations. We can mitigate this issue by artificially adding more test locations, so that each test locations receives a decent number of parents.

\arabicsection.\arabicsubsection Gaussian Processes from Radial Neighbors Graphs

Given a $\rho$ -radial neighbors graph on $\mathcal{D}=\{w_{1},w_{2},\ldots\,w_{n}\}$ defined in Section \arabicsection.\arabicsubsection, we can build a RadGP on $\mathcal{D}$ in which each $w_{i}$ is conditionally only dependent on its parent set. Let the symbol $\overset{d}{=}$ denote equality in distribution and define $\Sigma_{A,B}$ as the covariance matrix between two spatial location sets $A$ and $B$ under the covariance function $K(\cdot,\cdot)$ . The conditional distribution of $\hat{Z}_{w_{i}}$ is defined as

		$\displaystyle[\hat{Z}_{w_{i}}\mid\hat{Z}_{w_{j}},j<i]\overset{d}{=}[\hat{Z}_{w_{i}}\mid\hat{Z}_{\mathrm{pa}(w_{i})}]\overset{d}{=}[{Z}_{w_{i}}\mid{Z}_{\mathrm{pa}(w_{i})}]$
	$\displaystyle\sim$	$\displaystyle N\Big{(}\Sigma_{\mathrm{pa}(w_{i}),w_{i}}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(w_{i}),~{}\mathrm{pa}(w_{i})}^{-1}\hat{Z}_{\mathrm{pa}(w_{i})},\;K(w_{i},w_{i})-\Sigma_{\mathrm{pa}(w_{i}),w_{i}}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(w_{i}),~{}\mathrm{pa}(w_{i})}^{-1}\Sigma_{\mathrm{pa}(w_{i}),w_{i}}\Big{)},$		(\arabicequation)

where for $i=1$ the parent set $\mathrm{pa}(w_{1})$ is empty.

Equation (\arabicsection.\arabicsubsection) defines the distribution of $\hat{Z}$ on all finite subsets of $\mathcal{D}$ , which essentially says the conditional distribution of $\hat{Z}_{w_{i}}$ given the process $\hat{Z}$ on all locations preceding $w_{i}$ has the same law as the conditional distribution of $Z_{w_{i}}$ given the process $Z$ on the parent set $\mathrm{pa}(w_{i})$ . We then extend this process $\hat{Z}$ to arbitrary finite subsets of the whole space $\Omega$ . For all $s\in\Omega\backslash\mathcal{D}$ , we define the parents of $s$ as $\mathrm{pa}(s)=\{s^{\prime}\in\mathcal{D}:\|s^{\prime}-s\|_{2}<\rho\}$ . For any finite set $U\subset\Omega\backslash\mathcal{D}$ , we define the conditional distribution of $\hat{Z}_{U}$ given $\hat{Z}_{\mathcal{T}_{1}}$ as

	$\displaystyle p(\hat{Z}_{U}\mid\hat{Z}_{\mathcal{D}})$	$\displaystyle=\prod_{s\in U}p(\hat{Z}_{s}\mid\hat{Z}_{\mathrm{pa}(s)})=\prod_{s\in U}p({Z}_{s}\mid Z_{\mathrm{pa}(s)}),$
	$\displaystyle[\hat{Z}_{s}\mid\hat{Z}_{\mathrm{pa}(s)}]$	$\displaystyle\sim N\Big{(}\Sigma_{\mathrm{pa}(s),s}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(s),\mathrm{pa}(s)}^{-1}\hat{Z}_{\mathrm{pa}(s)},\;K(s,s)-\Sigma_{\mathrm{pa}(s),s}^{\mathrm{\scriptscriptstyle T}}\Sigma_{\mathrm{pa}(s),\mathrm{pa}(s)}^{-1}\Sigma_{\mathrm{pa}(s),s}\Big{)}.$		(\arabicequation)

Equations (\arabicsection.\arabicsubsection) and (\arabicsection.\arabicsubsection) together give the joint distribution of $\hat{Z}$ on any finite subset of $\Omega$ . Specifically, for a generic finite set $A=U\cup V$ with $U\subset\Omega\backslash\mathcal{D}$ and $V\subset\mathcal{D}$ , the joint density of $\hat{Z}$ on $A$ implied by Equations (\arabicsection.\arabicsubsection) and (\arabicsection.\arabicsubsection) is

p(\hat{Z}_{A})=p(\hat{Z}_{U}|\hat{Z}_{\mathcal{D}})\int\prod_{i=1}^{n}p(\hat{Z}_{w_{i}}|\hat{Z}_{\mathrm{pa}(w_{i})})\prod_{w_{i}\in\mathcal{D}\backslash V}dw_{i}.

(\arabicequation)

Equations (\arabicsection.\arabicsubsection)-(\arabicequation) complete the definition of our RadGP.

Lemma \arabicsection.\arabictheorem.

The radial neighbors Gaussian process (RadGP) $\hat{Z}(\cdot)$ is a valid Gaussian process on the whole spatial domain $\Omega$ .

The proof of Lemma \arabicsection.\arabictheorem relies on the Kolmogorov extension theorem and is available at Section S1 of the Supplementary Material.

\arabicsection Theoretical Properties

\arabicsection.\arabicsubsection Wasserstein Distance Between Gaussian Processes

The objective of this section is to establish theoretical upper bounds for the Wasserstein distance between the original Gaussian process and any RadGP on $\mathcal{D}$ described above. For two generic random vectors $Z_{1},Z_{2}$ defined on the same space $\mathcal{Z}$ following the probability distributions $\mu_{1},\mu_{2}$ , their Wasserstein-2 ( $W_{2}$ ) distance is

W_{2}(Z_{1},Z_{2})=\left\{\min_{\tilde{\mu}\in\mathcal{M}}\int_{\mathcal{Z}\times\mathcal{Z}}\|z_{1}-z_{2}\|_{2}^{2}d\tilde{\mu}(z_{1},z_{2})\right\}^{1/2},

where $\mathcal{M}$ is the set of all probability distributions on the product space $\mathcal{Z}\times\mathcal{Z}$ whose marginals are $\mu_{1}$ and $\mu_{2}$ . Let $Z_{\mathcal{D}}$ and $\hat{Z}_{\mathcal{D}}$ be the random vectors from the original Gaussian process and the RadGP on the set $\mathcal{D}$ . Denote the covariance matrix of $Z_{\mathcal{D}}$ and $\hat{Z}_{\mathcal{D}}$ as $\Sigma_{\mathcal{D}\mathcal{D}}$ and $\hat{\Sigma}_{\mathcal{D}\mathcal{D}}$ , respectively. Then the Wasserstein-2 distance between $Z_{\mathcal{D}}$ and $\hat{Z}_{\mathcal{D}}$ has a closed form (Gelbrich, 1990):

W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})=\mathrm{tr}(\Sigma_{\mathcal{D}\mathcal{D}})+\mathrm{tr}(\hat{\Sigma}_{\mathcal{D}\mathcal{D}})-2\mathrm{tr}\{(\Sigma_{\mathcal{D}\mathcal{D}}^{1/2}\hat{\Sigma}_{\mathcal{D}\mathcal{D}}\Sigma_{\mathcal{D}\mathcal{D}}^{1/2})^{1/2}\},

(\arabicequation)

where $\mathrm{tr}(A)$ denotes the trace of a square matrix $A$ . The right-hand side of equation (\arabicequation) involves various matrix powers that are difficult to analyze. Fortunately, for Gaussian measures, the squared Wasserstein-2 distance can be upper bounded by the trace norm of the difference between their covariance matrices, which greatly simplifies our analysis.

\arabicsection.\arabicsubsection Spatial Decaying Families

We consider the Gaussian process $Z=(Z_{s}:s\in\Omega)$ with zero mean and an isotropic nonnegative covariance function $K(\cdot,\cdot)$ . Thus we can reformulate $K$ as $K(s_{1},s_{2})=K_{0}(\|s_{1}-s_{2}\|_{2})$ for all $s_{1},s_{2}\in\Omega$ , where $K_{0}:[0,+\infty)\to[0,+\infty)$ . For all $r>0$ , define the function $v_{r}(\cdot):\mathbb{R}\to\mathbb{R}_{+}$ as $v_{r}(x)=\sum_{k=0}^{+\infty}|x|^{k}/(k!)^{r}$ . Then $1/v_{r}(x)$ is a monotone decreasing function of $x$ , where a larger $r$ results in a slower decay rate in $1/v_{r}(x)$ . Specifically, $1/v_{1}(x)=\exp(-x)$ and $1/v_{r}(x)$ with $r>1$ has a decay rate faster than all polynomials but slower than the exponential function. We also define a series of polynomials $c_{r}(x)=1+x^{r}$ for $r>0$ . For each $v_{r}$ with $r>1$ and each $c_{r}$ with $r>0$ , we define the following families of Gaussian processes:

	$\displaystyle\mathscr{Z}_{v_{r}}$	$\displaystyle=\left\{Z=(Z_{s}:s\in\Omega):K_{0}(\\|s_{1}-s_{2}\\|_{2})\leq\frac{1}{v_{r}(\\|s_{1}-s_{2}\\|_{2})(1+\\|s_{1}-s_{2}\\|_{2}^{d+1})}\right\},$		(\arabicequation)
	$\displaystyle\mathscr{Z}_{c_{r}}$	$\displaystyle=\left\{Z=(Z_{s}:s\in\Omega):K_{0}(\\|s_{1}-s_{2}\\|_{2})\leq\frac{1}{(1+\\|s_{1}-s_{2}\\|_{2})^{r}}\right\}.$		(\arabicequation)

Intuitively, $\mathscr{Z}_{v_{r}}$ is the family of isotropic Gaussian processes whose covariance function decays no slower than some subexponential rate of the spatial distance, while $\mathscr{Z}_{c_{r}}$ is the family with covariance functions decaying no slower than an $r$ th order polynomial of the spatial distance.

The objective of our theoretical study is to bound the Wasserstein-2 distance between the marginal distribution of RadGP and original Gaussian process on $\mathcal{D}$ when the original process belongs to the two spatial decaying families in (\arabicequation) and (\arabicequation):

\sup_{Z\in\mathscr{Z}_{v_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\quad\text{and}\quad\sup_{Z\in\mathscr{Z}_{c_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}}).

(\arabicequation)

\arabicsection.\arabicsubsection Rate of Approximation

For the finite set $\mathcal{D}$ described above, we define the minimal separation distance among all spatial locations in $\mathcal{D}$ as $q=\min_{1\leq i<j\leq|\mathcal{D}|}\|w_{i}-w_{j}\|_{2}$ . This minimal separation distance is a parsimonious statistic for describing the spatial dispersion of $\mathcal{D}$ , and is useful in bounding various quantities such as the maximal eigenvalue and condition number of $\Sigma_{\mathcal{D}\mathcal{D}}$ ; see Lemma S2 of the Supplementary Material for details. Let $\hat{K}_{0}(w)=(2\pi)^{-d/2}\int K_{0}(x)\exp(-\imath w^{{\mathrm{\scriptscriptstyle T}}}x)dx$ be the Fourier transform of $K_{0}$ where $\imath^{2}=-1$ , and let $\phi_{0}(x)=\inf_{\|w\|_{2}\leq 2x}\hat{K}_{0}(w)$ for all $x>0$ . For two positive sequences $a_{n}$ and $b_{n}$ , we use $a_{n}\lesssim b_{n}$ to denote the relation that $a_{n}\leq Cb_{n}$ for all $n$ and some finite constant $C$ . We have the following theorems on the Wasserstein-2 distance between the RadGP and the original Gaussian process. The constants $c_{1},c_{2}$ are from Lemma S2 while $c_{3}$ is from the proof of Lemma S5 in the Supplementary Material, all of which only depend on the dimension $d$ of the spatial domain $\Omega$ .

Theorem \arabicsection.\arabictheorem.

For all radial neighbors Gaussian processes $\hat{Z}$ , for the family $\mathscr{Z}_{v_{r}}$ in (\arabicequation) with $r>1$ , if $0<q<1$ and $n^{1/2}\{v_{r}(\rho d^{-1/2})\}^{-1}\{\phi_{0}(c_{2}/q)\}^{-9/2}v_{r-1}[c_{3}\{\phi_{0}(c_{2}/q)\}^{-1}]\leq c_{6}$ hold for some constant $c_{6}$ only dependent on $d$ , then

\sup_{Z\in\mathscr{Z}_{v_{r}}}W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim\frac{n}{v_{r}(\rho d^{-1/2})}\{\phi_{0}(c_{2}/q)\}^{-5}q^{-d}v_{r-1}[c_{3}\{\phi_{0}(c_{2}/q)\}^{-1}].

(\arabicequation)

Else if $q\geq 1$ and $n^{1/2}/v_{r}(\rho d^{-1/2})\leq c^{\prime}_{6}$ hold for some constant $c^{\prime}_{6}$ only dependent on $d$ , then $\sup_{Z\in\mathscr{Z}_{v_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim n/v_{r}(\rho d^{-1/2}).$

Theorem \arabicsection.\arabictheorem.

For all radial neighbors Gaussian processes $\hat{Z}$ , for the family $\mathscr{Z}_{c_{r}}$ in (\arabicequation) with $r\geq d+1$ , if $0<q<1$ and $n^{1/2}(1+\rho d^{-1/2})^{-(r-d-1)}q^{(r-7)d}\{\phi_{0}(c_{2}/q)\}^{-(r+4)}$ $(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r}\leq c_{7}$ hold for some constant $c_{7}$ only dependent on $d$ , then

\sup_{Z\in\mathscr{Z}_{c_{r}}}W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim\frac{n}{(1+\rho d^{-1/2})^{r-d-1}}q^{(r-8)d}\{\phi_{0}(c_{2}/q)\}^{-(r+9/2)}(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r},

(\arabicequation)

Else if $q\geq 1$ and $n^{1/2}(1+\rho d^{-1/2})^{-(r-d-1)}\{\phi_{0}(c_{2}/q)\}^{-r}(c_{1}c_{5}d2^{d-1}\pi/\surd{6})^{r}\leq c^{\prime}_{7}$ hold for some constant $c^{\prime}_{7}$ only dependent on $d$ , then $\sup_{Z\in\mathscr{Z}_{c_{r}}}W^{2}_{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\lesssim$ $n(1+\rho d^{-1/2})^{-(r-d-1)}$ .

The conditions for the above theorems, namely the equations regarding $c_{6},c^{\prime}_{6},c_{7},c^{\prime}_{7}$ , are necessary but not sufficient conditions for the upper bounds of Wasserstein-2 distance to go to zero. Thus, they can be safely ignored if we aim for accurate approximations. We now give a detailed analysis of these upper bounds for Wasserstein-2 distance involving three variables: the sample size $n$ , the minimal separation distance $q$ , and the approximation radius $\rho$ in the RadGP. We will discuss from the perspectives of both finite samples and asymptotics. From a finite sample perspective, we fix two out of three variables and discuss how and why the upper bounds change as the remaining variable changes; from an asymptotic perspective, we describe the relations between $n,q$ and $\rho$ such that as $n$ goes to infinity, the Wasserstein-2 distance goes to zero.

Viewing Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem as finite sample bounds, the upper bounds are linearly dependent on the sample size $n$ . This directly follows from the definition of Wasserstein distance which measures the difference between two $n$ -dimensional Gaussian distributions. Provided both the minimal separation distance $q$ and approximation radius $\rho$ are fixed, as the sample size $n$ increases, the local structure for existing locations will not change and the approximation accuracy for them will not improve. Thus, adding more locations will lead to a linearly increasing approximation error in the squared $W_{2}$ distance.

Second, the upper bounds on the squared $W_{2}$ distance decay with respect to the approximation radius $\rho d^{-1/2}$ at the same rate as the spatial covariance function with respect to the spatial distance, up to a $(d+1)$ th order polynomial. For Theorem \arabicsection.\arabictheorem, the covariance function is required to decay faster than $1/v_{r}(x)$ by a polynomial $1+x^{d+1}$ as defined in equation (\arabicequation); for Theorem \arabicsection.\arabictheorem, the covariance function decays faster than an $r$ th order polynomial while the upper bound of $W_{2}$ distance decays at an $(r-d-1)$ th order polynomial. The difference between the decay rate of the covariance function and the $W_{2}$ distance in Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem comes from the fact that high dimension reduces the effective decay rate of covariance function. For example, under a fixed minimal separation distance, when $d=1$ , as long as the covariance function decays faster than $1/x$ by a polynomial term, the matrix $\Sigma_{\mathcal{D}\mathcal{D}}$ has a bounded $l_{1}$ norm regardless of sample size; however, for a general $d$ dimensional space, the covariance function must decay faster than $1/x^{d}$ for $\Sigma_{\mathcal{D}\mathcal{D}}$ to have a bounded $l_{1}$ norm. These upper bounds show that the approximation radius $\rho$ indeed controls the approximation accuracy. Specifically, for fixed $q$ and $n$ , we can reach an arbitrarily small $W_{2}$ error by increasing $\rho$ over some threshold value; see our detailed conditions on $\rho$ in Corollary \arabicsection.\arabictheorem for various types of covariance functions.

Third, for fixed $\rho$ and $n$ , both upper bounds in the two theorems increase as the minimal separation distance $q$ decreases. This is because as the minimal separation distance decreases, spatial locations get closer to each other and the covariance matrix gets close to singular, leading to a larger approximation error.

From the asymptotic perspective, we can increase the sample size $n$ while viewing the minimal separation distance $q$ and the approximation radius $\rho$ as some functions of $n$ . The objective is to choose an approximation radius $\rho$ dependent on $n$ and $q$ such that the squared Wasserstein distance goes to zero as $n$ goes to infinity. We first present a corollary showing how to choose such $\rho$ for three commmonly used covariance functions, where the convergence is understood in the sense that $q$ is also a function of $n$ .

Corollary \arabicsection.\arabictheorem.

We have $W_{2}^{2}(Z_{\mathcal{D}},\hat{Z}_{\mathcal{D}})\to 0$ as $n\to\infty$ if one of the following conditions is satisfied:
(1) The covariance function is the isotropic Matérn: $K_{0}(s_{1}-s_{2})=\tau^{2}2^{1-\nu}{\Gamma(\nu)}^{-1}\left(\phi\|s_{1}-s_{2}\|\right)^{\nu}\mathcal{K}_{\nu}\left(\phi\|s_{1}-s_{2}\|_{2}\right)$ with $q<1/\phi$ . Define the constant $c_{m,1}=\Gamma(\nu)/\{\tau^{2}2^{d}\pi^{d/2}\Gamma(\nu+d/2)\}$ . The approximation radius $\rho$ satisfies

\rho\geq\frac{d^{1/2}}{\phi}\left[c_{3}c_{m,1}\Big{(}1+\frac{4c_{2}^{2}}{\phi^{2}q^{2}}\Big{)}^{\nu+\frac{d}{2}}+\ln\Big{\{}c_{m,1}nq^{-d}\Big{(}1+\frac{4c_{2}^{2}}{\phi^{2}q^{2}}\Big{)}^{5(\nu+\frac{d}{2})}\Big{\}}\right]^{3}.

(2) The covariance function is Gaussian $K_{0}(\|s_{1}-s_{2}\|_{2})=\tau^{2}\exp(-a\|s_{1}-s_{2}\|_{2}^{2})$ with $q<a^{-1/2}$ . The approximation radius $\rho$ satisfies

\rho\geq\left(\frac{d}{a}\right)^{1/2}\left[c_{3}\tau^{-2}\exp\big{\{}c_{2}^{2}/(aq^{2})\big{\}}+\ln(nq^{-d}\tau^{-10})+\frac{5c_{2}^{2}}{aq^{2}}\right]^{3}.

(3) The covariance function is the generalized Cauchy $K_{0}(\|s_{1}-s_{2}\|_{2})=\tau^{2}\big{\{}1+(\phi\|s_{1}-s_{2}\|)^{\delta}\big{\}}^{-\lambda/\delta}$ with parameter $\lambda>d+1$ and $q<\phi$ . Let $c_{9}$ be a constant dependent on dimension $d$ and covariance function parameters $\tau^{2},\phi,\delta,\lambda$ . The approximation radius $\rho$ satisfies

\rho\geq c_{9}q^{-\frac{\frac{25}{2}d+\delta(\lambda+\frac{9}{2})}{\lambda-(d+1)}}n^{\frac{1}{\lambda-(d+1)}}.

We then analyze our main theorems from the asymptotic perspective and verify the analysis using the three examples in Corollary \arabicsection.\arabictheorem. Depending on whether and how the minimal separation distance $q$ changes with respect to the sample size $n$ , there are three commonly used asymptotic frameworks: fixed domain, mixed domain and increasing domain asymptotics.

Under the increasing domain setting, as $n$ goes to infinity, the radius of the domain $\Omega$ increases while the minimal separation distance $q$ is fixed. Therefore, as long as $n\gtrsim d^{1/2}v_{r}^{-1}(n)$ under conditions of Theorem \arabicsection.\arabictheorem or $n\gtrsim d^{1/2}n^{1/(r-d-1)}$ under conditions of Theorem \arabicsection.\arabictheorem, we have $o(1)$ approximation error. When applied to Matérn covariance in case (1) and Gaussian covariance in case (2) of Corollary \arabicsection.\arabictheorem, this yields a slowly increasing rate of $O(\ln^{3}n)$ for the approximation radius $\rho$ , which is much smaller than the sample size $n$ . On the other hand, when the covariance function decays polynomially such as the generalized Cauchy in case (3) of Corollary \arabicsection.\arabictheorem, the approximation radius $\rho$ needs to increase at a polynomial rate of sample size $n$ . This is because for covariance functions that decrease slowly in the spatial distance, one needs more spatial locations and a larger approximation radius to capture the dependence information.

Under the mixed domain setting, as $n$ goes to infinity, the radius of the domain $\Omega$ increases with $n$ while the minimal separation distance $q$ also decreases with $n$ . By solving the inequalities on the right hand side of equations (\arabicequation) and (\arabicequation) being no greater than $o(1)$ , we obtain the lower bound of $\rho$ that guarantees the $o(1)$ approximation error. Corollary \arabicsection.\arabictheorem directly shows such lower bounds for Matérn, Gaussian and generalize Cauchy covariance functions. We can still achieve the $o(1)$ approximation error as long as these lower bounds of $\rho$ in the three cases of Corollary \arabicsection.\arabictheorem increase slower than the radius of the domain $\Omega$ . Otherwise, if $\rho$ increases exactly at the same rate as the radius of $\Omega$ , RadGP is identical to the original Gaussian process and does not make any approximation at all. Our theory is not applicable when the lower bounds of $\rho$ exceed the radius of domain $\Omega$ . In particular, for the fixed domain setting, the radius of $\Omega$ is fixed as a constant and the minimal separation distance $q$ decreases to zero at a rate no slower than $O(n^{-1/d})$ . In such case, the lower bounds of $\rho$ in Corollary \arabicsection.\arabictheorem will diverge with $n$ , and therefore Theorems \arabicsection.\arabictheorem and \arabicsection.\arabictheorem do not provide a vanishing bound for Wasserstein-2 distances.

Our results fill a major gap in the literature by providing the previously lacking theoretical support for popular local or neighborhood-based Gaussian process approximations. Our theory only requires knowledge of the spatial decaying pattern of the covariance functions (e.g., exponential and polynomial in $\mathscr{Z}_{v_{r}}$ and $\mathscr{Z}_{c_{r}}$ , respectively). These conditions are easily verifiable, with derivations for the Matérn, Gaussian and the generalized Cauchy covariance functions provided in Corollary \arabicsection.\arabictheorem. Similar techniques can be used for a wider range of stationary covariance functions. Furthermore, our novel theory explicitly links the approximation quality from choosing radius $\rho$ with the minimal separation distance $q$ and the sample size $n$ , both of which are readily available from observed data.

\arabicsection Bayesian Regression with RadGP

Consider a linear regression model with spatial latent effects as:

Y(s)=X(s)\beta+Z(s)+\epsilon(s),\;s\in\Omega,

(\arabicequation)

where $X(s)$ are covariates at location $s$ , $\beta$ are regression coefficients for the covariates, $Z(s)$ is a Gaussian process with zero mean and covariance function $K_{\theta}:\Omega\times\Omega\to\mathbb{R}$ , and $\epsilon(s)$ is a white noise process. Researchers observe $Y(s),X(s)$ at a collection of training locations $\mathcal{T}_{1}$ with the objective of estimating the regression coefficients $\beta$ , the covariance function parameters $\theta$ , and the spatial random effects $Z(s)$ at both training locations $\mathcal{T}_{1}$ and test locations $\mathcal{T}_{2}\subset\Omega$ . If the spatial random effects $Z(s)$ are not of interest, one can combine $Z(s)$ and $\epsilon(s)$ into one Gaussian process, known as the response model. This is described in Section S4.2 of the Supplementary Material. Hereafter we assume that inference on $Z(s)$ is desired. Let $Z_{\mathcal{T}_{1}}$ denote the column vector consisting of $(Z(s):s\in\mathcal{T}_{1})$ . Similarly define other notations with $\mathcal{T}_{i}$ as subscripts. Endow the spatial random effects on $\mathcal{T}_{1}\cup\mathcal{T}_{2}$ with our RadGP prior defined in Section \arabicsection.\arabicsubsection, which is built on the radial neighbors graph constructed in Section \arabicsection.\arabicsubsection such that the locations in the training set $\mathcal{T}_{1}$ precede locations in the test set $\mathcal{T}_{2}$ . The full Bayesian model can be outlined as follows:

	$\displaystyle(\beta,\theta,\sigma^{2})\sim p(\beta)p(\theta)p(\sigma^{2}),\quad\epsilon(s)\overset{\text{i.i.d.}}{\sim}N(0,\sigma^{2}),s\in\Omega,\quad\hat{Z}_{\mathcal{T}_{1}}\|\theta\sim\prod_{s\in\mathcal{T}_{1}}p(\hat{Z}_{s}\|\hat{Z}_{\mathrm{pa}(s)}),$
	$\displaystyle p(\hat{Z}_{\mathcal{T}_{2}}\|\hat{Z}_{\mathcal{T}_{1}},\theta)\propto\prod_{s\in\mathcal{T}_{2}}p(\hat{Z}_{s}\|\hat{Z}_{\mathrm{pa}(s)}),\quad Y(s)=X(s)\beta+\hat{Z}(s)+\epsilon(s),\quad\text{for }s\in\Omega.$		(\arabicequation)

The priors for $\beta,\theta$ and $\sigma^{2}$ can be flexible, but it is advised to set a proper prior for either $\sigma^{2}$ or $\theta$ due to potential non-identifiability.

We employ a Gibbs sampling framework while handling each full conditional with different approaches. If the prior of $\beta$ is normal $\beta\sim N(\beta_{0},\Phi_{0}^{-1})$ , then the posterior of $\beta$ is also normal:

		$\displaystyle\beta\|{Y}_{\mathcal{T}_{1}},\hat{Z}_{\mathcal{T}_{1}},\sigma^{2}$
	$\displaystyle\sim$	$\displaystyle N\big{(}(\Phi_{0}+{X}_{\mathcal{T}_{1}}^{T}{X}_{\mathcal{T}_{1}}/\sigma^{2})^{-1}\{\Phi_{0}\beta_{0}+{X}_{\mathcal{T}_{1}}^{T}({Y}_{\mathcal{T}_{1}}-\hat{Z}_{\mathcal{T}_{1}})/\sigma^{2}\},(\Phi_{0}+{X}_{\mathcal{T}_{1}}^{T}{X}_{\mathcal{T}_{1}}/\sigma^{2})^{-1}\big{)}.$		(\arabicequation)

We then sample $\sigma^{2}$ from its full conditional. If the prior distribution for $\sigma^{2}$ is inverse gamma $\sigma^{2}\sim\mathrm{IG}(a_{0},b_{0})$ , then the posterior of $\sigma^{2}$ is also inverse gamma:

\sigma^{2}|{Y}_{\mathcal{T}_{1}},\hat{Z}_{\mathcal{T}_{1}},\beta\sim\mathrm{IG}\left(a_{0}+n/2,b_{0}+\|{Y}_{\mathcal{T}_{1}}-{X}_{\mathcal{T}_{1}}\beta-\hat{Z}_{\mathcal{T}_{1}}\|_{2}^{2}/2\right).

(\arabicequation)

Denote the precision matrix of $\hat{Z}_{\mathcal{T}_{1}}|\theta$ from the RadGP prior as $\hat{\Phi}$ . We sample the spatial random effects on $\mathcal{T}_{1}$ with conjugate gradients as illustrated in Nishimura & Suchard (2022), using the full conditional distribution:

\hat{Z}_{\mathcal{T}_{1}}|{Y}_{\mathcal{T}_{1}},\beta,\sigma^{2},\theta\sim N({\xi},(\hat{\Phi}+{I}_{n}/\sigma^{2})^{-1}),

where ${\xi}=(\hat{{\Phi}}+{I}_{n}/\sigma^{2})^{-1}({Y}_{\mathcal{T}_{1}}-{X}_{\mathcal{T}_{1}}\beta)/\sigma^{2}$ .

Finally, for the parameter $\theta$ in the covariance function $K_{\theta}$ , there is no available conjugate prior in general. We sample $\theta$ using the Metropolis-Hastings algorithm relying on the full conditional $p(\theta|\hat{Z}_{\mathcal{T}_{1}})\propto p(\hat{Z}_{\mathcal{T}_{1}}|\theta)p(\theta)$ . A detailed algorithm for posterior sampling can be found in Section S4.1 of the Supplementary Material.

If prediction on a test set is of interest, given MCMC samples of random effects on training set $\mathcal{T}_{1}$ and parameter $\theta$ , $Z_{\mathcal{T}_{2}}$ can be sampled from each unidimensional Gaussian distribution as in equation (\arabicequation). Unlike most previous related work, this prediction accounts for dependence among test locations, which is important in many applied contexts.

In practical applications where the sample sizes of training set and test set are similar and much larger than the cardinality of parent sets, the computational complexity is $O(Mln_{\text{cg}}n)$ , where $l$ is the number of MCMC iterations, $n_{\text{cg}}$ is the average number of conjugate gradient iterations, $n$ is the sample size of training set, and $M$ is the average number of nonzero elements per column in the RadGP precision matrix $\hat{\Phi}$ . The derivation of computational complexity is in Section S4 of the Supplementary Material; here we discuss implications. The variable $n_{\text{cg}}$ is determined by the distribution of eigenvalues of matrix $\hat{\Phi}+I_{n}/\sigma^{2}$ and numerical tolerance. The latter is fixed as $10^{-6}$ in our R package while the former is determined by the data set if the RadGP precision matrix $\hat{\Phi}$ approximates the original Gaussian process precision matrix $\Phi$ decently well. Therefore, the computational complexity is roughly proportional to $M$ , the average number of nonzero elements per column in the precision matrix.

\arabicsection Experimental Studies

\arabicsection.\arabicsubsection Approximation of a Gaussian Process Prior

Throughout Section 5, we consider the specific RadGP described in Section 2. We compare the RadGP prior with the nearest-neighbor Gaussian process (NNGP, Datta et al. 2016) implemented on a maximin ordering of the data (Guinness, 2018), in terms of their approximation accuracy. The true Gaussian process has mean zero and an isotropic Matérn covariance function

K_{0}(x)=\frac{\tau^{2}2^{1-\nu}}{\Gamma(\nu)}(\phi x)^{\nu}\mathcal{K}_{\nu}(\phi x),

(\arabicequation)

with the true parameters $(\phi,\tau^{2},\nu)=(31.63,1,1.5)$ . The value $\phi=31.63$ is obtained by setting $K_{0}(0.15)=0.05$ , a choice suggested by Katzfuss et al. (2020). The experiment is carried out on a $40\times 40$ grid in $[0,1]^{2}$ . According to the analysis of computational complexity in Section \arabicsection, we use the average number of nonzero elements per column in the precision matrix as a measure of complexity for both methods. The quality of approximation is measured by squared Wasserstein-2 distance between each approximate process and the true Gaussian process, which can be computed efficiently using the R package transport available from CRAN (Schuhmacher et al. 2024). As shown in Figure \arabicfigure, the results are generally in favor of RadGP, as it achieves smaller approximation error under various complexity of the directed graphs.

\arabicsection.\arabicsubsection Approximation of a Gaussian Process Regression Posterior

We study the performance of RadGP for the spatial regression model (\arabicequation) with no covariates, where the true covariance function for spatial effects $Z(s)$ is the same as that in Section \arabicsection.\arabicsubsection. The white noise $\epsilon(s)$ follows $N(0,\sigma^{2})$ with $\sigma^{2}=0.01$ . The parameters $\phi,\tau^{2},\sigma^{2}$ are unknown and need to be estimated from training data, while the Matérn smoothness parameter $\nu=3/2$ is assumed to be fixed and known. The domain is set as $\Omega=[0,1]^{2}$ , with training data consisting of $1600$ grid locations in $\Omega$ . We replicate the analysis on 50 datasets. For each dataset, the test set is generated as $1000$ random samples from the uniform distribution on $\Omega$ .

Three methods are compared: RadGP proposed in this paper, nearest-neighbor Gaussian process with maximin ordering, and Vecchia Gaussian process predictions (V-Pred) (Katzfuss et al., 2020) with maximin ordering. The last two methods share the same training procedure but differ in how they make predictions: NNGP assumes that testing locations are independent given the training locations whereas V-Pred allows dependencies among testing locations. All methods are fitted via the MCMC algorithm described in Section \arabicsection and thus only differ in terms of their assumed graphs.

To measure the accuracy of approximations for prediction in the spatial regression model (\arabicequation), we compute the Wasserstein-2 distance between the posterior predictive distributions from each of the three approximation methods and the true Gaussian process regression model. The Wasserstein-2 distance is computed using MCMC samples. The results are shown in Figure \arabicfigure. When the complexity of the directed graph is sufficiently large, NNGP and RadGP tend to have the same approximation error while V-Pred is slightly worse. However, RadGP is able to achieve the limiting approximation accuracy with a much smaller complexity compared to the other two methods, indicating a faster approximation by RadGP. Unlike prior approximation in Section \arabicsection.\arabicsubsection, the posterior approximation error does not go to zero when the complexity of directed graphs increases. This nonzero error comes from two parts: first, the Gaussian process regression model (\arabicequation) has unknown parameters that need to be estimated from a finite amount of data; second, the Wasserstein-2 distance computed based on a finite number of MCMC samples contains Monte Carlo error.

\arabicsection.\arabicsubsection Surface Temperature Data

We consider the land surface temperature data on a $499\mathrm{km}\times 581\mathrm{km}$ region southeast of Addis Ababa in Ethiopia at April 1st, 2020. The data are a product of the MODIS (Moderate Resolution Imaging Spectroradiometer) instrument on the Aqua satellite. The original data set comes as a $499\times 581$ image with each pixel corresponding to a $1\mathrm{km}\times 1\mathrm{km}$ spatial region, as shown in Fig. \arabicfigure left. However, atmosphere conditions like clouds can block earth surfaces, resulting in missing data for many pixels. There are $211,683$ spatial locations with observed temperature values and $78,236$ locations with missing values. We randomly split the training data into $3$ folds, using two folds for training and one fold for out-of-sample validation. This results in a total of $141,122$ training locations and $148,797$ testing locations.

We consider the same Gaussian process regression model as Section 5.1 with no covariates, zero mean function and exponential covariance function $K_{0}(x)=\tau^{2}\exp(-\phi x)$ . The observed temperature values are centered to compensate for the lack of intercept term. We evaluate the performances of RadGP with radius $4.01\mathrm{km}$ and nearest neighbor Gaussian process with neighbor size $15$ . The priors for both methods are set as $p(\phi)\propto 1$ , $\tau^{2}\sim\mathrm{IG}(2,1)$ and $\sigma^{2}\sim\mathrm{IG}(1,0.1)$ . Estimates for covariance parameters, out-of-sample mean squared errors and coverage for $95\%$ credible intervals are shown in Table \arabictable. The results are quite similar between the two methods.

The observed temperature and predicted temperature by RadGP are visualized in Fig. \arabicfigure. In general, the land surface temperature changes smoothly with respect to geological locations. The east and south part of the figure, which is covered by less vegetation, features slightly higher surface temperature. Small variations of temperature also occur in many local regions, indicating the geographical complexity of Ethiopia. There is a small blue plate in the west border of the figure that has significantly lower surface temperature than its surroundings, which corresponds to several lakes in the Great Rift Valley.

We further compare the performance of the two methods in terms of joint predictions at multiple held out locations. As we do not know the ground truth, we cannot calculate Wasserstein-2 error and instead focus on the joint likelihood ratio for held out data. Specifically, we select $567$ pairs of locations in the set reserved for out-of-sample validation. Each pair consists of two locations with $1$ km distance from each other. Any two locations from different pairs are at least $10$ km apart. For each pair of locations, we compute the logarithm of the likelihood ratio between RadGP and NNGP. The distribution of the log likelihood ratio can provide important information about joint inference. Specifically, the mean of the log likelihood ratio is $2.152$ and there are $68\%$ pairs with positive log likelihood ratios. Since the distribution of the log likelihood ratio has heavy tails, we remove the bottom and top $5\%$ of the values and visualize the rest in Fig. \arabicfigure, where the red vertical line is $\mathrm{ratio}=0$ . We can see in the majority of cases, RadGP performs better than NNGP in terms of characterizing dependencies between temperature at two close locations.

	RadGP	NNGP
$\phi\times 10^{2}$	5.761 (5.432, 6.133)	5.689 (5.020, 6.321)
$\tau^{2}$	1.706 (1.605, 1.807)	1.744 (1.565, 1.972)
$\sigma^{2}\times 10^{3}$	1.222 (1.104, 1.351)	1.105 (0.994, 1.222)
MSE	0.0761	0.0770
coverage	0.954	0.953
time	1929 sec	1811 sec

\arabicsection Discussion

Vecchia approximations are arguably one of the most popular approaches for scaling up Gaussian process models to large datasets in spatial statistics. The major contribution of this article is to propose the first Vecchia-like approximation method that has strong theoretical guarantees in quantifying the accuracy of approximation to the original process. Our numerical experiments have shown competitive performance from the proposed RadGP compared to other Vecchia approximation methods. We have demonstrated that our method leads to improvements in characterizing joint predictive distributions while matching the performance of state-of-the-art methods in parameter estimation and marginal predictions.

Our current theory leads to several potential future research directions. First, we have established upper bounds for the approximation error to the original Gaussian process in Wasserstein-2 distance. We can further study lower bounds of this approximation error within specific families of approximations. It will be helpful to examine when the norm-controlled inversion bounds we have used in the proofs are tight, and whether the derived conditions on the approximation radius $\rho$ for various covariance functions in Corollary \arabicsection.\arabictheorem can be made optimal for such families.

Second, existing Vecchia approximations that use maximin ordering (Katzfuss et al., 2020) systematically generate remote locations across the domain. While our theory shows close locations are effective in quantifying spatial dependence, it is still unclear whether and how remote locations help with the approximation of covariance function. We would like to expand our research into general Vecchia approximations. The objective is to study the approximation error for different approximation approaches and discover the optimal strategy that minimizes the approximation error under certain sparsity constraints.

Third, although our main focus has been on approximation of Gaussian processes, we can further consider how the approximate process will affect posterior estimation of model parameters, including spatial fixed effects and covariance parameters. Under our current theoretical framework, since the minimal separation distance $q$ needs to be lower bounded, we can study Bayesian parameter estimation under increasing domain (Mardia & Marshall, 1984) or mixed domain (Lahiri & Zhu, 2006) asymptotic regimes. It remains an open problem whether our scalable approximation can maintain posterior consistency and optimal rates for all model parameters. We leave these directions for future exploration.

Acknowledgements

The authors thank Dr. Karlheinz Gröchenig for helpful discussion regarding the theory on norm-controlled inversion of Banach algebras, and two anonymous reviewers for detailed feedback and suggestion that greatly improved this article. Yichen Zhu has received funding from the European Research Council (ERC) under grant No. 101041064. Yichen Zhu, Michele Peruzzi and David Dunson have received funding from ERC under grant No. 856506 and United States National Institutes of Health under grant No. R01ES028804. Cheng Li has received funding from Singapore Ministry of Education Academic Research Funds Tier 1 Grant A-0004822-00-00. Part of this research was performed while Michele Peruzzi was visiting the Institute for Mathematical and Statistical Innovation (IMSI), which is supported by the National Science Foundation (grant No. DMS-1929348).

Supplementary Material

Proofs of lemmas, theorems and corollaries, detailed algorithms, computational complexity analysis and additional experimental studies are available at Supplementary Material for “Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes.” An R package for regression models using radial neighbors Gaussian processes is available at https://github.com/mkln/radgp.

References

Bai et al. (2012) Bai, Y., Song, P. X.-K. & Raghunathan, T. (2012). Joint composite estimating functions in spatiotemporal models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74, 799–824.
Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O. & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 825–848.
Bevilacqua et al. (2019) Bevilacqua, M., Faouzi, T., Furrer, R. & Porcu, E. (2019). Estimation and prediction using generalized Wendland covariance functions under fixed domain asymptotics. The Annals of Statistics 47, 828–856.
Bevilacqua & Gaetan (2015) Bevilacqua, M. & Gaetan, C. (2015). Comparing composite likelihood methods based on pairs for spatial Gaussian random fields. Statistics and Computing 25, 877–892.
Burt et al. (2020) Burt, D., Rasmussen, C. E. & van der Wilk, M. (2020). Convergence of sparse variational inference in Gaussian processes regression. Journal of Machine Learning Research 21, 1–63.
Cressie & Johannesson (2008) Cressie, N. & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 209–226.
Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley-Interscience.
Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O. & Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association 111, 800–812.
Eidsvik et al. (2014) Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M. & Niemi, J. (2014). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics 23, 295–315.
Finley et al. (2019) Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E. & Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics 28, 401–414.
Finley et al. (2009) Finley, A. O., Sang, H., Banerjee, S. & Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis 53, 2873–2884.
Furrer et al. (2006) Furrer, R., Genton, M. G. & Nychya, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics 15, 502–523.
Gelbrich (1990) Gelbrich, M. (1990). On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten 147, 185–203.
Gramacy & Apley (2015) Gramacy, R. B. & Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics 24, 561–578.
Guhaniyogi et al. (2011) Guhaniyogi, R., Finley, A. O., Banerjee, S. & Gelfand, A. E. (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics 22, 997–1007.
Guinness (2018) Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics 60, 415–429.
Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F. & Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics 24, 398–425.
Jin et al. (2021) Jin, B., Peruzzi, M. & Dunson, D. B. (2021). Bag of DAGs: Flexible & scalable modeling of spatiotemporal dependence. arXiv preprint arXiv: 2112.11870 .
Katzfuss & Guinness (2021) Katzfuss, M. & Guinness, J. (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science 36, 124–141.
Katzfuss et al. (2020) Katzfuss, M., Guinness, J., Gong, W. & Zilber, D. (2020). Vecchia approximations of Gaussian-process predictions. Journal of Agricultural, Biological and Environmental Statistics 25, 383–414.
Kaufman et al. (2008) Kaufman, C. G., Schervish, M. J. & Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association 103, 1545–1555.
Kidd & Katzfuss (2022) Kidd, B. & Katzfuss, M. (2022). Bayesian nonstationary and nonparametric covariance estimation for large spatial data (with discussion. Bayesian Analysis 17, 291–351.
Lahiri & Zhu (2006) Lahiri, S. N. & Zhu, J. (2006). Resampling methods for spatial regression models under a class of stochastic designs. The Annals of Statistics 34, 1774–1813.
Mardia & Marshall (1984) Mardia, K. V. & Marshall, R. J. (1984). Maximum likelihood estimation of models for residual covariance in spatial statistics. Biometrika 71, 135–146.
Nishimura & Suchard (2022) Nishimura, A. & Suchard, M. A. (2022). Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in “large n, large p” Bayesian sparse regression. Journal of the American Statistical Association 118, 2468–2481.
Peruzzi et al. (2022) Peruzzi, M., Banerjee, S. & Finley, A. O. (2022). Highly scalable Bayesian geostatistical modeling via meshed Gaussian processes on partitioned domains. Journal of the American Statistical Association 117, 969–982.
Peruzzi & Dunson (2022) Peruzzi, M. & Dunson, D. B. (2022). Spatial multivariate trees for big data Bayesian regression. Journal of Machine Learning Research 23, 1–40.
Peruzzi & Dunson (2024) Peruzzi, M. & Dunson, D. B. (2024). Spatial meshing for general Bayesian multivariate models. Journal of Machine Learning Research 25, 1–49.
Quiñonero-Candela & Rasmussen (2005) Quiñonero-Candela, J. & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research 6, 1939–1959.
Rue (2001) Rue, H. (2001). Fast sampling of Gaussian markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 325–338.
Rue & Held (2005) Rue, H. & Held, L. (2005). Gaussian Markov random fields: theory and applications. Chapman and Hall/CRC.
Sang et al. (2011) Sang, H., Un, M. & Huang, J. Z. (2011). Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors. Annals of Applied Statistics 5, 2519–2548.
Schäfer et al. (2021) Schäfer, F., Katzfuss, M. & Owhadi, H. (2021). Sparse Cholesky factorization by Kullback–Leibler minimization. SIAM Journal on Scientific Computing 43, A2019–A2046.
Schuhmacher et al. (2024) Schuhmacher, D., Bähre, B., Bonneel, N., Gottschlich, C., Hartmann, V., Heinemann, F., B., S. & Schrieber, J. (2024). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0.
Stein (2014) Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics 8, 1–19.
Vecchia (1988) Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 50, 297–312.
Zhang (2012) Zhang, H. (2012). Asymptotics and computation for spatial statistics. In Advances and Challenge in Space-time Modelling of Natural Events. Springer.
Zhang et al. (2023) Zhang, L., Tang, W. & Sudipto, B. (2023). Fixed-domain asymptotics under Vecchia’s approximation of spatial process likelihoods. Statistica Sinica , forthcoming.
Zilber & Katzfuss (2021) Zilber, D. & Katzfuss, M. (2021). Vecchia–Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. Computational Statistics & Data Analysis 153, 107081.

\printhistory