This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semiparametric Regression for Spatial Data via Deep Learning

Kexuan Li , Jun Zhu , Anthony R. Ives ,
Volker C. Radeloff , and Fangfang Wang
[email protected], Global Analytics and Data Sciences, Biogen, Cambridge, Massachusetts, [email protected], Department of Statistics, University of Wisconsin-Madison. [email protected], Department of Integrative Biology, University of Wisconsin-Madison. [email protected], Department of Forest and Wildlife Ecology, University of [email protected], Department of Mathematical Sciences, Worcester Polytechnic Institute.
Abstract

In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.


Keywords: Semiparametric regression; Spatially dependent data; Deep Neural Networks; Stochastic gradient descent.

1 Introduction

With recent advances in remote sensing technology and geographical sciences, there has been a considerable interest in modeling spatially referenced data. The purpose of this paper is to develop new methodology that captures complex structures in such data via deep neural networks and Gaussian random fields. In addition, we provide a theoretical understanding of deep neural networks for spatially dependent data.

In recent years, deep neural network (DNN) has made a great breakthrough in many fields, such as computer vision (He et al.,, 2016), dynamics system (Li et al.,, 2021), natural language processing (Bahdanau et al.,, 2014), drug discovery and toxicology (Jiménez-Luna et al.,, 2020), and variable selection (Li et al.,, 2022; Li,, 2022). Besides its successful applications, there has also been great progress on theoretical development of deep learning. Liu et al., (2020) and Schmidt-Hieber, (2020) proved that the neural network estimator achieves the optimal (up to a logarithmic factor) minimax rate of convergence. Liu et al., (2022) further removed the logarithmic term and achieved the exact optimal nonparametric convergence rate. One of the appealing features of deep neural network is that it can circumvent the curse of dimensionality under some mild conditions.

Owing to the superior performance and theoretical guarantees of deep learning, applying deep learning to spatial data has also drawn much attention. For example, Zammit-Mangion and Wikle, (2020) fitted the integro-difference equation through convolutional neural networks and obtained probabilistic spatio-temporal forecasting. Zammit-Mangion et al., (2021) constructed a deep probabilistic architecture to model nonstationary spatial processes using warping approach. Mateu and Jalilian, (2022) used variational autoencoder generative neural networks to analyze spatio-temporal point processes. Kirkwood et al., (2022) applied Bayesian deep neural network to spatial interpolation. However, there is a lack of theoretical understanding of the aforementioned work, which we will address in this paper.

In addition, we model spatial dependence by Gaussian random fields and develop model estimation with computational efficiency. Due to technological advances in data collecting process, the size of spatial datasets are massive and traditional statistical methods encounter two challenges. One challenge is the aggravated computational burden. To reduce computation cost, various methods have been developed, such as covariance tapering, fixed rank kriging, and Gaussian Markov random fields (see Sun et al., (2012) for a comprehensive review). The other challenge is data storage and data collection. Many spatial datasets are not only big, but are also generated by different sources or in an online fashion such that the observations are generated one-by-one. In both cases, we cannot process entire datasets at once. To overcome these two challenges, Robbins and Monro, (1951) proposed a computationally scalable algorithm called stochastic gradient descent (SGD) and achieved great success in many areas. Instead of evaluating the actual gradient based on an entire dataset, SGD estimates the gradient using only one observation which makes it computationally feasible with large scale data and streaming data. Its statistical inferential properties have also been studied by many researchers (Su and Zhu,, 2018; Liu et al.,, 2021).

Before proceeding, it is essential to provide a brief overview of the classical approaches for formulating spatial dependence in a spatial process. For a spatial process, various approaches have been proposed to formulate the spatial dependence. In spatially varying coefcient (SVC) models, covariates may have different effect along with locations, which are referred to as “spatial non-stationarity”. Thus, SVC models allows the coefficients of the covariates to change with the locations. In other words, the spatial dependence is entirely explained by the regressors, while the disturbances are independent (see e.g. Hastie and Tibshirani, (1993); Fan and Zhang, (1999); Mu et al., (2018); Kim and Wang, (2021)). Another class of models are based on the spatial autoregressive (SAR) models of Cliff and Ord, (1981), where the information about spatial dependence is contained in the spatial weight matrix, and the response variable at each location is assumed to be affected by a weighted average of the dependent variables from other sampled units (see Lee, (2002, 2004). In SAR models, the spatial weight matrix is assumed to be known, which is infeasible in practice, especially in large dataset. Obviously, both SVC and SAR models have some limitations. First, they assume the relationship between the response and the independent variable is linear and ignore complex interaction and nonlinear structure. Second, they both involve computing the inverse of an n×nn\times n matrix, where nn is the number of observations, which requires O(n3)O(n^{3}) time complexity and O(n2)O(n^{2}) memory complexity. Thus, the computational burden for both SVC and SAR model is substantial.

To meet these challenges, here we develop deep learning-based semiparametric regression for spatial data. Specifically, we use a sparsely connected feedforward neural network to fit the regression model, where the spatial dependence is captured by Gaussian random fields. By assuming a compositional structure on the regression function, the consistency of the neural network estimator is guaranteed. The advantages of the proposed method are fourfold. First, we do not assume any parametric functional form for the regression function, allowing the true mean function to be nonlinear or with complex interactions. This is an improvement over many of the existing parametric, semiparametric, or nonparametric approaches (Hastie and Tibshirani,, 1993; Fan and Zhang,, 1999; Mu et al.,, 2018; Kim and Wang,, 2021; Lee,, 2002; Robinson,, 2011; Jenish,, 2012; Li,, 2016; Lu and Tjøstheim,, 2014; Kurisu,, 2019, 2022). Second, under some mild technical conditions, we show that the estimator is consistent. To the best of our knowledge, this is the first theoretical result in deep neural network for spatially dependent data. Third, the convergence rate is free of the input dimension, which means our estimator does not suffer from the curse of dimensionality. Finally, owing to the appealing properties of SGD, our method is feasible for large scale dataset and streaming data.

The remainder of the paper is organized as follows. Section 2 formulates the statistical problem and presents the deep neural network estimator. The computational aspects and theoretical properties of the estimator are given in Section 3. Section 4 evaluates the finite-sample performance of the proposed estimator by a simulation study. We apply our method to a real-world dataset in Section 5 and some concluding remarks are provided in Section 6. Technical details are provided in Appendix.

2 Model and Estimator

In this section, we first formulate the problem and then present the proposed estimator under a deep learning framework.

2.1 Model Setup

For a spatial domain of interest 𝒮\mathcal{S}, we consider the following semiparametric spatial regression model:

y(𝒔)=f0(𝒙(𝒔))+e1(𝒔)+e2(𝒔),𝒔𝒮y(\bm{s})=f_{0}(\bm{x}(\bm{s}))+e_{1}(\bm{s})+e_{2}(\bm{s}),\,\,\,\bm{s}\in\mathcal{S} (1)

where f0:[0,1]df_{0}:[0,1]^{d}\rightarrow\mathbb{R} is an unknown mean function of interest, 𝒙(𝒔)=(x1(𝒔),,xd(𝒔))\bm{x}(\bm{s})=(x_{1}(\bm{s}),\ldots,x_{d}(\bm{s}))^{\top} represents a dd-dimensional vector of covariates at location 𝒔\bm{s} with xi(𝒔)[0,1]x_{i}(\bm{s})\in[0,1], e1(𝒔)e_{1}(\bm{s}) is a mean zero Gaussian random field with covariance function γ(𝒔,𝒔)\gamma({\bm{s},\bm{s}^{\prime}}), 𝒔,𝒔𝒮\bm{s},\bm{s}^{\prime}\in\mathcal{S}, and e2(𝒔)e_{2}(\bm{s}) is a spatial Gaussian white noise process with mean 0 and variance σ2\sigma^{2}. Furthermore, we assume that e1(𝒔)e_{1}(\bm{s}), e2(𝒔)e_{2}(\bm{s}), and 𝒙(𝒔)\bm{x}(\bm{s}) are independent of each other. Thus the observation y(𝒔)y(\bm{s}) comprises three components: large-scale trend f0(𝒙(𝒔))f_{0}(\bm{x}(\bm{s})), small-scale spatial variation e1(𝒔)e_{1}(\bm{s}), and measurement error e2(𝒔)e_{2}(\bm{s}); see, for instance, Cressie and Johannesson, (2008).

In the spatial statistics literature, it is popular to focus on predicting the hidden spatial process y(𝒔)=f0(𝒙(𝒔))+e1(𝒔)y(\bm{s})^{*}=f_{0}(\bm{x}(\bm{s}))+e_{1}(\bm{s}) using the observed information (Cressie and Johannesson,, 2008). However, the primary interest of this paper is to estimate the large-scale trend f0(𝒙(𝒔))f_{0}(\bm{x}(\bm{s})), where the relationship between the hidden spatial process and the covariates could be complex in nature. To capture such a complex relationship, we assume that f0f_{0} is a composition of several functions inspired by neural networks characteristics (Schmidt-Hieber,, 2020). Hölder smoothness (see Definition 1 in Appendix) is a commonly used smoothness assumption for regression function in nonparametric and semiparametric literature. Thus it is natural to assume the true mean function f0f_{0} is a composition of Hölder smooth functions, which is formally stated in the following assumption.

Assumption 1.

The function f0:df_{0}:\mathbb{R}^{d}\rightarrow\mathbb{R} has a compositional structure with parameters (L,𝐫,𝐫~,𝛃,𝐚,𝐛,𝐂)(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}) where L+L_{*}\in\mathbb{Z}_{+}, 𝐫=(r0,,rL+1)\bm{r}=(r_{0},\ldots,r_{L_{*}+1})^{\top} +L+2\in\mathbb{Z}_{+}^{L_{*}+2} with r0=dr_{0}=d and rL+1=1r_{L_{*}+1}=1, 𝐫~=(r~0,,r~L)+L+1\bm{\tilde{r}}=(\tilde{r}_{0},\ldots,\tilde{r}_{L_{*}})^{\top}\in\mathbb{Z}_{+}^{L_{*}+1}, 𝛃=(β0,,βL)+L+1\bm{\beta}=(\beta_{0},\ldots,\beta_{L_{*}})^{\top}\in\mathbb{R}_{+}^{L_{*}+1}, 𝐚=(a0,,aL+1)\bm{a}=(a_{0},\ldots,a_{L_{*}+1})^{\top}, 𝐛=(b0,,bL+1)L+2\bm{b}=(b_{0},\ldots,b_{L_{*}+1})^{\top}\in\mathbb{R}^{L_{*}+2}, and 𝐂=(C0,,CL)+L+1\bm{C}=(C_{0},\ldots,C_{L_{*}})^{\top}\in\mathbb{R}_{+}^{L_{*}+1}; that is,

f0(𝒛)=𝒈L𝒈1𝒈0(𝒛), for 𝒛[a0,b0]r0f_{0}(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}}

where +,+\mathbb{Z}_{+},\mathbb{R}_{+} denote the sets of positive integers and positive real numbers, respectively, 𝐠i=(gi,1,,gi,ri+1):[ai,bi]ri[ai+1,bi+1]ri+1\bm{g}_{i}=(g_{i,1},\ldots,g_{i,r_{i+1}})^{\top}:[a_{i},b_{i}]^{r_{i}}\to[a_{i+1},b_{i+1}]^{r_{i+1}} for some |ai|,|bi|Ci|a_{i}|,|b_{i}|\leq C_{i} and the functions gi,j:[ai,bi]r~i[ai+1,bi+1]g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}] are (βi,Ci)(\beta_{i},C_{i})-Hölder smooth only relying on r~i\tilde{r}_{i} variables and r~iri\tilde{r}_{i}\leq r_{i}.

Without loss of generality, we assume Ci>1C_{i}>1 in Assumption 1. The parameter LL_{*} refers to the total number of layers, i.e., the number of composite functions, 𝒓\bm{r} is the whole number of variables in each layer, whereas 𝒓~\bm{\tilde{r}} is the number of “active” variables in each layer. The two parameter vectors 𝜷\bm{\beta} and 𝑪\bm{C} pertain to the Hölder smoothness in each layer, while 𝒂\bm{a} and 𝒃\bm{b} define the domain of 𝒈i\bm{g}_{i} in the iith layer. For the rest of the paper, we will use 𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}) to denote the class of functions that have a compositional structure as specified in Assumption 1 with parameters (L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}).

It is worth mentioning that Assumption 1 is commonly adopted in deep learning literature; see Bauer and Kohler, (2019), Schmidt-Hieber, (2020), Kohler and Langer, (2021), Liu et al., (2020), Li et al., (2021), among others. This compositional structure covers a wide range of function classes including the generalized additive model. In Example 1 and Figure 1, we present an illustrative example to help readers understand the concept of compositional structure within the context of the generalized additive model.

Example 1.

The generalized additive model is a generalized linear model with a linear predictor involving a sum of smooth functions of covariates (Hastie and Tibshirani,, 1986). Suppose f(𝐳)=φ(i=1dhi(zi))f(\bm{z})=\varphi(\sum_{i=1}^{d}h_{i}(z_{i})), where φ()\varphi(\cdot) is (βφ,Cφ)(\beta_{\varphi},C_{\varphi})-Hölder smooth and hi()h_{i}(\cdot) are (βh,Ch)(\beta_{h},C_{h})-Hölder smooth, for some (βφ,Cφ)(\beta_{\varphi},C_{\varphi}) and (βh,Ch)(\beta_{h},C_{h}). Clearly, f(𝐳)f(\bm{z}) can be written as a composition of three functions f(𝐳)=𝐠2𝐠1𝐠0(𝐳)f(\bm{z})=\bm{g}_{2}\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}) with 𝐠0(z1,,zd)=(h1(z1),,hd(zd))\bm{g}_{0}(z_{1},\ldots,z_{d})=(h_{1}(z_{1}),\ldots,h_{d}(z_{d}))^{\top}, 𝐠1(z1,,zd)=i=1dzi\bm{g}_{1}(z_{1},\ldots,z_{d})=\sum_{i=1}^{d}z_{i}, and 𝐠2(z)=φ(z)\bm{g}_{2}(z)=\varphi(z). Here, L=2L_{*}=2, 𝐫=(d,d,1,1)\bm{r}=(d,d,1,1)^{\top}, 𝐫~=(d,d,1)\bm{\tilde{r}}=(d,d,1)^{\top}, and 𝛃=(βh,,βφ)\bm{\beta}=(\beta_{h},\infty,\beta_{\varphi})^{\top}.

Refer to caption
Figure 1: Illustration of the compositional structure for the generalized additive model in Example 1.

2.2 Deep Neural Network (DNN) Estimator

In this paper, we consider estimating the unknown function f0f_{0} via a deep neural network owing to the complexity of f0f_{0} and the flexibility of neural networks. So before presenting our main results, we first briefly review the neural network terminologies pertaining to this work.

An activation function is a nonlinear function used to learn the complex pattern from data. In this paper, we focus on the Rectified Linear Unit (ReLU) shifted activation function which is defined as σ𝒗(𝒛)=(σ(z1v1),,σ(zdvd))\sigma_{\bm{v}}(\bm{z})=(\sigma(z_{1}-v_{1}),\ldots,\sigma(z_{d}-v_{d}))^{\top}, where σ(s)=max{0,s}\sigma(s)=\max\{0,s\} and 𝒛=(z1,,zd)d\bm{z}=(z_{1},\ldots,z_{d})^{\top}\in\mathbb{R}^{d}. ReLU activation function enjoys both theoretical and computational advantages. The projection property σσ=σ\sigma\circ\sigma=\sigma can facilitate the proof of consistency, while ReLU activation function can help avoid vanishing gradient problem. The ReLU feedforward neural network 𝒇(𝒛,W,v)\bm{f}(\bm{z},W,v) is given by

𝒇(𝒛,W,v)=WLσ𝒗LW1σ𝒗1W0𝒛,𝒛p0,\bm{f}(\bm{z},W,v)=W_{L}\sigma_{\bm{v}_{L}}\ldots W_{1}\sigma_{\bm{v}_{1}}W_{0}\bm{z},\quad\bm{z}\in\mathbb{R}^{p_{0}}, (2)

where {(W0,,WL):Wlpl+1×pl,0lL}\{(W_{0},\ldots,W_{L}):W_{l}\in\mathbb{R}^{p_{l+1}\times p_{l}},0\leq l\leq L\} is the collection of weight matrices, {(𝒗1,,𝒗L):𝒗lpl,1lL}\{(\bm{v}_{1},\ldots,\bm{v}_{L}):\bm{v}_{l}\in\mathbb{R}^{p_{l}},1\leq l\leq L\} is the collection of so-called biases (in the neural network literature), and σ𝒗l()\sigma_{\bm{v}_{l}}(\cdot), 1lL1\leq l\leq L, are the ReLU shifted activation functions. Here, LL measures the number of hidden layers, i.e., the length of the network, while pjp_{j} is the number of units in each layer, i.e., the depth of the network. When using a ReLU feedforward neural network to estimate the regression problem (1), we need to have p0=dp_{0}=d and pL+1=1p_{L+1}=1; and the parameters that need to be estimated are the weight matrices (Wj)j=0,,L(W_{j})_{j=0,\ldots,L} and the biases (𝒗j)j=1,,L(\bm{v}_{j})_{j=1,\ldots,L}.

By definition, a ReLU feedforward neural network can be written as a composition of simple nonlinear functions; that is,

𝒇(𝒛,W,v)=𝒈L𝒈1𝒈0(𝒛),𝒛p0,\bm{f}(\bm{z},W,v)=\bm{g}_{L}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\bm{z}\in\mathbb{R}^{p_{0}},

where 𝒈i(𝒛)=Wiσ𝒗i(𝒛)\bm{g}_{i}(\bm{z})=W_{i}\sigma_{\bm{v}_{i}}(\bm{z}), 𝒛pi\bm{z}\in\mathbb{R}^{p_{i}}, i=1,,Li=1,\ldots,L, and 𝒈0(𝒛)=W0𝒛,𝒛p0\bm{g}_{0}(\bm{z})=W_{0}\bm{z},\bm{z}\in\mathbb{R}^{p_{0}}. Unlike traditional function approximation theory where a complex function is considered as an infinite sum of simpler functions (such as Tayler series, Fourier Series, Chebyshev approximation, etc.), deep neural networks approximate a complex function via compositions, i.e., approximating the function by compositing simpler functions (Lu et al.,, 2020; Farrell et al.,, 2021; Yarotsky,, 2017). Thus, a composite function can be well approximated by a feedforward neural network. That is why we assume the true mean function f0f_{0} has a compositional structure.

In practice, the length and depth of the networks can be extremely large, thereby easily causing overfitting. To overcome this problem, a common practice in deep learning is to randomly set some neurons to zero, which is called dropout. Therefore, it is natural to assume the network space is sparse and all the parameters are bounded by one, where the latter can be achieved by dividing all the weights by the maximum weight (Bauer and Kohler,, 2019; Schmidt-Hieber,, 2020; Kohler and Langer,, 2021). As such, we consider the following sparse neural network class with bounded weights

(L,𝒑,τ,F)={f is of form (2):maxj=0,,LWj+|𝒗j|1,j=0L(Wj0+|𝒗j|0)τ,fF},\mathcal{F}(L,\bm{p},\tau,F)=\left\{f\textrm{ is of form }\eqref{fnn}:\max_{j=0,\ldots,L}\|W_{j}\|_{\infty}+|\bm{v}_{j}|_{\infty}\leq 1,\sum_{j=0}^{L}(\|W_{j}\|_{0}+|\bm{v}_{j}|_{0})\leq\tau,\|f\|_{\infty}\leq F\right\}, (3)

where 𝒑=(p0,,pL+1)\bm{p}=(p_{0},\ldots,p_{L+1}) with p0=dp_{0}=d and pL+1=1p_{L+1}=1, and 𝒗0\bm{v}_{0} is a vector of zeros. This class of neural networks is also adopted in Schmidt-Hieber, (2020), Liu et al., (2020), and Li et al., (2021).

Suppose that the process y()y(\cdot) is observed at a finite number of spatial locations {𝒔1,,𝒔n}\{\bm{s}_{1},\ldots,\bm{s}_{n}\} in 𝒮\mathcal{S}. The desired DNN estimator of f0f_{0} in Model (1) is a sparse neural network in (L,𝒑,τ,F)\mathcal{F}(L,\bm{p},\tau,F) with the smallest empirical risk; that is,

f^global(W^,v^)=argminf(L,𝒑,τ,F)n1i=1n(y(𝒔i)f(𝒙(𝒔i))2.\widehat{f}_{\textrm{global}}(\widehat{W},\widehat{v})=\mathop{\mathrm{argmin}}_{f\in\mathcal{F}(L,\bm{p},\tau,F)}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}. (4)

For simplicity, we sometimes write f^global\widehat{f}_{\textrm{global}} for f^global(W^,v^)\widehat{f}_{\textrm{global}}(\widehat{W},\widehat{v}) if no confusion arises.

3 Computation and Theoretical Results

In this section, we describe the computational procedure used to optimize the objective function (4) and present the main theoretical results.

3.1 Computational Aspects

Because (4) does not have an exact solution, we use a stochastic gradient descent (SGD)-based algorithm to optimize (4). In contrast to a gradient descent algorithm which requires a full dataset to estimate gradients in each iteration, SGD or mini-batch gradient descent only needs an access to a subset of observations during each update, which is capable of training relatively complex models for large datasets and computationally feasible with streaming data. Albeit successful applications in machine learning and deep learning, SGD still suffers from some potential problems. For example, the rate of convergence to the minima is slow; the performance is very sensitive to tuning parameters. To circumvent these problems, various methods have been proposed, such as RMSprop and Adam (Kingma and Ba,, 2014). In this paper, we use Adam optimizer to solve (4).

During the training process, there are many hyper-parameters to tune in our approach: the number of layers LL, the number of neurons in each layer 𝒑\bm{p}, the sparse parameter τ\tau, and the learning rate. These hyper-parameters play an important role in the learning process. However, it is challenging to determine the values of hyper-parameters without domain knowledge. In particular, it is challenging to control the sparse parameter τ\tau directly in the training process. Thus, we add an 1\ell_{1}-regularization penalty to control the number of inactive neurons in the network. The idea of adding a sparse regularization to hidden layers in deep learning is very common; see, for instance, Scardapane et al., (2017) and Lemhadri et al., (2021). In this paper, we use a 55-fold cross-validation to select tuning parameters.

3.2 Theoretical Results

Recall that the minimizer of (4), f^global\widehat{f}_{\textrm{global}}, is practically unattainable and we use an SGD-based algorithm to minimize the objective function (4), which may converge to a local minimum. The actual estimator obtained by minimizing (4) is denoted by f^local(L,𝒑,τ,F)\widehat{f}_{\textrm{local}}\in\mathcal{F}(L,\bm{p},\tau,F). We define the difference between the expected empirical risks of f^global\widehat{f}_{\textrm{global}} and f^local\widehat{f}_{\textrm{local}} as

Δn(f^local)\displaystyle\Delta_{n}(\widehat{f}_{\textrm{local}}) 𝑬f0[1ni=1n(y(𝒔i)f^local(𝒙(𝒔i))2inff~(L,𝒑,τ,F)1ni=1n(y(𝒔i)f~(𝒙(𝒔i))2]\displaystyle\doteq\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{local}}(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\right]
=𝑬f0[1ni=1n(y(𝒔i)f^local(𝒙(𝒔i))21ni=1n(y(𝒔i)f^global(𝒙(𝒔i))2],\displaystyle=\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{local}}(\bm{x}(\bm{s}_{i}))^{2}-\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{global}}(\bm{x}(\bm{s}_{i}))^{2}\right], (5)

where 𝑬f0\bm{E}_{f_{0}} stands for the expectation with respect to the true regression function f0f_{0}. For any f^(L,𝒑,τ,F)\widehat{f}\in\mathcal{F}(L,\bm{p},\tau,F), we consider the following estimation error:

Rn(f^,f0)𝑬f0[1ni=1n(f^(𝒙(𝒔i))f0(𝒙(𝒔i)))2].R_{n}(\widehat{f},f_{0})\doteq\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\right]. (6)

The oracle-type theorem below gives an upper bound for the estimation error.

Theorem 1.

Suppose that the unknown true mean function f0f_{0} in (1) satisfies f0F\|f_{0}\|_{\infty}\leq F for some F1F\geq 1. For any δ,ϵ(0,1]\delta,\epsilon\in(0,1] and f^(L,𝐩,τ,F)\widehat{f}\in\mathcal{F}(L,\bm{p},\tau,F), the following oracle inequality holds:

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\lesssim (1+ε)(inff~(L,𝒑,τ,F)f~f02+ζn,ε,δ+Δn(f^)),\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\zeta_{n,\varepsilon,\delta}+\Delta_{n}(\widehat{f})\right),

where

ζn,ε,δ\displaystyle\zeta_{n,\varepsilon,\delta}\asymp 1ε[δ(n1tr(Γn)+2n1tr(Γn2)+3σ)+τn(log(L/δ)+Llogτ)(n1tr(Γn2)+σ2+1)],\displaystyle\frac{1}{\varepsilon}\left[\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}+\frac{\tau}{n}\left(\log(L/\delta)+L\log\tau\right)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)\right],

and Γn=[γ(𝐬i,𝐬i)]1i,in\Gamma_{n}=[\gamma(\bm{s}_{i},\bm{s}_{i^{\prime}})]_{1\leq i,i^{\prime}\leq n}.

The convergence rate in Theorem 1 is determined by three components. The first component inff~(L,𝒑,τ,F)f~f02\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty} measures the distance between the neural network class (L,𝒑,τ,F)\mathcal{F}(L,\bm{p},\tau,F) and f0f_{0}, i.e., the approximation error. The second term ζn,ε,δ\zeta_{n,\varepsilon,\delta} pertains to the estimation error, and Δn(f^)\Delta_{n}(\widehat{f}) is owing to the difference between f^\widehat{f} and the oracle neural network estimator f^global\widehat{f}_{\textrm{global}}. It is worth noting that the upper bound in Theorem 1 does not depend on the network architecture parameter 𝒑\bm{p}, i.e., the width of the network, in that the network is sparse and its “actual” width is controlled by the sparsity parameter τ\tau. To see this, after removing all the inactive neurons, it is straightforward to show that (L,𝒑,τ,F)=(L,(p0,p1τ,,pLτ,pL+1),τ,F)\mathcal{F}(L,\bm{p},\tau,F)=\mathcal{F}(L,(p_{0},p_{1}\wedge\tau,\ldots,p_{L}\wedge\tau,p_{L+1}),\tau,F) (Schmidt-Hieber,, 2020).

Next, we turn to the consistency of the DNN estimator f^local\widehat{f}_{\textrm{local}} for f0𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}). In nonparametric regression, the estimation convergence rate is heavily affected by the smoothness of the function. Consider the class of composite functions 𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}). Let βi=βis=i+1L(βs1)\beta_{i}^{*}=\beta_{i}\prod_{s=i+1}^{L_{*}}(\beta_{s}\wedge 1) for i=0,,Li=0,\ldots,L_{*} and i=argmin0iLβi/r~ii^{*}=\mathop{\mathrm{argmin}}_{0\leq i\leq L_{*}}\beta_{i}^{*}/\tilde{r}_{i}, with the convention s=L+1L(βs1)=1\prod_{s=L_{*}+1}^{L_{*}}(\beta_{s}\wedge 1)=1. Then β=βi\beta^{*}=\beta_{i^{*}}^{*} and r=r~ir^{*}=\tilde{r}_{i^{*}} are known as the intrinsic smoothness and intrinsic dimension of f𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)f\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}). Similar definitions could be found in literature Kurisu, (2019); Nakada and Imaizumi, (2020); Cloninger and Klock, (2021); Kohler et al., (2022); Wang et al., (2023). These quantities play an important role in controlling the convergence rate of the estimator. To better understand βi\beta_{i}^{*} and ii^{*}, think about the composite function from the iith to the last layer, i.e., 𝒉i(𝒛)=𝒈L𝒈i+1𝒈i(𝒛):[ai,bi]ri\bm{h}_{i}(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{i+1}\circ\bm{g}_{i}(\bm{z}):[a_{i},b_{i}]^{r_{i}}\to\mathbb{R}; then βi\beta_{i}^{*} can be viewed as the smoothness of 𝒉i\bm{h}_{i} and ii^{*} is the layer of the least smoothness after rescaled by the respective number of “active” variables r~i\tilde{r}_{i}, i=0,,Li=0,\ldots,L_{*}. The following theorem establishes the consistency of f^local\widehat{f}_{\textrm{local}} as an estimator of f0f_{0} and its convergence rate in the presence of spatial dependence.

Theorem 2.

Suppose Assumption 1 is satisfied, i.e., f0𝒞𝒮(L,𝐫,𝐫~,𝛃,𝐚,𝐛,𝐂)f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}). Let f^local(L,𝐩,τ,F)\widehat{f}_{\textrm{local}}\in\mathcal{F}(L,\bm{p},\tau,F) be an estimator of f0f_{0}. Further assume that Fmaxi=0,,L(Ci,1)F\geq\max_{i=0,\ldots,L^{*}}(C_{i},1), Nmini=1,,Lpi6ηmaxi=0,,L(βi+1)r~i(C~i+1)er~iN\doteq\min_{i=1,\ldots,L}p_{i}\geq 6\eta\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}} where η=maxi=0,,L(ri+1(r~i+βi))\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)), and τLN\tau\lesssim LN. Then we have

Rn(f^local,f0)ςn,R_{n}(\widehat{f}_{\textrm{local}},f_{0})\lesssim\varsigma_{n},

where

ςn(N2L)2l=1Lβl1+N2βr+(tr(Γn2)+n)(LNlog(Ln2)+L2Nlog(LN))n2+Δn(f^local),\varsigma_{n}\asymp(N2^{-L})^{2\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{2\beta^{*}}{r^{*}}}+\frac{(\operatorname{tr}(\Gamma_{n}^{2})+n)(LN\log(Ln^{2})+L^{2}N\log(LN))}{n^{2}}+\Delta_{n}(\widehat{f}_{\textrm{local}}),

and C~i\tilde{C}_{i} are constants only depending on 𝐂,𝐚,𝐛\bm{C},\bm{a},\bm{b}, and Γn=[γ(𝐬i,𝐬i)]1i,in\Gamma_{n}=[\gamma(\bm{s}_{i},\bm{s}_{i^{\prime}})]_{1\leq i,i^{\prime}\leq n}.

The consistency of f^local\widehat{f}_{\textrm{local}} can be achieved by, for instance, letting Llog(n),Nnr2β+r,tr(Γn2)=o(n4β+r2β+r(logn)3)L\asymp\log(n),N\asymp n^{\frac{r^{*}}{2\beta^{*}+r^{*}}},\operatorname{tr}(\Gamma_{n}^{2})=o(n^{\frac{4\beta^{*}+r^{*}}{2\beta^{*}+r^{*}}}(\log n)^{-3}), and Δn(f^local)=o(1)\Delta_{n}(\widehat{f}_{\textrm{local}})=o(1), as a result of which ςnn2β2β+r(logn)3+Δn(f^local)=o(1)\varsigma_{n}\asymp n^{-\frac{2\beta^{*}}{2\beta^{*}+r^{*}}}(\log n)^{3}+\Delta_{n}(\widehat{f}_{\textrm{local}})=o(1). As expected, the rate of convergence is affected by the intrinsic smoothness and intrinsic dimension of 𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}), the architecture of the neural network (L,𝒑,τ,F)\mathcal{F}(L,\bm{p},\tau,F), and the magnitude of the spatial dependence.

4 Simulation Study

In this section, we evaluate the finite sample performance of the proposed DNN estimator through a set of simulation studies. Two different simulation designs are considered, and for each design, we generate 100 independent data sets. In both settings, we use the same neural network architecture with length L=3L=3 and width N=30N=30. Additionally, to prevent overfitting and control the sparsity of the network, we applied dropout with a probability of 0.2.

In the first design, the spatial domain of interest 𝒮\mathcal{S} is in \mathbb{R}. To be specific, we generate data from the following model

y(si)=f0(𝒙(si))+e1(si)+e2(si),si[0,D],i=1,2,,n,\displaystyle y(s_{i})=f_{0}(\bm{x}(s_{i}))+e_{1}(s_{i})+e_{2}(s_{i}),\,\,\,s_{i}\in[0,D],~{}i=1,2,\ldots,n,

and

𝒙(si)=(x1(si),,x5(si))=(si/D,sin(10si/D),(si/D)2,exp(3si/D),(si/D+1)1)5,\displaystyle\bm{x}(s_{i})=(x_{1}(s_{i}),\ldots,x_{5}(s_{i}))^{\top}=\left(s_{i}/D,\sin(10s_{i}/D),(s_{i}/D)^{2},\exp(3s_{i}/D),(s_{i}/D+1)^{-1}\right)^{\top}\in\mathbb{R}^{5},

with the true mean function f0(𝒙(si))=j=15xj(si)f_{0}(\bm{x}(s_{i}))=\sum_{j=1}^{5}x_{j}(s_{i}). The small-scale spatial variation e1()e_{1}(\cdot) is a zero-mean stationary and isotropic Gaussian process with an exponential covariance function γ(si,sj)=exp(|sisj|/ρ)\gamma({s_{i},s_{j}})=\exp(-|s_{i}-s_{j}|/\rho) and the range parameter ρ=0.1,0.5,1\rho=0.1,0.5,1. The measurement error e2()e_{2}(\cdot) is standard normal distributed and independent of e1()e_{1}(\cdot). It is worth mentioning that the covariates are location dependent.

We consider two different spatial domains: fixed domain and expanding domain. For the fixed domain, 𝒮=[0,1]\mathcal{S}=[0,1] is a fixed interval, i.e., D=1D=1, whereas for the expanding domain, the spatial domain 𝒮=[0,D]\mathcal{S}=[0,D] increases with the sample size nn. The nn observations are equally spaced over the region. In both scenarios, we let n=100,200,300n=100,200,300, and accordingly, D=10,20,30D=10,20,30 in the expanding domain case.

Refer to caption
(a) Fixed domain: 𝒮=[0,1]\mathcal{S}=[0,1].
Refer to caption
(b) Expanding domain: 𝒮=[0,10]\mathcal{S}=[0,10].
Figure 2: The estimated mean function and 95%95\% pointwise simulation intervals using our method in Simulation Design 1 with n=100,ρ=0.5n=100,\rho=0.5. In both plots, the solid line is the true mean function, and the two dashed lines are the 95%95\% pointwise simulation intervals. The grey lines are the estimated mean functions from each replication.

In the second design, the mean function is defined on 2\mathbb{R}^{2}, given by

f0(𝒙(𝒔i))=\displaystyle f_{0}(\bm{x}(\bm{s}_{i}))= β1x1(𝒔i)x2(𝒔i)+β2x2(𝒔i)2sin(x3(𝒔i))+β3exp(x4(𝒔i))max(x5(𝒔i),0)\displaystyle\beta_{1}x_{1}(\bm{s}_{i})x_{2}(\bm{s}_{i})+\beta_{2}x_{2}(\bm{s}_{i})^{2}\sin(x_{3}(\bm{s}_{i}))+\beta_{3}\exp(x_{4}(\bm{s}_{i}))\max(x_{5}(\bm{s}_{i}),0)
+β4signx4(𝒔i)(10+x5(𝒔i))+β5tanh(x1(𝒔i)),𝒔i[0,D]2,i=1,,n,\displaystyle+\frac{\beta_{4}}{\mathop{\mathrm{sign}}x_{4}(\bm{s}_{i})(10+x_{5}(\bm{s}_{i}))}+\beta_{5}\tanh(x_{1}(\bm{s}_{i})),\quad\bm{s}_{i}\in[0,D]^{2},~{}i=1,\ldots,n, (7)

where the coefficients βj\beta_{j}, j=1,,5j=1,\ldots,5, are drawn from U(1,2)U(1,2). The covariates at each location are generated from standard normal distributions with a cross-covariate correlation of 0.5 and the covariates at different locations are assumed to be independent. We further normalize each covariate to have zero mean and unit variance. The mean function f0f_{0} is nonlinear, featuring interactions among the covariates. We simulate y(𝒔i)y(\bm{s}_{i}) according to (1) with e1(𝒔i)e_{1}(\bm{s}_{i}) and e2(𝒔i)e_{2}(\bm{s}_{i}) similar to those in Design 1. That is, e1(𝒔i)e_{1}(\bm{s}_{i}) is a zero-mean stationary and isotropic Gaussian process on 2\mathbb{R}^{2} with an exponential covariance function γ(𝒔i,𝒔j)=exp(|𝒔i𝒔j|/ρ)\gamma({\bm{s}_{i},\bm{s}_{j}})=\exp(-|\bm{s}_{i}-\bm{s}_{j}|/\rho) and ρ=0.1,0.5,1\rho=0.1,0.5,1, and e2(𝒔i)N(0,1)e_{2}(\bm{s}_{i})\sim N(0,1).

Similar to the first design, we consider two types of spatial domain: fixed domain, i.e., D=1D=1 and expanding domain, i.e., D=10,20,30D=10,20,30. In both cases, we have n=100,400,900n=100,400,900, and all the locations are equally spaced over [0,D]2[0,D]^{2}.

4.1 Estimating f0f_{0} via other methods

We also compare the proposed DNN estimator with various estimators in the literature. The first estimator of f0f_{0} is based on the Gaussian process-based spatially varying coefficient model (GP-SVC) which is given by

y(𝒔)=β1(𝒔)x1(𝒔)+,+βp(𝒔)xp(𝒔)+ϵ,ϵ𝒩(0,τ2),𝒔𝒮,y(\bm{s})=\beta_{1}(\bm{s})x_{1}(\bm{s})+\ldots,+\beta_{p}(\bm{s})x_{p}(\bm{s})+\epsilon,\,\,\,\epsilon\sim\mathcal{N}(0,\tau^{2}),\,\,\,\bm{s}\in\mathcal{S},

and the spatially varying coefficient βj()\beta_{j}(\cdot) is the sum of a fixed effect and a random effect. That is, βj(𝒔)=μj+ηj(𝒔)\beta_{j}(\bm{s})=\mu_{j}+\eta_{j}(\bm{s}), where μj\mu_{j} is a non-random fixed effect and ηj()\eta_{j}(\cdot) is a zero-mean Gaussian process with an isotropic covariance function c(;𝜽j)c(\cdot;\bm{\theta}_{j}). In this work, we use the popular Matérn covariance function defined as

c(r;ρ,ν,σ2)=σ221νΓ(ν)(2νrρ)νKν(2νrρ),c\left(r;\rho,\nu,\sigma^{2}\right)=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2\nu}\frac{r}{\rho}\right)^{\nu}K_{\nu}\left(\sqrt{2\nu}\frac{r}{\rho}\right),

where ρ>0\rho>0 is the range parameter, ν>0\nu>0 is the smoothness parameter, and Kν()K_{\nu}(\cdot) is the modified Bessel function of second kind with order ν\nu.

The second estimator is the Nadaraya-Watson (N-W) kernel estimator for spatially dependent data discussed in Robinson, (2011), which considers the following spatial regression model

y(𝒔i)=f0(𝒙(𝒔i))+σ(𝒙(𝒔i))Vi,Vi=j=1aijϵj,i=1,,n,y(\bm{s}_{i})=f_{0}(\bm{x}(\bm{s}_{i}))+\sigma(\bm{x}(\bm{s}_{i}))V_{i},\,\,\,V_{i}=\sum_{j=1}^{\infty}a_{ij}\epsilon_{j},\,\,\,i=1,\ldots,n,

where f0(𝒙):[0,1]df_{0}(\bm{x}):[0,1]^{d}\rightarrow\mathbb{R} and σ(𝒙):[0,1]d[0,)\sigma(\bm{x}):[0,1]^{d}\rightarrow[0,\infty) are the mean and variance functions, respectively, ϵj\epsilon_{j} are independent random variables with zero mean and unit variance, and j=1aij2=1\sum_{j=1}^{\infty}a_{ij}^{2}=1. Robinson, (2011) introduces the following Nadaraya-Watson kernel estimator for f0f_{0}:

f^(𝒙)=ν^(𝒙)g^(𝒙),\hat{f}(\bm{x})=\frac{\hat{\nu}(\bm{x})}{\widehat{g}(\bm{x})},

where

g^(𝒙)=1nhndi=1nKi(𝒙),ν^(𝒙)=1nhhdi=1nyiKi(𝒙),\widehat{g}(\bm{x})=\frac{1}{nh_{n}^{d}}\sum_{i=1}^{n}K_{i}(\bm{x}),\quad\hat{\nu}(\bm{x})=\frac{1}{nh_{h}^{d}}\sum_{i=1}^{n}y_{i}K_{i}(\bm{x}),

with

Ki(𝒙)=K(𝒙𝒙(𝒔i)hn),K_{i}(\bm{x})=K\left(\frac{\bm{x}-\bm{x}(\bm{s}_{i})}{h_{n}}\right),

and hnh_{n} is a scalar, positive bandwidth sequence satisfying hn0h_{n}\rightarrow 0 as nn\rightarrow\infty.

The third estimator of f0f_{0} is based on the generalized additive model (GAM) mentioned in Example 1. That is, we assume that

f0(𝒙(𝒔))=Ψ(j=1dgj(xj(𝒔))),f_{0}(\bm{x}(\bm{s}))=\Psi\left(\sum_{j=1}^{d}g_{j}(x_{j}(\bm{s}))\right),

where gj():[0,1]g_{j}(\cdot):[0,1]\rightarrow\mathbb{R} and Ψ():\Psi(\cdot):\mathbb{R}\rightarrow\mathbb{R} are some smooth functions. In this model, spatial dependence is not assumed. The fourth estimator we consider is the linear regression (LR) model which assumes a linear relationship between the response variable and the predictor variables, without incorporating spatial dependence explicitly.

4.2 Simulation Results

To evaluate the performance of each method, we generate additional m=n/10m=n/10 observations at new locations, treated as a test set. Similar to Chu et al., (2014), we adopt mean squared estimation error (MSEE) and mean squared prediction error (MSPE) to evaluate the estimation and prediction performance, where MSEE and MSPE are defined as

MSEE=m1i=1m(f^(𝒙(si))f0(𝒙(si)))2,MSPE=m1i=1m(f^(𝒙(si))y(si))2,\textrm{MSEE}=m^{-1}\sum_{i=1}^{m}(\widehat{f}(\bm{x}(s_{i}))-f_{0}(\bm{x}(s_{i})))^{2},\quad\textrm{MSPE}=m^{-1}\sum_{i=1}^{m}(\widehat{f}(\bm{x}(s_{i}))-y(s_{i}))^{2},

and f^(𝒙(si))\widehat{f}(\bm{x}(s_{i})) is an estimator of f0(𝒙(si))f_{0}(\bm{x}(s_{i})). The mean and standard deviation of MSEE and MSPE over the 100 independent replicates are summarized in Tables 14.

Tables 1 and 2 pertain to Simulation Design 1, for fixed and expanding domains, respectively. For each combination of the sample size nn and the spatial dependence ρ\rho, we highlight the estimator in boldface that yields the smallest MSEE and MSPE. Overall, GAM, GP-SVC, N-W, and LR methods perform similar to each other. The proposed DNN estimator produces a smaller estimation error and prediction error than the others in all cases except when n=200n=200 and ρ=0.1\rho=0.1 in the fixed-domain case, GAM yields the smallest MSPE of 1.26. But the MSPE produced by DNN is close. Despite that spatial dependence has an adverse impact on the performance, when nn increases (and DD increases for the expanding-domain case), both estimation error and prediction error decrease as expected.

We depict in Figure 2 the estimated mean functions f^(𝒙(s))\widehat{f}(\bm{x}(s)) via our method with n=100n=100 and ρ=0.5\rho=0.5 from the 100 replications along with the 95%95\% pointwise confidence intervals for both fixed and expanding domains. Here, the 95%95\% pointwise intervals are defined as

(21(f^(2)(𝒙(si))+f^(3)(𝒙(si))),21(f^(97)(𝒙(si))+f^(98)(𝒙(si)))),i=1,2,,n,\left(2^{-1}(\widehat{f}_{(2)}(\bm{x}(s_{i}))+\widehat{f}_{(3)}(\bm{x}(s_{i}))),2^{-1}(\widehat{f}_{(97)}(\bm{x}(s_{i}))+\widehat{f}_{(98)}(\bm{x}(s_{i})))\right),~{}i=1,2,\ldots,n,

where f^(k)(𝒙(si))\widehat{f}_{(k)}(\bm{x}(s_{i})) is the kkth smallest value of {f^[j](𝒙(si)):j=1,,100}\{\widehat{f}_{[j]}(\bm{x}(s_{i})):j=1,\ldots,100\}, and f^[j](𝒙(si))\widehat{f}_{[j]}(\bm{x}(s_{i})) is the estimator of f0(𝒙(si))f_{0}(\bm{x}(s_{i})) from the jjth replicate.

Tables 3 and 4 report the results for Simulation Design 2. For both fixed and expanding domains, our method performs the best among the five methods and N-W comes next. This is mainly because LR, GAM and GP-SVC treat f0f_{0} to be linear and cannot handle complex interactions and nonlinear structures in f0f_{0}.

Table 1: Results of Simulation Design 1 with fixed domain: the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different nn and ρ\rho.
Fixed domain ρ=0.1\rho=0.1 ρ=0.5\rho=0.5 ρ=1\rho=1
nn MSEE MSPE MSEE MSPE MSEE MSPE
GAM 0.92 (0.58) 1.40 (0.69) 0.98 (0.78) 1.19 (0.40) 1.05 (0.87) 1.17 (0.38)
n=100n=100 GP-SVC 0.87 (0.49) 1.37 (0.67) 0.92 (0.75) 1.15 (0.35) 1.01 (0.82) 1.13 (0.36)
N-W 0.89 (0.52) 1.38 (0.67) 0.94 (0.76) 1.16 (0.38) 1.03 (0.85) 1.15 (0.37)
LR 0.93 (0.59) 1.40 (0.69) 0.99 (0.80) 1.20 (0.39) 1.06 (0.86) 1.18 (0.38)
DNN 0.78 (0.41) 1.32 (0.64) 0.82 (0.71) 1.13 (0.35) 0.94 (0.79) 1.10 (0.33)
GAM 0.87 (0.50) 1.26 (0.30) 0.93 (0.72) 1.12 (0.27) 0.99 (0.77) 1.08 (0.24)
n=200n=200 GP-SVC 0.81 (0.42) 1.34 (0.37) 0.88 (0.74) 1.09 (0.29) 0.95 (0.77) 1.09 (0.28)
N-W 0.84 (0.38) 1.33 (0.41) 0.91 (0.73) 1.10 (0.28) 0.93 (0.75) 1.06 (0.25)
LR 0.86 (0.50) 1.28 (0.32) 0.95 (0.74) 1.14 (0.28) 0.98 (0.76) 1.07 (0.24)
DNN 0.69 (0.32) 1.27 (0.39) 0.71 (0.66) 1.06 (0.26) 0.78 (0.68) 1.04 (0.29)
GAM 0.83 (0.47) 1.19 (0.44) 0.88 (0.68) 1.09 (0.20) 0.96 (0.66) 1.05 (0.19)
n=300n=300 GP-SVC 0.77 (0.38) 1.15 (0.37) 0.86 (0.71) 1.06 (0.21) 0.91 (0.64) 1.05 (0.18)
N-W 0.80 (0.36) 1.13 (0.40) 0.88 (0.70) 1.07 (0.24) 0.92 (0.66) 1.06 (0.21)
LR 0.82 (0.46) 1.17 (0.42) 0.87 (0.67) 1.10 (0.22) 0.98 (0.69) 1.06 (0.20)
DNN 0.58 (0.27) 1.07 (0.34) 0.63 (0.55) 1.01 (0.22) 0.69 (0.55) 1.01 (0.25)
Table 2: Results of Simulation Design 1 with expanding domain (i.e., D=10,20,30D=10,20,30): the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different nn and ρ\rho.
Expanding domain ρ=0.1\rho=0.1 ρ=0.5\rho=0.5 ρ=1\rho=1
nn MSEE MSPE MSEE MSPE MSEE MSPE
GAM 0.35 (0.23) 2.06 (0.67) 0.82 (0.38) 1.60 (0.59) 0.98 (0.59) 1.42 (0.31)
n=100,D=10n=100,D=10 GP-SVC 0.33 (0.22) 2.01 (0.61) 0.78 (0.36) 1.53 (0.55) 0.94 (0.54) 1.37 (0.29)
N-W 0.38 (0.26) 2.03 (0.65) 0.81 (0.40) 1.57 (0.58) 0.96 (0.57) 1.39 (0.29)
LR 0.34(0.22) 2.05 (0.66) 0.81 (0.36) 1.58 (0.58) 0.97 (0.58) 1.41 (0.30)
DNN 0.26 (0.19) 1.93 (0.55) 0.64 (0.37) 1.44 (0.55) 0.76 (0.44) 1.22 (0.26)
GAM 0.21 (0.14) 1.91 (0.58) 0.66 (0.34) 1.51 (0.36) 0.85 (0.44) 1.39 (0.39)
n=200,D=20n=200,D=20 GP-SVC 0.18 (0.14) 1.89 (0.54) 0.61 (0.33) 1.48 (0.39) 0.81 (0.40) 1.36 (0.41)
N-W 0.20 (0.17) 1.93 (0.61) 0.63 (0.36) 1.47 (0.37) 0.88 (0.48) 1.40 (0.43)
LR 0.20 (0.13) 1.90 (0.56) 0.67 (0.35) 1.52 (0.35) 0.86 (0.45) 1.39 (0.40)
DNN 0.14 (0.11) 1.82 (0.55) 0.43 (0.28) 1.33 (0.32) 0.61 (0.37) 1.29 (0.38)
GAM 0.11 (0.07) 1.72 (0.39) 0.51 (0.29) 1.40 (0.31) 0.70 (0.31) 1.30 (0.26)
n=300,D=30n=300,D=30 GP-SVC 0.13 (0.10) 1.76 (0.41) 0.56 (0.33) 1.43 (0.36) 0.74 (0.37) 1.34 (0.30)
N-W 0.13 (0.07) 1.77 (0.44) 0.53 (0.30) 1.44 (0.39) 0.69 (0.33) 1.29 (0.27)
LR 0.12 (0.07) 1.74 (0.40) 0.52 (0.31) 1.42 (0.33) 0.72 (0.33) 1.32 (0.28)
DNN 0.07 (0.09) 1.63 (0.38) 0.31 (0.23) 1.22 (0.23) 0.43 (0.31) 1.10 (0.19)
Table 3: Results of Simulation Design 2 with fixed domain: the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different nn and ρ\rho.
fixed-domain ρ=0.1\rho=0.1 ρ=0.5\rho=0.5 ρ=1\rho=1
nn MSEE MSPE MSEE MSPE MSEE MSPE
GAM 0.87 (0.92) 2.81 (1.19) 1.10 (2.40) 2.40 (1.53) 1.23 (1.06) 2.16 (1.10)
n=100n=100 GP-SVC 0.93 (0.99) 2.93 (1.23) 1.23 (2.54) 2.59 (1.67) 1.53 (1.33) 2.37 (1.26)
N-W 0.73 (0.81) 2.70 (1.15) 1.01 (2.16)) 2.30 (1.38) 1.09 (1.01) 2.09 (0.98)
LR 1.16 (0.99) 3.02 (1.25) 1.46 (2.62) 2.82 (1.76) 1.88 (1.44) 2.66 (1.39)
DNN 0.51 (0.60) 2.27 (1.09) 0.74 (1.11) 1.90 (1.02) 0.83 (1.19) 1.99 (1.17)
GAM 0.44 (0.52) 2.38 (0.58) 0.75 (0.65) 1.89 (0.42) 0.92 (0.92) 1.69 (0.47)
n=400n=400 GP-SVC 0.50 (0.59) 2.43 (0.66) 0.82 (0.77) 1.94 (0.47) 1.00 (0.96) 1.80 (0.53)
N-W 0.39 (0.44) 1.99 (0.58) 0.68 (0.70) 1.73 (0.41) 0.83 (0.84)) 1.56 (0.41)
LR 0.66 (0.52) 2.67 (0.71) 0.95 (0.79) 2.05 (0.55) 1.11 (0.97) 1.88 (0.57)
DNN 0.22 (0.39) 1.87 (0.49) 0.54 (0.61) 1.57 (0.37) 0.68 (0.71) 1.41 (0.37)
GAM 0.31 (0.40) 2.25 (0.53) 0.59 (0.53) 1.81 (0.36) 0.80 (0.79) 1.65 (0.36)
n=900n=900 GP-SVC 0.38 (0.44) 2.29 (0.58) 0.66 (0.60) 1.88 (0.39) 0.88 (0.83) 1.73 (0.40)
N-W 0.25 (0.34) 1.86 (0.49) 0.51 (0.46) 1.70 (0.31) 0.71 (0.72)) 1.52 (0.34)
LR 0.49 (0.48) 2.33 (0.66) 0.88 (0.73) 1.95 (0.44) 0.98 (0.86) 1.85 (0.42)
DNN 0.19 (0.27) 1.70 ( 0.42) 0.28 (0.33) 1.49 (0.29) 0.57 (0.59) 1.33 (0.34)
Table 4: Results of Simulation Design 2 with expanding domain (D=10,20,30D=10,20,30): the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different nn and ρ\rho.
fixed-domain ρ=0.1\rho=0.1 ρ=0.5\rho=0.5 ρ=1\rho=1
nn MSEE MSPE MSEE MSPE MSEE MSPE
GAM 0.75 (0.81) 2.88 (1.04) 0.81 (0.65) 2.72 (0.94) 0.90 (0.76) 2.65 (0.88)
n=100,D=10n=100,D=10 GP-SVC 0.84 (0.88) 2.93 (1.11) 0.89 (0.90) 2.80 (0.97) 0.96 (0.83)) 2.71 (0.91)
N-W 0.66 (0.70) 2.71 (1.00) 0.71 (0.62) 2.66 (0.90) 0.82 (0.71)) 2.59 (0.81)
LR 0.97 (0.91) 2.95 (1.14) 1.06 (0.92) 2.92 (0.98) 1.15 (0.86) 2.75 (0.95)
DNN 0.44 (0.49) 2.15 (1.02) 0.60 (1.03) 1.81 (0.99) 0.69 (1.07) 1.76 (0.94)
GAM 0.32 (0.25) 2.49 (0.44) 0.35 (0.24) 2.39 (0.46) 0.40 (0.33) 2.32 (0.51)
n=400,D=20n=400,D=20 GP-SVC 0.40 (0.31) 2.55 (0.48) 0.49 (0.29) 2.44 (0.50) 0.54 (0.37)) 2.40 (0.55)
N-W 0.28 (0.23) 2.35 (0.41) 0.31 (0.20) 2.33 (0.41) 0.35 (0.30) 2.27 (0.47)
LR 0.56 (0.35) 2.60 (0.54) 0.63 (0.33) 2.49 (0.49) 0.71 (0.41) 2.46 (0.53)
DNN 0.18 (0.20) 2.20 (0.38) 0.24 (0.21) 2.29 (0.38) 0.29 (0.24)) 2.20 (0.41)
GAM 0.24 (0.29) 2.28 (0.33) 0.27 (0.16) 2.25 (0.27) 0.31 (0.17) 2.26 (0.25)
n=900,D=30n=900,D=30 GP-SVC 0.31 (0.32)) 2.31 (0.35) 0.37 (0.21) 2.17 (0.25) 0.41 (0.20) 2.29 (0.28)
N-W 0.21 (0.30) 1.82 (0.46) 0.26 (0.21) 1.61 (0.28) 0.29 (0.16) 1.50 (0.30)
LR 0.42 (0.35) 2.37 (0.38) 0.53 (0.24) 2.33 (0.29) 0.62 (0.23) 2.30 (0.31)
DNN 0.16 (0.22)) 1.71 (0.30)) 0.19 (0.20) 1.58 (0.25) 0.23 (0.15)) 1.45 (0.28)

5 Data Example

In this section, we use the proposed DNN method to analyze California Housing data that are publicly available from the website https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. After removing missing values, the dataset contains housing price information from n=20433n=20433 block groups in California from the 1990 census, where a block group on average includes 1425.5 individuals living in a geographically compact area. To be specific, the dataset comprises median house values and six covariates of interest: the median age of a house, the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households. Figure 3 displays the histograms of the six covariates, from which one can observe that the covariates are all right skewed except for the median age of a house. Thus, we first apply the logarithm transform to the five covariates and then use min-max normalization to rescale all the six covariates so that the data are within the range [0,1][0,1].

Figure 4 shows the spatial distribution of the five log transformed covariates (i.e., the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households) and the median age of a house. We also depict in Figure 5 (the top panel) the map of the median house values in California. The data exhibit a clear geographical pattern. Home values in the coastal region, especially the San Francisco Bay Area and South Coast, are higher than the other regions. Areas of high home values are always associated with high household income, dense population, large home size, and large household, which are clustered in the coastal region and Central Valley. Our objective is to explore the intricate relationship between the median house value and the six covariates by taking into account their spatial autocorrelation.

Refer to caption
Figure 3: Histograms of six covariates in California housing data example.
Refer to caption
Figure 4: The map of six covariates in California housing data example.
Refer to caption
Figure 5: The top panel is the map of 20433 observations and the corresponding median house value in California housing data example. The bottom panel is the estimated median house value.

Same as the simulation study, we estimate the mean function f0()f_{0}(\cdot) via four methods: DNN, GAM, GP-SVC, and N-W. We use the same neural network architecture as the simulation study, i.e., the length and width equal L=3L=3, N=30N=30, respectively, and dropout rate is set as 0.2 to avoid overfitting. To assess their performance, we compute the out-of-sample prediction error measured by MSPE based on 10-fold cross-validation, and the results are summarized in Table 5. Consistent with the observations in the simulation study, the proposed DNN method yields a much more accurate prediction than the others. The bottom panel of Figure 5 shows the estimated median house value using the DNN estimator, which exhibits a similar geographical pattern to the observations.

Table 5: Summary of the mean squared prediction error in California housing data example.
Methods GAM GP-SVC N-W LR DNN
MSPE (×104\times 10^{4}) 4.74 4.23 4.05 5.32 3.41

6 Conclusion

In this study, we have ventured into the realm of regression analysis for spatially dependent data utilizing deep learning. Through meticulous consideration of the intricate interplay of spatial autocorrelation, our estimator has demonstrated its consistency. An intriguing path for future exploration unfolds in the domain of large-scale datasets. While the present study adeptly showcases the effectiveness of our approach via rigorous simulations, its true potential unfurls when confronted with the intricacies of real-world data. An illustrative instance is illuminated by the remote sensing data expounded in Ma et al.’s work Ma et al., (2022). This dataset weaves together a tapestry of spatially referenced variables, collectively contributing to the intricate fabric of underlying relationships. The deployment of our model upon such expansive and diverse datasets holds the pledge of unveiling concealed patterns and amplifying comprehension of underlying processes.

Furthermore, our methodology’s scope extends beyond predictive modeling, revealing a multitude of versatile applications. As expounded earlier, scenarios entailing interpolation challenges and gap-filling predicaments in remote sensing, alongside endeavors aimed at assessing the relative significance of covariates in ecological investigations, stand to derive substantial benefits from our model’s nuanced capabilities. The vista of extending our framework to encompass spatiotemporal prediction broadens even further, beckoning exploration into more intricate, dynamic, and temporally evolving phenomena. For a more comprehensive review, refer to Wikle et al.’s comprehensive work Wikle et al., (2019).

In conclusion, the journey into the domain of regression analysis for spatially dependent data endures as a continuous and captivating odyssey. As we confront the unique intricacies presented by expansive datasets and diverse problem domains, our proposed methodology emerges as a steadfast companion. Its role extends beyond mere contribution, shaping the advancement of statistical methodologies for spatial analysis.

References

  • Bahdanau et al., (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bauer and Kohler, (2019) Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285.
  • Chu et al., (2014) Chu, T., Wang, H., and Zhu, J. (2014). On semiparametric inference of geostatistical models via local karhunen–loève expansion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):817–832.
  • Cliff and Ord, (1981) Cliff, A. D. and Ord, J. K. (1981). Spatial processes: models & applications. Taylor & Francis.
  • Cloninger and Klock, (2021) Cloninger, A. and Klock, T. (2021). A deep network construction that adapts to intrinsic dimensionality beyond the domain. Neural Networks, 141:404–419.
  • Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226.
  • Fan and Zhang, (1999) Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. The annals of Statistics, 27(5):1491–1518.
  • Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
  • Hastie and Tibshirani, (1986) Hastie, T. and Tibshirani, R. (1986). Generalized Additive Models. Statistical Science, 1(3):297 – 310.
  • Hastie and Tibshirani, (1993) Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):757–779.
  • He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  • Jenish, (2012) Jenish, N. (2012). Nonparametric spatial regression under near-epoch dependence. Journal of Econometrics, 167(1):224–239.
  • Jiménez-Luna et al., (2020) Jiménez-Luna, J., Grisoni, F., and Schneider, G. (2020). Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584.
  • Kim and Wang, (2021) Kim, M. and Wang, L. (2021). Generalized spatially varying coefficient models. Journal of Computational and Graphical Statistics, 30(1):1–10.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kirkwood et al., (2022) Kirkwood, C., Economou, T., Pugeault, N., and Odbert, H. (2022). Bayesian deep learning for spatial interpolation in the presence of auxiliary information. Mathematical Geosciences, pages 1–25.
  • Kohler et al., (2022) Kohler, M., Krzyżak, A., and Langer, S. (2022). Estimation of a function of low local dimensionality by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042.
  • Kohler and Langer, (2021) Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249.
  • Kurisu, (2019) Kurisu, D. (2019). On nonparametric inference for spatial regression models under domain expanding and infill asymptotics. Statistics & Probability Letters, 154:108543.
  • Kurisu, (2022) Kurisu, D. (2022). Nonparametric regression for locally stationary random fields under stochastic sampling design. Bernoulli, 28(2):1250–1275.
  • Lee, (2002) Lee, L.-F. (2002). Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models. Econometric theory, 18(2):252–277.
  • Lee, (2004) Lee, L.-F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica, 72(6):1899–1925.
  • Lemhadri et al., (2021) Lemhadri, I., Ruan, F., and Tibshirani, R. (2021). Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, pages 10–18. PMLR.
  • Li, (2022) Li, K. (2022). Variable selection for nonlinear cox regression model via deep learning. arXiv preprint arXiv:2211.09287.
  • Li et al., (2021) Li, K., Wang, F., Liu, R., Yang, F., and Shang, Z. (2021). Calibrating multi-dimensional complex ode from noisy data via deep neural networks. arXiv preprint arXiv:2106.03591.
  • Li et al., (2022) Li, K., Wang, F., and Yang, L. (2022). Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. arXiv preprint arXiv:2204.01682.
  • Li, (2016) Li, L. (2016). Nonparametric regression on random fields with random design using wavelet method. Statistical Inference for Stochastic Processes, 19(1):51–69.
  • Liu et al., (2022) Liu, R., Boukai, B., and Shang, Z. (2022). Optimal nonparametric inference via deep neural network. Journal of Mathematical Analysis and Applications, 505(2):125561.
  • Liu et al., (2020) Liu, R., Shang, Z., and Cheng, G. (2020). On deep instrumental variables estimate.
  • Liu et al., (2021) Liu, R., Yuan, M., and Shang, Z. (2021). Online statistical inference for parameters estimation with linear-equality constraints. arXiv preprint arXiv:2105.10315.
  • Lu et al., (2020) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2020). Deep network approximation for smooth functions.
  • Lu and Tjøstheim, (2014) Lu, Z. and Tjøstheim, D. (2014). Nonparametric estimation of probability density functions for irregularly observed spatial data. Journal of the American Statistical Association, 109(508):1546–1564.
  • Ma et al., (2022) Ma, T. F., Wang, F., Zhu, J., Ives, A. R., and Lewińska, K. E. (2022). Scalable semiparametric spatio-temporal regression for large data analysis. Journal of Agricultural, Biological and Environmental Statistics, pages 1–20.
  • Mateu and Jalilian, (2022) Mateu, J. and Jalilian, A. (2022). Spatial point processes and neural networks: A convenient couple. Spatial Statistics, page 100644.
  • Mu et al., (2018) Mu, J., Wang, G., and Wang, L. (2018). Estimation and inference in spatially varying coefficient models. Environmetrics, 29(1):e2485.
  • Nakada and Imaizumi, (2020) Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. The Journal of Machine Learning Research, 21(1):7018–7055.
  • Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407.
  • Robinson, (2011) Robinson, P. M. (2011). Asymptotic theory for nonparametric regression with spatial data. Journal of Econometrics, 165(1):5–19.
  • Scardapane et al., (2017) Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
  • Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. Ann. Statist., 48(4):1875–1897.
  • Su and Zhu, (2018) Su, W. J. and Zhu, Y. (2018). Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876.
  • Sun et al., (2012) Sun, Y., Li, B., and Genton, M. G. (2012). Geostatistics for large datasets. In Advances and challenges in space-time modelling of natural events, pages 55–77. Springer.
  • Wang et al., (2023) Wang, S., Cao, G., Shang, Z., Initiative, A. D. N., Weiner, M. W., Aisen, P., Petersen, R., Weiner, M. W., Aisen, P., Petersen, R., et al. (2023). Deep neural network classifier for multi-dimensional functional data. Scandinavian Journal of Statistics.
  • Wikle et al., (2019) Wikle, C. K., Zammit-Mangion, A., and Cressie, N. (2019). Spatio-temporal statistics with R. CRC Press.
  • Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.
  • Zammit-Mangion et al., (2021) Zammit-Mangion, A., Ng, T. L. J., Vu, Q., and Filippone, M. (2021). Deep compositional spatial models. Journal of the American Statistical Association, pages 1–22.
  • Zammit-Mangion and Wikle, (2020) Zammit-Mangion, A. and Wikle, C. K. (2020). Deep integro-difference equation models for spatio-temporal forecasting. Spatial Statistics, 37:100408.

7 Appendix

7.1 Notation and Definition

In this paper all vectors are column vectors, unless otherwise stated. Let 𝒗22=𝒗𝒗\|\bm{v}\|_{2}^{2}=\bm{v}^{\top}\bm{v} for any vector 𝒗\bm{v}, and f2=f(x)2𝑑x\|f\|_{2}=\sqrt{\int f(x)^{2}dx} be the L2L_{2} norm of a real-valued function f(x)f(x). For two positive sequences an{a_{n}} and bn{b_{n}}, we write anbna_{n}\lesssim b_{n} if there exists a positive constant cc such that ancbna_{n}\leq cb_{n} for all nn, and anbna_{n}\asymp b_{n} if c1anbncanc^{-1}a_{n}\leq b_{n}\leq ca_{n} for some constant c>1c>1 and a sufficiently large nn. Suppose that 𝒙=(x1,,xd)\bm{x}=(x_{1},\ldots,x_{d})^{\top} is a dd-dimensional vector. Let |𝒙|=(|x1|,,|xd|)|\bm{x}|=(|x_{1}|,\ldots,|x_{d}|)^{\top}, |𝒙|=maxi=1,,d|xi||\bm{x}|_{\infty}=\max_{i=1,\ldots,d}|x_{i}|, and |𝒙|0=i=1d𝟙(xi0)|\bm{x}|_{0}=\sum_{i=1}^{d}\mathbbm{1}(x_{i}\neq 0). For two dd-dimensional vectors 𝒙\bm{x} and 𝒚\bm{y}, we write 𝒙𝒚\bm{x}\lesssim\bm{y} if xiyix_{i}\lesssim y_{i} for i=1,,di=1,\ldots,d. Let x\lfloor x\rfloor be the largest number less than xx and x\lceil x\rceil be the smallest number greater than xx. For a matrix A=(aij)A=(a_{ij}), let A=maxij|aij|\|A\|_{\infty}=\max_{ij}|a_{ij}| the max norm of AA, A0\|A\|_{0} be the number of non-zero entries of AA. Define f\|f\|_{\infty} as the sup-norm of a real-valued function ff. We use aba\wedge b to represent the minimum of two numbers aa and bb, while aba\vee b is the maximum of aa and bb.

Definition 1 (Hölder smoothness).

A function g:r0g:\mathbb{R}^{r_{0}}\to\mathbb{R} is said to be (β,C)(\beta,C)-Hölder smooth for some positive constants β\beta and CC, if for every 𝛄=(γ1,,γr0)r0\bm{\gamma}=(\gamma_{1},\ldots,\gamma_{r_{0}})\in\mathbb{N}^{r_{0}}, the following two conditions hold:

sup𝒙r0|κgx1γ1xr0γr0(𝒙)|C, for κβ,\displaystyle\sup_{\bm{x}\in\mathbb{R}^{r_{0}}}\bigg{|}\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\bm{x})\bigg{|}\leq C,\quad\textrm{ for }\kappa\leq\lfloor\beta\rfloor,

and

|κgx1γ1xr0γr0(𝒙)κgx1γ1xr0γr0(𝒙~)|C𝒙𝒙~2ββ, for κ=β𝒙,𝒙~r0,\displaystyle\bigg{|}\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\bm{x})-\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\widetilde{\bm{x}})\bigg{|}\leq C\|\bm{x}-\widetilde{\bm{x}}\|_{2}^{\beta-\lfloor\beta\rfloor},\quad\textrm{ for }\kappa=\lfloor\beta\rfloor\textrm{, }\bm{x},\widetilde{\bm{x}}\in\mathbb{R}^{r_{0}},

where κ=i=1r0γi\kappa=\sum_{i=1}^{r_{0}}\gamma_{i}. Moreover, we say gg is (,C)(\infty,C)-Hölder smooth if gg is (β,C)(\beta,C)-Hölder smooth for all β>0\beta>0.

7.2 Proof of Theorem 1

The proof of Theorem 1 requires a preliminary lemma. First, we define the δ\delta-cover of a function space \mathcal{F} as a set ~\widetilde{\mathcal{F}}\subset\mathcal{F} satisfying that following property: for any ff\in\mathcal{F}, there exists a f~~\widetilde{f}\in\widetilde{\mathcal{F}} such that f~fδ\|\widetilde{f}-f\|_{\infty}\leq\delta. Next, we define the δ\delta-covering number of \mathcal{F} as

𝒩(δ,,)min{|~|:~ is a δ-cover of },\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty})\doteq\min\{|\widetilde{\mathcal{F}}|:\widetilde{\mathcal{F}}\textrm{ is a $\delta$-cover of $\mathcal{F}$}\},

where |𝒜||\mathcal{A}| means the number of distinct elements in set 𝒜\mathcal{A}. In other words, 𝒩(δ,,)\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty}) is the minimal number of \|\cdot\|_{\infty}-balls with radius δ\delta that covers \mathcal{F}.

Lemma 1.

Suppose that f0f_{0} is the unknown true mean function in (1). Let \mathcal{F} be a function class such that {f0}{f:[0,1]d[F,F]}\{f_{0}\}\cup\mathcal{F}\subset\{f:[0,1]^{d}\rightarrow[-F,F]\} for some F1F\geq 1. Then for all δ,ϵ(0,1]\delta,\epsilon\in(0,1] and f^\widehat{f}\in\mathcal{F}, the following inequality holds:

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\leq (1+ε)(inff~Rn(f~,f0)+Δn(f^)+2δ(n1tr(Γn)+2n1tr(Γn2)+3σ))\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}}R_{n}(\tilde{f},f_{0})+\Delta_{n}(\widehat{f})+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)
+(1+ε)2F2nε(3log𝒩+1)(n1tr(Γn2)+σ2+1),\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1),

where 𝒩=𝒩(δ,,)\mathcal{N}=\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty}).

Proof.

Let Δn=Δn(f^)\Delta_{n}=\Delta_{n}(\widehat{f}). For any fixed ff\in\mathcal{F}, we have 𝑬f0[n1i=1n(y(𝒔i)f(𝒙(𝒔i))2inff~n1i=1n(y(𝒔i)f~(𝒙(𝒔i))2]0\bm{E}_{f_{0}}\Big{[}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}\geq 0. Therefore,

𝑬f0[1ni=1n(y(𝒔i)f^(𝒙(𝒔i))2]\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}
\displaystyle\leq 𝑬f0[1ni=1n(y(𝒔i)f^(𝒙(𝒔i))2+1ni=1n(y(𝒔i)f(𝒙(𝒔i))2inff~1ni=1n(y(𝒔i)f~(𝒙(𝒔i))2]\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}(\bm{x}(\bm{s}_{i}))^{2}+\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}
=\displaystyle= 𝑬f0[1ni=1n(y(𝒔i)f(𝒙(𝒔i)))2]+Δn.\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-{f}(\bm{x}(\bm{s}_{i})))^{2}\Big{]}+\Delta_{n}.

Let ϵi=e1(𝒔i)+e2(𝒔i)\epsilon_{i}=e_{1}(\bm{s}_{i})+e_{2}(\bm{s}_{i}). Furthermore, we have

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0}) =1ni=1n𝑬f0[(f^(𝒙(𝒔i))f0(𝒙(𝒔i)))2]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}
=1ni=1n𝑬f0[(f^(𝒙(𝒔i))y(𝒔i)+y(𝒔i)f0(𝒙(𝒔i)))2]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-y(\bm{s}_{i})+y(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}
=1ni=1n𝑬f0[(f^(𝒙(𝒔i))y(𝒔i))2+ϵi2+2(f^(x(𝒔i))y(𝒔i))ϵi]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-y(\bm{s}_{i})\big{)}^{2}+\epsilon_{i}^{2}+2\big{(}\widehat{f}(x(\bm{s}_{i}))-y(\bm{s}_{i})\big{)}\epsilon_{i}\big{]}
1ni=1n𝑬f0[(y(𝒔i)f(𝒙(𝒔i)))2ϵi2+2f^(x(𝒔i))ϵi]+Δn\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\Big{[}(y(\bm{s}_{i})-{f}(\bm{x}(\bm{s}_{i})))^{2}-\epsilon_{i}^{2}+2\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\Big{]}+\Delta_{n}
=1ni=1n𝑬f0[(y(𝒔i)f0(𝒙(𝒔i))+f0(𝒙(𝒔i))f(𝒙(𝒔i)))2ϵi2+2f^(x(𝒔i))ϵi]+Δn\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\bm{E}_{f_{0}}\big{[}(y(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))+f_{0}(\bm{x}(\bm{s}_{i}))-{f}(\bm{x}(\bm{s}_{i})))^{2}-\epsilon_{i}^{2}+2\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\big{]}+\Delta_{n}}
=1ni=1n𝑬f0[(f0(𝒙(𝒔i))f(𝒙(𝒔i)))2+2f^(𝒙(𝒔i))ϵi]+Δn.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}(f_{0}(\bm{x}(\bm{s}_{i}))-{f}(\bm{x}(\bm{s}_{i})))^{2}+2\widehat{f}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}+\Delta_{n}. (8)

Next, we will find an upper bound for 𝑬f0[2ni=1nf^(x(𝒔i))ϵi]\bm{E}_{f_{0}}\big{[}\frac{2}{n}\sum_{i=1}^{n}\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\big{]}. By the definition of the δ\delta-cover of a function space \mathcal{F} and the δ\delta-covering number, we denote the centers of the balls by fjf_{j}, j=1,2,,𝒩j=1,2,\ldots,\mathcal{N}; and there exists fjf_{j*} such that f^fjδ\|\widehat{f}-f_{j*}\|_{\infty}\leq\delta. Together with the fact that 𝑬[f0(𝒙(𝒔i))ϵi]=0\bm{E}\big{[}f_{0}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}=0, we have

𝑬[2ni=1nf^(𝒙(𝒔i))ϵi]\displaystyle\bm{E}\big{[}\frac{2}{n}\sum_{i=1}^{n}\widehat{f}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}
=\displaystyle= 𝑬[2ni=1n(f^(𝒙(𝒔i))fj(𝒙(𝒔i))+fj(𝒙(𝒔i)f0(𝒙(𝒔i)))ϵi]\displaystyle\bm{E}\big{[}\frac{2}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{j*}(\bm{x}(\bm{s}_{i}))+f_{j*}(\bm{x}(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\big{]}
\displaystyle\leq 𝑬|2ni=1n(f^(𝒙(𝒔i))fj(𝒙(𝒔i)))ϵi|+𝑬|2ni=1n(fj(𝒙(𝒔i))f0(𝒙(𝒔i)))ϵi|\displaystyle\bm{E}\Big{|}\frac{2}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{j*}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{|}+\bm{E}\Big{|}\frac{2}{n}\sum_{i=1}^{n}\big{(}f_{j*}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{|}
\displaystyle\leq 2δn𝑬[i=1n|ϵi|]+2n𝑬|i=1n(fj(𝒙(𝒔i))f0(𝒙(𝒔i)))ϵi|\displaystyle\frac{2\delta}{n}\bm{E}\Big{[}\sum_{i=1}^{n}\big{|}\epsilon_{i}\big{|}\Big{]}+\frac{2}{n}\bm{E}\Big{|}\sum_{i=1}^{n}\big{(}f_{j*}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{|}
\displaystyle\doteq T1+T2.\displaystyle T_{1}+T_{2}. (9)

It is easy to see that T12δ(n1trΓn+σ)T_{1}\leq 2\delta(n^{-1}\operatorname{tr}\Gamma_{n}+\sigma). For the second term T2T_{2}, notice that, conditionally on {𝒙(𝒔1),,𝒙(𝒔n)}\{\bm{x}(\bm{s}_{1}),\ldots,\bm{x}(\bm{s}_{n})\},

ηji=1n(fj(𝒙(𝒔i))f0(𝒙(𝒔i)))ϵi𝒂jΓn𝒂j+nσ2fjf0n2\eta_{j}\doteq\frac{\sum_{i=1}^{n}\big{(}f_{j}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}}{\sqrt{\bm{a}_{j}^{\top}\Gamma_{n}\bm{a}_{j}+n\sigma^{2}\|f_{j}-f_{0}\|_{n}^{2}}}

follows 𝒩(0,1)\mathcal{N}(0,1) where 𝒂j=(fj(𝒙(𝒔1))f0(𝒙(𝒔1)),,fj(𝒙(𝒔n))f0(𝒙(𝒔n)))\bm{a}_{j}=(f_{j}(\bm{x}(\bm{s}_{1}))-f_{0}(\bm{x}(\bm{s}_{1})),\ldots,f_{j}(\bm{x}(\bm{s}_{n}))-f_{0}(\bm{x}(\bm{s}_{n})))^{\top}, fn2=n1i=1nf(𝒙(𝒔i))2\|f\|_{n}^{2}=n^{-1}\sum_{i=1}^{n}f(\bm{x}(\bm{s}_{i}))^{2}. From Lemma C.1 of Schmidt-Hieber, (2020), 𝑬f0[maxj=1,,𝒩ηj2|𝒙(𝒔1),,𝒙(𝒔n)]3log𝒩+1\bm{E}_{f_{0}}\big{[}\max_{j=1,\ldots,\mathcal{N}}\eta_{j}^{2}|\bm{x}(\bm{s}_{1}),\ldots,\bm{x}(\bm{s}_{n})\big{]}\leq 3\log\mathcal{N}+1. Consequently,

T2=\displaystyle T_{2}= 2n𝑬|ηj𝒂jΓn𝒂j+nσ2fjf0n2|\displaystyle\frac{2}{n}\bm{E}\Big{|}\eta_{j*}\sqrt{\bm{a}_{j*}^{\top}\Gamma_{n}\bm{a}_{j*}+n\sigma^{2}\|f_{j*}-f_{0}\|_{n}^{2}}\Big{|}
\displaystyle\leq 2n𝑬(|ηj|(tr(Γn2)+nσ2)fjf0n2)\displaystyle\frac{2}{n}\bm{E}\Big{(}|\eta_{j*}|\sqrt{(\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2})\|f_{j*}-f_{0}\|_{n}^{2}}\Big{)}
\displaystyle\leq 2tr(Γn2)+nσ2n𝑬(|ηj|(f^f0n+δ))\displaystyle\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\bm{E}\Big{(}|\eta_{j*}|(\|\hat{f}-f_{0}\|_{n}+\delta)\Big{)}
\displaystyle\leq 2tr(Γn2)+nσ2n3log𝒩+1(Rn(f^,f0)+δ)\displaystyle\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\Big{(}\sqrt{R_{n}(\widehat{f},f_{0})}+\delta\Big{)}

Together with (8) and (9), we have

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\leq Rn(f,f0)+Δn+2δ(n1trΓn+σ)+2tr(Γn2)+nσ2n3log𝒩+1(Rn(f^,f0)+δ).\displaystyle R_{n}(f,f_{0})+\Delta_{n}+2\delta(n^{-1}\operatorname{tr}\Gamma_{n}+\sigma)+\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\Big{(}\sqrt{R_{n}(\widehat{f},f_{0})}+\delta\Big{)}.

If log𝒩n\log\mathcal{N}\leq n, then

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\leq Rn(f,f0)+Δn+2δ(n1tr(Γn)+σ+2n1tr(Γn2)+σ2)\displaystyle R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+\sigma+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}}\Big{)}
+2tr(Γn2)+nσ2n3log𝒩+1Rn(f^,f0).\displaystyle+\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\sqrt{R_{n}(\widehat{f},f_{0})}.

Applying the inequality (43) in Schmidt-Hieber, (2020), we have, for any 0<ε10<\varepsilon\leq 1,

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\leq (1+ε)(Rn(f,f0)+Δn+2δ(n1tr(Γn)+σ+2n1tr(Γn2)+σ2))\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+\sigma+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}}\Big{)}\right)
+(1+ε)2ε1n2(3log𝒩+1)(tr(Γn2)+nσ2)\displaystyle+\frac{(1+\varepsilon)^{2}}{\varepsilon}\frac{1}{n^{2}}(3\log\mathcal{N}+1)(\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2})
\displaystyle\leq (1+ε)(Rn(f,f0)+Δn+2δ(n1tr(Γn)+2n1tr(Γn2)+3σ))\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)
+(1+ε)2F2nε(3log𝒩+1)(n1tr(Γn2)+σ2+1).\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1).

For log𝒩>n\log\mathcal{N}>n, Rn(f^,f0)=1ni=1n𝑬f0[(f^(𝒙(𝒔i))f0(𝒙(𝒔i)))2]4F2R_{n}(\widehat{f},f_{0})=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}\leq 4F^{2} and

(1+ε)(Rn(f,f0)+Δn+2δ(n1tr(Γn)+2n1tr(Γn2)+3σ))\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)
+(1+ε)2F2nε(3log𝒩+1)(n1tr(Γn2)+σ2+1)\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)
>\displaystyle> 2F2n(3n+1)>6F2.\displaystyle\frac{2F^{2}}{n}(3n+1)>6F^{2}.

Thus,

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\leq (1+ε)(Rn(f,f0)+Δn+2δ(n1tr(Γn)+2n1tr(Γn2)+3σ))\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)
+(1+ε)2F2nε(3log𝒩+1)(n1tr(Γn2)+σ2+1).\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1).

Since the above inequality holds true for any ff\in\mathcal{F}, we can prove the result by letting f=arginff~Rn(f~,f0)f=\mathop{\mathrm{arginf}}_{\tilde{f}\in\mathcal{F}}R_{n}(\tilde{f},f_{0}). ∎

Proof of Theorem 1: It follows from Lemma 5 and Remark 1 of Schmidt-Hieber, (2020) that

log𝒩=log𝒩(δ,(L,𝒑,τ,F),)\displaystyle\log\mathcal{N}=\log\mathcal{N}(\delta,\mathcal{F}(L,\bm{p},\tau,F),\|\cdot\|_{\infty})\leq (1+τ)log(25+2Lδ1(L+1)τ2Ld2).\displaystyle(1+\tau)\log(2^{5+2L}\delta^{-1}(L+1)\tau^{2L}d^{2}).

Because F1F\geq 1 and 0<ε10<\varepsilon\leq 1, we have

Rn(f^,f0)\displaystyle R_{n}(\widehat{f},f_{0})\lesssim (1+ε)(inff~(L,𝒑,τ,F)f~f02+Δn(f^)+ςn,ε,δ),\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\Delta_{n}(\widehat{f})+\varsigma_{n,\varepsilon,\delta}\right),

where

ςn,ε,δ\displaystyle\varsigma_{n,\varepsilon,\delta}\asymp 1ε[δ(n1tr(Γn)+2n1tr(Γn2)+3σ)+τn(log(L/δ)+Llogτ)(n1tr(Γn2)+σ2+1)].\displaystyle\frac{1}{\varepsilon}\left[\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}+\frac{\tau}{n}\left(\log(L/\delta)+L\log\tau\right)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)\right].

\square

7.3 Proof of Theorem 2

Lemma 2.

For any f:d𝒞𝒮(L,𝐫,𝐫~,𝛃,𝐚,𝐛,𝐂)f:\mathbb{R}^{d}\to\mathbb{R}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}), m+m\in\mathbb{Z}^{+}, and Nmaxi=0,,L(βi+1)r~i(C~i+1)er~iN\geq\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}}, there exists a neural network

f(L,(d,6ηN,,6ηN,1),i=0Lri+1(τi+4),),f_{*}\in\mathcal{F}(L,(d,6\eta N,\ldots,6\eta N,1),\sum_{i=0}^{L_{*}}r_{i+1}(\tau_{i}+4),\infty),

such that

ffCLl=0L1(2Cl)βl+1i=0L((2C~i+1)(1+r~i2+βi2)6r~N2m+C~i3βiNβir~i)l=i+1Lβl1,\|f_{*}-f\|_{\infty}\leq C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}\left((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}\right)^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1},

where

C~i\displaystyle\tilde{C}_{i} =k=0iCkbkakbk+1ak+1,i=0,,L1,C~L=k=0LCkbkakbk+1ak+1+bLaL\displaystyle=\sum_{k=0}^{i}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}},~{}i=0,\ldots,L_{*}-1,\quad\tilde{C}_{L_{*}}=\sum_{k=0}^{L_{*}}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}}+b_{L_{*}}-a_{L_{*}}
L\displaystyle L =3L+i=0LLi,with Li=8+(m+5)(1+log2(r~iβi)),\displaystyle=3L_{*}+\sum_{i=0}^{L_{*}}L_{i},\quad\text{with }L_{i}=8+(m+5)(1+\lceil\log_{2}(\tilde{r}_{i}\vee\beta_{i})\rceil),
τi\displaystyle\tau_{i} 141(r~i+βi+1)3+r~iN(m+6),i=0,,L,\displaystyle\leq 141(\tilde{r}_{i}+\beta_{i}+1)^{3+\tilde{r}_{i}}N(m+6),~{}i=0,\ldots,L_{*},
η\displaystyle\eta =maxi=0,,L(ri+1(r~i+βi)).\displaystyle=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)).
Proof.

By definition, we write f(𝒛)f(\bm{z}) as

f(𝒛)=𝒈L𝒈1𝒈0(𝒛), for 𝒛[a0,b0]r0,f(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}},

where 𝒈i=(gi,1,,gi,ri+1):[ai,bi]ri[ai+1,bi+1]ri+1\bm{g}_{i}=(g_{i,1},\ldots,g_{i,r_{i+1}})^{\top}:[a_{i},b_{i}]^{r_{i}}\to[a_{i+1},b_{i+1}]^{r_{i+1}} for some |ai|,|bi|Ci|a_{i}|,|b_{i}|\leq C_{i} and the functions gi,j:[ai,bi]r~i[ai+1,bi+1]g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}] are (βi,Ci)(\beta_{i},C_{i})-Hölder smooth and rL+1=1r_{L_{*}+1}=1. For i=0,,L1i=0,\ldots,L_{*}-1, the domain and range of 𝒈i\bm{g}_{i} are [ai,bi]ri[a_{i},b_{i}]^{r_{i}} and [ai+1,bi+1]ri+1[a_{i+1},b_{i+1}]^{r_{i+1}}, respectively. First of all, we will rewrite ff as the composition of the functions 𝒉i:=(hi,1,,hi,ri+1)\bm{h}_{i}:=(h_{i,1},\ldots,h_{i,r_{i+1}})^{\top} whose domain and range are [0,1]ri[0,1]^{r_{i}} and [0,1]ri+1[0,1]^{r_{i+1}} which are constructed via linear transformation. That is, we define

𝒉i(𝒛)\displaystyle\bm{h}_{i}(\bm{z}) :=𝒈i((biai)𝒛ai+1)bi+1ai+1, for 𝒛[0,1]ri,i=0,,L1,\displaystyle:=\frac{\bm{g}_{i}((b_{i}-a_{i})\bm{z}-a_{i+1})}{b_{i+1}-a_{i+1}},\quad\quad\textrm{ for }\bm{z}\in[0,1]^{r_{i}},~{}i=0,\ldots,L_{*}-1,
𝒉L(𝒛)\displaystyle\bm{h}_{L_{*}}(\bm{z}) :=𝒈L((bLaL)𝒛+aL), for 𝒛[0,1]rL.\displaystyle:=\bm{g}_{L_{*}}((b_{L_{*}}-a_{L_{*}})\bm{z}+a_{L_{*}}),\quad\quad\textrm{ for }\bm{z}\in[0,1]^{r_{L_{*}}}.

Therefore the following equality holds

f(𝒛)=𝒈L𝒈1𝒈0(𝒛)=𝒉L𝒉1𝒉0(𝒛a0b0a0), for 𝒛[a0,b0]r0.f(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z})=\bm{h}_{L_{*}}\circ\ldots\circ\bm{h}_{1}\circ\bm{h}_{0}(\frac{\bm{z}-a_{0}}{b_{0}-a_{0}}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}}.

Since gi,j:[ai,bi]r~i[ai+1,bi+1]g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}] are all (βi,Ci)(\beta_{i},C_{i})-Hölder smooth, it follows that hi,j:[0,1]r~i[0,1]h_{i,j}:[0,1]^{\tilde{r}_{i}}\to[0,1] are all (βi,C~i)(\beta_{i},\tilde{C}_{i})-Hölder smooth as well, where C~i\tilde{C}_{i} is a constant only depending on 𝒂,𝒃\bm{a},\bm{b}, and 𝑪\bm{C}, i.e., C~i=k=0iCkbkakbk+1ak+1\tilde{C}_{i}=\sum_{k=0}^{i}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}} for i=0,,L1i=0,\ldots,L_{*}-1, and C~L=k=0LCkbkakbk+1ak+1+bLaL\tilde{C}_{L_{*}}=\sum_{k=0}^{L_{*}}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}}+b_{L_{*}}-a_{L_{*}}.

By Theorem 5 of Schmidt-Hieber, (2020), for any integer m1m\geq 1 and Nmaxi=0,,L(βi+1)r~i(C~i+1)er~iN\geq\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}}, there exists a network

h~i,j(Li,(r~i,6(r~i+βi)N,,6(r~i+βi)N,1),τi,),\tilde{h}_{i,j}\in\mathcal{F}(L_{i},(\tilde{r}_{i},6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,1),\tau_{i},\infty),

with Li=8+(m+5)(1+log2(r~iβi))L_{i}=8+(m+5)(1+\lceil\log_{2}(\tilde{r}_{i}\vee\beta_{i})\rceil) and τi141(r~i+βi+1)3+r~iN(m+6)\tau_{i}\leq 141(\tilde{r}_{i}+\beta_{i}+1)^{3+\tilde{r}_{i}}N(m+6), such that

h~i,jhi,j(2C~i+1)(1+r~i2+βi2)6r~iN2m+C~i3βiNβir~i.\|\tilde{h}_{i,j}-h_{i,j}\|_{\infty}\leq(2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}_{i}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}.

Note that the value of h~i,j\tilde{h}_{i,j} is (,)(-\infty,\infty). So we define hi,j:=σ(σ(h~i,j+1)+1)h^{*}_{i,j}:=\sigma(-\sigma(-\tilde{h}_{i,j}+1)+1) by adding two more layers σ(1x)\sigma(1-x) to restrict hi,jh^{*}_{i,j} into the interval [0,1][0,1], where σ(x)=max(0,x)\sigma(x)=\max(0,x). This introduces two more layers and four more parameters. By the fact that hi,j[0,1]h_{i,j}\in[0,1], we have hi,j(Li+2,(r~i,6(r~i+βi)N,,6(r~i+βi)N,1),τi+4,)h^{*}_{i,j}\in\mathcal{F}(L_{i}+2,(\tilde{r}_{i},6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,1),\tau_{i}+4,\infty) and

hi,jhi,jh~i,jhi,j(2C~i+1)(1+r~i2+βi2)6r~iN2m+C~i3βiNβir~i.\|h^{*}_{i,j}-h_{i,j}\|_{\infty}\leq\|\tilde{h}_{i,j}-h_{i,j}\|_{\infty}\leq(2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}_{i}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}.

We further parallelize all (hi,j)j=1,,ri+1(h^{*}_{i,j})_{j=1,\ldots,r_{i+1}} together, obtaining 𝒉i:=(hi,1,,hi,ri+1)(Li+2,(ri,6ri+1(r~i+βi)N,,6ri+1(r~i+βi)N,ri+1),ri+1(τi+4),)\bm{h}^{*}_{i}:=(h^{*}_{i,1},\ldots,h^{*}_{i,r_{i+1}})^{\top}\in\mathcal{F}(L_{i}+2,(r_{i},6r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,r_{i+1}),r_{i+1}(\tau_{i}+4),\infty). Moreover, we construct the composite network f:=𝒉L𝒉1𝒉0(3L+i=0LLi,(r0,6ηN,,6ηN,1),i=0Lri+1(τi+4),)f_{*}:=\bm{h}^{*}_{L_{*}}\circ\ldots\circ\bm{h}^{*}_{1}\circ\bm{h}^{*}_{0}\in\mathcal{F}(3L_{*}+\sum_{i=0}^{L_{*}}L_{i},(r_{0},6\eta N,\ldots,6\eta N,1),\sum_{i=0}^{L_{*}}r_{i+1}(\tau_{i}+4),\infty), where η=maxi=0,,L(ri+1(r~i+βi))\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)).

By Lemma 3 in Schmidt-Hieber, (2020),

ff=\displaystyle\|f-f_{*}\|_{\infty}= 𝒉L𝒉1𝒉0𝒉L𝒉1𝒉0\displaystyle\|\bm{h}_{L_{*}}\circ\ldots\circ\bm{h}_{1}\circ\bm{h}_{0}-\bm{h}^{*}_{L_{*}}\circ\ldots\circ\bm{h}^{*}_{1}\circ\bm{h}^{*}_{0}\|_{\infty}
\displaystyle\leq CLl=0L1(2Cl)βl+1i=0L|𝒉i𝒉i|l=i+1Lβl1\displaystyle C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}\||\bm{h}_{i}-\bm{h}^{*}_{i}|_{\infty}\|_{\infty}^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}
\displaystyle\leq CLl=0L1(2Cl)βl+1i=0L((2C~i+1)(1+r~i2+βi2)6r~N2m+C~i3βiNβir~i)l=i+1Lβl1\displaystyle C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}\left((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}\right)^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}
\displaystyle\leq CLl=0L1(2Cl)βl+1i=0L((2C~i+1)(1+r~i2+βi2)6r~N2m)l=i+1Lβl1\displaystyle C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}
+CLl=0L1(2Cl)βl+1i=0L(C~i3βiNβir~i)l=i+1Lβl1.\displaystyle+C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}(\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}.

Proof of Theorem 2: By Theorem 1 with δ=n2\delta=n^{-2} and ε=1\varepsilon=1, it follows that

Rn(f^local,f0)inff~(L,𝒑,τ,F)f~f02+(tr(Γn2)+n)τ(log(Ln2)+Llogτ)n2+Δn(f^local).R_{n}(\widehat{f}_{\textrm{local}},f_{0})\lesssim\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\frac{(\operatorname{tr}(\Gamma_{n}^{2})+n)\tau(\log(Ln^{2})+L\log{\tau})}{n^{2}}+\Delta_{n}(\widehat{f}_{\textrm{local}}).

Next, we need to analyze the first term. Since f0𝒞𝒮(L,𝒓,𝒓~,𝜷,𝒂,𝒃,𝑪)f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C}), by Lemma 2, for any m>0m>0, there exists a neural network

f(L,(d,N,,N,1),τ,),f_{*}\in\mathcal{F}(L,(d,N,\ldots,N,1),\tau,\infty),

with Lm,N6ηmaxi=0,,L(βi+1)r~i(C~i+1)er~i,η=maxi=0,,L(ri+1(r~i+βi)),τmN,L\asymp m,N\geq 6\eta\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}},\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)),\tau\lesssim mN, such that

ff0\displaystyle\|f_{*}-f_{0}\|_{\infty} i=0L(N2m)l=i+1Lβl1+(Nβir~i)l=i+1Lβl1\displaystyle\lesssim\sum_{i=0}^{L_{*}}(N2^{-m})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}+(N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}
i=0L(N2m)l=i+1Lβl1+Nβir~i\displaystyle\lesssim\sum_{i=0}^{L_{*}}(N2^{-m})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta_{i}^{*}}{\tilde{r}_{i}}}
(N2m)l=1Lβl1+Nβr,\displaystyle\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}, (10)

where recall that β=βi\beta^{*}=\beta_{i^{*}}^{*} and r=r~ir^{*}=\tilde{r}_{i^{*}}. For simplicity, we let 𝒑=(d,N,,N,1)\bm{p}=(d,N,\ldots,N,1). This means there exists a sequence of networks (fn)n(f_{n})_{n} such that for all sufficiently large nn, fnf0(N2m)l=1Lβl1+Nβr\|f_{n}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}} and fn(L,𝒑,τ,).f_{n}\in\mathcal{F}(L,\bm{p},\tau,\infty). Next define f`:=fn(f0/fn1)(L,𝒑,τ,F),Fmaxi=0,,L(Ci,1)\grave{f}:=f_{n}(\|f_{0}\|_{\infty}/\|f_{n}\|_{\infty}\wedge 1)\in\mathcal{F}(L,\bm{p},\tau,F),F\geq\max_{i=0,\ldots,L^{*}}(C_{i},1), and it is obvious that f`f0(N2m)l=1Lβl1+Nβr\|\grave{f}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}. Then it follows that

inff~(L,𝒑,τ,F)f~f0f`f0(N2m)l=1Lβl1+Nβr.\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|_{\infty}\lesssim\|\grave{f}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}. (11)

By combining (7.3) and the fact that τLN\tau\lesssim LN, the proof is completed. \square