Semiparametric Regression for Spatial Data via Deep Learning

Kexuan Li , Jun Zhu , Anthony R. Ives ,
Volker C. Radeloff , and Fangfang Wang [email protected], Global Analytics and Data Sciences, Biogen, Cambridge, Massachusetts, [email protected], Department of Statistics, University of Wisconsin-Madison. [email protected], Department of Integrative Biology, University of Wisconsin-Madison. [email protected], Department of Forest and Wildlife Ecology, University of [email protected], Department of Mathematical Sciences, Worcester Polytechnic Institute.

Abstract

In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.

Keywords: Semiparametric regression; Spatially dependent data; Deep Neural Networks; Stochastic gradient descent.

1 Introduction

With recent advances in remote sensing technology and geographical sciences, there has been a considerable interest in modeling spatially referenced data. The purpose of this paper is to develop new methodology that captures complex structures in such data via deep neural networks and Gaussian random fields. In addition, we provide a theoretical understanding of deep neural networks for spatially dependent data.

In recent years, deep neural network (DNN) has made a great breakthrough in many fields, such as computer vision (He et al.,, 2016), dynamics system (Li et al.,, 2021), natural language processing (Bahdanau et al.,, 2014), drug discovery and toxicology (Jiménez-Luna et al.,, 2020), and variable selection (Li et al.,, 2022; Li,, 2022). Besides its successful applications, there has also been great progress on theoretical development of deep learning. Liu et al., (2020) and Schmidt-Hieber, (2020) proved that the neural network estimator achieves the optimal (up to a logarithmic factor) minimax rate of convergence. Liu et al., (2022) further removed the logarithmic term and achieved the exact optimal nonparametric convergence rate. One of the appealing features of deep neural network is that it can circumvent the curse of dimensionality under some mild conditions.

Owing to the superior performance and theoretical guarantees of deep learning, applying deep learning to spatial data has also drawn much attention. For example, Zammit-Mangion and Wikle, (2020) fitted the integro-difference equation through convolutional neural networks and obtained probabilistic spatio-temporal forecasting. Zammit-Mangion et al., (2021) constructed a deep probabilistic architecture to model nonstationary spatial processes using warping approach. Mateu and Jalilian, (2022) used variational autoencoder generative neural networks to analyze spatio-temporal point processes. Kirkwood et al., (2022) applied Bayesian deep neural network to spatial interpolation. However, there is a lack of theoretical understanding of the aforementioned work, which we will address in this paper.

In addition, we model spatial dependence by Gaussian random fields and develop model estimation with computational efficiency. Due to technological advances in data collecting process, the size of spatial datasets are massive and traditional statistical methods encounter two challenges. One challenge is the aggravated computational burden. To reduce computation cost, various methods have been developed, such as covariance tapering, fixed rank kriging, and Gaussian Markov random fields (see Sun et al., (2012) for a comprehensive review). The other challenge is data storage and data collection. Many spatial datasets are not only big, but are also generated by different sources or in an online fashion such that the observations are generated one-by-one. In both cases, we cannot process entire datasets at once. To overcome these two challenges, Robbins and Monro, (1951) proposed a computationally scalable algorithm called stochastic gradient descent (SGD) and achieved great success in many areas. Instead of evaluating the actual gradient based on an entire dataset, SGD estimates the gradient using only one observation which makes it computationally feasible with large scale data and streaming data. Its statistical inferential properties have also been studied by many researchers (Su and Zhu,, 2018; Liu et al.,, 2021).

Before proceeding, it is essential to provide a brief overview of the classical approaches for formulating spatial dependence in a spatial process. For a spatial process, various approaches have been proposed to formulate the spatial dependence. In spatially varying coefcient (SVC) models, covariates may have different effect along with locations, which are referred to as “spatial non-stationarity”. Thus, SVC models allows the coefficients of the covariates to change with the locations. In other words, the spatial dependence is entirely explained by the regressors, while the disturbances are independent (see e.g. Hastie and Tibshirani, (1993); Fan and Zhang, (1999); Mu et al., (2018); Kim and Wang, (2021)). Another class of models are based on the spatial autoregressive (SAR) models of Cliff and Ord, (1981), where the information about spatial dependence is contained in the spatial weight matrix, and the response variable at each location is assumed to be affected by a weighted average of the dependent variables from other sampled units (see Lee, (2002, 2004). In SAR models, the spatial weight matrix is assumed to be known, which is infeasible in practice, especially in large dataset. Obviously, both SVC and SAR models have some limitations. First, they assume the relationship between the response and the independent variable is linear and ignore complex interaction and nonlinear structure. Second, they both involve computing the inverse of an $n\times n$ matrix, where $n$ is the number of observations, which requires $O(n^{3})$ time complexity and $O(n^{2})$ memory complexity. Thus, the computational burden for both SVC and SAR model is substantial.

To meet these challenges, here we develop deep learning-based semiparametric regression for spatial data. Specifically, we use a sparsely connected feedforward neural network to fit the regression model, where the spatial dependence is captured by Gaussian random fields. By assuming a compositional structure on the regression function, the consistency of the neural network estimator is guaranteed. The advantages of the proposed method are fourfold. First, we do not assume any parametric functional form for the regression function, allowing the true mean function to be nonlinear or with complex interactions. This is an improvement over many of the existing parametric, semiparametric, or nonparametric approaches (Hastie and Tibshirani,, 1993; Fan and Zhang,, 1999; Mu et al.,, 2018; Kim and Wang,, 2021; Lee,, 2002; Robinson,, 2011; Jenish,, 2012; Li,, 2016; Lu and Tjøstheim,, 2014; Kurisu,, 2019, 2022). Second, under some mild technical conditions, we show that the estimator is consistent. To the best of our knowledge, this is the first theoretical result in deep neural network for spatially dependent data. Third, the convergence rate is free of the input dimension, which means our estimator does not suffer from the curse of dimensionality. Finally, owing to the appealing properties of SGD, our method is feasible for large scale dataset and streaming data.

The remainder of the paper is organized as follows. Section 2 formulates the statistical problem and presents the deep neural network estimator. The computational aspects and theoretical properties of the estimator are given in Section 3. Section 4 evaluates the finite-sample performance of the proposed estimator by a simulation study. We apply our method to a real-world dataset in Section 5 and some concluding remarks are provided in Section 6. Technical details are provided in Appendix.

2 Model and Estimator

In this section, we first formulate the problem and then present the proposed estimator under a deep learning framework.

2.1 Model Setup

For a spatial domain of interest $\mathcal{S}$ , we consider the following semiparametric spatial regression model:

y(\bm{s})=f_{0}(\bm{x}(\bm{s}))+e_{1}(\bm{s})+e_{2}(\bm{s}),\,\,\,\bm{s}\in\mathcal{S}

(1)

where $f_{0}:[0,1]^{d}\rightarrow\mathbb{R}$ is an unknown mean function of interest, $\bm{x}(\bm{s})=(x_{1}(\bm{s}),\ldots,x_{d}(\bm{s}))^{\top}$ represents a $d$ -dimensional vector of covariates at location $\bm{s}$ with $x_{i}(\bm{s})\in[0,1]$ , $e_{1}(\bm{s})$ is a mean zero Gaussian random field with covariance function $\gamma({\bm{s},\bm{s}^{\prime}})$ , $\bm{s},\bm{s}^{\prime}\in\mathcal{S}$ , and $e_{2}(\bm{s})$ is a spatial Gaussian white noise process with mean 0 and variance $\sigma^{2}$ . Furthermore, we assume that $e_{1}(\bm{s})$ , $e_{2}(\bm{s})$ , and $\bm{x}(\bm{s})$ are independent of each other. Thus the observation $y(\bm{s})$ comprises three components: large-scale trend $f_{0}(\bm{x}(\bm{s}))$ , small-scale spatial variation $e_{1}(\bm{s})$ , and measurement error $e_{2}(\bm{s})$ ; see, for instance, Cressie and Johannesson, (2008).

In the spatial statistics literature, it is popular to focus on predicting the hidden spatial process $y(\bm{s})^{*}=f_{0}(\bm{x}(\bm{s}))+e_{1}(\bm{s})$ using the observed information (Cressie and Johannesson,, 2008). However, the primary interest of this paper is to estimate the large-scale trend $f_{0}(\bm{x}(\bm{s}))$ , where the relationship between the hidden spatial process and the covariates could be complex in nature. To capture such a complex relationship, we assume that $f_{0}$ is a composition of several functions inspired by neural networks characteristics (Schmidt-Hieber,, 2020). Hölder smoothness (see Definition 1 in Appendix) is a commonly used smoothness assumption for regression function in nonparametric and semiparametric literature. Thus it is natural to assume the true mean function $f_{0}$ is a composition of Hölder smooth functions, which is formally stated in the following assumption.

Assumption 1.

The function $f_{0}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ has a compositional structure with parameters $(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ where $L_{*}\in\mathbb{Z}_{+}$ , $\bm{r}=(r_{0},\ldots,r_{L_{*}+1})^{\top}$ $\in\mathbb{Z}_{+}^{L_{*}+2}$ with $r_{0}=d$ and $r_{L_{*}+1}=1$ , $\bm{\tilde{r}}=(\tilde{r}_{0},\ldots,\tilde{r}_{L_{*}})^{\top}\in\mathbb{Z}_{+}^{L_{*}+1}$ , $\bm{\beta}=(\beta_{0},\ldots,\beta_{L_{*}})^{\top}\in\mathbb{R}_{+}^{L_{*}+1}$ , $\bm{a}=(a_{0},\ldots,a_{L_{*}+1})^{\top}$ , $\bm{b}=(b_{0},\ldots,b_{L_{*}+1})^{\top}\in\mathbb{R}^{L_{*}+2}$ , and $\bm{C}=(C_{0},\ldots,C_{L_{*}})^{\top}\in\mathbb{R}_{+}^{L_{*}+1}$ ; that is,

f_{0}(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}}

where $\mathbb{Z}_{+},\mathbb{R}_{+}$ denote the sets of positive integers and positive real numbers, respectively, $\bm{g}_{i}=(g_{i,1},\ldots,g_{i,r_{i+1}})^{\top}:[a_{i},b_{i}]^{r_{i}}\to[a_{i+1},b_{i+1}]^{r_{i+1}}$ for some $|a_{i}|,|b_{i}|\leq C_{i}$ and the functions $g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}]$ are $(\beta_{i},C_{i})$ -Hölder smooth only relying on $\tilde{r}_{i}$ variables and $\tilde{r}_{i}\leq r_{i}$ .

Without loss of generality, we assume $C_{i}>1$ in Assumption 1. The parameter $L_{*}$ refers to the total number of layers, i.e., the number of composite functions, $\bm{r}$ is the whole number of variables in each layer, whereas $\bm{\tilde{r}}$ is the number of “active” variables in each layer. The two parameter vectors $\bm{\beta}$ and $\bm{C}$ pertain to the Hölder smoothness in each layer, while $\bm{a}$ and $\bm{b}$ define the domain of $\bm{g}_{i}$ in the $i$ th layer. For the rest of the paper, we will use $\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ to denote the class of functions that have a compositional structure as specified in Assumption 1 with parameters $(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ .

It is worth mentioning that Assumption 1 is commonly adopted in deep learning literature; see Bauer and Kohler, (2019), Schmidt-Hieber, (2020), Kohler and Langer, (2021), Liu et al., (2020), Li et al., (2021), among others. This compositional structure covers a wide range of function classes including the generalized additive model. In Example 1 and Figure 1, we present an illustrative example to help readers understand the concept of compositional structure within the context of the generalized additive model.

Example 1.

The generalized additive model is a generalized linear model with a linear predictor involving a sum of smooth functions of covariates (Hastie and Tibshirani,, 1986). Suppose $f(\bm{z})=\varphi(\sum_{i=1}^{d}h_{i}(z_{i}))$ , where $\varphi(\cdot)$ is $(\beta_{\varphi},C_{\varphi})$ -Hölder smooth and $h_{i}(\cdot)$ are $(\beta_{h},C_{h})$ -Hölder smooth, for some $(\beta_{\varphi},C_{\varphi})$ and $(\beta_{h},C_{h})$ . Clearly, $f(\bm{z})$ can be written as a composition of three functions $f(\bm{z})=\bm{g}_{2}\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z})$ with $\bm{g}_{0}(z_{1},\ldots,z_{d})=(h_{1}(z_{1}),\ldots,h_{d}(z_{d}))^{\top}$ , $\bm{g}_{1}(z_{1},\ldots,z_{d})=\sum_{i=1}^{d}z_{i}$ , and $\bm{g}_{2}(z)=\varphi(z)$ . Here, $L_{*}=2$ , $\bm{r}=(d,d,1,1)^{\top}$ , $\bm{\tilde{r}}=(d,d,1)^{\top}$ , and $\bm{\beta}=(\beta_{h},\infty,\beta_{\varphi})^{\top}$ .

Refer to caption — Figure 1: Illustration of the compositional structure for the generalized additive model in Example 1.

2.2 Deep Neural Network (DNN) Estimator

In this paper, we consider estimating the unknown function $f_{0}$ via a deep neural network owing to the complexity of $f_{0}$ and the flexibility of neural networks. So before presenting our main results, we first briefly review the neural network terminologies pertaining to this work.

An activation function is a nonlinear function used to learn the complex pattern from data. In this paper, we focus on the Rectified Linear Unit (ReLU) shifted activation function which is defined as $\sigma_{\bm{v}}(\bm{z})=(\sigma(z_{1}-v_{1}),\ldots,\sigma(z_{d}-v_{d}))^{\top}$ , where $\sigma(s)=\max\{0,s\}$ and $\bm{z}=(z_{1},\ldots,z_{d})^{\top}\in\mathbb{R}^{d}$ . ReLU activation function enjoys both theoretical and computational advantages. The projection property $\sigma\circ\sigma=\sigma$ can facilitate the proof of consistency, while ReLU activation function can help avoid vanishing gradient problem. The ReLU feedforward neural network $\bm{f}(\bm{z},W,v)$ is given by

\bm{f}(\bm{z},W,v)=W_{L}\sigma_{\bm{v}_{L}}\ldots W_{1}\sigma_{\bm{v}_{1}}W_{0}\bm{z},\quad\bm{z}\in\mathbb{R}^{p_{0}},

(2)

where $\{(W_{0},\ldots,W_{L}):W_{l}\in\mathbb{R}^{p_{l+1}\times p_{l}},0\leq l\leq L\}$ is the collection of weight matrices, $\{(\bm{v}_{1},\ldots,\bm{v}_{L}):\bm{v}_{l}\in\mathbb{R}^{p_{l}},1\leq l\leq L\}$ is the collection of so-called biases (in the neural network literature), and $\sigma_{\bm{v}_{l}}(\cdot)$ , $1\leq l\leq L$ , are the ReLU shifted activation functions. Here, $L$ measures the number of hidden layers, i.e., the length of the network, while $p_{j}$ is the number of units in each layer, i.e., the depth of the network. When using a ReLU feedforward neural network to estimate the regression problem (1), we need to have $p_{0}=d$ and $p_{L+1}=1$ ; and the parameters that need to be estimated are the weight matrices $(W_{j})_{j=0,\ldots,L}$ and the biases $(\bm{v}_{j})_{j=1,\ldots,L}$ .

By definition, a ReLU feedforward neural network can be written as a composition of simple nonlinear functions; that is,

\bm{f}(\bm{z},W,v)=\bm{g}_{L}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\bm{z}\in\mathbb{R}^{p_{0}},

where $\bm{g}_{i}(\bm{z})=W_{i}\sigma_{\bm{v}_{i}}(\bm{z})$ , $\bm{z}\in\mathbb{R}^{p_{i}}$ , $i=1,\ldots,L$ , and $\bm{g}_{0}(\bm{z})=W_{0}\bm{z},\bm{z}\in\mathbb{R}^{p_{0}}$ . Unlike traditional function approximation theory where a complex function is considered as an infinite sum of simpler functions (such as Tayler series, Fourier Series, Chebyshev approximation, etc.), deep neural networks approximate a complex function via compositions, i.e., approximating the function by compositing simpler functions (Lu et al.,, 2020; Farrell et al.,, 2021; Yarotsky,, 2017). Thus, a composite function can be well approximated by a feedforward neural network. That is why we assume the true mean function $f_{0}$ has a compositional structure.

In practice, the length and depth of the networks can be extremely large, thereby easily causing overfitting. To overcome this problem, a common practice in deep learning is to randomly set some neurons to zero, which is called dropout. Therefore, it is natural to assume the network space is sparse and all the parameters are bounded by one, where the latter can be achieved by dividing all the weights by the maximum weight (Bauer and Kohler,, 2019; Schmidt-Hieber,, 2020; Kohler and Langer,, 2021). As such, we consider the following sparse neural network class with bounded weights

\mathcal{F}(L,\bm{p},\tau,F)=\left\{f\textrm{ is of form }\eqref{fnn}:\max_{j=0,\ldots,L}\|W_{j}\|_{\infty}+|\bm{v}_{j}|_{\infty}\leq 1,\sum_{j=0}^{L}(\|W_{j}\|_{0}+|\bm{v}_{j}|_{0})\leq\tau,\|f\|_{\infty}\leq F\right\},

(3)

where $\bm{p}=(p_{0},\ldots,p_{L+1})$ with $p_{0}=d$ and $p_{L+1}=1$ , and $\bm{v}_{0}$ is a vector of zeros. This class of neural networks is also adopted in Schmidt-Hieber, (2020), Liu et al., (2020), and Li et al., (2021).

Suppose that the process $y(\cdot)$ is observed at a finite number of spatial locations $\{\bm{s}_{1},\ldots,\bm{s}_{n}\}$ in $\mathcal{S}$ . The desired DNN estimator of $f_{0}$ in Model (1) is a sparse neural network in $\mathcal{F}(L,\bm{p},\tau,F)$ with the smallest empirical risk; that is,

\widehat{f}_{\textrm{global}}(\widehat{W},\widehat{v})=\mathop{\mathrm{argmin}}_{f\in\mathcal{F}(L,\bm{p},\tau,F)}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}.

(4)

For simplicity, we sometimes write $\widehat{f}_{\textrm{global}}$ for $\widehat{f}_{\textrm{global}}(\widehat{W},\widehat{v})$ if no confusion arises.

3 Computation and Theoretical Results

In this section, we describe the computational procedure used to optimize the objective function (4) and present the main theoretical results.

3.1 Computational Aspects

Because (4) does not have an exact solution, we use a stochastic gradient descent (SGD)-based algorithm to optimize (4). In contrast to a gradient descent algorithm which requires a full dataset to estimate gradients in each iteration, SGD or mini-batch gradient descent only needs an access to a subset of observations during each update, which is capable of training relatively complex models for large datasets and computationally feasible with streaming data. Albeit successful applications in machine learning and deep learning, SGD still suffers from some potential problems. For example, the rate of convergence to the minima is slow; the performance is very sensitive to tuning parameters. To circumvent these problems, various methods have been proposed, such as RMSprop and Adam (Kingma and Ba,, 2014). In this paper, we use Adam optimizer to solve (4).

During the training process, there are many hyper-parameters to tune in our approach: the number of layers $L$ , the number of neurons in each layer $\bm{p}$ , the sparse parameter $\tau$ , and the learning rate. These hyper-parameters play an important role in the learning process. However, it is challenging to determine the values of hyper-parameters without domain knowledge. In particular, it is challenging to control the sparse parameter $\tau$ directly in the training process. Thus, we add an $\ell_{1}$ -regularization penalty to control the number of inactive neurons in the network. The idea of adding a sparse regularization to hidden layers in deep learning is very common; see, for instance, Scardapane et al., (2017) and Lemhadri et al., (2021). In this paper, we use a $5$ -fold cross-validation to select tuning parameters.

3.2 Theoretical Results

Recall that the minimizer of (4), $\widehat{f}_{\textrm{global}}$ , is practically unattainable and we use an SGD-based algorithm to minimize the objective function (4), which may converge to a local minimum. The actual estimator obtained by minimizing (4) is denoted by $\widehat{f}_{\textrm{local}}\in\mathcal{F}(L,\bm{p},\tau,F)$ . We define the difference between the expected empirical risks of $\widehat{f}_{\textrm{global}}$ and $\widehat{f}_{\textrm{local}}$ as

	$\displaystyle\Delta_{n}(\widehat{f}_{\textrm{local}})$	$\displaystyle\doteq\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{local}}(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\right]$
		$\displaystyle=\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{local}}(\bm{x}(\bm{s}_{i}))^{2}-\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}_{\textrm{global}}(\bm{x}(\bm{s}_{i}))^{2}\right],$		(5)

where $\bm{E}_{f_{0}}$ stands for the expectation with respect to the true regression function $f_{0}$ . For any $\widehat{f}\in\mathcal{F}(L,\bm{p},\tau,F)$ , we consider the following estimation error:

R_{n}(\widehat{f},f_{0})\doteq\bm{E}_{f_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\right].

(6)

The oracle-type theorem below gives an upper bound for the estimation error.

Theorem 1.

Suppose that the unknown true mean function $f_{0}$ in (1) satisfies $\|f_{0}\|_{\infty}\leq F$ for some $F\geq 1$ . For any $\delta,\epsilon\in(0,1]$ and $\widehat{f}\in\mathcal{F}(L,\bm{p},\tau,F)$ , the following oracle inequality holds:

\displaystyle R_{n}(\widehat{f},f_{0})\lesssim

\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\zeta_{n,\varepsilon,\delta}+\Delta_{n}(\widehat{f})\right),

where

\displaystyle\zeta_{n,\varepsilon,\delta}\asymp

\displaystyle\frac{1}{\varepsilon}\left[\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}+\frac{\tau}{n}\left(\log(L/\delta)+L\log\tau\right)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)\right],

and $\Gamma_{n}=[\gamma(\bm{s}_{i},\bm{s}_{i^{\prime}})]_{1\leq i,i^{\prime}\leq n}$ .

The convergence rate in Theorem 1 is determined by three components. The first component $\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}$ measures the distance between the neural network class $\mathcal{F}(L,\bm{p},\tau,F)$ and $f_{0}$ , i.e., the approximation error. The second term $\zeta_{n,\varepsilon,\delta}$ pertains to the estimation error, and $\Delta_{n}(\widehat{f})$ is owing to the difference between $\widehat{f}$ and the oracle neural network estimator $\widehat{f}_{\textrm{global}}$ . It is worth noting that the upper bound in Theorem 1 does not depend on the network architecture parameter $\bm{p}$ , i.e., the width of the network, in that the network is sparse and its “actual” width is controlled by the sparsity parameter $\tau$ . To see this, after removing all the inactive neurons, it is straightforward to show that $\mathcal{F}(L,\bm{p},\tau,F)=\mathcal{F}(L,(p_{0},p_{1}\wedge\tau,\ldots,p_{L}\wedge\tau,p_{L+1}),\tau,F)$ (Schmidt-Hieber,, 2020).

Next, we turn to the consistency of the DNN estimator $\widehat{f}_{\textrm{local}}$ for $f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ . In nonparametric regression, the estimation convergence rate is heavily affected by the smoothness of the function. Consider the class of composite functions $\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ . Let $\beta_{i}^{*}=\beta_{i}\prod_{s=i+1}^{L_{*}}(\beta_{s}\wedge 1)$ for $i=0,\ldots,L_{*}$ and $i^{*}=\mathop{\mathrm{argmin}}_{0\leq i\leq L_{*}}\beta_{i}^{*}/\tilde{r}_{i}$ , with the convention $\prod_{s=L_{*}+1}^{L_{*}}(\beta_{s}\wedge 1)=1$ . Then $\beta^{*}=\beta_{i^{*}}^{*}$ and $r^{*}=\tilde{r}_{i^{*}}$ are known as the intrinsic smoothness and intrinsic dimension of $f\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ . Similar definitions could be found in literature Kurisu, (2019); Nakada and Imaizumi, (2020); Cloninger and Klock, (2021); Kohler et al., (2022); Wang et al., (2023). These quantities play an important role in controlling the convergence rate of the estimator. To better understand $\beta_{i}^{*}$ and $i^{*}$ , think about the composite function from the $i$ th to the last layer, i.e., $\bm{h}_{i}(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{i+1}\circ\bm{g}_{i}(\bm{z}):[a_{i},b_{i}]^{r_{i}}\to\mathbb{R}$ ; then $\beta_{i}^{*}$ can be viewed as the smoothness of $\bm{h}_{i}$ and $i^{*}$ is the layer of the least smoothness after rescaled by the respective number of “active” variables $\tilde{r}_{i}$ , $i=0,\ldots,L_{*}$ . The following theorem establishes the consistency of $\widehat{f}_{\textrm{local}}$ as an estimator of $f_{0}$ and its convergence rate in the presence of spatial dependence.

Theorem 2.

Suppose Assumption 1 is satisfied, i.e., $f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ . Let $\widehat{f}_{\textrm{local}}\in\mathcal{F}(L,\bm{p},\tau,F)$ be an estimator of $f_{0}$ . Further assume that $F\geq\max_{i=0,\ldots,L^{*}}(C_{i},1)$ , $N\doteq\min_{i=1,\ldots,L}p_{i}\geq 6\eta\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}}$ where $\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil))$ , and $\tau\lesssim LN$ . Then we have

R_{n}(\widehat{f}_{\textrm{local}},f_{0})\lesssim\varsigma_{n},

where

\varsigma_{n}\asymp(N2^{-L})^{2\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{2\beta^{*}}{r^{*}}}+\frac{(\operatorname{tr}(\Gamma_{n}^{2})+n)(LN\log(Ln^{2})+L^{2}N\log(LN))}{n^{2}}+\Delta_{n}(\widehat{f}_{\textrm{local}}),

and $\tilde{C}_{i}$ are constants only depending on $\bm{C},\bm{a},\bm{b}$ , and $\Gamma_{n}=[\gamma(\bm{s}_{i},\bm{s}_{i^{\prime}})]_{1\leq i,i^{\prime}\leq n}$ .

The consistency of $\widehat{f}_{\textrm{local}}$ can be achieved by, for instance, letting $L\asymp\log(n),N\asymp n^{\frac{r^{*}}{2\beta^{*}+r^{*}}},\operatorname{tr}(\Gamma_{n}^{2})=o(n^{\frac{4\beta^{*}+r^{*}}{2\beta^{*}+r^{*}}}(\log n)^{-3})$ , and $\Delta_{n}(\widehat{f}_{\textrm{local}})=o(1)$ , as a result of which $\varsigma_{n}\asymp n^{-\frac{2\beta^{*}}{2\beta^{*}+r^{*}}}(\log n)^{3}+\Delta_{n}(\widehat{f}_{\textrm{local}})=o(1)$ . As expected, the rate of convergence is affected by the intrinsic smoothness and intrinsic dimension of $\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ , the architecture of the neural network $\mathcal{F}(L,\bm{p},\tau,F)$ , and the magnitude of the spatial dependence.

4 Simulation Study

In this section, we evaluate the finite sample performance of the proposed DNN estimator through a set of simulation studies. Two different simulation designs are considered, and for each design, we generate 100 independent data sets. In both settings, we use the same neural network architecture with length $L=3$ and width $N=30$ . Additionally, to prevent overfitting and control the sparsity of the network, we applied dropout with a probability of 0.2.

In the first design, the spatial domain of interest $\mathcal{S}$ is in $\mathbb{R}$ . To be specific, we generate data from the following model

\displaystyle y(s_{i})=f_{0}(\bm{x}(s_{i}))+e_{1}(s_{i})+e_{2}(s_{i}),\,\,\,s_{i}\in[0,D],~{}i=1,2,\ldots,n,

and

\displaystyle\bm{x}(s_{i})=(x_{1}(s_{i}),\ldots,x_{5}(s_{i}))^{\top}=\left(s_{i}/D,\sin(10s_{i}/D),(s_{i}/D)^{2},\exp(3s_{i}/D),(s_{i}/D+1)^{-1}\right)^{\top}\in\mathbb{R}^{5},

with the true mean function $f_{0}(\bm{x}(s_{i}))=\sum_{j=1}^{5}x_{j}(s_{i})$ . The small-scale spatial variation $e_{1}(\cdot)$ is a zero-mean stationary and isotropic Gaussian process with an exponential covariance function $\gamma({s_{i},s_{j}})=\exp(-|s_{i}-s_{j}|/\rho)$ and the range parameter $\rho=0.1,0.5,1$ . The measurement error $e_{2}(\cdot)$ is standard normal distributed and independent of $e_{1}(\cdot)$ . It is worth mentioning that the covariates are location dependent.

We consider two different spatial domains: fixed domain and expanding domain. For the fixed domain, $\mathcal{S}=[0,1]$ is a fixed interval, i.e., $D=1$ , whereas for the expanding domain, the spatial domain $\mathcal{S}=[0,D]$ increases with the sample size $n$ . The $n$ observations are equally spaced over the region. In both scenarios, we let $n=100,200,300$ , and accordingly, $D=10,20,30$ in the expanding domain case.

In the second design, the mean function is defined on $\mathbb{R}^{2}$ , given by

	$\displaystyle f_{0}(\bm{x}(\bm{s}_{i}))=$	$\displaystyle\beta_{1}x_{1}(\bm{s}_{i})x_{2}(\bm{s}_{i})+\beta_{2}x_{2}(\bm{s}_{i})^{2}\sin(x_{3}(\bm{s}_{i}))+\beta_{3}\exp(x_{4}(\bm{s}_{i}))\max(x_{5}(\bm{s}_{i}),0)$
		$\displaystyle+\frac{\beta_{4}}{\mathop{\mathrm{sign}}x_{4}(\bm{s}_{i})(10+x_{5}(\bm{s}_{i}))}+\beta_{5}\tanh(x_{1}(\bm{s}_{i})),\quad\bm{s}_{i}\in[0,D]^{2},~{}i=1,\ldots,n,$		(7)

where the coefficients $\beta_{j}$ , $j=1,\ldots,5$ , are drawn from $U(1,2)$ . The covariates at each location are generated from standard normal distributions with a cross-covariate correlation of 0.5 and the covariates at different locations are assumed to be independent. We further normalize each covariate to have zero mean and unit variance. The mean function $f_{0}$ is nonlinear, featuring interactions among the covariates. We simulate $y(\bm{s}_{i})$ according to (1) with $e_{1}(\bm{s}_{i})$ and $e_{2}(\bm{s}_{i})$ similar to those in Design 1. That is, $e_{1}(\bm{s}_{i})$ is a zero-mean stationary and isotropic Gaussian process on $\mathbb{R}^{2}$ with an exponential covariance function $\gamma({\bm{s}_{i},\bm{s}_{j}})=\exp(-|\bm{s}_{i}-\bm{s}_{j}|/\rho)$ and $\rho=0.1,0.5,1$ , and $e_{2}(\bm{s}_{i})\sim N(0,1)$ .

Similar to the first design, we consider two types of spatial domain: fixed domain, i.e., $D=1$ and expanding domain, i.e., $D=10,20,30$ . In both cases, we have $n=100,400,900$ , and all the locations are equally spaced over $[0,D]^{2}$ .

4.1 Estimating $f_{0}$ via other methods

We also compare the proposed DNN estimator with various estimators in the literature. The first estimator of $f_{0}$ is based on the Gaussian process-based spatially varying coefficient model (GP-SVC) which is given by

y(\bm{s})=\beta_{1}(\bm{s})x_{1}(\bm{s})+\ldots,+\beta_{p}(\bm{s})x_{p}(\bm{s})+\epsilon,\,\,\,\epsilon\sim\mathcal{N}(0,\tau^{2}),\,\,\,\bm{s}\in\mathcal{S},

and the spatially varying coefficient $\beta_{j}(\cdot)$ is the sum of a fixed effect and a random effect. That is, $\beta_{j}(\bm{s})=\mu_{j}+\eta_{j}(\bm{s})$ , where $\mu_{j}$ is a non-random fixed effect and $\eta_{j}(\cdot)$ is a zero-mean Gaussian process with an isotropic covariance function $c(\cdot;\bm{\theta}_{j})$ . In this work, we use the popular Matérn covariance function defined as

c\left(r;\rho,\nu,\sigma^{2}\right)=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2\nu}\frac{r}{\rho}\right)^{\nu}K_{\nu}\left(\sqrt{2\nu}\frac{r}{\rho}\right),

where $\rho>0$ is the range parameter, $\nu>0$ is the smoothness parameter, and $K_{\nu}(\cdot)$ is the modified Bessel function of second kind with order $\nu$ .

The second estimator is the Nadaraya-Watson (N-W) kernel estimator for spatially dependent data discussed in Robinson, (2011), which considers the following spatial regression model

y(\bm{s}_{i})=f_{0}(\bm{x}(\bm{s}_{i}))+\sigma(\bm{x}(\bm{s}_{i}))V_{i},\,\,\,V_{i}=\sum_{j=1}^{\infty}a_{ij}\epsilon_{j},\,\,\,i=1,\ldots,n,

where $f_{0}(\bm{x}):[0,1]^{d}\rightarrow\mathbb{R}$ and $\sigma(\bm{x}):[0,1]^{d}\rightarrow[0,\infty)$ are the mean and variance functions, respectively, $\epsilon_{j}$ are independent random variables with zero mean and unit variance, and $\sum_{j=1}^{\infty}a_{ij}^{2}=1$ . Robinson, (2011) introduces the following Nadaraya-Watson kernel estimator for $f_{0}$ :

\hat{f}(\bm{x})=\frac{\hat{\nu}(\bm{x})}{\widehat{g}(\bm{x})},

where

\widehat{g}(\bm{x})=\frac{1}{nh_{n}^{d}}\sum_{i=1}^{n}K_{i}(\bm{x}),\quad\hat{\nu}(\bm{x})=\frac{1}{nh_{h}^{d}}\sum_{i=1}^{n}y_{i}K_{i}(\bm{x}),

with

K_{i}(\bm{x})=K\left(\frac{\bm{x}-\bm{x}(\bm{s}_{i})}{h_{n}}\right),

and $h_{n}$ is a scalar, positive bandwidth sequence satisfying $h_{n}\rightarrow 0$ as $n\rightarrow\infty$ .

The third estimator of $f_{0}$ is based on the generalized additive model (GAM) mentioned in Example 1. That is, we assume that

f_{0}(\bm{x}(\bm{s}))=\Psi\left(\sum_{j=1}^{d}g_{j}(x_{j}(\bm{s}))\right),

where $g_{j}(\cdot):[0,1]\rightarrow\mathbb{R}$ and $\Psi(\cdot):\mathbb{R}\rightarrow\mathbb{R}$ are some smooth functions. In this model, spatial dependence is not assumed. The fourth estimator we consider is the linear regression (LR) model which assumes a linear relationship between the response variable and the predictor variables, without incorporating spatial dependence explicitly.

4.2 Simulation Results

To evaluate the performance of each method, we generate additional $m=n/10$ observations at new locations, treated as a test set. Similar to Chu et al., (2014), we adopt mean squared estimation error (MSEE) and mean squared prediction error (MSPE) to evaluate the estimation and prediction performance, where MSEE and MSPE are defined as

\textrm{MSEE}=m^{-1}\sum_{i=1}^{m}(\widehat{f}(\bm{x}(s_{i}))-f_{0}(\bm{x}(s_{i})))^{2},\quad\textrm{MSPE}=m^{-1}\sum_{i=1}^{m}(\widehat{f}(\bm{x}(s_{i}))-y(s_{i}))^{2},

and $\widehat{f}(\bm{x}(s_{i}))$ is an estimator of $f_{0}(\bm{x}(s_{i}))$ . The mean and standard deviation of MSEE and MSPE over the 100 independent replicates are summarized in Tables 1 – 4.

Tables 1 and 2 pertain to Simulation Design 1, for fixed and expanding domains, respectively. For each combination of the sample size $n$ and the spatial dependence $\rho$ , we highlight the estimator in boldface that yields the smallest MSEE and MSPE. Overall, GAM, GP-SVC, N-W, and LR methods perform similar to each other. The proposed DNN estimator produces a smaller estimation error and prediction error than the others in all cases except when $n=200$ and $\rho=0.1$ in the fixed-domain case, GAM yields the smallest MSPE of 1.26. But the MSPE produced by DNN is close. Despite that spatial dependence has an adverse impact on the performance, when $n$ increases (and $D$ increases for the expanding-domain case), both estimation error and prediction error decrease as expected.

We depict in Figure 2 the estimated mean functions $\widehat{f}(\bm{x}(s))$ via our method with $n=100$ and $\rho=0.5$ from the 100 replications along with the $95\%$ pointwise confidence intervals for both fixed and expanding domains. Here, the $95\%$ pointwise intervals are defined as

\left(2^{-1}(\widehat{f}_{(2)}(\bm{x}(s_{i}))+\widehat{f}_{(3)}(\bm{x}(s_{i}))),2^{-1}(\widehat{f}_{(97)}(\bm{x}(s_{i}))+\widehat{f}_{(98)}(\bm{x}(s_{i})))\right),~{}i=1,2,\ldots,n,

where $\widehat{f}_{(k)}(\bm{x}(s_{i}))$ is the $k$ th smallest value of $\{\widehat{f}_{[j]}(\bm{x}(s_{i})):j=1,\ldots,100\}$ , and $\widehat{f}_{[j]}(\bm{x}(s_{i}))$ is the estimator of $f_{0}(\bm{x}(s_{i}))$ from the $j$ th replicate.

Tables 3 and 4 report the results for Simulation Design 2. For both fixed and expanding domains, our method performs the best among the five methods and N-W comes next. This is mainly because LR, GAM and GP-SVC treat $f_{0}$ to be linear and cannot handle complex interactions and nonlinear structures in $f_{0}$ .

Table 1: Results of Simulation Design 1 with fixed domain: the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different

n

and

\rho

Fixed domain		$\rho=0.1$		$\rho=0.5$		$\rho=1$
$n$		MSEE	MSPE	MSEE	MSPE	MSEE	MSPE
	GAM	0.92 (0.58)	1.40 (0.69)	0.98 (0.78)	1.19 (0.40)	1.05 (0.87)	1.17 (0.38)
$n=100$	GP-SVC	0.87 (0.49)	1.37 (0.67)	0.92 (0.75)	1.15 (0.35)	1.01 (0.82)	1.13 (0.36)
	N-W	0.89 (0.52)	1.38 (0.67)	0.94 (0.76)	1.16 (0.38)	1.03 (0.85)	1.15 (0.37)
	LR	0.93 (0.59)	1.40 (0.69)	0.99 (0.80)	1.20 (0.39)	1.06 (0.86)	1.18 (0.38)
	DNN	0.78 (0.41)	1.32 (0.64)	0.82 (0.71)	1.13 (0.35)	0.94 (0.79)	1.10 (0.33)
	GAM	0.87 (0.50)	1.26 (0.30)	0.93 (0.72)	1.12 (0.27)	0.99 (0.77)	1.08 (0.24)
$n=200$	GP-SVC	0.81 (0.42)	1.34 (0.37)	0.88 (0.74)	1.09 (0.29)	0.95 (0.77)	1.09 (0.28)
	N-W	0.84 (0.38)	1.33 (0.41)	0.91 (0.73)	1.10 (0.28)	0.93 (0.75)	1.06 (0.25)
	LR	0.86 (0.50)	1.28 (0.32)	0.95 (0.74)	1.14 (0.28)	0.98 (0.76)	1.07 (0.24)
	DNN	0.69 (0.32)	1.27 (0.39)	0.71 (0.66)	1.06 (0.26)	0.78 (0.68)	1.04 (0.29)
	GAM	0.83 (0.47)	1.19 (0.44)	0.88 (0.68)	1.09 (0.20)	0.96 (0.66)	1.05 (0.19)
$n=300$	GP-SVC	0.77 (0.38)	1.15 (0.37)	0.86 (0.71)	1.06 (0.21)	0.91 (0.64)	1.05 (0.18)
	N-W	0.80 (0.36)	1.13 (0.40)	0.88 (0.70)	1.07 (0.24)	0.92 (0.66)	1.06 (0.21)
	LR	0.82 (0.46)	1.17 (0.42)	0.87 (0.67)	1.10 (0.22)	0.98 (0.69)	1.06 (0.20)
	DNN	0.58 (0.27)	1.07 (0.34)	0.63 (0.55)	1.01 (0.22)	0.69 (0.55)	1.01 (0.25)

Table 2: Results of Simulation Design 1 with expanding domain (i.e.,

D=10,20,30

): the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different

n

and

\rho

Expanding domain		$\rho=0.1$		$\rho=0.5$		$\rho=1$
$n$		MSEE	MSPE	MSEE	MSPE	MSEE	MSPE
	GAM	0.35 (0.23)	2.06 (0.67)	0.82 (0.38)	1.60 (0.59)	0.98 (0.59)	1.42 (0.31)
$n=100,D=10$	GP-SVC	0.33 (0.22)	2.01 (0.61)	0.78 (0.36)	1.53 (0.55)	0.94 (0.54)	1.37 (0.29)
	N-W	0.38 (0.26)	2.03 (0.65)	0.81 (0.40)	1.57 (0.58)	0.96 (0.57)	1.39 (0.29)
	LR	0.34(0.22)	2.05 (0.66)	0.81 (0.36)	1.58 (0.58)	0.97 (0.58)	1.41 (0.30)
	DNN	0.26 (0.19)	1.93 (0.55)	0.64 (0.37)	1.44 (0.55)	0.76 (0.44)	1.22 (0.26)
	GAM	0.21 (0.14)	1.91 (0.58)	0.66 (0.34)	1.51 (0.36)	0.85 (0.44)	1.39 (0.39)
$n=200,D=20$	GP-SVC	0.18 (0.14)	1.89 (0.54)	0.61 (0.33)	1.48 (0.39)	0.81 (0.40)	1.36 (0.41)
	N-W	0.20 (0.17)	1.93 (0.61)	0.63 (0.36)	1.47 (0.37)	0.88 (0.48)	1.40 (0.43)
	LR	0.20 (0.13)	1.90 (0.56)	0.67 (0.35)	1.52 (0.35)	0.86 (0.45)	1.39 (0.40)
	DNN	0.14 (0.11)	1.82 (0.55)	0.43 (0.28)	1.33 (0.32)	0.61 (0.37)	1.29 (0.38)
	GAM	0.11 (0.07)	1.72 (0.39)	0.51 (0.29)	1.40 (0.31)	0.70 (0.31)	1.30 (0.26)
$n=300,D=30$	GP-SVC	0.13 (0.10)	1.76 (0.41)	0.56 (0.33)	1.43 (0.36)	0.74 (0.37)	1.34 (0.30)
	N-W	0.13 (0.07)	1.77 (0.44)	0.53 (0.30)	1.44 (0.39)	0.69 (0.33)	1.29 (0.27)
	LR	0.12 (0.07)	1.74 (0.40)	0.52 (0.31)	1.42 (0.33)	0.72 (0.33)	1.32 (0.28)
	DNN	0.07 (0.09)	1.63 (0.38)	0.31 (0.23)	1.22 (0.23)	0.43 (0.31)	1.10 (0.19)

Table 3: Results of Simulation Design 2 with fixed domain: the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different

n

and

\rho

fixed-domain		$\rho=0.1$		$\rho=0.5$		$\rho=1$
$n$		MSEE	MSPE	MSEE	MSPE	MSEE	MSPE
	GAM	0.87 (0.92)	2.81 (1.19)	1.10 (2.40)	2.40 (1.53)	1.23 (1.06)	2.16 (1.10)
$n=100$	GP-SVC	0.93 (0.99)	2.93 (1.23)	1.23 (2.54)	2.59 (1.67)	1.53 (1.33)	2.37 (1.26)
	N-W	0.73 (0.81)	2.70 (1.15)	1.01 (2.16))	2.30 (1.38)	1.09 (1.01)	2.09 (0.98)
	LR	1.16 (0.99)	3.02 (1.25)	1.46 (2.62)	2.82 (1.76)	1.88 (1.44)	2.66 (1.39)
	DNN	0.51 (0.60)	2.27 (1.09)	0.74 (1.11)	1.90 (1.02)	0.83 (1.19)	1.99 (1.17)
	GAM	0.44 (0.52)	2.38 (0.58)	0.75 (0.65)	1.89 (0.42)	0.92 (0.92)	1.69 (0.47)
$n=400$	GP-SVC	0.50 (0.59)	2.43 (0.66)	0.82 (0.77)	1.94 (0.47)	1.00 (0.96)	1.80 (0.53)
	N-W	0.39 (0.44)	1.99 (0.58)	0.68 (0.70)	1.73 (0.41)	0.83 (0.84))	1.56 (0.41)
	LR	0.66 (0.52)	2.67 (0.71)	0.95 (0.79)	2.05 (0.55)	1.11 (0.97)	1.88 (0.57)
	DNN	0.22 (0.39)	1.87 (0.49)	0.54 (0.61)	1.57 (0.37)	0.68 (0.71)	1.41 (0.37)
	GAM	0.31 (0.40)	2.25 (0.53)	0.59 (0.53)	1.81 (0.36)	0.80 (0.79)	1.65 (0.36)
$n=900$	GP-SVC	0.38 (0.44)	2.29 (0.58)	0.66 (0.60)	1.88 (0.39)	0.88 (0.83)	1.73 (0.40)
	N-W	0.25 (0.34)	1.86 (0.49)	0.51 (0.46)	1.70 (0.31)	0.71 (0.72))	1.52 (0.34)
	LR	0.49 (0.48)	2.33 (0.66)	0.88 (0.73)	1.95 (0.44)	0.98 (0.86)	1.85 (0.42)
	DNN	0.19 (0.27)	1.70 ( 0.42)	0.28 (0.33)	1.49 (0.29)	0.57 (0.59)	1.33 (0.34)

Table 4: Results of Simulation Design 2 with expanding domain (

D=10,20,30

): the averaged MSEE and MSPE over 100 replicates (with its standard deviation in parentheses) of various methods with different

n

and

\rho

fixed-domain		$\rho=0.1$		$\rho=0.5$		$\rho=1$
$n$		MSEE	MSPE	MSEE	MSPE	MSEE	MSPE
	GAM	0.75 (0.81)	2.88 (1.04)	0.81 (0.65)	2.72 (0.94)	0.90 (0.76)	2.65 (0.88)
$n=100,D=10$	GP-SVC	0.84 (0.88)	2.93 (1.11)	0.89 (0.90)	2.80 (0.97)	0.96 (0.83))	2.71 (0.91)
	N-W	0.66 (0.70)	2.71 (1.00)	0.71 (0.62)	2.66 (0.90)	0.82 (0.71))	2.59 (0.81)
	LR	0.97 (0.91)	2.95 (1.14)	1.06 (0.92)	2.92 (0.98)	1.15 (0.86)	2.75 (0.95)
	DNN	0.44 (0.49)	2.15 (1.02)	0.60 (1.03)	1.81 (0.99)	0.69 (1.07)	1.76 (0.94)
	GAM	0.32 (0.25)	2.49 (0.44)	0.35 (0.24)	2.39 (0.46)	0.40 (0.33)	2.32 (0.51)
$n=400,D=20$	GP-SVC	0.40 (0.31)	2.55 (0.48)	0.49 (0.29)	2.44 (0.50)	0.54 (0.37))	2.40 (0.55)
	N-W	0.28 (0.23)	2.35 (0.41)	0.31 (0.20)	2.33 (0.41)	0.35 (0.30)	2.27 (0.47)
	LR	0.56 (0.35)	2.60 (0.54)	0.63 (0.33)	2.49 (0.49)	0.71 (0.41)	2.46 (0.53)
	DNN	0.18 (0.20)	2.20 (0.38)	0.24 (0.21)	2.29 (0.38)	0.29 (0.24))	2.20 (0.41)
	GAM	0.24 (0.29)	2.28 (0.33)	0.27 (0.16)	2.25 (0.27)	0.31 (0.17)	2.26 (0.25)
$n=900,D=30$	GP-SVC	0.31 (0.32))	2.31 (0.35)	0.37 (0.21)	2.17 (0.25)	0.41 (0.20)	2.29 (0.28)
	N-W	0.21 (0.30)	1.82 (0.46)	0.26 (0.21)	1.61 (0.28)	0.29 (0.16)	1.50 (0.30)
	LR	0.42 (0.35)	2.37 (0.38)	0.53 (0.24)	2.33 (0.29)	0.62 (0.23)	2.30 (0.31)
	DNN	0.16 (0.22))	1.71 (0.30))	0.19 (0.20)	1.58 (0.25)	0.23 (0.15))	1.45 (0.28)

5 Data Example

In this section, we use the proposed DNN method to analyze California Housing data that are publicly available from the website https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. After removing missing values, the dataset contains housing price information from $n=20433$ block groups in California from the 1990 census, where a block group on average includes 1425.5 individuals living in a geographically compact area. To be specific, the dataset comprises median house values and six covariates of interest: the median age of a house, the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households. Figure 3 displays the histograms of the six covariates, from which one can observe that the covariates are all right skewed except for the median age of a house. Thus, we first apply the logarithm transform to the five covariates and then use min-max normalization to rescale all the six covariates so that the data are within the range $[0,1]$ .

Figure 4 shows the spatial distribution of the five log transformed covariates (i.e., the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households) and the median age of a house. We also depict in Figure 5 (the top panel) the map of the median house values in California. The data exhibit a clear geographical pattern. Home values in the coastal region, especially the San Francisco Bay Area and South Coast, are higher than the other regions. Areas of high home values are always associated with high household income, dense population, large home size, and large household, which are clustered in the coastal region and Central Valley. Our objective is to explore the intricate relationship between the median house value and the six covariates by taking into account their spatial autocorrelation.

Same as the simulation study, we estimate the mean function $f_{0}(\cdot)$ via four methods: DNN, GAM, GP-SVC, and N-W. We use the same neural network architecture as the simulation study, i.e., the length and width equal $L=3$ , $N=30$ , respectively, and dropout rate is set as 0.2 to avoid overfitting. To assess their performance, we compute the out-of-sample prediction error measured by MSPE based on 10-fold cross-validation, and the results are summarized in Table 5. Consistent with the observations in the simulation study, the proposed DNN method yields a much more accurate prediction than the others. The bottom panel of Figure 5 shows the estimated median house value using the DNN estimator, which exhibits a similar geographical pattern to the observations.

Table 5: Summary of the mean squared prediction error in California housing data example.

Methods	GAM	GP-SVC	N-W	LR	DNN
MSPE ( $\times 10^{4}$ )	4.74	4.23	4.05	5.32	3.41

6 Conclusion

In this study, we have ventured into the realm of regression analysis for spatially dependent data utilizing deep learning. Through meticulous consideration of the intricate interplay of spatial autocorrelation, our estimator has demonstrated its consistency. An intriguing path for future exploration unfolds in the domain of large-scale datasets. While the present study adeptly showcases the effectiveness of our approach via rigorous simulations, its true potential unfurls when confronted with the intricacies of real-world data. An illustrative instance is illuminated by the remote sensing data expounded in Ma et al.’s work Ma et al., (2022). This dataset weaves together a tapestry of spatially referenced variables, collectively contributing to the intricate fabric of underlying relationships. The deployment of our model upon such expansive and diverse datasets holds the pledge of unveiling concealed patterns and amplifying comprehension of underlying processes.

Furthermore, our methodology’s scope extends beyond predictive modeling, revealing a multitude of versatile applications. As expounded earlier, scenarios entailing interpolation challenges and gap-filling predicaments in remote sensing, alongside endeavors aimed at assessing the relative significance of covariates in ecological investigations, stand to derive substantial benefits from our model’s nuanced capabilities. The vista of extending our framework to encompass spatiotemporal prediction broadens even further, beckoning exploration into more intricate, dynamic, and temporally evolving phenomena. For a more comprehensive review, refer to Wikle et al.’s comprehensive work Wikle et al., (2019).

In conclusion, the journey into the domain of regression analysis for spatially dependent data endures as a continuous and captivating odyssey. As we confront the unique intricacies presented by expansive datasets and diverse problem domains, our proposed methodology emerges as a steadfast companion. Its role extends beyond mere contribution, shaping the advancement of statistical methodologies for spatial analysis.

References

Bahdanau et al., (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bauer and Kohler, (2019) Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285.
Chu et al., (2014) Chu, T., Wang, H., and Zhu, J. (2014). On semiparametric inference of geostatistical models via local karhunen–loève expansion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):817–832.
Cliff and Ord, (1981) Cliff, A. D. and Ord, J. K. (1981). Spatial processes: models & applications. Taylor & Francis.
Cloninger and Klock, (2021) Cloninger, A. and Klock, T. (2021). A deep network construction that adapts to intrinsic dimensionality beyond the domain. Neural Networks, 141:404–419.
Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226.
Fan and Zhang, (1999) Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. The annals of Statistics, 27(5):1491–1518.
Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
Hastie and Tibshirani, (1986) Hastie, T. and Tibshirani, R. (1986). Generalized Additive Models. Statistical Science, 1(3):297 – 310.
Hastie and Tibshirani, (1993) Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):757–779.
He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Jenish, (2012) Jenish, N. (2012). Nonparametric spatial regression under near-epoch dependence. Journal of Econometrics, 167(1):224–239.
Jiménez-Luna et al., (2020) Jiménez-Luna, J., Grisoni, F., and Schneider, G. (2020). Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584.
Kim and Wang, (2021) Kim, M. and Wang, L. (2021). Generalized spatially varying coefficient models. Journal of Computational and Graphical Statistics, 30(1):1–10.
Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kirkwood et al., (2022) Kirkwood, C., Economou, T., Pugeault, N., and Odbert, H. (2022). Bayesian deep learning for spatial interpolation in the presence of auxiliary information. Mathematical Geosciences, pages 1–25.
Kohler et al., (2022) Kohler, M., Krzyżak, A., and Langer, S. (2022). Estimation of a function of low local dimensionality by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042.
Kohler and Langer, (2021) Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249.
Kurisu, (2019) Kurisu, D. (2019). On nonparametric inference for spatial regression models under domain expanding and infill asymptotics. Statistics & Probability Letters, 154:108543.
Kurisu, (2022) Kurisu, D. (2022). Nonparametric regression for locally stationary random fields under stochastic sampling design. Bernoulli, 28(2):1250–1275.
Lee, (2002) Lee, L.-F. (2002). Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models. Econometric theory, 18(2):252–277.
Lee, (2004) Lee, L.-F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica, 72(6):1899–1925.
Lemhadri et al., (2021) Lemhadri, I., Ruan, F., and Tibshirani, R. (2021). Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, pages 10–18. PMLR.
Li, (2022) Li, K. (2022). Variable selection for nonlinear cox regression model via deep learning. arXiv preprint arXiv:2211.09287.
Li et al., (2021) Li, K., Wang, F., Liu, R., Yang, F., and Shang, Z. (2021). Calibrating multi-dimensional complex ode from noisy data via deep neural networks. arXiv preprint arXiv:2106.03591.
Li et al., (2022) Li, K., Wang, F., and Yang, L. (2022). Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. arXiv preprint arXiv:2204.01682.
Li, (2016) Li, L. (2016). Nonparametric regression on random fields with random design using wavelet method. Statistical Inference for Stochastic Processes, 19(1):51–69.
Liu et al., (2022) Liu, R., Boukai, B., and Shang, Z. (2022). Optimal nonparametric inference via deep neural network. Journal of Mathematical Analysis and Applications, 505(2):125561.
Liu et al., (2020) Liu, R., Shang, Z., and Cheng, G. (2020). On deep instrumental variables estimate.
Liu et al., (2021) Liu, R., Yuan, M., and Shang, Z. (2021). Online statistical inference for parameters estimation with linear-equality constraints. arXiv preprint arXiv:2105.10315.
Lu et al., (2020) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2020). Deep network approximation for smooth functions.
Lu and Tjøstheim, (2014) Lu, Z. and Tjøstheim, D. (2014). Nonparametric estimation of probability density functions for irregularly observed spatial data. Journal of the American Statistical Association, 109(508):1546–1564.
Ma et al., (2022) Ma, T. F., Wang, F., Zhu, J., Ives, A. R., and Lewińska, K. E. (2022). Scalable semiparametric spatio-temporal regression for large data analysis. Journal of Agricultural, Biological and Environmental Statistics, pages 1–20.
Mateu and Jalilian, (2022) Mateu, J. and Jalilian, A. (2022). Spatial point processes and neural networks: A convenient couple. Spatial Statistics, page 100644.
Mu et al., (2018) Mu, J., Wang, G., and Wang, L. (2018). Estimation and inference in spatially varying coefficient models. Environmetrics, 29(1):e2485.
Nakada and Imaizumi, (2020) Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. The Journal of Machine Learning Research, 21(1):7018–7055.
Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407.
Robinson, (2011) Robinson, P. M. (2011). Asymptotic theory for nonparametric regression with spatial data. Journal of Econometrics, 165(1):5–19.
Scardapane et al., (2017) Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. Ann. Statist., 48(4):1875–1897.
Su and Zhu, (2018) Su, W. J. and Zhu, Y. (2018). Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876.
Sun et al., (2012) Sun, Y., Li, B., and Genton, M. G. (2012). Geostatistics for large datasets. In Advances and challenges in space-time modelling of natural events, pages 55–77. Springer.
Wang et al., (2023) Wang, S., Cao, G., Shang, Z., Initiative, A. D. N., Weiner, M. W., Aisen, P., Petersen, R., Weiner, M. W., Aisen, P., Petersen, R., et al. (2023). Deep neural network classifier for multi-dimensional functional data. Scandinavian Journal of Statistics.
Wikle et al., (2019) Wikle, C. K., Zammit-Mangion, A., and Cressie, N. (2019). Spatio-temporal statistics with R. CRC Press.
Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.
Zammit-Mangion et al., (2021) Zammit-Mangion, A., Ng, T. L. J., Vu, Q., and Filippone, M. (2021). Deep compositional spatial models. Journal of the American Statistical Association, pages 1–22.
Zammit-Mangion and Wikle, (2020) Zammit-Mangion, A. and Wikle, C. K. (2020). Deep integro-difference equation models for spatio-temporal forecasting. Spatial Statistics, 37:100408.

7 Appendix

7.1 Notation and Definition

In this paper all vectors are column vectors, unless otherwise stated. Let $\|\bm{v}\|_{2}^{2}=\bm{v}^{\top}\bm{v}$ for any vector $\bm{v}$ , and $\|f\|_{2}=\sqrt{\int f(x)^{2}dx}$ be the $L_{2}$ norm of a real-valued function $f(x)$ . For two positive sequences ${a_{n}}$ and ${b_{n}}$ , we write $a_{n}\lesssim b_{n}$ if there exists a positive constant $c$ such that $a_{n}\leq cb_{n}$ for all $n$ , and $a_{n}\asymp b_{n}$ if $c^{-1}a_{n}\leq b_{n}\leq ca_{n}$ for some constant $c>1$ and a sufficiently large $n$ . Suppose that $\bm{x}=(x_{1},\ldots,x_{d})^{\top}$ is a $d$ -dimensional vector. Let $|\bm{x}|=(|x_{1}|,\ldots,|x_{d}|)^{\top}$ , $|\bm{x}|_{\infty}=\max_{i=1,\ldots,d}|x_{i}|$ , and $|\bm{x}|_{0}=\sum_{i=1}^{d}\mathbbm{1}(x_{i}\neq 0)$ . For two $d-$ dimensional vectors $\bm{x}$ and $\bm{y}$ , we write $\bm{x}\lesssim\bm{y}$ if $x_{i}\lesssim y_{i}$ for $i=1,\ldots,d$ . Let $\lfloor x\rfloor$ be the largest number less than $x$ and $\lceil x\rceil$ be the smallest number greater than $x$ . For a matrix $A=(a_{ij})$ , let $\|A\|_{\infty}=\max_{ij}|a_{ij}|$ the max norm of $A$ , $\|A\|_{0}$ be the number of non-zero entries of $A$ . Define $\|f\|_{\infty}$ as the sup-norm of a real-valued function $f$ . We use $a\wedge b$ to represent the minimum of two numbers $a$ and $b$ , while $a\vee b$ is the maximum of $a$ and $b$ .

Definition 1 (Hölder smoothness).

A function $g:\mathbb{R}^{r_{0}}\to\mathbb{R}$ is said to be $(\beta,C)$ -Hölder smooth for some positive constants $\beta$ and $C$ , if for every $\bm{\gamma}=(\gamma_{1},\ldots,\gamma_{r_{0}})\in\mathbb{N}^{r_{0}}$ , the following two conditions hold:

\displaystyle\sup_{\bm{x}\in\mathbb{R}^{r_{0}}}\bigg{|}\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\bm{x})\bigg{|}\leq C,\quad\textrm{ for }\kappa\leq\lfloor\beta\rfloor,

and

\displaystyle\bigg{|}\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\bm{x})-\frac{\partial^{\kappa}g}{\partial x_{1}^{\gamma_{1}}\ldots\partial x_{r_{0}}^{\gamma_{r_{0}}}}(\widetilde{\bm{x}})\bigg{|}\leq C\|\bm{x}-\widetilde{\bm{x}}\|_{2}^{\beta-\lfloor\beta\rfloor},\quad\textrm{ for }\kappa=\lfloor\beta\rfloor\textrm{, }\bm{x},\widetilde{\bm{x}}\in\mathbb{R}^{r_{0}},

where $\kappa=\sum_{i=1}^{r_{0}}\gamma_{i}$ . Moreover, we say $g$ is $(\infty,C)$ -Hölder smooth if $g$ is $(\beta,C)$ -Hölder smooth for all $\beta>0$ .

7.2 Proof of Theorem 1

The proof of Theorem 1 requires a preliminary lemma. First, we define the $\delta$ -cover of a function space $\mathcal{F}$ as a set $\widetilde{\mathcal{F}}\subset\mathcal{F}$ satisfying that following property: for any $f\in\mathcal{F}$ , there exists a $\widetilde{f}\in\widetilde{\mathcal{F}}$ such that $\|\widetilde{f}-f\|_{\infty}\leq\delta$ . Next, we define the $\delta$ -covering number of $\mathcal{F}$ as

\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty})\doteq\min\{|\widetilde{\mathcal{F}}|:\widetilde{\mathcal{F}}\textrm{ is a $\delta$-cover of $\mathcal{F}$}\},

where $|\mathcal{A}|$ means the number of distinct elements in set $\mathcal{A}$ . In other words, $\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty})$ is the minimal number of $\|\cdot\|_{\infty}$ -balls with radius $\delta$ that covers $\mathcal{F}$ .

Lemma 1.

Suppose that $f_{0}$ is the unknown true mean function in (1). Let $\mathcal{F}$ be a function class such that $\{f_{0}\}\cup\mathcal{F}\subset\{f:[0,1]^{d}\rightarrow[-F,F]\}$ for some $F\geq 1$ . Then for all $\delta,\epsilon\in(0,1]$ and $\widehat{f}\in\mathcal{F}$ , the following inequality holds:

	$\displaystyle R_{n}(\widehat{f},f_{0})\leq$	$\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}}R_{n}(\tilde{f},f_{0})+\Delta_{n}(\widehat{f})+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)$
		$\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1),$

where $\mathcal{N}=\mathcal{N}(\delta,\mathcal{F},\|\cdot\|_{\infty})$ .

Proof.

Let $\Delta_{n}=\Delta_{n}(\widehat{f})$ . For any fixed $f\in\mathcal{F}$ , we have $\bm{E}_{f_{0}}\Big{[}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}}n^{-1}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}\geq 0$ . Therefore,

		$\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}$
	$\displaystyle\leq$	$\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\widehat{f}(\bm{x}(\bm{s}_{i}))^{2}+\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-f(\bm{x}(\bm{s}_{i}))^{2}-\inf_{\tilde{f}\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-\tilde{f}(\bm{x}(\bm{s}_{i}))^{2}\Big{]}$
	$\displaystyle=$	$\displaystyle\bm{E}_{f_{0}}\Big{[}\frac{1}{n}\sum_{i=1}^{n}(y(\bm{s}_{i})-{f}(\bm{x}(\bm{s}_{i})))^{2}\Big{]}+\Delta_{n}.$

Let $\epsilon_{i}=e_{1}(\bm{s}_{i})+e_{2}(\bm{s}_{i})$ . Furthermore, we have

$\displaystyle R_{n}(\widehat{f},f_{0})$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-y(\bm{s}_{i})+y(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-y(\bm{s}_{i})\big{)}^{2}+\epsilon_{i}^{2}+2\big{(}\widehat{f}(x(\bm{s}_{i}))-y(\bm{s}_{i})\big{)}\epsilon_{i}\big{]}$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\Big{[}(y(\bm{s}_{i})-{f}(\bm{x}(\bm{s}_{i})))^{2}-\epsilon_{i}^{2}+2\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\Big{]}+\Delta_{n}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\bm{E}_{f_{0}}\big{[}(y(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))+f_{0}(\bm{x}(\bm{s}_{i}))-{f}(\bm{x}(\bm{s}_{i})))^{2}-\epsilon_{i}^{2}+2\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\big{]}+\Delta_{n}}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}(f_{0}(\bm{x}(\bm{s}_{i}))-{f}(\bm{x}(\bm{s}_{i})))^{2}+2\widehat{f}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}+\Delta_{n}.$	(8)

Next, we will find an upper bound for $\bm{E}_{f_{0}}\big{[}\frac{2}{n}\sum_{i=1}^{n}\widehat{f}(x(\bm{s}_{i}))\epsilon_{i}\big{]}$ . By the definition of the $\delta$ -cover of a function space $\mathcal{F}$ and the $\delta$ -covering number, we denote the centers of the balls by $f_{j}$ , $j=1,2,\ldots,\mathcal{N}$ ; and there exists $f_{j*}$ such that $\|\widehat{f}-f_{j*}\|_{\infty}\leq\delta$ . Together with the fact that $\bm{E}\big{[}f_{0}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}=0$ , we have

	$\displaystyle\bm{E}\big{[}\frac{2}{n}\sum_{i=1}^{n}\widehat{f}(\bm{x}(\bm{s}_{i}))\epsilon_{i}\big{]}$
$\displaystyle=$	$\displaystyle\bm{E}\big{[}\frac{2}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{j}(\bm{x}(\bm{s}_{i}))+f_{j}(\bm{x}(\bm{s}_{i})-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\big{]}$
$\displaystyle\leq$	$\displaystyle\bm{E}\Big{\|}\frac{2}{n}\sum_{i=1}^{n}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{j}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{\|}+\bm{E}\Big{\|}\frac{2}{n}\sum_{i=1}^{n}\big{(}f_{j}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{\|}$
$\displaystyle\leq$	$\displaystyle\frac{2\delta}{n}\bm{E}\Big{[}\sum_{i=1}^{n}\big{\|}\epsilon_{i}\big{\|}\Big{]}+\frac{2}{n}\bm{E}\Big{\|}\sum_{i=1}^{n}\big{(}f_{j*}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}\Big{\|}$
$\displaystyle\doteq$	$\displaystyle T_{1}+T_{2}.$	(9)

It is easy to see that $T_{1}\leq 2\delta(n^{-1}\operatorname{tr}\Gamma_{n}+\sigma)$ . For the second term $T_{2}$ , notice that, conditionally on $\{\bm{x}(\bm{s}_{1}),\ldots,\bm{x}(\bm{s}_{n})\}$ ,

\eta_{j}\doteq\frac{\sum_{i=1}^{n}\big{(}f_{j}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}\epsilon_{i}}{\sqrt{\bm{a}_{j}^{\top}\Gamma_{n}\bm{a}_{j}+n\sigma^{2}\|f_{j}-f_{0}\|_{n}^{2}}}

follows $\mathcal{N}(0,1)$ where $\bm{a}_{j}=(f_{j}(\bm{x}(\bm{s}_{1}))-f_{0}(\bm{x}(\bm{s}_{1})),\ldots,f_{j}(\bm{x}(\bm{s}_{n}))-f_{0}(\bm{x}(\bm{s}_{n})))^{\top}$ , $\|f\|_{n}^{2}=n^{-1}\sum_{i=1}^{n}f(\bm{x}(\bm{s}_{i}))^{2}$ . From Lemma C.1 of Schmidt-Hieber, (2020), $\bm{E}_{f_{0}}\big{[}\max_{j=1,\ldots,\mathcal{N}}\eta_{j}^{2}|\bm{x}(\bm{s}_{1}),\ldots,\bm{x}(\bm{s}_{n})\big{]}\leq 3\log\mathcal{N}+1$ . Consequently,

	$\displaystyle T_{2}=$	$\displaystyle\frac{2}{n}\bm{E}\Big{\|}\eta_{j}\sqrt{\bm{a}_{j}^{\top}\Gamma_{n}\bm{a}_{j}+n\sigma^{2}\\|f_{j}-f_{0}\\|_{n}^{2}}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{n}\bm{E}\Big{(}\|\eta_{j}\|\sqrt{(\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2})\\|f_{j}-f_{0}\\|_{n}^{2}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\bm{E}\Big{(}\|\eta_{j*}\|(\\|\hat{f}-f_{0}\\|_{n}+\delta)\Big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\Big{(}\sqrt{R_{n}(\widehat{f},f_{0})}+\delta\Big{)}$

Together with (8) and (9), we have

\displaystyle R_{n}(\widehat{f},f_{0})\leq

\displaystyle R_{n}(f,f_{0})+\Delta_{n}+2\delta(n^{-1}\operatorname{tr}\Gamma_{n}+\sigma)+\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\Big{(}\sqrt{R_{n}(\widehat{f},f_{0})}+\delta\Big{)}.

If $\log\mathcal{N}\leq n$ , then

	$\displaystyle R_{n}(\widehat{f},f_{0})\leq$	$\displaystyle R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+\sigma+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}}\Big{)}$
		$\displaystyle+\frac{2\sqrt{\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2}}}{n}\sqrt{3\log\mathcal{N}+1}\sqrt{R_{n}(\widehat{f},f_{0})}.$

Applying the inequality (43) in Schmidt-Hieber, (2020), we have, for any $0<\varepsilon\leq 1$ ,

	$\displaystyle R_{n}(\widehat{f},f_{0})\leq$	$\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+\sigma+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}}\Big{)}\right)$
		$\displaystyle+\frac{(1+\varepsilon)^{2}}{\varepsilon}\frac{1}{n^{2}}(3\log\mathcal{N}+1)(\operatorname{tr}(\Gamma_{n}^{2})+n\sigma^{2})$
	$\displaystyle\leq$	$\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)$
		$\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1).$

For $\log\mathcal{N}>n$ , $R_{n}(\widehat{f},f_{0})=\frac{1}{n}\sum_{i=1}^{n}\bm{E}_{f_{0}}\big{[}\big{(}\widehat{f}(\bm{x}(\bm{s}_{i}))-f_{0}(\bm{x}(\bm{s}_{i}))\big{)}^{2}\big{]}\leq 4F^{2}$ and

		$\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)$
		$\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)$
	$\displaystyle>$	$\displaystyle\frac{2F^{2}}{n}(3n+1)>6F^{2}.$

Thus,

	$\displaystyle R_{n}(\widehat{f},f_{0})\leq$	$\displaystyle(1+\varepsilon)\left(R_{n}(f,f_{0})+\Delta_{n}+2\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}\right)$
		$\displaystyle+(1+\varepsilon)\frac{2F^{2}}{n\varepsilon}(3\log\mathcal{N}+1)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1).$

Since the above inequality holds true for any $f\in\mathcal{F}$ , we can prove the result by letting $f=\mathop{\mathrm{arginf}}_{\tilde{f}\in\mathcal{F}}R_{n}(\tilde{f},f_{0})$ . ∎

Proof of Theorem 1: It follows from Lemma 5 and Remark 1 of Schmidt-Hieber, (2020) that

\displaystyle\log\mathcal{N}=\log\mathcal{N}(\delta,\mathcal{F}(L,\bm{p},\tau,F),\|\cdot\|_{\infty})\leq

\displaystyle(1+\tau)\log(2^{5+2L}\delta^{-1}(L+1)\tau^{2L}d^{2}).

Because $F\geq 1$ and $0<\varepsilon\leq 1$ , we have

\displaystyle R_{n}(\widehat{f},f_{0})\lesssim

\displaystyle(1+\varepsilon)\left(\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\Delta_{n}(\widehat{f})+\varsigma_{n,\varepsilon,\delta}\right),

where

\displaystyle\varsigma_{n,\varepsilon,\delta}\asymp

\displaystyle\frac{1}{\varepsilon}\left[\delta\Big{(}n^{-1}\operatorname{tr}(\Gamma_{n})+2\sqrt{n^{-1}\operatorname{tr}(\Gamma_{n}^{2})}+3\sigma\Big{)}+\frac{\tau}{n}\left(\log(L/\delta)+L\log\tau\right)(n^{-1}\operatorname{tr}(\Gamma_{n}^{2})+\sigma^{2}+1)\right].

$\square$

7.3 Proof of Theorem 2

Lemma 2.

For any $f:\mathbb{R}^{d}\to\mathbb{R}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ , $m\in\mathbb{Z}^{+}$ , and $N\geq\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}}$ , there exists a neural network

f_{*}\in\mathcal{F}(L,(d,6\eta N,\ldots,6\eta N,1),\sum_{i=0}^{L_{*}}r_{i+1}(\tau_{i}+4),\infty),

such that

\|f_{*}-f\|_{\infty}\leq C_{L_{*}}\prod_{l=0}^{L_{*}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{*}}\left((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}\right)^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1},

where

	$\displaystyle\tilde{C}_{i}$	$\displaystyle=\sum_{k=0}^{i}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}},~{}i=0,\ldots,L_{}-1,\quad\tilde{C}_{L_{}}=\sum_{k=0}^{L_{}}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}}+b_{L_{}}-a_{L_{*}}$
	$\displaystyle L$	$\displaystyle=3L_{}+\sum_{i=0}^{L_{}}L_{i},\quad\text{with }L_{i}=8+(m+5)(1+\lceil\log_{2}(\tilde{r}_{i}\vee\beta_{i})\rceil),$
	$\displaystyle\tau_{i}$	$\displaystyle\leq 141(\tilde{r}_{i}+\beta_{i}+1)^{3+\tilde{r}_{i}}N(m+6),~{}i=0,\ldots,L_{*},$
	$\displaystyle\eta$	$\displaystyle=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)).$

Proof.

By definition, we write $f(\bm{z})$ as

f(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}},

where $\bm{g}_{i}=(g_{i,1},\ldots,g_{i,r_{i+1}})^{\top}:[a_{i},b_{i}]^{r_{i}}\to[a_{i+1},b_{i+1}]^{r_{i+1}}$ for some $|a_{i}|,|b_{i}|\leq C_{i}$ and the functions $g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}]$ are $(\beta_{i},C_{i})$ -Hölder smooth and $r_{L_{*}+1}=1$ . For $i=0,\ldots,L_{*}-1$ , the domain and range of $\bm{g}_{i}$ are $[a_{i},b_{i}]^{r_{i}}$ and $[a_{i+1},b_{i+1}]^{r_{i+1}}$ , respectively. First of all, we will rewrite $f$ as the composition of the functions $\bm{h}_{i}:=(h_{i,1},\ldots,h_{i,r_{i+1}})^{\top}$ whose domain and range are $[0,1]^{r_{i}}$ and $[0,1]^{r_{i+1}}$ which are constructed via linear transformation. That is, we define

	$\displaystyle\bm{h}_{i}(\bm{z})$	$\displaystyle:=\frac{\bm{g}_{i}((b_{i}-a_{i})\bm{z}-a_{i+1})}{b_{i+1}-a_{i+1}},\quad\quad\textrm{ for }\bm{z}\in[0,1]^{r_{i}},~{}i=0,\ldots,L_{*}-1,$
	$\displaystyle\bm{h}_{L_{*}}(\bm{z})$	$\displaystyle:=\bm{g}_{L_{}}((b_{L_{}}-a_{L_{}})\bm{z}+a_{L_{}}),\quad\quad\textrm{ for }\bm{z}\in[0,1]^{r_{L_{*}}}.$

Therefore the following equality holds

f(\bm{z})=\bm{g}_{L_{*}}\circ\ldots\circ\bm{g}_{1}\circ\bm{g}_{0}(\bm{z})=\bm{h}_{L_{*}}\circ\ldots\circ\bm{h}_{1}\circ\bm{h}_{0}(\frac{\bm{z}-a_{0}}{b_{0}-a_{0}}),\quad\quad\textrm{ for }\bm{z}\in[a_{0},b_{0}]^{r_{0}}.

Since $g_{i,j}:[a_{i},b_{i}]^{\tilde{r}_{i}}\to[a_{i+1},b_{i+1}]$ are all $(\beta_{i},C_{i})$ -Hölder smooth, it follows that $h_{i,j}:[0,1]^{\tilde{r}_{i}}\to[0,1]$ are all $(\beta_{i},\tilde{C}_{i})$ -Hölder smooth as well, where $\tilde{C}_{i}$ is a constant only depending on $\bm{a},\bm{b}$ , and $\bm{C}$ , i.e., $\tilde{C}_{i}=\sum_{k=0}^{i}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}}$ for $i=0,\ldots,L_{*}-1$ , and $\tilde{C}_{L_{*}}=\sum_{k=0}^{L_{*}}C_{k}\frac{b_{k}-a_{k}}{b_{k+1}-a_{k+1}}+b_{L_{*}}-a_{L_{*}}$ .

By Theorem 5 of Schmidt-Hieber, (2020), for any integer $m\geq 1$ and $N\geq\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}}$ , there exists a network

\tilde{h}_{i,j}\in\mathcal{F}(L_{i},(\tilde{r}_{i},6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,1),\tau_{i},\infty),

with $L_{i}=8+(m+5)(1+\lceil\log_{2}(\tilde{r}_{i}\vee\beta_{i})\rceil)$ and $\tau_{i}\leq 141(\tilde{r}_{i}+\beta_{i}+1)^{3+\tilde{r}_{i}}N(m+6)$ , such that

\|\tilde{h}_{i,j}-h_{i,j}\|_{\infty}\leq(2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}_{i}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}.

Note that the value of $\tilde{h}_{i,j}$ is $(-\infty,\infty)$ . So we define $h^{*}_{i,j}:=\sigma(-\sigma(-\tilde{h}_{i,j}+1)+1)$ by adding two more layers $\sigma(1-x)$ to restrict $h^{*}_{i,j}$ into the interval $[0,1]$ , where $\sigma(x)=\max(0,x)$ . This introduces two more layers and four more parameters. By the fact that $h_{i,j}\in[0,1]$ , we have $h^{*}_{i,j}\in\mathcal{F}(L_{i}+2,(\tilde{r}_{i},6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,1),\tau_{i}+4,\infty)$ and

\|h^{*}_{i,j}-h_{i,j}\|_{\infty}\leq\|\tilde{h}_{i,j}-h_{i,j}\|_{\infty}\leq(2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}_{i}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}.

We further parallelize all $(h^{*}_{i,j})_{j=1,\ldots,r_{i+1}}$ together, obtaining $\bm{h}^{*}_{i}:=(h^{*}_{i,1},\ldots,h^{*}_{i,r_{i+1}})^{\top}\in\mathcal{F}(L_{i}+2,(r_{i},6r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,\ldots,6r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)N,r_{i+1}),r_{i+1}(\tau_{i}+4),\infty)$ . Moreover, we construct the composite network $f_{*}:=\bm{h}^{*}_{L_{*}}\circ\ldots\circ\bm{h}^{*}_{1}\circ\bm{h}^{*}_{0}\in\mathcal{F}(3L_{*}+\sum_{i=0}^{L_{*}}L_{i},(r_{0},6\eta N,\ldots,6\eta N,1),\sum_{i=0}^{L_{*}}r_{i+1}(\tau_{i}+4),\infty)$ , where $\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil))$ .

By Lemma 3 in Schmidt-Hieber, (2020),

	$\displaystyle\\|f-f_{*}\\|_{\infty}=$	$\displaystyle\\|\bm{h}_{L_{}}\circ\ldots\circ\bm{h}_{1}\circ\bm{h}_{0}-\bm{h}^{}_{L_{}}\circ\ldots\circ\bm{h}^{}_{1}\circ\bm{h}^{*}_{0}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}\\|\|\bm{h}_{i}-\bm{h}^{}_{i}\|_{\infty}\\|_{\infty}^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}\left((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}\right)^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}$
		$\displaystyle+C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}(\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}.$

∎

Proof of Theorem 2: By Theorem 1 with $\delta=n^{-2}$ and $\varepsilon=1$ , it follows that

R_{n}(\widehat{f}_{\textrm{local}},f_{0})\lesssim\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|^{2}_{\infty}+\frac{(\operatorname{tr}(\Gamma_{n}^{2})+n)\tau(\log(Ln^{2})+L\log{\tau})}{n^{2}}+\Delta_{n}(\widehat{f}_{\textrm{local}}).

Next, we need to analyze the first term. Since $f_{0}\in\mathcal{CS}(L_{*},\bm{r},\bm{\tilde{r}},\bm{\beta},\bm{a},\bm{b},\bm{C})$ , by Lemma 2, for any $m>0$ , there exists a neural network

f_{*}\in\mathcal{F}(L,(d,N,\ldots,N,1),\tau,\infty),

with $L\asymp m,N\geq 6\eta\max_{i=0,\ldots,L_{*}}(\beta_{i}+1)^{\tilde{r}_{i}}\vee(\tilde{C}_{i}+1)e^{\tilde{r}_{i}},\eta=\max_{i=0,\ldots,L_{*}}(r_{i+1}(\tilde{r}_{i}+\lceil\beta_{i}\rceil)),\tau\lesssim mN,$ such that

$\displaystyle\\|f_{*}-f_{0}\\|_{\infty}$	$\displaystyle\lesssim\sum_{i=0}^{L_{}}(N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}+(N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}$
	$\displaystyle\lesssim\sum_{i=0}^{L_{}}(N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}+N^{-\frac{\beta_{i}^{*}}{\tilde{r}_{i}}}$
	$\displaystyle\lesssim(N2^{-m})^{\prod_{l=1}^{L_{}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{}}{r^{*}}},$	(10)

where recall that $\beta^{*}=\beta_{i^{*}}^{*}$ and $r^{*}=\tilde{r}_{i^{*}}$ . For simplicity, we let $\bm{p}=(d,N,\ldots,N,1)$ . This means there exists a sequence of networks $(f_{n})_{n}$ such that for all sufficiently large $n$ , $\|f_{n}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}$ and $f_{n}\in\mathcal{F}(L,\bm{p},\tau,\infty).$ Next define $\grave{f}:=f_{n}(\|f_{0}\|_{\infty}/\|f_{n}\|_{\infty}\wedge 1)\in\mathcal{F}(L,\bm{p},\tau,F),F\geq\max_{i=0,\ldots,L^{*}}(C_{i},1)$ , and it is obvious that $\|\grave{f}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}$ . Then it follows that

\inf_{\tilde{f}\in\mathcal{F}(L,\bm{p},\tau,F)}\|\tilde{f}-f_{0}\|_{\infty}\lesssim\|\grave{f}-f_{0}\|_{\infty}\lesssim(N2^{-m})^{\prod_{l=1}^{L_{*}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{*}}{r^{*}}}.

(11)

By combining (7.3) and the fact that $\tau\lesssim LN$ , the proof is completed. $\square$

	$\displaystyle\\|f-f_{*}\\|_{\infty}=$	$\displaystyle\\|\bm{h}_{L_{}}\circ\ldots\circ\bm{h}_{1}\circ\bm{h}_{0}-\bm{h}^{}_{L_{}}\circ\ldots\circ\bm{h}^{}_{1}\circ\bm{h}^{*}_{0}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}\\|\|\bm{h}_{i}-\bm{h}^{}_{i}\|_{\infty}\\|_{\infty}^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}\left((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m}+\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}}\right)^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}$
	$\displaystyle\leq$	$\displaystyle C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}((2\tilde{C}_{i}+1)(1+\tilde{r}^{2}_{i}+\beta_{i}^{2})6^{\tilde{r}}N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}$
		$\displaystyle+C_{L_{}}\prod_{l=0}^{L_{}-1}(2C_{l})^{\beta_{l+1}}\sum_{i=0}^{L_{}}(\tilde{C}_{i}3^{\beta_{i}}N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}.$

$\displaystyle\\|f_{*}-f_{0}\\|_{\infty}$	$\displaystyle\lesssim\sum_{i=0}^{L_{}}(N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}+(N^{-\frac{\beta_{i}}{\tilde{r}_{i}}})^{\prod_{l=i+1}^{L_{*}}\beta_{l}\wedge 1}$
	$\displaystyle\lesssim\sum_{i=0}^{L_{}}(N2^{-m})^{\prod_{l=i+1}^{L_{}}\beta_{l}\wedge 1}+N^{-\frac{\beta_{i}^{*}}{\tilde{r}_{i}}}$
	$\displaystyle\lesssim(N2^{-m})^{\prod_{l=1}^{L_{}}\beta_{l}\wedge 1}+N^{-\frac{\beta^{}}{r^{*}}},$	(10)

Semiparametric Regression for Spatial Data via Deep Learning

Abstract

1 Introduction

2 Model and Estimator

2.1 Model Setup

Assumption 1.

Example 1.

2.2 Deep Neural Network (DNN) Estimator

3 Computation and Theoretical Results

3.1 Computational Aspects

3.2 Theoretical Results

Theorem 1.

Theorem 2.

4 Simulation Study

4.1 Estimating f0f_{0} via other methods

4.2 Simulation Results

5 Data Example

6 Conclusion

References

7 Appendix

7.1 Notation and Definition

Definition 1 (Hölder smoothness).

7.2 Proof of Theorem 1

Lemma 1.

Proof.

7.3 Proof of Theorem 2

Lemma 2.

Proof.

4.1 Estimating $f_{0}$ via other methods