Semiparametric Regression for Spatial Data via Deep Learning
Abstract
In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.
Keywords: Semiparametric regression; Spatially dependent data; Deep Neural Networks; Stochastic gradient descent.
1 Introduction
With recent advances in remote sensing technology and geographical sciences, there has been a considerable interest in modeling spatially referenced data. The purpose of this paper is to develop new methodology that captures complex structures in such data via deep neural networks and Gaussian random fields. In addition, we provide a theoretical understanding of deep neural networks for spatially dependent data.
In recent years, deep neural network (DNN) has made a great breakthrough in many fields, such as computer vision (He et al.,, 2016), dynamics system (Li et al.,, 2021), natural language processing (Bahdanau et al.,, 2014), drug discovery and toxicology (Jiménez-Luna et al.,, 2020), and variable selection (Li et al.,, 2022; Li,, 2022). Besides its successful applications, there has also been great progress on theoretical development of deep learning. Liu et al., (2020) and Schmidt-Hieber, (2020) proved that the neural network estimator achieves the optimal (up to a logarithmic factor) minimax rate of convergence. Liu et al., (2022) further removed the logarithmic term and achieved the exact optimal nonparametric convergence rate. One of the appealing features of deep neural network is that it can circumvent the curse of dimensionality under some mild conditions.
Owing to the superior performance and theoretical guarantees of deep learning, applying deep learning to spatial data has also drawn much attention. For example, Zammit-Mangion and Wikle, (2020) fitted the integro-difference equation through convolutional neural networks and obtained probabilistic spatio-temporal forecasting. Zammit-Mangion et al., (2021) constructed a deep probabilistic architecture to model nonstationary spatial processes using warping approach. Mateu and Jalilian, (2022) used variational autoencoder generative neural networks to analyze spatio-temporal point processes. Kirkwood et al., (2022) applied Bayesian deep neural network to spatial interpolation. However, there is a lack of theoretical understanding of the aforementioned work, which we will address in this paper.
In addition, we model spatial dependence by Gaussian random fields and develop model estimation with computational efficiency. Due to technological advances in data collecting process, the size of spatial datasets are massive and traditional statistical methods encounter two challenges. One challenge is the aggravated computational burden. To reduce computation cost, various methods have been developed, such as covariance tapering, fixed rank kriging, and Gaussian Markov random fields (see Sun et al., (2012) for a comprehensive review). The other challenge is data storage and data collection. Many spatial datasets are not only big, but are also generated by different sources or in an online fashion such that the observations are generated one-by-one. In both cases, we cannot process entire datasets at once. To overcome these two challenges, Robbins and Monro, (1951) proposed a computationally scalable algorithm called stochastic gradient descent (SGD) and achieved great success in many areas. Instead of evaluating the actual gradient based on an entire dataset, SGD estimates the gradient using only one observation which makes it computationally feasible with large scale data and streaming data. Its statistical inferential properties have also been studied by many researchers (Su and Zhu,, 2018; Liu et al.,, 2021).
Before proceeding, it is essential to provide a brief overview of the classical approaches for formulating spatial dependence in a spatial process. For a spatial process, various approaches have been proposed to formulate the spatial dependence. In spatially varying coefcient (SVC) models, covariates may have different effect along with locations, which are referred to as “spatial non-stationarity”. Thus, SVC models allows the coefficients of the covariates to change with the locations. In other words, the spatial dependence is entirely explained by the regressors, while the disturbances are independent (see e.g. Hastie and Tibshirani, (1993); Fan and Zhang, (1999); Mu et al., (2018); Kim and Wang, (2021)). Another class of models are based on the spatial autoregressive (SAR) models of Cliff and Ord, (1981), where the information about spatial dependence is contained in the spatial weight matrix, and the response variable at each location is assumed to be affected by a weighted average of the dependent variables from other sampled units (see Lee, (2002, 2004). In SAR models, the spatial weight matrix is assumed to be known, which is infeasible in practice, especially in large dataset. Obviously, both SVC and SAR models have some limitations. First, they assume the relationship between the response and the independent variable is linear and ignore complex interaction and nonlinear structure. Second, they both involve computing the inverse of an matrix, where is the number of observations, which requires time complexity and memory complexity. Thus, the computational burden for both SVC and SAR model is substantial.
To meet these challenges, here we develop deep learning-based semiparametric regression for spatial data. Specifically, we use a sparsely connected feedforward neural network to fit the regression model, where the spatial dependence is captured by Gaussian random fields. By assuming a compositional structure on the regression function, the consistency of the neural network estimator is guaranteed. The advantages of the proposed method are fourfold. First, we do not assume any parametric functional form for the regression function, allowing the true mean function to be nonlinear or with complex interactions. This is an improvement over many of the existing parametric, semiparametric, or nonparametric approaches (Hastie and Tibshirani,, 1993; Fan and Zhang,, 1999; Mu et al.,, 2018; Kim and Wang,, 2021; Lee,, 2002; Robinson,, 2011; Jenish,, 2012; Li,, 2016; Lu and Tjøstheim,, 2014; Kurisu,, 2019, 2022). Second, under some mild technical conditions, we show that the estimator is consistent. To the best of our knowledge, this is the first theoretical result in deep neural network for spatially dependent data. Third, the convergence rate is free of the input dimension, which means our estimator does not suffer from the curse of dimensionality. Finally, owing to the appealing properties of SGD, our method is feasible for large scale dataset and streaming data.
The remainder of the paper is organized as follows. Section 2 formulates the statistical problem and presents the deep neural network estimator. The computational aspects and theoretical properties of the estimator are given in Section 3. Section 4 evaluates the finite-sample performance of the proposed estimator by a simulation study. We apply our method to a real-world dataset in Section 5 and some concluding remarks are provided in Section 6. Technical details are provided in Appendix.
2 Model and Estimator
In this section, we first formulate the problem and then present the proposed estimator under a deep learning framework.
2.1 Model Setup
For a spatial domain of interest , we consider the following semiparametric spatial regression model:
(1) |
where is an unknown mean function of interest, represents a -dimensional vector of covariates at location with , is a mean zero Gaussian random field with covariance function , , and is a spatial Gaussian white noise process with mean 0 and variance . Furthermore, we assume that , , and are independent of each other. Thus the observation comprises three components: large-scale trend , small-scale spatial variation , and measurement error ; see, for instance, Cressie and Johannesson, (2008).
In the spatial statistics literature, it is popular to focus on predicting the hidden spatial process using the observed information (Cressie and Johannesson,, 2008). However, the primary interest of this paper is to estimate the large-scale trend , where the relationship between the hidden spatial process and the covariates could be complex in nature. To capture such a complex relationship, we assume that is a composition of several functions inspired by neural networks characteristics (Schmidt-Hieber,, 2020). Hölder smoothness (see Definition 1 in Appendix) is a commonly used smoothness assumption for regression function in nonparametric and semiparametric literature. Thus it is natural to assume the true mean function is a composition of Hölder smooth functions, which is formally stated in the following assumption.
Assumption 1.
The function has a compositional structure with parameters where , with and , , , , , and ; that is,
where denote the sets of positive integers and positive real numbers, respectively, for some and the functions are -Hölder smooth only relying on variables and .
Without loss of generality, we assume in Assumption 1. The parameter refers to the total number of layers, i.e., the number of composite functions, is the whole number of variables in each layer, whereas is the number of “active” variables in each layer. The two parameter vectors and pertain to the Hölder smoothness in each layer, while and define the domain of in the th layer. For the rest of the paper, we will use to denote the class of functions that have a compositional structure as specified in Assumption 1 with parameters .
It is worth mentioning that Assumption 1 is commonly adopted in deep learning literature; see Bauer and Kohler, (2019), Schmidt-Hieber, (2020), Kohler and Langer, (2021), Liu et al., (2020), Li et al., (2021), among others. This compositional structure covers a wide range of function classes including the generalized additive model. In Example 1 and Figure 1, we present an illustrative example to help readers understand the concept of compositional structure within the context of the generalized additive model.
Example 1.
The generalized additive model is a generalized linear model with a linear predictor involving a sum of smooth functions of covariates (Hastie and Tibshirani,, 1986). Suppose , where is -Hölder smooth and are -Hölder smooth, for some and . Clearly, can be written as a composition of three functions with , , and . Here, , , , and .

2.2 Deep Neural Network (DNN) Estimator
In this paper, we consider estimating the unknown function via a deep neural network owing to the complexity of and the flexibility of neural networks. So before presenting our main results, we first briefly review the neural network terminologies pertaining to this work.
An activation function is a nonlinear function used to learn the complex pattern from data. In this paper, we focus on the Rectified Linear Unit (ReLU) shifted activation function which is defined as , where and . ReLU activation function enjoys both theoretical and computational advantages. The projection property can facilitate the proof of consistency, while ReLU activation function can help avoid vanishing gradient problem. The ReLU feedforward neural network is given by
(2) |
where is the collection of weight matrices, is the collection of so-called biases (in the neural network literature), and , , are the ReLU shifted activation functions. Here, measures the number of hidden layers, i.e., the length of the network, while is the number of units in each layer, i.e., the depth of the network. When using a ReLU feedforward neural network to estimate the regression problem (1), we need to have and ; and the parameters that need to be estimated are the weight matrices and the biases .
By definition, a ReLU feedforward neural network can be written as a composition of simple nonlinear functions; that is,
where , , , and . Unlike traditional function approximation theory where a complex function is considered as an infinite sum of simpler functions (such as Tayler series, Fourier Series, Chebyshev approximation, etc.), deep neural networks approximate a complex function via compositions, i.e., approximating the function by compositing simpler functions (Lu et al.,, 2020; Farrell et al.,, 2021; Yarotsky,, 2017). Thus, a composite function can be well approximated by a feedforward neural network. That is why we assume the true mean function has a compositional structure.
In practice, the length and depth of the networks can be extremely large, thereby easily causing overfitting. To overcome this problem, a common practice in deep learning is to randomly set some neurons to zero, which is called dropout. Therefore, it is natural to assume the network space is sparse and all the parameters are bounded by one, where the latter can be achieved by dividing all the weights by the maximum weight (Bauer and Kohler,, 2019; Schmidt-Hieber,, 2020; Kohler and Langer,, 2021). As such, we consider the following sparse neural network class with bounded weights
(3) |
where with and , and is a vector of zeros. This class of neural networks is also adopted in Schmidt-Hieber, (2020), Liu et al., (2020), and Li et al., (2021).
Suppose that the process is observed at a finite number of spatial locations in . The desired DNN estimator of in Model (1) is a sparse neural network in with the smallest empirical risk; that is,
(4) |
For simplicity, we sometimes write for if no confusion arises.
3 Computation and Theoretical Results
In this section, we describe the computational procedure used to optimize the objective function (4) and present the main theoretical results.
3.1 Computational Aspects
Because (4) does not have an exact solution, we use a stochastic gradient descent (SGD)-based algorithm to optimize (4). In contrast to a gradient descent algorithm which requires a full dataset to estimate gradients in each iteration, SGD or mini-batch gradient descent only needs an access to a subset of observations during each update, which is capable of training relatively complex models for large datasets and computationally feasible with streaming data. Albeit successful applications in machine learning and deep learning, SGD still suffers from some potential problems. For example, the rate of convergence to the minima is slow; the performance is very sensitive to tuning parameters. To circumvent these problems, various methods have been proposed, such as RMSprop and Adam (Kingma and Ba,, 2014). In this paper, we use Adam optimizer to solve (4).
During the training process, there are many hyper-parameters to tune in our approach: the number of layers , the number of neurons in each layer , the sparse parameter , and the learning rate. These hyper-parameters play an important role in the learning process. However, it is challenging to determine the values of hyper-parameters without domain knowledge. In particular, it is challenging to control the sparse parameter directly in the training process. Thus, we add an -regularization penalty to control the number of inactive neurons in the network. The idea of adding a sparse regularization to hidden layers in deep learning is very common; see, for instance, Scardapane et al., (2017) and Lemhadri et al., (2021). In this paper, we use a -fold cross-validation to select tuning parameters.
3.2 Theoretical Results
Recall that the minimizer of (4), , is practically unattainable and we use an SGD-based algorithm to minimize the objective function (4), which may converge to a local minimum. The actual estimator obtained by minimizing (4) is denoted by . We define the difference between the expected empirical risks of and as
(5) |
where stands for the expectation with respect to the true regression function . For any , we consider the following estimation error:
(6) |
The oracle-type theorem below gives an upper bound for the estimation error.
Theorem 1.
Suppose that the unknown true mean function in (1) satisfies for some . For any and , the following oracle inequality holds:
where
and .
The convergence rate in Theorem 1 is determined by three components. The first component measures the distance between the neural network class and , i.e., the approximation error. The second term pertains to the estimation error, and is owing to the difference between and the oracle neural network estimator . It is worth noting that the upper bound in Theorem 1 does not depend on the network architecture parameter , i.e., the width of the network, in that the network is sparse and its “actual” width is controlled by the sparsity parameter . To see this, after removing all the inactive neurons, it is straightforward to show that (Schmidt-Hieber,, 2020).
Next, we turn to the consistency of the DNN estimator for . In nonparametric regression, the estimation convergence rate is heavily affected by the smoothness of the function. Consider the class of composite functions . Let for and , with the convention . Then and are known as the intrinsic smoothness and intrinsic dimension of . Similar definitions could be found in literature Kurisu, (2019); Nakada and Imaizumi, (2020); Cloninger and Klock, (2021); Kohler et al., (2022); Wang et al., (2023). These quantities play an important role in controlling the convergence rate of the estimator. To better understand and , think about the composite function from the th to the last layer, i.e., ; then can be viewed as the smoothness of and is the layer of the least smoothness after rescaled by the respective number of “active” variables , . The following theorem establishes the consistency of as an estimator of and its convergence rate in the presence of spatial dependence.
Theorem 2.
Suppose Assumption 1 is satisfied, i.e., . Let be an estimator of . Further assume that , where , and . Then we have
where
and are constants only depending on , and .
The consistency of can be achieved by, for instance, letting , and , as a result of which . As expected, the rate of convergence is affected by the intrinsic smoothness and intrinsic dimension of , the architecture of the neural network , and the magnitude of the spatial dependence.
4 Simulation Study
In this section, we evaluate the finite sample performance of the proposed DNN estimator through a set of simulation studies. Two different simulation designs are considered, and for each design, we generate 100 independent data sets. In both settings, we use the same neural network architecture with length and width . Additionally, to prevent overfitting and control the sparsity of the network, we applied dropout with a probability of 0.2.
In the first design, the spatial domain of interest is in . To be specific, we generate data from the following model
and
with the true mean function . The small-scale spatial variation is a zero-mean stationary and isotropic Gaussian process with an exponential covariance function and the range parameter . The measurement error is standard normal distributed and independent of . It is worth mentioning that the covariates are location dependent.
We consider two different spatial domains: fixed domain and expanding domain. For the fixed domain, is a fixed interval, i.e., , whereas for the expanding domain, the spatial domain increases with the sample size . The observations are equally spaced over the region. In both scenarios, we let , and accordingly, in the expanding domain case.


In the second design, the mean function is defined on , given by
(7) |
where the coefficients , , are drawn from . The covariates at each location are generated from standard normal distributions with a cross-covariate correlation of 0.5 and the covariates at different locations are assumed to be independent. We further normalize each covariate to have zero mean and unit variance. The mean function is nonlinear, featuring interactions among the covariates. We simulate according to (1) with and similar to those in Design 1. That is, is a zero-mean stationary and isotropic Gaussian process on with an exponential covariance function and , and .
Similar to the first design, we consider two types of spatial domain: fixed domain, i.e., and expanding domain, i.e., . In both cases, we have , and all the locations are equally spaced over .
4.1 Estimating via other methods
We also compare the proposed DNN estimator with various estimators in the literature. The first estimator of is based on the Gaussian process-based spatially varying coefficient model (GP-SVC) which is given by
and the spatially varying coefficient is the sum of a fixed effect and a random effect. That is, , where is a non-random fixed effect and is a zero-mean Gaussian process with an isotropic covariance function . In this work, we use the popular Matérn covariance function defined as
where is the range parameter, is the smoothness parameter, and is the modified Bessel function of second kind with order .
The second estimator is the Nadaraya-Watson (N-W) kernel estimator for spatially dependent data discussed in Robinson, (2011), which considers the following spatial regression model
where and are the mean and variance functions, respectively, are independent random variables with zero mean and unit variance, and . Robinson, (2011) introduces the following Nadaraya-Watson kernel estimator for :
where
with
and is a scalar, positive bandwidth sequence satisfying as .
The third estimator of is based on the generalized additive model (GAM) mentioned in Example 1. That is, we assume that
where and are some smooth functions. In this model, spatial dependence is not assumed. The fourth estimator we consider is the linear regression (LR) model which assumes a linear relationship between the response variable and the predictor variables, without incorporating spatial dependence explicitly.
4.2 Simulation Results
To evaluate the performance of each method, we generate additional observations at new locations, treated as a test set. Similar to Chu et al., (2014), we adopt mean squared estimation error (MSEE) and mean squared prediction error (MSPE) to evaluate the estimation and prediction performance, where MSEE and MSPE are defined as
and is an estimator of . The mean and standard deviation of MSEE and MSPE over the 100 independent replicates are summarized in Tables 1 – 4.
Tables 1 and 2 pertain to Simulation Design 1, for fixed and expanding domains, respectively. For each combination of the sample size and the spatial dependence , we highlight the estimator in boldface that yields the smallest MSEE and MSPE. Overall, GAM, GP-SVC, N-W, and LR methods perform similar to each other. The proposed DNN estimator produces a smaller estimation error and prediction error than the others in all cases except when and in the fixed-domain case, GAM yields the smallest MSPE of 1.26. But the MSPE produced by DNN is close. Despite that spatial dependence has an adverse impact on the performance, when increases (and increases for the expanding-domain case), both estimation error and prediction error decrease as expected.
We depict in Figure 2 the estimated mean functions via our method with and from the 100 replications along with the pointwise confidence intervals for both fixed and expanding domains. Here, the pointwise intervals are defined as
where is the th smallest value of , and is the estimator of from the th replicate.
Tables 3 and 4 report the results for Simulation Design 2. For both fixed and expanding domains, our method performs the best among the five methods and N-W comes next. This is mainly because LR, GAM and GP-SVC treat to be linear and cannot handle complex interactions and nonlinear structures in .
Fixed domain | |||||||
MSEE | MSPE | MSEE | MSPE | MSEE | MSPE | ||
GAM | 0.92 (0.58) | 1.40 (0.69) | 0.98 (0.78) | 1.19 (0.40) | 1.05 (0.87) | 1.17 (0.38) | |
GP-SVC | 0.87 (0.49) | 1.37 (0.67) | 0.92 (0.75) | 1.15 (0.35) | 1.01 (0.82) | 1.13 (0.36) | |
N-W | 0.89 (0.52) | 1.38 (0.67) | 0.94 (0.76) | 1.16 (0.38) | 1.03 (0.85) | 1.15 (0.37) | |
LR | 0.93 (0.59) | 1.40 (0.69) | 0.99 (0.80) | 1.20 (0.39) | 1.06 (0.86) | 1.18 (0.38) | |
DNN | 0.78 (0.41) | 1.32 (0.64) | 0.82 (0.71) | 1.13 (0.35) | 0.94 (0.79) | 1.10 (0.33) | |
GAM | 0.87 (0.50) | 1.26 (0.30) | 0.93 (0.72) | 1.12 (0.27) | 0.99 (0.77) | 1.08 (0.24) | |
GP-SVC | 0.81 (0.42) | 1.34 (0.37) | 0.88 (0.74) | 1.09 (0.29) | 0.95 (0.77) | 1.09 (0.28) | |
N-W | 0.84 (0.38) | 1.33 (0.41) | 0.91 (0.73) | 1.10 (0.28) | 0.93 (0.75) | 1.06 (0.25) | |
LR | 0.86 (0.50) | 1.28 (0.32) | 0.95 (0.74) | 1.14 (0.28) | 0.98 (0.76) | 1.07 (0.24) | |
DNN | 0.69 (0.32) | 1.27 (0.39) | 0.71 (0.66) | 1.06 (0.26) | 0.78 (0.68) | 1.04 (0.29) | |
GAM | 0.83 (0.47) | 1.19 (0.44) | 0.88 (0.68) | 1.09 (0.20) | 0.96 (0.66) | 1.05 (0.19) | |
GP-SVC | 0.77 (0.38) | 1.15 (0.37) | 0.86 (0.71) | 1.06 (0.21) | 0.91 (0.64) | 1.05 (0.18) | |
N-W | 0.80 (0.36) | 1.13 (0.40) | 0.88 (0.70) | 1.07 (0.24) | 0.92 (0.66) | 1.06 (0.21) | |
LR | 0.82 (0.46) | 1.17 (0.42) | 0.87 (0.67) | 1.10 (0.22) | 0.98 (0.69) | 1.06 (0.20) | |
DNN | 0.58 (0.27) | 1.07 (0.34) | 0.63 (0.55) | 1.01 (0.22) | 0.69 (0.55) | 1.01 (0.25) |
Expanding domain | |||||||
MSEE | MSPE | MSEE | MSPE | MSEE | MSPE | ||
GAM | 0.35 (0.23) | 2.06 (0.67) | 0.82 (0.38) | 1.60 (0.59) | 0.98 (0.59) | 1.42 (0.31) | |
GP-SVC | 0.33 (0.22) | 2.01 (0.61) | 0.78 (0.36) | 1.53 (0.55) | 0.94 (0.54) | 1.37 (0.29) | |
N-W | 0.38 (0.26) | 2.03 (0.65) | 0.81 (0.40) | 1.57 (0.58) | 0.96 (0.57) | 1.39 (0.29) | |
LR | 0.34(0.22) | 2.05 (0.66) | 0.81 (0.36) | 1.58 (0.58) | 0.97 (0.58) | 1.41 (0.30) | |
DNN | 0.26 (0.19) | 1.93 (0.55) | 0.64 (0.37) | 1.44 (0.55) | 0.76 (0.44) | 1.22 (0.26) | |
GAM | 0.21 (0.14) | 1.91 (0.58) | 0.66 (0.34) | 1.51 (0.36) | 0.85 (0.44) | 1.39 (0.39) | |
GP-SVC | 0.18 (0.14) | 1.89 (0.54) | 0.61 (0.33) | 1.48 (0.39) | 0.81 (0.40) | 1.36 (0.41) | |
N-W | 0.20 (0.17) | 1.93 (0.61) | 0.63 (0.36) | 1.47 (0.37) | 0.88 (0.48) | 1.40 (0.43) | |
LR | 0.20 (0.13) | 1.90 (0.56) | 0.67 (0.35) | 1.52 (0.35) | 0.86 (0.45) | 1.39 (0.40) | |
DNN | 0.14 (0.11) | 1.82 (0.55) | 0.43 (0.28) | 1.33 (0.32) | 0.61 (0.37) | 1.29 (0.38) | |
GAM | 0.11 (0.07) | 1.72 (0.39) | 0.51 (0.29) | 1.40 (0.31) | 0.70 (0.31) | 1.30 (0.26) | |
GP-SVC | 0.13 (0.10) | 1.76 (0.41) | 0.56 (0.33) | 1.43 (0.36) | 0.74 (0.37) | 1.34 (0.30) | |
N-W | 0.13 (0.07) | 1.77 (0.44) | 0.53 (0.30) | 1.44 (0.39) | 0.69 (0.33) | 1.29 (0.27) | |
LR | 0.12 (0.07) | 1.74 (0.40) | 0.52 (0.31) | 1.42 (0.33) | 0.72 (0.33) | 1.32 (0.28) | |
DNN | 0.07 (0.09) | 1.63 (0.38) | 0.31 (0.23) | 1.22 (0.23) | 0.43 (0.31) | 1.10 (0.19) |
fixed-domain | |||||||
MSEE | MSPE | MSEE | MSPE | MSEE | MSPE | ||
GAM | 0.87 (0.92) | 2.81 (1.19) | 1.10 (2.40) | 2.40 (1.53) | 1.23 (1.06) | 2.16 (1.10) | |
GP-SVC | 0.93 (0.99) | 2.93 (1.23) | 1.23 (2.54) | 2.59 (1.67) | 1.53 (1.33) | 2.37 (1.26) | |
N-W | 0.73 (0.81) | 2.70 (1.15) | 1.01 (2.16)) | 2.30 (1.38) | 1.09 (1.01) | 2.09 (0.98) | |
LR | 1.16 (0.99) | 3.02 (1.25) | 1.46 (2.62) | 2.82 (1.76) | 1.88 (1.44) | 2.66 (1.39) | |
DNN | 0.51 (0.60) | 2.27 (1.09) | 0.74 (1.11) | 1.90 (1.02) | 0.83 (1.19) | 1.99 (1.17) | |
GAM | 0.44 (0.52) | 2.38 (0.58) | 0.75 (0.65) | 1.89 (0.42) | 0.92 (0.92) | 1.69 (0.47) | |
GP-SVC | 0.50 (0.59) | 2.43 (0.66) | 0.82 (0.77) | 1.94 (0.47) | 1.00 (0.96) | 1.80 (0.53) | |
N-W | 0.39 (0.44) | 1.99 (0.58) | 0.68 (0.70) | 1.73 (0.41) | 0.83 (0.84)) | 1.56 (0.41) | |
LR | 0.66 (0.52) | 2.67 (0.71) | 0.95 (0.79) | 2.05 (0.55) | 1.11 (0.97) | 1.88 (0.57) | |
DNN | 0.22 (0.39) | 1.87 (0.49) | 0.54 (0.61) | 1.57 (0.37) | 0.68 (0.71) | 1.41 (0.37) | |
GAM | 0.31 (0.40) | 2.25 (0.53) | 0.59 (0.53) | 1.81 (0.36) | 0.80 (0.79) | 1.65 (0.36) | |
GP-SVC | 0.38 (0.44) | 2.29 (0.58) | 0.66 (0.60) | 1.88 (0.39) | 0.88 (0.83) | 1.73 (0.40) | |
N-W | 0.25 (0.34) | 1.86 (0.49) | 0.51 (0.46) | 1.70 (0.31) | 0.71 (0.72)) | 1.52 (0.34) | |
LR | 0.49 (0.48) | 2.33 (0.66) | 0.88 (0.73) | 1.95 (0.44) | 0.98 (0.86) | 1.85 (0.42) | |
DNN | 0.19 (0.27) | 1.70 ( 0.42) | 0.28 (0.33) | 1.49 (0.29) | 0.57 (0.59) | 1.33 (0.34) |
fixed-domain | |||||||
MSEE | MSPE | MSEE | MSPE | MSEE | MSPE | ||
GAM | 0.75 (0.81) | 2.88 (1.04) | 0.81 (0.65) | 2.72 (0.94) | 0.90 (0.76) | 2.65 (0.88) | |
GP-SVC | 0.84 (0.88) | 2.93 (1.11) | 0.89 (0.90) | 2.80 (0.97) | 0.96 (0.83)) | 2.71 (0.91) | |
N-W | 0.66 (0.70) | 2.71 (1.00) | 0.71 (0.62) | 2.66 (0.90) | 0.82 (0.71)) | 2.59 (0.81) | |
LR | 0.97 (0.91) | 2.95 (1.14) | 1.06 (0.92) | 2.92 (0.98) | 1.15 (0.86) | 2.75 (0.95) | |
DNN | 0.44 (0.49) | 2.15 (1.02) | 0.60 (1.03) | 1.81 (0.99) | 0.69 (1.07) | 1.76 (0.94) | |
GAM | 0.32 (0.25) | 2.49 (0.44) | 0.35 (0.24) | 2.39 (0.46) | 0.40 (0.33) | 2.32 (0.51) | |
GP-SVC | 0.40 (0.31) | 2.55 (0.48) | 0.49 (0.29) | 2.44 (0.50) | 0.54 (0.37)) | 2.40 (0.55) | |
N-W | 0.28 (0.23) | 2.35 (0.41) | 0.31 (0.20) | 2.33 (0.41) | 0.35 (0.30) | 2.27 (0.47) | |
LR | 0.56 (0.35) | 2.60 (0.54) | 0.63 (0.33) | 2.49 (0.49) | 0.71 (0.41) | 2.46 (0.53) | |
DNN | 0.18 (0.20) | 2.20 (0.38) | 0.24 (0.21) | 2.29 (0.38) | 0.29 (0.24)) | 2.20 (0.41) | |
GAM | 0.24 (0.29) | 2.28 (0.33) | 0.27 (0.16) | 2.25 (0.27) | 0.31 (0.17) | 2.26 (0.25) | |
GP-SVC | 0.31 (0.32)) | 2.31 (0.35) | 0.37 (0.21) | 2.17 (0.25) | 0.41 (0.20) | 2.29 (0.28) | |
N-W | 0.21 (0.30) | 1.82 (0.46) | 0.26 (0.21) | 1.61 (0.28) | 0.29 (0.16) | 1.50 (0.30) | |
LR | 0.42 (0.35) | 2.37 (0.38) | 0.53 (0.24) | 2.33 (0.29) | 0.62 (0.23) | 2.30 (0.31) | |
DNN | 0.16 (0.22)) | 1.71 (0.30)) | 0.19 (0.20) | 1.58 (0.25) | 0.23 (0.15)) | 1.45 (0.28) |
5 Data Example
In this section, we use the proposed DNN method to analyze California Housing data that are publicly available from the website https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. After removing missing values, the dataset contains housing price information from block groups in California from the 1990 census, where a block group on average includes 1425.5 individuals living in a geographically compact area. To be specific, the dataset comprises median house values and six covariates of interest: the median age of a house, the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households. Figure 3 displays the histograms of the six covariates, from which one can observe that the covariates are all right skewed except for the median age of a house. Thus, we first apply the logarithm transform to the five covariates and then use min-max normalization to rescale all the six covariates so that the data are within the range .
Figure 4 shows the spatial distribution of the five log transformed covariates (i.e., the total number of rooms, the total number of bedrooms, population, the total number of households, and the median income for households) and the median age of a house. We also depict in Figure 5 (the top panel) the map of the median house values in California. The data exhibit a clear geographical pattern. Home values in the coastal region, especially the San Francisco Bay Area and South Coast, are higher than the other regions. Areas of high home values are always associated with high household income, dense population, large home size, and large household, which are clustered in the coastal region and Central Valley. Our objective is to explore the intricate relationship between the median house value and the six covariates by taking into account their spatial autocorrelation.



Same as the simulation study, we estimate the mean function via four methods: DNN, GAM, GP-SVC, and N-W. We use the same neural network architecture as the simulation study, i.e., the length and width equal , , respectively, and dropout rate is set as 0.2 to avoid overfitting. To assess their performance, we compute the out-of-sample prediction error measured by MSPE based on 10-fold cross-validation, and the results are summarized in Table 5. Consistent with the observations in the simulation study, the proposed DNN method yields a much more accurate prediction than the others. The bottom panel of Figure 5 shows the estimated median house value using the DNN estimator, which exhibits a similar geographical pattern to the observations.
Methods | GAM | GP-SVC | N-W | LR | DNN |
---|---|---|---|---|---|
MSPE () | 4.74 | 4.23 | 4.05 | 5.32 | 3.41 |
6 Conclusion
In this study, we have ventured into the realm of regression analysis for spatially dependent data utilizing deep learning. Through meticulous consideration of the intricate interplay of spatial autocorrelation, our estimator has demonstrated its consistency. An intriguing path for future exploration unfolds in the domain of large-scale datasets. While the present study adeptly showcases the effectiveness of our approach via rigorous simulations, its true potential unfurls when confronted with the intricacies of real-world data. An illustrative instance is illuminated by the remote sensing data expounded in Ma et al.’s work Ma et al., (2022). This dataset weaves together a tapestry of spatially referenced variables, collectively contributing to the intricate fabric of underlying relationships. The deployment of our model upon such expansive and diverse datasets holds the pledge of unveiling concealed patterns and amplifying comprehension of underlying processes.
Furthermore, our methodology’s scope extends beyond predictive modeling, revealing a multitude of versatile applications. As expounded earlier, scenarios entailing interpolation challenges and gap-filling predicaments in remote sensing, alongside endeavors aimed at assessing the relative significance of covariates in ecological investigations, stand to derive substantial benefits from our model’s nuanced capabilities. The vista of extending our framework to encompass spatiotemporal prediction broadens even further, beckoning exploration into more intricate, dynamic, and temporally evolving phenomena. For a more comprehensive review, refer to Wikle et al.’s comprehensive work Wikle et al., (2019).
In conclusion, the journey into the domain of regression analysis for spatially dependent data endures as a continuous and captivating odyssey. As we confront the unique intricacies presented by expansive datasets and diverse problem domains, our proposed methodology emerges as a steadfast companion. Its role extends beyond mere contribution, shaping the advancement of statistical methodologies for spatial analysis.
References
- Bahdanau et al., (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Bauer and Kohler, (2019) Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285.
- Chu et al., (2014) Chu, T., Wang, H., and Zhu, J. (2014). On semiparametric inference of geostatistical models via local karhunen–loève expansion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):817–832.
- Cliff and Ord, (1981) Cliff, A. D. and Ord, J. K. (1981). Spatial processes: models & applications. Taylor & Francis.
- Cloninger and Klock, (2021) Cloninger, A. and Klock, T. (2021). A deep network construction that adapts to intrinsic dimensionality beyond the domain. Neural Networks, 141:404–419.
- Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226.
- Fan and Zhang, (1999) Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. The annals of Statistics, 27(5):1491–1518.
- Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
- Hastie and Tibshirani, (1986) Hastie, T. and Tibshirani, R. (1986). Generalized Additive Models. Statistical Science, 1(3):297 – 310.
- Hastie and Tibshirani, (1993) Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):757–779.
- He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Jenish, (2012) Jenish, N. (2012). Nonparametric spatial regression under near-epoch dependence. Journal of Econometrics, 167(1):224–239.
- Jiménez-Luna et al., (2020) Jiménez-Luna, J., Grisoni, F., and Schneider, G. (2020). Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584.
- Kim and Wang, (2021) Kim, M. and Wang, L. (2021). Generalized spatially varying coefficient models. Journal of Computational and Graphical Statistics, 30(1):1–10.
- Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kirkwood et al., (2022) Kirkwood, C., Economou, T., Pugeault, N., and Odbert, H. (2022). Bayesian deep learning for spatial interpolation in the presence of auxiliary information. Mathematical Geosciences, pages 1–25.
- Kohler et al., (2022) Kohler, M., Krzyżak, A., and Langer, S. (2022). Estimation of a function of low local dimensionality by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042.
- Kohler and Langer, (2021) Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249.
- Kurisu, (2019) Kurisu, D. (2019). On nonparametric inference for spatial regression models under domain expanding and infill asymptotics. Statistics & Probability Letters, 154:108543.
- Kurisu, (2022) Kurisu, D. (2022). Nonparametric regression for locally stationary random fields under stochastic sampling design. Bernoulli, 28(2):1250–1275.
- Lee, (2002) Lee, L.-F. (2002). Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models. Econometric theory, 18(2):252–277.
- Lee, (2004) Lee, L.-F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica, 72(6):1899–1925.
- Lemhadri et al., (2021) Lemhadri, I., Ruan, F., and Tibshirani, R. (2021). Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, pages 10–18. PMLR.
- Li, (2022) Li, K. (2022). Variable selection for nonlinear cox regression model via deep learning. arXiv preprint arXiv:2211.09287.
- Li et al., (2021) Li, K., Wang, F., Liu, R., Yang, F., and Shang, Z. (2021). Calibrating multi-dimensional complex ode from noisy data via deep neural networks. arXiv preprint arXiv:2106.03591.
- Li et al., (2022) Li, K., Wang, F., and Yang, L. (2022). Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. arXiv preprint arXiv:2204.01682.
- Li, (2016) Li, L. (2016). Nonparametric regression on random fields with random design using wavelet method. Statistical Inference for Stochastic Processes, 19(1):51–69.
- Liu et al., (2022) Liu, R., Boukai, B., and Shang, Z. (2022). Optimal nonparametric inference via deep neural network. Journal of Mathematical Analysis and Applications, 505(2):125561.
- Liu et al., (2020) Liu, R., Shang, Z., and Cheng, G. (2020). On deep instrumental variables estimate.
- Liu et al., (2021) Liu, R., Yuan, M., and Shang, Z. (2021). Online statistical inference for parameters estimation with linear-equality constraints. arXiv preprint arXiv:2105.10315.
- Lu et al., (2020) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2020). Deep network approximation for smooth functions.
- Lu and Tjøstheim, (2014) Lu, Z. and Tjøstheim, D. (2014). Nonparametric estimation of probability density functions for irregularly observed spatial data. Journal of the American Statistical Association, 109(508):1546–1564.
- Ma et al., (2022) Ma, T. F., Wang, F., Zhu, J., Ives, A. R., and Lewińska, K. E. (2022). Scalable semiparametric spatio-temporal regression for large data analysis. Journal of Agricultural, Biological and Environmental Statistics, pages 1–20.
- Mateu and Jalilian, (2022) Mateu, J. and Jalilian, A. (2022). Spatial point processes and neural networks: A convenient couple. Spatial Statistics, page 100644.
- Mu et al., (2018) Mu, J., Wang, G., and Wang, L. (2018). Estimation and inference in spatially varying coefficient models. Environmetrics, 29(1):e2485.
- Nakada and Imaizumi, (2020) Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. The Journal of Machine Learning Research, 21(1):7018–7055.
- Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407.
- Robinson, (2011) Robinson, P. M. (2011). Asymptotic theory for nonparametric regression with spatial data. Journal of Econometrics, 165(1):5–19.
- Scardapane et al., (2017) Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
- Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. Ann. Statist., 48(4):1875–1897.
- Su and Zhu, (2018) Su, W. J. and Zhu, Y. (2018). Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876.
- Sun et al., (2012) Sun, Y., Li, B., and Genton, M. G. (2012). Geostatistics for large datasets. In Advances and challenges in space-time modelling of natural events, pages 55–77. Springer.
- Wang et al., (2023) Wang, S., Cao, G., Shang, Z., Initiative, A. D. N., Weiner, M. W., Aisen, P., Petersen, R., Weiner, M. W., Aisen, P., Petersen, R., et al. (2023). Deep neural network classifier for multi-dimensional functional data. Scandinavian Journal of Statistics.
- Wikle et al., (2019) Wikle, C. K., Zammit-Mangion, A., and Cressie, N. (2019). Spatio-temporal statistics with R. CRC Press.
- Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.
- Zammit-Mangion et al., (2021) Zammit-Mangion, A., Ng, T. L. J., Vu, Q., and Filippone, M. (2021). Deep compositional spatial models. Journal of the American Statistical Association, pages 1–22.
- Zammit-Mangion and Wikle, (2020) Zammit-Mangion, A. and Wikle, C. K. (2020). Deep integro-difference equation models for spatio-temporal forecasting. Spatial Statistics, 37:100408.
7 Appendix
7.1 Notation and Definition
In this paper all vectors are column vectors, unless otherwise stated. Let for any vector , and be the norm of a real-valued function . For two positive sequences and , we write if there exists a positive constant such that for all , and if for some constant and a sufficiently large . Suppose that is a -dimensional vector. Let , , and . For two dimensional vectors and , we write if for . Let be the largest number less than and be the smallest number greater than . For a matrix , let the max norm of , be the number of non-zero entries of . Define as the sup-norm of a real-valued function . We use to represent the minimum of two numbers and , while is the maximum of and .
Definition 1 (Hölder smoothness).
A function is said to be -Hölder smooth for some positive constants and , if for every , the following two conditions hold:
and
where . Moreover, we say is -Hölder smooth if is -Hölder smooth for all .
7.2 Proof of Theorem 1
The proof of Theorem 1 requires a preliminary lemma. First, we define the -cover of a function space as a set satisfying that following property: for any , there exists a such that . Next, we define the -covering number of as
where means the number of distinct elements in set . In other words, is the minimal number of -balls with radius that covers .
Lemma 1.
Suppose that is the unknown true mean function in (1). Let be a function class such that for some . Then for all and , the following inequality holds:
where .
Proof.
Let . For any fixed , we have . Therefore,
Let . Furthermore, we have
(8) |
Next, we will find an upper bound for . By the definition of the -cover of a function space and the -covering number, we denote the centers of the balls by , ; and there exists such that . Together with the fact that , we have
(9) |
It is easy to see that . For the second term , notice that, conditionally on ,
follows where , . From Lemma C.1 of Schmidt-Hieber, (2020), . Consequently,
For , and
Thus,
Since the above inequality holds true for any , we can prove the result by letting . ∎
7.3 Proof of Theorem 2
Lemma 2.
For any , , and , there exists a neural network
such that
where
Proof.
By definition, we write as
where for some and the functions are -Hölder smooth and . For , the domain and range of are and , respectively. First of all, we will rewrite as the composition of the functions whose domain and range are and which are constructed via linear transformation. That is, we define
Therefore the following equality holds
Since are all -Hölder smooth, it follows that are all -Hölder smooth as well, where is a constant only depending on , and , i.e., for , and .
By Theorem 5 of Schmidt-Hieber, (2020), for any integer and , there exists a network
with and , such that
Note that the value of is . So we define by adding two more layers to restrict into the interval , where . This introduces two more layers and four more parameters. By the fact that , we have and
We further parallelize all together, obtaining . Moreover, we construct the composite network , where .
Proof of Theorem 2: By Theorem 1 with and , it follows that
Next, we need to analyze the first term. Since , by Lemma 2, for any , there exists a neural network
with such that
(10) |
where recall that and . For simplicity, we let . This means there exists a sequence of networks such that for all sufficiently large , and Next define , and it is obvious that . Then it follows that
(11) |
By combining (7.3) and the fact that , the proof is completed.