This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Subsampling Inference for Deep Neural Networks

Kejin Wu Department of Mathematics, University of California, San Diego Dimitris N. Politis Department of Mathematics and Halicioğlu Data Science Institute, University of California, San Diego
Abstract

Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest results on the approximation ability of DNN. More importantly, however, a non-random subsampling technique–scalable subsampling–is applied to construct a ‘subagged’ DNN estimator. Under regularity conditions, it is shown that the subagged DNN estimator is computationally efficient without sacrificing accuracy for either estimation or prediction tasks. Beyond point estimation/prediction, we propose different approaches to build confidence and prediction intervals based on the subagged DNN estimator. In addition to being asymptotically valid, the proposed confidence/prediction intervals appear to work well in finite samples. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.

1 Introduction

In the last several years, machine learning (ML) methods have been developed rapidly fueled by ever-increasing amounts of data and computational power. Among different ML methods, a popular and widely-used technique is Neural Networks (NN) that models the relationship between inputs and outputs through layers of connected computational neurons. The idea of applying such a biology-analogous framework can be traced to the work of McCulloch and Pitts, (1943).

At the end of the 20th century, people focused on the feed-forward Shallow Neural Networks (SNN) with sigmoid-type activation functions. An SNN has only one hidden layer but is shown to possess the universal approximation property, i.e., it can be used to approximate any Borel measurable function from one finite dimensional space to another with any desired degree of accuracy—see Cybenko, (1989); Hornik et al., (1989) and references within. However, the SNN practical performance left much to be desired. In the last ten or so years, Deep Neural Networks (DNN) received increased attention due to their great empirical performance.

Although DNN have become a state-of-the-art model, their theoretical foundation is still in development. Notably, Yarotsky, (2018); Yarotsky and Zhevnerchuk, (2020) explored the approximation ability of DNN for any function ff that belongs to a Hölder Banach space; here, the sigmoid-type activation functions are now replaced by ReLU-type functions to avoid the gradient vanishing problem. The aforementioned work showed that the optimal error of the DNN estimator fDNNf_{\text{DNN}} can be uniformly bounded, i.e.,

ffDNN=O(W2ξ/d);||f-f_{\text{DNN}}||_{\infty}=O\left(W^{-2\xi/d}\right); (1)

here, ξ\xi is some smoothness measurement of the target function f:𝐑d𝐑f:{\bf R}^{d}\to{\bf R} —see Section 4 for a formal definition; WW is the size of a neural network fDNNf_{\text{DNN}}, i.e., the total number of parameters; and dd is the dimension of the function inputs.

However, the bound (1) is not useful in practice. The reason is three-fold: (a) it requires a discontinuous weight assignment to build the desired DNN, so it is not feasible to train such DNN with usual gradient-based methods; (b) the structure of the DNN might not be the standard fully connected form so finding the satisfied specific structure becomes another difficult; most importantly, (c) this error bound is on the optimal estimation we can achieve from a finely designed DNN. It fails to tell us any story about the situation of applying the DNN estimator to solving real-world problems.

For example, what is the performance of the DNN to estimate a regression function with nn independent samples {(Y,𝑿i)}i=1n\{(Y,\bm{X}_{i})\}_{i=1}^{n} generated from an underlying true model ff? It is easy to see that the error ε\varepsilon of fDNNf_{\text{DNN}} in sup-norm can be arbitrarily small if we allow WW to be arbitrarily large based on Eq. 1. However, this optimal performance is hardly achievable and only represents the theoretically best estimation. What we attempt to do in this paper is to determine an empirically optimal f^DNN\widehat{f}_{\text{DNN}} with samples {(Y,𝑿i)}i=1n\{(Y,\bm{X}_{i})\}_{i=1}^{n} and then explore its estimation and prediction inference. Guided by this spirit, people usually think f^DNN\widehat{f}_{\text{DNN}} as an MM-estimator and set different loss functions for various purposes:

f^DNNargminfθDNN1ni=1nL(fθ(𝒙i),yi);\widehat{f}_{\text{DNN}}\in\arg\min_{f_{\theta}\in\mathscr{F}_{\text{DNN}}}\frac{1}{n}\sum_{i=1}^{n}L(f_{\theta}(\bm{x}_{i}),y_{i}); (2)

here DNN\mathscr{F}_{\text{DNN}} is a user-chosen space that contains all DNN candidates; L(,)L(\cdot,\cdot) is the loss function, e.g., Mean Squared Errors loss for the regression problem with real-valued output, i.e., L(fθ(𝒙i),yi)=(fθ(𝒙i)yi)2/2L(f_{\theta}(\bm{x}_{i}),y_{i})=(f_{\theta}(\bm{x}_{i})-y_{i})^{2}/2; {(y,𝒙i)}i=1n\{(y,\bm{x}_{i})\}_{i=1}^{n} are realizations of {(Y,𝑿i)}i=1n\{(Y,\bm{X}_{i})\}_{i=1}^{n}.

In the paper at hand, we consider DNN-based estimation and prediction inference in the data-generating model: Yi=f(𝑿i)+ϵiY_{i}=f(\bm{X}_{i})+\epsilon_{i}; here, the ϵi\epsilon_{i} are independent, identically distributed (i.i.d.) from a distribution FϵF_{\epsilon} that has mean 0 and variance σ2\sigma^{2}—we will use the shorthand ϵi\epsilon_{i}\sim i.i.d. (0,σ2(0,\sigma^{2}). Consequently, f(𝒙i)=𝔼(Yi|𝑿i=𝒙i)f(\bm{x}_{i})=\mathbb{E}(Y_{i}|\bm{X}_{i}=\bm{x}_{i}). Furthermore, the regression function f()f(\cdot) will be assumed to satisfy some smoothing condition which will be specified later. Note that the additive model with heteroscedastic error: Yi=f(𝑿i)+g(𝑿i)ϵiY_{i}=f(\bm{X}_{i})+g(\bm{X}_{i})\cdot\epsilon_{i} can be analyzed similarly by applying two DNNs, one to estimate f()f(\cdot) and one for g()g(\cdot).

From a nonparametric regression view, it is well-known that the optimal convergence rate of the estimation for a pp-times continuously differentiable regression function of a dd-dimensional argument is n2p/(2p+d)n^{2p/(2p+d)}—see Stone, (1982). If we assume the regression function belongs to a more general Hölder Banach space, we can define a non-integer ξ=p+s\xi=p+s to represent the smoothness order of ff; here 0<s10<s\leq 1 is the Hölder coefficient. The optimal rate of non-parametric estimation can also be extended to such non-integer smoothness order; see Condition 33^{{}^{\prime}} and Definition 2 of Kohler et al., (2023). Focusing on DNN estimation, the optimal and achievable error bound on the L2L_{2} norm of f^DNN\widehat{f}_{\text{DNN}} is O(nξ/(ξ+d)log8(n))O(n^{-\xi/(\xi+d)}\cdot\log^{8}(n)) with a high probability; this bound is due to Farrell et al., (2021) but the rate appears slower than the optimal rate that we can attain. Besides, although f^DNN\widehat{f}_{\text{DNN}} will become more accurate as the sample size increases, training DNN becomes very time-consuming. Moreover, it is infeasible to load massive data into a PC or even a supercomputer since its node memory is also limited in the computation process as pointed out by Zou et al., (2021).

In this paper, we first give a small improvement on the bound of Farrell et al., (2021) using the latest results on the DNN approximation ability. Then, we resolve the computational issue involving huge data by applying the Scalable Subsampling technique of Politis, (2021) to create a set of subsamples and then build a so-called subagging DNN estimator f¯DNN\overline{f}_{\text{DNN}}. Under regularity conditions, we can show that the subagging DNN estimator f¯DNN\overline{f}_{\text{DNN}} could possess a faster convergence rate than a single DNN estimator f^DNN\widehat{f}_{\text{DNN}} trained on the whole sample. Lastly, using the same set of subsamples, we can build a Confidence Interval (CI) for ff based on f¯DNN\overline{f}_{\text{DNN}}. Due to the prevalent undercoverage phenomenon of CIs with finite samples, we propose two ideas to improve the empirical coverage rate: (1) we enlarge the CI by maximizing the margin of errors in an appropriate way; (2) we take an iterated subsampling method to build a specifically designed CI which is a combination of pivot-CI and quantile-CI. Beyond estimation inference, we also perform predictive inference (with both point and interval predictions).

The paper is organized as follows. In Section 2, we give a short introduction to the structure of DNN. In Section 3, we describe the methodology of scalable subsampling. Subsequently, the performance of the subagging DNN estimator and its associated confidence/prediction intervals are analyzed in Section 4 and Section 5. Simulations and conclusions are provided in Section 6 and Section 7, respectively. Proofs are given in Appendix: A and additional simulations studies are presented in Appendix: B and Appendix: C.

In terms of notation, we will use the following norms: gL2(𝑿):=𝔼[g(𝑿)2]1/2\|g\|_{L_{2}(\bm{X})}:=\mathbb{E}[g(\bm{X})^{2}]^{1/2}; g:=sup𝒙|g(𝒙)|\|g\|_{\infty}:=\sup_{\bm{x}}|g(\bm{x})|. Also, we employ the notation an=Θ(dn)a_{n}=\Theta\left(d_{n}\right) to denote “exact order”, i.e., that there exist two constants c1,c2c_{1},c_{2} satisfying c1c2>0c_{1}\cdot c_{2}>0, and c1dnanc2dnc_{1}d_{n}\leq a_{n}\leq c_{2}d_{n}. We also use 𝔼n[]\mathbb{E}_{n}[\cdot] to represent the sample average operator.

2 Standard fully connected deep neural network

For completeness, we now give a brief introduction to the fully connected forward DNN with ReLU activation functions. Hereafter, we refer to DNN as the standard fully connected deep neural network with the ReLU activation function. In short, the DNN can be viewed as a parameterized family of functions. Its structure mainly depends on the input dimension dd, depth LL\in\mathbb{N}, width 𝑯L\bm{H}\in\mathbb{N}^{L} and the output dimension. The depth LL describes how many hidden layers a DNN possesses; the width 𝑯=(H1,,HL)\bm{H}=(H_{1},\ldots,H_{L}) represents the number of neurons in each hidden layer. The forward property implies that the input, hidden neurons and output are connected in an acyclic relationship. The fully connected property indicates that each hidden neuron receives information from all hidden neurons at the previously hidden layer in a functional way.

Formally, if we let 𝒖l=(ul,1,,ul,Hl)T\bm{u}_{l}=(u_{l,1},\ldots,u_{l,H_{l}})^{T} to represent all number of neurons at the ll-th hidden layer for l=0,,L+1l=0,\ldots,L+1; here, 𝒖0\bm{u}_{0} represents the input vector (x1,,xd)T(x_{1},\ldots,x_{d})^{T} and 𝒖L+1\bm{u}_{L+1} is the output. Therefore, we can pretend that the input layer and the output layer are the 0-th and (L+1)(L+1)-th hidden layers, respectively. Then, ul,i=σ(𝒖l1T𝒘l1,i+bl1,i)u_{l,i}=\sigma(\bm{u}_{l-1}^{T}\bm{w}_{l-1,i}+b_{l-1,i}) for l=1,,Ll=1,\ldots,L and i=1,,Hli=1,\ldots,H_{l}; here 𝒘l1,iHl1\bm{w}_{l-1,i}\in\mathbb{R}^{H_{l-1}} is the weight vector which connects the (l1)(l-1)-th hidden layer and the neuron ul,iu_{l,i}; bl1,ib_{l-1,i}\in\mathbb{R} is the corresponding intercept term; σ()\sigma(\cdot) is the so-called activation function and we take the ReLU function in this paper. To get the output layer, we just take uL+1,i=𝒖LT𝒘L,i+bL,iu_{L+1,i}=\bm{u}_{L}^{T}\bm{w}_{L,i}+b_{L,i} for i=1,,HL+1i=1,\ldots,H_{L+1}; here HL+1H_{L+1} is equal to the output dimension. To express the functionality of the DNN in a more concise way, we can stack {𝒘l1,iT}i=1Hl\{\bm{w}_{l-1,i}^{T}\}_{i=1}^{H_{l}} by row to get 𝑾l1Hl×Hl1\bm{W}_{l-1}\in\mathbb{R}^{H_{l}}\times\mathbb{R}^{H_{l-1}} and collect {bl1,i}i=1Hl\{b_{l-1,i}\}_{i=1}^{H_{l}} to be a vector 𝒃l1\bm{b}_{l-1} for l=1,,L+1l=1,\ldots,L+1. Subsequently, we can treat the DNN as a function that takes the input 𝒙\bm{x} and returns output in the below way:

fDNN(𝒙)=𝑾L(σ(𝑾L1(σ(𝑾2σ(𝑾1σ(𝑾0𝒙+𝒃0)+𝒃1)+𝒃2))+𝒃L1)+𝒃L.f_{\text{DNN}}(\bm{x})=\bm{W}_{L}(\sigma(\bm{W}_{L-1}(\cdots\sigma(\bm{W}_{2}\sigma(\bm{W}_{1}\sigma(\bm{W}_{0}\bm{x}+\bm{b}_{0})+\bm{b}_{1})+\bm{b}_{2})\cdots)+\bm{b}_{L-1})+\bm{b}_{L}.

We can understand that the function fDNN(𝒙)f_{\text{DNN}}(\bm{x}) maps 𝒙\bm{x} to the 11-st hidden layer and then map the 11-st hidden to the 22-nd hidden layer and so on iteratively with weights {𝑾l}l=0L\{\bm{W}_{l}\}_{l=0}^{L}, {𝒃l}l=0L\{\bm{b}_{l}\}_{l=0}^{L} and the activation function σ()\sigma(\cdot). We can then compute the total number of parameters in a DNN by the formula W=i=0L(HiHi+1+Hi+1)W=\sum_{i=0}^{L}(H_{i}\cdot H_{i+1}+H_{i+1}). A simple DNN is presented in Fig. 1. It has a constant width of 4 and a depth of 2.

Figure 1: The illustration of a fully connected DNN with L=2L=2, H=4H=4 and W=37W=37, and input dimension d=2d=2 and output dimension 1.

Refer to caption

3 Scalable subsampling

Scalable subsampling is one type of non-stochastic subsampling technique proposed by Politis, (2021). Assume that we observe the sample {𝒁1,,𝒁n}\{\bm{Z}_{1},\ldots,\bm{Z}_{n}\}; then, scalable subsampling relies on q=(nb)/h+1q=\lfloor(n-b)/h\rfloor+1 number of subsamples B1,,BqB_{1},\ldots,B_{q} where Bj={𝒁(j1)h+1,,B_{j}=\{\bm{Z}_{(j-1)h+1},\ldots, 𝒁(j1)h+b}\bm{Z}_{(j-1)h+b}\}; here, \lfloor\cdot\rfloor denotes the floor function, and hh controls the amount of overlap (or separation) between BjB_{j} and Bj+1B_{j+1}. In general, the subsample size bb and the overlap hh are functions of nn, but these dependencies will not be explicitly denoted, e.g.,

b=Θ(nβ);h=ab,b=\Theta(n^{\beta})~{};~{}h=a\cdot b,

where 0<β<10<\beta<1 and a>0a>0. More importantly, tuning bb and hh can make scalable subsampling samples have different properties. For example, if h=1h=1, the overlap is the maximum possible; if h=0.2bh=0.2b, there is 80%\% overlap between BjB_{j} and Bj+1B_{j+1}; if h=bh=b, there is no overlap between BjB_{j} and Bj+1B_{j+1} but these two blocks are adjacent; if h=1.2bh=1.2b, there is a block of about 0.2b0.2b data points that separate the blocks BjB_{j} and Bj+1B_{j+1}.

The bagging idea was initially proposed by Breiman, (1996), where the subsample is bootstrapped (sampling with replacement) with the same size as the original sample. As revealed by that work, the main benefit of taking this technique is that the mean-squared error (MSE) of the bagging estimator can decrease, especially for unstable estimators that may change a lot with different samples, e.g., neural networks and regression trees. There are ample works about combing the neural networks with the bagging technique to improve its generalization performance; e.g., see applications in the work of Ha et al., (2005); Khwaja et al., (2015) for references. However, the drawback of the original bagging method is that the estimation process needs to be performed with nn-size bootstrap resamples many times which is infeasible with massive data. Bühlmann and Yu, (2002) proposed the subagging idea which is based on all subsamples as opposed to bootstrap resamples. However, even choosing a single random subsample could be computationally challenging when nn is large. As pointed out in Ting, (2021), drawing a random sample of size bb from nn items using the Sparse Fisher-Yates Sampler takes O(b)O(b) time and space which corresponds to optimal time and space complexity for this problem.

Facing such computational dilemmas, scalable subsampling and subagging as proposed by Politis, (2021) can be seen as an extension of the Divide-and-Conquer principle—see e.g. Jordan, (2013). Moreover, in addition to the computational savings, scalable subagging may yield an estimator that is not less (and sometimes more) accurate than the original; the following example illustrates such a case.

Example 3.1 (Kernel-smoothed function estimation).

A remarkable example from the work of Politis, (2021) is the scalable subagging kernel estimator. Suppose our goal is estimating the value of function gg at a specific point; here, the function gg can be a probability density, spectral density, or other function that is estimated in a nonparametric setting. Denote the estimand θ\theta and its corresponding kernel-smoothed estimator θ^n\hat{\theta}_{n} based on the whole sample, and assume that θ^n\hat{\theta}_{n} satisfies the following conditions:

  • (1)

    𝔼(θ^n2)<\mathbb{E}(\hat{\theta}_{n}^{2})<\infty for all nn;

  • (2)

    nγ(𝔼(θ^n)θ)Cn^{\gamma}(\mathbb{E}(\hat{\theta}_{n})-\theta)\rightarrow C and Var(nαθ^n)σ2\text{Var}(n^{\alpha}\hat{\theta}_{n})\rightarrow\sigma^{2} as nn\rightarrow\infty, where CC is a non-zero constant, σ2>0\sigma^{2}>0 and γ>α>0\gamma>\alpha>0.

Define the scalable subagging estimator as:

θ¯b,n,SS=q1i=1qθ^b,i,\bar{\theta}_{b,n,SS}=q^{-1}\sum_{i=1}^{q}\hat{\theta}_{b,i},

here qq is the total number of subsamples and θ^b,i\hat{\theta}_{b,i} is the non-parametric estimator based on the ii-th subsample BiB_{i}. To achieve the fastest convergence rate of θ¯b,n,SS\bar{\theta}_{b,n,SS} we may let β=11+2(γα)\beta=\frac{1}{1+2(\gamma-\alpha)}. As a result, the Mean Squared Error (MSE )of the scalable subagging estimator θ¯b,n,SS\bar{\theta}_{b,n,SS} is Θ(n2γ/(1+2(γα)))\Theta(n^{-2\gamma/(1+2(\gamma-\alpha))}); see Lemma 4.1 of Politis, (2021) for a detailed discussion. To achieve such a convergence rate in the context of nonparametric estimation, the crucial point is using an undersmoothed bandwidth on the subsample statistics. To elaborate, suppose we are employing a non-negative (second-order) kernel for smoothing in which case the MSE-optimal bandwidth is Θ(n1/5)\Theta\left(n^{-1/5}\right). To conduct efficient subagging, however, the θ^b,i\hat{\theta}_{b,i} should be computed using an undersmoothed bandwidth of order o(b1/5)o\left(b^{-1/5}\right). For example, if we choose the bandwidth for θ^b,i\hat{\theta}_{b,i} to be Θ(b1/4)\Theta\left(b^{-1/4}\right) instead, then the choices α=3/8\alpha=3/8, γ=1/2\gamma=1/2, h=O(b)h=O(b), and b=Θ(nβ)b=\Theta\left(n^{\beta}\right) with β=0.8\beta=0.8 implies that the rate of convergence of θ¯b,n,SS\bar{\theta}_{b,n,SS} is n2/5n^{2/5}. This rate is not only faster than the rate of θ^n\hat{\theta}_{n} that used the sub-optimal bandwidth Θ(n1/4)\Theta\left(n^{-1/4}\right); it is actually the fastest rate achievable by any estimator that uses a non-negative kernel with its associated MSE-optimal bandwidth. Nevertheless, θ¯b,n,SS\bar{\theta}_{b,n,SS} can be computed faster than θ^n\hat{\theta}_{n}, and may thus be preferable. In addition to the asymptotic results, the simulation study of Politis, (2021) reveals that the error of the scalable subsampling estimator can actually be smaller than the full-sample nonparametric estimator with its own optimal bandwidth choice.

In the next section, we will introduce how to compute the scalable subsampling DNN estimator. Then, we will show that our aggregated DNN estimator could possess a smaller MSE than the optimal DNN estimator trained on the whole sample, under some conditions. We also discuss some specifically designed confidence intervals to measure the estimation accuracy via the approaches mentioned in Section 1.

4 Estimation inference with DNN

Although the DNN has captured much attention in practice, its theoretical validation is still in development. Recently, Farrell et al., (2021) gave a non-asymptotic error bound to measure the performance of the DNN estimator under two regularity assumptions. To sync with the latest results on the approximation ability of DNNs, we replace their second assumption (Assumption 2 in their work) with the smooth function condition assumed by Yarotsky and Zhevnerchuk, (2020). We present these new assumptions below:

  • A1: The regression data are i.i.d. copies of 𝒁=(Y,𝑿)\bm{Z}=(Y,\bm{X})\in 𝒴×[1,1]d\mathscr{Y}\times[-1,1]^{d}, where 𝑿\bm{X} has a continuous distribution, and 𝒴[M,M]\mathscr{Y}\subset[-M,M] for some positive constant MM. Correspondingly, we set the space of all DNN candidate functions to be DNN={fθ:fθ2M}\mathscr{F}_{\text{DNN}}=\{f_{\theta}:||f_{\theta}||_{\infty}\leq 2M\}.

  • A2: The target regression function ff lies in the Hölder Banach space 𝒞k,α([1,1]d)\mathscr{C}^{k,\alpha}\left([-1,1]^{d}\right) which is the space of kk times continuously differentiable functions on [1,1]d[-1,1]^{d} having a finite norm defined by

    f𝒞k,α([1,1]d)=max{max𝐤:|𝐤|kmax𝐱[1,1]d|D𝐤f(𝐱)|,max𝐤:|𝐤|=ksup𝐱,𝐲[1,1]d𝐱𝐲|D𝐤f(𝐱)D𝐤f(𝐲)|𝐱𝐲α},\|f\|_{\mathscr{C}^{k,\alpha}\left([-1,1]^{d}\right)}=\max\left\{\max_{\mathbf{k}:|\mathbf{k}|\leq k}\max_{\mathbf{x}\in[-1,1]^{d}}\left|D^{\mathbf{k}}f(\mathbf{x})\right|,\max_{\mathbf{k}:|\mathbf{k}|=k}\sup_{\begin{subarray}{c}\mathbf{x},\mathbf{y}\in[-1,1]^{d}\\ \mathbf{x}\neq\mathbf{y}\end{subarray}}\frac{\left|D^{\mathbf{k}}f(\mathbf{x})-D^{\mathbf{k}}f(\mathbf{y})\right|}{\|\mathbf{x}-\mathbf{y}\|^{\alpha}}\right\},

    where the smoothness index is ξ=k+α\xi=k+\alpha with an integer k0k\geq 0 and 0<α10<\alpha\leq 1.

  • A3: The sample size nn is larger than (2eM)2Pdim(fDNN)(2eM)^{2}\vee\text{Pdim}(f_{\text{DNN}}) where Pdim(fDNN)\text{Pdim}(f_{\text{DNN}}) is the pseudo-dimension of fDNNf_{\text{DNN}} which satisfies:

    cWLlog(W/L)Pdim(fDNN)CWLlogW,c\cdot WL\log(W/L)\leq\operatorname{Pdim}(f_{\text{DNN}})\leq C\cdot WL\log W,

    with some universal constants c,C>0c,C>0 and Euler’s number ee; see Bartlett et al., (2019) for details.

Remark.

We can weaken the assumption on the domain of 𝐗\bm{X} to [Cx,Cx]d[-C_{x},C_{x}]^{d} for some constant CxC_{x}, i.e., we can work on a compact domain of 𝐗\bm{X}; see also the proof of Yarotsky and Zhevnerchuk, (2020).

As shown in Farrell et al., (2021), with H1=H2==HL=Θ(nd2(ξ+α)log2n)H_{1}=H_{2}=\cdots=H_{L}=\Theta(n^{\frac{d}{2(\xi+\alpha)}}\log^{2}n), L=Θ(logn)L=\Theta(\log n), the L2L_{2} norm loss and empirical mean squared error of the deep fully connected ReLU network estimator from Eq. 2 can be bounded with probability at least 1exp(ndξ+dlog8n)1-\exp\left(-n^{\frac{d}{\xi+d}}\log^{8}n\right), i.e.,

f^DNNfL2(X)2C1{nξξ+dlog8n+loglognn},and𝔼n[(f^DNNf)2]=Θ(f^DNNfL2(X)2);\begin{split}&\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}^{2}\leq C_{1}\cdot\left\{n^{-\frac{\xi}{\xi+d}}\log^{8}n+\frac{\log\log n}{n}\right\},\\ &\mbox{and}\ \ \mathbb{E}_{n}\left[\left(\widehat{f}_{\text{DNN}}-f\right)^{2}\right]=\Theta\left(\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}^{2}\right);\end{split} (3)

here C1>0C_{1}>0 is a constant which is independent of nn and may depend on d,Md,M, and other fixed constants.

Obviously, the L2L_{2} norm error bound in (3) is sub-optimal compared to the fastest convergence rate we can achieve for nonparametric function estimation. With the latest approximation theory on DNNs, we can improve the error bound in Eq. 3 by decreasing the power of the log(n)\log(n) term. Meanwhile, this faster rate is satisfied with a narrower DNN. We give our first theorem about the convergence rate of f^DNN\widehat{f}_{\text{DNN}} below.

Theorem 4.1.

Under assumptions A1 to A3, width H=Θ(nd2(ξ+α)logn)H=\Theta(n^{\frac{d}{2(\xi+\alpha)}}\log n), and depth L=Θ(lognL=\Theta(\log n). Then, the L2L_{2} norm loss of the deep fully connected ReLU network estimator Eq. 2 can be bounded with probability at least 1exp(ndξ+dlog6n)1-\exp\left(-n^{\frac{d}{\xi+d}}\log^{6}n\right), i.e.,

f^DNNfL2(X)2C2{nξξ+dlog6n+loglognn};\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}^{2}\leq C_{2}\cdot\left\{n^{-\frac{\xi}{\xi+d}}\log^{6}n+\frac{\log\log n}{n}\right\}; (4)

here C2>0C_{2}>0 is another constant.

It appears that the above gives the fastest rate obtainable based on the current literature. Later, we will show how this error bound can be further improved by applying the scalable subagging technique under some mild conditions.

Remark.

The improvement implied by 4.1 can also be applied to Corollaries 1 and 2 of Farrell et al., (2021) to improve the corresponding error bounds.

4.1 Scalable subagging DNN estimator

We first review the idea of scalable subagging and explain how this technique can be combined with DNN estimation. We focus on the regression problem and assume we observe sample {(Y1,𝑿1),,(Yn,𝑿n)}\{(Y_{1},\bm{X}_{1}),\ldots,(Y_{n},\bm{X}_{n})\}.

Analogously to the subagging kernel-smoothed estimator of 3.1, we can define the subagging DNN estimator as:

f¯DNN(𝑿)=1qj=1qf^DNN,b,j(𝑿);\overline{f}_{\text{DNN}}(\bm{X})=\frac{1}{q}\sum_{j=1}^{q}\widehat{f}_{\text{DNN},b,j}(\bm{X}); (5)

here, q=(nb)/h+1q=\lfloor(n-b)/h\rfloor+1, and f^DNN,b,j()\widehat{f}_{\text{DNN},b,j}(\cdot) is the minimizer of the empirical loss function in Eq. 2 just using the data in the jj-th subsample namely Bj={(Y(j1)h+1,𝑿(j1)h+1),,B_{j}=\{(Y_{(j-1)h+1},\bm{X}_{(j-1)h+1}),\ldots, (Y(j1)h+b,𝑿(j1)h+b)}(Y_{(j-1)h+b},\bm{X}_{(j-1)h+b})\}.

In nonparametric function estimation where the estimation is performed through the kernel technique, the bandwidth can control the bias order of the kernel-smoothed estimator. As shown in 3.1, the optimal convergence rate can be recovered by combining scalable subagging trick and undersmoothing bandwidth. Similarly, in the context of neural network estimation, the whole architecture of a DNN controls its smoothness similar to the role of the shape (order) of a kernel. The depth and width of a DNN play the role of tuning parameters similar to the bandwidth of a kernel. Moreover, according to the prevailing wisdom, a deeper DNN may possess a lower bias; this conjecture was confirmed by Yang et al., (2020) with ResNet on some image datasets.

However, as far as we know, there is no theoretical result that explains the relationship between bias and the width/depth of a DNN. Here, we make the below assumptions to restrict the order of the bias of f^DNN\widehat{f}_{\text{DNN}}:

  • A4: 𝔼(f^DNN(𝒙)f(𝒙))=O(nΛ/2)\mathbb{E}(\widehat{f}_{\text{DNN}}(\bm{x})-f(\bm{x}))=O(n^{-\Lambda/2}) uniformly in 𝒙\bm{x} for some constant Λ>0\Lambda>0.

To boost the scalable subagging method, a fundamental preliminary condition is that the bias of the estimator is comparatively negligible to its standard deviation—see Politis, (2021) for details. Thus, we further impose an additional assumption on the order of bias:

  • A5: The bias exponent in Assumption A4 satisfies the inequality: Λ>ξξ+d\Lambda>\frac{\xi}{\xi+d}.

We claim that assumptions A4 and A5 could be achievable. Due to the fact as revealed in Yarotsky and Zhevnerchuk, (2020), the approximation ability in the uniform norm of a DNN can be as fast as W2ξ/dW^{-2\xi/d}. Although this rate is not instructive in practice, the existence of a DNN that satisfies the bias order requirement A4 is possible.

We should also notice that practitioners tend to build a large DNN whose size is larger than the sample size. i.e., the DNN interpolates the sample in the modern machine-learning practice. Interestingly, such an over-parameterized estimator breaks the classical understanding of the bias-variance trade-off since its generalization performance can even be better than a DNN which lies in the under-parameterized regime. Actually, this phenomenon is described as the double-descent of the risk by Belkin et al., (2019). Thus, A4 and A5 should be reasonable when we consider DNNs with an overwhelming number of parameters; however, assumption A3 may fail which means the consistency property of the DNN estimator may be lost. It is interesting to explore whether the scalable subsampling can work for DNN estimators in an over-parameterized regime; we leave this to future work.

Remark 4.1.

In this paper, we focus on applying scalable subagging to DNNs whose size is less than the sample size but the extension to a large DNN is straightforward. From the computability aspect, as we can expect, the saving of computational cost from applying scalable subagging will be more significant for executing estimation with a large DNN. To see this fact, let’s assume that we consider a DNN with size W=Θ(nϕ)W=\Theta(n^{\phi}), ϕ>1\phi>1. Then, the computational complexity will be mainly determined by how many manipulations (e.g., forward calculation and backward updating) we carry out to train the DNN. The total number of manipulations is also affected by the batch size and the number of epochs. Thus, we summarize that the total number of manipulations is O(nWE)O(n\cdot W\cdot E); here EE represents the number of epochs, i.e., the number of complete passes of the training through the algorithm. It is fair to assume that the complexity is in the order of nϕ+1:=nφn^{\phi+1}:=n^{\varphi}. When the size of the DNN is larger than the sample size, φ>2\varphi>2. Thus, for the subagging estimator, the computational complexity is approximately to be O(nβφq)=O(n1+β(φ1))O(n^{\beta\varphi}q)=O(n^{1+\beta(\varphi-1)}). The ratio of n1+β(φ1)n^{1+\beta(\varphi-1)} over nφn^{\varphi} is n(φ1)(1β)n^{-(\varphi-1)(1-\beta)}. Thus, for a fixed β\beta, the larger φ\varphi to be, the more computation can be saved by deploying the subagging technique.

Aggregating all the above, the following theorem quantifies the error bound of the scalable subagging DNN estimator of Eq. 5:

Theorem 4.2.

Assume Assumptions A1 to A5, and let β=11+Λξξ+d\beta=\frac{1}{1+\Lambda-\frac{\xi}{\xi+d}}. Then, with probability at least (1exp(ndξ+dlog6n))q(1-\exp(-n^{\frac{d}{\xi+d}}\log^{6}n))^{q} the error bound of the subagging estimator Eq. 5 in L2L_{2} norm is:

f¯DNNfL2(X)2nΛΛ+dξ+d(n),\left\|\overline{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}^{2}\leq n^{\frac{-\Lambda}{\Lambda+\frac{d}{\xi+d}}}{\cal L}(n),

where (n){\cal L}(n) is a slowly varying function involving a constant and all log(n)\log(n) terms.

Remark 4.2.

Choosing β=11+Λξξ+d\beta=\frac{1}{1+\Lambda-\frac{\xi}{\xi+d}} in 4.2 ensures that the square bias term will be always relatively negligible compared to the variance which is important for the success of scalable subsampling; see related discussion in 4.3.

Note that the final accuracy of DNN heavily depends on many other factors in practice, e.g., which optimizer we choose in the training stage, which parameter initialization strategy we take, and how large the batch size should be. Thus, a solely theoretical rate is insufficient to verify the superiority of the scalable subsampling DNN estimator. We then deploy simulation studies in Section 6 to provide supplementary evidence.

4.2 Estimation of the bias order of DNN estimator

Although 4.2 shows the possibility of getting a smaller error bound, it depends on the bias exponent Λ\Lambda which is typically unknown. In this subsection, we propose two approaches to estimate the value of Λ\Lambda via subsampling. As far as we know, it is the first attempt towards quantifying the bias of the DNN estimator.

First note that A4 implies that, for any ii, 𝔼(f^DNN,b,i(𝒙)f(𝒙))=O(nβΛ/2)\mathbb{E}(\widehat{f}_{\text{DNN},b,i}(\bm{x})-f(\bm{x}))=O(n^{-\beta\Lambda/2}). Since f¯DNN(𝑿)=1qj=1qf^DNN,b,j(𝑿)\overline{f}_{\text{DNN}}(\bm{X})=\frac{1}{q}\sum_{j=1}^{q}\widehat{f}_{\text{DNN},b,j}(\bm{X}), it follows that the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) is O(nβΛ/2)O(n^{-\beta\Lambda/2}), so we can write

𝔼(f¯DNN(𝒙)f(𝒙))=cbbΛ/2+o(bΛ/2).\mathbb{E}(\overline{f}_{\text{DNN}}(\bm{x})-f(\bm{x}))=c_{b}\cdot b^{-\Lambda/2}+o(b^{-\Lambda/2}). (6)

Recall that f¯DNN(𝑿)\overline{f}_{\text{DNN}}(\bm{X}) was built based on subsamples of size bb. If we have another DNN estimator f^DNN,b0(𝒙)\widehat{f}_{\text{DNN},b_{0}}(\bm{x}) trained on sample of size b0b_{0}, then its bias will be cbb0Λ/2+o(b0Λ/2)c_{b}\cdot b_{0}^{-\Lambda/2}+o(b_{0}^{-\Lambda/2}). Then,

𝔼(f¯DNN(𝒙)f(𝒙))=𝔼(f¯DNN(𝒙)f^DNN,b0(𝒙)+f^DNN,b0(𝒙)f(𝒙))=𝔼(f¯DNN(𝒙)f^DNN,b0(𝒙))+𝔼(f^DNN,b0(𝒙)f(𝒙)).\begin{split}\mathbb{E}\left(\overline{f}_{\text{DNN}}(\bm{x})-f(\bm{x})\right)&=\mathbb{E}\left(\overline{f}_{\text{DNN}}(\bm{x})-\widehat{f}_{\text{DNN},b_{0}}(\bm{x})+\widehat{f}_{\text{DNN},b_{0}}(\bm{x})-f(\bm{x})\right)\\ &=\mathbb{E}\left(\overline{f}_{\text{DNN}}(\bm{x})-\widehat{f}_{\text{DNN},b_{0}}(\bm{x})\right)+\mathbb{E}\left(\widehat{f}_{\text{DNN},b_{0}}(\bm{x})-f(\bm{x})\right).\end{split} (7)

If bb\to\infty and b/b00b/b_{0}\to 0, the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) is asymptotically determined by the first term on the r.h.s. of Eq. 7. So we can try to estimate 𝔼(f¯DNN(𝒙)f^DNN,b0(𝒙))\mathbb{E}\left(\overline{f}_{\text{DNN}}(\bm{x})-\widehat{f}_{\text{DNN},b_{0}}(\bm{x})\right) to approximate the l.h.s. of Eq. 7.

Ideally, if we have a large enough sample, we can carve out MM non-overlapping (or partially overlapping) b0b_{0}-size subsamples and compute {f^DNN,b0(i)(𝒙)}i=1M\{\widehat{f}_{\text{DNN},b_{0}}^{(i)}(\bm{x})\}_{i=1}^{M}. If we further separate each b0b_{0}-size subsample into multiple non-overlapping (or partially overlapping) bb-size subsamples, {f¯DNN(i)(𝒙)}i=1M\{\overline{f}_{\text{DNN}}^{(i)}(\bm{x})\}_{i=1}^{M} can be built and each f¯DNN(i)(𝒙)\overline{f}_{\text{DNN}}^{(i)}(\bm{x}) possesses the same bias order as our desired DNN estimator. Subsequently, the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) can be estimated by the sample mean of {f¯DNN(i)(𝒙)f^DNN,b0(i)(𝒙)}i=1M\{\overline{f}_{\text{DNN}}^{(i)}(\bm{x})-\widehat{f}_{\text{DNN},b_{0}}^{(i)}(\bm{x})\}_{i=1}^{M}. We can then use this information to estimate the value of Λ\Lambda. By the law of large numbers, we can get accurate bias estimation as MM\to\infty. However, as we can easily see, this approach is computationally heavy and requires a large dataset.

Consequently, we propose another way to perform the bias estimation; we will call it scaling-down estimation method. To elaborate, recall that our goal is estimating the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) that was built based on subsamples of size bb. Consider different DNN estimators f^DNN,b1(𝒙)\widehat{f}_{\text{DNN},b_{1}}(\bm{x}) and f^DNN,b2(𝒙)\widehat{f}_{\text{DNN},b_{2}}(\bm{x}) which are trained on samples of size b1b_{1} and b2b_{2} respectively; here b1bb_{1}\ll b and b2b1b_{2}\ll b_{1}. As before, A4 implies that the bias of f^DNN,bi(𝒙)\widehat{f}_{\text{DNN},b_{i}}(\bm{x}) is cbbiΛ/2+o(biΛ/2)c_{b}\cdot b_{i}^{-\Lambda/2}+o(b_{i}^{-\Lambda/2}) for i=1,2i=1,2. Then, a key observation is that:

𝔼(f^DNN,bi(𝒙)f(𝒙))=𝔼(f^DNN,bi(𝒙)f¯DNN(𝒙)+f¯DNN(𝒙)f(𝒙))=𝔼(f^DNN,bi(𝒙)f¯DNN(𝒙))+𝔼(f¯DNN(𝒙)f(𝒙)),fori=1,2.\begin{split}\mathbb{E}\left(\widehat{f}_{\text{DNN},b_{i}}(\bm{x})-f(\bm{x})\right)&=\mathbb{E}\left(\widehat{f}_{\text{DNN},b_{i}}(\bm{x})-\overline{f}_{\text{DNN}}(\bm{x})+\overline{f}_{\text{DNN}}(\bm{x})-f(\bm{x})\right)\\ &=\mathbb{E}\left(\widehat{f}_{\text{DNN},b_{i}}(\bm{x})-\overline{f}_{\text{DNN}}(\bm{x})\right)+\mathbb{E}\left(\overline{f}_{\text{DNN}}(\bm{x})-f(\bm{x})\right),~{}\text{for}~{}i=1,2.\end{split} (8)

Due to the relationship between b,b1,b2b,b_{1},b_{2}, the bias of f^DNN,bi(𝒙)\widehat{f}_{\text{DNN},b_{i}}(\bm{x}) is dominated by the first term on the r.h.s. of Eq. 8. We then have two different estimates of the bias of f^DNN,bi(𝒙)\widehat{f}_{\text{DNN},b_{i}}(\bm{x}), namely:

B^i=1qij=1qi(f^DNN,bi(j)(𝒙)f¯DNN(𝒙)),fori=1,2.\widehat{B}_{i}=\frac{1}{q_{i}}\sum_{j=1}^{q_{i}}\left(\widehat{f}_{\text{DNN},b_{i}}^{(j)}(\bm{x})-\overline{f}_{\text{DNN}}(\bm{x})\right),~{}\text{for}~{}i=1,2.

Fixing the value of ii, {f^DNN,bi(j)(𝒙)}j=1qi\{\widehat{f}_{\text{DNN},b_{i}}^{(j)}(\bm{x})\}_{j=1}^{q_{i}} is value of f^DNN,bi(𝒙)\widehat{f}_{\text{DNN},b_{i}}(\bm{x}) computed from the jjth subsample of size bib_{i} carved out the whole sample; as before, these subsamples can be non-overlapping or partially overlapping and their number is denoted by qiq_{i}. Ignoring the o()o(\cdot) term in Eq. 6, we can solve the following system of equations to approximate both cbc_{b} and Λ\Lambda:

{B^1=cbb1Λ/2B^2=cbb2Λ/2.\begin{cases}\widehat{B}_{1}&=c_{b}\cdot b_{1}^{-\Lambda/2}\\ \widehat{B}_{2}&=c_{b}\cdot b_{2}^{-\Lambda/2}.\\ \end{cases} (9)

Taking logarithms in Eq. 9 turns it into a linear system in cbc_{b} and Λ\Lambda. Finally, we can estimate the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) by scaling down B^1\widehat{B}_{1} by a factor (b/b1)Λ/2(b/b_{1})^{-\Lambda/2}, i.e., the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) is approximately B^1(b/b1)Λ/2\widehat{B}_{1}\cdot(b/b_{1})^{-\Lambda/2}. We summarize this procedure in Algorithm 1.

Algorithm 1 Scaling-down bias estimation of DNN estimator
Step 1 Fix a subsample size bb, and compute f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) at point 𝒙\bm{x}.
Step 2 Fix two subsample sizes b1bb_{1}\ll b and b2b1b_{2}\ll b_{1}, and separate the whole sample into q1q_{1} and q2q_{2} number of b1b_{1}-size and b2b_{2}-size subsamples, respectively. Compute {f^DNN,bi(j)(𝒙)}j=1qi\{\widehat{f}_{\text{DNN},b_{i}}^{(j)}(\bm{x})\}_{j=1}^{q_{i}} at 𝒙\bm{x} for i=1,2i=1,2.
Step 3 Solve Eq. 9 to get cbc_{b} and Λ\Lambda.
Step 4 Estimate the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) by B^1(b/b1)Λ/2\widehat{B}_{1}\cdot(b/b_{1})^{-\Lambda/2}.

4.3 Confidence intervals

Beyond point estimation, it is important to quantify DNN estimation accuracy; this can be done via a standard error or –even better– via a Confidence Interval (CI). More specifically, for a point of interest 𝑿=𝒙\bm{X}=\bm{x}, we hope to find a CI which satisfies:

(Blf(𝒙)Bu)=1δ;\mathbb{P}(B_{l}\leq f(\bm{x})\leq B_{u})=1-\delta;

here \mathbb{P} should be understood as the conditional probability given 𝑿=𝒙\bm{X}=\bm{x}; BlB_{l} and BuB_{u} are lower and upper bound for f(𝒙)f(\bm{x}) that are functions of the DNN estimator; δ\delta is the significance level. Since we can have different CI constructions having the same δ\delta, we are also interested in the CI length (CIL) which is defined as CIL=BuBl\text{CIL}=B_{u}-B_{l}. We aim for a (conditional) CI that is the most accurate (in terms of its coverage being close to 1δ1-\delta) but with the shortest length.

Analogously to 3.1, we make an assumption about the variance term of the DNN estimator trained with sample size nn and evaluated at 𝒙\bm{x}:

  • B1

    Var(nαf^DNN(𝒙))σ2>0(n^{\alpha}\widehat{f}_{\text{DNN}}(\bm{x}))\to\sigma^{2}>0 as nn\to\infty.

Generally speaking, we have two choices to build CI for f(𝒙)f(\bm{x}): (1) Pivot-CI (PCI), the type of CI obtained by estimating the sampling distribution of a pivotal quantity, e.g. the estimator centered at its expectation; (2) Quantile-CI (QPI), the type of CI based on quantiles of the estimated sampling distribution of the (uncentered) estimator of interest. More details are given in the example below.

Example 4.1 (Types of CI).

For any unknown quantity θ\theta estimated by θ^n\hat{\theta}_{n}, we may build a scalable subagging estimator θ¯b,n,SS=q1i=1qθ^b,i\bar{\theta}_{b,n,SS}=q^{-1}\sum_{i=1}^{q}\hat{\theta}_{b,i} to approximate it. To construct a CI for θ\theta based on θ¯b,n,SS\bar{\theta}_{b,n,SS}, we are aided by the CLT of Politis, (2021), i.e.,

κn(θ¯b,n,SSθ)𝑑N(Cμ,Cσ2),asn,\kappa_{n}(\bar{\theta}_{b,n,SS}-\theta)\overset{d}{\to}N(C_{\mu},C_{\sigma}^{2}),~{}\text{as}~{}n\to\infty, (10)

under mild conditions; here CμC_{\mu} and Cσ2C_{\sigma}^{2} are the mean and variance of limiting distribution, respectively, and κn=n1β+2αβ2\kappa_{n}=n^{\frac{1-\beta+2\alpha\beta}{2}}. [By the way, note the typo in Politis, (2021) where κn\kappa_{n} was incorrectly written as n1β+2αβ2n^{-\frac{1-\beta+2\alpha\beta}{2}}.]

The form of the PCI based on CLT (10) depends crucially on whether Cμ=0C_{\mu}=0 or not; see the next two subsections for details. On the other hand, the QCI is easier to build but it has its own deficiencies. In the context of this example, it is tempting to create a QCI for θ\theta by taking the δ/2\delta/2 and 1δ/21-\delta/2 quantile values of the empirical distribution of the points {θ^b,1,,θ^b,q}\{\hat{\theta}_{b,1},\ldots,\hat{\theta}_{b,q}\}. However, the resulting CI will be too conservative, i.e., its coverage will be (much) bigger than 1δ1-\delta. The reason is that the empirical distribution of {θ^b,i,,θ^b,q}\{\hat{\theta}_{b,i},\ldots,\hat{\theta}_{b,q}\} is approximating the sampling distribution of estimator θ^b\hat{\theta}_{b} which has bigger variance than that of the target θ^n\hat{\theta}_{n}. We could try to re-scale the empirical distribution of {θ^b,1,,θ^b,q}\{\hat{\theta}_{b,1},\ldots,\hat{\theta}_{b,q}\} as in classical subsampling—see Politis et al., (1999). We still consider the QCI in the simulation studies. As expected, this QCI is the most conservative one; see details in Section 6.

4.3.1 PCI in the case where Cμ=0C_{\mu}=0

If Cμ=0C_{\mu}=0, i.e., when the square bias is relatively negligible compared to the variance in estimation, we can rely on Eq. 10 to build a PCI for the true function ff at a point 𝒙\bm{x}. All we need is a consistent estimator of Cσ2C_{\sigma}^{2}, e.g., C^σ2=b2αq1i=1q(θ^b,iθ¯b,n,SS)2\widehat{C}_{\sigma}^{2}=b^{2\alpha}q^{-1}\sum_{i=1}^{q}\left(\hat{\theta}_{b,i}-\bar{\theta}_{b,n,SS}\right)^{2}. In that case, a PCI for θ\theta based on the CLT can be written as:

θ¯b,n,SS±z1δ/2C^σκn1,\bar{\theta}_{b,n,SS}\pm z_{1-\delta/2}\cdot\widehat{C}_{\sigma}\cdot\kappa_{n}^{-1}, (11)

where z1δ/2z_{1-\delta/2} is the 1δ/21-\delta/2 quantile of the standard normal distribution.

Observing that there is a common term nβαn^{\beta\alpha} in κn\kappa_{n} and C^σ\widehat{C}_{\sigma}, we can estimate C^σκn1\widehat{C}_{\sigma}\cdot\kappa_{n}^{-1} as a whole rather than computing κn\kappa_{n} and C^σ2\widehat{C}^{2}_{\sigma} separately. As a result, we can get a simplified PCI based on Eq. 11 as follows:

f¯DNN(𝒙)±z1δ/2Mσ;\overline{f}_{\text{DNN}}(\bm{x})\pm z_{1-\delta/2}\cdot M_{\sigma}; (12)

here Mσ=C^σκn1M_{\sigma}=\widehat{C}_{\sigma}\cdot\kappa_{n}^{-1} which can be approximated by q1i=1q(f^DNN,b,i(𝒙)f¯DNN(𝒙))2/n1β2\sqrt{q^{-1}\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{x})-\overline{f}_{\text{DNN}}(\bm{x})\right)^{2}}/n^{\frac{1-\beta}{2}}. Note that the building of the CI does not require the knowledge of α\alpha which is the order of the variance term in B1. However, the estimation C^σ\widehat{C}_{\sigma} may not be accurate when qq is small since it is only an average of qq terms. As a result, the PCI according to Eq. 12 may undercover the true model values. Thus, we may relax the desired property of CI. Instead of requiring the exact coverage rate of a CI to be 1δ1-\delta, we seek a CI such that:

(Blf(𝒙)Bu)1δ.\mathbb{P}(B_{l}\leq f(\bm{x})\leq B_{u})\geq 1-\delta. (13)

Thus, the optimal candidate will be the CI which has the shortest length and guarantees the lowest coverage rate larger than 1δ1-\delta. To satisfy Eq. 13, we may enlarge the CI appropriately by replacing C^σ2\widehat{C}^{2}_{\sigma} with C~σ2=C^σ2+(f¯DNN(𝒙)y)2\widetilde{C}^{2}_{\sigma}=\widehat{C}^{2}_{\sigma}+(\overline{f}_{\text{DNN}}(\bm{x})-y)^{2}; here y=f(𝒙)+ϵy=f(\bm{x})+\epsilon.

It is appealing to think that C~σ2\widetilde{C}^{2}_{\sigma} is close to the MSE of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}). However,

(f¯DNN(𝒙)y)2=(1qi=1q(f^DNN,b,i(𝒙)y))2=(1qi=1q(f^DNN,b,i(𝒙)f(𝒙))ϵ)2.(\overline{f}_{\text{DNN}}(\bm{x})-y)^{2}=\left(\frac{1}{q}\sum_{i=1}^{q}(\widehat{f}_{\text{DNN},b,i}(\bm{x})-y)\right)^{2}=\left(\frac{1}{q}\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{x})-f(\bm{x})\right)-\epsilon\right)^{2}.

When qq is large, (f¯DNN(𝒙)y)2(Cμϵ)2(\overline{f}_{\text{DNN}}(\bm{x})-y)^{2}\to(C_{\mu}-\epsilon)^{2} where CμC_{\mu} is the bias of f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}). Therefore, C~σ2\widetilde{C}^{2}_{\sigma} is not exactly the MSE, but it can still be used to enlarge the CI to some extent. We can then define another PCI as:

f¯DNN(𝒙)±z1δ/2M~σ,\overline{f}_{\text{DNN}}(\bm{x})\pm z_{1-\delta/2}\cdot\widetilde{M}_{\sigma}, (14)

where

M~σ=q1i=1q(f^DNN,b,i(𝒙)f¯DNN(𝒙))2/n1β+(f¯DNN(𝒙)y)2/n1β+2αβ.\widetilde{M}_{\sigma}=\sqrt{q^{-1}\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{x})-\overline{f}_{\text{DNN}}(\bm{x})\right)^{2}/n^{1-\beta}+(\overline{f}_{\text{DNN}}(\bm{x})-y)^{2}/n^{1-\beta+2\alpha\beta}}. (15)

Since the order of the variance term α\alpha is involved in the above terms, we consider two extreme situations in the simulation sections: (1) We take 2α=02\alpha=0 which is a most enlarged case; or (2) take 2α=12\alpha=1 which is a mildly enlarged case.

Remark 4.3 (The condition to guarantee Cμ=0C_{\mu}=0).

According to Eq. 10, Cμ=0C_{\mu}=0 is satisfied as long as β>11+Λ2α\beta>\frac{1}{1+\Lambda-2\alpha} under A5. If we take β=11+Λξξ+d\beta=\frac{1}{1+\Lambda-\frac{\xi}{\xi+d}} in 4.2, we can find that the condition for Cμ=0C_{\mu}=0 is always satisfied. This is not surprising due to A5 imposing the requirement on the convergence rate of the bias term. However, as explained in 4.2, this β\beta is not the optimal one to generate the smallest error bound. Thus, we could arrive at a stage where the orders of squared bias and variance are the same once we know α\alpha. Due to the high variability of training a DNN in practice, we introduce a method in Section 4.3.2 to build CI appropriately under the situation that Cμ0C_{\mu}\neq 0, which serves for cases where the bias is not relatively negligible.

4.3.2 PCI in the case where Cμ0C_{\mu}\neq 0

It is worthwhile to discuss how can we build a PCI for scalable subsampling DNN estimator when Cμ0C_{\mu}\neq 0. Note that Politis, (2021) proposed an iterated scalable subsampling technique that is applicable in the case Cμ0C_{\mu}\neq 0. While this technique is also applicable in the case Cμ=0C_{\mu}=0, we may prefer the construction of Section 4.3.1 since it is less computer-intensive. However, we should notice that the additional computational burden brought by iterated subsampling is negligible when nn\to\infty; see analysis in Appendix: C. For completeness, we present this method here in the remark below.

Remark 4.4 (Iterated subsampling).

With the same notations in 4.1, we can perform the iterated subsampling in three steps: (1) Let b=nβb=\left\lfloor n^{\beta}\right\rfloor, then apply the scalable subsampling technique to sample X1,,XnX_{1},\ldots,X_{n} and get qq subsets {Bi}i=1q\{B_{i}\}_{i=1}^{q}. Compute θ¯b,n,SS\bar{\theta}_{b,n,SS}; we call it “first stage subsampling”; (2) Take another subsample size b=bβb^{\prime}=\lfloor b^{\beta}\rfloor and apply scalable subagging method again to all {Bi}i=1q\{B_{i}\}_{i=1}^{q}, i.e., as if BiB_{i} where the only data at hand and make subagging estimator for each BiB_{i} subsamples; such subagging estimator θ¯b,b,SS,i\bar{\theta}_{b^{\prime},b,SS,i} is computed by averaging qq^{\prime} estimators {θ^b,b,SS,i(j)}j=1q\{\hat{\theta}_{b^{\prime},b,SS,i}^{(j)}\}_{j=1}^{q^{\prime}}; here q=(bb)/h+1q^{\prime}=\lfloor(b-b^{\prime})/h^{\prime}\rfloor+1. As a result, we can get qq number of {θ¯b,b,SS,i}i=1q\{\bar{\theta}_{b^{\prime},b,SS,i}\}_{i=1}^{q}; we call it “iterated stage subsampling”; (3) Find the subsampling distribution Lb,b,SS(z)=L_{b^{\prime},b,SS}(z)= q1i=1q1{κb(θ¯b,b,SS,iθ¯b,n,SS)z}q^{-1}\sum_{i=1}^{q}1\left\{\kappa_{b}\left(\bar{\theta}_{b^{\prime},b,SS,i}-\bar{\theta}_{b,n,SS}\right)\leq z\right\}; κb\kappa_{b} is a function of bb. In the context of DNN estimation, we use f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i} to represent the DNN estimator in the iterated subsampling stage on the jj-th subsamples from the ii-th subsample in the first stage subsampling.

Denote Jn(z)=(κn(θ¯b,n,SSθ)z)J_{n}(z)=\mathbb{P}(\kappa_{n}(\bar{\theta}_{b,n,SS}-\theta)\leq z), and J(z)J(z) is the limit of Jn(z)J_{n}(z) as nn\to\infty; recall that (10) implied that J(z)J(z) is Gaussian. Proposition 2.1 of Politis, (2021) shows that Lb,b,SS(z)L_{b^{\prime},b,SS}(z) converges to J(z)J(z) in probability for all points of continuity of J(z)J(z). Due to Eq. 10, J(z)J(z) is continuous everywhere, and therefore the convergence is uniform. Thus, both Lb,b,SS(z)L_{b^{\prime},b,SS}(z) and Jn(z)J_{n}(z) converge in a uniform fashion to J(z)J(z) in probability which implies that:

supz|Lb,b,SS(z)Jn(z)|𝑝0,asn.\sup_{z}\left|L_{b^{\prime},b,SS}(z)-J_{n}(z)\right|\overset{p}{\to}0,~{}\text{as}~{}n\to\infty. (16)

Thus, iterated subsampling can be used to estimate the distribution JnJ_{n}. We can build the CI in a pivotal style without explicitly referring to the form of JJ that involves the two unknown parameters. A further issue is that normality might not be well represented in JnJ_{n} since it is based on an average of qq quantities; having a large qq requires a huge nn. To compensate for the data size requirement, we take a specific approach to build CI which can be considered as a combination of PCI and QCI to some extent. Algorithm 2 describes all the steps to construct the CI for ff at a point 𝒙\bm{x} based on the subagging DNN estimator and iterated subsampling method.

Algorithm 2 PCI of f(𝒙)f(\bm{x}) based on iterated subsampling
Step 1 Fix the subsample size bb, compute f¯DNN(𝒙)\overline{f}_{\text{DNN}}(\bm{x}) at point 𝒙\bm{x}.
Step 2 Fix the subsample size bb^{\prime} of iterated subsampling, perform necessary steps in 4.4 to find L_b^′, b, S S(z)= q^-1 ∑_i=1^q 1{κ_b(¯f_DNN,i(x)-¯f_DNN(x) ) ≤z}; here f¯DNN,i(𝒙)=1qj=1qf^DNN,b,i(j)\overline{f}_{\text{DNN},i}(\bm{x})=\frac{1}{q^{\prime}}\sum_{j=1}^{q^{\prime}}\widehat{f}^{(j)}_{\text{DNN},b,i} is the subagging DNN estimator on the ii-th subsamples in Step 1 at the point 𝒙\bm{x}.
Step 3 Denote the δ/2\delta/2 and 1δ/21-\delta/2 quantile values of the distribution Lb,b,SS(z)L_{b^{\prime},b,SS}(z) as blb_{l} and bub_{u}.
Step 4 Determine the PCI of f(𝒙)f(\bm{x}) by: [f¯DNN(𝒙)bu/κn,f¯DNN(𝒙)bl/κn].[\overline{f}_{\text{DNN}}(\bm{x})-b_{u}/\kappa_{n}~{},~{}\overline{f}_{\text{DNN}}(\bm{x})-b_{l}/\kappa_{n}]. (17) In other words, we take Bl=f¯DNN(𝒙)bu/κnB_{l}=\overline{f}_{\text{DNN}}(\bm{x})-b_{u}/\kappa_{n} and Bu=f¯DNN(𝒙)bl/κnB_{u}=\overline{f}_{\text{DNN}}(\bm{x})-b_{l}/\kappa_{n}.

Note that to construct the PCI (17), the values of κn\kappa_{n} and κb\kappa_{b} are required. Recall that κn=n1β+2αβ2\kappa_{n}=n^{\frac{1-\beta+2\alpha\beta}{2}} and κb=nβ1β+2αβ2\kappa_{b}=n^{\beta\frac{1-\beta+2\alpha\beta}{2}}. Although β\beta is the practitioner’s choice, α\alpha is typically unknown. 4.5 explains how upper and lower bounds for α\alpha can be used in the PCI construction.

Remark 4.5.

In constructing the PCI (17) we can replace κb\kappa_{b} by a larger value (say κ¯b\bar{\kappa}_{b}) and replace κn\kappa_{n} by a smaller value (say κ¯n\underline{\kappa}_{n}) and still the coverage bound of Eq. 13 would be met. From 4.1, the fastest rate of the variance decrease is of order O(n1)O(n^{-1}); so α\alpha could be as large as 1/21/2 in which κ¯b=nβ2\bar{\kappa}_{b}=n^{\frac{\beta}{2}}. On the other hand, the slowest rate is influenced by nξξ+dn^{-\frac{\xi}{\xi+d}}; if we pretend the smoothness of the true model is equal to the input dimension (although it is actually smoother), we can take α=1/4\alpha=1/4 to compute κ¯n=n1β/22\underline{\kappa}_{n}=n^{\frac{1-\beta/2}{2}}.

5 Predictive inference with the DNN estimator

Most of the work in DNN estimation has applications in prediction although this is typically point prediction. However, as in the estimation case, it is important to be able to quantify the accuracy of the point predictors which can be done via the construction of Prediction Intervals (PI); see related work of Pan and Politis, (2016); Wang and Politis, (2021); Zhang and Politis, (2023); Wu and Politis, (2023) on predictive inference with dependent or independent data.

Consider the problem of predicting a response Y0Y_{0} that is associated with a regressor value of interest denoted by 𝒙0\bm{x}_{0} and its corresponding prediction interval. The L2L_{2} optimal point predictor of Y0Y_{0} is f(𝒙0)f(\bm{x}_{0}) which is well approximated by f¯DNN(𝒙0)\bar{f}_{DNN}(\bm{x}_{0}) as 4.2 shows. To construct a PI for Y0Y_{0}, we need to take the variability of the errors into account since, conditionally on 𝑿0=𝒙0\bm{X}_{0}=\bm{x}_{0}, we have Y0=f(𝒙0)+ϵ0Y_{0}=f(\bm{x}_{0})+\epsilon_{0}.

If the model ff and the error distribution FϵF_{\epsilon} were both known, we could construct a PI which covers Y0Y_{0} with 1δ1-\delta confidence level as follows:

[f(𝒙0)+zϵ,δ/2,f(𝒙0)+zϵ,1δ/2];\left[f(\bm{x}_{0})+z_{\epsilon,\delta/2},f(\bm{x}_{0})+z_{\epsilon,1-\delta/2}\right]; (18)

here zϵ,1δ/2z_{\epsilon,1-\delta/2} and zϵ,δ/2z_{\epsilon,\delta/2} are the 1δ/21-\delta/2 and δ/2\delta/2 quantile values of FϵF_{\epsilon}, respectively. Of course, we do not know the true model ff but we may replace it with our scalable subsampling DNN estimator f¯DNN\overline{f}_{\text{DNN}}. In addition, FϵF_{\epsilon} is also unknown and must be estimated; a typical estimator is F^ϵ\widehat{F}_{\epsilon} which is the empirical distribution of residuals. To elaborate, we define F^ϵ\widehat{F}_{\epsilon} as follows:

F^ϵ(z):=1ni=1n𝟙ϵ^iz;𝟙()is the indicator function.ϵ^i=f(𝒙i)f¯DNN(𝒙i),fori=1,,n.\begin{split}&\widehat{F}_{\epsilon}(z):=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{\hat{\epsilon}_{i}\leq z};~{}\mathbbm{1}_{(\cdot)}\text{is the indicator function.}\\ &\hat{\epsilon}_{i}=f(\bm{x}_{i})-\overline{f}_{\text{DNN}}(\bm{x}_{i}),~{}\text{for}~{}i=1,\ldots,n.\end{split} (19)

To consistently estimate the error distribution FϵF_{\epsilon}, we need to make some mild assumptions on FϵF_{\epsilon}, namely:

  • B2: The error distribution FϵF_{\epsilon} has zero mean and is differentiable on the real line and supzpϵ(z)<\sup_{z}p_{\epsilon}(z)<\infty were pϵ(z)p_{\epsilon}(z) is the density function of error ϵ\epsilon.

The following Lemma can be proved analogously to the proof of Lemma 4.1 in Wu and Politis, (2023).

Lemma 5.1.

Under A1-A5 and B1, we have supz|F^ϵ(z)Fϵ(z)|𝑝0.\sup_{z}|\widehat{F}_{\epsilon}(z)-F_{\epsilon}(z)|\overset{p}{\to}0.

We can then apply the PI below to approximate the ‘oracle’ PI of Eq. 18:

[f¯DNN(𝒙0)+z^ϵ,δ/2,f¯DNN(𝒙0)+z^ϵ,1δ/2];\left[\overline{f}_{\text{DNN}}(\bm{x}_{0})+\hat{z}_{\epsilon,\delta/2},\overline{f}_{\text{DNN}}(\bm{x}_{0})+\hat{z}_{\epsilon,1-\delta/2}\right]; (20)

here z^ϵ,1δ/2\hat{z}_{\epsilon,1-\delta/2} and z^ϵ,δ/2\hat{z}_{\epsilon,\delta/2} are the 1δ/21-\delta/2 and δ/2\delta/2 quantile values of F^ϵ\widehat{F}_{\epsilon}, respectively. To construct this PI in practice, we can rely on Algorithm 3 below:

Algorithm 3 PI of Y0Y_{0} conditional on 𝒙0\bm{x}_{0}
Step 1 Train the subagging DNN estimator f¯DNN()\overline{f}_{\text{DNN}}(\cdot) and find the empirical distribution of residuals F^ϵ\widehat{F}_{\epsilon} as Eq. 19.
Step 2 Evaluate the subagging DNN estimator at 𝒙0\bm{x}_{0} to get f¯DNN(𝒙0)\overline{f}_{\text{DNN}}(\bm{x}_{0}).
Step 3 Determine z^ϵ,δ/2\hat{z}_{\epsilon,\delta/2} and z^ϵ,1δ/2\hat{z}_{\epsilon,1-\delta/2} by taking lower δ/2\delta/2 and 1δ/21-\delta/2 quantiles of F^ϵ\widehat{F}_{\epsilon}.
Step 4 Construct PI as Eq. 20.

We claim that the PI in Eq. 20 is asymptotically valid (conditionally on 𝑿0=𝒙0\bm{X}_{0}=\bm{x}_{0}), i.e., it satisfies

(Y0[f¯DNN(𝒙0)+z^ϵ,δ/2,f¯DNN(𝒙0)+z^ϵ,1δ/2])𝑝1δ,\mathbb{P}\left(Y_{0}\in\left[\overline{f}_{\text{DNN}}(\bm{x}_{0})+\hat{z}_{\epsilon,\delta/2},\overline{f}_{\text{DNN}}(\bm{x}_{0})+\hat{z}_{\epsilon,1-\delta/2}\right]\right)\overset{p}{\to}1-\delta, (21)

where the above probability is conditional on 𝑿0=𝒙0\bm{X}_{0}=\bm{x}_{0}. This statement is guaranteed by 5.1. To describe it, denote Y0=f¯DNN(𝒙0)+ϵ0Y^{*}_{0}=\overline{f}_{\text{DNN}}(\bm{x}_{0})+\epsilon^{*}_{0} where ϵ0\epsilon^{*}_{0} has the distribution F^ϵ\widehat{F}_{\epsilon}.

Theorem 5.1.

Under A1-A5 and B1-B2, the distribution of Y0Y^{*}_{0} converges to the distribution of Y0Y_{0} uniformly (in probability), i.e.,

supz|FY0|𝒙0=𝒙0(z)FY0|𝒙0=𝒙0(z)|𝑝0,asn.\sup_{z}\left|F_{Y^{*}_{0}|\bm{x}_{0}=\bm{x}_{0}}(z)-F_{Y_{0}|\bm{x}_{0}=\bm{x}_{0}}(z)\right|\overset{p}{\to}0,~{}\text{as}~{}n\to\infty. (22)

Although the PI in Eq. 20 is asymptotically valid, it may undercover Y0Y_{0} in the finite sample case. This problem is mainly due to two reasons: (1) PI in Eq. 20 does not take the variability of model estimation into account; and (2) the scale of the error distribution is typically underestimated by the residual distribution with finite samples. For issue (1), we can rely on a so-called pertinent PI which is able to capture the model estimation variability; this pertinence property is crucial, especially for the prediction inference of time series data in which multiple-step ahead forecasting is usually required. For issue (2), we can “enlarge” the residual distribution by basing it on the so-called predictive (as opposed to fitted) residuals. Although the predictive residuals are asymptotically equivalent to the fitted residuals, i.e., ϵ^\hat{\epsilon} in Eq. 19, the corresponding PI could have a better coverage rate; see Politis, (2015) for the formal definition of pertinent PI and predictive residuals.

In this paper, due to the computational issues in fitting DNN models, we only build the PI in Eq. 20. Taking a fairly large enough sample size in Section 6, this PI works well, and its empirical coverage rate is only slightly lower than that of the oracle.

6 Simulations

In this section, we attempt to check the performance of the scalable subagging DNN estimator with simulation examples. More specifically, we consider two aspects of one estimator: (1) Time-complexity, we take the running time of the training stage to measure its complexity for a fixed hyperparameter setting, e.g., fixed number of epochs and batch size; (2) Estimation accuracy, we take empirical MSE (mean square error)/MSPE (mean square prediction error) and empirical coverage rate to measure the accuracy of point estimations/predictions and confidence/prediction intervals.

6.1 Simulations on point estimations

As shown in Section 4, the scalable subagging DNN estimator is more computationally efficient but also more accurate meantime compared to the DNN estimator trained with the whole sample size under some mild conditions. Here, we hope to verify such dominating performance with simulated data. To perform simulations, we consider below models:

  • Model-1: Y=i=110Xi+ϵY=\sum_{i=1}^{10}X_{i}+\epsilon, where (X1,,X10)N(0,𝑰)(X_{1},\ldots,X_{10})\sim N(0,\bm{I}).

  • Model-2: Y=i=110iXi+ϵY=\sum_{i=1}^{10}i\cdot X_{i}+\epsilon, where (X1,,X10)N(0,𝑰)(X_{1},\ldots,X_{10})\sim N(0,\bm{I}).

  • Model-3: Y=X12+sin(X2+X3)+ϵY=X_{1}^{2}+\sin(X_{2}+X_{3})+\epsilon, where (X1,X2,X3)N(0,𝑰)(X_{1},X_{2},X_{3})\sim N(0,\bm{I}).

  • Model-4: Y=X12+sin(X2+X3)+exp(|X4+X5|)+ϵY=X_{1}^{2}+\sin(X_{2}+X_{3})+\exp(-|X_{4}+X_{5}|)+\epsilon, where (X1,X2,X3,X4,X5)N(0,𝑰)(X_{1},X_{2},X_{3},X_{4},X_{5})\sim N(0,\bm{I});

here 𝑰\bm{I} is an identity matrix with the correct dimension for each model; ϵ\epsilon is the standard normal error. We build the DNN estimator with PyTorch in Python. To train the DNN, we use the stochastic gradient descent algorithm Adam developed by Kingma and Ba, (2014) with a learning rate 0.010.01. In addition, we take the number of epochs and batch size to be 200 and 10 to make the DNN fully trained for the first and iterated subsampling stages, respectively. We use the function time.time() in Python to compute the running time of the training procedure, namely Training Time.

To be consistent with the folk wisdom, we build f^DNN,b,i\widehat{f}_{\text{DNN},b,i} with a relatively large depth to decrease the bias. Meanwhile, we take the width as large as possible to make its size close to the sample size so that A3 could be satisfied and we are in the under-parameterized region. In order to make a comprehensive comparison between the scalable subsampling DNN (SS-DNN) estimator f¯DNN\overline{f}_{\text{DNN}} and classical DNN estimators, we consider 5 DNN estimators which are trained with the whole sample:

  • (1)

    A DNN possesses the same depth and width as f^DNN,b,i\widehat{f}_{\text{DNN},b,i}. We denote it “S-DNN”.

  • (2)

    A DNN possesses the same depth as f^DNN,b,i\widehat{f}_{\text{DNN},b,i}, but a larger width so that its size is close to the sample size. We denote it “DNN-deep-1”.

  • (3)

    A DNN possesses the same depth as f^DNN,b,i\widehat{f}_{\text{DNN},b,i}, but a larger width so that its size is close to half of the sample size. We denote it “DNN-deep-2”.

  • (4)

    A DNN possesses only one hidden layer, but a larger width so that its size is close to the sample size. We denote it “DNN-wide-1”.

  • (5)

    A DNN possesses only one hidden layer, but a larger width so that its size is close to half of the sample size. We denote it “DNN-wide-2”.

We deploy DNN (1) to check the performance of a DNN with the same structure as f^DNN,b,i\widehat{f}_{\text{DNN},b,i}, but it is trained with the whole dataset. We deploy DNNs (2) - (5) to challenge the scalable subsampling DNN estimator with various wide or deep DNNs. To evaluate the point estimation performance, we apply two empirical MSE criteria:

MSE-1:1ni=1n(f^DNN(𝒙i)yi)2;MSE-2:1ni=1n(f^DNN(𝒙i)f(𝒙i))2;\text{MSE-1:}\frac{1}{n}\sum_{i=1}^{n}(\widehat{f}_{\text{DNN}}(\bm{x}_{i})-y_{i})^{2}~{};~{}\text{MSE-2:}\frac{1}{n}\sum_{i=1}^{n}(\widehat{f}_{\text{DNN}}(\bm{x}_{i})-f(\bm{x}_{i}))^{2};

here f^DNN()\widehat{f}_{\text{DNN}}(\cdot) represents different DNN estimators and f()f(\cdot) is the true regression function; {𝒙i,yi}i=1n\{\bm{x}_{i},y_{i}\}_{i=1}^{n} are realizations of samples; we call it training data.

An estimator is optimal in MSE-1 criterion if its MSE-1 is closest to the sample variance of errors, namely σ^ϵ2=1ni=1nϵi2\hat{\sigma}^{2}_{\epsilon}=\frac{1}{n}\sum_{i=1}^{n}\epsilon^{2}_{i}; here {ϵi2}i=1n\{\epsilon^{2}_{i}\}_{i=1}^{n} are observed error values. An estimator is optimal in the MSE-2 criterion if its MSE-2 is closest to 0. We present MSE-1 and MSE-2 of different estimators in Table 1. In addition, we also present σ^ϵ2\hat{\sigma}^{2}_{\epsilon} of the corresponding simulated sample as the benchmark to compare the performance of different estimators according to the MSE-1 criterion.

Beyond the point estimation measured on training data, we are also interested in the performance of difference DNN estimators on test data. Thus, we generate new samples: {𝒙0,i,y0,i}i=1N\{\bm{x}_{0,i},y_{0,i}\}_{i=1}^{N}; here we take N=2105N=2\cdot 10^{5} to evaluate the prediction performance. Similarly, we consider two MSPEs and we denote them MSPE-1 and MSPE-2 following:

MSPE-1:1Ni=1N(f^DNN(𝒙0,i)y0,i)2;MSPE-2:1Ni=1N(f^DNN(𝒙0,i)f(𝒙0,i))2;\text{MSPE-1:}\frac{1}{N}\sum_{i=1}^{N}(\widehat{f}_{\text{DNN}}(\bm{x}_{0,i})-y_{0,i})^{2}~{};~{}\text{MSPE-2:}\frac{1}{N}\sum_{i=1}^{N}(\widehat{f}_{\text{DNN}}(\bm{x}_{0,i})-f(\bm{x}_{0,i}))^{2};

we expect that the best estimator on prediction tasks should have the smallest MSPE-2 and the MSPE-1 which is closest to σ^ϵ,02=1Ni=1N(ϵ0,i)2\hat{\sigma}^{2}_{\epsilon,0}=\frac{1}{N}\sum_{i=1}^{N}(\epsilon_{0,i})^{2}; here {ϵ0,i}i=1N\{\epsilon_{0,i}\}_{i=1}^{N} are observed error values for the test data. We present all simulation results in Table 1; here empirical MSE/MSPE and Training Time (in seconds) were computed as averages of 200 replications.

Table 1: MSE/MSPE and Training Time (in seconds) of different DNN models on various simulation datasets with error terms
Estimator: SS-DNN S-DNN DNN-deep-1 DNN-deep-2 DNN-wide-1 DNN-wide-2
Model-1, n=104n=10^{4}, σ^ϵ2=1.0011\hat{\sigma}^{2}_{\epsilon}=1.0011, σ^ϵ,02=1.0003\hat{\sigma}^{2}_{\epsilon,0}=1.0003
Width [20,20] [20,20] [90,90] [60,60] [800] [400]
MSE-1 1.0034 1.0168 0.9975 1.0036 1.0136 1.0151
MSE-2 0.1011 0.0579 0.1039 0.0894 0.0466 0.0433
MSPE-1 1.1020 1.0678 1.1299 1.1059 1.0543 1.0487
MSPE-2 0.1019 0.0675 0.1296 0.1057 0.0540 0.0484
Training Time 209 225 403 303 373 274
Model-2, n=104n=10^{4}, σ^ϵ2=1.0012\hat{\sigma}^{2}_{\epsilon}=1.0012, σ^ϵ,02=1.0011\hat{\sigma}^{2}_{\epsilon,0}=1.0011
Width [20,20] [20,20] [90,90] [60,60] [800] [400]
MSE-1 1.0506 1.1355 1.1314 1.1350 1.0768 1.0745
MSE-2 0.1232 0.1625 0.1889 0.1839 0.1249 0.1194
MSPE-1 1.1339 1.1469 1.1841 1.1737 1.1254 1.1237
MSPE-2 0.1338 0.1468 0.1841 0.1736 0.1253 0.1238
Training Time 224 240 417 320 376 280
Model-3, n=104n=10^{4}, σ^ϵ2=0.9997\hat{\sigma}^{2}_{\epsilon}=0.9997,σ^ϵ,02=1.0001\hat{\sigma}^{2}_{\epsilon,0}=1.0001   ,
Width [15,15,15] [15,15,15] [65,65,65] [45,45,45] [2000] [1000]
MSE-1 1.0014 1.0361 1.0299 1.0308 1.0286 1.0290
MSE-2 0.0296 0.0536 0.0533 0.0522 0.0426 0.0431
MSPE-1 1.0310 1.0565 1.0572 1.0571 1.0453 1.0449
MSPE-2 0.0310 0.0564 0.0572 0.0570 0.0453 0.0449
Training Time 353 379 561 468 483 363
Model-4, n=104n=10^{4}, σ^ϵ2=1.0014\hat{\sigma}^{2}_{\epsilon}=1.0014,σ^ϵ,02=1.0003\hat{\sigma}^{2}_{\epsilon,0}=1.0003
Width [15,15,15] [15,15,15] [65,65,65] [45,45,45] [2000] [1000]
MSE-1 1.0243 1.0488 1.0318 1.0350 1.0457 1.0460
MSE-2 0.0757 0.0830 0.1076 0.0980 0.0729 0.0728
MSPE-1 1.0792 1.0878 1.1117 1.1048 1.0756 1.0752
MSPE-2 0.0790 0.0875 0.1114 0.1045 0.0754 0.0749
Training Time 359 376 560 471 551 394
Model-4, n=2104n=2\cdot 10^{4}, σ^ϵ2=0.9991\hat{\sigma}^{2}_{\epsilon}=0.9991,σ^ϵ2=0.9999\hat{\sigma}^{2}_{\epsilon}=0.9999
Width [20,20,20] [20,20,20] [95,95,95] [65,65,65] [2800] [1400]
MSE-1 1.0093 1.0483 1.0419 1.0438 1.0508 1.0508
MSE-2 0.0490 0.0653 0.0686 0.0675 0.0635 0.0635
MSPE-1 1.0501 1.0669 1.0692 1.0689 1.0622 1.0625
MSPE-2 0.0502 0.0670 0.0692 0.0689 0.0623 0.0626
Training Time 748 775 1684 1198 1549 998

Note: “width” represents the number of neurons of each hidden layer, e.g., [20, 20] means that there are two hidden layers within the DNN and each has 20 number neurons.

We can summarize several notable findings from the simulation results:

  • f¯DNN\overline{f}_{\text{DNN}} is always the most computationally efficient one, it is even faster than applying a single DNN estimator with the same structure as f^DNN,b,i\widehat{f}_{\text{DNN},b,i} but trained on the whole sample.

  • According to the MSE-1, f¯DNN\overline{f}_{\text{DNN}} is the most accurate one for all simulations.

  • According to the MSE-2, f¯DNN\overline{f}_{\text{DNN}} can work best when the data is large enough for Models 3-4 which are non-linear. For Model-2, the performance of f¯DNN\overline{f}_{\text{DNN}} is just slightly worse than the optimal estimator. For Model-1, the performance of f¯DNN\overline{f}_{\text{DNN}} is still worse than the optimal estimator. We guess the reason may be that the Model-1 and Model-2 are linear models. In this case, a wide DNN is sufficient to mimic such a linear relationship.

  • For MSPEs, f¯DNN\overline{f}_{\text{DNN}} works slightly worse than the optimal model for Model-1 and Model-2 cases, but it turns out to be the optimal one for Model-3 and Model-4 cases. This phenomenon is consistent with the behavior of MSEs.

  • The model-selection step for “wide” or “deep” type DNN estimators is necessary but it is obscure meanwhile; see DNN-wide-2 works better than DNN-wide-1 for the Model-2 MSE case; however, the situation reverses for the Model-3 MSE case. This phenomenon occurs for “Deep” type DNN estimators also; see the performance of S-DNN, DNN-deep-1 and DNN-wide-2; there is no single one that beats the others uniformly. For MSPE, we can also find such a reverse phenomenon. On the other hand, by applying the scalable subagging estimator, we can avoid the model-selection difficulty and just make f^DNN,b,i\widehat{f}_{\text{DNN},b,i} deep and large enough.

To analyze the ability of various DNN estimators on estimating regression models solely, we present additional simulation results in Appendix: B where the four models described above do not have error terms, so the MSE-1 and MSE-2 coincide to each other.

6.2 Simulations for CI and PI

We continue using the four models in Section 6.1 to test the accuracy of multiple confidence and prediction intervals defined in previous sections with scalable subagging DNN estimators. To make sure we have enough subsamples to do iterated subsampling for CI, we take the sample size to be 21052\cdot 10^{5}, which implies q=38q=38 when β=0.7\beta=0.7. It further implies that the number of subsamples for the iterated subagging stage is q=nβ(1β)=12q=\lfloor n^{\beta(1-\beta)}\rfloor=12. For developing the prediction interval, we take the sample size to be 10410^{4} or 21042\cdot 10^{4}. To determine the structure of the subagging DNN estimator, we keep the strategy summarized in the previous subsection, i.e., we make its size as close to the sample size as possible no matter in the first or the iterated subsampling stage. We take the same training setting with PyTorch to find f¯DNN(𝒙0)\overline{f}_{\text{DNN}}(\bm{x}_{0}) as we have done in Section 6.1.

We call the naive QCI which is determined by the equal-tail quantile of estimations {f^DNN,b,1(𝒙),\{\widehat{f}_{\text{DNN},b,1}(\bm{x}), ,f^DNN,b,q(𝒙)}\ldots,\widehat{f}_{\text{DNN},b,q}(\bm{x})\} QCI-1; we should notice that this QCI may be too conservative as we explained in 4.1; we call the QCI based on Eq. 17 QCI-2; we call the PCI based on Eq. 12 PCI-1; we call the PCI based on Eq. 15 with taking 2α=02\alpha=0, PCI-2; we call the PCI based on Eq. 15 with taking 2α=12\alpha=1, PCI-3; the PI represents the prediction interval defined in Eq. 20. For all CIs and PI defined in previous sections, they have asymptotically validity conditional on the observation 𝑿=𝒙\bm{X}=\bm{x}. We attempt to check the conditional coverage rate with simulations for finite sample cases. To achieve this purpose, we fix 10 unchanged test points {(yt,1,𝒙t,1),,(yt,10,𝒙t,10)}\{(y_{t,1},\bm{x}_{t,1}),\ldots,(y_{t,10},\bm{x}_{t,10})\} which are different from training points for each simulation model; these 10 points can be recovered by setting numpy.random.seed(0) and generate sample according to the model.

To evaluate the performance of (conditional) CI for each test point, we repeat the simulation process K=500K=500 times and apply the below formulas to compute the empirical coverage rate (ECR) and empirical length (EL) of different CIs for each test point:

ECRj=1Ki=1K𝟙f(𝒙t,j)[Bl,i,j,Bu,i,j],ELj=1Ki=1K(Bu,i,jBl,i,j),forj=1,,10;\text{ECR}_{j}=\frac{1}{K}\sum_{i=1}^{K}\mathbbm{1}_{f(\bm{x}_{t,j})\in[B_{l,i,j},B_{u,i,j}]}~{},~{}\text{EL}_{j}=\frac{1}{K}\sum_{i=1}^{K}(B_{u,i,j}-B_{l,i,j}),\text{for}~{}j=1,\ldots,10;

here f(𝒙t,j)f(\bm{x}_{t,j}) is the true model value evaluated at the jj-th test data point; Bu,i,jB_{u,i,j} and Bl,i,jB_{l,i,j} are the corresponding upper and lower bounds of different CIs at the ii-th replication for the jj-th test point, respectively. We take two nominal significance levels δ=0.05\delta=0.05 and δ=0.1\delta=0.1. Simulation results are tabularized in Tables 2 and 3.

To evaluate the performance of (conditional) PI for each test point, the procedure is slightly complicated and we summarize it in below four steps:

  • Step 1

    Take the sample size nn to be 10410^{4} or 21042\cdot 10^{4}; simulate K=500K=500 sample sets: {(yi(k),𝒙i(k))i=1n}k=1K\{(y^{(k)}_{i},\bm{x}^{(k)}_{i})_{i=1}^{n}\}_{k=1}^{K} based on one of four simulation models.

  • Step 2

    For each sample set, train the subsampling DNN estimator and build the prediction interval for 10 test points by:

    [f¯DNN(𝒙t,j)+z^δ/2,f¯DNN(𝒙t,j)+z^1δ/2],forj=1,,10,[\overline{f}_{\text{DNN}}(\bm{x}_{t,j})+\hat{z}_{\delta/2},\overline{f}_{\text{DNN}}(\bm{x}_{t,j})+\hat{z}_{1-\delta/2}],~{}\text{for}~{}j=1,\ldots,10,

    where z^ϵ,1δ/2\hat{z}_{\epsilon,1-\delta/2} and z^ϵ,δ/2\hat{z}_{\epsilon,\delta/2} are the 1δ/21-\delta/2 and δ/2\delta/2 quantile values of the empirical distribution of the residuals, respectively.

  • Step 3

    To check the performance of PIs for test points based on each sample set, simulate {ys,j}s=1M\{y_{s,j}\}_{s=1}^{M} conditional on 𝒙t,j\bm{x}_{t,j} for j=1,,10j=1,\ldots,10 pretending the true data-generating model is known and check the empirical coverage rate and empirical length by below formulas:

    ECRi,j=1Ms=1M𝟙ys,j[Bl,i,j,Bu,i,j],ELi,j=Bu,i,jBl,i,j,forj=1,,10;i=1,,500;\text{ECR}_{i,j}=\frac{1}{M}\sum_{s=1}^{M}\mathbbm{1}_{y_{s,j}\in[B_{l,i,j},B_{u,i,j}]}~{},~{}\text{EL}_{i,j}=B_{u,i,j}-B_{l,i,j},\text{for}~{}j=1,\ldots,10;i=1,\ldots,500;

    Bl,i,jB_{l,i,j} and Bu,i,jB_{u,i,j} are the corresponding upper and lower bounds of PI for the jj-th test point based on ii-th sample set defined in Step 2; M=3000M=3000.

  • Step 4

    For j=1,,10j=1,\ldots,10, estimate (Y0PI|𝑿0=𝒙t,j)\mathbb{P}(Y_{0}\in PI|\bm{X}_{0}=\bm{x}_{t,j}) by the average of empirical coverage rate of corresponding (conditional) PI on KK sample sets, i.e., Average(ECRi,j)\text{Average}(\text{ECR}_{i,j}) w.r.t. ii; estimate length of (conditional) PI for jj-th test point by Average(ELi,j)\text{Average}(\text{EL}_{i,j}) w.r.t. ii.

We still take two nominal significance levels δ=0.05\delta=0.05 and δ=0.1\delta=0.1. Simulation results are tabularized in Table 4.

Remark (Different levels of conditioning).

As explained in the work of Wang and Politis, (2021), we have several conditioning levels to measure the performance of PI or CI. What we consider in this paper is 1:=(|𝐗0=𝐱0)\mathbb{P}_{1}:=\mathbb{P}(\cdot|\bm{X}_{0}=\bm{x}_{0}) which shall be interpreted as the conditional probability on 𝐗0=𝐱0\bm{X}_{0}=\bm{x}_{0}. If we consider the empirical coverage rate of ECRi,j\text{ECR}_{i,j}, it approximates another conditioning level, i.e., 2:=(|𝐗0=𝐱0,(Yn,𝐗n))\mathbb{P}_{2}:=\mathbb{P}(\cdot|\bm{X}_{0}=\bm{x}_{0},(Y_{n},\bm{X}_{n})); here (Yn,𝐗n)(Y_{n},\bm{X}_{n}) represents the whole sample. By Lemma 4 of Wang and Politis, (2021), if Aσ(𝐗n,𝐘n,Xf,Y0)A\in\sigma\left(\mathbf{X}_{n},\mathbf{Y}_{n},X_{f},Y_{0}\right) is an arbitrary measurable event, then 𝔼Yn,𝐗n2(A)=1(A)\mathbb{E}_{Y_{n},\bm{X}_{n}}\mathbb{P}_{2}(A)=\mathbb{P}_{1}(A). Besides, 1δ1-\delta conditional coverage under 1\mathbb{P}_{1} will imply the marginal coverage 0:=𝔼𝐗1\mathbb{P}_{0}:=\mathbb{E}_{\bm{X}}\mathbb{P}_{1}. This unconditional coverage is implied by the popular Conformal Prediction method in the machine learning community. Simulation studies show that our CIs and PIs also have great unconditional coverage; see results from Appendix: C.

We can summarize several findings based on simulation results:

  • For the empirical coverage rate of quantile-type CIs, the naive QCI-1 over-covers true model values as we expect. Also, the corresponding CI length is always larger than the length of QCI-2 and it is actually the largest one among 5 different CIs. On the other hand, the specifically designed QCI-2 returns ECRs that are closer to the specified confidence level than QCI-1. Meanwhile, ECR of QCI-2 is larger than the nominal confidence level for almost all test points since we take κn\kappa_{n} and κb\kappa_{b} according to the strategy in 4.5 to enlarge the CI.

  • For the empirical coverage rate of pivot-type CIs, although the length of PCI-1 is the shortest one, the ECR of PCI-1 is always less than the nominal confidence level for almost all test points since Cσ2C_{\sigma}^{2} may be underestimated and we may have the bias issue in practice. For the PCI-3 whose margin of error is enlarged in a mild way, although its ECR is always larger than PCI-1, it still undercover true model value mostly. For the PCI-2 in which the margin of error is enlarged in a most extreme way, it has a much better performance according to the coverage rate but with a larger CI length as a sacrifice. We claim that the PCI-2 is the optimal CI candidate according to the overall performance based on length and coverage rate. For the QCI-2, we conjecture it will be a good alternative if we have more samples so that Lb,b,SS(x)L_{b^{\prime},b,SS}(x) can approximate Jn(x)J_{n}(x) well in the iterated subsampling stage.

  • For the prediction task, all PIs for four models and all test points have almost the same coverage rate and length. All ECRs are slightly less than the nominal confidence level which is not a surprise since we omit the variability in the model estimation and the residual distribution may underestimate the true error distribution for a finite sample case. For the length of PI, all PIs’ lengths are close to 2z0.952\cdot z_{0.95} or 2z0.9752\cdot z_{0.975} since the true error distribution is assumed to be standard normal in simulations and we took equal-tail PI.

Table 2: Empirical Coverage Rate and Empirical Length of different (conditional) CIs with various simulation models
Test point: 1 2 3 4 5 6 7 8 9 10
Model-1, n=2105n=2\cdot 10^{5}
ECR
QCI-1 δ=0.10\delta=0.10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-1 δ=0.05\delta=0.05 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-2 δ=0.10\delta=0.10 0.966 0.964 0.990 0.990 0.946 0.814 0.994 0.972 0.978 0.946
QCI-2 δ=0.05\delta=0.05 0.988 0.984 0.998 1.000 0.984 0.932 0.998 0.984 0.994 0.976
PCI-1 δ=0.10\delta=0.10 0.884 0.854 0.896 0.894 0.748 0.510 0.874 0.884 0.830 0.860
PCI-1 δ=0.05\delta=0.05 0.938 0.918 0.954 0.940 0.836 0.610 0.942 0.938 0.890 0.918
PCI-2 δ=0.10\delta=0.10 0.892 0.986 0.996 1.000 0.784 0.954 0.998 0.908 0.876 0.998
PCI-2 δ=0.05\delta=0.05 0.946 0.996 0.998 1.000 0.882 0.984 1.000 0.954 0.938 1.000
PCI-3 δ=0.10\delta=0.10 0.886 0.862 0.906 0.914 0.748 0.524 0.880 0.884 0.830 0.878
PCI-3 δ=0.05\delta=0.05 0.938 0.922 0.958 0.960 0.836 0.616 0.948 0.938 0.890 0.928
EL
QCI-1 δ=0.10\delta=0.10 2.39 2.24 1.93 1.92 1.51 1.04 1.38 1.94 1.46 3.10
QCI-1 δ=0.05\delta=0.05 2.94 3.05 2.49 2.46 1.85 1.30 1.94 2.65 1.81 4.19
QCI-2 δ=0.10\delta=0.10 1.44 1.25 1.27 1.27 1.00 0.80 0.96 1.23 1.02 1.50
QCI-2 δ=0.05\delta=0.05 1.73 1.50 1.54 1.54 1.20 0.97 1.16 1.48 1.24 1.81
PCI-1 δ=0.10\delta=0.10 0.39 0.37 0.32 0.31 0.24 0.17 0.23 0.32 0.24 0.51
PCI-1 δ=0.05\delta=0.05 0.46 0.45 0.38 0.37 0.29 0.20 0.28 0.38 0.28 0.61
PCI-2 δ=0.10\delta=0.10 0.40 0.53 0.57 1.07 0.25 0.31 0.51 0.34 0.26 0.96
PCI-2 δ=0.05\delta=0.05 0.47 0.63 0.67 1.28 0.30 0.37 0.61 0.41 0.31 1.15
PCI-3 δ=0.10\delta=0.10 0.39 0.38 0.32 0.34 0.24 0.17 0.24 0.32 0.24 0.53
PCI-3 δ=0.05\delta=0.05 0.46 0.45 0.38 0.41 0.29 0.21 0.28 0.39 0.28 0.63
Model-2, n=2105n=2\cdot 10^{5}
ECR
QCI-1 δ=0.10\delta=0.10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-1 δ=0.05\delta=0.05 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-2 δ=0.10\delta=0.10 0.910 0.960 0.948 0.918 0.910 0.628 0.972 0.936 0.930 0.928
QCI-2 δ=0.05\delta=0.05 0.952 0.980 0.980 0.962 0.960 0.746 0.984 0.974 0.970 0.958
PCI-1 δ=0.10\delta=0.10 0.886 0.890 0.892 0.862 0.860 0.828 0.888 0.866 0.882 0.888
PCI-1 δ=0.05\delta=0.05 0.932 0.952 0.954 0.926 0.916 0.894 0.944 0.940 0.936 0.958
PCI-2 δ=0.10\delta=0.10 0.904 0.974 0.990 1.000 0.880 0.958 0.982 0.896 0.902 0.998
PCI-2 δ=0.05\delta=0.05 0.946 0.984 0.996 1.000 0.940 0.990 0.984 0.960 0.952 1.000
PCI-3 δ=0.10\delta=0.10 0.886 0.892 0.900 0.892 0.860 0.830 0.896 0.868 0.882 0.920
PCI-3 δ=0.05\delta=0.05 0.932 0.956 0.958 0.948 0.916 0.898 0.950 0.940 0.936 0.964
EL
QCI-1 δ=0.10\delta=0.10 2.66 2.05 2.45 2.72 2.13 1.93 2.05 2.48 2.55 3.17
QCI-1 δ=0.05\delta=0.05 3.31 2.60 3.04 3.35 2.58 2.35 2.75 3.04 3.09 4.21
QCI-2 δ=0.10\delta=0.10 1.19 1.08 1.15 1.26 1.04 0.92 1.15 1.18 1.19 1.33
QCI-2 δ=0.05\delta=0.05 1.44 1.30 1.39 1.52 1.25 1.10 1.39 1.43 1.44 1.60
PCI-1 δ=0.10\delta=0.10 0.43 0.33 0.39 0.44 0.34 0.31 0.34 0.40 0.41 0.52
PCI-1 δ=0.05\delta=0.05 0.51 0.40 0.47 0.52 0.41 0.37 0.40 0.48 0.49 0.61
PCI-2 δ=0.10\delta=0.10 0.44 0.48 0.61 1.12 0.35 0.39 0.56 0.42 0.42 0.96
PCI-2 δ=0.05\delta=0.05 0.52 0.57 0.73 1.33 0.41 0.47 0.67 0.50 0.50 1.14
PCI-3 δ=0.10\delta=0.10 0.43 0.33 0.40 0.47 0.34 0.31 0.34 0.40 0.41 0.53
PCI-3 δ=0.05\delta=0.05 0.51 0.40 0.48 0.55 0.41 0.37 0.41 0.48 0.49 0.63
Table 3: Empirical Coverage Rate and Empirical Length of different (conditional) CIs with various simulation models
Test point: 1 2 3 4 5 6 7 8 9 10
Model-3, n=2105n=2\cdot 10^{5}
ECR
QCI-1 δ=0.10\delta=0.10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-1 δ=0.05\delta=0.05 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-2 δ=0.10\delta=0.10 0.988 0.996 1.000 0.996 0.976 0.962 0.990 0.992 0.998 0.994
QCI-2 δ=0.05\delta=0.05 0.998 1.000 1.000 0.998 0.996 0.974 0.996 0.998 1.000 1.000
PCI-1 δ=0.10\delta=0.10 0.788 0.868 0.638 0.862 0.704 0.810 0.868 0.846 0.838 0.842
PCI-1 δ=0.05\delta=0.05 0.858 0.920 0.728 0.920 0.804 0.880 0.930 0.918 0.914 0.920
PCI-2 δ=0.10\delta=0.10 1.000 0.884 1.000 1.000 1.000 0.902 0.908 0.978 0.876 0.970
PCI-2 δ=0.05\delta=0.05 1.000 0.938 1.000 1.000 1.000 0.948 0.964 0.998 0.932 0.988
PCI-3 δ=0.10\delta=0.10 0.860 0.870 0.754 0.870 0.726 0.810 0.868 0.852 0.840 0.844
PCI-3 δ=0.05\delta=0.05 0.920 0.920 0.842 0.924 0.820 0.882 0.930 0.922 0.914 0.922
EL
QCI-1 δ=0.10\delta=0.10 1.24 1.22 0.41 0.47 0.77 2.39 1.64 1.10 0.91 0.78
QCI-1 δ=0.05\delta=0.05 1.51 1.47 0.51 0.57 0.93 2.94 2.00 1.32 1.12 0.94
QCI-2 δ=0.10\delta=0.10 0.85 0.97 0.39 0.43 0.62 1.52 1.31 0.80 0.74 0.62
QCI-2 δ=0.05\delta=0.05 1.03 1.16 0.47 0.52 0.75 1.83 1.57 0.97 0.90 0.76
PCI-1 δ=0.10\delta=0.10 0.20 0.19 0.07 0.07 0.12 0.39 0.26 0.18 0.15 0.12
PCI-1 δ=0.05\delta=0.05 0.24 0.23 0.08 0.09 0.15 0.46 0.31 0.21 0.17 0.15
PCI-2 δ=0.10\delta=0.10 1.22 0.21 0.76 0.20 0.45 0.53 0.29 0.27 0.16 0.20
PCI-2 δ=0.05\delta=0.05 1.46 0.25 0.90 0.23 0.53 0.63 0.35 0.33 0.20 0.24
PCI-3 δ=0.10\delta=0.10 0.24 0.20 0.08 0.08 0.13 0.39 0.26 0.18 0.15 0.13
PCI-3 δ=0.05\delta=0.05 0.28 0.23 0.10 0.09 0.15 0.47 0.31 0.21 0.17 0.15
Model-4, n=2105n=2\cdot 10^{5}
ECR
QCI-1 δ=0.10\delta=0.10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-1 δ=0.05\delta=0.05 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
QCI-2 δ=0.10\delta=0.10 0.738 1.000 0.996 0.910 0.994 0.924 0.982 0.998 0.988 0.864
QCI-2 δ=0.05\delta=0.05 0.852 1.000 0.998 0.966 0.998 0.968 0.994 1.000 0.996 0.940
PCI-1 δ=0.10\delta=0.10 0.878 0.832 0.870 0.856 0.870 0.868 0.860 0.872 0.494 0.662
PCI-1 δ=0.05\delta=0.05 0.932 0.902 0.940 0.912 0.928 0.938 0.910 0.932 0.590 0.776
PCI-2 δ=0.10\delta=0.10 0.962 1.000 0.948 0.984 0.998 0.894 1.000 0.998 0.998 0.998
PCI-2 δ=0.05\delta=0.05 0.980 1.000 0.986 0.996 0.998 0.948 1.000 0.998 1.000 1.000
PCI-3 δ=0.10\delta=0.10 0.878 0.898 0.870 0.858 0.884 0.868 0.888 0.876 0.522 0.664
PCI-3 δ=0.05\delta=0.05 0.936 0.948 0.942 0.920 0.938 0.938 0.936 0.940 0.614 0.780
EL
QCI-1 δ=0.10\delta=0.10 2.90 0.63 0.79 1.50 2.35 1.16 1.09 1.99 1.21 0.69
QCI-1 δ=0.05\delta=0.05 3.71 0.79 1.00 1.91 2.92 1.48 1.35 2.48 1.49 0.89
QCI-2 δ=0.10\delta=0.10 1.74 0.61 0.70 1.14 1.67 0.80 0.85 1.38 0.89 0.58
QCI-2 δ=0.05\delta=0.05 2.10 0.74 0.84 1.38 2.02 0.97 1.03 1.68 1.08 0.70
PCI-1 δ=0.10\delta=0.10 0.47 0.10 0.13 0.24 0.38 0.19 0.18 0.32 0.20 0.11
PCI-1 δ=0.05\delta=0.05 0.56 0.12 0.15 0.29 0.45 0.22 0.21 0.39 0.23 0.13
PCI-2 δ=0.10\delta=0.10 0.69 0.78 0.17 0.45 0.85 0.20 0.76 0.54 0.60 0.24
PCI-2 δ=0.05\delta=0.05 0.82 0.93 0.20 0.54 1.01 0.24 0.90 0.64 0.72 0.29
PCI-3 δ=0.10\delta=0.10 0.48 0.12 0.13 0.24 0.39 0.19 0.19 0.33 0.20 0.11
PCI-3 δ=0.05\delta=0.05 0.57 0.14 0.15 0.29 0.47 0.22 0.23 0.39 0.24 0.14
Table 4: Empirical Coverage Rate and Empirical Length of (conditional) PIs with various simulation models
Test point: 1 2 3 4 5 6 7 8 9 10
n=104n=10^{4}
Model-1:
EL = 3.28, δ=0.10\delta=0.10; EL = 3.91, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.870 0.877 0.877 0.873 0.884 0.891 0.886 0.878 0.886 0.865
ECR δ=0.05\delta=0.05: 0.929 0.934 0.934 0.931 0.939 0.944 0.940 0.935 0.940 0.925
Model-2:
EL = 3.35, δ=0.10\delta=0.10; EL = 4.00, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.880 0.882 0.883 0.879 0.889 0.892 0.888 0.881 0.885 0.869
ECR δ=0.05\delta=0.05: 0.936 0.938 0.938 0.935 0.942 0.945 0.942 0.937 0.939 0.928
Model-3:
EL = 3.29, δ=0.10\delta=0.10; EL = 3.93, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.896 0.894 0.898 0.899 0.897 0.880 0.889 0.895 0.896 0.897
ECR δ=0.05\delta=0.05: 0.948 0.946 0.949 0.949 0.948 0.936 0.943 0.947 0.947 0.949
Model-4:
EL = 3.32, δ=0.10\delta=0.10; EL = 3.96, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.830 0.900 0.900 0.889 0.882 0.898 0.897 0.889 0.895 0.899
ECR δ=0.05\delta=0.05: 0.901 0.951 0.950 0.943 0.938 0.949 0.949 0.943 0.947 0.950
n=2104n=2\cdot 10^{4}
Model-1:
EL = 3.28, δ=0.10\delta=0.10; EL = 3.91, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.885 0.886 0.887 0.886 0.892 0.895 0.892 0.887 0.891 0.880
ECR δ=0.05\delta=0.05: 0.939 0.941 0.941 0.940 0.945 0.946 0.945 0.941 0.943 0.936
Model-2:
EL = 3.32, δ=0.10\delta=0.10; EL = 3.95, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.887 0.891 0.889 0.889 0.893 0.895 0.894 0.889 0.891 0.882
ECR δ=0.05\delta=0.05: 0.941 0.944 0.942 0.943 0.945 0.947 0.946 0.943 0.944 0.938
Model-3:
EL = 3.29, δ=0.10\delta=0.10; EL = 3.92, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.897 0.896 0.899 0.899 0.899 0.889 0.893 0.897 0.897 0.899
ECR δ=0.05\delta=0.05: 0.948 0.947 0.950 0.949 0.949 0.943 0.945 0.948 0.948 0.949
Model-4:
EL = 3.30, δ=0.10\delta=0.10; EL = 3.94, δ=0.05\delta=0.05
ECR δ=0.10\delta=0.10: 0.858 0.900 0.899 0.892 0.887 0.897 0.898 0.891 0.895 0.899
ECR δ=0.05\delta=0.05: 0.921 0.950 0.950 0.945 0.941 0.948 0.949 0.944 0.947 0.949

7 Conclusions

In this paper, we revisit the error bound of fully connected DNN with the ReLU activation function on estimating regression models. By taking into account the latest DNN approximation results, we improve the current error bound. Under some mild conditions, we show that the error bound of the DNN estimator may be further improved by applying the scalable subsampling technique. As a result, the scalable subsampling DNN estimator is computationally efficient without sacrificing accuracy. The theoretical result is verified by extensive simulation results with various linear or non-linear regression models.

Beyond the error analysis for point estimations and point predictions, we propose different approaches to build asymptotically valid confidence and prediction intervals. More specifically, to overcome the undercoverage issue of CIs with finite samples, we consider several methods to enlarge the CI. As shown by simulations, our point estimations/predictions and confidence/prediction intervals based on scalable subsampling work well in practice. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.

Appendix A: Proofs

Proof of Theorem 4.1.

This result can be easily shown based on the proof of Theorem 1 in the work of Farrell et al., (2021). We take the intermediate result from the final step of their proof: With probability at least 1exp(γ)1-\exp(-\gamma),

f^DNNfL2(X)C(H2L2log(H2L)nlogn+loglogn+γn+ϵn),\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}\leq C\left(\sqrt{\frac{H^{2}L^{2}\log\left(H^{2}L\right)}{n}\log n}+\sqrt{\frac{\log\log n+\gamma}{n}}+\epsilon_{n}\right), (23)

where CC is an appropriate constant; in this proof, CC represents appropriate constants and its meaning may change according to the context; ϵn=fDNNf\epsilon_{n}=\left\|f_{\text{DNN}}-f\right\|_{\infty}; fDNN=argminfθDNNfθff_{\text{DNN}}=\arg\min_{f_{\theta}\in\mathscr{F}_{\text{DNN}}}\left\|f_{\theta}-f\right\|_{\infty}. By Theorem 3.1 of Yarotsky and Zhevnerchuk, (2020) and Lemma 1 of Farrell et al., (2021), we can conclude that there is a standard fully connected DNN whose depth and width satisfy below inequalities:

HCϵndξlog(1/ϵn),LClog(1/ϵn),\begin{split}&H\leq C\epsilon_{n}^{-\frac{d}{\xi}}\log\left(1/\epsilon_{n}\right),\\ &L\leq C\cdot\log\left(1/\epsilon_{n}\right),\end{split} (24)

for any ϵn\epsilon_{n}; Furthermore, we can find the upper bound of H2L2log(H2L)H^{2}L^{2}\log\left(H^{2}L\right) based on Eq. 24:

H2L2log(H2L)Cϵn2dξ(log(1/ϵn))5.H^{2}L^{2}\log\left(H^{2}L\right)\leq C\cdot\epsilon_{n}^{-\frac{2d}{\xi}}\left(\log\left(1/\epsilon_{n}\right)\right)^{5}.

Subsequently, we rewrite the Eq. 23 as below:

f^DNNfL2(X)C(ϵn2dξ(log(1/ϵn))5nlogn+loglogn+γn+ϵn).\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}\leq C\left(\sqrt{\frac{\epsilon_{n}^{-\frac{2d}{\xi}}\left(\log\left(1/\epsilon_{n}\right)\right)^{5}}{n}\log n}+\sqrt{\frac{\log\log n+\gamma}{n}}+\epsilon_{n}\right). (25)

To optimize the bound, we can choose ϵn=nξ2(ξ+d),H=Θ(nd2(ξ+d)logn),L=Θ(logn)\epsilon_{n}=n^{-\frac{\xi}{2(\xi+d)}},H=\Theta(n^{\frac{d}{2(\xi+d)}}\log n),L=\Theta(\log n). This gives:

f^DNNfL2(X)C(nξ2(ξ+d)log3n+loglogn+γn).\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}\leq C\left(n^{-\frac{\xi}{2(\xi+d)}}\log^{3}n+\sqrt{\frac{\log\log n+\gamma}{n}}\right). (26)

As a result, we get:

f^DNNfL2(X)2C(nξ(ξ+d)log6n+loglogn+γn).\left\|\widehat{f}_{\text{DNN}}-f\right\|_{L_{2}(X)}^{2}\leq C\left(n^{-\frac{\xi}{(\xi+d)}}\log^{6}n+\frac{\log\log n+\gamma}{n}\right). (27)

Finally, we take γ=ndd+ξlog6(n)\gamma=n^{\frac{d}{d+\xi}}\log^{6}(n), which implies 4.1.

Proof of Theorem 4.2.

Under A1-A5, we can analyze the expected square error for the subagging DNN estimator as below:

𝔼(f¯DNN(𝑿)f(𝑿))2=𝔼[1qi=1qf^DNN,b,i(𝑿)f(𝑿)]2=1q2𝔼[i=1q(f^DNN,b,i(𝑿)f(𝑿))]2=1q2𝔼[i=1q(f^DNN,b,i(𝑿)f(𝑿))2]+1q2𝔼[i,j,ij(f^DNN,b,i(𝑿)f(𝑿))(f^DNN,b,j(𝑿)f(𝑿))].\begin{split}&\mathbb{E}(\overline{f}_{\text{DNN}}(\bm{X})-f(\bm{X}))^{2}\\ &=\mathbb{E}\left[\frac{1}{q}\sum_{i=1}^{q}\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right]^{2}\\ &=\frac{1}{q^{2}}\mathbb{E}\left[\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\right]^{2}\\ &=\frac{1}{q^{2}}\mathbb{E}\left[\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)^{2}\right]+\frac{1}{q^{2}}\mathbb{E}\left[\sum_{i,j,i\neq j}\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\cdot\left(\widehat{f}_{\text{DNN},b,j}(\bm{X})-f(\bm{X})\right)\right].\end{split} (28)

For the first term on the r.h.s. of Eq. 28, by the error bound ignoring the slowly varying term, we can get:

1q2𝔼[i=1q(f^DNN,b,i(𝑿)f(𝑿))2]1q2qO(nβξξ+d)=1qO(1nβξξ+d)=O(1nβξξ+d+1β);\begin{split}\frac{1}{q^{2}}\mathbb{E}\left[\sum_{i=1}^{q}\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)^{2}\right]&\leq\frac{1}{q^{2}}\cdot q\cdot O\left(n^{-\frac{\beta\xi}{\xi+d}}\right)\\ &=\frac{1}{q}O\left(\frac{1}{n^{\frac{\beta\xi}{\xi+d}}}\right)\\ &=O\left(\frac{1}{n^{\frac{\beta\xi}{\xi+d}+1-\beta}}\right);\end{split} (29)

this is satisfied with at least probability (1exp(ndξ+dlog6n))q(1-\exp(-n^{\frac{d}{\xi+d}}\log^{6}n))^{q}.

Ideally, we hope β\beta can take a small value to improve the error bound for Eq. 29. However, it is restricted to do this since the bias of the subagging estimator will get increased once we take β\beta smaller and smaller. Thus, we need to consider the second term on the r.h.s. of Eq. 28. Start by considering on specific pair:

𝔼[(f^DNN,b,i(𝑿)f(𝑿))(f^DNN,b,j(𝑿)f(𝑿))]=𝔼[𝔼[(f^DNN,b,i(𝑿)f(𝑿))(f^DNN,b,j(𝑿)f(𝑿))|𝑿]]=𝔼[𝔼[(f^DNN,b,i(𝑿)f(𝑿))|𝑿]𝔼[(f^DNN,b,j(𝑿)f(𝑿))|𝑿]].\begin{split}&\mathbb{E}\left[\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\cdot\left(\widehat{f}_{\text{DNN},b,j}(\bm{X})-f(\bm{X})\right)\right]\\ &=\mathbb{E}\left[\mathbb{E}\left[\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\cdot\left(\widehat{f}_{\text{DNN},b,j}(\bm{X})-f(\bm{X})\right)\bigg{|}\bm{X}\right]\right]\\ &=\mathbb{E}\left[\mathbb{E}\left[\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\bigg{|}\bm{X}\right]\cdot\mathbb{E}\left[\left(\widehat{f}_{\text{DNN},b,j}(\bm{X})-f(\bm{X})\right)\bigg{|}\bm{X}\right]\right].\end{split} (30)

The last equality is due to the independence between subsample BiB_{i} and BjB_{j}. As we mentioned in the main text, we face difficulty in determining the rate of the bias of the subagging estimator. Thus, A4 and A5 are used to make additional assumptions on the bias term. We present A4 as below:

𝔼(f^DNN(𝒙)f(𝒙))=O(nΛ/2);𝔼(f^DNN,b,i(𝒙)f(𝒙))=O(nβΛ/2).\mathbb{E}(\widehat{f}_{\text{DNN}}(\bm{x})-f(\bm{x}))=O(n^{-\Lambda/2})~{};~{}\mathbb{E}(\widehat{f}_{\text{DNN},b,i}(\bm{x})-f(\bm{x}))=O(n^{-\beta\Lambda/2}).

A5 then requires the bias order of f^DNN\widehat{f}_{\text{DNN}} satisfies the inequality: Λ>ξξ+d\Lambda>\frac{\xi}{\xi+d}.

Then, we can find the order of Eq. 30 is:

𝔼[(f^DNN,b,i(𝑿)f(𝑿))(f^DNN,b,j(𝑿)f(𝑿))]=O(nβΛ).\mathbb{E}\left[\left(\widehat{f}_{\text{DNN},b,i}(\bm{X})-f(\bm{X})\right)\cdot\left(\widehat{f}_{\text{DNN},b,j}(\bm{X})-f(\bm{X})\right)\right]=O(n^{-\beta\Lambda}).

Combine these two pieces, we can analyze Eq. 28:

𝔼(f¯DNN(𝑿)f(𝑿))2O(1nβξξ+d+1β)+21q2(q2)O(1nβΛ)=O(1nβξξ+d+1β)+O(1nβΛ).\begin{split}&\mathbb{E}(\overline{f}_{\text{DNN}}(\bm{X})-f(\bm{X}))^{2}\\ &\leq O\left(\frac{1}{n^{\frac{\beta\xi}{\xi+d}+1-\beta}}\right)+2\cdot\frac{1}{q^{2}}\cdot{{q}\choose{2}}\cdot O\left(\frac{1}{n^{\beta\Lambda}}\right)\\ &=O\left(\frac{1}{n^{\frac{\beta\xi}{\xi+d}+1-\beta}}\right)+O\left(\frac{1}{n^{\beta\Lambda}}\right).\end{split} (31)

If the bias term is more negligible than the other term, i.e.,

βΛβξξ+d+1β,i.e.,β11+Λξξ+d.\beta\Lambda\geq\frac{\beta\xi}{\xi+d}+1-\beta,~{}i.e.,\beta\geq\frac{1}{1+\Lambda-\frac{\xi}{\xi+d}}.

The above lower bound satisfies the requirement of β\beta being positive. Then, Λ\Lambda needs to be larger than ξξ+d\frac{\xi}{\xi+d} to make sure the lower bound of β\beta is less than 1 which is satisfied due to A5. Meanwhile, we want to take β\beta as small as possible, i.e., β=11+Λξξ+d\beta=\frac{1}{1+\Lambda-\frac{\xi}{\xi+d}}. This results in the error bound below:

𝔼(f¯DNN(𝑿)f(𝑿))2O(nΛΛ+dξ+d).\mathbb{E}(\overline{f}_{\text{DNN}}(\bm{X})-f(\bm{X}))^{2}\leq O\left(n^{\frac{-\Lambda}{\Lambda+\frac{d}{\xi+d}}}\right).

The fact that ΛΛ+dξ+d\frac{\Lambda}{\Lambda+\frac{d}{\xi+d}} is larger than ξξ+d\frac{\xi}{\xi+d} is guaranteed by the requirement that Λ>ξξ+d\Lambda>\frac{\xi}{\xi+d}, i.e., A5 again.

Proof of Theorem 5.1.

Since error ϵ0\epsilon_{0} and ϵ0\epsilon^{*}_{0} are independent to 𝒙0\bm{x}_{0}, we actually have supz|Fϵ0|𝑿0=𝒙0(z)Fϵ0|𝑿0=𝒙0(z)|𝑝0\sup_{z}\left|F_{\epsilon^{*}_{0}|\bm{X}_{0}=\bm{x}_{0}}(z)-F_{\epsilon_{0}|\bm{X}_{0}=\bm{x}_{0}}(z)\right|\overset{p}{\to}0 based on 5.1. Thus, we can write:

supz|(Y0f¯DNN(𝒙0)z)(Y0f(𝒙0)z)|𝑝0,\sup_{z}|\mathbb{P}(Y^{*}_{0}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}-f(\bm{x}_{0})\leq z)|\overset{p}{\to}0, (32)

where ()\mathbb{P}(\cdot) represents (|𝑿0=𝒙0)\mathbb{P}(\cdot|\bm{X}_{0}=\bm{x}_{0}). We can start by considering the below expression:

supz|(Y0f(𝒙0)z)(Y0f(𝒙0)z)|=supz|(Y0f(𝒙0)z)(Y0f¯DNN(𝒙0)z)+(Y0f¯DNN(𝒙0)z)(Y0f(𝒙0)z)|supz|(Y0f(𝒙0)z)(Y0f¯DNN(𝒙0)z)|+supz|(Y0f¯DNN(𝒙0)z)(Y0f(𝒙0)z)|.\begin{split}&\sup_{z}|\mathbb{P}(Y^{*}_{0}-f(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}-f(\bm{x}_{0})\leq z)|\\ &=\sup_{z}|\mathbb{P}(Y^{*}_{0}-f(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)+\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}-f(\bm{x}_{0})\leq z)|\\ &\leq\sup_{z}|\mathbb{P}(Y^{*}_{0}-f(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)|+\sup_{z}|\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}-f(\bm{x}_{0})\leq z)|.\\ \end{split} (33)

For the first term on the r.h.s. of the above inequality, we have:

supz|(Y0f(𝒙0)z)(Y0f¯DNN(𝒙0)z)|=supz|(Y0f¯DNN(𝒙0)+f¯DNN(𝒙0)f(𝒙0)z)(Y0f¯DNN(𝒙0)z)=supz|Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))Fϵ0(z)|=supz|Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))+Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))Fϵ0(z)+Fϵ0(z)Fϵ0(z)|supz|Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))|+supz|Fϵ0(z+f(𝒙0)f¯DNN(𝒙0))Fϵ0(z)|+supz|Fϵ0(z)Fϵ0(z)|.\begin{split}&\sup_{z}|\mathbb{P}(Y^{*}_{0}-f(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)|\\ &=\sup_{z}|\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})+\overline{f}_{\text{DNN}}(\bm{x}_{0})-f(\bm{x}_{0})\leq z)-\mathbb{P}(Y_{0}^{*}-\overline{f}_{\text{DNN}}(\bm{x}_{0})\leq z)\\ &=\sup_{z}|F_{\epsilon_{0}^{*}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))-F_{\epsilon_{0}^{*}}(z)|\\ &=\sup_{z}|F_{\epsilon_{0}^{*}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))-F_{\epsilon_{0}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))\\ &+F_{\epsilon_{0}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))-F_{\epsilon_{0}}(z)+F_{\epsilon_{0}}(z)-F_{\epsilon_{0}^{*}}(z)|\\ &\leq\sup_{z}|F_{\epsilon_{0}^{*}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))-F_{\epsilon_{0}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))|\\ &+\sup_{z}|F_{\epsilon_{0}}(z+f(\bm{x}_{0})-\overline{f}_{\text{DNN}}(\bm{x}_{0}))-F_{\epsilon_{0}}(z)|+\sup_{z}|F_{\epsilon_{0}}(z)-F_{\epsilon_{0}^{*}}(z)|.\end{split} (34)

We should notice that the first and third terms of the r.h.s. of Eq. 34 converge to 0 in probability. For the middle term, since f¯DNN(𝒙0)\overline{f}_{\text{DNN}}(\bm{x}_{0}) converges to f(𝒙0)f(\bm{x}_{0}) in probability and supz|pϵ0(z)|\sup_{z}|p_{\epsilon_{0}}(z)| is assumed to be bounded as B2, this term also converges to 0 in probability by applying the Taylor expansion. Combining all the pieces, we have:

supz|FY0|𝑿0=𝒙0(z)FY0|𝑿0=𝒙0(z)|𝑝0.\sup_{z}\left|F_{Y_{0}^{*}|\bm{X}_{0}=\bm{x}_{0}}(z)-F_{Y_{0}|\bm{X}_{0}=\bm{x}_{0}}(z)\right|\overset{p}{\to}0. (35)

Appendix B: Additional simulations on point estimations

In this part, we consider below models to check the performance of SS-DNN:

  • Model-1’: Y=i=110XiY=\sum_{i=1}^{10}X_{i}, where (X1,,X10)N(0,𝑰)(X_{1},\ldots,X_{10})\sim N(0,\bm{I}).

  • Model-2’: Y=i=110iXiY=\sum_{i=1}^{10}i\cdot X_{i}, where (X1,,X10)N(0,𝑰)(X_{1},\ldots,X_{10})\sim N(0,\bm{I}).

  • Model-3’: Y=X12+sin(X2+X3)Y=X_{1}^{2}+\sin(X_{2}+X_{3}), where (X1,X2,X3)N(0,𝑰)(X_{1},X_{2},X_{3})\sim N(0,\bm{I}).

  • Model-4’: Y=X12+sin(X2+X3)+exp(|X4+X5|)Y=X_{1}^{2}+\sin(X_{2}+X_{3})+\exp(-|X_{4}+X_{5}|), where (X1,X2,X3,X4,X5)N(0,𝑰)(X_{1},X_{2},X_{3},X_{4},X_{5})\sim N(0,\bm{I}).

Similar to the main text, 𝑰\bm{I} is the identity matrix with the appropriate dimension. Compared to the data-generating model in the main text, the only difference is that the error term is removed. Due to this change, we can investigate the approximation ability of various DNN estimators straightforwardly, i.e., the MSE-1 error is equivalent to the MSE-2 error. Setting the same training procedure, we summarize all simulation results in Table 5; here, empirical MSE and Run Times (in seconds) were also computed as averages of 200 replications.

Table 5: MSE and Run Times of different DNN models with various simulation models
SS-DNN S-DNN DNN-deep-1 DNN-deep-2 DNN-wide-1 DNN-wide-2
Model-1 n=104n=10^{4}
Width [20,20] [20,20] [90,90] [60,60] [800] [400]
MSE 0.0002 0.0008 0.0009 0.0007 0.0007 0.0006
Run Times 238 252 446 338 405 298
Model-2 n=104n=10^{4}
Width [20,20] [20,20] [90,90] [60,60] [800] [400]
MSE 0.0071 0.0126 0.0171 0.0213 0.0079 0.0104
Run Times 209 227 406 308 367 270
Model-3 n=104n=10^{4}
Width [15,15,15] [15,15,15] [65,65,65] [45,45,45] [2000] [1000]
MSE 0.0038 0.0075 0.0061 0.0065 0.0068 0.0066
Run Times 251 271 445 356 397 289
Model-4 n=104n=10^{4}
Width [15,15,15] [15,15,15] [65,65,65] [45,45,45] [2000] [1000]
MSE 0.0073 0.0126 0.0090 0.0096 0.0119 0.0124
Run Times 252 270 449 358 476 328
Model-4 n=2104n=2\cdot 10^{4}
Width [20,20,20] [20,20,20] [95,95,95] [65,65,65] [2800] [1400]
MSE 0.0083 0.0233 0.0194 0.0188 0.0247 0.0248
Run Times 518 555 1438 962 1369 862

Note: Here, “width” represents the number of neurons of each hidden layer, e.g., [20, 20] means that there are two hidden layers within the DNN and each has 20 number neurons.

The SS-DNN is still the most time-efficient estimator. It even runs faster than training S-DNN with the whole sample size. Applying the scalable subagging method can gain more computational savings for training with a larger sample size or a larger model. The SS-DNN is also the most accurate estimator except in the case with 10410^{4} Model-4 simulated data. For this case, the accuracy of SS-DNN is slightly worse than the estimator DNN-deep-1. We conjecture the reason is that Model-4 is relatively complicated so a DNN with 3 depths and constant width 15 has a high bias. After increasing the sample size to 20000, the subagging estimator beats other models.

Appendix C: Simulations for unconditional CI and PI

In this part, we consider the unconditional (not conditional on 𝑿0=𝒙0\bm{X}_{0}=\bm{x}_{0}) performance of CI and PI defined in the main text. We take the below empirical coverage rate (ECR) and empirical length of CI (EL) as the measurement criteria for (unconditional) CI:

ECR=1nk=1n𝟙f(𝒙k)[Bl,k,Bu,k];EL=1nk=1n(Bu,kBl,k);\text{ECR}=\frac{1}{n}\sum_{k=1}^{n}\mathbbm{1}_{f(\bm{x}_{k})\in[B_{l,k},B_{u,k}]}~{};~{}\text{EL}=\frac{1}{n}\sum_{k=1}^{n}(B_{u,k}-B_{l,k});

here f(𝒙k)f(\bm{x}_{k}) is the true model value at the kk-th data point in the training dataset; Bu,kB_{u,k} and Bl,kB_{l,k} are the corresponding upper and lower bounds of CI, respectively. To measure the performance of (unconditional) PI, we take

ECR=1Ni=1N𝟙yi,0[Bl,i,Bu,i],EL=1Ni=1N(Bu,iBl,i);\text{ECR}=\frac{1}{N}\sum_{i=1}^{N}\mathbbm{1}_{y_{i,0}\in[B_{l,i},B_{u,i}]}~{},~{}\text{EL}=\frac{1}{N}\sum_{i=1}^{N}(B_{u,i}-B_{l,i});

here yi,0y_{i,0} is the ii-th observed response value in the test dataset; Bu,kB_{u,k} and Bl,kB_{l,k} are the corresponding upper and lower bounds of PI. We take n=N=2105n=N=2\cdot 10^{5}. Due to the (unconditional) ECR and EL hardly changing in the simulation studies, we just do 50 replications and we present the average results of various CIs and PIs in Table 6 and Table 7, respectively. Similar to simulation results of conditional CIs and PIs in the main text, QCI-1 undercovers true model values; PCI-2 shows the best comprehensive performance according to length and coverage rate. All PIs show great performance but slightly undercover true future values.

For the computational issue of the iterated subsampling stage, the total time of training all DNN estimators f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i} for i{1,,q}i\in\{1,\ldots,q\} and j{1,,q}j\in\{1,\ldots,q^{\prime}\} (iterated subsampling stage) is less than the time of training all DNN estimators in the first subsampling stage, i.e., f^DNN,b,i\widehat{f}_{\text{DNN},b,i} for i{1,,q}i\in\{1,\ldots,q\}. We can see the reason by analyzing the computational complexity of the iterated subsampling stage. In total, we need to train qq=O(n1β2)q\cdot q^{\prime}=O(n^{1-\beta^{2}}) number of models with sample size nβ2n^{\beta^{2}}. As the assumption we made in 4.1, the complexity of training a DNN is mainly determined by its size, sample size and the number of epochs, so the training time of the iterated stage is around qqO(nβ2nβ2)=O(n1+β2)q\cdot q^{\prime}\cdot O(n^{\beta^{2}}\cdot n^{\beta^{2}})=O(n^{1+\beta^{2}}) when the sample size is close to the size of DNN. Similarly, we can analyze that the complexity of training DNNs in the first subsampling stage is around O(n1+β)O(n^{1+\beta}). Since β<1\beta<1, the complexity of the first subsampling stage will dominate the iterated stage when nn is large enough. In other words, the complexity cost of applying the iterated subsampling technique is negligible when we are dealing with a huge dataset.

Table 6: Empirical Coverage Rate and Empirical Length of different (unconditional) CIs with various simulation models; Run Times of first and iterated subsampling stages
Nominal δ\delta: 0.1 0.05
Model-1, n=2105n=2\cdot 10^{5}
ECR of QCI-1 1.000 1.000
ECR of QCI-2 0.961 0.983
ECR of PCI-1 0.842 0.905
ECR of PCI-2 0.953 0.976
ECR of PCI-3 0.853 0.914
EL of QCI-1 1.752 2.371
EL of QCI-2 1.090 1.316
EL of PCI-1 0.292 0.348
EL of PCI-2 0.549 0.654
EL of PCI-3 0.299 0.356
Training time of first and iterated subagging estimators: 5613 &\& 4566
Structure of f^DNN,b,i\widehat{f}_{\text{DNN},b,i} and f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i}: [65,65] &\& [10,10]
Model-2, n=2105n=2\cdot 10^{5}
ECR of QCI-1 1.000 1.000
ECR of QCI-2 0.925 0.961
ECR of PCI-1 0.875 0.931
ECR of PCI-2 0.957 0.979
ECR of PCI-3 0.883 0.937
EL of QCI-1 2.213 2.847
EL of QCI-2 1.117 1.349
EL of PCI-1 0.359 0.428
EL of PCI-2 0.597 0.711
EL of PCI-3 0.367 0.437
Training time of first and iterated subagging estimators: 5451 &\& 4566
Structure of f^DNN,b,i\widehat{f}_{\text{DNN},b,i} and f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i}: [65,65] &\& [10,10]
Model-3, n=2105n=2\cdot 10^{5}
ECR of QCI-1 1.000 1.000
ECR of QCI-2 0.967 0.985
ECR of PCI-1 0.776 0.846
ECR of PCI-2 0.961 0.976
ECR of PCI-3 0.802 0.867
EL of QCI-1 0.774 0.956
EL of QCI-2 0.625 0.755
EL of PCI-1 0.125 0.149
EL of PCI-2 0.457 0.545
EL of PCI-3 0.132 0.158
Training time of first and iterated subagging estimators: 6597 &\& 5551
Structure of f^DNN,b,i\widehat{f}_{\text{DNN},b,i} and f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i}: [45,45,45] &\& [10,10,10]
Model-4, n=2105n=2\cdot 10^{5}
ECR of QCI-1 1.000 1.000
ECR of QCI-2 0.904 0.945
ECR of PCI-1 0.762 0.830
ECR of PCI-2 0.931 0.956
ECR of PCI-3 0.778 0.844
EL of QCI-1 1.242 1.570
EL of QCI-2 0.903 1.091
EL of PCI-1 0.202 0.240
EL of PCI-2 0.496 0.591
EL of PCI-3 0.209 0.249
Training time of first and iterated subagging estimators: 6451 &\& 5419
Structure of f^DNN,b,i\widehat{f}_{\text{DNN},b,i} and f^DNN,b,i(j)\widehat{f}^{(j)}_{\text{DNN},b,i}: [45,45,45] &\& [10,10,10]
Table 7: Empirical Coverage Rate and Empirical Length of (unconditional) PI with various simulation models
Nominal δ\delta: 0.1 0.05
Model-1, N=2105N=2\cdot 10^{5}
ECR of PI 0.8981 0.9486
EL of PI 3.2919 3.9234
Model-2, N=2105N=2\cdot 10^{5}
ECR of PI 0.8983 0.9487
EL of PI 3.3008 3.9316
Model-3, N=2105N=2\cdot 10^{5}
ECR of PI 0.8973 0.9481
EL of PI 3.3022 3.9339
Model-4, N=2105N=2\cdot 10^{5}
ECR of PI 0.8975 0.9482
EL of PI 3.3135 3.9411

References

  • Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301.
  • Belkin et al., (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
  • Breiman, (1996) Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140.
  • Bühlmann and Yu, (2002) Bühlmann, P. and Yu, B. (2002). Analyzing bagging. The annals of Statistics, 30(4):927–961.
  • Cybenko, (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
  • Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
  • Ha et al., (2005) Ha, K., Cho, S., and MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19(1):17–30.
  • Hornik et al., (1989) Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
  • Jordan, (2013) Jordan, M. I. (2013). On statistics, computation and scalability. Bernoulli, 19(4):1378–1390.
  • Khwaja et al., (2015) Khwaja, A., Naeem, M., Anpalagan, A., Venetsanopoulos, A., and Venkatesh, B. (2015). Improved short-term load forecasting using bagged neural networks. Electric Power Systems Research, 125:109–115.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kohler et al., (2023) Kohler, M., Langer, S., and Reif, U. (2023). Estimation of a regression function on a manifold by fully connected deep neural networks. Journal of Statistical Planning and Inference, 222:160–181.
  • McCulloch and Pitts, (1943) McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.
  • Pan and Politis, (2016) Pan, L. and Politis, D. N. (2016). Bootstrap prediction intervals for linear, nonlinear and nonparametric autoregressions. Journal of Statistical Planning and Inference, 177:1–27.
  • Politis, (2015) Politis, D. N. (2015). Model-free prediction and regression: a transformation-based approach to inference. Springer.
  • Politis, (2021) Politis, D. N. (2021). Scalable subsampling: computation, aggregation and inference. arXiv preprint arXiv:2112.06434. Accepted by Biometrika.
  • Politis et al., (1999) Politis, D. N., Romano, J. P., and Wolf, M. (1999). Subsampling. Springer Science & Business Media.
  • Stone, (1982) Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The annals of statistics, 10(4):1040–1053.
  • Ting, (2021) Ting, D. (2021). Simple, optimal algorithms for random sampling without replacement. arXiv preprint arXiv:2104.05091.
  • Wang and Politis, (2021) Wang, Y. and Politis, D. N. (2021). Model-free bootstrap and conformal prediction in regression: Conditionality, conjecture testing, and pertinent prediction intervals. arXiv preprint arXiv:2109.12156.
  • Wu and Politis, (2023) Wu, K. and Politis, D. N. (2023). Bootstrap prediction inference of non-linear autoregressive models. arXiv preprint arXiv:2306.04126.
  • Yang et al., (2020) Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. (2020). Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pages 10767–10777. PMLR.
  • Yarotsky, (2018) Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pages 639–649. PMLR.
  • Yarotsky and Zhevnerchuk, (2020) Yarotsky, D. and Zhevnerchuk, A. (2020). The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015.
  • Zhang and Politis, (2023) Zhang, Y. and Politis, D. N. (2023). Bootstrap prediction intervals with asymptotic conditional validity and unconditional guarantees. Information and Inference: A Journal of the IMA, 12(1):157–209.
  • Zou et al., (2021) Zou, T., Li, X., Liang, X., and Wang, H. (2021). On the subbagging estimation for massive data. arXiv preprint arXiv:2103.00631.