Scalable Subsampling Inference for Deep Neural Networks
Abstract
Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest results on the approximation ability of DNN. More importantly, however, a non-random subsampling technique–scalable subsampling–is applied to construct a ‘subagged’ DNN estimator. Under regularity conditions, it is shown that the subagged DNN estimator is computationally efficient without sacrificing accuracy for either estimation or prediction tasks. Beyond point estimation/prediction, we propose different approaches to build confidence and prediction intervals based on the subagged DNN estimator. In addition to being asymptotically valid, the proposed confidence/prediction intervals appear to work well in finite samples. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.
1 Introduction
In the last several years, machine learning (ML) methods have been developed rapidly fueled by ever-increasing amounts of data and computational power. Among different ML methods, a popular and widely-used technique is Neural Networks (NN) that models the relationship between inputs and outputs through layers of connected computational neurons. The idea of applying such a biology-analogous framework can be traced to the work of McCulloch and Pitts, (1943).
At the end of the 20th century, people focused on the feed-forward Shallow Neural Networks (SNN) with sigmoid-type activation functions. An SNN has only one hidden layer but is shown to possess the universal approximation property, i.e., it can be used to approximate any Borel measurable function from one finite dimensional space to another with any desired degree of accuracy—see Cybenko, (1989); Hornik et al., (1989) and references within. However, the SNN practical performance left much to be desired. In the last ten or so years, Deep Neural Networks (DNN) received increased attention due to their great empirical performance.
Although DNN have become a state-of-the-art model, their theoretical foundation is still in development. Notably, Yarotsky, (2018); Yarotsky and Zhevnerchuk, (2020) explored the approximation ability of DNN for any function that belongs to a Hölder Banach space; here, the sigmoid-type activation functions are now replaced by ReLU-type functions to avoid the gradient vanishing problem. The aforementioned work showed that the optimal error of the DNN estimator can be uniformly bounded, i.e.,
(1) |
here, is some smoothness measurement of the target function —see Section 4 for a formal definition; is the size of a neural network , i.e., the total number of parameters; and is the dimension of the function inputs.
However, the bound (1) is not useful in practice. The reason is three-fold: (a) it requires a discontinuous weight assignment to build the desired DNN, so it is not feasible to train such DNN with usual gradient-based methods; (b) the structure of the DNN might not be the standard fully connected form so finding the satisfied specific structure becomes another difficult; most importantly, (c) this error bound is on the optimal estimation we can achieve from a finely designed DNN. It fails to tell us any story about the situation of applying the DNN estimator to solving real-world problems.
For example, what is the performance of the DNN to estimate a regression function with independent samples generated from an underlying true model ? It is easy to see that the error of in sup-norm can be arbitrarily small if we allow to be arbitrarily large based on Eq. 1. However, this optimal performance is hardly achievable and only represents the theoretically best estimation. What we attempt to do in this paper is to determine an empirically optimal with samples and then explore its estimation and prediction inference. Guided by this spirit, people usually think as an -estimator and set different loss functions for various purposes:
(2) |
here is a user-chosen space that contains all DNN candidates; is the loss function, e.g., Mean Squared Errors loss for the regression problem with real-valued output, i.e., ; are realizations of .
In the paper at hand, we consider DNN-based estimation and prediction inference in the data-generating model: ; here, the are independent, identically distributed (i.i.d.) from a distribution that has mean 0 and variance —we will use the shorthand i.i.d. ). Consequently, . Furthermore, the regression function will be assumed to satisfy some smoothing condition which will be specified later. Note that the additive model with heteroscedastic error: can be analyzed similarly by applying two DNNs, one to estimate and one for .
From a nonparametric regression view, it is well-known that the optimal convergence rate of the estimation for a -times continuously differentiable regression function of a -dimensional argument is —see Stone, (1982). If we assume the regression function belongs to a more general Hölder Banach space, we can define a non-integer to represent the smoothness order of ; here is the Hölder coefficient. The optimal rate of non-parametric estimation can also be extended to such non-integer smoothness order; see Condition and Definition 2 of Kohler et al., (2023). Focusing on DNN estimation, the optimal and achievable error bound on the norm of is with a high probability; this bound is due to Farrell et al., (2021) but the rate appears slower than the optimal rate that we can attain. Besides, although will become more accurate as the sample size increases, training DNN becomes very time-consuming. Moreover, it is infeasible to load massive data into a PC or even a supercomputer since its node memory is also limited in the computation process as pointed out by Zou et al., (2021).
In this paper, we first give a small improvement on the bound of Farrell et al., (2021) using the latest results on the DNN approximation ability. Then, we resolve the computational issue involving huge data by applying the Scalable Subsampling technique of Politis, (2021) to create a set of subsamples and then build a so-called subagging DNN estimator . Under regularity conditions, we can show that the subagging DNN estimator could possess a faster convergence rate than a single DNN estimator trained on the whole sample. Lastly, using the same set of subsamples, we can build a Confidence Interval (CI) for based on . Due to the prevalent undercoverage phenomenon of CIs with finite samples, we propose two ideas to improve the empirical coverage rate: (1) we enlarge the CI by maximizing the margin of errors in an appropriate way; (2) we take an iterated subsampling method to build a specifically designed CI which is a combination of pivot-CI and quantile-CI. Beyond estimation inference, we also perform predictive inference (with both point and interval predictions).
The paper is organized as follows. In Section 2, we give a short introduction to the structure of DNN. In Section 3, we describe the methodology of scalable subsampling. Subsequently, the performance of the subagging DNN estimator and its associated confidence/prediction intervals are analyzed in Section 4 and Section 5. Simulations and conclusions are provided in Section 6 and Section 7, respectively. Proofs are given in Appendix: A and additional simulations studies are presented in Appendix: B and Appendix: C.
In terms of notation, we will use the following norms: ; . Also, we employ the notation to denote “exact order”, i.e., that there exist two constants satisfying , and . We also use to represent the sample average operator.
2 Standard fully connected deep neural network
For completeness, we now give a brief introduction to the fully connected forward DNN with ReLU activation functions. Hereafter, we refer to DNN as the standard fully connected deep neural network with the ReLU activation function. In short, the DNN can be viewed as a parameterized family of functions. Its structure mainly depends on the input dimension , depth , width and the output dimension. The depth describes how many hidden layers a DNN possesses; the width represents the number of neurons in each hidden layer. The forward property implies that the input, hidden neurons and output are connected in an acyclic relationship. The fully connected property indicates that each hidden neuron receives information from all hidden neurons at the previously hidden layer in a functional way.
Formally, if we let to represent all number of neurons at the -th hidden layer for ; here, represents the input vector and is the output. Therefore, we can pretend that the input layer and the output layer are the -th and -th hidden layers, respectively. Then, for and ; here is the weight vector which connects the -th hidden layer and the neuron ; is the corresponding intercept term; is the so-called activation function and we take the ReLU function in this paper. To get the output layer, we just take for ; here is equal to the output dimension. To express the functionality of the DNN in a more concise way, we can stack by row to get and collect to be a vector for . Subsequently, we can treat the DNN as a function that takes the input and returns output in the below way:
We can understand that the function maps to the -st hidden layer and then map the -st hidden to the -nd hidden layer and so on iteratively with weights , and the activation function . We can then compute the total number of parameters in a DNN by the formula . A simple DNN is presented in Fig. 1. It has a constant width of 4 and a depth of 2.

3 Scalable subsampling
Scalable subsampling is one type of non-stochastic subsampling technique proposed by Politis, (2021). Assume that we observe the sample ; then, scalable subsampling relies on number of subsamples where ; here, denotes the floor function, and controls the amount of overlap (or separation) between and . In general, the subsample size and the overlap are functions of , but these dependencies will not be explicitly denoted, e.g.,
where and . More importantly, tuning and can make scalable subsampling samples have different properties. For example, if , the overlap is the maximum possible; if , there is 80 overlap between and ; if , there is no overlap between and but these two blocks are adjacent; if , there is a block of about data points that separate the blocks and .
The bagging idea was initially proposed by Breiman, (1996), where the subsample is bootstrapped (sampling with replacement) with the same size as the original sample. As revealed by that work, the main benefit of taking this technique is that the mean-squared error (MSE) of the bagging estimator can decrease, especially for unstable estimators that may change a lot with different samples, e.g., neural networks and regression trees. There are ample works about combing the neural networks with the bagging technique to improve its generalization performance; e.g., see applications in the work of Ha et al., (2005); Khwaja et al., (2015) for references. However, the drawback of the original bagging method is that the estimation process needs to be performed with -size bootstrap resamples many times which is infeasible with massive data. Bühlmann and Yu, (2002) proposed the subagging idea which is based on all subsamples as opposed to bootstrap resamples. However, even choosing a single random subsample could be computationally challenging when is large. As pointed out in Ting, (2021), drawing a random sample of size from items using the Sparse Fisher-Yates Sampler takes time and space which corresponds to optimal time and space complexity for this problem.
Facing such computational dilemmas, scalable subsampling and subagging as proposed by Politis, (2021) can be seen as an extension of the Divide-and-Conquer principle—see e.g. Jordan, (2013). Moreover, in addition to the computational savings, scalable subagging may yield an estimator that is not less (and sometimes more) accurate than the original; the following example illustrates such a case.
Example 3.1 (Kernel-smoothed function estimation).
A remarkable example from the work of Politis, (2021) is the scalable subagging kernel estimator. Suppose our goal is estimating the value of function at a specific point; here, the function can be a probability density, spectral density, or other function that is estimated in a nonparametric setting. Denote the estimand and its corresponding kernel-smoothed estimator based on the whole sample, and assume that satisfies the following conditions:
-
(1)
for all ;
-
(2)
and as , where is a non-zero constant, and .
Define the scalable subagging estimator as:
here is the total number of subsamples and is the non-parametric estimator based on the -th subsample . To achieve the fastest convergence rate of we may let . As a result, the Mean Squared Error (MSE )of the scalable subagging estimator is ; see Lemma 4.1 of Politis, (2021) for a detailed discussion. To achieve such a convergence rate in the context of nonparametric estimation, the crucial point is using an undersmoothed bandwidth on the subsample statistics. To elaborate, suppose we are employing a non-negative (second-order) kernel for smoothing in which case the MSE-optimal bandwidth is . To conduct efficient subagging, however, the should be computed using an undersmoothed bandwidth of order . For example, if we choose the bandwidth for to be instead, then the choices , , , and with implies that the rate of convergence of is . This rate is not only faster than the rate of that used the sub-optimal bandwidth ; it is actually the fastest rate achievable by any estimator that uses a non-negative kernel with its associated MSE-optimal bandwidth. Nevertheless, can be computed faster than , and may thus be preferable. In addition to the asymptotic results, the simulation study of Politis, (2021) reveals that the error of the scalable subsampling estimator can actually be smaller than the full-sample nonparametric estimator with its own optimal bandwidth choice.
In the next section, we will introduce how to compute the scalable subsampling DNN estimator. Then, we will show that our aggregated DNN estimator could possess a smaller MSE than the optimal DNN estimator trained on the whole sample, under some conditions. We also discuss some specifically designed confidence intervals to measure the estimation accuracy via the approaches mentioned in Section 1.
4 Estimation inference with DNN
Although the DNN has captured much attention in practice, its theoretical validation is still in development. Recently, Farrell et al., (2021) gave a non-asymptotic error bound to measure the performance of the DNN estimator under two regularity assumptions. To sync with the latest results on the approximation ability of DNNs, we replace their second assumption (Assumption 2 in their work) with the smooth function condition assumed by Yarotsky and Zhevnerchuk, (2020). We present these new assumptions below:
-
•
A1: The regression data are i.i.d. copies of , where has a continuous distribution, and for some positive constant . Correspondingly, we set the space of all DNN candidate functions to be .
-
•
A2: The target regression function lies in the Hölder Banach space which is the space of times continuously differentiable functions on having a finite norm defined by
where the smoothness index is with an integer and .
-
•
A3: The sample size is larger than where is the pseudo-dimension of which satisfies:
with some universal constants and Euler’s number ; see Bartlett et al., (2019) for details.
Remark.
We can weaken the assumption on the domain of to for some constant , i.e., we can work on a compact domain of ; see also the proof of Yarotsky and Zhevnerchuk, (2020).
As shown in Farrell et al., (2021), with , , the norm loss and empirical mean squared error of the deep fully connected ReLU network estimator from Eq. 2 can be bounded with probability at least , i.e.,
(3) |
here is a constant which is independent of and may depend on , and other fixed constants.
Obviously, the norm error bound in (3) is sub-optimal compared to the fastest convergence rate we can achieve for nonparametric function estimation. With the latest approximation theory on DNNs, we can improve the error bound in Eq. 3 by decreasing the power of the term. Meanwhile, this faster rate is satisfied with a narrower DNN. We give our first theorem about the convergence rate of below.
Theorem 4.1.
Under assumptions A1 to A3, width , and depth ). Then, the norm loss of the deep fully connected ReLU network estimator Eq. 2 can be bounded with probability at least , i.e.,
(4) |
here is another constant.
It appears that the above gives the fastest rate obtainable based on the current literature. Later, we will show how this error bound can be further improved by applying the scalable subagging technique under some mild conditions.
Remark.
4.1 Scalable subagging DNN estimator
We first review the idea of scalable subagging and explain how this technique can be combined with DNN estimation. We focus on the regression problem and assume we observe sample .
Analogously to the subagging kernel-smoothed estimator of 3.1, we can define the subagging DNN estimator as:
(5) |
here, , and is the minimizer of the empirical loss function in Eq. 2 just using the data in the -th subsample namely .
In nonparametric function estimation where the estimation is performed through the kernel technique, the bandwidth can control the bias order of the kernel-smoothed estimator. As shown in 3.1, the optimal convergence rate can be recovered by combining scalable subagging trick and undersmoothing bandwidth. Similarly, in the context of neural network estimation, the whole architecture of a DNN controls its smoothness similar to the role of the shape (order) of a kernel. The depth and width of a DNN play the role of tuning parameters similar to the bandwidth of a kernel. Moreover, according to the prevailing wisdom, a deeper DNN may possess a lower bias; this conjecture was confirmed by Yang et al., (2020) with ResNet on some image datasets.
However, as far as we know, there is no theoretical result that explains the relationship between bias and the width/depth of a DNN. Here, we make the below assumptions to restrict the order of the bias of :
-
•
A4: uniformly in for some constant .
To boost the scalable subagging method, a fundamental preliminary condition is that the bias of the estimator is comparatively negligible to its standard deviation—see Politis, (2021) for details. Thus, we further impose an additional assumption on the order of bias:
-
•
A5: The bias exponent in Assumption A4 satisfies the inequality: .
We claim that assumptions A4 and A5 could be achievable. Due to the fact as revealed in Yarotsky and Zhevnerchuk, (2020), the approximation ability in the uniform norm of a DNN can be as fast as . Although this rate is not instructive in practice, the existence of a DNN that satisfies the bias order requirement A4 is possible.
We should also notice that practitioners tend to build a large DNN whose size is larger than the sample size. i.e., the DNN interpolates the sample in the modern machine-learning practice. Interestingly, such an over-parameterized estimator breaks the classical understanding of the bias-variance trade-off since its generalization performance can even be better than a DNN which lies in the under-parameterized regime. Actually, this phenomenon is described as the double-descent of the risk by Belkin et al., (2019). Thus, A4 and A5 should be reasonable when we consider DNNs with an overwhelming number of parameters; however, assumption A3 may fail which means the consistency property of the DNN estimator may be lost. It is interesting to explore whether the scalable subsampling can work for DNN estimators in an over-parameterized regime; we leave this to future work.
Remark 4.1.
In this paper, we focus on applying scalable subagging to DNNs whose size is less than the sample size but the extension to a large DNN is straightforward. From the computability aspect, as we can expect, the saving of computational cost from applying scalable subagging will be more significant for executing estimation with a large DNN. To see this fact, let’s assume that we consider a DNN with size , . Then, the computational complexity will be mainly determined by how many manipulations (e.g., forward calculation and backward updating) we carry out to train the DNN. The total number of manipulations is also affected by the batch size and the number of epochs. Thus, we summarize that the total number of manipulations is ; here represents the number of epochs, i.e., the number of complete passes of the training through the algorithm. It is fair to assume that the complexity is in the order of . When the size of the DNN is larger than the sample size, . Thus, for the subagging estimator, the computational complexity is approximately to be . The ratio of over is . Thus, for a fixed , the larger to be, the more computation can be saved by deploying the subagging technique.
Aggregating all the above, the following theorem quantifies the error bound of the scalable subagging DNN estimator of Eq. 5:
Theorem 4.2.
Assume Assumptions A1 to A5, and let . Then, with probability at least the error bound of the subagging estimator Eq. 5 in norm is:
where is a slowly varying function involving a constant and all terms.
Remark 4.2.
Note that the final accuracy of DNN heavily depends on many other factors in practice, e.g., which optimizer we choose in the training stage, which parameter initialization strategy we take, and how large the batch size should be. Thus, a solely theoretical rate is insufficient to verify the superiority of the scalable subsampling DNN estimator. We then deploy simulation studies in Section 6 to provide supplementary evidence.
4.2 Estimation of the bias order of DNN estimator
Although 4.2 shows the possibility of getting a smaller error bound, it depends on the bias exponent which is typically unknown. In this subsection, we propose two approaches to estimate the value of via subsampling. As far as we know, it is the first attempt towards quantifying the bias of the DNN estimator.
First note that A4 implies that, for any , . Since , it follows that the bias of is , so we can write
(6) |
Recall that was built based on subsamples of size . If we have another DNN estimator trained on sample of size , then its bias will be . Then,
(7) |
If and , the bias of is asymptotically determined by the first term on the r.h.s. of Eq. 7. So we can try to estimate to approximate the l.h.s. of Eq. 7.
Ideally, if we have a large enough sample, we can carve out non-overlapping (or partially overlapping) -size subsamples and compute . If we further separate each -size subsample into multiple non-overlapping (or partially overlapping) -size subsamples, can be built and each possesses the same bias order as our desired DNN estimator. Subsequently, the bias of can be estimated by the sample mean of . We can then use this information to estimate the value of . By the law of large numbers, we can get accurate bias estimation as . However, as we can easily see, this approach is computationally heavy and requires a large dataset.
Consequently, we propose another way to perform the bias estimation; we will call it scaling-down estimation method. To elaborate, recall that our goal is estimating the bias of that was built based on subsamples of size . Consider different DNN estimators and which are trained on samples of size and respectively; here and . As before, A4 implies that the bias of is for . Then, a key observation is that:
(8) |
Due to the relationship between , the bias of is dominated by the first term on the r.h.s. of Eq. 8. We then have two different estimates of the bias of , namely:
Fixing the value of , is value of computed from the th subsample of size carved out the whole sample; as before, these subsamples can be non-overlapping or partially overlapping and their number is denoted by . Ignoring the term in Eq. 6, we can solve the following system of equations to approximate both and :
(9) |
Taking logarithms in Eq. 9 turns it into a linear system in and . Finally, we can estimate the bias of by scaling down by a factor , i.e., the bias of is approximately . We summarize this procedure in Algorithm 1.
Step 1 | Fix a subsample size , and compute at point . |
Step 2 | Fix two subsample sizes and , and separate the whole sample into and number of -size and -size subsamples, respectively. Compute at for . |
Step 3 | Solve Eq. 9 to get and . |
Step 4 | Estimate the bias of by . |
4.3 Confidence intervals
Beyond point estimation, it is important to quantify DNN estimation accuracy; this can be done via a standard error or –even better– via a Confidence Interval (CI). More specifically, for a point of interest , we hope to find a CI which satisfies:
here should be understood as the conditional probability given ; and are lower and upper bound for that are functions of the DNN estimator; is the significance level. Since we can have different CI constructions having the same , we are also interested in the CI length (CIL) which is defined as . We aim for a (conditional) CI that is the most accurate (in terms of its coverage being close to ) but with the shortest length.
Analogously to 3.1, we make an assumption about the variance term of the DNN estimator trained with sample size and evaluated at :
-
B1
Var as .
Generally speaking, we have two choices to build CI for : (1) Pivot-CI (PCI), the type of CI obtained by estimating the sampling distribution of a pivotal quantity, e.g. the estimator centered at its expectation; (2) Quantile-CI (QPI), the type of CI based on quantiles of the estimated sampling distribution of the (uncentered) estimator of interest. More details are given in the example below.
Example 4.1 (Types of CI).
For any unknown quantity estimated by , we may build a scalable subagging estimator to approximate it. To construct a CI for based on , we are aided by the CLT of Politis, (2021), i.e.,
(10) |
under mild conditions; here and are the mean and variance of limiting distribution, respectively, and . [By the way, note the typo in Politis, (2021) where was incorrectly written as .]
The form of the PCI based on CLT (10) depends crucially on whether or not; see the next two subsections for details. On the other hand, the QCI is easier to build but it has its own deficiencies. In the context of this example, it is tempting to create a QCI for by taking the and quantile values of the empirical distribution of the points . However, the resulting CI will be too conservative, i.e., its coverage will be (much) bigger than . The reason is that the empirical distribution of is approximating the sampling distribution of estimator which has bigger variance than that of the target . We could try to re-scale the empirical distribution of as in classical subsampling—see Politis et al., (1999). We still consider the QCI in the simulation studies. As expected, this QCI is the most conservative one; see details in Section 6.
4.3.1 PCI in the case where
If , i.e., when the square bias is relatively negligible compared to the variance in estimation, we can rely on Eq. 10 to build a PCI for the true function at a point . All we need is a consistent estimator of , e.g., . In that case, a PCI for based on the CLT can be written as:
(11) |
where is the quantile of the standard normal distribution.
Observing that there is a common term in and , we can estimate as a whole rather than computing and separately. As a result, we can get a simplified PCI based on Eq. 11 as follows:
(12) |
here which can be approximated by . Note that the building of the CI does not require the knowledge of which is the order of the variance term in B1. However, the estimation may not be accurate when is small since it is only an average of terms. As a result, the PCI according to Eq. 12 may undercover the true model values. Thus, we may relax the desired property of CI. Instead of requiring the exact coverage rate of a CI to be , we seek a CI such that:
(13) |
Thus, the optimal candidate will be the CI which has the shortest length and guarantees the lowest coverage rate larger than . To satisfy Eq. 13, we may enlarge the CI appropriately by replacing with ; here .
It is appealing to think that is close to the MSE of . However,
When is large, where is the bias of . Therefore, is not exactly the MSE, but it can still be used to enlarge the CI to some extent. We can then define another PCI as:
(14) |
where
(15) |
Since the order of the variance term is involved in the above terms, we consider two extreme situations in the simulation sections: (1) We take which is a most enlarged case; or (2) take which is a mildly enlarged case.
Remark 4.3 (The condition to guarantee ).
According to Eq. 10, is satisfied as long as under A5. If we take in 4.2, we can find that the condition for is always satisfied. This is not surprising due to A5 imposing the requirement on the convergence rate of the bias term. However, as explained in 4.2, this is not the optimal one to generate the smallest error bound. Thus, we could arrive at a stage where the orders of squared bias and variance are the same once we know . Due to the high variability of training a DNN in practice, we introduce a method in Section 4.3.2 to build CI appropriately under the situation that , which serves for cases where the bias is not relatively negligible.
4.3.2 PCI in the case where
It is worthwhile to discuss how can we build a PCI for scalable subsampling DNN estimator when . Note that Politis, (2021) proposed an iterated scalable subsampling technique that is applicable in the case . While this technique is also applicable in the case , we may prefer the construction of Section 4.3.1 since it is less computer-intensive. However, we should notice that the additional computational burden brought by iterated subsampling is negligible when ; see analysis in Appendix: C. For completeness, we present this method here in the remark below.
Remark 4.4 (Iterated subsampling).
With the same notations in 4.1, we can perform the iterated subsampling in three steps: (1) Let , then apply the scalable subsampling technique to sample and get subsets . Compute ; we call it “first stage subsampling”; (2) Take another subsample size and apply scalable subagging method again to all , i.e., as if where the only data at hand and make subagging estimator for each subsamples; such subagging estimator is computed by averaging estimators ; here . As a result, we can get number of ; we call it “iterated stage subsampling”; (3) Find the subsampling distribution ; is a function of . In the context of DNN estimation, we use to represent the DNN estimator in the iterated subsampling stage on the -th subsamples from the -th subsample in the first stage subsampling.
Denote , and is the limit of as ; recall that (10) implied that is Gaussian. Proposition 2.1 of Politis, (2021) shows that converges to in probability for all points of continuity of . Due to Eq. 10, is continuous everywhere, and therefore the convergence is uniform. Thus, both and converge in a uniform fashion to in probability which implies that:
(16) |
Thus, iterated subsampling can be used to estimate the distribution . We can build the CI in a pivotal style without explicitly referring to the form of that involves the two unknown parameters. A further issue is that normality might not be well represented in since it is based on an average of quantities; having a large requires a huge . To compensate for the data size requirement, we take a specific approach to build CI which can be considered as a combination of PCI and QCI to some extent. Algorithm 2 describes all the steps to construct the CI for at a point based on the subagging DNN estimator and iterated subsampling method.
Step 1 | Fix the subsample size , compute at point . |
Step 2 | Fix the subsample size of iterated subsampling, perform necessary steps in 4.4 to find L_b^′, b, S S(z)= q^-1 ∑_i=1^q 1{κ_b(¯f_DNN,i(x)-¯f_DNN(x) ) ≤z}; here is the subagging DNN estimator on the -th subsamples in Step 1 at the point . |
Step 3 | Denote the and quantile values of the distribution as and . |
Step 4 | Determine the PCI of by: (17) In other words, we take and . |
Note that to construct the PCI (17), the values of and are required. Recall that and . Although is the practitioner’s choice, is typically unknown. 4.5 explains how upper and lower bounds for can be used in the PCI construction.
Remark 4.5.
In constructing the PCI (17) we can replace by a larger value (say ) and replace by a smaller value (say ) and still the coverage bound of Eq. 13 would be met. From 4.1, the fastest rate of the variance decrease is of order ; so could be as large as in which . On the other hand, the slowest rate is influenced by ; if we pretend the smoothness of the true model is equal to the input dimension (although it is actually smoother), we can take to compute .
5 Predictive inference with the DNN estimator
Most of the work in DNN estimation has applications in prediction although this is typically point prediction. However, as in the estimation case, it is important to be able to quantify the accuracy of the point predictors which can be done via the construction of Prediction Intervals (PI); see related work of Pan and Politis, (2016); Wang and Politis, (2021); Zhang and Politis, (2023); Wu and Politis, (2023) on predictive inference with dependent or independent data.
Consider the problem of predicting a response that is associated with a regressor value of interest denoted by and its corresponding prediction interval. The optimal point predictor of is which is well approximated by as 4.2 shows. To construct a PI for , we need to take the variability of the errors into account since, conditionally on , we have .
If the model and the error distribution were both known, we could construct a PI which covers with confidence level as follows:
(18) |
here and are the and quantile values of , respectively. Of course, we do not know the true model but we may replace it with our scalable subsampling DNN estimator . In addition, is also unknown and must be estimated; a typical estimator is which is the empirical distribution of residuals. To elaborate, we define as follows:
(19) |
To consistently estimate the error distribution , we need to make some mild assumptions on , namely:
-
•
B2: The error distribution has zero mean and is differentiable on the real line and were is the density function of error .
The following Lemma can be proved analogously to the proof of Lemma 4.1 in Wu and Politis, (2023).
Lemma 5.1.
Under A1-A5 and B1, we have
We can then apply the PI below to approximate the ‘oracle’ PI of Eq. 18:
(20) |
here and are the and quantile values of , respectively. To construct this PI in practice, we can rely on Algorithm 3 below:
Step 1 | Train the subagging DNN estimator and find the empirical distribution of residuals as Eq. 19. |
Step 2 | Evaluate the subagging DNN estimator at to get . |
Step 3 | Determine and by taking lower and quantiles of . |
Step 4 | Construct PI as Eq. 20. |
We claim that the PI in Eq. 20 is asymptotically valid (conditionally on ), i.e., it satisfies
(21) |
where the above probability is conditional on . This statement is guaranteed by 5.1. To describe it, denote where has the distribution .
Theorem 5.1.
Under A1-A5 and B1-B2, the distribution of converges to the distribution of uniformly (in probability), i.e.,
(22) |
Although the PI in Eq. 20 is asymptotically valid, it may undercover in the finite sample case. This problem is mainly due to two reasons: (1) PI in Eq. 20 does not take the variability of model estimation into account; and (2) the scale of the error distribution is typically underestimated by the residual distribution with finite samples. For issue (1), we can rely on a so-called pertinent PI which is able to capture the model estimation variability; this pertinence property is crucial, especially for the prediction inference of time series data in which multiple-step ahead forecasting is usually required. For issue (2), we can “enlarge” the residual distribution by basing it on the so-called predictive (as opposed to fitted) residuals. Although the predictive residuals are asymptotically equivalent to the fitted residuals, i.e., in Eq. 19, the corresponding PI could have a better coverage rate; see Politis, (2015) for the formal definition of pertinent PI and predictive residuals.
6 Simulations
In this section, we attempt to check the performance of the scalable subagging DNN estimator with simulation examples. More specifically, we consider two aspects of one estimator: (1) Time-complexity, we take the running time of the training stage to measure its complexity for a fixed hyperparameter setting, e.g., fixed number of epochs and batch size; (2) Estimation accuracy, we take empirical MSE (mean square error)/MSPE (mean square prediction error) and empirical coverage rate to measure the accuracy of point estimations/predictions and confidence/prediction intervals.
6.1 Simulations on point estimations
As shown in Section 4, the scalable subagging DNN estimator is more computationally efficient but also more accurate meantime compared to the DNN estimator trained with the whole sample size under some mild conditions. Here, we hope to verify such dominating performance with simulated data. To perform simulations, we consider below models:
-
•
Model-1: , where .
-
•
Model-2: , where .
-
•
Model-3: , where .
-
•
Model-4: , where ;
here is an identity matrix with the correct dimension for each model; is the standard normal error. We build the DNN estimator with PyTorch in Python. To train the DNN, we use the stochastic gradient descent algorithm Adam developed by Kingma and Ba, (2014) with a learning rate . In addition, we take the number of epochs and batch size to be 200 and 10 to make the DNN fully trained for the first and iterated subsampling stages, respectively. We use the function time.time() in Python to compute the running time of the training procedure, namely Training Time.
To be consistent with the folk wisdom, we build with a relatively large depth to decrease the bias. Meanwhile, we take the width as large as possible to make its size close to the sample size so that A3 could be satisfied and we are in the under-parameterized region. In order to make a comprehensive comparison between the scalable subsampling DNN (SS-DNN) estimator and classical DNN estimators, we consider 5 DNN estimators which are trained with the whole sample:
-
(1)
A DNN possesses the same depth and width as . We denote it “S-DNN”.
-
(2)
A DNN possesses the same depth as , but a larger width so that its size is close to the sample size. We denote it “DNN-deep-1”.
-
(3)
A DNN possesses the same depth as , but a larger width so that its size is close to half of the sample size. We denote it “DNN-deep-2”.
-
(4)
A DNN possesses only one hidden layer, but a larger width so that its size is close to the sample size. We denote it “DNN-wide-1”.
-
(5)
A DNN possesses only one hidden layer, but a larger width so that its size is close to half of the sample size. We denote it “DNN-wide-2”.
We deploy DNN (1) to check the performance of a DNN with the same structure as , but it is trained with the whole dataset. We deploy DNNs (2) - (5) to challenge the scalable subsampling DNN estimator with various wide or deep DNNs. To evaluate the point estimation performance, we apply two empirical MSE criteria:
here represents different DNN estimators and is the true regression function; are realizations of samples; we call it training data.
An estimator is optimal in MSE-1 criterion if its MSE-1 is closest to the sample variance of errors, namely ; here are observed error values. An estimator is optimal in the MSE-2 criterion if its MSE-2 is closest to 0. We present MSE-1 and MSE-2 of different estimators in Table 1. In addition, we also present of the corresponding simulated sample as the benchmark to compare the performance of different estimators according to the MSE-1 criterion.
Beyond the point estimation measured on training data, we are also interested in the performance of difference DNN estimators on test data. Thus, we generate new samples: ; here we take to evaluate the prediction performance. Similarly, we consider two MSPEs and we denote them MSPE-1 and MSPE-2 following:
we expect that the best estimator on prediction tasks should have the smallest MSPE-2 and the MSPE-1 which is closest to ; here are observed error values for the test data. We present all simulation results in Table 1; here empirical MSE/MSPE and Training Time (in seconds) were computed as averages of 200 replications.
Estimator: | SS-DNN | S-DNN | DNN-deep-1 | DNN-deep-2 | DNN-wide-1 | DNN-wide-2 |
---|---|---|---|---|---|---|
Model-1, , , | ||||||
Width | [20,20] | [20,20] | [90,90] | [60,60] | [800] | [400] |
MSE-1 | 1.0034 | 1.0168 | 0.9975 | 1.0036 | 1.0136 | 1.0151 |
MSE-2 | 0.1011 | 0.0579 | 0.1039 | 0.0894 | 0.0466 | 0.0433 |
MSPE-1 | 1.1020 | 1.0678 | 1.1299 | 1.1059 | 1.0543 | 1.0487 |
MSPE-2 | 0.1019 | 0.0675 | 0.1296 | 0.1057 | 0.0540 | 0.0484 |
Training Time | 209 | 225 | 403 | 303 | 373 | 274 |
Model-2, , , | ||||||
Width | [20,20] | [20,20] | [90,90] | [60,60] | [800] | [400] |
MSE-1 | 1.0506 | 1.1355 | 1.1314 | 1.1350 | 1.0768 | 1.0745 |
MSE-2 | 0.1232 | 0.1625 | 0.1889 | 0.1839 | 0.1249 | 0.1194 |
MSPE-1 | 1.1339 | 1.1469 | 1.1841 | 1.1737 | 1.1254 | 1.1237 |
MSPE-2 | 0.1338 | 0.1468 | 0.1841 | 0.1736 | 0.1253 | 0.1238 |
Training Time | 224 | 240 | 417 | 320 | 376 | 280 |
Model-3, , , , | ||||||
Width | [15,15,15] | [15,15,15] | [65,65,65] | [45,45,45] | [2000] | [1000] |
MSE-1 | 1.0014 | 1.0361 | 1.0299 | 1.0308 | 1.0286 | 1.0290 |
MSE-2 | 0.0296 | 0.0536 | 0.0533 | 0.0522 | 0.0426 | 0.0431 |
MSPE-1 | 1.0310 | 1.0565 | 1.0572 | 1.0571 | 1.0453 | 1.0449 |
MSPE-2 | 0.0310 | 0.0564 | 0.0572 | 0.0570 | 0.0453 | 0.0449 |
Training Time | 353 | 379 | 561 | 468 | 483 | 363 |
Model-4, , , | ||||||
Width | [15,15,15] | [15,15,15] | [65,65,65] | [45,45,45] | [2000] | [1000] |
MSE-1 | 1.0243 | 1.0488 | 1.0318 | 1.0350 | 1.0457 | 1.0460 |
MSE-2 | 0.0757 | 0.0830 | 0.1076 | 0.0980 | 0.0729 | 0.0728 |
MSPE-1 | 1.0792 | 1.0878 | 1.1117 | 1.1048 | 1.0756 | 1.0752 |
MSPE-2 | 0.0790 | 0.0875 | 0.1114 | 0.1045 | 0.0754 | 0.0749 |
Training Time | 359 | 376 | 560 | 471 | 551 | 394 |
Model-4, , , | ||||||
Width | [20,20,20] | [20,20,20] | [95,95,95] | [65,65,65] | [2800] | [1400] |
MSE-1 | 1.0093 | 1.0483 | 1.0419 | 1.0438 | 1.0508 | 1.0508 |
MSE-2 | 0.0490 | 0.0653 | 0.0686 | 0.0675 | 0.0635 | 0.0635 |
MSPE-1 | 1.0501 | 1.0669 | 1.0692 | 1.0689 | 1.0622 | 1.0625 |
MSPE-2 | 0.0502 | 0.0670 | 0.0692 | 0.0689 | 0.0623 | 0.0626 |
Training Time | 748 | 775 | 1684 | 1198 | 1549 | 998 |
Note: “width” represents the number of neurons of each hidden layer, e.g., [20, 20] means that there are two hidden layers within the DNN and each has 20 number neurons.
We can summarize several notable findings from the simulation results:
-
•
is always the most computationally efficient one, it is even faster than applying a single DNN estimator with the same structure as but trained on the whole sample.
-
•
According to the MSE-1, is the most accurate one for all simulations.
-
•
According to the MSE-2, can work best when the data is large enough for Models 3-4 which are non-linear. For Model-2, the performance of is just slightly worse than the optimal estimator. For Model-1, the performance of is still worse than the optimal estimator. We guess the reason may be that the Model-1 and Model-2 are linear models. In this case, a wide DNN is sufficient to mimic such a linear relationship.
-
•
For MSPEs, works slightly worse than the optimal model for Model-1 and Model-2 cases, but it turns out to be the optimal one for Model-3 and Model-4 cases. This phenomenon is consistent with the behavior of MSEs.
-
•
The model-selection step for “wide” or “deep” type DNN estimators is necessary but it is obscure meanwhile; see DNN-wide-2 works better than DNN-wide-1 for the Model-2 MSE case; however, the situation reverses for the Model-3 MSE case. This phenomenon occurs for “Deep” type DNN estimators also; see the performance of S-DNN, DNN-deep-1 and DNN-wide-2; there is no single one that beats the others uniformly. For MSPE, we can also find such a reverse phenomenon. On the other hand, by applying the scalable subagging estimator, we can avoid the model-selection difficulty and just make deep and large enough.
To analyze the ability of various DNN estimators on estimating regression models solely, we present additional simulation results in Appendix: B where the four models described above do not have error terms, so the MSE-1 and MSE-2 coincide to each other.
6.2 Simulations for CI and PI
We continue using the four models in Section 6.1 to test the accuracy of multiple confidence and prediction intervals defined in previous sections with scalable subagging DNN estimators. To make sure we have enough subsamples to do iterated subsampling for CI, we take the sample size to be , which implies when . It further implies that the number of subsamples for the iterated subagging stage is . For developing the prediction interval, we take the sample size to be or . To determine the structure of the subagging DNN estimator, we keep the strategy summarized in the previous subsection, i.e., we make its size as close to the sample size as possible no matter in the first or the iterated subsampling stage. We take the same training setting with PyTorch to find as we have done in Section 6.1.
We call the naive QCI which is determined by the equal-tail quantile of estimations QCI-1; we should notice that this QCI may be too conservative as we explained in 4.1; we call the QCI based on Eq. 17 QCI-2; we call the PCI based on Eq. 12 PCI-1; we call the PCI based on Eq. 15 with taking , PCI-2; we call the PCI based on Eq. 15 with taking , PCI-3; the PI represents the prediction interval defined in Eq. 20. For all CIs and PI defined in previous sections, they have asymptotically validity conditional on the observation . We attempt to check the conditional coverage rate with simulations for finite sample cases. To achieve this purpose, we fix 10 unchanged test points which are different from training points for each simulation model; these 10 points can be recovered by setting numpy.random.seed(0) and generate sample according to the model.
To evaluate the performance of (conditional) CI for each test point, we repeat the simulation process times and apply the below formulas to compute the empirical coverage rate (ECR) and empirical length (EL) of different CIs for each test point:
here is the true model value evaluated at the -th test data point; and are the corresponding upper and lower bounds of different CIs at the -th replication for the -th test point, respectively. We take two nominal significance levels and . Simulation results are tabularized in Tables 2 and 3.
To evaluate the performance of (conditional) PI for each test point, the procedure is slightly complicated and we summarize it in below four steps:
-
Step 1
Take the sample size to be or ; simulate sample sets: based on one of four simulation models.
-
Step 2
For each sample set, train the subsampling DNN estimator and build the prediction interval for 10 test points by:
where and are the and quantile values of the empirical distribution of the residuals, respectively.
-
Step 3
To check the performance of PIs for test points based on each sample set, simulate conditional on for pretending the true data-generating model is known and check the empirical coverage rate and empirical length by below formulas:
and are the corresponding upper and lower bounds of PI for the -th test point based on -th sample set defined in Step 2; .
-
Step 4
For , estimate by the average of empirical coverage rate of corresponding (conditional) PI on sample sets, i.e., w.r.t. ; estimate length of (conditional) PI for -th test point by w.r.t. .
We still take two nominal significance levels and . Simulation results are tabularized in Table 4.
Remark (Different levels of conditioning).
As explained in the work of Wang and Politis, (2021), we have several conditioning levels to measure the performance of PI or CI. What we consider in this paper is which shall be interpreted as the conditional probability on . If we consider the empirical coverage rate of , it approximates another conditioning level, i.e., ; here represents the whole sample. By Lemma 4 of Wang and Politis, (2021), if is an arbitrary measurable event, then . Besides, conditional coverage under will imply the marginal coverage . This unconditional coverage is implied by the popular Conformal Prediction method in the machine learning community. Simulation studies show that our CIs and PIs also have great unconditional coverage; see results from Appendix: C.
We can summarize several findings based on simulation results:
-
•
For the empirical coverage rate of quantile-type CIs, the naive QCI-1 over-covers true model values as we expect. Also, the corresponding CI length is always larger than the length of QCI-2 and it is actually the largest one among 5 different CIs. On the other hand, the specifically designed QCI-2 returns ECRs that are closer to the specified confidence level than QCI-1. Meanwhile, ECR of QCI-2 is larger than the nominal confidence level for almost all test points since we take and according to the strategy in 4.5 to enlarge the CI.
-
•
For the empirical coverage rate of pivot-type CIs, although the length of PCI-1 is the shortest one, the ECR of PCI-1 is always less than the nominal confidence level for almost all test points since may be underestimated and we may have the bias issue in practice. For the PCI-3 whose margin of error is enlarged in a mild way, although its ECR is always larger than PCI-1, it still undercover true model value mostly. For the PCI-2 in which the margin of error is enlarged in a most extreme way, it has a much better performance according to the coverage rate but with a larger CI length as a sacrifice. We claim that the PCI-2 is the optimal CI candidate according to the overall performance based on length and coverage rate. For the QCI-2, we conjecture it will be a good alternative if we have more samples so that can approximate well in the iterated subsampling stage.
-
•
For the prediction task, all PIs for four models and all test points have almost the same coverage rate and length. All ECRs are slightly less than the nominal confidence level which is not a surprise since we omit the variability in the model estimation and the residual distribution may underestimate the true error distribution for a finite sample case. For the length of PI, all PIs’ lengths are close to or since the true error distribution is assumed to be standard normal in simulations and we took equal-tail PI.
Test point: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Model-1, | ||||||||||
ECR | ||||||||||
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-2 | 0.966 | 0.964 | 0.990 | 0.990 | 0.946 | 0.814 | 0.994 | 0.972 | 0.978 | 0.946 |
QCI-2 | 0.988 | 0.984 | 0.998 | 1.000 | 0.984 | 0.932 | 0.998 | 0.984 | 0.994 | 0.976 |
PCI-1 | 0.884 | 0.854 | 0.896 | 0.894 | 0.748 | 0.510 | 0.874 | 0.884 | 0.830 | 0.860 |
PCI-1 | 0.938 | 0.918 | 0.954 | 0.940 | 0.836 | 0.610 | 0.942 | 0.938 | 0.890 | 0.918 |
PCI-2 | 0.892 | 0.986 | 0.996 | 1.000 | 0.784 | 0.954 | 0.998 | 0.908 | 0.876 | 0.998 |
PCI-2 | 0.946 | 0.996 | 0.998 | 1.000 | 0.882 | 0.984 | 1.000 | 0.954 | 0.938 | 1.000 |
PCI-3 | 0.886 | 0.862 | 0.906 | 0.914 | 0.748 | 0.524 | 0.880 | 0.884 | 0.830 | 0.878 |
PCI-3 | 0.938 | 0.922 | 0.958 | 0.960 | 0.836 | 0.616 | 0.948 | 0.938 | 0.890 | 0.928 |
EL | ||||||||||
QCI-1 | 2.39 | 2.24 | 1.93 | 1.92 | 1.51 | 1.04 | 1.38 | 1.94 | 1.46 | 3.10 |
QCI-1 | 2.94 | 3.05 | 2.49 | 2.46 | 1.85 | 1.30 | 1.94 | 2.65 | 1.81 | 4.19 |
QCI-2 | 1.44 | 1.25 | 1.27 | 1.27 | 1.00 | 0.80 | 0.96 | 1.23 | 1.02 | 1.50 |
QCI-2 | 1.73 | 1.50 | 1.54 | 1.54 | 1.20 | 0.97 | 1.16 | 1.48 | 1.24 | 1.81 |
PCI-1 | 0.39 | 0.37 | 0.32 | 0.31 | 0.24 | 0.17 | 0.23 | 0.32 | 0.24 | 0.51 |
PCI-1 | 0.46 | 0.45 | 0.38 | 0.37 | 0.29 | 0.20 | 0.28 | 0.38 | 0.28 | 0.61 |
PCI-2 | 0.40 | 0.53 | 0.57 | 1.07 | 0.25 | 0.31 | 0.51 | 0.34 | 0.26 | 0.96 |
PCI-2 | 0.47 | 0.63 | 0.67 | 1.28 | 0.30 | 0.37 | 0.61 | 0.41 | 0.31 | 1.15 |
PCI-3 | 0.39 | 0.38 | 0.32 | 0.34 | 0.24 | 0.17 | 0.24 | 0.32 | 0.24 | 0.53 |
PCI-3 | 0.46 | 0.45 | 0.38 | 0.41 | 0.29 | 0.21 | 0.28 | 0.39 | 0.28 | 0.63 |
Model-2, | ||||||||||
ECR | ||||||||||
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-2 | 0.910 | 0.960 | 0.948 | 0.918 | 0.910 | 0.628 | 0.972 | 0.936 | 0.930 | 0.928 |
QCI-2 | 0.952 | 0.980 | 0.980 | 0.962 | 0.960 | 0.746 | 0.984 | 0.974 | 0.970 | 0.958 |
PCI-1 | 0.886 | 0.890 | 0.892 | 0.862 | 0.860 | 0.828 | 0.888 | 0.866 | 0.882 | 0.888 |
PCI-1 | 0.932 | 0.952 | 0.954 | 0.926 | 0.916 | 0.894 | 0.944 | 0.940 | 0.936 | 0.958 |
PCI-2 | 0.904 | 0.974 | 0.990 | 1.000 | 0.880 | 0.958 | 0.982 | 0.896 | 0.902 | 0.998 |
PCI-2 | 0.946 | 0.984 | 0.996 | 1.000 | 0.940 | 0.990 | 0.984 | 0.960 | 0.952 | 1.000 |
PCI-3 | 0.886 | 0.892 | 0.900 | 0.892 | 0.860 | 0.830 | 0.896 | 0.868 | 0.882 | 0.920 |
PCI-3 | 0.932 | 0.956 | 0.958 | 0.948 | 0.916 | 0.898 | 0.950 | 0.940 | 0.936 | 0.964 |
EL | ||||||||||
QCI-1 | 2.66 | 2.05 | 2.45 | 2.72 | 2.13 | 1.93 | 2.05 | 2.48 | 2.55 | 3.17 |
QCI-1 | 3.31 | 2.60 | 3.04 | 3.35 | 2.58 | 2.35 | 2.75 | 3.04 | 3.09 | 4.21 |
QCI-2 | 1.19 | 1.08 | 1.15 | 1.26 | 1.04 | 0.92 | 1.15 | 1.18 | 1.19 | 1.33 |
QCI-2 | 1.44 | 1.30 | 1.39 | 1.52 | 1.25 | 1.10 | 1.39 | 1.43 | 1.44 | 1.60 |
PCI-1 | 0.43 | 0.33 | 0.39 | 0.44 | 0.34 | 0.31 | 0.34 | 0.40 | 0.41 | 0.52 |
PCI-1 | 0.51 | 0.40 | 0.47 | 0.52 | 0.41 | 0.37 | 0.40 | 0.48 | 0.49 | 0.61 |
PCI-2 | 0.44 | 0.48 | 0.61 | 1.12 | 0.35 | 0.39 | 0.56 | 0.42 | 0.42 | 0.96 |
PCI-2 | 0.52 | 0.57 | 0.73 | 1.33 | 0.41 | 0.47 | 0.67 | 0.50 | 0.50 | 1.14 |
PCI-3 | 0.43 | 0.33 | 0.40 | 0.47 | 0.34 | 0.31 | 0.34 | 0.40 | 0.41 | 0.53 |
PCI-3 | 0.51 | 0.40 | 0.48 | 0.55 | 0.41 | 0.37 | 0.41 | 0.48 | 0.49 | 0.63 |
Test point: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Model-3, | ||||||||||
ECR | ||||||||||
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-2 | 0.988 | 0.996 | 1.000 | 0.996 | 0.976 | 0.962 | 0.990 | 0.992 | 0.998 | 0.994 |
QCI-2 | 0.998 | 1.000 | 1.000 | 0.998 | 0.996 | 0.974 | 0.996 | 0.998 | 1.000 | 1.000 |
PCI-1 | 0.788 | 0.868 | 0.638 | 0.862 | 0.704 | 0.810 | 0.868 | 0.846 | 0.838 | 0.842 |
PCI-1 | 0.858 | 0.920 | 0.728 | 0.920 | 0.804 | 0.880 | 0.930 | 0.918 | 0.914 | 0.920 |
PCI-2 | 1.000 | 0.884 | 1.000 | 1.000 | 1.000 | 0.902 | 0.908 | 0.978 | 0.876 | 0.970 |
PCI-2 | 1.000 | 0.938 | 1.000 | 1.000 | 1.000 | 0.948 | 0.964 | 0.998 | 0.932 | 0.988 |
PCI-3 | 0.860 | 0.870 | 0.754 | 0.870 | 0.726 | 0.810 | 0.868 | 0.852 | 0.840 | 0.844 |
PCI-3 | 0.920 | 0.920 | 0.842 | 0.924 | 0.820 | 0.882 | 0.930 | 0.922 | 0.914 | 0.922 |
EL | ||||||||||
QCI-1 | 1.24 | 1.22 | 0.41 | 0.47 | 0.77 | 2.39 | 1.64 | 1.10 | 0.91 | 0.78 |
QCI-1 | 1.51 | 1.47 | 0.51 | 0.57 | 0.93 | 2.94 | 2.00 | 1.32 | 1.12 | 0.94 |
QCI-2 | 0.85 | 0.97 | 0.39 | 0.43 | 0.62 | 1.52 | 1.31 | 0.80 | 0.74 | 0.62 |
QCI-2 | 1.03 | 1.16 | 0.47 | 0.52 | 0.75 | 1.83 | 1.57 | 0.97 | 0.90 | 0.76 |
PCI-1 | 0.20 | 0.19 | 0.07 | 0.07 | 0.12 | 0.39 | 0.26 | 0.18 | 0.15 | 0.12 |
PCI-1 | 0.24 | 0.23 | 0.08 | 0.09 | 0.15 | 0.46 | 0.31 | 0.21 | 0.17 | 0.15 |
PCI-2 | 1.22 | 0.21 | 0.76 | 0.20 | 0.45 | 0.53 | 0.29 | 0.27 | 0.16 | 0.20 |
PCI-2 | 1.46 | 0.25 | 0.90 | 0.23 | 0.53 | 0.63 | 0.35 | 0.33 | 0.20 | 0.24 |
PCI-3 | 0.24 | 0.20 | 0.08 | 0.08 | 0.13 | 0.39 | 0.26 | 0.18 | 0.15 | 0.13 |
PCI-3 | 0.28 | 0.23 | 0.10 | 0.09 | 0.15 | 0.47 | 0.31 | 0.21 | 0.17 | 0.15 |
Model-4, | ||||||||||
ECR | ||||||||||
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-1 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
QCI-2 | 0.738 | 1.000 | 0.996 | 0.910 | 0.994 | 0.924 | 0.982 | 0.998 | 0.988 | 0.864 |
QCI-2 | 0.852 | 1.000 | 0.998 | 0.966 | 0.998 | 0.968 | 0.994 | 1.000 | 0.996 | 0.940 |
PCI-1 | 0.878 | 0.832 | 0.870 | 0.856 | 0.870 | 0.868 | 0.860 | 0.872 | 0.494 | 0.662 |
PCI-1 | 0.932 | 0.902 | 0.940 | 0.912 | 0.928 | 0.938 | 0.910 | 0.932 | 0.590 | 0.776 |
PCI-2 | 0.962 | 1.000 | 0.948 | 0.984 | 0.998 | 0.894 | 1.000 | 0.998 | 0.998 | 0.998 |
PCI-2 | 0.980 | 1.000 | 0.986 | 0.996 | 0.998 | 0.948 | 1.000 | 0.998 | 1.000 | 1.000 |
PCI-3 | 0.878 | 0.898 | 0.870 | 0.858 | 0.884 | 0.868 | 0.888 | 0.876 | 0.522 | 0.664 |
PCI-3 | 0.936 | 0.948 | 0.942 | 0.920 | 0.938 | 0.938 | 0.936 | 0.940 | 0.614 | 0.780 |
EL | ||||||||||
QCI-1 | 2.90 | 0.63 | 0.79 | 1.50 | 2.35 | 1.16 | 1.09 | 1.99 | 1.21 | 0.69 |
QCI-1 | 3.71 | 0.79 | 1.00 | 1.91 | 2.92 | 1.48 | 1.35 | 2.48 | 1.49 | 0.89 |
QCI-2 | 1.74 | 0.61 | 0.70 | 1.14 | 1.67 | 0.80 | 0.85 | 1.38 | 0.89 | 0.58 |
QCI-2 | 2.10 | 0.74 | 0.84 | 1.38 | 2.02 | 0.97 | 1.03 | 1.68 | 1.08 | 0.70 |
PCI-1 | 0.47 | 0.10 | 0.13 | 0.24 | 0.38 | 0.19 | 0.18 | 0.32 | 0.20 | 0.11 |
PCI-1 | 0.56 | 0.12 | 0.15 | 0.29 | 0.45 | 0.22 | 0.21 | 0.39 | 0.23 | 0.13 |
PCI-2 | 0.69 | 0.78 | 0.17 | 0.45 | 0.85 | 0.20 | 0.76 | 0.54 | 0.60 | 0.24 |
PCI-2 | 0.82 | 0.93 | 0.20 | 0.54 | 1.01 | 0.24 | 0.90 | 0.64 | 0.72 | 0.29 |
PCI-3 | 0.48 | 0.12 | 0.13 | 0.24 | 0.39 | 0.19 | 0.19 | 0.33 | 0.20 | 0.11 |
PCI-3 | 0.57 | 0.14 | 0.15 | 0.29 | 0.47 | 0.22 | 0.23 | 0.39 | 0.24 | 0.14 |
Test point: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Model-1: | ||||||||||
EL = 3.28, ; EL = 3.91, | ||||||||||
ECR : | 0.870 | 0.877 | 0.877 | 0.873 | 0.884 | 0.891 | 0.886 | 0.878 | 0.886 | 0.865 |
ECR : | 0.929 | 0.934 | 0.934 | 0.931 | 0.939 | 0.944 | 0.940 | 0.935 | 0.940 | 0.925 |
Model-2: | ||||||||||
EL = 3.35, ; EL = 4.00, | ||||||||||
ECR : | 0.880 | 0.882 | 0.883 | 0.879 | 0.889 | 0.892 | 0.888 | 0.881 | 0.885 | 0.869 |
ECR : | 0.936 | 0.938 | 0.938 | 0.935 | 0.942 | 0.945 | 0.942 | 0.937 | 0.939 | 0.928 |
Model-3: | ||||||||||
EL = 3.29, ; EL = 3.93, | ||||||||||
ECR : | 0.896 | 0.894 | 0.898 | 0.899 | 0.897 | 0.880 | 0.889 | 0.895 | 0.896 | 0.897 |
ECR : | 0.948 | 0.946 | 0.949 | 0.949 | 0.948 | 0.936 | 0.943 | 0.947 | 0.947 | 0.949 |
Model-4: | ||||||||||
EL = 3.32, ; EL = 3.96, | ||||||||||
ECR : | 0.830 | 0.900 | 0.900 | 0.889 | 0.882 | 0.898 | 0.897 | 0.889 | 0.895 | 0.899 |
ECR : | 0.901 | 0.951 | 0.950 | 0.943 | 0.938 | 0.949 | 0.949 | 0.943 | 0.947 | 0.950 |
Model-1: | ||||||||||
EL = 3.28, ; EL = 3.91, | ||||||||||
ECR : | 0.885 | 0.886 | 0.887 | 0.886 | 0.892 | 0.895 | 0.892 | 0.887 | 0.891 | 0.880 |
ECR : | 0.939 | 0.941 | 0.941 | 0.940 | 0.945 | 0.946 | 0.945 | 0.941 | 0.943 | 0.936 |
Model-2: | ||||||||||
EL = 3.32, ; EL = 3.95, | ||||||||||
ECR : | 0.887 | 0.891 | 0.889 | 0.889 | 0.893 | 0.895 | 0.894 | 0.889 | 0.891 | 0.882 |
ECR : | 0.941 | 0.944 | 0.942 | 0.943 | 0.945 | 0.947 | 0.946 | 0.943 | 0.944 | 0.938 |
Model-3: | ||||||||||
EL = 3.29, ; EL = 3.92, | ||||||||||
ECR : | 0.897 | 0.896 | 0.899 | 0.899 | 0.899 | 0.889 | 0.893 | 0.897 | 0.897 | 0.899 |
ECR : | 0.948 | 0.947 | 0.950 | 0.949 | 0.949 | 0.943 | 0.945 | 0.948 | 0.948 | 0.949 |
Model-4: | ||||||||||
EL = 3.30, ; EL = 3.94, | ||||||||||
ECR : | 0.858 | 0.900 | 0.899 | 0.892 | 0.887 | 0.897 | 0.898 | 0.891 | 0.895 | 0.899 |
ECR : | 0.921 | 0.950 | 0.950 | 0.945 | 0.941 | 0.948 | 0.949 | 0.944 | 0.947 | 0.949 |
7 Conclusions
In this paper, we revisit the error bound of fully connected DNN with the ReLU activation function on estimating regression models. By taking into account the latest DNN approximation results, we improve the current error bound. Under some mild conditions, we show that the error bound of the DNN estimator may be further improved by applying the scalable subsampling technique. As a result, the scalable subsampling DNN estimator is computationally efficient without sacrificing accuracy. The theoretical result is verified by extensive simulation results with various linear or non-linear regression models.
Beyond the error analysis for point estimations and point predictions, we propose different approaches to build asymptotically valid confidence and prediction intervals. More specifically, to overcome the undercoverage issue of CIs with finite samples, we consider several methods to enlarge the CI. As shown by simulations, our point estimations/predictions and confidence/prediction intervals based on scalable subsampling work well in practice. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.
Appendix A: Proofs
Proof of Theorem 4.1.
This result can be easily shown based on the proof of Theorem 1 in the work of Farrell et al., (2021). We take the intermediate result from the final step of their proof: With probability at least ,
(23) |
where is an appropriate constant; in this proof, represents appropriate constants and its meaning may change according to the context; ; . By Theorem 3.1 of Yarotsky and Zhevnerchuk, (2020) and Lemma 1 of Farrell et al., (2021), we can conclude that there is a standard fully connected DNN whose depth and width satisfy below inequalities:
(24) |
for any ; Furthermore, we can find the upper bound of based on Eq. 24:
Subsequently, we rewrite the Eq. 23 as below:
(25) |
To optimize the bound, we can choose . This gives:
(26) |
As a result, we get:
(27) |
Finally, we take , which implies 4.1.
∎
Proof of Theorem 4.2.
Under A1-A5, we can analyze the expected square error for the subagging DNN estimator as below:
(28) |
For the first term on the r.h.s. of Eq. 28, by the error bound ignoring the slowly varying term, we can get:
(29) |
this is satisfied with at least probability .
Ideally, we hope can take a small value to improve the error bound for Eq. 29. However, it is restricted to do this since the bias of the subagging estimator will get increased once we take smaller and smaller. Thus, we need to consider the second term on the r.h.s. of Eq. 28. Start by considering on specific pair:
(30) |
The last equality is due to the independence between subsample and . As we mentioned in the main text, we face difficulty in determining the rate of the bias of the subagging estimator. Thus, A4 and A5 are used to make additional assumptions on the bias term. We present A4 as below:
A5 then requires the bias order of satisfies the inequality: .
Then, we can find the order of Eq. 30 is:
Combine these two pieces, we can analyze Eq. 28:
(31) |
If the bias term is more negligible than the other term, i.e.,
The above lower bound satisfies the requirement of being positive. Then, needs to be larger than to make sure the lower bound of is less than 1 which is satisfied due to A5. Meanwhile, we want to take as small as possible, i.e., . This results in the error bound below:
The fact that is larger than is guaranteed by the requirement that , i.e., A5 again.
∎
Proof of Theorem 5.1.
Since error and are independent to , we actually have based on 5.1. Thus, we can write:
(32) |
where represents . We can start by considering the below expression:
(33) |
For the first term on the r.h.s. of the above inequality, we have:
(34) |
We should notice that the first and third terms of the r.h.s. of Eq. 34 converge to 0 in probability. For the middle term, since converges to in probability and is assumed to be bounded as B2, this term also converges to 0 in probability by applying the Taylor expansion. Combining all the pieces, we have:
(35) |
∎
Appendix B: Additional simulations on point estimations
In this part, we consider below models to check the performance of SS-DNN:
-
•
Model-1’: , where .
-
•
Model-2’: , where .
-
•
Model-3’: , where .
-
•
Model-4’: , where .
Similar to the main text, is the identity matrix with the appropriate dimension. Compared to the data-generating model in the main text, the only difference is that the error term is removed. Due to this change, we can investigate the approximation ability of various DNN estimators straightforwardly, i.e., the MSE-1 error is equivalent to the MSE-2 error. Setting the same training procedure, we summarize all simulation results in Table 5; here, empirical MSE and Run Times (in seconds) were also computed as averages of 200 replications.
SS-DNN | S-DNN | DNN-deep-1 | DNN-deep-2 | DNN-wide-1 | DNN-wide-2 | |
---|---|---|---|---|---|---|
Model-1 | ||||||
Width | [20,20] | [20,20] | [90,90] | [60,60] | [800] | [400] |
MSE | 0.0002 | 0.0008 | 0.0009 | 0.0007 | 0.0007 | 0.0006 |
Run Times | 238 | 252 | 446 | 338 | 405 | 298 |
Model-2 | ||||||
Width | [20,20] | [20,20] | [90,90] | [60,60] | [800] | [400] |
MSE | 0.0071 | 0.0126 | 0.0171 | 0.0213 | 0.0079 | 0.0104 |
Run Times | 209 | 227 | 406 | 308 | 367 | 270 |
Model-3 | ||||||
Width | [15,15,15] | [15,15,15] | [65,65,65] | [45,45,45] | [2000] | [1000] |
MSE | 0.0038 | 0.0075 | 0.0061 | 0.0065 | 0.0068 | 0.0066 |
Run Times | 251 | 271 | 445 | 356 | 397 | 289 |
Model-4 | ||||||
Width | [15,15,15] | [15,15,15] | [65,65,65] | [45,45,45] | [2000] | [1000] |
MSE | 0.0073 | 0.0126 | 0.0090 | 0.0096 | 0.0119 | 0.0124 |
Run Times | 252 | 270 | 449 | 358 | 476 | 328 |
Model-4 | ||||||
Width | [20,20,20] | [20,20,20] | [95,95,95] | [65,65,65] | [2800] | [1400] |
MSE | 0.0083 | 0.0233 | 0.0194 | 0.0188 | 0.0247 | 0.0248 |
Run Times | 518 | 555 | 1438 | 962 | 1369 | 862 |
Note: Here, “width” represents the number of neurons of each hidden layer, e.g., [20, 20] means that there are two hidden layers within the DNN and each has 20 number neurons.
The SS-DNN is still the most time-efficient estimator. It even runs faster than training S-DNN with the whole sample size. Applying the scalable subagging method can gain more computational savings for training with a larger sample size or a larger model. The SS-DNN is also the most accurate estimator except in the case with Model-4 simulated data. For this case, the accuracy of SS-DNN is slightly worse than the estimator DNN-deep-1. We conjecture the reason is that Model-4 is relatively complicated so a DNN with 3 depths and constant width 15 has a high bias. After increasing the sample size to 20000, the subagging estimator beats other models.
Appendix C: Simulations for unconditional CI and PI
In this part, we consider the unconditional (not conditional on ) performance of CI and PI defined in the main text. We take the below empirical coverage rate (ECR) and empirical length of CI (EL) as the measurement criteria for (unconditional) CI:
here is the true model value at the -th data point in the training dataset; and are the corresponding upper and lower bounds of CI, respectively. To measure the performance of (unconditional) PI, we take
here is the -th observed response value in the test dataset; and are the corresponding upper and lower bounds of PI. We take . Due to the (unconditional) ECR and EL hardly changing in the simulation studies, we just do 50 replications and we present the average results of various CIs and PIs in Table 6 and Table 7, respectively. Similar to simulation results of conditional CIs and PIs in the main text, QCI-1 undercovers true model values; PCI-2 shows the best comprehensive performance according to length and coverage rate. All PIs show great performance but slightly undercover true future values.
For the computational issue of the iterated subsampling stage, the total time of training all DNN estimators for and (iterated subsampling stage) is less than the time of training all DNN estimators in the first subsampling stage, i.e., for . We can see the reason by analyzing the computational complexity of the iterated subsampling stage. In total, we need to train number of models with sample size . As the assumption we made in 4.1, the complexity of training a DNN is mainly determined by its size, sample size and the number of epochs, so the training time of the iterated stage is around when the sample size is close to the size of DNN. Similarly, we can analyze that the complexity of training DNNs in the first subsampling stage is around . Since , the complexity of the first subsampling stage will dominate the iterated stage when is large enough. In other words, the complexity cost of applying the iterated subsampling technique is negligible when we are dealing with a huge dataset.
Nominal : | 0.1 | 0.05 | |||
---|---|---|---|---|---|
Model-1, | |||||
ECR of QCI-1 | 1.000 | 1.000 | |||
ECR of QCI-2 | 0.961 | 0.983 | |||
ECR of PCI-1 | 0.842 | 0.905 | |||
ECR of PCI-2 | 0.953 | 0.976 | |||
ECR of PCI-3 | 0.853 | 0.914 | |||
EL of QCI-1 | 1.752 | 2.371 | |||
EL of QCI-2 | 1.090 | 1.316 | |||
EL of PCI-1 | 0.292 | 0.348 | |||
EL of PCI-2 | 0.549 | 0.654 | |||
EL of PCI-3 | 0.299 | 0.356 | |||
Training time of first and iterated subagging estimators: | 5613 4566 | ||||
Structure of and : | [65,65] [10,10] | ||||
Model-2, | |||||
ECR of QCI-1 | 1.000 | 1.000 | |||
ECR of QCI-2 | 0.925 | 0.961 | |||
ECR of PCI-1 | 0.875 | 0.931 | |||
ECR of PCI-2 | 0.957 | 0.979 | |||
ECR of PCI-3 | 0.883 | 0.937 | |||
EL of QCI-1 | 2.213 | 2.847 | |||
EL of QCI-2 | 1.117 | 1.349 | |||
EL of PCI-1 | 0.359 | 0.428 | |||
EL of PCI-2 | 0.597 | 0.711 | |||
EL of PCI-3 | 0.367 | 0.437 | |||
Training time of first and iterated subagging estimators: | 5451 4566 | ||||
Structure of and : | [65,65] [10,10] | ||||
Model-3, | |||||
ECR of QCI-1 | 1.000 | 1.000 | |||
ECR of QCI-2 | 0.967 | 0.985 | |||
ECR of PCI-1 | 0.776 | 0.846 | |||
ECR of PCI-2 | 0.961 | 0.976 | |||
ECR of PCI-3 | 0.802 | 0.867 | |||
EL of QCI-1 | 0.774 | 0.956 | |||
EL of QCI-2 | 0.625 | 0.755 | |||
EL of PCI-1 | 0.125 | 0.149 | |||
EL of PCI-2 | 0.457 | 0.545 | |||
EL of PCI-3 | 0.132 | 0.158 | |||
Training time of first and iterated subagging estimators: | 6597 5551 | ||||
Structure of and : | [45,45,45] [10,10,10] | ||||
Model-4, | |||||
ECR of QCI-1 | 1.000 | 1.000 | |||
ECR of QCI-2 | 0.904 | 0.945 | |||
ECR of PCI-1 | 0.762 | 0.830 | |||
ECR of PCI-2 | 0.931 | 0.956 | |||
ECR of PCI-3 | 0.778 | 0.844 | |||
EL of QCI-1 | 1.242 | 1.570 | |||
EL of QCI-2 | 0.903 | 1.091 | |||
EL of PCI-1 | 0.202 | 0.240 | |||
EL of PCI-2 | 0.496 | 0.591 | |||
EL of PCI-3 | 0.209 | 0.249 | |||
Training time of first and iterated subagging estimators: | 6451 5419 | ||||
Structure of and : | [45,45,45] [10,10,10] |
Nominal : | 0.1 | 0.05 | |||
---|---|---|---|---|---|
Model-1, | |||||
ECR of PI | 0.8981 | 0.9486 | |||
EL of PI | 3.2919 | 3.9234 | |||
Model-2, | |||||
ECR of PI | 0.8983 | 0.9487 | |||
EL of PI | 3.3008 | 3.9316 | |||
Model-3, | |||||
ECR of PI | 0.8973 | 0.9481 | |||
EL of PI | 3.3022 | 3.9339 | |||
Model-4, | |||||
ECR of PI | 0.8975 | 0.9482 | |||
EL of PI | 3.3135 | 3.9411 |
References
- Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301.
- Belkin et al., (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
- Breiman, (1996) Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140.
- Bühlmann and Yu, (2002) Bühlmann, P. and Yu, B. (2002). Analyzing bagging. The annals of Statistics, 30(4):927–961.
- Cybenko, (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
- Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
- Ha et al., (2005) Ha, K., Cho, S., and MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19(1):17–30.
- Hornik et al., (1989) Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
- Jordan, (2013) Jordan, M. I. (2013). On statistics, computation and scalability. Bernoulli, 19(4):1378–1390.
- Khwaja et al., (2015) Khwaja, A., Naeem, M., Anpalagan, A., Venetsanopoulos, A., and Venkatesh, B. (2015). Improved short-term load forecasting using bagged neural networks. Electric Power Systems Research, 125:109–115.
- Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kohler et al., (2023) Kohler, M., Langer, S., and Reif, U. (2023). Estimation of a regression function on a manifold by fully connected deep neural networks. Journal of Statistical Planning and Inference, 222:160–181.
- McCulloch and Pitts, (1943) McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.
- Pan and Politis, (2016) Pan, L. and Politis, D. N. (2016). Bootstrap prediction intervals for linear, nonlinear and nonparametric autoregressions. Journal of Statistical Planning and Inference, 177:1–27.
- Politis, (2015) Politis, D. N. (2015). Model-free prediction and regression: a transformation-based approach to inference. Springer.
- Politis, (2021) Politis, D. N. (2021). Scalable subsampling: computation, aggregation and inference. arXiv preprint arXiv:2112.06434. Accepted by Biometrika.
- Politis et al., (1999) Politis, D. N., Romano, J. P., and Wolf, M. (1999). Subsampling. Springer Science & Business Media.
- Stone, (1982) Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The annals of statistics, 10(4):1040–1053.
- Ting, (2021) Ting, D. (2021). Simple, optimal algorithms for random sampling without replacement. arXiv preprint arXiv:2104.05091.
- Wang and Politis, (2021) Wang, Y. and Politis, D. N. (2021). Model-free bootstrap and conformal prediction in regression: Conditionality, conjecture testing, and pertinent prediction intervals. arXiv preprint arXiv:2109.12156.
- Wu and Politis, (2023) Wu, K. and Politis, D. N. (2023). Bootstrap prediction inference of non-linear autoregressive models. arXiv preprint arXiv:2306.04126.
- Yang et al., (2020) Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. (2020). Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pages 10767–10777. PMLR.
- Yarotsky, (2018) Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pages 639–649. PMLR.
- Yarotsky and Zhevnerchuk, (2020) Yarotsky, D. and Zhevnerchuk, A. (2020). The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015.
- Zhang and Politis, (2023) Zhang, Y. and Politis, D. N. (2023). Bootstrap prediction intervals with asymptotic conditional validity and unconditional guarantees. Information and Inference: A Journal of the IMA, 12(1):157–209.
- Zou et al., (2021) Zou, T., Li, X., Liang, X., and Wang, H. (2021). On the subbagging estimation for massive data. arXiv preprint arXiv:2103.00631.