This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Cheap Bootstrap Method for Fast Inference

Henry Lam Email: [email protected] Department of Industrial Engineering and Operations Research, Columbia University
Abstract

The bootstrap is a versatile inference method that has proven powerful in many statistical problems. However, when applied to modern large-scale models, it could face substantial computation demand from repeated data resampling and model fitting. We present a bootstrap methodology that uses minimal computation, namely with a resample effort as low as one Monte Carlo replication, while maintaining desirable statistical guarantees. We present the theory of this method that uses a twisted perspective from the standard bootstrap principle. We also present generalizations of this method to nested sampling problems and to a range of subsampling variants, and illustrate how it can be used for fast inference across different estimation problems.

1 Introduction

The bootstrap is a versatile method for statistical inference that has proven powerful in many problems. It is advantageously data-driven and automatic: Instead of using analytic calculation based on detailed model knowledge such as the delta method, the bootstrap uses resampling directly from the data as a mechanism to approximate the sampling distribution. The flexibility and ease of the bootstrap have made it widely popular; see, e.g., the monographs Efron and Tibshirani (1994); Davison and Hinkley (1997); Shao and Tu (2012); Hall and Martin (1988) for comprehensive reviews on the subject.

Despite its popularity, an arguable challenge in using the bootstrap is its computational demand. To execute the bootstrap, one needs to run enough Monte Carlo replications to generate resamples and refit models. While this is a non-issue in classical problems, for modern large-scale models and data size, even one model refitting could incur substantial costs. This is the case if the estimation or model fitting involves big optimization or numerical root-finding problems arising from, for instance, empirical risk minimization in machine learning, where the computing routine can require running time-consuming stochastic gradient descent or advanced mathematical programming procedures. Moreover, in these problems, what constitutes an adequate Monte Carlo replication size can be case-by-case and necessitate trial-and-error, which adds to the sophistication. Such computation issues in applying resampling techniques to massive learning models have been discussed and motivate some recent works, e.g., Kleiner et al. (2012); Lu et al. (2020); Giordano et al. (2019); Schulam and Saria (2019); Alaa and Van Der Schaar (2020).

Moreover, an additional challenge faced by the bootstrap is that its implementation can sometimes involve a nested amount of Monte Carlo runs. This issue arises for predictive models that are corrupted by noises from both the data and the computation procedure. An example is predictive inference using high-fidelity stochastic simulation modeling (Nelson (2013); Law et al. (1991)). These models are used to estimate output performance measures in sophisticated systems, and comprises a popular approach for causal decision-making in operations management and scientific disciplines. In this context, obtaining a point estimate of the performance measure itself requires a large number of Monte Carlo runs on the model to average out the aleatory uncertainty. When the model contains parameters that are calibrated from exogenous data, conducting bootstrap inference on the prediction would involve first resampling the exogenous data, then given each resampled input parameter value, running and averaging a large number of model runs. This consequently leads to a multiplicative total computation effort between the resample size and the model run size per resample, an issue known as the input uncertainty problem in simulation (see, e.g., Henderson (2003); Song et al. (2014); Barton (2012); Lam (2016)). Here, to guarantee consistency, the sizes in the two sampling layers not only need to be sufficiently large, but also depend on the input data size which adds more complication to their choices (Lam and Qian (2021)). Other than simulation modeling, similar phenomenon arises in some machine learning models. For example, a bagging predictor (Breiman (1996); Wager et al. (2014); Mentch and Hooker (2016)) averages a large number of base learning models each trained from a resample, and deep ensemble (Lakshminarayanan et al. (2016); Lee et al. (2015)) averages several neural networks each trained with a different randomization seed (e.g., in initializing a stochastic gradient descent). Running bootstraps for the prediction values of these models again requires a multiplicative effort of data resampling and, given each resample, building a new predictor which requires multiple training runs.

Motivated by these challenges, our goal in this paper is to study a statistically valid bootstrap method that allows using very few resamples. More precisely, our method reduces the number of bootstrap resamples to the minimum possible, namely as low as one Monte Carlo replication, while still delivering asymptotically exact coverage as the data size grows. This latter guarantee holds under essentially the same regularity conditions for conventional bootstrap methods. As a result, our method can give valid inference under constrained budget when existing bootstrap approaches fail. Moreover, the validity of our inference under any number of resamples endows more robustness, and hence alleviates the tuning effort, in choosing the number of resamples, which as mentioned before could be a trial-and-error task in practice. For convenience, we call our method the Cheap Bootstrap, where “Cheap” refers to our low computation cost, and also to the correspondingly lower-quality half-width performance when using the low cost as we will see.

In terms of theory, the underpinning mechanism of our Cheap Bootstrap relies on a simple twist to the basic principle used in conventional bootstraps. The latter dictates that the sampling distribution of a statistic can be approximated by the resample counterpart, conditional on the data. To utilize this principle for inference, one then typically generates many resample estimates to obtain their approximate distribution. Our main idea takes this principle a step further by leveraging the following: The resemblance of the resampling distribution to the original sampling distribution, universally regardless of the data realization, implies the asymptotic independence between the original estimate and any resample estimate. Thus, under asymptotic normality, we can construct pivotal statistics using the original estimate and any number of resample estimates that cancel out the standard error as a nuisance parameter, which subsequently allows us to conduct inference using as low as one resample.

In addition to the basic asymptotic exact coverage guarantee, we study Cheap Bootstrap in several other aspects. First regards the half-width of the Cheap Bootstrap confidence interval, a statistical efficiency measure in addition to coverage performance. We show how the Cheap Bootstrap interval follows the behavior of a tt-interval, which is wide when we use only one resample but shrinks rapidly as the number of resamples increases. Second, we investigate the higher-order coverage of Cheap Bootstrap via Edgeworth expansions. Different from conventional bootstrap procedures, the Cheap Bootstrap pivotal statistic has a limiting tt-distribution, not a normal distribution. Edgeworth expansion on tt-distribution is, to our best knowledge, open in the literature (except the recent working paper He and Lam (2021)). Here, we build the Edgeworth expansions for Cheap Bootstrap coverage probabilities and show their errors match the order O(n1)O(n^{-1}) (where nn is the sample size) incurred in conventional bootstraps in the two-sided case. Moreover, we explicitly identify the coefficient in this error term, which is expressible as a high-dimensional integral that, although cannot be evaluated in closed-form, is amenable to Monte Carlo integration.

We also generalize the Cheap Bootstrap in several directions. First, we devise the Cheap Bootstrap confidence intervals for models that encounter both data and computation noises and hence the aforementioned nested sampling issue. When applying the Cheap Bootstrap to these problems, the number of outer resamples can be driven down to one or a very small number, and thus the total model run size is no longer multiplicatively big. On the other hand, because of the presence of multiple sources of noises, the joint distribution between the original estimate and the resample estimate is no longer asymptotically independent, but exhibits a dependent structure that nonetheless could be tractably exploited. Second, our Cheap Bootstrap can also be used in conjunction with “subsampling” variants. By subsampling here we broadly refer to schemes that resample data sets of size smaller than the original size, such as the mm-out-of-nn Bootstrap (Politis et al. (1999); Bickel et al. (1997)), and more recently the Bags of Little Bootstraps (Kleiner et al. (2014)) and Subsampled Double Bootstrap (Sengupta et al. (2016)). We show how these procedures can all be implemented “cheaply”. Finally, we also extend the Cheap Bootstrap to multivariate problems.

We close this introduction by discussing other related works. Regarding motivation, our study appears orthogonal to most of the classical bootstrap literature, as its focus is on higher-order coverage accuracy, using techniques such as studentization (Davison and Hinkley (1997) §2.4), calibration or iterated bootstrap (Hall (2013) §3.11; Hall and Martin (1988); Hall (1986a); Beran (1987)), and bias-correction and acceleration (Efron (1982) §10.4; Efron (1987); DiCiccio et al. (1996)). The literature also studies the relaxation of assumptions to handle non-smooth functions and dependent data (Politis et al. (1999); Bickel et al. (1997)). In contrast to these studies, we focus on the capability to output confidence intervals under extremely few resamples, and the produced intervals are admittedly less accurate than the refined intervals in the literature constructed under idealized infinite resamples.

Our study is related to Hall (1986b), which shows a uniform coverage error of O(n1)O(n^{-1}) for one-sided intervals regardless of the number of resamples BB, as long as the nominal coverage probability is a multiple of (B+1)1(B+1)^{-1}. From this, Hall (1986b) suggests, for 95%95\% interval for instance, that it suffices to use B=19B=19 to obtain a reasonable coverage accuracy. Our suggestion in this paper drives this number down to B=1B=1, which is made possible via our use of asymptotic independence and normality instead of order-statistic analyses on the quantile-based bootstrap in Hall (1986b). Our result on a small BB resulting in a large interval width matches the insight in Hall (1986b), though we make this tradeoff more explicit by looking at the width’s moments. Other related works on bootstrap computation cost include Booth and Hall (1994) and Lee and Young (1995) that analyze or reduce Monte Carlo sizes by replacing sampling with analytical calculation when applying the iterated bootstrap, and Booth and Do (1993) and Efron and Tibshirani (1994) §23 on variance reduction. These works, however, do not consider the exceedingly low number of resamples that we advocate in this paper.

The remainder of this paper is as follows. Section 2 introduces the Cheap Bootstrap. Section 3 presents the main theoretical results on asymptotic exactness and higher-order coverage error, and discusses half-width behaviors. Section 4 generalizes the Cheap Bootstrap to nested sampling problems, and Section 5 to subsampling. Section 6 shows numerical results to demonstrate our method and compare with benchmarks. The Appendix details proofs that are not included in the main body, presents additional theoretical and numerical results, and documents some technical background materials.

2 Cheap Bootstrap Confidence Intervals

We aim to construct a confidence interval for a target parameter ψ:=ψ(P)\psi:=\psi(P), where PP is the data distribution, and ψ:𝒫\psi:\mathcal{P}\to\mathbb{R} is a functional with 𝒫\mathcal{P} as the set of all distributions on the data domain. This ψ(P)\psi(P) can range from simple statistical summaries such as correlation coefficient, quantile, conditional value-at-risk, to model parameters such as regression coefficient and prediction error measurement.

Suppose we are given independent and identically distributed (i.i.d.) data of size nn, say X1,,XnX_{1},\ldots,X_{n}. A natural point estimate of ψ(P)\psi(P) is ψ^n:=ψ(P^n)\hat{\psi}_{n}:=\psi(\hat{P}_{n}), where P^n():=(1/n)i=1nI(Xi)\hat{P}_{n}(\cdot):=(1/n)\sum_{i=1}^{n}I(X_{i}\in\cdot) is the empirical distribution constructed from the data, and I()I(\cdot) denotes the indicator function.

Our approach to construct a confidence interval for ψ\psi proceeds as follows. For each replication b=1,,Bb=1,\ldots,B, we resample the data set, namely independently and uniformly sample with replacement from {X1,,Xn}\{X_{1},\ldots,X_{n}\} nn times, to obtain {X1b,,Xnb}\{X_{1}^{*b},\ldots,X_{n}^{*b}\}, and evaluate the resample estimate ψnb:=ψ(Pnb)\psi_{n}^{*b}:=\psi(P_{n}^{*b}), where Pnb()=(1/n)i=1nI(Xib)P_{n}^{*b}(\cdot)=(1/n)\sum_{i=1}^{n}I(X_{i}^{*b}\in\cdot) is the resample empirical distribution. Our confidence interval is

=[ψ^ntB,1α/2S,ψ^n+tB,1α/2S]\mathcal{I}=\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}S\right] (1)

where

S2=1Bb=1B(ψnbψ^n)2S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)^{2} (2)

Here, S2S^{2} resembles the sample variance of the resample estimates, but “centered” at the original point estimate ψ^n\hat{\psi}_{n} instead of the resample mean, and using BB in the denominator instead of B1B-1 as in “textbook” sample variance. The critical value tB,1α/2t_{B,1-\alpha/2} is the (1α/2)(1-\alpha/2)-quantile of tBt_{B}, the student tt-distribution with degree of freedom BB. That is, the degree of freedom of this tt-distribution is precisely the resampling computation effort.

The interval \mathcal{I} in (1) is defined for any positive integer B1B\geq 1. In particular, when B=1B=1, it becomes

[ψ^nt1,1α/2|ψnψ^n|,ψ^n+t1,1α/2|ψnψ^n|]\left[\hat{\psi}_{n}-t_{1,1-\alpha/2}\left|\psi_{n}^{*}-\hat{\psi}_{n}\right|,\ \hat{\psi}_{n}+t_{1,1-\alpha/2}\left|\psi_{n}^{*}-\hat{\psi}_{n}\right|\right] (3)

where ψn\psi_{n}^{*} is the single resample estimate.

The form of (1) looks similar to the so-called standard error bootstrap as we explain momentarily, especially when BB is large. However, the point here is that BB does not need to be large. In fact, we have the following basic coverage guarantee for (1) and (3). First, consider the following condition that is standard in the bootstrap literature:

Assumption 1 (Standard condition for bootstrap validity).

We have n(ψ^nψ)N(0,σ2)\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\sigma^{2}) where σ2>0\sigma^{2}>0. Moreover, a resample estimate ψn\psi_{n}^{*} satisfies n(ψnψ^n)N(0,σ2)\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\sigma^{2}) conditional on the data X1,X2,X_{1},X_{2},\ldots in probability as nn\to\infty.

In Assumption 1, “\Rightarrow” denotes convergence in distribution, and the conditional “\Rightarrow”-convergence in probability means P(n(ψnψ^n)x|P^n)pP(N(0,σ2)x)P(\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\leq x|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}P(N(0,\sigma^{2})\leq x) for any xx\in\mathbb{R}, where “p\stackrel{{\scriptstyle p}}{{\to}}” denotes convergence in probability. Assumption 1 is a standard condition to justify bootstrap validity, and is ensured when ψ()\psi(\cdot) is Hadamard differentiable (see Proposition 2 in the sequel which follows from Van der Vaart (2000) §23). This assumption implies that, conditional on the data, the asymptotic distributions of the centered resample estimate n(ψnψ^n)\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n}) and the centered original estimate n(ψ^nψ)\sqrt{n}(\hat{\psi}_{n}-\psi) are the same. Thus, one can use the former distribution, which is computable via Monte Carlo, to approximate the latter unknown distribution. Simply put, we can use a “plug-in” of P^n\hat{P}_{n} in place of PP, namely ξ(ψ^n,ψn)\xi(\hat{\psi}_{n},\psi_{n}^{*}), to approximate ξ(ψ,ψ^n)\xi(\psi,\hat{\psi}_{n}) for any suitable data-dependent quantity ξ(,)\xi(\cdot,\cdot).

Under Assumption 1, (1) is an asymptotically exact (1α)(1-\alpha)-level confidence interval:

Theorem 1 (Asymptotic exactness of Cheap Bootstrap).

Under Assumption 1, for any B1B\geq 1, the interval \mathcal{I} in (1) is an asymptotically exact (1α)(1-\alpha)-level confidence interval for ψ\psi, i.e.,

n(ψ)1α\mathbb{P}_{n}(\psi\in\mathcal{I})\to 1-\alpha

as nn\to\infty, where n\mathbb{P}_{n} denotes the probability with respect to the data X1,,XnX_{1},\ldots,X_{n} and all randomness from the resampling.

Theorem 1 states that, under the same condition to justify the validity of conventional bootstraps, the Cheap Bootstrap interval \mathcal{I} has asymptotically exact coverage, for resample effort as low as 1. Before we explain how Theorem 1 is derived, we first compare the Cheap Bootstrap to conventional bootstraps.

2.1 Comparisons with Established Bootstrap Methods

The commonest bootstrap approach for interval construction uses the quantiles of resample estimates to calibrate the interval limits. This includes the basic bootstrap and (Efron’s) percentile bootstrap (Efron and Tibshirani (1994); Davison and Hinkley (1997)). In the basic bootstrap, we first run Monte Carlo to approximate the α/2\alpha/2 and (1α/2)(1-\alpha/2)-quantiles of ψnψ^n\psi_{n}^{*}-\hat{\psi}_{n}, say qα/2q_{\alpha/2} and q1α/2q_{1-\alpha/2}. Assumption 1 guarantees that qα/2q_{\alpha/2} and q1α/2q_{1-\alpha/2} approximate the corresponding quantiles of ψ^nψ\hat{\psi}_{n}-\psi, thus giving [ψ^nq1α/2,ψ^nqα/2][\hat{\psi}_{n}-q_{1-\alpha/2},\hat{\psi}_{n}-q_{\alpha/2}] as a (1α)(1-\alpha) confidence interval for ψ\psi (or equivalently using the α/2\alpha/2 and (1α/2)(1-\alpha/2) quantiles of 2ψ^nψn2\hat{\psi}_{n}-\psi_{n}^{*}). The percentile bootstrap directly uses the α/2\alpha/2 and (1α/2)(1-\alpha/2)-quantiles of ψn\psi_{n}^{*} as the interval limits, i.e., [qα/2,q1α/2][q_{\alpha/2},q_{1-\alpha/2}], justified by additionally the symmetry of the asymptotic distribution. Alternately, one can bootstrap the standard error and plug into a normality confidence interval (Efron (1981)). That is, we compute Var(ψn)Var_{*}(\psi^{*}_{n}) to approximate Var(ψ^n)Var(\hat{\psi}_{n}) (where VarVar and VarVar_{*} denote respectively the variance with respect to the data and the conditional variance with respect to a resample drawn from P^n\hat{P}_{n}), and output [ψ^n±z1α/2Var(ψ)][\hat{\psi}_{n}\pm z_{1-\alpha/2}\sqrt{Var_{*}(\psi^{*})}] as the confidence interval, with z1α/2z_{1-\alpha/2} being the (1α/2)(1-\alpha/2)-quantile of standard normal.

All approaches above require generating enough resamples. When BB is large, the S2S^{2} defined in (2) satisfies S2Var(ψn)S^{2}\approx Var_{*}(\psi^{*}_{n}), and moreover tB,1α/2z1α/2t_{B,1-\alpha/2}\approx z_{1-\alpha/2}. Thus in this case the Cheap Bootstrap interval \mathcal{I} becomes ψ^±z1α/2Var(ψn)\hat{\psi}\pm z_{1-\alpha/2}\sqrt{Var_{*}(\psi^{*}_{n})}, which is nothing but the standard error bootstrap. In other words, the Cheap Bootstrap can be viewed as a generalization of the standard error bootstrap to using any BB.

We also contrast our approach with the studentized bootstrap (e.g., Davison and Hinkley (1997) §2.4), which resamples pivotal quantities in the form (ψ^nψ)/σ^n(\hat{\psi}_{n}-\psi)/\hat{\sigma}_{n} where σ^n\hat{\sigma}_{n} is a sample standard error (computed from P^n\hat{P}_{n}). As widely known, while this approach bears the term “studentized”, it does not use the tt-distribution, and is motivated from better higher-order coverage accuracy. Moreover, attaining such an interval could require additional computation to resample the standard error. Our approach is orthogonal to this studentization in that we aim to minimize computation instead of expanding it for the sake of attaining higher-order accuracy.

Lastly, (1) and (3) have natural analogs for one-sided intervals, where we use

lower=(,ψ^n+tB,1αS] or upper=[ψ^ntB,1αS,).\mathcal{I}_{lower}=\left(-\infty,\ \hat{\psi}_{n}+t_{B,1-\alpha}S\right]\text{\ \ or\ \ }\mathcal{I}_{upper}=\left[\hat{\psi}_{n}-t_{B,1-\alpha}S,\ \infty\right). (4)

Theorem 1 applies to (4) under the same assumption. In the one-sided case, Hall (1986b) has shown an error of O(n1)O(n^{-1}), uniformly in BB, for the studentized bootstrap when the nominal coverage probability (1α)(1-\alpha) is a multiple of (B+1)1(B+1)^{-1}. This translates, in the case of 95%95\% interval, a minimum of B=19B=19. Our Theorem 1 drives down this suggestion of Hall (1986b) for the studentized bootstrap to B=1B=1 for the Cheap Bootstrap.

2.2 A Basic Numerical Demonstration

We numerically compare our Cheap Bootstrap with the conventional approaches, namely the basic bootstrap, percentile bootstrap and standard error bootstrap described above. We use a basic example on estimating a 95%95\% confidence interval of the 0.60.6-quantile of, say, an exponential distribution with unit rate from i.i.d. data. This example can be handled readily by other means, but it demonstrates how Cheap Bootstrap can outperform baselines under limited replication budget.

We use a data size n=100n=100, and vary B=1,2,5,10,50B=1,2,5,10,50. For each BB, we generate synthetic data 10001000 times, each time running all the competing methods, and outputting the empirical coverage and interval width statistics from the 10001000 experimental repetitions. Table 1 shows that when B=1B=1, Cheap Bootstrap already gives confidence intervals with a reasonable coverage of 92%92\%, while all other bootstrap methods fail because they simply cannot operate with only one resample (basic and percentile bootstraps cannot output two different finite numbers as the upper and lower interval limits, and standard error bootstrap has a zero denominator in the formula). When B=2B=2, all baseline methods still have poor performance, as B=2B=2 is clearly still too small a size to possess any statistical guarantees. In contrast, Cheap Bootstrap gives a similar coverage of 93%93\% as in B=1B=1, and continues to have stable coverages when BB increases. As BB increases through 5 to 50, the baseline methods gradually catch up on the coverage level.

Table 1: Interval performances with different bootstrap methods, at nominal confidence level 95%95\% and with sample size n=100n=100.
Replication Empirical coverage Width mean
size BB (margin of error) (standard deviation)
Cheap Bootstrap 1 0.92  (0.02) 2.42  (2.06)
Basic Bootstrap 1 NA NA
Percentile Bootstrap 1 NA NA
Standard Error Bootstrap 1 NA NA
Cheap Bootstrap 2 0.93  (0.02) 0.95  (0.60)
Basic Bootstrap 2 0.32  (0.03) 0.14  (0.12)
Percentile Bootstrap 2 0.32  (0.03) 0.14  (0.12)
Standard Error Bootstrap 2 0.67  (0.03) 0.38  (0.33)
Cheap Bootstrap 5 0.92  (0.02) 0.63  (0.28)
Basic Bootstrap 5 0.62  (0.03) 0.29  (0.14)
Percentile Bootstrap 5 0.69  (0.03) 0.29  (0.14)
Standard Error Bootstrap 5 0.86  (0.02) 0.47  (0.22)
Cheap Bootstrap 10 0.92  (0.02) 0.53  (0.20)
Basic Bootstrap 10 0.73  (0.03) 0.37  (0.14)
Percentile Bootstrap 10 0.80  (0.02) 0.37  (0.14)
Standard Error Bootstrap 10 0.88  (0.02) 0.46  (0.17)
Cheap Bootstrap 50 0.94  (0.02) 0.50  (0.13)
Basic Bootstrap 50 0.86  (0.02) 0.44  (0.12)
Percentile Bootstrap 50 0.92  (0.02) 0.44  (0.12)
Standard Error Bootstrap 50 0.93  (0.02) 0.48  (0.13)

Regarding interval length, we see that the Cheap Bootstrap interval shrinks as BB increases, sharply for small BB and then stabilizes. In particular, the length decreases by 1.471.47 when BB increases from 1 to 2, compared to a much smaller 0.030.03 when BB increases from 10 to 50. Though the length of Cheap Bootstrap is always larger than the baselines, it closes in at B=10B=10 and even more so at B=50B=50. Both the good coverage starting from B=1B=1 for Cheap Bootstrap, and the sharp decrease of its interval length at very small BB and then level-off will be explained by our theory next.

3 Theory of Cheap Bootstrap

We describe the theory of Cheap Bootstrap, including its asymptotic exactness (Section 3.1), higher-order coverage error (Section 3.2) and interval width behavior (Section 3.3).

3.1 Asymptotic Exactness

We present the proof of Theorem 1 on the asymptotic exactness of Cheap Bootstrap for any B1B\geq 1. The first key ingredient is the following joint asymptotic characterization among the original estimate and the resample estimates:

Proposition 1 (Asymptotic independence and normality among original and resample estimates).

Under Assumption 1, we have, for the original estimate ψ^n\hat{\psi}_{n} and resample estimates ψnb,b=1,,B\psi_{n}^{*b},b=1,\ldots,B,

n(ψ^nψ,ψn1ψ^n,,ψnBψ^n)(σZ0,σZ1,,σZB)\sqrt{n}(\hat{\psi}_{n}-\psi,\psi^{*1}_{n}-\hat{\psi}_{n},\ldots,\psi^{*B}_{n}-\hat{\psi}_{n})\Rightarrow(\sigma Z_{0},\sigma Z_{1},\ldots,\sigma Z_{B}) (5)

where Zb,b=0,,BZ_{b},b=0,\ldots,B are i.i.d. N(0,1)N(0,1) variables.

The convergence (5) entails that, under n\sqrt{n}-scaling, the centered resample estimates and the original point estimate are all asymptotically independent, and moreover are distributed according to the same normal, with the only unknown being σ\sigma that captures the asymptotic standard error. The asymptotic independence is thanks to the universal limit of a resample estimate n(ψnbψ^n)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}) conditional on any data sequence, as detailed below.

Proof of Proposition 1.

Denote Φσ()\Phi_{\sigma}(\cdot) as the distribution function of N(0,σ2)N(0,\sigma^{2}). To show the joint weak convergence of n(ψ^nψ)\sqrt{n}(\hat{\psi}_{n}-\psi) and n(ψnbψ^n)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}), b=1,,Bb=1,\ldots,B, to i.i.d. N(0,σ2)N(0,\sigma^{2}) variables, we will show, for any xb,b=0,,Bx_{b}\in\mathbb{R},b=0,\ldots,B,

P(n(ψ^nψ)x0,n(ψn1ψ^n)x1,,n(ψnBψ^n)xB)b=0BΦσ(xb)P\left(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0},\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1},\ldots,\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}\right)\to\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})

as nn\to\infty. To this end, we have

|P(n(ψ^nψ)x0,n(ψn1ψ^n)x1,,n(ψnBψ^n)xB)b=0BΦσ(xb)|\displaystyle\left|P\left(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0},\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1},\ldots,\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}\right)-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right| (6)
=\displaystyle= |E[I(n(ψ^nψ)x0)P(n(ψn1ψ^n)x1|P^n)P(n(ψnBψ^n)xB|P^n)]b=0BΦσ(xb)|\displaystyle\left|E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1}|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}|\hat{P}_{n})\right]-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right|{}
              since n(ψnbψ^n)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}) for b=1,,Bb=1,\ldots,B are independent conditional on the data X1,,XnX_{1},\ldots,X_{n}
=\displaystyle= |E[I(n(ψ^nψ)x0)P(n(ψn1ψ^n)x1|P^n)P(n(ψnBψ^n)xB|P^n)]\displaystyle\Bigg{|}E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1}|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}|\hat{P}_{n})\right]{}
P(n(ψ^nψ)x0)b=1BΦσ(xb)+P(n(ψ^nψ)x0)b=1BΦσ(xb)b=0BΦσ(xb)|\displaystyle{}-P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})+P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\Bigg{|}
\displaystyle\leq E[I(n(ψ^nψ)x0)|P(n(ψn1ψ^n)x1|P^n)P(n(ψnBψ^n)xB|P^n)\displaystyle E\Bigg{[}I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\Bigg{|}P(\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1}|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}|\hat{P}_{n}){}
b=1BΦσ(xb)|]+|(P(n(ψ^nψ)x0)Φσ(x0))b=1BΦσ(xb)|\displaystyle{}-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\Bigg{|}\Bigg{]}+\left|\left(P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right)\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right|{}
              by the triangle inequality
\displaystyle\leq E|P(n(ψn1ψ^n)x1|P^n)P(n(ψnBψ^n)xB|P^n)b=1BΦσ(xb)|\displaystyle E\left|P(\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1}|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}|\hat{P}_{n})-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right|{}
+|P(n(ψ^nψ)x0)Φσ(x0)|b=1BΦσ(xb)\displaystyle{}+\left|P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right|\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})

Now, by Assumption 1, we have P(n(ψnbψ^n)xb|P^n)pΦσ(xb)P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\Phi_{\sigma}(x_{b}) for b=1,,Bb=1,\ldots,B, and thus b=1BP(n(ψnbψ^n)xb|P^n)pb=1BΦσ(xb)\prod_{b=1}^{B}P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\prod_{b=1}^{B}\Phi_{\sigma}(x_{b}). Hence, since P(n(ψnbψ^n)xb|P^n)P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n}) and Φσ(xb)\Phi_{\sigma}(x_{b}) are bounded by 1, the first term in (6) converges to 0 by the bounded convergence theorem. The second term in (6) also goes to 0 because P(n(ψ^nψ)x0)Φσ(x0)P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\to\Phi_{\sigma}(x_{0}) by Assumption 1 again. Therefore (6) converges to 0 as nn\to\infty. ∎

Given Proposition 1, to infer ψ\psi we can now leverage classical normality inference tools to “cancel out” the nuisance parameter σ\sigma. In particular, we can use the pivotal statistic

T:=ψ^nψST:=\frac{\hat{\psi}_{n}-\psi}{S} (7)

where S2S^{2} is as defined in (2), which converges to a student tt-distribution with degree of freedom BB. More concretely:

Proof of Theorem 1.

The pivotal statistic (7) satisfies

T=ψ^nψS=n(ψ^nψ)1Bb=1B(n(ψnbψ^n))2Z01Bb=1BZb2T=\frac{\hat{\psi}_{n}-\psi}{S}=\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}))^{2}}}\Rightarrow\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}} (8)

for i.i.d. N(0,1)N(0,1) variables Z0,Z1,,ZBZ_{0},Z_{1},\ldots,Z_{B}, where we use Proposition 1 and the continuous mapping theorem to deduce the weak convergence. Note that

Z01Bb=1BZb2=dN(0,1)χB2/B=dtB\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}}\stackrel{{\scriptstyle d}}{{=}}\frac{N(0,1)}{\sqrt{\chi^{2}_{B}/B}}\stackrel{{\scriptstyle d}}{{=}}t_{B}

where N(0,1)N(0,1) and χB2\chi^{2}_{B} here represent standard normal and χB2\chi^{2}_{B} random variables. Here the two equalities in distribution (denoted “=d\stackrel{{\scriptstyle d}}{{=}}”) follow from the elementary constructions of χ2\chi^{2}- and tt-distributions respectively. Thus

T=ψ^nψStBT=\frac{\hat{\psi}_{n}-\psi}{S}\Rightarrow t_{B}

Hence we have

n(tB,1α/2ψ^nψStB,1α/2)1α\mathbb{P}_{n}\left(-t_{B,1-\alpha/2}\leq\frac{\hat{\psi}_{n}-\psi}{S}\leq t_{B,1-\alpha/2}\right)\to 1-\alpha

from which we conclude

n(ψ^ntB,1α/2Sψψ^n+tB,1α/2S)1α\mathbb{P}_{n}\left(\hat{\psi}_{n}-t_{B,1-\alpha/2}S\leq\psi\leq\hat{\psi}_{n}+t_{B,1-\alpha/2}S\right)\to 1-\alpha

and the theorem. ∎

Note that instead of using the tt-statistic approach, it is also possible to produce intervals from the normal variables in other ways (e.g., Wall et al. (2001)) which could have potential benefits for very small BB. However, the tt-interval form of (1) is intuitive, matches the standard error bootstrap as BB grows, and its width is easy to quantify.

In Theorem 1 we have used Assumption 1. This assumption is ensured when the functional ψ()\psi(\cdot) is Hadamard differentiable with a non-degenerate derivative. More precisely, for a class of functions \mathcal{F} from 𝒳\mathcal{X}\to\mathbb{R}, define ():={z:z:=supf|z(f)|<}\ell^{\infty}(\mathcal{F}):=\left\{z:\|z\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|z(f)|<\infty\right\} where zz is a map from \mathcal{F} to \mathbb{R}. We have the following:

Proposition 2 (Sufficient conditions for bootstrap validity).

Consider P^n\hat{P}_{n} and PnP_{n}^{*} as random elements that take values in ()\ell^{\infty}(\mathcal{F}), where \mathcal{F} is a Donsker class with finite envelope. Suppose ψ:()\psi:\ell^{\infty}(\mathcal{F})\to\mathbb{R} is Hadamard differentiable at PP (tangential to ()\ell^{\infty}(\mathcal{F})) where the derivative ψP\psi_{P}^{\prime} satisfies that ψP(𝔾P)\psi_{P}^{\prime}(\mathbb{G}_{P}) is a non-degenerate random variable (i.e., with positive variance), for a tight Gaussian process 𝔾P\mathbb{G}_{P} on ()\ell^{\infty}(\mathcal{F}) with mean 0 and covariance Cov(𝔾P(f1),𝔾P(f2))=CovP(f1(X),f2(X))Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X)) (where CovPCov_{P} denotes the covariance taken with respect to PP). Then Assumption 1 holds under i.i.d. data.

Proposition 2 follows immediately from the functional delta method (see Van der Vaart (2000) §23 or Appendix F.2). Note that we may as well use the conditions in Proposition 2 as the assumption for Theorem 1, but Assumption 1 helps highlight our point that Cheap Bootstrap is valid whenever conventional bootstraps are.

3.2 Higher-Order Coverage Errors

We analyze the higher-order coverage errors of Cheap Bootstrap. A common approach to analyze coverage errors in conventional bootstraps is to use Edgeworth expansions, which we will also utilize. However, unlike these existing methods, the pivotal statistic used in Cheap Bootstrap has a limiting tt-distribution, not a normal distribution. Edgeworth expansions on limiting tt-distributions appear open to our best knowledge (except a recent working paper He and Lam (2021)). Here, we derive our expansions for Cheap Bootstrap by integrating the expansions of the original estimate and resample estimates that follow (conditional) asymptotic normal distributions. The resulting coefficients can be explicitly identified which, even though cannot be evaluated in closed-form, are amenable to Monte Carlo integration.

As is customary in the bootstrap literature, we consider the function-of-mean model, namely ψ=g(𝝁)\psi=g(\bm{\mu}), where 𝝁=E𝐗\bm{\mu}=E\mathbf{X} for a dd-dimensional random vector 𝐗\mathbf{X} and g:dg:\mathbb{R}^{d}\to\mathbb{R} is a function. Denote 𝐗¯=(1/n)i=1n𝐗i\overline{\mathbf{X}}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i} as the sample mean of i.i.d. data {𝐗1,,𝐗n}\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\}. To facilitate our discussion, define

A(𝐱)=g(𝐱)g(𝝁)h(𝝁)A(\mathbf{x})=\frac{g(\mathbf{x})-g(\bm{\mu})}{h(\bm{\mu})} (9)

for function h:dh:\mathbb{R}^{d}\to\mathbb{R}, where h(𝝁)2h(\bm{\mu})^{2} is the asymptotic variance of ng(𝐗¯)\sqrt{n}g(\overline{\mathbf{X}}) that can be written in terms of 𝝁\bm{\mu} by the delta method (under regularity conditions that will be listed explicitly in our following theorem). To be more precise on the latter point, note that given any 𝝁~=E𝐗~\tilde{\bm{\mu}}=E\tilde{\mathbf{X}}, we can augment it to 𝝁=E[𝐗~𝐗~2]\bm{\mu}=E\left[\begin{array}[]{c}\tilde{\mathbf{X}}\\ \tilde{\mathbf{X}}^{2}\end{array}\right], with the operation 2\cdot^{2} defined component-wise (viewing 𝐗~\tilde{\mathbf{X}} as column vector). Thus, for a given g~(𝝁~)\tilde{g}(\tilde{\bm{\mu}}), we can define g(𝝁)=g~(𝝁~)g(\bm{\mu})=\tilde{g}(\tilde{\bm{\mu}}) and h(𝝁)2=g~(𝝁~)Σg~(𝝁~)h(\bm{\mu})^{2}=\nabla{\tilde{g}}(\tilde{\bm{\mu}})^{\top}\Sigma\nabla{\tilde{g}}(\tilde{\bm{\mu}}), where Σ\Sigma is the covariance matrix Cov(𝐗~)Cov(\tilde{\mathbf{X}}) which is a function of 𝝁\bm{\mu}, and \top denotes transpose.

We also define the “studentized” version of AA given by

As(𝐱)=g(𝐱)g(𝝁)h(𝐱)A_{s}(\mathbf{x})=\frac{g(\mathbf{x})-g(\bm{\mu})}{h(\mathbf{x})} (10)

We have the following:

Theorem 2 (Higher-order coverage errors for Cheap Bootstrap).

Consider the function-of-mean model where ψ=g(𝛍)\psi=g(\bm{\mu}) for some function g:dg:\mathbb{R}^{d}\to\mathbb{R} and 𝛍=E𝐗\bm{\mu}=E\mathbf{X} for a dd-dimensional random vector 𝐗\mathbf{X}. Consider also the function h:dh:\mathbb{R}^{d}\to\mathbb{R} that appears in (9). Assume that gg and hh each has ν+3\nu+3 bounded derivatives in a neighborhood of 𝛍\bm{\mu}, that E𝐗l<E\|\mathbf{X}\|^{l}<\infty for a sufficiently large positive number ll, and that the characteristic function χ\chi of 𝐗\mathbf{X} satisfies Cramer’s condition lim sup𝐭|χ(𝐭)|<1\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1. Then

  1. 1.

    When ν2\nu\geq 2, the two-sided Cheap Bootstrap confidence interval \mathcal{I} satisfies

    n(g(𝝁))=(1α)+ζn+o(1n)\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I})=(1-\alpha)+\frac{\zeta}{n}+o\left(\frac{1}{n}\right)

    where the coefficient

    ζ\displaystyle\zeta =\displaystyle= B|z01Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\displaystyle B\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
    +|z01Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(zB1)𝑑Φ(z1)d(q2(z0)ϕ(z0))\displaystyle{}+\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))
  2. 2.

    When ν1\nu\geq 1, the one-sided upper Cheap Bootstrap confidence interval upper\mathcal{I}_{upper} satisfies

    n(g(𝝁)upper)=(1α)+ζuppern+o(1n)\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I}_{upper})=(1-\alpha)+\frac{\zeta_{upper}}{\sqrt{n}}+o\left(\frac{1}{\sqrt{n}}\right)

    where the coefficient

    ζupper\displaystyle\zeta_{upper} =\displaystyle= Bz01Bb=1Bzb2tB,1αd(p1(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\displaystyle B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
    +z01Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(zB1)𝑑Φ(z1)d(q1(z0)ϕ(z0))\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))

    and the one-sided lower Cheap Bootstrap confidence interval lower\mathcal{I}_{lower} satisfies

    n(g(𝝁)lower)=(1α)+ζlowern+o(1n)\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I}_{lower})=(1-\alpha)+\frac{\zeta_{lower}}{\sqrt{n}}+o\left(\frac{1}{\sqrt{n}}\right)

    where the coefficient

    ζlower\displaystyle\zeta_{lower} =\displaystyle= Bz01Bb=1Bzb2tB,1αd(p1(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\displaystyle B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\geq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
    +z01Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(zB1)𝑑Φ(z1)d(q1(z0)ϕ(z0))\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\geq t_{B,1-\alpha}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))

    In the above, Φ\Phi and ϕ\phi are the standard normal distribution and density functions respectively, pjp_{j} and qjq_{j} are polynomials of degree 3j13j-1, odd for even jj and even for odd jj, with coefficients depending on moments of 𝐗\mathbf{X} up to order j+2j+2 polynomially and also g()g(\cdot) and h()h(\cdot).

In Theorem 2, the polynomials pjp_{j} and qjq_{j} are related to AA and AsA_{s} defined in (9) and (10) as follows. Under the assumptions in the theorem, the jj-th cumulant of nA(𝐗¯)\sqrt{n}A(\overline{\mathbf{X}}) and nAs(𝐗¯)\sqrt{n}A_{s}(\overline{\mathbf{X}}) can be expanded as

κj,n=n(j2)/2(kj,1+kj,2n+kj,3n2+)\kappa_{j,n}=n^{-(j-2)/2}\left(k_{j,1}+\frac{k_{j,2}}{n}+\frac{k_{j,3}}{n^{2}}+\cdots\right)

for coefficients kj,lk_{j,l}’s depending on whether we are considering AA or AsA_{s}. Then p1p_{1} or q1q_{1} is equal to

(k1,2+16k3,1H2(x))=(k1,2+16k3,1(x21))-\left(k_{1,2}+\frac{1}{6}k_{3,1}H_{2}(x)\right)=-\left(k_{1,2}+\frac{1}{6}k_{3,1}(x^{2}-1)\right)

while p2p_{2} or q2q_{2} is equal to

(12(k2,2+k1,22)H1(x)+124(k4,1+4k1,2k3,1)H3(x)+172k3,12H5(x))\displaystyle-\left(\frac{1}{2}(k_{2,2}+k_{1,2}^{2})H_{1}(x)+\frac{1}{24}(k_{4,1}+4k_{1,2}k_{3,1})H_{3}(x)+\frac{1}{72}k_{3,1}^{2}H_{5}(x)\right)
=\displaystyle= x(12(k2,2+k1,22)+124(k4,1+4k1,2k3,1)(x23)+172k3,12(x410x2+15))\displaystyle-x\left(\frac{1}{2}(k_{2,2}+k_{1,2}^{2})+\frac{1}{24}(k_{4,1}+4k_{1,2}k_{3,1})(x^{2}-3)+\frac{1}{72}k_{3,1}^{2}(x^{4}-10x^{2}+15)\right)

Here Hj()H_{j}(\cdot) is the jj-th order Hermite polynomial, and kj,lk_{j,l}’s are determined from AA for p1,p2p_{1},p_{2} and determined from AsA_{s} for q1,q2q_{1},q_{2}.

The coverage error O(n1)O(n^{-1}) of the Cheap Bootstrap in the two-sided case and O(n1/2)O(n^{-1/2}) in the one-sided case in Theorem 2 match the conventional basic and percentile bootstraps described in Section 2.1. Nonetheless, these errors are inferior to more refined approaches, including the studentized bootstrap also mentioned earlier which attains O(n1)O(n^{-1}) in the one-sided case (Davison and Hinkley (1997)).

Theorem 2 is proved by expressing the distribution of the pivotal statistic in (7) as a multi-dimensional integral, with respect to measures that are approximated by the Edgeworth expansions of the original estimate n(ψ^nψ)\sqrt{n}(\hat{\psi}_{n}-\psi) and the conditional expansions of the resample estimates n(ψnbψ^)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}) which have limiting normal distributions. From these expansions, we could also identify the polynomials pjp_{j} and qjq_{j} in our discussion above by using equations (2.20), (2.24) and (2.25) in Hall (2013). Lastly, we note that the remainder term in the coverage of the two-sided confidence interval in Theorem 2 can be refined to O(1/n3/2)O(1/n^{3/2}) when ν3\nu\geq 3, and the one-sided intervals can be refined to O(1/n)O(1/n) when ν2\nu\geq 2. These can be seen by tracing our proof (in Appendix B).

The coefficients ζ\zeta, ζupper\zeta_{upper} and ζlower\zeta_{lower} are computable via Monte Carlo integration, because integral in the form

(z0,,zB)𝒮d(π(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}d(\pi(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})

for some set 𝒮\mathcal{S} and polynomial π\pi can be written as

(z0,,zB)𝒮(π(zB)ϕ(zB)zBπ(zB)ϕ(zB))𝑑zB𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}(\pi^{\prime}(z_{B})\phi(z_{B})-z_{B}\pi(z_{B})\phi(z_{B}))dz_{B}d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})
=\displaystyle= (z0,,zB)𝒮(π(zB)zBπ(zB))𝑑Φ(zB)𝑑Φ(zB1)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}(\pi^{\prime}(z_{B})-z_{B}\pi(z_{B}))d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})

which is expressible as an expectation

E[π(ZB)ZBπ(ZB);(Z0,,ZB)𝒮]E[\pi^{\prime}(Z_{B})-Z_{B}\pi(Z_{B});(Z_{0},\cdots,Z_{B})\in\mathcal{S}]

taken with respect to independent standard normal variables Z0,,ZBZ_{0},\ldots,Z_{B}.

3.3 Interval Tightness

Besides coverage, another important efficiency criterion is the interval width. From Section 3.1, we know that the Cheap Bootstrap interval (1) arises from a tt-interval construction, using which we can readily extract its width behavior. More specifically, for a fixed number of resamples BB, S2S^{2} satisfies

nS1Bb=1BZb2=dσχB2B\sqrt{n}S\Rightarrow\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}\stackrel{{\scriptstyle d}}{{=}}\sigma\sqrt{\frac{\chi^{2}_{B}}{B}}

as nn\to\infty so that the half-width of \mathcal{I} is HWtB,1α/2σχB2/(nB)\text{HW}\approx t_{B,1-\alpha/2}\sigma\sqrt{\chi_{B}^{2}/(nB)}. Plugging in the moments of χB2\chi_{B}^{2}, we see that the half-width for large sample size nn has mean and variance given by

E[HW]\displaystyle E[\text{HW}] tB,1α/22BΓ((B+1)/2)Γ(B/2)σn\displaystyle\approx t_{B,1-\alpha/2}\sqrt{\frac{2}{B}}\frac{\Gamma((B+1)/2)}{\Gamma(B/2)}\frac{\sigma}{\sqrt{n}} (11)
Var(HW)\displaystyle Var(\text{HW}) tB,1α/22(B2Γ((B+1)/2)2Γ(B/2)2)σ2nB\displaystyle\approx t_{B,1-\alpha/2}^{2}\left(B-\frac{2\Gamma((B+1)/2)^{2}}{\Gamma(B/2)^{2}}\right)\frac{\sigma^{2}}{nB} (12)

respectively, where Γ()\Gamma(\cdot) is the Gamma function. As BB increases, both the mean and variance decrease, which signifies a natural gain in statistical efficiency, until in the limit B=B=\infty we get a mean z1α/2σ/nz_{1-\alpha/2}\sigma/\sqrt{n} and a variance 0, which correspond to the normality interval with a known σ\sigma.

The expressions (11) and (12) reveal that the half-width of Cheap Bootstrap is large when B=1B=1, but falls and stabilizes quickly as BB increases. Table 2 shows the approximate half-width mean and standard deviation shown in (11) and (12) at α=5%\alpha=5\% (ignoring the σ/n\sigma/\sqrt{n} factor), and the relative inflation in mean half-width compared to the case B=B=\infty (i.e., z1α/2=1.96)z_{1-\alpha/2}=1.96). We see that, as BB increases from 1 to 2, the half-width mean drops drastically from 10.1410.14 (417.3%417.3\% inflation relative to B=B=\infty) to 3.813.81 (94.6%94.6\% inflation), and half-width standard deviation from 7.667.66 to 1.991.99. As BB increases from 2 to 3, the mean continues to drop notably to 2.932.93 (49.6%49.6\% inflation) and standard deviation to 1.241.24. The drop rate slows down as BB increases further. For instance, at B=10B=10 the mean is 2.172.17 (10.9%10.9\% inflation) and standard deviation is 0.490.49, while at B=20B=20 the mean is 2.062.06 (5.1%5.1\% inflation) and standard deviation is 0.330.33. Though what constitutes an acceptable inflation level compared to B=B=\infty is context-dependent, generally the inflation appears reasonably low even when BB is a small number, except perhaps when BB is 11 or 22.

Table 2: Approximate half-width performance of Cheap Bootstrap against BB at 95%95\% confidence level.
Replication Mean (ignoring Mean inflation relative Standard deviation (ignoring
size BB the σ/n\sigma/\sqrt{n} factor) to B=B=\infty (z0.975z_{0.975}) the σ/n\sigma/\sqrt{n} factor)
1 10.14 417.3% 7.66
2 3.81 94.6% 1.99
3 2.93 49.6% 1.24
4 2.61 33.2% 0.95
5 2.45 24.8% 0.79
6 2.35 19.8% 0.69
7 2.28 16.4% 0.62
8 2.24 14.0% 0.57
9 2.20 12.3% 0.53
10 2.17 10.9% 0.49
11 2.15 9.8% 0.46
12 2.13 8.9% 0.44
13 2.12 8.1% 0.42
14 2.11 7.5% 0.40
15 2.10 7.0% 0.39
16 2.09 6.5% 0.37
17 2.08 6.1% 0.36
18 2.07 5.7% 0.35
19 2.07 5.4% 0.34
20 2.06 5.1% 0.33

In Appendices A.1 and A.2, we discuss inference on standard error and a multivariate version of the Cheap Bootstrap, which are immediate extensions of our developments in this section. In the next two sections, we discuss two generalizations of our approach to models corrupted by both data and computation noises (Section 4) and to subsampling (Section 5).

4 Applying Cheap Bootstrap to Nested Sampling Problems

Our Cheap Bootstrap can reduce computational cost for problems which, when applying the conventional bootstraps, require nested sampling. This phenomenon arises when the estimate involves noises coming from both the data and computation procedure. To facilitate discussion, as in Section 2, denote ψ:=ψ(P)\psi:=\psi(P) as a target parameter or quantity of interest. However, here the computation of ψ(Q)\psi(Q), for any given QQ, could be noisy. More precisely, suppose that, given any QQ, we could only generate ψ^r(Q)\hat{\psi}_{r}(Q) where ψ^r(Q)\hat{\psi}_{r}(Q) is an unbiased output for ψ(Q)\psi(Q). Then, to compute a point estimate of ψ\psi, we use the data X1,,XnX_{1},\ldots,X_{n} to construct the empirical distribution P^n\hat{P}_{n}, and with this P^n\hat{P}_{n}, we output

ψ^^n,R=1Rr=1Rψ^r(P^n)\hat{\hat{\psi}}_{n,R}=\frac{1}{R}\sum_{r=1}^{R}\hat{\psi}_{r}(\hat{P}_{n}) (13)

where ψ^r(P^n)\hat{\psi}_{r}(\hat{P}_{n}), r=1,,Rr=1,\ldots,R is a sequence of unbiased runs for ψ(P^n)\psi(\hat{P}_{n}), independent given P^n\hat{P}_{n}. The “double hat” above ψ\psi in the left hand side of (13) signifies the two sources of noises, one from the estimation of PP and one from the estimation of ψ()\psi(\cdot), in the resulting overall point estimate.

The above construction that requires generating and averaging multiple noisy outputs of ψ^r\hat{\psi}_{r} arises in the following examples:

  1. 1.

    Input uncertainty quantification in simulation modeling: Here ψ(P)\psi(P) denotes the output performance measure of a stochastic simulation model, where PP is the distribution of input random variates fed into the simulation logic. For instance, ψ(P)\psi(P) could be the expected workload of a queueing network, and PP denotes the inter-arrival time distribution. Estimating the performance measure ψ(P)\psi(P), even with a known input distribution PP, would require running the simulation model many times (i.e., RR times) and taking their average. When PP is observed from exogenous data, we use a plug-in estimate P^n\hat{P}_{n} to drive the input variates in the simulation model (Henderson (2003); Song et al. (2014); Barton (2012); Lam (2016)). This amounts to the point estimate (13).

  2. 2.

    Bagging: Bagging predictors are constructed by averaging a large number of base predictors, where each base predictor is obtained by re-training the model on a new resample of the data (Breiman (1996); Bühlmann and Yu (2002)). Here, ψ(P)\psi(P) denotes the target predicted value (at some tested point) of the bagging predictor, and PP is the data distribution. The quantity ψ^r(P^n)\hat{\psi}_{r}(\hat{P}_{n}) denotes the predicted value of a model trained on a resample from P^n\hat{P}_{n}.

  3. 3.

    Deep ensemble: Deep ensembles are predictors constructed by averaging several neural networks, each trained from the same data but using a different randomization seed (Lakshminarayanan et al. (2016); Lee et al. (2015)). The seed controls, for instance, the initialization of a stochastic gradient descent. Like bagging, ψ(P)\psi(P) here denotes the target predicted value (at some tested point) where PP is the data distribution. However, the quantity ψ^r(P^n)\hat{\psi}_{r}(\hat{P}_{n}) refers to the output of a trained individual neural network.

Applying the bootstrap to assess the uncertainty of (13) requires a nested sampling that involves two layers: First, we resample the data BB times to obtain Pnb,b=1,,BP_{n}^{*b},b=1,\ldots,B. Then, given each resample PnbP_{n}^{*b}, we run our computation procedure RR times to obtain ψ^r(Pnb),r=1,,R\hat{\psi}_{r}(P_{n}^{*b}),r=1,\ldots,R. The quantities

ψn,Rb=1Rr=1Rψ^r(Pnb),b=1,,B\psi_{n,R}^{**b}=\frac{1}{R}\sum_{r=1}^{R}\hat{\psi}_{r}(P_{n}^{*b}),\ \ b=1,\ldots,B (14)

then serve as the resample estimates whose statistic (e.g., quantiles or standard deviation) can be used to obtain a confidence interval for the target quantity ψ\psi. The computation effort in generating these intervals is the product between the outer number of resamples and the inner number of computation runs, namely BRBR. The number BB needs to be sufficiently large as required by conventional bootstraps. The number RR, depending on the context, may also need to be large. For instance, the simulation modeling and bagging examples above require a large enough RR. Furthermore, not only does RR need to be large, but it could also depend (linearly) on the data size nn in order to achieve consistency, as shown in the context of simulation modeling (Lam and Qian (2021, 2018c)). In other words, the difficulty with nested sampling when running the bootstrap not only lies in the multiplicative computation effort between BB and RR, but also that their choices can depend intricately on the data size which adds to the complication.

Cheap Bootstrap reduces the outer number of resamples BB to a low number, a strength that we investigate in this section. To build the theoretical framework, we first derive an asymptotic on problems involving noises from both data and computation. Denote τ2(Q)=Var(ψ^r(Q))\tau^{2}(Q)=Var(\hat{\psi}_{r}(Q)) and κ3(Q)=E|ψ^r(Q)ψ(Q)|3\kappa_{3}(Q)=E|\hat{\psi}_{r}(Q)-\psi(Q)|^{3} as the variance and third-order absolute central moment of a computation run driven by a given input distribution QQ. We assume the following.

Assumption 2 (Moment consistency of computation runs).

We have τ2(P^n)pτ2(P)=:τ2\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P)=:\tau^{2} and κ3(P^n)pκ3(P)=:κ3\kappa_{3}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}(P)=:\kappa_{3} as nn\to\infty, where 0<τ2<0<\tau^{2}<\infty and κ3<\kappa_{3}<\infty. Similarly, τ2(Pn)pτ2(P)=τ2\tau^{2}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P)=\tau^{2} and κ3(Pn)pκ3(P)=κ3\kappa_{3}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}(P)=\kappa_{3}.

Assumption 2 can be justified if τ2()\tau^{2}(\cdot) and κ3()\kappa_{3}(\cdot) are Hadamard differentiable with non-degenerate derivatives like in Assumption 1. Specifically, we have the following:

Proposition 3.

If τ2()\tau^{2}(\cdot) and κ3()\kappa_{3}(\cdot) satisfy the assumptions on ψ()\psi(\cdot) in Proposition 2, then Assumption 2 holds.

In stating our asymptotic result below, we consider a slightly more general version of the bootstrap where the computation sizes in the original estimate and the resample estimates are allowed to be different, which gives some extra flexibility to our procedures. That is, we use ψ^^n,R0\hat{\hat{\psi}}_{n,R_{0}} as defined in (13) where R0R_{0} denotes the computation size in constructing the point estimate. Then we resample BB times and obtain ψn,Rb\psi_{n,R}^{**b} as defined in (14) where RR denotes the computation size in constructing each resample estimate. Moreover, we denote ψ^n=ψ(P^n)\hat{\psi}_{n}=\psi(\hat{P}_{n}) as the hypothetical original estimate constructed from P^n\hat{P}_{n} without any computation noise, and ψnb=ψ(Pnb)\psi_{n}^{*b}=\psi(P_{n}^{*b}) as the hypothetical bb-th resample estimate without computation noise. We have the following:

Theorem 3 (Joint asymptotic of original and resample estimates under both data and computation noises).

Suppose Assumptions 1 and 2 hold and n,R0,Rn,R_{0},R\to\infty such that R0/np0R_{0}/n\to p_{0} and R/npR/n\to p for some constants 0<p0,p<10<p_{0},p<1. We have

n(ψ^^n,R0ψ,ψn,R1ψ^^n,R0,,ψn,RBψ^^n,R0)\displaystyle\sqrt{n}\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{**1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{**B}-\hat{\hat{\psi}}_{n,R_{0}}\right) (15)
\displaystyle\Rightarrow (σZ0+τp0W0,σZ1+τpW1τp0W0,,σZB+τpWBτp0W0)\displaystyle\left(\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0},\ \sigma Z_{1}+\frac{\tau}{\sqrt{p}}W_{1}-\frac{\tau}{\sqrt{p_{0}}}W_{0},\ldots,\ \sigma Z_{B}+\frac{\tau}{\sqrt{p}}W_{B}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)

where Z0,Z1,,ZB,W0,W1,,WBN(0,1)Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\sim N(0,1) and are all independent.

Note that the scaling R0/np0R_{0}/n\to p_{0} and R/npR/n\to p in Theorem 3 are introduced only for mathematical convenience. The convergence (15) stipulates that

(ψ^^n,R0ψ,ψn,R1ψ^^n,R0,,ψn,RBψ^^n,R0)\displaystyle\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{**1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{**B}-\hat{\hat{\psi}}_{n,R_{0}}\right)
d\displaystyle\stackrel{{\scriptstyle d}}{{\approx}} (σnZ0+τR0W0,σnZ1+τRW1τR0W0,,σnZB+τRWBτR0W0)\displaystyle\left(\frac{\sigma}{\sqrt{n}}Z_{0}+\frac{\tau}{\sqrt{R_{0}}}W_{0},\ \frac{\sigma}{\sqrt{n}}Z_{1}+\frac{\tau}{\sqrt{R}}W_{1}-\frac{\tau}{\sqrt{R_{0}}}W_{0},\ldots,\ \frac{\sigma}{\sqrt{n}}Z_{B}+\frac{\tau}{\sqrt{R}}W_{B}-\frac{\tau}{\sqrt{R_{0}}}W_{0}\right)

It implies, in particular, that the standard error of ψ^^n,R0\hat{\hat{\psi}}_{n,R_{0}} is given by σ2/n+τ2/R0\sqrt{\sigma^{2}/n+\tau^{2}/R_{0}} which consists of two parts: σ2/n\sigma^{2}/n captures the variability from the data and τ2/R0\tau^{2}/R_{0} from the computation.

Theorem 3 is a generalization of Proposition 1. The random variables ZbZ_{b}’s and WbW_{b}’s in the limit each signifies a source of noise coming from either data or computation. More specifically, Z0Z_{0} comes from the original data, Zb,b=1,,BZ_{b},b=1,\ldots,B from each resample, W0W_{0} from the computation of the original estimate, and Wb,b=1,,BW_{b},b=1,\ldots,B from the computation of each resample. Unlike (5), in (15) the (centered) resample estimate ψn,Rbψ^^n,R0\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}} and original estimate ψ^^n,R0ψ\hat{\hat{\psi}}_{n,R_{0}}-\psi are no longer asymptotically independent, because they share the same source of noise W0W_{0}. This causes modifications to our Cheap Bootstrap in canceling out the nuisance parameter.

We will describe two approaches to construct Cheap Bootstrap confidence intervals for ψ\psi under the conditions in Theorem 3, both resulting in the form

[ψ^^n,R0q,1α/2S,ψ^^n,R0+q,1α/2S]\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{\cdot,1-\alpha/2}S_{\cdot},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{\cdot,1-\alpha/2}S_{\cdot}\right] (16)

where SS_{\cdot}, like the SS in (1), is a standard error estimate of ψ^^n,R0\hat{\hat{\psi}}_{n,R_{0}}. These two approaches differ by how we pool together the resample estimates in SS_{\cdot}, where the \cdot therein denotes the approach. Correspondingly, the critical value q,1α/2q_{\cdot,1-\alpha/2} is obtained from the (1α/2)(1-\alpha/2)-quantile of a distribution pertinent to the pivotal statistic using SS_{\cdot}. Both approaches we propose achieve asymptotic validity with very few resamples.

4.1 Centered at Original Estimate

In the first approach, we use in (16)

SO2=1Bb=1B(ψn,Rbψ^^n,R0)2S_{O}^{2}=\frac{1}{B}\sum_{b=1}^{B}(\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}})^{2} (17)

That is, SO2{S_{O}}^{2} acts as the sample variance of the bootstrap estimates, where the center of the squares is set to be ψ^^n,R0\hat{\hat{\psi}}_{n,R_{0}}. We choose qO,1α/2q_{O,1-\alpha/2} to be

qO,1α/2=min{q:minθ0F(q;θ,ρ)1α2}q_{O,1-\alpha/2}=\min\left\{q:\min_{\theta\geq 0}F(q;\theta,\rho)\geq 1-\frac{\alpha}{2}\right\} (18)

where F(q;θ,ρ)F(q;\theta,\rho) is the distribution function of

θV1+V2θ2+ρ2B(Y+V32)2θ2+ρ2BV3V2+V22\frac{\theta V_{1}+V_{2}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y+V_{3}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}V_{2}+V_{2}^{2}}} (19)

with V1,V2,V3i.i.d.N(0,1)V_{1},V_{2},V_{3}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1) and YχB12Y\sim\chi^{2}_{B-1} being all independent, and ρ=p0/p\rho=\sqrt{p_{0}/p} is the square-rooted ratio between the computation size used in the original estimate and each resample estimate. When B=1B=1, we set Y=0Y=0 in (19) so that the expression is equivalent to

θV1+V2|θ2+ρ2V3V2|\frac{\theta V_{1}+V_{2}}{|\sqrt{\theta^{2}+\rho^{2}}V_{3}-V_{2}|} (20)

We call the resulting interval

O=[ψ^^n,R0qO,1α/2SO,ψ^^n,R0+qO,1α/2SO]\mathcal{I}_{O}=\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{O,1-\alpha/2}S_{O},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{O,1-\alpha/2}S_{O}\right] (21)

the Cheap Bootstrap interval “centered at original estimate”.

Note that the distribution function F(q;θ,ρ)F(q;\theta,\rho) has no closed-form expression or reduction to any common distribution where a table or built-in calculator is available. However, it can be easily simulated, for instance, by running many independent copies of (V1,V2,V3,Y)(V_{1},V_{2},V_{3},Y), say (V1j,V2j,V3j,Yj)(V_{1}^{j},V_{2}^{j},V_{3}^{j},Y^{j}) for j=1,,Nj=1,\ldots,N, and computing (via, e.g., a simple grid search)

min{q:minθ01Nj=1NI(θV1j+V2jθ2+ρ2B(Yj+V3j2)2θ2+ρ2BV3jV2j+V2j2q)1α2}\min\left\{q:\min_{\theta\geq 0}\frac{1}{N}\sum_{j=1}^{N}I\left(\frac{\theta V_{1}^{j}+V_{2}^{j}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y^{j}+{V_{3}^{j}}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}^{j}V_{2}^{j}+{V_{2}^{j}}^{2}}}\leq q\right)\geq 1-\frac{\alpha}{2}\right\} (22)

where I()I(\cdot) denotes the indicator function, to approximate qO,1α/2q_{O,1-\alpha/2}. Note that this Monte Carlo computation only involves standard normal and χ2\chi^{2} random variables, not the noisy model evaluation ψ^r\hat{\psi}_{r} that could be expensive.

We have the following guarantee:

Theorem 4 (Asymptotic validity of Cheap Bootstrap “centered at point estimate”).

Under the same assumptions and setting in Theorem 3, the interval O\mathcal{I}_{O} in (21), where SO2S_{O}^{2} and qO,1α/2q_{O,1-\alpha/2} are defined in (17) and (18), is an asymptotically valid (1α)(1-\alpha)-level confidence interval, i.e., it satisfies

lim infnn(ψO)1α\liminf_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{O})\geq 1-\alpha

as nn\to\infty, where n\mathbb{P}_{n} denotes the probability with respect to the data X1,,XnX_{1},\ldots,X_{n} and all randomness from the resampling and computation.

Theorem 4 is obtained by looking at the asymptotic distribution of the (approximately) pivotal statistic (ψ^^n,R0ψ)/SO(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{O}. However, because of the dependence between the original and resample estimates in (15), the unknown nuisance parameters σ2\sigma^{2} and τ2\tau^{2} are not directly canceled out. To tackle this, we consider a worst-case calculation which leads to some conservativeness in the coverage guarantee, i.e., inequality instead of exact equality in Theorem 4.

4.2 Centered at Resample Mean

We consider our second approach. For B2B\geq 2, we use in (16)

SM2=1B1b=1B(ψn,Rbψn,R¯)2S_{M}^{2}=\frac{1}{B-1}\sum_{b=1}^{B}(\psi_{n,R}^{**b}-\overline{\psi_{n,R}^{**}})^{2} (23)

where ψn,R¯\overline{\psi_{n,R}^{**}} is the sample mean of ψn,Rb\psi_{n,R}^{**b}’s given by ψn,R¯=(1/B)b=1Bψn,Rb\overline{\psi_{n,R}^{**}}=(1/B)\sum_{b=1}^{B}\psi^{**b}_{n,R}. That is, we use the sample variance of ψn,Rb\psi^{**b}_{n,R}’s as our standard error estimate. Correspondingly, we set

qM,1α/2=max{ρ1,1}tB1,1α/2q_{M,1-\alpha/2}=\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2} (24)

where, like in Section 4.1, ρ=p0/p\rho=\sqrt{p_{0}/p} is the square-rooted ratio between the computation sizes of the original estimate and each resample estimate. Also, recall tB1,1α/2t_{B-1,1-\alpha/2} is the (1α/2)(1-\alpha/2)-quantile of tB1t_{B-1}. We call the resulting interval

M=[ψ^^n,R0qM,1α/2SM,ψ^^n,R0+qM,1α/2SM]\mathcal{I}_{M}=\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{M,1-\alpha/2}S_{M},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{M,1-\alpha/2}S_{M}\right] (25)

the Cheap Bootstrap interval “centered at resample mean”.

We have the following guarantee:

Theorem 5 (Asymptotic validity of Cheap Bootstrap “centered at resample mean”).

Under the same assumptions and setting in Theorem 3, for B2B\geq 2, the interval M\mathcal{I}_{M} in (25) where SM2S_{M}^{2} and qM,1α/2q_{M,1-\alpha/2} are defined in (23) and (24), is an asymptotically valid (1α)(1-\alpha)-level confidence interval, i.e., it satisfies

lim infnn(ψM)1α\liminf_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{M})\geq 1-\alpha

as nn\to\infty, where n\mathbb{P}_{n} denotes the probability with respect to the data X1,,XnX_{1},\ldots,X_{n} and all randomness from the resampling and computation. Moreover, if the computation sizes for each resample estimate and the original estimate are the same, i.e., ρ=1\rho=1, then M\mathcal{I}_{M} is asymptotically exact, i.e.,

limnn(ψM)=1α\lim_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{M})=1-\alpha

Compared to O\mathcal{I}_{O} in Section 4.1, M\mathcal{I}_{M} does not require Monte Carlo computation for qM,1α/2q_{M,1-\alpha/2} and is thus easier to use. Moreover, note that when the computation size used in each resample estimate is at most that used in the original estimate, i.e., ρ11\rho^{-1}\leq 1, we have qM,1α/2=tB1,1α/2q_{M,1-\alpha/2}=t_{B-1,1-\alpha/2}, so the critical value in M\mathcal{I}_{M} reduces to a standard tt-quantile, just like the interval (1) for non-nested problems. Furthermore, when these computation sizes are the same, then we have asymptotically exact coverage in Theorem 5. On the other hand, M\mathcal{I}_{M} is defined for B2B\geq 2 instead of B1B\geq 1, and thus its requirement on BB is slightly more stringent than O\mathcal{I}_{O} and (1).

5 Cheap Subsampling

We integrate Cheap Bootstrap into, in broad term, subsampling methods where the bootstrap resample has a smaller size than the original full data size. Subsampling is motivated by the lack of consistency when applying standard bootstraps in non-smooth problems (Politis et al. (1999); Bickel et al. (1997)), but also can be used to address the computational challenge in repeatedly fitting models over large data sets (Kleiner et al. (2012)). The latter arises because standard sampling with replacement on the original data retains around 63%63\% of the observation values, thus for many problems each bootstrap resample estimate requires roughly the same computation order as the original estimate constructed from the raw data.

Here we consider three subsampling variants: mm-out-of-nn Bootstrap (Bickel et al. (1997)), and the more recent Bag of Little Bootstraps (Kleiner et al. (2014)) and Subsampled Double Bootstrap (Sengupta et al. (2016)). The validity of these methods relies on generalizations of Assumption 1 (though not always, as explained at the end of this section), in which the distribution of n(ψ^nψ)\sqrt{n}(\hat{\psi}_{n}-\psi) can be approximated by that of N(ΨΨ)\sqrt{N}(\Psi^{*}-\Psi), where Ψ\Psi^{*} and/or Ψ\Psi are some estimators obtained using a small resample data size or number of distinct observed values, and NN is a suitable scaling parameter. More precisely, suppose we have N(ΨΨ)N(0,σ2)\sqrt{N}(\Psi^{*}-\Psi)\Rightarrow N(0,\sigma^{2}) in probability conditional on the data X1,,XnX_{1},\ldots,X_{n} and possibly independent randomness (the independent randomness is sometimes used to help determine Ψ\Psi). Then, to construct a (1α)(1-\alpha)-level confidence interval for ψ\psi, we can simulate the α/2\alpha/2 and (1α/2)(1-\alpha/2)-quantiles of ΨΨ\Psi^{*}-\Psi, say Qα/2Q_{\alpha/2} and Q1α/2Q_{1-\alpha/2}, and then output

[ψ^nNnQ1α/2,ψ^n+NnQα/2]\left[\hat{\psi}_{n}-\sqrt{\frac{N}{n}}Q_{1-\alpha/2},\ \hat{\psi}_{n}+\sqrt{\frac{N}{n}}Q_{\alpha/2}\right]

Like in the standard bootstrap, such an approach requires many Monte Carlo replications.

We describe how to use subsampling in conjunction with Cheap Bootstrap to devise bootstrap schemes that are small simultaneously in the number of resamples and the resample data size (or number of distinct observed values in use). Following the ideas in Sections 2 and 3, our Cheap Subsampling unifiedly uses the following framework: Generate Ψb\Psi^{*b} and Ψb\Psi^{b} given X1,,XnX_{1},\ldots,X_{n} and any required additional randomness, for b=1,,Bb=1,\ldots,B. Then the (1α)(1-\alpha)-level confidence interval is given by

[ψ^ntB,1α/2NnS,ψ^n+tB,1α/2NnS]\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}\sqrt{\frac{N}{n}}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}\sqrt{\frac{N}{n}}S\right] (26)

where

S2=1Bb=1B(ΨbΨb)2S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\Psi^{*b}-\Psi^{b}\right)^{2}

Compared to (1), the main difference in (26) is the adjusting factor N/n\sqrt{N/n} in the standard error.

Below we describe the application of the above framework into three subsampling variants. Like in Section 2, suppose the target parameter is ψ=ψ(P)\psi=\psi(P) and we have obtained the point estimate ψ^n=ψ(P^n)\hat{\psi}_{n}=\psi(\hat{P}_{n}).

5.1 Cheap mm-out-of-nn Bootstrap

We compute BB subsample estimates ψsb:=ψ(Psb)\psi_{s}^{*b}:=\psi(P_{s}^{*b}), b=1,,Bb=1,\ldots,B, where PsbP_{s}^{*b} is the empirical distribution constructed from a subsample of size s<ns<n drawn from the raw data {X1,,Xn}\{X_{1},\ldots,X_{n}\} via sampling with replacement. Then output the interval

[ψ^ntB,1α/2snS,ψ^n+tB,1α/2snS]\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}\sqrt{\frac{s}{n}}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}\sqrt{\frac{s}{n}}S\right] (27)

where S2=1Bb=1B(ψsbψ^n)2S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{s}^{*b}-\hat{\psi}_{n}\right)^{2}.

Cast in our framework above, here we take Ψb\Psi^{b} to be always the original estimate ψ^n\hat{\psi}_{n}, Ψb\Psi^{*b} to be a subsample estimate ψsb\psi_{s}^{*b}, and the scale parameter NN to be the subsample size ss.

5.2 Cheap Bag of Little Bootstraps

We first construct ψs:=ψ(Ps)\psi_{s}^{*}:=\psi(P_{s}^{*}), where PsP_{s}^{*} is the empirical distribution of a subsample of size s<ns<n, called 𝐗s\mathbf{X}_{s}^{*}, drawn from the raw data {X1,,Xn}\{X_{1},\ldots,X_{n}\} via sampling without replacement (i.e., 𝐗s\mathbf{X}_{s}^{*} is an ss-subset of {X1,,Xn}\{X_{1},\ldots,X_{n}\}). The subsample 𝐗s\mathbf{X}_{s}^{*}, once drawn, is fixed throughout. Given 𝐗s\mathbf{X}_{s}^{*}, we generate ψnb:=ψ(Pnb)\psi^{**b}_{n}:=\psi(P_{n}^{**b}), b=1,,Bb=1,\ldots,B, where PnbP_{n}^{**b} is the empirical distribution constructed from sampling with replacement from the subsample 𝐗s\mathbf{X}_{s}^{*} for nn times, with nn being the original full data size (i.e., PnbP_{n}^{**b} is a weighted empirical distribution over 𝐗s\mathbf{X}_{s}^{*}). We output the interval in (1), where now

S2=1Bb=1B(ψnbψs)2S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{**b}-\psi_{s}^{*}\right)^{2} (28)

Cast in our framework, here we take Ψb\Psi^{b} to be always ψs\psi_{s}^{*}, Ψb\Psi^{*b} to be the double resample estimate ψnb\psi_{n}^{**b}, and the scale parameter NN to be the original data size nn. The subsampling used to obtain ψs\psi_{s}^{*} introduces an additional randomness that is conditioned upon in the conditional weak convergence N(ΨΨ)N(0,σ2)\sqrt{N}(\Psi^{*}-\Psi)\Rightarrow N(0,\sigma^{2}) described before. Note that when ss is small, the double resample estimate ψnb\psi_{n}^{**b}, though constructed with a full size data, uses only a small number of distinct data points which, in problems such as MM-estimation, involves only a weighted estimation on the small-size subsample (Kleiner et al. (2014)).

5.3 Cheap Subsampled Double Bootstrap

For b=1,,Bb=1,\ldots,B, we do the following: First, generate a subsample estimate ψsb:=ψ(Psb)\psi_{s}^{*b}:=\psi(P_{s}^{*b}) where PsbP_{s}^{*b} is the empirical distribution of a subsample of size s<ns<n, called 𝐗sb\mathbf{X}_{s}^{*b}, drawn from the raw data {X1,,Xn}\{X_{1},\ldots,X_{n}\} via sampling without replacement (i.e., 𝐗sb\mathbf{X}_{s}^{*b} is an ss-subset of {X1,,Xn}\{X_{1},\ldots,X_{n}\}). Then, given 𝐗sb\mathbf{X}_{s}^{*b}, we construct ψnb=ψ(Pnb)\psi_{n}^{**b}=\psi(P_{n}^{**b}) where PnbP_{n}^{**b} is the empirical distribution of a size-nn resample constructed by sampling with replacement from 𝐗sb\mathbf{X}_{s}^{*b}. We output the interval in (1), where now

S2=1Bb=1B(ψnbψsb)2S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{**b}-\psi_{s}^{*b}\right)^{2} (29)

Cheap Subsampled Double Bootstrap is similar to Cheap Bag of Little Bootstraps, but the Ψb\Psi^{b} in our framework is now taken as a subsample estimate ψsb\psi_{s}^{*b} that is newly generated from a new subsample 𝐗sb\mathbf{X}_{s}^{*b} for each bb, and Ψb\Psi^{*b} is obtained by resampling with replacement from that particular 𝐗sb\mathbf{X}_{s}^{*b}.

5.4 Theory of Cheap Subsampling

All of Cheap mm-out-of-nn Bootstrap, Cheap Bag of Little Bootstraps, and Cheap Subsampled Double Bootstrap achieve asymptotic exactness for any B1B\geq 1. To state this concretely, we denote δ={fg:f,g,ρP(fg)<δ}\mathcal{F}_{\delta}=\{f-g:f,g\in\mathcal{F},\ \rho_{P}(f-g)<\delta\}, where ρP(fg):=(VarP(f(X)g(X)))1/2\rho_{P}(f-g):=(Var_{P}(f(X)-g(X)))^{1/2} is the canonical metric.

Theorem 6 (Asymptotic exactness of Cheap Subsampling).

Suppose the assumptions in Proposition 2 hold, and in addition δ\mathcal{F}_{\delta} is measurable for every δ>0\delta>0. Then the intervals produced by Cheap mm-out-of-nn Bootstrap (i.e., (27)), Cheap Bag of Little Bootstraps (i.e., (1) using (28)), and Cheap Subsampled Double Bootstrap (i.e., (1) using (29)) are all asymptotically exact for ψ\psi as n,sn,s\to\infty with sns\leq n.

The proof of Theorem 6 uses a similar roadmap as Theorem 1 and the subsampling analogs of Assumption 1, which hold under the additional technical measurability condition on δ\mathcal{F}_{\delta} and have been used to justify the original version of these subsampling methods.

We make a couple of remarks. First, the original Bag of Little Bootstraps suggests to use several, or even a growing number of different subsample estimates, instead of a fixed subsample as we have presented earlier. Then from each subsample, an adequate number of resamples are drawn to obtain a quantile that informs an interval limit. These quantile estimates from different subsamples are averaged to obtain the final interval limit. Note that this suggestion requires a multiplicative amount of computation effort, and is motivated from better convergence rates (Kleiner et al. (2014) Theorems 2 and 3). Here, we simply use one subsample as we focus on the case of small Monte Carlo replication budget, but the modification to include multiple subsamples is feasible, evidenced by the validity of Cheap Subsampled Double Bootstrap which essentially uses a different subsample in each bootstrap replication. Second, in addition to allowing a smaller data size in model refitting, the mm-out-of-nn Bootstrap, and also the closely related subsampling without replacement or so-called (nm)\binom{n}{m} sampling, as well as sample splitting (Politis et al. (1999); Bickel et al. (1997)), are all motivated as remedies to handle non-smooth functions where standard bootstraps could fail. Instead of conditional weak convergence which we utilize in this work, these approaches are shown to provide consistent estimates based on symmetric statistics (Politis and Romano (1994)).

6 Numerical Results

We test the numerical performances of our Cheap Bootstrap and compare it with baseline bootstrap approaches. Specifically, we consider elementary variance and correlation estimation (Section 6.1), regression (Section 6.2), simulation input uncertainty quantification (Section 6.3) and deep ensemble prediction (Section 6.4). Due to space limit, we show other examples in Appendix E. Among these examples, Cheap Subsampling is also investigated in Section 6.2 and nested sampling issues arise in Sections 6.3 and 6.4.

6.1 Elementary Examples

We consider four setups. In the first two setups, we estimate the variance of a distribution using the sample variance. We consider a folded standard normal (i.e., |N(0,1)||N(0,1)|) and double exponential with rate 1 (i.e., Sgn×Exp(1)Sgn\times Exp(1) where Sgn=+1Sgn=+1 or 1-1 with equal probability and is independent of Exp(1)Exp(1)) as the distribution. In the last two setups, we estimate the correlation using the sample correlation. We consider bivariate normal with mean zero, unit variance and correlation 0.50.5, and bivariate lognormal (i.e., (eZ1,eZ2)(e^{Z_{1}},e^{Z_{2}}) where (Z1,Z2)(Z_{1},Z_{2}) is the bivariate normal just described). We use a sample size n=1000n=1000 in all cases. We set the confidence level 1α=95%1-\alpha=95\%. The ground truths of all the setups are known, namely 12/π1-2/\pi, 22, 0.50.5 and (e3/2e)/(e2e)(e^{3/2}-e)/(e^{2}-e) respectively (these examples are also used in, e.g., Schenker (1985); DiCiccio et al. (1992); Lee and Young (1995)).

We run Cheap Bootstrap using a small number of resamples B=1,2,3,4,5,10B=1,2,3,4,5,10, and compare with the Basic Bootstrap, Percentile Bootstrap and Standard Error Bootstrap. For each setting, we repeat the experiments 10001000 times and report the empirical coverage and interval width mean and standard deviation. Table 3 shows the performances for variance and correlation estimation that are similar to Table 1 in Section 2.2. Cheap Bootstrap gives rise to accurate coverage (91%96%91\%-96\%) in all considered cases, including when BB is as low as 11. On the other hand, all baseline methods fail to generate two-sided intervals when B=1B=1, and encounter significant under-coverage when B=2B=2 to 55 (e.g., as low as 28%28\% for B=2B=2 and 58%58\% when B=5B=5). When B=10B=10, these baseline methods start to catch up, with Standard Error Bootstrap rising to a coverage level close to 90%90\%. These observations coincide with our theory of Cheap Bootstrap presented in Sections 2 and 3, and also that the baseline methods are all designed to work under a large BB.

Regarding the interval width, its mean and standard deviation for Cheap Bootstrap are initially large at B=1B=1, signifying a price on statistical efficiency when the computation budget is very small. These numbers drop quickly when BB increases from 11 to 22 and continue to drop at a decreasing rate as BB increases further. In contrast, all baseline methods have lower width means and standard deviations than our Cheap Bootstrap, with an increasing trend for the mean as BB increases while the standard deviation remains roughly constant for each method. When B=10B=10, the Cheap Bootstrap interval has generally comparable width mean and standard deviation with the baselines, though still larger. Note that while the baseline methods generate shorter intervals, they have significant under-coverage in the considered range of BB.

Table 3: Interval performances with different bootstrap methods: Cheap, Basic, Percentile and Standard Error Bootstrap, for variance and correlation estimation, at nominal confidence level 95%95\% and with sample size n=1000n=1000.
BB Variance of Variance of Correlation of Correlation of
Folded normal Double exponential Bivariate normal Bivariate lognormal
Coverage Width Coverage Width Coverage Width Coverage Width
(margin mean (margin mean (margin mean (margin mean
of error) (st. dev.) of error) (st. dev.) of error) (st. dev.) of error) (st. dev.)
Cheap 1 0.95 (0.01) 0.38 (0.29) 0.94 (0.01) 2.84 (2.27) 0.93 (0.02) 0.47 (0.37) 0.95 (0.01) 1.03 (0.83)
Basic 1 NA NA NA NA NA NA NA NA
Per. 1 NA NA NA NA NA NA NA NA
S.E. 1 NA NA NA NA NA NA NA NA
Cheap 2 0.95 (0.01) 0.15 (0.08) 0.94 (0.01) 1.10 (0.60) 0.95 (0.01) 0.18 (0.10) 0.94 (0.01) 0.38 (0.25)
Basic 2 0.30 (0.03) 0.02 (0.02) 0.31 (0.03) 0.16 (0.13) 0.32 (0.03) 0.03 (0.02) 0.28 (0.03) 0.06 (0.05)
Per. 2 0.33 (0.03) 0.02 (0.02) 0.32 (0.03) 0.16 (0.13) 0.32 (0.03) 0.03 (0.02) 0.32 (0.03) 0.06 (0.05)
S.E. 2 0.68 (0.03) 0.06 (0.05) 0.69 (0.03) 0.45 (0.35) 0.69 (0.03) 0.07 (0.06) 0.65 (0.03) 0.15 (0.14)
Cheap 3 0.95 (0.01) 0.11 (0.05) 0.94 (0.02) 0.81 (0.37) 0.95 (0.01) 0.14 (0.06) 0.94 (0.01) 0.29 (0.15)
Basic 3 0.50 (0.03) 0.03 (0.02) 0.48 (0.03) 0.23 (0.13) 0.48 (0.03) 0.04 (0.02) 0.46 (0.03) 0.08 (0.05)
Per. 3 0.50 (0.03) 0.03 (0.02) 0.48 (0.03) 0.23 (0.13) 0.52 (0.03) 0.04 (0.02) 0.46 (0.03) 0.08 (0.05)
S.E. 3 0.81 (0.02) 0.07 (0.04) 0.78 (0.03) 0.48 (0.27) 0.81 (0.02) 0.08 (0.04) 0.80 (0.02) 0.17 (0.11)
Cheap 4 0.96 (0.01) 0.10 (0.04) 0.96 (0.01) 0.73 (0.28) 0.94 (0.01) 0.12 (0.05) 0.94 (0.02) 0.26 (0.12)
Basic 4 0.62 (0.03) 0.04 (0.02) 0.62 (0.03) 0.29 (0.13) 0.61 (0.03) 0.05 (0.02) 0.54 (0.03) 0.10 (0.05)
Per. 4 0.62 (0.03) 0.04 (0.02) 0.60 (0.03) 0.29 (0.13) 0.61 (0.03) 0.05 (0.02) 0.59 (0.03) 0.10 (0.05)
S.E. 4 0.86 (0.02) 0.07 (0.03) 0.86 (0.02) 0.50 (0.22) 0.86 (0.02) 0.09 (0.04) 0.81 (0.02) 0.17 (0.09)
Cheap 5 0.95 (0.01) 0.10 (0.03) 0.95 (0.01) 0.68 (0.24) 0.94 (0.01) 0.12 (0.04) 0.91 (0.02) 0.25 (0.12)
Basic 5 0.67 (0.03) 0.05 (0.02) 0.67 (0.03) 0.32 (0.13) 0.66 (0.03) 0.06 (0.02) 0.58 (0.03) 0.12 (0.06)
Per. 5 0.66 (0.03) 0.05 (0.02) 0.65 (0.03) 0.32 (0.13) 0.65 (0.03) 0.06 (0.02) 0.61 (0.03) 0.12 (0.06)
S.E. 5 0.87 (0.02) 0.07 (0.03) 0.86 (0.02) 0.51 (0.20) 0.86 (0.02) 0.09 (0.03) 0.83 (0.02) 0.19 (0.10)
Cheap 10 0.93 (0.02) 0.08 (0.02) 0.94 (0.01) 0.62 (0.17) 0.94 (0.01) 0.10 (0.02) 0.91 (0.02) 0.21 (0.09)
Basic 10 0.79 (0.03) 0.06 (0.02) 0.80 (0.02) 0.43 (0.13) 0.83 (0.02) 0.07 (0.02) 0.73 (0.03) 0.15 (0.06)
Per. 10 0.79 (0.03) 0.06 (0.02) 0.80 (0.02) 0.43 (0.13) 0.80 (0.02) 0.07 (0.02) 0.78 (0.03) 0.15 (0.06)
S.E. 10 0.90 (0.02) 0.07 (0.02) 0.91 (0.02) 0.54 (0.15) 0.92 (0.02) 0.09 (0.02) 0.86 (0.02) 0.19 (0.08)

6.2 Regression Problems

We apply our Cheap Bootstrap and compare with standard approaches on a linear regression. The example is adopted from Sengupta et al. (2016) §4 and Kleiner et al. (2014) §4. We fit a model Y=β1X1++βdXd+ϵY=\beta_{1}X_{1}+\cdots+\beta_{d}X_{d}+\epsilon, where we set dimension d=100d=100 and use data (X1,i,,Xd,i,Yi)(X_{1,i},\ldots,X_{d,i},Y_{i}) of size n=105n=10^{5} to fit the model and estimate the coefficients βj\beta_{j}’s. The ground truth is set as Xjt3X_{j}\sim t_{3} for j=1,,dj=1,\ldots,d, ϵN(0,10)\epsilon\sim N(0,10), and βj=1\beta_{j}=1 for j=1,,dj=1,\ldots,d.

In addition to full-size Cheap Bootstrap, we run the three variants of Cheap Subsampling, namely Cheap mm-out-of-nn Bootstrap, Cheap Bag of Little Bootstraps (BLB), and Cheap Subsampled Double Bootstrap (SDB). Moreover, we compare each cheap bootstrap method with its standard quantile-based counterpart (i.e., the basic bootstrap described in Section 2.1 and the subsample counterparts at the beginning of Section 5, with one fixed subsample in the Bag of Little Bootstrap case) to compute 95%95\% confidence intervals for the first coefficient β1\beta_{1}. For the subsampling methods, we use subsample size n0.6=1000n^{0.6}=1000, which is also used in Sengupta et al. (2016) and Kleiner et al. (2014). We use in total 5050 resamples for each method. We repeat the experiments 10001000 times, each time we regenerate a new synthetic data set. For the number of resamples starting from 1 to 5050, we compute the empirical coverage probability and the mean and standard deviation of the confidence interval width for the first coefficient.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Confidence interval coverage probabilities of Standard versus Cheap Bootstrap methods in linear regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the coverage probability estimates from 1000 experimental repetitions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Mean confidence interval widths of Standard versus Cheap Bootstrap methods in linear regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the mean width estimates from 1000 experimental repetitions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Standard deviations of confidence interval widths of Standard versus Cheap Bootstrap methods in linear regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the standard deviation estimates from 1000 experimental repetitions.

Figure 1 shows the empirical coverage probability against the number of resamples from the 10001000 experimental repetitions. In the first graph, we see that Standard Bootstrap falls short of the nominal coverage, severely when BB is small and gradually improves as BB increases. For example, the coverage probability at B=5B=5 is 70%70\% and at B=10B=10 is 86%86\%. On the other hand, Cheap Bootstrap gives close to the nominal coverage starting from the first replication (94%94\%), and the coverage stays between 94%94\% and 97%97\% throughout. The comparison is similar for subsampling methods, where with small BB Standard mm-out-of-nn, Bag of Little Bootstrap and Subsampled Double Bootstrap all under-cover and gradually approach the nominal level as BB increases. For a more specific comparison, at B=5B=5 for instance, Standard mm-out-of-nn, Bags of Little Bootstrap and Subsampled Double Bootstrap have under-coverages of 73%73\%, 69%69\% and 68%68\% respectively, while the Cheap counterparts attain 98%98\%, 95%95\% and 96%96\%. We note that the “sudden drop” in the coverage probability for all Standard bootstrap methods at the 4141-st resample in Figure 1 (as well as all the following figures) is due to the discontinuity in the empirical quantile, which is calculated by inverting the empirical distribution and at this number of resamples the inverse switches from outputting the largest or smallest resample estimate to the second largest or smallest.

Figure 2 shows the mean interval width against the number of resamples (note that the confidence intervals for the mean width estimates are very narrow, meaning that the estimates are accurate, which makes the shaded regions in the figure difficult to visualize). These plots show that the mean interval widths of Cheap bootstrap methods are initially large when only one replication is used, then decrease sharply when the number of resamples increases and then continue to decrease at a slower rate. Figure 3 shows the standard deviation of the interval width against the number of resamples, which follow a similar trend as the mean in Figure 2. More specifically, the interval width mean and standard deviation of Cheap Bootstrap are 0.360.36 and 0.280.28 at B=1B=1, drop sharply to 0.110.11 and 0.050.05 at B=3B=3, further to 0.090.09 and 0.030.03 at B=5B=5, and 0.080.08 and 0.020.02 at B=10B=10, with negligible reduction afterwards. These trends match our theory regarding the interval width pattern in Section 3.3. On the other hand, the interval width means of Standard methods are initially small and increase with the number of resamples, while the standard deviations are roughly constant. Note that at B=1B=1 it is not possible to generate confidence intervals using Standard bootstrap methods, and the interval width mean and standard deviation are undefined. Moreover, even though the interval widths in Standard bootstrap methods appear lower generally, these intervals can substantially under-cover the truth when BB is small.

We present another example on logistic regression in Appendix E.1 which shows similar experimental observations discussed above.

6.3 Input Uncertainty Quantification in Simulation Modeling

We present an example on the input uncertainty quantification problem in simulation modeling discussed in Section 4. Our target quantity of interest is the expected average waiting time of the first 10 customers in a single-server queueing system, where the interarrival times and service time are i.i.d. with ground truth distributions being exponential with rates 11 and 1.11.1 respectively. The queue starts empty and the first customer immediately arrives. This system is amenable to discrete-event simulation and has been used commonly in existing works in input uncertainty quantification (e.g., Barton et al. (2014); Song and Nelson (2015); Zouaoui and Wilson (2004)). Suppose we do not know the interarrival time distribution but instead have external data of size n=100n=100. To compute a point estimate of the expected waiting time, we use the empirical distribution constructed from the 100 observations, denoted P^n\hat{P}_{n}, as an approximation to the interarrrival time distribution that is used to drive simulation runs. We conduct R0=50R_{0}=50 unbiased simulation runs, each giving output ψ^r(P^n)\hat{\psi}_{r}(\hat{P}_{n}), and then average the outputs of these runs to get (1/R0)r=1R0ψ^r(P^n)(1/R_{0})\sum_{r=1}^{R_{0}}\hat{\psi}_{r}(\hat{P}_{n}).

To obtain a 95%95\% confidence interval, we use Cheap Bootstrap centered at original estimate and centered at resample mean presented in Sections 4.1 and 4.2 respectively. More precisely, the first approach constructs the interval O\mathcal{I}_{O} using (16) with (17) and (18), while the second approach constructs the interval M\mathcal{I}_{M} using (16) with (23) and (24). In both approaches, we set the simulation size for each resample estimate to equal that for the original estimate, i.e., R=R0=50R=R_{0}=50. To construct O\mathcal{I}_{O}, we need to approximate the critical value qO,1α/2q_{O,1-\alpha/2} using (22), where we set N=100,000N=100,000 and discretize θ\theta over a grid 0.01,0.02,,1000.01,0.02,\ldots,100. Then we increase qq from tB,1α/20.5t_{B,1-\alpha/2}-0.5 and iteratively check whether

minθ01Nj=1NI(θV1j+V2jθ2+ρ2B(Yj+V3j2)2θ2+ρ2BV3jV2j+V2j2q)\min_{\theta\geq 0}\frac{1}{N}\sum_{j=1}^{N}I\left(\frac{\theta V_{1}^{j}+V_{2}^{j}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y^{j}+{V_{3}^{j}}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}^{j}V_{2}^{j}+{V_{2}^{j}}^{2}}}\leq q\right) (30)

is greater than 1α/21-\alpha/2; if not, we increase qq by a step size 0.010.01, until we reach a qq such that (30) passes 1α/21-\alpha/2. We detail the computed values of qO,1α/2q_{O,1-\alpha/2} in Appendix E.2.

Using the above, we construct O\mathcal{I}_{O} and M\mathcal{I}_{M} at BB ranging from 11 to 1010. For each BB, we repeat our experiments 10001000 times to record the empirical coverage and interval width mean and standard deviation. To calculate the empirical coverage, we run 1 million simulation runs under the true interarrival and service time distributions to obtain an accurate estimate of the ground truth value. Table 4 shows the performances of O\mathcal{I}_{O} and M\mathcal{I}_{M}, and also the comparisons with the Basic and Percentile Bootstraps that treat ψn,Rb,b=1,,B\psi_{n,R}^{**b},b=1,\ldots,B in (14) as the resample estimates. We see that the coverage of O\mathcal{I}_{O} is very close to the nominal confidence level 95%95\% at all considered BB, ranging in 95%96%95\%-96\%. The coverage of M\mathcal{I}_{M} is also close, ranging in 93%94%93\%-94\%. The interval width performances for both O\mathcal{I}_{O} and M\mathcal{I}_{M} behave similarly as previous examples, with initially large width at B=1B=1 or B=2B=2, falling sharply when BB increases by 1 or 2, and flattening out afterwards. The Basic and Percentile Bootstraps have coverages substantially deviated from the nominal 95%95\%, though they improve as BB increases. The inferior coverages of these conventional methods can be caused by both the inadequate BB and also the ignorance of the simulation noise in their implementation.

Table 4: Interval performances using Cheap Bootstrap centered at original estimate O\mathcal{I}_{O}, Cheap Bootstrap centered at resample mean M\mathcal{I}_{M}, Basic Bootstrap and Percentile Bootstrap, at nominal confidence level 95%95\% for input uncertainty quantification in a queueing system simulation.
BB Cheap bootstrap centered Cheap bootstrap centered Basic bootstrap Percentile bootstrap
at original estimate at resample mean
Coverage Width Coverage Width Coverage Width Coverage Width
(margin mean (margin mean (margin mean (margin mean
of error) (st. dev.) of error) (st. dev.) of error) (st. dev.) of error) (st. dev.)
1 0.96 (0.01) 6.73 (5.41) NA NA NA NA NA NA
2 0.95 (0.01) 2.55 (1.50) 0.94 (0.02) 5.84 (4.86) 0.35 (0.03) 0.33 (0.26) 0.27 (0.03) 0.33 (0.26)
3 0.95 (0.01) 1.97 (0.99) 0.94 (0.01) 2.26 (1.30) 0.52 (0.03) 0.50 (0.29) 0.41 (0.03) 0.50 (0.29)
4 0.95 (0.01) 1.74 (0.79) 0.94 (0.01) 1.73 (0.84) 0.64 (0.03) 0.59 (0.29) 0.48 (0.03) 0.59 (0.29)
5 0.95 (0.01) 1.64 (0.69) 0.93 (0.02) 1.54 (0.68) 0.71 (0.03) 0.66 (0.29) 0.54 (0.03) 0.66 (0.29)
6 0.95 (0.01) 1.58 (0.63) 0.93 (0.02) 1.45 (0.59) 0.75 (0.03) 0.72 (0.29) 0.58 (0.03) 0.72 (0.29)
7 0.95 (0.01) 1.54 (0.58) 0.93 (0.02) 1.40 (0.53) 0.77 (0.03) 0.76 (0.29) 0.61 (0.03) 0.76 (0.29)
8 0.95 (0.01) 1.50 (0.55) 0.93 (0.02) 1.35 (0.49) 0.79 (0.03) 0.80 (0.29) 0.64 (0.03) 0.80 (0.29)
9 0.95 (0.01) 1.48 (0.53) 0.93 (0.02) 1.33 (0.47) 0.81 (0.02) 0.82 (0.29) 0.65 (0.03) 0.82 (0.29)
10 0.95 (0.01) 1.46 (0.51) 0.93 (0.02) 1.31 (0.44) 0.83 (0.02) 0.85 (0.29) 0.67 (0.03) 0.85 (0.29)

Compared with existing works on input uncertainty quantification, we highlight that the computational load in Cheap Bootstrap is significantly lower. According to Table 4, BB can be taken as, for instance, 3 or 4 in O\mathcal{I}_{O} to obtain a reasonable confidence interval, so that the total computation cost is BRBR which is 200 or 250 when we use R=50R=50 (including the cost to compute the original point estimate). In contrast, bootstrap approaches proposed in the literature suggest to use BB at least 5050 in numerical examples (e.g., Cheng and Holland (2004); Song and Nelson (2015)) or linearly dependent on the parameter dimension (e.g., 1010 times the dimension when using the so-called metamodel-assisted bootstrap in the parametric case; Xie et al. (2014)).

6.4 Deep Ensemble Prediction

We consider deep ensemble prediction described in Section 4. We build a prediction model y=f(𝐱)y=f(\mathbf{x}) from i.i.d. supervised data (𝐗i,Yi),i=1,,n(\mathbf{X}_{i},Y_{i}),i=1,\ldots,n of size n=200n=200. Here the feature vector 𝐱=(x(1),,x(d))\mathbf{x}=(x(1),\ldots,x(d)) is 16-dimensional and yy\in\mathbb{R}. Suppose each dimension of 𝐗i\mathbf{X}_{i} is generated from the uniform distribution on [0,1][0,1], and the ground-truth model is y=j=116x(j)+N(0,1)y=\sum_{j=1}^{16}x(j)+N(0,1). We build a deep ensemble by training R0=5R_{0}=5 base neural networks each using an independent random initialization, and average these networks to obtain a predictor. More specifically, each base neural network has two fully connected layers with 10241024 hidden neurons, using rectified linear unit as activation. It is trained using the squared loss with L2L_{2}-regularization on the neuron weights, and Adam for gradient descent. All the weights in each neural network are initialized as independent N(0,1)N(0,1) variables.

Like in Section 6.3, we use Cheap Bootstrap centered at original estimate O\mathcal{I}_{O} and centered at resample mean M\mathcal{I}_{M} to construct 95%95\% confidence intervals for the prediction output. Here, we consider a test point with value 0.50.5 in all dimensions. We set R=R0R=R_{0} in both approaches, and use Table 5 in Appendix E.2 to calibrate the critical value qO,1α/2q_{O,1-\alpha/2} in O\mathcal{I}_{O}. We repeat the experiments 100100 times to take down statistics on the generated confidence intervals. Figure 4 depicts the box plots of the confidence interval widths using B=1B=1 to 1010, for O\mathcal{I}_{O} and M\mathcal{I}_{M} respectively. We see that the interval widths are relatively large for B=1B=1 (in the case of O\mathcal{I}_{O}) and for B=2B=2 (in the case of M\mathcal{I}_{M} which is only well-defined starting from B=2B=2), but decrease fast when BB increases. When B=3B=3 (in the case of O\mathcal{I}_{O}) and B=4B=4 (in the case of M\mathcal{I}_{M}), the interval widths appear to more or less stabilize.

Refer to caption
Refer to caption
Figure 4: Box plots of 95%95\% confidence interval widths for deep ensemble prediction on synthetic data at test point that takes value 0.50.5 in all dimensions, using Cheap Bootstrap centered at original estimate O\mathcal{I}_{O} (left graph) and Cheap Bootstrap centered at resample mean M\mathcal{I}_{M} (right graph).

Next, we also train deep ensemble predictors for a real data set on Boston housing, available at https://www.kaggle.com/c/boston-housing. The data set has size 504504 and feature dimension 1313. Like in our synthetic data, we use R0=5R_{0}=5, and the same configurations of base neural networks and training methods. We split the data by 80%80\% to a training set and 20%20\% to a testing set, where only the training set is used to construct the predictor and also run our Cheap Bootstrap, with again R=R0=5R=R_{0}=5. Then we create the 95%95\% Cheap Bootstrap confidence intervals for the prediction values of the testing set. Figure 5 shows the box plots of the widths of these intervals, namely O\mathcal{I}_{O} and M\mathcal{I}_{M}, for BB ranging from 11 to 1010. We see that the widths fall sharply when BB is small (B=1,2B=1,2 in O\mathcal{I}_{O} and B=2,3B=2,3 in M\mathcal{I}_{M}) and stabilize quickly afterwards.

Refer to caption
Refer to caption
Figure 5: Box plots of 95%95\% confidence interval widths for deep ensemble prediction for the test points in the Boston housing data set, using Cheap Bootstrap centered at original estimate O\mathcal{I}_{O} (left graph) and Cheap Bootstrap centered at resample mean M\mathcal{I}_{M} (right graph).

In Appendix E.3, we present an additional example on constructing confidence bounds for the optimality gaps of data-driven stochastic optimization problems. This problem, which possesses a nested sampling structure, is of interest to stochastic programming. There we would illustrate the performances of Cheap Bootstrap in constructing one-sided confidence bounds.

7 Discussion

Motivated by the computational demand in large-scale problems where repeated model refitting can be costly, in this paper we propose a Cheap Bootstrap method that can use very few resamples for bootstrap inference. A key element of our method is that it attains asymptotically exact coverage for any number of resamples, including as low as one. This is in contrast to conventional bootstrap approaches that require running many Monte Carlo replications. Our theory on the Cheap Bootstrap also differs from these approaches by exploiting the asymptotic independence between the original estimate and resample estimates, instead of a direct approximation of the sampling distribution via the reasample counterpart, the latter typically executed by running many Monte Carlo runs in order to obtain accurate summary statistics from the resamples.

Besides the basic asymptotic coverage guarantee, we have studied several aspects of the Cheap Bootstrap. First is its higher-order coverage errors which match the conventional basic and percentile bootstraps. This error analysis is based on an Edgeworth expansion on a tt-limit which, to our knowledge, has not been studied in the literature that has focused on normal limits. Second is the half-width behavior, where we show that the half widths of Cheap Bootstrap intervals are naturally wider for small number of resamples, and match the conventional approaches when resample size increases. Moreover, the decrease in the half width is sharp when the resample number is very small and flattens quickly, thus leading to a reasonable half-width performance with only few resamples. In addition, we have also investigated several generalizations of the Cheap Bootstrap, including its applications to nested sampling problems, subsampling procedures, and multivariate extensions. Our numerical results validate the performances of Cheap Bootstrap, especially its ability to conduct valid inference with an extremely small number of resamples.

This work is intended to lay the foundation of the proposed Cheap Bootstrap method. In principle, this method can be applied to many other problems than those shown in our numerical section. Essentially, for any problem where bootstrap interval is used and justified via the standard condition - conditional asymptotic normality of the resamples, one can adopt the Cheap Bootstrap in lieu of conventional bootstraps. Besides applying and testing the performances of Cheap Bootstrap across these other problems, we believe the following are also worth further investigation:

Comparison with the infinitesimal jackknife and the jackknife: While we have focused our comparisons primarily within the bootstrap family, the Cheap Bootstrap bears advantages when compared to non-bootstrap alternatives including the infinitesimal jackknife and the jackknife. These advantages inherit from the bootstrap approach in general, but strengthened further with the light resampling effort. The infinitesimal jackknife, or the delta method, uses a linear approximation to evaluate the standard error. This approach relies on influence function calculation that could be tedious analytically, and also computationally due to, for instance, the inversion of big matrices in MM-estimation. The jackkknife, which relies on leave-one-out estimates, generally requires a number of model evaluation that is the same as the sample size. The Cheap Bootstrap thus provides potential significant computation savings compared to both methods. Note, however, that it is possible to combine the jackknife with batching to reduce computation load (see the next discussion point).

In problems facing nested sampling such as those in Section 4, it is also possible to use non-bootstrap techniques such as the infinitesimal jackknife. In fact, the latter is particularly handy to derive for problems involving double resampling such as bagging predictors (e.g., Wager et al. (2014)). Nonetheless, it is open to our knowledge whether these alternatives can generally resolve the expensive nested sampling requirement faced by conventional bootstraps, especially in problems lacking unbiased estimators for the influence function. In these cases, the need to control the entangled noises coming from both data and computation could necessitate a large computation size, despite the apparent avoidance of nested sampling.

Comparison with batching: A technique prominently used in the simulation literature, but perhaps less common in statistics, is known as batch means or more generally standardized time series (Glynn and Iglehart (1990); Schmeiser (1982); Schruben (1983); Glynn and Lam (2018)). This technique conducts inference by grouping data into batches and aggregating them via tt-statistics. The batches can be disjoint or overlapping (Meketon and Schmeiser (1984); Song and Schmeiser (1995)), similar to the blocks in subsampling (Politis and Romano (1994)). While batch means uses tt-statistics like our Cheap Bootstrap, it utilizes solely the original sample instead of resample, and is motivated to handle serially dependent observations in simulation outputs such as from Markov chain Monte Carlo (Geyer (1992); Flegal et al. (2010); Jones et al. (2006)).

Batch means in principle can be used to construct confidence intervals with few model evaluations, by using a small number of batches or blocks. However, a potential downside of this approach is that in the disjoint-batches case, dividing a limited sample into batches will thin out the sample size per batch and deteriorate the accuracy of asymptotic approximation. In other words, there is a constrained tradeoff between the number of batches and sample size per batch that limits its accuracy. The Cheap Bootstrap, on the other hand, bypasses this constraint by allowing the use of any number of resamples. In fact, since the observations in different resamples overlap, the Cheap Bootstrap appears closer to overlapping batching. To this end, overlapping batching contains multiple tuning parameters to specify how the batches overlap, which could also affect the asymptotic distributions. In this regard, the Cheap Bootstrap appears easier to use as the only parameter needed is the resample budget, which is decided by the amount of computation resource.

One approach to obtain more in-depth theoretical comparisons between the Cheap Bootstrap and batching is to analyze the coefficients in the respective higher-order coverage expansions. This comprises an immediate future direction.

High-dimensional problems: As the main advantage of the Cheap Bootstrap relative to other approaches is its light computation, it would be revealing to analyze the error of the Cheap Bootstrap in high-dimensional problems where computation saving is of utmost importance. Regarding this, it is possible to obtain bounds for the coverage error of Cheap Bootstrap with explicit dependence on problem dimension, by following our analysis roadmap in Section 3.2 but integrating with high-dimensional Berry-Esseen-type bounds for normal limits (Chernozhukov et al. (2017); Fang and Koike (2021)).

Higher-order coverage accuracy: Although our investigation is orthogonal to the bootstrap literature on higher-order coverage error refinements, we speculate that our Cheap Bootstrap can be sharpened to exhibit second-order accurate intervals, by replacing the tt-quantile with a bootstrap calibrated quantile. This approach is similar to the studentized bootstrap, but uses few outer resamples in a potential iterated bootstrap procedure. In other words, our computation effort could be larger than elementary bootstraps, as we need to run two layers of resampling, but less than a double bootstrap, as one of the layers consists of only few samples. As a related application, we may apply the Cheap Bootstrap to assess the error of the bootstrap itself, which is done conventionally via bootstrapping the bootstrap or alternately the jackknife-after-bootstrap (Efron (1992)).

Serial dependence: A common bootstrap approach to handle serial dependence is to use block sampling (e.g., Davison and Hinkley (1997) §8). The computation advantage of the Cheap Bootstrap likely continues to hold when using blocks. More specifically, instead of regenerating many series each concatenated from resampled blocks, the Cheap Bootstrap only regenerates few such series and aggregates them via a tt-approximation. On the other hand, as mentioned above, batch means or subsampling also comprises viable approaches to handle serially dependent data, and the effectiveness of the Cheap Bootstrap compared with these approaches will need further investigation.

Acknowledgments

I gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-1834710 and IIS-1849280.

References

  • Alaa and Van Der Schaar (2020) Alaa, A. and Van Der Schaar, M. (2020) Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order influence functions. In International Conference on Machine Learning, 165–174. PMLR.
  • Barton (2012) Barton, R. R. (2012) Tutorial: Input uncertainty in output analysis. In Proceedings of the 2012 Winter Simulation Conference (WSC), 1–12. IEEE.
  • Barton et al. (2014) Barton, R. R., Nelson, B. L. and Xie, W. (2014) Quantifying input uncertainty via simulation confidence intervals. INFORMS journal on computing, 26, 74–87.
  • Beran (1987) Beran, R. (1987) Prepivoting to reduce level error of confidence sets. Biometrika, 74, 457–468.
  • Bickel et al. (1997) Bickel, P. J., Götze, F. and van Zwet, W. R. (1997) Resampling fewer than nn observations: Gains, losses, and remedies for losses. Statistica Sinica, 7, 1–31.
  • Birge and Louveaux (2011) Birge, J. R. and Louveaux, F. (2011) Introduction to stochastic programming. Springer Science & Business Media.
  • Booth and Do (1993) Booth, J. G. and Do, K.-A. (1993) Simple and efficient methods for constructing bootstrap confidence intervals. Computational Statistics, 8, 333–333.
  • Booth and Hall (1994) Booth, J. G. and Hall, P. (1994) Monte Carlo approximation and the iterated bootstrap. Biometrika, 81, 331–340.
  • Breiman (1996) Breiman, L. (1996) Bagging predictors. Machine learning, 24, 123–140.
  • Bühlmann and Yu (2002) Bühlmann, P. and Yu, B. (2002) Analyzing bagging. The annals of Statistics, 30, 927–961.
  • Cheng and Holland (2004) Cheng, R. C. and Holland, W. (2004) Calculation of confidence intervals for simulation output. ACM Transactions on Modeling and Computer Simulation (TOMACS), 14, 344–362.
  • Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017) Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45, 2309–2352.
  • Davison and Hinkley (1997) Davison, A. C. and Hinkley, D. V. (1997) Bootstrap Methods and Their Application. No. 1. Cambridge University Press.
  • DiCiccio et al. (1996) DiCiccio, T. J., Efron, B. et al. (1996) Bootstrap confidence intervals. Statistical Science, 11, 189–228.
  • DiCiccio et al. (1992) DiCiccio, T. J., Martin, M. A. and Young, G. A. (1992) Analytical approximations for iterated bootstrap confidence intervals. Statistics and Computing, 2, 161–171.
  • Efron (1981) Efron, B. (1981) Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika, 68, 589–599.
  • Efron (1982) — (1982) The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.
  • Efron (1987) — (1987) Better bootstrap confidence intervals. Journal of the American statistical Association, 82, 171–185.
  • Efron (1992) — (1992) Jackknife-after-bootstrap standard errors and influence functions. Journal of the Royal Statistical Society: Series B (Methodological), 54, 83–111.
  • Efron and Tibshirani (1994) Efron, B. and Tibshirani, R. J. (1994) An Introduction to the Bootstrap. CRC Press.
  • Fang and Koike (2021) Fang, X. and Koike, Y. (2021) High-dimensional central limit theorems by stein’s method. The Annals of Applied Probability, 31, 1660–1686.
  • Flegal et al. (2010) Flegal, J. M., Jones, G. L. et al. (2010) Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics, 38, 1034–1070.
  • Geyer (1992) Geyer, C. J. (1992) Practical Markov chain Monte Carlo. Statistical Science, 7, 473–483.
  • Giordano et al. (2019) Giordano, R., Stephenson, W., Liu, R., Jordan, M. and Broderick, T. (2019) A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, 1139–1147. PMLR.
  • Glasserman (2004) Glasserman, P. (2004) Monte Carlo methods in financial engineering, vol. 53. Springer.
  • Glynn and Iglehart (1990) Glynn, P. W. and Iglehart, D. L. (1990) Simulation output analysis using standardized time series. Mathematics of Operations Research, 15, 1–16.
  • Glynn and Lam (2018) Glynn, P. W. and Lam, H. (2018) Constructing simulation output intervals under input uncertainty via data sectioning. In 2018 Winter Simulation Conference (WSC), 1551–1562. IEEE.
  • Hall (1986a) Hall, P. (1986a) On the bootstrap and confidence intervals. Annals of Statistics, 14, 1431–1452.
  • Hall (1986b) — (1986b) On the number of bootstrap simulations required to construct a confidence interval. The Annals of Statistics, 1453–1462.
  • Hall (2013) — (2013) The Bootstrap and Edgeworth Expansion. Springer Science & Business Media.
  • Hall and Martin (1988) Hall, P. and Martin, M. A. (1988) On bootstrap resampling and iteration. Biometrika, 75, 661–671.
  • He and Lam (2021) He, S. and Lam, H. (2021) Higher-order coverage errors of batching methods via edgeworth expansions on tt-statistics. arXiv preprint arXiv:2111.06859.
  • Henderson (2003) Henderson, S. G. (2003) Input modeling: Input model uncertainty: Why do we care and what should we do about it? In Proceedings of the 2003 Winter Simulation Conference (eds. S. Chick, P. J. Sánchez, D. Ferrin and D. J. Morrice), 90–100. Piscataway, New Jersey: IEEE.
  • Jones et al. (2006) Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006) Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association, 101, 1537–1547.
  • Kleiner et al. (2012) Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2012) The big data bootstrap. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 1787–1794. Madison, WI, USA: Omnipress.
  • Kleiner et al. (2014) — (2014) A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 76, 795–816.
  • Kosorok (2007) Kosorok, M. R. (2007) Introduction to Empirical Processes and Semiparametric Inference. Springer Science & Business Media.
  • Lakshminarayanan et al. (2016) Lakshminarayanan, B., Pritzel, A. and Blundell, C. (2016) Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474.
  • Lam (2016) Lam, H. (2016) Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In Winter Simulation Conference (WSC), 2016, 178–192. IEEE.
  • Lam and Qian (2018a) Lam, H. and Qian, H. (2018a) Assessing solution quality in stochastic optimization via bootstrap aggregating. In 2018 Winter Simulation Conference (WSC), 2061–2071. IEEE.
  • Lam and Qian (2018b) — (2018b) Bounding optimality gap in stochastic optimization via bagging: Statistical efficiency and stability. arXiv preprint arXiv:1810.02905.
  • Lam and Qian (2018c) — (2018c) Subsampling variance for input uncertainty quantification. In 2018 Winter Simulation Conference (WSC), 1611–1622. IEEE.
  • Lam and Qian (2021) — (2021) Subsampling to enhance efficiency in input uncertainty quantification. Operations Research.
  • Law et al. (1991) Law, A. M., Kelton, W. D. and Kelton, W. D. (1991) Simulation modeling and analysis, vol. 2. McGraw-Hill New York.
  • Lee et al. (2015) Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D. and Batra, D. (2015) Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314.
  • Lee and Young (1995) Lee, S. M. and Young, G. A. (1995) Asymptotic iterated bootstrap confidence intervals. The Annals of Statistics, 23, 1301–1330.
  • Lu et al. (2020) Lu, Z., Ie, E. and Sha, F. (2020) Uncertainty estimation with infinitesimal jackknife, its distribution and mean-field approximation. arXiv preprint arXiv:2006.07584.
  • Mak et al. (1999) Mak, W.-K., Morton, D. P. and Wood, R. K. (1999) Monte carlo bounding techniques for determining solution quality in stochastic programs. Operations research letters, 24, 47–56.
  • Meketon and Schmeiser (1984) Meketon, M. S. and Schmeiser, B. (1984) Overlapping batch means: Something for nothing? In Proceedings of the Winter simulation Conference, 227–230.
  • Mentch and Hooker (2016) Mentch, L. and Hooker, G. (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17, 841–881.
  • Nelson (2013) Nelson, B. (2013) Foundations and Methods of Stochastic Simulation: A First Course. Springer Science & Business Media.
  • Politis and Romano (1994) Politis, D. N. and Romano, J. P. (1994) Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22, 2031–2050.
  • Politis et al. (1999) Politis, D. N., Romano, J. P. and Wolf, M. (1999) Subsampling. Springer Science & Business Media.
  • Schenker (1985) Schenker, N. (1985) Qualms about bootstrap confidence intervals. Journal of the American Statistical Association, 80, 360–361.
  • Schmeiser (1982) Schmeiser, B. (1982) Batch size effects in the analysis of simulation output. Operations Research, 30, 556–568.
  • Schruben (1983) Schruben, L. (1983) Confidence interval estimation using standardized time series. Operations Research, 31, 1090–1108.
  • Schulam and Saria (2019) Schulam, P. and Saria, S. (2019) Can you trust this prediction? auditing pointwise reliability after learning. In The 22nd International Conference on Artificial Intelligence and Statistics, 1022–1031. PMLR.
  • Sengupta et al. (2016) Sengupta, S., Volgushev, S. and Shao, X. (2016) A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111, 1222–1232.
  • Shao and Tu (2012) Shao, J. and Tu, D. (2012) The jackknife and bootstrap. Springer Science & Business Media.
  • Shapiro et al. (2021) Shapiro, A., Dentcheva, D. and Ruszczynski, A. (2021) Lectures on stochastic programming: modeling and theory. SIAM.
  • Song and Nelson (2015) Song, E. and Nelson, B. L. (2015) Quickly assessing contributions to input uncertainty. IIE Transactions, 47, 893–909.
  • Song et al. (2014) Song, E., Nelson, B. L. and Pegden, C. D. (2014) Advanced tutorial: Input uncertainty quantification. In Proceedings of the 2014 Winter Simulation Conference (eds. A. Tolk, S. Diallo, I. Ryzhov, L. Yilmaz, S. Buckley and J. Miller), 162–176. Piscataway, New Jersey: IEEE.
  • Song and Schmeiser (1995) Song, W. T. and Schmeiser, B. W. (1995) Optimal mean-squared-error batch sizes. Management Science, 41, 110–123.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000) Asymptotic Statistics, vol. 3. Cambridge University Press.
  • Van der Vaart and Wellner (2013) Van der Vaart, A. W. and Wellner, J. A. (2013) Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media.
  • Wager et al. (2014) Wager, S., Hastie, T. and Efron, B. (2014) Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15, 1625–1651.
  • Wall et al. (2001) Wall, M. M., Boen, J. and Tweedie, R. (2001) An effective confidence interval for the mean with samples of size one and two. The American Statistician, 55, 102–105.
  • Xie et al. (2014) Xie, W., Nelson, B. L. and Barton, R. R. (2014) A bayesian framework for quantifying uncertainty in stochastic simulation. Operations Research, 62, 1439–1452.
  • Zouaoui and Wilson (2004) Zouaoui, F. and Wilson, J. R. (2004) Accounting for input-model and input-parameter uncertainties in simulation. IIE Transactions, 36, 1135–1151.

Appendix A Immediate Extensions from Section 3

A.1 Inference on Standard Error

We can conduct inference on the standard error of ψ^n\hat{\psi}_{n} using Cheap Bootstrap. Here by standard error we mean the standard deviation of ψ^n\hat{\psi}_{n}, which is asymptotically σ/n\sigma/\sqrt{n}, and our inference target is σ\sigma. Using the same argument as in Section 3.1, we have that

𝒥=[BnSχ1α/2,B2,BnSχα/2,B2]\mathcal{J}=\left[\frac{\sqrt{Bn}S}{\sqrt{\chi^{2}_{1-\alpha/2,B}}},\frac{\sqrt{Bn}S}{\sqrt{\chi^{2}_{\alpha/2,B}}}\right] (31)

is an asymptotically exact (1α)(1-\alpha)-level confidence interval for σ\sigma, where χα/2,B2\chi^{2}_{\alpha/2,B} and χ1α/2,B2\chi^{2}_{1-\alpha/2,B} are the α/2\alpha/2 and (1α/2)(1-\alpha/2)-quantiles of χB2\chi^{2}_{B}. We summarize this as follows.

Theorem 7 (Cheap Bootstrap interval for standard error).

Under Assumption 1, we have, for any bootstrap replication size B1B\geq 1, the interval 𝒥\mathcal{J} defined in (31) is an asymptotically exact (1α)(1-\alpha)-level confidence interval for σ\sigma, i.e., n(σ𝒥)1α\mathbb{P}_{n}(\sigma\in\mathcal{J})\to 1-\alpha as the sample size nn\to\infty, where n\mathbb{P}_{n} denotes the probability with respect to the data X1,,XnX_{1},\ldots,X_{n} and all randomness from the resampling.

A.2 Multivariate Generalization

We present a multivariate generalization of Cheap Bootstrap. Consider now ψ:=ψ(P)d\psi:=\psi(P)\in\mathbb{R}^{d}. Our multivariate confidence region proceeds similarly as the univariate case, except we use Hotelling’s T2T^{2} instead of tt. More specifically, we have point estimate ψ^n=ψ(P^n)\hat{\psi}_{n}=\psi(\hat{P}_{n}), and we resample from {X1,,Xn}\{X_{1},\ldots,X_{n}\} to obtain {X1b,,Xnb}\{X_{1}^{*b},\ldots,X_{n}^{*b}\} and evaluate the resample estimate ψnb:=ψ(Pnb)\psi_{n}^{*b}:=\psi(P_{n}^{*b}). Our confidence region is

={ψ:(ψ^nψ)S1(ψ^nψ)Td,B,1α2}\mathcal{R}=\left\{\psi:(\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)\leq T^{2}_{d,B,1-\alpha}\right\} (32)

where SS now denotes a d×dd\times d matrix given by

S=1Bb=1B(ψnbψ^n)(ψnbψ^n)S=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)^{\top} (33)

and the critical value Td,B,1α2T^{2}_{d,B,1-\alpha} is the (1α)(1-\alpha)-quantile of Hotelling’s T2T^{2} distribution with parameters dd and BB.

We have the following asymptotic exact guarantee for \mathcal{R}. First we make the following multivariate analog of Assumption 1:

Assumption 3 (Standard condition for multivariate bootstrap validity).

We have n(ψ^nψ)N(0,Σ)\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\Sigma) and n(ψnψ^n)N(0,Σ)\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\Sigma) conditional on the data X1,X2,X_{1},X_{2},\ldots in probability as nn\to\infty, where N(0,Σ)N(0,\Sigma) is a multivariate normal vector with mean 0d0\in\mathbb{R}^{d} and covariance matrix Σd×d\Sigma\in\mathbb{R}^{d\times d} that is positive definite.

Then we have:

Theorem 8 (Asymptotic exactness of multivariate Cheap Bootstrap).

Under Assumption 3, for any BdB\geq d, the region \mathcal{R} in (32) is an asymptotically exact (1α)(1-\alpha)-level confidence region for ψ\psi, i.e., n(ψ)1α\mathbb{P}_{n}(\psi\in\mathcal{R})\to 1-\alpha as nn\to\infty, where n\mathbb{P}_{n} denotes the probability with respect to the data X1,,XnX_{1},\ldots,X_{n} and all randomness from the resampling.

Note that in Theorem 8 we require BB to be at least dd, the dimension of ψ\psi. In the univariate case this reduces to B=1B=1.

Finally, the following is a multivariate analog of Proposition 2 that uses Hadamard differentiability with non-degenerate derivative to ensure Assumption 3:

Proposition 4 (Sufficient conditions for multivariate bootstrap validity).

Consider P^n\hat{P}_{n} and PnP_{n}^{*} as random elements that take values in ()\ell^{\infty}(\mathcal{F}), where \mathcal{F} is a Donsker class with finite envelope. Suppose ψ:()d\psi:\ell^{\infty}(\mathcal{F})\to\mathbb{R}^{d} is Hadamard differentiable at PP where the derivative ψP\psi_{P}^{\prime} satisfies that the covariance matrix of ψP(𝔾P)\psi_{P}^{\prime}(\mathbb{G}_{P}) is positive definite, for a tight Gaussian process 𝔾P\mathbb{G}_{P} on ()\ell^{\infty}(\mathcal{F}) with mean 0 and covariance Cov(𝔾P(f1),𝔾P(f2))=CovP(f1(X),f2(X))Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X)). Then Assumption 3 holds under i.i.d. data.

Appendix B Proofs for Section 3 and Appendix A

Proof of Proposition 2.

Apply Theorem 12 (in Appendix F.2) with d=1d=1 and note that ψP(𝔾P)\psi_{P}^{\prime}(\mathbb{G}_{P}) is normal. ∎

Proof of Theorem 2.

For convenience throughout the proof we use CC to denote a positive constant that is not necessarily the same every time it appears. We first define

A^(𝐱)=g(𝐱)g(𝐗¯)h(𝐗¯)\hat{A}(\mathbf{x})=\frac{g(\mathbf{x})-g(\overline{\mathbf{X}})}{h(\overline{\mathbf{X}})}

and consider A^(𝐗¯)\hat{A}(\overline{\mathbf{X}}^{*}), where 𝐗¯=(1/n)i=1n𝐗i\overline{\mathbf{X}}^{*}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i}^{*} is the mean of a resample {𝐗1,,𝐗n}\{\mathbf{X}_{1}^{*},\ldots,\mathbf{X}_{n}^{*}\}. We can view A^(𝐗¯)\hat{A}(\overline{\mathbf{X}}^{*}) as the resample counterpart of A(𝐗¯)A(\overline{\mathbf{X}}).

Under the function-of-mean model described in the theorem, the pivotal statistic (7) can be written as

T=g(𝐗¯)g(𝝁)1Bb=1B(g(𝐗¯b)g(𝐗¯))2=As(𝐗¯)1Bb=1BA^(𝐗¯b)2T=\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}=\frac{A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\hat{A}(\overline{\mathbf{X}}^{*b})^{2}}}

For a two-sided interval, coverage is defined by the event

|g(𝐗¯)g(𝝁)1Bb=1B(g(𝐗¯b)g(𝐗¯))2|tB,1α/2\left|\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\right|\leq t_{B,1-\alpha/2}

or equivalently

|nAs(𝐗¯)1Bb=1B(nA^(𝐗¯b))2|tB,1α/2\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*b}))^{2}}}\right|\leq t_{B,1-\alpha/2}

Define QQ^{*} as the conditional distribution of nA^(𝐗¯b)\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*b}) given the data 𝒳n={𝐗1,,𝐗n}\mathcal{X}_{n}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\}. Note that we can write the coverage probability as

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)]E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})\right] (34)

where the expectation EE is taken with respect to the data 𝒳n\mathcal{X}_{n}.

Consider a positive number λ3/2\lambda\geq 3/2. We first show that, with probability 1O(nλ)1-O(n^{-\lambda}),

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})
=|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q^(z1)+R\displaystyle=\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\cdots d\hat{Q}^{*}(z_{1})+R (35)

where

Q^(x)=Φ(x)+j=1,,νj evennj/2p^j(x)ϕ(x)\hat{Q}^{*}(x)=\Phi(x)+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(x)\phi(x) (36)

is a random signed measure (constructed from the random polynomial p^j\hat{p}_{j}), and RR satisfies |R|Cn(ν+1)/2|R|\leq Cn^{-(\nu+1)/2} for some constant CC. Here, the polynomials p^j\hat{p}_{j} are the ones defined in Theorem 14 (in Appendix F.3). To this end, denote \mathcal{E} as the event that

sup<x<|P(nA^(𝐗¯)x|𝒳n)Φ(x)j=1νnj/2p^j(x)ϕ(x)|Cn(ν+1)/2\sup_{-\infty<x<\infty}\left|P(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*})\leq x|\mathcal{X}_{n})-\Phi(x)-\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)\right|\leq Cn^{-(\nu+1)/2} (37)
max1jνsup<x<(1+|x|)(3j1)|p^j(x)|C\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}(x)|\leq C (38)

and

max1jνsup<x<(1+|x|)(3j1)|p^j(x)|C\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}^{\prime}(x)|\leq C (39)

hold simultaneously. By Theorem 14, \mathcal{E} occurs with probability 1O(nλ)1-O(n^{-\lambda}). Now, conditional on the data 𝒳n\mathcal{X}_{n} and for any z1,,zB1z_{1},\ldots,z_{B-1}\in\mathbb{R},

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B}) (40)

is expressible as 1Q(q)+Q(q)1-Q^{*}(q)+Q^{*}(-q) for some qq\in\mathbb{R}. Thus, using the oddness and evenness property of pjp_{j} in Theorem 13, we have, under \mathcal{E},

sup<z1,,zB1<||nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)|Cn(ν+1)/2\sup_{-\infty<z_{1},\ldots,z_{B-1}<\infty}\left|\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})-\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\right|\leq Cn^{-(\nu+1)/2} (41)

for some C>0C>0 by (37). Thus, integrating (40) with respect to dQ(zB1)dQ(z1)dQ^{*}(z_{B-1})\cdots dQ^{*}(z_{1}) and using (41), we get

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1}) (42)
=\displaystyle= |nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q(zB1)𝑑Q(z1)+RB\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})dQ^{*}(z_{B-1})\cdots dQ^{*}(z_{1})+R_{B}

where RBR_{B} satisfies |RB|Cn(ν+1)/2|R_{B}|\leq Cn^{-(\nu+1)/2}. Iterating (42), this time considering

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB1)\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B-1})

and using Fubini’s theorem with the signed measure Q^\hat{Q}^{*}, we have

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q(zB1)𝑑Q(z1)\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})dQ^{*}(z_{B-1})\cdots dQ^{*}(z_{1})
=\displaystyle= |nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q^(zB1)𝑑Q(zB2)𝑑Q(z1)+RB1\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})d\hat{Q}^{*}(z_{B-1})dQ^{*}(z_{B-2})\cdots dQ^{*}(z_{1})+R_{B-1}

where RB1R_{B-1} satisfies

|RB1|Cn(ν+1)/2|(Q^)(zB)|𝑑zB𝑑Q(zB2)𝑑Q(z1)=Cn(ν+1)/2|(Q^)(zB)|𝑑zBCn(ν+1)/2|R_{B-1}|\leq Cn^{-(\nu+1)/2}\int\cdots\int|(\hat{Q}^{*})^{\prime}(z_{B})|dz_{B}dQ^{*}(z_{B-2})\cdots dQ^{*}(z_{1})=Cn^{-(\nu+1)/2}\int|(\hat{Q}^{*})^{\prime}(z_{B})|dz_{B}\leq Cn^{-(\nu+1)/2}

by (38) and (39) (where the last CC is a different constant from the previous one). Continuing in this fashion, we get

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})
=\displaystyle= |nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q^(z1)+RB+RB1++R1\displaystyle\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\cdots d\hat{Q}^{*}(z_{1})+R_{B}+R_{B-1}+\cdots+R_{1}

where each RbR_{b} satisfies |Rb|Cn(n+1)/2|R_{b}|\leq Cn^{-(n+1)/2}. This gives (35).

Now consider (34), which we can write as

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})\right]
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1});\mathcal{E}\right]{}
+E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1);c]\displaystyle{}+E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1});\mathcal{E}^{c}\right]
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1);]+O(nλ)\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1});\mathcal{E}\right]+O(n^{-\lambda}){}
  since |nAs(𝐗¯)(1/B)b=1Bzb2|tB,1α/2𝑑Q(zB)𝑑Q(z1)1\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{(1/B)\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})\leq 1
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q^(z1);]+O(n(ν+1)/2)+O(nλ)\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\cdots d\hat{Q}^{*}(z_{1});\mathcal{E}\right]+O(n^{-(\nu+1)/2})+O(n^{-\lambda}){}
  by (35)

Using (36), we write the first term in (B) as

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Q^(zB)𝑑Q^(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\cdots d\hat{Q}^{*}(z_{1});\mathcal{E}\right] (44)
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(Φ(zB)+j=1,,νj evennj/2p^j(zB)ϕ(zB))\displaystyle E\Bigg{[}\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\left(\Phi(z_{B})+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(z_{B})\phi(z_{B})\right)\cdots{}
d(Φ(z1)+j=1,,νj evennj/2p^j(z1)ϕ(z1));]\displaystyle{}d\left(\Phi(z_{1})+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(z_{1})\phi(z_{1})\right);\mathcal{E}\Bigg{]}
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right]{}
+BnE[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p^2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(\hat{p}_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}
+O(1n2)\displaystyle{}+O\left(\frac{1}{n^{2}}\right)

where the last equality follows by expanding out the product for ν2\nu\geq 2, using the symmetry among z1,,zBz_{1},\ldots,z_{B} to get the second term, and using (38) and (39) to get the last remainder term.

Next, we can write

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p^2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(\hat{p}_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right] (45)
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}
+E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d((p^2(zB)p2(zB))ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle{}+E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d((\hat{p}_{2}(z_{B})-p_{2}(z_{B}))\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}
+E[poly(μ^m1,,md,m1++md4)poly(μm1,,md,m1++md4);]\displaystyle{}+E\left[\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)-\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4);\mathcal{E}\right]

where poly(μ^m1,,md,m1++md4)\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4) and poly(μm1,,md,m1++md4)\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4) denote the same polynomial of bounded degree in μ^m1,,md\hat{\mu}_{m_{1},\ldots,m_{d}} or μm1,,md\mu_{m_{1},\ldots,m_{d}} for m1++md4m_{1}+\cdots+m_{d}\leq 4, where the coefficients of the polynomial consist of linear combinations of terms in the form

|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(zBkϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(z_{B}^{k}\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})

for some positive integer k5k\leq 5, which is absolutely bounded by

B|(zBkϕ(zB))|𝑑zBΦ(zB1)𝑑Φ(z1)C\int\cdots\int_{\mathbb{R}^{B}}|(z_{B}^{k}\phi(z_{B}))^{\prime}|dz_{B}\Phi(z_{B-1})\cdots d\Phi(z_{1})\leq C

for some constant C>0C>0 independent of k5k\leq 5. Now, with E𝐗l<E\|\mathbf{X}\|^{l}<\infty for sufficiently large ll, uniform integrability gives the convergence on the power moments of sample moments, i.e.,

E|μ^m1,,mdkμm1,,mdk|=O(1n)E\left|\hat{\mu}_{m_{1},\ldots,m_{d}}^{k}-\mu_{m_{1},\ldots,m_{d}}^{k}\right|=O\left(\frac{1}{\sqrt{n}}\right)

where m1++md4m_{1}+\cdots+m_{d}\leq 4 and any k5k\leq 5. So we have

|E[poly(μ^m1,,md,m1++md4)poly(μm1,,md,m1++md4);]|\displaystyle\left|E\left[\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)-\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4);\mathcal{E}\right]\right|
\displaystyle\leq CE[m1++md4|μ^m1,,mdkμm1,,mdk|]\displaystyle CE\left[\sum_{m_{1}+\cdots+m_{d}\leq 4}\left|\hat{\mu}_{m_{1},\ldots,m_{d}}^{k}-\mu_{m_{1},\ldots,m_{d}}^{k}\right|\right]
=\displaystyle= O(1n)\displaystyle O\left(\frac{1}{\sqrt{n}}\right)

where CC in the inequality could be different from the previous CC.

On the other hand,

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]+O(nλ)\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})

since

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);c]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}^{c}\right]
\displaystyle\leq E[|(p2(zB)ϕ(zB))|𝑑zB𝑑Φ(zB1)𝑑Φ(z1);c]=O(nλ)\displaystyle E\left[\int\cdots\int|(p_{2}(z_{B})\phi(z_{B}))^{\prime}|dz_{B}d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}^{c}\right]=O(n^{-\lambda})

So (45) becomes

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]+O(nλ)+O(1n)E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})+O\left(\frac{1}{\sqrt{n}}\right) (46)

Similarly,

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right] (47)
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)]+O(nλ)\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})

Combining (B), (44), (46) and (47), we can write (34) as

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)]+O(nλ)\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda}){} (48)
+Bn{E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]+O(nλ)+O(1n)}\displaystyle{}+\frac{B}{n}\left\{E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})+O\left(\frac{1}{\sqrt{n}}\right)\right\}{}
+O(1n2)+O(n(ν+1)/2)+O(nλ)\displaystyle{}+O\left(\frac{1}{n^{2}}\right)+O(n^{-(\nu+1)/2})+O(n^{-\lambda})
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}
+BnE[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]{}
+O(nλ)+O(1n3/2)+O(n(ν+1)/2)\displaystyle{}+O(n^{-\lambda})+O\left(\frac{1}{n^{3/2}}\right)+O(n^{-(\nu+1)/2})
=\displaystyle= E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}
+BnE[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]+O(1n3/2)\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O\left(\frac{1}{n^{3/2}}\right)

when λ3/2\lambda\geq 3/2 and ν2\nu\geq 2 so that (ν+1)/23/2(\nu+1)/2\geq 3/2.

Lastly, using Theorem 13, and a similar argument as before with Fubini’s theorem and the observation that

P(|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2|z1,,zB)P\left(\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}\Bigg{|}z_{1},\ldots,z_{B}\right)

is expressible as P(nAs(𝐗¯)q)P(nAs(𝐗¯)q)P(\sqrt{n}A_{s}(\overline{\mathbf{X}})\leq q)-P(\sqrt{n}A_{s}(\overline{\mathbf{X}})\leq-q) for some qq\in\mathbb{R}, we have

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right] (49)
=\displaystyle= |z01Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
+1n|z01Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)d(q2(z0)ϕ(z0))+o(1n)\displaystyle{}+\frac{1}{n}\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))+o\left(\frac{1}{n}\right)

when ν2\nu\geq 2. Similarly,

E[|nAs(𝐗¯)1Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]\displaystyle E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right] (50)
=\displaystyle= |z01Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z0)+O(1n)\displaystyle\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0})+O\left(\frac{1}{n}\right)

Thus, using (49) and (50), we can write (48) as

|z01Bb=1Bzb2|tB,1α/2𝑑Φ(zB)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
+1n{B|z01Bb=1Bzb2|tB,1α/2d(p2(zB)ϕ(zB))dΦ(zB1)dΦ(z0)\displaystyle{}+\frac{1}{n}\Bigg{\{}B\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0}){}
+|z01Bb=1Bzb2|tB,1α/2dΦ(zB)dΦ(z1)d(q2(z0)ϕ(z0))}+o(1n)\displaystyle{}+\int\cdots\int_{\left|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))\Bigg{\}}+o\left(\frac{1}{n}\right) (51)

which gives the first part of the theorem. Note that when ν3\nu\geq 3, the remainder term in (49) is refined to o(1/n3/2)o(1/n^{3/2}) and, as a result, the remainder term in (51) is refined to O(1/n3/2)O(1/n^{3/2}).

The second part of the theorem follows analogously by replacing the event |g(𝐗¯)g(𝝁)1Bb=1B(g(𝐗¯b)g(𝐗¯))2|tB,1α/2\left|\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\right|\leq t_{B,1-\alpha/2} with g(𝐗¯)g(𝝁)1Bb=1B(g(𝐗¯b)g(𝐗¯))2tB,1α\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\leq t_{B,1-\alpha} or g(𝐗¯)g(𝝁)1Bb=1B(g(𝐗¯b)g(𝐗¯))2tB,1α\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\geq t_{B,1-\alpha}. Consider now λ>1/2\lambda>1/2. Because of the aforementioned change of the considered event, now (36) is replaced by

Q^(x)=Φ(x)+j=1νnj/2p^j(x)ϕ(x)\hat{Q}^{*}(x)=\Phi(x)+\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)

and (44) becomes, in the upper interval case,

=\displaystyle= E[nAs(𝐗¯)1Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(z1);]\displaystyle E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right]{}
+BnE[nAs(𝐗¯)1Bb=1Bzb2tB,1αd(p^1(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1);]+O(1n)\displaystyle{}+\frac{B}{\sqrt{n}}E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(\hat{p}_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]+O\left(\frac{1}{n}\right)

for ν1\nu\geq 1 giving a modified (48) as

E[nAs(𝐗¯)1Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(z1)]\displaystyle E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}
+BnE[nAs(𝐗¯)1Bb=1Bzb2tB,1αd(p1(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z1)]+o(1n)\displaystyle{}+\frac{B}{\sqrt{n}}E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+o\left(\frac{1}{\sqrt{n}}\right) (52)

when λ>1/2\lambda>1/2 and ν1\nu\geq 1. Moreover, (49) becomes

z01Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
+1nz01Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(z1)d(q1(z0)ϕ(z0))+o(1n)\displaystyle{}+\frac{1}{\sqrt{n}}\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))+o\left(\frac{1}{\sqrt{n}}\right) (53)

and (50) becomes

z01Bb=1Bzb2tB,1αd(p1(zB)ϕ(zB))𝑑Φ(zB1)𝑑Φ(z0)+O(1n)\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0})+O\left(\frac{1}{\sqrt{n}}\right)

giving rise to a modified (51) as

z01Bb=1Bzb2tB,1α𝑑Φ(zB)𝑑Φ(z1)𝑑Φ(z0)\displaystyle\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}
+1n{Bz01Bb=1Bzb2tB,1αd(p1(zB)ϕ(zB))dΦ(zB1)dΦ(z0)\displaystyle{}+\frac{1}{\sqrt{n}}\Bigg{\{}B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0}){}
+z01Bb=1Bzb2tB,1αdΦ(zB)dΦ(z1)d(q1(z0)ϕ(z0))}+o(1n)\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))\Bigg{\}}+o\left(\frac{1}{\sqrt{n}}\right) (54)

which gives the upper interval case in second part of the theorem. The lower interval case follows analogously. Moreover, when λ1\lambda\geq 1 and ν2\nu\geq 2, the remainder term in (52) is refined to O(1/n)O(1/n), the remainder term in (53) is refined to O(1/n)O(1/n) and, as a result, the remainder term in (54) is refined to O(1/n)O(1/n). ∎

Proof of Theorem 7.

Using the same argument as in (8) in the proof of Theorem 1, we have nSσχB2B\sqrt{n}S\Rightarrow\sigma\sqrt{\frac{\chi^{2}_{B}}{B}} and hence

n(σχα/2,B2BnSσχ1α/2,B2B)1α\mathbb{P}_{n}\left(\sigma\sqrt{\frac{\chi^{2}_{\alpha/2,B}}{B}}\leq\sqrt{n}S\leq\sigma\sqrt{\frac{\chi^{2}_{1-\alpha/2,B}}{B}}\right)\to 1-\alpha

The conclusion then follows. ∎

Proof of Theorem 8.

A straightforward modification of Proposition 1 from univariate to multivariate n(ψ^nψ)\sqrt{n}(\hat{\psi}_{n}-\psi) and n(ψnbψ^n)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}) gives the asymptotic

n(ψ^nψ,ψn1ψ^n,,ψnBψ^n)(Z0,Z1,,ZB)\sqrt{n}(\hat{\psi}_{n}-\psi,\ \psi_{n}^{*1}-\hat{\psi}_{n},\ldots,\ \psi_{n}^{*B}-\hat{\psi}_{n})\Rightarrow(Z_{0},Z_{1},\ldots,Z_{B}) (55)

where Z0,Z1,,ZBdZ_{0},Z_{1},\ldots,Z_{B}\in\mathbb{R}^{d} are i.i.d. N(0,Σ)N(0,\Sigma). By the continuous mapping theorem, we have

(ψ^nψ)S1(ψ^nψ)=(n(ψ^nψ))(nS)1(n(ψ^nψ))Td,B2(\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)=(\sqrt{n}(\hat{\psi}_{n}-\psi))^{\top}(nS)^{-1}(\sqrt{n}(\hat{\psi}_{n}-\psi))\Rightarrow T^{2}_{d,B}

where Td,B2T^{2}_{d,B} denotes Hotelling’s T2T^{2} variable with parameters dd and BB. Hence

n(ψ)=n((ψ^nψ)S1(ψ^nψ)Td,B,1α2)1α\mathbb{P}_{n}(\psi\in\mathcal{R})=\mathbb{P}_{n}\left((\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)\leq T^{2}_{d,B,1-\alpha}\right)\to 1-\alpha

which concludes the theorem. ∎

Proof of Proposition 4.

This is the same as Proposition 2 except we now consider general dd when using Theorem 12 in Appendix F.2. ∎

Appendix C Further Details for Section 4

C.1 Proofs for the Beginning of Section 4

Proof of Proposition 3.

We focus on τ2()\tau^{2}(\cdot) as the argument for κ3()\kappa_{3}(\cdot) is the same. Suppose τ2()\tau^{2}(\cdot) is Hadamard differentiable and satisfies the assumptions in Proposition 2. Then, by Proposition 2, Assumption 1 holds for τ2()\tau^{2}(\cdot) and we have n(τ2(P^n)τ2(P))\sqrt{n}(\tau^{2}(\hat{P}_{n})-\tau^{2}(P)) weakly converges to a tight random variable. Hence τ2(P^n)pτ2(P)\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P) by the Slutsky theorem. Moreover, by Proposition 1, we also know n(τ2(Pn)τ2(P^n))\sqrt{n}(\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n})) weakly converges to a tight variable, and hence τ2(Pn)τ2(P^n)p0\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}0 by the Slutsky theorem again. Thus, τ2(Pn)τ2(P)=(τ2(Pn)τ2(P^n))+(τ2(P^n)τ2(P))p0\tau^{2}(P_{n}^{*})-\tau^{2}(P)=(\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n}))+(\tau^{2}(\hat{P}_{n})-\tau^{2}(P))\stackrel{{\scriptstyle p}}{{\to}}0 once again by the Slutsky theorem which concludes the proposition. ∎

Proof of Theorem 3.

We divide the proof into two steps:

Step 1. We first show the convergence of the following joint distribution

(n(ψ^nψ)σ,n(ψn1ψ^n)σ,,n(ψnBψ^n)σ,\displaystyle\Bigg{(}\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sigma},\frac{\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})}{\sigma},\ldots,\frac{\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})}{\sigma},{} (56)
R0(ψ^^n,R0ψ^n)τ,R(ψn,R1ψn1)τ,,R(ψn,RBψnB)τ)\displaystyle\frac{\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})}{\tau},\frac{\sqrt{R}(\psi_{n,R}^{**1}-\psi_{n}^{*1})}{\tau},\ldots,\frac{\sqrt{R}(\psi_{n,R}^{**B}-\psi_{n}^{*B})}{\tau}\Bigg{)}
\displaystyle\Rightarrow (Z0,Z1,,ZB,W0,W1,,WB)\displaystyle(Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B})

where Z0,Z1,,ZB,W0,W1,,WBi.i.d.N(0,1)Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1).

For convenience, denote

Z0n\displaystyle Z_{0}^{n} =n(ψ^nψ)σ\displaystyle=\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sigma}
Zbn\displaystyle Z_{b}^{n} =n(ψnbψ^n)σ,b=1,,B\displaystyle=\frac{\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})}{\sigma},\ b=1,\ldots,B
W0n\displaystyle W_{0}^{n} =R0(ψ^^n,R0ψ^n)τ\displaystyle=\frac{\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})}{\tau}
Wbn\displaystyle W_{b}^{n} =R(ψn,Rbψnb)τ,b=1,,B\displaystyle=\frac{\sqrt{R}(\psi_{n,R}^{**b}-\psi_{n}^{*b})}{\tau},\ b=1,\ldots,B

Also, denote τ^n=τ(P^n)\hat{\tau}^{n}=\tau(\hat{P}_{n}) and κ^3n=κ3(P^n)\hat{\kappa}_{3}^{n}=\kappa_{3}(\hat{P}_{n}).

Then, we have, for any fixed real constants zb,wbz_{b},w_{b} for b=0,,Bb=0,\ldots,B,

|P(Zbnzb,Wbnwb,b=0,,B)b=0BΦ(zb)Φ(wb)|\displaystyle\left|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=0,\ldots,B\right)-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\right| (57)
=\displaystyle= |E[I(Z0nz0,W0nw0)P(Zbnzb,Wbnwb,b=1,,B|P^n,ξR0)]b=0BΦ(zb)Φ(wb)|\displaystyle\left|E\left[I(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}\hat{P}_{n},\xi_{R_{0}}\right)\right]-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\right|
              where ξR0\xi_{R_{0}} refers to all the computation randomness in generating ψ^^n,R0\hat{\hat{\psi}}_{n,R_{0}},
              by noting that Z0nZ_{0}^{n} and W0nW_{0}^{n} are determined solely by P^n\hat{P}_{n} and ξR0\xi_{R_{0}}
=\displaystyle= |E[I(Z0nz0,W0nw0)P(Zbnzb,Wbnwb,b=1,,B|P^n,ξR0)]\displaystyle\Bigg{|}E\left[I(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}\hat{P}_{n},\xi_{R_{0}}\right)\right]{}
P(Z0nz0,W0nw0)b=1BΦ(zb)Φ(wb)+P(Z0nz0,W0nw0)b=1BΦ(zb)Φ(wb)b=0BΦ(zb)Φ(wb)|\displaystyle{}-P(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})+P(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\Bigg{|}
\displaystyle\leq E[|P(Zbnzb,Wbnwb,b=1,,B|P^,ξR0)b=1BΦ(zb)Φ(wb)|;Z0nz0,W0nw0]\displaystyle E\left[\left|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}\hat{P},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right|;Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0}\right]{}
+|P(Z0nz0,W0nw0)Φ(z0)Φ(w0)|b=1BΦ(zb)Φ(wb)\displaystyle{}+\left|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right|\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b}){}
              by the triangle inequality
\displaystyle\leq E|P(Zbnzb,Wbnwb,b=1,,B|P^,ξR0)b=1BΦ(zb)Φ(wb)|\displaystyle E\left|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}\hat{P},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right|{}
+|P(Z0nz0,W0nw0)Φ(z0)Φ(w0)|b=1BΦ(zb)Φ(wb)\displaystyle{}+\left|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right|\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})

We consider the two terms in (57) one by one, and we consider the second term first. By conditioning on P^n\hat{P}_{n} and telescoping, we have

|P(Z0nz0,W0nw0)Φ(z0)Φ(w0)|\displaystyle\left|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right| (58)
\displaystyle\leq E|P(W0nw0|P^n)Φ(w0)|+|P(Z0nz0)Φ(z0)|Φ(w0)\displaystyle E|P(W_{0}^{n}\leq w_{0}|\hat{P}_{n})-\Phi(w_{0})|+|P(Z_{0}^{n}\leq z_{0})-\Phi(z_{0})|\Phi(w_{0})

The first term in (58) can be bounded from above by

P(|τ^nτ|>δ or κ^3n>κ3+δ)+E[|P(W0nw0|P^n)Φ(w0)|;|τ^nτ|δ,κ^3nκ3+δ]P(|\hat{\tau}^{n}-\tau|>\delta\text{\ or\ }\hat{\kappa}_{3}^{n}>\kappa_{3}+\delta)+E[|P(W_{0}^{n}\leq w_{0}|\hat{P}_{n})-\Phi(w_{0})|;|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta]

for some small δ>0\delta>0, where the second term can be written as

E[|P(W0nττ^nw0ττ^n|P^n)Φ(w0ττ^n)|+|Φ(w0ττ^n)Φ(w0)|;|τ^nτ|δ,κ^3nκ3+δ]\displaystyle E\left[\left|P\left(\frac{W_{0}^{n}\tau}{\hat{\tau}^{n}}\leq\frac{w_{0}\tau}{\hat{\tau}^{n}}\Bigg{|}\hat{P}_{n}\right)-\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)\right|+\left|\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)-\Phi(w_{0})\right|;|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]
\displaystyle\leq E[C1κ^3n(τ^n)3R0+|Φ(w0ττ^n)Φ(w0)|;|τ^nτ|δ,κ^3nκ3+δ]\displaystyle E\left[\frac{C_{1}\hat{\kappa}_{3}^{n}}{(\hat{\tau}^{n})^{3}\sqrt{R_{0}}}+\left|\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)-\Phi(w_{0})\right|;|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]

for some constant C1>0C_{1}>0 by the Berry-Esseen theorem, which is further bounded from above by

E[C1(κ3+δ)(τδ)3R0+C2δτδ;|τ^nτ|δ,κ^3nκ3+δ]C1(κ3+δ)(τδ)3R0+C2δτδE\left[\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta};|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]\leq\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta}

for some constant C2>0C_{2}>0, which follows from applying the mean value theorem to the function Φ(w0τ/)\Phi(w_{0}\tau/\cdot) and noting that the function xϕ(x)x\phi(x) is bounded over xx\in\mathbb{R}. Hence, the first term in (58) is bounded from above by

P(|τ^nτ|>δ or κ^3n>κ3+δ)+C1(κ3+δ)(τδ)3R0+C2δτδP(|\hat{\tau}^{n}-\tau|>\delta\text{\ or\ }\hat{\kappa}_{3}^{n}>\kappa_{3}+\delta)+\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta} (59)

Since τ^npτ\hat{\tau}^{n}\stackrel{{\scriptstyle p}}{{\to}}\tau and κ^3npκ3\hat{\kappa}_{3}^{n}\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3} in Assumption 2, given arbitrary ϵ>0\epsilon>0, we can choose a small enough δ>0\delta>0, a large enough nn and a large enough R0R_{0} such that (59) is bounded above by ϵ\epsilon. Thus, the first term in (58) converges to 0 as n,R0n,R_{0}\to\infty. The second term in (58) converges to 0 as nn\to\infty by Assumption 1. We therefore have the second term in (57) go to 0 as n,R0n,R_{0}\to\infty.

We handle the first term in (57) with a similar argument. We have

E|P(Zbnzb,Wbnwb,b=1,,B|P^n,ξR0)b=1BΦ(zb)Φ(wb)|\displaystyle E\left|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}\hat{P}_{n},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right| (60)
=\displaystyle= E|E[I(Zbnzb,b=1,,B)P(Wbnwb,b=1,,B|Pnb,b=1,,B,P^n,ξR0)|P^n,ξR0]\displaystyle E\Bigg{|}E\left[I(Z_{b}^{n}\leq z_{b},b=1,\ldots,B)P\left(W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}P_{n}^{*b},b=1,\ldots,B,\hat{P}_{n},\xi_{R_{0}}\right)\Big{|}\hat{P}_{n},\xi_{R_{0}}\right]{}
P(Zbnzb,b=1,,B|P^n,ξR0)b=1BΦ(wb)+P(Zbnzb,b=1,,B|P^n,ξR0)b=1BΦ(wb)\displaystyle{}-P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n},\xi_{R_{0}}\right)\prod_{b=1}^{B}\Phi(w_{b})+P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n},\xi_{R_{0}}\right)\prod_{b=1}^{B}\Phi(w_{b}){}
b=1BΦ(zb)Φ(wb)|\displaystyle{}-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\Bigg{|}
              by noting that Zbn,b=1,,BZ_{b}^{n},b=1,\ldots,B are determined solely by Pnb,b=1,,BP_{n}^{*b},b=1,\ldots,B and P^n\hat{P}_{n}
\displaystyle\leq E|P(Wbnwb,b=1,,B|Pnb,b=1,,B,P^n,ξR0)b=1BΦ(wb)|\displaystyle E\left|P\left(W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{|}P_{n}^{*b},b=1,\ldots,B,\hat{P}_{n},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(w_{b})\right|
+E|P(Zbnzb,b=1,,B|P^n)b=1BΦ(zb)|b=1BΦ(wb)\displaystyle{}+E\left|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right|\prod_{b=1}^{B}\Phi(w_{b})
              by the triangle and Jensen inequalities
=\displaystyle= E|b=1BP(Wbnwb|Pnb)b=1BΦ(wb)|+E|P(Zbnzb,b=1,,B|P^n)b=1BΦ(zb)|b=1BΦ(wb)\displaystyle E\left|\prod_{b=1}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)-\prod_{b=1}^{B}\Phi(w_{b})\right|+E\left|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right|\prod_{b=1}^{B}\Phi(w_{b})
              by the conditional independence of Wbn,b=1,,BW_{b}^{n},b=1,\ldots,B given Pnb,b=1,,BP_{n}^{*b},b=1,\ldots,B and that,
              given PnbP_{n}^{*b}, WbnW_{b}^{n} is independent of PnkP_{n}^{*k} for kbk\neq b and P^n\hat{P}_{n} and ξR0\xi_{R_{0}}
=\displaystyle= E|b=1BP(Wbnwb|Pnb)Φ(w1)b=2BP(Wbnwb|Pnb)+Φ(w1)b=2BP(Wbnwb|Pnb)\displaystyle E\Bigg{|}\prod_{b=1}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)-\Phi(w_{1})\prod_{b=2}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)+\Phi(w_{1})\prod_{b=2}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right){}
Φ(w1)Φ(w2)b=3BP(Wbnwb|Pnb)+Φ(w1)Φ(w2)b=3BP(Wbnwb|Pnb)b=1BΦ(wb)|\displaystyle{}-\Phi(w_{1})\Phi(w_{2})\prod_{b=3}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)+\Phi(w_{1})\Phi(w_{2})\prod_{b=3}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)-\cdots-\prod_{b=1}^{B}\Phi(w_{b})\Bigg{|}{}
+E|P(Zbnzb,b=1,,B|P^n)b=1BΦ(zb)|b=1BΦ(wb)\displaystyle{}+E\left|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right|\prod_{b=1}^{B}\Phi(w_{b})
\displaystyle\leq E[b=1B|P(Wbnwb|Pnb)Φ(wb)|]+E|P(Zbnzb,b=1,,B|P^n)b=1BΦ(zb)|b=1BΦ(wb)\displaystyle E\left[\sum_{b=1}^{B}\left|P\left(W_{b}^{n}\leq w_{b}\Big{|}P_{n}^{*b}\right)-\Phi(w_{b})\right|\right]+E\left|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right|\prod_{b=1}^{B}\Phi(w_{b})
              by the triangle inequality

Note that in the first term in (60), each

E|P(Wbnwb|Pb)Φ(wb)|E\left|P\left(W_{b}^{n}\leq w_{b}\big{|}P^{*b}\right)-\Phi(w_{b})\right|

converges to 0 as n,Rn,R\to\infty by the same argument as for the first term in (58), except that we use the bootstrapped moments τ(Pn)pτ\tau(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\tau and κ3(Pn)pκ3\kappa_{3}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3} in Assumption 2 instead of τ^npτ\hat{\tau}^{n}\stackrel{{\scriptstyle p}}{{\to}}\tau and κ^3npκ3\hat{\kappa}_{3}^{n}\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}. The second term in (60) also goes to 0 as nn\to\infty by Assumption 1 and the dominated convergence theorem.

Therefore, (57) goes to 0. This proves (56).

Step 2. We consider the following decompositions

ψ^^n,R0ψ=(ψ^^n,R0ψ^n)+(ψ^nψ)\hat{\hat{\psi}}_{n,R_{0}}-\psi=(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})+(\hat{\psi}_{n}-\psi)

and

ψn,Rbψ^^n,R0=(ψn,Rbψnb)+(ψnbψ^n)+(ψ^nψ^^n,R0)\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}}=(\psi^{**b}_{n,R}-\psi_{n}^{*b})+(\psi_{n}^{*b}-\hat{\psi}_{n})+(\hat{\psi}_{n}-\hat{\hat{\psi}}_{n,R_{0}})

Then apply the continuous mapping theorem to get

n(ψ^^n,R0ψ,ψn,R1ψ^^n,R0,,ψn,RBψ^^n,R0)\displaystyle\sqrt{n}\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{**1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{**B}-\hat{\hat{\psi}}_{n,R_{0}}\right)
=\displaystyle= (nR0R0(ψ^^n,R0ψ^n)+n(ψ^nψ),nRR(ψn,R1ψn1)+n(ψn1ψ^n)nR0R0(ψ^^n,R0ψ^n),\displaystyle\Bigg{(}\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})+\sqrt{n}(\hat{\psi}_{n}-\psi),\sqrt{\frac{n}{R}}\sqrt{R}(\psi_{n,R}^{**1}-\psi_{n}^{*1})+\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})-\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n}),{}
,nRR(ψn,RBψnB)+n(ψnBψ^n)nR0R0(ψ^^n,R0ψ^n))\displaystyle{}\ldots,\sqrt{\frac{n}{R}}\sqrt{R}(\psi_{n,R}^{**B}-\psi_{n}^{*B})+\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})-\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})\Bigg{)}
\displaystyle\Rightarrow (τp0W0+σZ0,τpW1+σZ1τp0W0,,τpWB+σZBτp0W0)\displaystyle\left(\frac{\tau}{\sqrt{p_{0}}}W_{0}+\sigma Z_{0},\ \frac{\tau}{\sqrt{p}}W_{1}+\sigma Z_{1}-\frac{\tau}{\sqrt{p_{0}}}W_{0},\ldots,\ \frac{\tau}{\sqrt{p}}W_{B}+\sigma Z_{B}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)

This concludes the theorem. ∎

C.2 Proofs and Additional Discussions for Section 4.1

We first prove Theorem 4:

Proof of Theorem 4.

Consider the pivotal statistic TO=(ψ^^n,R0ψ)/SOT_{O}=(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{O}. We argue that qO,1α/2q_{O,1-\alpha/2} defined in (18) satisfies

lim infnP(ψO)=lim infnP(|TO|qO,1α/2)1α\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{O})=\liminf_{n\to\infty}P\left(\left|T_{O}\right|\leq q_{O,1-\alpha/2}\right)\geq 1-\alpha

which would conclude that O\mathcal{I}_{O} is an asymptotically valid (1α)(1-\alpha)-level confidence interval.

To this end, by Theorem 3 and the continuous mapping theorem, as nn\to\infty, TOT_{O} converges weakly to

σZ0+τp0W01Bb=1B(σZb+τpWbτp0W0)2\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\sigma Z_{b}+\frac{\tau}{\sqrt{p}}W_{b}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)^{2}}} (61)

where Z0,Z1,,ZB,W0,W1,,Wbi.i.d.N(0,1)Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{b}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1). A direct inspection on the homogenity of the expression reveals that (61) only depends on σ\sigma and τ\tau through their ratio. Multiplying by a factor p0/τ\sqrt{p_{0}}/\tau on both the numerator and denominator, we rewrite (61) as

θZ0+W01Bb=1B(θZb+ρWbW0)2\frac{\theta Z_{0}+W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\theta Z_{b}+\rho W_{b}-W_{0}\right)^{2}}} (62)

where θ=σp0/τ\theta=\sigma\sqrt{p_{0}}/\tau and ρ=p0/p\rho=\sqrt{p_{0}/p} as defined before. We will see momentarily that (62) follows the same distribution as (19). For now, recall the distribution function of (19) is F(;θ,ρ)F(\cdot;\theta,\rho), which has the unknown θ\theta. We have

lim infnP(ψO)\displaystyle\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{O}) =\displaystyle= lim infnP(|TO|qO,1α/2)=F(qO,1α/2;θ,ρ)F(qO,1α/2;θ,ρ)\displaystyle\liminf_{n\to\infty}P\left(|T_{O}|\leq q_{O,1-\alpha/2}\right)=F\left(q_{O,1-\alpha/2};\theta,\rho\right)-F\left(-q_{O,1-\alpha/2};\theta,\rho\right){}
=\displaystyle= 2F(qO,1α/2;θ,ρ)12minθ0F(qO,1α/2;θ,ρ)1=1α\displaystyle{}2F\left(q_{O,1-\alpha/2};\theta,\rho\right)-1\geq 2\min_{\theta\geq 0}F\left(q_{O,1-\alpha/2};\theta,\rho\right)-1=1-\alpha

where the third equality follows from the symmetry of (19) or (62).

Finally, we see that (62) and (19) follow the same distribution, since by expanding the sum of squares in (62) we have

θZ0+W01Bb=1B(θZb+ρWbW0)2=dθZ0+W0(θ2+ρ2)(1Bb=1B(XbX¯)2+X¯2)2θ2+ρ2X¯W0+W02\frac{\theta Z_{0}+W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\theta Z_{b}+\rho W_{b}-W_{0}\right)^{2}}}\\ \stackrel{{\scriptstyle d}}{{=}}\frac{\theta Z_{0}+W_{0}}{\sqrt{(\theta^{2}+\rho^{2})\left(\frac{1}{B}\sum_{b=1}^{B}(X_{b}-\bar{X})^{2}+\bar{X}^{2}\right)-2\sqrt{\theta^{2}+\rho^{2}}\bar{X}W_{0}+W_{0}^{2}}}

where we have written θZb+ρWb=θ2+ρ2Xb\theta Z_{b}+\rho W_{b}=\sqrt{\theta^{2}+\rho^{2}}X_{b} with X1,,XBi.i.d.N(0,1)X_{1},\ldots,X_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1) which are independent of Z0,W0Z_{0},W_{0}. Noting that b=1B(XbX¯)2χB12\sum_{b=1}^{B}(X_{b}-\bar{X})^{2}\sim\chi^{2}_{B-1}, X¯N(0,1/B)\bar{X}\sim N(0,1/B), which are independent by the property of standard normals, we get (19). ∎

To understand how the additional intricacy from the computation noise affects the interval half-width, we consider the scenario when BB grows. The asymptotic distribution of TOT_{O} given by (19) becomes

θV1+V2θ2+ρ2+V22\frac{\theta V_{1}+V_{2}}{\sqrt{\theta^{2}+\rho^{2}+V_{2}^{2}}} (63)

with V1,V2i.i.d.N(0,1)V_{1},V_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1), since Y/B1Y/B\to 1 and V32/B,V3V2/B0V_{3}^{2}/B,V_{3}V_{2}/\sqrt{B}\to 0 a.s. in (19). Correspondingly, SOS_{O} defined in (17) is distributed approximately as

1n((σ2+τ2p)Y+V32B2τp0σ2+τ2pV3V2B+τ2p0V22)\sqrt{\frac{1}{n}\left(\left(\sigma^{2}+\frac{\tau^{2}}{p}\right)\frac{Y+V_{3}^{2}}{B}-2\frac{\tau}{\sqrt{p_{0}}}\sqrt{\sigma^{2}+\frac{\tau^{2}}{p}}\frac{V_{3}V_{2}}{\sqrt{B}}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}\right)} (64)

which, when BB is large, becomes

σ2+τ2p+τ2p0V22n\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}}{n}}

So the half-width of O\mathcal{I}_{O} behaves like

q~O,1α/2σ2+τ2p+τ2p0V22n=q~O,1α/2σ2n+τ2R+τ2R0V22\tilde{q}_{O,1-\alpha/2}\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}}{n}}=\tilde{q}_{O,1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}+\frac{\tau^{2}}{R_{0}}V_{2}^{2}} (65)

where q~O,1α/2\tilde{q}_{O,1-\alpha/2} is the maximum (1α/2)(1-\alpha/2)-quantile of (63) over all possible θ0\theta\geq 0. Now, supposing we know the values of σ\sigma and τ\tau, the normality confidence interval obtained from extracting the first component in the limit in (15) is

[ψ^^n,R0z1α/2σ2n+τ2R0,ψ^^n,R0+z1α/2σ2n+τ2R0]\left[\hat{\hat{\psi}}_{n,R_{0}}-z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}},\ \hat{\hat{\psi}}_{n,R_{0}}+z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}}\right]

thus with a half-width

z1α/2σ2n+τ2R0z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}} (66)

The standard error in this half-width has the notable interpretation of being a combination of the data variability σ2/n\sigma^{2}/n and computation variability τ2/R0\tau^{2}/R_{0}. Suppose in (65) we use R=R0R=R_{0}, so that a resample estimate exhibits the same variability as the original estimate. Comparing (65) and (66), we see that (65) has an additional contribution coming from V2V_{2}. If V2V_{2} is not present, then SOS_{O} becomes σ2/n+τ2/R0\sqrt{\sigma^{2}/n+\tau^{2}/R_{0}} and q~O,1α/2\tilde{q}_{O,1-\alpha/2} becomes the quantile of V1V_{1} which is a standard normal variable, since in this case the maximum quantile over all θ0\theta\geq 0 is approached by choosing θ\theta\to\infty. In other words, when V2V_{2} is not present, we recover the normality interval half-width when BB increases. Thus, the added variability from V2V_{2} can be viewed as a price we pay to handle the additional computation noise without knowledge on σ\sigma and τ\tau.

C.3 Proofs and Additional Discussions for Section 4.2

We first prove Theorem 5:

Proof of Theorem 5.

The proof follows the roadmap of that of Theorem 4. Consider the pivotal statistic TM=(ψ^^n,R0ψ)/SMT_{M}=(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{M}. We first argue that qM,1α/2q_{M,1-\alpha/2} satisfies

lim infnP(ψM)=lim infnP(|TM|qM,1α/2)1α\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{M})=\liminf_{n\to\infty}P\left(\left|T_{M}\right|\leq q_{M,1-\alpha/2}\right)\geq 1-\alpha

which would conclude that M\mathcal{I}_{M} is an asymptotically valid (1α)(1-\alpha)-level confidence interval.

By Theorem 3 and the continuous mapping theorem, as nn\to\infty, TMT_{M} converges weakly to

σZ0+τp0W01B1b=1B((σZb+τpWb)(σZ¯+τpW¯))2=dσZ0+τp0W0(σ2+τ2p)YB1=dσ2+τ2p0σ2+τ2ptB1\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\frac{1}{B-1}\sum_{b=1}^{B}\left(\left(\sigma Z_{b}+\frac{\tau}{\sqrt{p}}W_{b}\right)-\left(\sigma\bar{Z}+\frac{\tau}{\sqrt{p}}\bar{W}\right)\right)^{2}}}\stackrel{{\scriptstyle d}}{{=}}\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\left(\sigma^{2}+\frac{\tau^{2}}{p}\right)\frac{Y}{B-1}}}\stackrel{{\scriptstyle d}}{{=}}\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p_{0}}}{\sigma^{2}+\frac{\tau^{2}}{p}}}t_{B-1} (67)

where Z0,Z1,,ZB,W0,W1,,WBi.i.d.N(0,1)Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1), Z¯=(1/B)b=1BZb\bar{Z}=(1/B)\sum_{b=1}^{B}Z_{b}, W¯=(1/B)b=1BWb\bar{W}=(1/B)\sum_{b=1}^{B}W_{b}, and YχB12Y\sim\chi^{2}_{B-1} which is independent of Z0Z_{0} and W0W_{0}. The equalities in distribution use the standard properties of normals to obtain χB12\chi^{2}_{B-1} and tB1t_{B-1} distributions. By multiplying both the numerator and denominator of (67) by p0/τ\sqrt{p_{0}}/\tau, we get

θ2+1θ2+ρ2tB1\sqrt{\frac{\theta^{2}+1}{\theta^{2}+\rho^{2}}}t_{B-1} (68)

where θ=σp0/τ\theta=\sigma\sqrt{p_{0}}/\tau and ρ=p0/p\rho=\sqrt{p_{0}/p} as defined before. Denote the distribution of (68) as F~(;θ,ρ)\tilde{F}(\cdot;\theta,\rho). We have

minθ0F~(q;θ,ρ)=P(max{ρ1,1}tB1q)\min_{\theta\geq 0}\tilde{F}(q;\theta,\rho)=P\left(\max\{\rho^{-1},1\}t_{B-1}\leq q\right)

for any q0q\geq 0, by noting that the minimum is approached by setting θ\theta\to\infty when ρ1\rho\geq 1 and attained at θ=0\theta=0 when ρ<1\rho<1. Thus setting qM,1α/2=max{ρ1,1}tB1,1α/2q_{M,1-\alpha/2}=\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2} gives

lim infnP(ψM)\displaystyle\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{M}) =\displaystyle= lim infnP(|TM|qM,1α/2)=F~(qM,1α/2;θ,ρ)F~(qM,1α/2;θ,ρ)\displaystyle\liminf_{n\to\infty}P\left(|T_{M}|\leq q_{M,1-\alpha/2}\right)=\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-\tilde{F}\left(-q_{M,1-\alpha/2};\theta,\rho\right){}
=\displaystyle= 2F~(qM,1α/2;θ,ρ)12minθ0F~(qM,1α/2;θ,ρ)1=1α\displaystyle{}2\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-1\geq 2\min_{\theta\geq 0}\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-1=1-\alpha

where the third equality follows from the symmetry of (68).

Finally, note that when ρ=1\rho=1 (68) becomes tB1t_{B-1} regardless of the value of θ\theta. Thus in this case TMT_{M} is asymptotically tB1t_{B-1}, and asymptotic exactness of M\mathcal{I}_{M} holds.

Note that, in contrast to (61) where W0W_{0} appears both in the numerator and denominator, in (67) W0W_{0} only appears in the former and thus the numerator and denominator there are independent. This simplifies the asymptotic distribution to be used in M\mathcal{I}_{M}.

We discuss the half-width efficiency of M\mathcal{I}_{M} and contrast with O\mathcal{I}_{O}. First, unlike O\mathcal{I}_{O}, note that it is possible to have asymptotic exactness for M\mathcal{I}_{M} as nn\to\infty, in particular when we use the same computation size in the resample estimate and the original point estimate, which is a natural configuration. Moreover, from (24) and (67), we see that as nn increases, the half-width of M\mathcal{I}_{M} behaves approximately as

max{ρ1,1}tB1,1α/2(σ2n+τ2R)χB12B1\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2}\sqrt{\left(\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}\right)\frac{\chi^{2}_{B-1}}{B-1}}

When BB increases, this becomes

max{ρ1,1}z1α/2σ2n+τ2R\max\{\rho^{-1},1\}z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}}

so that when ρ=1\rho=1 we get

z1α/2σ2n+τ2Rz_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}}

which is the half-width of the normality confidence interval (when RR is set to equal R0R_{0}). This conformance shows the superiority of M\mathcal{I}_{M} over O\mathcal{I}_{O} when BB is large. Nonetheless, when BB is small or when the computation sizes of the resample and original estimates are different, O\mathcal{I}_{O} could possibly outperform M\mathcal{I}_{M}.

Lastly, both O\mathcal{I}_{O} and M\mathcal{I}_{M} have natural one-sided analogs, where we replace qO,1α/2q_{O,1-\alpha/2} and qM,1α/2q_{M,1-\alpha/2} by qO,1αq_{O,1-\alpha} and qM,1αq_{M,1-\alpha} in one of the interval limits (and with the other side unbounded).

Appendix D Proofs for Section 5

Proof of Theorem 6.

In each of the three subsampling variants, we show that an analog of Assumption 1 holds and hence we can use the same roadmap as the proofs of Proposition 1 and Theorem 1 to conclude our result. Note that, under the assumptions in Proposition 2, we have immediately that n(ψ^nψ)N(0,σ2)\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\sigma^{2}) for some σ2>0\sigma^{2}>0 by the functional delta method (Theorem 10 in Appendix F.2). Now, under the additional assumption that δ\mathcal{F}_{\delta} is measurable for every δ>0\delta>0, we have the following:

Cheap mm-out-of-nn Bootstrap: Denote ψs\psi_{s}^{*} as a subsample estimate. We invoke Theorem 15 (in Appendix F.4) to conclude that s(ψsψ^n)N(0,σ2)\sqrt{s}(\psi_{s}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\sigma^{2}) given X1,X2,X_{1},X_{2},\ldots in probability as nn\to\infty, for any ss dependent on nn such that sns\leq n and ss\to\infty. Following the same argument in the proof of Proposition 1, except we replace n(ψnbψ^n)\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}) by s(ψsbψ^n)\sqrt{s}(\psi_{s}^{*b}-\hat{\psi}_{n}), we obtain

(n(ψ^nψ),s(ψs1ψ^n),,s(ψsBψ^n))(σZ0,σZ1,,σZB)\left(\sqrt{n}(\hat{\psi}_{n}-\psi),\sqrt{s}(\psi_{s}^{*1}-\hat{\psi}_{n}),\ldots,\sqrt{s}(\psi_{s}^{*B}-\hat{\psi}_{n})\right)\Rightarrow(\sigma Z_{0},\sigma Z_{1},\ldots,\sigma Z_{B})

where Z0,Z1,,ZBZ_{0},Z_{1},\ldots,Z_{B} are i.i.d. N(0,1)N(0,1). Therefore,

ψ^nψsnSZ01Bb=1BZb2\frac{\hat{\psi}_{n}-\psi}{\sqrt{\frac{s}{n}}S}\Rightarrow\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}}

by the continuous mapping theorem. Then, following the proof of Theorem 1 to note that the right hand side above is a tBt_{B}-variable, we get

P(t1α/2,Bψ^nψsnSt1α/2,B)1αP\left(-t_{1-\alpha/2,B}\leq\frac{\hat{\psi}_{n}-\psi}{\sqrt{\frac{s}{n}}S}\leq t_{1-\alpha/2,B}\right)\to 1-\alpha

from which we conclude the result.

Cheap Bag of Little Bootstraps: Recall ψs\psi_{s}^{*} is the fixed subsample estimate, and denote ψn\psi_{n}^{**} as a second-layer resample estimate. Note that the subsample encoded by PsP_{s}^{*}, which is obtained by sampling without replacement from the data encoded by P^n\hat{P}_{n}, is distributed i.i.d. from PP. Thus we can involve Theorem 15 to conclude n(ψnψs)N(0,σ2)\sqrt{n}(\psi_{n}^{**}-\psi_{s}^{*})\Rightarrow N(0,\sigma^{2}) in probability, given X1,X2,,XnX_{1},X_{2},\ldots,X_{n} and the subsampling randomness, as n,sn,s\to\infty for any ss dependent on nn such that sns\leq n. The rest is identical to the proofs of Proposition 1 and Theorem 1, except we replace ψnbψ^n\psi_{n}^{*b}-\hat{\psi}_{n} by ψnbψs\psi_{n}^{**b}-\psi_{s}^{*} and, instead of conditioning on P^n\hat{P}_{n} in the series of inequalities in the proof of Proposition 1, we condition on both P^n\hat{P}_{n} and the randomness in the subsampling that obtains ψs\psi_{s}^{*}.

Cheap Subsampled Double Bootstrap: Denote ψs\psi_{s}^{*} as a first-layer subsample estimate and ψn\psi_{n}^{**} as the derived second-layer resample estimate. We invoke Theorem 16 (in Appendix F.4) to conclude that n(ψnψs)N(0,σ2)\sqrt{n}(\psi_{n}^{**}-\psi_{s}^{*})\Rightarrow N(0,\sigma^{2}) given X1,X2,X_{1},X_{2},\ldots in probability as nn\to\infty, for any ss dependent on nn such that sns\leq n and ss\to\infty. The rest is identical to the proofs of Proposition 1 and Theorem 1, except we replace ψnbψ^n\psi_{n}^{*b}-\hat{\psi}_{n} by ψnbψsb\psi_{n}^{**b}-\psi_{s}^{*b}. ∎

Appendix E Additional Numerical Results

E.1 Logistic Regression

We present another example on a logistic regression model YBernoulli(p)Y\sim Bernoulli(p) where p=1/(1+exp((β1X1++βdXd)))p=1/(1+\exp(-(\beta_{1}X_{1}+\cdots+\beta_{d}X_{d}))). We set d=10d=10 and use data (X1,i,,Xd,i,Yi)(X_{1,i},\ldots,X_{d,i},Y_{i}) of size 10510^{5} to fit the model and estimate the coefficients βj\beta_{j}’s. The ground truth is set as Xjt3X_{j}\sim t_{3} and the coefficients (β1,,β10)=(1.9,1.7,1.3,1.8,1.1,1.2,1.9,2.2,1.5,2.0)(\beta_{1},\ldots,\beta_{10})=(1.9,1.7,1.3,1.8,1.1,1.2,1.9,2.2,1.5,2.0), a set of numbers arbitrarily chosen from the interval [1,3][1,3]. Our setup is similar to the linear regression example in Section 6.2, where we test all methods to compute 95%95\% confidence intervals on the first coefficient β1\beta_{1}, and use subsample size n0.6=1000n^{0.6}=1000 for mm-out-of-nn Bootstrap, Bag of Little Bootstraps and Subsampled Double Bootstrap. We again use 5050 resamples in total for each method to depict the trend.

Figures 6, 7 and 8 show the coverage probabilities, mean interval widths, and standard deviations of interval widths respectively for Standard and Cheap bootstrap methods. The comparisons are largely similar to the linear regression example. The Cheap bootstrap methods all attain close to the target 95%95\% coverage at B=1B=1, with 96%96\%, 96%96\%, 97%97\% and 97%97\% for Cheap Bootstrap, mm-out-of-nn, Bag of Little Bootstrap and Subsampled Double Bootstrap respectively. On the other hand, the Standard bootstrap methods all fail at B=1B=1 and require much larger BB to approach the target coverage. Both the means and standard deviations of interval widths for Cheap bootstrap methods decrease sharply from B=1B=1 (e.g., mean 0.380.38 and standard deviation 0.280.28 for Cheap Bootstrap) to 22 (mean 0.140.14 and standard deviation 0.070.07), and continue to drop further at B=3B=3 (mean 0.110.11 and standard deviation 0.040.04) and beyond at a continuously slower rate. On the other hand, the mean interval widths of Standard bootstrap methods are initially small and exhibit increasing trends, whereas the standard deviations appear roughly constant against BB. Thus, similar to the linear regression example, here Cheap bootstrap methods again consistently attain accurate coverage regardless of BB and their interval widths drop fast to levels comparable to large BB. On the other hand, Standard methods under-cover when BB is small and converge to the nominal level at much slower rates.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Confidence interval coverage probabilities of Standard versus Cheap Bootstrap methods in logistic regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the coverage probability estimates from 1000 experimental repetitions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Mean confidence interval widths of Standard versus Cheap Bootstrap methods in logistic regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the mean width estimates from 1000 experimental repetitions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Standard deviations of confidence interval widths of Standard versus Cheap Bootstrap methods in logistic regression. Nominal confidence level =95%=95\% and sample size n=105n=10^{5}. Shaded areas depict the associated confidence intervals of the standard deviation estimates from 1000 experimental repetitions.

E.2 Critical Values for Cheap Bootstrap in Nested Sampling Problems

Table 5 shows the approximated qO,1α/2q_{O,1-\alpha/2} for BB ranging from 11 to 2020 and α=0.05\alpha=0.05 (and also α=0.1\alpha=0.1 for later use in the example in Appendix E.3) when R0=RR_{0}=R, i.e., ρ=1\rho=1, where we also show qM,1α/2=tB1,1α/2q_{M,1-\alpha/2}=t_{B-1,1-\alpha/2} for comparison. Note that when B=1B=1, only qO,1α/2q_{O,1-\alpha/2} is well-defined but not qM,1α/2q_{M,1-\alpha/2}. We also see that while qO,1α/2q_{O,1-\alpha/2} is smaller than qM,1α/2q_{M,1-\alpha/2} when BB is small, it appears that they are very similar as BB reaches 2020.

Table 5: Values of qO,1α/2q_{O,1-\alpha/2} and qM,1α/2q_{M,1-\alpha/2} when R0=RR_{0}=R, at α=0.05\alpha=0.05 and α=0.1\alpha=0.1.
BB α=0.05\alpha=0.05 α=0.1\alpha=0.1
qO,1α/2q_{O,1-\alpha/2} qM,1α/2q_{M,1-\alpha/2} qO,1α/2q_{O,1-\alpha/2} qM,1α/2q_{M,1-\alpha/2}
1 12.75 NA 6.32 NA
2 4.32 12.71 2.92 6.31
3 3.19 4.30 2.36 2.92
4 2.78 3.18 2.14 2.35
5 2.57 2.78 2.02 2.13
6 2.45 2.57 1.95 2.02
7 2.37 2.45 1.90 1.94
8 2.31 2.36 1.86 1.89
9 2.27 2.31 1.84 1.86
10 2.23 2.26 1.82 1.83
11 2.21 2.23 1.80 1.81
12 2.19 2.20 1.79 1.80
13 2.16 2.18 1.78 1.78
14 2.15 2.16 1.77 1.77
15 2.14 2.14 1.76 1.76
16 2.12 2.13 1.75 1.75
17 2.11 2.12 1.74 1.75
18 2.11 2.11 1.74 1.74
19 2.10 2.10 1.73 1.73
20 2.09 2.09 1.73 1.73

E.3 Bagging Estimation for Bounding Optimality Gap

We apply the Cheap Bootstrap on a bagging method to construct upper confidence bounds for data-driven stochastic optimization problems. More specifically, consider an expected-value optimization problem minθΘ{H(θ):=E[h(θ,X)]}\min_{\theta\in\Theta}\{H(\theta):=E[h(\theta,X)]\} where the distribution PP governing XX in the expectation E[]E[\cdot] is unknown and is only informed from data. The cost function h(,)h(\cdot,\cdot) is known or evaluatable, and Θ\Theta denotes the feasible region for the decision variable θ\theta. Such an optimization formulation appears broadly in multiple disciplines such as revenue management, portfolio selection, among others (e.g., Shapiro et al. (2021); Birge and Louveaux (2011)).

Suppose we have a given solution, say θ^\hat{\theta} (which is presumably obtained from a data-driven procedure such as solving an empirical optimization or sample average approximation Shapiro et al. (2021)). To assess the quality of this given solution, we can construct an upper confidence bound for its optimality gap G=H(θ^)HG=H(\hat{\theta})-H^{*}, where H=minθΘH(θ)H^{*}=\min_{\theta\in\Theta}H(\theta) denotes the unknown true optimal value. To this end, a general upper bound for GG is given by E[H~(X1,X2,,Xk)]E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})] where X1,X2,,Xki.i.d.PX_{1},X_{2},\ldots,X_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and E[]E[\cdot] denotes the corresponding expectation. The quantity

H~(X1,X2,,Xk)=maxθΘ1ki=1k(h(θ^,Xi)h(θ,Xi))\tilde{H}(X_{1},X_{2},\ldots,X_{k})=\max_{\theta\in\Theta}\frac{1}{k}\sum_{i=1}^{k}(h(\hat{\theta},X_{i})-h(\theta,X_{i}))

is a sample average approximation on cost function h(θ^,X)h(θ,X)h(\hat{\theta},X)-h(\theta,X), which is equivalent to the difference between an empirical estimate of E[h(θ^,X)]E[h(\hat{\theta},X)] and a sample average approximation on h(θ,X)h(\theta,X) itself. That E[H~(X1,X2,,Xk)]E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})] is an upper bound for GG can be argued via the Jensen inequality and is a well-established optimistic bound in stochastic optimization (e.g., Mak et al. (1999), Glasserman (2004) §8).

To implement E[H~(X1,X2,,Xk)]E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})] using data, Lam and Qian (2018a, b) suggest to use bagging. Namely, given data X1,,XnX_{1},\ldots,X_{n}, where nn denotes the sample size (that could be different from kk), we repeatedly resample data set {X1b,,Xkb}\{X_{1}^{*b},\ldots,X_{k}^{*b}\} and solve a sample average approximation H~(X1b,X2b,,Xkb)\tilde{H}(X_{1}^{*b},X_{2}^{*b},\ldots,X_{k}^{*b}), for b=1,,Bb=1,\ldots,B. Then we output their average

1Bb=1BH~(X1b,X2b,,Xkb)\frac{1}{B}\sum_{b=1}^{B}\tilde{H}(X_{1}^{*b},X_{2}^{*b},\ldots,X_{k}^{*b})

to give a point estimate for E[H~(X1,X2,,Xk)]E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]. Cast in our framework in Section 4, we view ψ(P)=E[H~(X1,X2,,Xk)]\psi(P)=E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})] where X1,,Xki.i.d.PX_{1},\ldots,X_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and each noisy computation run as ψ^r(Q)=H~(X1,X2,,Xk)\hat{\psi}_{r}(Q)=\tilde{H}(X_{1}^{*},X_{2}^{*},\ldots,X_{k}^{*}) where X1,,,Xki.i.d.QX_{1}^{*},,\ldots,X_{k}^{*}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q. Thus we can use the Cheap Bootstrap centered at original estimate and centered at resample mean in Section 4 to construct valid confidence bounds. Note that each computation run here involves solving a sample average approximation problem which could be costly.

To test the performance of our Cheap Bootstrap, we consider more specifically the following expected-value optimization problem

minθE[Xθ]subject toAθbθi{0,1} for i=1,2,,10\begin{array}[]{ll}\min_{\theta}&E[X^{\top}\theta]\\ \text{subject to}&A\theta\leq b\\ &\theta_{i}\in\{0,1\}\text{\ for\ }i=1,2,\ldots,10\end{array}

This is a 1010-dimensional stochastic integer program, with binary decision variables θi,i=1,,10\theta_{i},i=1,\ldots,10. The true distribution of X10X\in\mathbb{R}^{10} is N(μ,Σ)N(\mu,\Sigma), where μ=(1,7/9,5/9,,7/9,1)\mu=(-1,-7/9,-5/9,\ldots,7/9,1)^{\top} and Σ\Sigma is an arbitrarily generated covariance matrix. We set b=(1,2)b=(-1,2)^{\top} and

A=[11111111110000001111].A=\left[\begin{matrix}-1&-1&-1&-1&-1&-1&-1&-1&-1&-1\\ 0&0&0&0&0&0&1&1&1&1\\ \end{matrix}\right].

This example is also used in Lam and Qian (2018b). We use the data size n=100n=100, and the size in sample average approximation k=50k=50. Throughout our set of experiments, we set θ^=(1,1,0,1,0,0,0,0,0,0)\hat{\theta}=(1,1,0,1,0,0,0,0,0,0)^{\top} which is obtained by solving a sample average approximation from an independent data set of size 3030. We construct our one-sided Cheap Bootstrap intervals using qO,1αq_{O,1-\alpha} and qM,1αq_{M,1-\alpha} at α=0.05\alpha=0.05 (which corresponds to the case α=0.1\alpha=0.1 in Table 5), for BB ranging from 11 to 1010. We repeat our experiment 10001000 times to obtain summary statistics on our generated bounds, including the one-sided empirical coverage, i.e., the proportion of experiments where our upper confidence bound is at least the ground true value, and the mean and standard deviation of the upper confidence bound.

Table 6 shows the results. We see that the empirical coverages range from 96%96\% to 98%98\% in all considered BB for Cheap Bootstrap centered at original estimate and centered at resample mean (note that the latter is defined only starting from B=2B=2). The coverages are higher than the nominal value 95%95\% likely because the true value targeted by the confidence bound, namely E[H~(X1,X2,,Xk)]E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})], is itself an upper bound on the true optimality gap GG. The trends of the generated confidence bounds appear consistent with our previous examples. For centered at original estimate, the mean falls quickly from 2.832.83 at B=1B=1 to 1.911.91 at B=2B=2, further to 1.721.72 at B=3B=3 and much more steadily to 1.581.58 at B=10B=10, while the standard deviation falls from 1.881.88 at B=1B=1 to 0.910.91 at B=2B=2, to 0.730.73 at B=3B=3, and steadily to 0.620.62 at B=10B=10. Similarly, for centered at resample mean, the mean falls quickly from 2.792.79 at B=2B=2 to 1.841.84 at B=3B=3, and then steadily to 1.561.56 at B=10B=10, while the standard deviation falls from 1.711.71 at B=2B=2 to 0.810.81 at B=3B=3 and steadily to 0.620.62 at B=10B=10. The performance of centered at original estimate appears better than centered at resample mean for small BB but their performances are similar as BB reaches close to 1010.

Table 6: Performances of upper confidence bounds using Cheap Bootstrap centered at original estimate and centered at resample mean, at nominal confidence level 95%95\% for bagging estimation of optimality gap.
BB Centered at original estimate Centered at resample mean
Empirical coverage Bound mean Empirical coverage Bound mean
(margin of error) (st. dev.) (margin of error) (st. dev.)
1 0.96   (0.01) 2.83  (1.88) NA NA
2 0.97   (0.01) 1.91  (0.91) 0.98   (0.01) 2.79  (1.71)
3 0.97   (0.01) 1.72  (0.73) 0.98   (0.01) 1.84  (0.81)
4 0.96   (0.01) 1.67  (0.69) 0.97   (0.01) 1.72  (0.72)
5 0.97   (0.01) 1.64  (0.67) 0.97   (0.01) 1.66  (0.68)
6 0.97   (0.01) 1.61  (0.65) 0.96   (0.01) 1.62  (0.66)
7 0.97   (0.01) 1.60  (0.65) 0.97   (0.01) 1.60  (0.65)
8 0.97   (0.01) 1.58  (0.63) 0.97   (0.01) 1.58  (0.63)
9 0.97   (0.01) 1.58  (0.63) 0.97   (0.01) 1.57  (0.63)
10 0.97   (0.01) 1.58  (0.62) 0.97   (0.01) 1.56  (0.62)

Appendix F Useful Technical Backgrounds

For self-containedness purpose, we provide some existing results needed for our technical developments.

F.1 Hadamard Differentiability

We give some details on Hadamard differentiability used in Proposition 2 and Theorem 6. Consider a functional ψ():𝒫d\psi(\cdot):\mathcal{P}\to\mathbb{R}^{d}, where 𝒫\mathcal{P} denotes the space of probability distribution on the domain 𝒳\mathcal{X}. We call ψ()\psi(\cdot) Hadamard differentiable at PP with derivative ψP(H)\psi_{P}^{\prime}(H), tangential to some subset 𝒬\mathcal{Q} of 𝒫\mathcal{P}, if there exists a continuous, linear map ψP:𝒬d\psi_{P}^{\prime}:\mathcal{Q}\to\mathbb{R}^{d} such that

ψ(P+tHt)ψ(P)tψP(H)0\left\|\frac{\psi(P+tH_{t})-\psi(P)}{t}-\psi_{P}^{\prime}(H)\right\|\to 0

as t0t\searrow 0 for every sequence HtH_{t} such that P+tHt𝒫P+tH_{t}\in\mathcal{P} for any small t>0t>0 and converging to H𝒬H\in\mathcal{Q} (Van der Vaart (2000) §20.2).

F.2 Bootstrap Empirical Processes

We say that a sequence of random elements GnG_{n} in a normed space 𝔻\mathbb{D}, with norm denoted \|\cdot\|, converges in distribution to a tight limit GG in 𝔻\mathbb{D} if

suphBL1(𝔻)|Eh(Gn)Eh(G)|0\sup_{h\in BL_{1}(\mathbb{D})}|E^{*}h(G_{n})-Eh(G)|\to 0 (69)

where BL1(𝔻)BL_{1}(\mathbb{D}) is the set of all functions h:𝔻[1,1]h:\mathbb{D}\to[-1,1] that are uniformly Lipschitz, i.e., |h(z1)h(z2)|z1z2|h(z_{1})-h(z_{2})|\leq\|z_{1}-z_{2}\| for every pair z1,z2𝔻z_{1},z_{2}\in\mathbb{D}, and E[]E^{*}[\cdot] denotes the outer expectation.

For a class of functions \mathcal{F} from 𝒳\mathcal{X} to \mathbb{R}, define

():={z:z:=supf|z(f)|<}\ell^{\infty}(\mathcal{F}):=\left\{z:\|z\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|z(f)|<\infty\right\}

where zz is a map from \mathcal{F} to \mathbb{R}. Consider the empirical process 𝔾n=n(P^nP)\mathbb{G}_{n}=\sqrt{n}(\hat{P}_{n}-P) as a random element that takes value in ()\ell^{\infty}(\mathcal{F}), and consider the bootstrap empirical process 𝔾n=n(PnP^n)\mathbb{G}_{n}^{*}=\sqrt{n}(P_{n}^{*}-\hat{P}_{n}). Supposing \mathcal{F} is Donsker with a finite envelope function, it is well established that 𝔾n𝔾P\mathbb{G}_{n}\Rightarrow\mathbb{G}_{P} in ()\ell^{\infty}(\mathcal{F}) where 𝔾P\mathbb{G}_{P} is a tight Gaussian process (with mean 0 and covariance Cov(𝔾P(f1),𝔾P(f2))=CovP(f1(X),f2(X))Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X)) where CovPCov_{P} denotes the covariance taken with respect to PP). Denote EM[]E_{M}[\cdot] as the expectation conditional on X1,,XnX_{1},\ldots,X_{n}. We recall the following result for the bootstrap empirical process.

Theorem 9 (Adapted from Van der Vaart (2000) Theorem 23.7).

For every Donsker class \mathcal{F} of measurable functions with a finite envelope function,

suphBL1(())|EMh(𝔾n)Eh(𝔾P)|p0\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{n}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

Furthermore, the sequence 𝔾n\mathbb{G}_{n}^{*} is asymptotically measurable.

Next we recall two theorems:

Theorem 10 (Adapted from Van der Vaart (2000) Theorem 20.8).

Let 𝔻\mathbb{D} be a normed space and ϕ:𝔻ϕ𝔻d\phi:\mathbb{D}_{\phi}\subset\mathbb{D}\to\mathbb{R}^{d} be Hadamard differentiable at θ\theta tangential to some subspace 𝔻0\mathbb{D}_{0}. Let θ^n\hat{\theta}_{n} be random maps with values in 𝔻ϕ\mathbb{D}_{\phi} such that n(θ^nθ)𝕋\sqrt{n}(\hat{\theta}_{n}-\theta)\Rightarrow\mathbb{T} where 𝕋\mathbb{T} takes values in 𝔻0\mathbb{D}_{0}. Then n(ϕ(θ^n)ϕ(θ))ϕθ(𝕋)\sqrt{n}(\phi(\hat{\theta}_{n})-\phi(\theta))\Rightarrow\phi_{\theta}^{\prime}(\mathbb{T}).

Theorem 11 (Adapted from Van der Vaart (2000) Theorem 23.9).

Let 𝔻\mathbb{D} be a normed space and ϕ:𝔻ϕ𝔻d\phi:\mathbb{D}_{\phi}\subset\mathbb{D}\to\mathbb{R}^{d} be Hadamard differentiable at θ\theta tangential to some subspace 𝔻0\mathbb{D}_{0}. Let θ^n\hat{\theta}_{n} and θn\theta_{n}^{*} be random maps with values in 𝔻ϕ\mathbb{D}_{\phi} such that n(θ^nθ)𝕋\sqrt{n}(\hat{\theta}_{n}-\theta)\Rightarrow\mathbb{T} and suphBL1(𝔻)|EMh(n(θnθ^n))Eh(𝕋)|p0\sup_{h\in BL_{1}(\mathbb{D})}|E_{M}h(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}))-Eh(\mathbb{T})|\stackrel{{\scriptstyle p}}{{\to}}0, in which n(θnθ^n)\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}) is asymptotically measurable and 𝕋\mathbb{T} is tight and takes values in 𝔻0\mathbb{D}_{0}. Then n(ϕ(θn)ϕ(θ^n))ϕθ(𝕋)\sqrt{n}(\phi(\theta_{n}^{*})-\phi(\hat{\theta}_{n}))\Rightarrow\phi_{\theta}^{\prime}(\mathbb{T}) conditionally given X1,X2,X_{1},X_{2},\ldots in probability.

In general, we say n(θnθ^n)\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}) weakly converges to 𝕋\mathbb{T} conditionally given X1,X2,X_{1},X_{2},\ldots in probability if the condition suphBL1(𝔻)|EMh(n(θnθ^n))Eh(𝕋)|p0\sup_{h\in BL_{1}(\mathbb{D})}|E_{M}h(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}))-Eh(\mathbb{T})|\stackrel{{\scriptstyle p}}{{\to}}0 in Theorem 11 holds (Van der Vaart (2000) equation (23.8)). By Kosorok (2007) Lemma 10.11, in the case 𝔻=d\mathbb{D}=\mathbb{R}^{d}, this condition implies P(n(θnθ^n)x|P^n)pF(x)P(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n})\leq x|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}F(x) for all xdx\in\mathbb{R}^{d} if the distribution function F()F(\cdot) of 𝕋\mathbb{T} is continuous.

The following theorem, which is an immediate consequence of the above results, is used to justify Propositions 2 and 4.

Theorem 12 (Delta method for empirical bootstrap).

Consider P^n\hat{P}_{n} and PnP_{n}^{*} as random elements that take values in ()\ell^{\infty}(\mathcal{F}), where \mathcal{F} is a Donsker class with a finite envelope. Suppose ϕ:()d\phi:\ell^{\infty}(\mathcal{F})\to\mathbb{R}^{d} is Hadamard differentiable at PP (tangential to ()\ell^{\infty}(\mathcal{F})). Then n(ϕ(P^n)ϕ(P))ϕP(𝔾P)\sqrt{n}(\phi(\hat{P}_{n})-\phi(P))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P}), and also n(ϕ(Pn)ϕ(P^n))ϕP(𝔾P)\sqrt{n}(\phi(P_{n}^{*})-\phi(\hat{P}_{n}))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P}) given X1,X2,X_{1},X_{2},\ldots in probability.

Proof of Theorem 12.

Setting θ^n=P^n\hat{\theta}_{n}=\hat{P}_{n} and 𝔻=()\mathbb{D}=\ell^{\infty}(\mathcal{F}), Theorem 10 implies n(ϕ(P^n)ϕ(P))ϕP(𝔾P)\sqrt{n}(\phi(\hat{P}_{n})-\phi(P))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P}). Moreover, Theorem 9 gives the conditions needed in Theorem 11 to conclude that n(ϕ(Pn)ϕ(P^n))ϕP(𝔾P)\sqrt{n}(\phi(P_{n}^{*})-\phi(\hat{P}_{n}))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P}) given X1,X2,X_{1},X_{2},\ldots in probability. ∎

F.3 Edgeworth Expansions

We have the following higher-order expansion of coverage probability for a general function-of-mean model:

Theorem 13 (Hall (2013) Theorem 2.2).

Assume that for an integer ν1\nu\geq 1, the function A~\tilde{A} has ν+2\nu+2 continuous derivatives in a neighborhood of 𝛍\bm{\mu}, that A~(𝛍)=0\tilde{A}(\bm{\mu})=0, that E𝐗ν+2<E\|\mathbf{X}\|^{\nu+2}<\infty, that the characteristic function χ\chi of 𝐗\mathbf{X} satisfies Cramer’s condition lim sup𝐭|χ(𝐭)|<1\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1, and that the asymptotic variance of nA~(𝐗¯)\sqrt{n}\tilde{A}(\overline{\mathbf{X}}) equals 1. Then

P(nA~(𝐗¯)x)=Φ(x)+j=1νnj/2πj(x)ϕ(x)+o(nν/2)P(\sqrt{n}\tilde{A}(\overline{\mathbf{X}})\leq x)=\Phi(x)+\sum_{j=1}^{\nu}n^{-j/2}\pi_{j}(x)\phi(x)+o(n^{-\nu/2})

uniformly in xx, where πj\pi_{j} is a polynomial of degree 3j13j-1, odd for even jj and even for odd jj, with coefficients depending on moments of 𝐗\mathbf{X} up to order j+2j+2 polynomially and also AA.

In Theorem 13, A~\tilde{A} is generally defined and the condition A~(𝝁)=0\tilde{A}(\bm{\mu})=0 is satisfied by AA and AsA_{s} defined in (9) and (10). When A~=A\tilde{A}=A, we denote its πj\pi_{j} as pjp_{j}. When A~=As\tilde{A}=A_{s}, we denote its πj\pi_{j} as qjq_{j}.

Now denote π^j()\hat{\pi}_{j}(\cdot) as πj()\pi_{j}(\cdot) but with all moments in its coefficients replaced by the sample moments, i.e., denoting 𝐗=(X(1),,X(d))\mathbf{X}=(X(1),\ldots,X(d)), the moment

μm1,,md=E[X(1)m1X(d)md]\mu_{m_{1},\ldots,m_{d}}=E[{X(1)}^{m_{1}}\cdots{X(d)}^{m_{d}}]

is replaced by

μ^m1,,md=1ni=1nXi(1)m1Xi(d)md\hat{\mu}_{m_{1},\ldots,m_{d}}=\frac{1}{n}\sum_{i=1}^{n}{X_{i}(1)}^{m_{1}}\cdots{X_{i}(d)}^{m_{d}}

for sample 𝐗i=(Xi(1),,Xi(d)),i=1,,n\mathbf{X}_{i}=(X_{i}(1),\ldots,X_{i}(d)),i=1,\ldots,n. Specifically, in the case of AA, we denote p^j\hat{p}_{j} as the pjp_{j} with moments replaced by sample moments. Moreover, we also define

A^(𝐱)=g(𝐱)g(𝐗¯)h(𝐗¯)\hat{A}(\mathbf{x})=\frac{g(\mathbf{x})-g(\overline{\mathbf{X}})}{h(\overline{\mathbf{X}})}

Then, A^(𝐗¯)\hat{A}(\overline{\mathbf{X}}^{*}), where 𝐗¯=(1/n)i=1n𝐗i\overline{\mathbf{X}}^{*}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i}^{*} for a resample 𝐗i,i=1,,n\mathbf{X}_{i}^{*},i=1,\ldots,n, is the resample counterpart of A(𝐗¯)A(\overline{\mathbf{X}}). We have the following expansion and bounds for the resample counterpart:

Theorem 14 (Adapted from Hall (2013) Theorem 5.1).

Let λ>0\lambda>0 be given, and let l=l(λ)l=l(\lambda) be a sufficiently large positive number. Assume that gg and hh each have ν+3\nu+3 bounded derivatives in a neighborhood of 𝛍\bm{\mu}, that E𝐗l<E\|\mathbf{X}\|^{l}<\infty, and that the characteristic function χ\chi of 𝐗\mathbf{X} satisfies Cramer’s condition lim sup𝐭|χ(𝐭)|<1\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1. Then there exists a constant C>0C>0 such that

P(sup<x<|P(nA^(𝐗¯)x|𝒳n)Φ(x)j=1νnj/2p^j(x)ϕ(x)|>Cn(ν+1)/2)=O(nλ)P\left(\sup_{-\infty<x<\infty}\left|P(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*})\leq x|\mathcal{X}_{n})-\Phi(x)-\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)\right|>Cn^{-(\nu+1)/2}\right)=O(n^{-\lambda})
P(max1jνsup<x<(1+|x|)(3j1)|p^j(x)|>C)=O(nλ)P\left(\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}(x)|>C\right)=O(n^{-\lambda})

and

P(max1jνsup<x<(1+|x|)(3j1)|p^j(x)|>C)=O(nλ)P\left(\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}^{\prime}(x)|>C\right)=O(n^{-\lambda}) (70)

where 𝒳n={𝐗1,,𝐗n}\mathcal{X}_{n}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\} denotes the data.

Theorem 14 is slightly strengthened from Hall (2013) in that we put (70) as an additional conclusion. The proof in Hall (2013) works for any polynomial pjp_{j} of degree 3j13j-1 and thus also pjp_{j}^{\prime}, and the additional implication is immediate (and useful for our proof of Theorem 2).

F.4 Subsampling

Consider 𝔾n,k=k(PkP^n)\mathbb{G}_{n,k}^{*}=\sqrt{k}(P_{k}^{*}-\hat{P}_{n}) where PkP_{k}^{*} is the bootstrap empirical distribution constructed using kk resampled values from X1,,XnX_{1},\ldots,X_{n} by sampling with replacement. Let δ={fg:f,g,ρP(fg)<δ}\mathcal{F}_{\delta}=\{f-g:f,g\in\mathcal{F},\ \rho_{P}(f-g)<\delta\}, where ρP(fg):=(VarP(f(X)g(X)))1/2\rho_{P}(f-g):=(Var_{P}(f(X)-g(X)))^{1/2} is the canonical metric. The following result is useful for analyzing Cheap mm-out-of-nn Bootstrap and Cheap Bag of Little Bootstraps:

Theorem 15 (Adapted from Van der Vaart and Wellner (2013) Theorem 3.6.3).

Let \mathcal{F} be a Donsker class of measurable functions such that δ\mathcal{F}_{\delta} is measurable for every δ>0\delta>0. Then

suphBL1(())|EMh(𝔾n,kn)Eh(𝔾P)|p0\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{n,k_{n}}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

as nn\to\infty, for any sequence knk_{n}\to\infty, where EM[]E_{M}[\cdot] denotes the expectation conditional on the data X1,X2,,XnX_{1},X_{2},\ldots,X_{n}.

The following is useful for analyzing Cheap Subsampled Double Bootstrap:

Theorem 16 (Adapted from Sengupta et al. (2016) Theorem 1).

Let \mathcal{F} be a Donsker class of measurable functions such that δ\mathcal{F}_{\delta} is measurable for every δ>0\delta>0. Then the Subsampled Double Bootstrap process defined by 𝔾SDB,n,s=n(PnPs)\mathbb{G}_{SDB,n,s}^{*}=\sqrt{n}(P_{n}^{**}-P_{s}^{*}) satisfies

suphBL1(())|EMh(𝔾SDB,n,s)Eh(𝔾P)|p0\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{SDB,n,s}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

as min(n,s)\min(n,s)\to\infty, where EM[]E_{M}[\cdot] is with respect to the randomness of both the first and second-layer resampling in Subsampled Double Bootstrap, conditional on the data X1,X2,,XnX_{1},X_{2},\ldots,X_{n}.