A Cheap Bootstrap Method for Fast Inference

Henry Lam Email: [email protected] Department of Industrial Engineering and Operations Research, Columbia University

Abstract

The bootstrap is a versatile inference method that has proven powerful in many statistical problems. However, when applied to modern large-scale models, it could face substantial computation demand from repeated data resampling and model fitting. We present a bootstrap methodology that uses minimal computation, namely with a resample effort as low as one Monte Carlo replication, while maintaining desirable statistical guarantees. We present the theory of this method that uses a twisted perspective from the standard bootstrap principle. We also present generalizations of this method to nested sampling problems and to a range of subsampling variants, and illustrate how it can be used for fast inference across different estimation problems.

1 Introduction

The bootstrap is a versatile method for statistical inference that has proven powerful in many problems. It is advantageously data-driven and automatic: Instead of using analytic calculation based on detailed model knowledge such as the delta method, the bootstrap uses resampling directly from the data as a mechanism to approximate the sampling distribution. The flexibility and ease of the bootstrap have made it widely popular; see, e.g., the monographs Efron and Tibshirani (1994); Davison and Hinkley (1997); Shao and Tu (2012); Hall and Martin (1988) for comprehensive reviews on the subject.

Despite its popularity, an arguable challenge in using the bootstrap is its computational demand. To execute the bootstrap, one needs to run enough Monte Carlo replications to generate resamples and refit models. While this is a non-issue in classical problems, for modern large-scale models and data size, even one model refitting could incur substantial costs. This is the case if the estimation or model fitting involves big optimization or numerical root-finding problems arising from, for instance, empirical risk minimization in machine learning, where the computing routine can require running time-consuming stochastic gradient descent or advanced mathematical programming procedures. Moreover, in these problems, what constitutes an adequate Monte Carlo replication size can be case-by-case and necessitate trial-and-error, which adds to the sophistication. Such computation issues in applying resampling techniques to massive learning models have been discussed and motivate some recent works, e.g., Kleiner et al. (2012); Lu et al. (2020); Giordano et al. (2019); Schulam and Saria (2019); Alaa and Van Der Schaar (2020).

Moreover, an additional challenge faced by the bootstrap is that its implementation can sometimes involve a nested amount of Monte Carlo runs. This issue arises for predictive models that are corrupted by noises from both the data and the computation procedure. An example is predictive inference using high-fidelity stochastic simulation modeling (Nelson (2013); Law et al. (1991)). These models are used to estimate output performance measures in sophisticated systems, and comprises a popular approach for causal decision-making in operations management and scientific disciplines. In this context, obtaining a point estimate of the performance measure itself requires a large number of Monte Carlo runs on the model to average out the aleatory uncertainty. When the model contains parameters that are calibrated from exogenous data, conducting bootstrap inference on the prediction would involve first resampling the exogenous data, then given each resampled input parameter value, running and averaging a large number of model runs. This consequently leads to a multiplicative total computation effort between the resample size and the model run size per resample, an issue known as the input uncertainty problem in simulation (see, e.g., Henderson (2003); Song et al. (2014); Barton (2012); Lam (2016)). Here, to guarantee consistency, the sizes in the two sampling layers not only need to be sufficiently large, but also depend on the input data size which adds more complication to their choices (Lam and Qian (2021)). Other than simulation modeling, similar phenomenon arises in some machine learning models. For example, a bagging predictor (Breiman (1996); Wager et al. (2014); Mentch and Hooker (2016)) averages a large number of base learning models each trained from a resample, and deep ensemble (Lakshminarayanan et al. (2016); Lee et al. (2015)) averages several neural networks each trained with a different randomization seed (e.g., in initializing a stochastic gradient descent). Running bootstraps for the prediction values of these models again requires a multiplicative effort of data resampling and, given each resample, building a new predictor which requires multiple training runs.

Motivated by these challenges, our goal in this paper is to study a statistically valid bootstrap method that allows using very few resamples. More precisely, our method reduces the number of bootstrap resamples to the minimum possible, namely as low as one Monte Carlo replication, while still delivering asymptotically exact coverage as the data size grows. This latter guarantee holds under essentially the same regularity conditions for conventional bootstrap methods. As a result, our method can give valid inference under constrained budget when existing bootstrap approaches fail. Moreover, the validity of our inference under any number of resamples endows more robustness, and hence alleviates the tuning effort, in choosing the number of resamples, which as mentioned before could be a trial-and-error task in practice. For convenience, we call our method the Cheap Bootstrap, where “Cheap” refers to our low computation cost, and also to the correspondingly lower-quality half-width performance when using the low cost as we will see.

In terms of theory, the underpinning mechanism of our Cheap Bootstrap relies on a simple twist to the basic principle used in conventional bootstraps. The latter dictates that the sampling distribution of a statistic can be approximated by the resample counterpart, conditional on the data. To utilize this principle for inference, one then typically generates many resample estimates to obtain their approximate distribution. Our main idea takes this principle a step further by leveraging the following: The resemblance of the resampling distribution to the original sampling distribution, universally regardless of the data realization, implies the asymptotic independence between the original estimate and any resample estimate. Thus, under asymptotic normality, we can construct pivotal statistics using the original estimate and any number of resample estimates that cancel out the standard error as a nuisance parameter, which subsequently allows us to conduct inference using as low as one resample.

In addition to the basic asymptotic exact coverage guarantee, we study Cheap Bootstrap in several other aspects. First regards the half-width of the Cheap Bootstrap confidence interval, a statistical efficiency measure in addition to coverage performance. We show how the Cheap Bootstrap interval follows the behavior of a $t$ -interval, which is wide when we use only one resample but shrinks rapidly as the number of resamples increases. Second, we investigate the higher-order coverage of Cheap Bootstrap via Edgeworth expansions. Different from conventional bootstrap procedures, the Cheap Bootstrap pivotal statistic has a limiting $t$ -distribution, not a normal distribution. Edgeworth expansion on $t$ -distribution is, to our best knowledge, open in the literature (except the recent working paper He and Lam (2021)). Here, we build the Edgeworth expansions for Cheap Bootstrap coverage probabilities and show their errors match the order $O(n^{-1})$ (where $n$ is the sample size) incurred in conventional bootstraps in the two-sided case. Moreover, we explicitly identify the coefficient in this error term, which is expressible as a high-dimensional integral that, although cannot be evaluated in closed-form, is amenable to Monte Carlo integration.

We also generalize the Cheap Bootstrap in several directions. First, we devise the Cheap Bootstrap confidence intervals for models that encounter both data and computation noises and hence the aforementioned nested sampling issue. When applying the Cheap Bootstrap to these problems, the number of outer resamples can be driven down to one or a very small number, and thus the total model run size is no longer multiplicatively big. On the other hand, because of the presence of multiple sources of noises, the joint distribution between the original estimate and the resample estimate is no longer asymptotically independent, but exhibits a dependent structure that nonetheless could be tractably exploited. Second, our Cheap Bootstrap can also be used in conjunction with “subsampling” variants. By subsampling here we broadly refer to schemes that resample data sets of size smaller than the original size, such as the $m$ -out-of- $n$ Bootstrap (Politis et al. (1999); Bickel et al. (1997)), and more recently the Bags of Little Bootstraps (Kleiner et al. (2014)) and Subsampled Double Bootstrap (Sengupta et al. (2016)). We show how these procedures can all be implemented “cheaply”. Finally, we also extend the Cheap Bootstrap to multivariate problems.

We close this introduction by discussing other related works. Regarding motivation, our study appears orthogonal to most of the classical bootstrap literature, as its focus is on higher-order coverage accuracy, using techniques such as studentization (Davison and Hinkley (1997) §2.4), calibration or iterated bootstrap (Hall (2013) §3.11; Hall and Martin (1988); Hall (1986a); Beran (1987)), and bias-correction and acceleration (Efron (1982) §10.4; Efron (1987); DiCiccio et al. (1996)). The literature also studies the relaxation of assumptions to handle non-smooth functions and dependent data (Politis et al. (1999); Bickel et al. (1997)). In contrast to these studies, we focus on the capability to output confidence intervals under extremely few resamples, and the produced intervals are admittedly less accurate than the refined intervals in the literature constructed under idealized infinite resamples.

Our study is related to Hall (1986b), which shows a uniform coverage error of $O(n^{-1})$ for one-sided intervals regardless of the number of resamples $B$ , as long as the nominal coverage probability is a multiple of $(B+1)^{-1}$ . From this, Hall (1986b) suggests, for $95\%$ interval for instance, that it suffices to use $B=19$ to obtain a reasonable coverage accuracy. Our suggestion in this paper drives this number down to $B=1$ , which is made possible via our use of asymptotic independence and normality instead of order-statistic analyses on the quantile-based bootstrap in Hall (1986b). Our result on a small $B$ resulting in a large interval width matches the insight in Hall (1986b), though we make this tradeoff more explicit by looking at the width’s moments. Other related works on bootstrap computation cost include Booth and Hall (1994) and Lee and Young (1995) that analyze or reduce Monte Carlo sizes by replacing sampling with analytical calculation when applying the iterated bootstrap, and Booth and Do (1993) and Efron and Tibshirani (1994) §23 on variance reduction. These works, however, do not consider the exceedingly low number of resamples that we advocate in this paper.

The remainder of this paper is as follows. Section 2 introduces the Cheap Bootstrap. Section 3 presents the main theoretical results on asymptotic exactness and higher-order coverage error, and discusses half-width behaviors. Section 4 generalizes the Cheap Bootstrap to nested sampling problems, and Section 5 to subsampling. Section 6 shows numerical results to demonstrate our method and compare with benchmarks. The Appendix details proofs that are not included in the main body, presents additional theoretical and numerical results, and documents some technical background materials.

2 Cheap Bootstrap Confidence Intervals

We aim to construct a confidence interval for a target parameter $\psi:=\psi(P)$ , where $P$ is the data distribution, and $\psi:\mathcal{P}\to\mathbb{R}$ is a functional with $\mathcal{P}$ as the set of all distributions on the data domain. This $\psi(P)$ can range from simple statistical summaries such as correlation coefficient, quantile, conditional value-at-risk, to model parameters such as regression coefficient and prediction error measurement.

Suppose we are given independent and identically distributed (i.i.d.) data of size $n$ , say $X_{1},\ldots,X_{n}$ . A natural point estimate of $\psi(P)$ is $\hat{\psi}_{n}:=\psi(\hat{P}_{n})$ , where $\hat{P}_{n}(\cdot):=(1/n)\sum_{i=1}^{n}I(X_{i}\in\cdot)$ is the empirical distribution constructed from the data, and $I(\cdot)$ denotes the indicator function.

Our approach to construct a confidence interval for $\psi$ proceeds as follows. For each replication $b=1,\ldots,B$ , we resample the data set, namely independently and uniformly sample with replacement from $\{X_{1},\ldots,X_{n}\}$ $n$ times, to obtain $\{X_{1}^{*b},\ldots,X_{n}^{*b}\}$ , and evaluate the resample estimate $\psi_{n}^{*b}:=\psi(P_{n}^{*b})$ , where $P_{n}^{*b}(\cdot)=(1/n)\sum_{i=1}^{n}I(X_{i}^{*b}\in\cdot)$ is the resample empirical distribution. Our confidence interval is

\mathcal{I}=\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}S\right]

(1)

where

S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)^{2}

(2)

Here, $S^{2}$ resembles the sample variance of the resample estimates, but “centered” at the original point estimate $\hat{\psi}_{n}$ instead of the resample mean, and using $B$ in the denominator instead of $B-1$ as in “textbook” sample variance. The critical value $t_{B,1-\alpha/2}$ is the $(1-\alpha/2)$ -quantile of $t_{B}$ , the student $t$ -distribution with degree of freedom $B$ . That is, the degree of freedom of this $t$ -distribution is precisely the resampling computation effort.

The interval $\mathcal{I}$ in (1) is defined for any positive integer $B\geq 1$ . In particular, when $B=1$ , it becomes

\left[\hat{\psi}_{n}-t_{1,1-\alpha/2}\left|\psi_{n}^{*}-\hat{\psi}_{n}\right|,\ \hat{\psi}_{n}+t_{1,1-\alpha/2}\left|\psi_{n}^{*}-\hat{\psi}_{n}\right|\right]

(3)

where $\psi_{n}^{*}$ is the single resample estimate.

The form of (1) looks similar to the so-called standard error bootstrap as we explain momentarily, especially when $B$ is large. However, the point here is that $B$ does not need to be large. In fact, we have the following basic coverage guarantee for (1) and (3). First, consider the following condition that is standard in the bootstrap literature:

Assumption 1 (Standard condition for bootstrap validity).

We have $\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\sigma^{2})$ where $\sigma^{2}>0$ . Moreover, a resample estimate $\psi_{n}^{*}$ satisfies $\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\sigma^{2})$ conditional on the data $X_{1},X_{2},\ldots$ in probability as $n\to\infty$ .

In Assumption 1, “ $\Rightarrow$ ” denotes convergence in distribution, and the conditional “ $\Rightarrow$ ”-convergence in probability means $P(\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\leq x|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}P(N(0,\sigma^{2})\leq x)$ for any $x\in\mathbb{R}$ , where “ $\stackrel{{\scriptstyle p}}{{\to}}$ ” denotes convergence in probability. Assumption 1 is a standard condition to justify bootstrap validity, and is ensured when $\psi(\cdot)$ is Hadamard differentiable (see Proposition 2 in the sequel which follows from Van der Vaart (2000) §23). This assumption implies that, conditional on the data, the asymptotic distributions of the centered resample estimate $\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})$ and the centered original estimate $\sqrt{n}(\hat{\psi}_{n}-\psi)$ are the same. Thus, one can use the former distribution, which is computable via Monte Carlo, to approximate the latter unknown distribution. Simply put, we can use a “plug-in” of $\hat{P}_{n}$ in place of $P$ , namely $\xi(\hat{\psi}_{n},\psi_{n}^{*})$ , to approximate $\xi(\psi,\hat{\psi}_{n})$ for any suitable data-dependent quantity $\xi(\cdot,\cdot)$ .

Under Assumption 1, (1) is an asymptotically exact $(1-\alpha)$ -level confidence interval:

Theorem 1 (Asymptotic exactness of Cheap Bootstrap).

Under Assumption 1, for any $B\geq 1$ , the interval $\mathcal{I}$ in (1) is an asymptotically exact $(1-\alpha)$ -level confidence interval for $\psi$ , i.e.,

\mathbb{P}_{n}(\psi\in\mathcal{I})\to 1-\alpha

as $n\to\infty$ , where $\mathbb{P}_{n}$ denotes the probability with respect to the data $X_{1},\ldots,X_{n}$ and all randomness from the resampling.

Theorem 1 states that, under the same condition to justify the validity of conventional bootstraps, the Cheap Bootstrap interval $\mathcal{I}$ has asymptotically exact coverage, for resample effort as low as 1. Before we explain how Theorem 1 is derived, we first compare the Cheap Bootstrap to conventional bootstraps.

2.1 Comparisons with Established Bootstrap Methods

The commonest bootstrap approach for interval construction uses the quantiles of resample estimates to calibrate the interval limits. This includes the basic bootstrap and (Efron’s) percentile bootstrap (Efron and Tibshirani (1994); Davison and Hinkley (1997)). In the basic bootstrap, we first run Monte Carlo to approximate the $\alpha/2$ and $(1-\alpha/2)$ -quantiles of $\psi_{n}^{*}-\hat{\psi}_{n}$ , say $q_{\alpha/2}$ and $q_{1-\alpha/2}$ . Assumption 1 guarantees that $q_{\alpha/2}$ and $q_{1-\alpha/2}$ approximate the corresponding quantiles of $\hat{\psi}_{n}-\psi$ , thus giving $[\hat{\psi}_{n}-q_{1-\alpha/2},\hat{\psi}_{n}-q_{\alpha/2}]$ as a $(1-\alpha)$ confidence interval for $\psi$ (or equivalently using the $\alpha/2$ and $(1-\alpha/2)$ quantiles of $2\hat{\psi}_{n}-\psi_{n}^{*}$ ). The percentile bootstrap directly uses the $\alpha/2$ and $(1-\alpha/2)$ -quantiles of $\psi_{n}^{*}$ as the interval limits, i.e., $[q_{\alpha/2},q_{1-\alpha/2}]$ , justified by additionally the symmetry of the asymptotic distribution. Alternately, one can bootstrap the standard error and plug into a normality confidence interval (Efron (1981)). That is, we compute $Var_{*}(\psi^{*}_{n})$ to approximate $Var(\hat{\psi}_{n})$ (where $Var$ and $Var_{*}$ denote respectively the variance with respect to the data and the conditional variance with respect to a resample drawn from $\hat{P}_{n}$ ), and output $[\hat{\psi}_{n}\pm z_{1-\alpha/2}\sqrt{Var_{*}(\psi^{*})}]$ as the confidence interval, with $z_{1-\alpha/2}$ being the $(1-\alpha/2)$ -quantile of standard normal.

All approaches above require generating enough resamples. When $B$ is large, the $S^{2}$ defined in (2) satisfies $S^{2}\approx Var_{*}(\psi^{*}_{n})$ , and moreover $t_{B,1-\alpha/2}\approx z_{1-\alpha/2}$ . Thus in this case the Cheap Bootstrap interval $\mathcal{I}$ becomes $\hat{\psi}\pm z_{1-\alpha/2}\sqrt{Var_{*}(\psi^{*}_{n})}$ , which is nothing but the standard error bootstrap. In other words, the Cheap Bootstrap can be viewed as a generalization of the standard error bootstrap to using any $B$ .

We also contrast our approach with the studentized bootstrap (e.g., Davison and Hinkley (1997) §2.4), which resamples pivotal quantities in the form $(\hat{\psi}_{n}-\psi)/\hat{\sigma}_{n}$ where $\hat{\sigma}_{n}$ is a sample standard error (computed from $\hat{P}_{n}$ ). As widely known, while this approach bears the term “studentized”, it does not use the $t$ -distribution, and is motivated from better higher-order coverage accuracy. Moreover, attaining such an interval could require additional computation to resample the standard error. Our approach is orthogonal to this studentization in that we aim to minimize computation instead of expanding it for the sake of attaining higher-order accuracy.

Lastly, (1) and (3) have natural analogs for one-sided intervals, where we use

\mathcal{I}_{lower}=\left(-\infty,\ \hat{\psi}_{n}+t_{B,1-\alpha}S\right]\text{\ \ or\ \ }\mathcal{I}_{upper}=\left[\hat{\psi}_{n}-t_{B,1-\alpha}S,\ \infty\right).

(4)

Theorem 1 applies to (4) under the same assumption. In the one-sided case, Hall (1986b) has shown an error of $O(n^{-1})$ , uniformly in $B$ , for the studentized bootstrap when the nominal coverage probability $(1-\alpha)$ is a multiple of $(B+1)^{-1}$ . This translates, in the case of $95\%$ interval, a minimum of $B=19$ . Our Theorem 1 drives down this suggestion of Hall (1986b) for the studentized bootstrap to $B=1$ for the Cheap Bootstrap.

2.2 A Basic Numerical Demonstration

We numerically compare our Cheap Bootstrap with the conventional approaches, namely the basic bootstrap, percentile bootstrap and standard error bootstrap described above. We use a basic example on estimating a $95\%$ confidence interval of the $0.6$ -quantile of, say, an exponential distribution with unit rate from i.i.d. data. This example can be handled readily by other means, but it demonstrates how Cheap Bootstrap can outperform baselines under limited replication budget.

We use a data size $n=100$ , and vary $B=1,2,5,10,50$ . For each $B$ , we generate synthetic data $1000$ times, each time running all the competing methods, and outputting the empirical coverage and interval width statistics from the $1000$ experimental repetitions. Table 1 shows that when $B=1$ , Cheap Bootstrap already gives confidence intervals with a reasonable coverage of $92\%$ , while all other bootstrap methods fail because they simply cannot operate with only one resample (basic and percentile bootstraps cannot output two different finite numbers as the upper and lower interval limits, and standard error bootstrap has a zero denominator in the formula). When $B=2$ , all baseline methods still have poor performance, as $B=2$ is clearly still too small a size to possess any statistical guarantees. In contrast, Cheap Bootstrap gives a similar coverage of $93\%$ as in $B=1$ , and continues to have stable coverages when $B$ increases. As $B$ increases through 5 to 50, the baseline methods gradually catch up on the coverage level.

Table 1: Interval performances with different bootstrap methods, at nominal confidence level

95\%

and with sample size

n=100

	Replication	Empirical coverage	Width mean
	size $B$	(margin of error)	(standard deviation)
Cheap Bootstrap	1	0.92 (0.02)	2.42 (2.06)
Basic Bootstrap	1	NA	NA
Percentile Bootstrap	1	NA	NA
Standard Error Bootstrap	1	NA	NA
Cheap Bootstrap	2	0.93 (0.02)	0.95 (0.60)
Basic Bootstrap	2	0.32 (0.03)	0.14 (0.12)
Percentile Bootstrap	2	0.32 (0.03)	0.14 (0.12)
Standard Error Bootstrap	2	0.67 (0.03)	0.38 (0.33)
Cheap Bootstrap	5	0.92 (0.02)	0.63 (0.28)
Basic Bootstrap	5	0.62 (0.03)	0.29 (0.14)
Percentile Bootstrap	5	0.69 (0.03)	0.29 (0.14)
Standard Error Bootstrap	5	0.86 (0.02)	0.47 (0.22)
Cheap Bootstrap	10	0.92 (0.02)	0.53 (0.20)
Basic Bootstrap	10	0.73 (0.03)	0.37 (0.14)
Percentile Bootstrap	10	0.80 (0.02)	0.37 (0.14)
Standard Error Bootstrap	10	0.88 (0.02)	0.46 (0.17)
Cheap Bootstrap	50	0.94 (0.02)	0.50 (0.13)
Basic Bootstrap	50	0.86 (0.02)	0.44 (0.12)
Percentile Bootstrap	50	0.92 (0.02)	0.44 (0.12)
Standard Error Bootstrap	50	0.93 (0.02)	0.48 (0.13)

Regarding interval length, we see that the Cheap Bootstrap interval shrinks as $B$ increases, sharply for small $B$ and then stabilizes. In particular, the length decreases by $1.47$ when $B$ increases from 1 to 2, compared to a much smaller $0.03$ when $B$ increases from 10 to 50. Though the length of Cheap Bootstrap is always larger than the baselines, it closes in at $B=10$ and even more so at $B=50$ . Both the good coverage starting from $B=1$ for Cheap Bootstrap, and the sharp decrease of its interval length at very small $B$ and then level-off will be explained by our theory next.

3 Theory of Cheap Bootstrap

We describe the theory of Cheap Bootstrap, including its asymptotic exactness (Section 3.1), higher-order coverage error (Section 3.2) and interval width behavior (Section 3.3).

3.1 Asymptotic Exactness

We present the proof of Theorem 1 on the asymptotic exactness of Cheap Bootstrap for any $B\geq 1$ . The first key ingredient is the following joint asymptotic characterization among the original estimate and the resample estimates:

Proposition 1 (Asymptotic independence and normality among original and resample estimates).

Under Assumption 1, we have, for the original estimate $\hat{\psi}_{n}$ and resample estimates $\psi_{n}^{*b},b=1,\ldots,B$ ,

\sqrt{n}(\hat{\psi}_{n}-\psi,\psi^{*1}_{n}-\hat{\psi}_{n},\ldots,\psi^{*B}_{n}-\hat{\psi}_{n})\Rightarrow(\sigma Z_{0},\sigma Z_{1},\ldots,\sigma Z_{B})

(5)

where $Z_{b},b=0,\ldots,B$ are i.i.d. $N(0,1)$ variables.

The convergence (5) entails that, under $\sqrt{n}$ -scaling, the centered resample estimates and the original point estimate are all asymptotically independent, and moreover are distributed according to the same normal, with the only unknown being $\sigma$ that captures the asymptotic standard error. The asymptotic independence is thanks to the universal limit of a resample estimate $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ conditional on any data sequence, as detailed below.

Proof of Proposition 1.

Denote $\Phi_{\sigma}(\cdot)$ as the distribution function of $N(0,\sigma^{2})$ . To show the joint weak convergence of $\sqrt{n}(\hat{\psi}_{n}-\psi)$ and $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ , $b=1,\ldots,B$ , to i.i.d. $N(0,\sigma^{2})$ variables, we will show, for any $x_{b}\in\mathbb{R},b=0,\ldots,B$ ,

P\left(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0},\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})\leq x_{1},\ldots,\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})\leq x_{B}\right)\to\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})

as $n\to\infty$ . To this end, we have

	$\displaystyle\left\|P\left(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0},\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1},\ldots,\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\right)-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right\|$	(6)
$\displaystyle=$	$\displaystyle\left\|E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})\right]-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	since $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ for $b=1,\ldots,B$ are independent conditional on the data $X_{1},\ldots,X_{n}$
$\displaystyle=$	$\displaystyle\Bigg{\|}E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})\right]{}$
	$\displaystyle{}-P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})+P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\Bigg{\|}$
$\displaystyle\leq$	$\displaystyle E\Bigg{[}I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\Bigg{\|}P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n}){}$
	$\displaystyle{}-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\Bigg{\|}\Bigg{]}+\left\|\left(P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right)\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	by the triangle inequality
$\displaystyle\leq$	$\displaystyle E\left\|P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	$\displaystyle{}+\left\|P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right\|\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})$

Now, by Assumption 1, we have $P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\Phi_{\sigma}(x_{b})$ for $b=1,\ldots,B$ , and thus $\prod_{b=1}^{B}P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})$ . Hence, since $P(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})\leq x_{b}|\hat{P}_{n})$ and $\Phi_{\sigma}(x_{b})$ are bounded by 1, the first term in (6) converges to 0 by the bounded convergence theorem. The second term in (6) also goes to 0 because $P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\to\Phi_{\sigma}(x_{0})$ by Assumption 1 again. Therefore (6) converges to 0 as $n\to\infty$ . ∎

Given Proposition 1, to infer $\psi$ we can now leverage classical normality inference tools to “cancel out” the nuisance parameter $\sigma$ . In particular, we can use the pivotal statistic

T:=\frac{\hat{\psi}_{n}-\psi}{S}

(7)

where $S^{2}$ is as defined in (2), which converges to a student $t$ -distribution with degree of freedom $B$ . More concretely:

Proof of Theorem 1.

The pivotal statistic (7) satisfies

T=\frac{\hat{\psi}_{n}-\psi}{S}=\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n}))^{2}}}\Rightarrow\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}}

(8)

for i.i.d. $N(0,1)$ variables $Z_{0},Z_{1},\ldots,Z_{B}$ , where we use Proposition 1 and the continuous mapping theorem to deduce the weak convergence. Note that

\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}}\stackrel{{\scriptstyle d}}{{=}}\frac{N(0,1)}{\sqrt{\chi^{2}_{B}/B}}\stackrel{{\scriptstyle d}}{{=}}t_{B}

where $N(0,1)$ and $\chi^{2}_{B}$ here represent standard normal and $\chi^{2}_{B}$ random variables. Here the two equalities in distribution (denoted “ $\stackrel{{\scriptstyle d}}{{=}}$ ”) follow from the elementary constructions of $\chi^{2}$ - and $t$ -distributions respectively. Thus

T=\frac{\hat{\psi}_{n}-\psi}{S}\Rightarrow t_{B}

Hence we have

\mathbb{P}_{n}\left(-t_{B,1-\alpha/2}\leq\frac{\hat{\psi}_{n}-\psi}{S}\leq t_{B,1-\alpha/2}\right)\to 1-\alpha

from which we conclude

\mathbb{P}_{n}\left(\hat{\psi}_{n}-t_{B,1-\alpha/2}S\leq\psi\leq\hat{\psi}_{n}+t_{B,1-\alpha/2}S\right)\to 1-\alpha

and the theorem. ∎

Note that instead of using the $t$ -statistic approach, it is also possible to produce intervals from the normal variables in other ways (e.g., Wall et al. (2001)) which could have potential benefits for very small $B$ . However, the $t$ -interval form of (1) is intuitive, matches the standard error bootstrap as $B$ grows, and its width is easy to quantify.

In Theorem 1 we have used Assumption 1. This assumption is ensured when the functional $\psi(\cdot)$ is Hadamard differentiable with a non-degenerate derivative. More precisely, for a class of functions $\mathcal{F}$ from $\mathcal{X}\to\mathbb{R}$ , define $\ell^{\infty}(\mathcal{F}):=\left\{z:\|z\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|z(f)|<\infty\right\}$ where $z$ is a map from $\mathcal{F}$ to $\mathbb{R}$ . We have the following:

Proposition 2 (Sufficient conditions for bootstrap validity).

Consider $\hat{P}_{n}$ and $P_{n}^{*}$ as random elements that take values in $\ell^{\infty}(\mathcal{F})$ , where $\mathcal{F}$ is a Donsker class with finite envelope. Suppose $\psi:\ell^{\infty}(\mathcal{F})\to\mathbb{R}$ is Hadamard differentiable at $P$ (tangential to $\ell^{\infty}(\mathcal{F})$ ) where the derivative $\psi_{P}^{\prime}$ satisfies that $\psi_{P}^{\prime}(\mathbb{G}_{P})$ is a non-degenerate random variable (i.e., with positive variance), for a tight Gaussian process $\mathbb{G}_{P}$ on $\ell^{\infty}(\mathcal{F})$ with mean 0 and covariance $Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X))$ (where $Cov_{P}$ denotes the covariance taken with respect to $P$ ). Then Assumption 1 holds under i.i.d. data.

Proposition 2 follows immediately from the functional delta method (see Van der Vaart (2000) §23 or Appendix F.2). Note that we may as well use the conditions in Proposition 2 as the assumption for Theorem 1, but Assumption 1 helps highlight our point that Cheap Bootstrap is valid whenever conventional bootstraps are.

3.2 Higher-Order Coverage Errors

We analyze the higher-order coverage errors of Cheap Bootstrap. A common approach to analyze coverage errors in conventional bootstraps is to use Edgeworth expansions, which we will also utilize. However, unlike these existing methods, the pivotal statistic used in Cheap Bootstrap has a limiting $t$ -distribution, not a normal distribution. Edgeworth expansions on limiting $t$ -distributions appear open to our best knowledge (except a recent working paper He and Lam (2021)). Here, we derive our expansions for Cheap Bootstrap by integrating the expansions of the original estimate and resample estimates that follow (conditional) asymptotic normal distributions. The resulting coefficients can be explicitly identified which, even though cannot be evaluated in closed-form, are amenable to Monte Carlo integration.

As is customary in the bootstrap literature, we consider the function-of-mean model, namely $\psi=g(\bm{\mu})$ , where $\bm{\mu}=E\mathbf{X}$ for a $d$ -dimensional random vector $\mathbf{X}$ and $g:\mathbb{R}^{d}\to\mathbb{R}$ is a function. Denote $\overline{\mathbf{X}}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i}$ as the sample mean of i.i.d. data $\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\}$ . To facilitate our discussion, define

A(\mathbf{x})=\frac{g(\mathbf{x})-g(\bm{\mu})}{h(\bm{\mu})}

(9)

for function $h:\mathbb{R}^{d}\to\mathbb{R}$ , where $h(\bm{\mu})^{2}$ is the asymptotic variance of $\sqrt{n}g(\overline{\mathbf{X}})$ that can be written in terms of $\bm{\mu}$ by the delta method (under regularity conditions that will be listed explicitly in our following theorem). To be more precise on the latter point, note that given any $\tilde{\bm{\mu}}=E\tilde{\mathbf{X}}$ , we can augment it to $\bm{\mu}=E\left[\begin{array}[]{c}\tilde{\mathbf{X}}\\ \tilde{\mathbf{X}}^{2}\end{array}\right]$ , with the operation $\cdot^{2}$ defined component-wise (viewing $\tilde{\mathbf{X}}$ as column vector). Thus, for a given $\tilde{g}(\tilde{\bm{\mu}})$ , we can define $g(\bm{\mu})=\tilde{g}(\tilde{\bm{\mu}})$ and $h(\bm{\mu})^{2}=\nabla{\tilde{g}}(\tilde{\bm{\mu}})^{\top}\Sigma\nabla{\tilde{g}}(\tilde{\bm{\mu}})$ , where $\Sigma$ is the covariance matrix $Cov(\tilde{\mathbf{X}})$ which is a function of $\bm{\mu}$ , and $\top$ denotes transpose.

We also define the “studentized” version of $A$ given by

A_{s}(\mathbf{x})=\frac{g(\mathbf{x})-g(\bm{\mu})}{h(\mathbf{x})}

(10)

We have the following:

Theorem 2 (Higher-order coverage errors for Cheap Bootstrap).

Consider the function-of-mean model where $\psi=g(\bm{\mu})$ for some function $g:\mathbb{R}^{d}\to\mathbb{R}$ and $\bm{\mu}=E\mathbf{X}$ for a $d$ -dimensional random vector $\mathbf{X}$ . Consider also the function $h:\mathbb{R}^{d}\to\mathbb{R}$ that appears in (9). Assume that $g$ and $h$ each has $\nu+3$ bounded derivatives in a neighborhood of $\bm{\mu}$ , that $E\|\mathbf{X}\|^{l}<\infty$ for a sufficiently large positive number $l$ , and that the characteristic function $\chi$ of $\mathbf{X}$ satisfies Cramer’s condition $\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1$ . Then

When $\nu\geq 2$ , the two-sided Cheap Bootstrap confidence interval $\mathcal{I}$ satisfies

\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I})=(1-\alpha)+\frac{\zeta}{n}+o\left(\frac{1}{n}\right)

where the coefficient

	$\displaystyle\zeta$	$\displaystyle=$	$\displaystyle B\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
			$\displaystyle{}+\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))$

When $\nu\geq 1$ , the one-sided upper Cheap Bootstrap confidence interval $\mathcal{I}_{upper}$ satisfies

\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I}_{upper})=(1-\alpha)+\frac{\zeta_{upper}}{\sqrt{n}}+o\left(\frac{1}{\sqrt{n}}\right)

where the coefficient

	$\displaystyle\zeta_{upper}$	$\displaystyle=$	$\displaystyle B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
			$\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))$

and the one-sided lower Cheap Bootstrap confidence interval $\mathcal{I}_{lower}$ satisfies

\mathbb{P}_{n}(g(\bm{\mu})\in\mathcal{I}_{lower})=(1-\alpha)+\frac{\zeta_{lower}}{\sqrt{n}}+o\left(\frac{1}{\sqrt{n}}\right)

where the coefficient

	$\displaystyle\zeta_{lower}$	$\displaystyle=$	$\displaystyle B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\geq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
			$\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\geq t_{B,1-\alpha}}d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))$

In the above, $\Phi$ and $\phi$ are the standard normal distribution and density functions respectively, $p_{j}$ and $q_{j}$ are polynomials of degree $3j-1$ , odd for even $j$ and even for odd $j$ , with coefficients depending on moments of $\mathbf{X}$ up to order $j+2$ polynomially and also $g(\cdot)$ and $h(\cdot)$ .

In Theorem 2, the polynomials $p_{j}$ and $q_{j}$ are related to $A$ and $A_{s}$ defined in (9) and (10) as follows. Under the assumptions in the theorem, the $j$ -th cumulant of $\sqrt{n}A(\overline{\mathbf{X}})$ and $\sqrt{n}A_{s}(\overline{\mathbf{X}})$ can be expanded as

\kappa_{j,n}=n^{-(j-2)/2}\left(k_{j,1}+\frac{k_{j,2}}{n}+\frac{k_{j,3}}{n^{2}}+\cdots\right)

for coefficients $k_{j,l}$ ’s depending on whether we are considering $A$ or $A_{s}$ . Then $p_{1}$ or $q_{1}$ is equal to

-\left(k_{1,2}+\frac{1}{6}k_{3,1}H_{2}(x)\right)=-\left(k_{1,2}+\frac{1}{6}k_{3,1}(x^{2}-1)\right)

while $p_{2}$ or $q_{2}$ is equal to

			$\displaystyle-\left(\frac{1}{2}(k_{2,2}+k_{1,2}^{2})H_{1}(x)+\frac{1}{24}(k_{4,1}+4k_{1,2}k_{3,1})H_{3}(x)+\frac{1}{72}k_{3,1}^{2}H_{5}(x)\right)$
		$\displaystyle=$	$\displaystyle-x\left(\frac{1}{2}(k_{2,2}+k_{1,2}^{2})+\frac{1}{24}(k_{4,1}+4k_{1,2}k_{3,1})(x^{2}-3)+\frac{1}{72}k_{3,1}^{2}(x^{4}-10x^{2}+15)\right)$

Here $H_{j}(\cdot)$ is the $j$ -th order Hermite polynomial, and $k_{j,l}$ ’s are determined from $A$ for $p_{1},p_{2}$ and determined from $A_{s}$ for $q_{1},q_{2}$ .

The coverage error $O(n^{-1})$ of the Cheap Bootstrap in the two-sided case and $O(n^{-1/2})$ in the one-sided case in Theorem 2 match the conventional basic and percentile bootstraps described in Section 2.1. Nonetheless, these errors are inferior to more refined approaches, including the studentized bootstrap also mentioned earlier which attains $O(n^{-1})$ in the one-sided case (Davison and Hinkley (1997)).

Theorem 2 is proved by expressing the distribution of the pivotal statistic in (7) as a multi-dimensional integral, with respect to measures that are approximated by the Edgeworth expansions of the original estimate $\sqrt{n}(\hat{\psi}_{n}-\psi)$ and the conditional expansions of the resample estimates $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi})$ which have limiting normal distributions. From these expansions, we could also identify the polynomials $p_{j}$ and $q_{j}$ in our discussion above by using equations (2.20), (2.24) and (2.25) in Hall (2013). Lastly, we note that the remainder term in the coverage of the two-sided confidence interval in Theorem 2 can be refined to $O(1/n^{3/2})$ when $\nu\geq 3$ , and the one-sided intervals can be refined to $O(1/n)$ when $\nu\geq 2$ . These can be seen by tracing our proof (in Appendix B).

The coefficients $\zeta$ , $\zeta_{upper}$ and $\zeta_{lower}$ are computable via Monte Carlo integration, because integral in the form

\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}d(\pi(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})

for some set $\mathcal{S}$ and polynomial $\pi$ can be written as

			$\displaystyle\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}(\pi^{\prime}(z_{B})\phi(z_{B})-z_{B}\pi(z_{B})\phi(z_{B}))dz_{B}d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})$
		$\displaystyle=$	$\displaystyle\int\cdots\int_{(z_{0},\cdots,z_{B})\in\mathcal{S}}(\pi^{\prime}(z_{B})-z_{B}\pi(z_{B}))d\Phi(z_{B})d\Phi(z_{B-1})\cdots d\Phi(z_{1})d\Phi(z_{0})$

which is expressible as an expectation

E[\pi^{\prime}(Z_{B})-Z_{B}\pi(Z_{B});(Z_{0},\cdots,Z_{B})\in\mathcal{S}]

taken with respect to independent standard normal variables $Z_{0},\ldots,Z_{B}$ .

3.3 Interval Tightness

Besides coverage, another important efficiency criterion is the interval width. From Section 3.1, we know that the Cheap Bootstrap interval (1) arises from a $t$ -interval construction, using which we can readily extract its width behavior. More specifically, for a fixed number of resamples $B$ , $S^{2}$ satisfies

\sqrt{n}S\Rightarrow\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}\stackrel{{\scriptstyle d}}{{=}}\sigma\sqrt{\frac{\chi^{2}_{B}}{B}}

as $n\to\infty$ so that the half-width of $\mathcal{I}$ is $\text{HW}\approx t_{B,1-\alpha/2}\sigma\sqrt{\chi_{B}^{2}/(nB)}$ . Plugging in the moments of $\chi_{B}^{2}$ , we see that the half-width for large sample size $n$ has mean and variance given by

	$\displaystyle E[\text{HW}]$	$\displaystyle\approx t_{B,1-\alpha/2}\sqrt{\frac{2}{B}}\frac{\Gamma((B+1)/2)}{\Gamma(B/2)}\frac{\sigma}{\sqrt{n}}$		(11)
	$\displaystyle Var(\text{HW})$	$\displaystyle\approx t_{B,1-\alpha/2}^{2}\left(B-\frac{2\Gamma((B+1)/2)^{2}}{\Gamma(B/2)^{2}}\right)\frac{\sigma^{2}}{nB}$		(12)

respectively, where $\Gamma(\cdot)$ is the Gamma function. As $B$ increases, both the mean and variance decrease, which signifies a natural gain in statistical efficiency, until in the limit $B=\infty$ we get a mean $z_{1-\alpha/2}\sigma/\sqrt{n}$ and a variance 0, which correspond to the normality interval with a known $\sigma$ .

The expressions (11) and (12) reveal that the half-width of Cheap Bootstrap is large when $B=1$ , but falls and stabilizes quickly as $B$ increases. Table 2 shows the approximate half-width mean and standard deviation shown in (11) and (12) at $\alpha=5\%$ (ignoring the $\sigma/\sqrt{n}$ factor), and the relative inflation in mean half-width compared to the case $B=\infty$ (i.e., $z_{1-\alpha/2}=1.96)$ . We see that, as $B$ increases from 1 to 2, the half-width mean drops drastically from $10.14$ ( $417.3\%$ inflation relative to $B=\infty$ ) to $3.81$ ( $94.6\%$ inflation), and half-width standard deviation from $7.66$ to $1.99$ . As $B$ increases from 2 to 3, the mean continues to drop notably to $2.93$ ( $49.6\%$ inflation) and standard deviation to $1.24$ . The drop rate slows down as $B$ increases further. For instance, at $B=10$ the mean is $2.17$ ( $10.9\%$ inflation) and standard deviation is $0.49$ , while at $B=20$ the mean is $2.06$ ( $5.1\%$ inflation) and standard deviation is $0.33$ . Though what constitutes an acceptable inflation level compared to $B=\infty$ is context-dependent, generally the inflation appears reasonably low even when $B$ is a small number, except perhaps when $B$ is $1$ or $2$ .

Table 2: Approximate half-width performance of Cheap Bootstrap against

B

95\%

confidence level.

Replication	Mean (ignoring	Mean inflation relative	Standard deviation (ignoring
size $B$	the $\sigma/\sqrt{n}$ factor)	to $B=\infty$ ( $z_{0.975}$ )	the $\sigma/\sqrt{n}$ factor)
1	10.14	417.3%	7.66
2	3.81	94.6%	1.99
3	2.93	49.6%	1.24
4	2.61	33.2%	0.95
5	2.45	24.8%	0.79
6	2.35	19.8%	0.69
7	2.28	16.4%	0.62
8	2.24	14.0%	0.57
9	2.20	12.3%	0.53
10	2.17	10.9%	0.49
11	2.15	9.8%	0.46
12	2.13	8.9%	0.44
13	2.12	8.1%	0.42
14	2.11	7.5%	0.40
15	2.10	7.0%	0.39
16	2.09	6.5%	0.37
17	2.08	6.1%	0.36
18	2.07	5.7%	0.35
19	2.07	5.4%	0.34
20	2.06	5.1%	0.33

In Appendices A.1 and A.2, we discuss inference on standard error and a multivariate version of the Cheap Bootstrap, which are immediate extensions of our developments in this section. In the next two sections, we discuss two generalizations of our approach to models corrupted by both data and computation noises (Section 4) and to subsampling (Section 5).

4 Applying Cheap Bootstrap to Nested Sampling Problems

Our Cheap Bootstrap can reduce computational cost for problems which, when applying the conventional bootstraps, require nested sampling. This phenomenon arises when the estimate involves noises coming from both the data and computation procedure. To facilitate discussion, as in Section 2, denote $\psi:=\psi(P)$ as a target parameter or quantity of interest. However, here the computation of $\psi(Q)$ , for any given $Q$ , could be noisy. More precisely, suppose that, given any $Q$ , we could only generate $\hat{\psi}_{r}(Q)$ where $\hat{\psi}_{r}(Q)$ is an unbiased output for $\psi(Q)$ . Then, to compute a point estimate of $\psi$ , we use the data $X_{1},\ldots,X_{n}$ to construct the empirical distribution $\hat{P}_{n}$ , and with this $\hat{P}_{n}$ , we output

\hat{\hat{\psi}}_{n,R}=\frac{1}{R}\sum_{r=1}^{R}\hat{\psi}_{r}(\hat{P}_{n})

(13)

where $\hat{\psi}_{r}(\hat{P}_{n})$ , $r=1,\ldots,R$ is a sequence of unbiased runs for $\psi(\hat{P}_{n})$ , independent given $\hat{P}_{n}$ . The “double hat” above $\psi$ in the left hand side of (13) signifies the two sources of noises, one from the estimation of $P$ and one from the estimation of $\psi(\cdot)$ , in the resulting overall point estimate.

The above construction that requires generating and averaging multiple noisy outputs of $\hat{\psi}_{r}$ arises in the following examples:

1.

Input uncertainty quantification in simulation modeling: Here $\psi(P)$ denotes the output performance measure of a stochastic simulation model, where $P$ is the distribution of input random variates fed into the simulation logic. For instance, $\psi(P)$ could be the expected workload of a queueing network, and $P$ denotes the inter-arrival time distribution. Estimating the performance measure $\psi(P)$ , even with a known input distribution $P$ , would require running the simulation model many times (i.e., $R$ times) and taking their average. When $P$ is observed from exogenous data, we use a plug-in estimate $\hat{P}_{n}$ to drive the input variates in the simulation model (Henderson (2003); Song et al. (2014); Barton (2012); Lam (2016)). This amounts to the point estimate (13).
2.

Bagging: Bagging predictors are constructed by averaging a large number of base predictors, where each base predictor is obtained by re-training the model on a new resample of the data (Breiman (1996); Bühlmann and Yu (2002)). Here, $\psi(P)$ denotes the target predicted value (at some tested point) of the bagging predictor, and $P$ is the data distribution. The quantity $\hat{\psi}_{r}(\hat{P}_{n})$ denotes the predicted value of a model trained on a resample from $\hat{P}_{n}$ .
3.

Deep ensemble: Deep ensembles are predictors constructed by averaging several neural networks, each trained from the same data but using a different randomization seed (Lakshminarayanan et al. (2016); Lee et al. (2015)). The seed controls, for instance, the initialization of a stochastic gradient descent. Like bagging, $\psi(P)$ here denotes the target predicted value (at some tested point) where $P$ is the data distribution. However, the quantity $\hat{\psi}_{r}(\hat{P}_{n})$ refers to the output of a trained individual neural network.

Applying the bootstrap to assess the uncertainty of (13) requires a nested sampling that involves two layers: First, we resample the data $B$ times to obtain $P_{n}^{*b},b=1,\ldots,B$ . Then, given each resample $P_{n}^{*b}$ , we run our computation procedure $R$ times to obtain $\hat{\psi}_{r}(P_{n}^{*b}),r=1,\ldots,R$ . The quantities

\psi_{n,R}^{**b}=\frac{1}{R}\sum_{r=1}^{R}\hat{\psi}_{r}(P_{n}^{*b}),\ \ b=1,\ldots,B

(14)

then serve as the resample estimates whose statistic (e.g., quantiles or standard deviation) can be used to obtain a confidence interval for the target quantity $\psi$ . The computation effort in generating these intervals is the product between the outer number of resamples and the inner number of computation runs, namely $BR$ . The number $B$ needs to be sufficiently large as required by conventional bootstraps. The number $R$ , depending on the context, may also need to be large. For instance, the simulation modeling and bagging examples above require a large enough $R$ . Furthermore, not only does $R$ need to be large, but it could also depend (linearly) on the data size $n$ in order to achieve consistency, as shown in the context of simulation modeling (Lam and Qian (2021, 2018c)). In other words, the difficulty with nested sampling when running the bootstrap not only lies in the multiplicative computation effort between $B$ and $R$ , but also that their choices can depend intricately on the data size which adds to the complication.

Cheap Bootstrap reduces the outer number of resamples $B$ to a low number, a strength that we investigate in this section. To build the theoretical framework, we first derive an asymptotic on problems involving noises from both data and computation. Denote $\tau^{2}(Q)=Var(\hat{\psi}_{r}(Q))$ and $\kappa_{3}(Q)=E|\hat{\psi}_{r}(Q)-\psi(Q)|^{3}$ as the variance and third-order absolute central moment of a computation run driven by a given input distribution $Q$ . We assume the following.

Assumption 2 (Moment consistency of computation runs).

We have $\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P)=:\tau^{2}$ and $\kappa_{3}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}(P)=:\kappa_{3}$ as $n\to\infty$ , where $0<\tau^{2}<\infty$ and $\kappa_{3}<\infty$ . Similarly, $\tau^{2}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P)=\tau^{2}$ and $\kappa_{3}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}(P)=\kappa_{3}$ .

Assumption 2 can be justified if $\tau^{2}(\cdot)$ and $\kappa_{3}(\cdot)$ are Hadamard differentiable with non-degenerate derivatives like in Assumption 1. Specifically, we have the following:

Proposition 3.

If $\tau^{2}(\cdot)$ and $\kappa_{3}(\cdot)$ satisfy the assumptions on $\psi(\cdot)$ in Proposition 2, then Assumption 2 holds.

In stating our asymptotic result below, we consider a slightly more general version of the bootstrap where the computation sizes in the original estimate and the resample estimates are allowed to be different, which gives some extra flexibility to our procedures. That is, we use $\hat{\hat{\psi}}_{n,R_{0}}$ as defined in (13) where $R_{0}$ denotes the computation size in constructing the point estimate. Then we resample $B$ times and obtain $\psi_{n,R}^{**b}$ as defined in (14) where $R$ denotes the computation size in constructing each resample estimate. Moreover, we denote $\hat{\psi}_{n}=\psi(\hat{P}_{n})$ as the hypothetical original estimate constructed from $\hat{P}_{n}$ without any computation noise, and $\psi_{n}^{*b}=\psi(P_{n}^{*b})$ as the hypothetical $b$ -th resample estimate without computation noise. We have the following:

Theorem 3 (Joint asymptotic of original and resample estimates under both data and computation noises).

Suppose Assumptions 1 and 2 hold and $n,R_{0},R\to\infty$ such that $R_{0}/n\to p_{0}$ and $R/n\to p$ for some constants $0<p_{0},p<1$ . We have

			$\displaystyle\sqrt{n}\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{B}-\hat{\hat{\psi}}_{n,R_{0}}\right)$		(15)
		$\displaystyle\Rightarrow$	$\displaystyle\left(\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0},\ \sigma Z_{1}+\frac{\tau}{\sqrt{p}}W_{1}-\frac{\tau}{\sqrt{p_{0}}}W_{0},\ldots,\ \sigma Z_{B}+\frac{\tau}{\sqrt{p}}W_{B}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)$		(15)

where $Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\sim N(0,1)$ and are all independent.

Note that the scaling $R_{0}/n\to p_{0}$ and $R/n\to p$ in Theorem 3 are introduced only for mathematical convenience. The convergence (15) stipulates that

			$\displaystyle\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{B}-\hat{\hat{\psi}}_{n,R_{0}}\right)$
		$\displaystyle\stackrel{{\scriptstyle d}}{{\approx}}$	$\displaystyle\left(\frac{\sigma}{\sqrt{n}}Z_{0}+\frac{\tau}{\sqrt{R_{0}}}W_{0},\ \frac{\sigma}{\sqrt{n}}Z_{1}+\frac{\tau}{\sqrt{R}}W_{1}-\frac{\tau}{\sqrt{R_{0}}}W_{0},\ldots,\ \frac{\sigma}{\sqrt{n}}Z_{B}+\frac{\tau}{\sqrt{R}}W_{B}-\frac{\tau}{\sqrt{R_{0}}}W_{0}\right)$

It implies, in particular, that the standard error of $\hat{\hat{\psi}}_{n,R_{0}}$ is given by $\sqrt{\sigma^{2}/n+\tau^{2}/R_{0}}$ which consists of two parts: $\sigma^{2}/n$ captures the variability from the data and $\tau^{2}/R_{0}$ from the computation.

Theorem 3 is a generalization of Proposition 1. The random variables $Z_{b}$ ’s and $W_{b}$ ’s in the limit each signifies a source of noise coming from either data or computation. More specifically, $Z_{0}$ comes from the original data, $Z_{b},b=1,\ldots,B$ from each resample, $W_{0}$ from the computation of the original estimate, and $W_{b},b=1,\ldots,B$ from the computation of each resample. Unlike (5), in (15) the (centered) resample estimate $\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}}$ and original estimate $\hat{\hat{\psi}}_{n,R_{0}}-\psi$ are no longer asymptotically independent, because they share the same source of noise $W_{0}$ . This causes modifications to our Cheap Bootstrap in canceling out the nuisance parameter.

We will describe two approaches to construct Cheap Bootstrap confidence intervals for $\psi$ under the conditions in Theorem 3, both resulting in the form

\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{\cdot,1-\alpha/2}S_{\cdot},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{\cdot,1-\alpha/2}S_{\cdot}\right]

(16)

where $S_{\cdot}$ , like the $S$ in (1), is a standard error estimate of $\hat{\hat{\psi}}_{n,R_{0}}$ . These two approaches differ by how we pool together the resample estimates in $S_{\cdot}$ , where the $\cdot$ therein denotes the approach. Correspondingly, the critical value $q_{\cdot,1-\alpha/2}$ is obtained from the $(1-\alpha/2)$ -quantile of a distribution pertinent to the pivotal statistic using $S_{\cdot}$ . Both approaches we propose achieve asymptotic validity with very few resamples.

4.1 Centered at Original Estimate

In the first approach, we use in (16)

S_{O}^{2}=\frac{1}{B}\sum_{b=1}^{B}(\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}})^{2}

(17)

That is, ${S_{O}}^{2}$ acts as the sample variance of the bootstrap estimates, where the center of the squares is set to be $\hat{\hat{\psi}}_{n,R_{0}}$ . We choose $q_{O,1-\alpha/2}$ to be

q_{O,1-\alpha/2}=\min\left\{q:\min_{\theta\geq 0}F(q;\theta,\rho)\geq 1-\frac{\alpha}{2}\right\}

(18)

where $F(q;\theta,\rho)$ is the distribution function of

\frac{\theta V_{1}+V_{2}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y+V_{3}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}V_{2}+V_{2}^{2}}}

(19)

with $V_{1},V_{2},V_{3}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ and $Y\sim\chi^{2}_{B-1}$ being all independent, and $\rho=\sqrt{p_{0}/p}$ is the square-rooted ratio between the computation size used in the original estimate and each resample estimate. When $B=1$ , we set $Y=0$ in (19) so that the expression is equivalent to

\frac{\theta V_{1}+V_{2}}{|\sqrt{\theta^{2}+\rho^{2}}V_{3}-V_{2}|}

(20)

We call the resulting interval

\mathcal{I}_{O}=\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{O,1-\alpha/2}S_{O},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{O,1-\alpha/2}S_{O}\right]

(21)

the Cheap Bootstrap interval “centered at original estimate”.

Note that the distribution function $F(q;\theta,\rho)$ has no closed-form expression or reduction to any common distribution where a table or built-in calculator is available. However, it can be easily simulated, for instance, by running many independent copies of $(V_{1},V_{2},V_{3},Y)$ , say $(V_{1}^{j},V_{2}^{j},V_{3}^{j},Y^{j})$ for $j=1,\ldots,N$ , and computing (via, e.g., a simple grid search)

\min\left\{q:\min_{\theta\geq 0}\frac{1}{N}\sum_{j=1}^{N}I\left(\frac{\theta V_{1}^{j}+V_{2}^{j}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y^{j}+{V_{3}^{j}}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}^{j}V_{2}^{j}+{V_{2}^{j}}^{2}}}\leq q\right)\geq 1-\frac{\alpha}{2}\right\}

(22)

where $I(\cdot)$ denotes the indicator function, to approximate $q_{O,1-\alpha/2}$ . Note that this Monte Carlo computation only involves standard normal and $\chi^{2}$ random variables, not the noisy model evaluation $\hat{\psi}_{r}$ that could be expensive.

We have the following guarantee:

Theorem 4 (Asymptotic validity of Cheap Bootstrap “centered at point estimate”).

Under the same assumptions and setting in Theorem 3, the interval $\mathcal{I}_{O}$ in (21), where $S_{O}^{2}$ and $q_{O,1-\alpha/2}$ are defined in (17) and (18), is an asymptotically valid $(1-\alpha)$ -level confidence interval, i.e., it satisfies

\liminf_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{O})\geq 1-\alpha

as $n\to\infty$ , where $\mathbb{P}_{n}$ denotes the probability with respect to the data $X_{1},\ldots,X_{n}$ and all randomness from the resampling and computation.

Theorem 4 is obtained by looking at the asymptotic distribution of the (approximately) pivotal statistic $(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{O}$ . However, because of the dependence between the original and resample estimates in (15), the unknown nuisance parameters $\sigma^{2}$ and $\tau^{2}$ are not directly canceled out. To tackle this, we consider a worst-case calculation which leads to some conservativeness in the coverage guarantee, i.e., inequality instead of exact equality in Theorem 4.

4.2 Centered at Resample Mean

We consider our second approach. For $B\geq 2$ , we use in (16)

S_{M}^{2}=\frac{1}{B-1}\sum_{b=1}^{B}(\psi_{n,R}^{**b}-\overline{\psi_{n,R}^{**}})^{2}

(23)

where $\overline{\psi_{n,R}^{**}}$ is the sample mean of $\psi_{n,R}^{**b}$ ’s given by $\overline{\psi_{n,R}^{**}}=(1/B)\sum_{b=1}^{B}\psi^{**b}_{n,R}$ . That is, we use the sample variance of $\psi^{**b}_{n,R}$ ’s as our standard error estimate. Correspondingly, we set

q_{M,1-\alpha/2}=\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2}

(24)

where, like in Section 4.1, $\rho=\sqrt{p_{0}/p}$ is the square-rooted ratio between the computation sizes of the original estimate and each resample estimate. Also, recall $t_{B-1,1-\alpha/2}$ is the $(1-\alpha/2)$ -quantile of $t_{B-1}$ . We call the resulting interval

\mathcal{I}_{M}=\left[\hat{\hat{\psi}}_{n,R_{0}}-q_{M,1-\alpha/2}S_{M},\ \hat{\hat{\psi}}_{n,R_{0}}+q_{M,1-\alpha/2}S_{M}\right]

(25)

the Cheap Bootstrap interval “centered at resample mean”.

We have the following guarantee:

Theorem 5 (Asymptotic validity of Cheap Bootstrap “centered at resample mean”).

Under the same assumptions and setting in Theorem 3, for $B\geq 2$ , the interval $\mathcal{I}_{M}$ in (25) where $S_{M}^{2}$ and $q_{M,1-\alpha/2}$ are defined in (23) and (24), is an asymptotically valid $(1-\alpha)$ -level confidence interval, i.e., it satisfies

\liminf_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{M})\geq 1-\alpha

as $n\to\infty$ , where $\mathbb{P}_{n}$ denotes the probability with respect to the data $X_{1},\ldots,X_{n}$ and all randomness from the resampling and computation. Moreover, if the computation sizes for each resample estimate and the original estimate are the same, i.e., $\rho=1$ , then $\mathcal{I}_{M}$ is asymptotically exact, i.e.,

\lim_{n\to\infty}\mathbb{P}_{n}(\psi\in\mathcal{I}_{M})=1-\alpha

Compared to $\mathcal{I}_{O}$ in Section 4.1, $\mathcal{I}_{M}$ does not require Monte Carlo computation for $q_{M,1-\alpha/2}$ and is thus easier to use. Moreover, note that when the computation size used in each resample estimate is at most that used in the original estimate, i.e., $\rho^{-1}\leq 1$ , we have $q_{M,1-\alpha/2}=t_{B-1,1-\alpha/2}$ , so the critical value in $\mathcal{I}_{M}$ reduces to a standard $t$ -quantile, just like the interval (1) for non-nested problems. Furthermore, when these computation sizes are the same, then we have asymptotically exact coverage in Theorem 5. On the other hand, $\mathcal{I}_{M}$ is defined for $B\geq 2$ instead of $B\geq 1$ , and thus its requirement on $B$ is slightly more stringent than $\mathcal{I}_{O}$ and (1).

5 Cheap Subsampling

We integrate Cheap Bootstrap into, in broad term, subsampling methods where the bootstrap resample has a smaller size than the original full data size. Subsampling is motivated by the lack of consistency when applying standard bootstraps in non-smooth problems (Politis et al. (1999); Bickel et al. (1997)), but also can be used to address the computational challenge in repeatedly fitting models over large data sets (Kleiner et al. (2012)). The latter arises because standard sampling with replacement on the original data retains around $63\%$ of the observation values, thus for many problems each bootstrap resample estimate requires roughly the same computation order as the original estimate constructed from the raw data.

Here we consider three subsampling variants: $m$ -out-of- $n$ Bootstrap (Bickel et al. (1997)), and the more recent Bag of Little Bootstraps (Kleiner et al. (2014)) and Subsampled Double Bootstrap (Sengupta et al. (2016)). The validity of these methods relies on generalizations of Assumption 1 (though not always, as explained at the end of this section), in which the distribution of $\sqrt{n}(\hat{\psi}_{n}-\psi)$ can be approximated by that of $\sqrt{N}(\Psi^{*}-\Psi)$ , where $\Psi^{*}$ and/or $\Psi$ are some estimators obtained using a small resample data size or number of distinct observed values, and $N$ is a suitable scaling parameter. More precisely, suppose we have $\sqrt{N}(\Psi^{*}-\Psi)\Rightarrow N(0,\sigma^{2})$ in probability conditional on the data $X_{1},\ldots,X_{n}$ and possibly independent randomness (the independent randomness is sometimes used to help determine $\Psi$ ). Then, to construct a $(1-\alpha)$ -level confidence interval for $\psi$ , we can simulate the $\alpha/2$ and $(1-\alpha/2)$ -quantiles of $\Psi^{*}-\Psi$ , say $Q_{\alpha/2}$ and $Q_{1-\alpha/2}$ , and then output

\left[\hat{\psi}_{n}-\sqrt{\frac{N}{n}}Q_{1-\alpha/2},\ \hat{\psi}_{n}+\sqrt{\frac{N}{n}}Q_{\alpha/2}\right]

Like in the standard bootstrap, such an approach requires many Monte Carlo replications.

We describe how to use subsampling in conjunction with Cheap Bootstrap to devise bootstrap schemes that are small simultaneously in the number of resamples and the resample data size (or number of distinct observed values in use). Following the ideas in Sections 2 and 3, our Cheap Subsampling unifiedly uses the following framework: Generate $\Psi^{*b}$ and $\Psi^{b}$ given $X_{1},\ldots,X_{n}$ and any required additional randomness, for $b=1,\ldots,B$ . Then the $(1-\alpha)$ -level confidence interval is given by

\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}\sqrt{\frac{N}{n}}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}\sqrt{\frac{N}{n}}S\right]

(26)

where

S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\Psi^{*b}-\Psi^{b}\right)^{2}

Compared to (1), the main difference in (26) is the adjusting factor $\sqrt{N/n}$ in the standard error.

Below we describe the application of the above framework into three subsampling variants. Like in Section 2, suppose the target parameter is $\psi=\psi(P)$ and we have obtained the point estimate $\hat{\psi}_{n}=\psi(\hat{P}_{n})$ .

5.1 Cheap $m$ -out-of- $n$ Bootstrap

We compute $B$ subsample estimates $\psi_{s}^{*b}:=\psi(P_{s}^{*b})$ , $b=1,\ldots,B$ , where $P_{s}^{*b}$ is the empirical distribution constructed from a subsample of size $s<n$ drawn from the raw data $\{X_{1},\ldots,X_{n}\}$ via sampling with replacement. Then output the interval

\left[\hat{\psi}_{n}-t_{B,1-\alpha/2}\sqrt{\frac{s}{n}}S,\ \hat{\psi}_{n}+t_{B,1-\alpha/2}\sqrt{\frac{s}{n}}S\right]

(27)

where $S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{s}^{*b}-\hat{\psi}_{n}\right)^{2}$ .

Cast in our framework above, here we take $\Psi^{b}$ to be always the original estimate $\hat{\psi}_{n}$ , $\Psi^{*b}$ to be a subsample estimate $\psi_{s}^{*b}$ , and the scale parameter $N$ to be the subsample size $s$ .

5.2 Cheap Bag of Little Bootstraps

We first construct $\psi_{s}^{*}:=\psi(P_{s}^{*})$ , where $P_{s}^{*}$ is the empirical distribution of a subsample of size $s<n$ , called $\mathbf{X}_{s}^{*}$ , drawn from the raw data $\{X_{1},\ldots,X_{n}\}$ via sampling without replacement (i.e., $\mathbf{X}_{s}^{*}$ is an $s$ -subset of $\{X_{1},\ldots,X_{n}\}$ ). The subsample $\mathbf{X}_{s}^{*}$ , once drawn, is fixed throughout. Given $\mathbf{X}_{s}^{*}$ , we generate $\psi^{**b}_{n}:=\psi(P_{n}^{**b})$ , $b=1,\ldots,B$ , where $P_{n}^{**b}$ is the empirical distribution constructed from sampling with replacement from the subsample $\mathbf{X}_{s}^{*}$ for $n$ times, with $n$ being the original full data size (i.e., $P_{n}^{**b}$ is a weighted empirical distribution over $\mathbf{X}_{s}^{*}$ ). We output the interval in (1), where now

S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{**b}-\psi_{s}^{*}\right)^{2}

(28)

Cast in our framework, here we take $\Psi^{b}$ to be always $\psi_{s}^{*}$ , $\Psi^{*b}$ to be the double resample estimate $\psi_{n}^{**b}$ , and the scale parameter $N$ to be the original data size $n$ . The subsampling used to obtain $\psi_{s}^{*}$ introduces an additional randomness that is conditioned upon in the conditional weak convergence $\sqrt{N}(\Psi^{*}-\Psi)\Rightarrow N(0,\sigma^{2})$ described before. Note that when $s$ is small, the double resample estimate $\psi_{n}^{**b}$ , though constructed with a full size data, uses only a small number of distinct data points which, in problems such as $M$ -estimation, involves only a weighted estimation on the small-size subsample (Kleiner et al. (2014)).

5.3 Cheap Subsampled Double Bootstrap

For $b=1,\ldots,B$ , we do the following: First, generate a subsample estimate $\psi_{s}^{*b}:=\psi(P_{s}^{*b})$ where $P_{s}^{*b}$ is the empirical distribution of a subsample of size $s<n$ , called $\mathbf{X}_{s}^{*b}$ , drawn from the raw data $\{X_{1},\ldots,X_{n}\}$ via sampling without replacement (i.e., $\mathbf{X}_{s}^{*b}$ is an $s$ -subset of $\{X_{1},\ldots,X_{n}\}$ ). Then, given $\mathbf{X}_{s}^{*b}$ , we construct $\psi_{n}^{**b}=\psi(P_{n}^{**b})$ where $P_{n}^{**b}$ is the empirical distribution of a size- $n$ resample constructed by sampling with replacement from $\mathbf{X}_{s}^{*b}$ . We output the interval in (1), where now

S^{2}=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{**b}-\psi_{s}^{*b}\right)^{2}

(29)

Cheap Subsampled Double Bootstrap is similar to Cheap Bag of Little Bootstraps, but the $\Psi^{b}$ in our framework is now taken as a subsample estimate $\psi_{s}^{*b}$ that is newly generated from a new subsample $\mathbf{X}_{s}^{*b}$ for each $b$ , and $\Psi^{*b}$ is obtained by resampling with replacement from that particular $\mathbf{X}_{s}^{*b}$ .

5.4 Theory of Cheap Subsampling

All of Cheap $m$ -out-of- $n$ Bootstrap, Cheap Bag of Little Bootstraps, and Cheap Subsampled Double Bootstrap achieve asymptotic exactness for any $B\geq 1$ . To state this concretely, we denote $\mathcal{F}_{\delta}=\{f-g:f,g\in\mathcal{F},\ \rho_{P}(f-g)<\delta\}$ , where $\rho_{P}(f-g):=(Var_{P}(f(X)-g(X)))^{1/2}$ is the canonical metric.

Theorem 6 (Asymptotic exactness of Cheap Subsampling).

Suppose the assumptions in Proposition 2 hold, and in addition $\mathcal{F}_{\delta}$ is measurable for every $\delta>0$ . Then the intervals produced by Cheap $m$ -out-of- $n$ Bootstrap (i.e., (27)), Cheap Bag of Little Bootstraps (i.e., (1) using (28)), and Cheap Subsampled Double Bootstrap (i.e., (1) using (29)) are all asymptotically exact for $\psi$ as $n,s\to\infty$ with $s\leq n$ .

The proof of Theorem 6 uses a similar roadmap as Theorem 1 and the subsampling analogs of Assumption 1, which hold under the additional technical measurability condition on $\mathcal{F}_{\delta}$ and have been used to justify the original version of these subsampling methods.

We make a couple of remarks. First, the original Bag of Little Bootstraps suggests to use several, or even a growing number of different subsample estimates, instead of a fixed subsample as we have presented earlier. Then from each subsample, an adequate number of resamples are drawn to obtain a quantile that informs an interval limit. These quantile estimates from different subsamples are averaged to obtain the final interval limit. Note that this suggestion requires a multiplicative amount of computation effort, and is motivated from better convergence rates (Kleiner et al. (2014) Theorems 2 and 3). Here, we simply use one subsample as we focus on the case of small Monte Carlo replication budget, but the modification to include multiple subsamples is feasible, evidenced by the validity of Cheap Subsampled Double Bootstrap which essentially uses a different subsample in each bootstrap replication. Second, in addition to allowing a smaller data size in model refitting, the $m$ -out-of- $n$ Bootstrap, and also the closely related subsampling without replacement or so-called $\binom{n}{m}$ sampling, as well as sample splitting (Politis et al. (1999); Bickel et al. (1997)), are all motivated as remedies to handle non-smooth functions where standard bootstraps could fail. Instead of conditional weak convergence which we utilize in this work, these approaches are shown to provide consistent estimates based on symmetric statistics (Politis and Romano (1994)).

6 Numerical Results

We test the numerical performances of our Cheap Bootstrap and compare it with baseline bootstrap approaches. Specifically, we consider elementary variance and correlation estimation (Section 6.1), regression (Section 6.2), simulation input uncertainty quantification (Section 6.3) and deep ensemble prediction (Section 6.4). Due to space limit, we show other examples in Appendix E. Among these examples, Cheap Subsampling is also investigated in Section 6.2 and nested sampling issues arise in Sections 6.3 and 6.4.

6.1 Elementary Examples

We consider four setups. In the first two setups, we estimate the variance of a distribution using the sample variance. We consider a folded standard normal (i.e., $|N(0,1)|$ ) and double exponential with rate 1 (i.e., $Sgn\times Exp(1)$ where $Sgn=+1$ or $-1$ with equal probability and is independent of $Exp(1)$ ) as the distribution. In the last two setups, we estimate the correlation using the sample correlation. We consider bivariate normal with mean zero, unit variance and correlation $0.5$ , and bivariate lognormal (i.e., $(e^{Z_{1}},e^{Z_{2}})$ where $(Z_{1},Z_{2})$ is the bivariate normal just described). We use a sample size $n=1000$ in all cases. We set the confidence level $1-\alpha=95\%$ . The ground truths of all the setups are known, namely $1-2/\pi$ , $2$ , $0.5$ and $(e^{3/2}-e)/(e^{2}-e)$ respectively (these examples are also used in, e.g., Schenker (1985); DiCiccio et al. (1992); Lee and Young (1995)).

We run Cheap Bootstrap using a small number of resamples $B=1,2,3,4,5,10$ , and compare with the Basic Bootstrap, Percentile Bootstrap and Standard Error Bootstrap. For each setting, we repeat the experiments $1000$ times and report the empirical coverage and interval width mean and standard deviation. Table 3 shows the performances for variance and correlation estimation that are similar to Table 1 in Section 2.2. Cheap Bootstrap gives rise to accurate coverage ( $91\%-96\%$ ) in all considered cases, including when $B$ is as low as $1$ . On the other hand, all baseline methods fail to generate two-sided intervals when $B=1$ , and encounter significant under-coverage when $B=2$ to $5$ (e.g., as low as $28\%$ for $B=2$ and $58\%$ when $B=5$ ). When $B=10$ , these baseline methods start to catch up, with Standard Error Bootstrap rising to a coverage level close to $90\%$ . These observations coincide with our theory of Cheap Bootstrap presented in Sections 2 and 3, and also that the baseline methods are all designed to work under a large $B$ .

Regarding the interval width, its mean and standard deviation for Cheap Bootstrap are initially large at $B=1$ , signifying a price on statistical efficiency when the computation budget is very small. These numbers drop quickly when $B$ increases from $1$ to $2$ and continue to drop at a decreasing rate as $B$ increases further. In contrast, all baseline methods have lower width means and standard deviations than our Cheap Bootstrap, with an increasing trend for the mean as $B$ increases while the standard deviation remains roughly constant for each method. When $B=10$ , the Cheap Bootstrap interval has generally comparable width mean and standard deviation with the baselines, though still larger. Note that while the baseline methods generate shorter intervals, they have significant under-coverage in the considered range of $B$ .

Table 3: Interval performances with different bootstrap methods: Cheap, Basic, Percentile and Standard Error Bootstrap, for variance and correlation estimation, at nominal confidence level

95\%

and with sample size

n=1000

	$B$	Variance of		Variance of		Correlation of		Correlation of
		Folded normal		Double exponential		Bivariate normal		Bivariate lognormal
		Coverage	Width	Coverage	Width	Coverage	Width	Coverage	Width
		(margin	mean	(margin	mean	(margin	mean	(margin	mean
		of error)	(st. dev.)	of error)	(st. dev.)	of error)	(st. dev.)	of error)	(st. dev.)
Cheap	1	0.95 (0.01)	0.38 (0.29)	0.94 (0.01)	2.84 (2.27)	0.93 (0.02)	0.47 (0.37)	0.95 (0.01)	1.03 (0.83)
Basic	1	NA	NA	NA	NA	NA	NA	NA	NA
Per.	1	NA	NA	NA	NA	NA	NA	NA	NA
S.E.	1	NA	NA	NA	NA	NA	NA	NA	NA
Cheap	2	0.95 (0.01)	0.15 (0.08)	0.94 (0.01)	1.10 (0.60)	0.95 (0.01)	0.18 (0.10)	0.94 (0.01)	0.38 (0.25)
Basic	2	0.30 (0.03)	0.02 (0.02)	0.31 (0.03)	0.16 (0.13)	0.32 (0.03)	0.03 (0.02)	0.28 (0.03)	0.06 (0.05)
Per.	2	0.33 (0.03)	0.02 (0.02)	0.32 (0.03)	0.16 (0.13)	0.32 (0.03)	0.03 (0.02)	0.32 (0.03)	0.06 (0.05)
S.E.	2	0.68 (0.03)	0.06 (0.05)	0.69 (0.03)	0.45 (0.35)	0.69 (0.03)	0.07 (0.06)	0.65 (0.03)	0.15 (0.14)
Cheap	3	0.95 (0.01)	0.11 (0.05)	0.94 (0.02)	0.81 (0.37)	0.95 (0.01)	0.14 (0.06)	0.94 (0.01)	0.29 (0.15)
Basic	3	0.50 (0.03)	0.03 (0.02)	0.48 (0.03)	0.23 (0.13)	0.48 (0.03)	0.04 (0.02)	0.46 (0.03)	0.08 (0.05)
Per.	3	0.50 (0.03)	0.03 (0.02)	0.48 (0.03)	0.23 (0.13)	0.52 (0.03)	0.04 (0.02)	0.46 (0.03)	0.08 (0.05)
S.E.	3	0.81 (0.02)	0.07 (0.04)	0.78 (0.03)	0.48 (0.27)	0.81 (0.02)	0.08 (0.04)	0.80 (0.02)	0.17 (0.11)
Cheap	4	0.96 (0.01)	0.10 (0.04)	0.96 (0.01)	0.73 (0.28)	0.94 (0.01)	0.12 (0.05)	0.94 (0.02)	0.26 (0.12)
Basic	4	0.62 (0.03)	0.04 (0.02)	0.62 (0.03)	0.29 (0.13)	0.61 (0.03)	0.05 (0.02)	0.54 (0.03)	0.10 (0.05)
Per.	4	0.62 (0.03)	0.04 (0.02)	0.60 (0.03)	0.29 (0.13)	0.61 (0.03)	0.05 (0.02)	0.59 (0.03)	0.10 (0.05)
S.E.	4	0.86 (0.02)	0.07 (0.03)	0.86 (0.02)	0.50 (0.22)	0.86 (0.02)	0.09 (0.04)	0.81 (0.02)	0.17 (0.09)
Cheap	5	0.95 (0.01)	0.10 (0.03)	0.95 (0.01)	0.68 (0.24)	0.94 (0.01)	0.12 (0.04)	0.91 (0.02)	0.25 (0.12)
Basic	5	0.67 (0.03)	0.05 (0.02)	0.67 (0.03)	0.32 (0.13)	0.66 (0.03)	0.06 (0.02)	0.58 (0.03)	0.12 (0.06)
Per.	5	0.66 (0.03)	0.05 (0.02)	0.65 (0.03)	0.32 (0.13)	0.65 (0.03)	0.06 (0.02)	0.61 (0.03)	0.12 (0.06)
S.E.	5	0.87 (0.02)	0.07 (0.03)	0.86 (0.02)	0.51 (0.20)	0.86 (0.02)	0.09 (0.03)	0.83 (0.02)	0.19 (0.10)
Cheap	10	0.93 (0.02)	0.08 (0.02)	0.94 (0.01)	0.62 (0.17)	0.94 (0.01)	0.10 (0.02)	0.91 (0.02)	0.21 (0.09)
Basic	10	0.79 (0.03)	0.06 (0.02)	0.80 (0.02)	0.43 (0.13)	0.83 (0.02)	0.07 (0.02)	0.73 (0.03)	0.15 (0.06)
Per.	10	0.79 (0.03)	0.06 (0.02)	0.80 (0.02)	0.43 (0.13)	0.80 (0.02)	0.07 (0.02)	0.78 (0.03)	0.15 (0.06)
S.E.	10	0.90 (0.02)	0.07 (0.02)	0.91 (0.02)	0.54 (0.15)	0.92 (0.02)	0.09 (0.02)	0.86 (0.02)	0.19 (0.08)

6.2 Regression Problems

We apply our Cheap Bootstrap and compare with standard approaches on a linear regression. The example is adopted from Sengupta et al. (2016) §4 and Kleiner et al. (2014) §4. We fit a model $Y=\beta_{1}X_{1}+\cdots+\beta_{d}X_{d}+\epsilon$ , where we set dimension $d=100$ and use data $(X_{1,i},\ldots,X_{d,i},Y_{i})$ of size $n=10^{5}$ to fit the model and estimate the coefficients $\beta_{j}$ ’s. The ground truth is set as $X_{j}\sim t_{3}$ for $j=1,\ldots,d$ , $\epsilon\sim N(0,10)$ , and $\beta_{j}=1$ for $j=1,\ldots,d$ .

In addition to full-size Cheap Bootstrap, we run the three variants of Cheap Subsampling, namely Cheap $m$ -out-of- $n$ Bootstrap, Cheap Bag of Little Bootstraps (BLB), and Cheap Subsampled Double Bootstrap (SDB). Moreover, we compare each cheap bootstrap method with its standard quantile-based counterpart (i.e., the basic bootstrap described in Section 2.1 and the subsample counterparts at the beginning of Section 5, with one fixed subsample in the Bag of Little Bootstrap case) to compute $95\%$ confidence intervals for the first coefficient $\beta_{1}$ . For the subsampling methods, we use subsample size $n^{0.6}=1000$ , which is also used in Sengupta et al. (2016) and Kleiner et al. (2014). We use in total $50$ resamples for each method. We repeat the experiments $1000$ times, each time we regenerate a new synthetic data set. For the number of resamples starting from 1 to $50$ , we compute the empirical coverage probability and the mean and standard deviation of the confidence interval width for the first coefficient.

Refer to caption — Figure 1: Confidence interval coverage probabilities of Standard versus Cheap Bootstrap methods in linear regression. Nominal confidence level $=95\%$ and sample size $n=10^{5}$ . Shaded areas depict the associated confidence intervals of the coverage probability estimates from 1000 experimental repetitions.

Figure 1 shows the empirical coverage probability against the number of resamples from the $1000$ experimental repetitions. In the first graph, we see that Standard Bootstrap falls short of the nominal coverage, severely when $B$ is small and gradually improves as $B$ increases. For example, the coverage probability at $B=5$ is $70\%$ and at $B=10$ is $86\%$ . On the other hand, Cheap Bootstrap gives close to the nominal coverage starting from the first replication ( $94\%$ ), and the coverage stays between $94\%$ and $97\%$ throughout. The comparison is similar for subsampling methods, where with small $B$ Standard $m$ -out-of- $n$ , Bag of Little Bootstrap and Subsampled Double Bootstrap all under-cover and gradually approach the nominal level as $B$ increases. For a more specific comparison, at $B=5$ for instance, Standard $m$ -out-of- $n$ , Bags of Little Bootstrap and Subsampled Double Bootstrap have under-coverages of $73\%$ , $69\%$ and $68\%$ respectively, while the Cheap counterparts attain $98\%$ , $95\%$ and $96\%$ . We note that the “sudden drop” in the coverage probability for all Standard bootstrap methods at the $41$ -st resample in Figure 1 (as well as all the following figures) is due to the discontinuity in the empirical quantile, which is calculated by inverting the empirical distribution and at this number of resamples the inverse switches from outputting the largest or smallest resample estimate to the second largest or smallest.

Figure 2 shows the mean interval width against the number of resamples (note that the confidence intervals for the mean width estimates are very narrow, meaning that the estimates are accurate, which makes the shaded regions in the figure difficult to visualize). These plots show that the mean interval widths of Cheap bootstrap methods are initially large when only one replication is used, then decrease sharply when the number of resamples increases and then continue to decrease at a slower rate. Figure 3 shows the standard deviation of the interval width against the number of resamples, which follow a similar trend as the mean in Figure 2. More specifically, the interval width mean and standard deviation of Cheap Bootstrap are $0.36$ and $0.28$ at $B=1$ , drop sharply to $0.11$ and $0.05$ at $B=3$ , further to $0.09$ and $0.03$ at $B=5$ , and $0.08$ and $0.02$ at $B=10$ , with negligible reduction afterwards. These trends match our theory regarding the interval width pattern in Section 3.3. On the other hand, the interval width means of Standard methods are initially small and increase with the number of resamples, while the standard deviations are roughly constant. Note that at $B=1$ it is not possible to generate confidence intervals using Standard bootstrap methods, and the interval width mean and standard deviation are undefined. Moreover, even though the interval widths in Standard bootstrap methods appear lower generally, these intervals can substantially under-cover the truth when $B$ is small.

We present another example on logistic regression in Appendix E.1 which shows similar experimental observations discussed above.

6.3 Input Uncertainty Quantification in Simulation Modeling

We present an example on the input uncertainty quantification problem in simulation modeling discussed in Section 4. Our target quantity of interest is the expected average waiting time of the first 10 customers in a single-server queueing system, where the interarrival times and service time are i.i.d. with ground truth distributions being exponential with rates $1$ and $1.1$ respectively. The queue starts empty and the first customer immediately arrives. This system is amenable to discrete-event simulation and has been used commonly in existing works in input uncertainty quantification (e.g., Barton et al. (2014); Song and Nelson (2015); Zouaoui and Wilson (2004)). Suppose we do not know the interarrival time distribution but instead have external data of size $n=100$ . To compute a point estimate of the expected waiting time, we use the empirical distribution constructed from the 100 observations, denoted $\hat{P}_{n}$ , as an approximation to the interarrrival time distribution that is used to drive simulation runs. We conduct $R_{0}=50$ unbiased simulation runs, each giving output $\hat{\psi}_{r}(\hat{P}_{n})$ , and then average the outputs of these runs to get $(1/R_{0})\sum_{r=1}^{R_{0}}\hat{\psi}_{r}(\hat{P}_{n})$ .

To obtain a $95\%$ confidence interval, we use Cheap Bootstrap centered at original estimate and centered at resample mean presented in Sections 4.1 and 4.2 respectively. More precisely, the first approach constructs the interval $\mathcal{I}_{O}$ using (16) with (17) and (18), while the second approach constructs the interval $\mathcal{I}_{M}$ using (16) with (23) and (24). In both approaches, we set the simulation size for each resample estimate to equal that for the original estimate, i.e., $R=R_{0}=50$ . To construct $\mathcal{I}_{O}$ , we need to approximate the critical value $q_{O,1-\alpha/2}$ using (22), where we set $N=100,000$ and discretize $\theta$ over a grid $0.01,0.02,\ldots,100$ . Then we increase $q$ from $t_{B,1-\alpha/2}-0.5$ and iteratively check whether

\min_{\theta\geq 0}\frac{1}{N}\sum_{j=1}^{N}I\left(\frac{\theta V_{1}^{j}+V_{2}^{j}}{\sqrt{\frac{\theta^{2}+\rho^{2}}{B}\left(Y^{j}+{V_{3}^{j}}^{2}\right)-2\sqrt{\frac{\theta^{2}+\rho^{2}}{B}}V_{3}^{j}V_{2}^{j}+{V_{2}^{j}}^{2}}}\leq q\right)

(30)

is greater than $1-\alpha/2$ ; if not, we increase $q$ by a step size $0.01$ , until we reach a $q$ such that (30) passes $1-\alpha/2$ . We detail the computed values of $q_{O,1-\alpha/2}$ in Appendix E.2.

Using the above, we construct $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ at $B$ ranging from $1$ to $10$ . For each $B$ , we repeat our experiments $1000$ times to record the empirical coverage and interval width mean and standard deviation. To calculate the empirical coverage, we run 1 million simulation runs under the true interarrival and service time distributions to obtain an accurate estimate of the ground truth value. Table 4 shows the performances of $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ , and also the comparisons with the Basic and Percentile Bootstraps that treat $\psi_{n,R}^{**b},b=1,\ldots,B$ in (14) as the resample estimates. We see that the coverage of $\mathcal{I}_{O}$ is very close to the nominal confidence level $95\%$ at all considered $B$ , ranging in $95\%-96\%$ . The coverage of $\mathcal{I}_{M}$ is also close, ranging in $93\%-94\%$ . The interval width performances for both $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ behave similarly as previous examples, with initially large width at $B=1$ or $B=2$ , falling sharply when $B$ increases by 1 or 2, and flattening out afterwards. The Basic and Percentile Bootstraps have coverages substantially deviated from the nominal $95\%$ , though they improve as $B$ increases. The inferior coverages of these conventional methods can be caused by both the inadequate $B$ and also the ignorance of the simulation noise in their implementation.

Table 4: Interval performances using Cheap Bootstrap centered at original estimate

\mathcal{I}_{O}

, Cheap Bootstrap centered at resample mean

\mathcal{I}_{M}

, Basic Bootstrap and Percentile Bootstrap, at nominal confidence level

95\%

for input uncertainty quantification in a queueing system simulation.

$B$	Cheap bootstrap centered		Cheap bootstrap centered		Basic bootstrap		Percentile bootstrap
	at original estimate		at resample mean
	Coverage	Width	Coverage	Width	Coverage	Width	Coverage	Width
	(margin	mean	(margin	mean	(margin	mean	(margin	mean
	of error)	(st. dev.)	of error)	(st. dev.)	of error)	(st. dev.)	of error)	(st. dev.)
1	0.96 (0.01)	6.73 (5.41)	NA	NA	NA	NA	NA	NA
2	0.95 (0.01)	2.55 (1.50)	0.94 (0.02)	5.84 (4.86)	0.35 (0.03)	0.33 (0.26)	0.27 (0.03)	0.33 (0.26)
3	0.95 (0.01)	1.97 (0.99)	0.94 (0.01)	2.26 (1.30)	0.52 (0.03)	0.50 (0.29)	0.41 (0.03)	0.50 (0.29)
4	0.95 (0.01)	1.74 (0.79)	0.94 (0.01)	1.73 (0.84)	0.64 (0.03)	0.59 (0.29)	0.48 (0.03)	0.59 (0.29)
5	0.95 (0.01)	1.64 (0.69)	0.93 (0.02)	1.54 (0.68)	0.71 (0.03)	0.66 (0.29)	0.54 (0.03)	0.66 (0.29)
6	0.95 (0.01)	1.58 (0.63)	0.93 (0.02)	1.45 (0.59)	0.75 (0.03)	0.72 (0.29)	0.58 (0.03)	0.72 (0.29)
7	0.95 (0.01)	1.54 (0.58)	0.93 (0.02)	1.40 (0.53)	0.77 (0.03)	0.76 (0.29)	0.61 (0.03)	0.76 (0.29)
8	0.95 (0.01)	1.50 (0.55)	0.93 (0.02)	1.35 (0.49)	0.79 (0.03)	0.80 (0.29)	0.64 (0.03)	0.80 (0.29)
9	0.95 (0.01)	1.48 (0.53)	0.93 (0.02)	1.33 (0.47)	0.81 (0.02)	0.82 (0.29)	0.65 (0.03)	0.82 (0.29)
10	0.95 (0.01)	1.46 (0.51)	0.93 (0.02)	1.31 (0.44)	0.83 (0.02)	0.85 (0.29)	0.67 (0.03)	0.85 (0.29)

Compared with existing works on input uncertainty quantification, we highlight that the computational load in Cheap Bootstrap is significantly lower. According to Table 4, $B$ can be taken as, for instance, 3 or 4 in $\mathcal{I}_{O}$ to obtain a reasonable confidence interval, so that the total computation cost is $BR$ which is 200 or 250 when we use $R=50$ (including the cost to compute the original point estimate). In contrast, bootstrap approaches proposed in the literature suggest to use $B$ at least $50$ in numerical examples (e.g., Cheng and Holland (2004); Song and Nelson (2015)) or linearly dependent on the parameter dimension (e.g., $10$ times the dimension when using the so-called metamodel-assisted bootstrap in the parametric case; Xie et al. (2014)).

6.4 Deep Ensemble Prediction

We consider deep ensemble prediction described in Section 4. We build a prediction model $y=f(\mathbf{x})$ from i.i.d. supervised data $(\mathbf{X}_{i},Y_{i}),i=1,\ldots,n$ of size $n=200$ . Here the feature vector $\mathbf{x}=(x(1),\ldots,x(d))$ is 16-dimensional and $y\in\mathbb{R}$ . Suppose each dimension of $\mathbf{X}_{i}$ is generated from the uniform distribution on $[0,1]$ , and the ground-truth model is $y=\sum_{j=1}^{16}x(j)+N(0,1)$ . We build a deep ensemble by training $R_{0}=5$ base neural networks each using an independent random initialization, and average these networks to obtain a predictor. More specifically, each base neural network has two fully connected layers with $1024$ hidden neurons, using rectified linear unit as activation. It is trained using the squared loss with $L_{2}$ -regularization on the neuron weights, and Adam for gradient descent. All the weights in each neural network are initialized as independent $N(0,1)$ variables.

Like in Section 6.3, we use Cheap Bootstrap centered at original estimate $\mathcal{I}_{O}$ and centered at resample mean $\mathcal{I}_{M}$ to construct $95\%$ confidence intervals for the prediction output. Here, we consider a test point with value $0.5$ in all dimensions. We set $R=R_{0}$ in both approaches, and use Table 5 in Appendix E.2 to calibrate the critical value $q_{O,1-\alpha/2}$ in $\mathcal{I}_{O}$ . We repeat the experiments $100$ times to take down statistics on the generated confidence intervals. Figure 4 depicts the box plots of the confidence interval widths using $B=1$ to $10$ , for $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ respectively. We see that the interval widths are relatively large for $B=1$ (in the case of $\mathcal{I}_{O}$ ) and for $B=2$ (in the case of $\mathcal{I}_{M}$ which is only well-defined starting from $B=2$ ), but decrease fast when $B$ increases. When $B=3$ (in the case of $\mathcal{I}_{O}$ ) and $B=4$ (in the case of $\mathcal{I}_{M}$ ), the interval widths appear to more or less stabilize.

Next, we also train deep ensemble predictors for a real data set on Boston housing, available at https://www.kaggle.com/c/boston-housing. The data set has size $504$ and feature dimension $13$ . Like in our synthetic data, we use $R_{0}=5$ , and the same configurations of base neural networks and training methods. We split the data by $80\%$ to a training set and $20\%$ to a testing set, where only the training set is used to construct the predictor and also run our Cheap Bootstrap, with again $R=R_{0}=5$ . Then we create the $95\%$ Cheap Bootstrap confidence intervals for the prediction values of the testing set. Figure 5 shows the box plots of the widths of these intervals, namely $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ , for $B$ ranging from $1$ to $10$ . We see that the widths fall sharply when $B$ is small ( $B=1,2$ in $\mathcal{I}_{O}$ and $B=2,3$ in $\mathcal{I}_{M}$ ) and stabilize quickly afterwards.

In Appendix E.3, we present an additional example on constructing confidence bounds for the optimality gaps of data-driven stochastic optimization problems. This problem, which possesses a nested sampling structure, is of interest to stochastic programming. There we would illustrate the performances of Cheap Bootstrap in constructing one-sided confidence bounds.

7 Discussion

Motivated by the computational demand in large-scale problems where repeated model refitting can be costly, in this paper we propose a Cheap Bootstrap method that can use very few resamples for bootstrap inference. A key element of our method is that it attains asymptotically exact coverage for any number of resamples, including as low as one. This is in contrast to conventional bootstrap approaches that require running many Monte Carlo replications. Our theory on the Cheap Bootstrap also differs from these approaches by exploiting the asymptotic independence between the original estimate and resample estimates, instead of a direct approximation of the sampling distribution via the reasample counterpart, the latter typically executed by running many Monte Carlo runs in order to obtain accurate summary statistics from the resamples.

Besides the basic asymptotic coverage guarantee, we have studied several aspects of the Cheap Bootstrap. First is its higher-order coverage errors which match the conventional basic and percentile bootstraps. This error analysis is based on an Edgeworth expansion on a $t$ -limit which, to our knowledge, has not been studied in the literature that has focused on normal limits. Second is the half-width behavior, where we show that the half widths of Cheap Bootstrap intervals are naturally wider for small number of resamples, and match the conventional approaches when resample size increases. Moreover, the decrease in the half width is sharp when the resample number is very small and flattens quickly, thus leading to a reasonable half-width performance with only few resamples. In addition, we have also investigated several generalizations of the Cheap Bootstrap, including its applications to nested sampling problems, subsampling procedures, and multivariate extensions. Our numerical results validate the performances of Cheap Bootstrap, especially its ability to conduct valid inference with an extremely small number of resamples.

This work is intended to lay the foundation of the proposed Cheap Bootstrap method. In principle, this method can be applied to many other problems than those shown in our numerical section. Essentially, for any problem where bootstrap interval is used and justified via the standard condition - conditional asymptotic normality of the resamples, one can adopt the Cheap Bootstrap in lieu of conventional bootstraps. Besides applying and testing the performances of Cheap Bootstrap across these other problems, we believe the following are also worth further investigation:

Comparison with the infinitesimal jackknife and the jackknife: While we have focused our comparisons primarily within the bootstrap family, the Cheap Bootstrap bears advantages when compared to non-bootstrap alternatives including the infinitesimal jackknife and the jackknife. These advantages inherit from the bootstrap approach in general, but strengthened further with the light resampling effort. The infinitesimal jackknife, or the delta method, uses a linear approximation to evaluate the standard error. This approach relies on influence function calculation that could be tedious analytically, and also computationally due to, for instance, the inversion of big matrices in $M$ -estimation. The jackkknife, which relies on leave-one-out estimates, generally requires a number of model evaluation that is the same as the sample size. The Cheap Bootstrap thus provides potential significant computation savings compared to both methods. Note, however, that it is possible to combine the jackknife with batching to reduce computation load (see the next discussion point).

In problems facing nested sampling such as those in Section 4, it is also possible to use non-bootstrap techniques such as the infinitesimal jackknife. In fact, the latter is particularly handy to derive for problems involving double resampling such as bagging predictors (e.g., Wager et al. (2014)). Nonetheless, it is open to our knowledge whether these alternatives can generally resolve the expensive nested sampling requirement faced by conventional bootstraps, especially in problems lacking unbiased estimators for the influence function. In these cases, the need to control the entangled noises coming from both data and computation could necessitate a large computation size, despite the apparent avoidance of nested sampling.

Comparison with batching: A technique prominently used in the simulation literature, but perhaps less common in statistics, is known as batch means or more generally standardized time series (Glynn and Iglehart (1990); Schmeiser (1982); Schruben (1983); Glynn and Lam (2018)). This technique conducts inference by grouping data into batches and aggregating them via $t$ -statistics. The batches can be disjoint or overlapping (Meketon and Schmeiser (1984); Song and Schmeiser (1995)), similar to the blocks in subsampling (Politis and Romano (1994)). While batch means uses $t$ -statistics like our Cheap Bootstrap, it utilizes solely the original sample instead of resample, and is motivated to handle serially dependent observations in simulation outputs such as from Markov chain Monte Carlo (Geyer (1992); Flegal et al. (2010); Jones et al. (2006)).

Batch means in principle can be used to construct confidence intervals with few model evaluations, by using a small number of batches or blocks. However, a potential downside of this approach is that in the disjoint-batches case, dividing a limited sample into batches will thin out the sample size per batch and deteriorate the accuracy of asymptotic approximation. In other words, there is a constrained tradeoff between the number of batches and sample size per batch that limits its accuracy. The Cheap Bootstrap, on the other hand, bypasses this constraint by allowing the use of any number of resamples. In fact, since the observations in different resamples overlap, the Cheap Bootstrap appears closer to overlapping batching. To this end, overlapping batching contains multiple tuning parameters to specify how the batches overlap, which could also affect the asymptotic distributions. In this regard, the Cheap Bootstrap appears easier to use as the only parameter needed is the resample budget, which is decided by the amount of computation resource.

One approach to obtain more in-depth theoretical comparisons between the Cheap Bootstrap and batching is to analyze the coefficients in the respective higher-order coverage expansions. This comprises an immediate future direction.

High-dimensional problems: As the main advantage of the Cheap Bootstrap relative to other approaches is its light computation, it would be revealing to analyze the error of the Cheap Bootstrap in high-dimensional problems where computation saving is of utmost importance. Regarding this, it is possible to obtain bounds for the coverage error of Cheap Bootstrap with explicit dependence on problem dimension, by following our analysis roadmap in Section 3.2 but integrating with high-dimensional Berry-Esseen-type bounds for normal limits (Chernozhukov et al. (2017); Fang and Koike (2021)).

Higher-order coverage accuracy: Although our investigation is orthogonal to the bootstrap literature on higher-order coverage error refinements, we speculate that our Cheap Bootstrap can be sharpened to exhibit second-order accurate intervals, by replacing the $t$ -quantile with a bootstrap calibrated quantile. This approach is similar to the studentized bootstrap, but uses few outer resamples in a potential iterated bootstrap procedure. In other words, our computation effort could be larger than elementary bootstraps, as we need to run two layers of resampling, but less than a double bootstrap, as one of the layers consists of only few samples. As a related application, we may apply the Cheap Bootstrap to assess the error of the bootstrap itself, which is done conventionally via bootstrapping the bootstrap or alternately the jackknife-after-bootstrap (Efron (1992)).

Serial dependence: A common bootstrap approach to handle serial dependence is to use block sampling (e.g., Davison and Hinkley (1997) §8). The computation advantage of the Cheap Bootstrap likely continues to hold when using blocks. More specifically, instead of regenerating many series each concatenated from resampled blocks, the Cheap Bootstrap only regenerates few such series and aggregates them via a $t$ -approximation. On the other hand, as mentioned above, batch means or subsampling also comprises viable approaches to handle serially dependent data, and the effectiveness of the Cheap Bootstrap compared with these approaches will need further investigation.

Acknowledgments

I gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-1834710 and IIS-1849280.

References

Alaa and Van Der Schaar (2020) Alaa, A. and Van Der Schaar, M. (2020) Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order influence functions. In International Conference on Machine Learning, 165–174. PMLR.
Barton (2012) Barton, R. R. (2012) Tutorial: Input uncertainty in output analysis. In Proceedings of the 2012 Winter Simulation Conference (WSC), 1–12. IEEE.
Barton et al. (2014) Barton, R. R., Nelson, B. L. and Xie, W. (2014) Quantifying input uncertainty via simulation confidence intervals. INFORMS journal on computing, 26, 74–87.
Beran (1987) Beran, R. (1987) Prepivoting to reduce level error of confidence sets. Biometrika, 74, 457–468.
Bickel et al. (1997) Bickel, P. J., Götze, F. and van Zwet, W. R. (1997) Resampling fewer than $n$ observations: Gains, losses, and remedies for losses. Statistica Sinica, 7, 1–31.
Birge and Louveaux (2011) Birge, J. R. and Louveaux, F. (2011) Introduction to stochastic programming. Springer Science & Business Media.
Booth and Do (1993) Booth, J. G. and Do, K.-A. (1993) Simple and efficient methods for constructing bootstrap confidence intervals. Computational Statistics, 8, 333–333.
Booth and Hall (1994) Booth, J. G. and Hall, P. (1994) Monte Carlo approximation and the iterated bootstrap. Biometrika, 81, 331–340.
Breiman (1996) Breiman, L. (1996) Bagging predictors. Machine learning, 24, 123–140.
Bühlmann and Yu (2002) Bühlmann, P. and Yu, B. (2002) Analyzing bagging. The annals of Statistics, 30, 927–961.
Cheng and Holland (2004) Cheng, R. C. and Holland, W. (2004) Calculation of confidence intervals for simulation output. ACM Transactions on Modeling and Computer Simulation (TOMACS), 14, 344–362.
Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017) Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45, 2309–2352.
Davison and Hinkley (1997) Davison, A. C. and Hinkley, D. V. (1997) Bootstrap Methods and Their Application. No. 1. Cambridge University Press.
DiCiccio et al. (1996) DiCiccio, T. J., Efron, B. et al. (1996) Bootstrap confidence intervals. Statistical Science, 11, 189–228.
DiCiccio et al. (1992) DiCiccio, T. J., Martin, M. A. and Young, G. A. (1992) Analytical approximations for iterated bootstrap confidence intervals. Statistics and Computing, 2, 161–171.
Efron (1981) Efron, B. (1981) Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika, 68, 589–599.
Efron (1982) — (1982) The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.
Efron (1987) — (1987) Better bootstrap confidence intervals. Journal of the American statistical Association, 82, 171–185.
Efron (1992) — (1992) Jackknife-after-bootstrap standard errors and influence functions. Journal of the Royal Statistical Society: Series B (Methodological), 54, 83–111.
Efron and Tibshirani (1994) Efron, B. and Tibshirani, R. J. (1994) An Introduction to the Bootstrap. CRC Press.
Fang and Koike (2021) Fang, X. and Koike, Y. (2021) High-dimensional central limit theorems by stein’s method. The Annals of Applied Probability, 31, 1660–1686.
Flegal et al. (2010) Flegal, J. M., Jones, G. L. et al. (2010) Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics, 38, 1034–1070.
Geyer (1992) Geyer, C. J. (1992) Practical Markov chain Monte Carlo. Statistical Science, 7, 473–483.
Giordano et al. (2019) Giordano, R., Stephenson, W., Liu, R., Jordan, M. and Broderick, T. (2019) A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, 1139–1147. PMLR.
Glasserman (2004) Glasserman, P. (2004) Monte Carlo methods in financial engineering, vol. 53. Springer.
Glynn and Iglehart (1990) Glynn, P. W. and Iglehart, D. L. (1990) Simulation output analysis using standardized time series. Mathematics of Operations Research, 15, 1–16.
Glynn and Lam (2018) Glynn, P. W. and Lam, H. (2018) Constructing simulation output intervals under input uncertainty via data sectioning. In 2018 Winter Simulation Conference (WSC), 1551–1562. IEEE.
Hall (1986a) Hall, P. (1986a) On the bootstrap and confidence intervals. Annals of Statistics, 14, 1431–1452.
Hall (1986b) — (1986b) On the number of bootstrap simulations required to construct a confidence interval. The Annals of Statistics, 1453–1462.
Hall (2013) — (2013) The Bootstrap and Edgeworth Expansion. Springer Science & Business Media.
Hall and Martin (1988) Hall, P. and Martin, M. A. (1988) On bootstrap resampling and iteration. Biometrika, 75, 661–671.
He and Lam (2021) He, S. and Lam, H. (2021) Higher-order coverage errors of batching methods via edgeworth expansions on $t$ -statistics. arXiv preprint arXiv:2111.06859.
Henderson (2003) Henderson, S. G. (2003) Input modeling: Input model uncertainty: Why do we care and what should we do about it? In Proceedings of the 2003 Winter Simulation Conference (eds. S. Chick, P. J. Sánchez, D. Ferrin and D. J. Morrice), 90–100. Piscataway, New Jersey: IEEE.
Jones et al. (2006) Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006) Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association, 101, 1537–1547.
Kleiner et al. (2012) Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2012) The big data bootstrap. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 1787–1794. Madison, WI, USA: Omnipress.
Kleiner et al. (2014) — (2014) A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 76, 795–816.
Kosorok (2007) Kosorok, M. R. (2007) Introduction to Empirical Processes and Semiparametric Inference. Springer Science & Business Media.
Lakshminarayanan et al. (2016) Lakshminarayanan, B., Pritzel, A. and Blundell, C. (2016) Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474.
Lam (2016) Lam, H. (2016) Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In Winter Simulation Conference (WSC), 2016, 178–192. IEEE.
Lam and Qian (2018a) Lam, H. and Qian, H. (2018a) Assessing solution quality in stochastic optimization via bootstrap aggregating. In 2018 Winter Simulation Conference (WSC), 2061–2071. IEEE.
Lam and Qian (2018b) — (2018b) Bounding optimality gap in stochastic optimization via bagging: Statistical efficiency and stability. arXiv preprint arXiv:1810.02905.
Lam and Qian (2018c) — (2018c) Subsampling variance for input uncertainty quantification. In 2018 Winter Simulation Conference (WSC), 1611–1622. IEEE.
Lam and Qian (2021) — (2021) Subsampling to enhance efficiency in input uncertainty quantification. Operations Research.
Law et al. (1991) Law, A. M., Kelton, W. D. and Kelton, W. D. (1991) Simulation modeling and analysis, vol. 2. McGraw-Hill New York.
Lee et al. (2015) Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D. and Batra, D. (2015) Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314.
Lee and Young (1995) Lee, S. M. and Young, G. A. (1995) Asymptotic iterated bootstrap confidence intervals. The Annals of Statistics, 23, 1301–1330.
Lu et al. (2020) Lu, Z., Ie, E. and Sha, F. (2020) Uncertainty estimation with infinitesimal jackknife, its distribution and mean-field approximation. arXiv preprint arXiv:2006.07584.
Mak et al. (1999) Mak, W.-K., Morton, D. P. and Wood, R. K. (1999) Monte carlo bounding techniques for determining solution quality in stochastic programs. Operations research letters, 24, 47–56.
Meketon and Schmeiser (1984) Meketon, M. S. and Schmeiser, B. (1984) Overlapping batch means: Something for nothing? In Proceedings of the Winter simulation Conference, 227–230.
Mentch and Hooker (2016) Mentch, L. and Hooker, G. (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17, 841–881.
Nelson (2013) Nelson, B. (2013) Foundations and Methods of Stochastic Simulation: A First Course. Springer Science & Business Media.
Politis and Romano (1994) Politis, D. N. and Romano, J. P. (1994) Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22, 2031–2050.
Politis et al. (1999) Politis, D. N., Romano, J. P. and Wolf, M. (1999) Subsampling. Springer Science & Business Media.
Schenker (1985) Schenker, N. (1985) Qualms about bootstrap confidence intervals. Journal of the American Statistical Association, 80, 360–361.
Schmeiser (1982) Schmeiser, B. (1982) Batch size effects in the analysis of simulation output. Operations Research, 30, 556–568.
Schruben (1983) Schruben, L. (1983) Confidence interval estimation using standardized time series. Operations Research, 31, 1090–1108.
Schulam and Saria (2019) Schulam, P. and Saria, S. (2019) Can you trust this prediction? auditing pointwise reliability after learning. In The 22nd International Conference on Artificial Intelligence and Statistics, 1022–1031. PMLR.
Sengupta et al. (2016) Sengupta, S., Volgushev, S. and Shao, X. (2016) A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111, 1222–1232.
Shao and Tu (2012) Shao, J. and Tu, D. (2012) The jackknife and bootstrap. Springer Science & Business Media.
Shapiro et al. (2021) Shapiro, A., Dentcheva, D. and Ruszczynski, A. (2021) Lectures on stochastic programming: modeling and theory. SIAM.
Song and Nelson (2015) Song, E. and Nelson, B. L. (2015) Quickly assessing contributions to input uncertainty. IIE Transactions, 47, 893–909.
Song et al. (2014) Song, E., Nelson, B. L. and Pegden, C. D. (2014) Advanced tutorial: Input uncertainty quantification. In Proceedings of the 2014 Winter Simulation Conference (eds. A. Tolk, S. Diallo, I. Ryzhov, L. Yilmaz, S. Buckley and J. Miller), 162–176. Piscataway, New Jersey: IEEE.
Song and Schmeiser (1995) Song, W. T. and Schmeiser, B. W. (1995) Optimal mean-squared-error batch sizes. Management Science, 41, 110–123.
Van der Vaart (2000) Van der Vaart, A. W. (2000) Asymptotic Statistics, vol. 3. Cambridge University Press.
Van der Vaart and Wellner (2013) Van der Vaart, A. W. and Wellner, J. A. (2013) Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media.
Wager et al. (2014) Wager, S., Hastie, T. and Efron, B. (2014) Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15, 1625–1651.
Wall et al. (2001) Wall, M. M., Boen, J. and Tweedie, R. (2001) An effective confidence interval for the mean with samples of size one and two. The American Statistician, 55, 102–105.
Xie et al. (2014) Xie, W., Nelson, B. L. and Barton, R. R. (2014) A bayesian framework for quantifying uncertainty in stochastic simulation. Operations Research, 62, 1439–1452.
Zouaoui and Wilson (2004) Zouaoui, F. and Wilson, J. R. (2004) Accounting for input-model and input-parameter uncertainties in simulation. IIE Transactions, 36, 1135–1151.

Appendix A Immediate Extensions from Section 3

A.1 Inference on Standard Error

We can conduct inference on the standard error of $\hat{\psi}_{n}$ using Cheap Bootstrap. Here by standard error we mean the standard deviation of $\hat{\psi}_{n}$ , which is asymptotically $\sigma/\sqrt{n}$ , and our inference target is $\sigma$ . Using the same argument as in Section 3.1, we have that

\mathcal{J}=\left[\frac{\sqrt{Bn}S}{\sqrt{\chi^{2}_{1-\alpha/2,B}}},\frac{\sqrt{Bn}S}{\sqrt{\chi^{2}_{\alpha/2,B}}}\right]

(31)

is an asymptotically exact $(1-\alpha)$ -level confidence interval for $\sigma$ , where $\chi^{2}_{\alpha/2,B}$ and $\chi^{2}_{1-\alpha/2,B}$ are the $\alpha/2$ and $(1-\alpha/2)$ -quantiles of $\chi^{2}_{B}$ . We summarize this as follows.

Theorem 7 (Cheap Bootstrap interval for standard error).

Under Assumption 1, we have, for any bootstrap replication size $B\geq 1$ , the interval $\mathcal{J}$ defined in (31) is an asymptotically exact $(1-\alpha)$ -level confidence interval for $\sigma$ , i.e., $\mathbb{P}_{n}(\sigma\in\mathcal{J})\to 1-\alpha$ as the sample size $n\to\infty$ , where $\mathbb{P}_{n}$ denotes the probability with respect to the data $X_{1},\ldots,X_{n}$ and all randomness from the resampling.

A.2 Multivariate Generalization

We present a multivariate generalization of Cheap Bootstrap. Consider now $\psi:=\psi(P)\in\mathbb{R}^{d}$ . Our multivariate confidence region proceeds similarly as the univariate case, except we use Hotelling’s $T^{2}$ instead of $t$ . More specifically, we have point estimate $\hat{\psi}_{n}=\psi(\hat{P}_{n})$ , and we resample from $\{X_{1},\ldots,X_{n}\}$ to obtain $\{X_{1}^{*b},\ldots,X_{n}^{*b}\}$ and evaluate the resample estimate $\psi_{n}^{*b}:=\psi(P_{n}^{*b})$ . Our confidence region is

\mathcal{R}=\left\{\psi:(\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)\leq T^{2}_{d,B,1-\alpha}\right\}

(32)

where $S$ now denotes a $d\times d$ matrix given by

S=\frac{1}{B}\sum_{b=1}^{B}\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)\left(\psi_{n}^{*b}-\hat{\psi}_{n}\right)^{\top}

(33)

and the critical value $T^{2}_{d,B,1-\alpha}$ is the $(1-\alpha)$ -quantile of Hotelling’s $T^{2}$ distribution with parameters $d$ and $B$ .

We have the following asymptotic exact guarantee for $\mathcal{R}$ . First we make the following multivariate analog of Assumption 1:

Assumption 3 (Standard condition for multivariate bootstrap validity).

We have $\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\Sigma)$ and $\sqrt{n}(\psi_{n}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\Sigma)$ conditional on the data $X_{1},X_{2},\ldots$ in probability as $n\to\infty$ , where $N(0,\Sigma)$ is a multivariate normal vector with mean $0\in\mathbb{R}^{d}$ and covariance matrix $\Sigma\in\mathbb{R}^{d\times d}$ that is positive definite.

Then we have:

Theorem 8 (Asymptotic exactness of multivariate Cheap Bootstrap).

Under Assumption 3, for any $B\geq d$ , the region $\mathcal{R}$ in (32) is an asymptotically exact $(1-\alpha)$ -level confidence region for $\psi$ , i.e., $\mathbb{P}_{n}(\psi\in\mathcal{R})\to 1-\alpha$ as $n\to\infty$ , where $\mathbb{P}_{n}$ denotes the probability with respect to the data $X_{1},\ldots,X_{n}$ and all randomness from the resampling.

Note that in Theorem 8 we require $B$ to be at least $d$ , the dimension of $\psi$ . In the univariate case this reduces to $B=1$ .

Finally, the following is a multivariate analog of Proposition 2 that uses Hadamard differentiability with non-degenerate derivative to ensure Assumption 3:

Proposition 4 (Sufficient conditions for multivariate bootstrap validity).

Consider $\hat{P}_{n}$ and $P_{n}^{*}$ as random elements that take values in $\ell^{\infty}(\mathcal{F})$ , where $\mathcal{F}$ is a Donsker class with finite envelope. Suppose $\psi:\ell^{\infty}(\mathcal{F})\to\mathbb{R}^{d}$ is Hadamard differentiable at $P$ where the derivative $\psi_{P}^{\prime}$ satisfies that the covariance matrix of $\psi_{P}^{\prime}(\mathbb{G}_{P})$ is positive definite, for a tight Gaussian process $\mathbb{G}_{P}$ on $\ell^{\infty}(\mathcal{F})$ with mean 0 and covariance $Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X))$ . Then Assumption 3 holds under i.i.d. data.

Appendix B Proofs for Section 3 and Appendix A

Proof of Proposition 2.

Apply Theorem 12 (in Appendix F.2) with $d=1$ and note that $\psi_{P}^{\prime}(\mathbb{G}_{P})$ is normal. ∎

Proof of Theorem 2.

For convenience throughout the proof we use $C$ to denote a positive constant that is not necessarily the same every time it appears. We first define

\hat{A}(\mathbf{x})=\frac{g(\mathbf{x})-g(\overline{\mathbf{X}})}{h(\overline{\mathbf{X}})}

and consider $\hat{A}(\overline{\mathbf{X}}^{*})$ , where $\overline{\mathbf{X}}^{*}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i}^{*}$ is the mean of a resample $\{\mathbf{X}_{1}^{*},\ldots,\mathbf{X}_{n}^{*}\}$ . We can view $\hat{A}(\overline{\mathbf{X}}^{*})$ as the resample counterpart of $A(\overline{\mathbf{X}})$ .

Under the function-of-mean model described in the theorem, the pivotal statistic (7) can be written as

T=\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}=\frac{A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\hat{A}(\overline{\mathbf{X}}^{*b})^{2}}}

For a two-sided interval, coverage is defined by the event

\left|\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\right|\leq t_{B,1-\alpha/2}

or equivalently

\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*b}))^{2}}}\right|\leq t_{B,1-\alpha/2}

Define $Q^{*}$ as the conditional distribution of $\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*b})$ given the data $\mathcal{X}_{n}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\}$ . Note that we can write the coverage probability as

E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})\cdots dQ^{*}(z_{1})\right]

(34)

where the expectation $E$ is taken with respect to the data $\mathcal{X}_{n}$ .

Consider a positive number $\lambda\geq 3/2$ . We first show that, with probability $1-O(n^{-\lambda})$ ,

	$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})$
	$\displaystyle=\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})\cdots d\hat{Q}^{}(z_{1})+R$		(35)

where

\hat{Q}^{*}(x)=\Phi(x)+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(x)\phi(x)

(36)

is a random signed measure (constructed from the random polynomial $\hat{p}_{j}$ ), and $R$ satisfies $|R|\leq Cn^{-(\nu+1)/2}$ for some constant $C$ . Here, the polynomials $\hat{p}_{j}$ are the ones defined in Theorem 14 (in Appendix F.3). To this end, denote $\mathcal{E}$ as the event that

\sup_{-\infty<x<\infty}\left|P(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*})\leq x|\mathcal{X}_{n})-\Phi(x)-\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)\right|\leq Cn^{-(\nu+1)/2}

(37)

\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}(x)|\leq C

(38)

and

\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}^{\prime}(x)|\leq C

(39)

hold simultaneously. By Theorem 14, $\mathcal{E}$ occurs with probability $1-O(n^{-\lambda})$ . Now, conditional on the data $\mathcal{X}_{n}$ and for any $z_{1},\ldots,z_{B-1}\in\mathbb{R}$ ,

\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})

(40)

is expressible as $1-Q^{*}(q)+Q^{*}(-q)$ for some $q\in\mathbb{R}$ . Thus, using the oddness and evenness property of $p_{j}$ in Theorem 13, we have, under $\mathcal{E}$ ,

\sup_{-\infty<z_{1},\ldots,z_{B-1}<\infty}\left|\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B})-\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d\hat{Q}^{*}(z_{B})\right|\leq Cn^{-(\nu+1)/2}

(41)

for some $C>0$ by (37). Thus, integrating (40) with respect to $dQ^{*}(z_{B-1})\cdots dQ^{*}(z_{1})$ and using (41), we get

			$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})$		(42)
		$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})dQ^{}(z_{B-1})\cdots dQ^{*}(z_{1})+R_{B}$		(42)

where $R_{B}$ satisfies $|R_{B}|\leq Cn^{-(\nu+1)/2}$ . Iterating (42), this time considering

\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}dQ^{*}(z_{B-1})

and using Fubini’s theorem with the signed measure $\hat{Q}^{*}$ , we have

			$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})dQ^{}(z_{B-1})\cdots dQ^{*}(z_{1})$
		$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})d\hat{Q}^{}(z_{B-1})dQ^{}(z_{B-2})\cdots dQ^{}(z_{1})+R_{B-1}$

where $R_{B-1}$ satisfies

|R_{B-1}|\leq Cn^{-(\nu+1)/2}\int\cdots\int|(\hat{Q}^{*})^{\prime}(z_{B})|dz_{B}dQ^{*}(z_{B-2})\cdots dQ^{*}(z_{1})=Cn^{-(\nu+1)/2}\int|(\hat{Q}^{*})^{\prime}(z_{B})|dz_{B}\leq Cn^{-(\nu+1)/2}

by (38) and (39) (where the last $C$ is a different constant from the previous one). Continuing in this fashion, we get

			$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})$
		$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})\cdots d\hat{Q}^{}(z_{1})+R_{B}+R_{B-1}+\cdots+R_{1}$

where each $R_{b}$ satisfies $|R_{b}|\leq Cn^{-(n+1)/2}$ . This gives (35).

Now consider (34), which we can write as

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})\right]$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}\right]{}$
			$\displaystyle{}+E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}^{c}\right]$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}\right]+O(n^{-\lambda}){}$
			since $\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{(1/B)\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})\leq 1$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})\cdots d\hat{Q}^{}(z_{1});\mathcal{E}\right]+O(n^{-(\nu+1)/2})+O(n^{-\lambda}){}$
			by (35)

Using (36), we write the first term in (B) as

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})\cdots d\hat{Q}^{}(z_{1});\mathcal{E}\right]$	(44)
$\displaystyle=$	$\displaystyle E\Bigg{[}\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\left(\Phi(z_{B})+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(z_{B})\phi(z_{B})\right)\cdots{}$
	$\displaystyle{}d\left(\Phi(z_{1})+\sum_{\begin{subarray}{c}j=1,\ldots,\nu\\ j\text{\ even}\end{subarray}}n^{-j/2}\hat{p}_{j}(z_{1})\phi(z_{1})\right);\mathcal{E}\Bigg{]}$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right]{}$
	$\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(\hat{p}_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}$
	$\displaystyle{}+O\left(\frac{1}{n^{2}}\right)$

where the last equality follows by expanding out the product for $\nu\geq 2$ , using the symmetry among $z_{1},\ldots,z_{B}$ to get the second term, and using (38) and (39) to get the last remainder term.

Next, we can write

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(\hat{p}_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]$	(45)
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}$
	$\displaystyle{}+E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d((\hat{p}_{2}(z_{B})-p_{2}(z_{B}))\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]{}$
	$\displaystyle{}+E\left[\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)-\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4);\mathcal{E}\right]$

where $\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)$ and $\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)$ denote the same polynomial of bounded degree in $\hat{\mu}_{m_{1},\ldots,m_{d}}$ or $\mu_{m_{1},\ldots,m_{d}}$ for $m_{1}+\cdots+m_{d}\leq 4$ , where the coefficients of the polynomial consist of linear combinations of terms in the form

\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(z_{B}^{k}\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})

for some positive integer $k\leq 5$ , which is absolutely bounded by

\int\cdots\int_{\mathbb{R}^{B}}|(z_{B}^{k}\phi(z_{B}))^{\prime}|dz_{B}\Phi(z_{B-1})\cdots d\Phi(z_{1})\leq C

for some constant $C>0$ independent of $k\leq 5$ . Now, with $E\|\mathbf{X}\|^{l}<\infty$ for sufficiently large $l$ , uniform integrability gives the convergence on the power moments of sample moments, i.e.,

E\left|\hat{\mu}_{m_{1},\ldots,m_{d}}^{k}-\mu_{m_{1},\ldots,m_{d}}^{k}\right|=O\left(\frac{1}{\sqrt{n}}\right)

where $m_{1}+\cdots+m_{d}\leq 4$ and any $k\leq 5$ . So we have

			$\displaystyle\left\|E\left[\text{poly}(\hat{\mu}_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4)-\text{poly}(\mu_{m_{1},\ldots,m_{d}},m_{1}+\cdots+m_{d}\leq 4);\mathcal{E}\right]\right\|$
		$\displaystyle\leq$	$\displaystyle CE\left[\sum_{m_{1}+\cdots+m_{d}\leq 4}\left\|\hat{\mu}_{m_{1},\ldots,m_{d}}^{k}-\mu_{m_{1},\ldots,m_{d}}^{k}\right\|\right]$
		$\displaystyle=$	$\displaystyle O\left(\frac{1}{\sqrt{n}}\right)$

where $C$ in the inequality could be different from the previous $C$ .

On the other hand,

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})$

since

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}^{c}\right]$
		$\displaystyle\leq$	$\displaystyle E\left[\int\cdots\int\|(p_{2}(z_{B})\phi(z_{B}))^{\prime}\|dz_{B}d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}^{c}\right]=O(n^{-\lambda})$

So (45) becomes

E\left[\int\cdots\int_{\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})+O\left(\frac{1}{\sqrt{n}}\right)

(46)

Similarly,

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right]$		(47)
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})$		(47)

Combining (B), (44), (46) and (47), we can write (34) as

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda}){}$	(48)
	$\displaystyle{}+\frac{B}{n}\left\{E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})+O\left(\frac{1}{\sqrt{n}}\right)\right\}{}$
	$\displaystyle{}+O\left(\frac{1}{n^{2}}\right)+O(n^{-(\nu+1)/2})+O(n^{-\lambda})$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+O(n^{-\lambda})+O\left(\frac{1}{n^{3/2}}\right)+O(n^{-(\nu+1)/2})$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O\left(\frac{1}{n^{3/2}}\right)$

when $\lambda\geq 3/2$ and $\nu\geq 2$ so that $(\nu+1)/2\geq 3/2$ .

Lastly, using Theorem 13, and a similar argument as before with Fubini’s theorem and the observation that

P\left(\left|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right|\leq t_{B,1-\alpha/2}\Bigg{|}z_{1},\ldots,z_{B}\right)

is expressible as $P(\sqrt{n}A_{s}(\overline{\mathbf{X}})\leq q)-P(\sqrt{n}A_{s}(\overline{\mathbf{X}})\leq-q)$ for some $q\in\mathbb{R}$ , we have

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]$	(49)
$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{n}\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))+o\left(\frac{1}{n}\right)$

when $\nu\geq 2$ . Similarly,

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]$		(50)
		$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0})+O\left(\frac{1}{n}\right)$		(50)

Thus, using (49) and (50), we can write (48) as

	$\displaystyle\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{n}\Bigg{\{}B\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0}){}$
	$\displaystyle{}+\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))\Bigg{\}}+o\left(\frac{1}{n}\right)$		(51)

which gives the first part of the theorem. Note that when $\nu\geq 3$ , the remainder term in (49) is refined to $o(1/n^{3/2})$ and, as a result, the remainder term in (51) is refined to $O(1/n^{3/2})$ .

The second part of the theorem follows analogously by replacing the event $\left|\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\right|\leq t_{B,1-\alpha/2}$ with $\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\leq t_{B,1-\alpha}$ or $\frac{g(\overline{\mathbf{X}})-g(\bm{\mu})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}(g(\overline{\mathbf{X}}^{*b})-g(\overline{\mathbf{X}}))^{2}}}\geq t_{B,1-\alpha}$ . Consider now $\lambda>1/2$ . Because of the aforementioned change of the considered event, now (36) is replaced by

\hat{Q}^{*}(x)=\Phi(x)+\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)

and (44) becomes, in the upper interval case,

		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1});\mathcal{E}\right]{}$
			$\displaystyle{}+\frac{B}{\sqrt{n}}E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(\hat{p}_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1});\mathcal{E}\right]+O\left(\frac{1}{n}\right)$

for $\nu\geq 1$ giving a modified (48) as

	$\displaystyle E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+\frac{B}{\sqrt{n}}E\left[\int\cdots\int_{\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+o\left(\frac{1}{\sqrt{n}}\right)$		(52)

when $\lambda>1/2$ and $\nu\geq 1$ . Moreover, (49) becomes

	$\displaystyle\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{\sqrt{n}}\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))+o\left(\frac{1}{\sqrt{n}}\right)$		(53)

and (50) becomes

\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0})+O\left(\frac{1}{\sqrt{n}}\right)

giving rise to a modified (51) as

	$\displaystyle\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{\sqrt{n}}\Bigg{\{}B\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d(p_{1}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0}){}$
	$\displaystyle{}+\int\cdots\int_{\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\leq t_{B,1-\alpha}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{1}(z_{0})\phi(z_{0}))\Bigg{\}}+o\left(\frac{1}{\sqrt{n}}\right)$		(54)

which gives the upper interval case in second part of the theorem. The lower interval case follows analogously. Moreover, when $\lambda\geq 1$ and $\nu\geq 2$ , the remainder term in (52) is refined to $O(1/n)$ , the remainder term in (53) is refined to $O(1/n)$ and, as a result, the remainder term in (54) is refined to $O(1/n)$ . ∎

Proof of Theorem 7.

Using the same argument as in (8) in the proof of Theorem 1, we have $\sqrt{n}S\Rightarrow\sigma\sqrt{\frac{\chi^{2}_{B}}{B}}$ and hence

\mathbb{P}_{n}\left(\sigma\sqrt{\frac{\chi^{2}_{\alpha/2,B}}{B}}\leq\sqrt{n}S\leq\sigma\sqrt{\frac{\chi^{2}_{1-\alpha/2,B}}{B}}\right)\to 1-\alpha

The conclusion then follows. ∎

Proof of Theorem 8.

A straightforward modification of Proposition 1 from univariate to multivariate $\sqrt{n}(\hat{\psi}_{n}-\psi)$ and $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ gives the asymptotic

\sqrt{n}(\hat{\psi}_{n}-\psi,\ \psi_{n}^{*1}-\hat{\psi}_{n},\ldots,\ \psi_{n}^{*B}-\hat{\psi}_{n})\Rightarrow(Z_{0},Z_{1},\ldots,Z_{B})

(55)

where $Z_{0},Z_{1},\ldots,Z_{B}\in\mathbb{R}^{d}$ are i.i.d. $N(0,\Sigma)$ . By the continuous mapping theorem, we have

(\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)=(\sqrt{n}(\hat{\psi}_{n}-\psi))^{\top}(nS)^{-1}(\sqrt{n}(\hat{\psi}_{n}-\psi))\Rightarrow T^{2}_{d,B}

where $T^{2}_{d,B}$ denotes Hotelling’s $T^{2}$ variable with parameters $d$ and $B$ . Hence

\mathbb{P}_{n}(\psi\in\mathcal{R})=\mathbb{P}_{n}\left((\hat{\psi}_{n}-\psi)^{\top}S^{-1}(\hat{\psi}_{n}-\psi)\leq T^{2}_{d,B,1-\alpha}\right)\to 1-\alpha

which concludes the theorem. ∎

Proof of Proposition 4.

This is the same as Proposition 2 except we now consider general $d$ when using Theorem 12 in Appendix F.2. ∎

Appendix C Further Details for Section 4

C.1 Proofs for the Beginning of Section 4

Proof of Proposition 3.

We focus on $\tau^{2}(\cdot)$ as the argument for $\kappa_{3}(\cdot)$ is the same. Suppose $\tau^{2}(\cdot)$ is Hadamard differentiable and satisfies the assumptions in Proposition 2. Then, by Proposition 2, Assumption 1 holds for $\tau^{2}(\cdot)$ and we have $\sqrt{n}(\tau^{2}(\hat{P}_{n})-\tau^{2}(P))$ weakly converges to a tight random variable. Hence $\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}\tau^{2}(P)$ by the Slutsky theorem. Moreover, by Proposition 1, we also know $\sqrt{n}(\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n}))$ weakly converges to a tight variable, and hence $\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}0$ by the Slutsky theorem again. Thus, $\tau^{2}(P_{n}^{*})-\tau^{2}(P)=(\tau^{2}(P_{n}^{*})-\tau^{2}(\hat{P}_{n}))+(\tau^{2}(\hat{P}_{n})-\tau^{2}(P))\stackrel{{\scriptstyle p}}{{\to}}0$ once again by the Slutsky theorem which concludes the proposition. ∎

Proof of Theorem 3.

We divide the proof into two steps:

Step 1. We first show the convergence of the following joint distribution

	$\displaystyle\Bigg{(}\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sigma},\frac{\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})}{\sigma},\ldots,\frac{\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})}{\sigma},{}$	(56)
	$\displaystyle\frac{\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})}{\tau},\frac{\sqrt{R}(\psi_{n,R}^{*1}-\psi_{n}^{1})}{\tau},\ldots,\frac{\sqrt{R}(\psi_{n,R}^{*B}-\psi_{n}^{B})}{\tau}\Bigg{)}$
$\displaystyle\Rightarrow$	$\displaystyle(Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B})$

where $Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ .

For convenience, denote

	$\displaystyle Z_{0}^{n}$	$\displaystyle=\frac{\sqrt{n}(\hat{\psi}_{n}-\psi)}{\sigma}$
	$\displaystyle Z_{b}^{n}$	$\displaystyle=\frac{\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})}{\sigma},\ b=1,\ldots,B$
	$\displaystyle W_{0}^{n}$	$\displaystyle=\frac{\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})}{\tau}$
	$\displaystyle W_{b}^{n}$	$\displaystyle=\frac{\sqrt{R}(\psi_{n,R}^{*b}-\psi_{n}^{b})}{\tau},\ b=1,\ldots,B$

Also, denote $\hat{\tau}^{n}=\tau(\hat{P}_{n})$ and $\hat{\kappa}_{3}^{n}=\kappa_{3}(\hat{P}_{n})$ .

Then, we have, for any fixed real constants $z_{b},w_{b}$ for $b=0,\ldots,B$ ,

	$\displaystyle\left\|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=0,\ldots,B\right)-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\right\|$	(57)
$\displaystyle=$	$\displaystyle\left\|E\left[I(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right)\right]-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\right\|$
	where $\xi_{R_{0}}$ refers to all the computation randomness in generating $\hat{\hat{\psi}}_{n,R_{0}}$ ,
	by noting that $Z_{0}^{n}$ and $W_{0}^{n}$ are determined solely by $\hat{P}_{n}$ and $\xi_{R_{0}}$
$\displaystyle=$	$\displaystyle\Bigg{\|}E\left[I(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right)\right]{}$
	$\displaystyle{}-P(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})+P(Z_{0}^{n}\leq z_{0},W_{0}^{n}\leq w_{0})\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})-\prod_{b=0}^{B}\Phi(z_{b})\Phi(w_{b})\Bigg{\|}$
$\displaystyle\leq$	$\displaystyle E\left[\left\|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}\hat{P},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right\|;Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0}\right]{}$
	$\displaystyle{}+\left\|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right\|\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b}){}$
	by the triangle inequality
$\displaystyle\leq$	$\displaystyle E\left\|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}\hat{P},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right\|{}$
	$\displaystyle{}+\left\|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right\|\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})$

We consider the two terms in (57) one by one, and we consider the second term first. By conditioning on $\hat{P}_{n}$ and telescoping, we have

			$\displaystyle\left\|P(Z_{0}^{n}\leq z_{0},\ W_{0}^{n}\leq w_{0})-\Phi(z_{0})\Phi(w_{0})\right\|$		(58)
		$\displaystyle\leq$	$\displaystyle E\|P(W_{0}^{n}\leq w_{0}\|\hat{P}_{n})-\Phi(w_{0})\|+\|P(Z_{0}^{n}\leq z_{0})-\Phi(z_{0})\|\Phi(w_{0})$		(58)

The first term in (58) can be bounded from above by

P(|\hat{\tau}^{n}-\tau|>\delta\text{\ or\ }\hat{\kappa}_{3}^{n}>\kappa_{3}+\delta)+E[|P(W_{0}^{n}\leq w_{0}|\hat{P}_{n})-\Phi(w_{0})|;|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta]

for some small $\delta>0$ , where the second term can be written as

			$\displaystyle E\left[\left\|P\left(\frac{W_{0}^{n}\tau}{\hat{\tau}^{n}}\leq\frac{w_{0}\tau}{\hat{\tau}^{n}}\Bigg{\|}\hat{P}_{n}\right)-\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)\right\|+\left\|\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)-\Phi(w_{0})\right\|;\|\hat{\tau}^{n}-\tau\|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]$
		$\displaystyle\leq$	$\displaystyle E\left[\frac{C_{1}\hat{\kappa}_{3}^{n}}{(\hat{\tau}^{n})^{3}\sqrt{R_{0}}}+\left\|\Phi\left(\frac{w_{0}\tau}{\hat{\tau}^{n}}\right)-\Phi(w_{0})\right\|;\|\hat{\tau}^{n}-\tau\|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]$

for some constant $C_{1}>0$ by the Berry-Esseen theorem, which is further bounded from above by

E\left[\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta};|\hat{\tau}^{n}-\tau|\leq\delta,\hat{\kappa}_{3}^{n}\leq\kappa_{3}+\delta\right]\leq\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta}

for some constant $C_{2}>0$ , which follows from applying the mean value theorem to the function $\Phi(w_{0}\tau/\cdot)$ and noting that the function $x\phi(x)$ is bounded over $x\in\mathbb{R}$ . Hence, the first term in (58) is bounded from above by

P(|\hat{\tau}^{n}-\tau|>\delta\text{\ or\ }\hat{\kappa}_{3}^{n}>\kappa_{3}+\delta)+\frac{C_{1}(\kappa_{3}+\delta)}{(\tau-\delta)^{3}\sqrt{R_{0}}}+\frac{C_{2}\delta}{\tau-\delta}

(59)

Since $\hat{\tau}^{n}\stackrel{{\scriptstyle p}}{{\to}}\tau$ and $\hat{\kappa}_{3}^{n}\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}$ in Assumption 2, given arbitrary $\epsilon>0$ , we can choose a small enough $\delta>0$ , a large enough $n$ and a large enough $R_{0}$ such that (59) is bounded above by $\epsilon$ . Thus, the first term in (58) converges to 0 as $n,R_{0}\to\infty$ . The second term in (58) converges to 0 as $n\to\infty$ by Assumption 1. We therefore have the second term in (57) go to 0 as $n,R_{0}\to\infty$ .

We handle the first term in (57) with a similar argument. We have

	$\displaystyle E\left\|P\left(Z_{b}^{n}\leq z_{b},W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\right\|$	(60)
$\displaystyle=$	$\displaystyle E\Bigg{\|}E\left[I(Z_{b}^{n}\leq z_{b},b=1,\ldots,B)P\left(W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}P_{n}^{*b},b=1,\ldots,B,\hat{P}_{n},\xi_{R_{0}}\right)\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right]{}$
	$\displaystyle{}-P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right)\prod_{b=1}^{B}\Phi(w_{b})+P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n},\xi_{R_{0}}\right)\prod_{b=1}^{B}\Phi(w_{b}){}$
	$\displaystyle{}-\prod_{b=1}^{B}\Phi(z_{b})\Phi(w_{b})\Bigg{\|}$
	by noting that $Z_{b}^{n},b=1,\ldots,B$ are determined solely by $P_{n}^{*b},b=1,\ldots,B$ and $\hat{P}_{n}$
$\displaystyle\leq$	$\displaystyle E\left\|P\left(W_{b}^{n}\leq w_{b},b=1,\ldots,B\Big{\|}P_{n}^{*b},b=1,\ldots,B,\hat{P}_{n},\xi_{R_{0}}\right)-\prod_{b=1}^{B}\Phi(w_{b})\right\|$
	$\displaystyle{}+E\left\|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right\|\prod_{b=1}^{B}\Phi(w_{b})$
	by the triangle and Jensen inequalities
$\displaystyle=$	$\displaystyle E\left\|\prod_{b=1}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{*b}\right)-\prod_{b=1}^{B}\Phi(w_{b})\right\|+E\left\|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right\|\prod_{b=1}^{B}\Phi(w_{b})$
	by the conditional independence of $W_{b}^{n},b=1,\ldots,B$ given $P_{n}^{*b},b=1,\ldots,B$ and that,
	given $P_{n}^{b}$ , $W_{b}^{n}$ is independent of $P_{n}^{k}$ for $k\neq b$ and $\hat{P}_{n}$ and $\xi_{R_{0}}$
$\displaystyle=$	$\displaystyle E\Bigg{\|}\prod_{b=1}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{b}\right)-\Phi(w_{1})\prod_{b=2}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{b}\right)+\Phi(w_{1})\prod_{b=2}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{*b}\right){}$
	$\displaystyle{}-\Phi(w_{1})\Phi(w_{2})\prod_{b=3}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{b}\right)+\Phi(w_{1})\Phi(w_{2})\prod_{b=3}^{B}P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{b}\right)-\cdots-\prod_{b=1}^{B}\Phi(w_{b})\Bigg{\|}{}$
	$\displaystyle{}+E\left\|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right\|\prod_{b=1}^{B}\Phi(w_{b})$
$\displaystyle\leq$	$\displaystyle E\left[\sum_{b=1}^{B}\left\|P\left(W_{b}^{n}\leq w_{b}\Big{\|}P_{n}^{*b}\right)-\Phi(w_{b})\right\|\right]+E\left\|P\left(Z_{b}^{n}\leq z_{b},b=1,\ldots,B\Big{\|}\hat{P}_{n}\right)-\prod_{b=1}^{B}\Phi(z_{b})\right\|\prod_{b=1}^{B}\Phi(w_{b})$
	by the triangle inequality

Note that in the first term in (60), each

E\left|P\left(W_{b}^{n}\leq w_{b}\big{|}P^{*b}\right)-\Phi(w_{b})\right|

converges to 0 as $n,R\to\infty$ by the same argument as for the first term in (58), except that we use the bootstrapped moments $\tau(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\tau$ and $\kappa_{3}(P_{n}^{*})\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}$ in Assumption 2 instead of $\hat{\tau}^{n}\stackrel{{\scriptstyle p}}{{\to}}\tau$ and $\hat{\kappa}_{3}^{n}\stackrel{{\scriptstyle p}}{{\to}}\kappa_{3}$ . The second term in (60) also goes to 0 as $n\to\infty$ by Assumption 1 and the dominated convergence theorem.

Therefore, (57) goes to 0. This proves (56).

Step 2. We consider the following decompositions

\hat{\hat{\psi}}_{n,R_{0}}-\psi=(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})+(\hat{\psi}_{n}-\psi)

and

\psi_{n,R}^{**b}-\hat{\hat{\psi}}_{n,R_{0}}=(\psi^{**b}_{n,R}-\psi_{n}^{*b})+(\psi_{n}^{*b}-\hat{\psi}_{n})+(\hat{\psi}_{n}-\hat{\hat{\psi}}_{n,R_{0}})

Then apply the continuous mapping theorem to get

			$\displaystyle\sqrt{n}\left(\hat{\hat{\psi}}_{n,R_{0}}-\psi,\ \psi_{n,R}^{1}-\hat{\hat{\psi}}_{n,R_{0}},\ldots,\ \psi_{n,R}^{B}-\hat{\hat{\psi}}_{n,R_{0}}\right)$
		$\displaystyle=$	$\displaystyle\Bigg{(}\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})+\sqrt{n}(\hat{\psi}_{n}-\psi),\sqrt{\frac{n}{R}}\sqrt{R}(\psi_{n,R}^{*1}-\psi_{n}^{1})+\sqrt{n}(\psi_{n}^{*1}-\hat{\psi}_{n})-\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n}),{}$
			$\displaystyle{}\ldots,\sqrt{\frac{n}{R}}\sqrt{R}(\psi_{n,R}^{*B}-\psi_{n}^{B})+\sqrt{n}(\psi_{n}^{*B}-\hat{\psi}_{n})-\sqrt{\frac{n}{R_{0}}}\sqrt{R_{0}}(\hat{\hat{\psi}}_{n,R_{0}}-\hat{\psi}_{n})\Bigg{)}$
		$\displaystyle\Rightarrow$	$\displaystyle\left(\frac{\tau}{\sqrt{p_{0}}}W_{0}+\sigma Z_{0},\ \frac{\tau}{\sqrt{p}}W_{1}+\sigma Z_{1}-\frac{\tau}{\sqrt{p_{0}}}W_{0},\ldots,\ \frac{\tau}{\sqrt{p}}W_{B}+\sigma Z_{B}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)$

This concludes the theorem. ∎

C.2 Proofs and Additional Discussions for Section 4.1

We first prove Theorem 4:

Proof of Theorem 4.

Consider the pivotal statistic $T_{O}=(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{O}$ . We argue that $q_{O,1-\alpha/2}$ defined in (18) satisfies

\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{O})=\liminf_{n\to\infty}P\left(\left|T_{O}\right|\leq q_{O,1-\alpha/2}\right)\geq 1-\alpha

which would conclude that $\mathcal{I}_{O}$ is an asymptotically valid $(1-\alpha)$ -level confidence interval.

To this end, by Theorem 3 and the continuous mapping theorem, as $n\to\infty$ , $T_{O}$ converges weakly to

\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\sigma Z_{b}+\frac{\tau}{\sqrt{p}}W_{b}-\frac{\tau}{\sqrt{p_{0}}}W_{0}\right)^{2}}}

(61)

where $Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{b}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ . A direct inspection on the homogenity of the expression reveals that (61) only depends on $\sigma$ and $\tau$ through their ratio. Multiplying by a factor $\sqrt{p_{0}}/\tau$ on both the numerator and denominator, we rewrite (61) as

\frac{\theta Z_{0}+W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\theta Z_{b}+\rho W_{b}-W_{0}\right)^{2}}}

(62)

where $\theta=\sigma\sqrt{p_{0}}/\tau$ and $\rho=\sqrt{p_{0}/p}$ as defined before. We will see momentarily that (62) follows the same distribution as (19). For now, recall the distribution function of (19) is $F(\cdot;\theta,\rho)$ , which has the unknown $\theta$ . We have

	$\displaystyle\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{O})$	$\displaystyle=$	$\displaystyle\liminf_{n\to\infty}P\left(\|T_{O}\|\leq q_{O,1-\alpha/2}\right)=F\left(q_{O,1-\alpha/2};\theta,\rho\right)-F\left(-q_{O,1-\alpha/2};\theta,\rho\right){}$
		$\displaystyle=$	$\displaystyle{}2F\left(q_{O,1-\alpha/2};\theta,\rho\right)-1\geq 2\min_{\theta\geq 0}F\left(q_{O,1-\alpha/2};\theta,\rho\right)-1=1-\alpha$

where the third equality follows from the symmetry of (19) or (62).

Finally, we see that (62) and (19) follow the same distribution, since by expanding the sum of squares in (62) we have

\frac{\theta Z_{0}+W_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}\left(\theta Z_{b}+\rho W_{b}-W_{0}\right)^{2}}}\\ \stackrel{{\scriptstyle d}}{{=}}\frac{\theta Z_{0}+W_{0}}{\sqrt{(\theta^{2}+\rho^{2})\left(\frac{1}{B}\sum_{b=1}^{B}(X_{b}-\bar{X})^{2}+\bar{X}^{2}\right)-2\sqrt{\theta^{2}+\rho^{2}}\bar{X}W_{0}+W_{0}^{2}}}

where we have written $\theta Z_{b}+\rho W_{b}=\sqrt{\theta^{2}+\rho^{2}}X_{b}$ with $X_{1},\ldots,X_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ which are independent of $Z_{0},W_{0}$ . Noting that $\sum_{b=1}^{B}(X_{b}-\bar{X})^{2}\sim\chi^{2}_{B-1}$ , $\bar{X}\sim N(0,1/B)$ , which are independent by the property of standard normals, we get (19). ∎

To understand how the additional intricacy from the computation noise affects the interval half-width, we consider the scenario when $B$ grows. The asymptotic distribution of $T_{O}$ given by (19) becomes

\frac{\theta V_{1}+V_{2}}{\sqrt{\theta^{2}+\rho^{2}+V_{2}^{2}}}

(63)

with $V_{1},V_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ , since $Y/B\to 1$ and $V_{3}^{2}/B,V_{3}V_{2}/\sqrt{B}\to 0$ a.s. in (19). Correspondingly, $S_{O}$ defined in (17) is distributed approximately as

\sqrt{\frac{1}{n}\left(\left(\sigma^{2}+\frac{\tau^{2}}{p}\right)\frac{Y+V_{3}^{2}}{B}-2\frac{\tau}{\sqrt{p_{0}}}\sqrt{\sigma^{2}+\frac{\tau^{2}}{p}}\frac{V_{3}V_{2}}{\sqrt{B}}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}\right)}

(64)

which, when $B$ is large, becomes

\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}}{n}}

So the half-width of $\mathcal{I}_{O}$ behaves like

\tilde{q}_{O,1-\alpha/2}\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p}+\frac{\tau^{2}}{p_{0}}V_{2}^{2}}{n}}=\tilde{q}_{O,1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}+\frac{\tau^{2}}{R_{0}}V_{2}^{2}}

(65)

where $\tilde{q}_{O,1-\alpha/2}$ is the maximum $(1-\alpha/2)$ -quantile of (63) over all possible $\theta\geq 0$ . Now, supposing we know the values of $\sigma$ and $\tau$ , the normality confidence interval obtained from extracting the first component in the limit in (15) is

\left[\hat{\hat{\psi}}_{n,R_{0}}-z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}},\ \hat{\hat{\psi}}_{n,R_{0}}+z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}}\right]

thus with a half-width

z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R_{0}}}

(66)

The standard error in this half-width has the notable interpretation of being a combination of the data variability $\sigma^{2}/n$ and computation variability $\tau^{2}/R_{0}$ . Suppose in (65) we use $R=R_{0}$ , so that a resample estimate exhibits the same variability as the original estimate. Comparing (65) and (66), we see that (65) has an additional contribution coming from $V_{2}$ . If $V_{2}$ is not present, then $S_{O}$ becomes $\sqrt{\sigma^{2}/n+\tau^{2}/R_{0}}$ and $\tilde{q}_{O,1-\alpha/2}$ becomes the quantile of $V_{1}$ which is a standard normal variable, since in this case the maximum quantile over all $\theta\geq 0$ is approached by choosing $\theta\to\infty$ . In other words, when $V_{2}$ is not present, we recover the normality interval half-width when $B$ increases. Thus, the added variability from $V_{2}$ can be viewed as a price we pay to handle the additional computation noise without knowledge on $\sigma$ and $\tau$ .

C.3 Proofs and Additional Discussions for Section 4.2

We first prove Theorem 5:

Proof of Theorem 5.

The proof follows the roadmap of that of Theorem 4. Consider the pivotal statistic $T_{M}=(\hat{\hat{\psi}}_{n,R_{0}}-\psi)/S_{M}$ . We first argue that $q_{M,1-\alpha/2}$ satisfies

\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{M})=\liminf_{n\to\infty}P\left(\left|T_{M}\right|\leq q_{M,1-\alpha/2}\right)\geq 1-\alpha

which would conclude that $\mathcal{I}_{M}$ is an asymptotically valid $(1-\alpha)$ -level confidence interval.

By Theorem 3 and the continuous mapping theorem, as $n\to\infty$ , $T_{M}$ converges weakly to

\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\frac{1}{B-1}\sum_{b=1}^{B}\left(\left(\sigma Z_{b}+\frac{\tau}{\sqrt{p}}W_{b}\right)-\left(\sigma\bar{Z}+\frac{\tau}{\sqrt{p}}\bar{W}\right)\right)^{2}}}\stackrel{{\scriptstyle d}}{{=}}\frac{\sigma Z_{0}+\frac{\tau}{\sqrt{p_{0}}}W_{0}}{\sqrt{\left(\sigma^{2}+\frac{\tau^{2}}{p}\right)\frac{Y}{B-1}}}\stackrel{{\scriptstyle d}}{{=}}\sqrt{\frac{\sigma^{2}+\frac{\tau^{2}}{p_{0}}}{\sigma^{2}+\frac{\tau^{2}}{p}}}t_{B-1}

(67)

where $Z_{0},Z_{1},\ldots,Z_{B},W_{0},W_{1},\ldots,W_{B}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1)$ , $\bar{Z}=(1/B)\sum_{b=1}^{B}Z_{b}$ , $\bar{W}=(1/B)\sum_{b=1}^{B}W_{b}$ , and $Y\sim\chi^{2}_{B-1}$ which is independent of $Z_{0}$ and $W_{0}$ . The equalities in distribution use the standard properties of normals to obtain $\chi^{2}_{B-1}$ and $t_{B-1}$ distributions. By multiplying both the numerator and denominator of (67) by $\sqrt{p_{0}}/\tau$ , we get

\sqrt{\frac{\theta^{2}+1}{\theta^{2}+\rho^{2}}}t_{B-1}

(68)

where $\theta=\sigma\sqrt{p_{0}}/\tau$ and $\rho=\sqrt{p_{0}/p}$ as defined before. Denote the distribution of (68) as $\tilde{F}(\cdot;\theta,\rho)$ . We have

\min_{\theta\geq 0}\tilde{F}(q;\theta,\rho)=P\left(\max\{\rho^{-1},1\}t_{B-1}\leq q\right)

for any $q\geq 0$ , by noting that the minimum is approached by setting $\theta\to\infty$ when $\rho\geq 1$ and attained at $\theta=0$ when $\rho<1$ . Thus setting $q_{M,1-\alpha/2}=\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2}$ gives

	$\displaystyle\liminf_{n\to\infty}P(\psi\in\mathcal{I}_{M})$	$\displaystyle=$	$\displaystyle\liminf_{n\to\infty}P\left(\|T_{M}\|\leq q_{M,1-\alpha/2}\right)=\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-\tilde{F}\left(-q_{M,1-\alpha/2};\theta,\rho\right){}$
		$\displaystyle=$	$\displaystyle{}2\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-1\geq 2\min_{\theta\geq 0}\tilde{F}\left(q_{M,1-\alpha/2};\theta,\rho\right)-1=1-\alpha$

where the third equality follows from the symmetry of (68).

Finally, note that when $\rho=1$ (68) becomes $t_{B-1}$ regardless of the value of $\theta$ . Thus in this case $T_{M}$ is asymptotically $t_{B-1}$ , and asymptotic exactness of $\mathcal{I}_{M}$ holds.

∎

Note that, in contrast to (61) where $W_{0}$ appears both in the numerator and denominator, in (67) $W_{0}$ only appears in the former and thus the numerator and denominator there are independent. This simplifies the asymptotic distribution to be used in $\mathcal{I}_{M}$ .

We discuss the half-width efficiency of $\mathcal{I}_{M}$ and contrast with $\mathcal{I}_{O}$ . First, unlike $\mathcal{I}_{O}$ , note that it is possible to have asymptotic exactness for $\mathcal{I}_{M}$ as $n\to\infty$ , in particular when we use the same computation size in the resample estimate and the original point estimate, which is a natural configuration. Moreover, from (24) and (67), we see that as $n$ increases, the half-width of $\mathcal{I}_{M}$ behaves approximately as

\max\{\rho^{-1},1\}t_{B-1,1-\alpha/2}\sqrt{\left(\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}\right)\frac{\chi^{2}_{B-1}}{B-1}}

When $B$ increases, this becomes

\max\{\rho^{-1},1\}z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}}

so that when $\rho=1$ we get

z_{1-\alpha/2}\sqrt{\frac{\sigma^{2}}{n}+\frac{\tau^{2}}{R}}

which is the half-width of the normality confidence interval (when $R$ is set to equal $R_{0}$ ). This conformance shows the superiority of $\mathcal{I}_{M}$ over $\mathcal{I}_{O}$ when $B$ is large. Nonetheless, when $B$ is small or when the computation sizes of the resample and original estimates are different, $\mathcal{I}_{O}$ could possibly outperform $\mathcal{I}_{M}$ .

Lastly, both $\mathcal{I}_{O}$ and $\mathcal{I}_{M}$ have natural one-sided analogs, where we replace $q_{O,1-\alpha/2}$ and $q_{M,1-\alpha/2}$ by $q_{O,1-\alpha}$ and $q_{M,1-\alpha}$ in one of the interval limits (and with the other side unbounded).

Appendix D Proofs for Section 5

Proof of Theorem 6.

In each of the three subsampling variants, we show that an analog of Assumption 1 holds and hence we can use the same roadmap as the proofs of Proposition 1 and Theorem 1 to conclude our result. Note that, under the assumptions in Proposition 2, we have immediately that $\sqrt{n}(\hat{\psi}_{n}-\psi)\Rightarrow N(0,\sigma^{2})$ for some $\sigma^{2}>0$ by the functional delta method (Theorem 10 in Appendix F.2). Now, under the additional assumption that $\mathcal{F}_{\delta}$ is measurable for every $\delta>0$ , we have the following:

Cheap $m$ -out-of- $n$ Bootstrap: Denote $\psi_{s}^{*}$ as a subsample estimate. We invoke Theorem 15 (in Appendix F.4) to conclude that $\sqrt{s}(\psi_{s}^{*}-\hat{\psi}_{n})\Rightarrow N(0,\sigma^{2})$ given $X_{1},X_{2},\ldots$ in probability as $n\to\infty$ , for any $s$ dependent on $n$ such that $s\leq n$ and $s\to\infty$ . Following the same argument in the proof of Proposition 1, except we replace $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ by $\sqrt{s}(\psi_{s}^{*b}-\hat{\psi}_{n})$ , we obtain

\left(\sqrt{n}(\hat{\psi}_{n}-\psi),\sqrt{s}(\psi_{s}^{*1}-\hat{\psi}_{n}),\ldots,\sqrt{s}(\psi_{s}^{*B}-\hat{\psi}_{n})\right)\Rightarrow(\sigma Z_{0},\sigma Z_{1},\ldots,\sigma Z_{B})

where $Z_{0},Z_{1},\ldots,Z_{B}$ are i.i.d. $N(0,1)$ . Therefore,

\frac{\hat{\psi}_{n}-\psi}{\sqrt{\frac{s}{n}}S}\Rightarrow\frac{Z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}Z_{b}^{2}}}

by the continuous mapping theorem. Then, following the proof of Theorem 1 to note that the right hand side above is a $t_{B}$ -variable, we get

P\left(-t_{1-\alpha/2,B}\leq\frac{\hat{\psi}_{n}-\psi}{\sqrt{\frac{s}{n}}S}\leq t_{1-\alpha/2,B}\right)\to 1-\alpha

from which we conclude the result.

Cheap Bag of Little Bootstraps: Recall $\psi_{s}^{*}$ is the fixed subsample estimate, and denote $\psi_{n}^{**}$ as a second-layer resample estimate. Note that the subsample encoded by $P_{s}^{*}$ , which is obtained by sampling without replacement from the data encoded by $\hat{P}_{n}$ , is distributed i.i.d. from $P$ . Thus we can involve Theorem 15 to conclude $\sqrt{n}(\psi_{n}^{**}-\psi_{s}^{*})\Rightarrow N(0,\sigma^{2})$ in probability, given $X_{1},X_{2},\ldots,X_{n}$ and the subsampling randomness, as $n,s\to\infty$ for any $s$ dependent on $n$ such that $s\leq n$ . The rest is identical to the proofs of Proposition 1 and Theorem 1, except we replace $\psi_{n}^{*b}-\hat{\psi}_{n}$ by $\psi_{n}^{**b}-\psi_{s}^{*}$ and, instead of conditioning on $\hat{P}_{n}$ in the series of inequalities in the proof of Proposition 1, we condition on both $\hat{P}_{n}$ and the randomness in the subsampling that obtains $\psi_{s}^{*}$ .

Cheap Subsampled Double Bootstrap: Denote $\psi_{s}^{*}$ as a first-layer subsample estimate and $\psi_{n}^{**}$ as the derived second-layer resample estimate. We invoke Theorem 16 (in Appendix F.4) to conclude that $\sqrt{n}(\psi_{n}^{**}-\psi_{s}^{*})\Rightarrow N(0,\sigma^{2})$ given $X_{1},X_{2},\ldots$ in probability as $n\to\infty$ , for any $s$ dependent on $n$ such that $s\leq n$ and $s\to\infty$ . The rest is identical to the proofs of Proposition 1 and Theorem 1, except we replace $\psi_{n}^{*b}-\hat{\psi}_{n}$ by $\psi_{n}^{**b}-\psi_{s}^{*b}$ . ∎

Appendix E Additional Numerical Results

E.1 Logistic Regression

We present another example on a logistic regression model $Y\sim Bernoulli(p)$ where $p=1/(1+\exp(-(\beta_{1}X_{1}+\cdots+\beta_{d}X_{d})))$ . We set $d=10$ and use data $(X_{1,i},\ldots,X_{d,i},Y_{i})$ of size $10^{5}$ to fit the model and estimate the coefficients $\beta_{j}$ ’s. The ground truth is set as $X_{j}\sim t_{3}$ and the coefficients $(\beta_{1},\ldots,\beta_{10})=(1.9,1.7,1.3,1.8,1.1,1.2,1.9,2.2,1.5,2.0)$ , a set of numbers arbitrarily chosen from the interval $[1,3]$ . Our setup is similar to the linear regression example in Section 6.2, where we test all methods to compute $95\%$ confidence intervals on the first coefficient $\beta_{1}$ , and use subsample size $n^{0.6}=1000$ for $m$ -out-of- $n$ Bootstrap, Bag of Little Bootstraps and Subsampled Double Bootstrap. We again use $50$ resamples in total for each method to depict the trend.

Figures 6, 7 and 8 show the coverage probabilities, mean interval widths, and standard deviations of interval widths respectively for Standard and Cheap bootstrap methods. The comparisons are largely similar to the linear regression example. The Cheap bootstrap methods all attain close to the target $95\%$ coverage at $B=1$ , with $96\%$ , $96\%$ , $97\%$ and $97\%$ for Cheap Bootstrap, $m$ -out-of- $n$ , Bag of Little Bootstrap and Subsampled Double Bootstrap respectively. On the other hand, the Standard bootstrap methods all fail at $B=1$ and require much larger $B$ to approach the target coverage. Both the means and standard deviations of interval widths for Cheap bootstrap methods decrease sharply from $B=1$ (e.g., mean $0.38$ and standard deviation $0.28$ for Cheap Bootstrap) to $2$ (mean $0.14$ and standard deviation $0.07$ ), and continue to drop further at $B=3$ (mean $0.11$ and standard deviation $0.04$ ) and beyond at a continuously slower rate. On the other hand, the mean interval widths of Standard bootstrap methods are initially small and exhibit increasing trends, whereas the standard deviations appear roughly constant against $B$ . Thus, similar to the linear regression example, here Cheap bootstrap methods again consistently attain accurate coverage regardless of $B$ and their interval widths drop fast to levels comparable to large $B$ . On the other hand, Standard methods under-cover when $B$ is small and converge to the nominal level at much slower rates.

E.2 Critical Values for Cheap Bootstrap in Nested Sampling Problems

Table 5 shows the approximated $q_{O,1-\alpha/2}$ for $B$ ranging from $1$ to $20$ and $\alpha=0.05$ (and also $\alpha=0.1$ for later use in the example in Appendix E.3) when $R_{0}=R$ , i.e., $\rho=1$ , where we also show $q_{M,1-\alpha/2}=t_{B-1,1-\alpha/2}$ for comparison. Note that when $B=1$ , only $q_{O,1-\alpha/2}$ is well-defined but not $q_{M,1-\alpha/2}$ . We also see that while $q_{O,1-\alpha/2}$ is smaller than $q_{M,1-\alpha/2}$ when $B$ is small, it appears that they are very similar as $B$ reaches $20$ .

Table 5: Values of

q_{O,1-\alpha/2}

and

q_{M,1-\alpha/2}

when

R_{0}=R

, at

\alpha=0.05

and

\alpha=0.1

$B$	$\alpha=0.05$		$\alpha=0.1$
$B$	$q_{O,1-\alpha/2}$	$q_{M,1-\alpha/2}$	$q_{O,1-\alpha/2}$	$q_{M,1-\alpha/2}$
1	12.75	NA	6.32	NA
2	4.32	12.71	2.92	6.31
3	3.19	4.30	2.36	2.92
4	2.78	3.18	2.14	2.35
5	2.57	2.78	2.02	2.13
6	2.45	2.57	1.95	2.02
7	2.37	2.45	1.90	1.94
8	2.31	2.36	1.86	1.89
9	2.27	2.31	1.84	1.86
10	2.23	2.26	1.82	1.83
11	2.21	2.23	1.80	1.81
12	2.19	2.20	1.79	1.80
13	2.16	2.18	1.78	1.78
14	2.15	2.16	1.77	1.77
15	2.14	2.14	1.76	1.76
16	2.12	2.13	1.75	1.75
17	2.11	2.12	1.74	1.75
18	2.11	2.11	1.74	1.74
19	2.10	2.10	1.73	1.73
20	2.09	2.09	1.73	1.73

E.3 Bagging Estimation for Bounding Optimality Gap

We apply the Cheap Bootstrap on a bagging method to construct upper confidence bounds for data-driven stochastic optimization problems. More specifically, consider an expected-value optimization problem $\min_{\theta\in\Theta}\{H(\theta):=E[h(\theta,X)]\}$ where the distribution $P$ governing $X$ in the expectation $E[\cdot]$ is unknown and is only informed from data. The cost function $h(\cdot,\cdot)$ is known or evaluatable, and $\Theta$ denotes the feasible region for the decision variable $\theta$ . Such an optimization formulation appears broadly in multiple disciplines such as revenue management, portfolio selection, among others (e.g., Shapiro et al. (2021); Birge and Louveaux (2011)).

Suppose we have a given solution, say $\hat{\theta}$ (which is presumably obtained from a data-driven procedure such as solving an empirical optimization or sample average approximation Shapiro et al. (2021)). To assess the quality of this given solution, we can construct an upper confidence bound for its optimality gap $G=H(\hat{\theta})-H^{*}$ , where $H^{*}=\min_{\theta\in\Theta}H(\theta)$ denotes the unknown true optimal value. To this end, a general upper bound for $G$ is given by $E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ where $X_{1},X_{2},\ldots,X_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and $E[\cdot]$ denotes the corresponding expectation. The quantity

\tilde{H}(X_{1},X_{2},\ldots,X_{k})=\max_{\theta\in\Theta}\frac{1}{k}\sum_{i=1}^{k}(h(\hat{\theta},X_{i})-h(\theta,X_{i}))

is a sample average approximation on cost function $h(\hat{\theta},X)-h(\theta,X)$ , which is equivalent to the difference between an empirical estimate of $E[h(\hat{\theta},X)]$ and a sample average approximation on $h(\theta,X)$ itself. That $E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ is an upper bound for $G$ can be argued via the Jensen inequality and is a well-established optimistic bound in stochastic optimization (e.g., Mak et al. (1999), Glasserman (2004) §8).

To implement $E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ using data, Lam and Qian (2018a, b) suggest to use bagging. Namely, given data $X_{1},\ldots,X_{n}$ , where $n$ denotes the sample size (that could be different from $k$ ), we repeatedly resample data set $\{X_{1}^{*b},\ldots,X_{k}^{*b}\}$ and solve a sample average approximation $\tilde{H}(X_{1}^{*b},X_{2}^{*b},\ldots,X_{k}^{*b})$ , for $b=1,\ldots,B$ . Then we output their average

\frac{1}{B}\sum_{b=1}^{B}\tilde{H}(X_{1}^{*b},X_{2}^{*b},\ldots,X_{k}^{*b})

to give a point estimate for $E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ . Cast in our framework in Section 4, we view $\psi(P)=E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ where $X_{1},\ldots,X_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and each noisy computation run as $\hat{\psi}_{r}(Q)=\tilde{H}(X_{1}^{*},X_{2}^{*},\ldots,X_{k}^{*})$ where $X_{1}^{*},,\ldots,X_{k}^{*}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q$ . Thus we can use the Cheap Bootstrap centered at original estimate and centered at resample mean in Section 4 to construct valid confidence bounds. Note that each computation run here involves solving a sample average approximation problem which could be costly.

To test the performance of our Cheap Bootstrap, we consider more specifically the following expected-value optimization problem

\begin{array}[]{ll}\min_{\theta}&E[X^{\top}\theta]\\ \text{subject to}&A\theta\leq b\\ &\theta_{i}\in\{0,1\}\text{\ for\ }i=1,2,\ldots,10\end{array}

This is a $10$ -dimensional stochastic integer program, with binary decision variables $\theta_{i},i=1,\ldots,10$ . The true distribution of $X\in\mathbb{R}^{10}$ is $N(\mu,\Sigma)$ , where $\mu=(-1,-7/9,-5/9,\ldots,7/9,1)^{\top}$ and $\Sigma$ is an arbitrarily generated covariance matrix. We set $b=(-1,2)^{\top}$ and

A=\left[\begin{matrix}-1&-1&-1&-1&-1&-1&-1&-1&-1&-1\\ 0&0&0&0&0&0&1&1&1&1\\ \end{matrix}\right].

This example is also used in Lam and Qian (2018b). We use the data size $n=100$ , and the size in sample average approximation $k=50$ . Throughout our set of experiments, we set $\hat{\theta}=(1,1,0,1,0,0,0,0,0,0)^{\top}$ which is obtained by solving a sample average approximation from an independent data set of size $30$ . We construct our one-sided Cheap Bootstrap intervals using $q_{O,1-\alpha}$ and $q_{M,1-\alpha}$ at $\alpha=0.05$ (which corresponds to the case $\alpha=0.1$ in Table 5), for $B$ ranging from $1$ to $10$ . We repeat our experiment $1000$ times to obtain summary statistics on our generated bounds, including the one-sided empirical coverage, i.e., the proportion of experiments where our upper confidence bound is at least the ground true value, and the mean and standard deviation of the upper confidence bound.

Table 6 shows the results. We see that the empirical coverages range from $96\%$ to $98\%$ in all considered $B$ for Cheap Bootstrap centered at original estimate and centered at resample mean (note that the latter is defined only starting from $B=2$ ). The coverages are higher than the nominal value $95\%$ likely because the true value targeted by the confidence bound, namely $E[\tilde{H}(X_{1},X_{2},\ldots,X_{k})]$ , is itself an upper bound on the true optimality gap $G$ . The trends of the generated confidence bounds appear consistent with our previous examples. For centered at original estimate, the mean falls quickly from $2.83$ at $B=1$ to $1.91$ at $B=2$ , further to $1.72$ at $B=3$ and much more steadily to $1.58$ at $B=10$ , while the standard deviation falls from $1.88$ at $B=1$ to $0.91$ at $B=2$ , to $0.73$ at $B=3$ , and steadily to $0.62$ at $B=10$ . Similarly, for centered at resample mean, the mean falls quickly from $2.79$ at $B=2$ to $1.84$ at $B=3$ , and then steadily to $1.56$ at $B=10$ , while the standard deviation falls from $1.71$ at $B=2$ to $0.81$ at $B=3$ and steadily to $0.62$ at $B=10$ . The performance of centered at original estimate appears better than centered at resample mean for small $B$ but their performances are similar as $B$ reaches close to $10$ .

Table 6: Performances of upper confidence bounds using Cheap Bootstrap centered at original estimate and centered at resample mean, at nominal confidence level

95\%

for bagging estimation of optimality gap.

$B$	Centered at original estimate		Centered at resample mean
	Empirical coverage	Bound mean	Empirical coverage	Bound mean
	(margin of error)	(st. dev.)	(margin of error)	(st. dev.)
1	0.96 (0.01)	2.83 (1.88)	NA	NA
2	0.97 (0.01)	1.91 (0.91)	0.98 (0.01)	2.79 (1.71)
3	0.97 (0.01)	1.72 (0.73)	0.98 (0.01)	1.84 (0.81)
4	0.96 (0.01)	1.67 (0.69)	0.97 (0.01)	1.72 (0.72)
5	0.97 (0.01)	1.64 (0.67)	0.97 (0.01)	1.66 (0.68)
6	0.97 (0.01)	1.61 (0.65)	0.96 (0.01)	1.62 (0.66)
7	0.97 (0.01)	1.60 (0.65)	0.97 (0.01)	1.60 (0.65)
8	0.97 (0.01)	1.58 (0.63)	0.97 (0.01)	1.58 (0.63)
9	0.97 (0.01)	1.58 (0.63)	0.97 (0.01)	1.57 (0.63)
10	0.97 (0.01)	1.58 (0.62)	0.97 (0.01)	1.56 (0.62)

Appendix F Useful Technical Backgrounds

For self-containedness purpose, we provide some existing results needed for our technical developments.

F.1 Hadamard Differentiability

We give some details on Hadamard differentiability used in Proposition 2 and Theorem 6. Consider a functional $\psi(\cdot):\mathcal{P}\to\mathbb{R}^{d}$ , where $\mathcal{P}$ denotes the space of probability distribution on the domain $\mathcal{X}$ . We call $\psi(\cdot)$ Hadamard differentiable at $P$ with derivative $\psi_{P}^{\prime}(H)$ , tangential to some subset $\mathcal{Q}$ of $\mathcal{P}$ , if there exists a continuous, linear map $\psi_{P}^{\prime}:\mathcal{Q}\to\mathbb{R}^{d}$ such that

\left\|\frac{\psi(P+tH_{t})-\psi(P)}{t}-\psi_{P}^{\prime}(H)\right\|\to 0

as $t\searrow 0$ for every sequence $H_{t}$ such that $P+tH_{t}\in\mathcal{P}$ for any small $t>0$ and converging to $H\in\mathcal{Q}$ (Van der Vaart (2000) §20.2).

F.2 Bootstrap Empirical Processes

We say that a sequence of random elements $G_{n}$ in a normed space $\mathbb{D}$ , with norm denoted $\|\cdot\|$ , converges in distribution to a tight limit $G$ in $\mathbb{D}$ if

\sup_{h\in BL_{1}(\mathbb{D})}|E^{*}h(G_{n})-Eh(G)|\to 0

(69)

where $BL_{1}(\mathbb{D})$ is the set of all functions $h:\mathbb{D}\to[-1,1]$ that are uniformly Lipschitz, i.e., $|h(z_{1})-h(z_{2})|\leq\|z_{1}-z_{2}\|$ for every pair $z_{1},z_{2}\in\mathbb{D}$ , and $E^{*}[\cdot]$ denotes the outer expectation.

For a class of functions $\mathcal{F}$ from $\mathcal{X}$ to $\mathbb{R}$ , define

\ell^{\infty}(\mathcal{F}):=\left\{z:\|z\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|z(f)|<\infty\right\}

where $z$ is a map from $\mathcal{F}$ to $\mathbb{R}$ . Consider the empirical process $\mathbb{G}_{n}=\sqrt{n}(\hat{P}_{n}-P)$ as a random element that takes value in $\ell^{\infty}(\mathcal{F})$ , and consider the bootstrap empirical process $\mathbb{G}_{n}^{*}=\sqrt{n}(P_{n}^{*}-\hat{P}_{n})$ . Supposing $\mathcal{F}$ is Donsker with a finite envelope function, it is well established that $\mathbb{G}_{n}\Rightarrow\mathbb{G}_{P}$ in $\ell^{\infty}(\mathcal{F})$ where $\mathbb{G}_{P}$ is a tight Gaussian process (with mean 0 and covariance $Cov(\mathbb{G}_{P}(f_{1}),\mathbb{G}_{P}(f_{2}))=Cov_{P}(f_{1}(X),f_{2}(X))$ where $Cov_{P}$ denotes the covariance taken with respect to $P$ ). Denote $E_{M}[\cdot]$ as the expectation conditional on $X_{1},\ldots,X_{n}$ . We recall the following result for the bootstrap empirical process.

Theorem 9 (Adapted from Van der Vaart (2000) Theorem 23.7).

For every Donsker class $\mathcal{F}$ of measurable functions with a finite envelope function,

\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{n}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

Furthermore, the sequence $\mathbb{G}_{n}^{*}$ is asymptotically measurable.

Next we recall two theorems:

Theorem 10 (Adapted from Van der Vaart (2000) Theorem 20.8).

Theorem 11 (Adapted from Van der Vaart (2000) Theorem 23.9).

Let $\mathbb{D}$ be a normed space and $\phi:\mathbb{D}_{\phi}\subset\mathbb{D}\to\mathbb{R}^{d}$ be Hadamard differentiable at $\theta$ tangential to some subspace $\mathbb{D}_{0}$ . Let $\hat{\theta}_{n}$ and $\theta_{n}^{*}$ be random maps with values in $\mathbb{D}_{\phi}$ such that $\sqrt{n}(\hat{\theta}_{n}-\theta)\Rightarrow\mathbb{T}$ and $\sup_{h\in BL_{1}(\mathbb{D})}|E_{M}h(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}))-Eh(\mathbb{T})|\stackrel{{\scriptstyle p}}{{\to}}0$ , in which $\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n})$ is asymptotically measurable and $\mathbb{T}$ is tight and takes values in $\mathbb{D}_{0}$ . Then $\sqrt{n}(\phi(\theta_{n}^{*})-\phi(\hat{\theta}_{n}))\Rightarrow\phi_{\theta}^{\prime}(\mathbb{T})$ conditionally given $X_{1},X_{2},\ldots$ in probability.

In general, we say $\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n})$ weakly converges to $\mathbb{T}$ conditionally given $X_{1},X_{2},\ldots$ in probability if the condition $\sup_{h\in BL_{1}(\mathbb{D})}|E_{M}h(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n}))-Eh(\mathbb{T})|\stackrel{{\scriptstyle p}}{{\to}}0$ in Theorem 11 holds (Van der Vaart (2000) equation (23.8)). By Kosorok (2007) Lemma 10.11, in the case $\mathbb{D}=\mathbb{R}^{d}$ , this condition implies $P(\sqrt{n}(\theta_{n}^{*}-\hat{\theta}_{n})\leq x|\hat{P}_{n})\stackrel{{\scriptstyle p}}{{\to}}F(x)$ for all $x\in\mathbb{R}^{d}$ if the distribution function $F(\cdot)$ of $\mathbb{T}$ is continuous.

The following theorem, which is an immediate consequence of the above results, is used to justify Propositions 2 and 4.

Theorem 12 (Delta method for empirical bootstrap).

Consider $\hat{P}_{n}$ and $P_{n}^{*}$ as random elements that take values in $\ell^{\infty}(\mathcal{F})$ , where $\mathcal{F}$ is a Donsker class with a finite envelope. Suppose $\phi:\ell^{\infty}(\mathcal{F})\to\mathbb{R}^{d}$ is Hadamard differentiable at $P$ (tangential to $\ell^{\infty}(\mathcal{F})$ ). Then $\sqrt{n}(\phi(\hat{P}_{n})-\phi(P))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P})$ , and also $\sqrt{n}(\phi(P_{n}^{*})-\phi(\hat{P}_{n}))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P})$ given $X_{1},X_{2},\ldots$ in probability.

Proof of Theorem 12.

Setting $\hat{\theta}_{n}=\hat{P}_{n}$ and $\mathbb{D}=\ell^{\infty}(\mathcal{F})$ , Theorem 10 implies $\sqrt{n}(\phi(\hat{P}_{n})-\phi(P))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P})$ . Moreover, Theorem 9 gives the conditions needed in Theorem 11 to conclude that $\sqrt{n}(\phi(P_{n}^{*})-\phi(\hat{P}_{n}))\Rightarrow\phi_{P}^{\prime}(\mathbb{G}_{P})$ given $X_{1},X_{2},\ldots$ in probability. ∎

F.3 Edgeworth Expansions

We have the following higher-order expansion of coverage probability for a general function-of-mean model:

Theorem 13 (Hall (2013) Theorem 2.2).

Assume that for an integer $\nu\geq 1$ , the function $\tilde{A}$ has $\nu+2$ continuous derivatives in a neighborhood of $\bm{\mu}$ , that $\tilde{A}(\bm{\mu})=0$ , that $E\|\mathbf{X}\|^{\nu+2}<\infty$ , that the characteristic function $\chi$ of $\mathbf{X}$ satisfies Cramer’s condition $\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1$ , and that the asymptotic variance of $\sqrt{n}\tilde{A}(\overline{\mathbf{X}})$ equals 1. Then

P(\sqrt{n}\tilde{A}(\overline{\mathbf{X}})\leq x)=\Phi(x)+\sum_{j=1}^{\nu}n^{-j/2}\pi_{j}(x)\phi(x)+o(n^{-\nu/2})

uniformly in $x$ , where $\pi_{j}$ is a polynomial of degree $3j-1$ , odd for even $j$ and even for odd $j$ , with coefficients depending on moments of $\mathbf{X}$ up to order $j+2$ polynomially and also $A$ .

In Theorem 13, $\tilde{A}$ is generally defined and the condition $\tilde{A}(\bm{\mu})=0$ is satisfied by $A$ and $A_{s}$ defined in (9) and (10). When $\tilde{A}=A$ , we denote its $\pi_{j}$ as $p_{j}$ . When $\tilde{A}=A_{s}$ , we denote its $\pi_{j}$ as $q_{j}$ .

Now denote $\hat{\pi}_{j}(\cdot)$ as $\pi_{j}(\cdot)$ but with all moments in its coefficients replaced by the sample moments, i.e., denoting $\mathbf{X}=(X(1),\ldots,X(d))$ , the moment

\mu_{m_{1},\ldots,m_{d}}=E[{X(1)}^{m_{1}}\cdots{X(d)}^{m_{d}}]

is replaced by

\hat{\mu}_{m_{1},\ldots,m_{d}}=\frac{1}{n}\sum_{i=1}^{n}{X_{i}(1)}^{m_{1}}\cdots{X_{i}(d)}^{m_{d}}

for sample $\mathbf{X}_{i}=(X_{i}(1),\ldots,X_{i}(d)),i=1,\ldots,n$ . Specifically, in the case of $A$ , we denote $\hat{p}_{j}$ as the $p_{j}$ with moments replaced by sample moments. Moreover, we also define

\hat{A}(\mathbf{x})=\frac{g(\mathbf{x})-g(\overline{\mathbf{X}})}{h(\overline{\mathbf{X}})}

Then, $\hat{A}(\overline{\mathbf{X}}^{*})$ , where $\overline{\mathbf{X}}^{*}=(1/n)\sum_{i=1}^{n}\mathbf{X}_{i}^{*}$ for a resample $\mathbf{X}_{i}^{*},i=1,\ldots,n$ , is the resample counterpart of $A(\overline{\mathbf{X}})$ . We have the following expansion and bounds for the resample counterpart:

Theorem 14 (Adapted from Hall (2013) Theorem 5.1).

Let $\lambda>0$ be given, and let $l=l(\lambda)$ be a sufficiently large positive number. Assume that $g$ and $h$ each have $\nu+3$ bounded derivatives in a neighborhood of $\bm{\mu}$ , that $E\|\mathbf{X}\|^{l}<\infty$ , and that the characteristic function $\chi$ of $\mathbf{X}$ satisfies Cramer’s condition $\limsup_{\|\mathbf{t}\|\to\infty}|\chi(\mathbf{t})|<1$ . Then there exists a constant $C>0$ such that

P\left(\sup_{-\infty<x<\infty}\left|P(\sqrt{n}\hat{A}(\overline{\mathbf{X}}^{*})\leq x|\mathcal{X}_{n})-\Phi(x)-\sum_{j=1}^{\nu}n^{-j/2}\hat{p}_{j}(x)\phi(x)\right|>Cn^{-(\nu+1)/2}\right)=O(n^{-\lambda})

P\left(\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}(x)|>C\right)=O(n^{-\lambda})

and

P\left(\max_{1\leq j\leq\nu}\sup_{-\infty<x<\infty}(1+|x|)^{-(3j-1)}|\hat{p}_{j}^{\prime}(x)|>C\right)=O(n^{-\lambda})

(70)

where $\mathcal{X}_{n}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\}$ denotes the data.

Theorem 14 is slightly strengthened from Hall (2013) in that we put (70) as an additional conclusion. The proof in Hall (2013) works for any polynomial $p_{j}$ of degree $3j-1$ and thus also $p_{j}^{\prime}$ , and the additional implication is immediate (and useful for our proof of Theorem 2).

F.4 Subsampling

Consider $\mathbb{G}_{n,k}^{*}=\sqrt{k}(P_{k}^{*}-\hat{P}_{n})$ where $P_{k}^{*}$ is the bootstrap empirical distribution constructed using $k$ resampled values from $X_{1},\ldots,X_{n}$ by sampling with replacement. Let $\mathcal{F}_{\delta}=\{f-g:f,g\in\mathcal{F},\ \rho_{P}(f-g)<\delta\}$ , where $\rho_{P}(f-g):=(Var_{P}(f(X)-g(X)))^{1/2}$ is the canonical metric. The following result is useful for analyzing Cheap $m$ -out-of- $n$ Bootstrap and Cheap Bag of Little Bootstraps:

Theorem 15 (Adapted from Van der Vaart and Wellner (2013) Theorem 3.6.3).

Let $\mathcal{F}$ be a Donsker class of measurable functions such that $\mathcal{F}_{\delta}$ is measurable for every $\delta>0$ . Then

\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{n,k_{n}}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

as $n\to\infty$ , for any sequence $k_{n}\to\infty$ , where $E_{M}[\cdot]$ denotes the expectation conditional on the data $X_{1},X_{2},\ldots,X_{n}$ .

The following is useful for analyzing Cheap Subsampled Double Bootstrap:

Theorem 16 (Adapted from Sengupta et al. (2016) Theorem 1).

Let $\mathcal{F}$ be a Donsker class of measurable functions such that $\mathcal{F}_{\delta}$ is measurable for every $\delta>0$ . Then the Subsampled Double Bootstrap process defined by $\mathbb{G}_{SDB,n,s}^{*}=\sqrt{n}(P_{n}^{**}-P_{s}^{*})$ satisfies

\sup_{h\in BL_{1}(\ell^{\infty}(\mathcal{F}))}|E_{M}h(\mathbb{G}_{SDB,n,s}^{*})-Eh(\mathbb{G}_{P})|\stackrel{{\scriptstyle p}}{{\to}}0

as $\min(n,s)\to\infty$ , where $E_{M}[\cdot]$ is with respect to the randomness of both the first and second-layer resampling in Subsampled Double Bootstrap, conditional on the data $X_{1},X_{2},\ldots,X_{n}$ .

	$\displaystyle\left\|P\left(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0},\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1},\ldots,\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\right)-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right\|$	(6)
$\displaystyle=$	$\displaystyle\left\|E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})\right]-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	since $\sqrt{n}(\psi_{n}^{*b}-\hat{\psi}_{n})$ for $b=1,\ldots,B$ are independent conditional on the data $X_{1},\ldots,X_{n}$
$\displaystyle=$	$\displaystyle\Bigg{\|}E\left[I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})\right]{}$
	$\displaystyle{}-P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})+P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})-\prod_{b=0}^{B}\Phi_{\sigma}(x_{b})\Bigg{\|}$
$\displaystyle\leq$	$\displaystyle E\Bigg{[}I(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})\Bigg{\|}P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n}){}$
	$\displaystyle{}-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\Bigg{\|}\Bigg{]}+\left\|\left(P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right)\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	by the triangle inequality
$\displaystyle\leq$	$\displaystyle E\left\|P(\sqrt{n}(\psi_{n}^{1}-\hat{\psi}_{n})\leq x_{1}\|\hat{P}_{n})\cdots P(\sqrt{n}(\psi_{n}^{B}-\hat{\psi}_{n})\leq x_{B}\|\hat{P}_{n})-\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})\right\|{}$
	$\displaystyle{}+\left\|P(\sqrt{n}(\hat{\psi}_{n}-\psi)\leq x_{0})-\Phi_{\sigma}(x_{0})\right\|\prod_{b=1}^{B}\Phi_{\sigma}(x_{b})$

			$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})\right]$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}\right]{}$
			$\displaystyle{}+E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}^{c}\right]$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1});\mathcal{E}\right]+O(n^{-\lambda}){}$
			since $\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{(1/B)\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}dQ^{}(z_{B})\cdots dQ^{}(z_{1})\leq 1$
		$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\hat{Q}^{}(z_{B})\cdots d\hat{Q}^{}(z_{1});\mathcal{E}\right]+O(n^{-(\nu+1)/2})+O(n^{-\lambda}){}$
			by (35)

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda}){}$	(48)
	$\displaystyle{}+\frac{B}{n}\left\{E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O(n^{-\lambda})+O\left(\frac{1}{\sqrt{n}}\right)\right\}{}$
	$\displaystyle{}+O\left(\frac{1}{n^{2}}\right)+O(n^{-(\nu+1)/2})+O(n^{-\lambda})$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+O(n^{-\lambda})+O\left(\frac{1}{n^{3/2}}\right)+O(n^{-(\nu+1)/2})$
$\displaystyle=$	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]{}$
	$\displaystyle{}+\frac{B}{n}E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{1})\right]+O\left(\frac{1}{n^{3/2}}\right)$

	$\displaystyle E\left[\int\cdots\int_{\left\|\frac{\sqrt{n}A_{s}(\overline{\mathbf{X}})}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})\right]$	(49)
$\displaystyle=$	$\displaystyle\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{n}\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))+o\left(\frac{1}{n}\right)$

	$\displaystyle\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d\Phi(z_{0}){}$
	$\displaystyle{}+\frac{1}{n}\Bigg{\{}B\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d(p_{2}(z_{B})\phi(z_{B}))d\Phi(z_{B-1})\cdots d\Phi(z_{0}){}$
	$\displaystyle{}+\int\cdots\int_{\left\|\frac{z_{0}}{\sqrt{\frac{1}{B}\sum_{b=1}^{B}z_{b}^{2}}}\right\|\leq t_{B,1-\alpha/2}}d\Phi(z_{B})\cdots d\Phi(z_{1})d(q_{2}(z_{0})\phi(z_{0}))\Bigg{\}}+o\left(\frac{1}{n}\right)$		(51)

A Cheap Bootstrap Method for Fast Inference

Abstract

1 Introduction

2 Cheap Bootstrap Confidence Intervals

Assumption 1 (Standard condition for bootstrap validity).

Theorem 1 (Asymptotic exactness of Cheap Bootstrap).

2.1 Comparisons with Established Bootstrap Methods

2.2 A Basic Numerical Demonstration

3 Theory of Cheap Bootstrap

3.1 Asymptotic Exactness

Proposition 1 (Asymptotic independence and normality among original and resample estimates).

Proof of Proposition 1.

Proof of Theorem 1.

Proposition 2 (Sufficient conditions for bootstrap validity).

3.2 Higher-Order Coverage Errors

Theorem 2 (Higher-order coverage errors for Cheap Bootstrap).

3.3 Interval Tightness

4 Applying Cheap Bootstrap to Nested Sampling Problems

Assumption 2 (Moment consistency of computation runs).

Proposition 3.

Theorem 3 (Joint asymptotic of original and resample estimates under both data and computation noises).

4.1 Centered at Original Estimate

Theorem 4 (Asymptotic validity of Cheap Bootstrap “centered at point estimate”).

4.2 Centered at Resample Mean

Theorem 5 (Asymptotic validity of Cheap Bootstrap “centered at resample mean”).

5 Cheap Subsampling

5.1 Cheap mm-out-of-nn Bootstrap

5.2 Cheap Bag of Little Bootstraps

5.3 Cheap Subsampled Double Bootstrap

5.4 Theory of Cheap Subsampling

Theorem 6 (Asymptotic exactness of Cheap Subsampling).

6 Numerical Results

6.1 Elementary Examples

6.2 Regression Problems

6.3 Input Uncertainty Quantification in Simulation Modeling

6.4 Deep Ensemble Prediction

7 Discussion

Acknowledgments

References

Appendix A Immediate Extensions from Section 3

A.1 Inference on Standard Error

Theorem 7 (Cheap Bootstrap interval for standard error).

A.2 Multivariate Generalization

Assumption 3 (Standard condition for multivariate bootstrap validity).

Theorem 8 (Asymptotic exactness of multivariate Cheap Bootstrap).

Proposition 4 (Sufficient conditions for multivariate bootstrap validity).

Appendix B Proofs for Section 3 and Appendix A

Proof of Proposition 2.

Proof of Theorem 2.

Proof of Theorem 7.

Proof of Theorem 8.

Proof of Proposition 4.

Appendix C Further Details for Section 4

C.1 Proofs for the Beginning of Section 4

Proof of Proposition 3.

Proof of Theorem 3.

C.2 Proofs and Additional Discussions for Section 4.1

Proof of Theorem 4.

C.3 Proofs and Additional Discussions for Section 4.2

Proof of Theorem 5.

Appendix D Proofs for Section 5

Proof of Theorem 6.

Appendix E Additional Numerical Results

E.1 Logistic Regression

E.2 Critical Values for Cheap Bootstrap in Nested Sampling Problems

E.3 Bagging Estimation for Bounding Optimality Gap

Appendix F Useful Technical Backgrounds

F.1 Hadamard Differentiability

F.2 Bootstrap Empirical Processes

Theorem 9 (Adapted from Van der Vaart (2000) Theorem 23.7).

Theorem 10 (Adapted from Van der Vaart (2000) Theorem 20.8).

Theorem 11 (Adapted from Van der Vaart (2000) Theorem 23.9).

Theorem 12 (Delta method for empirical bootstrap).

Proof of Theorem 12.

F.3 Edgeworth Expansions

Theorem 13 (Hall (2013) Theorem 2.2).

Theorem 14 (Adapted from Hall (2013) Theorem 5.1).

F.4 Subsampling

Theorem 15 (Adapted from Van der Vaart and Wellner (2013) Theorem 3.6.3).

Theorem 16 (Adapted from Sengupta et al. (2016) Theorem 1).

5.1 Cheap $m$ -out-of- $n$ Bootstrap