Gaussian and Bootstrap Approximations for Suprema of Empirical Processes
Abstract
In this paper we develop non-asymptotic Gaussian approximation results for the sampling distribution of suprema of empirical processes when the indexing function class varies with the sample size and may not be Donsker. Prior approximations of this type required upper bounds on the metric entropy of and uniform lower bounds on the variance of which, both, limited their applicability to high-dimensional inference problems.
In contrast, the results in this paper hold under simpler conditions on boundedness, continuity, and the strong variance of the approximating Gaussian process. The results are broadly applicable and yield a novel procedure for bootstrapping the distribution of empirical process suprema based on the truncated Karhunen-Loève decomposition of the approximating Gaussian process. We demonstrate the flexibility of this new bootstrap procedure by applying it to three fundamental problems in high-dimensional statistics: simultaneous inference on parameter vectors, inference on the spectral norm of covariance matrices, and construction of simultaneous confidence bands for functions in reproducing kernel Hilbert spaces.
Keywords: Gaussian Approximation; Gaussian Comparison; Gaussian Process Bootstrap; High-Dimensional Inference.
1 Introduction
1.1 Approximating the sampling distribution of suprema of empirical processes
This paper is concerned with non-asymptotic bounds on the Kolmogorov distance between the sampling distribution of suprema of empirical processes and the distribution of suprema of Gaussian proxy processes. Consider a simple random sample of independent and identically distributed random variables with common law taking values in the measurable space . Let be a class of measurable functions and define the empirical process
Also, denote by a centered Gaussian process with some positive semi-definite covariance function. Within this minimal setup (and a few additional technical assumptions), the paper addresses the problem of deriving non-asymptotic bounds on
(1) |
and, consequently, conditions under which an empirical process indexed by is Gaussian approximable, i.e. as . Evidently, if is a Donsker class then the empirical process is trivially Gaussian approximable by the centered Gaussian process with covariance function . Thus, in this paper we are concerned with conditions that are strictly weaker than those that guarantee a central limit theorem.
Gaussian approximation results of this type have gained significant interest recently, as they prove to be effective in tackling high- and infinite-dimensional inference problems. Applications include simultaneous inference on high-dimensional parameters (Dezeure et al., 2017; Zhang and Cheng, 2017), inference on nonparametric models (Chernozhukov et al., 2014a; Chen et al., 2015, 2016), testing for spurious correlations (Fan et al., 2018), testing for shape restrictions (Chetverikov, 2019), inference on large covariance matrices (Chen, 2018; Han et al., 2018; Lopes et al., 2019), goodness-of-fit tests for high-dimensional linear models (Janková et al., 2020), simultaneous confidence bands in functional data analysis (Lopes et al., 2020; Singh and Vijaykumar, 2023), and error quantification in randomized algorithms (Lopes et al., 2023).
The theoretical groundwork on Gaussian approximation was laid in three seminal papers by Chernozhukov et al. (2013, 2014b, 2015). Since then, subsequent theoretical works have developed numerous refinements and extensions (e.g. Chernozhukov et al., 2016, 2019, 2020; Deng and Zhang, 2020; Fang and Koike, 2021; Kuchibhotla et al., 2021; Cattaneo et al., 2022; Lopes, 2022a, b; Bong et al., 2023). In this paper, we resolve two limitations of Chernozhukov et al.’s (2014b) original results that have not been previously addressed:
-
•
Entropy conditions. The original bounds on the Kolmogorov distance depend on the metric entropy of the function class . These upper bounds are non-trivial only if the metric entropy grows at a much slower rate than the sample size . As a result, most applications only address sparse inference problems involving function classes with either a finite number of functions, discretized functions, or a small VC-index. The new upper bounds in this paper no longer depend on the metric entropy of the function class and therefore open the possibility to tackle non-sparse, high-dimensional inference problems. We call these bounds dimension-/entropy-free.
-
•
Lower bounds on weak variances. The original bounds on require a strictly positive lower bound on the weak variance of the Gaussian proxy process, i.e. . This condition limits the scope of the original results to problems with standardized function classes (studentized statistics) and excludes situations with variance decay, typically observed in non-sparse, high-dimensional problems. The new Gaussian approximation results in this paper only depend on the strong variance and are therefore applicable to a broad range of problems including those with degenerate distributions. We say that these bounds are weak variance-free.
Even though these limitations (and our solution) are quite technical, the resulting new Gaussian and bootstrap approximations have immediate practical consequences. We present three substantial applications to problems in high-dimensional inference in Section 4 and two toy examples that cast light on why and when the original approximation results by Chernozhukov et al. (2014b) fail and the new results succeed in Appendix A.
Notably, in the special case of inference on spectral norms of covariance matrices, entropy-free bounds on already exist. Namely, for with Lopes et al. (2019; 2020; 2022a; 2022b) have devised an approach to bounding that combines a specific variance decay assumption with a truncation argument. For a carefully chosen truncation level, the resulting bound on depends only on the sample size and the parameter characterizing the variance decay. Since their approach is intimately related to the bilinearity of the functions in and the specific variance decay assumption, it does not easily extend to arbitrary function classes. We therefore develop a different strategy in this paper.
Furthermore, for finite function classes , Chernozhukov et al. (2020) and Deng and Zhang (2020) have been able to slightly relax the requirements on the weak variance of the Gaussian proxy process. However, their results do not generalize to arbitrary function classes with . Our strategy for replacing the weak variance with the strong variance in the upper bounds on is therefore conceptually completely different from theirs. For details, we refer to our companion paper on Gaussian anti-concentration inequalities Giessing (2023).
1.2 Contributions and overview of the results
This paper consists of three parts which contribute to probability theory (Section 2), mathematical statistics and bootstrap methodology (Section 3), and high-dimensional inference (Section 4). The Appendices A–F contain additional supporting results and all proofs.
Section 2 contains the main mathematical innovations of this paper. We establish dimension- and weak variance-free Gaussian and bootstrap approximations for maxima of sums of independent and identically distributed high-dimensional random vectors (Section 2.1). Specifically, we derive a Gaussian approximation inequality (Proposition 1), a Gaussian comparison inequality (Proposition 3), and two bootstrap approximation inequalities (Propositions 4 and 5). At the core of these four theoretical results is a new proof of a multivariate Berry-Esseen-type bound which leverages two new technical auxiliary results: an anti-concentration inequality for suprema of separable Gaussian processes (Lemma 6 in Appendix B.2) and a smoothing inequality for partial derivatives of the Ornstein-Uhlenbeck semigroup associated to a multivariate normal measure (Lemma 2 in Appendix B.1). We conclude this section with a comparison to results from the literature (Section 2.2).
Section 3 contains our contributions to mathematical statistics and bootstrap methodology. The results in this section generalize the four basic results of Section 2 from finite dimensional vectors to empirical processes indexed by totally bounded (and hence separable) function classes. As one would expect, dimension- (a.k.a entropy-) and weak variance-freeness of the finite dimensional results carry over to empirical processes. We establish Gaussian approximation inequalities (Section 3.2), Gaussian comparison inequalities (Section 3.3), and an abstract bootstrap approximation inequality (Section 3.4). The latter result motivates a new procedure for bootstrapping the sampling distribution of suprema of empirical processes. We call this procedure the Gaussian process bootstrap (Algorithm 1) and discuss practical aspects of its implementations via the truncated Karhunen-Loève decomposition of a Gaussian proxy process (Section 3.5). We include a selective comparison with results from the literature (Section 3.6).
In Section 4 we showcase the flexibility of the Gaussian process bootstrap by applying it to three fundamental problems in high-dimensional statistics: simultaneous inference on parameter vectors (Section 4.1), inference on the spectral norm of covariance matrices (Section 4.2), and construction of simultaneous confidence bands for functions in reproducing kernel Hilbert spaces (Section 4.3). For each of these three examples we include a brief comparison with alternative bootstrap methods from the literature to explain in what sense our method improves over (or matches with) existing results.
2 Dimension- and weak variance-free results for the maximum norm of sums of i.i.d. random vectors
2.1 Four basic results
We begin with four propositions on non-asymptotic Gaussian and bootstrap approximations for finite-dimensional random vectors. These propositions are our main mathematical contribution and essential for the statistical results to come. In Section 3 we lift these propositions to general empirical processes which are widely applicable to problems in mathematical statistics. Throughout this section, denotes the maximum norm of a vector .
The first result is a Gaussian approximation inequality. This inequality provides a non-asymptotic bound on the Kolmogorov distance between the laws of the maximum norm of sums of independent and identically distributed random vectors and a Gaussian proxy statistic.
Proposition 1 (Gaussian approximation).
Let be i.i.d. random vectors with mean zero and positive semi-definite covariance matrix . Set and . Then, for , ,
where hides an absolute constant independent of , and the distribution of the ’s.
Remark 1 (Extension to independent, non-identically distributed random vectors).
The proof of Proposition 1 involves an inductive argument inspired by Theorem 3.7.1 in Nourdin and Peccati (2012) who attribute it to Bolthausen (1984). This argument relies crucially on independent and identically distributed data. Generalizing it to independent but non-identically distributed data would require additional assumptions on the variances similar to the uniform asymptotic negligibility condition in the classical Lindeberg-Feller CLT for triangular arrays. For empirical processes such a generalization is less relevant. We therefore leave this to future research.
Remark 2 (Extension to non-centered random vectors).
If the covariance matrix is strictly positive definite, then Proposition 1 also holds for i.i.d. random vectors with non-zero mean and . Nevertheless, in the broader context of empirical processes a strictly positive definite covariance (function) is a very strong assumption. Therefore, in this paper, we do not pursue this refinement. Instead, the interested reader may consult the companion paper Giessing (2023).
Remark 3 (Gaussian approximation versus CLT).
Proposition 1 is strictly weaker than a multivariate CLT because the Kolmogorov distance between the maximum norm of two sequences of random vectors induces a topology on the set of probability measures on which is coarser than the topology of convergence in distribution. Indeed, consider and for and arbitrary random variables. Then, the Kolmogorov distance between and is zero for all , but if and only if and are exchangeable. For another perspective on the same issue, see Section 2.2.
Above Gaussian approximation result differs in several ways from related results in the literature. The two most striking difference are the following: First, the non-asymptotic upper bound in Proposition 1 does not explicitly depend on the dimension , but only on the (truncated) third moments of and and the variance of . Under suitable conditions on the marginal and/ or joint distribution of the coordinates of these three quantities grow substantially slower than the dimension . Among other things, this opens up the possibility of applying this bound in the context of high-dimensional random vectors when . Second, Proposition 1 holds without stringent assumptions on the distribution of and applies even to degenerate distributions that do not have a strictly positive definite covariance matrix . In particular, the lemma does not require a lower bound on the variances of the coordinates of or the minimum eigenvalue of . This is fundamental for an effortless extension to general empirical processes. For lower bounds on the variance of the reader may consult Giessing (2023). For a comprehensive comparison of Proposition 1 with previous work, we refer to Section 2.2.
If we are willing to impose a lower bound on the variances of the coordinates of , we can deduce the following useful inequality:
Corollary 2.
Recall the setup of Proposition 1. In addition, suppose that and that has coordinate-wise finite moments, . Define , . Then, for all ,
where hides an absolute constant independent of , and the distribution of the ’s. If the coordinates in are equicorrelated with correlation coefficient , then above inequality holds with replaced by .
Since for , the right hand side of the inequality in Corollary 2 can be easily upper bounded under a variety of moment assumptions on the marginal distributions of the coordinates of . For a few concrete examples relevant in high-dimensional statistics we refer to Lemmas 1–3 in Giessing and Fan (2023).
The second basic result is a Gaussian comparison inequality. The novelty of this result is (again) that it holds even for degenerate Gaussian laws with singular covariance matrices and does not explicitly depend on the dimension of the random vectors.
Proposition 3 (Gaussian comparison).
Let be Gaussian random vectors with mean zero and positive semi-definite covariance matrices and , respectively. Then,
where hides an absolute constant independent of .
Since the Gaussian distribution is fully characterized by its first two moments, Proposition 1 instantly suggests that it should be possible to approximate the sampling distribution of with the sampling distribution of , where and is a positive semi-definite estimate of . The third basic result formalizes this idea; it is a simple consequence of the triangle inequality combined with Propositions 1 and 3:
Proposition 4 (Bootstrap approximation of the sampling distribution).
Let be i.i.d. random vectors with mean zero and positive semi-definite covariance matrix . Let be any positive semi-definite estimate of . Set , , and . Then, for , ,
where hides an absolute constant independent of , and the distribution of the ’s.
Typically, statistical applications require estimates of the quantiles of the sampling distribution. Since the covariance matrix is unknown, the quantiles of with are infeasible. Hence, for arbitrary, we define the feasible Gaussian bootstrap quantiles as
Since this quantity is random, it is not immediately obvious that it is a valid approximation of the -quantile of the sampling distribution of . However, combing Proposition 4 with standard arguments (e.g. Chernozhukov et al., 2013) we obtain the fourth basic result:
Proposition 5 (Bootstrap approximation of quantiles).
Consider the setup of Lemma 4. Let be a sequence of arbitrary random variables, not necessarily independent of . Then, for , ,
where hides an absolute constant independent of , and the distribution of the ’s.
Remark 4 (On the purpose of the random variables ).
In applications may be a higher-order approximation error such as the remainder of a first-order Taylor approximation. For concrete examples we refer to Corollary 2 and Theorem 10 in Giessing and Fan (2023). The random variables may also be taken identical to zero. In this case, the expression in the last line in Lemma 5 vanishes.
Propositions 4 and 5 inherit the dimension- and weak variance-freeness from Propositions 1 and 3. Since these propositions do not require lower bounds the minimum eigenvalue of the estimate of , we can always use the naive sample covariance matrix with to estimate even if . If additional information about is available (viz. low-rank, bandedness, or approximate sparsity), we can of course use more sophisticated estimators to improve the non-asymptotic bounds. This will prove particularly effective when we lift Propositions 4 and 5 to general empirical processes (see Section 3.4). The reader can find several more examples in Section 4.2 of the companion paper Giessing and Fan (2023).
2.2 Relation to previous work
Here, we compare Propositions 1 and 3 with the relevant results in the literature. From a mathematical point of view Propositions 4 and 5 are just an afterthought. For a comprehensive review of the entire literature on Gaussian and bootstrap approximations of maxima of sums of random vectors see Chernozhukov et al. (2023).
-
•
On the dependence on the dimension. As emphasized above, the upper bounds in Propositions 1 and 3 are dimension-free. This is a significant improvement over Theorems 2.2, 2.1, 3.2, 2.1 in Chernozhukov et al. (2013, 2017, 2019, 2020), Theorem 1.1 and Corollary 1.3 in Fang and Koike (2021), and Theorems 2.1 and 2.2 in Lopes (2022a), which all feature logarithmic factors of the dimension. Such bounds generalize poorly to empirical processes since the resulting upper bounds necessarily depend on the -entropy of the function class. This precludes (or, at the very least, substantially complicates) applications to objects as basic as the operator norm of a high-dimensional covariance matrix.
-
•
On the moment assumptions. The above mentioned results by Chernozhukov et al. (2013, 2017, 2019, 2020), Fang and Koike (2021), and Lopes (2022a) all require strictly positive lower bounds on the variances of the components of the random vector and/ or on the minimum eigenvalue of the covariance matrix . The strictly positive lower bounds are especially awkward if we try to extend their finite dimensional results to general empirical processes: While is often sensible to impose an upper bound on the variances of the increments of an empirical process, lower bounds on the variances are much harder to justify and in general only achievable via a discretization (or thinning) of the function class. The weak variance-free upper bounds in Propositions 1 and 3 give us greater leeway and increases the scope of our approximation results considerably (see also Appendix A).
-
•
On the sharpness of the upper bounds. The results in Chernozhukov et al. (2020), Fang and Koike (2021), and Lopes (2022a) show that the upper bound in Proposition 1 is sub-optimal and that its dependence on sample size can be improved to . The proof techniques in Chernozhukov et al. (2020) and Fang and Koike (2021) are both based on delicate estimates of Hermite polynomials and thus inherently dimension dependent. Extending their approaches to the coordinate-free Wiener chaos decomposition (which would yield dimension-free results) is a formidable research task. The proof strategy in Lopes (2022a) is very different and requires sub-Gaussian data. The results in the aforementioned papers also show that under additional distributional assumptions the exponent on the upper bound in Proposition 3 can be improved to .
-
•
Proper generalizations of classical CLTs. In a certain regard, Propositions 1 and 3 are strictly weaker than the results in Chernozhukov et al. (2017, 2020), Fang and Koike (2021), and Lopes (2022a). Proposition 1 (and similarly Proposition 3) provides a bound on
where be the collection of all hypercubes in with center at the origin. Chernozhukov et al. (2017), Fang and Koike (2021), and Lopes (2022a) provide bounds on above quantity not for the supremum over but the supremum over the much larger collection of all hyper-rectangles in . When the dimension is fixed, their results imply convergence in distribution of to as and are therefore stronger than our results (see also Remark 3). In particular, their results can be considered proper generalizations of classical CLTs to high dimensions.
The results in this paper depend on a dimension-free anti-concentration inequality for (or, in other words, for the supremum over ) and a smoothing inequality specific to the map (i.e. Lemmas 2 and 6). Since these inequalities do not apply to the class of all hyper-rectangles in , the arguments in this paper cannot be easily modified to yield dimension-free generalizations of the classical CLTs to high dimensions.
3 Approximation results for suprema of empirical processes
3.1 Empirical process notation and definitions
In the previous section we got away with intuitive standard notation. We now introduce precise empirical process notation for the remainder of the paper: We denote by a sequence of i.i.d. random variables taking values in a measurable space with common distribution , i.e. , , are the coordinate projections of the infinite product probability space with law . If auxiliary variables independent of the ’s are involved, the underlying probability space is assumed to be of the form . We define the empirical measures associated with observations as random measures on given by for all , where is the Dirac measure at .
For a class of measurable functions from onto the real line we define the empirical process indexed by as
Further, we denote by the Gaussian -bridge process with mean zero and the same covariance function as the process , i.e.
(2) |
Moreover, we denote by the Gaussian -motion with mean zero and covariance function covariance function given by
(3) |
For probability measures on we define the -norm, , for function by , the -semimetric by and the intrinsic standard deviation metric by , . We denote by , , the space of all real-valued measurable functions on with finite -norm. A function is an envelop for the class satisfies for all and . For a semimetric space and any we write to denote the -covering numbers of with respect to . For two deterministic sequences and , we write if and if there exist absolute constants such that for all . For we write for and for .
3.2 Gaussian approximation inequalities
We present the first main theoretical result of this paper: a Gaussian approximation inequality for empirical processes indexed by function classes that need not be Donsker. This result generalizes Proposition 1 from finite dimensional index sets to general function classes.
Theorem 1 (Gaussian approximation).
Let be a totally bounded pseudo-metric space. Further, let have envelope function . Suppose that there exist functions such that for ,
(4) |
where . Let . Then, for each ,
where hides an absolute constant independent of , and .
Remark 1 (Lower and upper bounds on ).
Since may change with the sample size, so may . In a few special cases it is possible to exactly compute the variance (Giessing and Fan, 2023), but in most cases one can only derive lower and upper bounds. In our companion work Giessing (2023) we derive such bounds under mild conditions on the Gaussian -bridge process. For the reader’s convenience we include the relevant results from this paper in Appendix B.2.
The total boundedness of in above theorem is a standard assumption which allows us to reduce the proof of Theorem 1 to an application of Proposition 1 combined with a discretization argument. Since is totally bounded whenever (i.e. Lemma 16 in Appendix B.4), this is a rather mild technical assumption. For a detailed discussion of this result and a comparison with the literature we refer to Section 3.6.
Under more specific smoothness assumptions on the Gaussian -bridge process (beyond the abstract control of the moduli of continuity in eq. (4)) Theorem 1 simplifies as follows:
Corollary 2.
Let and have envelope . If the Gaussian -Bridge process has almost surely uniformly -continuous sample paths and , then for each ,
where hides an absolute constant independent of , and .
Remark 2 (Further simplification with lower bound on second moment).
Recall the setup of Corollary 2. If in addition there exists such that for all and , , then
where hides an absolute constant independent of , and the distribution of the ’s. This is the empirical process analogue to Corollary 2. Since the proof of this inequality is is identical to the one of Corollary 2 and only differs in notation, we omit the details.
Since centered Gaussian processes are either almost surely uniformly continuous w.r.t. their intrinsic standard deviation metric or almost surely discontinuous, the smoothness condition in Corollary 2 is natural. (For the proofs to go through uniform -continuity in probability would be sufficient.) In Lemmas 17 and 18 in Appendix B.4 we provide simple sufficient and necessary conditions for almost sure uniform -continuity of on . Importantly, is fixed; the uniform -continuity is a point-wise requirement and does not need to hold when taking the limit . This is especially relevant in high-dimensional statistics where Gaussian processes are typically unbounded and have diverging sample paths as , and, more generally, whenever the function classes are non-Donsker. Of course, just as with Proposition 1 the conclusion of Corollary 2 is weaker than a weaker convergence (see Section 2.2).
3.3 Gaussian comparison inequalities
Our second main result is a comparison inequality which bounds the Kolmogorov distance between a Gaussian -bridge process and a Gaussian -motion both indexed by (possibly) different function classes and . This generalizes Proposition 3 to Gaussian processes.
To state the theorem, we introduce the following notation: Given the pseudo-metric spaces and , , we write to denote the projection from onto defined by for all . In the case that the image is not a singleton for some , we choose any one of the equivalent points in . Similarly, we write for the projection from onto defined via .
Theorem 3 (Gaussian comparison).
Let and be totally bounded pseudo-metric spaces. Further, let and have envelope functions and , respectively. Suppose that there exist functions such that
where and . Let . Then,
where
(5) |
where hides an absolute constant independent of and .
Remark 3 (Comparison inequality for Gaussian -bridge processes).
Theorem 3 holds without changes (and identical proof) for Gaussian -bridge processes . This easily follows from the almost sure representation for all , where is independent of for all . We state the comparison inequality for -motions because this is the version that we use in the context of the Gaussian process bootstrap in Section 3.4.
Informally speaking, Theorem 3 states that the distributions of the suprema of any two Gaussian processes are close whenever their covariance functions are not too different. Note that we compare two Gaussian processes with different covariance functions and that differ not only in their measures ( and ) but also their functional form (viz. eq. (2) and (3) in Section 3.1). This turns out to be essential for developing alternatives to the classical Gaussian multiplier bootstrap procedure for suprema of empirical processes proposed by Chernozhukov et al. (2014a) and Chernozhukov et al. (2014b). We discuss this in detail in Section 3.4.
Under additional smoothness conditions on the Gaussian processes and assumptions on the function classes Theorem 3 simplifies. The special cases and are of particular interest in the context of the bootstrap. They are the content of the next two corollaries.
Corollary 4.
Let . If the Gaussian -Bridge process and the Gaussian -motion have almost surely uniformly continuous sample paths w.r.t. their respective standard deviation metrics and , and , then
where hides an absolute constant independent of , and .
Corollary 5.
Let and be a -net of with respect to , . If the Gaussian -Bridge process has almost surely uniformly continuous sample paths w.r.t. metric and , then
where hides an absolute constant independent of , and .
3.4 Gaussian process bootstrap
In this section we develop a general framework for bootstrapping the distribution of suprema of empirical processes. We recover the classical Gaussian multiplier bootstrap as a special case of this more general framework. For concrete applications to statistical problems we refer to Section 4.
The following abstract approximation result generalizes Proposition 4 to empirical processes.
Theorem 6 (Abstract bootstrap approximation).
Let and be totally bounded pseudo-metric spaces. Further, let and have envelope functions and , respectively. Suppose that there exist functions such that for ,
(6) |
where and . Let . Then, for each ,
where is given in eq. (5) and hides an absolute constant independent of , and .
Remark 4 (Bootstrap approximation with Gaussian -bridge process).
Theorem 6 is an immediate consequence of Theorems 1 and 3 and the triangle inequality, and thus a mathematical triviality. What matters is its statistical interpretation: It implies that we can approximate the sampling distribution of the supremum of an empirical process with the distribution of the supremum of a Gaussian -motion provided that their covariance functions and do not differ by too much. Importantly, up to the smoothness condition in (6), we are completely free to choose whatever -measure and function class suit us best.
This interpretation of Theorem 6 motivates the following bootstrap procedure for estimating the distribution of the supremum of an empirical process when the -measure is unknown.
Algorithm 1 (Gaussian process bootstrap).
Let be a simple random sample drawn from distribution .
-
Step 1:
Construct a positive semi-definite estimate of the covariance function of the empirical process .
-
Step 2:
Construct a Gaussian process such that for all ,
(7) -
Step 3:
Approximate the distribution of by drawing Monte Carlo samples from .
By Kolmogorov’s consistency theorem the Gaussian process will always exist. If the estimate is uniformly consistent and if the envelope function and the supremum of the Gaussian -bridge process satisfy certain moment conditions, Theorem 6 readily implies uniform consistency of the Gaussian process bootstrap as . The challenge is to actually construct a (measurable) version of from which we can draw Monte Carlo samples. This is the content the next section.
Remark 5 (Relation to the Gaussian multiplier bootstrap by Chernozhukov et al. (2014b)).
The Gaussian multiplier bootstrap is a special case of the Gaussian process bootstrap. To see this, note that if we estimate the covariance function nonparametrically via the sample covariance function , , then the corresponding Gaussian -Bridge process can be expanded into a finite sum of non-orthogonal functions
where are i.i.d. standard normal random variables independent of . While this representation is extremely simple (in theory and practice!), unlike the Gaussian process bootstrap it does not allow integrating additional structural information about the covariance function. Empirically, this often results in less accurate bootstrap estimates (Giessing and Fan, 2023).
3.5 A practical guide to the Gaussian process bootstrap
To implement the Gaussian process bootstrap we need two things: a uniformly consistent estimate of the covariance function and a systematic way of constructing Gaussian processes for given covariance functions.
Finding consistent estimates of the covariance function is relatively straightforward. For example, a natural estimate is the nonparametric sample covariance function defined as , . Under suitable conditions on the probability measure and the function class this estimate is uniformly consistent in high dimensions (e.g. Koltchinskii and Lounici, 2017). Of course, an important feature of the Gaussian process bootstrap is that it can be combined with any positive semi-definite estimate of the covariance function. In particular, we can use (semi-)parametric estimators to exploit structural constraints induced by and . For concrete examples we refer to Section 4 and also Giessing and Fan (2023).
Constructing Gaussian processes for given covariance functions and defined on potentially arbitrary index sets is a more challenging problem. We propose to base the construction on an almost sure version of the classical Karhunen-Loève decomposition.
To develop this decomposition in the framework of this paper, we need new notation and concepts. In the following, denotes a measurable space for some finite measure with support . Typically, is the Borel -algebra on and the measure is chosen for convenience. For example, can be set to the Lebesgue measure when an interval, or the counting measure when is a finite (discrete) set. The space equipped with the inner product is a Hilbert space. Given a positive semi-definite and continuous kernel we define the linear operator on via
(8) |
If , then is a bounded linear operator. If is a compact metric space, then is a compact operator. In this case, has a spectral decomposition which depends on the kernel alone; the measure is exogenous. For further details we refer the reader to Chapters 4 and 7 in Hsing and Eubank (2015).
The following result is well-known (Jain and Kallianpur, 1970); since it is slightly adapted to our setting we have included its proof in the appendix.
Proposition 7 (Karhunen-Loève decomposition of Gaussian processes).
Let be a compact pseudo-metric space and be a Gaussian -motion. Suppose that has a continuous covariance function and . Let be the eigenvalue and eigenfunction pairs of the linear operator corresponding to , and be a sequence of i.i.d. standard normal random variables. Then, with probability one,
where , , is an almost surely bounded and continuous Gaussian process on for all .
This proposition justifies the following constructive approximation of the Gaussian proxy process defined in (7): Let be a generic estimate of the covariance function of the empirical process . Suppose that is such that the associated integral operator defined in (8) admits a spectral decomposition. Denote by its eigenvalue and eigenfunction pairs, and, for concreteness, assume that the ’s are sorted in nonincreasing order. Given a sequence of i.i.d. standard normal random variables and truncation level , define the Gaussian process and associated covariance function
(9) |
While there are only a few situations in which the eigenvalues and eigenfunctions of can be found analytically, from a computational perspective this is a standard problem and efficient numerical solvers exist (Ghanem and Spanos, 2003; Berlinet and Thomas-Agnan, 2004). Thus, constructing the process in (9) does not pose any practical challenges. Proposition 7 now guarantees (under appropriate boundedness and smoothness conditions on ) that the approximation error between and can be made arbitrarily small by choosing sufficiently large. In fact, by Mercer’s theorem the error can be quantified in terms of the operator norm of the difference between the covariance functions and (e.g. Hsing and Eubank, 2015, Lemma 4.6.6 and Corollary 8).
In above discussion we have implicitly imposed several assumptions on the estimate . For future reference, we summarize these assumptions in a single definition.
Definition 1 (Admissibility of ).
We say that an estimate of the covariance function is admissible if it is continuous, symmetric, positive semi-definite, and its associated integral operator is a bounded linear operator on for some finite measure with support .
Remark 6 (Admissible estimates exist).
Under the assumption that has an envelope function, the nonparametric sample covariance is admissible. Indeed, by definition, is continuous (w.r.t. ), symmetric, positive semi-definite, and, by existence of the envelope function, . Also, the existence of an envelop function implies that is a compact metric space. See also Section 4.
The next result establishes consistency of the Gaussian process bootstrap based on the truncated Karhunen-Loève expansion in (9). It is a simple corollary of Theorem 6. Note that the intrinsic standard deviation metric associated with (the pushforward probability measure induced by) can be expressed in terms of its covariance function as .
Corollary 8 (Consistency of the Gaussian process bootstrap).
Let be an admissible estimate of and its best rank- approximation. Let have envelope . If the Gaussian processes and have almost surely uniformly continuous sample paths w.r.t. their respective standard deviation metrics and , and , then, for each ,
where hides an absolute constant independent of , and .
Moreover, if is compact w.r.t. , then the last term on the right hand side in above display vanishes as .
In general, the strong variance depends on the dimension of the function class . Hence, the truncation level has to be chosen (inverse) proportionate to to ensure that in the upper bound of Corollary 8 the deterministic approximation error is negligible compared to the stochastic estimation errors.
We conclude this section with a consistency result on the bootstrap approximation of quantiles of suprema of empirical processes. This result is the empirical process analogue to Proposition 5. It is relevant in the context of hypothesis testing and construction of confidence regions. For we denote the conditional -quantile of the supremum of by
We have the following result:
Theorem 9 (Quantiles of the Gaussian process bootstrap).
Consider the setup of Corollary 8. Let be a sequence of arbitrary random variables, not necessarily independent of . Then, for each ,
where hides an absolute constant independent of , and .
In statistical applications, the statistic of interest is rarely a simple empirical process. Instead, the empirical process usually arises as the leading term of a (functional) Taylor expansion. The random sequence in above theorem can be used to capture the higher-order approximation errors of such a expansion.
Remark 7 (Additional practical considerations).
It is often infeasible to draw Monte Carlo samples directly from . In practice, we suggest approximating via a finite -net with respect to . With this additional approximation step and by Corollary 5, the conclusion of the Corollary 8 holds with substituted for and the additional quantity in the upper bound on the Kolmogorov distance. This shows that the level of discretization should be chosen proportional to .
3.6 Relation to previous work
The papers by Chernozhukov et al. (2014a, b, 2016) are currently the only other existing works on Gaussian and bootstrap approximations of suprema of empirical processes indexed by (potentially) non-Donsker function classes. In this section, we compare their results to our Theorems 1–9. To keep the comparison we focus on the key aspects that motivated us to write this paper.
It is important to note that the results presented in Chernozhukov et al. (2014a, b, 2016) differ slightly in nature from ours. Instead of establishing bounds on Kolmogorov distances, Chernozhukov and his co-authors derive coupling inequalities. However, through standard arguments (Strassen’s theorem and anti-concentration) these coupling inequalities indirectly yield bounds on Kolmogorov distances. These implied bounds on the Kolmogorov distances are at most as sharp as the ones in the coupling inequalities, up to multiplicative constants. Since we do not care about absolute constants, but only the dependence of the upper bounds on characteristics of the function class and the law , we can meaningfully compare their findings and ours.
-
•
Unbounded function classes. In classical empirical process theory the function class is typically assumed to be uniformly bounded, i.e. for all (e.g. van der Vaart and Wellner, 1996). A key feature of Theorems 1–9 as well as Theorems A.2, 2.1, and 2.1-2.3 in Chernozhukov et al. (2014a, b, 2016) is that they hold for unbounded function classes and only require the envelope function to have finite moments. Relaxing the uniform boundedness of the function class is useful, among other things, for inference on high-dimensional statistical models, functional data analysis, nonparametric regression, and series/ sieve estimation (Chernozhukov et al., 2014a, b; Giessing and Fan, 2023).
-
•
Entropy conditions. The upper bounds provided in Theorems A.2, 2.1, and 2.1-2.3 in Chernozhukov et al. (2014a, b, 2016) depend on a combination of (truncated) second and th () moments of , “local quantities” of order and (disregarding -factors), , and the entropy number , arbitrary. These upper bounds are not only more complex than ours but also weaker in terms of their implications: Since metric entropy with respect to intrinsic standard deviation metric typically scales linearly in the (VC-)dimension of the statistical model (see Appendix A for two (counter-)examples), the upper bounds in Chernozhukov et al. (2014a, b, 2016) are vacuous in high-dimensional situations without sparsity or when the (VC-)dimension exceeds the sample size. In contrast, the upper bounds in our Theorems 1–9 depend only on the expected value and standard deviation of the supremum of the Gaussian -bridge process and the (truncated) third moments of the envelope. Under mild assumptions, these quantities can be upper bounded independently of the (VC-)dimension, thus offering useful bounds even in high-dimensional problems (see Section 4).
-
•
Lower bounds on weak variances. Lemmas 2.3 and 2.4 in Chernozhukov et al. (2014b) and all of the results in Chernozhukov et al. (2014a, 2016) require a strictly positive lower bound on the weak variance of the Gaussian -bridge process, i.e. . This assumption automatically limits the applicability of these lemmas to studentized statistics/ standardized function classes (Chernozhukov et al., 2014a, b) and excludes relevant scenarios with variance decay (Lopes et al., 2020; Lopes, 2022b; Lopes et al., 2023). In contrast, our Theorems 1–9 apply to all function classes for which the strong variance of the Gaussian -bridge process, , does not vanish “too fast”. A slowly vanishing strong variance is a weaker requirement than a strictly positive lower bound on the weak variance (see Giessing (2023) and also Lemmas 6 and 7).
The lower bound on the weak variance is an artifact of Chernozhukov et al.’s (2014a; 2014b; 2016) proof technique which requires the joint distribution of the finite-dimensional marginals of the Gaussian -bridge process to be non-degenerate.
4 Applications
4.1 Confidence ellipsoids for high-dimensional parameter vectors
Confidence regions are fundamental to uncertainty quantification in multivariate statistics. Typically, their asymptotic validity relies on multivariate CLTs. However, in high dimensions, when the number of parameters exceeds the sample size, validity of confidence regions needs to be justified differently. In this section, we show how the Gaussian process bootstrap offers a practical solution to this problem. Existing work on bootstrap confidence regions in high dimensions has focused exclusively on conservative, rectangular confidence regions (e.g. Chernozhukov et al., 2013, 2014b, 2023). For the first time, our results allow construction of tighter, elliptical confidence regions and without sparsity assumptions.
Let be an unknown parameter and an estimator for based on the simple random sample . Then, an asymptotic confidence ellipsoid for can be constructed as
(10) |
where denotes the Euclidean norm and solves
In general, it is difficult to determine from above equation because the sampling distribution of often does not have an explicit form. However, in many cases, admits an asymptotically linear expansion of the form
where are centered i.i.d. random vectors (influence functions) and is a remainder term. Thus, in these cases, we have
Using this formulation, we can now apply Theorem 9 to approximate the quantile and complete the construction of the asymptotic confidence ellipsoid. By Algorithm 1 we need to construct a Gaussian process with index set and whose covariance function approximates the bi-linear map , where . Clearly, the following process does the job:
and and . We denote the -quantile of the supremum of this process by
(11) |
To show that these quantiles uniformly approximate the quantiles of the (asymptotic) distribution of introduce the following assumptions:
Definition 2 (Sub-Gaussian random vector).
We call a centered random vector sub-Gaussian if for all .
Assumption 1 (Sub-Gaussian influence functions).
The influence functions are i.i.d. sub-Gaussian random vectors with covariance matrix and (i) , (ii) , and (iii) , where is the effective rank matrix .
Assumption 2 (Heavy-tailed influence functions).
The influence functions are centered i.i.d. random vectors with covariance matrix such that and for some . Furthermore, (i) , (ii) , (iii) , (iv) , and (v) , where is the effective rank matrix .
Under either assumption, we have the following result:
Proposition 1 (Bootstrap confidence ellipsoid).
It is worth noticing that neither Assumption 1 nor 2 require sparsity of the parameter , the estimator , or the influence functions . Instead, Assumption 1 and 2 already hold if the covariance matrix has bounded effective rank (see Giessing and Fan, 2023, Section 2.1 and Appendices A.1 and A.2). In this context it is important to notice that, unlike Chernozhukov et al. (2014b), we do not require a strictly positive lower bound on , the smallest eigenvalue of the covariance matrix . If such a lower bound was needed, the effective rank would grow linearly in the dimension and Assumptions 1 and 2 would be violated if (see Appendix A). Moreover, Proposition 1 cannot be deduced from Theorem 2.1 in Chernozhukov et al. (2014b) because the upper bound in their coupling inequality would feature the term (and other terms as well), which is a remnant of the entropy number of the -net discretization of the -dimensional Euclidean unit-ball.
4.2 Inference on the spectral norm of high-dimensional covariance matrices
Spectral statistics of random matrices play an important role in multivariate statistical analysis. The asymptotic distributions of spectral statistics are well established in low dimensions (e.g. Anderson, 1963; Waternaux, 1976; Fujikoshi, 1980) and when the dimension is comparable to the sample size (e.g. Johnstone, 2001; El Karoui, 2007; Péché, 2009; Bao et al., 2015). In the high-dimensional case, when asymptotic arguments do not apply, bootstrap procedures have proved to be effective in approximating distributions of certain maximum-type spectral statistics (e.g. Han et al., 2018; Naumov et al., 2019; Lopes et al., 2019, 2020; Lopes, 2022b). Here, we demonstrate that the Gaussian process bootstrap is a viable alternative to these bootstrap procedures in approximating the distribution of the spectral norm of a high-dimensional sample covariance matrix.
Let be i.i.d. random vectors with law , mean zero, and covariance matrix . Consider the spectral statistic
Since there does not exist a closed form expression of the sampling distribution of when , we apply Algorithm 1 to obtain an approximation based on a Gaussian proxy process. We introduce the following notation: Let and note that , where denotes the Kronecker product, the half-vectorization operator which turns a symmetric matrix into a column vector (of unique entries), and the duplication matrix such that where is the ordinary vectorization operator. Whence, where is the empirical measure of the collection of , , the pushforward of under the map , and . Since each has a (not necessarily unique) representation in terms of , in the following we identify with pairs . The covariance function associated with the empirical process is thus given by
where . Thus, in the light of Algorithm 1 the natural choice for the Gaussian bootstrap process is
(12) |
and is the sample analogue of .
We make the following assumption:
Assumption 3 (Sub-Gaussian data).
The data are i.i.d. sub-Gaussian random vectors with covariance matrix and .
Remark 1 (On the lower bound on the variances).
The strictly positive lower bound on the variance of the quadratic from is mild. The existence of the lower bound is equivalent to for all . The latter inequality holds if the law of does not concentrate on a lower dimensional subspace of . Since the bounds in Proposition 2 are explicit in , Proposition 2 also applies to scenarios in which as . Similar lower bounds on the variance of appear in Lopes (2022b); Lopes et al. (2023).
We have the following result:
Proposition 2 (Bootstrap approximation of the distribution of spectral norms of covariance matrices).
Remark 2.
Note that the matrix is symmetric just as the target matrix .
We conclude this section with a comparison of Proposition 2 to existing results in the literature: First, if , then the upper bound in above theorem is asymptotically negligible. In this case, the bootstrapped distribution of the Gaussian proxy statistic consistently approximates the distribution of . Since this rate depends only on the effective rank, it is dimension-free and cannot be derived through the results in Chernozhukov et al. (2014b). Second, unlike the results in Lopes (2022b) and Lopes et al. (2023), Proposition 2 does not rely on specific assumptions about the decay of the eigenvalues of . And yet, under certain circumstances, the consistency rate provided by Proposition 2 can be faster than theirs. Specifically, the bootstrap procedure described in Lopes (2022b) achieves consistency at a rate of for and . In our context, the parameter determines the rate at which the eigenvalues of decrease, i.e. , . To achieve a rate faster than , must be greater than which requires an extremely fast decay of the eigenvalues. Third, Lopes et al. (2023) conduct extensive numerical experiments and observe that the bootstrap approximation exhibits a sharp phase transition from accurate to inaccurate when switches from greater than to less than . This observation aligns not only with their own theoretical findings but also with the upper bound presented in Proposition 2, since, under their modeling assumptions, the effective rank remains bounded if but diverges if .
4.3 Simultaneous confidence bands for functions in reproducing kernel Hilbert spaces
Reproducing kernel Hilbert spaces (RKHS) are an integral part of statistics, with applications in classical non-parametric statistics (Wahba, 1990), machine learning (Schölkopf and Smola, 2002; Steinwart and Christmann, 2008) and, most recently, (deep) neural nets (Belkin et al., 2018; Jacot et al., 2018; Bohn et al., 2019; Unser, 2019; Chen and Xu, 2020). In this section, we consider constructing simultaneous confidence bands for functions in RKHS by bootstrapping the distribution of a bias-corrected kernel ridge regression estimator. Recently, this problem has been addressed by Singh and Vijaykumar (2023) with a symmetrizied multiplier bootstrap. Here, we propose an alternative based on the Gaussian process bootstrap using a truncated Karhunen-Loève decomposition. We point out several commonalities and differences between the two procedures.
In the following, denotes the RKHS of continuous functions from to associated to the symmetric and positive definite kernel . We allow , , and to change with the sample size , but we do not make this dependence explicit. We write for the norm induced by the inner product and for the supremum norm. For , we let be the function . Then, for all and, by the reproducing property, for all and . The kernel induces the so-called kernel metric , . Given () we denote its dual by (). For we define the tensor product by . For operators on we use , , and to denote operator norm, Hilbert-Schmidt norm, and trace, respectively. For further details on RKHSs we refer to Berlinet and Thomas-Agnan (2004).
Let and have joint law . Given a simple random sample our goal is to construct uniform confidence bands for the conditional mean function or, rather, its best approximation in hypothesis space , i.e.
To this end, consider the classical kernel ridge regression estimator
and define its bias-corrected version as
(13) |
We propose to construct simultaneous confidence bands for based on via the rectangle
(14) |
where approximates the (asymptotic) quantile of the law of . To compute we proceed in two steps: First, we show that can be written as the sum of the supremum of an empirical process and a negligible remainder term. Then, we apply the strategy developed in Section 3.5 to bootstrap the supremum of the empirical process.
where and is a higher-order remainder term. Since is a reproducing kernel, we have for all . Hence, above expansion implies
and , . By Lemma 21 in Appendix B.5, is negligible with high probability. Moreover, since is the best approximation in square loss, the random elements ’s have mean zero. Thus, , where is the empirical measure of the ’s, the pushforward of under the map , and . Since the functions are just the evaluation functionals of , in the following we identify with its corresponding . The covariance function associated with the empirical process is thus given by
with covariance operator
where we have used that the operators and commute.
We proceed to construct a Gaussian proxy process as outlined in Section 3.5: Let and be the plug-in estimates of the covariance operator and the variance , respectively. Define by . Recall definition (8) of the integral operator . In the present setup, by Fubini’s theorem. Denote by the eigenvalue and eigenfunction pairs of . Further, let be a sequence of i.i.d. standard normal random variables. Then, for and define
(15) |
where is the best rank- approximation of .
Given Proposition 7 we postulate that the process is an almost sure version of a Gaussian process on with covariance function (or, equivalently, with covariance operator . Consequently, Theorem 9 guarantees validity of the Gaussian process bootstrap based on . To make all these claims rigorous, consider the following assumptions:
Assumption 4 (On the kernel).
The kernel is symmetric, positive semi-definite, continuous, and bounded, i.e. .
Remark 3.
The assumptions on the kernel are standard and important (Berlinet and Thomas-Agnan, 2004). The continuity of guarantees that is a random element on whenever is a random variable. It also implies that the RKHS is separable whenever is separable. The boundedness and the reproducing property of imply that for all .
Assumption 5 (On the data).
The data are i.i.d. random elements defined on an abstract product probability space . The ’s are almost surely bounded, i.e. there exists an absolute constant such that almost surely.
Remark 4.
The almost sure boundedness of the ’s is a strong assumption. We introduce this assumption to keep technical arguments at a minimum. Singh and Vijaykumar (2023) impose an equivalent boundedness condition on the pseudo-residuals , .
Assumption 6 (On the population and sample covariance operators).
For all (i) there exists such that and (ii) .
Remark 5.
Condition (i) is the Hilbert space equivalent to the lower bound on the variance in Assumption 3. While Singh and Vijaykumar (2023) do not explicitly impose a lower bound on the covariance operator, such a lower bound is implied by their Assumption 5.2 (see Giessing, 2023, for details). In eq. (LABEL:eq:theorem:Uniform-CI-Bands-RKHS-8) in Appendix 4 we provide a general non-asymptotic complement to below Proposition 3 which applies even if as . Condition (ii) is a classical assumption in learning theory on RKHS (Mendelson, 2002). Together with Conditions (i) it implies that and are finite rank and trace class operators, respectively.
Denote the -quantile of the supremum of the Gaussian proxy process in (15) by
(16) |
An application of Theorem 9 yields:
Proposition 3 (Bootstrap quantiles).
The finite metric entropy condition on the set ensures that the Gaussian bootstrap process (15) is almost surely bounded and uniformly continuous on (as required by Theorem 9). This condition is not merely technical but also intuitive: Since the RKHS is the completion of and is bounded and continuous, conditions that guarantee the continuity of Gaussian bootstrap processes on (a subset of) should indeed be attributable to properties of . Importantly, the metric entropy condition on does not impose restrictions on the dimension of . Only Assumption 6 (ii) implicitly imposes restrictions on the dimension of .
Since under the conditions of Proposition 3, almost surely for all , it follows that the bootstrap confidence band proposed in (14) is asymptotically valid:
Corollary 4 (Simultaneous bootstrap confidence bands).
A thorough comparison of the Gaussian process bootstrap and Singh and Vijaykumar’s (2023) symmetrized multiplier bootstrap is beyond the scope of this paper. In practice, both methods yield biased confidence bands for , albeit for different reasons: Singh and Vijaykumar’s (2023) bias stems from constructing a confidence band for the pseudo-true regression function without correcting the regularization bias induced by , ours is due to using an -truncated Karhunen-Loève decomposition based on a finite number of eigenfunctions. In future work we will explore ways to mitigate these biases by judiciously choosing and .
5 Conclusion
In this paper we have developed a new approach to theory and practice of Gaussian and bootstrap approximations of the sampling distribution of suprema of empirical processes. We have put special emphasize on non-asymptotic approximations that are entropy- and weak variance-free, and have allowed the function class to vary with the sample size and to be non-Donsker. We have shown that such general approximation results are useful, among other things, for inference on high-dimensional statistical models and reproducing kernel Hilbert spaces. However, theory and methodology in this paper have three limitations that need to be addressed in future work:
-
•
Reliance on independent and identically distributed data. All statistically relevant results in this paper depend on Proposition 1, which heavily relies on the assumption of independent and identically distributed data. Expanding Proposition 1 to accommodate non-identical distributed data would be a first step towards solving simultaneous and large-scale two-sample testing problems and conducting inference in high-dimensional fixed design settings. Currently, the results in this paper are exclusively applicable to one-sample testing and unconditional inference.
-
•
Lack of tight lower bounds on the strong variances of most Gaussian processes. One the most notable features of the results in this paper is the fact that all upper bounds depend on the inverse of the strong variance of some Gaussian proxy process. Unfortunately, in statistical applications this poses a formidable challenge since up until now there exist only few techniques to derive tight lower bounds on these strong variances (Giessing and Fan, 2023; Giessing, 2023). We either need new tools or we need to develop Gaussian and bootstrap approximations for statistics other than maxima/ suprema. The latter will require new anti-concentration inequalities.
-
•
Biased quantile estimates due to bootstrapping non-pivotal statistic. The Gaussian process bootstrap is based on a non-pivotal statistic, i.e. the sampling distribution of the supremum depends on the unknown population covariance function. In practice, when bootstrapping non-pivotal statistics the estimated quantiles often differ substantially from the true quantiles. Several bias correction schemes have been proposed in the classical setting (Davison et al., 1986; Beran, 1987; Hall and Martin, 1988; Shi, 1992). Since the Gaussian process bootstrap is not a re-sampling procedure in the classical sense, these techniques do not apply. In Giessing and Fan (2023) we therefore develop the spherical bootstrap to improve accuracy and efficiency when bootstrapping -statistics. However, this approach does not generalize to arbitrary empirical processes as the one in Section 4.3. Thus, there is an urgent need for new bias correction schemes.
Acknowledgement
Alexander Giessing is supported by NSF grant DMS-2310578.
References
- Anderson (2003) T. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics. Wiley, 2003. ISBN 9780471360919.
- Anderson (1963) T. W. Anderson. Asymptotic Theory for Principal Component Analysis. The Annals of Mathematical Statistics, 34(1):122 – 148, 1963.
- Bao et al. (2015) Z. Bao, G. Pan, and W. Zhou. Universality for the largest eigenvalue of sample covariance matrices with general population. The Annals of Statistics, 43(1):382–421, 2015.
- Belkin et al. (2018) M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 541–549. PMLR, 2018.
- Bentkus (2003) V. Bentkus. On the dependence of the Berry–Esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385 – 402, 2003. ISSN 0378-3758.
- Beran (1987) R. Beran. Prepivoting to reduce level error of confidence sets. Biometrika, 74(3):457–468, 1987.
- Berlinet and Thomas-Agnan (2004) A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, 2004.
- Bhattacharya and Holmes (2010) R. Bhattacharya and S. Holmes. An exposition of Götze’s estimation of the rate of convergence in the multivariate central limit theorem, 2010.
- Bohn et al. (2019) B. Bohn, C. Rieger, and M. Griebel. A representer theorem for deep kernel learning. The Journal of Machine Learning Research, 20(1):2302–2333, 2019.
- Bolthausen (1984) E. Bolthausen. An estimate of the remainder in a combinatorial central limit theorem. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 66(3):379–386, 1984.
- Bong et al. (2023) H. Bong, A. K. Kuchibhotla, and A. Rinaldo. Dual induction clt for high-dimensional m-dependent data, 2023.
- Cattaneo et al. (2022) M. D. Cattaneo, R. P. Masini, and W. G. Underwood. Yurinskii’s coupling for martingales, 2022.
- Chen and Xu (2020) L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. In International Conference on Learning Representations, 2020.
- Chen (2018) X. Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and their applications. Ann. Statist., 46(2):642–678, 04 2018.
- Chen et al. (2015) Y.-C. Chen, C. R. Genovese, and L. Wasserman. Asymptotic theory for density ridges. The Annals of Statistics, 43(5):1896–1928, 2015.
- Chen et al. (2016) Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman. Nonparametric modal regression. The Annals of Statistics, 44(2):489 – 514, 2016.
- Chernozhukov et al. (2013) V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 12 2013.
- Chernozhukov et al. (2014a) V. Chernozhukov, D. Chetverikov, and K. Kato. Anti-concentration and honest, adaptive confidence bands. The Annals of Statistics, 42(5):1787–1818, 2014a.
- Chernozhukov et al. (2014b) V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564–1597, 08 2014b.
- Chernozhukov et al. (2015) V. Chernozhukov, D. Chetverikov, and K. Kato. Comparison and anti-concentration bounds for maxima of gaussian random vectors. Probability Theory and Related Fields, 162(1):47–70, Jun 2015.
- Chernozhukov et al. (2016) V. Chernozhukov, D. Chetverikov, and K. Kato. Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings. Stochastic Processes and their Applications, 126(12):3632–3651, 2016. ISSN 0304-4149. In Memoriam: Evarist Giné.
- Chernozhukov et al. (2017) V. Chernozhukov, D. Chetverikov, and K. Kato. Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45(4):2309–2352, 07 2017.
- Chernozhukov et al. (2019) V. Chernozhukov, D. Chetverikov, K. Kato, and Y. Koike. Improved central limit theorem and bootstrap approximations in high dimensions. arXiv preprint, arXiv:1912.10529, 12 2019.
- Chernozhukov et al. (2020) V. Chernozhukov, D. Chetverikov, and Y. Koike. Nearly optimal central limit theorem and bootstrap approximations in high dimensions. 2020.
- Chernozhukov et al. (2023) V. Chernozhukov, D. Chetverikov, K. Kato, and Y. Koike. High-dimensional data bootstrap. Annual Review of Statistics and Its Application, 10(1):427–449, 2023.
- Chetverikov (2019) D. Chetverikov. Testing regression monotonicity in econometric models. Econometric Theory, 35(4):729–776, 2019.
- Davison et al. (1986) A. C. Davison, D. V. Hinkley, and E. Schechtman. Efficient bootstrap simulation. Biometrika, 73(3):555–566, 1986.
- Deng and Zhang (2020) H. Deng and C.-H. Zhang. Beyond gaussian approximation: Bootstrap for maxima of sums of independent random vectors. arXiv preprint, arXiv:1705.09528, 2020.
- Dezeure et al. (2017) R. Dezeure, P. Bühlmann, and C.-H. Zhang. High-dimensional simultaneous inference with the bootstrap. Test, 26(4):685–719, 2017.
- Dudley (2002) R. Dudley. Real Analysis and Probability. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2002.
- Dudley (2014) R. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2014.
- El Karoui (2007) N. El Karoui. Tracy–Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices. The Annals of Probability, 35(2):663 – 714, 2007.
- Fan et al. (2018) J. Fan, Q.-M. Shao, and W.-X. Zhou. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Ann. Statist., 46(3):989–1017, 06 2018.
- Fang and Koike (2021) X. Fang and Y. Koike. High-dimensional central limit theorems by Stein’s method. The Annals of Applied Probability, 31(4):1660 – 1686, 2021.
- Folland (1999) G. Folland. Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 1999.
- Fujikoshi (1980) Y. Fujikoshi. Asymptotic expansions for the distributions of the sample roots under nonnormality. Biometrika, 67(1):45–51, 1980.
- Ghanem and Spanos (2003) R. Ghanem and P. Spanos. Stochastic Finite Elements: A Spectral Approach. Dover Publications, 2003.
- Giessing (2023) A. Giessing. Anti-Concentration of Suprema of Gaussian Processes and Gaussian Order Statistics. Working Paper, 2023.
- Giessing and Fan (2023) A. Giessing and J. Fan. A bootstrap hypothesis test for high-dimensional mean vectors. Working Paper, 2023.
- Giessing and Wang (2021) A. Giessing and J. Wang. Debiased inference on heterogeneous quantile treatment effects with regression rank-scores, 2021.
- Giné and Nickl (2016) E. Giné and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press, 2016.
- Götze (1991) F. Götze. On the rate of convergence in the multivariate clt. The Annals of Probability, 19(2):724–739, 1991.
- Hall and Martin (1988) P. Hall and M. A. Martin. On bootstrap resampling and iteration. Biometrika, 75(4):661–671, 1988.
- Han et al. (2018) F. Han, S. Xu, and W.-X. Zhou. On Gaussian comparison inequality and its application to spectral analysis of large random matrices. Bernoulli, 24(3):1787 – 1833, 2018.
- Hardy (2006) M. Hardy. Combinatorics of partial derivatives. Electronic Journal of Combinatorics, 2006.
- Hsing and Eubank (2015) T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. Wiley, 2015.
- Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. NIPS’18, page 8580–8589, 2018.
- Jain and Kallianpur (1970) N. C. Jain and G. Kallianpur. A Note on Uniform Convergence of Stochastic Processes. The Annals of Mathematical Statistics, 41(4):1360 – 1362, 1970.
- Janková et al. (2020) J. Janková, R. D. Shah, P. Bühlmann, and R. J. Samworth. Goodness-of-fit Testing in High Dimensional Generalized Linear Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(3):773–795, 05 2020.
- Johnstone (2001) I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295 – 327, 2001.
- Koltchinskii and Lounici (2017) V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23(1):110 – 133, 2017.
- Kuchibhotla et al. (2021) A. K. Kuchibhotla, S. Mukherjee, and D. Banerjee. High-dimensional CLT: Improvements, non-uniform extensions and large deviations. Bernoulli, 27(1):192 – 217, 2021.
- Le Cam (1986) L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics. Springer New York, 1986.
- Lopes (2022a) M. E. Lopes. Central limit theorem and bootstrap approximation in high dimensions: Near rates via implicit smoothing. The Annals of Statistics, 50(5):2492 – 2513, 2022a.
- Lopes (2022b) M. E. Lopes. Improved rates of bootstrap approximation for the operator norm: A coordinate-free approach, 2022b.
- Lopes et al. (2019) M. E. Lopes, A. Blandino, and A. Aue. Bootstrapping spectral statistics in high dimensions. Biometrika, 106(4):781–801, 09 2019.
- Lopes et al. (2020) M. E. Lopes, Z. Lin, and H.-G. Müller. Bootstrapping max statistics in high dimensions: Near-parametric rates under weak variance decay and application to functional and multinomial data. Annals of Statistics, 48(2):1214–1229, 04 2020.
- Lopes et al. (2023) M. E. Lopes, N. B. Erichson, and M. W. Mahoney. Bootstrapping the operator norm in high dimensions: Error estimation for covariance matrices and sketching. Bernoulli, 29(1):428 – 450, 2023.
- Mendelson (2002) S. Mendelson. Geometric parameters of kernel machines. In International conference on computational learning theory, pages 29–43. Springer, 2002.
- Naumov et al. (2019) A. Naumov, V. Spokoiny, and V. Ulyanov. Bootstrap confidence sets for spectral projectors of sample covariance. Probability Theory and Related Fields, 174(3):1091–1132, 2019.
- Nourdin and Peccati (2012) I. Nourdin and G. Peccati. Normal approximations with Malliavin calculus: from Stein’s method to universality. Number 192. Cambridge University Press, 2012.
- Péché (2009) S. Péché. Universality results for the largest eigenvalues of some sample covariance matrix ensembles. Probability Theory and Related Fields, 143(3):481–516, 2009.
- Raič (2019) M. Raič. A multivariate Berry–Esseen theorem with explicit constants. Bernoulli, 25(4A):2824–2853, 11 2019.
- Rockafellar (1999) R. T. Rockafellar. Second-order convex analysis. J. Nonlinear Convex Anal, 1(1-16):84, 1999.
- Schölkopf and Smola (2002) B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive computation and machine learning. MIT Press, 2002.
- Sedaghat (2003) H. Sedaghat. Nonlinear Difference Equations: Theory with Applications to Social Science Models. Springer, 2003.
- Shi (1992) S. G. Shi. Accurate and efficient double-bootstrap confidence limit method. Computational Statistics and Data Analysis, 13(1):21–32, 1992.
- Singh and Vijaykumar (2023) R. Singh and S. Vijaykumar. Kernel ridge regression inference, 2023.
- Steinwart and Christmann (2008) I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
- Tanguy (2017) K. Tanguy. Quelques inégalités de superconcentration: théorie et applications. PhD thesis, Université Paul Sabatier-Toulouse III, 2017.
- Unser (2019) M. Unser. A representer theorem for deep neural networks. Journal of Machine Learning Research, 20, 2019.
- van der Vaart and Wellner (1996) A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.
- Vershynin (2018) R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Wahba (1990) G. Wahba. Spline Models for Observational Data. SIAM, 1990.
- Waternaux (1976) C. M. Waternaux. Asymptotic distribution of the sample roots for a nonnormal population. Biometrika, 63(3):639–645, 1976.
- Yurinsky (2006) V. Yurinsky. Sums and Gaussian Vectors. Springer, 2006.
- Zhang and Cheng (2017) X. Zhang and G. Cheng. Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768, 2017.
Supplementary Materials for “Gaussian and Bootstrap Approximations for Empirical Processes” Alexander Giessing11footnotemark: 1
Contents
[sections] \printcontents[sections]l1
Appendix A Two toy examples
In this section we present two examples to illustrate the limitations of the existing and the advantage of the new Gaussian approximation results. The presentation is deliberately expository; we omit proofs as much as possible. In both examples, we take with , which plays a role in the construction of high-dimensional confidence regions and multiple testing problems (see Section 4.1).
Consider a simple random sample from the law of , where is a fixed vector and a centered random variable with finite third moment. Then, . Whence, the supremum of the empirical process reduces to the average of i.i.d. scalar-valued random variables and the classical univariate Berry-Essen theorem yields the following bound on the Kolmogorov distance:
(17) |
This non-asymptotic bound is independent of the dimension and the weak variance of the Gaussian proxy process. If we ignore the low-rank structure of the data and instead use the Gaussian approximation results in Chernozhukov et al. (2014b) (Theorem 2.1 combined with Lemma 2.3) we obtain (qualitatively)
(18) |
where and hides a complicated multiplicative factor which, among other things, depends on the sample size , third and higher moments of , and the inverse of the discretization level (see Section 3.6). Since the covariance matrix has rank one, the unit ball in with respect to the intrinsic standard deviation metric is isometrically isomorphic to the interval . Hence, the metric entropy can be upper bounded independently of the dimension ; in particular, we have . However, the low-rank structure of the data also implies that the weak variance of the associated Gaussian proxy process vanishes; indeed, . Thus, the upper bound in (18) is in fact invalid (or trivial if we interpret ) and fails to replicate the univariate Berry-Esseen bound in (17).
In contrast, the new Gaussian approximation inequality (Theorem 1 in Section 3.2; Theorem A.1 in Giessing and Fan (2023)) is agnostic to the covariance structure of the data and yields
(19) |
where hides an absolute constant independent of , and the distribution of the ’s and . The upper bound in this inequality depends only on the third moment of the envelop function and the strong variance of the Gaussian proxy process . If we use that and for , we obtain
(20) |
This inequality is obviously dimension- and weak variance-free. In this sense, it recovers the essential feature of the univariate Berry-Esseen bound in (17) and improves over the bound in (18). The dependence on the sample size and the moments of is still sub-optimal; but refinements in this direction are beyond the scope of this paper. Obviously, in this example, we have chosen a rank one covariance matrix only to be able to compare the Gaussian approximation result with the Berry-Esseen theorem. Any low-rank structure implies a vanishing weak variance and, hence, a breakdown of the results in Chernozhukov et al. (2014b).
Next, suppose that the data are a simple random sample drawn from the law of a random vector with mean zero, element-wise bounded entries almost surely for some , and equi-correlated covariance matrix for some . The constraints on guarantee that has full rank. Therefore, the multivariate Berry-Esseen theorem by Bentkus (2003) (Theorem 1.1) implies
(21) |
This upper bound is not useful in high-dimensional settings because the expected value is polynomial in the dimension , i.e. .
The Gaussian approximation results by Chernozhukov et al. (2014b) yield (again) inequality (18). Since the covariance matrix has full rank, the weak variance is now strictly positive and equal to the smallest eigenvalue of , i.e. . However, in this example, the metric entropy with respect to the intrinsic standard deviation metric poses a problem. Indeed, let and be the eigenvalues of . Then, the unit ball in with respect to can be identified with the weighted Euclidean ball . Let be the standard unit ball in with respect to the Euclidean distance. Then, whenever . Therefore, standard arguments . Thus, the metric entropy grows linear in the dimension , i.e. for . We conclude that the results in Chernozhukov et al. (2014b) are again not useful in high dimensions with .
The new Gaussian approximation inequality implies (again) (19). Since almost surely and it follows that . Moreover, by Theorem A.6 in Giessing and Fan (2023) and since , we have . Therefore, inequality (19) simplifies to
(22) |
This inequality is not only dimension- and weak variance-free but also improves qualitatively over, both, the results by Chernozhukov et al. (2014b) and Bentkus (2003). Intuitively, the reason why we are able to shed the -factor from the upper bound compared to the results in Bentkus (2003) is that we only take the supremum over all Euclidean balls with center at the origin whereas he takes the supremum over all convex sets in . (Note that to apply his bound in the context of this example, we need to take the supremum over at least all weighted Euclidean balls .) This second example is related to the first one in so far as the covariance matrix has “approximately” rank one. Indeed, as the dimension increases the law of the random vector concentrates in the neighborhood of the one-dimensional subspace spanned by eigenvector associated to the largest eigenvalue of . This becomes even more obvious if we consider the standardized covariance with eigenvalues and as .
In both examples, the exact and approximate low-rank structures of the data are crucial in order to go from the abstract bound (19) to the dimension-/ entropy-free bounds (20) and (22), respectively. This is not coincidental and is in fact representative for the entire theory that we develop in this paper: While our Gaussian and bootstrap approximation inequalities hold without assumptions on the metric entropy, in concrete examples they often only yield entropy-free upper bounds if the trace of the covariance operator is bounded or grows at a much slower rate than the sample size (e.g. low-rank, bounded effective rank, or variance decay, see Section 4). This is certainly a limitation of our theory, but a low-rank covariance (function) is an empirically well-documented property of many data sets and therefore a common assumption in multivariate and high-dimensional statistics (e.g. Anderson, 2003; Vershynin, 2018).
Appendix B Auxiliary results
B.1 Smoothing inequalities and partial derivatives
Lemma 1.
Let be arbitrary random variables. There exists a map such that
-
(i)
for all and ,
where is a constant depending only on ; and
-
(ii)
for all ,
where for real-valued .
Remark 1.
We can take (see proof).
Lemma 2.
Let be the map from Lemma 1 and define for . Let be the Ornstein-Uhlenbeck semi-group with stationary measure and positive definite covariance matrix .
-
(i)
For arbitrary indices , , and all ,
-
(ii)
for almost every the absolute value of the derivative in (i) can be upper bounded by
where , , and is the absolute constant from Lemma 1.
Remark 2.
While claim (i) looks a lot like a “differentiating under the integral sign” type result, it is more accurate to think of it as a specific smoothing property of the Ornstein-Uhlenbeck semigroup with stationary measure when applied to (compositions of Lipschitz continuous functions with) the map .
Lemma 3.
Let with . For and a map on set
where denotes the th standard unit vector in . For , , , and arbitrary indices ,
Remark 3.
The proof of this result on “partially regularized” functions is similar to the one on “fully regularized” functions (Folland, 1999, Theorem 8.14 (a)). The conditions on the functions can probably be relaxed, but they are sufficiently general to apply to the situations that we encounter.
Lemma 4 (Partial derivatives of compositions of almost everywhere diff’able functions).
Let and and is a null set with respect to the Lebesgue measure on . Then, and, for arbitrary indices , , and all ,
where is the set of all partitions of , denotes the number of “blocks of indices” in partition , and is the number of indices in block .
Remark 4.
This result is a trivial modification of the multivariate version of Faà di Bruno’s formula due to Hardy (2006). The modification is that we only require to be -times differentiable almost everywhere on . The formula is generally false if is only -times differentiable almost everywhere on .
Remark 5.
The first three partial derivatives are of particular interest to us. They are given by (whenever they exist)
Lemma 5 (Partial and total derivatives of -norms).
The map is partially differentiable of any order almost everywhere on with partial derivatives (whenever they exist)
for and . Moreover, is twice totally differentiable almost everywhere on with Jacobian and Hessian matrices (whenever they exist)
Remark 6.
Note that the first partial derivative can be re-written (less compact) as a piece-wise linear function.
B.2 Anti-concentration inequalities and lower bounds on variances
Lemma 6 (Giessing, 2023).
Let be a centered separable Gaussian process indexed by a semi-metric space . Set and assume that a.s. For all ,
The result remains true if is replaced by .
Remark 7.
If the covariance function of is positive definite, then above inequalities hold even for uncentered and a.s.
Lemma 7 (Giessing, 2023).
Let be a separable Gaussian process indexed by a semi-metric space such that , , and for all . Set and assume that a.s. Then, and there exist absolute constants such that
with the convention that “”. The result remains true if is replaced by .
Lemma 9.
Let be arbitrary random variables. Then,
where for . Moreover, if , then
B.3 Quantile comparison lemmas
Throughout this section, and . For we define the th quantile of and , respectively, by
(23) |
Lemma 10.
For all ,
where and is an absolute constant.
If we use non-Gaussian proxy statistics to test hypothesis (such as Efron’s empirical bootstrap), we replace Lemma 10 by the following result:
Lemma 11.
Let be a statistic; and need not be independent. For arbitrary define . Then, for all ,
where , , , and is an absolute constant.
For we define the (conditional) th quantile of the supremum of by
and the th quantile of the supremum of the Gaussian -bridge process by
The following two lemmas are straightforward generalizations of the preceding lemmas to empirical processes. The proofs of Lemmas 12 and 13 are identical to the ones of Lemmas 10 and 11, respectively. We therefore omit them.
Lemma 12.
For all ,
where and is an absolute constant.
Lemma 13.
Let be an arbitrary stochastic process. For arbitrary define . Then, for all ,
where , , , and is an absolute constant.
B.4 Boundedness and continuity of centered Gaussian processes
This section contains classical results on boundeness and continuity of the sample paths of centered Gaussian processes. We provide a proof for Lemma 18 in Appendix F.4; all other results (with proofs or references to proofs) can be found in Appendix A of van der Vaart and Wellner (1996).
Throughout this section denotes a centered separable Gaussian process indexed by a semi-metric space , , , and the intrinsic standard deviation metric associated with .
Lemma 14 (Equivalence of bounded sample path and finite expectation).
is almost surely bounded on if and only if .
Lemma 15 (Reverse Liapunov inequality).
If is almost surely bounded on , then there exist constants depending on only such that
Lemma 16 (Sudakov’s lower bound).
Let be the -covering number of w.r.t. . Then, there exists an absolute constant such that
Consequently, if , then is totally bounded w.r.t. .
Lemma 17 (Metric entropy condition for bounded and continuous sample paths).
Let be the -covering number of w.r.t. . If , then there exists a version of that is almost surely bounded and has almost surely uniformly -continuous sample paths.
Lemma 18 (Continuous sample paths and modulus of continuity).
If is almost surely bounded on , then the sample paths of on are almost surely uniformly -continuous if and only if
(24) |
B.5 Auxiliary results for applications
In this section we collect several technical results needed for the applications in Section 4.
Lemma 19.
Let be i.i.d. sub-Gaussian random vectors with mean zero and covariance matrix . Define by . Then,
where and hides an absolute constant independent of , , and . (Here, we tacitly identify the ’s with linear maps and denotes the tensor product between these linear maps.)
Remark 9.
This result is useful because the upper bound is dimension-free in the sense that it only depends on the effective rank and the operator norm . However, the dependence on the sample size is sub-optimal.
Lemma 20.
The bias-corrected kernel ridge regression estimator defined in (13) satisfies
where is a higher-order remainder term and
where hides an absolute constant.
Lemma 21.
Remark 10.
The quantity also appears in Singh and Vijaykumar (2023). For an interpretation and its relation to the effective rank of the operator we refer to Section H (ibid., pp. 80ff).
Remark 11.
Above upper bound is if (i) , (ii) , and (iii) .
Lemma 22.
Remark 12.
Above upper bound is if .
Lemma 23.
Remark 13.
If is pre-Gaussian, then weaker conditions and generic chaining arguments yield tighter bounds (e.g. Koltchinskii and Lounici, 2017, Theorem 9).
The next lemma is a version of Bernstein’s exponential tail bound for random elements on separable Hilbert spaces.
Lemma 24 (Theorem 3.3.4, Yurinsky, 2006; Lemma G.2, Singh and Vijaykumar, 2023).
Let be i.i.d. centered random elements on a separable Hilbert space with induced norm . Suppose that there exist absolute constants such that for all . Then, for arbitrary,
In particular, for arbitrary, with probability at least ,
where hides an absolute constant independent of , and .
Appendix C Proofs of the results in Section 2
C.1 Proof of Proposition 1
Proof of Proposition 1.
Our proof is inspired by Nourdin and Peccati (2012) (Theorem 3.7.1) who establish a Berry-Esseen type bound for the uni-variate case using an inductive argument (attributed to Bolthausen, 1984). The multi-variate case requires several modifications some of which we take from Götze (1991), Bhattacharya and Holmes (2010), and Fang and Koike (2021). We also borrow a truncation argument from Chernozhukov et al. (2017), which explains the qualitative similarity between their bound and ours. Original ideas in our proof are mostly those related to the way in which we use our Gaussian anti-concentration inequality (Lemma 6) and exploit the mollifying properties of the Ornstein-Uhlenbeck semi-group operator to by-pass dimension dependent smoothing inequalities (Lemmas 1 and 2).
Our proof strategy has two drawbacks: First, the inductive argument relies substantially on the i.i.d. assumption of the data. Generalizing this argument to independent but non-identical data requires additional assumptions on the variances similar to the uniform asymptotic negligibility condition in the classical Lindeberg-Feller CLT. We leave this generalization to future research. Second, the recent results by Chernozhukov et al. (2020) suggest that our Berry-Esseen type bound is not sharp. Unfortunately, their proof technique (based on delicate estimates of Hermite polynomials) is inherently dimension dependent. Extending their approach to the coordinate-free Wiener chaos decomposition is another interesting research task.
The case of positive definite .
Suppose that is positive definite. Let be independent of and define, for each ,
Further, for each , let be the smallest constant greater than or equal to one such that, for all i.i.d. random variables with , , and ,
where
The factor 12 in front of ensures that so that . Indeed, one easily computes (by the equivalence of moments of suprema of Gaussian processes, e.g. van der Vaart and Wellner, 1996, Proposition A.2.4) and, hence, .
While the upper bound is too loose to conclude the proof, it is nonetheless an important first step towards a tighter bound. For the moment, assume that there exists an absolute constant such that
(25) |
Then, by construction of ,
(26) |
We shall now show that this difference inequality implies that independent of the distribution of the ’s: Define the map and consider the nonlinear first-order difference equation
We easily verify that the fixed point solving satisfies
and that for all and for all . We also notice that is monotone increasing on . Thus,
Hence, by Theorem 2.1.2 in Sedaghat (2003) every trajectory with converges to . In particular, . Returning to the inequality (26) we conclude that there exists such that for all and all . Since for all , it follows that for all and .
To complete the proof of the theorem, it remains to show that eq. (25) holds. Let , be arbitrary and be the map from Lemma 1. Define . By Lemma 6 and Lemma 1 (ii),
(27) |
where denotes the Ornstein-Uhlenbeck semi-group with stationary measure , i.e.
Since is Lipschitz continuous (with constant ) and positive definite, Proposition 4.3.2 in Nourdin and Peccati (2012) implies that
(28) |
where and
Since almost surly for all integrable maps (semi-group property!), we have
We proceed by re-writing eq. (28) in multi-index notation as
where the ’s are independent copies of the ’s. A Taylor expansion around with exact integral remainder term yields
(29) |
where is independent of the ’s and ’s. The first and fourth term cancel out and the third term vanishes because and are independent and mean zero. (Eq. (C.1) is essentially a re-statement of Lemmas 2.9 and 2.4 in Götze (1991) and Raič (2019), respectively.)
Notice that because it is the convolution of a bounded Lipschitz map with the density of (e.g. Nourdin and Peccati, 2012, Proposition 4.2.2). Derivatives on are usually obtained by differentiating the density of (e.g. Götze, 1991, Lemma 2.1; Raič, 2019, Lemma 2.5 and 2.6; Fang and Koike, 2021, Lemma 2.2; and Chernozhukov et al., 2020, Lemmas 6.1, 6.2, 6.3). Here, we proceed differently. Let be the indices corresponding to the multi-indices and . By Lemma 2 (i) we have
where denotes the expectation taken with respect to the law of only. And by Lemma 2 (ii),
where
Let be the value of for corresponding to the indices . An application of Hölder’s inequality yields
(30) |
where in the last line we have used that almost surely because for all and (since is positive definite no pair of entries in and can be perfectly (positively or negatively) correlated!). Taking expectation with respect to the ’s and over above inequality, we obtain
(31) |
Notice that and , and , where is an independent copy of . Set and bound the probability in line (C.1) by
(32) |
where we have used that under the i.i.d. assumption
By Lemma 6, Lemma 1 (ii), and monotonicity and concavity of the map , , we have, for arbitrary ,
(33) |
Combine eq. (C.1)–(C.1) with (C.1) and integrate over to conclude via eq. (27)–(C.1) and the i.i.d. assumption that there exists an absolute constant such that
(34) |
where we have used Harris’ association inequality to simplify several summands, i.e.
and
Observe that eq. (34) holds for arbitrary . Setting
we deduce from eq. (34), the definition of , and (because !) that, for all ,
We have thus established eq. (25). This concludes the proof of the theorem in the case of a positive definite covariance matrix.
The case of positive semi-definite .
Suppose that is positive semi-definite but not identical to zero.
Take and such that are mutually independent. Let be arbitrary and define , , and with . Clearly, and the ’s are i.i.d. with mean zero and positive definite covariance . Hence, by the first part of the proof there exists an absolute constant independent of , , and the distribution of the ’s (and hence, independent of !) such that for and ,
(35) |
where
and
At this point, it is tempting to take in eq. (35). However, this would yield the desired result only for the case in which the law of is continuous. (Alternatively, we could replace the supremum over by the supremum over , where is the set of continuity points of the law of .) Therefore, we proceed differently. Recall eq. (27), i.e.
where denotes the Ornstein-Uhlenbeck semi-group with stationary measure ,
Let be the Ornstein-Uhlenbeck semi-group with stationary measure and expand above inequality to obtain the following modified version of eq. (27):
where the (a) holds because is -Lipschitz and is -Lipschitz w.r.t. the metric induced by the -norm.
Since is positive definite we can proceed as in the first part of the proof and arrive at the following modified version of eq. (34): There exists an absolute constant such that
(36) |
Set
and combine eq. (35) and eq. (36) to conclude (as in the first part of the proof) that for all ,
(37) |
Since arbitrary, we can take . To complete the proof, we only need to find the limit of the expression on the right hand side in above display.
Let be a monotone falling null sequence. For all and , a.s. and (otherwise the upper bound in Proposition 1 is trivial!) and (e.g. van der Vaart and Wellner, 1996, Proposition A.2.4). Hence, is uniformly integrable for . Since in addition a.s., the sequence converges in . Arguing similarly, we conclude that the sequences converges in as well. Thus,
and
Hence, by eq. (37) we have shown that for all and . This completes the proof of the proposition. ∎
C.2 Proof of Corollary 2
Proof of Corollary 2.
Define and where . Thus, by Lemma 7,
(38) |
Moreover, for arbitrary,
and, hence, for and ,
(39) |
Combine the upper bound in Proposition 1 with eq. (38) and (39) and simplify the expression to conclude the proof of the first claim. (Obviously, this bound is not tight, but it is aesthetically pleasing.) The second claim about equicorrelated coordinates in follows from the lower bound on the variance of in Proposition 4.1.1 in Tanguy (2017) combined with the upper bound in Proposition 1 and eq. (39). ∎
C.3 Proof of Proposition 3
Proof of Proposition 3.
The main proof idea is standard, e.g. similar arguments have been used in proofs by Fang and Koike (2021) (Theorem 1.1) and Chernozhukov et al. (2020) (Theorem 3.2). While our bound is dimension-free, it is not sharp (e.g. Chernozhukov et al., 2020, Proposition 2.1).
The case of positive definite .
Suppose that is positive semi-definite and is positive definite. To simplify notation, we set
Moreover, for , arbitrary denote by the map from Lemma 1 and define . Then, by Lemma 6 and Lemma 1 (ii),
(40) |
Since is Lipschitz continuous (with constant ) and is positive definite, Proposition 4.3.2 in Nourdin and Peccati (2012) implies that
(41) |
where and
and denotes the Ornstein-Uhlenbeck semi-group with stationary measure , i.e.
Using Stein’s lemma we re-write eq. (41) as
Notice that above identity holds even if is only positive semi-definite (e.g. Nourdin and Peccati, 2012, Lemma 4.1.3)! By Hölder’s inequality for matrix inner products,
(42) |
To complete the proof, we now bound the second derivative . By Lemma 2 (i), for arbitrary indices ,
and, hence, by Lemma 2 (ii),
where
Since is positive definite, no pair of entries in can be perfectly (positively or negatively) correlated. Therefore, almost surely. Hence,
(43) |
Combine eq. (42)–(43), integrate over , and conclude that
(44) |
To conclude, combine eq. (40) and eq. (44) and optimize over .
The case of positive semi-definite .
Suppose that both, and are positive semi-definite. To avoid trivialities, we assume that is not identical to zero.
Take independent of and, for arbitrary, define . Now, consider eq. (40), i.e.
Expand above inequality yields
(45) |
where the last inequality holds because is -Lipschitz continuous w.r.t. the metric induced by the -norm.
C.4 Proofs of Propositions 4 and 5
Proof of Proposition 4.
Proof of Proposition 5.
The proof is an adaptation of the proof of Theorem 3.1 in Chernozhukov et al. (2013) to our setup. To simplify notation we write and ; see also eq. (23). Note that
(47) |
For arbitrary, the first term can be upper bounded by Lemma 10 as
(48) | ||||
(49) |
where the second inequality follows from (several applications of) Lemma 6 and the third from the definition of quantiles and because has no point masses.
Appendix D Proofs of the results in Section 3
D.1 Proofs of Theorem 1 and Corollary 2
Proof of Theorem 1.
Let be arbitrary and define . Let be a -net of and set . Since is totally bounded with respect to , the ’s are finite. Moreover, for each there exists a map such that for all and, hence,
where the first inequality holds by the reverse triangle inequality and the prelinearity of the Gaussian -bridge process (e.g. Dudley, 2014, p. 65, eq. 2.4). (Here and in the following stands for the identity map.) Since the map is obviously linear, the same inequality holds for the empirical process .
By the triangle inequality,
Since is finite, Proposition 1 implies, for all ,
where is an absolute constant and
Therefore, by Strassen’s theorem (e.g. Dudley, 2002, Theorem 11.6.2),
(51) |
where is an absolute constant and
By the hypothesis on the moduli of continuity of the Gaussian -bridge and the empirical process, we can upper bound the right hand side in eq. (51) by
(52) |
Since is non-increasing in , we can assume without loss of generality that there exist such that for all and all , ; otherwise the upper bound is trivial by eq. (D.1). Thus, since , Lemma 9 with and yields, for all ,
where (a) holds because and (b) holds because of and for suprema of Gaussian processes we can reverse Liapunov’s inequality (i.e. Lemma 15, the unknown constant can be absorbed into ). Conclude that for all ,
(53) |
Combine eq. (51)–(53) and obtain for all ,
where
Increase the constant on the right hand side until the bound holds also for all . Since is arbitrary, set such that .
∎
Proof of Corollary 2.
We only need to verify that under the stated assumptions Theorem 1 applies with . Let be arbitrary and define , where and . Note that . First, by Lemma 16, is totally bounded w.r.t. the pseudo-metric . Second, by Lemma 18, as . Thus, for each , there exists some such that as . Third, . Thus, for each , there also exists such that as . This completes the proof.
(Note that the last argument does not imply that the empirical process is asymptotically -equicontinuous: Not only do we use a different norm, but we keep the sample size fixed and only consider the limit . This shows once again that Gaussian approximability is a strictly weaker concept than weak convergence, see also Remark 3 and the discussion in Section 2.2.) ∎
D.2 Proofs of Theorem 3 and Corollaries 4 and 5
Proof of Theorem 3.
The key idea is to construct “entangled -nets” of the function classes and . The precise meaning of this term and our reasoning for this idea will become clear in the course of the proof.
Let be arbitrary and define . Let and be - and -nets of and w.r.t. and , respectively. Since and are totally bounded w.r.t. and , respectively, the ’s and ’s are finite.
Next, let be the projection from onto defined by for all . If this projection is not unique for some , choose any of the equivalent points. Define the projection from onto analogously. Finally, define the “entangled -nets” of and as
Note that the entangled sets and are still - and -nets of and w.r.t. and . Hence, for each , there exist maps and such that for all and for all . Therefore, by the reverse triangle inequality and the linearity of Gaussian -bridge processes and -motions,
where and .
(54) |
By the triangle inequality,
(55) |
Since and (note: two finite dimensional mean zero Gaussian vectors with the same (!) index sets; to achieve this we needed the “entangled -nets”) we have by Proposition 3, for all ,
where is an absolute constant and, for all ,
Therefore, by eq. (54) and eq. (55), Strassen’s theorem (e.g. Dudley, 2002, Theorem 11.6.2; see also proof of Theorem 1), and Markov’s inequality,
(56) |
By the same arguments as in the proof of Theorem 1, there exists such that for all and all we have and, hence,
(57) |
Combine eq. (56) and eq. (57) to conclude that the claim of the theorem holds for all . Since for the upper bound is trivial by eq. (56), the theorem holds in fact for all . Since is arbitrary, set such that . ∎
Proof of Corollary 4.
D.3 Proofs of Theorems 6 and 9, Proposition 7, and Corollary 8
Proof of Theorem 6.
Proof of Proposition 7.
Our proof is modeled after the proof of Theorem 1 in Jain and Kallianpur (1970). However, unlike them we do not argue via the reproducing kernel Hilbert space associated to . Instead, we leverage the fact that under the stated assumptions has a version that is both a mean-square continuous stochastic process and a random element on some Hilbert space.
Since is compact it is totally bounded and, by Lemma 16, separable w.r.t . Hence, the process has a separable and jointly measurable version (e.g. Giné and Nickl, 2016, Proposition 2.1.12, note that the is the intrinsic standard deviation metric of the Gaussian -motion ). Since is compact and is continuous, is a bounded linear operator. Let be the eigenvalue and eigenfunction pairs of and define
(58) |
By continuity of , is a mean-square continuous stochastic process in . Moreover, by joint measurability, is also a random element on the Hilbert space . Hence, Theorems 7.3.5 and 7.4.3 in Hsing and Eubank (2015) apply, and we conclude that the partial sums converge to in as pointwise in .
Now, observe that
(59) |
Thus, the ’s are uncorrelated random variables. Since the ’s are necessarily Gaussian (inner product of a Gaussian random element with a deterministic function !), they are in fact independent Gaussian random variables with mean zero and variance . Therefore, by Lévy’s theorem, converges to almost surely pointwise in . Thus, for ,
(60) |
Since , Lemma 14 implies that is almost surely bounded. Moreover, since is continuous and compact, Lemma 4.6.6 (4) in Hsing and Eubank (2015) implies that is uniformly continuous. Consequently, by construction, is almost surely bounded and uniformly continuous, too. Thus, and can be considered random elements on the Banach space of bounded continuous functions on equipped with the supremum norm.
Recall that the dual space is the space of finite signed Borel measures on equipped with the total variation norm. The dual pairing between and is . We compute
(61) |
where (a) holds by (60) and (b) holds by Lemma 4.6.6 (3) in Hsing and Eubank (2015). Since (D.3) implies convergence of the dual pairings in probability for every , we have, by Itô-Nisio’s theorem,
almost surely. Recall that is a version of . This completes the proof. ∎
Proof of Corollary 8.
By Theorem 6 and arguments as in the proof of Corollary 2 we have, for each ,
where hides an absolute constant independent of , and .
Above approximation error can be further upper bounded by
If is compact w.r.t. , then by Mercer’s theorem (e.g. Hsing and Eubank, 2015, Lemma 4.6.6),
This completes the proof of the corollary. ∎
Appendix E Proofs of the results in Section 4
Proof of Proposition 1.
Since the problem is finite dimensional we do not have to develop the Karhunen-Loève expansion from Section 3.5. The covariance function of the empirical process with has the explicit form . Hence, a natural estimate of is . A version of a centered Gaussian process defined on and with covariance function is where . It is standard to verify that, under the assumptions of the theorem, the metric entropy integrals associated to the Gaussian processes and are finite for every fixed and . Thus, by Lemma 17 there exist versions of and which are almost surely bounded and almost surely uniformly continuous. Hence, the modulus of continuity condition (6) holds for these versions and we can take . It follows by Theorem 6 that, for all ,
Under Assumption 1 or 2 the right hand side in above inequality vanishes as diverges. For details we refer to Appendix A.1 in Giessing and Fan (2023). This completes the proof. ∎
Proof of Proposition 2.
Since the problem is finite dimensional we do not have to develop the Karhunen-Loève expansion from Section 3.5. Consider where is the empirical measure of the collection of , , the pushforward of under the map , and . Since each has a (not necessarily unique) representation in terms of , we will identify with pairs when there is no danger of confusion. Clearly, is an envelope function for .
Define the Gaussian processes , where with , and , where with .
Without loss of generality, we can assume that and are finite since otherwise the statement is trivially true. Thus, under the assumptions of the theorem, the metric entropy integrals associated to the Gaussian processes and are finite for every fixed and . Therefore by Lemma 17 there exist versions of and (which we also denote by and ) that are almost surely bounded and almost surely uniformly continuous. The modulus of continuity condition (6) holds for these versions and, in particular, we can take . By Theorem 6,
(62) |
In the remainder of the proof we derive upper bounds on the quantities on the right-hand side in above display.
Note that
(63) |
where (a) follows from Hölder’s inequality and (b) holds because is sub-Gaussian with mean zero and covariance and therefore
Next, note that for all . Thus, we compute
Also, since for all symmetric matrices , we have
where the last inequality follows from similar arguments as used to derive the upper bound in (63). Thus,
(64) |
Moreover, since for all symmetric matrices and matching vectors , we have, for arbitrary ,
(65) |
Combine eq. (64)–(E) with Lemma 7 to obtain
(66) |
Next, since (again!) for all symmetric matrices and matching vectors , we compute
(67) |
where the last line holds since by Markov’s inequality and Lemma 19,
and by Theorem 4 in Koltchinskii and Lounici (2017)
Lastly, for , by the upper and lower bounds in (63),
(68) |
Proof of Proposition 3.
Let . Further, let be the empirical measure based on the , and the pushforward measure of under the map . Since is the best approximation in square loss, and and the ’s have mean zero. Hence,
where is a reminder term which satisfies . We plan to apply Theorem 9 to the far right hand side in above display. To this end, we need to (i) find an envelope for the function class , (ii) establish (ii) compactness of , (iii) construct certain Gaussian processes (to be defined below) with continuous covariance functions, and (iv) establish almost sure uniform continuity of these processes w.r.t. their intrinsic standard deviation metrics.
Recall that for all and that for all and . Hence, for all and . Thus, is an envelope of the function class .
Next, let and be centered Gaussian random elements on with covariance operators and , respectively, i.e. for all ,
By Cauchy-Schwarz these covariance functions are obviously continuous w.r.t. . Denote the standard deviation metrics associated with above Gaussian random elements by and . Then, for all ,
and, completely analogous,
Since is compact and by assumption, these inequalities imply that both and (by continuity of the inner product) are compact w.r.t. the standard deviation metrics and . Moreover, since , these inequalities also imply that and . Hence, by Lemma 17 there exist versions of and that are almost surely bounded and has almost surely uniformly - and -continuous sample paths. In the following, we keep using and to denote these versions.
Hence, Theorem 9 applies and we have
(69) |
In the remainder of the proof we derive upper bounds on the quantities on the right-hand side in above display.
First, from the proof of Lemma 20 we know that
Hence, by the proof of Lemma 22
and
Therefore,
(70) |
where and , .
Second, by Cauchy-Schwarz,
(71) |
Third, for ,
(72) |
Fifth, by Lemma 22, with probability at least ,
(75) |
Finally, Lemma 21 combined with (69)–(E), and implies that, with probability at least ,
(76) | ||||
where (a) holds provided that sample size , regularization parameter , kernel , and operators and are such that
(77) |
where and , .
Simplifying these rates is beyond the scope of this illustrative example. The interested reader may consult Appendix H in Singh and Vijaykumar (2023) and Sections 6 and 7 in Lopes (2022b) for potentially useful results. If the Hilbert space is finite dimensional and is invertible, then these rates are satisfied for and (the exact rates feature some - factor for some ). ∎
Appendix F Proofs of the results in Section B
F.1 Proofs of Lemmas 1, 4, and 5
Proof of Lemma 1.
For and arbitrary, define
(78) |
Since for all and , it follows that (draw a sketch!)
(79) |
where and
where is an absolute constant such that . Since with support , we easily verify that the map is continuously differentiable and its th derivative satisfies
(80) |
where is a constant depending only on . This establishes the first claim of the lemma with . To prove the second claim of the lemma we proceed in two steps: First, we show that
By the chain of inequalities in eq. (79), for arbitrary,
(81) |
Similarly,
(82) |
Now, take the supremum over in eq. (F.1) and (F.1) and switch the roles of and . Next, we show
As before, we compute
(83) |
and
(84) |
To conclude, take the supremum over in (F.1) and (F.1) and switch the roles of and . ∎
Proof of Lemma 4.
The set on which is not differentiable is contained in the null set on which is not differentiable. Thus, is -times differentiable almost everywhere. The derivatives now follow from Hardy (2006). ∎
Proof of Lemma 5.
Since is a piecewise linear function, it has partial derivatives of any order at all its continuity points. Since the set of discontinuity points of forms a null set with respect to the Lebesgue measure on , differentiable almost everywhere on . (These partial derivatives need not to be continuous!) The expressions of these partial derivatives follow from direct calculation.
Even more is true: Since is Lipschitz continuous, Rademacher’s theorem implies that is in fact totally differentiable almost everywhere on . Furthermore, since is convex, Alexandrov’s theorem implies that satisfies a second order quadratic expansion almost everywhere on , i.e. for all ,
According to Rockafellar (1999) the matrix is symmetric, positive semi-definite, and equal to the Jacobian of for all at which exists. Thus, can be identified with the second derivative for almost all . ∎
F.2 Proofs of Lemmas 2 and 3
Proof of Lemma 2.
Throughout the proof we write with as defined in Lemma 1 and . To simplify the notation, we denote partial derivatives w.r.t by , e.g. we write for , for , etc. We use to denote the Lebesgue measure on .
Proof of part (i).
Special case .
Let and the -th standard unit vector in . Recall that , where is -Lipschitz w.r.t. the norm induced by the absolute value and is -Lipschitz w.r.t. the metric induced by the -norm. Thus, for and arbitrary,
Hence, the difference quotient is bounded uniformly in and . Furthermore, by Lemma 5 is differentiable -a.e. Therefore, the conditions of Corollary A.5 in Dudley (2014) are met and we can pass the derivative through both integrals (over and ) to obtain
(85) |
The cases .
“Off-the-shelf” differentiating under the integral sign seems to be only possible in the case . In all other cases, Corollary A.5 and related results in Dudley (2002) do not apply, because the higher-order difference quotients of are not uniformly integrable (considered as a collection of random variables indexed by a null sequence ). Therefore, to prove the claim for we develop a more specific inductive argument which is tailored explicitly to the map and the fact that we integrate w.r.t. a non-degenerate Gaussian measure.
Base case and .
Let , , and be the density of the law of . By a change of variable we can re-write eq. (85) as
(86) |
where
Since is full rank, is a smooth function of -a.e. . Hence, the map exists and is continuous for -a.e. . Furthermore, the map is uniformly bounded and integrable in and for -a.e. . Hence, by Corollary A.4 in Dudley (2014) we can pass a second partial derivative through the integral in (F.2) and compute
(87) |
In the following we show that one can “pull back” the partial derivative from onto . Our argument involves a version of Stein’s lemma (i.e. Chernozhukov et al., 2020, Theorem 11.1) and a regularization of . Several computations also crucially depend on the fact that .
Let be the standard mollifier
where is an absolute constant such that . For , set and define the “partial regularization” of a function on in its th coordinate by
(88) |
where is the th standard unit vector in . With this notation, define
Obviously, the regularization is just with the discontinuity at smoothed out. (We comment on the rationale behind this partial regularization after eq. (94).) In particular, exists for all , and, by Lemmas 4 and 5,
(89) | ||||
(90) |
Observe that by Leibniz’s integral rule the partial derivative in line (90) takes the form
where
(91) |
where , . Thus, by Lemma 1 (i) we easily verify that and are bounded and integrable.
We return to eq. (87). Adding and subtracting we expand its right hand side as
(92) | |||
(93) |
Consider the integral in line (92). A change of variable yields
By Stein’s lemma for non-differentiable (but bounded and integrable) functions (i.e. Chernozhukov et al., 2020, Theorem 11.1) the last line in above display is equal to
where denotes the -th standard unit vector in .
Consider the expression in the previous line. Since exists and is uniformly bounded in , we can push the partial derivative through the expectation (e.g. Folland, 1999, Theorem 2.27 (b); the integrability condition is satisified because we integrate w.r.t. the non-degenerate law of ). Thus, we have shown that the integral in line (92) equals
(94) |
Obviously, we could have derived eq. (94) under any kind of smoothing. However, under the “partial regularization” of in its th coordinate the partial derivative takes on a simple closed-form expression (eq. (89)–(F.2)). The simple form of the expression in eq. (F.2) is particularly important as it strongly suggests that does not “blow up” as . Indeed, if we had “fully” regularized in all its coordinates, the expression in eq. (F.2) would not be as simple and its asymptotic behavior (as ) would be less clear.
We now show that the integral in eq. (94) converges to
(95) |
as , where
We record the following useful bounds: First, by Lemma 1 (i), for all ,
(96) | ||||
(97) |
Second, there exists (depending on and ) such that for all and ,
(98) |
(99) |
where the limit in the last line follows from Lemma 3.
Since the law of is non-degenerate,
and, hence, the event is a -null set for all and -a.e. (for a more detailed argument, see also below proof of part (ii) of this lemma). Thus,
Therefore, by eq. (F.2), (96), and (98),
(100) |
where the limit follows from Lemma 3. The limit in line (95) now follows by combining eq. (F.2) and (F.2) with eq. (89) and (90).
Lastly, we return to the integral in line (93). By eq. (97) and (98),
(101) |
where the limit follows from Lemma 3.
Combine eq. (95) and (F.2) with eq. (87), (92), and (93) to conclude via Lemma (5) that for -a.e. ,
(102) |
Notice that the order in which we take the partial derivatives and does not matter.
Base case and .
The strategy is identical to the one of the preceding case . The only difference is the regularization of . Recall the notion of “partial regularization” from eq. (88) and define, for ,
The map exists for all , and, by Lemmas 4 and 5,
(103) | ||||
(104) |
By Leibniz’s integral rule we find that the partial derivative in line (104) equals
(105) |
where , . Thus, from Lemma 1 (i) we infer that and are both bounded and integrable.
The same arguments that led to eq. (87), (92), and (93) also yield
(106) | ||||
(107) |
Repeating the arguments that gave eq. (94) we find that the integral in line (106) equals
(108) |
We now study the behavior of the integrals in lines (107) and (108) as . The arguments are similar to those used in the case ; we provide them for completeness only. As before, let . By eq. (97) and (98) and Lemma 3,
(109) |
Since the law of is non-degenerate, by eq. (F.2), (96), and (98) and Lemma 3,
(110) |
Combine eq. (F.2) and (F.2) to conclude that, as , the integral in (108) converges to
(111) |
Lastly, we turn to the integral in line (107). By eq. (97) and (98) and Lemma 3,
(112) |
Thus, combining eq. (111) and (F.2) with eq. (106) and (107) and invoking Lemma (5) we conclude for -a.e. ,
(113) |
Inductive step from to .
Suppose that for arbitrary indices , ,
(114) |
Under the induction hypothesis (114) the argument that gave identity (87) also gives
(115) |
As in above case with , we now show that one can “pull back” the partial derivative from onto . Let be an arbitrary index and, for , define
where denotes the “partial regularization” in the th coordinate as defined in eq. (88). The map exists for all , and satisfies, by the chain rule,
(116) | |||
(117) | |||
(118) |
Notice that the partial derivative in eq. (117) and in eq. (118) follow the patterns derived in eq. (F.2) and (F.2), respectively.
Next, expand the right hand side of (115) as
(119) | |||
(120) |
Recall the argument developed to establish the limit (95) for . The same argument, now combined with (116)–(118), also yields that the integral in (119) converges to
(121) |
Similarly, the same argument used to show that the integral in (93) vanishes for also guarantees that the integral in (120) vanishes. Combine eq. (115) and (119)–(F.2) to conclude the inductive step from to for .
Proof of part (ii).
Combine Lemma 1 (i), 4, and 5 and conclude that, for all ,
where is the absolute constant from Lemma 1. Notice that
and, hence,
Thus, there exists a -null set such that for all ,
(122) |
Since is positive definite, is absolutely continuous with respect to . Thus, the -a.e. upper bound (122) continues to hold when evaluated at and integrated over and . We conclude that for all ,
(123) |
∎
Proof of Lemma 3.
For any function on and vector , we define the translation operator by . With this notation,
Without loss of generality we can assume that the ’s are bounded by one. Also, since integrates to one, we have
Hence, by the product comparison inequality,
(124) |
Next, compute
Since and are both bounded by for all , Proposition 8.5 in Folland (1999) implies that, for all ,
Hence, by the dominated convergence theorem
Conclude that each summand in eq. (F.2) vanishes as . This completes the proof. ∎
F.3 Proofs of Lemmas 8, 9, 10, and 11
Proof of Lemma 8.
We provide a full proof for completeness only. Incidentally, this result has already been established by Le Cam (1986), p. 402 (Lemma 2). We compute,
For the reverse inequality,
and, hence,
Take the supremum over and combine both inequalities. Then switch the roles of and to conclude the proof. ∎
Proof of Lemma 9.
Observe that . Now, by Cauchy-Schwarz,
Similarly, use the estimate in the first line of above display, to obtain
Combine both inequalities to obtain the desired two-sided inequality. Further, if , then by convexity of the map , , and Jensen’s inequality,
∎
Proof of Lemma 10.
The proof is identical to the one of Lemma 3.2 in Chernozhukov et al. (2013). Let . By Lemma 3 there exists an absolute constant such that on the event , for all ,
In particular, for , we obtain
To conclude the proof of the first statement apply the definition of quantiles. The second claim follows in the same way. ∎
F.4 Proof of Lemma 18
Proof of Lemma 18.
First, we show necessity: Suppose that the sample paths of on are almost surely uniformly continuous w.r.t. , i.e. for almost every ,
Since is Gaussian, by Lemma 14, for arbitrary,
Hence, the claim follows from the dominated convergence theorem.
Next, we show sufficiency: Given the premise, we can find a null sequence such that
(125) |
Define events
By Markov’s inequality and (125),
Thus, by Borel-Cantelli, , i.e. is almost surely uniformly continuous on . (Note that we have not used the fact that is Gaussian and the intrinsic standard deviation. Hence, sufficiency holds for arbitrary stochastic processes on general metric spaces.) ∎
F.5 Proofs of Lemmas 19, 20, 21, 22, and 23
Proof of Lemma 19.
Throughout the proof denotes the tensor product between linear maps.
Let . Define the map . This map is Lipschitz continuous with Lipschitz constant and . Let be i.i.d. Rademacher random variables independent of . Then, symmetrization and contraction principle applied to yield
We upper bound the right-hand side in above display using Cauchy-Schwarz, Hoffmann-Jørgensen, and de-symmetrization inequalities by
Since is sub-Gaussian with mean zero and covariance , we compute
Hence, by Lemma 35 in Giessing and Wang (2021) and Lemma 2.2.2 in van der Vaart and Wellner (1996) (see Exercise 2.14.1 (ibid.) for how to handle the non-convexity of the map inducing the -norm)
Moreover, by Theorem 4 in Koltchinskii and Lounici (2017),
Thus, we conclude that
Adjust some absolute constants to complete the proof. ∎
Proof of Lemma 20.
To simplify notation we write for .
We begin with the following fundamental identity: Let be the identity operator and be such that is invertible. Then,
Indeed, we have . Now, re-arrange the terms on the far left and right hand side of this identity to conclude. Applied to , we obtain
(126) |
Hence, we compute
(127) |
where (a), (b), and (c) follow from identity (126).
We now further upper bound the terms in (F.5). Re-walking the steps from (c) to (b) we obtain
(128) |
Note that
(129) |
Recall that for all . Let be the unit ball of . We have
(130) |
where (a) and (b) hold because and are self-adjoint, (c) follows from the reverse triangle inequality applied and the definition of the operator norm, and (d) follows from (129).
Proof of Lemma 21.
Observe that . Therefore, by Lemma 23, with probability at least ,
and
and
We combine above three inequalities and conclude that, with probability at least ,
∎
Proof of Lemma 22.
Note that and commute for any operator . Hence, we compute
Bound on term . Since , almost surely, and Bernstein’s inequality for real-valued random variables implies that, with probability at least ,
Also,
Hence, with probability at least ,
Bound on term . Compute
Since for , it follows that
Hence, by Lemma 23, with probability at least ,
(134) |
Next, recall the approach that led to the bound in (F.5) in the proof of Lemma 20. We iterate this approach to obtain
Thus, by Lemma 23 and since , with probability at least ,
(135) |
Combine (134) and (135) and conclude that, with probability at least ,
Since it follows that . Hence, under the rate conditions the upper bounds on terms and simplify as stated in the lemma. ∎
Proof of Lemma 23.
Proof of claim (i). Let and be arbitrary. Since minimizes the expected square loss, we have
Therefore, as in the proof of Lemma G.4 in Singh and Vijaykumar (2023), we compute,
and
Moreover, since is separable and is continuous, is separable as well. Hence, the conditions of Lemma 24 are satisfied with and . This completes the proof of the first claim.
To proof of claim (ii). Let be arbitrary. Then . Moreover,
and
The space of linear operators equipped with the Hilbert-Schmidt norm is a Hilbert space. This space can be identified as the tensor product . Since this tensor product space is separable whenever is separable, the conditions of Lemma 24 are satisfied and . This completes the proof of the second claim. ∎