Tail Bounds for Canonical -Statistics and -Processes with Unbounded Kernels††thanks: An initial version of this work was available here. This draft is a revised version with some edits.
Abstract
In this paper, we prove exponential tail bounds for canonical (or degenerate) -statistics and -processes under exponential-type tail assumptions on the kernels. Most of the existing results in the relevant literature often assume bounded kernels or obtain sub-optimal tail behavior under unbounded kernels. We obtain sharp rates and optimal tail behavior under sub-Weibull kernel functions. Some examples from nonparametric and semiparametric statistics literature are considered.
Keywords and phrases: Degenerate -statistics and -processes, Unbounded kernels, Sub-Weibull tails, Exponential tail bounds, Nonparametric/semiparametric statistics.
1 Introduction and Motivation
In this paper, we study moment and tail bounds of second-order degenerate -statistics and -processes. Averages, the simplest function of a collection of random variables, are sums with each summand depending only on one element of the collection. On the other hand, -statistics depend on tuples of elements in the collection. Formally, second-order -statistics based on the collection of random variables is of the form
(1) |
for some functions . In this paper, we consider -statistics defined on independent but possible non-identically distributed random variables defined on some measurable space. -statistics, in general, are ubiquitous in statistical applications including, e.g., goodness-of-fit tests, two-sample tests using kernel-based distances as well as independence testing via permutation tests; see Kim, (2020) for an overview of this literature.
We motivate our interest in -statistics using a few prototypical examples. Suppose are independent and identically distributed (i.i.d.) realizations of a random vector with Lebesgue density . Consider the problem of estimating the quadratic functional
(2) |
A natural estimator for this functional is given by
(3) |
where represents the bandwidth and represents the leave-one-out kernel density estimator:
(4) |
Here the function is assumed to be symmetric and satisfies . This estimator was introduced by Hall and Marron, (1987) and was studied thoroughly (in terms of adaptivity) for in Giné and Nickl, (2008).
Similarly, to estimate integrals involving the conditional expectation function from i.i.d. realizations of , the following -statistics appears:
(5) |
Aside from these prototypical examples, various other examples of such -statistics are encountered in the literature on integral approximation involving kernel smoothing estimators (Newey and Ruud,, 2005; Delyon and Portier,, 2016) and the semiparametric inference literature on quadratic and integral-type functionals (Robins et al.,, 2016). In the latter literature, -statistics of this type – especially in their degenerate form (see below for the definition) – are fundamentally involved in the analysis of so-called doubly robust estimators of certain functionals encountered in missing data or causal inference problems (Robins et al.,, 1994; Bang and Robins,, 2005), as well as in the literature on adaptive estimation of functionals based on so-called higher order influence functions (Robins et al.,, 2008, 2017; Liu et al.,, 2021).
Apart from the nonparametric and semiparametric statistics literature, second order -statistics also arise in relation to Hanson-Wright-type inequalities. The classical Hanson-Wright inequality concerns tail bounds for the quadratic form where is a standard multivariate normal random vector in and is a positive semi-definite matrix; see Theorem 3.1.9 of Giné and Nickl, (2016). For further applications of Hanson-Wright inequalities, see Rudelson and Vershynin, (2013) and Spokoiny and Zhilova, (2013), as well as the recent work of He et al., (2024) on sparse random vectors. Note that for any random vector and matrix
(6) |
where represents the -th row, -th column entry in the matrix
Motivated by the examples above, we study the properties of the -statistic . Before proceeding further, we briefly discuss degenerate and non-degenerate -statistics. See Serfling, (1980, Chapter 5) for more details. This discussion proves that for a precise understanding of the tail behavior of a -statistics it suffices to consider degenerate -statistics. In fact, most of the asymptotic normality results related to -statistics are shown by proving asymptotic negligebility of the degenerate -statistics compared to the linear statistic; see, for example, Chen and Kato, (2020). This paper is partly motivated by the cases where such asymptotic negligebility may not hold. For example, in the context of estimating based on IID observations with mean , the unbiased estimator exhibits a phase transition at in terms of rate and also the limiting distribution.
Degenerate or Canonical -statistics.
For any sequence of functions (called kernels) and independent random variables , a -statistic is given by
Note that the diagonal terms ( cases) are ignored in the summation above. If these diagonal terms are included then the resulting statistic is called a -statistic. The -statistic is called degenerate or canonical if the kernel functions satisfy
(7) |
If the kernel functions do not satisfy (7), then the corresponding -statistic is called non-degenerate. It is not difficult to see that a non-degenerate -statistic can be written as a sum of independent mean zero random variables and a degenerate -statistic:
(8) |
where
(9) | ||||
It is clear from these expressions that the kernels satisfy (7) and so are degenerate kernels. Since and in (8) are sums of independent random variables with mean zero, they can be understood easily from the classical results like the central limit theorem (asymptotically) and Bernstein/Hoeffding or more general inequalities (non-asymptotically). For this reason, we focus mostly on the degenerate part of (8) in the rest of the paper and derive non-asymptotic moment as well as tail bounds when the non-degenerate -statistics is of the form (1). Our main tool is the decoupling inequality proved in de la Peña, (1992). We refer to de la Peña and Giné, (1999, Chapter 3) for more details regarding decoupling in -statistics.
After deriving non-asymptotic tail bounds for degenerate -statistics, we provide the same for supremum of degenerate -statistics over a function class. Suppose is a class of sequence of functions (degenerate kernels) of type and define
Then can be viewed as a process called the -process and we provide exponential tail bounds for the supremum:
An important application would be the study of uniform-in-bandwidth properties of the estimator in (3), that is,
for some numbers . Further applications can be found in de la Peña and Giné, (1999, Section 5.5) and Major, (2013). As a final note, we mention that even though our techniques extend to -statistics/processes of higher order, we restrict ourselves to second order -statistics/processes for simplicity and ease of exposition.
1.1 Related Literature
In this section, we review some of the by-now classical exponential tail bounds for degenerate -statistics and supremum of -processes. Proposition 2.3 of Arcones and Giné, (1993) proved a Bernstein type inequality for degenerate -statistics/processes. Specifically, for the degenerate -statistics
with i.i.d. random variables , and , they show there exists constants such that for any ,
This tail bound has two regimes: exponential and Weibull of order . Because of the appearance of the variance, this tail bound provides the correct rate of convergence. Theorem 3.3 of Giné et al., (2000) improved the tail bound by providing the optimal four regimes of the tail: Gaussian, exponential, Weibull of orders and . Houdré and Reynaud-Bouret, (2003) gave an alternative proof to the result of Giné et al., (2000) using martingale inequalities with explicit constants. In particular, Theorem 3.3 of Giné et al., (2000) shows that for all ,
for some constants and . The main disadvantage of the results above is the restrictive boundedness assumption. Theorem 3.2 of Giné et al., (2000) actually applies without the boundedness condition but the tail bound thus obtained is sub-optimal. For example, if where , and ’s are mean zero (conditionally) sub-Weibull variables of order , that is, . Then, Theorem 3.2 of Giné et al., (2000) implies a tail bound of the form:
where and . This is sub-optimal in comparison with the results of Kolesko and Latała, (2015, Example 3). On the other hand, the results of Kolesko and Latała, (2015) do not get the correct rate of convergence as can be obtained from the results of Giné et al., (2000). This is because the bound of Kolesko and Latała, (2015) does not depend on the variance. We are not aware of any tail bounds in the literature that implies the correct rate of convergence as well as the optimal tail behavior. We also note here the recent work of Bakhshizadeh, (2023) which appeared after the initial working version (Chakrabortty and Kuchibhotla,, 2018) of this preprint. While they do consider general unbounded kernels, their focus is primarily on exponential bounds and large deviation principles for non-degenerate -statistics, different from ours.
In regards to the tail bounds for degenerate -processes, some of the important works are Adamczak, (2006), Clémençon et al., (2008) and Major, (2013). The latter two papers only consider bounded kernels and the bounds of Adamczak, (2006) are written in terms of functionals that are in general hard to control. The results of Major, (2005) and Major, (2013) apply only to bounded kernels and are written for VC classes but imply the correct rate of convergence. However, the results there do not show the optimal four regimes in the tail behavior. Theorem 11 of Clémençon et al., (2008) is written as a deviation inequality but does not imply the correct rate of convergence. For instance, if with being Rademacher random variables independent of , then the rate of convergence of
from Theorem 11 of Clémençon et al., (2008) is (because of in the moment bound) but the correct rate of convergence is (that can be obtained by calculating the variance). As in the case of -statistics, we are not aware of any tail bound results that can obtain the correct rate of convergence and apply to unbounded kernels. Using the techniques of truncation, decoupling technique and the entropy method of Boucheron et al., (2005), we prove a deviation inequality for degenerate -processes that implies the correct rate of convergence and the optimal tail behavior.
Organization.
The rest of the article is organized as follows. In Section 2 we prove exponential tail bounds for second order degenerate -statistics. In Section 3 we prove a deviation bound for degenerate -processes and also provide maximal inequalities to control the expectation of the maximum. The proofs of all the results are distributed in Appendices A, B and C.
2 Tail Bounds for Degenerate –Statistics
We prove two tail bounds for degenerate -statistics. The first is a general result applicable to all kernels that are bounded above by a product kernel and the second result is for more structured kernels that are of importance in non- and semi-parametric estimation. Define a random variable to be sub-Weibull of order if where for and
Several properties of sum of independent sub-Weibull random variables are derived in Kuchibhotla and Chakrabortty, (2022). The main focus of this section is to extend these results to degenerate -statistics.
Consider a degenerate -statistics
where are independent random variables and is a collection of degenerate (or canonical) kernels, i.e.,
We assume the following on the degenerate kernel :
-
(A1)
For , there exist non-negative functions and such that
The first part of assumption (A1) implies that the degenerate kernel can be expressed as for some collection of bounded kernels . No further structure on ’s is required. In the second result we consider below, we place additional structure on ’s. The second part of assumption (A1) means that
Equivalently, is sub-Weibull and is sub-Weibull, in the terminology of Kuchibhotla and Chakrabortty, (2022). To present the result, we define a few quantities. Let be an independent copy of .
(10) |
The quantities and also appear in the moment bound for degenerate -statistics with bounded kernels; see Theorem 3.2 of Giné et al., (2000). Note that can be trivially bounded as
Similar comment holds for as well. We use and to denote universal constants, constants depending on , respectively. We now present the first main result.
Theorem 1.
Under assumption (A1), for every ,
(11) |
where and . Consequently, for every , with probability at least ,
Proof.
See Appendix A.1 for a proof. ∎
Theorem 1 reduces to Theorem 3.2 of Giné et al., (2000) by setting ; note that if , then and the log factors in the result become 1. The result is asymmetric in only because of the structure of the proof. One can apply the result by switching the roles of and take the minimum of the two bounds. We do not present this for brevity. It is interesting to note that the tail exhibits five different behaviors including the commonly expected sub-Gaussian and sub-exponential tails. Because we did not make any assumption on the symmetry of the kernel, and can be different. Under an assumption of symmetry, and Theorem 1 now yields a tail bound that only exhibits five regmies.
Assuming only (A1), Theorem 1 provides a moment and tail bound for degenerate -statistics. The appearance of the constants and might make this result difficult to apply in some applications. For this reason, we provide our second result assuming a little more structure on the kernel. Suppose we have independent random variables on some measurable space and sequence of functions . Consider, for functions and , the -statistic
(12) |
The kernels are not required to be degenerate here. We will derive moment and tail bounds for the degenerate version of the -statistics given by
for the kernel defined in (9). We first prove a basic lemma that reduces the problem of moment bounds on to a symmetrized version of ; see Theorem 3.5.3 of de la Peña and Giné, (1999). For any random variable , set for .
Lemma 1.
For any ,
for Rademacher random variables . Here can be taken to be and represents an independent of independent random variables such that is identically distributed as for .
The proof of this lemma (given in Appendix A.2) is based on the by-now classical decoupling inequalities of de la Peña, (1992) and de la Peña and Giné, (1999, Chapter 3). The result also holds in case of degenerate -processes and does not require the special structure of the kernels in (12).
To prove moment and tail bounds for degenerate second order -statistics with unbounded kernels, we use the following assumptions. Consider the following assumptions.
-
(B1)
There exists constants such that
hold almost surely.
-
(B2)
The functions are all uniformly bounded, that is,
The main technique in our proof is truncation and Hoffmann-Jørgensen’s inequality. Assumption (B1) implies that conditional on ’s the maximum of is at most a polynomial of (in rate). This along with Assumption (B2) allows us to apply truncation at this rate and study the truncated part using the sharp results of Giné et al., (2000). The unbounded parts of smaller order are controlled using Hoffmann-Jørgensen’s inequality. The bound in Assumption (B2) is allowed to grown in and all the kernels are also allowed to be function of . All the results to be presented here are non-asymptotic. For more applications of this technique see Kuchibhotla and Chakrabortty, (2022).
Define
and the truncated random variables
(13) |
It is clear that and Based on these, note that
(14) |
The first term on the right hand side is bounded by . The second and third terms are non-zero only when and , are respectively non-zero, which can only happen with only a small probability under Assumption (B1). Finally, the fourth term can be non-zero only if both and are non-zero which can happen with even smaller probability. These four terms leads to four different degenerate -statistics that will be controlled separately in Section A.3 to prove the following result. We need the following notation: for ,
Define and
The quantities also appear in the case of bounded kernels as shown in Theorem 3.2 of Giné et al., (2000).
Theorem 2.
Proof.
See Appendix A.3 for a proof. ∎
Remark 2.1 (Comparison with previous results) As noted in the introduction, an important feature of our result is that the kernel is allowed to be unbounded with proper tail behavior. The tail of the degenerate -statistics as shown in (15) has seven different regimes, the prominent ones being the Gaussian and exponential parts. These seven regimes collapse to five if . In particular, if , then for ,
If , then our assumption (B1) implies boundedness of the kernels. In this case, only four regimes remain and these four regimes coincide with those shown in Theorem 3.2 of Giné et al., (2000). Additionally in the case of bounded kernels (), Theorem 2 essentially coincides with Theorem 3.2 of Giné et al., (2000) except for the additional and factors. We believe these to be artifacts of our proof and closely following the proof of Theorem 1, they could be avoided.
3 Tail Bounds for Degenerate –Processes
In this section, we generalize Theorem 2 to degenerate -processes. Consider
for some function class with elements of the type . If is a singleton, then this reduces to the -statistic studied in Section 2. Here denote an independent sequence of Rademacher random variables as before. For simplicity, we consider the symmetrized version and by Lemma 1 the results also hold for the original degenerate -process; see Theorem 3.5.3 of de la Peña and Giné, (1999) for details.
-processes were introduced in Nolan and Pollard, (1987) to study cross-validation in the context of kernel density estimation. They studied uniform almost sure limit theorems and established the rate of convergence. These results parallel the Glivenko-Cantelli theorems well-known for empirical processes. Functional limit theorems were established in Nolan et al., (1988). Exponential tail bounds that parallel the classical Bernstein’s inequality for non-degenerate and degenerate -statistics were given in Arcones and Giné, (1993). They also established LLN and CLT type results under various metric entropy conditions. Most of these results require boundedness of the kernel functions. Being asymptotic in nature, some of these results can be extended to the case of unbounded kernels using a truncation argument. Finite sample concentration inequalities for degenerate unbounded -processes are not readily available.
The only work (we are aware of) that provides general results for -processes applicable to is Adamczak, (2006). In this work, degenerate -processes of arbitrary order were considered. However, the moment bounds for -processes in this work depend further on the moment bounds of some complicated degenerate -processes of lower order. Furthermore, the tail behavior thus obtained is not sharp for unbounded -processes.
To avoid measurability issues for , we use either of the following conventions. One simple assumption on used in van der Vaart and Wellner, (1996) that implies measurability is separability and in this case we can take to be a dense countable subset of . Another convention used in Talagrand, (2014) is to define for any and increasing function ,
Based on either convention, we treat as a countable set for the remaining part of this section.
One “simple” way to obtain tail bounds for is via generic chaining as follows: First apply Theorem 2 for for functions . The tail bound (15) provides a mixed tail in terms of various semi-metrics on . Using these and following the proof of classical generic chaining bound (e.g., Theorem 3.5 of Dirksen, (2015)), one can obtain tail bounds for -processes in terms of -functionals; see Talagrand, (2014) and Dirksen, (2015) for details. A problem with this approach is the complication in controlling the -functionals. This approach with Dudley’s chaining (instead of generic chaining) was used for bounded kernel -processes in Nolan and Pollard, (1987) and Nolan et al., (1988).
In the following, we first provide a deviation inequality for and then prove a maximal inequality to control the expectations appearing in the deviation inequality. For these results, we consider the following generalization of assumption (B2).
-
(A2′)
The functions are all uniformly bounded, that is,
We will use the notation of given in (13). For the main result of this section, define
Here in the definitions, the supremum over (or ) represents supremum over all function (or )satisfying
where denotes the probability measure of . Note that is similar to defined for Theorem 2.
Theorem 3.
Proof.
See Appendix B.1 for a proof. ∎
If is a singleton set, then the above result reduces to Theorem 2. From the moment bound above, it is easy to derive a tail bound using Markov’s inequality. In comparison, we again get seven different tail regimes that again reduce to five if . Unlike the result of Adamczak, (2006), the moment bound in Theorem 3 only depends on some expectations. An additional advantage of Theorem 3 is that all the expectations only involve bounded random variables.
3.1 Maximal Inequality for Bounded Degenerate U-Processes
To apply Theorem 3, we need to control various expectations appearing on the right hand side of the moment bound there. Expect for , all the other quantities are maximal inequalities related to empirical processes. See van der Vaart and Wellner, (2011) and Lemmas 3.4.2-3.4.3 of van der Vaart and Wellner, (1996) for maximal inequalities of empirical processes. In this section, we provide a maximal inequality for . For independent and identically distributed random variables, Chen and Kato, (2020, Theorem 5.1) provide a maximal inequality for degenerate -processes of arbitrary order. This result is similar to Theorem 2.1 of van der Vaart and Wellner, (2011) for empirical processes. The same proof as in Chen and Kato, (2020) does not provide the “correct” bound in the case of possibly non-identically distributed observations since they use Hoeffding averaging which can lead to sub-optimal rate if the observations are not identically distributed. A modification of the proof leads to the maximal inequality below.
For any , function class containing functions and a discrete probability measure with support , let denotes the minimum such that there exists satisfying
where for ,
Note that the right hand side is expectation with respect to the measure induced on . Define the uniform entropy integral needed for -processes is given by
Here represents the envelope function for satisfying for all and the supremum is taken over all discrete probability measures supported on .
The following Lemma proves a maximal inequality using Theorem 5.1.4 of de la Peña and Giné, (1999). The proof is very similar to that of Theorem 5.1 of Chen and Kato, (2020) which itself was based on the proof of Theorem 2.1 of van der Vaart and Wellner, (2011).
Theorem 4.
Suppose represent a class of real-valued functions uniformly bounded by with the envelope function . Then there exists a universal constant such that
for any and , where ,
Proof.
See Appendix C for a proof. ∎
APPENDIX
Appendix A Proofs of Results in Section 2
A.1 Proof of Theorem 1
By Theorem 3.5.3 of de la Peña and Giné, (1999), it follows that
where
For and any , define
Observe that
First, we consider the behavior of for a fixed and then the behavior of . By the degeneracy of the kernel, we have that is a sum of independent mean zero random variables for every . Moreover, for all . Hence, Theorem 3.4 of Kuchibhotla and Chakrabortty, (2022) (with and ) implies
where . Based on this, define
where
Getting back to the behavior of , we first note that by degeneracy and symmetrization,
(16) |
Here are independent Rademacher random variables independent of , . Hence, it suffices to understand the behavior of
(The introduction of Rademacher variables is only done for notational convenience in applying truncation.)
(17) |
Because are independent random variables conditional on , we get by another application of Theorem 3.4 of Kuchibhotla and Chakrabortty, (2022) (with and ) that conditional on , with probability at least ,
(18) |
Observe now that
To bound the first term on the right hand side of (18), we follow the argument in the proof of Theorem 3.2 of Giné et al., (2000). However, in place of inequality (3.8) of Giné et al., (2000), we apply Theorem B.1 of Kuchibhotla and Chakrabortty, (2022). Following the display after inequality (3.11) of Giné et al., (2000), we have
where the supremum is taken over a countable subset of mean zero vector functions . Define
Degeneracy of implies that ’s are mean zero independent random variables. Hence, by Theorem B.1 of Kuchibhotla and Chakrabortty, (2022), we get
Following the argument in Theorem 3.2 of Giné et al., (2000), we get
Therefore, by Markov’s inequality, with probability at least ,
(19) |
A.2 Proof of Lemma 1
From Theorem 3.1.1 of de la Peña and Giné, (1999) and following the arguments similar to those in Theorem 3.5.3 of de la Peña and Giné, (1999), we get for all
where are Rademacher random variables independent of . Note from (9) that
Here represents the probability measure of for . By Jensen’s inequality, it is clear that for ,
Therefore, for ,
Throughout the proofs in all the appendices to follow, we use the notation
Note that this is different from and defined in the main text.
A.3 Proof of Theorem 2
Based on the basic decomposition (14), we get
where
(20) |
It is easy to verify that are all degenerate -statistics. From Theorem 3.2 of Giné et al., (2000), we get that there exists a constant such that for all ,
where
(21) |
It is clear that
The quantity appears as the square root of the wimpy variance of the supremum of an empirical process; see Boucheron et al., (2013, page 314). Lemma 4 of Section A.4 implies that
For bounding , note that
Combining these two inequalities implies that
Finally, it is clear from assumption (B2) that Combining all these with Theorem 3.2 of Giné et al., (2000) and noting
we get that there exists a constant such that for all
(22) |
To bound and in (33), we use Hoffmann-Jøgensen’s inequality (Proposition 6.8 of Ledoux and Talagrand, (1991)). Observe that
With and , note that
and so, by Equation (6.8) of Ledoux and Talagrand, (1991), we get
From assumption (B1) and Theorem 6.21 of Ledoux and Talagrand, (1991), we thus get for ,
(23) |
for some constant depending only on (and can be different in different lines). If , then we get
See proof of Theorem 3.3 in Kuchibhotla and Chakrabortty, (2022) for similar argument. Thus,
Thus, for ,
(24) |
To control the right hand side above, recall that
is a sum of mean zero independent random variables that are bounded by . Also, note that
Therefore by Bernstein’s inequality (Lemma 4 of van de Geer and Lederer, (2013)), we get that
(25) |
where
So, by Propositions A.3 and A.4 of Kuchibhotla and Chakrabortty, (2022), we get that for ,
Hence for ,
(26) |
A similar calculation for shows that for ,
(27) |
where
To control , recall that
Following the arguments leading to (23), we have
Conditioning on , the right hand side satisfies the hypothesis of (6.8) of Ledoux and Talagrand, (1991) and so by Theorem 6.21 of Ledoux and Talagrand, (1991), we get
for some constant depending only on . Therefore, for
(28) |
for some constant .
A.4 Auxiliary Lemmas Used in Theorem 2
The two lemmas to follow in this section provide explicit (but not necessarily optimal) constants for Equations (3.1) and (2.6) of Giné et al., (2000). These lemmas can be used in the proof of Theorem 3.2 of Giné et al., (2000) to get explicit constants. In this respect, we note that Theorem 3.4.8 of Giné and Nickl, (2016) (which was first proved in Houdré and Reynaud-Bouret, (2003)) does not imply Theorem 3.2 of Giné et al., (2000) since the result of Giné et al., (2000) applies for unbounded kernels in -statistics while the result of Giné and Nickl, (2016) applies exclusively for bounded kernel -statistics.
Lemma 2.
Suppose are independent mean zero random variables. Then for ,
Proof.
Lemma 3.
Suppose are independent random variables, then for and ,
Proof.
Fix . Define such that
By (1.4.4) of de la Peña and Giné, (1999), it follows that
(29) |
Observe that
Inequality (a) follows from (29). To prove the result now, we consider two cases:
-
–
Case 1: If
then
Therefore (in case 1),
(30) -
–
Case 2: If
then
Therefore (in case 2),
(31)
Combining inequalities (30) and (31), we get for and that
This proves the result. ∎
Proof.
Following the proof of Theorem 3.2 of Giné et al., (2000), the quantity is the square root of the wimpy variance of
where and
This implies that
where for ,
Note that depends on since the random variables are allowed to be non-identically distributed. Now observe that
(32) |
To prove this, note that for any satisfying the (integral) constraint,
To prove the reverse inequality, define for ,
It is clear that satisfy the integral constraint in (32) and
This completes the proof of (32). Rewriting the representation (32), we get
This representation shows that is indeed the supremum of an empirical process. The wimpy variance of this supremum is given by
Now a duality argument implies that
Thus the result follows. ∎
Appendix B Proofs of Results in Section 3
B.1 Proof of Theorem 3
Similar to defined in the proof of Theorem 2, we define
(33) |
As in the proof of Theorem 2, we will control each of the terms separately in the following lemmas. All the lemmas below assume (B1) and (A2′).
Lemma 5 (Control of ).
There exists a constant (depending only on ) such that for all ,
Proof.
The following lemma controls the moments of and .
Lemma 6 (Control of and ).
There exists a constant (depending only on ) such that for ,
Proof.
We will only prove the bound for and the proof for follows very similar arguments. Recall that
Here again (6.8) of Ledoux and Talagrand, (1991) applies and we get
By a similar calculation, we get
Thus, for ,
(34) |
The right hand side quantities involve supremum of bounded empirical processes for which Talagrand’s inequality applies; see proposition 3.1 of Giné et al., (2000). Observe that for any ,
By proposition 3.1 of Giné et al., (2000), we obtain for any and ,
where and . Therefore, by following the argument that lead to (26), we get that
(35) | |||
Substituting this in (34), we get
By a similar calculation, we get
This completes the proof of the result. ∎
The following lemma controls the moments of . This is a bounded degenerate -process and is (usually) the dominating term among the four parts.
Lemma 7 (Control of ).
There exists a constant (depending only on ) such that for all ,
Proof.
Recall that
Observe that conditional on , is a bounded empirical process and so Talagrand’s inequality applies. Thus by Proposition 3.1 of Giné et al., (2000), we get for
Therefore, for ,
Controlling Using (35) from Lemma 6, we get
Controlling To control , we use a technique similar to the one used in Lemma 4. For this note by (32) that for any
Therefore,
Now observe that
where represents the sequence satisfying and
(36) |
Thus
The right hand side is a bounded empirical process and by proposition 3.1 of Giné et al., (2000), we get
(37) |
We will now control each of the three terms appearing in (37). Using the fact , we get
By following the duality argument (32), we get
and so,
(38) |
Also, note that
Hence, again following the duality argument (32), we get
(39) |
Here represents a sequence satisfying .
Substituting (39) and (38) in (37), we get
(40) |
Controlling We use Lemma 8 (a restatement of Lemma 2 of Adamczak, (2006)) to control In the notation of Lemma 8, take
and for ,
This implies
Observe that . Thus we get for
(41) | ||||
where
with defined in Lemma 8. We now simplify the last two terms on the right hand side of (41). First observe that for the third term
To control , observe that
So, using the definition of , we get
Equality (a) above follows from the duality argument (32) while equality (b) follows from the argument given in Lemma 8.
∎
B.2 Auxiliary Lemmas Used in Theorem 3
The following lemma is a rewording of Lemma 2 of Adamczak, (2006). For this result, define the class of functions
The domain of functions in is left out on purpose.
Lemma 8.
Suppose represents a countable class of vector functions. Define for independent random variables ,
where represents the expectation only with respect to . (So, is a random variable that depends on ). If for a.e , then there exists a constant such that for all ,
Proof.
Following the proof of Lemma 2 of Adamczak, (2006), we get
To see this, define such that
Here satisfying
Therefore,
The right hand side above is the supremum of a mean zero empirical process and so by proposition 3.1 of Giné et al., (2000), we get
From the definition of , we get
Thus,
So, the result follows. ∎
Appendix C Proof of the Maximal Inequality (Theorem 4)
The following moment bound of Rademacher chaos is used in the proof. See corollary 3.2.6 of de la Peña and Giné, (1999) and inequalities leading to (4.1.20) on page 167 of de la Peña and Giné, (1999).
Lemma 9.
Let be a homogeneous Rademacher chaos of degree 2, that is,
for some constants . Then , where
Proof of Theorem 4.
As before, let Also, let
By Lemma 9, we get conditional on ,
where
and define the discrete probability measure with support as
Now, following the proof of Theorem 5.1.4 of de la Peña and Giné, (1999),
where
Therefore,
This implies that
(42) |
Using concavity of as in the proof of Theorem 2.1 of van der Vaart and Wellner, (2011), it follows that
(43) |
where
At this point the proof of Theorem 5.1 of Chen and Kato, (2020) uses Hoeffding averaging to bound which proves the result for i.i.d. random variables . To allow for non-identically distributed random variables , we bound in terms of on the right hand side of (43). This is similar to the proof of Theorem 2.1 of van der Vaart and Wellner, (2011). To bound , define for
Using these definitions, we get
(44) |
where
By decoupling and symmetrization, we obtain
Set for ,
Again by Lemma 9 and using for all and , we get
Hence by following the first part of the proof, we get
(45) |
Substituting this in (44) after taking expectations,
where
It follows that
for any and . Therefore, by Lemma 2.1 of van der Vaart and Wellner, (2011), it follows that for any and ,
Substituting this in (43) and (42), we get
for any and . The result is proved.∎
References
- Adamczak, (2006) Adamczak, R. (2006). Moment inequalities for -statistics. Ann. Probab., 34(6):2288–2314.
- Arcones and Giné, (1993) Arcones, M. A. and Giné, E. (1993). Limit theorems for -processes. Ann. Probab., 21(3):1494–1542.
- Bakhshizadeh, (2023) Bakhshizadeh, M. (2023). Exponential tail bounds and large deviation principle for heavy-tailed -statistics. arXiv preprint arXiv:2301.11563.
- Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
- Bentkus and Götze, (1999) Bentkus, V. and Götze, F. (1999). Optimal bounds in non-gaussian limit theorems for -statistics. The Annals of Probability, 27(1):454–521.
- Boucheron et al., (2005) Boucheron, S., Bousquet, O., Lugosi, G., and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab., 33(2):514–560.
- Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities. Oxford University Press, Oxford. A nonasymptotic theory of independence, With a foreword by Michel Ledoux.
- Chakrabortty and Kuchibhotla, (2018) Chakrabortty, A. and Kuchibhotla, A. K. (2018). Tail bounds for canonical U-statistics and U-processes with unbounded kernels. Technical report, Working paper, Wharton School, University of Pennsylvania.
- Chen and Kato, (2020) Chen, X. and Kato, K. (2020). Jackknife multiplier bootstrap: finite sample approximations to the -process supremum with applications. Probability Theory and Related Fields, 176:1097–1163.
- Clémençon et al., (2008) Clémençon, S., Lugosi, G., and Vayatis, N. (2008). Ranking and empirical minimization of -statistics. Ann. Statist., 36(2):844–874.
- de la Peña, (1992) de la Peña, V. H. (1992). Decoupling and Khintchine’s inequalities for -statistics. Ann. Probab., 20(4):1877–1892.
- de la Peña and Giné, (1999) de la Peña, V. H. and Giné, E. (1999). Decoupling. Probability and its Applications (New York). Springer-Verlag, New York. From dependence to independence, Randomly stopped processes. -statistics and processes. Martingales and beyond.
- Delyon and Portier, (2016) Delyon, B. and Portier, F. (2016). Integral approximation by kernel smoothing. Bernoulli, 22(4):2177–2208.
- Dirksen, (2015) Dirksen, S. (2015). Tail bounds via generic chaining. Electron. J. Probab., 20:no. 53, 29.
- Giné et al., (2000) Giné, E., Latała, R., and Zinn, J. (2000). Exponential and moment inequalities for -statistics. In High dimensional probability, II (Seattle, WA, 1999), volume 47 of Progr. Probab., pages 13–38. Birkhäuser Boston, Boston, MA.
- Giné and Nickl, (2008) Giné, E. and Nickl, R. (2008). A simple adaptive estimator of the integrated square of a density. Bernoulli, 14(1):47–61.
- Giné and Nickl, (2016) Giné, E. and Nickl, R. (2016). Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, New York.
- Hall and Marron, (1987) Hall, P. and Marron, J. S. (1987). Estimation of integrated squared density derivatives. Statist. Probab. Lett., 6(2):109–115.
- He et al., (2024) He, Y., Wang, K., and Zhu, Y. (2024). Sparse hanson-wright inequalities with applications. arXiv preprint arXiv:2410.15652.
- Houdré and Reynaud-Bouret, (2003) Houdré, C. and Reynaud-Bouret, P. (2003). Exponential inequalities, with constants, for U-statistics of order two. In Stochastic inequalities and applications, volume 56 of Progr. Probab., pages 55–69. Birkhäuser, Basel.
- Kim, (2020) Kim, I. (2020). Statistical Theory and Methods for Comparing Distributions. PhD thesis, Carnegie Mellon University.
- Kolesko and Latała, (2015) Kolesko, K. and Latała, R. (2015). Moment estimates for chaoses generated by symmetric random variables with logarithmically convex tails. Statist. Probab. Lett., 107:210–214.
- Kuchibhotla and Chakrabortty, (2022) Kuchibhotla, A. K. and Chakrabortty, A. (2022). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389–1456.
- Ledoux and Talagrand, (1991) Ledoux, M. and Talagrand, M. (1991). Probability in Banach spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Springer-Verlag, Berlin. Isoperimetry and processes.
- Liu et al., (2021) Liu, L., Mukherjee, R., Robins, J. M., and Tchetgen, E. T. (2021). Adaptive estimation of nonparametric functionals. Journal of Machine Learning Research, 22(99):1–66.
- Major, (2005) Major, P. (2005). Tail behaviour of multiple random integrals and -statistics. Probab. Surv., 2:448–505.
- Major, (2013) Major, P. (2013). On the estimation of multiple random integrals and -statistics, volume 2079 of Lecture Notes in Mathematics. Springer, Heidelberg.
- Newey and Ruud, (2005) Newey, W. K. and Ruud, P. A. (2005). Density weighted linear least squares. In Identification and inference for econometric models, pages 554–573. Cambridge Univ. Press, Cambridge.
- Nolan and Pollard, (1987) Nolan, D. and Pollard, D. (1987). -processes: rates of convergence. Ann. Statist., 15(2):780–799.
- Nolan et al., (1988) Nolan, D., Pollard, D., et al. (1988). Functional limit theorems for -processes. The Annals of Probability, 16(3):1291–1298.
- Robins et al., (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, volume 2, pages 335–422. Institute of Mathematical Statistics.
- Robins et al., (2017) Robins, J. M., Li, L., Mukherjee, R., Tchetgen, E. T., and van der Vaart, A. (2017). Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951–1987.
- Robins et al., (2016) Robins, J. M., Li, L., Tchetgen, E. T., and van der Vaart, A. (2016). Asymptotic normality of quadratic estimators. Stochastic processes and their applications, 126(12):3733–3759.
- Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
- Rudelson and Vershynin, (2013) Rudelson, M. and Vershynin, R. (2013). Hanson-Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab., 18:no. 82, 9.
- Serfling, (1980) Serfling, R. J. (1980). Approximation theorems of mathematical statistics. John Wiley & Sons, Inc., New York. Wiley Series in Probability and Mathematical Statistics.
- Spokoiny and Zhilova, (2013) Spokoiny, V. and Zhilova, M. (2013). Sharp deviation bounds for quadratic forms. Math. Methods Statist., 22(2):100–113.
- Talagrand, (2014) Talagrand, M. (2014). Upper and lower bounds for stochastic processes, volume 60 of Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]. Springer, Heidelberg. Modern methods and classical problems.
- van de Geer and Lederer, (2013) van de Geer, S. and Lederer, J. (2013). The Bernstein-Orlicz norm and deviation inequalities. Probab. Theory Related Fields, 157(1-2):225–250.
- van der Vaart and Wellner, (2011) van der Vaart, A. and Wellner, J. A. (2011). A local maximal inequality under uniform entropy. Electron. J. Stat., 5:192–203.
- van der Vaart and Wellner, (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics.