1. Introduction
Suppose are independent and identically distributed random variables
taking values in a given measurable space . Hoeffding (1948) first introduced U-statistics, which, for a given degree , are statistics of the form
(1.1) |
|
|
|
where and is a given symmetric function in arguments, i.e.
|
|
|
for any permutation of the indices . It generalizes the sample mean and encompasses many examples such as the sample variance; see Koroljuk and Borovskich (1994) for a comprehensive treatment of the classical theory on U-statistics. To simplify notation, for each -tuple , we will henceforth adopt the shorthands and , as well as using to represent all the raw data inserted into . Throughout this paper, we will assume
(1.2) |
|
|
|
and that the single-argument projection function associated with the kernel , defined by
, is non-degenerate, i.e.
(1.3) |
|
|
|
note that (1.3) necessarily implies by Jensen’s inequality.
is complete, in a sense that it is an average of the kernel evaluated at all possible ordered subsamples of size from .
Blom (1976) suggested alternative constructions dubbed “incomplete U-statistics” which only average over an appropriate subset of , with the intuition that the strong dependence among the summands in (1.1) should allow one to only use a well-chosen subset of them without losing too much statistical efficiency. In this work, we study one particular such construction, known as the incomplete U-statistic with Bernoulli sampling, which has the form
(1.4) |
|
|
|
where the ’s are independent and identically distributed Bernoulli random variables with success probability
(1.5) |
|
|
|
i.e., and ,
and is defined as
|
|
|
We call the computational budget because it is the expected number of kernel evaluations that winds up summing over.
In the sequel, we will also use to represent all these Bernoulli samplers, which are also assumed to be independent of the raw data .
It is an established result (Janson, 1984, Corollary 1) that, if
(1.6) |
|
|
|
the following weak convergence holds for any degree under (1.2)-(1.3):
(1.7) |
|
|
|
where
|
|
|
Note that the asymptotic regime in (1.6) captures a common computationally efficient use case where the number of kernel evaluations being summed over by is in the same order as the raw sample size , whereas the full-blown U-statistic entails a lot more summations, of the order . The asymptotic normality of incomplete U-statistics such as (1.7) has found a resurgence of interest among the machine learning community, since Mentch and Hooker (2016) pointed out that it provides a computationally feasible framework for quantifying the uncertainty of ensemble predictions; see also Zhou et al. (2021), Peng et al. (2022), DiCiccio and Romano (2022), Xu et al. (2022) for some related recent works.
While the convergent result (1.7) itself has been known for a long time, a Berry-Esseen (B-E) bound of a natural rate that characterizes the associated normal approximating accuracy is thus far missing from the literature.
In this paper, we offer the latter to fill such a gap. Capitalizing on the recent framework of Stein’s method for the so-called “Studentized nonlinear statistics” (Leung et al., 2024) and an exponential lower tail bound for non-negative U-statistics (Leung and Shao, 2023, Lemma 4.3), we establish B-E bounds for under minimalistic moment assumptions on the kernel , which has the rate when and are of the same order, the optimal rate we believe one could possibly achieve. In passing, we mention that the statistical community has also been developing B-E bounds for high-dimensional incomplete U-statistics (Song et al., 2019, Chen and Kato, 2019); our focus here is instead to provide a bound of the best rate for the one-dimensional convergent result in (1.7).
1.1. Other notation and convention
, denote generic probability and expectation operators. For any , we use to denote the -norm of any real-valued random variable . Given and a function in arguments in , we use and as shorthands for and , respectively.
and are respectively the standard normal density and distribution functions, with ;
denotes the univariate normal distribution with mean and variance . The indicator function is denoted by . Generally speaking,
denotes an unspecified absolute positive constant, where “absolute” means it is universal for the statements it appears in and doesn’t depend on other quantities;
it generally varies in values at different occurrences throughout the paper.
2. The main result
We first review the nature of the normal convergence in (1.7). To facilitate the discussion, we define the U-statistics
(2.1) |
|
|
|
whose non-negative kernels and are induced by taking powers of the absolute value of the original kernel .
One can easily check that, as in the recent works on B-E bounds for incomplete U-statistics (Chen and Kato, 2019, Song et al., 2019, Peng et al., 2022, Sturma et al., 2022), admits the decomposition
(2.2) |
|
|
|
where
(2.3) |
|
|
|
Hence, up to multiplicative scaling factors, can be understood as a sum of the complete U-statistic and the term . Under the assumptions (1.2) and (1.3), it is well-known that as tends to infinity, converges weakly to the limit (Koroljuk and Borovskich, 1994, Ch.4). On the other hand, when conditioned on the raw data , should be approximately distributed as , for being a sum of independent mean-zero terms solely driven by the randomness of , with variance ; since
(2.4) |
|
|
|
marginally, is expected to be approximately distributed as and also approximately independent of . Altogether, in light of , , being a sum of two approximately independent and normal random variables, is expected to be asymptotically normal with mean zero and variance
|
|
|
under the regime of (1.6), because . Indeed, this is stated by the weak convergence in (1.7).
An extension of the preceding discussion also sheds light on what one should expect from a B-E bound that assesses the normal approximating accuracy of (1.7). Firstly, the complete U-statistic has a well-established B-E bound (Chen and Shao, 2007, Peng et al., 2022) that reads as
(2.5) |
|
|
|
Moreover, ignoring the technical difficulty that may arise from division by zero when for the time being, one can write
|
|
|
so is a sum of conditionally independent random variables indexed by , each having the conditional mean
|
|
|
and the conditional second and third absolute moments
|
|
|
where we have used the calculation
.
Since the second moment calculation above suggests that
|
|
|
the classical B-E bound (Berry, 1941, Esseen, 1942, Shevtsova, 2010) for a sum of independent variables with unit variance shall imply that
(2.6) |
|
|
|
As discussed, since the asymptotic normality of
under the regime (1.6) can be viewed as the distribution resulting from summing the two approximately independent and normal random variables and , the two B-E bounds in (2.5) and (2.6) intuitively suggest that a “correct” B-E bound
for the marginal cumulative distribution of should be of the rate when , i.e. when there exists a positive constant such that
(2.7) |
|
|
|
as a non-asymptotic analog of the regime (1.6).
This paper materializes the latter intuition into a Berry-Esseen theorem, which can be stated in terms of projection functions with respect to the non-negative kernel of the U-statistic in (2.1). For each , first define the function in arguments by
|
|
|
which is centered in the sense that ; specifically for , we also let be its “uncentered version”
(2.8) |
|
|
|
which must also be non-negative.
Next, we can recursively define the Hoeffding projection kernels of as
|
|
|
and
|
|
|
These projection functions give rise to the famous Hoeffding decomposition (De la Pena and Giné, 1999, p.137)
of the U-statistic that, upon re-centering and rescaling with , can be written as
(2.9) |
|
|
|
where
(2.10) |
|
|
|
are the leading terms in the decomposition. We note that each of the ’s has mean , giving rise to the re-centering of by in (2.9).
We can now state our main B-E bound for ; recall and , as per our convention in Section 1.1.
Theorem 2.1 (Berry-Esseen theorem for incomplete U-statistics with Bernoulli sampling).
Let be constructed as in (1.4) with a symmetric kernel on and be the greatest integer less than . Suppose that (1.2)-(1.3), and hold. For an absolute constant , one has the B-E bound
|
|
|
where is a non-negative term that can be bounded as
|
|
|
If, in addition, , then we have
|
|
|
It is easy to see that each term on the right hand side of our B-E bound in Theorem 2.1 decays no slower than the rate as , when as in (2.7); in particular, since the expression
(2.11) |
|
|
|
is seen to be of order , the term decays at the rate under (2.7), whether under the minimalistic finite third moment assumption or the stronger finite fourth moment assumption . Our result can also be contrasted with Peng et al. (2022, Theorem 5), which to our best knowledge is the first attempt in proving a B-E bound for the convergence in (1.7). However, Peng et al. (2022) had to assume six finite moments on the kernel and their bound exhibits a slower rate than under (2.7). To further appreciate Theorem 2.1, the various parts of our bound could be interpreted as follows:
-
•
stems from the complete U-statistic B-E bound in (2.5), which account for the “part-complete-U-statistic” behavior of coming from ; recall the decomposition in (2.2).
-
•
is analogous to the B-E bound in (2.6) for the conditional normal convergence of , which accounts for the “part-independent-sum” behavior of coming from as per the decomposition in (2.2). However, because Theorem 2.1 really concerns the convergence of the marginal distribution for ,
the data-dependent ratio in (2.6)
is now replaced by , a ratio of
population moments that can be roughly thought of as its “average”; note that and . The extra term captures the fact that can never be too large due to the exponentially decaying lower tail behavior of as a non-negative U-statistic; see (3.20) below.
The factor should also be understood with some care, because depends on and as per (1.5). Indeed, as approaches , approaches , which makes blows up and the whole bound become vacuous. However, our B-E bound in Theorem 2.1 is really geared for and having the same order as in (2.7), in which case as . In fact, it is well-known that has different limiting distributions than the one considered in (1.7), when the relative growth rates of and are different from the prescription in (1.6) (Janson, 1984, Mentch and Hooker, 2016).
-
•
stems from various “remainder terms” emerging in the proof of Theorem 2.1. For instance, one such remainder term is defined in (3.6) below, which corresponds to the error incurred by estimating with , as is implicit in (2.4). While the bound on under the minimalistic third moment assumption appears to have a relatively complicated form as exhibited by (2.11), it comes from standard moment inequalities on the centered U-statistic ; see
Ferger (1996) and Koroljuk and Borovskich (1994, Chapter 2) for instance.
The proof of Theorem 2.1 adopts a technique called variable censoring, recently introduced in Leung et al. (2024) as a refinement of the variable truncation technique for proving B-E bounds with Stein’s method. For a given real-valued random variable , its censored version with respect to two constants , where , is defined as
|
|
|
i.e. once goes beyond the range , its censored version can only be either or depending on which direction the value of goes; that or/and are allowed to be infinity means that may only be censored in one direction, or not be censored at all.
This technique has the following apparent but very useful properties for establishing bounds:
Property 2.2 (Properties of variable censoring).
Let and be any two real-value variables. The following facts hold:
-
(i)
Suppose, for some with ,
|
|
|
and
|
|
|
then it must be that
-
(ii)
If is a non-negative random variable, then it must also be true that
|
|
|
i.e., the upper-censored version of is always no larger than itself.
We now explain the structure of Theorem 2.1’s proof. Throughout, we will use to denote the entire underlying probability space. Moreover, we will define
the following censored version of ,
(2.12) |
|
|
|
and
the event
(2.13) |
|
|
|
that only depends on the Bernoulli samplers in ; note that on . The first step of the proof is to establish, for any , the bound
(2.14) |
|
|
|
where
(2.15) |
|
|
|
(2.16) |
|
|
|
and
(2.17) |
|
|
|
Following other existing works on B-E bounds for incomplete U-statistics, this can be developed with iterative arguments similar to ones pioneered by Chen and Kato (2019), and will be proved at the end of this section.
Apparently, the remaining steps boil down to bounding each term on the right hand side of (2.14). The first term can be readily handled by the B-E bound for complete U-statistics in (2.5), where as the second term can be bounded by
Bernstein’s inequality (van der Vaart and Wellner, 1996, Lemma 2.2.9, p.102) as in
(2.18) |
|
|
|
where the last inequality comes from the fact that
(2.19) |
|
|
|
and . The proof of Theorem 2.1 can then be completed by the following two lemmas on the other two terms in (2.14):
Lemma 2.3 (B-E bound for with the random threshold ).
Suppose the same assumptions as in Theorem 2.1 hold. If , then
|
|
|
where is a non-negative term that can be bounded as
|
|
|
If, in addition, , then we have
|
|
|
Lemma 2.4 (Bound for ).
Under the same assumptions as in Theorem 2.1, there exists an absolute constant such that for all ,
|
|
|
Collecting (2.5), (2.14) and (2.18), Lemmas 2.3 and 2.4 immediately render Theorem 2.1.
The critical result is Lemma 2.3, the B-E bound for the term with the random threshold , which also depends on the data via as suggested in (2.16). Most of the rest of this paper is devoted to proving Lemma 2.3, while the straightforward proof of Lemma 2.4 is included in Appendix A. We note that in (2.15), is a lower-order term while is the main term that drives the distribution of ; hence, it may be tempting to draw a parallel between Lemma 2.3 and the B-E bound for the conditional distribution in (2.6). Nevertheless, while (2.6) is a direct corollary of the classical result by Berry (1941) and Esseen (1942), the normalizer for is not the latter’s conditional standard deviation given . Moreover, unlike standard B-E bound problems, one has to handle a random threshold. These attributes make proving Lemma 2.3 non-trivially challenging.
Turns out, two ingredients that have only been recently formalized in the literature come in handy: the framework of proving B-E bounds for the so-called “Studentized nonlinear statistics” with Stein’s method (Leung et al., 2024), and an exponential lower tail bound for the U-statistic in (2.1). We note that analyzing the tail behavior of is quite an important aspect, as will become apparent later in our proof of Lemma 2.3 in Section 3. For another technical comparison with Peng et al. (2022, Theorem 5), we note that the Chebyshev’s inequality used therein (Peng et al., 2022, p.285) for controlling the tail of is too crude to produce a bound of the desired rate . In contrast, we exploit the non-negativity of and only control its lower tail, which decays exponentially in as per the latest result in Leung and Shao (2023, Lemma 4.3).
Before ending this section, we establish (2.14) as promised:
Proof of (2.14).
Let and be two independent standard normal random variables that are also independent of the data .
With the decomposition in (2.2), one can first write the inequality
|
|
|
|
(2.20) |
|
|
|
|
The first term on the right hand side of (2) can be further bounded as
(2.21) |
|
|
|
Continuing, we can bound the first term on the right hand side of (2.21) as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(2.22) |
|
|
|
|
where is as defined in (2.17);
the last equality above uses that is distributed as a normal random variable with mean and variance
|
|
|
Combining (2)-(2.22), we have
(2.23) |
|
|
|
By reverse arguments, in parallel to (2)-(2.22), we can get
|
|
|
|
|
|
and
|
|
|
which can be combined to give
(2.24) |
|
|
|
Together, (2.23) and (2.24) render (2.14).
3. Proof of the critical Lemma 2.3
Now we start proving Lemma 2.3. Define the event
(3.1) |
|
|
|
that only depends on the raw data . For the rest of this paper, if is any event whose occurrence is fully determined by the raw data , with a slight abuse of notation we will use to denote a situation where the event is true. For instance, indicates the realized value of is such that . In an analogous manner, means the realized values of the Bernoulli samplers are such that , for the event defined in (2.13). Next, we censor various random quantities: Define
the lower censored version of as
(3.2) |
|
|
|
from which we let
(3.3) |
|
|
|
whose first moments conditional on the raw data can be computed as
(3.4) |
|
|
|
With these, by letting
(3.5) |
|
|
|
and
(3.6) |
|
|
|
we will define the random variable
(3.7) |
|
|
|
where “SN” stands for “Studentized Nonlinear” and it will soon become clear why is named this way. We also remark that the quantities , and , which have and figuring as denominators in their definitions, are all well-defined on the entire space ; unlike and , the and can never be equal to zero by their censoring constructions, so division by zero is not a concern.
Because
(3.8) |
|
|
|
in light of how and are defined, by tracing the definitions in (3.3), (3.5) and (3.6), one should see that
|
|
|
Together with the definition of in (2.15), the facts in the prior display can be put together to render that
|
|
|
this gets us off on the right foot with developing the B-E bound in Lemma 2.3, as it allows one to write
(3.9) |
|
|
|
by truncating away the complementary event , which has a small probability as we will see. We have thus transformed the problem into one of establishing a B-E bound for with respect to the random threshold , i.e. a bound on the absolute difference
(3.10) |
|
|
|
and can adopt the paradigm recently formalized in Leung et al. (2024) of using Stein’s method to prove B-E bounds for Studentized nonlinear statistics: For real-valued deterministic functions and a sample of independent random variables such that for all and , a Studentized nonlinear statistic is a statistic that can be written into the generic form
(3.11) |
|
|
|
where and are (usually small) remainder terms that may depend on the data , with almost surely; in particular, the generic statistic in (3.11) is “Studentized” by the data-driven estimate of its own variance, and we refer the readers to Shao et al. (2016, Sec. 4) for some typical examples of such statistics. Indeed,
(3.12) |
|
|
|
From the conditional moment calculations in (3.4), we have
(3.13) |
|
|
|
where the last equality comes from the fact in (3.8).
Moreover, only depends on , and is no smaller than because, by (3.8) again,
|
|
|
These verify the claim in (3.12).
Based on the preceding observation, we can adopt the framework of Leung et al. (2024) to develop a preliminary bound on the absolute difference in (3.10).
To state the resulting bound, we shall
define the following “leave-one-out” versions of , and with respect to each , by leaving out any term involving :
|
|
|
(3.14) |
|
|
|
and
(3.15) |
|
|
|
Further, we censor and , as well as and for each , as:
|
|
|
(3.16) |
|
|
|
and
(3.17) |
|
|
|
The following B-E bound for with respect to the random threshold is proven in Appendix B via suitably adapting the arguments from Leung et al. (2024):
Lemma 3.1 (B-E bound for with the random threshold ).
Under the same assumptions as Theorem 2.1, for any , there exists an absolute constant such that
|
|
|
for
|
|
|
and
|
|
|
where
(3.18) |
|
|
|
is the solution to the Stein equation (Stein, 1972)
(3.19) |
|
|
|
and denotes in (3.18) with substituted by the random quantity .
Apparently, to eventually arrive at Lemma 2.3,
we must further bound the quantities and in Lemma 3.1, which respectively contain terms related to the remainders in the numerator and denominator of in (3.7):
Lemma 3.2 (Bounding ).
Under the same assumptions as in Theorem 2.1, the term defined in Lemma 3.1 satisfies the bound
|
|
|
Lemma 3.3 (Bounding ).
Suppose the same assumptions as in Theorem 2.1 hold. If , then the term defined in Lemma 3.1 satisfies the bound
|
|
|
If, in addition, , then
.
The proof of Lemma 3.2 in Appendix C entails computing tight moment estimates that exploits conditioning on the raw data wherever necessary, and the proof of Lemma 3.3 in Section 4 involves interesting leave-one-out arguments quite unique to this problem. To seal the proof of Lemma 2.3, we critically require the following exponential lower tail bound for U-statistics constructed with non-negative kernels recently proved in Leung and Shao (2023, Lemma 4.3):
Lemma 3.4 (Exponential lower tail bound for U-statistics with non-negative kernels).
Suppose is a U-statistic, and is a symmetric kernel of degree that can only take non-negative values, with the property that for some . Then for ,
|
|
|
where is defined as the greatest integer less than .
Taking and in Lemma 3.4, from the definition of in (3.1) one has
(3.20) |
|
|
|
Combining (3.9), (3.20),
Lemmas 3.1-3.3 and the fact that
|
|
|
(by virtue of (2.19) and ), the proof of Lemma 2.3 is finished.
4. Bounding in Lemma 3.3
In this section, we will use the following bounds on the solution to the Stein equation in (3.18) proven in Appendix D.1.
Lemma 4.1 (Bounds for and ).
The following bounds are true for in (3.18) and its derivative :
-
(i)
For all , and .
-
(ii)
|
|
|
and
|
|
|
-
(iii)
For ,
|
|
|
and
|
|
|
Due to the bounds for in Lemma 4.1-, it is easy to see that
(4.1) |
|
|
|
Given , by Property 2.2 and (4.1), in light of the definition of we have
|
|
|
where the second last inequality comes from and the last inequality uses a classical bound on U-statistic variance (Koroljuk and Borovskich, 1994, Lemma 1.1.4). This has proved the bound for in Lemma 3.3 under the stronger finite fourth moment assumption .
The bound for in Lemma 3.3 under the finite third moment assumption is considerably more involved to prove. To prove it,
we first upper-censor the ’s in (2.10) as
|
|
|
from which we can further define
(4.2) |
|
|
|
(4.3) |
|
|
|
|
(4.4) |
|
|
|
and
|
|
|
The following moment bounds on (4.2) and (4.3) proven in Appendix D.2 will soon be useful in the sequel.
Lemma 4.2 (Moment estimates on and ).
If , the following moment bounds hold for and defined in (4.2) and (4.3):
|
|
|
where is a constant depending only on ,
and
|
|
|
Since
|
|
|
in light of the Hoeffding decomposition in (2.9), one can see that
(4.5) |
|
|
|
by comparing the definitions of and in (3.6) and (4.4). Now, because
’s component terms satisfy the bounds , and by (4.1),
it is easy to see from (4.5) that
(4.6) |
|
|
|
where we have defined the event
|
|
|
In Appendices D.3-D.5, we will derive that one can bound each of the first three terms on the right hand side of (4.6) in moments of and as:
(4.7) |
|
|
|
(4.8) |
|
|
|
(4.9) |
|
|
|
Hence, upon applying the moment bound for , and in Lemma 4.2 to the right hand sides of (4.7)-(4.9), one can further get
(4.10) |
|
|
|
(4.11) |
|
|
|
(4.12) |
|
|
|
where we have absorbed the term showing up in the bound for into . The latter is permissible, because
|
|
|
by virtue of (4.1) and being censored within the interval ; for the bound on to be non-vacuous, one can hence without loss of generality assume , which implies to justify absorbing the former into the latter. By recalling (4.6), (4.10)-(4.12) give
(4.13) |
|
|
|
For the rest of this section, we will show the estimate
(4.14) |
|
|
|
with leave-one-out-arguments that are quite unique to this problem; combining (4.13)-(4.14), the union bound
|
|
|
and the bound for in Lemma 4.2 will then have proven Lemma 3.3 under the finite third moment assumption .
For the proof of (4.14) to follow, we will use the symbol to denote a summation over the index vectors in for which for any . In the same spirit, we will use to denote a summation over the index vectors in for which for some .
Let
(4.15) |
|
|
|
be the “leave-one-out” version of eliminating terms involving , and define
(4.16) |
|
|
|
analogously to and . Then write
(4.17) |
|
|
|
We will next establish the bounds
(4.18) |
|
|
|
and
(4.19) |
|
|
|
combining (4.17)-(4.19) will then finish proving (4.14).
4.1. Bound on the term in (4.18)
We first have to separately consider and . Since by Lemma 4.1,
(4.20) |
|
|
|
Since
|
|
|
we can form the bound
|
|
|
|
|
|
|
|
|
|
|
|
(4.21) |
|
|
|
where the first term in (4.21) comes from facts:
-
(i)
For any with , if by the bounds for in Lemma 4.1 and ;
-
(ii)
For any any with , implies because by definition,
and the second term in (4.21) comes from by Lemma 4.1.
As
(4.22) |
|
|
|
(4.20)-(4.21) imply
(4.23) |
|
|
|
Now we estimate the term in (4.23) as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.24) |
|
|
|
|
To estimate the term in (4.23), we first write
|
|
|
|
|
|
|
|
(4.25) |
|
|
|
|
where the last inequality uses the concavity of the function.
Since , continuing from (4.25), we have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.26) |
|
|
|
|
Combining (4.23)-(4.26), we have proved the bound in (4.18).
4.2. Bound on the term in (4.19)
Let
(4.27) |
|
|
|
where
|
|
|
First we write the equality
(4.28) |
|
|
|
and will bound and . By the independence between and , we have
|
|
|
so we can write as
|
|
|
|
|
|
by the independence between and |
|
|
|
|
|
|
|
|
|
From here, with both and being non-negative, we can immediately write
(4.29) |
|
|
|
At this point, we will bound the integrand in (4.29). By from Lemma 4.1,
(4.30) |
|
|
|
Since by Lemma 4.1 and for any and by Lemma 4.1 and , for any , we have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.31) |
|
|
|
Hence, from (4.30)-(4.31), we have
|
|
|
plugging this latter fact into (4.29), we get
|
|
|
|
|
|
|
|
|
|
|
|
(4.32) |
|
|
|
|
where (4.32) comes from evaluating and bounding the conditional expectation as
|
|
|
Since , we can then further continue from (4.32) as
|
|
|
|
|
|
|
|
(4.33) |
|
|
|
|
where the last inequality is true via .
Now we bound in (4.28). Using that in Lemma 4.1 and , we first write
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.34) |
|
|
|
|
Combining (4.28), (4.33) and (4.34) gives (4.19).