Efficient median of means estimator
Abstract
The goal of this note is to present a modification of the popular median of means estimator that achieves sub-Gaussian deviation bounds with nearly optimal constants under minimal assumptions on the underlying distribution. We build on a recent work on the topic by the author, and prove that desired guarantees can be attained under weaker requirements.
keywords:
t3Author acknowledges support by the National Science Foundation grants DMS CAREER-2045068 and CCF-1908905.
1 Introduction.
Let be a random variable with mean and variance . A sub-Gaussian estimator of based on a sample of i.i.d. copies of is a measurable function such that for a absolute constants and all . It is known (for instance, see the work by Catoni, (2012)) that . A natural question, posed previously by Devroye et al., (2016), is whether sub-Gaussian estimators with , where is a function that goes to as (and possibly ) tend to infinity, exist.
Several authors showed that such estimators can indeed be constructed under various additional assumptions. In one of the earliest works on the topic, Catoni, (2012) presented the first known example of sharp sub-Gaussian estimators for distributions with finite fourth moment and a known upper bound on the kurtosis, as well as for distributions with known variance. Construction by Devroye et al., (2016) similarly required the fourth moment to be finite. One of the strongest results is the one by Lee and Valiant, (2020): their estimator attains required guarantees uniformly over the class of distributions with finite variance, assuming just the finite second moment, albeit with only in the limit as . Minsker, (2023) proposed a permutation-invariant version of the well known median of means (MOM) estimator (Nemirovski and Yudin,, 1983; Jerrum et al.,, 1986; Alon et al.,, 1996) and proved that it achieves desired guarantees for the class of distributions with more than finite moments and “sufficiently regular” probability density functions.
The main goal of this essay is to present a modification of the “permutation-invariant” MOM estimator that attains sub-Gaussian guarantees with asymptotically optimal constants for distributions possessing moments for some . This result could yield improvements for a variety of robust algorithms (e.g., see the survey by Lugosi and Mendelson, (2019)) that rely on the classical MOM estimator serves as a subroutine.
1.1 Notation.
For a positive integer , will denote the set . We employ standard big-O and small-o notation for asymptotic relations between functions and sequences; it will be implicitly assumed that and may denote different functions from line to line. Moreover, given two sequences and where for all , we will write that if as . Additional notation will be introduced in the main text whenever necessary.
2 Main results.
Let us recall the definition of the classical median of means estimator. Given an i.i.d. sample from distribution with mean and variance , let be a collection of disjoint subsets of cardinality each, and where stands for the “median.” It is known that satisfies the inequality with , where goes to as . Minsker, (2023) proved that allowing the overlapping subsets of data improves the constant: given of cardinality , set and define , where denotes the set of sample averages computed over all possible subsets of of cardinality . Then attains sub-Gaussian deviations with under the assumptions described in section 1. Essentially, is a function of the order statistics which are complete and sufficient for the family of all distributions with finite variance.
Our construction, presented below, shows that it is not necessary to use all possible sample means, and that a much smaller collection of averages suffices: not only this makes computation easier, but the theoretical guarantees for the resulting estimator hold under weaker assumptions. The main idea is to split the data into subsets of size smaller than , and construct all possible sample means using these subsets as “building blocks”. The size of the overlap is then naturally proportional to the size of the block. For instance, the estimator corresponds to the blocks of size , resulting in the sample means over all possible subsets of a given size. Our results show that allowing the block size to be slowly growing with the the sample size could be beneficial. Formally, let be positive integers such that . Assume that are disjoint subsets of cardinality each, and . It will be convenient to set , , and to view is a new i.i.d. sample; clearly, has mean and variance . Given of cardinality , set ; note that is an average of observations from the original sample , same as in the definition of the standard MOM estimator. Define and
(1) |
where denotes the set of sample averages computed over all possible subsets of of cardinality . In other words, is the median of means computed over overlapping subsets of data, where the size of the overlap is proportional to , the size of the block . We remark here that all explicit, non-asymptotic deviations guarantees that are valid for the classical MOM estimator automatically extend to in view of the so-called “Hoeffding representation” of U-statistics (Lee,, 2019) as the average of averages of independent random variables; pursuit of optimal constant however appears to require the bounds that include asymptotic terms. Everywhere below, it is assumed that and functions of the sample size . We proceed with the statement of our main result. Denote
(2) |
Feller, (1968) proved that controls the rate of convergence in the central limit theorem, namely that where and are the distribution functions of and the standard normal law respectively. It is well known that as for distributions with finite variance. Moreover, admits an upper bound of the form whenever for some . In the context of the median of mean estimation, the role of is to control the difference between the mean and the median corresponding to the distribution of , which can be seen as the main contribution to the the bias of .
Theorem 1.
Assume that for some . Suppose that and let and be any sequences such that and . Then for all ,
where as uniformly over all .
Remark 1.
-
(a)
A possible choice of parameters is , and . By varying , the deviation guarantees can be attained in the desired range of the confidence parameter.
-
(b)
The question of uniformity of the bounds with respect to the underlying distribution is not explicitly addressed in this note. In particular, the quantities appearing in the inequalities are distribution-dependent. With additional effort, it should be possible to prove uniformity with respect to the classes of distributions of satisfying moment conditions of the form for a sequence that grows sufficiently slow.
-
(c)
Exact computation of is still prohibitively expensive from a numerical standpoint, as the naive upper bound for evaluating the estimator exactly is . Instead, one may select a collection of subsets among uniformly at random and compute the median of the corresponding sample means: in view of Theorem 1 in section 4.3.3 of the book by (Lee,, 2019) implies that the asymptotic distribution of the estimator constructed in this way coincides with the asymptotic distribution of as soon as . However, this asymptotic equivalence does not automatically imply sharp non-asymptotic bounds of the estimator computed from subsampled blocks any more: results of such nature are currently unknown to us and require further investigation.
Proof.
As is scale-invariant, we can and will assume that . Set , and note that the equivalent characterization of as an M-estimator is
The necessary conditions for the minimum of imply that , hence the left derivative . Therefore, if for some , then and, due to being nondecreasing, . It implies that
(3) |
where we used the shortcut in place of
Note that
(4) |
Since
(5) |
where , we see that
which is whenever and . It remains to analyze the U-statistic
(6) |
As the expression above is invariant with respect to the shift , we can assume that . For , let
where is an independent copy of based on a sample that is an independent copy of . Our goal is to determine the size of .
Remark 2.
The quantity is related to the so-called Hájek projection that can be viewed as the best (in mean squared sense) approximation of the U-statistic in terms of the sums of i.i.d. random variables. For related background on U-statistics, we refer the reader to an excellent monograph by Lee, (2019).
Lemma 1.
The proof of the lemma is given in section 3. The following result, a deviation inequality for U-statistics of order that grows with the sample size, is the second key technical tool required to complete the argument.
Theorem 2.
Let be a function that is invariant with respect to permutations of its arguments, and let be the corresponding U-statistic with kernel evaluated on a sample . Assume that is an increasing function of , and that
-
(a)
is uniformly bounded;
-
(b)
, where .
Let be increasing in , decreasing in , and such that . Then for all ,
where as uniformly over .
The proof of this result is outlined in section 4 111We note that closely related results for U-statistics were obtained by Maurer, (2019), and it may be possible to use Maurer’s inequality in place of Theorem 2.. To get the desired inequality for the estimator , it remains to apply Theorem 2 and Lemma 1 to the U-statistic defined in (6): specifically, we deduce that
uniformly over , and the final result follows.
∎
3 Proof of Lemma 1.
Note that we can rewrite as
Given an integer , let be the cumulative distribution function of . Then
with being the law of . Feller’s version of Berry-Esseen theorem implies that
where is the distribution function of standard normal law. Therefore,
by assumption. At the same time,
where is such that as and is a bounded function. Therefore, almost surely, assuming that . Finally, note that
where the last inequality follows from the well known bound for the concentration function (Theorem 2.20 in the book by Petrov, (1995)); here, is a constant that may depend on the distribution of . We therefore conclude that the sequence is uniformly integrable (as is), hence the claim follows.
4 Proof of Theorem 2.
The union bound together with Hoeffding’s decomposition entails that for any and (to be chosen later as a decreasing sequence ),
where are the degenerate kernels corresponding to the higher-order terms of Hoeffding’s decomposition (not to be confused with the derivatives!). Specifically,
where is the point measure concentrated at ; in particular, . It is known that can be viewed geometrically as orthogonal projections of onto a particular subspace of . We refer the reader to the book by Lee, (2019) for futher details related to the background material on U-statistics and the Hoeffding’s decomposition. Bernstein’s inequality yields that
where as uniformly over . It remains to control the expression involving higher order Hoeffding decomposition terms. To this end, we will show that it is bounded from above by where uniformly over the range of . To this end, we will need concentration inequality for the U-statistics of growing order established in Minsker, (2023, Theorem 4.1). Set , and note that, in view of the union bound,
where the last inequality follows from the first bound of Theorem 4.1 in Minsker, (2023). Whenever and , the last expression is at most
In turn, it is bounded by whenever . Desired conclusion follows.
Acknowledgements.
The author is grateful to the anonymous Referees for their insightful comments and constructive feedback that helped improve the quality of the presentation.
References
- Alon et al., (1996) Alon, N., Matias, Y., and Szegedy, M. (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM.
- Catoni, (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pages 1148–1185. Institut Henri Poincaré.
- Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725.
- Feller, (1968) Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 10(3):261–268.
- Jerrum et al., (1986) Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188.
- Lee, (2019) Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
- Lee and Valiant, (2020) Lee, J. C. H. and Valiant, P. (2020). Optimal sub-Gaussian mean estimation in . arXiv preprint arXiv:2011.08384.
- Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190.
- Maurer, (2019) Maurer, A. (2019). A Bernstein-type inequality for functions of bounded interaction. Bernoulli, 25(2):1451–1471.
- Minsker, (2023) Minsker, S. (2023). U-statistics of growing order and sub-Gaussian mean estimators with sharp constants. Mathematical Statistics and Learning, to appear.
- Nemirovski and Yudin, (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.
- Petrov, (1995) Petrov, V. V. (1995). Limit theorems of probability theory: sequences of independent random variables. Oxford, New York.