This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient median of means estimator

Stanislav Minskerlabel=e1 [    mark][email protected] Department of Mathematics, University of Southern California
Abstract

The goal of this note is to present a modification of the popular median of means estimator that achieves sub-Gaussian deviation bounds with nearly optimal constants under minimal assumptions on the underlying distribution. We build on a recent work on the topic by the author, and prove that desired guarantees can be attained under weaker requirements.

Median of means estimator,
U-statistics,
heavy tails,
robustness,
keywords:

t3Author acknowledges support by the National Science Foundation grants DMS CAREER-2045068 and CCF-1908905.

1 Introduction.

Let XX be a random variable with mean μ\mu and variance σ2\sigma^{2}. A sub-Gaussian estimator of μ\mu based on a sample 𝒳={X1,,XN}\mathcal{X}=\{X_{1},\ldots,X_{N}\} of i.i.d. copies of XX is a measurable function μ~:=μ~(𝒳;t)\widetilde{\mu}:=\widetilde{\mu}(\mathcal{X};t) such that (|μ~μ|CσtN)cet\mathbb{P}{\left(\left|\widetilde{\mu}-\mu\right|\geq C\sigma\sqrt{\frac{t}{N}}\right)}\leq ce^{-t} for a absolute constants c,C>0c,C>0 and all t[1,tmax(N)]t\in\left[1,t_{\mathrm{max}}(N)\right]. It is known (for instance, see the work by Catoni, (2012)) that C2C\geq\sqrt{2}. A natural question, posed previously by Devroye et al., (2016), is whether sub-Gaussian estimators with C=2+o(1)C=\sqrt{2}+o(1), where o(1)o(1) is a function that goes to 0 as NN (and possibly tt) tend to infinity, exist.

Several authors showed that such estimators can indeed be constructed under various additional assumptions. In one of the earliest works on the topic, Catoni, (2012) presented the first known example of sharp sub-Gaussian estimators for distributions with finite fourth moment and a known upper bound on the kurtosis, as well as for distributions with known variance. Construction by Devroye et al., (2016) similarly required the fourth moment to be finite. One of the strongest results is the one by Lee and Valiant, (2020): their estimator attains required guarantees uniformly over the class of distributions with finite variance, assuming just the finite second moment, albeit with C=2C=\sqrt{2} only in the limit as tt\to\infty. Minsker, (2023) proposed a permutation-invariant version of the well known median of means (MOM) estimator (Nemirovski and Yudin,, 1983; Jerrum et al.,, 1986; Alon et al.,, 1996) and proved that it achieves desired guarantees for the class of distributions with more than 3+52\frac{3+\sqrt{5}}{2} finite moments and “sufficiently regular” probability density functions.

The main goal of this essay is to present a modification of the “permutation-invariant” MOM estimator that attains sub-Gaussian guarantees with asymptotically optimal constants for distributions possessing 2+ε2+\varepsilon moments for some ε>0\varepsilon>0. This result could yield improvements for a variety of robust algorithms (e.g., see the survey by Lugosi and Mendelson, (2019)) that rely on the classical MOM estimator serves as a subroutine.

1.1 Notation.

For a positive integer NN, [N][N] will denote the set {1,,N}\{1,\ldots,N\}. We employ standard big-O and small-o notation for asymptotic relations between functions and sequences; it will be implicitly assumed that o(1)o(1) and O(1)O(1) may denote different functions from line to line. Moreover, given two sequences {an}n1\{a_{n}\}_{n\geq 1} and {bn}n1\{b_{n}\}_{n\geq 1} where bn0b_{n}\neq 0 for all nn, we will write that anbna_{n}\ll b_{n} if anbn=o(1)\frac{a_{n}}{b_{n}}=o(1) as nn\to\infty. Additional notation will be introduced in the main text whenever necessary.

2 Main results.

Let us recall the definition of the classical median of means estimator. Given an i.i.d. sample 𝒳={X1,,XN}\mathcal{X}=\{X_{1},\ldots,X_{N}\} from distribution PP with mean μ\mu and variance σ2\sigma^{2}, let G1Gk[N]G_{1}\cup\ldots\cup G_{k}\subseteq[N] be a collection of kk disjoint subsets of cardinality N/k\lfloor N/k\rfloor each, X¯j:=1|Gj|iGjXi\bar{X}_{j}:=\frac{1}{|G_{j}|}\sum_{i\in G_{j}}X_{i} and μ^MOM=med(X¯1,,X¯k),\widehat{\mu}_{\mathrm{MOM}}=\mbox{med}\left(\bar{X}_{1},\ldots,\bar{X}_{k}\right), where med()\mbox{med}\left(\cdot\right) stands for the “median.” It is known that μ^MOM\widehat{\mu}_{\mathrm{MOM}} satisfies the inequality (|μ^MOMμ|CσtN)2et\mathbb{P}{\left(\left|\widehat{\mu}_{\mathrm{MOM}}-\mu\right|\geq C\sigma\sqrt{\frac{t}{N}}\right)}\leq 2e^{-t} with C=π+o(1)C=\sqrt{\pi}+o(1), where o(1)o(1) goes to 0 as k,N/kk,N/k\to\infty. Minsker, (2023) proved that allowing the overlapping subsets of data improves the constant: given J[N]J\subseteq[N] of cardinality |J|=N/k|J|=\lfloor N/k\rfloor, set X¯J:=1|J|jJXj\bar{X}_{J}:=\frac{1}{|J|}\sum_{j\in J}X_{j} and define μ^U=med(X¯J,|J|=N/k)\widehat{\mu}_{U}=\mbox{med}\left(\bar{X}_{J},\ |J|=\lfloor N/k\rfloor\right), where {X¯J,|J|=N/k}\left\{\bar{X}_{J},\ |J|=\lfloor N/k\rfloor\right\} denotes the set of sample averages computed over all possible subsets of [N][N] of cardinality N/k\lfloor N/k\rfloor. Then μ^U\widehat{\mu}_{U} attains sub-Gaussian deviations with C=2+o(1)C=\sqrt{2}+o(1) under the assumptions described in section 1. Essentially, μ^U\widehat{\mu}_{U} is a function of the order statistics which are complete and sufficient for the family of all distributions with finite variance.

Our construction, presented below, shows that it is not necessary to use all possible sample means, and that a much smaller collection of averages suffices: not only this makes computation easier, but the theoretical guarantees for the resulting estimator hold under weaker assumptions. The main idea is to split the data into subsets of size smaller than N/k\lfloor N/k\rfloor, and construct all possible sample means using these subsets as “building blocks”. The size of the overlap is then naturally proportional to the size of the block. For instance, the estimator μ^U\widehat{\mu}_{U} corresponds to the blocks of size 11, resulting in the sample means over all possible subsets of a given size. Our results show that allowing the block size to be slowly growing with the the sample size could be beneficial. Formally, let k,lk,l be positive integers such that N/kl\frac{\lfloor N/k\rfloor}{l}\in\mathbb{N}. Assume that G1Glk[N]G_{1}\cup\ldots\cup G_{lk}\subseteq[N] are disjoint subsets of cardinality Nlk\lfloor\frac{N}{lk}\rfloor each, and Zj:=X¯j=1|Gj|iGjXi,j=1,,lkZ_{j}:=\bar{X}_{j}=\frac{1}{|G_{j}|}\sum_{i\in G_{j}}X_{i},\ j=1,\ldots,lk. It will be convenient to set n=lkn=lk, m=Nkm=\lfloor\frac{N}{k}\rfloor, and to view Z1,,ZnZ_{1},\ldots,Z_{n} is a new i.i.d. sample; clearly, Z1Z_{1} has mean μ\mu and variance σ2m/l\frac{\sigma^{2}}{m/l}. Given J[n]J\subseteq[n] of cardinality |J|=l|J|=l, set Z¯J:=1ljJZj\bar{Z}_{J}:=\frac{1}{l}\sum_{j\in J}Z_{j}; note that Z¯J\bar{Z}_{J} is an average of mm observations from the original sample 𝒳N\mathcal{X}_{N}, same as in the definition of the standard MOM estimator. Define 𝒜n(l)={J[n]:|J|=l}\mathcal{A}_{n}^{(l)}=\left\{J\subset[n]:\ |J|=l\right\} and

μ^N:=med(Z¯J,J𝒜n(l)),\widehat{\mu}_{N}:=\mbox{med}\left(\bar{Z}_{J},\ J\in\mathcal{A}_{n}^{(l)}\right), (1)

where {X¯J,J𝒜n(l)}\left\{\bar{X}_{J},\ J\in\mathcal{A}_{n}^{(l)}\right\} denotes the set of sample averages computed over all possible subsets of [n][n] of cardinality ll. In other words, μ^N\widehat{\mu}_{N} is the median of means computed over overlapping subsets of data, where the size of the overlap is proportional to N/lk\lfloor N/lk\rfloor, the size of the block G1G_{1}. We remark here that all explicit, non-asymptotic deviations guarantees that are valid for the classical MOM estimator μ^MOM\widehat{\mu}_{\mathrm{MOM}} automatically extend to μ^N\widehat{\mu}_{N} in view of the so-called “Hoeffding representation” of U-statistics (Lee,, 2019) as the average of averages of independent random variables; pursuit of optimal constant however appears to require the bounds that include asymptotic terms. Everywhere below, it is assumed that k,m,lk,m,l and functions of the sample size NN. We proceed with the statement of our main result. Denote

g(m):=6m𝔼[(X1μσ)2min(|X1μσ|,m)].g(m):=\frac{6}{\sqrt{m}}\mathbb{E}\left[\left(\frac{X_{1}-\mu}{\sigma}\right)^{2}\min\left(\left|\frac{X_{1}-\mu}{\sigma}\right|,\sqrt{m}\right)\right]. (2)

Feller, (1968) proved that g(m)g(m) controls the rate of convergence in the central limit theorem, namely that supt|Φm(t)Φ(t)|g(m)\sup_{t\in\mathbb{R}}\left|\Phi_{m}(t)-\Phi(t)\right|\leq g(m) where Φm\Phi_{m} and Φ\Phi are the distribution functions of j=1mXjμσm\frac{\sum_{j=1}^{m}X_{j}-\mu}{\sigma\sqrt{m}} and the standard normal law respectively. It is well known that g(m)0g(m)\to 0 as mm\to\infty for distributions with finite variance. Moreover, g(m)g(m) admits an upper bound of the form g(m)C𝔼|X1μσ|2+εmε/2g(m)\leq C\mathbb{E}\left|\frac{X_{1}-\mu}{\sigma}\right|^{2+\varepsilon}m^{-\varepsilon/2} whenever 𝔼|X1μ|2+ε<\mathbb{E}|X_{1}-\mu|^{2+\varepsilon}<\infty for some ε(0,1]\varepsilon\in(0,1]. In the context of the median of mean estimation, the role of g(m)g(m) is to control the difference between the mean and the median corresponding to the distribution of 1mj=1mXj\frac{1}{m}\sum_{j=1}^{m}X_{j}, which can be seen as the main contribution to the the bias of μ^N\widehat{\mu}_{N}.

Theorem 1.

Assume that 𝔼|X1μ|2+ε<\mathbb{E}|X_{1}-\mu|^{2+\varepsilon}<\infty for some ε>0\varepsilon>0. Suppose that l=o(mε)l=o(m^{\varepsilon}) and let L(n,l)L(n,l) and M(n,l)M(n,l) be any sequences such that L(n,l)nlg2(m)L(n,l)\gg\frac{n}{l}g^{2}(m) and M(n,l)nl2M(n,l)\ll\frac{n}{l^{2}}. Then for all L(n,l)tM(n,l)L(n,l)\leq t\leq M(n,l),

(|μ^Nμ|σtN)3exp(t2(1+o(1))),\mathbb{P}{\left(\left|\widehat{\mu}_{N}-\mu\right|\geq\sigma\sqrt{\frac{t}{N}}\right)}\leq 3\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right),

where o(1)0o(1)\to 0 as l,kl,k\to\infty uniformly over all t[L(n,l),M(n,l)]t\in\left[L(n,l),M(n,l)\right].

Remark 1.
  1. (a)

    A possible choice of parameters is l=log(m)l=\log(m), L(n,l)=nllog(m)mεL(n,l)=\frac{n}{l}\frac{\log(m)}{m^{\varepsilon}} and M(n,l)=nl2log(l)M(n,l)=\frac{n}{l^{2}\log(l)}. By varying kk, the deviation guarantees can be attained in the desired range of the confidence parameter.

  2. (b)

    The question of uniformity of the bounds with respect to the underlying distribution is not explicitly addressed in this note. In particular, the o(1)o(1) quantities appearing in the inequalities are distribution-dependent. With additional effort, it should be possible to prove uniformity with respect to the classes of distributions 𝒫N\mathcal{P}_{N} of XX satisfying moment conditions of the form 𝔼|Xμσ|2+εaN\mathbb{E}\left|\frac{X-\mu}{\sigma}\right|^{2+\varepsilon}\leq a_{N} for a sequence aNa_{N} that grows sufficiently slow.

  3. (c)

    Exact computation of μ^N\widehat{\mu}_{N} is still prohibitively expensive from a numerical standpoint, as the naive upper bound for evaluating the estimator exactly is O((n/l)llog(n/l))O\left((n/l)^{l}\log(n/l)\right). Instead, one may select a collection of TT subsets among J𝒜n(l)J\in\mathcal{A}_{n}^{(l)} uniformly at random and compute the median of the corresponding sample means: in view of Theorem 1 in section 4.3.3 of the book by (Lee,, 2019) implies that the asymptotic distribution of the estimator constructed in this way coincides with the asymptotic distribution N(0,σ2)N(0,\sigma^{2}) of μ^N\widehat{\mu}_{N} as soon as Tn/lT\gg n/l. However, this asymptotic equivalence does not automatically imply sharp non-asymptotic bounds of the estimator computed from subsampled blocks any more: results of such nature are currently unknown to us and require further investigation.

Proof.

As μ^N\widehat{\mu}_{N} is scale-invariant, we can and will assume that σ2=1\sigma^{2}=1. Set ρ(x)=|x|\rho(x)=|x|, and note that the equivalent characterization of μ^\widehat{\mu} as an M-estimator is

μ^argminzJ𝒜n(l)ρ(m(Z¯Jz)).\widehat{\mu}\in\mathop{\mbox{argmin}}_{z\in\mathbb{R}}\sum_{J\in\mathcal{A}_{n}^{(l)}}\rho\left(\sqrt{m}\left(\bar{Z}_{J}-z\right)\right).

The necessary conditions for the minimum of F(z):=J𝒜n(l)ρ(m(Z¯Jz))F(z):=\sum_{J\in\mathcal{A}_{n}^{(l)}}\rho\left(\sqrt{m}\left(\bar{Z}_{J}-z\right)\right) imply that 0F(μ^N)0\in\partial F(\widehat{\mu}_{N}), hence the left derivative F(μ^N)0F^{\prime}_{-}(\widehat{\mu}_{N})\leq 0. Therefore, if N(μ^Nμ)t\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq\sqrt{t} for some t>0t>0, then μ^Nμ+t/N\widehat{\mu}_{N}\geq\mu+\sqrt{t/N} and, due to FF^{\prime}_{-} being nondecreasing, F(μ+t/N)0F^{\prime}_{-}\left(\mu+\sqrt{t/N}\right)\leq 0. It implies that

(N(μ^Nμ)t)(J𝒜n(l)ρ(m(Z¯Jμt/N))0)=(k(nl)J𝒜n(l)(ρ(m(Z¯Jμt/N))𝔼ρ)k𝔼ρ){\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{N}-\mu)\geq\sqrt{t}\right)}\leq\mathbb{P}{\left(\sum_{J\in\mathcal{A}_{n}^{(l)}}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)\geq 0\right)}\\ =\mathbb{P}{\left(\frac{\sqrt{k}}{{n\choose l}}\sum_{J\in\mathcal{A}_{n}^{(l)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right)\geq-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\right)}} (3)

where we used the shortcut 𝔼ρ\mathbb{E}\rho^{\prime}_{-} in place of

𝔼ρ(m(Z¯Jμt/N))=I{m(Z¯Jμ)tk}I{m(Z¯Jμ)>tk}=12I{m(Z¯Jμ)tk}.{\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)=I\left\{\sqrt{m}(\bar{Z}_{J}-\mu)\leq\sqrt{\frac{t}{k}}\right\}-I\left\{\sqrt{m}(\bar{Z}_{J}-\mu)>\sqrt{\frac{t}{k}}\right\}\\ =1-2I\left\{\sqrt{m}(\bar{Z}_{J}-\mu)\leq\sqrt{\frac{t}{k}}\right\}.}

Note that

k𝔼ρ(m(Z¯Jμt/N))=k(12(m(Z¯Jμt/N)0))=2k(Φ(tk)Φ(0))2k(Φ(tk)(m(Z¯Jμ)tk))2kg(m)+2t1t/k(Φ(tk)Φ(0)).{-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)=-\sqrt{k}\left(1-2\mathbb{P}{\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\leq 0\right)}\right)\\ =2\sqrt{k}\left(\Phi\left(\sqrt{\frac{t}{k}}\right)-\Phi(0)\right)-2\sqrt{k}\left(\Phi\left(\sqrt{\frac{t}{k}}\right)-\mathbb{P}{\left(\sqrt{m}\left(\bar{Z}_{J}-\mu\right)\leq\sqrt{\frac{t}{k}}\right)}\right)\\ \geq-2\sqrt{k}\cdot g(m)+2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{k}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\Phi(0)\right).} (4)

Since

2t1t/k(Φ(tk)Φ(0))=2t(ϕ(0)+O(t/k))=t(2π+O(t/k))2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{k}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\Phi(0)\right)=2\sqrt{t}\left(\phi(0)+O(\sqrt{t/k})\right)=\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(\sqrt{t/k})\right) (5)

where ϕ(t)=Φ(t)\phi(t)=\Phi^{\prime}(t), we see that

k𝔼ρ(m(Z¯Jμt/N))2kg(m)+t(2π+O(t/k))-\sqrt{k}\,\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)\geq-2\sqrt{k}\cdot g(m)+\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(\sqrt{t/k})\right)

which is t2π(1+o(1))\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right) whenever tkt\ll k and tkg2(m)t\gg k\,g^{2}(m). It remains to analyze the U-statistic

kUn,l(ρ)=k(nl)J𝒜n(l)(ρ(m(Z¯Jμt/N))𝔼ρ).\sqrt{k}\,U_{n,l}(\rho^{\prime}_{-})=\frac{\sqrt{k}}{{n\choose l}}\sum_{J\in\mathcal{A}_{n}^{(l)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{Z}_{J}-\mu-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right). (6)

As the expression above is invariant with respect to the shift ZjZjμZ_{j}\mapsto Z_{j}-\mu, we can assume that μ=0\mu=0. For i[N]i\in[N], let

h(1)(Zi)=l𝔼[ρ(m(1lj=1l1Z~j+Zilt/N))|Zi]l𝔼ρ,h^{(1)}(Z_{i})=\sqrt{l}\,\mathbb{E}\left[\rho^{\prime}_{-}\left(\sqrt{m}\left(\frac{1}{l}\sum_{j=1}^{l-1}\tilde{Z}_{j}+\frac{Z_{i}}{l}-\sqrt{t/N}\right)\right)\,\big{|}\,Z_{i}\right]-\sqrt{l}\,\mathbb{E}\rho^{\prime}_{-},

where (Z~1,,Z~l)(\tilde{Z}_{1},\ldots,\tilde{Z}_{l}) is an independent copy of (Z1,,Zl)(Z_{1},\ldots,Z_{l}) based on a sample 𝒳~N\tilde{\mathcal{X}}_{N} that is an independent copy of 𝒳N\mathcal{X}_{N}. Our goal is to determine the size of Var(h(1)(X1))\mbox{Var}(h^{(1)}(X_{1})).

Remark 2.

The quantity h(1)(Z)h^{(1)}(Z) is related to the so-called Hájek projection that can be viewed as the best (in mean squared sense) approximation of the U-statistic Un,l(ρ)U_{n,l}(\rho^{\prime}_{-}) in terms of the sums of i.i.d. random variables. For related background on U-statistics, we refer the reader to an excellent monograph by Lee, (2019).

Lemma 1.

In the framework of Theorem 1,

Var(h(1)(Z1))2π\mbox{Var}\left(h^{(1)}(Z_{1})\right)\to\frac{2}{\pi}

as l,kl,k\to\infty, uniformly over all t[L(n,l),M(n,l)]t\in\left[L(n,l),M(n,l)\right].

The proof of the lemma is given in section 3. The following result, a deviation inequality for U-statistics of order that grows with the sample size, is the second key technical tool required to complete the argument.

Theorem 2.

Let h:lh:\mathbb{R}^{l}\mapsto\mathbb{R} be a function that is invariant with respect to permutations of its arguments, and let Un,l(h)=1(nl)J𝒜n(l)(h(Xj,jJ)𝔼h(X1,,Xl))U_{n,l}(h)=\frac{1}{{n\choose l}}\sum_{J\in\mathcal{A}_{n}^{(l)}}\left(h\left(X_{j},\ j\in J\right)-\mathbb{E}h\left(X_{1},\ldots,X_{l}\right)\right) be the corresponding U-statistic with kernel hh evaluated on a sample X1,,XnX_{1},\ldots,X_{n}. Assume that ll is an increasing function of nn, and that

  1. (a)

    hh is uniformly bounded;

  2. (b)

    lim inflVar(lh(1)(X1))>0\liminf_{l\to\infty}\mbox{Var}\left(\sqrt{l}\,h^{(1)}(X_{1})\right)>0, where h(1)(X1)=𝔼[h(X1,X2,,Xl)|X1]h^{(1)}(X_{1})=\mathbb{E}\left[h(X_{1},X_{2},\ldots,X_{l})|X_{1}\right].

Let q(n,l)q(n,l) be increasing in nn, decreasing in ll, and such that q(n,l)=o(nl2)q(n,l)=o\left(\frac{n}{l^{2}}\right). Then for all 2tq(n,l)2\leq t\leq q(n,l),

(|Un,l(h)|tln)(2+o(1))exp(t2(1+o(1))Var(lh(1)(X1))),\mathbb{P}{\left(\left|U_{n,l}(h)\right|\geq\sqrt{\frac{tl}{n}}\right)}\leq(2+o(1))\operatorname{exp}\left(-\frac{t}{2(1+o(1))\mbox{Var}\left(\sqrt{l}\,h^{(1)}(X_{1})\right)}\right),

where o(1)0o(1)\to 0 as l,n/ll,n/l\to\infty uniformly over 2tq(n,l)2\leq t\leq q(n,l).

The proof of this result is outlined in section 4 111We note that closely related results for U-statistics were obtained by Maurer, (2019), and it may be possible to use Maurer’s inequality in place of Theorem 2.. To get the desired inequality for the estimator μ^N\widehat{\mu}_{N}, it remains to apply Theorem 2 and Lemma 1 to the U-statistic defined in (6): specifically, we deduce that

(|kUn,l(ρ)|t2π(1+o(1)))2exp(t2(1+o(1))),\mathbb{P}{\left(\left|\sqrt{k}\,U_{n,l}(\rho^{\prime}_{-})\right|\geq\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right)\right)}\leq 2\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right),

uniformly over nlg2(m)tnl2\frac{n}{l}g^{2}(m)\ll t\ll\frac{n}{l^{2}}, and the final result follows.

3 Proof of Lemma 1.

Note that we can rewrite h(1)(Z1)h^{(1)}(Z_{1}) as

h(1)(Z1)=l𝔼[ρ(m(1mj=1mm/lZ~j+1ml(Z1m/l)t/N))|Zi]l𝔼ρ.h^{(1)}(Z_{1})=\sqrt{l}\,\mathbb{E}\left[\rho^{\prime}_{-}\left(\sqrt{m}\left(\frac{1}{m}\sum_{j=1}^{m-m/l}\tilde{Z}_{j}+\frac{1}{\sqrt{ml}}\left(Z_{1}\sqrt{m/l}\right)-\sqrt{t/N}\right)\right)\,\big{|}\,Z_{i}\right]-\sqrt{l}\,\mathbb{E}\rho^{\prime}_{-}.

Given an integer r1r\geq 1, let Φ~r(t)\widetilde{\Phi}_{r}(t) be the cumulative distribution function of j=1rX~j\sum_{j=1}^{r}\tilde{X}_{j}. Then

h(1)(Z1)=l(2Φ~mm/l(mtNm/l(Z1m/l))1)l𝔼ρ=2l(Φ~mm/l(mtNm/l(Z1m/l))𝔼Φ~mm/l(mtNm/l(Z1m/l)))=2l(Φ~mm/l(mtNm/l(Z1m/l))Φ~mm/l(mtNxm/l))dPZ1m/l(x),{h^{(1)}(Z_{1})=\sqrt{l}\left(2\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-\sqrt{m/l}\left(Z_{1}\sqrt{m/l}\right)\right)-1\right)-\sqrt{l}\,\mathbb{E}\,\rho^{\prime}_{-}\\ =2\sqrt{l}\left(\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-\sqrt{m/l}\left(Z_{1}\sqrt{m/l}\right)\right)-\mathbb{E}\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-\sqrt{m/l}\left(Z_{1}\sqrt{m/l}\right)\right)\right)\\ =2\sqrt{l}\int_{\mathbb{R}}\left(\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-\sqrt{m/l}\left(Z_{1}\sqrt{m/l}\right)\right)\right.\\ \left.-\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-x\sqrt{m/l}\right)\right)dP_{Z_{1}\sqrt{m/l}}(x),}

with PZ1m/lP_{Z_{1}\sqrt{m/l}} being the law of Z1m/lZ_{1}\sqrt{m/l}. Feller’s version of Berry-Esseen theorem implies that

supx|Φ~mm/l(x)Φ(x/mm/l)|6g(mm/l)\sup_{x\in\mathbb{R}}\left|\widetilde{\Phi}_{m-m/l}(x)-\Phi(x/\sqrt{m-m/l})\right|\leq 6g(m-m/l)

where Φ\Phi is the distribution function of standard normal law. Therefore,

|h(1)(Z1)2l(Φ(mmm/ltNm/lmm/l(Z1m/l))Φ(mmm/ltNxm/lmm/l))dPZ1m/l(x)|12lg(mm/l)0{\left|h^{(1)}(Z_{1})-2\sqrt{l}\int_{\mathbb{R}}\left(\Phi\left(\frac{m}{\sqrt{m-m/l}}\sqrt{\frac{t}{N}}-\sqrt{\frac{m/l}{m-m/l}}\left(Z_{1}\sqrt{m/l}\right)\right)\right.\right.\\ \left.\left.-\Phi\left(\frac{m}{\sqrt{m-m/l}}\sqrt{\frac{t}{N}}-x\sqrt{\frac{m/l}{m-m/l}}\right)\right)dP_{Z_{1}\sqrt{m/l}}(x)\right|\leq 12\sqrt{l}\,g(m-m/l)\to 0}

by assumption. At the same time,

l(Φ(mmm/ltNm/lmm/l(Z1m/l))Φ(mmm/ltNxm/lmm/l))=12πexp(q(x)/2)(xZ1m/l)mmm/l+C(x,Z1)l(xZ1m/l)2m/lmm/l{\sqrt{l}\left(\Phi\left(\frac{m}{\sqrt{m-m/l}}\sqrt{\frac{t}{N}}-\sqrt{\frac{m/l}{m-m/l}}\left(Z_{1}\sqrt{m/l}\right)\right)-\Phi\left(\frac{m}{\sqrt{m-m/l}}\sqrt{\frac{t}{N}}-x\sqrt{\frac{m/l}{m-m/l}}\right)\right)\\ =\frac{1}{\sqrt{2\pi}}\operatorname{exp}\left(-q(x)/2\right)(x-Z_{1}\sqrt{m/l})\sqrt{\frac{m}{m-m/l}}+\frac{C(x,Z_{1})}{\sqrt{l}}(x-Z_{1}\sqrt{m/l})^{2}\frac{m/l}{m-m/l}}

where q(x):=(mmm/ltNxm/lmm/l)2q(x):=\left(\frac{m}{\sqrt{m-m/l}}\sqrt{\frac{t}{N}}-x\sqrt{\frac{m/l}{m-m/l}}\right)^{2} is such that q(x)0q(x)\to 0 as ll\to\infty and C(x,Z1)C(x,Z_{1}) is a bounded function. Therefore, h(1)(Z1)2πZ1m/l0h^{(1)}(Z_{1})-\sqrt{\frac{2}{\pi}}Z_{1}\sqrt{m/l}\to 0 almost surely, assuming that lg(m)=o(1)\sqrt{l}g(m)=o(1). Finally, note that

l(Φ~mm/l(mtNm/l(Z1m/l))Φ~mm/l(mtNxm/l))supzl(j=1mm/lX~j(z,z+|m/l(xZ1m/l)|])C|xZ1m/l|,{\sqrt{l}\left(\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-\sqrt{m/l}\left(Z_{1}\sqrt{m/l}\right)\right)-\widetilde{\Phi}_{m-m/l}\left(m\sqrt{\frac{t}{N}}-x\sqrt{m/l}\right)\right)\\ \leq\sup_{z}\sqrt{l}\,\mathbb{P}{\left(\sum_{j=1}^{m-m/l}\tilde{X}_{j}\in\Big{(}z,z+\left|\sqrt{m/l}\left(x-Z_{1}\sqrt{m/l}\right)\right|\Big{]}\right)}\leq C\left|x-Z_{1}\sqrt{m/l}\right|,}

where the last inequality follows from the well known bound for the concentration function (Theorem 2.20 in the book by Petrov, (1995)); here, C=C(P)>0C=C(P)>0 is a constant that may depend on the distribution of X1X_{1}. We therefore conclude that the sequence (h(1)(Z1)2πZ1m/l)2\left(h^{(1)}(Z_{1})-\sqrt{\frac{2}{\pi}}Z_{1}\sqrt{m/l}\right)^{2} is uniformly integrable (as Z1m/lZ_{1}\sqrt{m/l} is), hence the claim follows.

4 Proof of Theorem 2.

The union bound together with Hoeffding’s decomposition entails that for any t>0t>0 and 0<ε<10<\varepsilon<1 (to be chosen later as a decreasing sequence ε(l)\varepsilon(l)),

(|Un,l(h)|tln)(|lnj=1nh(1)(Zj)|(1ε)tln)+(|j=2l(lj)(nj)J𝒜n(j)h(j)(Zi,iJ)|εtln),{\mathbb{P}{\left(\left|U_{n,l}(h)\right|\geq\sqrt{\frac{tl}{n}}\right)}\\ \leq\mathbb{P}{\left(\left|\frac{l}{n}\sum_{j=1}^{n}h^{(1)}(Z_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{l}{n}}\right)}+\mathbb{P}{\left(\left|\sum_{j=2}^{l}\frac{{l\choose j}}{{n\choose j}}\sum_{J\in\mathcal{A}_{n}^{(j)}}h^{(j)}(Z_{i},\,i\in J)\right|\geq\varepsilon\sqrt{t}\sqrt{\frac{l}{n}}\right)},}

where h(j), 2=1,,lh^{(j)},\ 2=1,\ldots,l are the degenerate kernels corresponding to the higher-order terms of Hoeffding’s decomposition (not to be confused with the derivatives!). Specifically,

h(j)(y1,,yj)=(δy1PY)××(δyjPY)×PYmjh,h^{(j)}(y_{1},\ldots,y_{j})=(\delta_{y_{1}}-P_{Y})\times\ldots\times(\delta_{y_{j}}-P_{Y})\times P_{Y}^{m-j}h,

where δy\delta_{y} is the point measure concentrated at yy; in particular, δy(h)=h(y)\delta_{y}(h)=h(y). It is known that h(j)h^{(j)} can be viewed geometrically as orthogonal projections of hh onto a particular subspace of L2(PYm)L_{2}(P_{Y}^{m}). We refer the reader to the book by Lee, (2019) for futher details related to the background material on U-statistics and the Hoeffding’s decomposition. Bernstein’s inequality yields that

(|lnj=1nh(1)(Zj)|(1ε)tln)2exp((1ε)2t/2Var(lh(1)(Z1))+(1ε)13lnht1/2)=2exp((1ε)2t2Var(lh(1)(Z1))(1+o(1))){\mathbb{P}{\left(\left|\frac{l}{n}\sum_{j=1}^{n}h^{(1)}(Z_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{l}{n}}\right)}\\ \leq 2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t/2}{\mbox{Var}\left(\sqrt{l}\,h^{(1)}(Z_{1})\right)+(1-\varepsilon)\frac{1}{3}\sqrt{\frac{l}{n}}\|h\|_{\infty}t^{1/2}}\right)\\ =2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t}{2\,\mbox{Var}\left(\sqrt{l}\,h^{(1)}(Z_{1})\right)(1+o(1))}\right)}

where o(1)0o(1)\to 0 as n/ln/l\to\infty uniformly over tt. It remains to control the expression involving higher order Hoeffding decomposition terms. To this end, we will show that it is bounded from above by exp(t2Var(lh(1)(X1)))o(1)\operatorname{exp}\left(-\frac{t}{2\,\mbox{Var}\left(\sqrt{l}\,h^{(1)}(X_{1})\right)}\right)\cdot o(1) where o(1)0o(1)\to 0 uniformly over the range of tt. To this end, we will need concentration inequality for the U-statistics of growing order established in Minsker, (2023, Theorem 4.1). Set tj,ε=(εtj2(nl)j12)2t_{j,\varepsilon}=\left(\varepsilon\frac{\sqrt{t}}{j^{2}}\left(\frac{n}{l}\right)^{\frac{j-1}{2}}\right)^{2}, and note that, in view of the union bound,

(|j=2l(lj)(nj)J𝒜n(j)h(j)(Zi,iJ)|εtln)j=2l(|(lj)(nj)J𝒜n(j)h(j)(Zi,iJ)|tj,εln)lmax2jlexp(cmin((tε2)1/j(nl)j1j,(tε2h2)1j+1(njl2)jj+1)),{\mathbb{P}{\left(\left|\sum_{j=2}^{l}\frac{{l\choose j}}{{n\choose j}}\sum_{J\in\mathcal{A}_{n}^{(j)}}h^{(j)}(Z_{i},\,i\in J)\right|\geq\varepsilon\sqrt{t}\sqrt{\frac{l}{n}}\right)}\\ \leq\sum_{j=2}^{l}\mathbb{P}{\left(\left|\frac{{l\choose j}}{{n\choose j}}\sum_{J\in\mathcal{A}_{n}^{(j)}}h^{(j)}(Z_{i},\,i\in J)\right|\geq\sqrt{t_{j,\varepsilon}}\sqrt{\frac{l}{n}}\right)}\\ \leq l\max_{2\leq j\leq l}\operatorname{exp}\left(-c\min\left((t\varepsilon^{2})^{1/j}\left(\frac{n}{l}\right)^{\frac{j-1}{j}},\left(\frac{t\varepsilon^{2}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{nj}{l^{2}}\right)^{\frac{j}{j+1}}\right)\right),}

where the last inequality follows from the first bound of Theorem 4.1 in Minsker, (2023). Whenever llog2(l)kl\log^{2}(l)\ll k and εlog1/2(l)\varepsilon\gg\log^{-1/2}(l), the last expression is at most

max2jlexp(c1min((tε2)1/j(nl)j1j,(tε2h2)1j+1(njl2)jj+1)).\max_{2\leq j\leq l}\operatorname{exp}\left(-c_{1}\min\left((t\varepsilon^{2})^{1/j}\left(\frac{n}{l}\right)^{\frac{j-1}{j}},\left(\frac{t\varepsilon^{2}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{nj}{l^{2}}\right)^{\frac{j}{j+1}}\right)\right).

In turn, it is bounded by ec2tεe^{-\frac{c_{2}t}{\varepsilon}} whenever t<nl2ε4t<\frac{n}{l^{2}}\varepsilon^{4}. Desired conclusion follows.

Acknowledgements.

The author is grateful to the anonymous Referees for their insightful comments and constructive feedback that helped improve the quality of the presentation.

References

  • Alon et al., (1996) Alon, N., Matias, Y., and Szegedy, M. (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM.
  • Catoni, (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pages 1148–1185. Institut Henri Poincaré.
  • Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725.
  • Feller, (1968) Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 10(3):261–268.
  • Jerrum et al., (1986) Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188.
  • Lee, (2019) Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
  • Lee and Valiant, (2020) Lee, J. C. H. and Valiant, P. (2020). Optimal sub-Gaussian mean estimation in 𝐑\mathbf{R}. arXiv preprint arXiv:2011.08384.
  • Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190.
  • Maurer, (2019) Maurer, A. (2019). A Bernstein-type inequality for functions of bounded interaction. Bernoulli, 25(2):1451–1471.
  • Minsker, (2023) Minsker, S. (2023). U-statistics of growing order and sub-Gaussian mean estimators with sharp constants. Mathematical Statistics and Learning, to appear.
  • Nemirovski and Yudin, (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.
  • Petrov, (1995) Petrov, V. V. (1995). Limit theorems of probability theory: sequences of independent random variables. Oxford, New York.