Optimal Sub-Gaussian Mean Estimation in
Abstract
We revisit the problem of estimating the mean of a real-valued distribution, presenting a novel estimator with sub-Gaussian convergence: intuitively, “our estimator, on any distribution, is as accurate as the sample mean is for the Gaussian distribution of matching variance.” Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the entire gamut of distributions with finite variance, including those without any higher moments. Parameterized by the sample size , the failure probability , and the variance , our estimator is accurate to within , tight up to the factor. Our estimator construction and analysis gives a framework generalizable to other problems, tightly analyzing a sum of dependent random variables by viewing the sum implicitly as a 2-parameter -estimator, and constructing bounds using mathematical programming and duality techniques.
1 Introduction
We revisit one of the most fundamental problems in statistics: estimating the mean of a real-valued distribution, using as few independent samples from it as possible. Our proposed estimator has convergence that is optimal not only in a big-O sense (i.e. “up to multiplicative constants”), but tight to a factor, under the minimal (and essentially necessary, see below) assumption of the finiteness of the variance. Previous works, discussed further in Section 2, are either only big-O tight [12, 20, 1], or require additional strong assumptions such as the variance being known to the estimator [4] or assumptions that allow for accurate estimates of the variance, such as the kurtosis (fourth moment) being finite [4, 7].
1.1 The Model and Main Result
Given a set of i.i.d. samples from a real-valued distribution, the goal is to return, with extremely high probability, an accurate estimate of the distribution’s mean. Specifically, given a sample set of size consisting of independent draws from a real-value distribution , an -estimator of the mean is a function such that, except with failure probability , the estimate is within of the true mean . Namely,
(1) |
The goal is to find the optimal tradeoff between the sample size , and the error parameters and , for the distribution . Fixing any two of the three parameters and minimizing the third yields essentially equivalent reformulations of the problem: we can fix and minimize the sample complexity ; we can fix and minimize error ; or we can fix and minimize the failure probability (maximizing the robustness ).
Perhaps the most standard and well-behaved setting for mean estimation is when the distribution is a Gaussian. The sample mean (the empirical mean) is a provably optimal estimator in our sense when is Gaussian: for any , the sample mean is an -estimator when given a sample set of size (all logarithms will be base ); and there is no - estimator for Gaussians if the constant 2 in the previous expression for the sample size is changed to any smaller number.
The main result of this paper is an estimator that performs as well, on any distribution with finite variance, as the sample mean does on a Gaussian, without knowledge of the distribution or its variance:
Theorem 1.
Estimator 1, given , defines a function such that with probability at least , given a sample set of size , yields an estimate with error
Here, the term tends to 0 as . Furthermore, as evidenced by the Gaussian case, there is no estimator which, under the same settings, produces an error that improves on our guarantees by more than a multiplicative factor.
We have parameterized the above theorem in terms of fixing the sample size and the robustness parameter and asking for the minimum error ; however, because of the simple functional form of the bounds of Theorem 1, we can equivalently rephrase it as saying that, for any , (a reparameterized) Algorithm 1 is an estimator using samples; or for any , Algorithm 1 gives an estimate that is -robust. For each of these formulations, the performance is optimal up to the factor, as evidenced by the well-known Gaussian case, as explained above.
We make the following observations regarding the main (minimal) assumption in the theorem, namely the finiteness of the variance of the unknown distribution. First, imposing further assumptions about the finiteness of higher moments will not yield any improvements to the result, since matching lower bounds are provided by Gaussians, for which all moments are finite. Second, as shown by Devroye et al. [7], relaxing the finite variance assumption by only assuming, say, the finiteness of the moment for some will yield strictly worse sample complexity. In particular, the sample complexity will have an -dependence that is . Thus, our result shows that mean estimation can be performed at a sub-Gaussian rate, with the optimal multiplicative constant of in the sample complexity, if and only if the variance of the underlying distribution is finite.
We also contrast with previous works that attain optimal sub-Gaussian convergence but make additional assumptions such as the finiteness of the kurtosis ( moment) [4, 7]. The gap in assumptions between those works and this work is not only theoretical, but also of practical consequence: power law distributions are known to be good models of certain real-world phenomena, and for exponents in the range , the variance exists, but not the kurtosis.
1.2 Our Approach
We briefly describe the main features of our estimator, as a setting for what follows, and to distinguish it from prior work. At the highest level: in order to return a -robust estimate of the mean, our estimator “throws out the most extreme points in the samples”, and returns the mean of what remains. More specifically, outliers are thrown out in a weighted manner, where we throw out a fraction of each data point, with the fraction proportional to the square of its distance from a median-of-means initial guess for the mean, where the fraction is capped at 1, and the proportionality constant is chosen so that the total weight thrown out equals exactly . See Estimator 1 for full details, but we stress here that the estimator is simple to implement—it may be computed in linear time—and therefore applicable in practice.
The above description is rather different from the typical M-estimator/-estimator approach of Catoni [4] and other works in this area. However, as we see in Section 3, our estimator can be reinterpreted as a 2-parameter -estimator, and the proof of our main result will crucially rely on this reformulation.
1.3 Motivation: -order corrections of the empirical mean
Perhaps the most non-obvious part of our estimator is throwing out exactly many samples. We motivate this quantity in this section, by considering the special case of estimating the mean of asymmetric—very biased—Bernoulli distributions, which is in some sense an extremal case for our setting.
Example 1.
Consider the mean estimation problem, given samples from a Bernoulli distribution supported on 0 and 1, where the probability of drawing 1 equals some parameter . Thus the number of 1s observed is distributed as the Binomial distribution , of mean and variance . The interesting regime for us is when is very small, and thus , and the Binomial distribution is essentially the Poisson distribution of mean and variance . In this setting, the mean estimation problem becomes: given a sample from , and the parameters and , return an estimate that, except with failure probability , is as close as possible to (or equivalently ). Given a Poisson sample , returning simply is a natural estimate of ; however, since Poisson distributions are slightly skewed, it turns out that one should instead return the correction .
Explicitly, the Poisson distribution has pmf , whose logarithm, using Stirling’s approximation for the factorial, expanding to order in , and dropping lower-order terms in is . The -order term here corresponds to a Gaussian centered at of variance , which is a standard approximation for the Poisson distribution. However, crucially, the order term, corresponding to the positive skewness of the Poisson distribution, increases the pmf to the right of and decreases it by an essentially symmetric factor to the left.
Seeking a -robust estimation of from a single sample of , we are concerned, essentially, with the interval where the Poisson pmf is greater than , or equivalently, where the log pmf is greater than . The quadratic in the first term of the above approximation equals when , and this interval is centered at the . However, crucially, when we take into account the -order term, the interval where essentially shifts to become . Thus, given a single sample, one can -robustly estimate the mean of a Poisson distribution similarly well as the Gaussian of same mean and variance, but only if one returns the sample minus .
Thus, the term in our estimator arises essentially from a order correction to the sample mean, at least in the special case of Bernoulli distributions. For additional intuition and motivation about the “ order correction” in our estimator, please refer to Appendix A.
1.4 Key Contributions in Our Construction and Analysis
In addition to settling the fundamental sample complexity question of mean estimation, we point out that the estimator construction and analysis may also be of independent interest. In particular, the analysis framework—as described below—is generalizable to other problem settings and estimator constructions.
Our overall analysis framework may be viewed as a Chernoff bound—showing exponentially small probability of estimation error via bounds on a moment generation function (expectation of an exponentiated real-valued random variable). However, since we seek to analyze our estimator to sub-constant accuracy, many standard approaches fail to yield the required resolution. We point out three crucial components of our approach.
First, our estimator (Estimator 1) is not a sum of independent terms, which is fundamental to standard Chernoff bound approaches, and thus we instead reformulate our estimator as a 2-parameter -estimator (see Definition 2). This technique rewrites our estimate as the first coordinate of the root of a system of 2 equations and , where the functions and are explicitly sums of a corresponding function applied to each of the independent data points in the sample set. Thus we have bought independence at the price of making the estimator an implicit function, introducing two new variables. One-dimensional estimators of this form are standard: for example, Catoni’s [4] mean estimator in the case of known variance is a (1 parameter) -estimator for which he proves finite sample concentration. However, adding another dimension—, a new implicit variable whose value the estimator will ultimately discard—is less standard, without standard analysis techniques, yet significantly increases the expressive power of such estimators [24]. Our high-level approach is to find carefully chosen linear combinations of the functions and , each of which is now a sum of independent terms, and prove Chernoff bounds about these linear combinations.
Second, even after identifying these linear combinations of functions, the corresponding Chernoff bound analysis is difficult to directly tackle. The Chernoff bound analysis, as it turns out, is essentially equivalent to bounding a max-min optimization problem where the maximization is over the set of real-valued probability measures with mean 0 and variance 1. In other words, the max-min optimization problem can be interpreted as having uncountably infinitely many variables. In order to drastically simplify the problem and make it amenable to analysis, we use convex-concave programming and linear programming duality techniques to reduce the problem to a pure minimization problem with a small finite number of variables, which we can analyze tightly.
We believe that the above two ideas—1) reformulating an estimator as a multi-parameter -estimator, so as to find a proxy of the estimator that is a sum of independent variables, and 2) viewing the corresponding Chernoff bound analysis as an optimization problems and applying relevant duality techniques—form a general analysis framework which expands the space of possible estimators that are amenable to tight analysis.
2 Related Work
There is a long history of work on real-valued mean estimation in a variety of models. In the problem setting we adopt, where the sole assumption is on the finiteness of the second moment, the median-of-means algorithm [12, 20, 1] has long been known to have sample complexity tight to within constant multiplicative factors, albeit with a sub-optimal constant. Catoni [4] improved this sample complexity to essentially optimal (tight up to a factor), by focusing on the special cases where the variance of the underlying distribution is known or the moment is finite and bounded (in which case the second moment can be accurately estimated). We stress however that the finiteness of the moment is nonetheless a much stronger assumption than our minimal assumption on the finiteness of the variance (see the discussion at the end of Section 1.1).
Moving beyond the original problem formulation, Devroye et al. [7] drew the distinction between a single- estimator, which takes in the robustness parameter as input, versus a multiple- estimator, which does not take any as input, but still provides guarantees across a wide range of values. In their work, making the same finite kurtosis assumption as Catoni, they achieved a multiple- estimator with essentially optimal sample complexity, for a wide range of values. It is thus natural and prudent to ask whether a multiple- estimator can exist for the entire class of distributions with finite variance, for a meaningful range of values. Unfortunately, Devroye et al. [7] showed strong lower bounds answering the question in the negative. Hence, in this work, our proposed estimator is (and must be) a single- estimator, taking as input.
Many applications have arisen from the success of sub-Gaussian mean estimation, showing how to leverage or extend Catoni-style estimators to new settings, achieving sub-Gaussian performance on problems such as regression, empirical risk minimization, and online learning (bandit settings): for example see [18, 2, 5, 3].
A separate but closely related line of work is on high dimensional mean estimation. While estimators generalizing the “median-of-means” construction were found to have statistical convergence tight to multiplicative constants, until recently, such estimators took super-polynomial time to compute [17]. A recent line of work [10, 6, 15], started by Hopkins [10], thus focuses on the computational aspect, and brought the computation time first down to polynomial time, with subsequent work bringing it further down to quadratic time using spectral methods.
A recent comprehensive survey by Lugosi and Mendelson [16] explains much of the above works in greater detail.
Other works have focused on mean estimation in restrictive settings, for example, with differential privacy constraints. For example, Kamath et al. [14] studied the differentially private mean estimation problem in the constant probability regime, and showed strong sample complexity separations from our unrestricted setting. Duchi, Jordan and Wainright [8, 9] also studied the problem under the stricter constraint of local differential privacy. See the work of Kamath et al. [14] for a more comprehensive literature review on differentially private mean estimation.
3 Our Estimator
In this section, we present our estimator (Estimator 1), as well as its reformulation as a 2-parameter -estimator. We then present some perspective and basic structural properties of the estimator that will serve as a foundation for the analysis to follow.
Inputs:
-
•
independent samples from the unknown underlying distribution (guaranteed to have finite variance)
-
•
Confidence parameter
-
1.
Compute the median-of-means estimate : evenly partition the data into groups and let be the median of the set of means of the groups.
-
2.
Find the solution to the monotonic, piecewise-linear equation
-
3.
Output:
3.1 Meaning of the Estimator
Consider the expression in Step 3 for the final returned value of the estimator, . Without the final expression, the expression computes exactly the sample mean. The factor may be thought of as a weight on the element, between 0 and 1, where a weight of 1 leaves that element as is, but a weight towards 0 essentially throws out part of the sample and instead defaults to the median-of-means estimate . Thus, rather than either keeping or discarding each entry, the weight specifies what fraction of the sample to discard.
The condition in Step 2 of Estimator 1 picks so that the total, weighted, number of discarded samples equals . The expression specifying what fraction of each to discard says, essentially, that this fraction should be proportional to the square of the deviation of from the mean estimate , capped at 1 so that we do not discard “more than 100% of” any sample .
3.2 Structural Properties of the Estimator
We point out three basic structural properties of Estimator 1 that both shed light on the estimator itself, and will be crucial to its analysis. We formally state and prove these properties in Appendix B.
First, the estimator is “affine invariant” in the sense that, if its input samples undergo an affine map then its output will be mapped correspondingly. Second, as is well known, the median-of-means estimate of Step 1, while not as accurate as what we will eventually return, is robust in the sense that, with probability at least , the median-of-means estimate has additive error from the true mean that is at most —proportional to the eventual guarantees of our estimator, but with somewhat worse proportionality constant. Third, if we temporarily ignore Step 1, treating as a free parameter, we show that the final output of the algorithm, , varies very little with . Combined with the accuracy guarantees of the median-of-means estimate, the difference in the final estimate between using the median-of-means as versus using the true mean as is inconsequential (a factor) compared to the total additive error we aim for. Therefore, for the purposes of analysis, it suffices to assume that takes the value of the true mean (though an algorithm could not do this in practice, as the true mean is unknown).
These structural properties allow us to drastically simplify the analysis: the affine invariance means it is sufficient to show our estimator works for the special case of distributions with mean 0 and variance 1; the second and third properties mean that errors in effectively do not matter, and, for distributions with mean 0, it is sufficient to omit Step 1 and instead just analyze the case where .
We point out that Estimator 1 when modified to set (independently of the samples) is no longer affine invariant, nor is its reformulation as a -estimator in Section 3.3. The structural properties in this section show that, instead of analyzing the actual estimator (Estimator 1 which is affine invariant), it suffices to analyze this artificially simplified, although no longer affine invariant, estimator which sets , on distributions with mean 0 and variance 1. Explicitly, in the rest of the paper we will show Proposition 3 (Section 4), which analyzes the mean-0 variance-1 case of the -estimator defined below in Definition 2; the discussion of this section—made formal in Appendix B—shows that this proposition implies our main result, Theorem 1.
3.3 Representing a Special Case of Estimator 1 as a -Estimator
As discussed in Section 1.4, our estimator, even its simplified version with , is not a sum of independent terms, making it difficult to tightly bound its moment generating function, and hence also difficult to prove its concentration around the true mean using a Chernoff-style bound. Our solution is to reformulate Estimator 1, with the simplifying assumption that , as a 2-parameter -estimator, as defined in Definition 2. This reformulation defines our estimate implicitly in terms of two new functions and that are indeed sums of independent terms, each term depending on a single . We will use this representation crucially for the concentration analysis of the estimator.
Definition 2.
Consider Estimator 1 but with Step 1 replaced with “”. The estimator can be equivalently expressed as follows:
-
1.
Input: independent samples
-
2.
Solve for the (unique) pair satisfying and , where the functions are defined as follows:
(Note that always)
-
3.
Output: from the previous step
We will sometimes omit from the arguments of since is not used in the definition of the function. We will often refer to the pair as a 2-element vector .
For convenience in the rest of the paper, we define , which we refer to as the “truncated empirical variance”; this is because, if we modify the condition by removing the “truncation” of taking the min with 1, then the resulting condition, when expressed in terms of and rearranged, is exactly the condition that is the empirical variance: . Thus may be thought of as a proxy for the empirical variance, as equals the empirical variance, except in cases when samples are far enough from 0 that they are “truncated” by the “”.
Interestingly, in the case that none of the samples are “truncated”, (and ), the overall output of the estimator becomes , namely, is “the empirical mean, corrected by subtracting times the ratio of the empirical moment over the empirical moment.”
Proof that Definition 2 is equivalent to Estimator 1 when is set to 0.
Fix a set of samples . We observe that Estimator 1, with the additional simplifying assumption that , can be represented by the following 2 equations.
(2) | ||||
Estimator 1 solves for in the first line, and uses this value to compute the estimate in the second line. The two conditions of Equation 2 are equivalent to the two conditions , respectively, and thus the two estimators are equivalent. ∎
4 Analyzing our estimator
In this section, we outline the proof of our main theorem, restated as follows.
Theorem 1. Estimator 1, given , and a sample set of independent samples from distribution , will, with probability at least over the sampling process, yield an estimate with error at most . Here, the term tends to 0 as .
The discussion of the structural properties of Estimator 1 in Section 3.2 shows that it is sufficient to instead show that, for any distribution of mean 0 and variance 1, the -estimator of Definition 2 will return an estimate that is close to 0, except with tiny probability. (See Appendix B for the formal statements of the claims of Section 3.2.) Recall also that, since the -estimator solves for such that (where is the sample set) and returns , the claim that the returned estimate will be close to 0 is equivalent to saying that every pair with far from 0 must violate the equation, namely . We thus prove the following proposition (Proposition 3), to yield Theorem 1. Note that the failure probability in Proposition 3 is (instead of , as in Theorem 1), accounting for an additional probability that the median-of-means estimate in Step 1 of Estimator 1 fails.
Proposition 3.
There exists a universal constant such that, fixing , we have that for all distributions with mean 0 and variance 1, with probability at least over the set of samples , for all where and , the vector .
Proposition 3 asks us to show that, with high probability, is not at the origin for any choice of ; instead, as a proof strategy, we choose a finite bounded mesh of and show that the function is 1) not just nonzero, but far from the origin on this set, 2) Lipschitz in between mesh elements, and 3) monotonic (in an appropriate sense) outside the mesh bounds. Step 1), discussed below, contains the most noteworthy part of the proof, a mathematical programming-inspired bound to help complete a delicate Chernoff bound argument.
For simplicity, we reparameterize to work with (the “truncated empirical variance”) instead of : the mesh we analyze, covering the most delicate region for analysis, will span the interval , namely, where the truncated empirical variance is within a constant factor of the true variance of 1. Note that this should not be taken to imply that with high probability—the truncated empirical variance is not designed to be a good estimate of the variance, merely as a step in robustly estimating the mean; and further, accurate estimates of the variance are simply impossible in general without further assumptions such as bounds on the distribution’s or moments. We also want to distinguish our estimator from Catoni’s [4]: Catoni’s estimator relies on having a high-precision estimate of the variance (to within a factor) in order to achieve the desired performance. By contrast, our estimator is robust against wild inaccuracies of the (truncated) empirical variance compared to the true variance of 1. In short, the approach of our estimator should be viewed as distinct from Catoni’s, since, while Catoni’s estimator relies on an initial good guess at the variance, ours thrives in the inevitable situations where is far from 1.
We return to describing our strategy for analyzing the performance of our estimator. For each that we analyze (from the finite mesh): instead of directly showing that, with probability, is far from the origin in some direction, we instead linearize this claim; we prove the stronger claim that there exists a specific direction such that with probability, is more than distance from the origin in direction (specifically we lower bound the dot product , while we upper bound each coordinate of inversely with the Lipschitz coefficients of ). The crucial advantage of this reformulation is that, since each of is a sum of terms, that are each a function of an independent sample from , the dot product is thus also a sum of independent terms, and thus we finish the proof with a Chernoff bound, Lemma 4. The Chernoff bound argument itself is standard; however, to bound the resulting expression requires an extremely delicate analysis that we pull out into a separate 4-variable inequality expressed as Lemma 7—see the discussion around the lemma for more details and for motivation of the analysis from a mathematical programming perspective.
We state the crucial Chernoff bound (Lemma 4) and the Lipschitz bounds (Lemma 5), and then use them to prove Proposition 3. We prove Lemmas 4 and 5 in the next section, along with the statement and proof of the delicate component that is Lemma 7.
Lemma 4.
Consider an arbitrary distribution with mean 0 and variance 1. There exists a universal constant where the following claim is true. Fixing , then for all smaller than some universal constant, and for all , there exists a vector where , and both are bounded by a universal constant, such that
Furthermore, for we have , ; and for we have , .
Lemma 5.
Consider an arbitrary set of samples . Consider the expressions , reparameterized in terms of in place of . Suppose the equation has a solution in the range . Then the functions and are Lipschitz with respect to on the entire interval , with Lipschitz constant for some universal constant .
Proof of Proposition 3.
As in Lemma 4, we fix , where is some universal constant.
By symmetry, instead of considering positive and negative , it suffices to consider the case (as opposed to ) and show that this case succeeds with probability at least .
To prove the claim, we first prove a stronger statement on a restricted domain, that with probability at least over the randomness of the sample set , for each there exists a vector such that , with throughout, and, for we have , ; and for we have , .
We will first apply Lemma 4 to each in a discrete mesh: let consist of evenly spaced points between and with spacing (thus with many points).
By Lemma 4 and a union bound over these points, we have that with probability at least (which is at least for smaller than some universal constant) over the set of samples , for all , there exists a vector such that , where further satisfies the desired positivity and boundary conditions, and where both are bounded by a universal constant. For the rest of the proof, we will only consider sets of samples satisfying the above condition.
Now consider an arbitrary and consider the vector evaluated at . We wish to extend the dot product inequality to hold also for . If then there is nothing to prove: set and ; otherwise, means we may apply Lemma 5 to conclude that both and are Lipschitz with respect to on the interval , with Lipschitz constant for some universal constant .
Consider the closest to , which by definition of is at most away. By assumption on , there exists a vector such that , with and both are bounded by a universal constant. Because of the Lipschitz bounds on , combined with the bounds on the size of the , we conclude that the Lipschitz constant of the dot product (treating the vector as fixed) is . Thus, the large positive dot product at implies at least a positive dot product nearby at : , for sufficiently small as given in the proposition statement.
Having shown the stronger version of the claim for the restriction and we now extend to the entire domain via three monotonicity arguments. Explicitly, assume the set of samples satisfies the dot product inequality above with the vector function , where satisfies the boundary conditions at and specified in Lemma 4. From this assumption, we will show that for any positive , and for any .
First consider (still fixing ). The function is an increasing function of , and thus a decreasing function of . Since for , the dot product with , the dot product will thus remain positive for this same choice of as we increase from .
Next, for (again still fixing ), we analogously show that the dot product of with the fixed vector will increase as we decrease . The term in the sums defining or depends on (and thus ) only in the factor . Further, there is no dependence unless the first term attains the min, namely , which in turn is upper bounded by because of our assumption that . Thus, the only terms in the dot product which have dependent are simply equal to . By our choice of and from Lemma 4, the expression is thus always non-negative, and thus the overall dot product cannot decrease as we send to —equivalently, sending to 0—as desired.
We have thus shown that, for all non-negative , there is a vector with whose dot product with is greater than 0. We complete the proof by noting that the only dependence on in is that is (trivially) increasing in . Since , increasing from will only increase the dot product, and thus the dot product remains strictly greater than 0, implying that as desired. ∎
5 Proofs of Lemmas 4 and 5
The main purpose of this section is to present and motivate the proof of Lemma 4—since our results are tight across such a wide parameter space, the resulting inequalities are somewhat subtle. After, we also present the short proof of Lemma 5.
Lemma 4. Consider an arbitrary distribution with mean 0 and variance 1. There exists a universal constant where the following claim is true. Fixing , then for all smaller than some universal constant, and for all , there exists a vector where , and both are bounded by a universal constant, such that
(3) |
Furthermore, for we have , ; and for we have , .
We start the analysis via standard Chernoff bounds on the complement of the probability in Equation 3 via Lemma 6, before pausing to discuss how mathematical programming and duality insights lead to the formulation of the crucial Lemma 7; we then complete the proof.
Lemma 6.
Consider an arbitrary distribution with mean 0 and variance 1. For all sufficiently small , for any and vector , we have
Proof.
We upper-bound the probability by exponentiating the negation of both sides of the expression inside the probability, and then using Markov’s inequality:
(4) |
∎
5.1 Mathematical Programming and Duality Analysis
In order to show Lemma 4, we aim to find bounds on the failure probability that are as strong as possible. Appealing to Lemma 6 that we have just proven, recall that, as in the standard Chernoff bound methodology, we are still free to choose the parameters , which we do so as to minimize the resulting bound on the failure probability. Phrased abstractly, the goal is, for the of Lemma 4, to show that, for any distribution of mean 0 and variance 1, there is a choice that makes Equation 4 sufficiently small. Phrased as an optimization problem, our goal is to evaluate (or tightly bound):
(5) |
where ranges over distributions of mean 0 and variance 1.
We will use convex-concave programming and linear programming duality to significantly simplify the max-min program in Equation 5 before we dive into the part of analysis that is ad hoc for this problem. We wish to emphasize here again that the steps of 1) writing an estimator as a multi-parameter -estimator and finding an analogous lemma to our Lemma 4, then 2) using mathematical programming duality to simplify the Chernoff bound analysis, are a framework generalizable for tightly analyzing other estimators.
For simplicity of exposition, assume that we restrict the support of to some sufficiently fine-grained finite set, meaning that the maximization in Equation 5 is now finite-dimensional, albeit an arbitrarily large finite number. For each support element , let be a variable representing the probability of choosing under distribution . The expectation component of Equation 5 may now be expressed as sum that is a linear function in the variables :
(6) |
Using the standard max-min inequality (a form of weak duality in optimization), we have that Equation 6 is upper bounded by swapping the maximization and minimization (Equation 7), meaning that the vector no longer depends on the distribution .
(7) |
Crucially, however, Equation 7 is not just an upper bound on Equation 6, but is in fact equal to it, due to Sion’s minimax theorem [23]. To apply Sion’s minimax theorem, it suffices to check that 1) both and are constrained to be in convex sets, at least one of which is compact, 2) the objective is convex in and 3) concave in the variables . For the first condition, we note that the set of distributions on a finite domain is compact. The objective is convex in since the objective is the sum of exponentials that are each linear in . And the objective is concave in because it is in fact a linear function of .
The guarantee of Sion’s minimax theorem means that we may work with Equation 7 instead of Equation 6 without sacrificing tightness in our analysis. This justifies why we are free to choose in Lemma 4 that does not depend on the distribution .
To further simplify the problem in Equation 7, we note again that both the objective and the constraints on are linear in the variables , meaning that the inner maximization is in fact a linear program. We can then apply linear programming (strong) duality to yield the following equivalent optimization (Equation 8). We note that, as above, for the purposes of upper bounding Equation 5, it suffices to only use weak duality. Strong duality however guarantees that this step does not introduce slack into the analysis.
The three variables in the inner minimization below are the dual variables corresponding to the three constraints on distribution originally: that has variance 1, mean 0, and total probability mass 1.
(8) | ||||
for all : |
We have thus reduced the infinite-dimensional optimization problem of Equation 5 to the five-dimensional problem of Equation 8 (or six dimensions, if we include the universal quantification for ), a significant simplification. We bound Equation 8 by explicitly choosing values for as functions of , and showing that they jointly satisfy the constraint of Equation 8, for all . We factor out the terms in the exponential that do not depend on ; we make the variable substitutions and to replace dependence on with dependence on the single variable ; taking the log of both sides (and swapping sides) yields an expression that is recognizable in the following lemma, where the multipliers of respectively on the right hand side are essentially our choices of :
Lemma 7.
For all , there exist and such that
where and for positive constants . Further, for , the pair works.
We emphasize that the application of Lemma 7 in the proof of Lemma 4 below is straightforward, though finding the particular form of Lemma 7 is not. Further, one would not seek a result of the form of Lemma 7 without the guarantees of this section, derived via duality and mathematical programming, showing that “results of the form of Lemma 7 encompass the full power of the Chernoff bounds of Equation 4.” See the end of Section 5.2 for the proof of Lemma 7.
5.2 Proof of Lemma 4
We now prove Lemma 4 by combining the Chernoff bound analysis of Lemma 6 with the inequality from Lemma 7. We point out that the proof below is direct, without any reference to duality or mathematical programming; however, the discussion of Section 5.1 was crucial to discovering the right formulation for Lemma 7. We prove Lemma 7 at the end of the section.
Lemma 4. Consider an arbitrary distribution with mean 0 and variance 1. There exists a universal constant where the following claim is true. Fixing , then for all smaller than some universal constant, and for all , there exists a vector where , and both are bounded by a universal constant, such that
Furthermore, for we have , ; and for we have , .
Proof.
Start with the bound on the probability of failure given by Lemma 6:
For we bound the exponential inside the expectation via the exponential of Lemma 7; we also use Lemma 7 to choose for us (the case is covered at the end). Namely, in Lemma 7 use as given, substitute (where as always), and choose , and —in particular, for this gives . Thus the failure probability is bounded by
We prove the case now. We choose and , substituting into the bound of Equation 4 to yield
which is bounded as desired for small enough . ∎
We now prove Lemma 7.
Lemma 7. For all , there exist and such that
(9) |
where and for positive constants . Further, for , the pair works.
Proof.
We first prove the special case of 1) , before moving to the general case of 2) . We note that our choice of is not continuous in at , but the usage of the lemma does not require any continuity. We choose at the edge case for convenience.
1) For , we choose . This special case of Equation 9 simplifies to:
(where 0.174 is a lower bound on ). This is a 1-dimensional bound and can be easily analyzed in many ways. For the range the right hand side is at least , which in this range is at least , which is easily shown to be greater than the polynomial expression that the left hand side reduces to in this range, . For the remaining range, , the left hand side is the constant , and it is easy to check that the quadratic in the argument of the right hand side, , always exceeds .
2) To show Equation 9 for the rest of the range of , we choose to be the positive root of the quadratic equation and let —we will see the motivation for this choice shortly. For now, note that the definition of implies , for otherwise would be greater than 0.
Our proof will analyze the sign of the derivative with respect to of the difference between the right and left hand sides of Equation 9. For the critical region this derivative equals:
(10) |
The crucial step is to choose to be the positive root of the quadratic equation and let , after which Equation 10 miraculously factors as
From this expression for the derivative, it is straightforward to read off its sign. The discriminant of the quadratic in the denominator is , meaning the denominator is always positive. The squared term in the numerator cannot affect the overall sign. Thus the sign of the derivative equals the sign of , meaning that the difference between the right and left side of Equation 9 is monotonically increasing for , and unimodal for , having non-positive derivative for and nonnegative derivative for smaller . Thus to show the inequality holds for all it suffices to check it at and .
The case is trivial as both sides of Equation 9 equal 0.
For , Equation 9, after expressing both and in terms of becomes
(11) |
For , the inverse of the rational expression inside the log is bounded by its linear approximation, . Calling this a new variable , which is between 0 and 1, Equation 11 becomes the claim that , which is easily verified for .
Lastly, we show Equation 9 for . Reexpressing and in terms of , the left hand side of the inequality is the constant value (independent of ), while the right hand side is . Analyzing the quadratic inside the log shows that the right hand side has a minimum of , attained at .
When the location of this minimum, , is inside the interval , then because this quadratic is monotonic to either side of the minimum, the fact that we have already proven Equation 9 for implies the inequality holds for all further from 0.
The remaining case is when the minimum is not in , namely, , meaning ; since is monotonic in , is at least its value when , namely . Equation 9 thus reduces to showing that, for we have , which is trivially implied, substituting , by the inequality for , yielding the claim. ∎
5.3 Proof of Lemma 5
Lemma 5. Consider an arbitrary set of samples . Consider the expressions , reparameterized in terms of in place of . Suppose the equation has a solution in the range . Then the functions and are Lipschitz with respect to on the entire interval , with Lipschitz constant for some universal constant .
Proof.
Consider the derivative of . The derivative of is either or 0, depending on which term in the min is the smallest, and in either case has magnitude at most . Thus the overall derivative of has magnitude at most . Since, we are guaranteed that for some , we thus have that the derivative is within a constant factor of this across the entire range, as desired.
Similarly, consider the derivative of . The term of this is the derivative of , which is either or 0 depending on whether , and thus the magnitude of this derivative may be bounded by . Since is bounded by a constant times (as in the last paragraph), and is bounded by a constant times , the magnitude of the derivative of is bounded by a constant times , as desired. ∎
References
- [1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci, 58(1):137–147, 1999.
- [2] Christian Brownlees, Emilien Joly, Gábor Lugosi, et al. Empirical risk minimization for heavy-tailed losses. Ann. Stat., 43(6):2507–2536, 2015.
- [3] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Trans. Inf. Theory, 59(11):7711–7717, 2013.
- [4] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. I. H. Poincare-PR, 48(4):1148–1185, 2012.
- [5] Olivier Catoni and Ilaria Giulini. Dimension-free pac-bayesian bounds for matrices, vectors, and linear least squares regression. arXiv:1712.02747, 2017.
- [6] Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L Bartlett. Fast mean estimation with sub-gaussian rates. In Proc. COLT ’20, pages 786–806, 2019.
- [7] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-gaussian mean estimators. Ann. Stat, 44(6):2695–2725, 2016.
- [8] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In Proc. FOCS ’13, pages 429–438, 2013.
- [9] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. J. Am. Stat. Assoc, 113(521):182–201, 2018.
- [10] Samuel B Hopkins et al. Mean estimation with sub-gaussian rates in polynomial time. Ann. Stat., 48(2):1193–1213, 2020.
- [11] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res, 17(1):543–582, 2016.
- [12] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci, 43:169–188, 1986.
- [13] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory, 61(5):2835–2885, 2015.
- [14] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. volume 125 of Proc. COLT ’20, pages 2204–2235. PMLR, 09–12 Jul 2020.
- [15] Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-gaussian rates. In Proc. COLT ’20, pages 2598–2612, 2020.
- [16] Gabor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions–a survey, 2019.
- [17] Gábor Lugosi, Shahar Mendelson, et al. Sub-Gaussian estimators of the mean of a random vector. The annals of statistics, 47(2):783–794, 2019.
- [18] Stanislav Minsker. Uniform bounds for robust mean estimators. arXiv:1812.03523, 2018.
- [19] Ankur Moitra and Michael Saks. A polynomial time algorithm for lossy population recovery. In Proc. FOCS’13, pages 110–116, 2013.
- [20] A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
- [21] Yury Polyanskiy, Ananda Theertha Suresh, and Yihong Wu. Sample complexity of population recovery. In Proc. COLT’17, volume 65, 2017.
- [22] Yury Polyanskiy and Yihong Wu. Dualizing Le Cam’s method, with applications to estimating the unseens. arXiv:1902.05616, 2019.
- [23] Maurice Sion. On general minimax theorems. Pac. J. Math, 8(1):171–176, 1958.
- [24] Leonard A. Stefanski and Dennis D. Boos. The calculus of M-estimation. The American Statistician, 56(1):29–38, 2002.
- [25] Gregory Valiant and Paul Valiant. The power of linear estimators. In Proc. FOCS’11, pages 403–412, 2011.
- [26] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory, 62(6):3702–3720, 2016.
- [27] Yihong Wu, Pengkun Yang, et al. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ann. Stat, 47(2):857–883, 2019.
Appendix A Additional “ Order” Motivation for Our Estimator
In this appendix, we give additional motivation of our estimator as a “ order correction” to the sample mean.
Suppose (for this section only), as in [4], that one knows the variance of the distribution in question, or has a good estimate of it.
Example 2.
Given samples from a distribution of mean 0 and variance 1 and bounded higher moments, suppose our goal is to construct a slight variant of the empirical mean that will robustly return an estimate that is close to 0, the true mean; we consider estimates of the form for some function . Explicitly, given a bound , we want our estimate to be between , with as high probability as possible. For simplicity we will consider the positive case, namely, bounding . With a view towards deriving a Chernoff bound, we rearrange, multiply by an arbitrary positive constant , and exponente inside the probability to yield that this probability equals ; by Markov’s inequality, this probability is at most , for our choice of . We set . Since each is independent, this probability becomes .
Considering the empirical estimator, where , we thus have that the probability the empirical mean estimate exceeds is at most the power of , where this expression can be expanded to order as
As we assumed the data distribution has mean 0 and variance 1, we can simplify the above expression to
Ignoring, for the moment, the or higher-order terms, this expression is , whose power equals , which is exactly the bound one would expect for the standard Gaussian case, of the probability that the empirical mean of samples is more than from the true value. However, the order term is a crucial obstacle here, as the third moment could be of either sign, skewing either the left tail or right tail to have substantially more mass than in our benchmark of the Gaussian case.
We thus choose a correction function so as to cancel out this -order term and improve the estimate in this regime: to cancel out the term in the -order expansion of our Chernoff bound , we replace by , yielding a bound on the failure probability of the power of
as desired.
For the sake of clarity, we can change variables, letting the leading term of our probability bound equal , and thus the correction becomes , meaning the correction amounts essentially to a moment correction, split ways and scaled by the same of our main algorithm.
We explicitly relate this estimator to Estimator 1 by pointing out that, when none of the samples are “truncated” by Estimator 1 (namely, always), then may be expressed in terms of the empirical variance ; taking for simplicity, the returned estimate will be , which equals the above-derived “-order corrected estimator” when the empirical variance is the true variance, 1.
In the above example we showed that Chernoff bounds for the empirical mean deteriorate for distributions with large moments (skew), and that adding a -order correction to the empirical mean corrects for this, leaving essentially “Gaussian-like” performance. These calculations motivate several features of Estimator 1—including the parameter, and the -order terms in the expression for —even though the overall form of Estimator 1 is rather different, as it must work in all regimes and not just in the cartoon asymptotic regime considered in this example.
Appendix B Proposition 3 implies Theorem 1
For completeness’ sake, we explicitly state and prove the properties described in Section 3.2, which combine to show that Proposition 3 implies Theorem 1.
Lemma 8.
Suppose is a set of samples in . Then for any and any scale and shift ,
where denotes the output of Estimator 1.
The above lemma follows trivially from the fact that the median-of-means estimate also respects shift and scale in the input samples, and that is chosen in Step 2 of Estimator 1 so that does not depend on the affine parameters .
Fact 9 ([11]).
For any distribution with mean and standard deviation , the median-of-means estimate , on input samples, satisfies
Lemma 10.
Consider a fixed sample set of size , and a confidence parameter . Let denote Estimator 1 but where Step 1 is omitted and is instead considered as an input. Then,
Fact 9 shows that, except with probability, the median-of-means estimate is within of the true mean, and multiplying this by the Lipschitz constant from Lemma 10 shows that the change in output of Algorithm 1, between using the median-of-means versus setting , has magnitude . This discrepancy is therefore a fraction of the additive error guaranteed by Theorem 1.
We now prove Lemma 10.
Proof.
We compute the derivatives with respect to and of the (computed in Step 3 of Estimator 1), and the expression on the left hand side of Step 2, which we denote . We note that for terms where , all derivatives are 0, so we adopt the notation “” to denote summing only over those indices for which . Thus we have
Recall that is defined implicitly so as to make the expression ; thus in Estimator 1, if we change at a rate of 1, then also changes at rate to keep unchanged. Thus, the overall derivative of the estimate with respect to changing equals . We bound this from the derivatives computed above.
To bound , we note that the number of indices not in the sum “” is at most because each such contributes 1 to the left hand side of the condition in Step 2 of Estimator 1 and the right hand side equals . Thus the initial terms of are bounded as . The remaining part of , namely is times the corresponding terms in itself, and thus is at most . Thus .
We now bound the remaining term . Since for each index in “” we have , we may bound , involving a moment term, by the simpler . Combining this, with the other derivatives and the bound from the previous paragraph yields:
where the last inequality is Cauchy-Schwarz applied to the sequence and the all-1s sequence. ∎