This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust estimation algorithms don’t need to know the corruption level.

Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar
University of California, San Diego
{ayjain,alon,varavind}@eng.ucsd.edu
Abstract

Real data are rarely pure. Hence the past half century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle’s solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.

1 Motivation

Much of statistics concerns learning from samples, typically assumed to be generated by an unknown distribution. As the number of samples increases, many statistical tasks can be learned to arbitrarily small error. However, in practical applications, often not all samples are generated according to the underlying distribution. Some may be inaccurate, corrupt, or even adversarial.

Robust algorithms that learn accurately even when some samples are corrupt, were therefore studied since the early works of Tukey [Tuk60], Huber [Hub64], and has been the subject of many a book, [HR09, HRRS11]. The research continues to this day with recent efficient algorithms for robust mean and density estimation [LRV16, DKK+16, DKK+17], sparse mean estimation [BDLS17, DKK+19c], learning from batches [QV18, CLM20a, JO20b, JO20a, CLM20b, JO21], Erdos-Renyi graphs [AJK+21], and many more [BDLS17, CSV17, KKM18, SCV18, DKK+18, HL18, KSS18, DKK+19b, DKK+19a, ZJS19, PSBR20, LSLC20, CKMY20, PJL21, JLST21], see [DK19] for a recent survey.

Essentially all these works consider the Huber model [Hub64] and its generalizations. A fraction 1α1-\alpha of the samples are genuine, while the remaining α\alpha fraction may be inaccurate, corrupt, or adversarial, even chosen after viewing the genuine samples. A significant effort within this research concerns statistical estimation, approximating an underlying distribution or parameter set. This note addresses statistical estimation under any of the corruption models above.

Most robust algorithms, both in general and specifically for estimation, either make the theoretically convenient assumption that α\alpha is known, or more practically utilize an input parameter β\beta that is assumed to upper bound α\alpha, and else may result in an arbitrary error. Since α\alpha is never known, the first approach cannot work in practice. Assuming an upper bound necessitates a large β\beta to ensure correctness. Yet as the following examples show, the algorithm’s estimation error is an increasing function f(β)f(\beta).

[DKK+16, DKK+17] considered robust learning of high-dimensional Gaussians with identity covariance. They showed that as long as the corrupted fraction α\alpha is at most the upper bound β\beta, with sufficiently many samples, the distribution mean can be leaned in 2\ell_{2} distance 𝒪(βlog(1/β)){\cal O}(\beta\sqrt{\log(1/\beta)}) and the same accuracy applies for learning the distribution itself in TV-Distance. They also derived similar results for learning general Gaussian distributions and product distributions when provided an upper bound βα\beta\geq\alpha. A sequence of papers, [QV18, CLM20a, JO20b, JO20a, CLM20b, JO21] studied discrete and piecewise-polynomial continuous distributions where the samples arrive in batches of size nn, and a fraction αβ\alpha\leq\beta of the batches can be arbitrary. They showed that the underlying distribution can be learned to a TV distance 𝒪(βlog(1/β)/n){\cal O}(\beta\sqrt{\log(1/\beta)/n}). Similarly, [AJK+21] showed that for Erdős-Rényi graphs with nn nodes, where the edges connected to a fraction αβ\alpha\leq\beta of the nodes may be corrupt, the connection probability pp can be learned to accuracy 𝒪(βp(1p)log(1/β)n+βnlogn+p(1p)lognn){\cal O}(\beta\sqrt{\frac{p(1-p)\log(1/\beta)}{n}}+\frac{\beta}{n}\log n+\frac{\sqrt{p(1-p)\log n}}{n}).

While this note addresses robust estimation, the known-upper-bound assumption is prevalent in many other robust learning applications including robust regression and robust gradient descent, see for example [KKM18, CKMY20, DKS19, PSBR20]. It would interesting to see whether they could be similarly addressed as well.

The conflicting dependence of accuracy and validity on β\beta raises several natural concerns. First the corrupt fraction α\alpha is typically unknown. Upper bounding it by a fixed β\beta goes against the very essence of robustness. Even if one is willing to assume an upper bound βα\beta\geq\alpha, what should it be. Large β\beta can drastically reduce the algorithm’s accuracy, for example, yet small β\beta risks invalidating its results altogether.

Questions about the validity and choice of β\beta have therefore haunted this approach in both presentations and applications.

This brief note takes a bird’s-eye view of optimal robust estimation. Instead of addressing each individual problem, it reformulates all of them as an elementary geometric puzzle whose exceedingly simple and elegant solution yields a universal, unified, and efficient method achieving optimal error without knowing or bounding α\alpha.

The next section describes the puzzle and its solution, Section 3 applies the result to remove the upper bound requirement from all robust estimation problems, and achieving the same accuracy as if the corruption level was known in advance. It demonstrates the effect on the three robust learning examples provided above. Section 4 considers some of the technique’s limitations and possible extensions.

2 AirTag

Apple’s AirTag approximates the location xx of a misplaced item. Its beta-version successor lets the user select an approximation distance β\beta that in turn affects the search space S(β)S(\beta), if ββ\beta^{\prime}\leq\beta then S(β)S(β)S(\beta^{\prime})\subseteq S(\beta). AirTag then returns an approximate location xβx_{\beta} that is within distance β\beta from xx if xS(β)x\in S(\beta), and is arbitrary otherwise. The set of possible locations and distance may form any metric space, β\beta can assume any finite set of positive values, and par for Apple’s course, except for its growth with β\beta, S(β)S(\beta) is completely unknown.

The best approximation distance is clearly the smallest β\beta such that xS(β)x\in S(\beta), which we denote by α\alpha. Choosing β>α\beta>\alpha worsens xβx_{\beta}’s accuracy, while for β<α\beta<\alpha, xβx_{\beta} is arbitrary. However xx and S(β)S(\beta) are unknown, hence so is α\alpha. Upon obtaining the locations xβx_{\beta} for all β\beta, can you approximate xx to a distance at most cαc\cdot\alpha for some small constant cc?

We will soon describe two simple solutions, but first observe that the puzzle captures both the essence and functionality of robust algorithms. Given an upper bound β\beta on the corruption level, these algorithms approximate an underlying distribution or parameter set. If the actual corruption level is below β\beta, their output is within the specified distance from the distribution, and otherwise, no guarantees are provided.

One small difference is that for estimation algorithms, the distance guarantee is not β\beta, but rather some known increasing function f(β)f(\beta) specific to each problem. The addition of ff can be viewed as a simple reparamtrization of β\beta, hence it does not change the problem. Yet it will be convenient to use it in the application, hence we phrase the results in this more general form.

Consider a metric space (,d)({\cal M},d), finite set {\cal B}\subseteq\mathbb{R}, an association xβx_{\beta}\in{\cal M} with every β\beta\in{\cal B}, and a non-decreasing f:[0,)f:{\cal B}\to[0,\infty). An estimator is given xβx_{\beta} for all β\beta\in{\cal B} as input, and outputs a point x^\hat{x}\in{\cal M}. It achieves approximation factor cc if for every input and every xx\in{\cal M} and α\alpha\in{\cal B} such that d(x,xβ)f(β)d(x,x_{\beta})\leq f(\beta) for all βα\beta\geq\alpha, it must satisfy d(x^,x)cf(α)d(\hat{x},x)\leq c\cdot f(\alpha).

The following estimator achieves approximation factor 2. Let B(x,r):={y:d(x,y)r}B(x,r):=\{y\in{\cal M}:d(x,y)\leq r\} be the ball of radius r0r\geq 0 around xx\in{\cal M}. Define β\beta^{\prime} to be the smallest number in {\cal B} such that ββB(xβ,f(β))\bigcap_{{\cal B}\ni\beta\geq\beta^{\prime}}B(x_{\beta},f(\beta)) is non empty, and let xx^{\prime} be any point in this intersection.

Lemma 1.

xx^{\prime} achieves approximation factor 2.

Proof.

Let xx\in{\cal M} and α\alpha\in{\cal B} satisfy d(x,xβ)f(β)d(x,x_{\beta})\leq f(\beta) for all βα\beta\geq\alpha. For all βα\beta\geq\alpha, xB(xβ,f(β))x\in B(x_{\beta},f(\beta)). Hence, by definition βα\beta^{\prime}\leq\alpha, and xB(xα,f(α))x^{\prime}\in B(x_{\alpha},f(\alpha)). By the triangle inequality d(x,x)d(x,xα)+d(xα,x)2f(α)d(x,x^{\prime})\leq d(x,x_{\alpha})+d(x_{\alpha},x^{\prime})\leq 2f(\alpha). ∎

The next lemma shows that approximation factor 2 is best possible.

Lemma 2.

No estimator achieves approximation factor less than 2.

Proof.

Consider the reals with absolute difference distance, ={ϵ,1}{\cal B}=\{\epsilon,1\} for ϵ(0,0.5]\epsilon\in(0,0.5], f(β)=βf(\beta)=\beta for all β\beta, and xϵ=1+ϵx_{\epsilon}=1+\epsilon, x1=0x_{1}=0. Let estimator x^\hat{x} achieve approximation factor cc. For every xx\in\mathbb{R} and α\alpha\in{\cal B} such that |xxβ|f(β)|x-x_{\beta}|\leq f(\beta) for all βα\beta\geq\alpha, it must satisfy |x^x|cf(α)|\hat{x}-x|\leq c\cdot f(\alpha). Hence for x=1x=1 and α=ϵ\alpha=\epsilon, |x^1|cϵ|\hat{x}-1|\leq c\epsilon, while for x=1x=-1 and α=1\alpha=1, |x^(1)|c|\hat{x}-(-1)|\leq c. By the triangle inequality, c+cϵ|x^1|+|x^+1|2c+c\epsilon\geq|\hat{x}-1|+|\hat{x}+1|\geq 2, hence c2/(1+ϵ)c\geq 2/(1+\epsilon) that can be made to exceed any number less than 2. ∎

While finding an xx^{\prime} at the intersection of several balls may be feasible in some metric spaces, for example when ={\cal M}=\mathbb{R}, for general (,d)({\cal M},d) this may be computationally challenging. The next method achieves a slightly larger approximation factor c=3c=3, but requires only pairwise distance computations among the xβx_{\beta}’s.

Define β^\hat{\beta} to be the smallest number in {\cal B} such that for all ββ^{\cal B}\ni\beta\geq\hat{\beta}, d(xβ,xβ^)f(β)+f(β^)d(x_{\beta},x_{\hat{\beta}})\leq f(\beta)+f(\hat{\beta}).

Lemma 3.

xβ^x_{\hat{\beta}} achieves approximation factor 3.

Proof.

Let xx\in{\cal M} and α\alpha\in{\cal B} satisfy d(x,xβ)f(β)d(x,x_{\beta})\leq f(\beta) for all βα\beta\geq\alpha. By the triangle inequality, for all βα\beta\geq\alpha,

d(xβ,xα)d(xβ,x)+d(x,xα)f(β)+f(α),d(x_{\beta},x_{\alpha})\leq d(x_{\beta},x)+d(x,x_{\alpha})\leq f(\beta)+f({\alpha}),

hence by β^\hat{\beta}’s definition, β^α\hat{\beta}\leq\alpha. Adding the triangle inequality, the condition on α\alpha, and αβ^\alpha\geq\hat{\beta},

d(xβ^,x)d(xβ^,xα)+d(xα,x)f(β^)+f(α)+f(α)3f(α).d(x_{\hat{\beta}},x)\leq d(x_{\hat{\beta}},x_{\alpha})+d(x_{\alpha},x)\leq f(\hat{\beta})+f(\alpha)+f(\alpha)\leq 3f(\alpha).\qed

Returning to the AirTag, Lemmas 1 and 3 provides estimators that locate the item to distance 2α2\alpha and 3α3\alpha.

3 Robustness applications

The AirTag puzzle and its solution suggest a simple universal method for removing the upper bound requirement from any robust estimation algorithm.

Recall that many existing algorithms utilize an input parameter β\beta, and so long as it exceeds the actual fraction of corrupt data α\alpha, they estimate the unknown distribution or parameter vector xx to error f(β)f(\beta), while for β<α\beta<\alpha the algorithm fails and its error can be arbitrary large. If α\alpha was known in advance, one could run the algorithm with input parameter β=α\beta=\alpha resulting in output xαx_{\alpha} that achieves the best error guarantee d(x,xα)f(α)d(x,x_{\alpha})\leq f(\alpha). We show that even without knowing α\alpha, we can still find an estimate x^\hat{x}\in{\cal M} such that d(x,x^)3f(α)d(x,\hat{x})\leq 3\cdot f(\alpha).

Let βmax\beta_{\max} denote the algorithm’s breakdown point, the largest corruption fraction for which the algorithm gives a meaningful answer. For example, when more than half the data is corrupt, no parameter can be accurately estimated, hence for every algorithm, βmax<1/2\beta_{\max}<1/2. We henceforth assume that the actual corruption level αβmax\alpha\leq\beta_{\max}.

The AirTag solution involves finding xβx_{\beta} potentially for every β\beta in the set {\cal B} of all possible α\alpha values. In corruption applications, every α[0,βmax]\alpha\in[0,\beta_{\max}] is possible, hence the approach cannot be applied directly. Instead, we select a small geometric sequence B[0,βmax]B\subseteq[0,\beta_{\max}] that contains a tight approximation of any α[0,βmax]\alpha\in[0,\beta_{\max}]. For any choice of θ>1\theta>1 and ϵ>0\epsilon>0, let {\cal B} be the collection of βi:=βmax/θi\beta_{i}:=\beta_{\max}/\theta^{i} for i=0,1,,i=0,1,\ldots, till βif1(ϵ)\beta_{i}\leq f^{-1}(\epsilon). Let α:=min{βα}\alpha^{\prime}:=\min\{{\cal B}\ni\beta\geq\alpha\} be the closest upper bound of α\alpha in BB. Clearly αmax{θα,f1(ϵ)}\alpha^{\prime}\leq\max\{\theta\alpha,f^{-1}(\epsilon)\}, and since αα\alpha^{\prime}\geq\alpha, for all βα\beta\geq\alpha^{\prime} we have d(x,xβ)f(β)d(x,x_{\beta})\leq f(\beta).

Running the original algorithm ||logθ(1/f1(ϵ)|{\cal B}|\leq\lceil\log_{\theta}(1/f^{-1}(\epsilon)\rceil times, once for each value in {\cal B}, Lemmas 1 and 3 achieve error cf(α)cmax{f(θα),ϵ}c\cdot f(\alpha^{\prime})\leq c\cdot\max\{f(\theta\alpha),\epsilon\} for c=2c=2 and 33, respectively. For example, selecting θ=1.1\theta=1.1, the method achieves cmax{f(1.1α),ϵ}\leq c\cdot\max\{f(1.1\alpha),\epsilon\} error by running the algorithm 𝒪(log(1/f1(ϵ)){\cal O}(\log(1/f^{-1}(\epsilon)) times. For typical problems f(β)f(\beta) is sublinear in β\beta, hence the method achieves cmax{1.1f(α),ϵ}\leq c\cdot\max\{1.1f(\alpha),\epsilon\} error.

This approach applies to every robust estimation algorithm. It converts any robust estimation algorithm that requires a tight upper bound on α\alpha to achieve its optimal guarantee, into one that achieves essentially the same guarantees without knowing α\alpha. Consider for example the problems mentioned in the introduction. For any positive ϵ\epsilon, however small, without knowledge or upper bound on the corruption level α\alpha, high-dimensional identity-covariance Gaussians can be learned to TV distance 𝒪(αlog(1/α))+ϵ{\cal O}(\alpha\sqrt{\log(1/\alpha)})+\epsilon and the same holds for approximating the mean in 2\ell_{2} distance. Similar results hold for the other problems considered in [DKK+16, DKK+17]. When the samples arrive in batches of nn, the underlying distribution can be estimated to TV distance 𝒪(αlog(1/α)/n)+ϵ{\cal O}(\alpha\sqrt{\log(1/\alpha)/n})+\epsilon.

For Erdős-Rényi graphs, recall that for any β[α,βmax]\beta\in[\alpha,\beta_{\max}], one can estimate the connection probability pp to accuracy f(β,p):=𝒪(βp(1p)log(1/β)n+βnlogn+p(1p)lognn)f(\beta,p):={\cal O}(\beta\sqrt{\frac{p(1-p)\log(1/\beta)}{n}}+\frac{\beta}{n}\log n+\frac{\sqrt{p(1-p)\log n}}{n}). However, since pp is unknown, so is this accuracy, hence we cannot apply our approach directly. To mitigate this problem, we first use the worst possible corruption level βmax\beta_{\max} to approximate pp by a weak approximation p~\tilde{p}, and then us p~\tilde{p} to obtain a tight upper bound on f(β,p)f(\beta,p) for all β\beta, which therefore upper bounds the actual error all βα\beta\geq\alpha.

A simple calculation shows that for all 0p,p~10\leq p,\tilde{p}\leq 1, ε0\varepsilon\geq 0, and β\beta, if p~(1p~)p(1p)[0,ε]\tilde{p}(1-\tilde{p})-p(1-p)\in[0,\varepsilon] then f(β,p~)f(β,p)[0,𝒪(βεlog(1/β)n+εlognn)]f(\beta,\tilde{p})-f(\beta,p)\in[0,{\cal O}(\beta\sqrt{\frac{\varepsilon\log(1/\beta)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})]. Therefore, given such p~\tilde{p}, f(β,p~)f(\beta,\tilde{p}) upper bounds the algorithm’s for all input parameter βα\beta\geq\alpha, using our approach we can estimate pp to an accuracy cf(α,p~)=cf(α,p)+𝒪(αεlog(1/α)n+εlognn)+ϵc\cdot f(\alpha,\tilde{p})=c\cdot f(\alpha,p)+{\cal O}(\alpha\sqrt{\frac{\varepsilon\log(1/\alpha)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})+\epsilon.

Next, we obtain a weak estimate of pp by using the max corruption level as the input parameter, namely, β=βmax\beta=\beta_{\max}. This way we can estimate pp (and hence p(1p)p(1-p)) to an accuracy f(βmax,p)f(\beta_{\max},p), and using βmax=𝒪(1)\beta_{\max}={\cal O}(1) we get f(βmax,p)=𝒪(p(1p)n+lognn)𝒪(p(1p)+lognn)f(\beta_{\max},p)={\cal O}(\sqrt{\frac{p(1-p)}{n}}+\frac{\log n}{n})\leq{\cal O}(p(1-p)+\frac{\log n}{n}). This estimate of pp, can be used to find p~\tilde{p} such that p~(1p~)p(1p)[0,ε]\tilde{p}(1-\tilde{p})-p(1-p)\in[0,\varepsilon], where ε=𝒪(p(1p)+lognn)\varepsilon={\cal O}(p(1-p)+\frac{\log n}{n}). Using this p~\tilde{p}, as shown above, we can estimate pp to an accuracy cf(α,p~)=cf(α,p)+𝒪(αεlog(1/α)n+εlognn)+ϵ=cf(α,p)+𝒪(lognn3/2)+ϵc\cdot f(\alpha,\tilde{p})=c\cdot f(\alpha,p)+{\cal O}(\alpha\sqrt{\frac{\varepsilon\log(1/\alpha)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})+\epsilon=c\cdot f(\alpha,p)+{\cal O}(\frac{\log n}{n^{3/2}})+\epsilon. Note that the extra lognn3/2\frac{\log n}{n^{3/2}} term is very small and dominated by f(α,p)f(\alpha,p) except for the extreme regime where p(1p)=𝒪(lognn)p(1-p)={\cal O}(\frac{\log n}{n}) and α=𝒪(1n)\alpha={\cal O}(\frac{1}{\sqrt{n}}).

4 Extensions and limitations

This note concerns robust estimation of an underlying distribution, or some of its parameters, from samples. It would be interesting to generalize this approach to other robust learning problems such as regression, gradient descent etc. In fact, as the AirTag puzzle itself suggests, the approach is not limited to robustness problems. It applies to any algorithm whose accuracy depends monotonically on an input parameter, up to an unknown limit where the algorithm is no longer correct. It would be interesting to find other types of problems with that property.

Another natural generalization may be to multi parameter problems. Yet, even with two parameters we can not achieve guarantees similar to the one parameter case. Consider a metric space (,d)({\cal M},d), a finite set {\cal B} and for let f(β1,β2)f({\beta_{1},\beta_{2}}) for β1,β2\beta_{1},\beta_{2}\in{\cal B} be non-negative and non-decreasing function in both β1\beta_{1} and β2\beta_{2}. For each β,γ\beta,\gamma\in{\cal B}, let xβ,γx_{\beta,\gamma} be points in {\cal M}. Can we find a point x^\hat{x}\in{\cal M}, such that for all xx\in{\cal M} and α1,α2\alpha_{1},\alpha_{2}\in{\cal B} if β1α1\forall\,\beta_{1}\geq\alpha_{1} and β2α2\beta_{2}\geq\alpha_{2}, d(xβ1,β2,x)f(β1,β2)d(x_{\beta_{1},\beta_{2}},x)\leq f({\beta_{1},\beta_{2}}), then the point x^\hat{x} satisfies d(x^,x)cf(α1,α2)d(\hat{x},x)\leq c\cdot f({\alpha_{1},\alpha_{2}}) for some universal constant c>0c>0?

Unfortunately such a point does not necessarily exist. Consider the reals with absolute difference distance and ={0,1}{\cal B}=\{0,1\}. Let x1,1=0x_{1,1}=0, x0,1=1x_{0,1}=1, x1,0=1x_{1,0}=-1, and x1,1x_{1,1} be arbitrary. Let f(1,1)=1f({1,1})=1 and f(β1,β2)=0f({\beta_{1},\beta_{2}})=0 for (β1,β2)(1,1)(\beta_{1},\beta_{2})\neq(1,1). For x=1x=1, (α1,α2)=(0,1)(\alpha_{1},\alpha_{2})=(0,1) satisfies the above condition, and f(α1,α2)=0f(\alpha_{1},\alpha_{2})=0. For x=1x=-1, (α1,α2)=(1,0)(\alpha_{1},\alpha_{2})=(1,0) satisfies the above condition, and f(α1,α2)=0f(\alpha_{1},\alpha_{2})=0. Then for x^\hat{x} to be within a distance constant times f(α1,α2)f(\alpha_{1},\alpha_{2}) from all corresponding xx, the point x^\hat{x} must have a distance zero from both 1 and -1, therefore x^\hat{x} does not exist.

In fact, this limitation is not an artifact of our approach, but is inherent to some 2-parameter estimation problems.

Let 𝒟{\cal D} be an unknown distribution over \mathbb{R} with unknown mean xx and variance σ2\sigma^{2}. Given a tight upper bound on either σ\sigma or α\alpha simple truncated-mean estimators achieve optimal error Θ(σα)\Theta(\sigma\sqrt{\alpha}). When a tight upper bound on α\alpha is known, trim α\alpha fraction of samples from both the extremes, and when a tight upper bound on σ\sigma is known, recursively remove the farthest sample from the mean of remaining samples until the mean square deviation of the remaining samples is 𝒪(σ2){\cal O}(\sigma^{2}). We show that if both α\alpha and σ\sigma can be arbitrary, then for any choice of ϵ\epsilon and CC any algorithm incurs an error much larger than Cσα+ϵC\sigma\sqrt{\alpha}+\epsilon for some α\alpha and σ\sigma.

Consider two distributions, 𝒟1{\cal D}_{1} assigns probability 11 to 0, while 𝒟2{\cal D}_{2} assigns probability 9/109/10 to 0 and probability 1/101/10 to 21ϵ21\epsilon. For α>1/10\alpha>1/10 and 𝒟=𝒟1{\cal D}={\cal D}_{1} the adversary can make the distribution of overall samples appear to be 𝒟2{\cal D}_{2}. In this case σ=0\sigma=0, therefore for any estimate x^\hat{x} that achieve the error at most Cσα+ϵC\sigma\sqrt{\alpha}+\epsilon then must satisfy x^ϵ\hat{x}\leq\epsilon. However, for the case α=0\alpha=0, 𝒟=𝒟2{\cal D}={\cal D}_{2}, where the mean of 𝒟{\cal D} is 21ϵ/1021\epsilon/10, for any x^ϵ\hat{x}\leq\epsilon, the error |2.1ϵx^|1.1ϵ>Cσα+ϵ|2.1\epsilon-\hat{x}|\geq 1.1\epsilon>C\sigma\sqrt{\alpha}+\epsilon.

References

  • [AJK+21] Jayadev Acharya, Ayush Jain, Gautam Kamath, Ananda Theertha Suresh, and Huanyu Zhang. Robust estimation for random graphs. arXiv preprint arXiv:2111.05320, 2021.
  • [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, COLT ’17, pages 169–212, 2017.
  • [CKMY20] Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. arXiv preprint arXiv:2010.04157, 2020.
  • [CLM20a] Sitan Chen, Jerry Li, and Ankur Moitra. Efficiently learning structured distributions from untrusted batches. In Proceedings of the 52nd Annual ACM Symposium on the Theory of Computing, STOC ’20, pages 960–973, New York, NY, USA, 2020. ACM.
  • [CLM20b] Sitan Chen, Jerry Li, and Ankur Moitra. Learning structured distributions from untrusted batches: Faster and simpler. In Advances in Neural Information Processing Systems 33, NeurIPS ’20. Curran Associates, Inc., 2020.
  • [CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM Symposium on the Theory of Computing, STOC ’17, pages 47–60, New York, NY, USA, 2017. ACM.
  • [DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
  • [DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’16, pages 655–664, Washington, DC, USA, 2016. IEEE Computer Society.
  • [DKK+17] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML ’17, pages 999–1008. JMLR, Inc., 2017.
  • [DKK+18] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, Philadelphia, PA, USA, 2018. SIAM.
  • [DKK+19a] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
  • [DKK+19b] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML ’19, pages 1596–1606. JMLR, Inc., 2019.
  • [DKK+19c] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, and Alistair Stewart. Outlier-robust high-dimensional sparse estimation via iterative filtering. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, pages 10688–10699. Curran Associates, Inc., 2019.
  • [DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’19, pages 2745–2754, Philadelphia, PA, USA, 2019. SIAM.
  • [HL18] Samuel B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC ’18, pages 1021–1034, New York, NY, USA, 2018. ACM.
  • [HR09] Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley, 2009.
  • [HRRS11] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley, 2011.
  • [Hub64] Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
  • [JLST21] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34, 2021.
  • [JO20a] Ayush Jain and Alon Orlitsky. A general method for robust learning from batches. Advances in Neural Information Processing Systems, 33, 2020.
  • [JO20b] Ayush Jain and Alon Orlitsky. Optimal robust learning of discrete distributions from batches. In Proceedings of the 37th International Conference on Machine Learning, ICML ’20, pages 4651–4660. JMLR, Inc., 2020.
  • [JO21] Ayush Jain and Alon Orlitsky. Robust density estimation from batches: The best things in life are (nearly) free. In International Conference on Machine Learning, pages 4698–4708. PMLR, 2021.
  • [KKM18] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. In Proceedings of the 31st Annual Conference on Learning Theory, COLT ’18, pages 1420–1430, 2018.
  • [KSS18] Pravesh Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC ’18, pages 1035–1046, New York, NY, USA, 2018. ACM.
  • [LRV16] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’16, pages 665–674, Washington, DC, USA, 2016. IEEE Computer Society.
  • [LSLC20] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, AISTATS ’20, pages 411–421. JMLR, Inc., 2020.
  • [PJL21] Ankit Pensia, Varun Jog, and Po-Ling Loh. Robust regression with covariate filtering: Heavy tails and adversarial contamination, 2021.
  • [PSBR20] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):601–627, 2020.
  • [QV18] Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrusted batches. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS ’18, pages 47:1–47:20, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [SCV18] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS ’18, pages 45:1–45:21, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [Tuk60] John W. Tukey. A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 448–485, 1960.
  • [ZJS19] Banghua Zhu, Jiantao Jiao, and Jacob Steinhardt. Generalized resilience and robust statistics. arXiv preprint arXiv:1909.08755, 2019.