Robust estimation algorithms don’t need to know the corruption level.

Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar
University of California, San Diego
{ayjain,alon,varavind}@eng.ucsd.edu

Abstract

Real data are rarely pure. Hence the past half century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle’s solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.

1 Motivation

Much of statistics concerns learning from samples, typically assumed to be generated by an unknown distribution. As the number of samples increases, many statistical tasks can be learned to arbitrarily small error. However, in practical applications, often not all samples are generated according to the underlying distribution. Some may be inaccurate, corrupt, or even adversarial.

Robust algorithms that learn accurately even when some samples are corrupt, were therefore studied since the early works of Tukey [Tuk60], Huber [Hub64], and has been the subject of many a book, [HR09, HRRS11]. The research continues to this day with recent efficient algorithms for robust mean and density estimation [LRV16, DKK⁺16, DKK⁺17], sparse mean estimation [BDLS17, DKK⁺19c], learning from batches [QV18, CLM20a, JO20b, JO20a, CLM20b, JO21], Erdos-Renyi graphs [AJK⁺21], and many more [BDLS17, CSV17, KKM18, SCV18, DKK⁺18, HL18, KSS18, DKK⁺19b, DKK⁺19a, ZJS19, PSBR20, LSLC20, CKMY20, PJL21, JLST21], see [DK19] for a recent survey.

Essentially all these works consider the Huber model [Hub64] and its generalizations. A fraction $1-\alpha$ of the samples are genuine, while the remaining $\alpha$ fraction may be inaccurate, corrupt, or adversarial, even chosen after viewing the genuine samples. A significant effort within this research concerns statistical estimation, approximating an underlying distribution or parameter set. This note addresses statistical estimation under any of the corruption models above.

Most robust algorithms, both in general and specifically for estimation, either make the theoretically convenient assumption that $\alpha$ is known, or more practically utilize an input parameter $\beta$ that is assumed to upper bound $\alpha$ , and else may result in an arbitrary error. Since $\alpha$ is never known, the first approach cannot work in practice. Assuming an upper bound necessitates a large $\beta$ to ensure correctness. Yet as the following examples show, the algorithm’s estimation error is an increasing function $f(\beta)$ .

[DKK⁺16, DKK⁺17] considered robust learning of high-dimensional Gaussians with identity covariance. They showed that as long as the corrupted fraction $\alpha$ is at most the upper bound $\beta$ , with sufficiently many samples, the distribution mean can be leaned in $\ell_{2}$ distance ${\cal O}(\beta\sqrt{\log(1/\beta)})$ and the same accuracy applies for learning the distribution itself in TV-Distance. They also derived similar results for learning general Gaussian distributions and product distributions when provided an upper bound $\beta\geq\alpha$ . A sequence of papers, [QV18, CLM20a, JO20b, JO20a, CLM20b, JO21] studied discrete and piecewise-polynomial continuous distributions where the samples arrive in batches of size $n$ , and a fraction $\alpha\leq\beta$ of the batches can be arbitrary. They showed that the underlying distribution can be learned to a TV distance ${\cal O}(\beta\sqrt{\log(1/\beta)/n})$ . Similarly, [AJK⁺21] showed that for Erdős-Rényi graphs with $n$ nodes, where the edges connected to a fraction $\alpha\leq\beta$ of the nodes may be corrupt, the connection probability $p$ can be learned to accuracy ${\cal O}(\beta\sqrt{\frac{p(1-p)\log(1/\beta)}{n}}+\frac{\beta}{n}\log n+\frac{\sqrt{p(1-p)\log n}}{n})$ .

While this note addresses robust estimation, the known-upper-bound assumption is prevalent in many other robust learning applications including robust regression and robust gradient descent, see for example [KKM18, CKMY20, DKS19, PSBR20]. It would interesting to see whether they could be similarly addressed as well.

The conflicting dependence of accuracy and validity on $\beta$ raises several natural concerns. First the corrupt fraction $\alpha$ is typically unknown. Upper bounding it by a fixed $\beta$ goes against the very essence of robustness. Even if one is willing to assume an upper bound $\beta\geq\alpha$ , what should it be. Large $\beta$ can drastically reduce the algorithm’s accuracy, for example, yet small $\beta$ risks invalidating its results altogether.

Questions about the validity and choice of $\beta$ have therefore haunted this approach in both presentations and applications.

This brief note takes a bird’s-eye view of optimal robust estimation. Instead of addressing each individual problem, it reformulates all of them as an elementary geometric puzzle whose exceedingly simple and elegant solution yields a universal, unified, and efficient method achieving optimal error without knowing or bounding $\alpha$ .

The next section describes the puzzle and its solution, Section 3 applies the result to remove the upper bound requirement from all robust estimation problems, and achieving the same accuracy as if the corruption level was known in advance. It demonstrates the effect on the three robust learning examples provided above. Section 4 considers some of the technique’s limitations and possible extensions.

2 AirTag

Apple’s AirTag approximates the location $x$ of a misplaced item. Its beta-version successor lets the user select an approximation distance $\beta$ that in turn affects the search space $S(\beta)$ , if $\beta^{\prime}\leq\beta$ then $S(\beta^{\prime})\subseteq S(\beta)$ . AirTag then returns an approximate location $x_{\beta}$ that is within distance $\beta$ from $x$ if $x\in S(\beta)$ , and is arbitrary otherwise. The set of possible locations and distance may form any metric space, $\beta$ can assume any finite set of positive values, and par for Apple’s course, except for its growth with $\beta$ , $S(\beta)$ is completely unknown.

The best approximation distance is clearly the smallest $\beta$ such that $x\in S(\beta)$ , which we denote by $\alpha$ . Choosing $\beta>\alpha$ worsens $x_{\beta}$ ’s accuracy, while for $\beta<\alpha$ , $x_{\beta}$ is arbitrary. However $x$ and $S(\beta)$ are unknown, hence so is $\alpha$ . Upon obtaining the locations $x_{\beta}$ for all $\beta$ , can you approximate $x$ to a distance at most $c\cdot\alpha$ for some small constant $c$ ?

We will soon describe two simple solutions, but first observe that the puzzle captures both the essence and functionality of robust algorithms. Given an upper bound $\beta$ on the corruption level, these algorithms approximate an underlying distribution or parameter set. If the actual corruption level is below $\beta$ , their output is within the specified distance from the distribution, and otherwise, no guarantees are provided.

One small difference is that for estimation algorithms, the distance guarantee is not $\beta$ , but rather some known increasing function $f(\beta)$ specific to each problem. The addition of $f$ can be viewed as a simple reparamtrization of $\beta$ , hence it does not change the problem. Yet it will be convenient to use it in the application, hence we phrase the results in this more general form.

Consider a metric space $({\cal M},d)$ , finite set ${\cal B}\subseteq\mathbb{R}$ , an association $x_{\beta}\in{\cal M}$ with every $\beta\in{\cal B}$ , and a non-decreasing $f:{\cal B}\to[0,\infty)$ . An estimator is given $x_{\beta}$ for all $\beta\in{\cal B}$ as input, and outputs a point $\hat{x}\in{\cal M}$ . It achieves approximation factor $c$ if for every input and every $x\in{\cal M}$ and $\alpha\in{\cal B}$ such that $d(x,x_{\beta})\leq f(\beta)$ for all $\beta\geq\alpha$ , it must satisfy $d(\hat{x},x)\leq c\cdot f(\alpha)$ .

The following estimator achieves approximation factor 2. Let $B(x,r):=\{y\in{\cal M}:d(x,y)\leq r\}$ be the ball of radius $r\geq 0$ around $x\in{\cal M}$ . Define $\beta^{\prime}$ to be the smallest number in ${\cal B}$ such that $\bigcap_{{\cal B}\ni\beta\geq\beta^{\prime}}B(x_{\beta},f(\beta))$ is non empty, and let $x^{\prime}$ be any point in this intersection.

Lemma 1.

$x^{\prime}$ achieves approximation factor 2.

Proof.

Let $x\in{\cal M}$ and $\alpha\in{\cal B}$ satisfy $d(x,x_{\beta})\leq f(\beta)$ for all $\beta\geq\alpha$ . For all $\beta\geq\alpha$ , $x\in B(x_{\beta},f(\beta))$ . Hence, by definition $\beta^{\prime}\leq\alpha$ , and $x^{\prime}\in B(x_{\alpha},f(\alpha))$ . By the triangle inequality $d(x,x^{\prime})\leq d(x,x_{\alpha})+d(x_{\alpha},x^{\prime})\leq 2f(\alpha)$ . ∎

The next lemma shows that approximation factor 2 is best possible.

Lemma 2.

No estimator achieves approximation factor less than 2.

Proof.

Consider the reals with absolute difference distance, ${\cal B}=\{\epsilon,1\}$ for $\epsilon\in(0,0.5]$ , $f(\beta)=\beta$ for all $\beta$ , and $x_{\epsilon}=1+\epsilon$ , $x_{1}=0$ . Let estimator $\hat{x}$ achieve approximation factor $c$ . For every $x\in\mathbb{R}$ and $\alpha\in{\cal B}$ such that $|x-x_{\beta}|\leq f(\beta)$ for all $\beta\geq\alpha$ , it must satisfy $|\hat{x}-x|\leq c\cdot f(\alpha)$ . Hence for $x=1$ and $\alpha=\epsilon$ , $|\hat{x}-1|\leq c\epsilon$ , while for $x=-1$ and $\alpha=1$ , $|\hat{x}-(-1)|\leq c$ . By the triangle inequality, $c+c\epsilon\geq|\hat{x}-1|+|\hat{x}+1|\geq 2$ , hence $c\geq 2/(1+\epsilon)$ that can be made to exceed any number less than 2. ∎

While finding an $x^{\prime}$ at the intersection of several balls may be feasible in some metric spaces, for example when ${\cal M}=\mathbb{R}$ , for general $({\cal M},d)$ this may be computationally challenging. The next method achieves a slightly larger approximation factor $c=3$ , but requires only pairwise distance computations among the $x_{\beta}$ ’s.

Define $\hat{\beta}$ to be the smallest number in ${\cal B}$ such that for all ${\cal B}\ni\beta\geq\hat{\beta}$ , $d(x_{\beta},x_{\hat{\beta}})\leq f(\beta)+f(\hat{\beta})$ .

Lemma 3.

$x_{\hat{\beta}}$ achieves approximation factor 3.

Proof.

Let $x\in{\cal M}$ and $\alpha\in{\cal B}$ satisfy $d(x,x_{\beta})\leq f(\beta)$ for all $\beta\geq\alpha$ . By the triangle inequality, for all $\beta\geq\alpha$ ,

d(x_{\beta},x_{\alpha})\leq d(x_{\beta},x)+d(x,x_{\alpha})\leq f(\beta)+f({\alpha}),

hence by $\hat{\beta}$ ’s definition, $\hat{\beta}\leq\alpha$ . Adding the triangle inequality, the condition on $\alpha$ , and $\alpha\geq\hat{\beta}$ ,

d(x_{\hat{\beta}},x)\leq d(x_{\hat{\beta}},x_{\alpha})+d(x_{\alpha},x)\leq f(\hat{\beta})+f(\alpha)+f(\alpha)\leq 3f(\alpha).\qed

Returning to the AirTag, Lemmas 1 and 3 provides estimators that locate the item to distance $2\alpha$ and $3\alpha$ .

3 Robustness applications

The AirTag puzzle and its solution suggest a simple universal method for removing the upper bound requirement from any robust estimation algorithm.

Recall that many existing algorithms utilize an input parameter $\beta$ , and so long as it exceeds the actual fraction of corrupt data $\alpha$ , they estimate the unknown distribution or parameter vector $x$ to error $f(\beta)$ , while for $\beta<\alpha$ the algorithm fails and its error can be arbitrary large. If $\alpha$ was known in advance, one could run the algorithm with input parameter $\beta=\alpha$ resulting in output $x_{\alpha}$ that achieves the best error guarantee $d(x,x_{\alpha})\leq f(\alpha)$ . We show that even without knowing $\alpha$ , we can still find an estimate $\hat{x}\in{\cal M}$ such that $d(x,\hat{x})\leq 3\cdot f(\alpha)$ .

Let $\beta_{\max}$ denote the algorithm’s breakdown point, the largest corruption fraction for which the algorithm gives a meaningful answer. For example, when more than half the data is corrupt, no parameter can be accurately estimated, hence for every algorithm, $\beta_{\max}<1/2$ . We henceforth assume that the actual corruption level $\alpha\leq\beta_{\max}$ .

The AirTag solution involves finding $x_{\beta}$ potentially for every $\beta$ in the set ${\cal B}$ of all possible $\alpha$ values. In corruption applications, every $\alpha\in[0,\beta_{\max}]$ is possible, hence the approach cannot be applied directly. Instead, we select a small geometric sequence $B\subseteq[0,\beta_{\max}]$ that contains a tight approximation of any $\alpha\in[0,\beta_{\max}]$ . For any choice of $\theta>1$ and $\epsilon>0$ , let ${\cal B}$ be the collection of $\beta_{i}:=\beta_{\max}/\theta^{i}$ for $i=0,1,\ldots,$ till $\beta_{i}\leq f^{-1}(\epsilon)$ . Let $\alpha^{\prime}:=\min\{{\cal B}\ni\beta\geq\alpha\}$ be the closest upper bound of $\alpha$ in $B$ . Clearly $\alpha^{\prime}\leq\max\{\theta\alpha,f^{-1}(\epsilon)\}$ , and since $\alpha^{\prime}\geq\alpha$ , for all $\beta\geq\alpha^{\prime}$ we have $d(x,x_{\beta})\leq f(\beta)$ .

Running the original algorithm $|{\cal B}|\leq\lceil\log_{\theta}(1/f^{-1}(\epsilon)\rceil$ times, once for each value in ${\cal B}$ , Lemmas 1 and 3 achieve error $c\cdot f(\alpha^{\prime})\leq c\cdot\max\{f(\theta\alpha),\epsilon\}$ for $c=2$ and $3$ , respectively. For example, selecting $\theta=1.1$ , the method achieves $\leq c\cdot\max\{f(1.1\alpha),\epsilon\}$ error by running the algorithm ${\cal O}(\log(1/f^{-1}(\epsilon))$ times. For typical problems $f(\beta)$ is sublinear in $\beta$ , hence the method achieves $\leq c\cdot\max\{1.1f(\alpha),\epsilon\}$ error.

This approach applies to every robust estimation algorithm. It converts any robust estimation algorithm that requires a tight upper bound on $\alpha$ to achieve its optimal guarantee, into one that achieves essentially the same guarantees without knowing $\alpha$ . Consider for example the problems mentioned in the introduction. For any positive $\epsilon$ , however small, without knowledge or upper bound on the corruption level $\alpha$ , high-dimensional identity-covariance Gaussians can be learned to TV distance ${\cal O}(\alpha\sqrt{\log(1/\alpha)})+\epsilon$ and the same holds for approximating the mean in $\ell_{2}$ distance. Similar results hold for the other problems considered in [DKK⁺16, DKK⁺17]. When the samples arrive in batches of $n$ , the underlying distribution can be estimated to TV distance ${\cal O}(\alpha\sqrt{\log(1/\alpha)/n})+\epsilon$ .

For Erdős-Rényi graphs, recall that for any $\beta\in[\alpha,\beta_{\max}]$ , one can estimate the connection probability $p$ to accuracy $f(\beta,p):={\cal O}(\beta\sqrt{\frac{p(1-p)\log(1/\beta)}{n}}+\frac{\beta}{n}\log n+\frac{\sqrt{p(1-p)\log n}}{n})$ . However, since $p$ is unknown, so is this accuracy, hence we cannot apply our approach directly. To mitigate this problem, we first use the worst possible corruption level $\beta_{\max}$ to approximate $p$ by a weak approximation $\tilde{p}$ , and then us $\tilde{p}$ to obtain a tight upper bound on $f(\beta,p)$ for all $\beta$ , which therefore upper bounds the actual error all $\beta\geq\alpha$ .

A simple calculation shows that for all $0\leq p,\tilde{p}\leq 1$ , $\varepsilon\geq 0$ , and $\beta$ , if $\tilde{p}(1-\tilde{p})-p(1-p)\in[0,\varepsilon]$ then $f(\beta,\tilde{p})-f(\beta,p)\in[0,{\cal O}(\beta\sqrt{\frac{\varepsilon\log(1/\beta)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})]$ . Therefore, given such $\tilde{p}$ , $f(\beta,\tilde{p})$ upper bounds the algorithm’s for all input parameter $\beta\geq\alpha$ , using our approach we can estimate $p$ to an accuracy $c\cdot f(\alpha,\tilde{p})=c\cdot f(\alpha,p)+{\cal O}(\alpha\sqrt{\frac{\varepsilon\log(1/\alpha)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})+\epsilon$ .

Next, we obtain a weak estimate of $p$ by using the max corruption level as the input parameter, namely, $\beta=\beta_{\max}$ . This way we can estimate $p$ (and hence $p(1-p)$ ) to an accuracy $f(\beta_{\max},p)$ , and using $\beta_{\max}={\cal O}(1)$ we get $f(\beta_{\max},p)={\cal O}(\sqrt{\frac{p(1-p)}{n}}+\frac{\log n}{n})\leq{\cal O}(p(1-p)+\frac{\log n}{n})$ . This estimate of $p$ , can be used to find $\tilde{p}$ such that $\tilde{p}(1-\tilde{p})-p(1-p)\in[0,\varepsilon]$ , where $\varepsilon={\cal O}(p(1-p)+\frac{\log n}{n})$ . Using this $\tilde{p}$ , as shown above, we can estimate $p$ to an accuracy $c\cdot f(\alpha,\tilde{p})=c\cdot f(\alpha,p)+{\cal O}(\alpha\sqrt{\frac{\varepsilon\log(1/\alpha)}{n}}+\frac{\sqrt{\varepsilon\log n}}{n})+\epsilon=c\cdot f(\alpha,p)+{\cal O}(\frac{\log n}{n^{3/2}})+\epsilon$ . Note that the extra $\frac{\log n}{n^{3/2}}$ term is very small and dominated by $f(\alpha,p)$ except for the extreme regime where $p(1-p)={\cal O}(\frac{\log n}{n})$ and $\alpha={\cal O}(\frac{1}{\sqrt{n}})$ .

4 Extensions and limitations

This note concerns robust estimation of an underlying distribution, or some of its parameters, from samples. It would be interesting to generalize this approach to other robust learning problems such as regression, gradient descent etc. In fact, as the AirTag puzzle itself suggests, the approach is not limited to robustness problems. It applies to any algorithm whose accuracy depends monotonically on an input parameter, up to an unknown limit where the algorithm is no longer correct. It would be interesting to find other types of problems with that property.

Another natural generalization may be to multi parameter problems. Yet, even with two parameters we can not achieve guarantees similar to the one parameter case. Consider a metric space $({\cal M},d)$ , a finite set ${\cal B}$ and for let $f({\beta_{1},\beta_{2}})$ for $\beta_{1},\beta_{2}\in{\cal B}$ be non-negative and non-decreasing function in both $\beta_{1}$ and $\beta_{2}$ . For each $\beta,\gamma\in{\cal B}$ , let $x_{\beta,\gamma}$ be points in ${\cal M}$ . Can we find a point $\hat{x}\in{\cal M}$ , such that for all $x\in{\cal M}$ and $\alpha_{1},\alpha_{2}\in{\cal B}$ if $\forall\,\beta_{1}\geq\alpha_{1}$ and $\beta_{2}\geq\alpha_{2}$ , $d(x_{\beta_{1},\beta_{2}},x)\leq f({\beta_{1},\beta_{2}})$ , then the point $\hat{x}$ satisfies $d(\hat{x},x)\leq c\cdot f({\alpha_{1},\alpha_{2}})$ for some universal constant $c>0$ ?

Unfortunately such a point does not necessarily exist. Consider the reals with absolute difference distance and ${\cal B}=\{0,1\}$ . Let $x_{1,1}=0$ , $x_{0,1}=1$ , $x_{1,0}=-1$ , and $x_{1,1}$ be arbitrary. Let $f({1,1})=1$ and $f({\beta_{1},\beta_{2}})=0$ for $(\beta_{1},\beta_{2})\neq(1,1)$ . For $x=1$ , $(\alpha_{1},\alpha_{2})=(0,1)$ satisfies the above condition, and $f(\alpha_{1},\alpha_{2})=0$ . For $x=-1$ , $(\alpha_{1},\alpha_{2})=(1,0)$ satisfies the above condition, and $f(\alpha_{1},\alpha_{2})=0$ . Then for $\hat{x}$ to be within a distance constant times $f(\alpha_{1},\alpha_{2})$ from all corresponding $x$ , the point $\hat{x}$ must have a distance zero from both 1 and -1, therefore $\hat{x}$ does not exist.

In fact, this limitation is not an artifact of our approach, but is inherent to some 2-parameter estimation problems.

Let ${\cal D}$ be an unknown distribution over $\mathbb{R}$ with unknown mean $x$ and variance $\sigma^{2}$ . Given a tight upper bound on either $\sigma$ or $\alpha$ simple truncated-mean estimators achieve optimal error $\Theta(\sigma\sqrt{\alpha})$ . When a tight upper bound on $\alpha$ is known, trim $\alpha$ fraction of samples from both the extremes, and when a tight upper bound on $\sigma$ is known, recursively remove the farthest sample from the mean of remaining samples until the mean square deviation of the remaining samples is ${\cal O}(\sigma^{2})$ . We show that if both $\alpha$ and $\sigma$ can be arbitrary, then for any choice of $\epsilon$ and $C$ any algorithm incurs an error much larger than $C\sigma\sqrt{\alpha}+\epsilon$ for some $\alpha$ and $\sigma$ .

Consider two distributions, ${\cal D}_{1}$ assigns probability $1$ to $0$ , while ${\cal D}_{2}$ assigns probability $9/10$ to $0$ and probability $1/10$ to $21\epsilon$ . For $\alpha>1/10$ and ${\cal D}={\cal D}_{1}$ the adversary can make the distribution of overall samples appear to be ${\cal D}_{2}$ . In this case $\sigma=0$ , therefore for any estimate $\hat{x}$ that achieve the error at most $C\sigma\sqrt{\alpha}+\epsilon$ then must satisfy $\hat{x}\leq\epsilon$ . However, for the case $\alpha=0$ , ${\cal D}={\cal D}_{2}$ , where the mean of ${\cal D}$ is $21\epsilon/10$ , for any $\hat{x}\leq\epsilon$ , the error $|2.1\epsilon-\hat{x}|\geq 1.1\epsilon>C\sigma\sqrt{\alpha}+\epsilon$ .

References

[AJK⁺21] Jayadev Acharya, Ayush Jain, Gautam Kamath, Ananda Theertha Suresh, and Huanyu Zhang. Robust estimation for random graphs. arXiv preprint arXiv:2111.05320, 2021.
[BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, COLT ’17, pages 169–212, 2017.
[CKMY20] Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. arXiv preprint arXiv:2010.04157, 2020.
[CLM20a] Sitan Chen, Jerry Li, and Ankur Moitra. Efficiently learning structured distributions from untrusted batches. In Proceedings of the 52nd Annual ACM Symposium on the Theory of Computing, STOC ’20, pages 960–973, New York, NY, USA, 2020. ACM.
[CLM20b] Sitan Chen, Jerry Li, and Ankur Moitra. Learning structured distributions from untrusted batches: Faster and simpler. In Advances in Neural Information Processing Systems 33, NeurIPS ’20. Curran Associates, Inc., 2020.
[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM Symposium on the Theory of Computing, STOC ’17, pages 47–60, New York, NY, USA, 2017. ACM.
[DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
[DKK⁺16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’16, pages 655–664, Washington, DC, USA, 2016. IEEE Computer Society.
[DKK⁺17] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML ’17, pages 999–1008. JMLR, Inc., 2017.
[DKK⁺18] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, Philadelphia, PA, USA, 2018. SIAM.
[DKK⁺19a] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
[DKK⁺19b] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML ’19, pages 1596–1606. JMLR, Inc., 2019.
[DKK⁺19c] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, and Alistair Stewart. Outlier-robust high-dimensional sparse estimation via iterative filtering. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, pages 10688–10699. Curran Associates, Inc., 2019.
[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’19, pages 2745–2754, Philadelphia, PA, USA, 2019. SIAM.
[HL18] Samuel B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC ’18, pages 1021–1034, New York, NY, USA, 2018. ACM.
[HR09] Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley, 2009.
[HRRS11] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley, 2011.
[Hub64] Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
[JLST21] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34, 2021.
[JO20a] Ayush Jain and Alon Orlitsky. A general method for robust learning from batches. Advances in Neural Information Processing Systems, 33, 2020.
[JO20b] Ayush Jain and Alon Orlitsky. Optimal robust learning of discrete distributions from batches. In Proceedings of the 37th International Conference on Machine Learning, ICML ’20, pages 4651–4660. JMLR, Inc., 2020.
[JO21] Ayush Jain and Alon Orlitsky. Robust density estimation from batches: The best things in life are (nearly) free. In International Conference on Machine Learning, pages 4698–4708. PMLR, 2021.
[KKM18] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. In Proceedings of the 31st Annual Conference on Learning Theory, COLT ’18, pages 1420–1430, 2018.
[KSS18] Pravesh Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC ’18, pages 1035–1046, New York, NY, USA, 2018. ACM.
[LRV16] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’16, pages 665–674, Washington, DC, USA, 2016. IEEE Computer Society.
[LSLC20] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, AISTATS ’20, pages 411–421. JMLR, Inc., 2020.
[PJL21] Ankit Pensia, Varun Jog, and Po-Ling Loh. Robust regression with covariate filtering: Heavy tails and adversarial contamination, 2021.
[PSBR20] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):601–627, 2020.
[QV18] Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrusted batches. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS ’18, pages 47:1–47:20, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[SCV18] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS ’18, pages 45:1–45:21, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[Tuk60] John W. Tukey. A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 448–485, 1960.
[ZJS19] Banghua Zhu, Jiantao Jiao, and Jacob Steinhardt. Generalized resilience and robust statistics. arXiv preprint arXiv:1909.08755, 2019.