This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hideLIPIcs\ccsdesc

[300]Theory of computation Mathematical optimization University of Wisconsin-Madison, Madison, WI, [email protected] University of Wisconsin-Madison, Madison, WI, [email protected] Wayne State University, Detroit, MI, [email protected] University of Wisconsin-Madison, Madison, WI, [email protected] \CopyrightChangyu Gao, Andrew Lowy, Xingyu Zhou, and Stephen J. Wright

Optimal Rates for Robust Stochastic Convex Optimization

Changyu Gao    Andrew Lowy    Xingyu Zhou    Stephen J. Wright
Abstract

Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the ϵ\epsilon-contamination model, where an adversary can inspect and replace up to an ϵ\epsilon-fraction of the samples, a fundamental open problem is determining the optimal rates for robust stochastic convex optimization (SCO) under such contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the ϵ\epsilon-contamination model. Our approach improves over existing algorithms, which are not only suboptimal but also require stringent assumptions, including Lipschitz continuity and smoothness of individual sample functions. By contrast, our optimal algorithms do not require these restrictive assumptions, and can handle nonsmooth but Lipschitz population loss functions. We complement our algorithmic developments with a tight lower bound for robust SCO.

keywords:
Optimization Algorithm, Robust Optimization, Stochastic Convex Optimization
category:
\relatedversion

1 Introduction

Machine learning models are increasingly deployed in security-critical applications, yet they remain vulnerable to data manipulation. A particular threat is data poisoning, where adversaries deliberately insert malicious points into training data to degrade model performance (biggio2012poisoning, ). Even in non-adversarial settings, naturally occurring outliers can significantly impact learning algorithms, especially in high-dimensional settings. These challenges motivate our study of optimization algorithms for training machine learning models in the presence of outliers, both natural and adversarial.

Motivation for our work traces to Tukey’s pioneering research on robust estimation (tukey1960survey, ). Recent breakthroughs have produced efficient algorithms for high-dimensional robust estimation under the ϵ\epsilon-contamination model, where an adversary can arbitrarily replace up to ϵ\epsilon fraction of the samples. Notable advances include polynomial-time algorithms for robust mean estimation in high dimensions (diakonikolas2017being, ; diakonikolas2019robust, ). See diakonikolas2019recent for a comprehensive survey of recent developments in high-dimensional robust estimation.

These developments in robust estimation naturally lead to a fundamental question: Can we solve stochastic optimization problems, under the ϵ\epsilon-contamination model? Stochastic optimization is a fundamental problem in machine learning, where we aim to find the parameter that minimizes the population risk using training samples. We focus specifically on robust stochastic optimization with convex objective functions whose gradients exhibit bounded covariance, a standard assumption in robust mean estimation (diakonikolas2020outlier, ). While our goal aligns with classical stochastic convex optimization in minimizing population risk, the presence of adversarial contamination introduces significant new challenges.

Prior research in robust optimization has primarily concentrated on narrow domains. A line of work focuses on robust linear regression (klivans2018efficient, ; diakonikolas2019efficient, ; cherapanamjeri2020optimal, ). While (jambulapati2021robust, ; prasad2020robust, ) have explored general problems, they focus on robust regression. To our best knowledge, SEVER (diakonikolas2019sever, ) is the only work that considers general stochastic optimization problems. However, their approach has several limitations. First, it focuses only on achieving dimension-independent error due to corruption, with a suboptimal sample complexity. Second, the results for SEVER depend on several stringent assumptions, including Lipschitzness and smoothness conditions on sample functions. These constraints restrict the applicability of SEVER. Consequently, optimal excess risk bounds for robust stochastic convex optimization, and under what conditions they can be achieved, remain unknown.

Our work addresses these limitations by developing efficient algorithms for robust stochastic convex optimization that achieve optimal excess risk bounds (up to logarithmic factors) under the ϵ\epsilon-contamination model. Notably, Algorithm 1 handles even non-smooth population risks. Moreover, we prove a matching lower bound to show the minimax-optimality of our algorithms.

1.1 Problem Setup and Motivation

Notations. Throughout the paper, we use the following notation for vectors and matrices. For a vector vdv\in\mathbb{R}^{d}, let v\|v\| denote the 2\ell_{2} norm of vv. For a matrix Ad×dA\in\mathbb{R}^{d\times d}, let A\|A\| denote the spectral norm of AA. For symmetric matrices AA and BB, we write ABA\preceq B if BAB-A is positive semidefinite (PSD). We use O~\tilde{O} and Ω~\tilde{\Omega} to hide logarithmic factors in our bounds.

Let 𝒲d\mathcal{W}\subset\mathbb{R}^{d} be a closed convex set. Consider a distribution pp^{\ast} over functions f:𝒲f:\mathcal{W}\to\mathbb{R}. Stochastic optimization aims to find a parameter vector w𝒲w^{\ast}\in\mathcal{W} minimizing the population risk f¯(w):=𝔼fp[f(w)]\overline{f}(w):=\mathbb{E}_{f\sim p^{\ast}}[f(w)]. For example, function ff can take the form of a loss function fx(w)f_{x}(w) dependent on the data point xx, and the data distribution on xx induces the function distribution pp^{\ast}. In robust stochastic optimization, some data samples may be corrupted. We adopt the strong ϵ\epsilon-contamination model, following (diakonikolas2019sever, ).

Definition 1.1 (ϵ\epsilon-contamination model).

Given ϵ>0\epsilon>0 and a distribution pp^{\ast} over functions f:𝒲f:\mathcal{W}\to\mathbb{R}, data is generated as follows: first, nn clean samples f1,,fnf_{1},\ldots,f_{n} are drawn from pp^{\ast}. An adversary is then permitted to examine the samples and replace up to ϵn\epsilon n of them with arbitrary samples. The algorithm is subsequently provided with this modified set of functions, which we refer to as ϵ\epsilon-corrupted samples (with respect to pp^{\ast}).

The ϵ\epsilon-contamination model allows the adversary to replace up to ϵ\epsilon fraction of samples. This model is strictly stronger than the Huber contamination model (huber1964robust, ), in which the samples are drawn from a mixture of the clean and adversarial distributions of the form p=(1ϵ)p+ϵqp^{\ast}=(1-\epsilon)p+\epsilon q, where pp is the clean distribution and qq is the adversarial distribution.

Our objective is to develop an efficient algorithm that minimizes the population risk f¯(w)\overline{f}(w), even when the data is ϵ\epsilon-corrupted. The following is assumed throughout the paper.

Assumption 1.2.

𝒲d\mathcal{W}\subset\mathbb{R}^{d} is a compact convex set with diameter bounded by DD, that is, supw,w𝒲wwD\sup_{w,w^{\prime}\in\mathcal{W}}\|w-w^{\prime}\|\leq D. The population risk f¯(w)\overline{f}(w) is a convex function.

We also assume that the gradients of the functions have bounded covariance as in diakonikolas2019sever , which is a typical assumption used in robust mean estimation.

Assumption 1.3.

For all w𝒲w\in\mathcal{W}, the covariance matrix of the gradients, defined by Σw:=𝔼fp[(f(w)f¯(w))(f(w)f¯(w))T]\Sigma_{w}:=\mathbb{E}_{f\sim p^{*}}[(\nabla f(w)-\nabla\overline{f}(w))(\nabla f(w)-\nabla\overline{f}(w))^{T}] satisfies Σwσ2I\Sigma_{w}\preceq\sigma^{2}I for some σ>0\sigma>0.

We will additionally assume that the population risk f¯(w)\overline{f}(w) satisfies certain properties, or that certain properties are satisfied almost surely for functions ff from distribution pp^{*}, as needed.

To our best knowledge, SEVER (diakonikolas2019sever, ) is the only work that studies robust stochastic optimization for general convex losses. While SEVER focuses on finding approximate critical points, our work focuses on minimizing the population risk f¯(w)\overline{f}(w), and we measure the performance of our algorithm in terms of the excess risk f¯(w^)minwf¯(w)\overline{f}(\hat{w})-\min_{w}\overline{f}(w), where w^\hat{w} is the output of the algorithm.

We remark that SEVER also derives excess risk bounds. To contrast with SEVER, we decompose the excess risk of a stochastic optimization algorithm as follows111 We omit the term due to optimization error that depends on the number of iterations of the algorithm, since it will be dominated by the other terms when we run the optimization algorithm for a sufficient number of iterations. :

Excess risk=Error due to corruption+Statistical error,\text{Excess risk}=\text{Error due to corruption}+\text{Statistical error},

where “error due to corruption” refers to the error due to the presence of corruption in the data, whereas “statistical error” denotes the error that accrues even when no corruption is present. SEVER (diakonikolas2019sever, ) focuses only on the error due to corruption The statistical error term is implicit in their requirement on the sample complexity nn, that is,

Excess risk=Error due to corruption,if n[sample complexity].\text{Excess risk}=\text{Error due to corruption},\;\text{if }n\geq\,\text{[sample complexity]}.

Specifically, they design a polynomial-time algorithm that achieves the error due to corruption term O(Dσϵ)O(D\sigma\sqrt{\epsilon}) for n=Ω~(dL2ϵσ2+dL4σ4)n=\tilde{\Omega}\left(\frac{dL^{2}}{\epsilon\sigma^{2}}+\frac{dL^{4}}{\sigma^{4}}\right), provided that ff¯f-\overline{f} is LL-Lipschitz and β\beta-smooth almost surely for fpf\in p^{*}, and that ff is smooth almost surely. (We fixed their wrong sample complexity result. See Appendix A for details.) This sample complexity can be huge, or even infinite, as some functions in the distribution may have a very large, possibly unbounded, Lipschitz constant. Moreover, SEVER implicitly requires that ff is smooth almost surely.

Consider functions of the form fx(w)=12xw2f_{x}(w)=-\tfrac{1}{2}x\cdot\|w\|^{2} for ww such that wD\|w\|\leq D, where xPx\sim P for a probability distribution PP with bounded mean and variance but takes unbounded values, e.g. normal distribution. We have fx(w)=2xw\nabla f_{x}(w)=-2x\cdot w. Since xx is unbounded, the worst case Lipschitz parameter and smoothness of ff are both infinite. However, the population risk f¯(w)=12w2𝐄[x]\overline{f}(w)=-\tfrac{1}{2}\|w\|^{2}\cdot\mathbf{E}[x] is smooth and Lipschitz. This motivating example demonstrates that the assumptions in SEVER that assume properties uniformly for individual functions fpf\sim p^{*} can be too stringent. In this paper, we aim to answer the following question:

Can we design computationally efficient algorithms that achieve the optimal excess risk for robust SCO, under much milder conditions?

We give positive answers to this question and summarize our contributions below.

1.2 Our Contributions

1. Optimal Rates for Robust SCO (Section 3):   We develop algorithms that achieve the following minimax-optimal (up to logarithmic factors) excess risk:

f¯(w^T)minw𝒲f¯(w)=O~(D(σϵ+σdlog(1/τ)n)).\overline{f}(\hat{w}_{T})-\min_{w\in\mathcal{W}}\overline{f}(w)=\tilde{O}\left(D\left(\sigma\sqrt{\epsilon}+\sigma\sqrt{\frac{d\log(1/\tau)}{n}}\right)\right).

Compared with SEVER, we achieve the same error due to corruption O(Dσϵ)O(D\sigma\sqrt{\epsilon}) provided n=Ω~(d/ϵ)n=\tilde{\Omega}(d/\epsilon), a significant improvement in sample complexity.222 We remark that in excess risk bounds, O~\tilde{O} always hides logarithmic factors only in the statistical error term, and the robust term is always O(Dσϵ)O(D\sigma\sqrt{\epsilon}).
2. Much Weaker Assumptions for Robust SCO:Algorithm 1 achieves the optimal rates while only assuming the smoothness of the population risk, which is significantly weaker than the assumptions used in SEVER. By contrast, SEVER requires ff¯f-\overline{f} to have bounded worst-case Lipschitz and smoothness parameter, and that individual functions ff is smooth almost surely. Algorithm 1 can also handle nonsmooth but Lipschitz population risks using Moreau smoothing. Additionally, the algorithm can be adapted to the case when the covariance parameter σ\sigma is unknown.
3. A Matching Lower Bound for Robust SCO:   We prove a matching lower bound (see Appendix B), demonstrating that our excess risk bound is minimax-optimal (up to logarithmic factors). Consequently, our sample complexity for achieving the error due to corruption O(Dσϵ)O(D\sigma\sqrt{\epsilon}) is also minimax-optimal.
4. A Straightforward Algorithm for Robust SCO (Section 4):   We provide a simple projected gradient descent algorithm (Algorithm 2) that achieves the same optimal excess risk, under slightly weaker assumptions compared to SEVER. Our approach builds on the “many-good-sets” assumption, which SEVER briefly introduced without providing a concrete analysis.

Our results might be surprising, as cover-based approaches (e.g., uniform convergence) typically suffers from suboptimal error. Our results, however, imply that the cover-based approach can indeed achieve the optimal excess risk for the corruption case. See Section G.2 for discussions. A high-level summary of our results is outlined in the table below.

Algorithm Assumptions Excess Risk Sample Complexity
SEVER (diakonikolas2019sever, )
1. ff¯f-\overline{f} has bounded worst-case Lipschitz
and smoothness param   2. ff is smooth a.s.
suboptimal Ω~(dL2ϵσ2+dL4σ4)\tilde{\Omega}\left(\frac{dL^{2}}{\epsilon\sigma^{2}}+\frac{dL^{4}}{\sigma^{4}}\right)
Algorithm 1 f¯\overline{f} is smooth or Lipschitz optimal Ω~(d/ϵ)\tilde{\Omega}\left(d/\epsilon\right)
Algorithm 2
1. ff¯f-\overline{f} has bounded worst-case Lipschitz
and smoothness param   2. f¯\overline{f} is smooth or Lipschitz
optimal Ω~(d/ϵ)\tilde{\Omega}\left(d/\epsilon\right)
Table 1: Comparison of assumptions, rates, and sample complexity of SEVER and our two algorithms. All algorithms assume 1.2, and bounded covariance of the gradients, that is, the covariance matrix Σw\Sigma_{w} satisfies Σwσ2I\Sigma_{w}\preceq\sigma^{2}I for all ww. Optimality is up to logarithmic factors.

2 Revisiting SEVER

In this section, we revisit SEVER (diakonikolas2019sever, ) to motivate our work. Below we fix the corruption parameter ϵ\epsilon and the covariance boundedness parameter σ>0\sigma>0. Given ϵ\epsilon-corrupted function samples f1,,fnf_{1},\ldots,f_{n}, we say a subset of functions is “good” with respect to ww if their sample mean and covariance at ww are close to those of the true distribution, as defined below.

Definition 2.1 (“Good” set).

We say a set Sgood[n]S_{\text{good}}\subseteq[n] with |Sgood|(1ϵ)n|S_{\text{good}}|\geq(1-\epsilon)n is “good” with respect to (w.r.t) ww if the functions {fi}iSgood\{f_{i}\}_{i\in S_{\text{good}}} satisfy the following,

1|Sgood|iSgood(fi(w)f¯(w))(fi(w)f¯(w))T\displaystyle\left\|\frac{1}{|S_{\text{good}}|}\sum_{i\in S_{\text{good}}}\big{(}\nabla f_{i}(w)-\nabla\bar{f}(w)\big{)}\big{(}\nabla f_{i}(w)-\nabla\bar{f}(w)\big{)}^{T}\right\| σ2,\displaystyle\leq\sigma^{2}, (1)
1|Sgood|iSgood(fi(w)f¯(w))\displaystyle\left\|\frac{1}{|S_{\text{good}}|}\sum_{i\in S_{\text{good}}}(\nabla f_{i}(w)-\nabla\bar{f}(w))\right\| σϵ.\displaystyle\leq\sigma\sqrt{\epsilon}.

A “good” set w.r.t. ww allows us to robustly estimate the gradient at ww. SEVER requires the existence of a set that is uniformly good for all ww, which we refer to as the “uniform-good-set” assumption.

Assumption 2.2 (“Uniform good set”, Assumption B.1 diakonikolas2019sever ).

There exists a set Sgood[n]S_{\text{good}}\subseteq[n] with |Sgood|(1ϵ)n|S_{\text{good}}|\geq(1-\epsilon)n such that SgoodS_{\text{good}} is “good” with respect to ww, for all w𝒲w\in\mathcal{W}.

SEVER operates through an iterative filtering framework built around a black-box learner. Its core algorithm consists of three main steps: (1) The black-box learner processes the current set of functions to find approximate critical points. (2) A filtering mechanism identifies and removes outlier functions. (3) The algorithm updates its working set with the remaining functions. This process repeats until convergence. Crucially, SEVER’s theoretical guarantees rely on its “uniform-good-set” assumption. Without this assumption (as opposed to “many-good-sets” assumption introduced later), the set of “good” functions can change at each iteration, potentially preventing the iterative filtering process from converging.

We argue that the “uniform-good-set” assumption can be too strong. Recall that SEVER requires a sample complexity of n=Ω~(dL2ϵσ2+dL4σ4)n=\tilde{\Omega}\left(\frac{dL^{2}}{\epsilon\sigma^{2}}+\frac{dL^{4}}{\sigma^{4}}\right). When n=Ω~(d/ϵ)n=\tilde{\Omega}(d/\epsilon), the “uniform-good-set” assumption can no longer be guaranteed to hold. In contrast, the “many-good-sets” assumption, which we will introduce later, is weaker and aligns with the general framework of robustly estimating gradients in each iteration.

Besides the “uniform-good-set” assumption, SEVER additionally assumes the existence of a black box approximate learner.

Definition 2.3 (γ\gamma-approximate learner).

A learning algorithm \mathcal{L} is called γ\gamma-approximate if, for any functions f1,,fm:𝒲f_{1},\ldots,f_{m}:\mathcal{W}\to\mathbb{R} each bounded below on a closed domain \mathcal{H}, the output ww of \mathcal{L} is a γ\gamma-approximate critical point of f^(x):=1mi=1mfi(x)\hat{f}(x):=\frac{1}{m}\sum_{i=1}^{m}f_{i}(x), that is, there exists δ>0\delta>0 such that for all unit vectors vv where w+δv𝒲w+\delta v\in\mathcal{W}, we have that vf^(w)γv\cdot\nabla\hat{f}(w)\geq-\gamma.

Remark 2.4.

We remark that the existence of a γ\gamma-approximate learner implies that the learner can find a γ\gamma-approximate critical point of any function ff by choosing f1==fm=ff_{1}=\ldots=f_{m}=f. To our best knowledge, any polynomial-time algorithm that finds approximate critical points requires smoothness of the objective. Therefore, SEVER does not apply to problems where some functions in the distribution are nonsmooth. For example, consider a distribution pp^{*} consisted of two functions with equal probability, h+gh+g and hgh-g, where hh is smooth but gg is nonsmooth. The population risk is smooth, but the individual functions are not.

In the appendix of diakonikolas2019sever , the authors consider the “many-good-sets” assumption, an alternative weaker assumption that allows the good set to be dependent on the point ww.

Assumption 2.5 (“Many good sets”, Assumption D.1 in diakonikolas2019sever ).

For each ww, there exists a set Sgood=Sgood(w)[n]S_{\text{good}}=S_{\text{good}}(w)\subseteq[n] with |Sgood|(1ϵ)n|S_{\text{good}}|\geq(1-\epsilon)n such that SgoodS_{\text{good}} is “good” with respect to ww.

We remark that the “many-good-sets” assumption allows us to do robust gradient estimation in each iteration. The SEVER paper mentions without going into detail that, under “many-good-sets” assumption, projected gradient descent can be used to find a (O(σϵ))(O(\sigma\sqrt{\epsilon}))-approximate critical point. It is unclear that under what conditions “many-good-sets” assumption can be satisfied, and hence no excess risk bound or sample complexity is provided.

In this paper, we utilize a further relaxed assumption stated below, which only requires the existence of good sets at points in a dense covering of the domain.

Assumption 2.6 (“Dense good sets”).

There exists a covering 𝒞\mathcal{C} of the domain 𝒲\mathcal{W} such that for each w𝒞w\in\mathcal{C}, there exists a set Sgood=Sgood(w)[n]S_{\text{good}}=S_{\text{good}}(w)\subseteq[n] with |Sgood|(1ϵ)n|S_{\text{good}}|\geq(1-\epsilon)n such that SgoodS_{\text{good}} is “good” with respect to ww, where a ξ\xi-covering 𝒞\mathcal{C} of 𝒲\mathcal{W} is such that for any w𝒲w\in\mathcal{W}, there exists w𝒞w^{\prime}\in\mathcal{C} such that wwξ\|w-w^{\prime}\|\leq\xi for some small ξ>0\xi>0.

Throughout our paper, we will refer to 𝒞\mathcal{C} as a cover of 𝒲\mathcal{W} for simplicity333Technically, the cover consists of the family of open balls {B(w,ξ)}w𝒞\{B(w,\xi)\}_{w\in\mathcal{C}}, where B(w,ξ)B(w,\xi) is the open ball centered at ww with radius ξ\xi.. The idea of “dense good sets” is that, if we can estimate the gradient robustly at each point in the cover, then we can estimate the gradient robustly at any point in the domain 𝒲\mathcal{W} by the smoothness of the population risk, provided that the cover is fine enough (parameter ξ\xi will depend on σ\sigma, as we will see in Algorithm 1). This relaxed assumption allows us to circumvent the technical difficulties of dealing with infinite many ww with a net argument, and thus remove the requirement of uniform Lipschitzness and smoothness of ff¯f-\overline{f} for all ff as used in SEVER. As a consequence, we are able to achieve the same corruption error as SEVER with a significantly reduced sample complexity. The next section presents our algorithm that achieves this result.

3 Optimal Rates for Robust SCO under Weak Distributional Assumptions

We now present a cover-based algorithm that achieves the minimax-optimal excess risk under the weak assumption that the population risk f¯\overline{f} is smooth or Lipschitz.

Assumption 3.1.

Let pp^{*} be a distribution over functions f:𝒲f:\mathcal{W}\rightarrow\mathbb{R} with f¯=𝐄fp[f]\overline{f}=\mathbf{E}_{f\sim p^{*}}[f] so that:

  1. 1.

    For each w𝒲w\in\mathcal{W} and unit vector vv, 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}.

  2. 2.

    f¯\overline{f} is β¯\bar{\beta}-smooth or L¯\bar{L}-Lipschitz.

Individual functions ff can be allowed to nonsmooth and non-Lipschitz, as long as their average, the population risk is smooth or Lipschitz. We remark that the first condition is equivalent to the assumption that the covariance matrix Σw\Sigma_{w} of the gradients f(w)\nabla f(w) satisfies Σwσ2I\Sigma_{w}\preceq\sigma^{2}I. See Appendix G for a proof. Below, due to space limit, we only consider smooth population risks. The case of (non-smooth) Lipschitz population risks can be reduced to the smooth case via Moreau smoothing. See Section 3.2 for details.

We outline our algorithm below, see Algorithm 1. The algorithm is based on projected gradient descent with a robust estimator. Here, we treat the robust gradient estimator RobustEstimator as a black box, which can be any deterministic stability-based algorithm (e.g. iterative filtering, see Appendix I).

The key innovation lies in its gradient estimation strategy: rather than computing gradients at arbitrary points, it leverages a cover of the domain 𝒲\mathcal{W}. In each iteration, the gradient is estimated at the nearest point ww^{\prime} in the cover to the current iterate ww. The smoothness of the population risk ensures this approximation remains accurate. As mentioned at the end of Section 2, this strategy helps us avoid the technical challenges of handling infinitely many ww with a net argument, thereby achieving optimal rates under significantly weaker distributional assumptions compared to SEVER.

Algorithm 1 Cover-based Projected Gradient Descent with Robust Gradient Estimator
1:Input: ϵ\epsilon-corrupted set of functions f1,,fnf_{1},\ldots,f_{n}, stepsize parameters {ηt}t[T]\{\eta_{t}\}_{t\in[T]}, robust gradient estimator RobustEstimator(w)\texttt{RobustEstimator}(w), (σϵ/β¯)(\sigma\sqrt{\epsilon}/\bar{\beta})-cover of 𝒲\mathcal{W} denoted 𝒞\mathcal{C}.
2:Initialize w0𝒲w_{0}\in\mathcal{W} and t=1t=1.
3:for t[T]t\in[T] do
4:     Let wt1:=argminw𝒞wwt1w_{t-1}^{\prime}:=\operatornamewithlimits{arg\,min}_{w\in\mathcal{C}}\|w-w_{t-1}\| denote the closest point to wt1w_{t-1} in the cover.
5:     Robustly estimate gradient at wt1w_{t-1}^{\prime}, denoted g~t:=RobustEstimator(wt1)\tilde{g}_{t}:=\texttt{RobustEstimator}(w_{t-1}^{\prime}).
6:     wtProj𝒲(wt1ηtg~t)w_{t}\leftarrow\text{Proj}_{\mathcal{W}}(w_{t-1}-\eta_{t}\tilde{g}_{t}).
7:end for
8:Output: w^T=1Tt=1Twt\hat{w}_{T}=\frac{1}{T}\sum_{t=1}^{T}w_{t}.

Efficient Implementation. For implementation efficiency, we propose a grid-based cover construction. Let ξ=σϵ/β¯\xi=\sigma\sqrt{\epsilon}/\bar{\beta}. We can use grid points spaced ξ/d\xi/\sqrt{d} apart in each dimension, i.e.,

{ξdz=(ξdz1,ξdz2,,ξdzd):z=(z1,z2,,zd)d,ξdz2D}\left\{\frac{\xi}{\sqrt{d}}\cdot z=\left(\frac{\xi}{\sqrt{d}}\cdot z_{1},\frac{\xi}{\sqrt{d}}\cdot z_{2},\ldots,\frac{\xi}{\sqrt{d}}\cdot z_{d}\right):z=(z_{1},z_{2},\ldots,z_{d})\in\mathbb{Z}^{d},\left\|\frac{\xi}{\sqrt{d}}\cdot z\right\|_{2}\leq D\right\}

to construct a ξ\xi-cover.444Technically, we can choose a grid spaced 2ξ/d2\xi/\sqrt{d} apart in each dimension, and add additional points to cover the boundary of the feasible set. This would reduce the size of grid points by almost a factor of 2d2^{d}. Given a point ww, we can find a cover point within ξ\xi distance in O(d)O(d) time through: (1) Scaling: Divide ww by ξ/d\xi/\sqrt{d}. (2) Rounding: Convert to the nearest integral vector in d\mathbb{Z}^{d}. (3) Rescaling: Multiply by ξ/d\xi/\sqrt{d}.

This construction yields a cover of size |𝒞|=O(Dd/ξ)d|\mathcal{C}|=O\left(D\sqrt{d}/\xi\right)^{d}, which is larger than the optimal covering number O((D/ξ)d)O((D/\xi)^{d}). While this introduces an extra logd\log d factor in the excess risk bound (due to union bound over cover points), it offers two significant practical advantages: (1) Implicit cover: No need to explicitly construct and store the cover. (2) Efficient computation: O(d)O(d) time for finding the nearest cover point. An exponential-time algorithm that achieves the excess risk without the logd\log d factor is described in Appendix H.

Algorithm 1 has the following guarantees.

Theorem 3.2.

Grant 3.1. There are choices of stepsizes {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} and TT such that, with probability at least 1τ1-\tau, we have

f¯(w^T)minw𝒲f¯(w)=O~(σDϵ+σDdlog(1/τ)n).\overline{f}(\hat{w}_{T})-\min_{w\in\mathcal{W}}\overline{f}(w)=\tilde{O}\left(\sigma D\sqrt{\epsilon}+\sigma D\sqrt{\frac{d\log(1/\tau)}{n}}\,\right).

As a consequence, the algorithm achieves excess risk of O(Dσϵ)O(D\sigma\sqrt{\epsilon}) with high probability whenever n=Ω~(d/ϵ)n=\tilde{\Omega}(d/\epsilon). The expected excess risk is bounded by O~(σDϵ+σDd/n)\tilde{O}\left(\sigma D\sqrt{\epsilon}+\sigma D\sqrt{d/n}\right).

Remark 3.3.

Theorem 3.2 is minimax-optimal (up to logarithmic factors) For comparison, our sample complexity n=Ω~(d/ϵ)n=\tilde{\Omega}(d/\epsilon), significant improves over the sample complexity of SEVER n=Ω~(dL2ϵσ2+dL4σ4)n=\tilde{\Omega}\left(\frac{dL^{2}}{\epsilon\sigma^{2}}+\frac{dL^{4}}{\sigma^{4}}\right).

Lower Bound for Robust SCO: In Appendix B, we derive the following matching lower bound, showing the minimax-optimality of Algorithm 1.

Theorem 3.4.

There exist a closed bounded set 𝒲d\mathcal{W}\subset\mathbb{R}^{d} with diameter at most 2D2D, and a distribution pp^{*} over functions f:𝒲f:\mathcal{W}\rightarrow\mathbb{R} that satisfy the following: Let f¯=𝐄fp[f]\overline{f}=\mathbf{E}_{f\sim p^{*}}[f]. We have that for each w𝒲w\in\mathcal{W} and unit vector vv that 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}. There exist parameters L,β>0L,\beta>0 such that ff are convex, LL-Lipschitz and β\beta-smooth almost surely. The output w^\hat{w} of any algorithm with access to an ϵ\epsilon-corrupted set of functions f1,,fnf_{1},\dots,f_{n} sampled from pp^{*} satisfies the following,

𝔼f¯(w^)f¯=Ω(Dσϵ+Dσdn).\mathbb{E}\overline{f}(\hat{w})-\overline{f}^{*}=\Omega\left(D\sigma\sqrt{\epsilon}+D\sigma\sqrt{\frac{d}{n}}\right). (2)

3.1 Analysis of Algorithm 1 for Smooth Population Risks

To prove Theorem 3.2 for smooth population risks, we will use the following robust estimation results.

Lemma 3.5 (diakonikolas2020outlier ).

Let SS be an ϵ\epsilon-corrupted set of nn samples from a distribution in d\mathbb{R}^{d} with mean μ\mu and covariance Σ\Sigma such that Σσ2I\Sigma\preceq\sigma^{2}I. Let ϵ=Θ(log(1/τ)/n+ϵ)c\epsilon^{\prime}=\Theta(\log(1/\tau)/n+\epsilon)\leq c be given, for a constant c>0c>0. Then any stability-based algorithm on input SS and ϵ\epsilon^{\prime}, efficiently computes μ^\widehat{\mu} such that with probability at least 1τ1-\tau, we have

μ^μ=O(σδ(τ)),whereδ(τ)=ϵ+d/n+log(1/τ)/n.\|\widehat{\mu}-\mu\|=O(\sigma\cdot\delta(\tau)),\;\text{where}\;\delta(\tau)=\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{\log(1/\tau)/n}. (3)

Recall that in each iteration, given current ww, we estimate the gradient at w=argminz𝒞zww^{\prime}=\operatornamewithlimits{arg\,min}_{z\in\mathcal{C}}\|z-w\|, the nearest point in the cover to ww. The above lemma allows us to bound the robust estimation error.

We defer the full proof to Appendix C and sketch the high level ideas below.

  1. 1.

    We use robust mean estimation results to establish an upper bound on the gradient estimator error at ww^{\prime}, specifically g~(w)f¯(w)\|\tilde{g}(w^{\prime})-\nabla\overline{f}(w^{\prime})\|. Subsequently, we leverage the smoothness property of the population risk to bound the error resulting from estimating the gradient at ww^{\prime} instead of ww, i.e., f¯(w)f¯(w)\|\nabla\overline{f}(w)-\nabla\overline{f}(w^{\prime})\|. By combining these two bounds, we obtain a bound on the bias of our gradient estimator at ww, that is, g~(w)f¯(w)\|\tilde{g}(w^{\prime})-\nabla\overline{f}(w)\|.

  2. 2.

    The robust gradient estimator has a failure probability τ\tau at fixed ww. Since we apply the robust gradient estimator exclusively at grid points, it suffices to employ the union bound across all grid points to account for the total failure probability.

  3. 3.

    We utilize the projected biased gradient descent analysis framework to establish an upper bound on the excess risk.

3.2 Handling Nonsmooth but Lipschitz Population Risks

For L¯\bar{L}-Lipschitzness but nonsmooth population risk f¯\overline{f}, we use Nesterov’s smoothing (nesterov2005smooth, ) and run Algorithm 1 on the smoothed objective. Nesetrov’s smoothing has following properties.

Lemma 3.6 ((nesterov2005smooth, )).

Given a convex and LL-Lipschitz loss function f(w)f(w), β\beta-Moreau envelope of ff is defined as fβ(w)=minu{f(u)+12βuw22}f_{\beta}(w)=\min_{u}\left\{f(u)+\frac{1}{2\beta}\|u-w\|_{2}^{2}\right\}. The following properties hold:

  1. 1.

    fβf_{\beta} is convex, β\beta-smooth and 2L2L-Lipschitz.

  2. 2.

    For any ww, fβ(w)f(w)fβ(w)+L22βf_{\beta}(w)\leq f(w)\leq f_{\beta}(w)+\frac{L^{2}}{2\beta}.

  3. 3.

    We have minf(w)=minfβ(w)\min f(w)=\min f_{\beta}(w).

Therefore, running Algorithm 1 on the smoothed objective fβf_{\beta} (smooth function samples f1,f2,,fnf_{1},f_{2},\ldots,f_{n} accordingly), with w^\hat{w} as the output, we have that

f¯(w)minw𝒲f(w)f¯β(w^)minw𝒲f¯β(w)+L¯22β,\overline{f}(w)-\min_{w\in\mathcal{W}}f(w)\leq\overline{f}_{\beta}(\hat{w})-\min_{w\in\mathcal{W}}\overline{f}_{\beta}(w)+\frac{\bar{L}^{2}}{2\beta},

and f¯β(w^)minw𝒲f¯β(w)\overline{f}_{\beta}(\hat{w})-\min_{w\in\mathcal{W}}\overline{f}_{\beta}(w) is bounded by the same bound as in Theorem 3.2. It suffices to choose β=O(L¯2σDmin(1/ϵ,n/(dlog(1/τ))))\beta=O\left(\frac{\bar{L}^{2}}{\sigma D}\min\left(1/\sqrt{\epsilon},\sqrt{n/(d\log(1/\tau))}\right)\right) to achieve the same optimal excess risk bound. Note that while the size of the cover depends on β\beta, the excess risk bound only depends on β\beta through logarithmic terms.

3.3 Handling Unknown Covariance Parameter σ\sigma

Algorithm 1 can be adapted to handle the case where the covariance parameter σ\sigma is unknown. In particular, we can first use robust estimation at any point to obtain a lower bound on σ\sigma, and then run the algorithm with this lower bound. This gives the same optimal excess risk up to logarithmic factors, with high probability. Details can be found in Appendix J.

4 Projected Gradient Descent with Robust Gradient Estimator

Algorithm 1 uses a cover-based strategy to estimate gradients robustly. A more straightforward approach is to estimate gradients at arbitrary points using a robust gradient estimator. We will show that the simple projected gradient descent algorithm can achieve the same optimal rate as Algorithm 1 under stronger assumptions. Even so, our new assumptions are still slightly weaker than those used in SEVER diakonikolas2019sever . Concretely, following assumptions on the distribution over functions are assumed.

Assumption 4.1.

Let pp^{*} be a distribution over functions f:𝒲f:\mathcal{W}\rightarrow\mathbb{R} with f¯=𝐄fp[f]\overline{f}=\mathbf{E}_{f\sim p^{*}}[f] so that:

  1. 1.

    For each w𝒲w\in\mathcal{W} and unit vector vv, 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}.

  2. 2.

    ff¯f-\overline{f} is LL-Lipschitz and β\beta-smooth almost surely, where555 Without loss of generality, see Appendix G. LσL\geq\sigma.

  3. 3.

    f¯\overline{f} is β¯\bar{\beta}-smooth or L¯\bar{L}-Lipschitz. Note that we put an overline to reflect properties of f¯\overline{f}.

Remark 4.2.

Compared to the assumptions used in SEVER, our algorithm additionally covers the case where each individual function is Lipschitz and possibly nonsmooth.

Algorithm 2 follows the “many-good-sets” assumption. We are able to robustly estimate the gradient of the population risk f¯\overline{f} at any point ww with high probability, at the cost of requiring additional almost-sure assumptions on ff¯f-\overline{f} compared to Algorithm 1.

Algorithm 2 Projected Gradient Descent with Robust Gradient Estimator
1:Input: ϵ\epsilon-corrupted set of functions f1,,fnf_{1},\ldots,f_{n}, stepsize parameters {ηt}t[T]\{\eta_{t}\}_{t\in[T]}, robust gradient estimator RobustEstimator(w)\texttt{RobustEstimator}(w).
2:Initialize w0𝒲w_{0}\in\mathcal{W} and t=1t=1.
3:for t[T]t\in[T] do
4:     Apply robust gradient estimator to get g~t=RobustEstimator(wt1)\tilde{g}_{t}=\texttt{RobustEstimator}(w_{t-1}).
5:     wtProj𝒲(wt1ηtg~t)w_{t}\leftarrow\text{Proj}_{\mathcal{W}}(w_{t-1}-\eta_{t}\tilde{g}_{t}).
6:end for
7:Output: w^T=1Tt=1Twt\hat{w}_{T}=\frac{1}{T}\sum_{t=1}^{T}w_{t}.

Algorithm 2 achieves the same optimal excess risk bounds as in Theorem 3.2.

Theorem 4.3.

Grant 4.1. There are choices of stepsizes {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} and TT such that, with probability at least 1τ1-\tau, we have

f¯(w^T)minw𝒲f¯(w)=O~(σDϵ+σDdlog(1/τ)n).\overline{f}(\hat{w}_{T})-\min_{w\in\mathcal{W}}\overline{f}(w)=\tilde{O}\left(\sigma D\sqrt{\epsilon}+\sigma D\sqrt{\frac{d\log(1/\tau)}{n}}\,\right).

As a consequence, the algorithm achieves excess risk of O(Dσϵ)O(D\sigma\sqrt{\epsilon}) with high probability whenever n=Ω~(d/ϵ)n=\tilde{\Omega}(d/\epsilon). The expected excess risk is bounded by O~(σDϵ+σDd/n)\tilde{O}\left(\sigma D\sqrt{\epsilon}+\sigma D\sqrt{d/n}\right).

The proof of our results is similar to the one used in li2024robust . The high level idea is as follows. For simplicity, we say ww is “good” if there exists a good set of functions at ww. We need to show that with high probability, there exists a good set for all ww, so that we can robustly estimate the gradient at all ww. To do this, we use a net argument. We will show that if ww is “good”, then all points in a small neighborhood of ww are also “good”. After choosing a proper cover of 𝒲\mathcal{W}, it suffices to apply the union bound to show that with high probability, all points in the cover are “good”. The full proof can be found in Appendix D.

5 Conclusion and Future Work

In this work, we have advanced robust stochastic convex optimization under the ϵ\epsilon-contamination model. While previous the state of the art SEVER (diakonikolas2019sever, ) focused on finding approximate critical points under stringent assumptions, we have developed algorithms that directly tackle population risk minimization, obtaining the optimal excess risk under more practical assumptions. Our first algorithm (Algorithm 1) achieves the minimax-optimal excess risk by leveraging our relaxed “dense-good-sets” assumption and estimating gradients only at points in a cover of the domain, removing the stringent requirements used in SEVER. Our second algorithm (Algorithm 2) provides a simple projected gradient descent approach that achieves the same optimal excess risk, concretely addressing the “many-good-sets” assumption briefly noted in their paper. Both of our algorithms significantly reduce the sample complexity compared to the state-of-the-art SEVER algorithm.

For future work, it would be interesting to explore following directions: (1) For our grid-based implementation of Algorithm 1, there is a logd\log d factor in the excess risk bound. Under the same assumptions, can we design a polynomial-time algorithm that achieves the same optimal rate without the logd\log d factor? (2) Robustness has been shown to be closely related to differential privacy (hopkins2023robustness, ). Can we design optimization algorithms that are both robust and differentially private?

References

  • (1) Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
  • (2) Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I Jordan, Nicolas Flammarion, and Peter L Bartlett. Optimal robust linear regression in nearly linear time. arXiv preprint arXiv:2007.08137, 2020.
  • (3) Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
  • (4) Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. SEVER: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
  • (5) Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017.
  • (6) Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
  • (7) Ilias Diakonikolas and Daniel M Kane. Algorithmic high-dimensional robust statistics. Cambridge university press, 2023.
  • (8) Ilias Diakonikolas, Daniel M Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. Advances in Neural Information Processing Systems, 33:1830–1840, 2020.
  • (9) Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2745–2754. SIAM, 2019.
  • (10) Vitaly Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. Advances in Neural Information Processing Systems, 29, 2016.
  • (11) Samuel B Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness implies privacy in statistical estimation. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 497–506, 2023.
  • (12) Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
  • (13) Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34:4475–4488, 2021.
  • (14) Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, pages 1420–1430. PMLR, 2018.
  • (15) Shuyao Li, Yu Cheng, Ilias Diakonikolas, Jelena Diakonikolas, Rong Ge, and Stephen Wright. Robust second-order nonconvex optimization and its application to low rank matrix sensing. Advances in Neural Information Processing Systems, 36, 2024.
  • (16) Andrew Lowy and Meisam Razaviyayn. Private stochastic optimization with large worst-case lipschitz parameter: Optimal rates for (non-smooth) convex losses and extension to non-convex losses. In International Conference on Algorithmic Learning Theory, pages 986–1054. PMLR, 2023.
  • (17) Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103:127–152, 2005.
  • (18) Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(3):601–627, 2020.
  • (19) John Wilder Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pages 448–485, 1960.

Appendix A Fixing SEVER’s Sample Complexity Result

In this section, we fix SEVER’s sample complexity result in their Proposition B.5. Their proof is incorrect due to the error in the application of Hoeffding’s inequality, resulting in a wrong sample complexity bound (n=Ω~(dL2/(ϵσ2))n=\tilde{\Omega}(dL^{2}/(\epsilon\sigma^{2}))).

We will provide a correct, more rigorous proof for their result. The correct bound is worse than the one claimed in their paper, as we show below.

Lemma A.1 (Fixed from Proposition B.5 in diakonikolas2019sever ).

Let pp^{*} be a distribution over functions f:𝒲f:\mathcal{W}\rightarrow\mathbb{R} with f¯=𝐄fp[f]\overline{f}=\mathbf{E}_{f\sim p^{*}}[f] so that ff¯f-\overline{f} is LL-Lipschitz and β\beta-smooth almost surely. Assume further that for each w𝒲w\in\mathcal{W} and unit vector vv that 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}. If n=Ω~(dL2ϵσ2+dL4σ4),n=\tilde{\Omega}\left(\frac{dL^{2}}{\epsilon\sigma^{2}}+\frac{dL^{4}}{\sigma^{4}}\right), then with high probability, any an ϵ\epsilon-corrupted set of functions f1,,fnf_{1},\dots,f_{n} (with respect to pp^{\ast}) satisfy 2.2.

Proof A.2.

For any set of functions A={fi}iAA=\{f_{i}\}_{i\in A} and functional gg, we denote 𝔼iA[g(fi)]\mathbb{E}_{i\in A}[g(f_{i})] as the empirical average of g(fi)g(f_{i}) over AA, i.e., 1|A|iAg(fi)\frac{1}{|A|}\sum_{i\in A}g(f_{i}). Let TT denote the original uncontaminated samples, and let SS denote the ϵ\epsilon-contaminated samples of TT. Let SgoodSS_{\text{good}}\subset S be the set of uncorrupted functions fif_{i}. It is then the case that SgoodTS_{\text{good}}\subset T and that |Sgood|(1ϵ)n|S_{\text{good}}|\geq(1-\epsilon)n.

Let |𝒞||\mathcal{C}| be the size of the cover 𝒞\mathcal{C}, where 𝒞\mathcal{C} is a cover of 𝒲\mathcal{W} and will be determined later. We have log|𝒞|=O~(d)\log|\mathcal{C}|=\tilde{O}(d). By the bounded covariance assumption,

𝔼p[(v(f(w)f¯(w)))2]σ2.\mathbb{E}_{p^{\ast}}[(v\cdot(\nabla f(w)-\overline{f}(w)))^{2}]\leq\sigma^{2}.

Observe that the term inside the expectation is bounded by L2L^{2}. By Hoeffding’s Inequality, with probability at least 1exp(2n(L4log(2|𝒞|/n)τ)/L4)1-\exp(-2n(L^{4}\log(2|\mathcal{C}|/n)\tau)/L^{4}), that is, 1τ/(2|𝒞|)1-\tau/(2|\mathcal{C}|), we have

𝔼iT[(v(f(w)f¯(w)))2]σ2+L21nlog(|𝒞|/τ)=O(σ2).\mathbb{E}_{i\in T}[(v\cdot(\nabla f(w)-\overline{f}(w)))^{2}]\leq\sigma^{2}+L^{2}\sqrt{\frac{1}{n}\log(|\mathcal{C}|/\tau)}=O(\sigma^{2}). (4)

Now, since SgoodTS_{\text{good}}\subset T and they differ by at most ϵn\epsilon n samples, we know that we have

𝔼iSgood[(v(fi(w)f¯(w)))2]O(σ2).\mathbb{E}_{i\in S_{\text{good}}}[(v\cdot(\nabla f_{i}(w)-\overline{f}(w)))^{2}]\leq O(\sigma^{2}). (5)

For the other part, we start by observing that,

𝔼p[(v(f(w)f¯(w)))]=0.\mathbb{E}_{p^{\ast}}[(v\cdot(\nabla f(w)-\overline{f}(w)))]=0.

By Chernoff (Hoeffding) bound, with probability at least 1τ/(2|𝒞|)1-\tau/(2|\mathcal{C}|), we have that

𝔼iT[(v(fi(w)f¯(w)))]O(L1nlog(|𝒞|/τ))=O(σϵ).\mathbb{E}_{i\in T}[(v\cdot(\nabla f_{i}(w)-\overline{f}(w)))]\leq O(L\sqrt{\frac{1}{n}\log(|\mathcal{C}|/\tau)})=O(\sigma\sqrt{\epsilon}). (6)

For any subset S1TS_{1}\subset T with |S1|ϵn\sqrt{|S_{1}|}\leq\epsilon n, by Cauchy-Schwarz inequality, we have

1niS1v(fi(w)f¯)\displaystyle\frac{1}{n}\sum_{i\in S_{1}}v\cdot(\nabla f_{i}(w)-\overline{f}) 1n|S1|(iT(v(fi(w)f¯))2)1/2\displaystyle\leq\frac{1}{n}|S_{1}|\left(\sum_{i\in T}(v\cdot(\nabla f_{i}(w)-\overline{f}))^{2}\right)^{1/2} (7)
1nϵnO(σ2n)=O(σϵ),\displaystyle\leq\frac{1}{n}\sqrt{\epsilon n}\sqrt{O(\sigma^{2}n)}=O(\sigma\sqrt{\epsilon}),

where we used (4). Therefore, removing ϵ\epsilon-fraction of these samples cannot change this value by more than σϵ\sigma\sqrt{\epsilon}. Since SgoodTS_{\text{good}}\subset T and they differ by at most ϵn\epsilon n samples, we know that the above bound holds for SgoodS_{\text{good}} as well, that is

𝔼iSgood[(v(fi(w)f¯(w)))]O(σϵ).\mathbb{E}_{i\in S_{\text{good}}}[(v\cdot(\nabla f_{i}(w)-\overline{f}(w)))]\leq O(\sigma\sqrt{\epsilon}). (8)

We now proceed with a net argument and show (5) and (8) hold for all ww with high probability. Suppose they hold for some ww, then by LL-Lipschitzness and β\beta-smoothness of f(w)f¯(w)f(w)-\overline{f}(w), we have that (8) holds for all ww^{\prime} within distance σϵ/β\sigma\sqrt{\epsilon}/\beta from ww, and (5) holds for all ww^{\prime} within distance σ2/(2Lβ)\sigma^{2}/(2L\beta) from ww, where the first statement follows directly from the Lipschitzness and the second statement is due to the following calculation:

|{v(fi(w)f¯(w))}2{v(fi(w)f¯(w))}2|\displaystyle\left|\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)\right\}^{2}-\left\{v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}^{2}\right|
={v(fi(w)f¯(w))+v(fi(w)f¯(w))}{v(fi(w)f¯(w))v(fi(w)f¯(w))}\displaystyle\phantom{\qquad}=\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)+v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}\cdot\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)-v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}
2Lβww|,\displaystyle\phantom{\qquad}\leq 2L\cdot\beta\|w-w^{\prime}|,

for any unit vector vv. Tt suffices to choose a cover 𝒞\mathcal{C} such that for any ww, there exists a point in the cover within distance min(σϵ/β,σ2/(2Lβ))\min(\sigma\sqrt{\epsilon}/\beta,\sigma^{2}/(2L\beta)) from ww.

Applying a union bound over cover 𝒞\mathcal{C}, we have that (5) and (8) hold for all ww with probability at least 1τ1-\tau.

Appendix B Lower Bound for Robust Stochastic Optimization

In this section, we demonstrate a matching lower bound for robust stochastic optimization under ϵ\epsilon-strong contamination with bounded covariance, showing that our algorithm achieves the minimax-optimal excess risk rate (up to logarithmic factors). Formally we will show the following,

Theorem B.1.

There exist a closed bounded set 𝒲d\mathcal{W}\subset\mathbb{R}^{d} with diameter at most 2D2D, and a distribution pp^{*} over functions f:𝒲f:\mathcal{W}\rightarrow\mathbb{R} that satisfy the following: Let f¯=𝐄fp[f]\overline{f}=\mathbf{E}_{f\sim p^{*}}[f]. We have that for each w𝒲w\in\mathcal{W} and unit vector vv that 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}. Both ff and f¯\overline{f} are convex, Lipschitz and smooth. The output w^\hat{w} of any algorithm with access to an ϵ\epsilon-corrupted set of functions f1,,fnf_{1},\dots,f_{n} sampled from pp^{*} satisfies the following,

𝔼f¯(w^)f¯=Ω(Dσϵ+Dσdn).\mathbb{E}\overline{f}(\hat{w})-\overline{f}^{*}=\Omega\left(D\sigma\sqrt{\epsilon}+D\sigma\sqrt{\frac{d}{n}}\right). (9)
Remark B.2.

Our construction of hard instances meets the superset of the assumptions of our algorithms and SEVER, i.e., Lipschitzness and smoothness of the individual functions (consequently the same holds for the population risk), and bounded covariance of the gradients. The lower bound consists of two terms. The first term is due to corruption and the second term is necessary even without corruption. We will prove these two terms separately.

B.1 Lower Bound: Term due to Corruption

We will leverage the following proposition that characterizes the information-theoretic limit of robust estimation.

Proposition B.3 (diakonikolas2023algorithmic ).

Let XX and YY be distributions with dTV(X,Y)2ϵd_{\text{TV}}(X,Y)\leq 2\epsilon for some 0<ϵ<10<\epsilon<1. A distribution 𝒟\mathcal{D} is taken to be either XX or YY. Then an algorithm, given any number of samples from 𝒟\mathcal{D} under ϵ\epsilon-contamination, cannot reliably distinguish between the cases 𝒟=X\mathcal{D}=X and 𝒟=Y\mathcal{D}=Y.

Consider a random variable X1X_{1} that takes value 0 with probability 1ϵ1-\epsilon and takes value ±σ/ϵ\pm\sigma/\sqrt{\epsilon} with probability ϵ/2\epsilon/2 each. That is,

X1={0with probability 1ϵσ/ϵwith probability ϵ/2σ/ϵwith probability ϵ/2.X_{1}=\begin{cases}0&\text{with probability }1-\epsilon\\ \sigma/\sqrt{\epsilon}&\text{with probability }\epsilon/2\\ -\sigma/\sqrt{\epsilon}&\text{with probability }\epsilon/2\end{cases}. (10)

The variance of X1X_{1} is σ2\sigma^{2}. Now, consider X1X_{1}^{\prime} that takes value 0 with probability 1ϵ1-\epsilon and takes value σ/ϵ\sigma/\sqrt{\epsilon} with probability ϵ\epsilon. The mean of X1X_{1}^{\prime} is σϵ\sigma\sqrt{\epsilon}, and the variance of X1X_{1}^{\prime} is

Var(X1)\displaystyle\text{Var}(X_{1}^{\prime}) =𝔼[X12](𝔼[X1])2\displaystyle=\mathbb{E}[X_{1}^{\prime 2}]-\left(\mathbb{E}[X_{1}^{\prime}]\right)^{2} (11)
=ϵσ2ϵ+0(σϵ)2\displaystyle=\epsilon\cdot\frac{\sigma^{2}}{\epsilon}+0-(\sigma\sqrt{\epsilon})^{2}
=σ2σ2ϵ<σ2.\displaystyle=\sigma^{2}-\sigma^{2}\epsilon<\sigma^{2}.

Let 𝒟1\mathcal{D}_{1} and 𝒟1\mathcal{D}_{1}^{\prime} denote the probability distributions of X1X_{1} and X1X_{1}^{\prime} respectively.

Consider the following robust optimization instance. Let {w|w|D}\{w\mid|w|\leq D\} denote the feasible set. Define the loss function as fx(w)=wxf_{x}(w)=-w\cdot x. We know that fx(w)=x\nabla f_{x}(w)=-x, so that both ff and f¯\overline{f} are Lipschitz and smooth. Let f¯(w,𝒟)=𝔼X𝒟[fX(w)]\overline{f}(w,\mathcal{D})=\mathbb{E}_{X\sim\mathcal{D}}[f_{X}(w)] denote the population risk for a given distribution 𝒟\mathcal{D}.

Expanding the expectation, we have that

f¯(w,𝒟1)\displaystyle-\overline{f}(w,\mathcal{D}_{1}) =(1ϵ)w0+ϵ2wσϵ+ϵ2wσϵ=0,\displaystyle=(1-\epsilon)w\cdot 0+\frac{\epsilon}{2}w\cdot\frac{\sigma}{\sqrt{\epsilon}}+\frac{\epsilon}{2}w\cdot-\frac{\sigma}{\sqrt{\epsilon}}=0, (12)
f¯(w,𝒟1)\displaystyle-\overline{f}(w,\mathcal{D}_{1}^{\prime}) =(1ϵ)w0+ϵwσϵ=wσϵ.\displaystyle=(1-\epsilon)w\cdot 0+\epsilon w\cdot\frac{\sigma}{\sqrt{\epsilon}}=w\cdot\sigma\sqrt{\epsilon}.

So we have

minwf¯(w,𝒟1)=0andminwf¯(w,𝒟1)=Dσϵ.\min_{w}\overline{f}(w,\mathcal{D}_{1})=0\quad\text{and}\quad\min_{w}\overline{f}(w,\mathcal{D}_{1}^{\prime})=-D\sigma\sqrt{\epsilon}. (13)

Therefore,

minwf¯(w,𝒟1)minwf¯(w,𝒟1)=Dσϵ.\min_{w}\overline{f}(w,\mathcal{D}_{1})-\min_{w}\overline{f}(w,\mathcal{D}_{1}^{\prime})=D\cdot\sigma\sqrt{\epsilon}. (14)

The total variation distance between 𝒟1\mathcal{D}_{1} and 𝒟1\mathcal{D}_{1}^{\prime} is ϵ\epsilon. Therefore, by Proposition B.3, given ϵ\epsilon-corrupted samples, no algorithm can reliably distinguish when these samples are generated from 𝒟1\mathcal{D}_{1} or 𝒟1\mathcal{D}_{1}^{\prime}. If an algorithm could optimize the population risk within DσϵD\cdot\sigma\sqrt{\epsilon}, then it could use the output to distinguish between 𝒟1\mathcal{D}_{1} and 𝒟1\mathcal{D}_{1}^{\prime}, which is a contradiction.

B.2 Lower Bound: Term due to Stochastic Optimization

lowy2023private proves the lower bound for stochastic optimization for heavy-tailed distribution. (Note that their result is for private optimization; we only use the nonprivate part of the lower bound.) We state the hard instance and the result below.

Lemma B.4 (lowy2023private [Theorem 36, part 3 for k=2k=2 and γ=σ2\gamma=\sigma^{2}]).

There exists a product distribution (the distribution of product of independent random variables) QνQ_{\nu} that is supported on {±σ}d\{\pm\sigma\}^{d}. Let the feasible set be 𝒲=B2d(0,D)\mathcal{W}=B_{2}^{d}(0,D). Define the loss fx(w)=w,xf_{x}(w)=-\langle w,x\rangle, f¯(w):=𝔼xQνfx(w)\overline{f}(w):=\mathbb{E}_{x\sim Q_{\nu}}f_{x}(w). We have that supw𝒲𝔼xQν|fx(w)f¯(w),ej|2σ2\sup_{w\in\mathcal{W}}\mathbb{E}_{x\sim Q_{\nu}}|\langle\nabla f_{x}(w)-\overline{f}(w),e_{j}\rangle|^{2}\leq\sigma^{2}, for all j[d]j\in[d], where eje_{j} denotes the jj-th standard basis vector in d\mathbb{R}^{d}.

Let XQνnX\sim Q_{\nu}^{n}. Any algorithm 𝒜\mathcal{A} has the following excess risk lower bound:

𝔼f¯(𝒜(X))f¯=Ω(Dσdn).\mathbb{E}\overline{f}(\mathcal{A}(X))-\overline{f}^{*}=\Omega\left(D\sigma\sqrt{\frac{d}{n}}\right). (15)

We now verify that this hard instance satisfies the assumptions used in our algorithms. We know that wfx(w)=x\nabla_{w}f_{x}(w)=-x, so that both ff and f¯\overline{f} are Lipschitz and smooth. It remains to show that this hard instance satisfies the bounded covariance assumption. For any unit vector udu\in\mathbb{R}^{d}, write u=j=1dujeju=\sum_{j=1}^{d}u_{j}e_{j}. Then

𝔼[(u,f(w)f¯(w))2]\displaystyle\mathbb{E}[(\langle u,\nabla f(w)-\nabla\overline{f}(w)\rangle)^{2}] =𝔼(j=1dujej,f(w)f¯(w))2\displaystyle=\mathbb{E}\left(\sum_{j=1}^{d}u_{j}\langle e_{j},\nabla f(w)-\nabla\overline{f}(w)\rangle\right)^{2} (16)
=()𝔼[j=1duj2(ej,f(w)f¯(w))2]\displaystyle\stackrel{{\scriptstyle(*)}}{{=}}\mathbb{E}\left[\sum_{j=1}^{d}u_{j}^{2}(\langle e_{j},\nabla f(w)-\nabla\overline{f}(w)\rangle)^{2}\right]
=j=1duj2𝔼[(ej,f(w)f¯(w))2]\displaystyle=\sum_{j=1}^{d}u_{j}^{2}\mathbb{E}[(\langle e_{j},\nabla f(w)-\nabla\overline{f}(w)\rangle)^{2}]
j=1duj2σ2=σ2,\displaystyle\leq\sum_{j=1}^{d}u_{j}^{2}\sigma^{2}=\sigma^{2},

where in ()(*) we use the fact that QνQ_{\nu} is a product distribution and thus cross terms vanish in the expectation. Therefore, the lower bound in (15) is also a lower bound for our problem. Combining this with the lower bound term due to corruption, we have the lower bound in (2).

Appendix C Analysis of Algorithm 1 for Smooth Population Risks

First, recall the robust estimation result. See 3.5

Proof C.1.

1. Bound the bias of the gradient estimator at ww. By Lemma 3.5, for given ww, we have that with probability at least 1τ1-\tau^{\prime}, the robust gradient estimator at w=argminz𝒞zww^{\prime}=\operatornamewithlimits{arg\,min}_{z\in\mathcal{C}}\|z-w\| satisfies

g~(w)f¯(w)=σO~(ϵ+d/n+log(1/τ)/n).\|\tilde{g}(w^{\prime})-\nabla\overline{f}(w^{\prime})\|=\sigma\cdot\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{\log(1/\tau^{\prime})/n}\right). (17)

We have wwσϵ/β¯\|w-w^{\prime}\|\leq\sigma\sqrt{\epsilon}/\bar{\beta} by definition of the cover. By β¯\bar{\beta}-smoothness of the population risk f¯\overline{f}, we have

f¯(w)f¯(w)β¯wwσϵ.\|\nabla\overline{f}(w)-\nabla\overline{f}(w^{\prime})\|\leq\bar{\beta}\|w-w^{\prime}\|\leq\sigma\sqrt{\epsilon}. (18)

Combining the two bounds, we have

g~(w)f¯(w)=σO~(ϵ+d/n+log(1/τ)/n).\|\tilde{g}(w^{\prime})-\nabla\overline{f}(w)\|=\sigma\cdot\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{\log(1/\tau^{\prime})/n}\right).

2. Apply the union bound over all grid points. By union bound, setting τ=τ/|𝒞|\tau^{\prime}=\tau/|\mathcal{C}|, we have that with probability at least 1τ1-\tau, (17) holds for all w𝒞w^{\prime}\in\mathcal{C}. Recall |𝒞|=O(Dd/ξ)d|\mathcal{C}|=O\left(D\sqrt{d}/\xi\right)^{d}. We have log|𝒞|=O~(d)\log|\mathcal{C}|=\tilde{O}(d). It follows that, with probability at least 1τ1-\tau, simultaneously for all w𝒲w\in\mathcal{W}, let w=argminz𝒞zww^{\prime}=\operatornamewithlimits{arg\,min}_{z\in\mathcal{C}}\|z-w\|, we have

g~(w)f¯(w)=σO~(ϵ+d/n+dlog(1/τ)/n).\|\tilde{g}(w^{\prime})-\nabla\overline{f}(w)\|=\sigma\cdot\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{d\log(1/\tau)/n}\right). (19)

Therefore, with probability at least 1τ1-\tau, the bias of the gradient estimator at ww is bounded by the above expression, simultaneously for all w𝒲w\in\mathcal{W}.

3. Apply the projected biased gradient descent analysis. By Lemma E.2, choosing a constant step size η=1/β¯\eta=1/\bar{\beta}, the excess risk of the algorithm is bounded by

f¯(w^T)minw𝒲f¯(w)=O~(β¯D2T+D(σϵ+σdlog(1/τ)n)).\overline{f}(\hat{w}_{T})-\min_{w\in\mathcal{W}}\overline{f}(w)=\tilde{O}\left(\frac{\bar{\beta}D^{2}}{T}+D\cdot\left(\sigma\sqrt{\epsilon}+\sigma\sqrt{\frac{d\log(1/\tau)}{n}}\right)\right). (20)

Choosing T=Ω~(β¯Dσϵ+σdlog(1/τ)n)T=\tilde{\Omega}\left(\frac{\bar{\beta}D}{\sigma\sqrt{\epsilon}+\sigma\sqrt{\frac{d\log(1/\tau)}{n}}}\right) gives the optimal rate. To convert the high probability bound to an in-expectation bound, we apply Lemma F.1.

Appendix D Analysis of Algorithm 2

Before proving Theorem 4.3, we need some results from robust estimation literature.

D.1 Results from Robust Mean Estimation

Recall Definition 2.1. The “good” set property is a special case of stability, defined as follows.

Definition D.1 (Stability diakonikolas2019robust ).

Fix 0<ϵ<1/20<\epsilon<1/2 and δϵ\delta\geq\epsilon. A finite set SdS\subset\mathbb{R}^{d} is (ϵ,δ)(\epsilon,\delta)-stable with respect to mean μd\mu\in\mathbb{R}^{d} and σ2\sigma^{2} if for every SSS^{\prime}\subseteq S with |S|(1ϵ)|S||S^{\prime}|\geq(1-\epsilon)|S|, the following conditions hold: (i) μSμσδ\|\mu_{S^{\prime}}-\mu\|\leq\sigma\delta, and (ii) Σ¯Sσ2Iσ2δ2/ϵ\|\overline{\Sigma}_{S^{\prime}}-\sigma^{2}I\|\leq\sigma^{2}\delta^{2}/\epsilon.

With the stability condition, we can robustly estimate the mean of a distribution with bounded covariance.

Lemma D.2 (Robust Mean Estimation Under Stability diakonikolas2019robust ).

Let TdT\subset\mathbb{R}^{d} be an ϵ\epsilon-corrupted version of a set SS with the following properties: SS contains a subset SSS^{\prime}\subseteq S such that |S|(1ϵ)|S|\left|S^{\prime}\right|\geq(1-\epsilon)|S| and SS^{\prime} is (Cϵ,δ)(C\epsilon,\delta) stable with respect to μd\mu\in\mathbb{R}^{d} and σ2\sigma^{2}, for a sufficiently large constant C>0C>0. Then there is a polynomial-time algorithm, that on input ϵ,T\epsilon,T, computes μ^\widehat{\mu} such that μ^μ=O(σδ)\|\widehat{\mu}-\mu\|=O(\sigma\delta).

Remark D.3.

In Algorithm 2, we use the robust gradient estimator RobustEstimator(w)\texttt{RobustEstimator}(w) as a black box, and assume that it satisfies the property in this lemma. See Appendix I for a specific instantiation of the robust gradient estimator.

The following results due to diakonikolas2020outlier achieve subgaussian rates for robust mean estimation for bounded covariance distributions.

Lemma D.4 (diakonikolas2020outlier ).

Fix any 0<τ<10<\tau^{\prime}<1. Let SS be a multiset of nn i.i.d. samples from a distribution on d\mathbb{R}^{d} with mean μ\mu and covariance Σ\Sigma such that Σσ2I\Sigma\preceq\sigma^{2}I. Let ϵ=Θ(log(1/τ)/n+ϵ)c\epsilon^{\prime}=\Theta(\log(1/\tau^{\prime})/n+\epsilon)\leq c, for a sufficiently small constant c>0c>0. Then, with probability at least 1τ1-\tau^{\prime}, there exists a subset SSS^{\prime}\subseteq S such that |S|(1ϵ)n|S^{\prime}|\geq(1-\epsilon^{\prime})n and SS^{\prime} is (2ϵ,δ)(2\epsilon^{\prime},\delta^{\prime})-stable with respect to μ\mu and σ2\sigma^{2}, where δ=δ(τ)\delta^{\prime}=\delta(\tau^{\prime}) depends on τ\tau^{\prime} as δ(τ)=O((dlogd)/n+ϵ+log(1/τ)/n)\delta(\tau^{\prime})=O(\sqrt{(d\log d)/n}+\sqrt{\epsilon}+\sqrt{\log(1/\tau^{\prime})/n}).

D.2 Proof of Theorem 4.3

As long as the stability condition holds, we can use deterministic stability-based algorithms (e.g. deterministic filtering) to robustly estimate the mean. Using union bound over the covering, it suffices to argue that at a given point ww, given the existence of a stable subset of the form {fi(w)}i\{\nabla f_{i}(w)\}_{i\in\mathcal{I}}, where \mathcal{I} denotes the index set of the stable subset at ww, such subset is also stable within a small neighborhood of ww, that is, {fi(w)}i\{\nabla f_{i}(w^{\prime})\}_{i\in\mathcal{I}} is stable for all ww^{\prime} in a small neighborhood of ww. We have the following stability result, which corresponds to “many-good-sets” 2.5.

Lemma D.5.

Under 4.1, let f1,,fnf_{1},\ldots,f_{n} denote an ϵ\epsilon-corrupted set of functions sampled from pp^{*}. Let ϵ=Θ(log(1/τ)/n+ϵ)c\epsilon^{\prime}=\Theta(\log(1/\tau)/n+\epsilon)\leq c be given, for a constant c>0c>0. With probability at least 1τ1-\tau, for all w𝒲w\in\mathcal{W}, there exists index set [n]\mathcal{I}\subseteq[n] (here \mathcal{I} depends on the choice of ww) such that ||(1ϵ)n|\mathcal{I}|\geq(1-\epsilon^{\prime})n and {fi(w)}i\{\nabla f_{i}(w)\}_{i\in\mathcal{I}} is (2ϵ,δ(τ))(2\epsilon^{\prime},\delta(\tau^{\prime}))-stable with respect to f¯(w)\nabla\overline{f}(w) and σ2\sigma^{2}, where τ=τ/exp(O~(d))\tau^{\prime}=\tau/\exp(\tilde{O}(d)) and δ(τ)=O~(ϵ+dlog(1/τ)/n)\delta(\tau^{\prime})=\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d\log(1/\tau)/n}\right).

Proof D.6.

We use a net argument to show that the stability condition holds for all ww, following similar proof techniques used in li2024robust . For fixed ww, by Lemma D.4, with probability at least 1τ1-\tau^{\prime}, there exists a subset [n]\mathcal{I}\subseteq[n] such that ||(1ϵ)n|\mathcal{I}|\geq(1-\epsilon^{\prime})n and {fi(w)}i\{\nabla f_{i}(w)\}_{i\in\mathcal{I}} is (2ϵ,δ)(2\epsilon^{\prime},\delta^{\prime})-stable where δ=δ(τ)\delta^{\prime}=\delta(\tau^{\prime}), with respect to f¯(w)\nabla\overline{f}(w) and σ2\sigma^{2}, that is

1||ifi(w)f¯(w)O(σδ),\left\|\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\nabla f_{i}(w)-\nabla\overline{f}(w)\right\|\leq O(\sigma\delta^{\prime}), (21a)
1||i(fi(w)f¯(w))(fi(w)f¯(w))σ2IO(σ2δ2/ϵ).\left\|\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}(\nabla f_{i}(w)-\nabla\overline{f}(w))(\nabla f_{i}(w)-\nabla\overline{f}(w))^{\top}-\sigma^{2}I\right\|\leq O(\sigma^{2}\delta^{\prime 2}/\epsilon^{\prime}). (21b)

By β\beta-smoothness of fif¯f_{i}-\overline{f}, we have

1||ifi(w)f¯(w)\displaystyle\left\|\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\nabla f_{i}(w^{\prime})-\nabla\overline{f}(w^{\prime})\right\| 1||i(fi(w)fi(w))+fi(w)f¯(w)\displaystyle\leq\left\|\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\left(\nabla f_{i}(w^{\prime})-\nabla f_{i}(w)\right)\right\|+\left\|\nabla f_{i}(w)-\nabla\overline{f}(w)\right\| (22)
βww+O(σδ).\displaystyle\leq\beta\|w^{\prime}-w\|+O(\sigma\delta^{\prime}).

Therefore, (21a) holds for all ww^{\prime} such that wwσδ/β\|w-w^{\prime}\|\leq\sigma\delta^{\prime}/\beta. We note that Equation 21b is equivalent to the following condition: for any unit vector vv, we have that

|1||i(v(fi(w)f¯(w)))2σ2|O(σ2δ2/ϵ).\left|\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}(v\cdot(\nabla f_{i}(w)-\nabla\overline{f}(w)))^{2}-\sigma^{2}\right|\leq O(\sigma^{2}\delta^{\prime 2}/\epsilon^{\prime}).

By LL-Lipschitzness and β\beta-smoothness of fif¯f_{i}-\overline{f}, for any unit vector vv, we have the following

|{v(fi(w)f¯(w))}2{v(fi(w)f¯(w))}2|\displaystyle\left|\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)\right\}^{2}-\left\{v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}^{2}\right| (23)
={v(fi(w)f¯(w))+v(fi(w)f¯(w))}{v(fi(w)f¯(w))v(fi(w)f¯(w))}\displaystyle\phantom{\qquad}=\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)+v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}\cdot\left\{v\cdot\left(\nabla f_{i}(w)-\overline{f}(w)\right)-v\cdot\left(\nabla f_{i}(w^{\prime})-\overline{f}(w^{\prime})\right)\right\}
2Lβww|,\displaystyle\phantom{\qquad}\leq 2L\cdot\beta\|w-w^{\prime}|,

It follows that (21b) holds for ww^{\prime} such that wwσ2δ2/(ϵLβ)\|w-w^{\prime}\|\leq\sigma^{2}\delta^{\prime 2}/(\epsilon^{\prime}L\beta).

Let ξ=min(σδ/β,σ2δ2/(ϵLβ))\xi=\min\left(\sigma\delta^{\prime}/\beta,\sigma^{2}\delta^{\prime 2}/(\epsilon^{\prime}L\beta)\right). Then, for all ww^{\prime} such that wwξ\|w-w^{\prime}\|\leq\xi, {fi(w)}i\{\nabla f_{i}(w^{\prime})\}_{i\in\mathcal{I}} is (2ϵ,δ)(2\epsilon^{\prime},\delta^{\prime})-stable with respect to f¯(w)\nabla\overline{f}(w^{\prime}) and σ2\sigma^{2}. It suffices to choose a ξ\xi-cover 𝒞\mathcal{C} of 𝒲\mathcal{W}, where the optimal size of the cover is |𝒞|=O((D/ξ)d)|\mathcal{C}|=O((D/\xi)^{d}), and choose τ=τ/|𝒞|\tau^{\prime}=\tau/|\mathcal{C}|. By union bound, with probability at least 1|𝒞|τ1-|\mathcal{C}|\tau^{\prime}, the stable subset exists for all w𝒞w\in\mathcal{C} simultaneously. Since we have argued that for fixed ww, the same stable subset applies for all ww^{\prime} within distance ξ\xi from ww, the subset stability holds simultaneously for all ww with probability at least 1τ1-\tau, as claimed.

Proof D.7 (Proof of Theorem 4.3).

Combining Lemma D.5 and Lemma D.2, in each iteration, we can estimate the gradient up to a bias as follows:

g~(wt)f¯(wt)=O(σδ(τ))=σO~(ϵ+dlog(1/τ)/n).\|\tilde{g}(w_{t})-\overline{f}(w_{t})\|=O(\sigma\cdot\delta(\tau^{\prime}))=\sigma\cdot\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d\log(1/\tau)/n}\right).

The excess risk bound then follows by applying Lemma E.2 for smoothness loss, or Lemma E.4 for Lipschitz loss with corresponding choices of stepsizes and large enough TT. When σ\sigma is unknown, we can use the approach in Appendix J to get a lower bound σ^\hat{\sigma} of σ\sigma and choose large enough TT so that optimization error term that depends on TT is dominated by σ^O~(ϵ+dlog(d)/n+dlog(1/τ)/n)\hat{\sigma}\cdot\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d\log(d)/n}+\sqrt{d\log(1/\tau)/n}\right).

Appendix E Projected Biased Gradient Descent

In this section, we analyze the convergence of the projected gradient descent algorithm with a biased gradient estimator. We assume the loss function is convex throughout this section. The general form is stated below.

Algorithm 3 Projected Gradient Descent with Biased Gradient Estimator
1:Input: Convex function FF, stepsize parameters {ηt}t[T]\{\eta_{t}\}_{t\in[T]}, biased gradient estimator BiasedEstimator(w)\texttt{BiasedEstimator}(w), feasible set 𝒲\mathcal{W}
2:Initialize w0𝒲w_{0}\in\mathcal{W} and t=1t=1.
3:for t[T]t\in[T] do
4:     Let g~t=BiasedEstimator(wt)\tilde{g}_{t}=\texttt{BiasedEstimator}(w_{t}).
5:     wtΠ𝒲(wt1ηtg~t)w_{t}\leftarrow\Pi_{\mathcal{W}}(w_{t-1}-\eta_{t}\tilde{g}_{t}).
6:end for
7:Output: w^T=1Tt=1Twt\hat{w}_{T}=\frac{1}{T}\sum_{t=1}^{T}w_{t}.

Here, Π𝒲()\Pi_{\mathcal{W}}(\cdot) denotes the projection operator onto the feasible set 𝒲\mathcal{W}, that is,

Π𝒲(y)=argminw𝒲wy2.\Pi_{\mathcal{W}}(y)=\arg\min_{w\in\mathcal{W}}\|w-y\|^{2}.

The projection operation ensures that the iterates wtw_{t} remain within the feasible set 𝒲\mathcal{W} throughout the optimization process. The projection step is crucial when the optimization problem is constrained, as it guarantees that the updates do not violate the constraints defined by 𝒲\mathcal{W}.

We analyze the convergence of the algorithm for (1) smooth loss, (2) Lipschitz loss. For convenience, we always write g~t=gt+bt\tilde{g}_{t}=g_{t}+b_{t}, where gtg_{t} is the true gradient and btb_{t} is the bias for the BiasedEstimator. We assume that the bias term is bounded, i.e., btB\|b_{t}\|\leq B, for all iterations tt.

We will use the following property of the projection operator.

Lemma E.1.

Let w𝒲w\in\mathcal{W} and ydy\in\mathbb{R}^{d}. We have

(Π𝒲(y)y)(wΠ𝒲(y))0.\left(\Pi_{\mathcal{W}}(y)-y\right)^{\top}\left(w-\Pi_{\mathcal{W}}(y)\right)\geq 0.

E.1 Smooth Loss

Lemma E.2.

Suppose FF is β\beta-smooth. Running Algorithm 3 with constant step size η=1β\eta=\frac{1}{\beta}, we have

F(1Tt=1Twt)F(w)β2T(w0w2wTw2)+BD.F\left(\frac{1}{T}\sum_{t=1}^{T}w_{t}\right)-F(w^{*})\leq\frac{\beta}{2T}\left(\|w_{0}-w^{*}\|^{2}-\|w_{T}-w^{*}\|^{2}\right)+BD. (24)
Proof E.3.

By convexity, we have

F(wt)F(w)+gt(wtw).F(w_{t})\leq F(w^{*})+g_{t}^{\top}(w_{t}-w^{*}). (25)

By LL-smoothness, we have

F(wt+1)F(wt)+gt(wt+1wt)+β2wtwt+12.F(w_{t+1})\leq F(w_{t})+g_{t}^{\top}(w_{t+1}-w_{t})+\frac{\beta}{2}\|w_{t}-w_{t+1}\|^{2}. (26)

Using Lemma E.1, we have

(wt+1wt+ηtg~t)(wwt+1)0.(w_{t+1}-w_{t}+\eta_{t}\tilde{g}_{t})^{\top}(w^{*}-w_{t+1})\geq 0. (27)

We break the left hand into two terms (wt+1wt)(wwt+1)(w_{t+1}-w_{t})^{\top}(w^{*}-w_{t+1}) and ηtg~t(wwt+1)\eta_{t}\tilde{g}_{t}^{\top}(w^{*}-w_{t+1}). We can write the first term as

(wt+1wt)(wwt+1)=12(wtw2wt+1wt2wt+1w2)(w_{t+1}-w_{t})^{\top}(w^{*}-w_{t+1})=\tfrac{1}{2}\left(\|w_{t}-w^{*}\|^{2}-\|w_{t+1}-w_{t}\|^{2}-\|w_{t+1}-w^{*}\|^{2}\right) (28)

For the second term, we have

ηtg~t(wwt+1)=ηtgt(wwt)+ηtgt(wtwt+1)+ηtbt(wwt+1)\eta_{t}\tilde{g}_{t}^{\top}(w^{*}-w_{t+1})=\eta_{t}g_{t}^{\top}(w^{*}-w_{t})+\eta_{t}g_{t}^{\top}(w_{t}-w_{t+1})+\eta_{t}b_{t}^{\top}(w^{*}-w_{t+1}) (29)

Using (25), (26), and Cauchy-Schwarz inequality to bound the three terms respectively, we have

ηtg~t(wwt+1)ηt(F(w)F(wt))+ηt(F(wt)F(wt+1))+Lηt2wtwt+12+ηtBD.\eta_{t}\tilde{g}_{t}^{\top}(w^{*}-w_{t+1})\leq\eta_{t}(F(w^{*})-F(w_{t}))+\eta_{t}(F(w_{t})-F(w_{t+1}))+\tfrac{L\eta_{t}}{2}\|w_{t}-w_{t+1}\|^{2}+\eta_{t}BD. (30)

Now going back to (27), we can combine the above inequalities and choose ηt=1/β\eta_{t}=1/\beta to get

F(wt+1)F(w)β2wtw2β2wt+1w2+BDF(w_{t+1})-F(w^{*})\leq\tfrac{\beta}{2}\|w_{t}-w^{*}\|^{2}-\tfrac{\beta}{2}\|w_{t+1}-w^{*}\|^{2}+BD (31)

Summing over t=0,,T1t=0,\ldots,T-1 and divided by TT, we have

1Tt=1T(F(wt)F(w))\displaystyle\frac{1}{T}\sum_{t=1}^{T}\left(F(w_{t})-F(w^{*})\right) β2T(w0w2wTw2)+BD\displaystyle\leq\frac{\beta}{2T}\left(\|w_{0}-w^{*}\|^{2}-\|w_{T}-w^{*}\|^{2}\right)+BD (32)
βD22T+BD.\displaystyle\leq\frac{\beta D^{2}}{2T}+BD.

The result then follows by convexity.

E.2 Lipschitz Loss

Alternatively, we can consider the case where the loss function F(w)F(w) is convex and LL-Lipschitz. The following lemma holds.

Lemma E.4.

Suppose FF is LL-Lipschitz. Running Algorithm 3 with constant step size η=1β\eta=\frac{1}{\beta}, we have

F(1Tt=1Twt)F(w)DLT+(1T+1)BD.F\left(\frac{1}{T}\sum_{t=1}^{T}w_{t}\right)-F(w^{*})\leq\frac{DL}{\sqrt{T}}+\left(\frac{1}{\sqrt{T}}+1\right)BD. (33)
Proof E.5.

Let us denote yt+1=wtηtg~ty_{t+1}=w_{t}-\eta_{t}\tilde{g}_{t}. Using Lemma E.1, we have

wtwytw.\|w_{t}-w^{*}\|\leq\|y_{t}-w^{*}\|.

By the update rule, we have

g~t(wtw)\displaystyle\tilde{g}_{t}(w_{t}-w^{*}) =1η(wtyt+1)(wtw)\displaystyle=\frac{1}{\eta}(w_{t}-y_{t+1})^{\top}(w_{t}-w^{*}) (34)
12η(wtw2wtyt+12yt+1w2)\displaystyle\leq\frac{1}{2\eta}\left(\|w_{t}-w^{*}\|^{2}-\|w_{t}-y_{t+1}\|^{2}-\|y_{t+1}-w^{*}\|^{2}\right)
12η(wtw2wtyt+12wt+1w2)\displaystyle\leq\frac{1}{2\eta}\left(\|w_{t}-w^{*}\|^{2}-\|w_{t}-y_{t+1}\|^{2}-\|w_{t+1}-w^{*}\|^{2}\right)
=12η(wtw2wt+1w2)+η2g~t2.\displaystyle=\frac{1}{2\eta}\left(\|w_{t}-w^{*}\|^{2}-\|w_{t+1}-w^{*}\|^{2}\right)+\frac{\eta}{2}\|\tilde{g}_{t}\|^{2}.

Now by convexity, we have

F(wt)F(w)gt(wtw)=g~t(wtw)bt(wtw).F(w_{t})-F(w^{*})\leq g_{t}^{\top}(w_{t}-w^{*})=\tilde{g}_{t}^{\top}(w_{t}-w^{*})-b_{t}^{\top}(w_{t}-w^{*}). (35)

Recall our assumptions on gtg_{t} and btb_{t}. We have g~t2=gt+bt2(L+B)2\|\tilde{g}_{t}\|^{2}=\|g_{t}+b_{t}\|^{2}\leq(L+B)^{2}. Summing over t=0,,T1t=0,\ldots,T-1 and divided by TT gives

1Tt=0T1(F(wt)F(w))\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\left(F(w_{t})-F(w^{*})\right) 12ηT(w0w2wT+1w2)+η2(L+B)2+BD\displaystyle\leq\frac{1}{2\eta T}\left(\|w_{0}-w^{*}\|^{2}-\|w_{T+1}-w^{*}\|^{2}\right)+\frac{\eta}{2}(L+B)^{2}+BD (36)
D22ηT+η2(L+B)2+BD.\displaystyle\leq\frac{D^{2}}{2\eta T}+\frac{\eta}{2}(L+B)^{2}+BD.

Choosing η=D(L+B)T\eta=\frac{D}{(L+B)\sqrt{T}} and using convexity of FF gives the desired result.

Appendix F A Lemma for Converting High Probability Bounds to Expectation Bounds

Lemma F.1.

Let XX be a nonnegative random variable such that for all τ(0,1)\tau\in(0,1),

P(Xa+blog(1/τ))1τ,P(X\leq a+b\sqrt{\log(1/\tau)})\geq 1-\tau,

where aa and bb are positive constants. Then the expectation of XX satisfies

𝔼[X]a+bπ2.\mathbb{E}[X]\leq a+b\cdot\frac{\sqrt{\pi}}{2}.
Proof F.2.

The expectation of XX can be expressed as:

𝔼[X]=0P(X>t)𝑑t.\mathbb{E}[X]=\int_{0}^{\infty}P(X>t)\,dt.

From the given condition, we have P(X>a+blog(1/τ))τP(X>a+b\sqrt{\log(1/\tau)})\leq\tau. Set t=a+blog(1/τ)t=a+b\sqrt{\log(1/\tau)}, then solving for τ\tau gives τ=e(tab)2\tau=e^{-\left(\frac{t-a}{b}\right)^{2}}, so that for tat\geq a, we have

P(X>t)e(tab)2.P(X>t)\leq e^{-\left(\frac{t-a}{b}\right)^{2}}.

Thus, the expectation can be bounded by:

𝔼[X]0a1𝑑t+ae(tab)2𝑑t=a+b0eu2𝑑u.\mathbb{E}[X]\leq\int_{0}^{a}1\,dt+\int_{a}^{\infty}e^{-\left(\frac{t-a}{b}\right)^{2}}\,dt=a+b\int_{0}^{\infty}e^{-u^{2}}\,du.

The Gaussian integral evaluates to 0eu2𝑑u=π2\int_{0}^{\infty}e^{-u^{2}}\,du=\frac{\sqrt{\pi}}{2}, giving the final bound:

𝔼[X]a+bπ2.\mathbb{E}[X]\leq a+b\cdot\frac{\sqrt{\pi}}{2}.

Appendix G Discussions on the Assumptions

G.1 On the bounded covariance assumption

The reason for the assumption σL\sigma\leq L is as follows. Due to Lipschitzness, we have f(w)f¯(w)L\|\nabla f(w)-\nabla\overline{f}(w)\|\leq L almost surely. It follows from Cauchy-Schwarz Inequality that 𝐄fp[(v(f(w)f¯(w)))2]L2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq L^{2} holds for any unit vector vv. For the other direction, LL can be much larger than σ\sigma since the Lipschitzness needs to hold almost surely.

The assumption 𝐄[(v(f(w)f¯(w)))2]σ2\mathbf{E}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2} is equivalent to the assumption that the covariance matrix of the gradients f(w)\nabla f(w) satisfies Σσ2I\Sigma\preceq\sigma^{2}I.

Proposition G.1.

Let Σw\Sigma_{w} denote the covariance matrix of the gradients f(w)\nabla f(w). For given ww, the following two assumptions are equivalent:

  1. 1.

    For every unit vector vv, we have 𝐄fp[(v(f(w)f¯(w)))2]σ2\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}]\leq\sigma^{2}.

  2. 2.

    The covariance matrix satisfies Σw\Sigma_{w} σ2I\preceq\sigma^{2}I.

Furthermore, since Σw\Sigma_{w} is positive semidefinite, by definition of the spectral norm, the latter assumption can be equivalently written as Σwσ2\|\Sigma_{w}\|\leq\sigma^{2}.

Proof G.2.

By definition,

Σw=𝐄fp[(f(w)f¯(w))(f(w)f¯(w))].\Sigma_{w}=\mathbf{E}_{f\sim p^{*}}[(\nabla f(w)-\nabla\overline{f}(w))(\nabla f(w)-\nabla\overline{f}(w))^{\top}]. (37)

We have

𝐄fp[(v(f(w)f¯(w)))2]\displaystyle\mathbf{E}_{f\sim p^{*}}[(v\cdot(\nabla f(w)-\nabla\overline{f}(w)))^{2}] =𝐄fp[v(f(w)f¯(w))(f(w)f¯(w))v]\displaystyle=\mathbf{E}_{f\sim p^{*}}[v\cdot(\nabla f(w)-\nabla\overline{f}(w))(\nabla f(w)-\nabla\overline{f}(w))^{\top}\cdot v] (38)
=v𝐄fp[(f(w)f¯(w))(f(w)f¯(w))]v=vΣwv.\displaystyle=v^{\top}\cdot\mathbf{E}_{f\sim p^{*}}[(\nabla f(w)-\nabla\overline{f}(w))(\nabla f(w)-\nabla\overline{f}(w))^{\top}]\cdot v=v^{\top}\Sigma_{w}v.

Therefore, the two assumptions are equivalent.

G.2 Compare bounded covariance assumption with bounded variance assumption

Cover-based approaches (e.g. uniform convergence) often suffer from suboptimal error (feldman2016generalization, ). However, Algorithm 1 indeed achieves the optimal rate. We believe the reason is due to the bounded covariance assumption Σσ2I\Sigma\preceq\sigma^{2}I. Below, we provide a discussion on the bounded covariance assumption and compare it with the bounded variance assumption.

The bounded covariance assumption Σσ2I\Sigma\preceq\sigma^{2}I is different from the bounded variance assumption 𝐄f(w)f¯(w)2Φ2\mathbf{E}\left\|\nabla f(w)-\nabla\overline{f}(w)\right\|^{2}\leq\Phi^{2} as commonly used in optimization literature without corruption. Using the property tr(AB)=tr(BA)\operatorname{tr}{(AB)}=\operatorname{tr}{(BA)}, this is equivalent to tr(Σ)Φ2\operatorname{tr}{(\Sigma)}\leq\Phi^{2}.

We comment that neither assumption implies the other. For isotropic Gaussian distribution, where the covariance matrix is Σ=σ2I\Sigma=\sigma^{2}I, we have tr(Σ)=dσ2\operatorname{tr}{(\Sigma)}=d\sigma^{2}. On the other hand, consider the distribution where the variance is concentrated in one direction, i.e., Σ=Φ2vv\Sigma=\Phi^{2}\cdot vv^{\top} for some unit vector vv. We have tr(Σ)=Φ2\operatorname{tr}{(\Sigma)}=\Phi^{2} and Σ=Φ2\|\Sigma\|=\Phi^{2}. In general, we only know that Σtr(Σ)dΣ\|\Sigma\|\leq\operatorname{tr}{(\Sigma)}\leq d\|\Sigma\|.

Recall Lemma D.4. The complete version of the lemma is as follows.

Lemma G.3 (diakonikolas2020outlier ).

Fix any 0<τ<10<\tau<1. Let SS be a multiset of nn i.i.d. samples from a distribution on d\mathbb{R}^{d} with mean μ\mu and covariance Σ\Sigma. Let ϵ=Θ(log(1/τ)/n+ϵ)c\epsilon^{\prime}=\Theta(\log(1/\tau)/n+\epsilon)\leq c, for a sufficiently small constant c>0c>0. Then, with probability at least 1τ1-\tau, there exists a subset SSS^{\prime}\subseteq S such that |S|(1ϵ)n|S^{\prime}|\geq(1-\epsilon^{\prime})n and SS^{\prime} is (2ϵ,τ(τ))(2\epsilon^{\prime},\tau(\tau))-stable with respect to μ\mu and Σ\|\Sigma\|, where τ(τ)=O((r(Σ)logr(Σ))/n+ϵ+log(1/τ)/n)\tau(\tau)=O(\sqrt{(\operatorname{r}(\Sigma)\log\operatorname{r}(\Sigma))/n}+\sqrt{\epsilon}+\sqrt{\log(1/\tau)/n}). Here we use r(M)\operatorname{r}(M) to denote the stable rank (or intrinsic dimension) of a positive semidefinite matrix, i.e., r(M):=tr(M)/M\operatorname{r}(M):=\operatorname{tr}(M)/\|M\|.

Following identical proof steps (recall proofs for our algorithms), we can express our excess risk bound in terms of the covariance matrix Σ\Sigma as follows:

DO~(Σϵ+tr(Σ)/n+dΣlog(1/τ)/n).D\cdot\tilde{O}\left(\sqrt{\|\Sigma\|\epsilon}+\sqrt{\operatorname{tr}{(\Sigma)}/n}+\sqrt{d\|\Sigma\|\log(1/\tau)/n}\right). (39)

In our paper, we consider the bounded covariance assumption Σσ2I\Sigma\preceq\sigma^{2}I, which is a standard assumption in robust optimization literature. Otherwise, we cannot control the error term Σϵ\sqrt{\|\Sigma\|\epsilon} due to corruption. In the worse case (e.g. isotropic Gaussian), we have Σ=σ2\|\Sigma\|=\sigma^{2} and tr(Σ)=dσ2\operatorname{tr}{(\Sigma)}=d\sigma^{2}, so the bound reduces to

σO(ϵ+d/n+dlog(1/τ)/n).\sigma\cdot O\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{d\log(1/\tau)/n}\right). (40)

We see that the second term already contains the dependence on dd. Therefore, the dd factor in the last term due to our cover-based approach, specifically the use of the union bound over the cover points, does not affect the rate.

Appendix H An exponential time algorithm that achieves the minimax-optimal excess risk bound without logd\log d factor

Our two algorithms achieve the minimax-optimal excess risk bound up to logarithmic factors. In this section, we show that the minimax-optimal excess risk bound can be achieved without the logd\log d factor, but at the cost of exponential time complexity.

Based on Lemma D.4, we can remove the logd\log d factor when estimating the gradients, by using the following framework, as shown in diakonikolas2020outlier .

  1. 1.

    Set k=ϵnk=\lfloor\epsilon^{\prime}n\rfloor. Randomly partition SS into kk buckets of size n/k\lfloor n/k\rfloor (remove the last bucket if nn is not divisible by kk).

  2. 2.

    Compute the empirical mean within each bucket and denote the means as z1,,zkz_{1},\ldots,z_{k}.

  3. 3.

    Run stability-based robust mean estimation on the set {z1,,zk}\{z_{1},\ldots,z_{k}\}.

Here, the first two steps serve as preprocessing before feeding the data into the robust mean estimation algorithm. We now restate the robust estimation result without logd\log d factor below.

Lemma H.1.

Let SS be an ϵ\epsilon-corrupted set of nn samples from a distribution in d\mathbb{R}^{d} with mean μ\mu and covariance Σσ2I\Sigma\preceq\sigma^{2}I. Let ϵ=Θ(log(1/τ)/n+ϵ)c\epsilon^{\prime}=\Theta(\log(1/\tau)/n+\epsilon)\leq c be given, for a constant c>0c>0. Then any stability-based algorithm on input SS and ϵ\epsilon^{\prime}, efficiently computes μ^\widehat{\mu} such that with probability at least 1τ1-\tau, we have μ^μ=σO(ϵ+d/n+log(1/τ)/n)\|\widehat{\mu}-\mu\|=\sigma\cdot O\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{\log(1/\tau)/n}\right).

We recall that our efficient implementation using grid points cost a logd\log d factor due to the suboptimal cover size. Using a cover with size matching the cover number O((D/ξ)d)O((D/\xi)^{d}) will remove the logd\log d factor, but at the cost of exponential time complexity for constructing the cover and finding a point within O(ξ)O(\xi) distance for a given point.

Following the same proof steps, as in Section 3.1, we can derive the excess risk bound without the logd\log d factor, at the cost of exponential time complexity.

Appendix I Iterative Filtering Algorithm for Robust Mean Estimation

In our algorithms, we treat the robust mean estimation as a black box. In this section, we provide an instantiation of such algorithms due to diakonikolas2020outlier . Recall the robust mean estimation result. See D.2 The following algorithm achieves the robust mean estimation result.

Algorithm 4 Iterative Filtering diakonikolas2020outlier
1:Input: ϵ\epsilon-corrupted set AdA\subset\mathbb{R}^{d} of nn samples that satisfies the properties in Lemma D.2.
2:Initialize weight function h:A0h:A\to\mathbb{R}_{\geq 0} with h(x)=1/|A|h(x)=1/|A| for all xAx\in A
3:while h1<12ϵ\|h\|_{1}<1-2\epsilon do
4:     Compute μ(h)=1h1xAh(x)x\mu(h)=\frac{1}{\|h\|_{1}}\sum_{x\in A}h(x)x
5:     Compute Σ(h)=1h1xAh(x)(xμ(h))(xμ(h))A\Sigma(h)=\frac{1}{\|h\|_{1}}\sum_{x\in A}h(x)(x-\mu(h))(x-\mu(h))^{A}
6:     Compute approximate largest eigenvector vv of Σ(h)\Sigma(h)
7:     Define g(x)=|v(xμ(h))|2g(x)=|v\cdot(x-\mu(h))|^{2} for all xAx\in A
8:     Find largest tt such that xA:g(x)th(x)ϵ\sum_{x\in A:g(x)\geq t}h(x)\geq\epsilon
9:     Define f(x)={g(x)if g(x)t0otherwisef(x)=\begin{cases}g(x)&\text{if }g(x)\geq t\\ 0&\text{otherwise}\end{cases}
10:     Let m=max{f(x):xA,h(x)0}m=\max\{f(x):x\in A,h(x)\neq 0\}
11:     Update h(x)h(x)(1f(x)/m)h(x)\leftarrow h(x)(1-f(x)/m) for all xAx\in A
12:end while
13:Output: μ(h)\mu(h)

In each iteration, we iteratively filter out points that are “far” from the sample mean in a large variance direction. Note that this algorithm is deterministic and runs in polynomial time. Notably, this algorithm modifies the terminal condition of the algorithm described in diakonikolas2019recent so that it works even when σ\sigma is unknown.

Appendix J Dealing with Unknown σ\sigma

We can adapt Algorithm 1 to work without knowing σ\sigma by first getting a lower bound on σ\sigma using the filtering algorithm in Appendix I.

In Algorithm 1, we use σ\sigma only to determine the fineness of the covering via ξ=σϵ/β¯\xi=\sigma\sqrt{\epsilon}/\bar{\beta}. A smaller ξ\xi results in a finer covering and consequently reduces the error when evaluating gradients at the cover point ww^{\prime} instead of ww, that is, (18) still holds with a smaller ξ\xi. Since the excess risk depends on ξ\xi only through logarithmic terms, the same analysis (see Appendix C) holds with a smaller ξ\xi. It then suffices to choose xi=σ^δ/β¯xi=\hat{\sigma}\delta/\bar{\beta}, where σ^\hat{\sigma} is a lower bound on σ\sigma. We also need to choose T=Ω~(βDσ^ϵ+σ^dlog(1/τ)n)T=\tilde{\Omega}\left(\frac{\beta D}{\hat{\sigma}\sqrt{\epsilon}+\hat{\sigma}\sqrt{\frac{d\log(1/\tau)}{n}}}\right) where we use σ^\hat{\sigma} in place of σ\sigma. When using smoothing to handle nonsmooth losses, we can choose β\beta similarly by replacing σ\sigma with σ^\hat{\sigma}.

Recall that Algorithm 4 works even when σ\sigma is unknown. Moreover, the output hh satisfies Σ(h)σ2(1+O(δ2/ϵ))\|\Sigma(h)\|\leq\sigma^{2}(1+O(\delta^{2}/\epsilon)) (see diakonikolas2020outlier ). It follows that we can use Σ(h)\|\Sigma(h)\| to obtain a lower bound on σ\sigma. Using Lemma D.4, at any fixed ww, we can run Algorithm 4 with input A={fi(w)}i=1nA=\{\nabla f_{i}(w)\}_{i=1}^{n} to obtain a lower bound σ^\hat{\sigma} on σ\sigma. We have that (plugging in δ(τ)\delta(\tau^{\prime}) in Lemma D.4), with probability at least 1τ1-\tau^{\prime},

Σ(h)σ2(1+O(δ2/ϵ)),\|\Sigma(h)\|\leq\sigma^{2}(1+O(\delta^{2}/\epsilon)), (41)

where δ=O~(ϵ+d/n+log(1/τ)/n)\delta=\tilde{O}\left(\sqrt{\epsilon}+\sqrt{d/n}+\sqrt{\log(1/\tau^{\prime})/n}\right). Therefore, σ^:=Σ(h)/1+O(δ2/ϵ)\hat{\sigma}:=\sqrt{\|\Sigma(h)\|}/\sqrt{1+O(\delta^{2}/\epsilon)} is a lower bound on σ\sigma with probability at least 1τ1-\tau^{\prime}.

Our modified algorithm is as follows. (1) Estimate σ\sigma: Choose a point ww and run Algorithm 4 with input A={fi(w)}i=1nA=\{\nabla f_{i}(w)\}_{i=1}^{n} to obtain an lower bound σ^\hat{\sigma}. (2) Then run Algorithm 1 with ξ=σ^δ/β¯\xi=\hat{\sigma}\delta/\bar{\beta}.

To compensate the failure probability for estimation of σ^\hat{\sigma}, we can set τ=τ/(1+𝒞)\tau^{\prime}=\tau/(1+\|\mathcal{C}\|), so that the overall failure probability is at most τ\tau.