Corruption-tolerant Algorithms for Generalized Linear Models

Bhaskar Mukhoty Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAEWork done while the author was a student at IIT Kanpur Debojyoti Dey Indian Institute of Technology Kanpur, Uttar Pradesh, India Purushottam Kar²²footnotemark: 2
[email protected], {debojyot,purushot}@cse.iitk.ac.in

Abstract

This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/

1 Introduction

Generalized linear models (GLMs) [17] are effective models for a variety of discrete and continuous label spaces, allowing the prediction of binary or count-valued labels (logistic, Poisson regression) as well as real-valued labels (gamma, least-squares regression). Inference in a GLM involves two steps: given a feature vector ${{\mathbf{x}}}\in{\mathbb{R}}^{d}$ and model parameters ${{\mathbf{w}}}^{\ast}$ , a canonical parameter is generated as $\theta:=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle$ then the label $y$ is sampled from the exponential family distribution

{\mathbb{P}}\left[{{y\,|\,\theta}}\right]=\exp(y\cdot\theta-\psi(\theta)-h(y)),

where the function $h(\cdot)$ is specific to the GLM and $\psi(\cdot)$ is a normalization term, also known as log partition function. It is common to use a non-canonical link such as $\theta:=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle)$ for gamma distribution. GLMs also admit vector valued label ${{\mathbf{y}}}\in{\mathbb{R}}^{n}$ by substituting the scalar product by inner product $\left\langle{{{\mathbf{y}}}},{\text{\boldmath$\mathbf{\eta}$}}\right\rangle$ where $\text{\boldmath$\mathbf{\eta}$}:={{\mathbf{X}}}{{\mathbf{w}}}^{\ast}$ is the canonical parameter and ${{\mathbf{X}}}\in{\mathbb{R}}^{n\times d}$ is the covariate matrix.

Problem Description: Given data $\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}$ generated using a known GLM but unknown model parameters ${{\mathbf{w}}}^{\ast}$ , statistically efficient techniques exist to recover a consistent estimate of the model ${{\mathbf{w}}}^{\ast}$ [14]. However, these techniques break down if several observed labels $y_{i}$ are corrupted, not just by random statistical noise but by adversarially generated structured noise. Suppose $k<n$ labels are corrupted i.e. for some $k$ data points $i_{1},\ldots,i_{k}$ , the actual label $y_{i_{j}},j=1,\ldots,k$ generated by the GLM are replaced by the adversary with corrupted ones say $\tilde{y}_{i_{j}}$ . Can we still recover ${{\mathbf{w}}}^{\ast}$ ? Note that the learning algorithm is unaware of the points that are corrupted.

Breakdown Point: The largest fraction $\alpha=k/n$ of corruptions that a learning algorithm can tolerate while still offering an estimate of ${{\mathbf{w}}}^{\ast}$ with bounded error is known as its breakdown point. This paper proposes the SVAM algorithm that can tolerate $k=\Omega\left({{n}}\right)$ corruptions i.e. $\alpha=\Omega\left({{1}}\right)$ .

Adversary Models: Contamination of the training labels $y_{1},\ldots,y_{n}$ by an adversary can misguide the learning algorithm into selecting model parameters of the adversary’s choice. An adversary has to choose (1) which labels $i_{1},\ldots,i_{k}$ to corrupt and (2) what corrupted labels $\tilde{y}_{i_{1}},\ldots,\tilde{y}_{i_{k}}$ to put there. Adversary models emerge based on what information the adversary can consult while making these choices. The oblivious adversary must make both these choices with no access to the original data $\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}$ or true model ${{\mathbf{w}}}^{\ast}$ and thus, can only corrupt a random/fixed subset of $k$ labels by sampling $\tilde{y}_{i_{j}}$ from some predetermined noise distribution. This is also known as the Huber noise model. On the other hand, a fully adaptive adversary has full access to the original data and true model while making both choices. Finally, the partially adaptive adversary must choose the corruption locations without knowledge of original data or true model but has full access to these while deciding the corrupted labels. See Appendix B for details.

Contributions: This paper describes the SVAM (Sequential Variance-Altered MLE) framework that offers:
1. robust estimation with a breakdown point $\alpha=\Omega\left({{1}}\right)$ against partially and fully adaptive adversaries for robust least-squares regression and mean estimation and $\alpha=\Omega\left({{1/\sqrt{d}}}\right)$ for robust gamma regression. Prior works do not offer any breakdown point for gamma regression.
2. exact recovery of the true model ${{\mathbf{w}}}^{\ast}$ against a fully-adaptive adversary for the case of least squares regression,
3. the use of variance reduction technique (see §3.1) in robust learning, which is novel to the best of our knowledge,
4. extensive empirical evaluation demonstrating that despite being a generic framework, SVAM is competitive to or outperforms algorithms specifically designed to solve problems such as least-squares and logistic regression.

2 Related Works

In the interest of space, we review aspects of literature most related to SVAM and refer to others [6, 15] for a detailed review.

Robust GLM learning has been studied in a variety of settings. [2] considered an oblivious adversary (Huber’s noise model) but offered a breakdown point of $\alpha={\cal O}\left({{\frac{1}{\sqrt{n}}}}\right)$ i.e. tolerate $k\leq{\cal O}\left({{\sqrt{n}}}\right)$ corruptions. [25] solve robust GLM estimation by solving M-estimation problems. However, they require the magnitude of the corruptions to be upper-bounded by some constant i.e. $\left|{y_{i}-\tilde{y}_{i}}\right|\leq{\cal O}\left({{1}}\right)$ and offer a breakdown point of $\alpha={\cal O}\left({{\frac{1}{\sqrt{n}}}}\right)$ . Moreover, their approach solves $L_{1}$ -regularized problems using projected gradient descent that offers slow convergence. In contrast, SVAM offers a linear rate of convergence, offers a breakdown point of $\alpha=\Omega\left({{1}}\right)$ i.e. tolerate $k=\Omega\left({{n}}\right)$ corruptions and can tolerate corruptions with unbounded magnitude introduced by a partially or fully adaptive adversary.

Specific GLMs such as robust regression have received focused attention. Here the model is ${{\mathbf{y}}}=X{{\mathbf{w}}}^{\ast}+{{\mathbf{b}}}$ where $X\in{\mathbb{R}}^{n\times d}$ is the feature matrix and ${{\mathbf{b}}}$ is $k$ -sparse corruption vector denoting the adversarial corruptions. A variant of this, studies a hybrid noise model that replaces the zero entries of ${{\mathbf{b}}}$ with Gaussian noise ${\mathcal{N}}(0,\sigma^{2})$ . [18, 24] solve an $L_{1}$ minimization problem which is slow in practice. (see §5). [1] use hard thresholding techniques to estimate the subset of uncorrupted points while [15] modify the IRLS algorithm to do so. However, [1, 15] are unable to offer consistent model estimates in the hybrid noise model even if the corruption rate $\alpha=k/n\rightarrow 0$ which is surprising since $\alpha\rightarrow 0$ implies vanishing corruption. In contrast, SVAM offers consistent model recovery in the hybrid noise model against a fully adaptive adversary when $\alpha\rightarrow 0$ . [21] also offer consistent recovery with breakdown points $\alpha>0.5$ but assume an oblivious adversary.

Robust classification with $y_{i}\in\left\{{-1,+1}\right\}$ has been explored using robust surrogate loss functions [16] and ranking [8, 19] techniques. These works do not offer breakdown points but offer empirical comparisons.

Robust mean estimation entails recovering an estimate $\hat{\text{\boldmath$\mathbf{\mu}$}}\in{\mathbb{R}}^{d}$ of the mean $\text{\boldmath$\mathbf{\mu}$}^{\ast}$ of a multivariate Gaussian ${\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$}^{\ast},\Sigma)$ given $n$ samples of which an $\alpha$ fraction are corrupted [11]. Estimation error is known to be lower bounded $\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\geq\Omega\left({{\alpha\sqrt{\log\frac{1}{\alpha}}}}\right)$ for this application even if $n\to\infty$ [5]. [6] use convex programming techniques and offer ${\cal O}\left({{\alpha\log^{\frac{3}{2}}{\frac{1}{\alpha}}}}\right)$ error given $n\geq\tilde{\Omega}\left({{\frac{d^{2}}{\alpha^{2}}}}\right)$ samples and a $\text{poly}\left({n,d,\frac{1}{\alpha}}\right)$ runtime. [3] improve the running time to $\frac{\widetilde{\cal O}\left({{nd}}\right)}{\text{poly}(\alpha)}$ . The recent work of [4] uses an IRLS-style approach that internally relies on expensive SDP-calls but offers high breakdown points. SVAM uses $n={\cal O}\left({{\log^{2}\frac{1}{\alpha}}}\right)$ samples and offers a recovery error of ${\cal O}\left({{\textit{trace}(\Sigma)(\log\frac{1}{\alpha})^{-1/2}}}\right)$ . This is comparable to existing works if $\textit{trace}(\Sigma)={\cal O}\left({{1}}\right)$ . Moreover, SVAM is much faster and simpler to implement in practice.

Meta algorithms such as robust gradient techniques, median-of-means [12], tilted ERM [13] and maximum correntropy criterion [9] have been studied. SEVER [7] uses gradient covariance matrix to filter out the outliers along its largest eigenspace while RGD [20] uses robust gradient estimates to perform robust first-order optimization directly. While convenient to execute, they may require larger training sets, e.g., SEVER requires $n>d^{5}$ samples for robust least-squares regression whereas SVAM requires $n>\Omega\left({{d\log(d)}}\right)$ . In terms of recovery guarantees, for least-squares regression without Gaussian noise, SVAM and other methods [1, 15]) offer exact recovery of ${{\mathbf{w}}}^{\ast}$ so long as the fraction of corrupted points is less than the breakdown point while SEVER’s error continues to be bounded away from zero. RGD only considers an oblivious/Huber adversary while SVAM can tolerate partially/fully adaptive adversaries. SEVER does not report an explicit breakdown point, RGD offers a breakdown point of $\alpha=1/\log d$ (see Thm 2 in their paper) while SVAM offers an explicit breakdown point independent of $d$ . SVAM also offers faster convergence than existing methods such as SEVER and RGD.

3 The SVAM Algorithm

A popular approach in robust learning is to assign weights to data points, hoping that large weights would be given to uncorrupted points and low weights to corrupted ones, followed by weighted likelihood maximization. Often the weights are updated, and the process is repeated. [2] use Huber style weighing functions used in Mallow’s type M-estimators, [15] use truncated inverse residuals, and [22] use Mahalanobis distance-based weights.

SVAM notes that the label likelihood offers a natural measure of how likely the point is to be uncorrupted. Given a model estimate $\hat{{\mathbf{w}}}^{t}$ at iteration $t$ , the weight $s_{i}={\mathbb{P}}\left[{{y_{i}\,|\,\eta^{t}_{i}}}\right]=\exp(y_{i}\cdot\eta_{i}^{t}-\psi(\eta_{i}^{t})-h(y_{i}))$ can be assigned to the $i\text{${}^{\text{th}}$}$ point where $\eta^{t}_{i}=\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle$ . This gives us the weighted MLE¹¹1Recall that for gamma/Poisson regression we need to set $\eta^{t}_{i}=\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle)$ given the non-canonical link for these problems. $\tilde{Q}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})=-\sum_{i=1}^{n}s_{i}\cdot\log{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle}}\right]$ solving which gives us the next model iterate as

\hat{{\mathbf{w}}}^{t+1}=\arg\min_{{{\mathbf{w}}}\in{\mathbb{R}}^{d}}\ \tilde{Q}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})

(1)

However, as §5 will show, this strategy does not perform well. If the initial model $\hat{{\mathbf{w}}}^{1}$ is far from ${{\mathbf{w}}}^{\ast}$ , it may result in imprecise weights $s_{i}$ that are large for the corrupted points. For example, if the adversary introduces corruptions using a different model $\tilde{{\mathbf{w}}}$ i.e. $\tilde{y}_{i_{j}}\sim{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{\tilde{{\mathbf{w}}}},{{{\mathbf{x}}}^{i_{j}}}\right\rangle}}\right],j\in[k]$ and we happen to initialize close to $\tilde{{\mathbf{w}}}$ i.e. $\hat{{\mathbf{w}}}^{1}\approx\tilde{{\mathbf{w}}}$ , then it is the corrupted points that would get large weights initially that may cause the algorithm to converge to $\tilde{{\mathbf{w}}}$ itself.

Key Idea: It is thus better to avoid drastic decisions, say setting $s_{i}\gg 0$ in the initial stages no matter how much a data point appears to be clean. SVAM implements this intuition by setting weights using a label likelihood distribution with very large variance initially. This ensures that no data point (not even the uncorrupted ones) gets large weight (c.f. the uniform distribution that has large variance and assigns to point a high density). As SVAM progresses towards ${{\mathbf{w}}}^{\ast}$ , it starts using likelihood distributions with progressively lower variance. Note that this allows data points (hopefully the uncorrupted ones) to get larger weights (c.f. the Dirac delta distribution that has vanishing variance and assigns high density to isolated points).

3.1 Mode-preserving Variance-altering Likelihood Transformations

[Uncaptioned image] — Table 1: Some common distributions and their variance altered forms. Note that in all cases, the form of the distribution is preserved after transformation, as well as that the variance asymptotically goes down at the rate $\Theta(1/\beta)$ as $\beta\rightarrow\infty$ . The figures on the right show examples of how the Gaussian and gamma likelihood functions change with varying values of $\beta$ while still being order/mode preserving.

To implement the above strategy, SVAM (Algorithm 1) needs techniques to alter the variance of a likelihood distribution at will. Note that the likelihood values of the altered distributions must be computable as they will be used as weights $s_{i}$ i.e. merely being able to sample the distribution is not enough. Moreover, the transformation must be order-preserving – say the original and transformed distributions are ${\mathbb{P}}$ and $\tilde{\mathbb{P}}$ resp., then for every pair of labels $y,y^{\prime}$ and every parameter value $\eta$ , we must have ${\mathbb{P}}\left[{{y\,|\,\eta}}\right]>{\mathbb{P}}\left[{{y^{\prime}\,|\,\eta}}\right]\Leftrightarrow\tilde{\mathbb{P}}\left[{y\,|\,\eta}\right]>\tilde{\mathbb{P}}\left[{y^{\prime}\,|\,\eta}\right]$ . If this is not true, then SVAM could exhibit anomalous behavior.

The Transformation: If ${\mathbb{P}}\left[{{y\,|\,\eta}}\right]=\exp(y\cdot\eta-\psi(\eta)-h(y))$ is an exponential family distribution with parameter $\eta$ and log-partition function $\psi(\eta)=\log\int\exp(y\cdot\eta-h(y))\ dy$ , then for any $\beta>0$ , we get the variance-altered density,

\tilde{\mathbb{P}}_{\beta}\left[{y\,|\,\eta}\right]=\frac{1}{Z(\eta,\beta)}\exp(\beta\cdot(y\cdot\eta-\psi(\eta)-h(y))),

where $Z(\eta,\beta)=\int\exp(\beta\cdot(y\cdot\eta-\psi(\eta)-h(y)))\ dy$ . This transformation is order and mode preserving since $x^{\beta}$ is an increasing function for any $\beta>0$ . This generalized likelihood distribution has variance [17] $\frac{1}{\beta}\nabla^{2}\psi(\eta)$ , which tends to $0$ as $\beta\rightarrow\infty$ . Table 1 lists a few popular distributions, their variance altered versions, and asymptotic versions as $\beta\rightarrow\infty$ .

We note that [10] also study variance altering transformations for learning hidden Markov models, topic models, etc.. However, their transformations are unsuitable for use in SVAM for a few reasons:
1. SVAM’s transformed distributions are always available in closed form whereas those of [10] are not necessarily available in closed form.
2. SVAM’s transformations are order-preserving while [10] offer mean-preserving that are not assured to be order-preserving.

The Algorithm: As presented in Algorithm 1, SVAM repeatedly constructs weighted MLEs $\tilde{Q}_{\beta}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})$ that take $\beta$ -variance altered weights $s_{i}=\tilde{\mathbb{P}}_{\beta}[y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle]$ for all $i\in[n]$ and solves them to get new model estimates.

Algorithm 1 SVAM: Sequential Variance-Altered MLE

0: Data

\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}

, initial model

\hat{{\mathbf{w}}}^{1}

, initial scale

\beta_{1}

, scale increment

\xi>1

, likelihood dist.

{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]

1: for

t=1,2,\ldots,T-1

s^{t}_{i}\leftarrow\tilde{\mathbb{P}}_{\beta_{t}}[y_{i}\,|\,\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle]

\beta_{t}

-var altered

{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]

\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})\overset{\mathrm{def}}{=}-\sum_{i=1}^{n}s^{t}_{i}\cdot\log{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle}}\right]

\hat{{\mathbf{w}}}^{t+1}=\arg\min_{{\mathbf{w}}}\ \tilde{Q}_{\beta_{t}}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})

\beta_{t+1}\leftarrow\xi\cdot\beta_{t}

// Variance of

\tilde{\mathbb{P}}_{\beta}[\cdot\,|\,\cdot]\downarrow

\beta\uparrow

6: end for

7: return

\hat{{\mathbf{w}}}^{T}

We take a pause to assert that whereas the approach in [15], although similar at first to Eq (1), applies only to least-squares regression as it relies on notions of residuals missing from other GLMs. In contrast, SVAM works for all GLMs e.g. least-squares/logistic/gamma regression and offers stronger theoretical guarantees.

Theorem 1 shows that SVAM enjoys a linear rate of convergence. However, we first define notions of Local Weighted Strong Convexity and Lipschitz Continuity. Let ${\mathcal{B}}_{2}({{\mathbf{v}}},r):=\left\{{{{\mathbf{w}}}:\left\|{{{\mathbf{w}}}-{{\mathbf{v}}}}\right\|_{2}\leq r}\right\}$ denote the $L_{2}$ ball of radius $r$ centered at the vector ${{\mathbf{v}}}\in{\mathbb{R}}^{d}$ .

Definition 1 (LWSC/LWLC).

Given data $\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}$ and $\beta>0$ an exponential family distribution ${\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]$ is said to satisfy $\lambda_{\beta}$ -Local Weighted Strongly Convexity and $\Lambda_{\beta}$ -Local Weighted Lipschitz Continuity if for any true model ${{\mathbf{w}}}^{\ast}$ and any ${{\mathbf{u}}},{{\mathbf{v}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right)$ the following hold,

1.

$\nabla^{2}\tilde{Q}_{\beta}({{\mathbf{v}}}\,|\,{{\mathbf{u}}})\overset{\mathrm{def}}{=}\left.\nabla^{2}\tilde{Q}_{\beta}(\cdot\,|\,{{\mathbf{u}}})\right|_{{{\mathbf{v}}}}\succeq\lambda_{\beta}\cdot I$
2.

$\left\|{\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{u}}})}\right\|_{2}\overset{\mathrm{def}}{=}\left\|{\left.\nabla\tilde{Q}_{\beta}(\cdot\,|\,{{\mathbf{u}}})\right|_{{{\mathbf{w}}}^{\ast}}}\right\|_{2}\leq\Lambda_{\beta}$

The above requires the $\tilde{Q}_{\beta}$ -function to be strongly convex and Lipschitz continuous in a ball of radius $\frac{1}{\sqrt{\beta}}$ around the true model ${{\mathbf{w}}}^{\ast}$ i.e. as $\beta\uparrow$ , the neighborhood in which these properties are desired also shrinks. We will show that likelihood functions corresponding to GLMs e.g., least squares and gamma regression satisfy these properties for appropriate ranges of $\beta$ , even in the presence of corrupted samples.

Theorem 1 (SVAM convergence).

If the data and likelihood distribution satisfy the LWSC/LWLC properties for all $\beta\in(0,\beta_{\max}]$ and if SVAM is initialized at $\hat{{\mathbf{w}}}^{1}$ and scale $\beta_{1}>0$ s.t. $\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ , then for any $\epsilon>1/{\beta_{\max}}$ , for small-enough scale increment $\xi>1$ , SVAM ensures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ after $T={\cal O}\left({{\log\frac{1}{\epsilon}}}\right)$ iterations.

It is useful to take a moment to analyze this result. Note that if the LWSC/LWLC properties hold for larger values of $\beta$ , SVAM is able to offer smaller model recovery errors. Lets take least-squares regression with hybrid noise (see §4) as an example. The proofs will show that LWSC/LWLC properties are assured for $\beta$ as large as $\beta_{\max}=\widetilde{\cal O}\left({{\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d}}}\right\}}}\right)$ (see §4). Thus, with proper initialization of $\hat{{\mathbf{w}}}^{1},\xi$ and $\beta_{1}$ (discussed below), SVAM ensures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|^{2}_{2}\leq\widetilde{\cal O}\left({{\max\left\{{\alpha^{2/3},\sqrt{\frac{d}{n}}}\right\}}}\right)$ within $T={\cal O}\left({{\ln(n)}}\right)$ steps. This proof will hold so long as SVAM is offered at least $n=\Omega(d\log d)$ training samples.

Refer to caption — (a) $\beta_{1}$ Sensitivity

Initialization: SVAM needs to be invoked with $\hat{{\mathbf{w}}}^{1},\beta_{1}$ that satisfy the requirements of Thm 1 and small enough $\xi$ . If we initialize at the origin i.e. $\hat{{\mathbf{w}}}^{1}={\mathbf{0}}$ , then Theorem 1’s requirement translates to $\beta_{1}\leq\frac{1}{\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}}$ i.e. we need only find a small enough $\beta_{1}$ . Thus, SVAM needs to tune two scalars $\xi,\beta_{1}$ to take small enough values which it does as described below. In practice, SVAM offered stable performance for a wide range of $\beta_{1},\xi$ (see Fig 1).

Hyperparameter Tuning: SVAM’s two hyperparameters $\beta_{1},\xi$ were tuned using a held-out validation set. As the validation data could also contain corruptions, validation error was calculated by rejecting the top $\alpha$ fraction of validation points with the highest prediction error. The true value of $\alpha$ was provided to competitor algorithms as a handicap but not to SVAM. Thus, $\alpha$ itself was treated as a (third) hyperparameter for SVAM.

4 Robust GLM Applications with SVAM

This section adapts SVAM to robust least-squares/gamma/logistic regression and robust mean estimation and establishes breakdown points and LWSC/LWLC guarantees for their respective $\tilde{Q}_{\beta}$ functions (see Defn 1). We refer the reader to §1 for definitions of partially/fully adaptive adversaries.

Robust Least Squares Regression. We have $n$ data points $({{\mathbf{x}}}^{i},y_{i})$ , ${{\mathbf{x}}}^{i}\in{\mathbb{R}}^{d}$ sampled from a subGaussian distribution ${\mathcal{D}}$ over ${\mathbb{R}}^{d}$ . We consider the hybrid corruption setting where on the $G=(1-\alpha)\cdot n$ “good” data points, we get labels $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle+\epsilon_{i}$ with Gaussian noise $\epsilon_{i}\sim{\mathcal{N}}(0,\frac{1}{\beta^{\ast}})$ with variance $\frac{1}{\beta^{\ast}}$ added. On the remaining $B=\alpha\cdot n$ “bad” points, we get adversarially corrupted labels $\tilde{y}_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle+b_{i}$ where $b_{i}\in{\mathbb{R}}$ is chosen by the adversary. Note that $b_{i}$ can be unbounded. We also consider the pure corruption setting where clean points receive no Gaussian noise (note that this corresponds to $\beta^{\ast}=\infty$ ). SVAM-RR (Alg. 2) adapts SVAM to the task of robust regression.

Theorem 2 (Partially Adaptive Adversary).

For hybrid corruptions by a partially adaptive adversary with corruption rate $\alpha\leq 0.18$ , there exist $\xi>1$ s.t. with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC properties are satisfied for the $\tilde{Q}_{\beta}$ function for $\beta$ values as large as $\beta_{\max}={\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right)$ . If initialized with $\hat{{\mathbf{w}}}^{1},\beta^{1}$ s.t. $\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ , SVAM-RR assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right)$ within $T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right)$ iterations. For pure corruptions by a partially adaptive adversary, we have $\beta_{\max}=\infty$ and thus, for any $\epsilon>0$ , SVAM-RR assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right)$ iterations.

Note that in the pure corruption setting, SVAM assures exact recovery of ${{\mathbf{w}}}^{\ast}$ simply by running the algorithm long enough. This is not a contradiction since in this case, the LWSC/LWSS properties can be shown to hold for all values of $\beta<\infty$ since we effectively have $\beta^{\ast}=\infty$ in this case. Thm 2 holds against a partially adaptive adversary but can be extended to a fully adaptive adversary as well but at the cost of a worse breakdown point (see Thm 3 below). Note that SVAM continues to assure exact recovery of ${{\mathbf{w}}}^{\ast}$ .

Theorem 3 (Fully Adaptive Adversary).

For pure corruptions by a fully adaptive adversary with corruption rate $\alpha\leq 0.0036$ , LWSC/LWLC are satisfied for all $\beta\in(0,\infty)$ i.e. $\beta_{\max}=\infty$ and for any $\epsilon>0$ , SVAM-RR assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right)$ iterations if initialized as described in the statement of Theorem 2.

Establishing LWSC/LWLC: In the appendices, Lemmata 15 and 16 establish LWSC/LWLC properties for robust least squares regression while Theorems 14 and 21 establish the breakdown points and existence of suitable increments $\xi>1$ . Handling a fully adaptive adversary requires making mild modifications to the notions of LWSC/LWLC, details of which are present in Appendix G.1.

Model Recovery and Breakdown Point: For pure corruption, SVAM-RR offers exact model recovery against partially and fully adaptive adversaries as it assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ for any $\epsilon>0$ if SVAM-RR is executed long enough. For hybrid corruption where even “clean” points receive Gaussian noise with variance $\frac{1}{\beta^{\ast}}$ , SVAM-RR assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\sqrt{\frac{d\log(n)}{n}}}}\right)$ as $\alpha\rightarrow 0$ i.e. $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\rightarrow 0$ as $n\rightarrow\infty$ assuring consistent recovery. This significantly improves previous results by [1, 15] which offer ${\cal O}\left({{\frac{1}{\beta^{\ast}}}}\right)$ error even if $\alpha\rightarrow 0$ and $n\rightarrow\infty$ . Note that SVAM-RR has a superior breakdown point (allowing upto 18% corruption rate) against an oblivious adversary. The breakdown point deteriorates as expected (still allowing upto 0.36% corruption rate) against a fully adaptive adversary. We now present analyses for other GLM problems.

Algorithm 2 SVAM-RR: Robust Least Squares Regression 0: Data $\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}$ , initial scale $\beta_{1}$ , initial model $\hat{{\mathbf{w}}}^{1}$ , $\xi$ 0: A model estimate $\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast}$ 1: for $t=1,2,\ldots,T-1$ do 2: $s_{i}\leftarrow\exp\left({-\frac{\beta_{t}}{2}(y_{i}-\left\langle{{{\mathbf{x}}}^{i}},{\hat{{\mathbf{w}}}^{t}}\right\rangle)^{2}}\right)$ 3: $S\leftarrow\text{diag}(s_{1},\ldots,s_{n})$ 4: $\hat{{\mathbf{w}}}^{t+1}\leftarrow(XSX^{\top})^{-1}(XS{{\mathbf{y}}})$ 5: $\beta_{t+1}\leftarrow\xi\cdot\beta_{t}$ 6: end for 7: return $\hat{{\mathbf{w}}}^{T}$ Algorithm 3 SVAM-ME: Robust Mean Estimation 0: Data $\{{{{\mathbf{x}}}^{i}}\}_{i=1}^{n}$ , initial scale $\beta_{1}$ , initial model $\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}$ , $\xi$ 0: A mean estimate $\hat{\text{\boldmath$\mathbf{\mu}$}}\approx\text{\boldmath$\mathbf{\mu}$}^{\ast}$ 1: for $t=1,2,\ldots,T-1$ do 2: $s_{i}\leftarrow\exp\left({-\frac{\beta_{t}}{2}\left\|{{{\mathbf{x}}}^{i}-\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}}\right\|_{2}^{2}}\right)$ 3: $\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}\leftarrow\left({\sum_{i=1}^{n}s_{i}}\right)^{-1}\left({\sum_{i=1}^{n}s_{i}{{\mathbf{x}}}^{i}}\right)$ 4: $\beta_{t+1}\leftarrow\xi\cdot\beta_{t}$ 5: end for 6: return $\hat{\text{\boldmath$\mathbf{\mu}$}}^{T}$ Algorithm 4 SVAM-Gamma: Robust Gamma Regression 0: Data $\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}$ , initial scale $\beta_{1}$ , initial model $\hat{{\mathbf{w}}}^{1}$ , $\xi$ 0: A model estimate $\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast}$ 1: for $t=1,2,\ldots,T-1$ do 2: $s_{i}\leftarrow{\mathcal{G}}(y_{i}\,|\,\tilde{\eta}_{\beta_{t}},\tilde{\phi}_{\beta_{t}})$ //see Table 1 3: $\hat{{\mathbf{w}}}^{t+1}\leftarrow\arg\min\sum_{i=1}^{n}s_{i}\cdot\ell({{\mathbf{w}}},{{\mathbf{x}}}^{i},y_{i})$ where $\ell({{\mathbf{w}}},{{\mathbf{x}}},y)=(1-\phi)^{-1}y\exp(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle)-\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle$ 4: $\beta_{t+1}\leftarrow\xi\cdot\beta_{t}$ 5: end for 6: return $\hat{{\mathbf{w}}}^{T}$ Algorithm 5 SVAM-LR: Robust Classification 0: Data $\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}$ , initial scale $\beta_{1}$ , initial model $\hat{{\mathbf{w}}}^{1}$ , $\xi$ 0: A model estimate $\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast}$ 1: for $t=1,2,\ldots,T-1$ do 2: $s_{i}\leftarrow(1+\exp(-\beta_{t}y_{i}\left\langle{{{\mathbf{x}}}^{i}},{\hat{{\mathbf{w}}}^{t}}\right\rangle))^{-1}$ 3: $\hat{{\mathbf{w}}}^{t+1}\leftarrow\arg\min\sum_{i=1}^{n}s_{i}\cdot\ell({{\mathbf{w}}},{{\mathbf{x}}}^{i},y_{i})$ where $\ell({{\mathbf{w}}},{{\mathbf{x}}},y)=\log(1+\exp(-y\left\langle{{{\mathbf{x}}}},{{{\mathbf{w}}}}\right\rangle))$ 4: $\beta_{t+1}\leftarrow\xi\cdot\beta_{t}$ 5: end for 6: return $\hat{{\mathbf{w}}}^{T}$

Robust Gamma Regression. The data generation and corruption model for gamma regression are slightly different given that the gamma distribution has support only over positive reals. First, the canonical parameter is calculated as $\eta_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)$ using which a clean label $y^{i}$ is generated. To simplify the analysis, we assume that $\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1$ , $\phi=0.5$ , ${{\mathbf{x}}}^{i}\sim{\mathcal{N}}({\mathbf{0}},I)$ . For the $G=(1-\alpha)\cdot n$ “good” points, labels are generated as $y_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)(1-\phi)$ i.e. the no-noise model. For the remaining $B=\alpha\cdot n$ “bad” points, the label is corrupted as $\tilde{y}^{i}=y^{i}\cdot b_{i}$ where $b_{i}>0$ is a positive real number (but otherwise arbitrary and unbounded). A multiplicative corruption makes more sense since the final label must be positive. SVAM-Gamma (Algorithm 4) adapts SVAM to robust gamma regression. Due to the alternate canonical parameter used in gamma regression, the initialization requirement also needs to be modified to $\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1$ . However, the hyperparameter tuning strategy discussed in §3 continues to apply.

Theorem 4.

For data corrupted by a partially adaptive adversary with $\alpha\leq\frac{0.002}{\sqrt{d}}$ , there exist $\xi>1$ s.t. with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC conditions are satisfied for the $\tilde{Q}_{\beta}$ function for $\beta$ values as large as $\beta_{\max}={\cal O}\left({{1/\left({\exp\left({{\cal O}\left({{\alpha\sqrt{d}}}\right)}\right)-1}\right)}}\right)$ . If initialized at $\hat{{\mathbf{w}}}^{1},\beta_{1}$ s.t. $\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1$ and $\beta\geq 1$ , SVAM-Gamma assures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon$ for any $\epsilon\geq{\cal O}\left({{\alpha\sqrt{d}}}\right)$ within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon}}}\right)$ steps.

Model recovery, Consistency, Breakdown pt. It is notable that prior results in literature do not offer any breakdown point results for gamma regression. We find that Thm 4 requires $\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1$ and $\beta\geq 1$ which imply $\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\ln 2$ . This is in contrast to Thms 2 and 3 that allow any initial $\hat{{\mathbf{w}}}^{1}$ so long as $\beta_{1},\xi$ are sufficiently small. SVAM-Gamma guarantees convergence to a region of radius ${\cal O}\left({{\alpha\sqrt{d}}}\right)$ around ${{\mathbf{w}}}^{\ast}$ whereas Thms 2 and 3 assure exact recovery. However, these do not seem to be artifacts of the proof technique. In experiments, SVAM-Gamma did not offer vanishingly small recovery errors and did indeed struggle if initialized with $\beta_{1}\ll 1$ . It may be the case that there exist lower bounds preventing exact recovery for gamma regression similar to mean estimation.

Robust Mean Estimation. We have $n$ data points of which the set $G$ of $(1-\alpha)\cdot n$ ”good” points are generated from a $d$ -dimensional spherical Gaussian ${{\mathbf{x}}}^{i}\sim{\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$},\Sigma)$ i.e. ${{\mathbf{x}}}^{i}=\text{\boldmath$\mathbf{\mu}$}+\text{\boldmath$\mathbf{\epsilon}$}^{i}$ where $\text{\boldmath$\mathbf{\epsilon}$}^{i}\sim{\mathcal{N}}({\mathbf{0}},\Sigma)$ and $\Sigma=\frac{1}{\beta^{\ast}}\cdot I$ for some $\beta^{\ast}>0$ . The rest are the set $B$ of $\alpha\cdot n$ ”bad” points that are corrupted by an adversary i.e. $\tilde{{\mathbf{x}}}^{i}=\text{\boldmath$\mathbf{\mu}$}^{\ast}+{{\mathbf{b}}}^{i}$ where ${{\mathbf{b}}}^{i}\in{\mathbb{R}}^{d}$ can be unbounded. SVAM-ME (Algorithm 3) adapts SVAM to the robust mean estimation problem. For notational clarity we use, $\text{\boldmath$\mathbf{\eta}$}=\text{\boldmath$\mathbf{\mu}$}$ , in this problem.

Theorem 5.

For data corrupted by a partially adaptive adversary with corruption rate $\alpha\leq 0.26$ , there exists $\xi>1$ s.t. with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC conditions are satisfied for the $\tilde{Q}_{\beta}$ function for $\beta$ upto $\beta_{\max}={\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right)$ . If initialized with $\hat{\text{\boldmath$\mathbf{\mu}$}}^{1},\beta^{1}$ s.t. $\beta_{1}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq 1$ , SVAM-ME assures $\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{T}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ for any $\epsilon\geq{\cal O}\left({{\text{trace}^{2}(\Sigma)\cdot\max\left\{{{\frac{1}{\ln(1/\alpha)}},\frac{1}{\sqrt{nd}}}\right\}}}\right)$ within $T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right)$ iterations.

Model recovery, Consistency, Breakdown pt. Note that for any constant $\alpha>0$ , the estimation error does not go to zero as $n\rightarrow\infty$ . As mentioned in §2, an error of $\Omega\left({{\alpha}}\right)$ is unavoidable no matter how large $n$ gets. Thus, the best hope we have is of the estimation error going to zero as $\alpha\rightarrow 0$ and $n\rightarrow\infty$ . The error in Theorem 5 does indeed go to zero in this setting. Also, note that the error depends only on the trace of the covariance matrix of the clean points, and thus for $\text{trace}(\Sigma)={\cal O}\left({{1}}\right)$ , the result offers an estimation error independent of dimension. SVAM-ME offers a large breakdown point (allowing upto 26% corruption rate).

Establishing LWSC/LWLC for Gamma Regression and Mean Estimation: In the appendices, Lemmata 28, 29, Lemmata 23, 24 establish the LWSC/LWLC properties for the $\tilde{Q}_{\beta}$ function for gamma regression and mean estimation and Theorems 27 and Theorem 22 establish the breakdown points and existence of increments $\xi>1$ .

Robust Classification. In this case the labels are generated as $y_{i}=\text{sign}(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)$ and the bad points in the set $B$ get their labels flipped $\tilde{y}_{i}=-\text{sign}(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)$ . SVAM-LR (Algorithm 5) adapts SVAM to robust logistic regression.

5 Experiments

We used a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM, Ubuntu 16.04 OS.

Benchmarks. SVAM was benchmarked against several baselines (a) VAM: this is SVAM executed with a fixed value of the scale $\beta$ by setting the scale increment to $\xi=1$ to investigate any benefits of varying the scale $\beta$ , (b) MLE: likelihood maximization on all points (clean + corrupted) without any weights assigned to data points – this checks for benefits of performing weighted MLE, (c) Oracle: an execution of the MLE only on the clean points – this is the gold standard in robust learning and offers the best possible outcome. In addition, several problem-specific competitors were also considered. For robust regression, STIR [15], TORRENT [1], SEVER [7], RGD [20], and the classical robust M-estimator of Tukey’s bisquare loss were included. Note that TORRENT already outperforms $L_{1}$ regularization methods while achieving better or competitive recovery errors (see [1, Fig 2(b)]). Since SVAM-RR was faster than TORRENT itself, $L_{1}$ regularized methods such as [18, 24] were not considered. For robust mean estimation, popular baselines such as coordinate-wise median and geometric median were taken. For robust classification, the rank-pruning method RP-LR [19] and the method from [16] were used.

Experimental Setting and Reproducibility. Due to lack of space, details of experimental setup, data generation, how adversaries were simulated etc are presented in Appendix C. SVAM also offered superior robustness than competitors against a wide range of ways to simulate adversarial corruption (see Appendix D for details). Code for SVAM is available at https://github.com/purushottamkar/svam/.

5.1 Experimental Observations

Robust Regression. Fig 2(a) shows that SVAM-RR, SEVER, RGD, STIR, and TORRENT are competitive and achieve oracle-level error. However, SVAM-RR can be twice as fast in terms of execution time. Since TORRENT itself outperforms $L_{1}$ regularization methods while achieving better or competitive recovery errors (see Fig 2(b) in [1]), we do not compare against $L_{1}$ methods. SVAM-RR is several times faster than classical robust M-estimators such as Tukey’s bisquare loss. Also, no single value of $\beta$ can offer the performance of SVAM, as is indicated by the poor performance of VAM. Fig 4 in the appendix shows that this is true even if very large or very small values of $\beta$ are used with VAM. We note that SEVER chooses a threshold in each iteration to eliminate specific points as corrupted. This threshold is chosen randomly (possibly for ease of proof) but causes SEVER to offer sluggish convergence. Thus, we also report the performance of a modification SEVER-M that was given an unfair advantage by revealing to it the actual number of corrupted points (SVAM was not given this information). This sped-up SEVER but SVAM continued to outperform SEVER-M. Fig 3 in the appendix reports repeated runs of the experiment where SVAM continues to lead.

Robust Logistic and Gamma Regression. Fig 2(c,d) report results of SVAM on robust gamma and logistic regression problems. The figures show that executing VAM with a fixed value of $\beta$ cannot replace the gradual variations in $\beta_{t}$ done by SVAM. Additionally, for robust classification, SVAM-LR achieves error, an order of magnitude smaller than all competitors except the oracle. SVAM also outperforms the RP-LR [19] and [16] algorithms that were specifically designed for robust classification. A horizontal dashed line is used to indicate the final performance of algorithms for which iteration-wise performance was unavailable.

Robust Mean Estimation. Fig 2(b) reports results on robust mean estimation problems. SVAM outperforms VAM with any fixed value of $\beta$ as well as the naive sample mean (the MLE in this case). Popular approaches coordinate-wise median and geometric median were fast but offered poor results. SVAM on the other hand achieved oracle error-level error by assigning proper scores to all data points.

Sensitivity to Hyperparameter Tuning. In Figs 1(a,b), SVAM-RR was offered hyperparameters in a wide range of values to study how it responded when provided mis-specified hyperparameters. SVAM offered stable convergence for a wide range of $\beta_{1},\xi$ indicating that it is resilient to minor mis-specifications in hyperparameters.

Sensitivity to Dimension and Corruption. Figs 1(c,d) compare the error offered by various algorithms in recovering ${{\mathbf{w}}}^{\ast}$ for robust least-squares regression when the fraction of corrupted points $\alpha$ and feature dimension $d$ were varied. All values are averaged over $20$ experiments with each experiment using $1000$ data points. $\alpha$ was varied in the range $[0,0.4]$ and $d$ in the range $[10,100]$ with fixed hyper-parameters. STIR and Bi-square are sensitive to corruption while SEVER is sensitive to both corruption and dimension. RGD is not visible in the figures as its error exceeded the figure boundaries. Experiments for Fig 1(c) fixed $d=10$ and vary $\alpha$ while Fig 1(d) fixed $\alpha=0.15$ and vary $d$ . Figs 1(c,d) show that SVAM-RR can tolerate large fractions of the data getting corrupted and is not sensitive to $d$ .

Testing SVAM for Global Convergence. To test the effect of initialization, in Fig 2(e), corruptions were introduced using an adversarial model $\tilde{\mathbf{w}}$ i.e. for corrupted points, labels were set to $\tilde{y}_{i}=\left\langle{\tilde{\mathbf{w}}},{{{\mathbf{x}}}_{i}}\right\rangle$ . SVAM-RR was initialized at 1000 randomly chosen models, the origin, as well as at the adversarial model $\tilde{{\mathbf{w}}}$ itself. WORST-1000 (resp. AVG-1000) indicate the worst (resp. average) performance SVAM had at any of the 1000 initializations. Fig 2(f) further emphasizes this using a toy 2D problem. SVAM was initialized at all points on the grid. An initialization was called a success if SVAM got error $<10^{-6}$ within eight or fewer iterations. In all these experiments SVAM rapidly converged to the true model irrespective of model initialization.

Acknowledgements

The authors thank the anonymous reviewers of this paper for suggesting illustrative experiments and pointing to relevant literature. B.M. is supported by the Technology Innovation Institute and MBZUAI joint project (NO. TII/ARRC/2073/2021): Energy-based Probing for Spiking Neural Networks. D.D. is supported by the Research-I Foundation at IIT Kanpur and acknowledges support from the Visvesvaraya PhD Scheme for Electronics & IT (FELLOW/2016-17/MLA/194). P.K. thanks Microsoft Research India and Tower Research for research grants.

References

Bhatia et al. [2015] K. Bhatia, P. Jain, and P. Kar. Robust Regression via Hard Thresholding. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015.
Cantoni and Ronchetti [2001] E. Cantoni and E. Ronchetti. Robust Inference for Generalized Linear Models. Journal of the American Statistical Association, 96(455):1022–1030, 2001.
Cheng et al. [2019] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2755–2771. SIAM, 2019.
Dalalyan and Minasyan [2022] A. S. Dalalyan and A. Minasyan. All-In-One Robust Estimator of the Gaussian Mean. Annals of Statistics, 50(2):1193–1219, 2022.
Diakonikolas and Kane [2019] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
Diakonikolas et al. [2019a] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust Estimators in High-Dimensions Without the Computational Intractability. SIAM Journal on Computing, 48(2):742–864, 2019a.
Diakonikolas et al. [2019b] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A Robust Meta-Algorithm for Stochastic Optimization. In 36th International Conference on Machine Learning (ICML), 2019b.
Feng et al. [2014] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust Logistic Regression and Classification. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.
Feng et al. [2015] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. Journal of Machine Learning Research, 16(30):993–1034, 2015.
Jiang et al. [2012] K. Jiang, B. Kulis, and M. Jordan. Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.
Lai et al. [2016] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
Lecué and Lerasle [2020] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: Theory and practice. Annals of Statistics, 48(2):906–931, 2020.
Li et al. [2021] T. Li, A. Beirami, M. Sanjabi, and V. Smith. Tilted Empirical Risk Minimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.
Mukhoty et al. [2019] B. Mukhoty, G. Gopakumar, P. Jain, and P. Kar. Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
Natarajan et al. [2013] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
Nelder and Wedderburn [1972] J. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society. Series A, 135(3):370–384, 1972.
Nguyen and Tran [2013] N. H. Nguyen and T. D. Tran. Exact Recoverability From Dense Corrupted Observations via $\ell_{1}$ -Minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
Northcutt et al. [2017] C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. URL http://auai.org/uai2017/proceedings/papers/35.pdf.
Prasad et al. [2018] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust Estimation via Robust Gradient Estimation. arXiv:1802.06485 [stat.ML], 2018.
Suggala et al. [2019] A. S. Suggala, K. Bhatia, P. Ravikumar, and P. Jain. Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression. In 32nd Conference on Learning Theory (COLT), 2019.
Valdora and Yohai [2014] M. Valdora and V. J. Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.
Vershynin [2018] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
Wright and Ma [2010] J. Wright and Y. Ma. Dense Error Correction via $\ell^{1}$ Minimization. IEEE Transactions on Information Theory, 56(7):3540–3560, 2010.
Yang et al. [2013] E. Yang, A. Tewari, and P. Ravikumar. On Robust Estimation of High Dimensional Generalized Linear Models. In 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013.

Appendix A Summary of Assumptions

This paper presented SVAM, a framework for robust GLM problems based on a novel variance reduced reweighted MLE technique that can be readily adapted to arbitrary GLM problems such as robust least squares/logistic/gamma regression and mean estimation. Here, we summarize the theoretical and empirical assumptions made by the SVAM framework for easy inspection.

Experimental Assumptions. SVAM requires minimal assumptions to be executed in practice and only requires two scalar hyperparameters to be tuned properly. SVAM is robust to minor misspecifications to its hyperparameters (see Figure 1) and the hyperparameter tuning described in §3 works well in practice allowing SVAM to offer superior or competitive empirical performance when compared to state of the art techniques for task-specific estimation techniques e.g. TORRENT for robust regression, classical and popular techniques such as Tukey’s bisquare or geometric median for robust mean estimation, as well as recent advances in robust gradient-based techniques such as SEVER and RGD.

Theoretical Assumptions. SVAM establishes explicit breakdown points in several interesting scenarios against both partially and fully adaptive adversaries. To do so, SVAM assumes a realizable setting e.g. in least-squares regression, labels for the $G$ clean points are assumed to be generated as $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i}$ where ${{\mathbf{w}}}^{\ast}$ is the gold model and $\epsilon_{i}\sim{\mathcal{N}}\left({0,\frac{1}{\beta^{\ast}}}\right)$ is Gaussian noise. Of course, on the $B$ bad points, the (partially/fully adaptive) adversary is free to introduce corruptions jointly in any manner. For least-squares regression, the covariates/feature vectors i.e. ${{\mathbf{x}}}_{i}$ are assumed to be sampled from some sub-Gaussian distribution that includes arbitrary bounded distributions as well as multivariate Gaussian distributions in $d$ dimensions (both standard and non-standard) – see Appendix G for details. For gamma regression and mean estimation settings, the covariates are assumed to be sampled from a spherical multivariate Gaussian in $d$ dimensions. Other assumptions are listed in the statements of the theorems in §4.

Appendix B Adversary Models

We explain the various adversary models in more detail here. Several models popular in literature give various degrees of control to the adversary. This section offers a more relaxed discussion of some prominent adversary models along with examples of applications in which they arise.

(Oblivious) Huber Adversary. Corruption locations $i_{1},\ldots,i_{k}$ are chosen randomly for which corrupted labels are sampled i.i.d. from some pre-decided distribution ${\mathcal{B}}$ i.e. $\tilde{y}_{i_{j}}\sim{\mathcal{B}}$ . Next, data features ${{\mathbf{x}}}^{i}$ and the true model ${{\mathbf{w}}}^{\ast}$ are selected and clean labels are generated according to the GLM for all non-corrupted points.

Partially Adaptive Adversary. The adversary first chooses the corruption locations $i_{1},\ldots,i_{k}$ (e.g. some fixed choice or randomly). Then data features ${{\mathbf{x}}}^{i}$ and the true model ${{\mathbf{w}}}^{\ast}$ are selected and clean labels $y_{1},\ldots,y_{n}$ are generated according to the GLM. Next, the adversary is presented with the collection $\left\{{{{\mathbf{w}}}^{\ast},\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n},\left\{{i_{j}}\right\}_{j=1}^{k}}\right\}$ and is allowed to use this information to generate corrupted labels $\tilde{y}_{i_{j}}$ for the points marked for corruption.

Fully Adaptive Adversary. First data features ${{\mathbf{x}}}^{i}$ and the true model ${{\mathbf{w}}}^{\ast}$ are selected and clean labels $y_{1},\ldots,y_{n}$ are generated according to the GLM. Then the adversary is presented with the collection $\left\{{{{\mathbf{w}}}^{\ast},\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}}\right\}$ and is allowed to use this information to select which $k$ points to corrupt as well as generate corrupted labels $\tilde{y}_{i_{j}}$ for those points.

Discussion. The fully adaptive adversary can choose corruption locations $i_{1},\ldots,i_{k}$ and the corrupted labels $\tilde{y}_{i_{j}}$ with complete information of the true model ${{\mathbf{w}}}^{\ast}$ , the clean labels $\left\{{y_{i}}\right\}$ and the feature vectors $\left\{{{{\mathbf{x}}}^{i}}\right\}$ and is the most powerful. The partially adaptive adversary can decide the corruptions $\tilde{y}_{i_{j}}$ after inspecting ${{\mathbf{w}}}^{\ast},\left\{{{{\mathbf{x}}}^{i},y_{i}}\right\}$ but cannot control the corruption locations. This can model e.g. an adversary that corrupts user data by installing malware on their systems. The adversary cannot force malware on a system of their choice but can manipulate data coming from already compromised systems at will. The Huber adversary is the least powerful with corruption locations that are random as well as corrupted labels that are sampled randomly and cannot depend on $\left\{{{{\mathbf{x}}}^{i},y_{i}}\right\}$ . Although weak, this adversary can nevertheless model sensor noise e.g. pixels in a CCD array that misfire with a certain probability. As noted earlier, SVAM results are shown against both fully and partially adaptive adversaries.

Appendix C Experimental Setup

Experiments were carried out on a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM and Ubuntu 16.04 OS. Statistics such as dataset size $n$ , feature dimensions $d$ and corruption fraction $\alpha$ are mentioned above each figure. 20% of train data was used as a held-out validation set using which ( $\beta_{1},\xi$ ) were tuned using line search. Figure 1 indicates that SVAM is not sensitive to setting $\beta_{1}$ or $\xi$ and a good value can be found using line search.

Synthetic Data Generation. Synthetic datasets were used in the experiments to demonstration recovery of the true model parameters. All regression co-variates/features were generated using a standard normal distribution ${\mathcal{N}}(0,I_{d})$ . Clean responses in the least-squares regression settings were generated without additional Gaussian noise while corrupted responses were generated for an $\alpha$ fraction of the data points. For least-squares regression, clean responses were generated as $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle$ , while for logistic regression, clean binary labels were generated as $y_{i}={\mathbb{I}}[\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle>0]$ . For gamma regression, clean responses were generated using a likelihood distribution with vanishing variance. This was done by using a variance-altered likelihood distribution with the setting $\beta^{\ast}\rightarrow\infty$ (see Table 1 for the likelihood expressions). Clean data points for mean estimation were sampled from ${\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$}^{\ast},\frac{1}{d}\cdot I_{d})$ . When not stated otherwise, corrupted labels were generated using an adversarial model. Specifically, to simulate the adversary, an adversarial model $\tilde{{\mathbf{w}}}$ (for least-squares/logistic/gamma regression) or $\tilde{\text{\boldmath$\mathbf{\mu}$}}$ (for mean estimation) was sampled and labels for data points chosen for corruption were generated using this adversarial model instead of the true model. For example, corrupted labels were generated for least-squares regression as $\tilde{y}_{i}=\left\langle{\tilde{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle$ for all $k$ locations chosen for corruption. Please also refer to Appendix D for an extensive study on several other ways of simulating the adversary and initialization schemes.

Simulating the Adversary. Corruptions were introduced using an adversarial model. Specifically, an adversarial model $\tilde{{\mathbf{w}}}$ was chosen, and for bad points, the adversary generated a label using $\tilde{{\mathbf{w}}}$ , which overwrote the true label. For least-squares/logistic/gamma regression, both ${{\mathbf{w}}}^{\ast},\tilde{{\mathbf{w}}}$ were independently chosen to be random unit vectors. For robust mean estimation Fig 2(a), $\text{\boldmath$\mathbf{\mu}$}^{\ast}$ and $\tilde{\text{\boldmath$\mathbf{\mu}$}}$ were chosen as random Gaussian vectors of length $2$ and $6$ respectively. Except Fig 2(e,f) in which the setting is different, SVAM variants were always initialized at the adversarial model itself i.e. $\hat{{\mathbf{w}}}^{1}=\tilde{{\mathbf{w}}}$ to test the ability of SVAM to converge to the true model no matter what the initialization.

Other Corruption Models. SVAM was found to offer superior robustness as compared to competitor algorithms against a wide range of ways to simulate adversarial corruption, including powerful ones that use expensive leverage score computations to decide corruptions. Fig 5 in Appendix D reports results of experiments with a variety of adversaries.

Appendix D Tolerance to Different Adversaries and Adversarial Initialization Schemes

Figure 3 repeats the experiment of Fig. 2(a) 10 times, showing that the convergence results are consistent across several runs of the experiment. In this comparison, we put two closet competitors STIR and TORRENT, along with SVAM, and omit other competitors to avoid clutter in the figure. The experiment shows that the performance of the methods does not vary very wildly across the runs and SVAM’s convergence continues to remain faster than its competitors.

Figure 4 considers the VAM method with several values of $\beta$ ranging from very small to very large. It is apparent that no single fixed value of $\beta$ is able to offer satisfactory result indicating that dynamically changing the value of $\beta$ as done by SVAM is required for better convergence results.

Figure 5 demonstrates recovery of ${{\mathbf{w}}}^{\ast}$ , under different ways of simulating the adversary and initialization of algorithms. We consider corruption models, that differ in choosing locations for corruption and the way chosen points are corrupted. We demonstrate recovery, on choosing points for corruption in three ways: i) random, ii) using magnitude of response and iii) using leverage statistic score. The leverage score of a point increases as it lies farther from the mean of the data points. The diagonal elements of the projection matrix $P=X^{\top}(XX^{\top})^{-1}X$ , gives the respective leverage scores and $k$ data points having largest leverage score are choosen for corruption. After selecting the data points to be corrupted, we either i) flip the sign of the response, ii) set the responses to a constant value B or iii) use an adversarial model $\tilde{\mathbf{w}}$ , to generate the corrupted response i.e. set the response to $\tilde{y}_{i}=\left\langle{\tilde{\mathbf{w}}},{{{\mathbf{x}}}_{i}}\right\rangle$ . The initialization offered was also varied in two ways: i) random initialization, and ii) adversarial initialization where SVAM was initialized at $\tilde{{\mathbf{w}}}$ , the same adversarial model using which corruptions were introduced.

Experiments were performed by varying location of corruption in {random, absolute magnitude, leverage score}, corruption type in {adversarial, sign change, set constant} and initialization in {random, adversarial}. The experimental setting is given in the title of each figure, while all of them have $n=1000$ , $d=10$ and $\alpha=0.15$ . It can be observed that in general, SVAM demonstrates superior performance irrespective of the adversarial models and initialization.

It can be also observed that SVAM converges after single iteration when corruptions are introduced by setting responses to a constant (second column) whereas Tukey’s bisquare method does not converge well, when absolute magnitude is used to select location of corruption(second row), except for constant response corruptions.

Appendix E Proof of Theorem 1

Theorem 6 (SVAM convergence - Restated).

Suppose the data and likelihood distribution satisfy the $\lambda_{\beta}$ -LWSC and $\Lambda_{\beta}$ -LWLC properties for all values of $\beta$ in the range $(0,\beta_{\max}]$ . Then if SVAM is initialized at a point $\hat{{\mathbf{w}}}^{1}$ and initial scale $\beta_{1}>0$ such that $\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ then for any $\epsilon>\frac{1}{\beta_{\max}}$ , for small-enough scale increment $\xi>1$ , SVAM ensures $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon$ within $T={\cal O}\left({{\log\frac{1}{\epsilon}}}\right)$ iterations.

Proof.

The key to this proof is to maintain the invariant $\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ . Note that initialization is done precisely to ensure this at the beginning of the execution of the algorithm which acts as the base case for an inductive argument. For the inductive case, consider an iteration $t$ and let $\sqrt{\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq 1$ . LWSC ensures strong convexity giving

\tilde{Q}_{\beta_{t}}(\hat{{\mathbf{w}}}^{t+1}\,|\,\hat{{\mathbf{w}}}^{t})-\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})\geq\left\langle{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})},{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\rangle+\frac{\lambda_{\beta_{t}}}{2}\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}

Since $\hat{{\mathbf{w}}}^{t+1}$ minimizes $\tilde{Q}_{\beta_{t}}(\cdot\,|\,\hat{{\mathbf{w}}}^{t})$ , we have $\tilde{Q}_{\beta_{t}}(\hat{{\mathbf{w}}}^{t+1}\,|\,\hat{{\mathbf{w}}}^{t})\leq\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})$ . Elementary manipulations and the Cauchy-Schwartz inequality now give us

\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{2\left\|{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})}\right\|_{2}}{\lambda_{\beta_{t}}}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}.

Now, we will additionally ensure that we choose the scale increment $\xi$ to be small enough (while still ensuring $\xi>1$ ) such that $\frac{2\Lambda_{\beta}}{\lambda_{\beta}}<\sqrt{\frac{1}{\xi\beta}}$ for all $\beta\in(0,\beta_{\max}]$ . Combining this with the above result gives us

\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{\xi\beta_{t}}

Thus, if we now set $\beta_{t+1}=\xi\beta_{t}$ , then rearranging the terms in the above inequality tell us that $\beta_{t+1}\cdot\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ which lets us continue the inductive argument. Note that this process can continue on till we have $\beta_{t}\leq\beta_{\max}$ since the LWSC/LWLC properties are assured till that point. Moreover, since $\beta_{t}$ goes up by a constant fraction at each step and $\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{\beta_{t}}$ due to the invariant, a linear rate of convergence is assured which finishes the proof. The existence of a suitable scale increment $\xi$ satisfying the above requirements is established in a case-wise manner by Theorems 14, 21, 22 and 27. We also note that, as discussed in §4 and elaborated in Appendix I, since gamma regression requires an alternate parameterization owing to its need to support only non-negative labels, the invariant used for the convergence bound for SVAM-Gamma is also slightly altered as mentioned in Thm 4 to instead use $\beta_{t}\cdot(\exp(\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2})-1)^{2}\leq 1$ . ∎

Appendix F Some Helpful Results

Below we present a few helpful results.

Lemma 7.

Suppose $\text{\boldmath$\mathbf{\epsilon}$}^{i}\sim{\mathcal{N}}({\mathbf{0}},I),i=1,\ldots,n$ and denote $R_{X}:=\max_{i\in n}\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}$ . Then we have $R_{X}\leq\sqrt{n}$ with probability at least $1-\exp(-\Omega\left({{n}}\right))$ .

Proof.

Follows from standard arguments. ∎

Lemma 8.

For covariate vectors $X=\left[{{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}}\right]$ generated from an isotropic sub-Gaussian distribution, for any fixed set $S\subset[n]$ and $n=\Omega\left({{d}}\right)$ , with probability at least $1-\exp(-\Omega\left({{d}}\right))$ ,

0.99\left|{S}\right|\leq\lambda_{\min}(X_{S}X_{S}^{\top})\leq\lambda_{\max}(X_{S}X_{S}^{\top})\leq 1.01\left|{S}\right|,

where the constant inside $\Omega\left({{\cdot}}\right)$ depends only on the sub-Gaussian distribution and universal constants.

Proof.

Taken from [1]. ∎

Lemma 9.

If $\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I)$ , then for any function $f:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ , any $\beta>0$ and any $\text{\boldmath$\mathbf{\Delta}$}\in{\mathbb{R}}^{d}$ , we have

{\mathbb{E}}\left[{{\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\frac{1}{\tilde{c}}\cdot{\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right],

where ${{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta+\beta^{\ast}}\cdot I}\right)$ and $\tilde{c}=\left({\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ .

Proof.

We have

$\displaystyle{\mathbb{E}}\left[{{\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\cdot\exp\left({-\frac{\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}^{2}}\right)\ d\text{\boldmath$\mathbf{\epsilon}$}$ $\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta+\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}^{2}+\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}$ $\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}+\frac{\beta^{2}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\left\|{\sqrt{\frac{\beta+\beta^{\ast}}{2}}\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\sqrt{2(\beta+\beta^{\ast})}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}$ $\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta+\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}$ $\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\left({\sqrt{\frac{2\pi}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right){\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right]$

which finishes the proof upon using $\tilde{c}\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\left({\sqrt{\frac{2\pi}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)=1$ . ∎

Lemma 10.

{\mathbb{E}}\left[{{\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\tilde{q}\cdot{\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right],

where ${{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta^{\ast}}\cdot I}\right)$ and $\tilde{q}=\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ .

Proof.

We have

	$\displaystyle{\mathbb{E}}\left[{{\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\idotsint_{{\mathbb{R}}^{d}}\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\cdot\exp\left({-\frac{\beta^{\ast}}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\\|_{2}^{2}}\right)\ d\text{\boldmath$\mathbf{\epsilon}$}$
	$\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{1}{2}\left\\|{\sqrt{\beta^{\ast}}\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\sqrt{\beta^{\ast}}}\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}$
	$\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta^{\ast}}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}$

which finishes the proof. ∎

Lemma 11.

If $\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I)$ , then for any constant $C>0$ and fixed vectors ${{\mathbf{u}}},{{\mathbf{v}}}$ such that $\left\|{{{\mathbf{u}}}}\right\|_{2}=u$ and $\left\|{{{\mathbf{v}}}}\right\|_{2}=v$ , we have

{\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta{{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}\right)}\cdot{{\mathbf{u}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}}\right]\leq\min\left\{{C,\frac{1}{C}}\right\}\exp\left({\frac{\beta^{2}v^{2}}{2\beta^{\ast}}}\right)\frac{\beta v}{\beta^{\ast}}u

Proof.

We begin by analyzing the vector $\text{\boldmath$\mathbf{\nu}$}={\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta{{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}\right)}\cdot\text{\boldmath$\mathbf{\epsilon}$}}}\right]$ itself. Note that due to the rotational symmetry of the Gaussian distribution, we can, w.l.o.g. assume that ${{\mathbf{v}}}=(v,0,0,\ldots,0)$ . This means that ${{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}=v\cdot\text{\boldmath$\mathbf{\epsilon}$}_{1}$ . Thus, the $i\text{${}^{\text{th}}$}$ coordinate of the vector $\mathbf{\nu}$ i.e. $\text{\boldmath$\mathbf{\nu}$}_{i}={\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta v\cdot\text{\boldmath$\mathbf{\epsilon}$}_{1}}\right)}\cdot\text{\boldmath$\mathbf{\epsilon}$}_{i}}}\right]$ . Thus, by independence and unbiased-ness of the coordinates of a Gaussian vector, we have $\text{\boldmath$\mathbf{\nu}$}_{i}=0$ for all $i\neq 1$ . So all we are left to analyze is $\text{\boldmath$\mathbf{\nu}$}_{1}$ . We have

	$\displaystyle\left\|{\text{\boldmath$\mathbf{\nu}$}_{1}}\right\|$	$\displaystyle=\sqrt{\frac{\beta^{\ast}}{2\pi}}\left\|{\int_{\mathbb{R}}\frac{\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon}{1+C\exp(\beta v\epsilon)}\ d\epsilon}\right\|$
		$\displaystyle=\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C(\exp(\beta v\epsilon)-\exp(-\beta v\epsilon))}{1+C^{2}+C(\exp(\beta v\epsilon)+\exp(-\beta v\epsilon))}\ d\epsilon$
		$\displaystyle\leq\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C\exp(\beta v\epsilon)}{1+C^{2}+C\exp(\beta v\epsilon)}\ d\epsilon$
		$\displaystyle\leq\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C\exp(\beta v\epsilon)}{1+C^{2}}\ d\epsilon$
		$\displaystyle\leq\min\left\{{C,\frac{1}{C}}\right\}\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\exp(\beta v\epsilon)\ d\epsilon$
		$\displaystyle\leq\min\left\{{C,\frac{1}{C}}\right\}\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{\mathbb{R}}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\exp(\beta v\epsilon)\ d\epsilon$
		$\displaystyle=\min\left\{{C,\frac{1}{C}}\right\}{\mathbb{E}}\left[{{\exp(\beta v\epsilon)\cdot\epsilon}}\right]$
		$\displaystyle=\min\left\{{C,\frac{1}{C}}\right\}\exp\left({\frac{\beta^{2}v^{2}}{2\beta^{\ast}}}\right)\frac{\beta v}{\beta^{\ast}}$

since $\frac{C}{1+C^{2}}\leq\min\left\{{C,\frac{1}{C}}\right\}$ and we used Lemma 10 in the last step since that lemma is independent of the dimensionality of the Gaussian vector. Applying the Cauchy-Schwartz inequality then gives us the result. ∎

Lemma 12.

Let $\beta,V>2$ , then we have

\left.\begin{array}[]{c}\int_{\mathbb{R}}\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}\ dx\\ \int_{\mathbb{R}}\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}x\ dx\end{array}\right\}\leq\exp(-\Omega\left({{V^{2}}}\right))

Proof.

By completing squares we have

\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}=\exp\left({-\frac{x^{2}}{2}}\right)\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}

Now, we consider two cases

1.

Case 1 $(x<\frac{V}{2})$ : In this case $\exp\left({\frac{\beta x^{2}}{4}}\right)\leq\exp\left({\frac{\beta V^{2}}{16}}\right)$ whereas $\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)\geq\exp\left({\frac{9\beta V^{2}}{16}}\right)$ . Thus, we have $\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq\exp\left({-\frac{\beta V^{2}}{2}}\right)$ in this region. Using standard Gaussian integrals we conclude that the region $\left({-\infty,\frac{V}{2}}\right)$ contributes at most ${\cal O}\left({{\exp\left({-\frac{\beta V^{2}}{2}}\right)}}\right)\leq\exp(-\Omega\left({{V^{2}}}\right))$ (since $\beta>2$ ) to both integrals.
2.

Case 2 $(x>\frac{V}{2})$ : In this case we simply bound $\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq 1$ and use standard bounds on the complementary error function to conclude that the contribution of the region $\left({\frac{V}{2},\infty}\right)$ to both integrals is at most $\exp\left({-\Omega\left({{V^{2}}}\right)}\right)$ .

∎

Lemma 13.

Suppose $\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}\left({{\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I}\right)$ and ${{\mathbf{v}}},\text{\boldmath$\mathbf{\Delta}$}$ are fixed vectors such that $\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq 1$ . Then the random variable $X:=\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}}$ has a subexponential constant at most

\frac{2}{\sqrt{\beta+d}}\left({\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)

Proof.

The subexponential constant of a random variable $X$ is defined as the value $\left\|{X}\right\|_{\psi_{1}}:=\sup_{p\geq 1}\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}$ . In contrast, the subGaussian constant of a random variable $X$ is defined as the value $\left\|{X}\right\|_{\psi_{2}}:=\sup_{p\geq 1}\frac{1}{\sqrt{p}}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}$ . Using Lemma 9 gives us

	$\displaystyle{\mathbb{E}}\left[{{\left\|{X}\right\|^{p}}}\right]$	$\displaystyle={\mathbb{E}}\left[{{\exp\left({-\frac{p\beta}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)(\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}})^{p}}}\right]$
		$\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{p\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{p\beta\beta^{\ast}}{2(p\beta+\beta^{\ast})}\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right){\mathbb{E}}\left[{{({{\mathbf{x}}}^{\top}{{\mathbf{v}}})^{p}}}\right],$

where ${{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{p\beta}{p\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{p\beta+\beta^{\ast}}\cdot I}\right)$ . Now by virtue of ${{\mathbf{x}}}$ being a Gaussian and using the triangle inequality for the subGaussian norm, we know that the random variable ${{\mathbf{x}}}^{\top}{{\mathbf{v}}}$ is $\left({\frac{p\beta}{p\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)$ -subGaussian. This, in turn implies that

\left({{\mathbb{E}}\left[{{({{\mathbf{x}}}^{\top}{{\mathbf{v}}})^{p}}}\right]}\right)^{1/p}\leq\sqrt{p}\cdot\left({\frac{p\beta}{p\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)\leq\sqrt{p}\cdot\left({\frac{p\sqrt{\beta}}{p\beta+\beta^{\ast}}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)

Thus, we have

\displaystyle\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}\leq\frac{1}{\sqrt{p}}\cdot\underbrace{\left({\sqrt{\frac{\beta^{\ast}}{p\beta+\beta^{\ast}}}}\right)^{d/p}\exp\left({-\frac{\beta\beta^{\ast}}{2(p\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\left({\frac{p\sqrt{\beta}}{p\beta+\beta^{\ast}}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)}_{(A)}

Now, $(A)$ is a bounded function whereas $\frac{1}{\sqrt{p}}$ is a decreasing function. Thus, $\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}$ is a decreasing function of $p$ and hence achieves its maximum value in the range $p\geq 1$ at $p=1$ itself. Noting that $\left({\frac{\sqrt{\beta}}{\beta+\beta^{\ast}}+\frac{1}{\sqrt{\beta+\beta^{\ast}}}}\right)\leq\frac{2}{\sqrt{\beta+d}}$ finishes the proof. ∎

Appendix G Robust Regression

For this proof we use the notation $X=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{d\times n},{{\mathbf{y}}}=[y_{1},\ldots,y_{n}]\in{\mathbb{R}}^{n},{{\mathbf{b}}}=[b_{1},\ldots,b_{n}]\in{\mathbb{R}}^{n}$ . For any vector ${{\mathbf{v}}}\in{\mathbb{R}}^{m}$ and any set $T\subseteq[m]$ , ${{\mathbf{v}}}_{T}$ denotes the vector with all coordinates other than those in the set $T$ zeroed out. Similarly, for any matrix $A\in{\mathbb{R}}^{k\times m},A_{T}$ denotes the matrix with all columns other than those in the set $T$ zeroed out. We will let $G,B$ respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let $G=(1-\alpha)\cdot n$ and $B=\alpha\cdot n$ respectively denote the number of good and bad points too.

Theorem 14 (Theorem 2 restated – Partially Adaptive Adversary).

For data generated in the robust regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set $B$ ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-RR enjoys a breakdown point of $0.1866$ , i.e. it ensures a bounded ${\cal O}\left({{1}}\right)$ error even if $k=\alpha\cdot n$ corruptions are introduced where the value of $\alpha$ can go upto at least $0.1866$ . More generally, for corruption rates $\alpha\leq 0.1866$ , there always exists values of scale increment $\xi>1$ s.t. with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC conditions are satisfied for the $\tilde{Q}_{\beta}$ function corresponding to the robust least squares model for $\beta$ values at least as large as $\beta_{\max}={\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right)$ .
Hybrid Corruption Model: If initialized with $\hat{{\mathbf{w}}}^{1},\beta^{1}$ s.t. $\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ , SVAM-RR assures

\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right)

within $T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right)$ iterations for the hybrid corruption model where even points uncorrupted by the adversary receive Gaussian noise with variance $\frac{1}{\beta^{\ast}}$ i.e. $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i}$ for $i\in B$ where $\epsilon_{i}\sim{\mathcal{N}}\left({0,\frac{1}{\beta^{\ast}}}\right)$ .
Pure Corruption Model: For the pure corruption model where uncorrupted points receive no Gaussian noise i.e. $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle$ for $i\in G$ , SVAM-RR assures exact model recovery. Specifically, for any $\epsilon>0$ , SVAM-RR assures

\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon

within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right)$ iterations.

Proof.

For any two models ${{\mathbf{v}}},{{\mathbf{w}}}$ , the $\tilde{Q}_{\beta}$ function for robust least squares has the following form

\tilde{Q}_{\beta}({{\mathbf{v}}}\,|\,{{\mathbf{w}}})=\sum_{i=1}^{n}s_{i}\cdot\left({\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle-y_{i}}\right)^{2},

where $s_{i}\leftarrow\exp\left({-\frac{\beta}{2}(y_{i}-\left\langle{{{\mathbf{x}}}^{i}},{{{\mathbf{w}}}}\right\rangle)^{2}}\right)$ . We first outline the proof below.

Proof Outline. This proof has three key elements

1.

We will establish the LWSC and LWLC properties for any fixed value of $\beta>0$ with probability $1-\exp(-\Omega\left({{d}}\right))$ . As promised in the statement of Theorem 2, we will execute SVAM-RR for no more than ${\cal O}\left({{\log n}}\right)$ iterations, taking a naive union bound would offer a confidence level of $1-\log n\exp(-\Omega\left({{d}}\right))$ . However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form $1-\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))$ . Thus, a union over ${\cal O}\left({{\log n}}\right)$ such events will at best deteriorate the confidence bounds to $1-\log n\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))=1-\exp(-\Omega\left({{n\zeta^{2}-d\log n-\log\log n}}\right))$ which is still $1-\exp(-\Omega\left({{d}}\right))$ for the values of $\zeta$ we shall set.
2.

The key to this proof is to maintain the invariant $\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ . Recall that initialization ensures $\beta_{1}\cdot{\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}}\leq 1$ to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, iductively assuming that $\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ for an iteration $t$ , we will establish that $\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{(A)}{\sqrt{\beta}_{t}}$ where $(A)$ will be an application-specific expression derived below.
3.

We will then ensure that $(A)<1$ , say $(A)=1/\sqrt{\xi}$ for some $\xi>1$ , whenever the number of corruptions are below the breakdown point. This ensures $\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{{\xi\beta_{t}}}$ , in other words, ${\beta_{t+1}}\cdot{\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}}\leq 1$ for $\beta_{t+1}=\xi\cdot\beta_{t}$ so that the invariant is preserved. However, notice that the above step simultaneously ensures that $\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{1}{\sqrt{\xi\beta_{t}}}$ . This ensures that a valid value of scale increment $\xi$ can always be found till $\beta_{t}\leq\beta_{\max}$ . Specifically, we will be able assure the existence of a scale increment $\xi>1$ satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till $\beta<{\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right)$ .

We now present the proof. Lemmata 15,16 establish the LWSC/LWLC properties for the $\tilde{Q}_{\beta}$ function for robust least squares regression. Let $\text{\boldmath$\mathbf{\Delta}$}:=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}$ . By Lemma 8, with probability at least $1-\exp(-\Omega\left({{n-d}}\right))$ , we have $\left\|{X_{B}}\right\|_{2}=\sqrt{\lambda_{\max}(X_{B}X_{B}^{\top})}\leq\sqrt{1.01B}$ . The proof of Lemma 16 tells us that with the same probability, we have

\left\|{S{{\mathbf{b}}}}\right\|_{2}\leq\sqrt{\frac{B}{2\pi}}[(\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01)^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}\leq\sqrt{\frac{B}{2\pi}}[(\kappa^{2}1.01)^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}.

On the other hand, the proof of Lemma 16 also tells us us that with probability at least $1-\exp\left({-\Omega\left({{n\nu^{2}-d\log\frac{1}{\nu}-d\log(n)}}\right)}\right)$ we have,

\left\|{XS\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}=\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|\leq G(1+\nu)\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}.

By Lemma 15, with probability at least $1-\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right)$ , we have $\lambda_{\min}(XSX^{\top})\geq\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})\geq\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G$ . This gives us,

$\displaystyle\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{B\sqrt{\frac{1.01}{2\pi}}[(1.01\kappa^{2})^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}+G(1+\nu)\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}}{\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G}$ $\displaystyle=\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\tilde{g}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)$ $\displaystyle=\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)$ $\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\frac{\beta^{\ast}\kappa^{2}}{\beta+\beta^{\ast}}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)$ $\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\underbrace{\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)}_{(A)}$

where in the last two steps, we used that $\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\kappa$ and $\frac{\beta^{\ast}}{\beta+\beta^{\ast}}\leq 1$ . We shall set $\kappa=0.47$ below. Now, in order to satisfy the conditions of Theorem 1, we need to assure the existence of a scale increment $\xi>1$ that assures a linear rate of convergence. Since SVAM-RR always maintains the invariant ${\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1$ and $\kappa<1$ , assuring the existence of a scale increment $\xi>1$ is easily seen to be equivalent to showing that $(A)<1$ . This is done below and gives us our breakdown point.

Breakdown Point: In order to ensure a linear rate of convergence, we need only ensure $(A)<1$ . If we set $\nu,\zeta$ to small constants (which we can always do for large enough $n$ ) and also set $\frac{\beta}{\beta^{\ast}}$ to a small constant (which still allows us to offer an error $\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq{\cal O}\left({{\frac{1}{\sqrt{\beta^{\ast}}}}}\right)$ ), then we can get $(A)<1$ if

\frac{\alpha}{1-\alpha}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}<1

Setting $\kappa=0.47$ gives us a breakdown point of $\alpha\leq 0.1866$ .

Consistency (Hybrid Corruption): For vanishing corruption i.e. $\alpha\rightarrow 0$ , we can instead show a much stronger, consistent estimation guarantee. Suppose we promise that we would always set $\zeta,\nu\geq\frac{1}{n}$ . Then, the results in Lemmata 15 and 16 hold with probability at least $1-\exp(-\Omega\left({{d}}\right))$ even if we set $\zeta,\nu=\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ . Now, recall that we need to set $(A)<1$ to expect a linear rate of convergence. Setting $\zeta=\nu$ to simplify notation (and also since both can be set to similar values without sacrificing $1-\exp(-\Omega\left({{d}}\right))$ confidence guarantees), using $\alpha\leq 0.5$ , and using the shorthand $\rho:=1+\frac{\beta}{\beta^{\ast}}$ gives us the requirement

	$\displaystyle(A)\leq 1\Leftrightarrow$
	$\displaystyle\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)<1$
	$\displaystyle\Leftrightarrow\frac{1}{1-\nu}\left({9\alpha\sqrt{\rho}+(1+\nu)\frac{\rho-1}{\rho}}\right)<1$

The last requirement can be fulfilled for $\beta\leq\beta^{\ast}\cdot\min\left\{{\Omega\left({{\frac{1}{\alpha^{2/3}}}}\right),\Omega\left({{\sqrt{\frac{n}{d\log(n)}}}}\right)}\right\}$ which assures us of an error guarantee of $\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right)$ and finishes the proof for hybrid corruption case. Note that for no corruptions i.e. $\alpha=0$ , we do recover $\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq{\cal O}\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ .

3.

Consistency (Pure Corruption): The pure corruption case corresponds to $\beta^{\ast}=\infty$ . In this case, the above analysis shows that LWSC/LWLC properties continue to hold for arbitrarily large values of $\beta$ that assures arbitrarily accurate recovery of ${{\mathbf{w}}}^{\ast}$ . Given the linear rate of convergence and the fact that SVAM-RR maintains the invariant $\sqrt{\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq 1$ , it is clear that for any $\epsilon>0$ , a model recovery error of $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ is assured within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right)$ iterations.

∎

Lemma 15 (LWSC for Robust Least Squares Regression).

For any $0\leq\beta\leq n$ , the $\tilde{Q}_{\beta}$ -function for robust regression satisfies the LWSC property with constant $\lambda_{\beta}\geq Gc_{\varepsilon}(1-\zeta)$ with probability at least $1-\exp(-d)$ for any $\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ . In particular, for standard Gaussian covariates and Gaussian noise with variance $\frac{1}{\beta^{\ast}}$ , we can take $c_{\varepsilon}\geq\frac{1}{\tilde{g}}$ where $\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}$ .

Proof.

It is easy to see that $\nabla^{2}\tilde{Q}_{\beta}(\hat{{\mathbf{w}}}\,|\,{{\mathbf{w}}})=XSX^{\top}$ for any $\hat{{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right)$ . Let ${{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{\mathcal{D}}_{\varepsilon}$ and let $y=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle+\epsilon$ be the response of an uncorrupted data point and ${{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\frac{\kappa}{\sqrt{\beta}}}\right)$ be any fixed model. Then if we let $\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}^{\ast}-{{\mathbf{w}}}$ , then weight ${{\mathbf{s}}}_{i}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle+\epsilon_{i})^{2})$ .

For any fixed ${{\mathbf{v}}}\in S^{d-1}$ , we have:

	$\displaystyle{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]=G\cdot{\mathbb{E}}\left[{{s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]$	$\displaystyle=G\sqrt{\frac{\beta}{2\pi}}\cdot\underset{{{\mathbf{x}}}_{i}\sim{\mathcal{D}},\epsilon_{i}\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i})^{2})}}\right]$
		$\displaystyle\geq G\sqrt{\frac{\beta}{2\pi}}\cdot c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma)$

where,

c_{\varepsilon}:=c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma):=\inf_{{{\mathbf{v}}}\in S^{d-1}}\left\{{\underset{{{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle+\epsilon)^{2})}}\right]}\right\}

Similar to Lemma 17 we have:

{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<\left({1-\frac{\zeta}{2}}\right)\,c_{\varepsilon}\,G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\zeta^{2}c_{\varepsilon}^{2}}{128R^{4}}}\right]

Note that Lemma 18 continues to hold in this setting. Proceeding as in the proof of Lemma 20 to set up a $\tau$ -net over ${\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\frac{\kappa}{\sqrt{\beta}}}\right)$ and taking a union bound over this net finishes the proof.

We now simplify $c_{\varepsilon}:=c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma)$ for various distributions:

Centered Isotropic Gaussian

For the special case of ${\mathcal{D}}={\mathcal{N}}({\mathbf{0}},I_{d})$ , using rotational symmetry, we can w.l.o.g. take $\text{\boldmath$\mathbf{\Delta}$}=(\Delta_{1},0,0,\ldots,0)$ and ${{\mathbf{v}}}=(v_{1},v_{2},0,0,\ldots,0),\,v_{1}^{2}+v_{2}^{2}=1$ . Thus, if $x_{1},x_{2}\sim{\mathcal{N}}(0,1)$ i.i.d. then

	$\displaystyle\underset{{{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle+\epsilon)^{2})}}\right]$
	$\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1),\epsilon\sim{\mathcal{N}}(0,\sigma^{2})}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]$
	$\displaystyle=\underset{x_{1},x_{2},\epsilon}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}}}\right]=0]$
	$\displaystyle=\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}^{2}}}\right]=1]$
	$\displaystyle=v_{1}^{2}\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]+v_{2}^{2}\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]$

Using,

\displaystyle\frac{1}{\sqrt{2\pi\sigma^{2}}}\int\limits_{-\infty}^{\infty}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2}-\frac{\epsilon^{2}}{2\sigma^{2}})d\epsilon\quad=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\exp(-\frac{\beta\Delta_{1}^{2}x_{1}^{2}}{2(1+\beta\sigma^{2})})

We have,

$\displaystyle\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]$ $\displaystyle=\frac{1}{\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\left[x_{1}^{2}\exp(-\frac{x_{1}^{2}}{2})\frac{1}{\sqrt{2\pi\sigma^{2}}}\int\limits_{-\infty}^{\infty}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2}-\frac{\epsilon^{2}}{2\sigma^{2}})d\epsilon\right]\,dx_{1}$ $\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\frac{1}{\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}x_{1}^{2}\exp(-\frac{x_{1}^{2}}{2}-\frac{\beta\Delta_{1}^{2}x_{1}^{2}}{2(1+\beta\sigma^{2})})\,dx_{1}$ $\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{3}{2}}$

and,

\displaystyle\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]

\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{1}{2}}

This gives us

	$\displaystyle c_{\varepsilon}$	$\displaystyle=\inf_{(v_{1},v_{2})\in S^{1}}\left\{{\frac{v_{1}^{2}}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{3}{2}}+\frac{v_{2}^{2}}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{1}{2}}}\right\}$
		$\displaystyle=\frac{1+\beta\sigma^{2}}{(1+\beta\sigma^{2}+\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2})^{\frac{3}{2}}}\qquad[\text{using, }v_{1}^{2}+v_{2}^{2}=1\text{ and }\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|=\Delta_{1}^{2}]$
		$\displaystyle=\frac{\frac{\beta^{\ast}+\beta}{\beta^{\ast}}}{\left({\frac{\beta^{\ast}+\beta}{\beta^{\ast}}+\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}}\right)^{\frac{3}{2}}}=\frac{1}{\sqrt{\frac{\beta^{\ast}+\beta}{\beta^{\ast}}}\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|_{2}^{2}}\right)^{3/2}}=\frac{1}{\tilde{g}}$

which finishes the proof.

∎

Lemma 16 (LWLC for Robust Least Squares Regression).

For any $0\leq\beta\leq n$ , the $\tilde{Q}_{\beta}$ -function for robust regression satisfies the LWLC property with constant $\Lambda_{\beta}\leq G(1+\nu)V+1.01B[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3/2}$ with probability at least $1-\exp(-d)$ for any $\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right)$ , where $V=\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}$ and $\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}$ .

Proof.

It is easy to see that $\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}+X_{B}S_{B}{{\mathbf{b}}}$ . We bound these separately below.

Weights on Bad Points.

Suppose we denote $\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}$ and let $S=\text{diag}({{\mathbf{s}}}^{t})$ be the weights assigned by the algorithm, then the analysis below shows that we must have

\left\|{XS{{\mathbf{b}}}}\right\|_{2}\leq 1.01B[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3/2}

We will bound $\left\|{S{{\mathbf{b}}}}\right\|_{2}$ below and use Lemma 8 to get the above bound. Let $b_{i}$ denote the corruption on the data point ${{\mathbf{x}}}_{i}$ , i.e. $y_{i}=\langle{{\mathbf{w}}}^{\ast},{{\mathbf{x}}}_{i}\rangle+b_{i}$ . The proof proceeds via a simple case analysis:

Case 1: $\left|{b_{i}}\right|\leq k\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|$

In this case we simply bound $(s_{i}b_{i})^{2}\leq\frac{\beta}{2\pi}b_{i}^{2}\leq\frac{\beta}{2\pi}k^{2}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2}$ .

Case 2: $\left|{b_{i}}\right|>k\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|$

or, $-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>-\frac{\left|{b_{i}}\right|}{k}$ or, $\left|{b_{i}}\right|-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>\left|{b_{i}}\right|-\frac{\left|{b_{i}}\right|}{k}$

\left|{b_{i}-\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|\geq\left|{b_{i}}\right|-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>\left|{b_{i}}\right|\frac{k-1}{k}

So that,

	$\displaystyle b_{i}^{2}s_{i}^{2}$	$\displaystyle=b_{i}^{2}\mathcal{N}(y_{i}\|\langle{{\mathbf{w}}},\mathbf{x}_{i}\rangle,\frac{1}{\beta})^{2}$
		$\displaystyle=b_{i}^{2}\frac{\beta}{2\pi}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}},{{\mathbf{x}}}_{i}\rangle)^{2})^{2}$
		$\displaystyle=b_{i}^{2}\frac{\beta}{2\pi}\exp(-\beta(b_{i}-\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle)^{2})$
		$\displaystyle\leq b_{i}^{2}\frac{\beta}{2\pi}\exp(-\beta b_{i}^{2}\frac{(k-1)^{2}}{k^{2}})\qquad\text{for, }k\geq 1$
		$\displaystyle=\frac{k^{2}}{2\pi(k-1)^{2}}z\exp(-z)\qquad\text{for, }z=\beta b_{i}^{2}\frac{(k-1)^{2}}{k^{2}}$
		$\displaystyle\leq\frac{k^{2}}{2\pi e(k-1)^{2}}\qquad\text{as, }\max_{z}\{z\exp(-z)\}=\frac{1}{e}$

Combining case 1 and 2,

	$\displaystyle\left\\|{S{{\mathbf{b}}}}\right\\|_{2}^{2}$	$\displaystyle=\sum_{i\in B}(s_{i}b_{i})^{2}\leq\sum_{i\in B}\max\{\frac{\beta}{2\pi}k^{2}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2},\frac{k^{2}}{2\pi e(k-1)^{2}}\}\qquad\text{where, }k\geq 1$
		$\displaystyle\leq\frac{\beta}{2\pi}k^{2}\sum_{i\in B}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}$
		$\displaystyle\leq\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}k^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}$
		$\displaystyle=qk^{2}+p\frac{k^{2}}{(k-1)^{2}}\qquad\text{where, }p=\frac{B}{2\pi e},\quad q=\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}$

Let,

	$\displaystyle g(k)=qk^{2}+\frac{pk^{2}}{(k-1)^{2}};g^{\prime}(k)=2qk+\frac{2pk}{(k-1)^{2}}-\frac{2pk}{(k-1)^{3}}=0$
	$\displaystyle\implies k=1+(\frac{p}{q})^{1/3}\implies\min_{k}g(k)=(q^{1/3}+p^{1/3})^{3}$

This gives us

	$\displaystyle\left\\|{S{{\mathbf{b}}}}\right\\|_{2}^{2}$	$\displaystyle\leq[(\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi})^{1/3}+(\frac{B}{2\pi e})^{1/3}]^{3}$
		$\displaystyle\leq B[(\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3}$

Bounding the Weights on Good Points.

Given noise values sampled from ${\mathcal{D}}_{\varepsilon}={\mathcal{N}}(0,\frac{1}{\beta^{\ast}})$ and corrupted points $B$ are choosen independent of ${{\mathbf{x}}}_{i}$ , then for any $\beta>0$ and $S=\text{diag}({{\mathbf{s}}})$ , ${{\mathbf{s}}}$ computed w.r.t. model ${{\mathbf{w}}}$ at variance $\frac{1}{\beta}$ , then the following analysis shows that

{\mathbb{P}}\left[{{\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|_{2}>G(1+\nu)V}}\right]\leq\left({\frac{12R_{X}}{V\nu}}\right)^{3d}\exp\left({-\frac{m\beta^{\ast}\nu^{2}V^{2}\cdot n}{32}}\right),

where $V=\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}$ and $\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}$ . Suppose $\epsilon\sim{\mathcal{N}}(0,\frac{1}{\beta^{\ast}})$ and ${{\mathbf{x}}}\sim{\mathcal{N}}({\mathbf{0}},I)$ . Then, for any fixed error vector $\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\frac{\kappa}{\sqrt{\beta}}}\right)$ , if we set $s=\sqrt{\frac{\beta}{2\pi}}\exp\left({-\frac{\beta}{2}(\epsilon-\text{\boldmath$\mathbf{\Delta}$}\cdot{{\mathbf{x}}})^{2}}\right)$ , then we can analyze the vector ${\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}}}\right]$ as

	$\displaystyle{\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}}}\right]=\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\underset{\epsilon}{{\mathbb{E}}}\left[{{\epsilon s}}\right]\cdot{{\mathbf{x}}}}}\right]=\sqrt{\frac{\beta}{2\pi}}\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\underset{\epsilon}{{\mathbb{E}}}\left[{{\exp\left({-\frac{\beta}{2}(\epsilon-\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}})^{2}}\right)\epsilon}}\right]\cdot{{\mathbf{x}}}}}\right]$
	$\displaystyle=\sqrt{\frac{\beta}{2\pi}}\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}\frac{\beta}{\beta+\beta^{\ast}}\underbrace{\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\left({\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}(\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}})^{2}}\right)\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}}}\right)\cdot{{\mathbf{x}}}}}\right]}_{{\mathbf{p}}}$

where in the last step, we used Lemma 9 in one dimensions. Now, by rotational symmetry of the Gaussian distribution and unbiased and independent nature of its coordinates, we can assume w.l.o.g. that the error vector is of the form $\text{\boldmath$\mathbf{\Delta}$}=(\delta,0,\ldots,0)$ and conclude, as we did in the Gaussian mixture model analysis, that the vector ${{\mathbf{p}}}$ in this situation also must have only its first coordinate nonzero. Thus, we have, for

\left|{{{\mathbf{p}}}_{1}}\right|=\left|{{\mathbb{E}}\left[{{\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\delta^{2}x^{2}}\right)\delta x^{2}}}\right]}\right|,

where $x\sim{\mathcal{N}}(0,1)$ since we had ${{\mathbf{x}}}\sim{\mathcal{N}}({\mathbf{0}},I)$ . Applying Lemma 9 in single dimensions yet again and the Cauchy-Schwartz inequality gives us, for any unit vector ${{\mathbf{v}}}$ ,

{\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}^{\top}{{\mathbf{v}}}}}\right]\leq\sqrt{\frac{\beta}{2\pi}}\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}},

where $\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}$ and in the last step we used $\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq\kappa^{2}$ . The above, when combined with uniform convergence arguments over appropriate nets over the unit vectors ${{\mathbf{v}}}$ and the error vector $\mathbf{\Delta}$ gives us the claimed result.

∎

Lemma 17.

With the same preconditions as in Lemma 19, we have

\left.\begin{array}[]{r}{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\\ \vspace*{-2ex}\hfil\\ {\mathbb{P}}\left[{{\lambda_{\max}(X_{G}S_{G}X_{G}^{\top})>(1+\zeta)\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\end{array}\right\}\leq 2\cdot 9^{d}\exp\left[{-\frac{mnc^{2}\zeta^{2}}{32R^{4}}}\right],

where $R$ is the subGaussian constant of the distribution ${\mathcal{D}}$ that generated the covariate vectors ${{\mathbf{x}}}^{i}$ . When ${\mathcal{D}}$ is the standard Gaussian i.e. ${\mathcal{N}}({\mathbf{0}},I)$ , we have $R\leq 1$ . For ${\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I)$ , we have $R\leq\sqrt{\frac{1}{\beta^{\ast}}}$ . In the above, $c$ is a constant that depends only on the distribution ${\mathcal{D}}$ and is bounded for various distributions in Lemmata 19.

Proof.

Let $A\in{\mathbb{R}}^{d\times d}$ be a square symmetric matrix, for $\delta>0$ , we have:

\displaystyle\left\|{A-c\cdot I}\right\|_{2}\leq\delta\iff\forall{{\mathbf{v}}}\in S^{d-1},\left|{{{\mathbf{v}}}^{\top}A{{\mathbf{v}}}-c}\right|\leq\delta\iff c-\delta\leq\lambda_{\min}(A)\leq\lambda_{\max}(A)\leq c+\delta

Also, for any square symmetric matrix $F\in{\mathbb{R}}^{d\times d}$ and ${\mathcal{N}}_{\epsilon}$ being $\epsilon$ net over $S^{d-1}$

\displaystyle\left\|{F}\right\|_{2}\leq(1-2\epsilon)^{-1}\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{\epsilon}}\left|{{{\mathbf{v}}}^{\top}F{{\mathbf{v}}}}\right|

Taking $F=A-c\cdot I$ and $\epsilon=1/4$ , we have

\displaystyle\left\|{A-c\cdot I}\right\|_{2}\leq 2\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{1/4}}\left|{{{\mathbf{v}}}^{\top}A{{\mathbf{v}}}-c}\right|

(2)

let $Z_{i}:=\sqrt{s_{i}}\cdot\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle$ and ${{\mathbf{x}}}\sim{\mathcal{D}}$ then for any fixed ${{\mathbf{v}}}\in S^{d-1}$ , we have

\left\|{Z_{i}}\right\|_{\psi_{2}}=\sup_{p\geq 1}\frac{\left({{\mathbb{E}}\left[{{\left|{Z_{i}}\right|^{p}}}\right]}\right)^{1/p}}{\sqrt{p}}\leq(\frac{\beta}{2\pi})^{1/4}\cdot\sup_{p\geq 1}\frac{\left({{\mathbb{E}}\left[{{\left|{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle}\right|^{p}}}\right]}\right)^{1/p}}{\sqrt{p}}=(\frac{\beta}{2\pi})^{1/4}R

where we use the fact that $\left\|{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle}\right\|_{\Psi_{2}}\leq R$ since ${\mathcal{D}}$ is $R$ -sub-Gaussian and $\sqrt{s_{i}}\leq(\frac{\beta}{2\pi})^{1/4}$ .

Also, $Z$ is $R(\frac{\beta}{2\pi})^{1/4}$ -sub-Gaussian, implies $Z^{2}$ is $R^{2}\sqrt{\frac{\beta}{2\pi}}$ sub-exponential, as well as $Z^{2}-{\mathbb{E}}Z^{2}$ is $2R^{2}\sqrt{\frac{\beta}{2\pi}}$ -sub-exponential, using centering. And for a single good data point Lemma 19 gives $\mu:={\mathbb{E}}Z_{i}^{2}\in[c\sqrt{\frac{\beta}{2\pi}},\sqrt{\frac{\beta}{2\pi}}]$ .

	$\displaystyle{\mathbb{P}}\left[{{\left\|{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}-G\mu}\right\|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]$
	$\displaystyle={\mathbb{P}}\left[{{\left\|{\sum_{i\in G}(Z^{2}_{i}-\mu)}\right\|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]$
	$\displaystyle\leq 2\exp\left[{-m\cdot\min\left\{{\frac{\left(\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}\right)^{2}}{G\left(2R^{2}\sqrt{\frac{\beta}{2\pi}}\right)^{2}},\frac{\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}{2R^{2}\sqrt{\frac{\beta}{2\pi}}}}\right\}}\right]\text{[Theorem 2.8.2]}$
	$\displaystyle\leq 2\exp\left[{-\frac{mn\varepsilon^{2}}{8R^{4}}}\right]$

where $m>0$ is a universal constant and in the last step we used $G\geq n/2$ and w.l.o.g. we assumed that $\varepsilon\leq 2R^{2}$ . Taking a union bound over all $9^{d}$ elements of ${\mathcal{N}}_{1/4}$ , we get

	$\displaystyle{\mathbb{P}}\left[{{\left\\|{X_{G}S_{G}X_{G}^{\top}-G\mu\cdot I}\right\\|_{2}\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]$	$\displaystyle\leq{\mathbb{P}}\left[{{2\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{1/4}}\left\|{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}-G\mu}\right\|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\text{using \ref{net}}$
		$\displaystyle\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\varepsilon^{2}}{32R^{4}}}\right]$

Setting $\varepsilon=\zeta c$ and noticing that $\mu\in[c\sqrt{\frac{\beta}{2\pi}},\sqrt{\frac{\beta}{2\pi}}]$ by Lemma 19 finishes the proof. ∎

Lemma 18.

Consider two models ${{\mathbf{w}}}^{1},{{\mathbf{w}}}^{2}\in{\mathbb{R}}^{d}$ such that $\left\|{{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2}}\right\|_{2}\leq\tau$ and let ${{\mathbf{s}}}^{1},{{\mathbf{s}}}^{2}$ denote the corresponding weight vectors, i.e. $s_{i}^{j}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}}^{j},{{\mathbf{x}}}_{i}\rangle)^{2}),\,j=1,2$ . Also let $S^{1}=\text{diag}({{\mathbf{s}}}^{1})$ and $S^{2}=\text{diag}({{\mathbf{s}}}^{2})$ . Then for any $X=[{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}]\in{\mathbb{R}}^{d\times n}$ such that $\left\|{{{\mathbf{x}}}_{i}}\right\|_{2}\leq R_{X}$ for all $i$ ,

\left|{\lambda_{\min}(XS^{1}X^{\top})-\lambda_{\min}(XS^{2}X^{\top})}\right|\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}},

where $R_{X}$ is the maximum length in a set of $n$ vectors, each sampled from a $d$ -dimensional Gaussian (see Lemma 7).

Proof.

Let, $s_{i}^{j}=f(r_{i}^{j})=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-r_{i}^{j})^{2})$ where, $r_{i}^{j}=\langle{{\mathbf{w}}}^{j},{{\mathbf{x}}}_{i}\rangle,\,j=1,2$

Since, $f:{\mathbb{R}}\rightarrow{\mathbb{R}}$ , is a everywhere differentiable function, it is $L$ -lipschitz continuous where $L=\sup\limits_{r}\left|{f^{\prime}(r)}\right|$

$\displaystyle f^{\prime}(r)$	$\displaystyle=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-r)^{2})\beta(y_{i}-r)$
	$\displaystyle=\sqrt{\frac{\beta}{2\pi}}t\exp(-t^{2})\sqrt{2\beta}$	$\displaystyle where,t=\sqrt{\frac{\beta}{2}}(y_{i}-r)$
	$\displaystyle\leq\frac{\beta}{\sqrt{\pi}}\frac{1}{\sqrt{2e}}$	$\displaystyle t\exp(-t^{2})\leq\frac{1}{\sqrt{2e}}$
	$\displaystyle=\frac{\beta}{\sqrt{2\pi e}}$

Hence, $\frac{\left|{f(r_{i}^{1})-f(r_{i}^{2})}\right|}{\left|{r_{i}^{1}-r_{i}^{2}}\right|}\leq\frac{\beta}{\sqrt{2\pi e}}$ or $\left|{s_{i}^{1}-s_{i}^{2}}\right|\leq\frac{\beta}{\sqrt{2\pi e}}\left|{\langle{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2},{{\mathbf{x}}}_{i}\rangle}\right|=\frac{\beta\tau R_{X}}{\sqrt{2\pi e}}$ . This gives us $\left\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\|_{1}\leq\frac{n\tau\beta R_{X}}{\sqrt{2\pi e}}$ . Now, if we let $S^{1}=\text{diag}({{\mathbf{s}}}^{1})$ and $S^{2}=\text{diag}({{\mathbf{s}}}^{2})$ , then for any unit vector ${{\mathbf{v}}}\in S^{d-1}$ , denoting $R_{X}:=\max_{i\in[n]}\ \left\|{{{\mathbf{x}}}_{i}}\right\|_{2}$ we have

	$\displaystyle\left\|{{{\mathbf{v}}}^{\top}XS^{1}X^{\top}{{\mathbf{v}}}-{{\mathbf{v}}}^{\top}XS^{2}X^{\top}{{\mathbf{v}}}}\right\|$	$\displaystyle=\left\|{\sum_{i=1}^{n}\left({{{\mathbf{s}}}^{1}_{i}-{{\mathbf{s}}}^{2}_{i}}\right)\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}\right\|$
		$\displaystyle\leq\left\\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\\|_{1}\cdot\max_{i\in[n]}\ \left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}$
		$\displaystyle\leq\left\\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\\|_{1}\cdot R_{X}^{2}$
		$\displaystyle\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}.$

This proves that $\left\|{XS^{1}X^{\top}-XS^{2}X^{\top}}\right\|_{2}\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}$ .

∎

Lemma 19.

Let $X=[{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}]\in{\mathbb{R}}^{d\times n}$ generated from an isotropic $R$ -sub-Gaussian distribution ${\mathcal{D}}$ , for any fixed model ${{\mathbf{w}}}$ and $\beta>0$ , let $s_{i}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}},{{\mathbf{x}}}_{i}\rangle)^{2})$ be the weight of the data point ${{\mathbf{x}}}_{i}$ , $S_{G}=\text{diag}({{\mathbf{s}}}_{G})$ , then there exists a constant $c>0$ that depends only on ${\mathcal{D}}$ such that for any fixed vector unit ${{\mathbf{v}}}\in S^{d-1}$ ,

c\cdot G\sqrt{\frac{\beta}{2\pi}}\leq{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]\leq G\sqrt{\frac{\beta}{2\pi}}.

Proof.

Let ${{\mathbf{x}}}_{i}\sim{\mathcal{D}}$ and for good points $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle$ . Let $\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}^{\ast}-{{\mathbf{w}}}$ , as $s_{i}\leq\sqrt{\frac{\beta}{2\pi}}$ , we have by linearity of expectation,

{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]={\mathbb{E}}\left[{{\sum_{i\in G}s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]=G\cdot{\mathbb{E}}\left[{{s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]\leq G\sqrt{\frac{\beta}{2\pi}}\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]=G\sqrt{\frac{\beta}{2\pi}},

since ${\mathcal{D}}$ is isotropic.

For the lower bound, we may write,

	$\displaystyle{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]=G\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\cdot s_{i}}}\right]$	$\displaystyle=G\sqrt{\frac{\beta}{2\pi}}\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}_{i}}\right\rangle^{2})}}\right]$
		$\displaystyle\geq G\sqrt{\frac{\beta}{2\pi}}\cdot c(\beta,\text{\boldmath$\mathbf{\Delta}$})$

where, for any distribution ${\mathcal{D}}$ over ${\mathbb{R}}^{d}$ , we define the constant $c$ as

c(\beta,\text{\boldmath$\mathbf{\Delta}$}):=\inf_{{{\mathbf{v}}}\in S^{d-1}}\left\{{\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2})}}\right]}\right\}

Centered Isotropic Gaussian

For the special case of ${\mathcal{D}}={\mathcal{N}}({\mathbf{0}},I_{d})$ , using rotational symmetry, we can w.l.o.g. take $\text{\boldmath$\mathbf{\Delta}$}=(\Delta_{1},0,0,\ldots,0)$ and ${{\mathbf{v}}}=(v_{1},v_{2},0,0,\ldots,0)$ . Thus, if $x_{1},x_{2}\sim{\mathcal{N}}(0,1)$ i.i.d. then

$\displaystyle\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2})}}\right]$ $\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]$ $\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}}}\right]=0]$ $\displaystyle=\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}^{2}}}\right]=1]$ $\displaystyle=v_{1}^{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]+v_{2}^{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]$ $\displaystyle=\frac{v_{1}^{2}}{(1+\beta\Delta_{1}^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta_{1}^{2})^{1/2}}$

This gives us

	$\displaystyle c(\beta,\text{\boldmath$\mathbf{\Delta}$})$	$\displaystyle=\inf_{(v_{1},v_{2})\in S^{1}}\left\{{\frac{v_{1}^{2}}{(1+\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2})^{1/2}}}\right\}\geq\frac{1}{(1+\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2})^{3/2}}$
		$\displaystyle\geq\frac{1}{(1+\kappa^{2})^{3/2}}\qquad[\text{using}\sqrt{\beta}\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|<\kappa]$

Centered Non-isotropic Gaussian

For the case ${\mathcal{D}}={\mathcal{N}}({\mathbf{0}},\Sigma)$ , we have ${{\mathbf{x}}}\sim{\mathcal{D}}=\Sigma^{1/2}.{\mathcal{N}}({\mathbf{0}},I_{d})$ . Thus for any fixed unit vector ${{\mathbf{v}}}$ , we have $\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}}\right\rangle\sim\left\langle{\tilde{{{\mathbf{v}}}}},{{{\mathbf{z}}}}\right\rangle$ where $\tilde{{{\mathbf{v}}}}=\Sigma^{-1/2}{{\mathbf{v}}}$ and ${{\mathbf{z}}}\sim{\mathcal{N}}({\mathbf{0}},I_{d})$ . We also have $\left\|{\tilde{{{\mathbf{v}}}}}\right\|_{2}\in\left[{\frac{1}{\sqrt{\Lambda}},\frac{1}{\sqrt{\lambda}}}\right]$ , where $\Lambda=\lambda_{max}(\Sigma)$ and $\lambda=\lambda_{min}(\Sigma)$ . Now for any fixed vectors $\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{v}}}$ we first perform rotations so that we have $\tilde{\text{\boldmath$\mathbf{\Delta}$}}=(\Delta,0,0,\cdots,0)$ and $\tilde{{{\mathbf{v}}}}=(v_{1},v_{2},0,0,\cdots,0)\in\mathbb{B}({\mathbf{0}},r)$ where $\Delta\in\left[{\frac{\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|}{\sqrt{\Lambda}},\frac{\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|}{\sqrt{\lambda}}}\right]$ and $r\in\left[{\frac{1}{\sqrt{\Lambda}},\frac{1}{\sqrt{\lambda}}}\right]$ . This gives us $c(\beta,\text{\boldmath$\mathbf{\Delta}$})\geq\inf_{v_{1},v_{2}}f(v_{1},v_{2})$ where,

	$\displaystyle f(v_{1},v_{2})$	$\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}\Delta^{2}x_{1}^{2})}}\right]$
		$\displaystyle=\frac{v_{1}^{2}}{(1+\beta\Delta^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta^{2})^{1/2}}$

similar to that of isotropic counterpart; giving the following,

\displaystyle c(\beta,\text{\boldmath$\mathbf{\Delta}$})

\displaystyle=\inf_{(v_{1},v_{2})\in\mathbb{B}({\mathbf{0}},r)}\left\{{\frac{v_{1}^{2}}{(1+\beta\Delta^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta^{2})^{1/2}}}\right\}\geq\frac{1}{\Lambda(1+\frac{\beta}{\lambda}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{3/2}}

As the above term needs to be bounded away from 0, we require $\lambda=\lambda_{min}(\Sigma)>0$ i.e. reasonably away from 0.

Non-centered isotropic Gaussian

Suppose the covariates are generated from a distribution ${\mathcal{D}}={\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$},I_{d})$ . As earlier, by rotational symmetry, we can take $\text{\boldmath$\mathbf{\Delta}$}=(\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|,0,0,\cdots,0)$ , ${{\mathbf{v}}}=(v_{1},v_{2},0,\ldots,0)$ , $\text{\boldmath$\mathbf{\mu}$}=(\mu_{1},\mu_{2},\mu_{3},0,0,\ldots,0)$ . Assume $\left\|{\text{\boldmath$\mathbf{\mu}$}}\right\|_{2}=\rho$ . Letting $\left\langle{\text{\boldmath$\mathbf{\mu}$}},{{{\mathbf{v}}}}\right\rangle=:p\leq\rho$ and $x_{1},x_{2},x_{3}\sim{\mathcal{N}}(0,1)$ i.i.d. gives $c(\beta,\text{\boldmath$\mathbf{\Delta}$})\geq\inf_{v_{1},v_{2}}f(v_{1},v_{2})$ independence of $x_{1},x_{2},x_{3}$ and the fact that ${\mathbb{E}}\left[{{x_{2}}}\right]=0$ and ${\mathbb{E}}\left[{{x_{2}^{2}}}\right]=1$ gives us,

$\displaystyle f(v_{1},v_{2})$ $\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(p+v_{1}x_{1}+v_{2}x_{2})^{2}\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]$ $\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{((p+v_{1}x_{1})^{2}+v_{2}^{2}x_{2}^{2}+2(p+v_{1}x_{1})v_{2}x_{2})\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]$ $\displaystyle=\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{((p+v_{1}x_{1})^{2}+v_{2}^{2})\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]$

Now, since $(v_{1},v_{2})\in S^{1}$ we have the following two cases:

Case 1: $v_{2}^{2}\geq\frac{1}{2}$ . In this case

	$\displaystyle f(v_{1},v_{2})$	$\displaystyle\geq\frac{1}{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{\exp\left({-\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]$
		$\displaystyle=\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c}{2}(x+\mu_{1})^{2}-\frac{x^{2}}{2}}\right)dx\text{ ,where }c=\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}$
	$\displaystyle\text{if }\mu_{1}>0$
		$\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c+1}{2}(x+\mu_{1})^{2}}\right)dx=\frac{1}{2\sqrt{c+1}}\geq\frac{1}{2\sqrt{\kappa^{2}+1}}=0.45$
	$\displaystyle\text{else if }\mu_{1}<0$
		$\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c+1}{2}x^{2}}\right)dx=\frac{1}{2\sqrt{c+1}}\geq\frac{1}{2\sqrt{\kappa^{2}+1}}=0.45$

for $c=\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\leq\kappa^{2}$ and $\kappa=0.47$ .

Case 2: $v_{1}^{2}\geq\frac{1}{2}$ . In this case, if $x_{1}\geq 2\sqrt{2}\rho$ , then $\left|{v_{1}x_{1}+p}\right|\geq\rho$ and also $\left|{v_{1}x_{1}+p}\right|\geq\frac{x_{1}}{2\sqrt{2}}$ . So we can write $(v_{1}x_{1}+p)^{2}\geq\frac{\rho x_{1}}{2\sqrt{2}}$ . Also $\left|{x_{1}+\mu_{1}}\right|\leq 2x_{1}$ . Hence,

	$\displaystyle f(v_{1},v_{2})$	$\displaystyle\geq\frac{\rho}{2\sqrt{2}}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{x_{1}\exp\left({-\frac{c}{2}(2x_{1})^{2}}\right){\mathbb{I}}\left\{{{x_{1}\geq 2\sqrt{2}\rho}}\right\}}}\right]$
		$\displaystyle=\frac{\rho}{4\sqrt{\pi}}\int\limits_{2\sqrt{2}\rho}^{\infty}x\exp\left({-\frac{c}{2}(2x)^{2}-\frac{x^{2}}{2}}\right)dx$
		$\displaystyle=\frac{\rho}{4\sqrt{\pi}}\int\limits_{2\sqrt{2}\rho}^{\infty}x\exp\left({-(\frac{1}{2}+2c)x^{2}}\right)dx$
		$\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}\int\limits_{2\sqrt{2}\rho}^{\infty}(1+4c)x.\exp\left({-\frac{1+4c}{2}x^{2}}\right)dx$
		$\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}\int\limits_{4(1+4c)\rho^{2}}^{\infty}exp(-z)dz$
		$\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}e^{-4(1+4c)\rho^{2}}$
		$\displaystyle\geq\frac{\rho}{4\sqrt{\pi}(1+4\kappa^{2})}e^{-4(1+4\kappa^{2})\rho^{2}}\geq\frac{\rho}{14}e^{-8\rho^{2}}$

using the fact $c=\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\leq\kappa^{2}$ and $\kappa=0.47$ . We see from above that we need to avoid large $\rho$ value. One way to avoid this is to center the covariates i.e. use $\tilde{x_{i}}=x_{i}-\hat{\mu}$ where $\hat{\mu}:=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ . This would approximately center the covariates and ensure an effective value of $\rho\approx{\cal O}\left({{\sqrt{\frac{d}{n}}}}\right)$ .

Bounded Distribution

Suppose our covariate distribution has bounded support that is, we have $supp({\mathcal{D}})\subset\mathcal{B}_{2}(\rho)$ for some $\rho>0$ . Assume $\rho>1$ w.l.o.g. Also using the centering trick above, assume that $\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{{{\mathbf{x}}}}}\right]={\mathbf{0}}$ . Then we have $\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle\leq\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|\rho$ which implies $\exp\left({-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2}}\right)\geq\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\rho^{2}}\right)$ . Let $\Sigma$ denote the covariance of distribution ${\mathcal{D}}$ and let $\lambda:=\lambda_{min}(\Sigma)$ denote its smallest eigenvalue. This gives us $c\geq e^{-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\rho^{2}}\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]\geq e^{-\kappa^{2}\rho/2}\cdot\lambda=e^{-0.11\rho}\cdot\lambda$ using value of $\kappa=0.47$

This finishes the proof. ∎

Lemma 20.

Let $X=\left[{{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}}\right]$ are generated from an isotropic $R$ -sub-Gaussian distribution ${\mathcal{D}}$ and $G$ denotes the set of uncorrupted points, then there exists distribution specific constant $c$ , such that

\left.\begin{array}[]{r}{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\\ \vspace*{-2ex}\hfil\\ {\mathbb{P}}\left[{{\lambda_{\max}(X_{G}S_{G}X_{G}^{\top})>(1+\zeta)\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\end{array}\right\}\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right).

Proof.

The bound for the largest eigenvalue follows directly due to the fact that all weights are upper bounded by $\sqrt{\frac{\beta}{2\pi}}$ and hence $X_{G}S_{G}X_{G}^{\top}\preceq\sqrt{\frac{\beta}{2\pi}}\cdot X_{G}X_{G}^{\top}$ and applying Lemma 8. For the bound on the smallest eigenvalue, notice that Lemma 17 shows us that for any fixed variance $\frac{1}{\beta}$ , we have

{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<\left({1-\frac{\zeta}{2}}\right)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\zeta^{2}c^{2}}{128R^{4}}}\right]

Given $R_{X}:=\max_{i\in[n]}\ \left\|{{{\mathbf{x}}}_{i}}\right\|_{2}$ , lemma 18 shows us that if ${{\mathbf{w}}}^{1},{{\mathbf{w}}}^{2}$ are two models at stage $\frac{1}{\beta}$ variance, such that $\left\|{{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2}}\right\|_{2}\leq\tau$ then, the following holds almost surely.

\left|{\lambda_{\min}(X_{G}S^{1}_{G}X_{G}^{\top})-\lambda_{\min}(X_{G}S^{2}_{G}X_{G}^{\top})}\right|\leq\frac{G\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}

This prompts us to initiate a uniform convergence argument by setting up a $\tau$ -net over ${\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{2\pi}{\beta}}}\right)$ for $\tau=\frac{\zeta c}{2R_{X}^{3}}\sqrt{\frac{e}{\beta}}$ . Note that such a net has at most $\left({\frac{6R_{X}^{3}}{\zeta c}\sqrt{\frac{2\pi}{e}}}\right)^{d}$ elements by applying standard covering number bounds for the Euclidean ball [23, Corollary 4.2.13]. Taking a union bound over this net gives us

	$\displaystyle{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<0.99c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]$	$\displaystyle\leq 2\cdot\left({\frac{54R_{X}^{3}}{\zeta c}\sqrt{\frac{2\pi}{e}}}\right)^{d}\exp\left[{-\frac{mn\zeta^{2}c^{2}}{128R^{4}}}\right]$
		$\displaystyle\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right),$

where in the last step we used Lemma 7 to bound $R_{X}={\cal O}\left({{R\sqrt{n}}}\right)$ with probability at least $1-\exp(-\Omega\left({{n}}\right))$ . ∎

G.1 Robust Least-squares Regression with a Fully Adaptive Adversary

To handle a fully adaptive adversary, we need mild modifications to the notions of LWSC and LWLC given in Definition 1, so that the adversary is now allowed to choose the location of corruption arbitrarily. For sake of simplicity, we present these re-definitions and subsequent arguments in the context of robust least-squares regression but similar extensions hold for the other GLM tasks as well. Let us denote the useful shorthands $\bar{\alpha}=1-\alpha$ , and ${\mathcal{T}}_{\bar{\alpha}}=\{T\subset[n]:\left|{T}\right|=\bar{\alpha}\cdot n\}$ denote the collection of all possible subsets of $\bar{\alpha}\cdot n$ data points.

Definition 2 (LWSC/LWLC against Fully Adaptive Adversaries).

Suppose we are given an exponential family likelihood distribution ${\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]$ and data points $\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}$ of which an $\alpha>0$ fraction has been corrupted by a fully adaptive adversary and $\beta>0$ is any positive real value. Then, we say that the adaptive $\tilde{\lambda}_{\beta}$ -local weighted strong convexity property is satisfied if for any model ${{\mathbf{w}}}$ , we have

\min_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})\geq\tilde{\lambda}_{\beta},

where $S$ is a diagonal matrix containing the data point scores $s_{i}$ assigned by the model ${{\mathbf{w}}}$ (see Definition 1 for a definition of the scores). Similarly, we say that the adaptive $\tilde{\Lambda}_{\beta}$ -weighted strong smoothness properties are satisfied if for any true model ${{\mathbf{w}}}^{\ast}$ and any model ${{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right)$ , we have

\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\|{X_{B}S_{B}{{\mathbf{b}}}}\right\|_{2}\leq\tilde{\Lambda}_{\beta},

where we used the shorthand $B=[n]\backslash G$ and $S$ continues to be the diagonal matrix containing the data point scores $s_{i}$ assigned by the model ${{\mathbf{w}}}$ .

Note that for the setting of robust least-squares regression, for any two models $\hat{{\mathbf{w}}},{{\mathbf{w}}}$ we have $\lambda_{\min}(\nabla^{2}\tilde{Q}_{\beta}(\hat{{\mathbf{w}}}|{{\mathbf{w}}}))=\lambda_{\min}(XSX^{\top})\geq\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})$ (since the scores $s_{i}$ are non-negative) which motivates the above re-definition of adaptive LWSC. For the same setting we also have $\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}+X_{B}S_{B}{{\mathbf{b}}}$ (with the shorthand $B=[n]\backslash G$ ) i.e. $\left\|{\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})}\right\|_{2}\leq\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|_{2}+\left\|{X_{B}S_{B}{{\mathbf{b}}}}\right\|_{2}$ by triangle inequality. However, for sake of simplicity we will be analyzing the noiseless setting i.e. $\text{\boldmath$\mathbf{\epsilon}$}={\mathbf{0}}$ which motivates the above re-definition of adaptive LWLC. These re-definitions can be readily adapted to settings with hybrid noise i.e. when $\beta^{\ast}<\infty$ as well.

We now show that for the same settings as considered for the proof of Theorem 2, the adaptive LWSC and LWLC properties are also satisfied, albeit with worse constants due to the application of union bounds over all possible “good” sets in the collection ${\mathcal{T}}_{\bar{\alpha}}$ that the adversary could have left untouched to its corruption.

Lemma 20 gives us, for any fixed set $G\in{\mathcal{T}}_{\bar{\alpha}}$

\displaystyle{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right)

By taking union bound over all sets $G\in{\mathcal{T}}_{\bar{\alpha}}$ , and observing that

{n\choose k}={n\choose n-k}\leq\left({\frac{ne}{n-k}}\right)^{n-k}=\left({\frac{e}{\alpha}}\right)^{\alpha n}=\exp(\alpha n(1-\log\alpha)),

we have

{\mathbb{P}}\left[{{\tilde{\lambda}_{\beta}<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq\exp\left({-\Omega\left({{n(\zeta^{2}+\alpha\log\alpha-\alpha)-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right)

which requires, setting $\zeta\geq\Omega(\sqrt{\frac{d\log n}{n}+\alpha-\alpha\log\alpha})$ to obtain a confidence of $1-\exp(-\Omega\left({{d}}\right))$ . This establishes the adaptive LWSC guarantee with a confidence level similar to the one we had for the partially adaptive case. We now establish the adaptive LWLC guarantee.

We notice that Lemma 8 can be extended to the “adaptive” setting using [1, Lemma 15] to show that with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , we have

	$\displaystyle\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\max}(X_{B}X_{B}^{\top})$	$\displaystyle\leq\alpha n\left({1+3e\sqrt{6\log\frac{e}{\alpha}}}\right)+{\cal O}\left({{\sqrt{nd}}}\right)$
		$\displaystyle\leq B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)$

where we continue to use the notation $B=[n]\backslash G$ and for large enough $n$ , absorbed the $\sqrt{nd}$ term into the first term by increasing the constant 3 to 3.01 since the second term asymptotically vanishes in comparison to the first term that is linear in $n$ . Now, following steps similar to those in the proof of Lemma 16 gives us

	$\displaystyle\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\\|{X_{B}S_{B}{{\mathbf{b}}}}\right\\|_{2}$	$\displaystyle\leq\left\\|{X_{B}}\right\\|_{2}\left\\|{S_{B}{{\mathbf{b}}}_{B}}\right\\|_{2}$
		$\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\\|{X_{B}}\right\\|_{2}[(\frac{\kappa^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi})^{1/3}+(\frac{B}{2\pi e})^{1/3}]^{\frac{3}{2}}$
		$\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\max}(X_{B}X_{B}^{\top})[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}}$
		$\displaystyle\leq B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}},$

where the second last step uses the fact that we our upper bound on $\lambda_{\max}(X_{B}X_{B}^{\top})$ is at least $B$ . This establishes the adaptive LWLC property with confidence at least $1-\exp(-\Omega\left({{d}}\right))$ .

Theorem 21 (Theorem 3 restated – Fully Adaptive Adversary).

Suppose data is corrupted by a fully adaptive adversary that is able to decide the location of the corruptions as well as the corrupted labels using complete information of the true model, data features and clean labels, and SVAM-RR is initialized and executed as described in the statement of Theorem 2. Then SVAM-RR enjoys a breakdown point of $\alpha\leq 0.0036$ , i.e. it ensures model recovery even if $k=\alpha\cdot n$ corruptions are introduced by a fully adaptive adversary where the value of $\alpha$ can go upto at least $0.0036$ . More specifically, in the noiseless setting where $\beta^{\ast}\rightarrow\infty$ where clean data points do not experience any Gaussian noise i.e. $\epsilon_{i}=0$ and $y_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle$ for clean points, with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , the LWSC/LWLC conditions are satisfied for all $\beta\in(0,\infty)$ i.e. $\beta_{\max}=\infty$ . Consequently, for any $\epsilon>0$ , within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right)$ iterations, we have $\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ .

Proof.

The above arguments establishing the adaptive LWSC and adaptive LWLC properties allow us to obtain the following result in a manner similar to that used in the proof of Theorem 14 (but in the noiseless setting)

	$\displaystyle\left\\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\\|_{2}\leq\frac{2\tilde{\Lambda}_{\beta}}{\tilde{\lambda}_{\beta}}\leq\frac{B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)\sqrt{\frac{1}{2\pi}}[(1.01\kappa^{2})^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}}{\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G}$
	$\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\underbrace{\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}}\right)}_{(A)}$

Applying the limit $\beta^{\ast}\rightarrow\infty$ (since we are working in the pure corruption setting without any Gaussian noise on labels of the uncorrupted points) transforms the requirement $(A)\leq 1$ (which as before, assures us of the existence of a scale increment $\xi>1$ satisfying the requirements of Theorem 1) to:

\displaystyle\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}}\right)\leq 1

Setting $\kappa=0.47$ as done before further simplifies this requirement to

\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}}\right)\leq\frac{1}{4.38}

However, unlike earlier where we could simply set $\zeta$ to an arbitrarily small value for large enough $n$ , we cannot do so now since as noted earlier, we must set $\zeta\geq\Omega(\sqrt{\frac{d\log n}{n}+\alpha-\alpha\log\alpha})$ to obtain a confidence of $1-\exp(-\Omega\left({{d}}\right))$ in the LWSC guarantee. However, for large enough $n$ we can still obtain $\zeta\rightarrow\sqrt{\alpha-\alpha\log\alpha}$ which transforms the requirement further to

\frac{1}{1-\sqrt{\alpha-\alpha\log\alpha}}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}}\right)\leq\frac{1}{4.38}

which is satisfied at all values of $\alpha=0.0036$ or smaller. This finishes the proof. ∎

Appendix H Robust Mean Estimation

We will let $G,B$ respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let $G=(1-\alpha)\cdot n$ and $B=\alpha\cdot n$ respectively denote the number of good and bad points too.

Theorem 22 (Theorem 5 restated).

For data generated in the robust mean estimation model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set $B$ ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-ME enjoys a breakdown point of $0.2621$ , i.e. it ensures a bounded ${\cal O}\left({{1}}\right)$ error even if $k=\alpha\cdot n$ corruptions are introduced where the value of $\alpha$ can go upto at least $0.2621$ . More generally, for corruption rates $\alpha\leq 0.2621$ , there always exists values of scale increment $\xi>1$ s.t. with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC conditions are satisfied for the $\tilde{Q}_{\beta}$ function corresponding to the robust mean estimation model for $\beta$ values at least as large as $\beta_{\max}={\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right)$ . If initialized with $\hat{\text{\boldmath$\mathbf{\mu}$}}^{1},\beta^{1}$ s.t. ${\beta_{1}}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq 1$ , SVAM-ME assures $\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{T}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\epsilon$ for any $\epsilon\geq{\cal O}\left({{\text{trace}^{2}(\Sigma)\cdot\max\left\{{{\frac{1}{\ln(1/\alpha)}},\frac{1}{\sqrt{nd}}}\right\}}}\right)$ within $T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right)$ iterations.

Proof.

For any two models $\text{\boldmath$\mathbf{\mu}$},\text{\boldmath$\mathbf{\delta}$}$ , the $\tilde{Q}_{\beta}$ function for robust mean estimation has the following form

\tilde{Q}_{\beta}(\text{\boldmath$\mathbf{\delta}$}\,|\,\text{\boldmath$\mathbf{\mu}$})=\sum_{i=1}^{n}s_{i}\cdot\left\|{{{\mathbf{x}}}^{i}-\text{\boldmath$\mathbf{\delta}$}}\right\|_{2}^{2},

where $s_{i}\leftarrow\exp\left({-\frac{\beta}{2}\left\|{{{\mathbf{x}}}^{i}-\text{\boldmath$\mathbf{\mu}$}}\right\|_{2}^{2}}\right)$ . We first outline the proof below.

Proof Outline. This proof has four key elements

1.

We will first establish this result for $\Sigma=\frac{1}{\beta^{\ast}}\cdot I$ for $\beta^{\ast}=d$ , then generalize the result for arbitrary $\beta^{\ast}>0$ . Note that for $\beta^{\ast}=d$ , we have $\text{trace}(\Sigma)=1$ .
2.

To establish the LWSC and LWLC properties, we will first consider as before, a fixed value of $\beta>0$ for which the properties will be shown to hold with probability $1-\exp(-\Omega\left({{d}}\right))$ . As promised in the statement of Theorem 5, we will execute SVAM-ME for no more than ${\cal O}\left({{\log n}}\right)$ iterations, taking a naive union bound would offer a confidence level of $1-\log n\exp(-\Omega\left({{d}}\right))$ . However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form $1-\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))$ . Thus, a union over ${\cal O}\left({{\log n}}\right)$ such events will at best deteriorate the confidence bounds to $1-\log n\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))=1-\exp(-\Omega\left({{n\zeta^{2}-d\log n-\log\log n}}\right))$ which is still $1-\exp(-\Omega\left({{d}}\right))$ for the values of $\zeta$ we shall set.
3.

The key to this proof is to maintain the invariant $\sqrt{\beta_{t}}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\leq 1$ . Recall that initialization ensures ${\beta_{1}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1$ to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, inductively assuming that ${\beta_{t}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1$ for an iteration $t$ , we will establish that $\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{(A)}{\sqrt{\beta}_{t}}$ where $(A)$ will be an application-specific expression derived below.
4.

We will then ensure that $(A)<1$ , say $(A)=1/\sqrt{\xi}$ for some $\xi>1$ , whenever the number of corruptions are below the breakdown point. This ensures $\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{{\xi\beta_{t}}}$ , in other words, ${\beta_{t+1}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1$ for $\beta_{t+1}=\xi\cdot\beta_{t}$ so that the invariant is preserved. However, notice that the above step simultaneously ensures that $\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{1}{\sqrt{\xi\beta_{t}}}$ . This ensures that a valid value of scale increment $\xi$ can always be found till $\beta_{t}\leq\beta_{\max}$ . Specifically, we will be able assure the existence of a scale increment $\xi>1$ satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till $\beta<{\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right)$ .

We now present the proof. Lemmata 23,24 establish the LWSC/LWLC properties for the $\tilde{Q}_{\beta}$ function for robust mean estimation. Let $\text{\boldmath$\mathbf{\Delta}$}=\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast}$ and $\text{\boldmath$\mathbf{\Delta}$}^{+}=\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}$ . To simplify the notation, we will analyze below the updates made with weights scaled up by the constant $\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ as defined in Lemma 9. This in no way affects the execution of the algorithm since this scaling factor appears in both the numerator and the denominator of the update terms and simply cancels away. However, it will simplify our presentation. We also, w.l.o.g., first analyze the case of $\beta^{\ast}=d$ first and then scale the space to retrieve a result for general $\beta^{\ast}$ .

Using the bound $\tilde{c}\leq\exp(\beta+0.5)$ from Lemma 26, we gives us, upon using Lemmata 23 and 24

	$\displaystyle\left\\|{\text{\boldmath$\mathbf{\Delta}$}^{+}}\right\\|_{2}\leq\frac{\left\\|{\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}}\right\\|_{2}+\left\\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\\|_{2}}{\sum_{i=1}^{n}s_{i}}\leq\frac{\frac{B\tilde{c}\kappa}{\sqrt{\beta}}+\left\\|{{{\mathbf{m}}}}\right\\|_{2}+\frac{\nu G}{\sqrt{\beta}}}{G(1-\zeta)}$
	$\displaystyle\leq\sqrt{\frac{1}{\beta}}\cdot\frac{\kappa\tilde{c}B+G\frac{\beta}{\beta+{d}}+\nu G}{G(1-\zeta)}$
	$\displaystyle\leq\sqrt{\frac{1}{\beta}}\cdot\underbrace{\left({\frac{1}{1-\zeta}\left[{\kappa\tilde{c}\frac{B}{G}+\frac{\beta}{\beta+{d}}+\nu}\right]}\right)}_{(A)},$

where $\kappa=1+\sqrt{\frac{1}{2}}$ . As was the case of robust least squares regression in the proof of Theorem 14, to assure the existence of a scale increment $\xi>1$ satisfying the requirements of Theorem 1 and hence a linear rate of convergence, all we require is to ensure $(A)$ has a value of the form $\frac{1}{\xi}<1$ where $\xi>1$ . Now, noting that $R_{X}\leq\sqrt{n}$ , and promising that we will never set $\beta\geq n$ as well as never set $\nu,\zeta\leq\frac{1}{n}$ , we note that we need to set $\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ as well as $\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right)$ , in order to ensure a confidence of $1-\exp(-\Omega\left({{d}}\right))$ in the tail bounds we have established.

1.

Breakdown Point: as observed above, with large enough $n$ , we can set $\zeta,\nu$ to be small, yet positive, constants. For large enough $d$ , if we set $\beta={\cal O}\left({{1}}\right)$ to be a small enough constant then we have $\frac{\beta}{\beta+d}\rightarrow 0$ , as well as $\tilde{c}\leq\exp(0.5+\beta)\approx\sqrt{e}$ . This means we need only ensure $\left({1+\sqrt{\frac{1}{2}}}\right)\sqrt{e}\frac{\alpha}{1-\alpha}\leq 1$ . The above is satisfied for all $\alpha\leq 0.26.21$ which establishes a breakdown point of $26.21\%$ . Note that even at this breakdown point, since we still set $r=\beta=\Omega\left({{1}}\right)$ , and thus, can still assure $\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\sqrt{\frac{1}{\beta}}={\cal O}\left({{1}}\right)$ .
2.

Consistency: To analyze the consistency properties of the algorithm, we recall that, ignoring universal constants, to obtain a linear rate of convergence, we need ensure

$\frac{1}{1-\zeta}\left[{\tilde{c}\alpha+\frac{\beta}{\beta+{d}}+\nu}\right]<1$

which can be rewritten as $\frac{\beta}{d}+1\leq\frac{1}{\zeta+\nu+\tilde{c}\alpha}$ . Recall from above that we need to set $\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ as well as $\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right)$ . Setting them at these lower bounds, using $\tilde{c}\leq\sqrt{e}\exp(\beta)$ , ignoring universal constants (since we are only interested in the asymptotic behavior of the algorithm) and some simple manipulations, we can show that, for all $n\geq d$ , we can allow values of $\beta$ as large as

$\beta\leq\min\left\{{{\cal O}\left({{\log\frac{1}{\alpha}}}\right),{\cal O}\left({{\sqrt{nd}}}\right)}\right\}$

Note that the above assures us, when $\alpha=0$ i.e. corruptions are absent, an error of $\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq\frac{1}{\beta}\leq\frac{1}{\sqrt{nd}}\rightarrow 0$ as $n\rightarrow\infty$ . Thus, the method is consistent when corruptions are absent.

Scaling the space back up by a factor of $\sqrt{\frac{d}{\beta^{\ast}}}$ gives us the desired result. ∎

Lemma 23 (LWSC for Robust Mean Estimation).

For any $0\leq\beta\leq n$ , the $\tilde{Q}_{\beta}$ -function for robust mean estimation satisfies the LWSC property with constant $\lambda_{\beta}\geq G(1-\zeta)$ with probability at least $1-\exp(-\Omega\left({{d}}\right))$ for any $\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right)$ .

Proof.

It is easy to see that $\nabla^{2}\tilde{Q}_{\beta}(\hat{\text{\boldmath$\mathbf{\mu}$}}\,|\,\text{\boldmath$\mathbf{\mu}$})=(\sum_{i=1}^{n}s_{i})\cdot I$ for any $\hat{\text{\boldmath$\mathbf{\mu}$}}\in{\mathcal{B}}_{2}\left({\text{\boldmath$\mathbf{\mu}$}^{\ast},\sqrt{\frac{1}{\beta}}}\right)$ . Applying Lemma 9 gives us

{\mathbb{E}}\left[{{\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)}}\right]=1,

where $\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ is defined in Lemma 9. The analysis in the proof of Lemma 13, on the other hand tells us that the random variable $s=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ has a subGaussian constant at most unity. Applying the Hoeffding’s inequality for subGaussian variables and noticing $G\geq n/2$ gives us

{\mathbb{P}}\left[{{\sum_{i\in G}s_{i}\leq G\left({1-\frac{\zeta}{2}}\right)}}\right]\leq\exp\left({-\frac{m\zeta^{2}n}{8}}\right),

where $m>0$ is a universal constant. Again notice that the above result holds for a fixed error vector $\mathbf{\Delta}$ . Suppose $\text{\boldmath$\mathbf{\Delta}$}_{1},\text{\boldmath$\mathbf{\Delta}$}_{2}\in{\mathcal{B}}\left({\frac{1}{\sqrt{\beta}}}\right)$ are two error vectors such that $\left\|{\text{\boldmath$\mathbf{\Delta}$}_{1}-\text{\boldmath$\mathbf{\Delta}$}_{2}}\right\|_{2}\leq\tau$ . Then, denoting $s^{1}_{i},s^{2}_{i}$ as the weights assigned by these two error vectors, for all $\tau\leq\frac{2}{3\sqrt{\beta}R_{X}}$ , by applying Lemma 25, we get

\left|{\sum_{i\in G}s^{1}_{i}-\sum_{i\in G}s^{2}_{i}}\right|\leq 3G\tilde{c}\tau\sqrt{\beta}R_{X},

where $R_{X}$ is the maximum length in a set of $n$ vectors, each sampled from a $d$ -dimensional Gaussian (see Lemma 7). Applying a union bound over a $\tau$ -net over ${\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right)$ with $\tau=\frac{\zeta}{6\sqrt{\beta}\tilde{c}R_{X}}$ gives us

{\mathbb{P}}\left[{{\exists\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right):\sum_{i\in G}s_{i}\leq G(1-\zeta)}}\right]\leq\left({\frac{12R_{X}\tilde{c}\sqrt{\beta}}{\zeta}}\right)^{d}\exp\left({-\frac{m\zeta^{2}n}{8}}\right),

Promising that we will always set $\beta<\sqrt{\frac{n}{d}}$ and noting that Lemma 26 gives us $\tilde{c}\leq\sqrt{e}\exp(\beta)$ for $\beta^{\ast}=d$ and noting that Lemma 7 gives us $R_{X}\leq\sqrt{n}$ finishes the proof. ∎

Lemma 24 (LWLC for Robust Mean Estimation).

For any $0\leq\beta\leq n$ , the $\tilde{Q}_{\beta}$ -function for robust mean estimation satisfies the LWLC property with constant $\Lambda_{\beta}\leq G(1+\nu)+\frac{B\tilde{c}\kappa}{\sqrt{\beta}}$ with probability at least $1-\exp(-\Omega\left({{d}}\right))$ for any $\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right)$ .

Proof.

It is easy to see that $\nabla\tilde{Q}_{\beta}(\text{\boldmath$\mathbf{\mu}$}^{\ast}\,|\,\text{\boldmath$\mathbf{\mu}$})=\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}_{i}+\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}$ . We bound these separately below. Recall that we are working with weights that are scaled by a factor of $\tilde{c}$ , where $\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ is defined in Lemma 9.

Bad Points.

We have $s_{i}=\tilde{c}\exp(-\frac{\beta}{2}\left\|{{{\mathbf{b}}}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2})$ for $i\in B$ . Let $\kappa=1+\sqrt{\frac{1}{2}}$ . This gives us two cases

1.

$\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|$ : in this case we use $s_{i}\leq\tilde{c}$ and thus, $s_{i}\cdot\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|\leq\frac{\tilde{c}\kappa}{\sqrt{\beta}}$
2.

$\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}>\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|$ : in this case we have $\left\|{{{\mathbf{b}}}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\geq\frac{\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}}{2}$ and thus we also have $s_{i}\leq\tilde{c}\exp(-\frac{\beta}{2}\left({1-\frac{1}{\kappa}}\right)^{2}\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}^{2})$ which gives us $s_{i}\cdot\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\frac{\tilde{c}\kappa}{\sqrt{\beta}}$ upon using the fact that $x\cdot\exp(-x^{2})<\frac{1}{2}$ for all $x$ .

The above tells us, by an application of the triangle inequality, that $\left\|{\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}}\right\|_{2}\leq\frac{B\tilde{c}\kappa}{\sqrt{\beta}}$ .

Good Points.

We have $s_{i}=\tilde{c}\exp(-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2})$ for $i\in G$ . Thus, Lemma 9 gives us

\displaystyle{\mathbb{E}}\left[{{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}}\right]=G\cdot{\mathbb{E}}\left[{{{{\mathbf{x}}}}}\right]=G\cdot\frac{\beta}{\beta+{d}}\text{\boldmath$\mathbf{\Delta}$}=:{{\mathbf{m}}},

where ${{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta+{d}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta+{d}}\cdot I}\right)$ . Note that since $\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1$ , we have

\left\|{{{\mathbf{m}}}}\right\|_{2}\leq G\cdot\frac{\sqrt{\beta}}{\beta+{d}}

Using Lemma 13 and the linearity of the subexponential norm tells us that the subexponential norm of the random variable $s\cdot\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}}=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}}$ , for a fixed unit vector ${{\mathbf{v}}}$ , is at most $\frac{2}{\sqrt{\beta+d}}$ (where $\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ is defined in Lemma 9). Applying the Bernsteins’s inequality for subexponential variables gives us, for some universal constant $m$ ,

{\mathbb{P}}\left[{{\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}>{{\mathbf{m}}}\cdot{{\mathbf{v}}}+t}}\right]\leq\exp\left({-m\cdot\min\left\{{\frac{t^{2}(\beta+d)}{G},t\sqrt{\beta+d}}\right\}}\right),

for some universal constant $m>0$ . Now, if ${{\mathbf{v}}}^{1},{{\mathbf{v}}}^{2}\in S^{d-1}$ are two unit vectors such that $\left\|{{{\mathbf{v}}}^{1}-{{\mathbf{v}}}^{2}}\right\|_{2}\leq\tau$ , we have

\left|{\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}^{1}-\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}^{2}}\right|\leq\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\cdot\tau\leq GR_{X}\tau,

where $R_{X}$ is the maximum length in a set of $n$ vectors, each sampled from a $d$ -dimensional Gaussian (see Lemma 7) and where in the last step we used the triangle inequality and the bounds $s_{i}\leq 1$ and $\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq R_{X}$ for all $i$ . Thus, taking a union bound over a $\tau$ -net over the surface of the unit sphere $S^{d-1}$ gives us

{\mathbb{P}}\left[{{\exists{{\mathbf{v}}}\in S^{d-1},\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}>{{\mathbf{m}}}\cdot{{\mathbf{v}}}+t+GR_{X}\tau}}\right]\leq\left({\frac{2}{\tau}}\right)^{d}\exp\left({-m\cdot\min\left\{{\frac{t^{2}(\beta+d)}{G},t\sqrt{\beta+d}}\right\}}\right),

The above can be seen as simply an affirmation that $\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq\left\|{{{\mathbf{m}}}}\right\|_{2}+t+GR_{X}\tau$ with high probability. Setting $t=\frac{\nu G}{4\sqrt{\beta}}$ and $\tau=\frac{\nu}{4R_{X}}$ , and noticing $G\geq n/2$ gives us, upon promising that we always set $\nu\leq 1$ ,

{\mathbb{P}}\left[{{\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}>\left\|{{{\mathbf{m}}}}\right\|_{2}+\frac{\nu G}{2\sqrt{\beta}}}}\right]\leq\left({\frac{8R_{X}}{\nu}}\right)^{d}\exp\left({-\frac{m\nu^{2}n}{8}\cdot\frac{\beta+d}{\beta}}\right).

Now notice that this result holds for a fixed error vector $\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right)$ . Suppose now that we have two vectors $\text{\boldmath$\mathbf{\Delta}$}^{1},\text{\boldmath$\mathbf{\Delta}$}^{2}\in{\mathcal{B}}\left({\frac{1}{\sqrt{\beta}}}\right)$ such that $\left\|{\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}\leq\tau$ . If we let $s^{1}_{i}$ and $s^{2}_{i}$ denote the weights with respect to these two error vectors, then Lemma 25 tells us that, for any $\tau$ , then we must have

\left\|{\sum_{i\in G}(s^{1}_{i}-s^{2}_{i})\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq 3GR_{X}\tilde{c}\tau\left({\beta R_{X}+2\sqrt{\beta}}\right).

Taking a union bound over a $\tau$ -net over ${\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right)$ for $\tau=\frac{\nu}{6\tilde{c}R_{X}\left({\beta R_{X}+2\sqrt{\beta}}\right)\sqrt{\beta}}$ gives us

{\mathbb{P}}\left[{{\exists\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right):\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}>\left\|{{{\mathbf{m}}}}\right\|_{2}+\frac{\nu G}{\sqrt{\beta}}}}\right]\leq\left({\frac{24R_{X}^{2}\beta^{2}\tilde{c}}{\nu}}\right)^{d}\left({\frac{8R_{X}}{\nu}}\right)^{d}\exp\left({-\frac{m\nu^{2}n}{8}\cdot\frac{\beta+d}{\beta}}\right).

This finishes the proof upon simple modifications and promising that we will always set $\beta<\sqrt{\frac{n}{d}}$ and noting that Lemma 26 gives us $\tilde{c}\leq\sqrt{e}\exp(\beta)$ for $\beta^{\ast}=d$ . ∎

Lemma 25.

Now, suppose $\text{\boldmath$\mathbf{\Delta}$}^{1},\text{\boldmath$\mathbf{\Delta}$}^{2}\in{\mathcal{B}}_{2}\left({{\mathbf{0}},\frac{1}{\sqrt{\beta}}}\right)$ are two error vectors such that $\left\|{\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}\leq\tau$ and, for some vector $\mathbf{\epsilon}$ , we define $s_{i}=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{i}}\right\|_{2}^{2}}\right)$ . Then, for all $\tau$ , then we must have

\left|{s^{1}-s^{2}}\right|\leq 3\tilde{c}\tau\left({\beta\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}+2\sqrt{\beta}}\right)

Proof.

Since $\exp(-x)$ is a $1$ -Lipschitz function in the region $x\geq 0$ , we have

	$\displaystyle\left\|{\exp\left({-\frac{\beta}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{1}}\right\\|_{2}^{2}}\right)-\exp\left({-\frac{\beta}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\\|_{2}^{2}}\right)}\right\|\leq\left\|{-\frac{\beta}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{1}}\right\\|_{2}^{2}+\frac{\beta}{2}\left\\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\\|_{2}^{2}}\right\|$
	$\displaystyle\leq\frac{\beta}{2}\left\|{(\text{\boldmath$\mathbf{\Delta}$}^{1}+\text{\boldmath$\mathbf{\Delta}$}^{2}+2\text{\boldmath$\mathbf{\epsilon}$})^{\top}(\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2})}\right\|\leq 3\tau\left({\beta\left\\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\\|_{2}+2\sqrt{\beta}}\right),$

where in the last step we applied the Cauchy-Schwartz inequality and used $\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}^{i}}\right\|_{2}\leq 1$ for $i=1,2$ . ∎

Lemma 26.

For $\beta^{\ast}=d$ , we have $\tilde{c}\leq\sqrt{e}\exp(\beta)$ where $\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)$ is defined in Lemma 9.

Proof.

We have

\left({\sqrt{\frac{\beta+d}{d}}}\right)^{d}\leq\exp\left({\frac{\beta}{2}}\right)=\exp\left({\frac{\beta}{2}}\right).

We also have, using $\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1$ ,

\exp\left({\frac{\beta{d}}{2(\beta+{d})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\leq\exp\left({\frac{{d}}{2(\beta+{d})}}\right),

by using $\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1$ . Thus, we have

\tilde{c}\leq\exp\left({\frac{{d}}{2(\beta+{d})}+\frac{d\beta}{d}}\right)\leq\exp(\beta+0.5)

∎

Appendix I Gamma Regression

For this proof we use the notation $X=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{d\times n},{{\mathbf{y}}}=[y_{1},\ldots,y_{n}]\in{\mathbb{R}}_{+}^{n},{{\mathbf{b}}}=[b_{1},\ldots,b_{n}]\in{\mathbb{R}}_{+}^{n}$ . Recall that the labels for gamma distribution are always non-negative and, as specified in Section 4, the corruptions are multiplicative in this case. We will let $G,B$ respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let $G=(1-\alpha)\cdot n$ and $B=\alpha\cdot n$ respectively denote the number of good and bad points too.

To simplify the analysis, we assume that data features ${{\mathbf{x}}}^{i}$ are sampled uniformly from the surface of the unit sphere in $d$ -dimensions i.e. $S^{d-1}$ . We will also assume that for clean points, labels were generated from mode of Gamma distribution as $y_{i}=\exp(-\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)(1-\phi)$ i.e. in the no-noise model. For corrupted points, labels are $\tilde{y}_{i}=y_{i}\cdot b_{i}$ where $b_{i}>0$ but otherwise arbitrary or even unbounded. To simplify the presentation of the results, we will additionally assume $\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1$ and $\left\|{{{\mathbf{w}}}}\right\|_{2}\leq R$ .

For any vector ${{\mathbf{v}}}\in{\mathbb{R}}^{m}$ and any set $T\subseteq[m]$ , ${{\mathbf{v}}}_{T}$ denotes the vector with all coordinates other than those in the set $T$ zeroed out. Similarly, for any matrix $A\in{\mathbb{R}}^{k\times m},A_{T}$ denotes the matrix with all columns other than those in the set $T$ zeroed out.

I.1 Variance Reduction with the Gamma distribution

Since the variance reduction step is a bit more involved with the Gamma distribution, we give a detailed derivation here. Consider the gamma distribution:

	$\displaystyle{\cal G}(y_{i};\eta_{i},\phi)$	$\displaystyle=\frac{1}{y_{i}\Gamma(\frac{1}{\phi})}\left(\frac{y_{i}\eta_{i}}{\phi}\right)^{\frac{1}{\phi}}\exp(-\frac{y_{i}\eta_{i}}{\phi})$
		$\displaystyle=\exp\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)$

where natural parameter $\eta_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)$ . The mode preserving variance reduced distribution would be:

\displaystyle{\cal G}(y_{i};\eta_{i},\phi,\beta_{t})

\displaystyle=\exp\left(\beta_{t}\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)-W(\eta_{i},\beta_{t},\phi)\right)

writing $\frac{1}{\tilde{\phi}_{\beta_{t}}}:=\frac{\beta_{t}}{\phi}-\beta_{t}+1$ we have,

	$\displaystyle W(\eta_{i},\beta_{t},\phi)$	$\displaystyle=\ln\left(\int_{0}^{\infty}\exp\left(\beta_{t}\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)\right)dy\right)$
		$\displaystyle=\frac{\beta_{t}}{\phi}\ln\frac{\eta_{i}}{\phi}-\beta_{t}\ln\Gamma(\frac{1}{\phi})+\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln\frac{\phi}{\beta_{t}\eta_{i}}+\ln\Gamma\left(\frac{1}{\tilde{\phi}_{\beta_{t}}}\right)$

So that,

	$\displaystyle{\cal G}(y_{i};\eta_{i},\phi,\beta_{t})$	$\displaystyle=\exp\left(-\frac{\beta_{t}y_{i}\eta_{i}}{\phi}+(\frac{1}{\tilde{\phi}_{t}}-1)\ln y_{i}-\frac{1}{\tilde{\phi}_{t}}\ln\frac{\phi}{\beta_{t}\eta_{i}}-\ln\Gamma(\frac{1}{\tilde{\phi}_{t}})\right)$
		$\displaystyle=\exp\left(-\frac{y_{i}\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}}+(\frac{1}{\tilde{\phi}_{t}}-1)\ln y_{i}-\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln\frac{\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}-\ln\Gamma(\frac{1}{\tilde{\phi}_{t}})\right)\quad\text{setting, }\tilde{\eta}_{i}:=\frac{\eta_{i}\beta_{t}\tilde{\phi}_{t}}{\phi}$
		$\displaystyle={\mathcal{G}}(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{t})$

Hence to perform variance reduction, following parameter update is sufficient:

\displaystyle\tilde{\eta}_{i}=\frac{\eta_{i}\beta_{t}\tilde{\phi}_{t}}{\phi}=\eta_{i}\frac{\beta_{t}}{\phi+\beta_{t}-\phi\beta_{t}};\quad\tilde{\phi}_{\beta_{t}}=\frac{\phi}{\beta_{t}-\phi\beta_{t}+\phi}

We will give the proof for the setting where clean points suffer no noise. Let $(\hat{\eta},\hat{\phi})$ be the no-noise parameters. Assuming $0<\phi<1$ we have $\hat{\eta_{i}}=\lim\limits_{\beta\rightarrow\infty}\frac{\eta_{i}\beta}{\phi+\beta-\phi\beta}=\frac{\eta_{i}}{1-\phi}=\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{1-\phi}$ and $\hat{\phi}=\lim\limits_{\beta\rightarrow\infty}\frac{\phi}{\phi+\beta-\phi\beta}=0$ . Giving,

\displaystyle mode({\mathcal{G}}(y_{i};\hat{\eta}_{i},\hat{\phi}))=\frac{1-\hat{\phi}}{\hat{\eta}_{i}}=\frac{1-\phi}{\eta_{i}}=mode({\mathcal{G}}(y_{i};\eta_{i},\phi))

Let be $b_{i}\in[0,\infty)$ denote the multiplicative corruption, un-corrupted points having $b_{i}=1$ .
We have,

\displaystyle y_{i}=mode({\mathcal{G}}(y_{i};\hat{\eta}_{i},\hat{\phi}))\cdot b_{i}=b_{i}\exp(-\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)(1-\phi)

Let $c:=\frac{1}{\phi}-1$ and $\Delta^{t}:=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}$

\displaystyle s_{i}

\displaystyle={\mathcal{G}}(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{t})=\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{t}})}\left(b_{i}c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)\right)^{\frac{1}{\tilde{\phi}_{t}}}\exp(-b_{i}c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle))

We may write,

$\displaystyle\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})$ $\displaystyle=-\log\prod\limits_{i=1}^{n}{\mathcal{G}}(y_{i};\frac{\eta_{i}}{1-\phi},\hat{\phi})^{s_{i}}$ $\displaystyle=\sum\limits_{i=1}^{n}s_{i}\left(\frac{b_{i}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)-\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle+\ln(1-\phi)}{\hat{\phi}}-(\frac{1}{\hat{\phi}}-1)\ln y_{i}+\frac{1}{\hat{\phi}}\ln\hat{\phi}+\ln\Gamma(\frac{1}{\hat{\phi}})\right)$ $\displaystyle\nabla_{{{\mathbf{w}}}}\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})$ $\displaystyle=\frac{1}{\hat{\phi}}\sum\limits_{i\in G}\left(b_{i}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)-1\right)s_{i}{{\mathbf{x}}}_{i}$

Theorem 27 (Theorem 4 restated).

For data generated in the robust gamma regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set $B$ ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-Gamma enjoys a breakdown point of $\frac{0.002}{\sqrt{d}}$ , i.e. it ensures a bounded ${\cal O}\left({{1}}\right)$ error even if $k=\alpha\cdot n$ corruptions are introduced where the value of $\alpha$ can go upto at least $\frac{0.002}{\sqrt{d}}$ . More generally, for corruption rates $\alpha\leq\frac{0.002}{\sqrt{d}}$ , there always exists values of scale increment $\xi>1$ such that with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , LWSC/LWLC conditions are satisfied for the $\tilde{Q}_{\beta}$ function corresponding to the robust gamma regression model for $\beta$ values at least as large as $\beta_{\max}={\cal O}\left({{1/\left({\exp\left({{\cal O}\left({{\alpha\sqrt{d}}}\right)}\right)-1}\right)}}\right)$ . Specifically, if initialized at $\hat{{\mathbf{w}}}^{1},\beta_{1}\geq 1$ such that ${\beta_{1}}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq\frac{\phi}{1-\phi}$ , for any $\epsilon\geq{\cal O}\left({{\alpha\sqrt{d}}}\right)$ SVAM-Gamma assures

\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon

within $T\leq{\cal O}\left({{\log\frac{1}{\epsilon}}}\right)$ iterations.

Proof.

We first outline the proof below. We note that Theorem 4 is obtained by setting $\phi=0.5$ in the above statement.

Proof Outline.

The proof outline is similar to the one followed for robust least squares regression and robust mean estimation in Theorems 14 and 22 but adapted to suit the alternate parametrization and invariant used by SVAM-Gamma for gamma regression. Lemmata 28,29 below establish the LWSC and LWLC properties with high probability for $\beta_{t}\geq 1$ . Let $\Delta^{t}=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}$ , Unlike in mean estimation and robust regression where we maintained the invariant $\beta_{t}\cdot\left\|{\Delta^{t}}\right\|_{2}^{2}\leq 1$ , in this setting we will instead maintain the invariant $c\beta_{t}(\exp(\left\|{\Delta^{t}}\right\|)-1)\leq\frac{\phi}{1-\phi}$ . This is because gamma regression uses a non-standard canonical parameter to enable support only over non-negative labels. Note however that this altered invariant still ensures that as $\beta_{t}\rightarrow\infty$ , we also have $\left\|{\Delta^{t}}\right\|_{2}\rightarrow 0$ . Since we will always use $\beta_{t}\geq 1$ (since we set $\beta_{1}=1$ ), we correspondingly require $\left\|{\Delta^{1}}\right\|_{2}\leq\ln(\frac{1}{c}+1)$ during initialization as mentioned in the statement of Theorem 4, where $c=\frac{1}{\phi}-1$ . Since we will analyze the special case $\phi=0.5$ for sake of simplicity, we get $c=1$ .

Lemmata 28,29 establish the LWSC/LWLC properties for the $\tilde{Q}_{\beta}$ function for robust gamma regression. Given the above proof outline and applying Lemmata 28,29 gives us (taking $\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1$ and $\left\|{{{\mathbf{w}}}}\right\|_{2}\leq R$ )

	$\displaystyle\Delta^{t+1}$	$\displaystyle\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{2\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}}}{(1-\zeta)\mu_{c}}$
		$\displaystyle\leq\frac{\frac{2B}{\hat{\phi}}\sqrt{\frac{2}{d}}\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}+2\right)^{c\beta_{t}+2}\exp(-c\beta_{t}-2)}{(1-\zeta)\frac{\exp(\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1))}{\hat{\phi}(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\frac{G}{d}\exp\left(-R-(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-1-c\beta_{t}\right)}$
		$\displaystyle\leq\frac{2\sqrt{2d}B}{G(1-\zeta)}(1+\frac{2}{c\beta_{t}})^{2}\exp(R+3)$

For break down point calculation we set $\zeta=0.5$ and $R=1$ ,

\displaystyle\frac{2\sqrt{2d}B}{G(1-\zeta)}(1+\frac{2}{c\beta_{t}})^{2}\exp(R+3)\leq\ln(\frac{1}{\xi\beta_{t}}+1)

to get $\alpha\leq\frac{0.002}{\sqrt{d}}$

∎

Lemma 28 (LWSC for Robust Gamma Regression).

For any $0\leq\beta_{t}\leq n$ , the $\tilde{Q}_{\beta}$ -function for gamma regression satisfies the LWSC property with constant $\lambda_{\beta}\geq(1-\zeta)\mu_{c}$ with probability at least $1-\exp(-\Omega\left({{d}}\right))$ where $\mu_{c}$ is defined in the proof.

Proof.

It is easy to see that having $\frac{1}{\tilde{\phi}_{\beta_{t}}}=c\beta_{t}+1$ we have at the good points,

	$\displaystyle\nabla^{2}\tilde{Q}_{G}({{\mathbf{w}}}\,\|\,\hat{{\mathbf{w}}}^{t})$	$\displaystyle=\frac{1}{\hat{\phi}}\sum\limits_{i\in G}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)s_{i}{{\mathbf{x}}}_{i}{{\mathbf{x}}}_{i}^{\top}$
		$\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\sum\limits_{i\in G}\exp\left(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle-(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\exp(\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)+\frac{1}{\tilde{\phi}_{\beta_{t}}}\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\right){{\mathbf{x}}}_{i}{{\mathbf{x}}}_{i}^{\top},$

where $\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})=\frac{\exp(\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1))}{\hat{\phi}(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}$ .

Let us write,

\displaystyle\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..{{\mathbf{x}}}_{n}):={{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\left[\sum\limits_{i\in G}g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}\right]

with,

	$\displaystyle g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t}):=$	$\displaystyle\exp\left(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle-(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\exp(\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)+\frac{1}{\tilde{\phi}_{\beta_{t}}}\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\right)$
	$\displaystyle\geq$	$\displaystyle\exp\left(-R-(c\beta_{t}+1)\left\\|{\Delta^{t}}\right\\|-c\beta_{t}\exp(\left\\|{\Delta^{t}}\right\\|)\right)\quad\text{using, }\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\geq-\left\\|{\Delta^{t}}\right\\|$
	$\displaystyle\geq$	$\displaystyle\exp\left(-R-(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-1-c\beta_{t}\right)=:g_{min}$

and,

	$\displaystyle g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\leq$	$\displaystyle\exp\left(R+(c\beta_{t}+1)\left\\|{\Delta^{t}}\right\\|-c\beta_{t}\exp(-\left\\|{\Delta_{t}}\right\\|)\right)$
	$\displaystyle\leq$	$\displaystyle\exp\left(R+(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-\frac{(c\beta_{t})^{2}}{1+c\beta_{t}}\right)=:g_{\max}$

We have,

\displaystyle\mu:=\underset{{{\mathbf{x}}}_{i}\sim{\cal S}^{d-1}}{{\mathbb{E}}}\left[{{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..{{\mathbf{x}}}_{n})}}\right]\geq G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}\underset{{{\mathbf{x}}}_{i}\sim{\cal S}^{d-1}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}}}\right]=\frac{G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}}{d}=:\mu_{c}

To get a high-probability lower bound on the LWSC constant $\lambda_{\beta}$ we would like to apply McDiarmid’s inequality. Having,

	$\displaystyle\left\|{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..,{{\mathbf{x}}}_{n})-\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i}^{\prime},..,{{\mathbf{x}}}_{n})}\right\|$
	$\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\left\|g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}-g({{\mathbf{x}}}_{i}^{\prime},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}^{\prime}}\right\rangle^{2}\right\|$
	$\displaystyle\leq\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})(g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})+g({{\mathbf{x}}}_{i}^{\prime},{{\mathbf{w}}},\Delta^{t}))$
	$\displaystyle\leq 2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}$

, we may write for any fixed ${{\mathbf{v}}}\in{\cal S}^{d-1}$ and ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ ,

\displaystyle{\mathbb{P}}(\left|{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..,{{\mathbf{x}}}_{n})-\mu}\right|\geq t)\leq 2\exp\left(\frac{-2t^{2}}{2G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

For any square symmetric matrix $F\in\mathbb{R}^{d\times d}$ , we have, $\left\|{F}\right\|_{2}\leq\frac{1}{1-2\epsilon}\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\epsilon}}\left|{{{\mathbf{v}}}^{\top}F{{\mathbf{v}}}}\right|$ . Taking, $F=\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}$ gives $\left\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\|_{2}\leq 2\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\frac{1}{4}}}\left|{{{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}-\mu}\right|$

Taking union bound over $\frac{1}{4}$ -net of ${{\mathbf{v}}}\in\mathbb{R}^{d}$ gives

	$\displaystyle{\mathbb{P}}\left(\left\\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\\|_{2}\geq t\right)$	$\displaystyle\leq{\mathbb{P}}\left(2\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\frac{1}{4}}}\left\|{{{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}-\mu}\right\|\geq t\right)$
		$\displaystyle\leq 2\cdot 9^{d}\exp\left(\frac{-t^{2}}{4G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)$

Since, $\lambda_{\beta_{t}}:=\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\geq\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))$ and $\mu\geq\mu_{c}$ , we have,

${\mathbb{P}}(\lambda_{\beta_{t}}-\mu_{c}\leq-t)\leq{\mathbb{P}}(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu_{c}\leq-t)\leq{\mathbb{P}}(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu\leq-t)\leq{\mathbb{P}}(\left|{\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu}\right|\geq t)\leq{\mathbb{P}}(\left\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\|_{2}\geq t)\leq 2\cdot 9^{d}\exp\left(\frac{-t^{2}}{4G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)$

Put, $t=\frac{\zeta\mu_{c}}{2}$ with $0<\zeta<1$ , for any fixed ${{\mathbf{w}}}\in\mathbb{R}^{d}$ we have:

\displaystyle{\mathbb{P}}\left(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\leq(1-\frac{\zeta}{2})\mu_{c}\right)\leq 2\cdot 9^{d}\exp\left(\frac{-\zeta^{2}\mu_{c}^{2}}{16G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

In order to take union bound over ${{\mathbf{w}}}$ we observe, let $\tau=\left\|{\hat{{\mathbf{w}}}^{1}-\hat{{\mathbf{w}}}^{2}}\right\|_{2}$ and using $\exp(x)\leq 1+2x$ for $0\leq x\leq 1$

$g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{1},\Delta^{t})-g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})=g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})(\exp\left(\left\langle{\hat{{\mathbf{w}}}^{1}-\hat{{\mathbf{w}}}^{2}},{{{\mathbf{x}}}_{i}}\right\rangle\right)-1)\leq 2\tau g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})\leq 2\tau g_{\max}$

So that,

	$\displaystyle\left\|{\lambda_{\min}(\nabla_{\hat{{\mathbf{w}}}^{1}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t}))-\lambda_{\min}(\nabla_{\hat{{\mathbf{w}}}^{2}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t}))}\right\|\leq\sup_{\left\\|{{{\mathbf{v}}}}\right\\|_{2}=1}\left\|{{{\mathbf{v}}}^{\top}(\nabla_{\hat{{\mathbf{w}}}^{1}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t})-\nabla_{\hat{{\mathbf{w}}}^{2}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}\|\hat{{\mathbf{w}}}^{t})){{\mathbf{v}}}}\right\|$
	$\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\sup_{\left\\|{{{\mathbf{v}}}}\right\\|_{2}=1}\left\|{\left[\sum\limits_{i\in G}\left(g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{1},\Delta^{t})-g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})\right)\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}\right]}\right\|\leq 2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\tau Gg_{\max}$

In order to set the $\tau$ -net, we would require, $(1-\zeta)\mu_{c}\leq(1-\frac{\zeta}{2})\mu_{c}-2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\tau Gg_{\max}$

\displaystyle\tau\leq\frac{\zeta\mu_{c}}{4\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})Gg_{\max}}=\frac{\zeta\frac{G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})}{d}g_{\min}}{4\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})Gg_{\max}}=\frac{\zeta g_{\min}}{4dg_{\max}}

Taking covering number of ${\cal B}({{\mathbf{w}}}^{\ast},R)\leq R^{d}(\frac{3}{\tau})^{d}=(\frac{12dRg_{\max}}{\zeta g_{\min}})^{d}$ and observing,

$\frac{g_{\max}}{g_{\min}}\leq(1+\frac{1}{c\beta_{t}})^{2}\exp\left(2R+3+\frac{c\beta_{t}}{1+c\beta_{t}}\right)$ , $\frac{\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}^{2}}{g_{\max}}\geq\frac{\beta_{t}}{\phi\hat{\phi}}(1+\frac{1}{c\beta_{t}})^{-3}\exp\left(-3R-5-\frac{c\beta_{t}}{1+c\beta_{t}}\right)$

We may write,

$\displaystyle{\mathbb{P}}\left(\exists{{\mathbf{w}}}\in{\cal B}(0,R):\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\leq(1-\zeta)\mu_{c}\right)\leq 2\cdot 9^{d}\cdot(\frac{12dRg_{\max}}{\zeta g_{\min}})^{d}\exp\left(\frac{-\zeta^{2}\mu_{c}^{2}}{16G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)$ $\displaystyle=2\cdot\exp\left(\frac{-\zeta^{2}G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}^{2}}{16d^{2}g_{\max}}+d\ln(\frac{108dRg_{\max}}{\zeta g_{\min}})\right)=\exp\left(-\Omega(\frac{\zeta^{2}G\beta_{t}}{d^{2}\phi\hat{\phi}}-\frac{d\ln d}{\zeta})\right)$

This finishes the proof. ∎

Lemma 29 (LWLC for Robust Gamma Regression).

For any $0\leq\beta\leq n$ , the $\tilde{Q}_{\beta}$ -function for robust gamma regression satisfies the LWLC property with constant $\Lambda_{\beta}\leq\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}}$ with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , where $m(\beta_{t})$ is defined in proof.

Proof.

It is easy to see that $\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=\frac{1}{{\hat{\phi}}}\sum\limits_{i\in B}(b_{i}-1)s_{i}{{\mathbf{x}}}_{i}$ . Since $b_{i}=1$ for good points in the no-noise setting we have assumed, good points do not contribute to the gradient at all. Thus, we get

\left\|{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})}\right\|_{2}=\frac{1}{\hat{\phi}}\left\|{X_{B}{{\mathbf{t}}}}\right\|_{2}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\left\|{{{\mathbf{t}}}}\right\|_{2},

where ${{\mathbf{t}}}=[t_{i}]_{i\in B}$ and $t_{i}=(b_{i}-1)\cdot s_{i}$ . Now since ${{\mathbf{x}}}_{i}\in{\cal S}^{d-1}$ with probability at least $1-\exp(-\Omega\left({{d}}\right))$ , we have $\left\|{X_{B}}\right\|_{2}\leq\sqrt{\frac{B}{d}}$ (since the locations of the corruptions were chosen randomly without looking at data). Thus, we are left to bound $\left\|{{{\mathbf{t}}}}\right\|_{2}$ . We have $t_{i}\leq(b_{i}-1)^{2}s_{i}^{2}\leq(b_{i}^{2}+1)s_{i}^{2}$ .

Let $z_{i}:=c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)$ , we have, $\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{z_{i}}=\frac{1}{c\beta_{t}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)}\leq\frac{1}{c\beta_{t}}\exp(\left\|{{{\mathbf{w}}}^{\ast}-\Delta^{t}}\right\|_{2})\leq\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\exp(1)$

	$\displaystyle b_{i}s_{i}$	$\displaystyle=b_{i}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(b_{i}z_{i}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}}\exp(-b_{i}z_{i})=\frac{1}{z_{i}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(b_{i}z_{i}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}+1}\exp(-b_{i}z_{i})$
		$\displaystyle\leq\frac{1}{z_{i}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1}{\tilde{\phi}_{\beta_{t}}}+1\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}+1}\exp(-\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\quad\text{since, }x^{a}\exp(-x)\leq a^{a}\exp(-a)\text{ for, }x>0$
		$\displaystyle\leq\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}+2\right)^{c\beta_{t}+2}\exp(-c\beta_{t}-2)=:m(\beta_{t})$

And using, $\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)\leq\exp(\left\|{\hat{{\mathbf{w}}}^{t}}\right\|_{2})\leq\exp(\left\|{{{\mathbf{w}}}^{\ast}+\Delta^{t}}\right\|_{2})\leq(1+\frac{1}{c\beta_{t}})\exp(1)$ and density at the mode of Gamma,

	$\displaystyle s_{i}$	$\displaystyle=f(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{\beta_{t}})\leq f(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}^{t}};\tilde{\eta}_{i},\tilde{\phi}_{\beta_{t}})=\frac{1}{\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\frac{\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}}\right)^{\frac{1}{\phi}}\exp(-\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\frac{\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}})$
		$\displaystyle=\frac{\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\phi}_{\beta_{t}}}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}}\exp(-\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\phi}_{\beta_{t}}})$
		$\displaystyle=(1+\frac{1}{c\beta_{t}})\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}\right)^{c\beta_{t}+1}\exp(-c\beta_{t})$
		$\displaystyle\leq m(\beta_{t})$

which gives us

\displaystyle\Lambda_{\beta_{t}}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\left\|{{{\mathbf{t}}}}\right\|_{2}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\sqrt{\sum\limits_{i\in B}(b_{i}s_{i})^{2}+s_{i}^{2}}\leq\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}}

∎

Name	Standard Form	Variance Altered Form	Variance	Asymptotic Form
	(Mass/Density function)	$(\beta)$		$(\beta\rightarrow\infty)$
Gaussian (univariate)	$\sqrt{\frac{1}{2\pi}}\exp(-\frac{1}{2}\left({y-\eta}\right)^{2})$	$\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}\left({y-\eta}\right)^{2})$	$\frac{1}{\beta}$	$\delta_{\eta}(y)$
${\mathcal{N}}(y\,\|\,\eta)$
Gaussian (multivariate)	$\left({\frac{1}{2\pi}}\right)^{\frac{d}{2}}\exp(-\frac{1}{2}\left\\|{{{\mathbf{y}}}-\text{\boldmath$\mathbf{\eta}$}}\right\\|_{2}^{2})$	$\left({\frac{\beta}{2\pi}}\right)^{\frac{d}{2}}\exp(-\frac{\beta}{2}\left\\|{{{\mathbf{y}}}-\text{\boldmath$\mathbf{\eta}$}}\right\\|_{2}^{2})$	$\frac{1}{\beta}$	$\delta_{\text{\boldmath$\mathbf{\eta}$}}({{\mathbf{y}}})$
${\mathcal{N}}({{\mathbf{y}}}\,\|\,\text{\boldmath$\mathbf{\eta}$})$
Bernoulli	${\mathbb{P}}\left[{{y=1\,\|\,\eta}}\right]=\pi$	${\mathbb{P}}\left[{{y=1\,\|\,\eta}}\right]=\tilde{\pi}$	$<\frac{1}{\beta\eta}$	$\delta_{\text{sign}(\eta)}(y)$
$y\in\left\{{-1,+1}\right\}$	$\pi=(1+\exp(-y\eta))^{-1}$	$\tilde{\pi}=(1+\exp(-\beta y\eta))^{-1}$
	$\frac{1}{y\Gamma(\frac{1}{\phi})}\left(\frac{y\eta}{\phi}\right)^{\frac{1}{\phi}}\exp(-\frac{y\eta}{\phi})$	$\frac{1}{y\Gamma(\frac{1}{\tilde{\phi}_{\beta}})}\left(\frac{y\tilde{\eta}_{\beta}}{\tilde{\phi}_{\beta}}\right)^{\frac{1}{\tilde{\phi}_{\beta}}}\exp(-\frac{y\tilde{\eta}_{\beta}}{\tilde{\phi}_{\beta}})$	$\frac{\phi}{\eta^{2}}\frac{\phi+\beta(1-\phi)}{\beta^{2}}$	$\delta_{\frac{1-\phi}{\eta}}(y)$
Gamma
${\mathcal{G}}(y\,\|\,\eta,\phi)$	$\phi<1$	$\tilde{\phi}_{\beta}=\phi/\left({\phi+\beta(1-\phi)}\right)$
	Note: $\eta=\exp(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle)$	$\tilde{\eta}_{\beta}=\eta\beta/\left({\phi+\beta(1-\phi)}\right)$

	$\displaystyle\left\\|{S{{\mathbf{b}}}}\right\\|_{2}^{2}$	$\displaystyle=\sum_{i\in B}(s_{i}b_{i})^{2}\leq\sum_{i\in B}\max\{\frac{\beta}{2\pi}k^{2}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2},\frac{k^{2}}{2\pi e(k-1)^{2}}\}\qquad\text{where, }k\geq 1$
		$\displaystyle\leq\frac{\beta}{2\pi}k^{2}\sum_{i\in B}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}$
		$\displaystyle\leq\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}k^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}$
		$\displaystyle=qk^{2}+p\frac{k^{2}}{(k-1)^{2}}\qquad\text{where, }p=\frac{B}{2\pi e},\quad q=\frac{\beta\left\\|{\text{\boldmath$\mathbf{\Delta}$}}\right\\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}$

	$\displaystyle\left\|{{{\mathbf{v}}}^{\top}XS^{1}X^{\top}{{\mathbf{v}}}-{{\mathbf{v}}}^{\top}XS^{2}X^{\top}{{\mathbf{v}}}}\right\|$	$\displaystyle=\left\|{\sum_{i=1}^{n}\left({{{\mathbf{s}}}^{1}_{i}-{{\mathbf{s}}}^{2}_{i}}\right)\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}\right\|$
		$\displaystyle\leq\left\\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\\|_{1}\cdot\max_{i\in[n]}\ \left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}$
		$\displaystyle\leq\left\\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\\|_{1}\cdot R_{X}^{2}$
		$\displaystyle\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}.$

	$\displaystyle\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\\|{X_{B}S_{B}{{\mathbf{b}}}}\right\\|_{2}$	$\displaystyle\leq\left\\|{X_{B}}\right\\|_{2}\left\\|{S_{B}{{\mathbf{b}}}_{B}}\right\\|_{2}$
		$\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\\|{X_{B}}\right\\|_{2}[(\frac{\kappa^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi})^{1/3}+(\frac{B}{2\pi e})^{1/3}]^{\frac{3}{2}}$
		$\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\max}(X_{B}X_{B}^{\top})[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}}$
		$\displaystyle\leq B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}},$