This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Corruption-tolerant Algorithms for Generalized Linear Models

Bhaskar Mukhoty Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAEWork done while the author was a student at IIT Kanpur    Debojyoti Dey Indian Institute of Technology Kanpur, Uttar Pradesh, India    Purushottam Kar22footnotemark: 2
[email protected], {debojyot,purushot}@cse.iitk.ac.in
Abstract

This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/

1 Introduction

Generalized linear models (GLMs) [17] are effective models for a variety of discrete and continuous label spaces, allowing the prediction of binary or count-valued labels (logistic, Poisson regression) as well as real-valued labels (gamma, least-squares regression). Inference in a GLM involves two steps: given a feature vector 𝐱d{{\mathbf{x}}}\in{\mathbb{R}}^{d} and model parameters 𝐰{{\mathbf{w}}}^{\ast}, a canonical parameter is generated as θ:=𝐰,𝐱\theta:=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle then the label yy is sampled from the exponential family distribution

[y|θ]=exp(yθψ(θ)h(y)),{\mathbb{P}}\left[{{y\,|\,\theta}}\right]=\exp(y\cdot\theta-\psi(\theta)-h(y)),

where the function h()h(\cdot) is specific to the GLM and ψ()\psi(\cdot) is a normalization term, also known as log partition function. It is common to use a non-canonical link such as θ:=exp(𝐰,𝐱)\theta:=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle) for gamma distribution. GLMs also admit vector valued label 𝐲n{{\mathbf{y}}}\in{\mathbb{R}}^{n} by substituting the scalar product by inner product 𝐲,𝜼\left\langle{{{\mathbf{y}}}},{\text{\boldmath$\mathbf{\eta}$}}\right\rangle where 𝜼:=𝐗𝐰\text{\boldmath$\mathbf{\eta}$}:={{\mathbf{X}}}{{\mathbf{w}}}^{\ast} is the canonical parameter and 𝐗n×d{{\mathbf{X}}}\in{\mathbb{R}}^{n\times d} is the covariate matrix.

Problem Description: Given data {(𝐱i,yi)}i=1n\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n} generated using a known GLM but unknown model parameters 𝐰{{\mathbf{w}}}^{\ast}, statistically efficient techniques exist to recover a consistent estimate of the model 𝐰{{\mathbf{w}}}^{\ast} [14]. However, these techniques break down if several observed labels yiy_{i} are corrupted, not just by random statistical noise but by adversarially generated structured noise. Suppose k<nk<n labels are corrupted i.e. for some kk data points i1,,iki_{1},\ldots,i_{k}, the actual label yij,j=1,,ky_{i_{j}},j=1,\ldots,k generated by the GLM are replaced by the adversary with corrupted ones say y~ij\tilde{y}_{i_{j}}. Can we still recover 𝐰{{\mathbf{w}}}^{\ast}? Note that the learning algorithm is unaware of the points that are corrupted.

Breakdown Point: The largest fraction α=k/n\alpha=k/n of corruptions that a learning algorithm can tolerate while still offering an estimate of 𝐰{{\mathbf{w}}}^{\ast} with bounded error is known as its breakdown point. This paper proposes the SVAM algorithm that can tolerate k=Ω(n)k=\Omega\left({{n}}\right) corruptions i.e. α=Ω(1)\alpha=\Omega\left({{1}}\right).

Adversary Models: Contamination of the training labels y1,,yny_{1},\ldots,y_{n} by an adversary can misguide the learning algorithm into selecting model parameters of the adversary’s choice. An adversary has to choose (1) which labels i1,,iki_{1},\ldots,i_{k} to corrupt and (2) what corrupted labels y~i1,,y~ik\tilde{y}_{i_{1}},\ldots,\tilde{y}_{i_{k}} to put there. Adversary models emerge based on what information the adversary can consult while making these choices. The oblivious adversary must make both these choices with no access to the original data {(𝐱i,yi)}i=1n\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n} or true model 𝐰{{\mathbf{w}}}^{\ast} and thus, can only corrupt a random/fixed subset of kk labels by sampling y~ij\tilde{y}_{i_{j}} from some predetermined noise distribution. This is also known as the Huber noise model. On the other hand, a fully adaptive adversary has full access to the original data and true model while making both choices. Finally, the partially adaptive adversary must choose the corruption locations without knowledge of original data or true model but has full access to these while deciding the corrupted labels. See Appendix B for details.

Contributions: This paper describes the SVAM (Sequential Variance-Altered MLE) framework that offers:
1. robust estimation with a breakdown point α=Ω(1)\alpha=\Omega\left({{1}}\right) against partially and fully adaptive adversaries for robust least-squares regression and mean estimation and α=Ω(1/d)\alpha=\Omega\left({{1/\sqrt{d}}}\right) for robust gamma regression. Prior works do not offer any breakdown point for gamma regression.
2. exact recovery of the true model 𝐰{{\mathbf{w}}}^{\ast} against a fully-adaptive adversary for the case of least squares regression,
3. the use of variance reduction technique (see §3.1) in robust learning, which is novel to the best of our knowledge,
4. extensive empirical evaluation demonstrating that despite being a generic framework, SVAM is competitive to or outperforms algorithms specifically designed to solve problems such as least-squares and logistic regression.

2 Related Works

In the interest of space, we review aspects of literature most related to SVAM and refer to others [6, 15] for a detailed review.

Robust GLM learning has been studied in a variety of settings. [2] considered an oblivious adversary (Huber’s noise model) but offered a breakdown point of α=𝒪(1n)\alpha={\cal O}\left({{\frac{1}{\sqrt{n}}}}\right) i.e. tolerate k𝒪(n)k\leq{\cal O}\left({{\sqrt{n}}}\right) corruptions. [25] solve robust GLM estimation by solving M-estimation problems. However, they require the magnitude of the corruptions to be upper-bounded by some constant i.e. |yiy~i|𝒪(1)\left|{y_{i}-\tilde{y}_{i}}\right|\leq{\cal O}\left({{1}}\right) and offer a breakdown point of α=𝒪(1n)\alpha={\cal O}\left({{\frac{1}{\sqrt{n}}}}\right). Moreover, their approach solves L1L_{1}-regularized problems using projected gradient descent that offers slow convergence. In contrast, SVAM offers a linear rate of convergence, offers a breakdown point of α=Ω(1)\alpha=\Omega\left({{1}}\right) i.e. tolerate k=Ω(n)k=\Omega\left({{n}}\right) corruptions and can tolerate corruptions with unbounded magnitude introduced by a partially or fully adaptive adversary.

Specific GLMs such as robust regression have received focused attention. Here the model is 𝐲=X𝐰+𝐛{{\mathbf{y}}}=X{{\mathbf{w}}}^{\ast}+{{\mathbf{b}}} where Xn×dX\in{\mathbb{R}}^{n\times d} is the feature matrix and 𝐛{{\mathbf{b}}} is kk-sparse corruption vector denoting the adversarial corruptions. A variant of this, studies a hybrid noise model that replaces the zero entries of 𝐛{{\mathbf{b}}} with Gaussian noise 𝒩(0,σ2){\mathcal{N}}(0,\sigma^{2}). [18, 24] solve an L1L_{1} minimization problem which is slow in practice. (see §5). [1] use hard thresholding techniques to estimate the subset of uncorrupted points while [15] modify the IRLS algorithm to do so. However, [1, 15] are unable to offer consistent model estimates in the hybrid noise model even if the corruption rate α=k/n0\alpha=k/n\rightarrow 0 which is surprising since α0\alpha\rightarrow 0 implies vanishing corruption. In contrast, SVAM offers consistent model recovery in the hybrid noise model against a fully adaptive adversary when α0\alpha\rightarrow 0. [21] also offer consistent recovery with breakdown points α>0.5\alpha>0.5 but assume an oblivious adversary.

Robust classification with yi{1,+1}y_{i}\in\left\{{-1,+1}\right\} has been explored using robust surrogate loss functions [16] and ranking [8, 19] techniques. These works do not offer breakdown points but offer empirical comparisons.

Robust mean estimation entails recovering an estimate 𝝁^d\hat{\text{\boldmath$\mathbf{\mu}$}}\in{\mathbb{R}}^{d} of the mean 𝝁\text{\boldmath$\mathbf{\mu}$}^{\ast} of a multivariate Gaussian 𝒩(𝝁,Σ){\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$}^{\ast},\Sigma) given nn samples of which an α\alpha fraction are corrupted [11]. Estimation error is known to be lower bounded 𝝁^𝝁2Ω(αlog1α)\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\geq\Omega\left({{\alpha\sqrt{\log\frac{1}{\alpha}}}}\right) for this application even if nn\to\infty [5]. [6] use convex programming techniques and offer 𝒪(αlog321α){\cal O}\left({{\alpha\log^{\frac{3}{2}}{\frac{1}{\alpha}}}}\right) error given nΩ~(d2α2)n\geq\tilde{\Omega}\left({{\frac{d^{2}}{\alpha^{2}}}}\right) samples and a poly(n,d,1α)\text{poly}\left({n,d,\frac{1}{\alpha}}\right) runtime. [3] improve the running time to 𝒪~(nd)poly(α)\frac{\widetilde{\cal O}\left({{nd}}\right)}{\text{poly}(\alpha)}. The recent work of [4] uses an IRLS-style approach that internally relies on expensive SDP-calls but offers high breakdown points. SVAM uses n=𝒪(log21α)n={\cal O}\left({{\log^{2}\frac{1}{\alpha}}}\right) samples and offers a recovery error of 𝒪(trace(Σ)(log1α)1/2){\cal O}\left({{\textit{trace}(\Sigma)(\log\frac{1}{\alpha})^{-1/2}}}\right). This is comparable to existing works if trace(Σ)=𝒪(1)\textit{trace}(\Sigma)={\cal O}\left({{1}}\right). Moreover, SVAM is much faster and simpler to implement in practice.

Meta algorithms such as robust gradient techniques, median-of-means [12], tilted ERM [13] and maximum correntropy criterion [9] have been studied. SEVER [7] uses gradient covariance matrix to filter out the outliers along its largest eigenspace while RGD [20] uses robust gradient estimates to perform robust first-order optimization directly. While convenient to execute, they may require larger training sets, e.g., SEVER requires n>d5n>d^{5} samples for robust least-squares regression whereas SVAM requires n>Ω(dlog(d))n>\Omega\left({{d\log(d)}}\right). In terms of recovery guarantees, for least-squares regression without Gaussian noise, SVAM and other methods [1, 15]) offer exact recovery of 𝐰{{\mathbf{w}}}^{\ast} so long as the fraction of corrupted points is less than the breakdown point while SEVER’s error continues to be bounded away from zero. RGD only considers an oblivious/Huber adversary while SVAM can tolerate partially/fully adaptive adversaries. SEVER does not report an explicit breakdown point, RGD offers a breakdown point of α=1/logd\alpha=1/\log d (see Thm 2 in their paper) while SVAM offers an explicit breakdown point independent of dd. SVAM also offers faster convergence than existing methods such as SEVER and RGD.

3 The SVAM Algorithm

A popular approach in robust learning is to assign weights to data points, hoping that large weights would be given to uncorrupted points and low weights to corrupted ones, followed by weighted likelihood maximization. Often the weights are updated, and the process is repeated. [2] use Huber style weighing functions used in Mallow’s type M-estimators, [15] use truncated inverse residuals, and [22] use Mahalanobis distance-based weights.

SVAM notes that the label likelihood offers a natural measure of how likely the point is to be uncorrupted. Given a model estimate 𝐰^t\hat{{\mathbf{w}}}^{t} at iteration tt, the weight si=[yi|ηit]=exp(yiηitψ(ηit)h(yi))s_{i}={\mathbb{P}}\left[{{y_{i}\,|\,\eta^{t}_{i}}}\right]=\exp(y_{i}\cdot\eta_{i}^{t}-\psi(\eta_{i}^{t})-h(y_{i})) can be assigned to the ithi\text{${}^{\text{th}}$} point where ηit=𝐰^t,𝐱i\eta^{t}_{i}=\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle. This gives us the weighted MLE111Recall that for gamma/Poisson regression we need to set ηit=exp(𝐰^t,𝐱i)\eta^{t}_{i}=\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle) given the non-canonical link for these problems. Q~(𝐰|𝐰^t)=i=1nsilog[yi|𝐰,𝐱i]\tilde{Q}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})=-\sum_{i=1}^{n}s_{i}\cdot\log{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle}}\right] solving which gives us the next model iterate as

𝐰^t+1=argmin𝐰dQ~(𝐰|𝐰^t)\hat{{\mathbf{w}}}^{t+1}=\arg\min_{{{\mathbf{w}}}\in{\mathbb{R}}^{d}}\ \tilde{Q}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t}) (1)

However, as §5 will show, this strategy does not perform well. If the initial model 𝐰^1\hat{{\mathbf{w}}}^{1} is far from 𝐰{{\mathbf{w}}}^{\ast}, it may result in imprecise weights sis_{i} that are large for the corrupted points. For example, if the adversary introduces corruptions using a different model 𝐰~\tilde{{\mathbf{w}}} i.e. y~ij[yi|𝐰~,𝐱ij],j[k]\tilde{y}_{i_{j}}\sim{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{\tilde{{\mathbf{w}}}},{{{\mathbf{x}}}^{i_{j}}}\right\rangle}}\right],j\in[k] and we happen to initialize close to 𝐰~\tilde{{\mathbf{w}}} i.e. 𝐰^1𝐰~\hat{{\mathbf{w}}}^{1}\approx\tilde{{\mathbf{w}}}, then it is the corrupted points that would get large weights initially that may cause the algorithm to converge to 𝐰~\tilde{{\mathbf{w}}} itself.

Key Idea: It is thus better to avoid drastic decisions, say setting si0s_{i}\gg 0 in the initial stages no matter how much a data point appears to be clean. SVAM implements this intuition by setting weights using a label likelihood distribution with very large variance initially. This ensures that no data point (not even the uncorrupted ones) gets large weight (c.f. the uniform distribution that has large variance and assigns to point a high density). As SVAM progresses towards 𝐰{{\mathbf{w}}}^{\ast}, it starts using likelihood distributions with progressively lower variance. Note that this allows data points (hopefully the uncorrupted ones) to get larger weights (c.f. the Dirac delta distribution that has vanishing variance and assigns high density to isolated points).

3.1 Mode-preserving Variance-altering Likelihood Transformations

Table 1: Some common distributions and their variance altered forms. Note that in all cases, the form of the distribution is preserved after transformation, as well as that the variance asymptotically goes down at the rate Θ(1/β)\Theta(1/\beta) as β\beta\rightarrow\infty. The figures on the right show examples of how the Gaussian and gamma likelihood functions change with varying values of β\beta while still being order/mode preserving.
Name Standard Form Variance Altered Form Variance Asymptotic Form
(Mass/Density function) (β)(\beta) (β)(\beta\rightarrow\infty)
Gaussian (univariate) 12πexp(12(yη)2)\sqrt{\frac{1}{2\pi}}\exp(-\frac{1}{2}\left({y-\eta}\right)^{2}) β2πexp(β2(yη)2)\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}\left({y-\eta}\right)^{2}) 1β\frac{1}{\beta} δη(y)\delta_{\eta}(y)
𝒩(y|η){\mathcal{N}}(y\,|\,\eta)
Gaussian (multivariate) (12π)d2exp(12𝐲𝜼22)\left({\frac{1}{2\pi}}\right)^{\frac{d}{2}}\exp(-\frac{1}{2}\left\|{{{\mathbf{y}}}-\text{\boldmath$\mathbf{\eta}$}}\right\|_{2}^{2}) (β2π)d2exp(β2𝐲𝜼22)\left({\frac{\beta}{2\pi}}\right)^{\frac{d}{2}}\exp(-\frac{\beta}{2}\left\|{{{\mathbf{y}}}-\text{\boldmath$\mathbf{\eta}$}}\right\|_{2}^{2}) 1β\frac{1}{\beta} δ𝜼(𝐲)\delta_{\text{\boldmath$\mathbf{\eta}$}}({{\mathbf{y}}})
𝒩(𝐲|𝜼){\mathcal{N}}({{\mathbf{y}}}\,|\,\text{\boldmath$\mathbf{\eta}$})
Bernoulli [y=1|η]=π{\mathbb{P}}\left[{{y=1\,|\,\eta}}\right]=\pi [y=1|η]=π~{\mathbb{P}}\left[{{y=1\,|\,\eta}}\right]=\tilde{\pi} <1βη<\frac{1}{\beta\eta} δsign(η)(y)\delta_{\text{sign}(\eta)}(y)
y{1,+1}y\in\left\{{-1,+1}\right\} π=(1+exp(yη))1\pi=(1+\exp(-y\eta))^{-1} π~=(1+exp(βyη))1\tilde{\pi}=(1+\exp(-\beta y\eta))^{-1}
1yΓ(1ϕ)(yηϕ)1ϕexp(yηϕ)\frac{1}{y\Gamma(\frac{1}{\phi})}\left(\frac{y\eta}{\phi}\right)^{\frac{1}{\phi}}\exp(-\frac{y\eta}{\phi}) 1yΓ(1ϕ~β)(yη~βϕ~β)1ϕ~βexp(yη~βϕ~β)\frac{1}{y\Gamma(\frac{1}{\tilde{\phi}_{\beta}})}\left(\frac{y\tilde{\eta}_{\beta}}{\tilde{\phi}_{\beta}}\right)^{\frac{1}{\tilde{\phi}_{\beta}}}\exp(-\frac{y\tilde{\eta}_{\beta}}{\tilde{\phi}_{\beta}}) ϕη2ϕ+β(1ϕ)β2\frac{\phi}{\eta^{2}}\frac{\phi+\beta(1-\phi)}{\beta^{2}} δ1ϕη(y)\delta_{\frac{1-\phi}{\eta}}(y)
Gamma
𝒢(y|η,ϕ){\mathcal{G}}(y\,|\,\eta,\phi) ϕ<1\phi<1 ϕ~β=ϕ/(ϕ+β(1ϕ))\tilde{\phi}_{\beta}=\phi/\left({\phi+\beta(1-\phi)}\right)
Note: η=exp(𝐰,𝐱)\eta=\exp(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle) η~β=ηβ/(ϕ+β(1ϕ))\tilde{\eta}_{\beta}=\eta\beta/\left({\phi+\beta(1-\phi)}\right)
[Uncaptioned image][Uncaptioned image]

To implement the above strategy, SVAM (Algorithm 1) needs techniques to alter the variance of a likelihood distribution at will. Note that the likelihood values of the altered distributions must be computable as they will be used as weights sis_{i} i.e. merely being able to sample the distribution is not enough. Moreover, the transformation must be order-preserving – say the original and transformed distributions are {\mathbb{P}} and ~\tilde{\mathbb{P}} resp., then for every pair of labels y,yy,y^{\prime} and every parameter value η\eta, we must have [y|η]>[y|η]~[y|η]>~[y|η]{\mathbb{P}}\left[{{y\,|\,\eta}}\right]>{\mathbb{P}}\left[{{y^{\prime}\,|\,\eta}}\right]\Leftrightarrow\tilde{\mathbb{P}}\left[{y\,|\,\eta}\right]>\tilde{\mathbb{P}}\left[{y^{\prime}\,|\,\eta}\right]. If this is not true, then SVAM could exhibit anomalous behavior.

The Transformation: If [y|η]=exp(yηψ(η)h(y)){\mathbb{P}}\left[{{y\,|\,\eta}}\right]=\exp(y\cdot\eta-\psi(\eta)-h(y)) is an exponential family distribution with parameter η\eta and log-partition function ψ(η)=logexp(yηh(y))𝑑y\psi(\eta)=\log\int\exp(y\cdot\eta-h(y))\ dy, then for any β>0\beta>0, we get the variance-altered density,

~β[y|η]=1Z(η,β)exp(β(yηψ(η)h(y))),\tilde{\mathbb{P}}_{\beta}\left[{y\,|\,\eta}\right]=\frac{1}{Z(\eta,\beta)}\exp(\beta\cdot(y\cdot\eta-\psi(\eta)-h(y))),

where Z(η,β)=exp(β(yηψ(η)h(y)))𝑑yZ(\eta,\beta)=\int\exp(\beta\cdot(y\cdot\eta-\psi(\eta)-h(y)))\ dy. This transformation is order and mode preserving since xβx^{\beta} is an increasing function for any β>0\beta>0. This generalized likelihood distribution has variance [17] 1β2ψ(η)\frac{1}{\beta}\nabla^{2}\psi(\eta), which tends to 0 as β\beta\rightarrow\infty. Table 1 lists a few popular distributions, their variance altered versions, and asymptotic versions as β\beta\rightarrow\infty.

We note that [10] also study variance altering transformations for learning hidden Markov models, topic models, etc.. However, their transformations are unsuitable for use in SVAM for a few reasons:
1. SVAM’s transformed distributions are always available in closed form whereas those of [10] are not necessarily available in closed form.
2. SVAM’s transformations are order-preserving while [10] offer mean-preserving that are not assured to be order-preserving.

The Algorithm: As presented in Algorithm 1, SVAM repeatedly constructs weighted MLEs Q~β(𝐰|𝐰^t)\tilde{Q}_{\beta}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t}) that take β\beta-variance altered weights si=~β[yi|𝐰,𝐱i]s_{i}=\tilde{\mathbb{P}}_{\beta}[y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle] for all i[n]i\in[n] and solves them to get new model estimates.

Algorithm 1 SVAM: Sequential Variance-Altered MLE
0:  Data {(𝐱i,yi)}i=1n\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}, initial model 𝐰^1\hat{{\mathbf{w}}}^{1}, initial scale β1\beta_{1}, scale increment ξ>1\xi>1, likelihood dist. [|]{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]
1:  for t=1,2,,T1t=1,2,\ldots,T-1 do
2:     sit~βt[yi|𝐰^t,𝐱i]s^{t}_{i}\leftarrow\tilde{\mathbb{P}}_{\beta_{t}}[y_{i}\,|\,\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}^{i}}\right\rangle] // βt\beta_{t}-var altered [|]{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right]
3:     Q~βt(𝐰|𝐰^t)=defi=1nsitlog[yi|𝐰,𝐱i]\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})\overset{\mathrm{def}}{=}-\sum_{i=1}^{n}s^{t}_{i}\cdot\log{\mathbb{P}}\left[{{y_{i}\,|\,\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle}}\right]
4:     𝐰^t+1=argmin𝐰Q~βt(𝐰|𝐰^t)\hat{{\mathbf{w}}}^{t+1}=\arg\min_{{\mathbf{w}}}\ \tilde{Q}_{\beta_{t}}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t})
5:     βt+1ξβt\beta_{t+1}\leftarrow\xi\cdot\beta_{t} // Variance of ~β[|]\tilde{\mathbb{P}}_{\beta}[\cdot\,|\,\cdot]\downarrow as β\beta\uparrow
6:  end for
7:  return 𝐰^T\hat{{\mathbf{w}}}^{T}

We take a pause to assert that whereas the approach in [15], although similar at first to Eq (1), applies only to least-squares regression as it relies on notions of residuals missing from other GLMs. In contrast, SVAM works for all GLMs e.g. least-squares/logistic/gamma regression and offers stronger theoretical guarantees.

Theorem 1 shows that SVAM enjoys a linear rate of convergence. However, we first define notions of Local Weighted Strong Convexity and Lipschitz Continuity. Let 2(𝐯,r):={𝐰:𝐰𝐯2r}{\mathcal{B}}_{2}({{\mathbf{v}}},r):=\left\{{{{\mathbf{w}}}:\left\|{{{\mathbf{w}}}-{{\mathbf{v}}}}\right\|_{2}\leq r}\right\} denote the L2L_{2} ball of radius rr centered at the vector 𝐯d{{\mathbf{v}}}\in{\mathbb{R}}^{d}.

Definition 1 (LWSC/LWLC).

Given data {(𝐱i,yi)}i=1n\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n} and β>0\beta>0 an exponential family distribution [|]{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right] is said to satisfy λβ\lambda_{\beta}-Local Weighted Strongly Convexity and Λβ\Lambda_{\beta}-Local Weighted Lipschitz Continuity if for any true model 𝐰{{\mathbf{w}}}^{\ast} and any 𝐮,𝐯2(𝐰,1β){{\mathbf{u}}},{{\mathbf{v}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right) the following hold,

  1. 1.

    2Q~β(𝐯|𝐮)=def2Q~β(|𝐮)|𝐯λβI\nabla^{2}\tilde{Q}_{\beta}({{\mathbf{v}}}\,|\,{{\mathbf{u}}})\overset{\mathrm{def}}{=}\left.\nabla^{2}\tilde{Q}_{\beta}(\cdot\,|\,{{\mathbf{u}}})\right|_{{{\mathbf{v}}}}\succeq\lambda_{\beta}\cdot I

  2. 2.

    Q~β(𝐰|𝐮)2=defQ~β(|𝐮)|𝐰2Λβ\left\|{\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{u}}})}\right\|_{2}\overset{\mathrm{def}}{=}\left\|{\left.\nabla\tilde{Q}_{\beta}(\cdot\,|\,{{\mathbf{u}}})\right|_{{{\mathbf{w}}}^{\ast}}}\right\|_{2}\leq\Lambda_{\beta}

The above requires the Q~β\tilde{Q}_{\beta}-function to be strongly convex and Lipschitz continuous in a ball of radius 1β\frac{1}{\sqrt{\beta}} around the true model 𝐰{{\mathbf{w}}}^{\ast} i.e. as β\beta\uparrow, the neighborhood in which these properties are desired also shrinks. We will show that likelihood functions corresponding to GLMs e.g., least squares and gamma regression satisfy these properties for appropriate ranges of β\beta, even in the presence of corrupted samples.

Theorem 1 (SVAM convergence).

If the data and likelihood distribution satisfy the LWSC/LWLC properties for all β(0,βmax]\beta\in(0,\beta_{\max}] and if SVAM is initialized at 𝐰^1\hat{{\mathbf{w}}}^{1} and scale β1>0\beta_{1}>0 s.t. β1𝐰^1𝐰221\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1, then for any ϵ>1/βmax\epsilon>1/{\beta_{\max}}, for small-enough scale increment ξ>1\xi>1, SVAM ensures 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon after T=𝒪(log1ϵ)T={\cal O}\left({{\log\frac{1}{\epsilon}}}\right) iterations.

It is useful to take a moment to analyze this result. Note that if the LWSC/LWLC properties hold for larger values of β\beta, SVAM is able to offer smaller model recovery errors. Lets take least-squares regression with hybrid noise (see §4) as an example. The proofs will show that LWSC/LWLC properties are assured for β\beta as large as βmax=𝒪~(min{1α2/3,nd})\beta_{\max}=\widetilde{\cal O}\left({{\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d}}}\right\}}}\right) (see §4). Thus, with proper initialization of 𝐰^1,ξ\hat{{\mathbf{w}}}^{1},\xi and β1\beta_{1} (discussed below), SVAM ensures 𝐰^T𝐰22𝒪~(max{α2/3,dn})\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|^{2}_{2}\leq\widetilde{\cal O}\left({{\max\left\{{\alpha^{2/3},\sqrt{\frac{d}{n}}}\right\}}}\right) within T=𝒪(ln(n))T={\cal O}\left({{\ln(n)}}\right) steps. This proof will hold so long as SVAM is offered at least n=Ω(dlogd)n=\Omega(d\log d) training samples.

Refer to caption
(a) β1\beta_{1} Sensitivity
Refer to caption
(b) ξ\xi Sensitivity
Refer to caption
(c) α\alpha Sensitivity
Refer to caption
(d) dd Sensitivity
Figure 1: SVAM offers stable convergence and recovery superior to competitor algorithms for a wide range of hyperparameters β1,ξ\beta_{1},\xi, corruption rates α=k/n\alpha=k/n and feature dimensionality dd.

Initialization: SVAM needs to be invoked with 𝐰^1,β1\hat{{\mathbf{w}}}^{1},\beta_{1} that satisfy the requirements of Thm 1 and small enough ξ\xi. If we initialize at the origin i.e. 𝐰^1=𝟎\hat{{\mathbf{w}}}^{1}={\mathbf{0}}, then Theorem 1’s requirement translates to β11𝐰^1𝐰22\beta_{1}\leq\frac{1}{\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}} i.e. we need only find a small enough β1\beta_{1}. Thus, SVAM needs to tune two scalars ξ,β1\xi,\beta_{1} to take small enough values which it does as described below. In practice, SVAM offered stable performance for a wide range of β1,ξ\beta_{1},\xi (see Fig 1).

Hyperparameter Tuning: SVAM’s two hyperparameters β1,ξ\beta_{1},\xi were tuned using a held-out validation set. As the validation data could also contain corruptions, validation error was calculated by rejecting the top α\alpha fraction of validation points with the highest prediction error. The true value of α\alpha was provided to competitor algorithms as a handicap but not to SVAM. Thus, α\alpha itself was treated as a (third) hyperparameter for SVAM.

4 Robust GLM Applications with SVAM

This section adapts SVAM to robust least-squares/gamma/logistic regression and robust mean estimation and establishes breakdown points and LWSC/LWLC guarantees for their respective Q~β\tilde{Q}_{\beta} functions (see Defn 1). We refer the reader to §1 for definitions of partially/fully adaptive adversaries.

Robust Least Squares Regression. We have nn data points (𝐱i,yi)({{\mathbf{x}}}^{i},y_{i}), 𝐱id{{\mathbf{x}}}^{i}\in{\mathbb{R}}^{d} sampled from a subGaussian distribution 𝒟{\mathcal{D}} over d{\mathbb{R}}^{d}. We consider the hybrid corruption setting where on the G=(1α)nG=(1-\alpha)\cdot n “good” data points, we get labels yi=𝐰,𝐱i+ϵiy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle+\epsilon_{i} with Gaussian noise ϵi𝒩(0,1β)\epsilon_{i}\sim{\mathcal{N}}(0,\frac{1}{\beta^{\ast}}) with variance 1β\frac{1}{\beta^{\ast}} added. On the remaining B=αnB=\alpha\cdot n “bad” points, we get adversarially corrupted labels y~i=𝐰,𝐱i+bi\tilde{y}_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle+b_{i} where bib_{i}\in{\mathbb{R}} is chosen by the adversary. Note that bib_{i} can be unbounded. We also consider the pure corruption setting where clean points receive no Gaussian noise (note that this corresponds to β=\beta^{\ast}=\infty). SVAM-RR (Alg. 2) adapts SVAM to the task of robust regression.

Theorem 2 (Partially Adaptive Adversary).

For hybrid corruptions by a partially adaptive adversary with corruption rate α0.18\alpha\leq 0.18, there exist ξ>1\xi>1 s.t. with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC properties are satisfied for the Q~β\tilde{Q}_{\beta} function for β\beta values as large as βmax=𝒪(βmin{1α2/3,ndlog(n)})\beta_{\max}={\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right). If initialized with 𝐰^1,β1\hat{{\mathbf{w}}}^{1},\beta^{1} s.t. β1𝐰^1𝐰221\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1, SVAM-RR assures 𝐰^T𝐰22𝒪(1βmax{α2/3,dlog(n)n})\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right) within T𝒪(lognβ1)T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right) iterations. For pure corruptions by a partially adaptive adversary, we have βmax=\beta_{\max}=\infty and thus, for any ϵ>0\epsilon>0, SVAM-RR assures 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon within T𝒪(log1ϵβ1)T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right) iterations.

Note that in the pure corruption setting, SVAM assures exact recovery of 𝐰{{\mathbf{w}}}^{\ast} simply by running the algorithm long enough. This is not a contradiction since in this case, the LWSC/LWSS properties can be shown to hold for all values of β<\beta<\infty since we effectively have β=\beta^{\ast}=\infty in this case. Thm 2 holds against a partially adaptive adversary but can be extended to a fully adaptive adversary as well but at the cost of a worse breakdown point (see Thm 3 below). Note that SVAM continues to assure exact recovery of 𝐰{{\mathbf{w}}}^{\ast}.

Theorem 3 (Fully Adaptive Adversary).

For pure corruptions by a fully adaptive adversary with corruption rate α0.0036\alpha\leq 0.0036, LWSC/LWLC are satisfied for all β(0,)\beta\in(0,\infty) i.e. βmax=\beta_{\max}=\infty and for any ϵ>0\epsilon>0, SVAM-RR assures 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon within T𝒪(log1ϵβ1)T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right) iterations if initialized as described in the statement of Theorem 2.

Establishing LWSC/LWLC: In the appendices, Lemmata 15 and 16 establish LWSC/LWLC properties for robust least squares regression while Theorems 14 and 21 establish the breakdown points and existence of suitable increments ξ>1\xi>1. Handling a fully adaptive adversary requires making mild modifications to the notions of LWSC/LWLC, details of which are present in Appendix G.1.

Model Recovery and Breakdown Point: For pure corruption, SVAM-RR offers exact model recovery against partially and fully adaptive adversaries as it assures 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon for any ϵ>0\epsilon>0 if SVAM-RR is executed long enough. For hybrid corruption where even “clean” points receive Gaussian noise with variance 1β\frac{1}{\beta^{\ast}}, SVAM-RR assures 𝐰^T𝐰22𝒪(1βdlog(n)n)\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\sqrt{\frac{d\log(n)}{n}}}}\right) as α0\alpha\rightarrow 0 i.e. 𝐰^T𝐰220\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\rightarrow 0 as nn\rightarrow\infty assuring consistent recovery. This significantly improves previous results by [1, 15] which offer 𝒪(1β){\cal O}\left({{\frac{1}{\beta^{\ast}}}}\right) error even if α0\alpha\rightarrow 0 and nn\rightarrow\infty. Note that SVAM-RR has a superior breakdown point (allowing upto 18% corruption rate) against an oblivious adversary. The breakdown point deteriorates as expected (still allowing upto 0.36% corruption rate) against a fully adaptive adversary. We now present analyses for other GLM problems.

Algorithm 2 SVAM-RR: Robust Least Squares Regression 0:  Data {(𝐱i,yi)}i=1n\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}, initial scale β1\beta_{1}, initial model 𝐰^1\hat{{\mathbf{w}}}^{1}, ξ\xi 0:  A model estimate 𝐰^𝐰\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast} 1:  for t=1,2,,T1t=1,2,\ldots,T-1 do 2:     siexp(βt2(yi𝐱i,𝐰^t)2)s_{i}\leftarrow\exp\left({-\frac{\beta_{t}}{2}(y_{i}-\left\langle{{{\mathbf{x}}}^{i}},{\hat{{\mathbf{w}}}^{t}}\right\rangle)^{2}}\right) 3:     Sdiag(s1,,sn)S\leftarrow\text{diag}(s_{1},\ldots,s_{n}) 4:     𝐰^t+1(XSX)1(XS𝐲)\hat{{\mathbf{w}}}^{t+1}\leftarrow(XSX^{\top})^{-1}(XS{{\mathbf{y}}}) 5:     βt+1ξβt\beta_{t+1}\leftarrow\xi\cdot\beta_{t} 6:  end for 7:  return 𝐰^T\hat{{\mathbf{w}}}^{T} Algorithm 3 SVAM-ME: Robust Mean Estimation 0:  Data {𝐱i}i=1n\{{{{\mathbf{x}}}^{i}}\}_{i=1}^{n}, initial scale β1\beta_{1}, initial model 𝝁^1\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}, ξ\xi 0:  A mean estimate 𝝁^𝝁\hat{\text{\boldmath$\mathbf{\mu}$}}\approx\text{\boldmath$\mathbf{\mu}$}^{\ast} 1:  for t=1,2,,T1t=1,2,\ldots,T-1 do 2:     siexp(βt2𝐱i𝝁^t22)s_{i}\leftarrow\exp\left({-\frac{\beta_{t}}{2}\left\|{{{\mathbf{x}}}^{i}-\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}}\right\|_{2}^{2}}\right) 3:     𝝁^t+1(i=1nsi)1(i=1nsi𝐱i)\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}\leftarrow\left({\sum_{i=1}^{n}s_{i}}\right)^{-1}\left({\sum_{i=1}^{n}s_{i}{{\mathbf{x}}}^{i}}\right) 4:     βt+1ξβt\beta_{t+1}\leftarrow\xi\cdot\beta_{t} 5:  end for 6:  return 𝝁^T\hat{\text{\boldmath$\mathbf{\mu}$}}^{T} Algorithm 4 SVAM-Gamma: Robust Gamma Regression 0:  Data {(𝐱i,yi)}i=1n\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}, initial scale β1\beta_{1}, initial model 𝐰^1\hat{{\mathbf{w}}}^{1}, ξ\xi 0:  A model estimate 𝐰^𝐰\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast} 1:  for t=1,2,,T1t=1,2,\ldots,T-1 do 2:     si𝒢(yi|η~βt,ϕ~βt)s_{i}\leftarrow{\mathcal{G}}(y_{i}\,|\,\tilde{\eta}_{\beta_{t}},\tilde{\phi}_{\beta_{t}}) //see Table 1 3:     𝐰^t+1argmini=1nsi(𝐰,𝐱i,yi)\hat{{\mathbf{w}}}^{t+1}\leftarrow\arg\min\sum_{i=1}^{n}s_{i}\cdot\ell({{\mathbf{w}}},{{\mathbf{x}}}^{i},y_{i}) where(𝐰,𝐱,y)=(1ϕ)1yexp(𝐰,𝐱)𝐰,𝐱\ell({{\mathbf{w}}},{{\mathbf{x}}},y)=(1-\phi)^{-1}y\exp(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle)-\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}}\right\rangle 4:     βt+1ξβt\beta_{t+1}\leftarrow\xi\cdot\beta_{t} 5:  end for 6:  return 𝐰^T\hat{{\mathbf{w}}}^{T} Algorithm 5 SVAM-LR: Robust Classification 0:  Data {(𝐱i,yi)}i=1n\{{({{\mathbf{x}}}^{i},y_{i})}\}_{i=1}^{n}, initial scale β1\beta_{1}, initial model 𝐰^1\hat{{\mathbf{w}}}^{1}, ξ\xi 0:  A model estimate 𝐰^𝐰\hat{{\mathbf{w}}}\approx{{\mathbf{w}}}^{\ast} 1:  for t=1,2,,T1t=1,2,\ldots,T-1 do 2:     si(1+exp(βtyi𝐱i,𝐰^t))1s_{i}\leftarrow(1+\exp(-\beta_{t}y_{i}\left\langle{{{\mathbf{x}}}^{i}},{\hat{{\mathbf{w}}}^{t}}\right\rangle))^{-1} 3:     𝐰^t+1argmini=1nsi(𝐰,𝐱i,yi)\hat{{\mathbf{w}}}^{t+1}\leftarrow\arg\min\sum_{i=1}^{n}s_{i}\cdot\ell({{\mathbf{w}}},{{\mathbf{x}}}^{i},y_{i}) where(𝐰,𝐱,y)=log(1+exp(y𝐱,𝐰))\ell({{\mathbf{w}}},{{\mathbf{x}}},y)=\log(1+\exp(-y\left\langle{{{\mathbf{x}}}},{{{\mathbf{w}}}}\right\rangle)) 4:     βt+1ξβt\beta_{t+1}\leftarrow\xi\cdot\beta_{t} 5:  end for 6:  return 𝐰^T\hat{{\mathbf{w}}}^{T}

Robust Gamma Regression. The data generation and corruption model for gamma regression are slightly different given that the gamma distribution has support only over positive reals. First, the canonical parameter is calculated as ηi=exp(𝐰,𝐱i)\eta_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle) using which a clean label yiy^{i} is generated. To simplify the analysis, we assume that 𝐰2=1\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1, ϕ=0.5\phi=0.5, 𝐱i𝒩(𝟎,I){{\mathbf{x}}}^{i}\sim{\mathcal{N}}({\mathbf{0}},I). For the G=(1α)nG=(1-\alpha)\cdot n “good” points, labels are generated as yi=exp(𝐰,𝐱i)(1ϕ)y_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)(1-\phi) i.e. the no-noise model. For the remaining B=αnB=\alpha\cdot n “bad” points, the label is corrupted as y~i=yibi\tilde{y}^{i}=y^{i}\cdot b_{i} where bi>0b_{i}>0 is a positive real number (but otherwise arbitrary and unbounded). A multiplicative corruption makes more sense since the final label must be positive. SVAM-Gamma (Algorithm 4) adapts SVAM to robust gamma regression. Due to the alternate canonical parameter used in gamma regression, the initialization requirement also needs to be modified to β1(exp(𝐰^1𝐰2)1)21\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1. However, the hyperparameter tuning strategy discussed in §3 continues to apply.

Theorem 4.

For data corrupted by a partially adaptive adversary with α0.002d\alpha\leq\frac{0.002}{\sqrt{d}}, there exist ξ>1\xi>1 s.t. with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC conditions are satisfied for the Q~β\tilde{Q}_{\beta} function for β\beta values as large as βmax=𝒪(1/(exp(𝒪(αd))1))\beta_{\max}={\cal O}\left({{1/\left({\exp\left({{\cal O}\left({{\alpha\sqrt{d}}}\right)}\right)-1}\right)}}\right). If initialized at 𝐰^1,β1\hat{{\mathbf{w}}}^{1},\beta_{1} s.t. β1(exp(𝐰^1𝐰2)1)21\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1 and β1\beta\geq 1, SVAM-Gamma assures 𝐰^T𝐰2ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon for any ϵ𝒪(αd)\epsilon\geq{\cal O}\left({{\alpha\sqrt{d}}}\right) within T𝒪(log1ϵ)T\leq{\cal O}\left({{\log\frac{1}{\epsilon}}}\right) steps.

Model recovery, Consistency, Breakdown pt. It is notable that prior results in literature do not offer any breakdown point results for gamma regression. We find that Thm 4 requires β1(exp(𝐰^1𝐰2)1)21\beta_{1}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq 1 and β1\beta\geq 1 which imply 𝐰^1𝐰2ln2\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\ln 2. This is in contrast to Thms 2 and 3 that allow any initial 𝐰^1\hat{{\mathbf{w}}}^{1} so long as β1,ξ\beta_{1},\xi are sufficiently small. SVAM-Gamma guarantees convergence to a region of radius 𝒪(αd){\cal O}\left({{\alpha\sqrt{d}}}\right) around 𝐰{{\mathbf{w}}}^{\ast} whereas Thms 2 and 3 assure exact recovery. However, these do not seem to be artifacts of the proof technique. In experiments, SVAM-Gamma did not offer vanishingly small recovery errors and did indeed struggle if initialized with β11\beta_{1}\ll 1. It may be the case that there exist lower bounds preventing exact recovery for gamma regression similar to mean estimation.

Robust Mean Estimation. We have nn data points of which the set GG of (1α)n(1-\alpha)\cdot n ”good” points are generated from a dd-dimensional spherical Gaussian 𝐱i𝒩(𝝁,Σ){{\mathbf{x}}}^{i}\sim{\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$},\Sigma) i.e. 𝐱i=𝝁+ϵi{{\mathbf{x}}}^{i}=\text{\boldmath$\mathbf{\mu}$}+\text{\boldmath$\mathbf{\epsilon}$}^{i} where ϵi𝒩(𝟎,Σ)\text{\boldmath$\mathbf{\epsilon}$}^{i}\sim{\mathcal{N}}({\mathbf{0}},\Sigma) and Σ=1βI\Sigma=\frac{1}{\beta^{\ast}}\cdot I for some β>0\beta^{\ast}>0. The rest are the set BB of αn\alpha\cdot n ”bad” points that are corrupted by an adversary i.e. 𝐱~i=𝝁+𝐛i\tilde{{\mathbf{x}}}^{i}=\text{\boldmath$\mathbf{\mu}$}^{\ast}+{{\mathbf{b}}}^{i} where 𝐛id{{\mathbf{b}}}^{i}\in{\mathbb{R}}^{d} can be unbounded. SVAM-ME (Algorithm 3) adapts SVAM to the robust mean estimation problem. For notational clarity we use, 𝜼=𝝁\text{\boldmath$\mathbf{\eta}$}=\text{\boldmath$\mathbf{\mu}$}, in this problem.

Theorem 5.

For data corrupted by a partially adaptive adversary with corruption rate α0.26\alpha\leq 0.26, there exists ξ>1\xi>1 s.t. with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC conditions are satisfied for the Q~β\tilde{Q}_{\beta} function for β\beta upto βmax=𝒪(βdmin{log1α,nd})\beta_{\max}={\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right). If initialized with 𝛍^1,β1\hat{\text{\boldmath$\mathbf{\mu}$}}^{1},\beta^{1} s.t. β1𝛍^1𝛍221\beta_{1}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq 1, SVAM-ME assures 𝛍^T𝛍22ϵ\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{T}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\epsilon for any ϵ𝒪(trace2(Σ)max{1ln(1/α),1nd})\epsilon\geq{\cal O}\left({{\text{trace}^{2}(\Sigma)\cdot\max\left\{{{\frac{1}{\ln(1/\alpha)}},\frac{1}{\sqrt{nd}}}\right\}}}\right) within T𝒪(lognβ1)T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right) iterations.

Model recovery, Consistency, Breakdown pt. Note that for any constant α>0\alpha>0, the estimation error does not go to zero as nn\rightarrow\infty. As mentioned in §2, an error of Ω(α)\Omega\left({{\alpha}}\right) is unavoidable no matter how large nn gets. Thus, the best hope we have is of the estimation error going to zero as α0\alpha\rightarrow 0 and nn\rightarrow\infty. The error in Theorem 5 does indeed go to zero in this setting. Also, note that the error depends only on the trace of the covariance matrix of the clean points, and thus for trace(Σ)=𝒪(1)\text{trace}(\Sigma)={\cal O}\left({{1}}\right), the result offers an estimation error independent of dimension. SVAM-ME offers a large breakdown point (allowing upto 26% corruption rate).

Establishing LWSC/LWLC for Gamma Regression and Mean Estimation: In the appendices, Lemmata 28, 29, Lemmata 23, 24 establish the LWSC/LWLC properties for the Q~β\tilde{Q}_{\beta} function for gamma regression and mean estimation and Theorems 27 and Theorem 22 establish the breakdown points and existence of increments ξ>1\xi>1.

Robust Classification. In this case the labels are generated as yi=sign(𝐰,𝐱i)y_{i}=\text{sign}(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle) and the bad points in the set BB get their labels flipped y~i=sign(𝐰,𝐱i)\tilde{y}_{i}=-\text{sign}(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle). SVAM-LR (Algorithm 5) adapts SVAM to robust logistic regression.

5 Experiments

We used a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM, Ubuntu 16.04 OS.

Benchmarks. SVAM was benchmarked against several baselines (a) VAM: this is SVAM executed with a fixed value of the scale β\beta by setting the scale increment to ξ=1\xi=1 to investigate any benefits of varying the scale β\beta, (b) MLE: likelihood maximization on all points (clean + corrupted) without any weights assigned to data points – this checks for benefits of performing weighted MLE, (c) Oracle: an execution of the MLE only on the clean points – this is the gold standard in robust learning and offers the best possible outcome. In addition, several problem-specific competitors were also considered. For robust regression, STIR [15], TORRENT [1], SEVER [7], RGD [20], and the classical robust M-estimator of Tukey’s bisquare loss were included. Note that TORRENT already outperforms L1L_{1} regularization methods while achieving better or competitive recovery errors (see [1, Fig 2(b)]). Since SVAM-RR was faster than TORRENT itself, L1L_{1} regularized methods such as [18, 24] were not considered. For robust mean estimation, popular baselines such as coordinate-wise median and geometric median were taken. For robust classification, the rank-pruning method RP-LR [19] and the method from [16] were used.

Experimental Setting and Reproducibility. Due to lack of space, details of experimental setup, data generation, how adversaries were simulated etc are presented in Appendix C. SVAM also offered superior robustness than competitors against a wide range of ways to simulate adversarial corruption (see Appendix D for details). Code for SVAM is available at https://github.com/purushottamkar/svam/.

Refer to caption
(a) Robust Regression
Refer to caption
(b) Robust Mean Estimation
Refer to caption
(c) Robust Gamma Regression
Refer to caption
(d) Robust Classification
Refer to caption
(e) Adversarial Initialization
Refer to caption
(f) Global Convergence
Figure 2: Figs 2(a,b,c,d) compare SVAM and various competitors on robust GLM problems. The number of data points nn, dimensions dd, and fraction of corruptions α=k/n\alpha=k/n are mentioned at the top of each figure. For algorithms for which iteration-wise performance was unavailable, their final performance level is plotted as a horizontal dashed line. A marker is placed on the line indicating the time it took for that algorithm to converge. The figures clarify that executing VAM with a single fixed value of β\beta cannot replace the gradual variations in βt\beta_{t} done by SVAM. Figs 2(e,f) confirm that SVAM offers convergence to 𝐰{{\mathbf{w}}}^{\ast} irrespective of the model with which it is initialized. In these figures, corruptions were introduced using an adversarial model 𝐰~\tilde{{\mathbf{w}}} i.e. for corrupted points, the label was set to y~i=𝐰~,𝐱i\tilde{y}_{i}=\left\langle{\tilde{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle. SVAM was then initialized at 𝐰~\tilde{{\mathbf{w}}} itself to check if it gets misled by faulty initialization but was found to offer exact recovery regardless.

5.1 Experimental Observations

Robust Regression. Fig 2(a) shows that SVAM-RR, SEVER, RGD, STIR, and TORRENT are competitive and achieve oracle-level error. However, SVAM-RR can be twice as fast in terms of execution time. Since TORRENT itself outperforms L1L_{1} regularization methods while achieving better or competitive recovery errors (see Fig 2(b) in [1]), we do not compare against L1L_{1} methods. SVAM-RR is several times faster than classical robust M-estimators such as Tukey’s bisquare loss. Also, no single value of β\beta can offer the performance of SVAM, as is indicated by the poor performance of VAM. Fig 4 in the appendix shows that this is true even if very large or very small values of β\beta are used with VAM. We note that SEVER chooses a threshold in each iteration to eliminate specific points as corrupted. This threshold is chosen randomly (possibly for ease of proof) but causes SEVER to offer sluggish convergence. Thus, we also report the performance of a modification SEVER-M that was given an unfair advantage by revealing to it the actual number of corrupted points (SVAM was not given this information). This sped-up SEVER but SVAM continued to outperform SEVER-M. Fig 3 in the appendix reports repeated runs of the experiment where SVAM continues to lead.

Robust Logistic and Gamma Regression. Fig 2(c,d) report results of SVAM on robust gamma and logistic regression problems. The figures show that executing VAM with a fixed value of β\beta cannot replace the gradual variations in βt\beta_{t} done by SVAM. Additionally, for robust classification, SVAM-LR achieves error, an order of magnitude smaller than all competitors except the oracle. SVAM also outperforms the RP-LR [19] and [16] algorithms that were specifically designed for robust classification. A horizontal dashed line is used to indicate the final performance of algorithms for which iteration-wise performance was unavailable.

Robust Mean Estimation. Fig 2(b) reports results on robust mean estimation problems. SVAM outperforms VAM with any fixed value of β\beta as well as the naive sample mean (the MLE in this case). Popular approaches coordinate-wise median and geometric median were fast but offered poor results. SVAM on the other hand achieved oracle error-level error by assigning proper scores to all data points.

Sensitivity to Hyperparameter Tuning. In Figs 1(a,b), SVAM-RR was offered hyperparameters in a wide range of values to study how it responded when provided mis-specified hyperparameters. SVAM offered stable convergence for a wide range of β1,ξ\beta_{1},\xi indicating that it is resilient to minor mis-specifications in hyperparameters.

Sensitivity to Dimension and Corruption. Figs 1(c,d) compare the error offered by various algorithms in recovering 𝐰{{\mathbf{w}}}^{\ast} for robust least-squares regression when the fraction of corrupted points α\alpha and feature dimension dd were varied. All values are averaged over 2020 experiments with each experiment using 10001000 data points. α\alpha was varied in the range [0,0.4][0,0.4] and dd in the range [10,100][10,100] with fixed hyper-parameters. STIR and Bi-square are sensitive to corruption while SEVER is sensitive to both corruption and dimension. RGD is not visible in the figures as its error exceeded the figure boundaries. Experiments for Fig 1(c) fixed d=10d=10 and vary α\alpha while Fig 1(d) fixed α=0.15\alpha=0.15 and vary dd. Figs 1(c,d) show that SVAM-RR can tolerate large fractions of the data getting corrupted and is not sensitive to dd.

Testing SVAM for Global Convergence. To test the effect of initialization, in Fig 2(e), corruptions were introduced using an adversarial model 𝐰~\tilde{\mathbf{w}} i.e. for corrupted points, labels were set to y~i=𝐰~,𝐱i\tilde{y}_{i}=\left\langle{\tilde{\mathbf{w}}},{{{\mathbf{x}}}_{i}}\right\rangle. SVAM-RR was initialized at 1000 randomly chosen models, the origin, as well as at the adversarial model 𝐰~\tilde{{\mathbf{w}}} itself. WORST-1000 (resp. AVG-1000) indicate the worst (resp. average) performance SVAM had at any of the 1000 initializations. Fig 2(f) further emphasizes this using a toy 2D problem. SVAM was initialized at all points on the grid. An initialization was called a success if SVAM got error <106<10^{-6} within eight or fewer iterations. In all these experiments SVAM rapidly converged to the true model irrespective of model initialization.

Acknowledgements

The authors thank the anonymous reviewers of this paper for suggesting illustrative experiments and pointing to relevant literature. B.M. is supported by the Technology Innovation Institute and MBZUAI joint project (NO. TII/ARRC/2073/2021): Energy-based Probing for Spiking Neural Networks. D.D. is supported by the Research-I Foundation at IIT Kanpur and acknowledges support from the Visvesvaraya PhD Scheme for Electronics & IT (FELLOW/2016-17/MLA/194). P.K. thanks Microsoft Research India and Tower Research for research grants.

References

  • Bhatia et al. [2015] K. Bhatia, P. Jain, and P. Kar. Robust Regression via Hard Thresholding. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015.
  • Cantoni and Ronchetti [2001] E. Cantoni and E. Ronchetti. Robust Inference for Generalized Linear Models. Journal of the American Statistical Association, 96(455):1022–1030, 2001.
  • Cheng et al. [2019] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2755–2771. SIAM, 2019.
  • Dalalyan and Minasyan [2022] A. S. Dalalyan and A. Minasyan. All-In-One Robust Estimator of the Gaussian Mean. Annals of Statistics, 50(2):1193–1219, 2022.
  • Diakonikolas and Kane [2019] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
  • Diakonikolas et al. [2019a] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust Estimators in High-Dimensions Without the Computational Intractability. SIAM Journal on Computing, 48(2):742–864, 2019a.
  • Diakonikolas et al. [2019b] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A Robust Meta-Algorithm for Stochastic Optimization. In 36th International Conference on Machine Learning (ICML), 2019b.
  • Feng et al. [2014] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust Logistic Regression and Classification. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.
  • Feng et al. [2015] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. Journal of Machine Learning Research, 16(30):993–1034, 2015.
  • Jiang et al. [2012] K. Jiang, B. Kulis, and M. Jordan. Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.
  • Lai et al. [2016] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
  • Lecué and Lerasle [2020] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: Theory and practice. Annals of Statistics, 48(2):906–931, 2020.
  • Li et al. [2021] T. Li, A. Beirami, M. Sanjabi, and V. Smith. Tilted Empirical Risk Minimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
  • McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.
  • Mukhoty et al. [2019] B. Mukhoty, G. Gopakumar, P. Jain, and P. Kar. Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • Natarajan et al. [2013] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
  • Nelder and Wedderburn [1972] J. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society. Series A, 135(3):370–384, 1972.
  • Nguyen and Tran [2013] N. H. Nguyen and T. D. Tran. Exact Recoverability From Dense Corrupted Observations via 1\ell_{1}-Minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
  • Northcutt et al. [2017] C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. URL http://auai.org/uai2017/proceedings/papers/35.pdf.
  • Prasad et al. [2018] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust Estimation via Robust Gradient Estimation. arXiv:1802.06485 [stat.ML], 2018.
  • Suggala et al. [2019] A. S. Suggala, K. Bhatia, P. Ravikumar, and P. Jain. Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression. In 32nd Conference on Learning Theory (COLT), 2019.
  • Valdora and Yohai [2014] M. Valdora and V. J. Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.
  • Vershynin [2018] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
  • Wright and Ma [2010] J. Wright and Y. Ma. Dense Error Correction via 1\ell^{1} Minimization. IEEE Transactions on Information Theory, 56(7):3540–3560, 2010.
  • Yang et al. [2013] E. Yang, A. Tewari, and P. Ravikumar. On Robust Estimation of High Dimensional Generalized Linear Models. In 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013.

Appendix A Summary of Assumptions

This paper presented SVAM, a framework for robust GLM problems based on a novel variance reduced reweighted MLE technique that can be readily adapted to arbitrary GLM problems such as robust least squares/logistic/gamma regression and mean estimation. Here, we summarize the theoretical and empirical assumptions made by the SVAM framework for easy inspection.

Experimental Assumptions. SVAM requires minimal assumptions to be executed in practice and only requires two scalar hyperparameters to be tuned properly. SVAM is robust to minor misspecifications to its hyperparameters (see Figure 1) and the hyperparameter tuning described in §3 works well in practice allowing SVAM to offer superior or competitive empirical performance when compared to state of the art techniques for task-specific estimation techniques e.g. TORRENT for robust regression, classical and popular techniques such as Tukey’s bisquare or geometric median for robust mean estimation, as well as recent advances in robust gradient-based techniques such as SEVER and RGD.

Theoretical Assumptions. SVAM establishes explicit breakdown points in several interesting scenarios against both partially and fully adaptive adversaries. To do so, SVAM assumes a realizable setting e.g. in least-squares regression, labels for the GG clean points are assumed to be generated as yi=𝐰,𝐱i+ϵiy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i} where 𝐰{{\mathbf{w}}}^{\ast} is the gold model and ϵi𝒩(0,1β)\epsilon_{i}\sim{\mathcal{N}}\left({0,\frac{1}{\beta^{\ast}}}\right) is Gaussian noise. Of course, on the BB bad points, the (partially/fully adaptive) adversary is free to introduce corruptions jointly in any manner. For least-squares regression, the covariates/feature vectors i.e. 𝐱i{{\mathbf{x}}}_{i} are assumed to be sampled from some sub-Gaussian distribution that includes arbitrary bounded distributions as well as multivariate Gaussian distributions in dd dimensions (both standard and non-standard) – see Appendix G for details. For gamma regression and mean estimation settings, the covariates are assumed to be sampled from a spherical multivariate Gaussian in dd dimensions. Other assumptions are listed in the statements of the theorems in §4.

Appendix B Adversary Models

We explain the various adversary models in more detail here. Several models popular in literature give various degrees of control to the adversary. This section offers a more relaxed discussion of some prominent adversary models along with examples of applications in which they arise.

(Oblivious) Huber Adversary. Corruption locations i1,,iki_{1},\ldots,i_{k} are chosen randomly for which corrupted labels are sampled i.i.d. from some pre-decided distribution {\mathcal{B}} i.e. y~ij\tilde{y}_{i_{j}}\sim{\mathcal{B}}. Next, data features 𝐱i{{\mathbf{x}}}^{i} and the true model 𝐰{{\mathbf{w}}}^{\ast} are selected and clean labels are generated according to the GLM for all non-corrupted points.

Partially Adaptive Adversary. The adversary first chooses the corruption locations i1,,iki_{1},\ldots,i_{k} (e.g. some fixed choice or randomly). Then data features 𝐱i{{\mathbf{x}}}^{i} and the true model 𝐰{{\mathbf{w}}}^{\ast} are selected and clean labels y1,,yny_{1},\ldots,y_{n} are generated according to the GLM. Next, the adversary is presented with the collection {𝐰,{(𝐱i,yi)}i=1n,{ij}j=1k}\left\{{{{\mathbf{w}}}^{\ast},\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n},\left\{{i_{j}}\right\}_{j=1}^{k}}\right\} and is allowed to use this information to generate corrupted labels y~ij\tilde{y}_{i_{j}} for the points marked for corruption.

Fully Adaptive Adversary. First data features 𝐱i{{\mathbf{x}}}^{i} and the true model 𝐰{{\mathbf{w}}}^{\ast} are selected and clean labels y1,,yny_{1},\ldots,y_{n} are generated according to the GLM. Then the adversary is presented with the collection {𝐰,{(𝐱i,yi)}i=1n}\left\{{{{\mathbf{w}}}^{\ast},\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n}}\right\} and is allowed to use this information to select which kk points to corrupt as well as generate corrupted labels y~ij\tilde{y}_{i_{j}} for those points.

Discussion. The fully adaptive adversary can choose corruption locations i1,,iki_{1},\ldots,i_{k} and the corrupted labels y~ij\tilde{y}_{i_{j}} with complete information of the true model 𝐰{{\mathbf{w}}}^{\ast}, the clean labels {yi}\left\{{y_{i}}\right\} and the feature vectors {𝐱i}\left\{{{{\mathbf{x}}}^{i}}\right\} and is the most powerful. The partially adaptive adversary can decide the corruptions y~ij\tilde{y}_{i_{j}} after inspecting 𝐰,{𝐱i,yi}{{\mathbf{w}}}^{\ast},\left\{{{{\mathbf{x}}}^{i},y_{i}}\right\} but cannot control the corruption locations. This can model e.g. an adversary that corrupts user data by installing malware on their systems. The adversary cannot force malware on a system of their choice but can manipulate data coming from already compromised systems at will. The Huber adversary is the least powerful with corruption locations that are random as well as corrupted labels that are sampled randomly and cannot depend on {𝐱i,yi}\left\{{{{\mathbf{x}}}^{i},y_{i}}\right\}. Although weak, this adversary can nevertheless model sensor noise e.g. pixels in a CCD array that misfire with a certain probability. As noted earlier, SVAM results are shown against both fully and partially adaptive adversaries.

Appendix C Experimental Setup

Experiments were carried out on a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM and Ubuntu 16.04 OS. Statistics such as dataset size nn, feature dimensions dd and corruption fraction α\alpha are mentioned above each figure. 20% of train data was used as a held-out validation set using which (β1,ξ\beta_{1},\xi) were tuned using line search. Figure 1 indicates that SVAM is not sensitive to setting β1\beta_{1} or ξ\xi and a good value can be found using line search.

Synthetic Data Generation. Synthetic datasets were used in the experiments to demonstration recovery of the true model parameters. All regression co-variates/features were generated using a standard normal distribution 𝒩(0,Id){\mathcal{N}}(0,I_{d}). Clean responses in the least-squares regression settings were generated without additional Gaussian noise while corrupted responses were generated for an α\alpha fraction of the data points. For least-squares regression, clean responses were generated as yi=𝐰,𝐱iy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle, while for logistic regression, clean binary labels were generated as yi=𝕀[𝐰,𝐱i>0]y_{i}={\mathbb{I}}[\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle>0]. For gamma regression, clean responses were generated using a likelihood distribution with vanishing variance. This was done by using a variance-altered likelihood distribution with the setting β\beta^{\ast}\rightarrow\infty (see Table 1 for the likelihood expressions). Clean data points for mean estimation were sampled from 𝒩(𝝁,1dId){\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$}^{\ast},\frac{1}{d}\cdot I_{d}). When not stated otherwise, corrupted labels were generated using an adversarial model. Specifically, to simulate the adversary, an adversarial model 𝐰~\tilde{{\mathbf{w}}} (for least-squares/logistic/gamma regression) or 𝝁~\tilde{\text{\boldmath$\mathbf{\mu}$}} (for mean estimation) was sampled and labels for data points chosen for corruption were generated using this adversarial model instead of the true model. For example, corrupted labels were generated for least-squares regression as y~i=𝐰~,𝐱i\tilde{y}_{i}=\left\langle{\tilde{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle for all kk locations chosen for corruption. Please also refer to Appendix D for an extensive study on several other ways of simulating the adversary and initialization schemes.

Simulating the Adversary. Corruptions were introduced using an adversarial model. Specifically, an adversarial model 𝐰~\tilde{{\mathbf{w}}} was chosen, and for bad points, the adversary generated a label using 𝐰~\tilde{{\mathbf{w}}}, which overwrote the true label. For least-squares/logistic/gamma regression, both 𝐰,𝐰~{{\mathbf{w}}}^{\ast},\tilde{{\mathbf{w}}} were independently chosen to be random unit vectors. For robust mean estimation Fig 2(a), 𝝁\text{\boldmath$\mathbf{\mu}$}^{\ast} and 𝝁~\tilde{\text{\boldmath$\mathbf{\mu}$}} were chosen as random Gaussian vectors of length 22 and 66 respectively. Except Fig 2(e,f) in which the setting is different, SVAM variants were always initialized at the adversarial model itself i.e. 𝐰^1=𝐰~\hat{{\mathbf{w}}}^{1}=\tilde{{\mathbf{w}}} to test the ability of SVAM to converge to the true model no matter what the initialization.

Other Corruption Models. SVAM was found to offer superior robustness as compared to competitor algorithms against a wide range of ways to simulate adversarial corruption, including powerful ones that use expensive leverage score computations to decide corruptions. Fig 5 in Appendix D reports results of experiments with a variety of adversaries.

Refer to caption
Figure 3: Convergence results obtained after repeated experiments with the 3 leading methods for robust least squares regression. SVAM continues to lead despite the natural variance in convergence plots.
Refer to caption
Figure 4: Convergence results for VAM with various (fixed) values of β\beta for robust least squares regression. The plots indicate that the best performance achieved by VAM is with a moderately large value of β\beta. Excessively large and excessively small values of β\beta are both ill-suited to model recovery. However, even if provided such an optimal value of β\beta, VAM’s performance is still inferior to that of SVAM. We note that SVAM does not use a fixed value of β\beta and instead dynamically updates it.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: SVAM exhibits a high degree of tolerance (superior to competitor algorithms) to a variety of corruption models such as choice of corruption location (e.g. choosing corruption locations randomly, based on leverage scores, based on magnitude of the clean label, etc), choice of type of corruption (e.g. sign-flip, constant shift, using an adversarial model, etc) and initialization schemes (e.g. random, adversarial, etc). Please see the text in Appendix D for more details.

Appendix D Tolerance to Different Adversaries and Adversarial Initialization Schemes

Figure 3 repeats the experiment of Fig. 2(a) 10 times, showing that the convergence results are consistent across several runs of the experiment. In this comparison, we put two closet competitors STIR and TORRENT, along with SVAM, and omit other competitors to avoid clutter in the figure. The experiment shows that the performance of the methods does not vary very wildly across the runs and SVAM’s convergence continues to remain faster than its competitors.

Figure 4 considers the VAM method with several values of β\beta ranging from very small to very large. It is apparent that no single fixed value of β\beta is able to offer satisfactory result indicating that dynamically changing the value of β\beta as done by SVAM is required for better convergence results.

Figure 5 demonstrates recovery of 𝐰{{\mathbf{w}}}^{\ast}, under different ways of simulating the adversary and initialization of algorithms. We consider corruption models, that differ in choosing locations for corruption and the way chosen points are corrupted. We demonstrate recovery, on choosing points for corruption in three ways: i) random, ii) using magnitude of response and iii) using leverage statistic score. The leverage score of a point increases as it lies farther from the mean of the data points. The diagonal elements of the projection matrix P=X(XX)1XP=X^{\top}(XX^{\top})^{-1}X, gives the respective leverage scores and kk data points having largest leverage score are choosen for corruption. After selecting the data points to be corrupted, we either i) flip the sign of the response, ii) set the responses to a constant value B or iii) use an adversarial model 𝐰~\tilde{\mathbf{w}}, to generate the corrupted response i.e. set the response to y~i=𝐰~,𝐱i\tilde{y}_{i}=\left\langle{\tilde{\mathbf{w}}},{{{\mathbf{x}}}_{i}}\right\rangle. The initialization offered was also varied in two ways: i) random initialization, and ii) adversarial initialization where SVAM was initialized at 𝐰~\tilde{{\mathbf{w}}}, the same adversarial model using which corruptions were introduced.

Experiments were performed by varying location of corruption in {random, absolute magnitude, leverage score}, corruption type in {adversarial, sign change, set constant} and initialization in {random, adversarial}. The experimental setting is given in the title of each figure, while all of them have n=1000n=1000, d=10d=10 and α=0.15\alpha=0.15. It can be observed that in general, SVAM demonstrates superior performance irrespective of the adversarial models and initialization.

It can be also observed that SVAM converges after single iteration when corruptions are introduced by setting responses to a constant (second column) whereas Tukey’s bisquare method does not converge well, when absolute magnitude is used to select location of corruption(second row), except for constant response corruptions.

Appendix E Proof of Theorem 1

Theorem 6 (SVAM convergence - Restated).

Suppose the data and likelihood distribution satisfy the λβ\lambda_{\beta}-LWSC and Λβ\Lambda_{\beta}-LWLC properties for all values of β\beta in the range (0,βmax](0,\beta_{\max}]. Then if SVAM is initialized at a point 𝐰^1\hat{{\mathbf{w}}}^{1} and initial scale β1>0\beta_{1}>0 such that β1𝐰^1𝐰221\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1 then for any ϵ>1βmax\epsilon>\frac{1}{\beta_{\max}}, for small-enough scale increment ξ>1\xi>1, SVAM ensures 𝐰^T𝐰2ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon within T=𝒪(log1ϵ)T={\cal O}\left({{\log\frac{1}{\epsilon}}}\right) iterations.

Proof.

The key to this proof is to maintain the invariant βt𝐰^t𝐰221\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1. Note that initialization is done precisely to ensure this at the beginning of the execution of the algorithm which acts as the base case for an inductive argument. For the inductive case, consider an iteration tt and let βt𝐰^t𝐰21\sqrt{\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq 1. LWSC ensures strong convexity giving

Q~βt(𝐰^t+1|𝐰^t)Q~βt(𝐰|𝐰^t)Q~βt(𝐰|𝐰^t),𝐰^t+1𝐰+λβt2𝐰^t+1𝐰22\tilde{Q}_{\beta_{t}}(\hat{{\mathbf{w}}}^{t+1}\,|\,\hat{{\mathbf{w}}}^{t})-\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})\geq\left\langle{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})},{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\rangle+\frac{\lambda_{\beta_{t}}}{2}\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}

Since 𝐰^t+1\hat{{\mathbf{w}}}^{t+1} minimizes Q~βt(|𝐰^t)\tilde{Q}_{\beta_{t}}(\cdot\,|\,\hat{{\mathbf{w}}}^{t}), we have Q~βt(𝐰^t+1|𝐰^t)Q~βt(𝐰|𝐰^t)\tilde{Q}_{\beta_{t}}(\hat{{\mathbf{w}}}^{t+1}\,|\,\hat{{\mathbf{w}}}^{t})\leq\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t}). Elementary manipulations and the Cauchy-Schwartz inequality now give us

𝐰^t+1𝐰22Q~βt(𝐰|𝐰^t)2λβt2Λβtλβt.\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{2\left\|{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,\hat{{\mathbf{w}}}^{t})}\right\|_{2}}{\lambda_{\beta_{t}}}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}.

Now, we will additionally ensure that we choose the scale increment ξ\xi to be small enough (while still ensuring ξ>1\xi>1) such that 2Λβλβ<1ξβ\frac{2\Lambda_{\beta}}{\lambda_{\beta}}<\sqrt{\frac{1}{\xi\beta}} for all β(0,βmax]\beta\in(0,\beta_{\max}]. Combining this with the above result gives us

𝐰^t+1𝐰221ξβt\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{\xi\beta_{t}}

Thus, if we now set βt+1=ξβt\beta_{t+1}=\xi\beta_{t}, then rearranging the terms in the above inequality tell us that βt+1𝐰^t+1𝐰221\beta_{t+1}\cdot\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1 which lets us continue the inductive argument. Note that this process can continue on till we have βtβmax\beta_{t}\leq\beta_{\max} since the LWSC/LWLC properties are assured till that point. Moreover, since βt\beta_{t} goes up by a constant fraction at each step and 𝐰^t𝐰221βt\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{\beta_{t}} due to the invariant, a linear rate of convergence is assured which finishes the proof. The existence of a suitable scale increment ξ\xi satisfying the above requirements is established in a case-wise manner by Theorems 14, 21, 22 and 27. We also note that, as discussed in §4 and elaborated in Appendix I, since gamma regression requires an alternate parameterization owing to its need to support only non-negative labels, the invariant used for the convergence bound for SVAM-Gamma is also slightly altered as mentioned in Thm 4 to instead use βt(exp(𝐰^t𝐰2)1)21\beta_{t}\cdot(\exp(\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2})-1)^{2}\leq 1. ∎

Appendix F Some Helpful Results

Below we present a few helpful results.

Lemma 7.

Suppose ϵi𝒩(𝟎,I),i=1,,n\text{\boldmath$\mathbf{\epsilon}$}^{i}\sim{\mathcal{N}}({\mathbf{0}},I),i=1,\ldots,n and denote RX:=maxinϵi2R_{X}:=\max_{i\in n}\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}. Then we have RXnR_{X}\leq\sqrt{n} with probability at least 1exp(Ω(n))1-\exp(-\Omega\left({{n}}\right)).

Proof.

Follows from standard arguments. ∎

Lemma 8.

For covariate vectors X=[𝐱1,,𝐱n]X=\left[{{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}}\right] generated from an isotropic sub-Gaussian distribution, for any fixed set S[n]S\subset[n] and n=Ω(d)n=\Omega\left({{d}}\right), with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)),

0.99|S|λmin(XSXS)λmax(XSXS)1.01|S|,0.99\left|{S}\right|\leq\lambda_{\min}(X_{S}X_{S}^{\top})\leq\lambda_{\max}(X_{S}X_{S}^{\top})\leq 1.01\left|{S}\right|,

where the constant inside Ω()\Omega\left({{\cdot}}\right) depends only on the sub-Gaussian distribution and universal constants.

Proof.

Taken from [1]. ∎

Lemma 9.

If ϵ𝒩(𝟎,1βI)\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I), then for any function f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}, any β>0\beta>0 and any 𝚫d\text{\boldmath$\mathbf{\Delta}$}\in{\mathbb{R}}^{d}, we have

𝔼[exp(β2ϵ𝚫22)f(ϵ)]=1c~𝔼[f(𝐱)],{\mathbb{E}}\left[{{\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\frac{1}{\tilde{c}}\cdot{\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right],

where 𝐱𝒩(ββ+β𝚫,1β+βI){{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta+\beta^{\ast}}\cdot I}\right) and c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right).

Proof.

We have

𝔼[exp(β2ϵ𝚫22)f(ϵ)]=(β2π)d∫⋯∫dexp(β2ϵ𝚫22)f(ϵ)exp(β2ϵ22)𝑑ϵ\displaystyle{\mathbb{E}}\left[{{\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\cdot\exp\left({-\frac{\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}^{2}}\right)\ d\text{\boldmath$\mathbf{\epsilon}$} =(β2π)dexp(β2𝚫22)∫⋯∫dexp(β+β2ϵ22+βϵ𝚫)f(ϵ)𝑑ϵ\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta+\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}^{2}+\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$} =(β2π)dexp(β2𝚫22+β22(β+β)𝚫22)∫⋯∫dexp(β+β2ϵβ2(β+β)𝚫22)f(ϵ)𝑑ϵ\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}+\frac{\beta^{2}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\left\|{\sqrt{\frac{\beta+\beta^{\ast}}{2}}\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\sqrt{2(\beta+\beta^{\ast})}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$} =(β2π)dexp(ββ2(β+β)𝚫22)∫⋯∫dexp(β+β2ϵββ+β𝚫22)f(ϵ)𝑑ϵ\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta+\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$} =(β2π)d(2πβ+β)dexp(ββ2(β+β)𝚫22)𝔼[f(𝐱)]\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\left({\sqrt{\frac{2\pi}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right){\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right]

which finishes the proof upon using c~(β2π)d(2πβ+β)dexp(ββ2(β+β)𝚫22)=1\tilde{c}\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\left({\sqrt{\frac{2\pi}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)=1. ∎

Lemma 10.

If ϵ𝒩(𝟎,1βI)\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I), then for any function f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}, any β>0\beta>0 and any 𝚫d\text{\boldmath$\mathbf{\Delta}$}\in{\mathbb{R}}^{d}, we have

𝔼[exp(βϵ𝚫)f(ϵ)]=q~𝔼[f(𝐱)],{\mathbb{E}}\left[{{\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\tilde{q}\cdot{\mathbb{E}}\left[{{f({{\mathbf{x}}})}}\right],

where 𝐱𝒩(ββ𝚫,1βI){{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta^{\ast}}\cdot I}\right) and q~=exp(β22β𝚫22)\tilde{q}=\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right).

Proof.

We have

𝔼[exp(βϵ𝚫)f(ϵ)]=(β2π)d∫⋯∫dexp(βϵ𝚫)f(ϵ)exp(β2ϵ22)𝑑ϵ\displaystyle{\mathbb{E}}\left[{{\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})}}\right]=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\idotsint_{{\mathbb{R}}^{d}}\exp\left({\beta\text{\boldmath$\mathbf{\epsilon}$}^{\top}\text{\boldmath$\mathbf{\Delta}$}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\cdot\exp\left({-\frac{\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}^{2}}\right)\ d\text{\boldmath$\mathbf{\epsilon}$}
=(β2π)dexp(β22β𝚫22)∫⋯∫dexp(12βϵββ𝚫22)f(ϵ)𝑑ϵ\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{1}{2}\left\|{\sqrt{\beta^{\ast}}\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\sqrt{\beta^{\ast}}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}
=(β2π)dexp(β22β𝚫22)∫⋯∫dexp(β2ϵββ𝚫22)f(ϵ)𝑑ϵ\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{2\pi}}}\right)^{d}\exp\left({\frac{\beta^{2}}{2\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\idotsint_{{\mathbb{R}}^{d}}\exp\left({-\frac{\beta^{\ast}}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\frac{\beta}{\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\cdot f(\text{\boldmath$\mathbf{\epsilon}$})\ d\text{\boldmath$\mathbf{\epsilon}$}

which finishes the proof. ∎

Lemma 11.

If ϵ𝒩(𝟎,1βI)\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I), then for any constant C>0C>0 and fixed vectors 𝐮,𝐯{{\mathbf{u}}},{{\mathbf{v}}} such that 𝐮2=u\left\|{{{\mathbf{u}}}}\right\|_{2}=u and 𝐯2=v\left\|{{{\mathbf{v}}}}\right\|_{2}=v, we have

𝔼[11+Cexp(β𝐯ϵ)𝐮ϵ]min{C,1C}exp(β2v22β)βvβu{\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta{{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}\right)}\cdot{{\mathbf{u}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}}\right]\leq\min\left\{{C,\frac{1}{C}}\right\}\exp\left({\frac{\beta^{2}v^{2}}{2\beta^{\ast}}}\right)\frac{\beta v}{\beta^{\ast}}u
Proof.

We begin by analyzing the vector 𝝂=𝔼[11+Cexp(β𝐯ϵ)ϵ]\text{\boldmath$\mathbf{\nu}$}={\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta{{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}}\right)}\cdot\text{\boldmath$\mathbf{\epsilon}$}}}\right] itself. Note that due to the rotational symmetry of the Gaussian distribution, we can, w.l.o.g. assume that 𝐯=(v,0,0,,0){{\mathbf{v}}}=(v,0,0,\ldots,0). This means that 𝐯ϵ=vϵ1{{\mathbf{v}}}^{\top}\text{\boldmath$\mathbf{\epsilon}$}=v\cdot\text{\boldmath$\mathbf{\epsilon}$}_{1}. Thus, the ithi\text{${}^{\text{th}}$} coordinate of the vector 𝝂\mathbf{\nu} i.e. 𝝂i=𝔼[11+Cexp(βvϵ1)ϵi]\text{\boldmath$\mathbf{\nu}$}_{i}={\mathbb{E}}\left[{{\frac{1}{1+C\exp\left({\beta v\cdot\text{\boldmath$\mathbf{\epsilon}$}_{1}}\right)}\cdot\text{\boldmath$\mathbf{\epsilon}$}_{i}}}\right]. Thus, by independence and unbiased-ness of the coordinates of a Gaussian vector, we have 𝝂i=0\text{\boldmath$\mathbf{\nu}$}_{i}=0 for all i1i\neq 1. So all we are left to analyze is 𝝂1\text{\boldmath$\mathbf{\nu}$}_{1}. We have

|𝝂1|\displaystyle\left|{\text{\boldmath$\mathbf{\nu}$}_{1}}\right| =β2π|exp(βϵ22)ϵ1+Cexp(βvϵ)𝑑ϵ|\displaystyle=\sqrt{\frac{\beta^{\ast}}{2\pi}}\left|{\int_{\mathbb{R}}\frac{\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon}{1+C\exp(\beta v\epsilon)}\ d\epsilon}\right|
=β2π0exp(βϵ22)ϵC(exp(βvϵ)exp(βvϵ))1+C2+C(exp(βvϵ)+exp(βvϵ))𝑑ϵ\displaystyle=\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C(\exp(\beta v\epsilon)-\exp(-\beta v\epsilon))}{1+C^{2}+C(\exp(\beta v\epsilon)+\exp(-\beta v\epsilon))}\ d\epsilon
β2π0exp(βϵ22)ϵCexp(βvϵ)1+C2+Cexp(βvϵ)𝑑ϵ\displaystyle\leq\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C\exp(\beta v\epsilon)}{1+C^{2}+C\exp(\beta v\epsilon)}\ d\epsilon
β2π0exp(βϵ22)ϵCexp(βvϵ)1+C2𝑑ϵ\displaystyle\leq\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\frac{C\exp(\beta v\epsilon)}{1+C^{2}}\ d\epsilon
min{C,1C}β2π0exp(βϵ22)ϵexp(βvϵ)𝑑ϵ\displaystyle\leq\min\left\{{C,\frac{1}{C}}\right\}\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{0}^{\infty}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\exp(\beta v\epsilon)\ d\epsilon
min{C,1C}β2πexp(βϵ22)ϵexp(βvϵ)𝑑ϵ\displaystyle\leq\min\left\{{C,\frac{1}{C}}\right\}\sqrt{\frac{\beta^{\ast}}{2\pi}}\int_{\mathbb{R}}\exp\left({-\frac{\beta^{\ast}\epsilon^{2}}{2}}\right)\epsilon\cdot\exp(\beta v\epsilon)\ d\epsilon
=min{C,1C}𝔼[exp(βvϵ)ϵ]\displaystyle=\min\left\{{C,\frac{1}{C}}\right\}{\mathbb{E}}\left[{{\exp(\beta v\epsilon)\cdot\epsilon}}\right]
=min{C,1C}exp(β2v22β)βvβ\displaystyle=\min\left\{{C,\frac{1}{C}}\right\}\exp\left({\frac{\beta^{2}v^{2}}{2\beta^{\ast}}}\right)\frac{\beta v}{\beta^{\ast}}

since C1+C2min{C,1C}\frac{C}{1+C^{2}}\leq\min\left\{{C,\frac{1}{C}}\right\} and we used Lemma 10 in the last step since that lemma is independent of the dimensionality of the Gaussian vector. Applying the Cauchy-Schwartz inequality then gives us the result. ∎

Lemma 12.

Let β,V>2\beta,V>2, then we have

exp(x22)1+exp(βV(Vx))𝑑xexp(x22)1+exp(βV(Vx))x𝑑x}exp(Ω(V2))\left.\begin{array}[]{c}\int_{\mathbb{R}}\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}\ dx\\ \int_{\mathbb{R}}\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}x\ dx\end{array}\right\}\leq\exp(-\Omega\left({{V^{2}}}\right))
Proof.

By completing squares we have

exp(x22)1+exp(βV(Vx))=exp(x22)exp(βx24)exp(βx24)+exp(β4(x2V)2)\frac{\exp\left({-\frac{x^{2}}{2}}\right)}{1+\exp\left({\beta V(V-x)}\right)}=\exp\left({-\frac{x^{2}}{2}}\right)\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}

Now, we consider two cases

  1. 1.

    Case 1 (x<V2)(x<\frac{V}{2}): In this case exp(βx24)exp(βV216)\exp\left({\frac{\beta x^{2}}{4}}\right)\leq\exp\left({\frac{\beta V^{2}}{16}}\right) whereas exp(β4(x2V)2)exp(9βV216)\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)\geq\exp\left({\frac{9\beta V^{2}}{16}}\right). Thus, we have exp(βx24)exp(βx24)+exp(β4(x2V)2)exp(βx24)exp(β4(x2V)2)exp(βV22)\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq\exp\left({-\frac{\beta V^{2}}{2}}\right) in this region. Using standard Gaussian integrals we conclude that the region (,V2)\left({-\infty,\frac{V}{2}}\right) contributes at most 𝒪(exp(βV22))exp(Ω(V2)){\cal O}\left({{\exp\left({-\frac{\beta V^{2}}{2}}\right)}}\right)\leq\exp(-\Omega\left({{V^{2}}}\right)) (since β>2\beta>2) to both integrals.

  2. 2.

    Case 2 (x>V2)(x>\frac{V}{2}): In this case we simply bound exp(βx24)exp(βx24)+exp(β4(x2V)2)1\frac{\exp\left({\frac{\beta x^{2}}{4}}\right)}{\exp\left({\frac{\beta x^{2}}{4}}\right)+\exp\left({\frac{\beta}{4}\left({x-2V}\right)^{2}}\right)}\leq 1 and use standard bounds on the complementary error function to conclude that the contribution of the region (V2,)\left({\frac{V}{2},\infty}\right) to both integrals is at most exp(Ω(V2))\exp\left({-\Omega\left({{V^{2}}}\right)}\right).

Lemma 13.

Suppose ϵ𝒩(𝟎,1βI)\text{\boldmath$\mathbf{\epsilon}$}\sim{\mathcal{N}}\left({{\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I}\right) and 𝐯,𝚫{{\mathbf{v}}},\text{\boldmath$\mathbf{\Delta}$} are fixed vectors such that β𝚫21\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq 1. Then the random variable X:=exp(β2ϵ𝚫22)ϵ𝐯X:=\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}} has a subexponential constant at most

2β+d(ββ+β)dexp(ββ2(β+β)𝚫22)\frac{2}{\sqrt{\beta+d}}\left({\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)
Proof.

The subexponential constant of a random variable XX is defined as the value Xψ1:=supp11p(𝔼[|X|p])1/p\left\|{X}\right\|_{\psi_{1}}:=\sup_{p\geq 1}\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}. In contrast, the subGaussian constant of a random variable XX is defined as the value Xψ2:=supp11p(𝔼[|X|p])1/p\left\|{X}\right\|_{\psi_{2}}:=\sup_{p\geq 1}\frac{1}{\sqrt{p}}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}. Using Lemma 9 gives us

𝔼[|X|p]\displaystyle{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right] =𝔼[exp(pβ2ϵ𝚫22)(ϵ𝐯)p]\displaystyle={\mathbb{E}}\left[{{\exp\left({-\frac{p\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)(\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}})^{p}}}\right]
=(βpβ+β)dexp(pββ2(pβ+β)𝚫22)𝔼[(𝐱𝐯)p],\displaystyle=\left({\sqrt{\frac{\beta^{\ast}}{p\beta+\beta^{\ast}}}}\right)^{d}\exp\left({-\frac{p\beta\beta^{\ast}}{2(p\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right){\mathbb{E}}\left[{{({{\mathbf{x}}}^{\top}{{\mathbf{v}}})^{p}}}\right],

where 𝐱𝒩(pβpβ+β𝚫,1pβ+βI){{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{p\beta}{p\beta+\beta^{\ast}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{p\beta+\beta^{\ast}}\cdot I}\right). Now by virtue of 𝐱{{\mathbf{x}}} being a Gaussian and using the triangle inequality for the subGaussian norm, we know that the random variable 𝐱𝐯{{\mathbf{x}}}^{\top}{{\mathbf{v}}} is (pβpβ+β𝚫2+1pβ+β)\left({\frac{p\beta}{p\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)-subGaussian. This, in turn implies that

(𝔼[(𝐱𝐯)p])1/pp(pβpβ+β𝚫2+1pβ+β)p(pβpβ+β+1pβ+β)\left({{\mathbb{E}}\left[{{({{\mathbf{x}}}^{\top}{{\mathbf{v}}})^{p}}}\right]}\right)^{1/p}\leq\sqrt{p}\cdot\left({\frac{p\beta}{p\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)\leq\sqrt{p}\cdot\left({\frac{p\sqrt{\beta}}{p\beta+\beta^{\ast}}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)

Thus, we have

1p(𝔼[|X|p])1/p1p(βpβ+β)d/pexp(ββ2(pβ+β)𝚫22)(pβpβ+β+1pβ+β)(A)\displaystyle\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p}\leq\frac{1}{\sqrt{p}}\cdot\underbrace{\left({\sqrt{\frac{\beta^{\ast}}{p\beta+\beta^{\ast}}}}\right)^{d/p}\exp\left({-\frac{\beta\beta^{\ast}}{2(p\beta+\beta^{\ast})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\left({\frac{p\sqrt{\beta}}{p\beta+\beta^{\ast}}+\frac{1}{\sqrt{p\beta+\beta^{\ast}}}}\right)}_{(A)}

Now, (A)(A) is a bounded function whereas 1p\frac{1}{\sqrt{p}} is a decreasing function. Thus, 1p(𝔼[|X|p])1/p\frac{1}{p}\left({{\mathbb{E}}\left[{{\left|{X}\right|^{p}}}\right]}\right)^{1/p} is a decreasing function of pp and hence achieves its maximum value in the range p1p\geq 1 at p=1p=1 itself. Noting that (ββ+β+1β+β)2β+d\left({\frac{\sqrt{\beta}}{\beta+\beta^{\ast}}+\frac{1}{\sqrt{\beta+\beta^{\ast}}}}\right)\leq\frac{2}{\sqrt{\beta+d}} finishes the proof. ∎

Appendix G Robust Regression

For this proof we use the notation X=[𝐱1,,𝐱n]d×n,𝐲=[y1,,yn]n,𝐛=[b1,,bn]nX=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{d\times n},{{\mathbf{y}}}=[y_{1},\ldots,y_{n}]\in{\mathbb{R}}^{n},{{\mathbf{b}}}=[b_{1},\ldots,b_{n}]\in{\mathbb{R}}^{n}. For any vector 𝐯m{{\mathbf{v}}}\in{\mathbb{R}}^{m} and any set T[m]T\subseteq[m], 𝐯T{{\mathbf{v}}}_{T} denotes the vector with all coordinates other than those in the set TT zeroed out. Similarly, for any matrix Ak×m,ATA\in{\mathbb{R}}^{k\times m},A_{T} denotes the matrix with all columns other than those in the set TT zeroed out. We will let G,BG,B respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let G=(1α)nG=(1-\alpha)\cdot n and B=αnB=\alpha\cdot n respectively denote the number of good and bad points too.

Theorem 14 (Theorem 2 restated – Partially Adaptive Adversary).

For data generated in the robust regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set BB) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-RR enjoys a breakdown point of 0.18660.1866, i.e. it ensures a bounded 𝒪(1){\cal O}\left({{1}}\right) error even if k=αnk=\alpha\cdot n corruptions are introduced where the value of α\alpha can go upto at least 0.18660.1866. More generally, for corruption rates α0.1866\alpha\leq 0.1866, there always exists values of scale increment ξ>1\xi>1 s.t. with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC conditions are satisfied for the Q~β\tilde{Q}_{\beta} function corresponding to the robust least squares model for β\beta values at least as large as βmax=𝒪(βmin{1α2/3,ndlog(n)})\beta_{\max}={\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right).
Hybrid Corruption Model: If initialized with 𝐰^1,β1\hat{{\mathbf{w}}}^{1},\beta^{1} s.t. β1𝐰^1𝐰221\beta_{1}\cdot\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1, SVAM-RR assures

𝐰^T𝐰22𝒪(1βmax{α2/3,dlog(n)n})\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right)

within T𝒪(lognβ1)T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right) iterations for the hybrid corruption model where even points uncorrupted by the adversary receive Gaussian noise with variance 1β\frac{1}{\beta^{\ast}} i.e. yi=𝐰,𝐱i+ϵiy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i} for iBi\in B where ϵi𝒩(0,1β)\epsilon_{i}\sim{\mathcal{N}}\left({0,\frac{1}{\beta^{\ast}}}\right).
Pure Corruption Model: For the pure corruption model where uncorrupted points receive no Gaussian noise i.e. yi=𝐰,𝐱iy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle for iGi\in G, SVAM-RR assures exact model recovery. Specifically, for any ϵ>0\epsilon>0, SVAM-RR assures

𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon

within T𝒪(log1ϵβ1)T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right) iterations.

Proof.

For any two models 𝐯,𝐰{{\mathbf{v}}},{{\mathbf{w}}}, the Q~β\tilde{Q}_{\beta} function for robust least squares has the following form

Q~β(𝐯|𝐰)=i=1nsi(𝐰,𝐱iyi)2,\tilde{Q}_{\beta}({{\mathbf{v}}}\,|\,{{\mathbf{w}}})=\sum_{i=1}^{n}s_{i}\cdot\left({\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}^{i}}\right\rangle-y_{i}}\right)^{2},

where siexp(β2(yi𝐱i,𝐰)2)s_{i}\leftarrow\exp\left({-\frac{\beta}{2}(y_{i}-\left\langle{{{\mathbf{x}}}^{i}},{{{\mathbf{w}}}}\right\rangle)^{2}}\right). We first outline the proof below.

Proof Outline. This proof has three key elements

  1. 1.

    We will establish the LWSC and LWLC properties for any fixed value of β>0\beta>0 with probability 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)). As promised in the statement of Theorem 2, we will execute SVAM-RR for no more than 𝒪(logn){\cal O}\left({{\log n}}\right) iterations, taking a naive union bound would offer a confidence level of 1lognexp(Ω(d))1-\log n\exp(-\Omega\left({{d}}\right)). However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form 1exp(Ω(nζ2dlogn))1-\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right)). Thus, a union over 𝒪(logn){\cal O}\left({{\log n}}\right) such events will at best deteriorate the confidence bounds to 1lognexp(Ω(nζ2dlogn))=1exp(Ω(nζ2dlognloglogn))1-\log n\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))=1-\exp(-\Omega\left({{n\zeta^{2}-d\log n-\log\log n}}\right)) which is still 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) for the values of ζ\zeta we shall set.

  2. 2.

    The key to this proof is to maintain the invariant βt𝐰^t𝐰221\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1. Recall that initialization ensures β1𝐰^1𝐰221\beta_{1}\cdot{\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}}\leq 1 to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, iductively assuming that βt𝐰^t𝐰221\beta_{t}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1 for an iteration tt, we will establish that 𝐰^t+1𝐰22Λβtλβt(A)βt\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{(A)}{\sqrt{\beta}_{t}} where (A)(A) will be an application-specific expression derived below.

  3. 3.

    We will then ensure that (A)<1(A)<1, say (A)=1/ξ(A)=1/\sqrt{\xi} for some ξ>1\xi>1, whenever the number of corruptions are below the breakdown point. This ensures 𝐰^t+1𝐰221ξβt\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{{\xi\beta_{t}}}, in other words, βt+1𝐰^t+1𝐰221{\beta_{t+1}}\cdot{\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}}\leq 1 for βt+1=ξβt\beta_{t+1}=\xi\cdot\beta_{t} so that the invariant is preserved. However, notice that the above step simultaneously ensures that 2Λβtλβt1ξβt\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{1}{\sqrt{\xi\beta_{t}}}. This ensures that a valid value of scale increment ξ\xi can always be found till βtβmax\beta_{t}\leq\beta_{\max}. Specifically, we will be able assure the existence of a scale increment ξ>1\xi>1 satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till β<𝒪(βmin{1α2/3,ndlog(n)})\beta<{\cal O}\left({{\beta^{\ast}\min\left\{{\frac{1}{\alpha^{2/3}},\sqrt{\frac{n}{d\log(n)}}}\right\}}}\right).

We now present the proof. Lemmata 15,16 establish the LWSC/LWLC properties for the Q~β\tilde{Q}_{\beta} function for robust least squares regression. Let 𝚫:=𝐰^t𝐰\text{\boldmath$\mathbf{\Delta}$}:=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}. By Lemma 8, with probability at least 1exp(Ω(nd))1-\exp(-\Omega\left({{n-d}}\right)), we have XB2=λmax(XBXB)1.01B\left\|{X_{B}}\right\|_{2}=\sqrt{\lambda_{\max}(X_{B}X_{B}^{\top})}\leq\sqrt{1.01B}. The proof of Lemma 16 tells us that with the same probability, we have

S𝐛2B2π[(β𝚫21.01)1/3+(1e)1/3]32B2π[(κ21.01)1/3+(1e)1/3]32.\left\|{S{{\mathbf{b}}}}\right\|_{2}\leq\sqrt{\frac{B}{2\pi}}[(\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01)^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}\leq\sqrt{\frac{B}{2\pi}}[(\kappa^{2}1.01)^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}.

On the other hand, the proof of Lemma 16 also tells us us that with probability at least 1exp(Ω(nν2dlog1νdlog(n)))1-\exp\left({-\Omega\left({{n\nu^{2}-d\log\frac{1}{\nu}-d\log(n)}}\right)}\right) we have,

XSϵ2=XGSGϵGG(1+ν)κ22πββ+β1g~.\left\|{XS\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}=\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|\leq G(1+\nu)\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}.

By Lemma 15, with probability at least 1exp(Ω(nζ2dlog1ζdlog(n)))1-\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right), we have λmin(XSX)λmin(XGSGXG)β2π(1ζ)1g~G\lambda_{\min}(XSX^{\top})\geq\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})\geq\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G. This gives us,

𝐰^t+1𝐰2B1.012π[(1.01κ2)1/3+(1e)1/3]32+G(1+ν)κ22πββ+β1g~β2π(1ζ)1g~G\displaystyle\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{B\sqrt{\frac{1.01}{2\pi}}[(1.01\kappa^{2})^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}+G(1+\nu)\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}}}{\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G} =κβ11ζ(α1αg~1.01[(1.01)1/3+(eκ2)1/3]32+(1+ν)ββ+β)\displaystyle=\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\tilde{g}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right) =κβ11ζ(α1αβ+ββ(1+βββ+β𝚫22)3/21.01[(1.01)1/3+(eκ2)1/3]32+(1+ν)ββ+β)\displaystyle=\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right) κβ11ζ(α1α1+ββ(1+βκ2β+β)3/21.01[(1.01)1/3+(eκ2)1/3]32+(1+ν)ββ+β)\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\frac{\beta^{\ast}\kappa^{2}}{\beta+\beta^{\ast}}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right) κβ11ζ(α1α1+ββ(1+κ2)3/21.01[(1.01)1/3+(eκ2)1/3]32+(1+ν)ββ+β)(A)\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\underbrace{\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)}_{(A)}

where in the last two steps, we used that β𝚫2κ\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\kappa and ββ+β1\frac{\beta^{\ast}}{\beta+\beta^{\ast}}\leq 1. We shall set κ=0.47\kappa=0.47 below. Now, in order to satisfy the conditions of Theorem 1, we need to assure the existence of a scale increment ξ>1\xi>1 that assures a linear rate of convergence. Since SVAM-RR always maintains the invariant βt𝐰^t𝐰221{\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq 1 and κ<1\kappa<1, assuring the existence of a scale increment ξ>1\xi>1 is easily seen to be equivalent to showing that (A)<1(A)<1. This is done below and gives us our breakdown point.

  1. 1.

    Breakdown Point: In order to ensure a linear rate of convergence, we need only ensure (A)<1(A)<1. If we set ν,ζ\nu,\zeta to small constants (which we can always do for large enough nn) and also set ββ\frac{\beta}{\beta^{\ast}} to a small constant (which still allows us to offer an error 𝚫2𝒪(1β)\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq{\cal O}\left({{\frac{1}{\sqrt{\beta^{\ast}}}}}\right)), then we can get (A)<1(A)<1 if

    α1α(1+κ2)3/21.01[(1.01)1/3+(eκ2)1/3]32<1\frac{\alpha}{1-\alpha}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}<1

    Setting κ=0.47\kappa=0.47 gives us a breakdown point of α0.1866\alpha\leq 0.1866.

  2. 2.

    Consistency (Hybrid Corruption): For vanishing corruption i.e. α0\alpha\rightarrow 0, we can instead show a much stronger, consistent estimation guarantee. Suppose we promise that we would always set ζ,ν1n\zeta,\nu\geq\frac{1}{n}. Then, the results in Lemmata 15 and 16 hold with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) even if we set ζ,ν=Ω(dlog(n)n)\zeta,\nu=\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right). Now, recall that we need to set (A)<1(A)<1 to expect a linear rate of convergence. Setting ζ=ν\zeta=\nu to simplify notation (and also since both can be set to similar values without sacrificing 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) confidence guarantees), using α0.5\alpha\leq 0.5, and using the shorthand ρ:=1+ββ\rho:=1+\frac{\beta}{\beta^{\ast}} gives us the requirement

    (A)1\displaystyle(A)\leq 1\Leftrightarrow
    11ζ(α1α1+ββ(1+κ2)3/21.01[(1.01)1/3+(eκ2)1/3]32+(1+ν)ββ+β)<1\displaystyle\frac{1}{1-\zeta}\left({\frac{\alpha}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}+(1+\nu)\frac{\beta}{\beta+\beta^{\ast}}}\right)<1
    11ν(9αρ+(1+ν)ρ1ρ)<1\displaystyle\Leftrightarrow\frac{1}{1-\nu}\left({9\alpha\sqrt{\rho}+(1+\nu)\frac{\rho-1}{\rho}}\right)<1

    The last requirement can be fulfilled for ββmin{Ω(1α2/3),Ω(ndlog(n))}\beta\leq\beta^{\ast}\cdot\min\left\{{\Omega\left({{\frac{1}{\alpha^{2/3}}}}\right),\Omega\left({{\sqrt{\frac{n}{d\log(n)}}}}\right)}\right\} which assures us of an error guarantee of 𝚫22𝒪(1βmax{α2/3,dlog(n)n})\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq{\cal O}\left({{\frac{1}{\beta^{\ast}}\max\left\{{\alpha^{2/3},\sqrt{\frac{d\log(n)}{n}}}\right\}}}\right) and finishes the proof for hybrid corruption case. Note that for no corruptions i.e. α=0\alpha=0, we do recover 𝚫22𝒪(dlog(n)n)\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq{\cal O}\left({{\sqrt{\frac{d\log(n)}{n}}}}\right).

  3. 3.

    Consistency (Pure Corruption): The pure corruption case corresponds to β=\beta^{\ast}=\infty. In this case, the above analysis shows that LWSC/LWLC properties continue to hold for arbitrarily large values of β\beta that assures arbitrarily accurate recovery of 𝐰{{\mathbf{w}}}^{\ast}. Given the linear rate of convergence and the fact that SVAM-RR maintains the invariant βt𝐰^t𝐰21\sqrt{\beta_{t}}\cdot\left\|{\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq 1, it is clear that for any ϵ>0\epsilon>0, a model recovery error of 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon is assured within T𝒪(log1ϵβ1)T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right) iterations.

Lemma 15 (LWSC for Robust Least Squares Regression).

For any 0βn0\leq\beta\leq n, the Q~β\tilde{Q}_{\beta}-function for robust regression satisfies the LWSC property with constant λβGcε(1ζ)\lambda_{\beta}\geq Gc_{\varepsilon}(1-\zeta) with probability at least 1exp(d)1-\exp(-d) for any ζΩ(dlog(n)n)\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right). In particular, for standard Gaussian covariates and Gaussian noise with variance 1β\frac{1}{\beta^{\ast}}, we can take cε1g~c_{\varepsilon}\geq\frac{1}{\tilde{g}} where g~=β+ββ(1+βββ+β𝚫22)3/2\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}.

Proof.

It is easy to see that 2Q~β(𝐰^|𝐰)=XSX\nabla^{2}\tilde{Q}_{\beta}(\hat{{\mathbf{w}}}\,|\,{{\mathbf{w}}})=XSX^{\top} for any 𝐰^2(𝐰,1β)\hat{{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right). Let 𝐱𝒟,ϵ𝒟ε{{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{\mathcal{D}}_{\varepsilon} and let y=𝐰,𝐱+ϵy=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}}\right\rangle+\epsilon be the response of an uncorrupted data point and 𝐰2(𝐰,κβ){{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\frac{\kappa}{\sqrt{\beta}}}\right) be any fixed model. Then if we let 𝚫:=𝐰𝐰\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}^{\ast}-{{\mathbf{w}}}, then weight 𝐬i=β2πexp(β2(𝚫,𝐱i+ϵi)2){{\mathbf{s}}}_{i}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle+\epsilon_{i})^{2}).

For any fixed 𝐯Sd1{{\mathbf{v}}}\in S^{d-1}, we have:

𝔼[𝐯XGSGXG𝐯]=G𝔼[si𝐱i,𝐯2]\displaystyle{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]=G\cdot{\mathbb{E}}\left[{{s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right] =Gβ2π𝔼𝐱i𝒟,ϵi𝒟ϵ[𝐱i,𝐯2exp(β2(𝚫,𝐱i+ϵi)2)]\displaystyle=G\sqrt{\frac{\beta}{2\pi}}\cdot\underset{{{\mathbf{x}}}_{i}\sim{\mathcal{D}},\epsilon_{i}\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}_{i}}\right\rangle+\epsilon_{i})^{2})}}\right]
Gβ2πc(β,𝚫,σ)\displaystyle\geq G\sqrt{\frac{\beta}{2\pi}}\cdot c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma)

where,

cε:=c(β,𝚫,σ):=inf𝐯Sd1{𝔼𝐱𝒟,ϵ𝒟ϵ[𝐱,𝐯2exp(β2(𝚫,𝐱+ϵ)2)]}c_{\varepsilon}:=c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma):=\inf_{{{\mathbf{v}}}\in S^{d-1}}\left\{{\underset{{{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle+\epsilon)^{2})}}\right]}\right\}

Similar to Lemma 17 we have:

[λmin(XGSGXG)<(1ζ2)cεGβ2π]29dexp[mnζ2cε2128R4]{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<\left({1-\frac{\zeta}{2}}\right)\,c_{\varepsilon}\,G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\zeta^{2}c_{\varepsilon}^{2}}{128R^{4}}}\right]

Note that Lemma 18 continues to hold in this setting. Proceeding as in the proof of Lemma 20 to set up a τ\tau-net over 2(𝐰,κβ){\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\frac{\kappa}{\sqrt{\beta}}}\right) and taking a union bound over this net finishes the proof.

We now simplify cε:=c(β,𝚫,σ)c_{\varepsilon}:=c(\beta,\text{\boldmath$\mathbf{\Delta}$},\sigma) for various distributions:

Centered Isotropic Gaussian

For the special case of 𝒟=𝒩(𝟎,Id){\mathcal{D}}={\mathcal{N}}({\mathbf{0}},I_{d}), using rotational symmetry, we can w.l.o.g. take 𝚫=(Δ1,0,0,,0)\text{\boldmath$\mathbf{\Delta}$}=(\Delta_{1},0,0,\ldots,0) and 𝐯=(v1,v2,0,0,,0),v12+v22=1{{\mathbf{v}}}=(v_{1},v_{2},0,0,\ldots,0),\,v_{1}^{2}+v_{2}^{2}=1. Thus, if x1,x2𝒩(0,1)x_{1},x_{2}\sim{\mathcal{N}}(0,1) i.i.d. then

𝔼𝐱𝒟,ϵ𝒟ϵ[𝐱,𝐯2exp(β2(𝚫,𝐱+ϵ)2)]\displaystyle\underset{{{\mathbf{x}}}\sim{\mathcal{D}},\epsilon\sim{{\mathcal{D}}}_{\epsilon}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}(\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle+\epsilon)^{2})}}\right]
=𝔼x1,x2𝒩(0,1),ϵ𝒩(0,σ2)[(v12x12+v22x22+2v1v2x1x2)exp(β2(Δ1x1+ϵ)2)]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1),\epsilon\sim{\mathcal{N}}(0,\sigma^{2})}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]
=𝔼x1,x2,ϵ[(v12x12+v22x22)exp(β2(Δ1x1+ϵ)2)][as,𝔼[x2]=0]\displaystyle=\underset{x_{1},x_{2},\epsilon}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}}}\right]=0]
=𝔼x1,ϵ[(v12x12+v22)exp(β2(Δ1x1+ϵ)2)][as,𝔼[x22]=1]\displaystyle=\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2})\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}^{2}}}\right]=1]
=v12𝔼x1,ϵ[x12exp(β2(Δ1x1+ϵ)2)]+v22𝔼x1,ϵ[exp(β2(Δ1x1+ϵ)2)]\displaystyle=v_{1}^{2}\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]+v_{2}^{2}\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right]

Using,

12πσ2exp(β2(Δ1x1+ϵ)2ϵ22σ2)dϵ=11+βσ2exp(βΔ12x122(1+βσ2))\displaystyle\frac{1}{\sqrt{2\pi\sigma^{2}}}\int\limits_{-\infty}^{\infty}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2}-\frac{\epsilon^{2}}{2\sigma^{2}})d\epsilon\quad=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\exp(-\frac{\beta\Delta_{1}^{2}x_{1}^{2}}{2(1+\beta\sigma^{2})})

We have,

𝔼x1,ϵ[x12exp(β2(Δ1x1+ϵ)2)]\displaystyle\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right] =12π[x12exp(x122)12πσ2exp(β2(Δ1x1+ϵ)2ϵ22σ2)𝑑ϵ]𝑑x1\displaystyle=\frac{1}{\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\left[x_{1}^{2}\exp(-\frac{x_{1}^{2}}{2})\frac{1}{\sqrt{2\pi\sigma^{2}}}\int\limits_{-\infty}^{\infty}\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2}-\frac{\epsilon^{2}}{2\sigma^{2}})d\epsilon\right]\,dx_{1} =11+βσ212πx12exp(x122βΔ12x122(1+βσ2))𝑑x1\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\frac{1}{\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}x_{1}^{2}\exp(-\frac{x_{1}^{2}}{2}-\frac{\beta\Delta_{1}^{2}x_{1}^{2}}{2(1+\beta\sigma^{2})})\,dx_{1} =11+βσ2(1+βσ21+β(σ2+Δ12))32\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{3}{2}}

and,

𝔼x1,ϵ[exp(β2(Δ1x1+ϵ)2)]\displaystyle\underset{x_{1},\epsilon}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}(\Delta_{1}x_{1}+\epsilon)^{2})}}\right] =11+βσ2(1+βσ21+β(σ2+Δ12))12\displaystyle=\frac{1}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{1}{2}}

This gives us

cε\displaystyle c_{\varepsilon} =inf(v1,v2)S1{v121+βσ2(1+βσ21+β(σ2+Δ12))32+v221+βσ2(1+βσ21+β(σ2+Δ12))12}\displaystyle=\inf_{(v_{1},v_{2})\in S^{1}}\left\{{\frac{v_{1}^{2}}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{3}{2}}+\frac{v_{2}^{2}}{\sqrt{1+\beta\sigma^{2}}}\left(\frac{1+\beta\sigma^{2}}{1+\beta(\sigma^{2}+\Delta_{1}^{2})}\right)^{\frac{1}{2}}}\right\}
=1+βσ2(1+βσ2+β𝚫2)32[using, v12+v22=1 and 𝚫=Δ12]\displaystyle=\frac{1+\beta\sigma^{2}}{(1+\beta\sigma^{2}+\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{\frac{3}{2}}}\qquad[\text{using, }v_{1}^{2}+v_{2}^{2}=1\text{ and }\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|=\Delta_{1}^{2}]
=β+ββ(β+ββ+β𝚫2)32=1β+ββ(1+βββ+β𝚫22)3/2=1g~\displaystyle=\frac{\frac{\beta^{\ast}+\beta}{\beta^{\ast}}}{\left({\frac{\beta^{\ast}+\beta}{\beta^{\ast}}+\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}\right)^{\frac{3}{2}}}=\frac{1}{\sqrt{\frac{\beta^{\ast}+\beta}{\beta^{\ast}}}\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}=\frac{1}{\tilde{g}}

which finishes the proof.

Lemma 16 (LWLC for Robust Least Squares Regression).

For any 0βn0\leq\beta\leq n, the Q~β\tilde{Q}_{\beta}-function for robust regression satisfies the LWLC property with constant ΛβG(1+ν)V+1.01B[(β𝚫21.012π)1/3+(12πe)1/3]3/2\Lambda_{\beta}\leq G(1+\nu)V+1.01B[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3/2} with probability at least 1exp(d)1-\exp(-d) for any νΩ(dβlog(n)n(β+d))\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right), where V=κ22πββ+β1g~V=\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}} and g~=β+ββ(1+βββ+β𝚫22)3/2\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}.

Proof.

It is easy to see that Q~β(𝐰|𝐰)=XGSGϵG+XBSB𝐛\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}+X_{B}S_{B}{{\mathbf{b}}}. We bound these separately below.

Weights on Bad Points.

Suppose we denote 𝚫:=𝐰𝐰\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}-{{\mathbf{w}}}^{\ast} and let S=diag(𝐬t)S=\text{diag}({{\mathbf{s}}}^{t}) be the weights assigned by the algorithm, then the analysis below shows that we must have

XS𝐛21.01B[(β𝚫21.012π)1/3+(12πe)1/3]3/2\left\|{XS{{\mathbf{b}}}}\right\|_{2}\leq 1.01B[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3/2}

We will bound S𝐛2\left\|{S{{\mathbf{b}}}}\right\|_{2} below and use Lemma 8 to get the above bound. Let bib_{i} denote the corruption on the data point 𝐱i{{\mathbf{x}}}_{i}, i.e. yi=𝐰,𝐱i+biy_{i}=\langle{{\mathbf{w}}}^{\ast},{{\mathbf{x}}}_{i}\rangle+b_{i}. The proof proceeds via a simple case analysis:

Case 1: |bi|k|𝚫,𝐱i|\left|{b_{i}}\right|\leq k\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|

In this case we simply bound (sibi)2β2πbi2β2πk2𝚫,𝐱i2(s_{i}b_{i})^{2}\leq\frac{\beta}{2\pi}b_{i}^{2}\leq\frac{\beta}{2\pi}k^{2}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2}.

Case 2: |bi|>k|𝚫,𝐱i|\left|{b_{i}}\right|>k\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|

or, |𝚫,𝐱i|>|bi|k-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>-\frac{\left|{b_{i}}\right|}{k} or,|bi||𝚫,𝐱i|>|bi||bi|k\left|{b_{i}}\right|-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>\left|{b_{i}}\right|-\frac{\left|{b_{i}}\right|}{k}

|bi𝚫,𝐱i||bi||𝚫,𝐱i|>|bi|k1k\left|{b_{i}-\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|\geq\left|{b_{i}}\right|-\left|{\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle}\right|>\left|{b_{i}}\right|\frac{k-1}{k}

So that,

bi2si2\displaystyle b_{i}^{2}s_{i}^{2} =bi2𝒩(yi|𝐰,𝐱i,1β)2\displaystyle=b_{i}^{2}\mathcal{N}(y_{i}|\langle{{\mathbf{w}}},\mathbf{x}_{i}\rangle,\frac{1}{\beta})^{2}
=bi2β2πexp(β2(yi𝐰,𝐱i)2)2\displaystyle=b_{i}^{2}\frac{\beta}{2\pi}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}},{{\mathbf{x}}}_{i}\rangle)^{2})^{2}
=bi2β2πexp(β(bi𝚫,𝐱i)2)\displaystyle=b_{i}^{2}\frac{\beta}{2\pi}\exp(-\beta(b_{i}-\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle)^{2})
bi2β2πexp(βbi2(k1)2k2)for, k1\displaystyle\leq b_{i}^{2}\frac{\beta}{2\pi}\exp(-\beta b_{i}^{2}\frac{(k-1)^{2}}{k^{2}})\qquad\text{for, }k\geq 1
=k22π(k1)2zexp(z)for, z=βbi2(k1)2k2\displaystyle=\frac{k^{2}}{2\pi(k-1)^{2}}z\exp(-z)\qquad\text{for, }z=\beta b_{i}^{2}\frac{(k-1)^{2}}{k^{2}}
k22πe(k1)2as, maxz{zexp(z)}=1e\displaystyle\leq\frac{k^{2}}{2\pi e(k-1)^{2}}\qquad\text{as, }\max_{z}\{z\exp(-z)\}=\frac{1}{e}

Combining case 1 and 2,

S𝐛22\displaystyle\left\|{S{{\mathbf{b}}}}\right\|_{2}^{2} =iB(sibi)2iBmax{β2πk2𝚫,𝐱i2,k22πe(k1)2}where, k1\displaystyle=\sum_{i\in B}(s_{i}b_{i})^{2}\leq\sum_{i\in B}\max\{\frac{\beta}{2\pi}k^{2}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2},\frac{k^{2}}{2\pi e(k-1)^{2}}\}\qquad\text{where, }k\geq 1
β2πk2iB𝚫,𝐱i2+B2πek2(k1)2\displaystyle\leq\frac{\beta}{2\pi}k^{2}\sum_{i\in B}\langle\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{x}}}_{i}\rangle^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}
β𝚫2λmax(XBXBT)2πk2+B2πek2(k1)2\displaystyle\leq\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}k^{2}+\frac{B}{2\pi e}\frac{k^{2}}{(k-1)^{2}}
=qk2+pk2(k1)2where, p=B2πe,q=β𝚫2λmax(XBXBT)2π\displaystyle=qk^{2}+p\frac{k^{2}}{(k-1)^{2}}\qquad\text{where, }p=\frac{B}{2\pi e},\quad q=\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi}

Let,

g(k)=qk2+pk2(k1)2;g(k)=2qk+2pk(k1)22pk(k1)3=0\displaystyle g(k)=qk^{2}+\frac{pk^{2}}{(k-1)^{2}};g^{\prime}(k)=2qk+\frac{2pk}{(k-1)^{2}}-\frac{2pk}{(k-1)^{3}}=0
k=1+(pq)1/3minkg(k)=(q1/3+p1/3)3\displaystyle\implies k=1+(\frac{p}{q})^{1/3}\implies\min_{k}g(k)=(q^{1/3}+p^{1/3})^{3}

This gives us

S𝐛22\displaystyle\left\|{S{{\mathbf{b}}}}\right\|_{2}^{2} [(β𝚫2λmax(XBXBT)2π)1/3+(B2πe)1/3]3\displaystyle\leq[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi})^{1/3}+(\frac{B}{2\pi e})^{1/3}]^{3}
B[(β𝚫21.012π)1/3+(12πe)1/3]3\displaystyle\leq B[(\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}1.01}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{3}

Bounding the Weights on Good Points.

Given noise values sampled from 𝒟ε=𝒩(0,1β){\mathcal{D}}_{\varepsilon}={\mathcal{N}}(0,\frac{1}{\beta^{\ast}}) and corrupted points BB are choosen independent of 𝐱i{{\mathbf{x}}}_{i}, then for any β>0\beta>0 and S=diag(𝐬)S=\text{diag}({{\mathbf{s}}}), 𝐬{{\mathbf{s}}} computed w.r.t. model 𝐰{{\mathbf{w}}} at variance 1β\frac{1}{\beta}, then the following analysis shows that

[XGSGϵG2>G(1+ν)V](12RXVν)3dexp(mβν2V2n32),{\mathbb{P}}\left[{{\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|_{2}>G(1+\nu)V}}\right]\leq\left({\frac{12R_{X}}{V\nu}}\right)^{3d}\exp\left({-\frac{m\beta^{\ast}\nu^{2}V^{2}\cdot n}{32}}\right),

where V=κ22πββ+β1g~V=\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}} and g~=β+ββ(1+βββ+β𝚫22)3/2\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}. Suppose ϵ𝒩(0,1β)\epsilon\sim{\mathcal{N}}(0,\frac{1}{\beta^{\ast}}) and 𝐱𝒩(𝟎,I){{\mathbf{x}}}\sim{\mathcal{N}}({\mathbf{0}},I). Then, for any fixed error vector 𝚫2(κβ)\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\frac{\kappa}{\sqrt{\beta}}}\right), if we set s=β2πexp(β2(ϵ𝚫𝐱)2)s=\sqrt{\frac{\beta}{2\pi}}\exp\left({-\frac{\beta}{2}(\epsilon-\text{\boldmath$\mathbf{\Delta}$}\cdot{{\mathbf{x}}})^{2}}\right), then we can analyze the vector 𝔼[ϵs𝐱]{\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}}}\right] as

𝔼[ϵs𝐱]=𝔼𝐱[𝔼ϵ[ϵs]𝐱]=β2π𝔼𝐱[𝔼ϵ[exp(β2(ϵ𝚫𝐱)2)ϵ]𝐱]\displaystyle{\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}}}\right]=\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\underset{\epsilon}{{\mathbb{E}}}\left[{{\epsilon s}}\right]\cdot{{\mathbf{x}}}}}\right]=\sqrt{\frac{\beta}{2\pi}}\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\underset{\epsilon}{{\mathbb{E}}}\left[{{\exp\left({-\frac{\beta}{2}(\epsilon-\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}})^{2}}\right)\epsilon}}\right]\cdot{{\mathbf{x}}}}}\right]
=β2πββ+βββ+β𝔼𝐱[(exp(ββ2(β+β)(𝚫𝐱)2)𝚫𝐱)𝐱]𝐩\displaystyle=\sqrt{\frac{\beta}{2\pi}}\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}\frac{\beta}{\beta+\beta^{\ast}}\underbrace{\underset{{{\mathbf{x}}}}{{\mathbb{E}}}\left[{{\left({\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}(\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}})^{2}}\right)\text{\boldmath$\mathbf{\Delta}$}^{\top}{{\mathbf{x}}}}\right)\cdot{{\mathbf{x}}}}}\right]}_{{\mathbf{p}}}

where in the last step, we used Lemma 9 in one dimensions. Now, by rotational symmetry of the Gaussian distribution and unbiased and independent nature of its coordinates, we can assume w.l.o.g. that the error vector is of the form 𝚫=(δ,0,,0)\text{\boldmath$\mathbf{\Delta}$}=(\delta,0,\ldots,0) and conclude, as we did in the Gaussian mixture model analysis, that the vector 𝐩{{\mathbf{p}}} in this situation also must have only its first coordinate nonzero. Thus, we have, for

|𝐩1|=|𝔼[exp(ββ2(β+β)δ2x2)δx2]|,\left|{{{\mathbf{p}}}_{1}}\right|=\left|{{\mathbb{E}}\left[{{\exp\left({-\frac{\beta\beta^{\ast}}{2(\beta+\beta^{\ast})}\delta^{2}x^{2}}\right)\delta x^{2}}}\right]}\right|,

where x𝒩(0,1)x\sim{\mathcal{N}}(0,1) since we had 𝐱𝒩(𝟎,I){{\mathbf{x}}}\sim{\mathcal{N}}({\mathbf{0}},I). Applying Lemma 9 in single dimensions yet again and the Cauchy-Schwartz inequality gives us, for any unit vector 𝐯{{\mathbf{v}}},

𝔼[ϵs𝐱𝐯]β2πββ+βββ+β1(1+βββ+β𝚫22)3/2𝚫2κ22πββ+β1g~,{\mathbb{E}}\left[{{\epsilon s\cdot{{\mathbf{x}}}^{\top}{{\mathbf{v}}}}}\right]\leq\sqrt{\frac{\beta}{2\pi}}\sqrt{\frac{\beta^{\ast}}{\beta+\beta^{\ast}}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\sqrt{\frac{\kappa^{2}}{2\pi}}\frac{\beta}{\beta+\beta^{\ast}}\frac{1}{\tilde{g}},

where g~=β+ββ(1+βββ+β𝚫22)3/2\tilde{g}=\sqrt{\frac{\beta+\beta^{\ast}}{\beta^{\ast}}}{\left({1+\frac{\beta\beta^{\ast}}{\beta+\beta^{\ast}}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)^{3/2}} and in the last step we used β𝚫22κ2\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq\kappa^{2}. The above, when combined with uniform convergence arguments over appropriate nets over the unit vectors 𝐯{{\mathbf{v}}} and the error vector 𝚫\mathbf{\Delta} gives us the claimed result.

Lemma 17.

With the same preconditions as in Lemma 19, we have

[λmin(XGSGXG)<(1ζ)cGβ2π][λmax(XGSGXG)>(1+ζ)Gβ2π]}29dexp[mnc2ζ232R4],\left.\begin{array}[]{r}{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\\ \vspace*{-2ex}\hfil\\ {\mathbb{P}}\left[{{\lambda_{\max}(X_{G}S_{G}X_{G}^{\top})>(1+\zeta)\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\end{array}\right\}\leq 2\cdot 9^{d}\exp\left[{-\frac{mnc^{2}\zeta^{2}}{32R^{4}}}\right],

where RR is the subGaussian constant of the distribution 𝒟{\mathcal{D}} that generated the covariate vectors 𝐱i{{\mathbf{x}}}^{i}. When 𝒟{\mathcal{D}} is the standard Gaussian i.e. 𝒩(𝟎,I){\mathcal{N}}({\mathbf{0}},I), we have R1R\leq 1. For 𝒩(𝟎,1βI){\mathcal{N}}({\mathbf{0}},\frac{1}{\beta^{\ast}}\cdot I), we have R1βR\leq\sqrt{\frac{1}{\beta^{\ast}}}. In the above, cc is a constant that depends only on the distribution 𝒟{\mathcal{D}} and is bounded for various distributions in Lemmata 19.

Proof.

Let Ad×dA\in{\mathbb{R}}^{d\times d} be a square symmetric matrix, for δ>0\delta>0, we have:

AcI2δ𝐯Sd1,|𝐯A𝐯c|δcδλmin(A)λmax(A)c+δ\displaystyle\left\|{A-c\cdot I}\right\|_{2}\leq\delta\iff\forall{{\mathbf{v}}}\in S^{d-1},\left|{{{\mathbf{v}}}^{\top}A{{\mathbf{v}}}-c}\right|\leq\delta\iff c-\delta\leq\lambda_{\min}(A)\leq\lambda_{\max}(A)\leq c+\delta

Also, for any square symmetric matrix Fd×dF\in{\mathbb{R}}^{d\times d} and 𝒩ϵ{\mathcal{N}}_{\epsilon} being ϵ\epsilon net over Sd1S^{d-1}

F2(12ϵ)1sup𝐯𝒩ϵ|𝐯F𝐯|\displaystyle\left\|{F}\right\|_{2}\leq(1-2\epsilon)^{-1}\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{\epsilon}}\left|{{{\mathbf{v}}}^{\top}F{{\mathbf{v}}}}\right|

Taking F=AcIF=A-c\cdot I and ϵ=1/4\epsilon=1/4, we have

AcI22sup𝐯𝒩1/4|𝐯A𝐯c|\displaystyle\left\|{A-c\cdot I}\right\|_{2}\leq 2\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{1/4}}\left|{{{\mathbf{v}}}^{\top}A{{\mathbf{v}}}-c}\right| (2)

let Zi:=si𝐱i,𝐯Z_{i}:=\sqrt{s_{i}}\cdot\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle and 𝐱𝒟{{\mathbf{x}}}\sim{\mathcal{D}} then for any fixed 𝐯Sd1{{\mathbf{v}}}\in S^{d-1}, we have

Ziψ2=supp1(𝔼[|Zi|p])1/pp(β2π)1/4supp1(𝔼[|𝐱i,𝐯|p])1/pp=(β2π)1/4R\left\|{Z_{i}}\right\|_{\psi_{2}}=\sup_{p\geq 1}\frac{\left({{\mathbb{E}}\left[{{\left|{Z_{i}}\right|^{p}}}\right]}\right)^{1/p}}{\sqrt{p}}\leq(\frac{\beta}{2\pi})^{1/4}\cdot\sup_{p\geq 1}\frac{\left({{\mathbb{E}}\left[{{\left|{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle}\right|^{p}}}\right]}\right)^{1/p}}{\sqrt{p}}=(\frac{\beta}{2\pi})^{1/4}R

where we use the fact that 𝐱i,𝐯Ψ2R\left\|{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle}\right\|_{\Psi_{2}}\leq R since 𝒟{\mathcal{D}} is RR-sub-Gaussian and si(β2π)1/4\sqrt{s_{i}}\leq(\frac{\beta}{2\pi})^{1/4}.

Also, ZZ is R(β2π)1/4R(\frac{\beta}{2\pi})^{1/4}-sub-Gaussian, implies Z2Z^{2} is R2β2πR^{2}\sqrt{\frac{\beta}{2\pi}} sub-exponential, as well as Z2𝔼Z2Z^{2}-{\mathbb{E}}Z^{2} is 2R2β2π2R^{2}\sqrt{\frac{\beta}{2\pi}}-sub-exponential, using centering. And for a single good data point Lemma 19 gives μ:=𝔼Zi2[cβ2π,β2π]\mu:={\mathbb{E}}Z_{i}^{2}\in[c\sqrt{\frac{\beta}{2\pi}},\sqrt{\frac{\beta}{2\pi}}].

[|𝐯XGSGXG𝐯Gμ|εGβ2π]\displaystyle{\mathbb{P}}\left[{{\left|{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}-G\mu}\right|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]
=[|iG(Zi2μ)|εGβ2π]\displaystyle={\mathbb{P}}\left[{{\left|{\sum_{i\in G}(Z^{2}_{i}-\mu)}\right|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]
2exp[mmin{(εGβ2π)2G(2R2β2π)2,εGβ2π2R2β2π}][Theorem 2.8.2]\displaystyle\leq 2\exp\left[{-m\cdot\min\left\{{\frac{\left(\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}\right)^{2}}{G\left(2R^{2}\sqrt{\frac{\beta}{2\pi}}\right)^{2}},\frac{\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}{2R^{2}\sqrt{\frac{\beta}{2\pi}}}}\right\}}\right]\text{[Theorem 2.8.2]}
2exp[mnε28R4]\displaystyle\leq 2\exp\left[{-\frac{mn\varepsilon^{2}}{8R^{4}}}\right]

where m>0m>0 is a universal constant and in the last step we used Gn/2G\geq n/2 and w.l.o.g. we assumed that ε2R2\varepsilon\leq 2R^{2}. Taking a union bound over all 9d9^{d} elements of 𝒩1/4{\mathcal{N}}_{1/4}, we get

[XGSGXGGμI2εGβ2π]\displaystyle{\mathbb{P}}\left[{{\left\|{X_{G}S_{G}X_{G}^{\top}-G\mu\cdot I}\right\|_{2}\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right] [2sup𝐯𝒩1/4|𝐯XGSGXG𝐯Gμ|εGβ2π]using 2\displaystyle\leq{\mathbb{P}}\left[{{2\sup_{{{\mathbf{v}}}\in{\mathcal{N}}_{1/4}}\left|{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}-G\mu}\right|\geq\varepsilon\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\text{using \ref{net}}
29dexp[mnε232R4]\displaystyle\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\varepsilon^{2}}{32R^{4}}}\right]

Setting ε=ζc\varepsilon=\zeta c and noticing that μ[cβ2π,β2π]\mu\in[c\sqrt{\frac{\beta}{2\pi}},\sqrt{\frac{\beta}{2\pi}}] by Lemma 19 finishes the proof. ∎

Lemma 18.

Consider two models 𝐰1,𝐰2d{{\mathbf{w}}}^{1},{{\mathbf{w}}}^{2}\in{\mathbb{R}}^{d} such that 𝐰1𝐰22τ\left\|{{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2}}\right\|_{2}\leq\tau and let 𝐬1,𝐬2{{\mathbf{s}}}^{1},{{\mathbf{s}}}^{2} denote the corresponding weight vectors, i.e. sij=β2πexp(β2(yi𝐰j,𝐱i)2),j=1,2s_{i}^{j}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}}^{j},{{\mathbf{x}}}_{i}\rangle)^{2}),\,j=1,2. Also let S1=diag(𝐬1)S^{1}=\text{diag}({{\mathbf{s}}}^{1}) and S2=diag(𝐬2)S^{2}=\text{diag}({{\mathbf{s}}}^{2}). Then for any X=[𝐱1,,𝐱n]d×nX=[{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}]\in{\mathbb{R}}^{d\times n} such that 𝐱i2RX\left\|{{{\mathbf{x}}}_{i}}\right\|_{2}\leq R_{X} for all ii,

|λmin(XS1X)λmin(XS2X)|nτβRX32πe,\left|{\lambda_{\min}(XS^{1}X^{\top})-\lambda_{\min}(XS^{2}X^{\top})}\right|\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}},

where RXR_{X} is the maximum length in a set of nn vectors, each sampled from a dd-dimensional Gaussian (see Lemma 7).

Proof.

Let, sij=f(rij)=β2πexp(β2(yirij)2)s_{i}^{j}=f(r_{i}^{j})=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-r_{i}^{j})^{2}) where, rij=𝐰j,𝐱i,j=1,2r_{i}^{j}=\langle{{\mathbf{w}}}^{j},{{\mathbf{x}}}_{i}\rangle,\,j=1,2

Since, f:f:{\mathbb{R}}\rightarrow{\mathbb{R}}, is a everywhere differentiable function, it is LL-lipschitz continuous where L=supr|f(r)|L=\sup\limits_{r}\left|{f^{\prime}(r)}\right|

f(r)\displaystyle f^{\prime}(r) =β2πexp(β2(yir)2)β(yir)\displaystyle=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-r)^{2})\beta(y_{i}-r)
=β2πtexp(t2)2β\displaystyle=\sqrt{\frac{\beta}{2\pi}}t\exp(-t^{2})\sqrt{2\beta} where,t=β2(yir)\displaystyle where,t=\sqrt{\frac{\beta}{2}}(y_{i}-r)
βπ12e\displaystyle\leq\frac{\beta}{\sqrt{\pi}}\frac{1}{\sqrt{2e}} texp(t2)12e\displaystyle t\exp(-t^{2})\leq\frac{1}{\sqrt{2e}}
=β2πe\displaystyle=\frac{\beta}{\sqrt{2\pi e}}

Hence, |f(ri1)f(ri2)||ri1ri2|β2πe\frac{\left|{f(r_{i}^{1})-f(r_{i}^{2})}\right|}{\left|{r_{i}^{1}-r_{i}^{2}}\right|}\leq\frac{\beta}{\sqrt{2\pi e}} or |si1si2|β2πe|𝐰1𝐰2,𝐱i|=βτRX2πe\left|{s_{i}^{1}-s_{i}^{2}}\right|\leq\frac{\beta}{\sqrt{2\pi e}}\left|{\langle{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2},{{\mathbf{x}}}_{i}\rangle}\right|=\frac{\beta\tau R_{X}}{\sqrt{2\pi e}}. This gives us 𝐬1𝐬21nτβRX2πe\left\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\|_{1}\leq\frac{n\tau\beta R_{X}}{\sqrt{2\pi e}}. Now, if we let S1=diag(𝐬1)S^{1}=\text{diag}({{\mathbf{s}}}^{1}) and S2=diag(𝐬2)S^{2}=\text{diag}({{\mathbf{s}}}^{2}), then for any unit vector 𝐯Sd1{{\mathbf{v}}}\in S^{d-1}, denoting RX:=maxi[n]𝐱i2R_{X}:=\max_{i\in[n]}\ \left\|{{{\mathbf{x}}}_{i}}\right\|_{2} we have

|𝐯XS1X𝐯𝐯XS2X𝐯|\displaystyle\left|{{{\mathbf{v}}}^{\top}XS^{1}X^{\top}{{\mathbf{v}}}-{{\mathbf{v}}}^{\top}XS^{2}X^{\top}{{\mathbf{v}}}}\right| =|i=1n(𝐬i1𝐬i2)𝐱i,𝐯2|\displaystyle=\left|{\sum_{i=1}^{n}\left({{{\mathbf{s}}}^{1}_{i}-{{\mathbf{s}}}^{2}_{i}}\right)\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}\right|
𝐬1𝐬21maxi[n]𝐱i,𝐯2\displaystyle\leq\left\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\|_{1}\cdot\max_{i\in[n]}\ \left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}
𝐬1𝐬21RX2\displaystyle\leq\left\|{{{\mathbf{s}}}^{1}-{{\mathbf{s}}}^{2}}\right\|_{1}\cdot R_{X}^{2}
nτβRX32πe.\displaystyle\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}.

This proves that XS1XXS2X2nτβRX32πe\left\|{XS^{1}X^{\top}-XS^{2}X^{\top}}\right\|_{2}\leq\frac{n\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}.

Lemma 19.

Let X=[𝐱1,,𝐱n]d×nX=[{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}]\in{\mathbb{R}}^{d\times n} generated from an isotropic RR-sub-Gaussian distribution 𝒟{\mathcal{D}}, for any fixed model 𝐰{{\mathbf{w}}} and β>0\beta>0, let si=β2πexp(β2(yi𝐰,𝐱i)2)s_{i}=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(y_{i}-\langle{{\mathbf{w}}},{{\mathbf{x}}}_{i}\rangle)^{2}) be the weight of the data point 𝐱i{{\mathbf{x}}}_{i}, SG=diag(𝐬G)S_{G}=\text{diag}({{\mathbf{s}}}_{G}), then there exists a constant c>0c>0 that depends only on 𝒟{\mathcal{D}} such that for any fixed vector unit 𝐯Sd1{{\mathbf{v}}}\in S^{d-1},

cGβ2π𝔼[𝐯XGSGXG𝐯]Gβ2π.c\cdot G\sqrt{\frac{\beta}{2\pi}}\leq{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]\leq G\sqrt{\frac{\beta}{2\pi}}.
Proof.

Let 𝐱i𝒟{{\mathbf{x}}}_{i}\sim{\mathcal{D}} and for good points yi=𝐰,𝐱iy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle. Let 𝚫:=𝐰𝐰\text{\boldmath$\mathbf{\Delta}$}:={{\mathbf{w}}}^{\ast}-{{\mathbf{w}}}, as siβ2πs_{i}\leq\sqrt{\frac{\beta}{2\pi}}, we have by linearity of expectation,

𝔼[𝐯XGSGXG𝐯]=𝔼[iGsi𝐱i,𝐯2]=G𝔼[si𝐱i,𝐯2]Gβ2π𝔼[𝐱i,𝐯2]=Gβ2π,{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]={\mathbb{E}}\left[{{\sum_{i\in G}s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]=G\cdot{\mathbb{E}}\left[{{s_{i}\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]\leq G\sqrt{\frac{\beta}{2\pi}}\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]=G\sqrt{\frac{\beta}{2\pi}},

since 𝒟{\mathcal{D}} is isotropic.

For the lower bound, we may write,

𝔼[𝐯XGSGXG𝐯]=G𝔼[𝐱i,𝐯2si]\displaystyle{\mathbb{E}}\left[{{{{\mathbf{v}}}^{\top}X_{G}S_{G}X_{G}^{\top}{{\mathbf{v}}}}}\right]=G\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\cdot s_{i}}}\right] =Gβ2π𝔼[𝐱i,𝐯2exp(β2𝚫,𝐱i2)]\displaystyle=G\sqrt{\frac{\beta}{2\pi}}\cdot{\mathbb{E}}\left[{{\left\langle{{{\mathbf{x}}}_{i}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}_{i}}\right\rangle^{2})}}\right]
Gβ2πc(β,𝚫)\displaystyle\geq G\sqrt{\frac{\beta}{2\pi}}\cdot c(\beta,\text{\boldmath$\mathbf{\Delta}$})

where, for any distribution 𝒟{\mathcal{D}} over d{\mathbb{R}}^{d}, we define the constant cc as

c(β,𝚫):=inf𝐯Sd1{𝔼𝐱𝒟[𝐱,𝐯2exp(β2𝚫,𝐱2)]}c(\beta,\text{\boldmath$\mathbf{\Delta}$}):=\inf_{{{\mathbf{v}}}\in S^{d-1}}\left\{{\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2})}}\right]}\right\}
Centered Isotropic Gaussian

For the special case of 𝒟=𝒩(𝟎,Id){\mathcal{D}}={\mathcal{N}}({\mathbf{0}},I_{d}), using rotational symmetry, we can w.l.o.g. take 𝚫=(Δ1,0,0,,0)\text{\boldmath$\mathbf{\Delta}$}=(\Delta_{1},0,0,\ldots,0) and 𝐯=(v1,v2,0,0,,0){{\mathbf{v}}}=(v_{1},v_{2},0,0,\ldots,0). Thus, if x1,x2𝒩(0,1)x_{1},x_{2}\sim{\mathcal{N}}(0,1) i.i.d. then

𝔼𝐱𝒟[𝐱,𝐯2exp(β2𝚫,𝐱2)]\displaystyle\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}\exp(-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2})}}\right] =𝔼x1,x2𝒩(0,1)[(v12x12+v22x22+2v1v2x1x2)exp(β2Δ12x12)]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right] =𝔼x1,x2𝒩(0,1)[(v12x12+v22x22)exp(β2Δ12x12)][as,𝔼[x2]=0]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}}}\right]=0] =𝔼x1𝒩(0,1)[(v12x12+v22)exp(β2Δ12x12)][as,𝔼[x22]=1]\displaystyle=\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2})\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]\quad[\text{as},\underset{}{{\mathbb{E}}}\left[{{x_{2}^{2}}}\right]=1] =v12𝔼x1𝒩(0,1)[x12exp(β2Δ12x12)]+v22𝔼x1𝒩(0,1)[exp(β2Δ12x12)]\displaystyle=v_{1}^{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{x_{1}^{2}\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right]+v_{2}^{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{\exp(-\frac{\beta}{2}\Delta_{1}^{2}x_{1}^{2})}}\right] =v12(1+βΔ12)3/2+v22(1+βΔ12)1/2\displaystyle=\frac{v_{1}^{2}}{(1+\beta\Delta_{1}^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta_{1}^{2})^{1/2}}

This gives us

c(β,𝚫)\displaystyle c(\beta,\text{\boldmath$\mathbf{\Delta}$}) =inf(v1,v2)S1{v12(1+β𝚫2)3/2+v22(1+β𝚫2)1/2}1(1+β𝚫2)3/2\displaystyle=\inf_{(v_{1},v_{2})\in S^{1}}\left\{{\frac{v_{1}^{2}}{(1+\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{1/2}}}\right\}\geq\frac{1}{(1+\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{3/2}}
1(1+κ2)3/2[usingβ𝚫<κ]\displaystyle\geq\frac{1}{(1+\kappa^{2})^{3/2}}\qquad[\text{using}\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|<\kappa]
Centered Non-isotropic Gaussian

For the case 𝒟=𝒩(𝟎,Σ){\mathcal{D}}={\mathcal{N}}({\mathbf{0}},\Sigma), we have 𝐱𝒟=Σ1/2.𝒩(𝟎,Id){{\mathbf{x}}}\sim{\mathcal{D}}=\Sigma^{1/2}.{\mathcal{N}}({\mathbf{0}},I_{d}). Thus for any fixed unit vector 𝐯{{\mathbf{v}}}, we have 𝐯,𝐱𝐯~,𝐳\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}}\right\rangle\sim\left\langle{\tilde{{{\mathbf{v}}}}},{{{\mathbf{z}}}}\right\rangle where 𝐯~=Σ1/2𝐯\tilde{{{\mathbf{v}}}}=\Sigma^{-1/2}{{\mathbf{v}}} and 𝐳𝒩(𝟎,Id){{\mathbf{z}}}\sim{\mathcal{N}}({\mathbf{0}},I_{d}). We also have 𝐯~2[1Λ,1λ]\left\|{\tilde{{{\mathbf{v}}}}}\right\|_{2}\in\left[{\frac{1}{\sqrt{\Lambda}},\frac{1}{\sqrt{\lambda}}}\right], where Λ=λmax(Σ)\Lambda=\lambda_{max}(\Sigma) and λ=λmin(Σ)\lambda=\lambda_{min}(\Sigma). Now for any fixed vectors 𝚫,𝐯\text{\boldmath$\mathbf{\Delta}$},{{\mathbf{v}}} we first perform rotations so that we have 𝚫~=(Δ,0,0,,0)\tilde{\text{\boldmath$\mathbf{\Delta}$}}=(\Delta,0,0,\cdots,0) and 𝐯~=(v1,v2,0,0,,0)𝔹(𝟎,r)\tilde{{{\mathbf{v}}}}=(v_{1},v_{2},0,0,\cdots,0)\in\mathbb{B}({\mathbf{0}},r) where Δ[𝚫Λ,𝚫λ]\Delta\in\left[{\frac{\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|}{\sqrt{\Lambda}},\frac{\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|}{\sqrt{\lambda}}}\right] and r[1Λ,1λ]r\in\left[{\frac{1}{\sqrt{\Lambda}},\frac{1}{\sqrt{\lambda}}}\right]. This gives us c(β,𝚫)infv1,v2f(v1,v2)c(\beta,\text{\boldmath$\mathbf{\Delta}$})\geq\inf_{v_{1},v_{2}}f(v_{1},v_{2}) where,

f(v1,v2)\displaystyle f(v_{1},v_{2}) =𝔼x1,x2𝒩(0,1)[(v12x12+v22x22+2v1v2x1x2)exp(β2Δ2x12)]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(v_{1}^{2}x_{1}^{2}+v_{2}^{2}x_{2}^{2}+2v_{1}v_{2}x_{1}x_{2})\exp(-\frac{\beta}{2}\Delta^{2}x_{1}^{2})}}\right]
=v12(1+βΔ2)3/2+v22(1+βΔ2)1/2\displaystyle=\frac{v_{1}^{2}}{(1+\beta\Delta^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta^{2})^{1/2}}

similar to that of isotropic counterpart; giving the following,

c(β,𝚫)\displaystyle c(\beta,\text{\boldmath$\mathbf{\Delta}$}) =inf(v1,v2)𝔹(𝟎,r){v12(1+βΔ2)3/2+v22(1+βΔ2)1/2}1Λ(1+βλ𝚫2)3/2\displaystyle=\inf_{(v_{1},v_{2})\in\mathbb{B}({\mathbf{0}},r)}\left\{{\frac{v_{1}^{2}}{(1+\beta\Delta^{2})^{3/2}}+\frac{v_{2}^{2}}{(1+\beta\Delta^{2})^{1/2}}}\right\}\geq\frac{1}{\Lambda(1+\frac{\beta}{\lambda}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2})^{3/2}}

As the above term needs to be bounded away from 0, we require λ=λmin(Σ)>0\lambda=\lambda_{min}(\Sigma)>0 i.e. reasonably away from 0.

Non-centered isotropic Gaussian

Suppose the covariates are generated from a distribution 𝒟=𝒩(𝝁,Id){\mathcal{D}}={\mathcal{N}}(\text{\boldmath$\mathbf{\mu}$},I_{d}). As earlier, by rotational symmetry, we can take 𝚫=(𝚫,0,0,,0)\text{\boldmath$\mathbf{\Delta}$}=(\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|,0,0,\cdots,0), 𝐯=(v1,v2,0,,0){{\mathbf{v}}}=(v_{1},v_{2},0,\ldots,0), 𝝁=(μ1,μ2,μ3,0,0,,0)\text{\boldmath$\mathbf{\mu}$}=(\mu_{1},\mu_{2},\mu_{3},0,0,\ldots,0). Assume 𝝁2=ρ\left\|{\text{\boldmath$\mathbf{\mu}$}}\right\|_{2}=\rho. Letting 𝝁,𝐯=:pρ\left\langle{\text{\boldmath$\mathbf{\mu}$}},{{{\mathbf{v}}}}\right\rangle=:p\leq\rho and x1,x2,x3𝒩(0,1)x_{1},x_{2},x_{3}\sim{\mathcal{N}}(0,1) i.i.d. gives c(β,𝚫)infv1,v2f(v1,v2)c(\beta,\text{\boldmath$\mathbf{\Delta}$})\geq\inf_{v_{1},v_{2}}f(v_{1},v_{2}) independence of x1,x2,x3x_{1},x_{2},x_{3} and the fact that 𝔼[x2]=0{\mathbb{E}}\left[{{x_{2}}}\right]=0 and 𝔼[x22]=1{\mathbb{E}}\left[{{x_{2}^{2}}}\right]=1 gives us,

f(v1,v2)\displaystyle f(v_{1},v_{2}) =𝔼x1,x2𝒩(0,1)[(p+v1x1+v2x2)2exp(β𝚫22(x1+μ1)2)]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{(p+v_{1}x_{1}+v_{2}x_{2})^{2}\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right] =𝔼x1,x2𝒩(0,1)[((p+v1x1)2+v22x22+2(p+v1x1)v2x2)exp(β𝚫22(x1+μ1)2)]\displaystyle=\underset{x_{1},x_{2}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{((p+v_{1}x_{1})^{2}+v_{2}^{2}x_{2}^{2}+2(p+v_{1}x_{1})v_{2}x_{2})\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right] =𝔼x1𝒩(0,1)[((p+v1x1)2+v22)exp(β𝚫22(x1+μ1)2)]\displaystyle=\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{((p+v_{1}x_{1})^{2}+v_{2}^{2})\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]

Now, since (v1,v2)S1(v_{1},v_{2})\in S^{1} we have the following two cases:

Case 1: v2212v_{2}^{2}\geq\frac{1}{2}. In this case

f(v1,v2)\displaystyle f(v_{1},v_{2}) 12𝔼x1𝒩(0,1)[exp(β𝚫22(x1+μ1)2)]\displaystyle\geq\frac{1}{2}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{\exp\left({-\frac{\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}}{2}(x_{1}+\mu_{1})^{2}}\right)}}\right]
=122πexp(c2(x+μ1)2x22)𝑑x ,where c=β𝚫2\displaystyle=\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c}{2}(x+\mu_{1})^{2}-\frac{x^{2}}{2}}\right)dx\text{ ,where }c=\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}
if μ1>0\displaystyle\text{if }\mu_{1}>0
122πexp(c+12(x+μ1)2)𝑑x=12c+112κ2+1=0.45\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c+1}{2}(x+\mu_{1})^{2}}\right)dx=\frac{1}{2\sqrt{c+1}}\geq\frac{1}{2\sqrt{\kappa^{2}+1}}=0.45
else if μ1<0\displaystyle\text{else if }\mu_{1}<0
122πexp(c+12x2)𝑑x=12c+112κ2+1=0.45\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int\limits_{-\infty}^{\infty}\exp\left({-\frac{c+1}{2}x^{2}}\right)dx=\frac{1}{2\sqrt{c+1}}\geq\frac{1}{2\sqrt{\kappa^{2}+1}}=0.45

for c=β𝚫2κ2c=\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\leq\kappa^{2} and κ=0.47\kappa=0.47.

Case 2: v1212v_{1}^{2}\geq\frac{1}{2}. In this case, if x122ρx_{1}\geq 2\sqrt{2}\rho, then |v1x1+p|ρ\left|{v_{1}x_{1}+p}\right|\geq\rho and also |v1x1+p|x122\left|{v_{1}x_{1}+p}\right|\geq\frac{x_{1}}{2\sqrt{2}}. So we can write (v1x1+p)2ρx122(v_{1}x_{1}+p)^{2}\geq\frac{\rho x_{1}}{2\sqrt{2}}. Also |x1+μ1|2x1\left|{x_{1}+\mu_{1}}\right|\leq 2x_{1}. Hence,

f(v1,v2)\displaystyle f(v_{1},v_{2}) ρ22𝔼x1𝒩(0,1)[x1exp(c2(2x1)2)𝕀{x122ρ}]\displaystyle\geq\frac{\rho}{2\sqrt{2}}\underset{x_{1}\sim{\mathcal{N}}(0,1)}{{\mathbb{E}}}\left[{{x_{1}\exp\left({-\frac{c}{2}(2x_{1})^{2}}\right){\mathbb{I}}\left\{{{x_{1}\geq 2\sqrt{2}\rho}}\right\}}}\right]
=ρ4π22ρxexp(c2(2x)2x22)𝑑x\displaystyle=\frac{\rho}{4\sqrt{\pi}}\int\limits_{2\sqrt{2}\rho}^{\infty}x\exp\left({-\frac{c}{2}(2x)^{2}-\frac{x^{2}}{2}}\right)dx
=ρ4π22ρxexp((12+2c)x2)𝑑x\displaystyle=\frac{\rho}{4\sqrt{\pi}}\int\limits_{2\sqrt{2}\rho}^{\infty}x\exp\left({-(\frac{1}{2}+2c)x^{2}}\right)dx
=ρ4π(1+4c)22ρ(1+4c)x.exp(1+4c2x2)dx\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}\int\limits_{2\sqrt{2}\rho}^{\infty}(1+4c)x.\exp\left({-\frac{1+4c}{2}x^{2}}\right)dx
=ρ4π(1+4c)4(1+4c)ρ2exp(z)𝑑z\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}\int\limits_{4(1+4c)\rho^{2}}^{\infty}exp(-z)dz
=ρ4π(1+4c)e4(1+4c)ρ2\displaystyle=\frac{\rho}{4\sqrt{\pi}(1+4c)}e^{-4(1+4c)\rho^{2}}
ρ4π(1+4κ2)e4(1+4κ2)ρ2ρ14e8ρ2\displaystyle\geq\frac{\rho}{4\sqrt{\pi}(1+4\kappa^{2})}e^{-4(1+4\kappa^{2})\rho^{2}}\geq\frac{\rho}{14}e^{-8\rho^{2}}

using the fact c=β𝚫2κ2c=\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\leq\kappa^{2} and κ=0.47\kappa=0.47. We see from above that we need to avoid large ρ\rho value. One way to avoid this is to center the covariates i.e. use xi~=xiμ^\tilde{x_{i}}=x_{i}-\hat{\mu} where μ^:=1ni=1nxi\hat{\mu}:=\frac{1}{n}\sum_{i=1}^{n}x_{i}. This would approximately center the covariates and ensure an effective value of ρ𝒪(dn)\rho\approx{\cal O}\left({{\sqrt{\frac{d}{n}}}}\right).

Bounded Distribution

Suppose our covariate distribution has bounded support that is, we have supp(𝒟)2(ρ)supp({\mathcal{D}})\subset\mathcal{B}_{2}(\rho) for some ρ>0\rho>0. Assume ρ>1\rho>1 w.l.o.g. Also using the centering trick above, assume that 𝔼𝐱𝒟[𝐱]=𝟎\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{{{\mathbf{x}}}}}\right]={\mathbf{0}}. Then we have 𝚫,𝐱𝚫ρ\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle\leq\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|\rho which implies exp(β2𝚫,𝐱2)exp(β2𝚫2ρ2)\exp\left({-\frac{\beta}{2}\left\langle{\text{\boldmath$\mathbf{\Delta}$}},{{{\mathbf{x}}}}\right\rangle^{2}}\right)\geq\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\rho^{2}}\right). Let Σ\Sigma denote the covariance of distribution 𝒟{\mathcal{D}} and let λ:=λmin(Σ)\lambda:=\lambda_{min}(\Sigma) denote its smallest eigenvalue. This gives us ceβ2𝚫2ρ2𝔼𝐱𝒟[𝐱,𝐯2]eκ2ρ/2λ=e0.11ρλc\geq e^{-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|^{2}\rho^{2}}\underset{{{\mathbf{x}}}\sim{\mathcal{D}}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{x}}}},{{{\mathbf{v}}}}\right\rangle^{2}}}\right]\geq e^{-\kappa^{2}\rho/2}\cdot\lambda=e^{-0.11\rho}\cdot\lambda using value of κ=0.47\kappa=0.47

This finishes the proof. ∎

Lemma 20.

Let X=[𝐱1,,𝐱n]X=\left[{{{\mathbf{x}}}_{1},\ldots,{{\mathbf{x}}}_{n}}\right] are generated from an isotropic RR-sub-Gaussian distribution 𝒟{\mathcal{D}} and GG denotes the set of uncorrupted points, then there exists distribution specific constant cc, such that

[λmin(XGSGXG)<(1ζ)cGβ2π][λmax(XGSGXG)>(1+ζ)Gβ2π]}exp(Ω(nζ2dlog1ζdlog(n))).\left.\begin{array}[]{r}{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\\ \vspace*{-2ex}\hfil\\ {\mathbb{P}}\left[{{\lambda_{\max}(X_{G}S_{G}X_{G}^{\top})>(1+\zeta)\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\end{array}\right\}\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right).
Proof.

The bound for the largest eigenvalue follows directly due to the fact that all weights are upper bounded by β2π\sqrt{\frac{\beta}{2\pi}} and hence XGSGXGβ2πXGXGX_{G}S_{G}X_{G}^{\top}\preceq\sqrt{\frac{\beta}{2\pi}}\cdot X_{G}X_{G}^{\top} and applying Lemma 8. For the bound on the smallest eigenvalue, notice that Lemma 17 shows us that for any fixed variance 1β\frac{1}{\beta}, we have

[λmin(XGSGXG)<(1ζ2)cGβ2π]29dexp[mnζ2c2128R4]{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<\left({1-\frac{\zeta}{2}}\right)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq 2\cdot 9^{d}\exp\left[{-\frac{mn\zeta^{2}c^{2}}{128R^{4}}}\right]

Given RX:=maxi[n]𝐱i2R_{X}:=\max_{i\in[n]}\ \left\|{{{\mathbf{x}}}_{i}}\right\|_{2}, lemma 18 shows us that if 𝐰1,𝐰2{{\mathbf{w}}}^{1},{{\mathbf{w}}}^{2} are two models at stage 1β\frac{1}{\beta} variance, such that 𝐰1𝐰22τ\left\|{{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{2}}\right\|_{2}\leq\tau then, the following holds almost surely.

|λmin(XGSG1XG)λmin(XGSG2XG)|GτβRX32πe\left|{\lambda_{\min}(X_{G}S^{1}_{G}X_{G}^{\top})-\lambda_{\min}(X_{G}S^{2}_{G}X_{G}^{\top})}\right|\leq\frac{G\tau\beta R_{X}^{3}}{\sqrt{2\pi e}}

This prompts us to initiate a uniform convergence argument by setting up a τ\tau-net over 2(𝐰,2πβ){\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{2\pi}{\beta}}}\right) for τ=ζc2RX3eβ\tau=\frac{\zeta c}{2R_{X}^{3}}\sqrt{\frac{e}{\beta}}. Note that such a net has at most (6RX3ζc2πe)d\left({\frac{6R_{X}^{3}}{\zeta c}\sqrt{\frac{2\pi}{e}}}\right)^{d} elements by applying standard covering number bounds for the Euclidean ball [23, Corollary 4.2.13]. Taking a union bound over this net gives us

[λmin(XGSGXG)<0.99cGβ2π]\displaystyle{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<0.99c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right] 2(54RX3ζc2πe)dexp[mnζ2c2128R4]\displaystyle\leq 2\cdot\left({\frac{54R_{X}^{3}}{\zeta c}\sqrt{\frac{2\pi}{e}}}\right)^{d}\exp\left[{-\frac{mn\zeta^{2}c^{2}}{128R^{4}}}\right]
exp(Ω(nζ2dlog1ζdlog(n))),\displaystyle\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right),

where in the last step we used Lemma 7 to bound RX=𝒪(Rn)R_{X}={\cal O}\left({{R\sqrt{n}}}\right) with probability at least 1exp(Ω(n))1-\exp(-\Omega\left({{n}}\right)). ∎

G.1 Robust Least-squares Regression with a Fully Adaptive Adversary

To handle a fully adaptive adversary, we need mild modifications to the notions of LWSC and LWLC given in Definition 1, so that the adversary is now allowed to choose the location of corruption arbitrarily. For sake of simplicity, we present these re-definitions and subsequent arguments in the context of robust least-squares regression but similar extensions hold for the other GLM tasks as well. Let us denote the useful shorthands α¯=1α\bar{\alpha}=1-\alpha, and 𝒯α¯={T[n]:|T|=α¯n}{\mathcal{T}}_{\bar{\alpha}}=\{T\subset[n]:\left|{T}\right|=\bar{\alpha}\cdot n\} denote the collection of all possible subsets of α¯n\bar{\alpha}\cdot n data points.

Definition 2 (LWSC/LWLC against Fully Adaptive Adversaries).

Suppose we are given an exponential family likelihood distribution [|]{\mathbb{P}}\left[{{\cdot\,|\,\cdot}}\right] and data points {(𝐱i,yi)}i=1n\left\{{({{\mathbf{x}}}^{i},y_{i})}\right\}_{i=1}^{n} of which an α>0\alpha>0 fraction has been corrupted by a fully adaptive adversary and β>0\beta>0 is any positive real value. Then, we say that the adaptive λ~β\tilde{\lambda}_{\beta}-local weighted strong convexity property is satisfied if for any model 𝐰{{\mathbf{w}}}, we have

minG𝒯α¯λmin(XGSGXG)λ~β,\min_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})\geq\tilde{\lambda}_{\beta},

where SS is a diagonal matrix containing the data point scores sis_{i} assigned by the model 𝐰{{\mathbf{w}}} (see Definition 1 for a definition of the scores). Similarly, we say that the adaptive Λ~β\tilde{\Lambda}_{\beta}-weighted strong smoothness properties are satisfied if for any true model 𝐰{{\mathbf{w}}}^{\ast} and any model 𝐰2(𝐰,1β){{\mathbf{w}}}\in{\mathcal{B}}_{2}\left({{{\mathbf{w}}}^{\ast},\sqrt{\frac{1}{\beta}}}\right), we have

maxG𝒯α¯XBSB𝐛2Λ~β,\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\|{X_{B}S_{B}{{\mathbf{b}}}}\right\|_{2}\leq\tilde{\Lambda}_{\beta},

where we used the shorthand B=[n]\GB=[n]\backslash G and SS continues to be the diagonal matrix containing the data point scores sis_{i} assigned by the model 𝐰{{\mathbf{w}}}.

Note that for the setting of robust least-squares regression, for any two models 𝐰^,𝐰\hat{{\mathbf{w}}},{{\mathbf{w}}} we have λmin(2Q~β(𝐰^|𝐰))=λmin(XSX)λmin(XGSGXG)\lambda_{\min}(\nabla^{2}\tilde{Q}_{\beta}(\hat{{\mathbf{w}}}|{{\mathbf{w}}}))=\lambda_{\min}(XSX^{\top})\geq\lambda_{\min}(X_{G}S_{G}X_{G}^{\top}) (since the scores sis_{i} are non-negative) which motivates the above re-definition of adaptive LWSC. For the same setting we also have Q~β(𝐰|𝐰)=XGSGϵG+XBSB𝐛\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}+X_{B}S_{B}{{\mathbf{b}}} (with the shorthand B=[n]\GB=[n]\backslash G) i.e. Q~β(𝐰|𝐰)2XGSGϵG2+XBSB𝐛2\left\|{\nabla\tilde{Q}_{\beta}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})}\right\|_{2}\leq\left\|{X_{G}S_{G}\text{\boldmath$\mathbf{\epsilon}$}_{G}}\right\|_{2}+\left\|{X_{B}S_{B}{{\mathbf{b}}}}\right\|_{2} by triangle inequality. However, for sake of simplicity we will be analyzing the noiseless setting i.e. ϵ=𝟎\text{\boldmath$\mathbf{\epsilon}$}={\mathbf{0}} which motivates the above re-definition of adaptive LWLC. These re-definitions can be readily adapted to settings with hybrid noise i.e. when β<\beta^{\ast}<\infty as well.

We now show that for the same settings as considered for the proof of Theorem 2, the adaptive LWSC and LWLC properties are also satisfied, albeit with worse constants due to the application of union bounds over all possible “good” sets in the collection 𝒯α¯{\mathcal{T}}_{\bar{\alpha}} that the adversary could have left untouched to its corruption.

Lemma 20 gives us, for any fixed set G𝒯α¯G\in{\mathcal{T}}_{\bar{\alpha}}

[λmin(XGSGXG)<(1ζ)cGβ2π]exp(Ω(nζ2dlog1ζdlog(n)))\displaystyle{\mathbb{P}}\left[{{\lambda_{\min}(X_{G}S_{G}X_{G}^{\top})<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq\exp\left({-\Omega\left({{n\zeta^{2}-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right)

By taking union bound over all sets G𝒯α¯G\in{\mathcal{T}}_{\bar{\alpha}}, and observing that

(nk)=(nnk)(nenk)nk=(eα)αn=exp(αn(1logα)),{n\choose k}={n\choose n-k}\leq\left({\frac{ne}{n-k}}\right)^{n-k}=\left({\frac{e}{\alpha}}\right)^{\alpha n}=\exp(\alpha n(1-\log\alpha)),

we have

[λ~β<(1ζ)cGβ2π]exp(Ω(n(ζ2+αlogαα)dlog1ζdlog(n))){\mathbb{P}}\left[{{\tilde{\lambda}_{\beta}<(1-\zeta)c\cdot G\sqrt{\frac{\beta}{2\pi}}}}\right]\leq\exp\left({-\Omega\left({{n(\zeta^{2}+\alpha\log\alpha-\alpha)-d\log\frac{1}{\zeta}-d\log(n)}}\right)}\right)

which requires, setting ζΩ(dlognn+ααlogα)\zeta\geq\Omega(\sqrt{\frac{d\log n}{n}+\alpha-\alpha\log\alpha}) to obtain a confidence of 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)). This establishes the adaptive LWSC guarantee with a confidence level similar to the one we had for the partially adaptive case. We now establish the adaptive LWLC guarantee.

We notice that Lemma 8 can be extended to the “adaptive” setting using [1, Lemma 15] to show that with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), we have

maxG𝒯α¯λmax(XBXB)\displaystyle\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\max}(X_{B}X_{B}^{\top}) αn(1+3e6logeα)+𝒪(nd)\displaystyle\leq\alpha n\left({1+3e\sqrt{6\log\frac{e}{\alpha}}}\right)+{\cal O}\left({{\sqrt{nd}}}\right)
B(1+3.01e6logeα)\displaystyle\leq B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)

where we continue to use the notation B=[n]\GB=[n]\backslash G and for large enough nn, absorbed the nd\sqrt{nd} term into the first term by increasing the constant 3 to 3.01 since the second term asymptotically vanishes in comparison to the first term that is linear in nn. Now, following steps similar to those in the proof of Lemma 16 gives us

maxG𝒯α¯XBSB𝐛2\displaystyle\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\|{X_{B}S_{B}{{\mathbf{b}}}}\right\|_{2} XB2SB𝐛B2\displaystyle\leq\left\|{X_{B}}\right\|_{2}\left\|{S_{B}{{\mathbf{b}}}_{B}}\right\|_{2}
maxG𝒯α¯XB2[(κ2λmax(XBXBT)2π)1/3+(B2πe)1/3]32\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\left\|{X_{B}}\right\|_{2}[(\frac{\kappa^{2}\lambda_{max}(X_{B}X_{B}^{T})}{2\pi})^{1/3}+(\frac{B}{2\pi e})^{1/3}]^{\frac{3}{2}}
maxG𝒯α¯λmax(XBXB)[(κ22π)1/3+(12πe)1/3]32\displaystyle\leq\max_{G\in{\mathcal{T}}_{\bar{\alpha}}}\lambda_{\max}(X_{B}X_{B}^{\top})[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}}
B(1+3.01e6logeα)[(κ22π)1/3+(12πe)1/3]32,\displaystyle\leq B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)[(\frac{\kappa^{2}}{2\pi})^{1/3}+(\frac{1}{2\pi e})^{1/3}]^{\frac{3}{2}},

where the second last step uses the fact that we our upper bound on λmax(XBXB)\lambda_{\max}(X_{B}X_{B}^{\top}) is at least BB. This establishes the adaptive LWLC property with confidence at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)).

Theorem 21 (Theorem 3 restated – Fully Adaptive Adversary).

Suppose data is corrupted by a fully adaptive adversary that is able to decide the location of the corruptions as well as the corrupted labels using complete information of the true model, data features and clean labels, and SVAM-RR is initialized and executed as described in the statement of Theorem 2. Then SVAM-RR enjoys a breakdown point of α0.0036\alpha\leq 0.0036, i.e. it ensures model recovery even if k=αnk=\alpha\cdot n corruptions are introduced by a fully adaptive adversary where the value of α\alpha can go upto at least 0.00360.0036. More specifically, in the noiseless setting where β\beta^{\ast}\rightarrow\infty where clean data points do not experience any Gaussian noise i.e. ϵi=0\epsilon_{i}=0 and yi=𝐰,𝐱iy_{i}=\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle for clean points, with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), the LWSC/LWLC conditions are satisfied for all β(0,)\beta\in(0,\infty) i.e. βmax=\beta_{\max}=\infty. Consequently, for any ϵ>0\epsilon>0, within T𝒪(log1ϵβ1)T\leq{\cal O}\left({{\log\frac{1}{\epsilon\beta^{1}}}}\right) iterations, we have 𝐰^T𝐰22ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}^{2}\leq\epsilon.

Proof.

The above arguments establishing the adaptive LWSC and adaptive LWLC properties allow us to obtain the following result in a manner similar to that used in the proof of Theorem 14 (but in the noiseless setting)

𝐰^t+1𝐰22Λ~βλ~βB(1+3.01e6logeα)12π[(1.01κ2)1/3+(1e)1/3]32β2π(1ζ)1g~G\displaystyle\left\|{\hat{{\mathbf{w}}}^{t+1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\frac{2\tilde{\Lambda}_{\beta}}{\tilde{\lambda}_{\beta}}\leq\frac{B\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)\sqrt{\frac{1}{2\pi}}[(1.01\kappa^{2})^{1/3}+(\frac{1}{e})^{1/3}]^{\frac{3}{2}}}{\sqrt{\frac{\beta}{2\pi}}(1-\zeta)\frac{1}{\tilde{g}}\cdot G}
κβ11ζ(α(1+3.01e6logeα)1α1+ββ(1+κ2)3/21.01[(1.01)1/3+(eκ2)1/3]32)(A)\displaystyle\leq\frac{\kappa}{\sqrt{\beta}}\cdot\underbrace{\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}\sqrt{1+\frac{\beta}{\beta^{\ast}}}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}}\right)}_{(A)}

Applying the limit β\beta^{\ast}\rightarrow\infty (since we are working in the pure corruption setting without any Gaussian noise on labels of the uncorrupted points) transforms the requirement (A)1(A)\leq 1 (which as before, assures us of the existence of a scale increment ξ>1\xi>1 satisfying the requirements of Theorem 1) to:

11ζ(α(1+3.01e6logeα)1α(1+κ2)3/21.01[(1.01)1/3+(eκ2)1/3]32)1\displaystyle\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}{\left({1+\kappa^{2}}\right)^{3/2}}\sqrt{1.01}\left[{(1.01)^{1/3}+\left({e\kappa^{2}}\right)^{-1/3}}\right]^{\frac{3}{2}}}\right)\leq 1

Setting κ=0.47\kappa=0.47 as done before further simplifies this requirement to

11ζ(α(1+3.01e6logeα)1α)14.38\frac{1}{1-\zeta}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}}\right)\leq\frac{1}{4.38}

However, unlike earlier where we could simply set ζ\zeta to an arbitrarily small value for large enough nn, we cannot do so now since as noted earlier, we must set ζΩ(dlognn+ααlogα)\zeta\geq\Omega(\sqrt{\frac{d\log n}{n}+\alpha-\alpha\log\alpha}) to obtain a confidence of 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) in the LWSC guarantee. However, for large enough nn we can still obtain ζααlogα\zeta\rightarrow\sqrt{\alpha-\alpha\log\alpha} which transforms the requirement further to

11ααlogα(α(1+3.01e6logeα)1α)14.38\frac{1}{1-\sqrt{\alpha-\alpha\log\alpha}}\left({\frac{\alpha\left({1+3.01e\sqrt{6\log\frac{e}{\alpha}}}\right)}{1-\alpha}}\right)\leq\frac{1}{4.38}

which is satisfied at all values of α=0.0036\alpha=0.0036 or smaller. This finishes the proof. ∎

Appendix H Robust Mean Estimation

We will let G,BG,B respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let G=(1α)nG=(1-\alpha)\cdot n and B=αnB=\alpha\cdot n respectively denote the number of good and bad points too.

Theorem 22 (Theorem 5 restated).

For data generated in the robust mean estimation model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set BB) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-ME enjoys a breakdown point of 0.26210.2621, i.e. it ensures a bounded 𝒪(1){\cal O}\left({{1}}\right) error even if k=αnk=\alpha\cdot n corruptions are introduced where the value of α\alpha can go upto at least 0.26210.2621. More generally, for corruption rates α0.2621\alpha\leq 0.2621, there always exists values of scale increment ξ>1\xi>1 s.t. with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC conditions are satisfied for the Q~β\tilde{Q}_{\beta} function corresponding to the robust mean estimation model for β\beta values at least as large as βmax=𝒪(βdmin{log1α,nd})\beta_{\max}={\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right). If initialized with 𝛍^1,β1\hat{\text{\boldmath$\mathbf{\mu}$}}^{1},\beta^{1} s.t. β1𝛍^1𝛍221{\beta_{1}}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq 1, SVAM-ME assures 𝛍^T𝛍22ϵ\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{T}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\epsilon for any ϵ𝒪(trace2(Σ)max{1ln(1/α),1nd})\epsilon\geq{\cal O}\left({{\text{trace}^{2}(\Sigma)\cdot\max\left\{{{\frac{1}{\ln(1/\alpha)}},\frac{1}{\sqrt{nd}}}\right\}}}\right) within T𝒪(lognβ1)T\leq{\cal O}\left({{\log\frac{n}{\beta^{1}}}}\right) iterations.

Proof.

For any two models 𝝁,𝜹\text{\boldmath$\mathbf{\mu}$},\text{\boldmath$\mathbf{\delta}$}, the Q~β\tilde{Q}_{\beta} function for robust mean estimation has the following form

Q~β(𝜹|𝝁)=i=1nsi𝐱i𝜹22,\tilde{Q}_{\beta}(\text{\boldmath$\mathbf{\delta}$}\,|\,\text{\boldmath$\mathbf{\mu}$})=\sum_{i=1}^{n}s_{i}\cdot\left\|{{{\mathbf{x}}}^{i}-\text{\boldmath$\mathbf{\delta}$}}\right\|_{2}^{2},

where siexp(β2𝐱i𝝁22)s_{i}\leftarrow\exp\left({-\frac{\beta}{2}\left\|{{{\mathbf{x}}}^{i}-\text{\boldmath$\mathbf{\mu}$}}\right\|_{2}^{2}}\right). We first outline the proof below.

Proof Outline. This proof has four key elements

  1. 1.

    We will first establish this result for Σ=1βI\Sigma=\frac{1}{\beta^{\ast}}\cdot I for β=d\beta^{\ast}=d, then generalize the result for arbitrary β>0\beta^{\ast}>0. Note that for β=d\beta^{\ast}=d, we have trace(Σ)=1\text{trace}(\Sigma)=1.

  2. 2.

    To establish the LWSC and LWLC properties, we will first consider as before, a fixed value of β>0\beta>0 for which the properties will be shown to hold with probability 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)). As promised in the statement of Theorem 5, we will execute SVAM-ME for no more than 𝒪(logn){\cal O}\left({{\log n}}\right) iterations, taking a naive union bound would offer a confidence level of 1lognexp(Ω(d))1-\log n\exp(-\Omega\left({{d}}\right)). However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form 1exp(Ω(nζ2dlogn))1-\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right)). Thus, a union over 𝒪(logn){\cal O}\left({{\log n}}\right) such events will at best deteriorate the confidence bounds to 1lognexp(Ω(nζ2dlogn))=1exp(Ω(nζ2dlognloglogn))1-\log n\exp(-\Omega\left({{n\zeta^{2}-d\log n}}\right))=1-\exp(-\Omega\left({{n\zeta^{2}-d\log n-\log\log n}}\right)) which is still 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) for the values of ζ\zeta we shall set.

  3. 3.

    The key to this proof is to maintain the invariant βt𝝁^t𝝁21\sqrt{\beta_{t}}\cdot\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\leq 1. Recall that initialization ensures β1𝝁^1𝝁221{\beta_{1}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1 to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, inductively assuming that βt𝝁^t𝝁221{\beta_{t}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1 for an iteration tt, we will establish that 𝝁^t+1𝝁22Λβtλβt(A)βt\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{(A)}{\sqrt{\beta}_{t}} where (A)(A) will be an application-specific expression derived below.

  4. 4.

    We will then ensure that (A)<1(A)<1, say (A)=1/ξ(A)=1/\sqrt{\xi} for some ξ>1\xi>1, whenever the number of corruptions are below the breakdown point. This ensures 𝝁^t+1𝝁221ξβt\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}\leq\frac{1}{{\xi\beta_{t}}}, in other words, βt+1𝝁^t+1𝝁221{\beta_{t+1}}\cdot{\left\|{\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}}\right\|_{2}^{2}}\leq 1 for βt+1=ξβt\beta_{t+1}=\xi\cdot\beta_{t} so that the invariant is preserved. However, notice that the above step simultaneously ensures that 2Λβtλβt1ξβt\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{1}{\sqrt{\xi\beta_{t}}}. This ensures that a valid value of scale increment ξ\xi can always be found till βtβmax\beta_{t}\leq\beta_{\max}. Specifically, we will be able assure the existence of a scale increment ξ>1\xi>1 satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till β<𝒪(βdmin{log1α,nd})\beta<{\cal O}\left({{\frac{\beta^{\ast}}{d}\min\left\{{\log\frac{1}{\alpha},\sqrt{nd}}\right\}}}\right).

We now present the proof. Lemmata 23,24 establish the LWSC/LWLC properties for the Q~β\tilde{Q}_{\beta} function for robust mean estimation. Let 𝚫=𝝁^t𝝁\text{\boldmath$\mathbf{\Delta}$}=\hat{\text{\boldmath$\mathbf{\mu}$}}^{t}-\text{\boldmath$\mathbf{\mu}$}^{\ast} and 𝚫+=𝝁^t+1𝝁\text{\boldmath$\mathbf{\Delta}$}^{+}=\hat{\text{\boldmath$\mathbf{\mu}$}}^{t+1}-\text{\boldmath$\mathbf{\mu}$}^{\ast}. To simplify the notation, we will analyze below the updates made with weights scaled up by the constant c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) as defined in Lemma 9. This in no way affects the execution of the algorithm since this scaling factor appears in both the numerator and the denominator of the update terms and simply cancels away. However, it will simplify our presentation. We also, w.l.o.g., first analyze the case of β=d\beta^{\ast}=d first and then scale the space to retrieve a result for general β\beta^{\ast}.

Using the bound c~exp(β+0.5)\tilde{c}\leq\exp(\beta+0.5) from Lemma 26, we gives us, upon using Lemmata 23 and 24

𝚫+2iBsi𝐛i2+iGsiϵi2i=1nsiBc~κβ+𝐦2+νGβG(1ζ)\displaystyle\left\|{\text{\boldmath$\mathbf{\Delta}$}^{+}}\right\|_{2}\leq\frac{\left\|{\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}}\right\|_{2}+\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}}{\sum_{i=1}^{n}s_{i}}\leq\frac{\frac{B\tilde{c}\kappa}{\sqrt{\beta}}+\left\|{{{\mathbf{m}}}}\right\|_{2}+\frac{\nu G}{\sqrt{\beta}}}{G(1-\zeta)}
1βκc~B+Gββ+d+νGG(1ζ)\displaystyle\leq\sqrt{\frac{1}{\beta}}\cdot\frac{\kappa\tilde{c}B+G\frac{\beta}{\beta+{d}}+\nu G}{G(1-\zeta)}
1β(11ζ[κc~BG+ββ+d+ν])(A),\displaystyle\leq\sqrt{\frac{1}{\beta}}\cdot\underbrace{\left({\frac{1}{1-\zeta}\left[{\kappa\tilde{c}\frac{B}{G}+\frac{\beta}{\beta+{d}}+\nu}\right]}\right)}_{(A)},

where κ=1+12\kappa=1+\sqrt{\frac{1}{2}}. As was the case of robust least squares regression in the proof of Theorem 14, to assure the existence of a scale increment ξ>1\xi>1 satisfying the requirements of Theorem 1 and hence a linear rate of convergence, all we require is to ensure (A)(A) has a value of the form 1ξ<1\frac{1}{\xi}<1 where ξ>1\xi>1. Now, noting that RXnR_{X}\leq\sqrt{n}, and promising that we will never set βn\beta\geq n as well as never set ν,ζ1n\nu,\zeta\leq\frac{1}{n}, we note that we need to set ζΩ(dlog(n)n)\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right) as well as νΩ(dβlog(n)n(β+d))\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right), in order to ensure a confidence of 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) in the tail bounds we have established.

  1. 1.

    Breakdown Point: as observed above, with large enough nn, we can set ζ,ν\zeta,\nu to be small, yet positive, constants. For large enough dd, if we set β=𝒪(1)\beta={\cal O}\left({{1}}\right) to be a small enough constant then we have ββ+d0\frac{\beta}{\beta+d}\rightarrow 0, as well as c~exp(0.5+β)e\tilde{c}\leq\exp(0.5+\beta)\approx\sqrt{e}. This means we need only ensure (1+12)eα1α1\left({1+\sqrt{\frac{1}{2}}}\right)\sqrt{e}\frac{\alpha}{1-\alpha}\leq 1. The above is satisfied for all α0.26.21\alpha\leq 0.26.21 which establishes a breakdown point of 26.21%26.21\%. Note that even at this breakdown point, since we still set r=β=Ω(1)r=\beta=\Omega\left({{1}}\right), and thus, can still assure 𝚫21β=𝒪(1)\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\leq\sqrt{\frac{1}{\beta}}={\cal O}\left({{1}}\right).

  2. 2.

    Consistency: To analyze the consistency properties of the algorithm, we recall that, ignoring universal constants, to obtain a linear rate of convergence, we need ensure

    11ζ[c~α+ββ+d+ν]<1\frac{1}{1-\zeta}\left[{\tilde{c}\alpha+\frac{\beta}{\beta+{d}}+\nu}\right]<1

    which can be rewritten as βd+11ζ+ν+c~α\frac{\beta}{d}+1\leq\frac{1}{\zeta+\nu+\tilde{c}\alpha}. Recall from above that we need to set ζΩ(dlog(n)n)\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right) as well as νΩ(dβlog(n)n(β+d))\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right). Setting them at these lower bounds, using c~eexp(β)\tilde{c}\leq\sqrt{e}\exp(\beta), ignoring universal constants (since we are only interested in the asymptotic behavior of the algorithm) and some simple manipulations, we can show that, for all ndn\geq d, we can allow values of β\beta as large as

    βmin{𝒪(log1α),𝒪(nd)}\beta\leq\min\left\{{{\cal O}\left({{\log\frac{1}{\alpha}}}\right),{\cal O}\left({{\sqrt{nd}}}\right)}\right\}

    Note that the above assures us, when α=0\alpha=0 i.e. corruptions are absent, an error of 𝚫221β1nd0\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq\frac{1}{\beta}\leq\frac{1}{\sqrt{nd}}\rightarrow 0 as nn\rightarrow\infty. Thus, the method is consistent when corruptions are absent.

Scaling the space back up by a factor of dβ\sqrt{\frac{d}{\beta^{\ast}}} gives us the desired result. ∎

Lemma 23 (LWSC for Robust Mean Estimation).

For any 0βn0\leq\beta\leq n, the Q~β\tilde{Q}_{\beta}-function for robust mean estimation satisfies the LWSC property with constant λβG(1ζ)\lambda_{\beta}\geq G(1-\zeta) with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) for any ζΩ(dlog(n)n)\zeta\geq\Omega\left({{\sqrt{\frac{d\log(n)}{n}}}}\right).

Proof.

It is easy to see that 2Q~β(𝝁^|𝝁)=(i=1nsi)I\nabla^{2}\tilde{Q}_{\beta}(\hat{\text{\boldmath$\mathbf{\mu}$}}\,|\,\text{\boldmath$\mathbf{\mu}$})=(\sum_{i=1}^{n}s_{i})\cdot I for any 𝝁^2(𝝁,1β)\hat{\text{\boldmath$\mathbf{\mu}$}}\in{\mathcal{B}}_{2}\left({\text{\boldmath$\mathbf{\mu}$}^{\ast},\sqrt{\frac{1}{\beta}}}\right). Applying Lemma 9 gives us

𝔼[c~exp(β2ϵ𝚫22)]=1,{\mathbb{E}}\left[{{\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)}}\right]=1,

where c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) is defined in Lemma 9. The analysis in the proof of Lemma 13, on the other hand tells us that the random variable s=c~exp(β2ϵ𝚫22)s=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) has a subGaussian constant at most unity. Applying the Hoeffding’s inequality for subGaussian variables and noticing Gn/2G\geq n/2 gives us

[iGsiG(1ζ2)]exp(mζ2n8),{\mathbb{P}}\left[{{\sum_{i\in G}s_{i}\leq G\left({1-\frac{\zeta}{2}}\right)}}\right]\leq\exp\left({-\frac{m\zeta^{2}n}{8}}\right),

where m>0m>0 is a universal constant. Again notice that the above result holds for a fixed error vector 𝚫\mathbf{\Delta}. Suppose 𝚫1,𝚫2(1β)\text{\boldmath$\mathbf{\Delta}$}_{1},\text{\boldmath$\mathbf{\Delta}$}_{2}\in{\mathcal{B}}\left({\frac{1}{\sqrt{\beta}}}\right) are two error vectors such that 𝚫1𝚫22τ\left\|{\text{\boldmath$\mathbf{\Delta}$}_{1}-\text{\boldmath$\mathbf{\Delta}$}_{2}}\right\|_{2}\leq\tau. Then, denoting si1,si2s^{1}_{i},s^{2}_{i} as the weights assigned by these two error vectors, for all τ23βRX\tau\leq\frac{2}{3\sqrt{\beta}R_{X}}, by applying Lemma 25, we get

|iGsi1iGsi2|3Gc~τβRX,\left|{\sum_{i\in G}s^{1}_{i}-\sum_{i\in G}s^{2}_{i}}\right|\leq 3G\tilde{c}\tau\sqrt{\beta}R_{X},

where RXR_{X} is the maximum length in a set of nn vectors, each sampled from a dd-dimensional Gaussian (see Lemma 7). Applying a union bound over a τ\tau-net over 2(1β){\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right) with τ=ζ6βc~RX\tau=\frac{\zeta}{6\sqrt{\beta}\tilde{c}R_{X}} gives us

[𝚫2(1β):iGsiG(1ζ)](12RXc~βζ)dexp(mζ2n8),{\mathbb{P}}\left[{{\exists\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right):\sum_{i\in G}s_{i}\leq G(1-\zeta)}}\right]\leq\left({\frac{12R_{X}\tilde{c}\sqrt{\beta}}{\zeta}}\right)^{d}\exp\left({-\frac{m\zeta^{2}n}{8}}\right),

Promising that we will always set β<nd\beta<\sqrt{\frac{n}{d}} and noting that Lemma 26 gives us c~eexp(β)\tilde{c}\leq\sqrt{e}\exp(\beta) for β=d\beta^{\ast}=d and noting that Lemma 7 gives us RXnR_{X}\leq\sqrt{n} finishes the proof. ∎

Lemma 24 (LWLC for Robust Mean Estimation).

For any 0βn0\leq\beta\leq n, the Q~β\tilde{Q}_{\beta}-function for robust mean estimation satisfies the LWLC property with constant ΛβG(1+ν)+Bc~κβ\Lambda_{\beta}\leq G(1+\nu)+\frac{B\tilde{c}\kappa}{\sqrt{\beta}} with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) for any νΩ(dβlog(n)n(β+d))\nu\geq\Omega\left({{\sqrt{\frac{d\beta\log(n)}{n(\beta+d)}}}}\right).

Proof.

It is easy to see that Q~β(𝝁|𝝁)=iGsiϵi+iBsi𝐛i\nabla\tilde{Q}_{\beta}(\text{\boldmath$\mathbf{\mu}$}^{\ast}\,|\,\text{\boldmath$\mathbf{\mu}$})=\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}_{i}+\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}. We bound these separately below. Recall that we are working with weights that are scaled by a factor of c~\tilde{c}, where c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) is defined in Lemma 9.

Bad Points.

We have si=c~exp(β2𝐛i𝚫22)s_{i}=\tilde{c}\exp(-\frac{\beta}{2}\left\|{{{\mathbf{b}}}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}) for iBi\in B. Let κ=1+12\kappa=1+\sqrt{\frac{1}{2}}. This gives us two cases

  1. 1.

    𝐛i2κ𝚫\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|: in this case we use sic~s_{i}\leq\tilde{c} and thus, si𝐛i2κ𝚫c~κβs_{i}\cdot\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|\leq\frac{\tilde{c}\kappa}{\sqrt{\beta}}

  2. 2.

    𝐛i2>κ𝚫\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}>\kappa\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|: in this case we have 𝐛i𝚫2𝐛i22\left\|{{{\mathbf{b}}}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}\geq\frac{\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}}{2} and thus we also have sic~exp(β2(11κ)2𝐛i22)s_{i}\leq\tilde{c}\exp(-\frac{\beta}{2}\left({1-\frac{1}{\kappa}}\right)^{2}\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}^{2}) which gives us si𝐛i2c~κβs_{i}\cdot\left\|{{{\mathbf{b}}}^{i}}\right\|_{2}\leq\frac{\tilde{c}\kappa}{\sqrt{\beta}} upon using the fact that xexp(x2)<12x\cdot\exp(-x^{2})<\frac{1}{2} for all xx.

The above tells us, by an application of the triangle inequality, that iBsi𝐛i2Bc~κβ\left\|{\sum_{i\in B}s_{i}{{\mathbf{b}}}^{i}}\right\|_{2}\leq\frac{B\tilde{c}\kappa}{\sqrt{\beta}}.

Good Points.

We have si=c~exp(β2ϵi𝚫22)s_{i}=\tilde{c}\exp(-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}) for iGi\in G. Thus, Lemma 9 gives us

𝔼[iGsiϵi]=G𝔼[𝐱]=Gββ+d𝚫=:𝐦,\displaystyle{\mathbb{E}}\left[{{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}}\right]=G\cdot{\mathbb{E}}\left[{{{{\mathbf{x}}}}}\right]=G\cdot\frac{\beta}{\beta+{d}}\text{\boldmath$\mathbf{\Delta}$}=:{{\mathbf{m}}},

where 𝐱𝒩(ββ+d𝚫,1β+dI){{\mathbf{x}}}\sim{\mathcal{N}}\left({\frac{\beta}{\beta+{d}}\text{\boldmath$\mathbf{\Delta}$},\frac{1}{\beta+{d}}\cdot I}\right). Note that since β𝚫221\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1, we have

𝐦2Gββ+d\left\|{{{\mathbf{m}}}}\right\|_{2}\leq G\cdot\frac{\sqrt{\beta}}{\beta+{d}}

Using Lemma 13 and the linearity of the subexponential norm tells us that the subexponential norm of the random variable sϵ𝐯=c~exp(β2ϵ𝚫22)ϵ𝐯s\cdot\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}}=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\text{\boldmath$\mathbf{\epsilon}$}^{\top}{{\mathbf{v}}}, for a fixed unit vector 𝐯{{\mathbf{v}}}, is at most 2β+d\frac{2}{\sqrt{\beta+d}} (where c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) is defined in Lemma 9). Applying the Bernsteins’s inequality for subexponential variables gives us, for some universal constant mm,

[iGsiϵi𝐯>𝐦𝐯+t]exp(mmin{t2(β+d)G,tβ+d}),{\mathbb{P}}\left[{{\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}>{{\mathbf{m}}}\cdot{{\mathbf{v}}}+t}}\right]\leq\exp\left({-m\cdot\min\left\{{\frac{t^{2}(\beta+d)}{G},t\sqrt{\beta+d}}\right\}}\right),

for some universal constant m>0m>0. Now, if 𝐯1,𝐯2Sd1{{\mathbf{v}}}^{1},{{\mathbf{v}}}^{2}\in S^{d-1} are two unit vectors such that 𝐯1𝐯22τ\left\|{{{\mathbf{v}}}^{1}-{{\mathbf{v}}}^{2}}\right\|_{2}\leq\tau, we have

|iGsiϵi𝐯1iGsiϵi𝐯2|iGsiϵi2τGRXτ,\left|{\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}^{1}-\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}^{2}}\right|\leq\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\cdot\tau\leq GR_{X}\tau,

where RXR_{X} is the maximum length in a set of nn vectors, each sampled from a dd-dimensional Gaussian (see Lemma 7) and where in the last step we used the triangle inequality and the bounds si1s_{i}\leq 1 and ϵi2RX\left\|{\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq R_{X} for all ii. Thus, taking a union bound over a τ\tau-net over the surface of the unit sphere Sd1S^{d-1} gives us

[𝐯Sd1,iGsiϵi𝐯>𝐦𝐯+t+GRXτ](2τ)dexp(mmin{t2(β+d)G,tβ+d}),{\mathbb{P}}\left[{{\exists{{\mathbf{v}}}\in S^{d-1},\sum_{i\in G}s_{i}{\text{\boldmath$\mathbf{\epsilon}$}^{i}}^{\top}{{\mathbf{v}}}>{{\mathbf{m}}}\cdot{{\mathbf{v}}}+t+GR_{X}\tau}}\right]\leq\left({\frac{2}{\tau}}\right)^{d}\exp\left({-m\cdot\min\left\{{\frac{t^{2}(\beta+d)}{G},t\sqrt{\beta+d}}\right\}}\right),

The above can be seen as simply an affirmation that iGsiϵi2𝐦2+t+GRXτ\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq\left\|{{{\mathbf{m}}}}\right\|_{2}+t+GR_{X}\tau with high probability. Setting t=νG4βt=\frac{\nu G}{4\sqrt{\beta}} and τ=ν4RX\tau=\frac{\nu}{4R_{X}}, and noticing Gn/2G\geq n/2 gives us, upon promising that we always set ν1\nu\leq 1,

[iGsiϵi2>𝐦2+νG2β](8RXν)dexp(mν2n8β+dβ).{\mathbb{P}}\left[{{\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}>\left\|{{{\mathbf{m}}}}\right\|_{2}+\frac{\nu G}{2\sqrt{\beta}}}}\right]\leq\left({\frac{8R_{X}}{\nu}}\right)^{d}\exp\left({-\frac{m\nu^{2}n}{8}\cdot\frac{\beta+d}{\beta}}\right).

Now notice that this result holds for a fixed error vector 𝚫2(1β)\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right). Suppose now that we have two vectors 𝚫1,𝚫2(1β)\text{\boldmath$\mathbf{\Delta}$}^{1},\text{\boldmath$\mathbf{\Delta}$}^{2}\in{\mathcal{B}}\left({\frac{1}{\sqrt{\beta}}}\right) such that 𝚫1𝚫22τ\left\|{\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}\leq\tau. If we let si1s^{1}_{i} and si2s^{2}_{i} denote the weights with respect to these two error vectors, then Lemma 25 tells us that, for any τ\tau, then we must have

iG(si1si2)ϵi23GRXc~τ(βRX+2β).\left\|{\sum_{i\in G}(s^{1}_{i}-s^{2}_{i})\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}\leq 3GR_{X}\tilde{c}\tau\left({\beta R_{X}+2\sqrt{\beta}}\right).

Taking a union bound over a τ\tau-net over 2(1β){\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right) for τ=ν6c~RX(βRX+2β)β\tau=\frac{\nu}{6\tilde{c}R_{X}\left({\beta R_{X}+2\sqrt{\beta}}\right)\sqrt{\beta}} gives us

[𝚫2(1β):iGsiϵi2>𝐦2+νGβ](24RX2β2c~ν)d(8RXν)dexp(mν2n8β+dβ).{\mathbb{P}}\left[{{\exists\text{\boldmath$\mathbf{\Delta}$}\in{\mathcal{B}}_{2}\left({\sqrt{\frac{1}{\beta}}}\right):\left\|{\sum_{i\in G}s_{i}\text{\boldmath$\mathbf{\epsilon}$}^{i}}\right\|_{2}>\left\|{{{\mathbf{m}}}}\right\|_{2}+\frac{\nu G}{\sqrt{\beta}}}}\right]\leq\left({\frac{24R_{X}^{2}\beta^{2}\tilde{c}}{\nu}}\right)^{d}\left({\frac{8R_{X}}{\nu}}\right)^{d}\exp\left({-\frac{m\nu^{2}n}{8}\cdot\frac{\beta+d}{\beta}}\right).

This finishes the proof upon simple modifications and promising that we will always set β<nd\beta<\sqrt{\frac{n}{d}} and noting that Lemma 26 gives us c~eexp(β)\tilde{c}\leq\sqrt{e}\exp(\beta) for β=d\beta^{\ast}=d. ∎

Lemma 25.

Now, suppose 𝚫1,𝚫22(𝟎,1β)\text{\boldmath$\mathbf{\Delta}$}^{1},\text{\boldmath$\mathbf{\Delta}$}^{2}\in{\mathcal{B}}_{2}\left({{\mathbf{0}},\frac{1}{\sqrt{\beta}}}\right) are two error vectors such that 𝚫1𝚫22τ\left\|{\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}\leq\tau and, for some vector ϵ\mathbf{\epsilon}, we define si=c~exp(β2ϵ𝚫i22)s_{i}=\tilde{c}\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{i}}\right\|_{2}^{2}}\right). Then, for all τ\tau, then we must have

|s1s2|3c~τ(βϵ2+2β)\left|{s^{1}-s^{2}}\right|\leq 3\tilde{c}\tau\left({\beta\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}+2\sqrt{\beta}}\right)
Proof.

Since exp(x)\exp(-x) is a 11-Lipschitz function in the region x0x\geq 0, we have

|exp(β2ϵ𝚫122)exp(β2ϵ𝚫222)||β2ϵ𝚫122+β2ϵ𝚫222|\displaystyle\left|{\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{1}}\right\|_{2}^{2}}\right)-\exp\left({-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}^{2}}\right)}\right|\leq\left|{-\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{1}}\right\|_{2}^{2}+\frac{\beta}{2}\left\|{\text{\boldmath$\mathbf{\epsilon}$}-\text{\boldmath$\mathbf{\Delta}$}^{2}}\right\|_{2}^{2}}\right|
β2|(𝚫1+𝚫2+2ϵ)(𝚫1𝚫2)|3τ(βϵ2+2β),\displaystyle\leq\frac{\beta}{2}\left|{(\text{\boldmath$\mathbf{\Delta}$}^{1}+\text{\boldmath$\mathbf{\Delta}$}^{2}+2\text{\boldmath$\mathbf{\epsilon}$})^{\top}(\text{\boldmath$\mathbf{\Delta}$}^{1}-\text{\boldmath$\mathbf{\Delta}$}^{2})}\right|\leq 3\tau\left({\beta\left\|{\text{\boldmath$\mathbf{\epsilon}$}}\right\|_{2}+2\sqrt{\beta}}\right),

where in the last step we applied the Cauchy-Schwartz inequality and used β𝚫i21\sqrt{\beta}\left\|{\text{\boldmath$\mathbf{\Delta}$}^{i}}\right\|_{2}\leq 1 for i=1,2i=1,2. ∎

Lemma 26.

For β=d\beta^{\ast}=d, we have c~eexp(β)\tilde{c}\leq\sqrt{e}\exp(\beta) where c~=(β+ββ)dexp(ββ2(β+β)𝚫22)\tilde{c}=\left({\sqrt{\frac{\beta+{\beta^{\ast}}}{\beta^{\ast}}}}\right)^{d}\exp\left({\frac{\beta{\beta^{\ast}}}{2(\beta+{\beta^{\ast}})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right) is defined in Lemma 9.

Proof.

We have

(β+dd)dexp(β2)=exp(β2).\left({\sqrt{\frac{\beta+d}{d}}}\right)^{d}\leq\exp\left({\frac{\beta}{2}}\right)=\exp\left({\frac{\beta}{2}}\right).

We also have, using β𝚫221\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1,

exp(βd2(β+d)𝚫22)exp(d2(β+d)),\exp\left({\frac{\beta{d}}{2(\beta+{d})}\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}}\right)\leq\exp\left({\frac{{d}}{2(\beta+{d})}}\right),

by using β𝚫221\beta\left\|{\text{\boldmath$\mathbf{\Delta}$}}\right\|_{2}^{2}\leq 1. Thus, we have

c~exp(d2(β+d)+dβd)exp(β+0.5)\tilde{c}\leq\exp\left({\frac{{d}}{2(\beta+{d})}+\frac{d\beta}{d}}\right)\leq\exp(\beta+0.5)

Appendix I Gamma Regression

For this proof we use the notation X=[𝐱1,,𝐱n]d×n,𝐲=[y1,,yn]+n,𝐛=[b1,,bn]+nX=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{d\times n},{{\mathbf{y}}}=[y_{1},\ldots,y_{n}]\in{\mathbb{R}}_{+}^{n},{{\mathbf{b}}}=[b_{1},\ldots,b_{n}]\in{\mathbb{R}}_{+}^{n}. Recall that the labels for gamma distribution are always non-negative and, as specified in Section 4, the corruptions are multiplicative in this case. We will let G,BG,B respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let G=(1α)nG=(1-\alpha)\cdot n and B=αnB=\alpha\cdot n respectively denote the number of good and bad points too.

To simplify the analysis, we assume that data features 𝐱i{{\mathbf{x}}}^{i} are sampled uniformly from the surface of the unit sphere in dd-dimensions i.e. Sd1S^{d-1}. We will also assume that for clean points, labels were generated from mode of Gamma distribution as yi=exp(𝐰,𝐱i)(1ϕ)y_{i}=\exp(-\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}^{i}}\right\rangle)(1-\phi) i.e. in the no-noise model. For corrupted points, labels are y~i=yibi\tilde{y}_{i}=y_{i}\cdot b_{i} where bi>0b_{i}>0 but otherwise arbitrary or even unbounded. To simplify the presentation of the results, we will additionally assume 𝐰2=1\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1 and 𝐰2R\left\|{{{\mathbf{w}}}}\right\|_{2}\leq R.

For any vector 𝐯m{{\mathbf{v}}}\in{\mathbb{R}}^{m} and any set T[m]T\subseteq[m], 𝐯T{{\mathbf{v}}}_{T} denotes the vector with all coordinates other than those in the set TT zeroed out. Similarly, for any matrix Ak×m,ATA\in{\mathbb{R}}^{k\times m},A_{T} denotes the matrix with all columns other than those in the set TT zeroed out.

I.1 Variance Reduction with the Gamma distribution

Since the variance reduction step is a bit more involved with the Gamma distribution, we give a detailed derivation here. Consider the gamma distribution:

𝒢(yi;ηi,ϕ)\displaystyle{\cal G}(y_{i};\eta_{i},\phi) =1yiΓ(1ϕ)(yiηiϕ)1ϕexp(yiηiϕ)\displaystyle=\frac{1}{y_{i}\Gamma(\frac{1}{\phi})}\left(\frac{y_{i}\eta_{i}}{\phi}\right)^{\frac{1}{\phi}}\exp(-\frac{y_{i}\eta_{i}}{\phi})
=exp(yiηilnηiϕ+(1ϕ1)lnyi1ϕlnϕlnΓ(1ϕ))\displaystyle=\exp\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)

where natural parameter ηi=exp(𝐰,𝐱i)\eta_{i}=\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle). The mode preserving variance reduced distribution would be:

𝒢(yi;ηi,ϕ,βt)\displaystyle{\cal G}(y_{i};\eta_{i},\phi,\beta_{t}) =exp(βt(yiηilnηiϕ+(1ϕ1)lnyi1ϕlnϕlnΓ(1ϕ))W(ηi,βt,ϕ))\displaystyle=\exp\left(\beta_{t}\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)-W(\eta_{i},\beta_{t},\phi)\right)

writing 1ϕ~βt:=βtϕβt+1\frac{1}{\tilde{\phi}_{\beta_{t}}}:=\frac{\beta_{t}}{\phi}-\beta_{t}+1 we have,

W(ηi,βt,ϕ)\displaystyle W(\eta_{i},\beta_{t},\phi) =ln(0exp(βt(yiηilnηiϕ+(1ϕ1)lnyi1ϕlnϕlnΓ(1ϕ)))𝑑y)\displaystyle=\ln\left(\int_{0}^{\infty}\exp\left(\beta_{t}\left(\frac{y_{i}\eta_{i}-\ln\eta_{i}}{-\phi}+(\frac{1}{\phi}-1)\ln y_{i}-\frac{1}{\phi}\ln\phi-\ln\Gamma(\frac{1}{\phi})\right)\right)dy\right)
=βtϕlnηiϕβtlnΓ(1ϕ)+1ϕ~βtlnϕβtηi+lnΓ(1ϕ~βt)\displaystyle=\frac{\beta_{t}}{\phi}\ln\frac{\eta_{i}}{\phi}-\beta_{t}\ln\Gamma(\frac{1}{\phi})+\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln\frac{\phi}{\beta_{t}\eta_{i}}+\ln\Gamma\left(\frac{1}{\tilde{\phi}_{\beta_{t}}}\right)

So that,

𝒢(yi;ηi,ϕ,βt)\displaystyle{\cal G}(y_{i};\eta_{i},\phi,\beta_{t}) =exp(βtyiηiϕ+(1ϕ~t1)lnyi1ϕ~tlnϕβtηilnΓ(1ϕ~t))\displaystyle=\exp\left(-\frac{\beta_{t}y_{i}\eta_{i}}{\phi}+(\frac{1}{\tilde{\phi}_{t}}-1)\ln y_{i}-\frac{1}{\tilde{\phi}_{t}}\ln\frac{\phi}{\beta_{t}\eta_{i}}-\ln\Gamma(\frac{1}{\tilde{\phi}_{t}})\right)
=exp(yiη~iϕ~βt+(1ϕ~t1)lnyi1ϕ~βtlnϕ~βtη~ilnΓ(1ϕ~t))setting, η~i:=ηiβtϕ~tϕ\displaystyle=\exp\left(-\frac{y_{i}\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}}+(\frac{1}{\tilde{\phi}_{t}}-1)\ln y_{i}-\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln\frac{\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}-\ln\Gamma(\frac{1}{\tilde{\phi}_{t}})\right)\quad\text{setting, }\tilde{\eta}_{i}:=\frac{\eta_{i}\beta_{t}\tilde{\phi}_{t}}{\phi}
=𝒢(yi;η~i,ϕ~t)\displaystyle={\mathcal{G}}(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{t})

Hence to perform variance reduction, following parameter update is sufficient:

η~i=ηiβtϕ~tϕ=ηiβtϕ+βtϕβt;ϕ~βt=ϕβtϕβt+ϕ\displaystyle\tilde{\eta}_{i}=\frac{\eta_{i}\beta_{t}\tilde{\phi}_{t}}{\phi}=\eta_{i}\frac{\beta_{t}}{\phi+\beta_{t}-\phi\beta_{t}};\quad\tilde{\phi}_{\beta_{t}}=\frac{\phi}{\beta_{t}-\phi\beta_{t}+\phi}

We will give the proof for the setting where clean points suffer no noise. Let (η^,ϕ^)(\hat{\eta},\hat{\phi}) be the no-noise parameters. Assuming 0<ϕ<10<\phi<1 we have ηi^=limβηiβϕ+βϕβ=ηi1ϕ=exp(𝐰,𝐱i)1ϕ\hat{\eta_{i}}=\lim\limits_{\beta\rightarrow\infty}\frac{\eta_{i}\beta}{\phi+\beta-\phi\beta}=\frac{\eta_{i}}{1-\phi}=\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{1-\phi} and ϕ^=limβϕϕ+βϕβ=0\hat{\phi}=\lim\limits_{\beta\rightarrow\infty}\frac{\phi}{\phi+\beta-\phi\beta}=0. Giving,

mode(𝒢(yi;η^i,ϕ^))=1ϕ^η^i=1ϕηi=mode(𝒢(yi;ηi,ϕ))\displaystyle mode({\mathcal{G}}(y_{i};\hat{\eta}_{i},\hat{\phi}))=\frac{1-\hat{\phi}}{\hat{\eta}_{i}}=\frac{1-\phi}{\eta_{i}}=mode({\mathcal{G}}(y_{i};\eta_{i},\phi))

Let be bi[0,)b_{i}\in[0,\infty) denote the multiplicative corruption, un-corrupted points having bi=1b_{i}=1.
We have,

yi=mode(𝒢(yi;η^i,ϕ^))bi=biexp(𝐰,𝐱i)(1ϕ)\displaystyle y_{i}=mode({\mathcal{G}}(y_{i};\hat{\eta}_{i},\hat{\phi}))\cdot b_{i}=b_{i}\exp(-\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)(1-\phi)

Let c:=1ϕ1c:=\frac{1}{\phi}-1 and Δt:=𝐰^t𝐰\Delta^{t}:=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}

si\displaystyle s_{i} =𝒢(yi;η~i,ϕ~t)=exp(𝐰,𝐱i)(1ϕ)Γ(1ϕ~t)(bicβtexp(𝚫t,𝐱i))1ϕ~texp(bicβtexp(𝚫t,𝐱i))\displaystyle={\mathcal{G}}(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{t})=\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{t}})}\left(b_{i}c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)\right)^{\frac{1}{\tilde{\phi}_{t}}}\exp(-b_{i}c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle))

We may write,

Q~βt(𝐰|𝐰^t)\displaystyle\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}) =logi=1n𝒢(yi;ηi1ϕ,ϕ^)si\displaystyle=-\log\prod\limits_{i=1}^{n}{\mathcal{G}}(y_{i};\frac{\eta_{i}}{1-\phi},\hat{\phi})^{s_{i}} =i=1nsi(biexp(𝐰𝐰,𝐱i)𝐰,𝐱i+ln(1ϕ)ϕ^(1ϕ^1)lnyi+1ϕ^lnϕ^+lnΓ(1ϕ^))\displaystyle=\sum\limits_{i=1}^{n}s_{i}\left(\frac{b_{i}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)-\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle+\ln(1-\phi)}{\hat{\phi}}-(\frac{1}{\hat{\phi}}-1)\ln y_{i}+\frac{1}{\hat{\phi}}\ln\hat{\phi}+\ln\Gamma(\frac{1}{\hat{\phi}})\right) 𝐰Q~βt(𝐰|𝐰^t)\displaystyle\nabla_{{{\mathbf{w}}}}\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}) =1ϕ^iG(biexp(𝐰𝐰,𝐱i)1)si𝐱i\displaystyle=\frac{1}{\hat{\phi}}\sum\limits_{i\in G}\left(b_{i}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)-1\right)s_{i}{{\mathbf{x}}}_{i}

Theorem 27 (Theorem 4 restated).

For data generated in the robust gamma regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set BB) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-Gamma enjoys a breakdown point of 0.002d\frac{0.002}{\sqrt{d}}, i.e. it ensures a bounded 𝒪(1){\cal O}\left({{1}}\right) error even if k=αnk=\alpha\cdot n corruptions are introduced where the value of α\alpha can go upto at least 0.002d\frac{0.002}{\sqrt{d}}. More generally, for corruption rates α0.002d\alpha\leq\frac{0.002}{\sqrt{d}}, there always exists values of scale increment ξ>1\xi>1 such that with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), LWSC/LWLC conditions are satisfied for the Q~β\tilde{Q}_{\beta} function corresponding to the robust gamma regression model for β\beta values at least as large as βmax=𝒪(1/(exp(𝒪(αd))1))\beta_{\max}={\cal O}\left({{1/\left({\exp\left({{\cal O}\left({{\alpha\sqrt{d}}}\right)}\right)-1}\right)}}\right). Specifically, if initialized at 𝐰^1,β11\hat{{\mathbf{w}}}^{1},\beta_{1}\geq 1 such that β1(exp(𝐰^1𝐰2)1)2ϕ1ϕ{\beta_{1}}\cdot\left({\exp\left({\left\|{\hat{{\mathbf{w}}}^{1}-{{\mathbf{w}}}^{\ast}}\right\|_{2}}\right)-1}\right)^{2}\leq\frac{\phi}{1-\phi}, for any ϵ𝒪(αd)\epsilon\geq{\cal O}\left({{\alpha\sqrt{d}}}\right) SVAM-Gamma assures

𝐰^T𝐰2ϵ\left\|{\hat{{\mathbf{w}}}^{T}-{{\mathbf{w}}}^{\ast}}\right\|_{2}\leq\epsilon

within T𝒪(log1ϵ)T\leq{\cal O}\left({{\log\frac{1}{\epsilon}}}\right) iterations.

Proof.

We first outline the proof below. We note that Theorem 4 is obtained by setting ϕ=0.5\phi=0.5 in the above statement.

Proof Outline.

The proof outline is similar to the one followed for robust least squares regression and robust mean estimation in Theorems 14 and 22 but adapted to suit the alternate parametrization and invariant used by SVAM-Gamma for gamma regression. Lemmata 28,29 below establish the LWSC and LWLC properties with high probability for βt1\beta_{t}\geq 1. Let Δt=𝐰^t𝐰\Delta^{t}=\hat{{\mathbf{w}}}^{t}-{{\mathbf{w}}}^{\ast}, Unlike in mean estimation and robust regression where we maintained the invariant βtΔt221\beta_{t}\cdot\left\|{\Delta^{t}}\right\|_{2}^{2}\leq 1, in this setting we will instead maintain the invariant cβt(exp(Δt)1)ϕ1ϕc\beta_{t}(\exp(\left\|{\Delta^{t}}\right\|)-1)\leq\frac{\phi}{1-\phi}. This is because gamma regression uses a non-standard canonical parameter to enable support only over non-negative labels. Note however that this altered invariant still ensures that as βt\beta_{t}\rightarrow\infty, we also have Δt20\left\|{\Delta^{t}}\right\|_{2}\rightarrow 0. Since we will always use βt1\beta_{t}\geq 1 (since we set β1=1\beta_{1}=1), we correspondingly require Δ12ln(1c+1)\left\|{\Delta^{1}}\right\|_{2}\leq\ln(\frac{1}{c}+1) during initialization as mentioned in the statement of Theorem 4, where c=1ϕ1c=\frac{1}{\phi}-1. Since we will analyze the special case ϕ=0.5\phi=0.5 for sake of simplicity, we get c=1c=1.

Lemmata 28,29 establish the LWSC/LWLC properties for the Q~β\tilde{Q}_{\beta} function for robust gamma regression. Given the above proof outline and applying Lemmata 28,29 gives us (taking 𝐰2=1\left\|{{{\mathbf{w}}}^{\ast}}\right\|_{2}=1 and 𝐰2R\left\|{{{\mathbf{w}}}}\right\|_{2}\leq R)

Δt+1\displaystyle\Delta^{t+1} 2Λβtλβt2m(βt)Bϕ^2d(1ζ)μc\displaystyle\leq\frac{2\Lambda_{\beta_{t}}}{\lambda_{\beta_{t}}}\leq\frac{2\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}}}{(1-\zeta)\mu_{c}}
2Bϕ^2d1+cβt(cβt)2exp(1)(1ϕ)Γ(1ϕ~βt)(cβt+2)cβt+2exp(cβt2)(1ζ)exp(1ϕ~βtln(1ϕ~βt1))ϕ^(1ϕ)Γ(1ϕ~βt)Gdexp(R(cβt+1)ln(1+1cβt)1cβt)\displaystyle\leq\frac{\frac{2B}{\hat{\phi}}\sqrt{\frac{2}{d}}\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}+2\right)^{c\beta_{t}+2}\exp(-c\beta_{t}-2)}{(1-\zeta)\frac{\exp(\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1))}{\hat{\phi}(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\frac{G}{d}\exp\left(-R-(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-1-c\beta_{t}\right)}
22dBG(1ζ)(1+2cβt)2exp(R+3)\displaystyle\leq\frac{2\sqrt{2d}B}{G(1-\zeta)}(1+\frac{2}{c\beta_{t}})^{2}\exp(R+3)

For break down point calculation we set ζ=0.5\zeta=0.5 and R=1R=1,

22dBG(1ζ)(1+2cβt)2exp(R+3)ln(1ξβt+1)\displaystyle\frac{2\sqrt{2d}B}{G(1-\zeta)}(1+\frac{2}{c\beta_{t}})^{2}\exp(R+3)\leq\ln(\frac{1}{\xi\beta_{t}}+1)

to get α0.002d\alpha\leq\frac{0.002}{\sqrt{d}}

Lemma 28 (LWSC for Robust Gamma Regression).

For any 0βtn0\leq\beta_{t}\leq n, the Q~β\tilde{Q}_{\beta}-function for gamma regression satisfies the LWSC property with constant λβ(1ζ)μc\lambda_{\beta}\geq(1-\zeta)\mu_{c} with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)) where μc\mu_{c} is defined in the proof.

Proof.

It is easy to see that having 1ϕ~βt=cβt+1\frac{1}{\tilde{\phi}_{\beta_{t}}}=c\beta_{t}+1 we have at the good points,

2Q~G(𝐰|𝐰^t)\displaystyle\nabla^{2}\tilde{Q}_{G}({{\mathbf{w}}}\,|\,\hat{{\mathbf{w}}}^{t}) =1ϕ^iGexp(𝐰𝐰,𝐱i)si𝐱i𝐱i\displaystyle=\frac{1}{\hat{\phi}}\sum\limits_{i\in G}\exp(\left\langle{{{\mathbf{w}}}-{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)s_{i}{{\mathbf{x}}}_{i}{{\mathbf{x}}}_{i}^{\top}
=κ(ϕ,ϕ^,ϕ~βt)iGexp(𝐰,𝐱i(1ϕ~βt1)exp(Δt,𝐱i)+1ϕ~βtΔt,𝐱i)𝐱i𝐱i,\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\sum\limits_{i\in G}\exp\left(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle-(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\exp(\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)+\frac{1}{\tilde{\phi}_{\beta_{t}}}\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\right){{\mathbf{x}}}_{i}{{\mathbf{x}}}_{i}^{\top},

where κ(ϕ,ϕ^,ϕ~βt)=exp(1ϕ~βtln(1ϕ~βt1))ϕ^(1ϕ)Γ(1ϕ~βt)\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})=\frac{\exp(\frac{1}{\tilde{\phi}_{\beta_{t}}}\ln(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1))}{\hat{\phi}(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}.

Let us write,

φ(𝐱1,..,𝐱i,..𝐱n):=𝐯𝐰2Q~G(𝐰|𝐰^t)𝐯=κ(ϕ,ϕ^,ϕ~βt)[iGg(𝐱i,𝐰,Δt)𝐯,𝐱i2]\displaystyle\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..{{\mathbf{x}}}_{n}):={{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\left[\sum\limits_{i\in G}g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}\right]

with,

g(𝐱i,𝐰,Δt):=\displaystyle g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t}):= exp(𝐰,𝐱i(1ϕ~βt1)exp(Δt,𝐱i)+1ϕ~βtΔt,𝐱i)\displaystyle\exp\left(\left\langle{{{\mathbf{w}}}},{{{\mathbf{x}}}_{i}}\right\rangle-(\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\exp(\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)+\frac{1}{\tilde{\phi}_{\beta_{t}}}\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\right)
\displaystyle\geq exp(R(cβt+1)Δtcβtexp(Δt))using, Δt,𝐱iΔt\displaystyle\exp\left(-R-(c\beta_{t}+1)\left\|{\Delta^{t}}\right\|-c\beta_{t}\exp(\left\|{\Delta^{t}}\right\|)\right)\quad\text{using, }\left\langle{\Delta^{t}},{{{\mathbf{x}}}_{i}}\right\rangle\geq-\left\|{\Delta^{t}}\right\|
\displaystyle\geq exp(R(cβt+1)ln(1+1cβt)1cβt)=:gmin\displaystyle\exp\left(-R-(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-1-c\beta_{t}\right)=:g_{min}

and,

g(𝐱i,𝐰,Δt)\displaystyle g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\leq exp(R+(cβt+1)Δtcβtexp(Δt))\displaystyle\exp\left(R+(c\beta_{t}+1)\left\|{\Delta^{t}}\right\|-c\beta_{t}\exp(-\left\|{\Delta_{t}}\right\|)\right)
\displaystyle\leq exp(R+(cβt+1)ln(1+1cβt)(cβt)21+cβt)=:gmax\displaystyle\exp\left(R+(c\beta_{t}+1)\ln(1+\frac{1}{c\beta_{t}})-\frac{(c\beta_{t})^{2}}{1+c\beta_{t}}\right)=:g_{\max}

We have,

μ:=𝔼𝐱i𝒮d1[φ(𝐱1,..,𝐱i,..𝐱n)]Gκ(ϕ,ϕ^,ϕ~βt)gmin𝔼𝐱i𝒮d1[𝐯,𝐱i2]=Gκ(ϕ,ϕ^,ϕ~βt)gmind=:μc\displaystyle\mu:=\underset{{{\mathbf{x}}}_{i}\sim{\cal S}^{d-1}}{{\mathbb{E}}}\left[{{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..{{\mathbf{x}}}_{n})}}\right]\geq G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}\underset{{{\mathbf{x}}}_{i}\sim{\cal S}^{d-1}}{{\mathbb{E}}}\left[{{\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}}}\right]=\frac{G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}}{d}=:\mu_{c}

To get a high-probability lower bound on the LWSC constant λβ\lambda_{\beta} we would like to apply McDiarmid’s inequality. Having,

|φ(𝐱1,..,𝐱i,..,𝐱n)φ(𝐱1,..,𝐱i,..,𝐱n)|\displaystyle\left|{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..,{{\mathbf{x}}}_{n})-\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i}^{\prime},..,{{\mathbf{x}}}_{n})}\right|
=κ(ϕ,ϕ^,ϕ~βt)|g(𝐱i,𝐰,Δt)𝐯,𝐱i2g(𝐱i,𝐰,Δt)𝐯,𝐱i2|\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\left|g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}-g({{\mathbf{x}}}_{i}^{\prime},{{\mathbf{w}}},\Delta^{t})\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}^{\prime}}\right\rangle^{2}\right|
κ(ϕ,ϕ^,ϕ~βt)(g(𝐱i,𝐰,Δt)+g(𝐱i,𝐰,Δt))\displaystyle\leq\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})(g({{\mathbf{x}}}_{i},{{\mathbf{w}}},\Delta^{t})+g({{\mathbf{x}}}_{i}^{\prime},{{\mathbf{w}}},\Delta^{t}))
2κ(ϕ,ϕ^,ϕ~βt)gmax\displaystyle\leq 2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}

, we may write for any fixed 𝐯𝒮d1{{\mathbf{v}}}\in{\cal S}^{d-1} and 𝐰d{{\mathbf{w}}}\in{\mathbb{R}}^{d},

(|φ(𝐱1,..,𝐱i,..,𝐱n)μ|t)2exp(2t22Gκ(ϕ,ϕ^,ϕ~βt)gmax)\displaystyle{\mathbb{P}}(\left|{\varphi({{\mathbf{x}}}_{1},..,{{\mathbf{x}}}_{i},..,{{\mathbf{x}}}_{n})-\mu}\right|\geq t)\leq 2\exp\left(\frac{-2t^{2}}{2G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

For any square symmetric matrix Fd×dF\in\mathbb{R}^{d\times d}, we have, F2112ϵsup𝐯𝒩ϵ|𝐯F𝐯|\left\|{F}\right\|_{2}\leq\frac{1}{1-2\epsilon}\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\epsilon}}\left|{{{\mathbf{v}}}^{\top}F{{\mathbf{v}}}}\right|. Taking, F=𝐰2Q~G(𝐰|𝐰^t)μIdF=\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d} gives 𝐰2Q~G(𝐰|𝐰^t)μId22sup𝐯𝒩14|𝐯𝐰2Q~G(𝐰|𝐰^t)𝐯μ|\left\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\|_{2}\leq 2\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\frac{1}{4}}}\left|{{{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}-\mu}\right|

Taking union bound over 14\frac{1}{4}-net of 𝐯d{{\mathbf{v}}}\in\mathbb{R}^{d} gives

(𝐰2Q~G(𝐰|𝐰^t)μId2t)\displaystyle{\mathbb{P}}\left(\left\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\|_{2}\geq t\right) (2sup𝐯𝒩14|𝐯𝐰2Q~G(𝐰|𝐰^t)𝐯μ|t)\displaystyle\leq{\mathbb{P}}\left(2\sup\limits_{{{\mathbf{v}}}\in{\cal N}_{\frac{1}{4}}}\left|{{{\mathbf{v}}}^{\top}\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}){{\mathbf{v}}}-\mu}\right|\geq t\right)
29dexp(t24Gκ(ϕ,ϕ^,ϕ~βt)gmax)\displaystyle\leq 2\cdot 9^{d}\exp\left(\frac{-t^{2}}{4G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

Since, λβt:=λmin(𝐰2Q~(𝐰|𝐰^t))λmin(𝐰2Q~G(𝐰|𝐰^t))\lambda_{\beta_{t}}:=\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\geq\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})) and μμc\mu\geq\mu_{c}, we have,

(λβtμct)(λmin(𝐰2Q~G(𝐰|𝐰^t))μct)(λmin(𝐰2Q~G(𝐰|𝐰^t))μt)(|λmin(𝐰2Q~G(𝐰|𝐰^t))μ|t)(𝐰2Q~G(𝐰|𝐰^t)μId2t)29dexp(t24Gκ(ϕ,ϕ^,ϕ~βt)gmax){\mathbb{P}}(\lambda_{\beta_{t}}-\mu_{c}\leq-t)\leq{\mathbb{P}}(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu_{c}\leq-t)\leq{\mathbb{P}}(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu\leq-t)\leq{\mathbb{P}}(\left|{\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\mu}\right|\geq t)\leq{\mathbb{P}}(\left\|{\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\mu I_{d}}\right\|_{2}\geq t)\leq 2\cdot 9^{d}\exp\left(\frac{-t^{2}}{4G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

Put, t=ζμc2t=\frac{\zeta\mu_{c}}{2} with 0<ζ<10<\zeta<1, for any fixed 𝐰d{{\mathbf{w}}}\in\mathbb{R}^{d} we have:

(λmin(𝐰2Q~G(𝐰|𝐰^t))(1ζ2)μc)29dexp(ζ2μc216Gκ(ϕ,ϕ^,ϕ~βt)gmax)\displaystyle{\mathbb{P}}\left(\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\leq(1-\frac{\zeta}{2})\mu_{c}\right)\leq 2\cdot 9^{d}\exp\left(\frac{-\zeta^{2}\mu_{c}^{2}}{16G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right)

In order to take union bound over 𝐰{{\mathbf{w}}} we observe, let τ=𝐰^1𝐰^22\tau=\left\|{\hat{{\mathbf{w}}}^{1}-\hat{{\mathbf{w}}}^{2}}\right\|_{2} and using exp(x)1+2x\exp(x)\leq 1+2x for 0x10\leq x\leq 1

g(𝐱i,𝐰^1,Δt)g(𝐱i,𝐰^2,Δt)=g(𝐱i,𝐰^2,Δt)(exp(𝐰^1𝐰^2,𝐱i)1)2τg(𝐱i,𝐰^2,Δt)2τgmaxg({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{1},\Delta^{t})-g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})=g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})(\exp\left(\left\langle{\hat{{\mathbf{w}}}^{1}-\hat{{\mathbf{w}}}^{2}},{{{\mathbf{x}}}_{i}}\right\rangle\right)-1)\leq 2\tau g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})\leq 2\tau g_{\max}

So that,

|λmin(𝐰^12Q~G(𝐰|𝐰^t))λmin(𝐰^22Q~G(𝐰|𝐰^t))|sup𝐯2=1|𝐯(𝐰^12Q~G(𝐰|𝐰^t)𝐰^22Q~G(𝐰|𝐰^t))𝐯|\displaystyle\left|{\lambda_{\min}(\nabla_{\hat{{\mathbf{w}}}^{1}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))-\lambda_{\min}(\nabla_{\hat{{\mathbf{w}}}^{2}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))}\right|\leq\sup_{\left\|{{{\mathbf{v}}}}\right\|_{2}=1}\left|{{{\mathbf{v}}}^{\top}(\nabla_{\hat{{\mathbf{w}}}^{1}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})-\nabla_{\hat{{\mathbf{w}}}^{2}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t})){{\mathbf{v}}}}\right|
=κ(ϕ,ϕ^,ϕ~βt)sup𝐯2=1|[iG(g(𝐱i,𝐰^1,Δt)g(𝐱i,𝐰^2,Δt))𝐯,𝐱i2]|2κ(ϕ,ϕ^,ϕ~βt)τGgmax\displaystyle=\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\sup_{\left\|{{{\mathbf{v}}}}\right\|_{2}=1}\left|{\left[\sum\limits_{i\in G}\left(g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{1},\Delta^{t})-g({{\mathbf{x}}}_{i},\hat{{\mathbf{w}}}^{2},\Delta^{t})\right)\left\langle{{{\mathbf{v}}}},{{{\mathbf{x}}}_{i}}\right\rangle^{2}\right]}\right|\leq 2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\tau Gg_{\max}

In order to set the τ\tau-net, we would require, (1ζ)μc(1ζ2)μc2κ(ϕ,ϕ^,ϕ~βt)τGgmax(1-\zeta)\mu_{c}\leq(1-\frac{\zeta}{2})\mu_{c}-2\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})\tau Gg_{\max}

τζμc4κ(ϕ,ϕ^,ϕ~βt)Ggmax=ζGκ(ϕ,ϕ^,ϕ~βt)dgmin4κ(ϕ,ϕ^,ϕ~βt)Ggmax=ζgmin4dgmax\displaystyle\tau\leq\frac{\zeta\mu_{c}}{4\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})Gg_{\max}}=\frac{\zeta\frac{G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})}{d}g_{\min}}{4\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})Gg_{\max}}=\frac{\zeta g_{\min}}{4dg_{\max}}

Taking covering number of (𝐰,R)Rd(3τ)d=(12dRgmaxζgmin)d{\cal B}({{\mathbf{w}}}^{\ast},R)\leq R^{d}(\frac{3}{\tau})^{d}=(\frac{12dRg_{\max}}{\zeta g_{\min}})^{d} and observing,

gmaxgmin(1+1cβt)2exp(2R+3+cβt1+cβt)\frac{g_{\max}}{g_{\min}}\leq(1+\frac{1}{c\beta_{t}})^{2}\exp\left(2R+3+\frac{c\beta_{t}}{1+c\beta_{t}}\right), κ(ϕ,ϕ^,ϕ~βt)gmin2gmaxβtϕϕ^(1+1cβt)3exp(3R5cβt1+cβt)\frac{\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}^{2}}{g_{\max}}\geq\frac{\beta_{t}}{\phi\hat{\phi}}(1+\frac{1}{c\beta_{t}})^{-3}\exp\left(-3R-5-\frac{c\beta_{t}}{1+c\beta_{t}}\right)

We may write,

(𝐰(0,R):λmin(𝐰2Q~G(𝐰|𝐰^t))(1ζ)μc)29d(12dRgmaxζgmin)dexp(ζ2μc216Gκ(ϕ,ϕ^,ϕ~βt)gmax)\displaystyle{\mathbb{P}}\left(\exists{{\mathbf{w}}}\in{\cal B}(0,R):\lambda_{\min}(\nabla_{{{\mathbf{w}}}}^{2}\tilde{Q}_{G}({{\mathbf{w}}}|\hat{{\mathbf{w}}}^{t}))\leq(1-\zeta)\mu_{c}\right)\leq 2\cdot 9^{d}\cdot(\frac{12dRg_{\max}}{\zeta g_{\min}})^{d}\exp\left(\frac{-\zeta^{2}\mu_{c}^{2}}{16G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\max}}\right) =2exp(ζ2Gκ(ϕ,ϕ^,ϕ~βt)gmin216d2gmax+dln(108dRgmaxζgmin))=exp(Ω(ζ2Gβtd2ϕϕ^dlndζ))\displaystyle=2\cdot\exp\left(\frac{-\zeta^{2}G\kappa(\phi,\hat{\phi},\tilde{\phi}_{\beta_{t}})g_{\min}^{2}}{16d^{2}g_{\max}}+d\ln(\frac{108dRg_{\max}}{\zeta g_{\min}})\right)=\exp\left(-\Omega(\frac{\zeta^{2}G\beta_{t}}{d^{2}\phi\hat{\phi}}-\frac{d\ln d}{\zeta})\right)

This finishes the proof. ∎

Lemma 29 (LWLC for Robust Gamma Regression).

For any 0βn0\leq\beta\leq n, the Q~β\tilde{Q}_{\beta}-function for robust gamma regression satisfies the LWLC property with constant Λβm(βt)Bϕ^2d\Lambda_{\beta}\leq\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}} with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), where m(βt)m(\beta_{t}) is defined in proof.

Proof.

It is easy to see that Q~βt(𝐰|𝐰)=1ϕ^iB(bi1)si𝐱i\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})=\frac{1}{{\hat{\phi}}}\sum\limits_{i\in B}(b_{i}-1)s_{i}{{\mathbf{x}}}_{i}. Since bi=1b_{i}=1 for good points in the no-noise setting we have assumed, good points do not contribute to the gradient at all. Thus, we get

Q~βt(𝐰|𝐰)2=1ϕ^XB𝐭21ϕ^XB2𝐭2,\left\|{\nabla\tilde{Q}_{\beta_{t}}({{\mathbf{w}}}^{\ast}\,|\,{{\mathbf{w}}})}\right\|_{2}=\frac{1}{\hat{\phi}}\left\|{X_{B}{{\mathbf{t}}}}\right\|_{2}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\left\|{{{\mathbf{t}}}}\right\|_{2},

where 𝐭=[ti]iB{{\mathbf{t}}}=[t_{i}]_{i\in B} and ti=(bi1)sit_{i}=(b_{i}-1)\cdot s_{i}. Now since 𝐱i𝒮d1{{\mathbf{x}}}_{i}\in{\cal S}^{d-1} with probability at least 1exp(Ω(d))1-\exp(-\Omega\left({{d}}\right)), we have XB2Bd\left\|{X_{B}}\right\|_{2}\leq\sqrt{\frac{B}{d}} (since the locations of the corruptions were chosen randomly without looking at data). Thus, we are left to bound 𝐭2\left\|{{{\mathbf{t}}}}\right\|_{2}. We have ti(bi1)2si2(bi2+1)si2t_{i}\leq(b_{i}-1)^{2}s_{i}^{2}\leq(b_{i}^{2}+1)s_{i}^{2}.

Let zi:=cβtexp(𝚫t,𝐱i)z_{i}:=c\beta_{t}\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle), we have, exp(𝐰,𝐱i)zi=1cβtexp(𝐰,𝐱i)exp(𝚫t,𝐱i)1cβtexp(𝐰Δt2)1+cβt(cβt)2exp(1)\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{z_{i}}=\frac{1}{c\beta_{t}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{\exp(\left\langle{\text{\boldmath$\mathbf{\Delta}$}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)}\leq\frac{1}{c\beta_{t}}\exp(\left\|{{{\mathbf{w}}}^{\ast}-\Delta^{t}}\right\|_{2})\leq\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\exp(1)

bisi\displaystyle b_{i}s_{i} =biexp(𝐰,𝐱i)(1ϕ)Γ(1ϕ~βt)(bizi)1ϕ~βtexp(bizi)=1ziexp(𝐰,𝐱i)(1ϕ)Γ(1ϕ~βt)(bizi)1ϕ~βt+1exp(bizi)\displaystyle=b_{i}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(b_{i}z_{i}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}}\exp(-b_{i}z_{i})=\frac{1}{z_{i}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(b_{i}z_{i}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}+1}\exp(-b_{i}z_{i})
1ziexp(𝐰,𝐱i)(1ϕ)Γ(1ϕ~βt)(1ϕ~βt+1)1ϕ~βt+1exp(1ϕ~βt1)since, xaexp(x)aaexp(a) for, x>0\displaystyle\leq\frac{1}{z_{i}}\frac{\exp(\left\langle{{{\mathbf{w}}}^{\ast}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1}{\tilde{\phi}_{\beta_{t}}}+1\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}+1}\exp(-\frac{1}{\tilde{\phi}_{\beta_{t}}}-1)\quad\text{since, }x^{a}\exp(-x)\leq a^{a}\exp(-a)\text{ for, }x>0
1+cβt(cβt)2exp(1)(1ϕ)Γ(1ϕ~βt)(cβt+2)cβt+2exp(cβt2)=:m(βt)\displaystyle\leq\frac{1+c\beta_{t}}{(c\beta_{t})^{2}}\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}+2\right)^{c\beta_{t}+2}\exp(-c\beta_{t}-2)=:m(\beta_{t})

And using, exp(𝐰^t,𝐱i)exp(𝐰^t2)exp(𝐰+Δt2)(1+1cβt)exp(1)\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)\leq\exp(\left\|{\hat{{\mathbf{w}}}^{t}}\right\|_{2})\leq\exp(\left\|{{{\mathbf{w}}}^{\ast}+\Delta^{t}}\right\|_{2})\leq(1+\frac{1}{c\beta_{t}})\exp(1) and density at the mode of Gamma,

si\displaystyle s_{i} =f(yi;η~i,ϕ~βt)f(1ϕ~βtη~it;η~i,ϕ~βt)=11ϕ~βtη~iΓ(1ϕ~βt)(1ϕ~βtη~iη~iϕ~βt)1ϕexp(1ϕ~βtη~iη~iϕ~βt)\displaystyle=f(y_{i};\tilde{\eta}_{i},\tilde{\phi}_{\beta_{t}})\leq f(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}^{t}};\tilde{\eta}_{i},\tilde{\phi}_{\beta_{t}})=\frac{1}{\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\frac{\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}}\right)^{\frac{1}{\phi}}\exp(-\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\eta}_{i}}\frac{\tilde{\eta}_{i}}{\tilde{\phi}_{\beta_{t}}})
=exp(𝐰^t,𝐱i)(1ϕ)Γ(1ϕ~βt)(1ϕ~βtϕ~βt)1ϕ~βtexp(1ϕ~βtϕ~βt)\displaystyle=\frac{\exp(\left\langle{\hat{{\mathbf{w}}}^{t}},{{{\mathbf{x}}}_{i}}\right\rangle)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\phi}_{\beta_{t}}}\right)^{\frac{1}{\tilde{\phi}_{\beta_{t}}}}\exp(-\frac{1-\tilde{\phi}_{\beta_{t}}}{\tilde{\phi}_{\beta_{t}}})
=(1+1cβt)exp(1)(1ϕ)Γ(1ϕ~βt)(cβt)cβt+1exp(cβt)\displaystyle=(1+\frac{1}{c\beta_{t}})\frac{\exp(1)}{(1-\phi)\Gamma(\frac{1}{\tilde{\phi}_{\beta_{t}}})}\left(c\beta_{t}\right)^{c\beta_{t}+1}\exp(-c\beta_{t})
m(βt)\displaystyle\leq m(\beta_{t})

which gives us

Λβt1ϕ^XB2𝐭21ϕ^XB2iB(bisi)2+si2m(βt)Bϕ^2d\displaystyle\Lambda_{\beta_{t}}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\left\|{{{\mathbf{t}}}}\right\|_{2}\leq\frac{1}{\hat{\phi}}\left\|{X_{B}}\right\|_{2}\sqrt{\sum\limits_{i\in B}(b_{i}s_{i})^{2}+s_{i}^{2}}\leq\frac{m(\beta_{t})B}{\hat{\phi}}\sqrt{\frac{2}{d}}