Corruption-tolerant Algorithms for Generalized Linear Models
Abstract
This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/
1 Introduction
Generalized linear models (GLMs) [17] are effective models for a variety of discrete and continuous label spaces, allowing the prediction of binary or count-valued labels (logistic, Poisson regression) as well as real-valued labels (gamma, least-squares regression). Inference in a GLM involves two steps: given a feature vector and model parameters , a canonical parameter is generated as then the label is sampled from the exponential family distribution
where the function is specific to the GLM and is a normalization term, also known as log partition function. It is common to use a non-canonical link such as for gamma distribution. GLMs also admit vector valued label by substituting the scalar product by inner product where is the canonical parameter and is the covariate matrix.
Problem Description: Given data generated using a known GLM but unknown model parameters , statistically efficient techniques exist to recover a consistent estimate of the model [14]. However, these techniques break down if several observed labels are corrupted, not just by random statistical noise but by adversarially generated structured noise. Suppose labels are corrupted i.e. for some data points , the actual label generated by the GLM are replaced by the adversary with corrupted ones say . Can we still recover ? Note that the learning algorithm is unaware of the points that are corrupted.
Breakdown Point: The largest fraction of corruptions that a learning algorithm can tolerate while still offering an estimate of with bounded error is known as its breakdown point. This paper proposes the SVAM algorithm that can tolerate corruptions i.e. .
Adversary Models: Contamination of the training labels by an adversary can misguide the learning algorithm into selecting model parameters of the adversary’s choice. An adversary has to choose (1) which labels to corrupt and (2) what corrupted labels to put there. Adversary models emerge based on what information the adversary can consult while making these choices. The oblivious adversary must make both these choices with no access to the original data or true model and thus, can only corrupt a random/fixed subset of labels by sampling from some predetermined noise distribution. This is also known as the Huber noise model. On the other hand, a fully adaptive adversary has full access to the original data and true model while making both choices. Finally, the partially adaptive adversary must choose the corruption locations without knowledge of original data or true model but has full access to these while deciding the corrupted labels. See Appendix B for details.
Contributions: This paper describes the SVAM (Sequential Variance-Altered MLE) framework that offers:
1. robust estimation with a breakdown point against partially and fully adaptive adversaries for robust least-squares regression and mean estimation and for robust gamma regression. Prior works do not offer any breakdown point for gamma regression.
2. exact recovery of the true model against a fully-adaptive adversary for the case of least squares regression,
3. the use of variance reduction technique (see §3.1) in robust learning, which is novel to the best of our knowledge,
4. extensive empirical evaluation demonstrating that despite being a generic framework, SVAM is competitive to or outperforms algorithms specifically designed to solve problems such as least-squares and logistic regression.
2 Related Works
In the interest of space, we review aspects of literature most related to SVAM and refer to others [6, 15] for a detailed review.
Robust GLM learning has been studied in a variety of settings. [2] considered an oblivious adversary (Huber’s noise model) but offered a breakdown point of i.e. tolerate corruptions. [25] solve robust GLM estimation by solving M-estimation problems. However, they require the magnitude of the corruptions to be upper-bounded by some constant i.e. and offer a breakdown point of . Moreover, their approach solves -regularized problems using projected gradient descent that offers slow convergence. In contrast, SVAM offers a linear rate of convergence, offers a breakdown point of i.e. tolerate corruptions and can tolerate corruptions with unbounded magnitude introduced by a partially or fully adaptive adversary.
Specific GLMs such as robust regression have received focused attention. Here the model is where is the feature matrix and is -sparse corruption vector denoting the adversarial corruptions. A variant of this, studies a hybrid noise model that replaces the zero entries of with Gaussian noise . [18, 24] solve an minimization problem which is slow in practice. (see §5). [1] use hard thresholding techniques to estimate the subset of uncorrupted points while [15] modify the IRLS algorithm to do so. However, [1, 15] are unable to offer consistent model estimates in the hybrid noise model even if the corruption rate which is surprising since implies vanishing corruption. In contrast, SVAM offers consistent model recovery in the hybrid noise model against a fully adaptive adversary when . [21] also offer consistent recovery with breakdown points but assume an oblivious adversary.
Robust classification with has been explored using robust surrogate loss functions [16] and ranking [8, 19] techniques. These works do not offer breakdown points but offer empirical comparisons.
Robust mean estimation entails recovering an estimate of the mean of a multivariate Gaussian given samples of which an fraction are corrupted [11]. Estimation error is known to be lower bounded for this application even if [5]. [6] use convex programming techniques and offer error given samples and a runtime. [3] improve the running time to . The recent work of [4] uses an IRLS-style approach that internally relies on expensive SDP-calls but offers high breakdown points. SVAM uses samples and offers a recovery error of . This is comparable to existing works if . Moreover, SVAM is much faster and simpler to implement in practice.
Meta algorithms such as robust gradient techniques, median-of-means [12], tilted ERM [13] and maximum correntropy criterion [9] have been studied. SEVER [7] uses gradient covariance matrix to filter out the outliers along its largest eigenspace while RGD [20] uses robust gradient estimates to perform robust first-order optimization directly. While convenient to execute, they may require larger training sets, e.g., SEVER requires samples for robust least-squares regression whereas SVAM requires . In terms of recovery guarantees, for least-squares regression without Gaussian noise, SVAM and other methods [1, 15]) offer exact recovery of so long as the fraction of corrupted points is less than the breakdown point while SEVER’s error continues to be bounded away from zero. RGD only considers an oblivious/Huber adversary while SVAM can tolerate partially/fully adaptive adversaries. SEVER does not report an explicit breakdown point, RGD offers a breakdown point of (see Thm 2 in their paper) while SVAM offers an explicit breakdown point independent of . SVAM also offers faster convergence than existing methods such as SEVER and RGD.
3 The SVAM Algorithm
A popular approach in robust learning is to assign weights to data points, hoping that large weights would be given to uncorrupted points and low weights to corrupted ones, followed by weighted likelihood maximization. Often the weights are updated, and the process is repeated. [2] use Huber style weighing functions used in Mallow’s type M-estimators, [15] use truncated inverse residuals, and [22] use Mahalanobis distance-based weights.
SVAM notes that the label likelihood offers a natural measure of how likely the point is to be uncorrupted. Given a model estimate at iteration , the weight can be assigned to the point where . This gives us the weighted MLE111Recall that for gamma/Poisson regression we need to set given the non-canonical link for these problems. solving which gives us the next model iterate as
(1) |
However, as §5 will show, this strategy does not perform well. If the initial model is far from , it may result in imprecise weights that are large for the corrupted points. For example, if the adversary introduces corruptions using a different model i.e. and we happen to initialize close to i.e. , then it is the corrupted points that would get large weights initially that may cause the algorithm to converge to itself.
Key Idea: It is thus better to avoid drastic decisions, say setting in the initial stages no matter how much a data point appears to be clean. SVAM implements this intuition by setting weights using a label likelihood distribution with very large variance initially. This ensures that no data point (not even the uncorrupted ones) gets large weight (c.f. the uniform distribution that has large variance and assigns to point a high density). As SVAM progresses towards , it starts using likelihood distributions with progressively lower variance. Note that this allows data points (hopefully the uncorrupted ones) to get larger weights (c.f. the Dirac delta distribution that has vanishing variance and assigns high density to isolated points).
3.1 Mode-preserving Variance-altering Likelihood Transformations
Name | Standard Form | Variance Altered Form | Variance | Asymptotic Form |
(Mass/Density function) | ||||
Gaussian (univariate) | ||||
Gaussian (multivariate) | ||||
Bernoulli | ||||
Gamma | ||||
Note: |
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/4dd7dea8-8957-4d0f-9402-d08f6bf1c6b8/x1.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/4dd7dea8-8957-4d0f-9402-d08f6bf1c6b8/x2.png)
To implement the above strategy, SVAM (Algorithm 1) needs techniques to alter the variance of a likelihood distribution at will. Note that the likelihood values of the altered distributions must be computable as they will be used as weights i.e. merely being able to sample the distribution is not enough. Moreover, the transformation must be order-preserving – say the original and transformed distributions are and resp., then for every pair of labels and every parameter value , we must have . If this is not true, then SVAM could exhibit anomalous behavior.
The Transformation: If is an exponential family distribution with parameter and log-partition function , then for any , we get the variance-altered density,
where . This transformation is order and mode preserving since is an increasing function for any . This generalized likelihood distribution has variance [17] , which tends to as . Table 1 lists a few popular distributions, their variance altered versions, and asymptotic versions as .
We note that [10] also study variance altering transformations for learning hidden Markov models, topic models, etc.. However, their transformations are unsuitable for use in SVAM for a few reasons:
1. SVAM’s transformed distributions are always available in closed form whereas those of [10] are not necessarily available in closed form.
2. SVAM’s transformations are order-preserving while [10] offer mean-preserving that are not assured to be order-preserving.
The Algorithm: As presented in Algorithm 1, SVAM repeatedly constructs weighted MLEs that take -variance altered weights for all and solves them to get new model estimates.
We take a pause to assert that whereas the approach in [15], although similar at first to Eq (1), applies only to least-squares regression as it relies on notions of residuals missing from other GLMs. In contrast, SVAM works for all GLMs e.g. least-squares/logistic/gamma regression and offers stronger theoretical guarantees.
Theorem 1 shows that SVAM enjoys a linear rate of convergence. However, we first define notions of Local Weighted Strong Convexity and Lipschitz Continuity. Let denote the ball of radius centered at the vector .
Definition 1 (LWSC/LWLC).
Given data and an exponential family distribution is said to satisfy -Local Weighted Strongly Convexity and -Local Weighted Lipschitz Continuity if for any true model and any the following hold,
-
1.
-
2.
The above requires the -function to be strongly convex and Lipschitz continuous in a ball of radius around the true model i.e. as , the neighborhood in which these properties are desired also shrinks. We will show that likelihood functions corresponding to GLMs e.g., least squares and gamma regression satisfy these properties for appropriate ranges of , even in the presence of corrupted samples.
Theorem 1 (SVAM convergence).
If the data and likelihood distribution satisfy the LWSC/LWLC properties for all and if SVAM is initialized at and scale s.t. , then for any , for small-enough scale increment , SVAM ensures after iterations.
It is useful to take a moment to analyze this result. Note that if the LWSC/LWLC properties hold for larger values of , SVAM is able to offer smaller model recovery errors. Lets take least-squares regression with hybrid noise (see §4) as an example. The proofs will show that LWSC/LWLC properties are assured for as large as (see §4). Thus, with proper initialization of and (discussed below), SVAM ensures within steps. This proof will hold so long as SVAM is offered at least training samples.




Initialization: SVAM needs to be invoked with that satisfy the requirements of Thm 1 and small enough . If we initialize at the origin i.e. , then Theorem 1’s requirement translates to i.e. we need only find a small enough . Thus, SVAM needs to tune two scalars to take small enough values which it does as described below. In practice, SVAM offered stable performance for a wide range of (see Fig 1).
Hyperparameter Tuning: SVAM’s two hyperparameters were tuned using a held-out validation set. As the validation data could also contain corruptions, validation error was calculated by rejecting the top fraction of validation points with the highest prediction error. The true value of was provided to competitor algorithms as a handicap but not to SVAM. Thus, itself was treated as a (third) hyperparameter for SVAM.
4 Robust GLM Applications with SVAM
This section adapts SVAM to robust least-squares/gamma/logistic regression and robust mean estimation and establishes breakdown points and LWSC/LWLC guarantees for their respective functions (see Defn 1). We refer the reader to §1 for definitions of partially/fully adaptive adversaries.
Robust Least Squares Regression. We have data points , sampled from a subGaussian distribution over . We consider the hybrid corruption setting where on the “good” data points, we get labels with Gaussian noise with variance added. On the remaining “bad” points, we get adversarially corrupted labels where is chosen by the adversary. Note that can be unbounded. We also consider the pure corruption setting where clean points receive no Gaussian noise (note that this corresponds to ). SVAM-RR (Alg. 2) adapts SVAM to the task of robust regression.
Theorem 2 (Partially Adaptive Adversary).
For hybrid corruptions by a partially adaptive adversary with corruption rate , there exist s.t. with probability at least , LWSC/LWLC properties are satisfied for the function for values as large as . If initialized with s.t. , SVAM-RR assures within iterations. For pure corruptions by a partially adaptive adversary, we have and thus, for any , SVAM-RR assures within iterations.
Note that in the pure corruption setting, SVAM assures exact recovery of simply by running the algorithm long enough. This is not a contradiction since in this case, the LWSC/LWSS properties can be shown to hold for all values of since we effectively have in this case. Thm 2 holds against a partially adaptive adversary but can be extended to a fully adaptive adversary as well but at the cost of a worse breakdown point (see Thm 3 below). Note that SVAM continues to assure exact recovery of .
Theorem 3 (Fully Adaptive Adversary).
For pure corruptions by a fully adaptive adversary with corruption rate , LWSC/LWLC are satisfied for all i.e. and for any , SVAM-RR assures within iterations if initialized as described in the statement of Theorem 2.
Establishing LWSC/LWLC: In the appendices, Lemmata 15 and 16 establish LWSC/LWLC properties for robust least squares regression while Theorems 14 and 21 establish the breakdown points and existence of suitable increments . Handling a fully adaptive adversary requires making mild modifications to the notions of LWSC/LWLC, details of which are present in Appendix G.1.
Model Recovery and Breakdown Point: For pure corruption, SVAM-RR offers exact model recovery against partially and fully adaptive adversaries as it assures for any if SVAM-RR is executed long enough. For hybrid corruption where even “clean” points receive Gaussian noise with variance , SVAM-RR assures as i.e. as assuring consistent recovery. This significantly improves previous results by [1, 15] which offer error even if and . Note that SVAM-RR has a superior breakdown point (allowing upto 18% corruption rate) against an oblivious adversary. The breakdown point deteriorates as expected (still allowing upto 0.36% corruption rate) against a fully adaptive adversary. We now present analyses for other GLM problems.
Algorithm 2 SVAM-RR: Robust Least Squares Regression 0: Data , initial scale , initial model , 0: A model estimate 1: for do 2: 3: 4: 5: 6: end for 7: return Algorithm 3 SVAM-ME: Robust Mean Estimation 0: Data , initial scale , initial model , 0: A mean estimate 1: for do 2: 3: 4: 5: end for 6: return Algorithm 4 SVAM-Gamma: Robust Gamma Regression 0: Data , initial scale , initial model , 0: A model estimate 1: for do 2: //see Table 1 3: where 4: 5: end for 6: return Algorithm 5 SVAM-LR: Robust Classification 0: Data , initial scale , initial model , 0: A model estimate 1: for do 2: 3: where 4: 5: end for 6: return
Robust Gamma Regression. The data generation and corruption model for gamma regression are slightly different given that the gamma distribution has support only over positive reals. First, the canonical parameter is calculated as using which a clean label is generated. To simplify the analysis, we assume that , , . For the “good” points, labels are generated as i.e. the no-noise model. For the remaining “bad” points, the label is corrupted as where is a positive real number (but otherwise arbitrary and unbounded). A multiplicative corruption makes more sense since the final label must be positive. SVAM-Gamma (Algorithm 4) adapts SVAM to robust gamma regression. Due to the alternate canonical parameter used in gamma regression, the initialization requirement also needs to be modified to . However, the hyperparameter tuning strategy discussed in §3 continues to apply.
Theorem 4.
For data corrupted by a partially adaptive adversary with , there exist s.t. with probability at least , LWSC/LWLC conditions are satisfied for the function for values as large as . If initialized at s.t. and , SVAM-Gamma assures for any within steps.
Model recovery, Consistency, Breakdown pt. It is notable that prior results in literature do not offer any breakdown point results for gamma regression. We find that Thm 4 requires and which imply . This is in contrast to Thms 2 and 3 that allow any initial so long as are sufficiently small. SVAM-Gamma guarantees convergence to a region of radius around whereas Thms 2 and 3 assure exact recovery. However, these do not seem to be artifacts of the proof technique. In experiments, SVAM-Gamma did not offer vanishingly small recovery errors and did indeed struggle if initialized with . It may be the case that there exist lower bounds preventing exact recovery for gamma regression similar to mean estimation.
Robust Mean Estimation. We have data points of which the set of ”good” points are generated from a -dimensional spherical Gaussian i.e. where and for some . The rest are the set of ”bad” points that are corrupted by an adversary i.e. where can be unbounded. SVAM-ME (Algorithm 3) adapts SVAM to the robust mean estimation problem. For notational clarity we use, , in this problem.
Theorem 5.
For data corrupted by a partially adaptive adversary with corruption rate , there exists s.t. with probability at least , LWSC/LWLC conditions are satisfied for the function for upto . If initialized with s.t. , SVAM-ME assures for any within iterations.
Model recovery, Consistency, Breakdown pt. Note that for any constant , the estimation error does not go to zero as . As mentioned in §2, an error of is unavoidable no matter how large gets. Thus, the best hope we have is of the estimation error going to zero as and . The error in Theorem 5 does indeed go to zero in this setting. Also, note that the error depends only on the trace of the covariance matrix of the clean points, and thus for , the result offers an estimation error independent of dimension. SVAM-ME offers a large breakdown point (allowing upto 26% corruption rate).
Establishing LWSC/LWLC for Gamma Regression and Mean Estimation: In the appendices, Lemmata 28, 29, Lemmata 23, 24 establish the LWSC/LWLC properties for the function for gamma regression and mean estimation and Theorems 27 and Theorem 22 establish the breakdown points and existence of increments .
Robust Classification. In this case the labels are generated as and the bad points in the set get their labels flipped . SVAM-LR (Algorithm 5) adapts SVAM to robust logistic regression.
5 Experiments
We used a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM, Ubuntu 16.04 OS.
Benchmarks. SVAM was benchmarked against several baselines (a) VAM: this is SVAM executed with a fixed value of the scale by setting the scale increment to to investigate any benefits of varying the scale , (b) MLE: likelihood maximization on all points (clean + corrupted) without any weights assigned to data points – this checks for benefits of performing weighted MLE, (c) Oracle: an execution of the MLE only on the clean points – this is the gold standard in robust learning and offers the best possible outcome. In addition, several problem-specific competitors were also considered. For robust regression, STIR [15], TORRENT [1], SEVER [7], RGD [20], and the classical robust M-estimator of Tukey’s bisquare loss were included. Note that TORRENT already outperforms regularization methods while achieving better or competitive recovery errors (see [1, Fig 2(b)]). Since SVAM-RR was faster than TORRENT itself, regularized methods such as [18, 24] were not considered. For robust mean estimation, popular baselines such as coordinate-wise median and geometric median were taken. For robust classification, the rank-pruning method RP-LR [19] and the method from [16] were used.
Experimental Setting and Reproducibility. Due to lack of space, details of experimental setup, data generation, how adversaries were simulated etc are presented in Appendix C. SVAM also offered superior robustness than competitors against a wide range of ways to simulate adversarial corruption (see Appendix D for details). Code for SVAM is available at https://github.com/purushottamkar/svam/.






5.1 Experimental Observations
Robust Regression. Fig 2(a) shows that SVAM-RR, SEVER, RGD, STIR, and TORRENT are competitive and achieve oracle-level error. However, SVAM-RR can be twice as fast in terms of execution time. Since TORRENT itself outperforms regularization methods while achieving better or competitive recovery errors (see Fig 2(b) in [1]), we do not compare against methods. SVAM-RR is several times faster than classical robust M-estimators such as Tukey’s bisquare loss. Also, no single value of can offer the performance of SVAM, as is indicated by the poor performance of VAM. Fig 4 in the appendix shows that this is true even if very large or very small values of are used with VAM. We note that SEVER chooses a threshold in each iteration to eliminate specific points as corrupted. This threshold is chosen randomly (possibly for ease of proof) but causes SEVER to offer sluggish convergence. Thus, we also report the performance of a modification SEVER-M that was given an unfair advantage by revealing to it the actual number of corrupted points (SVAM was not given this information). This sped-up SEVER but SVAM continued to outperform SEVER-M. Fig 3 in the appendix reports repeated runs of the experiment where SVAM continues to lead.
Robust Logistic and Gamma Regression. Fig 2(c,d) report results of SVAM on robust gamma and logistic regression problems. The figures show that executing VAM with a fixed value of cannot replace the gradual variations in done by SVAM. Additionally, for robust classification, SVAM-LR achieves error, an order of magnitude smaller than all competitors except the oracle. SVAM also outperforms the RP-LR [19] and [16] algorithms that were specifically designed for robust classification. A horizontal dashed line is used to indicate the final performance of algorithms for which iteration-wise performance was unavailable.
Robust Mean Estimation. Fig 2(b) reports results on robust mean estimation problems. SVAM outperforms VAM with any fixed value of as well as the naive sample mean (the MLE in this case). Popular approaches coordinate-wise median and geometric median were fast but offered poor results. SVAM on the other hand achieved oracle error-level error by assigning proper scores to all data points.
Sensitivity to Hyperparameter Tuning. In Figs 1(a,b), SVAM-RR was offered hyperparameters in a wide range of values to study how it responded when provided mis-specified hyperparameters. SVAM offered stable convergence for a wide range of indicating that it is resilient to minor mis-specifications in hyperparameters.
Sensitivity to Dimension and Corruption. Figs 1(c,d) compare the error offered by various algorithms in recovering for robust least-squares regression when the fraction of corrupted points and feature dimension were varied. All values are averaged over experiments with each experiment using data points. was varied in the range and in the range with fixed hyper-parameters. STIR and Bi-square are sensitive to corruption while SEVER is sensitive to both corruption and dimension. RGD is not visible in the figures as its error exceeded the figure boundaries. Experiments for Fig 1(c) fixed and vary while Fig 1(d) fixed and vary . Figs 1(c,d) show that SVAM-RR can tolerate large fractions of the data getting corrupted and is not sensitive to .
Testing SVAM for Global Convergence. To test the effect of initialization, in Fig 2(e), corruptions were introduced using an adversarial model i.e. for corrupted points, labels were set to . SVAM-RR was initialized at 1000 randomly chosen models, the origin, as well as at the adversarial model itself. WORST-1000 (resp. AVG-1000) indicate the worst (resp. average) performance SVAM had at any of the 1000 initializations. Fig 2(f) further emphasizes this using a toy 2D problem. SVAM was initialized at all points on the grid. An initialization was called a success if SVAM got error within eight or fewer iterations. In all these experiments SVAM rapidly converged to the true model irrespective of model initialization.
Acknowledgements
The authors thank the anonymous reviewers of this paper for suggesting illustrative experiments and pointing to relevant literature. B.M. is supported by the Technology Innovation Institute and MBZUAI joint project (NO. TII/ARRC/2073/2021): Energy-based Probing for Spiking Neural Networks. D.D. is supported by the Research-I Foundation at IIT Kanpur and acknowledges support from the Visvesvaraya PhD Scheme for Electronics & IT (FELLOW/2016-17/MLA/194). P.K. thanks Microsoft Research India and Tower Research for research grants.
References
- Bhatia et al. [2015] K. Bhatia, P. Jain, and P. Kar. Robust Regression via Hard Thresholding. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015.
- Cantoni and Ronchetti [2001] E. Cantoni and E. Ronchetti. Robust Inference for Generalized Linear Models. Journal of the American Statistical Association, 96(455):1022–1030, 2001.
- Cheng et al. [2019] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2755–2771. SIAM, 2019.
- Dalalyan and Minasyan [2022] A. S. Dalalyan and A. Minasyan. All-In-One Robust Estimator of the Gaussian Mean. Annals of Statistics, 50(2):1193–1219, 2022.
- Diakonikolas and Kane [2019] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
- Diakonikolas et al. [2019a] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust Estimators in High-Dimensions Without the Computational Intractability. SIAM Journal on Computing, 48(2):742–864, 2019a.
- Diakonikolas et al. [2019b] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A Robust Meta-Algorithm for Stochastic Optimization. In 36th International Conference on Machine Learning (ICML), 2019b.
- Feng et al. [2014] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust Logistic Regression and Classification. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.
- Feng et al. [2015] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. Journal of Machine Learning Research, 16(30):993–1034, 2015.
- Jiang et al. [2012] K. Jiang, B. Kulis, and M. Jordan. Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.
- Lai et al. [2016] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
- Lecué and Lerasle [2020] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: Theory and practice. Annals of Statistics, 48(2):906–931, 2020.
- Li et al. [2021] T. Li, A. Beirami, M. Sanjabi, and V. Smith. Tilted Empirical Risk Minimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
- McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.
- Mukhoty et al. [2019] B. Mukhoty, G. Gopakumar, P. Jain, and P. Kar. Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
- Natarajan et al. [2013] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
- Nelder and Wedderburn [1972] J. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society. Series A, 135(3):370–384, 1972.
- Nguyen and Tran [2013] N. H. Nguyen and T. D. Tran. Exact Recoverability From Dense Corrupted Observations via -Minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
- Northcutt et al. [2017] C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. URL http://auai.org/uai2017/proceedings/papers/35.pdf.
- Prasad et al. [2018] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust Estimation via Robust Gradient Estimation. arXiv:1802.06485 [stat.ML], 2018.
- Suggala et al. [2019] A. S. Suggala, K. Bhatia, P. Ravikumar, and P. Jain. Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression. In 32nd Conference on Learning Theory (COLT), 2019.
- Valdora and Yohai [2014] M. Valdora and V. J. Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.
- Vershynin [2018] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Wright and Ma [2010] J. Wright and Y. Ma. Dense Error Correction via Minimization. IEEE Transactions on Information Theory, 56(7):3540–3560, 2010.
- Yang et al. [2013] E. Yang, A. Tewari, and P. Ravikumar. On Robust Estimation of High Dimensional Generalized Linear Models. In 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013.
Appendix A Summary of Assumptions
This paper presented SVAM, a framework for robust GLM problems based on a novel variance reduced reweighted MLE technique that can be readily adapted to arbitrary GLM problems such as robust least squares/logistic/gamma regression and mean estimation. Here, we summarize the theoretical and empirical assumptions made by the SVAM framework for easy inspection.
Experimental Assumptions. SVAM requires minimal assumptions to be executed in practice and only requires two scalar hyperparameters to be tuned properly. SVAM is robust to minor misspecifications to its hyperparameters (see Figure 1) and the hyperparameter tuning described in §3 works well in practice allowing SVAM to offer superior or competitive empirical performance when compared to state of the art techniques for task-specific estimation techniques e.g. TORRENT for robust regression, classical and popular techniques such as Tukey’s bisquare or geometric median for robust mean estimation, as well as recent advances in robust gradient-based techniques such as SEVER and RGD.
Theoretical Assumptions. SVAM establishes explicit breakdown points in several interesting scenarios against both partially and fully adaptive adversaries. To do so, SVAM assumes a realizable setting e.g. in least-squares regression, labels for the clean points are assumed to be generated as where is the gold model and is Gaussian noise. Of course, on the bad points, the (partially/fully adaptive) adversary is free to introduce corruptions jointly in any manner. For least-squares regression, the covariates/feature vectors i.e. are assumed to be sampled from some sub-Gaussian distribution that includes arbitrary bounded distributions as well as multivariate Gaussian distributions in dimensions (both standard and non-standard) – see Appendix G for details. For gamma regression and mean estimation settings, the covariates are assumed to be sampled from a spherical multivariate Gaussian in dimensions. Other assumptions are listed in the statements of the theorems in §4.
Appendix B Adversary Models
We explain the various adversary models in more detail here. Several models popular in literature give various degrees of control to the adversary. This section offers a more relaxed discussion of some prominent adversary models along with examples of applications in which they arise.
(Oblivious) Huber Adversary. Corruption locations are chosen randomly for which corrupted labels are sampled i.i.d. from some pre-decided distribution i.e. . Next, data features and the true model are selected and clean labels are generated according to the GLM for all non-corrupted points.
Partially Adaptive Adversary. The adversary first chooses the corruption locations (e.g. some fixed choice or randomly). Then data features and the true model are selected and clean labels are generated according to the GLM. Next, the adversary is presented with the collection and is allowed to use this information to generate corrupted labels for the points marked for corruption.
Fully Adaptive Adversary. First data features and the true model are selected and clean labels are generated according to the GLM. Then the adversary is presented with the collection and is allowed to use this information to select which points to corrupt as well as generate corrupted labels for those points.
Discussion. The fully adaptive adversary can choose corruption locations and the corrupted labels with complete information of the true model , the clean labels and the feature vectors and is the most powerful. The partially adaptive adversary can decide the corruptions after inspecting but cannot control the corruption locations. This can model e.g. an adversary that corrupts user data by installing malware on their systems. The adversary cannot force malware on a system of their choice but can manipulate data coming from already compromised systems at will. The Huber adversary is the least powerful with corruption locations that are random as well as corrupted labels that are sampled randomly and cannot depend on . Although weak, this adversary can nevertheless model sensor noise e.g. pixels in a CCD array that misfire with a certain probability. As noted earlier, SVAM results are shown against both fully and partially adaptive adversaries.
Appendix C Experimental Setup
Experiments were carried out on a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50GHz, 4 cores, 16 GB RAM and Ubuntu 16.04 OS. Statistics such as dataset size , feature dimensions and corruption fraction are mentioned above each figure. 20% of train data was used as a held-out validation set using which () were tuned using line search. Figure 1 indicates that SVAM is not sensitive to setting or and a good value can be found using line search.
Synthetic Data Generation. Synthetic datasets were used in the experiments to demonstration recovery of the true model parameters. All regression co-variates/features were generated using a standard normal distribution . Clean responses in the least-squares regression settings were generated without additional Gaussian noise while corrupted responses were generated for an fraction of the data points. For least-squares regression, clean responses were generated as , while for logistic regression, clean binary labels were generated as . For gamma regression, clean responses were generated using a likelihood distribution with vanishing variance. This was done by using a variance-altered likelihood distribution with the setting (see Table 1 for the likelihood expressions). Clean data points for mean estimation were sampled from . When not stated otherwise, corrupted labels were generated using an adversarial model. Specifically, to simulate the adversary, an adversarial model (for least-squares/logistic/gamma regression) or (for mean estimation) was sampled and labels for data points chosen for corruption were generated using this adversarial model instead of the true model. For example, corrupted labels were generated for least-squares regression as for all locations chosen for corruption. Please also refer to Appendix D for an extensive study on several other ways of simulating the adversary and initialization schemes.
Simulating the Adversary. Corruptions were introduced using an adversarial model. Specifically, an adversarial model was chosen, and for bad points, the adversary generated a label using , which overwrote the true label. For least-squares/logistic/gamma regression, both were independently chosen to be random unit vectors. For robust mean estimation Fig 2(a), and were chosen as random Gaussian vectors of length and respectively. Except Fig 2(e,f) in which the setting is different, SVAM variants were always initialized at the adversarial model itself i.e. to test the ability of SVAM to converge to the true model no matter what the initialization.
Other Corruption Models. SVAM was found to offer superior robustness as compared to competitor algorithms against a wide range of ways to simulate adversarial corruption, including powerful ones that use expensive leverage score computations to decide corruptions. Fig 5 in Appendix D reports results of experiments with a variety of adversaries.














Appendix D Tolerance to Different Adversaries and Adversarial Initialization Schemes
Figure 3 repeats the experiment of Fig. 2(a) 10 times, showing that the convergence results are consistent across several runs of the experiment. In this comparison, we put two closet competitors STIR and TORRENT, along with SVAM, and omit other competitors to avoid clutter in the figure. The experiment shows that the performance of the methods does not vary very wildly across the runs and SVAM’s convergence continues to remain faster than its competitors.
Figure 4 considers the VAM method with several values of ranging from very small to very large. It is apparent that no single fixed value of is able to offer satisfactory result indicating that dynamically changing the value of as done by SVAM is required for better convergence results.
Figure 5 demonstrates recovery of , under different ways of simulating the adversary and initialization of algorithms. We consider corruption models, that differ in choosing locations for corruption and the way chosen points are corrupted. We demonstrate recovery, on choosing points for corruption in three ways: i) random, ii) using magnitude of response and iii) using leverage statistic score. The leverage score of a point increases as it lies farther from the mean of the data points. The diagonal elements of the projection matrix , gives the respective leverage scores and data points having largest leverage score are choosen for corruption. After selecting the data points to be corrupted, we either i) flip the sign of the response, ii) set the responses to a constant value B or iii) use an adversarial model , to generate the corrupted response i.e. set the response to . The initialization offered was also varied in two ways: i) random initialization, and ii) adversarial initialization where SVAM was initialized at , the same adversarial model using which corruptions were introduced.
Experiments were performed by varying location of corruption in {random, absolute magnitude, leverage score}, corruption type in {adversarial, sign change, set constant} and initialization in {random, adversarial}. The experimental setting is given in the title of each figure, while all of them have , and . It can be observed that in general, SVAM demonstrates superior performance irrespective of the adversarial models and initialization.
It can be also observed that SVAM converges after single iteration when corruptions are introduced by setting responses to a constant (second column) whereas Tukey’s bisquare method does not converge well, when absolute magnitude is used to select location of corruption(second row), except for constant response corruptions.
Appendix E Proof of Theorem 1
Theorem 6 (SVAM convergence - Restated).
Suppose the data and likelihood distribution satisfy the -LWSC and -LWLC properties for all values of in the range . Then if SVAM is initialized at a point and initial scale such that then for any , for small-enough scale increment , SVAM ensures within iterations.
Proof.
The key to this proof is to maintain the invariant . Note that initialization is done precisely to ensure this at the beginning of the execution of the algorithm which acts as the base case for an inductive argument. For the inductive case, consider an iteration and let . LWSC ensures strong convexity giving
Since minimizes , we have . Elementary manipulations and the Cauchy-Schwartz inequality now give us
Now, we will additionally ensure that we choose the scale increment to be small enough (while still ensuring ) such that for all . Combining this with the above result gives us
Thus, if we now set , then rearranging the terms in the above inequality tell us that which lets us continue the inductive argument. Note that this process can continue on till we have since the LWSC/LWLC properties are assured till that point. Moreover, since goes up by a constant fraction at each step and due to the invariant, a linear rate of convergence is assured which finishes the proof. The existence of a suitable scale increment satisfying the above requirements is established in a case-wise manner by Theorems 14, 21, 22 and 27. We also note that, as discussed in §4 and elaborated in Appendix I, since gamma regression requires an alternate parameterization owing to its need to support only non-negative labels, the invariant used for the convergence bound for SVAM-Gamma is also slightly altered as mentioned in Thm 4 to instead use . ∎
Appendix F Some Helpful Results
Below we present a few helpful results.
Lemma 7.
Suppose and denote . Then we have with probability at least .
Proof.
Follows from standard arguments. ∎
Lemma 8.
For covariate vectors generated from an isotropic sub-Gaussian distribution, for any fixed set and , with probability at least ,
where the constant inside depends only on the sub-Gaussian distribution and universal constants.
Proof.
Taken from [1]. ∎
Lemma 9.
If , then for any function , any and any , we have
where and .
Proof.
We have
which finishes the proof upon using . ∎
Lemma 10.
If , then for any function , any and any , we have
where and .
Proof.
We have
which finishes the proof. ∎
Lemma 11.
If , then for any constant and fixed vectors such that and , we have
Proof.
We begin by analyzing the vector itself. Note that due to the rotational symmetry of the Gaussian distribution, we can, w.l.o.g. assume that . This means that . Thus, the coordinate of the vector i.e. . Thus, by independence and unbiased-ness of the coordinates of a Gaussian vector, we have for all . So all we are left to analyze is . We have
since and we used Lemma 10 in the last step since that lemma is independent of the dimensionality of the Gaussian vector. Applying the Cauchy-Schwartz inequality then gives us the result. ∎
Lemma 12.
Let , then we have
Proof.
By completing squares we have
Now, we consider two cases
-
1.
Case 1 : In this case whereas . Thus, we have in this region. Using standard Gaussian integrals we conclude that the region contributes at most (since ) to both integrals.
-
2.
Case 2 : In this case we simply bound and use standard bounds on the complementary error function to conclude that the contribution of the region to both integrals is at most .
∎
Lemma 13.
Suppose and are fixed vectors such that . Then the random variable has a subexponential constant at most
Proof.
The subexponential constant of a random variable is defined as the value . In contrast, the subGaussian constant of a random variable is defined as the value . Using Lemma 9 gives us
where . Now by virtue of being a Gaussian and using the triangle inequality for the subGaussian norm, we know that the random variable is -subGaussian. This, in turn implies that
Thus, we have
Now, is a bounded function whereas is a decreasing function. Thus, is a decreasing function of and hence achieves its maximum value in the range at itself. Noting that finishes the proof. ∎
Appendix G Robust Regression
For this proof we use the notation . For any vector and any set , denotes the vector with all coordinates other than those in the set zeroed out. Similarly, for any matrix denotes the matrix with all columns other than those in the set zeroed out. We will let respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let and respectively denote the number of good and bad points too.
Theorem 14 (Theorem 2 restated – Partially Adaptive Adversary).
For data generated in the robust regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-RR enjoys a breakdown point of , i.e. it ensures a bounded error even if corruptions are introduced where the value of can go upto at least . More generally, for corruption rates , there always exists values of scale increment s.t. with probability at least , LWSC/LWLC conditions are satisfied for the function corresponding to the robust least squares model for values at least as large as .
Hybrid Corruption Model: If initialized with s.t. , SVAM-RR assures
within iterations for the hybrid corruption model where even points uncorrupted by the adversary receive Gaussian noise with variance i.e. for where .
Pure Corruption Model: For the pure corruption model where uncorrupted points receive no Gaussian noise i.e. for , SVAM-RR assures exact model recovery. Specifically, for any , SVAM-RR assures
within iterations.
Proof.
For any two models , the function for robust least squares has the following form
where . We first outline the proof below.
Proof Outline. This proof has three key elements
-
1.
We will establish the LWSC and LWLC properties for any fixed value of with probability . As promised in the statement of Theorem 2, we will execute SVAM-RR for no more than iterations, taking a naive union bound would offer a confidence level of . However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form . Thus, a union over such events will at best deteriorate the confidence bounds to which is still for the values of we shall set.
-
2.
The key to this proof is to maintain the invariant . Recall that initialization ensures to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, iductively assuming that for an iteration , we will establish that where will be an application-specific expression derived below.
-
3.
We will then ensure that , say for some , whenever the number of corruptions are below the breakdown point. This ensures , in other words, for so that the invariant is preserved. However, notice that the above step simultaneously ensures that . This ensures that a valid value of scale increment can always be found till . Specifically, we will be able assure the existence of a scale increment satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till .
We now present the proof. Lemmata 15,16 establish the LWSC/LWLC properties for the function for robust least squares regression. Let . By Lemma 8, with probability at least , we have . The proof of Lemma 16 tells us that with the same probability, we have
On the other hand, the proof of Lemma 16 also tells us us that with probability at least we have,
By Lemma 15, with probability at least , we have . This gives us,
where in the last two steps, we used that and . We shall set below. Now, in order to satisfy the conditions of Theorem 1, we need to assure the existence of a scale increment that assures a linear rate of convergence. Since SVAM-RR always maintains the invariant and , assuring the existence of a scale increment is easily seen to be equivalent to showing that . This is done below and gives us our breakdown point.
-
1.
Breakdown Point: In order to ensure a linear rate of convergence, we need only ensure . If we set to small constants (which we can always do for large enough ) and also set to a small constant (which still allows us to offer an error ), then we can get if
Setting gives us a breakdown point of .
-
2.
Consistency (Hybrid Corruption): For vanishing corruption i.e. , we can instead show a much stronger, consistent estimation guarantee. Suppose we promise that we would always set . Then, the results in Lemmata 15 and 16 hold with probability at least even if we set . Now, recall that we need to set to expect a linear rate of convergence. Setting to simplify notation (and also since both can be set to similar values without sacrificing confidence guarantees), using , and using the shorthand gives us the requirement
The last requirement can be fulfilled for which assures us of an error guarantee of and finishes the proof for hybrid corruption case. Note that for no corruptions i.e. , we do recover .
-
3.
Consistency (Pure Corruption): The pure corruption case corresponds to . In this case, the above analysis shows that LWSC/LWLC properties continue to hold for arbitrarily large values of that assures arbitrarily accurate recovery of . Given the linear rate of convergence and the fact that SVAM-RR maintains the invariant , it is clear that for any , a model recovery error of is assured within iterations.
∎
Lemma 15 (LWSC for Robust Least Squares Regression).
For any , the -function for robust regression satisfies the LWSC property with constant with probability at least for any . In particular, for standard Gaussian covariates and Gaussian noise with variance , we can take where .
Proof.
It is easy to see that for any . Let and let be the response of an uncorrupted data point and be any fixed model. Then if we let , then weight .
For any fixed , we have:
where,
Similar to Lemma 17 we have:
Note that Lemma 18 continues to hold in this setting. Proceeding as in the proof of Lemma 20 to set up a -net over and taking a union bound over this net finishes the proof.
We now simplify for various distributions:
- Centered Isotropic Gaussian
-
For the special case of , using rotational symmetry, we can w.l.o.g. take and . Thus, if i.i.d. then
Using,
We have,
and,
This gives us
which finishes the proof.
∎
Lemma 16 (LWLC for Robust Least Squares Regression).
For any , the -function for robust regression satisfies the LWLC property with constant with probability at least for any , where and .
Proof.
It is easy to see that . We bound these separately below.
Weights on Bad Points.
Suppose we denote and let be the weights assigned by the algorithm, then the analysis below shows that we must have
We will bound below and use Lemma 8 to get the above bound. Let denote the corruption on the data point , i.e. . The proof proceeds via a simple case analysis:
- Case 1:
-
In this case we simply bound .
- Case 2:
-
or, or,
So that,
Combining case 1 and 2,
Let,
This gives us
Bounding the Weights on Good Points.
Given noise values sampled from and corrupted points are choosen independent of , then for any and , computed w.r.t. model at variance , then the following analysis shows that
where and . Suppose and . Then, for any fixed error vector , if we set , then we can analyze the vector as
where in the last step, we used Lemma 9 in one dimensions. Now, by rotational symmetry of the Gaussian distribution and unbiased and independent nature of its coordinates, we can assume w.l.o.g. that the error vector is of the form and conclude, as we did in the Gaussian mixture model analysis, that the vector in this situation also must have only its first coordinate nonzero. Thus, we have, for
where since we had . Applying Lemma 9 in single dimensions yet again and the Cauchy-Schwartz inequality gives us, for any unit vector ,
where and in the last step we used . The above, when combined with uniform convergence arguments over appropriate nets over the unit vectors and the error vector gives us the claimed result.
∎
Lemma 17.
With the same preconditions as in Lemma 19, we have
where is the subGaussian constant of the distribution that generated the covariate vectors . When is the standard Gaussian i.e. , we have . For , we have . In the above, is a constant that depends only on the distribution and is bounded for various distributions in Lemmata 19.
Proof.
Let be a square symmetric matrix, for , we have:
Also, for any square symmetric matrix and being net over
Taking and , we have
(2) |
let and then for any fixed , we have
where we use the fact that since is -sub-Gaussian and .
Also, is -sub-Gaussian, implies is sub-exponential, as well as is -sub-exponential, using centering. And for a single good data point Lemma 19 gives .
where is a universal constant and in the last step we used and w.l.o.g. we assumed that . Taking a union bound over all elements of , we get
Setting and noticing that by Lemma 19 finishes the proof. ∎
Lemma 18.
Consider two models such that and let denote the corresponding weight vectors, i.e. . Also let and . Then for any such that for all ,
where is the maximum length in a set of vectors, each sampled from a -dimensional Gaussian (see Lemma 7).
Proof.
Let, where,
Since, , is a everywhere differentiable function, it is -lipschitz continuous where
Hence, or . This gives us . Now, if we let and , then for any unit vector , denoting we have
This proves that .
∎
Lemma 19.
Let generated from an isotropic -sub-Gaussian distribution , for any fixed model and , let be the weight of the data point , , then there exists a constant that depends only on such that for any fixed vector unit ,
Proof.
Let and for good points . Let , as , we have by linearity of expectation,
since is isotropic.
For the lower bound, we may write,
where, for any distribution over , we define the constant as
- Centered Isotropic Gaussian
-
For the special case of , using rotational symmetry, we can w.l.o.g. take and . Thus, if i.i.d. then
This gives us
- Centered Non-isotropic Gaussian
-
For the case , we have . Thus for any fixed unit vector , we have where and . We also have , where and . Now for any fixed vectors we first perform rotations so that we have and where and . This gives us where,
similar to that of isotropic counterpart; giving the following,
As the above term needs to be bounded away from 0, we require i.e. reasonably away from 0.
- Non-centered isotropic Gaussian
-
Suppose the covariates are generated from a distribution . As earlier, by rotational symmetry, we can take , , . Assume . Letting and i.i.d. gives independence of and the fact that and gives us,
Now, since we have the following two cases:
Case 1: . In this case
for and .
Case 2: . In this case, if , then and also . So we can write . Also . Hence,
using the fact and . We see from above that we need to avoid large value. One way to avoid this is to center the covariates i.e. use where . This would approximately center the covariates and ensure an effective value of .
- Bounded Distribution
-
Suppose our covariate distribution has bounded support that is, we have for some . Assume w.l.o.g. Also using the centering trick above, assume that . Then we have which implies . Let denote the covariance of distribution and let denote its smallest eigenvalue. This gives us using value of
This finishes the proof. ∎
Lemma 20.
Let are generated from an isotropic -sub-Gaussian distribution and denotes the set of uncorrupted points, then there exists distribution specific constant , such that
Proof.
The bound for the largest eigenvalue follows directly due to the fact that all weights are upper bounded by and hence and applying Lemma 8. For the bound on the smallest eigenvalue, notice that Lemma 17 shows us that for any fixed variance , we have
Given , lemma 18 shows us that if are two models at stage variance, such that then, the following holds almost surely.
This prompts us to initiate a uniform convergence argument by setting up a -net over for . Note that such a net has at most elements by applying standard covering number bounds for the Euclidean ball [23, Corollary 4.2.13]. Taking a union bound over this net gives us
where in the last step we used Lemma 7 to bound with probability at least . ∎
G.1 Robust Least-squares Regression with a Fully Adaptive Adversary
To handle a fully adaptive adversary, we need mild modifications to the notions of LWSC and LWLC given in Definition 1, so that the adversary is now allowed to choose the location of corruption arbitrarily. For sake of simplicity, we present these re-definitions and subsequent arguments in the context of robust least-squares regression but similar extensions hold for the other GLM tasks as well. Let us denote the useful shorthands , and denote the collection of all possible subsets of data points.
Definition 2 (LWSC/LWLC against Fully Adaptive Adversaries).
Suppose we are given an exponential family likelihood distribution and data points of which an fraction has been corrupted by a fully adaptive adversary and is any positive real value. Then, we say that the adaptive -local weighted strong convexity property is satisfied if for any model , we have
where is a diagonal matrix containing the data point scores assigned by the model (see Definition 1 for a definition of the scores). Similarly, we say that the adaptive -weighted strong smoothness properties are satisfied if for any true model and any model , we have
where we used the shorthand and continues to be the diagonal matrix containing the data point scores assigned by the model .
Note that for the setting of robust least-squares regression, for any two models we have (since the scores are non-negative) which motivates the above re-definition of adaptive LWSC. For the same setting we also have (with the shorthand ) i.e. by triangle inequality. However, for sake of simplicity we will be analyzing the noiseless setting i.e. which motivates the above re-definition of adaptive LWLC. These re-definitions can be readily adapted to settings with hybrid noise i.e. when as well.
We now show that for the same settings as considered for the proof of Theorem 2, the adaptive LWSC and LWLC properties are also satisfied, albeit with worse constants due to the application of union bounds over all possible “good” sets in the collection that the adversary could have left untouched to its corruption.
Lemma 20 gives us, for any fixed set
By taking union bound over all sets , and observing that
we have
which requires, setting to obtain a confidence of . This establishes the adaptive LWSC guarantee with a confidence level similar to the one we had for the partially adaptive case. We now establish the adaptive LWLC guarantee.
We notice that Lemma 8 can be extended to the “adaptive” setting using [1, Lemma 15] to show that with probability at least , we have
where we continue to use the notation and for large enough , absorbed the term into the first term by increasing the constant 3 to 3.01 since the second term asymptotically vanishes in comparison to the first term that is linear in . Now, following steps similar to those in the proof of Lemma 16 gives us
where the second last step uses the fact that we our upper bound on is at least . This establishes the adaptive LWLC property with confidence at least .
Theorem 21 (Theorem 3 restated – Fully Adaptive Adversary).
Suppose data is corrupted by a fully adaptive adversary that is able to decide the location of the corruptions as well as the corrupted labels using complete information of the true model, data features and clean labels, and SVAM-RR is initialized and executed as described in the statement of Theorem 2. Then SVAM-RR enjoys a breakdown point of , i.e. it ensures model recovery even if corruptions are introduced by a fully adaptive adversary where the value of can go upto at least . More specifically, in the noiseless setting where where clean data points do not experience any Gaussian noise i.e. and for clean points, with probability at least , the LWSC/LWLC conditions are satisfied for all i.e. . Consequently, for any , within iterations, we have .
Proof.
The above arguments establishing the adaptive LWSC and adaptive LWLC properties allow us to obtain the following result in a manner similar to that used in the proof of Theorem 14 (but in the noiseless setting)
Applying the limit (since we are working in the pure corruption setting without any Gaussian noise on labels of the uncorrupted points) transforms the requirement (which as before, assures us of the existence of a scale increment satisfying the requirements of Theorem 1) to:
Setting as done before further simplifies this requirement to
However, unlike earlier where we could simply set to an arbitrarily small value for large enough , we cannot do so now since as noted earlier, we must set to obtain a confidence of in the LWSC guarantee. However, for large enough we can still obtain which transforms the requirement further to
which is satisfied at all values of or smaller. This finishes the proof. ∎
Appendix H Robust Mean Estimation
We will let respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let and respectively denote the number of good and bad points too.
Theorem 22 (Theorem 5 restated).
For data generated in the robust mean estimation model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-ME enjoys a breakdown point of , i.e. it ensures a bounded error even if corruptions are introduced where the value of can go upto at least . More generally, for corruption rates , there always exists values of scale increment s.t. with probability at least , LWSC/LWLC conditions are satisfied for the function corresponding to the robust mean estimation model for values at least as large as . If initialized with s.t. , SVAM-ME assures for any within iterations.
Proof.
For any two models , the function for robust mean estimation has the following form
where . We first outline the proof below.
Proof Outline. This proof has four key elements
-
1.
We will first establish this result for for , then generalize the result for arbitrary . Note that for , we have .
-
2.
To establish the LWSC and LWLC properties, we will first consider as before, a fixed value of for which the properties will be shown to hold with probability . As promised in the statement of Theorem 5, we will execute SVAM-ME for no more than iterations, taking a naive union bound would offer a confidence level of . However, this can be improved by noticing that the confidence levels offered by the LWSC/LWLC results are actually of the form . Thus, a union over such events will at best deteriorate the confidence bounds to which is still for the values of we shall set.
-
3.
The key to this proof is to maintain the invariant . Recall that initialization ensures to start things off. §3 gives details on how to initialize in practice. This establishes the base case of an inductive argument. Next, inductively assuming that for an iteration , we will establish that where will be an application-specific expression derived below.
-
4.
We will then ensure that , say for some , whenever the number of corruptions are below the breakdown point. This ensures , in other words, for so that the invariant is preserved. However, notice that the above step simultaneously ensures that . This ensures that a valid value of scale increment can always be found till . Specifically, we will be able assure the existence of a scale increment satisfying the conditions of Theorem 1 w.r.t the LWSC/LWLC results only till .
We now present the proof. Lemmata 23,24 establish the LWSC/LWLC properties for the function for robust mean estimation. Let and . To simplify the notation, we will analyze below the updates made with weights scaled up by the constant as defined in Lemma 9. This in no way affects the execution of the algorithm since this scaling factor appears in both the numerator and the denominator of the update terms and simply cancels away. However, it will simplify our presentation. We also, w.l.o.g., first analyze the case of first and then scale the space to retrieve a result for general .
Using the bound from Lemma 26, we gives us, upon using Lemmata 23 and 24
where . As was the case of robust least squares regression in the proof of Theorem 14, to assure the existence of a scale increment satisfying the requirements of Theorem 1 and hence a linear rate of convergence, all we require is to ensure has a value of the form where . Now, noting that , and promising that we will never set as well as never set , we note that we need to set as well as , in order to ensure a confidence of in the tail bounds we have established.
-
1.
Breakdown Point: as observed above, with large enough , we can set to be small, yet positive, constants. For large enough , if we set to be a small enough constant then we have , as well as . This means we need only ensure . The above is satisfied for all which establishes a breakdown point of . Note that even at this breakdown point, since we still set , and thus, can still assure .
-
2.
Consistency: To analyze the consistency properties of the algorithm, we recall that, ignoring universal constants, to obtain a linear rate of convergence, we need ensure
which can be rewritten as . Recall from above that we need to set as well as . Setting them at these lower bounds, using , ignoring universal constants (since we are only interested in the asymptotic behavior of the algorithm) and some simple manipulations, we can show that, for all , we can allow values of as large as
Note that the above assures us, when i.e. corruptions are absent, an error of as . Thus, the method is consistent when corruptions are absent.
Scaling the space back up by a factor of gives us the desired result. ∎
Lemma 23 (LWSC for Robust Mean Estimation).
For any , the -function for robust mean estimation satisfies the LWSC property with constant with probability at least for any .
Proof.
It is easy to see that for any . Applying Lemma 9 gives us
where is defined in Lemma 9. The analysis in the proof of Lemma 13, on the other hand tells us that the random variable has a subGaussian constant at most unity. Applying the Hoeffding’s inequality for subGaussian variables and noticing gives us
where is a universal constant. Again notice that the above result holds for a fixed error vector . Suppose are two error vectors such that . Then, denoting as the weights assigned by these two error vectors, for all , by applying Lemma 25, we get
where is the maximum length in a set of vectors, each sampled from a -dimensional Gaussian (see Lemma 7). Applying a union bound over a -net over with gives us
Promising that we will always set and noting that Lemma 26 gives us for and noting that Lemma 7 gives us finishes the proof. ∎
Lemma 24 (LWLC for Robust Mean Estimation).
For any , the -function for robust mean estimation satisfies the LWLC property with constant with probability at least for any .
Proof.
It is easy to see that . We bound these separately below. Recall that we are working with weights that are scaled by a factor of , where is defined in Lemma 9.
Bad Points.
We have for . Let . This gives us two cases
-
1.
: in this case we use and thus,
-
2.
: in this case we have and thus we also have which gives us upon using the fact that for all .
The above tells us, by an application of the triangle inequality, that .
Good Points.
We have for . Thus, Lemma 9 gives us
where . Note that since , we have
Using Lemma 13 and the linearity of the subexponential norm tells us that the subexponential norm of the random variable , for a fixed unit vector , is at most (where is defined in Lemma 9). Applying the Bernsteins’s inequality for subexponential variables gives us, for some universal constant ,
for some universal constant . Now, if are two unit vectors such that , we have
where is the maximum length in a set of vectors, each sampled from a -dimensional Gaussian (see Lemma 7) and where in the last step we used the triangle inequality and the bounds and for all . Thus, taking a union bound over a -net over the surface of the unit sphere gives us
The above can be seen as simply an affirmation that with high probability. Setting and , and noticing gives us, upon promising that we always set ,
Now notice that this result holds for a fixed error vector . Suppose now that we have two vectors such that . If we let and denote the weights with respect to these two error vectors, then Lemma 25 tells us that, for any , then we must have
Taking a union bound over a -net over for gives us
This finishes the proof upon simple modifications and promising that we will always set and noting that Lemma 26 gives us for . ∎
Lemma 25.
Now, suppose are two error vectors such that and, for some vector , we define . Then, for all , then we must have
Proof.
Since is a -Lipschitz function in the region , we have
where in the last step we applied the Cauchy-Schwartz inequality and used for . ∎
Lemma 26.
For , we have where is defined in Lemma 9.
Proof.
We have
We also have, using ,
by using . Thus, we have
∎
Appendix I Gamma Regression
For this proof we use the notation . Recall that the labels for gamma distribution are always non-negative and, as specified in Section 4, the corruptions are multiplicative in this case. We will let respectively denote the set of “good” uncorrupted points and “bad” corrupted points. We will abuse notation to let and respectively denote the number of good and bad points too.
To simplify the analysis, we assume that data features are sampled uniformly from the surface of the unit sphere in -dimensions i.e. . We will also assume that for clean points, labels were generated from mode of Gamma distribution as i.e. in the no-noise model. For corrupted points, labels are where but otherwise arbitrary or even unbounded. To simplify the presentation of the results, we will additionally assume and .
For any vector and any set , denotes the vector with all coordinates other than those in the set zeroed out. Similarly, for any matrix denotes the matrix with all columns other than those in the set zeroed out.
I.1 Variance Reduction with the Gamma distribution
Since the variance reduction step is a bit more involved with the Gamma distribution, we give a detailed derivation here. Consider the gamma distribution:
where natural parameter . The mode preserving variance reduced distribution would be:
writing we have,
So that,
Hence to perform variance reduction, following parameter update is sufficient:
We will give the proof for the setting where clean points suffer no noise. Let be the no-noise parameters. Assuming we have and . Giving,
Let be denote the multiplicative corruption, un-corrupted points having .
We have,
Let and
We may write,
Theorem 27 (Theorem 4 restated).
For data generated in the robust gamma regression model as described in §4, suppose corruptions are introduced by a partially adaptive adversary i.e. the locations of the corruptions (the set ) is not decided adversarially but the corruptions are decided jointly, adversarially and may be unbounded, then SVAM-Gamma enjoys a breakdown point of , i.e. it ensures a bounded error even if corruptions are introduced where the value of can go upto at least . More generally, for corruption rates , there always exists values of scale increment such that with probability at least , LWSC/LWLC conditions are satisfied for the function corresponding to the robust gamma regression model for values at least as large as . Specifically, if initialized at such that , for any SVAM-Gamma assures
within iterations.
Proof.
We first outline the proof below. We note that Theorem 4 is obtained by setting in the above statement.
Proof Outline.
The proof outline is similar to the one followed for robust least squares regression and robust mean estimation in Theorems 14 and 22 but adapted to suit the alternate parametrization and invariant used by SVAM-Gamma for gamma regression. Lemmata 28,29 below establish the LWSC and LWLC properties with high probability for . Let , Unlike in mean estimation and robust regression where we maintained the invariant , in this setting we will instead maintain the invariant . This is because gamma regression uses a non-standard canonical parameter to enable support only over non-negative labels. Note however that this altered invariant still ensures that as , we also have . Since we will always use (since we set ), we correspondingly require during initialization as mentioned in the statement of Theorem 4, where . Since we will analyze the special case for sake of simplicity, we get .
Lemmata 28,29 establish the LWSC/LWLC properties for the function for robust gamma regression. Given the above proof outline and applying Lemmata 28,29 gives us (taking and )
For break down point calculation we set and ,
to get
∎
Lemma 28 (LWSC for Robust Gamma Regression).
For any , the -function for gamma regression satisfies the LWSC property with constant with probability at least where is defined in the proof.
Proof.
It is easy to see that having we have at the good points,
where .
Let us write,
with,
and,
We have,
To get a high-probability lower bound on the LWSC constant we would like to apply McDiarmid’s inequality. Having,
, we may write for any fixed and ,
For any square symmetric matrix , we have, . Taking, gives
Taking union bound over -net of gives
Since, and , we have,
Put, with , for any fixed we have:
In order to take union bound over we observe, let and using for
So that,
In order to set the -net, we would require,
Taking covering number of and observing,
,
We may write,
This finishes the proof. ∎
Lemma 29 (LWLC for Robust Gamma Regression).
For any , the -function for robust gamma regression satisfies the LWLC property with constant with probability at least , where is defined in proof.
Proof.
It is easy to see that . Since for good points in the no-noise setting we have assumed, good points do not contribute to the gradient at all. Thus, we get
where and . Now since with probability at least , we have (since the locations of the corruptions were chosen randomly without looking at data). Thus, we are left to bound . We have .
Let , we have,
And using, and density at the mode of Gamma,
which gives us
∎