Robust Sub-Gaussian Principal Component Analysis
and Width-Independent Schatten Packing

Arun Jambulapati Stanford University, {jmblpati, kjtian}@stanford.edu Jerry Li Microsoft Research, [email protected] Kevin Tian¹¹footnotemark: 1

1 Introduction

We study two natural, but seemingly unrelated, problems in high dimensional robust statistics and continuous optimization respectively. As we will see, these problems have an intimate connection.

Problem 1: Robust sub-Gaussian principal component analysis. We consider the following statistical task, which we call robust sub-Gaussian principal component analysis (PCA). Given samples $X_{1},\ldots,X_{n}$ from sub-Gaussian¹¹1See Section 2 for a formal definition. distribution $\mathcal{D}$ with covariance $\bm{\Sigma}$ , an $\epsilon$ fraction of which are arbitrarily corrupted, the task asks to output unit vector $u$ with $u^{\top}\bm{\Sigma}u\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}$ ²²2Throughout we use $\left\lVert\mathbf{M}\right\rVert_{p}$ to denote the Schatten $p$ -norm (cf. Section 2 for more details). for tolerance $\gamma$ . Ergo, the goal is to robustly return a $(1-\gamma)$ -approximate top eigenvector of the covariance of sub-Gaussian $\mathcal{D}$ . This is the natural extension of PCA to the robust statistics setting.

There has been a flurry of recent work on efficient algorithms for robust statistical tasks, e.g. covariance estimation and PCA. From an information-theoretic perspective, sub-Gaussian concentration suffices for robust covariance estimation. Nonetheless, to date all polynomial-time algorithms achieving nontrivial guarantees on covariance estimation of a sub-Gaussian distribution (including PCA specifically) in the presence of adversarial noise require additional algebraic structure. For instance, sum-of-squares certifiably bounded moments have been leveraged in polynomial time covariance estimation [HL18, KSS18]; however, this is a stronger assumption than sub-Gaussianity.

In many applications (see discussion in [DKK⁺17]), the end goal of covariance estimation is PCA. Thus, a natural question which relaxes robust covariance estimation is: can we robustly estimate the top eigenvector of the covariance $\bm{\Sigma}$ , assuming only sub-Gaussian concentration? Our work answers this question affirmatively via two incomparable algorithms. The first achieves $\gamma=O(\epsilon\log\epsilon^{-1})$ in polynomial time; the second achieves $\gamma=O(\sqrt{\epsilon\log\epsilon^{-1}\log d})$ in nearly-linear time under a mild gap assumption on $\bm{\Sigma}$ . Moreover, both methods have nearly-optimal sample complexity.

Problem 2: Width-independent Schatten packing. We consider a natural generalization of packing semidefinite programs (SDPs) which we call Schatten packing. Given symmetric positive semidefinite $\mathbf{A}_{1},\ldots,\mathbf{A}_{n}$ and parameter $p\geq 1$ , a Schatten packing SDP asks to solve the optimization problem

\min\left\lVert\sum_{i\in[n]}w_{i}\mathbf{A}_{i}\right\rVert_{p}\text{ subject to }w\in\Delta^{n}.

(1)

Here, $\left\lVert\mathbf{M}\right\rVert_{p}$ is the Schatten- $p$ norm of matrix $\mathbf{M}$ and $\Delta^{n}$ is the probability simplex (see Section 2). When $p=\infty$ , (1) is the well-studied (standard) packing SDP objective [JY11, ALO16, PTZ16], which asks to find the most spectrally bounded convex combination of packing matrices. For smaller $p$ , the objective encourages combinations more (spectrally) uniformly distributed over directions.

The specialization of (1) to diagonal matrices is a smooth generalization of packing linear programs, previously studied in the context of fair resource allocation [MSZ16, DFO18]. For the $\ell_{\infty}$ case of (1), packing SDPs have the desirable property of admitting “width-independent” approximation algorithms via exploiting positivity structure. Specifically, width-independent solvers obtain multiplicative approximations with runtimes independent or logarithmically dependent on size parameters of the problem. This is a strengthening of additive notions of approximation typically used for approximate semidefinite programming. Our work gives the first width-independent solver for Schatten packing.

1.1 Previous work

Learning with adversarial outliers. The study of estimators robust to a small fraction of adversarial outliers dates back to foundational work, e.g. [Hub64, Tuk75]. Following more recent work [LRV16, DKK⁺19], there has been significant interest in efficient, robust algorithms for statistical tasks in high-dimensional settings. We focus on methods robustly estimating covariance properties here, and defer a thorough discussion of the (extensive) robust statistics literature to [Ste18, Li18, DK19].

There has been quite a bit of work in understanding and giving guarantees for robust covariance estimation where the uncorrupted distribution is exactly Gaussian [DKK⁺17, DKK⁺18, DKK⁺19, CDGW19]. These algorithms strongly use relationships between higher-order moments of Gaussian distributions via Isserlis’ theorem. Departing from the Gaussian setting, work of [LRV16] showed that if the distribution is an affine transformation of a 4-wise independent distribution, robust covariance estimation is possible. This was extended by [KSS18], which also assumed nontrivial structure in the moments of the distribution, namely that sub-Gaussianity was certifiable via the sum-of-squares proof system. To the best of our knowledge it has remained open to give nontrivial guarantees for robust estimation of any covariance properties under minimal assumptions, i.e. sub-Gaussian concentration.

All aforementioned algorithms also yield guarantees for robust PCA, by applying a top eigenvector method to the learned covariance. However, performing robust PCA via the intermediate covariance estimation step is lossy, both statistically and computationally. From a statistical perspective, $\Omega(d^{2})$ samples are necessary to learn the covariance of a $d$ -dimensional Gaussian in Frobenius norm (and for known efficient algorithms for spectral norm error [DKS17]); in contrast, $O(d)$ samples suffice for (non-robust) PCA. Computationally, even when the underlying distrubution is exactly Gaussian, the best-known covariance estimation algorithms run in time $\Omega(d^{3.25})$ ; algorithms working in more general settings based on the sum-of-squares approach require much more time. In contrast, the power method for PCA in a $d\times d$ matrix takes time $\widetilde{O}(d^{2})$ ³³3We say $g=\widetilde{O}(f)$ if $g=O(f\log^{c}f)$ for some constant $c>0$ .. Motivated by this, our work initiates the direct study of robust PCA, which is often independently interesting in applications.

We remark there is another problem termed “robust PCA” in the literature, e.g. [CLMW11], under a different generative model. We defer a detailed discussion to [DKK⁺17], which experimentally shows that algorithms from that line of work do not transfer well to our corruption model.

Width-independent iterative methods. Semidefinite programming (SDP) and its linear programming specialization are fundamental computational tasks, with myriad applications in learning, operations research, and computer science. Though general-purpose polynomial time algorithms exist for SDPs ([NN94]), in practical settings in high dimensions, approximations depending linearly on input size and polynomially on error $\epsilon$ are sometimes desirable. To this end, approximation algorithms based on entropic mirror descent have been intensely studied [WK06, AK16, GHM15, AL17, CDST19], obtaining $\epsilon$ additive approximations to the objective with runtimes depending polynomially on $\rho/\epsilon$ , where $\rho$ is the “width”, the largest spectral norm of a constraint.

For structured SDPs, stronger guarantees can be obtained in terms of width. Specifically, several algorithms developed for packing SDPs ((1) with $p=\infty$ ) yield $(1+\epsilon)$ -multiplicative approximations to the objective, with logarithmic dependence on width [JY11, PTZ16, ALO16, JLL⁺20]. As $\rho$ upper bounds objective value in this setting, in the worst case runtimes of width-dependent solvers yielding $\epsilon\rho$ -additive approximations have similar dependences as width-independent counterparts. Width-independent solvers simultaneously yield stronger multiplicative bounds at all scales of objective value, making them desirable in suitable applications. In particular, $\ell_{\infty}$ packing SDPs have found great utility in robust statistics algorithm design [CG18, CDG19, CDGW19, DL19].

Beyond $\ell_{\infty}$ packing, width-independent guarantees in the SDP literature are few and far between; to our knowledge, other than the covering and mixed solvers of [JLL⁺20], ours is the first such guarantee for a broader family of objectives⁴⁴4In concurrent and independent work, [CMY20] develops width-independent solvers for Ky-Fan packing objectives, a different notion of generalization than the Schatten packing objectives we consider.. Our method complements analogous $\ell_{p}$ extensions in the width-dependent setting, e.g. [ALO15], as well as width-independent solvers for $\ell_{p}$ packing linear programs [MSZ16, DFO18]. We highlight the fair packing solvers of [MSZ16, DFO18], motivated by problems in equitable resource allocation, which further solved $\ell_{p}$ packing variants for $p\not\in[1,\infty)$ . We find analogous problems in semidefinite settings interesting, and defer to future work.

1.2 Our results

Robust sub-Gaussian principal component analysis. We give two algorithms for robust sub-Gaussian PCA⁵⁵5We follow the distribution and corruption model described in Assumption 1.. Both are near-sample optimal, polynomial-time, and assume only sub-Gaussianity. The first is via a simple filtering approach, as summarized here (and developed in Section 3).

Theorem 1.

Under Assumption 1, let $\delta\in[0,1]$ , and $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ . Algorithm 6 runs in time $O(\tfrac{nd^{2}}{\epsilon}\log\tfrac{n}{\delta\epsilon}\log\tfrac{n}{\delta})$ , and outputs $u$ with $u^{\top}\bm{\Sigma}u>(1-C^{\star}\epsilon\log\epsilon^{-1})\|\bm{\Sigma}\|_{\infty}$ , for $C^{\star}$ a fixed multiple of parameter $c$ in Assumption 1, with probability at least $1-\delta$ .

Our second algorithm is more efficient under mild conditions, but yields a worse approximation $1-\gamma$ for $\gamma=O(\sqrt{\epsilon\log\epsilon^{-1}\log d})$ . Specifically, if there are few eigenvalues of $\bm{\Sigma}$ larger than $1-\gamma$ , our algorithm runs in nearly-linear time. Note that if there are many eigenvalues above this threshold, then the PCA problem itself is not very well-posed; our algorithm is very efficient in the interesting setting where the approximate top eigenvector is identifiable. We state our main algorithmic guarantee here, and defer details to Section 5.

Theorem 2.

Under Assumption 1, let $\delta\in[0,1]$ , $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ , $\gamma=C\sqrt{\epsilon\log\epsilon^{-1}\log d}$ , for $C$ a fixed multiple of parameter $c$ from Assumption 1, and let $t\in[d]$ satisfy $\bm{\Sigma}_{t+1}<(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}$ . Algorithm 4 outputs a unit vector $u\in\mathbb{R}^{d}$ with $u^{\top}\bm{\Sigma}u\geq(1-\gamma)\|\bm{\Sigma}\|_{\infty}$ in time $\widetilde{O}(\tfrac{nd}{\epsilon^{4.5}}+\tfrac{ndt}{\epsilon^{1.5}})$ .

We remark that $\Omega(d\epsilon^{-2})$ samples are necessary for a $(1-\epsilon)$ -approximation to the top eigenvector of $\bm{\Sigma}$ via uncorrupted samples from $\mathcal{N}(0,\bm{\Sigma})$ , so our first method is sample-optimal, as is our second up to a $\widetilde{O}(\epsilon^{-1})$ factor.

Width-independent Schatten packing. Our second method crucially requires an efficient solver for Schatten packing SDPs. We demonstrate that Schatten packing, i.e. (1) for arbitrary $p$ , admits width-independent solvers. We state an informal guarantee, and defer details to Section 4.

Theorem 3.

Let $\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}$ , and $\epsilon>0$ . There is an algorithm taking $O(\tfrac{p\log(\frac{nd}{\epsilon})}{\epsilon})$ iterations, returning a $1+\epsilon$ multiplicative approximation to the problem (1). For odd $p$ , each iteration can be implemented in time nearly-linear in the number of nonzeros amongst all $\{\mathbf{A}_{i}\}_{i\in[n]}$ .

2 Preliminaries

General notation. $[n]$ denotes the set $1\leq i\leq n$ . Applied to a vector, $\left\lVert\cdot\right\rVert_{p}$ is the $\ell_{p}$ norm; applied to a symmetric matrix, $\left\lVert\cdot\right\rVert_{p}$ is the Schatten- $p$ norm, i.e. the $\ell_{p}$ norm of the spectrum. The dual norm of $\ell_{p}$ is $\ell_{q}$ for $q=\tfrac{p}{p-1}$ ; when $p=\infty$ , $q=1$ . $\Delta^{n}$ is the $n$ -dimensional simplex (subset of positive orthant with $\ell_{1}$ -norm $1$ ) and we define $\mathfrak{S}^{n}_{\varepsilon}\subseteq\Delta^{n}$ to be the truncated simplex:

\mathfrak{S}^{n}_{\varepsilon}:=\left\{w\in\mathbb{R}^{n}_{\geq 0}\;\Bigg{|}\;\left\lVert w\right\rVert_{1}=1,\;w\leq\frac{1}{n(1-\varepsilon)}\text{ entrywise}\right\}.

(2)

Matrices. $\mathbb{S}^{d}$ is $d\times d$ symmetric matrices, and $\mathbb{S}_{\geq 0}^{d}$ is the positive semidefinite subset. $\mathbf{I}$ is the identity of appropriate dimension. $\lambda_{\textup{max}}$ , $\lambda_{\textup{min}}$ , and Tr are the largest and smallest eigenvalues and trace of a symmetric matrix. For $\mathbf{M},\mathbf{N}\in\mathbb{S}^{d}$ , $\left\langle\mathbf{M},\mathbf{N}\right\rangle:=\textup{Tr}\left(\mathbf{M}\mathbf{N}\right)$ and we use the Loewner order $\preceq$ , ( $\mathbf{M}\preceq\mathbf{N}$ iff $\mathbf{N}-\mathbf{M}\in\mathbb{S}_{\geq 0}^{d}$ ). The seminorm of $\mathbf{M}\succeq 0$ is $\left\lVert v\right\rVert_{\mathbf{M}}:=\sqrt{v^{\top}\mathbf{M}v}$ .

Fact 1.

For $\mathbf{A}$ , $\mathbf{B}$ with compatible dimension, $\textup{Tr}(\mathbf{A}\mathbf{B})=\textup{Tr}(\mathbf{B}\mathbf{A})$ . For $\mathbf{M},\mathbf{N}\in\mathbb{S}_{\geq 0}^{d}$ , $\left\langle\mathbf{M},\mathbf{N}\right\rangle\geq 0$ .

Fact 2.

We have the following characterization of the Schatten- $p$ norm: for $\mathbf{M}\in\mathbb{S}^{d}$ , and $q=\tfrac{p}{p-1}$ ,

\left\lVert\mathbf{M}\right\rVert_{p}=\sup_{\begin{subarray}{c}\mathbf{N}\in\mathbb{S}^{d},\;\left\lVert\mathbf{N}\right\rVert_{q}=1\end{subarray}}\left\langle\mathbf{N},\mathbf{M}\right\rangle.

For $\mathbf{M}=\sum_{j\in[d]}\lambda_{i}v_{i}v_{i}^{\top}$ , the satisfying $\mathbf{N}$ is $\tfrac{\sum_{j\in[d]}\pm\lambda_{i}^{p-1}v_{i}v_{i}^{\top}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}$ , so $\mathbf{N}\mathbf{M}$ has spectrum $\tfrac{|\lambda|^{p}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}$ .

Distributions. We denote drawing vector $X$ from distribution $\mathcal{D}$ by $X\sim\mathcal{D}$ , and the covariance $\bm{\Sigma}$ of $\mathcal{D}$ is $\mathbb{E}_{X\sim\mathcal{D}}\left[XX^{\top}\right]$ . We say scalar distribution $\mathcal{D}$ is $\gamma^{2}$ -sub-Gaussian if $\mathbb{E}_{X\sim\mathcal{D}}[X]=0$ and

\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(tX\right)\right]\leq\exp\left(\frac{t^{2}\gamma^{2}}{2}\right)\;\forall t\in\mathbb{R}.

Multivariate $\mathcal{D}$ has sub-Gaussian proxy $\bm{\Gamma}$ if its restriction to any unit $v$ is $\left\lVert v\right\rVert_{\bm{\Gamma}}^{2}$ -sub-Gaussian, i.e.

\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(tX^{\top}v\right)\right]\leq\exp\left(\frac{t^{2}\left\lVert v\right\rVert_{\bm{\Gamma}}^{2}}{2}\right)\text{ for all }\left\lVert v\right\rVert_{2}=1,\;t\in\mathbb{R}.

(3)

We consider the following standard model for gross corruption with respect to distribution a $\mathcal{D}$ .

Assumption 1 (Corruption model, see [DKK⁺19]).

Let $\mathcal{D}$ be a mean-zero distribution on $\mathbb{R}^{d}$ with covariance $\bm{\Sigma}$ and sub-Gaussian proxy $\bm{\Gamma}\preceq c\bm{\Sigma}$ for a constant $c$ . Denote by index set $G^{\prime}$ with $|G^{\prime}|=n$ a set of (uncorrupted) samples $\{X_{i}\}_{i\in G^{\prime}}\sim\mathcal{D}$ . An adversary arbitrarily replaces $\epsilon n$ points in $G^{\prime}$ ; we denote the new index set by $[n]=B\cup G$ , where $B$ is the (unknown) set of points added by an adversary, and $G\subseteq G^{\prime}$ is the set of points from $G^{\prime}$ that were not changed.

As we only estimate covariance properties, the assumption that $\mathcal{D}$ is mean-zero only loses constants in problem parameters, by pairing samples and subtracting them (cf. [DKK⁺19], Section 4.5.1).

3 Robust sub-Gaussian PCA via filtering

In this section, we sketch the proof of Theorem 1, which gives guarantees on our filtering algorithm for robust sub-Gaussian PCA. This algorithm obtains stronger statistical guarantees than Theorem 2, at the cost of super-linear runtime; the algorithm is given as Algorithm 6. Our analysis stems largely from concentration facts about sub-Gaussian distributions, as well as the following (folklore) fact regarding estimation of variance along any particular direction.

Lemma 1.

Under Assumption 1, let $\delta\in[0,1]$ , $n=\Omega\left(\tfrac{\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ , and $u\in\mathbb{R}^{d}$ be a fixed unit vector. Algorithm 5, $\mathsf{1DRobustVariance}$ , takes input $\{X_{i}\}_{i\in[n]}$ , $u$ , and $\epsilon$ , and outputs $\sigma_{u}^{2}$ with $|u^{\top}\bm{\Sigma}u-\sigma_{u}^{2}|<Cu^{\top}\bm{\Sigma}u\cdot\epsilon\log\epsilon^{-1}$ with probability at least $1-\delta$ , and runs in time $O(nd+n\log n)$ , for $C$ a fixed multiple of the parameter $c$ in Assumption 1.

In other words, we show that using corrupted samples, we can efficiently estimate a $1+O(\epsilon\log\epsilon^{-1})$ -multiplicative approximation of the variance of $\mathcal{D}$ in any unit direction⁶⁶6Corollary 5 gives a slightly stronger guarantee that reusing samples does not break dependencies of $u$ .. This proof is deferred to Appendix B for completeness. Algorithm 6 combines this key insight with a soft filtering approach, suggested by the following known structural fact found in previous work (e.g. Lemma A.1 of [DHL19], see also [SCV17, Ste18]).

Lemma 2.

Let $\{a_{i}\}_{i\in[m]}$ , $\{w_{i}\}_{i\in[m]}$ be sets of nonnegative reals, and $a_{\max}=\max_{i\in[m]}a_{i}$ . Define $w^{\prime}_{i}=\left(1-\frac{a_{i}}{a_{\max}}\right)w_{i}$ , for all $i\in[m]$ . Consider any disjoint partition $I_{B}$ , $I_{G}$ of $[m]$ with $\sum_{i\in I_{B}}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}a_{i}.$ Then, $\sum_{i\in I_{B}}w_{i}-w_{i}^{\prime}>\tfrac{1}{2a_{\max}}\sum_{i\in[m]}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}-w_{i}^{\prime}$ .

Our Algorithm 6, $\mathsf{PCAFilter}$ , takes as input a set of corrupted samples $\{X_{i}\}_{i\in[n]}$ following Assumption 1 and the corruption parameter $\epsilon$ . At a high level, it initializes a uniform weight vector $w^{(0)}$ , and iteratively operates as follows (we denote by $\mathbf{M}(w)$ the empirical covariance $\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}$ ).

1.

$u_{t}\leftarrow$ approximate top eigenvector of $\mathbf{M}(w^{(t-1)})$ via power iteration.
2.

Compute $\sigma_{t}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},u_{t},\epsilon)$ .
3.

If $\sigma_{t}^{2}>(1-O(\epsilon\log\epsilon^{-1}))\cdot u_{t}^{\top}\mathbf{M}(w^{(t-1)})u_{t}$ , then terminate and return $u_{t}$ .
4.
Else:
1. (a)
  
  Sort indices $i\in[n]$ by $a_{i}\leftarrow\left\langle u_{t},X_{i}\right\rangle^{2}$ , with $a_{1}$ smallest.
2. (b)
  
  Let $\ell\leq i\leq n$ be the smallest set for which $\sum_{i=\ell}^{n}w_{i}\geq 2\epsilon$ , and apply the downweighting procedure of Lemma 2 to this subset of indices.

The analysis of Algorithm 6 then proceeds in two stages.

Monotonicity of downweighting.

We show the invariant criteria for Lemma 2 (namely, that for the set $\ell\leq i\leq n$ in every iteration, there is more spectral mass on bad points than good) holds inductively for our algorithm. Specifically, lack of termination implies $\mathbf{M}(w^{(t-1)})$ puts significant mass on bad directions, which combined with concentration of good directions yields the invariant. The details of this argument can be found as Lemma 13.

Roughly uniform weightings imply approximation quality.

As Lemma 2 then applies, the procedure always removes more mass from bad points than good, and thus can only remove at most $2\epsilon$ mass total by the corruption model. Thus, the weights $w^{(t)}$ are always roughly uniform (in $\mathfrak{S}_{O(\epsilon)}^{n}$ ), which by standard concentration facts (see Appendix A) imply the quality of the approximate top eigenvector is good. Moreover, the iteration count is bounded by roughly $d$ because whenever the algorithm does not terminate, enough mass is removed from large spectral directions. Combining with the termination criteria imply that when a vector is returned, it is a close approximation to the top direction of $\bm{\Sigma}$ . Details can be found as Lemma 15 and in the proof of Theorem 1.

4 Schatten packing

4.1 Mirror descent interpretation of [MRWZ16]

We begin by reinterpreting the [MRWZ16] solver, which achieves the state-of-the-art parallel runtime for packing LPs⁷⁷7The [MRWZ16] solver also generalizes to covering and mixed objectives; we focus on packing in this work.. An $(\ell_{\infty})$ packing LP algorithm solves the following decision problem.⁸⁸8Packing linear programs are sometimes expressed as the optimization problem $\max_{x\geq 0,\mathbf{A}x\leq\mathbf{1}}\left\lVert x\right\rVert_{1}$ , similarly to (1); these problems are equivalent up to a standard binary search, see e.g. discussion in [JLL⁺20]..

Problem 1 ( $\ell_{\infty}$ packing linear program).

Given entrywise nonnegative $\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n}$ , either find primal solution $x\in\Delta^{n}$ with $\left\lVert\mathbf{A}x\right\rVert_{\infty}\leq 1+\epsilon$ or dual solution $y\in\Delta^{d}$ with $\mathbf{A}^{\top}y\geq(1-\epsilon)\mathbf{1}$ .

Algorithm 1

\mathsf{PackingLP}(\mathbf{A},\epsilon)

1: Input:

\mathbf{A}\in\mathbb{R}^{d\times n}_{\geq 0}

\epsilon\in[0,\tfrac{1}{2}]

K\leftarrow\tfrac{3\log(d)}{\epsilon}

\eta\leftarrow K^{-1}

T\leftarrow\tfrac{4\log(d)\log(nd/\epsilon)}{\epsilon^{2}}

[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d}

for all

i\in[n]

z\leftarrow\mathbf{0}

t\leftarrow 0

4: while

\mathbf{A}w_{t}\leq K\mathbf{1}

\left\lVert w_{t}\right\rVert_{1}\leq K

v_{t}\leftarrow\tfrac{\exp(\mathbf{A}w_{t})}{\left\lVert\exp(\mathbf{A}w_{t})\right\rVert_{1}}

g_{t}\leftarrow\max(0,\mathbf{1}-\mathbf{A}^{\top}v_{t})

entrywise

w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t})

z\leftarrow z+v_{t}

t\leftarrow t+1

8: if

t\geq T

then

9: return

y\leftarrow\frac{1}{T}z

10: end if

11: end while

12: return

x\leftarrow\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

The following result is shown in [MRWZ16].

Proposition 1.

$\mathsf{PackingLP}$ (Algorithm 1) solves Problem 1 in $O(\textup{nnz}(\mathbf{A})\cdot\tfrac{\log(d)\log(nd/\epsilon)}{\epsilon^{2}})$ time.

Oour interpretation of the analysis of [MRWZ16], combines two ingredients: a potential argument and mirror descent, which yields a dual feasible point if $\left\lVert w_{t}\right\rVert_{1}$ did not grow sufficiently.

Potential argument. The potential used by [MRWZ16] is $\log(\sum_{j\in[d]}\exp([\mathbf{A}w_{t}]_{j}))-\left\lVert w_{t}\right\rVert_{1}$ , well-known to be a $O(\log d)$ -additive approximation of $\left\lVert\mathbf{A}w_{t}\right\rVert_{\infty}-\left\lVert w_{t}\right\rVert_{1}$ . As soon as $\left\lVert\mathbf{A}w_{t}\right\rVert_{\infty}$ or $\left\lVert w_{t}\right\rVert_{1}$ reaches the scale $O(\tfrac{\log d}{\epsilon})$ , by nonnegativity this becomes a multiplicative guarantee, motivating the setting of threshold $K$ . To prove the potential is monotone, [MRWZ16] uses step size $K^{-1}$ and a Taylor approximation; combining with the termination condition yields the desired claim.

Mirror descent. To certify that $w_{t}$ grows sufficiently (e.g. the method terminates in few iterations, else dual feasibility holds), we interpret the step $w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t})$ as approximate entropic mirror descent. Specifically, we track the quantity $\sum_{0\leq t<T}\left\langle\eta g_{t},u\right\rangle$ , and show that if $\left\lVert w_{t}\right\rVert_{1}$ has not grown sufficiently, then it must be bounded for every $u\in\Delta^{n}$ , certifying dual feasibility. Formally, for any $g_{t}$ sequence and $u\in\Delta^{n}$ , we show

O(\log(nd/\epsilon))+\log\left(\frac{\left\lVert w_{T}\right\rVert_{1}}{\left\lVert w_{0}\right\rVert_{1}}\right)\geq\sum_{0\leq t<T}\left\langle\eta g_{t},u\right\rangle\geq\eta\sum_{0\leq t<T}\left\langle\mathbf{1}-\mathbf{A}^{\top}v_{t},u\right\rangle.

The last inequality followed by $g_{t}$ being an upwards truncation. If $\left\lVert w_{T}\right\rVert_{1}$ is bounded (else, we have primal feasibility), we show the entire above expression is bounded $O(\log\tfrac{nd}{\epsilon})$ for any $u$ . Thus, by setting $T=O(\tfrac{\log(nd/\epsilon)}{\eta\epsilon})$ and choosing $u$ to be each coordinate indicator, it follows that the average of all $v_{t}$ is coordinatewise at least $1-\epsilon$ , and solves Problem 1 as a dual solution.

Our $g_{t}$ is chosen as the (truncated) gradient of the function used in the potential analysis, so its form allows us to interpret dual feasibility (e.g. $v_{t}$ has $\ell_{1}$ norm 1 and is a valid dual point). Our analysis patterns standard mirror descent, complemented by side information which says that lack of a primal solution can transform a regret guarantee into a feasibility bound. We apply this framework to analyze $\ell_{p}$ variants of Problem 1, via different potentials; nonetheless, our proofs are quite straightforward upon adopting this perspective, and we believe it may yield new insights for instances with positivity structure.

4.2 $\ell_{p}$ -norm packing linear programs

In this section, we give a complete self-contained example of the framework proposed in Section 4.1, for approximately solving $\ell_{p}$ norm packing linear programs. Specifically, we now consider the generalization of Problem 1 to $\ell_{p}$ norms; throughout, $q=\tfrac{p}{p-1}$ is the dual norm.

Problem 2 ( $\ell_{p}$ packing linear program).

Given entrywise nonnegative $\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n}$ , either find primal solution $x\in\Delta^{n}$ with $\left\lVert\mathbf{A}x\right\rVert_{p}\leq 1+\epsilon$ or dual solution $y\in\mathbb{R}^{d}_{\geq 0},\left\lVert y\right\rVert_{q}=1$ with $\mathbf{A}^{\top}y\geq(1-\epsilon)\mathbf{1}$ .

For $p=\tfrac{\log d}{\epsilon}$ , Problem 2 recovers Problem 1 up to constants as $\ell_{p}$ multiplicatively approximates $\ell_{\infty}$ by $1+\epsilon$ . We now state our method for solving Problem 2 as Algorithm 2.

Algorithm 2

\mathsf{PNormPacking}(\mathbf{A},\epsilon,p)

1: Input:

\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n},\epsilon\in[0,\tfrac{1}{2}],p\geq 2

\eta\leftarrow p^{-1},T\leftarrow\tfrac{4p\log(\frac{nd}{\epsilon})}{\epsilon}

[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d}

for all

i\in[n]

z\leftarrow\mathbf{0}

t\leftarrow 0

4: while

\left\lVert w_{t}\right\rVert_{1}\leq\epsilon^{-1}

g_{t}\leftarrow\max(0,\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1})

entrywise, for

v_{t}\leftarrow\tfrac{\mathbf{A}w_{t}}{\left\lVert\mathbf{A}w_{t}\right\rVert_{p}}

w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t})

z\leftarrow z+(v_{t})^{p-1}

t\leftarrow t+1

7: if

t\geq T

then

8: return

y=\frac{z}{\left\lVert z\right\rVert_{q}}

9: end if

10: end while

11: return

x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

Other than changing parameters, the only difference from Algorithm 1 is that $v$ is a point with unit $\ell_{q}$ norm induced by the gradient of our potential $\Phi_{t}$ . We state our main potential fact, whose proof is based straightforwardly on Taylor expanding $\left\lVert\cdot\right\rVert_{p}$ , and deferred to Appendix C for brevity.

Lemma 3.

In all iterations $t$ of Algorithm 2, defining $\Phi_{t}:=\left\lVert\mathbf{A}w_{t}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}$ , $\Phi_{t+1}\leq\Phi_{t}$ .

We now prove our main result, which leverages the potential bound following the framework of Section 4.1. In the proof, we assume that entries of $\mathbf{A}$ are bounded by $n\epsilon^{-1}$ ; this does not incur more loss than a constant multiple of $\epsilon$ in the guarantees, and a proof can be found as Lemma 16.

Theorem 4.

Algorithm 2 runs in time $O(\textup{nnz}(\mathbf{A})\cdot\tfrac{p\log(nd/\epsilon)}{\epsilon})$ . Further, its output solves Problem 2.

Proof.

The runtime follows from Line 7 (each iteration cost is dominated by multiplication through $\mathbf{A}$ ), so we prove correctness. Define potential $\Phi_{t}$ as in Lemma 3, and note that as $w_{0}=\tfrac{\epsilon}{n^{2}d}\mathbf{1}$ ,

\Phi_{0}\leq\left\lVert\mathbf{A}w_{0}\right\rVert_{p}\leq\frac{1}{n}\left\lVert\mathbf{1}\right\rVert_{p}\leq 1.

The second inequality followed from our assumption on $\mathbf{A}$ entry sizes (Lemma 16). If Algorithm 2 breaks out of the while loop of Line 4, we have by Lemma 3 that for $x$ returned on Line 11,

\left\lVert\mathbf{A}w_{t}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}\leq 1\implies\left\lVert\mathbf{A}x\right\rVert_{p}\leq\frac{1+\left\lVert w_{t}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\leq 1+\epsilon.

Thus, primal feasibility is always correct. We now prove correctness of dual feasibility. First, let $V_{x}(u)=\sum_{i\in[n]}u_{i}\log(\tfrac{u_{i}}{x_{i}})$ be the Kullback-Leibler divergence from $x$ to $u$ , for $x$ , $u\in\Delta^{d}$ . Define the normalized points $x_{t}=\tfrac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}$ in each iteration. Expanding definitions,

$\displaystyle V_{x_{t+1}}(u)-V_{x_{t}}(u)$	$\displaystyle=\sum_{i\in[n]}u_{i}\log\frac{[x_{t}]_{i}}{[x_{t+1}]_{i}}$	(4)
	$\displaystyle=\sum_{i\in[n]}u_{i}\left(\log\left(\frac{\left\lVert w_{t+1}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\right)+\log\left(\frac{1}{1+\eta[g_{t}]_{i}}\right)\right)$
	$\displaystyle\leq-\eta(1-\eta)\left\langle g_{t},u\right\rangle+\log\left(\frac{\left\lVert w_{t+1}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\right).$

The only inequality used the bounds, for $g,\eta\in[0,1]$ ,

\log\left(\frac{1}{1+\eta g}\right)\leq g\log\left(\frac{1}{1+\eta}\right)\leq-\eta(1-\eta)g.

Telescoping (4) over all $T$ iterations, and using $V_{x_{0}}(u)\leq\log n$ for all $u\in\Delta^{n}$ since $x_{0}$ is uniform, we have that whenever Line 4 is not satisfied before the check on Line 7 (i.e. $t\geq T$ ),

\eta(1-\eta)\sum_{0\leq t<T}\left\langle g_{t},u\right\rangle\leq\log\left(\frac{\left\lVert w_{T}\right\rVert_{1}}{\left\lVert w_{0}\right\rVert_{1}}\right)+V_{x_{0}}(u)\leq\log\left(\frac{nd}{\epsilon^{2}}\right)+\log n\leq 2\log\left(\frac{nd}{\epsilon}\right).

(5)

The last inequality used $\left\lVert w_{T}\right\rVert_{1}\leq\epsilon^{-1}$ by assumption, and $\left\lVert w_{0}\right\rVert_{1}=\tfrac{\epsilon}{nd}$ . Next, since each $g_{t}\geq\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1}$ entrywise, defining $\bar{z}=\tfrac{z}{T}$ ,

\sum_{0\leq t<T}\left\langle g_{t},u\right\rangle\geq\sum_{0\leq t<T}\left\langle\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1},u\right\rangle=T\left\langle\mathbf{1}-\mathbf{A}^{\top}\bar{z},u\right\rangle,\text{ for all }u\in\Delta^{n}.

(6)

Combining (5) and (6), and rearranging, yields by definition of $T$ ,

\left\langle\mathbf{1}-\mathbf{A}^{\top}\bar{z},u\right\rangle\leq\frac{2\log(\frac{nd}{\epsilon})}{T\eta(1-\eta)}\leq\frac{4p\log(\frac{nd}{\epsilon})}{T}\leq\epsilon\implies\mathbf{A}^{\top}\bar{z}\geq 1-\epsilon\text{ entrywise.}

The last claim follows by setting $u$ to be each coordinate-sparse simplex vector. Finally, since $\tfrac{\bar{z}}{\left\lVert\bar{z}\right\rVert_{q}}=\tfrac{z}{\left\lVert z\right\rVert_{q}}$ , to show that $y$ is a correct dual solution to Problem 2 it suffices to show $\left\lVert\bar{z}\right\rVert_{q}\leq 1$ . This follows as $\bar{z}$ is an average of the $(v_{t})^{p-1}$ , convexity of $\ell_{q}$ norms, and that for all $t$ ,

\left\lVert(v_{t})^{p-1}\right\rVert_{q}^{q}=\sum_{j\in[d]}v_{t}^{p}=\sum_{j\in[d]}\frac{\left[\mathbf{A}w_{t}\right]_{j}^{p}}{\left\lVert\mathbf{A}w_{t}\right\rVert_{p}^{p}}=1.

∎

4.3 Schatten-norm packing semidefinite programs

We generalize Algorithm 2 to solve Schatten packing semidefinite programs, which we now define.

Problem 3.

Given $\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}$ , either find primal solution $x\in\Delta^{n}$ with $\left\lVert\sum_{i\in[n]}x_{i}\mathbf{A}_{i}\right\rVert_{p}\leq 1+\epsilon$ or dual solution $\mathbf{Y}\in\mathbb{S}_{\geq 0}^{d}$ , $\left\lVert\mathbf{Y}\right\rVert_{q}=1$ with $\left\langle\mathbf{A}_{i},\mathbf{Y}\right\rangle\geq 1-\epsilon$ for all $i\in[n]$ .

Algorithm 3

\mathsf{SchattenPacking}(\{\mathbf{A}_{i}\}_{i\in[n]},\epsilon,p)

1: Input:

\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d},\epsilon\in[0,\tfrac{1}{2}],p\geq 2

\eta\leftarrow p^{-1},T\leftarrow\tfrac{4p\log(\frac{nd}{\epsilon})}{\epsilon}

[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d}

for all

i\in[n]

z\leftarrow 0

4: while

\left\lVert w_{t}\right\rVert_{1}\leq\epsilon^{-1}

g_{t}\leftarrow\max\left(0,\mathbf{1}-\left\langle\mathbf{A}_{i},\mathbf{V}_{t}^{p-1}\right\rangle\right)

entrywise, for

\mathbf{V}_{t}\leftarrow\tfrac{\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}}{\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}}

w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t})

\mathbf{Z}\leftarrow\mathbf{Z}+(\mathbf{V}_{t})^{p-1}

t\leftarrow t+1

7: if

t\geq T

then

8: return

\mathbf{Y}=\frac{\mathbf{Z}}{\left\lVert\mathbf{Z}\right\rVert_{q}}

9: end if

10: end while

11: return

x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

We assume that $p$ is an odd integer for simplicity (sufficient for our applications), and leave for interesting future work the cases when $p$ is even or noninteger. The potential used in the analysis and an overall guarantee are stated here, and deferred to Appendix C. The proofs are simple modifications of Lemma 3 and Theorem 4 using trace inequalities (similar to those in [JLL⁺20]) in place of scalar inequalities, as well as efficient approximation of quantities in Line 5 via the standard technique of Johnson-Lindestrauss projections.

Lemma 4.

In all iterations $t$ of Algorithm 3, defining $\Phi_{t}:=\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}$ , $\Phi_{t+1}\leq\Phi_{t}$ .

Theorem 5.

Let $p$ be odd. Algorithm 3 runs in $O(\tfrac{p\log(nd/\epsilon)}{\epsilon})$ iterations, and its output solves Problem 3. Each iteration is implementable in $O(\textup{nnz}\cdot\tfrac{p\log(nd/\epsilon)}{\epsilon^{2}})$ , where nnz is the number of nonzero entries amongst all $\{\mathbf{A}_{i}\}_{i\in[n]}$ , losing $O(\epsilon)$ in the quality of Problem 3 with probability $1-\textup{poly}((nd/\epsilon)^{-1})$ .

4.4 Schatten packing with a $\ell_{\infty}$ constraint

We remark that the framework outlined in Section 4.1 is flexible enough to handle mixed-norm packing problems. Specifically, developments in Section 5 require the following guarantee.

Proposition 2.

Following Theorem 5’s notation, let $p$ be odd, $\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}$ , $0<\epsilon=O(\alpha)$ , and

\min_{\begin{subarray}{c}x\in\Delta^{n}\\ \left\lVert x\right\rVert_{\infty}\leq\frac{1+\alpha}{n}\end{subarray}}\left\lVert\mathcal{A}(x)\right\rVert_{p}=\textup{OPT}.

(7)

for $\mathcal{A}(x):=\sum_{i\in[n]}x_{i}\mathbf{A}_{i}$ . Given estimate of OPT exponentially bounded in $\tfrac{nd}{\epsilon}$ , there is a procedure calling Algorithm 7 $O(\log\frac{nd}{\epsilon})$ times giving $x\in\Delta^{n}$ with $\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1+\alpha)(1+\epsilon)}{n}$ , $\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq(1+\epsilon)\textup{OPT}$ . Algorithm 7 runs in $O(\tfrac{\log(nd/\epsilon)\log n}{\epsilon^{2}})$ iterations, each implementable in time $O(\textup{nnz}\cdot\frac{p\log(nd/\epsilon)}{\epsilon^{2}})$ .

Our method, found in Appendix C, approximately solves (7) by first applying a standard binary search to place $\mathcal{A}(x)$ on the right scale, for which it suffices to solve an approximate decision problem. Then, we apply a truncated mirror descent procedure on the potential $\Phi(w)=\log(\exp(\left\lVert\mathcal{A}(w)\right\rVert_{p})+\exp(\tfrac{n}{1+\alpha}\left\lVert w\right\rVert_{\infty}))-\left\lVert w\right\rVert_{1}$ , and prove correctness for solving the decision problem following the framework we outlined in Section 4.1.

5 Robust sub-Gaussian PCA in nearly-linear time

We give our nearly-linear time robust PCA method, leveraging developments of Section 4. Throughout, we will be operating under Assumption 1, for some corruption parameter $\epsilon$ with $\epsilon\log\epsilon^{-1}\log d=O(1)$ ; $\epsilon=O(\frac{1}{\log d\log\log d})$ suffices. We now develop tools to prove Theorem 2.

Algorithm 4 uses three subroutines: our earlier $\mathsf{1DRobustVariance}$ method (Lemma 1), an application of our earlier Proposition 2 to approximate the solution to

\min_{w\in\mathfrak{S}_{\epsilon}^{n}}\left\lVert\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}\right\rVert_{p},\text{ for }p=\Theta\left(\sqrt{\frac{\log d}{\epsilon\log\epsilon^{-1}}}\right),

(8)

and a method for computing approximate eigenvectors by [MM15] (discussed in Appendix D).

Proposition 3.

There is an algorithm $\mathsf{Power}$ (Algorithm 1, [MM15]), parameterized by $t\in[d]$ , tolerance $\tilde{\epsilon}>0$ , $p\geq 1$ , and $\mathbf{A}\in\mathbb{S}_{\geq 0}^{d}$ , which outputs orthonormal $\{z_{j}\}_{j\in[t]}$ with the guarantee

\begin{rcases}\left|z_{j}^{\top}\mathbf{A}^{p}z_{j}-\lambda_{j}^{p}(\mathbf{A})\right|&\leq\tilde{\epsilon}\lambda_{j}^{p}(\mathbf{A})\\ \left|z_{j}^{\top}\mathbf{A}^{p-1}z_{j}-\lambda_{j}^{p-1}(\mathbf{A})\right|&\leq\tilde{\epsilon}\lambda_{j}^{p-1}(\mathbf{A})\end{rcases}\text{ for all }j\in[t].

(9)

Here, $\lambda_{j}(\mathbf{A})$ is the $j^{th}$ largest eigenvalue of $\mathbf{A}$ . The total time required by the method is $O(\textup{nnz}(\mathbf{A})\tfrac{tp\log d}{\varepsilon})$ .

Algorithm 4

\mathsf{RobustPCA}(\{X_{i}\}_{i\in[n]},\epsilon,t)

1: Input:

\{X_{i}\}_{i\in[n]}

\epsilon=O(\frac{1}{\log d\log\log d})

t\in[d]

with

\bm{\Sigma}_{t+1}\leq(1-\gamma)\bm{\Sigma}

for

\gamma

in Theorem 2

w\leftarrow

\mathsf{BoxedSchattenPacking}

(Proposition 2) on

\{\mathbf{A}_{i}=X_{i}X_{i}^{\top}\}_{i\in[n]}

\alpha\leftarrow\epsilon

p

as in (8)

\mathbf{M}=\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}

\{z_{j}\}_{j\in[t]}=\mathsf{Power}(t,\epsilon,p,\mathbf{M})

\alpha_{j}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\mathbf{M}^{\frac{p-1}{2}}z_{j}/\|\mathbf{M}^{\frac{p-1}{2}}z_{j}\|_{2},\epsilon)

for all

j\in[t]

6: return

z_{j^{*}}

for

j^{*}=\textup{argmax}_{j\in[t]}\alpha_{j}

Algorithm 4 is computationally bottlenecked by the application of Proposition 2 on Line 2 and the call to $\mathsf{Power}$ on Line 4, from which the runtime guarantee of Theorem 2 follows straightforwardly. To demonstrate correctness, we first certify the quality of the solution to (8).

Lemma 5.

Let $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ . With probability $1-\tfrac{\delta}{2}$ , the uniform distribution over $G$ attains value $(1+\tfrac{\tilde{\epsilon}}{2})\|\bm{\Sigma}\|_{p}$ for objective (8), where $\tilde{\epsilon}=C^{\prime}\epsilon\log\epsilon^{-1}$ for a universal constant $C^{\prime}>0$ .

The proof of this is similar to results in e.g. [DKK⁺19, Li18], and combines concentration guarantees with a union bound over all possible corruption sets $B$ . This implies the following immediately, upon applying the guarantees of Proposition 2.

Corollary 1.

Let $w$ be the output of Line 2 of $\mathsf{RobustPCA}$ . Then, we have $\left\lVert w\right\rVert_{\infty}\leq\frac{1}{(1-2\epsilon)n}$ , and $\left\lVert\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq(1+\tilde{\epsilon})\left\lVert\bm{\Sigma}\right\rVert_{p}$ under the guarantee of Lemma 5.

Let $w$ be the output of the solver. Recall that $\mathbf{M}=\sum_{i=1}^{n}w_{i}X_{i}X_{i}^{\top}$ . Additionally, define

\mathbf{M}_{G}:=\sum_{i\in G}w_{i}X_{i}X_{i}^{\top},\;w_{G}:=\sum_{i\in G}w_{i},\;\mathbf{M}_{B}:=\sum_{i\in B}w_{i}X_{i}X_{i}^{\top},\;w_{B}:=\sum_{i\in G}w_{i}\;.

(10)

Notice in particular that $\mathbf{M}=\mathbf{M}_{G}+\mathbf{M}_{B}$ , and that all these matrices are PSD. We next prove the second, crucial fact, which says that $\mathbf{M}_{G}$ is a good approximator to $\bm{\Sigma}$ in Loewner ordering:

Lemma 6.

Let $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ . With probability at least $1-\tfrac{\delta}{2}$ , $(1+\tilde{\epsilon})\bm{\Sigma}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma}$ .

The proof combines the strategy in Lemma 5 with the guarantee of the SDP solver. Perhaps surprisingly, Corollary 1 and Lemma 6 are the only two properties about $\mathbf{M}$ that our final analysis of Theorem 2 will need. In particular, we have the following key geometric proposition, which carefully combines trace inequalities to argue that the corrupted points $\mathbf{M}_{B}$ cannot create too many new large eigendirections.

Proposition 4.

Let $\mathbf{M}=\mathbf{M}_{G}+\mathbf{M}_{B}$ be so that $\left\lVert\mathbf{M}\right\rVert_{p}\leq(1+\tilde{\epsilon})\left\lVert\bm{\Sigma}\right\rVert_{p}$ , $\mathbf{M}_{G}\succeq 0$ and $\mathbf{M}_{B}\succeq 0$ , and so that $(1+\tilde{\epsilon})\bm{\Sigma}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma}$ . Following notation of Algorithm 4, let

\mathbf{M}=\sum_{j\in[d]}\lambda_{j}v_{j}v_{j}^{\top},\;\bm{\Sigma}=\sum_{j\in[d]}\sigma_{j}u_{j}u_{j}^{\top}

(11)

be sorted eigendecompositions of $\mathbf{M}$ and $\bm{\Sigma}$ , so $\lambda_{1}\geq\ldots\geq\lambda_{d}$ , and $\sigma_{1}\geq\ldots\geq\sigma_{d}$ . Let $\gamma$ be as in Theorem 2, and assume $\sigma_{t+1}<(1-\gamma)\sigma_{1}$ . Then,

\max_{j\in[t]}v_{j}^{\top}\bm{\Sigma}v_{j}\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.

Proof.

For concreteness, we will define the parameters

p=\frac{2}{7}\sqrt{\frac{\log(3d)}{\tilde{\epsilon}}},\;\gamma=14\sqrt{\tilde{\epsilon}\log(3d)}=49p\tilde{\epsilon}.

For these choices of $p$ , $\gamma$ , we will use the following (loose) approximations for sufficiently small $\tilde{\epsilon}$ :

\displaystyle\left(1-\frac{\gamma}{4}\right)^{p}=\left(1-\frac{\gamma}{4}\right)^{\frac{4}{\gamma}\log(3d)}\leq\frac{1}{3d},\;(1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\leq\exp(p\tilde{\epsilon})-(1-p\tilde{\epsilon})\leq 3p\tilde{\epsilon}.

(12)

Suppose for contradiction that all $v_{j}^{\top}\bm{\Sigma}v_{j}<(1-\gamma)\sigma_{1}$ for $j\in[t]$ . By applying the guarantee of Corollary 1 and Fact 2, it follows that

\left\langle\mathbf{M},\mathbf{M}^{p-1}\right\rangle=\left\lVert\mathbf{M}\right\rVert_{p}^{p}\leq(1+\tilde{\epsilon})^{p}\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}.

(13)

Let $s\in[d]$ be the largest index such that $\sigma_{s}>\left(1-\tfrac{\gamma}{4}\right)\sigma_{1}$ , and note that $s\leq t$ . We define

\mathbf{N}:=\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\preceq\mathbf{M}^{p-1}.

That is, $\mathbf{N}$ is the restriction of $\mathbf{M}^{p-1}$ to its top $s$ eigendirections. Then,

	$\displaystyle\left\langle\mathbf{M},\mathbf{M}^{p-1}\right\rangle$	$\displaystyle=\left\langle\mathbf{M}_{B},\mathbf{M}^{p-1}\right\rangle+\left\langle\mathbf{M}_{G},\mathbf{M}^{p-1}\right\rangle$		(14)
		$\displaystyle\geq\left\langle\mathbf{M}_{B},\mathbf{M}^{p-1}\right\rangle+\left\langle(1-\tilde{\epsilon})\bm{\Sigma},\mathbf{M}^{p-1}\right\rangle\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle+(1-\tilde{\epsilon})^{p}\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}.$		(14)

In the second line, we used Lemma 6 twice, as well as the trace inequality Lemma 7 with $\mathbf{A}=\mathbf{M}$ and $\mathbf{B}=(1-\tilde{\epsilon})\bm{\Sigma}$ . Combining (13) with (14), and expanding the definition of $\mathbf{M}_{B}$ , yields

$\displaystyle\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}$	$\displaystyle\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle=\left\langle\mathbf{M}_{B},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle$	(15)
	$\displaystyle=\left\langle\mathbf{M},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle-\left\langle\mathbf{M}_{G},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle$
	$\displaystyle\geq\left\langle\mathbf{M},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle-(1+\tilde{\epsilon})\left\langle\bm{\Sigma},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle$
	$\displaystyle=\sum_{j\in[s]}\left(\lambda_{j}^{p}-(1+\tilde{\epsilon})\lambda_{j}^{p-1}v_{j}^{\top}\bm{\Sigma}v_{j}\right)\geq\sum_{j\in[s]}\left(\lambda_{j}^{p}-\lambda_{j}^{p-1}(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\right).$

The third line followed from from the spectral bound $\mathbf{M}_{G}\preceq(1+\tilde{\epsilon})\bm{\Sigma}$ of Lemma 6, and the fourth followed from the fact that $\{\lambda_{j}\}_{j\in[d]}$ , $\{v_{j}\}_{j\in[d]}$ eigendecompose $\mathbf{M}$ , as well as the assumption $v_{j}^{\top}\bm{\Sigma}v_{j}\leq(1-\gamma)\sigma_{1}$ for all $j\in[t]$ . Letting $S:=\sum_{j\in[s]}\sigma_{j}^{p}$ , and using both approximations in (12),

\displaystyle\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\leq\sum_{j\in[s]}\sigma_{j}^{p}+\left(1-\frac{\gamma}{4}\right)^{p}(d-s)\sigma_{1}^{p}\leq\frac{4}{3}S\implies\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\leq 4p\tilde{\epsilon}S.

(16)

Next, we bound the last term of (15). By using $(1+\tilde{\epsilon})(1-\gamma)\leq 1-\tfrac{\gamma}{2}$ ,

	$\displaystyle\sum_{j\in[s]}\left(\lambda_{j}^{p}-\lambda_{j}^{p-1}(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\right)$	$\displaystyle\geq\sum_{j\in[s]}\lambda_{j}^{p-1}\left(\lambda_{j}-\left(1-\frac{\gamma}{2}\right)\sigma_{1}\right)$		(17)
		$\displaystyle\geq\frac{\gamma}{6}\sum_{j\in[s]}\lambda_{j}^{p-1}\sigma_{1}\geq\frac{\gamma}{6}\left(1-\tilde{\epsilon}\right)^{p-1}\sum_{j\in[s]}\sigma_{j}^{p}\geq\frac{\gamma}{12}S.$		(17)

The second line used $\lambda_{j}-(1-\tfrac{\gamma}{2})\sigma_{1}\geq(1-\tilde{\epsilon})\sigma_{j}-(1-\tfrac{\gamma}{2})\sigma_{1}\geq\tfrac{\gamma}{6}\sigma_{1}$ by definition of $s$ , Lemma 8 (twice), and $(1-\tilde{\epsilon})^{p-1}\geq\tfrac{1}{2}$ . Combining (17) and (16) and plugging into (15),

4p\tilde{\epsilon}S\geq\frac{\gamma}{12}S\implies 48p\tilde{\epsilon}\geq\gamma.

By the choice of $\gamma$ and $p$ (i.e. $\gamma=49p\tilde{\epsilon}$ ), we attain a contradiction. ∎

In the proof of Proposition 4, we used the following facts.

Lemma 7.

Let $\mathbf{A}\succeq\mathbf{B}\succeq 0$ be symmetric matrices and $p$ a positive integer. Then we have

\textup{Tr}\left(\mathbf{A}^{p-1}\mathbf{B}\right)\geq\textup{Tr}\left(\mathbf{B}^{p}\right).

Proof.

For any $1\leq k\leq p-1$ ,

\textup{Tr}\left(\mathbf{A}^{k}\mathbf{B}^{p-k}\right)\geq\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{\frac{p-k}{2}}\mathbf{A}\mathbf{B}^{\frac{p-k}{2}}\right)\geq\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{\frac{p-k}{2}}\mathbf{B}\mathbf{B}^{\frac{p-k}{2}}\right)=\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{p-k+1}\right).

The first step used the Extended Lieb-Thirring trace inequality $\textup{Tr}(\mathbf{M}\mathbf{N}^{2})\geq\textup{Tr}(\mathbf{M}^{\alpha}\mathbf{N}\mathbf{M}^{1-\alpha}\mathbf{N})$ for $\alpha\in[0,1]$ , $\mathbf{M},\mathbf{N}\in\mathbb{S}_{\geq 0}^{d}$ (see e.g. Lemma 2.1, [ALO16]), and the second $\mathbf{A}\succeq\mathbf{B}$ . Finally, induction on $k$ yields the claim. ∎

Lemma 8.

For all $j\in[d]$ , $\lambda_{j}\geq(1-\tilde{\epsilon})\sigma_{j}$ .

Proof.

By the Courant-Fischer minimax characterization of eigenvalues,

\lambda_{j}\geq\min_{k\in[j]}u_{k}^{\top}\mathbf{M}u_{k}.

However, we also have $\mathbf{M}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma}$ (Lemma 6), yielding the conclusion. ∎

The guarantees of Proposition 4 were geared towards exact eigenvectors of the matrix $\mathbf{M}$ . We now modify the analysis to tolerate inexactness in the eigenvector computation, in line with the processing of Line 5 of our Algorithm 4. This yields our final claim in Theorem 2.

Corollary 2.

In the setting of Proposition 4, and letting $\{z_{j}\}_{j\in[t]}$ satisfy (9), set for all $j\in[t]$

y_{j}:=\frac{\mathbf{M}^{\frac{p-1}{2}}z_{j}}{\left\lVert\mathbf{M}^{\frac{p-1}{2}}z_{j}\right\rVert_{2}}.

Then with probability at least $1-\delta$ ,

\max_{j\in[t]}y_{j}^{\top}\bm{\Sigma}y_{j}\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.

Proof.

Assume all $y_{j}$ have $y_{j}^{\top}\bm{\Sigma}y_{j}\leq(1-\gamma)\sigma_{1}$ for contradiction. We outline modifications to the proof of Proposition 4. Specifically, we redefine the matrix $\mathbf{N}$ by

\mathbf{N}:=\mathbf{M}^{\frac{p-1}{2}}\left(\sum_{j\in[s]}z_{j}z_{j}^{\top}\right)\mathbf{M}^{\frac{p-1}{2}}.

Because $\sum_{j\in[s]}z_{j}z_{j}^{\top}$ is a projection matrix, it is clear $\mathbf{N}\preceq\mathbf{M}^{p-1}$ . Therefore, by combining the derivations (13) and (14), it remains true that

\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle=\left\langle\mathbf{M},\mathbf{N}\right\rangle-\left\langle\mathbf{M}_{G},\mathbf{N}\right\rangle.

We now bound these two terms in an analogous way from Proposition 4, with negligible loss; combining these bounds will again yield a contradiction. First, we have the lower bound

\displaystyle\left\langle\mathbf{M},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle=\sum_{j\in[s]}z_{j}^{\top}\mathbf{M}^{p}z_{j}\geq(1-\tilde{\epsilon})\sum_{j\in[s]}\lambda_{j}^{p}.

Here, the last inequality applied the assumption (9) with respect to $\mathbf{M}^{p}$ . Next, we upper bound

	$\displaystyle\left\langle\mathbf{M}_{G},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle$	$\displaystyle\leq(1+\tilde{\epsilon})\left\langle\bm{\Sigma},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle$
		$\displaystyle=(1+\tilde{\epsilon})\sum_{j\in[s]}\left\lVert\mathbf{M}^{\frac{p-1}{2}}z_{j}\right\rVert_{2}^{2}y_{j}^{\top}\bm{\Sigma}y_{j}$
		$\displaystyle\leq(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\sum_{j\in[s]}z_{j}^{\top}\mathbf{M}^{p-1}z_{j}$
		$\displaystyle\leq(1-\gamma)(1+\tilde{\epsilon})^{2}\sigma_{1}\sum_{j\in[s]}\lambda_{j}^{p-1},$

The first line used $\mathbf{M}_{G}\preceq(1+\tilde{\epsilon})\bm{\Sigma}$ , the second used the definition of $y_{j}$ , the third used our assumption $y_{j}^{\top}\bm{\Sigma}y_{j}\leq(1-\gamma)\sigma_{1}$ , and the last used (9) with respect to $\mathbf{M}^{p-1}$ . Finally, the remaining derivation (17) is tolerant to additional factors of $1+\tilde{\epsilon}$ , yielding the same conclusion up to constants. ∎

Finally, we prove Theorem 2 by combining the tools developed thus far.

Proof of Theorem 2.

Correctness of the algorithm is immediate from Corollary 2 and the guarantees of $\mathsf{1DRobustVariance}$ . Concretely, Corollary 2 guarantees that one of the vectors we produce will be a $(1-\gamma)$ -approximate top eigenvector (say some index $j\in[t]$ ), and $\mathsf{1DRobustVariance}$ will only lose a negligible fraction $O(\epsilon\log\epsilon^{-1})$ of this quality (see Lemma 1); the best returned eigenvector as measured by $\mathsf{1DRobustVariance}$ can only improve the guarantee. Finally, the failure probability follows by combining the guarantees of Lemmas 1, 5, and 6.

We now discuss runtime. The complexity of lines 2, 4, and 5, as guaranteed by Proposition 2, Proposition 3, and Lemma 1 are respectively (recalling $p=\widetilde{O}(\epsilon^{-0.5})$ )

\widetilde{O}\left(\frac{nd}{\epsilon^{4.5}}\right),\;\widetilde{O}\left(\frac{ndt}{\epsilon^{1.5}}\right),\;\widetilde{O}\left(ndt\right).

Throughout we use that we can compute matrix-vector products in an arbitrary linear combination of the $X_{i}X_{i}^{\top}$ in time $O(nd)$ ; it is easy to check that in all runtime guarantees, nnz can be replaced by this computational cost. Combining these bounds yields the final conclusion. ∎

Acknowledgments

We thank Swati Padmanabhan and Aaron Sidford for helpful discussions.

References

[AK16] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. J. ACM, 63(2):12:1–12:35, 2016.
[AL17] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster online learning of eigenvectors and faster MMWU. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 116–125, 2017.
[ALO15] Zeyuan Allen Zhu, Zhenyu Liao, and Lorenzo Orecchia. Spectral sparsification and regret minimization beyond matrix multiplicative updates. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 237–245, 2015.
[ALO16] Zeyuan Allen Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1824–1831, 2016.
[CDG19] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755–2771, 2019.
[CDGW19] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David Woodruff. Faster algorithms for high-dimensional robust covariance estimation. arXiv preprint arXiv:1906.04661, 2019.
[CDST19] Yair Carmon, John C. Duchi, Aaron Sidford, and Kevin Tian. A rank-1 sketch for matrix multiplicative weights. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 589–623, 2019.
[CG18] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1362–1394, 2018.
[CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
[CMY20] Yeshwanth Cherapanamjeri, Sidhanth Mohanty, and Morris Yau. List decodable mean estimation in nearly linear time. CoRR, abs/2005.09796, 2020.
[DFO18] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Width-independence beyond linear objectives: Distributed fair packing and covering algorithms. CoRR, abs/1808.02517, 2018.
[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
[DHL19] Yihe Dong, Samuel Hopkins, and Jerry Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. In Advances in Neural Information Processing Systems, pages 6065–6075, 2019.
[DK19] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
[DKK⁺17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 999–1008. JMLR. org, 2017.
[DKK⁺18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702. SIAM, 2018.
[DKK⁺19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742–864, 2019.
[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.
[DL19] Jules Despersin and Guillaume Lecué. Robust subgaussian estimation of a mean vector in nearly linear time. CoRR, abs/1906.03058, 2019.
[GHM15] Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 560–568, 2015.
[HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
[Hub64] Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
[JLL⁺20] Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, and Kevin Tian. Positive semidefinite programming: Mixed, parallel, and width-independent. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020.
[JY11] Rahul Jain and Penghui Yao. A parallel approximation algorithm for positive semidefinite programming. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 463–471, 2011.
[KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046, 2018.
[Li18] Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis, Massachusetts Institute of Technology, 2018.
[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
[MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1396–1404, 2015.
[MRWZ16] Michael W. Mahoney, Satish Rao, Di Wang, and Peng Zhang. Approximating the solution to mixed packing and covering lps in parallel o~(epsilon^{-3}) time. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 52:1–52:14, 2016.
[MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for alpha-fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 54:1–54:15, 2016.
[NN94] Yurii Nesterov and Arkadi Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
[PTZ16] Richard Peng, Kanat Tangwongsan, and Peng Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. CoRR, abs/1201.5135, 2016.
[RH17] Philippe Rigollet and Jan-Christian Hütter. High-Dimensional Statistics. 2017.
[SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
[Ste18] Jacob Steinhardt. Robust Learning: Information Theory and Algorithms. PhD thesis, Stanford University, 2018.
[Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
[Ver16] Roman Vershynin. High-Dimensional Probability, An Introduction with Applications in Data Science. 2016.
[WK06] Manfred K. Warmuth and Dima Kuzmin. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1481–1488, 2006.

Appendix A Concentration

A.1 Sub-Gaussian concentration

We use the following concentration facts on sub-Gaussian distributions following from standard techniques, and give an application bounding Schatten-norm deviations.

Lemma 9.

Under Assumption 1, there are universal constants $C_{1}$ , $C_{2}$ such that

\Pr\left[\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{2}=1\end{subarray}}\left|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|-tv^{\top}\bm{\Sigma}v>0\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).

Proof.

By observing (3), it is clear that the random vector $\widetilde{X}=\bm{\Sigma}^{-\frac{1}{2}}X$ for $X\sim\mathcal{D}$ has covariance $\mathbf{I}$ and sub-Gaussian proxy $c\mathbf{I}$ . For any fixed unit vector $u$ , by Lemma 1.12 of [RH17], the random variable $(u^{\top}\widetilde{X})^{2}-1$ is sub-exponential with parameter $16c$ , so by Bernstein’s inequality (Theorem 1.13, [RH17]), defining $\widetilde{X}_{i}=\bm{\Sigma}^{-\frac{1}{2}}X_{i}$ for each $X_{i}\sim\mathcal{D}$ ,

\Pr\left[\left|u^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}\widetilde{X}_{i}\widetilde{X}_{i}^{\top}-\mathbf{I}\right)u\right|>\frac{t}{2}\right]\leq\exp\left(-\frac{n}{2^{11}c^{2}}\min(t,t^{2})\right).

For shorthand define $\mathbf{M}:=\tfrac{1}{n}\sum_{i\in G^{\prime}}\widetilde{X}_{i}\widetilde{X}_{i}^{\top}$ , and let $\mathcal{N}$ be a maximal $\tfrac{1}{4}$ -net of the unit ball (as measured in $\ell_{2}$ distance). By Lemma 1.18 of [RH17], $|\mathcal{N}|\leq 12^{d}$ , so by a union bound,

\Pr\left[\sup_{u\in\mathcal{N}}\left|u^{\top}(\mathbf{M}-\mathbf{I})u\right|>\frac{t}{2}\right]\leq\exp\left(3d-\frac{n}{2^{11}c^{2}}\min(t,t^{2})\right).

Next, by a standard application of the triangle inequality (see e.g. Exercise 4.3.3, [Ver16])

\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{2}=1\end{subarray}}\left|v^{\top}(\mathbf{M}-\mathbf{I})v\right|\leq 2\sup_{u\in\mathcal{N}}\left|u^{\top}(\mathbf{M}-\mathbf{I})u\right|\leq t

with probability at least $1-\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right)$ for appropriate $C_{1}$ , $C_{2}$ . The conclusion follows since its statement is scale invariant, so it suffices to show as we have that

\Pr\left[\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{\bm{\Sigma}}=1\end{subarray}}\left|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|-tv^{\top}\bm{\Sigma}v>0\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).

∎

Corollary 3.

Let $p\geq 2$ . Under Assumption 1, there are universal constants $C_{1}$ , $C_{2}$ with

\Pr\left[\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>t\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).

Proof.

Suppose the event in Lemma 9 does not hold, which happens with probability at least $1-\exp(C_{1}d-C_{2}n\min(t,t^{2}))$ . Define for shorthand $\mathbf{M}:=\tfrac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}$ and let its spectral decomposition be $\sum_{j\in[d]}\lambda_{j}v_{j}v_{j}^{\top}$ . By the triangle inequality and Fact 2,

	$\displaystyle\left\lVert\mathbf{M}\right\rVert_{p}$	$\displaystyle\leq\sum_{j\in[d]}\frac{\|\lambda_{j}\|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}\left\|v_{j}^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v_{j}\right\|$
		$\displaystyle\leq t\sum_{j\in[d]}\frac{\|\lambda_{j}\|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}^{\top}\bm{\Sigma}v_{j}=t\left\langle\sum_{j\in[d]}\frac{\|\lambda_{j}\|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}v_{j}^{\top},\bm{\Sigma}\right\rangle\leq t\left\lVert\bm{\Sigma}\right\rVert_{p}.$

In the last inequality, we used that $\sum_{j\in[d]}\tfrac{|\lambda_{j}|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}v_{j}^{\top}$ has unit $\ell_{q}$ norm, and applied Fact 2. ∎

A.2 Concentration under weightings in $\mathfrak{S}_{\epsilon}^{n}$

We consider concentration of the empirical covariance under weightings which are not far from uniform, in spectral and Schatten senses.

Lemma 10.

Under Assumption 1, let $\delta\in[0,1]$ , $p\geq 2$ , and $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ for a sufficiently large constant. Then for a universal constant $C_{3}$ ,

\Pr\left[\exists w\in\mathfrak{S}_{\epsilon}^{n}\;\Bigg{|}\;\left\lVert\sum_{i\in G^{\prime}}w_{i}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{2}.

Proof.

Because the vertices of $\mathfrak{S}_{\epsilon}^{n}$ are uniform over sets $S\subseteq G^{\prime}$ with $|S|=(1-\epsilon)n$ (see e.g. Section 4.1, [DKK⁺19]), by convexity of the Schatten- $p$ norm it suffices to prove

\Pr\left[\exists S\text{ with }|S|=(1-\epsilon)n\;\Bigg{|}\;\left\lVert\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{4}.

For any fixed $S$ , and recalling $|S^{c}|=\epsilon n$ , we can decompose this sum as

\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}=\frac{1}{1-\epsilon}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}\right)-\frac{\epsilon}{1-\epsilon}\left(\frac{1}{|S^{c}|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}\right).

(18)

By applying Corollary 3, it follows that by setting $t=\tfrac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}$ and our choice of $n$ that

\Pr\left[\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{4}.

(19)

Moreover, for any fixed $S^{c}$ , setting $t=\tfrac{1-\epsilon}{2}\cdot C_{3}\log\epsilon^{-1}$ where $C_{3}$ is a sufficiently large constant, so that for sufficiently small $\epsilon$ , $t=\min(t,t^{2})$ ,

$\displaystyle\Pr\left[\left\lVert\frac{1}{\epsilon n}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]$	$\displaystyle\leq\exp\left(C_{1}d-C_{2}\epsilon nt\right)$	(20)
	$\displaystyle\leq\exp\left(-\left(\log\delta^{-1}+n\epsilon\log\epsilon^{-1}\right)\right)$
	$\displaystyle\leq\frac{\delta}{4\binom{n}{\epsilon n}}.$

Here, we used that $\log\binom{n}{\epsilon n}=O\left(n\epsilon\log\epsilon^{-1}\right)$ . Finally, union bounding over all possible sets $S^{c}$ imply that with probability at least $1-\tfrac{\delta}{2}$ , the following events hold:

	$\displaystyle\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p},$
	$\displaystyle\left\lVert\frac{1}{\|S^{c}\|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\text{ for all }S\text{ with }\|S\|=(1-\epsilon)n.$

Combining these bounds in the context of (18) after applying the triangle inequality, we have with probability at least $1-\tfrac{\delta}{2}$ for all $S$ the desired conclusion,

\left\lVert\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}.

∎

Corollary 4.

Under Assumption 1, let $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ for a sufficiently large constant. For universal $C_{3}$ and all $w\in\mathfrak{S}_{\epsilon}^{n}$ , with probability at least $1-\tfrac{\delta}{2}$ ,

C_{3}\cdot\epsilon\log\epsilon^{-1}\bm{\Sigma}\succeq\sum_{i\in G^{\prime}}w_{i}X_{i}X_{i}^{\top}-\bm{\Sigma}\succeq-C_{3}\cdot\epsilon\log\epsilon^{-1}\bm{\Sigma}.

Proof.

Consider any unit vector $v\in\mathbb{R}^{d}$ . By similar arguments as in (19), (20), and applying a union bound over all $S$ with $|S|=(1-\epsilon)n$ , with probability at least $1-\tfrac{\delta}{2}$ , it follows from Lemma 9 that

	$\displaystyle\left\|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right\|<\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}v^{\top}\bm{\Sigma}v,$		(21)
	$\displaystyle\left\|v^{\top}\left(\frac{1}{\|S^{c}\|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right\|<\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}v^{\top}\bm{\Sigma}v\;.$		(22)

Therefore, again using the formula (18) and the triangle inequality yields the desired conclusion for all directions $v$ , which is equivalent to the spectral bound of the lemma statement. ∎

Appendix B Deferred proofs from Section 3

B.1 Robust univariate variance estimation

In this section, we prove Lemma 1, which allows us to robustly estimate the quadratic form of a vector in the covariance of a sub-Gaussian distribution from corrupted samples. Algorithm 5 is folklore, and intuitively very simple; it projects all samples onto $u$ , throws away the $2\epsilon$ fraction of points with largest magnitude in this direction, and takes the mean of the remaining set.

Algorithm 5 Univariate variance estimation:

\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\epsilon,u)

Input:

\{X_{i}\}_{i\in[n]}

\epsilon>0

, and a unit vector

u

Let

a_{i}=\left\langle X_{i},u\right\rangle^{2}

for

i=1,\ldots,n

Sort the

a_{i}

in increasing order. WLOG assume

a_{1}\leq a_{2}\leq\ldots\leq a_{n}

return

\sigma_{u}^{2}=\frac{1}{(1-2\epsilon)n}\sum_{i=1}^{(1-2\epsilon)n}a_{i}

We require the following helper fact.

Fact 3.

Let $Z$ be a sub-exponential random variable with parameter at most $\lambda$ ⁹⁹9We say mean-zero $Z$ is sub-exponential with parameter $\lambda$ if $\forall|s|\leq\lambda^{-1}$ , $\mathbb{E}[\exp(sZ)]\leq\exp(\tfrac{s^{2}\lambda^{2}}{2})$ ., and let $\epsilon\in[0,1]$ . Then, for any event $E$ with $\Pr[Z\in E]\leq\epsilon$ , $|\mathbb{E}\left[Z\cdot\mathbf{1}[Z\in E]\right]|\leq 2\lambda\epsilon\log\epsilon^{-1}$ .

Proof.

We have by Hölder’s inequality that for any $p,q\geq 1$ with $p^{-1}+q^{-1}=1$ ,

\left|\mathbb{E}\left[Z\cdot\mathbf{1}[Z\in E]\right]\right|\leq\mathbb{E}[|Z|^{p}]^{1/p}\cdot\epsilon^{1/q}\leq 2\lambda p\cdot\epsilon^{1/q}.

The second inequality is Lemma 1.10 [RH17]. Setting $p=\log\epsilon^{-1}$ yields the result. ∎

See 1

Proof.

The runtime claim is immediate; we now turn our attention to correctness. We follow notation of Assumption 1, and in a slight abuse of notation, also define $a_{i}=\left\langle X_{i},u\right\rangle^{2}$ for $i\in G^{\prime}$ . First, for $X\sim\mathcal{D}$ , then $\left\langle u,X\right\rangle^{2}-u^{\top}\bm{\Sigma}u$ is sub-exponential with parameter at most $16cu^{\top}\bm{\Sigma}u$ (Lemma 1.12, [RH17]). By Bernstein’s inequality, we have that if $X\sim\mathcal{D}$ , then for all $t\geq 1$ ,

\Pr\left[\left\langle X,u\right\rangle^{2}>32ctu^{\top}\bm{\Sigma}u\right]\leq\exp(-t)\;.

(23)

Using this in a standard Chernoff bound, we have that with probability $1-\tfrac{\delta}{2}$ ,

\frac{\left|\{i\in G^{\prime}:a_{i}>64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\}\right|}{n}\leq\epsilon\;.

(24)

Let $T=64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u$ , and let $Y$ be distributed as $(\left\langle u,X\right\rangle^{2}-u^{\top}\bm{\Sigma}u)\cdot\mathbf{1}[\left\langle u,X\right\rangle^{2}\leq T]$ , where $X\sim\mathcal{D}$ . We observe $Y-\mathbb{E}[Y]$ is also sub-exponential with parameter $16cu^{\top}\bm{\Sigma}u$ , and that by Fact 3,

|\mathbb{E}[Y]|\leq 32cu^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1}.

(25)

Define the interval $I=[0,T]$ and let $S$ be the set of points in $[n]$ that survive the truncation procedure, so that $\sigma^{2}_{u}=\tfrac{1}{|S|}\sum_{i\in S}a_{i}$ . Given event (24), $a_{i}\in I$ for all $i\in S$ , since there are at most $\epsilon n$ points in $G$ outside $I$ , and $|B|\leq\epsilon n$ . We decompose the deviation as follows:

$\displaystyle\sum_{i\in S}a_{i}-\|S\|u^{\top}\bm{\Sigma}u$	$\displaystyle=\sum_{i\in G\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)+\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)$	(26)
	$\displaystyle=\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)+\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)$
	$\displaystyle-\sum_{i\in(G^{\prime}\setminus G)\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)-\sum_{i\in(G\cap I)\setminus S}(a_{i}-u^{\top}\bm{\Sigma}u).$

Here we overloaded $i\in I$ to mean that $a_{i}$ lies in the interval $I$ , and conditioned on $S$ lying entirely in $I$ . We bound each of these terms individually. First, for all $i\in G^{\prime}\cap I$ , conditioning on (24) (i.e. all $a_{i}\in I$ ), $a_{i}-u^{\top}\bm{\Sigma}u$ is an independent sample from $Y$ . Thus, by (25) and Bernstein’s inequality,

	$\displaystyle\left\|\frac{1}{\|G^{\prime}\cap I\|}\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)\right\|$	$\displaystyle\leq\left\|\frac{1}{\|G^{\prime}\cap I\|}\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)-\mathbb{E}[Y]\right\|+32cu^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1}$		(27)
		$\displaystyle\leq 64c\cdot u^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1},$		(27)

with (conditional) probability at least $1-\tfrac{\delta}{2}$ . By a union bound, both events occur with probability at least $1-\delta$ ; condition on this for the remainder of the proof. Under this assumption, we control the other three terms of (26). Observe that $|B\cap S|\leq\epsilon n$ , $|(G^{\prime}\setminus G)\cap I|\leq\epsilon n$ , and $|(G\cap I)\setminus S|\leq\epsilon n$ . Further, by definition of $I$ , every summand is at most $64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u$ . Thus,

$\displaystyle\left\|\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)\right\|$	$\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u,$	(28)
$\displaystyle\left\|\sum_{i\in(G^{\prime}\setminus G)\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)\right\|$	$\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u,$	(29)
$\displaystyle\left\|\sum_{i\in(G^{\prime}\cap I)\setminus S}(a_{i}-u^{\top}\bm{\Sigma}u)\right\|$	$\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.$	(30)

Combining (27), (28), (29), and (30) in derivation (26) and dividing by $|S|$ yields the claim. ∎

Finally, we also give an alternative set of conditions under which we can certify correctness of $\mathsf{1DRobustVariance}$ . Specifically, this assumption will be useful in lifting indpendence assumptions between $u$ and our samples $\{X_{i}\}_{i\in[n]}$ in repeated calls within Algorithm 6.

Assumption 2.

Under Assumption 1, let the following conditions hold for universal constant $C_{4}$ :

	$\displaystyle C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma}$	$\displaystyle\succeq\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\succeq-C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma},$		(31)
	$\displaystyle C_{4}\log\epsilon^{-1}\cdot\bm{\Sigma}$	$\displaystyle\succeq\sum_{i\in G^{\prime}}w_{i}\left(X_{i}X_{i}^{\top}-\bm{\Sigma}\right)\succeq-C_{4}\log\epsilon^{-1}\cdot\bm{\Sigma}\;\mbox{for all $w\in\mathfrak{S}_{1-\epsilon}^{n}$}.$		(32)

Note that (32) is a factor $\epsilon$ weaker in its guarantee than Corollary 4, and is over weights in a different set $\mathfrak{S}_{1-\epsilon}^{n}$ . Standard sub-Gaussian concentration (i.e. an unweighted variant of Corollary 4) and modifying the proof of Corollary 4 to take the constraint set $\mathfrak{S}_{1-\epsilon}^{n}$ and normalizing over vertex sets of size $\epsilon n$ yield the following conclusion.

Lemma 11.

Let $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ for a sufficiently large constant. Assumption 2 holds with probability at least $1-\tfrac{\delta}{2}$ .

We give a variant of Lemma 1 with slightly stronger guarantees for $\mathsf{1DRobustVariance}$ ; specifically, it holds for all $u$ simultaneously for a fixed set of samples satisfying Assumption 2.

Corollary 5.

Under Assumption 2, Algorithm 5 outputs $\sigma_{u}^{2}$ with $|u^{\top}\bm{\Sigma}u-\sigma_{u}^{2}|<Cu^{\top}\bm{\Sigma}u\cdot\epsilon\log\epsilon^{-1}$ , for $C$ a fixed multiple of the parameter $c$ in Assumption 1, and runs in time $O(nd+n\log n)$ .

Proof.

We discuss how to modify the derivations from Lemma 1 appropriately in the absence of applications of Bernstein’s inequality. First, note that appropriately combining (31) and (32) in a derivation such as (18) yields the following bound (deterministically under Assumption 2):

C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma}\succeq\sum_{i\in G^{\prime}}w_{i}\left(X_{i}X_{i}^{\top}-\bm{\Sigma}\right)\succeq-C_{4}\epsilon\log\epsilon^{-1}\bm{\Sigma}\;\mbox{for all $w\in\mathfrak{S}_{3\epsilon}^{n}$}.

(33)

Now, consider the decomposition (26). We claim first that similarly to (28), (29), (30) we can bound each summand in the latter three terms by $O(u^{\top}\bm{\Sigma}u\log\epsilon^{-1})$ ; to prove this, it suffices to show that at least one filtered $a_{i}$ attains this bound, as then by definition of the algorithm, each non-filtered $a_{i}$ will as well. Note that a fraction between $\epsilon$ and $2\epsilon$ of points in $G\subset G^{\prime}$ is filtered (since there are only $\epsilon n$ points from $B$ ). The assumption (32) then implies precisely the desired bound on some filtered $a_{i}$ by placing uniform mass on filtered points from $G$ , and applying pigeonhole. So, all non-filtered $a_{i}$ are bounded by $O(u^{\top}\bm{\Sigma}u\log\epsilon^{-1})$ , yielding analogous statements to (28), (29), (30).

Finally, an analogous derivation to (27) follows via an application of the bound (33), where we place uniform mass on the set $G^{\prime}\cap I$ and adjust constants appropriately, since the above argument shows that under the assumption (32), we have that at most $2\epsilon n$ indices $i\in G^{\prime}$ have $a_{i}\not\in I$ . ∎

B.2 Preliminaries

For convenience, we give the following preliminaries before embarking on our proof of Theorem 1 and giving guarantees on Algorithm 6. First, we state a set of assumptions which augments Assumption 2 with one additional condition, used in bounding the iteration count of our algorithm.

Assumption 3.

Under Assumption 1, let Assumption 2 hold, as well as the following additional condition for the same universal constant $C_{4}$ :

\left\lVert X_{i}\right\rVert_{2}^{2}\leq C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})\;\mbox{for all $i\in G$}.

(34)

Standard sub-Gaussian concentration inequalities and a union bound, combined with our earlier claim Lemma 11, then yield the following guarantee.

Lemma 12.

Let $n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right)$ for a sufficiently large constant. Assumption 3 holds with probability at least $1-\delta$ .

B.3 Analysis of $\mathsf{PCAFilter}$

For this section, for any nonnegative weights $w$ , define $\mathbf{M}(w):=\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}$ . We now state our algorithm, $\mathsf{PCAFilter}$ . At all iterations $t$ , it maintains a current nonnegative weight vector $w^{(t)}$ (initialized to be the uniform distribution on $[n]$ ), preserving the following invariants for all $t$ :

\displaystyle w_{i}^{(t-1)}\geq w_{i}^{(t)}\text{ for all }i\in[n],\;\sum_{i\in B}w^{(t-1)}_{i}-w^{(t)}_{i}\geq\sum_{i\in G}w^{(t-1)}_{i}-w^{(t)}_{i}.

(35)

We now state our method as Algorithm 6; note that the update to $w^{(t)}$ is of the form in Lemma 2.

Algorithm 6

\mathsf{PCAFilter}(\{X_{i}\}_{i\in[n]},\epsilon)

1: Remove all points

i\in[n]

with

\left\lVert X_{i}\right\rVert_{2}^{2}>c\log(\tfrac{n}{\delta})\cdot\textup{Tr}(\bm{\Sigma})

w^{(0)}_{i}\leftarrow\tfrac{1}{n}

for all

i\in[n]

t\leftarrow 1

u_{1}\leftarrow

approximate top eigenvector of

\mathbf{M}(w^{(0)})

\sigma_{1}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\epsilon,u_{1})

5: while

u_{t}^{\top}\mathbf{M}(w^{(t-1)})u_{t}>(1+5C_{5}\epsilon\log\epsilon^{-1})\sigma_{t}^{2}

, where

C_{5}=\max(C,C_{4})

from constants in Assumption 2, Corollary 5 do

a_{i}\leftarrow\left\langle u_{t},X_{i}\right\rangle^{2}

for all

i\in[n]

7: Sort (permute) the indices

[n]

so the

a_{i}

are in increasing order (with

a_{1}

smallest,

a_{n}

largest)

8: Let

\ell

be the largest index with

\sum_{i=\ell}^{n}w_{i}\geq 2\epsilon

9: Define

w_{i}^{(t)}\leftarrow\begin{cases}\left(1-\frac{a_{i}}{a_{n}}\right)w_{i}^{(t-1)}&\ell\leq i\leq n\\ w_{i}^{(t-1)}&i<\ell\end{cases}

10:

u_{t}\leftarrow

approximate top eigenvector of

\mathbf{M}(w^{(t)})

11:

\sigma_{t}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},u_{t},\epsilon)

12:

t\leftarrow t+1

13: end while

14: return

u_{t}

We assume that in Line 8, we also have $\sum_{i=\ell}^{n}w_{i}\leq 3\epsilon$ , as we can assume at least one point is corrupted i.e. $\epsilon\geq\tfrac{1}{n}$ (else standard algorithms suffice for our setting), so adding an additional $w_{i}$ can only change the sum by $\epsilon$ . We first prove invariants (35) are preserved; at a high level, we simply demonstrate that Lemma 2 holds via concentration on $G$ and lack of termination.

Lemma 13.

Under Assumption 2, for any iteration $t$ of Algorithm 6, suppose (35) held for all iterations $t^{\prime}\leq t-1$ . Then, (35) holds at iteration $t$ .

Proof.

The first part of (35) is immediate by observing the update in Line 9, so we show the second. We drop subscripts and superscripts for conciseness and focus on a single iteration $t$ . Let $I_{B}=\{\ell,\ldots,n\}\cap B$ , and $I_{G}=\{\ell,\ldots,n\}\cap G$ . By Lemma 2, it suffices to demonstrate that

\sum_{i\in I_{B}}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}a_{i}.

(36)

First, $\sum_{i\in I_{B}}w_{i}\leq\epsilon$ , so by definition of index $\ell$ , we have $\epsilon\leq\sum_{i\in I_{G}}w_{i}\leq 2\epsilon$ . Define $\widetilde{w}_{i}=\tfrac{w_{i}}{\sum_{i\in I_{G}}w_{i}}$ if $i\in I_{G}$ , and $0$ otherwise, and observe $\widetilde{w}\in\mathfrak{S}_{1-2\epsilon}^{n}$ . By modifying constants appropriately from (32), it follows from definition of $a_{i}=u^{\top}X_{i}X_{i}^{\top}u$ that

\sum_{i\in I_{G}}w_{i}a_{i}\leq\left(\sum_{i\in I_{G}}w_{i}\right)\cdot C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\leq 2C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

(37)

On the other hand, by (33) we know that the total quadratic form over $G$ is bounded as

\sum_{i\in G}w_{i}a_{i}<\left(\sum_{i\in G}w_{i}\right)\left(1+C_{4}\epsilon\log\epsilon^{-1}\right)u^{\top}\bm{\Sigma}u<\left(1+C_{4}\epsilon\log\epsilon^{-1}\right)u^{\top}\bm{\Sigma}u.

(38)

Here, we applied the observation that the normalized $w_{i}$ restricted to $G$ are in $\mathfrak{S}^{n}_{1-3\epsilon}$ (e.g. using Lemma 14 inductively). However, since we did not terminate (Line 5), we must have by $u_{t}$ being a top eigenvector and Corollary 5 (we defer discussions of inexactness to Theorem 1) that

	$\displaystyle\sum_{i\in[n]}w_{i}a_{i}\geq(1+5C_{5}\epsilon\log\epsilon^{-1})\sigma_{t}^{2}\geq(1+4C_{4}\epsilon\log\epsilon^{-1})\cdot u^{\top}\bm{\Sigma}u$
	$\displaystyle\implies\sum_{i\in B}w_{i}a_{i}>3C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.$

To obtain the last conclusion, we used (38). Finally, note that for all $i\in B\setminus I_{B}$ ,

a_{i}\leq a_{\ell}\leq\sum_{i\in I_{G}}\widetilde{w}_{i}a_{i}\leq C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u

by rearranging (37). This implies that

\sum_{i\in B\setminus I_{B}}w_{i}a_{i}\leq\left(\sum_{i\in B\setminus I_{B}}w_{i}\right)\cdot C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\leq C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

Thus, the desired inequality (36) follows from combining the above derivations, e.g. using (37) and

\sum_{i\in I_{B}}w_{i}a_{i}=\sum_{i\in B}w_{i}a_{i}-\sum_{i\in B\setminus I_{B}}w_{i}a_{i}>2C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

∎

Lemma 13 yields for all $t$ that $\sum_{i\in B}w^{(0)}_{i}-w^{(t)}_{i}\geq\sum_{i\in G}w^{(0)}_{i}-w^{(t)}_{i}$ by telescoping. Note that we can only remove at most $2\epsilon$ mass from $w$ total, as $\sum_{i\in B}w^{(0)}_{i}-w^{(t)}_{i}\leq\epsilon$ . Denote for shorthand normalized weights $v^{(t)}:=\tfrac{w^{(t)}}{\left\lVert w^{(t)}\right\rVert_{1}}$ . Then, the following is immediate by $\left\lVert w^{(t)}\right\rVert_{1}\geq 1-2\epsilon$ .

Lemma 14.

Under Assumption 2, in all iterations $t$ of Algorithm 6, $v^{(t)}\in\mathfrak{S}_{2\epsilon}^{n}$ .

Using Lemma 14, we show that the output has the desired quality of being a large eigenvector.

Lemma 15.

Under Assumption 2, let the output of Algorithm 6 be $u_{T}$ . Then for a universal constant $C^{\star}$ , $u_{T}^{\top}\bm{\Sigma}u_{T}\geq(1-C^{\star}\epsilon\log\epsilon^{-1})\|\bm{\Sigma}\|_{\infty}$ .

Proof.

We assume for now that $u_{T}$ is an exact top eigenvector, and discuss inexactness while proving Theorem 1. By (33) and Lemma 14, as then the normalized restriction of $w^{(T)}$ to $G$ is in $\mathfrak{S}_{3\epsilon}^{n}$ ,

	$\displaystyle\mathbf{M}(w^{(T)})\succeq\sum_{i\in G}w^{(T)}_{i}X_{i}X_{i}^{\top}\succeq\left(1-2C_{4}\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}$
	$\displaystyle\implies u_{T}^{\top}\mathbf{M}(w^{(T)})u_{T}\geq\left(1-2C_{4}\epsilon\log\epsilon^{-1}\right)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.$

We used the Courant-Fischer characterization of eigenvalues, and that $u_{T}$ is a top eigenvector of $\mathbf{M}(w^{(T)})$ . Moreover, by termination conditions and Corollary 5 (correctness of $\mathsf{1DRobustVariance}$ ),

(1+C\epsilon\log\epsilon^{-1})u_{T}^{\top}\bm{\Sigma}u_{T}\geq\sigma_{T}^{2}\geq(1+5C_{5}\epsilon\log\epsilon^{-1})^{-1}u_{T}^{\top}\mathbf{M}(w^{(T)})u_{T}.

Combining these two bounds and rescaling yields the conclusion. ∎

Finally, we prove our main guarantee about Algorithm 6. See 1

Proof.

First, we will operate under Assumption 3, which holds with probability at least $1-\delta$ . It is clear that the analyses of Lemma 13 and 15 hold with $1-\Theta(\epsilon\log\epsilon^{-1})$ multiplicative approximations of top eigenvector computation, which the power method approximates with high probability. Thus, each iteration takes time $O\left(\frac{nd}{\epsilon}\log\frac{n}{\delta\epsilon}\right)$ , where we will union bound over the number of iterations. We now give an iteration bound: in any iteration where we do not terminate, Lemma 2 implies

	$\displaystyle\sum_{i=1}^{n}w^{(t-1)}_{i}-w^{(t)}_{i}$	$\displaystyle\geq\frac{1}{2\max_{i\in[n]}\left\langle u_{t},X_{i}\right\rangle^{2}}\sum_{i=\ell}^{n}w_{i}a_{i}$
		$\displaystyle\geq\frac{1}{2C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\sum_{i=\ell}^{n}w_{i}a_{i}$
		$\displaystyle\geq\frac{1}{2C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\left(\frac{\sum_{i=\ell}^{n}w_{i}}{\sum_{i\in[n]}w_{i}}\right)\sum_{i\in[n]}w_{i}a_{i}$
		$\displaystyle=\Omega\left(\epsilon\cdot\frac{\left\lVert\bm{\Sigma}\right\rVert_{\infty}}{\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\right)=\Omega\left(\frac{\epsilon}{d\log\frac{n}{\delta}}\right).$

Here, the second line used Assumption 3, the third used that the $a_{i}$ are in sorted order, and the last used the definition of $\ell$ as well as the derivations of Lemma 15 (specifically, that $\mathbf{M}(w)$ spectrally dominates $(1-O(\epsilon\log\epsilon^{-1}))\bm{\Sigma}$ for roughly uniform $w$ ). The conclusion follows since there can be at most $O(d\log\tfrac{n}{\delta})$ iterations, as the algorithm terminates when a $2\epsilon$ fraction of the mass is removed, giving the overall runtime claim. ∎

Appendix C Deferred proofs from Section 4

C.1 Proofs from Section 4.2

Since our notion of approximation is multiplicative, we can assume without more than constant loss that $\mathbf{A}$ has bounded entries. This observation is standard, and formalized in the following lemma.

Lemma 16 (Entrywise bounds on $\mathbf{A}$ ).

Feasibility of Problem 2 is unaffected (up to constants in $\epsilon$ ) by removing columns of $\mathbf{A}$ with entries larger than $n\epsilon^{-1}$ .

Proof.

If $\mathbf{A}_{ji}>n\epsilon^{-1}$ for any entry, then $x_{i}\leq\tfrac{\epsilon(1+\epsilon)}{n}$ , else $\left\lVert\mathbf{A}x\right\rVert_{p}$ is already larger than $1+\epsilon$ . Ignoring all such entries of $x$ and rescaling can only change the objective by a $1+O(\epsilon)$ factor. ∎

See 3

Proof.

Fix an iteration $t$ . Define $\delta=\eta g_{t}$ , and note $w_{t+1}=w_{t}+\delta\circ w_{t}$ ; henceforth in this proof, we will drop subscripts $t$ when clear. Observe that

\left\lVert\mathbf{A}w_{t+1}\right\rVert_{p}=\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}=\left(\sum_{j\in[d]}[\mathbf{A}w]_{j}^{p}\left(1+\frac{[\mathbf{A}(\delta\circ w_{t})]_{j}}{[\mathbf{A}w_{t}]_{j}}\right)^{p}\right)^{1/p}.

As $g\leq\mathbf{1}\implies\delta\leq p^{-1}\mathbf{1}$ , $\frac{\mathbf{A}(\delta\circ w_{t})}{\mathbf{A}w_{t}}\leq p^{-1}$ entrywise. Via $(1+x)^{p}\leq\exp(px)\leq 1+px+p^{2}x^{2}$ for $x\leq p^{-1}$ , it follows that

\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}\leq\left(\sum_{j\in[d]}[\mathbf{A}w]_{j}^{p}\left(1+\frac{p[\mathbf{A}(\delta\circ w)]_{j}}{[\mathbf{A}w]_{j}}+\left(\frac{p[\mathbf{A}(\delta\circ w)]_{j}}{[\mathbf{A}w]_{j}}\right)^{2}\right)\right)^{1/p}.

By direct manipulation of the above quantity, and recalling we defined $v=\tfrac{\mathbf{A}w}{\left\lVert\mathbf{A}w\right\rVert_{p}}$ ,

	$\displaystyle\left(\sum_{j\in[d]}\left([\mathbf{A}w]_{j}^{p}+p[\mathbf{A}w]_{j}^{p-1}[\mathbf{A}(\delta\circ w)]_{j}+p^{2}[\mathbf{A}w]_{j}^{p-2}[\mathbf{A}(\delta\circ w)]_{j}^{2}\right)\right)^{1/p}$
	$\displaystyle=\left(\left\lVert\mathbf{A}w\right\rVert_{p}^{p}\sum_{j\in[d]}\left(v_{j}^{p}+pv_{j}^{p-1}\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}+p^{2}v_{j}^{p-2}\left(\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}\right)^{2}\right)\right)^{1/p}$
	$\displaystyle=\left\lVert\mathbf{A}w\right\rVert_{p}\left(1+\sum_{j\in[d]}\left(pv_{j}^{p-1}\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}+p^{2}v_{j}^{p-2}\left(\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}\right)\right)^{2}\right)^{1/p}.$

Using $(1+x)^{p}>1+px$ , i.e. $(1+px)^{1/p}<1+x$ , we thus obtain

\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}\leq\left\lVert\mathbf{A}w\right\rVert_{p}+\left\langle v^{p-1},\mathbf{A}(\delta\circ w)\right\rangle+p\left\langle v^{p-1},\frac{(\mathbf{A}(\delta\circ w))^{2}}{\mathbf{A}w}\right\rangle.

Cauchy-Schwarz yields that $[\mathbf{A}(\delta\circ w)]_{j}^{2}\leq[\mathbf{A}(\delta^{2}\circ w)]_{j}[\mathbf{A}w]_{j}$ , $\forall j\in[d]$ . Substituting into the above,

	$\displaystyle\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}$	$\displaystyle\leq\left\lVert\mathbf{A}w\right\rVert_{p}+\left\langle v^{p-1},\mathbf{A}(\delta\circ w)\right\rangle+p\left\langle v^{p-1},\mathbf{A}(\delta^{2}\circ w)\right\rangle$		(39)
		$\displaystyle=\left\lVert\mathbf{A}w\right\rVert_{p}+\sum_{j\in[d]}\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\delta_{j}w_{j}(1+p\delta_{j}).$		(39)

Finally, to bound this latter quantity, since $\delta=\eta g$ , we observe that for all $j$ either $\delta_{j}=0$ or $1+p\delta_{j}=1+g_{j}=2-[\mathbf{A}^{\top}v^{p-1}]_{j}$ , in which case

\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}(1+p\delta_{j})=\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\left(2-\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\right)\leq 1.

Thus, plugging this bound into (39) entrywise,

\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}-\left\lVert\mathbf{A}w\right\rVert_{p}\leq\sum_{j\in[d]}\delta_{j}w_{j}\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}(1+p\delta_{j})\leq\sum_{j\in[d]}\delta_{j}w_{j}=\left\lVert w_{t+1}\right\rVert_{1}-\left\lVert w_{t}\right\rVert_{1}.

Rearranging yields the desired claim. ∎

C.2 Proofs from Section 4.3

Our analysis of Algorithm 3 will use the following helper fact.

Lemma 17 (Spectral bounds on $\{\mathbf{A}_{i}\}_{i\in[n]}$ ).

Feasibility of Problem 3 is unaffected (up to constants in $\epsilon$ ) by removing matrices $\mathbf{A}_{i}$ with an eigenvalue larger than $n\epsilon^{-1}$ .

Proof.

The proof is identical to Lemma 16; we also require the additional fact that the Schatten norm $\left\lVert\cdot\right\rVert_{p}$ is monotone in the Loewner order, forcing the constraint $x_{i}\leq\tfrac{\epsilon(1+\epsilon)}{n}$ . ∎

We remark that we can perform this preprocessing procedure via power iteration on each $\mathbf{A}_{i}$ .

See 4

Proof.

Drop $t$ and define $\delta=\eta g$ . For simplicity, define the matrices

\mathbf{M}_{0}:=\sum_{i\in[n]}w_{i}\mathbf{A}_{i},\;\mathbf{M}_{1}:=\sum_{i\in[n]}\delta_{i}w_{i}\mathbf{A}_{i},\;\mathbf{M}_{2}:=\sum_{i\in[n]}\delta_{i}^{2}w_{i}\mathbf{A}_{i}.

We recall the Lieb-Thirring inequality $\textup{Tr}((\mathbf{A}\mathbf{B}\mathbf{A})^{p})\leq\textup{Tr}(\mathbf{A}^{2p}\mathbf{B}^{p})$ . Applying this, we have

\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p}=\textup{Tr}\left(\left(\mathbf{M}_{0}+\mathbf{M}_{1}\right)^{p}\right)\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}\left(\mathbf{I}+\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}\right)^{p}\right).

As $g\leq\mathbf{1}$ , we have $\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}\preceq p^{-1}\mathbf{I}$ . Applying the bounds $(\mathbf{I}+\mathbf{M})^{p}\preceq\exp(p\mathbf{M})\preceq\mathbf{I}+p\mathbf{M}+p^{2}\mathbf{M}^{2}$ for $\mathbf{M}=\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}$ , where we use that $\mathbf{I}$ commutes with all $\mathbf{M}$ , it follows that

\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p}\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}+p\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}+p^{2}\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\right).

Definitions of $\mathbf{M}_{0}$ , $\mathbf{M}_{1}$ , $\mathbf{M}_{2}$ , and preservation of positiveness under Schur complements imply

\begin{pmatrix}\mathbf{M}_{0}&\mathbf{M}_{1}\\ \mathbf{M}_{1}&\mathbf{M}_{2}\end{pmatrix}\succeq 0\implies\mathbf{M}_{2}-\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\succeq 0.

Thus, $\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\preceq\mathbf{M}_{2}$ . Applying this and recalling $\mathbf{V}=\tfrac{\mathbf{M}_{0}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}}$ ,

	$\displaystyle\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p}$	$\displaystyle\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}+p\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}+p^{2}\mathbf{M}_{0}^{p-1}\mathbf{M}_{2}\right)$
		$\displaystyle=\left\lVert\mathbf{M}_{0}\right\rVert_{p}^{p}\left(1+p\left\langle\mathbf{V}^{p-1},\frac{\mathbf{M}_{1}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}}+\frac{p\mathbf{M}_{2}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}}\right\rangle\right).$

By $(1+px)^{1/p}<1+x$ , taking $p^{th}$ roots we thus have

\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}\leq\left\lVert\mathbf{M}_{0}\right\rVert_{p}+\left\langle\mathbf{V}^{p-1},\mathbf{M}_{1}+p\mathbf{M}_{2}\right\rangle.

Finally, the conclusion follows as in Lemma 3; by linearity of trace and $g=p\delta$ ,

\left\langle\mathbf{V}^{p-1},\mathbf{M}_{1}+p\mathbf{M}_{2}\right\rangle=\sum_{i\in[n]}\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\delta_{i}w_{i}(1+p\delta_{i})\leq\sum_{i\in[n]}\delta_{i}w_{i}.

Here, we used the inequality for all nonzero $g_{i}$ ,

\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle(1+p\delta_{i})=\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\left(2-\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\right)\leq 1.

∎

See 5

Proof.

The proof is analogous to that of Theorem 4; we sketch the main differences here. By applying Lemma 17 and monotonicity of Schatten norms in the Loewner order, we again have $\Phi_{0}\leq 1$ , implying correctness whenever the algorithm terminates on Line 4. Correctness of dual certification again follows from lack of termination and the choice of $T$ , as well as setting $u$ to indicate each coordinate. Finally, the returned matrix in Line 8 is correct by convexity of the Schatten- $q$ norm, and the fact that all $\mathbf{V}_{t}^{p-1}$ have unit Schatten- $q$ norm.

We now discuss issues regearding computing $g_{t}$ in Line 5 of the algorithm, the bottleneck step; these techniques are standard in the approximate SDP literature, and we defer a more formal discussion to e.g. [JLL⁺20]. First, note that each coordinate of $g_{t}$ requires us to compute

\frac{1}{\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}^{p-1}}\cdot\left\langle\mathbf{A}_{i},\left(\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right)^{p-1}\right\rangle.

(40)

We estimate the two quantities in the above expression each to $1+\epsilon$ multiplicative error with high probability. Union bounding over iterations, and modifying Lemma 4 to use the potential $\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}-(1+O(\epsilon))\left\lVert w_{t}\right\rVert_{1}$ , the analysis remains valid up to constants in $\epsilon$ with this multiplicative approximation quality. We now discuss our approximation strategies.

For shorthand, denote $\mathbf{M}=\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}$ . To estimate the denominator of (40), it suffices to multiplicatively approximate $\left\lVert\mathbf{M}\right\rVert_{p}^{p}=\textup{Tr}[\mathbf{M}^{p}]$ within a $1+\epsilon$ factor, as raising to the $\tfrac{p-1}{p}$ power can only improve this. To do so, we use the well-known fact (e.g. [DG03]) that letting $\mathbf{Q}$ be a $k\times d$ matrix with independent entries $\sim\mathcal{N}(0,\tfrac{1}{k})$ , for $k=O(\tfrac{\log(\frac{nd}{\epsilon})}{\epsilon^{2}})$ , with probability $1-\textup{poly}((\tfrac{nd}{\epsilon})^{-1})$ ,

\textup{Tr}[\mathbf{M}^{p}]\approx\sum_{\ell\in[k]}\mathbf{Q}_{\ell:}^{\top}\mathbf{M}^{p}\mathbf{Q}_{\ell:}

to a $1+\epsilon$ factor. To read this from the standard Johnson-Lindestrauss guarantee, it suffices to factorize $\mathbf{M}^{p}$ and use that each row of the square root’s $\ell_{2}$ norm is preserved with high probability under multiplication by $\mathbf{Q}$ , and then apply the cyclic definition of trace. Similarly, for each $i\in[n]$ , we can approximate the numerators via

\textup{Tr}\left(\mathbf{Q}\mathbf{M}^{\frac{p-1}{2}}\mathbf{A}_{i}\mathbf{M}^{\frac{p-1}{2}}\mathbf{Q}^{\top}\right).

We can simultaneously compute all such quantities by first applying $O(p)$ matrix-vector multiplications through $\mathbf{M}$ to each row of $\mathbf{Q}$ , and then computing all quadratic forms. In total, the computational cost per iteration of all approximations is $O(\textup{nnz}\cdot\tfrac{p\log(\frac{nd}{\epsilon})}{\epsilon^{2}})$ as desired. ∎

C.3 Proof of Proposition 2

In this section, following our prior developments, we prove the following claim.

See 2

C.3.1 Reduction to a decision problem

Given access to an oracle for the following approximate decision problem, we can implement an efficient binary search for estimating OPT. Specifically, letting the range of OPT be $(\mu_{\text{lower}},\mu_{\text{upper}})$ , we can subdivide the range into $O(\tfrac{1}{\epsilon}\log\tfrac{\mu_{\text{upper}}}{\mu_{\text{lower}}})$ multiplicative intervals of range $1+\epsilon$ , and then compute a binary search using our decision oracle. This incurs a multiplicative $\log(\tfrac{nd}{\epsilon})$ overhead in the setting of Proposition 2 (see Appendix A, [JLL⁺20], for a more formal treatment).

Problem 4.

Given $\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}$ , either find primal solution $x\in\Delta^{n}$ with $\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq 1+\epsilon$ , $\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1+\epsilon)(1+\alpha)}{n}$ , or conclude no $x\in\Delta^{n}$ satisfies $\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq 1-\epsilon$ , $\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1-\epsilon)(1+\alpha)}{n}$ .

The hard constraint $\left\lVert x\right\rVert_{\infty}\leq\tfrac{1+\alpha}{n}$ in the definition (7) can be adjusted by constant factors to admit the $\ell_{\infty}$ bound in Problem 4, since we assumed $\epsilon=O(\alpha)$ is sufficiently small.

C.3.2 Preliminaries

We use the shorthand $\mathbf{S}:=\tfrac{n}{1+\alpha}\mathbf{I}$ , and $p^{\prime}:=\tfrac{\log n}{\epsilon}$ , so $\ell_{p^{\prime}}$ and $\ell_{\infty}$ are interchangeable up to $1+O(\epsilon)$ factors. In other words, Problem 4 asks to certify whether there exists $x\in\Delta^{n}$ with

\max\left(\left\lVert\mathcal{A}(x)\right\rVert_{p},\;\left\lVert\mathbf{S}x\right\rVert_{p^{\prime}}\right)\leq 1,

(41)

up to multiplicative $1+\epsilon$ tolerance on either side. Consider the potential function

\Phi(w):=\log\left(\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)\right)-\left\lVert w\right\rVert_{1}.

(42)

It is clear that the first term of $\Phi(w)$ approximates the left hand side of (41) up to a $\log 2$ additive factor, so if any of $\left\lVert\mathcal{A}(w)\right\rVert_{p}$ , $\left\lVert\mathcal{A}(w)\right\rVert_{p^{\prime}}$ , or $\left\lVert w\right\rVert_{1}$ reaches the scale $3\epsilon^{-1}$ and $\Phi(w)$ is bounded by $1$ , we can safely terminate. and conclude primal feasibility for Problem 4. Next, we compute

	$\displaystyle\nabla_{i}\Phi(w)=1-\frac{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)[\mathbf{S}z(w)]_{i}}{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)}\text{ for all }i\in[n],$		(43)
	$\displaystyle\text{where }\mathbf{Y}(w):=\left(\frac{\mathcal{A}(w)}{\left\lVert\mathcal{A}(w)\right\rVert_{p}}\right)^{p-1},\;z(w):=\left(\frac{\mathbf{S}w}{\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}}\right)^{p^{\prime}-1}$		(43)

The following helper lemma will be useful in concluding dual infeasibility of Problem 4.

Lemma 18.

In the setting of Problem 4, suppose there exists $x^{*}\in\Delta^{n}$ with

\left\lVert\mathcal{A}(x^{*})\right\rVert_{p}\leq 1-\epsilon,\;\left\lVert\mathbf{S}x^{*}\right\rVert_{p^{\prime}}\leq 1-\epsilon.

Then, for any $w$ ,

\left\langle\nabla\Phi(w),x^{*}\right\rangle\geq\epsilon.

Proof.

From the definitions in (43), it is clear that $\left\lVert\mathbf{Y}(w)\right\rVert_{q}=\left\lVert z(w)\right\rVert_{q^{\prime}}=1$ , where $q$ , $q^{\prime}$ are the dual norms of $p$ , $p^{\prime}$ respectively. Moreover, by the definition of $x^{*}$ , we have for all $\left\lVert\mathbf{Y}\right\rVert_{q}=\left\lVert z\right\rVert_{q^{\prime}}=1$ ,

\left\langle\mathbf{Y},\mathcal{A}(x)\right\rangle\leq 1-\epsilon,\;\left\langle z,\mathbf{S}x\right\rangle\leq 1-\epsilon.

This follows from the dual definition of the $\ell_{p}$ norm (see Fact 2). Now, note that for some nonnegative $\alpha(w)$ , $\beta(w)$ summing to 1, using the above claim and (43),

\left\langle\nabla\Phi(w),x^{*}\right\rangle=1-\left(\alpha(w)\left\langle\mathbf{Y}(w),\mathcal{A}(x^{*})\right\rangle+\beta(w)\left\langle z(w),\mathbf{S}x^{*}\right\rangle\right)\geq\epsilon,

as desired (here, we used positivity of all relevant quantities). ∎

C.3.3 Potential monotonicity

We prove a monotonicity property regarding the potential $\Phi$ in (42).

Lemma 19.

Let $w\in\mathbb{R}^{n}_{\geq 0}$ satisfy $\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq 3\epsilon^{-1}$ , $\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\leq 3\epsilon^{-1}$ , let $g=\max(0,\nabla\Phi(w))$ entrywise, and let $w^{\prime}=(1+\eta g)\circ w$ , where $\eta=(4p^{\prime})^{-1}$ . Then, $\Phi(w^{\prime})\leq\Phi(w)$ .

Proof.

Denote for simplicity the threshold $K=3\epsilon^{-1}$ and the step vector $\delta=\eta g$ . First, by prior calculations in Lemma 3 and Lemma 4, it follows that

	$\displaystyle\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\leq\left\lVert\mathcal{A}(w)\right\rVert_{p}+\Delta_{\mathcal{A}},\;\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\leq\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}+\Delta_{\mathbf{S}},$
	$\displaystyle\text{where }\Delta_{\mathcal{A}}:=\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}(1+p\delta_{i}),\;\Delta_{\mathbf{S}}:=\sum_{i\in[n]}[\mathbf{S}z(w)]_{i}\delta_{i}w_{i}(1+p^{\prime}\delta_{i}).$

Next, note that by $\delta\leq\eta$ entrywise and lack of termination (i.e. the threshold $K$ ),

\displaystyle\Delta_{\mathcal{A}}\leq(1+p\eta)\eta\left\langle\mathbf{Y}(w),\mathcal{A}(w)\right\rangle\leq 2\eta\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq 1.

Therefore, by $\exp(x)\leq 1+x+x^{2}$ for $x\leq 1$ ,

\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)\leq\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left(1+\Delta_{\mathcal{A}}+\Delta_{\mathcal{A}}^{2}\right).

(44)

Moreover, by applying Cauchy-Schwarz and the threshold $\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq K$ once more,

	$\displaystyle\Delta_{\mathcal{A}}^{2}$	$\displaystyle\leq(1+p\eta)^{2}\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}\right)^{2}$		(45)
		$\displaystyle\leq 2\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}^{2}w_{i}\right)\left\langle\mathbf{Y}(w),\mathcal{A}(w)\right\rangle\leq 2K\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}^{2}w_{i}\right).$		(45)

Combining (44) and (45) (and applying similar reasoning to the term $\Delta_{\mathbf{S}}$ ), we conclude

	$\displaystyle\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)\leq\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left(1+\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}(1+(p+2K)\delta_{i})\right),$
	$\displaystyle\exp\left(\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\right)\leq\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)\left(1+\sum_{i\in[n]}[\mathbf{S}z(w)]_{i}\delta_{i}w_{i}(1+(p^{\prime}+2K)\delta_{i})\right).$

Recall the inequality $\log(1+x)\leq x$ for nonnegative $x$ . Expanding the definition of $\Phi$ and $\nabla\Phi$ (cf. (42)), and plugging in the above bounds, we conclude that

	$\displaystyle\Phi(w^{\prime})-\Phi(w)$	$\displaystyle=\log\left(\frac{\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\right)}{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)}\right)-\left\langle\delta,w\right\rangle$
		$\displaystyle\leq\sum_{i\in[n]}(1-\nabla_{i}\Phi(w))\delta_{i}w_{i}(1+(p^{\prime}+2K)\delta_{i})-\left\langle\delta,w\right\rangle$
		$\displaystyle=\sum_{i\in[n]}\left((1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\delta_{i})-1\right)\delta_{i}w_{i}.$

As before, we show that this sum is entrywise nonpositive. For any $i\in[n]$ with $\delta_{i}\neq 0$ , we have

	$\displaystyle(1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\delta_{i})-1$	$\displaystyle=(1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\eta\nabla_{i}\Phi(w))-1$
		$\displaystyle\leq(1-\nabla_{i}\Phi(w))(1+\nabla_{i}\Phi(w))-1\leq 0,$

as desired, where we used that $\eta^{-1}\geq p^{\prime}+2K$ . This yields the conclusion $\Phi(w^{\prime})\leq\Phi(w)$ . ∎

C.3.4 Algorithm and analysis

Finally, we state Algorithm 7 and prove Proposition 2.

Algorithm 7

\mathsf{BoxedSchattenPacking}(\{\mathbf{A}_{i}\}_{i\in[n]},\epsilon,p,\alpha)

1: Input:

\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d},\epsilon\in[0,\tfrac{1}{2}],p\geq 2

\alpha\in[0,n-1]

p^{\prime}\leftarrow\tfrac{\log n}{\epsilon}

\mathbf{S}\leftarrow\tfrac{n}{1+\alpha}\mathbf{I}

\eta\leftarrow(4p^{\prime})^{-1}

K\leftarrow 3\epsilon^{-1}

T\leftarrow\frac{6\log(\frac{nd}{\epsilon})}{\eta\epsilon}

[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d}

for all

i\in[n]

t\leftarrow 0

5: while

\left\lVert\mathcal{A}(w_{t})\right\rVert_{p},\left\lVert\mathbf{S}w_{t}\right\rVert_{p^{\prime}},\left\lVert w_{t}\right\rVert_{1}\leq K

g_{t}\leftarrow\max\left(0,\nabla\Phi(w_{t})\right)

entrywise, where we use the definition (42)

w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t})

t\leftarrow t+1

8: if

t\geq T

then

9: return Infeasible

10: end if

11: end while

12: return

x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

Proof of Proposition 2.

Correctness of the reduction to deciding Problem 4 follows from the discussion in Section C.3.1. Moreover, by the given Algorithm 7, it is clear (following e.g. the preprocessing of Lemma 17) that $\Phi(w_{t})\leq 1$ throughout the algorithm, so whenever the algorithm terminates we have primal feasibility. It suffices to prove that whenever the problem admits $x^{*}$ with

\left\lVert\mathcal{A}(x^{*})\right\rVert_{p}\leq 1-\epsilon,\;\left\lVert\mathbf{S}x^{*}\right\rVert_{p^{\prime}}\leq 1-\epsilon,

then the algorithm terminates on Line 5 in $T$ iterations. Analogously to Theorem 4, we have

\eta(1-\eta)\sum_{0\leq t<T}\left\langle g_{t},x^{*}\right\rangle\leq\log n-\log\left\lVert w_{0}\right\rVert_{1}+\log\left\lVert w_{T}\right\rVert_{1}\leq 2\log\left(\frac{nd}{\epsilon}\right)+\log\left\lVert w_{T}\right\rVert_{1}.

Next, since $g_{t}$ is an upwards truncation of $\nabla\Phi(w_{t})$ , applying Lemma 18 implies that

\left\lVert w_{T}\right\rVert_{1}\geq\exp\left(\frac{\eta\epsilon T}{2}-2\log\left(\frac{nd}{\epsilon}\right)\right).

The conclusion follows by the definition of $T$ , as desired. Finally, the iteration complexity follows analogously to the discussion in Theorem 5’s proof, where the only expensive cost is estimating coordinates of the $\mathcal{A}$ component of $\nabla\Phi(w_{t})$ every iteration. ∎

Finally, we remark that by opening up the dual certificates $\mathbf{Y}(w)$ , $\mathbf{Z}(w)$ of our mirror descent analysis, we can in fact implement a stronger version of the decision Problem 4 which returns a feasible dual certificate whenever the primal problem is infeasible. We omit this extension for brevity, as it is unnecessary for our applications, but it is analogous to the analysis of Theorem 5.

Appendix D Deferred proofs from Section 5

D.1 Proof of Proposition 3

See 3

Proof.

We claim that Algorithm 1 in [MM15] applied to the matrix $\mathbf{A}^{p}$ with a careful choice of exponent $q$ in their Algorithm 1 yields this guarantee. Specifically, we choose $q_{1},q_{2}$ , both of which satisfy the criteria in their main theorem, such that the iterates produced by simultaneous power iteration $\mathbf{M}^{p}$ with exponent $q_{1}$ and $\mathbf{M}^{p-1}$ with exponent $q_{2}$ are identical; it suffices to choose $q$ a multiple of $p(p-1)$ . Thus, we can also apply their guarantees to $\mathbf{A}^{p-1}$ and apply a union bound. Notice that their Algorithm 1 also contains some postprocessing to ensure that they obtain singular values in the right space, which is unnecessary for us, as our matrices are Hermitian. ∎

D.2 Proof of Lemma 5

See 5

Proof.

Lemma 10 implies that letting $w^{*}$ be the uniform distribution over the uncorrupted samples amongst $X_{1},\ldots,X_{n}$ , we have with probability at least $1-\tfrac{\delta}{2}$ , and denoting $\tilde{\epsilon}:=2C_{3}\cdot\epsilon\log\epsilon^{-1}$ ,

\left\lVert\sum_{i\in[n]}w^{*}_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq\left(1+\frac{\tilde{\epsilon}}{2}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}.

Therefore, the mixed $\ell_{\infty}$ - $\ell_{p}$ packing semidefinite program

\exists w^{*}\in\Delta^{n}\text{ with }\left\lVert w^{*}\right\rVert_{\infty}\leq\frac{1}{(1-\epsilon)n},\;\left\lVert\sum_{i\in[n]}w^{*}_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq\left(1+\frac{\tilde{\epsilon}}{2}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}

is feasible. This completes the proof. ∎

D.3 Proof of Lemma 6

See 6

Proof.

We follow the notation of (10). First, by the guarantees of Corollary 1,

w_{G}=1-w_{B}\geq 1-\frac{\epsilon n}{(1-2\epsilon)n}=\frac{1-3\epsilon}{1-2\epsilon}\geq 1-2\epsilon.

Therefore, again applying Corollary 1, for all $i\in G$ ,

\frac{w_{i}}{w_{G}}\leq\frac{1}{(1-2\epsilon)n}\cdot\frac{1-2\epsilon}{1-3\epsilon}=\frac{1}{(1-3\epsilon)n}.

We conclude that the set of weights $\{\tfrac{w_{i}}{w_{G}}\}_{i\in G}$ belong to $\mathfrak{S}_{3\epsilon}^{(1-\epsilon)n}$ . By applying Corollary 4 to these weights and adjusting the definition of $C_{3}$ by a constant, we conclude with probability at least $1-\tfrac{\delta}{2}$

\left(1+C_{3}\cdot\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}\succeq\sum_{i\in G}\frac{w_{i}}{w_{G}}X_{i}X_{i}^{\top}\succeq\left(1-C_{3}\cdot\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}.

The conclusion follows by multiplying through by $w_{G}$ , and using the definition $\tilde{\epsilon}=2C_{3}\cdot\epsilon\log\epsilon^{-1}$ . ∎

Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing

1 Introduction

1.1 Previous work

1.2 Our results

Theorem 1.

Theorem 2.

Theorem 3.

2 Preliminaries

Fact 1.

Fact 2.

Assumption 1 (Corruption model, see [DKK+19]).

3 Robust sub-Gaussian PCA via filtering

Lemma 1.

Lemma 2.

Monotonicity of downweighting.

Roughly uniform weightings imply approximation quality.

4 Schatten packing

4.1 Mirror descent interpretation of [MRWZ16]

Problem 1 (ℓ∞\ell_{\infty} packing linear program).

Proposition 1.

4.2 ℓp\ell_{p}-norm packing linear programs

Problem 2 (ℓp\ell_{p} packing linear program).

Lemma 3.

Theorem 4.

Proof.

4.3 Schatten-norm packing semidefinite programs

Problem 3.

Lemma 4.

Theorem 5.

4.4 Schatten packing with a ℓ∞\ell_{\infty} constraint

Proposition 2.

5 Robust sub-Gaussian PCA in nearly-linear time

Proposition 3.

Lemma 5.

Corollary 1.

Lemma 6.

Proposition 4.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Corollary 2.

Proof.

Proof of Theorem 2.

Acknowledgments

References

Appendix A Concentration

A.1 Sub-Gaussian concentration

Lemma 9.

Proof.

Corollary 3.

Proof.

A.2 Concentration under weightings in 𝔖ϵn\mathfrak{S}_{\epsilon}^{n}

Lemma 10.

Proof.

Corollary 4.

Proof.

Appendix B Deferred proofs from Section 3

B.1 Robust univariate variance estimation

Fact 3.

Proof.

Proof.

Assumption 2.

Lemma 11.

Corollary 5.

Proof.

B.2 Preliminaries

Assumption 3.

Lemma 12.

B.3 Analysis of 𝖯𝖢𝖠𝖥𝗂𝗅𝗍𝖾𝗋\mathsf{PCAFilter}

Lemma 13.

Proof.

Lemma 14.

Lemma 15.

Proof.

Proof.

Appendix C Deferred proofs from Section 4

C.1 Proofs from Section 4.2

Lemma 16 (Entrywise bounds on 𝐀\mathbf{A}).

Robust Sub-Gaussian Principal Component Analysis
and Width-Independent Schatten Packing

Assumption 1 (Corruption model, see [DKK⁺19]).

Problem 1 ( $\ell_{\infty}$ packing linear program).

4.2 $\ell_{p}$ -norm packing linear programs

Problem 2 ( $\ell_{p}$ packing linear program).

4.4 Schatten packing with a $\ell_{\infty}$ constraint

A.2 Concentration under weightings in $\mathfrak{S}_{\epsilon}^{n}$

B.3 Analysis of $\mathsf{PCAFilter}$

Lemma 16 (Entrywise bounds on $\mathbf{A}$ ).

Lemma 17 (Spectral bounds on $\{\mathbf{A}_{i}\}_{i\in[n]}$ ).