This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Sub-Gaussian Principal Component Analysis
and Width-Independent Schatten Packing

Arun Jambulapati Stanford University, {jmblpati, kjtian}@stanford.edu    Jerry Li Microsoft Research, [email protected]    Kevin Tian11footnotemark: 1

1 Introduction

We study two natural, but seemingly unrelated, problems in high dimensional robust statistics and continuous optimization respectively. As we will see, these problems have an intimate connection.

Problem 1: Robust sub-Gaussian principal component analysis. We consider the following statistical task, which we call robust sub-Gaussian principal component analysis (PCA). Given samples X1,,XnX_{1},\ldots,X_{n} from sub-Gaussian111See Section 2 for a formal definition. distribution 𝒟\mathcal{D} with covariance 𝚺\bm{\Sigma}, an ϵ\epsilon fraction of which are arbitrarily corrupted, the task asks to output unit vector uu with u𝚺u(1γ)𝚺u^{\top}\bm{\Sigma}u\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}222Throughout we use 𝐌p\left\lVert\mathbf{M}\right\rVert_{p} to denote the Schatten pp-norm (cf. Section 2 for more details). for tolerance γ\gamma. Ergo, the goal is to robustly return a (1γ)(1-\gamma)-approximate top eigenvector of the covariance of sub-Gaussian 𝒟\mathcal{D}. This is the natural extension of PCA to the robust statistics setting.

There has been a flurry of recent work on efficient algorithms for robust statistical tasks, e.g. covariance estimation and PCA. From an information-theoretic perspective, sub-Gaussian concentration suffices for robust covariance estimation. Nonetheless, to date all polynomial-time algorithms achieving nontrivial guarantees on covariance estimation of a sub-Gaussian distribution (including PCA specifically) in the presence of adversarial noise require additional algebraic structure. For instance, sum-of-squares certifiably bounded moments have been leveraged in polynomial time covariance estimation [HL18, KSS18]; however, this is a stronger assumption than sub-Gaussianity.

In many applications (see discussion in [DKK+17]), the end goal of covariance estimation is PCA. Thus, a natural question which relaxes robust covariance estimation is: can we robustly estimate the top eigenvector of the covariance 𝚺\bm{\Sigma}, assuming only sub-Gaussian concentration? Our work answers this question affirmatively via two incomparable algorithms. The first achieves γ=O(ϵlogϵ1)\gamma=O(\epsilon\log\epsilon^{-1}) in polynomial time; the second achieves γ=O(ϵlogϵ1logd)\gamma=O(\sqrt{\epsilon\log\epsilon^{-1}\log d}) in nearly-linear time under a mild gap assumption on 𝚺\bm{\Sigma}. Moreover, both methods have nearly-optimal sample complexity.

Problem 2: Width-independent Schatten packing. We consider a natural generalization of packing semidefinite programs (SDPs) which we call Schatten packing. Given symmetric positive semidefinite 𝐀1,,𝐀n\mathbf{A}_{1},\ldots,\mathbf{A}_{n} and parameter p1p\geq 1, a Schatten packing SDP asks to solve the optimization problem

mini[n]wi𝐀ip subject to wΔn.\min\left\lVert\sum_{i\in[n]}w_{i}\mathbf{A}_{i}\right\rVert_{p}\text{ subject to }w\in\Delta^{n}. (1)

Here, 𝐌p\left\lVert\mathbf{M}\right\rVert_{p} is the Schatten-pp norm of matrix 𝐌\mathbf{M} and Δn\Delta^{n} is the probability simplex (see Section 2). When p=p=\infty, (1) is the well-studied (standard) packing SDP objective [JY11, ALO16, PTZ16], which asks to find the most spectrally bounded convex combination of packing matrices. For smaller pp, the objective encourages combinations more (spectrally) uniformly distributed over directions.

The specialization of (1) to diagonal matrices is a smooth generalization of packing linear programs, previously studied in the context of fair resource allocation [MSZ16, DFO18]. For the \ell_{\infty} case of (1), packing SDPs have the desirable property of admitting “width-independent” approximation algorithms via exploiting positivity structure. Specifically, width-independent solvers obtain multiplicative approximations with runtimes independent or logarithmically dependent on size parameters of the problem. This is a strengthening of additive notions of approximation typically used for approximate semidefinite programming. Our work gives the first width-independent solver for Schatten packing.

1.1 Previous work

Learning with adversarial outliers. The study of estimators robust to a small fraction of adversarial outliers dates back to foundational work, e.g. [Hub64, Tuk75]. Following more recent work [LRV16, DKK+19], there has been significant interest in efficient, robust algorithms for statistical tasks in high-dimensional settings. We focus on methods robustly estimating covariance properties here, and defer a thorough discussion of the (extensive) robust statistics literature to [Ste18, Li18, DK19].

There has been quite a bit of work in understanding and giving guarantees for robust covariance estimation where the uncorrupted distribution is exactly Gaussian [DKK+17, DKK+18, DKK+19, CDGW19]. These algorithms strongly use relationships between higher-order moments of Gaussian distributions via Isserlis’ theorem. Departing from the Gaussian setting, work of [LRV16] showed that if the distribution is an affine transformation of a 4-wise independent distribution, robust covariance estimation is possible. This was extended by [KSS18], which also assumed nontrivial structure in the moments of the distribution, namely that sub-Gaussianity was certifiable via the sum-of-squares proof system. To the best of our knowledge it has remained open to give nontrivial guarantees for robust estimation of any covariance properties under minimal assumptions, i.e. sub-Gaussian concentration.

All aforementioned algorithms also yield guarantees for robust PCA, by applying a top eigenvector method to the learned covariance. However, performing robust PCA via the intermediate covariance estimation step is lossy, both statistically and computationally. From a statistical perspective, Ω(d2)\Omega(d^{2}) samples are necessary to learn the covariance of a dd-dimensional Gaussian in Frobenius norm (and for known efficient algorithms for spectral norm error [DKS17]); in contrast, O(d)O(d) samples suffice for (non-robust) PCA. Computationally, even when the underlying distrubution is exactly Gaussian, the best-known covariance estimation algorithms run in time Ω(d3.25)\Omega(d^{3.25}); algorithms working in more general settings based on the sum-of-squares approach require much more time. In contrast, the power method for PCA in a d×dd\times d matrix takes time O~(d2)\widetilde{O}(d^{2})333We say g=O~(f)g=\widetilde{O}(f) if g=O(flogcf)g=O(f\log^{c}f) for some constant c>0c>0.. Motivated by this, our work initiates the direct study of robust PCA, which is often independently interesting in applications.

We remark there is another problem termed “robust PCA” in the literature, e.g. [CLMW11], under a different generative model. We defer a detailed discussion to [DKK+17], which experimentally shows that algorithms from that line of work do not transfer well to our corruption model.

Width-independent iterative methods. Semidefinite programming (SDP) and its linear programming specialization are fundamental computational tasks, with myriad applications in learning, operations research, and computer science. Though general-purpose polynomial time algorithms exist for SDPs ([NN94]), in practical settings in high dimensions, approximations depending linearly on input size and polynomially on error ϵ\epsilon are sometimes desirable. To this end, approximation algorithms based on entropic mirror descent have been intensely studied [WK06, AK16, GHM15, AL17, CDST19], obtaining ϵ\epsilon additive approximations to the objective with runtimes depending polynomially on ρ/ϵ\rho/\epsilon, where ρ\rho is the “width”, the largest spectral norm of a constraint.

For structured SDPs, stronger guarantees can be obtained in terms of width. Specifically, several algorithms developed for packing SDPs ((1) with p=p=\infty) yield (1+ϵ)(1+\epsilon)-multiplicative approximations to the objective, with logarithmic dependence on width [JY11, PTZ16, ALO16, JLL+20]. As ρ\rho upper bounds objective value in this setting, in the worst case runtimes of width-dependent solvers yielding ϵρ\epsilon\rho-additive approximations have similar dependences as width-independent counterparts. Width-independent solvers simultaneously yield stronger multiplicative bounds at all scales of objective value, making them desirable in suitable applications. In particular, \ell_{\infty} packing SDPs have found great utility in robust statistics algorithm design [CG18, CDG19, CDGW19, DL19].

Beyond \ell_{\infty} packing, width-independent guarantees in the SDP literature are few and far between; to our knowledge, other than the covering and mixed solvers of [JLL+20], ours is the first such guarantee for a broader family of objectives444In concurrent and independent work, [CMY20] develops width-independent solvers for Ky-Fan packing objectives, a different notion of generalization than the Schatten packing objectives we consider.. Our method complements analogous p\ell_{p} extensions in the width-dependent setting, e.g. [ALO15], as well as width-independent solvers for p\ell_{p} packing linear programs [MSZ16, DFO18]. We highlight the fair packing solvers of [MSZ16, DFO18], motivated by problems in equitable resource allocation, which further solved p\ell_{p} packing variants for p[1,)p\not\in[1,\infty). We find analogous problems in semidefinite settings interesting, and defer to future work.

1.2 Our results

Robust sub-Gaussian principal component analysis. We give two algorithms for robust sub-Gaussian PCA555We follow the distribution and corruption model described in Assumption 1.. Both are near-sample optimal, polynomial-time, and assume only sub-Gaussianity. The first is via a simple filtering approach, as summarized here (and developed in Section 3).

Theorem 1.

Under Assumption 1, let δ[0,1]\delta\in[0,1], and n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right). Algorithm 6 runs in time O(nd2ϵlognδϵlognδ)O(\tfrac{nd^{2}}{\epsilon}\log\tfrac{n}{\delta\epsilon}\log\tfrac{n}{\delta}), and outputs uu with u𝚺u>(1Cϵlogϵ1)𝚺u^{\top}\bm{\Sigma}u>(1-C^{\star}\epsilon\log\epsilon^{-1})\|\bm{\Sigma}\|_{\infty}, for CC^{\star} a fixed multiple of parameter cc in Assumption 1, with probability at least 1δ1-\delta.

Our second algorithm is more efficient under mild conditions, but yields a worse approximation 1γ1-\gamma for γ=O(ϵlogϵ1logd)\gamma=O(\sqrt{\epsilon\log\epsilon^{-1}\log d}). Specifically, if there are few eigenvalues of 𝚺\bm{\Sigma} larger than 1γ1-\gamma, our algorithm runs in nearly-linear time. Note that if there are many eigenvalues above this threshold, then the PCA problem itself is not very well-posed; our algorithm is very efficient in the interesting setting where the approximate top eigenvector is identifiable. We state our main algorithmic guarantee here, and defer details to Section 5.

Theorem 2.

Under Assumption 1, let δ[0,1]\delta\in[0,1], n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right), γ=Cϵlogϵ1logd\gamma=C\sqrt{\epsilon\log\epsilon^{-1}\log d}, for CC a fixed multiple of parameter cc from Assumption 1, and let t[d]t\in[d] satisfy 𝚺t+1<(1γ)𝚺\bm{\Sigma}_{t+1}<(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}. Algorithm 4 outputs a unit vector udu\in\mathbb{R}^{d} with u𝚺u(1γ)𝚺u^{\top}\bm{\Sigma}u\geq(1-\gamma)\|\bm{\Sigma}\|_{\infty} in time O~(ndϵ4.5+ndtϵ1.5)\widetilde{O}(\tfrac{nd}{\epsilon^{4.5}}+\tfrac{ndt}{\epsilon^{1.5}}).

We remark that Ω(dϵ2)\Omega(d\epsilon^{-2}) samples are necessary for a (1ϵ)(1-\epsilon)-approximation to the top eigenvector of 𝚺\bm{\Sigma} via uncorrupted samples from 𝒩(0,𝚺)\mathcal{N}(0,\bm{\Sigma}), so our first method is sample-optimal, as is our second up to a O~(ϵ1)\widetilde{O}(\epsilon^{-1}) factor.

Width-independent Schatten packing. Our second method crucially requires an efficient solver for Schatten packing SDPs. We demonstrate that Schatten packing, i.e. (1) for arbitrary pp, admits width-independent solvers. We state an informal guarantee, and defer details to Section 4.

Theorem 3.

Let {𝐀i}i[n]𝕊0d\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}, and ϵ>0\epsilon>0. There is an algorithm taking O(plog(ndϵ)ϵ)O(\tfrac{p\log(\frac{nd}{\epsilon})}{\epsilon}) iterations, returning a 1+ϵ1+\epsilon multiplicative approximation to the problem (1). For odd pp, each iteration can be implemented in time nearly-linear in the number of nonzeros amongst all {𝐀i}i[n]\{\mathbf{A}_{i}\}_{i\in[n]}.

2 Preliminaries

General notation. [n][n] denotes the set 1in1\leq i\leq n. Applied to a vector, p\left\lVert\cdot\right\rVert_{p} is the p\ell_{p} norm; applied to a symmetric matrix, p\left\lVert\cdot\right\rVert_{p} is the Schatten-pp norm, i.e. the p\ell_{p} norm of the spectrum. The dual norm of p\ell_{p} is q\ell_{q} for q=pp1q=\tfrac{p}{p-1}; when p=p=\infty, q=1q=1. Δn\Delta^{n} is the nn-dimensional simplex (subset of positive orthant with 1\ell_{1}-norm 11) and we define 𝔖εnΔn\mathfrak{S}^{n}_{\varepsilon}\subseteq\Delta^{n} to be the truncated simplex:

𝔖εn:={w0n|w1=1,w1n(1ε) entrywise}.\mathfrak{S}^{n}_{\varepsilon}:=\left\{w\in\mathbb{R}^{n}_{\geq 0}\;\Bigg{|}\;\left\lVert w\right\rVert_{1}=1,\;w\leq\frac{1}{n(1-\varepsilon)}\text{ entrywise}\right\}. (2)

Matrices. 𝕊d\mathbb{S}^{d} is d×dd\times d symmetric matrices, and 𝕊0d\mathbb{S}_{\geq 0}^{d} is the positive semidefinite subset. 𝐈\mathbf{I} is the identity of appropriate dimension. λmax\lambda_{\textup{max}}, λmin\lambda_{\textup{min}}, and Tr are the largest and smallest eigenvalues and trace of a symmetric matrix. For 𝐌,𝐍𝕊d\mathbf{M},\mathbf{N}\in\mathbb{S}^{d}, 𝐌,𝐍:=Tr(𝐌𝐍)\left\langle\mathbf{M},\mathbf{N}\right\rangle:=\textup{Tr}\left(\mathbf{M}\mathbf{N}\right) and we use the Loewner order \preceq, (𝐌𝐍\mathbf{M}\preceq\mathbf{N} iff 𝐍𝐌𝕊0d\mathbf{N}-\mathbf{M}\in\mathbb{S}_{\geq 0}^{d}). The seminorm of 𝐌0\mathbf{M}\succeq 0 is v𝐌:=v𝐌v\left\lVert v\right\rVert_{\mathbf{M}}:=\sqrt{v^{\top}\mathbf{M}v}.

Fact 1.

For 𝐀\mathbf{A}, 𝐁\mathbf{B} with compatible dimension, Tr(𝐀𝐁)=Tr(𝐁𝐀)\textup{Tr}(\mathbf{A}\mathbf{B})=\textup{Tr}(\mathbf{B}\mathbf{A}). For 𝐌,𝐍𝕊0d\mathbf{M},\mathbf{N}\in\mathbb{S}_{\geq 0}^{d}, 𝐌,𝐍0\left\langle\mathbf{M},\mathbf{N}\right\rangle\geq 0.

Fact 2.

We have the following characterization of the Schatten-pp norm: for 𝐌𝕊d\mathbf{M}\in\mathbb{S}^{d}, and q=pp1q=\tfrac{p}{p-1},

𝐌p=sup𝐍𝕊d,𝐍q=1𝐍,𝐌.\left\lVert\mathbf{M}\right\rVert_{p}=\sup_{\begin{subarray}{c}\mathbf{N}\in\mathbb{S}^{d},\;\left\lVert\mathbf{N}\right\rVert_{q}=1\end{subarray}}\left\langle\mathbf{N},\mathbf{M}\right\rangle.

For 𝐌=j[d]λivivi\mathbf{M}=\sum_{j\in[d]}\lambda_{i}v_{i}v_{i}^{\top}, the satisfying 𝐍\mathbf{N} is j[d]±λip1vivi𝐌pp1\tfrac{\sum_{j\in[d]}\pm\lambda_{i}^{p-1}v_{i}v_{i}^{\top}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}, so 𝐍𝐌\mathbf{N}\mathbf{M} has spectrum |λ|p𝐌pp1\tfrac{|\lambda|^{p}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}.

Distributions. We denote drawing vector XX from distribution 𝒟\mathcal{D} by X𝒟X\sim\mathcal{D}, and the covariance 𝚺\bm{\Sigma} of 𝒟\mathcal{D} is 𝔼X𝒟[XX]\mathbb{E}_{X\sim\mathcal{D}}\left[XX^{\top}\right]. We say scalar distribution 𝒟\mathcal{D} is γ2\gamma^{2}-sub-Gaussian if 𝔼X𝒟[X]=0\mathbb{E}_{X\sim\mathcal{D}}[X]=0 and

𝔼X𝒟[exp(tX)]exp(t2γ22)t.\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(tX\right)\right]\leq\exp\left(\frac{t^{2}\gamma^{2}}{2}\right)\;\forall t\in\mathbb{R}.

Multivariate 𝒟\mathcal{D} has sub-Gaussian proxy 𝚪\bm{\Gamma} if its restriction to any unit vv is v𝚪2\left\lVert v\right\rVert_{\bm{\Gamma}}^{2}-sub-Gaussian, i.e.

𝔼X𝒟[exp(tXv)]exp(t2v𝚪22) for all v2=1,t.\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(tX^{\top}v\right)\right]\leq\exp\left(\frac{t^{2}\left\lVert v\right\rVert_{\bm{\Gamma}}^{2}}{2}\right)\text{ for all }\left\lVert v\right\rVert_{2}=1,\;t\in\mathbb{R}. (3)

We consider the following standard model for gross corruption with respect to distribution a 𝒟\mathcal{D}.

Assumption 1 (Corruption model, see [DKK+19]).

Let 𝒟\mathcal{D} be a mean-zero distribution on d\mathbb{R}^{d} with covariance 𝚺\bm{\Sigma} and sub-Gaussian proxy 𝚪c𝚺\bm{\Gamma}\preceq c\bm{\Sigma} for a constant cc. Denote by index set GG^{\prime} with |G|=n|G^{\prime}|=n a set of (uncorrupted) samples {Xi}iG𝒟\{X_{i}\}_{i\in G^{\prime}}\sim\mathcal{D}. An adversary arbitrarily replaces ϵn\epsilon n points in GG^{\prime}; we denote the new index set by [n]=BG[n]=B\cup G, where BB is the (unknown) set of points added by an adversary, and GGG\subseteq G^{\prime} is the set of points from GG^{\prime} that were not changed.

As we only estimate covariance properties, the assumption that 𝒟\mathcal{D} is mean-zero only loses constants in problem parameters, by pairing samples and subtracting them (cf. [DKK+19], Section 4.5.1).

3 Robust sub-Gaussian PCA via filtering

In this section, we sketch the proof of Theorem 1, which gives guarantees on our filtering algorithm for robust sub-Gaussian PCA. This algorithm obtains stronger statistical guarantees than Theorem 2, at the cost of super-linear runtime; the algorithm is given as Algorithm 6. Our analysis stems largely from concentration facts about sub-Gaussian distributions, as well as the following (folklore) fact regarding estimation of variance along any particular direction.

Lemma 1.

Under Assumption 1, let δ[0,1]\delta\in[0,1], n=Ω(logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right), and udu\in\mathbb{R}^{d} be a fixed unit vector. Algorithm 5, 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance}, takes input {Xi}i[n]\{X_{i}\}_{i\in[n]}, uu, and ϵ\epsilon, and outputs σu2\sigma_{u}^{2} with |u𝚺uσu2|<Cu𝚺uϵlogϵ1|u^{\top}\bm{\Sigma}u-\sigma_{u}^{2}|<Cu^{\top}\bm{\Sigma}u\cdot\epsilon\log\epsilon^{-1} with probability at least 1δ1-\delta, and runs in time O(nd+nlogn)O(nd+n\log n), for CC a fixed multiple of the parameter cc in Assumption 1.

In other words, we show that using corrupted samples, we can efficiently estimate a 1+O(ϵlogϵ1)1+O(\epsilon\log\epsilon^{-1})-multiplicative approximation of the variance of 𝒟\mathcal{D} in any unit direction666Corollary 5 gives a slightly stronger guarantee that reusing samples does not break dependencies of uu.. This proof is deferred to Appendix B for completeness. Algorithm 6 combines this key insight with a soft filtering approach, suggested by the following known structural fact found in previous work (e.g. Lemma A.1 of [DHL19], see also [SCV17, Ste18]).

Lemma 2.

Let {ai}i[m]\{a_{i}\}_{i\in[m]}, {wi}i[m]\{w_{i}\}_{i\in[m]} be sets of nonnegative reals, and amax=maxi[m]aia_{\max}=\max_{i\in[m]}a_{i}. Define wi=(1aiamax)wiw^{\prime}_{i}=\left(1-\frac{a_{i}}{a_{\max}}\right)w_{i}, for all i[m]i\in[m]. Consider any disjoint partition IBI_{B}, IGI_{G} of [m][m] with iIBwiai>iIGwiai.\sum_{i\in I_{B}}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}a_{i}. Then, iIBwiwi>12amaxi[m]wiai>iIGwiwi\sum_{i\in I_{B}}w_{i}-w_{i}^{\prime}>\tfrac{1}{2a_{\max}}\sum_{i\in[m]}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}-w_{i}^{\prime}.

Our Algorithm 6, 𝖯𝖢𝖠𝖥𝗂𝗅𝗍𝖾𝗋\mathsf{PCAFilter}, takes as input a set of corrupted samples {Xi}i[n]\{X_{i}\}_{i\in[n]} following Assumption 1 and the corruption parameter ϵ\epsilon. At a high level, it initializes a uniform weight vector w(0)w^{(0)}, and iteratively operates as follows (we denote by 𝐌(w)\mathbf{M}(w) the empirical covariance i[n]wiXiXi\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}).

  1. 1.

    utu_{t}\leftarrow approximate top eigenvector of 𝐌(w(t1))\mathbf{M}(w^{(t-1)}) via power iteration.

  2. 2.

    Compute σt2𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾({Xi}i[n],ut,ϵ)\sigma_{t}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},u_{t},\epsilon).

  3. 3.

    If σt2>(1O(ϵlogϵ1))ut𝐌(w(t1))ut\sigma_{t}^{2}>(1-O(\epsilon\log\epsilon^{-1}))\cdot u_{t}^{\top}\mathbf{M}(w^{(t-1)})u_{t}, then terminate and return utu_{t}.

  4. 4.

    Else:

    1. (a)

      Sort indices i[n]i\in[n] by aiut,Xi2a_{i}\leftarrow\left\langle u_{t},X_{i}\right\rangle^{2}, with a1a_{1} smallest.

    2. (b)

      Let in\ell\leq i\leq n be the smallest set for which i=nwi2ϵ\sum_{i=\ell}^{n}w_{i}\geq 2\epsilon, and apply the downweighting procedure of Lemma 2 to this subset of indices.

The analysis of Algorithm 6 then proceeds in two stages.

Monotonicity of downweighting.

We show the invariant criteria for Lemma 2 (namely, that for the set in\ell\leq i\leq n in every iteration, there is more spectral mass on bad points than good) holds inductively for our algorithm. Specifically, lack of termination implies 𝐌(w(t1))\mathbf{M}(w^{(t-1)}) puts significant mass on bad directions, which combined with concentration of good directions yields the invariant. The details of this argument can be found as Lemma 13.

Roughly uniform weightings imply approximation quality.

As Lemma 2 then applies, the procedure always removes more mass from bad points than good, and thus can only remove at most 2ϵ2\epsilon mass total by the corruption model. Thus, the weights w(t)w^{(t)} are always roughly uniform (in 𝔖O(ϵ)n\mathfrak{S}_{O(\epsilon)}^{n}), which by standard concentration facts (see Appendix A) imply the quality of the approximate top eigenvector is good. Moreover, the iteration count is bounded by roughly dd because whenever the algorithm does not terminate, enough mass is removed from large spectral directions. Combining with the termination criteria imply that when a vector is returned, it is a close approximation to the top direction of 𝚺\bm{\Sigma}. Details can be found as Lemma 15 and in the proof of Theorem 1.

4 Schatten packing

4.1 Mirror descent interpretation of [MRWZ16]

We begin by reinterpreting the [MRWZ16] solver, which achieves the state-of-the-art parallel runtime for packing LPs777The [MRWZ16] solver also generalizes to covering and mixed objectives; we focus on packing in this work.. An ()(\ell_{\infty}) packing LP algorithm solves the following decision problem.888Packing linear programs are sometimes expressed as the optimization problem maxx0,𝐀x𝟏x1\max_{x\geq 0,\mathbf{A}x\leq\mathbf{1}}\left\lVert x\right\rVert_{1}, similarly to (1); these problems are equivalent up to a standard binary search, see e.g. discussion in [JLL+20]..

Problem 1 (\ell_{\infty} packing linear program).

Given entrywise nonnegative 𝐀0d×n\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n}, either find primal solution xΔnx\in\Delta^{n} with 𝐀x1+ϵ\left\lVert\mathbf{A}x\right\rVert_{\infty}\leq 1+\epsilon or dual solution yΔdy\in\Delta^{d} with 𝐀y(1ϵ)𝟏\mathbf{A}^{\top}y\geq(1-\epsilon)\mathbf{1}.

Algorithm 1 𝖯𝖺𝖼𝗄𝗂𝗇𝗀𝖫𝖯(𝐀,ϵ)\mathsf{PackingLP}(\mathbf{A},\epsilon)
1:  Input: 𝐀0d×n\mathbf{A}\in\mathbb{R}^{d\times n}_{\geq 0}, ϵ[0,12]\epsilon\in[0,\tfrac{1}{2}]
2:  K3log(d)ϵK\leftarrow\tfrac{3\log(d)}{\epsilon}, ηK1\eta\leftarrow K^{-1}, T4log(d)log(nd/ϵ)ϵ2T\leftarrow\tfrac{4\log(d)\log(nd/\epsilon)}{\epsilon^{2}}
3:  [w0]iϵn2d[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d} for all i[n]i\in[n], z𝟎z\leftarrow\mathbf{0}, t0t\leftarrow 0
4:  while 𝐀wtK𝟏\mathbf{A}w_{t}\leq K\mathbf{1}, wt1K\left\lVert w_{t}\right\rVert_{1}\leq K do
5:     vtexp(𝐀wt)exp(𝐀wt)1v_{t}\leftarrow\tfrac{\exp(\mathbf{A}w_{t})}{\left\lVert\exp(\mathbf{A}w_{t})\right\rVert_{1}}
6:     gtmax(0,𝟏𝐀vt)g_{t}\leftarrow\max(0,\mathbf{1}-\mathbf{A}^{\top}v_{t}) entrywise
7:     wt+1wt(1+ηgt)w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t}), zz+vtz\leftarrow z+v_{t}, tt+1t\leftarrow t+1
8:     if tTt\geq T then
9:        return  y1Tzy\leftarrow\frac{1}{T}z
10:     end if
11:  end while
12:  return   xwtwt1x\leftarrow\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

The following result is shown in [MRWZ16].

Proposition 1.

𝖯𝖺𝖼𝗄𝗂𝗇𝗀𝖫𝖯\mathsf{PackingLP} (Algorithm 1) solves Problem 1 in O(nnz(𝐀)log(d)log(nd/ϵ)ϵ2)O(\textup{nnz}(\mathbf{A})\cdot\tfrac{\log(d)\log(nd/\epsilon)}{\epsilon^{2}}) time.

Oour interpretation of the analysis of [MRWZ16], combines two ingredients: a potential argument and mirror descent, which yields a dual feasible point if wt1\left\lVert w_{t}\right\rVert_{1} did not grow sufficiently.

Potential argument. The potential used by [MRWZ16] is log(j[d]exp([𝐀wt]j))wt1\log(\sum_{j\in[d]}\exp([\mathbf{A}w_{t}]_{j}))-\left\lVert w_{t}\right\rVert_{1}, well-known to be a O(logd)O(\log d)-additive approximation of 𝐀wtwt1\left\lVert\mathbf{A}w_{t}\right\rVert_{\infty}-\left\lVert w_{t}\right\rVert_{1}. As soon as 𝐀wt\left\lVert\mathbf{A}w_{t}\right\rVert_{\infty} or wt1\left\lVert w_{t}\right\rVert_{1} reaches the scale O(logdϵ)O(\tfrac{\log d}{\epsilon}), by nonnegativity this becomes a multiplicative guarantee, motivating the setting of threshold KK. To prove the potential is monotone, [MRWZ16] uses step size K1K^{-1} and a Taylor approximation; combining with the termination condition yields the desired claim.

Mirror descent. To certify that wtw_{t} grows sufficiently (e.g. the method terminates in few iterations, else dual feasibility holds), we interpret the step wt+1wt(1+ηgt)w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t}) as approximate entropic mirror descent. Specifically, we track the quantity 0t<Tηgt,u\sum_{0\leq t<T}\left\langle\eta g_{t},u\right\rangle, and show that if wt1\left\lVert w_{t}\right\rVert_{1} has not grown sufficiently, then it must be bounded for every uΔnu\in\Delta^{n}, certifying dual feasibility. Formally, for any gtg_{t} sequence and uΔnu\in\Delta^{n}, we show

O(log(nd/ϵ))+log(wT1w01)0t<Tηgt,uη0t<T𝟏𝐀vt,u.O(\log(nd/\epsilon))+\log\left(\frac{\left\lVert w_{T}\right\rVert_{1}}{\left\lVert w_{0}\right\rVert_{1}}\right)\geq\sum_{0\leq t<T}\left\langle\eta g_{t},u\right\rangle\geq\eta\sum_{0\leq t<T}\left\langle\mathbf{1}-\mathbf{A}^{\top}v_{t},u\right\rangle.

The last inequality followed by gtg_{t} being an upwards truncation. If wT1\left\lVert w_{T}\right\rVert_{1} is bounded (else, we have primal feasibility), we show the entire above expression is bounded O(logndϵ)O(\log\tfrac{nd}{\epsilon}) for any uu. Thus, by setting T=O(log(nd/ϵ)ηϵ)T=O(\tfrac{\log(nd/\epsilon)}{\eta\epsilon}) and choosing uu to be each coordinate indicator, it follows that the average of all vtv_{t} is coordinatewise at least 1ϵ1-\epsilon, and solves Problem 1 as a dual solution.

Our gtg_{t} is chosen as the (truncated) gradient of the function used in the potential analysis, so its form allows us to interpret dual feasibility (e.g. vtv_{t} has 1\ell_{1} norm 1 and is a valid dual point). Our analysis patterns standard mirror descent, complemented by side information which says that lack of a primal solution can transform a regret guarantee into a feasibility bound. We apply this framework to analyze p\ell_{p} variants of Problem 1, via different potentials; nonetheless, our proofs are quite straightforward upon adopting this perspective, and we believe it may yield new insights for instances with positivity structure.

4.2 p\ell_{p}-norm packing linear programs

In this section, we give a complete self-contained example of the framework proposed in Section 4.1, for approximately solving p\ell_{p} norm packing linear programs. Specifically, we now consider the generalization of Problem 1 to p\ell_{p} norms; throughout, q=pp1q=\tfrac{p}{p-1} is the dual norm.

Problem 2 (p\ell_{p} packing linear program).

Given entrywise nonnegative 𝐀0d×n\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n}, either find primal solution xΔnx\in\Delta^{n} with 𝐀xp1+ϵ\left\lVert\mathbf{A}x\right\rVert_{p}\leq 1+\epsilon or dual solution y0d,yq=1y\in\mathbb{R}^{d}_{\geq 0},\left\lVert y\right\rVert_{q}=1 with 𝐀y(1ϵ)𝟏\mathbf{A}^{\top}y\geq(1-\epsilon)\mathbf{1}.

For p=logdϵp=\tfrac{\log d}{\epsilon}, Problem 2 recovers Problem 1 up to constants as p\ell_{p} multiplicatively approximates \ell_{\infty} by 1+ϵ1+\epsilon. We now state our method for solving Problem 2 as Algorithm 2.

Algorithm 2 𝖯𝖭𝗈𝗋𝗆𝖯𝖺𝖼𝗄𝗂𝗇𝗀(𝐀,ϵ,p)\mathsf{PNormPacking}(\mathbf{A},\epsilon,p)
1:  Input: 𝐀0d×n,ϵ[0,12],p2\mathbf{A}\in\mathbb{R}_{\geq 0}^{d\times n},\epsilon\in[0,\tfrac{1}{2}],p\geq 2
2:  ηp1,T4plog(ndϵ)ϵ\eta\leftarrow p^{-1},T\leftarrow\tfrac{4p\log(\frac{nd}{\epsilon})}{\epsilon}
3:  [w0]iϵn2d[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d} for all i[n]i\in[n], z𝟎z\leftarrow\mathbf{0}, t0t\leftarrow 0
4:  while wt1ϵ1\left\lVert w_{t}\right\rVert_{1}\leq\epsilon^{-1} do
5:     gtmax(0,𝟏𝐀(vt)p1)g_{t}\leftarrow\max(0,\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1}) entrywise, for vt𝐀wt𝐀wtpv_{t}\leftarrow\tfrac{\mathbf{A}w_{t}}{\left\lVert\mathbf{A}w_{t}\right\rVert_{p}}
6:     wt+1wt(1+ηgt)w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t}), zz+(vt)p1z\leftarrow z+(v_{t})^{p-1}, tt+1t\leftarrow t+1
7:     if tTt\geq T then
8:        return  y=zzqy=\frac{z}{\left\lVert z\right\rVert_{q}}
9:     end if
10:  end while
11:  return   x=wtwt1x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

Other than changing parameters, the only difference from Algorithm 1 is that vv is a point with unit q\ell_{q} norm induced by the gradient of our potential Φt\Phi_{t}. We state our main potential fact, whose proof is based straightforwardly on Taylor expanding p\left\lVert\cdot\right\rVert_{p}, and deferred to Appendix C for brevity.

Lemma 3.

In all iterations tt of Algorithm 2, defining Φt:=𝐀wtpwt1\Phi_{t}:=\left\lVert\mathbf{A}w_{t}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}, Φt+1Φt\Phi_{t+1}\leq\Phi_{t}.

We now prove our main result, which leverages the potential bound following the framework of Section 4.1. In the proof, we assume that entries of 𝐀\mathbf{A} are bounded by nϵ1n\epsilon^{-1}; this does not incur more loss than a constant multiple of ϵ\epsilon in the guarantees, and a proof can be found as Lemma 16.

Theorem 4.

Algorithm 2 runs in time O(nnz(𝐀)plog(nd/ϵ)ϵ)O(\textup{nnz}(\mathbf{A})\cdot\tfrac{p\log(nd/\epsilon)}{\epsilon}). Further, its output solves Problem 2.

Proof.

The runtime follows from Line 7 (each iteration cost is dominated by multiplication through 𝐀\mathbf{A}), so we prove correctness. Define potential Φt\Phi_{t} as in Lemma 3, and note that as w0=ϵn2d𝟏w_{0}=\tfrac{\epsilon}{n^{2}d}\mathbf{1},

Φ0𝐀w0p1n𝟏p1.\Phi_{0}\leq\left\lVert\mathbf{A}w_{0}\right\rVert_{p}\leq\frac{1}{n}\left\lVert\mathbf{1}\right\rVert_{p}\leq 1.

The second inequality followed from our assumption on 𝐀\mathbf{A} entry sizes (Lemma 16). If Algorithm 2 breaks out of the while loop of Line 4, we have by Lemma 3 that for xx returned on Line 11,

𝐀wtpwt11𝐀xp1+wt1wt11+ϵ.\left\lVert\mathbf{A}w_{t}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}\leq 1\implies\left\lVert\mathbf{A}x\right\rVert_{p}\leq\frac{1+\left\lVert w_{t}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\leq 1+\epsilon.

Thus, primal feasibility is always correct. We now prove correctness of dual feasibility. First, let Vx(u)=i[n]uilog(uixi)V_{x}(u)=\sum_{i\in[n]}u_{i}\log(\tfrac{u_{i}}{x_{i}}) be the Kullback-Leibler divergence from xx to uu, for xx, uΔdu\in\Delta^{d}. Define the normalized points xt=wtwt1x_{t}=\tfrac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}} in each iteration. Expanding definitions,

Vxt+1(u)Vxt(u)\displaystyle V_{x_{t+1}}(u)-V_{x_{t}}(u) =i[n]uilog[xt]i[xt+1]i\displaystyle=\sum_{i\in[n]}u_{i}\log\frac{[x_{t}]_{i}}{[x_{t+1}]_{i}} (4)
=i[n]ui(log(wt+11wt1)+log(11+η[gt]i))\displaystyle=\sum_{i\in[n]}u_{i}\left(\log\left(\frac{\left\lVert w_{t+1}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\right)+\log\left(\frac{1}{1+\eta[g_{t}]_{i}}\right)\right)
η(1η)gt,u+log(wt+11wt1).\displaystyle\leq-\eta(1-\eta)\left\langle g_{t},u\right\rangle+\log\left(\frac{\left\lVert w_{t+1}\right\rVert_{1}}{\left\lVert w_{t}\right\rVert_{1}}\right).

The only inequality used the bounds, for g,η[0,1]g,\eta\in[0,1],

log(11+ηg)glog(11+η)η(1η)g.\log\left(\frac{1}{1+\eta g}\right)\leq g\log\left(\frac{1}{1+\eta}\right)\leq-\eta(1-\eta)g.

Telescoping (4) over all TT iterations, and using Vx0(u)lognV_{x_{0}}(u)\leq\log n for all uΔnu\in\Delta^{n} since x0x_{0} is uniform, we have that whenever Line 4 is not satisfied before the check on Line 7 (i.e. tTt\geq T),

η(1η)0t<Tgt,ulog(wT1w01)+Vx0(u)log(ndϵ2)+logn2log(ndϵ).\eta(1-\eta)\sum_{0\leq t<T}\left\langle g_{t},u\right\rangle\leq\log\left(\frac{\left\lVert w_{T}\right\rVert_{1}}{\left\lVert w_{0}\right\rVert_{1}}\right)+V_{x_{0}}(u)\leq\log\left(\frac{nd}{\epsilon^{2}}\right)+\log n\leq 2\log\left(\frac{nd}{\epsilon}\right). (5)

The last inequality used wT1ϵ1\left\lVert w_{T}\right\rVert_{1}\leq\epsilon^{-1} by assumption, and w01=ϵnd\left\lVert w_{0}\right\rVert_{1}=\tfrac{\epsilon}{nd}. Next, since each gt𝟏𝐀(vt)p1g_{t}\geq\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1} entrywise, defining z¯=zT\bar{z}=\tfrac{z}{T},

0t<Tgt,u0t<T𝟏𝐀(vt)p1,u=T𝟏𝐀z¯,u, for all uΔn.\sum_{0\leq t<T}\left\langle g_{t},u\right\rangle\geq\sum_{0\leq t<T}\left\langle\mathbf{1}-\mathbf{A}^{\top}(v_{t})^{p-1},u\right\rangle=T\left\langle\mathbf{1}-\mathbf{A}^{\top}\bar{z},u\right\rangle,\text{ for all }u\in\Delta^{n}. (6)

Combining (5) and (6), and rearranging, yields by definition of TT,

𝟏𝐀z¯,u2log(ndϵ)Tη(1η)4plog(ndϵ)Tϵ𝐀z¯1ϵ entrywise.\left\langle\mathbf{1}-\mathbf{A}^{\top}\bar{z},u\right\rangle\leq\frac{2\log(\frac{nd}{\epsilon})}{T\eta(1-\eta)}\leq\frac{4p\log(\frac{nd}{\epsilon})}{T}\leq\epsilon\implies\mathbf{A}^{\top}\bar{z}\geq 1-\epsilon\text{ entrywise.}

The last claim follows by setting uu to be each coordinate-sparse simplex vector. Finally, since z¯z¯q=zzq\tfrac{\bar{z}}{\left\lVert\bar{z}\right\rVert_{q}}=\tfrac{z}{\left\lVert z\right\rVert_{q}}, to show that yy is a correct dual solution to Problem 2 it suffices to show z¯q1\left\lVert\bar{z}\right\rVert_{q}\leq 1. This follows as z¯\bar{z} is an average of the (vt)p1(v_{t})^{p-1}, convexity of q\ell_{q} norms, and that for all tt,

(vt)p1qq=j[d]vtp=j[d][𝐀wt]jp𝐀wtpp=1.\left\lVert(v_{t})^{p-1}\right\rVert_{q}^{q}=\sum_{j\in[d]}v_{t}^{p}=\sum_{j\in[d]}\frac{\left[\mathbf{A}w_{t}\right]_{j}^{p}}{\left\lVert\mathbf{A}w_{t}\right\rVert_{p}^{p}}=1.

4.3 Schatten-norm packing semidefinite programs

We generalize Algorithm 2 to solve Schatten packing semidefinite programs, which we now define.

Problem 3.

Given {𝐀i}i[n]𝕊0d\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}, either find primal solution xΔnx\in\Delta^{n} with i[n]xi𝐀ip1+ϵ\left\lVert\sum_{i\in[n]}x_{i}\mathbf{A}_{i}\right\rVert_{p}\leq 1+\epsilon or dual solution 𝐘𝕊0d\mathbf{Y}\in\mathbb{S}_{\geq 0}^{d}, 𝐘q=1\left\lVert\mathbf{Y}\right\rVert_{q}=1 with 𝐀i,𝐘1ϵ\left\langle\mathbf{A}_{i},\mathbf{Y}\right\rangle\geq 1-\epsilon for all i[n]i\in[n].

Algorithm 3 𝖲𝖼𝗁𝖺𝗍𝗍𝖾𝗇𝖯𝖺𝖼𝗄𝗂𝗇𝗀({𝐀i}i[n],ϵ,p)\mathsf{SchattenPacking}(\{\mathbf{A}_{i}\}_{i\in[n]},\epsilon,p)
1:  Input: {𝐀i}i[n]𝕊0d,ϵ[0,12],p2\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d},\epsilon\in[0,\tfrac{1}{2}],p\geq 2
2:  ηp1,T4plog(ndϵ)ϵ\eta\leftarrow p^{-1},T\leftarrow\tfrac{4p\log(\frac{nd}{\epsilon})}{\epsilon}
3:  [w0]iϵn2d[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d} for all i[n]i\in[n], z0z\leftarrow 0
4:  while wt1ϵ1\left\lVert w_{t}\right\rVert_{1}\leq\epsilon^{-1} do
5:     gtmax(0,𝟏𝐀i,𝐕tp1)g_{t}\leftarrow\max\left(0,\mathbf{1}-\left\langle\mathbf{A}_{i},\mathbf{V}_{t}^{p-1}\right\rangle\right) entrywise, for 𝐕ti[n][wt]i𝐀ii[n][wt]i𝐀ip\mathbf{V}_{t}\leftarrow\tfrac{\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}}{\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}}
6:     wt+1wt(1+ηgt)w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t}), 𝐙𝐙+(𝐕t)p1\mathbf{Z}\leftarrow\mathbf{Z}+(\mathbf{V}_{t})^{p-1}, tt+1t\leftarrow t+1
7:     if tTt\geq T then
8:        return  𝐘=𝐙𝐙q\mathbf{Y}=\frac{\mathbf{Z}}{\left\lVert\mathbf{Z}\right\rVert_{q}}
9:     end if
10:  end while
11:  return   x=wtwt1x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}

We assume that pp is an odd integer for simplicity (sufficient for our applications), and leave for interesting future work the cases when pp is even or noninteger. The potential used in the analysis and an overall guarantee are stated here, and deferred to Appendix C. The proofs are simple modifications of Lemma 3 and Theorem 4 using trace inequalities (similar to those in [JLL+20]) in place of scalar inequalities, as well as efficient approximation of quantities in Line 5 via the standard technique of Johnson-Lindestrauss projections.

Lemma 4.

In all iterations tt of Algorithm 3, defining Φt:=i[n][wt]i𝐀ipwt1\Phi_{t}:=\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}-\left\lVert w_{t}\right\rVert_{1}, Φt+1Φt\Phi_{t+1}\leq\Phi_{t}.

Theorem 5.

Let pp be odd. Algorithm 3 runs in O(plog(nd/ϵ)ϵ)O(\tfrac{p\log(nd/\epsilon)}{\epsilon}) iterations, and its output solves Problem 3. Each iteration is implementable in O(nnzplog(nd/ϵ)ϵ2)O(\textup{nnz}\cdot\tfrac{p\log(nd/\epsilon)}{\epsilon^{2}}), where nnz is the number of nonzero entries amongst all {𝐀i}i[n]\{\mathbf{A}_{i}\}_{i\in[n]}, losing O(ϵ)O(\epsilon) in the quality of Problem 3 with probability 1poly((nd/ϵ)1)1-\textup{poly}((nd/\epsilon)^{-1}).

4.4 Schatten packing with a \ell_{\infty} constraint

We remark that the framework outlined in Section 4.1 is flexible enough to handle mixed-norm packing problems. Specifically, developments in Section 5 require the following guarantee.

Proposition 2.

Following Theorem 5’s notation, let pp be odd, {𝐀i}i[n]𝕊0d\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}, 0<ϵ=O(α)0<\epsilon=O(\alpha), and

minxΔnx1+αn𝒜(x)p=OPT.\min_{\begin{subarray}{c}x\in\Delta^{n}\\ \left\lVert x\right\rVert_{\infty}\leq\frac{1+\alpha}{n}\end{subarray}}\left\lVert\mathcal{A}(x)\right\rVert_{p}=\textup{OPT}. (7)

for 𝒜(x):=i[n]xi𝐀i\mathcal{A}(x):=\sum_{i\in[n]}x_{i}\mathbf{A}_{i}. Given estimate of OPT exponentially bounded in ndϵ\tfrac{nd}{\epsilon}, there is a procedure calling Algorithm 7 O(logndϵ)O(\log\frac{nd}{\epsilon}) times giving xΔnx\in\Delta^{n} with x(1+α)(1+ϵ)n\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1+\alpha)(1+\epsilon)}{n}, 𝒜(x)p(1+ϵ)OPT\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq(1+\epsilon)\textup{OPT}. Algorithm 7 runs in O(log(nd/ϵ)lognϵ2)O(\tfrac{\log(nd/\epsilon)\log n}{\epsilon^{2}}) iterations, each implementable in time O(nnzplog(nd/ϵ)ϵ2)O(\textup{nnz}\cdot\frac{p\log(nd/\epsilon)}{\epsilon^{2}}).

Our method, found in Appendix C, approximately solves (7) by first applying a standard binary search to place 𝒜(x)\mathcal{A}(x) on the right scale, for which it suffices to solve an approximate decision problem. Then, we apply a truncated mirror descent procedure on the potential Φ(w)=log(exp(𝒜(w)p)+exp(n1+αw))w1\Phi(w)=\log(\exp(\left\lVert\mathcal{A}(w)\right\rVert_{p})+\exp(\tfrac{n}{1+\alpha}\left\lVert w\right\rVert_{\infty}))-\left\lVert w\right\rVert_{1}, and prove correctness for solving the decision problem following the framework we outlined in Section 4.1.

5 Robust sub-Gaussian PCA in nearly-linear time

We give our nearly-linear time robust PCA method, leveraging developments of Section 4. Throughout, we will be operating under Assumption 1, for some corruption parameter ϵ\epsilon with ϵlogϵ1logd=O(1)\epsilon\log\epsilon^{-1}\log d=O(1); ϵ=O(1logdloglogd)\epsilon=O(\frac{1}{\log d\log\log d}) suffices. We now develop tools to prove Theorem 2.

Algorithm 4 uses three subroutines: our earlier 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance} method (Lemma 1), an application of our earlier Proposition 2 to approximate the solution to

minw𝔖ϵni[n]wiXiXip, for p=Θ(logdϵlogϵ1),\min_{w\in\mathfrak{S}_{\epsilon}^{n}}\left\lVert\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}\right\rVert_{p},\text{ for }p=\Theta\left(\sqrt{\frac{\log d}{\epsilon\log\epsilon^{-1}}}\right), (8)

and a method for computing approximate eigenvectors by [MM15] (discussed in Appendix D).

Proposition 3.

There is an algorithm 𝖯𝗈𝗐𝖾𝗋\mathsf{Power} (Algorithm 1, [MM15]), parameterized by t[d]t\in[d], tolerance ϵ~>0\tilde{\epsilon}>0, p1p\geq 1, and 𝐀𝕊0d\mathbf{A}\in\mathbb{S}_{\geq 0}^{d}, which outputs orthonormal {zj}j[t]\{z_{j}\}_{j\in[t]} with the guarantee

|zj𝐀pzjλjp(𝐀)|ϵ~λjp(𝐀)|zj𝐀p1zjλjp1(𝐀)|ϵ~λjp1(𝐀)} for all j[t].\begin{rcases}\left|z_{j}^{\top}\mathbf{A}^{p}z_{j}-\lambda_{j}^{p}(\mathbf{A})\right|&\leq\tilde{\epsilon}\lambda_{j}^{p}(\mathbf{A})\\ \left|z_{j}^{\top}\mathbf{A}^{p-1}z_{j}-\lambda_{j}^{p-1}(\mathbf{A})\right|&\leq\tilde{\epsilon}\lambda_{j}^{p-1}(\mathbf{A})\end{rcases}\text{ for all }j\in[t]. (9)

Here, λj(𝐀)\lambda_{j}(\mathbf{A}) is the jthj^{th} largest eigenvalue of 𝐀\mathbf{A}. The total time required by the method is O(nnz(𝐀)tplogdε)O(\textup{nnz}(\mathbf{A})\tfrac{tp\log d}{\varepsilon}).

Algorithm 4 𝖱𝗈𝖻𝗎𝗌𝗍𝖯𝖢𝖠({Xi}i[n],ϵ,t)\mathsf{RobustPCA}(\{X_{i}\}_{i\in[n]},\epsilon,t)
1:  Input: {Xi}i[n]\{X_{i}\}_{i\in[n]} ϵ=O(1logdloglogd)\epsilon=O(\frac{1}{\log d\log\log d}), t[d]t\in[d] with 𝚺t+1(1γ)𝚺\bm{\Sigma}_{t+1}\leq(1-\gamma)\bm{\Sigma} for γ\gamma in Theorem 2
2:  ww\leftarrow 𝖡𝗈𝗑𝖾𝖽𝖲𝖼𝗁𝖺𝗍𝗍𝖾𝗇𝖯𝖺𝖼𝗄𝗂𝗇𝗀\mathsf{BoxedSchattenPacking} (Proposition 2) on {𝐀i=XiXi}i[n]\{\mathbf{A}_{i}=X_{i}X_{i}^{\top}\}_{i\in[n]}, αϵ\alpha\leftarrow\epsilon, pp as in (8)
3:  𝐌=i[n]wiXiXi\mathbf{M}=\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}
4:  {zj}j[t]=𝖯𝗈𝗐𝖾𝗋(t,ϵ,p,𝐌)\{z_{j}\}_{j\in[t]}=\mathsf{Power}(t,\epsilon,p,\mathbf{M})
5:  αj𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾({Xi}i[n],𝐌p12zj/𝐌p12zj2,ϵ)\alpha_{j}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\mathbf{M}^{\frac{p-1}{2}}z_{j}/\|\mathbf{M}^{\frac{p-1}{2}}z_{j}\|_{2},\epsilon) for all j[t]j\in[t]
6:  return  zjz_{j^{*}} for j=argmaxj[t]αjj^{*}=\textup{argmax}_{j\in[t]}\alpha_{j}

Algorithm 4 is computationally bottlenecked by the application of Proposition 2 on Line 2 and the call to 𝖯𝗈𝗐𝖾𝗋\mathsf{Power} on Line 4, from which the runtime guarantee of Theorem 2 follows straightforwardly. To demonstrate correctness, we first certify the quality of the solution to (8).

Lemma 5.

Let n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right). With probability 1δ21-\tfrac{\delta}{2}, the uniform distribution over GG attains value (1+ϵ~2)𝚺p(1+\tfrac{\tilde{\epsilon}}{2})\|\bm{\Sigma}\|_{p} for objective (8), where ϵ~=Cϵlogϵ1\tilde{\epsilon}=C^{\prime}\epsilon\log\epsilon^{-1} for a universal constant C>0C^{\prime}>0.

The proof of this is similar to results in e.g. [DKK+19, Li18], and combines concentration guarantees with a union bound over all possible corruption sets BB. This implies the following immediately, upon applying the guarantees of Proposition 2.

Corollary 1.

Let ww be the output of Line 2 of 𝖱𝗈𝖻𝗎𝗌𝗍𝖯𝖢𝖠\mathsf{RobustPCA}. Then, we have w1(12ϵ)n\left\lVert w\right\rVert_{\infty}\leq\frac{1}{(1-2\epsilon)n}, and i[n]wiXiXip(1+ϵ~)𝚺p\left\lVert\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq(1+\tilde{\epsilon})\left\lVert\bm{\Sigma}\right\rVert_{p} under the guarantee of Lemma 5.

Let ww be the output of the solver. Recall that 𝐌=i=1nwiXiXi\mathbf{M}=\sum_{i=1}^{n}w_{i}X_{i}X_{i}^{\top}. Additionally, define

𝐌G:=iGwiXiXi,wG:=iGwi,𝐌B:=iBwiXiXi,wB:=iGwi.\mathbf{M}_{G}:=\sum_{i\in G}w_{i}X_{i}X_{i}^{\top},\;w_{G}:=\sum_{i\in G}w_{i},\;\mathbf{M}_{B}:=\sum_{i\in B}w_{i}X_{i}X_{i}^{\top},\;w_{B}:=\sum_{i\in G}w_{i}\;. (10)

Notice in particular that 𝐌=𝐌G+𝐌B\mathbf{M}=\mathbf{M}_{G}+\mathbf{M}_{B}, and that all these matrices are PSD. We next prove the second, crucial fact, which says that 𝐌G\mathbf{M}_{G} is a good approximator to 𝚺\bm{\Sigma} in Loewner ordering:

Lemma 6.

Let n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right). With probability at least 1δ21-\tfrac{\delta}{2}, (1+ϵ~)𝚺𝐌G(1ϵ~)𝚺(1+\tilde{\epsilon})\bm{\Sigma}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma}.

The proof combines the strategy in Lemma 5 with the guarantee of the SDP solver. Perhaps surprisingly, Corollary 1 and Lemma 6 are the only two properties about 𝐌\mathbf{M} that our final analysis of Theorem 2 will need. In particular, we have the following key geometric proposition, which carefully combines trace inequalities to argue that the corrupted points 𝐌B\mathbf{M}_{B} cannot create too many new large eigendirections.

Proposition 4.

Let 𝐌=𝐌G+𝐌B\mathbf{M}=\mathbf{M}_{G}+\mathbf{M}_{B} be so that 𝐌p(1+ϵ~)𝚺p\left\lVert\mathbf{M}\right\rVert_{p}\leq(1+\tilde{\epsilon})\left\lVert\bm{\Sigma}\right\rVert_{p}, 𝐌G0\mathbf{M}_{G}\succeq 0 and 𝐌B0\mathbf{M}_{B}\succeq 0, and so that (1+ϵ~)𝚺𝐌G(1ϵ~)𝚺(1+\tilde{\epsilon})\bm{\Sigma}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma}. Following notation of Algorithm 4, let

𝐌=j[d]λjvjvj,𝚺=j[d]σjujuj\mathbf{M}=\sum_{j\in[d]}\lambda_{j}v_{j}v_{j}^{\top},\;\bm{\Sigma}=\sum_{j\in[d]}\sigma_{j}u_{j}u_{j}^{\top} (11)

be sorted eigendecompositions of 𝐌\mathbf{M} and 𝚺\bm{\Sigma}, so λ1λd\lambda_{1}\geq\ldots\geq\lambda_{d}, and σ1σd\sigma_{1}\geq\ldots\geq\sigma_{d}. Let γ\gamma be as in Theorem 2, and assume σt+1<(1γ)σ1\sigma_{t+1}<(1-\gamma)\sigma_{1}. Then,

maxj[t]vj𝚺vj(1γ)𝚺.\max_{j\in[t]}v_{j}^{\top}\bm{\Sigma}v_{j}\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.
Proof.

For concreteness, we will define the parameters

p=27log(3d)ϵ~,γ=14ϵ~log(3d)=49pϵ~.p=\frac{2}{7}\sqrt{\frac{\log(3d)}{\tilde{\epsilon}}},\;\gamma=14\sqrt{\tilde{\epsilon}\log(3d)}=49p\tilde{\epsilon}.

For these choices of pp, γ\gamma, we will use the following (loose) approximations for sufficiently small ϵ~\tilde{\epsilon}:

(1γ4)p=(1γ4)4γlog(3d)13d,(1+ϵ~)p(1ϵ~)pexp(pϵ~)(1pϵ~)3pϵ~.\displaystyle\left(1-\frac{\gamma}{4}\right)^{p}=\left(1-\frac{\gamma}{4}\right)^{\frac{4}{\gamma}\log(3d)}\leq\frac{1}{3d},\;(1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\leq\exp(p\tilde{\epsilon})-(1-p\tilde{\epsilon})\leq 3p\tilde{\epsilon}. (12)

Suppose for contradiction that all vj𝚺vj<(1γ)σ1v_{j}^{\top}\bm{\Sigma}v_{j}<(1-\gamma)\sigma_{1} for j[t]j\in[t]. By applying the guarantee of Corollary 1 and Fact 2, it follows that

𝐌,𝐌p1=𝐌pp(1+ϵ~)p𝚺pp.\left\langle\mathbf{M},\mathbf{M}^{p-1}\right\rangle=\left\lVert\mathbf{M}\right\rVert_{p}^{p}\leq(1+\tilde{\epsilon})^{p}\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}. (13)

Let s[d]s\in[d] be the largest index such that σs>(1γ4)σ1\sigma_{s}>\left(1-\tfrac{\gamma}{4}\right)\sigma_{1}, and note that sts\leq t. We define

𝐍:=j[s]λjp1vjvj𝐌p1.\mathbf{N}:=\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\preceq\mathbf{M}^{p-1}.

That is, 𝐍\mathbf{N} is the restriction of 𝐌p1\mathbf{M}^{p-1} to its top ss eigendirections. Then,

𝐌,𝐌p1\displaystyle\left\langle\mathbf{M},\mathbf{M}^{p-1}\right\rangle =𝐌B,𝐌p1+𝐌G,𝐌p1\displaystyle=\left\langle\mathbf{M}_{B},\mathbf{M}^{p-1}\right\rangle+\left\langle\mathbf{M}_{G},\mathbf{M}^{p-1}\right\rangle (14)
𝐌B,𝐌p1+(1ϵ~)𝚺,𝐌p1𝐌B,𝐍+(1ϵ~)p𝚺pp.\displaystyle\geq\left\langle\mathbf{M}_{B},\mathbf{M}^{p-1}\right\rangle+\left\langle(1-\tilde{\epsilon})\bm{\Sigma},\mathbf{M}^{p-1}\right\rangle\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle+(1-\tilde{\epsilon})^{p}\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}.

In the second line, we used Lemma 6 twice, as well as the trace inequality Lemma 7 with 𝐀=𝐌\mathbf{A}=\mathbf{M} and 𝐁=(1ϵ~)𝚺\mathbf{B}=(1-\tilde{\epsilon})\bm{\Sigma}. Combining (13) with (14), and expanding the definition of 𝐌B\mathbf{M}_{B}, yields

((1+ϵ~)p(1ϵ~)p)𝚺pp\displaystyle\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p} 𝐌B,𝐍=𝐌B,j[s]λjp1vjvj\displaystyle\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle=\left\langle\mathbf{M}_{B},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle (15)
=𝐌,j[s]λjp1vjvj𝐌G,j[s]λjp1vjvj\displaystyle=\left\langle\mathbf{M},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle-\left\langle\mathbf{M}_{G},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle
𝐌,j[s]λjp1vjvj(1+ϵ~)𝚺,j[s]λjp1vjvj\displaystyle\geq\left\langle\mathbf{M},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle-(1+\tilde{\epsilon})\left\langle\bm{\Sigma},\sum_{j\in[s]}\lambda_{j}^{p-1}v_{j}v_{j}^{\top}\right\rangle
=j[s](λjp(1+ϵ~)λjp1vj𝚺vj)j[s](λjpλjp1(1+ϵ~)(1γ)σ1).\displaystyle=\sum_{j\in[s]}\left(\lambda_{j}^{p}-(1+\tilde{\epsilon})\lambda_{j}^{p-1}v_{j}^{\top}\bm{\Sigma}v_{j}\right)\geq\sum_{j\in[s]}\left(\lambda_{j}^{p}-\lambda_{j}^{p-1}(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\right).

The third line followed from from the spectral bound 𝐌G(1+ϵ~)𝚺\mathbf{M}_{G}\preceq(1+\tilde{\epsilon})\bm{\Sigma} of Lemma 6, and the fourth followed from the fact that {λj}j[d]\{\lambda_{j}\}_{j\in[d]}, {vj}j[d]\{v_{j}\}_{j\in[d]} eigendecompose 𝐌\mathbf{M}, as well as the assumption vj𝚺vj(1γ)σ1v_{j}^{\top}\bm{\Sigma}v_{j}\leq(1-\gamma)\sigma_{1} for all j[t]j\in[t]. Letting S:=j[s]σjpS:=\sum_{j\in[s]}\sigma_{j}^{p}, and using both approximations in (12),

𝚺ppj[s]σjp+(1γ4)p(ds)σ1p43S((1+ϵ~)p(1ϵ~)p)𝚺pp4pϵ~S.\displaystyle\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\leq\sum_{j\in[s]}\sigma_{j}^{p}+\left(1-\frac{\gamma}{4}\right)^{p}(d-s)\sigma_{1}^{p}\leq\frac{4}{3}S\implies\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\leq 4p\tilde{\epsilon}S. (16)

Next, we bound the last term of (15). By using (1+ϵ~)(1γ)1γ2(1+\tilde{\epsilon})(1-\gamma)\leq 1-\tfrac{\gamma}{2},

j[s](λjpλjp1(1+ϵ~)(1γ)σ1)\displaystyle\sum_{j\in[s]}\left(\lambda_{j}^{p}-\lambda_{j}^{p-1}(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\right) j[s]λjp1(λj(1γ2)σ1)\displaystyle\geq\sum_{j\in[s]}\lambda_{j}^{p-1}\left(\lambda_{j}-\left(1-\frac{\gamma}{2}\right)\sigma_{1}\right) (17)
γ6j[s]λjp1σ1γ6(1ϵ~)p1j[s]σjpγ12S.\displaystyle\geq\frac{\gamma}{6}\sum_{j\in[s]}\lambda_{j}^{p-1}\sigma_{1}\geq\frac{\gamma}{6}\left(1-\tilde{\epsilon}\right)^{p-1}\sum_{j\in[s]}\sigma_{j}^{p}\geq\frac{\gamma}{12}S.

The second line used λj(1γ2)σ1(1ϵ~)σj(1γ2)σ1γ6σ1\lambda_{j}-(1-\tfrac{\gamma}{2})\sigma_{1}\geq(1-\tilde{\epsilon})\sigma_{j}-(1-\tfrac{\gamma}{2})\sigma_{1}\geq\tfrac{\gamma}{6}\sigma_{1} by definition of ss, Lemma 8 (twice), and (1ϵ~)p112(1-\tilde{\epsilon})^{p-1}\geq\tfrac{1}{2}. Combining (17) and (16) and plugging into (15),

4pϵ~Sγ12S48pϵ~γ.4p\tilde{\epsilon}S\geq\frac{\gamma}{12}S\implies 48p\tilde{\epsilon}\geq\gamma.

By the choice of γ\gamma and pp (i.e. γ=49pϵ~\gamma=49p\tilde{\epsilon}), we attain a contradiction. ∎

In the proof of Proposition 4, we used the following facts.

Lemma 7.

Let 𝐀𝐁0\mathbf{A}\succeq\mathbf{B}\succeq 0 be symmetric matrices and pp a positive integer. Then we have

Tr(𝐀p1𝐁)Tr(𝐁p).\textup{Tr}\left(\mathbf{A}^{p-1}\mathbf{B}\right)\geq\textup{Tr}\left(\mathbf{B}^{p}\right).
Proof.

For any 1kp11\leq k\leq p-1,

Tr(𝐀k𝐁pk)Tr(𝐀k1𝐁pk2𝐀𝐁pk2)Tr(𝐀k1𝐁pk2𝐁𝐁pk2)=Tr(𝐀k1𝐁pk+1).\textup{Tr}\left(\mathbf{A}^{k}\mathbf{B}^{p-k}\right)\geq\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{\frac{p-k}{2}}\mathbf{A}\mathbf{B}^{\frac{p-k}{2}}\right)\geq\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{\frac{p-k}{2}}\mathbf{B}\mathbf{B}^{\frac{p-k}{2}}\right)=\textup{Tr}\left(\mathbf{A}^{k-1}\mathbf{B}^{p-k+1}\right).

The first step used the Extended Lieb-Thirring trace inequality Tr(𝐌𝐍2)Tr(𝐌α𝐍𝐌1α𝐍)\textup{Tr}(\mathbf{M}\mathbf{N}^{2})\geq\textup{Tr}(\mathbf{M}^{\alpha}\mathbf{N}\mathbf{M}^{1-\alpha}\mathbf{N}) for α[0,1]\alpha\in[0,1], 𝐌,𝐍𝕊0d\mathbf{M},\mathbf{N}\in\mathbb{S}_{\geq 0}^{d} (see e.g. Lemma 2.1, [ALO16]), and the second 𝐀𝐁\mathbf{A}\succeq\mathbf{B}. Finally, induction on kk yields the claim. ∎

Lemma 8.

For all j[d]j\in[d], λj(1ϵ~)σj\lambda_{j}\geq(1-\tilde{\epsilon})\sigma_{j}.

Proof.

By the Courant-Fischer minimax characterization of eigenvalues,

λjmink[j]uk𝐌uk.\lambda_{j}\geq\min_{k\in[j]}u_{k}^{\top}\mathbf{M}u_{k}.

However, we also have 𝐌𝐌G(1ϵ~)𝚺\mathbf{M}\succeq\mathbf{M}_{G}\succeq(1-\tilde{\epsilon})\bm{\Sigma} (Lemma 6), yielding the conclusion. ∎

The guarantees of Proposition 4 were geared towards exact eigenvectors of the matrix 𝐌\mathbf{M}. We now modify the analysis to tolerate inexactness in the eigenvector computation, in line with the processing of Line 5 of our Algorithm 4. This yields our final claim in Theorem 2.

Corollary 2.

In the setting of Proposition 4, and letting {zj}j[t]\{z_{j}\}_{j\in[t]} satisfy (9), set for all j[t]j\in[t]

yj:=𝐌p12zj𝐌p12zj2.y_{j}:=\frac{\mathbf{M}^{\frac{p-1}{2}}z_{j}}{\left\lVert\mathbf{M}^{\frac{p-1}{2}}z_{j}\right\rVert_{2}}.

Then with probability at least 1δ1-\delta,

maxj[t]yj𝚺yj(1γ)𝚺.\max_{j\in[t]}y_{j}^{\top}\bm{\Sigma}y_{j}\geq(1-\gamma)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.
Proof.

Assume all yjy_{j} have yj𝚺yj(1γ)σ1y_{j}^{\top}\bm{\Sigma}y_{j}\leq(1-\gamma)\sigma_{1} for contradiction. We outline modifications to the proof of Proposition 4. Specifically, we redefine the matrix 𝐍\mathbf{N} by

𝐍:=𝐌p12(j[s]zjzj)𝐌p12.\mathbf{N}:=\mathbf{M}^{\frac{p-1}{2}}\left(\sum_{j\in[s]}z_{j}z_{j}^{\top}\right)\mathbf{M}^{\frac{p-1}{2}}.

Because j[s]zjzj\sum_{j\in[s]}z_{j}z_{j}^{\top} is a projection matrix, it is clear 𝐍𝐌p1\mathbf{N}\preceq\mathbf{M}^{p-1}. Therefore, by combining the derivations (13) and (14), it remains true that

((1+ϵ~)p(1ϵ~)p)𝚺pp𝐌B,𝐍=𝐌,𝐍𝐌G,𝐍.\left((1+\tilde{\epsilon})^{p}-(1-\tilde{\epsilon})^{p}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}^{p}\geq\left\langle\mathbf{M}_{B},\mathbf{N}\right\rangle=\left\langle\mathbf{M},\mathbf{N}\right\rangle-\left\langle\mathbf{M}_{G},\mathbf{N}\right\rangle.

We now bound these two terms in an analogous way from Proposition 4, with negligible loss; combining these bounds will again yield a contradiction. First, we have the lower bound

𝐌,j[s]𝐌p12zjzj𝐌p12=j[s]zj𝐌pzj(1ϵ~)j[s]λjp.\displaystyle\left\langle\mathbf{M},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle=\sum_{j\in[s]}z_{j}^{\top}\mathbf{M}^{p}z_{j}\geq(1-\tilde{\epsilon})\sum_{j\in[s]}\lambda_{j}^{p}.

Here, the last inequality applied the assumption (9) with respect to 𝐌p\mathbf{M}^{p}. Next, we upper bound

𝐌G,j[s]𝐌p12zjzj𝐌p12\displaystyle\left\langle\mathbf{M}_{G},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle (1+ϵ~)𝚺,j[s]𝐌p12zjzj𝐌p12\displaystyle\leq(1+\tilde{\epsilon})\left\langle\bm{\Sigma},\sum_{j\in[s]}\mathbf{M}^{\frac{p-1}{2}}z_{j}z_{j}^{\top}\mathbf{M}^{\frac{p-1}{2}}\right\rangle
=(1+ϵ~)j[s]𝐌p12zj22yj𝚺yj\displaystyle=(1+\tilde{\epsilon})\sum_{j\in[s]}\left\lVert\mathbf{M}^{\frac{p-1}{2}}z_{j}\right\rVert_{2}^{2}y_{j}^{\top}\bm{\Sigma}y_{j}
(1+ϵ~)(1γ)σ1j[s]zj𝐌p1zj\displaystyle\leq(1+\tilde{\epsilon})(1-\gamma)\sigma_{1}\sum_{j\in[s]}z_{j}^{\top}\mathbf{M}^{p-1}z_{j}
(1γ)(1+ϵ~)2σ1j[s]λjp1,\displaystyle\leq(1-\gamma)(1+\tilde{\epsilon})^{2}\sigma_{1}\sum_{j\in[s]}\lambda_{j}^{p-1},

The first line used 𝐌G(1+ϵ~)𝚺\mathbf{M}_{G}\preceq(1+\tilde{\epsilon})\bm{\Sigma}, the second used the definition of yjy_{j}, the third used our assumption yj𝚺yj(1γ)σ1y_{j}^{\top}\bm{\Sigma}y_{j}\leq(1-\gamma)\sigma_{1}, and the last used (9) with respect to 𝐌p1\mathbf{M}^{p-1}. Finally, the remaining derivation (17) is tolerant to additional factors of 1+ϵ~1+\tilde{\epsilon}, yielding the same conclusion up to constants. ∎

Finally, we prove Theorem 2 by combining the tools developed thus far.

Proof of Theorem 2.

Correctness of the algorithm is immediate from Corollary 2 and the guarantees of 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance}. Concretely, Corollary 2 guarantees that one of the vectors we produce will be a (1γ)(1-\gamma)-approximate top eigenvector (say some index j[t]j\in[t]), and 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance} will only lose a negligible fraction O(ϵlogϵ1)O(\epsilon\log\epsilon^{-1}) of this quality (see Lemma 1); the best returned eigenvector as measured by 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance} can only improve the guarantee. Finally, the failure probability follows by combining the guarantees of Lemmas 15, and 6.

We now discuss runtime. The complexity of lines 2, 4, and 5, as guaranteed by Proposition 2, Proposition 3, and Lemma 1 are respectively (recalling p=O~(ϵ0.5)p=\widetilde{O}(\epsilon^{-0.5}))

O~(ndϵ4.5),O~(ndtϵ1.5),O~(ndt).\widetilde{O}\left(\frac{nd}{\epsilon^{4.5}}\right),\;\widetilde{O}\left(\frac{ndt}{\epsilon^{1.5}}\right),\;\widetilde{O}\left(ndt\right).

Throughout we use that we can compute matrix-vector products in an arbitrary linear combination of the XiXiX_{i}X_{i}^{\top} in time O(nd)O(nd); it is easy to check that in all runtime guarantees, nnz can be replaced by this computational cost. Combining these bounds yields the final conclusion. ∎

Acknowledgments

We thank Swati Padmanabhan and Aaron Sidford for helpful discussions.

References

  • [AK16] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. J. ACM, 63(2):12:1–12:35, 2016.
  • [AL17] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster online learning of eigenvectors and faster MMWU. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 116–125, 2017.
  • [ALO15] Zeyuan Allen Zhu, Zhenyu Liao, and Lorenzo Orecchia. Spectral sparsification and regret minimization beyond matrix multiplicative updates. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 237–245, 2015.
  • [ALO16] Zeyuan Allen Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1824–1831, 2016.
  • [CDG19] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755–2771, 2019.
  • [CDGW19] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David Woodruff. Faster algorithms for high-dimensional robust covariance estimation. arXiv preprint arXiv:1906.04661, 2019.
  • [CDST19] Yair Carmon, John C. Duchi, Aaron Sidford, and Kevin Tian. A rank-1 sketch for matrix multiplicative weights. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 589–623, 2019.
  • [CG18] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1362–1394, 2018.
  • [CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
  • [CMY20] Yeshwanth Cherapanamjeri, Sidhanth Mohanty, and Morris Yau. List decodable mean estimation in nearly linear time. CoRR, abs/2005.09796, 2020.
  • [DFO18] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Width-independence beyond linear objectives: Distributed fair packing and covering algorithms. CoRR, abs/1808.02517, 2018.
  • [DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
  • [DHL19] Yihe Dong, Samuel Hopkins, and Jerry Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. In Advances in Neural Information Processing Systems, pages 6065–6075, 2019.
  • [DK19] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
  • [DKK+17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 999–1008. JMLR. org, 2017.
  • [DKK+18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702. SIAM, 2018.
  • [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742–864, 2019.
  • [DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.
  • [DL19] Jules Despersin and Guillaume Lecué. Robust subgaussian estimation of a mean vector in nearly linear time. CoRR, abs/1906.03058, 2019.
  • [GHM15] Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 560–568, 2015.
  • [HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
  • [Hub64] Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
  • [JLL+20] Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, and Kevin Tian. Positive semidefinite programming: Mixed, parallel, and width-independent. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020.
  • [JY11] Rahul Jain and Penghui Yao. A parallel approximation algorithm for positive semidefinite programming. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 463–471, 2011.
  • [KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046, 2018.
  • [Li18] Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis, Massachusetts Institute of Technology, 2018.
  • [LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
  • [MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1396–1404, 2015.
  • [MRWZ16] Michael W. Mahoney, Satish Rao, Di Wang, and Peng Zhang. Approximating the solution to mixed packing and covering lps in parallel o~(epsilon^{-3}) time. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 52:1–52:14, 2016.
  • [MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for alpha-fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 54:1–54:15, 2016.
  • [NN94] Yurii Nesterov and Arkadi Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
  • [PTZ16] Richard Peng, Kanat Tangwongsan, and Peng Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. CoRR, abs/1201.5135, 2016.
  • [RH17] Philippe Rigollet and Jan-Christian Hütter. High-Dimensional Statistics. 2017.
  • [SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
  • [Ste18] Jacob Steinhardt. Robust Learning: Information Theory and Algorithms. PhD thesis, Stanford University, 2018.
  • [Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
  • [Ver16] Roman Vershynin. High-Dimensional Probability, An Introduction with Applications in Data Science. 2016.
  • [WK06] Manfred K. Warmuth and Dima Kuzmin. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1481–1488, 2006.

Appendix A Concentration

A.1 Sub-Gaussian concentration

We use the following concentration facts on sub-Gaussian distributions following from standard techniques, and give an application bounding Schatten-norm deviations.

Lemma 9.

Under Assumption 1, there are universal constants C1C_{1}, C2C_{2} such that

Pr[supvdv2=1|v(1niGXiXi𝚺)vtv𝚺v>0]exp(C1dC2nmin(t,t2)).\Pr\left[\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{2}=1\end{subarray}}\left|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|-tv^{\top}\bm{\Sigma}v>0\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).
Proof.

By observing (3), it is clear that the random vector X~=𝚺12X\widetilde{X}=\bm{\Sigma}^{-\frac{1}{2}}X for X𝒟X\sim\mathcal{D} has covariance 𝐈\mathbf{I} and sub-Gaussian proxy c𝐈c\mathbf{I}. For any fixed unit vector uu, by Lemma 1.12 of [RH17], the random variable (uX~)21(u^{\top}\widetilde{X})^{2}-1 is sub-exponential with parameter 16c16c, so by Bernstein’s inequality (Theorem 1.13, [RH17]), defining X~i=𝚺12Xi\widetilde{X}_{i}=\bm{\Sigma}^{-\frac{1}{2}}X_{i} for each Xi𝒟X_{i}\sim\mathcal{D},

Pr[|u(1niGX~iX~i𝐈)u|>t2]exp(n211c2min(t,t2)).\Pr\left[\left|u^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}\widetilde{X}_{i}\widetilde{X}_{i}^{\top}-\mathbf{I}\right)u\right|>\frac{t}{2}\right]\leq\exp\left(-\frac{n}{2^{11}c^{2}}\min(t,t^{2})\right).

For shorthand define 𝐌:=1niGX~iX~i\mathbf{M}:=\tfrac{1}{n}\sum_{i\in G^{\prime}}\widetilde{X}_{i}\widetilde{X}_{i}^{\top}, and let 𝒩\mathcal{N} be a maximal 14\tfrac{1}{4}-net of the unit ball (as measured in 2\ell_{2} distance). By Lemma 1.18 of [RH17], |𝒩|12d|\mathcal{N}|\leq 12^{d}, so by a union bound,

Pr[supu𝒩|u(𝐌𝐈)u|>t2]exp(3dn211c2min(t,t2)).\Pr\left[\sup_{u\in\mathcal{N}}\left|u^{\top}(\mathbf{M}-\mathbf{I})u\right|>\frac{t}{2}\right]\leq\exp\left(3d-\frac{n}{2^{11}c^{2}}\min(t,t^{2})\right).

Next, by a standard application of the triangle inequality (see e.g. Exercise 4.3.3, [Ver16])

supvdv2=1|v(𝐌𝐈)v|2supu𝒩|u(𝐌𝐈)u|t\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{2}=1\end{subarray}}\left|v^{\top}(\mathbf{M}-\mathbf{I})v\right|\leq 2\sup_{u\in\mathcal{N}}\left|u^{\top}(\mathbf{M}-\mathbf{I})u\right|\leq t

with probability at least 1exp(C1dC2nmin(t,t2))1-\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right) for appropriate C1C_{1}, C2C_{2}. The conclusion follows since its statement is scale invariant, so it suffices to show as we have that

Pr[supvdv𝚺=1|v(1niGXiXi𝚺)vtv𝚺v>0]exp(C1dC2nmin(t,t2)).\Pr\left[\sup_{\begin{subarray}{c}v\in\mathbb{R}^{d}\\ \left\lVert v\right\rVert_{\bm{\Sigma}}=1\end{subarray}}\left|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|-tv^{\top}\bm{\Sigma}v>0\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).

Corollary 3.

Let p2p\geq 2. Under Assumption 1, there are universal constants C1C_{1}, C2C_{2} with

Pr[1niGXiXi𝚺p>t𝚺p]exp(C1dC2nmin(t,t2)).\Pr\left[\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>t\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\exp\left(C_{1}d-C_{2}n\min(t,t^{2})\right).
Proof.

Suppose the event in Lemma 9 does not hold, which happens with probability at least 1exp(C1dC2nmin(t,t2))1-\exp(C_{1}d-C_{2}n\min(t,t^{2})). Define for shorthand 𝐌:=1niGXiXi𝚺\mathbf{M}:=\tfrac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma} and let its spectral decomposition be j[d]λjvjvj\sum_{j\in[d]}\lambda_{j}v_{j}v_{j}^{\top}. By the triangle inequality and Fact 2,

𝐌p\displaystyle\left\lVert\mathbf{M}\right\rVert_{p} j[d]|λj|p1𝐌pp1|vj(1niGXiXi𝚺)vj|\displaystyle\leq\sum_{j\in[d]}\frac{|\lambda_{j}|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}\left|v_{j}^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v_{j}\right|
tj[d]|λj|p1𝐌pp1vj𝚺vj=tj[d]|λj|p1𝐌pp1vjvj,𝚺t𝚺p.\displaystyle\leq t\sum_{j\in[d]}\frac{|\lambda_{j}|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}^{\top}\bm{\Sigma}v_{j}=t\left\langle\sum_{j\in[d]}\frac{|\lambda_{j}|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}v_{j}^{\top},\bm{\Sigma}\right\rangle\leq t\left\lVert\bm{\Sigma}\right\rVert_{p}.

In the last inequality, we used that j[d]|λj|p1𝐌pp1vjvj\sum_{j\in[d]}\tfrac{|\lambda_{j}|^{p-1}}{\left\lVert\mathbf{M}\right\rVert_{p}^{p-1}}v_{j}v_{j}^{\top} has unit q\ell_{q} norm, and applied Fact 2. ∎

A.2 Concentration under weightings in 𝔖ϵn\mathfrak{S}_{\epsilon}^{n}

We consider concentration of the empirical covariance under weightings which are not far from uniform, in spectral and Schatten senses.

Lemma 10.

Under Assumption 1, let δ[0,1]\delta\in[0,1], p2p\geq 2, and n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right) for a sufficiently large constant. Then for a universal constant C3C_{3},

Pr[w𝔖ϵn|iGwiXiXi𝚺p>C3ϵlogϵ1𝚺p]δ2.\Pr\left[\exists w\in\mathfrak{S}_{\epsilon}^{n}\;\Bigg{|}\;\left\lVert\sum_{i\in G^{\prime}}w_{i}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{2}.
Proof.

Because the vertices of 𝔖ϵn\mathfrak{S}_{\epsilon}^{n} are uniform over sets SGS\subseteq G^{\prime} with |S|=(1ϵ)n|S|=(1-\epsilon)n (see e.g. Section 4.1, [DKK+19]), by convexity of the Schatten-pp norm it suffices to prove

Pr[S with |S|=(1ϵ)n|1(1ϵ)niSXiXi𝚺p>C3ϵlogϵ1𝚺p]δ4.\Pr\left[\exists S\text{ with }|S|=(1-\epsilon)n\;\Bigg{|}\;\left\lVert\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{4}.

For any fixed SS, and recalling |Sc|=ϵn|S^{c}|=\epsilon n, we can decompose this sum as

1(1ϵ)niSXiXi=11ϵ(1niGXiXi)ϵ1ϵ(1|Sc|iScXiXi).\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}=\frac{1}{1-\epsilon}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}\right)-\frac{\epsilon}{1-\epsilon}\left(\frac{1}{|S^{c}|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}\right). (18)

By applying Corollary 3, it follows that by setting t=1ϵ2ϵlogϵ1t=\tfrac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1} and our choice of nn that

Pr[1niGXiXi𝚺p>1ϵ2ϵlogϵ1𝚺p]δ4.\Pr\left[\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right]\leq\frac{\delta}{4}. (19)

Moreover, for any fixed ScS^{c}, setting t=1ϵ2C3logϵ1t=\tfrac{1-\epsilon}{2}\cdot C_{3}\log\epsilon^{-1} where C3C_{3} is a sufficiently large constant, so that for sufficiently small ϵ\epsilon, t=min(t,t2)t=\min(t,t^{2}),

Pr[1ϵniScXiXi𝚺p>1ϵ2C3logϵ1𝚺p]\displaystyle\Pr\left[\left\lVert\frac{1}{\epsilon n}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}>\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\right] exp(C1dC2ϵnt)\displaystyle\leq\exp\left(C_{1}d-C_{2}\epsilon nt\right) (20)
exp((logδ1+nϵlogϵ1))\displaystyle\leq\exp\left(-\left(\log\delta^{-1}+n\epsilon\log\epsilon^{-1}\right)\right)
δ4(nϵn).\displaystyle\leq\frac{\delta}{4\binom{n}{\epsilon n}}.

Here, we used that log(nϵn)=O(nϵlogϵ1)\log\binom{n}{\epsilon n}=O\left(n\epsilon\log\epsilon^{-1}\right). Finally, union bounding over all possible sets ScS^{c} imply that with probability at least 1δ21-\tfrac{\delta}{2}, the following events hold:

1niGXiXi𝚺p<1ϵ2ϵlogϵ1𝚺p,\displaystyle\left\lVert\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p},
1|Sc|iScXiXi𝚺p<1ϵ2C3logϵ1𝚺p for all S with |S|=(1ϵ)n.\displaystyle\left\lVert\frac{1}{|S^{c}|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}\text{ for all }S\text{ with }|S|=(1-\epsilon)n.

Combining these bounds in the context of (18) after applying the triangle inequality, we have with probability at least 1δ21-\tfrac{\delta}{2} for all SS the desired conclusion,

1(1ϵ)niSXiXi𝚺p<C3ϵlogϵ1𝚺p.\left\lVert\frac{1}{(1-\epsilon)n}\sum_{i\in S}X_{i}X_{i}^{\top}-\bm{\Sigma}\right\rVert_{p}<C_{3}\cdot\epsilon\log\epsilon^{-1}\left\lVert\bm{\Sigma}\right\rVert_{p}.

Corollary 4.

Under Assumption 1, let n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right) for a sufficiently large constant. For universal C3C_{3} and all w𝔖ϵnw\in\mathfrak{S}_{\epsilon}^{n}, with probability at least 1δ21-\tfrac{\delta}{2},

C3ϵlogϵ1𝚺iGwiXiXi𝚺C3ϵlogϵ1𝚺.C_{3}\cdot\epsilon\log\epsilon^{-1}\bm{\Sigma}\succeq\sum_{i\in G^{\prime}}w_{i}X_{i}X_{i}^{\top}-\bm{\Sigma}\succeq-C_{3}\cdot\epsilon\log\epsilon^{-1}\bm{\Sigma}.
Proof.

Consider any unit vector vdv\in\mathbb{R}^{d}. By similar arguments as in (19), (20), and applying a union bound over all SS with |S|=(1ϵ)n|S|=(1-\epsilon)n, with probability at least 1δ21-\tfrac{\delta}{2}, it follows from Lemma 9 that

|v(1niGXiXi𝚺)v|<1ϵ2ϵlogϵ1v𝚺v,\displaystyle\left|v^{\top}\left(\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|<\frac{1-\epsilon}{2}\cdot\epsilon\log\epsilon^{-1}v^{\top}\bm{\Sigma}v, (21)
|v(1|Sc|iScXiXi𝚺)v|<1ϵ2C3logϵ1v𝚺v.\displaystyle\left|v^{\top}\left(\frac{1}{|S^{c}|}\sum_{i\in S^{c}}X_{i}X_{i}^{\top}-\bm{\Sigma}\right)v\right|<\frac{1-\epsilon}{2}\cdot C_{3}\cdot\log\epsilon^{-1}v^{\top}\bm{\Sigma}v\;. (22)

Therefore, again using the formula (18) and the triangle inequality yields the desired conclusion for all directions vv, which is equivalent to the spectral bound of the lemma statement. ∎

Appendix B Deferred proofs from Section 3

B.1 Robust univariate variance estimation

In this section, we prove Lemma 1, which allows us to robustly estimate the quadratic form of a vector in the covariance of a sub-Gaussian distribution from corrupted samples. Algorithm 5 is folklore, and intuitively very simple; it projects all samples onto uu, throws away the 2ϵ2\epsilon fraction of points with largest magnitude in this direction, and takes the mean of the remaining set.

Algorithm 5 Univariate variance estimation: 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾({Xi}i[n],ϵ,u)\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\epsilon,u)
  Input: {Xi}i[n]\{X_{i}\}_{i\in[n]}, ϵ>0\epsilon>0, and a unit vector uu
  Let ai=Xi,u2a_{i}=\left\langle X_{i},u\right\rangle^{2} for i=1,,ni=1,\ldots,n
  Sort the aia_{i} in increasing order. WLOG assume a1a2ana_{1}\leq a_{2}\leq\ldots\leq a_{n}.
  return  σu2=1(12ϵ)ni=1(12ϵ)nai\sigma_{u}^{2}=\frac{1}{(1-2\epsilon)n}\sum_{i=1}^{(1-2\epsilon)n}a_{i}

We require the following helper fact.

Fact 3.

Let ZZ be a sub-exponential random variable with parameter at most λ\lambda999We say mean-zero ZZ is sub-exponential with parameter λ\lambda if |s|λ1\forall|s|\leq\lambda^{-1}, 𝔼[exp(sZ)]exp(s2λ22)\mathbb{E}[\exp(sZ)]\leq\exp(\tfrac{s^{2}\lambda^{2}}{2})., and let ϵ[0,1]\epsilon\in[0,1]. Then, for any event EE with Pr[ZE]ϵ\Pr[Z\in E]\leq\epsilon, |𝔼[Z𝟏[ZE]]|2λϵlogϵ1|\mathbb{E}\left[Z\cdot\mathbf{1}[Z\in E]\right]|\leq 2\lambda\epsilon\log\epsilon^{-1}.

Proof.

We have by Hölder’s inequality that for any p,q1p,q\geq 1 with p1+q1=1p^{-1}+q^{-1}=1,

|𝔼[Z𝟏[ZE]]|𝔼[|Z|p]1/pϵ1/q2λpϵ1/q.\left|\mathbb{E}\left[Z\cdot\mathbf{1}[Z\in E]\right]\right|\leq\mathbb{E}[|Z|^{p}]^{1/p}\cdot\epsilon^{1/q}\leq 2\lambda p\cdot\epsilon^{1/q}.

The second inequality is Lemma 1.10 [RH17]. Setting p=logϵ1p=\log\epsilon^{-1} yields the result. ∎

See 1

Proof.

The runtime claim is immediate; we now turn our attention to correctness. We follow notation of Assumption 1, and in a slight abuse of notation, also define ai=Xi,u2a_{i}=\left\langle X_{i},u\right\rangle^{2} for iGi\in G^{\prime}. First, for X𝒟X\sim\mathcal{D}, then u,X2u𝚺u\left\langle u,X\right\rangle^{2}-u^{\top}\bm{\Sigma}u is sub-exponential with parameter at most 16cu𝚺u16cu^{\top}\bm{\Sigma}u (Lemma 1.12, [RH17]). By Bernstein’s inequality, we have that if X𝒟X\sim\mathcal{D}, then for all t1t\geq 1,

Pr[X,u2>32ctu𝚺u]exp(t).\Pr\left[\left\langle X,u\right\rangle^{2}>32ctu^{\top}\bm{\Sigma}u\right]\leq\exp(-t)\;. (23)

Using this in a standard Chernoff bound, we have that with probability 1δ21-\tfrac{\delta}{2},

|{iG:ai>64clogϵ1u𝚺u}|nϵ.\frac{\left|\{i\in G^{\prime}:a_{i}>64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\}\right|}{n}\leq\epsilon\;. (24)

Let T=64clogϵ1u𝚺uT=64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u, and let YY be distributed as (u,X2u𝚺u)𝟏[u,X2T](\left\langle u,X\right\rangle^{2}-u^{\top}\bm{\Sigma}u)\cdot\mathbf{1}[\left\langle u,X\right\rangle^{2}\leq T], where X𝒟X\sim\mathcal{D}. We observe Y𝔼[Y]Y-\mathbb{E}[Y] is also sub-exponential with parameter 16cu𝚺u16cu^{\top}\bm{\Sigma}u, and that by Fact 3,

|𝔼[Y]|32cu𝚺uϵlogϵ1.|\mathbb{E}[Y]|\leq 32cu^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1}. (25)

Define the interval I=[0,T]I=[0,T] and let SS be the set of points in [n][n] that survive the truncation procedure, so that σu2=1|S|iSai\sigma^{2}_{u}=\tfrac{1}{|S|}\sum_{i\in S}a_{i}. Given event (24), aiIa_{i}\in I for all iSi\in S, since there are at most ϵn\epsilon n points in GG outside II, and |B|ϵn|B|\leq\epsilon n. We decompose the deviation as follows:

iSai|S|u𝚺u\displaystyle\sum_{i\in S}a_{i}-|S|u^{\top}\bm{\Sigma}u =iGS(aiu𝚺u)+iBS(aiu𝚺u)\displaystyle=\sum_{i\in G\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)+\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u) (26)
=iGI(aiu𝚺u)+iBS(aiu𝚺u)\displaystyle=\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)+\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)
i(GG)I(aiu𝚺u)i(GI)S(aiu𝚺u).\displaystyle-\sum_{i\in(G^{\prime}\setminus G)\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)-\sum_{i\in(G\cap I)\setminus S}(a_{i}-u^{\top}\bm{\Sigma}u).

Here we overloaded iIi\in I to mean that aia_{i} lies in the interval II, and conditioned on SS lying entirely in II. We bound each of these terms individually. First, for all iGIi\in G^{\prime}\cap I, conditioning on (24) (i.e. all aiIa_{i}\in I), aiu𝚺ua_{i}-u^{\top}\bm{\Sigma}u is an independent sample from YY. Thus, by (25) and Bernstein’s inequality,

|1|GI|iGI(aiu𝚺u)|\displaystyle\left|\frac{1}{|G^{\prime}\cap I|}\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)\right| |1|GI|iGI(aiu𝚺u)𝔼[Y]|+32cu𝚺uϵlogϵ1\displaystyle\leq\left|\frac{1}{|G^{\prime}\cap I|}\sum_{i\in G^{\prime}\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)-\mathbb{E}[Y]\right|+32cu^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1} (27)
64cu𝚺uϵlogϵ1,\displaystyle\leq 64c\cdot u^{\top}\bm{\Sigma}u\epsilon\log\epsilon^{-1},

with (conditional) probability at least 1δ21-\tfrac{\delta}{2}. By a union bound, both events occur with probability at least 1δ1-\delta; condition on this for the remainder of the proof. Under this assumption, we control the other three terms of (26). Observe that |BS|ϵn|B\cap S|\leq\epsilon n, |(GG)I|ϵn|(G^{\prime}\setminus G)\cap I|\leq\epsilon n, and |(GI)S|ϵn|(G\cap I)\setminus S|\leq\epsilon n. Further, by definition of II, every summand is at most 64clogϵ1u𝚺u64c\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u. Thus,

|iBS(aiu𝚺u)|\displaystyle\left|\sum_{i\in B\cap S}(a_{i}-u^{\top}\bm{\Sigma}u)\right| 64cϵnlogϵ1u𝚺u,\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u, (28)
|i(GG)I(aiu𝚺u)|\displaystyle\left|\sum_{i\in(G^{\prime}\setminus G)\cap I}(a_{i}-u^{\top}\bm{\Sigma}u)\right| 64cϵnlogϵ1u𝚺u,\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u, (29)
|i(GI)S(aiu𝚺u)|\displaystyle\left|\sum_{i\in(G^{\prime}\cap I)\setminus S}(a_{i}-u^{\top}\bm{\Sigma}u)\right| 64cϵnlogϵ1u𝚺u.\displaystyle\leq 64c\epsilon n\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u. (30)

Combining (27), (28), (29), and (30) in derivation (26) and dividing by |S||S| yields the claim. ∎

Finally, we also give an alternative set of conditions under which we can certify correctness of 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance}. Specifically, this assumption will be useful in lifting indpendence assumptions between uu and our samples {Xi}i[n]\{X_{i}\}_{i\in[n]} in repeated calls within Algorithm 6.

Assumption 2.

Under Assumption 1, let the following conditions hold for universal constant C4C_{4}:

C4ϵlogϵ1𝚺\displaystyle C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma} 1niGXiXi𝚺C4ϵlogϵ1𝚺,\displaystyle\succeq\frac{1}{n}\sum_{i\in G^{\prime}}X_{i}X_{i}^{\top}-\bm{\Sigma}\succeq-C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma}, (31)
C4logϵ1𝚺\displaystyle C_{4}\log\epsilon^{-1}\cdot\bm{\Sigma} iGwi(XiXi𝚺)C4logϵ1𝚺for all w𝔖1ϵn.\displaystyle\succeq\sum_{i\in G^{\prime}}w_{i}\left(X_{i}X_{i}^{\top}-\bm{\Sigma}\right)\succeq-C_{4}\log\epsilon^{-1}\cdot\bm{\Sigma}\;\mbox{for all $w\in\mathfrak{S}_{1-\epsilon}^{n}$}. (32)

Note that (32) is a factor ϵ\epsilon weaker in its guarantee than Corollary 4, and is over weights in a different set 𝔖1ϵn\mathfrak{S}_{1-\epsilon}^{n}. Standard sub-Gaussian concentration (i.e. an unweighted variant of Corollary 4) and modifying the proof of Corollary 4 to take the constraint set 𝔖1ϵn\mathfrak{S}_{1-\epsilon}^{n} and normalizing over vertex sets of size ϵn\epsilon n yield the following conclusion.

Lemma 11.

Let n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right) for a sufficiently large constant. Assumption 2 holds with probability at least 1δ21-\tfrac{\delta}{2}.

We give a variant of Lemma 1 with slightly stronger guarantees for 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance}; specifically, it holds for all uu simultaneously for a fixed set of samples satisfying Assumption 2.

Corollary 5.

Under Assumption 2, Algorithm 5 outputs σu2\sigma_{u}^{2} with |u𝚺uσu2|<Cu𝚺uϵlogϵ1|u^{\top}\bm{\Sigma}u-\sigma_{u}^{2}|<Cu^{\top}\bm{\Sigma}u\cdot\epsilon\log\epsilon^{-1}, for CC a fixed multiple of the parameter cc in Assumption 1, and runs in time O(nd+nlogn)O(nd+n\log n).

Proof.

We discuss how to modify the derivations from Lemma 1 appropriately in the absence of applications of Bernstein’s inequality. First, note that appropriately combining (31) and (32) in a derivation such as (18) yields the following bound (deterministically under Assumption 2):

C4ϵlogϵ1𝚺iGwi(XiXi𝚺)C4ϵlogϵ1𝚺for all w𝔖3ϵn.C_{4}\epsilon\log\epsilon^{-1}\cdot\bm{\Sigma}\succeq\sum_{i\in G^{\prime}}w_{i}\left(X_{i}X_{i}^{\top}-\bm{\Sigma}\right)\succeq-C_{4}\epsilon\log\epsilon^{-1}\bm{\Sigma}\;\mbox{for all $w\in\mathfrak{S}_{3\epsilon}^{n}$}. (33)

Now, consider the decomposition (26). We claim first that similarly to (28), (29), (30) we can bound each summand in the latter three terms by O(u𝚺ulogϵ1)O(u^{\top}\bm{\Sigma}u\log\epsilon^{-1}); to prove this, it suffices to show that at least one filtered aia_{i} attains this bound, as then by definition of the algorithm, each non-filtered aia_{i} will as well. Note that a fraction between ϵ\epsilon and 2ϵ2\epsilon of points in GGG\subset G^{\prime} is filtered (since there are only ϵn\epsilon n points from BB). The assumption (32) then implies precisely the desired bound on some filtered aia_{i} by placing uniform mass on filtered points from GG, and applying pigeonhole. So, all non-filtered aia_{i} are bounded by O(u𝚺ulogϵ1)O(u^{\top}\bm{\Sigma}u\log\epsilon^{-1}), yielding analogous statements to (28), (29), (30).

Finally, an analogous derivation to (27) follows via an application of the bound (33), where we place uniform mass on the set GIG^{\prime}\cap I and adjust constants appropriately, since the above argument shows that under the assumption (32), we have that at most 2ϵn2\epsilon n indices iGi\in G^{\prime} have aiIa_{i}\not\in I. ∎

B.2 Preliminaries

For convenience, we give the following preliminaries before embarking on our proof of Theorem 1 and giving guarantees on Algorithm 6. First, we state a set of assumptions which augments Assumption 2 with one additional condition, used in bounding the iteration count of our algorithm.

Assumption 3.

Under Assumption 1, let Assumption 2 hold, as well as the following additional condition for the same universal constant C4C_{4}:

Xi22C4lognδTr(𝚺)for all iG.\left\lVert X_{i}\right\rVert_{2}^{2}\leq C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})\;\mbox{for all $i\in G$}. (34)

Standard sub-Gaussian concentration inequalities and a union bound, combined with our earlier claim Lemma 11, then yield the following guarantee.

Lemma 12.

Let n=Ω(d+logδ1(ϵlogϵ1)2)n=\Omega\left(\tfrac{d+\log\delta^{-1}}{(\epsilon\log\epsilon^{-1})^{2}}\right) for a sufficiently large constant. Assumption 3 holds with probability at least 1δ1-\delta.

B.3 Analysis of 𝖯𝖢𝖠𝖥𝗂𝗅𝗍𝖾𝗋\mathsf{PCAFilter}

For this section, for any nonnegative weights ww, define 𝐌(w):=i[n]wiXiXi\mathbf{M}(w):=\sum_{i\in[n]}w_{i}X_{i}X_{i}^{\top}. We now state our algorithm, 𝖯𝖢𝖠𝖥𝗂𝗅𝗍𝖾𝗋\mathsf{PCAFilter}. At all iterations tt, it maintains a current nonnegative weight vector w(t)w^{(t)} (initialized to be the uniform distribution on [n][n]), preserving the following invariants for all tt:

wi(t1)wi(t) for all i[n],iBwi(t1)wi(t)iGwi(t1)wi(t).\displaystyle w_{i}^{(t-1)}\geq w_{i}^{(t)}\text{ for all }i\in[n],\;\sum_{i\in B}w^{(t-1)}_{i}-w^{(t)}_{i}\geq\sum_{i\in G}w^{(t-1)}_{i}-w^{(t)}_{i}. (35)

We now state our method as Algorithm 6; note that the update to w(t)w^{(t)} is of the form in Lemma 2.

Algorithm 6 𝖯𝖢𝖠𝖥𝗂𝗅𝗍𝖾𝗋({Xi}i[n],ϵ)\mathsf{PCAFilter}(\{X_{i}\}_{i\in[n]},\epsilon)
1:  Remove all points i[n]i\in[n] with Xi22>clog(nδ)Tr(𝚺)\left\lVert X_{i}\right\rVert_{2}^{2}>c\log(\tfrac{n}{\delta})\cdot\textup{Tr}(\bm{\Sigma})
2:  wi(0)1nw^{(0)}_{i}\leftarrow\tfrac{1}{n} for all i[n]i\in[n], t1t\leftarrow 1
3:  u1u_{1}\leftarrow approximate top eigenvector of 𝐌(w(0))\mathbf{M}(w^{(0)})
4:  σ12𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾({Xi}i[n],ϵ,u1)\sigma_{1}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},\epsilon,u_{1})
5:  while ut𝐌(w(t1))ut>(1+5C5ϵlogϵ1)σt2u_{t}^{\top}\mathbf{M}(w^{(t-1)})u_{t}>(1+5C_{5}\epsilon\log\epsilon^{-1})\sigma_{t}^{2}, where C5=max(C,C4)C_{5}=\max(C,C_{4}) from constants in Assumption 2, Corollary 5 do
6:     aiut,Xi2a_{i}\leftarrow\left\langle u_{t},X_{i}\right\rangle^{2} for all i[n]i\in[n]
7:     Sort (permute) the indices [n][n] so the aia_{i} are in increasing order (with a1a_{1} smallest, ana_{n} largest)
8:     Let \ell be the largest index with i=nwi2ϵ\sum_{i=\ell}^{n}w_{i}\geq 2\epsilon
9:     Define
wi(t){(1aian)wi(t1)inwi(t1)i<w_{i}^{(t)}\leftarrow\begin{cases}\left(1-\frac{a_{i}}{a_{n}}\right)w_{i}^{(t-1)}&\ell\leq i\leq n\\ w_{i}^{(t-1)}&i<\ell\end{cases}
10:     utu_{t}\leftarrow approximate top eigenvector of 𝐌(w(t))\mathbf{M}(w^{(t)})
11:     σt2𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾({Xi}i[n],ut,ϵ)\sigma_{t}^{2}\leftarrow\mathsf{1DRobustVariance}(\{X_{i}\}_{i\in[n]},u_{t},\epsilon)
12:     tt+1t\leftarrow t+1
13:  end while
14:  return  utu_{t}

We assume that in Line 8, we also have i=nwi3ϵ\sum_{i=\ell}^{n}w_{i}\leq 3\epsilon, as we can assume at least one point is corrupted i.e. ϵ1n\epsilon\geq\tfrac{1}{n} (else standard algorithms suffice for our setting), so adding an additional wiw_{i} can only change the sum by ϵ\epsilon. We first prove invariants (35) are preserved; at a high level, we simply demonstrate that Lemma 2 holds via concentration on GG and lack of termination.

Lemma 13.

Under Assumption 2, for any iteration tt of Algorithm 6, suppose (35) held for all iterations tt1t^{\prime}\leq t-1. Then, (35) holds at iteration tt.

Proof.

The first part of (35) is immediate by observing the update in Line 9, so we show the second. We drop subscripts and superscripts for conciseness and focus on a single iteration tt. Let IB={,,n}BI_{B}=\{\ell,\ldots,n\}\cap B, and IG={,,n}GI_{G}=\{\ell,\ldots,n\}\cap G. By Lemma 2, it suffices to demonstrate that

iIBwiai>iIGwiai.\sum_{i\in I_{B}}w_{i}a_{i}>\sum_{i\in I_{G}}w_{i}a_{i}. (36)

First, iIBwiϵ\sum_{i\in I_{B}}w_{i}\leq\epsilon, so by definition of index \ell, we have ϵiIGwi2ϵ\epsilon\leq\sum_{i\in I_{G}}w_{i}\leq 2\epsilon. Define w~i=wiiIGwi\widetilde{w}_{i}=\tfrac{w_{i}}{\sum_{i\in I_{G}}w_{i}} if iIGi\in I_{G}, and 0 otherwise, and observe w~𝔖12ϵn\widetilde{w}\in\mathfrak{S}_{1-2\epsilon}^{n}. By modifying constants appropriately from (32), it follows from definition of ai=uXiXiua_{i}=u^{\top}X_{i}X_{i}^{\top}u that

iIGwiai(iIGwi)C4logϵ1u𝚺u2C4ϵlogϵ1u𝚺u.\sum_{i\in I_{G}}w_{i}a_{i}\leq\left(\sum_{i\in I_{G}}w_{i}\right)\cdot C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\leq 2C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u. (37)

On the other hand, by (33) we know that the total quadratic form over GG is bounded as

iGwiai<(iGwi)(1+C4ϵlogϵ1)u𝚺u<(1+C4ϵlogϵ1)u𝚺u.\sum_{i\in G}w_{i}a_{i}<\left(\sum_{i\in G}w_{i}\right)\left(1+C_{4}\epsilon\log\epsilon^{-1}\right)u^{\top}\bm{\Sigma}u<\left(1+C_{4}\epsilon\log\epsilon^{-1}\right)u^{\top}\bm{\Sigma}u. (38)

Here, we applied the observation that the normalized wiw_{i} restricted to GG are in 𝔖13ϵn\mathfrak{S}^{n}_{1-3\epsilon} (e.g. using Lemma 14 inductively). However, since we did not terminate (Line 5), we must have by utu_{t} being a top eigenvector and Corollary 5 (we defer discussions of inexactness to Theorem 1) that

i[n]wiai(1+5C5ϵlogϵ1)σt2(1+4C4ϵlogϵ1)u𝚺u\displaystyle\sum_{i\in[n]}w_{i}a_{i}\geq(1+5C_{5}\epsilon\log\epsilon^{-1})\sigma_{t}^{2}\geq(1+4C_{4}\epsilon\log\epsilon^{-1})\cdot u^{\top}\bm{\Sigma}u
iBwiai>3C4ϵlogϵ1u𝚺u.\displaystyle\implies\sum_{i\in B}w_{i}a_{i}>3C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

To obtain the last conclusion, we used (38). Finally, note that for all iBIBi\in B\setminus I_{B},

aiaiIGw~iaiC4logϵ1u𝚺ua_{i}\leq a_{\ell}\leq\sum_{i\in I_{G}}\widetilde{w}_{i}a_{i}\leq C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u

by rearranging (37). This implies that

iBIBwiai(iBIBwi)C4logϵ1u𝚺uC4ϵlogϵ1u𝚺u.\sum_{i\in B\setminus I_{B}}w_{i}a_{i}\leq\left(\sum_{i\in B\setminus I_{B}}w_{i}\right)\cdot C_{4}\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u\leq C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

Thus, the desired inequality (36) follows from combining the above derivations, e.g. using (37) and

iIBwiai=iBwiaiiBIBwiai>2C4ϵlogϵ1u𝚺u.\sum_{i\in I_{B}}w_{i}a_{i}=\sum_{i\in B}w_{i}a_{i}-\sum_{i\in B\setminus I_{B}}w_{i}a_{i}>2C_{4}\epsilon\log\epsilon^{-1}\cdot u^{\top}\bm{\Sigma}u.

Lemma 13 yields for all tt that iBwi(0)wi(t)iGwi(0)wi(t)\sum_{i\in B}w^{(0)}_{i}-w^{(t)}_{i}\geq\sum_{i\in G}w^{(0)}_{i}-w^{(t)}_{i} by telescoping. Note that we can only remove at most 2ϵ2\epsilon mass from ww total, as iBwi(0)wi(t)ϵ\sum_{i\in B}w^{(0)}_{i}-w^{(t)}_{i}\leq\epsilon. Denote for shorthand normalized weights v(t):=w(t)w(t)1v^{(t)}:=\tfrac{w^{(t)}}{\left\lVert w^{(t)}\right\rVert_{1}}. Then, the following is immediate by w(t)112ϵ\left\lVert w^{(t)}\right\rVert_{1}\geq 1-2\epsilon.

Lemma 14.

Under Assumption 2, in all iterations tt of Algorithm 6, v(t)𝔖2ϵnv^{(t)}\in\mathfrak{S}_{2\epsilon}^{n}.

Using Lemma 14, we show that the output has the desired quality of being a large eigenvector.

Lemma 15.

Under Assumption 2, let the output of Algorithm 6 be uTu_{T}. Then for a universal constant CC^{\star}, uT𝚺uT(1Cϵlogϵ1)𝚺u_{T}^{\top}\bm{\Sigma}u_{T}\geq(1-C^{\star}\epsilon\log\epsilon^{-1})\|\bm{\Sigma}\|_{\infty}.

Proof.

We assume for now that uTu_{T} is an exact top eigenvector, and discuss inexactness while proving Theorem 1. By (33) and Lemma 14, as then the normalized restriction of w(T)w^{(T)} to GG is in 𝔖3ϵn\mathfrak{S}_{3\epsilon}^{n},

𝐌(w(T))iGwi(T)XiXi(12C4ϵlogϵ1)𝚺\displaystyle\mathbf{M}(w^{(T)})\succeq\sum_{i\in G}w^{(T)}_{i}X_{i}X_{i}^{\top}\succeq\left(1-2C_{4}\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}
uT𝐌(w(T))uT(12C4ϵlogϵ1)𝚺.\displaystyle\implies u_{T}^{\top}\mathbf{M}(w^{(T)})u_{T}\geq\left(1-2C_{4}\epsilon\log\epsilon^{-1}\right)\left\lVert\bm{\Sigma}\right\rVert_{\infty}.

We used the Courant-Fischer characterization of eigenvalues, and that uTu_{T} is a top eigenvector of 𝐌(w(T))\mathbf{M}(w^{(T)}). Moreover, by termination conditions and Corollary 5 (correctness of 𝟣𝖣𝖱𝗈𝖻𝗎𝗌𝗍𝖵𝖺𝗋𝗂𝖺𝗇𝖼𝖾\mathsf{1DRobustVariance}),

(1+Cϵlogϵ1)uT𝚺uTσT2(1+5C5ϵlogϵ1)1uT𝐌(w(T))uT.(1+C\epsilon\log\epsilon^{-1})u_{T}^{\top}\bm{\Sigma}u_{T}\geq\sigma_{T}^{2}\geq(1+5C_{5}\epsilon\log\epsilon^{-1})^{-1}u_{T}^{\top}\mathbf{M}(w^{(T)})u_{T}.

Combining these two bounds and rescaling yields the conclusion. ∎

Finally, we prove our main guarantee about Algorithm 6. See 1

Proof.

First, we will operate under Assumption 3, which holds with probability at least 1δ1-\delta. It is clear that the analyses of Lemma 13 and 15 hold with 1Θ(ϵlogϵ1)1-\Theta(\epsilon\log\epsilon^{-1}) multiplicative approximations of top eigenvector computation, which the power method approximates with high probability. Thus, each iteration takes time O(ndϵlognδϵ)O\left(\frac{nd}{\epsilon}\log\frac{n}{\delta\epsilon}\right), where we will union bound over the number of iterations. We now give an iteration bound: in any iteration where we do not terminate, Lemma 2 implies

i=1nwi(t1)wi(t)\displaystyle\sum_{i=1}^{n}w^{(t-1)}_{i}-w^{(t)}_{i} 12maxi[n]ut,Xi2i=nwiai\displaystyle\geq\frac{1}{2\max_{i\in[n]}\left\langle u_{t},X_{i}\right\rangle^{2}}\sum_{i=\ell}^{n}w_{i}a_{i}
12C4lognδTr(𝚺)i=nwiai\displaystyle\geq\frac{1}{2C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\sum_{i=\ell}^{n}w_{i}a_{i}
12C4lognδTr(𝚺)(i=nwii[n]wi)i[n]wiai\displaystyle\geq\frac{1}{2C_{4}\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\left(\frac{\sum_{i=\ell}^{n}w_{i}}{\sum_{i\in[n]}w_{i}}\right)\sum_{i\in[n]}w_{i}a_{i}
=Ω(ϵ𝚺lognδTr(𝚺))=Ω(ϵdlognδ).\displaystyle=\Omega\left(\epsilon\cdot\frac{\left\lVert\bm{\Sigma}\right\rVert_{\infty}}{\log\frac{n}{\delta}\cdot\textup{Tr}(\bm{\Sigma})}\right)=\Omega\left(\frac{\epsilon}{d\log\frac{n}{\delta}}\right).

Here, the second line used Assumption 3, the third used that the aia_{i} are in sorted order, and the last used the definition of \ell as well as the derivations of Lemma 15 (specifically, that 𝐌(w)\mathbf{M}(w) spectrally dominates (1O(ϵlogϵ1))𝚺(1-O(\epsilon\log\epsilon^{-1}))\bm{\Sigma} for roughly uniform ww). The conclusion follows since there can be at most O(dlognδ)O(d\log\tfrac{n}{\delta}) iterations, as the algorithm terminates when a 2ϵ2\epsilon fraction of the mass is removed, giving the overall runtime claim. ∎

Appendix C Deferred proofs from Section 4

C.1 Proofs from Section 4.2

Since our notion of approximation is multiplicative, we can assume without more than constant loss that 𝐀\mathbf{A} has bounded entries. This observation is standard, and formalized in the following lemma.

Lemma 16 (Entrywise bounds on 𝐀\mathbf{A}).

Feasibility of Problem 2 is unaffected (up to constants in ϵ\epsilon) by removing columns of 𝐀\mathbf{A} with entries larger than nϵ1n\epsilon^{-1}.

Proof.

If 𝐀ji>nϵ1\mathbf{A}_{ji}>n\epsilon^{-1} for any entry, then xiϵ(1+ϵ)nx_{i}\leq\tfrac{\epsilon(1+\epsilon)}{n}, else 𝐀xp\left\lVert\mathbf{A}x\right\rVert_{p} is already larger than 1+ϵ1+\epsilon. Ignoring all such entries of xx and rescaling can only change the objective by a 1+O(ϵ)1+O(\epsilon) factor. ∎

See 3

Proof.

Fix an iteration tt. Define δ=ηgt\delta=\eta g_{t}, and note wt+1=wt+δwtw_{t+1}=w_{t}+\delta\circ w_{t}; henceforth in this proof, we will drop subscripts tt when clear. Observe that

𝐀wt+1p=𝐀((1+δ)w)p=(j[d][𝐀w]jp(1+[𝐀(δwt)]j[𝐀wt]j)p)1/p.\left\lVert\mathbf{A}w_{t+1}\right\rVert_{p}=\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}=\left(\sum_{j\in[d]}[\mathbf{A}w]_{j}^{p}\left(1+\frac{[\mathbf{A}(\delta\circ w_{t})]_{j}}{[\mathbf{A}w_{t}]_{j}}\right)^{p}\right)^{1/p}.

As g𝟏δp1𝟏g\leq\mathbf{1}\implies\delta\leq p^{-1}\mathbf{1}, 𝐀(δwt)𝐀wtp1\frac{\mathbf{A}(\delta\circ w_{t})}{\mathbf{A}w_{t}}\leq p^{-1} entrywise. Via (1+x)pexp(px)1+px+p2x2(1+x)^{p}\leq\exp(px)\leq 1+px+p^{2}x^{2} for xp1x\leq p^{-1}, it follows that

𝐀((1+δ)w)p(j[d][𝐀w]jp(1+p[𝐀(δw)]j[𝐀w]j+(p[𝐀(δw)]j[𝐀w]j)2))1/p.\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}\leq\left(\sum_{j\in[d]}[\mathbf{A}w]_{j}^{p}\left(1+\frac{p[\mathbf{A}(\delta\circ w)]_{j}}{[\mathbf{A}w]_{j}}+\left(\frac{p[\mathbf{A}(\delta\circ w)]_{j}}{[\mathbf{A}w]_{j}}\right)^{2}\right)\right)^{1/p}.

By direct manipulation of the above quantity, and recalling we defined v=𝐀w𝐀wpv=\tfrac{\mathbf{A}w}{\left\lVert\mathbf{A}w\right\rVert_{p}},

(j[d]([𝐀w]jp+p[𝐀w]jp1[𝐀(δw)]j+p2[𝐀w]jp2[𝐀(δw)]j2))1/p\displaystyle\left(\sum_{j\in[d]}\left([\mathbf{A}w]_{j}^{p}+p[\mathbf{A}w]_{j}^{p-1}[\mathbf{A}(\delta\circ w)]_{j}+p^{2}[\mathbf{A}w]_{j}^{p-2}[\mathbf{A}(\delta\circ w)]_{j}^{2}\right)\right)^{1/p}
=(𝐀wppj[d](vjp+pvjp1[𝐀(δw)]j𝐀wp+p2vjp2([𝐀(δw)]j𝐀wp)2))1/p\displaystyle=\left(\left\lVert\mathbf{A}w\right\rVert_{p}^{p}\sum_{j\in[d]}\left(v_{j}^{p}+pv_{j}^{p-1}\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}+p^{2}v_{j}^{p-2}\left(\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}\right)^{2}\right)\right)^{1/p}
=𝐀wp(1+j[d](pvjp1[𝐀(δw)]j𝐀wp+p2vjp2([𝐀(δw)]j𝐀wp))2)1/p.\displaystyle=\left\lVert\mathbf{A}w\right\rVert_{p}\left(1+\sum_{j\in[d]}\left(pv_{j}^{p-1}\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}+p^{2}v_{j}^{p-2}\left(\frac{[\mathbf{A}(\delta\circ w)]_{j}}{\left\lVert\mathbf{A}w\right\rVert_{p}}\right)\right)^{2}\right)^{1/p}.

Using (1+x)p>1+px(1+x)^{p}>1+px, i.e. (1+px)1/p<1+x(1+px)^{1/p}<1+x, we thus obtain

𝐀((1+δ)w)p𝐀wp+vp1,𝐀(δw)+pvp1,(𝐀(δw))2𝐀w.\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}\leq\left\lVert\mathbf{A}w\right\rVert_{p}+\left\langle v^{p-1},\mathbf{A}(\delta\circ w)\right\rangle+p\left\langle v^{p-1},\frac{(\mathbf{A}(\delta\circ w))^{2}}{\mathbf{A}w}\right\rangle.

Cauchy-Schwarz yields that [𝐀(δw)]j2[𝐀(δ2w)]j[𝐀w]j[\mathbf{A}(\delta\circ w)]_{j}^{2}\leq[\mathbf{A}(\delta^{2}\circ w)]_{j}[\mathbf{A}w]_{j}, j[d]\forall j\in[d]. Substituting into the above,

𝐀((1+δ)w)p\displaystyle\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p} 𝐀wp+vp1,𝐀(δw)+pvp1,𝐀(δ2w)\displaystyle\leq\left\lVert\mathbf{A}w\right\rVert_{p}+\left\langle v^{p-1},\mathbf{A}(\delta\circ w)\right\rangle+p\left\langle v^{p-1},\mathbf{A}(\delta^{2}\circ w)\right\rangle (39)
=𝐀wp+j[d][𝐀vp1]jδjwj(1+pδj).\displaystyle=\left\lVert\mathbf{A}w\right\rVert_{p}+\sum_{j\in[d]}\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\delta_{j}w_{j}(1+p\delta_{j}).

Finally, to bound this latter quantity, since δ=ηg\delta=\eta g, we observe that for all jj either δj=0\delta_{j}=0 or 1+pδj=1+gj=2[𝐀vp1]j1+p\delta_{j}=1+g_{j}=2-[\mathbf{A}^{\top}v^{p-1}]_{j}, in which case

[𝐀vp1]j(1+pδj)=[𝐀vp1]j(2[𝐀vp1]j)1.\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}(1+p\delta_{j})=\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\left(2-\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}\right)\leq 1.

Thus, plugging this bound into (39) entrywise,

𝐀((1+δ)w)p𝐀wpj[d]δjwj[𝐀vp1]j(1+pδj)j[d]δjwj=wt+11wt1.\left\lVert\mathbf{A}((1+\delta)\circ w)\right\rVert_{p}-\left\lVert\mathbf{A}w\right\rVert_{p}\leq\sum_{j\in[d]}\delta_{j}w_{j}\left[\mathbf{A}^{\top}v^{p-1}\right]_{j}(1+p\delta_{j})\leq\sum_{j\in[d]}\delta_{j}w_{j}=\left\lVert w_{t+1}\right\rVert_{1}-\left\lVert w_{t}\right\rVert_{1}.

Rearranging yields the desired claim. ∎

C.2 Proofs from Section 4.3

Our analysis of Algorithm 3 will use the following helper fact.

Lemma 17 (Spectral bounds on {𝐀i}i[n]\{\mathbf{A}_{i}\}_{i\in[n]}).

Feasibility of Problem 3 is unaffected (up to constants in ϵ\epsilon) by removing matrices 𝐀i\mathbf{A}_{i} with an eigenvalue larger than nϵ1n\epsilon^{-1}.

Proof.

The proof is identical to Lemma 16; we also require the additional fact that the Schatten norm p\left\lVert\cdot\right\rVert_{p} is monotone in the Loewner order, forcing the constraint xiϵ(1+ϵ)nx_{i}\leq\tfrac{\epsilon(1+\epsilon)}{n}. ∎

We remark that we can perform this preprocessing procedure via power iteration on each 𝐀i\mathbf{A}_{i}.

See 4

Proof.

Drop tt and define δ=ηg\delta=\eta g. For simplicity, define the matrices

𝐌0:=i[n]wi𝐀i,𝐌1:=i[n]δiwi𝐀i,𝐌2:=i[n]δi2wi𝐀i.\mathbf{M}_{0}:=\sum_{i\in[n]}w_{i}\mathbf{A}_{i},\;\mathbf{M}_{1}:=\sum_{i\in[n]}\delta_{i}w_{i}\mathbf{A}_{i},\;\mathbf{M}_{2}:=\sum_{i\in[n]}\delta_{i}^{2}w_{i}\mathbf{A}_{i}.

We recall the Lieb-Thirring inequality Tr((𝐀𝐁𝐀)p)Tr(𝐀2p𝐁p)\textup{Tr}((\mathbf{A}\mathbf{B}\mathbf{A})^{p})\leq\textup{Tr}(\mathbf{A}^{2p}\mathbf{B}^{p}). Applying this, we have

𝐌0+𝐌1pp=Tr((𝐌0+𝐌1)p)Tr(𝐌0p(𝐈+𝐌012𝐌1𝐌012)p).\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p}=\textup{Tr}\left(\left(\mathbf{M}_{0}+\mathbf{M}_{1}\right)^{p}\right)\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}\left(\mathbf{I}+\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}\right)^{p}\right).

As g𝟏g\leq\mathbf{1}, we have 𝐌012𝐌1𝐌012p1𝐈\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}\preceq p^{-1}\mathbf{I}. Applying the bounds (𝐈+𝐌)pexp(p𝐌)𝐈+p𝐌+p2𝐌2(\mathbf{I}+\mathbf{M})^{p}\preceq\exp(p\mathbf{M})\preceq\mathbf{I}+p\mathbf{M}+p^{2}\mathbf{M}^{2} for 𝐌=𝐌012𝐌1𝐌012\mathbf{M}=\mathbf{M}_{0}^{-\frac{1}{2}}\mathbf{M}_{1}\mathbf{M}_{0}^{-\frac{1}{2}}, where we use that 𝐈\mathbf{I} commutes with all 𝐌\mathbf{M}, it follows that

𝐌0+𝐌1ppTr(𝐌0p+p𝐌0p1𝐌1+p2𝐌0p1𝐌1𝐌01𝐌1).\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p}\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}+p\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}+p^{2}\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\right).

Definitions of 𝐌0\mathbf{M}_{0}, 𝐌1\mathbf{M}_{1}, 𝐌2\mathbf{M}_{2}, and preservation of positiveness under Schur complements imply

(𝐌0𝐌1𝐌1𝐌2)0𝐌2𝐌1𝐌01𝐌10.\begin{pmatrix}\mathbf{M}_{0}&\mathbf{M}_{1}\\ \mathbf{M}_{1}&\mathbf{M}_{2}\end{pmatrix}\succeq 0\implies\mathbf{M}_{2}-\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\succeq 0.

Thus, 𝐌1𝐌01𝐌1𝐌2\mathbf{M}_{1}\mathbf{M}_{0}^{-1}\mathbf{M}_{1}\preceq\mathbf{M}_{2}. Applying this and recalling 𝐕=𝐌0𝐌0p\mathbf{V}=\tfrac{\mathbf{M}_{0}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}},

𝐌0+𝐌1pp\displaystyle\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}^{p} Tr(𝐌0p+p𝐌0p1𝐌1+p2𝐌0p1𝐌2)\displaystyle\leq\textup{Tr}\left(\mathbf{M}_{0}^{p}+p\mathbf{M}_{0}^{p-1}\mathbf{M}_{1}+p^{2}\mathbf{M}_{0}^{p-1}\mathbf{M}_{2}\right)
=𝐌0pp(1+p𝐕p1,𝐌1𝐌0p+p𝐌2𝐌0p).\displaystyle=\left\lVert\mathbf{M}_{0}\right\rVert_{p}^{p}\left(1+p\left\langle\mathbf{V}^{p-1},\frac{\mathbf{M}_{1}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}}+\frac{p\mathbf{M}_{2}}{\left\lVert\mathbf{M}_{0}\right\rVert_{p}}\right\rangle\right).

By (1+px)1/p<1+x(1+px)^{1/p}<1+x, taking pthp^{th} roots we thus have

𝐌0+𝐌1p𝐌0p+𝐕p1,𝐌1+p𝐌2.\left\lVert\mathbf{M}_{0}+\mathbf{M}_{1}\right\rVert_{p}\leq\left\lVert\mathbf{M}_{0}\right\rVert_{p}+\left\langle\mathbf{V}^{p-1},\mathbf{M}_{1}+p\mathbf{M}_{2}\right\rangle.

Finally, the conclusion follows as in Lemma 3; by linearity of trace and g=pδg=p\delta,

𝐕p1,𝐌1+p𝐌2=i[n]𝐕p1,𝐀iδiwi(1+pδi)i[n]δiwi.\left\langle\mathbf{V}^{p-1},\mathbf{M}_{1}+p\mathbf{M}_{2}\right\rangle=\sum_{i\in[n]}\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\delta_{i}w_{i}(1+p\delta_{i})\leq\sum_{i\in[n]}\delta_{i}w_{i}.

Here, we used the inequality for all nonzero gig_{i},

𝐕p1,𝐀i(1+pδi)=𝐕p1,𝐀i(2𝐕p1,𝐀i)1.\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle(1+p\delta_{i})=\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\left(2-\left\langle\mathbf{V}^{p-1},\mathbf{A}_{i}\right\rangle\right)\leq 1.

See 5

Proof.

The proof is analogous to that of Theorem 4; we sketch the main differences here. By applying Lemma 17 and monotonicity of Schatten norms in the Loewner order, we again have Φ01\Phi_{0}\leq 1, implying correctness whenever the algorithm terminates on Line 4. Correctness of dual certification again follows from lack of termination and the choice of TT, as well as setting uu to indicate each coordinate. Finally, the returned matrix in Line 8 is correct by convexity of the Schatten-qq norm, and the fact that all 𝐕tp1\mathbf{V}_{t}^{p-1} have unit Schatten-qq norm.

We now discuss issues regearding computing gtg_{t} in Line 5 of the algorithm, the bottleneck step; these techniques are standard in the approximate SDP literature, and we defer a more formal discussion to e.g. [JLL+20]. First, note that each coordinate of gtg_{t} requires us to compute

1i[n][wt]i𝐀ipp1𝐀i,(i[n][wt]i𝐀i)p1.\frac{1}{\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}^{p-1}}\cdot\left\langle\mathbf{A}_{i},\left(\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right)^{p-1}\right\rangle. (40)

We estimate the two quantities in the above expression each to 1+ϵ1+\epsilon multiplicative error with high probability. Union bounding over iterations, and modifying Lemma 4 to use the potential i[n][wt]i𝐀ip(1+O(ϵ))wt1\left\lVert\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}\right\rVert_{p}-(1+O(\epsilon))\left\lVert w_{t}\right\rVert_{1}, the analysis remains valid up to constants in ϵ\epsilon with this multiplicative approximation quality. We now discuss our approximation strategies.

For shorthand, denote 𝐌=i[n][wt]i𝐀i\mathbf{M}=\sum_{i\in[n]}[w_{t}]_{i}\mathbf{A}_{i}. To estimate the denominator of (40), it suffices to multiplicatively approximate 𝐌pp=Tr[𝐌p]\left\lVert\mathbf{M}\right\rVert_{p}^{p}=\textup{Tr}[\mathbf{M}^{p}] within a 1+ϵ1+\epsilon factor, as raising to the p1p\tfrac{p-1}{p} power can only improve this. To do so, we use the well-known fact (e.g. [DG03]) that letting 𝐐\mathbf{Q} be a k×dk\times d matrix with independent entries 𝒩(0,1k)\sim\mathcal{N}(0,\tfrac{1}{k}), for k=O(log(ndϵ)ϵ2)k=O(\tfrac{\log(\frac{nd}{\epsilon})}{\epsilon^{2}}), with probability 1poly((ndϵ)1)1-\textup{poly}((\tfrac{nd}{\epsilon})^{-1}),

Tr[𝐌p][k]𝐐:𝐌p𝐐:\textup{Tr}[\mathbf{M}^{p}]\approx\sum_{\ell\in[k]}\mathbf{Q}_{\ell:}^{\top}\mathbf{M}^{p}\mathbf{Q}_{\ell:}

to a 1+ϵ1+\epsilon factor. To read this from the standard Johnson-Lindestrauss guarantee, it suffices to factorize 𝐌p\mathbf{M}^{p} and use that each row of the square root’s 2\ell_{2} norm is preserved with high probability under multiplication by 𝐐\mathbf{Q}, and then apply the cyclic definition of trace. Similarly, for each i[n]i\in[n], we can approximate the numerators via

Tr(𝐐𝐌p12𝐀i𝐌p12𝐐).\textup{Tr}\left(\mathbf{Q}\mathbf{M}^{\frac{p-1}{2}}\mathbf{A}_{i}\mathbf{M}^{\frac{p-1}{2}}\mathbf{Q}^{\top}\right).

We can simultaneously compute all such quantities by first applying O(p)O(p) matrix-vector multiplications through 𝐌\mathbf{M} to each row of 𝐐\mathbf{Q}, and then computing all quadratic forms. In total, the computational cost per iteration of all approximations is O(nnzplog(ndϵ)ϵ2)O(\textup{nnz}\cdot\tfrac{p\log(\frac{nd}{\epsilon})}{\epsilon^{2}}) as desired. ∎

C.3 Proof of Proposition 2

In this section, following our prior developments, we prove the following claim.

See 2

C.3.1 Reduction to a decision problem

Given access to an oracle for the following approximate decision problem, we can implement an efficient binary search for estimating OPT. Specifically, letting the range of OPT be (μlower,μupper)(\mu_{\text{lower}},\mu_{\text{upper}}), we can subdivide the range into O(1ϵlogμupperμlower)O(\tfrac{1}{\epsilon}\log\tfrac{\mu_{\text{upper}}}{\mu_{\text{lower}}}) multiplicative intervals of range 1+ϵ1+\epsilon, and then compute a binary search using our decision oracle. This incurs a multiplicative log(ndϵ)\log(\tfrac{nd}{\epsilon}) overhead in the setting of Proposition 2 (see Appendix A, [JLL+20], for a more formal treatment).

Problem 4.

Given {𝐀i}i[n]𝕊0d\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d}, either find primal solution xΔnx\in\Delta^{n} with 𝒜(x)p1+ϵ\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq 1+\epsilon, x(1+ϵ)(1+α)n\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1+\epsilon)(1+\alpha)}{n}, or conclude no xΔnx\in\Delta^{n} satisfies 𝒜(x)p1ϵ\left\lVert\mathcal{A}(x)\right\rVert_{p}\leq 1-\epsilon, x(1ϵ)(1+α)n\left\lVert x\right\rVert_{\infty}\leq\tfrac{(1-\epsilon)(1+\alpha)}{n}.

The hard constraint x1+αn\left\lVert x\right\rVert_{\infty}\leq\tfrac{1+\alpha}{n} in the definition (7) can be adjusted by constant factors to admit the \ell_{\infty} bound in Problem 4, since we assumed ϵ=O(α)\epsilon=O(\alpha) is sufficiently small.

C.3.2 Preliminaries

We use the shorthand 𝐒:=n1+α𝐈\mathbf{S}:=\tfrac{n}{1+\alpha}\mathbf{I}, and p:=lognϵp^{\prime}:=\tfrac{\log n}{\epsilon}, so p\ell_{p^{\prime}} and \ell_{\infty} are interchangeable up to 1+O(ϵ)1+O(\epsilon) factors. In other words, Problem 4 asks to certify whether there exists xΔnx\in\Delta^{n} with

max(𝒜(x)p,𝐒xp)1,\max\left(\left\lVert\mathcal{A}(x)\right\rVert_{p},\;\left\lVert\mathbf{S}x\right\rVert_{p^{\prime}}\right)\leq 1, (41)

up to multiplicative 1+ϵ1+\epsilon tolerance on either side. Consider the potential function

Φ(w):=log(exp(𝒜(w)p)+exp(𝐒wp))w1.\Phi(w):=\log\left(\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)\right)-\left\lVert w\right\rVert_{1}. (42)

It is clear that the first term of Φ(w)\Phi(w) approximates the left hand side of (41) up to a log2\log 2 additive factor, so if any of 𝒜(w)p\left\lVert\mathcal{A}(w)\right\rVert_{p}, 𝒜(w)p\left\lVert\mathcal{A}(w)\right\rVert_{p^{\prime}}, or w1\left\lVert w\right\rVert_{1} reaches the scale 3ϵ13\epsilon^{-1} and Φ(w)\Phi(w) is bounded by 11, we can safely terminate. and conclude primal feasibility for Problem 4. Next, we compute

iΦ(w)=1exp(𝒜(w)p)𝐀i,𝐘(w)+exp(𝐒wp)[𝐒z(w)]iexp(𝒜(w)p)+exp(𝐒wp) for all i[n],\displaystyle\nabla_{i}\Phi(w)=1-\frac{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)[\mathbf{S}z(w)]_{i}}{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)}\text{ for all }i\in[n], (43)
where 𝐘(w):=(𝒜(w)𝒜(w)p)p1,z(w):=(𝐒w𝐒wp)p1\displaystyle\text{where }\mathbf{Y}(w):=\left(\frac{\mathcal{A}(w)}{\left\lVert\mathcal{A}(w)\right\rVert_{p}}\right)^{p-1},\;z(w):=\left(\frac{\mathbf{S}w}{\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}}\right)^{p^{\prime}-1}

The following helper lemma will be useful in concluding dual infeasibility of Problem 4.

Lemma 18.

In the setting of Problem 4, suppose there exists xΔnx^{*}\in\Delta^{n} with

𝒜(x)p1ϵ,𝐒xp1ϵ.\left\lVert\mathcal{A}(x^{*})\right\rVert_{p}\leq 1-\epsilon,\;\left\lVert\mathbf{S}x^{*}\right\rVert_{p^{\prime}}\leq 1-\epsilon.

Then, for any ww,

Φ(w),xϵ.\left\langle\nabla\Phi(w),x^{*}\right\rangle\geq\epsilon.
Proof.

From the definitions in (43), it is clear that 𝐘(w)q=z(w)q=1\left\lVert\mathbf{Y}(w)\right\rVert_{q}=\left\lVert z(w)\right\rVert_{q^{\prime}}=1, where qq, qq^{\prime} are the dual norms of pp, pp^{\prime} respectively. Moreover, by the definition of xx^{*}, we have for all 𝐘q=zq=1\left\lVert\mathbf{Y}\right\rVert_{q}=\left\lVert z\right\rVert_{q^{\prime}}=1,

𝐘,𝒜(x)1ϵ,z,𝐒x1ϵ.\left\langle\mathbf{Y},\mathcal{A}(x)\right\rangle\leq 1-\epsilon,\;\left\langle z,\mathbf{S}x\right\rangle\leq 1-\epsilon.

This follows from the dual definition of the p\ell_{p} norm (see Fact 2). Now, note that for some nonnegative α(w)\alpha(w), β(w)\beta(w) summing to 1, using the above claim and (43),

Φ(w),x=1(α(w)𝐘(w),𝒜(x)+β(w)z(w),𝐒x)ϵ,\left\langle\nabla\Phi(w),x^{*}\right\rangle=1-\left(\alpha(w)\left\langle\mathbf{Y}(w),\mathcal{A}(x^{*})\right\rangle+\beta(w)\left\langle z(w),\mathbf{S}x^{*}\right\rangle\right)\geq\epsilon,

as desired (here, we used positivity of all relevant quantities). ∎

C.3.3 Potential monotonicity

We prove a monotonicity property regarding the potential Φ\Phi in (42).

Lemma 19.

Let w0nw\in\mathbb{R}^{n}_{\geq 0} satisfy 𝒜(w)p3ϵ1\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq 3\epsilon^{-1}, 𝐒wp3ϵ1\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\leq 3\epsilon^{-1}, let g=max(0,Φ(w))g=\max(0,\nabla\Phi(w)) entrywise, and let w=(1+ηg)ww^{\prime}=(1+\eta g)\circ w, where η=(4p)1\eta=(4p^{\prime})^{-1}. Then, Φ(w)Φ(w)\Phi(w^{\prime})\leq\Phi(w).

Proof.

Denote for simplicity the threshold K=3ϵ1K=3\epsilon^{-1} and the step vector δ=ηg\delta=\eta g. First, by prior calculations in Lemma 3 and Lemma 4, it follows that

𝒜(w)p𝒜(w)p+Δ𝒜,𝐒wp𝐒wp+Δ𝐒,\displaystyle\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\leq\left\lVert\mathcal{A}(w)\right\rVert_{p}+\Delta_{\mathcal{A}},\;\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\leq\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}+\Delta_{\mathbf{S}},
where Δ𝒜:=i[n]𝐀i,𝐘(w)δiwi(1+pδi),Δ𝐒:=i[n][𝐒z(w)]iδiwi(1+pδi).\displaystyle\text{where }\Delta_{\mathcal{A}}:=\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}(1+p\delta_{i}),\;\Delta_{\mathbf{S}}:=\sum_{i\in[n]}[\mathbf{S}z(w)]_{i}\delta_{i}w_{i}(1+p^{\prime}\delta_{i}).

Next, note that by δη\delta\leq\eta entrywise and lack of termination (i.e. the threshold KK),

Δ𝒜(1+pη)η𝐘(w),𝒜(w)2η𝒜(w)p1.\displaystyle\Delta_{\mathcal{A}}\leq(1+p\eta)\eta\left\langle\mathbf{Y}(w),\mathcal{A}(w)\right\rangle\leq 2\eta\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq 1.

Therefore, by exp(x)1+x+x2\exp(x)\leq 1+x+x^{2} for x1x\leq 1,

exp(𝒜(w)p)exp(𝒜(w)p)(1+Δ𝒜+Δ𝒜2).\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)\leq\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left(1+\Delta_{\mathcal{A}}+\Delta_{\mathcal{A}}^{2}\right). (44)

Moreover, by applying Cauchy-Schwarz and the threshold 𝒜(w)pK\left\lVert\mathcal{A}(w)\right\rVert_{p}\leq K once more,

Δ𝒜2\displaystyle\Delta_{\mathcal{A}}^{2} (1+pη)2(i[n]𝐀i,𝐘(w)δiwi)2\displaystyle\leq(1+p\eta)^{2}\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}\right)^{2} (45)
2(i[n]𝐀i,𝐘(w)δi2wi)𝐘(w),𝒜(w)2K(i[n]𝐀i,𝐘(w)δi2wi).\displaystyle\leq 2\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}^{2}w_{i}\right)\left\langle\mathbf{Y}(w),\mathcal{A}(w)\right\rangle\leq 2K\left(\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}^{2}w_{i}\right).

Combining (44) and (45) (and applying similar reasoning to the term Δ𝐒\Delta_{\mathbf{S}}), we conclude

exp(𝒜(w)p)exp(𝒜(w)p)(1+i[n]𝐀i,𝐘(w)δiwi(1+(p+2K)δi)),\displaystyle\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)\leq\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)\left(1+\sum_{i\in[n]}\left\langle\mathbf{A}_{i},\mathbf{Y}(w)\right\rangle\delta_{i}w_{i}(1+(p+2K)\delta_{i})\right),
exp(𝐒wp)exp(𝐒wp)(1+i[n][𝐒z(w)]iδiwi(1+(p+2K)δi)).\displaystyle\exp\left(\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\right)\leq\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)\left(1+\sum_{i\in[n]}[\mathbf{S}z(w)]_{i}\delta_{i}w_{i}(1+(p^{\prime}+2K)\delta_{i})\right).

Recall the inequality log(1+x)x\log(1+x)\leq x for nonnegative xx. Expanding the definition of Φ\Phi and Φ\nabla\Phi (cf. (42)), and plugging in the above bounds, we conclude that

Φ(w)Φ(w)\displaystyle\Phi(w^{\prime})-\Phi(w) =log(exp(𝒜(w)p)+exp(𝐒wp)exp(𝒜(w)p)+exp(𝐒wp))δ,w\displaystyle=\log\left(\frac{\exp\left(\left\lVert\mathcal{A}(w^{\prime})\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w^{\prime}\right\rVert_{p^{\prime}}\right)}{\exp\left(\left\lVert\mathcal{A}(w)\right\rVert_{p}\right)+\exp\left(\left\lVert\mathbf{S}w\right\rVert_{p^{\prime}}\right)}\right)-\left\langle\delta,w\right\rangle
i[n](1iΦ(w))δiwi(1+(p+2K)δi)δ,w\displaystyle\leq\sum_{i\in[n]}(1-\nabla_{i}\Phi(w))\delta_{i}w_{i}(1+(p^{\prime}+2K)\delta_{i})-\left\langle\delta,w\right\rangle
=i[n]((1iΦ(w))(1+(p+2K)δi)1)δiwi.\displaystyle=\sum_{i\in[n]}\left((1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\delta_{i})-1\right)\delta_{i}w_{i}.

As before, we show that this sum is entrywise nonpositive. For any i[n]i\in[n] with δi0\delta_{i}\neq 0, we have

(1iΦ(w))(1+(p+2K)δi)1\displaystyle(1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\delta_{i})-1 =(1iΦ(w))(1+(p+2K)ηiΦ(w))1\displaystyle=(1-\nabla_{i}\Phi(w))(1+(p^{\prime}+2K)\eta\nabla_{i}\Phi(w))-1
(1iΦ(w))(1+iΦ(w))10,\displaystyle\leq(1-\nabla_{i}\Phi(w))(1+\nabla_{i}\Phi(w))-1\leq 0,

as desired, where we used that η1p+2K\eta^{-1}\geq p^{\prime}+2K. This yields the conclusion Φ(w)Φ(w)\Phi(w^{\prime})\leq\Phi(w). ∎

C.3.4 Algorithm and analysis

Finally, we state Algorithm 7 and prove Proposition 2.

Algorithm 7 𝖡𝗈𝗑𝖾𝖽𝖲𝖼𝗁𝖺𝗍𝗍𝖾𝗇𝖯𝖺𝖼𝗄𝗂𝗇𝗀({𝐀i}i[n],ϵ,p,α)\mathsf{BoxedSchattenPacking}(\{\mathbf{A}_{i}\}_{i\in[n]},\epsilon,p,\alpha)
1:  Input: {𝐀i}i[n]𝕊0d,ϵ[0,12],p2\{\mathbf{A}_{i}\}_{i\in[n]}\in\mathbb{S}_{\geq 0}^{d},\epsilon\in[0,\tfrac{1}{2}],p\geq 2, α[0,n1]\alpha\in[0,n-1]
2:  plognϵp^{\prime}\leftarrow\tfrac{\log n}{\epsilon}, 𝐒n1+α𝐈\mathbf{S}\leftarrow\tfrac{n}{1+\alpha}\mathbf{I}
3:  η(4p)1\eta\leftarrow(4p^{\prime})^{-1}, K3ϵ1K\leftarrow 3\epsilon^{-1}, T6log(ndϵ)ηϵT\leftarrow\frac{6\log(\frac{nd}{\epsilon})}{\eta\epsilon}
4:  [w0]iϵn2d[w_{0}]_{i}\leftarrow\tfrac{\epsilon}{n^{2}d} for all i[n]i\in[n], t0t\leftarrow 0
5:  while 𝒜(wt)p,𝐒wtp,wt1K\left\lVert\mathcal{A}(w_{t})\right\rVert_{p},\left\lVert\mathbf{S}w_{t}\right\rVert_{p^{\prime}},\left\lVert w_{t}\right\rVert_{1}\leq K do
6:     gtmax(0,Φ(wt))g_{t}\leftarrow\max\left(0,\nabla\Phi(w_{t})\right) entrywise, where we use the definition (42)
7:     wt+1wt(1+ηgt)w_{t+1}\leftarrow w_{t}\circ(1+\eta g_{t}), tt+1t\leftarrow t+1
8:     if tTt\geq T then
9:        return  Infeasible
10:     end if
11:  end while
12:  return   x=wtwt1x=\frac{w_{t}}{\left\lVert w_{t}\right\rVert_{1}}
Proof of Proposition 2.

Correctness of the reduction to deciding Problem 4 follows from the discussion in Section C.3.1. Moreover, by the given Algorithm 7, it is clear (following e.g. the preprocessing of Lemma 17) that Φ(wt)1\Phi(w_{t})\leq 1 throughout the algorithm, so whenever the algorithm terminates we have primal feasibility. It suffices to prove that whenever the problem admits xx^{*} with

𝒜(x)p1ϵ,𝐒xp1ϵ,\left\lVert\mathcal{A}(x^{*})\right\rVert_{p}\leq 1-\epsilon,\;\left\lVert\mathbf{S}x^{*}\right\rVert_{p^{\prime}}\leq 1-\epsilon,

then the algorithm terminates on Line 5 in TT iterations. Analogously to Theorem 4, we have

η(1η)0t<Tgt,xlognlogw01+logwT12log(ndϵ)+logwT1.\eta(1-\eta)\sum_{0\leq t<T}\left\langle g_{t},x^{*}\right\rangle\leq\log n-\log\left\lVert w_{0}\right\rVert_{1}+\log\left\lVert w_{T}\right\rVert_{1}\leq 2\log\left(\frac{nd}{\epsilon}\right)+\log\left\lVert w_{T}\right\rVert_{1}.

Next, since gtg_{t} is an upwards truncation of Φ(wt)\nabla\Phi(w_{t}), applying Lemma 18 implies that

wT1exp(ηϵT22log(ndϵ)).\left\lVert w_{T}\right\rVert_{1}\geq\exp\left(\frac{\eta\epsilon T}{2}-2\log\left(\frac{nd}{\epsilon}\right)\right).

The conclusion follows by the definition of TT, as desired. Finally, the iteration complexity follows analogously to the discussion in Theorem 5’s proof, where the only expensive cost is estimating coordinates of the 𝒜\mathcal{A} component of Φ(wt)\nabla\Phi(w_{t}) every iteration. ∎

Finally, we remark that by opening up the dual certificates 𝐘(w)\mathbf{Y}(w), 𝐙(w)\mathbf{Z}(w) of our mirror descent analysis, we can in fact implement a stronger version of the decision Problem 4 which returns a feasible dual certificate whenever the primal problem is infeasible. We omit this extension for brevity, as it is unnecessary for our applications, but it is analogous to the analysis of Theorem 5.

Appendix D Deferred proofs from Section 5

D.1 Proof of Proposition 3

See 3

Proof.

We claim that Algorithm 1 in [MM15] applied to the matrix 𝐀p\mathbf{A}^{p} with a careful choice of exponent qq in their Algorithm 1 yields this guarantee. Specifically, we choose q1,q2q_{1},q_{2}, both of which satisfy the criteria in their main theorem, such that the iterates produced by simultaneous power iteration 𝐌p\mathbf{M}^{p} with exponent q1q_{1} and 𝐌p1\mathbf{M}^{p-1} with exponent q2q_{2} are identical; it suffices to choose qq a multiple of p(p1)p(p-1). Thus, we can also apply their guarantees to 𝐀p1\mathbf{A}^{p-1} and apply a union bound. Notice that their Algorithm 1 also contains some postprocessing to ensure that they obtain singular values in the right space, which is unnecessary for us, as our matrices are Hermitian. ∎

D.2 Proof of Lemma 5

See 5

Proof.

Lemma 10 implies that letting ww^{*} be the uniform distribution over the uncorrupted samples amongst X1,,XnX_{1},\ldots,X_{n}, we have with probability at least 1δ21-\tfrac{\delta}{2}, and denoting ϵ~:=2C3ϵlogϵ1\tilde{\epsilon}:=2C_{3}\cdot\epsilon\log\epsilon^{-1},

i[n]wiXiXip(1+ϵ~2)𝚺p.\left\lVert\sum_{i\in[n]}w^{*}_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq\left(1+\frac{\tilde{\epsilon}}{2}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}.

Therefore, the mixed \ell_{\infty}-p\ell_{p} packing semidefinite program

wΔn with w1(1ϵ)n,i[n]wiXiXip(1+ϵ~2)𝚺p\exists w^{*}\in\Delta^{n}\text{ with }\left\lVert w^{*}\right\rVert_{\infty}\leq\frac{1}{(1-\epsilon)n},\;\left\lVert\sum_{i\in[n]}w^{*}_{i}X_{i}X_{i}^{\top}\right\rVert_{p}\leq\left(1+\frac{\tilde{\epsilon}}{2}\right)\left\lVert\bm{\Sigma}\right\rVert_{p}

is feasible. This completes the proof. ∎

D.3 Proof of Lemma 6

See 6

Proof.

We follow the notation of (10). First, by the guarantees of Corollary 1,

wG=1wB1ϵn(12ϵ)n=13ϵ12ϵ12ϵ.w_{G}=1-w_{B}\geq 1-\frac{\epsilon n}{(1-2\epsilon)n}=\frac{1-3\epsilon}{1-2\epsilon}\geq 1-2\epsilon.

Therefore, again applying Corollary 1, for all iGi\in G,

wiwG1(12ϵ)n12ϵ13ϵ=1(13ϵ)n.\frac{w_{i}}{w_{G}}\leq\frac{1}{(1-2\epsilon)n}\cdot\frac{1-2\epsilon}{1-3\epsilon}=\frac{1}{(1-3\epsilon)n}.

We conclude that the set of weights {wiwG}iG\{\tfrac{w_{i}}{w_{G}}\}_{i\in G} belong to 𝔖3ϵ(1ϵ)n\mathfrak{S}_{3\epsilon}^{(1-\epsilon)n}. By applying Corollary 4 to these weights and adjusting the definition of C3C_{3} by a constant, we conclude with probability at least 1δ21-\tfrac{\delta}{2}

(1+C3ϵlogϵ1)𝚺iGwiwGXiXi(1C3ϵlogϵ1)𝚺.\left(1+C_{3}\cdot\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}\succeq\sum_{i\in G}\frac{w_{i}}{w_{G}}X_{i}X_{i}^{\top}\succeq\left(1-C_{3}\cdot\epsilon\log\epsilon^{-1}\right)\bm{\Sigma}.

The conclusion follows by multiplying through by wGw_{G}, and using the definition ϵ~=2C3ϵlogϵ1\tilde{\epsilon}=2C_{3}\cdot\epsilon\log\epsilon^{-1}. ∎