This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Low Rank column-wise Compressive Sensing

Seyedehsara (Sara) Nayer and Namrata Vaswani
Dept. of Electrical and Computer Engineering, Iowa State University, USA.
Email: [email protected]

Fast and Sample-Efficient
Federated Low Rank Matrix Recovery
from column-wise Linear and Quadratic Projections

Seyedehsara (Sara) Nayer and Namrata Vaswani
Dept. of Electrical and Computer Engineering, Iowa State University, USA.
Email: [email protected]
Abstract

We study the following lesser-known low rank (LR) recovery problem: recover an n×qn\times q rank-rr matrix, 𝑿=[𝒙1,𝒙2,,𝒙q]{\bm{X}}^{*}=[\bm{x}^{*}_{1},\bm{x}^{*}_{2},...,\bm{x}^{*}_{q}], with rmin(n,q)r\ll\min(n,q), from mm independent linear projections of each of its qq columns, i.e., from 𝒚k:=𝑨k𝒙k,k[q]\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k},k\in[q], when 𝒚k\bm{y}_{k} is an mm-length vector with m<nm<n. The matrices 𝑨k\bm{A}_{k} are known and mutually independent for different kk. We introduce a novel gradient descent (GD) based solution called AltGD-Min. We show that, if the 𝑨k\bm{A}_{k}s are i.i.d. with i.i.d. Gaussian entries, and if the right singular vectors of 𝑿{\bm{X}}^{*} satisfy the incoherence assumption, then ϵ\epsilon-accurate recovery of 𝑿{\bm{X}}^{*} is possible with order (n+q)r2log(1/ϵ)(n+q)r^{2}\log(1/\epsilon) total samples and order mqnrlog(1/ϵ)mqnr\log(1/\epsilon) time. Compared with existing work, this is the fastest solution. For ϵ<r1/4\epsilon<r^{1/4}, it also has the best sample complexity. A simple extension of AltGD-Min also provably solves LR Phase Retrieval, which is a magnitude-only generalization of the above problem.

AltGD-Min factorizes the unknown 𝑿{\bm{X}} as 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B} where 𝑼{\bm{U}} and 𝑩\bm{B} are matrices with rr columns and rows respectively. It alternates between a (projected) GD step for updating 𝑼{\bm{U}}, and a minimization step for updating 𝑩\bm{B}. Its each iteration is as fast as that of regular projected GD because the minimization over 𝑩\bm{B} decouples column-wise. At the same time, we can prove exponential error decay for it, which we are unable to for projected GD. Finally, it can also be efficiently federated with a communication cost of only nrnr per node, instead of nqnq for projected GD.

I Introduction

This work develops a sample-efficient, fast, and communication-efficient gradient descent (GD) solution, called AltGD-Min, for provably recovering a low-rank (LR) matrix from a set of mutually independent linear projections of each of its columns. The communication-efficiency considers a federated setting. This problem, which we henceforth refer to as “Low Rank column-wise Compressive Sensing (LRcCS)”, is precisely defined below. Unlike the other well-studied LR problems – multivariate regression (MVR) [1], LR matrix sensing [2] and LR matrix completion (LRMC) [3, 2] – LRcCS has received little attention so far in terms of approaches with provable guarantees. There are only two existing provably correct solutions. (1) Its generalization LR phase retrieval (LRPR), was studied in our recent work [4, 5, 6] where we developed a provably correct alternating minimization (AltMin) solution. Since LRPR is a generalization, the algorithm also solves LRcCS. (2) In parallel work, [7] developed and analyzed a convex relaxation (mixed-norm minimization) for LRcCS. Both solutions are much slower than GD-based methods, and, in most practical settings, also have worse sample complexity.

LRcCS occurs in accelerated LR dynamic MRI [8, 9, 10], and in distributed/federated sketching [11, 12, 7]. We explain these in Sec. I-D. We show the speed and performance advantage of AltGD-Min for dynamic MRI in [13].

I-A Problem Setting, Notation, and Assumption

Problem definition. The goal is to recover an n×qn\times q rank-rr matrix 𝑿=[𝒙1,𝒙2,,𝒙q]{\bm{X}}^{*}=[\bm{x}^{*}_{1},\bm{x}^{*}_{2},\dots,\bm{x}^{*}_{q}] from mm linear projections (sketches) of each of its qq columns, i.e. from

𝒚k:=𝑨k𝒙k,k[q]\displaystyle\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k},\ k\in[q] (1)

where each 𝒚k\bm{y}_{k} is an mm-length vector, [q]:={1,2,,q}[q]:=\{1,2,\dots,q\}, and the measurement/sketching matrices 𝑨k\bm{A}_{k} are mutually independent and known. The setting of interest is low-rank (LR), rmin(n,q)r\ll\min(n,q), and undersampled measurements, m<nm<n. Our guarantees assume that each 𝑨k\bm{A}_{k} is random-Gaussian: each entry of it is independent and identically distributed (i.i.d.) standard Gaussian.

We also study the magnitude-only measurements’ setting, LRPR [4, 5, 6]. This involves recovering 𝑿{\bm{X}}^{*} from

𝒚(mag)k:=|𝑨k𝒙k|,k[q].{\bm{y}_{(mag)}}_{k}:=|\bm{A}_{k}\bm{x}^{*}_{k}|,\ k\in[q].

Here |𝒛||\bm{z}| takes the entry-wise absolute value of entries of the vector 𝒛\bm{z}.

Notation. Everywhere, .F\|.\|_{F} denotes the Frobenius norm, .\|.\| without a subscript denotes the (induced) l2l_{2} norm (often called the operator norm or spectral norm), 𝑴max\|\bm{M}\|_{\max} is the maximum magnitude entry of the matrix 𝑴\bm{M}, denotes matrix or vector transpose, and |𝒛||\bm{z}| for a vector 𝒛\bm{z} denotes element-wise absolute values. 𝑰n\bm{I}_{n} (or sometimes just 𝑰\bm{I}) denotes the n×nn\times n identity matrix. We use 𝒆k\bm{e}_{k} to denote the kk-th canonical basis vector, i.e., the kk-th column of 𝑰\bm{I}. For any matrix 𝒁{\bm{{Z}}}, 𝒛k\bm{z}_{k} denotes its kk-th column.

We say 𝑼{\bm{U}} is a basis matrix if it contains orthonormal columns. For basis matrices 𝑼1,𝑼2{\bm{U}}_{1},{\bm{U}}_{2}, we use

SD(𝑼1,𝑼2):=(𝑰𝑼1𝑼1)𝑼2F\mathrm{SD}({\bm{U}}_{1},{\bm{U}}_{2}):=\|(\bm{I}-{\bm{U}}_{1}{\bm{U}}_{1}{}^{\top}){\bm{U}}_{2}\|_{F}

as the Subspace Distance (SD) measure. For two rr-dimensional subspaces, this is the l2l_{2} norm of the sines of the rr principal angles between span(𝑼1)\mathrm{span}({\bm{U}}_{1}) and span(𝑼2)\mathrm{span}({\bm{U}}_{2}). SD(𝑼1,𝑼2)\mathrm{SD}({\bm{U}}_{1},{\bm{U}}_{2}) is symmetric when 𝑼1,𝑼2{\bm{U}}_{1},{\bm{U}}_{2} are both n×rn\times r basis matrices. Notice here we are using the Frobenius SD, unlike many recent works including our older work [5] that use the induced 2-norm one. This is done because it enables us to prove the desired guarantees easily. We reuse the letters c,Cc,C to denote different numerical constants in each use with the convention that c<1c<1 and C1C\geq 1. The notation aΩ(b)a\in\Omega(b) means aCba\geq Cb while aO(b)a\in O(b) means aCba\leq Cb. We use 𝟙statement\mathbbm{1}_{\text{statement}} to denote an indicator function that takes the value 1 if statement is true and zero otherwise.

For a vector 𝒘\bm{w}, we sometimes use 𝒘(k)\bm{w}(k) to denote the kk-th entry of 𝒘\bm{w}. For a vector 𝒘\bm{w} and a scalar α\alpha, 𝟙(𝒘α)\mathbbm{1}(\bm{w}\leq\alpha) returns a vector of 1s and 0s of the same length as 𝒘\bm{w}, with 1s where (𝒘(k)α)(\bm{w}(k)\leq\alpha) and zero everywhere else. We use \circ to denote the Hadamard product. Thus 𝒛:=𝒘𝟙(𝒘α)\bm{z}:=\bm{w}\circ\mathbbm{1}(\bm{w}\leq\alpha) zeroes out entries of 𝒘\bm{w} larger than α\alpha, while keeping the smaller ones as is.

For 𝑿{\bm{X}}^{*} which is a rank-rr matrix, we let

𝑿=SVD𝑼𝚺𝑽𝑩:=𝑼𝑩{\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\underbrace{{\bm{\Sigma}^{*}}{\bm{V}^{*}}{}}_{\bm{B}^{*}}:={\bm{U}}^{*}{}\bm{B}^{*}

denote its reduced (rank rr) SVD, i.e., 𝑼{\bm{U}}^{*}{} and 𝑽{\bm{V}^{*}}^{\top} are matrices with orthonormal columns (basis matrices), 𝑼{\bm{U}}^{*}{} is n×rn\times r and 𝑽{\bm{V}^{*}} is r×qr\times q, and 𝚺{\bm{\Sigma}^{*}} is an r×rr\times r diagonal matrix with non-negative entries. We use κ:=σmax/σmin\kappa:={\sigma_{\max}^{*}}/{\sigma_{\min}^{*}} to denote the condition number of 𝚺{\bm{\Sigma}^{*}}. This is not the condition number of 𝑿{\bm{X}}^{*} (whose minimum singular value is zero). We let 𝑩:=𝚺𝑽\bm{B}^{*}:={\bm{\Sigma}^{*}}\bm{V}^{*}{}{} and we use 𝒃k\bm{b}^{*}_{k} to denote its kk-th column.

We use the phrase ϵ\epsilon-accurate recovery to refer to SD(𝑼,𝑼)ϵ\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\epsilon or 𝑿𝑿Fϵ𝑿F\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq\epsilon\|{\bm{X}}^{*}\|_{F} or both.

Assumption. Another way to understand (1) is as follows: each scalar measurement 𝒚ki\bm{y}_{ki} (ii-th entry of 𝒚k\bm{y}_{k}) satisfies

𝒚ki:=𝒂ki,𝒙k,i[m],k[q]\bm{y}_{ki}:=\langle\bm{a}_{ki},\bm{x}^{*}_{k}\rangle,\ i\in[m],\ k\in[q]

with 𝒂ki\bm{a}_{ki}{}^{\top} being the ii-th row of 𝑨k\bm{A}_{k}. Observe that the measurements are not global, i.e., no 𝒚ki\bm{y}_{ki} is a function of the entire matrix 𝑿{\bm{X}}^{*}. They are global for each column (𝒚ki\bm{y}_{ki} is a function of column 𝒙k\bm{x}^{*}_{k}) but not across the different columns. We thus need an assumption that enables correct interpolation across the different columns. The following assumption, which is a slightly weaker version of incoherence (w.r.t. the canonical basis) of right singular vectors suffices for this purpose.

Assumption 1.1 ((Weakened) Right Singular Vectors’ Incoherence).

Assume that

maxk𝒃kσmaxμr/q.\max_{k}\|\bm{b}^{*}_{k}\|\leq{\sigma_{\max}^{*}}\mu\sqrt{r/q}.

for a constant μ1\mu\geq 1 (μ\mu does not grow with n,q,rn,q,r). Since 𝐱k=𝐛k\|\bm{x}^{*}_{k}\|=\|\bm{b}^{*}_{k}\|, this implies that maxk𝐱kσmaxμr/q\max_{k}\|\bm{x}^{*}_{k}\|\leq{\sigma_{\max}^{*}}\mu\sqrt{r/q}. Also, since σminr𝐗F{\sigma_{\min}^{*}}\sqrt{r}\leq\|{\bm{X}}^{*}\|_{F}, this also implies that maxk𝐱kκμ𝐗F/q\max_{k}\|\bm{x}^{*}_{k}\|\leq\kappa\mu{\|{\bm{X}}^{*}\|_{F}}/{\sqrt{q}}.

Right singular vectors incoherence is the assumption maxk𝒗kμr/q\max_{k}\|\bm{v}^{*}_{k}\|\leq\mu\sqrt{r/q}. Since 𝒃k=𝚺𝒗k\bm{b}^{*}_{k}={\bm{\Sigma}^{*}}\bm{v}^{*}_{k}, this implies that the above holds. Incoherence of both left and right singular vectors was introduced for guaranteeing correct “interpolation” for the LRMC problem [3, 2].

I-B Existing Work

Existing solutions for LRcCS and LRPR. Since it is always possible to obtain magnitude-only measurements 𝒚(mag)k{\bm{y}_{(mag)}}_{k} from linear ones 𝒚k\bm{y}_{k} as 𝒚(mag)k=|𝒚k|{\bm{y}_{(mag)}}_{k}=|\bm{y}_{k}|, a solution to LRPR also automatically solves LRcCS under the same assumptions. Hence the AltMin algorithm for LRPR from [4, 5] is the first provably correct solution for LRcCS. Of course, since LRcCS is an easier problem than LRPR, we expect a direct solution to LRcCS to need weaker assumptions. As we show in this paper, this is indeed true. A more recent work [7] studied the noisy version of LRcCS and developed a convex relaxation (mixed norm minimization) to provably solve it. Its time complexity is not discussed in the paper, however, it is well known that solvers for convex programs are much slower when compared to direct iterative algorithms: they either require number of iterations proportional to 1/ϵ1/\sqrt{\epsilon} or the per-iteration cost has cubic dependence on the problem size (here ((n+q)r)3((n+q)r)^{3}) [2]. Thus, if qnq\leq n, its time complexity O(mqnrmin(1/ϵ,n3r3))O(mqnr\cdot\min(1/\sqrt{\epsilon},n^{3}r^{3})). In [6], we provided the best possible guarantee for the AltMin algorithm for solving LRPR, and hence LRcCS. We discuss these results in detail in Sec. II-D and summarize them in Table I.

Other well-studied LR recovery problems. The multivariate regression (MVR) problem, studied in [1], is our problem with 𝑨k=𝑨\bm{A}_{k}=\bm{A}. However this is a very different setting than ours because, with 𝑨k=𝑨\bm{A}_{k}=\bm{A}, the different 𝒚k\bm{y}_{k}’s are no longer mutually independent. As a result, one cannot exploit law of large numbers’ arguments over all mqmq scalar measurements 𝒚ki\bm{y}_{ki}. Consequently, the required value of mm can never be less than nn. The result of [1] shows that mm of order (n+q)r(n+q)r is both necessary and sufficient. LRMS involves recovering 𝑿{\bm{X}}^{*} from 𝒚i=𝑨i,𝑿,i=1,2,,mq\bm{y}_{i}=\langle\bm{A}_{i},{\bm{X}}^{*}\rangle,\ i=1,2,\dots,mq with 𝑨i\bm{A}_{i} being dense matrices, typically i.i.d. Gaussian [2]. Thus all measurements are i.i.d. and global: each contains information about the entire quantity-of-interest, here 𝑿{\bm{X}}^{*}. Because of this, for LRMS, one can prove a LR Restricted Isometry Property (RIP) that simplifies the rest of the analysis. This is what makes it very different from, and easier than, our problem.

LRMC, which involves recovering 𝑿{\bm{X}}^{*} from a subset of its observed entries, is the most closely related problem to ours since it also involves recovery from non-global measurements. The typical model assumed is that each matrix entry is observed with probability pp independent of others [3, 2]. Setting unobserved entries to zero, this can be written as 𝒚jk=δjk𝑿jk\bm{y}_{jk}=\delta_{jk}{\bm{X}}^{*}_{jk} with δjkiidBernoulli(p)\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p). LRMC measurements are both row-wise and column-wise local. To allow correct interpolation across both rows and columns, it needs the incoherence assumption on both its left and right singular vectors. For our problem, the measurements are global for each column, but not across the different columns. For this reason, only right singular vectors’ incoherence is needed. In fact, because of the nature of our measurements, even if left incoherence were assumed, it would not help. This asymmetry in our measurement model and the fact that our measurements are unbounded (each 𝐲ki\bm{y}_{ki} is a Gaussian r.v) are two key differences between LRMC and LRcCS that prevent us from borrowing LRMC proof techniques for our work. Here symmetric means: if we replace 𝑿{\bm{X}}^{*} by its transpose, the probability distribution of the set of measurements does not change. Bounded means that the measurements’ magnitude has a uniform bound. This bound is 𝑿max\|{\bm{X}}^{*}\|_{\max} for LRMC measurements.

Non-convex (iterative, not convex relaxation based) LRMC algorithms with the best sample complexity are GD-based. There are two common approaches for designing GD algorithms in the LR recovery literature, and in particular for LRMC. The first is to use standard projected GD on 𝑿{\bm{X}} (projGD-X), also referred to as Iterative Hard Thresholding: at each iteration, perform one step of GD for minimizing the squared loss cost function, f~(𝑿)\tilde{f}({\bm{X}}), w.r.t. 𝑿{\bm{X}}, followed by projecting the resulting matrix onto the space of rank rr matrices (by SVD). This was studied in [14, 15] for solving LRMC. This is shown to converge geometrically with a constant GD step size, while needing only Ω((n+q)r2log2nlog2(1/ϵ))\Omega((n+q)r^{2}\log^{2}n\log^{2}(1/\epsilon)) samples on average.

The second is to let 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B} where 𝑼{\bm{U}} is n×rn\times r and 𝑩\bm{B} is r×qr\times q and perform alternating GD for the cost function f(𝑼,𝑩):=f~(𝑼𝑩)f({\bm{U}},\bm{B}):=\tilde{f}({\bm{U}}\bm{B}), i.e., update 𝑩\bm{B} with one step of GD for minimizing f(𝑼,𝑩)f({\bm{U}},\bm{B}) while keeping 𝑼{\bm{U}} fixed at its previous value, and then do the same for 𝑼{\bm{U}} with 𝑩\bm{B} fixed, and repeat. Since the 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B} factorization is not unique, i.e., 𝑿=𝑼𝑹1𝑹𝑩{\bm{X}}={\bm{U}}{\bm{R}}^{-1}{\bm{R}}\bm{B} for any invertible r×rr\times r matrix 𝑹{\bm{R}}, this approach can result in the norm of one of 𝑼{\bm{U}} or 𝑩\bm{B} growing in an unbounded fashion, while that of the other decreases at the same rate, causing numerical problems. A typical approach to resolve this issue, and one that was used for LRMC [16, 17], is to change the cost function to minimize to f(𝑼,𝑩)+λf2(𝑼,𝑩)f({\bm{U}},\bm{B})+\lambda f_{2}({\bm{U}},\bm{B}) where f2(𝑼,𝑩):=𝑼𝑼𝑩𝑩Ff_{2}({\bm{U}},\bm{B}):=\|{\bm{U}}^{\top}{\bm{U}}-\bm{B}\bm{B}^{\top}\|_{F} is the “norm-balancing term” (helps ensure that norms of 𝑼{\bm{U}} and 𝑩\bm{B} remain similar). We henceforth refer to this approach as altGDnormbal. The sample complexity bound for this approach is similar to that for projGD-X. But, it needs a GD step size of order 1/r1/r or smaller [16, 17]; making it rr-times slower than projGD-X.

Sample Comp. Time Comp. Communic. Comp. Holds for Column-wise
mqmq\gtrsim per node (predicted) all 𝑿{\bm{X}}^{*}? error bound?
Convex [7] nr1ϵ4nr\frac{1}{\epsilon^{4}} linear-timemin(1ϵ,n3r3)\text{\scriptsize{linear-time}}\cdot\min\left(\frac{1}{\sqrt{\epsilon}},n^{3}r^{3}\right) not clear yes no
AltMin [4, 5] nr4log(1ϵ)nr^{4}\log(\frac{1}{\epsilon}) linear-timerlog2(1ϵ)\text{\scriptsize{linear-time}}\cdot r\log^{2}(\frac{1}{\epsilon}) nrlog(1ϵ)rlog2(1ϵ)nr\log(\frac{1}{\epsilon})\cdot r\log^{2}(\frac{1}{\epsilon}) no
AltMin [6] nr2(r+log(1ϵ))nr^{2}(r+\log(\frac{1}{\epsilon})) linear-timerlog2(1ϵ)\text{\scriptsize{linear-time}}\cdot r\log^{2}(\frac{1}{\epsilon}) nrlog(1ϵ)rlog2(1ϵ)nr\log(\frac{1}{\epsilon})\cdot r\log^{2}(\frac{1}{\epsilon}) no yes
altGD-Min 𝐧𝐫𝟐log(𝟏ϵ)\mathbf{nr^{2}\log(\frac{1}{\epsilon})} linear-time𝐫log(𝟏ϵ)\mathbf{\text{\scriptsize{linear-time}}\cdot r\log(\frac{1}{\epsilon})} 𝐧𝐫𝐫log(𝟏ϵ)\mathbf{nr\cdot r\log(\frac{1}{\epsilon})} no yes
(proposed)
Best sample LRMC algorithms among those that do not solve a convex relaxation
ProjGD-X max(n,q)r2log2nlog2(1ϵ)\max(n,q)r^{2}\log^{2}n\log^{2}(\frac{1}{\epsilon}) linear-time𝐫log(𝟏ϵ)\bf{\text{\scriptsize{linear-time}}\cdot r\log(\frac{1}{\epsilon})} nqnq **
[15]
AltGDnormbal max(𝐧,𝐪)𝐫𝟐log𝐧\mathbf{\max(n,q)r^{2}\log n} linear-timer2log(1ϵ)\text{\scriptsize{linear-time}}\cdot r^{2}\log(\frac{1}{\epsilon}) max(𝐧,𝐪)𝐫\mathbf{\max(n,q)r}
[16]

**The communication complexity of ProjGD-X would be nqnq because the gradient w.r.t. 𝑿{\bm{X}} computed at each node will need to be transmitted by the nodes to the center. The gradient w.r.t. 𝑿{\bm{X}} is not low rank (LR), and hence one cannot transmit just its rank rr SVD.

TABLE I: Existing work versus our work. For brevity, this table assumes qnq\leq n and treats κ,μ\kappa,\mu as numerical constants. All approaches also need mmax(r,logq,logn)m\geq\max(r,\log q,\log n). Column-wise error bound exists means maxk𝒙k𝒙k/𝒙kϵ\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|/\|\bm{x}^{*}_{k}\|\leq\epsilon holds in addition to a similar bound on matrix Frobenius norm error. Linear-time is the time needed to read all algorithm inputs. For LRcCS, this is 𝒚k,𝑨k\bm{y}_{k},\bm{A}_{k} for all k[q]k\in[q] and thus linear-time is order mnqmnq. For LRMC, this is the set of observed entries and their locations and thus linear-time is order mqmq. None of the other algorithms have been studied in the federated context and hence the communication complexity (Comm. Comp.) listed in the fourth column is based on our understanding of how one would federate the algorithm. Notice that AltGD-min has the best time and communication complexities; and for ϵ4<r\epsilon^{4}<r, it also has the best sample complexity.

I-C Contributions and Novelty

Contribution to solving LRcCS and LRPR. (1) This work develops a novel GD-based solution to LRcCS, called AltGD-Min, that is fast and communication-efficient. We show that, with high probability (w.h.p.), AltGD-Min obtains an ϵ\epsilon-accurate estimate in order κ2log(1/ϵ)\kappa^{2}\log(1/\epsilon) iterations, as long as Assumption 1.1 holds, the matrices 𝑨k\bm{A}_{k} are i.i.d., with each containing i.i.d. standard Gaussian entries, mqΩ(κ6μ2(n+q)r2log(1/ϵ))mq\in\Omega(\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon)), and mΩ(max(logq,logn)log(1/ϵ))m\in\Omega(\max(\log q,\log n)\log(1/\epsilon)). Its time complexity is O(mqnrκ2log(1/ϵ))O(mqnr\cdot\kappa^{2}\log(1/\epsilon)) and its communication complexity per node is O(nrκ2log(1/ϵ))O(nr\cdot\kappa^{2}\log(1/\epsilon)). We provide a comparison of our guarantee with those of other works in Table I. This table also summarizes the guarantees for the two most sample-efficient LRMC solutions: projGD-X and altGDnormbal. The former is also the fastest LRMC solution, while the latter is the most communication-efficient. As mentioned earlier, LRMC is the most similar problem to ours that has been extensively studied. Notice that, our sample complexity matches that of the best results for LRMC algorithms that do solve a convex relaxation. (2) We show that a simple extension of AltGD-Min also provides the fastest provable solution to LRPR, as long as the above assumptions hold and mqΩ(κ6μ2nr2(r+log(1/ϵ))mq\in\Omega(\kappa^{6}\mu^{2}nr^{2}(r+\log(1/\epsilon)). Its time complexity is the same too.

Contributions / Novelty of algorithm design and proof techniques. As explained earlier, there are three commonly used provably correct iterative algorithms for LR recovery problems – altMin, projGD-X, and altGD (altGDnormbal to be precise). AltMin is slower than GD-based methods because, for updating both 𝑼{\bm{U}} and 𝑩\bm{B}, it requires solving a minimization problem keeping the other variable fixed. For our specific asymmetric problem, the min step for 𝑼{\bm{U}} is the slow one. ProjGD-X and altGDnormbal are faster, but it is not clear how to analyze them for LRcCS under the desired sample complexity111In order to show that a GD-based algorithm converges, one needs to be able to bound the norm of the gradient and show that it goes to zero with iterations. When studying both projGD-X and altGDnormbal, for different reasons, the estimates of the different columns are coupled. Consequently, it is not possible to get a tight enough bound on maxk𝒙k𝒙k\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|. But, due to the form of the LRcCS measurement model, such a bound is needed to get a tight enough bound on the 2-norm of the gradient of the cost function, and show that it decreases sufficiently at each iteration, under the desired sample complexity. Moreover, in case of projGD-X, even if one could somehow get the desired bound, it would not suffice because the summands will still be too heavy tailed. This point is explained in detail in Appendix A.. Our novel altGD-min approach however resolves both issues: it is fast as projGD-X and it can be analyzed. Moreover, its communication complexity for a federated implementation (and its memory complexity) is only nrnr per node per iteration, instead of nqnq for projGD-X. As can be seen from Table I, treating κ,μ\kappa,\mu as numerical constants, it has the best sample-, time-, and communication/memory- complexity among all approaches for LRcCS and all fast (iterative) approaches for LRMC as well. Because of this, an AltGD-Min type algorithm may also be of interest for solving LRMC in a fast, sample-efficient and communication-efficient fashion. In fact, it can be also be useful for other bilinear inverse problems such as blind deconvolution.

AltGDmin algorithm. The main idea is as follows. Express 𝑿{\bm{X}} as 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B} and alternatively update 𝑼{\bm{U}} and 𝑩\bm{B} as follows: (a) keeping 𝑩\bm{B} fixed at its previous value, update 𝑼{\bm{U}} by a GD step for it for the cost function f(𝑼,𝑩)f({\bm{U}},\bm{B}) followed by projecting the output onto the space of matrices with orthonormal columns; and (b) keeping 𝑼{\bm{U}} fixed at its previous value, update 𝑩\bm{B} by minimizing f(𝑼,𝑩)f({\bm{U}},\bm{B}) over it. Because of the column-wise decoupled form of our measurement model, step (b) is as fast as the GD step and thus the per-iteration time complexity of AltGD-Min is equal to that of any other GD method such as projGD-X or altGDnormbal. This decoupling (which means that, given 𝑼{\bm{U}}, 𝒃k\bm{b}_{k} only depends on 𝒙k\bm{x}^{*}_{k}, and not on the other columns of 𝑿{\bm{X}}^{*}) also allows us to get the desired tight-enough bound on maxk𝒃k𝑼𝒙k\max_{k}\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\| and hence on maxk𝒙k𝒙k\max_{k}\|\bm{x}_{k}-\bm{x}^{*}_{k}\|. This, and the fact that we use the gradient w.r.t. 𝑼{\bm{U}} in our algorithm, means that the summands in the gradient, and in other error bound terms, are nice-enough sub-exponential random variables (r.v.s): sub-exponential r.v.s whose maximum sub-exponential norm is small enough (is proportional to (r/q)(r/q)), so that the summation can be bounded w.h.p. under the desired sample complexity.

AltGDmin analysis. When we analyzed the AltMin approach for LRPR [5, 6], we could directly modify proof techniques from AltMin for LRMC [2] for getting a bound on SD(𝑼,𝑼)\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{}) in terms of the bound on this distance from the previous iteration. We cannot do this for AltGD-Min because the algorithm itself is different from the two GD approaches studied for solving LRMC. We instead analyze AltGD-Min by a novel use of the fundamental theorem of calculus [18] that, along with other linear algebra tricks, helps us get a bound on SD(𝑼,𝑼)\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{}) which has the desired property: the terms in it are sums of nice-enough sub-exponentials. See Lemma 3.4 and its proof. The use of this result is motivated by its use in [19], and many earlier works, where it is used in a standard way: to bound the Euclidean distance, 𝒙𝒙\|\bm{x}-\bm{x}^{*}\|, for standard GD to solve the PR problem for recovering a single vector 𝒙\bm{x}^{*}. Thus, at the true solution 𝒙=𝒙\bm{x}=\bm{x}^{*}, the gradient of the cost function was zero. In our case, there are two differences: (i) we need to bound the subspace distance error, and (ii) our algorithm is not standard GD, and this means that Uf(𝑼𝑼𝑼,𝑩)0\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0. We explain our approach in Sec. III-B.

AltGDmin initialization. The standard LR spectral initialization approach cannot be used because its summands are sub-exponential r.v.s that are not nice-enough. We give a detailed explanation in Appendix A. We address this issue by borrowing the truncation idea from the PR literature [20, 21, 5]. But, in our case, truncation is applied to a non-symmetric matrix. Thus the sandwiching arguments developed for symmetric matrices in [20], and modified in [21, 5], cannot be borrowed. We need a different argument which is used for proving Lemma B.2 and is briefly explained in Sec. III-D.

I-D Applications

The LRcCS and LRPR problems occur in projection imaging applications involving sets of images, e.g., dynamic MRI [8, 9, 10], federated LR sketching [11, 7], and dynamic Fourier ptychography (LRPR) [22]. In MRI, Fourier projections of the region of interest, e.g., a cross-section of the brain or the heart, are acquired one coefficient at a time, making the scanning (data acquisition) quite slow. Hence, reduced sample complexity enables accelerated scanning. Since medical image sequences are usually slow changing, the LR model is a valid assumption for a time sequence [8, 9, 10]. In our notation, 𝒙k\bm{x}^{*}_{k} is the vectorized version of the kk-th image of the sequence and there are a total of qq images. The matrices 𝑨k\bm{A}_{k} are random Fourier, i.e., 𝑨k=𝑯k𝑭\bm{A}_{k}=\bm{H}_{k}{\bm{F}} where 𝑭{\bm{F}} is the n×nn\times n matrix that models computation of the 2D discrete Fourier transform as a matrix-vector operation, and 𝑯k\bm{H}_{k} is an m×nm\times n random sampling “mask” matrix that models the frequency selection. In [13], we have shown the power of AltGD-Min for fast undersampled dynamic MRI of medical image sequences. It is both much faster, and in most cases, also provides better reconstructions, than many existing solutions from the MRI literature.

Large scale usage of smartphones results in large amounts of geographically distributed data, e.g., images. There is a need to compress/sketch this data before storing it. Sketch refers to a compression approach where the compression end is low complexity, usually simple linear projections [11, 7]. Consider the setting where different subsets of columns of 𝑿{\bm{X}}^{*} (each column corresponds to one vectorized image) are available at each of the ρq\rho\leq q nodes. The goal is to sketch them so that they can be correctly recovered using a federated algorithm. We can store the sketches 𝒚k:=𝑨k𝒙k\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k} with 𝑨k\bm{A}_{k}’s being i.i.d. Gaussian. This way we store a total of only mqmq scalars, with mqmq of order roughly just (n+q)r2(n+q)r^{2}. Traditional LR sketching approaches, e.g., [23], are designed for centralized settings and will not be efficient in a distributed setting.

I-E Organization

In Sec. II, we develop AltGD-Min, give its guarantee for solving LRcCS, and compare it with existing results. We state and prove the two theorems that help prove our main result in Sec. III. This section also contains brief proof outlines before the actual proofs. The lemmas used in these proofs are proved in Sec. IV. The extension for solving LRPR is developed, and its guarantee is stated and proved, in Sec. V. We discuss the limitations of our results in Sec. VI. Simulation experiments are provided in Sec. VII. We conclude in Sec. VIII.

II The Proposed AltGD-Min Algorithm and Guarantee

II-A The AltGD-Min algorithm

We would like to design a fast GD algorithm to find the matrix 𝑿{\bm{X}} that minimizes the squared-loss cost function f~(𝑿):=k=1q𝒚k𝑨k𝒙k2.\tilde{f}({\bm{X}}):=\sum_{k=1}^{q}\|\bm{y}_{k}-\bm{A}_{k}\bm{x}_{k}\|^{2}. For reasons described earlier, we decompose 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B} and develop an alternating GD-min (AltGD-Min) approach for the squared loss function,

f(𝑼,𝑩):=f~(𝑼𝑩)=k𝒚k𝑨k𝑼𝒃k2.f({\bm{U}},\bm{B}):=\tilde{f}({\bm{U}}\bm{B})=\sum_{k}\|\bm{y}_{k}-\bm{A}_{k}{\bm{U}}\bm{b}_{k}\|^{2}.

Starting with a careful initialization for 𝑼{\bm{U}} explained below, AltGD-Min proceeds as follows. At each new iteration,

  • Min-B: update 𝑩\bm{B} by solving 𝑩argmin𝑩~f(𝑼,𝑩~)\bm{B}\leftarrow\arg\min_{\tilde{\bm{B}}}f({\bm{U}},\tilde{\bm{B}}). Since 𝒃k\bm{b}_{k} only occurs in the kk-th summand of f(𝑼,𝑩)f({\bm{U}},\bm{B}), this decouples to a much simpler column-wise least squares (LS) problem: 𝒃kargmin𝒃~k𝒚k𝑨k𝑼𝒃~k2\bm{b}_{k}\leftarrow\arg\min_{\tilde{\bm{b}}_{k}}\|\bm{y}_{k}-\bm{A}_{k}{\bm{U}}\tilde{\bm{b}}_{k}\|^{2}. This is solved in closed form as 𝒃k=(𝑨k𝑼)𝒚k\bm{b}_{k}=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k} for each kk; here 𝑴:=(𝑴𝑴)1𝑴\bm{M}^{\dagger}:=(\bm{M}^{\top}\bm{M})^{-1}\bm{M}^{\top}.

  • ProjGD-U: update 𝑼{\bm{U}} by one GD step for it, 𝑼^+𝑼ηUf(𝑼,𝑩)\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-\eta\nabla_{U}f({\bm{U}},\bm{B}), followed by projecting 𝑼^+\hat{\bm{U}}^{+} onto the space of matrices with orthonormal columns to get the updated 𝑼+{\bm{U}}^{+}. We get 𝑼+{\bm{U}}^{+} by QR decomposition: 𝑼^+=QR𝑼+𝑹+\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}.

Notice that, because of the decoupling for 𝑩\bm{B}, the min step only involves solving qq rr-dimensional Least Squares (LS) problems, in addition to also first computing the matrices, 𝑨k𝑼\bm{A}_{k}{\bm{U}}. Computing the matrices needs time of order mnrmnr, and solving one LS problem needs time of order mr2mr^{2}. Thus, the LS step needs time O(qmax(mnr,mr2))=O(mqnr)O(q\max(mnr,mr^{2}))=O(mqnr) since rnr\leq n. This is equal to the time needed to compute the gradient w.r.t. 𝑼{\bm{U}}; and thus, the per-iteration cost of AltGD-Min is only O(mqnr)O(mqnr). The QR decomposition of an n×rn\times r matrix takes time only nr2nr^{2}.

Since f(𝑼,𝑩)f({\bm{U}},\bm{B}) is not a convex function of the unknowns {𝑼,𝑩}\{{\bm{U}},\bm{B}\}, a careful initialization is needed. Borrowing the spectral initialization idea from LRMC and LRMS solutions, we should initialize 𝑼0{\bm{U}}_{0} by computing the top rr singular vectors of

𝑿0,full=1m[(𝑨1𝒚1),(𝑨2𝒚2),,(𝑨k𝒚k),(𝑨q𝒚q)]{\bm{X}}_{0,full}=\frac{1}{m}[(\bm{A}_{1}^{\top}\bm{y}_{1}),(\bm{A}_{2}^{\top}\bm{y}_{2}),\dots,(\bm{A}_{k}^{\top}\bm{y}_{k}),\dots(\bm{A}_{q}^{\top}\bm{y}_{q})]

Clearly the expected value of the kk-th column of this matrix equals 𝒙k\bm{x}^{*}_{k} and thus 𝔼[𝑿0,full]=𝑿\mathbb{E}[{\bm{X}}_{0,full}]={\bm{X}}^{*}. But, as we explain next, it is not clear how to prove that this matrix concentrates around 𝑿{\bm{X}}^{*}. Observe that it can also be written as

𝑿0,full:=1mk=1qi=1m𝒂ki𝒚ki𝒆k\displaystyle{\bm{X}}_{0,full}:=\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}

Its summands are independent sub-exponential r.v.s with maximum sub-exponential norm maxk𝒙kμr/qσmax\max_{k}\|\bm{x}^{*}_{k}\|\leq\mu\sqrt{r/q}{\sigma_{\max}^{*}}. This is too large and does not allow us to bound 𝑿0,full𝑿\|{\bm{X}}_{0,full}-{\bm{X}}^{*}\| under the desired sample complexity; see Appendix A. To resolve this issue, we borrow the truncation idea from earlier work on PR [20, 5] and initialize 𝑼0{\bm{U}}_{0} as the top rr left singular vectors of

𝑿0\displaystyle{\bm{X}}_{0} :=\displaystyle:= 1mk=1qi=1m𝒂ki𝒚ki𝒆k𝟙{𝒚ki2α}\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}\mathbbm{1}_{\left\{\bm{y}_{ki}^{2}\leq\alpha\right\}} (2)
=\displaystyle= 1mk=1q𝑨k𝒚k,trunc(α)𝒆k\displaystyle\frac{1}{m}\sum_{k=1}^{q}\bm{A}_{k}^{\top}\bm{y}_{k,trunc}(\alpha)\bm{e}_{k}^{\top}

where α:=C~ki(𝒚ki)2mq\alpha:=\tilde{C}\frac{\sum_{ki}(\bm{y}_{ki})^{2}}{mq} and 𝒚k,trunc(α):=𝒚k𝟙(|𝒚k|α)\bm{y}_{k,trunc}(\alpha):=\bm{y}_{k}\circ\mathbbm{1}(|\bm{y}_{k}|\leq\sqrt{\alpha}). We set C~\tilde{C} in our main result. Observe that we are summing over only those i,ki,k for which 𝒚ki2\bm{y}_{ki}^{2} is not too large (is not much larger than its empirically computed average value). This truncation filters out the too large (outlier-like) measurements and sums over the rest. Theoretically, this converts the summands into sub-Gaussian r.v.s which have lighter tails than the un-truncated ones. This allows us to prove the desired concentration bound. Different from the above setting, in [20, 5], truncation was applied to symmetric positive definite matrices and was used to convert summands that were heavier-tailed than sub-exponential to sub-exponential.

We summarize the complete algorithm in Algorithm 1. This uses sample-splitting which is a commonly used approach in the LR recovery literature [2, 14, 15] as well as in other compressive sensing settings. It helps ensure that the measurement matrices in each iteration for updating 𝑼{\bm{U}} and 𝑩\bm{B} are independent of all previous iterates. This allows one to use concentration bounds for sums of independent r.v.s. We provide a detailed discussion in Sec. VI-A.

II-A1 Practical algorithm and setting algorithm parameters

First, when we implement the algorithm, we use Algorithm 1 with using the full set of measurements for all the steps (no sample-splitting). The algorithm has 4 parameters: η\eta, TT, C~\tilde{C} and the rank rr. According to the theorem below, we should set η=c/σmax2\eta=c/{\sigma_{\max}^{*}}^{2} with c<0.5c<0.5. But σmax{\sigma_{\max}^{*}} is not known. The initialization matrix 𝑿0{\bm{X}}_{0} provides an approximation to 𝑿{\bm{X}}^{*} and hence we can set η=c/𝑿02\eta=c/\|{\bm{X}}_{0}\|^{2}. Consider C~\tilde{C}. The theorem requires setting C~=9κ2μ2\tilde{C}=9\kappa^{2}\mu^{2}, however κ,μ\kappa,\mu are functions of 𝑿{\bm{X}}^{*} which is unknown. Using the definition of μ\mu from Assumption 1.1, we can replace κ2μ2\kappa^{2}\mu^{2} by an estimate of its lower bound: qmaxk𝒙k2^/𝑿F2^q\cdot\max_{k}\widehat{\|\bm{x}^{*}_{k}\|^{2}}/\widehat{\|{\bm{X}}^{*}\|_{F}^{2}} with 𝒙k2^=(1/m)i𝒚ki2\widehat{\|\bm{x}^{*}_{k}\|^{2}}=(1/m)\sum_{i}\bm{y}_{ki}^{2} and 𝑿F2^=(1/m)ki𝒚ki2\widehat{\|{\bm{X}}^{*}\|_{F}^{2}}=(1/m)\sum_{k}\sum_{i}\bm{y}_{ki}^{2}. To set the total number of algorithm iterations TT, we can use a large maximum value along with breaking the loop if a stopping criterion is satisfied. A common stopping criterion for GD is to stop when the iterates do not change much. One way to do this is to stop when SD(𝑼t,𝑼t1)0.01r\mathrm{SD}({\bm{U}}_{t},{\bm{U}}_{t-1})\leq 0.01\sqrt{r} for last few iterations.

As explained in [13], we can use the following constraints to set the rank. We need our choice of rank, r^\hat{r}, to be sufficiently small compared to min(n,q)\min(n,q) for the algorithm to take advantage of the LR assumption. Moreover, for the LS step for updating 𝒃k\bm{b}_{k}’s (which are rr-length vectors) to work well (for its error to be small), we also need it to also be small compared with mm. One approach that is used often is to use the “b%b\% energy threshold” on singular values. Thus, one good heuristic that respects the above constraints is to compute the “b%b\% energy threshold” of the first min(n,q,m)/10\min(n,q,m)/10 singular values, i.e. compute r^\hat{r} as the smallest value of rr for which

j=1rσj(𝑿0)2(b/100)j=1min(n,q,m)/10σj(𝑿0)2\sum_{j=1}^{{r}}\sigma_{j}({\bm{X}}_{0})^{2}\geq(b/100)\cdot\sum_{j=1}^{\min(n,q,m)/10}\sigma_{j}({\bm{X}}_{0})^{2}

for a b100b\leq 100. In our MRI experiments in [13], we used b=85b=85. We also realized from the experiments that the algorithm is not very sensitive to this value as long as r^min(n,q,m)\hat{r}\ll\min(n,q,m).

II-A2 Federating the algorithm

Suppose that our sketches 𝒚k\bm{y}_{k} are geographically distributed across a set of LL nodes. Each node {\ell} stores a subset, denoted 𝒮\mathcal{S}_{\ell}, of the 𝒚k\bm{y}_{k}s with |𝒮|=q|\mathcal{S}_{\ell}|=q_{\ell}. These subsets are mutually disjoint so that q=q\sum_{\ell}q_{\ell}=q. Typically LqL\ll q. Privacy constraints dictate that we cannot share the 𝒚k\bm{y}_{k}s with the central server; although summaries computed using the 𝒚k\bm{y}_{k}s can be shared at each algorithm iteration. This will be done as follows. Consider the GDmin steps of Algorithm 1 first. Line 13 (Update 𝒃k\bm{b}_{k}s, 𝒙k\bm{x}_{k}s) is done locally at the node that stores the corresponding 𝒚k\bm{y}_{k}. For line 14 (Gradient w.r.t 𝑼{\bm{U}}), the partial sums over k𝒮k\in\mathcal{S}_{\ell} are computed at node {\ell} and transmitted to the center which adds all the partial sums to obtain 𝑼f(𝑼,𝑩)\nabla_{\bm{U}}f({\bm{U}},\bm{B}). Line 15 (GD step) and line 16 (projection via QR) are done at the center. The updated 𝑼{\bm{U}} is then broadcast to all the nodes for use in the next iteration. The per node time complexity of this algorithm is thus mnrqmnrq_{\ell} at each iteration. The center only performs additions and a QR decomposition (an order nr2nr^{2} operation) in each iteration. Thus, the time complexity of the federated solution is only mnr(maxq)Tmnr(\max_{\ell}q_{\ell})T per node.

The initialization step can be federated by using the Power Method (PM) [24, 25] to compute the top rr eigenvectors of 𝑿0𝑿0{\bm{X}}_{0}{\bm{X}}_{0}{}^{\top}. Any PM guarantee helps ensure that its output is close in subspace distance to the span of the top rr eigenvectors of 𝑿0𝑿0{\bm{X}}_{0}{\bm{X}}_{0}{}^{\top} after a sufficient number of iterations. The communication complexity of the federated implementation is thus just nrnr per node per iteration (need to share the partial gradient sums). Observe also that the information shared with the center is not sufficient to recover 𝑿{\bm{X}}^{*} centrally. It is only sufficient to recover span(𝑼)\mathrm{span}({\bm{U}}^{*}{}). The recovery of the columns of 𝑩\bm{B}, 𝒃k\bm{b}^{*}_{k}, is entirely done locally at the node where the corresponding 𝒚k\bm{y}_{k} is stored, thus ensuring privacy.

Algorithm 1 The AltGD-Min algorithm. Let 𝑴:=(𝑴𝑴)1𝑴\bm{M}^{\dagger}:=(\bm{M}^{\top}\bm{M})^{-1}\bm{M}^{\top}.
1:Input: 𝒚k,𝑨k,k[q]\bm{y}_{k},\bm{A}_{k},k\in[q]
2:Parameters: Multiplier in specifying α\alpha for init step, C~\tilde{C}; GD step size, η\eta; Number of iterations, TT
3:Sample-split: Partition the measurements and measurement matrices into 2T+12T+1 equal-sized disjoint sets: one set for initialization and 2T2T sets for the iterations. Denote these by 𝒚k(τ),𝑨k(τ),τ=0,1,2T\bm{y}_{k}^{(\tau)},\bm{A}_{k}^{(\tau)},\tau=0,1,\dots 2T.
4:Initialization:
5:Using 𝒚k𝒚k(0),𝑨k𝑨k(0)\bm{y}_{k}\equiv\bm{y}_{k}^{(0)},\bm{A}_{k}\equiv\bm{A}_{k}^{(0)}, set
6:α=C~1mqki|𝒚ki|2\alpha=\tilde{C}\frac{1}{mq}\sum_{ki}\big{|}\bm{y}_{ki}\big{|}^{2},
7:𝒚k,trunc(α):=𝒚k𝟙{|𝒚k|α}\bm{y}_{k,trunc}(\alpha):=\bm{y}_{k}\circ\mathbbm{1}\{|\bm{y}_{k}|\leq\sqrt{\alpha}\}
8:𝑿0:=(1/m)k[q]𝑨k𝒚k,trunc(α)𝒆k\displaystyle{\bm{X}}_{0}:=(1/m)\sum_{k\in[q]}\bm{A}_{k}^{\top}\bm{y}_{k,trunc}(\alpha)\bm{e}_{k}^{\top}
9:Set 𝑼0{\bm{U}}_{0}\leftarrow top-rr-singular-vectors of 𝑿0{\bm{X}}_{0}
10:GDmin iterations:
11:for t=1t=1 to TT do
12:     Let 𝑼𝑼t1{\bm{U}}\leftarrow{\bm{U}}_{t-1}.
13:     Update bk,xk\bm{b}_{k},\bm{x}_{k}: For each k[q]k\in[q], set (𝒃k)t(𝑨k(t)𝑼)𝒚k(t)(\bm{b}_{k})_{t}\leftarrow(\bm{A}_{k}^{(t)}{\bm{U}})^{\dagger}\bm{y}_{k}^{(t)} and set (𝒙k)t𝑼(𝒃k)t(\bm{x}_{k})_{t}\leftarrow{\bm{U}}(\bm{b}_{k})_{t}
14:     Gradient w.r.t. U{\bm{U}}: With 𝒚k𝒚k(T+t),𝑨k𝑨k(T+t)\bm{y}_{k}\equiv\bm{y}_{k}^{(T+t)},\bm{A}_{k}\equiv\bm{A}_{k}^{(T+t)}, compute 𝑼f(𝑼,𝑩t)=k𝑨k(𝑨k𝑼(𝒃k)t𝒚k)(𝒃k)t\nabla_{\bm{U}}f({\bm{U}},\bm{B}_{t})=\sum_{k}\bm{A}_{k}^{\top}(\bm{A}_{k}{\bm{U}}(\bm{b}_{k})_{t}-\bm{y}_{k})(\bm{b}_{k})_{t}^{\top}
15:     GD step: Set 𝑼^+𝑼(η/m)𝑼f(𝑼,𝑩t)\displaystyle\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-(\eta/m)\nabla_{\bm{U}}f({\bm{U}},\bm{B}_{t}).
16:     Projection step: Compute 𝑼^+=QR𝑼+𝑹+\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}.
17:     Set 𝑼t𝑼+{\bm{U}}_{t}\leftarrow{\bm{U}}^{+}.
18:end for

II-B Main Result

We can prove the following result.

Theorem 2.1.

Consider Algorithm 1. Let mtm_{t} denote the number of samples used in iteration tt. Set C~=9κ2μ2\tilde{C}=9\kappa^{2}\mu^{2}, η=c/σmax2\eta=c/{\sigma_{\max}^{*}}^{2} with a c0.5c\leq 0.5, and T=Cκ2log(1/ϵ)T=C\kappa^{2}\log(1/\epsilon). Assume that Assumption 1.1 holds and that the 𝐀k\bm{A}_{k}s are i.i.d. and each contains i.i.d. standard Gaussian entries. If

m0qCκ6μ2(n+q)r2,m_{0}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2},

and mtm_{t} for t1t\geq 1 satisfies

mtqCκ4μ2(n+q)r2logκ and mtCmax(r,logq,logn)m_{t}q\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa\text{ and }m_{t}\geq C\max(r,\log q,\log n)

then, with probability (w.p.) at least 1tn101-tn^{-10}, for all t0t\geq 0,

SD(𝑼t,𝑼)(1(ησmax2)0.4κ2)tδ0\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\left(1-\frac{(\eta{\sigma_{\max}^{*}}^{2})0.4}{\kappa^{2}}\right)^{t}\delta_{0}

with δ0=0.09/κ2.\delta_{0}=0.09/\kappa^{2}. Thus, with T=Cκ2log(1/ϵ)T=C\kappa^{2}\log(1/\epsilon) and η=0.5/σmax2\eta=0.5/{\sigma_{\max}^{*}}^{2}, w.p. at least 1(T+1)n101-(T+1)n^{-10},

SD(𝑼T,𝑼)ϵ,(𝒙k)T𝒙kϵ𝒙k, for all k[q],\displaystyle\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{*}{})\leq\epsilon,\ {\|(\bm{x}_{k})_{T}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|},\text{ for all $k\in[q]$, }
𝑿T𝑿F1.4ϵ𝑿\displaystyle\|{\bm{X}}_{T}-{\bm{X}}^{*}\|_{F}\leq 1.4\epsilon\|{\bm{X}}^{*}\|

Sample complexity The sample complexity (total number of samples needed to achieve ϵ\epsilon-accurate recovery) is mtot=τ=0Tmτm0+Tmint1mtm_{\mathrm{tot}}=\sum_{\tau=0}^{T}m_{\tau}\geq m_{0}+T\min_{t\geq 1}m_{t}. From the above result, this needs to satisfy mtotqCκ6μ2(n+q)r2log(1/ϵ)log(κ)m_{tot}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon)\log(\kappa) and mtot>Cκ2max(r,logq,logn)log(1/ϵ)m_{tot}>C\kappa^{2}\max(r,\log q,\log n)\log(1/\epsilon).

Time complexity Let mmtm\equiv m_{t}. The initialization step needs time mqnmqn for computing 𝑿0{\bm{X}}_{0}; and time of order nqrnqr times the number of iterations used in the rr-SVD step. Since we only need a δ0\delta_{0}-accurate initial estimate of span(𝑼)\mathrm{span}({\bm{U}}^{*}{}), with δ0=c/κ2\delta_{0}=c/\kappa^{2}, order log(κ)\log(\kappa) number of iterations suffice for this SVD step. Thus the complexity is O(nq(m+r)logκ)=O(mqnlogκ)O(nq(m+r)\cdot\log\kappa)=O(mqn\cdot\log\kappa) since mrm\geq r. One gradient computation needs time O(mqnr)O(mqnr). The QR decomposition needs time of order nr2nr^{2}. The update of columns of 𝑩\bm{B} by LS also needs time O(mqnr)O(mqnr) (explained earlier). As we prove above, we need to repeat these steps T=O(κ2log(1/ϵ))T=O(\kappa^{2}\log(1/\epsilon)) times. Thus the total time complexity is O(mqnlogκ+max(mqnr,nr2,mqnr)T)=O(κ2mqnrlog(1/ϵ)logκ)O(mqn\log\kappa+\max(mqnr,nr^{2},mqnr)\cdot T)=O(\kappa^{2}mqnr\log(1/\epsilon)\log\kappa).

Communication complexity The communication complexity per node per iteration for a federated implementation is just order nrnr. Thus, the total is O(nrκ2log(1/ϵ))O(nr\cdot\kappa^{2}\log(1/\epsilon)).

Thus, we have the following corollary.

Corollary 2.2 (AltGD-Min).

In the setting of Theorem 2.1, if Assumption 1.1 holds, and if

mtotqCκ6μ2(n+q)r2log(1/ϵ)log(κ)m_{tot}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon)\log(\kappa)

and mtot>Cκ2max(r,logq,logn)log(1/ϵ)m_{tot}>C\kappa^{2}\max(r,\log q,\log n)\log(1/\epsilon), then, w.p. at least 1(Cκ2log(1/ϵ))n101-(C\kappa^{2}\log(1/\epsilon))n^{-10}, 𝐗𝐗F1.4ϵ𝐗\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq 1.4\epsilon\|{\bm{X}}^{*}\| and 𝐱k𝐱kϵ𝐱k{\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|} for all k[q]k\in[q]. The time complexity is Cκ2mqnrlog(1/ϵ)logκC\kappa^{2}mqnr\log(1/\epsilon)\log\kappa and the communication complexity is O(nrκ2log(1/ϵ))O(nr\cdot\kappa^{2}\log(1/\epsilon)).

Observe that the above results show that after T=Cκ2log(1/ϵ)T=C\kappa^{2}\log(1/\epsilon) iterations, SD(𝑼T,𝑼)ϵ\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{*}{})\leq\epsilon, 𝒙k𝒙kϵ𝒙k{\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|}, and 𝑿T𝑿F1.4ϵ𝑿\|{\bm{X}}_{T}-{\bm{X}}^{*}\|_{F}\leq 1.4\epsilon\|{\bm{X}}^{*}\|. The RHS in the third bound does indeed contain 𝑿\|{\bm{X}}^{*}\| (the induced 2-norm). This is correct because, SD(.,.)\mathrm{SD}(.,.) is a Frobenius norm subspace distance. We explain this in Sec. III-B.

II-C Discussion and comparison with the best LRMC results

An algorithm is called linear time if its time complexity is the same order as the time needed to load all input data. In our case, this is O(mqn)O(mqn). Treating κ\kappa as a constant, the AltGD-Min complexity is worse than linear-time by a factor of only rlog(1/ϵ)r\log(1/\epsilon). As can be seen from Table I, the same is also true for the fastest LRMC solution, projGD-X [15]. For LRMC, linear time is O(mq)O(mq). To our best knowledge, this is the case for the fastest algorithms for all LR problems.

Consider the sample complexity. The degrees of freedom (number of unknowns) of a rank-rr n×qn\times q matrix are (n+q)r(n+q)r. A sample complexity of Ω((n+q)r)\Omega((n+q)r) samples (or, sometimes this times log factors) is called “optimal”. Thus, ignoring the log factors, our sample complexity of mtotq(n+q)r2m_{tot}q\gtrsim(n+q)r^{2} is sub-optimal only by a factor of rr. As can also be seen from Table I, this suboptimality matches that of the best results for LRMC solutions that are not convex relaxation based [15, 16, 17]. The need for exploiting incoherence while obtaining the high probability bounds on the recovery error terms is what introduces the extra factor of rr for both LRMC and LRcCS. LRMC has been extensively studied for over a decade and there does not seem to be a way to obtain an (order-) optimal sample complexity guarantee for it except when studying convex relaxation solutions (which are much slower).

In addition, we also need mmax(r,logq,logn)m\gtrsim\max(r,\log q,\log n). This is redundant except for very large q,nq,n. This is needed because, the recovery of each column of 𝑩\bm{B}^{*} is a decoupled rr-dimensional LS problem. We analyze this step in Lemma 3.3; notice that the bound on the recovery error of column kk holds w.p. at least 1exp(rcm)1-\exp(r-cm). By union bound, it holds for all qq columns w.p. at least at least 1qexp(rcm)=1exp(logq+rcm)1-q\exp(r-cm)=1-\exp(\log q+r-cm). This probability is at least 1n10=1exp(10logn)1-n^{-10}=1-\exp(-10\log n) if mmax(r,logq,logn)m\gtrsim\max(r,\log q,\log n).

II-D Detailed comparison with existing LRcCS results

There are two existing solutions for LRcCS – AltMin [4, 5, 6] and the convex relaxation (mixed norm minimization) [7]. Mixed norm is defined as 𝑿mixed:=inf{𝑼,𝑽:𝑼𝑽=𝑿}𝑼Fmaxk[q]𝒗k\|{\bm{X}}\|_{mixed}:=\inf_{\{{\bm{U}},\bm{V}:{\bm{U}}\bm{V}={\bm{X}}\}}\|{\bm{U}}\|_{F}\max_{k\in[q]}\|\bm{v}_{k}\|, where 𝑼{\bm{U}} is n×rn\times r and 𝑽:=[𝒗1,𝒗2,𝒗q]\bm{V}:=[\bm{v}_{1},\bm{v}_{2},\dots\bm{v}_{q}] is an r×qr\times q matrix. In our notation, for the noise-free case (σ=0\sigma=0), their main result states the following.

Proposition 2.3 (Convex relaxation (mixed norm min) in the σ=0\sigma=0 (noise-free) setting [7]).

Consider a matrix 𝐗{𝐗:maxk𝐱k2α2,𝐗mixedRαr}{\bm{X}}^{*}\in\{{\bm{X}}^{*}:\max_{k}\|\bm{x}^{*}_{k}\|^{2}\leq\alpha^{2},\|{\bm{X}}^{*}\|_{mixed}\leq R\leq\alpha\sqrt{r}\}. Then, w.p. 1exp(c2nR2/α2)1-\exp(-c_{2}nR^{2}/\alpha^{2}), 𝐗𝐗F2𝐗F2c1α2𝐗F2/q(n+q)rlog6nmtotq\frac{\|{\bm{X}}-{\bm{X}}^{*}\|_{F}^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}}\leq c_{1}\frac{\alpha^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}/q}\sqrt{\frac{(n+q)r\log^{6}n}{m_{tot}q}} Under our Assumption 1.1, maxk𝐱k2μ2(r/q)σmax2=(μ2κ2)(r/q)σmin2(κ2μ2)𝐗F2/q\max_{k}\|\bm{x}^{*}_{k}\|^{2}\leq\mu^{2}(r/q){\sigma_{\max}^{*}}^{2}=(\mu^{2}\kappa^{2})(r/q){\sigma_{\min}^{*}}^{2}\leq(\kappa^{2}\mu^{2})\|{\bm{X}}^{*}\|_{F}^{2}/q, i.e. α2𝐗F2/q=(κ2μ2)\frac{\alpha^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}/q}=(\kappa^{2}\mu^{2}). Thus, the above result can also be stated as:

For all matrices 𝐗{\bm{X}}^{*} that satisfy Assumption 1.1 and for which 𝐗mixedrκμ𝐗F/q\|{\bm{X}}^{*}\|_{mixed}\leq\sqrt{r}\cdot\kappa\mu\|{\bm{X}}^{*}\|_{F}/\sqrt{q}, if

mtotqC1κ4μ4(n+q)rlog6n1ϵ4,m_{tot}q\geq C_{1}\kappa^{4}\mu^{4}(n+q)r\log^{6}n\cdot\frac{1}{\epsilon^{4}},

then, w.p. at least 1exp(c2n)1-\exp(-c_{2}n), 𝐗𝐗Fϵ𝐗F{\|{\bm{X}}-{\bm{X}}^{*}\|_{F}}\leq\epsilon{\|{\bm{X}}^{*}\|_{F}}. The time complexity is Cmqnrmin(1ϵ,n3r3)Cmqnr\min(\frac{1}{\sqrt{\epsilon}},n^{3}r^{3}) (explained earlier in Sec. I-B).

Notice that both the sample and the time complexity of the convex solution depend on powers of 1/ϵ1/\sqrt{\epsilon}: the sample complexity grows as 1/ϵ41/\epsilon^{4} while the time complexity grows as 1/ϵ1/\sqrt{\epsilon}. However, its sample complexity has an order-optimal dependence on rr. For AltGD-Min, both sample and time complexities depend only logarithmically on ϵ\epsilon only as log(1/ϵ)\log(1/\epsilon). But its sample complexity depends sub-optimally on rr, it grows as r2r^{2}. In summary, the time complexity of the convex solution is always much worse, its sample complexity is worse when a solution with accuracy level ϵ<1/r1/4\epsilon<1/{r}^{1/4} is needed. A second point to mention is that our result for AltGD-Min provides a column-wise error bound (bounds 𝒙k𝒙k/𝒙k\|\bm{x}^{*}_{k}-\bm{x}_{k}\|/\bm{x}^{*}_{k}\|). The convex result only provides a bound on the Frobenius norm of the entire matrix. Thus it is possible that some columns have much larger recovery error than others. This can be problematic in applications such as dynamic MRI where each column corresponds to one signal/image of a time sequence and where the goal is to ensure accurate-enough recovery of all columns. On the other hand, the advantage of the convex guarantee is that it holds w.h.p. for all matrices 𝑿{\bm{X}}^{*} in the specified set, where as our result only holds w.h.p. for a matrix 𝑿{\bm{X}}^{*} satisfying Assumption 1.1. The reason for these last two points and the reason that we cannot avoid using sample-splitting is the same: the update of 𝑩\bm{B} is a column-wise LS problem. We explain the reasoning carefully in Sec. VI-A where we discuss the limitations of our approach. A second advantage of the convex result is that it directly studies the noisy version of the LRcCS problem. This should be possible for AltGD-Min too, we postpone it to future work.

The best result for AltMin is from [6], it states the following.

Proposition 2.4 (AltMin [6]).

Under Assumption 1.1, if

mtotqCκ8μ2nr2(r+log(1/ϵ)) and mtot>max(r,logq,logn),m_{tot}q\geq C\kappa^{8}\mu^{2}nr^{2}(r+\log(1/\epsilon))\text{ and }m_{tot}>\max(r,\log q,\log n),

then, w.p. at least 1(log(1/ϵ))n101-(\log(1/\epsilon))n^{-10}, 𝐗𝐗ϵ𝐗\|{\bm{X}}-{\bm{X}}^{*}\|\leq\epsilon\|{\bm{X}}^{*}\| and 𝐱k𝐱kϵ𝐱k{\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|} for all k[q]k\in[q]. The time complexity is Cmqnrlog2(1/ϵ)Cmqnr\log^{2}(1/\epsilon).

Treating κ\kappa as a numerical constant, compared with the above result for AltMin, the sample complexity of AltGD-Min is either better by a factor of rr or is as good. It is better when r>log(1/ϵ)r>\log(1/\epsilon). Also, the time complexity is always better by a factor log(1/ϵ)\log(1/\epsilon). As a function of κ\kappa, the AltGD-Min sample complexity is better by a factor of κ2\kappa^{2}, but its time is worse by a factor of κ2\kappa^{2} compared to that of AltMin. The reason is that its error decays as (1c/κ2)t(1-c/\kappa^{2})^{t}. For AltMin the error decays as ctc^{t}. Experimentally, GD is usually much faster than AltMin because the constants in its time complexity are also lower.

III Proving Theorem 2.1

III-A Two key results for proving Theorem 2.1 and its proof

Theorem 2.1 is an almost immediate consequence of the following two results.

Theorem 3.1 (Initialization).

Pick a δ0<0.1\delta_{0}<0.1. If mqCκ4μ2(n+q)r2/δ02mq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}/\delta_{0}^{2} , then w.p. at least 1exp(c(n+q))1-\exp(-c(n+q)),

SD(𝑼,𝑼0)δ0.\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}.
Proof.

See Sec. III-E (simpler proof with sample-splitting for α\alpha) or Appendix B (proof without sample-splitting). Proof outline is given in Sec. III-D. ∎

Theorem 3.2 (GD Descent).

If, at each iteration tt, mqCκ4μ2(n+q)r2logκmq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa and m>Cmax(logq,logn)m>C\max(\log q,\log n); if SD(𝐔,𝐔0)δ0=c/κ2\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}=c/\kappa^{2} for a c0.1/1.1c\leq 0.1/1.1; and if η0.5/σmax2\eta\leq 0.5/{\sigma_{\max}^{*}}^{2}, then w.p. at least 1(t+1)n101-(t+1)n^{-10},

SD(𝑼,𝑼t+1)δt+1:=(1(ησmax2)0.4κ2)t+1δ0.\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})\leq\delta_{t+1}:=\left(1-(\eta{\sigma_{\max}^{*}}^{2})\tfrac{0.4}{\kappa^{2}}\right)^{t+1}\delta_{0}.

If η=0.5σmax2\eta=0.5{\sigma_{\max}^{*}}^{2}, this simplifies to SD(𝐔,𝐔t+1)(10.2/κ2)t+1δ0\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})\leq(1-0.2/\kappa^{2})^{t+1}\delta_{0}.

Also, with the above probability,

(1/m)Uf(𝑼t,𝑩t+1)1.6δtσmax2.\|(1/m)\nabla_{U}f({\bm{U}}_{t},\bm{B}_{t+1})\|\leq 1.6\delta_{t}{\sigma_{\max}^{*}}^{2}.

with δt\delta_{t} defined in the SD(𝐔,𝐔t+1)\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1}) bound above.

Since δt\delta_{t} decays exponentially with tt, the same is also true for the gradient norm at iteration tt, (1/m)Uf(𝑼t,𝑩t+1)\|(1/m)\nabla_{U}f({\bm{U}}_{t},\bm{B}_{t+1})\|.

Proof.

See Sec. III-C. Proof outline is given in Sec. III-B. ∎

Proof of Theorem 2.1.

The SD(.)\mathrm{SD}(.) bound is an immediate consequence of Theorems 3.1 and 3.2. To apply Theorem 3.2, we need δ0=c/κ2\delta_{0}=c/\kappa^{2}. By Theorem 3.1, if mqCκ6μ2(n+q)r2mq\geq C\kappa^{6}\mu^{2}(n+q)r^{2}, then, w.p. at least 1n101-n^{-10}, SD(𝑼,𝑼0)δ0=c/κ2\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}=c/\kappa^{2}. With this, if, at each iteration, mqCκ4μ2(n+q)r2logκmq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa and mCmax(logq,logn)m\geq C\max(\log q,\log n), then by Theorem 3.2, w.p. at least 1(t+1)n101-(t+1)n^{-10}, the stated bound on SD(𝑼,𝑼t+1)\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1}) holds. By setting T=Cκ2log(1/ϵ)T=C\kappa^{2}\log(1/\epsilon) in this, we can guarantee (1c1κ2)Tϵ\left(1-\tfrac{c_{1}}{\kappa^{2}}\right)^{T}\leq\epsilon. This proves the SD(𝑼T,𝑼)\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{*}{}) bound. The bounds on 𝒙k𝒙k\|\bm{x}_{k}-\bm{x}^{*}_{k}\| and 𝑿𝑿F\|{\bm{X}}-{\bm{X}}^{*}\|_{F} follow by Lemma 3.3 given in Sec. III-C.∎

III-B Proof outline (and novelty) for Theorem 3.2

For proving exponential error decay, we need to show this: at iteration tt, if SD(𝑼,𝑼)δt\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t} with δt<δ0=c/κ2\delta_{t}<\delta_{0}=c/\kappa^{2}. Then, SD(𝑼+,𝑼)cδt\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})\leq c\delta_{t} for a c<1c<1. We explain how to do this next. Suppose that, at iteration tt, SD(𝑼,𝑼)δt<δ0=0.1/κ2\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}<\delta_{0}=0.1/\kappa^{2}.

Analyzing the minimization step for updating B\bm{B} (Lemma 3.3). Recall from Algorithm 1 that 𝒃k=(𝑨k𝑼)𝒚k\bm{b}_{k}=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k}, 𝒙k=𝑼𝒃k\bm{x}_{k}={\bm{U}}\bm{b}_{k}, and 𝒙k=𝑼𝒃k\bm{x}^{*}_{k}={\bm{U}}^{*}{}\bm{b}^{*}_{k}. Using standard results from [26], we can show that the estimates 𝒃k\bm{b}_{k} satisfy 𝒃k𝑼𝒙k0.4(𝑰𝑼𝑼)𝑼𝒃k\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|\leq 0.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|. This then implies that (i) 𝒃k\bm{b}_{k}’s are incoherent, i.e., 𝒃k1.1μσmaxr/q\|\bm{b}_{k}\|\leq 1.1\mu{\sigma_{\max}^{*}}\sqrt{r/q}; and (ii) 𝒙k𝒙k1.4(𝑰𝑼𝑼)𝑼𝒃k1.4δtmaxk𝒙k\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|\leq 1.4\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\|, i.e., we can get the desired column-wise error bound. Also (iii) 𝑿𝑿F1.4δtσmax\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq 1.4\delta_{t}{\sigma_{\max}^{*}} (notice this bound does not contain rr). We get this as follows:

𝑿𝑿F\displaystyle\|{\bm{X}}-{\bm{X}}^{*}\|_{F} =k𝒙k𝒙k2\displaystyle=\sqrt{\sum_{k}\|\bm{x}_{k}-\bm{x}^{*}_{k}\|^{2}}
1.42k(𝑰𝑼𝑼)𝑼𝒃k2\displaystyle\leq\sqrt{1.4^{2}\sum_{k}\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|^{2}}
=1.4(𝑰𝑼𝑼)𝑼𝑩F\displaystyle=1.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{B}^{*}\|_{F}
1.4(𝑰𝑼𝑼)𝑼Fσmax\displaystyle\leq 1.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\|_{F}{\sigma_{\max}^{*}}

Similarly, 𝑩𝑼𝑿F0.4δtσmax.\|\bm{B}-{\bm{U}}^{\top}{\bm{X}}^{*}\|_{F}\leq 0.4\delta_{t}{\sigma_{\max}^{*}}. (iv) Using Weyl’s inequality and δt<0.1/κ2\delta_{t}<0.1/\kappa^{2}, this then implies that σmax(𝑩)1.1σmax\sigma_{\max}(\bm{B})\leq 1.1{\sigma_{\max}^{*}} and σmin(𝑩)0.9σmin\sigma_{\min}(\bm{B})\geq 0.9{\sigma_{\min}^{*}}.

Bounding SD(U+,U)\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{}) by a novel use of fundamental theorem of calculus (Lemma 3.4). Recall from Algorithm 1 that 𝑼^+=𝑼^(η/m)Uf(𝑼,𝑩)\hat{\bm{U}}^{+}=\hat{\bm{U}}-(\eta/m)\nabla_{U}f({\bm{U}},\bm{B}) and 𝑼^+=QR𝑼+𝑹+\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}. We bound SD(𝑼+,𝑼)\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{}) using the fundamental theorem of calculus [18, Chapter XIII, Theorem 4.2],[19], summarized in Theorem 4.2. The use of this result is motivated by its use in [19], and many earlier works, where it is used in a standard way: to bound the Euclidean norm error 𝒙𝒙\|\bm{x}-\bm{x}^{*}\| for standard GD to solve the PR problem for recovering a single vector 𝒙\bm{x}^{*}. Thus, at the true solution 𝒙=𝒙\bm{x}=\bm{x}^{*}, the gradient of the cost function was zero. In our case, there are two differences: (i) we need to bound the subspace distance error, and (ii) our algorithm is not standard GD; in particular, this means that Uf(𝑼𝑼𝑼,𝑩)0\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0.

To deal with (i) and (ii), we proceed as follows. We first bound (𝑰𝑼𝑼)𝑼^+F\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\hat{\bm{U}}^{+}\|_{F}. To do this, we apply Theorem 4.2 on vectorized Uf(𝑼,𝑩)\nabla_{U}f({\bm{U}},\bm{B}) with the pivot being vectorized Uf(𝑼𝑼𝑼,𝑩)\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B}), and use this in the equation for 𝑼^+\hat{\bm{U}}^{+}. Next, we project both sides of this expression orthogonal to 𝑼{\bm{U}}^{*}{} followed by some careful linear algebra. Notice here that Uf(𝑼𝑼𝑼,𝑩)0\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0, because 𝑩𝑩\bm{B}\neq\bm{B}^{*}. Because of this, we get an extra term, Term2:=(𝑰𝑼𝑼)Uf(𝑼𝑼𝑼,𝑩)\mathrm{Term2}:=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B}), in our bound other than the usual term containing the Hessian. We are able to bound it by ϵδtσmax2\epsilon\delta_{t}{\sigma_{\max}^{*}}^{2} for any constant small enough ϵ\epsilon, by realizing that 𝔼[Term2]=0\mathbb{E}[\mathrm{Term2}]=0 (conditioned on past measurements), and that its summands are nice-enough subexponentials. Next, we bound SD(𝑼,𝑼+)\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+}) by using

SD(𝑼,𝑼+)\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+}) (𝑰𝑼𝑼)𝑼^+F(𝑹+)1\displaystyle\leq\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\hat{\bm{U}}^{+}\|_{F}\|({\bm{R}}^{+})^{-1}\|
=(𝑰𝑼𝑼)𝑼^+Fσmin(𝑼^+)\displaystyle=\frac{\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\hat{\bm{U}}^{+}\|_{F}}{\sigma_{\min}(\hat{\bm{U}}^{+})}

and σmin(𝑼^+)=σmin(𝑼(η/m)𝑼f(𝑼,𝑩))1(η/m)𝑼f(𝑼,𝑩)\sigma_{\min}(\hat{\bm{U}}^{+})=\sigma_{\min}({\bm{U}}-(\eta/m)\nabla_{\bm{U}}f({\bm{U}},\bm{B}))\geq 1-(\eta/m)\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|.

Bounding the terms in the SD(U,U+)\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+}) bound (Lemma 3.5). Consider 𝑼f(𝑼,𝑩)\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|. Using Lemma 3.3, it can be shown that, for unit vectors 𝒘,𝒛\bm{w},\bm{z}, the maximum sub-exponential norm of any summand of 𝒘𝑼f(𝑼,𝑩)𝒛\bm{w}^{\top}\nabla_{\bm{U}}f({\bm{U}},\bm{B})\bm{z} is bounded by 𝒙k𝒙k𝒃k1.1μ2σmax2δt(r/q)\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\cdot\|\bm{b}_{k}\|\leq 1.1\mu^{2}{\sigma_{\max}^{*}}^{2}\delta_{t}(r/q). Observe that we get this (sufficiently small) bound because of the extra 𝒃k\bm{b}_{k}^{\top} term in the summands of 𝑼f(𝑼,𝑩)\nabla_{\bm{U}}f({\bm{U}},\bm{B}) compared to those in 𝑿f~(𝑿)\nabla_{\bm{X}}\tilde{f}({\bm{X}}). This, along with using the sub-exponential Bernstein inequality [26] followed by a standard epsilon-net argument, and bounding 𝔼[Uf]\|\mathbb{E}[\nabla_{U}f]\| using 𝔼[Uf]=m(𝑿𝑿)𝑩mδtσmax2\|\mathbb{E}[\nabla_{U}f]\|=\|m({\bm{X}}-{\bm{X}}^{*})\bm{B}^{\top}\|\leq m\delta_{t}{\sigma_{\max}^{*}}^{2} (by Lemma 3.3), helps guarantee that Uf2mδtσmax2\|\nabla_{U}f\|\lesssim 2m\delta_{t}{\sigma_{\max}^{*}}^{2} w.h.p. as long as mq(n+q)r2mq\gtrsim(n+q)r^{2}. We bound Term2F\|\mathrm{Term2}\|_{F} using similar ideas and the key fact that 𝔼[Term2]=0\mathbb{E}[\mathrm{Term2}]=0. This is true because of sample-splitting. We upper and lower bound the eigenvalues of the Hessian, Hess\mathrm{Hess}, using similar ideas and the following: for a unit vector 𝒘\bm{w} of length nrnr and its rearranged unit Frobenius norm matrix 𝑾{\bm{W}} of size n×rn\times r, 𝔼[𝒘Hess𝒘]=𝔼[ki(𝒂ki𝑾𝒃k)2]=m𝑾𝑩F2\mathbb{E}[\bm{w}^{\top}\mathrm{Hess}\ \bm{w}]=\mathbb{E}[\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=m\|{\bm{W}}\bm{B}\|_{F}^{2}. Using the bounds on σi(𝑩)\sigma_{i}(\bm{B}) from Lemma 3.3, this can be upper and lower bounded.

III-C Lemmas for proving GD descent Theorem 3.2 and its proof

Let 𝑼𝑼t{\bm{U}}\equiv{\bm{U}}_{t}, 𝑩𝑩t+1\bm{B}\equiv\bm{B}_{t+1}. The proof follows using the following 3 lemmas.

Lemma 3.3 (Error bound on 𝑩\bm{B} and its implications).

Let U𝐔tU\equiv{\bm{U}}_{t}, 𝐁𝐁t+1\bm{B}\equiv\bm{B}_{t+1}, and

𝒈k:=𝑼𝒙k.\bm{g}_{k}:={\bm{U}}^{\top}\bm{x}^{*}_{k}.

Assume that SD(𝐔,𝐔t)δt\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t})\leq\delta_{t} with δt<δ0=c/κ2\delta_{t}<\delta_{0}=c/\kappa^{2} (this bound on δt\delta_{t} is needed for the second part of this lemma). Then, w.p. 1qexp(rcm)\geq 1-q\exp(r-cm),

  1. 1.
    𝒈k𝒃k\displaystyle\|\bm{g}_{k}-\bm{b}_{k}\| 0.4(𝑰n𝑼𝑼)𝑼𝒃k\displaystyle\leq 0.4\|\left(\bm{I}_{n}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\| (3)
  2. 2.

    This in turn implies all of the following.

    1. (a)

      𝒙k𝒙k1.4(𝑰𝑼𝑼)𝑼𝒃k\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\|\left(\bm{I}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|

    2. (b)

      𝑮𝑩F0.4δtσmax\|\bm{G}-\bm{B}\|_{F}\leq 0.4\delta_{t}{\sigma_{\max}^{*}} and 𝑿𝑿F1.16δtσmax\|{\bm{X}}^{*}-{\bm{X}}\|_{F}\leq\sqrt{1.16}\delta_{t}{\sigma_{\max}^{*}},

    3. (c)

      𝒈k𝒃k0.4δt𝒃k\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\delta_{t}\|\bm{b}^{*}_{k}\| and 𝒙k𝒙k1.4δt𝒙k\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\delta_{t}\|\bm{x}^{*}_{k}\|,

    4. (d)

      𝑼𝑼𝒃k𝒃k2.4δt𝒃k\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\leq 2.4\delta_{t}\|\bm{b}^{*}_{k}\|,

    5. (e)

      𝒃k1.1μσmaxr/q\|\bm{b}_{k}\|\leq 1.1\mu{\sigma_{\max}^{*}}\sqrt{r/q}.

    6. (f)

      σmin(𝑩)0.9σmin\sigma_{\min}(\bm{B})\geq 0.9{\sigma_{\min}^{*}} and σmax(𝑩)1.1σmax\sigma_{\max}(\bm{B})\leq 1.1{\sigma_{\max}^{*}},

Proof.

See Sec. IV-D. ∎

Lemma 3.4.

Let 𝐔𝐔t{\bm{U}}\equiv{\bm{U}}_{t}, 𝐁𝐁t+1\bm{B}\equiv\bm{B}_{t+1}. Let \otimes denote the Kronecker product. We have

SD(𝑼t+1,𝑼)\displaystyle\mathrm{SD}({\bm{U}}_{t+1},{\bm{U}}^{*}{})
𝑰nr(η/m)HessSD(𝑼,𝑼)+(η/m)Term2F1(η/m)GradU,\displaystyle\qquad\leq\dfrac{\|\bm{I}_{nr}-(\eta/m)\mathrm{Hess}\|\cdot\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})+(\eta/m)\|\mathrm{Term2}\|_{F}}{1-(\eta/m)\|\mathrm{GradU}\|},

where,

GradU\displaystyle\mathrm{GradU} :=𝑼f(𝑼,𝑩)=ki(𝒚ki𝒂ki𝑼𝒃k)𝒂ki𝒃k\displaystyle:=\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}
Term2\displaystyle\mathrm{Term2} :=(𝑰𝑼𝑼)𝑼f((𝑼𝑼𝑼),𝑩)\displaystyle:=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\nabla_{\bm{U}}f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}}),\bm{B})
=(𝑰𝑼𝑼)ki(𝒚ki𝒂ki𝑼𝑼𝑼𝒃k)𝒂ki𝒃k\displaystyle=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}}\bm{b}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}
Hess\displaystyle\mathrm{Hess} :=ki(𝒂ki𝒃k)(𝒂ki𝒃k)\displaystyle:=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}
Proof.

See Sec. IV-B

Lemma 3.5.

Assume SD(𝐔,𝐔)δt<δ0=c/κ2\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})\leq\delta_{t}<\delta_{0}=c/\kappa^{2}. Then,

  1. 1.

    w.p. at least 1exp((n+r)cmqϵ12/rμ2)exp(logq+rcm)1-\exp((n+r)-cmq\epsilon_{1}^{2}/r\mu^{2})-\exp(\log q+r-cm),

    GradU1.5(1.1+ϵ1)mδtσmax2;\|\mathrm{GradU}\|\leq 1.5(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2};
  2. 2.

    w.p. at least 1exp(nrcmqϵ22/rμ2)exp(logq+rcm)1-\exp(nr-cmq\epsilon_{2}^{2}/r\mu^{2})-\exp(\log q+r-cm),

    Term2F1.1mϵ2δtσmax2;\|\mathrm{Term2}\|_{F}\leq 1.1m\epsilon_{2}\delta_{t}{\sigma_{\max}^{*}}^{2};
  3. 3.

    w.p. at least 1exp(nrlogκcmqϵ32/rκ4μ2)exp(logq+rcm)1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\kappa^{4}\mu^{2})-\exp(\log q+r-cm),

    m(0.651.2ϵ3)σmin2\displaystyle m(0.65-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2} λmin(Hess)\displaystyle\leq\lambda_{\min}(\mathrm{Hess})
    λmax(Hess)m(1.1+ϵ3)σmax2.\displaystyle\leq\lambda_{\max}(\mathrm{Hess})\leq m(1.1+\epsilon_{3}){\sigma_{\max}^{*}}^{2}.
Proof.

See Sec. IV-C. ∎

Proof of Theorem 3.2.

The proof follows by induction. Base case for t=0t=0 is true by assumption. Induction assumption: Assume that, w.p. at least 1tn101-tn^{-10}, SD(𝑼,𝑼t)δt\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t})\leq\delta_{t} with δtδ0=c0/κ2\delta_{t}\leq\delta_{0}=c_{0}/\kappa^{2}.

Set ϵ1=0.1\epsilon_{1}=0.1, ϵ3=0.01\epsilon_{3}=0.01, ϵ2=0.01/1.1κ2\epsilon_{2}=0.01/1.1\kappa^{2} and, c0=0.1/1.5(1.1+0.1)c_{0}=0.1/1.5(1.1+0.1).

The upper bound on λmax(Hess)\lambda_{\max}(\mathrm{Hess}) and using η0.5/σmax2\eta\leq 0.5/{\sigma_{\max}^{*}}^{2} implies that λmin(𝑰nr(η/m)Hess)=1(η/m)λmax(Hess)10.5(1.1+0.01)mσmax2mσmax2>10.555>0\lambda_{\min}(\bm{I}_{nr}-(\eta/m)\mathrm{Hess})=1-(\eta/m)\lambda_{\max}(\mathrm{Hess})\geq 1-\frac{0.5(1.1+0.01)m{\sigma_{\max}^{*}}^{2}}{m{\sigma_{\max}^{*}}^{2}}>1-0.555>0 i.e. 𝑰nr(η/m)Hess\bm{I}_{nr}-(\eta/m)\mathrm{Hess} is positive definite. Thus, 𝑰nr(η/m)Hess=λmax(𝑰nr(η/m)Hess)=1(η/m)λmin(Hess)1(η/m)m(0.651.2ϵ3)σmin21(ησmax2)0.63/κ2.\|\bm{I}_{nr}-(\eta/m)\mathrm{Hess}\|=\lambda_{\max}(\bm{I}_{nr}-(\eta/m)\mathrm{Hess})=1-(\eta/m)\lambda_{\min}(\mathrm{Hess})\leq 1-(\eta/m)m(0.65-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2}\leq 1-(\eta{\sigma_{\max}^{*}}^{2})0.63/\kappa^{2}.

By Lemma 3.4, Lemma 3.5, and the above, w.p. at least 1tn10exp((n+q)cmq/rμ2)exp(nrcmq/rκ4μ2)exp(nrlogκcmq/rκ4μ2)exp(logq+rcm)1-tn^{-10}-\exp((n+q)-cmq/r\mu^{2})-\exp(nr-cmq/r\kappa^{4}\mu^{2})-\exp(nr\log\kappa-cmq/r\kappa^{4}\mu^{2})-\exp(\log q+r-cm),

SD(𝑼,𝑼t+1)\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})
(1(ησmax2)0.63/κ2)δt+(η/m)1.1mϵ2σmax2δt1(η/m)1.5(1.1+ϵ1)mδtσmax2\displaystyle\leq\dfrac{(1-(\eta{\sigma_{\max}^{*}}^{2})0.63/\kappa^{2})\cdot\delta_{t}+(\eta/m)1.1m\epsilon_{2}{\sigma_{\max}^{*}}^{2}\delta_{t}}{1-(\eta/m)1.5(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}}
(1(ησmax2)0.63/κ2+(ησmax2)0.01/κ21(ησmax2)0.1/κ2)δt\displaystyle\leq\left(\frac{1-(\eta{\sigma_{\max}^{*}}^{2})0.63/\kappa^{2}+(\eta{\sigma_{\max}^{*}}^{2})0.01/\kappa^{2}}{1-(\eta{\sigma_{\max}^{*}}^{2})0.1/\kappa^{2}}\right)\delta_{t}
(1(ησmax2)0.42κ2)δt\displaystyle\leq\left(1-(\eta{\sigma_{\max}^{*}}^{2})\frac{0.42}{\kappa^{2}}\right)\delta_{t}

The second inequality substituted the values of ϵj\epsilon_{j}’s and used δt<δ0=0.1/(1.5(1.1+0.1)κ2)\delta_{t}<\delta_{0}=0.1/(1.5(1.1+0.1)\kappa^{2}) for its denominator term. The third inequality used (1(ησmax2)0.1/κ2)1(1+(ησmax2)0.2/κ2)(1-(\eta{\sigma_{\max}^{*}}^{2})0.1/\kappa^{2})^{-1}\leq(1+(\eta{\sigma_{\max}^{*}}^{2})0.2/\kappa^{2}) (for 0<x<10<x<1, 1/(1x)1+2x1/(1-x)\leq 1+2x).

By plugging in the epsilon values in the probability, the above holds w.p. 1tn100.2exp((n+q)cmq/rμ2)0.2exp(nrcmq/rμ2κ4)0.2exp(nrlogκcmq/rμ2κ4)exp(logq+rcm)\geq 1-tn^{-10}-0.2\exp((n+q)-cmq/r\mu^{2})-0.2\exp(nr-cmq/r\mu^{2}\kappa^{4})-0.2\exp(nr\log\kappa-cmq/r\mu^{2}\kappa^{4})-\exp(\log q+r-cm) . If mqCκ4(n+q)r2logκmq\geq C\kappa^{4}(n+q)r^{2}\log\kappa and mCmax(r,logq,logn)m\geq C\max(r,\log q,\log n) for a CC large enough, then, this probability is 1tn100.2exp(c(n+q))0.4exp(cnr)n10>1(t+1)n10\geq 1-tn^{-10}-0.2\exp(-c(n+q))-0.4\exp(-cnr)-n^{-10}>1-(t+1)n^{-10}. ∎

III-D Proof outline (and novelty) for Initialization Theorem 3.1

Recall that we compute 𝑼0{\bm{U}}_{0} as the top rr left singular vectors of 𝑿0{\bm{X}}_{0} defined in (2) and that this is a truncated version of 𝑿0,full{\bm{X}}_{0,full}. As noted there, we cannot use 𝑿0,full{\bm{X}}_{0,full} because its summands are not nice-enough sub-exponentials. Truncation converts the summands into sub-Gaussian r.v.s. For these, we can use the sub-Gaussian Hoeffding inequality [26, Chap 2] which needs a small enough bound on only the squared sum of the sub-Gaussian norms of the mqmq summands, and not on their maximum value (as needed by the sub-exponential Bernstein inequality). This is an easier requirement that gets satisfied for our problem. Of course, truncation also means that the summands of 𝑿0{\bm{X}}_{0} are not mutually independent (each summand depends on the truncation threshold α\alpha which is computed using all measurements 𝒚ki\bm{y}_{ki}) and that 𝔼[𝑿0]𝑿\mathbb{E}[{\bm{X}}_{0}]\neq{\bm{X}}^{*}. There are two ways to resolve this issue. The first and simpler approach, but one that assumes more sample-splitting is given below in Sec III-E. This assumes that α\alpha is a computed using a different independent set of measurements than those used to define the rest of 𝑿0{\bm{X}}_{0}. With this, 𝔼[𝑿0|α]=𝑿𝑫(α)\mathbb{E}[{\bm{X}}_{0}|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha), where 𝑫{\bm{D}} is a diagonal matrix defined below in Lemma 3.6 and the summands are independent conditioned on α\alpha. Thus, we can apply Wedin’s sinΘ\sin\Theta theorem [27, 28] (given in Proposition 4.1) on 𝑿0{\bm{X}}_{0} and 𝔼[𝑿0|α]\mathbb{E}[{\bm{X}}_{0}|\alpha] to bound SD(𝑼0,𝑼)\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{}), followed by subGaussian Hoeffding and a standard epsilon-net argument, to bound the terms in this bound.

To avoid sample-splitting for α\alpha, we need to significantly modify the sandwiching arguments from [20, 5] for our setting. This is done in Appendix B. In the previous works, sandwiching was used for a symmetric positive definite (p.d.) matrix. Here we need such an argument for a non-symmetric matrix. Briefly, we do this as follows. We define a matrix 𝑿+{\bm{X}}_{+} that is such that the span of top rr left singular vectors of its expected value equals that of 𝑼{\bm{U}}^{*}{} and that can be shown to be close to 𝑿0{\bm{X}}_{0}. 𝑿+{\bm{X}}_{+} is 𝑿0{\bm{X}}_{0} with α\alpha replaced by C~(1+ϵ)𝑿F2/q\tilde{C}(1+\epsilon)\|{\bm{X}}^{*}\|_{F}^{2}/q. We bound 𝑿0𝔼[𝑿+]\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\| by bounding 𝑿+𝑿0\|{\bm{X}}_{+}-{\bm{X}}_{0}\| and 𝑿+𝔼[𝑿+]\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|. Bounding the latter is simple. Bounding 𝑿+𝑿0\|{\bm{X}}_{+}-{\bm{X}}_{0}\| requires bounding 𝒘(𝑿+𝑿0)𝒛\bm{w}^{\top}({\bm{X}}_{+}-{\bm{X}}_{0})\bm{z} for unit vectors 𝒘,𝒛\bm{w},\bm{z} and this is not straightforward because its summands are not mutually independent. To deal with this, we first bound each summand by its absolute value, and then bound the indicator function term to get a new one that is non-random so that the summands of this new term are mutually independent. But, its summands are no longer zero mean (because of taking the absolute values), and hence more work is needed to get the desired small enough bound on the expected value of this term.

III-E Simpler proof of Theorem 3.1 that assumes independent measurements used for computing α\alpha

For the simpler proof given here, assume that we use a different independent set of measurements for computing α\alpha than those used for the rest of 𝑿0{\bm{X}}_{0}, i.e., let

α=C~ki(𝒚kinrmX)2mq\alpha=\tilde{C}\frac{\sum_{ki}(\bm{y}_{ki}^{nrmX})^{2}}{mq}

with 𝒚kinrmX\bm{y}_{ki}^{nrmX} independent of {𝑨k(0),𝒚k(0)}\{\bm{A}_{k}^{(0)},\bm{y}_{k}^{(0)}\}. With this change, it is possible to compute 𝔼[𝑿0|α]\mathbb{E}[{\bm{X}}_{0}|\alpha] easily. But, it does not affect the sample complexity order and so it does not change our theorem statement. The proof follows by combining the two lemmas and facts given next.

Lemma 3.6.

Conditioned on α\alpha, we have the following conclusions.

  1. 1.

    Let ζ\zeta be a scalar standard Gaussian r.v.. Define

    βk(α):=𝔼[ζ2𝟙{𝒙k2ζ2α}].\beta_{k}(\alpha):=\mathbb{E}[\zeta^{2}\mathbbm{1}_{\{\|\bm{x}^{*}_{k}\|^{2}\zeta^{2}\leq\alpha\}}].

    Then,

    𝔼[𝑿0|α]=𝑿𝑫(α),\displaystyle\mathbb{E}[{\bm{X}}_{0}|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha),
    where 𝑫(α):=diagonal(βk(α),k[q])\displaystyle\text{ where }{\bm{D}}(\alpha):=diagonal(\beta_{k}(\alpha),k\in[q]) (4)

    i.e. 𝑫(α){\bm{D}}(\alpha) is a diagonal matrix of size q×qq\times q with diagonal entries βk\beta_{k} defined above.

  2. 2.

    Let 𝔼[𝑿0|α]=𝑿𝑫(α)=SVD𝑼𝚺ˇ𝑽ˇ\mathbb{E}[{\bm{X}}_{0}|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha)\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}} be its rr-SVD. Then,

    SD(𝑼0,𝑼)\displaystyle\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq
    2max((𝑿0𝔼[𝑿0|α])𝑼F,(𝑿0𝔼[𝑿0|α])𝑽ˇF)σminminkβk(α)𝑿0𝔼[𝑿0|α]\displaystyle\dfrac{\sqrt{2}\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])\check{\bm{V}}{}^{\top}\|_{F}\right)}{{\sigma_{\min}^{*}}\min_{k}\beta_{k}(\alpha)-\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|} (5)

    as long as the denominator is non-negative.

Proof.

See Sec. IV-F

Define the set \mathcal{E} as follows

:={C~(1ϵ1)𝑿F2qαC~(1+ϵ1)𝑿F2q}.\displaystyle\mathcal{E}:=\left\{\tilde{C}(1-\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}\leq\alpha\leq\tilde{C}(1+\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}\right\}. (6)

The following fact is an immediate consequence of sub-exponential Bernstein inequality for bounding |α𝑿F2/q||\alpha-\|{\bm{X}}^{*}\|_{F}^{2}/q|.

Fact 3.7.

Pr(α)1exp(c~mqϵ12):=1pα\Pr(\alpha\in\mathcal{E})\geq 1-\exp(-\tilde{c}mq\epsilon_{1}^{2}):=1-p_{\alpha}. Here c~=c/C~=c/κ2μ2.\tilde{c}=c/\tilde{C}=c/\kappa^{2}\mu^{2}.

The next lemma bounds the terms of Lemma 3.6.

Lemma 3.8.

Fix 0<ϵ1<10<\epsilon_{1}<1. Then,

  1. 1.

    w.p. at least 1exp[(n+q)cϵ12mq/μ2κ2]1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right], conditioned on α\alpha, for an α\alpha\in\mathcal{E},

    𝑿0𝔼[𝑿0|α]1.1ϵ1𝑿F\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}
  2. 2.

    w.p. at least 1exp[qrcϵ12mq/μ2κ2]1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right], conditioned on α\alpha, for an α\alpha\in\mathcal{E},

    (𝑿0𝔼[𝑿0|α])𝑼F1.1ϵ1𝑿F\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}
  3. 3.

    w.p. at least 1exp[nrcϵ12mq/μ2κ2]1-\exp\left[nr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right], conditioned on α\alpha, for an α\alpha\in\mathcal{E},

    (𝑿0𝔼[𝑿0|α])𝑽ˇF1.1ϵ1𝑿F.\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top}\|_{F}\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.
Proof.

See Sec. IV-G

We also need to the following fact.

Fact 3.9.

For any ϵ10.1\epsilon_{1}\leq 0.1, mink𝔼[ζ2𝟙{|ζ|C~1ϵ1𝐗Fq𝐱k}]0.92.\min_{k}\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{|\zeta|\leq\tilde{C}\frac{\sqrt{1-\epsilon_{1}}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}\right\}}\right]\geq 0.92.

Proof of Theorem 3.1.

Set ϵ1=0.4δ0/rκ\epsilon_{1}=0.4\delta_{0}/\sqrt{r}\kappa. Define p0=2exp((n+q)cmqδ02/rκ2)+2exp(nrcmqδ02/rκ2)+2exp(qrcmqδ02/rκ2).p_{0}=2\exp((n+q)-cmq\delta_{0}^{2}/r\kappa^{2})+2\exp(nr-cmq\delta_{0}^{2}/r\kappa^{2})+2\exp(qr-cmq\delta_{0}^{2}/r\kappa^{2}). Recall that Pr(α)1pα\Pr(\alpha\in\mathcal{E})\geq 1-p_{\alpha} with pα=exp(c~mqϵ12)=exp(cmqδ02/rμ2κ2).p_{\alpha}=\exp(-\tilde{c}mq\epsilon_{1}^{2})=\exp(-cmq\delta_{0}^{2}/r\mu^{2}\kappa^{2}).

Using Lemma 3.8, conditioned on α\alpha, for an α\alpha\in\mathcal{E},

  • w.p. at least 1p01-p_{0}, 𝑿0𝔼[𝑿0|α]1.1ϵ1𝑿F=0.44δ0σmin,\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}=0.44\delta_{0}{\sigma_{\min}^{*}}, and max((𝑿0𝔼[𝑿0|α])𝑼F,(𝑿0𝔼[𝑿0|α])𝑽ˇF)0.44δ0σmin\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])\check{\bm{V}}^{\top}\|_{F}\right)\leq 0.44\delta_{0}{\sigma_{\min}^{*}}

  • minkβk(α)mink𝔼[ζ2𝟙{|ζ|C~1ϵ1𝑿Fq𝒙k}]0.9\min_{k}\beta_{k}(\alpha)\geq\min_{k}\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\{|\zeta|\leq\tilde{C}\frac{\sqrt{1-\epsilon_{1}}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}\}}\right]\geq 0.9 The first inequality is an immediate consequence of α\alpha\in\mathcal{E} and the second follows by Fact 3.9.

Plugging the above bounds into (2) of Lemma 3.6, conditioned on α\alpha, for any α\alpha\in\mathcal{E}, w.p. at least 1p01-p_{0}, SD(𝑼0,𝑼)0.44δ00.90.44δ0<δ0\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\frac{0.44\delta_{0}}{0.9-0.44\delta_{0}}<\delta_{0} since δ0<0.1\delta_{0}<0.1. In other words,

Pr(SD(𝑼0,𝑼)δ0|α)p0for any α.\displaystyle\Pr\left(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}|\alpha\right)\leq p_{0}\ \text{for any $\alpha\in\mathcal{E}$}. (7)

Since (i) Pr(SD(𝑼0,𝑼)δ0)Pr(SD(𝑼0,𝑼)δ0 and α)+Pr(α),\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0})\leq\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\text{ and }\alpha\in\mathcal{E})+\Pr(\alpha\notin\mathcal{E}), and (ii) Pr(SD(𝑼0,𝑼)δ0 and α)Pr(α)maxαPr(SD(𝑼0,𝑼)δ0|α),\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\text{ and }\alpha\in\mathcal{E})\leq\Pr(\alpha\in\mathcal{E})\max_{\alpha\in\mathcal{E}}\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\delta_{0}|\alpha), thus, using Fact 3.7 and (7), we can conclude that

Pr(SD(𝑼0,𝑼)δ0)p0(1pα)+pαp0+pα\Pr\left(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\right)\leq p_{0}(1-p_{\alpha})+p_{\alpha}\leq p_{0}+p_{\alpha}

Thus, for a δ0<0.1\delta_{0}<0.1, SD(𝑼0,𝑼)<δ0\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})<\delta_{0} w.p. at least 1p0pα=12exp((n+q)cmqδ02/rκ2)2exp(nrcmqδ02/rκ2)2exp(qrcmqδ02/rκ2)exp(cmqδ02/rμ2κ4)1-p_{0}-p_{\alpha}=1-2\exp((n+q)-cmq\delta_{0}^{2}/r\kappa^{2})-2\exp(nr-cmq\delta_{0}^{2}/r\kappa^{2})-2\exp(qr-cmq\delta_{0}^{2}/r\kappa^{2})-\exp(-cmq\delta_{0}^{2}/r\mu^{2}\kappa^{4}). This is 15exp(c(n+q))\geq 1-5\exp(-c(n+q)) if mq>Cκ2μ2(n+q)r2/δ02mq>C\kappa^{2}\mu^{2}(n+q)r^{2}/\delta_{0}^{2}. This finishes our proof. ∎

IV Proofs of all the lemmas

IV-A Basic tools used

Our proofs use the following results and definitions:

Theorem 4.1 (Wedin sinΘ\sin\Theta theorem for Frobenius norm subspace distance [27, 28][Theorem 2.3.1).

] For two n1×n2n_{1}\times n_{2} matrices 𝐌\bm{M}^{*}, 𝐌\bm{M}, let 𝐔,𝐔{\bm{U}}^{*}{},{\bm{U}} denote the matrices containing their top rr singular vectors and let 𝐕,𝐕\bm{V}^{*}{}^{\top},\bm{V}^{\top} be the matrices of their right singular vectors (recall from problem definition that we defined SVD with the right matrix transposed). Let σr,σr+1\sigma^{*}_{r},\sigma^{*}_{r+1} denote the rr-th and (r+1)(r+1)-th singular values of 𝐌\bm{M}^{*}. If 𝐌𝐌σrσr+1\|\bm{M}-\bm{M}^{*}\|\leq\sigma^{*}_{r}-\sigma^{*}_{r+1}, then

SD(𝑼,𝑼)\displaystyle\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})
2max((𝑴𝑴)𝑼F,(𝑴𝑴)𝑽F)σrσr+1𝑴𝑴\displaystyle\qquad\leq\frac{\sqrt{2}\max(\|(\bm{M}-\bm{M}^{*})^{\top}{\bm{U}}^{*}{}\|_{F},\|(\bm{M}-\bm{M}^{*})^{\top}\bm{V}^{*}{}^{\top}\|_{F})}{\sigma^{*}_{r}-\sigma^{*}_{r+1}-\|\bm{M}-\bm{M}^{*}\|}
Theorem 4.2 (Fundamental theorem of calculus [18][Chapter XIII, Theorem 4.2).

, [19]] For two vectors 𝐳0,𝐳d\bm{z}_{0},\bm{z}^{*}\in\Re^{d}, and a differentiable vector function g(𝐳)d2g(\bm{z})\in\Re^{d_{2}},

g(𝒛0)g(𝒛)=(τ=01g(𝒛(τ))𝑑τ)(𝒛0𝒛),g(\bm{z}_{0})-g(\bm{z}^{*})=\left(\int_{\tau=0}^{1}\nabla g(\bm{z}(\tau))d\tau\right)(\bm{z}_{0}-\bm{z}^{*}),

where

𝒛(τ)=𝒛+τ(𝒛0𝒛).\bm{z}(\tau)=\bm{z}^{*}+\tau(\bm{z}_{0}-\bm{z}^{*}).

Observe that 𝐳g(𝐳)\nabla_{\bm{z}}g(\bm{z}) is a d2×dd_{2}\times d matrix.

Definition 4.3.

For any n×rn\times r matrix 𝐙{\bm{{Z}}}, let 𝐙vec{{\bm{{Z}}}_{vec}} denote the nrnr length vector formed by arranging all rr columns of 𝐙{\bm{{Z}}} one below the other. Thus, for nn-length and rr-length vectors 𝐚\bm{a} and 𝐛\bm{b},

  • (𝒂𝒃)vec=𝒂𝒃(\bm{a}\bm{b}^{\top})_{vec}=\bm{a}\otimes\bm{b} with \otimes being the Kronecker product;

  • 𝒂𝑼𝒃=trace(𝒂𝑼𝒃)=trace(𝒃𝒂𝑼)=(𝒂𝒃),𝑼=𝒂𝒃,𝑼vec\bm{a}^{\top}{\bm{U}}\bm{b}=\mathrm{trace}(\bm{a}^{\top}{\bm{U}}\bm{b})=\mathrm{trace}(\bm{b}\bm{a}^{\top}{\bm{U}})=\langle(\bm{a}\bm{b}^{\top}),{\bm{U}}\rangle=\langle\bm{a}\otimes\bm{b},{\bm{U}}_{vec}\rangle;

f(𝑼vec,𝑩)=ki((𝒂ki𝒃k)𝑼vec𝒚ki)2f({{\bm{U}}_{vec}},\bm{B})=\sum_{ki}((\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}{{\bm{U}}_{vec}}-\bm{y}_{ki})^{2} and

(𝑼f(𝑼,𝑩))vec=𝑼vecf(𝑼vec,𝑩)\displaystyle(\nabla_{{\bm{U}}}f({\bm{U}},\bm{B}))_{vec}=\nabla_{{\bm{U}}_{vec}}f({\bm{U}}_{vec},\bm{B}) (8)
Definition 4.4.

At various places, f(𝐔,𝐁)\nabla f({\bm{U}},\bm{B}) is short for 𝐔f(𝐔,𝐁)=ki𝐚ki𝐛k(𝐚ki𝐔𝐛k𝐲ki)\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}\bm{a}_{ki}\bm{b}_{k}{}^{\top}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{y}_{ki}) and similarly f(𝐔vec,𝐁)\nabla f({{\bm{U}}_{vec}},\bm{B}) is short for 𝐔vecf(𝐔vec,𝐁)=ki(𝐚ki𝐛k)((𝐚ki𝐛k)𝐔vec𝐲ki)\nabla_{{\bm{U}}_{vec}}f({{\bm{U}}_{vec}},\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})((\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}{{\bm{U}}_{vec}}-\bm{y}_{ki}).

Definition 4.5.

For any vector 𝐰\bm{w}, we use 𝐰(k)\bm{w}(k) to denote its kk-th entry.

Definition 4.6.

Everywhere we use 𝒮nr\mathcal{S}_{nr} to denote both the set of matrices {𝐖n×r:𝐖F=1}\{{\bm{W}}\in\Re^{n\times r}:\|{\bm{W}}\|_{F}=1\} and the set of these matrices vectorized {𝐰nr:𝐰=1}\{\bm{w}\in\Re^{nr}:\|\bm{w}\|=1\}. We also switch between the two sometimes. In the entire writing below, 𝐰=𝐖vec\bm{w}={\bm{W}}_{vec}.

All the high probability bounds for initialization use subGaussian Hoeffding inequality, while those for GD lemmas use the sub-exponential Bernstein inequality, both are from [26]. In addition, these lemmas also use the following results to “epsilon-net” extend a bound holding for a fixed unit norm 𝑾{\bm{W}} (or 𝒘\bm{w}) to all unit norm 𝑾{\bm{W}}s (or 𝒘\bm{w}s)

Proposition 4.7 (Epsilon-netting for bounding max𝒘𝒮n,𝒛𝒮r|𝒘𝑴𝒛|\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|).

For an n×rn\times r matrix 𝐌\bm{M} and fixed vectors 𝐰,𝐳\bm{w},\bm{z} with, 𝐰𝒮n\bm{w}\in\mathcal{S}_{n} and 𝐳𝒮r\bm{z}\in\mathcal{S}_{r}, suppose that |𝐰𝐌𝐳|b0|\bm{w}^{\top}\bm{M}\bm{z}|\leq b_{0} w.p. at least 1p01-p_{0}. Consider an ϵnet\epsilon_{net} net covering 𝒮n\mathcal{S}_{n} and 𝒮r\mathcal{S}_{r}, 𝒮¯n\bar{\mathcal{S}}_{n}, 𝒮¯r\bar{\mathcal{S}}_{r} Then w.p. at least 1(1+2/ϵnet)n+rp01-(1+2/\epsilon_{net})^{n+r}p_{0},

  • max𝒘𝒮¯n,𝒛𝒮¯r|𝒘𝑴𝒛|b0\max_{\bm{w}\in\bar{\mathcal{S}}_{n},\bm{z}\in\bar{\mathcal{S}}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq b_{0} and

  • max𝒘𝒮n,𝒛𝒮r|𝒘𝑴𝒛|112ϵnetϵnet2b0\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq\frac{1}{1-2\epsilon_{net}-\epsilon_{net}^{2}}b_{0}.

Using ϵnet=1/8\epsilon_{net}=1/8, this implies the following simpler conclusion:
W.p. at least 117n+rp0=1exp((log17)(n+r))p01-17^{n+r}p_{0}=1-\exp((\log 17)(n+r))\cdot p_{0}, max𝐰𝒮n,𝐳𝒮r|𝐰𝐌𝐳|1.4b0\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq 1.4b_{0}.

Proof.

The proof follows that of Lemma 4.4.1 of [26]

Proposition 4.8 (Epsilon-netting for bounding max𝑾𝒮nr𝑴,𝑾\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}).

For an n×rn\times r matrix 𝐌\bm{M} and a fixed n×rn\times r matrix 𝐖𝒮nr{\bm{W}}\in\mathcal{S}_{nr} (unit Frobenius norm matrix), suppose that 𝐌,𝐖b0\langle\bm{M},{\bm{W}}\rangle\leq b_{0} w.p. at least 1p01-p_{0}. Consider an ϵnet\epsilon_{net} net covering 𝒮nr\mathcal{S}_{nr}, 𝒮¯nr\bar{\mathcal{S}}_{nr}. Then w.p. at least 1(1+2/ϵnet)nrp01-(1+2/\epsilon_{net})^{nr}p_{0},

  • max𝑾𝒮¯nr𝑴,𝑾b0\max_{{\bm{W}}\in\bar{\mathcal{S}}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq b_{0} and

  • max𝑾𝒮nr𝑴,𝑾11ϵnetb0\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq\frac{1}{1-\epsilon_{net}}b_{0}.

Using ϵnet=1/8\epsilon_{net}=1/8, this implies the following simpler conclusion:
w.p. at least 117nrp0=1exp((log17)(nr))p01-17^{nr}p_{0}=1-\exp((\log 17)(nr))\cdot p_{0}, max𝐖𝒮nr𝐌,𝐖1.2b0\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq 1.2b_{0}.

Proof.

The proof follows exactly as that of Exercise 4.4.3 of [26]

Proposition 4.9 (Epsilon-netting for upper and lower bounding ki𝑴ki,𝑾2\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2} over all 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr}).

For an n×rn\times r matrices 𝐌ki\bm{M}_{ki} and a fixed 𝐖𝒮nr{\bm{W}}\in\mathcal{S}_{nr}, suppose that, w.p. at least 1p01-p_{0},

b1ki𝑴ki,𝑾2b2b_{1}\leq\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq b_{2}

Consider an ϵnet\epsilon_{net} net covering 𝒮nr\mathcal{S}_{nr}, 𝒮¯nr\bar{\mathcal{S}}_{nr}. Then, w.p. at least 1(1+2/ϵnet)nrp01-(1+2/\epsilon_{net})^{nr}p_{0},

max𝑾𝒮nrki𝑴ki,𝑾211ϵnet22ϵnetb2\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq\frac{1}{1-\epsilon_{net}^{2}-2\epsilon_{net}}b_{2}

and

min𝑾𝒮nrki𝑴ki,𝑾2b12ϵnet11ϵnet22ϵnetb2\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\geq b_{1}-2\epsilon_{net}\cdot\frac{1}{1-\epsilon_{net}^{2}-2\epsilon_{net}}b_{2}

Picking ϵnet=b1/(8b2)\epsilon_{net}=b_{1}/(8b_{2}) guarantees that the above lower bound is non-negative. In particular, it implies the following:
w.p. at least 1(24b2/b1)nrp0=1exp(Cnrlog(b2/b1))p01-(24b_{2}/b_{1})^{nr}p_{0}=1-\exp(Cnr\log(b_{2}/b_{1}))\cdot p_{0}, 0.8b1min𝐖𝒮nrki𝐌ki,𝐖2max𝐖𝒮nrki𝐌ki,𝐖21.4b20.8b_{1}\leq\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq 1.4b_{2}

Proof.

By union bound, for all 𝑾¯𝒮¯nr\bar{\bm{W}}\in\bar{\mathcal{S}}_{nr}, b1ki𝑴ki,𝑾¯2b2b_{1}\leq\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle^{2}\leq b_{2} holds w.p. at least 1(1+2/ϵnet)nrp01-(1+2/\epsilon_{net})^{nr}p_{0}.

Proof for the upper bound: Let γ=max𝑾𝒮nrki𝑴ki,𝑾2\gamma^{*}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}. Writing 𝑾=𝑾¯+(𝑾𝑾¯){\bm{W}}=\bar{\bm{W}}+({\bm{W}}-\bar{\bm{W}}) where 𝑾¯\bar{\bm{W}} is the closest point to 𝑾{\bm{W}} on 𝒮¯nr\bar{\mathcal{S}}_{nr}, we have ki𝑴ki,𝑾2=ki𝑴ki,𝑾¯2+ki𝑴ki,(𝑾𝑾¯)2+2ki𝑴ki,𝑾¯ki𝑴ki,(𝑾𝑾¯)\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}=\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle^{2}+\sum_{ki}\langle\bm{M}_{ki},({\bm{W}}-\bar{\bm{W}})\rangle^{2}+2\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle\cdot\sum_{ki}\langle\bm{M}_{ki},({\bm{W}}-\bar{\bm{W}})\rangle and (𝑾𝑾¯)Fϵnet\|({\bm{W}}-\bar{\bm{W}})\|_{F}\leq\epsilon_{net}.

Rewriting (𝑾𝑾¯)=(𝑾𝑾¯)(𝑾𝑾¯)/(𝑾𝑾¯)F({\bm{W}}-\bar{\bm{W}})=({\bm{W}}-\bar{\bm{W}})\cdot({\bm{W}}-\bar{\bm{W}})/\|({\bm{W}}-\bar{\bm{W}})\|_{F} and using the fact that (𝑾𝑾¯)/(𝑾𝑾¯)F𝒮nr({\bm{W}}-\bar{\bm{W}})/\|({\bm{W}}-\bar{\bm{W}})\|_{F}\in\mathcal{S}_{nr} and (𝑾𝑾¯)Fϵnet\|({\bm{W}}-\bar{\bm{W}})\|_{F}\leq\epsilon_{net} and using Cauchy-Schwarz for the third term in the above expression, we have

γb2+ϵnet2γ+2γϵnet2γ=b2+ϵnet2γ+2ϵnetγ\gamma^{*}\leq b_{2}+\epsilon_{net}^{2}\gamma^{*}+2\sqrt{\gamma^{*}\cdot\epsilon_{net}^{2}\gamma^{*}}=b_{2}+\epsilon_{net}^{2}\gamma^{*}+2\epsilon_{net}\gamma^{*}

Thus, γ1/(1ϵnet22ϵnet)b2\gamma^{*}\leq 1/(1-\epsilon_{net}^{2}-2\epsilon_{net})\cdot b_{2}.

Proof for the lower bound: Let β=min𝑾𝒮nrki𝑴ki,𝑾2\beta^{*}=\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}. Proceeding as above, we have

βb12γϵnet2γ=b12ϵnetγ\beta^{*}\geq b_{1}-2\sqrt{\gamma^{*}\cdot\epsilon_{net}^{2}\gamma^{*}}=b_{1}-2\epsilon_{net}\gamma^{*}

IV-B Proving GD iterations’ lemmas: Proof of Lemma 3.4 (algebra lemma)

Recall that 𝑼vec{{\bm{U}}_{vec}} denotes the vectorized 𝑼{\bm{U}}. We use this so that we can apply the simple vector version of the fundamental theorem of calculus [18, Chapter XIII, Theorem 4.2],[19, Lemma 2 proof] (given in Theorem 4.2) on the nrnr length vector f(𝑼vec,𝑩)\nabla f({{\bm{U}}_{vec}},\bm{B}), and so that the Hessian can be expressed as an nr×nrnr\times nr matrix.

We apply Theorem 4.2 with 𝒛0𝑼vec\bm{z}_{0}\equiv{{\bm{U}}_{vec}}, 𝒛(𝑼𝑼𝑼)vec\bm{z}^{*}\equiv({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}})_{vec}, and g(𝒛)=f(𝒛,𝑩)g(\bm{z})=\nabla f(\bm{z},\bm{B}). Thus d=d2=nrd=d_{2}=nr and g(𝒛)\nabla g(\bm{z}) is the Hessian of f(𝒛,𝑩)f(\bm{z},\bm{B}) computed at 𝒛\bm{z}. Let 𝑼(τ):=𝑼𝑼𝑼+τ(𝑼𝑼𝑼𝑼){\bm{U}}(\tau):={\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}+\tau({\bm{U}}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}). Applying the theorem,

f(𝑼vec,𝑩)f((𝑼𝑼𝑼)vec,𝑩)\displaystyle\nabla f({\bm{U}}_{vec},\bm{B})-\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B})
=(τ=01𝑼vec2f(𝑼(τ)vec,𝑩)𝑑τ)(𝑼vec(𝑼𝑼𝑼)vec)\displaystyle=(\int_{\tau=0}^{1}\nabla_{{{\bm{U}}_{vec}}}^{2}f({\bm{U}}(\tau)_{vec},\bm{B})d\tau)({\bm{U}}_{vec}-({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec}) (9)

where

𝑼vec2f(𝑼(τ)vec,𝑩)=ki(𝒂ki𝒃k)(𝒂ki𝒃k):=Hess\displaystyle\nabla_{{{\bm{U}}_{vec}}}^{2}f({\bm{U}}(\tau)_{vec},\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}:=\ \mathrm{Hess}\ (10)

This is an nr×nrnr\times nr matrix. Because the cost function is quadratic, the Hessian is constant w.r.t. τ\tau. Henceforth, we refer to it as Hess\ \mathrm{Hess}\ . With this, the above simplifies to

f(𝑼vec,𝑩)f((𝑼𝑼𝑼)vec,𝑩)\displaystyle\nabla f({\bm{U}}_{vec},\bm{B})-\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B})
=Hess(𝑼vec(𝑼𝑼𝑼)vec)=Hess(𝑷𝑼)vec\displaystyle=\ \mathrm{Hess}\ ({\bm{U}}_{vec}-({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec})=\mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec} (11)

with

𝑷:=𝑰𝑼𝑼\bm{P}:=\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}

denoting the n×nn\times n projection matrix to project orthogonal to 𝑼{\bm{U}}^{*}{}. This proof is motivated by a similar approach used in [19, Lemma 2 proof] to analyze GD for standard PR. However, there the application was much simpler because f(.)f(.) was a function of one variable and at the true solution the gradient was zero, i.e., f(𝒙)=𝟎\nabla f(\bm{x}^{*})=\bm{0}. In our case f(𝑼𝑼𝑼,𝑩)𝟎\nabla f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq\bm{0} because 𝑩𝑩\bm{B}\neq\bm{B}^{*}. But we can show that 𝔼[(𝑰𝑼𝑼)f(𝑼𝑼𝑼,𝑩)]=𝟎\mathbb{E}[(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\nabla f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})]=\bm{0} and this helps us get the final desired result.

From Algorithm 1, recall that 𝑼^+=𝑼(η/m)f(𝑼,𝑩)\hat{\bm{U}}^{+}={\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}). Vectorizing this equation, and using (11), we get

(𝑼^+)vec\displaystyle(\hat{\bm{U}}^{+})_{vec} =𝑼vec(η/m)f(𝑼vec,𝑩)\displaystyle={\bm{U}}_{vec}-(\eta/m)\nabla f({\bm{U}}_{vec},\bm{B})
=𝑼vec(η/m)Hess(𝑷𝑼)vec\displaystyle={\bm{U}}_{vec}-(\eta/m)\ \mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec}
(η/m)f((𝑼𝑼𝑼)vec,𝑩))\displaystyle\qquad-(\eta/m)\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B})) (12)

We can prove our final result by using (8) and the following simple facts:

  1. 1.

    For an n×nn\times n matrix 𝑴\bm{M}, let big(𝑴):=𝑰r𝑴.\mathrm{big}(\bm{M}):=\bm{I}_{r}\otimes\bm{M}. be an nr×nrnr\times nr block diagonal matrix with 𝑴\bm{M} in the diagonal blocks. For any n×rn\times r matrix 𝒁{\bm{{Z}}},

    big(𝑴)𝒁vec=(𝑴𝒁)vec\displaystyle\mathrm{big}(\bm{M}){{\bm{{Z}}}_{vec}}=(\bm{M}{\bm{{Z}}})_{vec} (13)
  2. 2.

    Since 𝑷\bm{P} is idempotent, 𝑷=𝑷2\bm{P}=\bm{P}^{2}. Also, because of its block diagonal structure, big(𝑴2)=(big(𝑴))2\mathrm{big}(\bm{M}^{2})=(\mathrm{big}(\bm{M}))^{2}. Thus,

    big(𝑷)\displaystyle\mathrm{big}(\bm{P}) =big(𝑷2)=(big(𝑷))2=big(𝑷))𝑰nr(big(𝑷)\displaystyle=\mathrm{big}(\bm{P}^{2})=(\mathrm{big}(\bm{P}))^{2}=\mathrm{big}(\bm{P}))\bm{I}_{nr}(\mathrm{big}(\bm{P}) (14)

Left multiplying both sides of (12) by big(𝑷)\mathrm{big}(\bm{P}), and using (13), (14), and (8),

big(𝑷)(𝑼^+)vec=big(𝑷)𝑼vec(η/m)big(𝑷)Hess(𝑷𝑼)vec\displaystyle\mathrm{big}(\bm{P})(\hat{\bm{U}}^{+})_{vec}=\mathrm{big}(\bm{P}){\bm{U}}_{vec}-(\eta/m)\mathrm{big}(\bm{P})\ \mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec}
(η/m)big(𝑷)f((𝑼𝑼𝑼)vec,𝑩)\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B})
=big(𝑷)𝑰nrbig(𝑷)𝑼vec(η/m)big(𝑷)Hessbig(𝑷)𝑼vec\displaystyle=\mathrm{big}(\bm{P})\bm{I}_{nr}\mathrm{big}(\bm{P}){\bm{U}}_{vec}-(\eta/m)\mathrm{big}(\bm{P})\ \mathrm{Hess}\ \mathrm{big}(\bm{P}){\bm{U}}_{vec}
(η/m)big(𝑷)f((𝑼𝑼𝑼)vec,𝑩)\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B})
=big(𝑷)(𝑰nr(η/m)Hess)big(𝑷)𝑼vec\displaystyle=\mathrm{big}(\bm{P})(\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess})\mathrm{big}(\bm{P}){\bm{U}}_{vec}
(η/m)big(𝑷)f((𝑼𝑼𝑼)vec,𝑩).\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})_{vec},\bm{B}).

Thus, using big(𝑷)=𝑷=1\|\mathrm{big}(\bm{P})\|=\|\bm{P}\|=1, (13), and (8),

(𝑷𝑼^+)vec\displaystyle\|(\bm{P}\hat{\bm{U}}^{+})_{vec}\| 𝑰nr(η/m)Hess(𝑷𝑼)vec\displaystyle\leq\|\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess}\ \|\ \|(\bm{P}{\bm{U}})_{vec}\|
+(η/m)(f((𝑼𝑼𝑼),𝑩))vec\displaystyle+(\eta/m)\|(\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}))_{vec}\| (15)

Converting the vectors to matrices, using 𝑴vec=𝑴F||\bm{M}_{vec}||=||\bm{M}||_{F}, and substituting for 𝑷\bm{P},

(𝑰𝑼𝑼)𝑼^+F\displaystyle\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\hat{\bm{U}}^{+}\|_{F}
𝑰nr(η/m)Hess(𝑰𝑼𝑼)𝑼F\displaystyle\leq\|\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess}\ \|\ \|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\|_{F}
+(η/m)(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩)F\displaystyle\qquad+(\eta/m)\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B})\|_{F}

Since 𝑼^+=QR𝑼+𝑹+\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+} and since 𝑴1𝑴2F𝑴1F𝑴2\|\bm{M}_{1}\bm{M}_{2}\|_{F}\leq\|\bm{M}_{1}\|_{F}\|\bm{M}_{2}\|, this means that

SD(𝑼,𝑼+)(𝑰𝑼𝑼)𝑼^+F(𝑹+)1.\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})\leq\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\hat{\bm{U}}^{+}\|_{F}\|({\bm{R}}^{+})^{-1}\|.

Since (𝑹+)1=1/σmin(𝑹+)=1/σmin(𝑼^+)\|({\bm{R}}^{+})^{-1}\|=1/\sigma_{\min}({\bm{R}}^{+})=1/\sigma_{\min}(\hat{\bm{U}}^{+}), using 𝑼^+=𝑼(η/m)f(𝑼,𝑩)\hat{\bm{U}}^{+}={\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}),

(𝑹+)1\displaystyle\|({\bm{R}}^{+})^{-1}\| =1σmin(𝑼(η/m)f(𝑼,𝑩))\displaystyle=\frac{1}{\sigma_{\min}({\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}))}
11(η/m)f(𝑼,𝑩)\displaystyle\leq\frac{1}{1-(\eta/m)\|\nabla f({\bm{U}},\bm{B})\|}

where we used σmin(𝑼(η/m)f(𝑼,𝑩))σmin(𝑼)(η/m)f(𝑼,𝑩)=1(η/m)f(𝑼,𝑩)\sigma_{\min}({\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}))\geq\sigma_{\min}({\bm{U}})-(\eta/m)\|\nabla f({\bm{U}},\bm{B})\|=1-(\eta/m)\|\nabla f({\bm{U}},\bm{B})\| for the last inequality. Combining the last three equations above proves our lemma.

IV-C Proof of GD iterations’ lemmas: Proof of Lemma 3.5

IV-C1 Upper and Lower bounding the Hessian eigenvalues and hence HessTerm

First assume the event that implies that the conclusions of Lemma 3.3 hold.

Recall from (10) that Hess:=𝑼~vec2f(𝑼~vec;𝑩)=ki(𝒂ki𝒃k)(𝒂ki𝒃k).\ \mathrm{Hess}\ :=\nabla_{\tilde{\bm{U}}_{vec}}^{2}f(\tilde{\bm{U}}_{vec};\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k}){}^{\top}. Since Hess\ \mathrm{Hess}\ is a positive semi-definite matrix, λmin(Hess)=min𝒘𝒮nr𝒘Hess𝒘\lambda_{\min}\left(\ \mathrm{Hess}\ \right)=\min_{\bm{w}\in\mathcal{S}_{nr}}\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w} and λmax(Hess)=max𝒘𝒮nr𝒘Hess𝒘.\lambda_{\max}\left(\ \mathrm{Hess}\ \right)=\max_{\bm{w}\in\mathcal{S}_{nr}}\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w}. For a fixed 𝒘𝒮nr\bm{w}\in\mathcal{S}_{nr},

𝒘Hess𝒘=ki(𝒂ki𝑾𝒃k)2\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w}=\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}

where 𝑾{\bm{W}} is an n×rn\times r matrix with 𝑾F=1\|{\bm{W}}\|_{F}=1. Clearly (𝒂ki𝑾𝒃k)2(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2} are mutually independent sub-exponential random variables (r.v.) with sub-exponential norm Kki𝑾𝒃k2K_{ki}\leq\|{\bm{W}}\bm{b}_{k}\|^{2}. Also, 𝔼[(𝒂ki𝑾𝒃k)2]=𝑾𝒃k2\mathbb{E}[(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=\|{\bm{W}}\bm{b}_{k}\|^{2} and thus 𝔼[ki(𝒂ki𝑾𝒃k)2]=m𝑾𝑩F2\mathbb{E}[\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=m\|{\bm{W}}\bm{B}\|_{F}^{2}. Applying the sub-exponential Bernstein inequality, Theorem 2.8.1 of [26], for a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr} yields

Pr{|ki|𝒂ki𝑾𝒃k|2m𝑾𝑩F2|t}\displaystyle\Pr\left\{\Big{|}\sum_{ki}\big{|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{|}^{2}-m\|{\bm{W}}\bm{B}\|_{F}^{2}\Big{|}\geq t\right\}
exp[cmin(t2kiKki2,tmaxkiKki)].\displaystyle\qquad\leq\exp\left[-c\min\left(\frac{t^{2}}{\sum_{ki}K_{ki}^{2}},~{}\frac{t}{\max_{ki}K_{ki}}\right)\right].

We set t=ϵ3mσmin2t=\epsilon_{3}m{\sigma_{\min}^{*}}^{2}. By Lemma 3.3, 𝒃k21.1μ2σmax2(r/q)=1.1κ2μ2σmin2(r/q)\|\bm{b}_{k}\|^{2}\leq 1.1\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)=1.1\kappa^{2}\mu^{2}{\sigma_{\min}^{*}}^{2}(r/q). Thus,

t2kiKki2\displaystyle\frac{t^{2}}{\sum_{ki}K_{ki}^{2}} ϵ32m2σmin4ki𝑾𝒃k4ϵ32mσmin4maxk𝒃k2k𝑾𝒃k2\displaystyle\geq\frac{\epsilon_{3}^{2}m^{2}{\sigma_{\min}^{*}}^{4}}{\sum_{ki}\|{\bm{W}}\bm{b}_{k}\|^{4}}\geq\frac{\epsilon_{3}^{2}m{\sigma_{\min}^{*}}^{4}}{\max_{k}\|\bm{b}_{k}\|^{2}\sum_{k}\|{\bm{W}}\bm{b}_{k}\|^{2}}
ϵ32mσmin4μ2σmax2(r/q)1.1.σmax2=cϵ32mq/rμ2κ4\displaystyle\geq\frac{\epsilon_{3}^{2}m{\sigma_{\min}^{*}}^{4}}{\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)1.1.{\sigma_{\max}^{*}}^{2}}=c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}

Here we used k𝑾𝒃k2=𝑾𝑩F2𝑾F𝑩21.1.σmax\sum_{k}\|{\bm{W}}\bm{b}_{k}\|^{2}=\|{\bm{W}}\bm{B}\|_{F}^{2}\leq\|{\bm{W}}\|_{F}\|\bm{B}\|_{2}\leq 1.1.{\sigma_{\max}^{*}} using the bound on 𝑩2\|\bm{B}\|_{2} from Lemma 3.3. Also,

tmaxkiKki\displaystyle\frac{t}{\max_{ki}K_{ki}} ϵ3mσmin2maxki𝑾𝒃k2ϵ3mσmin21.1μ2σmax2(r/q)\displaystyle\geq\frac{\epsilon_{3}m{\sigma_{\min}^{*}}^{2}}{\max_{ki}\|{\bm{W}}\bm{b}_{k}\|^{2}}\geq\frac{\epsilon_{3}m{\sigma_{\min}^{*}}^{2}}{1.1\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)}
=cϵ3mq/rμ2κ2.\displaystyle=c\epsilon_{3}mq/r\mu^{2}\kappa^{2}.

Therefore, for a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr}, w.p. 1exp[cϵ32mq/rμ2κ4]1-\exp\left[-c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}\right] we have

|ki|𝒂ki𝑾𝒃k|2m𝑾𝑩F2|ϵ3mσmin2.\displaystyle\Big{|}\sum_{ki}\big{|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{|}^{2}-m\|{\bm{W}}\bm{B}\|_{F}^{2}\Big{|}\leq\epsilon_{3}m{\sigma_{\min}^{*}}^{2}. (16)

and hence, by Lemma 3.3, w.p. 1exp[cϵ32mq/rμ2κ4]1-\exp\left[-c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}\right],

ki|𝒂ki𝑾𝒃k|2m𝑾𝑩F2+ϵ3mσmin2\displaystyle\sum_{ki}\big{|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{|}^{2}\leq m\|{\bm{W}}\bm{B}\|_{F}^{2}+\epsilon_{3}m{\sigma_{\min}^{*}}^{2}
m𝑩2+ϵ3mσmin2m(1.1+ϵ3/κ2)σmax2.\displaystyle\qquad\leq m\|\bm{B}\|^{2}+\epsilon_{3}m{\sigma_{\min}^{*}}^{2}\leq m(1.1+\epsilon_{3}/\kappa^{2}){\sigma_{\max}^{*}}^{2}. (17)

and

ki|𝒂ki𝑾𝒃k|2m𝑾𝑩F2ϵ3mσmin2\displaystyle\sum_{ki}\big{|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{|}^{2}\geq m\|{\bm{W}}\bm{B}\|_{F}^{2}-\epsilon_{3}m{\sigma_{\min}^{*}}^{2}
0.9mσmin2+ϵ3mσmin2m(0.9ϵ3)σmin2.\displaystyle\qquad\geq 0.9m{\sigma_{\min}^{*}}^{2}+\epsilon_{3}m{\sigma_{\min}^{*}}^{2}\geq m(0.9-\epsilon_{3}){\sigma_{\min}^{*}}^{2}. (18)

To extend these bounds to all 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr} we apply Proposition 4.9 with b1m(0.9ϵ3)σmin2b_{1}\equiv m(0.9-\epsilon_{3}){\sigma_{\min}^{*}}^{2} and b2m(1.1+ϵ3/κ2)σmax2b_{2}\equiv m(1.1+\epsilon_{3}/\kappa^{2}){\sigma_{\max}^{*}}^{2}. Applying it we can conclude that, given the event that the claims of Lemma 3.3 holds, w.p. at least 1exp(nrlogκcmqϵ32/rμ2κ4)1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\mu^{2}\kappa^{4}),

m(0.71.2ϵ3)σmin2\displaystyle m(0.7-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2} λmin(Hess)\displaystyle\leq\lambda_{\min}(\ \mathrm{Hess}\ )
λmax(Hess)m(1.1+ϵ3)σmax2\displaystyle\leq\lambda_{\max}(\ \mathrm{Hess}\ )\leq m(1.1+\epsilon_{3}){\sigma_{\max}^{*}}^{2}

Using the probability from Lemma 3.3, the above bound holds w.p. at least 1exp(nrlogκcmqϵ32/rμ2κ4)exp(logq+rcm)1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\mu^{2}\kappa^{4})-\exp(\log q+r-cm).

IV-C2 Bounding the GradU Term

We have f(𝑼,𝑩)=max𝒛𝒮n,𝒘𝒮r𝒛f(𝑼,𝑩)𝒘.\|\nabla f({\bm{U}},\bm{B})\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}}\bm{z}{}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}. For a fixed 𝒛𝒮n,𝒘𝒮r\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r} we have

𝒛(f(𝑼,𝑩)𝔼[f(𝑼,𝑩)])𝒘\displaystyle\bm{z}{}^{\top}\left(\nabla f({\bm{U}},\bm{B})-\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\right)\bm{w}
=ki[(𝒂ki𝑼𝒃k𝒚ki)(𝒂ki𝒛)(𝒘𝒃k)𝔼[.]]\displaystyle\qquad=\sum_{ki}\left[\left(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{y}_{ki}\right)\left(\bm{a}_{ki}{}^{\top}\bm{z}\right)\left(\bm{w}{}^{\top}\bm{b}_{k}\right)-\mathbb{E}[.]\right]

where 𝔼[.]\mathbb{E}[.] is the expected value of the first term. Clearly, the summands are independent sub-exponential r.v.s with norm KkiC𝒙k𝒙k𝒃kK_{ki}\leq C\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\|\bm{b}_{k}\|. We apply the sub-exponential Bernstein inequality, Theorem 2.8.1 of [26], with t=ϵ1δtmσmax2t=\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}. To apply this, we use bounds on 𝒃k\|\bm{b}_{k}\|, 𝑿𝑿F\|{\bm{X}}^{*}-{\bm{X}}\|_{F} and 𝒙k𝒙k\|\bm{x}_{k}-\bm{x}^{*}_{k}\| from Lemma 3.3 to show that

t2kiKki2\displaystyle\frac{t^{2}}{\sum_{ki}K_{ki}^{2}} cϵ12δt2m2σmax4mmaxk𝒃k2k𝒙k𝒙k2\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m^{2}{\sigma_{\max}^{*}}^{4}}{m\max_{k}\|\bm{b}_{k}\|^{2}\sum_{k}\|\bm{x}_{k}-\bm{x}^{*}_{k}\|^{2}}
cϵ12δt2mσmax4Cμ2σmax2(r/q)𝑿𝑿F2\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m{\sigma_{\max}^{*}}^{4}}{C\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)\|{\bm{X}}-{\bm{X}}^{*}\|_{F}^{2}}
cϵ12δt2mqσmax4Cμ2σmax2rδt2σmax2=cϵ12mqrμ2.\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}mq{\sigma_{\max}^{*}}^{4}}{C\mu^{2}{\sigma_{\max}^{*}}^{2}r\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}}=c\epsilon_{1}^{2}\frac{mq}{r\mu^{2}}.

and

tmaxkiKkicϵ1δtmσmax2Cδtσmax2μ2(r/q)cϵ1mqrμ2.\frac{t}{\max_{ki}K_{ki}}\geq c\frac{\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}}{C\delta_{t}{\sigma_{\max}^{*}}^{2}\mu^{2}(r/q)}\geq c\epsilon_{1}\frac{mq}{r\mu^{2}}.

Therefore, for a fixed 𝒛𝒮n,𝒘𝒮r\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r} w.p. 1exp(cϵ12mq/rμ2)1-\exp(-c\epsilon_{1}^{2}mq/r\mu^{2}),

𝒛(f(𝑼,𝑩)𝔼[f(𝑼,𝑩)])𝒘\displaystyle\bm{z}{}^{\top}\left(\nabla f({\bm{U}},\bm{B})-\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\right)\bm{w} ϵ1δtmσmax2\displaystyle\leq\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}

Since f(𝑼,𝑩)=ki𝒂ki𝒂ki(𝒙k𝒙k)𝒃k\nabla f({\bm{U}},\bm{B})=\sum_{ki}\bm{a}_{ki}\bm{a}_{ki}{}^{\top}(\bm{x}_{k}-\bm{x}^{*}_{k})\bm{b}_{k}{}^{\top},

𝔼[f(𝑼,𝑩)]=mk(𝒙k𝒙k)𝒃k=m(𝑿𝑿)𝑩.\mathbb{E}[\nabla f({\bm{U}},\bm{B})]=m\sum_{k}(\bm{x}_{k}-\bm{x}^{*}_{k})\bm{b}_{k}{}^{\top}=m\left({\bm{X}}-{\bm{X}}^{*}\right)\bm{B}{}^{\top}.

Using the bounds on 𝑿𝑿F\|{\bm{X}}^{*}-{\bm{X}}\|_{F} and 𝑩\|\bm{B}\| from Lemma 3.3,

𝔼[f(𝑼,𝑩)]\displaystyle\|\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\| =m(𝑿𝑿)𝑩\displaystyle=m\|({\bm{X}}-{\bm{X}}^{*})\bm{B}{}^{\top}\|
m𝑿𝑿𝑩\displaystyle\leq m\|{\bm{X}}-{\bm{X}}^{*}\|~{}\|\bm{B}\|
m𝑿𝑿F𝑩\displaystyle\leq m\|{\bm{X}}-{\bm{X}}^{*}\|_{F}~{}\|\bm{B}\|
1.1mδtσmax2\displaystyle\leq 1.1m\delta_{t}{\sigma_{\max}^{*}}^{2}

Hence, for a fixed 𝒛𝒮n,𝒘𝒮r\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r} w.p. 1exp[cϵ12mq/rμ2]1-\exp\left[-c\epsilon_{1}^{2}mq/r\mu^{2}\right] we have

|𝒛f(𝑼,𝑩)𝒘|(1.1+ϵ1)mδtσmax2.|\bm{z}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}|\leq(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}.

Applying Proposition 4.7, this implies that, w.p. 1exp((n+r)(log17)cϵ12mq/rμ2)1-\exp((n+r)(\log 17)-c\epsilon_{1}^{2}mq/r\mu^{2}), max𝒛𝒮n,𝒘𝒮r𝒛f(𝑼,𝑩)𝒘1.4(1.1+ϵ1)mδtσmax2.\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}}\bm{z}{}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}\leq 1.4(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}.

IV-C3 Bounding Term2

First, since Term2=(𝑰𝑼𝑼)ki𝒂ki(𝒂ki𝑼(𝑼𝑼𝒃k𝒃k))𝒃k\mathrm{Term2}=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\sum_{ki}\bm{a}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}({\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}))\bm{b}_{k}{}{}^{\top}, and 𝔼[𝒂ki𝒂ki]=𝑰\mathbb{E}[\bm{a}_{ki}\bm{a}_{ki}{}^{\top}]=\bm{I},

𝔼[Term2]=0\mathbb{E}[\mathrm{Term2}]=0

We have

(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩)F\displaystyle\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B})\|_{F}
=max𝑾𝒮nr(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩),𝑾\displaystyle\qquad=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle

For a fixed n×rn\times r matrix 𝑾{\bm{W}} with unit Frobenius norm,

(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩),𝑾\displaystyle\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle
=ki(𝒂ki𝑼(𝑼𝑼𝒃k𝒃k))(𝒂ki(𝑰𝑼𝑼)𝑾𝒃k)\displaystyle\qquad=\sum_{ki}\left(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}({\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k})\right)\left(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{W}}\bm{b}_{k}\right)

Observe that the summands are independent, zero mean, sub-exponential r.v.s with sub-exponential norm KkiC𝑼𝑼𝒃k𝒃k(𝑰𝑼𝑼)𝑾𝒃k𝑼𝑼𝒃k𝒃k𝑾𝒃kK_{ki}\leq C\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{W}}\bm{b}_{k}\|\leq\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\|{\bm{W}}\bm{b}_{k}\|. We can now apply the sub-exponential Bernstein inequality Theorem 2.8.1 of [26]. Let t=ϵ2δtmσmax2t=\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}. Using the bound on 𝑼𝑼𝒃k𝒃k\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\| from Lemma 3.3 followed by Assumption 1.1 (right incoherence), and also the bound on 𝑩\|\bm{B}\| from Lemma 3.3,

t2kiKki2\displaystyle\frac{t^{2}}{\sum_{ki}K^{2}_{ki}} ϵ22δt2m2σmax4δt2σmax2μ2(r/q)ki𝑾𝒃k2\displaystyle\geq\frac{\epsilon_{2}^{2}\delta_{t}^{2}m^{2}{\sigma_{\max}^{*}}^{4}}{\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}\mu^{2}(r/q)\sum_{ki}\|{\bm{W}}\bm{b}_{k}\|^{2}}
ϵ22m2σmax2Cμ2(r/q)m𝑾𝑩F2ϵ22m2σmax2μ2(r/q)mσmax2\displaystyle\geq\frac{\epsilon_{2}^{2}m^{2}{\sigma_{\max}^{*}}^{2}}{C\mu^{2}(r/q)m\|{\bm{W}}\bm{B}\|_{F}^{2}}\geq\frac{\epsilon_{2}^{2}m^{2}{\sigma_{\max}^{*}}^{2}}{\mu^{2}(r/q)m{\sigma_{\max}^{*}}^{2}}
cϵ22mq/rμ2,\displaystyle\geq c\epsilon_{2}^{2}mq/r\mu^{2},

and

tmaxkiKkiϵ2δtmσmax2Cδtκ2μ2σmax2(r/q)cϵ2mq/(rκ2μ2).\frac{t}{\max_{ki}K_{ki}}\geq\frac{\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}}{C\delta_{t}\kappa^{2}\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)}\geq c\epsilon_{2}mq/(r\kappa^{2}\mu^{2}).

Thus, by the sub-exponential Bernstein inequality, for a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr}, w.p. 1exp(cϵ22mq/rκ2μ2)1-\exp(-c\epsilon_{2}^{2}mq/r\kappa^{2}\mu^{2}),

(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩),𝑾ϵ2δtmσmax2.\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle\leq\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}.

Applying Proposition 4.8, w.p. at least 1exp(nrcϵ22mq/rκ2μ2)1-\exp(nr-c\epsilon_{2}^{2}mq/r\kappa^{2}\mu^{2}), max𝑾𝒮nr(𝑰𝑼𝑼)f((𝑼𝑼𝑼),𝑩),𝑾1.2ϵ2δtmσmax2.\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),{\bm{W}}\rangle\leq 1.2\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}.

IV-D Proof of GD iterations’ lemmas: Proof of Lemma 3.3, all parts other than the first part

Recall that 𝒈k=𝑼𝒙k=𝑼𝑼𝒃k\bm{g}_{k}={\bm{U}}^{\top}\bm{x}^{*}_{k}={\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{b}^{*}_{k}, and 𝑮=𝑼𝑼𝑩\bm{G}={\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{B}^{*}.

Using the SD\mathrm{SD} bound and the first part, 𝒈k𝒃k0.4δt𝒃k\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\delta_{t}\|\bm{b}^{*}_{k}\|.

Since 𝒙k𝒙k=𝑼𝒈k+(𝑰𝑼𝑼)𝒙k𝑼𝒃k=𝑼(𝒈k𝒃k)+(𝑰𝑼𝑼)𝒙k\bm{x}^{*}_{k}-\bm{x}_{k}={\bm{U}}\bm{g}_{k}+(\bm{I}-{\bm{U}}{\bm{U}}^{\top})\bm{x}^{*}_{k}-{\bm{U}}\bm{b}_{k}={\bm{U}}(\bm{g}_{k}-\bm{b}_{k})+(\bm{I}-{\bm{U}}{\bm{U}}^{\top})\bm{x}^{*}_{k}, using (3),

𝒙k𝒙k\displaystyle\|\bm{x}^{*}_{k}-\bm{x}_{k}\| 𝒈k𝒃k+(𝑰𝑼𝑼)𝑼𝒃k1.4δt𝒃k.\displaystyle\leq\|\bm{g}_{k}-\bm{b}_{k}\|+\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|\leq 1.4\delta_{t}\|\bm{b}^{*}_{k}\|.

𝑼𝑼𝒃k𝒃k=𝑼𝑼𝑼𝒃k𝑼𝒃k=𝑼𝒃k(𝑰𝑼𝑼)𝑼𝒃k𝑼𝒃k=𝒙k(𝑰𝑼𝑼)𝑼𝒃k𝒙k𝒙k𝒙k+(𝑰𝑼𝑼)𝑼𝒃k2.4δt𝒃k\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|=\|{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-{\bm{U}}^{*}{}\bm{b}^{*}_{k}\|=\|{\bm{U}}\bm{b}_{k}-(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}-{\bm{U}}^{*}{}\bm{b}^{*}_{k}\|=\|\bm{x}_{k}-(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}-\bm{x}^{*}_{k}\|\leq\|\bm{x}_{k}-\bm{x}^{*}_{k}\|+\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}\|\leq 2.4\delta_{t}\|\bm{b}^{*}_{k}\|

Bounding 𝑮𝑩F\|\bm{G}-\bm{B}\|_{F} and 𝑿𝑿F\|{\bm{X}}^{*}-{\bm{X}}\|_{F}: Since k𝑴𝒃k2=𝑴𝑩F2𝑴F2𝑩2=𝑴F2σmax2\sum_{k}\|\bm{M}\bm{b}^{*}_{k}\|^{2}=\|\bm{M}\bm{B}^{*}\|_{F}^{2}\leq\|\bm{M}\|_{F}^{2}\|\bm{B}^{*}\|^{2}=\|\bm{M}\|_{F}^{2}{\sigma_{\max}^{*}}^{2}, we can use the first bound from (3) to conclude that

𝑮𝑩F2\displaystyle\|\bm{G}-\bm{B}\|_{F}^{2} =k𝒈k𝒃k2\displaystyle=\sum_{k}\|\bm{g}_{k}-\bm{b}_{k}\|^{2}
0.42k(𝑰𝑼𝑼)𝑼𝒃k2\displaystyle\leq 0.4^{2}\sum_{k}\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|^{2}
=0.42(𝑰𝑼𝑼)𝑼𝑩F20.42δt2σmax2\displaystyle=0.4^{2}\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{B}^{*}\|_{F}^{2}\leq 0.4^{2}\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}

and, similarly,

𝑿𝑿F2\displaystyle\|{\bm{X}}^{*}-{\bm{X}}\|_{F}^{2} k𝒈k𝒃k2+k(𝑰𝑼𝑼)𝑼𝒃k2\displaystyle\leq\sum_{k}\|\bm{g}_{k}-\bm{b}_{k}\|^{2}+\sum_{k}\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|^{2}
(0.42+12)δt2σmax2\displaystyle\leq(0.4^{2}+1^{2})\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}

Incoherence of 𝒃k\bm{b}_{k}’s: Using the bound on 𝒃k𝒈k\|\bm{b}_{k}-\bm{g}_{k}\|, and using 𝒈k𝒃k\|\bm{g}_{k}\|\leq\|\bm{b}^{*}_{k}\| and the right incoherence assumption,

𝒃k\displaystyle\|\bm{b}_{k}\| =(𝒃k𝒈k+𝒈k)(1+0.4δt)𝒃k1.04σmaxr/q.\displaystyle=\|(\bm{b}_{k}-\bm{g}_{k}+\bm{g}_{k})\|\leq(1+0.4\delta_{t})\|\bm{b}^{*}_{k}\|\leq 1.04{\sigma_{\max}^{*}}\sqrt{r/q}.

Lower and Upper Bounds on σi(𝑩)\sigma_{i}(\bm{B})): Using the bound on 𝑮𝑩F\|\bm{G}-\bm{B}\|_{F} and using SD(𝑼,𝑼)δt<c/κ\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}<c/\kappa,

σmin(𝑩)\displaystyle\sigma_{\min}(\bm{B}) σmin(𝑮)𝑮𝑩\displaystyle\geq\sigma_{\min}(\bm{G})-\|\bm{G}-\bm{B}\|
σmin(𝑼𝑼)σmin(𝑩)𝑮𝑩F\displaystyle\geq\sigma_{\min}({\bm{U}}^{\top}{\bm{U}}^{*}{})\sigma_{\min}(\bm{B}^{*})-\|\bm{G}-\bm{B}\|_{F}
1𝑼𝑼2σmin0.4δtσmax\displaystyle\geq\sqrt{1-\|{\bm{U}}^{*}{}_{\perp}{}^{\top}{\bm{U}}\|^{2}}{\sigma_{\min}^{*}}-0.4\delta_{t}{\sigma_{\max}^{*}}
1δt2σmin0.4δtσmax0.9σmin\displaystyle\geq\sqrt{1-\delta_{t}^{2}}{\sigma_{\min}^{*}}-0.4\delta_{t}{\sigma_{\max}^{*}}\geq 0.9{\sigma_{\min}^{*}}

since we assumed δtδ0<0.1/κ\delta_{t}\leq\delta_{0}<0.1/\kappa. Similarly,

𝑩=σmax(𝑩)\displaystyle\|\bm{B}\|=\sigma_{\max}(\bm{B}) σmax(𝑼𝑼)σmax(𝑩)+𝑮𝑩F\displaystyle\leq\sigma_{\max}({\bm{U}}^{\top}{\bm{U}}^{*}{})\sigma_{\max}(\bm{B}^{*})+\|\bm{G}-\bm{B}\|_{F}
σmax+0.4δtσmax1.1σmax\displaystyle\leq{\sigma_{\max}^{*}}+0.4\delta_{t}{\sigma_{\max}^{*}}\leq 1.1{\sigma_{\max}^{*}}

IV-E Proof of GD iterations’ lemmas: Proof of Lemma 3.3, first part

We bound 𝒈k𝒃k\|\bm{g}_{k}-\bm{b}_{k}\| here. Recall that 𝒈k=𝑼𝒙k\bm{g}_{k}={\bm{U}}^{\top}\bm{x}^{*}_{k}. Since 𝒚k=𝑨k𝒙k=𝑨k𝑼𝑼𝒙k+𝑨k(𝑰𝑼𝑼)𝒙k\bm{y}_{k}=\bm{A}_{k}\bm{x}^{*}_{k}=\bm{A}_{k}{\bm{U}}{\bm{U}}{}^{\top}\bm{x}^{*}_{k}+\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}, therefore

𝒃k\displaystyle\bm{b}_{k} =(𝑼𝑨k𝑨k𝑼)1(𝑼𝑨k)𝑨k𝑼𝑼𝒙k\displaystyle=\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}{\bm{U}}{\bm{U}}{}^{\top}\bm{x}^{*}_{k}
+(𝑼𝑨k𝑨k𝑼)1(𝑼𝑨k)𝑨k(𝑰𝑼𝑼)𝒙k,\displaystyle\qquad+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k},
=(𝑼𝑨k𝑨k𝑼)1(𝑼𝑨k𝑨k𝑼)𝑼𝒙k\displaystyle=\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right){\bm{U}}{}^{\top}\bm{x}^{*}_{k}
+(𝑼𝑨k𝑨k𝑼)1(𝑼𝑨k)𝑨k(𝑰𝑼𝑼)𝒙k,\displaystyle\qquad+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k},
=𝒈k+(𝑼𝑨k𝑨k𝑼)1(𝑼𝑨k)𝑨k(𝑰𝑼𝑼)𝒙k.\displaystyle=\bm{g}_{k}+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}.

Thus,

𝒃k𝒈k\displaystyle\|\bm{b}_{k}-\bm{g}_{k}\| (𝑼𝑨k𝑨k𝑼)1\displaystyle\leq\|\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\|
×𝑼𝑨k𝑨k(𝑰𝑼𝑼)𝒙k.\displaystyle\qquad\times~{}\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|. (19)

Using standard results from [26], one can show the following:

  1. 1.

    W.p. 1qexp(rcm)\geq 1-q\exp\left(r-cm\right), for all k[q]k\in[q], min𝒘𝒮ri|𝒂ki𝑼𝒘|20.7m\min_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}\big{|}\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w}\big{|}^{2}\geq 0.7m and so

    (𝑼𝑨k𝑨k𝑼)1\displaystyle\|\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\| =1σmin(𝑼𝑨k𝑨k𝑼)\displaystyle=\frac{1}{\sigma_{\min}\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)}
    =1min𝒘𝒮ri𝑼𝒂ki,𝒘2\displaystyle=\frac{1}{\min_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}\langle{\bm{U}}^{\top}\bm{a}_{ki},\bm{w}\rangle^{2}}
    10.7m\displaystyle\leq\frac{1}{0.7m}
  2. 2.

    W.p. at least 1qexp(rcm)1-q\exp(r-cm),  for all k[q]\text{ for all }k\in[q],

    𝑼𝑨k𝑨k(𝑰𝑼𝑼)𝒙k0.15m(𝑰𝑼𝑼)𝒙k\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|\leq 0.15m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|

Combining the above two bounds and (19), w.p. at least 12exp(logq+rcm)1-2\exp(\log q+r-cm),  for all k[q]\text{ for all }k\in[q],

𝒈k𝒃k0.4(𝑰n𝑼𝑼)𝑼𝒃k.\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\|\left(\bm{I}_{n}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|.

This completes the proof. We explain next how to get the above two bounds.

The first bound above follows by a restatement of Theorem 4.6.1 of [26]. Or, it follows more directly by using 𝔼[i|𝒂ki𝑼𝒘|2]=m\mathbb{E}[\sum_{i}\big{|}\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w}\big{|}^{2}]=m, applying the sub-exponential Bernstein inequality [29, Theorem 2.8.1] to bound the deviation from this mean, and then applying Proposition 4.9 with n1,rrn\equiv 1,r\equiv r (epsilon net argument).

The second bound is obtained as follows. Notice that

𝑼𝑨k𝑨k(𝑰𝑼𝑼)𝒙k\displaystyle\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|
=max𝒘𝒮r𝒘𝑼𝑨k𝑨k(𝑰𝑼𝑼)𝒙k\displaystyle\qquad=\max_{\bm{w}\in\mathcal{S}_{r}}\bm{w}{}^{\top}{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}
=max𝒘𝒮ri(𝒂ki𝑼𝒘)(𝒂ki(𝑰𝑼𝑼)𝒙k)\displaystyle\qquad=\max_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w})(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k})

Clearly 𝔼[𝑼𝑨k𝑨k(𝑰𝑼𝑼)𝒙k]=𝑼(𝑰𝑼𝑼)𝒙k=0\mathbb{E}\left[{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\right]={\bm{U}}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}=0. Moreover, the summands are products of sub-Gaussian r.v.s and are thus sub-exponential. Also, the different summands are mutually independent and zero mean. Applying sub-exponential Bernstein with t=ϵ0m(𝑰𝑼𝑼)𝒙kt=\epsilon_{0}m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\| for a fixed 𝒘𝒮r\bm{w}\in\mathcal{S}_{r},

|i(𝒂ki𝑼𝒘)(𝒂ki(𝑰𝑼𝑼)𝒙k)|ϵ0m(𝑰𝑼𝑼)𝒙k|\sum_{i}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w})(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k})|\leq\epsilon_{0}m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|

w.p. at least 1exp(cϵ02m)1-\exp(-c\epsilon_{0}^{2}m). Setting ϵ0=0.1\epsilon_{0}=0.1, this implies that the above is bounded by 0.1m(𝑰𝑼𝑼)𝒙k0.1m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\| w.p. at least 1exp(cm)1-\exp(-cm). By Proposition 4.8 with n1,rrn\equiv 1,r\equiv r, the above is bounded by 0.12m(𝑰𝑼𝑼)𝒙k0.12m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\| for all 𝒘𝒮r\bm{w}\in\mathcal{S}_{r} w.p. at least 1exp(rcm)1-\exp(r-cm). Using a union bound over all qq columns, the bound holds for all qq columns w.p. at least 1qexp(rcm)1-q\exp(r-cm).

IV-F Proof of Initialization lemmas/facts: Proof of Lemma 3.6

To see why (4) holds, it suffices to show that 𝔼[(𝑿0)k|α]=𝒙kβk(α)\mathbb{E}[({\bm{X}}_{0})_{k}|\alpha]=\bm{x}^{*}_{k}\beta_{k}(\alpha) for each kk. The easiest way to see this is to express 𝒙k=𝒙k𝑸k𝒆1\bm{x}^{*}_{k}=\|\bm{x}^{*}_{k}\|{\bm{Q}}_{k}\bm{e}_{1} where 𝑸k{\bm{Q}}_{k} is an n×nn\times n unitary matrix with first column 𝒙k/𝒙k\bm{x}^{*}_{k}/\|\bm{x}^{*}_{k}\|; and to use the fact that 𝒂~ki:=𝑸k𝒂ki\tilde{\bm{a}}_{ki}:={\bm{Q}}_{k}^{\top}\bm{a}_{ki} has the same distribution as 𝒂ki\bm{a}_{ki}, both are 𝒩(0,𝑰n){\cal{N}}(0,\bm{I}_{n}). Using 𝑸k𝑸k=𝑰{\bm{Q}}_{k}{\bm{Q}}_{k}^{\top}=\bm{I}, (𝑿0)k=(1/m)i𝑸k𝑸k𝒂ki𝒂ki𝒙k𝑸k𝒆1𝟙𝒙k|𝒂ki𝑸k𝒆1|α=(1/m)i𝑸k𝒙k𝒂~ki𝒂~ki(1)𝟙|𝒂~ki(1)|α/𝒙k|({\bm{X}}_{0})_{k}=(1/m)\sum_{i}{\bm{Q}}_{k}{\bm{Q}}_{k}^{\top}\bm{a}_{ki}\bm{a}_{ki}^{\top}\|\bm{x}^{*}_{k}\|{\bm{Q}}_{k}\bm{e}_{1}\mathbbm{1}_{\|\bm{x}^{*}_{k}\||\bm{a}_{ki}^{\top}{\bm{Q}}_{k}\bm{e}_{1}|\leq\sqrt{\alpha}}=(1/m)\sum_{i}{\bm{Q}}_{k}\|\bm{x}^{*}_{k}\|\tilde{\bm{a}}_{ki}\tilde{\bm{a}}_{ki}(1)\mathbbm{1}_{|\tilde{\bm{a}}_{ki}(1)|\leq\sqrt{\alpha}/\|\bm{x}^{*}_{k}\\ |}. Thus 𝔼[((𝑿0)k]=(1/m)m𝑸k𝒙k𝒆1𝔼[ζ2𝟙|ζ|<α/𝒙k|]\mathbb{E}[(({\bm{X}}_{0})_{k}]=(1/m)m{\bm{Q}}_{k}\|\bm{x}^{*}_{k}\|\bm{e}_{1}\mathbb{E}[\zeta^{2}\mathbbm{1}_{|\zeta|<\sqrt{\alpha}/\|\bm{x}^{*}_{k}\\ |}]. This follows because 𝔼[𝒂𝒂(1)𝟙|𝒂(1)|<β=𝒆1𝔼[𝒂(1)2𝟙|𝒂(1)|<β]\mathbb{E}[\bm{a}\bm{a}(1)\mathbbm{1}_{|\bm{a}(1)|<\beta}=\bm{e}_{1}\mathbb{E}[\bm{a}(1)^{2}\mathbbm{1}_{|\bm{a}(1)|<\beta}].

Recall that C~=9κ2μ2\tilde{C}=9\kappa^{2}\mu^{2} and c~=c/C~\tilde{c}=c/\tilde{C} for a c<1c<1. Recall also that 𝑿=SVD𝑼𝚺𝑽{\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}{\bm{\Sigma}^{*}}{\bm{V}^{*}} and 𝔼[𝑿0|α]=SVD𝑼𝚺ˇ𝑽ˇ\mathbb{E}[{\bm{X}}_{0}|\alpha]\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}}. Thus, using (4), 𝚺ˇ=𝚺𝑽𝑫𝑽ˇ\check{\bm{\Sigma}^{*}}={\bm{\Sigma}^{*}}{\bm{V}^{*}}{\bm{D}}\check{\bm{V}}{}^{\top}. Hence,

σr(𝔼[𝑿0|α])\displaystyle\sigma_{r}(\mathbb{E}[{\bm{X}}_{0}|\alpha]) =σmin(𝚺ˇ)\displaystyle=\sigma_{\min}(\check{\bm{\Sigma}^{*}})
=σmin(𝚺𝑽𝑫𝑽ˇ)\displaystyle=\sigma_{\min}({\bm{\Sigma}^{*}}{\bm{V}^{*}}{\bm{D}}\check{\bm{V}}{}^{\top})
σmin(𝚺)σmin(𝑽)σmin(𝑫)σmin(𝑽ˇ)\displaystyle\geq\sigma_{\min}({\bm{\Sigma}^{*}})\sigma_{\min}({\bm{V}^{*}})\sigma_{\min}({\bm{D}})\sigma_{\min}(\check{\bm{V}}{}^{\top})
=σmin1(minkβk(α))1\displaystyle={\sigma_{\min}^{*}}\cdot 1\cdot(\min_{k}\beta_{k}(\alpha))\cdot 1

Also, σr+1(𝔼[𝑿0])=0\sigma_{r+1}(\mathbb{E}[{\bm{X}}_{0}])=0 since it is a rank rr matrix. Thus, using Wedin’s sinΘ\sin\Theta theorem for the Frobenius norm subspace distance SD\mathrm{SD} [27, 28][Theorem 2.3.1, second row] (specified in Theorem 4.1 above) applied with 𝑴𝑿0\bm{M}\equiv{\bm{X}}_{0}, 𝑴𝔼[𝑿0]\bm{M}^{*}\equiv\mathbb{E}[{\bm{X}}_{0}] we get (2).

IV-G Proof of Initialization lemmas and facts: Proof of Lemma 3.8

Proof of first part of Lemma 3.8.

The proof involves an application of the sub-Gaussian Hoeffding inequality, Theorem 2.6.2 of [26], followed by an epsilon-net argument. The application of sub-Gaussian Hoeffding uses conditioning on α\alpha for α\alpha\in\mathcal{E}. For α\alpha\in\mathcal{E}, αC~(1+ϵ1)𝑿F/q\alpha\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q} and this helps get a simple probability bound. Since α\alpha is independent of all 𝒂ki,𝒚ki\bm{a}_{ki},\bm{y}_{ki}’s used in defining 𝑿0{\bm{X}}_{0}, the conditioning does not change anything else in our proof. For example, the different summands are mutually independent even conditioned on it.

We have,

𝑿0𝔼[𝑿0|α]=max𝒛𝒮n,𝒘𝒮q𝑿0𝔼[𝑿0|α],𝒛𝒘.\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle.

For a fixed 𝒛𝒮n,𝒘𝒮q\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}, we have

𝑿0𝔼[𝑿0|α],𝒛𝒘\displaystyle\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle
=1mki𝒘(k)𝒚ki(𝒂ki𝒛)𝟙{|𝒚ki|2α}\displaystyle\qquad=\frac{1}{m}\sum_{ki}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\{|\bm{y}_{ki}|^{2}\leq\alpha\}}
𝔼[𝒘(k)𝒚ki(𝒂ki𝒛)𝟙{|𝒚ki|2α}].\displaystyle\qquad-\mathbb{E}\left[\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\{|\bm{y}_{ki}|^{2}\leq\alpha\}}\right].

The summands are mutually independent, zero mean sub-Gaussian r.v.s with sub-Gaussian norm KkiC|𝒘(k)|α/mK_{ki}\leq C|\bm{w}(k)|\sqrt{\alpha}/m. For α\alpha\in\mathcal{E}, αC~(1+ϵ1)𝑿F/mq\alpha\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/m\sqrt{q}. Let t=ϵ1𝑿Ft=\epsilon_{1}\|{\bm{X}}^{*}\|_{F}. Then, for any α\alpha\in\mathcal{E},

t2kiKki2ϵ12𝑿F2kiC~(1+ϵ1)𝒘(k)2𝑿F2/m2qϵ12mqCμ2κ2\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\bm{w}(k)^{2}\|{\bm{X}}^{*}\|_{F}^{2}/m^{2}q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}}

since k𝒘(k)2=𝒘2=1\sum_{k}\bm{w}(k)^{2}=\|\bm{w}\|^{2}=1. Thus, for a fixed 𝒛𝒮n,𝒘𝒮q\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}, by sub-Gaussian Hoeffding, we conclude that, conditioned on α\alpha, for any α\alpha\in\mathcal{E}, w.p. at least 1exp[cϵ12mq/μ2κ2]1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right],

𝑿0𝔼[𝑿0|α],𝒛𝒘Cϵ1𝑿F.\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

The rest of the proof follows by a standard epsilon net argument summarized in Proposition 4.7. Applying it, conditioned on α\alpha, for any α\alpha\in\mathcal{E}, w.p. at least 1exp[(n+q)cϵ12mq/μ2κ2]1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right], max𝒛𝒮n,𝒘𝒮q𝑿0𝔼[𝑿0|α],𝒛𝒘1.4Cϵ1𝑿F.\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq 1.4C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Proof of second part of Lemma 3.8.

We have

(𝑿0𝔼[𝑿0|α])𝑼F=max𝑾𝒮qr𝑾,(𝑿0𝔼[𝑿0|α])𝑼\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{qr}}\langle{\bm{W}},\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\rangle

For a fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr},

𝑾,(𝑿0𝔼[𝑿0|α])𝑼\displaystyle\langle{\bm{W}},\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\rangle
=trace(𝑾(𝑿0𝔼[𝑿0|α])𝑼)\displaystyle\qquad=\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)
=1mki(𝒚ki(𝒂ki𝑼𝒘k)𝟙{|𝒚ki|2α}𝔼[.])\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\alpha\right\}}-\mathbb{E}[.]\right)

Conditioned on α\alpha, for an α\alpha\in\mathcal{E}, the summands are independent zero mean sub-Gaussian r.v.s with subGaussian norm Kkiα𝒘k/mC~(1+ϵ1)𝑿F𝒘k/mqK_{ki}\leq\sqrt{\alpha}\|\bm{w}_{k}\|/m\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|\bm{w}_{k}\|/m\sqrt{q}. Thus,

kiKki2mC~(1+ϵ1)𝑾F2𝑿F2/m2q=C~𝑿F2/mq\sum_{ki}K_{ki}^{2}\leq m\tilde{C}(1+\epsilon_{1})\|{\bm{W}}\|_{F}^{2}\|{\bm{X}}^{*}\|_{F}^{2}/m^{2}q=\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/mq

Applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], for a fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr}, conditioned on α\alpha, for an α\alpha\in\mathcal{E}, w.p. 1exp[ϵ12mq/Cμ2κ2]1-\exp\left[-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right],

trace(𝑾(𝑿0𝔼[𝑿0|α])𝑼)ϵ1𝑿F.\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

The rest of the proof follows by a standard epsilon net argument summarized in Proposition 4.8. Applying Proposition 4.8, conditioned on α\alpha, for an α\alpha\in\mathcal{E}, w.p. at least 1exp[qrcϵ12mq/μ2κ2]1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right], max𝑾𝒮qrtrace(𝑾(𝑿0𝔼[𝑿0|α])𝑼)<1.2ϵ1𝑿F\max_{{\bm{W}}\in\mathcal{S}_{qr}}\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)<1.2\epsilon_{1}\|{\bm{X}}^{*}\|_{F}. ∎

Proof of third part of Lemma 3.8.

We have

(𝑿0𝔼[𝑿0|α])𝑽ˇF=max𝑾𝒮nr(𝑿0𝔼[𝑿0|α])𝑽ˇ,𝑾.\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle.

For a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr} we have,

(𝑿0𝔼[𝑿0|α])𝑽ˇ,𝑾\displaystyle\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle
=1mki(𝒚ki(𝒂ki𝑾𝒗ˇk)𝟙{|𝒚ki|2α}𝔼[.])\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\alpha\right\}}-\mathbb{E}[.]\right)

where 𝔼[.]\mathbb{E}[.] is the expected value of the first term. Conditioned on α\alpha, for an α\alpha\in\mathcal{E}, the summands are independent, zero mean, sub-Gaussian r.v.s with subGaussian norm KkiCα𝑾𝒗ˇkCC~(1+ϵ1)𝑿F𝑾𝒗ˇk/mqK_{ki}\leq C\sqrt{\alpha}\|{\bm{W}}\check{\bm{v}}_{k}\|\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|{\bm{W}}\check{\bm{v}}_{k}\|/m\sqrt{q}. Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with t=ϵ1𝑿Ft=\epsilon_{1}\|{\bm{X}}^{*}\|_{F}, and using 𝑾𝑽ˇF=1\|{\bm{W}}\check{\bm{V}}\|_{F}=1 (holds since 𝑽ˇ\check{\bm{V}} contains orthormal rows which are right singular vectors of 𝔼[𝑿0|α]\mathbb{E}[{\bm{X}}_{0}|\alpha]), conditioned on α\alpha, for an α\alpha\in\mathcal{E}, we will get that,

t2kiKki2m2ϵ12𝑿F2kiC~(1+ϵ1)𝑿F2𝑾𝒗ˇk2/q=mqϵ12Cμ2κ2,\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{m^{2}\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2}/q}=\frac{mq\epsilon_{1}^{2}}{C\mu^{2}\kappa^{2}},

w.p. 1exp[cϵ12mq/(μ2κ2)]1-\exp\left[-c\epsilon_{1}^{2}mq/(\mu^{2}\kappa^{2})\right]. Here we used the fact that 𝑽ˇ𝑽ˇ=𝑰\check{\bm{V}}\check{\bm{V}}{}{}^{\top}=\bm{I} and thus 𝑾𝑽ˇF2=1\|{\bm{W}}\check{\bm{V}}\|_{F}^{2}=1.

(𝑿0𝔼[𝑿0|α])𝑽ˇ,𝑾Cϵ1𝑿F.\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Applying Proposition 4.8, conditioned on α\alpha, for an α\alpha\in\mathcal{E}, w.p. at least 1exp[nrcϵ12mq/(μ2κ2)]1-\exp\left[nr-c\epsilon_{1}^{2}mq/(\mu^{2}\kappa^{2})\right], max𝑾𝒮nr(𝑿0𝔼[𝑿0|α])𝑽ˇ,𝑾1.2Cϵ1𝑿F.\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq 1.2C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

IV-H Proof of Initialization lemmas and facts: Proof of Facts

Proof of Fact 3.7.

Apply sub-exponential Bernstein. ∎

Proof of Fact 3.9.

Let γk=C~(1ϵ1)𝑿Fq𝒙k\gamma_{k}=\frac{\sqrt{\tilde{C}(1-\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}. Since C~=9μ2κ2\tilde{C}=9\mu^{2}\kappa^{2} and 𝒙k2μ2κ2𝑿F2/q\|\bm{x}^{*}_{k}\|^{2}\leq\mu^{2}\kappa^{2}\|{\bm{X}}^{*}\|_{F}^{2}/q (Assumption 1.1) thus

γk3.\gamma_{k}\geq 3.

Now,

𝔼[ζ2𝟙{|ζ|γk}]=\displaystyle\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{|\zeta|\leq\gamma_{k}\right\}}\right]= 1𝔼[ζ2𝟙{|ζ|γk}]\displaystyle 1-\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{|\zeta|\geq\gamma_{k}\right\}}\right]
\displaystyle\geq 122π3z2exp(z2/2)𝑑z\displaystyle 1-\frac{2}{\sqrt{2\pi}}\int_{3}^{\infty}z^{2}\exp(-z^{2}/2)dz
\displaystyle\geq 12e1/2π3zexp(z2/4)𝑑z\displaystyle 1-\frac{2e^{-1/2}}{\sqrt{\pi}}\int_{3}^{\infty}z\exp(-z^{2}/4)dz
=12e11/4π0.92.\displaystyle=1-\frac{2e^{-11/4}}{\sqrt{\pi}}\geq 0.92.

The first inequality used γk3\gamma_{k}\geq 3. The second used the fact that zexp(z2/4)2ez\exp(-z^{2}/4)\leq\sqrt{2e} for all zz\in\Re. ∎

In all the proofs above, notice that the only thing we used about 𝑽ˇ\check{\bm{V}} is the fact that its rows contain singular vectors and thus 𝑽ˇ𝑽ˇ=𝑰\check{\bm{V}}\check{\bm{V}}^{\top}=\bm{I} and so σr(𝑽ˇ)=σ1(𝑽ˇ)=1\sigma_{r}(\check{\bm{V}})=\sigma_{1}(\check{\bm{V}})=1. We never required incoherence for it

Algorithm 2 The AltGD-Min-LRPR algorithm.
1:Input: 𝒚(mag)k,𝑨k,k[q]{\bm{y}_{(mag)}}_{k},\bm{A}_{k},k\in[q]
2:Parameters: GD step size, η\eta; Number of iterations, TT
3:Sample-split: Partition the measurements and measurement matrices into 2T+12T+1 equal-sized disjoint sets: one set for initialization and 2T2T sets for the iterations. Denote these by 𝒚(mag)k(τ),𝑨k(τ),τ=0,1,2T{{\bm{y}_{(mag)}}_{k}}^{(\tau)},\bm{A}_{k}^{(\tau)},\tau=0,1,\dots 2T.
4:Initialization:
5:Compute 𝑼0{\bm{U}}_{0} as the top rr singular vectors of 𝒀U:=1mqki(𝒚(mag)ki)2𝒂ki𝒂ki𝟙{(𝒚(mag)ki)2C~1mqki(𝒚(mag)ki)2}\bm{Y}_{U}:=\frac{1}{mq}\sum_{ki}({\bm{y}_{(mag)}}_{ki})^{2}\bm{a}_{ki}\bm{a}_{ki}^{\top}\mathbbm{1}_{\left\{({\bm{y}_{(mag)}}_{ki})^{2}\leq\tilde{C}\frac{1}{mq}\sum_{ki}({\bm{y}_{(mag)}}_{ki})^{2}\right\}}. with 𝒚(mag)ki𝒚(mag)ki(0),𝒂ki𝒂ki(0){\bm{y}_{(mag)}}_{ki}\equiv{{\bm{y}_{(mag)}}_{ki}}^{(0)},\bm{a}_{ki}\equiv\bm{a}_{ki}^{(0)}.
6:GDmin Iterations:
7:for t=1t=1 to TT do
8:     Let 𝑼𝑼t1{\bm{U}}\leftarrow{\bm{U}}_{t-1}.
9:     Update bk,xk\bm{b}_{k},\bm{x}_{k}: For each k[q]k\in[q], set (𝒃k)tRWF(𝒚(mag)k(t),(𝑼𝑨k(t)),TRWF,t)(\bm{b}_{k})_{t}\leftarrow\mathrm{RWF}({{\bm{y}_{(mag)}}_{k}}^{(t)},({\bm{U}}^{\top}\bm{A}_{k}^{(t)}),T_{RWF,t}). Set (𝒙k)t𝑼(𝒃k)t(\bm{x}_{k})_{t}\leftarrow{\bm{U}}(\bm{b}_{k})_{t}
10:     Estimate gradient w.r.t. U{\bm{U}}: With 𝒚(mag)ki𝒚(mag)ki(T+t),𝒂ki𝒂ki(T+t){\bm{y}_{(mag)}}_{ki}\equiv{{\bm{y}_{(mag)}}_{ki}}^{(T+t)},\bm{a}_{ki}\equiv\bm{a}_{ki}^{(T+t)},
  • compute 𝒚^ki:=𝒚(mag)ki𝒄^ki\hat{\bm{y}}_{ki}:={\bm{y}_{(mag)}}_{ki}\hat{\bm{c}}_{ki} with 𝒄^ki=phase(𝒂ki𝒙k)\hat{\bm{c}}_{ki}=phase(\bm{a}_{ki}{}^{\top}\bm{x}_{k}) and

  • compute GradU^=ki(𝒚^ki𝒂ki𝒙k)t)𝒂ki(𝒃k)t\widehat{\mathrm{GradU}}=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k})_{t})\bm{a}_{ki}(\bm{b}_{k})_{t}{}^{\top}

11:     Set 𝑼^+𝑼(η/m)GradU^\displaystyle\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-(\eta/m)\widehat{\mathrm{GradU}}
12:     Orthornormalize to get new U{\bm{U}}: Compute 𝑼^+=QR𝑼+𝑹+\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}. Set 𝑼t𝑼+{\bm{U}}_{t}\leftarrow{\bm{U}}^{+}.
13:end for

V Extension to Low Rank Phase Retrieval (LRPR)

In LRPR, recall that, we measure 𝒚(mag)k=|𝑨k𝒙k|{\bm{y}_{(mag)}}_{k}=|\bm{A}_{k}\bm{x}^{*}_{k}|. This problem commonly occurs in dynamic phaseless imaging applications such as Fourier ptychography. Because of the magnitude-only measurements, we can recover each column only up to a global phase uncertainty. We use dist(𝒙,𝒙):=minθ[π,π]𝒙ejθ𝒙\mathrm{dist}(\bm{x}^{*},\bm{x}):=\min_{\theta\in[-\pi,\pi]}\|\bm{x}^{*}-e^{-j\theta}\bm{x}\| to quantify this phase invariant distance [30, 21]. Also, for a complex number, zz, we use z¯\bar{z} to denote its conjugate and we use phase(z):=z/|z|phase(z):=z/|z|.

V-A AltGD-Min-LRPR algorithm

With three simple changes that we explain next, the AltGD-Min approach also solves LRPR and provides the fastest existing solution for it. First, observe that because of the magnitude-only measurements, we cannot use 𝑿0{\bm{X}}_{0} with 𝒚ki\bm{y}_{ki} replaced by 𝒚(mag)ki{\bm{y}_{(mag)}}_{ki} for initialization. The reason is 𝔼[𝒂ki𝒚(mag)ki]=0\mathbb{E}[\bm{a}_{ki}{\bm{y}_{(mag)}}_{ki}]=0 and so 𝔼[𝒂ki𝒚(mag)ki𝟙𝒚(mag)kiα]=0\mathbb{E}[\bm{a}_{ki}{\bm{y}_{(mag)}}_{ki}\mathbbm{1}_{{\bm{y}_{(mag)}}_{ki}\leq\sqrt{\alpha}}]=0 too. In fact, because of this, it is not even possible to define a different matrix 𝑿{\bm{X}} whose expected value can be shown to be close to 𝑿{\bm{X}}^{*}. Instead, we have to use the initialization approach of [5]. This is given in line 5 of Algorithm 2. The matrix 𝒀U\bm{Y}_{U} is such that its expected value is close to 𝑿𝑿+c𝑰{\bm{X}}^{*}{\bm{X}}^{*}{}^{\top}+c\bm{I}. This fact is used to argue that its top rr singular vectors span a subspace that is close to that spanned by columns of 𝑼{\bm{U}}^{*}{}.

Next, consider the GDmin iterations. We use the following idea to deal with the magnitude-only measurements: 𝒚(mag)ki:=|𝒚ki|{\bm{y}_{(mag)}}_{ki}:=|\bm{y}_{ki}|. Let 𝒄ki:=phase(𝒂ki𝒙k)\bm{c}_{ki}:=\mathrm{phase}(\bm{a}_{ki}{}^{\top}\bm{x}^{*}_{k}). Then, clearly,

𝒚ki=𝒄ki𝒚(mag)ki\bm{y}_{ki}=\bm{c}_{ki}{\bm{y}_{(mag)}}_{ki}

and 𝒚(mag)ki=𝒄¯ki𝒚ki{\bm{y}_{(mag)}}_{ki}=\bar{\bm{c}}_{ki}\bm{y}_{ki}. We do not observe 𝒄ki\bm{c}_{ki}, but we can estimate it using 𝒙k\bm{x}_{k} which is an estimate of 𝒙k\bm{x}^{*}_{k}. Using the estimated phase, we can get an estimate 𝒚^ki\hat{\bm{y}}_{ki} of 𝒚ki\bm{y}_{ki}. We replace 𝑼f(𝑼,𝑩)\nabla_{\bm{U}}f({\bm{U}},\bm{B}) by its estimate which uses 𝒚^ki=𝒚(mag)ki𝒄^ki\hat{\bm{y}}_{ki}={\bm{y}_{(mag)}}_{ki}\hat{\bm{c}}_{ki}, with 𝒄^ki=phase(𝒂ki𝒙k)\hat{\bm{c}}_{ki}=phase(\bm{a}_{ki}{}^{\top}\bm{x}_{k}), to replace 𝒚ki\bm{y}_{ki}. See line 10 of Algorithm 2.

Lastly, because of the magnitude-only measurements, the update step for updating 𝒃k\bm{b}_{k}s is no longer an LS problem. We now need to solve an rr-dimensional standard PR problem: min𝒃𝒚(mag)k|𝑨k𝑼𝒃|2\min_{\bm{b}}\|{\bm{y}_{(mag)}}_{k}-|\bm{A}_{k}{\bm{U}}\bm{b}|\|^{2}. This can be solved using any of the order-optimal algorithms for standard PR, e.g., Truncated Wirtinger Flow (TWF) [20] or Reshaped WF (RWF) [21]. For concreteness, we assume that RWF is used. We should point out here that we only need to run TRWF,tT_{RWF,t} iterations of RWF at outer loop iteration tt, with TRWF,tT_{RWF,t} set below in our theorem (we set this to ensure that the error level of this step is of order δt\delta_{t}). The entire algorithm, AltGD-Min-LRPR, is summarized in Algorithm 2.

V-B Main Result

We can prove the following result with simple changes to the proof of Theorem 2.1.

Theorem 5.1.

Consider Algorithm 2. Set η=c/σmax2\eta=c/{\sigma_{\max}^{*}}^{2}, C~=9κ2μ2\tilde{C}=9\kappa^{2}\mu^{2}, T=Cκ2log(1/ϵ)T=C\kappa^{2}\log(1/\epsilon), and TRWF,t=C(t+clogr)T_{RWF,t}=C(t+c\log r). Assume that Assumption 1.1 holds. If

mqCκ6μ2(n+q)r2(r+log(1/ϵ)logκ)mq\geq C\kappa^{6}\mu^{2}(n+q)r^{2}(r+\log(1/\epsilon)\log\kappa)

and mCmax(logq,logn)log(1/ϵ)m\geq C\max(\log q,\log n)\log(1/\epsilon), then, w.p. 1n101-n^{-10}, SD(𝐔,𝐔T)ϵ\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{T})\leq\epsilon, dist((𝐱k)T,𝐱k)ϵ𝐱k\mathrm{dist}((\bm{x}_{k})_{T},\bm{x}^{*}_{k})\leq\epsilon\|\bm{x}^{*}_{k}\| for all k[q]k\in[q], and kdist2((𝐱k)T,𝐱k)ϵ2σmax2.\sum_{k}\mathrm{dist}^{2}((\bm{x}_{k})_{T},\bm{x}^{*}_{k})\leq\epsilon^{2}{\sigma_{\max}^{*}}^{2}.

We prove this result in Sec. V-C. Notice the log(1/ϵ)\log(1/\epsilon) in the sample complexity of Theorem 2.1 is now replaced by (r+log(1/ϵ))(r+\log(1/\epsilon)). The reason is because of the different initialization approach which needs nr3nr^{3} samples instead of nr2nr^{2}. This is needed because PR is a more difficult problem: we cannot define a matrix 𝑿0{\bm{X}}_{0} for it for which 𝔼[𝑿0]\mathbb{E}[{\bm{X}}_{0}] is close to 𝑿{\bm{X}}^{*}.

Observe that AltGD-Min-LRPR has the same sample complexity as that for the AltMin solution from [6]. But its time complexity is better by a factor of log(1/ϵ)\log(1/\epsilon) making it the fastest solution for LRPR. Also, we should mention here that, for solutions to the two related problems – sparse PR (phaseless but global measurements) and LRMC (linear but non-global measurements) – that have been extensively studied for nearly a decade, the best sample complexity guarantees for iterative (and hence fast) algorithms are sub-optimal. The best sparse PR guarantee [31] requires mm to be of order s2s^{2} for the initialization step. Here ss is the sparsity level. LRPR has both phaseless and non-global measurements. This is why its initialization step needs two extra factors of rr compared to the optimal. Once initialized close enough to the true solution, it is well known that a PR problem behaves like a linear one. This is true for AltGD-Min-LRPR too.

Consider a comparison with use of a standard PR approach to recover each column of 𝑿{\bm{X}}^{*} individually. If TWF [20] or RWF [21] were used for this, this would require mnm\gtrsim n. In comparison, ignoring log factors, our solution for LRPR needs m(n/q)r3m\gtrsim(n/q)r^{3}. Thus, the use of altGD-min is a better idea when the rank, rr, of the matrix 𝑿{\bm{X}}^{*} is small enough so that qr3q\gtrsim r^{3}.

V-C Proof of Theorem 5.1

For the initialization, we use the bound from [5].

Lemma 5.2 ([5]).

Let SD2(𝐔0,𝐔)=(𝐈𝐔𝐔)𝐔0\mathrm{SD}_{2}({\bm{U}}_{0},{\bm{U}}^{*}{})=\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}){\bm{U}}_{0}\|. Pick a δinit<0.1\delta_{\mathrm{init}}<0.1. Then, w.p. at least 12exp(n(log17)cδinit2mqκ4r2)2exp(cδinit2mqκ4μ2r2)1-2\exp\left(n(\log 17)-c\frac{\delta_{\mathrm{init}}^{2}mq}{\kappa^{4}r^{2}}\right)-2\exp\left(-c\frac{\delta_{\mathrm{init}}^{2}mq}{\kappa^{4}\mu^{2}r^{2}}\right),

SD2(𝑼0,𝑼)δinit and so SD(𝑼0,𝑼)rδinit.\mathrm{SD}_{2}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\delta_{\mathrm{init}}\text{ and so }\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\sqrt{r}\delta_{\mathrm{init}}.

For the iterations, without loss of generality, as also done in past works on PR, e.g., [30, 20, 21, 6], to make things simpler, we assume that, for each kk, 𝒙k\bm{x}^{*}_{k} is replaced by z¯𝒙k\bar{z}\bm{x}^{*}_{k} where z=phase(𝒙k,𝒙k)z=\mathrm{phase}(\langle\bm{x}^{*}_{k},\bm{x}_{k}\rangle). With this, dist(𝒙k,𝒙k)=𝒙k𝒙k\mathrm{dist}(\bm{x}^{*}_{k},\bm{x}_{k})=\|\bm{x}^{*}_{k}-\bm{x}_{k}\|.

We modify Lemma 3.4 using the following idea. Let 𝑼=𝑼t{\bm{U}}={\bm{U}}_{t} and 𝑩=𝑩t\bm{B}=\bm{B}_{t}. For LRPR, the GD step uses an approximate gradient w.r.t. the old cost function f(𝑼,𝑩)f({\bm{U}},\bm{B}). Let

Err:=GradU^GradU.\displaystyle\mathrm{Err}:=\widehat{\mathrm{GradU}}-\mathrm{GradU}.

Here GradU^=ki(𝒚^ki𝒂ki𝒙k))𝒂ki𝒃k\widehat{\mathrm{GradU}}=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k}))\bm{a}_{ki}\bm{b}_{k}{}^{\top} and GradU=𝑼f(𝑼,𝑩)=ki(𝒚ki𝒂ki𝒙k))𝒂ki𝒃k\mathrm{GradU}=\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k}))\bm{a}_{ki}\bm{b}_{k}{}^{\top} is the same as earlier. Thus,

Err\displaystyle\mathrm{Err} =ki(𝒚^ki𝒚ki)𝒂ki𝒃k\displaystyle=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{y}_{ki})\bm{a}_{ki}\bm{b}_{k}{}^{\top}
=ki(𝒄^ki𝒄ki)|𝒂ki𝒙k|𝒂ki𝒃k\displaystyle=\sum_{ki}(\hat{\bm{c}}_{ki}-\bm{c}_{ki})|\bm{a}_{ki}^{\top}\bm{x}^{*}_{k}|\bm{a}_{ki}\bm{b}_{k}{}^{\top}
=ki(𝒄^ki𝒄¯ki1)(𝒂ki𝒙k)𝒂ki𝒃k\displaystyle=\sum_{ki}(\hat{\bm{c}}_{ki}\bar{\bm{c}}_{ki}-1)(\bm{a}_{ki}^{\top}\bm{x}^{*}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}

Proceeding as in the proof of Lemma 3.4, and using (𝑰𝑼𝑼)ErrFErrF\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\mathrm{Err}\|_{F}\leq\|\mathrm{Err}\|_{F} and ErrErrF\|\mathrm{Err}\|\leq\|\mathrm{Err}\|_{F}, we can conclude the following

SD(𝑼,𝑼+)\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})\leq
(I(η/m)HessSD(𝑼,𝑼)+(η/m)Term2F+(η/m)ErrF1(η/m)GradU(η/m)ErrF\displaystyle\frac{\|(I-(\eta/m)\mathrm{Hess}\|\cdot\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})+(\eta/m)\|\mathrm{Term2}\|_{F}+(\eta/m)\|\mathrm{Err}\|_{F}}{1-(\eta/m)\|\mathrm{GradU}\|-(\eta/m)\|\mathrm{Err}\|_{F}}

where the expressions for GradU,Term2,Hess\mathrm{GradU},\mathrm{Term2},\mathrm{Hess} are the same as before with one change: 𝒃k\bm{b}_{k} is now obtained by solving a noisy rr-dimensional PR problem (instead of a LS problem) using RWF [21]. Thus, to complete the proof, (i) we need to bound

ErrF=max𝑾𝒮nrki(𝒄^ki𝒄¯ki1)(𝒂ki𝒙k)(𝒂ki𝑾𝒃k)\|\mathrm{Err}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}(\hat{\bm{c}}_{ki}\bar{\bm{c}}_{ki}-1)(\bm{a}_{ki}^{\top}\bm{x}^{*}_{k})(\bm{a}_{ki}^{\top}{\bm{W}}\bm{b}_{k})

and (ii) we need bounds on the three other terms that were also bounded earlier for the linear case.

The term ErrF\|\mathrm{Err}\|_{F}, is bounded in Lemma 4 of [6] . We repeat the lemma below.

Lemma 5.3.

Assume that SD(𝐔t,𝐔)δt\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\delta_{t} with δt<c/κ2\delta_{t}<c/\kappa^{2}. Then, w.p. at least 12exp(nrlog(17)cmqϵ22μ2κr)exp(logq+rcm)1-2\exp\left(nr\log(17)-c\frac{mq\epsilon_{2}^{2}}{\mu^{2}\kappa r}\right)-\exp(\log q+r-cm),

ErrFCm(ϵ2+δt)δtσmax2\|\mathrm{Err}\|_{F}\leq Cm(\epsilon_{2}+\sqrt{\delta_{t}})\delta_{t}{\sigma_{\max}^{*}}^{2}

Consider the other three terms: GradU,Term2,Hess\mathrm{GradU},\mathrm{Term2},\mathrm{Hess}. These were bounded in Lemma 3.5 for the linear case. The statement and proof of this lemma remain the same as earlier because of the following reason. Its proof uses the bounds on 𝒃k\bm{b}_{k}, 𝒙k\bm{x}_{k} from Lemma 3.3. The statement of this lemma also remains the same with one change: we replace 𝒙𝒙\|\bm{x}^{*}-\bm{x}\| by dist(𝒙,𝒙)\mathrm{dist}(\bm{x}^{*},\bm{x}) and 𝑿𝑿F2\|{\bm{X}}^{*}-{\bm{X}}\|_{F}^{2} by k=1qdist2(𝒙k,𝒙k)\sum_{k=1}^{q}\mathrm{dist}^{2}(\bm{x}^{*}_{k},\bm{x}_{k}), and the same for 𝒃k,𝒈k\bm{b}^{*}_{k},\bm{g}_{k}. The first part of Lemma 3.3 now follows by the first part of [6, Lemma 3.3]. All the subparts of the second part of Lemma 3.3 follow exactly as given in its proof in Sec. IV-D.

VI Limitations of our results

Our results have three limitations: (i) the algorithm that is analyzed needs sample-splitting, even though, in numerical experiments this is not needed; (ii) our bound holds w.h.p. for a single matrix 𝑿{\bm{X}}^{*} satisfying Assumption 1.1 (and not for all such matrices); and (iii) for obtaining exactly zero error, we need an infinite number of samples. We explain here the reasons why we are unable to address these issues. We should mention here that, since all computers are finite precision, (iii) is entirely a theoretical curiosity. Also, many other results in the LR recovery literature, e.g., [2, 14, 15], also have all these limitations.

VI-A Need for sample-splitting

In Algorithm 1, sample-splitting (line 3) helps ensure that the measurement matrices in each iteration for updating each of 𝑼{\bm{U}} and 𝑩\bm{B} are independent of all previous iterates: we split our sample set into 2T+12T+1 subsets, we use one subset for initialization of 𝑼{\bm{U}} and one subset each for TT iterations of updating 𝑩\bm{B} and updating 𝑼{\bm{U}}. This helps prove the desired error decay bound by applying the sub-exponential Bernstein inequality [26] which requires the summands to be mutually independent. This becomes true in our case because, conditioned on past measurement matrices, the current set of 𝒂ki\bm{a}_{ki}’s are independent of the last updated values of 𝑼,𝑩{\bm{U}},\bm{B}; and the 𝒂ki\bm{a}_{ki}s for different (i,k)(i,k) are mutually independent by definition. Thus, under the conditioning, the summands are mutually independent. Since we prove convergence in order log(1/ϵ)\log(1/\epsilon) iterations, this only adds a multiplicative factor of log(1/ϵ)\log(1/\epsilon) in the sample complexity. Sample-splitting and the above overall idea is a standard approach used in many older works; in fact it is assumed for most of the LRMC guarantees for solutions that do not solve a convex relaxation (are iterative algorithms) [2, 14, 15]. An exception is [16].

There are a few commonly used approaches to avoid sample splitting. (1) One is using the leave-one-out strategy as done in [19]. But this means that the sample complexity dependence on rr worsens: the LRMC sample complexity with this approach is (n+q)r3(n+q)r^{3} times log factors. Also, it is not clear how to develop this approach for alternating 𝑼,𝑩{\bm{U}},\bm{B} updates. (2) The second is to try to prove error decay for all matrices that are close enough to the true 𝑿{\bm{X}}^{*} and that satisfy the other assumptions of the guarantee. There are at least two different approaches to doing this. (2a) The first, which was used in [16], works for LRMC since its measurements are bounded and symmetric: the authors are able to utilize i.i.d. Bernoulli sampling and left and right singular vectors’ incoherence to prove key probabilistic bounds for all matrices of the form 𝑼𝑽{\bm{U}}\bm{V} with 𝑼,𝑽{\bm{U}},\bm{V} both being incoherent. This does not work in our case because our measurements are asymmetric and unbounded (which means for example that 𝒚ki\bm{y}_{ki} times its estimate is heavier-tailed than 𝒚ki\bm{y}_{ki}).

(2b) An alternative approach is the following overall idea, which has been successfully used for analyzing standard PR algorithms, e.g., see [20, 21], but does not always work for other problems. In our setting, this means the following: At iteration t+1t+1, suppose that the previous estimate 𝑼t{\bm{U}}_{t} satisfies SD(𝑼t,𝑼)δt\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\delta_{t}. We need to try to show that, for all 𝑼{\bm{U}} that are a subspace distance δt\delta_{t} away from the true subspace, the next iterate (which is a function of 𝑼{\bm{U}} and of the current 𝑨k,𝒚k\bm{A}_{k},\bm{y}_{k} for all kk) is a distance cδtc\delta_{t} away with a c<1c<1. To be precise, for all 𝑼𝒯:={𝑼:𝑼𝑼=𝑰 and SD(𝑼,𝑼)δt}{\bm{U}}\in\mathcal{T}:=\{{\bm{U}}:{\bm{U}}^{\top}{\bm{U}}=\bm{I}\text{ and }\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}\}, we need 𝑼+(𝑼)=orth(𝑼ηUf(𝑼,𝑩)){\bm{U}}^{+}({\bm{U}})=orth({\bm{U}}-\eta\nabla_{U}f({\bm{U}},\bm{B})) to satisfy SD(𝑼+,𝑼)cδt\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})\leq c\delta_{t} for a c<1c<1. Here orth(𝑴)orth(\bm{M}) is a matrix with orthonormal columns spanning the same subspace as those of 𝑴\bm{M}. Also recall that the columns of 𝑩\bm{B} are 𝒃k:=(𝑨k𝑼)𝒚k\bm{b}_{k}:=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k} for all k[q]k\in[q]. One can show this for all 𝑼𝒯{\bm{U}}\in\mathcal{T} by covering 𝒯\mathcal{T} by a net containing a finite number of points that are such that any point in 𝒯\mathcal{T} is with a subspace distance 0.25δt0.25\delta_{t} of some point in the net, and first proving that this bound holds for all 𝑼{\bm{U}} in the net. The first step for proving such a bound is to bound the error in the estimates 𝒃k\bm{b}_{k} for all 𝑼{\bm{U}} in this net. Because of the decoupled column-wise recovery of the 𝒃k\bm{b}_{k}’s, for one 𝑼{\bm{U}} in this net, the bound on 𝒃k(𝑼)𝑼𝒙k\|\bm{b}_{k}({\bm{U}})-{\bm{U}}^{\top}\bm{x}^{*}_{k}\| holds w.p. 1qexp(rcm)\geq 1-q\exp(r-cm). This is proved in Lemma 3.3. If we want this bound to hold for all 𝑼{\bm{U}}’s in the net covering 𝒯\mathcal{T}, we will need a union bound over all points in the net. The smallest sized net to cover 𝒯\mathcal{T} with accuracy ϵnet=0.25δt\epsilon_{net}=0.25\delta_{t} has size upper bounded by CnrC^{nr} [26]. With using this, the probability lower bound becomes 1exp(nr+logq+rcm)1-\exp(nr+\log q+r-cm). For this to even just be non-negative, we need m>Cnrm>Cnr which is too large and makes our guarantee useless.

VI-B Why we cannot prove our result for all XX^{*}

The inability to obtain a useful union bound over a net of size CnrC^{nr} explained above is also why we cannot do this.

VI-C Why sample complexity depends on the desired final accuracy ϵ\epsilon

Observe from our result that the number of samples required to achieve a certain accuracy ϵ\epsilon grows as log(1/ϵ)\log(1/\epsilon). This means that, for the algorithm to achieve zero error, we need an infinite number of samples. We should mention that this problem is not unique to our result. It is often seen for results that use sample-splitting, e.g., [2, 15]. An exception is [14] for LRMC, where the following basic idea is used: one tries to show that after enough iterations, e.g., when the recovery error is ϵ0=1/n\epsilon_{0}=1/n or smaller, one can start reusing the same samples and still prove error decay. This is also the idea used in [19]. Briefly, the reason we are unable to circumvent this problem using a similar idea to that of [14] is that our algorithm is not a regular GD or projected GD method.

To use a similar idea in our setting, we would need to proceed as follows. We use independent samples until the error is below an ϵ0\epsilon_{0} that is small enough. Pick ϵ0=1/(κ2n2)\epsilon_{0}=1/(\kappa^{2}n^{2}). This happens after T(ϵ0)=Cκ2log(n)log(κ)T(\epsilon_{0})=C\kappa^{2}\log(n)\log(\kappa) iterations. Consider t=T+1t=T+1. At this time, δt=ϵ0=1/(κ2n2)\delta_{t}=\epsilon_{0}=1/(\kappa^{2}n^{2}). Thus, by Lemma 3.3, 𝒃k𝑼𝒙k(1/(κ2n2))𝒙k\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|\lesssim(1/(\kappa^{2}n^{2}))\|\bm{x}^{*}_{k}\| and all the other bounds also hold with δt\delta_{t} replaced by ϵ0\epsilon_{0}. We try to show error decay by applying Lemma 3.4. For this to work, we need to be able to show all of the following without using independence between 𝑼,𝑩{\bm{U}},\bm{B} and the 𝑨k\bm{A}_{k}s: (i) upper and lower bound the eigenvalues of Hess=ki(𝒂ki𝒃k)(.)\mathrm{Hess}=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(.)^{\top} as those proved earlier, (ii) bound 𝑼f(𝑼,𝑩)/m\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|/m by c0σmin2c_{0}{\sigma_{\min}^{*}}^{2} for a small constant c0<1c_{0}<1 (in fact even in our main proof, such a bound is sufficient since this term only appears in the denominator), and (iii) bound Term2F/m\|\mathrm{Term2}\|_{F}/m by (c2/κ2)δtσmax2(c_{2}/\kappa^{2})\delta_{t}{\sigma_{\max}^{*}}^{2} with a c2c_{2} sufficiently less than one.

As we explain next, (i) and (ii) can be obtained easily, but (iii) cannot. We can obtain (i) by showing that Hess\mathrm{Hess} is close to Hess=ki(𝒂ki(𝑼𝑼𝒃k))(.)\mathrm{Hess}^{*}=\sum_{ki}(\bm{a}_{ki}\otimes({\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{b}^{*}_{k}))(.)^{\top}; and Hess\mathrm{Hess}^{*} can be bounded almost exactly as done in our proof earlier since 𝑨k\bm{A}_{k}s are independent of 𝒙k\bm{x}^{*}_{k}s. The 𝑼{\bm{U}} in the expression for Hess\mathrm{Hess}^{*} does not matter because 𝑼𝑼{\bm{U}}^{\top}{\bm{U}}^{*}{} is an r×rr\times r rotation matrix and one can take a maximum over all rotation matrices. Using the loose bounds 𝒂ki5n\|\bm{a}_{ki}\|\leq 5\sqrt{n} w.h.p., one can show that HessHessmqmaxki[max𝑾𝒮nr|𝒂ki𝑾𝒈k|max𝑾𝒮nr|𝒂ki𝑾(𝒈k𝒃k)|]mqnμr/qσmaxnϵ0μr/qσmaxmμ2(r/n)σmin2\|\mathrm{Hess}^{*}-\mathrm{Hess}\|\leq mq\max_{ki}[\max_{{\bm{W}}\in\mathcal{S}_{nr}}|\bm{a}_{ki}^{\top}{\bm{W}}\bm{g}_{k}|\cdot\max_{{\bm{W}}\in\mathcal{S}_{nr}}|\bm{a}_{ki}^{\top}{\bm{W}}(\bm{g}_{k}-\bm{b}_{k})|]\lesssim mq\sqrt{n}\mu\sqrt{r/q}{\sigma_{\max}^{*}}\cdot\sqrt{n}\epsilon_{0}\mu\sqrt{r/q}{\sigma_{\max}^{*}}\leq m\mu^{2}(r/n){\sigma_{\min}^{*}}^{2}. Similarly, for (ii), ki𝒂ki𝒂ki(𝒙k𝒙k)𝒃kmqnnϵ0(μ2r/q)σmax2=m(μ2r/n)σmin2\sum_{ki}\|\bm{a}_{ki}\bm{a}_{ki}^{\top}(\bm{x}^{*}_{k}-\bm{x}_{k})\bm{b}_{k}^{\top}\|\lesssim mq\cdot\sqrt{n}\cdot\sqrt{n}\cdot\epsilon_{0}\cdot(\mu^{2}r/q){\sigma_{\max}^{*}}^{2}=m(\mu^{2}r/n){\sigma_{\min}^{*}}^{2}. Using (μ2r/n)1(\mu^{2}r/n)\ll 1, claims (i) and (ii) follow. However, proving (iii) seems to be impossible without using the fact that 𝔼[Term2]=0\mathbb{E}[\mathrm{Term2}]=0. But this expected value is zero only when 𝑨k\bm{A}_{k}s are independent of 𝑼,𝑩{\bm{U}},\bm{B}.

Possible ways to prove (iii). For bounding Term2\mathrm{Term2} for times t>T(ϵ0)t>T(\epsilon_{0}), we can try one of the following ideas. (1) Try to use Cauchy-Schwarz in a way that the projection orthogonal to 𝑼{\bm{U}}^{*}{} is used. There does not seem to be a way to make this work. (2) Try to use the leave-one-out strategy of [19] only for t>T(ϵ0)t>T(\epsilon_{0}).

Refer to caption
(a) m=80m=80, n=q=600,r=4n=q=600,r=4
Refer to caption
(b) m=50m=50, n=q=600,r=4n=q=600,r=4
Refer to caption
(c) m=30m=30, n=q=600,r=4n=q=600,r=4
Refer to caption
(d) m=90m=90, n=100,q=120,r=2n=100,q=120,r=2
Figure 1: Comparing the proposed algorithm with existing approaches for solving LRcCS.
Refer to caption
Figure 2: Comparing the proposed algorithm with existing approach for solving LRPR. We used n=600,q=1000,r=4n=600,q=1000,r=4 and m=250m=250.

VII Numerical Experiments

Our first experiment compares AltGD-Min with the mixed norm minimization solution from [7] (mixed-norm-min) and with the AltMin algorithm [4, 5, 6] modified for the linear LRcCS problem (replace the PR step for updating 𝒃k\bm{b}_{k}’s by a simple LS step). We implement this with using two possible initializations: the initialization developed in [4, 5, 6] for LRPR (AltMinLin-LRPRinit), and with the initialization approach developed in this work (AltMinLin-LRCSinit). For mixed norm min, we used the code downloaded from https://www.dropbox.com/sh/lywtzc0y9awpvgz/AABbjuiuLWPy_8y7C3GQKo8pa?dl=0, which is provided by the authors. For AltMin, we used the code from https://github.com/praneethmurthy/. We implemented AltGD-Min with η=0.4/𝑿02\eta=0.4/\|{\bm{X}}_{0}\|^{2} and C~=9\tilde{C}=9. Also, we used one set of measurements for all its iterations.

For chosen values of n,q,rn,q,r and mm, we simulated the data as follows. We simulated 𝑼{\bm{U}}^{*}{} by orthogonalizing an n×rn\times r standard Gaussian matrix; and 𝒃k\bm{b}^{*}_{k}s were generated i.i.d. from 𝒩(0,𝑰r){\cal{N}}(0,\bm{I}_{r}). These were generated once. For each of 100 Monte Carlo runs, the measurement matrices 𝑨k\bm{A}_{k} contained i.i.d. standard Gaussian entries. We obtained 𝒚k=𝑨k𝑼𝒃k\bm{y}_{k}=\bm{A}_{k}{\bm{U}}^{*}{}\bm{b}^{*}_{k}, k[q]k\in[q]. For the LRPR experiment, we used 𝒚(mag)k=|𝒚k|{\bm{y}_{(mag)}}_{k}=|\bm{y}_{k}| as the measurements. We plot the empirical average of 𝑿𝑿F/𝑿F\|{\bm{X}}-{\bm{X}}^{*}\|_{F}/\|{\bm{X}}^{*}\|_{F} at each iteration tt on the y-axis (labeled “Error-X” in the plots) and the time taken by the algorithm until iteration tt on the x-axis.

For our first experiment, shown in Fig. 1(a), we used n=600,q=600,r=4n=600,q=600,r=4 and m=80m=80. In this case, mixed-norm-min error decays to about 2-5% but does not reduce any further. But, for our algorithm, AltGD-Min, and for both versions of AltMin, the error decays to 101510^{-15}. Notice also that AltGD-Min is much faster than all the other approaches. Fig. 1(b) reduced mm to m=50m=50. Here a similar trend is observed, except that the error decays to only around 101310^{-13} for AltGD-Min and 101110^{-11} for the two AltMin approaches. Finally, for Fig. 1(c), we reduced mm to m=30m=30. In this case, only AltGDmin and AltMin-LRCSinit work, while mixed-norm-min and AltMin-LRPRinit errors do not decrease at all. The reason is both these need a higher sample complexity (see Table I). Finally, we also tried an experiment with very large mm: n=100,q=120,r=2n=100,q=120,r=2 and m=0.9n=90m=0.9n=90, see Fig. 1(d). Even for such a large value of mm (compared to nn), observe that the mixed-norm-min error saturates at around 1-2%. The likely reason for this that, in the guarantee for mixed-norm-min [7] (summarized for the noiseless case in Proposition 2.3 given earlier), even for m=nm=n, the error is bounded by a multiplier (more than 1) times r/q\sqrt{r/q}.

For the comparisons for the LRPR problem shown in Fig. 2, we need a much larger qq and mm since LRPR requires mqmq to scale as nr3nr^{3} both for initialization and for the GDmin iterations and the multiplying constants are also much larger for LRPR. We used n=600,q=1000,r=4n=600,q=1000,r=4 and m=250m=250. Notice that altGD-Min-LRPR is faster than AltMin-LRPR. We implemented altGD-Min-LRPR with η=0.9/𝑿02\eta=0.9/\|{\bm{X}}_{0}\|^{2}, C~=9\tilde{C}=9, and TRWF,t=max(5+t,40)T_{RWF,t}=\max(5+t,40) in the RWF code (code for [21], downloaded from the specified site). Also, here again, we used one set of measurements for all its iterations.

VIII Conclusions

This work developed a sample-efficient and fast gradient descent (GD) solution, called AltGD-Min, for provably recovering a low-rank (LR) matrix from mutually independent column-wise linear projections. This problem, which we refer to as “Low Rank column-wise Compressive Sensing (LRcCS)”, frequently occurs in LR-based accelerated low rank dynamic MRI and in federated sketching. If used in a federated setting, AltGD-Min is also communication-efficient. The LRcCS problem has not received little attention in the theoretical literature unlike the other well-studied LR recovery problems (matrix completion, sensing, or multivariate regression).

Appendix A Understanding why LRMC-style GD approaches cannot be easily analyzed for LRcCS

TABLE II: Understanding why LRMC style projected-GD on 𝑿{\bm{X}} does not work in our case.
LRMC Our Problem, LRcCS
f~(𝑿)\tilde{f}({\bm{X}}) k=1qj=1n(𝒚jkδjk𝑿jk)2\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(\bm{y}_{jk}-\delta_{jk}{\bm{X}}_{jk})^{2} k=1qi=1m(𝒚ki𝒂ki𝒙k)2\displaystyle\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{y}_{ki}-\bm{a}_{ki}^{\top}\bm{x}_{k})^{2}
δjkiidBernoulli(p)\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p) 𝒂kiiid𝒩(0,𝑰n)\bm{a}_{ki}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}{\cal{N}}(0,\bm{I}_{n})
Xf~(𝑿)\nabla_{X}\tilde{f}({\bm{X}}) k=1qj=1nδjk(𝒚jkδjk𝑿jk)𝒆j𝒆k\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}\delta_{jk}(\bm{y}_{jk}-\delta_{jk}{\bm{X}}_{jk})\bm{e}_{j}\bm{e}_{k}^{\top} k=1qi=1m(𝒚ki𝒂ki𝒙k)𝒂ki𝒆k\displaystyle\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{y}_{ki}-\bm{a}_{ki}^{\top}\bm{x}_{k})\bm{a}_{ki}\bm{e}_{k}^{\top}
=k=1qj=1nδjk(𝑿jk𝑿jk)𝒆j𝒆k\displaystyle=\sum_{k=1}^{q}\sum_{j=1}^{n}\delta_{jk}({\bm{X}}^{*}_{jk}-{\bm{X}}_{jk})\bm{e}_{j}\bm{e}_{k}^{\top} =k=1qi=1m𝒂ki(𝒙k𝒙k)𝒂ki𝒆k\displaystyle=\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}^{\top}(\bm{x}^{*}_{k}-\bm{x}_{k})\bm{a}_{ki}\bm{e}_{k}^{\top}
𝑯~:=𝑯ηf(𝑿)\tilde{\bm{H}}:=\bm{H}-\eta\nabla f({\bm{X}}) k=1qj=1n(1δjkp)𝑯jk𝒆j𝒆k\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(1-\frac{\delta_{jk}}{p})\bm{H}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top} 1mk=1qi=1m(𝑰𝒂ki𝒂ki)𝒉k𝒆k\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{I}-\bm{a}_{ki}\bm{a}_{ki}{}^{\top})\bm{h}_{k}\bm{e}_{k}{}^{\top}

A-A Gradient Descent

The iterates of a gradient descent (GD) algorithm converge when the gradient approaches zero. Thus, in order to show its convergence, one needs to be able to bound the norm of the gradient and show that it goes to zero with iterations. In order to show fast enough convergence (reach ϵ\epsilon error in order log(1/ϵ)\log(1/\epsilon) iterations), one further needs to show that this bound on the gradient norm decreases sufficiently with each iteration. Consider projGD-X which was studied in [15] for solving LRMC. ProjGD-X iterations involve computing 𝑿+𝒫r(𝑿𝑿f~(𝑿)){\bm{X}}^{+}\leftarrow\mathcal{P}_{r}({\bm{X}}-\nabla_{\bm{X}}\tilde{f}({\bm{X}})), here 𝒫r(𝑴)\mathcal{P}_{r}(\bm{M}) projects its argument onto the space of rank-rr matrices. To bound 𝑿f~(𝑿)\|\nabla_{\bm{X}}\tilde{f}({\bm{X}})\|, we need to bound |𝒘𝑿f~(𝑿)𝒛||\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z}| for any unit norm vectors 𝒘,𝒛\bm{w},\bm{z}. We show the cost function f~(𝑿)\tilde{f}({\bm{X}}) and its gradient for both LRMC and LRcCS in Table II. Observe that, for LRcCS, 𝒘𝑿f~(𝑿)𝒛\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z} is a sum of sub-exponential r.v.s with sub-exponential norms bounded by Ke=maxk𝒘𝒙k𝒙k|𝒛k|maxk𝒙k𝒙kK_{e}=\max_{k}\|\bm{w}\|\cdot\|\bm{x}^{*}_{k}-\bm{x}_{k}\|\cdot|\bm{z}_{k}|\leq\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|. Thus, in order to get a small enough bound on |𝒘𝑿f~(𝑿)𝒛||\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z}| by applying the sub-exponential Bernstein inequality [26], we need a small enough bound on maxk𝒙k𝒙k\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\| (column-wise error bound). It is not clear how to get this because the projection step introduces coupling between the different columns of the estimated matrix 𝑿{\bm{X}} 222 Let 𝑯:=𝑿𝑿\bm{H}:={\bm{X}}-{\bm{X}}^{*}, 𝑯~:=(𝑿ηf(𝑿))𝑿=𝑯ηf(𝑿)\tilde{\bm{H}}:=({\bm{X}}-\eta\nabla f({\bm{X}}))-{\bm{X}}^{*}=\bm{H}-\eta\nabla f({\bm{X}}), and 𝑯+=𝑿+𝑿=𝒫r(𝑿f(𝑿))𝑿=𝒫r(𝑿+𝑯~)𝑿\bm{H}^{+}={\bm{X}}^{+}-{\bm{X}}^{*}=\mathcal{P}_{r}({\bm{X}}-\nabla f({\bm{X}}))-{\bm{X}}^{*}=\mathcal{P}_{r}({\bm{X}}^{*}+\tilde{\bm{H}})-{\bm{X}}^{*}. To bound the LRMC projGD-X errors, one needs an entry-wise bound of the form 𝑯+maxδt𝑿max\|\bm{H}^{+}\|_{\max}\leq\delta_{t}\|{\bm{X}}^{*}\|_{\max} with δt\delta_{t} decaying exponentially. We show the expressions for 𝑯~\tilde{\bm{H}} in the table. For LRMC, notice that different summands of 𝑯~\tilde{\bm{H}} are mutually independent and each depends on only one entry of 𝑯\bm{H}. This fact is carefully exploited in [15, Lemma 1] and [14, Lemma 1]. By borrowing ideas from the literature on spectral statistics of Erdos-Renyi graphs [32], the authors are able to obtain expressions for higher powers of (𝑯~𝑯~)(\tilde{\bm{H}}\tilde{\bm{H}}^{\top}). These expressions help them get the desired bound under the desired sample complexity. For LRcCS, using the gradient expression, we need a bound on maxk𝒉k+\max_{k}\|\bm{h}^{+}_{k}\| in terms of 𝒉k\|\bm{h}_{k}\| in order to show its exponential decay. Since the different entries of 𝑯~\tilde{\bm{H}} are not mutually independent and not bounded, the LRMC proof approach cannot be borrowed. . Moreover, even if we could somehow get such a bound, in the best case, it would be proportional to δtmaxk𝒙k\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\| with δt<1\delta_{t}<1 and decaying exponentially with tt. Using Assumption 1.1, this would then imply that Keδtmaxk𝒙kδtμr/qσmaxK_{e}\leq\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\|\leq\delta_{t}\mu\sqrt{r/q}{\sigma_{\max}^{*}}. But, this is not small enough. We need it to be proportional to δt(r/q)\delta_{t}(r/q) in order to be able to bound the gradient norm under the desired sample complexity.

Consider altGDnormbal studied in [17, 16] for LRMC. In this case again, the desired column-wise error bound cannot be obtained because the update step for 𝑩\bm{B} involves GD w.r.t. f(𝑼,𝑩)+f2(𝑼,𝑩)f({\bm{U}},\bm{B})+f_{2}({\bm{U}},\bm{B}). The gradient w.r.t f2f_{2} (norm-balancing term) introduces coupling between the different columns of 𝑩\bm{B}, and hence, also between columns of 𝑿=𝑼𝑩{\bm{X}}={\bm{U}}\bm{B}. Thus, once again, it is not clear how to get a tight bound on maxk𝒙k𝒙k\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|.

For AltGD-Min, because the min step for updating 𝑩\bm{B} is a decoupled LS problem, it is possible to get the desired column-wise error bound. Secondly, because we use GD w.r.t 𝑼{\bm{U}}, there is an extra 𝒃k\bm{b}_{k}^{\top} term in the gradient summands. This makes the gradient (and its deviation from its expected value), a sum of nice-enough sub-exponential r.v.s as explained in Sec. III-B.

TABLE III: Why the LRMC initialization approach cannot be directly borrowed?
LRMC Our Problem, LRcCS
𝑿0,full={\bm{X}}_{0,full}= kjδjkp𝒚jk𝒆j𝒆k\displaystyle\sum_{k}\sum_{j}\frac{\delta_{jk}}{p}\bm{y}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top} 1mki𝒂ki𝒚ki𝒆k\displaystyle\frac{1}{m}\sum_{k}\sum_{i}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}
δjkiidBernoulli(p)\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p) 𝒂kiiid𝒩(0,𝑰n)\bm{a}_{ki}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}{\cal{N}}(0,\bm{I}_{n})
𝑯0=𝑿0,full𝑿\bm{H}_{0}={\bm{X}}_{0,full}-{\bm{X}}^{*} k=1qj=1n(1δjkp)𝑿jk𝒆j𝒆k\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(1-\frac{\delta_{jk}}{p}){\bm{X}}^{*}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top} 1mk=1qi=1m(𝑰𝒂ki𝒂ki)𝒙k𝒆k\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{I}-\bm{a}_{ki}\bm{a}_{ki}{}^{\top})\bm{x}^{*}_{k}\bm{e}_{k}{}^{\top}
Each summand is nicely bounded by unbounded & sub-expo. norm∗∗ is
μ2σmax(r/nq)\mu^{2}{\sigma_{\max}^{*}}(r/\sqrt{nq}) μσmaxr/q\mu{\sigma_{\max}^{*}}\sqrt{r/q} (too large, need r/qr/q)
Concen. ineq. Matrix Bernstein [33] Sub-expo Bernstein [26]
gives desired sample comp. does not give desired sample comp.

{**}: “max sub-expo. norm": max sub-exponential norm of (𝒂ki𝒘)(𝒂ki𝒙k)(𝒆k𝒛)(\bm{a}_{ki}{}^{\top}\bm{w})(\bm{a}_{ki}{}^{\top}\bm{x}^{*}_{k})(\bm{e}_{k}^{\top}\bm{z}) for any unit vectors 𝒘,𝒛\bm{w},\bm{z}.

A-B Initialization

The standard approach used for initializing iterative algorithms for LRMC (as well as other linear LRR problems) is to compute the top rr left singular vectors of the matrix 𝑿0,full{\bm{X}}_{0,full} that satisfies (𝑿0,full)vec=𝒜(𝒚all)({\bm{X}}_{0,full})_{vec}=\mathcal{A}^{\top}(\bm{y}_{all}), where 𝒚all\bm{y}_{all} is the mqmq-length vector of all measurements and 𝒜\mathcal{A} denotes the linear mapping from (𝑿)vec({\bm{X}}^{*})_{vec} to 𝒚all\bm{y}_{all}. In case of LRMC and LRcCS, this is computed is as given in Table III. It is not hard to see that, in both cases, 𝔼[𝑿0,full]=𝑿\mathbb{E}[{\bm{X}}_{0,full}]={\bm{X}}^{*}. To show that this approach works, one typically uses a sinΘ\sin\Theta theorem, e.g., Davis-Kahan or Wedin, to bound SD(𝑼,𝑼0)\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0}) as a function of terms that depend on 𝑯0:=𝑿0,full𝑿\bm{H}_{0}:={\bm{X}}_{0,full}-{\bm{X}}^{*}. Thus a first requirement is to bound 𝑯0\|\bm{H}_{0}\|. For LRMC, this can be done easily since 𝑯0\bm{H}_{0} is a sum of the independent one-sparse random matrices shown in the table with each matrix containing an i.i.d. Bernoulli r.v. times 𝑿jk{\bm{X}}^{*}_{jk} (jkjk-th entry of 𝑿{\bm{X}}^{*}) as its nonzero entry. Using the left and right singular vectors’ incoherence (assumed in all LRMC guarantees), and 𝑿jk=𝒆j𝑿𝒆k{\bm{X}}^{*}_{jk}=\bm{e}_{j}^{\top}{\bm{X}}^{*}\bm{e}_{k}, one can argue that, for unit vectors 𝒘,𝒛\bm{w},\bm{z}, each summand of |𝒘𝑯0𝒛||\bm{w}^{\top}\bm{H}_{0}\bm{z}| is of order at most (1/p)σmaxr/nq(1/p){\sigma_{\max}^{*}}r/\sqrt{nq}. This bound, along with a bound on the “variance parameter" needed for applying matrix Bernstein [33],[26, Chap 5] helps show that 𝑯0cσmax\|\bm{H}_{0}\|\leq c{\sigma_{\max}^{*}} w.h.p., under the desired sample complexity bound. For LRcCS, the summands of 𝑿0,full{\bm{X}}_{0,full}, and hence of 𝑯0\bm{H}_{0}, are sub-exponential r.v.s. These can be bounded using the sub-exponential Bernstein inequality [26, Chap 2]. This requires a bound on the maximum sub-exponential norm of any summand. Denote this bound by KeK_{e}. In order to show that 𝑯0cσmax\|\bm{H}_{0}\|\leq c{\sigma_{\max}^{*}} w.h.p, under the desired sample complexity, we need KeK_{e} to be of order (r/q)(r/q) or smaller. However, for our summands, we can only guarantee Ke(1/m)maxk𝒙k(1/m)μr/qσmaxK_{e}\leq(1/m)\max_{k}\|\bm{x}^{*}_{k}\|\leq(1/m)\mu\sqrt{r/q}{\sigma_{\max}^{*}}. This is not small enough, i.e., the summands are not nice-enough subexponentials. It will require mq(n+q)rqmq\gtrsim(n+q)r\cdot\sqrt{q} which is too large.

Appendix B Proof of Initialization Theorem 3.1 without sample-splitting

Consider the initialization using 𝑿0{\bm{X}}_{0} defined in (2). We we want to bound the initialization error without sample-splitting. This means that the threshold α\alpha is not independent of the 𝒂ki,𝒚ki\bm{a}_{ki},\bm{y}_{ki} used in the expression for 𝑿0{\bm{X}}_{0} and thus, it is not clear how to compute its expected value even if we condition on α\alpha. However, the following slightly more complicated approach can be used. Using Fact 3.7 and Assumption 1.1, it is possible to show that 𝑿0{\bm{X}}_{0} is close to a matrix, 𝑿+(ϵ1){\bm{X}}_{+}(\epsilon_{1}) given next for which 𝔼[𝑿+]\mathbb{E}[{\bm{X}}_{+}] is easily computed: Let

α+:=C~(1+ϵ1)𝑿F2q\alpha_{+}:=\tilde{C}(1+\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}

and define

𝑿+(ϵ1)\displaystyle{\bm{X}}_{+}(\epsilon_{1}) :=1mki𝒂ki𝒚ki𝒆k𝟙{𝒚ki2α+}. Then,\displaystyle:=\frac{1}{m}\sum_{ki}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}\mathbbm{1}_{\small\{\bm{y}_{ki}^{2}\leq\alpha_{+}\}}.\text{ Then, }
𝔼[𝑿+]\displaystyle\mathbb{E}[{\bm{X}}_{+}] =𝑿𝑫(ϵ1),\displaystyle={\bm{X}}^{*}{\bm{D}}(\epsilon_{1}),
𝑫\displaystyle{\bm{D}} :=diagonal(βk(ϵ1)),\displaystyle:=diagonal(\beta_{k}(\epsilon_{1})),
βk(ϵ1)\displaystyle\beta_{k}(\epsilon_{1}) :=𝔼[ζ2𝟙{ζ2α+𝒙k2}]\displaystyle:=\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\small\left\{\zeta^{2}\leq\frac{\alpha_{+}}{\|\bm{x}^{*}_{k}\|^{2}}\right\}}\right] (20)

with ζ\zeta being a scalar standard Gaussian. Thus 𝑿+{\bm{X}}_{+} is 𝑿0{\bm{X}}_{0} with the threshold α\alpha replaced by α+\alpha_{+} which is deterministic. Consequently 𝔼[𝑿+]\mathbb{E}[{\bm{X}}_{+}] has a similar form too and is obtained as explained in the proof of Lemma 3.6 given in Sec. IV-F.

Next, recall that 𝑿=SVD𝑼𝚺𝑽{\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}{\bm{\Sigma}^{*}}{\bm{V}^{*}} and C~=9κ2μ2\tilde{C}=9\kappa^{2}\mu^{2}. Let c~=c/C~\tilde{c}=c/\tilde{C} for a c<1c<1. Clearly, the span of the top rr singular vectors of 𝔼[𝑿+]=𝑿𝑫\mathbb{E}[{\bm{X}}_{+}]={\bm{X}}^{*}{\bm{D}} equals span(𝑼)\mathrm{span}({\bm{U}}^{*}{}) and it is rank rr matrix. Let,

𝔼[𝑿+]=𝑿𝑫=SVD𝑼𝚺ˇ𝑽ˇ\mathbb{E}[{\bm{X}}_{+}]={\bm{X}}^{*}{\bm{D}}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}}

be its rr-SVD (here 𝑽ˇ\check{\bm{V}} is an r×qr\times q matrix with its rows containing the rr right singular vectors). We thus have

σr(𝔼[𝑿+])\displaystyle\sigma_{r}(\mathbb{E}[{\bm{X}}_{+}]) =σmin(𝚺ˇ)=σmin(𝚺𝑽𝑫𝑽ˇ)\displaystyle=\sigma_{\min}(\check{\bm{\Sigma}^{*}})=\sigma_{\min}({\bm{\Sigma}^{*}}{\bm{V}^{*}}{\bm{D}}\check{\bm{V}}{}^{\top})
σmin(𝚺)σmin(𝑽)σmin(𝑫)σmin(𝑽ˇ)\displaystyle\geq\sigma_{\min}({\bm{\Sigma}^{*}})\sigma_{\min}({\bm{V}^{*}})\sigma_{\min}({\bm{D}})\sigma_{\min}(\check{\bm{V}}{}^{\top})
=σmin1(minkβk)1\displaystyle={\sigma_{\min}^{*}}\cdot 1\cdot(\min_{k}\beta_{k})\cdot 1

Fact 3.9 given earlier shows that (minkβk)0.9(\min_{k}\beta_{k})\geq 0.9 and thus,

σr(𝔼[𝑿+])0.9σmin\sigma_{r}(\mathbb{E}[{\bm{X}}_{+}])\geq 0.9{\sigma_{\min}^{*}}

Also, σr+1(𝔼[𝑿+])=0\sigma_{r+1}(\mathbb{E}[{\bm{X}}_{+}])=0 since it is a rank rr matrix. Thus, using Wedin’s sinΘ\sin\Theta theorem for SD\mathrm{SD} (summarized in Theorem 4.1) applied with 𝑴𝑿0\bm{M}\equiv{\bm{X}}_{0}, 𝑴𝔼[𝑿+]\bm{M}^{*}\equiv\mathbb{E}[{\bm{X}}_{+}] gives

SD(𝑼0,𝑼)\displaystyle\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})
2max((𝑿0𝔼[𝑿+])𝑼F,(𝑿0𝔼[𝑿+])𝑽ˇF)0.9σmin𝑿0𝔼[𝑿+]\displaystyle\leq\dfrac{\sqrt{2}\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}{}^{\top}\|_{F}\right)}{0.9{\sigma_{\min}^{*}}-\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|} (21)

In the next three subsections, we prove a set of six lemmas that help bound the three terms in the expression above. The main new ideas over the proof given earlier in Sec III-E, are in the proof of the first lemma, Lemma B.2 given below, and in the proof of Claim B.1 that is used in this proof.

Claim B.1.

Let 𝐱n\bm{x}^{*}\in\Re^{n}, 𝐳n\bm{z}\in\Re^{n} be two deterministic vectors and let α\alpha be a deterministic scalar. Let 𝐚𝒩(0,𝐈n)\bm{a}\sim{\cal{N}}(0,\bm{I}_{n}) be a standard Gaussian vector and define 𝐲:=𝐚𝐱\bm{y}:=\bm{a}^{\top}\bm{x}^{*}. For an 0<ϵ<10<\epsilon<1,

𝔼[|𝒚(𝒂𝒛)|𝟙{𝒚2[1±ϵ]α}]Cϵ𝒛α.\mathbb{E}\left[|\bm{y}(\bm{a}{}^{\top}\bm{z})|\mathbbm{1}_{\{\bm{y}^{2}\in[1\pm\epsilon]\alpha\}}\right]\leq C\epsilon\|\bm{z}\|\sqrt{\alpha}.

Combining Lemmas B.3 and B.2 and using Fact 3.7, and setting ϵ1=cδ0/rκ\epsilon_{1}=c\delta_{0}/\sqrt{r}\kappa, we conclude that, w.p. at least
12exp((n+q)c~ϵ12mq)exp(c~mqϵ12)12exp((n+q)c~mqδ02/rκ2)exp(c~mqδ02/rκ2)1-2\exp((n+q)-\tilde{c}\epsilon_{1}^{2}mq)-\exp(-\tilde{c}mq\epsilon_{1}^{2})\geq 1-2\exp((n+q)-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-\exp(-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2}),

𝑿0𝔼[𝑿+]ϵ1𝑿Fcδ0σmin\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|\lesssim\epsilon_{1}\|{\bm{X}}^{*}\|_{F}\lesssim c\delta_{0}{\sigma_{\min}^{*}}

By combining Lemmas B.4, B.5, B.6, and B.7 and using Fact 3.7, and setting ϵ1=cδ0/rκ\epsilon_{1}=c\delta_{0}/\sqrt{r}\kappa, we conclude that, w.p. at least
12exp(nrc~mqδ02/rκ2)2exp(qrc~mqδ02/rκ2)exp(c~mqδ02/rκ2)1-2\exp(nr-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-2\exp(qr-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-\exp(-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2}),

max((𝑿0𝔼[𝑿+])𝑼F,(𝑿0𝔼[𝑿+])𝑽ˇF)cδ0σmin\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}^{\top}\|_{F}\right)\lesssim c\delta_{0}{\sigma_{\min}^{*}}

Plugging these into (B) proves Theorem 3.1

B-A Bounding the denominator term

By triangle inequality, 𝑿0𝔼[𝑿+]𝑿+𝔼[𝑿+]+𝑿0𝑿+.\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|\leq\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|+\|{\bm{X}}_{0}-{\bm{X}}_{+}\|. The next two lemmas bound these two terms. The lemmas assume the claim of Fact 3.7 holds, i.e., that 1mqki𝒚ki2[1±ϵ1]C~𝑿F2/q\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/q where C~=9μ2κ2\tilde{C}=9\mu^{2}\kappa^{2}.

Lemma B.2.

Assume that 1mqki𝐲ki2[1±ϵ1]C~𝐗F2/q\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/q (claim of Fact 3.7 holds). Then, w.p. 1exp(C(n+q)ϵ12mq/μ2κ2)1-\exp(C(n+q)-\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}),

𝑿0𝑿+Cϵ1μκ𝑿F.\|{\bm{X}}_{0}-{\bm{X}}_{+}\|\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.2.

We have

𝑿+𝑿0\displaystyle\|{\bm{X}}_{+}-{\bm{X}}_{0}\| =max𝒛𝒮n,𝒘𝒮q𝒛(𝑿+𝑿0)𝒘\displaystyle=\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\bm{z}{}^{\top}\left({\bm{X}}_{+}-{\bm{X}}_{0}\right)\bm{w}
=max𝒛𝒮n,𝒘𝒮q1mki𝒘(k)𝒚ki(𝒂ki𝒛)\displaystyle=\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\frac{1}{m}\sum_{ki}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})
×𝟙{C~mqki𝒚ki2𝒚ki2C~(1+ϵ1)q𝑿F2}.\displaystyle\qquad\times\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\bm{y}_{ki}^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}.

For the last expression above, we have used the assumption ki𝒚ki2/mC~(1+ϵ1)𝑿F2\sum_{ki}\bm{y}_{ki}^{2}/m\leq\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}. Consider the RHS for a fixed unit norm 𝒛\bm{z} and 𝒘\bm{w}. The lower threshold of the indicator function is itself a r.v.. To convert it into a deterministic bound, we need the following sequence of bounding steps: To use our assumption that ki𝒚ki2/m(1ϵ1)C~𝑿F2\sum_{ki}\bm{y}_{ki}^{2}/m\geq(1-\epsilon_{1})\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}, we first need to bound the summands by their absolute values. This is done as follows:

|𝒛(𝑿+𝑿0)𝒘|\displaystyle|\bm{z}{}^{\top}\left({\bm{X}}_{+}-{\bm{X}}_{0}\right)\bm{w}| 1mki|𝒘(k)𝒚ki(𝒂ki𝒛)|\displaystyle\leq\frac{1}{m}\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}
×𝟙{C~mqki𝒚ki2|𝒚ki|2C~(1+ϵ1)q𝑿F2},\displaystyle\qquad\times\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}},
1mki|𝒘(k)𝒚ki(𝒂ki𝒛)|\displaystyle\leq\frac{1}{m}\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}
×𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2},\displaystyle\qquad\times\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}},

where in the last line we used our assumption that ki𝒚ki2/m(1ϵ1)C~𝑿F2\sum_{ki}\bm{y}_{ki}^{2}/m\geq(1-\epsilon_{1})\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}. This final expression is a sum of mutually independent sub-Gaussian r.v.s with subGaussian norm KkiC|𝒘(k)|C~(1+ϵ1)𝑿F/qC~|𝒘(k)|𝑿F/qK_{ki}\leq C|\bm{w}(k)|\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q}\leq\sqrt{\tilde{C}}|\bm{w}(k)|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}. Thus, by applying the sub-Gaussian Hoeffding inequality, Theorem 2.6.2 of [26],

Pr{|ki|𝒘(k)𝒚ki(𝒂ki𝒛)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}\displaystyle\Pr\left\{\Big{|}\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right.
𝔼[ki|𝒘(k)𝒚ki(𝒂ki𝒛)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}]|t}\displaystyle-\left.\mathbb{E}\left[\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]\Big{|}\geq t\right\}
2exp[ct2kiKki2].\displaystyle\qquad\leq 2\exp\left[-c\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\right].

By setting t=ϵ1m𝑿Ft=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F},

t2kiKki2m2qϵ12𝑿F2kiC~𝑿F2|𝒘(k)|2=ϵ12mqC~.\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{m^{2}q\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}|\bm{w}(k)|^{2}}=\frac{\epsilon_{1}^{2}mq}{\tilde{C}}.

Since C~=9μ2κ2\tilde{C}=9\mu^{2}\kappa^{2}, thus, w.p. 1exp(cϵ12mq/μ2κ2)1-\exp(-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}), for a fixed 𝒛\bm{z} and 𝒘\bm{w},

𝒛(𝑿0𝑿+)𝒘ϵ1𝑿F+𝔼[1mki|𝒘(k)𝒚ki(𝒂ki𝒛)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}].\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}+\mathbb{E}\left[\frac{1}{m}\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right].

By using Claim B.1 and |𝒘(k)|𝒛=|𝒘(k)||\bm{w}(k)|\|\bm{z}\|=|\bm{w}(k)| we have

𝔼[1mki|𝒚ki(𝒂ki𝒛)𝒘(k)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}]\displaystyle\mathbb{E}\left[\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\bm{w}(k)\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]
C~(1+ϵ1)ϵ1𝑿Fk|𝒘(k)|/qCϵ1μκ𝑿F,\displaystyle\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\epsilon_{1}\|{\bm{X}}^{*}\|_{F}\sum_{k}\big{|}\bm{w}(k)\big{|}/\sqrt{q}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F},

where in the last inequality we used Cauchy-Schwarz to show that k|𝒘(k)|/qk|𝒘(k)|2k(1/q)=1\sum_{k}\big{|}\bm{w}(k)\big{|}/\sqrt{q}\leq\sqrt{\sum_{k}\big{|}\bm{w}(k)\big{|}^{2}\sum_{k}(1/q)}=1. Or this also follows by 𝒘1/q𝒘=1\|\bm{w}\|_{1}/\sqrt{q}\leq\|\bm{w}\|=1. Also, we used C~=Cκμ\sqrt{\tilde{C}}=C\kappa\mu.

Thus, w.p. 1exp(cϵ12mq/μ2κ2)1-\exp(-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}), for a fixed 𝒛\bm{z} and 𝒘\bm{w}, 𝒛(𝑿0𝑿+)𝒘Cϵ1μκ𝑿F\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.8, max𝒛𝒮n,𝒘𝒮q𝒛(𝑿0𝑿+)𝒘1.4Cϵ1μκ𝑿F\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq 1.4C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F} w.p. at least 1exp((n+q)log(17)cϵ12mq/μ2κ2)1-\exp((n+q)\log(17)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}). ∎

Lemma B.3.

Consider 𝐗+{\bm{X}}_{+}. Fix 1<ϵ1<11<\epsilon_{1}<1. Then, w.p. 1exp[C(n+q)cϵ12mq/μ2κ2]1-\exp\left[C(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]

𝑿+𝔼[𝑿+]Cϵ1𝑿F.\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.3.

The proof involves an application of the sub-Gaussian Hoeffding inequality followed by an epsilon-net argument, both almost the same as those used in the proof of Lemma B.2 given above. We have,

𝑿+𝔼[𝑿+]=max𝒛𝒮n,𝒘𝒮q𝑿+𝔼[𝑿+],𝒛𝒘.\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle.

For a fixed 𝒛𝒮n,𝒘𝒮q\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}, we have

𝑿+𝔼[𝑿+],𝒛𝒘\displaystyle\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle
=1mki(𝒘(k)𝒚ki(𝒂ki𝒛)𝟙{|𝒚ki|2C~(1+ϵ1)q𝑿F2}\displaystyle=\frac{1}{m}\sum_{ki}\left(\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right.
𝔼[𝒘(k)𝒚ki(𝒂ki𝒛)𝟙{|𝒚ki|2C~(1+ϵ1)q𝑿F2}]).\displaystyle\qquad-\left.\mathbb{E}\left[\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]\right).

The summands are mutually independent, zero mean sub-Gaussian r.v.s with norm KkiC|𝒘(k)|C~(1+ϵ1)𝑿F/qK_{ki}\leq C|\bm{w}(k)|\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q}. We will again apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26]. Let t=ϵ1m𝑿Ft=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}. Then

t2kiKki2ϵ12m2𝑿F2kiC~(1+ϵ1)𝑿F2/qϵ12mqCμ2κ2\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{\epsilon_{1}^{2}m^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}}

Thus, for a fixed 𝒛𝒮n,𝒘𝒮q\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}, by sub-Gaussian Hoeffding, we conclude that, w.p. at least 1exp[cϵ12mq/μ2κ2]1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right],

𝑿+𝔼[𝑿+],𝒛𝒘Cϵ1𝑿F.\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.7, the above bound holds w.p. at least 1exp[(n+q)cϵ12mq/μ2κ2]1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]. ∎

B-B Bounding the 𝐕ˇ\check{\bm{V}} numerator term

We bound (𝑿0𝔼[𝑿+])𝑽ˇF\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}^{\top}\|_{F} in this section. By triangle inequality. it is bounded by (𝑿0𝑿+)𝑽ˇF+(𝑿+𝔼[𝑿+])𝑽ˇF\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}+\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top}\|_{F}.

Lemma B.4.

Assume that 1mki𝐲ki2[1±ϵ1]𝐗F2\frac{1}{m}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\|{\bm{X}}^{*}\|_{F}^{2}. Then, w.p. 1exp[nrcϵ12mq/μ2κ2]1-\exp\left[nr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right],

(𝑿0𝑿+)𝑽ˇFCϵ1μκ𝑿F.\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.4.

The initial part of the proof is very similar to the that of the proof of Lemma B.2. We have, (𝑿0𝑿+)𝑽ˇF=max𝑾𝒮nr𝑾,(𝑿𝑿+)𝑽ˇ.\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle{\bm{W}},~{}\left({\bm{X}}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\rangle. For a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr},

𝑾,(𝑿0𝑿+)𝑽ˇ\displaystyle\langle{\bm{W}},~{}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\rangle
=1mki𝒚ki(𝒂ki𝑾𝒗ˇk)𝟙{C~mqki𝒚ki2|𝒚ki|2C~(1+ϵ1)q𝑿F2}\displaystyle=\frac{1}{m}\sum_{ki}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}

Proceeding as in the proof of Lemma B.2,

1mki𝒚ki(𝒂ki𝑾𝒗ˇk)𝟙{C~mqki𝒚ki2|𝒚ki|2C~(1+ϵ1)q𝑿F2}\displaystyle\frac{1}{m}\sum_{ki}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}
1mki|𝒚ki(𝒂ki𝑾𝒗ˇk)|𝟙{C~mqki𝒚ki2|𝒚ki|2C~(1+ϵ1)q𝑿F2},\displaystyle\leq\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{|}\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}},
1mki|𝒚ki||(𝒂ki𝑾𝒗ˇk)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}.\displaystyle\leq\frac{1}{m}\sum_{ki}|\bm{y}_{ki}||(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})|\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}.

The summands are mutually independent sub-Gaussian r.v.s with norm KkiCC~(1+ϵ1)𝑾𝒗ˇk𝑿F/qK_{ki}\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{W}}\check{\bm{v}}_{k}\|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}. Thus, we can apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26]. Set t=ϵ1m𝑿Ft=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}. Then we have

t2kiKki2ϵ12m2𝑿F2(ki𝑾𝒗ˇk2)C~(1+ϵ1)𝑿F2/qϵ12mqCμ2κ2,\frac{t^{2}}{\sum_{ki}K^{2}_{ki}}\geq\frac{\epsilon_{1}^{2}m^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{(\sum_{ki}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2})\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}},

where we used the fact that 𝑽ˇ𝑽ˇ=𝑰\check{\bm{V}}\check{\bm{V}}{}{}^{\top}=\bm{I} (𝑽ˇ\check{\bm{V}}^{\top} contains right singular vectors of a matrix) and thus 𝑾𝑽ˇF=1\|{\bm{W}}\check{\bm{V}}\|_{F}=1. Applying sub-Gaussian Hoeffding, we can conclude that, w.p., 1exp[cϵ12mq/μ2κ2]1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]

1mki|𝒚ki(𝒂ki𝑾𝒗ˇk)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}\displaystyle\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}
ϵ1𝑿F\displaystyle\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}
+1mki𝔼[|𝒚ki(𝒂ki𝑾𝒗ˇk)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}].\displaystyle\qquad+\frac{1}{m}\sum_{ki}\mathbb{E}\left[\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right].

We use Claim B.1 to bound the expectation term. Using this lemma with α2C~(1+ϵ1)𝑿F2/q\alpha^{2}\equiv\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q, 𝒛𝑾𝒗ˇk\bm{z}\equiv{\bm{W}}\check{\bm{v}}_{k}

1mki𝔼[|𝒚ki(𝒂ki𝑾𝒗ˇk)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}]\displaystyle\frac{1}{m}\sum_{ki}\mathbb{E}\left[\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]
1mkiC~(1+ϵ1)ϵ1𝑿F𝑾𝒗ˇk/qCϵ1μκ𝑿F.\displaystyle\leq\frac{1}{m}\sum_{ki}\sqrt{\tilde{C}(1+\epsilon_{1})}\epsilon_{1}\|{\bm{X}}^{*}\|_{F}\|{\bm{W}}\check{\bm{v}}_{k}\|/\sqrt{q}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.

where the last inequality used Cauchy-Schwarz on k𝑾𝒗ˇk/q\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|/\sqrt{q} to conclude that k𝑾𝒗ˇk(1/q)k𝑾𝒗ˇk2k(1/q)=𝑾𝑽ˇF21=1\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|(1/\sqrt{q})\leq\sqrt{\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2}\sum_{k}(1/q)}=\sqrt{\|{\bm{W}}\check{\bm{V}}\|_{F}^{2}\cdot 1}=1 since 𝑾𝑽ˇF=1\|{\bm{W}}\check{\bm{V}}\|_{F}=1.

By Proposition 4.8, the above bound holds for all 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr}, w.p. at least 1exp[nrlog(1+2/ϵnet)cϵ12mq/μ2κ2]1-\exp\left[nr\log(1+2/\epsilon_{net})-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]. ∎

Lemma B.5.

Consider 0<ϵ1<10<\epsilon_{1}<1. Then, w.p. 1exp[nrϵ12mq/μ2κ2]1-\exp\left[nr-\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]

(𝑿+𝔼[𝑿+])𝑽ˇFCϵ1𝑿F.\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top}\|_{F}\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.5.

The proof is quite similar to the previous one. For a fixed 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr} we have,

(𝑿+𝔼[𝑿+])𝑽ˇ,𝑾\displaystyle\langle\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle
=1mki(𝒚ki(𝒂ki𝑾𝒗ˇk)𝟙{|𝒚ki|2C~(1+ϵ1)q𝑿F2}𝔼[.])\displaystyle=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}-\mathbb{E}[.]\right)

where 𝔼[.]\mathbb{E}[.] is the expected value of the first term. The summands are independent, zero mean, sub-Gaussian r.v.s with subGaussian norm less than KkiCC~(1+ϵ1)𝑿F𝑾𝒃k/qK_{ki}\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|{\bm{W}}\bm{b}_{k}\|/\sqrt{q}. Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with t=ϵ1m𝑿Ft=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}, and using 𝑾𝑽ˇF=1\|{\bm{W}}\check{\bm{V}}\|_{F}=1, we can conclude that, w.p. 1exp[ϵ12mq/(Cμ2κ2)]1-\exp\left[-\epsilon_{1}^{2}mq/(C\mu^{2}\kappa^{2})\right],

(𝑿+𝔼[𝑿+])𝑽ˇ,𝑾Cϵ1𝑿F.\langle\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.8, the above bound holds for all 𝑾𝒮nr{\bm{W}}\in\mathcal{S}_{nr} w.p. 1exp[nrϵ12mq/(Cμ2κ2)]1-\exp\left[nr-\epsilon_{1}^{2}mq/(C\mu^{2}\kappa^{2})\right]. ∎

B-C Bounding the U* numerator term

We bound (𝑿0𝔼[𝑿+])𝑼F\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{*}{}\|_{F} here. By triangle inequality, it is bounded by (𝑿0𝑿+)𝑼F+(𝑿+𝔼[𝑿+])𝑼F\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}+\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}.

Lemma B.6.

Assume that 1mqki𝐲ki2[1±ϵ1]𝐗F2/q\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\|{\bm{X}}^{*}\|_{F}^{2}/q. Then, w.p. 1exp[qrcϵ12mq/μ2κ2]1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]

(𝑿0𝑿+)𝑼FCϵ1μκ𝑿F.\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.6.

The proof is similar to that of Lemmas B.2 and B.4. We have, (𝑿0𝑿+)𝑼F=max𝑾𝒮qr𝑾,(𝑿𝑿+)𝑼.\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{qr}}\langle{\bm{W}},~{}\left({\bm{X}}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\rangle. For a fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr}, using the same approach as in Lemma B.2, and letting 𝒘k\bm{w}_{k} be the kk-th column of the r×qr\times q matrix 𝑾{\bm{W}},

𝑾,(𝑿0𝑿+)𝑼\displaystyle\langle{\bm{W}},~{}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\rangle
1mki|𝒚ki(𝒂ki𝑼𝒘k)|𝟙{C~mqki|𝒚ki|2|𝒚ki|2C~(1+ϵ1)q𝑿F2},\displaystyle\qquad\leq\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\big{|}\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}|\bm{y}_{ki}|^{2}\leq|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}},
1mki|𝒚ki(𝒂ki𝑼𝒘k)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}.\displaystyle\qquad\leq\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}.

The summands are now mutually independent sub-Gaussian r.v.s with norm KkiC~(1+ϵ1)𝒘k𝑿F/qK_{ki}\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|\bm{w}_{k}\|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}. Thus, we can apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], to conclude that, for a fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr}, w.p. 1exp[cϵ12mq/μ2κ2]1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right],

1mki|𝒚ki(𝒂ki𝑼𝒘k)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}\displaystyle\frac{1}{m}\sum_{ki}\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}
ϵ1𝑿F+1mk𝔼[|𝒚ki(𝒂ki𝑼𝒘k)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}]\displaystyle\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}+\frac{1}{m}\sum_{k}\mathbb{E}\left[\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]

By Claim B.1, and using k𝒘k/qk𝒘k2k1/q=1\sum_{k}\|\bm{w}_{k}\|/\sqrt{q}\leq\sqrt{\sum_{k}\|\bm{w}_{k}\|^{2}}~{}\sqrt{\sum_{k}1/q}=1,

1mk𝔼[|𝒚ki(𝒂ki𝑼𝒘k)|𝟙{|𝒚ki|2[1±ϵ1]C~q𝑿F2}]\displaystyle\frac{1}{m}\sum_{k}\mathbb{E}\left[\big{|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]
1mkiϵ1𝒘kC~(1+ϵ1)/q𝑿F,\displaystyle\leq\frac{1}{m}\sum_{ki}\epsilon_{1}\|\bm{w}_{k}\|\sqrt{\tilde{C}(1+\epsilon_{1})/q}\|{\bm{X}}^{*}\|_{F},
Cϵ1μκ𝑿F,\displaystyle\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F},

By Proposition 4.8 (epsilon net argument), the bound holds for all unit norm 𝑾{\bm{W}} w.p. 1exp[qrcϵ12mq/μ2κ2]1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]. ∎

Lemma B.7.

Consider 0<ϵ1<10<\epsilon_{1}<1. Then, w.p. 1exp[qrϵ12/mqμ2κ2]1-\exp\left[qr-\epsilon_{1}^{2}/mq\mu^{2}\kappa^{2}\right]

(𝑿+𝔼[𝑿+])𝑼FCϵ1𝑿F.\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.
Proof of Lemma B.7.

For fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr},

trace(𝑾(𝑿+𝔼[𝑿+])𝑼)\displaystyle\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\right)
=1mki(𝒚ki(𝒂ki𝑼𝒘k)𝟙{|𝒚ki|2C~(1+ϵ1)q𝑿F2}\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right.
𝔼[𝒚ki(𝒂ki𝑼𝒘k)𝟙{|𝒚ki|2C~(1+ϵ1)q𝑿F2}])\displaystyle\qquad-\left.\mathbb{E}\left[\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right]\right)

The summands are independent zero mean sub-Gaussian r.v.s with norm less than KkiC~(1+ϵ1)𝑿F𝒘k/qK_{ki}\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|\bm{w}_{k}\|/\sqrt{q}. Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with t=ϵ1m𝑿Ft=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}, we can conclude that, for a fixed 𝑾𝒮qr{\bm{W}}\in\mathcal{S}_{qr}, w.p. 1exp[ϵ12mq/Cμ2κ2]1-\exp\left[-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right],

trace(𝑾(𝑿+𝔼[𝑿+])𝑼)ϵ1𝑿F.\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\right)\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.8 (epsilon net argument), the bound holds for all unit norm 𝑾{\bm{W}} w.p. 1exp[qrϵ12mq/Cμ2κ2]1-\exp\left[qr-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right]. ∎

B-D Proof of Claim B.1

Proof.

We can write 𝒙=𝒙𝑸𝒆1\bm{x}^{*}=\|\bm{x}^{*}\|{\bm{Q}}\bm{e}_{1} where 𝑸{\bm{Q}} is a unitary matrix with first column proportional to 𝒙k\bm{x}^{*}_{k}. We need to bound

𝔼[𝒙|(𝒂𝑸𝒆1)(𝒂𝑸𝑸𝒛)|𝟙{𝒙2|𝒂𝑸e1|2[1±ϵ]α}]\displaystyle\mathbb{E}[\|\bm{x}^{*}\|\cdot|(\bm{a}^{\top}{\bm{Q}}\bm{e}_{1})(\bm{a}^{\top}{\bm{Q}}{\bm{Q}}^{\top}\bm{z})|\mathbbm{1}_{\{\|\bm{x}^{*}\|^{2}|\bm{a}^{\top}{\bm{Q}}e_{1}|^{2}\in[1\pm\epsilon]\alpha}\}]
=𝒙𝒛𝔼[|𝒂~(1)𝒂~𝒛¯Q|𝟙{|𝒂~(1)|2[1±ϵ]β2}]\displaystyle=\|\bm{x}^{*}\|\cdot\|\bm{z}\|\cdot\mathbb{E}[|\tilde{\bm{a}}(1)\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}|\mathbbm{1}_{\{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}\}]

where 𝒛¯Q:=𝑸𝒛/𝒛\bar{\bm{z}}_{Q}:={\bm{Q}}^{\top}\bm{z}/\|\bm{z}\|, 𝒂~:=𝑸𝒂\tilde{\bm{a}}:={\bm{Q}}^{\top}\bm{a} and β:=α/𝒙\beta:=\sqrt{\alpha}/\|\bm{x}^{*}\|. Since 𝑸{\bm{Q}} is unitary and 𝒂\bm{a} Gaussian, thus 𝒂~\tilde{\bm{a}} has the same distribution as 𝒂\bm{a}. Let 𝒂~(1)\tilde{\bm{a}}(1) be its first entry and 𝒂~(rest)\tilde{\bm{a}}(\mathrm{rest}) be the (n1)(n-1)-length vector with the rest of the n1n-1 entries and similarly for 𝒛¯Q\bar{\bm{z}}_{Q}. Then, 𝒂~𝒛¯Q=𝒂~(1)𝒛¯Q(1)+𝒂~(rest)𝒛¯Q(rest)\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}=\tilde{\bm{a}}(1)\cdot\bar{\bm{z}}_{Q}(1)+\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest}). Since 𝒂~(1)\tilde{\bm{a}}(1) and 𝒂~(rest)\tilde{\bm{a}}(\mathrm{rest}) are independent,

𝔼[|𝒂~(1)𝒂~𝒛¯Q|𝟙|𝒂~(1)|2[1±ϵ]β2]\displaystyle\mathbb{E}[|\tilde{\bm{a}}(1)\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}|\mathbbm{1}_{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}]
|𝒛¯Q(1)|𝔼[|𝒂~(1)2|𝟙|𝒂~(1)|2[1±ϵ]β2]\displaystyle\leq|\bar{\bm{z}}_{Q}(1)|\mathbb{E}[|\tilde{\bm{a}}(1)^{2}|\mathbbm{1}_{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}]
+𝔼[|𝒂~(rest)𝒛¯Q(rest)|]𝔼[|𝒂~(1)|𝟙|𝒂~(1)|2[1±ϵ]β2]\displaystyle\qquad+\mathbb{E}[|\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest})|]\ \mathbb{E}[|\tilde{\bm{a}}(1)|\mathbbm{1}_{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}]
\displaystyle\leq 𝔼[|𝒂~(1)2|𝟙|𝒂~(1)|2[1±ϵ]β2]+2𝔼[|𝒂~(1)|𝟙|𝒂~(1)|2[1±ϵ]β2]\displaystyle\mathbb{E}[|\tilde{\bm{a}}(1)^{2}|\mathbbm{1}_{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}]+2\mathbb{E}[|\tilde{\bm{a}}(1)|\mathbbm{1}_{|\tilde{\bm{a}}(1)|^{2}\in[1\pm\epsilon]\beta^{2}}]
\displaystyle\leq ϵβ+2ϵβ=3ϵβ=Cϵα𝒙.\displaystyle\epsilon\beta+2\epsilon\beta=3\epsilon\beta=C\epsilon\frac{\sqrt{\alpha}}{\|\bm{x}^{*}\|}.

The second inequality used the facts that (i) |𝒛¯Q(1)|𝒛¯Q=1|\bar{\bm{z}}_{Q}(1)|\leq\|\bar{\bm{z}}_{Q}\|=1 by definition and (ii) ζ:=𝒂~(rest)𝒛¯Q(rest)\zeta:=\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest}) is a scalar standard Gaussian r.v. and so 𝔼[|ζ|]2\mathbb{E}[|\zeta|]\leq 2. The third one relies on the following two bounds:

  1. 1.
    𝔼[|𝒂(1)|2𝟙{|𝒂(1)|2[1±ϵ]β2}]\displaystyle\mathbb{E}\left[|\bm{a}(1)|^{2}\mathbbm{1}_{\left\{|\bm{a}(1)|^{2}\in[1\pm\epsilon]\beta^{2}\right\}}\right]
    =22π1ϵβ1+ϵβz2exp(z2/2)𝑑z,\displaystyle=\frac{2}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}z^{2}\exp(-z^{2}/2)dz,
    2e1/22π1ϵβ1+ϵβ𝑑z2e1/22πϵβϵβ/3\displaystyle\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}dz\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\epsilon\beta\leq\epsilon\beta/3

    where we used the facts that z2exp(z2/2)exp(1/2)z^{2}\exp(-z^{2}/2)\leq\exp(-1/2) for all zz\in\Re; 1ϵ1ϵ/2\sqrt{1-\epsilon}\geq 1-\epsilon/2 and 1+ϵ1+ϵ/2\sqrt{1+\epsilon}\leq 1+\epsilon/2 for 0<ϵ<10<\epsilon<1.

  2. 2.

    Similarly, we can show that

    𝔼[|𝒂(1)|𝟙{|𝒂(1)|2[1±ϵ]β2}]\displaystyle\mathbb{E}\left[|\bm{a}(1)|\mathbbm{1}_{\{|\bm{a}(1)|^{2}\in[1\pm\epsilon]\beta^{2}\}}\right]
    =22π1ϵβ1+ϵβzexp(z2/2)𝑑z,\displaystyle=\frac{2}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}z\exp(-z^{2}/2)dz,
    2e1/22π1ϵβ1+ϵβ𝑑z=2e1/22πϵβϵβ/3\displaystyle\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}dz=\frac{2e^{-1/2}}{\sqrt{2\pi}}\epsilon\beta\leq\epsilon\beta/3

The claim follows by combining the two equations given above. ∎

References

  • [1] S. Negahban, M. J. Wainwright et al., “Estimation of (near) low-rank matrices with noise and high-dimensional scaling,” The Annals of Statistics, vol. 39, no. 2, pp. 1069–1097, 2011.
  • [2] P. Netrapalli, P. Jain, and S. Sanghavi, “Low-rank matrix completion using alternating minimization,” in Annual ACM Symp. on Th. of Comp. (STOC), 2013.
  • [3] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Found. of Comput. Math, no. 9, pp. 717–772, 2008.
  • [4] S. Nayer, P. Narayanamurthy, and N. Vaswani, “Phaseless PCA: Low-rank matrix recovery from column-wise phaseless measurements,” in Intl. Conf. Machine Learning (ICML), 2019.
  • [5] ——, “Provable low rank phase retrieval,” IEEE Trans. Info. Th., March 2020.
  • [6] S. Nayer and N. Vaswani, “Sample-efficient low rank phase retrieval,” IEEE Trans. Info. Th., 2021.
  • [7] R. S. Srinivasa, K. Lee, M. Junge, and J. Romberg, “Decentralized sketching of low rank matrices,” in Neur. Info. Proc. Sys. (NeurIPS), 2019, pp. 10 101–10 110.
  • [8] Z.-P. Liang, “Spatiotemporal imaging with partially separable functions,” in 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2007, pp. 988–991.
  • [9] S. G. Lingala, Y. Hu, E. DiBella, and M. Jacob, “Accelerated dynamic mri exploiting sparsity and low-rank structure: kt slr,” IEEE Transactions on Medical Imaging, vol. 30, no. 5, pp. 1042–1054, 2011.
  • [10] J. Yao, Z. Xu, X. Huang, and J. Huang, “An efficient algorithm for dynamic mri using low-rank and total variation regularizations,” Medical Image Analysis, vol. 44, pp. 14–27, 2018.
  • [11] F. P. Anaraki and S. Hughes, “Memory and computation efficient pca via very sparse random projections,” in Intl. Conf. Machine Learning (ICML), 2014, pp. 1341–1349.
  • [12] A. Krishnamurthy, M. Azizyan, and A. Singh, “Subspace learning from extremely compressed measurements,” arXiv preprint arXiv:1404.0751, 2014.
  • [13] S. Babu, S. Nayer, S. G. Lingala, and N. Vaswani, “Fast low rank compressive sensing for accelerated dynamic mri,” in IEEE Intl. Conf. Acoustics, Speech, Sig. Proc. (ICASSP), 2022, to appear.
  • [14] P. Jain and P. Netrapalli, “Fast exact matrix completion with finite samples,” in Conf. on Learning Theory, 2015, pp. 1007–1034.
  • [15] Y. Cherapanamjeri, K. Gupta, and P. Jain, “Nearly-optimal robust matrix completion,” ICML, 2016.
  • [16] X. Yi, D. Park, Y. Chen, and C. Caramanis, “Fast algorithms for robust pca via gradient descent,” in Neur. Info. Proc. Sys. (NeurIPS), 2016.
  • [17] Q. Zheng and J. Lafferty, “Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent,” arXiv preprint arXiv:1605.07051, 2016.
  • [18] S. Lang, Real and Functional Analysis.   Springer-Verlag, New York 10:11–13, 1993.
  • [19] C. Ma, K. Wang, Y. Chi, and Y. Chen, “Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution,” in Intl. Conf. Machine Learning (ICML), 2018.
  • [20] Y. Chen and E. Candes, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” in Neur. Info. Proc. Sys. (NeurIPS), 2015, pp. 739–747.
  • [21] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 5164–5198, 2017.
  • [22] G. Jagatap, Z. Chen, S. Nayer, C. Hegde, and N. Vaswani, “Sample efficient fourier ptychography for structured data,” IEEE Trans. Comput. Imaging, vol. 6, pp. 344–357, 2020.
  • [23] Y. Chen, Y. Chi, and A. J. Goldsmith, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015.
  • [24] G. H. Golub and C. F. Van Loan, “Matrix computations,” The Johns Hopkins University Press, Baltimore, USA, 1989.
  • [25] M. Hardt and E. Price, “The noisy power method: A meta algorithm with applications,” in Neur. Info. Proc. Sys. (NeurIPS), 2014, pp. 2861–2869.
  • [26] R. Vershynin, High-dimensional probability: An introduction with applications in data science.   Cambridge University Press, 2018, vol. 47.
  • [27] P.-Å. Wedin, “Perturbation bounds in connection with singular value decomposition,” BIT Numerical Mathematics, vol. 12, no. 1, pp. 99–111, 1972.
  • [28] Y. Chen, Y. Chi, J. Fan, and C. Ma, “Spectral methods for data science: A statistical perspective,” arXiv preprint arXiv:2012.08496, 2020.
  • [29] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices.   Cambridge Univ. Press, Cambridge, 2012.
  • [30] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,” in Neur. Info. Proc. Sys. (NeurIPS), 2013, pp. 2796–2804.
  • [31] T. Cai, X. Li, and Z. Ma, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016.
  • [32] L. Erdos, A. Knowles, H. Yau, and J. Yin, “Spectral statistics of erdos–rényi graphs i: Local semicircle law,” The Annals of Probability, vol. 41, no. 3B, pp. 2279–2375, 2013.
  • [33] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Found. Comput. Math., vol. 12, no. 4, 2012.

Author Biographies

Seyedehsara Nayer (Email: [email protected]) recently completed her Ph.D. in ECE at Iowa State University. She has an M.S. from Sharif University in Iran. She works as a Senior Engineer at ASML in Santa Clara, CA. Her research interests are around various aspects of information science and focuses on Signal Processing, and Statistical Machine Learning.

Namrata Vaswani (Email: [email protected]) received a B.Tech from IIT-Delhi in India in 1999 and a Ph.D. from the University of Maryland, College Park in 2004, both in Electrical Engineering. Since Fall 2005, she has been with the Iowa State University where she is currently the Anderlik Professor of Electrical and Computer Engineering. Her research interests lie in a data science, with a particular focus on Statistical Machine Learning and Signal Processing. She has served two terms as an Associate Editor for the IEEE Transactions on Signal Processing; as a lead guest-editor for a 2018 Proceedings of the IEEE Special Issue (Rethinking PCA for modern datasets); and as an Area Editor for the IEEE Signal Processing Magazine (2018-2020). Vaswani is a recipient of the Iowa State Early Career Engineering Faculty Research Award (2014), the Iowa State University Mid-Career Achievement in Research Award (2019) and University of Maryland’s ECE Distinguished Alumni Award (2019). She also received the 2014 IEEE Signal Processing Society Best Paper Award for her 2010 IEEE Transactions on Signal Processing paper co-authored with her student Wei Lu on “Modified-CS: Modifying compressive sensing for problems with partially known support”. She is a Fellow of the IEEE Fellow (class of 2019).