Fast Low Rank column-wise Compressive Sensing

Seyedehsara (Sara) Nayer and Namrata Vaswani
Dept. of Electrical and Computer Engineering, Iowa State University, USA.
Email: [email protected]

Fast and Sample-Efficient
Federated Low Rank Matrix Recovery
from column-wise Linear and Quadratic Projections

Seyedehsara (Sara) Nayer and Namrata Vaswani
Dept. of Electrical and Computer Engineering, Iowa State University, USA.
Email: [email protected]

Abstract

We study the following lesser-known low rank (LR) recovery problem: recover an $n\times q$ rank- $r$ matrix, ${\bm{X}}^{*}=[\bm{x}^{*}_{1},\bm{x}^{*}_{2},...,\bm{x}^{*}_{q}]$ , with $r\ll\min(n,q)$ , from $m$ independent linear projections of each of its $q$ columns, i.e., from $\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k},k\in[q]$ , when $\bm{y}_{k}$ is an $m$ -length vector with $m<n$ . The matrices $\bm{A}_{k}$ are known and mutually independent for different $k$ . We introduce a novel gradient descent (GD) based solution called AltGD-Min. We show that, if the $\bm{A}_{k}$ s are i.i.d. with i.i.d. Gaussian entries, and if the right singular vectors of ${\bm{X}}^{*}$ satisfy the incoherence assumption, then $\epsilon$ -accurate recovery of ${\bm{X}}^{*}$ is possible with order $(n+q)r^{2}\log(1/\epsilon)$ total samples and order $mqnr\log(1/\epsilon)$ time. Compared with existing work, this is the fastest solution. For $\epsilon<r^{1/4}$ , it also has the best sample complexity. A simple extension of AltGD-Min also provably solves LR Phase Retrieval, which is a magnitude-only generalization of the above problem.

AltGD-Min factorizes the unknown ${\bm{X}}$ as ${\bm{X}}={\bm{U}}\bm{B}$ where ${\bm{U}}$ and $\bm{B}$ are matrices with $r$ columns and rows respectively. It alternates between a (projected) GD step for updating ${\bm{U}}$ , and a minimization step for updating $\bm{B}$ . Its each iteration is as fast as that of regular projected GD because the minimization over $\bm{B}$ decouples column-wise. At the same time, we can prove exponential error decay for it, which we are unable to for projected GD. Finally, it can also be efficiently federated with a communication cost of only $nr$ per node, instead of $nq$ for projected GD.

I Introduction

This work develops a sample-efficient, fast, and communication-efficient gradient descent (GD) solution, called AltGD-Min, for provably recovering a low-rank (LR) matrix from a set of mutually independent linear projections of each of its columns. The communication-efficiency considers a federated setting. This problem, which we henceforth refer to as “Low Rank column-wise Compressive Sensing (LRcCS)”, is precisely defined below. Unlike the other well-studied LR problems – multivariate regression (MVR) [1], LR matrix sensing [2] and LR matrix completion (LRMC) [3, 2] – LRcCS has received little attention so far in terms of approaches with provable guarantees. There are only two existing provably correct solutions. (1) Its generalization LR phase retrieval (LRPR), was studied in our recent work [4, 5, 6] where we developed a provably correct alternating minimization (AltMin) solution. Since LRPR is a generalization, the algorithm also solves LRcCS. (2) In parallel work, [7] developed and analyzed a convex relaxation (mixed-norm minimization) for LRcCS. Both solutions are much slower than GD-based methods, and, in most practical settings, also have worse sample complexity.

LRcCS occurs in accelerated LR dynamic MRI [8, 9, 10], and in distributed/federated sketching [11, 12, 7]. We explain these in Sec. I-D. We show the speed and performance advantage of AltGD-Min for dynamic MRI in [13].

I-A Problem Setting, Notation, and Assumption

Problem definition. The goal is to recover an $n\times q$ rank- $r$ matrix ${\bm{X}}^{*}=[\bm{x}^{*}_{1},\bm{x}^{*}_{2},\dots,\bm{x}^{*}_{q}]$ from $m$ linear projections (sketches) of each of its $q$ columns, i.e. from

\displaystyle\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k},\ k\in[q]

(1)

where each $\bm{y}_{k}$ is an $m$ -length vector, $[q]:=\{1,2,\dots,q\}$ , and the measurement/sketching matrices $\bm{A}_{k}$ are mutually independent and known. The setting of interest is low-rank (LR), $r\ll\min(n,q)$ , and undersampled measurements, $m<n$ . Our guarantees assume that each $\bm{A}_{k}$ is random-Gaussian: each entry of it is independent and identically distributed (i.i.d.) standard Gaussian.

We also study the magnitude-only measurements’ setting, LRPR [4, 5, 6]. This involves recovering ${\bm{X}}^{*}$ from

{\bm{y}_{(mag)}}_{k}:=|\bm{A}_{k}\bm{x}^{*}_{k}|,\ k\in[q].

Here $|\bm{z}|$ takes the entry-wise absolute value of entries of the vector $\bm{z}$ .

Notation. Everywhere, $\|.\|_{F}$ denotes the Frobenius norm, $\|.\|$ without a subscript denotes the (induced) $l_{2}$ norm (often called the operator norm or spectral norm), $\|\bm{M}\|_{\max}$ is the maximum magnitude entry of the matrix $\bm{M}$ , ^⊤ denotes matrix or vector transpose, and $|\bm{z}|$ for a vector $\bm{z}$ denotes element-wise absolute values. $\bm{I}_{n}$ (or sometimes just $\bm{I}$ ) denotes the $n\times n$ identity matrix. We use $\bm{e}_{k}$ to denote the $k$ -th canonical basis vector, i.e., the $k$ -th column of $\bm{I}$ . For any matrix ${\bm{{Z}}}$ , $\bm{z}_{k}$ denotes its $k$ -th column.

We say ${\bm{U}}$ is a basis matrix if it contains orthonormal columns. For basis matrices ${\bm{U}}_{1},{\bm{U}}_{2}$ , we use

\mathrm{SD}({\bm{U}}_{1},{\bm{U}}_{2}):=\|(\bm{I}-{\bm{U}}_{1}{\bm{U}}_{1}{}^{\top}){\bm{U}}_{2}\|_{F}

as the Subspace Distance (SD) measure. For two $r$ -dimensional subspaces, this is the $l_{2}$ norm of the sines of the $r$ principal angles between $\mathrm{span}({\bm{U}}_{1})$ and $\mathrm{span}({\bm{U}}_{2})$ . $\mathrm{SD}({\bm{U}}_{1},{\bm{U}}_{2})$ is symmetric when ${\bm{U}}_{1},{\bm{U}}_{2}$ are both $n\times r$ basis matrices. Notice here we are using the Frobenius SD, unlike many recent works including our older work [5] that use the induced 2-norm one. This is done because it enables us to prove the desired guarantees easily. We reuse the letters $c,C$ to denote different numerical constants in each use with the convention that $c<1$ and $C\geq 1$ . The notation $a\in\Omega(b)$ means $a\geq Cb$ while $a\in O(b)$ means $a\leq Cb$ . We use $\mathbbm{1}_{\text{statement}}$ to denote an indicator function that takes the value 1 if statement is true and zero otherwise.

For a vector $\bm{w}$ , we sometimes use $\bm{w}(k)$ to denote the $k$ -th entry of $\bm{w}$ . For a vector $\bm{w}$ and a scalar $\alpha$ , $\mathbbm{1}(\bm{w}\leq\alpha)$ returns a vector of 1s and 0s of the same length as $\bm{w}$ , with 1s where $(\bm{w}(k)\leq\alpha)$ and zero everywhere else. We use $\circ$ to denote the Hadamard product. Thus $\bm{z}:=\bm{w}\circ\mathbbm{1}(\bm{w}\leq\alpha)$ zeroes out entries of $\bm{w}$ larger than $\alpha$ , while keeping the smaller ones as is.

For ${\bm{X}}^{*}$ which is a rank- $r$ matrix, we let

{\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\underbrace{{\bm{\Sigma}^{*}}{\bm{V}^{*}}{}}_{\bm{B}^{*}}:={\bm{U}}^{*}{}\bm{B}^{*}

denote its reduced (rank $r$ ) SVD, i.e., ${\bm{U}}^{*}{}$ and ${\bm{V}^{*}}^{\top}$ are matrices with orthonormal columns (basis matrices), ${\bm{U}}^{*}{}$ is $n\times r$ and ${\bm{V}^{*}}$ is $r\times q$ , and ${\bm{\Sigma}^{*}}$ is an $r\times r$ diagonal matrix with non-negative entries. We use $\kappa:={\sigma_{\max}^{*}}/{\sigma_{\min}^{*}}$ to denote the condition number of ${\bm{\Sigma}^{*}}$ . This is not the condition number of ${\bm{X}}^{*}$ (whose minimum singular value is zero). We let $\bm{B}^{*}:={\bm{\Sigma}^{*}}\bm{V}^{*}{}{}$ and we use $\bm{b}^{*}_{k}$ to denote its $k$ -th column.

We use the phrase $\epsilon$ -accurate recovery to refer to $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\epsilon$ or $\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq\epsilon\|{\bm{X}}^{*}\|_{F}$ or both.

Assumption. Another way to understand (1) is as follows: each scalar measurement $\bm{y}_{ki}$ ( $i$ -th entry of $\bm{y}_{k}$ ) satisfies

\bm{y}_{ki}:=\langle\bm{a}_{ki},\bm{x}^{*}_{k}\rangle,\ i\in[m],\ k\in[q]

with $\bm{a}_{ki}{}^{\top}$ being the $i$ -th row of $\bm{A}_{k}$ . Observe that the measurements are not global, i.e., no $\bm{y}_{ki}$ is a function of the entire matrix ${\bm{X}}^{*}$ . They are global for each column ( $\bm{y}_{ki}$ is a function of column $\bm{x}^{*}_{k}$ ) but not across the different columns. We thus need an assumption that enables correct interpolation across the different columns. The following assumption, which is a slightly weaker version of incoherence (w.r.t. the canonical basis) of right singular vectors suffices for this purpose.

Assumption 1.1 ((Weakened) Right Singular Vectors’ Incoherence).

Assume that

\max_{k}\|\bm{b}^{*}_{k}\|\leq{\sigma_{\max}^{*}}\mu\sqrt{r/q}.

for a constant $\mu\geq 1$ ( $\mu$ does not grow with $n,q,r$ ). Since $\|\bm{x}^{*}_{k}\|=\|\bm{b}^{*}_{k}\|$ , this implies that $\max_{k}\|\bm{x}^{*}_{k}\|\leq{\sigma_{\max}^{*}}\mu\sqrt{r/q}$ . Also, since ${\sigma_{\min}^{*}}\sqrt{r}\leq\|{\bm{X}}^{*}\|_{F}$ , this also implies that $\max_{k}\|\bm{x}^{*}_{k}\|\leq\kappa\mu{\|{\bm{X}}^{*}\|_{F}}/{\sqrt{q}}$ .

Right singular vectors incoherence is the assumption $\max_{k}\|\bm{v}^{*}_{k}\|\leq\mu\sqrt{r/q}$ . Since $\bm{b}^{*}_{k}={\bm{\Sigma}^{*}}\bm{v}^{*}_{k}$ , this implies that the above holds. Incoherence of both left and right singular vectors was introduced for guaranteeing correct “interpolation” for the LRMC problem [3, 2].

I-B Existing Work

Existing solutions for LRcCS and LRPR. Since it is always possible to obtain magnitude-only measurements ${\bm{y}_{(mag)}}_{k}$ from linear ones $\bm{y}_{k}$ as ${\bm{y}_{(mag)}}_{k}=|\bm{y}_{k}|$ , a solution to LRPR also automatically solves LRcCS under the same assumptions. Hence the AltMin algorithm for LRPR from [4, 5] is the first provably correct solution for LRcCS. Of course, since LRcCS is an easier problem than LRPR, we expect a direct solution to LRcCS to need weaker assumptions. As we show in this paper, this is indeed true. A more recent work [7] studied the noisy version of LRcCS and developed a convex relaxation (mixed norm minimization) to provably solve it. Its time complexity is not discussed in the paper, however, it is well known that solvers for convex programs are much slower when compared to direct iterative algorithms: they either require number of iterations proportional to $1/\sqrt{\epsilon}$ or the per-iteration cost has cubic dependence on the problem size (here $((n+q)r)^{3}$ ) [2]. Thus, if $q\leq n$ , its time complexity $O(mqnr\cdot\min(1/\sqrt{\epsilon},n^{3}r^{3}))$ . In [6], we provided the best possible guarantee for the AltMin algorithm for solving LRPR, and hence LRcCS. We discuss these results in detail in Sec. II-D and summarize them in Table I.

Other well-studied LR recovery problems. The multivariate regression (MVR) problem, studied in [1], is our problem with $\bm{A}_{k}=\bm{A}$ . However this is a very different setting than ours because, with $\bm{A}_{k}=\bm{A}$ , the different $\bm{y}_{k}$ ’s are no longer mutually independent. As a result, one cannot exploit law of large numbers’ arguments over all $mq$ scalar measurements $\bm{y}_{ki}$ . Consequently, the required value of $m$ can never be less than $n$ . The result of [1] shows that $m$ of order $(n+q)r$ is both necessary and sufficient. LRMS involves recovering ${\bm{X}}^{*}$ from $\bm{y}_{i}=\langle\bm{A}_{i},{\bm{X}}^{*}\rangle,\ i=1,2,\dots,mq$ with $\bm{A}_{i}$ being dense matrices, typically i.i.d. Gaussian [2]. Thus all measurements are i.i.d. and global: each contains information about the entire quantity-of-interest, here ${\bm{X}}^{*}$ . Because of this, for LRMS, one can prove a LR Restricted Isometry Property (RIP) that simplifies the rest of the analysis. This is what makes it very different from, and easier than, our problem.

LRMC, which involves recovering ${\bm{X}}^{*}$ from a subset of its observed entries, is the most closely related problem to ours since it also involves recovery from non-global measurements. The typical model assumed is that each matrix entry is observed with probability $p$ independent of others [3, 2]. Setting unobserved entries to zero, this can be written as $\bm{y}_{jk}=\delta_{jk}{\bm{X}}^{*}_{jk}$ with $\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p)$ . LRMC measurements are both row-wise and column-wise local. To allow correct interpolation across both rows and columns, it needs the incoherence assumption on both its left and right singular vectors. For our problem, the measurements are global for each column, but not across the different columns. For this reason, only right singular vectors’ incoherence is needed. In fact, because of the nature of our measurements, even if left incoherence were assumed, it would not help. This asymmetry in our measurement model and the fact that our measurements are unbounded (each $\bm{y}_{ki}$ is a Gaussian r.v) are two key differences between LRMC and LRcCS that prevent us from borrowing LRMC proof techniques for our work. Here symmetric means: if we replace ${\bm{X}}^{*}$ by its transpose, the probability distribution of the set of measurements does not change. Bounded means that the measurements’ magnitude has a uniform bound. This bound is $\|{\bm{X}}^{*}\|_{\max}$ for LRMC measurements.

Non-convex (iterative, not convex relaxation based) LRMC algorithms with the best sample complexity are GD-based. There are two common approaches for designing GD algorithms in the LR recovery literature, and in particular for LRMC. The first is to use standard projected GD on ${\bm{X}}$ (projGD-X), also referred to as Iterative Hard Thresholding: at each iteration, perform one step of GD for minimizing the squared loss cost function, $\tilde{f}({\bm{X}})$ , w.r.t. ${\bm{X}}$ , followed by projecting the resulting matrix onto the space of rank $r$ matrices (by SVD). This was studied in [14, 15] for solving LRMC. This is shown to converge geometrically with a constant GD step size, while needing only $\Omega((n+q)r^{2}\log^{2}n\log^{2}(1/\epsilon))$ samples on average.

The second is to let ${\bm{X}}={\bm{U}}\bm{B}$ where ${\bm{U}}$ is $n\times r$ and $\bm{B}$ is $r\times q$ and perform alternating GD for the cost function $f({\bm{U}},\bm{B}):=\tilde{f}({\bm{U}}\bm{B})$ , i.e., update $\bm{B}$ with one step of GD for minimizing $f({\bm{U}},\bm{B})$ while keeping ${\bm{U}}$ fixed at its previous value, and then do the same for ${\bm{U}}$ with $\bm{B}$ fixed, and repeat. Since the ${\bm{X}}={\bm{U}}\bm{B}$ factorization is not unique, i.e., ${\bm{X}}={\bm{U}}{\bm{R}}^{-1}{\bm{R}}\bm{B}$ for any invertible $r\times r$ matrix ${\bm{R}}$ , this approach can result in the norm of one of ${\bm{U}}$ or $\bm{B}$ growing in an unbounded fashion, while that of the other decreases at the same rate, causing numerical problems. A typical approach to resolve this issue, and one that was used for LRMC [16, 17], is to change the cost function to minimize to $f({\bm{U}},\bm{B})+\lambda f_{2}({\bm{U}},\bm{B})$ where $f_{2}({\bm{U}},\bm{B}):=\|{\bm{U}}^{\top}{\bm{U}}-\bm{B}\bm{B}^{\top}\|_{F}$ is the “norm-balancing term” (helps ensure that norms of ${\bm{U}}$ and $\bm{B}$ remain similar). We henceforth refer to this approach as altGDnormbal. The sample complexity bound for this approach is similar to that for projGD-X. But, it needs a GD step size of order $1/r$ or smaller [16, 17]; making it $r$ -times slower than projGD-X.

	Sample Comp.	Time Comp.	Communic. Comp.	Holds for	Column-wise
	$mq\gtrsim$		per node (predicted)	all ${\bm{X}}^{*}$ ?	error bound?
Convex [7]	$nr\frac{1}{\epsilon^{4}}$	$\text{\scriptsize{linear-time}}\cdot\min\left(\frac{1}{\sqrt{\epsilon}},n^{3}r^{3}\right)$	not clear	yes	no
AltMin [4, 5]	$nr^{4}\log(\frac{1}{\epsilon})$	$\text{\scriptsize{linear-time}}\cdot r\log^{2}(\frac{1}{\epsilon})$	$nr\log(\frac{1}{\epsilon})\cdot r\log^{2}(\frac{1}{\epsilon})$	no
AltMin [6]	$nr^{2}(r+\log(\frac{1}{\epsilon}))$	$\text{\scriptsize{linear-time}}\cdot r\log^{2}(\frac{1}{\epsilon})$	$nr\log(\frac{1}{\epsilon})\cdot r\log^{2}(\frac{1}{\epsilon})$	no	yes
altGD-Min	$\mathbf{nr^{2}\log(\frac{1}{\epsilon})}$	$\mathbf{\text{\scriptsize{linear-time}}\cdot r\log(\frac{1}{\epsilon})}$	$\mathbf{nr\cdot r\log(\frac{1}{\epsilon})}$	no	yes
(proposed)
Best sample LRMC algorithms among those that do not solve a convex relaxation
ProjGD-X	$\max(n,q)r^{2}\log^{2}n\log^{2}(\frac{1}{\epsilon})$	$\bf{\text{\scriptsize{linear-time}}\cdot r\log(\frac{1}{\epsilon})}$	$nq$ **
[15]
AltGDnormbal	$\mathbf{\max(n,q)r^{2}\log n}$	$\text{\scriptsize{linear-time}}\cdot r^{2}\log(\frac{1}{\epsilon})$	$\mathbf{\max(n,q)r}$
[16]

**The communication complexity of ProjGD-X would be $nq$ because the gradient w.r.t. ${\bm{X}}$ computed at each node will need to be transmitted by the nodes to the center. The gradient w.r.t. ${\bm{X}}$ is not low rank (LR), and hence one cannot transmit just its rank $r$ SVD.

TABLE I: Existing work versus our work. For brevity, this table assumes

q\leq n

and treats

\kappa,\mu

as numerical constants. All approaches also need

m\geq\max(r,\log q,\log n)

. Column-wise error bound exists means

\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|/\|\bm{x}^{*}_{k}\|\leq\epsilon

holds in addition to a similar bound on matrix Frobenius norm error. Linear-time is the time needed to read all algorithm inputs. For LRcCS, this is

\bm{y}_{k},\bm{A}_{k}

for all

k\in[q]

and thus linear-time is order

mnq

. For LRMC, this is the set of observed entries and their locations and thus linear-time is order

mq

. None of the other algorithms have been studied in the federated context and hence the communication complexity (Comm. Comp.) listed in the fourth column is based on our understanding of how one would federate the algorithm. Notice that AltGD-min has the best time and communication complexities; and for

\epsilon^{4}<r

, it also has the best sample complexity.

I-C Contributions and Novelty

Contribution to solving LRcCS and LRPR. (1) This work develops a novel GD-based solution to LRcCS, called AltGD-Min, that is fast and communication-efficient. We show that, with high probability (w.h.p.), AltGD-Min obtains an $\epsilon$ -accurate estimate in order $\kappa^{2}\log(1/\epsilon)$ iterations, as long as Assumption 1.1 holds, the matrices $\bm{A}_{k}$ are i.i.d., with each containing i.i.d. standard Gaussian entries, $mq\in\Omega(\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon))$ , and $m\in\Omega(\max(\log q,\log n)\log(1/\epsilon))$ . Its time complexity is $O(mqnr\cdot\kappa^{2}\log(1/\epsilon))$ and its communication complexity per node is $O(nr\cdot\kappa^{2}\log(1/\epsilon))$ . We provide a comparison of our guarantee with those of other works in Table I. This table also summarizes the guarantees for the two most sample-efficient LRMC solutions: projGD-X and altGDnormbal. The former is also the fastest LRMC solution, while the latter is the most communication-efficient. As mentioned earlier, LRMC is the most similar problem to ours that has been extensively studied. Notice that, our sample complexity matches that of the best results for LRMC algorithms that do solve a convex relaxation. (2) We show that a simple extension of AltGD-Min also provides the fastest provable solution to LRPR, as long as the above assumptions hold and $mq\in\Omega(\kappa^{6}\mu^{2}nr^{2}(r+\log(1/\epsilon))$ . Its time complexity is the same too.

Contributions / Novelty of algorithm design and proof techniques. As explained earlier, there are three commonly used provably correct iterative algorithms for LR recovery problems – altMin, projGD-X, and altGD (altGDnormbal to be precise). AltMin is slower than GD-based methods because, for updating both ${\bm{U}}$ and $\bm{B}$ , it requires solving a minimization problem keeping the other variable fixed. For our specific asymmetric problem, the min step for ${\bm{U}}$ is the slow one. ProjGD-X and altGDnormbal are faster, but it is not clear how to analyze them for LRcCS under the desired sample complexity¹¹1In order to show that a GD-based algorithm converges, one needs to be able to bound the norm of the gradient and show that it goes to zero with iterations. When studying both projGD-X and altGDnormbal, for different reasons, the estimates of the different columns are coupled. Consequently, it is not possible to get a tight enough bound on $\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|$ . But, due to the form of the LRcCS measurement model, such a bound is needed to get a tight enough bound on the 2-norm of the gradient of the cost function, and show that it decreases sufficiently at each iteration, under the desired sample complexity. Moreover, in case of projGD-X, even if one could somehow get the desired bound, it would not suffice because the summands will still be too heavy tailed. This point is explained in detail in Appendix A.. Our novel altGD-min approach however resolves both issues: it is fast as projGD-X and it can be analyzed. Moreover, its communication complexity for a federated implementation (and its memory complexity) is only $nr$ per node per iteration, instead of $nq$ for projGD-X. As can be seen from Table I, treating $\kappa,\mu$ as numerical constants, it has the best sample-, time-, and communication/memory- complexity among all approaches for LRcCS and all fast (iterative) approaches for LRMC as well. Because of this, an AltGD-Min type algorithm may also be of interest for solving LRMC in a fast, sample-efficient and communication-efficient fashion. In fact, it can be also be useful for other bilinear inverse problems such as blind deconvolution.

AltGDmin algorithm. The main idea is as follows. Express ${\bm{X}}$ as ${\bm{X}}={\bm{U}}\bm{B}$ and alternatively update ${\bm{U}}$ and $\bm{B}$ as follows: (a) keeping $\bm{B}$ fixed at its previous value, update ${\bm{U}}$ by a GD step for it for the cost function $f({\bm{U}},\bm{B})$ followed by projecting the output onto the space of matrices with orthonormal columns; and (b) keeping ${\bm{U}}$ fixed at its previous value, update $\bm{B}$ by minimizing $f({\bm{U}},\bm{B})$ over it. Because of the column-wise decoupled form of our measurement model, step (b) is as fast as the GD step and thus the per-iteration time complexity of AltGD-Min is equal to that of any other GD method such as projGD-X or altGDnormbal. This decoupling (which means that, given ${\bm{U}}$ , $\bm{b}_{k}$ only depends on $\bm{x}^{*}_{k}$ , and not on the other columns of ${\bm{X}}^{*}$ ) also allows us to get the desired tight-enough bound on $\max_{k}\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|$ and hence on $\max_{k}\|\bm{x}_{k}-\bm{x}^{*}_{k}\|$ . This, and the fact that we use the gradient w.r.t. ${\bm{U}}$ in our algorithm, means that the summands in the gradient, and in other error bound terms, are nice-enough sub-exponential random variables (r.v.s): sub-exponential r.v.s whose maximum sub-exponential norm is small enough (is proportional to $(r/q)$ ), so that the summation can be bounded w.h.p. under the desired sample complexity.

AltGDmin analysis. When we analyzed the AltMin approach for LRPR [5, 6], we could directly modify proof techniques from AltMin for LRMC [2] for getting a bound on $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})$ in terms of the bound on this distance from the previous iteration. We cannot do this for AltGD-Min because the algorithm itself is different from the two GD approaches studied for solving LRMC. We instead analyze AltGD-Min by a novel use of the fundamental theorem of calculus [18] that, along with other linear algebra tricks, helps us get a bound on $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})$ which has the desired property: the terms in it are sums of nice-enough sub-exponentials. See Lemma 3.4 and its proof. The use of this result is motivated by its use in [19], and many earlier works, where it is used in a standard way: to bound the Euclidean distance, $\|\bm{x}-\bm{x}^{*}\|$ , for standard GD to solve the PR problem for recovering a single vector $\bm{x}^{*}$ . Thus, at the true solution $\bm{x}=\bm{x}^{*}$ , the gradient of the cost function was zero. In our case, there are two differences: (i) we need to bound the subspace distance error, and (ii) our algorithm is not standard GD, and this means that $\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0$ . We explain our approach in Sec. III-B.

AltGDmin initialization. The standard LR spectral initialization approach cannot be used because its summands are sub-exponential r.v.s that are not nice-enough. We give a detailed explanation in Appendix A. We address this issue by borrowing the truncation idea from the PR literature [20, 21, 5]. But, in our case, truncation is applied to a non-symmetric matrix. Thus the sandwiching arguments developed for symmetric matrices in [20], and modified in [21, 5], cannot be borrowed. We need a different argument which is used for proving Lemma B.2 and is briefly explained in Sec. III-D.

I-D Applications

The LRcCS and LRPR problems occur in projection imaging applications involving sets of images, e.g., dynamic MRI [8, 9, 10], federated LR sketching [11, 7], and dynamic Fourier ptychography (LRPR) [22]. In MRI, Fourier projections of the region of interest, e.g., a cross-section of the brain or the heart, are acquired one coefficient at a time, making the scanning (data acquisition) quite slow. Hence, reduced sample complexity enables accelerated scanning. Since medical image sequences are usually slow changing, the LR model is a valid assumption for a time sequence [8, 9, 10]. In our notation, $\bm{x}^{*}_{k}$ is the vectorized version of the $k$ -th image of the sequence and there are a total of $q$ images. The matrices $\bm{A}_{k}$ are random Fourier, i.e., $\bm{A}_{k}=\bm{H}_{k}{\bm{F}}$ where ${\bm{F}}$ is the $n\times n$ matrix that models computation of the 2D discrete Fourier transform as a matrix-vector operation, and $\bm{H}_{k}$ is an $m\times n$ random sampling “mask” matrix that models the frequency selection. In [13], we have shown the power of AltGD-Min for fast undersampled dynamic MRI of medical image sequences. It is both much faster, and in most cases, also provides better reconstructions, than many existing solutions from the MRI literature.

Large scale usage of smartphones results in large amounts of geographically distributed data, e.g., images. There is a need to compress/sketch this data before storing it. Sketch refers to a compression approach where the compression end is low complexity, usually simple linear projections [11, 7]. Consider the setting where different subsets of columns of ${\bm{X}}^{*}$ (each column corresponds to one vectorized image) are available at each of the $\rho\leq q$ nodes. The goal is to sketch them so that they can be correctly recovered using a federated algorithm. We can store the sketches $\bm{y}_{k}:=\bm{A}_{k}\bm{x}^{*}_{k}$ with $\bm{A}_{k}$ ’s being i.i.d. Gaussian. This way we store a total of only $mq$ scalars, with $mq$ of order roughly just $(n+q)r^{2}$ . Traditional LR sketching approaches, e.g., [23], are designed for centralized settings and will not be efficient in a distributed setting.

I-E Organization

In Sec. II, we develop AltGD-Min, give its guarantee for solving LRcCS, and compare it with existing results. We state and prove the two theorems that help prove our main result in Sec. III. This section also contains brief proof outlines before the actual proofs. The lemmas used in these proofs are proved in Sec. IV. The extension for solving LRPR is developed, and its guarantee is stated and proved, in Sec. V. We discuss the limitations of our results in Sec. VI. Simulation experiments are provided in Sec. VII. We conclude in Sec. VIII.

II The Proposed AltGD-Min Algorithm and Guarantee

II-A The AltGD-Min algorithm

We would like to design a fast GD algorithm to find the matrix ${\bm{X}}$ that minimizes the squared-loss cost function $\tilde{f}({\bm{X}}):=\sum_{k=1}^{q}\|\bm{y}_{k}-\bm{A}_{k}\bm{x}_{k}\|^{2}.$ For reasons described earlier, we decompose ${\bm{X}}={\bm{U}}\bm{B}$ and develop an alternating GD-min (AltGD-Min) approach for the squared loss function,

f({\bm{U}},\bm{B}):=\tilde{f}({\bm{U}}\bm{B})=\sum_{k}\|\bm{y}_{k}-\bm{A}_{k}{\bm{U}}\bm{b}_{k}\|^{2}.

Starting with a careful initialization for ${\bm{U}}$ explained below, AltGD-Min proceeds as follows. At each new iteration,

•

Min-B: update $\bm{B}$ by solving $\bm{B}\leftarrow\arg\min_{\tilde{\bm{B}}}f({\bm{U}},\tilde{\bm{B}})$ . Since $\bm{b}_{k}$ only occurs in the $k$ -th summand of $f({\bm{U}},\bm{B})$ , this decouples to a much simpler column-wise least squares (LS) problem: $\bm{b}_{k}\leftarrow\arg\min_{\tilde{\bm{b}}_{k}}\|\bm{y}_{k}-\bm{A}_{k}{\bm{U}}\tilde{\bm{b}}_{k}\|^{2}$ . This is solved in closed form as $\bm{b}_{k}=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k}$ for each $k$ ; here $\bm{M}^{\dagger}:=(\bm{M}^{\top}\bm{M})^{-1}\bm{M}^{\top}$ .
•

ProjGD-U: update ${\bm{U}}$ by one GD step for it, $\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-\eta\nabla_{U}f({\bm{U}},\bm{B})$ , followed by projecting $\hat{\bm{U}}^{+}$ onto the space of matrices with orthonormal columns to get the updated ${\bm{U}}^{+}$ . We get ${\bm{U}}^{+}$ by QR decomposition: $\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}$ .

Notice that, because of the decoupling for $\bm{B}$ , the min step only involves solving $q$ $r$ -dimensional Least Squares (LS) problems, in addition to also first computing the matrices, $\bm{A}_{k}{\bm{U}}$ . Computing the matrices needs time of order $mnr$ , and solving one LS problem needs time of order $mr^{2}$ . Thus, the LS step needs time $O(q\max(mnr,mr^{2}))=O(mqnr)$ since $r\leq n$ . This is equal to the time needed to compute the gradient w.r.t. ${\bm{U}}$ ; and thus, the per-iteration cost of AltGD-Min is only $O(mqnr)$ . The QR decomposition of an $n\times r$ matrix takes time only $nr^{2}$ .

Since $f({\bm{U}},\bm{B})$ is not a convex function of the unknowns $\{{\bm{U}},\bm{B}\}$ , a careful initialization is needed. Borrowing the spectral initialization idea from LRMC and LRMS solutions, we should initialize ${\bm{U}}_{0}$ by computing the top $r$ singular vectors of

{\bm{X}}_{0,full}=\frac{1}{m}[(\bm{A}_{1}^{\top}\bm{y}_{1}),(\bm{A}_{2}^{\top}\bm{y}_{2}),\dots,(\bm{A}_{k}^{\top}\bm{y}_{k}),\dots(\bm{A}_{q}^{\top}\bm{y}_{q})]

Clearly the expected value of the $k$ -th column of this matrix equals $\bm{x}^{*}_{k}$ and thus $\mathbb{E}[{\bm{X}}_{0,full}]={\bm{X}}^{*}$ . But, as we explain next, it is not clear how to prove that this matrix concentrates around ${\bm{X}}^{*}$ . Observe that it can also be written as

\displaystyle{\bm{X}}_{0,full}:=\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}

Its summands are independent sub-exponential r.v.s with maximum sub-exponential norm $\max_{k}\|\bm{x}^{*}_{k}\|\leq\mu\sqrt{r/q}{\sigma_{\max}^{*}}$ . This is too large and does not allow us to bound $\|{\bm{X}}_{0,full}-{\bm{X}}^{*}\|$ under the desired sample complexity; see Appendix A. To resolve this issue, we borrow the truncation idea from earlier work on PR [20, 5] and initialize ${\bm{U}}_{0}$ as the top $r$ left singular vectors of

	$\displaystyle{\bm{X}}_{0}$	$\displaystyle:=$	$\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}\mathbbm{1}_{\left\{\bm{y}_{ki}^{2}\leq\alpha\right\}}$		(2)
		$\displaystyle=$	$\displaystyle\frac{1}{m}\sum_{k=1}^{q}\bm{A}_{k}^{\top}\bm{y}_{k,trunc}(\alpha)\bm{e}_{k}^{\top}$		(2)

where $\alpha:=\tilde{C}\frac{\sum_{ki}(\bm{y}_{ki})^{2}}{mq}$ and $\bm{y}_{k,trunc}(\alpha):=\bm{y}_{k}\circ\mathbbm{1}(|\bm{y}_{k}|\leq\sqrt{\alpha})$ . We set $\tilde{C}$ in our main result. Observe that we are summing over only those $i,k$ for which $\bm{y}_{ki}^{2}$ is not too large (is not much larger than its empirically computed average value). This truncation filters out the too large (outlier-like) measurements and sums over the rest. Theoretically, this converts the summands into sub-Gaussian r.v.s which have lighter tails than the un-truncated ones. This allows us to prove the desired concentration bound. Different from the above setting, in [20, 5], truncation was applied to symmetric positive definite matrices and was used to convert summands that were heavier-tailed than sub-exponential to sub-exponential.

We summarize the complete algorithm in Algorithm 1. This uses sample-splitting which is a commonly used approach in the LR recovery literature [2, 14, 15] as well as in other compressive sensing settings. It helps ensure that the measurement matrices in each iteration for updating ${\bm{U}}$ and $\bm{B}$ are independent of all previous iterates. This allows one to use concentration bounds for sums of independent r.v.s. We provide a detailed discussion in Sec. VI-A.

II-A1 Practical algorithm and setting algorithm parameters

First, when we implement the algorithm, we use Algorithm 1 with using the full set of measurements for all the steps (no sample-splitting). The algorithm has 4 parameters: $\eta$ , $T$ , $\tilde{C}$ and the rank $r$ . According to the theorem below, we should set $\eta=c/{\sigma_{\max}^{*}}^{2}$ with $c<0.5$ . But ${\sigma_{\max}^{*}}$ is not known. The initialization matrix ${\bm{X}}_{0}$ provides an approximation to ${\bm{X}}^{*}$ and hence we can set $\eta=c/\|{\bm{X}}_{0}\|^{2}$ . Consider $\tilde{C}$ . The theorem requires setting $\tilde{C}=9\kappa^{2}\mu^{2}$ , however $\kappa,\mu$ are functions of ${\bm{X}}^{*}$ which is unknown. Using the definition of $\mu$ from Assumption 1.1, we can replace $\kappa^{2}\mu^{2}$ by an estimate of its lower bound: $q\cdot\max_{k}\widehat{\|\bm{x}^{*}_{k}\|^{2}}/\widehat{\|{\bm{X}}^{*}\|_{F}^{2}}$ with $\widehat{\|\bm{x}^{*}_{k}\|^{2}}=(1/m)\sum_{i}\bm{y}_{ki}^{2}$ and $\widehat{\|{\bm{X}}^{*}\|_{F}^{2}}=(1/m)\sum_{k}\sum_{i}\bm{y}_{ki}^{2}$ . To set the total number of algorithm iterations $T$ , we can use a large maximum value along with breaking the loop if a stopping criterion is satisfied. A common stopping criterion for GD is to stop when the iterates do not change much. One way to do this is to stop when $\mathrm{SD}({\bm{U}}_{t},{\bm{U}}_{t-1})\leq 0.01\sqrt{r}$ for last few iterations.

As explained in [13], we can use the following constraints to set the rank. We need our choice of rank, $\hat{r}$ , to be sufficiently small compared to $\min(n,q)$ for the algorithm to take advantage of the LR assumption. Moreover, for the LS step for updating $\bm{b}_{k}$ ’s (which are $r$ -length vectors) to work well (for its error to be small), we also need it to also be small compared with $m$ . One approach that is used often is to use the “ $b\%$ energy threshold” on singular values. Thus, one good heuristic that respects the above constraints is to compute the “ $b\%$ energy threshold” of the first $\min(n,q,m)/10$ singular values, i.e. compute $\hat{r}$ as the smallest value of $r$ for which

\sum_{j=1}^{{r}}\sigma_{j}({\bm{X}}_{0})^{2}\geq(b/100)\cdot\sum_{j=1}^{\min(n,q,m)/10}\sigma_{j}({\bm{X}}_{0})^{2}

for a $b\leq 100$ . In our MRI experiments in [13], we used $b=85$ . We also realized from the experiments that the algorithm is not very sensitive to this value as long as $\hat{r}\ll\min(n,q,m)$ .

II-A2 Federating the algorithm

Suppose that our sketches $\bm{y}_{k}$ are geographically distributed across a set of $L$ nodes. Each node ${\ell}$ stores a subset, denoted $\mathcal{S}_{\ell}$ , of the $\bm{y}_{k}$ s with $|\mathcal{S}_{\ell}|=q_{\ell}$ . These subsets are mutually disjoint so that $\sum_{\ell}q_{\ell}=q$ . Typically $L\ll q$ . Privacy constraints dictate that we cannot share the $\bm{y}_{k}$ s with the central server; although summaries computed using the $\bm{y}_{k}$ s can be shared at each algorithm iteration. This will be done as follows. Consider the GDmin steps of Algorithm 1 first. Line 13 (Update $\bm{b}_{k}$ s, $\bm{x}_{k}$ s) is done locally at the node that stores the corresponding $\bm{y}_{k}$ . For line 14 (Gradient w.r.t ${\bm{U}}$ ), the partial sums over $k\in\mathcal{S}_{\ell}$ are computed at node ${\ell}$ and transmitted to the center which adds all the partial sums to obtain $\nabla_{\bm{U}}f({\bm{U}},\bm{B})$ . Line 15 (GD step) and line 16 (projection via QR) are done at the center. The updated ${\bm{U}}$ is then broadcast to all the nodes for use in the next iteration. The per node time complexity of this algorithm is thus $mnrq_{\ell}$ at each iteration. The center only performs additions and a QR decomposition (an order $nr^{2}$ operation) in each iteration. Thus, the time complexity of the federated solution is only $mnr(\max_{\ell}q_{\ell})T$ per node.

The initialization step can be federated by using the Power Method (PM) [24, 25] to compute the top $r$ eigenvectors of ${\bm{X}}_{0}{\bm{X}}_{0}{}^{\top}$ . Any PM guarantee helps ensure that its output is close in subspace distance to the span of the top $r$ eigenvectors of ${\bm{X}}_{0}{\bm{X}}_{0}{}^{\top}$ after a sufficient number of iterations. The communication complexity of the federated implementation is thus just $nr$ per node per iteration (need to share the partial gradient sums). Observe also that the information shared with the center is not sufficient to recover ${\bm{X}}^{*}$ centrally. It is only sufficient to recover $\mathrm{span}({\bm{U}}^{*}{})$ . The recovery of the columns of $\bm{B}$ , $\bm{b}^{*}_{k}$ , is entirely done locally at the node where the corresponding $\bm{y}_{k}$ is stored, thus ensuring privacy.

Algorithm 1 The AltGD-Min algorithm. Let

\bm{M}^{\dagger}:=(\bm{M}^{\top}\bm{M})^{-1}\bm{M}^{\top}

1:Input:

\bm{y}_{k},\bm{A}_{k},k\in[q]

2:Parameters: Multiplier in specifying

\alpha

for init step,

\tilde{C}

; GD step size,

\eta

; Number of iterations,

T

3:Sample-split: Partition the measurements and measurement matrices into

2T+1

equal-sized disjoint sets: one set for initialization and

2T

sets for the iterations. Denote these by

\bm{y}_{k}^{(\tau)},\bm{A}_{k}^{(\tau)},\tau=0,1,\dots 2T

4:Initialization:

5:Using

\bm{y}_{k}\equiv\bm{y}_{k}^{(0)},\bm{A}_{k}\equiv\bm{A}_{k}^{(0)}

, set

\alpha=\tilde{C}\frac{1}{mq}\sum_{ki}\big{|}\bm{y}_{ki}\big{|}^{2}

\bm{y}_{k,trunc}(\alpha):=\bm{y}_{k}\circ\mathbbm{1}\{|\bm{y}_{k}|\leq\sqrt{\alpha}\}

\displaystyle{\bm{X}}_{0}:=(1/m)\sum_{k\in[q]}\bm{A}_{k}^{\top}\bm{y}_{k,trunc}(\alpha)\bm{e}_{k}^{\top}

9:Set

{\bm{U}}_{0}\leftarrow

top-

r

-singular-vectors of

{\bm{X}}_{0}

10:GDmin iterations:

11:for

t=1

T

12: Let

{\bm{U}}\leftarrow{\bm{U}}_{t-1}

13: Update

\bm{b}_{k},\bm{x}_{k}

: For each

k\in[q]

, set

(\bm{b}_{k})_{t}\leftarrow(\bm{A}_{k}^{(t)}{\bm{U}})^{\dagger}\bm{y}_{k}^{(t)}

and set

(\bm{x}_{k})_{t}\leftarrow{\bm{U}}(\bm{b}_{k})_{t}

14: Gradient w.r.t.

{\bm{U}}

: With

\bm{y}_{k}\equiv\bm{y}_{k}^{(T+t)},\bm{A}_{k}\equiv\bm{A}_{k}^{(T+t)}

, compute

\nabla_{\bm{U}}f({\bm{U}},\bm{B}_{t})=\sum_{k}\bm{A}_{k}^{\top}(\bm{A}_{k}{\bm{U}}(\bm{b}_{k})_{t}-\bm{y}_{k})(\bm{b}_{k})_{t}^{\top}

15: GD step: Set

\displaystyle\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-(\eta/m)\nabla_{\bm{U}}f({\bm{U}},\bm{B}_{t})

16: Projection step: Compute

\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}

17: Set

{\bm{U}}_{t}\leftarrow{\bm{U}}^{+}

18:end for

II-B Main Result

We can prove the following result.

Theorem 2.1.

Consider Algorithm 1. Let $m_{t}$ denote the number of samples used in iteration $t$ . Set $\tilde{C}=9\kappa^{2}\mu^{2}$ , $\eta=c/{\sigma_{\max}^{*}}^{2}$ with a $c\leq 0.5$ , and $T=C\kappa^{2}\log(1/\epsilon)$ . Assume that Assumption 1.1 holds and that the $\bm{A}_{k}$ s are i.i.d. and each contains i.i.d. standard Gaussian entries. If

m_{0}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2},

and $m_{t}$ for $t\geq 1$ satisfies

m_{t}q\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa\text{ and }m_{t}\geq C\max(r,\log q,\log n)

then, with probability (w.p.) at least $1-tn^{-10}$ , for all $t\geq 0$ ,

\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\left(1-\frac{(\eta{\sigma_{\max}^{*}}^{2})0.4}{\kappa^{2}}\right)^{t}\delta_{0}

with $\delta_{0}=0.09/\kappa^{2}.$ Thus, with $T=C\kappa^{2}\log(1/\epsilon)$ and $\eta=0.5/{\sigma_{\max}^{*}}^{2}$ , w.p. at least $1-(T+1)n^{-10}$ ,

	$\displaystyle\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{}{})\leq\epsilon,\ {\\|(\bm{x}_{k})_{T}-\bm{x}^{}_{k}\\|}\leq\epsilon{\\|\bm{x}^{*}_{k}\\|},\text{ for all $k\in[q]$, }$
	$\displaystyle\\|{\bm{X}}_{T}-{\bm{X}}^{}\\|_{F}\leq 1.4\epsilon\\|{\bm{X}}^{}\\|$

Sample complexity The sample complexity (total number of samples needed to achieve $\epsilon$ -accurate recovery) is $m_{\mathrm{tot}}=\sum_{\tau=0}^{T}m_{\tau}\geq m_{0}+T\min_{t\geq 1}m_{t}$ . From the above result, this needs to satisfy $m_{tot}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon)\log(\kappa)$ and $m_{tot}>C\kappa^{2}\max(r,\log q,\log n)\log(1/\epsilon)$ .

Time complexity Let $m\equiv m_{t}$ . The initialization step needs time $mqn$ for computing ${\bm{X}}_{0}$ ; and time of order $nqr$ times the number of iterations used in the $r$ -SVD step. Since we only need a $\delta_{0}$ -accurate initial estimate of $\mathrm{span}({\bm{U}}^{*}{})$ , with $\delta_{0}=c/\kappa^{2}$ , order $\log(\kappa)$ number of iterations suffice for this SVD step. Thus the complexity is $O(nq(m+r)\cdot\log\kappa)=O(mqn\cdot\log\kappa)$ since $m\geq r$ . One gradient computation needs time $O(mqnr)$ . The QR decomposition needs time of order $nr^{2}$ . The update of columns of $\bm{B}$ by LS also needs time $O(mqnr)$ (explained earlier). As we prove above, we need to repeat these steps $T=O(\kappa^{2}\log(1/\epsilon))$ times. Thus the total time complexity is $O(mqn\log\kappa+\max(mqnr,nr^{2},mqnr)\cdot T)=O(\kappa^{2}mqnr\log(1/\epsilon)\log\kappa)$ .

Communication complexity The communication complexity per node per iteration for a federated implementation is just order $nr$ . Thus, the total is $O(nr\cdot\kappa^{2}\log(1/\epsilon))$ .

Thus, we have the following corollary.

Corollary 2.2 (AltGD-Min).

In the setting of Theorem 2.1, if Assumption 1.1 holds, and if

m_{tot}q\geq C\kappa^{6}\mu^{2}(n+q)r^{2}\log(1/\epsilon)\log(\kappa)

and $m_{tot}>C\kappa^{2}\max(r,\log q,\log n)\log(1/\epsilon)$ , then, w.p. at least $1-(C\kappa^{2}\log(1/\epsilon))n^{-10}$ , $\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq 1.4\epsilon\|{\bm{X}}^{*}\|$ and ${\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|}$ for all $k\in[q]$ . The time complexity is $C\kappa^{2}mqnr\log(1/\epsilon)\log\kappa$ and the communication complexity is $O(nr\cdot\kappa^{2}\log(1/\epsilon))$ .

Observe that the above results show that after $T=C\kappa^{2}\log(1/\epsilon)$ iterations, $\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{*}{})\leq\epsilon$ , ${\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|}$ , and $\|{\bm{X}}_{T}-{\bm{X}}^{*}\|_{F}\leq 1.4\epsilon\|{\bm{X}}^{*}\|$ . The RHS in the third bound does indeed contain $\|{\bm{X}}^{*}\|$ (the induced 2-norm). This is correct because, $\mathrm{SD}(.,.)$ is a Frobenius norm subspace distance. We explain this in Sec. III-B.

II-C Discussion and comparison with the best LRMC results

An algorithm is called linear time if its time complexity is the same order as the time needed to load all input data. In our case, this is $O(mqn)$ . Treating $\kappa$ as a constant, the AltGD-Min complexity is worse than linear-time by a factor of only $r\log(1/\epsilon)$ . As can be seen from Table I, the same is also true for the fastest LRMC solution, projGD-X [15]. For LRMC, linear time is $O(mq)$ . To our best knowledge, this is the case for the fastest algorithms for all LR problems.

Consider the sample complexity. The degrees of freedom (number of unknowns) of a rank- $r$ $n\times q$ matrix are $(n+q)r$ . A sample complexity of $\Omega((n+q)r)$ samples (or, sometimes this times log factors) is called “optimal”. Thus, ignoring the log factors, our sample complexity of $m_{tot}q\gtrsim(n+q)r^{2}$ is sub-optimal only by a factor of $r$ . As can also be seen from Table I, this suboptimality matches that of the best results for LRMC solutions that are not convex relaxation based [15, 16, 17]. The need for exploiting incoherence while obtaining the high probability bounds on the recovery error terms is what introduces the extra factor of $r$ for both LRMC and LRcCS. LRMC has been extensively studied for over a decade and there does not seem to be a way to obtain an (order-) optimal sample complexity guarantee for it except when studying convex relaxation solutions (which are much slower).

In addition, we also need $m\gtrsim\max(r,\log q,\log n)$ . This is redundant except for very large $q,n$ . This is needed because, the recovery of each column of $\bm{B}^{*}$ is a decoupled $r$ -dimensional LS problem. We analyze this step in Lemma 3.3; notice that the bound on the recovery error of column $k$ holds w.p. at least $1-\exp(r-cm)$ . By union bound, it holds for all $q$ columns w.p. at least at least $1-q\exp(r-cm)=1-\exp(\log q+r-cm)$ . This probability is at least $1-n^{-10}=1-\exp(-10\log n)$ if $m\gtrsim\max(r,\log q,\log n)$ .

II-D Detailed comparison with existing LRcCS results

There are two existing solutions for LRcCS – AltMin [4, 5, 6] and the convex relaxation (mixed norm minimization) [7]. Mixed norm is defined as $\|{\bm{X}}\|_{mixed}:=\inf_{\{{\bm{U}},\bm{V}:{\bm{U}}\bm{V}={\bm{X}}\}}\|{\bm{U}}\|_{F}\max_{k\in[q]}\|\bm{v}_{k}\|$ , where ${\bm{U}}$ is $n\times r$ and $\bm{V}:=[\bm{v}_{1},\bm{v}_{2},\dots\bm{v}_{q}]$ is an $r\times q$ matrix. In our notation, for the noise-free case ( $\sigma=0$ ), their main result states the following.

Proposition 2.3 (Convex relaxation (mixed norm min) in the $\sigma=0$ (noise-free) setting [7]).

Consider a matrix ${\bm{X}}^{*}\in\{{\bm{X}}^{*}:\max_{k}\|\bm{x}^{*}_{k}\|^{2}\leq\alpha^{2},\|{\bm{X}}^{*}\|_{mixed}\leq R\leq\alpha\sqrt{r}\}$ . Then, w.p. $1-\exp(-c_{2}nR^{2}/\alpha^{2})$ , $\frac{\|{\bm{X}}-{\bm{X}}^{*}\|_{F}^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}}\leq c_{1}\frac{\alpha^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}/q}\sqrt{\frac{(n+q)r\log^{6}n}{m_{tot}q}}$ Under our Assumption 1.1, $\max_{k}\|\bm{x}^{*}_{k}\|^{2}\leq\mu^{2}(r/q){\sigma_{\max}^{*}}^{2}=(\mu^{2}\kappa^{2})(r/q){\sigma_{\min}^{*}}^{2}\leq(\kappa^{2}\mu^{2})\|{\bm{X}}^{*}\|_{F}^{2}/q$ , i.e. $\frac{\alpha^{2}}{\|{\bm{X}}^{*}\|_{F}^{2}/q}=(\kappa^{2}\mu^{2})$ . Thus, the above result can also be stated as:

For all matrices ${\bm{X}}^{*}$ that satisfy Assumption 1.1 and for which $\|{\bm{X}}^{*}\|_{mixed}\leq\sqrt{r}\cdot\kappa\mu\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ , if

m_{tot}q\geq C_{1}\kappa^{4}\mu^{4}(n+q)r\log^{6}n\cdot\frac{1}{\epsilon^{4}},

then, w.p. at least $1-\exp(-c_{2}n)$ , ${\|{\bm{X}}-{\bm{X}}^{*}\|_{F}}\leq\epsilon{\|{\bm{X}}^{*}\|_{F}}$ . The time complexity is $Cmqnr\min(\frac{1}{\sqrt{\epsilon}},n^{3}r^{3})$ (explained earlier in Sec. I-B).

Notice that both the sample and the time complexity of the convex solution depend on powers of $1/\sqrt{\epsilon}$ : the sample complexity grows as $1/\epsilon^{4}$ while the time complexity grows as $1/\sqrt{\epsilon}$ . However, its sample complexity has an order-optimal dependence on $r$ . For AltGD-Min, both sample and time complexities depend only logarithmically on $\epsilon$ only as $\log(1/\epsilon)$ . But its sample complexity depends sub-optimally on $r$ , it grows as $r^{2}$ . In summary, the time complexity of the convex solution is always much worse, its sample complexity is worse when a solution with accuracy level $\epsilon<1/{r}^{1/4}$ is needed. A second point to mention is that our result for AltGD-Min provides a column-wise error bound (bounds $\|\bm{x}^{*}_{k}-\bm{x}_{k}\|/\bm{x}^{*}_{k}\|$ ). The convex result only provides a bound on the Frobenius norm of the entire matrix. Thus it is possible that some columns have much larger recovery error than others. This can be problematic in applications such as dynamic MRI where each column corresponds to one signal/image of a time sequence and where the goal is to ensure accurate-enough recovery of all columns. On the other hand, the advantage of the convex guarantee is that it holds w.h.p. for all matrices ${\bm{X}}^{*}$ in the specified set, where as our result only holds w.h.p. for a matrix ${\bm{X}}^{*}$ satisfying Assumption 1.1. The reason for these last two points and the reason that we cannot avoid using sample-splitting is the same: the update of $\bm{B}$ is a column-wise LS problem. We explain the reasoning carefully in Sec. VI-A where we discuss the limitations of our approach. A second advantage of the convex result is that it directly studies the noisy version of the LRcCS problem. This should be possible for AltGD-Min too, we postpone it to future work.

The best result for AltMin is from [6], it states the following.

Proposition 2.4 (AltMin [6]).

Under Assumption 1.1, if

m_{tot}q\geq C\kappa^{8}\mu^{2}nr^{2}(r+\log(1/\epsilon))\text{ and }m_{tot}>\max(r,\log q,\log n),

then, w.p. at least $1-(\log(1/\epsilon))n^{-10}$ , $\|{\bm{X}}-{\bm{X}}^{*}\|\leq\epsilon\|{\bm{X}}^{*}\|$ and ${\|\bm{x}_{k}-\bm{x}^{*}_{k}\|}\leq\epsilon{\|\bm{x}^{*}_{k}\|}$ for all $k\in[q]$ . The time complexity is $Cmqnr\log^{2}(1/\epsilon)$ .

Treating $\kappa$ as a numerical constant, compared with the above result for AltMin, the sample complexity of AltGD-Min is either better by a factor of $r$ or is as good. It is better when $r>\log(1/\epsilon)$ . Also, the time complexity is always better by a factor $\log(1/\epsilon)$ . As a function of $\kappa$ , the AltGD-Min sample complexity is better by a factor of $\kappa^{2}$ , but its time is worse by a factor of $\kappa^{2}$ compared to that of AltMin. The reason is that its error decays as $(1-c/\kappa^{2})^{t}$ . For AltMin the error decays as $c^{t}$ . Experimentally, GD is usually much faster than AltMin because the constants in its time complexity are also lower.

III Proving Theorem 2.1

III-A Two key results for proving Theorem 2.1 and its proof

Theorem 2.1 is an almost immediate consequence of the following two results.

Theorem 3.1 (Initialization).

Pick a $\delta_{0}<0.1$ . If $mq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}/\delta_{0}^{2}$ , then w.p. at least $1-\exp(-c(n+q))$ ,

\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}.

Proof.

See Sec. III-E (simpler proof with sample-splitting for $\alpha$ ) or Appendix B (proof without sample-splitting). Proof outline is given in Sec. III-D. ∎

Theorem 3.2 (GD Descent).

If, at each iteration $t$ , $mq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa$ and $m>C\max(\log q,\log n)$ ; if $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}=c/\kappa^{2}$ for a $c\leq 0.1/1.1$ ; and if $\eta\leq 0.5/{\sigma_{\max}^{*}}^{2}$ , then w.p. at least $1-(t+1)n^{-10}$ ,

\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})\leq\delta_{t+1}:=\left(1-(\eta{\sigma_{\max}^{*}}^{2})\tfrac{0.4}{\kappa^{2}}\right)^{t+1}\delta_{0}.

If $\eta=0.5{\sigma_{\max}^{*}}^{2}$ , this simplifies to $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})\leq(1-0.2/\kappa^{2})^{t+1}\delta_{0}$ .

Also, with the above probability,

\|(1/m)\nabla_{U}f({\bm{U}}_{t},\bm{B}_{t+1})\|\leq 1.6\delta_{t}{\sigma_{\max}^{*}}^{2}.

with $\delta_{t}$ defined in the $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})$ bound above.

Since $\delta_{t}$ decays exponentially with $t$ , the same is also true for the gradient norm at iteration $t$ , $\|(1/m)\nabla_{U}f({\bm{U}}_{t},\bm{B}_{t+1})\|$ .

Proof.

See Sec. III-C. Proof outline is given in Sec. III-B. ∎

Proof of Theorem 2.1.

The $\mathrm{SD}(.)$ bound is an immediate consequence of Theorems 3.1 and 3.2. To apply Theorem 3.2, we need $\delta_{0}=c/\kappa^{2}$ . By Theorem 3.1, if $mq\geq C\kappa^{6}\mu^{2}(n+q)r^{2}$ , then, w.p. at least $1-n^{-10}$ , $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})\leq\delta_{0}=c/\kappa^{2}$ . With this, if, at each iteration, $mq\geq C\kappa^{4}\mu^{2}(n+q)r^{2}\log\kappa$ and $m\geq C\max(\log q,\log n)$ , then by Theorem 3.2, w.p. at least $1-(t+1)n^{-10}$ , the stated bound on $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})$ holds. By setting $T=C\kappa^{2}\log(1/\epsilon)$ in this, we can guarantee $\left(1-\tfrac{c_{1}}{\kappa^{2}}\right)^{T}\leq\epsilon$ . This proves the $\mathrm{SD}({\bm{U}}_{T},{\bm{U}}^{*}{})$ bound. The bounds on $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|$ and $\|{\bm{X}}-{\bm{X}}^{*}\|_{F}$ follow by Lemma 3.3 given in Sec. III-C.∎

III-B Proof outline (and novelty) for Theorem 3.2

For proving exponential error decay, we need to show this: at iteration $t$ , if $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}$ with $\delta_{t}<\delta_{0}=c/\kappa^{2}$ . Then, $\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})\leq c\delta_{t}$ for a $c<1$ . We explain how to do this next. Suppose that, at iteration $t$ , $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}<\delta_{0}=0.1/\kappa^{2}$ .

Analyzing the minimization step for updating $\bm{B}$ (Lemma 3.3). Recall from Algorithm 1 that $\bm{b}_{k}=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k}$ , $\bm{x}_{k}={\bm{U}}\bm{b}_{k}$ , and $\bm{x}^{*}_{k}={\bm{U}}^{*}{}\bm{b}^{*}_{k}$ . Using standard results from [26], we can show that the estimates $\bm{b}_{k}$ satisfy $\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|\leq 0.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|$ . This then implies that (i) $\bm{b}_{k}$ ’s are incoherent, i.e., $\|\bm{b}_{k}\|\leq 1.1\mu{\sigma_{\max}^{*}}\sqrt{r/q}$ ; and (ii) $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|\leq 1.4\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\|$ , i.e., we can get the desired column-wise error bound. Also (iii) $\|{\bm{X}}-{\bm{X}}^{*}\|_{F}\leq 1.4\delta_{t}{\sigma_{\max}^{*}}$ (notice this bound does not contain $r$ ). We get this as follows:

	$\displaystyle\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}$	$\displaystyle=\sqrt{\sum_{k}\\|\bm{x}_{k}-\bm{x}^{*}_{k}\\|^{2}}$
		$\displaystyle\leq\sqrt{1.4^{2}\sum_{k}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{b}^{}_{k}\\|^{2}}$
		$\displaystyle=1.4\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{B}^{}\\|_{F}$
		$\displaystyle\leq 1.4\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\\|_{F}{\sigma_{\max}^{}}$

Similarly, $\|\bm{B}-{\bm{U}}^{\top}{\bm{X}}^{*}\|_{F}\leq 0.4\delta_{t}{\sigma_{\max}^{*}}.$ (iv) Using Weyl’s inequality and $\delta_{t}<0.1/\kappa^{2}$ , this then implies that $\sigma_{\max}(\bm{B})\leq 1.1{\sigma_{\max}^{*}}$ and $\sigma_{\min}(\bm{B})\geq 0.9{\sigma_{\min}^{*}}$ .

Bounding $\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})$ by a novel use of fundamental theorem of calculus (Lemma 3.4). Recall from Algorithm 1 that $\hat{\bm{U}}^{+}=\hat{\bm{U}}-(\eta/m)\nabla_{U}f({\bm{U}},\bm{B})$ and $\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}$ . We bound $\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})$ using the fundamental theorem of calculus [18, Chapter XIII, Theorem 4.2],[19], summarized in Theorem 4.2. The use of this result is motivated by its use in [19], and many earlier works, where it is used in a standard way: to bound the Euclidean norm error $\|\bm{x}-\bm{x}^{*}\|$ for standard GD to solve the PR problem for recovering a single vector $\bm{x}^{*}$ . Thus, at the true solution $\bm{x}=\bm{x}^{*}$ , the gradient of the cost function was zero. In our case, there are two differences: (i) we need to bound the subspace distance error, and (ii) our algorithm is not standard GD; in particular, this means that $\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0$ .

To deal with (i) and (ii), we proceed as follows. We first bound $\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\hat{\bm{U}}^{+}\|_{F}$ . To do this, we apply Theorem 4.2 on vectorized $\nabla_{U}f({\bm{U}},\bm{B})$ with the pivot being vectorized $\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})$ , and use this in the equation for $\hat{\bm{U}}^{+}$ . Next, we project both sides of this expression orthogonal to ${\bm{U}}^{*}{}$ followed by some careful linear algebra. Notice here that $\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq 0$ , because $\bm{B}\neq\bm{B}^{*}$ . Because of this, we get an extra term, $\mathrm{Term2}:=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\nabla_{U}f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})$ , in our bound other than the usual term containing the Hessian. We are able to bound it by $\epsilon\delta_{t}{\sigma_{\max}^{*}}^{2}$ for any constant small enough $\epsilon$ , by realizing that $\mathbb{E}[\mathrm{Term2}]=0$ (conditioned on past measurements), and that its summands are nice-enough subexponentials. Next, we bound $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})$ by using

	$\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})$	$\displaystyle\leq\\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}^{\top})\hat{\bm{U}}^{+}\\|_{F}\\|({\bm{R}}^{+})^{-1}\\|$
		$\displaystyle=\frac{\\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}^{\top})\hat{\bm{U}}^{+}\\|_{F}}{\sigma_{\min}(\hat{\bm{U}}^{+})}$

and $\sigma_{\min}(\hat{\bm{U}}^{+})=\sigma_{\min}({\bm{U}}-(\eta/m)\nabla_{\bm{U}}f({\bm{U}},\bm{B}))\geq 1-(\eta/m)\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|$ .

Bounding the terms in the $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})$ bound (Lemma 3.5). Consider $\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|$ . Using Lemma 3.3, it can be shown that, for unit vectors $\bm{w},\bm{z}$ , the maximum sub-exponential norm of any summand of $\bm{w}^{\top}\nabla_{\bm{U}}f({\bm{U}},\bm{B})\bm{z}$ is bounded by $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\cdot\|\bm{b}_{k}\|\leq 1.1\mu^{2}{\sigma_{\max}^{*}}^{2}\delta_{t}(r/q)$ . Observe that we get this (sufficiently small) bound because of the extra $\bm{b}_{k}^{\top}$ term in the summands of $\nabla_{\bm{U}}f({\bm{U}},\bm{B})$ compared to those in $\nabla_{\bm{X}}\tilde{f}({\bm{X}})$ . This, along with using the sub-exponential Bernstein inequality [26] followed by a standard epsilon-net argument, and bounding $\|\mathbb{E}[\nabla_{U}f]\|$ using $\|\mathbb{E}[\nabla_{U}f]\|=\|m({\bm{X}}-{\bm{X}}^{*})\bm{B}^{\top}\|\leq m\delta_{t}{\sigma_{\max}^{*}}^{2}$ (by Lemma 3.3), helps guarantee that $\|\nabla_{U}f\|\lesssim 2m\delta_{t}{\sigma_{\max}^{*}}^{2}$ w.h.p. as long as $mq\gtrsim(n+q)r^{2}$ . We bound $\|\mathrm{Term2}\|_{F}$ using similar ideas and the key fact that $\mathbb{E}[\mathrm{Term2}]=0$ . This is true because of sample-splitting. We upper and lower bound the eigenvalues of the Hessian, $\mathrm{Hess}$ , using similar ideas and the following: for a unit vector $\bm{w}$ of length $nr$ and its rearranged unit Frobenius norm matrix ${\bm{W}}$ of size $n\times r$ , $\mathbb{E}[\bm{w}^{\top}\mathrm{Hess}\ \bm{w}]=\mathbb{E}[\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=m\|{\bm{W}}\bm{B}\|_{F}^{2}$ . Using the bounds on $\sigma_{i}(\bm{B})$ from Lemma 3.3, this can be upper and lower bounded.

III-C Lemmas for proving GD descent Theorem 3.2 and its proof

Let ${\bm{U}}\equiv{\bm{U}}_{t}$ , $\bm{B}\equiv\bm{B}_{t+1}$ . The proof follows using the following 3 lemmas.

Lemma 3.3 (Error bound on $\bm{B}$ and its implications).

Let $U\equiv{\bm{U}}_{t}$ , $\bm{B}\equiv\bm{B}_{t+1}$ , and

\bm{g}_{k}:={\bm{U}}^{\top}\bm{x}^{*}_{k}.

Assume that $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t})\leq\delta_{t}$ with $\delta_{t}<\delta_{0}=c/\kappa^{2}$ (this bound on $\delta_{t}$ is needed for the second part of this lemma). Then, w.p. $\geq 1-q\exp(r-cm)$ ,

\displaystyle\|\bm{g}_{k}-\bm{b}_{k}\|

\displaystyle\leq 0.4\|\left(\bm{I}_{n}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|

(3)

2.
This in turn implies all of the following.
1. (a)
  
  $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\|\left(\bm{I}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|$
2. (b)
  
  $\|\bm{G}-\bm{B}\|_{F}\leq 0.4\delta_{t}{\sigma_{\max}^{*}}$ and $\|{\bm{X}}^{*}-{\bm{X}}\|_{F}\leq\sqrt{1.16}\delta_{t}{\sigma_{\max}^{*}}$ ,
3. (c)
  
  $\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\delta_{t}\|\bm{b}^{*}_{k}\|$ and $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\leq 1.4\delta_{t}\|\bm{x}^{*}_{k}\|$ ,
4. (d)
  
  $\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\leq 2.4\delta_{t}\|\bm{b}^{*}_{k}\|$ ,
5. (e)
  
  $\|\bm{b}_{k}\|\leq 1.1\mu{\sigma_{\max}^{*}}\sqrt{r/q}$ .
6. (f)
  
  $\sigma_{\min}(\bm{B})\geq 0.9{\sigma_{\min}^{*}}$ and $\sigma_{\max}(\bm{B})\leq 1.1{\sigma_{\max}^{*}}$ ,

Proof.

See Sec. IV-D. ∎

Lemma 3.4.

Let ${\bm{U}}\equiv{\bm{U}}_{t}$ , $\bm{B}\equiv\bm{B}_{t+1}$ . Let $\otimes$ denote the Kronecker product. We have

	$\displaystyle\mathrm{SD}({\bm{U}}_{t+1},{\bm{U}}^{*}{})$
	$\displaystyle\qquad\leq\dfrac{\\|\bm{I}_{nr}-(\eta/m)\mathrm{Hess}\\|\cdot\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})+(\eta/m)\\|\mathrm{Term2}\\|_{F}}{1-(\eta/m)\\|\mathrm{GradU}\\|},$

where,

	$\displaystyle\mathrm{GradU}$	$\displaystyle:=\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}$
	$\displaystyle\mathrm{Term2}$	$\displaystyle:=(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}^{\top})\nabla_{\bm{U}}f(({\bm{U}}^{}{}{\bm{U}}^{}{}^{\top}{\bm{U}}),\bm{B})$
		$\displaystyle=(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}^{\top})\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}{\bm{U}}^{}{}^{\top}{\bm{U}}\bm{b}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}$
	$\displaystyle\mathrm{Hess}$	$\displaystyle:=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}$

Proof.

See Sec. IV-B ∎

Lemma 3.5.

Assume $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})\leq\delta_{t}<\delta_{0}=c/\kappa^{2}$ . Then,

1.

w.p. at least $1-\exp((n+r)-cmq\epsilon_{1}^{2}/r\mu^{2})-\exp(\log q+r-cm)$ ,

$\|\mathrm{GradU}\|\leq 1.5(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2};$
2.

w.p. at least $1-\exp(nr-cmq\epsilon_{2}^{2}/r\mu^{2})-\exp(\log q+r-cm)$ ,

$\|\mathrm{Term2}\|_{F}\leq 1.1m\epsilon_{2}\delta_{t}{\sigma_{\max}^{*}}^{2};$

w.p. at least $1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\kappa^{4}\mu^{2})-\exp(\log q+r-cm)$ ,

	$\displaystyle m(0.65-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2}$	$\displaystyle\leq\lambda_{\min}(\mathrm{Hess})$
		$\displaystyle\leq\lambda_{\max}(\mathrm{Hess})\leq m(1.1+\epsilon_{3}){\sigma_{\max}^{*}}^{2}.$

Proof.

See Sec. IV-C. ∎

Proof of Theorem 3.2.

The proof follows by induction. Base case for $t=0$ is true by assumption. Induction assumption: Assume that, w.p. at least $1-tn^{-10}$ , $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t})\leq\delta_{t}$ with $\delta_{t}\leq\delta_{0}=c_{0}/\kappa^{2}$ .

Set $\epsilon_{1}=0.1$ , $\epsilon_{3}=0.01$ , $\epsilon_{2}=0.01/1.1\kappa^{2}$ and, $c_{0}=0.1/1.5(1.1+0.1)$ .

The upper bound on $\lambda_{\max}(\mathrm{Hess})$ and using $\eta\leq 0.5/{\sigma_{\max}^{*}}^{2}$ implies that $\lambda_{\min}(\bm{I}_{nr}-(\eta/m)\mathrm{Hess})=1-(\eta/m)\lambda_{\max}(\mathrm{Hess})\geq 1-\frac{0.5(1.1+0.01)m{\sigma_{\max}^{*}}^{2}}{m{\sigma_{\max}^{*}}^{2}}>1-0.555>0$ i.e. $\bm{I}_{nr}-(\eta/m)\mathrm{Hess}$ is positive definite. Thus, $\|\bm{I}_{nr}-(\eta/m)\mathrm{Hess}\|=\lambda_{\max}(\bm{I}_{nr}-(\eta/m)\mathrm{Hess})=1-(\eta/m)\lambda_{\min}(\mathrm{Hess})\leq 1-(\eta/m)m(0.65-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2}\leq 1-(\eta{\sigma_{\max}^{*}}^{2})0.63/\kappa^{2}.$

By Lemma 3.4, Lemma 3.5, and the above, w.p. at least $1-tn^{-10}-\exp((n+q)-cmq/r\mu^{2})-\exp(nr-cmq/r\kappa^{4}\mu^{2})-\exp(nr\log\kappa-cmq/r\kappa^{4}\mu^{2})-\exp(\log q+r-cm)$ ,

	$\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{t+1})$
	$\displaystyle\leq\dfrac{(1-(\eta{\sigma_{\max}^{}}^{2})0.63/\kappa^{2})\cdot\delta_{t}+(\eta/m)1.1m\epsilon_{2}{\sigma_{\max}^{}}^{2}\delta_{t}}{1-(\eta/m)1.5(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}}$
	$\displaystyle\leq\left(\frac{1-(\eta{\sigma_{\max}^{}}^{2})0.63/\kappa^{2}+(\eta{\sigma_{\max}^{}}^{2})0.01/\kappa^{2}}{1-(\eta{\sigma_{\max}^{*}}^{2})0.1/\kappa^{2}}\right)\delta_{t}$
	$\displaystyle\leq\left(1-(\eta{\sigma_{\max}^{*}}^{2})\frac{0.42}{\kappa^{2}}\right)\delta_{t}$

The second inequality substituted the values of $\epsilon_{j}$ ’s and used $\delta_{t}<\delta_{0}=0.1/(1.5(1.1+0.1)\kappa^{2})$ for its denominator term. The third inequality used $(1-(\eta{\sigma_{\max}^{*}}^{2})0.1/\kappa^{2})^{-1}\leq(1+(\eta{\sigma_{\max}^{*}}^{2})0.2/\kappa^{2})$ (for $0<x<1$ , $1/(1-x)\leq 1+2x$ ).

By plugging in the epsilon values in the probability, the above holds w.p. $\geq 1-tn^{-10}-0.2\exp((n+q)-cmq/r\mu^{2})-0.2\exp(nr-cmq/r\mu^{2}\kappa^{4})-0.2\exp(nr\log\kappa-cmq/r\mu^{2}\kappa^{4})-\exp(\log q+r-cm)$ . If $mq\geq C\kappa^{4}(n+q)r^{2}\log\kappa$ and $m\geq C\max(r,\log q,\log n)$ for a $C$ large enough, then, this probability is $\geq 1-tn^{-10}-0.2\exp(-c(n+q))-0.4\exp(-cnr)-n^{-10}>1-(t+1)n^{-10}$ . ∎

III-D Proof outline (and novelty) for Initialization Theorem 3.1

Recall that we compute ${\bm{U}}_{0}$ as the top $r$ left singular vectors of ${\bm{X}}_{0}$ defined in (2) and that this is a truncated version of ${\bm{X}}_{0,full}$ . As noted there, we cannot use ${\bm{X}}_{0,full}$ because its summands are not nice-enough sub-exponentials. Truncation converts the summands into sub-Gaussian r.v.s. For these, we can use the sub-Gaussian Hoeffding inequality [26, Chap 2] which needs a small enough bound on only the squared sum of the sub-Gaussian norms of the $mq$ summands, and not on their maximum value (as needed by the sub-exponential Bernstein inequality). This is an easier requirement that gets satisfied for our problem. Of course, truncation also means that the summands of ${\bm{X}}_{0}$ are not mutually independent (each summand depends on the truncation threshold $\alpha$ which is computed using all measurements $\bm{y}_{ki}$ ) and that $\mathbb{E}[{\bm{X}}_{0}]\neq{\bm{X}}^{*}$ . There are two ways to resolve this issue. The first and simpler approach, but one that assumes more sample-splitting is given below in Sec III-E. This assumes that $\alpha$ is a computed using a different independent set of measurements than those used to define the rest of ${\bm{X}}_{0}$ . With this, $\mathbb{E}[{\bm{X}}_{0}|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha)$ , where ${\bm{D}}$ is a diagonal matrix defined below in Lemma 3.6 and the summands are independent conditioned on $\alpha$ . Thus, we can apply Wedin’s $\sin\Theta$ theorem [27, 28] (given in Proposition 4.1) on ${\bm{X}}_{0}$ and $\mathbb{E}[{\bm{X}}_{0}|\alpha]$ to bound $\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})$ , followed by subGaussian Hoeffding and a standard epsilon-net argument, to bound the terms in this bound.

To avoid sample-splitting for $\alpha$ , we need to significantly modify the sandwiching arguments from [20, 5] for our setting. This is done in Appendix B. In the previous works, sandwiching was used for a symmetric positive definite (p.d.) matrix. Here we need such an argument for a non-symmetric matrix. Briefly, we do this as follows. We define a matrix ${\bm{X}}_{+}$ that is such that the span of top $r$ left singular vectors of its expected value equals that of ${\bm{U}}^{*}{}$ and that can be shown to be close to ${\bm{X}}_{0}$ . ${\bm{X}}_{+}$ is ${\bm{X}}_{0}$ with $\alpha$ replaced by $\tilde{C}(1+\epsilon)\|{\bm{X}}^{*}\|_{F}^{2}/q$ . We bound $\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|$ by bounding $\|{\bm{X}}_{+}-{\bm{X}}_{0}\|$ and $\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|$ . Bounding the latter is simple. Bounding $\|{\bm{X}}_{+}-{\bm{X}}_{0}\|$ requires bounding $\bm{w}^{\top}({\bm{X}}_{+}-{\bm{X}}_{0})\bm{z}$ for unit vectors $\bm{w},\bm{z}$ and this is not straightforward because its summands are not mutually independent. To deal with this, we first bound each summand by its absolute value, and then bound the indicator function term to get a new one that is non-random so that the summands of this new term are mutually independent. But, its summands are no longer zero mean (because of taking the absolute values), and hence more work is needed to get the desired small enough bound on the expected value of this term.

III-E Simpler proof of Theorem 3.1 that assumes independent measurements used for computing $\alpha$

For the simpler proof given here, assume that we use a different independent set of measurements for computing $\alpha$ than those used for the rest of ${\bm{X}}_{0}$ , i.e., let

\alpha=\tilde{C}\frac{\sum_{ki}(\bm{y}_{ki}^{nrmX})^{2}}{mq}

with $\bm{y}_{ki}^{nrmX}$ independent of $\{\bm{A}_{k}^{(0)},\bm{y}_{k}^{(0)}\}$ . With this change, it is possible to compute $\mathbb{E}[{\bm{X}}_{0}|\alpha]$ easily. But, it does not affect the sample complexity order and so it does not change our theorem statement. The proof follows by combining the two lemmas and facts given next.

Lemma 3.6.

Conditioned on $\alpha$ , we have the following conclusions.

Let $\zeta$ be a scalar standard Gaussian r.v.. Define

\beta_{k}(\alpha):=\mathbb{E}[\zeta^{2}\mathbbm{1}_{\{\|\bm{x}^{*}_{k}\|^{2}\zeta^{2}\leq\alpha\}}].

Then,

	$\displaystyle\mathbb{E}[{\bm{X}}_{0}\|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha),$
	$\displaystyle\text{ where }{\bm{D}}(\alpha):=diagonal(\beta_{k}(\alpha),k\in[q])$		(4)

i.e. ${\bm{D}}(\alpha)$ is a diagonal matrix of size $q\times q$ with diagonal entries $\beta_{k}$ defined above.

Let $\mathbb{E}[{\bm{X}}_{0}|\alpha]={\bm{X}}^{*}{\bm{D}}(\alpha)\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}}$ be its $r$ -SVD. Then,

		$\displaystyle\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq$
		$\displaystyle\dfrac{\sqrt{2}\max\left(\\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha])^{\top}{\bm{U}}^{}{}\\|_{F},\\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha])\check{\bm{V}}{}^{\top}\\|_{F}\right)}{{\sigma_{\min}^{}}\min_{k}\beta_{k}(\alpha)-\\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha]\\|}$		(5)

as long as the denominator is non-negative.

Proof.

See Sec. IV-F ∎

Define the set $\mathcal{E}$ as follows

\displaystyle\mathcal{E}:=\left\{\tilde{C}(1-\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}\leq\alpha\leq\tilde{C}(1+\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}\right\}.

(6)

The following fact is an immediate consequence of sub-exponential Bernstein inequality for bounding $|\alpha-\|{\bm{X}}^{*}\|_{F}^{2}/q|$ .

Fact 3.7.

$\Pr(\alpha\in\mathcal{E})\geq 1-\exp(-\tilde{c}mq\epsilon_{1}^{2}):=1-p_{\alpha}$ . Here $\tilde{c}=c/\tilde{C}=c/\kappa^{2}\mu^{2}.$

The next lemma bounds the terms of Lemma 3.6.

Lemma 3.8.

Fix $0<\epsilon_{1}<1$ . Then,

1.

w.p. at least $1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ , conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ ,

$\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}$
2.

w.p. at least $1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ , conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ ,

$\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}$
3.

w.p. at least $1-\exp\left[nr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ , conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ ,

$\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top}\|_{F}\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.$

Proof.

See Sec. IV-G ∎

We also need to the following fact.

Fact 3.9.

For any $\epsilon_{1}\leq 0.1$ , $\min_{k}\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{|\zeta|\leq\tilde{C}\frac{\sqrt{1-\epsilon_{1}}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}\right\}}\right]\geq 0.92.$

Proof of Theorem 3.1.

Set $\epsilon_{1}=0.4\delta_{0}/\sqrt{r}\kappa$ . Define $p_{0}=2\exp((n+q)-cmq\delta_{0}^{2}/r\kappa^{2})+2\exp(nr-cmq\delta_{0}^{2}/r\kappa^{2})+2\exp(qr-cmq\delta_{0}^{2}/r\kappa^{2}).$ Recall that $\Pr(\alpha\in\mathcal{E})\geq 1-p_{\alpha}$ with $p_{\alpha}=\exp(-\tilde{c}mq\epsilon_{1}^{2})=\exp(-cmq\delta_{0}^{2}/r\mu^{2}\kappa^{2}).$

Using Lemma 3.8, conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ ,

•

w.p. at least $1-p_{0}$ , $\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|\leq 1.1\epsilon_{1}\|{\bm{X}}^{*}\|_{F}=0.44\delta_{0}{\sigma_{\min}^{*}},$ and $\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha])\check{\bm{V}}^{\top}\|_{F}\right)\leq 0.44\delta_{0}{\sigma_{\min}^{*}}$
•

$\min_{k}\beta_{k}(\alpha)\geq\min_{k}\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\{|\zeta|\leq\tilde{C}\frac{\sqrt{1-\epsilon_{1}}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}\}}\right]\geq 0.9$ The first inequality is an immediate consequence of $\alpha\in\mathcal{E}$ and the second follows by Fact 3.9.

Plugging the above bounds into (2) of Lemma 3.6, conditioned on $\alpha$ , for any $\alpha\in\mathcal{E}$ , w.p. at least $1-p_{0}$ , $\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\frac{0.44\delta_{0}}{0.9-0.44\delta_{0}}<\delta_{0}$ since $\delta_{0}<0.1$ . In other words,

\displaystyle\Pr\left(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}|\alpha\right)\leq p_{0}\ \text{for any $\alpha\in\mathcal{E}$}.

(7)

Since (i) $\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0})\leq\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\text{ and }\alpha\in\mathcal{E})+\Pr(\alpha\notin\mathcal{E}),$ and (ii) $\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\text{ and }\alpha\in\mathcal{E})\leq\Pr(\alpha\in\mathcal{E})\max_{\alpha\in\mathcal{E}}\Pr(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\delta_{0}|\alpha),$ thus, using Fact 3.7 and (7), we can conclude that

\Pr\left(\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\geq\delta_{0}\right)\leq p_{0}(1-p_{\alpha})+p_{\alpha}\leq p_{0}+p_{\alpha}

Thus, for a $\delta_{0}<0.1$ , $\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})<\delta_{0}$ w.p. at least $1-p_{0}-p_{\alpha}=1-2\exp((n+q)-cmq\delta_{0}^{2}/r\kappa^{2})-2\exp(nr-cmq\delta_{0}^{2}/r\kappa^{2})-2\exp(qr-cmq\delta_{0}^{2}/r\kappa^{2})-\exp(-cmq\delta_{0}^{2}/r\mu^{2}\kappa^{4})$ . This is $\geq 1-5\exp(-c(n+q))$ if $mq>C\kappa^{2}\mu^{2}(n+q)r^{2}/\delta_{0}^{2}$ . This finishes our proof. ∎

IV Proofs of all the lemmas

IV-A Basic tools used

Our proofs use the following results and definitions:

Theorem 4.1 (Wedin $\sin\Theta$ theorem for Frobenius norm subspace distance [27, 28][Theorem 2.3.1).

] For two $n_{1}\times n_{2}$ matrices $\bm{M}^{*}$ , $\bm{M}$ , let ${\bm{U}}^{*}{},{\bm{U}}$ denote the matrices containing their top $r$ singular vectors and let $\bm{V}^{*}{}^{\top},\bm{V}^{\top}$ be the matrices of their right singular vectors (recall from problem definition that we defined SVD with the right matrix transposed). Let $\sigma^{*}_{r},\sigma^{*}_{r+1}$ denote the $r$ -th and $(r+1)$ -th singular values of $\bm{M}^{*}$ . If $\|\bm{M}-\bm{M}^{*}\|\leq\sigma^{*}_{r}-\sigma^{*}_{r+1}$ , then

	$\displaystyle\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})$
	$\displaystyle\qquad\leq\frac{\sqrt{2}\max(\\|(\bm{M}-\bm{M}^{})^{\top}{\bm{U}}^{}{}\\|_{F},\\|(\bm{M}-\bm{M}^{})^{\top}\bm{V}^{}{}^{\top}\\|_{F})}{\sigma^{}_{r}-\sigma^{}_{r+1}-\\|\bm{M}-\bm{M}^{*}\\|}$

Theorem 4.2 (Fundamental theorem of calculus [18][Chapter XIII, Theorem 4.2).

, [19]] For two vectors $\bm{z}_{0},\bm{z}^{*}\in\Re^{d}$ , and a differentiable vector function $g(\bm{z})\in\Re^{d_{2}}$ ,

g(\bm{z}_{0})-g(\bm{z}^{*})=\left(\int_{\tau=0}^{1}\nabla g(\bm{z}(\tau))d\tau\right)(\bm{z}_{0}-\bm{z}^{*}),

where

\bm{z}(\tau)=\bm{z}^{*}+\tau(\bm{z}_{0}-\bm{z}^{*}).

Observe that $\nabla_{\bm{z}}g(\bm{z})$ is a $d_{2}\times d$ matrix.

Definition 4.3.

For any $n\times r$ matrix ${\bm{{Z}}}$ , let ${{\bm{{Z}}}_{vec}}$ denote the $nr$ length vector formed by arranging all $r$ columns of ${\bm{{Z}}}$ one below the other. Thus, for $n$ -length and $r$ -length vectors $\bm{a}$ and $\bm{b}$ ,

•

$(\bm{a}\bm{b}^{\top})_{vec}=\bm{a}\otimes\bm{b}$ with $\otimes$ being the Kronecker product;
•

$\bm{a}^{\top}{\bm{U}}\bm{b}=\mathrm{trace}(\bm{a}^{\top}{\bm{U}}\bm{b})=\mathrm{trace}(\bm{b}\bm{a}^{\top}{\bm{U}})=\langle(\bm{a}\bm{b}^{\top}),{\bm{U}}\rangle=\langle\bm{a}\otimes\bm{b},{\bm{U}}_{vec}\rangle$ ;

$f({{\bm{U}}_{vec}},\bm{B})=\sum_{ki}((\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}{{\bm{U}}_{vec}}-\bm{y}_{ki})^{2}$ and

\displaystyle(\nabla_{{\bm{U}}}f({\bm{U}},\bm{B}))_{vec}=\nabla_{{\bm{U}}_{vec}}f({\bm{U}}_{vec},\bm{B})

(8)

Definition 4.4.

At various places, $\nabla f({\bm{U}},\bm{B})$ is short for $\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}\bm{a}_{ki}\bm{b}_{k}{}^{\top}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{y}_{ki})$ and similarly $\nabla f({{\bm{U}}_{vec}},\bm{B})$ is short for $\nabla_{{\bm{U}}_{vec}}f({{\bm{U}}_{vec}},\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})((\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}{{\bm{U}}_{vec}}-\bm{y}_{ki})$ .

Definition 4.5.

For any vector $\bm{w}$ , we use $\bm{w}(k)$ to denote its $k$ -th entry.

Definition 4.6.

Everywhere we use $\mathcal{S}_{nr}$ to denote both the set of matrices $\{{\bm{W}}\in\Re^{n\times r}:\|{\bm{W}}\|_{F}=1\}$ and the set of these matrices vectorized $\{\bm{w}\in\Re^{nr}:\|\bm{w}\|=1\}$ . We also switch between the two sometimes. In the entire writing below, $\bm{w}={\bm{W}}_{vec}$ .

All the high probability bounds for initialization use subGaussian Hoeffding inequality, while those for GD lemmas use the sub-exponential Bernstein inequality, both are from [26]. In addition, these lemmas also use the following results to “epsilon-net” extend a bound holding for a fixed unit norm ${\bm{W}}$ (or $\bm{w}$ ) to all unit norm ${\bm{W}}$ s (or $\bm{w}$ s)

Proposition 4.7 (Epsilon-netting for bounding $\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|$ ).

For an $n\times r$ matrix $\bm{M}$ and fixed vectors $\bm{w},\bm{z}$ with, $\bm{w}\in\mathcal{S}_{n}$ and $\bm{z}\in\mathcal{S}_{r}$ , suppose that $|\bm{w}^{\top}\bm{M}\bm{z}|\leq b_{0}$ w.p. at least $1-p_{0}$ . Consider an $\epsilon_{net}$ net covering $\mathcal{S}_{n}$ and $\mathcal{S}_{r}$ , $\bar{\mathcal{S}}_{n}$ , $\bar{\mathcal{S}}_{r}$ Then w.p. at least $1-(1+2/\epsilon_{net})^{n+r}p_{0}$ ,

•

$\max_{\bm{w}\in\bar{\mathcal{S}}_{n},\bm{z}\in\bar{\mathcal{S}}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq b_{0}$ and
•

$\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq\frac{1}{1-2\epsilon_{net}-\epsilon_{net}^{2}}b_{0}$ .

Using $\epsilon_{net}=1/8$ , this implies the following simpler conclusion:
W.p. at least $1-17^{n+r}p_{0}=1-\exp((\log 17)(n+r))\cdot p_{0}$ , $\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|\leq 1.4b_{0}$ .

Proof.

The proof follows that of Lemma 4.4.1 of [26] ∎

Proposition 4.8 (Epsilon-netting for bounding $\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}$ ).

For an $n\times r$ matrix $\bm{M}$ and a fixed $n\times r$ matrix ${\bm{W}}\in\mathcal{S}_{nr}$ (unit Frobenius norm matrix), suppose that $\langle\bm{M},{\bm{W}}\rangle\leq b_{0}$ w.p. at least $1-p_{0}$ . Consider an $\epsilon_{net}$ net covering $\mathcal{S}_{nr}$ , $\bar{\mathcal{S}}_{nr}$ . Then w.p. at least $1-(1+2/\epsilon_{net})^{nr}p_{0}$ ,

•

$\max_{{\bm{W}}\in\bar{\mathcal{S}}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq b_{0}$ and
•

$\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq\frac{1}{1-\epsilon_{net}}b_{0}$ .

Using $\epsilon_{net}=1/8$ , this implies the following simpler conclusion:
w.p. at least $1-17^{nr}p_{0}=1-\exp((\log 17)(nr))\cdot p_{0}$ , $\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}\rangle\leq 1.2b_{0}$ .

Proof.

The proof follows exactly as that of Exercise 4.4.3 of [26] ∎

Proposition 4.9 (Epsilon-netting for upper and lower bounding $\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}$ over all ${\bm{W}}\in\mathcal{S}_{nr}$ ).

For an $n\times r$ matrices $\bm{M}_{ki}$ and a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ , suppose that, w.p. at least $1-p_{0}$ ,

b_{1}\leq\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq b_{2}

Consider an $\epsilon_{net}$ net covering $\mathcal{S}_{nr}$ , $\bar{\mathcal{S}}_{nr}$ . Then, w.p. at least $1-(1+2/\epsilon_{net})^{nr}p_{0}$ ,

\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq\frac{1}{1-\epsilon_{net}^{2}-2\epsilon_{net}}b_{2}

and

\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\geq b_{1}-2\epsilon_{net}\cdot\frac{1}{1-\epsilon_{net}^{2}-2\epsilon_{net}}b_{2}

Picking $\epsilon_{net}=b_{1}/(8b_{2})$ guarantees that the above lower bound is non-negative. In particular, it implies the following:
w.p. at least $1-(24b_{2}/b_{1})^{nr}p_{0}=1-\exp(Cnr\log(b_{2}/b_{1}))\cdot p_{0}$ , $0.8b_{1}\leq\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}\leq 1.4b_{2}$

Proof.

By union bound, for all $\bar{\bm{W}}\in\bar{\mathcal{S}}_{nr}$ , $b_{1}\leq\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle^{2}\leq b_{2}$ holds w.p. at least $1-(1+2/\epsilon_{net})^{nr}p_{0}$ .

Proof for the upper bound: Let $\gamma^{*}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}$ . Writing ${\bm{W}}=\bar{\bm{W}}+({\bm{W}}-\bar{\bm{W}})$ where $\bar{\bm{W}}$ is the closest point to ${\bm{W}}$ on $\bar{\mathcal{S}}_{nr}$ , we have $\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}=\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle^{2}+\sum_{ki}\langle\bm{M}_{ki},({\bm{W}}-\bar{\bm{W}})\rangle^{2}+2\sum_{ki}\langle\bm{M}_{ki},\bar{\bm{W}}\rangle\cdot\sum_{ki}\langle\bm{M}_{ki},({\bm{W}}-\bar{\bm{W}})\rangle$ and $\|({\bm{W}}-\bar{\bm{W}})\|_{F}\leq\epsilon_{net}$ .

Rewriting $({\bm{W}}-\bar{\bm{W}})=({\bm{W}}-\bar{\bm{W}})\cdot({\bm{W}}-\bar{\bm{W}})/\|({\bm{W}}-\bar{\bm{W}})\|_{F}$ and using the fact that $({\bm{W}}-\bar{\bm{W}})/\|({\bm{W}}-\bar{\bm{W}})\|_{F}\in\mathcal{S}_{nr}$ and $\|({\bm{W}}-\bar{\bm{W}})\|_{F}\leq\epsilon_{net}$ and using Cauchy-Schwarz for the third term in the above expression, we have

\gamma^{*}\leq b_{2}+\epsilon_{net}^{2}\gamma^{*}+2\sqrt{\gamma^{*}\cdot\epsilon_{net}^{2}\gamma^{*}}=b_{2}+\epsilon_{net}^{2}\gamma^{*}+2\epsilon_{net}\gamma^{*}

Thus, $\gamma^{*}\leq 1/(1-\epsilon_{net}^{2}-2\epsilon_{net})\cdot b_{2}$ .

Proof for the lower bound: Let $\beta^{*}=\min_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}$ . Proceeding as above, we have

\beta^{*}\geq b_{1}-2\sqrt{\gamma^{*}\cdot\epsilon_{net}^{2}\gamma^{*}}=b_{1}-2\epsilon_{net}\gamma^{*}

∎

IV-B Proving GD iterations’ lemmas: Proof of Lemma 3.4 (algebra lemma)

Recall that ${{\bm{U}}_{vec}}$ denotes the vectorized ${\bm{U}}$ . We use this so that we can apply the simple vector version of the fundamental theorem of calculus [18, Chapter XIII, Theorem 4.2],[19, Lemma 2 proof] (given in Theorem 4.2) on the $nr$ length vector $\nabla f({{\bm{U}}_{vec}},\bm{B})$ , and so that the Hessian can be expressed as an $nr\times nr$ matrix.

We apply Theorem 4.2 with $\bm{z}_{0}\equiv{{\bm{U}}_{vec}}$ , $\bm{z}^{*}\equiv({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}})_{vec}$ , and $g(\bm{z})=\nabla f(\bm{z},\bm{B})$ . Thus $d=d_{2}=nr$ and $\nabla g(\bm{z})$ is the Hessian of $f(\bm{z},\bm{B})$ computed at $\bm{z}$ . Let ${\bm{U}}(\tau):={\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}+\tau({\bm{U}}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}})$ . Applying the theorem,

	$\displaystyle\nabla f({\bm{U}}_{vec},\bm{B})-\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B})$
	$\displaystyle=(\int_{\tau=0}^{1}\nabla_{{{\bm{U}}_{vec}}}^{2}f({\bm{U}}(\tau)_{vec},\bm{B})d\tau)({\bm{U}}_{vec}-({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec})$		(9)

where

\displaystyle\nabla_{{{\bm{U}}_{vec}}}^{2}f({\bm{U}}(\tau)_{vec},\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k})^{\top}:=\ \mathrm{Hess}\

(10)

This is an $nr\times nr$ matrix. Because the cost function is quadratic, the Hessian is constant w.r.t. $\tau$ . Henceforth, we refer to it as $\ \mathrm{Hess}\$ . With this, the above simplifies to

	$\displaystyle\nabla f({\bm{U}}_{vec},\bm{B})-\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B})$
	$\displaystyle=\ \mathrm{Hess}\ ({\bm{U}}_{vec}-({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec})=\mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec}$		(11)

with

\bm{P}:=\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}

denoting the $n\times n$ projection matrix to project orthogonal to ${\bm{U}}^{*}{}$ . This proof is motivated by a similar approach used in [19, Lemma 2 proof] to analyze GD for standard PR. However, there the application was much simpler because $f(.)$ was a function of one variable and at the true solution the gradient was zero, i.e., $\nabla f(\bm{x}^{*})=\bm{0}$ . In our case $\nabla f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})\neq\bm{0}$ because $\bm{B}\neq\bm{B}^{*}$ . But we can show that $\mathbb{E}[(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\nabla f({\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}{\bm{U}},\bm{B})]=\bm{0}$ and this helps us get the final desired result.

From Algorithm 1, recall that $\hat{\bm{U}}^{+}={\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B})$ . Vectorizing this equation, and using (11), we get

$\displaystyle(\hat{\bm{U}}^{+})_{vec}$	$\displaystyle={\bm{U}}_{vec}-(\eta/m)\nabla f({\bm{U}}_{vec},\bm{B})$
	$\displaystyle={\bm{U}}_{vec}-(\eta/m)\ \mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec}$
	$\displaystyle\qquad-(\eta/m)\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B}))$	(12)

We can prove our final result by using (8) and the following simple facts:

1.

For an $n\times n$ matrix $\bm{M}$ , let $\mathrm{big}(\bm{M}):=\bm{I}_{r}\otimes\bm{M}.$ be an $nr\times nr$ block diagonal matrix with $\bm{M}$ in the diagonal blocks. For any $n\times r$ matrix ${\bm{{Z}}}$ ,

$\displaystyle\mathrm{big}(\bm{M}){{\bm{{Z}}}_{vec}}=(\bm{M}{\bm{{Z}}})_{vec}$ (13)

Since $\bm{P}$ is idempotent, $\bm{P}=\bm{P}^{2}$ . Also, because of its block diagonal structure, $\mathrm{big}(\bm{M}^{2})=(\mathrm{big}(\bm{M}))^{2}$ . Thus,

\displaystyle\mathrm{big}(\bm{P})

\displaystyle=\mathrm{big}(\bm{P}^{2})=(\mathrm{big}(\bm{P}))^{2}=\mathrm{big}(\bm{P}))\bm{I}_{nr}(\mathrm{big}(\bm{P})

(14)

Left multiplying both sides of (12) by $\mathrm{big}(\bm{P})$ , and using (13), (14), and (8),

	$\displaystyle\mathrm{big}(\bm{P})(\hat{\bm{U}}^{+})_{vec}=\mathrm{big}(\bm{P}){\bm{U}}_{vec}-(\eta/m)\mathrm{big}(\bm{P})\ \mathrm{Hess}\ (\bm{P}{\bm{U}})_{vec}$
	$\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B})$
	$\displaystyle=\mathrm{big}(\bm{P})\bm{I}_{nr}\mathrm{big}(\bm{P}){\bm{U}}_{vec}-(\eta/m)\mathrm{big}(\bm{P})\ \mathrm{Hess}\ \mathrm{big}(\bm{P}){\bm{U}}_{vec}$
	$\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B})$
	$\displaystyle=\mathrm{big}(\bm{P})(\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess})\mathrm{big}(\bm{P}){\bm{U}}_{vec}$
	$\displaystyle\qquad-(\eta/m)\mathrm{big}(\bm{P})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}})_{vec},\bm{B}).$

Thus, using $\|\mathrm{big}(\bm{P})\|=\|\bm{P}\|=1$ , (13), and (8),

	$\displaystyle\\|(\bm{P}\hat{\bm{U}}^{+})_{vec}\\|$	$\displaystyle\leq\\|\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess}\ \\|\ \\|(\bm{P}{\bm{U}})_{vec}\\|$
		$\displaystyle+(\eta/m)\\|(\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B}))_{vec}\\|$		(15)

Converting the vectors to matrices, using $||\bm{M}_{vec}||=||\bm{M}||_{F}$ , and substituting for $\bm{P}$ ,

	$\displaystyle\\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top})\hat{\bm{U}}^{+}\\|_{F}$
	$\displaystyle\leq\\|\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess}\ \\|\ \\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}){\bm{U}}\\|_{F}$
	$\displaystyle\qquad+(\eta/m)\\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B})\\|_{F}$

Since $\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}$ and since $\|\bm{M}_{1}\bm{M}_{2}\|_{F}\leq\|\bm{M}_{1}\|_{F}\|\bm{M}_{2}\|$ , this means that

\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})\leq\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\hat{\bm{U}}^{+}\|_{F}\|({\bm{R}}^{+})^{-1}\|.

Since $\|({\bm{R}}^{+})^{-1}\|=1/\sigma_{\min}({\bm{R}}^{+})=1/\sigma_{\min}(\hat{\bm{U}}^{+})$ , using $\hat{\bm{U}}^{+}={\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B})$ ,

	$\displaystyle\\|({\bm{R}}^{+})^{-1}\\|$	$\displaystyle=\frac{1}{\sigma_{\min}({\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}))}$
		$\displaystyle\leq\frac{1}{1-(\eta/m)\\|\nabla f({\bm{U}},\bm{B})\\|}$

where we used $\sigma_{\min}({\bm{U}}-(\eta/m)\nabla f({\bm{U}},\bm{B}))\geq\sigma_{\min}({\bm{U}})-(\eta/m)\|\nabla f({\bm{U}},\bm{B})\|=1-(\eta/m)\|\nabla f({\bm{U}},\bm{B})\|$ for the last inequality. Combining the last three equations above proves our lemma.

IV-C Proof of GD iterations’ lemmas: Proof of Lemma 3.5

IV-C1 Upper and Lower bounding the Hessian eigenvalues and hence HessTerm

First assume the event that implies that the conclusions of Lemma 3.3 hold.

Recall from (10) that $\ \mathrm{Hess}\ :=\nabla_{\tilde{\bm{U}}_{vec}}^{2}f(\tilde{\bm{U}}_{vec};\bm{B})=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(\bm{a}_{ki}\otimes\bm{b}_{k}){}^{\top}.$ Since $\ \mathrm{Hess}\$ is a positive semi-definite matrix, $\lambda_{\min}\left(\ \mathrm{Hess}\ \right)=\min_{\bm{w}\in\mathcal{S}_{nr}}\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w}$ and $\lambda_{\max}\left(\ \mathrm{Hess}\ \right)=\max_{\bm{w}\in\mathcal{S}_{nr}}\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w}.$ For a fixed $\bm{w}\in\mathcal{S}_{nr}$ ,

\bm{w}{}^{\top}\ \mathrm{Hess}\ \ \bm{w}=\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}

where ${\bm{W}}$ is an $n\times r$ matrix with $\|{\bm{W}}\|_{F}=1$ . Clearly $(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}$ are mutually independent sub-exponential random variables (r.v.) with sub-exponential norm $K_{ki}\leq\|{\bm{W}}\bm{b}_{k}\|^{2}$ . Also, $\mathbb{E}[(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=\|{\bm{W}}\bm{b}_{k}\|^{2}$ and thus $\mathbb{E}[\sum_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k})^{2}]=m\|{\bm{W}}\bm{B}\|_{F}^{2}$ . Applying the sub-exponential Bernstein inequality, Theorem 2.8.1 of [26], for a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ yields

	$\displaystyle\Pr\left\{\Big{\|}\sum_{ki}\big{\|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{\|}^{2}-m\\|{\bm{W}}\bm{B}\\|_{F}^{2}\Big{\|}\geq t\right\}$
	$\displaystyle\qquad\leq\exp\left[-c\min\left(\frac{t^{2}}{\sum_{ki}K_{ki}^{2}},~{}\frac{t}{\max_{ki}K_{ki}}\right)\right].$

We set $t=\epsilon_{3}m{\sigma_{\min}^{*}}^{2}$ . By Lemma 3.3, $\|\bm{b}_{k}\|^{2}\leq 1.1\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)=1.1\kappa^{2}\mu^{2}{\sigma_{\min}^{*}}^{2}(r/q)$ . Thus,

	$\displaystyle\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}$	$\displaystyle\geq\frac{\epsilon_{3}^{2}m^{2}{\sigma_{\min}^{}}^{4}}{\sum_{ki}\\|{\bm{W}}\bm{b}_{k}\\|^{4}}\geq\frac{\epsilon_{3}^{2}m{\sigma_{\min}^{}}^{4}}{\max_{k}\\|\bm{b}_{k}\\|^{2}\sum_{k}\\|{\bm{W}}\bm{b}_{k}\\|^{2}}$
		$\displaystyle\geq\frac{\epsilon_{3}^{2}m{\sigma_{\min}^{}}^{4}}{\mu^{2}{\sigma_{\max}^{}}^{2}(r/q)1.1.{\sigma_{\max}^{*}}^{2}}=c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}$

Here we used $\sum_{k}\|{\bm{W}}\bm{b}_{k}\|^{2}=\|{\bm{W}}\bm{B}\|_{F}^{2}\leq\|{\bm{W}}\|_{F}\|\bm{B}\|_{2}\leq 1.1.{\sigma_{\max}^{*}}$ using the bound on $\|\bm{B}\|_{2}$ from Lemma 3.3. Also,

	$\displaystyle\frac{t}{\max_{ki}K_{ki}}$	$\displaystyle\geq\frac{\epsilon_{3}m{\sigma_{\min}^{}}^{2}}{\max_{ki}\\|{\bm{W}}\bm{b}_{k}\\|^{2}}\geq\frac{\epsilon_{3}m{\sigma_{\min}^{}}^{2}}{1.1\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)}$
		$\displaystyle=c\epsilon_{3}mq/r\mu^{2}\kappa^{2}.$

Therefore, for a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ , w.p. $1-\exp\left[-c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}\right]$ we have

\displaystyle\Big{|}\sum_{ki}\big{|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{|}^{2}-m\|{\bm{W}}\bm{B}\|_{F}^{2}\Big{|}\leq\epsilon_{3}m{\sigma_{\min}^{*}}^{2}.

(16)

and hence, by Lemma 3.3, w.p. $1-\exp\left[-c\epsilon_{3}^{2}mq/r\mu^{2}\kappa^{4}\right]$ ,

		$\displaystyle\sum_{ki}\big{\|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{\|}^{2}\leq m\\|{\bm{W}}\bm{B}\\|_{F}^{2}+\epsilon_{3}m{\sigma_{\min}^{*}}^{2}$
		$\displaystyle\qquad\leq m\\|\bm{B}\\|^{2}+\epsilon_{3}m{\sigma_{\min}^{}}^{2}\leq m(1.1+\epsilon_{3}/\kappa^{2}){\sigma_{\max}^{}}^{2}.$		(17)

and

		$\displaystyle\sum_{ki}\big{\|}\bm{a}_{ki}{}^{\top}{\bm{W}}\bm{b}_{k}\big{\|}^{2}\geq m\\|{\bm{W}}\bm{B}\\|_{F}^{2}-\epsilon_{3}m{\sigma_{\min}^{*}}^{2}$
		$\displaystyle\qquad\geq 0.9m{\sigma_{\min}^{}}^{2}+\epsilon_{3}m{\sigma_{\min}^{}}^{2}\geq m(0.9-\epsilon_{3}){\sigma_{\min}^{*}}^{2}.$		(18)

To extend these bounds to all ${\bm{W}}\in\mathcal{S}_{nr}$ we apply Proposition 4.9 with $b_{1}\equiv m(0.9-\epsilon_{3}){\sigma_{\min}^{*}}^{2}$ and $b_{2}\equiv m(1.1+\epsilon_{3}/\kappa^{2}){\sigma_{\max}^{*}}^{2}$ . Applying it we can conclude that, given the event that the claims of Lemma 3.3 holds, w.p. at least $1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\mu^{2}\kappa^{4})$ ,

	$\displaystyle m(0.7-1.2\epsilon_{3}){\sigma_{\min}^{*}}^{2}$	$\displaystyle\leq\lambda_{\min}(\ \mathrm{Hess}\ )$
		$\displaystyle\leq\lambda_{\max}(\ \mathrm{Hess}\ )\leq m(1.1+\epsilon_{3}){\sigma_{\max}^{*}}^{2}$

Using the probability from Lemma 3.3, the above bound holds w.p. at least $1-\exp(nr\log\kappa-cmq\epsilon_{3}^{2}/r\mu^{2}\kappa^{4})-\exp(\log q+r-cm)$ .

IV-C2 Bounding the GradU Term

We have $\|\nabla f({\bm{U}},\bm{B})\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}}\bm{z}{}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}.$ For a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}$ we have

	$\displaystyle\bm{z}{}^{\top}\left(\nabla f({\bm{U}},\bm{B})-\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\right)\bm{w}$
	$\displaystyle\qquad=\sum_{ki}\left[\left(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{y}_{ki}\right)\left(\bm{a}_{ki}{}^{\top}\bm{z}\right)\left(\bm{w}{}^{\top}\bm{b}_{k}\right)-\mathbb{E}[.]\right]$

where $\mathbb{E}[.]$ is the expected value of the first term. Clearly, the summands are independent sub-exponential r.v.s with norm $K_{ki}\leq C\|\bm{x}_{k}-\bm{x}^{*}_{k}\|\|\bm{b}_{k}\|$ . We apply the sub-exponential Bernstein inequality, Theorem 2.8.1 of [26], with $t=\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}$ . To apply this, we use bounds on $\|\bm{b}_{k}\|$ , $\|{\bm{X}}^{*}-{\bm{X}}\|_{F}$ and $\|\bm{x}_{k}-\bm{x}^{*}_{k}\|$ from Lemma 3.3 to show that

	$\displaystyle\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}$	$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m^{2}{\sigma_{\max}^{}}^{4}}{m\max_{k}\\|\bm{b}_{k}\\|^{2}\sum_{k}\\|\bm{x}_{k}-\bm{x}^{}_{k}\\|^{2}}$
		$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m{\sigma_{\max}^{}}^{4}}{C\mu^{2}{\sigma_{\max}^{}}^{2}(r/q)\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}^{2}}$
		$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}mq{\sigma_{\max}^{}}^{4}}{C\mu^{2}{\sigma_{\max}^{}}^{2}r\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}}=c\epsilon_{1}^{2}\frac{mq}{r\mu^{2}}.$

and

\frac{t}{\max_{ki}K_{ki}}\geq c\frac{\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}}{C\delta_{t}{\sigma_{\max}^{*}}^{2}\mu^{2}(r/q)}\geq c\epsilon_{1}\frac{mq}{r\mu^{2}}.

Therefore, for a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}$ w.p. $1-\exp(-c\epsilon_{1}^{2}mq/r\mu^{2})$ ,

\displaystyle\bm{z}{}^{\top}\left(\nabla f({\bm{U}},\bm{B})-\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\right)\bm{w}

\displaystyle\leq\epsilon_{1}\delta_{t}m{\sigma_{\max}^{*}}^{2}

Since $\nabla f({\bm{U}},\bm{B})=\sum_{ki}\bm{a}_{ki}\bm{a}_{ki}{}^{\top}(\bm{x}_{k}-\bm{x}^{*}_{k})\bm{b}_{k}{}^{\top}$ ,

\mathbb{E}[\nabla f({\bm{U}},\bm{B})]=m\sum_{k}(\bm{x}_{k}-\bm{x}^{*}_{k})\bm{b}_{k}{}^{\top}=m\left({\bm{X}}-{\bm{X}}^{*}\right)\bm{B}{}^{\top}.

Using the bounds on $\|{\bm{X}}^{*}-{\bm{X}}\|_{F}$ and $\|\bm{B}\|$ from Lemma 3.3,

	$\displaystyle\\|\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\\|$	$\displaystyle=m\\|({\bm{X}}-{\bm{X}}^{*})\bm{B}{}^{\top}\\|$
		$\displaystyle\leq m\\|{\bm{X}}-{\bm{X}}^{*}\\|~{}\\|\bm{B}\\|$
		$\displaystyle\leq m\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}~{}\\|\bm{B}\\|$
		$\displaystyle\leq 1.1m\delta_{t}{\sigma_{\max}^{*}}^{2}$

Hence, for a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}$ w.p. $1-\exp\left[-c\epsilon_{1}^{2}mq/r\mu^{2}\right]$ we have

|\bm{z}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}|\leq(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}.

Applying Proposition 4.7, this implies that, w.p. $1-\exp((n+r)(\log 17)-c\epsilon_{1}^{2}mq/r\mu^{2})$ , $\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{r}}\bm{z}{}^{\top}\nabla f({\bm{U}},\bm{B})\bm{w}\leq 1.4(1.1+\epsilon_{1})m\delta_{t}{\sigma_{\max}^{*}}^{2}.$

IV-C3 Bounding Term2

First, since $\mathrm{Term2}=(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\sum_{ki}\bm{a}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}({\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}))\bm{b}_{k}{}{}^{\top}$ , and $\mathbb{E}[\bm{a}_{ki}\bm{a}_{ki}{}^{\top}]=\bm{I}$ ,

\mathbb{E}[\mathrm{Term2}]=0

We have

	$\displaystyle\\|(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B})\\|_{F}$
	$\displaystyle\qquad=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle$

For a fixed $n\times r$ matrix ${\bm{W}}$ with unit Frobenius norm,

	$\displaystyle\langle(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top})\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle$
	$\displaystyle\qquad=\sum_{ki}\left(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}({\bm{U}}^{}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{}_{k})\right)\left(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}^{}{}{\bm{U}}^{*}{}{}^{\top}){\bm{W}}\bm{b}_{k}\right)$

Observe that the summands are independent, zero mean, sub-exponential r.v.s with sub-exponential norm $K_{ki}\leq C\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{W}}\bm{b}_{k}\|\leq\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|\|{\bm{W}}\bm{b}_{k}\|$ . We can now apply the sub-exponential Bernstein inequality Theorem 2.8.1 of [26]. Let $t=\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}$ . Using the bound on $\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|$ from Lemma 3.3 followed by Assumption 1.1 (right incoherence), and also the bound on $\|\bm{B}\|$ from Lemma 3.3,

	$\displaystyle\frac{t^{2}}{\sum_{ki}K^{2}_{ki}}$	$\displaystyle\geq\frac{\epsilon_{2}^{2}\delta_{t}^{2}m^{2}{\sigma_{\max}^{}}^{4}}{\delta_{t}^{2}{\sigma_{\max}^{}}^{2}\mu^{2}(r/q)\sum_{ki}\\|{\bm{W}}\bm{b}_{k}\\|^{2}}$
		$\displaystyle\geq\frac{\epsilon_{2}^{2}m^{2}{\sigma_{\max}^{}}^{2}}{C\mu^{2}(r/q)m\\|{\bm{W}}\bm{B}\\|_{F}^{2}}\geq\frac{\epsilon_{2}^{2}m^{2}{\sigma_{\max}^{}}^{2}}{\mu^{2}(r/q)m{\sigma_{\max}^{*}}^{2}}$
		$\displaystyle\geq c\epsilon_{2}^{2}mq/r\mu^{2},$

and

\frac{t}{\max_{ki}K_{ki}}\geq\frac{\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}}{C\delta_{t}\kappa^{2}\mu^{2}{\sigma_{\max}^{*}}^{2}(r/q)}\geq c\epsilon_{2}mq/(r\kappa^{2}\mu^{2}).

Thus, by the sub-exponential Bernstein inequality, for a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ , w.p. $1-\exp(-c\epsilon_{2}^{2}mq/r\kappa^{2}\mu^{2})$ ,

\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),~{}{\bm{W}}\rangle\leq\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}.

Applying Proposition 4.8, w.p. at least $1-\exp(nr-c\epsilon_{2}^{2}mq/r\kappa^{2}\mu^{2})$ , $\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top})\nabla f(({\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}),\bm{B}),{\bm{W}}\rangle\leq 1.2\epsilon_{2}\delta_{t}m{\sigma_{\max}^{*}}^{2}.$

IV-D Proof of GD iterations’ lemmas: Proof of Lemma 3.3, all parts other than the first part

Recall that $\bm{g}_{k}={\bm{U}}^{\top}\bm{x}^{*}_{k}={\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{b}^{*}_{k}$ , and $\bm{G}={\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{B}^{*}$ .

Using the $\mathrm{SD}$ bound and the first part, $\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\delta_{t}\|\bm{b}^{*}_{k}\|$ .

Since $\bm{x}^{*}_{k}-\bm{x}_{k}={\bm{U}}\bm{g}_{k}+(\bm{I}-{\bm{U}}{\bm{U}}^{\top})\bm{x}^{*}_{k}-{\bm{U}}\bm{b}_{k}={\bm{U}}(\bm{g}_{k}-\bm{b}_{k})+(\bm{I}-{\bm{U}}{\bm{U}}^{\top})\bm{x}^{*}_{k}$ , using (3),

\displaystyle\|\bm{x}^{*}_{k}-\bm{x}_{k}\|

\displaystyle\leq\|\bm{g}_{k}-\bm{b}_{k}\|+\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|\leq 1.4\delta_{t}\|\bm{b}^{*}_{k}\|.

$\|{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-\bm{b}^{*}_{k}\|=\|{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}{\bm{U}}\bm{b}_{k}-{\bm{U}}^{*}{}\bm{b}^{*}_{k}\|=\|{\bm{U}}\bm{b}_{k}-(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}-{\bm{U}}^{*}{}\bm{b}^{*}_{k}\|=\|\bm{x}_{k}-(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}-\bm{x}^{*}_{k}\|\leq\|\bm{x}_{k}-\bm{x}^{*}_{k}\|+\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}{}^{\top}){\bm{U}}\bm{b}_{k}\|\leq 2.4\delta_{t}\|\bm{b}^{*}_{k}\|$

Bounding $\|\bm{G}-\bm{B}\|_{F}$ and $\|{\bm{X}}^{*}-{\bm{X}}\|_{F}$ : Since $\sum_{k}\|\bm{M}\bm{b}^{*}_{k}\|^{2}=\|\bm{M}\bm{B}^{*}\|_{F}^{2}\leq\|\bm{M}\|_{F}^{2}\|\bm{B}^{*}\|^{2}=\|\bm{M}\|_{F}^{2}{\sigma_{\max}^{*}}^{2}$ , we can use the first bound from (3) to conclude that

	$\displaystyle\\|\bm{G}-\bm{B}\\|_{F}^{2}$	$\displaystyle=\sum_{k}\\|\bm{g}_{k}-\bm{b}_{k}\\|^{2}$
		$\displaystyle\leq 0.4^{2}\sum_{k}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{b}^{}_{k}\\|^{2}$
		$\displaystyle=0.4^{2}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{B}^{}\\|_{F}^{2}\leq 0.4^{2}\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}$

and, similarly,

	$\displaystyle\\|{\bm{X}}^{*}-{\bm{X}}\\|_{F}^{2}$	$\displaystyle\leq\sum_{k}\\|\bm{g}_{k}-\bm{b}_{k}\\|^{2}+\sum_{k}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{b}^{}_{k}\\|^{2}$
		$\displaystyle\leq(0.4^{2}+1^{2})\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}$

Incoherence of $\bm{b}_{k}$ ’s: Using the bound on $\|\bm{b}_{k}-\bm{g}_{k}\|$ , and using $\|\bm{g}_{k}\|\leq\|\bm{b}^{*}_{k}\|$ and the right incoherence assumption,

\displaystyle\|\bm{b}_{k}\|

\displaystyle=\|(\bm{b}_{k}-\bm{g}_{k}+\bm{g}_{k})\|\leq(1+0.4\delta_{t})\|\bm{b}^{*}_{k}\|\leq 1.04{\sigma_{\max}^{*}}\sqrt{r/q}.

Lower and Upper Bounds on $\sigma_{i}(\bm{B})$ ): Using the bound on $\|\bm{G}-\bm{B}\|_{F}$ and using $\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}<c/\kappa$ ,

	$\displaystyle\sigma_{\min}(\bm{B})$	$\displaystyle\geq\sigma_{\min}(\bm{G})-\\|\bm{G}-\bm{B}\\|$
		$\displaystyle\geq\sigma_{\min}({\bm{U}}^{\top}{\bm{U}}^{}{})\sigma_{\min}(\bm{B}^{})-\\|\bm{G}-\bm{B}\\|_{F}$
		$\displaystyle\geq\sqrt{1-\\|{\bm{U}}^{}{}_{\perp}{}^{\top}{\bm{U}}\\|^{2}}{\sigma_{\min}^{}}-0.4\delta_{t}{\sigma_{\max}^{*}}$
		$\displaystyle\geq\sqrt{1-\delta_{t}^{2}}{\sigma_{\min}^{}}-0.4\delta_{t}{\sigma_{\max}^{}}\geq 0.9{\sigma_{\min}^{*}}$

since we assumed $\delta_{t}\leq\delta_{0}<0.1/\kappa$ . Similarly,

	$\displaystyle\\|\bm{B}\\|=\sigma_{\max}(\bm{B})$	$\displaystyle\leq\sigma_{\max}({\bm{U}}^{\top}{\bm{U}}^{}{})\sigma_{\max}(\bm{B}^{})+\\|\bm{G}-\bm{B}\\|_{F}$
		$\displaystyle\leq{\sigma_{\max}^{}}+0.4\delta_{t}{\sigma_{\max}^{}}\leq 1.1{\sigma_{\max}^{*}}$

IV-E Proof of GD iterations’ lemmas: Proof of Lemma 3.3, first part

We bound $\|\bm{g}_{k}-\bm{b}_{k}\|$ here. Recall that $\bm{g}_{k}={\bm{U}}^{\top}\bm{x}^{*}_{k}$ . Since $\bm{y}_{k}=\bm{A}_{k}\bm{x}^{*}_{k}=\bm{A}_{k}{\bm{U}}{\bm{U}}{}^{\top}\bm{x}^{*}_{k}+\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}$ , therefore

	$\displaystyle\bm{b}_{k}$	$\displaystyle=\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}{\bm{U}}{\bm{U}}{}^{\top}\bm{x}^{*}_{k}$
		$\displaystyle\qquad+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k},$
		$\displaystyle=\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right){\bm{U}}{}^{\top}\bm{x}^{*}_{k}$
		$\displaystyle\qquad+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k},$
		$\displaystyle=\bm{g}_{k}+\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top})\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}.$

Thus,

	$\displaystyle\\|\bm{b}_{k}-\bm{g}_{k}\\|$	$\displaystyle\leq\\|\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\\|$
		$\displaystyle\qquad\times~{}\\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\\|.$		(19)

Using standard results from [26], one can show the following:

W.p. $\geq 1-q\exp\left(r-cm\right)$ , for all $k\in[q]$ , $\min_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}\big{|}\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w}\big{|}^{2}\geq 0.7m$ and so

	$\displaystyle\\|\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)^{-1}\\|$	$\displaystyle=\frac{1}{\sigma_{\min}\left({\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}{\bm{U}}\right)}$
		$\displaystyle=\frac{1}{\min_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}\langle{\bm{U}}^{\top}\bm{a}_{ki},\bm{w}\rangle^{2}}$
		$\displaystyle\leq\frac{1}{0.7m}$

W.p. at least $1-q\exp(r-cm)$ , $\text{ for all }k\in[q]$ ,

\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|\leq 0.15m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|

Combining the above two bounds and (19), w.p. at least $1-2\exp(\log q+r-cm)$ , $\text{ for all }k\in[q]$ ,

\|\bm{g}_{k}-\bm{b}_{k}\|\leq 0.4\|\left(\bm{I}_{n}-{\bm{U}}{\bm{U}}^{\top}\right){\bm{U}}^{*}{}\bm{b}^{*}_{k}\|.

This completes the proof. We explain next how to get the above two bounds.

The first bound above follows by a restatement of Theorem 4.6.1 of [26]. Or, it follows more directly by using $\mathbb{E}[\sum_{i}\big{|}\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w}\big{|}^{2}]=m$ , applying the sub-exponential Bernstein inequality [29, Theorem 2.8.1] to bound the deviation from this mean, and then applying Proposition 4.9 with $n\equiv 1,r\equiv r$ (epsilon net argument).

The second bound is obtained as follows. Notice that

	$\displaystyle\\|{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\\|$
	$\displaystyle\qquad=\max_{\bm{w}\in\mathcal{S}_{r}}\bm{w}{}^{\top}{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}$
	$\displaystyle\qquad=\max_{\bm{w}\in\mathcal{S}_{r}}\sum_{i}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w})(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k})$

Clearly $\mathbb{E}\left[{\bm{U}}{}^{\top}\bm{A}_{k}{}^{\top}\bm{A}_{k}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\right]={\bm{U}}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}=0$ . Moreover, the summands are products of sub-Gaussian r.v.s and are thus sub-exponential. Also, the different summands are mutually independent and zero mean. Applying sub-exponential Bernstein with $t=\epsilon_{0}m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|$ for a fixed $\bm{w}\in\mathcal{S}_{r}$ ,

|\sum_{i}(\bm{a}_{ki}{}^{\top}{\bm{U}}\bm{w})(\bm{a}_{ki}{}^{\top}(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k})|\leq\epsilon_{0}m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|

w.p. at least $1-\exp(-c\epsilon_{0}^{2}m)$ . Setting $\epsilon_{0}=0.1$ , this implies that the above is bounded by $0.1m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|$ w.p. at least $1-\exp(-cm)$ . By Proposition 4.8 with $n\equiv 1,r\equiv r$ , the above is bounded by $0.12m\|(\bm{I}-{\bm{U}}{\bm{U}}{}^{\top})\bm{x}^{*}_{k}\|$ for all $\bm{w}\in\mathcal{S}_{r}$ w.p. at least $1-\exp(r-cm)$ . Using a union bound over all $q$ columns, the bound holds for all $q$ columns w.p. at least $1-q\exp(r-cm)$ .

IV-F Proof of Initialization lemmas/facts: Proof of Lemma 3.6

To see why (4) holds, it suffices to show that $\mathbb{E}[({\bm{X}}_{0})_{k}|\alpha]=\bm{x}^{*}_{k}\beta_{k}(\alpha)$ for each $k$ . The easiest way to see this is to express $\bm{x}^{*}_{k}=\|\bm{x}^{*}_{k}\|{\bm{Q}}_{k}\bm{e}_{1}$ where ${\bm{Q}}_{k}$ is an $n\times n$ unitary matrix with first column $\bm{x}^{*}_{k}/\|\bm{x}^{*}_{k}\|$ ; and to use the fact that $\tilde{\bm{a}}_{ki}:={\bm{Q}}_{k}^{\top}\bm{a}_{ki}$ has the same distribution as $\bm{a}_{ki}$ , both are ${\cal{N}}(0,\bm{I}_{n})$ . Using ${\bm{Q}}_{k}{\bm{Q}}_{k}^{\top}=\bm{I}$ , $({\bm{X}}_{0})_{k}=(1/m)\sum_{i}{\bm{Q}}_{k}{\bm{Q}}_{k}^{\top}\bm{a}_{ki}\bm{a}_{ki}^{\top}\|\bm{x}^{*}_{k}\|{\bm{Q}}_{k}\bm{e}_{1}\mathbbm{1}_{\|\bm{x}^{*}_{k}\||\bm{a}_{ki}^{\top}{\bm{Q}}_{k}\bm{e}_{1}|\leq\sqrt{\alpha}}=(1/m)\sum_{i}{\bm{Q}}_{k}\|\bm{x}^{*}_{k}\|\tilde{\bm{a}}_{ki}\tilde{\bm{a}}_{ki}(1)\mathbbm{1}_{|\tilde{\bm{a}}_{ki}(1)|\leq\sqrt{\alpha}/\|\bm{x}^{*}_{k}\\ |}$ . Thus $\mathbb{E}[(({\bm{X}}_{0})_{k}]=(1/m)m{\bm{Q}}_{k}\|\bm{x}^{*}_{k}\|\bm{e}_{1}\mathbb{E}[\zeta^{2}\mathbbm{1}_{|\zeta|<\sqrt{\alpha}/\|\bm{x}^{*}_{k}\\ |}]$ . This follows because $\mathbb{E}[\bm{a}\bm{a}(1)\mathbbm{1}_{|\bm{a}(1)|<\beta}=\bm{e}_{1}\mathbb{E}[\bm{a}(1)^{2}\mathbbm{1}_{|\bm{a}(1)|<\beta}]$ .

Recall that $\tilde{C}=9\kappa^{2}\mu^{2}$ and $\tilde{c}=c/\tilde{C}$ for a $c<1$ . Recall also that ${\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}{\bm{\Sigma}^{*}}{\bm{V}^{*}}$ and $\mathbb{E}[{\bm{X}}_{0}|\alpha]\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}}$ . Thus, using (4), $\check{\bm{\Sigma}^{*}}={\bm{\Sigma}^{*}}{\bm{V}^{*}}{\bm{D}}\check{\bm{V}}{}^{\top}$ . Hence,

	$\displaystyle\sigma_{r}(\mathbb{E}[{\bm{X}}_{0}\|\alpha])$	$\displaystyle=\sigma_{\min}(\check{\bm{\Sigma}^{*}})$
		$\displaystyle=\sigma_{\min}({\bm{\Sigma}^{}}{\bm{V}^{}}{\bm{D}}\check{\bm{V}}{}^{\top})$
		$\displaystyle\geq\sigma_{\min}({\bm{\Sigma}^{}})\sigma_{\min}({\bm{V}^{}})\sigma_{\min}({\bm{D}})\sigma_{\min}(\check{\bm{V}}{}^{\top})$
		$\displaystyle={\sigma_{\min}^{*}}\cdot 1\cdot(\min_{k}\beta_{k}(\alpha))\cdot 1$

Also, $\sigma_{r+1}(\mathbb{E}[{\bm{X}}_{0}])=0$ since it is a rank $r$ matrix. Thus, using Wedin’s $\sin\Theta$ theorem for the Frobenius norm subspace distance $\mathrm{SD}$ [27, 28][Theorem 2.3.1, second row] (specified in Theorem 4.1 above) applied with $\bm{M}\equiv{\bm{X}}_{0}$ , $\bm{M}^{*}\equiv\mathbb{E}[{\bm{X}}_{0}]$ we get (2).

IV-G Proof of Initialization lemmas and facts: Proof of Lemma 3.8

Proof of first part of Lemma 3.8.

The proof involves an application of the sub-Gaussian Hoeffding inequality, Theorem 2.6.2 of [26], followed by an epsilon-net argument. The application of sub-Gaussian Hoeffding uses conditioning on $\alpha$ for $\alpha\in\mathcal{E}$ . For $\alpha\in\mathcal{E}$ , $\alpha\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ and this helps get a simple probability bound. Since $\alpha$ is independent of all $\bm{a}_{ki},\bm{y}_{ki}$ ’s used in defining ${\bm{X}}_{0}$ , the conditioning does not change anything else in our proof. For example, the different summands are mutually independent even conditioned on it.

We have,

\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle.

For a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}$ , we have

	$\displaystyle\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle$
	$\displaystyle\qquad=\frac{1}{m}\sum_{ki}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\{\|\bm{y}_{ki}\|^{2}\leq\alpha\}}$
	$\displaystyle\qquad-\mathbb{E}\left[\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\{\|\bm{y}_{ki}\|^{2}\leq\alpha\}}\right].$

The summands are mutually independent, zero mean sub-Gaussian r.v.s with sub-Gaussian norm $K_{ki}\leq C|\bm{w}(k)|\sqrt{\alpha}/m$ . For $\alpha\in\mathcal{E}$ , $\alpha\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/m\sqrt{q}$ . Let $t=\epsilon_{1}\|{\bm{X}}^{*}\|_{F}$ . Then, for any $\alpha\in\mathcal{E}$ ,

\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\bm{w}(k)^{2}\|{\bm{X}}^{*}\|_{F}^{2}/m^{2}q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}}

since $\sum_{k}\bm{w}(k)^{2}=\|\bm{w}\|^{2}=1$ . Thus, for a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}$ , by sub-Gaussian Hoeffding, we conclude that, conditioned on $\alpha$ , for any $\alpha\in\mathcal{E}$ , w.p. at least $1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ ,

\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

The rest of the proof follows by a standard epsilon net argument summarized in Proposition 4.7. Applying it, conditioned on $\alpha$ , for any $\alpha\in\mathcal{E}$ , w.p. at least $1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ , $\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq 1.4C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.$ ∎

Proof of second part of Lemma 3.8.

We have

\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{qr}}\langle{\bm{W}},\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\rangle

For a fixed ${\bm{W}}\in\mathcal{S}_{qr}$ ,

	$\displaystyle\langle{\bm{W}},\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\rangle$
	$\displaystyle\qquad=\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)$
	$\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{*}{}\bm{w}_{k})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\alpha\right\}}-\mathbb{E}[.]\right)$

Conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , the summands are independent zero mean sub-Gaussian r.v.s with subGaussian norm $K_{ki}\leq\sqrt{\alpha}\|\bm{w}_{k}\|/m\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|\bm{w}_{k}\|/m\sqrt{q}$ . Thus,

\sum_{ki}K_{ki}^{2}\leq m\tilde{C}(1+\epsilon_{1})\|{\bm{W}}\|_{F}^{2}\|{\bm{X}}^{*}\|_{F}^{2}/m^{2}q=\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/mq

Applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], for a fixed ${\bm{W}}\in\mathcal{S}_{qr}$ , conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , w.p. $1-\exp\left[-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right]$ ,

\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

The rest of the proof follows by a standard epsilon net argument summarized in Proposition 4.8. Applying Proposition 4.8, conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , w.p. at least $1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ , $\max_{{\bm{W}}\in\mathcal{S}_{qr}}\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right){}^{\top}{\bm{U}}^{*}{}\right)<1.2\epsilon_{1}\|{\bm{X}}^{*}\|_{F}$ . ∎

Proof of third part of Lemma 3.8.

We have

\|\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle.

For a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ we have,

	$\displaystyle\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}\|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle$
	$\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\alpha\right\}}-\mathbb{E}[.]\right)$

where $\mathbb{E}[.]$ is the expected value of the first term. Conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , the summands are independent, zero mean, sub-Gaussian r.v.s with subGaussian norm $K_{ki}\leq C\sqrt{\alpha}\|{\bm{W}}\check{\bm{v}}_{k}\|\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|{\bm{W}}\check{\bm{v}}_{k}\|/m\sqrt{q}$ . Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with $t=\epsilon_{1}\|{\bm{X}}^{*}\|_{F}$ , and using $\|{\bm{W}}\check{\bm{V}}\|_{F}=1$ (holds since $\check{\bm{V}}$ contains orthormal rows which are right singular vectors of $\mathbb{E}[{\bm{X}}_{0}|\alpha]$ ), conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , we will get that,

\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{m^{2}\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2}/q}=\frac{mq\epsilon_{1}^{2}}{C\mu^{2}\kappa^{2}},

w.p. $1-\exp\left[-c\epsilon_{1}^{2}mq/(\mu^{2}\kappa^{2})\right]$ . Here we used the fact that $\check{\bm{V}}\check{\bm{V}}{}{}^{\top}=\bm{I}$ and thus $\|{\bm{W}}\check{\bm{V}}\|_{F}^{2}=1$ .

\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Applying Proposition 4.8, conditioned on $\alpha$ , for an $\alpha\in\mathcal{E}$ , w.p. at least $1-\exp\left[nr-c\epsilon_{1}^{2}mq/(\mu^{2}\kappa^{2})\right]$ , $\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\left({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{0}|\alpha]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq 1.2C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.$ ∎

IV-H Proof of Initialization lemmas and facts: Proof of Facts

Proof of Fact 3.7.

Apply sub-exponential Bernstein. ∎

Proof of Fact 3.9.

Let $\gamma_{k}=\frac{\sqrt{\tilde{C}(1-\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}}{\sqrt{q}\|\bm{x}^{*}_{k}\|}$ . Since $\tilde{C}=9\mu^{2}\kappa^{2}$ and $\|\bm{x}^{*}_{k}\|^{2}\leq\mu^{2}\kappa^{2}\|{\bm{X}}^{*}\|_{F}^{2}/q$ (Assumption 1.1) thus

\gamma_{k}\geq 3.

Now,

	$\displaystyle\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{\|\zeta\|\leq\gamma_{k}\right\}}\right]=$	$\displaystyle 1-\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\left\{\|\zeta\|\geq\gamma_{k}\right\}}\right]$
	$\displaystyle\geq$	$\displaystyle 1-\frac{2}{\sqrt{2\pi}}\int_{3}^{\infty}z^{2}\exp(-z^{2}/2)dz$
	$\displaystyle\geq$	$\displaystyle 1-\frac{2e^{-1/2}}{\sqrt{\pi}}\int_{3}^{\infty}z\exp(-z^{2}/4)dz$
		$\displaystyle=1-\frac{2e^{-11/4}}{\sqrt{\pi}}\geq 0.92.$

The first inequality used $\gamma_{k}\geq 3$ . The second used the fact that $z\exp(-z^{2}/4)\leq\sqrt{2e}$ for all $z\in\Re$ . ∎

In all the proofs above, notice that the only thing we used about $\check{\bm{V}}$ is the fact that its rows contain singular vectors and thus $\check{\bm{V}}\check{\bm{V}}^{\top}=\bm{I}$ and so $\sigma_{r}(\check{\bm{V}})=\sigma_{1}(\check{\bm{V}})=1$ . We never required incoherence for it

Algorithm 2 The AltGD-Min-LRPR algorithm.

1:Input:

{\bm{y}_{(mag)}}_{k},\bm{A}_{k},k\in[q]

2:Parameters: GD step size,

\eta

; Number of iterations,

T

3:Sample-split: Partition the measurements and measurement matrices into

2T+1

equal-sized disjoint sets: one set for initialization and

2T

sets for the iterations. Denote these by

{{\bm{y}_{(mag)}}_{k}}^{(\tau)},\bm{A}_{k}^{(\tau)},\tau=0,1,\dots 2T

4:Initialization:

5:Compute

{\bm{U}}_{0}

as the top

r

singular vectors of

\bm{Y}_{U}:=\frac{1}{mq}\sum_{ki}({\bm{y}_{(mag)}}_{ki})^{2}\bm{a}_{ki}\bm{a}_{ki}^{\top}\mathbbm{1}_{\left\{({\bm{y}_{(mag)}}_{ki})^{2}\leq\tilde{C}\frac{1}{mq}\sum_{ki}({\bm{y}_{(mag)}}_{ki})^{2}\right\}}

. with

{\bm{y}_{(mag)}}_{ki}\equiv{{\bm{y}_{(mag)}}_{ki}}^{(0)},\bm{a}_{ki}\equiv\bm{a}_{ki}^{(0)}

6:GDmin Iterations:

7:for

t=1

T

8: Let

{\bm{U}}\leftarrow{\bm{U}}_{t-1}

9: Update

\bm{b}_{k},\bm{x}_{k}

: For each

k\in[q]

, set

(\bm{b}_{k})_{t}\leftarrow\mathrm{RWF}({{\bm{y}_{(mag)}}_{k}}^{(t)},({\bm{U}}^{\top}\bm{A}_{k}^{(t)}),T_{RWF,t})

. Set

(\bm{x}_{k})_{t}\leftarrow{\bm{U}}(\bm{b}_{k})_{t}

10: Estimate gradient w.r.t.

{\bm{U}}

: With

{\bm{y}_{(mag)}}_{ki}\equiv{{\bm{y}_{(mag)}}_{ki}}^{(T+t)},\bm{a}_{ki}\equiv\bm{a}_{ki}^{(T+t)}

•

compute $\hat{\bm{y}}_{ki}:={\bm{y}_{(mag)}}_{ki}\hat{\bm{c}}_{ki}$ with $\hat{\bm{c}}_{ki}=phase(\bm{a}_{ki}{}^{\top}\bm{x}_{k})$ and
•

compute $\widehat{\mathrm{GradU}}=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k})_{t})\bm{a}_{ki}(\bm{b}_{k})_{t}{}^{\top}$

11: Set

\displaystyle\hat{\bm{U}}^{+}\leftarrow{\bm{U}}-(\eta/m)\widehat{\mathrm{GradU}}

12: Orthornormalize to get new

{\bm{U}}

: Compute

\hat{\bm{U}}^{+}\overset{\mathrm{QR}}{=}{\bm{U}}^{+}{\bm{R}}^{+}

. Set

{\bm{U}}_{t}\leftarrow{\bm{U}}^{+}

13:end for

V Extension to Low Rank Phase Retrieval (LRPR)

In LRPR, recall that, we measure ${\bm{y}_{(mag)}}_{k}=|\bm{A}_{k}\bm{x}^{*}_{k}|$ . This problem commonly occurs in dynamic phaseless imaging applications such as Fourier ptychography. Because of the magnitude-only measurements, we can recover each column only up to a global phase uncertainty. We use $\mathrm{dist}(\bm{x}^{*},\bm{x}):=\min_{\theta\in[-\pi,\pi]}\|\bm{x}^{*}-e^{-j\theta}\bm{x}\|$ to quantify this phase invariant distance [30, 21]. Also, for a complex number, $z$ , we use $\bar{z}$ to denote its conjugate and we use $phase(z):=z/|z|$ .

V-A AltGD-Min-LRPR algorithm

With three simple changes that we explain next, the AltGD-Min approach also solves LRPR and provides the fastest existing solution for it. First, observe that because of the magnitude-only measurements, we cannot use ${\bm{X}}_{0}$ with $\bm{y}_{ki}$ replaced by ${\bm{y}_{(mag)}}_{ki}$ for initialization. The reason is $\mathbb{E}[\bm{a}_{ki}{\bm{y}_{(mag)}}_{ki}]=0$ and so $\mathbb{E}[\bm{a}_{ki}{\bm{y}_{(mag)}}_{ki}\mathbbm{1}_{{\bm{y}_{(mag)}}_{ki}\leq\sqrt{\alpha}}]=0$ too. In fact, because of this, it is not even possible to define a different matrix ${\bm{X}}$ whose expected value can be shown to be close to ${\bm{X}}^{*}$ . Instead, we have to use the initialization approach of [5]. This is given in line 5 of Algorithm 2. The matrix $\bm{Y}_{U}$ is such that its expected value is close to ${\bm{X}}^{*}{\bm{X}}^{*}{}^{\top}+c\bm{I}$ . This fact is used to argue that its top $r$ singular vectors span a subspace that is close to that spanned by columns of ${\bm{U}}^{*}{}$ .

Next, consider the GDmin iterations. We use the following idea to deal with the magnitude-only measurements: ${\bm{y}_{(mag)}}_{ki}:=|\bm{y}_{ki}|$ . Let $\bm{c}_{ki}:=\mathrm{phase}(\bm{a}_{ki}{}^{\top}\bm{x}^{*}_{k})$ . Then, clearly,

\bm{y}_{ki}=\bm{c}_{ki}{\bm{y}_{(mag)}}_{ki}

and ${\bm{y}_{(mag)}}_{ki}=\bar{\bm{c}}_{ki}\bm{y}_{ki}$ . We do not observe $\bm{c}_{ki}$ , but we can estimate it using $\bm{x}_{k}$ which is an estimate of $\bm{x}^{*}_{k}$ . Using the estimated phase, we can get an estimate $\hat{\bm{y}}_{ki}$ of $\bm{y}_{ki}$ . We replace $\nabla_{\bm{U}}f({\bm{U}},\bm{B})$ by its estimate which uses $\hat{\bm{y}}_{ki}={\bm{y}_{(mag)}}_{ki}\hat{\bm{c}}_{ki}$ , with $\hat{\bm{c}}_{ki}=phase(\bm{a}_{ki}{}^{\top}\bm{x}_{k})$ , to replace $\bm{y}_{ki}$ . See line 10 of Algorithm 2.

Lastly, because of the magnitude-only measurements, the update step for updating $\bm{b}_{k}$ s is no longer an LS problem. We now need to solve an $r$ -dimensional standard PR problem: $\min_{\bm{b}}\|{\bm{y}_{(mag)}}_{k}-|\bm{A}_{k}{\bm{U}}\bm{b}|\|^{2}$ . This can be solved using any of the order-optimal algorithms for standard PR, e.g., Truncated Wirtinger Flow (TWF) [20] or Reshaped WF (RWF) [21]. For concreteness, we assume that RWF is used. We should point out here that we only need to run $T_{RWF,t}$ iterations of RWF at outer loop iteration $t$ , with $T_{RWF,t}$ set below in our theorem (we set this to ensure that the error level of this step is of order $\delta_{t}$ ). The entire algorithm, AltGD-Min-LRPR, is summarized in Algorithm 2.

V-B Main Result

We can prove the following result with simple changes to the proof of Theorem 2.1.

Theorem 5.1.

Consider Algorithm 2. Set $\eta=c/{\sigma_{\max}^{*}}^{2}$ , $\tilde{C}=9\kappa^{2}\mu^{2}$ , $T=C\kappa^{2}\log(1/\epsilon)$ , and $T_{RWF,t}=C(t+c\log r)$ . Assume that Assumption 1.1 holds. If

mq\geq C\kappa^{6}\mu^{2}(n+q)r^{2}(r+\log(1/\epsilon)\log\kappa)

and $m\geq C\max(\log q,\log n)\log(1/\epsilon)$ , then, w.p. $1-n^{-10}$ , $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{T})\leq\epsilon$ , $\mathrm{dist}((\bm{x}_{k})_{T},\bm{x}^{*}_{k})\leq\epsilon\|\bm{x}^{*}_{k}\|$ for all $k\in[q]$ , and $\sum_{k}\mathrm{dist}^{2}((\bm{x}_{k})_{T},\bm{x}^{*}_{k})\leq\epsilon^{2}{\sigma_{\max}^{*}}^{2}.$

We prove this result in Sec. V-C. Notice the $\log(1/\epsilon)$ in the sample complexity of Theorem 2.1 is now replaced by $(r+\log(1/\epsilon))$ . The reason is because of the different initialization approach which needs $nr^{3}$ samples instead of $nr^{2}$ . This is needed because PR is a more difficult problem: we cannot define a matrix ${\bm{X}}_{0}$ for it for which $\mathbb{E}[{\bm{X}}_{0}]$ is close to ${\bm{X}}^{*}$ .

Observe that AltGD-Min-LRPR has the same sample complexity as that for the AltMin solution from [6]. But its time complexity is better by a factor of $\log(1/\epsilon)$ making it the fastest solution for LRPR. Also, we should mention here that, for solutions to the two related problems – sparse PR (phaseless but global measurements) and LRMC (linear but non-global measurements) – that have been extensively studied for nearly a decade, the best sample complexity guarantees for iterative (and hence fast) algorithms are sub-optimal. The best sparse PR guarantee [31] requires $m$ to be of order $s^{2}$ for the initialization step. Here $s$ is the sparsity level. LRPR has both phaseless and non-global measurements. This is why its initialization step needs two extra factors of $r$ compared to the optimal. Once initialized close enough to the true solution, it is well known that a PR problem behaves like a linear one. This is true for AltGD-Min-LRPR too.

Consider a comparison with use of a standard PR approach to recover each column of ${\bm{X}}^{*}$ individually. If TWF [20] or RWF [21] were used for this, this would require $m\gtrsim n$ . In comparison, ignoring log factors, our solution for LRPR needs $m\gtrsim(n/q)r^{3}$ . Thus, the use of altGD-min is a better idea when the rank, $r$ , of the matrix ${\bm{X}}^{*}$ is small enough so that $q\gtrsim r^{3}$ .

V-C Proof of Theorem 5.1

For the initialization, we use the bound from [5].

Lemma 5.2 ([5]).

Let $\mathrm{SD}_{2}({\bm{U}}_{0},{\bm{U}}^{*}{})=\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top}){\bm{U}}_{0}\|$ . Pick a $\delta_{\mathrm{init}}<0.1$ . Then, w.p. at least $1-2\exp\left(n(\log 17)-c\frac{\delta_{\mathrm{init}}^{2}mq}{\kappa^{4}r^{2}}\right)-2\exp\left(-c\frac{\delta_{\mathrm{init}}^{2}mq}{\kappa^{4}\mu^{2}r^{2}}\right)$ ,

\mathrm{SD}_{2}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\delta_{\mathrm{init}}\text{ and so }\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})\leq\sqrt{r}\delta_{\mathrm{init}}.

For the iterations, without loss of generality, as also done in past works on PR, e.g., [30, 20, 21, 6], to make things simpler, we assume that, for each $k$ , $\bm{x}^{*}_{k}$ is replaced by $\bar{z}\bm{x}^{*}_{k}$ where $z=\mathrm{phase}(\langle\bm{x}^{*}_{k},\bm{x}_{k}\rangle)$ . With this, $\mathrm{dist}(\bm{x}^{*}_{k},\bm{x}_{k})=\|\bm{x}^{*}_{k}-\bm{x}_{k}\|$ .

We modify Lemma 3.4 using the following idea. Let ${\bm{U}}={\bm{U}}_{t}$ and $\bm{B}=\bm{B}_{t}$ . For LRPR, the GD step uses an approximate gradient w.r.t. the old cost function $f({\bm{U}},\bm{B})$ . Let

\displaystyle\mathrm{Err}:=\widehat{\mathrm{GradU}}-\mathrm{GradU}.

Here $\widehat{\mathrm{GradU}}=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k}))\bm{a}_{ki}\bm{b}_{k}{}^{\top}$ and $\mathrm{GradU}=\nabla_{\bm{U}}f({\bm{U}},\bm{B})=\sum_{ki}(\bm{y}_{ki}-\bm{a}_{ki}{}^{\top}\bm{x}_{k}))\bm{a}_{ki}\bm{b}_{k}{}^{\top}$ is the same as earlier. Thus,

	$\displaystyle\mathrm{Err}$	$\displaystyle=\sum_{ki}(\hat{\bm{y}}_{ki}-\bm{y}_{ki})\bm{a}_{ki}\bm{b}_{k}{}^{\top}$
		$\displaystyle=\sum_{ki}(\hat{\bm{c}}_{ki}-\bm{c}_{ki})\|\bm{a}_{ki}^{\top}\bm{x}^{*}_{k}\|\bm{a}_{ki}\bm{b}_{k}{}^{\top}$
		$\displaystyle=\sum_{ki}(\hat{\bm{c}}_{ki}\bar{\bm{c}}_{ki}-1)(\bm{a}_{ki}^{\top}\bm{x}^{*}_{k})\bm{a}_{ki}\bm{b}_{k}{}^{\top}$

Proceeding as in the proof of Lemma 3.4, and using $\|(\bm{I}-{\bm{U}}^{*}{}{\bm{U}}^{*}{}^{\top})\mathrm{Err}\|_{F}\leq\|\mathrm{Err}\|_{F}$ and $\|\mathrm{Err}\|\leq\|\mathrm{Err}\|_{F}$ , we can conclude the following

	$\displaystyle\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}^{+})\leq$
	$\displaystyle\frac{\\|(I-(\eta/m)\mathrm{Hess}\\|\cdot\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}})+(\eta/m)\\|\mathrm{Term2}\\|_{F}+(\eta/m)\\|\mathrm{Err}\\|_{F}}{1-(\eta/m)\\|\mathrm{GradU}\\|-(\eta/m)\\|\mathrm{Err}\\|_{F}}$

where the expressions for $\mathrm{GradU},\mathrm{Term2},\mathrm{Hess}$ are the same as before with one change: $\bm{b}_{k}$ is now obtained by solving a noisy $r$ -dimensional PR problem (instead of a LS problem) using RWF [21]. Thus, to complete the proof, (i) we need to bound

\|\mathrm{Err}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\sum_{ki}(\hat{\bm{c}}_{ki}\bar{\bm{c}}_{ki}-1)(\bm{a}_{ki}^{\top}\bm{x}^{*}_{k})(\bm{a}_{ki}^{\top}{\bm{W}}\bm{b}_{k})

and (ii) we need bounds on the three other terms that were also bounded earlier for the linear case.

The term $\|\mathrm{Err}\|_{F}$ , is bounded in Lemma 4 of [6] . We repeat the lemma below.

Lemma 5.3.

Assume that $\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\delta_{t}$ with $\delta_{t}<c/\kappa^{2}$ . Then, w.p. at least $1-2\exp\left(nr\log(17)-c\frac{mq\epsilon_{2}^{2}}{\mu^{2}\kappa r}\right)-\exp(\log q+r-cm)$ ,

\|\mathrm{Err}\|_{F}\leq Cm(\epsilon_{2}+\sqrt{\delta_{t}})\delta_{t}{\sigma_{\max}^{*}}^{2}

Consider the other three terms: $\mathrm{GradU},\mathrm{Term2},\mathrm{Hess}$ . These were bounded in Lemma 3.5 for the linear case. The statement and proof of this lemma remain the same as earlier because of the following reason. Its proof uses the bounds on $\bm{b}_{k}$ , $\bm{x}_{k}$ from Lemma 3.3. The statement of this lemma also remains the same with one change: we replace $\|\bm{x}^{*}-\bm{x}\|$ by $\mathrm{dist}(\bm{x}^{*},\bm{x})$ and $\|{\bm{X}}^{*}-{\bm{X}}\|_{F}^{2}$ by $\sum_{k=1}^{q}\mathrm{dist}^{2}(\bm{x}^{*}_{k},\bm{x}_{k})$ , and the same for $\bm{b}^{*}_{k},\bm{g}_{k}$ . The first part of Lemma 3.3 now follows by the first part of [6, Lemma 3.3]. All the subparts of the second part of Lemma 3.3 follow exactly as given in its proof in Sec. IV-D.

VI Limitations of our results

Our results have three limitations: (i) the algorithm that is analyzed needs sample-splitting, even though, in numerical experiments this is not needed; (ii) our bound holds w.h.p. for a single matrix ${\bm{X}}^{*}$ satisfying Assumption 1.1 (and not for all such matrices); and (iii) for obtaining exactly zero error, we need an infinite number of samples. We explain here the reasons why we are unable to address these issues. We should mention here that, since all computers are finite precision, (iii) is entirely a theoretical curiosity. Also, many other results in the LR recovery literature, e.g., [2, 14, 15], also have all these limitations.

VI-A Need for sample-splitting

In Algorithm 1, sample-splitting (line 3) helps ensure that the measurement matrices in each iteration for updating each of ${\bm{U}}$ and $\bm{B}$ are independent of all previous iterates: we split our sample set into $2T+1$ subsets, we use one subset for initialization of ${\bm{U}}$ and one subset each for $T$ iterations of updating $\bm{B}$ and updating ${\bm{U}}$ . This helps prove the desired error decay bound by applying the sub-exponential Bernstein inequality [26] which requires the summands to be mutually independent. This becomes true in our case because, conditioned on past measurement matrices, the current set of $\bm{a}_{ki}$ ’s are independent of the last updated values of ${\bm{U}},\bm{B}$ ; and the $\bm{a}_{ki}$ s for different $(i,k)$ are mutually independent by definition. Thus, under the conditioning, the summands are mutually independent. Since we prove convergence in order $\log(1/\epsilon)$ iterations, this only adds a multiplicative factor of $\log(1/\epsilon)$ in the sample complexity. Sample-splitting and the above overall idea is a standard approach used in many older works; in fact it is assumed for most of the LRMC guarantees for solutions that do not solve a convex relaxation (are iterative algorithms) [2, 14, 15]. An exception is [16].

There are a few commonly used approaches to avoid sample splitting. (1) One is using the leave-one-out strategy as done in [19]. But this means that the sample complexity dependence on $r$ worsens: the LRMC sample complexity with this approach is $(n+q)r^{3}$ times log factors. Also, it is not clear how to develop this approach for alternating ${\bm{U}},\bm{B}$ updates. (2) The second is to try to prove error decay for all matrices that are close enough to the true ${\bm{X}}^{*}$ and that satisfy the other assumptions of the guarantee. There are at least two different approaches to doing this. (2a) The first, which was used in [16], works for LRMC since its measurements are bounded and symmetric: the authors are able to utilize i.i.d. Bernoulli sampling and left and right singular vectors’ incoherence to prove key probabilistic bounds for all matrices of the form ${\bm{U}}\bm{V}$ with ${\bm{U}},\bm{V}$ both being incoherent. This does not work in our case because our measurements are asymmetric and unbounded (which means for example that $\bm{y}_{ki}$ times its estimate is heavier-tailed than $\bm{y}_{ki}$ ).

(2b) An alternative approach is the following overall idea, which has been successfully used for analyzing standard PR algorithms, e.g., see [20, 21], but does not always work for other problems. In our setting, this means the following: At iteration $t+1$ , suppose that the previous estimate ${\bm{U}}_{t}$ satisfies $\mathrm{SD}({\bm{U}}_{t},{\bm{U}}^{*}{})\leq\delta_{t}$ . We need to try to show that, for all ${\bm{U}}$ that are a subspace distance $\delta_{t}$ away from the true subspace, the next iterate (which is a function of ${\bm{U}}$ and of the current $\bm{A}_{k},\bm{y}_{k}$ for all $k$ ) is a distance $c\delta_{t}$ away with a $c<1$ . To be precise, for all ${\bm{U}}\in\mathcal{T}:=\{{\bm{U}}:{\bm{U}}^{\top}{\bm{U}}=\bm{I}\text{ and }\mathrm{SD}({\bm{U}},{\bm{U}}^{*}{})\leq\delta_{t}\}$ , we need ${\bm{U}}^{+}({\bm{U}})=orth({\bm{U}}-\eta\nabla_{U}f({\bm{U}},\bm{B}))$ to satisfy $\mathrm{SD}({\bm{U}}^{+},{\bm{U}}^{*}{})\leq c\delta_{t}$ for a $c<1$ . Here $orth(\bm{M})$ is a matrix with orthonormal columns spanning the same subspace as those of $\bm{M}$ . Also recall that the columns of $\bm{B}$ are $\bm{b}_{k}:=(\bm{A}_{k}{\bm{U}})^{\dagger}\bm{y}_{k}$ for all $k\in[q]$ . One can show this for all ${\bm{U}}\in\mathcal{T}$ by covering $\mathcal{T}$ by a net containing a finite number of points that are such that any point in $\mathcal{T}$ is with a subspace distance $0.25\delta_{t}$ of some point in the net, and first proving that this bound holds for all ${\bm{U}}$ in the net. The first step for proving such a bound is to bound the error in the estimates $\bm{b}_{k}$ for all ${\bm{U}}$ in this net. Because of the decoupled column-wise recovery of the $\bm{b}_{k}$ ’s, for one ${\bm{U}}$ in this net, the bound on $\|\bm{b}_{k}({\bm{U}})-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|$ holds w.p. $\geq 1-q\exp(r-cm)$ . This is proved in Lemma 3.3. If we want this bound to hold for all ${\bm{U}}$ ’s in the net covering $\mathcal{T}$ , we will need a union bound over all points in the net. The smallest sized net to cover $\mathcal{T}$ with accuracy $\epsilon_{net}=0.25\delta_{t}$ has size upper bounded by $C^{nr}$ [26]. With using this, the probability lower bound becomes $1-\exp(nr+\log q+r-cm)$ . For this to even just be non-negative, we need $m>Cnr$ which is too large and makes our guarantee useless.

VI-B Why we cannot prove our result for all $X^{*}$

The inability to obtain a useful union bound over a net of size $C^{nr}$ explained above is also why we cannot do this.

VI-C Why sample complexity depends on the desired final accuracy $\epsilon$

Observe from our result that the number of samples required to achieve a certain accuracy $\epsilon$ grows as $\log(1/\epsilon)$ . This means that, for the algorithm to achieve zero error, we need an infinite number of samples. We should mention that this problem is not unique to our result. It is often seen for results that use sample-splitting, e.g., [2, 15]. An exception is [14] for LRMC, where the following basic idea is used: one tries to show that after enough iterations, e.g., when the recovery error is $\epsilon_{0}=1/n$ or smaller, one can start reusing the same samples and still prove error decay. This is also the idea used in [19]. Briefly, the reason we are unable to circumvent this problem using a similar idea to that of [14] is that our algorithm is not a regular GD or projected GD method.

To use a similar idea in our setting, we would need to proceed as follows. We use independent samples until the error is below an $\epsilon_{0}$ that is small enough. Pick $\epsilon_{0}=1/(\kappa^{2}n^{2})$ . This happens after $T(\epsilon_{0})=C\kappa^{2}\log(n)\log(\kappa)$ iterations. Consider $t=T+1$ . At this time, $\delta_{t}=\epsilon_{0}=1/(\kappa^{2}n^{2})$ . Thus, by Lemma 3.3, $\|\bm{b}_{k}-{\bm{U}}^{\top}\bm{x}^{*}_{k}\|\lesssim(1/(\kappa^{2}n^{2}))\|\bm{x}^{*}_{k}\|$ and all the other bounds also hold with $\delta_{t}$ replaced by $\epsilon_{0}$ . We try to show error decay by applying Lemma 3.4. For this to work, we need to be able to show all of the following without using independence between ${\bm{U}},\bm{B}$ and the $\bm{A}_{k}$ s: (i) upper and lower bound the eigenvalues of $\mathrm{Hess}=\sum_{ki}(\bm{a}_{ki}\otimes\bm{b}_{k})(.)^{\top}$ as those proved earlier, (ii) bound $\|\nabla_{\bm{U}}f({\bm{U}},\bm{B})\|/m$ by $c_{0}{\sigma_{\min}^{*}}^{2}$ for a small constant $c_{0}<1$ (in fact even in our main proof, such a bound is sufficient since this term only appears in the denominator), and (iii) bound $\|\mathrm{Term2}\|_{F}/m$ by $(c_{2}/\kappa^{2})\delta_{t}{\sigma_{\max}^{*}}^{2}$ with a $c_{2}$ sufficiently less than one.

As we explain next, (i) and (ii) can be obtained easily, but (iii) cannot. We can obtain (i) by showing that $\mathrm{Hess}$ is close to $\mathrm{Hess}^{*}=\sum_{ki}(\bm{a}_{ki}\otimes({\bm{U}}^{\top}{\bm{U}}^{*}{}\bm{b}^{*}_{k}))(.)^{\top}$ ; and $\mathrm{Hess}^{*}$ can be bounded almost exactly as done in our proof earlier since $\bm{A}_{k}$ s are independent of $\bm{x}^{*}_{k}$ s. The ${\bm{U}}$ in the expression for $\mathrm{Hess}^{*}$ does not matter because ${\bm{U}}^{\top}{\bm{U}}^{*}{}$ is an $r\times r$ rotation matrix and one can take a maximum over all rotation matrices. Using the loose bounds $\|\bm{a}_{ki}\|\leq 5\sqrt{n}$ w.h.p., one can show that $\|\mathrm{Hess}^{*}-\mathrm{Hess}\|\leq mq\max_{ki}[\max_{{\bm{W}}\in\mathcal{S}_{nr}}|\bm{a}_{ki}^{\top}{\bm{W}}\bm{g}_{k}|\cdot\max_{{\bm{W}}\in\mathcal{S}_{nr}}|\bm{a}_{ki}^{\top}{\bm{W}}(\bm{g}_{k}-\bm{b}_{k})|]\lesssim mq\sqrt{n}\mu\sqrt{r/q}{\sigma_{\max}^{*}}\cdot\sqrt{n}\epsilon_{0}\mu\sqrt{r/q}{\sigma_{\max}^{*}}\leq m\mu^{2}(r/n){\sigma_{\min}^{*}}^{2}$ . Similarly, for (ii), $\sum_{ki}\|\bm{a}_{ki}\bm{a}_{ki}^{\top}(\bm{x}^{*}_{k}-\bm{x}_{k})\bm{b}_{k}^{\top}\|\lesssim mq\cdot\sqrt{n}\cdot\sqrt{n}\cdot\epsilon_{0}\cdot(\mu^{2}r/q){\sigma_{\max}^{*}}^{2}=m(\mu^{2}r/n){\sigma_{\min}^{*}}^{2}$ . Using $(\mu^{2}r/n)\ll 1$ , claims (i) and (ii) follow. However, proving (iii) seems to be impossible without using the fact that $\mathbb{E}[\mathrm{Term2}]=0$ . But this expected value is zero only when $\bm{A}_{k}$ s are independent of ${\bm{U}},\bm{B}$ .

Possible ways to prove (iii). For bounding $\mathrm{Term2}$ for times $t>T(\epsilon_{0})$ , we can try one of the following ideas. (1) Try to use Cauchy-Schwarz in a way that the projection orthogonal to ${\bm{U}}^{*}{}$ is used. There does not seem to be a way to make this work. (2) Try to use the leave-one-out strategy of [19] only for $t>T(\epsilon_{0})$ .

Refer to caption — (a) $m=80$ , $n=q=600,r=4$

VII Numerical Experiments

Our first experiment compares AltGD-Min with the mixed norm minimization solution from [7] (mixed-norm-min) and with the AltMin algorithm [4, 5, 6] modified for the linear LRcCS problem (replace the PR step for updating $\bm{b}_{k}$ ’s by a simple LS step). We implement this with using two possible initializations: the initialization developed in [4, 5, 6] for LRPR (AltMinLin-LRPRinit), and with the initialization approach developed in this work (AltMinLin-LRCSinit). For mixed norm min, we used the code downloaded from https://www.dropbox.com/sh/lywtzc0y9awpvgz/AABbjuiuLWPy_8y7C3GQKo8pa?dl=0, which is provided by the authors. For AltMin, we used the code from https://github.com/praneethmurthy/. We implemented AltGD-Min with $\eta=0.4/\|{\bm{X}}_{0}\|^{2}$ and $\tilde{C}=9$ . Also, we used one set of measurements for all its iterations.

For chosen values of $n,q,r$ and $m$ , we simulated the data as follows. We simulated ${\bm{U}}^{*}{}$ by orthogonalizing an $n\times r$ standard Gaussian matrix; and $\bm{b}^{*}_{k}$ s were generated i.i.d. from ${\cal{N}}(0,\bm{I}_{r})$ . These were generated once. For each of 100 Monte Carlo runs, the measurement matrices $\bm{A}_{k}$ contained i.i.d. standard Gaussian entries. We obtained $\bm{y}_{k}=\bm{A}_{k}{\bm{U}}^{*}{}\bm{b}^{*}_{k}$ , $k\in[q]$ . For the LRPR experiment, we used ${\bm{y}_{(mag)}}_{k}=|\bm{y}_{k}|$ as the measurements. We plot the empirical average of $\|{\bm{X}}-{\bm{X}}^{*}\|_{F}/\|{\bm{X}}^{*}\|_{F}$ at each iteration $t$ on the y-axis (labeled “Error-X” in the plots) and the time taken by the algorithm until iteration $t$ on the x-axis.

For our first experiment, shown in Fig. 1(a), we used $n=600,q=600,r=4$ and $m=80$ . In this case, mixed-norm-min error decays to about 2-5% but does not reduce any further. But, for our algorithm, AltGD-Min, and for both versions of AltMin, the error decays to $10^{-15}$ . Notice also that AltGD-Min is much faster than all the other approaches. Fig. 1(b) reduced $m$ to $m=50$ . Here a similar trend is observed, except that the error decays to only around $10^{-13}$ for AltGD-Min and $10^{-11}$ for the two AltMin approaches. Finally, for Fig. 1(c), we reduced $m$ to $m=30$ . In this case, only AltGDmin and AltMin-LRCSinit work, while mixed-norm-min and AltMin-LRPRinit errors do not decrease at all. The reason is both these need a higher sample complexity (see Table I). Finally, we also tried an experiment with very large $m$ : $n=100,q=120,r=2$ and $m=0.9n=90$ , see Fig. 1(d). Even for such a large value of $m$ (compared to $n$ ), observe that the mixed-norm-min error saturates at around 1-2%. The likely reason for this that, in the guarantee for mixed-norm-min [7] (summarized for the noiseless case in Proposition 2.3 given earlier), even for $m=n$ , the error is bounded by a multiplier (more than 1) times $\sqrt{r/q}$ .

For the comparisons for the LRPR problem shown in Fig. 2, we need a much larger $q$ and $m$ since LRPR requires $mq$ to scale as $nr^{3}$ both for initialization and for the GDmin iterations and the multiplying constants are also much larger for LRPR. We used $n=600,q=1000,r=4$ and $m=250$ . Notice that altGD-Min-LRPR is faster than AltMin-LRPR. We implemented altGD-Min-LRPR with $\eta=0.9/\|{\bm{X}}_{0}\|^{2}$ , $\tilde{C}=9$ , and $T_{RWF,t}=\max(5+t,40)$ in the RWF code (code for [21], downloaded from the specified site). Also, here again, we used one set of measurements for all its iterations.

VIII Conclusions

This work developed a sample-efficient and fast gradient descent (GD) solution, called AltGD-Min, for provably recovering a low-rank (LR) matrix from mutually independent column-wise linear projections. This problem, which we refer to as “Low Rank column-wise Compressive Sensing (LRcCS)”, frequently occurs in LR-based accelerated low rank dynamic MRI and in federated sketching. If used in a federated setting, AltGD-Min is also communication-efficient. The LRcCS problem has not received little attention in the theoretical literature unlike the other well-studied LR recovery problems (matrix completion, sensing, or multivariate regression).

Appendix A Understanding why LRMC-style GD approaches cannot be easily analyzed for LRcCS

TABLE II: Understanding why LRMC style projected-GD on

{\bm{X}}

does not work in our case.

	LRMC	Our Problem, LRcCS
$\tilde{f}({\bm{X}})$	$\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(\bm{y}_{jk}-\delta_{jk}{\bm{X}}_{jk})^{2}$	$\displaystyle\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{y}_{ki}-\bm{a}_{ki}^{\top}\bm{x}_{k})^{2}$
	$\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p)$	$\bm{a}_{ki}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}{\cal{N}}(0,\bm{I}_{n})$
$\nabla_{X}\tilde{f}({\bm{X}})$	$\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}\delta_{jk}(\bm{y}_{jk}-\delta_{jk}{\bm{X}}_{jk})\bm{e}_{j}\bm{e}_{k}^{\top}$	$\displaystyle\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{y}_{ki}-\bm{a}_{ki}^{\top}\bm{x}_{k})\bm{a}_{ki}\bm{e}_{k}^{\top}$
	$\displaystyle=\sum_{k=1}^{q}\sum_{j=1}^{n}\delta_{jk}({\bm{X}}^{*}_{jk}-{\bm{X}}_{jk})\bm{e}_{j}\bm{e}_{k}^{\top}$	$\displaystyle=\sum_{k=1}^{q}\sum_{i=1}^{m}\bm{a}_{ki}^{\top}(\bm{x}^{*}_{k}-\bm{x}_{k})\bm{a}_{ki}\bm{e}_{k}^{\top}$
$\tilde{\bm{H}}:=\bm{H}-\eta\nabla f({\bm{X}})$	$\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(1-\frac{\delta_{jk}}{p})\bm{H}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top}$	$\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{I}-\bm{a}_{ki}\bm{a}_{ki}{}^{\top})\bm{h}_{k}\bm{e}_{k}{}^{\top}$

A-A Gradient Descent

The iterates of a gradient descent (GD) algorithm converge when the gradient approaches zero. Thus, in order to show its convergence, one needs to be able to bound the norm of the gradient and show that it goes to zero with iterations. In order to show fast enough convergence (reach $\epsilon$ error in order $\log(1/\epsilon)$ iterations), one further needs to show that this bound on the gradient norm decreases sufficiently with each iteration. Consider projGD-X which was studied in [15] for solving LRMC. ProjGD-X iterations involve computing ${\bm{X}}^{+}\leftarrow\mathcal{P}_{r}({\bm{X}}-\nabla_{\bm{X}}\tilde{f}({\bm{X}}))$ , here $\mathcal{P}_{r}(\bm{M})$ projects its argument onto the space of rank- $r$ matrices. To bound $\|\nabla_{\bm{X}}\tilde{f}({\bm{X}})\|$ , we need to bound $|\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z}|$ for any unit norm vectors $\bm{w},\bm{z}$ . We show the cost function $\tilde{f}({\bm{X}})$ and its gradient for both LRMC and LRcCS in Table II. Observe that, for LRcCS, $\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z}$ is a sum of sub-exponential r.v.s with sub-exponential norms bounded by $K_{e}=\max_{k}\|\bm{w}\|\cdot\|\bm{x}^{*}_{k}-\bm{x}_{k}\|\cdot|\bm{z}_{k}|\leq\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|$ . Thus, in order to get a small enough bound on $|\bm{w}^{\top}\nabla_{\bm{X}}\tilde{f}({\bm{X}})\bm{z}|$ by applying the sub-exponential Bernstein inequality [26], we need a small enough bound on $\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|$ (column-wise error bound). It is not clear how to get this because the projection step introduces coupling between the different columns of the estimated matrix ${\bm{X}}$ ²²2 Let $\bm{H}:={\bm{X}}-{\bm{X}}^{*}$ , $\tilde{\bm{H}}:=({\bm{X}}-\eta\nabla f({\bm{X}}))-{\bm{X}}^{*}=\bm{H}-\eta\nabla f({\bm{X}})$ , and $\bm{H}^{+}={\bm{X}}^{+}-{\bm{X}}^{*}=\mathcal{P}_{r}({\bm{X}}-\nabla f({\bm{X}}))-{\bm{X}}^{*}=\mathcal{P}_{r}({\bm{X}}^{*}+\tilde{\bm{H}})-{\bm{X}}^{*}$ . To bound the LRMC projGD-X errors, one needs an entry-wise bound of the form $\|\bm{H}^{+}\|_{\max}\leq\delta_{t}\|{\bm{X}}^{*}\|_{\max}$ with $\delta_{t}$ decaying exponentially. We show the expressions for $\tilde{\bm{H}}$ in the table. For LRMC, notice that different summands of $\tilde{\bm{H}}$ are mutually independent and each depends on only one entry of $\bm{H}$ . This fact is carefully exploited in [15, Lemma 1] and [14, Lemma 1]. By borrowing ideas from the literature on spectral statistics of Erdos-Renyi graphs [32], the authors are able to obtain expressions for higher powers of $(\tilde{\bm{H}}\tilde{\bm{H}}^{\top})$ . These expressions help them get the desired bound under the desired sample complexity. For LRcCS, using the gradient expression, we need a bound on $\max_{k}\|\bm{h}^{+}_{k}\|$ in terms of $\|\bm{h}_{k}\|$ in order to show its exponential decay. Since the different entries of $\tilde{\bm{H}}$ are not mutually independent and not bounded, the LRMC proof approach cannot be borrowed. . Moreover, even if we could somehow get such a bound, in the best case, it would be proportional to $\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\|$ with $\delta_{t}<1$ and decaying exponentially with $t$ . Using Assumption 1.1, this would then imply that $K_{e}\leq\delta_{t}\max_{k}\|\bm{x}^{*}_{k}\|\leq\delta_{t}\mu\sqrt{r/q}{\sigma_{\max}^{*}}$ . But, this is not small enough. We need it to be proportional to $\delta_{t}(r/q)$ in order to be able to bound the gradient norm under the desired sample complexity.

Consider altGDnormbal studied in [17, 16] for LRMC. In this case again, the desired column-wise error bound cannot be obtained because the update step for $\bm{B}$ involves GD w.r.t. $f({\bm{U}},\bm{B})+f_{2}({\bm{U}},\bm{B})$ . The gradient w.r.t $f_{2}$ (norm-balancing term) introduces coupling between the different columns of $\bm{B}$ , and hence, also between columns of ${\bm{X}}={\bm{U}}\bm{B}$ . Thus, once again, it is not clear how to get a tight bound on $\max_{k}\|\bm{x}^{*}_{k}-\bm{x}_{k}\|$ .

For AltGD-Min, because the min step for updating $\bm{B}$ is a decoupled LS problem, it is possible to get the desired column-wise error bound. Secondly, because we use GD w.r.t ${\bm{U}}$ , there is an extra $\bm{b}_{k}^{\top}$ term in the gradient summands. This makes the gradient (and its deviation from its expected value), a sum of nice-enough sub-exponential r.v.s as explained in Sec. III-B.

TABLE III: Why the LRMC initialization approach cannot be directly borrowed?

	LRMC	Our Problem, LRcCS
${\bm{X}}_{0,full}=$	$\displaystyle\sum_{k}\sum_{j}\frac{\delta_{jk}}{p}\bm{y}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top}$	$\displaystyle\frac{1}{m}\sum_{k}\sum_{i}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}$
	$\delta_{jk}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}Bernoulli(p)$	$\bm{a}_{ki}\stackrel{{\scriptstyle\mathrm{iid}}}{{\thicksim}}{\cal{N}}(0,\bm{I}_{n})$
$\bm{H}_{0}={\bm{X}}_{0,full}-{\bm{X}}^{*}$	$\displaystyle\sum_{k=1}^{q}\sum_{j=1}^{n}(1-\frac{\delta_{jk}}{p}){\bm{X}}^{*}_{jk}\bm{e}_{j}\bm{e}_{k}{}^{\top}$	$\displaystyle\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}(\bm{I}-\bm{a}_{ki}\bm{a}_{ki}{}^{\top})\bm{x}^{*}_{k}\bm{e}_{k}{}^{\top}$
Each summand is	nicely bounded by	unbounded & sub-expo. norm^∗∗ is
	$\mu^{2}{\sigma_{\max}^{*}}(r/\sqrt{nq})$	$\mu{\sigma_{\max}^{*}}\sqrt{r/q}$ (too large, need $r/q$ )
Concen. ineq.	Matrix Bernstein [33]	Sub-expo Bernstein [26]
	gives desired sample comp.	does not give desired sample comp.

${**}$ : “max sub-expo. norm": max sub-exponential norm of $(\bm{a}_{ki}{}^{\top}\bm{w})(\bm{a}_{ki}{}^{\top}\bm{x}^{*}_{k})(\bm{e}_{k}^{\top}\bm{z})$ for any unit vectors $\bm{w},\bm{z}$ .

A-B Initialization

The standard approach used for initializing iterative algorithms for LRMC (as well as other linear LRR problems) is to compute the top $r$ left singular vectors of the matrix ${\bm{X}}_{0,full}$ that satisfies $({\bm{X}}_{0,full})_{vec}=\mathcal{A}^{\top}(\bm{y}_{all})$ , where $\bm{y}_{all}$ is the $mq$ -length vector of all measurements and $\mathcal{A}$ denotes the linear mapping from $({\bm{X}}^{*})_{vec}$ to $\bm{y}_{all}$ . In case of LRMC and LRcCS, this is computed is as given in Table III. It is not hard to see that, in both cases, $\mathbb{E}[{\bm{X}}_{0,full}]={\bm{X}}^{*}$ . To show that this approach works, one typically uses a $\sin\Theta$ theorem, e.g., Davis-Kahan or Wedin, to bound $\mathrm{SD}({\bm{U}}^{*}{},{\bm{U}}_{0})$ as a function of terms that depend on $\bm{H}_{0}:={\bm{X}}_{0,full}-{\bm{X}}^{*}$ . Thus a first requirement is to bound $\|\bm{H}_{0}\|$ . For LRMC, this can be done easily since $\bm{H}_{0}$ is a sum of the independent one-sparse random matrices shown in the table with each matrix containing an i.i.d. Bernoulli r.v. times ${\bm{X}}^{*}_{jk}$ ( $jk$ -th entry of ${\bm{X}}^{*}$ ) as its nonzero entry. Using the left and right singular vectors’ incoherence (assumed in all LRMC guarantees), and ${\bm{X}}^{*}_{jk}=\bm{e}_{j}^{\top}{\bm{X}}^{*}\bm{e}_{k}$ , one can argue that, for unit vectors $\bm{w},\bm{z}$ , each summand of $|\bm{w}^{\top}\bm{H}_{0}\bm{z}|$ is of order at most $(1/p){\sigma_{\max}^{*}}r/\sqrt{nq}$ . This bound, along with a bound on the “variance parameter" needed for applying matrix Bernstein [33],[26, Chap 5] helps show that $\|\bm{H}_{0}\|\leq c{\sigma_{\max}^{*}}$ w.h.p., under the desired sample complexity bound. For LRcCS, the summands of ${\bm{X}}_{0,full}$ , and hence of $\bm{H}_{0}$ , are sub-exponential r.v.s. These can be bounded using the sub-exponential Bernstein inequality [26, Chap 2]. This requires a bound on the maximum sub-exponential norm of any summand. Denote this bound by $K_{e}$ . In order to show that $\|\bm{H}_{0}\|\leq c{\sigma_{\max}^{*}}$ w.h.p, under the desired sample complexity, we need $K_{e}$ to be of order $(r/q)$ or smaller. However, for our summands, we can only guarantee $K_{e}\leq(1/m)\max_{k}\|\bm{x}^{*}_{k}\|\leq(1/m)\mu\sqrt{r/q}{\sigma_{\max}^{*}}$ . This is not small enough, i.e., the summands are not nice-enough subexponentials. It will require $mq\gtrsim(n+q)r\cdot\sqrt{q}$ which is too large.

Appendix B Proof of Initialization Theorem 3.1 without sample-splitting

Consider the initialization using ${\bm{X}}_{0}$ defined in (2). We we want to bound the initialization error without sample-splitting. This means that the threshold $\alpha$ is not independent of the $\bm{a}_{ki},\bm{y}_{ki}$ used in the expression for ${\bm{X}}_{0}$ and thus, it is not clear how to compute its expected value even if we condition on $\alpha$ . However, the following slightly more complicated approach can be used. Using Fact 3.7 and Assumption 1.1, it is possible to show that ${\bm{X}}_{0}$ is close to a matrix, ${\bm{X}}_{+}(\epsilon_{1})$ given next for which $\mathbb{E}[{\bm{X}}_{+}]$ is easily computed: Let

\alpha_{+}:=\tilde{C}(1+\epsilon_{1})\frac{\|{\bm{X}}^{*}\|_{F}^{2}}{q}

and define

$\displaystyle{\bm{X}}_{+}(\epsilon_{1})$	$\displaystyle:=\frac{1}{m}\sum_{ki}\bm{a}_{ki}\bm{y}_{ki}\bm{e}_{k}{}^{\top}\mathbbm{1}_{\small\{\bm{y}_{ki}^{2}\leq\alpha_{+}\}}.\text{ Then, }$
$\displaystyle\mathbb{E}[{\bm{X}}_{+}]$	$\displaystyle={\bm{X}}^{*}{\bm{D}}(\epsilon_{1}),$
$\displaystyle{\bm{D}}$	$\displaystyle:=diagonal(\beta_{k}(\epsilon_{1})),$
$\displaystyle\beta_{k}(\epsilon_{1})$	$\displaystyle:=\mathbb{E}\left[\zeta^{2}\mathbbm{1}_{\small\left\{\zeta^{2}\leq\frac{\alpha_{+}}{\\|\bm{x}^{*}_{k}\\|^{2}}\right\}}\right]$	(20)

with $\zeta$ being a scalar standard Gaussian. Thus ${\bm{X}}_{+}$ is ${\bm{X}}_{0}$ with the threshold $\alpha$ replaced by $\alpha_{+}$ which is deterministic. Consequently $\mathbb{E}[{\bm{X}}_{+}]$ has a similar form too and is obtained as explained in the proof of Lemma 3.6 given in Sec. IV-F.

Next, recall that ${\bm{X}}^{*}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}{\bm{\Sigma}^{*}}{\bm{V}^{*}}$ and $\tilde{C}=9\kappa^{2}\mu^{2}$ . Let $\tilde{c}=c/\tilde{C}$ for a $c<1$ . Clearly, the span of the top $r$ singular vectors of $\mathbb{E}[{\bm{X}}_{+}]={\bm{X}}^{*}{\bm{D}}$ equals $\mathrm{span}({\bm{U}}^{*}{})$ and it is rank $r$ matrix. Let,

\mathbb{E}[{\bm{X}}_{+}]={\bm{X}}^{*}{\bm{D}}\overset{\mathrm{SVD}}{=}{\bm{U}}^{*}{}\check{\bm{\Sigma}^{*}}\check{\bm{V}}

be its $r$ -SVD (here $\check{\bm{V}}$ is an $r\times q$ matrix with its rows containing the $r$ right singular vectors). We thus have

	$\displaystyle\sigma_{r}(\mathbb{E}[{\bm{X}}_{+}])$	$\displaystyle=\sigma_{\min}(\check{\bm{\Sigma}^{}})=\sigma_{\min}({\bm{\Sigma}^{}}{\bm{V}^{*}}{\bm{D}}\check{\bm{V}}{}^{\top})$
		$\displaystyle\geq\sigma_{\min}({\bm{\Sigma}^{}})\sigma_{\min}({\bm{V}^{}})\sigma_{\min}({\bm{D}})\sigma_{\min}(\check{\bm{V}}{}^{\top})$
		$\displaystyle={\sigma_{\min}^{*}}\cdot 1\cdot(\min_{k}\beta_{k})\cdot 1$

Fact 3.9 given earlier shows that $(\min_{k}\beta_{k})\geq 0.9$ and thus,

\sigma_{r}(\mathbb{E}[{\bm{X}}_{+}])\geq 0.9{\sigma_{\min}^{*}}

Also, $\sigma_{r+1}(\mathbb{E}[{\bm{X}}_{+}])=0$ since it is a rank $r$ matrix. Thus, using Wedin’s $\sin\Theta$ theorem for $\mathrm{SD}$ (summarized in Theorem 4.1) applied with $\bm{M}\equiv{\bm{X}}_{0}$ , $\bm{M}^{*}\equiv\mathbb{E}[{\bm{X}}_{+}]$ gives

		$\displaystyle\mathrm{SD}({\bm{U}}_{0},{\bm{U}}^{*}{})$
		$\displaystyle\leq\dfrac{\sqrt{2}\max\left(\\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{}{}\\|_{F},\\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}{}^{\top}\\|_{F}\right)}{0.9{\sigma_{\min}^{}}-\\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\\|}$		(21)

In the next three subsections, we prove a set of six lemmas that help bound the three terms in the expression above. The main new ideas over the proof given earlier in Sec III-E, are in the proof of the first lemma, Lemma B.2 given below, and in the proof of Claim B.1 that is used in this proof.

Claim B.1.

Let $\bm{x}^{*}\in\Re^{n}$ , $\bm{z}\in\Re^{n}$ be two deterministic vectors and let $\alpha$ be a deterministic scalar. Let $\bm{a}\sim{\cal{N}}(0,\bm{I}_{n})$ be a standard Gaussian vector and define $\bm{y}:=\bm{a}^{\top}\bm{x}^{*}$ . For an $0<\epsilon<1$ ,

\mathbb{E}\left[|\bm{y}(\bm{a}{}^{\top}\bm{z})|\mathbbm{1}_{\{\bm{y}^{2}\in[1\pm\epsilon]\alpha\}}\right]\leq C\epsilon\|\bm{z}\|\sqrt{\alpha}.

Combining Lemmas B.3 and B.2 and using Fact 3.7, and setting $\epsilon_{1}=c\delta_{0}/\sqrt{r}\kappa$ , we conclude that, w.p. at least
$1-2\exp((n+q)-\tilde{c}\epsilon_{1}^{2}mq)-\exp(-\tilde{c}mq\epsilon_{1}^{2})\geq 1-2\exp((n+q)-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-\exp(-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})$ ,

\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|\lesssim\epsilon_{1}\|{\bm{X}}^{*}\|_{F}\lesssim c\delta_{0}{\sigma_{\min}^{*}}

By combining Lemmas B.4, B.5, B.6, and B.7 and using Fact 3.7, and setting $\epsilon_{1}=c\delta_{0}/\sqrt{r}\kappa$ , we conclude that, w.p. at least
$1-2\exp(nr-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-2\exp(qr-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})-\exp(-\tilde{c}mq\delta_{0}^{2}/r\kappa^{2})$ ,

\max\left(\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{*}{}\|_{F},\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}^{\top}\|_{F}\right)\lesssim c\delta_{0}{\sigma_{\min}^{*}}

Plugging these into (B) proves Theorem 3.1

B-A Bounding the denominator term

By triangle inequality, $\|{\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}]\|\leq\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|+\|{\bm{X}}_{0}-{\bm{X}}_{+}\|.$ The next two lemmas bound these two terms. The lemmas assume the claim of Fact 3.7 holds, i.e., that $\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/q$ where $\tilde{C}=9\mu^{2}\kappa^{2}$ .

Lemma B.2.

Assume that $\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}/q$ (claim of Fact 3.7 holds). Then, w.p. $1-\exp(C(n+q)-\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2})$ ,

\|{\bm{X}}_{0}-{\bm{X}}_{+}\|\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.2.

We have

	$\displaystyle\\|{\bm{X}}_{+}-{\bm{X}}_{0}\\|$	$\displaystyle=\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\bm{z}{}^{\top}\left({\bm{X}}_{+}-{\bm{X}}_{0}\right)\bm{w}$
		$\displaystyle=\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\frac{1}{m}\sum_{ki}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})$
		$\displaystyle\qquad\times\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\bm{y}_{ki}^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}.$

For the last expression above, we have used the assumption $\sum_{ki}\bm{y}_{ki}^{2}/m\leq\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}$ . Consider the RHS for a fixed unit norm $\bm{z}$ and $\bm{w}$ . The lower threshold of the indicator function is itself a r.v.. To convert it into a deterministic bound, we need the following sequence of bounding steps: To use our assumption that $\sum_{ki}\bm{y}_{ki}^{2}/m\geq(1-\epsilon_{1})\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}$ , we first need to bound the summands by their absolute values. This is done as follows:

	$\displaystyle\|\bm{z}{}^{\top}\left({\bm{X}}_{+}-{\bm{X}}_{0}\right)\bm{w}\|$	$\displaystyle\leq\frac{1}{m}\sum_{ki}\big{\|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{\|}$
		$\displaystyle\qquad\times\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}},$
		$\displaystyle\leq\frac{1}{m}\sum_{ki}\big{\|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{\|}$
		$\displaystyle\qquad\times\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}},$

where in the last line we used our assumption that $\sum_{ki}\bm{y}_{ki}^{2}/m\geq(1-\epsilon_{1})\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}$ . This final expression is a sum of mutually independent sub-Gaussian r.v.s with subGaussian norm $K_{ki}\leq C|\bm{w}(k)|\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q}\leq\sqrt{\tilde{C}}|\bm{w}(k)|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ . Thus, by applying the sub-Gaussian Hoeffding inequality, Theorem 2.6.2 of [26],

	$\displaystyle\Pr\left\{\Big{\|}\sum_{ki}\big{\|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right.$
	$\displaystyle-\left.\mathbb{E}\left[\sum_{ki}\big{\|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right]\Big{\|}\geq t\right\}$
	$\displaystyle\qquad\leq 2\exp\left[-c\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\right].$

By setting $t=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}$ ,

\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{m^{2}q\epsilon_{1}^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}\|{\bm{X}}^{*}\|_{F}^{2}|\bm{w}(k)|^{2}}=\frac{\epsilon_{1}^{2}mq}{\tilde{C}}.

Since $\tilde{C}=9\mu^{2}\kappa^{2}$ , thus, w.p. $1-\exp(-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2})$ , for a fixed $\bm{z}$ and $\bm{w}$ ,

\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}+\mathbb{E}\left[\frac{1}{m}\sum_{ki}\big{|}\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\big{|}\mathbbm{1}_{\left\{|\bm{y}_{ki}|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\|{\bm{X}}^{*}\|_{F}^{2}\right\}}\right].

By using Claim B.1 and $|\bm{w}(k)|\|\bm{z}\|=|\bm{w}(k)|$ we have

	$\displaystyle\mathbb{E}\left[\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\bm{w}(k)\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right]$
	$\displaystyle\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\epsilon_{1}\\|{\bm{X}}^{}\\|_{F}\sum_{k}\big{\|}\bm{w}(k)\big{\|}/\sqrt{q}\leq C\epsilon_{1}\mu\kappa\\|{\bm{X}}^{}\\|_{F},$

where in the last inequality we used Cauchy-Schwarz to show that $\sum_{k}\big{|}\bm{w}(k)\big{|}/\sqrt{q}\leq\sqrt{\sum_{k}\big{|}\bm{w}(k)\big{|}^{2}\sum_{k}(1/q)}=1$ . Or this also follows by $\|\bm{w}\|_{1}/\sqrt{q}\leq\|\bm{w}\|=1$ . Also, we used $\sqrt{\tilde{C}}=C\kappa\mu$ .

Thus, w.p. $1-\exp(-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2})$ , for a fixed $\bm{z}$ and $\bm{w}$ , $\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}$ .

By Proposition 4.8, $\max_{\bm{z}\in\mathcal{S}^{n},~{}\bm{w}\in\mathcal{S}^{q}}\bm{z}{}^{\top}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\bm{w}\leq 1.4C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}$ w.p. at least $1-\exp((n+q)\log(17)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2})$ . ∎

Lemma B.3.

Consider ${\bm{X}}_{+}$ . Fix $1<\epsilon_{1}<1$ . Then, w.p. $1-\exp\left[C(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$

\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.3.

The proof involves an application of the sub-Gaussian Hoeffding inequality followed by an epsilon-net argument, both almost the same as those used in the proof of Lemma B.2 given above. We have,

\|{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\|=\max_{\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}}\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle.

For a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}$ , we have

	$\displaystyle\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{ki}\left(\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right.$
	$\displaystyle\qquad-\left.\mathbb{E}\left[\bm{w}(k)\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}\bm{z})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right]\right).$

The summands are mutually independent, zero mean sub-Gaussian r.v.s with norm $K_{ki}\leq C|\bm{w}(k)|\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ . We will again apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26]. Let $t=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}$ . Then

\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}\geq\frac{\epsilon_{1}^{2}m^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{\sum_{ki}\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}}

Thus, for a fixed $\bm{z}\in\mathcal{S}_{n},\bm{w}\in\mathcal{S}_{q}$ , by sub-Gaussian Hoeffding, we conclude that, w.p. at least $1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ ,

\langle{\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}],~{}\bm{z}\bm{w}{}^{\top}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.7, the above bound holds w.p. at least $1-\exp\left[(n+q)-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ . ∎

B-B Bounding the $\check{\bm{V}}$ numerator term

We bound $\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])\check{\bm{V}}^{\top}\|_{F}$ in this section. By triangle inequality. it is bounded by $\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}+\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top}\|_{F}$ .

Lemma B.4.

Assume that $\frac{1}{m}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\|{\bm{X}}^{*}\|_{F}^{2}$ . Then, w.p. $1-\exp\left[nr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ ,

\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.4.

The initial part of the proof is very similar to the that of the proof of Lemma B.2. We have, $\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle{\bm{W}},~{}\left({\bm{X}}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\rangle.$ For a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ ,

	$\displaystyle\langle{\bm{W}},~{}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right)\check{\bm{V}}{}^{\top}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{ki}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}$

Proceeding as in the proof of Lemma B.2,

	$\displaystyle\frac{1}{m}\sum_{ki}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}$
	$\displaystyle\leq\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{\|}\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\bm{y}_{ki}^{2}\leq\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}},$
	$\displaystyle\leq\frac{1}{m}\sum_{ki}\|\bm{y}_{ki}\|\|(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\|\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}.$

The summands are mutually independent sub-Gaussian r.v.s with norm $K_{ki}\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{W}}\check{\bm{v}}_{k}\|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ . Thus, we can apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26]. Set $t=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}$ . Then we have

\frac{t^{2}}{\sum_{ki}K^{2}_{ki}}\geq\frac{\epsilon_{1}^{2}m^{2}\|{\bm{X}}^{*}\|_{F}^{2}}{(\sum_{ki}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2})\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q}\geq\frac{\epsilon_{1}^{2}mq}{C\mu^{2}\kappa^{2}},

where we used the fact that $\check{\bm{V}}\check{\bm{V}}{}{}^{\top}=\bm{I}$ ( $\check{\bm{V}}^{\top}$ contains right singular vectors of a matrix) and thus $\|{\bm{W}}\check{\bm{V}}\|_{F}=1$ . Applying sub-Gaussian Hoeffding, we can conclude that, w.p., $1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$

	$\displaystyle\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}$
	$\displaystyle\leq\epsilon_{1}\\|{\bm{X}}^{*}\\|_{F}$
	$\displaystyle\qquad+\frac{1}{m}\sum_{ki}\mathbb{E}\left[\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right].$

We use Claim B.1 to bound the expectation term. Using this lemma with $\alpha^{2}\equiv\tilde{C}(1+\epsilon_{1})\|{\bm{X}}^{*}\|_{F}^{2}/q$ , $\bm{z}\equiv{\bm{W}}\check{\bm{v}}_{k}$

	$\displaystyle\frac{1}{m}\sum_{ki}\mathbb{E}\left[\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right]$
	$\displaystyle\leq\frac{1}{m}\sum_{ki}\sqrt{\tilde{C}(1+\epsilon_{1})}\epsilon_{1}\\|{\bm{X}}^{}\\|_{F}\\|{\bm{W}}\check{\bm{v}}_{k}\\|/\sqrt{q}\leq C\epsilon_{1}\mu\kappa\\|{\bm{X}}^{}\\|_{F}.$

where the last inequality used Cauchy-Schwarz on $\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|/\sqrt{q}$ to conclude that $\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|(1/\sqrt{q})\leq\sqrt{\sum_{k}\|{\bm{W}}\check{\bm{v}}_{k}\|^{2}\sum_{k}(1/q)}=\sqrt{\|{\bm{W}}\check{\bm{V}}\|_{F}^{2}\cdot 1}=1$ since $\|{\bm{W}}\check{\bm{V}}\|_{F}=1$ .

By Proposition 4.8, the above bound holds for all ${\bm{W}}\in\mathcal{S}_{nr}$ , w.p. at least $1-\exp\left[nr\log(1+2/\epsilon_{net})-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ . ∎

Lemma B.5.

Consider $0<\epsilon_{1}<1$ . Then, w.p. $1-\exp\left[nr-\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$

\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top}\|_{F}\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.5.

The proof is quite similar to the previous one. For a fixed ${\bm{W}}\in\mathcal{S}_{nr}$ we have,

	$\displaystyle\langle\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{W}}\check{\bm{v}}_{k})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}-\mathbb{E}[.]\right)$

where $\mathbb{E}[.]$ is the expected value of the first term. The summands are independent, zero mean, sub-Gaussian r.v.s with subGaussian norm less than $K_{ki}\leq C\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|{\bm{W}}\bm{b}_{k}\|/\sqrt{q}$ . Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with $t=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}$ , and using $\|{\bm{W}}\check{\bm{V}}\|_{F}=1$ , we can conclude that, w.p. $1-\exp\left[-\epsilon_{1}^{2}mq/(C\mu^{2}\kappa^{2})\right]$ ,

\langle\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right)\check{\bm{V}}{}^{\top},~{}{\bm{W}}\rangle\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.8, the above bound holds for all ${\bm{W}}\in\mathcal{S}_{nr}$ w.p. $1-\exp\left[nr-\epsilon_{1}^{2}mq/(C\mu^{2}\kappa^{2})\right]$ . ∎

B-C Bounding the U* numerator term

We bound $\|({\bm{X}}_{0}-\mathbb{E}[{\bm{X}}_{+}])^{\top}{\bm{U}}^{*}{}\|_{F}$ here. By triangle inequality, it is bounded by $\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}+\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}$ .

Lemma B.6.

Assume that $\frac{1}{mq}\sum_{ki}\bm{y}_{ki}^{2}\in[1\pm\epsilon_{1}]\|{\bm{X}}^{*}\|_{F}^{2}/q$ . Then, w.p. $1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$

\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq C\epsilon_{1}\mu\kappa\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.6.

The proof is similar to that of Lemmas B.2 and B.4. We have, $\|\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\|_{F}=\max_{{\bm{W}}\in\mathcal{S}_{qr}}\langle{\bm{W}},~{}\left({\bm{X}}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\rangle.$ For a fixed ${\bm{W}}\in\mathcal{S}_{qr}$ , using the same approach as in Lemma B.2, and letting $\bm{w}_{k}$ be the $k$ -th column of the $r\times q$ matrix ${\bm{W}}$ ,

	$\displaystyle\langle{\bm{W}},~{}\left({\bm{X}}_{0}-{\bm{X}}_{+}\right){}^{\top}{\bm{U}}^{*}{}\rangle$
	$\displaystyle\qquad\leq\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\big{\|}\mathbbm{1}_{\left\{\frac{\tilde{C}}{mq}\sum_{ki}\|\bm{y}_{ki}\|^{2}\leq\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}},$
	$\displaystyle\qquad\leq\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}}.$

The summands are now mutually independent sub-Gaussian r.v.s with norm $K_{ki}\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|\bm{w}_{k}\|\|{\bm{X}}^{*}\|_{F}/\sqrt{q}$ . Thus, we can apply the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], to conclude that, for a fixed ${\bm{W}}\in\mathcal{S}_{qr}$ , w.p. $1-\exp\left[-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ ,

	$\displaystyle\frac{1}{m}\sum_{ki}\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}}$
	$\displaystyle\leq\epsilon_{1}\\|{\bm{X}}^{}\\|_{F}+\frac{1}{m}\sum_{k}\mathbb{E}\left[\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{*}\\|_{F}^{2}\right\}}\right]$

By Claim B.1, and using $\sum_{k}\|\bm{w}_{k}\|/\sqrt{q}\leq\sqrt{\sum_{k}\|\bm{w}_{k}\|^{2}}~{}\sqrt{\sum_{k}1/q}=1$ ,

	$\displaystyle\frac{1}{m}\sum_{k}\mathbb{E}\left[\big{\|}\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\big{\|}\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\in[1\pm\epsilon_{1}]\frac{\tilde{C}}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}}\right]$
	$\displaystyle\leq\frac{1}{m}\sum_{ki}\epsilon_{1}\\|\bm{w}_{k}\\|\sqrt{\tilde{C}(1+\epsilon_{1})/q}\\|{\bm{X}}^{*}\\|_{F},$
	$\displaystyle\leq C\epsilon_{1}\mu\kappa\\|{\bm{X}}^{*}\\|_{F},$

By Proposition 4.8 (epsilon net argument), the bound holds for all unit norm ${\bm{W}}$ w.p. $1-\exp\left[qr-c\epsilon_{1}^{2}mq/\mu^{2}\kappa^{2}\right]$ . ∎

Lemma B.7.

Consider $0<\epsilon_{1}<1$ . Then, w.p. $1-\exp\left[qr-\epsilon_{1}^{2}/mq\mu^{2}\kappa^{2}\right]$

\|\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\|_{F}\leq C\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

Proof of Lemma B.7.

For fixed ${\bm{W}}\in\mathcal{S}_{qr}$ ,

	$\displaystyle\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\right)$
	$\displaystyle\qquad=\frac{1}{m}\sum_{ki}\left(\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}}\right.$
	$\displaystyle\qquad-\left.\mathbb{E}\left[\bm{y}_{ki}(\bm{a}_{ki}{}^{\top}{\bm{U}}^{}{}\bm{w}_{k})\mathbbm{1}_{\left\{\|\bm{y}_{ki}\|^{2}\leq\frac{\tilde{C}(1+\epsilon_{1})}{q}\\|{\bm{X}}^{}\\|_{F}^{2}\right\}}\right]\right)$

The summands are independent zero mean sub-Gaussian r.v.s with norm less than $K_{ki}\leq\sqrt{\tilde{C}(1+\epsilon_{1})}\|{\bm{X}}^{*}\|_{F}\|\bm{w}_{k}\|/\sqrt{q}$ . Thus, by applying the sub-Gaussian Hoeffding inequality Theorem 2.6.2 of [26], with $t=\epsilon_{1}m\|{\bm{X}}^{*}\|_{F}$ , we can conclude that, for a fixed ${\bm{W}}\in\mathcal{S}_{qr}$ , w.p. $1-\exp\left[-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right]$ ,

\mathrm{trace}\left({\bm{W}}{}^{\top}\left({\bm{X}}_{+}-\mathbb{E}[{\bm{X}}_{+}]\right){}^{\top}{\bm{U}}^{*}{}\right)\leq\epsilon_{1}\|{\bm{X}}^{*}\|_{F}.

By Proposition 4.8 (epsilon net argument), the bound holds for all unit norm ${\bm{W}}$ w.p. $1-\exp\left[qr-\epsilon_{1}^{2}mq/C\mu^{2}\kappa^{2}\right]$ . ∎

B-D Proof of Claim B.1

Proof.

We can write $\bm{x}^{*}=\|\bm{x}^{*}\|{\bm{Q}}\bm{e}_{1}$ where ${\bm{Q}}$ is a unitary matrix with first column proportional to $\bm{x}^{*}_{k}$ . We need to bound

	$\displaystyle\mathbb{E}[\\|\bm{x}^{}\\|\cdot\|(\bm{a}^{\top}{\bm{Q}}\bm{e}_{1})(\bm{a}^{\top}{\bm{Q}}{\bm{Q}}^{\top}\bm{z})\|\mathbbm{1}_{\{\\|\bm{x}^{}\\|^{2}\|\bm{a}^{\top}{\bm{Q}}e_{1}\|^{2}\in[1\pm\epsilon]\alpha}\}]$
	$\displaystyle=\\|\bm{x}^{*}\\|\cdot\\|\bm{z}\\|\cdot\mathbb{E}[\|\tilde{\bm{a}}(1)\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}\|\mathbbm{1}_{\{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}\}]$

where $\bar{\bm{z}}_{Q}:={\bm{Q}}^{\top}\bm{z}/\|\bm{z}\|$ , $\tilde{\bm{a}}:={\bm{Q}}^{\top}\bm{a}$ and $\beta:=\sqrt{\alpha}/\|\bm{x}^{*}\|$ . Since ${\bm{Q}}$ is unitary and $\bm{a}$ Gaussian, thus $\tilde{\bm{a}}$ has the same distribution as $\bm{a}$ . Let $\tilde{\bm{a}}(1)$ be its first entry and $\tilde{\bm{a}}(\mathrm{rest})$ be the $(n-1)$ -length vector with the rest of the $n-1$ entries and similarly for $\bar{\bm{z}}_{Q}$ . Then, $\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}=\tilde{\bm{a}}(1)\cdot\bar{\bm{z}}_{Q}(1)+\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest})$ . Since $\tilde{\bm{a}}(1)$ and $\tilde{\bm{a}}(\mathrm{rest})$ are independent,

		$\displaystyle\mathbb{E}[\|\tilde{\bm{a}}(1)\tilde{\bm{a}}^{\top}\bar{\bm{z}}_{Q}\|\mathbbm{1}_{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}]$
		$\displaystyle\leq\|\bar{\bm{z}}_{Q}(1)\|\mathbb{E}[\|\tilde{\bm{a}}(1)^{2}\|\mathbbm{1}_{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}]$
		$\displaystyle\qquad+\mathbb{E}[\|\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest})\|]\ \mathbb{E}[\|\tilde{\bm{a}}(1)\|\mathbbm{1}_{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\|\tilde{\bm{a}}(1)^{2}\|\mathbbm{1}_{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}]+2\mathbb{E}[\|\tilde{\bm{a}}(1)\|\mathbbm{1}_{\|\tilde{\bm{a}}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}}]$
	$\displaystyle\leq$	$\displaystyle\epsilon\beta+2\epsilon\beta=3\epsilon\beta=C\epsilon\frac{\sqrt{\alpha}}{\\|\bm{x}^{*}\\|}.$

The second inequality used the facts that (i) $|\bar{\bm{z}}_{Q}(1)|\leq\|\bar{\bm{z}}_{Q}\|=1$ by definition and (ii) $\zeta:=\tilde{\bm{a}}(\mathrm{rest})^{\top}\bar{\bm{z}}_{Q}(\mathrm{rest})$ is a scalar standard Gaussian r.v. and so $\mathbb{E}[|\zeta|]\leq 2$ . The third one relies on the following two bounds:

	$\displaystyle\mathbb{E}\left[\|\bm{a}(1)\|^{2}\mathbbm{1}_{\left\{\|\bm{a}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}\right\}}\right]$
	$\displaystyle=\frac{2}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}z^{2}\exp(-z^{2}/2)dz,$
	$\displaystyle\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}dz\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\epsilon\beta\leq\epsilon\beta/3$

where we used the facts that $z^{2}\exp(-z^{2}/2)\leq\exp(-1/2)$ for all $z\in\Re$ ; $\sqrt{1-\epsilon}\geq 1-\epsilon/2$ and $\sqrt{1+\epsilon}\leq 1+\epsilon/2$ for $0<\epsilon<1$ .

Similarly, we can show that

	$\displaystyle\mathbb{E}\left[\|\bm{a}(1)\|\mathbbm{1}_{\{\|\bm{a}(1)\|^{2}\in[1\pm\epsilon]\beta^{2}\}}\right]$
	$\displaystyle=\frac{2}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}z\exp(-z^{2}/2)dz,$
	$\displaystyle\leq\frac{2e^{-1/2}}{\sqrt{2\pi}}\int_{\sqrt{1-\epsilon}\beta}^{\sqrt{1+\epsilon}\beta}dz=\frac{2e^{-1/2}}{\sqrt{2\pi}}\epsilon\beta\leq\epsilon\beta/3$

The claim follows by combining the two equations given above. ∎

References

[1] S. Negahban, M. J. Wainwright et al., “Estimation of (near) low-rank matrices with noise and high-dimensional scaling,” The Annals of Statistics, vol. 39, no. 2, pp. 1069–1097, 2011.
[2] P. Netrapalli, P. Jain, and S. Sanghavi, “Low-rank matrix completion using alternating minimization,” in Annual ACM Symp. on Th. of Comp. (STOC), 2013.
[3] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Found. of Comput. Math, no. 9, pp. 717–772, 2008.
[4] S. Nayer, P. Narayanamurthy, and N. Vaswani, “Phaseless PCA: Low-rank matrix recovery from column-wise phaseless measurements,” in Intl. Conf. Machine Learning (ICML), 2019.
[5] ——, “Provable low rank phase retrieval,” IEEE Trans. Info. Th., March 2020.
[6] S. Nayer and N. Vaswani, “Sample-efficient low rank phase retrieval,” IEEE Trans. Info. Th., 2021.
[7] R. S. Srinivasa, K. Lee, M. Junge, and J. Romberg, “Decentralized sketching of low rank matrices,” in Neur. Info. Proc. Sys. (NeurIPS), 2019, pp. 10 101–10 110.
[8] Z.-P. Liang, “Spatiotemporal imaging with partially separable functions,” in 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2007, pp. 988–991.
[9] S. G. Lingala, Y. Hu, E. DiBella, and M. Jacob, “Accelerated dynamic mri exploiting sparsity and low-rank structure: kt slr,” IEEE Transactions on Medical Imaging, vol. 30, no. 5, pp. 1042–1054, 2011.
[10] J. Yao, Z. Xu, X. Huang, and J. Huang, “An efficient algorithm for dynamic mri using low-rank and total variation regularizations,” Medical Image Analysis, vol. 44, pp. 14–27, 2018.
[11] F. P. Anaraki and S. Hughes, “Memory and computation efficient pca via very sparse random projections,” in Intl. Conf. Machine Learning (ICML), 2014, pp. 1341–1349.
[12] A. Krishnamurthy, M. Azizyan, and A. Singh, “Subspace learning from extremely compressed measurements,” arXiv preprint arXiv:1404.0751, 2014.
[13] S. Babu, S. Nayer, S. G. Lingala, and N. Vaswani, “Fast low rank compressive sensing for accelerated dynamic mri,” in IEEE Intl. Conf. Acoustics, Speech, Sig. Proc. (ICASSP), 2022, to appear.
[14] P. Jain and P. Netrapalli, “Fast exact matrix completion with finite samples,” in Conf. on Learning Theory, 2015, pp. 1007–1034.
[15] Y. Cherapanamjeri, K. Gupta, and P. Jain, “Nearly-optimal robust matrix completion,” ICML, 2016.
[16] X. Yi, D. Park, Y. Chen, and C. Caramanis, “Fast algorithms for robust pca via gradient descent,” in Neur. Info. Proc. Sys. (NeurIPS), 2016.
[17] Q. Zheng and J. Lafferty, “Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent,” arXiv preprint arXiv:1605.07051, 2016.
[18] S. Lang, Real and Functional Analysis. Springer-Verlag, New York 10:11–13, 1993.
[19] C. Ma, K. Wang, Y. Chi, and Y. Chen, “Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution,” in Intl. Conf. Machine Learning (ICML), 2018.
[20] Y. Chen and E. Candes, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” in Neur. Info. Proc. Sys. (NeurIPS), 2015, pp. 739–747.
[21] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 5164–5198, 2017.
[22] G. Jagatap, Z. Chen, S. Nayer, C. Hegde, and N. Vaswani, “Sample efficient fourier ptychography for structured data,” IEEE Trans. Comput. Imaging, vol. 6, pp. 344–357, 2020.
[23] Y. Chen, Y. Chi, and A. J. Goldsmith, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015.
[24] G. H. Golub and C. F. Van Loan, “Matrix computations,” The Johns Hopkins University Press, Baltimore, USA, 1989.
[25] M. Hardt and E. Price, “The noisy power method: A meta algorithm with applications,” in Neur. Info. Proc. Sys. (NeurIPS), 2014, pp. 2861–2869.
[26] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018, vol. 47.
[27] P.-Å. Wedin, “Perturbation bounds in connection with singular value decomposition,” BIT Numerical Mathematics, vol. 12, no. 1, pp. 99–111, 1972.
[28] Y. Chen, Y. Chi, J. Fan, and C. Ma, “Spectral methods for data science: A statistical perspective,” arXiv preprint arXiv:2012.08496, 2020.
[29] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices. Cambridge Univ. Press, Cambridge, 2012.
[30] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,” in Neur. Info. Proc. Sys. (NeurIPS), 2013, pp. 2796–2804.
[31] T. Cai, X. Li, and Z. Ma, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016.
[32] L. Erdos, A. Knowles, H. Yau, and J. Yin, “Spectral statistics of erdos–rényi graphs i: Local semicircle law,” The Annals of Probability, vol. 41, no. 3B, pp. 2279–2375, 2013.
[33] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Found. Comput. Math., vol. 12, no. 4, 2012.

Author Biographies

Seyedehsara Nayer (Email: [email protected]) recently completed her Ph.D. in ECE at Iowa State University. She has an M.S. from Sharif University in Iran. She works as a Senior Engineer at ASML in Santa Clara, CA. Her research interests are around various aspects of information science and focuses on Signal Processing, and Statistical Machine Learning.

Namrata Vaswani (Email: [email protected]) received a B.Tech from IIT-Delhi in India in 1999 and a Ph.D. from the University of Maryland, College Park in 2004, both in Electrical Engineering. Since Fall 2005, she has been with the Iowa State University where she is currently the Anderlik Professor of Electrical and Computer Engineering. Her research interests lie in a data science, with a particular focus on Statistical Machine Learning and Signal Processing. She has served two terms as an Associate Editor for the IEEE Transactions on Signal Processing; as a lead guest-editor for a 2018 Proceedings of the IEEE Special Issue (Rethinking PCA for modern datasets); and as an Area Editor for the IEEE Signal Processing Magazine (2018-2020). Vaswani is a recipient of the Iowa State Early Career Engineering Faculty Research Award (2014), the Iowa State University Mid-Career Achievement in Research Award (2019) and University of Maryland’s ECE Distinguished Alumni Award (2019). She also received the 2014 IEEE Signal Processing Society Best Paper Award for her 2010 IEEE Transactions on Signal Processing paper co-authored with her student Wei Lu on “Modified-CS: Modifying compressive sensing for problems with partially known support”. She is a Fellow of the IEEE Fellow (class of 2019).

	$\displaystyle\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}$	$\displaystyle=\sqrt{\sum_{k}\\|\bm{x}_{k}-\bm{x}^{*}_{k}\\|^{2}}$
		$\displaystyle\leq\sqrt{1.4^{2}\sum_{k}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{b}^{}_{k}\\|^{2}}$
		$\displaystyle=1.4\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{B}^{}\\|_{F}$
		$\displaystyle\leq 1.4\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\\|_{F}{\sigma_{\max}^{}}$

	$\displaystyle\\|(\bm{P}\hat{\bm{U}}^{+})_{vec}\\|$	$\displaystyle\leq\\|\bm{I}_{nr}-(\eta/m)\ \mathrm{Hess}\ \\|\ \\|(\bm{P}{\bm{U}})_{vec}\\|$
		$\displaystyle+(\eta/m)\\|(\nabla f(({\bm{U}}^{}{}{\bm{U}}^{}{}{}^{\top}{\bm{U}}),\bm{B}))_{vec}\\|$		(15)

	$\displaystyle\frac{t^{2}}{\sum_{ki}K_{ki}^{2}}$	$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m^{2}{\sigma_{\max}^{}}^{4}}{m\max_{k}\\|\bm{b}_{k}\\|^{2}\sum_{k}\\|\bm{x}_{k}-\bm{x}^{}_{k}\\|^{2}}$
		$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}m{\sigma_{\max}^{}}^{4}}{C\mu^{2}{\sigma_{\max}^{}}^{2}(r/q)\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}^{2}}$
		$\displaystyle\geq c\frac{\epsilon_{1}^{2}\delta_{t}^{2}mq{\sigma_{\max}^{}}^{4}}{C\mu^{2}{\sigma_{\max}^{}}^{2}r\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}}=c\epsilon_{1}^{2}\frac{mq}{r\mu^{2}}.$

	$\displaystyle\\|\mathbb{E}[\nabla f({\bm{U}},\bm{B})]\\|$	$\displaystyle=m\\|({\bm{X}}-{\bm{X}}^{*})\bm{B}{}^{\top}\\|$
		$\displaystyle\leq m\\|{\bm{X}}-{\bm{X}}^{*}\\|~{}\\|\bm{B}\\|$
		$\displaystyle\leq m\\|{\bm{X}}-{\bm{X}}^{*}\\|_{F}~{}\\|\bm{B}\\|$
		$\displaystyle\leq 1.1m\delta_{t}{\sigma_{\max}^{*}}^{2}$

	$\displaystyle\\|\bm{G}-\bm{B}\\|_{F}^{2}$	$\displaystyle=\sum_{k}\\|\bm{g}_{k}-\bm{b}_{k}\\|^{2}$
		$\displaystyle\leq 0.4^{2}\sum_{k}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{b}^{}_{k}\\|^{2}$
		$\displaystyle=0.4^{2}\\|(\bm{I}-{\bm{U}}{\bm{U}}^{\top}){\bm{U}}^{}{}\bm{B}^{}\\|_{F}^{2}\leq 0.4^{2}\delta_{t}^{2}{\sigma_{\max}^{*}}^{2}$

Fast Low Rank column-wise Compressive Sensing

Fast and Sample-Efficient Federated Low Rank Matrix Recovery from column-wise Linear and Quadratic Projections

Abstract

I Introduction

I-A Problem Setting, Notation, and Assumption

Assumption 1.1 ((Weakened) Right Singular Vectors’ Incoherence).

I-B Existing Work

I-C Contributions and Novelty

I-D Applications

I-E Organization

II The Proposed AltGD-Min Algorithm and Guarantee

II-A The AltGD-Min algorithm

II-A1 Practical algorithm and setting algorithm parameters

II-A2 Federating the algorithm

II-B Main Result

Theorem 2.1.

Corollary 2.2 (AltGD-Min).

II-C Discussion and comparison with the best LRMC results

II-D Detailed comparison with existing LRcCS results

Proposition 2.3 (Convex relaxation (mixed norm min) in the σ=0\sigma=0 (noise-free) setting [7]).

Proposition 2.4 (AltMin [6]).

III Proving Theorem 2.1

III-A Two key results for proving Theorem 2.1 and its proof

Theorem 3.1 (Initialization).

Proof.

Theorem 3.2 (GD Descent).

Proof.

Proof of Theorem 2.1.

III-B Proof outline (and novelty) for Theorem 3.2

III-C Lemmas for proving GD descent Theorem 3.2 and its proof

Lemma 3.3 (Error bound on 𝑩\bm{B} and its implications).

Proof.

Lemma 3.4.

Proof.

Lemma 3.5.

Proof.

Proof of Theorem 3.2.

III-D Proof outline (and novelty) for Initialization Theorem 3.1

III-E Simpler proof of Theorem 3.1 that assumes independent measurements used for computing α\alpha

Lemma 3.6.

Proof.

Fact 3.7.

Lemma 3.8.

Proof.

Fact 3.9.

Proof of Theorem 3.1.

IV Proofs of all the lemmas

IV-A Basic tools used

Theorem 4.1 (Wedin sin⁡Θ\sin\Theta theorem for Frobenius norm subspace distance [27, 28][Theorem 2.3.1).

Theorem 4.2 (Fundamental theorem of calculus [18][Chapter XIII, Theorem 4.2).

Definition 4.3.

Definition 4.4.

Definition 4.5.

Definition 4.6.

Proposition 4.7 (Epsilon-netting for bounding max𝒘∈𝒮n,𝒛∈𝒮r⁡|𝒘⊤​𝑴​𝒛|\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|).

Proof.

Proposition 4.8 (Epsilon-netting for bounding max𝑾∈𝒮n​r⟨𝑴,𝑾\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}).

Proof.

Proposition 4.9 (Epsilon-netting for upper and lower bounding ∑k​i⟨𝑴k​i,𝑾⟩2\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2} over all 𝑾∈𝒮n​r{\bm{W}}\in\mathcal{S}_{nr}).

Proof.

IV-B Proving GD iterations’ lemmas: Proof of Lemma 3.4 (algebra lemma)

IV-C Proof of GD iterations’ lemmas: Proof of Lemma 3.5

IV-C1 Upper and Lower bounding the Hessian eigenvalues and hence HessTerm

IV-C2 Bounding the GradU Term

IV-C3 Bounding Term2

IV-D Proof of GD iterations’ lemmas: Proof of Lemma 3.3, all parts other than the first part

IV-E Proof of GD iterations’ lemmas: Proof of Lemma 3.3, first part

IV-F Proof of Initialization lemmas/facts: Proof of Lemma 3.6

IV-G Proof of Initialization lemmas and facts: Proof of Lemma 3.8

Proof of first part of Lemma 3.8.

Proof of second part of Lemma 3.8.

Proof of third part of Lemma 3.8.

IV-H Proof of Initialization lemmas and facts: Proof of Facts

Proof of Fact 3.7.

Proof of Fact 3.9.

V Extension to Low Rank Phase Retrieval (LRPR)

V-A AltGD-Min-LRPR algorithm

V-B Main Result

Theorem 5.1.

V-C Proof of Theorem 5.1

Fast and Sample-Efficient
Federated Low Rank Matrix Recovery
from column-wise Linear and Quadratic Projections

Proposition 2.3 (Convex relaxation (mixed norm min) in the $\sigma=0$ (noise-free) setting [7]).

Lemma 3.3 (Error bound on $\bm{B}$ and its implications).

III-E Simpler proof of Theorem 3.1 that assumes independent measurements used for computing $\alpha$

Theorem 4.1 (Wedin $\sin\Theta$ theorem for Frobenius norm subspace distance [27, 28][Theorem 2.3.1).

Proposition 4.7 (Epsilon-netting for bounding $\max_{\bm{w}\in\mathcal{S}_{n},\bm{z}\in\mathcal{S}_{r}}|\bm{w}^{\top}\bm{M}\bm{z}|$ ).

Proposition 4.8 (Epsilon-netting for bounding $\max_{{\bm{W}}\in\mathcal{S}_{nr}}\langle\bm{M},{\bm{W}}$ ).

Proposition 4.9 (Epsilon-netting for upper and lower bounding $\sum_{ki}\langle\bm{M}_{ki},{\bm{W}}\rangle^{2}$ over all ${\bm{W}}\in\mathcal{S}_{nr}$ ).

VI-B Why we cannot prove our result for all $X^{*}$

VI-C Why sample complexity depends on the desired final accuracy $\epsilon$

B-B Bounding the $\check{\bm{V}}$ numerator term