This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization

Liwei Jiang   Yudong Chen   Lijun Ding School of Operations Research and Information Engineering, Cornell University. Ithaca, NY 14850, USA; [email protected] of Computer Sciences, University of Wisconsin-Madison. Madison, WI 53706, USA; [email protected] Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, USA; [email protected]
Abstract

We study the asymmetric matrix factorization problem under a natural nonconvex formulation with arbitrary overparametrization. The model-free setting is considered, with minimal assumption on the rank or singular values of the observed matrix, where the global optima provably overfit. We show that vanilla gradient descent with small random initialization sequentially recovers the principal components of the observed matrix. Consequently, when equipped with proper early stopping, gradient descent produces the best low-rank approximation of the observed matrix without explicit regularization. We provide a sharp characterization of the relationship between the approximation error, iteration complexity, initialization size and stepsize. Our complexity bound is almost dimension-free and depends logarithmically on the approximation error, with significantly more lenient requirements on the stepsize and initialization compared to prior work. Our theoretical results provide accurate prediction for the behavior gradient descent, showing good agreement with numerical experiments.

1 Introduction

Let Xm×nX\in\mathbb{R}^{m\times n} be an arbitrary given matrix. In this paper, we study the following nonconvex objective function for asymmetric matrix factorization

f(F,G):=12FGXF2\displaystyle f(F,G):=\frac{1}{2}\|FG^{\top}-X\|_{{\tiny\text{F}}}^{2}\quad (\mathcal{M})

and the associated vanilla gradient descent dynamic:

initialize(F0,G0);run iteration(Ft+1,Gt+1)=(Ft,Gt)ηf(Ft,Gt),\displaystyle\text{initialize}\;(F_{0},G_{0});\;\;\text{run iteration}\;(F_{t+1},G_{t+1})=(F_{t},G_{t})-\eta\nabla f(F_{t},G_{t}), (𝒢𝒟\mathcal{GD}-\mathcal{M})

where (F,G)m×k×n×k(F,G)\in\mathbb{R}^{m\times k}\times\mathbb{R}^{n\times k} are the factor variables, k1k\geq 1 is a user-specified rank parameter, η>0\eta>0 is the stepsize, and f(F,G)=((FGX)G,(FGX)F).\nabla f(F,G)=((FG^{\top}-X)G,(FG^{\top}-X)^{\top}F).

In many statistical and machine learning settings [FYY20], the observed matrix XX takes the form X=X+E,X=X_{\natural}+E, where XX_{\natural} is an unknown low-rank matrix to be estimated, and EE is the additive error/noise. Gradient descent applied to the objective (\mathcal{M}) is a natural approach for computing an estimate of XX_{\natural}. Such an estimate can in turn be used as an approximate solution in more complicated nonconvex matrix estimation problems, such as matrix sensing [RFP10], matrix completion [CT10] and even nonlinear problems like the Single Index Model [FYY20]. In fact, the gradient descent procedure (𝒢𝒟\mathcal{GD}-\mathcal{M}) is often used, explicitly or implicitly, as a subroutine in more sophisticated algorithms for these problems. As such, characterizing the dynamics of (𝒢𝒟\mathcal{GD}-\mathcal{M}) provides deep intuition for more general problems and is considered an important first step for understanding various aspects of (linear) neural networks [DHL18, YD21].

Despite the apparent simplicity of the dynamic (𝒢𝒟\mathcal{GD}-\mathcal{M}), our understanding of its behavior remains limited, especially for general settings of XX and kk. Results in existing work often only apply to symmetric, exactly low-rank matrices XX or specific choices of kk. Many of them impose strong assumption on the initialization (F0,G0)(F_{0},G_{0}), only provides asymptotic analysis, or has order-wise suboptimal error and iteration complexity bounds. We discuss these existing results and the associated challenges in greater details later.

Of particular interest is the overparametrization regime, which is common in modern machine learning paradigms [HCB+19, KBZ+20, TL19]. In the context of the objective (\mathcal{M}), overparametrization means choosing the rank parameter kk to be larger than what is statistically necessary, e.g., k=min{m,n}k=\min\{m,n\}\gg rank(X)(X_{\natural}). Doing so, however, necessarily leads to overfitting in general. Indeed, with k=min{m,n}k=\min\{m,n\}, any global optimum of (\mathcal{M}) is simply (a full factorization of) XX itself and overfits the noise/error in XX, therefore failing to provide a useful estimate for XX_{\natural}. Moreover, as can be seen from the numerical results in Figure 1, gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with random initialization asymptotically converges to such a global minimum, with a vanishing training error FGXF\|FG^{\top}-X\|_{{\tiny\text{F}}} (dashed lines) but a large test error FGXF\|FG^{\top}-X_{\natural}\|_{{\tiny\text{F}}} (solid lines).

Refer to caption
Figure 1: Evolution of training error FtGtXF\|F_{t}G_{t}^{\top}-X\|_{{\tiny\text{F}}} and testing error FtGtXF\|F_{t}G_{t}^{\top}-X_{\natural}\|_{{\tiny\text{F}}} of (𝒢𝒟\mathcal{GD}-\mathcal{M}) with k=200k=200 under the model X=X+EX=X_{\natural}+E, where X250×200X_{\natural}\in\mathbb{R}^{250\times 200} has rank r=5r=5 and XF=1\|X_{\natural}\|_{{\tiny\text{F}}}=1, and the noise matrix EE has i.i.d. zero-mean Gaussian entries and expected operator norm 𝔼E2×103\mathbb{E}\|E\|\approx 2\times 10^{-3}. The green lines use small initialization (F0F,G0FXF\|F_{0}\|_{{\tiny\text{F}}},\|G_{0}\|_{{\tiny\text{F}}}\ll\|X_{\natural}\|_{{\tiny\text{F}}}), and the blue lines use moderate initialization (F0F,G0FXF\|F_{0}\|_{{\tiny\text{F}}},\|G_{0}\|_{{\tiny\text{F}}}\asymp\|X_{\natural}\|_{{\tiny\text{F}}}). The red line represents the optimal statistical error rE\sqrt{r}\|E\|. The left panel zooms in the first 600 iterations.

A careful inspection of Figure 1, however, reveals an interesting phenomenon: gradient descent with small random initialization achieves a small (and near optimal) test error before it eventually overfits; on the other hand, this behavior is not observed with moderate initialization. We note that similar phenomena have been empirically observed in many other statistical and machine learning problems, where vanilla gradient descent coupled with small random initialization (SRI) and early stopping (ES) has good generalization performance, even with overparametrization [WGL+20, GMMM20, Pre98, WLZ+21, LMZ18, SS21]. This observation motivates us to theoretically characterize, both qualitatively and quantitatively, the behavior of gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) and the algorithmic regularization effect of SRI and ES.

Our results

We relate the dynamics of (𝒢𝒟\mathcal{GD}-\mathcal{M}) to the best low-rank approximations of XX, defined as Xr=argminrank(Y)rYXF2X_{r}=\operatornamewithlimits{argmin}_{\mathrm{rank}(Y)\leq r}\|Y-X\|_{{\tiny\text{F}}}^{2} for r=1,2,r=1,2,\ldots Our main results, Theorems 4.1 and 4.2, establish the following:

The iteration (𝒢𝒟\mathcal{GD}-\mathcal{M}) with SRI sequentially approaches the principal components of XX, and proper ES produces the best low-rank approximation of XX.

Specifically, we show that for each r[1,k]r\in[1,k], there exists a (range of) stopping time tt such that (𝒢𝒟\mathcal{GD}-\mathcal{M}) terminated at iteration tt produces an approximate XrX_{r}. Moreover, we provide a sharp characterization of the scaling relationship between the approximation error, iteration complexity, initialization size and stepsize. This quantitative characterization agrees well with numerical experiments.

It is known that under many statistical models where XX is an observed noisy version of some structured matrix XX_{\natural}, the matrix XrX_{r} with an appropriate rr is a statistically optimal estimator of XX_{\natural} [Cha15, FYY20]. Our results thus imply that gradient descent with SRI and ES learns such an optimal estimate, even with overparametrization and no explicit regularization. In fact, our results are more general, applicable to any observed matrix XX and not tied to the existence of a ground truth XX_{\natural}.

We emphasize that we do not claim that vanilla gradient descent is a more efficient way for computing XrX_{r} for a given rr than standard numerical procedures (e.g., via singular value decomposition). Rather, our focus is to show, rigorously and quantitatively, that overparametrized gradient descent, a common and fundamental algorithmic paradigm, has an implicit regularization effect characterized by a deep connection to the computation of best low-rank approximation.

Analysis and challenges

Our analysis elucidates the mechanism that gives rise to the algorithmic regularization effect of small initialization and early stopping: starting from a small initial F0G0F_{0}G_{0}^{\top}, the singular values of the iterate FtGtF_{t}G_{t}^{\top} approach those of XX at geometrically different rates, hence FtGtF_{t}G_{t}^{\top} approximates X1,X2,X_{1},X_{2},\ldots sequentially (we elaborate in Section 3). While the intuition is simple, a rigorous and sharp analysis is highly non-trivial, due to the following challenges:

(i) Model-free setting. Most existing work assumes that XX is (exactly or approximately) low-rank with a sufficiently large singular value gap δ\delta between the rr-th and (r+1)(r+1)-th singular values of XX [LMZ18, ZKHC21, YD21, FYY20, SS21]. We allow for an arbitrary nonzero δ\delta and characterize its impact on the stopping time and final error. In this setting, the “signal” XrX_{r} may have magnitude arbitrarily close to that of the “noise” XXrX-X_{r} even in operator norm.

(ii) Asymmetry. Since the objective (\mathcal{M}) is invariant under the rescaling (F,G)(cF,c1G)(F,G)\to(cF,c^{-1}G), the magnitudes of FF and GG may be highly imbalanced, in which case the gradient has a large Lipschitz constant. This issue is well recognized to be a primary difficulty in analyzing the gradient dynamics (𝒢𝒟\mathcal{GD}-\mathcal{M}) even without overparametrization [DHL18]. Most previous works either restrict to the symmetric positive semidefinite formulation [LMZ18, ZKHC21, SS21, ZFZ21, MF21, DJC+21, Zha21], or add an explicit regularization term of the form FFGGF2\|F^{\top}F-G^{\top}G\|_{{\tiny\text{F}}}^{2} [TBS+16, ZL16].

(iii) Trajectory analysis and cold start. As the desired XrX_{r} is not a local minimizer of (\mathcal{M}), our analysis is inherently trajectory-based and initialization-specific.

Random initialization leads to a cold start: the initial iterates FtGtF_{t}G_{t}^{\top} are far from and nearly orthogonal to XrX_{r} when the dimensions (m,n)(m,n) are large. More precisely, assuming the SVD Xr=UΣVX_{r}=U\Sigma V^{\top}, one may measure the relative signal strength of (Ft,Gt)(F_{t},G_{t}) by

sig(Ft,Gt):=min{σr(UFt),σr(VGt)}max{σ1((IUU)Ft),σ1((IVV)Gt)},\texttt{sig}(F_{t},G_{t}):=\frac{\min\{\sigma_{r}(U^{\top}F_{t}),\sigma_{r}(V^{\top}G_{t})\}}{\max\{\sigma_{1}((I-UU^{\top})F_{t}),\sigma_{1}((I-VV^{\top})G_{t})\}}, (1.1)

which is the ratio between the projections of FtF_{t} and GtG_{t} to the column/row space of XX and the projections to the complementary space. Most existing work requires sig(F0,G0)\texttt{sig}(F_{0},G_{0}) to be larger than a universal constant [FYY20, LMZ18], which does not hold when k,r=o(m)k,r=o(m).111For F0,G0F_{0},G_{0} with i.i.d. standard Gaussian entries, we have w.h.p. sig(F0,G0)k+r1|mrk|\texttt{sig}(F_{0},G_{0})\lesssim\frac{\sqrt{k}+\sqrt{r-1}}{|\sqrt{m-r}-\sqrt{k}|} for mnm\asymp n.

(iv) General rank overparametrization. Our result holds for any choice of kk with krk\geq r. The work in [YD21, MLC21] only considers the exact-parametrization setting k=rk=r. The work [FYY20] assumes the specific choice k=2m+2nk=2m+2n; as mentioned in the previous paragraph, the setting with a smaller kk involves additional challenges due to cold-start.

2 Related work

The literature on gradient descent for matrix factorization is vast; see [CC18, CLC19] for a survey. Most prior work focuses on the exact parametrization setting k=rk=r (where rr is the target rank or the rank of a ground truth matrix XX_{\natural}) and requires an explicit regularizer FFGGF2\|F^{\top}F-G^{\top}G\|_{{\tiny\text{F}}}^{2}. More recent work in [DJC+21, DHL18, FYY20, LMZ18, MLC21, MF21, SS21, YD21, ZFZ21, Zha21, ZKHC21], discussed in the last section, studies (overparametrized) matrix factorization and implicit regularization. Below we discuss recent results that are most related to ours.

The work [YD21] also provides recovery guarantees for vanilla gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with random small initialization. Their result only applies to the setting where the matrix XX has exactly rank rr and k=rk=r, i.e., with exact parametrization. Moreover, their choice of stepsize is quite conservative and consequently their iteration complexity scales proportionally with dimension m+nm+n. In comparison, we allow for significantly larger stepsizes and establish almost dimension-free iteration complexity bounds.

The work [FYY20] considers a wide range of statistical problems with a symmetric ground truth matrix XX_{\natural} and shows that XX_{\natural} can be recovered with near optimal statistical errors using gradient descent for (\mathcal{M}) with FGFG^{\top} replaced by FFGGFF^{\top}-GG^{\top}. While one may translate their results to the asymmetric setting via a dilation argument, doing so requires the specific rank parametrization k=2m+2nk=2m+2n. This restriction allows for a decoupling of the dynamics of different singular values, which is essential to their analysis. While this decoupled setting provides intuition for the general setting (as we elaborate in Section 3), the same analysis no longer applies for other values of kk, e.g., k=2m+2n1k=2m+2n-1, for which the decoupling ceases to hold. Moreover, a smaller value of kk leads to the cold start issue, as discussed in footnote 1.

The work in [CGMR20] studies the deep matrix factorization problem. While on a high level their results deliver a message similar to our work, namely gradient descent sequentially approaches the principal components of XX, the technical details differ significantly. In particular, they results only apply to symmetric XX and guarantee recovery of the positive semidefinite part of XX. Their analysis relies crucially on the assumption k=m=nk=m=n, a specific identity initialization scheme and the resulting decoupled dynamics, which do not hold in the general setting as discussed above. A major contribution of our work lies in handling the entanglement of singular values resulted from general overparametrization, asymmetry, and random initialization.

In Section C, we provide additional discussion on related work.

3 Intuitions and the symmetric setting

In this section, we illustrate the behavior of gradient descent with small random initialization in the setting with a symmetric XX, and explain the challenges for generalizing to asymmetric XX.

Consider a simple example where m=n=k=2m=n=k=2, and X=diag(λ1,λ2)X={\rm diag}(\lambda_{1},\lambda_{2}) is a positive semidefinite diagonal matrix that is approximately rank-1, i.e., λ2110λ1\lambda_{2}\leq\frac{1}{10}\lambda_{1}. We consider the natural symmetric objective f(F)=14FFXF2f(F)=\frac{1}{4}\|FF^{\top}-X\|_{{\tiny\text{F}}}^{2} and the associated gradient descent dynamic Ft+1=Ftη(FtFtX)FtF_{t+1}=F_{t}-\eta(F_{t}F_{t}^{\top}-X)F_{t},222This is equivalent to (𝒢𝒟\mathcal{GD}-\mathcal{M}) with initialization F0=G0F_{0}=G_{0}. with initialization F0=ρλ1IF_{0}=\rho\sqrt{\lambda_{1}}I for some small ρ>0\rho>0.

It is easy to see that FtF_{t} is diagonal for all t0t\geq 0, and the ii-th diagonal element of FtF_{t}, denoted by fi,tf_{i,t}, is updated as

fi,t+1=fi,t(1+ηλiηfi,t2),i=1,2.f_{i,t+1}=f_{i,t}(1+\eta\lambda_{i}-\eta f_{i,t}^{2}),\quad i=1,2. (3.1)

Thus, the dynamics of the two eigenvalues decouple and can be analyzed separately. In particular, simple algebra shows that (i) when f1,t<λ12f_{1,t}<\sqrt{\frac{\lambda_{1}}{2}}, f1,tf_{1,t} increases geometrically by a factor of 1+ηλ1ηf1,t21+ηλ121+\eta\lambda_{1}-\eta f_{1,t}^{2}\geq 1+\frac{\eta\lambda_{1}}{2}, i.e., |f1,t+1|(1+ηλ12)|f1,t||f_{1,t+1}|\geq(1+\frac{\eta\lambda_{1}}{2})|f_{1,t}|, and (ii) when f1,tλ12f_{1,t}\geq\sqrt{\frac{\lambda_{1}}{2}}, f1,tf_{1,t} converges to λ1\sqrt{\lambda_{1}} geometrically with a factor of (1ηλ12)(1-\frac{\eta\lambda_{1}}{2}), i.e., |f1,t+1λ1|(1ηλ12)|f1,tλ1|.|f_{1,t+1}-\sqrt{\lambda_{1}}|\leq(1-\frac{\eta\lambda_{1}}{2})|f_{1,t}-\sqrt{\lambda_{1}}|. In a similar fashion, f2,tf_{2,t} converges to the second eigenvalue λ2\sqrt{\lambda_{2}}. It follows that the gradient descent iterate FtFtF_{t}F_{t}^{\top} converges to the observed matrix XX as tt goes to infinity.

What makes a difference, however, is that f2,tf_{2,t} converges at an exponentially slower rate than f1,tf_{1,t}. In particular, assuming the stepsize η\eta is sufficiently small, we can show that f2,tf_{2,t} is always nonnegative and satisfies f2,t+1(1+ηλ110)f2,tf_{2,t+1}\leq(1+\frac{\eta\lambda_{1}}{10})f_{2,t}. Note that the growth factor 1+ηλ1101+\frac{\eta\lambda_{1}}{10} is smaller than 1+ηλ121+\frac{\eta\lambda_{1}}{2}, the growth factor for f1,tf_{1,t}. We conclude that

With small initialization, larger eigenvalues converge (exponentially) faster. (3.2)

Thanks to the property (3.2), the gradient descent trajectory approaches the principal components of XX one by one. In particular, if the initial size ρ\rho is sufficiently small, then (3.2) implies the existence of time window for tt during which f1,tf_{1,t} is close to λ1\lambda_{1} while f2,tf_{2,t} remains close to its initial value (i.e., close to 0). If we terminate at a time tt within this window, then the gradient descent output satisfies FtFtX1F_{t}F_{t}^{\top}\approx X_{1}. If we continue the iteration, then f2,tf_{2,t} eventually grows away from 0 and converges quickly to λ2\sqrt{\lambda_{2}}, yielding FtFtX2F_{t}F_{t}^{\top}\approx X_{2}.

In Section C, we generalize the above argument to any m×mm\times m symmetric positive semi-definite XX via a diagonalization argument, which shows that FtFtF_{t}F_{t}^{\top} approaches X1,X2,X3,X_{1},X_{2},X_{3},\ldots sequentially. However, this simple derivation breaks down for general rectangular FtF_{t} and XX, since the dynamics of eigenvalues no longer decouple as in (3.1); rather, they have complicated dependence on each other, as can be seen in the proof of our main theorems. One of the main contributions of our proof is to rigorously establish the property (3.2) despite this complicated dependence.

4 Main results and analysis

In this section, we present our main theorems on the trajectory of gradient descent under small random initialization and early stopping. We also outline the analysis, deferring the full proof to the appendices.

4.1 Main theorems

Recall that Xm×nX\in\mathbb{R}^{m\times n} is a general rectangular matrix and denote its singular values by σ1σmin{m,n}0\sigma_{1}\geq\cdots\geq\sigma_{\min\{m,n\}}\geq 0. Fix r[0,min{k,m,n}]r\in[0,\min\{k,m,n\}] and define the rr-th condition number κr:=σ1σr\kappa_{r}:=\frac{\sigma_{1}}{\sigma_{r}}. Suppose that the gradient descent dynamic (𝒢𝒟\mathcal{GD}-\mathcal{M}) is initialized with (F0,G0)=ρ3m+n+k(F~0,G~0)(F_{0},G_{0})=\frac{\rho}{3\sqrt{m+n+k}}(\tilde{F}_{0},\tilde{G}_{0}), where ρ>0\rho>0 is a size parameter and F~0m×k,G~0n×k\tilde{F}_{0}\in\mathbb{R}^{m\times k},\tilde{G}_{0}\in\mathbb{R}^{n\times k} have i.i.d. N(0,σ1)N(0,\sigma_{1}) entries. The operator norm is denoted by \|\cdot\|.

Below we state two theorems under one of the following assumptions:

Assumption A.

The first r+1r+1 singular values are distinct, i.e., σ1>>σr>σr+1\sigma_{1}>\cdots>\sigma_{r}>\sigma_{r+1}.

Assumption B.

The rr-th and (r+1)(r+1)-th singular values are distinct, i.e., σr>σr+1.\sigma_{r}>\sigma_{r+1}.

Both theorems are high probability statements (w.r.t. random initialization); we refer to Theorem A.1 in the appendix for the precise value of the probability.

Our first theorem shows that under Assumption A, gradient descent with small random initialization approaches the principal components X1,X2,,XrX_{1},X_{2},\ldots,X_{r} sequentially.

Theorem 4.1.

Suppose that Assumption A holds and let δ¯\underline{\delta} be any number in (0,1)(0,1) such that δ¯min1sr{σsσs+1σs}\underline{\delta}\leq\min_{1\leq s\leq r}\{\frac{\sigma_{s}-\sigma_{s+1}}{\sigma_{s}}\}. Fix any tolerance ϵ1m+n+k\epsilon\leq\frac{1}{m+n+k}. Then, there exist some numerical constants c,cc,c^{\prime} and a sequence of iteration indices

T(1)T(2)T(r)cδ¯ησrlog(κrδ¯ϵ)T^{(1)}\leq T^{(2)}\leq\ldots T^{(r)}\leq\frac{c^{\prime}}{\underline{\delta}\eta\sigma_{r}}\log\left(\frac{\kappa_{r}}{\underline{\delta}\epsilon}\right)

such that with high probability, the iterates of gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with stepsize ηcmin{δ¯,1δ¯}σr2σ13\eta\leq c\min\{\underline{\delta},1-\underline{\delta}\}\frac{\sigma_{r}^{2}}{\sigma_{1}^{3}} and initialization size ρ(cδ¯ϵκr)1cδ¯\rho\leq\left(\frac{c\underline{\delta}\epsilon}{\kappa_{r}}\right)^{\frac{1}{c\underline{\delta}}} satisfy

FT(s)GT(s)Xsϵσ1,s=1,2,,r.\|F_{T^{(s)}}G_{T^{(s)}}^{\top}-X_{s}\|\leq\epsilon\sigma_{1},\quad\forall s=1,2,\ldots,r.

Our next, more precise theorem applies under Assumption B and shows that there is in fact a range of iterations at which gradient descent approximates XrX_{r}. Note that the first theorem can be derived by applying the second theorem to each r=1,2,r=1,2,\ldots

Theorem 4.2.

Suppose that Assumption B holds and let δ¯\underline{\delta} be any number in (0,1)(0,1) such that δ¯δ:=σrσr+1σr\underline{\delta}\leq\delta:=\frac{\sigma_{r}-\sigma_{r+1}}{\sigma_{r}}. Fix any tolerance ϵ1m+n+k\epsilon\leq\frac{1}{m+n+k}. The following holds with high probability for some numerical constants c,cc,c^{\prime}. Consider gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with stepsize ηcmin{δ¯,1δ¯}σr2σ13\eta\leq c\min\{\underline{\delta},1-\underline{\delta}\}\frac{\sigma_{r}^{2}}{\sigma_{1}^{3}} and initialization size ρ(cδ¯ϵκr)1cδ¯\rho\leq\left(\frac{c\underline{\delta}\epsilon}{\kappa_{r}}\right)^{\frac{1}{c\underline{\delta}}}. Let T=log(ρδ¯2(2δ¯)/ρ)log(1+(1δ¯)ησr)T=\Big{\lfloor}\frac{\log(\rho^{\frac{\underline{\delta}}{2(2-\underline{\delta})}}/\rho)}{\log(1+(1-\underline{\delta})\eta\sigma_{r})}\Big{\rfloor}, which is O(1δ¯ησrlog(κrδ¯ϵ))O\left(\frac{1}{\underline{\delta}\eta\sigma_{r}}\log\left(\frac{\kappa_{r}}{\underline{\delta}\epsilon}\right)\right) for ρ=(cδ¯ϵκr)1cδ¯\rho=\left(\frac{c\underline{\delta}\epsilon}{\kappa_{r}}\right)^{\frac{1}{c\underline{\delta}}}. Then, for all tt such that (1cδ¯)TtT(1-c^{\prime}\underline{\delta})T\leq t\leq T, we have

FtGtXrϵσ1.\|F_{t}G_{t}^{\top}-X_{r}\|\leq\epsilon\sigma_{1}. (4.1)

We highlight that Theorem 4.2 applies to any observed matrix XX with a nonzero singular value gap δ\delta,333Otherwise XrX_{r} is not uniquely defined. which can be arbitrarily small; we refer to this as the model-free setting. Moreover, Theorem 4.2 quantifies the relationship between various problem parameters: if XX has a relative singular value gap δ\delta, then gradient descent with initialization size ρϵ1δ\rho\lesssim\epsilon^{\frac{1}{\delta}} and early stopping at iteration O(1δlog1ϵ)O\left(\frac{1}{\delta}\log\frac{1}{\epsilon}\right) outputs XrX_{r} up to an ϵ\epsilon error (we ignore logarithmic terms).

We make two remarks regarding the tightness of the above parameter dependence.

  • Final error, initialization size, and eigen gap: The final error ϵ\epsilon can be made arbitrarily small as long as the initialization size ρ\rho is sufficiently small. Moreover, as verified by our numerical experiments in Section 5, the scaling ρϵ1δ\rho\lesssim\epsilon^{\frac{1}{\delta}} predicted by Theorem 4.2 is quite accurate.

  • Iteration complexity and stepsize: The number of iterations needed for an ϵ\epsilon error scales as log(1/ϵ)\log(1/\epsilon), which is akin to a geometric/linear convergence rate. Moreover, our stepsize and iteration complexity are independent of the dimension m,nm,n (up to log factors), both of which improve upon existing results in [YD21], which requires a significantly smaller stepsize and hence an iteration complexity proportional to (m+n)2(m+n)^{2}. Again, our dimension-independent scaling agrees well with the numerical results in Section 5.

4.2 Proof Sketch for Theorem 4.2

We sketch the main ideas for proving Theorem 4.2 and discuss our main technical innovations. Our proof is inspired by the work [YD21], which studies the setting with low-rank XX and exact parametrization k=r=rank(X)k=r=\text{rank}(X).

We start by simplifying the problem using the singular value decomposition (SVD) X=ΦΣXΨX=\Phi\Sigma_{X}\Psi^{\top} of XX, where Φm×m,ΣXm×n\Phi\in\mathbb{R}^{m\times m},\Sigma_{X}\in\mathbb{R}^{m\times n} and Ψn×n\Psi\in\mathbb{R}^{n\times n}. By replacing FF, GG with ΦF\Phi^{\top}F, ΨG\Psi^{\top}G, respectively, we may assume without loss of generality that XX is diagonal. The distribution of the initial iterate (F0,G0)(F_{0},G_{0}) remains the same thanks to the rotational invariance of Gaussian. Hence, the gradient descent update (𝒢𝒟\mathcal{GD}-\mathcal{M}) becomes

F+=F+η(ΣXFG)G,andG+=G+η(ΣXFG)F,F_{+}=F+\eta(\Sigma_{X}-FG^{\top})G,\quad\text{and}\quad G_{+}=G+\eta(\Sigma_{X}-FG^{\top})^{\top}F, (4.2)

where the subscript ++ indicates the next iterate and will be used throughout the rest of the paper. Let UU be the upper r×kr\times k submatrix of FF and JJ be the lower (mr)×k(m-r)\times k submatrix of FF. Similarly, let VV be the upper r×kr\times k submatrix of GG and KK be the lower (nr)×k(n-r)\times k submatrix of GG. Also let Σ=diag(σ1,,σr)\Sigma={\rm diag}(\sigma_{1},\ldots,\sigma_{r}) be the upper left r×rr\times r submatrix of and Σ~(mr)×(nr)\tilde{\Sigma}\in\mathbb{R}^{(m-r)\times(n-r)} be a diagonal matrix with σr+1,,σmin{m,n}\sigma_{r+1},\ldots,\sigma_{\min\{m,n\}} on the diagonal. The gradient descent update (4.2) induces the following update for the “signal” matrices U,VU,V and the “error” matrices J,KJ,K:

{U+=U+ηΣVηU(VV+KK)V+=V+ηΣUηV(UU+JJ)J+=J+ηΣ~KηJ(VV+KK)K+=K+ηΣ~JηK(UU+JJ).\displaystyle\begin{cases}U^{+}=U+\eta\Sigma V-\eta U(V^{\top}V+K^{\top}K)\\ V^{+}=V+\eta\Sigma U-\eta V(U^{\top}U+J^{\top}J)\\ J^{+}=J+\eta\tilde{\Sigma}K-\eta J(V^{\top}V+K^{\top}K)\\ K^{+}=K+\eta\tilde{\Sigma}^{\top}J-\eta K(U^{\top}U+J^{\top}J).\end{cases} (4.3)

We may bound the difference FGXrFG^{\top}-X_{r} as the following:

FGXrUVΣ+UK+JV+JK.\displaystyle\|FG^{\top}-X_{r}\|\leq\|UV^{\top}-\Sigma\|+\|UK^{\top}\|+\|JV^{\top}\|+\|JK^{\top}\|.

Hence, it suffices to show that the signal term UVUV^{\top} converges to Σ\Sigma and the error term (J,K)(J,K) remains small. To account for the potential imbalance of UU and VV, we introduce the following quantities using the symmetrization idea in [YD21]:

A=U+V2,B=UV2,P=ΣAA+BB,andQ=ABBA.\displaystyle A=\frac{U+V}{2},\;B=\frac{U-V}{2},\;P=\Sigma-AA^{\top}+BB^{\top},\;\text{and}\;Q=AB^{\top}-BA^{\top}. (4.4)

(Similarly define At,Bt,Pt,QtA_{t},B_{t},P_{t},Q_{t} based on the tt-th iterates.) Here AA is the symmetrized part of the signal terms U,VU,V, and BB represents the imbalance between them. Since ΣUV=P+Q\Sigma-UV^{\top}=P+Q, the quantities PP and QQ capture how far the signal term is from the true signal Σ\Sigma. Let TT be iteration index defined as in Theorem 4.2. Our proof studies three phases of lengths T1,T2,T3T_{1},T_{2},T_{3} within these TT iterations, where T1+T2+T3TT_{1}+T_{2}+T_{3}\leq T. The proof consists of three steps:

  • Step 1. We use induction on tt to show that the error term (Jt,Kt)(J_{t},K_{t}) and the imbalance term BtB_{t} remain small throughout the TT iterations.

  • Step 2. We characterize the evolution of the smallest singular value σr(At)\sigma_{r}(A_{t}). After the first T1T_{1} iteration, the value σr(At)\sigma_{r}(A_{t}) dominates the errors. Then, with T2T_{2} more iterations, σr(At)\sigma_{r}(A_{t}) grows above the threshold 0.8σr0.8\sqrt{\sigma_{r}}, in which case signal term UVUV^{\top} has magnitude close to that of the true signal Σr\Sigma_{r}.

  • Step 3. After σr(At)\sigma_{r}(A_{t}) becomes sufficiently large, we show that Pt\|P_{t}\| decreases geometrically. After T3T_{3} more iterations, Pt\|P_{t}\| has the same magnitude as the error terms Jt,KtJ_{t},K_{t}. Theorem 4.2 then follows by bounding FtGtXr\|F_{t}G_{t}^{\top}-X_{r}\| in terms of Pt,Qt,Jt,KtP_{t},Q_{t},J_{t},K_{t}.

Our analysis of PtP_{t} and BtB_{t} departs from that in [YD21], which requires the stepsize η\eta to depend on both problem dimension and initialization size, resulting in an iteration complexity that has polynomial dependence on the dimension. The better dependence in our Theorem 4.2 is achieved using the following new techniques:

  • 1.

    Unlike [YD21] which bounds Bt,Jt,Kt{B_{t}},{J_{t}},{K_{t}} by quantities independent of tt, we control them using geometrically increasing series, which are tighter and more accurately capture the dynamics of Bt,Jt,Kt{B_{t}},{J_{t}},{K_{t}} across tt.

  • 2.

    Our analysis decouples the choices of the stepsize η\eta and initialization size ρ\rho, allowing them to be independent. We do so by making the crucial observation that the desired lower bound on λr(Pt)\lambda_{r}(P_{t}) only depends on σr\sigma_{r} and the singular value gap δ\delta. As a result, we can take a very small initialization size ρ\rho (in both theory and experiments), since the iteration complexity depends only logarithmically on ρ\rho. In contrast, the desired lower bound on λr(Pt)\lambda_{r}(P_{t}) in [YD21] depends on initialization size ρ\rho, and in turn their stepsize η\eta have more stringent dependence on ρ\rho.

  • 3.

    The analysis in [YD21] cannot be easily generalized to the overparametrized setting, since in this case the error terms Jt,Kt\|J_{t}\|,\|K_{t}\| no longer decay geometrically when they are within a small neighborhood of zero—to our best knowledge, tightly characterizing this local convergence behavior in the overparametrized setting remains an open problem. We circumvent this difficulty by using a smaller initialization size ρ\rho, which is made possible by the tighter analysis outlined in the last bullet point.

5 Experiments

We present numerical experiments that corroborate our theoretical findings on gradient descent with small initialization and early stopping. In Section 5.1, we provide numerical results that demonstrate the dynamics and algorithmic regularization of gradient descent. In Section 5.2, we numerically verify our theoretical prediction on the scaling relationship between the initialization size, stepsize, iteration complexity and final error.

5.1 Dynamics of gradient descent

We generate a random rank-1010 matrix Xm×nX\in\mathbb{R}^{m\times n} with m=250,n=200m=250,n=200 and XF=1\|X\|_{{\tiny\text{F}}}=1, and run gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with initialization size ρ=106\rho=10^{-6}, stepsize η=0.5\eta=0.5 and rank parameter k=200k=200. As shown in Figure 2a, the top singular values σi(FtGt)\sigma_{i}(F_{t}G_{t}^{\top}) of the iterate grow towards those of XX sequentially at geometrically different rates. Consequently, for each r=1,2,r=1,2,\ldots, the iterate FtGtF_{t}G_{t}^{\top} is close to the best rank-rr approximation XrX_{r} when σi(FtGt)σi(X),1ir\sigma_{i}(F_{t}G_{t}^{\top})\approx\sigma_{i}(X),1\leq i\leq r and σi(FtGt)0,ir+1\sigma_{i}(F_{t}G_{t}^{\top})\approx 0,i\geq r+1; early stopping of gradient descent at this time outputs XrX_{r}.

Refer to caption
(a) The dashed lines track the evolution of singular values of the iterate FtGtF_{t}G_{t}^{\top}. The solid lines record the Frobenius norm distance between FtGtF_{t}G_{t}^{\top} and X1,,X4X_{1},\ldots,X_{4}, the leading components of XX.
Refer to caption
(b) Convergence rate of FtGtXF\|F_{t}G_{t}^{\top}-X\|_{{\tiny\text{F}}} for gradient descent with small (ρ{104,106,108}XF\rho\in\{10^{-4},10^{-6},10^{-8}\}\|X\|_{{\tiny\text{F}}}) and moderate (ρ=XF\rho=\|X\|_{{\tiny\text{F}}}) initialization.
Figure 2: Dynamics of gradient descent with small random initialization and early stopping.

In Figure 2b, we compare the convergence rates of gradient descent with small (ρ{102,104,106}\rho\in\{10^{-2},10^{-4},10^{-6}\}) and moderate (ρ=1\rho=1) initialization sizes. We see that using a small ρ\rho results in fast convergence to a small error level; moreover, the convergence rate is geometric-like (before saturation), which is consistent with the log(1/ϵ)\log(1/\epsilon) iteration complexity predicted by Theorem 4.2. Therefore, compared to moderate initialization, small initialization has both computational benefit (in terms of iteration complexity) and a statistical regularization effect (when coupled with early stopping, as demonstrated in Figure 1).

Refer to caption
(a) Relationship between ϵ\epsilon and ρ\rho
Refer to caption
(b) Relationship between ϵ\epsilon and δ\delta
Refer to caption
(c) Relationship between T0T_{0} and δ\delta
Refer to caption
(d) ϵ\epsilon vs. (m,n)(m,n) for fixed η=0.25\eta=0.25
Figure 3: Scaling relationship between relative Frobenius norm error ϵ\epsilon, initialization size ρ\rho, relative singular value gap δ\delta, iteration complexity T0T_{0} and stepsize η\eta.

5.2 Parameter dependence

Theorem 4.2 predicts that if XX has relative singular value gap δ=σrσr+1σr\delta=\frac{\sigma_{r}-\sigma_{r+1}}{\sigma_{r}}, then gradient descent outputs XrX_{r} with a relative error FtGtXrF/XrF=ϵ\|F_{t}G_{t}^{\top}-X_{r}\|_{{\tiny\text{F}}}/\|X_{r}\|_{{\tiny\text{F}}}=\epsilon in T0=O(1δlog1ϵ)T_{0}=O\left(\frac{1}{\delta}\log\frac{1}{\epsilon}\right) iterations when using an initialization size ρϵ1δ\rho\asymp\epsilon^{\frac{1}{\delta}} and a stepsize nearly independent of the dimension of XX. We numerically verify these scaling relationships.

In all experiments, we fix r=3r=3, stepsize η=0.25\eta=0.25 and rank parametrization k=10k=10, and generate a matrix Xm×nX\in\mathbb{R}^{m\times n} with σ3=1\sigma_{3}=1. We vary the dimension m=nm=n, relative singular value gap δ\delta and condition number κ=σ1σr\kappa=\frac{\sigma_{1}}{\sigma_{r}}, and record the smallest relative error ϵ\epsilon attained within 500500 iterations as well as the iteration index T0T_{0} at which this error is attained. The results are the averaged over ten runs with randomly generated XX and shown in Figure 3a-3d.

The results in Figure 3a verify the relation logϵlogρ\log\epsilon\propto\log\rho for fixed δ\delta. We believe the flat part of the curves (for ρ<1015\rho<10^{-15}) is due to numerical precision limits. Figure 3b verifies logϵδ\log\epsilon\propto-\delta for fixed ρ\rho. Figure 3c verifies T01δlog(1ϵ)T_{0}\propto\frac{1}{\delta}\log(\frac{1}{\epsilon}). In all these plots, the the curves for different dimensions m=nm=n overlap, which is consistent with the (near) dimension-independent results in Theorem 4.2. Finally, Figure 3d shows that with a single fixed stepsize η=0.25\eta=0.25, the error ϵ\epsilon is largely independent of the dimension m=nm=n, for different values of (ρ,δ,κ)(\rho,\delta,\kappa). This supports the prediction of Theorem 4.2 that the stepsize can be chosen independently of the dimension.

6 Discussion

In this paper, we characterize the dynamics of overparametrized vanilla gradient descent for asymmetric matrix factorization. We show that with sufficiently small random initialization and proper early stopping, gradient descent produces an iterate arbitrarily close to the best rank-rr approximation XrX_{r} for any rkr\leq k so long as the singular values σr\sigma_{r} and σr+1\sigma_{r+1} are distinct. Our theoretical results quantify the dependency between various problem parameters and match well with numerical experiments. Interesting future directions include extension to the matrix sensing/completion problems with asymmetric matrices, as well as understanding and capitalizing on the algorithmic regularization effect of overparametrized gradient descent in more complicated, nonlinear statistical models.

Acknowledgement

Y. Chen is partially supported by National Science Foundation grants CCF-1704828 and CCF-2047910. L. Ding is supported by National Science Foundation grant CCF-2023166.

References

  • [CC18] Yudong Chen and Yuejie Chi. Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine, 35(4):14–31, 2018.
  • [CGMR20] Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank. arXiv preprint arXiv:2011.13772, 2020.
  • [Cha15] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214, 2015.
  • [CLC19] Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
  • [CT10] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
  • [DHL18] Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arXiv preprint arXiv:1806.00900, 2018.
  • [DJC+21] Lijun Ding, Liwei Jiang, Yudong Chen, Qing Qu, and Zhihui Zhu. Rank overspecified robust matrix recovery: Subgradient method and exact recovery. arXiv preprint arXiv:2109.11154, 2021.
  • [FYY20] Jianqing Fan, Zhuoran Yang, and Mengxin Yu. Understanding implicit regularization in over-parameterized nonlinear statistical model. arXiv preprint arXiv:2007.08322, 2020.
  • [GMMM20] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
  • [HCB+19] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019.
  • [KBZ+20] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 491–507. Springer, 2020.
  • [LMZ18] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
  • [MF21] Jianhao Ma and Salar Fattahi. Sign-RIP: A robust restricted isometry property for low-rank matrix recovery. arXiv preprint arXiv:2102.02969, 2021.
  • [MLC21] Cong Ma, Yuanxin Li, and Yuejie Chi. Beyond Procrustes: Balancing-free gradient descent for asymmetric low-rank matrix sensing. IEEE Transactions on Signal Processing, 69:867–877, 2021.
  • [Pre98] Lutz Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer, 1998.
  • [RFP10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
  • [RV09] Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62(12):1707–1739, 2009.
  • [SS21] Dominik Stöger and Mahdi Soltanolkotabi. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. arXiv preprint arXiv:2106.15013, 2021.
  • [TBS+16] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via Procrustes flow. In International Conference on Machine Learning, pages 964–973. PMLR, 2016.
  • [TL19] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
  • [Wai19] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [WGL+20] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
  • [WLZ+21] Hengkang Wang, Taihui Li, Zhong Zhuang, Tiancong Chen, Hengyue Liang, and Ju Sun. Early stopping for deep image prior. arXiv preprint arXiv:2112.06074, 2021.
  • [YD21] Tian Ye and Simon S Du. Global convergence of gradient descent for asymmetric low-rank matrix factorization. arXiv preprint arXiv:2106.14289, 2021.
  • [ZFZ21] Jialun Zhang, Salar Fattahi, and Richard Zhang. Preconditioned gradient descent for over-parameterized nonconvex matrix factorization. Advances in Neural Information Processing Systems, 34, 2021.
  • [Zha21] Richard Y Zhang. Sharp global guarantees for nonconvex low-rank matrix recovery in the overparameterized regime. arXiv preprint arXiv:2104.10790, 2021.
  • [ZKHC21] Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, and Constantine Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing. arXiv preprint arXiv:2102.02756, 2021.
  • [ZL16] Qinqing Zheng and John Lafferty. Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051, 2016.

Organization of Appendix

We prove our main Theorem 4.2 in Appendix A, deferring some intermediate steps and technical lemmas to Appendices B and C, respectively.

Appendix A Proof of Theorem 4.2

For the ease of presentation, we work with (an upper bound) of the singular value ratio γ:=1δ¯σr+1σr\gamma:=1-\underline{\delta}\geq\frac{\sigma_{r+1}}{\sigma_{r}}, where we recall that δ¯σrσr+1σr\underline{\delta}\leq\frac{\sigma_{r}-\sigma_{r+1}}{\sigma_{r}} is a lower bound of the relative singular value gap. Theorem 4.2 is a simplified version of the following more precise theorem. The numerical constants below are not optimized.

Theorem A.1.

Fix any rkr\leq k. Suppose σr+1<σr\sigma_{r+1}<\sigma_{r}. Pick any γ(0,1)\gamma\in(0,1) such that σr+1σrγ\frac{\sigma_{r+1}}{\sigma_{r}}\leq\gamma. Pick any stepsize ηmin{γσr2600σ13,(1γ)σr20σ12}\eta\leq\min\left\{\frac{\gamma\sigma_{r}^{2}}{600\sigma_{1}^{3}},\frac{(1-\gamma)\sigma_{r}}{20\sigma_{1}^{2}}\right\}. For any cρ<1c_{\rho}<1, pick any initialization size ρ\rho satisfying

ρmin{13,1γ24,cρσ112(m+n+k)1γ24σr}180γ(1+γ)(32γ)(1γ)\rho\leq\min\left\{\frac{1}{3},\frac{1-\gamma}{24},\frac{c_{\rho}\sqrt{\sigma_{1}}}{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}\right\}^{\frac{180\gamma(1+\gamma)}{(3-2\gamma)(1-\gamma)}}

and

ρmin{((1γ)cρσr1200(m+n+k)rσ1)2(1+γ)1γ,(γσr21600rσ12)1+γ1γ,γσr2r16σ1m+n+k}.\rho\leq\min\left\{\left(\frac{(1-\gamma)c_{\rho}\sigma_{r}}{1200(m+n+k)r\sigma_{1}}\right)^{\frac{2(1+\gamma)}{1-\gamma}},\left(\frac{\gamma\sigma_{r}^{2}}{1600r\sigma_{1}^{2}}\right)^{\frac{1+\gamma}{1-\gamma}},\frac{\gamma\sigma_{r}\sqrt{2r}}{16\sigma_{1}\sqrt{m+n+k}}\right\}.

Define

T1\displaystyle T_{1} :=log(12(m+n+k)1γ24σrcρρσ1)log(1+1+γ2ησr)+1,T2:=log(241γ)log(1+0.1ησr)+1,\displaystyle:=\left\lfloor\frac{\log\left(\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}{c_{\rho}\rho\sqrt{\sigma_{1}}}\right)}{\log(1+\frac{1+\gamma}{2}\eta\sigma_{r})}\right\rfloor+1,\qquad T_{2}:=\left\lfloor\frac{\log\left(\sqrt{\frac{24}{1-\gamma}}\right)}{\log(1+0.1\eta\sigma_{r})}\right\rfloor+1, (A.1)
T3\displaystyle T_{3} :=log(ρ1γ2(1+γ)/3)log(132ησr)+1,T=log(ρ1γ2(1+γ)/ρ)log(1+γησr).\displaystyle:=\left\lfloor\frac{\log\left(\rho^{\frac{1-\gamma}{2(1+\gamma)}}/3\right)}{\log\left(1-\frac{3}{2}\eta\sigma_{r}\right)}\right\rfloor+1,\qquad T=\left\lfloor\frac{\log(\rho^{\frac{1-\gamma}{2(1+\gamma)}}/\rho)}{\log(1+\gamma\eta\sigma_{r})}\right\rfloor.

Define T0:=T1+T2+T3T_{0}:=T_{1}+T_{2}+T_{3}. Then we have

T0T1(32γ)(1γ)6(3γ+1).\displaystyle\frac{T_{0}}{T}\leq 1-\frac{(3-2\gamma)(1-\gamma)}{6(3\gamma+1)}. (A.2)

Moreover, there exists a universal constant CC such that with probability at least 1(Ccρ)kr+1Cexp(k/C)1-(Cc_{\rho})^{k-r+1}-C\exp(-k/C), for all T0tTT_{0}\leq t\leq T, we have

FtGtXr8ρδ2(2δ)σ1+4ρδ2(2δ)2rσ1.\|F_{t}G_{t}^{\top}-X_{r}\|\leq 8\rho^{\frac{\delta}{2(2-\delta)}}\sigma_{1}+4\rho^{\frac{\delta}{2(2-\delta)}}\sqrt{2r}\sigma_{1}. (A.3)
Proof.

The inequality (A.2) follows from the auxiliary Lemma C.9, which can be proved by mechanical though tedious calculation.

To prove the bound (A.3), we adopt the notations and strategy given in Section 4.2, where we argue that it suffices to prove the bound in the case with X=ΣXX=\Sigma_{X}. Recall the quantities A,BA,B and PP defined in (4.4). We can obtain the update of these quantities based on (4.3):

A+\displaystyle A_{+} =A+ηPAη(ABBA)BηAKK+JJ2ηBKKJJ2;\displaystyle=A+\eta PA-\eta(AB^{\top}-BA^{\top})B-\eta A\frac{K^{\top}K+J^{\top}J}{2}-\eta B\frac{K^{\top}K-J^{\top}J}{2}; (A.4)
B+\displaystyle B_{+} =BηPB+η(ABBA)AηAKKJJ2ηBKK+JJ2\displaystyle=B-\eta PB+\eta(AB^{\top}-BA^{\top})A-\eta A\frac{K^{\top}K-J^{\top}J}{2}-\eta B\frac{K^{\top}K+J^{\top}J}{2} (A.5)

and

P+\displaystyle P_{+} =PηP(ΣP)η(ΣP)P+η2(PPPPΣP)2ηBBP2ηPBB\displaystyle=P-\eta P(\Sigma-P)-\eta(\Sigma-P)P+\eta^{2}(PPP-P\Sigma P)-2\eta BB^{\top}P-2\eta PBB^{\top} (A.6)
η(A+ηPA)CηC(A+ηPA)η2CC\displaystyle\qquad-\eta(A+\eta PA)C^{\top}-\eta C(A+\eta PA)^{\top}-\eta^{2}CC^{\top}
+η(B+ηPB)D+ηD(B+ηPB)+η2DD,\displaystyle\qquad+\eta(B+\eta PB)D^{\top}+\eta D(B+\eta PB)^{\top}+\eta^{2}DD^{\top},

where

C\displaystyle C =ABB+BABAKK+JJ2BKKJJ2,\displaystyle=-AB^{\top}B+BA^{\top}B-A\frac{K^{\top}K+J^{\top}J}{2}-B\frac{K^{\top}K-J^{\top}J}{2},
D\displaystyle D =ABABAAAKKJJ2BKK+JJ2.\displaystyle=AB^{\top}A-BA^{\top}A-A\frac{K^{\top}K-J^{\top}J}{2}-B\frac{K^{\top}K+J^{\top}J}{2}.

Note that

FtGtXr=(UtJt)(VtKt)(Σ000)=(UtVtΣUtKtJtVtJtKt).\displaystyle F_{t}G_{t}^{\top}-X_{r}=\begin{pmatrix}U_{t}\\ J_{t}\end{pmatrix}\begin{pmatrix}V_{t}^{\top}&K_{t}^{\top}\end{pmatrix}-\begin{pmatrix}\Sigma&0\\ 0&0\end{pmatrix}=\begin{pmatrix}U_{t}V_{t}^{\top}-\Sigma&U_{t}K_{t}^{\top}\\ J_{t}V_{t}^{\top}&J_{t}K_{t}^{\top}\end{pmatrix}.

Therefore, we have the bound

FtGtXrUtVtΣ+UtKt+JtVt+JtKt.\displaystyle\|F_{t}G_{t}^{\top}-X_{r}\|\leq\|U_{t}V_{t}^{\top}-\Sigma\|+\|U_{t}K_{t}^{\top}\|+\|J_{t}V_{t}^{\top}\|+\|J_{t}K_{t}^{\top}\|.

By Proposition B.2 and Proposition B.5 given in Section B to follow, it holds with high probability that for any T1+T2+T3tTT_{1}+T_{2}+T_{3}\leq t\leq T,

UtKt3ρ1γ2(1+γ)σ1,JtVt3ρ1γ2(1+γ)σ1,JtKtρ1γ(1+γ)σ1,\|U_{t}K_{t}^{\top}\|\leq 3\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sigma_{1},\qquad\|J_{t}V_{t}^{\top}\|\leq 3\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sigma_{1},\qquad\|J_{t}K_{t}^{\top}\|\leq\rho^{\frac{1-\gamma}{(1+\gamma)}}\sigma_{1},

and

UtVtΣ\displaystyle\|U_{t}V_{t}^{\top}-\Sigma\| =Pt+QtPt+Qtρ1γ2(1+γ)σ1+4ρ1γ2(1+γ)2rσ1.\displaystyle=\|P_{t}+Q_{t}\|\leq\|P_{t}\|+\|Q_{t}\|\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sigma_{1}+4\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{2r}\sigma_{1}.

By combining pieces, we obtain that

FtGtXr8ρ1γ2(1+γ)σ1+4ρ1γ2(1+γ)2rσ1.\|F_{t}G_{t}^{\top}-X_{r}\|\leq 8\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sigma_{1}+4\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{2r}\sigma_{1}.

thereby completing the proof of Theorem A.1. ∎

Appendix B Induction steps for proving Theorem 4.2

Proposition B.1 (Base Case).

Suppose that F~0m×k\tilde{F}_{0}\in\mathbb{R}^{m\times k} and G~0n×k\tilde{G}_{0}\in\mathbb{R}^{n\times k} have i.i.d. N(0,σ1)N(0,\sigma_{1}) entries. For any fixed ρσr2σ1\rho\leq\sqrt{\frac{\sigma_{r}}{2\sigma_{1}}}, we take F0=ρ3m+n+kF~0F_{0}=\frac{\rho}{3\sqrt{m+n+k}}\tilde{F}_{0} and G0=ρ3m+n+kG~0G_{0}=\frac{\rho}{3\sqrt{m+n+k}}\tilde{G}_{0}. Then with probability at least 1(Ccρ)kr+1Cexp(k/C)1-(Cc_{\rho})^{k-r+1}-C\exp(-k/C), we have

  1. 1.

    max{U0,V0,J0,K0}ρσ1\max\{\left\|U_{0}\right\|,\left\|V_{0}\right\|,\left\|J_{0}\right\|,\left\|K_{0}\right\|\}\leq\rho\sqrt{\sigma_{1}}.

  2. 2.

    U0U0+J0J07σ1\left\|U_{0}^{\top}U_{0}+J_{0}^{\top}J_{0}\right\|\leq 7\sigma_{1}.

  3. 3.

    V0V0+K0K07σ1\left\|V_{0}^{\top}V_{0}+K_{0}^{\top}K_{0}\right\|\leq 7\sigma_{1}.

  4. 4.

    σr(A0)cρρσ112(m+n+k)\sigma_{r}(A_{0})\geq\frac{c_{\rho}\rho\sqrt{\sigma_{1}}}{12(m+n+k)}.

  5. 5.

    λr(P0)0\lambda_{r}(P_{0})\geq 0.

  6. 6.

    B0Fρ2rσ1\|B_{0}\|_{{\tiny\text{F}}}\leq\rho\sqrt{2r}\sqrt{\sigma_{1}}.

Proof.

The first four Items follow from Lemma C.3. For Item 5, by definition of P0P_{0},

λr(P0)σrA0A0B0B0σr2ρ2σ1\displaystyle\lambda_{r}(P_{0})\geq\sigma_{r}-\|A_{0}A_{0}^{\top}\|-\|B_{0}B_{0}^{\top}\|\geq\sigma_{r}-2\rho^{2}\sigma_{1} 0.\displaystyle\geq 0.

For Item 6, by the fact that rank(B0)2r\mathrm{rank}(B_{0})\leq 2r, we have B0F2rB0ρ2rσ1.\|B_{0}\|_{{\tiny\text{F}}}\leq\sqrt{2r}\left\|B_{0}\right\|\leq\rho\sqrt{2r}\sqrt{\sigma_{1}}.

In the sequel, we condition on the high probability event that the conclusion of Proposition B.1 holds. The rest of the analysis is deterministic.

Proposition B.2 (Inductive Step).

Suppose that the stepsize satisfies ηγσr2600σ13\eta\leq\frac{\gamma\sigma_{r}^{2}}{600\sigma_{1}^{3}}, the initial size satisfies

ρmax{((1γ)cρσr1200(m+n+k)rσ1)2(1+γ)1γ,(γσr21600rσ12)1+γ1γ,γσr2r16σ1m+n+k},\rho\leq\max\left\{\left(\frac{(1-\gamma)c_{\rho}\sigma_{r}}{1200(m+n+k)r\sigma_{1}}\right)^{\frac{2(1+\gamma)}{1-\gamma}},\left(\frac{\gamma\sigma_{r}^{2}}{1600r\sigma_{1}^{2}}\right)^{\frac{1+\gamma}{1-\gamma}},\frac{\gamma\sigma_{r}\sqrt{2r}}{16\sigma_{1}\sqrt{m+n+k}}\right\},

and the following holds for all 0ts<T0\leq t\leq s<T with Tlog(ρ1γ2(1+γ)/ρ)log(1+ηγσr)T\leq\left\lfloor\frac{\log(\rho^{\frac{1-\gamma}{2(1+\gamma)}}/\rho)}{\log(1+\eta\gamma\sigma_{r})}\right\rfloor:

  1. 1.

    UtUt+JtJt7σ1\|U_{t}^{\top}U_{t}+J_{t}^{\top}J_{t}\|\leq 7\sigma_{1}.

  2. 2.

    VtVt+KtKt7σ1\|V_{t}^{\top}V_{t}+K_{t}^{\top}K_{t}\|\leq 7\sigma_{1}.

  3. 3.

    max{Jt,Kt}(1+γησr)max{Jt1,Kt1}ρ1γ2(1+γ)σ1\max\{\left\|J_{t}\right\|,\left\|K_{t}\right\|\}\leq(1+\gamma\eta\sigma_{r})\max\{\left\|J_{t-1}\right\|,\left\|K_{t-1}\right\|\}\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{\sigma_{1}}, if t1t\geq 1.

  4. 4.

    λr(Pt)min{(1ησr)2λr(Ps)30η2σ13γησr220,30η2σ13γησr220}γσr10\lambda_{r}(P_{t})\geq\min\{(1-\eta\sigma_{r})^{2}\lambda_{r}(P_{s})-30\eta^{2}\sigma_{1}^{3}-\frac{\gamma\eta\sigma_{r}^{2}}{20},-30\eta^{2}\sigma_{1}^{3}-\frac{\gamma\eta\sigma_{r}^{2}}{20}\}\geq-\frac{\gamma\sigma_{r}}{10}, if t1t\geq 1.

  5. 5.

    If t1t\geq 1, we have BtFmax{(1+γησr)Bt1F,16σ1σ1m+n+k(1+γησr)2tρ2γσr}\|B_{t}\|_{{\tiny\text{F}}}\leq\max\left\{(1+\gamma\eta\sigma_{r})\|B_{t-1}\|_{{\tiny\text{F}}},\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2t}\rho^{2}}{\gamma\sigma_{r}}\right\} and BtF(1+γησr)tρ2rσ1ρ1γ2(1+γ)2rσ1σ1.\|B_{t}\|_{{\tiny\text{F}}}\leq(1+\gamma\eta\sigma_{r})^{t}\rho\sqrt{2r}\sqrt{\sigma_{1}}\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{2r}{\sqrt{\sigma_{1}}}\leq\sqrt{\sigma_{1}}.

Then the above items also hold for t=s+1t=s+1. Consequently, by Proposition B.1 and induction, they hold for all 0tT0\leq t\leq T.

Proof.

Assuming Item 15 hold for all t[0,s]t\in[0,s], we prove each item holds for t=s+1t=s+1.

Item 3. By (4.3) and the induction hypothesis that VsVs+KsKs7σ1\left\|V_{s}^{\top}V_{s}+K_{s}^{\top}K_{s}\right\|\leq 7\sigma_{1}, We have

Js+1\displaystyle\left\|J_{s+1}\right\| JsηJs(VsVs+KsKs)+ηΣ~Ks\displaystyle\leq\left\|J_{s}-\eta J_{s}(V_{s}^{\top}V_{s}+K_{s}^{\top}K_{s})\right\|+\left\|\eta\tilde{\Sigma}K_{s}\right\|
Js+γησrKs\displaystyle\leq\left\|J_{s}\right\|+\gamma\eta\sigma_{r}\left\|K_{s}\right\|
(1+γησr)max{Js,Ks}.\displaystyle\leq(1+\gamma\eta\sigma_{r})\max\{\left\|J_{s}\right\|,\left\|K_{s}\right\|\}.

By the same argument, we can show that Ks+1(1+γησr)max{Js,Ks}\left\|K_{s+1}\right\|\leq(1+\gamma\eta\sigma_{r})\max\{\left\|J_{s}\right\|,\left\|K_{s}\right\|\}, whence

max{Js+1,Ks+1}(1+γησr)max{Js,Ks}.\max\{\left\|J_{s+1}\right\|,\left\|K_{s+1}\right\|\}\leq(1+\gamma\eta\sigma_{r})\max\{\left\|J_{s}\right\|,\left\|K_{s}\right\|\}.

By applying the above inequality inductively, we have

max{Js+1,Ks+1}\displaystyle\max\{\left\|J_{s+1}\right\|,\left\|K_{s+1}\right\|\} (1+γησr)s+1max{J0,K0}\displaystyle\leq(1+\gamma\eta\sigma_{r})^{s+1}\max\{\|J_{0}\|,\|K_{0}\|\}
(1+γησr)Tmax{J0,K0}\displaystyle\leq(1+\gamma\eta\sigma_{r})^{T}\max\{\|J_{0}\|,\|K_{0}\|\}
ρ1γ2(1+γ)σ1.\displaystyle\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{\sigma_{1}}.

Item 4. By induction hypothesis and definition of PsP_{s}, we have λr(ΣAsAs+BsBs)γσr10\lambda_{r}(\Sigma-A_{s}A_{s}^{\top}+B_{s}B_{s}^{\top})\geq-\frac{\gamma\sigma_{r}}{10}. On the other hand, by induction hypothesis on Item 5, we have Bsσ1\|B_{s}\|\leq\sqrt{\sigma_{1}}. Therefore, we have AsAs2σ1\|A_{s}A_{s}^{\top}\|\leq 2\sigma_{1} and

Psmax{AsAs,Σ+BsBs}2σ1.\left\|P_{s}\right\|\leq\max\{\left\|A_{s}A_{s}^{\top}\right\|,\left\|\Sigma+B_{s}B_{s}^{\top}\right\|\}\leq 2\sigma_{1}.

By the updating formula of PP in (A.6), we can write

Ps+1=(Iη(ΣPs))Ps(Iη(ΣPs))+Es,\displaystyle P_{s+1}=(I-\eta(\Sigma-P_{s}))P_{s}(I-\eta(\Sigma-P_{s}))+E_{s}, (B.1)

where

Es\displaystyle E_{s} =2η2ΣPsΣ+η2ΣPs2+η2Ps2Σ2ηBsBsPs2ηPsBsBsη(As+ηPsAs)Cs\displaystyle=-2\eta^{2}\Sigma P_{s}\Sigma+\eta^{2}\Sigma P_{s}^{2}+\eta^{2}P_{s}^{2}\Sigma-2\eta B_{s}B_{s}^{\top}P_{s}-2\eta P_{s}B_{s}B_{s}^{\top}-\eta(A_{s}+\eta P_{s}A_{s})C_{s}^{\top}
ηCs(As+ηPsAs)η2CsCs+η(Bs+ηPsBs)Ds+ηDs(Bs+ηPsBs)\displaystyle\quad-\eta C_{s}(A_{s}+\eta P_{s}A_{s})^{\top}-\eta^{2}C_{s}C_{s}^{\top}+\eta(B_{s}+\eta P_{s}B_{s})D_{s}^{\top}+\eta D_{s}(B_{s}+\eta P_{s}B_{s})^{\top}
+η2DsDs.\displaystyle\quad+\eta^{2}D_{s}D_{s}^{\top}.

By As2σ1,Bsσ1\|A_{s}\|\leq\sqrt{2\sigma_{1}},\|B_{s}\|\leq\sqrt{\sigma_{1}} and triangle inequality, we have

Cs\displaystyle\|C_{s}\| AsBsBs+BsAsBs+AsKsKs+JsJs2+BsKsKsJsJs2\displaystyle\leq\|A_{s}B_{s}^{\top}B_{s}\|+\|B_{s}A_{s}^{\top}B_{s}\|+\|A_{s}\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\|+\|B_{s}\frac{K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}}{2}\|
12ρ1γ1+γrσ1σ1.\displaystyle\leq 12\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}\sqrt{\sigma_{1}}.

Similarly, we have Ds7ρ1γ2(1+γ)2rσ1σ1\|D_{s}\|\leq 7\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{2r}\sigma_{1}\sqrt{\sigma_{1}}. By the bounds above, triangle inequality, and the upper bound on ρ\rho, we have

Es12η2σ13+80ηρ1γ1+γrσ1212η2σ13+γησr220.\|E_{s}\|\leq 12\eta^{2}\sigma_{1}^{3}+80\eta\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}\leq 12\eta^{2}\sigma_{1}^{3}+\frac{\gamma\eta\sigma_{r}^{2}}{20}.

Combining the above estimates on Es\left\|E_{s}\right\| and σ1(P)\sigma_{1}(P), and the assumption on η\eta, we see that we can apply Lemma C.10 to Ps+1P_{s+1} and obtain

λr(Ps+1)min{(1ησr)2λr(Ps)30η2σ13γησr220,30η2σ13γησr220}.\displaystyle\lambda_{r}(P_{s+1})\geq\min\{(1-\eta\sigma_{r})^{2}\lambda_{r}(P_{s})-30\eta^{2}\sigma_{1}^{3}-\frac{\gamma\eta\sigma_{r}^{2}}{20},-30\eta^{2}\sigma_{1}^{3}-\frac{\gamma\eta\sigma_{r}^{2}}{20}\}.

By the fact that λr(P0)0\lambda_{r}(P_{0})\geq 0, we have λr(Ps+1)(30η2σ13+γησr220)i=0s(1ησr)2i\lambda_{r}(P_{s+1})\geq-(30\eta^{2}\sigma_{1}^{3}+\frac{\gamma\eta\sigma_{r}^{2}}{20})\sum_{i=0}^{s}(1-\eta\sigma_{r})^{2i}. Hence,

λr(Ps+1)\displaystyle\lambda_{r}(P_{s+1}) (30η2σ13+γησr220)i=0(1ησr)2i30η2σ13+γησr220ησrγσr10,\displaystyle\geq-(30\eta^{2}\sigma_{1}^{3}+\frac{\gamma\eta\sigma_{r}^{2}}{20})\sum_{i=0}^{\infty}(1-\eta\sigma_{r})^{2i}\geq-\frac{30\eta^{2}\sigma_{1}^{3}+\frac{\gamma\eta\sigma_{r}^{2}}{20}}{\eta\sigma_{r}}\geq-\frac{\gamma\sigma_{r}}{10},

where the last inequality follows from ηγσr2600σ13\eta\leq\frac{\gamma\sigma_{r}^{2}}{600\sigma_{1}^{3}}.

Item 5. Note that

Bs+1F2\displaystyle\|B_{s+1}\|_{F}^{2}
=BsF22ηBsBs,PsηAsBsBsAsF2\displaystyle=\|B_{s}\|_{F}^{2}-2\eta\left\langle B_{s}B_{s}^{\top},P_{s}\right\rangle-\eta\|A_{s}B_{s}^{\top}-B_{s}A_{s}^{\top}\|_{{\tiny\text{F}}}^{2}
+ηAsBs,KsKsJsJsηBsBs,KsKs+JsJs2\displaystyle\quad+\eta\left\langle A_{s}^{\top}B_{s},K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}\right\rangle-\eta\left\langle B_{s}^{\top}B_{s},\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\right\rangle
+η2PsBs+(AsBsBsAs)AsAsKsKsJsJs2BsKsKs+JsJs2F2\displaystyle\quad+\eta^{2}\|-P_{s}B_{s}+(A_{s}B_{s}^{\top}-B_{s}A_{s}^{\top})A_{s}-A_{s}\frac{K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}}{2}-B_{s}\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\|_{{\tiny\text{F}}}^{2}
BsF22ηλr(Ps)BsF2+ηBsBsFKsKs+JsJsF\displaystyle\leq\|B_{s}\|_{{\tiny\text{F}}}^{2}-2\eta\lambda_{r}(P_{s})\|B_{s}\|_{{\tiny\text{F}}}^{2}+\eta\|B_{s}^{\top}B_{s}\|_{{\tiny\text{F}}}\|K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}\|_{{\tiny\text{F}}}
+ηAsBsFKsKsJsJsF\displaystyle\quad+\eta\|A_{s}^{\top}B_{s}\|_{{\tiny\text{F}}}\|K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}\|_{{\tiny\text{F}}}
+η2PsBs+(AsBsBsAs)AsAsKsKsJsJs2BsKsKs+JsJs2F2,\displaystyle\quad+\eta^{2}\|-P_{s}B_{s}+(A_{s}B_{s}^{\top}-B_{s}A_{s}^{\top})A_{s}-A_{s}\frac{K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}}{2}-B_{s}\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\|_{{\tiny\text{F}}}^{2},

where the equality follows from (A.5) and brute force, and the inequality follows from Lemma C.8, Cauchy-Schwarz inequality and definition of PsP_{s}. By induction hypothesis, we have Ksρ1γ2(1+γ)σ1\|K_{s}\|\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{\sigma_{1}} and Jsρ1γ2(1+γ)σ1\|J_{s}\|\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{\sigma_{1}}. Moreover, the rank of KsKs±JsJsK_{s}^{\top}K_{s}\pm J_{s}^{\top}J_{s} is at most m+n+km+n+k. Hence,

KsKs±JsJsFm+n+kKsKs±JsJs2m+n+kρ1γ1+γσ1γσr10.\|K_{s}^{\top}K_{s}\pm J_{s}^{\top}J_{s}\|_{{\tiny\text{F}}}\leq\sqrt{m+n+k}\left\|K_{s}^{\top}K_{s}\pm J_{s}^{\top}J_{s}\right\|\leq 2\sqrt{m+n+k}\rho^{\frac{1-\gamma}{1+\gamma}}\sigma_{1}\leq{\frac{\gamma\sigma_{r}}{10}}.

Furthermore, we have

η2PsBs+(AsBsBsAs)AsAsKsKsJsJs2BsKsKs+JsJs2F2\displaystyle\eta^{2}\|-P_{s}B_{s}+(A_{s}B_{s}^{\top}-B_{s}A_{s}^{\top})A_{s}-A_{s}\frac{K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}}{2}-B_{s}\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\|_{{\tiny\text{F}}}^{2}
4η2PsBsF2+4η2(AsBsBsAs)AsF2+4η2AsKsKsJsJs2F2\displaystyle\leq 4\eta^{2}\|P_{s}B_{s}\|_{{\tiny\text{F}}}^{2}+4\eta^{2}\|(A_{s}B_{s}^{\top}-B_{s}A_{s}^{\top})A_{s}\|_{{\tiny\text{F}}}^{2}+4\eta^{2}\|A_{s}\frac{K_{s}^{\top}K_{s}-J_{s}^{\top}J_{s}}{2}\|_{{\tiny\text{F}}}^{2}
+4η2BsKsKs+JsJs2F2\displaystyle\quad+4\eta^{2}\|B_{s}\frac{K_{s}^{\top}K_{s}+J_{s}^{\top}J_{s}}{2}\|_{{\tiny\text{F}}}^{2}
48η2σ12BsF2+8η2σ13(m+n+k)(1+γησr)4sρ4+4η2σ12(1+γησr)4sρ4BsF2\displaystyle\leq 48\eta^{2}\sigma_{1}^{2}\|B_{s}\|_{{\tiny\text{F}}}^{2}+8\eta^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}+4\eta^{2}\sigma_{1}^{2}(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}\|B_{s}\|_{{\tiny\text{F}}}^{2}
(48η2σ12+4η2σ12(1+γησr)4sρ4)BsF2+8η2σ13(m+n+k)(1+γησr)4sρ4\displaystyle\leq(48\eta^{2}\sigma_{1}^{2}+4\eta^{2}\sigma_{1}^{2}(1+\gamma\eta\sigma_{r})^{4s}\rho^{4})\|B_{s}\|_{{\tiny\text{F}}}^{2}+8\eta^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}
γησr10BsF2+8η2σ13(m+n+k)(1+γησr)4sρ4,\displaystyle\leq\frac{\gamma\eta\sigma_{r}}{10}\|B_{s}\|_{{\tiny\text{F}}}^{2}+8\eta^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4},

where the second inequality follows from the induction hypothesis and the last inequality follows from the bound ηγσr2600σ13\eta\leq\frac{\gamma\sigma_{r}^{2}}{600\sigma_{1}^{3}}. Combining with Item 4, we have

Bs+1F2\displaystyle\|B_{s+1}\|_{{\tiny\text{F}}}^{2} (1+γησr2)BsF2+4ησ1σ1m+n+k(1+γησr)2sρ2BsF\displaystyle\leq\left(1+\frac{\gamma\eta\sigma_{r}}{2}\right)\|B_{s}\|_{{\tiny\text{F}}}^{2}+4\eta\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2s}\rho^{2}\|B_{s}\|_{{\tiny\text{F}}}
+8η2σ13(m+n+k)(1+γησr)4sρ4.\displaystyle\quad+8\eta^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}. (B.2)

Consider two cases. Case (i): BsF>16σ1σ1m+n+k(1+γησr)2sρ2γσr\|B_{s}\|_{{\tiny\text{F}}}>\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2s}\rho^{2}}{\gamma\sigma_{r}}. By (B.2), we have

Bs+1F2\displaystyle\|B_{s+1}\|_{{\tiny\text{F}}}^{2} (1+γησr2)BsF2+γησr4BsF2+γ2η2σr2BsF232(1+γησr)2BsF2.\displaystyle\leq\left(1+\frac{\gamma\eta\sigma_{r}}{2}\right)\|B_{s}\|_{{\tiny\text{F}}}^{2}+\frac{\gamma\eta\sigma_{r}}{4}\|B_{s}\|_{{\tiny\text{F}}}^{2}+\frac{\gamma^{2}\eta^{2}\sigma_{r}^{2}\|B_{s}\|_{{\tiny\text{F}}}^{2}}{32}\leq(1+\gamma\eta\sigma_{r})^{2}\|B_{s}\|_{{\tiny\text{F}}}^{2}.

Hence, Bs+1F(1+γησr)BsF.\|B_{s+1}\|_{{\tiny\text{F}}}\leq\left(1+\gamma{\eta\sigma_{r}}{}\right)\|B_{s}\|_{{\tiny\text{F}}}. Case (ii): BsF16σ1σ1m+n+k(1+γησr)2sρ2γσr\|B_{s}\|_{{\tiny\text{F}}}\leq\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2s}\rho^{2}}{\gamma\sigma_{r}}. By (B.2), we have

Bs+1F2\displaystyle\|B_{s+1}\|_{{\tiny\text{F}}}^{2} (1+γησr2)162σ13(m+n+k)(1+γησr)4sρ4γ2σr2\displaystyle\leq\left(1+\frac{\gamma\eta\sigma_{r}}{2}\right)\frac{16^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}}{\gamma^{2}\sigma_{r}^{2}}
+4γησr16162σ13(m+n+k)(1+γησr)4sρ4γ2σr2\displaystyle\quad+\frac{4\gamma\eta\sigma_{r}}{16}\frac{16^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}}{\gamma^{2}\sigma_{r}^{2}}
+8γ2η2σr2162162σ13(m+n+k)(1+γησr)4sρ4γ2σr2\displaystyle\quad+\frac{8\gamma^{2}\eta^{2}\sigma_{r}^{2}}{16^{2}}\frac{16^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4s}\rho^{4}}{\gamma^{2}\sigma_{r}^{2}}
162σ13(m+n+k)(1+γησr)4(s+1)ρ4γ2σr2.\displaystyle\leq\frac{16^{2}\sigma_{1}^{3}(m+n+k)(1+\gamma\eta\sigma_{r})^{4(s+1)}\rho^{4}}{\gamma^{2}\sigma_{r}^{2}}.

As a result, Bs+1Fmax{(1+γησr)BsF,16σ1σ1m+n+k(1+γησr)2(s+1)ρ2γσr}.\|B_{s+1}\|_{{\tiny\text{F}}}\leq\max\left\{(1+\gamma\eta\sigma_{r})\|B_{s}\|_{{\tiny\text{F}}},\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2(s+1)}\rho^{2}}{\gamma\sigma_{r}}\right\}.

For the rest of Item 5, we note that for any 0<ts+10<t\leq s+1,

max{BtF,16σ1σ1m+n+k(1+γησr)2tρ2γσr}\displaystyle\max\left\{\|B_{t}\|_{{\tiny\text{F}}},\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2t}\rho^{2}}{\gamma\sigma_{r}}\right\}
(1+γησr)max{Bt1F,16σ1σ1m+n+k(1+γησr)2(t1)ρ2γσr}\displaystyle\leq(1+\gamma\eta\sigma_{r})\max\left\{\|B_{t-1}\|_{{\tiny\text{F}}},\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}(1+\gamma\eta\sigma_{r})^{2(t-1)}\rho^{2}}{\gamma\sigma_{r}}\right\}
(1+γησr)tmax{B0F,16σ1σ1m+n+kρ2γσr}\displaystyle\leq(1+\gamma\eta\sigma_{r})^{t}\max\left\{\|B_{0}\|_{{\tiny\text{F}}},\frac{16\sigma_{1}\sqrt{\sigma_{1}}\sqrt{m+n+k}\rho^{2}}{\gamma\sigma_{r}}\right\}
(1+γησr)tρ2rσ1,\displaystyle\leq(1+\gamma\eta\sigma_{r})^{t}\rho\sqrt{2r}\sqrt{\sigma_{1}},

where the last inequality follows from the base case that B0Fρ2rσ1\|B_{0}\|_{F}\leq\rho\sqrt{2r}\sqrt{\sigma_{1}} and the assumption that ργσr2r16σ1m+n+k\rho\leq\frac{\gamma\sigma_{r}\sqrt{2r}}{16\sigma_{1}\sqrt{m+n+k}}. As a result,

Bs+1F(1+γησr)s+1ρ2rσ1(1+γησr)Tρ2rσ1ρ1γ2(1+γ)2rσ1.\|B_{s+1}\|_{{\tiny\text{F}}}\leq(1+\gamma\eta\sigma_{r})^{s+1}\rho\sqrt{2r}\sqrt{\sigma_{1}}\leq(1+\gamma\eta\sigma_{r})^{T}\rho\sqrt{2r}\sqrt{\sigma_{1}}\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sqrt{2r}{\sqrt{\sigma_{1}}}.

Items 1 and 2. By the proof of Item 4, we have

λr(ΣAs+1As+1+Bs+1Bs+1)=λr(Ps+1)γσr10.\lambda_{r}(\Sigma-A_{s+1}A_{s+1}^{\top}+B_{s+1}B_{s+1}^{\top})=\lambda_{r}(P_{s+1})\geq-\frac{\gamma\sigma_{r}}{10}.

Therefore, we have As+1As+1As+12σ1\|A_{s+1}\|\leq\sqrt{\left\|A_{s+1}A_{s+1}^{\top}\right\|}\leq\sqrt{2\sigma_{1}}. On the other hand, by the proof of Item 5, we have Bs+1σ1\|B_{s+1}\|\leq\sqrt{\sigma_{1}}. Therefore, we have

Us+1Us+1+Vs+1Vs+1=2(As+1As+1+Bs+1Bs+1)6σ1.\|U_{s+1}^{\top}U_{s+1}+V_{s+1}^{\top}V_{s+1}\|=\|2(A_{s+1}^{\top}A_{s+1}+B_{s+1}^{\top}B_{s+1})\|\leq 6\sigma_{1}.

This implies that Us+1Us+16σ1\|U_{s+1}^{\top}U_{s+1}\|\leq 6\sigma_{1} and Vs+1Vs+16σ1\|V_{s+1}^{\top}V_{s+1}\|\leq 6\sigma_{1}. The result follows from triangle inequality and Item 3. ∎

Next, we will show that σr(At)\sigma_{r}(A_{t}) will increase geometrically until it is at least σr2\sqrt{\frac{\sigma_{r}}{2}}.

Proposition B.3.

Suppose that the conditions of Proposition B.2 holds. In addition, we assume η(1γ)σr20σ12\eta\leq\frac{(1-\gamma)\sigma_{r}}{20\sigma_{1}^{2}}. Then for any 0tT0\leq t\leq T, we have

σr(At)min{(1+γ+12ησr)tσr(A0),1γ24σr}.\sigma_{r}(A_{t})\geq\min\bigg{\{}\left(1+\frac{\gamma+1}{2}\eta\sigma_{r}\right)^{t}\sigma_{r}(A_{0}),\sqrt{\frac{1-\gamma}{24}\sigma_{r}}\bigg{\}}.

In particular, for T1=log(12(m+n+k)1γ24σrcρρσ1)log(1+1+γ2ησr)+1T_{1}\!=\!\bigg{\lfloor}\frac{\log\!\Big{(}\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}\sigma_{r}}}{c_{\rho}\rho\sqrt{\sigma_{1}}}\Big{)}}{\log(1+\frac{1+\gamma}{2}\eta\sigma_{r})}\bigg{\rfloor}\!+1 and all t[T1,T]t\!\in[T_{1},T], we have σr(At)1γ24σr\sigma_{r}(A_{t})\geq\sqrt{\frac{1-\gamma}{24}\sigma_{r}}.

Proof.

We prove it by induction. Clearly, the inequality holds for t=0t=0. Suppose the result holds for 0t<T0\leq t<T. By the updating formula of AA in (A.4), we have

At+1=At+η(ΣAtAt)At+ηEt,\displaystyle A_{t+1}=A_{t}+\eta(\Sigma-A_{t}A_{t}^{\top})A_{t}+\eta E_{t}, (B.3)

where Et=BtBtAt(AtBtBtAt)BtAtKtKt+JtJt2BtKtKtJtJt2.E_{t}=B_{t}B_{t}^{\top}A_{t}-(A_{t}B_{t}^{\top}-B_{t}A_{t}^{\top})B_{t}-A_{t}\frac{K_{t}^{\top}K_{t}+J_{t}^{\top}J_{t}}{2}-B_{t}\frac{K_{t}^{\top}K_{t}-J_{t}^{\top}J_{t}}{2}. Note that

Et\displaystyle\|E_{t}\| 16(1+γησr)2tρ2rσ1σ1\displaystyle\leq 16(1+\gamma\eta\sigma_{r})^{2t}\rho^{2}r\sigma_{1}\sqrt{\sigma_{1}} (B.4)
192(1+γησr)tρ(m+n+k)rσ1cρ(1+γ+12ησr)tcρρσ112(m+n+k)\displaystyle\leq\frac{192(1+\gamma\eta\sigma_{r})^{t}\rho(m+n+k)r\sigma_{1}}{c_{\rho}}\cdot\bigg{(}1+\frac{\gamma+1}{2}\eta\sigma_{r}\bigg{)}^{t}\frac{c_{\rho}\rho\sqrt{\sigma_{1}}}{12(m+n+k)}
192ρ1γ2(1+γ)(m+n+k)rσ1cρ(1+γ+12ησr)tcρρσ112(m+n+k)\displaystyle\leq\frac{192\rho^{\frac{1-\gamma}{2(1+\gamma)}}(m+n+k)r\sigma_{1}}{c_{\rho}}\cdot\bigg{(}1+\frac{\gamma+1}{2}\eta\sigma_{r}\bigg{)}^{t}\frac{c_{\rho}\rho\sqrt{\sigma_{1}}}{12(m+n+k)}
1γ6σr(1+γ+12ησr)tσr(A0),\displaystyle\leq\frac{1-\gamma}{6}\sigma_{r}\Bigg{(}1+\frac{\gamma+1}{2}\eta\sigma_{r}\bigg{)}^{t}\sigma_{r}(A_{0}),

where the first inequality follows from Proposition B.2 and triangle inequality, the second inequality from γγ+12\gamma\leq\frac{\gamma+1}{2}, the third inequality from tTlog(ρ1γ2(1+γ)/ρ)log(1+ηγσr)t\leq T\leq\Big{\lfloor}\frac{\log\big{(}\rho^{\frac{1-\gamma}{2(1+\gamma)}}/\rho\big{)}}{\log(1+\eta\gamma\sigma_{r})}\Big{\rfloor}, and the last inequality from the assumption that ρ((1γ)cρσr1200(m+n+k)rσ1)2(1+γ)1γ\rho\leq\Big{(}\frac{(1-\gamma)c_{\rho}\sigma_{r}}{1200(m+n+k)r\sigma_{1}}\Big{)}^{\frac{2(1+\gamma)}{1-\gamma}}. On the other hand,

Et\displaystyle\|E_{t}\| 16(1+γησr)2tρ2rσ1σ116ρ1γ(1+γ)rσ1σ11γ6σr1γ24σr,\displaystyle\leq 16(1+\gamma\eta\sigma_{r})^{2t}\rho^{2}r\sigma_{1}\sqrt{\sigma_{1}}\leq 16\rho^{\frac{1-\gamma}{(1+\gamma)}}r\sigma_{1}\sqrt{\sigma_{1}}\leq\frac{1-\gamma}{6}\sigma_{r}\sqrt{\frac{1-\gamma}{24}\sigma_{r}},

where the last inequality follows from the assumption that ρ((1γ)cρσr1200(m+n+k)rσ1)2(1+γ)1γ\rho\leq\Big{(}\frac{(1-\gamma)c_{\rho}\sigma_{r}}{1200(m+n+k)r\sigma_{1}}\Big{)}^{\frac{2(1+\gamma)}{1-\gamma}}. Combining both bounds on Et\|E_{t}\|, we have

Et1γ6σrmin{(1+γ+12ησr)tσr(A0),1γ24σr}1γ6σrσr(At).\|E_{t}\|\leq\frac{1-\gamma}{6}\sigma_{r}\min\left\{\left(1+\frac{\gamma+1}{2}\eta\sigma_{r}\right)^{t}\sigma_{r}(A_{0}),\sqrt{\frac{1-\gamma}{24}\sigma_{r}}\right\}\leq\frac{1-\gamma}{6}\sigma_{r}\sigma_{r}(A_{t}).

Therefore, it holds that

σr(At+1)\displaystyle\sigma_{r}(A_{t+1}) σr(At+η(ΣAtAt)At)ηEt\displaystyle\geq\sigma_{r}(A_{t}+\eta(\Sigma-A_{t}A_{t}^{\top})A_{t})-\eta\left\|E_{t}\right\|
(14η2σ12)(1+ησr)σr(At)(1ησr2(At))1γ6ησrσr(At)\displaystyle\geq(1-4\eta^{2}\sigma_{1}^{2})(1+\eta\sigma_{r})\sigma_{r}(A_{t})(1-\eta\sigma_{r}^{2}(A_{t}))-\frac{1-\gamma}{6}\eta\sigma_{r}\sigma_{r}(A_{t})
(1+γ+34ησr)σr(At)(1ησr2(At))1γ6ησrσr(At),\displaystyle\geq\left(1+\frac{\gamma+3}{4}\eta\sigma_{r}\right)\sigma_{r}(A_{t})(1-\eta\sigma_{r}^{2}(A_{t}))-\frac{1-\gamma}{6}\eta\sigma_{r}\sigma_{r}(A_{t}),

where the second inequality follows from Lemma C.7 and the last inequality follows from the assumption that ηmin{γσr2600σ13,(1γ)σr20σ12}\eta\leq\min\left\{\frac{\gamma\sigma_{r}^{2}}{600\sigma_{1}^{3}},\frac{(1-\gamma)\sigma_{r}}{20\sigma_{1}^{2}}\right\}. We consider two cases.

  1. 1.

    σr(At)1γ16σr\sigma_{r}(A_{t})\geq\sqrt{\frac{1-\gamma}{16}\sigma_{r}}. We have

    σr(At+1)(1ησ12(At)1γ6ησr)σr(At)1γ16σr1γ24σr,\displaystyle\sigma_{r}(A_{t+1})\geq(1-\eta\sigma_{1}^{2}(A_{t})-\frac{1-\gamma}{6}\eta\sigma_{r})\sigma_{r}(A_{t})\sqrt{\frac{1-\gamma}{16}\sigma_{r}}\geq\sqrt{\frac{1-\gamma}{24}\sigma_{r}},

    where the last inequality follows from σ1(At)2σ1\sigma_{1}(A_{t})\leq\sqrt{2\sigma_{1}} and η1100σ1\eta\leq\frac{1}{100\sigma_{1}}.

  2. 2.

    (1+γ+12ησr)tσr(A0)σr(At)<1γ16σr.\left(1+\frac{\gamma+1}{2}\eta\sigma_{r}\right)^{t}\sigma_{r}(A_{0})\leq\sigma_{r}(A_{t})<\sqrt{\frac{1-\gamma}{16}\sigma_{r}}. Note that

    σr(At+1)\displaystyle\sigma_{r}(A_{t+1}) (1+γ+34ησr1γ16ησr1γ16η2σr2)σr(At)1γ6ησrσr(At)\displaystyle\geq(1+\frac{\gamma+3}{4}\eta\sigma_{r}-\frac{1-\gamma}{16}\eta\sigma_{r}-\frac{1-\gamma}{16}\eta^{2}\sigma_{r}^{2})\sigma_{r}(A_{t})-\frac{1-\gamma}{6}\eta\sigma_{r}\sigma_{r}(A_{t})
    (1+γ+12σr)σr(At)\displaystyle\geq(1+\frac{\gamma+1}{2}\sigma_{r})\sigma_{r}(A_{t})
    min{(1+γ+12ησr)t+1σr(A0),1γ24σr}.\displaystyle\geq\min\left\{\left(1+\frac{\gamma+1}{2}\eta\sigma_{r}\right)^{t+1}\sigma_{r}(A_{0}),\sqrt{\frac{1-\gamma}{24}\sigma_{r}}\right\}.

Combining these two cases completes the induction step. ∎

After σr(At)\sigma_{r}(A_{t}) exceeds 1γ24σr\sqrt{\frac{1-\gamma}{24}\sigma_{r}}, we show that σr(At)\sigma_{r}(A_{t}) increases to 0.8σr0.8\sqrt{\sigma_{r}} at a slower rate.

Proposition B.4.

Suppose that the conditions of Proposition B.2 holds. Let T1T_{1} be the same from Proposition B.3. Then for any T1tTT_{1}\leq t\leq T, we have

σr(At)min{(1+0.1ησr)tT11γ24σr,0.8σr}.\sigma_{r}(A_{t})\geq\min\bigg{\{}\left(1+0.1\eta\sigma_{r}\right)^{t-T_{1}}\sqrt{\frac{1-\gamma}{24}\sigma_{r}},0.8\sqrt{\sigma_{r}}\bigg{\}}.

In particular, for T2=log(241γ)log(1+0.1ησr)+1T_{2}=\bigg{\lfloor}\frac{\log\left(\sqrt{\frac{24}{1-\gamma}}\right)}{\log(1+0.1\eta\sigma_{r})}\bigg{\rfloor}+1 and all T1+T2tTT_{1}+T_{2}\leq t\leq T, we have σr(At)0.8σr\sigma_{r}(A_{t})\geq 0.8\sqrt{\sigma_{r}}.

Proof.

We prove it by induction. By Proposition B.3, the inequality holds for t=T1t=T_{1}. Suppose the result holds for T1t<TT_{1}\leq t<T. By the same argument of Proposition B.3, (B.3) and (B.4) hold. It follows that

Et16(1+γησr)2tρ2rσ1σ116ρ1γ1+γrσ1σ10.01σr1γ24σr0.01σrσr(At).\|E_{t}\|\leq 16(1+\gamma\eta\sigma_{r})^{2t}\rho^{2}r\sigma_{1}\sqrt{\sigma_{1}}\leq 16\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}\sqrt{\sigma_{1}}\leq 0.01\sigma_{r}\sqrt{\frac{1-\gamma}{24}\sigma_{r}}\leq 0.01\sigma_{r}\sigma_{r}(A_{t}). (B.5)

Applying Lemma C.7, we have

σr(At+1)\displaystyle\sigma_{r}(A_{t+1}) σr(At+η(ΣAtAt)At)ηEt\displaystyle\geq\sigma_{r}(A_{t}+\eta(\Sigma-A_{t}A_{t}^{\top})A_{t})-\eta\|E_{t}\|
(14η2σ12)(1+ησr)σr(At)(1ησr2(At))0.01ησrσr(At)\displaystyle\geq(1-4\eta^{2}\sigma_{1}^{2})(1+\eta\sigma_{r})\sigma_{r}(A_{t})(1-\eta\sigma_{r}^{2}(A_{t}))-0.01\eta\sigma_{r}\sigma_{r}(A_{t})
(1+0.99ησr)σr(At)(1ησr2(At))0.01ησrσr(At)\displaystyle\geq(1+0.99\eta\sigma_{r})\sigma_{r}(A_{t})(1-\eta\sigma_{r}^{2}(A_{t}))-0.01\eta\sigma_{r}\sigma_{r}(A_{t})

We consider two cases.

  1. 1.

    σr(At)0.9σr\sigma_{r}(A_{t})\geq 0.9\sqrt{\sigma_{r}}. We have

    σr(At+1)(1ησ12(At)0.01ησr)σr(At)0.8σr,\displaystyle\sigma_{r}(A_{t+1})\geq(1-\eta\sigma_{1}^{2}(A_{t})-0.01\eta\sigma_{r})\sigma_{r}(A_{t})\geq 0.8\sqrt{\sigma_{r}},

    where the last inequality follows from σ1(At)2σ1\sigma_{1}(A_{t})\leq\sqrt{2\sigma_{1}} and η1100σ1\eta\leq\frac{1}{100\sigma_{1}}.

  2. 2.

    (1+0.1ησr)tT11γ24σrσr(At)<0.9σr.\left(1+0.1\eta\sigma_{r}\right)^{t-T_{1}}\sqrt{\frac{1-\gamma}{24}\sigma_{r}}\leq\sigma_{r}(A_{t})<0.9\sqrt{\sigma_{r}}. We have

    σr(At+1)\displaystyle\sigma_{r}(A_{t+1}) (1+0.99ησr)(10.81ησr)σr(At)0.01ησrσr(At)\displaystyle\geq(1+0.99\eta\sigma_{r})(1-0.81\eta\sigma_{r})\sigma_{r}(A_{t})-0.01\eta\sigma_{r}\sigma_{r}(A_{t})
    (1+0.1ησr)σr(At)\displaystyle\geq(1+0.1\eta\sigma_{r})\sigma_{r}(A_{t})
    min{(1+0.1ησr)t+1T11γ24σr,0.8σr}.\displaystyle\geq\min\bigg{\{}\left(1+0.1\eta\sigma_{r}\right)^{t+1-T_{1}}\sqrt{\frac{1-\gamma}{24}\sigma_{r}},0.8\sqrt{\sigma_{r}}\bigg{\}}.

Combining the two cases completes the induction step. ∎

After σr(At)\sigma_{r}(A_{t}) exceeds 0.8σr0.8\sqrt{\sigma_{r}}, we show that Pt\|P_{t}\| decreases geometrically.

Proposition B.5.

Let T1T_{1} and T2T_{2} be defined as Proposition B.3 and Proposition B.4. Suppose that conditions of Proposition B.2 hold. For any T1+T2<tTT_{1}+T_{2}<t\leq T, we have

Pt2(13ησr2)tT1T2σ1+80ρ1γ1+γrσ12σr.\|P_{t}\|\leq 2\left(1-\frac{3\eta\sigma_{r}}{2}\right)^{t-T_{1}-T_{2}}\sigma_{1}+\frac{80\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}}{\sigma_{r}}.

Let T3=log(ρ1γ2(1+γ)/3)log(132ησr)+1T_{3}=\bigg{\lfloor}\frac{\log\big{(}\rho^{\frac{1-\gamma}{2(1+\gamma)}}/3\big{)}}{\log\left(1-\frac{3}{2}\eta\sigma_{r}\right)}\bigg{\rfloor}+1. Then for any T1+T2+T3tTT_{1}+T_{2}+T_{3}\leq t\leq T, we have Ptρ1γ2(1+γ)σ1\|P_{t}\|\leq\rho^{\frac{1-\gamma}{2(1+\gamma)}}\sigma_{1}.

Proof.

Recall the updating formula (B.1). By the same argument as the proof of the item  4 in Proposition B.2, for T1+T2t<TT_{1}+T_{2}\leq t<T, we have the bound Et6η2σ12Pt+80ηρ1γ1+γrσ12\|E_{t}\|\leq 6\eta^{2}\sigma_{1}^{2}\|P_{t}\|+80\eta\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}. On the other hand,

(Iη(ΣPt))Pt(Iη(ΣPt))\displaystyle(I-\eta(\Sigma-P_{t}))P_{t}(I-\eta(\Sigma-P_{t})) =(IηAtAt+ηBtBt)Pt(IηAtAt+ηBtBt).\displaystyle=(I-\eta A_{t}A_{t}^{\top}+\eta B_{t}B_{t}^{\top})P_{t}(I-\eta A_{t}A_{t}^{\top}+\eta B_{t}B_{t}^{\top}).

By Proposition B.2 and Proposition B.4, we have σr(At)0.8σr\sigma_{r}(A_{t})\geq 0.8\sqrt{\sigma_{r}} and BtBtF0.1σr\|B_{t}\|\leq\|B_{t}\|_{{\tiny\text{F}}}\leq 0.1\sqrt{\sigma_{r}}. It follows that

IηAtAt+ηBtBt10.8ησr+0.01ησr10.79ησr,\|I-\eta A_{t}A_{t}^{\top}+\eta B_{t}B_{t}^{\top}\|\leq 1-0.8\eta\sigma_{r}+0.01\eta\sigma_{r}\leq 1-0.79\eta\sigma_{r},

hence

(Iη(ΣPt))Pt(Iη(ΣPt))(10.79ησr)2Pt.\|(I-\eta(\Sigma-P_{t}))P_{t}(I-\eta(\Sigma-P_{t}))\|\leq(1-0.79\eta\sigma_{r})^{2}\left\|P_{t}\right\|.

As a result, we have

Pt+1\displaystyle\|P_{t+1}\| (10.79ησr)2Pt+6η2σ12Pt+80ηρ1γ1+γrσ12\displaystyle\leq(1-0.79\eta\sigma_{r})^{2}\left\|P_{t}\right\|+6\eta^{2}\sigma_{1}^{2}\|P_{t}\|+80\eta\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}
(13ησr2)Pt+80ηρ1γ1+γrσ12.\displaystyle\leq(1-\frac{3\eta\sigma_{r}}{2})\|P_{t}\|+80\eta\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}.

Applying the above inequality inductively, we have

Pt+1\displaystyle\|P_{t+1}\| (13ησr2)t+1T1T2PT1+T2+i=0(13ησr2)i80ηρ1γ1+γrσ12\displaystyle\leq\left(1-\frac{3\eta\sigma_{r}}{2}\right)^{t+1-T_{1}-T_{2}}\|P_{T_{1}+T_{2}}\|+\sum_{i=0}^{\infty}\left(1-\frac{3\eta\sigma_{r}}{2}\right)^{i}\cdot 80\eta\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}
2(13ησr2)t+1T1T2σ1+80ρ1γ1+γrσ12σr.\displaystyle\leq 2\left(1-\frac{3\eta\sigma_{r}}{2}\right)^{t+1-T_{1}-T_{2}}\sigma_{1}+\frac{80\rho^{\frac{1-\gamma}{1+\gamma}}r\sigma_{1}^{2}}{\sigma_{r}}.

The result follows if we shift the index of PP by one. The statement for T3T_{3} follows from direct calculation and the upper bound on ρ\rho. ∎

Appendix C Auxiliary lemmas and additional literature review

C.1 Auxiliary lemmas

In this section, we state several technical lemmas that are used in the proof of Theorem 4.2. These lemmas are either direct consequence of standard results or follow from simple algebraic calculation.

Lemma C.1 ([RV09, Theorem 1.1]).

Let AA be an r×kr\times k matrix with rkr\leq k and i.i.d. Gaussian entries with distribution N(0,1)N(0,1). Then for every cρ>0c_{\rho}>0 we have with probability at least 1(Ccρ)kr+1exp(k/C)1-(Cc_{\rho})^{k-r+1}-\exp(-k/C) that

σmin(A)cρ(kr1)cρ2k,\sigma_{\min}(A)\geq c_{\rho}\left(\sqrt{k}-\sqrt{r-1}\right)\geq\frac{c_{\rho}}{2\sqrt{k}},

where CC is a universal constant.

Lemma C.2 ([Wai19, Example 2.32, Exercise 5.14]).

Let WW be an n×Nn\times N matrix with i.i.d. Gaussian entries with distribution N(0,1)N(0,1). Then there exists a universal constant CC such that

σmax(W)3n+N\sigma_{\max}(W)\leq 3\sqrt{n+N}

with probability at least 12en+N21-2e^{-\frac{n+N}{2}}.

Lemma C.3 (Initialization Quality).

Suppose that we sample F~0m×k,G~0n×k\tilde{F}_{0}\in\mathbb{R}^{m\times k},\tilde{G}_{0}\in\mathbb{R}^{n\times k} with i.i.d. N(0,σ1)N(0,\sigma_{1}) entries. For any fixed ρ>0\rho>0, if we take F0=ρ3m+n+kF~0F_{0}=\frac{\rho}{3\sqrt{m+n+k}}\tilde{F}_{0} and G0=ρ3m+n+kG~0G_{0}=\frac{\rho}{3\sqrt{m+n+k}}\tilde{G}_{0}, then with probability at least 1(Ccρ)kr+1Cexp(k/C)1-(Cc_{\rho})^{k-r+1}-C\exp(-k/C), we have

  1. 1.

    F0ρσ1\left\|F_{0}\right\|\leq\rho\sqrt{\sigma_{1}}.

  2. 2.

    G0ρσ1\left\|G_{0}\right\|\leq\rho\sqrt{\sigma_{1}}.

  3. 3.

    σr(A0)cρρσ112(m+n+k)\sigma_{r}(A_{0})\geq\frac{c_{\rho}\rho\sqrt{\sigma_{1}}}{12(m+n+k)}.

Proof.

By Proposition C.2, with probability at least 12em+k22en+k21-2e^{-\frac{m+k}{2}}-2e^{-\frac{n+k}{2}}, we have F~03m+kσ1\|\tilde{F}_{0}\|\leq 3\sqrt{m+k}\sqrt{\sigma_{1}} and G~03n+kσ1\|\tilde{G}_{0}\|\leq 3\sqrt{n+k}\sqrt{\sigma_{1}}. Hence, we have F0ρσ1\|F_{0}\|\leq\rho\sqrt{\sigma_{1}} and G0ρσ1\|G_{0}\|\leq\rho\sqrt{\sigma_{1}}. On the other hand, A0=U0+V02A_{0}=\frac{U_{0}+V_{0}}{2} has i.i.d. N(0,σ12)N(0,\frac{\sigma_{1}}{2}) entries. By Proposition C.1, with probability at least 1(Ccρ)kr+1exp(k/C)1-(Cc_{\rho})^{k-r+1}-\exp(-k/C), we have

σr(A0)=ρ3m+n+kσr(A~0)cρρσ112(m+n+k).\sigma_{r}(A_{0})=\frac{\rho}{3\sqrt{m+n+k}}\sigma_{r}(\tilde{A}_{0})\geq\frac{c_{\rho}\rho\sqrt{\sigma_{1}}}{12(m+n+k)}.

Combining, all items hold with probability at least

1(Ccρ)kr+1exp(k/C)2em+k22en+k2.1-(Cc_{\rho})^{k-r+1}-\exp(-k/C)-2e^{-\frac{m+k}{2}}-2e^{-\frac{n+k}{2}}.

By increasing CC if necessary, we obtain the desired result. ∎

Lemma C.4.

Let SS be an r×kr\times k matrix with rkr\leq k. If S13η\|S\|\leq\sqrt{\frac{1}{3\eta}}, then the largest and smallest singular values of S(IηSS)S(I-\eta S^{\top}S) are σ1(S)ησ13(S)\sigma_{1}(S)-\eta\sigma_{1}^{3}(S) and σr(S)ησr3(S)\sigma_{r}(S)-\eta\sigma_{r}^{3}(S), respectively.

Proof.

Let USΣSVSU_{S}\Sigma_{S}V_{S}^{\top} be the SVD of SS. Simple algebra shows that

S(IηSS)=US(ΣSηΣS3)VS.S\left(I-\eta S^{\top}S\right)=U_{S}\left(\Sigma_{S}-\eta\Sigma_{S}^{3}\right)V_{S}^{\top}. (C.1)

This is exactly the SVD of S(IηSS)S\left(I-\eta S^{\top}S\right). Let g(x)=xηx3g(x)=x-\eta x^{3}. Ty taking derivative, gg is monotone increasing in interval [13η,13η][-\sqrt{\frac{1}{3\eta}},\sqrt{\frac{1}{3\eta}}]. Since the singular values of S(IηSS)S(I-\eta S^{\top}S) are exactly the singular values of SS mapped by gg, the result follows. ∎

Lemma C.5.

Let SS be an r×rr\times r matrix such that S<1\left\|S\right\|<1. Then I+SI+S is invertible and (I+S)111S.\left\|(I+S)^{-1}\right\|\leq\frac{1}{1-\left\|S\right\|}.

Proof.

Since S<1\left\|S\right\|<1, the matrix T=i=0(1)iSiT=\sum_{i=0}^{\infty}(-1)^{i}S^{i} is well defined and indeed TT is the inverse of I+SI+S. By continuity, subadditivity and submultiplicativity of operator norm,

(I+S)1=Ti=0Sii=0Si=11S.\displaystyle\left\|(I+S)^{-1}\right\|=\left\|T\right\|\leq\sum_{i=0}^{\infty}\left\|S^{i}\right\|\leq\sum_{i=0}^{\infty}\left\|S\right\|^{i}=\frac{1}{1-\left\|S\right\|}. (C.2)

Lemma C.6.

Let SS be an r×rr\times r matrix and TT be an r×kr\times k matrix. Then σr(ST)Sσr(T).\sigma_{r}(ST)\leq\left\|S\right\|\sigma_{r}(T).

Proof.

For any r×kr\times k matrix RR, the variational expression of rr-th singular value is

σr(R)=supsubspace Mkdim(M)=rinfxMx0Rxx.\sigma_{r}(R)=\sup_{\begin{subarray}{c}\text{subspace }M\subset\mathbb{R}^{k}\\ \dim(M)=r\end{subarray}}\inf_{\begin{subarray}{c}x\in M\\ x\neq 0\end{subarray}}\frac{\left\|Rx\right\|}{\left\|x\right\|}. (C.3)

Applying this variational result twice, we obtain

σr(ST)\displaystyle\sigma_{r}(ST) =supsubspace Mkdim(M)=rinfxMx0STxx\displaystyle=\sup_{\begin{subarray}{c}\text{subspace }M\subset\mathbb{R}^{k}\\ \dim(M)=r\end{subarray}}\inf_{\begin{subarray}{c}x\in M\\ x\neq 0\end{subarray}}\frac{\left\|STx\right\|}{\left\|x\right\|} (C.4)
supsubspace Mkdim(M)=rinfxMx0STxx\displaystyle\leq\sup_{\begin{subarray}{c}\text{subspace }M\subset\mathbb{R}^{k}\\ \dim(M)=r\end{subarray}}\inf_{\begin{subarray}{c}x\in M\\ x\neq 0\end{subarray}}\frac{\left\|S\right\|\left\|Tx\right\|}{\left\|x\right\|} (C.5)
=Sσr(T).\displaystyle=\left\|S\right\|\sigma_{r}(T). (C.6)

Lemma C.7.

Let SS be an r×kr\times k matrix with rkr\leq k. Suppose that σr(S)>0\sigma_{r}(S)>0. Let Λ=diag{σ1,,σr}r×r\Lambda={\rm diag}\{\sigma_{1},\ldots,\sigma_{r}\}\in\mathbb{R}^{r\times r} be a diagonal matrix. Suppose that ηΛSS<12\eta\|\Lambda-SS^{\top}\|<\frac{1}{2}, S13η\|S\|\leq\sqrt{\frac{1}{3\eta}}, 2η2ΛSS<12\eta^{2}\left\|\Lambda SS^{\top}\right\|<1, and S+=S+η(ΛSS)S.S_{+}=S+\eta(\Lambda-SS^{\top})S. Then we have σr(S+)(12η2ΛSS)(1+ησr)σr(S)(1ησr2(S))\sigma_{r}(S_{+})\geq(1-2\eta^{2}\left\|\Lambda SS^{\top}\right\|)(1+\eta\sigma_{r})\sigma_{r}(S)(1-\eta\sigma_{r}^{2}(S)).

Proof.

Since ηΛSS<1\eta\|\Lambda-SS^{\top}\|<1, matrix I+η(ΛSS)I+\eta(\Lambda-SS^{\top}) is invertible. Hence, we can write

S=(I+η(ΛSS))1S+.S=(I+\eta(\Lambda-SS^{\top}))^{-1}S_{+}.

On the other hand, by definition of S+S_{+}, we have

S+\displaystyle S_{+} =(I+ηΛ)S(IηSS)+η2ΛSSS\displaystyle=(I+\eta\Lambda)S(I-\eta S^{\top}S)+\eta^{2}\Lambda SS^{\top}S
=(I+ηΛ)S(IηSS)+η2ΛSS(I+η(ΛSS))1S+.\displaystyle=(I+\eta\Lambda)S(I-\eta S^{\top}S)+\eta^{2}\Lambda SS^{\top}(I+\eta(\Lambda-SS^{\top}))^{-1}S_{+}.

Therefore,

(Iη2ΛSS(I+η(ΛSS))1)S+=(I+ηΛ)S(IηSS)(I-\eta^{2}\Lambda SS^{\top}(I+\eta(\Lambda-SS^{\top}))^{-1})S_{+}=(I+\eta\Lambda)S(I-\eta S^{\top}S)

By Lemma C.4, we have

σr((I+ηΛ)S(IηSS))\displaystyle\sigma_{r}((I+\eta\Lambda)S(I-\eta S^{\top}S)) (1+ησr)σr(S(IηSS))\displaystyle\geq(1+\eta\sigma_{r})\sigma_{r}(S(I-\eta S^{\top}S))
=(1+ησr)σr(S)(1ησr(S)2).\displaystyle=(1+\eta\sigma_{r})\sigma_{r}(S)(1-\eta\sigma_{r}(S)^{2}).

On the other hand, by Lemma C.5,

σr((Iη2ΛSS(I+η(ΛSS))1)S+)\displaystyle\sigma_{r}((I-\eta^{2}\Lambda SS^{\top}(I+\eta(\Lambda-SS^{\top}))^{-1})S_{+})
Iη2ΛSS(I+η(ΛSS))1σr(S+)\displaystyle\leq\|I-\eta^{2}\Lambda SS^{\top}(I+\eta(\Lambda-SS^{\top}))^{-1}\|\sigma_{r}(S_{+})
(1+η2ΛSS1ηΛSS)σr(S+)\displaystyle\leq\Big{(}1+\eta^{2}\frac{\left\|\Lambda SS^{\top}\right\|}{1-\eta\left\|\Lambda-SS^{\top}\right\|}\Big{)}\sigma_{r}(S_{+})
(1+2η2ΛSS)σr(S+)\displaystyle\leq\Big{(}1+2\eta^{2}\left\|\Lambda SS^{\top}\right\|\Big{)}\sigma_{r}(S_{+})

Combining, we have

σr(S+)\displaystyle\sigma_{r}(S_{+}) (1+ησr)σr(S)(1ησr2(S))1+2η2ΛSS\displaystyle\geq\frac{(1+\eta\sigma_{r})\sigma_{r}(S)(1-\eta\sigma_{r}^{2}(S))}{1+2\eta^{2}\left\|\Lambda SS^{\top}\right\|}
(12η2ΛSS)(1+ησr)σr(S)(1ησr2(S)).\displaystyle\geq\Big{(}1-2\eta^{2}\left\|\Lambda SS^{\top}\right\|\Big{)}(1+\eta\sigma_{r})\sigma_{r}(S)(1-\eta\sigma_{r}^{2}(S)).

Lemma C.8.

Let Br×kB\in\mathbb{R}^{r\times k} be a real matrix and Pr×rP\in\mathbb{R}^{r\times r} a symmetric matrix. We have BB,Pλr(P)BF2.\left\langle BB^{\top},P\right\rangle\geq\lambda_{r}(P)\|B\|_{F}^{2}.

Proof.

Let the eigenvalue decomposition of PP be P=UΛUP=U^{\top}\Lambda U, where Ur×rU\in\mathbb{R}^{r\times r} is an orthogonal matrix and Λ=diag{λ1,,λr}\Lambda={\rm diag}\{\lambda_{1},\ldots,\lambda_{r}\} consists of eigenvalues of PP. Let B=UBB^{\prime}=UB, we have

BB,P\displaystyle\left\langle BB^{\top},P\right\rangle =trace(BBP)\displaystyle=\text{trace}(BB^{\top}P)
=trace(BBUΛU)\displaystyle=\text{trace}(BB^{\top}U^{\top}\Lambda U)
=trace(BΛB)\displaystyle=\text{trace}(B^{\prime\top}\Lambda B^{\prime})
trace(B(ΛλrI)B)+λrtrace(BIB)\displaystyle\geq\text{trace}(B^{\prime\top}(\Lambda-\lambda_{r}I)B^{\prime})+\lambda_{r}\text{trace}(B^{\prime\top}IB^{\prime})
λrtrace(BIB)\displaystyle\geq\lambda_{r}\text{trace}(B^{\prime\top}IB^{\prime})
=λrBF2\displaystyle=\lambda_{r}\|B^{\prime}\|_{F}^{2}
=λrBF2,\displaystyle=\lambda_{r}\|B\|_{F}^{2},

where the last inequality follows from the fact that B(ΛλrI)BB^{\prime\top}(\Lambda-\lambda_{r}I)B^{\prime} is positive semi-definite. ∎

Lemma C.9.

Under the assumption of Theorem A.1, we have

T1+T2+T3T1(32γ)(1γ)6(3γ+1).\frac{T_{1}+T_{2}+T_{3}}{T}\leq 1-\frac{(3-2\gamma)(1-\gamma)}{6(3\gamma+1)}.
Proof.

For simplicity, we omit the flooring and ceiling operations and assume

T1=log(12(m+n+k)1γ24σrcρρσ1)log(1+1+γ2ησr),T2=log(241γ)log(1+0.1ησr),T_{1}=\frac{\log\bigg{(}\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}{c_{\rho}\rho\sqrt{\sigma_{1}}}\bigg{)}}{\log(1+\frac{1+\gamma}{2}\eta\sigma_{r})},\qquad T_{2}=\frac{\log\left(\sqrt{\frac{24}{1-\gamma}}\right)}{\log(1+0.1\eta\sigma_{r})},

and

T3=log(ρ1γ2(1+γ)/3)log(132ησr),T=log(ρ1γ2(1+γ)/ρ)log(1+ηγσr)=3γ+12(1+γ)log(1ρ)log(1+ηγσr).T_{3}=\frac{\log\left(\rho^{\frac{1-\gamma}{2(1+\gamma)}}/3\right)}{\log\left(1-\frac{3}{2}\eta\sigma_{r}\right)},\qquad T=\frac{\log(\rho^{\frac{1-\gamma}{2(1+\gamma)}}/\rho)}{\log(1+\eta\gamma\sigma_{r})}=\frac{\frac{3\gamma+1}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}{\log(1+\eta\gamma\sigma_{r})}.

Using the bound

ρmin{13,1γ24,cρσ112(m+n+k)1γ24σr}180γ(1+γ)(32γ)(1γ),\rho\leq\min\Bigg{\{}\frac{1}{3},\frac{1-\gamma}{24},\frac{c_{\rho}\sqrt{\sigma_{1}}}{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}\Bigg{\}}^{\frac{180\gamma(1+\gamma)}{(3-2\gamma)(1-\gamma)}},

we have

log(12(m+n+k)1γ24σrcρσ1)(32γ)(1γ)72γlog(1ρ);\displaystyle\log\bigg{(}\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}{c_{\rho}\sqrt{\sigma_{1}}}\bigg{)}\leq\frac{(3-2\gamma)(1-\gamma)}{72\gamma}\log\left(\frac{1}{\rho}\right); (C.7)
log(241γ)(32γ)(1γ)360γ(1+γ)log(1ρ);\displaystyle\log\left(\sqrt{\frac{24}{1-\gamma}}\right)\leq\frac{(3-2\gamma)(1-\gamma)}{360\gamma(1+\gamma)}\log\left(\frac{1}{\rho}\right); (C.8)

and

log(3)(32γ)(1γ)24γ(1+γ)log(1ρ).\displaystyle\log(3)\leq\frac{(3-2\gamma)(1-\gamma)}{24\gamma(1+\gamma)}\log\left(\frac{1}{\rho}\right). (C.9)

We will prove the following three inequalities:

  1. 1.

    T1T4γ3γ+1+(32γ)(1γ)18(3γ+1)\frac{T_{1}}{T}\leq\frac{4\gamma}{3\gamma+1}+\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)};

  2. 2.

    T2T(32γ)(1γ)18(3γ+1)\frac{T_{2}}{T}\leq\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)};

  3. 3.

    T3T2γ(1γ)3(3γ+1)+(32γ)(1γ)18(3γ+1).\frac{T_{3}}{T}\leq\frac{2\gamma(1-\gamma)}{3(3\gamma+1)}+\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)}.

The result will follow if we add these inequalities together. To prove the above inequalities, we make frequent use of the following two facts from Calculus: (i) For any x(1,1),x1+xlogxxx\in(-1,1),\frac{x}{1+x}\leq\log x\leq x. (ii) For any 0<a<b0<a<b and x>0x>0, log(1+aησr)log(1+bησr)ab\frac{\log(1+a\eta\sigma_{r})}{\log(1+b\eta\sigma_{r})}\leq\frac{a}{b}.

First, we have

T1T\displaystyle\frac{T_{1}}{T} =log(1+γησr)log(1+1+γ2ησr)log(12(m+n+k)1γ24σrcρρσ1)3γ+12(1+γ)log(1ρ)\displaystyle=\frac{\log(1+\gamma\eta\sigma_{r})}{\log(1+\frac{1+\gamma}{2}\eta\sigma_{r})}\frac{\log\bigg{(}\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}{c_{\rho}\rho\sqrt{\sigma_{1}}}\bigg{)}}{\frac{3\gamma+1}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}
2γ1+γlog(12(m+n+k)1γ24σrcρρσ1)3γ+12(1+γ)log(1ρ)\displaystyle\leq\frac{2\gamma}{1+\gamma}\frac{\log\bigg{(}\frac{12(m+n+k)\sqrt{\frac{1-\gamma}{24}}\sqrt{\sigma_{r}}}{c_{\rho}\rho\sqrt{\sigma_{1}}}\bigg{)}}{\frac{3\gamma+1}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}
4γ3γ+1(1+(32γ)(1γ)72γ)\displaystyle\leq\frac{4\gamma}{3\gamma+1}\bigg{(}1+\frac{(3-2\gamma)(1-\gamma)}{72\gamma}\bigg{)}
=4γ3γ+1+(32γ)(1γ)18(3γ+1),\displaystyle=\frac{4\gamma}{3\gamma+1}+\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)},

where the first inequality follows from fact 2, and the second inequality follows from (C.7). Next, we have

T2T\displaystyle\frac{T_{2}}{T} =log(1+γησr)log(1+0.1ησr)log(241γ)3γ+12(1+γ)log(1ρ)\displaystyle=\frac{\log(1+\gamma\eta\sigma_{r})}{\log(1+0.1\eta\sigma_{r})}\frac{\log\left(\sqrt{\frac{24}{1-\gamma}}\right)}{\frac{3\gamma+1}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}
20γ(1+γ)3γ+1log(241γ)log(1ρ)\displaystyle\leq\frac{20\gamma(1+\gamma)}{3\gamma+1}\frac{\log\left(\sqrt{\frac{24}{1-\gamma}}\right)}{\log\left(\frac{1}{\rho}\right)}
(32γ)(1γ)18(3γ+1),\displaystyle\leq\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)},

where the first inequality follows from fact 2 and the second inequality follows from (C.8). Finally, we have

T3T\displaystyle\frac{T_{3}}{T} =log(1+ηγσr)log(132ησr)log(3/ρ1γ2(1+γ))3γ+12(1+γ)log(1ρ)\displaystyle=\frac{\log(1+\eta\gamma\sigma_{r})}{-\log\left(1-\frac{3}{2}\eta\sigma_{r}\right)}\frac{\log\left(3/\rho^{\frac{1-\gamma}{2(1+\gamma)}}\right)}{\frac{3\gamma+1}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}
4γ(1+γ)3(1+3γ)log(3)+1γ2(1+γ)log(1ρ)log(1ρ)\displaystyle\leq\frac{4\gamma(1+\gamma)}{3(1+3\gamma)}\frac{\log(3)+\frac{1-\gamma}{2(1+\gamma)}\log\left(\frac{1}{\rho}\right)}{\log\left(\frac{1}{\rho}\right)}
2γ(1γ)3(3γ+1)+(32γ)(1γ)18(3γ+1),\displaystyle\leq\frac{2\gamma(1-\gamma)}{3(3\gamma+1)}+\frac{(3-2\gamma)(1-\gamma)}{18(3\gamma+1)},

where the first inequality follows from fact 1 and the second inequality follows from (C.9). Therefore, the result follows. ∎

Lemma C.10 ([YD21, Lemma 3.3]).

Suppose P,Σr×rP,\Sigma\in\mathbb{R}^{r\times r} are two symmetric matrices, η>0\eta>0, and P=(Iη(ΣP))P(Iη(ΣP))P^{\prime}=(I-\eta(\Sigma-P))P(I-\eta(\Sigma-P)). Suppose σ1(P)2σ1\sigma_{1}(P)\leq 2\sigma_{1} and σrIΣσ1I\sigma_{r}I\preceq\Sigma\preceq\sigma_{1}I. Then, for all β(0,1)\beta\in(0,1) and ηβ8σ1\eta\leq\frac{\beta}{8\sigma_{1}}, it holds that

λr(P){(1ησr)2λr(P)8+6β1βη2σ13if λr(P)<00if λr(P)0.\displaystyle\lambda_{r}(P^{\prime})\geq\begin{cases}(1-\eta\sigma_{r})^{2}\lambda_{r}(P)-\frac{8+6\beta}{1-\beta}\eta^{2}\sigma_{1}^{3}&\text{if $\lambda_{r}(P)<0$}\\ 0&\text{if $\lambda_{r}(P)\geq 0$}.\end{cases}

C.2 Additional literature review

The literature on gradient descent for matrix factorization is vast; see [CC18, CLC19] for a comprehensive survey of the literature, most of which focuses on the exact parametrization case k=rk=r (where rr is the target rank or the rank of certain ground truth matrix XX_{\natural}) with the regularizer FFGGF2\|F^{\top}F-G^{\top}G\|_{{\tiny\text{F}}}^{2} that balances the magnitude of FF and GG. Below we review recent progress on overparametrization for matrix factorization without additional regularizers. A summary of the results can be found in Table 1.

Asymmetric Range of kk Gap rr-SVD Speed Cold Start
[LMZ18] k=mk=m 1 fast
[ZKHC21] k=𝒪(r)k=\mathcal{O}(r) 0.99\geq 0.99 n.a.
[SS21] krk\geq r 1 fast
[YD21] k=rk=r 1 slow
[FYY20] k=2m+2nk=2m+2n 12\geq\frac{1}{2} n.a.
Our work krk\geq r >0>0 fast
Table 1: A comparison of the settings and results from existing work on overparametrization matrix factorization. The column Asymmetric summarizes whether the result applies to a general asymmetric matrix XX. The column Range of kk shows the value of the rank parametrization kk to which the result applies. The column Gap shows the requirement on the relative singular value gap δ=σrσr+1σr\delta=\frac{\sigma_{r}-\sigma_{r+1}}{\sigma_{r}}; note that δ=1\delta=1 means XX is exactly rank-rr. The column rr-SVD shows whether the analysis proves that gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) converges arbitrarily close to XrX_{r} (✓) or with an error bounded away from zero (✗). In the column Speed, we label the work as fast if it shows that gradient descent converges to XrX_{r} in a number of iteration that is logarithmic in the inverse of the final error 1/ϵ1/\epsilon and the dimension m,nm,n of XX; we label it as slow if the iteration complexity is polynomial in 1/ϵ,m,n1/\epsilon,m,n; we write n.a. if the error cannot be made arbitrarily small. The last column Cold Start shows whether the result allows the initial relative signal (1.1) to be small (✓) or requires the signal to be larger than a universal constant (✗).

Matrix sensing with positive semidefinite matrices

A majority of existing theoretical work on overparametrization (k>r)(k>r) focuses on the matrix sensing problem: (approximately) recovering a positive semidefinite (psd) and rank-rr ground truth matrix XX_{\natural} from some linear measurement b=𝒜(X)+eb=\mathcal{A}(X_{\natural})+e, where ee is the noise vector and the linear constraint map 𝒜\mathcal{A} satisfies the so-called Restricted Isometry Property [RFP10, Definition 3.1] (when 𝒜\mathcal{A} is the identity map, then this problem becomes the matrix factorization problem). These works analyze the gradient descent dynamics applied to the problem min𝒜(FF)b2\min\|\mathcal{A}(FF^{\top})-b\|^{2}. In the noiseless (e=0e=0) setting with an exactly rank-rr XX_{\natural}, the pioneer work [LMZ18] and subsequent improvement in [SS21] show gradient descent recovers XX_{\natural} using random small initialization and arbitrary rank overparametrization kk. In the noisy and approximately rank-rr setting, the work in [ZKHC21] shows that for arbitrary rank overparametrization, spectral initialization followed by gradient descent approximately recovers XX_{\natural} with a sublinear rate of convergence. However, their error bound for the gradient descent output with respect to XX_{\natural} scales with the overparametrization kk, i.e., the algorithm overfits the noise under overparametrization. In particular, with k=mk=m (mm being the dimension of XX_{\natural}), this error could be worse than that of the trivial estimator 0. A similar limitation, that the error and/or sample complexity scales with kk, also appears in earlier work on landscape analysis [Zha21] as well as the recent work on preconditioned gradient descent [ZFZ21] and subgradient methods [MF21, DJC+21]. Existing results along this line all focus on positive semidefinite ground truth XX_{\natural} whose eigengap between the rr-th and (r+1)(r+1)-th eigenvalues is significant. In comparison, our results apply to a general asymmetric XX with arbitrary singular values, and our error bound depends only on the initialization size and stopping time but not kk.

Matrix factorization and general asymmetric XX

The work [YD21] also provides recovery guarantees for vanilla gradient descent (𝒢𝒟\mathcal{GD}-\mathcal{M}) with random small initialization. Their result only applies to the setting where the matrix XX has exactly rank rr and k=rk=r, i.e., with exact parametrization. Moreover, their choice of stepsize is conservative and consequently their iteration complexity scales proportionally with the matrix dimension m+nm+n. In comparison, we allow for significantly larger stepsizes and establish almost dimension-free iteration complexity bounds. To achieve an ϵ\epsilon accuracy, the result in [YD21] requires O((m+n)2σ14r4σr4log(σrϵ))O\left(\frac{(m+n)^{2}\sigma_{1}^{4}r^{4}}{\sigma_{r}^{4}}\log\left(\frac{\sigma_{r}}{\epsilon}\right)\right) iterations, while our main theorem only requires O(σ13σr3log(σrϵ))O\left(\frac{\sigma_{1}^{3}}{\sigma_{r}^{3}}\log\left(\frac{\sigma_{r}}{\epsilon}\right)\right) iterations. The work in [FYY20] considers a wide range of statistical problems with a symmetric ground truth matrix XX_{\natural}, and shows that XX_{\natural} can be recovered with near optimal statistical errors using gradient descent for the objective function (\mathcal{M}) with FGFG^{\top} replaced by FFGGFF^{\top}-GG^{\top}. While one may translate their results to the asymmetric setting via a dilation argument, it is feasible to do so only under the specific rank parametrization k=2m+2nk=2m+2n. This strong restriction on kk allows for a decoupling of the dynamics of different singular values, which is essential to their analysis. While this decoupled setting provides intuition for the general setting (as we elaborate in Section 3), the same analysis no longer applies for other values of kk, e.g., k=2m+2n1k=2m+2n-1, in which case the singular values do not decouple. Moreover, a smaller value of kk leads to the cold start issue, as discussed in footnote 1.

Deep matrix factorization

The work in [CGMR20] studies the deep matrix factorization problem of factorizing a given matrix XX into a product of multiple matrices. While on a high level their results deliver a message similar to our work—namely, gradient descent sequentially approaches the principal components of XX—the technical details differ significantly. In particular, they results only apply to symmetric XX and guarantee recovery of the positive semidefinite part of XX. Their analysis relies crucially on the assumption k=m=nk=m=n, a specific identity initialization scheme and the resulting decoupled dynamics, which do not hold in the general setting as discussed above. A major contribution of our work lies in handling the entanglement of singular values resulted from general overparametrization, asymmetry, and random initialization.

C.3 The general symmetric setting

In this section, we show that the arguments in Section 3 can be generalized to the setting where the observed matrix XX is a general symmetric positive semidefinite (p.s.d.) matrix.

In particular, we illustrate the behavior of gradient descent with small random initialization in the setting with (i) m=n=km=n=k, and (ii) Xm×mX\in\mathbb{R}^{m\times m} is a p.s.d. matrix with eigen decomposition X=UΣUX=U\Sigma U^{\top}, where Σ=diag(λ1,λ2,,λm)\Sigma={\rm diag}(\lambda_{1},\lambda_{2},\ldots,\lambda_{m}). Moreover, we assume that the rr-th eigengap exists, i.e. λr+1γλr\lambda_{r+1}\leq\gamma\lambda_{r} for some γ[0,1)\gamma\in[0,1). The following argument works for any γ<1\gamma<1; for ease of presentation, we assume γ=110\gamma=\frac{1}{10}. We consider the natural objective function f(F)=14FFXF2f(F)=\frac{1}{4}\|FF^{\top}-X\|_{{\tiny\text{F}}}^{2} and the associated gradient descent dynamic444One can obtain the same dynamic by using the initialization F0=G0F_{0}=G_{0} in (𝒢𝒟\mathcal{GD}-\mathcal{M}). Ft+1=Ftη(FtFtX)FtF_{t+1}=F_{t}-\eta(F_{t}F_{t}^{\top}-X)F_{t}, with initialization F0=ρλ1IF_{0}=\rho\sqrt{\lambda_{1}}I for some small ρ>0\rho>0. Let F~t:=UFtU\tilde{F}_{t}:=U^{\top}F_{t}U. To understand how FtFtF_{t}F_{t}^{\top} approaches XrX_{r}, we can equivalently study how F~tF~t\tilde{F}_{t}\tilde{F}_{t}^{\top} approaches Σr\Sigma_{r}. By simple algebra, we have

F~t+1=F~tη(F~tF~tΣ)F~t.\displaystyle\tilde{F}_{t+1}=\tilde{F}_{t}-\eta(\tilde{F}_{t}\tilde{F}_{t}^{\top}-\Sigma)\tilde{F}_{t}. (C.10)

Since F~0=F0\tilde{F}_{0}=F_{0} is diagonal, it is easy to see see that F~t\tilde{F}_{t} is diagonal for all t0t\geq 0, and the ii-th diagonal element in F~t\tilde{F}_{t}, denoted by fi,tf_{i,t}, is updated as

fi,t+1=fi,t(1+ηλiηfi,t2).f_{i,t+1}=f_{i,t}(1+\eta\lambda_{i}-\eta f_{i,t}^{2}). (C.11)

Thus, the dynamics of all the eigenvalues decouples and can be analyzed separately. In particularly, simple algebra shows that for 1ir1\leq i\leq r, (i) when fi,t<λi2f_{i,t}<\sqrt{\frac{\lambda_{i}}{2}} , fi,tf_{i,t} increases geometrically by a factor of 1+ηλiηfi,t21+ηλr21+\eta\lambda_{i}-\eta f_{i,t}^{2}\geq 1+\frac{\eta\lambda_{r}}{2}, i.e., fi,t+1(1+ηλr2)fi,tf_{i,t+1}\geq\left(1+\frac{\eta\lambda_{r}}{2}\right)f_{i,t}, and (ii) when fi,tλi2f_{i,t}\geq\sqrt{\frac{\lambda_{i}}{2}} holds, fi,tf_{i,t} converges to λi\sqrt{\lambda_{i}} geometrically because

|fi,t+1λi|\displaystyle|f_{i,t+1}-\sqrt{\lambda_{i}}| =|fi,tλi||1ηfi,t(fi,t+λi)|\displaystyle=|f_{i,t}-\sqrt{\lambda_{i}}||1-\eta f_{i,t}(f_{i,t}+\sqrt{\lambda_{i}})|
(1ηλi2)|fi,tλi|(1ηλr2)|fi,tλi|.\displaystyle\leq(1-\frac{\eta\lambda_{i}}{2})|f_{i,t}-\sqrt{\lambda_{i}}|\leq(1-\frac{\eta\lambda_{r}}{2})|f_{i,t}-\sqrt{\lambda_{i}}|.

In summary, the first rr diagonal elements of F~t\tilde{F}_{t} will first increase geometrically by a factor of (at least) 1+ηλr21+\frac{\eta\lambda_{r}}{2}, and then converge to λi\sqrt{\lambda_{i}} geometrically by a factor of 1ηλr21-\frac{\eta\lambda_{r}}{2}.

What makes a difference, however, is that for ir+1i\geq r+1, fi,tf_{i,t} converges at an exponentially slower rate than the first rr diagonal elements. In particular, assuming the step size η\eta is sufficiently small, for ir+1i\geq r+1, we can show that fi,tf_{i,t} is always nonnegative and satisfies

fi,t+1\displaystyle f_{i,t+1} =fi,t(1+ηλiηfi,t2)\displaystyle=f_{i,t}(1+\eta\lambda_{i}-\eta f_{i,t}^{2}) (C.12)
(1+ηλi)fi,t(1+ηλr10)fi,t.\displaystyle\leq(1+\eta\lambda_{i})f_{i,t}\leq(1+\frac{\eta\lambda_{r}}{10})f_{i,t}.

Note that the growth factor 1+ηλr101+\frac{\eta\lambda_{r}}{10} is smaller than 1+ηλr21+\frac{\eta\lambda_{r}}{2}, the growth factor for fi,t,ir.f_{i,t},i\leq r. Consequently, we conclude that larger singular value converges (exponentially) faster. This property implies that FtFtF_{t}F_{t}^{\top} approaches X1,X2,X3X_{1},X_{2},X_{3}\ldots sequentially for a positive semidefinite XX with distinct singular values, since we can repeat the above argument for each r=1,2,3,r=1,2,3,\ldots