Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

Chong You^† Zhihui Zhu^‡ Qing Qu^♯ Yi Ma^†

^†Department of EECS, University of California, Berkeley
^‡Department of Electrical and Computer Engineering, University of Denver
^♯Center for Data Science, New York University

Abstract

Recent advances have shown that implicit bias of gradient descent on over-parameterized models enables the recovery of low-rank matrices from linear measurements, even with no prior knowledge on the intrinsic rank. In contrast, for robust low-rank matrix recovery from grossly corrupted measurements, over-parameterization leads to overfitting without prior knowledge on both the intrinsic rank and sparsity of corruption. This paper shows that with a double over-parameterization for both the low-rank matrix and sparse corruption, gradient descent with discrepant learning rates provably recovers the underlying matrix even without prior knowledge on neither rank of the matrix nor sparsity of the corruption. We further extend our approach for the robust recovery of natural images by over-parameterizing images with deep convolutional networks. Experiments show that our method handles different test images and varying corruption levels with a single learning pipeline where the network width and termination conditions do not need to be adjusted on a case-by-case basis. Underlying the success is again the implicit bias with discrepant learning rates on different over-parameterized parameters, which may bear on broader applications.

^†^†^∗The first two authors contributed equally to this work.

1 Introduction

Learning over-parameterized models, which have more parameters than the problem’s intrinsic dimension, is becoming a crucial topic in machine learning [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. While classical learning theories suggest that over-parameterized models tend to overfit [12], recent advances showed that an optimization algorithm may produce an implicit bias that regularizes the solution with desired properties. This type of results has led to new insights and better understandings on gradient descent for solving several fundamental problems, including logistic regression on linearly separated data [1], compressive sensing [2, 3], sparse phase retrieval [4], nonlinear least-squares [5], low-rank (deep) matrix factorization [6, 7, 8, 9], and deep linear neural networks [10, 11], etc.

Inspired by these recent advances [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], in this work we present a new type of practical methods for robust recovery of structured signals via model over-parameterization. In particular, we aim to learn an unknown signal $\mathbf{X}_{\star}\in\mathbb{R}^{n\times n}$ from its grossly corrupted linear measurements

\displaystyle\mathbf{y}\;=\;\mathcal{A}(\mathbf{X}_{\star})\;+\;\mathbf{s}_{\star},

(1)

where the linear operator $\mathcal{A}(\cdot):\mathbb{R}^{n\times n}\mapsto\mathbb{R}^{m}$ , and $\mathbf{s}_{\star}\in\mathbb{R}^{m}$ is a (sparse) corruption vector. Variants of the problem appear ubiquitously in signal processing and machine learning [13, 14, 15, 16, 17, 18].

Robust recovery of low-rank matrices. Low-rank matrix recovery has broad applications in face recognition [13] (where self-shadowing, specularity, or saturation in brightness can be modeled as outliers), video surveillance [13] (where the foreground objects are usually modeled as outliers) and beyond. A classical method for low-rank matrix recovery is via nuclear norm¹¹1Nuclear norm (the tightest convex envelope to matrix rank) is defined as the sum of singular values. minimization. Such a method is provably correct under certain incoherent conditions [13, 19]. However, minimizing nuclear norm involves expensive computations of singular value decomposition (SVD) of large matrices [20] (when $n$ is large), which prohibits its application to problem size of practical interest.

The computational challenge has been addressed by recent development of matrix factorization (MF) methods [21, 17]. Such methods are based on parameterizing the signal $\mathbf{X}\in\mathbb{R}^{n\times n}$ via the factorization $\mathbf{X}=\mathbf{U}\mathbf{U}^{\top}$ [22], and solving the associated nonconvex optimization problem with respect to (w.r.t.) $\mathbf{U}\in\mathbb{R}^{n\times r}$ , where $r$ is the rank of $\mathbf{X}_{\star}$ . Since with $r\ll n$ the problem is of much smaller size, it leads to more scalable methods while still enjoys guarantees for correctness [23, 24, 25, 17]. However, the effectiveness of such exact-parameterization based method hinges on the exact information of the rank of $\mathbf{X}_{\star}$ and the percentage of corrupted entries in $\mathbf{s}_{\star}$ (see Figure 1(a)), which are usually not available a priori in practice.

Refer to caption — (a) Robust Matrix Recovery

Robust recovery of natural images. Robust recovery of natural images is often considered as a challenging task due to the lack of universal mathematical modeling for natural images. While sparse and low-rank based methods have been demonstrated to be effective for years [26, 27, 28, 29, 30, 31], the state-of-the-art performance is obtained by learned priors with deep convolutional neural networks. Such methods operate by end-to-end training of neural networks from pairs of corrupted/clean images [32, 33, 34], and often cannot effectively handle test cases with corruption type and noise level that are different from those of the training data.

Recently, this challenge has been partially addressed by the so-called deep image prior (DIP) [35], which has shown impressive results on many image processing tasks. The method is based on parameterizing an image $\mathbf{X}$ by a deep network $\mathbf{X}=\phi(\bm{\theta})$ , which turns out to be a flexible prior for modeling the underlying distributions of a variety of natural images. The network $\phi(\bm{\theta})$ has a U-shaped architecture and may be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization $\mathbf{U}\mathbf{U}^{\top}$ . Hence, it may not come as a surprise that DIP also inherits the drawbacks of the exact-parameterization approach for low-rank matrix recovery. Namely, it requires either a meticulous choice of network width [36] or early stopping of the learning process [37] (see Figure 1(b)).

Overview of our methods and contributions. In this work, we show that the challenges associated with the exact-parameterization methods can be simply and effectively dealt with via over-parameterization and discrepant learning rates. Our work is inspired by the recent results [6, 7] for low-rank matrix recovery with $\mathbf{s}_{\star}=\mathbf{0}$ , which state that the gradient descent on

\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}}}\frac{1}{2}\left\|\mathcal{A}(\mathbf{U}\mathbf{U}^{\top})-\mathbf{y}\right\|_{2}^{2}

(2)

converges to a low-rank regularized solution with $\operatorname{rank}(\mathbf{U})=\operatorname{rank}(\mathbf{X}_{\star})$ even when $\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}}$ over-parameterizes $\mathbf{X}_{\star}$ with $r^{\prime}=n\gg r$ . In the presence of sparse corruption $\mathbf{s}_{\star}$ , it is tempting to replace the $\ell_{2}$ -norm in (2) with an $\ell_{1}$ -norm²²2This is a commonly used approach for robust optimization problems, such as robust regression [38], robust subspace learning [39, 40, 41], robust phase retrieval [16, 42], robust matrix recovery [15, 18], and many more. and solve

\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}}}\;\left\|\mathcal{A}(\mathbf{U}\mathbf{U}^{\top})-\mathbf{y}\right\|_{1}.

(3)

Unfortunately, such a naive approach easily overfits the corruptions with $r^{\prime}=n$ (see Figure 1(a)).

In this work, we introduce a double over-parameterization method for robust recovery. Our method is based on over-parameterizing both the low-rank matrix $\mathbf{X}_{\star}$ and the outliers $\mathbf{s}_{\star}$ , and leveraging implicit algorithmic bias to find the correct solution $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ . The benefit of such an approach w.r.t. the state of the art is summarized as follows (see also Table 1):

•

More scalable. Our method is based on gradient descent only and does not require computing an SVD per iteration as in convex approaches [13]. Hence it is significantly more scalable.
•

Prior knowledge free. Our method requires no prior knowledge on neither the rank of $\mathbf{X}_{\star}$ nor the sparsity of $\mathbf{s}_{\star}$ , and is free of parameter tuning. This is in contrast to existing nonconvex approaches [15, 24, 18] for which the true rank and/or sparsity are required a priori.
•

Provably correct. Under similar conditions of the convex approach, our method converges to the ground truth $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ asymptotically.

Table 1: Comparison of different approaches to matrix recovery.

Methods	Convex (Eq. (8))	Nonconvex (Eq. (3))	Ours (Eq. (4))
Provably correct?	$\checkmark$	$\checkmark$	$\checkmark$
Prior knowledge free?	$\checkmark$		$\checkmark$
Scalable?		$\checkmark$	$\checkmark$

Underlying the success of our method is the notion of implicit bias of discrepant learning rates. The idea is that the algorithmic low-rank and sparse regularizations need to be balanced for the purpose of identifying the underlying rank and sparsity. With a lack of means of tuning a regularization parameter, we show in Section 2.3 that the desired balance can be obtained by using different learning rates for different optimization parameters. Such a property may be of separate interest and bear on a broader range of problems.

Finally, we demonstrate the broad applicability of our ideas on the task of the robust recovery of natural images. We only need to replace the over-parameterization $\mathbf{U}\mathbf{U}^{\top}$ for low-rank matrices by a U-shaped network $\phi(\bm{\theta})$ for natural images (as adopted in DIP [35]) and solve the optimization problem by gradient descent. This produces a powerful and easy-to-use method with favorable properties when compared to the original DIP (see Figure 1(b)). To be precise, our method handles different test images and varying corruption levels with a single learning pipeline, in which network width and termination conditions do not need to be adjusted on a case-by-case basis.

2 Main Results and Algorithms

2.1 A Double Over-Parameterization Formulation

As precluded in (1), we first consider the problem of recovering a rank- $r$ ( $r\ll n$ ) positive semidefinite matrix³³3By a lifting trick such as [43, 44], our method can be extended to handling arbitrary rectangular matrices. $\mathbf{X}_{\star}\in\mathbb{R}^{n\times n}$ from its corrupted linear measurements $\mathbf{y}=\mathcal{A}(\mathbf{X}_{\star})+\mathbf{s}_{\star}$ , where $\mathbf{s}_{\star}\in\mathbb{R}^{m}$ is a vector of sparse corruptions. Recent advances on algorithmic implicit bias for optimizing over-parameterized models [6, 7, 8, 9, 2, 3] motivate us to consider a nonlinear least squares for robust matrix recovery, with double over-parameterization $\mathbf{X}=\mathbf{U}\mathbf{U}^{\top}$ and $\mathbf{s}=\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}$ :

\displaystyle\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}},\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;f(\mathbf{U},\mathbf{g},\mathbf{h})\;:=\;\frac{1}{4}\left\|\mathcal{A}\left(\mathbf{U}\mathbf{U}^{\top}\right)+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{2}^{2},

(4)

where the dimensional parameter $r^{\prime}\geq r$ . In practice, the choice of $r^{\prime}$ depends on how much prior information we have for $\mathbf{X}_{\star}$ : $r^{\prime}$ can be either taken as an estimated upper bound for $r$ , or taken as $r^{\prime}=n$ with no prior knowledge. In the following, we provide more backgrounds for the choice of our formulation (4).

•

Implicit low-rank prior via matrix factorization. For the vanilla low rank matrix recovery problem with no outlier (i.e., $\mathbf{s}_{\star}=\mathbf{0}$ ), the low-rank matrix $\mathbf{X}_{\star}$ can be recovered via over-parameterization $\mathbf{X}=\mathbf{U}\mathbf{U}^{\top}$ [6, 7, 8, 9]. In particular, the work [6] showed that with small initialization and infinitesimal learning rate, gradient descent on (2) converges to the minimum nuclear norm solution under certain commute conditions for $\mathcal{A}(\cdot)$ .

•

Implicit sparse prior via Hadamard multiplication. For the classical sparse recovery problem [45, 46] which aims to recover a sparse $\mathbf{s}_{\star}\in\mathbb{R}^{m}$ from its linear measurement $\mathbf{b}=\mathbf{A}\mathbf{s}_{\star}$ , recent work [2, 3] showed that it can also be dealt with via over-parameterization $\mathbf{s}=\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}$ . In particular, the work [2] showed that with small initialization and infinitesimal learning rate, gradient descent on

\min_{\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;\left\|\mathbf{A}(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h})-\mathbf{b}\right\|_{2}^{2}

(5)

correctly recovers the sparse $\mathbf{s}_{\star}$ when $\mathbf{A}$ satisfies certain restricted isometric properties [47].

The benefit of the double over-parameterization formulation in (4), as we shall see, is that it allows specific algorithms to automatically identify both the intrinsic rank of $\mathbf{X}_{\star}$ and the sparsity level of $\mathbf{s}_{\star}$ without any prior knowledge.

2.2 Algorithmic Regularizations via Gradient Descent

Obviously, over-parameterization leads to under-determined problems which can have infinite number of solutions, so that not all solutions of (4) correspond to the desired $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ . For example, for any given $\mathbf{U}$ , one can always construct a pair $(\mathbf{g},\mathbf{h})$ for (4) such that $(\mathbf{U},\mathbf{g},\mathbf{h})$ achieves the global minimum value $f=0$ . Nonetheless, as we see in this work, the gradient descent iteration on (4),

\displaystyle\begin{split}\mathbf{U}_{k+1}\;&=\;\mathbf{U}_{k}\;-\;\tau\cdot\nabla_{\mathbf{U}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\;=\;\mathbf{U}_{k}\;-\;\tau\cdot\mathcal{A}^{*}\left(\mathbf{r}_{k}\right)\mathbf{U}_{k},\\ \begin{bmatrix}\mathbf{g}_{k+1}\\ \mathbf{h}_{k+1}\end{bmatrix}\;&=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\nabla_{\mathbf{g}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\\ \nabla_{\mathbf{h}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\end{bmatrix}\;=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\mathbf{r}_{k}\circ\mathbf{g}_{k}\\ -\mathbf{r}_{k}\circ\mathbf{h}_{k}\end{bmatrix},\end{split}

(6)

with properly chosen learning rates $(\tau,\;\alpha\cdot\tau)$ enforces implicit bias on the solution path, that it automatically identifies the desired, regularized solution $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ . Here, in (6) we have $\mathcal{A}^{*}(\cdot):\mathbb{R}^{m}\mapsto\mathbb{R}^{n\times n}$ being the conjugate operator of $\mathcal{A}(\cdot)$ and $\mathbf{r}_{k}=\mathcal{A}(\mathbf{U}_{k}\mathbf{U}_{k}^{\top})+\mathbf{g}_{k}\circ\mathbf{g}_{k}-\mathbf{h}_{k}\circ\mathbf{h}_{k}-\mathbf{y}$ .

It should be noted that the scalar $\alpha>0$ , which controls the ratio of the learning rates for $\mathbf{U}$ and $(\mathbf{g},\mathbf{h})$ , plays a crucial role on the quality of the solution that the iterate in (6) converges to (see Figure 2(a)). We will discuss this in more details in the next subsection (i.e., Section 2.3).

Convergence to low-rank & sparse solutions. Based on our discussion in Section 2.1, it is expected that the gradient descent (6) converges to a solution $(\mathbf{U},\mathbf{g},\mathbf{h})$ such that

\displaystyle\mathbf{X}\;=\;\mathbf{U}\mathbf{U}^{\top}\qquad\text{and}\qquad\mathbf{s}\;=\;\mathbf{g}\circ\mathbf{g}\;-\;\mathbf{h}\circ\mathbf{h}

(7)

have the minimum nuclear norm and $\ell_{1}$ norm, respectively. More specifically, we expect the solution $(\mathbf{X},\mathbf{s})$ in (7) of the nonlinear least squares (4) also serves as a global solution to a convex problem

\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times n},\;\mathbf{s}\in\mathbb{R}^{m}}\;\left\|\mathbf{X}\right\|_{*}\;+\;\lambda\cdot\left\|\mathbf{s}\right\|_{1},\quad\operatorname*{s.t.\ }\;\;\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\;\mathbf{X}\succeq\mathbf{0},

(8)

for which we state more rigorously in Section 3 under a particular setting. However, it should be noted that obtaining the global solution of (8) does not necessarily mean that we find the desired solution $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ — the penalty $\lambda>0$ in (8), which balances the low-rank and the sparse regularizations, plays a crucial role on the quality of the solution to (8). For instance, when $\mathcal{A}(\cdot)$ is the identity operator, under proper conditions of $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ , the work [13] showed that the global solution of (8) is exactly the target solution $\left(\mathbf{X}_{\star},\mathbf{s}_{\star}\right)$ only when $\lambda=1/\sqrt{n}$ . Therefore, choosing the right $\lambda$ is critical for a successful recovery of $(\mathbf{X}_{\star},\mathbf{s}_{\star})$ .

2.3 Implicit Bias with Discrepant Learning Rates

As noted above, a remaining challenge is to control the implicit regularization of the gradient descent (6) so that the algorithm converges to the solution to (8) with the desired value $\lambda=1/\sqrt{n}$ . Without explicit regularization terms in our new objective (4), at first glance this might seem impossible. Nonetheless, we show that this can simply be achieved by adapting the ratio of learning rates $\alpha$ between $\mathbf{U}$ and $(\mathbf{g},\mathbf{h})$ in our gradient step (6), which is one of our key contributions in this work that could also bear broader interest. More specifically, this phenomenon can be captured by the following observation.

Observation.

With small enough learning rate $\tau$ and initialization $(\mathbf{U}_{0},\mathbf{g}_{0},\mathbf{h}_{0})$ close enough to the origin, gradient descent (6) converges to a solution of (8) with $\lambda=1/\alpha$ .

In comparison to the classical optimization theory [48] where learning rates only affect algorithm convergence rate but not the quality of the solution, our observation here is surprisingly different — using discrepant learning rates on different over-parameterized variables actually affects the specific solution that the algorithm converges to⁴⁴4Our result also differs from [49, 50, 51] which analyze the effect of initial large learning rates.. In the following, let us provide some intuitions of why this happens and discuss its implications. We leave a more rigorous mathematical treatment to Section 3.

Intuitions. The relation $\lambda=1/\alpha$ implies that a large learning rate for one particular optimization variable in (6) leads to a small penalty on its implicit regularization counterpart in (8). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point (which is close to the origin), resulting in a weaker regularization effect (which penalizes the distance of the variable to the origin) on its solution path.

Implications. The implicit bias of discrepant learning rates provides a new and powerful way for controlling the regularization effects in over-parametrized models. For robust matrix recovery in particular, it reveals an equivalency between our method in (4) and the convex method in (8) with one-to-one correspondence between learning rate ratio $\alpha$ and the regularization parameter $\lambda$ . By picking $\alpha=\sqrt{n}$ , we may directly quote results from [13] and conclude that our method correctly recovers $\mathbf{X}_{\star}$ with information-theoretically optimal sampling complexity and sparsity levels (see Figure 2). Note that this is achieved with no prior knowledge on the rank of $\mathbf{X}_{\star}$ and sparsity of $\mathbf{s}_{\star}$ . Next, we show that such benefits have implications beyond robust low-rank matrix recovery.

2.4 Extension to Robust Recovery of Natural Images

Finally, we address the problem of robust recovery of a natural image⁵⁵5Here, $H,W$ are height and width of the image, respectively. $C$ denotes the number of channels of the image, where $C=1$ for a greyscale image and $C=3$ for a color image with RGB channels. $\mathbf{X}_{\star}\in\mathbb{R}^{C\times H\times W}$ from its corrupted observation $\mathbf{y}=\mathbf{X}_{\star}+\mathbf{s}_{\star}$ , for which the structure of $\mathbf{X}_{\star}$ cannot be fully captured by a low-rank model. Inspired by the work [35] on showing the effectiveness of an image prior from a deep convolutional network $\mathbf{X}=\phi(\bm{\theta})$ of particular architectures, where $\bm{\theta}\in\mathbb{R}^{c}$ denotes network parameters, we use the following formulation for image recovery:

\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{c},\;\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{C\times H\times W}}\;f(\bm{\theta},\mathbf{g},\mathbf{h})\;=\;\frac{1}{4}\left\|\phi(\bm{\theta})+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{F}^{2}.

(9)

As we will empirically demonstrate in Section 4.2, gradient descent on (9)

\displaystyle\begin{split}\bm{\theta}_{k+1}\;&=\;\bm{\theta}_{k}\;-\;\tau\cdot\nabla_{\bm{\theta}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k}),\\ \begin{bmatrix}\mathbf{g}_{k+1}\\ \mathbf{h}_{k+1}\end{bmatrix}\;&=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\nabla_{\mathbf{g}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\\ \nabla_{\mathbf{h}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\end{bmatrix},\end{split}

(10)

with a balanced learning rate ratio $\alpha$ also enforces implicit regularizations on the solution path to the desired solution. It should be noted that this happens even that the over-parameterization $\mathbf{X}=\phi(\bm{\theta})$ is a highly nonlinear network (in comparison with shallow linear network $\mathbf{X}=\mathbf{U}\mathbf{U}^{\top}$ [6] or deep linear network [8]), which raises several intriguing theoretical questions to be further investigated.

3 Convergence Analysis of Gradient Flow Dynamics

We provide a dynamical analysis certifying our observation in Section 2.3. Similar to [6, 8], we consider a special case where the measurement matrices $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ associated with $\mathcal{A}$ commute⁶⁶6Any $\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m}$ can be written as $\mathcal{A}(\mathbf{Z})=[\langle\mathbf{A}_{1},\mathbf{Z}\rangle,\ldots,\langle\mathbf{A}_{m},\mathbf{Z}\rangle]$ for some $\{\mathbf{A}_{i}\in\mathbb{R}^{n\times n}\}_{i=1}^{m}$ ., and investigate the trajectories of the discrete gradient iterate of $\mathbf{U}_{k},\;\mathbf{g}_{k}$ , and $\mathbf{h}_{k}$ in (6) with infinitesimal learning rate $\tau$ (i.e., $\tau\rightarrow 0$ ). In other words, we study their continuous dynamics counterparts $\mathbf{U}_{t}(\gamma)$ , $\mathbf{g}_{t}(\gamma)$ , and $\mathbf{h}_{t}(\gamma)$ , which are parameterized by $t\in[0,+\infty)$ and initialized at $t=0$ with

\displaystyle\mathbf{U}_{0}(\gamma)\;=\;\gamma\mathbf{I},\quad\mathbf{g}_{0}(\gamma)\;=\;\gamma\mathbf{1},\quad h_{0}(\gamma)\;=\;\gamma\mathbf{1},

(11)

where $\gamma>0$ . Thus, analogous to (6), the behaviors of the continuous gradient flows can be captured via the following differential equations

	$\displaystyle\dot{\mathbf{U}}_{t}(\gamma)\;$	$\displaystyle=\;\lim_{\tau\rightarrow 0}\big{(}\mathbf{U}_{t+\tau}(\gamma)-\mathbf{U}_{t}(\gamma)\big{)}/{\tau}\;=\;-\mathcal{A}^{*}\left(\mathbf{r}_{t}(\gamma)\right)\mathbf{U}_{t}(\gamma),$		(12)
	$\displaystyle\begin{bmatrix}\dot{\mathbf{g}}_{t}(\gamma)\\ \dot{\mathbf{h}}_{t}(\gamma)\end{bmatrix}\;$	$\displaystyle=\;\lim_{\tau\rightarrow 0}\left(\begin{bmatrix}\mathbf{g}_{t+\tau}(\gamma)\\ \mathbf{h}_{t+\tau}(\gamma)\end{bmatrix}-\begin{bmatrix}\mathbf{g}_{t}(\gamma)\\ \mathbf{h}_{t}(\gamma)\end{bmatrix}\right)\;/\;{\tau}\;=\;-\;\alpha\cdot\begin{bmatrix}\mathbf{r}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)\\ -\mathbf{r}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma),\end{bmatrix},$		(13)

with $\mathbf{r}_{t}(\gamma)\;=\;\mathcal{A}(\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma))+\mathbf{g}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)-\mathbf{h}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma)-\mathbf{y}$ . Let $\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)$ , then we can derive the gradient flow for $\mathbf{X}_{t}(\gamma)$ via chain rule

\displaystyle\dot{\mathbf{X}}_{t}(\gamma)\;=\;\dot{\mathbf{U}}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)\;+\;\mathbf{U}_{t}(\gamma)\dot{\mathbf{U}}_{t}^{\top}(\gamma)\;=\;-\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma))\mathbf{X}_{t}(\gamma)-\mathbf{X}_{t}(\gamma)\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma)).

(14)

For any $\gamma>0$ , suppose the limits of $\mathbf{X}_{t}(\gamma)$ , $\mathbf{g}_{t}(\gamma)$ , and $\mathbf{h}_{t}(\gamma)$ as $t\rightarrow+\infty$ exist and denote by

\displaystyle\mathbf{X}_{\infty}\left(\gamma\right)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{X}_{t}(\gamma),\quad\mathbf{g}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{g}_{t}(\gamma),\quad\mathbf{h}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{h}_{t}(\gamma).

(15)

Then when the initialization is infinitesimally small with $\gamma\rightarrow 0$ , we show that the following holds.

Theorem 1.

Assume that the measurement matrices $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ commute with $\mathbf{A}_{i}\mathbf{A}_{j}\;=\;\mathbf{A}_{j}\mathbf{A}_{i}$ for $\forall\;1\leq i\not=j\leq m$ , and the gradient flows of $\mathbf{U}_{t}(\gamma)$ , $\mathbf{g}_{t}(\gamma)$ , and $\mathbf{h}_{t}(\gamma)$ satisfy (12) and (13) and are initialized by (11). Let $\mathbf{X}_{\infty}\left(\gamma\right)$ , $\mathbf{g}_{\infty}(\gamma)$ , and $\mathbf{h}_{\infty}(\gamma)$ be the limit points as defined in (15). Suppose that our initialization is infinitesimally small such that

\displaystyle\widehat{\mathbf{X}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

exist and $(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}})$ is a global optimal solution to (4) with

\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y}\quad\text{and}\quad\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}.

Then we have $\widehat{\mathbf{g}}\circ\widehat{\mathbf{h}}=\mathbf{0}$ , and $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ is also a global optimal solution to (8), with $\lambda=\alpha^{-1}$ .

A detailed proof of Theorem 1 is deferred to Appendix B. Our result shows that among the infinite many global solutions to (4), gradient descent biases towards the one with the minimum nuclear norm (for $\mathbf{X}$ ) and $\ell_{1}$ norm (for $\mathbf{s}$ ) with relative weight controlled by $\alpha$ . In the following, we discuss the rationality of the assumptions for Theorem 1.

•

Since gradient descent almost surely converges to a local minimum [52] and any local minimum is likely to be a global minimum in low-rank matrix optimization with over-parameterization (i.e., $\mathbf{U}\in\mathbb{R}^{n\times n}$ ) [53], the assumption that $(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}})$ is a global optimal solution is generally satisfied.
•

The commutative assumption is commonly adopted in the analysis of over-parameterized low-rank matrix recovery and is empirically observed to be non-essential [6, 8]. A recent work [7] provides a more sophisticated analysis of the discrete dynamic under the restricted isometric assumption where the commutative assumption is not needed. We believe such analysis can be extended to our double over-parameterization setting as well, but leave it as future work.

4 Experiments

In this section, we provide experimental evidence for the implicit bias of the learning rate discussed in Section 2.3. Through experiments for low-rank matrix recovery, Section 4.1 shows that the learning rate ratio $\alpha$ affects the solution that gradient descent converges to, and that an optimal $\alpha$ does not depend on the rank of matrix and sparsity of corruption. Furthermore, Section 4.2 shows the effectiveness of implicit bias of learning rate in alleviating overfitting for robust image recovery, and demonstrates that our method produces better recovery quality when compared to DIP for varying test images and corruption levels, all with a single model and set of learning parameters.

4.1 Robust Recovery of Low-rank Matrices

We conduct experiments for a particular case of low-rank matrix recovery problem, namely the robust PCA problem, in which the operator $\mathcal{A}$ is the identity operator. More specifically, the goal is to recovery a low-rank matrix $\mathbf{X}_{\star}$ and a sparse matrix $\mathbf{S}_{\star}$ from the mixture $\mathbf{Y}=\mathbf{X}_{\star}+\mathbf{S}_{\star}$ , possibly without prior knowledge on the rank $r$ of $\mathbf{X}_{\star}$ and the percentage of nonzero entries $p$ of $\mathbf{S}_{\star}$ . For any given $r$ , we generate $\mathbf{X}_{\star}\in\mathbb{R}^{n\times n}$ by setting $\mathbf{X}_{\star}=\mathbf{U}_{\star}\mathbf{U}_{\star}^{\top}$ , where $\mathbf{U}_{\star}$ is an $n\times r$ matrix with i.i.d. entries drawn from standard Gaussian distribution. For any given $p$ , we generate $\mathbf{S}_{\star}\in\mathbb{R}^{n\times n}$ by sampling uniformly at random $n^{2}\cdot p$ locations from the matrix and setting those entries by sampling i.i.d. from a zero-mean Gaussian distribution with standard deviation $10$ . We use $n=50$ .

We apply our double over-parameterization method in (4) for the robust PCA problem, Specifically, we initialize $\mathbf{U}$ and $\mathbf{g}$ by drawing i.i.d. entries from zero mean Gaussian distribution with standard deviation $10^{-4}$ , and initialize $\mathbf{h}$ to be the same as $\mathbf{g}$ . The learning rate for $\mathbf{U}$ as well as for $\{\mathbf{g},\mathbf{h}\}$ are set to $\tau$ and $\alpha\cdot\tau$ , respectively, where $\tau=10^{-4}$ for all experiments.

Effect of learning rate ratio $\alpha$ . We set $r=5$ and $p=30\%$ , perform $2\times 10^{4}$ gradient descent steps and compute reconstruction errors for the output of our algorithm $(\widehat{\mathbf{X}},\widehat{\mathbf{S}})$ relative to the ground truth $(\mathbf{X}_{\star},\mathbf{S}_{\star})$ . Figure 2(a) illustrates the performance with varying values of $\alpha$ . We observe that $\alpha$ affects the solution that the algorithm converges to, which verifies that the learning rate ratio has an implicit regularization effect. Moreover, the best performance is given by $\alpha=\sqrt{n}$ , which is in accordance with our theoretical result in Theorem 1.

Effect of rank $r$ and outlier ratio $p$ and phase transition. We now fix $\alpha=\sqrt{n}$ and study the ability of our method for recovering matrices of varying rank $r$ with varying percentage of corruption $p$ . For each pair $(r,p)$ , we apply our method and declare the recovery to be successful if the relative reconstruction error $\frac{\left\|\widehat{\mathbf{X}}-\mathbf{X}_{\star}\right\|_{F}}{\left\|\mathbf{X}_{\star}\right\|_{F}}$ is less than $0.1$ . Figure 2(c) displays the fraction of successful recovery in $10$ Monte Carlo trials. It shows that a single value of the parameter $\alpha$ leads to correct recovery for a wide range of $r$ and $p$ . Figure 2(b) shows the fraction of successful recovery for the convex method in (8) (with the Accelerated Proximal Gradient [54] solver ⁷⁷7http://people.eecs.berkeley.edu/~yima/matrix-rank/sample_code.html).

4.2 Robust Recovery of Natural Images

Following [35], we evaluate the performance of our method for robust image recovery using images from a standard dataset⁸⁸8http://www.cs.tut.fi/~foi/GCF-BM3D/index.html#ref_results. Corruption for the images is synthesized by adding salt-and-pepper noise, where ratio $p$ of randomly chosen pixels are replaced with either $1$ or $0$ (each with $50\%$ probability).

The network $\phi(\bm{\theta})$ for our method in (9) is the same as the denoising network in [35], which has a U-shaped architecture with skip connections. Each layer of the network contains a convolutional layer, a nonlinear LeakyReLU layer and a batch normalization layer. We also follow [35] on the initialization of network parameters $\bm{\theta}$ , which is the Kaiming initialization. Meanwhile, we initialize $\mathbf{g}$ and $\mathbf{h}$ by i.i.d. zero-mean Gaussian distribution with a standard deviation of $10^{-5}$ . We use learning rate $\tau=1$ and set $\alpha=500$ for all our experiments below.

No need to tune model width or terminate early. We compare our method with a variant of DIP that we call DIP- $\ell_{1}$ , which is based on solving $\min_{\bm{\theta}}\ell(\phi(\bm{\theta})-\mathbf{y})$ with $\ell(\cdot)$ being $\left\|\cdot\right\|_{1}$ . As shown in Figure 1(b), DIP- $\ell_{1}$ requires either choosing appropriate network width or early termination to avoid overfitting. Note that neither of these can be carried out in practice without the true (clean) image. On the other hand, our method does not require tuning network width or early termination. Its performance continues to improve as training proceeds until stabilises.

Handling different images and varying corruption levels. The benefit mentioned above enables our method to handle different images and varying corruption levels without the need to tune network width, termination condition and any other learning parameters. In our experiments, we fix the number of channels in each convolutional layer to be $128$ , run $150,\!000$ gradient descent iterations and display the output image. The results are shown in Figure 3 (for four different test images) and Figure 4 (for four corruption levels, see Appendix C). For DIP and DIP- $\ell_{1}$ , we report the result with the highest PSNR in the learning process (i.e., we perform the best early termination for evaluation purposes – note that this cannot be carried out in practice). Despite that, our method obtains better recovery quality in terms of PSNR for all cases. We display the learning curves for these experiments in Figure 5(a) and Figure 5(b) (see Appendix C), which show that our method does not overfit for all cases while DIP- $\ell_{1}$ requires a case-dependent early termination.

5 Conclusion

In this work, we have shown both theoretically and empirically that the benefits of implicit bias of gradient descent can be extended to over-parameterization of two low-dimensional structures. The key to the success is the choice of discrepant learning rates that can properly regulate the optimization path so that it converges to the desired optimal solution. Such a framework frees us from the need of prior knowledge in the structures or from the lack of scalability of previous approaches. This has led to state of the art recovery results for both low-rank matrices and natural images. We hope this work may encourage people to investigate in the future if the same framework and idea generalize to a mixture of multiple and broader families of structures.

References

[1] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.
[2] T. Vaskevicius, V. Kanade, and P. Rebeschini, “Implicit regularization for optimal sparse recovery,” in Advances in Neural Information Processing Systems, pp. 2968–2979, 2019.
[3] P. Zhao, Y. Yang, and Q.-C. He, “Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression,” arXiv preprint arXiv:1903.09367, 2019.
[4] F. Wu and P. Rebeschini, “Hadamard wirtinger flow for sparse phase retrieval,” arXiv preprint arXiv:2006.01065, 2020.
[5] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” in International Conference on Machine Learning, pp. 4951–4960, 2019.
[6] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit regularization in matrix factorization,” in Advances in Neural Information Processing Systems, pp. 6151–6159, 2017.
[7] Y. Li, T. Ma, and H. Zhang, “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations,” in Conference On Learning Theory, pp. 2–47, 2018.
[8] S. Arora, N. Cohen, W. Hu, and Y. Luo, “Implicit regularization in deep matrix factorization,” in Advances in Neural Information Processing Systems, pp. 7411–7422, 2019.
[9] M. A. Belabbas, “On implicit regularization: Morse functions and applications to matrix factorization,” arXiv preprint arXiv:2001.04264, 2020.
[10] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradient descent on linear convolutional networks,” in Advances in Neural Information Processing Systems, pp. 9461–9471, 2018.
[11] G. Gidel, F. Bach, and S. Lacoste-Julien, “Implicit regularization of discrete gradient dynamics in linear neural networks,” in Advances in Neural Information Processing Systems, pp. 3196–3206, 2019.
[12] E. Alpaydin, Introduction to machine learning. 2020.
[13] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
[14] D. S. Weller, A. Pnueli, G. Divon, O. Radzyner, Y. C. Eldar, and J. A. Fessler, “Undersampled phase retrieval with outliers,” IEEE transactions on computational imaging, vol. 1, no. 4, pp. 247–258, 2015.
[15] Y. Li, Y. Sun, and Y. Chi, “Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements,” IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 397–408, 2016.
[16] J. C. Duchi and F. Ruan, “Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 471–529, 2019.
[17] Y. Chi, Y. M. Lu, and Y. Chen, “Nonconvex optimization meets low-rank matrix factorization: An overview,” IEEE Transactions on Signal Processing, vol. 67, no. 20, pp. 5239–5269, 2019.
[18] X. Li, Z. Zhu, A. Man-Cho So, and R. Vidal, “Nonconvex robust low-rank matrix recovery,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 660–686, 2020.
[19] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
[20] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” arXiv preprint arXiv:1009.5055, 2010.
[21] P. Jain and P. Kar, “Non-convex optimization for machine learning,” Foundations and Trends® in Machine Learning, vol. 10, no. 3-4, pp. 142–336, 2017.
[22] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
[23] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality of local search for low rank matrix recovery,” in Advances in Neural Information Processing Systems, pp. 3873–3881, 2016.
[24] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances in Neural Information Processing Systems, pp. 2973–2981, 2016.
[25] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rank matrix optimization,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp. 3614–3628, 2018.
[26] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in 2009 IEEE 12th international conference on computer vision, pp. 2272–2279, IEEE, 2009.
[27] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007.
[28] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Bm3d image denoising with shape-adaptive principal component analysis,” 2009.
[29] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869, 2014.
[30] J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36, 2018.
[31] B. Lecouat, J. Ponce, and J. Mairal, “Fully trainable and interpretable non-local sparse models for image restoration,” 2020.
[32] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[33] Z. Yue, H. Yong, Q. Zhao, D. Meng, and L. Zhang, “Variational denoising network: Toward blind noise modeling and removal,” in Advances in Neural Information Processing Systems, pp. 1688–1699, 2019.
[34] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” arXiv preprint arXiv:2003.07761, 2020.
[35] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
[36] R. Heckel and P. Hand, “Deep decoder: Concise image representations from untrained non-convolutional networks,” in International Conference on Learning Representations, 2019.
[37] R. Heckel and M. Soltanolkotabi, “Denoising and regularization via exploiting the structural bias of convolutional generators,” arXiv preprint arXiv:1910.14634, 2019.
[38] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection, vol. 589. John wiley & sons, 2005.
[39] Z. Zhu, Y. Wang, D. Robinson, D. Naiman, R. Vidal, and M. Tsakiris, “Dual principal component pursuit: Improved analysis and efficient algorithms,” in Advances in Neural Information Processing Systems 31, pp. 2171–2181, 2018.
[40] Q. Qu, Z. Zhu, X. Li, M. C. Tsakiris, J. Wright, and R. Vidal, “Finding the sparsest vectors in a subspace: Theory, algorithms, and applications,” 2020.
[41] H. Jiang, D. P. Robinson, R. Vidal, and C. You, “A nonconvex formulation for low rank subspace clustering: algorithms and convergence analysis,” Computational Optimization and Applications, vol. 70, no. 2, pp. 395–418, 2018.
[42] D. Davis, D. Drusvyatskiy, and C. Paquette, “The nonsmooth landscape of phase retrieval,” arXiv preprint arXiv:1711.03247, 2017.
[43] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in International Conference on Machine Learning, pp. 964–973, 2016.
[44] D. Park, A. Kyrillidis, C. Carmanis, and S. Sanghavi, “Non-square matrix sensing without spurious local minima via the burer-monteiro approach,” in Artificial Intelligence and Statistics, pp. 65–74, 2017.
[45] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on information theory, vol. 52, no. 2, pp. 489–509, 2006.
[46] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE signal processing magazine, vol. 25, no. 2, pp. 21–30, 2008.
[47] E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9-10, pp. 589–592, 2008.
[48] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
[49] Y. Li, C. Wei, and T. Ma, “Towards explaining the regularization effect of initial large learning rate in training neural networks,” in Advances in Neural Information Processing Systems, pp. 11669–11680, 2019.
[50] K. You, M. Long, J. Wang, and M. Jordan, “How does learning rate decay help modern neural networks,” arXiv preprint arXiv:1908.01878, 2019.
[51] P. Nakkiran, “Learning rate annealing can provably help generalization, even for convex problems,” arXiv preprint arXiv:2005.07360, 2020.
[52] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient descent only converges to minimizers,” in Conference on learning theory, pp. 1246–1257, 2016.
[53] M. Journée, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank optimization on the cone of positive semidefinite matrices,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2327–2351, 2010.
[54] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, “Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix,” Coordinated Science Laboratory Report no. UILU-ENG-09-2214, DC-246, 2009.
[55] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” arXiv preprint arXiv:1812.10004, 2018.

Appendix A Implicit Bias of Discrepant Learning Rates in Linear Regression

In this part of the appendix, let us use a classical result for underdeterimined linear regression problem to build up some intuitions behind the implicit bias of gradient descent for our problem formulation of robust learning problems. The high level message we aim to deliver through the simple example is that

•

Gradient descent implicitly biases towards solutions with minimum $\ell_{2}$ -norm.
•

Discrepant learning rates lead to solutions with minimum weighted $\ell_{2}$ -norm.

Underdeterimined linear regression.

Given observation $\mathbf{b}\in\mathbb{R}^{n_{1}}$ and wide data matrix $\mathbf{W}\in\mathbb{R}^{n_{1}\times n_{2}}$ ( $n_{2}>n_{1}$ ), we want to find $\bm{\theta}$ which is a solution to

\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\varphi(\bm{\theta})\;=\;\frac{1}{2}\left\|\mathbf{b}-\mathbf{W}\bm{\theta}\right\|_{2}^{2}.

(16)

For $n_{2}>n_{1}$ and full row-rank $\mathbf{W}$ , the underdetermined problem (16) obviously has infinite many solutions, which forms a set

\displaystyle\mathcal{S}\;:=\;\left\{\;\bm{\theta}_{ln}+\mathbf{n}\;\mid\;\bm{\theta}_{ln}=\mathbf{W}^{\dagger}\mathbf{b},\quad\mathbf{n}\in\mathcal{N}(\mathbf{W})\;\right\},

where $\mathbf{W}^{\dagger}:=\mathbf{W}^{\top}\left(\mathbf{W}\mathbf{W}^{\top}\right)^{-1}$ denotes the pseudo-inverse of $\mathbf{W}$ , and

\displaystyle\mathcal{N}(\mathbf{W}):=\left\{\mathbf{n}\mid\mathbf{W}\mathbf{n}=\mathbf{0}\right\},\quad\mathcal{R}(\mathbf{W}):=\left\{\mathbf{z}\mid\mathbf{z}=\mathbf{W}^{\top}\mathbf{v}\right\}

are the null space and row space of $\mathbf{W}$ , respectively. Simple derivation shows that $\bm{\theta}_{ln}$ is a particular least $\ell_{2}$ -norm solution to (16), that minimizes

\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\bm{\theta}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\theta}\;=\;\mathbf{b}.

Gradient descent biases towards $\bm{\theta}_{ln}$ .

Starting from any initialization $\bm{\theta}_{0}$ , gradient descent

\displaystyle\bm{\theta}_{k+1}\;=\;\bm{\theta}_{k}-\tau_{ls}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right)

(17)

with a sufficiently small learning rate⁹⁹9This is because $\mathbf{W}\bm{\theta}_{k+1}-\mathbf{b}=\left(\mathbf{I}-\tau_{ls}\mathbf{W}\mathbf{W}^{\top}\right)\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right)$ . If we choose $\tau_{ls}<\left\|\mathbf{W}\mathbf{W}^{\top}\right\|^{-1}$ , then $\left\|\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right\|$ converges to $0$ geometrically. $\tau_{ls}$ always finds one of the global solutions for (16). Furthermore, it is now well-understood [55] that whenever the initialization $\bm{\theta}_{0}$ has zero component in $\mathcal{N}(\mathbf{W})$ (i.e., $\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{0}\right)=\mathbf{0}$ ), one interesting phenomenon is that the iterates $\bm{\theta}_{\infty}$ in (17) implicitly bias towards the minimium $\ell^{2}$ -norm solution $\bm{\theta}_{ln}$ . This happens because once initialized in $\mathcal{R}(\mathbf{W})$ , gradient descent (17) implicitly biases towards iterates staying within $\mathcal{R}(\mathbf{W})$ , such that

\displaystyle\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{\infty}\right)\;=\;\bm{\theta}_{ln},\quad\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{\infty}\right)\;=\;\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{0}\right)=\mathbf{0}.

As we can see, a particular algorithm enforces specific regularization on the final solution.

Implicit bias of discrepant learning rates. The gradient update in (17) uses the same learning rate $\tau_{ls}$ for all coordinates of $\bm{\theta}$ . If we use different learning rates for each coordinate (i.e., $\mathbf{\Lambda}$ is a diagonal matrix with positive diagonals)

\displaystyle\bm{\theta}_{k+1}\;=\;\bm{\theta}_{k}-\tau_{ls}\cdot\mathbf{\Lambda}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right),

(18)

then by following a similar argument we conclude that the gradient update in (18) converges to a weighted regularized solution for

\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\mathbf{\Lambda}^{-1/2}\bm{\theta}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\theta}\;=\;\mathbf{b}.

(19)

Remark 1.

Let $\sigma_{i}$ be the $i$ -th diagonal of $\mathbf{\Lambda}$ and $\theta_{i}$ be the $i$ -th element of $\bm{\theta}$ . Then in (18), $\sigma_{i}\tau_{ls}$ is the learning rate for the variable $\theta_{i}$ , which varies for different variables. In words, the relation between (18) and (19) implies that for one particular optimization variable (e.g., $\theta_{i}$ ) a large learning rate $\sigma_{i}\tau_{ls}$ in (18) leads to a small implicit regularization effect in (19). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point, resulting in a weaker regularization effect (which penalizes the distance of the variable to the initialization) on its solution path.

An alternative explanation of this is through a change of variable ${\bm{\theta}}=\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}$ . Suppose we minimize

\displaystyle\min_{\widetilde{\bm{\theta}}\in\mathbb{R}^{n_{2}}}\widetilde{\varphi}(\widetilde{\bm{\theta}})\;=\;\frac{1}{2}\left\|\mathbf{b}-\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}\right\|_{2}^{2}

(20)

via standard gradient descent with a single learning rate

\displaystyle\widetilde{\bm{\theta}}_{k+1}\;=\;\widetilde{\bm{\theta}}_{k}-\tau_{ls}\cdot\mathbf{\Lambda}^{1/2}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}_{k}-\mathbf{b}\right).

(21)

Thus, once initialized in $\mathcal{R}(\mathbf{W}\bm{\Lambda}^{1/2})$ , gradient descent (21) converges to the least $\ell_{2}$ -norm solution to (20), i.e., the solution of the following problem

\displaystyle\min_{\widetilde{\bm{\theta}}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\widetilde{\bm{\theta}}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}\;=\;\mathbf{b}.

(22)

Finally, plugging $\widetilde{\bm{\theta}}=\bm{\Lambda}^{-1/2}{\bm{\theta}}$ into (21) and (22) gives (18) and (19), respectively, also indicating the gradient update (18) induces implicit weighted regularization towards the solution of (19).

Appendix B Proof of Theorem 1

In this part of the appendix, we prove our main technical result (i.e., Theorem 1) in Section 3. To make this part self-contained, we restate our result as follows.

Theorem 2.

Assume that the measurement matrices $\mathbf{A}_{1},\mathbf{A}_{2},\ldots,\mathbf{A}_{m}$ commute, i.e.

\displaystyle\mathbf{A}_{i}\mathbf{A}_{j}\;=\;\mathbf{A}_{j}\mathbf{A}_{i},\quad\forall\;1\leq i\not=j\leq m,

and the gradient flows of $\mathbf{U}_{t}(\gamma)$ , $\mathbf{g}_{t}(\gamma)$ , and $\mathbf{h}_{t}(\gamma)$ satisfy

	$\displaystyle\dot{\mathbf{U}}_{t}(\gamma)\;$	$\displaystyle=\;\lim_{\tau\rightarrow 0}\frac{\mathbf{U}_{t+\tau}(\gamma)-\mathbf{U}_{t}(\gamma)}{\tau}\;=\;-\mathcal{A}^{*}\left(\mathbf{r}_{t}(\gamma)\right)\mathbf{U}_{t}(\gamma),$		(23)
	$\displaystyle\begin{bmatrix}\dot{\mathbf{g}}_{t}(\gamma)\\ \dot{\mathbf{h}}_{t}(\gamma)\end{bmatrix}\;$	$\displaystyle=\;\lim_{\tau\rightarrow 0}\left(\begin{bmatrix}\mathbf{g}_{t+\tau}(\gamma)\\ \mathbf{h}_{t+\tau}(\gamma)\end{bmatrix}-\begin{bmatrix}\mathbf{g}_{t}(\gamma)\\ \mathbf{h}_{t}(\gamma)\end{bmatrix}\right)\;/\;{\tau}\;=\;-\;\alpha\cdot\begin{bmatrix}\mathbf{r}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)\\ -\mathbf{r}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma),\end{bmatrix},$		(24)

\displaystyle\mathbf{U}_{0}(\gamma)\;=\;\gamma\mathbf{I},\quad\mathbf{g}_{0}(\gamma)\;=\;\gamma\mathbf{1},\quad h_{0}(\gamma)\;=\;\gamma\mathbf{1}.

Let $\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)$ , and let $\mathbf{X}_{\infty}\left(\gamma\right)$ , $\mathbf{g}_{\infty}(\gamma)$ , and $\mathbf{g}_{\infty}(\gamma)$ be the limit points defined as

\displaystyle\mathbf{X}_{\infty}\left(\gamma\right)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{X}_{t}(\gamma),\quad\mathbf{g}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{g}_{t}(\gamma),\quad\mathbf{h}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{h}_{t}(\gamma).

(25)

Suppose that our initialization is infinitesimally small such that

\displaystyle\widehat{\mathbf{X}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

exist and $(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}})$ is a global optimal solution to

\displaystyle\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}},\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;f(\mathbf{U},\mathbf{g},\mathbf{h})\;:=\;\frac{1}{4}\left\|\mathcal{A}\left(\mathbf{U}\mathbf{U}^{\top}\right)+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{2}^{2},

(26)

with $\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y}$ and $\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}$ . Then we have $\widehat{\mathbf{g}}\circ\widehat{\mathbf{h}}=\mathbf{0}$ , and $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ is also a global optimal solution to

\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times n},\mathbf{s}\in\mathbb{R}^{m}}\;\left\|\mathbf{X}\right\|_{*}\;+\;\lambda\cdot\left\|\mathbf{s}\right\|_{1},\quad\text{s.t.}\quad\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\;\mathbf{X}\succeq\mathbf{0}.

(27)

with $\lambda=\alpha^{-1}$ and $\alpha>0$ being the balancing parameter in (24).

Proof.

From (23), we can derive the gradient flow for $\mathbf{X}_{t}(\gamma)$ via chain rule

\displaystyle\dot{\mathbf{X}}_{t}(\gamma)\;=\;\dot{\mathbf{U}}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)\;+\;\mathbf{U}_{t}(\gamma)\dot{\mathbf{U}}_{t}^{\top}(\gamma)\;=\;-\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma))\mathbf{X}_{t}(\gamma)-\mathbf{X}_{t}(\gamma)\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma)).

(28)

We want to show that when the initialization is infinitesimally small (i.e., $\gamma\rightarrow 0$ ), the limit points of the gradient flows $\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)$ and $\mathbf{s}_{t}(\gamma)=\mathbf{g}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)-\mathbf{h}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma)$ are optimal solutions for (27) as $t\rightarrow+\infty$ . Towards this goal, let us first look at the optimality condition for (27). From Lemma 1, we know that if $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ with

\displaystyle\widehat{\mathbf{X}}\;=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}\quad\text{with}\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

is an optimal solution for (27) then there exists a dual certificate $\bm{\nu}$ such that

\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y},\quad\left(\mathbf{I}-\mathcal{A}^{*}(\bm{\nu})\right)\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},\quad\mathcal{A}^{*}(\bm{\nu})\;\preceq\;\mathbf{I},\quad\bm{\nu}\in\lambda\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\widehat{\mathbf{X}}\succeq\mathbf{0},

where $\operatorname{sign}(\widehat{\mathbf{s}})$ is defined in (32), which is the subdifferential of $\left\|\cdot\right\|_{1}$ . Thus, it suffices to construct a dual certificate $\bm{\nu}$ such that $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ satisfies the equation above.

Since $(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}})$ is a global optimal solution to (26), we automatically have $\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y}$ and $\widehat{\mathbf{X}}\succeq\mathbf{0}$ . On the other hand, given that $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ commutes and (28) and (24) hold for $\mathbf{X}_{t}$ , $\mathbf{g}_{t}$ and $\mathbf{h}_{t}$ , by Lemma 2, we know that

	$\displaystyle\mathbf{X}_{t}(\gamma)\;$	$\displaystyle=\;\exp\left(\mathcal{A}^{}(\bm{\xi}_{t}(\gamma))\right)\cdot\mathbf{X}_{0}(\gamma)\cdot\exp\left(\mathcal{A}^{}(\bm{\xi}_{t}(\gamma))\right),$		(29)
	$\displaystyle\mathbf{g}_{t}(\gamma)\;$	$\displaystyle=\;\mathbf{g}_{0}(\gamma)\circ\exp\left(\alpha\bm{\xi}_{t}(\gamma)\right),\quad\mathbf{h}_{t}(\gamma)\;=\;\mathbf{h}_{0}(\gamma)\circ\exp\left(-\alpha\bm{\xi}_{t}(\gamma)\right),$		(30)

where $\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau$ . Let $\bm{\xi}_{\infty}(\gamma):=\lim_{t\rightarrow+\infty}\bm{\xi}_{t}(\gamma)$ , by Lemma 3 and Lemma 4, we can construct

\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)},

such that

\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\;\preceq\;\mathbf{I},\quad\lim_{\gamma\rightarrow 0}\left[\mathbf{I}-\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

and

\displaystyle\lim_{\gamma\rightarrow 0}\;\bm{\nu}(\gamma)\;\in\;\alpha^{-1}\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0}.

This shows the exists of the dual certificate $\bm{\nu}(\gamma)$ such that the optimality condition holds for $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ . Hence, $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ is also a global optimal solution to (27). ∎

Lemma 1.

If $(\widehat{\mathbf{X}},\widehat{\mathbf{s}})$ is an optimal solution for (27), then there exists a dual certificate $\bm{\nu}\in\mathbb{R}^{m}$ , that

\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y},\quad\left(\mathbf{I}-\mathcal{A}^{*}(\bm{\nu})\right)\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},\quad\bm{\nu}\in\lambda\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\mathbf{I}\succeq\mathcal{A}^{*}(\bm{\nu}),\;\widehat{\mathbf{X}}\succeq\mathbf{0},

(31)

where $\operatorname{sign}(\mathbf{s})$ is the subdifferential of $\left\|\mathbf{s}\right\|_{1}$ with each entry

\displaystyle\operatorname{sign}(s)\;:=\;\begin{cases}s/\left|s\right|&s\not=0,\\ [-1,1]&s=0.\end{cases}

(32)

Proof.

The Lagrangian function of the problem can be written as

\displaystyle\mathcal{L}\left(\mathbf{X},\mathbf{s},\bm{\nu},\mathbf{\Gamma}\right)\;=\;\operatorname{trace}\left(\mathbf{X}\right)+\lambda\left\|\mathbf{s}\right\|_{1}+\bm{\nu}^{\top}\left(\mathbf{y}-\mathcal{A}(\mathbf{X})-\mathbf{s}\right)-\left\langle\mathbf{X},\mathbf{\Gamma}\right\rangle,

with $\bm{\nu}\in\mathbb{R}^{m}$ and $\mathbf{\Gamma}\in\mathbb{R}^{n\times n}$ being the dual variables, where $\mathbf{\Gamma}\succeq\mathbf{0}$ . Thus, we can derive the Karush-Kuhn-Tucker (KKT) optimality condition for (27) as

	$\displaystyle\mathbf{0}\in\partial\mathcal{L}\;:$	$\displaystyle\quad\mathbf{I}-\mathbf{\Gamma}-\mathcal{A}^{*}(\bm{\nu})\;=\;\mathbf{0},\quad\bm{\nu}\in\lambda\cdot\partial\left\\|\mathbf{s}\right\\|_{1}=\lambda\cdot\operatorname{sign}\left(\mathbf{s}\right),$
	feasibility :	$\displaystyle\quad\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\mathbf{X}\succeq\mathbf{0},\;\mathbf{\Gamma}\succeq\mathbf{0},$
	complementary slackness :	$\displaystyle\quad\mathbf{\Gamma}\cdot\mathbf{X}\;=\;\mathbf{0},$

where $\partial(\cdot)$ denotes the subdifferential operator and $\operatorname{sign}(\mathbf{s})$ is the subdifferential of $\left\|\mathbf{s}\right\|_{1}$ with each entry

\displaystyle\operatorname{sign}(s)\;=\;\begin{cases}s/\left|s\right|&s\not=0,\\ [-1,1]&s=0.\end{cases}

Thus, we know that $\left(\widehat{\mathbf{X}},\widehat{\mathbf{s}}\right)$ is global solution to (27) as long as there exists a dual certificate $\bm{\nu}$ such that (31) holds, where we eliminated $\mathbf{\Gamma}$ by plugging in $\mathbf{\Gamma}=\mathbf{I}-\mathcal{A}^{*}(\bm{\nu})$ . ∎

Lemma 2.

Suppose that $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ commutes. Suppose (14) and (24) hold for $\mathbf{X}_{t}$ , $\mathbf{g}_{t}$ and $\mathbf{h}_{t}$ , then

	$\displaystyle\mathbf{X}_{t}\;$	$\displaystyle=\;\exp\left(\mathcal{A}^{}(\bm{\xi}_{t})\right)\cdot\mathbf{X}_{0}\cdot\exp\left(\mathcal{A}^{}(\bm{\xi}_{t})\right)$		(33)
	$\displaystyle\mathbf{g}_{t}\;$	$\displaystyle=\;\mathbf{g}_{0}\circ\exp\left(\alpha\bm{\xi}_{t}\right),\quad\mathbf{h}_{t}\;=\;\mathbf{h}_{0}\circ\exp\left(-\alpha\bm{\xi}_{t}\right),$		(34)

where $\bm{\xi}_{t}=-\int_{0}^{t}\mathbf{r}_{\tau}d\tau$ .

Proof.

From (24), we know that

\displaystyle\frac{d\mathbf{g}_{t}}{dt}\;=\;-\alpha\mathbf{r}_{t}\circ\mathbf{g}_{t},

where the differentiation $\frac{d\mathbf{g}_{t}}{dt}$ is entrywise for $\mathbf{g}_{t}$ . Thus, we have

\displaystyle\int_{0}^{t}\frac{d\mathbf{g}_{\tau}}{\mathbf{g}_{\tau}}\;=\;-\alpha\int_{0}^{t}\mathbf{r}_{\tau}d\tau\quad\Longrightarrow\quad\log\mathbf{g}_{t}-\log\mathbf{g}_{0}\;=\;\alpha\bm{\xi}_{t}\quad\Longrightarrow\quad\mathbf{g}_{t}\;=\;\mathbf{g}_{0}\circ\exp\left(\alpha\bm{\xi}_{t}\right),

where all the operators are entrywise. Similarly, $\mathbf{h}_{t}\;=\;\mathbf{h}_{0}\circ\exp\left(-\alpha\bm{\xi}_{t}\right)$ holds.

For (33), by using (28) and the fact that $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ commutes, we can derive it with an analogous argument. ∎

Lemma 3.

Under the settings of Theorem 2 and Lemma 2, for any $\gamma>0$ there exists

\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)},

(35)

such that

\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\;\preceq\;\mathbf{I},\quad\lim_{\gamma\rightarrow 0}\left[\mathbf{I}-\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

where $\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow 0}\bm{\xi}_{t}(\gamma)$ with $\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau$ .

Proof.

Given $\mathbf{U}_{0}=\gamma\mathbf{I}$ , we have $\mathbf{X}_{0}=\mathbf{U}_{0}\mathbf{U}_{0}^{\top}=\gamma^{2}\mathbf{I}$ . By the expression for $\mathbf{X}_{t}$ in (29), we have

\displaystyle\mathbf{X}_{\infty}(\gamma)\;=\;\gamma^{2}\cdot\exp\left(2\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)

(36)

where $\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow\infty}\bm{\xi}_{t}(\gamma)$ . Because $\left\{\mathbf{A}_{i}\right\}_{i=1}^{m}$ are symmetric and they commute, we know that they are simultaneously diagonalizable by an orthonormal basis $\mathbf{\Omega}=[\bm{\omega}_{1},\ldots,\bm{\omega}_{n}]\in\mathbb{R}^{n\times n}$ , i.e.,

\displaystyle\mathbf{\Omega}\mathbf{A}_{i}\mathbf{\Omega}^{\top}=\mathbf{\Lambda}_{i},\quad\mathbf{\Lambda}_{i}\text{ diagonal },\quad\forall\;i=1,2,\ldots,m,

and so is $\mathcal{A}^{*}(\mathbf{b})$ for any $\mathbf{b}\in\mathbb{R}^{m}$ . Therefore, we have

\displaystyle\lambda_{k}\left(\mathbf{X}_{\infty}(\gamma)\right)\;=\;\gamma^{2}\cdot\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)\right)\;=\;\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right),

(37)

where $\lambda_{k}(\cdot)$ denotes the $k$ -th eigenvalue with respect to the $k$ -th basis $\bm{\omega}_{k}$ . Moreover $\mathbf{X}_{\infty}(\gamma)$ and its limit $\widehat{\mathbf{X}}$ have the same eigen-basis $\mathbf{\Omega}$ . Since we have $\mathbf{X}_{\infty}(\gamma)$ converges to $\widehat{\mathbf{X}}$ when $\gamma\rightarrow 0$ , then we have the eigenvalues

\displaystyle\lambda_{k}\left(\mathbf{X}_{\infty}(\gamma)\right)\;\rightarrow\;\lambda_{k}(\widehat{\mathbf{X}}),\quad\forall\;k\;=\;1,2,\ldots,n,

(38)

whenever $\gamma\rightarrow 0$ .

Case 1: $\lambda_{k}(\widehat{\mathbf{X}})>0$ .

For any $k$ such that $\lambda_{k}(\widehat{\mathbf{X}})>0$ , from (37) and (38), we have

\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;\rightarrow\;\lambda_{k}(\widehat{\mathbf{X}}),

so that

\displaystyle 2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma-\log\lambda_{k}(\widehat{\mathbf{X}})\;\rightarrow\;0,

which further implies that

\displaystyle\lambda_{k}\left(\mathcal{A}^{*}\left(\frac{\bm{\xi}_{\infty}(\gamma)}{\log(1/\gamma)}\right)\right)-1-\frac{\log\lambda_{k}(\widehat{\mathbf{X}})}{2\log(1/\gamma)}\;\rightarrow\;0.

Now if we construct $\bm{\nu}(\gamma)$ as (35), so that we conclude

\displaystyle\lim_{\gamma\rightarrow 0}\;\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;=\;1,

(39)

for any $k$ such that $\lambda_{k}(\widehat{\mathbf{X}})>0$ .

Case 2: $\lambda_{k}(\widehat{\mathbf{X}})=0$ .

On the other hand, for any $k$ such that $\lambda_{k}(\widehat{\mathbf{X}})=0$ , similarly from (37) and (38), we have

\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;\rightarrow\;0,

when $\gamma\rightarrow 0$ . Thus, for any small $\epsilon\in(0,1)$ , there exists some $\gamma_{0}\in(0,1)$ such that for all $\gamma<\gamma_{0}$ ,

\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;<\;\epsilon,

which implies that

\displaystyle\lambda_{k}\left(\mathcal{A}^{*}\left(\frac{\bm{\xi}_{\infty}(\gamma)}{\log(1/\gamma)}\right)\right)\;-\;1\;<\;\frac{\log\epsilon}{2\log\left(1/\gamma\right)}\;<\;0.

Thus, given the construction of $\bm{\nu}(\gamma)$ in (35), we have

\displaystyle\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;<\;1,\quad\forall\;\gamma\;<\;\gamma_{0},

which further implies that for any $k$ with $\lambda_{k}(\widehat{\mathbf{X}})=0$ , we have

\displaystyle\lim_{\gamma\rightarrow 0}\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;<\;1.

(40)

Putting things together.

Combining our analysis in (39) and (40), we obtain

\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}(\bm{\nu}(\gamma))\;\preceq\;\mathbf{I}.

On the other hand, per our analysis, we know that there exists an orthogonal matrix $\mathbf{\Omega}\in\mathbb{R}^{n\times n}$ , such that $\mathcal{A}^{*}(\bm{\nu}(\gamma))$ and $\widehat{\mathbf{X}}$ can be simultaneously diagonalized. Thus, we have

\displaystyle\left[\mathbf{I}-\mathcal{A}^{*}(\bm{\nu}(\gamma))\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{\Omega}\cdot\left(\mathbf{I}-\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))}\right)\cdot\mathbf{\Lambda}_{\widehat{\mathbf{X}}}\cdot\mathbf{\Omega}^{\top},

where $\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))}$ and $\mathbf{\Lambda}_{\widehat{\mathbf{X}}}$ are diagonal matrices, with entries being the eigenvalues of $\mathcal{A}^{*}(\bm{\nu}(\gamma))$ and $\mathbf{\Lambda}_{\widehat{\mathbf{X}}}$ , respectively. From our analysis for Case 1 and Case 2, we know that $\lim_{\gamma\rightarrow 0}\left(\mathbf{I}-\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))}\right)\cdot\mathbf{\Lambda}_{\widehat{\mathbf{X}}}=\mathbf{0}$ . Therefore, we also have

\displaystyle\lim_{\gamma\rightarrow 0}\;\left[\mathbf{I}-\mathcal{A}^{*}(\bm{\nu}(\gamma))\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

as desired. ∎

Lemma 4.

Under the settings of Theorem 2 and Lemma 2, for any $\gamma>0$ there exists

\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)},

(41)

such that

\displaystyle\lim_{\gamma\rightarrow 0}\;\bm{\nu}(\gamma)\;\in\;\alpha^{-1}\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0},

(42)

where $\widehat{\mathbf{s}}=\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}$ , and $\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow 0}\bm{\xi}_{t}(\gamma)$ with $\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau$ .

Proof.

Let $g_{\infty}^{i}(\gamma)$ and $h_{\infty}^{i}(\gamma)$ be the $i$ th coordinate of $\mathbf{g}_{\infty}(\gamma)$ and $\mathbf{h}_{\infty}(\gamma)$ defined in (25), respectively. It follows from (30) that

\displaystyle g_{\infty}^{i}(\gamma)\;=\;\gamma\cdot\exp\left(\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\quad h_{\infty}^{i}(\gamma)\;=\;\gamma\cdot\exp\left(-\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\quad\forall\;i\;=\;1,2,\ldots,m.

When $\gamma\rightarrow 0$ , we have

\displaystyle g_{\infty}^{i}(\gamma)\cdot h_{\infty}^{i}(\gamma)\;=\;\gamma^{2}\;\rightarrow\;0,\quad\forall\;i\;=\;1,2,\ldots,m,

so that $\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0}$ . This also implies that either $g_{\infty}^{i}(\gamma)$ or $h_{\infty}^{i}(\gamma)$ for any $i=1,2,\ldots,m$ .

On the other hand, let us define

\displaystyle\mathbf{s}_{\infty}(\gamma)\;=\;\mathbf{h}_{\infty}(\gamma)\circ\mathbf{h}_{\infty}(\gamma)\;-\;\mathbf{g}_{\infty}(\gamma)\circ\mathbf{g}_{\infty}(\gamma),

and let $s_{\infty}^{i}(\gamma)$ be the $i$ th coordinate of $\mathbf{s}_{\infty}(\gamma)$ with

\displaystyle s_{\infty}^{i}(\gamma)\;=\;\gamma^{2}\cdot\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;-\;\gamma^{2}\cdot\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right).

(43)

Correspondingly, we know that $\widehat{\mathbf{s}}=\lim_{\gamma\rightarrow 0}\mathbf{s}_{\infty}(\gamma)$ and let $s^{i}$ be the $i$ th coordinate of $\widehat{\mathbf{s}}=\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}$ . In the following, we leverage on these to show that our construction of $\bm{\nu}(\gamma)$ satisfies (42). We classify the entries $\widehat{s}_{i}$ of $\widehat{\mathbf{s}}$ $(i=1,2,\ldots,m)$ into three cases and analyze as follows.

•

Case 1: $\widehat{s}_{i}>0$ . Since $\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}>0$ , from (43) we must have $\xi_{\infty}^{i}(\gamma)\rightarrow+\infty$ when $\gamma\rightarrow 0$ , so that $\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow+\infty$ and $\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0$ . Therefore, when $\gamma\rightarrow 0$ , we have

	$\displaystyle\gamma^{2}\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;\rightarrow\;\widehat{s}_{i}\quad$	$\displaystyle\Longrightarrow\quad 2\alpha\cdot\xi_{\infty}^{i}(\gamma)-2\log\left(1/\gamma\right)-\log\widehat{s}_{i}\;\rightarrow\;0,$
		$\displaystyle\Longrightarrow\quad\nu_{i}(\gamma)\;=\;\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\;\rightarrow\;\frac{1}{\alpha}.\;\;(\text{given}\;\;\log\left(1/\gamma\right)\rightarrow+\infty)$

•

Case 2: $\widehat{s}_{i}<0$ . Since $\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}<0$ , from (43) we must have $\xi_{\infty}^{i}(\gamma)\rightarrow-\infty$ when $\gamma\rightarrow 0$ , so that $\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0$ and $\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow+\infty$ . Therefore, when $\gamma\rightarrow 0$ , we have

	$\displaystyle-\gamma^{2}\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;\rightarrow\;\widehat{s}_{i}\quad$	$\displaystyle\Longrightarrow\quad-2\alpha\cdot\xi_{\infty}^{i}(\gamma)+2\log\left(1/\gamma\right)-\log\widehat{s}_{i}\;\rightarrow\;0,$
		$\displaystyle\Longrightarrow\quad\nu_{i}(\gamma)\;=\;\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\;\rightarrow\;-\frac{1}{\alpha}.$

•

Case 3: $\widehat{s}_{i}=0$ . Since $\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}=0$ , from (43) we must have $\gamma^{2}\cdot\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0$ and $\gamma^{2}\cdot\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0$ , when $\gamma\rightarrow 0$ . Therefore, for any small $\epsilon\in(0,1)$ , there exists some $\gamma_{0}>0$ , such that for all $\gamma\in(0,\gamma_{0})$ , we have

		$\displaystyle\gamma^{2}\cdot\max\left\{\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\;\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\right\}\;\leq\;\epsilon$
	$\displaystyle\Longrightarrow\quad$	$\displaystyle 2\alpha\cdot\max\left\{\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)},-\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\right\}-2\;<\;\frac{\log\epsilon}{\log\left(1/\gamma\right)}\;<\;0,$

which further implies that

\displaystyle\left|\nu_{i}(\gamma)\right|\;=\;\max\left\{\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)},-\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\right\}\;<\;\frac{1}{\alpha}.

Therefore, combining the results in the three cases above we obtain that

\displaystyle\lim_{\gamma\rightarrow 0}\nu_{i}(\gamma)\;=\;\frac{1}{\alpha}\operatorname{sign}(\widehat{s}_{i})\;=\;\begin{cases}\frac{\widehat{s}_{i}}{\alpha\left|\widehat{s}_{i}\right|}&\widehat{s}_{i}\not=0,\\ [-1/\alpha,1/\alpha]&\widehat{s}_{i}=0,\end{cases}

so that we have (42) holds. ∎

Appendix C Extra Experiments

Due to limited space in the main body, we here provide extra results for our experiments on robust image recovery presented in Section 4.2.

Varying corruption levels. In Figure 4, we display results of our method for robust image recovery with varying levels of salt-and-pepper corruption.

Learning curves for different test images and varying corruption levels. In Figure 5(a) and Figure 5(b), we provide learning curves for the results in Figure 3 and Figure 4, respectively. While DIP- $\ell_{1}$ requires a case-dependent early termination to obtain the best performance, our method does not overfit and does not require any early termination. Overall our method achieves both better PSNR and visual quality, especially when noise level is high.