This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

Chong You           Zhihui Zhu         Qing Qu         Yi Ma

Department of EECS, University of California, Berkeley
Department of Electrical and Computer Engineering, University of Denver
Center for Data Science, New York University
Abstract

Recent advances have shown that implicit bias of gradient descent on over-parameterized models enables the recovery of low-rank matrices from linear measurements, even with no prior knowledge on the intrinsic rank. In contrast, for robust low-rank matrix recovery from grossly corrupted measurements, over-parameterization leads to overfitting without prior knowledge on both the intrinsic rank and sparsity of corruption. This paper shows that with a double over-parameterization for both the low-rank matrix and sparse corruption, gradient descent with discrepant learning rates provably recovers the underlying matrix even without prior knowledge on neither rank of the matrix nor sparsity of the corruption. We further extend our approach for the robust recovery of natural images by over-parameterizing images with deep convolutional networks. Experiments show that our method handles different test images and varying corruption levels with a single learning pipeline where the network width and termination conditions do not need to be adjusted on a case-by-case basis. Underlying the success is again the implicit bias with discrepant learning rates on different over-parameterized parameters, which may bear on broader applications.

The first two authors contributed equally to this work.

1 Introduction

Learning over-parameterized models, which have more parameters than the problem’s intrinsic dimension, is becoming a crucial topic in machine learning [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. While classical learning theories suggest that over-parameterized models tend to overfit [12], recent advances showed that an optimization algorithm may produce an implicit bias that regularizes the solution with desired properties. This type of results has led to new insights and better understandings on gradient descent for solving several fundamental problems, including logistic regression on linearly separated data [1], compressive sensing [2, 3], sparse phase retrieval [4], nonlinear least-squares [5], low-rank (deep) matrix factorization [6, 7, 8, 9], and deep linear neural networks [10, 11], etc.

Inspired by these recent advances [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], in this work we present a new type of practical methods for robust recovery of structured signals via model over-parameterization. In particular, we aim to learn an unknown signal 𝐗n×n\mathbf{X}_{\star}\in\mathbb{R}^{n\times n} from its grossly corrupted linear measurements

𝐲=𝒜(𝐗)+𝐬,\displaystyle\mathbf{y}\;=\;\mathcal{A}(\mathbf{X}_{\star})\;+\;\mathbf{s}_{\star}, (1)

where the linear operator 𝒜():n×nm\mathcal{A}(\cdot):\mathbb{R}^{n\times n}\mapsto\mathbb{R}^{m}, and 𝐬m\mathbf{s}_{\star}\in\mathbb{R}^{m} is a (sparse) corruption vector. Variants of the problem appear ubiquitously in signal processing and machine learning [13, 14, 15, 16, 17, 18].

Robust recovery of low-rank matrices. Low-rank matrix recovery has broad applications in face recognition [13] (where self-shadowing, specularity, or saturation in brightness can be modeled as outliers), video surveillance [13] (where the foreground objects are usually modeled as outliers) and beyond. A classical method for low-rank matrix recovery is via nuclear norm111Nuclear norm (the tightest convex envelope to matrix rank) is defined as the sum of singular values. minimization. Such a method is provably correct under certain incoherent conditions [13, 19]. However, minimizing nuclear norm involves expensive computations of singular value decomposition (SVD) of large matrices [20] (when nn is large), which prohibits its application to problem size of practical interest.

The computational challenge has been addressed by recent development of matrix factorization (MF) methods [21, 17]. Such methods are based on parameterizing the signal 𝐗n×n\mathbf{X}\in\mathbb{R}^{n\times n} via the factorization 𝐗=𝐔𝐔\mathbf{X}=\mathbf{U}\mathbf{U}^{\top} [22], and solving the associated nonconvex optimization problem with respect to (w.r.t.) 𝐔n×r\mathbf{U}\in\mathbb{R}^{n\times r}, where rr is the rank of 𝐗\mathbf{X}_{\star}. Since with rnr\ll n the problem is of much smaller size, it leads to more scalable methods while still enjoys guarantees for correctness [23, 24, 25, 17]. However, the effectiveness of such exact-parameterization based method hinges on the exact information of the rank of 𝐗\mathbf{X}_{\star} and the percentage of corrupted entries in 𝐬\mathbf{s}_{\star} (see Figure 1(a)), which are usually not available a priori in practice.

Refer to caption

(a) Robust Matrix Recovery

Refer to caption

(b) Robust Image Recovery
Figure 1: Learning curves for robust recovery of low-rank matrices (a) and natural images (b). (a) Classical matrix factorization (MF) method with 1\ell_{1} penalty requires exact parameterization (left blue), otherwise over-parameterization leads to overfitting without early termination (left black). (b) Previous DIP method with 1\ell_{1} penalty requires tuning network width (with width = 6464, right blue) or early termination (with width = 192192, right black). For both problems, our method with double over-parameterization achieves superior performance and requires neither early termination nor precise parameterization (red curves).

Robust recovery of natural images. Robust recovery of natural images is often considered as a challenging task due to the lack of universal mathematical modeling for natural images. While sparse and low-rank based methods have been demonstrated to be effective for years [26, 27, 28, 29, 30, 31], the state-of-the-art performance is obtained by learned priors with deep convolutional neural networks. Such methods operate by end-to-end training of neural networks from pairs of corrupted/clean images [32, 33, 34], and often cannot effectively handle test cases with corruption type and noise level that are different from those of the training data.

Recently, this challenge has been partially addressed by the so-called deep image prior (DIP) [35], which has shown impressive results on many image processing tasks. The method is based on parameterizing an image 𝐗\mathbf{X} by a deep network 𝐗=ϕ(𝜽)\mathbf{X}=\phi(\bm{\theta}), which turns out to be a flexible prior for modeling the underlying distributions of a variety of natural images. The network ϕ(𝜽)\phi(\bm{\theta}) has a U-shaped architecture and may be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization 𝐔𝐔\mathbf{U}\mathbf{U}^{\top}. Hence, it may not come as a surprise that DIP also inherits the drawbacks of the exact-parameterization approach for low-rank matrix recovery. Namely, it requires either a meticulous choice of network width [36] or early stopping of the learning process [37] (see Figure 1(b)).

Overview of our methods and contributions. In this work, we show that the challenges associated with the exact-parameterization methods can be simply and effectively dealt with via over-parameterization and discrepant learning rates. Our work is inspired by the recent results [6, 7] for low-rank matrix recovery with 𝐬=𝟎\mathbf{s}_{\star}=\mathbf{0}, which state that the gradient descent on

min𝐔n×r12𝒜(𝐔𝐔)𝐲22\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}}}\frac{1}{2}\left\|\mathcal{A}(\mathbf{U}\mathbf{U}^{\top})-\mathbf{y}\right\|_{2}^{2} (2)

converges to a low-rank regularized solution with rank(𝐔)=rank(𝐗)\operatorname{rank}(\mathbf{U})=\operatorname{rank}(\mathbf{X}_{\star}) even when 𝐔n×r\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}} over-parameterizes 𝐗\mathbf{X}_{\star} with r=nrr^{\prime}=n\gg r. In the presence of sparse corruption 𝐬\mathbf{s}_{\star}, it is tempting to replace the 2\ell_{2}-norm in (2) with an 1\ell_{1}-norm222This is a commonly used approach for robust optimization problems, such as robust regression [38], robust subspace learning [39, 40, 41], robust phase retrieval [16, 42], robust matrix recovery [15, 18], and many more. and solve

min𝐔n×r𝒜(𝐔𝐔)𝐲1.\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}}}\;\left\|\mathcal{A}(\mathbf{U}\mathbf{U}^{\top})-\mathbf{y}\right\|_{1}. (3)

Unfortunately, such a naive approach easily overfits the corruptions with r=nr^{\prime}=n (see Figure 1(a)).

In this work, we introduce a double over-parameterization method for robust recovery. Our method is based on over-parameterizing both the low-rank matrix 𝐗\mathbf{X}_{\star} and the outliers 𝐬\mathbf{s}_{\star}, and leveraging implicit algorithmic bias to find the correct solution (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}). The benefit of such an approach w.r.t. the state of the art is summarized as follows (see also Table 1):

  • More scalable. Our method is based on gradient descent only and does not require computing an SVD per iteration as in convex approaches [13]. Hence it is significantly more scalable.

  • Prior knowledge free. Our method requires no prior knowledge on neither the rank of 𝐗\mathbf{X}_{\star} nor the sparsity of 𝐬\mathbf{s}_{\star}, and is free of parameter tuning. This is in contrast to existing nonconvex approaches [15, 24, 18] for which the true rank and/or sparsity are required a priori.

  • Provably correct. Under similar conditions of the convex approach, our method converges to the ground truth (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}) asymptotically.

Table 1: Comparison of different approaches to matrix recovery.
Methods Convex (Eq. (8)) Nonconvex (Eq. (3)) Ours (Eq. (4))
Provably correct? \checkmark \checkmark \checkmark
Prior knowledge free? \checkmark \checkmark
Scalable? \checkmark \checkmark

Underlying the success of our method is the notion of implicit bias of discrepant learning rates. The idea is that the algorithmic low-rank and sparse regularizations need to be balanced for the purpose of identifying the underlying rank and sparsity. With a lack of means of tuning a regularization parameter, we show in Section 2.3 that the desired balance can be obtained by using different learning rates for different optimization parameters. Such a property may be of separate interest and bear on a broader range of problems.

Finally, we demonstrate the broad applicability of our ideas on the task of the robust recovery of natural images. We only need to replace the over-parameterization 𝐔𝐔\mathbf{U}\mathbf{U}^{\top} for low-rank matrices by a U-shaped network ϕ(𝜽)\phi(\bm{\theta}) for natural images (as adopted in DIP [35]) and solve the optimization problem by gradient descent. This produces a powerful and easy-to-use method with favorable properties when compared to the original DIP (see Figure 1(b)). To be precise, our method handles different test images and varying corruption levels with a single learning pipeline, in which network width and termination conditions do not need to be adjusted on a case-by-case basis.

2 Main Results and Algorithms

2.1 A Double Over-Parameterization Formulation

As precluded in (1), we first consider the problem of recovering a rank-rr (rnr\ll n) positive semidefinite matrix333By a lifting trick such as [43, 44], our method can be extended to handling arbitrary rectangular matrices. 𝐗n×n\mathbf{X}_{\star}\in\mathbb{R}^{n\times n} from its corrupted linear measurements 𝐲=𝒜(𝐗)+𝐬\mathbf{y}=\mathcal{A}(\mathbf{X}_{\star})+\mathbf{s}_{\star}, where 𝐬m\mathbf{s}_{\star}\in\mathbb{R}^{m} is a vector of sparse corruptions. Recent advances on algorithmic implicit bias for optimizing over-parameterized models [6, 7, 8, 9, 2, 3] motivate us to consider a nonlinear least squares for robust matrix recovery, with double over-parameterization 𝐗=𝐔𝐔\mathbf{X}=\mathbf{U}\mathbf{U}^{\top} and 𝐬=𝐠𝐠𝐡𝐡\mathbf{s}=\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}:

min𝐔n×r,{𝐠,𝐡}mf(𝐔,𝐠,𝐡):=14𝒜(𝐔𝐔)+(𝐠𝐠𝐡𝐡)𝐲22,\displaystyle\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}},\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;f(\mathbf{U},\mathbf{g},\mathbf{h})\;:=\;\frac{1}{4}\left\|\mathcal{A}\left(\mathbf{U}\mathbf{U}^{\top}\right)+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{2}^{2}, (4)

where the dimensional parameter rrr^{\prime}\geq r. In practice, the choice of rr^{\prime} depends on how much prior information we have for 𝐗\mathbf{X}_{\star}: rr^{\prime} can be either taken as an estimated upper bound for rr, or taken as r=nr^{\prime}=n with no prior knowledge. In the following, we provide more backgrounds for the choice of our formulation (4).

  • Implicit low-rank prior via matrix factorization. For the vanilla low rank matrix recovery problem with no outlier (i.e., 𝐬=𝟎\mathbf{s}_{\star}=\mathbf{0}), the low-rank matrix 𝐗\mathbf{X}_{\star} can be recovered via over-parameterization 𝐗=𝐔𝐔\mathbf{X}=\mathbf{U}\mathbf{U}^{\top} [6, 7, 8, 9]. In particular, the work [6] showed that with small initialization and infinitesimal learning rate, gradient descent on (2) converges to the minimum nuclear norm solution under certain commute conditions for 𝒜()\mathcal{A}(\cdot).

  • Implicit sparse prior via Hadamard multiplication. For the classical sparse recovery problem [45, 46] which aims to recover a sparse 𝐬m\mathbf{s}_{\star}\in\mathbb{R}^{m} from its linear measurement 𝐛=𝐀𝐬\mathbf{b}=\mathbf{A}\mathbf{s}_{\star}, recent work [2, 3] showed that it can also be dealt with via over-parameterization 𝐬=𝐠𝐠𝐡𝐡\mathbf{s}=\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}. In particular, the work [2] showed that with small initialization and infinitesimal learning rate, gradient descent on

    min{𝐠,𝐡}m𝐀(𝐠𝐠𝐡𝐡)𝐛22\min_{\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;\left\|\mathbf{A}(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h})-\mathbf{b}\right\|_{2}^{2} (5)

    correctly recovers the sparse 𝐬\mathbf{s}_{\star} when 𝐀\mathbf{A} satisfies certain restricted isometric properties [47].

The benefit of the double over-parameterization formulation in (4), as we shall see, is that it allows specific algorithms to automatically identify both the intrinsic rank of 𝐗\mathbf{X}_{\star} and the sparsity level of 𝐬\mathbf{s}_{\star} without any prior knowledge.

2.2 Algorithmic Regularizations via Gradient Descent

Obviously, over-parameterization leads to under-determined problems which can have infinite number of solutions, so that not all solutions of (4) correspond to the desired (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}). For example, for any given 𝐔\mathbf{U}, one can always construct a pair (𝐠,𝐡)(\mathbf{g},\mathbf{h}) for (4) such that (𝐔,𝐠,𝐡)(\mathbf{U},\mathbf{g},\mathbf{h}) achieves the global minimum value f=0f=0. Nonetheless, as we see in this work, the gradient descent iteration on (4),

𝐔k+1=𝐔kτ𝐔f(𝐔k,𝐠k,𝐡k)=𝐔kτ𝒜(𝐫k)𝐔k,[𝐠k+1𝐡k+1]=[𝐠k𝐡k]ατ[𝐠f(𝐔k,𝐠k,𝐡k)𝐡f(𝐔k,𝐠k,𝐡k)]=[𝐠k𝐡k]ατ[𝐫k𝐠k𝐫k𝐡k],\displaystyle\begin{split}\mathbf{U}_{k+1}\;&=\;\mathbf{U}_{k}\;-\;\tau\cdot\nabla_{\mathbf{U}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\;=\;\mathbf{U}_{k}\;-\;\tau\cdot\mathcal{A}^{*}\left(\mathbf{r}_{k}\right)\mathbf{U}_{k},\\ \begin{bmatrix}\mathbf{g}_{k+1}\\ \mathbf{h}_{k+1}\end{bmatrix}\;&=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\nabla_{\mathbf{g}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\\ \nabla_{\mathbf{h}}f(\mathbf{U}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\end{bmatrix}\;=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\mathbf{r}_{k}\circ\mathbf{g}_{k}\\ -\mathbf{r}_{k}\circ\mathbf{h}_{k}\end{bmatrix},\end{split} (6)

with properly chosen learning rates (τ,ατ)(\tau,\;\alpha\cdot\tau) enforces implicit bias on the solution path, that it automatically identifies the desired, regularized solution (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}). Here, in (6) we have 𝒜():mn×n\mathcal{A}^{*}(\cdot):\mathbb{R}^{m}\mapsto\mathbb{R}^{n\times n} being the conjugate operator of 𝒜()\mathcal{A}(\cdot) and 𝐫k=𝒜(𝐔k𝐔k)+𝐠k𝐠k𝐡k𝐡k𝐲\mathbf{r}_{k}=\mathcal{A}(\mathbf{U}_{k}\mathbf{U}_{k}^{\top})+\mathbf{g}_{k}\circ\mathbf{g}_{k}-\mathbf{h}_{k}\circ\mathbf{h}_{k}-\mathbf{y}.

It should be noted that the scalar α>0\alpha>0, which controls the ratio of the learning rates for 𝐔\mathbf{U} and (𝐠,𝐡)(\mathbf{g},\mathbf{h}), plays a crucial role on the quality of the solution that the iterate in (6) converges to (see Figure 2(a)). We will discuss this in more details in the next subsection (i.e., Section 2.3).

Convergence to low-rank & sparse solutions. Based on our discussion in Section 2.1, it is expected that the gradient descent (6) converges to a solution (𝐔,𝐠,𝐡)(\mathbf{U},\mathbf{g},\mathbf{h}) such that

𝐗=𝐔𝐔and𝐬=𝐠𝐠𝐡𝐡\displaystyle\mathbf{X}\;=\;\mathbf{U}\mathbf{U}^{\top}\qquad\text{and}\qquad\mathbf{s}\;=\;\mathbf{g}\circ\mathbf{g}\;-\;\mathbf{h}\circ\mathbf{h} (7)

have the minimum nuclear norm and 1\ell_{1} norm, respectively. More specifically, we expect the solution (𝐗,𝐬)(\mathbf{X},\mathbf{s}) in (7) of the nonlinear least squares (4) also serves as a global solution to a convex problem

min𝐗n×n,𝐬m𝐗+λ𝐬1,s.t.𝒜(𝐗)+𝐬=𝐲,𝐗𝟎,\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times n},\;\mathbf{s}\in\mathbb{R}^{m}}\;\left\|\mathbf{X}\right\|_{*}\;+\;\lambda\cdot\left\|\mathbf{s}\right\|_{1},\quad\operatorname*{s.t.\ }\;\;\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\;\mathbf{X}\succeq\mathbf{0}, (8)

for which we state more rigorously in Section 3 under a particular setting. However, it should be noted that obtaining the global solution of (8) does not necessarily mean that we find the desired solution (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}) — the penalty λ>0\lambda>0 in (8), which balances the low-rank and the sparse regularizations, plays a crucial role on the quality of the solution to (8). For instance, when 𝒜()\mathcal{A}(\cdot) is the identity operator, under proper conditions of (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}), the work [13] showed that the global solution of (8) is exactly the target solution (𝐗,𝐬)\left(\mathbf{X}_{\star},\mathbf{s}_{\star}\right) only when λ=1/n\lambda=1/\sqrt{n}. Therefore, choosing the right λ\lambda is critical for a successful recovery of (𝐗,𝐬)(\mathbf{X}_{\star},\mathbf{s}_{\star}).

2.3 Implicit Bias with Discrepant Learning Rates

As noted above, a remaining challenge is to control the implicit regularization of the gradient descent (6) so that the algorithm converges to the solution to (8) with the desired value λ=1/n\lambda=1/\sqrt{n}. Without explicit regularization terms in our new objective (4), at first glance this might seem impossible. Nonetheless, we show that this can simply be achieved by adapting the ratio of learning rates α\alpha between 𝐔\mathbf{U} and (𝐠,𝐡)(\mathbf{g},\mathbf{h}) in our gradient step (6), which is one of our key contributions in this work that could also bear broader interest. More specifically, this phenomenon can be captured by the following observation.

Observation.

With small enough learning rate τ\tau and initialization (𝐔0,𝐠0,𝐡0)(\mathbf{U}_{0},\mathbf{g}_{0},\mathbf{h}_{0}) close enough to the origin, gradient descent (6) converges to a solution of (8) with λ=1/α\lambda=1/\alpha.

In comparison to the classical optimization theory [48] where learning rates only affect algorithm convergence rate but not the quality of the solution, our observation here is surprisingly different — using discrepant learning rates on different over-parameterized variables actually affects the specific solution that the algorithm converges to444Our result also differs from [49, 50, 51] which analyze the effect of initial large learning rates.. In the following, let us provide some intuitions of why this happens and discuss its implications. We leave a more rigorous mathematical treatment to Section 3.

Intuitions. The relation λ=1/α\lambda=1/\alpha implies that a large learning rate for one particular optimization variable in (6) leads to a small penalty on its implicit regularization counterpart in (8). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point (which is close to the origin), resulting in a weaker regularization effect (which penalizes the distance of the variable to the origin) on its solution path.

Implications. The implicit bias of discrepant learning rates provides a new and powerful way for controlling the regularization effects in over-parametrized models. For robust matrix recovery in particular, it reveals an equivalency between our method in (4) and the convex method in (8) with one-to-one correspondence between learning rate ratio α\alpha and the regularization parameter λ\lambda. By picking α=n\alpha=\sqrt{n}, we may directly quote results from [13] and conclude that our method correctly recovers 𝐗\mathbf{X}_{\star} with information-theoretically optimal sampling complexity and sparsity levels (see Figure 2). Note that this is achieved with no prior knowledge on the rank of 𝐗\mathbf{X}_{\star} and sparsity of 𝐬\mathbf{s}_{\star}. Next, we show that such benefits have implications beyond robust low-rank matrix recovery.

2.4 Extension to Robust Recovery of Natural Images

Finally, we address the problem of robust recovery of a natural image555Here, H,WH,W are height and width of the image, respectively. CC denotes the number of channels of the image, where C=1C=1 for a greyscale image and C=3C=3 for a color image with RGB channels. 𝐗C×H×W\mathbf{X}_{\star}\in\mathbb{R}^{C\times H\times W} from its corrupted observation 𝐲=𝐗+𝐬\mathbf{y}=\mathbf{X}_{\star}+\mathbf{s}_{\star}, for which the structure of 𝐗\mathbf{X}_{\star} cannot be fully captured by a low-rank model. Inspired by the work [35] on showing the effectiveness of an image prior from a deep convolutional network 𝐗=ϕ(𝜽)\mathbf{X}=\phi(\bm{\theta}) of particular architectures, where 𝜽c\bm{\theta}\in\mathbb{R}^{c} denotes network parameters, we use the following formulation for image recovery:

min𝜽c,{𝐠,𝐡}C×H×Wf(𝜽,𝐠,𝐡)=14ϕ(𝜽)+(𝐠𝐠𝐡𝐡)𝐲F2.\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{c},\;\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{C\times H\times W}}\;f(\bm{\theta},\mathbf{g},\mathbf{h})\;=\;\frac{1}{4}\left\|\phi(\bm{\theta})+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{F}^{2}. (9)

As we will empirically demonstrate in Section 4.2, gradient descent on (9)

𝜽k+1=𝜽kτ𝜽f(𝜽k,𝐠k,𝐡k),[𝐠k+1𝐡k+1]=[𝐠k𝐡k]ατ[𝐠f(𝜽k,𝐠k,𝐡k)𝐡f(𝜽k,𝐠k,𝐡k)],\displaystyle\begin{split}\bm{\theta}_{k+1}\;&=\;\bm{\theta}_{k}\;-\;\tau\cdot\nabla_{\bm{\theta}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k}),\\ \begin{bmatrix}\mathbf{g}_{k+1}\\ \mathbf{h}_{k+1}\end{bmatrix}\;&=\;\begin{bmatrix}\mathbf{g}_{k}\\ \mathbf{h}_{k}\end{bmatrix}\;-\;\alpha\cdot\tau\cdot\begin{bmatrix}\nabla_{\mathbf{g}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\\ \nabla_{\mathbf{h}}f(\bm{\theta}_{k},\mathbf{g}_{k},\mathbf{h}_{k})\end{bmatrix},\end{split} (10)

with a balanced learning rate ratio α\alpha also enforces implicit regularizations on the solution path to the desired solution. It should be noted that this happens even that the over-parameterization 𝐗=ϕ(𝜽)\mathbf{X}=\phi(\bm{\theta}) is a highly nonlinear network (in comparison with shallow linear network 𝐗=𝐔𝐔\mathbf{X}=\mathbf{U}\mathbf{U}^{\top} [6] or deep linear network [8]), which raises several intriguing theoretical questions to be further investigated.

3 Convergence Analysis of Gradient Flow Dynamics

We provide a dynamical analysis certifying our observation in Section 2.3. Similar to [6, 8], we consider a special case where the measurement matrices {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} associated with 𝒜\mathcal{A} commute666Any 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m} can be written as 𝒜(𝐙)=[𝐀1,𝐙,,𝐀m,𝐙]\mathcal{A}(\mathbf{Z})=[\langle\mathbf{A}_{1},\mathbf{Z}\rangle,\ldots,\langle\mathbf{A}_{m},\mathbf{Z}\rangle] for some {𝐀in×n}i=1m\{\mathbf{A}_{i}\in\mathbb{R}^{n\times n}\}_{i=1}^{m}., and investigate the trajectories of the discrete gradient iterate of 𝐔k,𝐠k\mathbf{U}_{k},\;\mathbf{g}_{k}, and 𝐡k\mathbf{h}_{k} in (6) with infinitesimal learning rate τ\tau (i.e., τ0\tau\rightarrow 0). In other words, we study their continuous dynamics counterparts 𝐔t(γ)\mathbf{U}_{t}(\gamma), 𝐠t(γ)\mathbf{g}_{t}(\gamma), and 𝐡t(γ)\mathbf{h}_{t}(\gamma), which are parameterized by t[0,+)t\in[0,+\infty) and initialized at t=0t=0 with

𝐔0(γ)=γ𝐈,𝐠0(γ)=γ𝟏,h0(γ)=γ𝟏,\displaystyle\mathbf{U}_{0}(\gamma)\;=\;\gamma\mathbf{I},\quad\mathbf{g}_{0}(\gamma)\;=\;\gamma\mathbf{1},\quad h_{0}(\gamma)\;=\;\gamma\mathbf{1}, (11)

where γ>0\gamma>0. Thus, analogous to (6), the behaviors of the continuous gradient flows can be captured via the following differential equations

𝐔˙t(γ)\displaystyle\dot{\mathbf{U}}_{t}(\gamma)\; =limτ0(𝐔t+τ(γ)𝐔t(γ))/τ=𝒜(𝐫t(γ))𝐔t(γ),\displaystyle=\;\lim_{\tau\rightarrow 0}\big{(}\mathbf{U}_{t+\tau}(\gamma)-\mathbf{U}_{t}(\gamma)\big{)}/{\tau}\;=\;-\mathcal{A}^{*}\left(\mathbf{r}_{t}(\gamma)\right)\mathbf{U}_{t}(\gamma), (12)
[𝐠˙t(γ)𝐡˙t(γ)]\displaystyle\begin{bmatrix}\dot{\mathbf{g}}_{t}(\gamma)\\ \dot{\mathbf{h}}_{t}(\gamma)\end{bmatrix}\; =limτ0([𝐠t+τ(γ)𝐡t+τ(γ)][𝐠t(γ)𝐡t(γ)])/τ=α[𝐫t(γ)𝐠t(γ)𝐫t(γ)𝐡t(γ),],\displaystyle=\;\lim_{\tau\rightarrow 0}\left(\begin{bmatrix}\mathbf{g}_{t+\tau}(\gamma)\\ \mathbf{h}_{t+\tau}(\gamma)\end{bmatrix}-\begin{bmatrix}\mathbf{g}_{t}(\gamma)\\ \mathbf{h}_{t}(\gamma)\end{bmatrix}\right)\;/\;{\tau}\;=\;-\;\alpha\cdot\begin{bmatrix}\mathbf{r}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)\\ -\mathbf{r}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma),\end{bmatrix}, (13)

with 𝐫t(γ)=𝒜(𝐔t(γ)𝐔t(γ))+𝐠t(γ)𝐠t(γ)𝐡t(γ)𝐡t(γ)𝐲\mathbf{r}_{t}(\gamma)\;=\;\mathcal{A}(\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma))+\mathbf{g}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)-\mathbf{h}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma)-\mathbf{y}. Let 𝐗t(γ)=𝐔t(γ)𝐔t(γ)\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma), then we can derive the gradient flow for 𝐗t(γ)\mathbf{X}_{t}(\gamma) via chain rule

𝐗˙t(γ)=𝐔˙t(γ)𝐔t(γ)+𝐔t(γ)𝐔˙t(γ)=𝒜(𝐫t(γ))𝐗t(γ)𝐗t(γ)𝒜(𝐫t(γ)).\displaystyle\dot{\mathbf{X}}_{t}(\gamma)\;=\;\dot{\mathbf{U}}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)\;+\;\mathbf{U}_{t}(\gamma)\dot{\mathbf{U}}_{t}^{\top}(\gamma)\;=\;-\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma))\mathbf{X}_{t}(\gamma)-\mathbf{X}_{t}(\gamma)\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma)). (14)

For any γ>0\gamma>0, suppose the limits of 𝐗t(γ)\mathbf{X}_{t}(\gamma), 𝐠t(γ)\mathbf{g}_{t}(\gamma), and 𝐡t(γ)\mathbf{h}_{t}(\gamma) as t+t\rightarrow+\infty exist and denote by

𝐗(γ):=limt+𝐗t(γ),𝐠(γ):=limt+𝐠t(γ),𝐡(γ):=limt+𝐡t(γ).\displaystyle\mathbf{X}_{\infty}\left(\gamma\right)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{X}_{t}(\gamma),\quad\mathbf{g}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{g}_{t}(\gamma),\quad\mathbf{h}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{h}_{t}(\gamma). (15)

Then when the initialization is infinitesimally small with γ0\gamma\rightarrow 0, we show that the following holds.

Theorem 1.

Assume that the measurement matrices {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} commute with 𝐀i𝐀j=𝐀j𝐀i\mathbf{A}_{i}\mathbf{A}_{j}\;=\;\mathbf{A}_{j}\mathbf{A}_{i} for  1ijm\forall\;1\leq i\not=j\leq m, and the gradient flows of 𝐔t(γ)\mathbf{U}_{t}(\gamma), 𝐠t(γ)\mathbf{g}_{t}(\gamma), and 𝐡t(γ)\mathbf{h}_{t}(\gamma) satisfy (12) and (13) and are initialized by (11). Let 𝐗(γ)\mathbf{X}_{\infty}\left(\gamma\right), 𝐠(γ)\mathbf{g}_{\infty}(\gamma), and 𝐡(γ)\mathbf{h}_{\infty}(\gamma) be the limit points as defined in (15). Suppose that our initialization is infinitesimally small such that

𝐗^:=limγ0𝐗(γ),𝐠^:=limγ0𝐠(γ),𝐡^:=limγ0𝐡(γ)\displaystyle\widehat{\mathbf{X}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

exist and (𝐗^,𝐠^,𝐡^)(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}}) is a global optimal solution to (4) with

𝒜(𝐗^)+𝐬^=𝐲and𝐬^=𝐠^𝐠^𝐡^𝐡^.\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y}\quad\text{and}\quad\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}.

Then we have 𝐠^𝐡^=𝟎\widehat{\mathbf{g}}\circ\widehat{\mathbf{h}}=\mathbf{0}, and (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) is also a global optimal solution to (8), with λ=α1\lambda=\alpha^{-1}.

A detailed proof of Theorem 1 is deferred to Appendix B. Our result shows that among the infinite many global solutions to (4), gradient descent biases towards the one with the minimum nuclear norm (for 𝐗\mathbf{X}) and 1\ell_{1} norm (for 𝐬\mathbf{s}) with relative weight controlled by α\alpha. In the following, we discuss the rationality of the assumptions for Theorem 1.

  • Since gradient descent almost surely converges to a local minimum [52] and any local minimum is likely to be a global minimum in low-rank matrix optimization with over-parameterization (i.e., 𝐔n×n\mathbf{U}\in\mathbb{R}^{n\times n}) [53], the assumption that (𝐗^,𝐠^,𝐡^)(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}}) is a global optimal solution is generally satisfied.

  • The commutative assumption is commonly adopted in the analysis of over-parameterized low-rank matrix recovery and is empirically observed to be non-essential [6, 8]. A recent work [7] provides a more sophisticated analysis of the discrete dynamic under the restricted isometric assumption where the commutative assumption is not needed. We believe such analysis can be extended to our double over-parameterization setting as well, but leave it as future work.

4 Experiments

In this section, we provide experimental evidence for the implicit bias of the learning rate discussed in Section 2.3. Through experiments for low-rank matrix recovery, Section 4.1 shows that the learning rate ratio α\alpha affects the solution that gradient descent converges to, and that an optimal α\alpha does not depend on the rank of matrix and sparsity of corruption. Furthermore, Section 4.2 shows the effectiveness of implicit bias of learning rate in alleviating overfitting for robust image recovery, and demonstrates that our method produces better recovery quality when compared to DIP for varying test images and corruption levels, all with a single model and set of learning parameters.

Refer to caption

(a) Effect of learning rate ratio α\alpha

Refer to caption

(b) Convex method [54]

Refer to caption

(c) Our method
Figure 2: Numerical results for robust PCA (with n=50n=50). (a) Relative reconstruction error for the output (𝐗^,𝐒^\widehat{\mathbf{X}},\widehat{\mathbf{S}}) of our method with varying step size ratio α\alpha. Here, we set r=5r=5 and p=30%p=30\%. (b, c) Probability of successful recovery over 1010 trials with varying rr (y-axis) and pp (x-axis). Here, we fixed α=n\alpha=\sqrt{n}. White indicates always successful recovery, while black means total failure.

4.1 Robust Recovery of Low-rank Matrices

We conduct experiments for a particular case of low-rank matrix recovery problem, namely the robust PCA problem, in which the operator 𝒜\mathcal{A} is the identity operator. More specifically, the goal is to recovery a low-rank matrix 𝐗\mathbf{X}_{\star} and a sparse matrix 𝐒\mathbf{S}_{\star} from the mixture 𝐘=𝐗+𝐒\mathbf{Y}=\mathbf{X}_{\star}+\mathbf{S}_{\star}, possibly without prior knowledge on the rank rr of 𝐗\mathbf{X}_{\star} and the percentage of nonzero entries pp of 𝐒\mathbf{S}_{\star}. For any given rr, we generate 𝐗n×n\mathbf{X}_{\star}\in\mathbb{R}^{n\times n} by setting 𝐗=𝐔𝐔\mathbf{X}_{\star}=\mathbf{U}_{\star}\mathbf{U}_{\star}^{\top}, where 𝐔\mathbf{U}_{\star} is an n×rn\times r matrix with i.i.d. entries drawn from standard Gaussian distribution. For any given pp, we generate 𝐒n×n\mathbf{S}_{\star}\in\mathbb{R}^{n\times n} by sampling uniformly at random n2pn^{2}\cdot p locations from the matrix and setting those entries by sampling i.i.d. from a zero-mean Gaussian distribution with standard deviation 1010. We use n=50n=50.

We apply our double over-parameterization method in (4) for the robust PCA problem, Specifically, we initialize 𝐔\mathbf{U} and 𝐠\mathbf{g} by drawing i.i.d. entries from zero mean Gaussian distribution with standard deviation 10410^{-4}, and initialize 𝐡\mathbf{h} to be the same as 𝐠\mathbf{g}. The learning rate for 𝐔\mathbf{U} as well as for {𝐠,𝐡}\{\mathbf{g},\mathbf{h}\} are set to τ\tau and ατ\alpha\cdot\tau, respectively, where τ=104\tau=10^{-4} for all experiments.

Effect of learning rate ratio α\alpha. We set r=5r=5 and p=30%p=30\%, perform 2×1042\times 10^{4} gradient descent steps and compute reconstruction errors for the output of our algorithm (𝐗^,𝐒^)(\widehat{\mathbf{X}},\widehat{\mathbf{S}}) relative to the ground truth (𝐗,𝐒)(\mathbf{X}_{\star},\mathbf{S}_{\star}). Figure 2(a) illustrates the performance with varying values of α\alpha. We observe that α\alpha affects the solution that the algorithm converges to, which verifies that the learning rate ratio has an implicit regularization effect. Moreover, the best performance is given by α=n\alpha=\sqrt{n}, which is in accordance with our theoretical result in Theorem 1.

Effect of rank rr and outlier ratio pp and phase transition. We now fix α=n\alpha=\sqrt{n} and study the ability of our method for recovering matrices of varying rank rr with varying percentage of corruption pp. For each pair (r,p)(r,p), we apply our method and declare the recovery to be successful if the relative reconstruction error 𝐗^𝐗F𝐗F\frac{\left\|\widehat{\mathbf{X}}-\mathbf{X}_{\star}\right\|_{F}}{\left\|\mathbf{X}_{\star}\right\|_{F}} is less than 0.10.1. Figure 2(c) displays the fraction of successful recovery in 1010 Monte Carlo trials. It shows that a single value of the parameter α\alpha leads to correct recovery for a wide range of rr and pp. Figure 2(b) shows the fraction of successful recovery for the convex method in (8) (with the Accelerated Proximal Gradient [54] solver 777http://people.eecs.berkeley.edu/~yima/matrix-rank/sample_code.html).

Refer to caption
Refer to caption
Refer to caption
(a) PSNR =16.46=16.46
Refer to caption
(b) PSNR =29.07=29.07
Refer to caption
(c) PSNR =31.60=31.60
Refer to caption
Refer to caption
Refer to caption
(d) PSNR =16.63=16.63
Refer to caption
(e) PSNR =29.13=29.13
Refer to caption
(f) PSNR =30.76=30.76
Refer to caption
Refer to caption
Refer to caption
(g) PSNR =18.56=18.56
Refer to caption
(h) PSNR =20.37=20.37
Refer to caption
(i) PSNR =21.38=21.38
Refer to caption
Refer to caption
Refer to caption
(j) PSNR =16.57=16.57
Refer to caption
(k) PSNR =31.87=31.87
Refer to caption
(l) PSNR =32.47=32.47
(m) Input
(n) Groundtruth
(o) DIP
(p) DIP-1\ell_{1}
(q) Ours
Figure 3: Robust image recovery for salt-and-pepper corruption. PSNR of the results is overlaid at the bottom of the images. For our method, all cases use the same network width, learning rate, and termination condition. For DIP and DIP-1\ell_{1}, case-dependent early stopping is used which is essential for their good performance. Despite that, our method achieves the highest PSNRs and best visual quality.

4.2 Robust Recovery of Natural Images

Following [35], we evaluate the performance of our method for robust image recovery using images from a standard dataset888http://www.cs.tut.fi/~foi/GCF-BM3D/index.html#ref_results. Corruption for the images is synthesized by adding salt-and-pepper noise, where ratio pp of randomly chosen pixels are replaced with either 11 or 0 (each with 50%50\% probability).

The network ϕ(𝜽)\phi(\bm{\theta}) for our method in (9) is the same as the denoising network in [35], which has a U-shaped architecture with skip connections. Each layer of the network contains a convolutional layer, a nonlinear LeakyReLU layer and a batch normalization layer. We also follow [35] on the initialization of network parameters 𝜽\bm{\theta}, which is the Kaiming initialization. Meanwhile, we initialize 𝐠\mathbf{g} and 𝐡\mathbf{h} by i.i.d. zero-mean Gaussian distribution with a standard deviation of 10510^{-5}. We use learning rate τ=1\tau=1 and set α=500\alpha=500 for all our experiments below.

No need to tune model width or terminate early. We compare our method with a variant of DIP that we call DIP-1\ell_{1}, which is based on solving min𝜽(ϕ(𝜽)𝐲)\min_{\bm{\theta}}\ell(\phi(\bm{\theta})-\mathbf{y}) with ()\ell(\cdot) being 1\left\|\cdot\right\|_{1}. As shown in Figure 1(b), DIP-1\ell_{1} requires either choosing appropriate network width or early termination to avoid overfitting. Note that neither of these can be carried out in practice without the true (clean) image. On the other hand, our method does not require tuning network width or early termination. Its performance continues to improve as training proceeds until stabilises.

Handling different images and varying corruption levels. The benefit mentioned above enables our method to handle different images and varying corruption levels without the need to tune network width, termination condition and any other learning parameters. In our experiments, we fix the number of channels in each convolutional layer to be 128128, run 150,000150,\!000 gradient descent iterations and display the output image. The results are shown in Figure 3 (for four different test images) and Figure 4 (for four corruption levels, see Appendix C). For DIP and DIP-1\ell_{1}, we report the result with the highest PSNR in the learning process (i.e., we perform the best early termination for evaluation purposes – note that this cannot be carried out in practice). Despite that, our method obtains better recovery quality in terms of PSNR for all cases. We display the learning curves for these experiments in Figure 5(a) and Figure 5(b) (see Appendix C), which show that our method does not overfit for all cases while DIP-1\ell_{1} requires a case-dependent early termination.

5 Conclusion

In this work, we have shown both theoretically and empirically that the benefits of implicit bias of gradient descent can be extended to over-parameterization of two low-dimensional structures. The key to the success is the choice of discrepant learning rates that can properly regulate the optimization path so that it converges to the desired optimal solution. Such a framework frees us from the need of prior knowledge in the structures or from the lack of scalability of previous approaches. This has led to state of the art recovery results for both low-rank matrices and natural images. We hope this work may encourage people to investigate in the future if the same framework and idea generalize to a mixture of multiple and broader families of structures.

References

  • [1] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.
  • [2] T. Vaskevicius, V. Kanade, and P. Rebeschini, “Implicit regularization for optimal sparse recovery,” in Advances in Neural Information Processing Systems, pp. 2968–2979, 2019.
  • [3] P. Zhao, Y. Yang, and Q.-C. He, “Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression,” arXiv preprint arXiv:1903.09367, 2019.
  • [4] F. Wu and P. Rebeschini, “Hadamard wirtinger flow for sparse phase retrieval,” arXiv preprint arXiv:2006.01065, 2020.
  • [5] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” in International Conference on Machine Learning, pp. 4951–4960, 2019.
  • [6] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit regularization in matrix factorization,” in Advances in Neural Information Processing Systems, pp. 6151–6159, 2017.
  • [7] Y. Li, T. Ma, and H. Zhang, “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations,” in Conference On Learning Theory, pp. 2–47, 2018.
  • [8] S. Arora, N. Cohen, W. Hu, and Y. Luo, “Implicit regularization in deep matrix factorization,” in Advances in Neural Information Processing Systems, pp. 7411–7422, 2019.
  • [9] M. A. Belabbas, “On implicit regularization: Morse functions and applications to matrix factorization,” arXiv preprint arXiv:2001.04264, 2020.
  • [10] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradient descent on linear convolutional networks,” in Advances in Neural Information Processing Systems, pp. 9461–9471, 2018.
  • [11] G. Gidel, F. Bach, and S. Lacoste-Julien, “Implicit regularization of discrete gradient dynamics in linear neural networks,” in Advances in Neural Information Processing Systems, pp. 3196–3206, 2019.
  • [12] E. Alpaydin, Introduction to machine learning. 2020.
  • [13] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
  • [14] D. S. Weller, A. Pnueli, G. Divon, O. Radzyner, Y. C. Eldar, and J. A. Fessler, “Undersampled phase retrieval with outliers,” IEEE transactions on computational imaging, vol. 1, no. 4, pp. 247–258, 2015.
  • [15] Y. Li, Y. Sun, and Y. Chi, “Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements,” IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 397–408, 2016.
  • [16] J. C. Duchi and F. Ruan, “Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 471–529, 2019.
  • [17] Y. Chi, Y. M. Lu, and Y. Chen, “Nonconvex optimization meets low-rank matrix factorization: An overview,” IEEE Transactions on Signal Processing, vol. 67, no. 20, pp. 5239–5269, 2019.
  • [18] X. Li, Z. Zhu, A. Man-Cho So, and R. Vidal, “Nonconvex robust low-rank matrix recovery,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 660–686, 2020.
  • [19] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
  • [20] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” arXiv preprint arXiv:1009.5055, 2010.
  • [21] P. Jain and P. Kar, “Non-convex optimization for machine learning,” Foundations and Trends® in Machine Learning, vol. 10, no. 3-4, pp. 142–336, 2017.
  • [22] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
  • [23] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality of local search for low rank matrix recovery,” in Advances in Neural Information Processing Systems, pp. 3873–3881, 2016.
  • [24] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances in Neural Information Processing Systems, pp. 2973–2981, 2016.
  • [25] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rank matrix optimization,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp. 3614–3628, 2018.
  • [26] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in 2009 IEEE 12th international conference on computer vision, pp. 2272–2279, IEEE, 2009.
  • [27] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007.
  • [28] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Bm3d image denoising with shape-adaptive principal component analysis,” 2009.
  • [29] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869, 2014.
  • [30] J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36, 2018.
  • [31] B. Lecouat, J. Ponce, and J. Mairal, “Fully trainable and interpretable non-local sparse models for image restoration,” 2020.
  • [32] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [33] Z. Yue, H. Yong, Q. Zhao, D. Meng, and L. Zhang, “Variational denoising network: Toward blind noise modeling and removal,” in Advances in Neural Information Processing Systems, pp. 1688–1699, 2019.
  • [34] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” arXiv preprint arXiv:2003.07761, 2020.
  • [35] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
  • [36] R. Heckel and P. Hand, “Deep decoder: Concise image representations from untrained non-convolutional networks,” in International Conference on Learning Representations, 2019.
  • [37] R. Heckel and M. Soltanolkotabi, “Denoising and regularization via exploiting the structural bias of convolutional generators,” arXiv preprint arXiv:1910.14634, 2019.
  • [38] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection, vol. 589. John wiley & sons, 2005.
  • [39] Z. Zhu, Y. Wang, D. Robinson, D. Naiman, R. Vidal, and M. Tsakiris, “Dual principal component pursuit: Improved analysis and efficient algorithms,” in Advances in Neural Information Processing Systems 31, pp. 2171–2181, 2018.
  • [40] Q. Qu, Z. Zhu, X. Li, M. C. Tsakiris, J. Wright, and R. Vidal, “Finding the sparsest vectors in a subspace: Theory, algorithms, and applications,” 2020.
  • [41] H. Jiang, D. P. Robinson, R. Vidal, and C. You, “A nonconvex formulation for low rank subspace clustering: algorithms and convergence analysis,” Computational Optimization and Applications, vol. 70, no. 2, pp. 395–418, 2018.
  • [42] D. Davis, D. Drusvyatskiy, and C. Paquette, “The nonsmooth landscape of phase retrieval,” arXiv preprint arXiv:1711.03247, 2017.
  • [43] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in International Conference on Machine Learning, pp. 964–973, 2016.
  • [44] D. Park, A. Kyrillidis, C. Carmanis, and S. Sanghavi, “Non-square matrix sensing without spurious local minima via the burer-monteiro approach,” in Artificial Intelligence and Statistics, pp. 65–74, 2017.
  • [45] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on information theory, vol. 52, no. 2, pp. 489–509, 2006.
  • [46] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE signal processing magazine, vol. 25, no. 2, pp. 21–30, 2008.
  • [47] E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9-10, pp. 589–592, 2008.
  • [48] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
  • [49] Y. Li, C. Wei, and T. Ma, “Towards explaining the regularization effect of initial large learning rate in training neural networks,” in Advances in Neural Information Processing Systems, pp. 11669–11680, 2019.
  • [50] K. You, M. Long, J. Wang, and M. Jordan, “How does learning rate decay help modern neural networks,” arXiv preprint arXiv:1908.01878, 2019.
  • [51] P. Nakkiran, “Learning rate annealing can provably help generalization, even for convex problems,” arXiv preprint arXiv:2005.07360, 2020.
  • [52] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient descent only converges to minimizers,” in Conference on learning theory, pp. 1246–1257, 2016.
  • [53] M. Journée, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank optimization on the cone of positive semidefinite matrices,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2327–2351, 2010.
  • [54] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, “Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix,” Coordinated Science Laboratory Report no. UILU-ENG-09-2214, DC-246, 2009.
  • [55] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” arXiv preprint arXiv:1812.10004, 2018.

Appendix A Implicit Bias of Discrepant Learning Rates in Linear Regression

In this part of the appendix, let us use a classical result for underdeterimined linear regression problem to build up some intuitions behind the implicit bias of gradient descent for our problem formulation of robust learning problems. The high level message we aim to deliver through the simple example is that

  • Gradient descent implicitly biases towards solutions with minimum 2\ell_{2}-norm.

  • Discrepant learning rates lead to solutions with minimum weighted 2\ell_{2}-norm.

Underdeterimined linear regression.

Given observation 𝐛n1\mathbf{b}\in\mathbb{R}^{n_{1}} and wide data matrix 𝐖n1×n2\mathbf{W}\in\mathbb{R}^{n_{1}\times n_{2}} (n2>n1n_{2}>n_{1}), we want to find 𝜽\bm{\theta} which is a solution to

min𝜽n2φ(𝜽)=12𝐛𝐖𝜽22.\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\varphi(\bm{\theta})\;=\;\frac{1}{2}\left\|\mathbf{b}-\mathbf{W}\bm{\theta}\right\|_{2}^{2}. (16)

For n2>n1n_{2}>n_{1} and full row-rank 𝐖\mathbf{W}, the underdetermined problem (16) obviously has infinite many solutions, which forms a set

𝒮:={𝜽ln+𝐧𝜽ln=𝐖𝐛,𝐧𝒩(𝐖)},\displaystyle\mathcal{S}\;:=\;\left\{\;\bm{\theta}_{ln}+\mathbf{n}\;\mid\;\bm{\theta}_{ln}=\mathbf{W}^{\dagger}\mathbf{b},\quad\mathbf{n}\in\mathcal{N}(\mathbf{W})\;\right\},

where 𝐖:=𝐖(𝐖𝐖)1\mathbf{W}^{\dagger}:=\mathbf{W}^{\top}\left(\mathbf{W}\mathbf{W}^{\top}\right)^{-1} denotes the pseudo-inverse of 𝐖\mathbf{W}, and

𝒩(𝐖):={𝐧𝐖𝐧=𝟎},(𝐖):={𝐳𝐳=𝐖𝐯}\displaystyle\mathcal{N}(\mathbf{W}):=\left\{\mathbf{n}\mid\mathbf{W}\mathbf{n}=\mathbf{0}\right\},\quad\mathcal{R}(\mathbf{W}):=\left\{\mathbf{z}\mid\mathbf{z}=\mathbf{W}^{\top}\mathbf{v}\right\}

are the null space and row space of 𝐖\mathbf{W}, respectively. Simple derivation shows that 𝜽ln\bm{\theta}_{ln} is a particular least 2\ell_{2}-norm solution to (16), that minimizes

min𝜽n212𝜽22,s.t.𝐖𝜽=𝐛.\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\bm{\theta}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\theta}\;=\;\mathbf{b}.

Gradient descent biases towards 𝜽ln\bm{\theta}_{ln}.

Starting from any initialization 𝜽0\bm{\theta}_{0}, gradient descent

𝜽k+1=𝜽kτls𝐖(𝐖𝜽k𝐛)\displaystyle\bm{\theta}_{k+1}\;=\;\bm{\theta}_{k}-\tau_{ls}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right) (17)

with a sufficiently small learning rate999This is because 𝐖𝜽k+1𝐛=(𝐈τls𝐖𝐖)(𝐖𝜽k𝐛)\mathbf{W}\bm{\theta}_{k+1}-\mathbf{b}=\left(\mathbf{I}-\tau_{ls}\mathbf{W}\mathbf{W}^{\top}\right)\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right). If we choose τls<𝐖𝐖1\tau_{ls}<\left\|\mathbf{W}\mathbf{W}^{\top}\right\|^{-1}, then 𝐖𝜽k𝐛\left\|\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right\| converges to 0 geometrically. τls\tau_{ls} always finds one of the global solutions for (16). Furthermore, it is now well-understood [55] that whenever the initialization 𝜽0\bm{\theta}_{0} has zero component in 𝒩(𝐖)\mathcal{N}(\mathbf{W}) (i.e., 𝒫𝒩(𝐖)(𝜽0)=𝟎\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{0}\right)=\mathbf{0}), one interesting phenomenon is that the iterates 𝜽\bm{\theta}_{\infty} in (17) implicitly bias towards the minimium 2\ell^{2}-norm solution 𝜽ln\bm{\theta}_{ln}. This happens because once initialized in (𝐖)\mathcal{R}(\mathbf{W}), gradient descent (17) implicitly biases towards iterates staying within (𝐖)\mathcal{R}(\mathbf{W}), such that

𝒫𝒩(𝐖)(𝜽)=𝜽ln,𝒫𝒩(𝐖)(𝜽)=𝒫𝒩(𝐖)(𝜽0)=𝟎.\displaystyle\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{\infty}\right)\;=\;\bm{\theta}_{ln},\quad\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{\infty}\right)\;=\;\mathcal{P}_{\mathcal{N}(\mathbf{W})}\left(\bm{\theta}_{0}\right)=\mathbf{0}.

As we can see, a particular algorithm enforces specific regularization on the final solution.

Implicit bias of discrepant learning rates. The gradient update in (17) uses the same learning rate τls\tau_{ls} for all coordinates of 𝜽\bm{\theta}. If we use different learning rates for each coordinate (i.e., 𝚲\mathbf{\Lambda} is a diagonal matrix with positive diagonals)

𝜽k+1=𝜽kτls𝚲𝐖(𝐖𝜽k𝐛),\displaystyle\bm{\theta}_{k+1}\;=\;\bm{\theta}_{k}-\tau_{ls}\cdot\mathbf{\Lambda}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\theta}_{k}-\mathbf{b}\right), (18)

then by following a similar argument we conclude that the gradient update in (18) converges to a weighted regularized solution for

min𝜽n212𝚲1/2𝜽22,s.t.𝐖𝜽=𝐛.\displaystyle\min_{\bm{\theta}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\mathbf{\Lambda}^{-1/2}\bm{\theta}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\theta}\;=\;\mathbf{b}. (19)
Remark 1.

Let σi\sigma_{i} be the ii-th diagonal of 𝚲\mathbf{\Lambda} and θi\theta_{i} be the ii-th element of 𝜽\bm{\theta}. Then in (18), σiτls\sigma_{i}\tau_{ls} is the learning rate for the variable θi\theta_{i}, which varies for different variables. In words, the relation between (18) and (19) implies that for one particular optimization variable (e.g., θi\theta_{i}) a large learning rate σiτls\sigma_{i}\tau_{ls} in (18) leads to a small implicit regularization effect in (19). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point, resulting in a weaker regularization effect (which penalizes the distance of the variable to the initialization) on its solution path.

An alternative explanation of this is through a change of variable 𝜽=𝚲1/2𝜽~{\bm{\theta}}=\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}. Suppose we minimize

min𝜽~n2φ~(𝜽~)=12𝐛𝐖𝚲1/2𝜽~22\displaystyle\min_{\widetilde{\bm{\theta}}\in\mathbb{R}^{n_{2}}}\widetilde{\varphi}(\widetilde{\bm{\theta}})\;=\;\frac{1}{2}\left\|\mathbf{b}-\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}\right\|_{2}^{2} (20)

via standard gradient descent with a single learning rate

𝜽~k+1=𝜽~kτls𝚲1/2𝐖(𝐖𝚲1/2𝜽~k𝐛).\displaystyle\widetilde{\bm{\theta}}_{k+1}\;=\;\widetilde{\bm{\theta}}_{k}-\tau_{ls}\cdot\mathbf{\Lambda}^{1/2}\cdot\mathbf{W}^{\top}\left(\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}_{k}-\mathbf{b}\right). (21)

Thus, once initialized in (𝐖𝚲1/2)\mathcal{R}(\mathbf{W}\bm{\Lambda}^{1/2}), gradient descent (21) converges to the least 2\ell_{2}-norm solution to (20), i.e., the solution of the following problem

min𝜽~n212𝜽~22,s.t.𝐖𝚲1/2𝜽~=𝐛.\displaystyle\min_{\widetilde{\bm{\theta}}\in\mathbb{R}^{n_{2}}}\;\frac{1}{2}\left\|\widetilde{\bm{\theta}}\right\|_{2}^{2},\quad\text{s.t.}\quad\mathbf{W}\bm{\Lambda}^{1/2}\widetilde{\bm{\theta}}\;=\;\mathbf{b}. (22)

Finally, plugging 𝜽~=𝚲1/2𝜽\widetilde{\bm{\theta}}=\bm{\Lambda}^{-1/2}{\bm{\theta}} into (21) and (22) gives (18) and (19), respectively, also indicating the gradient update (18) induces implicit weighted regularization towards the solution of (19).

Appendix B Proof of Theorem 1

In this part of the appendix, we prove our main technical result (i.e., Theorem 1) in Section 3. To make this part self-contained, we restate our result as follows.

Theorem 2.

Assume that the measurement matrices 𝐀1,𝐀2,,𝐀m\mathbf{A}_{1},\mathbf{A}_{2},\ldots,\mathbf{A}_{m} commute, i.e.

𝐀i𝐀j=𝐀j𝐀i, 1ijm,\displaystyle\mathbf{A}_{i}\mathbf{A}_{j}\;=\;\mathbf{A}_{j}\mathbf{A}_{i},\quad\forall\;1\leq i\not=j\leq m,

and the gradient flows of 𝐔t(γ)\mathbf{U}_{t}(\gamma), 𝐠t(γ)\mathbf{g}_{t}(\gamma), and 𝐡t(γ)\mathbf{h}_{t}(\gamma) satisfy

𝐔˙t(γ)\displaystyle\dot{\mathbf{U}}_{t}(\gamma)\; =limτ0𝐔t+τ(γ)𝐔t(γ)τ=𝒜(𝐫t(γ))𝐔t(γ),\displaystyle=\;\lim_{\tau\rightarrow 0}\frac{\mathbf{U}_{t+\tau}(\gamma)-\mathbf{U}_{t}(\gamma)}{\tau}\;=\;-\mathcal{A}^{*}\left(\mathbf{r}_{t}(\gamma)\right)\mathbf{U}_{t}(\gamma), (23)
[𝐠˙t(γ)𝐡˙t(γ)]\displaystyle\begin{bmatrix}\dot{\mathbf{g}}_{t}(\gamma)\\ \dot{\mathbf{h}}_{t}(\gamma)\end{bmatrix}\; =limτ0([𝐠t+τ(γ)𝐡t+τ(γ)][𝐠t(γ)𝐡t(γ)])/τ=α[𝐫t(γ)𝐠t(γ)𝐫t(γ)𝐡t(γ),],\displaystyle=\;\lim_{\tau\rightarrow 0}\left(\begin{bmatrix}\mathbf{g}_{t+\tau}(\gamma)\\ \mathbf{h}_{t+\tau}(\gamma)\end{bmatrix}-\begin{bmatrix}\mathbf{g}_{t}(\gamma)\\ \mathbf{h}_{t}(\gamma)\end{bmatrix}\right)\;/\;{\tau}\;=\;-\;\alpha\cdot\begin{bmatrix}\mathbf{r}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)\\ -\mathbf{r}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma),\end{bmatrix}, (24)

with 𝐫t(γ)=𝒜(𝐔t(γ)𝐔t(γ))+𝐠t(γ)𝐠t(γ)𝐡t(γ)𝐡t(γ)𝐲\mathbf{r}_{t}(\gamma)\;=\;\mathcal{A}(\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma))+\mathbf{g}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)-\mathbf{h}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma)-\mathbf{y}, and they are initialized by

𝐔0(γ)=γ𝐈,𝐠0(γ)=γ𝟏,h0(γ)=γ𝟏.\displaystyle\mathbf{U}_{0}(\gamma)\;=\;\gamma\mathbf{I},\quad\mathbf{g}_{0}(\gamma)\;=\;\gamma\mathbf{1},\quad h_{0}(\gamma)\;=\;\gamma\mathbf{1}.

Let 𝐗t(γ)=𝐔t(γ)𝐔t(γ)\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma), and let 𝐗(γ)\mathbf{X}_{\infty}\left(\gamma\right), 𝐠(γ)\mathbf{g}_{\infty}(\gamma), and 𝐠(γ)\mathbf{g}_{\infty}(\gamma) be the limit points defined as

𝐗(γ):=limt+𝐗t(γ),𝐠(γ):=limt+𝐠t(γ),𝐡(γ):=limt+𝐡t(γ).\displaystyle\mathbf{X}_{\infty}\left(\gamma\right)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{X}_{t}(\gamma),\quad\mathbf{g}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{g}_{t}(\gamma),\quad\mathbf{h}_{\infty}(\gamma)\;:=\;\lim_{t\rightarrow+\infty}\mathbf{h}_{t}(\gamma). (25)

Suppose that our initialization is infinitesimally small such that

𝐗^:=limγ0𝐗(γ),𝐠^:=limγ0𝐠(γ),𝐡^:=limγ0𝐡(γ)\displaystyle\widehat{\mathbf{X}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

exist and (𝐗^,𝐠^,𝐡^)(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}}) is a global optimal solution to

min𝐔n×r,{𝐠,𝐡}mf(𝐔,𝐠,𝐡):=14𝒜(𝐔𝐔)+(𝐠𝐠𝐡𝐡)𝐲22,\displaystyle\min_{\mathbf{U}\in\mathbb{R}^{n\times r^{\prime}},\{\mathbf{g},\mathbf{h}\}\subseteq\mathbb{R}^{m}}\;f(\mathbf{U},\mathbf{g},\mathbf{h})\;:=\;\frac{1}{4}\left\|\mathcal{A}\left(\mathbf{U}\mathbf{U}^{\top}\right)+\left(\mathbf{g}\circ\mathbf{g}-\mathbf{h}\circ\mathbf{h}\right)-\mathbf{y}\right\|_{2}^{2}, (26)

with 𝒜(𝐗^)+𝐬^=𝐲\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y} and 𝐬^=𝐠^𝐠^𝐡^𝐡^\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}. Then we have 𝐠^𝐡^=𝟎\widehat{\mathbf{g}}\circ\widehat{\mathbf{h}}=\mathbf{0}, and (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) is also a global optimal solution to

min𝐗n×n,𝐬m𝐗+λ𝐬1,s.t.𝒜(𝐗)+𝐬=𝐲,𝐗𝟎.\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times n},\mathbf{s}\in\mathbb{R}^{m}}\;\left\|\mathbf{X}\right\|_{*}\;+\;\lambda\cdot\left\|\mathbf{s}\right\|_{1},\quad\text{s.t.}\quad\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\;\mathbf{X}\succeq\mathbf{0}. (27)

with λ=α1\lambda=\alpha^{-1} and α>0\alpha>0 being the balancing parameter in (24).

Proof.

From (23), we can derive the gradient flow for 𝐗t(γ)\mathbf{X}_{t}(\gamma) via chain rule

𝐗˙t(γ)=𝐔˙t(γ)𝐔t(γ)+𝐔t(γ)𝐔˙t(γ)=𝒜(𝐫t(γ))𝐗t(γ)𝐗t(γ)𝒜(𝐫t(γ)).\displaystyle\dot{\mathbf{X}}_{t}(\gamma)\;=\;\dot{\mathbf{U}}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma)\;+\;\mathbf{U}_{t}(\gamma)\dot{\mathbf{U}}_{t}^{\top}(\gamma)\;=\;-\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma))\mathbf{X}_{t}(\gamma)-\mathbf{X}_{t}(\gamma)\mathcal{A}^{*}(\mathbf{r}_{t}(\gamma)). (28)

We want to show that when the initialization is infinitesimally small (i.e., γ0\gamma\rightarrow 0), the limit points of the gradient flows 𝐗t(γ)=𝐔t(γ)𝐔t(γ)\mathbf{X}_{t}(\gamma)=\mathbf{U}_{t}(\gamma)\mathbf{U}_{t}^{\top}(\gamma) and 𝐬t(γ)=𝐠t(γ)𝐠t(γ)𝐡t(γ)𝐡t(γ)\mathbf{s}_{t}(\gamma)=\mathbf{g}_{t}(\gamma)\circ\mathbf{g}_{t}(\gamma)-\mathbf{h}_{t}(\gamma)\circ\mathbf{h}_{t}(\gamma) are optimal solutions for (27) as t+t\rightarrow+\infty. Towards this goal, let us first look at the optimality condition for (27). From Lemma 1, we know that if (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) with

𝐗^=limγ0𝐗(γ),𝐬^=𝐠^𝐠^𝐡^𝐡^with𝐠^:=limγ0𝐠(γ),𝐡^:=limγ0𝐡(γ)\displaystyle\widehat{\mathbf{X}}\;=\;\lim_{\gamma\rightarrow 0}\mathbf{X}_{\infty}(\gamma),\quad\widehat{\mathbf{s}}\;=\;\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}\quad\text{with}\quad\widehat{\mathbf{g}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{g}_{\infty}(\gamma),\quad\widehat{\mathbf{h}}\;:=\;\lim_{\gamma\rightarrow 0}\mathbf{h}_{\infty}(\gamma)

is an optimal solution for (27) then there exists a dual certificate 𝝂\bm{\nu} such that

𝒜(𝐗^)+𝐬^=𝐲,(𝐈𝒜(𝝂))𝐗^= 0,𝒜(𝝂)𝐈,𝝂λsign(𝐬^),𝐗^𝟎,\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y},\quad\left(\mathbf{I}-\mathcal{A}^{*}(\bm{\nu})\right)\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},\quad\mathcal{A}^{*}(\bm{\nu})\;\preceq\;\mathbf{I},\quad\bm{\nu}\in\lambda\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\widehat{\mathbf{X}}\succeq\mathbf{0},

where sign(𝐬^)\operatorname{sign}(\widehat{\mathbf{s}}) is defined in (32), which is the subdifferential of 1\left\|\cdot\right\|_{1}. Thus, it suffices to construct a dual certificate 𝝂\bm{\nu} such that (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) satisfies the equation above.

Since (𝐗^,𝐠^,𝐡^)(\widehat{\mathbf{X}},\widehat{\mathbf{g}},\widehat{\mathbf{h}}) is a global optimal solution to (26), we automatically have 𝒜(𝐗^)+𝐬^=𝐲\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y} and 𝐗^𝟎\widehat{\mathbf{X}}\succeq\mathbf{0}. On the other hand, given that {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} commutes and (28) and (24) hold for 𝐗t\mathbf{X}_{t}, 𝐠t\mathbf{g}_{t} and 𝐡t\mathbf{h}_{t}, by Lemma 2, we know that

𝐗t(γ)\displaystyle\mathbf{X}_{t}(\gamma)\; =exp(𝒜(𝝃t(γ)))𝐗0(γ)exp(𝒜(𝝃t(γ))),\displaystyle=\;\exp\left(\mathcal{A}^{*}(\bm{\xi}_{t}(\gamma))\right)\cdot\mathbf{X}_{0}(\gamma)\cdot\exp\left(\mathcal{A}^{*}(\bm{\xi}_{t}(\gamma))\right), (29)
𝐠t(γ)\displaystyle\mathbf{g}_{t}(\gamma)\; =𝐠0(γ)exp(α𝝃t(γ)),𝐡t(γ)=𝐡0(γ)exp(α𝝃t(γ)),\displaystyle=\;\mathbf{g}_{0}(\gamma)\circ\exp\left(\alpha\bm{\xi}_{t}(\gamma)\right),\quad\mathbf{h}_{t}(\gamma)\;=\;\mathbf{h}_{0}(\gamma)\circ\exp\left(-\alpha\bm{\xi}_{t}(\gamma)\right), (30)

where 𝝃t(γ)=0t𝐫τ(γ)𝑑τ\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau. Let 𝝃(γ):=limt+𝝃t(γ)\bm{\xi}_{\infty}(\gamma):=\lim_{t\rightarrow+\infty}\bm{\xi}_{t}(\gamma), by Lemma 3 and Lemma 4, we can construct

𝝂(γ)=𝝃(γ)log(1/γ),\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)},

such that

limγ0𝒜(𝝂(γ))𝐈,limγ0[𝐈𝒜(𝝂(γ))]𝐗^= 0,\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\;\preceq\;\mathbf{I},\quad\lim_{\gamma\rightarrow 0}\left[\mathbf{I}-\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

and

limγ0𝝂(γ)α1sign(𝐬^),limγ0𝐠(γ)𝐡(γ)= 0.\displaystyle\lim_{\gamma\rightarrow 0}\;\bm{\nu}(\gamma)\;\in\;\alpha^{-1}\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0}.

This shows the exists of the dual certificate 𝝂(γ)\bm{\nu}(\gamma) such that the optimality condition holds for (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}). Hence, (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) is also a global optimal solution to (27). ∎

Lemma 1.

If (𝐗^,𝐬^)(\widehat{\mathbf{X}},\widehat{\mathbf{s}}) is an optimal solution for (27), then there exists a dual certificate 𝛎m\bm{\nu}\in\mathbb{R}^{m}, that

𝒜(𝐗^)+𝐬^=𝐲,(𝐈𝒜(𝝂))𝐗^= 0,𝝂λsign(𝐬^),𝐈𝒜(𝝂),𝐗^𝟎,\displaystyle\mathcal{A}(\widehat{\mathbf{X}})+\widehat{\mathbf{s}}\;=\;\mathbf{y},\quad\left(\mathbf{I}-\mathcal{A}^{*}(\bm{\nu})\right)\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},\quad\bm{\nu}\in\lambda\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\mathbf{I}\succeq\mathcal{A}^{*}(\bm{\nu}),\;\widehat{\mathbf{X}}\succeq\mathbf{0}, (31)

where sign(𝐬)\operatorname{sign}(\mathbf{s}) is the subdifferential of 𝐬1\left\|\mathbf{s}\right\|_{1} with each entry

sign(s):={s/|s|s0,[1,1]s=0.\displaystyle\operatorname{sign}(s)\;:=\;\begin{cases}s/\left|s\right|&s\not=0,\\ [-1,1]&s=0.\end{cases} (32)
Proof.

The Lagrangian function of the problem can be written as

(𝐗,𝐬,𝝂,𝚪)=trace(𝐗)+λ𝐬1+𝝂(𝐲𝒜(𝐗)𝐬)𝐗,𝚪,\displaystyle\mathcal{L}\left(\mathbf{X},\mathbf{s},\bm{\nu},\mathbf{\Gamma}\right)\;=\;\operatorname{trace}\left(\mathbf{X}\right)+\lambda\left\|\mathbf{s}\right\|_{1}+\bm{\nu}^{\top}\left(\mathbf{y}-\mathcal{A}(\mathbf{X})-\mathbf{s}\right)-\left\langle\mathbf{X},\mathbf{\Gamma}\right\rangle,

with 𝝂m\bm{\nu}\in\mathbb{R}^{m} and 𝚪n×n\mathbf{\Gamma}\in\mathbb{R}^{n\times n} being the dual variables, where 𝚪𝟎\mathbf{\Gamma}\succeq\mathbf{0}. Thus, we can derive the Karush-Kuhn-Tucker (KKT) optimality condition for (27) as

𝟎:\displaystyle\mathbf{0}\in\partial\mathcal{L}\;: 𝐈𝚪𝒜(𝝂)= 0,𝝂λ𝐬1=λsign(𝐬),\displaystyle\quad\mathbf{I}-\mathbf{\Gamma}-\mathcal{A}^{*}(\bm{\nu})\;=\;\mathbf{0},\quad\bm{\nu}\in\lambda\cdot\partial\left\|\mathbf{s}\right\|_{1}=\lambda\cdot\operatorname{sign}\left(\mathbf{s}\right),
feasibility : 𝒜(𝐗)+𝐬=𝐲,𝐗𝟎,𝚪𝟎,\displaystyle\quad\mathcal{A}(\mathbf{X})+\mathbf{s}\;=\;\mathbf{y},\;\mathbf{X}\succeq\mathbf{0},\;\mathbf{\Gamma}\succeq\mathbf{0},
complementary slackness : 𝚪𝐗= 0,\displaystyle\quad\mathbf{\Gamma}\cdot\mathbf{X}\;=\;\mathbf{0},

where ()\partial(\cdot) denotes the subdifferential operator and sign(𝐬)\operatorname{sign}(\mathbf{s}) is the subdifferential of 𝐬1\left\|\mathbf{s}\right\|_{1} with each entry

sign(s)={s/|s|s0,[1,1]s=0.\displaystyle\operatorname{sign}(s)\;=\;\begin{cases}s/\left|s\right|&s\not=0,\\ [-1,1]&s=0.\end{cases}

Thus, we know that (𝐗^,𝐬^)\left(\widehat{\mathbf{X}},\widehat{\mathbf{s}}\right) is global solution to (27) as long as there exists a dual certificate 𝝂\bm{\nu} such that (31) holds, where we eliminated 𝚪\mathbf{\Gamma} by plugging in 𝚪=𝐈𝒜(𝝂)\mathbf{\Gamma}=\mathbf{I}-\mathcal{A}^{*}(\bm{\nu}). ∎

Lemma 2.

Suppose that {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} commutes. Suppose (14) and (24) hold for 𝐗t\mathbf{X}_{t}, 𝐠t\mathbf{g}_{t} and 𝐡t\mathbf{h}_{t}, then

𝐗t\displaystyle\mathbf{X}_{t}\; =exp(𝒜(𝝃t))𝐗0exp(𝒜(𝝃t))\displaystyle=\;\exp\left(\mathcal{A}^{*}(\bm{\xi}_{t})\right)\cdot\mathbf{X}_{0}\cdot\exp\left(\mathcal{A}^{*}(\bm{\xi}_{t})\right) (33)
𝐠t\displaystyle\mathbf{g}_{t}\; =𝐠0exp(α𝝃t),𝐡t=𝐡0exp(α𝝃t),\displaystyle=\;\mathbf{g}_{0}\circ\exp\left(\alpha\bm{\xi}_{t}\right),\quad\mathbf{h}_{t}\;=\;\mathbf{h}_{0}\circ\exp\left(-\alpha\bm{\xi}_{t}\right), (34)

where 𝛏t=0t𝐫τ𝑑τ\bm{\xi}_{t}=-\int_{0}^{t}\mathbf{r}_{\tau}d\tau.

Proof.

From (24), we know that

d𝐠tdt=α𝐫t𝐠t,\displaystyle\frac{d\mathbf{g}_{t}}{dt}\;=\;-\alpha\mathbf{r}_{t}\circ\mathbf{g}_{t},

where the differentiation d𝐠tdt\frac{d\mathbf{g}_{t}}{dt} is entrywise for 𝐠t\mathbf{g}_{t}. Thus, we have

0td𝐠τ𝐠τ=α0t𝐫τ𝑑τlog𝐠tlog𝐠0=α𝝃t𝐠t=𝐠0exp(α𝝃t),\displaystyle\int_{0}^{t}\frac{d\mathbf{g}_{\tau}}{\mathbf{g}_{\tau}}\;=\;-\alpha\int_{0}^{t}\mathbf{r}_{\tau}d\tau\quad\Longrightarrow\quad\log\mathbf{g}_{t}-\log\mathbf{g}_{0}\;=\;\alpha\bm{\xi}_{t}\quad\Longrightarrow\quad\mathbf{g}_{t}\;=\;\mathbf{g}_{0}\circ\exp\left(\alpha\bm{\xi}_{t}\right),

where all the operators are entrywise. Similarly, 𝐡t=𝐡0exp(α𝝃t)\mathbf{h}_{t}\;=\;\mathbf{h}_{0}\circ\exp\left(-\alpha\bm{\xi}_{t}\right) holds.

For (33), by using (28) and the fact that {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} commutes, we can derive it with an analogous argument. ∎

Lemma 3.

Under the settings of Theorem 2 and Lemma 2, for any γ>0\gamma>0 there exists

𝝂(γ)=𝝃(γ)log(1/γ),\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)}, (35)

such that

limγ0𝒜(𝝂(γ))𝐈,limγ0[𝐈𝒜(𝝂(γ))]𝐗^= 0,\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\;\preceq\;\mathbf{I},\quad\lim_{\gamma\rightarrow 0}\left[\mathbf{I}-\mathcal{A}^{*}\left(\bm{\nu}(\gamma)\right)\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

where 𝛏(γ)=limt0𝛏t(γ)\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow 0}\bm{\xi}_{t}(\gamma) with 𝛏t(γ)=0t𝐫τ(γ)𝑑τ\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau.

Proof.

Given 𝐔0=γ𝐈\mathbf{U}_{0}=\gamma\mathbf{I}, we have 𝐗0=𝐔0𝐔0=γ2𝐈\mathbf{X}_{0}=\mathbf{U}_{0}\mathbf{U}_{0}^{\top}=\gamma^{2}\mathbf{I}. By the expression for 𝐗t\mathbf{X}_{t} in (29), we have

𝐗(γ)=γ2exp(2𝒜(𝝃(γ)))\displaystyle\mathbf{X}_{\infty}(\gamma)\;=\;\gamma^{2}\cdot\exp\left(2\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right) (36)

where 𝝃(γ)=limt𝝃t(γ)\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow\infty}\bm{\xi}_{t}(\gamma). Because {𝐀i}i=1m\left\{\mathbf{A}_{i}\right\}_{i=1}^{m} are symmetric and they commute, we know that they are simultaneously diagonalizable by an orthonormal basis 𝛀=[𝝎1,,𝝎n]n×n\mathbf{\Omega}=[\bm{\omega}_{1},\ldots,\bm{\omega}_{n}]\in\mathbb{R}^{n\times n}, i.e.,

𝛀𝐀i𝛀=𝚲i,𝚲i diagonal ,i=1,2,,m,\displaystyle\mathbf{\Omega}\mathbf{A}_{i}\mathbf{\Omega}^{\top}=\mathbf{\Lambda}_{i},\quad\mathbf{\Lambda}_{i}\text{ diagonal },\quad\forall\;i=1,2,\ldots,m,

and so is 𝒜(𝐛)\mathcal{A}^{*}(\mathbf{b}) for any 𝐛m\mathbf{b}\in\mathbb{R}^{m}. Therefore, we have

λk(𝐗(γ))=γ2exp(2λk(𝒜(𝝃(γ))))=exp(2λk(𝒜(𝝃(γ)))+2logγ),\displaystyle\lambda_{k}\left(\mathbf{X}_{\infty}(\gamma)\right)\;=\;\gamma^{2}\cdot\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)\right)\;=\;\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right), (37)

where λk()\lambda_{k}(\cdot) denotes the kk-th eigenvalue with respect to the kk-th basis 𝝎k\bm{\omega}_{k}. Moreover 𝐗(γ)\mathbf{X}_{\infty}(\gamma) and its limit 𝐗^\widehat{\mathbf{X}} have the same eigen-basis 𝛀\mathbf{\Omega}. Since we have 𝐗(γ)\mathbf{X}_{\infty}(\gamma) converges to 𝐗^\widehat{\mathbf{X}} when γ0\gamma\rightarrow 0, then we have the eigenvalues

λk(𝐗(γ))λk(𝐗^),k= 1,2,,n,\displaystyle\lambda_{k}\left(\mathbf{X}_{\infty}(\gamma)\right)\;\rightarrow\;\lambda_{k}(\widehat{\mathbf{X}}),\quad\forall\;k\;=\;1,2,\ldots,n, (38)

whenever γ0\gamma\rightarrow 0.

Case 1: λk(𝐗^)>0\lambda_{k}(\widehat{\mathbf{X}})>0.

For any kk such that λk(𝐗^)>0\lambda_{k}(\widehat{\mathbf{X}})>0, from (37) and (38), we have

exp(2λk(𝒜(𝝃(γ)))+2logγ)λk(𝐗^),\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;\rightarrow\;\lambda_{k}(\widehat{\mathbf{X}}),

so that

2λk(𝒜(𝝃(γ)))+2logγlogλk(𝐗^) 0,\displaystyle 2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma-\log\lambda_{k}(\widehat{\mathbf{X}})\;\rightarrow\;0,

which further implies that

λk(𝒜(𝝃(γ)log(1/γ)))1logλk(𝐗^)2log(1/γ) 0.\displaystyle\lambda_{k}\left(\mathcal{A}^{*}\left(\frac{\bm{\xi}_{\infty}(\gamma)}{\log(1/\gamma)}\right)\right)-1-\frac{\log\lambda_{k}(\widehat{\mathbf{X}})}{2\log(1/\gamma)}\;\rightarrow\;0.

Now if we construct 𝝂(γ)\bm{\nu}(\gamma) as (35), so that we conclude

limγ0λk(𝒜(𝝂(γ)))= 1,\displaystyle\lim_{\gamma\rightarrow 0}\;\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;=\;1, (39)

for any kk such that λk(𝐗^)>0\lambda_{k}(\widehat{\mathbf{X}})>0.

Case 2: λk(𝐗^)=0\lambda_{k}(\widehat{\mathbf{X}})=0.

On the other hand, for any kk such that λk(𝐗^)=0\lambda_{k}(\widehat{\mathbf{X}})=0, similarly from (37) and (38), we have

exp(2λk(𝒜(𝝃(γ)))+2logγ) 0,\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;\rightarrow\;0,

when γ0\gamma\rightarrow 0. Thus, for any small ϵ(0,1)\epsilon\in(0,1), there exists some γ0(0,1)\gamma_{0}\in(0,1) such that for all γ<γ0\gamma<\gamma_{0},

exp(2λk(𝒜(𝝃(γ)))+2logγ)<ϵ,\displaystyle\exp\left(2\lambda_{k}\left(\mathcal{A}^{*}\left(\bm{\xi}_{\infty}(\gamma)\right)\right)+2\log\gamma\right)\;<\;\epsilon,

which implies that

λk(𝒜(𝝃(γ)log(1/γ))) 1<logϵ2log(1/γ)< 0.\displaystyle\lambda_{k}\left(\mathcal{A}^{*}\left(\frac{\bm{\xi}_{\infty}(\gamma)}{\log(1/\gamma)}\right)\right)\;-\;1\;<\;\frac{\log\epsilon}{2\log\left(1/\gamma\right)}\;<\;0.

Thus, given the construction of 𝝂(γ)\bm{\nu}(\gamma) in (35), we have

λk(𝒜(𝝂(γ)))< 1,γ<γ0,\displaystyle\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;<\;1,\quad\forall\;\gamma\;<\;\gamma_{0},

which further implies that for any kk with λk(𝐗^)=0\lambda_{k}(\widehat{\mathbf{X}})=0, we have

limγ0λk(𝒜(𝝂(γ)))< 1.\displaystyle\lim_{\gamma\rightarrow 0}\lambda_{k}\left(\mathcal{A}^{*}(\bm{\nu}(\gamma))\right)\;<\;1. (40)

Putting things together.

Combining our analysis in (39) and (40), we obtain

limγ0𝒜(𝝂(γ))𝐈.\displaystyle\lim_{\gamma\rightarrow 0}\mathcal{A}^{*}(\bm{\nu}(\gamma))\;\preceq\;\mathbf{I}.

On the other hand, per our analysis, we know that there exists an orthogonal matrix 𝛀n×n\mathbf{\Omega}\in\mathbb{R}^{n\times n}, such that 𝒜(𝝂(γ))\mathcal{A}^{*}(\bm{\nu}(\gamma)) and 𝐗^\widehat{\mathbf{X}} can be simultaneously diagonalized. Thus, we have

[𝐈𝒜(𝝂(γ))]𝐗^=𝛀(𝐈𝚲𝒜(𝝂(γ)))𝚲𝐗^𝛀,\displaystyle\left[\mathbf{I}-\mathcal{A}^{*}(\bm{\nu}(\gamma))\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{\Omega}\cdot\left(\mathbf{I}-\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))}\right)\cdot\mathbf{\Lambda}_{\widehat{\mathbf{X}}}\cdot\mathbf{\Omega}^{\top},

where 𝚲𝒜(𝝂(γ))\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))} and 𝚲𝐗^\mathbf{\Lambda}_{\widehat{\mathbf{X}}} are diagonal matrices, with entries being the eigenvalues of 𝒜(𝝂(γ))\mathcal{A}^{*}(\bm{\nu}(\gamma)) and 𝚲𝐗^\mathbf{\Lambda}_{\widehat{\mathbf{X}}}, respectively. From our analysis for Case 1 and Case 2, we know that limγ0(𝐈𝚲𝒜(𝝂(γ)))𝚲𝐗^=𝟎\lim_{\gamma\rightarrow 0}\left(\mathbf{I}-\mathbf{\Lambda}_{\mathcal{A}^{*}(\bm{\nu}(\gamma))}\right)\cdot\mathbf{\Lambda}_{\widehat{\mathbf{X}}}=\mathbf{0}. Therefore, we also have

limγ0[𝐈𝒜(𝝂(γ))]𝐗^= 0,\displaystyle\lim_{\gamma\rightarrow 0}\;\left[\mathbf{I}-\mathcal{A}^{*}(\bm{\nu}(\gamma))\right]\cdot\widehat{\mathbf{X}}\;=\;\mathbf{0},

as desired. ∎

Lemma 4.

Under the settings of Theorem 2 and Lemma 2, for any γ>0\gamma>0 there exists

𝝂(γ)=𝝃(γ)log(1/γ),\displaystyle\bm{\nu}(\gamma)\;=\;\frac{\bm{\xi}_{\infty}(\gamma)}{\log\left(1/\gamma\right)}, (41)

such that

limγ0𝝂(γ)α1sign(𝐬^),limγ0𝐠(γ)𝐡(γ)= 0,\displaystyle\lim_{\gamma\rightarrow 0}\;\bm{\nu}(\gamma)\;\in\;\alpha^{-1}\cdot\operatorname{sign}(\widehat{\mathbf{s}}),\quad\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0}, (42)

where 𝐬^=𝐠^𝐠^𝐡^𝐡^\widehat{\mathbf{s}}=\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}, and 𝛏(γ)=limt0𝛏t(γ)\bm{\xi}_{\infty}(\gamma)=\lim_{t\rightarrow 0}\bm{\xi}_{t}(\gamma) with 𝛏t(γ)=0t𝐫τ(γ)𝑑τ\bm{\xi}_{t}(\gamma)=-\int_{0}^{t}\mathbf{r}_{\tau}(\gamma)d\tau.

Proof.

Let gi(γ)g_{\infty}^{i}(\gamma) and hi(γ)h_{\infty}^{i}(\gamma) be the iith coordinate of 𝐠(γ)\mathbf{g}_{\infty}(\gamma) and 𝐡(γ)\mathbf{h}_{\infty}(\gamma) defined in (25), respectively. It follows from (30) that

gi(γ)=γexp(αξi(γ)),hi(γ)=γexp(αξi(γ)),i= 1,2,,m.\displaystyle g_{\infty}^{i}(\gamma)\;=\;\gamma\cdot\exp\left(\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\quad h_{\infty}^{i}(\gamma)\;=\;\gamma\cdot\exp\left(-\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\quad\forall\;i\;=\;1,2,\ldots,m.

When γ0\gamma\rightarrow 0, we have

gi(γ)hi(γ)=γ2 0,i= 1,2,,m,\displaystyle g_{\infty}^{i}(\gamma)\cdot h_{\infty}^{i}(\gamma)\;=\;\gamma^{2}\;\rightarrow\;0,\quad\forall\;i\;=\;1,2,\ldots,m,

so that limγ0𝐠(γ)𝐡(γ)= 0\lim_{\gamma\rightarrow 0}\mathbf{g}(\gamma)\circ\mathbf{h}(\gamma)\;=\;\mathbf{0}. This also implies that either gi(γ)g_{\infty}^{i}(\gamma) or hi(γ)h_{\infty}^{i}(\gamma) for any i=1,2,,mi=1,2,\ldots,m.

On the other hand, let us define

𝐬(γ)=𝐡(γ)𝐡(γ)𝐠(γ)𝐠(γ),\displaystyle\mathbf{s}_{\infty}(\gamma)\;=\;\mathbf{h}_{\infty}(\gamma)\circ\mathbf{h}_{\infty}(\gamma)\;-\;\mathbf{g}_{\infty}(\gamma)\circ\mathbf{g}_{\infty}(\gamma),

and let si(γ)s_{\infty}^{i}(\gamma) be the iith coordinate of 𝐬(γ)\mathbf{s}_{\infty}(\gamma) with

si(γ)=γ2exp(2αξi(γ))γ2exp(2αξi(γ)).\displaystyle s_{\infty}^{i}(\gamma)\;=\;\gamma^{2}\cdot\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;-\;\gamma^{2}\cdot\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right). (43)

Correspondingly, we know that 𝐬^=limγ0𝐬(γ)\widehat{\mathbf{s}}=\lim_{\gamma\rightarrow 0}\mathbf{s}_{\infty}(\gamma) and let sis^{i} be the iith coordinate of 𝐬^=𝐠^𝐠^𝐡^𝐡^\widehat{\mathbf{s}}=\widehat{\mathbf{g}}\circ\widehat{\mathbf{g}}-\widehat{\mathbf{h}}\circ\widehat{\mathbf{h}}. In the following, we leverage on these to show that our construction of 𝝂(γ)\bm{\nu}(\gamma) satisfies (42). We classify the entries s^i\widehat{s}_{i} of 𝐬^\widehat{\mathbf{s}} (i=1,2,,m)(i=1,2,\ldots,m) into three cases and analyze as follows.

  • Case 1: s^i>0\widehat{s}_{i}>0. Since limγ0si(γ)=s^i>0\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}>0, from (43) we must have ξi(γ)+\xi_{\infty}^{i}(\gamma)\rightarrow+\infty when γ0\gamma\rightarrow 0, so that exp(2αξi(γ))+\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow+\infty and exp(2αξi(γ))0\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0. Therefore, when γ0\gamma\rightarrow 0, we have

    γ2exp(2αξi(γ))s^i\displaystyle\gamma^{2}\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;\rightarrow\;\widehat{s}_{i}\quad 2αξi(γ)2log(1/γ)logs^i 0,\displaystyle\Longrightarrow\quad 2\alpha\cdot\xi_{\infty}^{i}(\gamma)-2\log\left(1/\gamma\right)-\log\widehat{s}_{i}\;\rightarrow\;0,
    νi(γ)=ξi(γ)log(1/γ)1α.(givenlog(1/γ)+)\displaystyle\Longrightarrow\quad\nu_{i}(\gamma)\;=\;\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\;\rightarrow\;\frac{1}{\alpha}.\;\;(\text{given}\;\;\log\left(1/\gamma\right)\rightarrow+\infty)
  • Case 2: s^i<0\widehat{s}_{i}<0. Since limγ0si(γ)=s^i<0\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}<0, from (43) we must have ξi(γ)\xi_{\infty}^{i}(\gamma)\rightarrow-\infty when γ0\gamma\rightarrow 0, so that exp(2αξi(γ))0\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0 and exp(2αξi(γ))+\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow+\infty. Therefore, when γ0\gamma\rightarrow 0, we have

    γ2exp(2αξi(γ))s^i\displaystyle-\gamma^{2}\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\;\rightarrow\;\widehat{s}_{i}\quad 2αξi(γ)+2log(1/γ)logs^i 0,\displaystyle\Longrightarrow\quad-2\alpha\cdot\xi_{\infty}^{i}(\gamma)+2\log\left(1/\gamma\right)-\log\widehat{s}_{i}\;\rightarrow\;0,
    νi(γ)=ξi(γ)log(1/γ)1α.\displaystyle\Longrightarrow\quad\nu_{i}(\gamma)\;=\;\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\;\rightarrow\;-\frac{1}{\alpha}.
  • Case 3: s^i=0\widehat{s}_{i}=0. Since limγ0si(γ)=s^i=0\lim_{\gamma\rightarrow 0}s_{\infty}^{i}(\gamma)=\widehat{s}_{i}=0, from (43) we must have γ2exp(2αξi(γ))0\gamma^{2}\cdot\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0 and γ2exp(2αξi(γ))0\gamma^{2}\cdot\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\rightarrow 0, when γ0\gamma\rightarrow 0. Therefore, for any small ϵ(0,1)\epsilon\in(0,1), there exists some γ0>0\gamma_{0}>0, such that for all γ(0,γ0)\gamma\in(0,\gamma_{0}), we have

    γ2max{exp(2αξi(γ)),exp(2αξi(γ))}ϵ\displaystyle\gamma^{2}\cdot\max\left\{\exp\left(2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right),\;\exp\left(-2\alpha\cdot\xi_{\infty}^{i}(\gamma)\right)\right\}\;\leq\;\epsilon
    \displaystyle\Longrightarrow\quad 2αmax{ξi(γ)log(1/γ),ξi(γ)log(1/γ)}2<logϵlog(1/γ)< 0,\displaystyle 2\alpha\cdot\max\left\{\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)},-\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\right\}-2\;<\;\frac{\log\epsilon}{\log\left(1/\gamma\right)}\;<\;0,

    which further implies that

    |νi(γ)|=max{ξi(γ)log(1/γ),ξi(γ)log(1/γ)}<1α.\displaystyle\left|\nu_{i}(\gamma)\right|\;=\;\max\left\{\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)},-\frac{\xi_{\infty}^{i}(\gamma)}{\log\left(1/\gamma\right)}\right\}\;<\;\frac{1}{\alpha}.

Therefore, combining the results in the three cases above we obtain that

limγ0νi(γ)=1αsign(s^i)={s^iα|s^i|s^i0,[1/α,1/α]s^i=0,\displaystyle\lim_{\gamma\rightarrow 0}\nu_{i}(\gamma)\;=\;\frac{1}{\alpha}\operatorname{sign}(\widehat{s}_{i})\;=\;\begin{cases}\frac{\widehat{s}_{i}}{\alpha\left|\widehat{s}_{i}\right|}&\widehat{s}_{i}\not=0,\\ [-1/\alpha,1/\alpha]&\widehat{s}_{i}=0,\end{cases}

so that we have (42) holds. ∎

Appendix C Extra Experiments

Due to limited space in the main body, we here provide extra results for our experiments on robust image recovery presented in Section 4.2.

Varying corruption levels. In Figure 4, we display results of our method for robust image recovery with varying levels of salt-and-pepper corruption.

Learning curves for different test images and varying corruption levels. In Figure 5(a) and Figure 5(b), we provide learning curves for the results in Figure 3 and Figure 4, respectively. While DIP-1\ell_{1} requires a case-dependent early termination to obtain the best performance, our method does not overfit and does not require any early termination. Overall our method achieves both better PSNR and visual quality, especially when noise level is high.

Refer to caption
Refer to caption
(a) PSNR =26.19=26.19
Refer to caption
(b) PSNR =36.15=36.15
Refer to caption
(c) PSNR =34.91=34.91
Refer to caption
Refer to caption
(d) PSNR =20.15=20.15
Refer to caption
(e) PSNR =33.64=33.64
Refer to caption
(f) PSNR =34.39=34.39
Refer to caption
Refer to caption
(g) PSNR =16.46=16.46
Refer to caption
(h) PSNR =29.06=29.06
Refer to caption
(i) PSNR =31.60=31.60
Refer to caption
Refer to caption
(j) PSNR =13.91=13.91
Refer to caption
(k) PSNR =24.80=24.80
Refer to caption
(l) PSNR =26.99=26.99
(m) Input
(n) DIP
(o) DIP-L1
(p) Ours
Figure 4: Robust image recovery with 10%,30%,50%10\%,30\%,50\%, and 70%70\% salt-and-pepper noise. PSNR of the results is overlaid at the bottom of the images. For our method, all cases use the same network width, learning rate, and termination condition. For DIP and DIP-1\ell_{1}, case-dependent early stopping is used which is essential for their good performance. Despite that, our method achieves the highest PSNRs and best visual quality.

Refer to caption

(a) Different test images

Refer to caption

(b) Varying corruption levels
Figure 5: Learning curves for robust image recovery with different test images (a) and varying corruption levels (b). DIP-1\ell_{1} requires case-dependent early stopping while our method does not.