Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization
Abstract
Recent advances have shown that implicit bias of gradient descent on over-parameterized models enables the recovery of low-rank matrices from linear measurements, even with no prior knowledge on the intrinsic rank. In contrast, for robust low-rank matrix recovery from grossly corrupted measurements, over-parameterization leads to overfitting without prior knowledge on both the intrinsic rank and sparsity of corruption. This paper shows that with a double over-parameterization for both the low-rank matrix and sparse corruption, gradient descent with discrepant learning rates provably recovers the underlying matrix even without prior knowledge on neither rank of the matrix nor sparsity of the corruption. We further extend our approach for the robust recovery of natural images by over-parameterizing images with deep convolutional networks. Experiments show that our method handles different test images and varying corruption levels with a single learning pipeline where the network width and termination conditions do not need to be adjusted on a case-by-case basis. Underlying the success is again the implicit bias with discrepant learning rates on different over-parameterized parameters, which may bear on broader applications.
1 Introduction
Learning over-parameterized models, which have more parameters than the problem’s intrinsic dimension, is becoming a crucial topic in machine learning [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. While classical learning theories suggest that over-parameterized models tend to overfit [12], recent advances showed that an optimization algorithm may produce an implicit bias that regularizes the solution with desired properties. This type of results has led to new insights and better understandings on gradient descent for solving several fundamental problems, including logistic regression on linearly separated data [1], compressive sensing [2, 3], sparse phase retrieval [4], nonlinear least-squares [5], low-rank (deep) matrix factorization [6, 7, 8, 9], and deep linear neural networks [10, 11], etc.
Inspired by these recent advances [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], in this work we present a new type of practical methods for robust recovery of structured signals via model over-parameterization. In particular, we aim to learn an unknown signal from its grossly corrupted linear measurements
(1) |
where the linear operator , and is a (sparse) corruption vector. Variants of the problem appear ubiquitously in signal processing and machine learning [13, 14, 15, 16, 17, 18].
Robust recovery of low-rank matrices. Low-rank matrix recovery has broad applications in face recognition [13] (where self-shadowing, specularity, or saturation in brightness can be modeled as outliers), video surveillance [13] (where the foreground objects are usually modeled as outliers) and beyond. A classical method for low-rank matrix recovery is via nuclear norm111Nuclear norm (the tightest convex envelope to matrix rank) is defined as the sum of singular values. minimization. Such a method is provably correct under certain incoherent conditions [13, 19]. However, minimizing nuclear norm involves expensive computations of singular value decomposition (SVD) of large matrices [20] (when is large), which prohibits its application to problem size of practical interest.
The computational challenge has been addressed by recent development of matrix factorization (MF) methods [21, 17]. Such methods are based on parameterizing the signal via the factorization [22], and solving the associated nonconvex optimization problem with respect to (w.r.t.) , where is the rank of . Since with the problem is of much smaller size, it leads to more scalable methods while still enjoys guarantees for correctness [23, 24, 25, 17]. However, the effectiveness of such exact-parameterization based method hinges on the exact information of the rank of and the percentage of corrupted entries in (see Figure 1(a)), which are usually not available a priori in practice.
Robust recovery of natural images. Robust recovery of natural images is often considered as a challenging task due to the lack of universal mathematical modeling for natural images. While sparse and low-rank based methods have been demonstrated to be effective for years [26, 27, 28, 29, 30, 31], the state-of-the-art performance is obtained by learned priors with deep convolutional neural networks. Such methods operate by end-to-end training of neural networks from pairs of corrupted/clean images [32, 33, 34], and often cannot effectively handle test cases with corruption type and noise level that are different from those of the training data.
Recently, this challenge has been partially addressed by the so-called deep image prior (DIP) [35], which has shown impressive results on many image processing tasks. The method is based on parameterizing an image by a deep network , which turns out to be a flexible prior for modeling the underlying distributions of a variety of natural images. The network has a U-shaped architecture and may be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization . Hence, it may not come as a surprise that DIP also inherits the drawbacks of the exact-parameterization approach for low-rank matrix recovery. Namely, it requires either a meticulous choice of network width [36] or early stopping of the learning process [37] (see Figure 1(b)).
Overview of our methods and contributions. In this work, we show that the challenges associated with the exact-parameterization methods can be simply and effectively dealt with via over-parameterization and discrepant learning rates. Our work is inspired by the recent results [6, 7] for low-rank matrix recovery with , which state that the gradient descent on
(2) |
converges to a low-rank regularized solution with even when over-parameterizes with . In the presence of sparse corruption , it is tempting to replace the -norm in (2) with an -norm222This is a commonly used approach for robust optimization problems, such as robust regression [38], robust subspace learning [39, 40, 41], robust phase retrieval [16, 42], robust matrix recovery [15, 18], and many more. and solve
(3) |
Unfortunately, such a naive approach easily overfits the corruptions with (see Figure 1(a)).
In this work, we introduce a double over-parameterization method for robust recovery. Our method is based on over-parameterizing both the low-rank matrix and the outliers , and leveraging implicit algorithmic bias to find the correct solution . The benefit of such an approach w.r.t. the state of the art is summarized as follows (see also Table 1):
-
•
More scalable. Our method is based on gradient descent only and does not require computing an SVD per iteration as in convex approaches [13]. Hence it is significantly more scalable.
- •
-
•
Provably correct. Under similar conditions of the convex approach, our method converges to the ground truth asymptotically.
Methods | Convex (Eq. (8)) | Nonconvex (Eq. (3)) | Ours (Eq. (4)) |
---|---|---|---|
Provably correct? | |||
Prior knowledge free? | |||
Scalable? |
Underlying the success of our method is the notion of implicit bias of discrepant learning rates. The idea is that the algorithmic low-rank and sparse regularizations need to be balanced for the purpose of identifying the underlying rank and sparsity. With a lack of means of tuning a regularization parameter, we show in Section 2.3 that the desired balance can be obtained by using different learning rates for different optimization parameters. Such a property may be of separate interest and bear on a broader range of problems.
Finally, we demonstrate the broad applicability of our ideas on the task of the robust recovery of natural images. We only need to replace the over-parameterization for low-rank matrices by a U-shaped network for natural images (as adopted in DIP [35]) and solve the optimization problem by gradient descent. This produces a powerful and easy-to-use method with favorable properties when compared to the original DIP (see Figure 1(b)). To be precise, our method handles different test images and varying corruption levels with a single learning pipeline, in which network width and termination conditions do not need to be adjusted on a case-by-case basis.
2 Main Results and Algorithms
2.1 A Double Over-Parameterization Formulation
As precluded in (1), we first consider the problem of recovering a rank- () positive semidefinite matrix333By a lifting trick such as [43, 44], our method can be extended to handling arbitrary rectangular matrices. from its corrupted linear measurements , where is a vector of sparse corruptions. Recent advances on algorithmic implicit bias for optimizing over-parameterized models [6, 7, 8, 9, 2, 3] motivate us to consider a nonlinear least squares for robust matrix recovery, with double over-parameterization and :
(4) |
where the dimensional parameter . In practice, the choice of depends on how much prior information we have for : can be either taken as an estimated upper bound for , or taken as with no prior knowledge. In the following, we provide more backgrounds for the choice of our formulation (4).
-
•
Implicit low-rank prior via matrix factorization. For the vanilla low rank matrix recovery problem with no outlier (i.e., ), the low-rank matrix can be recovered via over-parameterization [6, 7, 8, 9]. In particular, the work [6] showed that with small initialization and infinitesimal learning rate, gradient descent on (2) converges to the minimum nuclear norm solution under certain commute conditions for .
-
•
Implicit sparse prior via Hadamard multiplication. For the classical sparse recovery problem [45, 46] which aims to recover a sparse from its linear measurement , recent work [2, 3] showed that it can also be dealt with via over-parameterization . In particular, the work [2] showed that with small initialization and infinitesimal learning rate, gradient descent on
(5) correctly recovers the sparse when satisfies certain restricted isometric properties [47].
The benefit of the double over-parameterization formulation in (4), as we shall see, is that it allows specific algorithms to automatically identify both the intrinsic rank of and the sparsity level of without any prior knowledge.
2.2 Algorithmic Regularizations via Gradient Descent
Obviously, over-parameterization leads to under-determined problems which can have infinite number of solutions, so that not all solutions of (4) correspond to the desired . For example, for any given , one can always construct a pair for (4) such that achieves the global minimum value . Nonetheless, as we see in this work, the gradient descent iteration on (4),
(6) |
with properly chosen learning rates enforces implicit bias on the solution path, that it automatically identifies the desired, regularized solution . Here, in (6) we have being the conjugate operator of and .
It should be noted that the scalar , which controls the ratio of the learning rates for and , plays a crucial role on the quality of the solution that the iterate in (6) converges to (see Figure 2(a)). We will discuss this in more details in the next subsection (i.e., Section 2.3).
Convergence to low-rank & sparse solutions. Based on our discussion in Section 2.1, it is expected that the gradient descent (6) converges to a solution such that
(7) |
have the minimum nuclear norm and norm, respectively. More specifically, we expect the solution in (7) of the nonlinear least squares (4) also serves as a global solution to a convex problem
(8) |
for which we state more rigorously in Section 3 under a particular setting. However, it should be noted that obtaining the global solution of (8) does not necessarily mean that we find the desired solution — the penalty in (8), which balances the low-rank and the sparse regularizations, plays a crucial role on the quality of the solution to (8). For instance, when is the identity operator, under proper conditions of , the work [13] showed that the global solution of (8) is exactly the target solution only when . Therefore, choosing the right is critical for a successful recovery of .
2.3 Implicit Bias with Discrepant Learning Rates
As noted above, a remaining challenge is to control the implicit regularization of the gradient descent (6) so that the algorithm converges to the solution to (8) with the desired value . Without explicit regularization terms in our new objective (4), at first glance this might seem impossible. Nonetheless, we show that this can simply be achieved by adapting the ratio of learning rates between and in our gradient step (6), which is one of our key contributions in this work that could also bear broader interest. More specifically, this phenomenon can be captured by the following observation.
Observation.
In comparison to the classical optimization theory [48] where learning rates only affect algorithm convergence rate but not the quality of the solution, our observation here is surprisingly different — using discrepant learning rates on different over-parameterized variables actually affects the specific solution that the algorithm converges to444Our result also differs from [49, 50, 51] which analyze the effect of initial large learning rates.. In the following, let us provide some intuitions of why this happens and discuss its implications. We leave a more rigorous mathematical treatment to Section 3.
Intuitions. The relation implies that a large learning rate for one particular optimization variable in (6) leads to a small penalty on its implicit regularization counterpart in (8). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point (which is close to the origin), resulting in a weaker regularization effect (which penalizes the distance of the variable to the origin) on its solution path.
Implications. The implicit bias of discrepant learning rates provides a new and powerful way for controlling the regularization effects in over-parametrized models. For robust matrix recovery in particular, it reveals an equivalency between our method in (4) and the convex method in (8) with one-to-one correspondence between learning rate ratio and the regularization parameter . By picking , we may directly quote results from [13] and conclude that our method correctly recovers with information-theoretically optimal sampling complexity and sparsity levels (see Figure 2). Note that this is achieved with no prior knowledge on the rank of and sparsity of . Next, we show that such benefits have implications beyond robust low-rank matrix recovery.
2.4 Extension to Robust Recovery of Natural Images
Finally, we address the problem of robust recovery of a natural image555Here, are height and width of the image, respectively. denotes the number of channels of the image, where for a greyscale image and for a color image with RGB channels. from its corrupted observation , for which the structure of cannot be fully captured by a low-rank model. Inspired by the work [35] on showing the effectiveness of an image prior from a deep convolutional network of particular architectures, where denotes network parameters, we use the following formulation for image recovery:
(9) |
As we will empirically demonstrate in Section 4.2, gradient descent on (9)
(10) |
with a balanced learning rate ratio also enforces implicit regularizations on the solution path to the desired solution. It should be noted that this happens even that the over-parameterization is a highly nonlinear network (in comparison with shallow linear network [6] or deep linear network [8]), which raises several intriguing theoretical questions to be further investigated.
3 Convergence Analysis of Gradient Flow Dynamics
We provide a dynamical analysis certifying our observation in Section 2.3. Similar to [6, 8], we consider a special case where the measurement matrices associated with commute666Any can be written as for some ., and investigate the trajectories of the discrete gradient iterate of , and in (6) with infinitesimal learning rate (i.e., ). In other words, we study their continuous dynamics counterparts , , and , which are parameterized by and initialized at with
(11) |
where . Thus, analogous to (6), the behaviors of the continuous gradient flows can be captured via the following differential equations
(12) | ||||
(13) |
with . Let , then we can derive the gradient flow for via chain rule
(14) |
For any , suppose the limits of , , and as exist and denote by
(15) |
Then when the initialization is infinitesimally small with , we show that the following holds.
Theorem 1.
Assume that the measurement matrices commute with for , and the gradient flows of , , and satisfy (12) and (13) and are initialized by (11). Let , , and be the limit points as defined in (15). Suppose that our initialization is infinitesimally small such that
exist and is a global optimal solution to (4) with
Then we have , and is also a global optimal solution to (8), with .
A detailed proof of Theorem 1 is deferred to Appendix B. Our result shows that among the infinite many global solutions to (4), gradient descent biases towards the one with the minimum nuclear norm (for ) and norm (for ) with relative weight controlled by . In the following, we discuss the rationality of the assumptions for Theorem 1.
- •
-
•
The commutative assumption is commonly adopted in the analysis of over-parameterized low-rank matrix recovery and is empirically observed to be non-essential [6, 8]. A recent work [7] provides a more sophisticated analysis of the discrete dynamic under the restricted isometric assumption where the commutative assumption is not needed. We believe such analysis can be extended to our double over-parameterization setting as well, but leave it as future work.
4 Experiments
In this section, we provide experimental evidence for the implicit bias of the learning rate discussed in Section 2.3. Through experiments for low-rank matrix recovery, Section 4.1 shows that the learning rate ratio affects the solution that gradient descent converges to, and that an optimal does not depend on the rank of matrix and sparsity of corruption. Furthermore, Section 4.2 shows the effectiveness of implicit bias of learning rate in alleviating overfitting for robust image recovery, and demonstrates that our method produces better recovery quality when compared to DIP for varying test images and corruption levels, all with a single model and set of learning parameters.
4.1 Robust Recovery of Low-rank Matrices
We conduct experiments for a particular case of low-rank matrix recovery problem, namely the robust PCA problem, in which the operator is the identity operator. More specifically, the goal is to recovery a low-rank matrix and a sparse matrix from the mixture , possibly without prior knowledge on the rank of and the percentage of nonzero entries of . For any given , we generate by setting , where is an matrix with i.i.d. entries drawn from standard Gaussian distribution. For any given , we generate by sampling uniformly at random locations from the matrix and setting those entries by sampling i.i.d. from a zero-mean Gaussian distribution with standard deviation . We use .
We apply our double over-parameterization method in (4) for the robust PCA problem, Specifically, we initialize and by drawing i.i.d. entries from zero mean Gaussian distribution with standard deviation , and initialize to be the same as . The learning rate for as well as for are set to and , respectively, where for all experiments.
Effect of learning rate ratio . We set and , perform gradient descent steps and compute reconstruction errors for the output of our algorithm relative to the ground truth . Figure 2(a) illustrates the performance with varying values of . We observe that affects the solution that the algorithm converges to, which verifies that the learning rate ratio has an implicit regularization effect. Moreover, the best performance is given by , which is in accordance with our theoretical result in Theorem 1.
Effect of rank and outlier ratio and phase transition. We now fix and study the ability of our method for recovering matrices of varying rank with varying percentage of corruption . For each pair , we apply our method and declare the recovery to be successful if the relative reconstruction error is less than . Figure 2(c) displays the fraction of successful recovery in Monte Carlo trials. It shows that a single value of the parameter leads to correct recovery for a wide range of and . Figure 2(b) shows the fraction of successful recovery for the convex method in (8) (with the Accelerated Proximal Gradient [54] solver 777http://people.eecs.berkeley.edu/~yima/matrix-rank/sample_code.html).




















4.2 Robust Recovery of Natural Images
Following [35], we evaluate the performance of our method for robust image recovery using images from a standard dataset888http://www.cs.tut.fi/~foi/GCF-BM3D/index.html#ref_results. Corruption for the images is synthesized by adding salt-and-pepper noise, where ratio of randomly chosen pixels are replaced with either or (each with probability).
The network for our method in (9) is the same as the denoising network in [35], which has a U-shaped architecture with skip connections. Each layer of the network contains a convolutional layer, a nonlinear LeakyReLU layer and a batch normalization layer. We also follow [35] on the initialization of network parameters , which is the Kaiming initialization. Meanwhile, we initialize and by i.i.d. zero-mean Gaussian distribution with a standard deviation of . We use learning rate and set for all our experiments below.
No need to tune model width or terminate early. We compare our method with a variant of DIP that we call DIP-, which is based on solving with being . As shown in Figure 1(b), DIP- requires either choosing appropriate network width or early termination to avoid overfitting. Note that neither of these can be carried out in practice without the true (clean) image. On the other hand, our method does not require tuning network width or early termination. Its performance continues to improve as training proceeds until stabilises.
Handling different images and varying corruption levels. The benefit mentioned above enables our method to handle different images and varying corruption levels without the need to tune network width, termination condition and any other learning parameters. In our experiments, we fix the number of channels in each convolutional layer to be , run gradient descent iterations and display the output image. The results are shown in Figure 3 (for four different test images) and Figure 4 (for four corruption levels, see Appendix C). For DIP and DIP-, we report the result with the highest PSNR in the learning process (i.e., we perform the best early termination for evaluation purposes – note that this cannot be carried out in practice). Despite that, our method obtains better recovery quality in terms of PSNR for all cases. We display the learning curves for these experiments in Figure 5(a) and Figure 5(b) (see Appendix C), which show that our method does not overfit for all cases while DIP- requires a case-dependent early termination.
5 Conclusion
In this work, we have shown both theoretically and empirically that the benefits of implicit bias of gradient descent can be extended to over-parameterization of two low-dimensional structures. The key to the success is the choice of discrepant learning rates that can properly regulate the optimization path so that it converges to the desired optimal solution. Such a framework frees us from the need of prior knowledge in the structures or from the lack of scalability of previous approaches. This has led to state of the art recovery results for both low-rank matrices and natural images. We hope this work may encourage people to investigate in the future if the same framework and idea generalize to a mixture of multiple and broader families of structures.
References
- [1] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.
- [2] T. Vaskevicius, V. Kanade, and P. Rebeschini, “Implicit regularization for optimal sparse recovery,” in Advances in Neural Information Processing Systems, pp. 2968–2979, 2019.
- [3] P. Zhao, Y. Yang, and Q.-C. He, “Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression,” arXiv preprint arXiv:1903.09367, 2019.
- [4] F. Wu and P. Rebeschini, “Hadamard wirtinger flow for sparse phase retrieval,” arXiv preprint arXiv:2006.01065, 2020.
- [5] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” in International Conference on Machine Learning, pp. 4951–4960, 2019.
- [6] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit regularization in matrix factorization,” in Advances in Neural Information Processing Systems, pp. 6151–6159, 2017.
- [7] Y. Li, T. Ma, and H. Zhang, “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations,” in Conference On Learning Theory, pp. 2–47, 2018.
- [8] S. Arora, N. Cohen, W. Hu, and Y. Luo, “Implicit regularization in deep matrix factorization,” in Advances in Neural Information Processing Systems, pp. 7411–7422, 2019.
- [9] M. A. Belabbas, “On implicit regularization: Morse functions and applications to matrix factorization,” arXiv preprint arXiv:2001.04264, 2020.
- [10] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradient descent on linear convolutional networks,” in Advances in Neural Information Processing Systems, pp. 9461–9471, 2018.
- [11] G. Gidel, F. Bach, and S. Lacoste-Julien, “Implicit regularization of discrete gradient dynamics in linear neural networks,” in Advances in Neural Information Processing Systems, pp. 3196–3206, 2019.
- [12] E. Alpaydin, Introduction to machine learning. 2020.
- [13] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
- [14] D. S. Weller, A. Pnueli, G. Divon, O. Radzyner, Y. C. Eldar, and J. A. Fessler, “Undersampled phase retrieval with outliers,” IEEE transactions on computational imaging, vol. 1, no. 4, pp. 247–258, 2015.
- [15] Y. Li, Y. Sun, and Y. Chi, “Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements,” IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 397–408, 2016.
- [16] J. C. Duchi and F. Ruan, “Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 471–529, 2019.
- [17] Y. Chi, Y. M. Lu, and Y. Chen, “Nonconvex optimization meets low-rank matrix factorization: An overview,” IEEE Transactions on Signal Processing, vol. 67, no. 20, pp. 5239–5269, 2019.
- [18] X. Li, Z. Zhu, A. Man-Cho So, and R. Vidal, “Nonconvex robust low-rank matrix recovery,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 660–686, 2020.
- [19] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
- [20] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” arXiv preprint arXiv:1009.5055, 2010.
- [21] P. Jain and P. Kar, “Non-convex optimization for machine learning,” Foundations and Trends® in Machine Learning, vol. 10, no. 3-4, pp. 142–336, 2017.
- [22] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
- [23] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality of local search for low rank matrix recovery,” in Advances in Neural Information Processing Systems, pp. 3873–3881, 2016.
- [24] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances in Neural Information Processing Systems, pp. 2973–2981, 2016.
- [25] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rank matrix optimization,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp. 3614–3628, 2018.
- [26] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in 2009 IEEE 12th international conference on computer vision, pp. 2272–2279, IEEE, 2009.
- [27] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007.
- [28] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Bm3d image denoising with shape-adaptive principal component analysis,” 2009.
- [29] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869, 2014.
- [30] J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36, 2018.
- [31] B. Lecouat, J. Ponce, and J. Mairal, “Fully trainable and interpretable non-local sparse models for image restoration,” 2020.
- [32] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
- [33] Z. Yue, H. Yong, Q. Zhao, D. Meng, and L. Zhang, “Variational denoising network: Toward blind noise modeling and removal,” in Advances in Neural Information Processing Systems, pp. 1688–1699, 2019.
- [34] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” arXiv preprint arXiv:2003.07761, 2020.
- [35] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
- [36] R. Heckel and P. Hand, “Deep decoder: Concise image representations from untrained non-convolutional networks,” in International Conference on Learning Representations, 2019.
- [37] R. Heckel and M. Soltanolkotabi, “Denoising and regularization via exploiting the structural bias of convolutional generators,” arXiv preprint arXiv:1910.14634, 2019.
- [38] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection, vol. 589. John wiley & sons, 2005.
- [39] Z. Zhu, Y. Wang, D. Robinson, D. Naiman, R. Vidal, and M. Tsakiris, “Dual principal component pursuit: Improved analysis and efficient algorithms,” in Advances in Neural Information Processing Systems 31, pp. 2171–2181, 2018.
- [40] Q. Qu, Z. Zhu, X. Li, M. C. Tsakiris, J. Wright, and R. Vidal, “Finding the sparsest vectors in a subspace: Theory, algorithms, and applications,” 2020.
- [41] H. Jiang, D. P. Robinson, R. Vidal, and C. You, “A nonconvex formulation for low rank subspace clustering: algorithms and convergence analysis,” Computational Optimization and Applications, vol. 70, no. 2, pp. 395–418, 2018.
- [42] D. Davis, D. Drusvyatskiy, and C. Paquette, “The nonsmooth landscape of phase retrieval,” arXiv preprint arXiv:1711.03247, 2017.
- [43] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in International Conference on Machine Learning, pp. 964–973, 2016.
- [44] D. Park, A. Kyrillidis, C. Carmanis, and S. Sanghavi, “Non-square matrix sensing without spurious local minima via the burer-monteiro approach,” in Artificial Intelligence and Statistics, pp. 65–74, 2017.
- [45] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on information theory, vol. 52, no. 2, pp. 489–509, 2006.
- [46] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE signal processing magazine, vol. 25, no. 2, pp. 21–30, 2008.
- [47] E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9-10, pp. 589–592, 2008.
- [48] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
- [49] Y. Li, C. Wei, and T. Ma, “Towards explaining the regularization effect of initial large learning rate in training neural networks,” in Advances in Neural Information Processing Systems, pp. 11669–11680, 2019.
- [50] K. You, M. Long, J. Wang, and M. Jordan, “How does learning rate decay help modern neural networks,” arXiv preprint arXiv:1908.01878, 2019.
- [51] P. Nakkiran, “Learning rate annealing can provably help generalization, even for convex problems,” arXiv preprint arXiv:2005.07360, 2020.
- [52] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient descent only converges to minimizers,” in Conference on learning theory, pp. 1246–1257, 2016.
- [53] M. Journée, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank optimization on the cone of positive semidefinite matrices,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2327–2351, 2010.
- [54] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, “Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix,” Coordinated Science Laboratory Report no. UILU-ENG-09-2214, DC-246, 2009.
- [55] S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?,” arXiv preprint arXiv:1812.10004, 2018.
Appendix A Implicit Bias of Discrepant Learning Rates in Linear Regression
In this part of the appendix, let us use a classical result for underdeterimined linear regression problem to build up some intuitions behind the implicit bias of gradient descent for our problem formulation of robust learning problems. The high level message we aim to deliver through the simple example is that
-
•
Gradient descent implicitly biases towards solutions with minimum -norm.
-
•
Discrepant learning rates lead to solutions with minimum weighted -norm.
Underdeterimined linear regression.
Given observation and wide data matrix (), we want to find which is a solution to
(16) |
For and full row-rank , the underdetermined problem (16) obviously has infinite many solutions, which forms a set
where denotes the pseudo-inverse of , and
are the null space and row space of , respectively. Simple derivation shows that is a particular least -norm solution to (16), that minimizes
Gradient descent biases towards .
Starting from any initialization , gradient descent
(17) |
with a sufficiently small learning rate999This is because . If we choose , then converges to geometrically. always finds one of the global solutions for (16). Furthermore, it is now well-understood [55] that whenever the initialization has zero component in (i.e., ), one interesting phenomenon is that the iterates in (17) implicitly bias towards the minimium -norm solution . This happens because once initialized in , gradient descent (17) implicitly biases towards iterates staying within , such that
As we can see, a particular algorithm enforces specific regularization on the final solution.
Implicit bias of discrepant learning rates. The gradient update in (17) uses the same learning rate for all coordinates of . If we use different learning rates for each coordinate (i.e., is a diagonal matrix with positive diagonals)
(18) |
then by following a similar argument we conclude that the gradient update in (18) converges to a weighted regularized solution for
(19) |
Remark 1.
Let be the -th diagonal of and be the -th element of . Then in (18), is the learning rate for the variable , which varies for different variables. In words, the relation between (18) and (19) implies that for one particular optimization variable (e.g., ) a large learning rate in (18) leads to a small implicit regularization effect in (19). From a high-level perspective, this happens because a larger learning rate allows the optimization variable to move faster away from its initial point, resulting in a weaker regularization effect (which penalizes the distance of the variable to the initialization) on its solution path.
An alternative explanation of this is through a change of variable . Suppose we minimize
(20) |
via standard gradient descent with a single learning rate
(21) |
Thus, once initialized in , gradient descent (21) converges to the least -norm solution to (20), i.e., the solution of the following problem
(22) |
Finally, plugging into (21) and (22) gives (18) and (19), respectively, also indicating the gradient update (18) induces implicit weighted regularization towards the solution of (19).
Appendix B Proof of Theorem 1
In this part of the appendix, we prove our main technical result (i.e., Theorem 1) in Section 3. To make this part self-contained, we restate our result as follows.
Theorem 2.
Assume that the measurement matrices commute, i.e.
and the gradient flows of , , and satisfy
(23) | ||||
(24) |
with , and they are initialized by
Let , and let , , and be the limit points defined as
(25) |
Suppose that our initialization is infinitesimally small such that
exist and is a global optimal solution to
(26) |
with and . Then we have , and is also a global optimal solution to
(27) |
with and being the balancing parameter in (24).
Proof.
From (23), we can derive the gradient flow for via chain rule
(28) |
We want to show that when the initialization is infinitesimally small (i.e., ), the limit points of the gradient flows and are optimal solutions for (27) as . Towards this goal, let us first look at the optimality condition for (27). From Lemma 1, we know that if with
is an optimal solution for (27) then there exists a dual certificate such that
where is defined in (32), which is the subdifferential of . Thus, it suffices to construct a dual certificate such that satisfies the equation above.
Since is a global optimal solution to (26), we automatically have and . On the other hand, given that commutes and (28) and (24) hold for , and , by Lemma 2, we know that
(29) | ||||
(30) |
where . Let , by Lemma 3 and Lemma 4, we can construct
such that
and
This shows the exists of the dual certificate such that the optimality condition holds for . Hence, is also a global optimal solution to (27). ∎
Lemma 1.
If is an optimal solution for (27), then there exists a dual certificate , that
(31) |
where is the subdifferential of with each entry
(32) |
Proof.
The Lagrangian function of the problem can be written as
with and being the dual variables, where . Thus, we can derive the Karush-Kuhn-Tucker (KKT) optimality condition for (27) as
feasibility : | |||
complementary slackness : |
where denotes the subdifferential operator and is the subdifferential of with each entry
Thus, we know that is global solution to (27) as long as there exists a dual certificate such that (31) holds, where we eliminated by plugging in . ∎
Proof.
From (24), we know that
where the differentiation is entrywise for . Thus, we have
where all the operators are entrywise. Similarly, holds.
Lemma 3.
Proof.
Given , we have . By the expression for in (29), we have
(36) |
where . Because are symmetric and they commute, we know that they are simultaneously diagonalizable by an orthonormal basis , i.e.,
and so is for any . Therefore, we have
(37) |
where denotes the -th eigenvalue with respect to the -th basis . Moreover and its limit have the same eigen-basis . Since we have converges to when , then we have the eigenvalues
(38) |
whenever .
Case 1: .
Case 2: .
Putting things together.
Combining our analysis in (39) and (40), we obtain
On the other hand, per our analysis, we know that there exists an orthogonal matrix , such that and can be simultaneously diagonalized. Thus, we have
where and are diagonal matrices, with entries being the eigenvalues of and , respectively. From our analysis for Case 1 and Case 2, we know that . Therefore, we also have
as desired. ∎
Lemma 4.
Proof.
Let and be the th coordinate of and defined in (25), respectively. It follows from (30) that
When , we have
so that . This also implies that either or for any .
On the other hand, let us define
and let be the th coordinate of with
(43) |
Correspondingly, we know that and let be the th coordinate of . In the following, we leverage on these to show that our construction of satisfies (42). We classify the entries of into three cases and analyze as follows.
-
•
Case 1: . Since , from (43) we must have when , so that and . Therefore, when , we have
-
•
Case 2: . Since , from (43) we must have when , so that and . Therefore, when , we have
-
•
Case 3: . Since , from (43) we must have and , when . Therefore, for any small , there exists some , such that for all , we have
which further implies that
Therefore, combining the results in the three cases above we obtain that
so that we have (42) holds. ∎
Appendix C Extra Experiments
Due to limited space in the main body, we here provide extra results for our experiments on robust image recovery presented in Section 4.2.
Varying corruption levels. In Figure 4, we display results of our method for robust image recovery with varying levels of salt-and-pepper corruption.
Learning curves for different test images and varying corruption levels. In Figure 5(a) and Figure 5(b), we provide learning curves for the results in Figure 3 and Figure 4, respectively. While DIP- requires a case-dependent early termination to obtain the best performance, our method does not overfit and does not require any early termination. Overall our method achieves both better PSNR and visual quality, especially when noise level is high.















