Comments on Efficient Singular
Value Thresholding Computation
Abstract
We discuss how to evaluate the proximal operator of a convex and increasing function of a nuclear norm, which forms the key computational step in several first-order optimization algorithms such as (accelerated) proximal gradient descent and ADMM. Various special cases of the problem arise in low-rank matrix completion, dropout training in deep learning and high-order low-rank tensor recovery, although they have all been solved on a case-by-case basis. We provide an unified and efficiently computable procedure for solving this problem.
1 Problem, Notation and Background
Proximal gradient descent (and its accelerated variant) (Beck and Teboulle, (2009); Combettes and Pesquet, (2011); Ma, (2012); Parikh and Boyd, (2014)) provide an efficient way (with and convergence rates, respectively) to solve structured convex non-smooth optimizaiton problems of the following form, which arise frequently in machine learning and structured signal recovery settings:
where is convex and smooth (i.e. continuously differentiable with bounded gradients) and is convex but non-smooth. The presence of often results from non-smooth but convex regularizers such as -norm or nuclear norm, as two prominent examples. The computational bottleneck in each iteration of (accelerated) proximal gradient is evaluating the proximal operator (at a point ):
which admits a unique solution due to strong convexity. Computationally, efficient (accelerated) proximal gradient is possible when and only when can be computed efficiently. Note that in addition to (accelerated) proximal gradient, evaluating proximal operators is also the key and computationally demanding step in methods such as ADMM (Luo, (2012); Yang and Yuan, (2013); Boyd et al., (2011)) for large-scale equality constrained optimization problems.
In this note, we focus on a class of non-smooth convex functions , where for some convex and increasing , with being the nuclear norm of . Thus, our main question is, how to efficient compute the optimal solution , where:
(1.1) |
It turns out a few special cases of this problem have played important roles in machine learning and signal processing. For instance, when , this problem occurs in low-rank matrix completion and recovery problems (Cai et al., (2010)); when , this problem occurs in drop-out training in deep learning (Cavazza et al., (2018)); when , this problem occurs in high-order low-rank tensor recovery (Zhang et al., (2014)). Further, analytical and/or efficiently computable solutions have been derived in these special cases. For instance, consider a matrix of rank , whose singular value decomposition (SVD) is:
where and are and matrices respectively with orthonormal columns. For each , the soft thresholding shrinkage operator is defined to be:
Cai et al., (2010) shows that soft thresholding provides an analytical solution when .
Lemma 1.
[Cai et al., (2010)] Given a matrix and a , consider the function We have .
However, although the proximal mappings of these different cases have been solved separately, there is a lack of an unified and efficiently computable scheme to solve the above problem for a general (one should not expect an analytical solution exists for a general ). It turns out that soft singular value thresholding still works and the threshold can be computed efficiently by a binary search on a system of 1-dimensional equations that depend on the input data matrix and .
2 Main Results
We augment the list of non-increasing singular values of with .
Lemma 2.
Let be such that , is increasing and . Suppose the rank of is and . There exists a unique integer , with , such that the solution to the following equation
(2.1) |
satisfies the constraint
(2.2) |
Proof.
We first show that if at least one such exists, then such a (and hence ) is unique. Consider the set Assume , let be the smallest element in . Now we argue that no , , can be in . Consider any with . Suppose for contradiction . That is:
(2.3) | |||
(2.4) |
Expanding on the right side of (2.3), we have
where the first inequality follows from the non-increasing values of the singular values, the second inequality follows from the assumption in (2.4) and that is increasing, the last inequality follows from and the last equality follows from the definition of .
Hence, it follows that
, leading to , hence a contradiction.
Next, we prove that is indeed not empty.
First, we note that by the property of , a unique solution () exists for , for each satisfying . We denote by the unique solution corresponding to each . Hence, it suffices to show at least one satisfies .
Again by monotonicity of and , it is easily seen that implies that . Now suppose it also holds that , then we are done. Otherwise, we have . Under this assumption, we claim that and . To prove , assume for the sake of contradiction that , leading to:
where the first inequality follows from . Hence we reach a contradiction, establishing that .
The desired inequality then follows since
implies by monotonicity of , hence yielding
If , then the claim is established. If not, we can repeat this process inductively. More formally, suppose we have just finished the -th iteration (note that the induction basis is verified above) and we have . If it also holds that , then the claim follows. If not, then we show and . First, assume on the contrary,
where the first inequality follows from . Hence we reach a contradiction, establishing that .
Next, we note that follows since
(where the last inequality follows due to ,) implying , which is equivalent to .
Thus, we have a strictly increasing sequence with . If it holds that at some iteration , then such a certifies that is not empty. If never holds for up to , then it must hold for , since , also certifying that is not empty. ∎
Remark 1.
In addition to asserting the unique existence of such a , the proof suggests a natural binary search algorithm to find such a and the corresponding . The algorithm is given in Algorithm 1. Note that the step “Compute " can be easily done very efficiently by numerically solving , even though there may not be any analytical solution.
Proof.
From the first part of proof for Lemma 2, we know that if is the unique guaranteed by Lemma 2, then for all , we have . Thus, if , then we know that cannot be less than M. That is, must be in the second half of the unsearched space. Conversely, if we hypothetically do a sequential search, then it follows immediately from the second part of proof of Lemma 2 that before reaches , must hold. This establishes that if in the while loop we encounter , then it must be the case that . That is, must lie in the first part of the unsearched space. It then follows that always lies between and , establishing that while loop will eventually halt, returning and .
∎
Definition 1.
Given , the generalized singular value thresholding operator is define to be:
where is the threshold computed by Algorithm 1.
Lemma 2 guarantees that is well-defined and Algorithm 1 guarantees that is efficiently computable. Having defined , the main result is:
Theorem 1.
Let be any convex, increasing, differentible function, with an increasing derivative satisfying . Given a and a , we have:
Proof.
To prove this theorem, we build on the techniques introduced in Cai et al., (2010).
The function is strictly convex, since it is the sum of a convex function and a strictly convex function. As a result, the minimizer to is unique and it suffices to show that is one minimizer.
Per the definition of a subgradient, is a subgradient of a convex function at if . is commonly used to denote the set of subgradients of at . Recall that the set of subgradients of the nuclear norm function at is: where the SVD of is and is the top singular value of .
First, it is easy to check that for , by the composition rule for the subgradient. Hence, is a subgradient for at , for satisfying , where and satisfies the assumption given in Lemma 2. Moreover, if there exists such a and it holds that , or equivalently that
(2.5) |
then is a minimizer (hence the unique minimizer) to .
We now establish that, with , Eq. (2.5) does hold with satisfying the given constraints. First, we consider the case that .
By Lemma 2, since satisfies the equation , we have . Since , ’s last singular values ( with ) have been set to 0, leading to that . Therefore, .
Next, we partition as follows:
where ’s columns and ’s columns are the first left and right singular vectors respectively, associated with the first singular values (i.e. singular values larger than ), while ’s columns and ’s colunms are the remaining left and right singular vectors respectively, associated with the remaining singular values (i.e. singular values less than or equal to ).
Under this partition, it is easily seen that We then have
Choose . By construction, , hence we have . In addition, since , we have . Hence thus chosen satisfies the constraints, hence establishing the claim.
References
- Beck and Teboulle, (2009) Beck, A. and Teboulle, M. (2009). Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pages 42–88.
- Boyd et al., (2011) Boyd, S. P., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
- Cai et al., (2010) Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982.
- Cavazza et al., (2018) Cavazza, J., Morerio, P., Haeffele, B., Lane, C., Murino, V., and Vidal, R. (2018). Dropout as a low-rank regularizer for matrix factorization. In International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR.
- Combettes and Pesquet, (2011) Combettes, P. L. and Pesquet, J.-C. (2011). Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer.
- Luo, (2012) Luo, Z. (2012). On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922.
- Ma, (2012) Ma, S. (2012). Alternating proximal gradient method for convex minimization. Preprint of Optimization Online.
- Parikh and Boyd, (2014) Parikh, N. and Boyd, S. (2014). Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239.
- Yang and Yuan, (2013) Yang, J. and Yuan, X. (2013). Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Mathematics of Computation, 82(281):301–329.
- Zhang et al., (2014) Zhang, X., Zhou, Z., Wang, D., and Ma, Y. (2014). Hybrid singular value thresholding for tensor completion. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Citeseer.