This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Comments on Efficient Singular
Value Thresholding Computation

Zhengyuan Zhou Stern School of Business, New York University. Email: [email protected].    Yi Ma UC Berkeley EECS. Email: [email protected]
Abstract

We discuss how to evaluate the proximal operator of a convex and increasing function of a nuclear norm, which forms the key computational step in several first-order optimization algorithms such as (accelerated) proximal gradient descent and ADMM. Various special cases of the problem arise in low-rank matrix completion, dropout training in deep learning and high-order low-rank tensor recovery, although they have all been solved on a case-by-case basis. We provide an unified and efficiently computable procedure for solving this problem.

1 Problem, Notation and Background

Proximal gradient descent (and its accelerated variant) (Beck and Teboulle, (2009); Combettes and Pesquet, (2011); Ma, (2012); Parikh and Boyd, (2014)) provide an efficient way (with O(1T)O(\frac{1}{T}) and O(1T2)O(\frac{1}{T^{2}}) convergence rates, respectively) to solve structured convex non-smooth optimizaiton problems of the following form, which arise frequently in machine learning and structured signal recovery settings:

minxF(x)=g(x)+h(x),\min_{x}F(x)=g(x)+h(x),

where g(x)g(x) is convex and smooth (i.e. continuously differentiable with bounded gradients) and h(x)h(x) is convex but non-smooth. The presence of hh often results from non-smooth but convex regularizers such as l1l_{1}-norm or nuclear norm, as two prominent examples. The computational bottleneck in each iteration of (accelerated) proximal gradient is evaluating the proximal operator (at a point ww):

proxh[w]=argminx{h(x)+12xw22},\text{prox}_{h}[w]=\arg\min_{x}\{h(x)+\frac{1}{2}\|x-w\|_{2}^{2}\},

which admits a unique solution due to strong convexity. Computationally, efficient (accelerated) proximal gradient is possible when and only when proxh[w]\text{prox}_{h}[w] can be computed efficiently. Note that in addition to (accelerated) proximal gradient, evaluating proximal operators is also the key and computationally demanding step in methods such as ADMM (Luo, (2012); Yang and Yuan, (2013); Boyd et al., (2011)) for large-scale equality constrained optimization problems.

In this note, we focus on a class of non-smooth convex functions h:𝐑n1×n2𝐑h:\mathbf{R}^{n_{1}\times n_{2}}\rightarrow\mathbf{R}, where h(X)=f(X)h(X)=f(\|X\|_{*}) for some convex and increasing f:𝐑𝐑f:\mathbf{R}\rightarrow\mathbf{R}, with X\|X\|_{*} being the nuclear norm of XX. Thus, our main question is, how to efficient compute the optimal solution XX^{*}, where:

X=argmin{τf(X)+12XYF2}.X^{*}=\arg\min\{\tau f(\|X\|_{*})+\frac{1}{2}\|X-Y\|_{F}^{2}\}. (1.1)

It turns out a few special cases of this problem have played important roles in machine learning and signal processing. For instance, when f(x)=xf(x)=x, this problem occurs in low-rank matrix completion and recovery problems (Cai et al., (2010)); when f(x)=x2f(x)=x^{2}, this problem occurs in drop-out training in deep learning (Cavazza et al., (2018)); when f(x)=exf(x)=e^{x}, this problem occurs in high-order low-rank tensor recovery (Zhang et al., (2014)). Further, analytical and/or efficiently computable solutions have been derived in these special cases. For instance, consider a matrix Y𝐑n1×n2Y\in\mathbf{R}^{n_{1}\times n_{2}} of rank rr, whose singular value decomposition (SVD) is:

Y=UΣV,Σ=diag({σYi}1ir),σY1σY2σYr>0,Y=U\Sigma V^{*},\Sigma=\mathrm{diag}(\{\sigma^{i}_{Y}\}_{1\leq i\leq r}),\sigma^{1}_{Y}\geq\sigma^{2}_{Y}\geq\dots\geq\sigma^{r}_{Y}>0,

where UU and VV are n1×rn_{1}\times r and n2×rn_{2}\times r matrices respectively with orthonormal columns. For each τ>0\tau>0, the soft thresholding shrinkage operator 𝒟τ\mathcal{D}_{\tau} is defined to be:

𝒟τ(Y)=U𝒟τ(Σ)V,𝒟τ(Σ)=diag((σYiτ)+),t+=max(0,t).\mathcal{D}_{\tau}(Y)=U\mathcal{D}_{\tau}(\Sigma)V^{*},\ \mathcal{D}_{\tau}(\Sigma)=\mathrm{diag}({(\sigma^{i}_{Y}-\tau)_{+}}),t_{+}=\max(0,t).

Cai et al., (2010) shows that soft thresholding provides an analytical solution when f(x)=xf(x)=x.

Lemma 1.

[Cai et al., (2010)] Given a matrix Y𝐑n1×n2Y\in\mathbf{R}^{n_{1}\times n_{2}} and a τ0\tau\geq 0, consider the function h(X)=τX+12XYF2,X𝐑n1×n2.h(X)=\tau\|X\|_{*}+\frac{1}{2}\|X-Y\|^{2}_{F},\ X\in\mathbf{R}^{n_{1}\times n_{2}}. We have argminXh(X)=𝒟τ(Y)\mathrm{arg}\min_{X}{h(X)}=\mathcal{D}_{\tau}(Y).

However, although the proximal mappings of these different cases have been solved separately, there is a lack of an unified and efficiently computable scheme to solve the above problem for a general ff (one should not expect an analytical solution exists for a general ff). It turns out that soft singular value thresholding still works and the threshold can be computed efficiently by a binary search on a system of 1-dimensional equations that depend on the input data matrix YY and f()f(\cdot).

2 Main Results

We augment the list of non-increasing singular values {σYi}1ir\{\sigma^{i}_{Y}\}_{1\leq i\leq r} of YY with σYr+1=\sigma^{r+1}_{Y}=-\infty.

Lemma 2.

Let g:𝐑𝐑g:\mathbf{R}\rightarrow\mathbf{R} be such that g(x)0g(x)\geq 0, gg is increasing and g(0)1g(0)\leq 1. Suppose the rank of YY is rr and τ<σY1\tau<\sigma^{1}_{Y}. There exists a unique integer jj, with 1jr1\leq j\leq r, such that the solution tjt_{j} to the following equation

g(i=1jσYijtj)=tjτg(\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j})=\frac{t_{j}}{\tau} (2.1)

satisfies the constraint

σYj+1tj<σYj.\sigma^{j+1}_{Y}\leq t_{j}<\sigma^{j}_{Y}. (2.2)
Proof.

We first show that if at least one such jj exists, then such a jj (and hence tjt_{j}) is unique. Consider the set J={jtj satisfies (2.1) and (2.2)}.J=\{j\mid t_{j}\text{ satisfies \eqref{algebraicEquation_g} and \eqref{constraint_g}}\}. Assume JJ\neq\emptyset, let jj^{*} be the smallest element in JJ. Now we argue that no j+kj^{*}+k, 1krj1\leq k\leq r-j^{*}, can be in JJ. Consider any kk with 1krj1\leq k\leq r-j^{*}. Suppose for contradiction j+kJj^{*}+k\in J. That is:

g(i=1j+kσYi(j+k)tj+k)=tj+kτ,\displaystyle g(\sum_{i=1}^{j^{*}+k}{\sigma^{i}_{Y}}-(j^{*}+k)t_{j^{*}+k})=\frac{t_{j^{*}+k}}{\tau}, (2.3)
σYj+k+1tj+k<σYj+k.\displaystyle\sigma^{j^{*}+k+1}_{Y}\leq t_{j^{*}+k}<\sigma^{j^{*}+k}_{Y}. (2.4)

Expanding on the right side of (2.3), we have

g(i=1j+kσYi(j+k)tj+k)g(i=1jσYi+kσYj+k(j+k)tj+k)\displaystyle g(\sum_{i=1}^{j^{*}+k}{\sigma^{i}_{Y}}-(j^{*}+k)t_{j^{*}+k})\geq g(\sum_{i=1}^{j^{*}}{\sigma^{i}_{Y}}+k\sigma^{j^{*}+k}_{Y}-(j^{*}+k)t_{j^{*}+k})
=g(i=1jσYijtj+k+k(σYj+ktj+k))>g(i=1jσYijtj+k)\displaystyle=g(\sum_{i=1}^{j^{*}}{\sigma^{i}_{Y}}-j^{*}t_{j^{*}+k}+k(\sigma^{j^{*}+k}_{Y}-t_{j^{*}+k}))>g(\sum_{i=1}^{j^{*}}{\sigma^{i}_{Y}}-j^{*}t_{j^{*}+k})
>g(i=1jσYijtj)=tjτ,\displaystyle>g(\sum_{i=1}^{j^{*}}{\sigma^{i}_{Y}}-j^{*}t_{j^{*}})=\frac{t_{j^{*}}}{\tau},

where the first inequality follows from the non-increasing values of the singular values, the second inequality follows from the assumption in (2.4) and that gg is increasing, the last inequality follows from tjσYj+1σYj+k>tj+kt_{j^{*}}\geq\sigma^{j^{*}+1}_{Y}\geq\sigma^{j^{*}+k}_{Y}>t_{j^{*}+k} and the last equality follows from the definition of tjt_{j^{*}}.

Hence, it follows that

tj+kτ=g(i=1j+kσYi(j+k)tj+k)>tjτ\frac{t_{j^{*}+k}}{\tau}=g(\sum_{i=1}^{j^{*}+k}{\sigma^{i}_{Y}}-(j^{*}+k)t_{j^{*}+k})>\frac{t_{j^{*}}}{\tau}

, leading to tj+k>tjt_{j^{*}+k}>t_{j^{*}}, hence a contradiction.

Next, we prove that JJ is indeed not empty.

First, we note that by the property of gg, a unique solution tjt_{j} (>0>0) exists for g(i=1jσYijtj)=tjτg(\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j})=\frac{t_{j}}{\tau}, for each jj satisfying 1jr1\leq j\leq r. We denote by tjt_{j} the unique solution corresponding to each jj. Hence, it suffices to show at least one tjt_{j} satisfies σYj+1tj<σYj\sigma^{j+1}_{Y}\leq t_{j}<\sigma^{j}_{Y}.

Again by monotonicity of gg and g(0)1g(0)\leq 1, it is easily seen that τ<σY1\tau<\sigma^{1}_{Y} implies that t1<σY1t_{1}<\sigma^{1}_{Y}. Now suppose it also holds that σY2t1\sigma^{2}_{Y}\leq t_{1}, then we are done. Otherwise, we have t1<σY2t_{1}<\sigma^{2}_{Y}. Under this assumption, we claim that t1<t2t_{1}<t_{2} and t2<σY2t_{2}<\sigma^{2}_{Y}. To prove t1<t2t_{1}<t_{2}, assume for the sake of contradiction that t1t2t_{1}\geq t_{2}, leading to:

t1τ=g(σY1t1)<g(σY1+σY22t1)\displaystyle\frac{t_{1}}{\tau}=g(\sigma^{1}_{Y}-t_{1})<g(\sigma^{1}_{Y}+\sigma^{2}_{Y}-2t_{1})
\displaystyle\leq g(σY1+σY22t2)=t2τ,\displaystyle g(\sigma^{1}_{Y}+\sigma^{2}_{Y}-2t_{2})=\frac{t_{2}}{\tau},

where the first inequality follows from t1<σY2t_{1}<\sigma^{2}_{Y}. Hence we reach a contradiction, establishing that t1<t2t_{1}<t_{2}.

The desired inequality t2<σY2t_{2}<\sigma^{2}_{Y} then follows since

g(σY1t1)=t1τ<t2τ<g(σY1+σY2t1t2)\displaystyle g(\sigma^{1}_{Y}-t_{1})=\frac{t_{1}}{\tau}<\frac{t_{2}}{\tau}<g(\sigma^{1}_{Y}+\sigma^{2}_{Y}-t_{1}-t_{2})

implies σY1t1<σY1+σY2t1t2\sigma^{1}_{Y}-t_{1}<\sigma^{1}_{Y}+\sigma^{2}_{Y}-t_{1}-t_{2} by monotonicity of gg, hence yielding t2<σY2t_{2}<\sigma^{2}_{Y}

If t2σY3t_{2}\geq\sigma^{3}_{Y}, then the claim is established. If not, we can repeat this process inductively. More formally, suppose we have just finished the jj-th iteration (note that the induction basis j=1j=1 is verified above) and we have tj<σYjt_{j}<\sigma^{j}_{Y}. If it also holds that tjσYj+1t_{j}\geq\sigma^{j+1}_{Y}, then the claim follows. If not, then we show tj+1>tjt_{j+1}>t_{j} and tj+1<σYj+1t_{j+1}<\sigma^{j+1}_{Y}. First, assume on the contrary, tj+1tjt_{j+1}\leq t_{j}

tjτ=g(i=1jσYijtj)<g(i=1j+1σYi(j+1)tj+1)=tj+1τ\displaystyle\frac{t_{j}}{\tau}=g(\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j})<g(\sum_{i=1}^{j+1}{\sigma^{i}_{Y}}-(j+1)t_{j+1})=\frac{t_{j+1}}{\tau}

where the first inequality follows from tj<σYjt_{j}<\sigma^{j}_{Y}. Hence we reach a contradiction, establishing that tj<tj+1t_{j}<t_{j+1}.

Next, we note that tj+1<σYj+1t_{j+1}<\sigma^{j+1}_{Y} follows since

g(i=1jσYijtj)=tjτ<tj+1τ<g(i=1j+1σYijtjtj+1),\displaystyle g(\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j})=\frac{t_{j}}{\tau}<\frac{t_{j+1}}{\tau}<g(\sum_{i=1}^{j+1}{\sigma^{i}_{Y}}-jt_{j}-t_{j+1}),

(where the last inequality follows due to jtj<jtj+1jt_{j}<jt_{j+1},) implying i=1jσYijtj<i=1j+1σYijtjtj+1\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j}<\sum_{i=1}^{j+1}{\sigma^{i}_{Y}}-jt_{j}-t_{j+1}, which is equivalent to tj+1<σYj+1t_{j+1}<\sigma^{j+1}_{Y}.

Thus, we have a strictly increasing sequence {tj}\{t_{j}\} with tj<σYit_{j}<\sigma^{i}_{Y}. If it holds that σYj+1tj<σYj\sigma^{j+1}_{Y}\leq t_{j}<\sigma^{j}_{Y} at some iteration jj, then such a jj certifies that JJ is not empty. If σYj+1tj<σYj\sigma^{j+1}_{Y}\leq t_{j}<\sigma^{j}_{Y} never holds for jj up to r1r-1, then it must hold for j=rj=r, since =σYr+1tr<σYr-\infty=\sigma^{r+1}_{Y}\leq t_{r}<\sigma^{r}_{Y}, also certifying that JJ is not empty. ∎

Remark 1.

In addition to asserting the unique existence of such a jj^{*}, the proof suggests a natural binary search algorithm to find such a jj^{*} and the corresponding tjt_{j^{*}}. The algorithm is given in Algorithm 1. Note that the step “Compute tjt_{j}" can be easily done very efficiently by numerically solving g(i=1jσYijtj)=tjτg(\sum_{i=1}^{j}{\sigma^{i}_{Y}}-jt_{j})=\frac{t_{j}}{\tau}, even though there may not be any analytical solution.

Lemma 3.

Algorithm 1 correctly computes the unique jj and tjt_{j} guaranteed by Lemma 2.

Proof.

From the first part of proof for Lemma 2, we know that if jj^{*} is the unique jj guaranteed by Lemma 2, then for all k>jk>j^{*}, we have tkσYkt_{k}\geq\sigma^{k}_{Y}. Thus, if tM<σYMt_{M}<\sigma^{M}_{Y}, then we know that jj^{*} cannot be less than M. That is, jj^{*} must be in the second half of the unsearched space. Conversely, if we hypothetically do a sequential search, then it follows immediately from the second part of proof of Lemma 2 that before jj reaches jj^{*}, tM<σYMt_{M}<\sigma^{M}_{Y} must hold. This establishes that if in the while loop we encounter tMσYMt_{M}\geq\sigma^{M}_{Y}, then it must be the case that jMj^{*}\leq M. That is, jj^{*} must lie in the first part of the unsearched space. It then follows that jj^{*} always lies between LL and RR, establishing that while loop will eventually halt, returning tjt_{j^{*}} and jj^{*}.

Algorithm 1 Generalized Singular Value Threshold Computation
  Input: YY, τ\tau, {σYi}1ir\{\sigma^{i}_{Y}\}_{1\leq i\leq r}
  Output: jj^{*} and tjt_{j^{*}}
  if τσY1\tau\geq\sigma^{1}_{Y}, Return 0 and σY1\sigma^{1}_{Y}
  Initialize L=1,R=rL=1,R=r,
  while LRL\leq R do
     M=L+R2M=\lceil{\frac{L+R}{2}}\rceil, and compute the solution tMt_{M} to the equation g(i=1MσYiMtM)=tMτg(\sum_{i=1}^{M}{\sigma^{i}_{Y}}-Mt_{M})=\frac{t_{M}}{\tau}
     if tM<σYMt_{M}<\sigma^{M}_{Y} then
        ifσYM+1tM\sigma^{M+1}_{Y}\leq t_{M}, Return MM and tMt_{M}
        else L=ML=M,Continue
     end if
     Else R=MR=M, Continue
  end while
Definition 1.

Given τ>0\tau>0, the generalized singular value thresholding operator τ\mathcal{H}_{\tau} is define to be:

τ(Y)=U𝒟tj(Σ)V,Y=UΣV𝐑n1×n2,\mathcal{H}_{\tau}(Y)=U\mathcal{D}_{t_{j^{*}}}(\Sigma)V^{*},Y=U\Sigma V^{*}\in\mathbf{R}^{n_{1}\times n_{2}},

where tjt_{j^{*}} is the threshold computed by Algorithm 1.

Lemma 2 guarantees that τ\mathcal{H}_{\tau} is well-defined and Algorithm 1 guarantees that τ\mathcal{H}_{\tau} is efficiently computable. Having defined τ\mathcal{H}_{\tau}, the main result is:

Theorem 1.

Let f:𝐑𝐑f:\mathbf{R}\rightarrow\mathbf{R} be any convex, increasing, differentible function, with an increasing derivative satisfying f(0)1f^{\prime}(0)\leq 1. Given a τ>0\tau>0 and a Y𝐑n1×n2Y\in\mathbf{R}^{n_{1}\times n_{2}}, we have:

τ(Y)=argminX{τf(X)+12XYF2}.\mathcal{H}_{\tau}(Y)=\arg\min_{X}\{\tau f({\|X\|_{*}})+\frac{1}{2}\|X-Y\|^{2}_{F}\}.
Proof.

To prove this theorem, we build on the techniques introduced in Cai et al., (2010).

The function h(X)=τf(X)+12XYF2h(X)=\tau f(\|X\|_{*})+\frac{1}{2}\|X-Y\|^{2}_{F} is strictly convex, since it is the sum of a convex function and a strictly convex function. As a result, the minimizer X^\hat{X} to h(X)h(X) is unique and it suffices to show that τ(Y)\mathcal{H}_{\tau}(Y) is one minimizer.

Per the definition of a subgradient, SS is a subgradient of a convex function ff at X0X_{0} if f(X)f(X0)+S,XX0f(X)\geq f(X_{0})+\langle S,X-X_{0}\rangle. f(X0)\partial f(X_{0}) is commonly used to denote the set of subgradients of ff at X0X_{0}. Recall that the set X0\partial\|X_{0}\|_{*} of subgradients of the nuclear norm function at X0X_{0} is: X0={UX0VX0+WW𝐑n1×n2,UX0W=0,WVX0=0,W21},\partial\|X_{0}\|_{*}=\{U_{X_{0}}V^{*}_{X_{0}}+W\mid W\in\mathbf{R}^{n_{1}\times n_{2}},U^{*}_{X_{0}}W=0,WV_{X_{0}}=0,\|W\|_{2}\leq 1\}, where the SVD of X0X_{0} is UX0ΣX0VX0U_{X_{0}}\Sigma_{X_{0}}V^{*}_{X_{0}} and W2\|W\|_{2} is the top singular value of WW.

First, it is easy to check that f(X0)(UX0VX0+W)(f(X0))f^{\prime}({\|X_{0}\|_{*}})(U_{X_{0}}V^{*}_{X_{0}}+W)\in\partial(f({\|X_{0}\|_{*}})) for W𝐑n1×n2,UX0W=0,WVX0=0,W21W\in\mathbf{R}^{n_{1}\times n_{2}},U^{*}_{X_{0}}W=0,WV_{X_{0}}=0,\|W\|_{2}\leq 1, by the composition rule for the subgradient. Hence, τg(X^)(UX^VX^+W)+X^Y\tau g({\|\hat{X}\|_{*}})(U_{\hat{X}}V^{*}_{\hat{X}}+W)+\hat{X}-Y is a subgradient for hh at X^\hat{X}, for WW satisfying UX^W=0,WVX^=0,W21U^{*}_{\hat{X}}W=0,WV_{\hat{X}}=0,\|W\|_{2}\leq 1, where g=fg=f^{\prime} and gg satisfies the assumption given in Lemma 2. Moreover, if there exists such a WW and it holds that 0=τg(X^)(UX^VX^+W)+X^Y0=\tau g({\|\hat{X}\|_{*}})(U_{\hat{X}}V^{*}_{\hat{X}}+W)+\hat{X}-Y, or equivalently that

YX^=τg(X^)(UX^VX^+W),Y-\hat{X}=\tau g({\|\hat{X}\|_{*}})(U_{\hat{X}}V^{*}_{\hat{X}}+W), (2.5)

then X^\hat{X} is a minimizer (hence the unique minimizer) to h(X)h(X).

We now establish that, with X^=τ(Y)\hat{X}=\mathcal{H}_{\tau}(Y), Eq. (2.5) does hold with WW satisfying the given constraints. First, we consider the case that τ<σY1\tau<\sigma^{1}_{Y}.

By Lemma 2, since tjt_{j^{*}} satisfies the equation tjτ=g(i=1jσYijtj)\frac{t_{j^{*}}}{\tau}=g({\sum_{i=1}^{j^{*}}\sigma^{i}_{Y}}-j^{*}t_{j^{*}}), we have tj=τg(i=1j(σYitj))t_{j^{*}}=\tau g({\sum_{i=1}^{j^{*}}{(\sigma^{i}_{Y}}-t_{j^{*}})}). Since σYj+1tj<σYj\sigma^{j^{*}+1}_{Y}\leq t_{j^{*}}<\sigma^{j^{*}}_{Y}, YY’s last rjr-j^{*} singular values (σYk\sigma^{k}_{Y} with kj+1k\geq j^{*}+1) have been set to 0, leading to that X^=i=1j(σYitj)\|\hat{X}\|_{*}=\sum_{i=1}^{j^{*}}{(\sigma^{i}_{Y}-t_{j^{*}})}. Therefore, tj=τg(X^)t_{j^{*}}=\tau g({\|\hat{X}\|_{*}}).

Next, we partition YY as follows:

Y=UaΣaVa+UbΣbVb,Y=U_{a}\Sigma_{a}V_{a}^{*}+U_{b}\Sigma_{b}V_{b}^{*},

where UaU_{a}’s columns and VaV_{a}’s columns are the first jj^{*} left and right singular vectors respectively, associated with the first jj^{*} singular values (i.e. singular values larger than tjt_{j^{*}}), while UbU_{b}’s columns and VbV_{b}’s colunms are the remaining rjr-j^{*} left and right singular vectors respectively, associated with the remaining rjr-j^{*} singular values (i.e. singular values less than or equal to tjt_{j^{*}}).

Under this partition, it is easily seen that X^=Ua(Σatj𝐈)Va.\hat{X}=U_{a}(\Sigma_{a}-t_{j^{*}}\mathbf{I})V_{a}^{*}. We then have

YX^=tjUaVa+UbΣbVb\displaystyle Y-\hat{X}=t_{j^{*}}U_{a}V_{a}^{*}+U_{b}\Sigma_{b}V_{b}^{*}
=\displaystyle= τg(X^)UaVa+UbΣbVb\displaystyle\tau g({\|\hat{X}\|_{*}})U_{a}V_{a}^{*}+U_{b}\Sigma_{b}V_{b}^{*}
=\displaystyle= τg(X^)(UaVa+1tjUbΣbVb).\displaystyle\tau g({\|\hat{X}\|_{*}})(U_{a}V_{a}^{*}+\frac{1}{t_{j^{*}}}U_{b}\Sigma_{b}V_{b}^{*}).

Choose W=1tjUbΣbVbW=\frac{1}{t_{j^{*}}}U_{b}\Sigma_{b}V_{b}^{*}. By construction, UaUb=0,VbVa=0U_{a}^{*}U_{b}=0,V_{b}^{*}V_{a}=0, hence we have UX^W=0,WVX^=0U^{*}_{\hat{X}}W=0,WV_{\hat{X}}=0. In addition, since tjσYj+1t_{j^{*}}\geq\sigma^{j*+1}_{Y}, we have W2=σYj+1tj1\|W\|_{2}=\frac{\sigma^{j*+1}_{Y}}{t_{j^{*}}}\leq 1. Hence WW thus chosen satisfies the constraints, hence establishing the claim.

Now, if τσY1\tau\geq\sigma^{1}_{Y}, then tj=σY1t_{j^{*}}=\sigma^{1}_{Y} and j=1j^{*}=1 are returned by Algorithm 1. It follows immediately that in this case X^=0\hat{X}=0. Choosing W=τ1YW=\tau^{-1}Y, it is easily seen that WW satisfies the constraints. Verification of Eq. (2.5) is instant when X^=0\hat{X}=0 and W=τ1YW=\tau^{-1}Y. ∎

References

  • Beck and Teboulle, (2009) Beck, A. and Teboulle, M. (2009). Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pages 42–88.
  • Boyd et al., (2011) Boyd, S. P., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
  • Cai et al., (2010) Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982.
  • Cavazza et al., (2018) Cavazza, J., Morerio, P., Haeffele, B., Lane, C., Murino, V., and Vidal, R. (2018). Dropout as a low-rank regularizer for matrix factorization. In International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR.
  • Combettes and Pesquet, (2011) Combettes, P. L. and Pesquet, J.-C. (2011). Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer.
  • Luo, (2012) Luo, Z. (2012). On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922.
  • Ma, (2012) Ma, S. (2012). Alternating proximal gradient method for convex minimization. Preprint of Optimization Online.
  • Parikh and Boyd, (2014) Parikh, N. and Boyd, S. (2014). Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239.
  • Yang and Yuan, (2013) Yang, J. and Yuan, X. (2013). Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Mathematics of Computation, 82(281):301–329.
  • Zhang et al., (2014) Zhang, X., Zhou, Z., Wang, D., and Ma, Y. (2014). Hybrid singular value thresholding for tensor completion. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Citeseer.