This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Orthogonally weighted 2,1\ell_{2,1} regularization for rank-aware joint sparse recovery: algorithm and analysis

Armenak Petrosyan School of Mathematics, Georgia Institute of Technology, 686 Cherry St NW, Atlanta, GA 30332, USA. apetrosyan3 (at) gatech.edu Konstantin Pieper pieperk (at) ornl.gov  and  Hoang Tran Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA. tranha (at) ornl.gov
Abstract.

We propose and analyze an efficient algorithm for solving the joint sparse recovery problem using a new regularization-based method, named orthogonally weighted 2,1\ell_{2,1} (𝑜𝑤2,1\mathit{ow}\ell_{2,1}), which is specifically designed to take into account the rank of the solution matrix. This method has applications in feature extraction, matrix column selection, and dictionary learning, and it is distinct from commonly used 2,1\ell_{2,1} regularization and other existing regularization-based approaches because it can exploit the full rank of the row-sparse solution matrix, a key feature in many applications. We provide a proof of the method’s rank-awareness, establish the existence of solutions to the proposed optimization problem, and develop an efficient algorithm for solving it, whose convergence is analyzed. We also present numerical experiments to illustrate the theory and demonstrate the effectiveness of our method on real-life problems.

Armenak Petrosyan’s research was partially funded by a Georgia Tech-ORNL SEED grant. Armenak Petrosyan also thanks Christopher Heil for continued support during his employment at Georgia Institute of Technology. Konstantin Pieper acknowledges support through the Compression Methods for Streaming Scientific Data project. Hoang Tran acknowledges support from the Scientific Discovery through Advanced Computing (SciDAC) program through the FASTMath Institute under Contract No. DE-AC02-05CH11231. This manuscript has been authored by UT-Battelle, LLC, under Contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

1. Introduction

The joint sparse recovery or multiple measurement vector (MMV) problem aims to find a solution to the matrix equation AX=YAX=Y, where AM×NA\in\mathbb{R}^{M\times N} and YM×KY\in\mathbb{R}^{M\times K} are given, and XN×KX\in\mathbb{R}^{N\times K} is an unknown ss-row sparse matrix, meaning that at most ss rows of XX are non-zero.

The task of finding the solution of AX=YAX=Y with the fewest non-zero rows in the noiseless case can be formulated as an optimization problem using mixed 2,0\ell_{2,0} norms:

(1) minZN×K2,0(Z),subject toAZ=Y,\min_{Z\in\mathbb{R}^{N\times K}}\ell_{2,0}(Z),\ \ \text{subject to}\ \ AZ=Y,

where 2,0(Z)\ell_{2,0}(Z) is the number of non-zero rows in ZZ. However, the main drawback of this method is that the optimization problem is computationally infeasible, as it results in an exhaustive search that is an NP-hard problem. To overcome this limitation, the 2,1\ell_{2,1} norm regularization method is commonly used to solve the joint sparse recovery problem [51]

(2) minZN×K2,1(Z),subject toAZ=Y,\min_{Z\in\mathbb{R}^{N\times K}}\ell_{2,1}(Z),\ \ \text{subject to}\ \ AZ=Y,

where 2,1(Z)=Z2,1=n=1NznznT\ell_{2,1}(Z)=\lVert Z\rVert_{2,1}=\sum_{n=1}^{N}\sqrt{z_{n}z_{n}^{T}}, and znz_{n} is the nn-th row of ZZ. This method generalizes the 1\ell_{1} sparsifying penalty to the multiple vector setting and is computationally feasible. However, this method has the drawback of being rank-blind, meaning it does not take into account the rank of the ss-row sparse matrix [16]. In real-world scenarios, data typically has a maximum essential rank, equal to its essential sparsity.

In [42], it was argued that a rank-aware regularizer cannot be continuous and, thus, convex in all of N×K\mathbb{R}^{N\times K}. A new method for recovery of jointly sparse vectors in the form of a (non-convex) optimization problem was proposed

(3) minZN×K𝑜𝑤2,1(Z) subject to AZ=Y,\min_{Z\in\mathbb{R}^{N\times K}}\mathit{ow}\ell_{2,1}(Z)\text{ subject to }AZ=Y,

where the regularizer, here we call the orthogonally weighted 2,1\ell_{2,1} or 𝑜𝑤2,1\mathit{ow}\ell_{2,1}, is defined as

𝑜𝑤2,1(Z)=2,1(Z(ZTZ)/2)=n=1Nzn(ZTZ)znT,\mathit{ow}\ell_{2,1}(Z)=\ell_{2,1}\left(Z(Z^{T}Z)^{\dagger/2}\right)=\sum_{n=1}^{N}\sqrt{z_{n}(Z^{T}Z)^{\dagger}z_{n}^{T}},

with ()(\cdot)^{\dagger} denoting the Moore-Penrose pseudo-inverse and ()/2=(())1/2(\cdot)^{\dagger/2}=((\cdot)^{\dagger})^{1/2} its square root for a given symmetric matrix. If the minimization is restricted to matrices of full rank, we obtain the functional 𝑜𝑤2,1(Z)=2,1(Z(ZTZ)1/2)\mathit{ow}\ell_{2,1}(Z)=\ell_{2,1}(Z(Z^{T}Z)^{-1/2}). The papers [36, 42] also discussed a strategy to reduce the joint sparse problem (1) to a simpler setting where the measurement matrix has full column-rank. This transformation can be achieved through techniques such as Singular Value Decomposition (SVD) or other factorization methods. In particular, if Y=YQY=Y^{\prime}Q where YM×rY^{\prime}\in\mathbb{R}^{M\times r} with rank(Y)=r{\rm rank\,}(Y^{\prime})=r and Qr×KQ\in\mathbb{R}^{r\times K} with QQT=Ir×rQQ^{T}=I\in\mathbb{R}^{r\times r}, then for any solution ZZ^{\prime} of the optimization problem

minZK×r,rank(Z)=r2,0(Z) subject to AZ=Y,\min_{Z\in\mathbb{R}^{K\times r},\,{\rm rank\,}(Z)=r}\;\ell_{2,0}(Z)\text{ subject to }AZ=Y^{\prime},

the matrix Z¯=ZQ\bar{Z}=Z^{\prime}Q is a solution of original problem (1). This strategy is independent of the chosen regularization functional and can be used for the noisy problem as well to reduce it to an approximate full column rank problem, and the resulting reconstruction error can be controlled; see [42] for more details.

However, [42] neither provided an affirmative conclusion on the rank-awareness, nor provided a general algorithm for implementation of 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization. The algorithm proposed in [42] relies on being able to determine the rank of the solution directly from the data and does not directly address the more practical case of noisy data discussed below. In this paper, we aim to address these two important open questions by providing: i) a proof of the rank-awareness of the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization, making it the first regularization-based method possessing this property; and ii) a computationally efficient algorithm for solving the optimization problem, supported by rigorous convergence analysis.

The naming of the functional is derived from the following. Note that, if Z=UΣVTZ=U\Sigma V^{T} is the compact singular value decomposition of ZZ, i.e., with UN×rU\in\mathbb{R}^{N\times r}, VK×rV\in\mathbb{R}^{K\times r} orthogonal, r=rank(Z)r={\rm rank\,}(Z), and Σr×r\Sigma\in\mathbb{R}^{r\times r} diagonal and s.p.d., then Z(ZTZ)/2=UVTZ(Z^{T}Z)^{\dagger/2}=UV^{T}, thus it renormalizes ZZ to its orthogonal component before applying the 2,1\ell_{2,1} norm. Also

𝑜𝑤2,1(Z)=2,1(UVT)=2,1(U),\mathit{ow}\ell_{2,1}(Z)=\ell_{2,1}(UV^{T})=\ell_{2,1}({U}),

in other words, 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) is the 2,1\ell_{2,1} norm of an orthonormal basis of the column space of ZZ and its value is independent of the choice of the basis. This and many other properties of the 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) functional, like its locally Lipschitz continuity, are discussed in detail in Appendix A. Consequently, when ZZ is of ss-row sparsity and of rank ss, it can be seen that 𝑜𝑤2,1(Z)=2,0(Z)\mathit{ow}\ell_{2,1}(Z)=\ell_{2,0}(Z), in other words it interpolates the values of 2,0\ell_{2,0} on such matrices. However, unlike the latter, it is feasible to develop an efficient algorithm for solving the optimization problem with the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularizer, as we shall see in this effort.

In the presence of measurement noise, we consider the regularized formulation of (3)

(4) minZN×K(𝑜𝑤2,1(Z)+12αAZYFro2).\min_{Z\in\mathbb{R}^{N\times K}}\left(\mathit{ow}\ell_{2,1}(Z)+\frac{1}{2\alpha}\|AZ-Y\|_{\textnormal{Fro}}^{2}\right).

The term 1/(2α)AZYFro21/(2\alpha)\|AZ-Y\|_{\textnormal{Fro}}^{2} acts as a penalty for deviating from the measurements, where Fro\|\cdot\|_{\textnormal{Fro}} denotes the Frobenius norm of a matrix. The quantity 1/(2α)1/(2\alpha) controls the strength of the regularization or can be interpreted as a Lagrange multiplier for the constrained formulation

(5) minZN×K𝑜𝑤2,1(Z)subject to AZYFroδ.\min_{Z\in\mathbb{R}^{N\times K}}\mathit{ow}\ell_{2,1}(Z)\quad\text{subject to }\|AZ-Y\|_{\textnormal{Fro}}\leq\delta.

To simplify notation, we will primarily consider the regularized formulation of the problem as given in (4).

1.1. Notational conventions

Here we introduce various notation conventions aimed at enhancing the clarity and readability of the content. The true value of the matrix of interest is represented by XN×KX\in\mathbb{R}^{N\times K}. The optimal solution of a given regularization problem is represented by Z¯N×K\bar{Z}\in\mathbb{R}^{N\times K}. The rows of a given matrix ZZ are denoted as znKz_{n}\in\mathbb{R}^{K} for n=1,,Nn=1,\dots,N. For a given symmetric positive matrix WW, the WW-(semi-)norm of a vector zKz\in\mathbb{R}^{K} is represented by zW=zTWz=W1/2z2\|z\|_{W}=\sqrt{z^{T}Wz}=\lVert W^{1/2}z\rVert_{2}, and the Euclidean norm is denoted by z2\|z\|_{2}. The W,p\ell_{W,p} norm for p[1,)p\in[1,\infty) of ZZ is represented by

W,p(Z)=ZW,p=(n=1NznWp)1/p\ell_{W,p}(Z)=\lVert Z\rVert_{W,p}=\left(\sum_{n=1}^{N}\lVert z_{n}\rVert^{p}_{W}\right)^{1/p}

and for p=p=\infty the sum is replaced by a maximum, as usual. Note that for p=2p=2 we obtain ZW,2=tr(ZWZT)\lVert Z\rVert_{W,2}=\sqrt{\operatorname{tr}(ZWZ^{T})} and the special case for the Euclidean norm is also denoted as the Frobenius norm 2,2=Fro\lVert\cdot\rVert_{2,2}=\lVert\cdot\rVert_{\textnormal{Fro}}. The associated inner product on the space of matrices is denoted by Z,Z=tr(ZTZ)\langle Z,Z^{\prime}\rangle=\operatorname{tr}(Z^{T}Z^{\prime}). Finally, by op\lVert\cdot\rVert_{\textnormal{op}} we denote the operator or induced matrix norm with respect to the Euclidean inner product. For a given matrix MM, we denote by MM^{\dagger} its Moore-Penrose pseudo-inverse, and for a given symmetric non-negative matrix MM by M/2=(M)1/2M^{\dagger/2}=(M^{\dagger})^{1/2} the square root of the pseudo-inverse. The main functional of interest is represented by 𝑜𝑤2,1(Z)=2,1(Z(ZTZ)/2)=Z(ZTZ),1\mathit{ow}\ell_{2,1}(Z)=\ell_{2,1}(Z(Z^{T}Z)^{\dagger/2})=\lVert Z\rVert_{(Z^{T}Z)^{\dagger},1}. Note that U=Z(ZTZ)/2U=Z(Z^{T}Z)^{\dagger/2} is related to the matrix sign function; see [27, Chapter 5] with the difference being that we use singular value calculus, as opposed to the standard (eigenvalue based) matrix calculus. In case of scalars or symmetric matrices the two definitions are identical.

1.2. Organization

The material is organized as follows: Section 2 discusses the applications of joint sparse recovery and related works. Section 3 proves the rank-awareness of 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization. Section 4 establishes conditions for the existence of global minimizers of (3) and (4). The question of how well the solution of the regularized problem approximates the true value remains open. In Section 5, we provide a family of regularizers for an additional parameter γ\gamma that connects the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularizer at γ=0\gamma=0 and the convex 2,1\ell_{2,1} regularizer at γ=1\gamma=1 and establish first-order conditions for the generalized problem. This includes optimality conditions for the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularized problem in the situation where ZTZZ^{T}Z has full rank. For other cases, we provide a convergence result for γ0\gamma\to 0, and rely on the optimality conditions for the “relaxed problem” for γ>0\gamma>0. Section 6 proposes and analyzes a numerical algorithm for solving the optimization problem that is based either on a direct minimization of (4) with a variable metric proximal gradient method or on a combination of this with a continuation strategy in γ\gamma. Section 7 conducts experiments to demonstrate the performance of the algorithm. All proofs concerning various necessary properties of the regularizer 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) have been moved to the appendix to improve readability.

2. Literature review

1/2\ell_{1}/\ell_{2} regularization. Our method represents a generalization of 1/2\ell_{1}/\ell_{2}, which uses the ratio of 1\ell_{1} and 2\ell_{2} to model the sparsity [28, 21]. Indeed, in the special case of single vector recovery (i.e. for K=1K=1), the problems (3) and (4) are equivalent to the 1/2\ell_{1}/\ell_{2} regularized problem due to

(6) 𝑜𝑤2,1(Z)=Z(ZTZ)/21={Z1/Z2for Z0,0for Z=0,if ZN×1.\mathit{ow}\ell_{2,1}(Z)=\lVert Z(Z^{T}Z)^{\dagger/2}\rVert_{1}=\begin{cases}\lVert Z\rVert_{1}/\lVert Z\rVert_{2}&\text{for }Z\neq 0,\\ 0&\text{for }Z=0,\end{cases}\qquad\text{if }Z\in\mathbb{R}^{N\times 1}.

Theoretical analysis for 1/2\ell_{1}/\ell_{2} has been developed in [56] for non-negative signals, [43] with a local optimality condition, and [46], where an analytic solution for the proximal operator is characterized, among others. Our proof of the existence of the global solution to (3) and (4) here is built on a generalization of the spherical section property, which has been studied in [57, 55] for the global optimality condition of 1/2\ell_{1}/\ell_{2}. The 1/2\ell_{1}/\ell_{2} functional has been shown to enhance sparsity of the solutions and outperform 1\ell_{1} on many test problems such as nonnegative matrix factorization and blind deconvolution [31, 29]. However, the existing uniform recovery guarantees for 1/2\ell_{1}/\ell_{2} remains too strict in the general, noisy setting (i.e., more restrictive than well-known 1\ell_{1} recovery conditions such as the null space property), despite many recent theoretical advances.

Joint sparse recovery. The joint sparse recovery problem, also known as multiple measurement vectors, distributed compressed sensing or simultaneous sparse approximation, has garnered a lot of attention in the literature. Mathematically, this problem arises when we want to reconstruct multiple vectors which have the same sparsity support using the same set of linear combinations for each. While one can certainly recover each vector independently, approximating them all at once should give superior results because we can take advantage of additional information about their shared support. To realize this potential, however, a joint sparse recovery algorithm is desired to be rank aware, that is, its performance improves with increasing rank of the solution matrix.

Existing approaches for the joint sparse recovery can be roughly categorized into greedy methods [14, 48, 25], and algorithms based on mixed norm optimization [14, 47, 23, 19, 50]. Pioneering joint sparse methods are often straightforward extensions of algorithms for single vector recovery, whose recovery conditions are based on standard compressed sensing properties and typically turn out to be the same as those obtained from a single measurement problem. In such cases, the theoretical results do not show any advantage of the joint sparse approach. One of the first methods exploiting rank information is a discrete version of the MUSIC (MUltiple SIgnal Classification) algorithm [22], where recovery guarantees better than those from single vector problem are proved when the solution matrix has full column-rank. Extensions of this work to non-full-rank cases have been developed in [33] (subspace-augmented MUSIC), [30] (compressive MUSIC) and [54] (semi-supervised MUSIC). Also inspired by the MUSIC technique, [16] proposes the rank-aware orthogonal recursive matching pursuit (RA-ORMP) based on a greedy selection strategy, and further demonstrates the rank awareness and rank blindness for a large class of joint sparse recovery algorithms. That work proves that some of the most popular algorithms, including orthogonal matching pursuit and mixed q,1\ell_{q,1} norm for any q1q\geq 1, are effectively rank blind. More recent rank aware methods include Simultaneous Normalsize Iterative Hard Thresholding (SNIHT) [6] and Rank Aware Row Thresholding (RART) [7]. Compared to the greedy approach, the integration of rank information into regularization-based algorithms seems an open problem. To date, all existing rank-aware algorithms belong to the greedy class.

Applications. The joint sparse recovery problem finds applications in various fields, including data analytics and imaging [34, 24], uncertainty quantification [17], and parameter identification of dynamical systems [44]. Here, we will illustrate one particular area of our interest in detail, which is the Column Subset Selection Problem (CSSP), also known as the feature selection, or the dictionary selection problem. Suppose we have a set of data columns YM×KY\in\mathbb{R}^{M\times K} and a dictionary of features AM×NA\in\mathbb{R}^{M\times N} (here NN represents the number of features in the dictionary and MM is the size of the features), and we want to extract a subset of the columns of AA such that YY can be represented in terms of only ss columns of AA. This problem exactly translates to the joint sparse recovery problem Y=AXY=AX where XX is the matrix of coefficients and the non-zero rows correspond to the columns that have been selected. In the feature selection problem, the data is also the matrix AA (i.e., Y=AY=A) and the goal is to select the most representative columns (features) of the matrix AA. Thus, we aim to solve the problem

(7) minZAAZFrosubject toZ2,0s.\displaystyle\min_{Z}\|A-AZ\|_{\textnormal{Fro}}\;\text{subject to}\;\|Z\|_{2,0}\leq s.

This problem has garnered extensive attention in machine learning where it was often relaxed by mixed 2,1\ell_{2,1} norm [34, 38, 35, 37]. This has also led to various matrix decomposition techniques such as the CUR decomposition [26, 8]. Demonstrations of our 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization on this problem on some bioinformatics datasets will be shown in Section 7.

Proximal algorithms. To our knowledge, this is the first paper that provides an iterative solution method for the problem (4) aside from the initial work [42]. The methods we outline in Section 6 will be based on standard proximal gradient based methods for least squares regression with convex regularizers [3, 9], and for more general non-smooth, convex optimization that encompasses joint sparse regression [11, 23, 12, 45, 18]. In order to incorporate the nonconvex non-smooth aspects of the 𝑜𝑤2,1\mathit{ow}\ell_{2,1}-regularizer, we employ a first order convex approximation through a model function in every step of the method. This allows us to write the new iteration with the help of a proximal mapping that is simple to calculate. Convex model functions are also used in related methods for nonsmooth nonconvex optimization [40, 39] or Trust-Region methods [13]. Our method employs a specialized variable metric to achieve simple and efficient iterations, which requires a dedicated analysis and leads to a global convergence result.

As already pointed out before, for the scalar case K=1K=1 we obtain the 1/2\ell_{1}/\ell_{2} regularizer, and optimization methods have been proposed in [56, 43, 53, 55, 46, 57]. For K=1K=1, proximal method in this manuscript is similar to the one proposed in [53, Algorithm 2] for the equality constrained formulation (3) and subsequently analyzed and generalized to include the noisy constrained formulation (5) in [57]. In this paper we focus on the noisy regularized formulation (4), which can also be used to approach the solutions of the equality constrained problem for α0\alpha\to 0 and the solutions of the noisy constrained problem for ααδ\alpha\to\alpha_{\delta}; see Section 6.3. While it is easy to specialize the method proposed in this manuscript to K=1K=1 (see Remark 6.5), the generalization to larger KK is not obvious, and requires additional effort.

Finally, since the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} functional is discontinuous across different ranks, we can not rely on only gradient descent unless we know or can guess the optimal rank ahead of time [42]. As a remedy, we propose an optional Lipschitz-continuous relaxation of the problem for a parameter γ>0\gamma>0, which we show converges to the original problem in the sense of Γ\Gamma-convergence [15].

3. Rank-awareness

Denote for any rKr\leq K

𝕊(N,K,r)={ZN×K:rank(Z)=r}.\mathbb{S}(N,K,r)=\{Z\in\mathbb{R}^{N\times K}:{\rm rank\,}(Z)=r\}.

When the columns of ZZ are linearly independent (r=Kr=K), we abuse the notation and write 𝕊(N,r)=𝕊(N,K,r)\mathbb{S}(N,r)=\mathbb{S}(N,K,r). It is worth noting that the set 𝕊(N,K,r)\mathbb{S}(N,K,r) is often referred to as the Stiefel manifold and it is a key object in differential geometry and optimization, with numerous applications in machine learning and data analysis.

The following theorem [32, 16] provides a necessary and sufficient condition for the exact recovery of sparse matrices from the 2,1\ell_{2,1} minimization problem (2):

Theorem 3.1.

Given a matrix AM×NA\in\mathbb{R}^{M\times N}, every s-row sparse matrix XN×KX\in\mathbb{R}^{N\times K} can be exactly recovered from (2) if and only if AA satisfies the NSP:

xI1<xIc1,xker(A){𝟎},I{1,,N}, with |I|s,\|x_{I}\|_{1}<\|x_{I^{c}}\|_{1},\ \forall x\in\ker(A)\setminus\{\mathbf{0}\},\ \forall I\subset\{1,\dots,N\},\text{ with }|I|\leq s,

where xIx_{I} is the vector that we get after nullifying all the entries of xx except the ones with indexes in II. Furthermore, if there exists an xker(A)x\in\ker(A) such that

xI1>xIc1,|I|s,\|x_{I}\|_{1}>\|x_{I^{c}}\|_{1},\;|I|\leq s,

then, for any 1rmin{s,K}1\leq r\leq\min\{s,K\}, there is an ss-row sparse matrix X𝕊(N,K,r)X\in\mathbb{S}(N,K,r) which cannot be recovered from (2).

The following theorem (see [42] for details) highlights the difference in behavior between the 2,1\ell_{2,1} norm and the 2,0\ell_{2,0} norm regularizers. The 2,1\ell_{2,1} norm regularizer is not rank-aware, while the 2,0\ell_{2,0} norm regularizer is rank-aware. The smallest number of linearly dependent columns in a matrix AA is referred to as the spark of AA and is denoted by spark (A)\textnormal{spark\,}(A).

Theorem 3.2.

The following statements are equivalent:

  1. (1)

    s(spark (A)1+r)/2s\leq(\textnormal{spark\,}(A)-1+r)/2,

  2. (2)

    For every ss-row sparse matrix X𝕊(N,K,r)X\in\mathbb{S}(N,K,r), XX is the unique solution of AZ=YAZ=Y in the set 𝕊(N,K,r)\mathbb{S}(N,K,r).

  3. (3)

    For every ss-row sparse matrix X𝕊(N,K,r)X\in\mathbb{S}(N,K,r), XX is the unique solution of the minimization problem (1) where Y=AXY=AX.

In the above theorem, the least restrictive condition occurs when the rank of the row-sparse matrix XX is equal to its sparsity, i.e., r=sr=s. In fact, in practical applications, the row-sparse matrix XX in the joint sparse recovery problem is often of maximum rank, i.e., it is approximately ss-row and also ss-row sparse. In such a scenario, the condition specified in Theorem 3.2 simplifies to s<spark (A)s<\textnormal{spark\,}(A). Hence, the assumption s<spark (A)s<\textnormal{spark\,}(A) is consistently made throughout the manuscript.

Theorem 3.3.

Given a matrix AM×NA\in\mathbb{R}^{M\times N} with the property s<spark (A)s<\textnormal{spark\,}(A), let X𝕊(N,K,s)X\in\mathbb{S}(N,K,s) represent an ss-row sparse matrix, and define Y=AXY=AX. Then XX is the unique solution to (3).

Proof.

From Prop. 1 in [42], since s<spark (A)s<\textnormal{spark\,}(A), we have rank(AX)=rank(X){\rm rank\,}(AX)={\rm rank\,}(X). Let ZN×KZ\in\mathbb{R}^{N\times K} be such that AX=AZAX=AZ, then rank(Z)rank(AZ)=rank(X){\rm rank\,}(Z)\geq{\rm rank\,}(AZ)={\rm rank\,}(X). Combining with Lemma A.2, 2,0(Z)2,0(X)\ell_{2,0}(Z)\geq\ell_{2,0}(X), with equality holding if and only if ZZ is ss-row sparse. ∎

Corollary 3.4.

The 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization in (3) is rank aware.

Proof.

According to Theorem 3.2, not all ss-row sparse matrices X𝕊(N,K,1)X\in\mathbb{S}(N,K,1) can be uniquely recovered from the equation AZ=YAZ=Y where Y=AXY=AX, when spark (A)/2<s<spark (A)\textnormal{spark\,}({A})/2<s<\textnormal{spark\,}({A}) and spark (A)>2\textnormal{spark\,}(A)>2. However, as per Theorem 3.3, all ss-row sparse matrices X𝕊(N,K,s)X\in\mathbb{S}(N,K,s) can be uniquely recovered using the method described in (3). ∎

For a given matrix AA, if spark (A)/2<s<spark (A){\textnormal{spark\,}(A)}/{2}<s<\textnormal{spark\,}(A) and r<sr<s, it is not necessarily guaranteed that the solution to the joint sparse recovery problem exists or is unique. In this context, Section 4 introduces a condition named spherical section property to ensure the existence of the global solution.

Earlier, we discussed that when the rank is equal to one, the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} reduces to 1/2\ell_{1}/\ell_{2}. Below we consider examples in which the 1/2\ell_{1}/\ell_{2} fails to recover a single vector but if we add another vector along with the corresponding measurements, then we successfully recover both.

Example 1.

Consider a matrix A2×3A\in\mathbb{R}^{2\times 3} and X3×1X\in\mathbb{R}^{3\times 1}, where

A=(c10c01),X=(011) and c0.A=\begin{pmatrix}c&1&0\\ c&0&1\end{pmatrix},\quad X=\begin{pmatrix}0\\ 1\\ 1\end{pmatrix}\quad\text{ and }\quad c\geq 0.

Then the measurement matrix YY is given by Y=AX=(1,1)T2Y=AX=(1,1)^{T}\in\mathbb{R}^{2}. A solution of AZ=YAZ=Y has the form Z=(τ,1τc,1τc)T,τZ=(\tau,1-\tau c,1-\tau c)^{T},\,\tau\in\mathbb{R}. For such ZZ we have

𝑜𝑤2,1(Z)=Z1Z2=|τ|+2|1τc|τ2+2(1τc)2.\mathit{ow}\ell_{2,1}(Z)=\frac{\lVert Z\rVert_{1}}{\lVert Z\rVert_{2}}=\frac{|\tau|+2|1-\tau c|}{\sqrt{\tau^{2}+2(1-\tau c)^{2}}}.

For c=0c=0, the functional assumes a local minimum with value 2\sqrt{2} at zero, but the global infimum is given by one for |τ|\lvert\tau\rvert\to\infty. However, then spark (A)=1\textnormal{spark\,}(A)=1 which is not sufficient for recovery of 2-sparse vectors with any method, since there exists a 1-sparse vector in the kernel. For c>0c>0 we have spark (A)=3\textnormal{spark\,}(A)=3 and 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) attains minimum at τ=1/c\tau=1/c, different from the true solution at τ=0\tau=0 (note, however, that the sparsest solution is at τ=1/c\tau=1/c, and the true solution XX is not the sparsest here). Yet, by adding another column to the matrix XX to make it 2-row sparse with rank two,

X=(001111)and the observation Y=AX=(1111),X=\begin{pmatrix}0&0\\ 1&1\\ 1&-1\end{pmatrix}\quad\text{and the observation }Y=AX=\begin{pmatrix}1&1\\ 1&-1\end{pmatrix},

the problem (3) has a unique solution that is equal to XX. This follows from the fact that r=s=2r=s=2 and s<spark (A)=3s<\textnormal{spark\,}(A)=3. By applying Theorem 3.3, we can conclude that the true solution is recovered.

The next example, in a similar spirit, demonstrates that the functional can even fail to have a minimizer for large spark (A)\textnormal{spark\,}(A), if the rank is smaller than the sparsity.

Example 2.

For a c>0c>0, let

A=(c1000c01000001000001),X=(00011) and Y=AX=(0011).A=\begin{pmatrix}c&1&0&0&0\\ c&0&1&0&0\\ 0&0&0&1&0\\ 0&0&0&0&1\end{pmatrix},\quad X=\begin{pmatrix}0\\ 0\\ 0\\ 1\\ 1\end{pmatrix}\quad\text{ and }\quad Y=AX=\begin{pmatrix}0\\ 0\\ 1\\ 1\end{pmatrix}.

Any solution of AZ=YAZ=Y has the form Z=(τ,τc,τc,1,1)T,τZ=(\tau,-\tau c,-\tau c,1,1)^{T},\,\tau\in\mathbb{R}. For such ZZ, we have

𝑜𝑤2,1(Z)=Z1Z2=2+(1+2c)|τ|2+(1+2c2)τ2.\mathit{ow}\ell_{2,1}(Z)=\frac{\lVert Z\rVert_{1}}{\lVert Z\rVert_{2}}=\frac{2+(1+2c)|\tau|}{\sqrt{2+(1+2c^{2})\tau^{2}}}.

The critical points in terms of τ\tau are ±(1+2c)/(1+2c2)\pm(1+2c)/(1+2c^{2}), ±\pm\infty, and 0 with local minimum at the latter. However, the local minimum at zero is only a global one for c>1/4c>1/4, due to lim|τ|𝑜𝑤2,1(Z)=(1+2c)/1+2c2>2=𝑜𝑤2,1(Z)|τ=0\lim_{|\tau|\to\infty}\mathit{ow}\ell_{2,1}(Z)=(1+2c)/\sqrt{1+2c^{2}}>\sqrt{2}=\mathit{ow}\ell_{2,1}(Z)\rvert_{\tau=0}, and for c<1/4c<1/4 the infimum is approached for τ±\tau\to\pm\infty.

As before, if we add another linearly independent column with the same sparsity pattern to XX, say the column X2=(0,0,0,1,1)TX_{2}=(0,0,0,1,-1)^{T}, then rank(X)=2{\rm rank\,}(X)=2 and with Theorem 3.3 and s=2<spark (A)=3s=2<\textnormal{spark\,}(A)=3 the problem (3) has a unique solution equal to XX.

4. Existence of a global minimizer

Here we turn to the question of showing that the global minimizers of the noise free (3) and the noisy unconstrained (4) problems indeed exist under certain conditions on AA. They do not have to exist in general.

To illustrate this fact, let us consider for the moment the case K=1K=1, where the functional 𝑜𝑤2,1\mathit{ow}\ell_{2,1} is identical to the 1/2\ell_{1}/\ell_{2} functional (6). For the existence of the global minimum of the cost function with the 1/2\ell_{1}/\ell_{2} regularizer, it is often required that AA satisfies the ss-spherical section property. An example can be found in [57, Theorem 3.4]. In this section we aim to extend these results to joint sprase recovery problem with the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularizer. The failure of minimizers to exist without appropriate assumptions has been illustrated in the previous section.

Definition 4.1.

The rank-rr singularity set of the matrix AA is defined as

Ωr(A)={ZN×K:rank(Z)r,rank(AZ)<rank(Z)}.\Omega_{r}(A)=\{Z\in\mathbb{R}^{N\times K}:\;{\rm rank\,}(Z)\geq r,\,{\rm rank\,}(AZ)<{\rm rank\,}(Z)\}.

This set contains all matrices of rank kk where some linear combination of the columns is in the kernel of AA. Notice that the rank-11 singularity set for K=1K=1 is the nontrivial part of the kernel: For K=1K=1 we have Ω1(A)=ker(A){ 0}\Omega_{1}(A)=\ker(A)\setminus\{\,0\,\}. This justifies the following definition as an extension of the ss-spherical section property (cf. [52, 57]).

Definition 4.2 ((r,s)(r,s)-spherical section property).

Let matrix AM×NA\in\mathbb{R}^{M\times N}. We say that AA satisfies the (r,s)(r,s)-spherical section property or (r,s)(r,s)-SSP if, for all ZΩr(A)Z\in\Omega_{r}(A),

𝑜𝑤2,1(Z)>rs.\displaystyle\mathit{ow}\ell_{2,1}(Z)>\sqrt{rs}.
Lemma 4.3.

The following infimum is attainable

ιr(A)=infZΩr(A)𝑜𝑤2,1(Z).\iota_{r}(A)=\inf_{Z\in\Omega_{r}(A)}\mathit{ow}\ell_{2,1}(Z).

Moreover, for r<spark (A)r<\textnormal{spark\,}(A), ιr(A)>r\iota_{r}(A)>r.

Proof.

Let ZΩr(A)Z\in\Omega_{r}(A) and consider Z=UΣVTZ=U\Sigma V^{T} compact singular value decomposition, where UN×rU\in\mathbb{R}^{N\times r^{\prime}}, VK×rV\in\mathbb{R}^{K\times r^{\prime}} with UTU=VTV=IrU^{T}U=V^{T}V=I_{r^{\prime}} where rrr^{\prime}\geq r is the rank of ZZ. Here Σr×r\Sigma\in\mathbb{R}^{r^{\prime}\times r^{\prime}} is the diagonal matrix of nonzero singular values. Note that Z(ZTZ)/2=UVTZ(Z^{T}Z)^{\dagger/2}=UV^{T}, therefore

rank(AU)=rank(AUVT)=rank(AZ(ZTZ)/2)rankAZ<rankZ=rankUVT{\rm rank\,}(AU)={\rm rank\,}(AUV^{T})={\rm rank\,}(AZ(Z^{T}Z)^{\dagger/2})\leq{\rm rank\,}AZ<{\rm rank\,}{Z}={\rm rank\,}{UV^{T}}

and thus UΩr(A)U\in\Omega_{r}^{\prime}(A). We define the set

Ω^r(A)\displaystyle\hat{\Omega}_{r^{\prime}}(A) =Ωr(A){UN×r:UTU=Ir}\displaystyle=\Omega_{r^{\prime}}(A)\cap\left\{U\in\mathbb{R}^{N\times r^{\prime}}:\;U^{T}U=I_{r^{\prime}}\right\}
={UN×r:UTU=Ir and rankAU<r}\displaystyle=\left\{U\in\mathbb{R}^{N\times r^{\prime}}:\;U^{T}U=I_{r^{\prime}}\text{ and }{\rm rank\,}{AU}<r^{\prime}\,\right\}

where IrI_{r^{\prime}} is the r×rr^{\prime}\times r^{\prime} identity matrix.

By the previous argument, we know that for any ZΩrZ\in\Omega_{r} there exists a UΩ^r(A)U\in\hat{\Omega}_{r^{\prime}}(A) with rrr^{\prime}\geq r such that

𝑜𝑤2,1(Z)=𝑜𝑤2,1(UVT)=𝑜𝑤2,1(U)=U2,1.\mathit{ow}\ell_{2,1}(Z)=\mathit{ow}\ell_{2,1}(UV^{T})=\mathit{ow}\ell_{2,1}(U)=\lVert U\rVert_{2,1}.

Thus, we have

infZΩr(A)𝑜𝑤2,1(Z)=infrr;UΩ^r(A)U2,1=minrrinfUΩ^r(A)U2,1.\inf_{Z\in\Omega_{r}(A)}\mathit{ow}\ell_{2,1}(Z)=\inf_{r^{\prime}\geq r;U\in\hat{\Omega}_{r^{\prime}}(A)}\lVert U\rVert_{2,1}=\min_{r^{\prime}\geq r}\inf_{U\in\hat{\Omega}_{r^{\prime}}(A)}\lVert U\rVert_{2,1}.

Moreover, the set Ω^r(A)\hat{\Omega}_{r^{\prime}}(A) is closed and bounded, and thus there exists a UΩ^r(A)Ωr(A)U^{*}\in\hat{\Omega}_{r^{\prime}}(A)\subset\Omega_{r}(A) for some rrr^{\prime}\geq r such that

infZΩr(A)𝑜𝑤2,1(Z)=𝑜𝑤2,1(U)=U2,1.\inf_{Z\in\Omega_{r}(A)}\mathit{ow}\ell_{2,1}(Z)=\mathit{ow}\ell_{2,1}(U^{*})=\lVert U^{*}\rVert_{2,1}.

Now, let ZΩr(A)Z\in\Omega_{r}(A). From Lemma A.2, 𝑜𝑤2,1(Z)rank(Z)r\mathit{ow}\ell_{2,1}(Z)\geq{\rm rank\,}(Z)\geq r. Moreover, from the same lemma, 𝑜𝑤2,1(Z)=r\mathit{ow}\ell_{2,1}(Z)=r if and only if Z2,0=rank(Z)=r\|Z\|_{2,0}={\rm rank\,}(Z)=r. But rank(AZ)<rank(Z){\rm rank\,}(AZ)<{\rm rank\,}(Z), therefore there exists a non-zero linear combination of the columns of ZZ that is in the kernel of AA. Such vector will have to be rr-sparse as a linear combination of rr-sparse vectors with the same sparsity pattern. But there cannot be any rr-sparse vector in the kernel of AA otherwise the conditions of Theorem 3.2 will be violated (we can construct an rr-sparse rank rr matrix that cannot be recovered from the measurements). Thus, for any ZΩr(A)Z\in\Omega_{r}(A), 𝑜𝑤2,1(Z)>r\mathit{ow}\ell_{2,1}(Z)>r. Since the infimum is attainable, we have that ιr(A)>r\iota_{r}(A)>r. ∎

Example 3.

To illustrate the advantage of larger rr in a specific instance, consider N=3,K=2N=3,K=2. Let

A=(c10c01)A=\begin{pmatrix}c&1&0\\ c&0&1\end{pmatrix}

for c0c\geq 0. Notice that,

Z=(11cccc)Ω1(A)Z=\begin{pmatrix}1&1\\ -c&-c\\ -c&-c\end{pmatrix}\in\Omega_{1}(A)

and

𝑜𝑤2,1(Z)=1+2c1+2c22\mathit{ow}\ell_{2,1}(Z)=\frac{1+2c}{\sqrt{1+2c^{2}}}\leq\sqrt{2}

when c1/2c\leq 1/2. Here we used the fact that 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) is the 2,1\ell_{2,1} norm of the matrix with any orthonormal basis in the space of the columns of the matrix ZZ. So, in this case, it is simply the 1/2\ell_{1}/\ell_{2} functional on the first column. Therefore, the matrix AA does not fullfill (1,2)(1,2)-SSP when c1/2c\leq 1/2. However, when c>0c>0, 2<spark (A)=32<\textnormal{spark\,}(A)=3 thus the (2,2)(2,2)-SSP is satisfied due to Lemma 4.3. When c=0c=0, then it is not (2,2)(2,2)-SSP which was expected. For

Z=(100100)Ω2(A),Z=\begin{pmatrix}1&0\\ 0&1\\ 0&0\end{pmatrix}\in\Omega_{2}(A),

we have 𝑜𝑤2,1(Z)=2\mathit{ow}\ell_{2,1}(Z)=2.

4.1. The noiseless constrained problem

Lemma 4.4.

Let X𝕊(N,K,r)X\in\mathbb{S}(N,K,r) with X2,0=s\|X\|_{2,0}=s and Y=AXY=AX, where s<spark (A)s<\textnormal{spark\,}(A). Assume there exists an unbounded minimizing sequence {Zk}\{Z_{k}\} of (3). Then

ιr(A)minZN×K,AZ=Y𝑜𝑤2,1(Z).\iota_{r}(A)\leq\min_{Z\in\mathbb{R}^{N\times K},\;AZ=Y}\mathit{ow}\ell_{2,1}(Z).
Proof.

Let ZkZ_{k} be minimizing sequence. From the proof of Proposition 1 in [42], s<spark (A)s<\textnormal{spark\,}(A) implies that AA preserves the rank of any ss-row sparse matrix. Thus rank(Zk)rank(AZk)=rank(Y)=rank(X)=r{\rm rank\,}(Z_{k})\geq{\rm rank\,}(AZ_{k})={\rm rank\,}(Y)={\rm rank\,}(X)=r.

Without loss of generality (by switching to a subsequence), we may assume that all the ZkZ_{k} have the same rank. Assuming the compact SVD of ZkZ_{k} is Zk=UkΣkVkTZ_{k}=U_{k}\Sigma_{k}V^{T}_{k}, we have Zk(ZkTZk)/2=UkVkT.Z_{k}(Z_{k}^{T}Z_{k})^{\dagger/2}=U_{k}V^{T}_{k}. Moreover, along a subsequence of kk the largest singular value of Σk\Sigma_{k} diverges and the corresponding left and right singular factors UkU_{k} and VkV_{k} converge due to compactness by switching to another subsequence. Let Z^\hat{Z} be the limit of the resulting subsequence of Zk(ZkTZk)/2Z_{k}(Z_{k}^{T}Z_{k})^{\dagger/2}. Z^\hat{Z} has the same rank as the matrices in the subsequence (from the fact that UkU_{k} and VkV_{k} are orthogonal) and since rank(Zk)r{\rm rank\,}(Z_{k})\geq r, we deduce that rank(Z^)r{\rm rank\,}(\hat{Z})\geq r.

Next, we show that rank(AZ^)<rank(Z^){\rm rank\,}(A\hat{Z})<{\rm rank\,}(\hat{Z}). Let v^1\hat{v}^{1} be the right singular vector and u^1\hat{u}^{1} be the left singular vector corresponding to the largest singular value σ1\sigma^{1} of Z^\hat{Z}. Note that

Z^(v^1)T=limkZk(vk1)T(σk1)1\hat{Z}(\hat{v}^{1})^{T}=\lim_{k\to\infty}Z_{k}(v_{k}^{1})^{T}(\sigma^{1}_{k})^{-1}

where v^k1\hat{v}_{k}^{1} is the right singular vector and u^k1\hat{u}_{k}^{1} be the left singular vector corresponding to the largest singular value σk1\sigma_{k}^{1} of ZkZ_{k}. On the other hand

AZ^(v^1)T=limkAZk(vk1)T(σk1)1=0A\hat{Z}(\hat{v}^{1})^{T}=\lim_{k\to\infty}AZ_{k}(v^{1}_{k})^{T}(\sigma^{1}_{k})^{-1}=0

since AZk(vk1)TAZ_{k}(v^{1}_{k})^{T} is bounded. Note that AZ^=A(Z^u^1(v^1)T)A\hat{Z}=A(\hat{Z}-\hat{u}^{1}(\hat{v}^{1})^{T}) and rank(Z^u^1(v^1)T)<rank(Z^){\rm rank\,}(\hat{Z}-\hat{u}^{1}(\hat{v}^{1})^{T})<{\rm rank\,}(\hat{Z}). Therefore Z^Ωr(A)\hat{Z}\in\Omega_{r}(A). Using Proposition A.3,

limk𝑜𝑤2,1(Zk)𝑜𝑤2,1(Z^)ιr(A).\lim_{k\to\infty}\mathit{ow}\ell_{2,1}(Z_{k})\geq\mathit{ow}\ell_{2,1}(\hat{Z})\geq\iota_{r}(A).\qed
Proposition 4.5.

Let X𝕊(N,K,r)X\in\mathbb{S}(N,K,r) with X2,0=s\|X\|_{2,0}=s, Y=AXY=AX. Assume s<spark (A)s<\textnormal{spark\,}(A) and AA satisfies the (r,s)(r,s)-SSP condition or, equivalently,

rs<min{ιr2(A)/r,spark (A)}.r\leq s<\min\{\iota^{2}_{r}(A)/r,\textnormal{spark\,}(A)\}.

Then any (globally) minimizing sequence of (3) is bounded and the set of optimal solutions of (3) is nonempty.

Proof.

Let

ν=infZN×K,AZ=Y𝑜𝑤2,1(Z).\nu=\inf_{Z\in\mathbb{R}^{N\times K},\;AZ=Y}\mathit{ow}\ell_{2,1}(Z).

Note that 𝑜𝑤2,1(X)ν\mathit{ow}\ell_{2,1}(X)\geq\nu. We have X(XTX)/2Fro2=r\|X(X^{T}X)^{\dagger/2}\|^{2}_{\textnormal{Fro}}=r (the singular values of X(XTX)/2X(X^{T}X)^{\dagger/2} are all equal to 1), therefore,

X2,01/2\displaystyle\|X\|_{2,0}^{1/2} =X(XTX)/22,01/2X(XTX)/22,1X(XTX)/2Fro=𝑜𝑤2,1(X)rνr.\displaystyle=\|X(X^{T}X)^{\dagger/2}\|_{2,0}^{1/2}\geq\frac{\|X(X^{T}X)^{\dagger/2}\|_{2,1}}{\|X(X^{T}X)^{\dagger/2}\|_{\textnormal{Fro}}}\ =\ \frac{\mathit{ow}\ell_{2,1}(X)}{\sqrt{r}}\geq\frac{\nu}{\sqrt{r}}.

For every ZΩr(A)Z\in\Omega_{r}(A), we have

𝑜𝑤2,1(Z)ιr(A)>rs=rX2,01/2ν.\displaystyle\mathit{ow}\ell_{2,1}(Z)\geq\iota_{r}(A)>\sqrt{rs}=\sqrt{r}\|X\|_{2,0}^{1/2}\geq\nu.

This, combined with Lemma 4.4, implies that there exists no unbounded minimizing sequence {Zk}\{Z_{k}\} of (4). Without loss of generality, by passing to a subsequence if necessary, we can assume ZkZ¯N×KZ_{k}\to\bar{Z}\in\mathbb{R}^{N\times K}. From Corollary A.3, lim infk𝑜𝑤2,1(Zk)𝑜𝑤2,1(Z¯)\liminf\limits_{k\to\infty}\mathit{ow}\ell_{2,1}(Z_{k})\geq\mathit{ow}\ell_{2,1}(\bar{Z}) therefore, Z¯\bar{Z} is the global minimum of (3), completing the proof. ∎

Corollary 4.6.

Let Y=AXY=AX where XX is an ss-row sparse rank ss matrix with s<spark (A).s<\textnormal{spark\,}(A). Then any (globally) minimizing sequence of (3) is bounded and the set of optimal solutions of (3) is nonempty.

Proof.

We have r=sr=s. From Lemma 4.3, we have that ιs(A)>s\iota_{s}(A)>s, so the conditions of Theorem 4.5 are satisfied. ∎

4.2. The noisy regularized problem

Next, we turn to the question of showing that the minimizers of the problem (4) indeed exist under certain conditions on AA.

Lemma 4.7.

Let X𝕊(N,K,r)X\in\mathbb{S}(N,K,r) with X2,0=s\|X\|_{2,0}=s and YAXδ\lVert Y-AX\rVert\leq\delta, where s<spark (A)s<\textnormal{spark\,}(A). Assume δ\delta and α\alpha are sufficiently small and assume there exists an unbounded minimizing sequence {Zk}\{Z_{k}\} of (4). Then

ιr(A)minZN×K(𝑜𝑤2,1(Z)+12αAZYFro2).\iota_{r}(A)\leq\min_{Z\in\mathbb{R}^{N\times K}}\left(\mathit{ow}\ell_{2,1}(Z)+\frac{1}{2\alpha}\|AZ-Y\|_{\textnormal{Fro}}^{2}\right).
Proof.

Let ZkZ_{k} be minimizing sequence for the right side. For kk large enough,

𝑜𝑤2,1(Zk)+12αAZkYFro22(𝑜𝑤2,1(X)+12αAXYFro2)\mathit{ow}\ell_{2,1}(Z_{k})+\frac{1}{2\alpha}\lVert AZ_{k}-Y\rVert^{2}_{\textnormal{Fro}}\leq 2(\mathit{ow}\ell_{2,1}(X)+\frac{1}{2\alpha}\lVert AX-Y\rVert^{2}_{\textnormal{Fro}})

thus

12AZkYFro22(α𝑜𝑤2,1(X)+12YAXFro2)α𝑜𝑤2,1(Zk)2αrs+δ2.\frac{1}{2}\lVert AZ_{k}-Y\rVert^{2}_{\textnormal{Fro}}\leq 2(\alpha\mathit{ow}\ell_{2,1}(X)+\frac{1}{2}\lVert Y-AX\rVert^{2}_{\textnormal{Fro}})-\alpha\mathit{ow}\ell_{2,1}(Z_{k})\leq 2\alpha\sqrt{r}\sqrt{s}+\delta^{2}.

This implies that rank(Zk)r{\rm rank\,}(Z_{k})\geq r for δ\delta, α\alpha small enough. Indeed, s<spark (A)s<\textnormal{spark\,}(A) implies that AA preserves the rank of any rr-row-sparse matrix (see Proposition 1 in [42]). If δ\delta is small enough then rank(Y)rank(AX)=r{\rm rank\,}(Y)\geq{\rm rank\,}(AX)=r. And when both α\alpha and δ\delta are small enough, rank(AZk)rank(Y)r{\rm rank\,}(AZ_{k})\geq{\rm rank\,}(Y)\geq r. On the other hand rank(Zk)rank(AZk)r{\rm rank\,}(Z_{k})\geq{\rm rank\,}(AZ_{k})\geq r.

Like in the proof of Lemma 4.4, without loss of generality, all the ZkZ_{k} have the same rank and Zk(ZkTZk)/2Z_{k}(Z_{k}^{T}Z_{k})^{\dagger/2} converge to a Z^Ωr\hat{Z}\in\Omega_{r}. Therefore,

limk(𝑜𝑤2,1(Zk)+12αAZkYFro2)lim infk𝑜𝑤2,1(Zk)𝑜𝑤2,1(Z^)ιr(A).\lim_{k\to\infty}\left(\mathit{ow}\ell_{2,1}(Z_{k})+\frac{1}{2\alpha}\|AZ_{k}-Y\|_{\textnormal{Fro}}^{2}\right)\geq\liminf_{k\to\infty}\mathit{ow}\ell_{2,1}(Z_{k})\geq\mathit{ow}\ell_{2,1}(\hat{Z})\geq\iota_{r}(A).\qed
Proposition 4.8.

Let X𝕊(N,K,r)X\in\mathbb{S}(N,K,r) with X2,0=s\|X\|_{2,0}=s and YAXδ\lVert Y-AX\rVert\leq\delta. Assume, s<spark (A)s<\textnormal{spark\,}(A) and AA satisfies the (r,s)(r,s)-SSP or, equivalently,

rs<min{ιr2(A)/r,spark (A)}.r\leq s<\min\{\iota^{2}_{r}(A)/r,\textnormal{spark\,}(A)\}.

Then, for sufficiently small δ2/α\delta^{2}/\alpha and α\alpha, any (globally) minimizing sequence of (4) is bounded and the set of optimal solutions of (4) is nonempty.

Proof.

Since 𝑜𝑤2,1(X)+12αAXYFro2ν\mathit{ow}\ell_{2,1}(X)+\dfrac{1}{2\alpha}\|AX-Y\|_{\textnormal{Fro}}^{2}\geq\nu, we have that 𝑜𝑤2,1(X)νδ22α\mathit{ow}\ell_{2,1}(X)\geq\nu-\dfrac{\delta^{2}}{2\alpha}. We have X(XTX)/2Fro2=r\|X(X^{T}X)^{\dagger/2}\|^{2}_{\textnormal{Fro}}=r, therefore,

X2,01/2\displaystyle\|X\|_{2,0}^{1/2} =X(XTX)/22,01/2X(XTX)/22,1X(XTX)/2Fro=𝑜𝑤2,1(X)r1r(νδ22α).\displaystyle=\|X(X^{T}X)^{\dagger/2}\|_{2,0}^{1/2}\geq\frac{\|X(X^{T}X)^{\dagger/2}\|_{2,1}}{\|X(X^{T}X)^{\dagger/2}\|_{\textnormal{Fro}}}\ =\ \frac{\mathit{ow}\ell_{2,1}(X)}{\sqrt{r}}\geq\frac{1}{\sqrt{r}}\left(\nu-\frac{\delta^{2}}{2\alpha}\right).

Thus, we have

ιr(A)>ιr(A)r=ιr(A)srX2,01/2ιr(A)s(νδ22α)>ν\displaystyle\iota_{r}(A)>\sqrt{\iota_{r}(A)}\sqrt{r}=\frac{\sqrt{\iota_{r}(A)}}{s}\sqrt{r}\|X\|_{2,0}^{1/2}\geq\frac{\sqrt{\iota_{r}(A)}}{s}\left(\nu-\frac{\delta^{2}}{2\alpha}\right)>\nu

when δ2/α\delta^{2}/\alpha is sufficiently small. This, combined with Lemma 4.4, implies that there exists no unbounded minimizing sequence {Zk}\{Z_{k}\} of (4). Without loss of generality, by passing to a subsequence if necessary, we can assume ZkZ¯N×KZ_{k}\to\bar{Z}\in\mathbb{R}^{N\times K}. Then AZkYFro2AZ¯YFro2\|AZ_{k}-Y\|_{\textnormal{Fro}}^{2}\to\|A\bar{Z}-Y\|_{\textnormal{Fro}}^{2} and, from Corollary A.3, lim infk𝑜𝑤2,1(Zk)𝑜𝑤2,1(Z¯)\liminf\limits_{k\to\infty}\mathit{ow}\ell_{2,1}(Z_{k})\geq\mathit{ow}\ell_{2,1}(\bar{Z}). Therefore, Z¯\bar{Z} is the global minimum of (4), completing the proof. ∎

Corollary 4.9.

Let Y=AXY=AX where XX is an ss-row sparse rank ss matrix with s<spark (A).s<\textnormal{spark\,}(A). Then, for sufficiently small δ2/α\delta^{2}/\alpha and α\alpha, any (globally) minimizing sequence of (4) is bounded and the set of optimal solutions of (4) is nonempty.

5. The approximate problem

The functional 𝑜𝑤2,1\mathit{ow}\ell_{2,1} is highly discontinuous (recall that this is a necessary condition for rank-awareness), especially in the case of large KK, and not directly amenable to computations, unless it is restricted to the submanifold 𝕊(N,K,r)N×K\mathbb{S}(N,K,r)\subset\mathbb{R}^{N\times K} of matrices of given rank rr. In the noise free case, we can usually deduce the appropriate rank from the data YY, since rank(X)=rank(Y){\rm rank\,}(X)={\rm rank\,}(Y) is a necessary condition for recovery. Using a decomposition Y=YQY=Y^{\prime}Q with QQT=Ir×rQQ^{T}=I\in\mathbb{R}^{r\times r}, we can then solve the problem with data Y=YQTN×rY^{\prime}=YQ^{T}\in\mathbb{R}^{N\times r} for the solution Z𝕊(N,r)N×rZ^{\prime}\in\mathbb{S}(N,r)\subset\mathbb{R}^{N\times r} and obtain a reconstruction of XX as ZQZ^{\prime}Q; see [42].

We do not know what the rank of XX should be from the data Y=AX+WY=AX+W in the noisy case, since usually we would expect the noise WW to have full rank. If we preprocess the data YY via its SVD, we can still ensure that the data has full rank. However, in the presence of noise, a number of small singular values may potentially be associated to just noise and an arbitrary cutoff for the small singular values can be difficult to determine in practice ahead of solving the problem. Additionally, even when we happen to guess the correct rank, an iterative solution algorithm can run into a singularity during one of the iterations. This challenge can be addressed by carefully selecting the step-size; however, this may still yield an ill-conditioned matrix, potentially causing the updates to stagnate. A proper relaxation of the problem may take care of this difficulty.

For this and other reasons to be discussed later, we consider a generalization of the 𝑜𝑤2,1\mathit{ow}\ell_{2,1}-functional with a parameter γ[0,1]\gamma\in[0,1]:

Ψγ(Z)=Z(γI+(1γ)ZTZ)/22,1,\Psi_{\gamma}(Z)=\lVert Z\left(\gamma I+(1-\gamma)Z^{T}Z\right)^{\dagger/2}\rVert_{2,1},

and a generalized version of problem (4):

(8) minZN×K[Ψγ(Z)+12αAZYFro2].\min_{Z\in\mathbb{R}^{N\times K}}\left[\Psi_{\gamma}(Z)+\frac{1}{2\alpha}\lVert AZ-Y\rVert^{2}_{\textnormal{Fro}}\right].

The numbers γ[0,1]\gamma\in[0,1] and α>0\alpha>0 are the hyper-parameters of this problem. In particular, γ\gamma controls the convexity of the problem, where γ=1\gamma=1 reduces it to the convex 2,1\ell_{2,1} norm regularized problem and γ=0\gamma=0 corresponds directly to the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization. For γ(0,1)\gamma\in(0,1), we obtain a new “relaxed” optimization problem.

As a first verification of the appropriateness of this relaxed problem we will show that the (global) solutions of this problem converges to the (global) solution of the problem in (4) as γ0\gamma\to 0 (Proposition 5.1). Let

Jγ(Z)=Ψγ(Z)+12αAZYFro2.J_{\gamma}(Z)=\Psi_{\gamma}(Z)+\frac{1}{2\alpha}\|AZ-Y\|_{\textnormal{Fro}}^{2}.

be the associated objective functional. As a corollary of Proposition A.5 on the Γ\Gamma convergence of JγJ_{\gamma} we obtain the following:

Proposition 5.1.

Any cluster point of minimizers Z¯γ\bar{Z}_{\gamma} of JγJ_{\gamma} with 0<γ00<\gamma\to 0 is a minimizer of (4).

An additional advantage of the relaxed problem, which will turn out to be useful for numerical optimization, is that its minimizers can always be characterized by first-order conditions Z¯\bar{Z} with the help of the Clarke subdifferential. It is defined for any locally Lipschitz function; see [10, Chapter 10]. According to Proposition A.4 we have:

Proposition 5.2.

Jγ(Z)J_{\gamma}(Z) is locally Lipschitz continuous for any γ(0,1]\gamma\in(0,1]. For γ=0\gamma=0 it is only locally Lipschitz continuous on neighborhoods of points ZZ where ZTZZ^{T}Z is invertible (rank(Z)=K{\rm rank\,}(Z)=K).

For the following, we introduce the matrix W=(γI+(1γ)ZTZ)W=(\gamma I+(1-\gamma)Z^{T}Z)^{\dagger} and note that Ψγ(Z)=ZW,1\Psi_{\gamma}(Z)=\|Z\|_{W,1}. When either γ>0\gamma>0 or ZTZZ^{T}Z is invertible, the matrix M=γI+(1γ)ZTZM=\gamma I+(1-\gamma)Z^{T}Z is symmetric positive definite and thus W=M1=(γI+(1γ)ZTZ)1W=M^{-1}=\left(\gamma I+(1-\gamma)Z^{T}Z\right)^{-1}.

The version of Fermat’s rule for Clarke subdifferential states that [10, Exercise 10.7], at a local minimimum Z¯\bar{Z} of JγJ_{\gamma}, it holds

(9) 0CJγ(Z¯)=CΨγ(Z¯)+1αAT(AZ¯Y).0\in\partial_{C}J_{\gamma}(\bar{Z})=\partial_{C}\Psi_{\gamma}(\bar{Z})+\frac{1}{\alpha}A^{T}(A\bar{Z}-Y).

The additive property does not always hold for the Clarke subdifferential, only an inclusion, but it does in this case due to [10, Exercise 10.16]. To characterize the subdifferential, we provide a convex approximation to Ψγ\Psi_{\gamma} at a given reference point. This is also known as a model function in nonconvex, nonsmooth optimization; cf., e.g., [40, 39].

Definition 5.3.

Let Z^\widehat{Z} be given with corresponding invertible M^=γI+(1γ)Z^TZ^\widehat{M}=\gamma I+(1-\gamma)\widehat{Z}^{T}\widehat{Z} and W^=M^1\widehat{W}=\widehat{M}^{-1}. The convex model function is defined as

(10) mZ^Ψ(Z):=ZW^,1+Z^Λ^,ZZ^where Λ^=n=1N1γz^nW^W^z^nz^nTW^.m^{\Psi}_{\widehat{Z}}(Z):=\lVert Z\rVert_{\widehat{W},1}+\langle\widehat{Z}\widehat{\Lambda},Z-\widehat{Z}\rangle\quad\text{where }\widehat{\Lambda}=-\sum_{n=1}^{N}\frac{1-\gamma}{\lVert\widehat{z}_{n}\rVert_{\widehat{W}}}\widehat{W}\widehat{z}_{n}\widehat{z}_{n}^{T}\widehat{W}.

With a characterization of the Clarke subdifferential of Ψγ(Z)\Psi_{\gamma}(Z) the above optimality condition can be equivalently stated as follows:

Theorem 5.4.

Let Z¯\bar{Z} be a local minimum of JγJ_{\gamma}, where either γ>0\gamma>0 or Z¯TZ¯\bar{Z}^{T}\bar{Z} invertible. Then, in terms of the Clarke subdifferential, the following first order necessary conditions hold:

z¯n(1z¯nW¯W¯i=1N1γz¯iW¯W¯z¯iz¯iW¯)+1α(AT(AZ¯Y))n\displaystyle\bar{z}_{n}\left(\frac{1}{\lVert\bar{z}_{n}\rVert_{\bar{W}}}\bar{W}-\sum_{i=1}^{N}\frac{1-\gamma}{\lVert\bar{z}_{i}\rVert_{\bar{W}}}\bar{W}\bar{z}_{i}\bar{z}^{\top}_{i}\bar{W}\right)+\frac{1}{\alpha}(A^{T}(A\bar{Z}-Y))_{n} =0\displaystyle=0 for z¯n0,\displaystyle\text{for }\bar{z}_{n}\neq 0,
(AT(AZ¯Y))nW¯1W¯=(AT(AZ¯Y))nW¯1\displaystyle\lVert(A^{T}(A\bar{Z}-Y))_{n}\bar{W}^{-1}\rVert_{\bar{W}}=\lVert(A^{T}(A\bar{Z}-Y))_{n}\rVert_{\bar{W}^{-1}} α\displaystyle\leq\alpha for z¯n=0,\displaystyle\text{for }\bar{z}_{n}=0,

where z¯n\bar{z}_{n} is the nn-th row of Z¯\bar{Z} and W¯=(γI+(1γ)Z¯TZ¯)1\bar{W}=(\gamma I+(1-\gamma)\bar{Z}^{T}\bar{Z})^{-1}.

Proof.

From Theorem A.11, the Clarke subdifferential is given as the convex subdifferential

CΨγ(Z¯)=mZ¯Ψ(Z¯)=Z¯W¯,1+Z¯Λ¯,\partial_{C}\Psi_{\gamma}(\bar{Z})=\partial m^{\Psi}_{\bar{Z}}(\bar{Z})=\partial\lVert\bar{Z}\rVert_{\bar{W},1}+\bar{Z}\bar{\Lambda},

of the model function at Z¯\bar{Z}. The result follows now from (9) and the well-known explicit characterization of the convex sub-differential of the W¯,1\ell_{\bar{W},1}-norm W¯,1\partial\lVert\cdot\rVert_{\bar{W},1} for given weight W¯\bar{W}. ∎

6. A proximal gradient based algorithm

We assume first that α>0\alpha>0 and γ0\gamma\geq 0 are given, and solve the problem (8). We first derive the idea of the algorithm using a proximal Lagrange method. Then, we will provide a convergence analysis using the model function (10), and then we will briefly discuss the adaptive selection of the parameters γ\gamma and α\alpha.

We note that in the case γ=0\gamma=0, the algorithm may run into cases where M=ZTZM=Z^{T}Z is not invertible, which corresponds to a discontinuity in 𝑜𝑤2,1\mathit{ow}\ell_{2,1}. We will disregard that possibility for the moment and discuss mitigation strategies in the context of stepsize selection. In general, if the sought after solution of the problem does not fulfill M¯=Z¯TZ¯\bar{M}=\bar{Z}^{T}\bar{Z} invertible, we will tacitly assume that γ>0\gamma>0 and rely on Proposition 5.1.

6.1. Proximal Lagrange derivation

To derive an algorithm, we first aim to isolate the nonsmooth and nonconvex parts of the functional and rewrite the problem (8) in the constrained form

(11) minZN,K,WK×KZW,1+12αAZYFro2subject to W=(γI+(1γ)ZTZ)1,\min_{Z\in\mathbb{R}^{N,K},W\in\mathbb{R}^{K\times K}}\|Z\|_{W,1}+\frac{1}{2\alpha}\|AZ-Y\|^{2}_{\textnormal{Fro}}\quad\text{subject to }W=(\gamma I+(1-\gamma)Z^{T}Z)^{-1},

where we have introduced a separate variable for the weight WW. Note that, similar to the model function (10) postulated above, the functional now contains a weighted convex group sparse norm, if WW is fixed. To solve the problem above, we consider the Lagrange function

(Z,W,Λ)\displaystyle\mathcal{L}(Z,W,\Lambda) =i=1NziW+12αAZYFro2+12(1γ)tr(ΛT(γI+(1γ)ZTZW1))\displaystyle=\sum_{i=1}^{N}\|z_{i}\|_{W}+\frac{1}{2\alpha}\|AZ-Y\|_{\textnormal{Fro}}^{2}+\frac{1}{2(1-\gamma)}\operatorname{tr}(\Lambda^{T}(\gamma I+(1-\gamma)Z^{T}Z-W^{-1}))
=i=1NziW+(Z,W,Λ),\displaystyle=\sum_{i=1}^{N}\|z_{i}\|_{W}+\mathcal{M}(Z,W,\Lambda),

which is split into a part that is convex in ZZ and smooth in WW and a second part \mathcal{M} that collects the remaining differentiable parts of the Lagrange function \mathcal{L}. The factor 1/(2(1γ))1/(2(1-\gamma)) is added to provide a convenient scaling of the multiplier in the subsequent analysis.

Additionally, we consider the Hilbert space on K\mathbb{R}^{K} induced by the WW inner product of the form z,zW=zTWz\langle z,z^{\prime}\rangle_{W}=z^{T}Wz^{\prime}, z,zKz,z^{\prime}\in\mathbb{R}^{K}. We note that it also induces an inner product Z,ZW=tr(ZWZT)\langle Z^{\prime},Z\rangle_{W}=\operatorname{tr}(Z^{\prime}WZ^{T}) for matrices Z,ZN×KZ,Z^{\prime}\in\mathbb{R}^{N\times K}. The idea essential to our strategy is to compute gradients with respect to ZZ in this inner product. Given a function F:N×KF:\mathbb{R}^{N\times K}\to\mathbb{R}, WF(Z)=F(Z)W1\nabla^{W}F(Z)=\nabla F(Z)W^{-1} denotes the gradient of FF in the WW-Hilbert space, i.e. it is the matrix for which the directional derivative can be written as

F(Z;ΔZ)=limh0F(Z+hΔZ)F(Z)h=WF(Z),ΔZW.F^{\prime}(Z;\Delta Z)=\lim_{h\to 0}\frac{F(Z+h\Delta Z)-F(Z)}{h}=\langle\nabla^{W}F(Z),\Delta Z\rangle_{W}.

Also, we define the WW-proximal operator as

(12) ProxW,σ(Z)=argminZZW,1+12σZZW,22.\operatorname{Prox}_{W,\sigma}(Z^{\prime})=\arg\min_{Z}\|Z\|_{W,1}+\frac{1}{2\sigma}\|Z-Z^{\prime}\|^{2}_{W,2}.

Since the choice of the inner product agrees with the vector norm used in the definition of W,1\lVert\cdot\rVert_{W,1}, it can be checked that the closed form solution is given by

[ProxW,σ(Z)]n=znmax{0, 1σznW}=zn(1σmax{σ,znW}).\left[\operatorname{Prox}_{W,\sigma}(Z)\right]_{n}=z_{n}\max\left\{0,\;1-\frac{\sigma}{\|z_{n}\|_{W}}\right\}=z_{n}\left(1-\frac{\sigma}{\max\{\sigma,\|z_{n}\|_{W}\}}\right).

Note that this analytic representation would not be possible for the standard inner product.

The main steps of the algorithm, given a previous iterate ZkZ_{k}, WkW_{k} (k0k\geq 0) are now derived as follows: First, we compute the update for the multiplier Λk\Lambda_{k} as a solution to the Lagrange stationarity condition W(Zk,Wk,Λk)=0\nabla_{W}\mathcal{L}(Z_{k},W_{k},\Lambda_{k})=0. We observe that

W(Zk,Wk,Λ)=12{n:znk0}znk(znk)T(znk)TWkznk+12(1γ)Wk1ΛWk1,\nabla_{W}\mathcal{L}(Z_{k},W_{k},\Lambda)=\frac{1}{2}\sum_{\{\,n\colon z^{k}_{n}\neq 0\,\}}\frac{z^{k}_{n}(z^{k}_{n})^{T}}{\sqrt{(z^{k}_{n})^{T}W_{k}z^{k}_{n}}}+\frac{1}{2(1-\gamma)}W_{k}^{-1}\Lambda W_{k}^{-1},

and the equation can be uniquely solved for Λ=Λk\Lambda=\Lambda_{k} as given is step 4 of Algorithm 1 below; cf. Definition 5.3.

Second, with given ZkZ_{k}, WkW_{k} and Λk\Lambda_{k}, we select a stepsize σk>0\sigma_{k}>0 and update Zk+1=ProxWk,σk(ZkσkGk)Z_{k+1}=\operatorname{Prox}_{W_{k},\sigma_{k}}(Z_{k}-\sigma_{k}G_{k}), where Gk=Z(Zk,Wk,Λk)Wk1G_{k}=\nabla_{Z}\mathcal{M}(Z_{k},W_{k},\Lambda_{k})W_{k}^{-1} is the WkW_{k}-gradient of the smooth part of the Lagrange function given as

Z(Z,W,Λ)=1αAT(AZY)+ZΛ.\nabla_{Z}\mathcal{M}(Z,W,\Lambda)=\frac{1}{\alpha}A^{T}(AZ-Y)+Z\Lambda.

In Subsection 6.2 we introduce and discuss an Armijo line search procedure for finding the step size.

Finally, we need to update WW, where we simply set Wk+1=(γI+(1γ)Zk+1TZk+1)1W_{k+1}=(\gamma I+(1-\gamma)Z_{k+1}^{T}Z_{k+1})^{-1}, which solves exactly the constraint. Note that due to this choice, WkW_{k} can be eliminated as an explicit iteration variable and only Z0Z_{0} needs to be provided to initialize the algorithm.

Our proposed iterative scheme to solving the problem (11) is summarized with the following steps:

Algorithm 1
1:Initialize Z0Z_{0}
2:while not converged do
3:     Wk=(γI+(1γ)ZkTZk)1W_{k}=(\gamma I+(1-\gamma)Z_{k}^{T}Z_{k})^{-1}
4:     Λk={n:znk0}1γ(znk)TWkznkWkznk(znk)TWk\Lambda_{k}=-\sum_{{\{\,n\colon z^{k}_{n}\neq 0\,\}}}\frac{1-\gamma}{\sqrt{(z^{k}_{n})^{T}W_{k}z^{k}_{n}}}W_{k}z^{k}_{n}(z^{k}_{n})^{T}W_{k}
5:     Gk=(ZkΛk+(1/α)AT(AZkY))Wk1G_{k}=(Z_{k}\Lambda_{k}+(1/\alpha)A^{T}(AZ_{k}-Y))W_{k}^{-1}
6:     Choose the step size σk\sigma_{k} with a Armijo line search and set Zk+1=ProxWk,σk(ZkσkGk)Z_{k+1}=\operatorname{Prox}_{W_{k},\sigma_{k}}(Z_{k}-\sigma_{k}G_{k})
7:     k=k+1k=k+1.
8:end while

We note that for γ=1\gamma=1, we have Wk=IW_{k}=I, Λk=0\Lambda_{k}=0, and we obtain the classical proximal gradient method for group sparse least squares regression. Note also that this extension’s main additional effort is the computation of the inverse WkW_{k}, which is cheap for moderately sized KK compared to NN.

Remark 6.1.

For the case of large KK, the effort of direct computation of the inverse WkW_{k} can become prohibitive. If YY is approximately low rank, this issue can be alleviated by decomposing Y=YQY=Y^{\prime}Q, where YM×rY^{\prime}\in\mathbb{R}^{M\times r} with rKr\ll K and Qr×KQ\in\mathbb{R}^{r\times K} with QQT=Ir×rQQ^{T}=I\in\mathbb{R}^{r\times r}, then the optimization problem reduces to a cheaper one where the data matrix YY is replaced by YY^{\prime} (see Section 7.2 for examples). Generally, an efficient way to compute WW can be given for s=2,0(Z)<Ks=\ell_{2,0}(Z)<K and r=rank(Z)s<Kr={\rm rank\,}(Z)\leq s<K: First, we decompose Z=PI(Z)Z=PRQZ=P_{I(Z)}Z^{\prime}=PRQ, where PI(Z){0,1}N×sP_{I(Z)}\in\{0,1\}^{N\times s} is the selection of columns of the identity matrix corresponding to the support I(Z)I(Z) of ZZ, and the factors Rs×rR\in\mathbb{R}^{s\times r} and Qr×KQ\in\mathbb{R}^{r\times K} with QQ orthogonal can be computed, e.g., with a QR-decomposition QTRT=ZTPI(Z)=(Z)TQ^{T}R^{T}=Z^{T}P_{I(Z)}=(Z^{\prime})^{T}. Then, with M=γI+(1γ)QTRTRQM=\gamma I+(1-\gamma)Q^{T}R^{T}RQ we have W=M=γ1(IQTQ)+QT(γI+(1γ)RTR)1QW=M^{\dagger}=\gamma^{-1}(I-Q^{T}Q)+Q^{T}(\gamma I+(1-\gamma)R^{T}R)^{-1}Q. If we now modify the algorithm to not explicitly store WW, but only PP, QQ and RR, the complexity can be reduced if sKs\ll K. Note that the matrix Λk\Lambda_{k} is also in low-rank format as the sum of ss rank-one products. In this context, it is critical to ensure that sk=rank(Zk)s_{k}={\rm rank\,}(Z_{k}), remains small from initialization throughout the iterations. We postpone a detailed investigation of this to future work.

6.2. Step size selection for descent

In this section, we demonstrate that the iteration steps in the proposed algorithm result in the decrease in the loss function and therefore, the algorithm is convergent if the step size σk\sigma_{k} is chosen appropriately.

Let Z^N×K\widehat{Z}\in\mathbb{R}^{N\times K} be given, which can be thought of as a given (old) iterate ZkZ_{k}. Assume γ(0,1]\gamma\in(0,1] or γ=0\gamma=0 and Z^TZ^\widehat{Z}^{T}\widehat{Z} is invertible. Based on the model function mZ^Ψm^{\Psi}_{\widehat{Z}} from Definition 5.3, the following function will serve as a model function for the objective function J=JγJ=J_{\gamma}:

mZ^,L(Z):\displaystyle m_{\widehat{Z},L}(Z): =mZ^Ψ(Z)+F^+1αAT(AZ^Y),ZZ^+L2ZZ^W^,22\displaystyle=m^{\Psi}_{\widehat{Z}}(Z)+\widehat{F}+\frac{1}{\alpha}\langle A^{T}(A\widehat{Z}-Y),Z-\widehat{Z}\rangle+\frac{L}{2}\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}^{2}
=F^+ZW^,1+G^,ZZ^W^+L2ZZ^W^,22\displaystyle=\widehat{F}+\lVert Z\rVert_{\widehat{W},1}+\langle\widehat{G},Z-\widehat{Z}\rangle_{\widehat{W}}+\frac{L}{2}\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}^{2}

where

F^=12αAZ^YFro2,G^=(Z^Λ^+(1/α)AT(AZ^Y))W^1.\widehat{F}=\frac{1}{2\alpha}\lVert A\widehat{Z}-Y\rVert_{\textnormal{Fro}}^{2},\quad\widehat{G}=(\widehat{Z}\widehat{\Lambda}+(1/\alpha)A^{T}(A\widehat{Z}-Y))\widehat{W}^{-1}.

We will use this model as an upper bound of JJ, where the constant LL has to account for the neglected terms in Ψγ\Psi_{\gamma} as well as in the quadratic fitting term. The new iterate Zk+1Z_{k+1} can then be interpreted as the minimizer of the model function for Z^=Zk\widehat{Z}=Z_{k} and L=1/σkL=1/\sigma_{k}. Due to the complictated structure of the nonlinearity in Ψγ\Psi_{\gamma} we will not give LL analytically, but identify it computationally.

Proposition 6.2.

Let γ(0,1]\gamma\in(0,1] or Z^TZ^\widehat{Z}^{T}\widehat{Z} be invertible. Then mZ^,L(Z)m_{\widehat{Z},L}(Z) has the following properties

  1. (1)

    J(Z^)=mZ^,L(Z^)J(\widehat{Z})=m_{\widehat{Z},L}(\widehat{Z}),

  2. (2)

    J(Z)mZ^,L^(Z)J(Z)\leq m_{\widehat{Z},\widehat{L}}(Z) for all ZZ^W^,2Δ¯\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}\leq\bar{\Delta} with some universal Δ¯>0\bar{\Delta}>0 and L^\widehat{L} only dependent on Z^op\lVert\widehat{Z}\rVert_{\textnormal{op}},

  3. (3)

    the convex function mZ^,Lm_{\widehat{Z},L} is minimized (uniquely) at

    Z+=ProxW^,1/L(Z^(1/L)G^).Z^{+}=\operatorname{Prox}_{\widehat{W},1/L}\left(\widehat{Z}-(1/L)\widehat{G}\right).
Proof.

Part 1 is straightforward. Part 2 follows with

J(Z)=\displaystyle J(Z)=\; Ψγ(Z)+12αAZYFro2\displaystyle\Psi_{\gamma}(Z)+\frac{1}{2\alpha}\lVert AZ-Y\rVert_{\textnormal{Fro}}^{2}
=\displaystyle=\; ZW^,1+Z^Λ^+(1/α)AT(AZ^Y),ZZ^+F^+12αA(ZZ^)Fro2+R(Z,Z^)\displaystyle\lVert Z\rVert_{\widehat{W},1}+\langle\widehat{Z}\widehat{\Lambda}+(1/\alpha)A^{T}(A\widehat{Z}-Y),Z-\widehat{Z}\rangle+\widehat{F}+\frac{1}{2\alpha}\lVert A(Z-\widehat{Z})\rVert_{\textnormal{Fro}}^{2}+R(Z,\widehat{Z})
=\displaystyle=\; mZ^Ψ(Z)+R2(Z,Z^)\displaystyle m^{\Psi}_{\widehat{Z}}(Z)+R_{2}(Z,\widehat{Z})

and the estimate

|R2(Z,Z^)|L¯+L^A2ZZ^W^,22=L^2ZZ^W^,22,\lvert R_{2}(Z,\widehat{Z})\rvert\leq\frac{\bar{L}+\widehat{L}_{A}}{2}\|Z-\widehat{Z}\|^{2}_{\widehat{W},2}=\frac{\widehat{L}}{2}\|Z-\widehat{Z}\|^{2}_{\widehat{W},2},

with Theorem A.6 for some L¯>0\bar{L}>0 assuming ZZ^W^,2Δ¯\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}\leq\bar{\Delta}, and the appropriate Lipschitz constant of the quadratic part is given as

L^A=1α(supZW^,21AZFro)2=Aop2W^1opα=Aop2(γ+(1γ)Z^op2)α.\widehat{L}_{A}=\frac{1}{\alpha}\left(\sup_{\lVert Z\rVert_{\widehat{W},2}\leq 1}\lVert AZ\rVert_{\textnormal{Fro}}\right)^{2}=\frac{\lVert A\rVert_{\textnormal{op}}^{2}\lVert\widehat{W}^{-1}\rVert_{\textnormal{op}}}{\alpha}=\frac{\lVert A\rVert_{\textnormal{op}}^{2}(\gamma+(1-\gamma)\lVert\widehat{Z}\rVert^{2}_{\textnormal{op}})}{\alpha}.

To prove part 3, notice that in the definition given by formula (12), if we replace ZZ^{\prime} by Z+=Z^(1/L)G^Z^{+}=\widehat{Z}-(1/L)\widehat{G} and disregard some terms not dependent on ZZ, we arrive at mZ^,L(Z)m_{\widehat{Z},L}(Z). ∎

Corollary 6.3.

If Z¯\bar{Z} is a local minimum of JJ then, for any L>0L>0, Z¯\bar{Z} is the unique minimizer of mZ¯,Lm_{\bar{Z},L}, and

Z¯=ProxW¯,1/L(Z¯(1/L)G¯),\bar{Z}=\operatorname{Prox}_{\bar{W},1/L}\left(\bar{Z}-(1/L)\bar{G}\right),

where G¯=(Z¯Λ¯+(1/α)AT(AZ¯Y))W¯1\bar{G}=(\bar{Z}\bar{\Lambda}+(1/\alpha)A^{T}(A\bar{Z}-Y))\bar{W}^{-1}.

Proof.

The first claim follows from parts 1 and 2 in Proposition 6.2. Then, the optimality conditions arise from part 3 in Proposition 6.2, by realizing that Z+=Z¯Z^{+}=\bar{Z}. ∎

Remark 6.4.

The condition in Corollary 6.3 is equivalent to the first order conditions in Theorem 5.4.

Remark 6.5.

Let us briefly discuss the case K=1K=1 and γ=0\gamma=0, which reduces to the 1/2\ell_{1}/\ell_{2} norm regularization. Here, W^=Z^22\widehat{W}=\lVert\widehat{Z}\rVert_{2}^{-2}, Λ^=Z^23Z^1\widehat{\Lambda}=-\lVert\widehat{Z}\rVert_{2}^{-3}\lVert\widehat{Z}\rVert_{1} and the model function up to the fixed additive constant mZ^,0Ψ(0)m^{\Psi}_{\widehat{Z},0}(0) and multiplicative constant 1/Z^21/\lVert\widehat{Z}\rVert_{2} is given as

ψZ^,L(Z)=1Z^2(mZ^,LΨ(Z)mZ^,0Ψ(0))=Z1Z^1Z^2Z^Z^2,Z+H^,Z+L2ZZ^22\psi_{\widehat{Z},L}(Z)=\frac{1}{\lVert\widehat{Z}\rVert_{2}}\left(m^{\Psi}_{\widehat{Z},L}(Z)-m^{\Psi}_{\widehat{Z},0}(0)\right)=\lVert Z\rVert_{1}-\Big{\langle}\frac{\lVert\widehat{Z}\rVert_{1}}{\lVert\widehat{Z}\rVert_{2}}\,\frac{\widehat{Z}}{\lVert\widehat{Z}\rVert_{2}},\;Z\Big{\rangle}+\langle\widehat{H},Z\rangle+\frac{L}{2}\lVert Z-\widehat{Z}\rVert^{2}_{2}

with H^=(Z^2/α)AT(AZ^Y)\widehat{H}=(\lVert\widehat{Z}\rVert_{2}/\alpha)A^{T}(A\widehat{Z}-Y). Replacing H^,Z\langle\widehat{H},Z\rangle with the indicator function of the constraint AZ=YAZ=Y, as is appropriate for solving formulation (3) compared to the regularized version (4), and setting Z^=Zk\widehat{Z}=Z_{k} and Zk+1=argminZψZk,L(Z)=argminZmZk,LΨ(Z)Z^{k+1}=\operatorname*{argmin}_{Z}\psi_{Z_{k},L}(Z)=\operatorname*{argmin}_{Z}m^{\Psi}_{Z_{k},L}(Z) we obtain exactly [53, Algorithm 2] (see also [57]).

We now return to the iterative procedure outlined in Algorithm 1, which generates the sequence ZkZ_{k}. We will show that the method is a descent method, i.e., J(Zk+1)J(Zk)J(Z_{k+1})\leq J(Z_{k}). Note that this implies that

(13) ZkWk,1+12αAZkYFro2J(Z0).\lVert Z_{k}\rVert_{W_{k},1}+\frac{1}{2\alpha}\lVert AZ_{k}-Y\rVert_{\textnormal{Fro}}^{2}\leq J(Z_{0}).

For the following analysis, we need to ensure that ZkZ_{k} remains bounded, which unfortunately does not directly follow from this inequality due to the fact that AA may have nontrivial kernel and Ψγ(1γ)NK\Psi_{\gamma}\leq\sqrt{(1-\gamma)NK} (see Proposition A.4). Similar to Proposition 4.8, a uniform bound on ZkZ_{k} could be deduced under appropriate assumptions on α\alpha, δ\delta, Z0Z_{0} and AA using the (r,s)(r,s)-SSP. For simplicity, we from now on simply assume that the sequence ZkZ_{k} remains bounded:

(14) ZkopC0for all k1.\lVert Z_{k}\rVert_{\textnormal{op}}\leq C_{0}\quad\text{for all }k\geq 1.

This will allow us to bound certain constants uniformly, in particular Wk1op=(γ+(1γ)Zkop2)\lVert W_{k}^{-1}\rVert_{\textnormal{op}}=(\gamma+(1-\gamma)\lVert Z_{k}\rVert^{2}_{\textnormal{op}}).

In order to show that descent can be fulfilled in every iteration as in 6.2 with Z^=Zk\widehat{Z}=Z_{k}, we have to replace the unknown constant LL in mZk,L(Z)m_{Z_{k},L}(Z) with a step size σk1/L^\sigma_{k}\sim 1/\widehat{L}. In practice, since this constant depends on the linearization point Z^=Zk\widehat{Z}=Z_{k} and the size of the update Zk+1ZkZ_{k+1}-Z_{k}, we select it by an adaptive Armijo-line search procedure described below. Given a candidate step size σ>0\sigma>0, we define the candidate next iteration as

(15) Zσ+=ProxWk,σ(ZkσGk)Z^{+}_{\sigma}=\operatorname{Prox}_{W_{k},\sigma}\left(Z_{k}-\sigma G_{k}\right)

where GkG_{k} is defined as before as Gk=(ZkΛk+(1/α)AT(AZkY))Wk1G_{k}=(Z_{k}\Lambda_{k}+(1/\alpha)A^{T}(AZ_{k}-Y))W_{k}^{-1}. We define the “predicted decrease” associated to a candidate stepsize σ>0\sigma>0 as

predσ(Zk)=Zσ+Wk,1ZkWk,1+Gk,Zσ+Zk.\displaystyle\operatorname{pred}_{\sigma}(Z_{k})=\lVert Z^{+}_{\sigma}\rVert_{W_{k},1}-\lVert Z_{k}\rVert_{W_{k},1}+\langle G_{k},Z^{+}_{\sigma}-Z_{k}\rangle.

The predicted decrease serves as an optimistic estimate for how much descent in the functional can be expected in the current iteration, and can be used as an indicator for stationarity of ZkZ_{k} as characterized by the next theorem.

Proposition 6.6.

For any σ>0\sigma>0 it holds

predσ(Zk)=mZk,1/σ(Zσ+)mZk,1/σ(Zk)12σZσ+ZkWk,220\operatorname{pred}_{\sigma}(Z_{k})=m_{Z_{k},1/\sigma}(Z^{+}_{\sigma})-m_{Z_{k},1/\sigma}(Z_{k})-\frac{1}{2\sigma}\lVert Z^{+}_{\sigma}-Z_{k}\rVert^{2}_{W_{k},2}\leq 0

and, thus,

(16) 12σZσ+ZkWk,22predσ(Zk).\frac{1}{2\sigma}\lVert Z^{+}_{\sigma}-Z_{k}\rVert^{2}_{W_{k},2}\leq-\operatorname{pred}_{\sigma}(Z_{k}).
Proof.

The alternate expression for predσ(Zk)\operatorname{pred}_{\sigma}(Z_{k}) is easy to verify. The inequality predσ(Zk)0\operatorname{pred}_{\sigma}(Z_{k})\leq 0 holds since mZk,1/σ(Zσ+)mZk,1/σ(Zk)m_{Z_{k},1/\sigma}(Z^{+}_{\sigma})\leq m_{Z_{k},1/\sigma}(Z_{k}) due to part 3 in Proposition 6.2. ∎

To determine an appropriate step size, we provide a (large) inital guess σ1=σmax\sigma_{-1}=\sigma_{\max}, in each iteration take the largest σ{βjmin{σmax,σk1/β}|j=0,1,2,}\sigma\in\{\,\beta^{j}\min\{\,\sigma_{\max},\;\sigma_{k-1}/\beta\,\}\;|\;j=0,1,2,\ldots\,\} for some line-search parameter β(0,1)\beta\in(0,1) that achieves a fraction of the predicted decrease, i.e. such that it holds

(17) Mσ+=γI+(1γ)(Zσ+)TZσ+ is invertible and J(Zσ+)J(Zk)κpredσ(Zk),M^{+}_{\sigma}=\gamma I+(1-\gamma)(Z^{+}_{\sigma})^{T}Z^{+}_{\sigma}\text{ is invertible and }J(Z^{+}_{\sigma})-J(Z_{k})\leq\kappa\operatorname{pred}_{\sigma}(Z_{k}),

where κ(0,1)\kappa\in(0,1) is an algorithmic constant. By the next result, for some σk\sigma_{k} chosen by the Armijo line search the condition in (17) will be satisfied.

Proposition 6.7.

Given the bounds (13) and (14), there exists a σ¯\bar{\sigma} such that (17) is fulfilled for any σσ¯\sigma\leq\bar{\sigma}.

Proof.

In order to apply part 2 of Proposition 6.2, we first need a bound on the difference Zσ+ZkZ^{+}_{\sigma}-Z_{k}. By a standard estimate for the Prox-operator and the triangle inequality,

Zσ+ZkWk,2σGkWk,2σ(ZkΛkWk1,2+1αAT(AZkY)Wk1,2).\lVert Z^{+}_{\sigma}-Z_{k}\rVert_{W_{k},2}\leq\sigma\lVert G_{k}\rVert_{W_{k},2}\leq\sigma\left(\lVert Z_{k}\Lambda_{k}\rVert_{W_{k}^{-1},2}+\frac{1}{\alpha}\lVert A^{T}(AZ_{k}-Y)\rVert_{W_{k}^{-1},2}\right).

Furthermore as in Remark A.7,

ZkΛkWk1,2ZkWk1/2opWk1/2ΛkWk1/2Fro(1γ)1/2ZkWk,1\lVert Z_{k}\Lambda_{k}\rVert_{W_{k}^{-1},2}\leq\lVert Z_{k}W_{k}^{1/2}\rVert_{\textnormal{op}}\lVert W_{k}^{-1/2}\Lambda_{k}W_{k}^{-1/2}\rVert_{\textnormal{Fro}}\leq(1-\gamma)^{1/2}\lVert Z_{k}\rVert_{W_{k},1}

and

1αAT(AZkY)Wk1,2LAAZkYFro.\frac{1}{\alpha}\lVert A^{T}(AZ_{k}-Y)\rVert_{W_{k}^{-1},2}\leq\sqrt{L_{A}}\lVert AZ_{k}-Y\rVert_{\textnormal{Fro}}.

By (13) and (14), both of these quantities are bounded by a universal constant, and thus there is a g¯\bar{g} such that Zσ+ZkWk,2σg¯\lVert Z^{+}_{\sigma}-Z_{k}\rVert_{W_{k},2}\leq\sigma\bar{g}. Thus, for σ\sigma small enough, by Proposition 6.2 we know that there exists a constant L0L_{0} (dependent only on ZkopC0\lVert Z_{k}\rVert_{\textnormal{op}}\leq C_{0}), such that

J(Zσ+)J(Zk)\displaystyle J(Z^{+}_{\sigma})-J(Z_{k}) mZk,L0(Zσ+)J(Zk)\displaystyle\leq m_{Z_{k},{L}_{0}}(Z^{+}_{\sigma})-J(Z_{k})
=mZk,L0(Zσ+)mZk,L0(Zk)\displaystyle=m_{Z_{k},{L}_{0}}(Z^{+}_{\sigma})-m_{Z_{k},{L}_{0}}(Z_{k})
=predσ(Zk)+L02Zσ+ZkWk,22\displaystyle=\operatorname{pred}_{\sigma}(Z_{k})+\frac{{L}_{0}}{2}\lVert Z^{+}_{\sigma}-Z_{k}\rVert^{2}_{W_{k},2}
=κpredσ(Zk)+(1κ)[predσ(Zk)+12σZσ+ZkWk,22]+σL0(1κ)2σZσ+ZkWk,2.\displaystyle=\kappa\operatorname{pred}_{\sigma}(Z_{k})+(1-\kappa)\left[\operatorname{pred}_{\sigma}(Z_{k})+\frac{1}{2\sigma}\lVert Z^{+}_{\sigma}-Z_{k}\rVert^{2}_{W_{k},2}\right]+\frac{\sigma{L}_{0}-(1-\kappa)}{2\sigma}\lVert Z^{+}_{\sigma}-Z_{k}\rVert_{W_{k},2}.

The second term is negative due to Proposition 6.6. The third term is negative for σ(1κ)/L0.\sigma\leq(1-\kappa)/{L}_{0}.

Proposition 6.8.

Consider ZkZ_{k} started from a given Z0Z_{0} (with either γ>0\gamma>0 or Z0TZ0Z_{0}^{T}Z_{0} invertible) by Zk+1=Zσk+Z_{k+1}=Z^{+}_{\sigma_{k}} from (15) and σk\sigma_{k} chosen according to (17). Finally, assume that (14) is fulfilled. Then the following holds:

  1. (1)

    The descent in every iteration can be estimated by

    J(Zk+1)J(Zk)κpredσk(Zk)0.\displaystyle J(Z_{k+1})-J(Z_{k})\leq\kappa\operatorname{pred}_{\sigma_{k}}(Z_{k})\leq 0.
  2. (2)

    infkσk>0\inf_{k}\sigma_{k}>0.

  3. (3)

    For any accumulation point ZZ^{\prime} of sequence ZkZ_{k} (with either γ>0\gamma>0 or (Z)TZ(Z^{\prime})^{T}Z^{\prime} invertible) the first order necessary condition is fulfilled.

Proof.

We apply Proposition 6.7 for part 1. From this, we derive that limkJ(Zk)=J¯\lim_{k\to\infty}J(Z_{k})=\bar{J} for some J¯J(Z0)\bar{J}\leq J(Z_{0}), since this is a decreasing bounded sequence and also (13) is fulfilled. For part 2, we note that by construction of the Armijo stepsize and Proposition 6.7, we know that σmaxσkmin{σmax,βσ¯}\sigma_{\max}\geq\sigma_{k}\geq\min\{\sigma_{\max},\beta\bar{\sigma}\}. Moreover,

>J(Z0)J¯\displaystyle\infty>J(Z_{0})-\bar{J} =k=1J(Zk)J(Zk)\displaystyle=\sum_{k=1}^{\infty}J(Z_{k})-J(Z_{k})
κk=1predσk(Zk)κ2(infkσk)k=11σk(Zσk+Zk)Wk,22.\displaystyle\geq-\kappa\sum_{k=1}^{\infty}\operatorname{pred}_{\sigma_{k}}(Z_{k})\geq\frac{\kappa}{2}\left(\inf_{k}\sigma_{k}\right)\sum_{k=1}^{\infty}\left\lVert\frac{1}{\sigma_{k}}(Z^{+}_{\sigma_{k}}-Z_{k})\right\rVert^{2}_{W_{k},2}.

Thus, we obtain

1σk(Zσk+Zk)Wk1/20.\frac{1}{\sigma_{k}}(Z^{+}_{\sigma_{k}}-Z^{k})W_{k}^{1/2}\to 0.

Let now ZZ^{\prime} be any accumulation point of ZkZ_{k} and σ>0\sigma^{\prime}>0 an accumulation point of σk\sigma_{k} along the same subsequence and W=(M)1W^{\prime}=(M^{\prime})^{-1} for invertible M=γI+(1γ)(Z)TZM^{\prime}=\gamma I+(1-\gamma)(Z^{\prime})^{T}Z^{\prime}. Along this subsequence we have Zσk+Zk0Z^{+}_{\sigma_{k}}-Z^{k}\to 0 due to uniform invertibility of all WkW_{k} and σk\sigma_{k} and thus

Zσk+Zk=ProxWk,σk(ZkσkGk)ZkProxW,σ(ZkσkG)Z=0,Z^{+}_{\sigma_{k}}-Z_{k}=\operatorname{Prox}_{W_{k},\sigma_{k}}\left(Z_{k}-\sigma_{k}G_{k}\right)-Z_{k}\to\operatorname{Prox}_{W^{\prime},\sigma^{\prime}}\left(Z_{k}-\sigma_{k}G^{\prime}\right)-Z^{\prime}=0,

where G=(ZΛ+(1/α)AT(AZY))MG^{\prime}=(Z^{\prime}\Lambda^{\prime}+(1/\alpha)A^{T}(AZ^{\prime}-Y))M^{\prime} by continuity of all expressions (using Proposition A.9 and Proposition A.10). Thus, with Remark 6.4 the stationarity conditions are fulfilled. ∎

Remark 6.9.

According to Proposition 6.7, the predicted decrease provides both an upper and a lower bound on the decrease after each successful iteration. Scaling it by the stepsize to avoid bias (due to small steps), an appropriate stopping criterion is given by

(18) 1σkpredσk(Zk)rTOLJk,-\frac{1}{\sigma_{k}}\operatorname{pred}_{\sigma_{k}}(Z_{k})\leq r_{\mathrm{TOL}}J_{k},

where rTOL<1r_{\mathrm{TOL}}<1 is a relative tolerance, and JkJ_{k} some reference value of the functional, e.g., Jk=J(Zk)J_{k}=J(Z_{k}). Moreover, due to (16) this provides an estimate on the weighted Prox-residual rProxk=σk1(Zσk+Zk)Wk1/2r^{k}_{\operatorname{Prox}}=\sigma_{k}^{-1}(Z^{+}_{\sigma_{k}}-Z^{k})W_{k}^{1/2}, which characterizes stationarity.

6.3. Adaptive choice of the regularization parameter

In practice, the value α\alpha has to be chosen appropriately, to balance a good fit of the data with the desire for robustness to noise and sparsity. This is studied extensively in regularization theory for inverse problems; see, e.g. [20]. In this manuscript, we rely on the (Morozov) discrepancy principle, which presumes that a noise level δ>0\delta>0 is known, and α\alpha is adjusted until an (approximate) solution ZαZ_{\alpha} of (11) fulfills

(19) τ1δAZαYFroτ2δ\tau_{1}\,\delta\leq\lVert AZ_{\alpha}-Y\rVert_{\textnormal{Fro}}\leq\tau_{2}\,\delta

for some 0<τ11<τ20<\tau_{1}\leq 1<\tau_{2}. To achieve this, we use the following procedure:

  1. (1)

    Solve (11) by several steps of Algorithm 1 for given fixed α\alpha until (18) is fulfilled.

  2. (2)

    If (19) is violated, update α\alpha (decrease if the fit is too large, increase otherwise) and go back to step 1.

Let us compare this to the constrained formulation (5): First, we take into account that the inequality constraint AZYFroδ\lVert AZ-Y\rVert_{\textnormal{Fro}}\leq\delta is generally fulfilled with equality at optimal solutions, since getting a better fit usually requires increasing the regularization term. This is indeed the case for γ>0\gamma>0, but not necessarily for γ=0\gamma=0, and we will discuss this further below. Second, 1/(2α)1/(2\alpha) can be interpreted as a Lagrange multiplier for this constraint. Thus, the difference of (19) consists in a fuzzy version of the hard constraint AZYFro=δ\lVert AZ-Y\rVert_{\textnormal{Fro}}=\delta, which is known not to affect recovery guarantees for convex problems since usually δ\delta is not known exactly. For efficiency purposes, experience from convex regularization suggests to initialize α\alpha large (to obtain a highly sparse approximation initially), and only decrease it if (19) is violated, and we employ this as well. We leave a detailed analysis and algorithmic refinements to future work.

6.3.1. Discrepancy principle for the 𝑜𝑤2,1\mathit{ow}\ell_{2,1} functional

The choice γ=0\gamma=0 does not always fit to the discrepancy principle (19), and in some cases we can only enforce the upper bound. This is due to the fact that unless γ>0\gamma>0, a change in the variable ZZ that is allowed to increase the discrepancy does not necessarily allow to reduce the regularization term Ψγ\Psi_{\gamma} by moving the reconstruction closer to zero. Note that

Ψγ(τZ)Ψγ(Z)for 0τ<1\Psi_{\gamma}(\tau Z)\leq\Psi_{\gamma}(Z)\quad\text{for }0\leq\tau<1

with equality for γ=0\gamma=0 and strict inequality for γ>0\gamma>0 and Z0Z\neq 0. The latter property allows to directly show that any solution of (5) will always fulfill AZYFro=δ\lVert AZ-Y\rVert_{\textnormal{Fro}}=\delta for γ>0\gamma>0.

To illustrate the issue for γ=0\gamma=0 in more detail, consider the case of matrices with rank(Z)=2,0(Z)=K{\rm rank\,}(Z)=\ell_{2,0}(Z)=K, and the optimality conditions from Theorem 5.4. In that case, for the set of vectors znz_{n} with zn0z_{n}\neq 0 the vectors un=W1/2zn=(ZTZ)1/2znu_{n}=W^{-1/2}z_{n}=(Z^{T}Z)^{-1/2}z_{n} form an orthogonal basis of K\mathbb{R}^{K} and

1znWWi:zi01ziWWziziW=W1/2(1un2Ii:zi01ui2uiui)W1/2=0.\frac{1}{\lVert z_{n}\rVert_{W}}W-\sum_{i\colon z_{i}\neq 0}\frac{1}{\lVert z_{i}\rVert_{W}}Wz_{i}z^{\top}_{i}W=W^{1/2}\left(\frac{1}{\lVert u_{n}\rVert_{2}}I-\sum_{i\colon z_{i}\neq 0}\frac{1}{\lVert u_{i}\rVert_{2}}u_{i}u^{\top}_{i}\right)W^{1/2}=0.

Thus, by the first condition in Theorem 5.4, any optimal solution of (4) fulfills (AT(AZ¯Y))n=0(A^{T}(A\bar{Z}-Y))_{n}=0 for all nn with zn0z_{n}\neq 0. For ZZ restricted to its sparsity pattern I(Z)={i|zi0}I(Z)=\{\,i\;|\;z_{i}\neq 0\,\}, |I(Z)|=K\lvert I(Z)\rvert=K, such that Z=PI(Z)ZZ=P_{I(Z)}Z^{\prime} with ZK×KZ^{\prime}\in\mathbb{R}^{K\times K} of full rank and PI(Z){0,1}N×KP_{I(Z)}\in\{0,1\}^{N\times K} a selection of columns of the identity matrix, we can suppose that PI(Z)TATAPI(Z)P_{I(Z)}^{T}A^{T}AP_{I(Z)} is invertible (which is necessary for recovery). The set of equations derived above can be written as PI(Z)TAT(API(Z)ZY)=0P_{I(Z)}^{T}A^{T}(AP_{I(Z)}Z^{\prime}-Y)=0, which can always be uniquely solved.

In particular, consider a situation with KK-sparse full rank X=PI(X)XX=P_{I(X)}X^{\prime} and noisy data with δX=AXY\delta_{X}=\lVert AX-Y\rVert. Now, consider that part of the noise can be fitted on the same support as XX, and that in general there exists a KK-sparse full rank ZZ^{\prime} such that δ=API(X)ZYFro<δX\delta^{\prime}=\lVert AP_{I(X)}Z^{\prime}-Y\rVert_{\textnormal{Fro}}<\delta_{X}. In fact, let ZZ^{\prime} be the minimizer of the discrepancy δ\delta^{\prime} on the sparsity pattern of XX. Note that this minimizer exactly fulfills PI(X)TAT(API(X)ZY)=0P_{I(X)}^{T}A^{T}(AP_{I(X)}Z^{\prime}-Y)=0. Then, Z¯=PI(X)Z\bar{Z}=P_{I(X)}Z^{\prime} clearly is a solution of (5) for all δ>δ\delta>\delta^{\prime}. Let α\alpha^{\prime} be the maximum of (AT(API(X)ZY))nZTZ\lVert(A^{T}(AP_{I(X)}Z^{\prime}-Y))_{n}\rVert_{Z^{T}Z} over all n=1,2,Nn=1,2\ldots,N. Clearly, for all αα\alpha\geq\alpha^{\prime}, the Z¯=PI(X)Z\bar{Z}=P_{I(X)}Z^{\prime} also fulfills the optimality conditions from Theorem 5.4. In this situation, Algorithm 1, which exclusively operates on rank-KK matrices for γ=0\gamma=0, will tend to converge towards Z¯=PI(X)Z\bar{Z}=P_{I(X)}Z^{\prime}, regardless of the value of αα\alpha\geq\alpha^{\prime}, since the only way to decrease the regularization term below Ψ(Z¯)=K\Psi(\bar{Z})=K is to decrease the rank. Here, if δδX\delta^{\prime}\ll\delta_{X}, the lower bound in condition (19) can not be satisfied by increasing α\alpha. In fact, such a situation always occurs for M=KM=K, where API(X)Z=YAP_{I(X)}Z^{\prime}=Y can be uniquely solved for ZZ^{\prime} and δ=α=0\delta^{\prime}=\alpha^{\prime}=0.

To guard against this situation, we add an additional rule that allows termination of the algorithm if the upper bound in (19) is valid, α\alpha has been previously increased, and no change in ZαZ_{\alpha} has occurred.

6.4. Continuation for relaxation parameter

In the previous parts of this section γ\gamma was considered fixed. Applying the described algorithms directly for γ=0\gamma=0, and solving the original problem (4) has several down-sides:

  • Due to the high degree of nonconvexity of the regularizer, the solution will depend on initialization. This, aside from causing problems with identifying a global minimum, also requires the initial point to be of full rank, which is not efficient when KK is large).

  • If the optimal solution has rankZ¯<K{\rm rank\,}{\bar{Z}}<K, the method will not be able to identify it for γ=0\gamma=0, since Algorithm 1 produces full rank iterates only and the 𝑜𝑤2,1\mathit{ow}\ell_{2,1}-functional is discontinuous across different ranks. This has to be remedied by guessing the appropriate rank ahead of time, and reducing KK to this rank by a data reduction as discussed at the beginning of section 5.

In this section, we briefly sketch a strategy that starts with γ=1\gamma=1 and then successively reduces γ\gamma. Here, where we are initially solving the convex problem with 𝑜𝑤2,1=Ψ0\mathit{ow}\ell_{2,1}=\Psi_{0} replaced by 2,1=Ψ1\ell_{2,1}=\Psi_{1} and then following the path of solutions Z¯α,γ\bar{Z}_{\alpha,\gamma} for Ψγ\Psi_{\gamma} to finally obtain Z¯αδ,0\bar{Z}_{\alpha_{\delta},0} with αδ\alpha_{\delta} fulfilling (19).

As a proof of concept, we employ a simple strategy, employing a fixed sequence of values γl=γ~l\gamma_{l}=\tilde{\gamma}^{l} for some 0<γ~<10<\tilde{\gamma}<1 and l=0,,lmaxl=0,\ldots,l_{\max}. We start with l=0l=0 and use the method from section 6.3 to solve the convex problem and the discrepancy principle. Then, we increment ll, and continue iterating Algorithm 1 until the termination conditions from the previous section are fulfilled again. We leave refinements of this naive approach (cf., e.g., [1]) to future work.

7. Experiments

7.1. Synthetic data

We evaluate the performance of our method against other recent MMV techniques, including subspace-augmented MUSIC (SA-MUSIC) [33], simultaneous normalized iterative hard thresholding (SNIHT) [6], and rank-aware orthogonal recursive matching pursuit (RA-ORMP) [16]. The codes of these baseline methods were obtained from the repository https://github.com/AdriBesson/joint_sparse_algorithms. We also include in our evaluation the spg_mmv method, which solves the standard 2,1\ell_{2,1} problem in constrained form and whose codes are acquired from SPGL1 package [49, 51]. Our codes and data for reproducing 𝑜𝑤2,1\mathit{ow}\ell_{2,1} results are available at https://github.com/a-petr/OWL.

We use similar experiment setting as those in [5]. We consider a Gaussian random measurement matrix AM×NA\in\mathbb{R}^{M\times N}, with Aij𝒩(0,1/M){A}_{ij}\sim\mathcal{N}(0,1/\sqrt{M}) and NN being fixed to 128128. The signal matrix XN×K{X}\in\mathbb{R}^{N\times K} is built as a random matrix, with K=s=30K=s=30, where ss is the row sparsity. We conduct 4040 random trials of the algorithms for each experiment and compute the rate of successful support recovery. We compare our method with their counterpart in two tests: fixed rank (r=10r=10) for a number of measurements ranging between 3030 and 9090 and fixed number of measurements (M=51M=51) for a rank varying between 11 and 3030.

We first consider the noiseless scenario, i.e., observation Y=AXY=AX. In Figure 1, our proposed approach is second to RA-ORMP in recovery probability in both fixed number of measurements (left) and fixed rank (right) tests, followed by SA-MUSIC and SNIHT. The experiment shows that these four methods are rank-aware, because increasing the rank of the signal matrix leads to an improvement of the recovery rate, while 2,1\ell_{2,1} regularization with spg_mmv solver does not benefit from the change of the rank. For 𝑜𝑤2,1\mathit{ow}\ell_{2,1} regularization, the recovery rate grows fast when the rank of signal matrix is larger than 3 and gets to 100%100\% at rank 1212. SA-MUSIC and SNIHT struggle to identify the correct support of the solutions, but when they are able to do so, they achieve a reconstruction error smaller than that of 𝑜𝑤2,1\mathit{ow}\ell_{2,1} (Figure 2).

Refer to caption
Refer to caption
Figure 1. Recovery probability of our approach in comparison with other joint sparse recovery techniques for varying matrix rank (left) and varying number of measurements (right) for noiseless reconstruction.
Refer to caption
Refer to caption
Figure 2. Root-mean-square error of our approach in comparison with other joint sparse recovery techniques for varying matrix rank (left) and varying number of measurements (right) for noiseless reconstruction.

Next, we consider a more practical scenario where the observation YY contains some measurement noise, i.e., Y=AX+ηY=AX+\eta. We assume η\eta is a normal distributed random variable with standard deviation 0.1/MK{0.1}/{\sqrt{MK}}, so that the expectation of η\|\eta\| is 0.10.1. In Figure 3, we observe that 𝑜𝑤2,1\mathit{ow}\ell_{2,1} achieves best performance in recovery rate in both fixed number of measurements (left) and fixed rank (right) tests. Our method has 100%100\% recovery rate at signal rank 1818, while SA-MUSIC only achieves this at full signal rank and SNIHT cannot get a recovery rate higher than 40%40\%. Interestingly, the best overall baseline in noiseless scenario, RA-ORMP, suffers heavily from the presence of the noise. With fixed rank and varied number of measurements, 𝑜𝑤2,1\mathit{ow}\ell_{2,1} also outperforms all others by a substantial margin. Our method has the best performance in root-mean-square error in the small rank and small number of measurements scenarios, where its recovery rate is superior (Figure 4). However, when the other methods catch up and can approximate the signal support well, they provide better reconstruction error, similarly to what we have seen in the noiseless case.

To conclude, the numerical tests highlight that 𝑜𝑤2,1\mathit{ow}\ell_{2,1} can efficiently exploit the rank of the signal matrix and achieve a very competitive recovery rate, with mild requirements on the number of measurements and signal rank. They also reveal that the successful rate and signal error metrics may not always be in alignment, thus, one should consider the recovery goals, measurement budget, and signal noise to select a suitable method. Our approach excels in identifying the exact support of the row-sparse signals, and has the top performance when the signal matrices have low-rank, making it a strong choice when those are the priority.

Refer to caption
Refer to caption
Figure 3. Recovery probability of our approach in comparison with other joint sparse recovery techniques for varying matrix rank (left) and varying number of measurements (right) for noisy reconstruction.
Refer to caption
Refer to caption
Figure 4. Root-mean-square error of our approach in comparison with other joint sparse recovery techniques for varying matrix rank (left) and varying number of measurements (right) for noisy reconstruction.

7.2. Feature selection on biological data.

We evaluate the performance of our method on the feature selection problem for two publicly available biomedical informatics data sets. The first data set WISCONSIN is a collection of data about breast cancer tumors that were diagnosed as either malignant or benign [4]. The data set contains 714 samples, with each sample representing a breast cancer tumor. Each tumor is described by 30 features, including information about the size, shape, and texture of the tumor, etc. These features were computed from a digitized image of a fine needle aspirate of a breast mass. The second data set LUNG_DISCRETE is a collection of microarray data about lung carcinomas, containing 147147 samples in 77 classes where each sample consists of 325 gene expressions [41]. In the feature selection problem, we aim to find an optimal subset of features (gene expressions) that can accurately represent the whole data set. This problem is a special instance of the MMV problem, where the observation matrix YY is identical with the measurement matrix AA (see Section 2). Here, we apply 𝑜𝑤2,1\mathit{ow}\ell_{2,1} and compare it with other MMV techniques (SA-MUSIC, RA-ORMP and spg_mmv) in reconstructing the full data set, given different allowable number of features. SNIHT does not produce competitive results for this test, so will be omitted from the comparison. For SA-MUSIC and RA-ORMP, the allowable number of features is enforced directly, while for the two regularization-based methods, we set an error tolerance a priori and solve for the smallest set of features that can reconstruct the full data set within this tolerance. We rank the feature importance via the 2\ell_{2} norm of the respective row of the solution matrix and further prune out the spurious features whose importance scores are very small, in particular, less than 106×(maximum importance score)10^{-6}\times\text{(maximum importance score)}. Finally, after the features are selected, we reconstruct the full data from this set by solving the regular least square problem, and report the root-mean-square error of the approximated and ground truth data.

Note that the feature selection problem has K=NK=N thus often involves a large KK. To accelerate the K×KK\times K matrix inversion step in our algorithm, we employ an SVD technique to decompose A=PΣQA=P\Sigma Q and solve the reduced problem

minZN×K(𝑜𝑤2,1(Z)+12αAZAFro2).\displaystyle\min_{Z\in\mathbb{R}^{N\times K^{\prime}}}\left(\mathit{ow}\ell_{2,1}(Z)+\frac{1}{2\alpha}\|AZ-A^{\prime}\|_{\textnormal{Fro}}^{2}\right).

Here, A=PΣM×KA^{\prime}=P\Sigma^{\prime}\in\mathbb{R}^{M\times K^{\prime}} is a low rank approximation of PΣP\Sigma, formed by dropping the columns of Σ\Sigma corresponding to singular values that are below a specified threshold. The solution now has dimension N×KN\times K^{\prime}, and our algorithm involves the matrix inversion of size K×KK^{\prime}\times K^{\prime} instead, which is much faster if KKK^{\prime}\ll K. In the LUNG_DISCRETE problem, even though AA has dimension 147×325{147\times 325} (i.e., K=325K=325), its effective rank is approximately 7272 (there are only 72 singular values of AA exceeding 101210^{-12}), thus K=72K^{\prime}=72. We emphasize that this is only the most conservative choice, and a more aggressive thresholding strategy is possible. In fact, here we further reduce KK^{\prime} by linking the singular value cutoff with the predefined reconstruction error tolerance, as the low rank approximation AA^{\prime} can afford an error of the same scale as this tolerance. The final row-sparse solution is easily obtained by right multiplying the reduced solution with QQ. It is worth noting that the above strategy is general for feature selection problems because the data matrices of these problems are inherently low-rank.

For the WISCONSIN data set (Figure 5, left), we observe that 𝑜𝑤2,1\mathit{ow}\ell_{2,1} performs much better than the competitors in identifying the representative features. The reconstruction error of 𝑜𝑤2,1\mathit{ow}\ell_{2,1} drops rapidly when the number of features is small and is approximately 20%20\% lower than those of the next baselines. By contrast, the other methods generally require five more features to capture the data set as good as 𝑜𝑤2,1\mathit{ow}\ell_{2,1}. The performance of 𝑜𝑤2,1\mathit{ow}\ell_{2,1} and SA-MUSIC eventually converge around 25 features, where the reconstruction error diminishes to 0. This verifies that the feature set contains some degree of redundancy, and that the data set can still be accurately represented with some features being removed. For the LUNG_DISCRETE data set (Figure 5, right), the performances of 𝑜𝑤2,1\mathit{ow}\ell_{2,1}, RA-ORMP and SA-MUSIC are close and better than spg-mmv with small number of features. However, all the tested methods can select roughly 7272 optimal features that fully represent the data, despite the extensive feature set comprising 325 gene expressions.

Refer to caption
Refer to caption
Figure 5. Approximation of the full WISCONSIN (left) and LUNG_DISCRETE datasets (right) from a limited number of features, identified by 𝑜𝑤2,1\mathit{ow}\ell_{2,1} as opposed to other methods, when a varying number of features is selected.

References

  • [1] E. L. Allgower and K. Georg, Numerical continuation methods: an introduction, vol. 13, Springer Science & Business Media, 2012.
  • [2] F. Andersson, M. Carlsson, and K.-M. Perfekt, Operator-Lipschitz estimates for the singular value functional calculus, Proceedings of the American Mathematical Society, 144 (2015), p. 1867–1875.
  • [3] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.
  • [4] K. P. Bennett and O. L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods & Software, 1 (1992), pp. 23–34.
  • [5] A. Besson, D. Perdios, Y. Wiaux, and J.-P. Thiran, Joint sparsity with partially known support and application to ultrasound imaging, IEEE Signal Processing Letters, 26 (2019), pp. 84–88.
  • [6] J. D. Blanchard, M. Cermak, D. Hanle, and Y. Jing, Greedy algorithms for joint sparse recovery, IEEE Transactions on Signal Processing, 62 (2014), pp. 1694–1704.
  • [7] J. D. Blanchard, C. Leedy, and Y. Wu, On rank awareness, thresholding, and MUSIC for joint sparse recovery, Applied and Computational Harmonic Analysis, 48 (2020), pp. 482–495.
  • [8] H. Cai, K. Hamm, L. Huang, and D. Needell, Robust CUR decomposition: Theory and imaging applications, SIAM Journal on Imaging Sciences, 14 (2021), pp. 1472–1503.
  • [9] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J Math Imaging Vis, 40 (2011), pp. 120–145.
  • [10] F. Clarke, Functional analysis, calculus of variations and optimal control, vol. 264, Springer, 2013.
  • [11] P. L. Combettes, Solving monotone inclusions via compositions of nonexpansive averaged operators, Optimization, 53 (2004), pp. 475–504.
  • [12] P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM Journal on Optimization, 18 (2008), pp. 1351–1376.
  • [13] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust Region Methods, Society for Industrial and Applied Mathematics, 2000.
  • [14] S. Cotter, B. Rao, K. Engan, and K. Kreutz-Delgado, Sparse solutions to linear inverse problems with multiple measurement vectors, IEEE Transactions on Signal Processing, 53 (2005), pp. 2477–2488.
  • [15] G. Dal Maso, An introduction to Γ\Gamma-convergence, vol. 8, Springer Science & Business Media, 2012.
  • [16] M. E. Davies and Y. C. Eldar, Rank awareness in joint sparse recovery, IEEE Transactions on Information Theory, 58 (2012), pp. 1135–1146.
  • [17] N. Dexter, H. Tran, and C. Webster, A mixed egularization approach for sparse simultaneous approximation of parameterized PDEs, ESAIM: M2AN, 53 (2019), pp. 2025–2045.
  • [18]  , On the strong convergence of forward-backward splitting in reconstructing jointly sparse signals, Set-Valued Var. Anal, 30 (2022), pp. 543–557.
  • [19] Y. Eldar and H. Rauhut, Average Case Analysis of Multichannel Sparse Recovery Using Convex Relaxation, IEEE Transactions on Information Theory, 56 (2010), pp. 505–519.
  • [20] H. W. Engl and R. Ramlau, Regularization of Inverse Problems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2015, pp. 1233–1241.
  • [21] E. Esser, Y. Lou, and J. Xin, A method for finding structured sparse solutions to nonnegative least squares problems with applications, SIAM Journal on Imaging Sciences, 6 (2013), pp. 2010–2046.
  • [22] P. Feng and Y. Bresler, Spectrum-blind minimum-rate sampling and reconstruction of multiband signals, in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 3, 1996, pp. 1688–1691 vol. 3.
  • [23] M. Fornasier and H. Rauhut, Recovery algorithms for vector-valued data with joint sparsity constraints, SIAM Journal on Numerical Analysis, 46 (2008), pp. 577–613.
  • [24] M. Golbabaee and P. Vandergheynst, Hyperspectral image compressed sensing via low-rank and joint-sparse matrix recovery, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 2741–2744.
  • [25] R. Gribonval, H. Rauhut, K. Schnass, and P. Vandergheynst, Atoms of All Channels, Unite! Average Case Analysis of Multi-Channel Sparse Recovery Using Greedy Algorithms, Journal of Fourier Analysis and Applications, 14 (2008), pp. 655–687.
  • [26] K. Hamm and L. Huang, Perspectives on CUR decompositions, Applied and Computational Harmonic Analysis, 48 (2020), pp. 1088–1099.
  • [27] N. J. Higham, Functions of matrices: theory and computation, SIAM, 2008.
  • [28] P. O. Hoyer, Non-negative matrix factorization with sparseness constraints., Journal of machine learning research, 5 (2004).
  • [29] H. Ji, J. Li, Z. Shen, and K. Wang, Image deconvolution using a characterization of sharp images in wavelet domain, Applied and Computational Harmonic Analysis, 32 (2012), pp. 295–304.
  • [30] J. M. Kim, O. K. Lee, and J. C. Ye, Compressive MUSIC: Revisiting the link between compressive sensing and array signal processing, IEEE Transactions on Information Theory, 58 (2012), pp. 278–301.
  • [31] D. Krishnan, T. Tay, and R. Fergus, Blind deconvolution using a normalized sparsity measure, in CVPR 2011, 2011, pp. 233–240.
  • [32] M.-J. Lai and Y. Liu, The null space property for sparse recovery from multiple measurement vectors, Applied and Computational Harmonic Analysis, 30 (2011), pp. 402–406.
  • [33] K. Lee, Y. Bresler, and M. Junge, Subspace methods for joint sparse recovery, IEEE Transactions on Information Theory, 58 (2012), pp. 3613–3641.
  • [34] J. Liu, S. Ji, and J. Ye, Multi-task feature learning via efficient l2,1l_{2,1}-norm minimization, in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, Arlington, Virginia, USA, 2009, AUAI Press, p. 339–348.
  • [35] M. Lu, Embedded feature selection accounting for unknown data heterogeneity, Expert Systems with Applications, 119 (2019), pp. 350–361.
  • [36] M. Mishali and Y. C. Eldar, Reduce and boost: Recovering arbitrary sets of jointly sparse vectors, IEEE Transactions on Signal Processing, 56 (2008), pp. 4692–4702.
  • [37] D. Nardone, A. Ciaramella, and A. Staiano, A sparse-modeling based approach for class specific feature selection, PeerJ Computer Science, 5 (2019), p. e237.
  • [38] F. Nie, H. Huang, X. Cai, and C. Ding, Efficient and robust feature selection via joint 2,1\ell_{2,1}-norms minimization, Advances in neural information processing systems, 23 (2010).
  • [39] D. Noll, Cutting plane oracles to minimize non-smooth non-convex functions, Set-Valued and Variational Analysis, 18 (2010), pp. 531–568.
  • [40] D. Noll, O. Prot, and A. Rondepierre, A proximity control algorithm to minimize nonsmooth and nonconvex functions, Pacific Journal of Optimization, 4 (2008), pp. 569–602.
  • [41] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), pp. 1226–1238.
  • [42] A. Petrosyan, H. Tran, and C. Webster, Reconstruction of jointly sparse vectors via manifold optimization, Applied Numerical Mathematics, 144 (2019), pp. 140–150.
  • [43] Y. Rahimi, C. Wang, H. Dong, and Y. Lou, A scale-invariant approach for sparse signal recovery, SIAM Journal on Scientific Computing, 41 (2019), pp. A3649–A3672.
  • [44] H. Schaeffer, G. Tran, and R. A. Ward, Learning dynamical systems and bifurcation via group sparsity, arXiv: Numerical Analysis, (2017).
  • [45] Z. Tan, P. Yang, and A. Nehorai, Joint sparse recovery method for compressed sensing with structured dictionary mismatches, IEEE Transactions on Signal Processing, 62 (2014), pp. 4997–5008.
  • [46] M. Tao, Minimization of l1l_{1} over l2l_{2} for sparse signal recovery with convergence guarantee, SIAM Journal on Scientific Computing, 44 (2022), pp. A770–A797.
  • [47] J. A. Tropp, Algorithms for simultaneous sparse approximation. part ii: Convex relaxation, Signal Processing, 86 (2006), pp. 589–602. Sparse Approximations in Signal and Image Processing.
  • [48] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, Algorithms for simultaneous sparse approximation. part i: Greedy pursuit, Signal Processing, 86 (2006), pp. 572–588. Sparse Approximations in Signal and Image Processing.
  • [49] E. van den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit solutions, Siam journal on scientific computing, 31 (2009), pp. 890–912.
  • [50]  , Theoretical and empirical results for recovery from multiple measurements, IEEE Transactions on Information Theory, 56 (2010), pp. 2516–2527.
  • [51]  , Sparse optimization with least-squares constraints, SIAM Journal on Optimization, 21 (2011), pp. 1201–1229.
  • [52] S. A. Vavasis, Derivation of compressive sensing theorems from the spherical section property, University of Waterloo, CO, 769 (2009).
  • [53] C. Wang, M. Yan, Y. Rahimi, and Y. Lou, Accelerated schemes for the l1/l2l_{1}/l_{2} minimization, IEEE Transactions on Signal Processing, 68 (2020), pp. 2660–2669.
  • [54] Z. Wen, B. Hou, and L. Jiao, Joint sparse recovery with semisupervised MUSIC, IEEE Signal Processing Letters, 24 (2017), pp. 629–633.
  • [55] Y. Xu, A. Narayan, H. Tran, and C. G. Webster, Analysis of the ratio of 1\ell_{1} and 2\ell_{2} norms in compressed sensing, Applied and Computational Harmonic Analysis, 55 (2021), pp. 486–511.
  • [56] P. Yin, E. Esser, and J. Xin, Ratio and difference of 1\ell_{1} and 2\ell_{2} norms and sparse representation with coherent dictionaries, Commun. Inf. Syst., 14 (2014), pp. 87–109.
  • [57] L. Zeng, P. Yu, and T. K. Pong, Analysis and algorithms for some compressed sensing models based on 1/2\ell_{1}/\ell_{2} minimization, SIAM Journal on Optimization, 31 (2021), pp. 1576–1603.

Appendix A Properties of the loss function

Lemma A.1.

Suppose ZS(N,K,r)Z\in S(N,K,r) and Z=UΣVTZ=U\Sigma V^{T} is the compact singular value decomposition of ZZ, then

𝑜𝑤2,1(Z)=U2,1\mathit{ow}\ell_{2,1}(Z)=\|U\|_{2,1}
Proof.

We have (ZTZ)/2=VΣ1VT(Z^{T}Z)^{\dagger/2}=V\Sigma^{-1}V^{T}, thus Z(ZTZ)/2=UVT.Z(Z^{T}Z)^{\dagger/2}=UV^{T}. Observe that

Z(ZTZ)/22,1=UVT2,1=U2,1.\|Z(Z^{T}Z)^{\dagger/2}\|_{2,1}=\|UV^{T}\|_{2,1}=\|U\|_{2,1}.\qed
Lemma A.2.

Let ZN×KZ\in\mathbb{R}^{N\times K}, then

rank(Z)𝑜𝑤2,1(Z)rank(Z)Z2,0.{\rm rank\,}(Z)\leq\mathit{ow}\ell_{2,1}(Z)\leq\sqrt{{\rm rank\,}(Z)\cdot\|Z\|_{2,0}}.

Moreover, we have equality on the left-hand side if and only if ZZ is rank(Z){\rm rank\,}(Z)-row sparse.

Proof.

Assume rank(Z)=r{\rm rank\,}(Z)=r. Let U=(u1,,uN)TU=(u_{1},\dots,u_{N})^{T} be as in the previous lemma where u1,,uNu_{1},\dots,u_{N} are the rows of UU. Notice that u12,,uN21\|u_{1}\|_{2},\dots,\|u_{N}\|_{2}\leq 1 and

U2,22=u122++uN22=r\|U\|^{2}_{2,2}=\|u_{1}\|_{2}^{2}+\cdots+\|u_{N}\|_{2}^{2}=r

therefore

𝑜𝑤2,1(Z)=u12++uN2r.\mathit{ow}\ell_{2,1}(Z)=\|u_{1}\|_{2}+\cdots+\|u_{N}\|_{2}\geq r.

So 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) takes its minimum value when 𝑜𝑤2,1(Z)=r\mathit{ow}\ell_{2,1}(Z)=r and that happens only when for each uiu_{i}, ui22=ui2\|u_{i}\|^{2}_{2}=\|u_{i}\|_{2} hence ui2=0\|u_{i}\|_{2}=0 or 11. Consequently, 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) takes its smallest value for ZN×KZ\in\mathbb{R}^{N\times K} on the rank(Z){\rm rank\,}(Z)-row sparse matrices.

To prove the upper bound, note that only ss of the u1,,uNu_{1},\dots,u_{N} are different from 0, thus from Hölder’s inequality

𝑜𝑤2,1(Z)=U2,1sU2,2=sr.\mathit{ow}\ell_{2,1}(Z)=\|U\|_{2,1}\leq\sqrt{s}\cdot\|U\|_{2,2}=\sqrt{s}\sqrt{r}.\qed
Proposition A.3.

𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) is lower semi-continuous.

Proof.

The proof of this proposition is a direct corollary of Proposition A.5 and the fact that the Γ\Gamma-limits are always lower semicontinuous [15, Proposition 6.8]. ∎

Let Z=UΣVTZ=U\Sigma V^{T} be singular value decomposition of ZN×KZ\in\mathbb{R}^{N\times K}. It can be full, thin or compact version, that does not affect the following notation but we will use the compact SVD unless stated otherwise. For a function w:[0,)[0,)w:[0,\infty)\to[0,\infty) with w(0)=0w(0)=0, define

w(Z):=Uw(Σ)VTw(Z):=Uw(\Sigma)V^{T}

where the matrix function is applied element-wise to the matrix Σ\Sigma containing the singular values. This is known in the literature by the “generalized” or “singular value” matrix calculus (see, e.g., [2]).

Define the family of functions

wγ(σ)=σγ+(1γ)σ2for γ(0,1],w_{\gamma}(\sigma)=\frac{\sigma}{\sqrt{\gamma+(1-\gamma)\sigma^{2}}}\quad\text{for }\gamma\in(0,1],

and the limit function for γ0\gamma\to 0 by

w0(σ)={0for σ=0,1for σ>0.w_{0}(\sigma)=\begin{cases}0&\text{for }\sigma=0,\\ 1&\text{for }\sigma>0.\end{cases}

With this notation, we can write

w0(Z)=Z(ZTZ)/2,andwγ(Z)=Z(γI+(1γ)ZTZ)1/2for 0<γ1.w_{0}(Z)=Z(Z^{T}Z)^{\dagger/2},\quad\text{and}\quad w_{\gamma}(Z)=Z(\gamma I+(1-\gamma)Z^{T}Z)^{-1/2}\quad\text{for }0<\gamma\leq 1.

Thus, we can also write

Ψγ(Z)=wγ(Z)2,1=Uwγ(Σ)2,1,for 0γ1\Psi_{\gamma}(Z)=\lVert w_{\gamma}(Z)\rVert_{2,1}=\lVert Uw_{\gamma}(\Sigma)\rVert_{2,1},\quad\text{for }0\leq\gamma\leq 1

where the last equality follows from the orthogonality of VV.

Proposition A.4.
  1. (1)

    Ψ1(Z)=2,1(Z)\Psi_{1}(Z)=\ell_{2,1}(Z) and Ψ0(Z)=𝑜𝑤2,1(Z)\Psi_{0}(Z)=\mathit{ow}\ell_{2,1}(Z).

  2. (2)

    γΨγ(Z)2,1(Z)\sqrt{\gamma}\Psi_{\gamma}(Z)\leq\ell_{2,1}(Z) and 1γΨγ(Z)𝑜𝑤2,1(Z)KN\sqrt{1-\gamma}\Psi_{\gamma}(Z)\leq\mathit{ow}\ell_{2,1}(Z)\leq\sqrt{KN}

  3. (3)

    Ψγ\Psi_{\gamma} is Lipschitz continuous for any γ(0,1]\gamma\in(0,1]:

    |Ψγ(Z)Ψγ(Z^)|N/γZZ^Fro.\displaystyle\lvert\Psi_{\gamma}(Z)-\Psi_{\gamma}(\widehat{Z})\rvert\leq\sqrt{N/\gamma}\,\lVert Z-\widehat{Z}\rVert_{\textnormal{Fro}}.
  4. (4)

    When Z^TZ^\widehat{Z}^{T}\widehat{Z} is invertible, 𝑜𝑤2,1(Z)\mathit{ow}\ell_{2,1}(Z) is local Lipschitz continuous near Z^\widehat{Z} with

    |𝑜𝑤2,1(Z)𝑜𝑤2,1(Z^)|max{(ZTZ)1/2op,(Z^TZ^)1/2op}ZZ^Fro.|\mathit{ow}\ell_{2,1}(Z)-\mathit{ow}\ell_{2,1}(\widehat{Z})|\leq\max\{\lVert(Z^{T}Z)^{-1/2}\rVert_{\textnormal{op}},\lVert(\widehat{Z}^{T}\widehat{Z})^{-1/2}\rVert_{\textnormal{op}}\}\|Z-\widehat{Z}\|_{\textnormal{Fro}}.
Proof.

The proofs of 1 and 2 are straight-forward. Property 3 follows from Ψγ(Z)=wγ(Z)2,1\Psi_{\gamma}(Z)=\lVert w_{\gamma}(Z)\rVert_{2,1}, Lipschitz continuity of the norm with constant one and [2, Theorem 1.1]

wγ(Z)2,1wγ(Z^)2,1wγ(Z)wγ(Z^)2,1Nwγ(Z)wγ(Z^)FroN/γZZ^Fro\lVert w_{\gamma}(Z)\rVert_{2,1}-\lVert w_{\gamma}(\widehat{Z})\rVert_{2,1}\leq\lVert w_{\gamma}(Z)-w_{\gamma}(\widehat{Z})\rVert_{2,1}\leq\sqrt{N}\lVert w_{\gamma}(Z)-w_{\gamma}(\widehat{Z})\rVert_{\textnormal{Fro}}\leq\sqrt{N/\gamma}\lVert Z-\widehat{Z}\rVert_{\textnormal{Fro}}

due to |wγ(σ)wγ(σ)|γ1/2|σσ|\lvert w_{\gamma}(\sigma)-w_{\gamma}(\sigma^{\prime})\rvert\leq\gamma^{-1/2}\lvert\sigma-\sigma^{\prime}\rvert for all σ,σ\sigma,\sigma^{\prime}.

Let ZZ and Z^\widehat{Z} with singular values bounded from below by min{σ(Z),σ(Z^)}=σ~>0\min\{\sigma(Z),\sigma(\widehat{Z})\}=\tilde{\sigma}>0, i.e. we set 1/σ~=max{(ZTZ)1/2op,(Z^TZ^)1/2op}1/\tilde{\sigma}=\max\{\lVert(Z^{T}Z)^{-1/2}\rVert_{\textnormal{op}},\lVert(\widehat{Z}^{T}\widehat{Z})^{-1/2}\rVert_{\textnormal{op}}\}. Define

w~0(σ)={σ/σ~for σ<σ~1else.\tilde{w}_{0}(\sigma)=\begin{cases}\sigma/\tilde{\sigma}&\text{for }\sigma<\tilde{\sigma}\\ 1&\text{else.}\end{cases}

Then, it holds

|𝑜𝑤2,1(Z)𝑜𝑤2,1(Z^)|\displaystyle\lvert\mathit{ow}\ell_{2,1}(Z)-\mathit{ow}\ell_{2,1}(\widehat{Z})\rvert =|w~0(Z)2,1w~0(Z^)|2,1\displaystyle=\lvert\lVert\tilde{w}_{0}(Z)\rVert_{2,1}-\lVert\tilde{w}_{0}(\widehat{Z})\rVert\rvert_{2,1}
w~0(Z)w~0(Z^)2,1\displaystyle\leq\lVert\tilde{w}_{0}(Z)-\tilde{w}_{0}(\widehat{Z})\rVert_{2,1}
Nw~0(Z)w~0(Z^)2,2\displaystyle\leq\sqrt{N}\lVert\tilde{w}_{0}(Z)-\tilde{w}_{0}(\widehat{Z})\rVert_{2,2}
N/σ~ZZ^2,2,\displaystyle\leq\sqrt{N}/\tilde{\sigma}\lVert Z-\widehat{Z}\rVert_{2,2},

where w~0(Z)\tilde{w}_{0}(Z) is the singular value calculus of the function w~0(σ)\tilde{w}_{0}(\sigma) and we applied [2, Theorem 1.1]. ∎

Proposition A.5.

Ψγ\Psi_{\gamma} Γ\Gamma-converges to Ψ\Psi for γ0\gamma\to 0: For any ZkZZ^{k}\to Z and γk0\gamma_{k}\to 0 it holds

lim infkΨγk(Zk)𝑜𝑤2,1(Z)\liminf_{k\to\infty}\Psi_{\gamma_{k}}(Z^{k})\geq\mathit{ow}\ell_{2,1}(Z)

and for any γk0\gamma_{k}\to 0 and ZN×KZ\in\mathbb{R}^{N\times K} there exists a sequence ZkZZ^{k}\to Z such that

lim supkΨγk(Zk)𝑜𝑤2,1(Z).\limsup_{k\to\infty}\Psi_{\gamma_{k}}(Z^{k})\leq\mathit{ow}\ell_{2,1}(Z).
Proof.

To show the first condition, denote the limit matrix ZZ and its full singular value decomposition by

Z=UΣV,where V,ΣK×K,UN×K.Z=U\Sigma V,\quad\text{where }V,\Sigma\in\mathbb{R}^{K\times K},U\in\mathbb{R}^{N\times K}.

Moreover, let a compatible SVD of ZkZ_{k} be given by

Zk=UkΣkVk,where Vk,ΣkK×K,UN×K,Z_{k}=U_{k}\Sigma_{k}V_{k},\quad\text{where }V_{k},\Sigma_{k}\in\mathbb{R}^{K\times K},U\in\mathbb{R}^{N\times K},

such that ΣkΣ\Sigma_{k}\to\Sigma for kk\to\infty (e.g., by sorting the singular values by magnitude). Then, we can write

Z(ZTZ)/2=Uw0(Σ)V=w0(Z),and Zk(γkI+(1γk)ZkTZk)1/2=Ukwγk(Σk)Vk=wγk(Zk).Z(Z^{T}Z)^{\dagger/2}=Uw_{0}(\Sigma)V=w_{0}(Z),\quad\text{and }Z_{k}(\gamma_{k}I+(1-\gamma_{k})Z_{k}^{T}Z_{k})^{-1/2}=U_{k}w_{\gamma_{k}}(\Sigma_{k})V_{k}=w_{\gamma_{k}}(Z_{k}).

Additionally let 0<σ^<min{σi|σidiag(Σ),σi>0}0<\hat{\sigma}<\min\{\,\sigma_{i}\;|\;\sigma_{i}\in\operatorname{diag}(\Sigma),\sigma_{i}>0\,\} be a lower bound on the smallest nonzero singular value of Σ\Sigma and introduce the function

w~γ(σ)={(σ/σ^)wγ(σ^)for σσ^wγ(σ)else.\tilde{w}_{\gamma}(\sigma)=\begin{cases}(\sigma/\hat{\sigma})w_{\gamma}(\hat{\sigma})&\text{for }\sigma\leq\hat{\sigma}\\ w_{\gamma}(\sigma)&\text{else.}\end{cases}

Clearly it holds wγw~γw_{\gamma}\geq\tilde{w}_{\gamma} (by concavity of wγw_{\gamma}), w~γ\tilde{w}_{\gamma} is uniformly Lipschitz-continuous (independently of γ\gamma), and w~γ(σ)w~0(σ)\tilde{w}_{\gamma}(\sigma)\to\tilde{w}_{0}(\sigma). By elementary arguments, it holds for any γ\gamma and Σ\Sigma that

wγ(Σ)w~γ(Σ).w_{\gamma}(\Sigma)\geq\tilde{w}_{\gamma}(\Sigma).

Squaring this estimate leads for any vv to the estimate

vwγ(Σ)22=vTwγ(Σ)2vvTw~γ(Σ)2v=vw~γ(Σ)22.\lVert vw_{\gamma}(\Sigma)\rVert^{2}_{2}=v^{T}w_{\gamma}(\Sigma)^{2}v\geq v^{T}\tilde{w}_{\gamma}(\Sigma)^{2}v=\lVert v\tilde{w}_{\gamma}(\Sigma)\rVert^{2}_{2}.

Thus, it follows that

lim infkΨγk(Zk)=lim infkUkwγk(Σk)2,1lim infkUkw~γk(Σk)2,1=lim infkw~γk(Zk)2,1.\liminf_{k\to\infty}\Psi_{\gamma_{k}}(Z_{k})=\liminf_{k\to\infty}\lVert U_{k}w_{\gamma_{k}}(\Sigma_{k})\rVert_{2,1}\geq\liminf_{k\to\infty}\lVert U_{k}\tilde{w}_{\gamma_{k}}(\Sigma_{k})\rVert_{2,1}=\liminf_{k\to\infty}\lVert\tilde{w}_{\gamma_{k}}(Z_{k})\rVert_{2,1}.

Using the fact that

Ukw~γk(Σk)2,1=Ukw~γk(Σk)Vk2,1=w~γk(Zk)2,1.\lVert U_{k}\tilde{w}_{\gamma_{k}}(\Sigma_{k})\rVert_{2,1}=\lVert U_{k}\tilde{w}_{\gamma_{k}}(\Sigma_{k})V_{k}\rVert_{2,1}=\lVert\tilde{w}_{\gamma_{k}}(Z_{k})\rVert_{2,1}.

To go to the limit we split the error into two terms:

w~γk(Zk)=(w~γk(Zk)w~0(Zk))+(w~0(Zk)w~0(Z))+w~0(Z).\tilde{w}_{\gamma_{k}}(Z_{k})=(\tilde{w}_{\gamma_{k}}(Z_{k})-\tilde{w}_{0}(Z_{k}))+(\tilde{w}_{0}(Z_{k})-\tilde{w}_{0}(Z))+\tilde{w}_{0}(Z).

The first term can be estimated through a uniform estimate of w~γk()w~0()\tilde{w}_{\gamma_{k}}(\cdot)-\tilde{w}_{0}(\cdot) on the compact interval [0,supk,iσik][0,\sup_{k,i}\sigma^{k}_{i}] and using the fact that UkU_{k} and VkV_{k} are uniformly bounded in any matrix norm. The second term can be estimated by the Lipschitz constant of w~0\tilde{w}_{0} equal to 1/σ^1/\hat{\sigma} and the error of ZkZZ_{k}-Z using [2, Theorem 1.1]. Thus, indeed

lim infkΨγk(Zk)lim infkw~γk(Zk)2,1=w~0(Z)2,1=𝑜𝑤2,1(Z),\liminf_{k\to\infty}\Psi_{\gamma_{k}}(Z_{k})\geq\liminf_{k\to\infty}\lVert\tilde{w}_{\gamma_{k}}(Z_{k})\rVert_{2,1}=\lVert\tilde{w}_{0}(Z)\rVert_{2,1}=\mathit{ow}\ell_{2,1}(Z),

which shows the first assertion. For the second assertion, we can choose the recovery sequence Zk=ZZ_{k}=Z and obtain

Ψγk(Z)=Uwγk(Σ)2,1Uw0(Σ)2,1=𝑜𝑤2,1(Z),\Psi_{\gamma_{k}}(Z)=\lVert Uw_{\gamma_{k}}(\Sigma)\rVert_{2,1}\to\lVert Uw_{0}(\Sigma)\rVert_{2,1}=\mathit{ow}\ell_{2,1}(Z),

for kk\to\infty. ∎

In the following, we derive an analysis of the method based on splitting the nonconvex nonsmooth term Ψγ\Psi_{\gamma} into a convex nonsmooth and a nonconvex quadratic remainder at a given fixed point Z^\widehat{Z} (which can be thought of as an approximate local minimizer or current iterate ZkZ_{k}). For this, define the functions

M[Z]=γI+(1γ)ZTZand W[Z]=M[Z]M[Z]=\gamma I+(1-\gamma)Z^{T}Z\quad\text{and }W[Z]=M[Z]^{\dagger}

and fix Z^N×K\widehat{Z}\in\mathbb{R}^{N\times K} with associated M^=M[Z^]\widehat{M}=M[\widehat{Z}] and W^=W[Z^]\widehat{W}=W[\widehat{Z}] and assume that M^\widehat{M} is symmetric positive definite. The last requirement is fulfilled for either γ>0\gamma>0 or if Z^\widehat{Z} has full rank, i.e., Z^𝕊(N,K)\widehat{Z}\in\mathbb{S}(N,K). Then it also holds W^=M^1\widehat{W}=\widehat{M}^{-1}. We recall the notation ZW,2=tr(ZWZT)=ZW1/2Fro.\|Z\|_{W,2}=\sqrt{\operatorname{tr}(ZWZ^{T})}=\lVert ZW^{1/2}\rVert_{\textnormal{Fro}}.

Theorem A.6.

Let M^=M[Z^]\widehat{M}=M[\widehat{Z}] be invertible. Then, the functional Ψγ(Z)\Psi_{\gamma}(Z) is equal to a sum of convex part and non-convex quadratic remainder:

(20) Ψγ(Z)=ZW[Z],1=ZW^,1+Z^Λ^,ZZ^+R(Z,Z^),\Psi_{\gamma}(Z)=\lVert Z\rVert_{W[Z],1}=\lVert Z\rVert_{\widehat{W},1}+\langle\widehat{Z}\widehat{\Lambda},Z-\widehat{Z}\rangle+R(Z,\widehat{Z}),

where

Λ^=n=1N1γz^nW^W^z^nz^nTW^.\widehat{\Lambda}=-\sum_{n=1}^{N}\frac{1-\gamma}{\lVert\widehat{z}_{n}\rVert_{\widehat{W}}}\widehat{W}\widehat{z}_{n}\widehat{z}_{n}^{T}\widehat{W}.

In fact, there exist constants L¯\bar{L} and Δ¯\bar{\Delta}, such that for all ZZ with ZZ^W^,2Δ¯\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}\leq\bar{\Delta}, M[Z]M[Z] is also invertible, and there holds

(21) |R(Z,Z^)|L¯2ZZ^W^,22.\lvert R(Z,\widehat{Z})\rvert\leq\frac{\bar{L}}{2}\lVert Z-\widehat{Z}\rVert^{2}_{\widehat{W},2}.
Remark A.7.

We note that the terms in the definition of Λ^\widehat{\Lambda} with z^n=0\widehat{z}_{n}=0 are not defined properly. They are evaluated as zero by convention, which is consistent with the limit z^n0\widehat{z}_{n}\to 0 of each expression under the sum. Note also that Λ^-\widehat{\Lambda} is a symmetric nonnegative matrix with tr(W^1/2Λ^W^1/2)=(1γ)Z^W^,1-\operatorname{tr}(\widehat{W}^{-1/2}\widehat{\Lambda}\widehat{W}^{-1/2})=(1-\gamma)\lVert\widehat{Z}\rVert_{\widehat{W},1}. Thus, the linear term in the expansion (20) can be estimated as

Z^Λ^,ZZ^(1γ)Z^W^,1Z^W^1/2opZZ^W^,2(1γ)1/2Z^W^,1ZZ^W^,2.\langle\widehat{Z}\widehat{\Lambda},Z-\widehat{Z}\rangle\leq(1-\gamma)\lVert\widehat{Z}\rVert_{\widehat{W},1}\|\widehat{Z}\widehat{W}^{1/2}\|_{\textnormal{op}}\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}\leq(1-\gamma)^{1/2}\lVert\widehat{Z}\rVert_{\widehat{W},1}\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}.

To prove this result, we first derive some continuity and differentiability properties of WW.

Proposition A.8.

For every ZZ and W=W[Z]W=W[Z] there holds W1/2ZTop=ZW1/2op(1γ)1/2\|W^{1/2}Z^{T}\|_{\textnormal{op}}=\|ZW^{1/2}\|_{\textnormal{op}}\leq(1-\gamma)^{-1/2}.

Proof.

With the singular value calculus we have ZW1/2=wγ(Z)=Uwγ(Σ)VZW^{1/2}=w_{\gamma}(Z)=Uw_{\gamma}(\Sigma)V and wγ()(1γ)1/2w_{\gamma}(\cdot)\leq(1-\gamma)^{-1/2}. ∎

Proposition A.9.

For W=W[Z]W=W[Z], W^=W[Z^]\widehat{W}=W[\widehat{Z}] such that both are invertible there holds

(22) |zT(WW^)z|\displaystyle\lvert z^{T}(W-\widehat{W})z\rvert CWzW^2(ZZ^)W^1/2opfor all zK,\displaystyle\leq C_{W}\lVert z\rVert^{2}_{\widehat{W}}\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}\quad\text{for all }z\in\mathbb{R}^{K},

where CW=(1γ)1/2W^1Wop(2+(ZZ^)W^1/2op)C_{W}=(1-\gamma)^{1/2}\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}(2+\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}). Moreover with the perturbation of WW defined by

DW=2(1γ)W^Z^T(ZZ^)W^D_{W}=-2(1-\gamma)\widehat{W}\widehat{Z}^{T}(Z-\widehat{Z})\widehat{W}

and CW=(1γ)(1+W^1Wop(2+(ZZ^)W^1/2op)2)C^{\prime}_{W}=(1-\gamma)\left(1+\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}(2+\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}})^{2}\right) there holds

(23) |zT(WW^DW)z|\displaystyle\lvert z^{T}(W-\widehat{W}-D_{W})z\rvert CWzW^2(ZZ^)W^1/2op2for all zK.\displaystyle\leq C^{\prime}_{W}\lVert z\rVert^{2}_{\widehat{W}}\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert^{2}_{\textnormal{op}}\quad\text{for all }z\in\mathbb{R}^{K}.
Proof.

Denote M=M[Z]=γI+(1γ)ZTZM=M[Z]=\gamma I+(1-\gamma)Z^{T}Z. We have that

(24) WW^\displaystyle W-\widehat{W} =M1M^1=M^1(M^M)M1=W^(M^M)W\displaystyle=M^{-1}-\widehat{M}^{-1}=\widehat{M}^{-1}(\widehat{M}-M)M^{-1}=\widehat{W}(\widehat{M}-M)W
(25) =W^(MM^)W^+W^(MM^)W^(MM^)W.\displaystyle=-\widehat{W}(M-\widehat{M})\widehat{W}+\widehat{W}(M-\widehat{M})\widehat{W}(M-\widehat{M})W.

Turning now to the terms with differences of MM, note that

MM^\displaystyle M-\widehat{M} =(1γ)(ZTZZ^TZ^)\displaystyle=(1-\gamma)(Z^{T}Z-\widehat{Z}^{T}\widehat{Z})
=(1γ)(ZT(ZZ^)+(ZZ^)TZ^)\displaystyle=(1-\gamma)(Z^{T}(Z-\widehat{Z})+(Z-\widehat{Z})^{T}\widehat{Z})
(26) =(1γ)(Z^T(ZZ^)+(ZZ^)TZ^+(ZZ^)T(ZZ^)).\displaystyle=(1-\gamma)(\widehat{Z}^{T}(Z-\widehat{Z})+(Z-\widehat{Z})^{T}\widehat{Z}+(Z-\widehat{Z})^{T}(Z-\widehat{Z})).

Thus we have with Proposition A.8 that

(27) (MM^)W^op\displaystyle\|(M-\widehat{M})\widehat{W}\|_{\textnormal{op}} (1γ)(Z^T(ZZ^)W^op+(ZZ^)TZ^W^op+(ZZ^)T(ZZ^)W^op)\displaystyle\leq(1-\gamma)(\|\widehat{Z}^{T}(Z-\widehat{Z})\widehat{W}\|_{\textnormal{op}}+\|(Z-\widehat{Z})^{T}\widehat{Z}\widehat{W}\|_{\textnormal{op}}+\|(Z-\widehat{Z})^{T}(Z-\widehat{Z})\widehat{W}\|_{\textnormal{op}})
(1γ)(2W^1/2Z^Top+(ZZ^)W^1/2op)(ZZ^)W^1/2op\displaystyle\leq(1-\gamma)(2\|\widehat{W}^{1/2}\widehat{Z}^{T}\|_{\textnormal{op}}+\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}})\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}
(2(1γ)1/2+(1γ)(ZZ^)W^1/2op)(ZZ^)W^1/2op.\displaystyle\leq(2(1-\gamma)^{1/2}+(1-\gamma)\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}})\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}.

Combining this with (24) we already obtain (22) due to

|zT(WW^)z|\displaystyle\lvert z^{T}(W-\widehat{W})z\rvert =|zTW^(MM^)Wz|zW^2W^1/2(MM^)WW^1/2op\displaystyle=\lvert z^{T}\widehat{W}(M-\widehat{M})Wz\rvert\leq\lVert z\rVert^{2}_{\widehat{W}}\lVert\widehat{W}^{1/2}(M-\widehat{M})W\widehat{W}^{-1/2}\rVert_{\textnormal{op}}
zW^2(MM^)W^opW^1Wop.\displaystyle\leq\lVert z\rVert^{2}_{\widehat{W}}\lVert(M-\widehat{M})\widehat{W}\rVert_{\textnormal{op}}\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}.

It remains to derive (23). We start by noticing that

zTDWz=(1/2)zT(DW+DWT)z=(1γ)zT(Z^T(ZZ^)+(ZZ^)TZ^)zz^{T}D_{W}z=(1/2)z^{T}(D_{W}+D_{W}^{T})z=-(1-\gamma)z^{T}\left(\widehat{Z}^{T}(Z-\widehat{Z})+(Z-\widehat{Z})^{T}\widehat{Z}\right)z

Using this together using (25) and (26) we obtain

zT(WW^DW)z=(1γ)(zTW^(ZZ^)T(ZZ^)W^z+zTW^(MM^)W^(MM^)Wz)z^{T}(W-\widehat{W}-D_{W})z=-(1-\gamma)\left(z^{T}\widehat{W}(Z-\widehat{Z})^{T}(Z-\widehat{Z})\widehat{W}z+z^{T}\widehat{W}(M-\widehat{M})\widehat{W}(M-\widehat{M})Wz\right)

Taking the absolute value of that expression and using the triangle inequality, the first term is estimated by

|zTW^(ZZ^)T(ZZ^)W^z|zW^2(ZZ^)W^1/2op2\lvert z^{T}\widehat{W}(Z-\widehat{Z})^{T}(Z-\widehat{Z})\widehat{W}z\rvert\leq\lVert z\rVert^{2}_{\widehat{W}}\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}^{2}

Concerning the second term, note that

|zTW^(MM^)W^(MM^)Wz|\displaystyle\lvert z^{T}\widehat{W}(M-\widehat{M})\widehat{W}(M-\widehat{M})Wz\rvert zW^2W^1/2(MM^)W^(MM^)WW^1/2op\displaystyle\leq\lVert z\rVert^{2}_{\widehat{W}}\|\widehat{W}^{1/2}(M-\widehat{M})\widehat{W}(M-\widehat{M})W\widehat{W}^{-1/2}\|_{\textnormal{op}}
=zW^2(MM^)W^(MM^)W^W^1Wop\displaystyle=\lVert z\rVert^{2}_{\widehat{W}}\|(M-\widehat{M})\widehat{W}(M-\widehat{M})\widehat{W}\widehat{W}^{-1}W\|_{\textnormal{op}}
zW^2W^1Wop(MM^)W^op2.\displaystyle\leq\lVert z\rVert^{2}_{\widehat{W}}\|\widehat{W}^{-1}W\|_{\textnormal{op}}\|(M-\widehat{M})\widehat{W}\|^{2}_{\textnormal{op}}.

and together with (27) we obtain the second result (23). ∎

Proposition A.10.

The mapping

KzzzTzK×K\mathbb{R}^{K}\ni z\mapsto\frac{zz^{T}}{\lVert z\rVert}\in\mathbb{R}^{K\times K}

is Lipschitz-continuous with constant three in the Frobenius norm.

Proof.

First, we consider the case where z=0z=0. In that case

zzTzzzTzFro=zzTzFro=z2=zz2\left\lVert\frac{z^{\prime}z^{\prime T}}{\lVert z^{\prime}\rVert}-\frac{zz^{T}}{\lVert z\rVert}\right\rVert_{\textnormal{Fro}}=\left\lVert\frac{z^{\prime}z^{\prime T}}{\lVert z^{\prime}\rVert}\right\rVert_{\textnormal{Fro}}=\lVert z^{\prime}\rVert_{2}=\lVert z^{\prime}-z\rVert_{2}

In the nonzero case

zzTzzzTzFro=1zz(zz)zTz+zzT(zz)+z(zTzT)zFro3zz2\left\lVert\frac{z^{\prime}z^{\prime T}}{\lVert z^{\prime}\rVert}-\frac{zz^{T}}{\lVert z\rVert}\right\rVert_{\textnormal{Fro}}=\frac{1}{\lVert z^{\prime}\rVert\lVert z\rVert}\left\lVert(z^{\prime}-z)z^{\prime T}\lVert z\rVert+zz^{\prime T}(\lVert z\rVert-\lVert z^{\prime}\rVert)+z(z^{\prime T}-z^{T})\lVert z^{\prime}\rVert\right\rVert_{\textnormal{Fro}}\leq 3\lVert z^{\prime}-z\rVert_{2}\qed

We are ready to prove the main result.

Proof of Theorem A.6.

Start with the identity for a,a^>0a,\widehat{a}>0

aa^=2aa^2a^2a^=aa^(aa^)22a^=aa^2a^(aa^)22a^(a+a^)2.\sqrt{a}-\sqrt{\widehat{a}}=\frac{2\sqrt{a}\sqrt{\widehat{a}}-2\widehat{a}}{2\sqrt{\widehat{a}}}=\frac{a-\widehat{a}-\big{(}\sqrt{a}-\sqrt{\widehat{a}}\big{)}^{2}}{2\sqrt{\widehat{a}}}=\frac{a-\widehat{a}}{2\sqrt{\widehat{a}}}-\frac{(a-\widehat{a})^{2}}{2\sqrt{\widehat{a}}\big{(}\sqrt{a}+\sqrt{\widehat{a}}\big{)}^{2}}.

Thus, denote W=W[Z]W=W[Z] and W^=W[Z^]\widehat{W}=W[\widehat{Z}], where both are invertible, and using a=znTWzna=z_{n}^{T}Wz_{n} and a^=znTW^zn\widehat{a}=z_{n}^{T}\widehat{W}z_{n} for zn0z_{n}\neq 0 we have

ZW,1ZW^,1\displaystyle\|Z\|_{W,1}-\|Z\|_{\widehat{W},1} =n:zn0znTWznznTW^zn\displaystyle=\sum_{n\colon z_{n}\neq 0}\sqrt{z_{n}^{T}Wz_{n}}-\sqrt{z_{n}^{T}\widehat{W}z_{n}}
(28) =n:zn0znT(WW^)zn2znW^(znT(WW^)zn)22znW^(znW+znW^)2.\displaystyle=\sum_{n\colon z_{n}\neq 0}\frac{z_{n}^{T}(W-\widehat{W})z_{n}}{2\|z_{n}\|_{\widehat{W}}}-\frac{(z_{n}^{T}(W-\widehat{W})z_{n})^{2}}{2\|z_{n}\|_{\widehat{W}}(\|z_{n}\|_{W}+\|z_{n}\|_{\widehat{W}})^{2}}.

We define the intermediate value

Λ=n=1N1γznW^W^znznTW^\Lambda=-\sum_{n=1}^{N}\frac{1-\gamma}{\lVert z_{n}\rVert_{\widehat{W}}}\widehat{W}z_{n}z_{n}^{T}\widehat{W}

and note that by cyclic permutation

Z^Λ,(ZZ^)=tr(ΛZ^T(ZZ^))=n=1N2(1γ)znTW^Z^T(ZZ^)W^zn2znW^=n=1NznTDWzn2znW^,\langle\widehat{Z}\Lambda,(Z-\widehat{Z})\rangle=\operatorname{tr}\left(\Lambda\widehat{Z}^{T}(Z-\widehat{Z})\right)=-\sum_{n=1}^{N}\frac{2(1-\gamma)z_{n}^{T}\widehat{W}\widehat{Z}^{T}(Z-\widehat{Z})\widehat{W}z_{n}}{2\|z_{n}\|_{\widehat{W}}}=\sum_{n=1}^{N}\frac{z_{n}^{T}D_{W}z_{n}}{2\|z_{n}\|_{\widehat{W}}},\\

where DWD_{W} is as in Proposition A.9. Subtracting this from both sides of (28), it follows

ZW,1ZW^,1Z^Λ,(ZZ^)=R0(Z,Z^).\|Z\|_{W,1}-\|Z\|_{\widehat{W},1}-\langle\widehat{Z}\Lambda,(Z-\widehat{Z})\rangle=R_{0}(Z,\widehat{Z}).

where

R0(Z,Z^)=n:zn0znT(WW^DW)zn2znW^(znT(WW^)zn)22znW^(znW+znW^)2R_{0}(Z,\widehat{Z})=\sum_{n\colon z_{n}\neq 0}\frac{z_{n}^{T}(W-\widehat{W}-D_{W})z_{n}}{2\|z_{n}\|_{\widehat{W}}}-\frac{(z_{n}^{T}(W-\widehat{W})z_{n})^{2}}{2\|z_{n}\|_{\widehat{W}}(\|z_{n}\|_{W}+\|z_{n}\|_{\widehat{W}})^{2}}

By the triangle inequality and Proposition A.9 we obtain

|R0(Z,Z^)|\displaystyle|R_{0}(Z,\widehat{Z})| n:zn0znW^2(CW+CW2)(ZZ^)W^1/2opL12ZZ^W^,22\displaystyle\leq\sum_{n\colon z_{n}\neq 0}\frac{\lVert z_{n}\rVert_{\widehat{W}}}{2}\left(C^{\prime}_{W}+C^{2}_{W}\right)\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}\leq\frac{L_{1}}{2}\lVert Z-\widehat{Z}\rVert^{2}_{\widehat{W},2}

where

L1=(CW+CW2)ZW^,1.L_{1}=\left(C^{\prime}_{W}+C^{2}_{W}\right)\|Z\|_{\widehat{W},1}.

To obtain the final result, it remains to estimate the difference of Λ\Lambda and Λ^\widehat{\Lambda}. We start with

W^1/2(ΛΛ^)W^1/2=(1γ)n=1N1znW^W^1/2znznTW^1/21z^nW^W^1/2z^nTz^nW^1/2\widehat{W}^{-1/2}(\Lambda-\widehat{\Lambda})\widehat{W}^{-1/2}=-(1-\gamma)\sum_{n=1}^{N}\frac{1}{\lVert z_{n}\rVert_{\widehat{W}}}\widehat{W}^{1/2}z_{n}z_{n}^{T}\widehat{W}^{1/2}-\frac{1}{\lVert\widehat{z}_{n}\rVert_{\widehat{W}}}\widehat{W}^{1/2}\widehat{z}_{n}^{T}\widehat{z}_{n}\widehat{W}^{1/2}

By the triangle inequality and Proposition A.10 we have

(ΛΛ^)W^1Fro=W^1/2(ΛΛ^)W^1/2Fro3(1γ)n=1Nznz^nW^\lVert(\Lambda-\widehat{\Lambda})\widehat{W}^{-1}\rVert_{\textnormal{Fro}}=\lVert\widehat{W}^{-1/2}(\Lambda-\widehat{\Lambda})\widehat{W}^{-1/2}\rVert_{\textnormal{Fro}}\leq 3(1-\gamma)\sum_{n=1}^{N}\lVert z_{n}-\widehat{z}_{n}\rVert_{\widehat{W}}

and thus

|Z^(ΛΛ^),(ZZ^)|=\displaystyle\lvert\langle\widehat{Z}(\Lambda-\widehat{\Lambda}),(Z-\widehat{Z})\rangle\rvert= |tr((ΛΛ^)Z^T(ZZ^))|\displaystyle\left\lvert\operatorname{tr}\left((\Lambda-\widehat{\Lambda})\widehat{Z}^{T}(Z-\widehat{Z})\right)\right\rvert
\displaystyle\leq\; (ΛΛ^)W^1FroW^Z^T(ZZ^)Fro\displaystyle\lVert(\Lambda-\widehat{\Lambda})\widehat{W}^{-1}\rVert_{\textnormal{Fro}}\|\widehat{W}\widehat{Z}^{T}(Z-\widehat{Z})\|_{\textnormal{Fro}}
\displaystyle\leq\; 3(1γ)ZZ^W^,1W^1/2Z^TopZZ^W^,2\displaystyle 3(1-\gamma)\lVert Z-\widehat{Z}\rVert_{\widehat{W},1}\lVert\widehat{W}^{1/2}\widehat{Z}^{T}\rVert_{\textnormal{op}}\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}
\displaystyle\leq\; 3N(1γ)1/2ZZ^W^,22,\displaystyle 3\sqrt{N}(1-\gamma)^{1/2}\lVert Z-\widehat{Z}\rVert^{2}_{\widehat{W},2},

where we have used again Proposition A.8 and the inequality W^,1NW^,2\lVert\cdot\rVert_{\widehat{W},1}\leq\sqrt{N}\lVert\cdot\rVert_{\widehat{W},2}.

The previous estimates combined lead to the desired expansion of Ψγ\Psi_{\gamma} (20) with the estimate of the remainder (21) with the constant

L=6N(1γ)1/2+L1.L=6\sqrt{N}(1-\gamma)^{1/2}+L_{1}.

The constant LL depends on (ZZ^)W^1/2opZZ^W^,2\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}\leq\lVert Z-\widehat{Z}\rVert_{\widehat{W},2} and W^1Wop\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}} through Proposition A.9, and on ZW^,1\lVert Z\rVert_{\widehat{W},1}. Moreover, the latter two quantities can also be estimated in terms of ZZ^W^,2\lVert Z-\widehat{Z}\rVert_{\widehat{W},2}. First, we have

W^1Wop=I+(M^M)W^W^1Wop1+(M^M)W^opW^1Wop\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}=\lVert I+(\widehat{M}-M)\widehat{W}\widehat{W}^{-1}W\rVert_{\textnormal{op}}\leq 1+\lVert(\widehat{M}-M)\widehat{W}\rVert_{\textnormal{op}}\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}

and substituting eM=(M^M)W^ope_{M}=\lVert(\widehat{M}-M)\widehat{W}\rVert_{\textnormal{op}} and rearranging we obtain (1eM)W^1Wop1(1-e_{M})\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}\leq 1, i.e., W^1Wop1/(1eM)=1+eM/(1eM)\lVert\widehat{W}^{-1}W\rVert_{\textnormal{op}}\leq 1/(1-e_{M})=1+e_{M}/(1-e_{M}). Now, we can apply (27): eM(2+(ZZ^)W^1/2op)(ZZ^)W^1/2ope_{M}\leq(2+\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}})\lVert(Z-\widehat{Z})\widehat{W}^{1/2}\rVert_{\textnormal{op}}. Second, we have

ZW^,1Z^W^,1+ZZ^W^,1\lVert Z\rVert_{\widehat{W},1}\leq\lVert\widehat{Z}\rVert_{\widehat{W},1}+\lVert Z-\widehat{Z}\rVert_{\widehat{W},1}

Thus, taking into account the inequalities W^,1NW^,2NKW^op\lVert\cdot\rVert_{\widehat{W},1}\leq\sqrt{N}\lVert\cdot\rVert_{\widehat{W},2}\leq\sqrt{NK}\lVert\cdot\,\widehat{W}\rVert_{\textnormal{op}} and Proposition A.8, it can be seen that for ZZ^W^,2\lVert Z-\widehat{Z}\rVert_{\widehat{W},2} smaller than a universal constant Δ¯\bar{\Delta}, the constant LL can be bounded by some universal constant L¯\bar{L}. ∎

Theorem A.6 allows to explicitly compute the Clarke subdifferential with the convex subdifferential of the model function for Ψγ\Psi_{\gamma} defined as motivated by the previous theorem

(29) mZ^Ψ(Z)=ZW^,1+Z^Λ^,ZZ^.m^{\Psi}_{\widehat{Z}}(Z)=\lVert Z\rVert_{\widehat{W},1}+\langle\widehat{Z}\widehat{\Lambda},Z-\widehat{Z}\rangle.
Theorem A.11.

For all Z^\widehat{Z} with M^=M[Z^]\widehat{M}=M[\widehat{Z}] invertible we have

CΨγ(Z^)=mZ^Ψ(Z^).\partial_{C}\Psi_{\gamma}(\widehat{Z})=\partial m^{\Psi}_{\widehat{Z}}(\widehat{Z}).
Proof.

Assume first that ZZ is given with zn0z_{n}\neq 0 for all nn. Then, Ψγ\Psi_{\gamma} is Frechet differentiable with

[Ψγ(Z)]n=znznWW(1γ)zni=1N1ziWWziTziWwhere W=W[Z],[\nabla\Psi_{\gamma}(Z)]_{n}=\frac{z_{n}}{\lVert z_{n}\rVert_{W}}W-(1-\gamma)z_{n}\sum_{i=1}^{N}\frac{1}{\lVert z_{i}\rVert_{W}}Wz_{i}^{T}z_{i}W\quad\text{where }W=W[Z],

which is obtained using the differentiability of ZW,1\lVert Z\rVert_{W,1} for fixed WW and the chain rule. On the other hand mZ^Ψ(Z)m^{\Psi}_{\widehat{Z}}(Z) is also differentiable under this assumption with

[mZ^Ψ(Z)]n=znznW^W^(1γ)z^ni=1N1z^iW^W^z^iTz^iW^[\nabla m^{\Psi}_{\widehat{Z}}(Z)]_{n}=\frac{z_{n}}{\lVert z_{n}\rVert_{\widehat{W}}}\widehat{W}-(1-\gamma)\widehat{z}_{n}\sum_{i=1}^{N}\frac{1}{\lVert\widehat{z}_{i}\rVert_{\widehat{W}}}\widehat{W}\widehat{z}_{i}^{T}\widehat{z}_{i}\widehat{W}

Furthermore for any sequence ZkZ^Z_{k}\to\widehat{Z} with the above property it holds

(30) limkmZ^Ψ(Zk)=limkΨγ(Zk),\lim_{k}\nabla m^{\Psi}_{\widehat{Z}}(Z_{k})=\lim_{k}\nabla\Psi_{\gamma}(Z_{k}),

assuming that either of the limits exists, i.e., zn/znW^snz_{n}/\lVert z_{n}\rVert_{\widehat{W}}\to s_{n} for all nn. Now, by the characterization of the Clarke subdifferential ([10, Theorem 10.27]) we know that

CΨγ(Z^)=convLimitsZZ^:zn0nΨγ(Z)\partial_{C}\Psi_{\gamma}(\widehat{Z})=\operatorname{conv}\operatorname*{Limits}_{Z\to\widehat{Z}:z_{n}\neq 0\,\forall n}\nabla\Psi_{\gamma}(Z)

and also that

mZ^Ψ(Z^)=CmZ^Ψ(Z^)=convLimitsZZ^:zn0nmZ^Ψ(Z).\partial m^{\Psi}_{\widehat{Z}}(\widehat{Z})=\partial_{C}m^{\Psi}_{\widehat{Z}}(\widehat{Z})=\operatorname{conv}\operatorname*{Limits}_{Z\to\widehat{Z}:z_{n}\neq 0\,\forall n}\nabla m^{\Psi}_{\widehat{Z}}(Z).

Thus, by (30), both differentials agree. ∎