Orthogonally weighted regularization for rank-aware joint sparse recovery: algorithm and analysis
Abstract.
We propose and analyze an efficient algorithm for solving the joint sparse recovery problem using a new regularization-based method, named orthogonally weighted (), which is specifically designed to take into account the rank of the solution matrix. This method has applications in feature extraction, matrix column selection, and dictionary learning, and it is distinct from commonly used regularization and other existing regularization-based approaches because it can exploit the full rank of the row-sparse solution matrix, a key feature in many applications. We provide a proof of the method’s rank-awareness, establish the existence of solutions to the proposed optimization problem, and develop an efficient algorithm for solving it, whose convergence is analyzed. We also present numerical experiments to illustrate the theory and demonstrate the effectiveness of our method on real-life problems.
1. Introduction
The joint sparse recovery or multiple measurement vector (MMV) problem aims to find a solution to the matrix equation , where and are given, and is an unknown -row sparse matrix, meaning that at most rows of are non-zero.
The task of finding the solution of with the fewest non-zero rows in the noiseless case can be formulated as an optimization problem using mixed norms:
(1) |
where is the number of non-zero rows in . However, the main drawback of this method is that the optimization problem is computationally infeasible, as it results in an exhaustive search that is an NP-hard problem. To overcome this limitation, the norm regularization method is commonly used to solve the joint sparse recovery problem [51]
(2) |
where , and is the -th row of . This method generalizes the sparsifying penalty to the multiple vector setting and is computationally feasible. However, this method has the drawback of being rank-blind, meaning it does not take into account the rank of the -row sparse matrix [16]. In real-world scenarios, data typically has a maximum essential rank, equal to its essential sparsity.
In [42], it was argued that a rank-aware regularizer cannot be continuous and, thus, convex in all of . A new method for recovery of jointly sparse vectors in the form of a (non-convex) optimization problem was proposed
(3) |
where the regularizer, here we call the orthogonally weighted or , is defined as
with denoting the Moore-Penrose pseudo-inverse and its square root for a given symmetric matrix. If the minimization is restricted to matrices of full rank, we obtain the functional . The papers [36, 42] also discussed a strategy to reduce the joint sparse problem (1) to a simpler setting where the measurement matrix has full column-rank. This transformation can be achieved through techniques such as Singular Value Decomposition (SVD) or other factorization methods. In particular, if where with and with , then for any solution of the optimization problem
the matrix is a solution of original problem (1). This strategy is independent of the chosen regularization functional and can be used for the noisy problem as well to reduce it to an approximate full column rank problem, and the resulting reconstruction error can be controlled; see [42] for more details.
However, [42] neither provided an affirmative conclusion on the rank-awareness, nor provided a general algorithm for implementation of regularization. The algorithm proposed in [42] relies on being able to determine the rank of the solution directly from the data and does not directly address the more practical case of noisy data discussed below. In this paper, we aim to address these two important open questions by providing: i) a proof of the rank-awareness of the regularization, making it the first regularization-based method possessing this property; and ii) a computationally efficient algorithm for solving the optimization problem, supported by rigorous convergence analysis.
The naming of the functional is derived from the following. Note that, if is the compact singular value decomposition of , i.e., with , orthogonal, , and diagonal and s.p.d., then , thus it renormalizes to its orthogonal component before applying the norm. Also
in other words, is the norm of an orthonormal basis of the column space of and its value is independent of the choice of the basis. This and many other properties of the functional, like its locally Lipschitz continuity, are discussed in detail in Appendix A. Consequently, when is of -row sparsity and of rank , it can be seen that , in other words it interpolates the values of on such matrices. However, unlike the latter, it is feasible to develop an efficient algorithm for solving the optimization problem with the regularizer, as we shall see in this effort.
In the presence of measurement noise, we consider the regularized formulation of (3)
(4) |
The term acts as a penalty for deviating from the measurements, where denotes the Frobenius norm of a matrix. The quantity controls the strength of the regularization or can be interpreted as a Lagrange multiplier for the constrained formulation
(5) |
To simplify notation, we will primarily consider the regularized formulation of the problem as given in (4).
1.1. Notational conventions
Here we introduce various notation conventions aimed at enhancing the clarity and readability of the content. The true value of the matrix of interest is represented by . The optimal solution of a given regularization problem is represented by . The rows of a given matrix are denoted as for . For a given symmetric positive matrix , the -(semi-)norm of a vector is represented by , and the Euclidean norm is denoted by . The norm for of is represented by
and for the sum is replaced by a maximum, as usual. Note that for we obtain and the special case for the Euclidean norm is also denoted as the Frobenius norm . The associated inner product on the space of matrices is denoted by . Finally, by we denote the operator or induced matrix norm with respect to the Euclidean inner product. For a given matrix , we denote by its Moore-Penrose pseudo-inverse, and for a given symmetric non-negative matrix by the square root of the pseudo-inverse. The main functional of interest is represented by . Note that is related to the matrix sign function; see [27, Chapter 5] with the difference being that we use singular value calculus, as opposed to the standard (eigenvalue based) matrix calculus. In case of scalars or symmetric matrices the two definitions are identical.
1.2. Organization
The material is organized as follows: Section 2 discusses the applications of joint sparse recovery and related works. Section 3 proves the rank-awareness of regularization. Section 4 establishes conditions for the existence of global minimizers of (3) and (4). The question of how well the solution of the regularized problem approximates the true value remains open. In Section 5, we provide a family of regularizers for an additional parameter that connects the regularizer at and the convex regularizer at and establish first-order conditions for the generalized problem. This includes optimality conditions for the regularized problem in the situation where has full rank. For other cases, we provide a convergence result for , and rely on the optimality conditions for the “relaxed problem” for . Section 6 proposes and analyzes a numerical algorithm for solving the optimization problem that is based either on a direct minimization of (4) with a variable metric proximal gradient method or on a combination of this with a continuation strategy in . Section 7 conducts experiments to demonstrate the performance of the algorithm. All proofs concerning various necessary properties of the regularizer have been moved to the appendix to improve readability.
2. Literature review
regularization. Our method represents a generalization of , which uses the ratio of and to model the sparsity [28, 21]. Indeed, in the special case of single vector recovery (i.e. for ), the problems (3) and (4) are equivalent to the regularized problem due to
(6) |
Theoretical analysis for has been developed in [56] for non-negative signals, [43] with a local optimality condition, and [46], where an analytic solution for the proximal operator is characterized, among others. Our proof of the existence of the global solution to (3) and (4) here is built on a generalization of the spherical section property, which has been studied in [57, 55] for the global optimality condition of . The functional has been shown to enhance sparsity of the solutions and outperform on many test problems such as nonnegative matrix factorization and blind deconvolution [31, 29]. However, the existing uniform recovery guarantees for remains too strict in the general, noisy setting (i.e., more restrictive than well-known recovery conditions such as the null space property), despite many recent theoretical advances.
Joint sparse recovery. The joint sparse recovery problem, also known as multiple measurement vectors, distributed compressed sensing or simultaneous sparse approximation, has garnered a lot of attention in the literature. Mathematically, this problem arises when we want to reconstruct multiple vectors which have the same sparsity support using the same set of linear combinations for each. While one can certainly recover each vector independently, approximating them all at once should give superior results because we can take advantage of additional information about their shared support. To realize this potential, however, a joint sparse recovery algorithm is desired to be rank aware, that is, its performance improves with increasing rank of the solution matrix.
Existing approaches for the joint sparse recovery can be roughly categorized into greedy methods [14, 48, 25], and algorithms based on mixed norm optimization [14, 47, 23, 19, 50]. Pioneering joint sparse methods are often straightforward extensions of algorithms for single vector recovery, whose recovery conditions are based on standard compressed sensing properties and typically turn out to be the same as those obtained from a single measurement problem. In such cases, the theoretical results do not show any advantage of the joint sparse approach. One of the first methods exploiting rank information is a discrete version of the MUSIC (MUltiple SIgnal Classification) algorithm [22], where recovery guarantees better than those from single vector problem are proved when the solution matrix has full column-rank. Extensions of this work to non-full-rank cases have been developed in [33] (subspace-augmented MUSIC), [30] (compressive MUSIC) and [54] (semi-supervised MUSIC). Also inspired by the MUSIC technique, [16] proposes the rank-aware orthogonal recursive matching pursuit (RA-ORMP) based on a greedy selection strategy, and further demonstrates the rank awareness and rank blindness for a large class of joint sparse recovery algorithms. That work proves that some of the most popular algorithms, including orthogonal matching pursuit and mixed norm for any , are effectively rank blind. More recent rank aware methods include Simultaneous Normalsize Iterative Hard Thresholding (SNIHT) [6] and Rank Aware Row Thresholding (RART) [7]. Compared to the greedy approach, the integration of rank information into regularization-based algorithms seems an open problem. To date, all existing rank-aware algorithms belong to the greedy class.
Applications. The joint sparse recovery problem finds applications in various fields, including data analytics and imaging [34, 24], uncertainty quantification [17], and parameter identification of dynamical systems [44]. Here, we will illustrate one particular area of our interest in detail, which is the Column Subset Selection Problem (CSSP), also known as the feature selection, or the dictionary selection problem. Suppose we have a set of data columns and a dictionary of features (here represents the number of features in the dictionary and is the size of the features), and we want to extract a subset of the columns of such that can be represented in terms of only columns of . This problem exactly translates to the joint sparse recovery problem where is the matrix of coefficients and the non-zero rows correspond to the columns that have been selected. In the feature selection problem, the data is also the matrix (i.e., ) and the goal is to select the most representative columns (features) of the matrix . Thus, we aim to solve the problem
(7) |
This problem has garnered extensive attention in machine learning where it was often relaxed by mixed norm [34, 38, 35, 37]. This has also led to various matrix decomposition techniques such as the CUR decomposition [26, 8]. Demonstrations of our regularization on this problem on some bioinformatics datasets will be shown in Section 7.
Proximal algorithms. To our knowledge, this is the first paper that provides an iterative solution method for the problem (4) aside from the initial work [42]. The methods we outline in Section 6 will be based on standard proximal gradient based methods for least squares regression with convex regularizers [3, 9], and for more general non-smooth, convex optimization that encompasses joint sparse regression [11, 23, 12, 45, 18]. In order to incorporate the nonconvex non-smooth aspects of the -regularizer, we employ a first order convex approximation through a model function in every step of the method. This allows us to write the new iteration with the help of a proximal mapping that is simple to calculate. Convex model functions are also used in related methods for nonsmooth nonconvex optimization [40, 39] or Trust-Region methods [13]. Our method employs a specialized variable metric to achieve simple and efficient iterations, which requires a dedicated analysis and leads to a global convergence result.
As already pointed out before, for the scalar case we obtain the regularizer, and optimization methods have been proposed in [56, 43, 53, 55, 46, 57]. For , proximal method in this manuscript is similar to the one proposed in [53, Algorithm 2] for the equality constrained formulation (3) and subsequently analyzed and generalized to include the noisy constrained formulation (5) in [57]. In this paper we focus on the noisy regularized formulation (4), which can also be used to approach the solutions of the equality constrained problem for and the solutions of the noisy constrained problem for ; see Section 6.3. While it is easy to specialize the method proposed in this manuscript to (see Remark 6.5), the generalization to larger is not obvious, and requires additional effort.
Finally, since the functional is discontinuous across different ranks, we can not rely on only gradient descent unless we know or can guess the optimal rank ahead of time [42]. As a remedy, we propose an optional Lipschitz-continuous relaxation of the problem for a parameter , which we show converges to the original problem in the sense of -convergence [15].
3. Rank-awareness
Denote for any
When the columns of are linearly independent (), we abuse the notation and write . It is worth noting that the set is often referred to as the Stiefel manifold and it is a key object in differential geometry and optimization, with numerous applications in machine learning and data analysis.
The following theorem [32, 16] provides a necessary and sufficient condition for the exact recovery of sparse matrices from the minimization problem (2):
Theorem 3.1.
Given a matrix , every s-row sparse matrix can be exactly recovered from (2) if and only if satisfies the NSP:
where is the vector that we get after nullifying all the entries of except the ones with indexes in . Furthermore, if there exists an such that
then, for any , there is an -row sparse matrix which cannot be recovered from (2).
The following theorem (see [42] for details) highlights the difference in behavior between the norm and the norm regularizers. The norm regularizer is not rank-aware, while the norm regularizer is rank-aware. The smallest number of linearly dependent columns in a matrix is referred to as the spark of and is denoted by .
Theorem 3.2.
The following statements are equivalent:
-
(1)
,
-
(2)
For every -row sparse matrix , is the unique solution of in the set .
-
(3)
For every -row sparse matrix , is the unique solution of the minimization problem (1) where .
In the above theorem, the least restrictive condition occurs when the rank of the row-sparse matrix is equal to its sparsity, i.e., . In fact, in practical applications, the row-sparse matrix in the joint sparse recovery problem is often of maximum rank, i.e., it is approximately -row and also -row sparse. In such a scenario, the condition specified in Theorem 3.2 simplifies to . Hence, the assumption is consistently made throughout the manuscript.
Theorem 3.3.
Given a matrix with the property , let represent an -row sparse matrix, and define . Then is the unique solution to (3).
Proof.
Corollary 3.4.
The regularization in (3) is rank aware.
Proof.
For a given matrix , if and , it is not necessarily guaranteed that the solution to the joint sparse recovery problem exists or is unique. In this context, Section 4 introduces a condition named spherical section property to ensure the existence of the global solution.
Earlier, we discussed that when the rank is equal to one, the reduces to . Below we consider examples in which the fails to recover a single vector but if we add another vector along with the corresponding measurements, then we successfully recover both.
Example 1.
Consider a matrix and , where
Then the measurement matrix is given by . A solution of has the form . For such we have
For , the functional assumes a local minimum with value at zero, but the global infimum is given by one for . However, then which is not sufficient for recovery of 2-sparse vectors with any method, since there exists a 1-sparse vector in the kernel. For we have and attains minimum at , different from the true solution at (note, however, that the sparsest solution is at , and the true solution is not the sparsest here). Yet, by adding another column to the matrix to make it 2-row sparse with rank two,
the problem (3) has a unique solution that is equal to . This follows from the fact that and . By applying Theorem 3.3, we can conclude that the true solution is recovered.
The next example, in a similar spirit, demonstrates that the functional can even fail to have a minimizer for large , if the rank is smaller than the sparsity.
Example 2.
For a , let
Any solution of has the form . For such , we have
The critical points in terms of are , , and with local minimum at the latter. However, the local minimum at zero is only a global one for , due to , and for the infimum is approached for .
4. Existence of a global minimizer
Here we turn to the question of showing that the global minimizers of the noise free (3) and the noisy unconstrained (4) problems indeed exist under certain conditions on . They do not have to exist in general.
To illustrate this fact, let us consider for the moment the case , where the functional is identical to the functional (6). For the existence of the global minimum of the cost function with the regularizer, it is often required that satisfies the -spherical section property. An example can be found in [57, Theorem 3.4]. In this section we aim to extend these results to joint sprase recovery problem with the regularizer. The failure of minimizers to exist without appropriate assumptions has been illustrated in the previous section.
Definition 4.1.
The rank- singularity set of the matrix is defined as
This set contains all matrices of rank where some linear combination of the columns is in the kernel of . Notice that the rank- singularity set for is the nontrivial part of the kernel: For we have . This justifies the following definition as an extension of the -spherical section property (cf. [52, 57]).
Definition 4.2 (-spherical section property).
Let matrix . We say that satisfies the -spherical section property or -SSP if, for all ,
Lemma 4.3.
The following infimum is attainable
Moreover, for , .
Proof.
Let and consider compact singular value decomposition, where , with where is the rank of . Here is the diagonal matrix of nonzero singular values. Note that , therefore
and thus . We define the set
where is the identity matrix.
By the previous argument, we know that for any there exists a with such that
Thus, we have
Moreover, the set is closed and bounded, and thus there exists a for some such that
Now, let . From Lemma A.2, . Moreover, from the same lemma, if and only if . But , therefore there exists a non-zero linear combination of the columns of that is in the kernel of . Such vector will have to be -sparse as a linear combination of -sparse vectors with the same sparsity pattern. But there cannot be any -sparse vector in the kernel of otherwise the conditions of Theorem 3.2 will be violated (we can construct an -sparse rank matrix that cannot be recovered from the measurements). Thus, for any , . Since the infimum is attainable, we have that . ∎
Example 3.
To illustrate the advantage of larger in a specific instance, consider . Let
for . Notice that,
and
when . Here we used the fact that is the norm of the matrix with any orthonormal basis in the space of the columns of the matrix . So, in this case, it is simply the functional on the first column. Therefore, the matrix does not fullfill -SSP when . However, when , thus the -SSP is satisfied due to Lemma 4.3. When , then it is not -SSP which was expected. For
we have .
4.1. The noiseless constrained problem
Lemma 4.4.
Let with and , where . Assume there exists an unbounded minimizing sequence of (3). Then
Proof.
Let be minimizing sequence. From the proof of Proposition 1 in [42], implies that preserves the rank of any -row sparse matrix. Thus .
Without loss of generality (by switching to a subsequence), we may assume that all the have the same rank. Assuming the compact SVD of is , we have Moreover, along a subsequence of the largest singular value of diverges and the corresponding left and right singular factors and converge due to compactness by switching to another subsequence. Let be the limit of the resulting subsequence of . has the same rank as the matrices in the subsequence (from the fact that and are orthogonal) and since , we deduce that .
Next, we show that . Let be the right singular vector and be the left singular vector corresponding to the largest singular value of . Note that
where is the right singular vector and be the left singular vector corresponding to the largest singular value of . On the other hand
since is bounded. Note that and . Therefore . Using Proposition A.3,
Proposition 4.5.
Proof.
Let
Note that . We have (the singular values of are all equal to 1), therefore,
For every , we have
This, combined with Lemma 4.4, implies that there exists no unbounded minimizing sequence of (4). Without loss of generality, by passing to a subsequence if necessary, we can assume . From Corollary A.3, therefore, is the global minimum of (3), completing the proof. ∎
Corollary 4.6.
4.2. The noisy regularized problem
Next, we turn to the question of showing that the minimizers of the problem (4) indeed exist under certain conditions on .
Lemma 4.7.
Let with and , where . Assume and are sufficiently small and assume there exists an unbounded minimizing sequence of (4). Then
Proof.
Let be minimizing sequence for the right side. For large enough,
thus
This implies that for , small enough. Indeed, implies that preserves the rank of any -row-sparse matrix (see Proposition 1 in [42]). If is small enough then . And when both and are small enough, . On the other hand .
Like in the proof of Lemma 4.4, without loss of generality, all the have the same rank and converge to a . Therefore,
Proposition 4.8.
Let with and . Assume, and satisfies the -SSP or, equivalently,
Then, for sufficiently small and , any (globally) minimizing sequence of (4) is bounded and the set of optimal solutions of (4) is nonempty.
Proof.
Since , we have that . We have , therefore,
Thus, we have
when is sufficiently small. This, combined with Lemma 4.4, implies that there exists no unbounded minimizing sequence of (4). Without loss of generality, by passing to a subsequence if necessary, we can assume . Then and, from Corollary A.3, . Therefore, is the global minimum of (4), completing the proof. ∎
5. The approximate problem
The functional is highly discontinuous (recall that this is a necessary condition for rank-awareness), especially in the case of large , and not directly amenable to computations, unless it is restricted to the submanifold of matrices of given rank . In the noise free case, we can usually deduce the appropriate rank from the data , since is a necessary condition for recovery. Using a decomposition with , we can then solve the problem with data for the solution and obtain a reconstruction of as ; see [42].
We do not know what the rank of should be from the data in the noisy case, since usually we would expect the noise to have full rank. If we preprocess the data via its SVD, we can still ensure that the data has full rank. However, in the presence of noise, a number of small singular values may potentially be associated to just noise and an arbitrary cutoff for the small singular values can be difficult to determine in practice ahead of solving the problem. Additionally, even when we happen to guess the correct rank, an iterative solution algorithm can run into a singularity during one of the iterations. This challenge can be addressed by carefully selecting the step-size; however, this may still yield an ill-conditioned matrix, potentially causing the updates to stagnate. A proper relaxation of the problem may take care of this difficulty.
For this and other reasons to be discussed later, we consider a generalization of the -functional with a parameter :
and a generalized version of problem (4):
(8) |
The numbers and are the hyper-parameters of this problem. In particular, controls the convexity of the problem, where reduces it to the convex norm regularized problem and corresponds directly to the regularization. For , we obtain a new “relaxed” optimization problem.
As a first verification of the appropriateness of this relaxed problem we will show that the (global) solutions of this problem converges to the (global) solution of the problem in (4) as (Proposition 5.1). Let
be the associated objective functional. As a corollary of Proposition A.5 on the convergence of we obtain the following:
Proposition 5.1.
Any cluster point of minimizers of with is a minimizer of (4).
An additional advantage of the relaxed problem, which will turn out to be useful for numerical optimization, is that its minimizers can always be characterized by first-order conditions with the help of the Clarke subdifferential. It is defined for any locally Lipschitz function; see [10, Chapter 10]. According to Proposition A.4 we have:
Proposition 5.2.
is locally Lipschitz continuous for any . For it is only locally Lipschitz continuous on neighborhoods of points where is invertible ().
For the following, we introduce the matrix and note that . When either or is invertible, the matrix is symmetric positive definite and thus .
The version of Fermat’s rule for Clarke subdifferential states that [10, Exercise 10.7], at a local minimimum of , it holds
(9) |
The additive property does not always hold for the Clarke subdifferential, only an inclusion, but it does in this case due to [10, Exercise 10.16]. To characterize the subdifferential, we provide a convex approximation to at a given reference point. This is also known as a model function in nonconvex, nonsmooth optimization; cf., e.g., [40, 39].
Definition 5.3.
Let be given with corresponding invertible and . The convex model function is defined as
(10) |
With a characterization of the Clarke subdifferential of the above optimality condition can be equivalently stated as follows:
Theorem 5.4.
Let be a local minimum of , where either or invertible. Then, in terms of the Clarke subdifferential, the following first order necessary conditions hold:
where is the -th row of and .
6. A proximal gradient based algorithm
We assume first that and are given, and solve the problem (8). We first derive the idea of the algorithm using a proximal Lagrange method. Then, we will provide a convergence analysis using the model function (10), and then we will briefly discuss the adaptive selection of the parameters and .
We note that in the case , the algorithm may run into cases where is not invertible, which corresponds to a discontinuity in . We will disregard that possibility for the moment and discuss mitigation strategies in the context of stepsize selection. In general, if the sought after solution of the problem does not fulfill invertible, we will tacitly assume that and rely on Proposition 5.1.
6.1. Proximal Lagrange derivation
To derive an algorithm, we first aim to isolate the nonsmooth and nonconvex parts of the functional and rewrite the problem (8) in the constrained form
(11) |
where we have introduced a separate variable for the weight . Note that, similar to the model function (10) postulated above, the functional now contains a weighted convex group sparse norm, if is fixed. To solve the problem above, we consider the Lagrange function
which is split into a part that is convex in and smooth in and a second part that collects the remaining differentiable parts of the Lagrange function . The factor is added to provide a convenient scaling of the multiplier in the subsequent analysis.
Additionally, we consider the Hilbert space on induced by the inner product of the form , . We note that it also induces an inner product for matrices . The idea essential to our strategy is to compute gradients with respect to in this inner product. Given a function , denotes the gradient of in the -Hilbert space, i.e. it is the matrix for which the directional derivative can be written as
Also, we define the -proximal operator as
(12) |
Since the choice of the inner product agrees with the vector norm used in the definition of , it can be checked that the closed form solution is given by
Note that this analytic representation would not be possible for the standard inner product.
The main steps of the algorithm, given a previous iterate , () are now derived as follows: First, we compute the update for the multiplier as a solution to the Lagrange stationarity condition . We observe that
and the equation can be uniquely solved for as given is step 4 of Algorithm 1 below; cf. Definition 5.3.
Second, with given , and , we select a stepsize and update , where is the -gradient of the smooth part of the Lagrange function given as
In Subsection 6.2 we introduce and discuss an Armijo line search procedure for finding the step size.
Finally, we need to update , where we simply set , which solves exactly the constraint. Note that due to this choice, can be eliminated as an explicit iteration variable and only needs to be provided to initialize the algorithm.
Our proposed iterative scheme to solving the problem (11) is summarized with the following steps:
We note that for , we have , , and we obtain the classical proximal gradient method for group sparse least squares regression. Note also that this extension’s main additional effort is the computation of the inverse , which is cheap for moderately sized compared to .
Remark 6.1.
For the case of large , the effort of direct computation of the inverse can become prohibitive. If is approximately low rank, this issue can be alleviated by decomposing , where with and with , then the optimization problem reduces to a cheaper one where the data matrix is replaced by (see Section 7.2 for examples). Generally, an efficient way to compute can be given for and : First, we decompose , where is the selection of columns of the identity matrix corresponding to the support of , and the factors and with orthogonal can be computed, e.g., with a QR-decomposition . Then, with we have . If we now modify the algorithm to not explicitly store , but only , and , the complexity can be reduced if . Note that the matrix is also in low-rank format as the sum of rank-one products. In this context, it is critical to ensure that , remains small from initialization throughout the iterations. We postpone a detailed investigation of this to future work.
6.2. Step size selection for descent
In this section, we demonstrate that the iteration steps in the proposed algorithm result in the decrease in the loss function and therefore, the algorithm is convergent if the step size is chosen appropriately.
Let be given, which can be thought of as a given (old) iterate . Assume or and is invertible. Based on the model function from Definition 5.3, the following function will serve as a model function for the objective function :
where
We will use this model as an upper bound of , where the constant has to account for the neglected terms in as well as in the quadratic fitting term. The new iterate can then be interpreted as the minimizer of the model function for and . Due to the complictated structure of the nonlinearity in we will not give analytically, but identify it computationally.
Proposition 6.2.
Let or be invertible. Then has the following properties
-
(1)
,
-
(2)
for all with some universal and only dependent on ,
-
(3)
the convex function is minimized (uniquely) at
Proof.
Corollary 6.3.
If is a local minimum of then, for any , is the unique minimizer of , and
where .
Proof.
Remark 6.4.
Remark 6.5.
Let us briefly discuss the case and , which reduces to the norm regularization. Here, , and the model function up to the fixed additive constant and multiplicative constant is given as
with . Replacing with the indicator function of the constraint , as is appropriate for solving formulation (3) compared to the regularized version (4), and setting and we obtain exactly [53, Algorithm 2] (see also [57]).
We now return to the iterative procedure outlined in Algorithm 1, which generates the sequence . We will show that the method is a descent method, i.e., . Note that this implies that
(13) |
For the following analysis, we need to ensure that remains bounded, which unfortunately does not directly follow from this inequality due to the fact that may have nontrivial kernel and (see Proposition A.4). Similar to Proposition 4.8, a uniform bound on could be deduced under appropriate assumptions on , , and using the -SSP. For simplicity, we from now on simply assume that the sequence remains bounded:
(14) |
This will allow us to bound certain constants uniformly, in particular .
In order to show that descent can be fulfilled in every iteration as in 6.2 with , we have to replace the unknown constant in with a step size . In practice, since this constant depends on the linearization point and the size of the update , we select it by an adaptive Armijo-line search procedure described below. Given a candidate step size , we define the candidate next iteration as
(15) |
where is defined as before as . We define the “predicted decrease” associated to a candidate stepsize as
The predicted decrease serves as an optimistic estimate for how much descent in the functional can be expected in the current iteration, and can be used as an indicator for stationarity of as characterized by the next theorem.
Proposition 6.6.
For any it holds
and, thus,
(16) |
Proof.
To determine an appropriate step size, we provide a (large) inital guess , in each iteration take the largest for some line-search parameter that achieves a fraction of the predicted decrease, i.e. such that it holds
(17) |
where is an algorithmic constant. By the next result, for some chosen by the Armijo line search the condition in (17) will be satisfied.
Proposition 6.7.
Proof.
In order to apply part 2 of Proposition 6.2, we first need a bound on the difference . By a standard estimate for the Prox-operator and the triangle inequality,
Furthermore as in Remark A.7,
and
By (13) and (14), both of these quantities are bounded by a universal constant, and thus there is a such that . Thus, for small enough, by Proposition 6.2 we know that there exists a constant (dependent only on ), such that
The second term is negative due to Proposition 6.6. The third term is negative for ∎
Proposition 6.8.
Consider started from a given (with either or invertible) by from (15) and chosen according to (17). Finally, assume that (14) is fulfilled. Then the following holds:
-
(1)
The descent in every iteration can be estimated by
-
(2)
.
-
(3)
For any accumulation point of sequence (with either or invertible) the first order necessary condition is fulfilled.
Proof.
We apply Proposition 6.7 for part 1. From this, we derive that for some , since this is a decreasing bounded sequence and also (13) is fulfilled. For part 2, we note that by construction of the Armijo stepsize and Proposition 6.7, we know that . Moreover,
Thus, we obtain
Let now be any accumulation point of and an accumulation point of along the same subsequence and for invertible . Along this subsequence we have due to uniform invertibility of all and and thus
where by continuity of all expressions (using Proposition A.9 and Proposition A.10). Thus, with Remark 6.4 the stationarity conditions are fulfilled. ∎
Remark 6.9.
According to Proposition 6.7, the predicted decrease provides both an upper and a lower bound on the decrease after each successful iteration. Scaling it by the stepsize to avoid bias (due to small steps), an appropriate stopping criterion is given by
(18) |
where is a relative tolerance, and some reference value of the functional, e.g., . Moreover, due to (16) this provides an estimate on the weighted Prox-residual , which characterizes stationarity.
6.3. Adaptive choice of the regularization parameter
In practice, the value has to be chosen appropriately, to balance a good fit of the data with the desire for robustness to noise and sparsity. This is studied extensively in regularization theory for inverse problems; see, e.g. [20]. In this manuscript, we rely on the (Morozov) discrepancy principle, which presumes that a noise level is known, and is adjusted until an (approximate) solution of (11) fulfills
(19) |
for some . To achieve this, we use the following procedure:
- (1)
-
(2)
If (19) is violated, update (decrease if the fit is too large, increase otherwise) and go back to step 1.
Let us compare this to the constrained formulation (5): First, we take into account that the inequality constraint is generally fulfilled with equality at optimal solutions, since getting a better fit usually requires increasing the regularization term. This is indeed the case for , but not necessarily for , and we will discuss this further below. Second, can be interpreted as a Lagrange multiplier for this constraint. Thus, the difference of (19) consists in a fuzzy version of the hard constraint , which is known not to affect recovery guarantees for convex problems since usually is not known exactly. For efficiency purposes, experience from convex regularization suggests to initialize large (to obtain a highly sparse approximation initially), and only decrease it if (19) is violated, and we employ this as well. We leave a detailed analysis and algorithmic refinements to future work.
6.3.1. Discrepancy principle for the functional
The choice does not always fit to the discrepancy principle (19), and in some cases we can only enforce the upper bound. This is due to the fact that unless , a change in the variable that is allowed to increase the discrepancy does not necessarily allow to reduce the regularization term by moving the reconstruction closer to zero. Note that
with equality for and strict inequality for and . The latter property allows to directly show that any solution of (5) will always fulfill for .
To illustrate the issue for in more detail, consider the case of matrices with , and the optimality conditions from Theorem 5.4. In that case, for the set of vectors with the vectors form an orthogonal basis of and
Thus, by the first condition in Theorem 5.4, any optimal solution of (4) fulfills for all with . For restricted to its sparsity pattern , , such that with of full rank and a selection of columns of the identity matrix, we can suppose that is invertible (which is necessary for recovery). The set of equations derived above can be written as , which can always be uniquely solved.
In particular, consider a situation with -sparse full rank and noisy data with . Now, consider that part of the noise can be fitted on the same support as , and that in general there exists a -sparse full rank such that . In fact, let be the minimizer of the discrepancy on the sparsity pattern of . Note that this minimizer exactly fulfills . Then, clearly is a solution of (5) for all . Let be the maximum of over all . Clearly, for all , the also fulfills the optimality conditions from Theorem 5.4. In this situation, Algorithm 1, which exclusively operates on rank- matrices for , will tend to converge towards , regardless of the value of , since the only way to decrease the regularization term below is to decrease the rank. Here, if , the lower bound in condition (19) can not be satisfied by increasing . In fact, such a situation always occurs for , where can be uniquely solved for and .
To guard against this situation, we add an additional rule that allows termination of the algorithm if the upper bound in (19) is valid, has been previously increased, and no change in has occurred.
6.4. Continuation for relaxation parameter
In the previous parts of this section was considered fixed. Applying the described algorithms directly for , and solving the original problem (4) has several down-sides:
-
•
Due to the high degree of nonconvexity of the regularizer, the solution will depend on initialization. This, aside from causing problems with identifying a global minimum, also requires the initial point to be of full rank, which is not efficient when is large).
-
•
If the optimal solution has , the method will not be able to identify it for , since Algorithm 1 produces full rank iterates only and the -functional is discontinuous across different ranks. This has to be remedied by guessing the appropriate rank ahead of time, and reducing to this rank by a data reduction as discussed at the beginning of section 5.
In this section, we briefly sketch a strategy that starts with and then successively reduces . Here, where we are initially solving the convex problem with replaced by and then following the path of solutions for to finally obtain with fulfilling (19).
As a proof of concept, we employ a simple strategy, employing a fixed sequence of values for some and . We start with and use the method from section 6.3 to solve the convex problem and the discrepancy principle. Then, we increment , and continue iterating Algorithm 1 until the termination conditions from the previous section are fulfilled again. We leave refinements of this naive approach (cf., e.g., [1]) to future work.
7. Experiments
7.1. Synthetic data
We evaluate the performance of our method against other recent MMV techniques, including subspace-augmented MUSIC (SA-MUSIC) [33], simultaneous normalized iterative hard thresholding (SNIHT) [6], and rank-aware orthogonal recursive matching pursuit (RA-ORMP) [16]. The codes of these baseline methods were obtained from the repository https://github.com/AdriBesson/joint_sparse_algorithms. We also include in our evaluation the spg_mmv method, which solves the standard problem in constrained form and whose codes are acquired from SPGL1 package [49, 51]. Our codes and data for reproducing results are available at https://github.com/a-petr/OWL.
We use similar experiment setting as those in [5]. We consider a Gaussian random measurement matrix , with and being fixed to . The signal matrix is built as a random matrix, with , where is the row sparsity. We conduct random trials of the algorithms for each experiment and compute the rate of successful support recovery. We compare our method with their counterpart in two tests: fixed rank () for a number of measurements ranging between and and fixed number of measurements () for a rank varying between and .
We first consider the noiseless scenario, i.e., observation . In Figure 1, our proposed approach is second to RA-ORMP in recovery probability in both fixed number of measurements (left) and fixed rank (right) tests, followed by SA-MUSIC and SNIHT. The experiment shows that these four methods are rank-aware, because increasing the rank of the signal matrix leads to an improvement of the recovery rate, while regularization with spg_mmv solver does not benefit from the change of the rank. For regularization, the recovery rate grows fast when the rank of signal matrix is larger than 3 and gets to at rank . SA-MUSIC and SNIHT struggle to identify the correct support of the solutions, but when they are able to do so, they achieve a reconstruction error smaller than that of (Figure 2).




Next, we consider a more practical scenario where the observation contains some measurement noise, i.e., . We assume is a normal distributed random variable with standard deviation , so that the expectation of is . In Figure 3, we observe that achieves best performance in recovery rate in both fixed number of measurements (left) and fixed rank (right) tests. Our method has recovery rate at signal rank , while SA-MUSIC only achieves this at full signal rank and SNIHT cannot get a recovery rate higher than . Interestingly, the best overall baseline in noiseless scenario, RA-ORMP, suffers heavily from the presence of the noise. With fixed rank and varied number of measurements, also outperforms all others by a substantial margin. Our method has the best performance in root-mean-square error in the small rank and small number of measurements scenarios, where its recovery rate is superior (Figure 4). However, when the other methods catch up and can approximate the signal support well, they provide better reconstruction error, similarly to what we have seen in the noiseless case.
To conclude, the numerical tests highlight that can efficiently exploit the rank of the signal matrix and achieve a very competitive recovery rate, with mild requirements on the number of measurements and signal rank. They also reveal that the successful rate and signal error metrics may not always be in alignment, thus, one should consider the recovery goals, measurement budget, and signal noise to select a suitable method. Our approach excels in identifying the exact support of the row-sparse signals, and has the top performance when the signal matrices have low-rank, making it a strong choice when those are the priority.




7.2. Feature selection on biological data.
We evaluate the performance of our method on the feature selection problem for two publicly available biomedical informatics data sets. The first data set WISCONSIN is a collection of data about breast cancer tumors that were diagnosed as either malignant or benign [4]. The data set contains 714 samples, with each sample representing a breast cancer tumor. Each tumor is described by 30 features, including information about the size, shape, and texture of the tumor, etc. These features were computed from a digitized image of a fine needle aspirate of a breast mass. The second data set LUNG_DISCRETE is a collection of microarray data about lung carcinomas, containing samples in classes where each sample consists of 325 gene expressions [41]. In the feature selection problem, we aim to find an optimal subset of features (gene expressions) that can accurately represent the whole data set. This problem is a special instance of the MMV problem, where the observation matrix is identical with the measurement matrix (see Section 2). Here, we apply and compare it with other MMV techniques (SA-MUSIC, RA-ORMP and spg_mmv) in reconstructing the full data set, given different allowable number of features. SNIHT does not produce competitive results for this test, so will be omitted from the comparison. For SA-MUSIC and RA-ORMP, the allowable number of features is enforced directly, while for the two regularization-based methods, we set an error tolerance a priori and solve for the smallest set of features that can reconstruct the full data set within this tolerance. We rank the feature importance via the norm of the respective row of the solution matrix and further prune out the spurious features whose importance scores are very small, in particular, less than . Finally, after the features are selected, we reconstruct the full data from this set by solving the regular least square problem, and report the root-mean-square error of the approximated and ground truth data.
Note that the feature selection problem has thus often involves a large . To accelerate the matrix inversion step in our algorithm, we employ an SVD technique to decompose and solve the reduced problem
Here, is a low rank approximation of , formed by dropping the columns of corresponding to singular values that are below a specified threshold. The solution now has dimension , and our algorithm involves the matrix inversion of size instead, which is much faster if . In the LUNG_DISCRETE problem, even though has dimension (i.e., ), its effective rank is approximately (there are only 72 singular values of exceeding ), thus . We emphasize that this is only the most conservative choice, and a more aggressive thresholding strategy is possible. In fact, here we further reduce by linking the singular value cutoff with the predefined reconstruction error tolerance, as the low rank approximation can afford an error of the same scale as this tolerance. The final row-sparse solution is easily obtained by right multiplying the reduced solution with . It is worth noting that the above strategy is general for feature selection problems because the data matrices of these problems are inherently low-rank.
For the WISCONSIN data set (Figure 5, left), we observe that performs much better than the competitors in identifying the representative features. The reconstruction error of drops rapidly when the number of features is small and is approximately lower than those of the next baselines. By contrast, the other methods generally require five more features to capture the data set as good as . The performance of and SA-MUSIC eventually converge around 25 features, where the reconstruction error diminishes to . This verifies that the feature set contains some degree of redundancy, and that the data set can still be accurately represented with some features being removed. For the LUNG_DISCRETE data set (Figure 5, right), the performances of , RA-ORMP and SA-MUSIC are close and better than spg-mmv with small number of features. However, all the tested methods can select roughly optimal features that fully represent the data, despite the extensive feature set comprising 325 gene expressions.


References
- [1] E. L. Allgower and K. Georg, Numerical continuation methods: an introduction, vol. 13, Springer Science & Business Media, 2012.
- [2] F. Andersson, M. Carlsson, and K.-M. Perfekt, Operator-Lipschitz estimates for the singular value functional calculus, Proceedings of the American Mathematical Society, 144 (2015), p. 1867–1875.
- [3] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.
- [4] K. P. Bennett and O. L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods & Software, 1 (1992), pp. 23–34.
- [5] A. Besson, D. Perdios, Y. Wiaux, and J.-P. Thiran, Joint sparsity with partially known support and application to ultrasound imaging, IEEE Signal Processing Letters, 26 (2019), pp. 84–88.
- [6] J. D. Blanchard, M. Cermak, D. Hanle, and Y. Jing, Greedy algorithms for joint sparse recovery, IEEE Transactions on Signal Processing, 62 (2014), pp. 1694–1704.
- [7] J. D. Blanchard, C. Leedy, and Y. Wu, On rank awareness, thresholding, and MUSIC for joint sparse recovery, Applied and Computational Harmonic Analysis, 48 (2020), pp. 482–495.
- [8] H. Cai, K. Hamm, L. Huang, and D. Needell, Robust CUR decomposition: Theory and imaging applications, SIAM Journal on Imaging Sciences, 14 (2021), pp. 1472–1503.
- [9] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J Math Imaging Vis, 40 (2011), pp. 120–145.
- [10] F. Clarke, Functional analysis, calculus of variations and optimal control, vol. 264, Springer, 2013.
- [11] P. L. Combettes, Solving monotone inclusions via compositions of nonexpansive averaged operators, Optimization, 53 (2004), pp. 475–504.
- [12] P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM Journal on Optimization, 18 (2008), pp. 1351–1376.
- [13] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust Region Methods, Society for Industrial and Applied Mathematics, 2000.
- [14] S. Cotter, B. Rao, K. Engan, and K. Kreutz-Delgado, Sparse solutions to linear inverse problems with multiple measurement vectors, IEEE Transactions on Signal Processing, 53 (2005), pp. 2477–2488.
- [15] G. Dal Maso, An introduction to -convergence, vol. 8, Springer Science & Business Media, 2012.
- [16] M. E. Davies and Y. C. Eldar, Rank awareness in joint sparse recovery, IEEE Transactions on Information Theory, 58 (2012), pp. 1135–1146.
- [17] N. Dexter, H. Tran, and C. Webster, A mixed egularization approach for sparse simultaneous approximation of parameterized PDEs, ESAIM: M2AN, 53 (2019), pp. 2025–2045.
- [18] , On the strong convergence of forward-backward splitting in reconstructing jointly sparse signals, Set-Valued Var. Anal, 30 (2022), pp. 543–557.
- [19] Y. Eldar and H. Rauhut, Average Case Analysis of Multichannel Sparse Recovery Using Convex Relaxation, IEEE Transactions on Information Theory, 56 (2010), pp. 505–519.
- [20] H. W. Engl and R. Ramlau, Regularization of Inverse Problems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2015, pp. 1233–1241.
- [21] E. Esser, Y. Lou, and J. Xin, A method for finding structured sparse solutions to nonnegative least squares problems with applications, SIAM Journal on Imaging Sciences, 6 (2013), pp. 2010–2046.
- [22] P. Feng and Y. Bresler, Spectrum-blind minimum-rate sampling and reconstruction of multiband signals, in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 3, 1996, pp. 1688–1691 vol. 3.
- [23] M. Fornasier and H. Rauhut, Recovery algorithms for vector-valued data with joint sparsity constraints, SIAM Journal on Numerical Analysis, 46 (2008), pp. 577–613.
- [24] M. Golbabaee and P. Vandergheynst, Hyperspectral image compressed sensing via low-rank and joint-sparse matrix recovery, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 2741–2744.
- [25] R. Gribonval, H. Rauhut, K. Schnass, and P. Vandergheynst, Atoms of All Channels, Unite! Average Case Analysis of Multi-Channel Sparse Recovery Using Greedy Algorithms, Journal of Fourier Analysis and Applications, 14 (2008), pp. 655–687.
- [26] K. Hamm and L. Huang, Perspectives on CUR decompositions, Applied and Computational Harmonic Analysis, 48 (2020), pp. 1088–1099.
- [27] N. J. Higham, Functions of matrices: theory and computation, SIAM, 2008.
- [28] P. O. Hoyer, Non-negative matrix factorization with sparseness constraints., Journal of machine learning research, 5 (2004).
- [29] H. Ji, J. Li, Z. Shen, and K. Wang, Image deconvolution using a characterization of sharp images in wavelet domain, Applied and Computational Harmonic Analysis, 32 (2012), pp. 295–304.
- [30] J. M. Kim, O. K. Lee, and J. C. Ye, Compressive MUSIC: Revisiting the link between compressive sensing and array signal processing, IEEE Transactions on Information Theory, 58 (2012), pp. 278–301.
- [31] D. Krishnan, T. Tay, and R. Fergus, Blind deconvolution using a normalized sparsity measure, in CVPR 2011, 2011, pp. 233–240.
- [32] M.-J. Lai and Y. Liu, The null space property for sparse recovery from multiple measurement vectors, Applied and Computational Harmonic Analysis, 30 (2011), pp. 402–406.
- [33] K. Lee, Y. Bresler, and M. Junge, Subspace methods for joint sparse recovery, IEEE Transactions on Information Theory, 58 (2012), pp. 3613–3641.
- [34] J. Liu, S. Ji, and J. Ye, Multi-task feature learning via efficient -norm minimization, in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, Arlington, Virginia, USA, 2009, AUAI Press, p. 339–348.
- [35] M. Lu, Embedded feature selection accounting for unknown data heterogeneity, Expert Systems with Applications, 119 (2019), pp. 350–361.
- [36] M. Mishali and Y. C. Eldar, Reduce and boost: Recovering arbitrary sets of jointly sparse vectors, IEEE Transactions on Signal Processing, 56 (2008), pp. 4692–4702.
- [37] D. Nardone, A. Ciaramella, and A. Staiano, A sparse-modeling based approach for class specific feature selection, PeerJ Computer Science, 5 (2019), p. e237.
- [38] F. Nie, H. Huang, X. Cai, and C. Ding, Efficient and robust feature selection via joint -norms minimization, Advances in neural information processing systems, 23 (2010).
- [39] D. Noll, Cutting plane oracles to minimize non-smooth non-convex functions, Set-Valued and Variational Analysis, 18 (2010), pp. 531–568.
- [40] D. Noll, O. Prot, and A. Rondepierre, A proximity control algorithm to minimize nonsmooth and nonconvex functions, Pacific Journal of Optimization, 4 (2008), pp. 569–602.
- [41] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), pp. 1226–1238.
- [42] A. Petrosyan, H. Tran, and C. Webster, Reconstruction of jointly sparse vectors via manifold optimization, Applied Numerical Mathematics, 144 (2019), pp. 140–150.
- [43] Y. Rahimi, C. Wang, H. Dong, and Y. Lou, A scale-invariant approach for sparse signal recovery, SIAM Journal on Scientific Computing, 41 (2019), pp. A3649–A3672.
- [44] H. Schaeffer, G. Tran, and R. A. Ward, Learning dynamical systems and bifurcation via group sparsity, arXiv: Numerical Analysis, (2017).
- [45] Z. Tan, P. Yang, and A. Nehorai, Joint sparse recovery method for compressed sensing with structured dictionary mismatches, IEEE Transactions on Signal Processing, 62 (2014), pp. 4997–5008.
- [46] M. Tao, Minimization of over for sparse signal recovery with convergence guarantee, SIAM Journal on Scientific Computing, 44 (2022), pp. A770–A797.
- [47] J. A. Tropp, Algorithms for simultaneous sparse approximation. part ii: Convex relaxation, Signal Processing, 86 (2006), pp. 589–602. Sparse Approximations in Signal and Image Processing.
- [48] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, Algorithms for simultaneous sparse approximation. part i: Greedy pursuit, Signal Processing, 86 (2006), pp. 572–588. Sparse Approximations in Signal and Image Processing.
- [49] E. van den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit solutions, Siam journal on scientific computing, 31 (2009), pp. 890–912.
- [50] , Theoretical and empirical results for recovery from multiple measurements, IEEE Transactions on Information Theory, 56 (2010), pp. 2516–2527.
- [51] , Sparse optimization with least-squares constraints, SIAM Journal on Optimization, 21 (2011), pp. 1201–1229.
- [52] S. A. Vavasis, Derivation of compressive sensing theorems from the spherical section property, University of Waterloo, CO, 769 (2009).
- [53] C. Wang, M. Yan, Y. Rahimi, and Y. Lou, Accelerated schemes for the minimization, IEEE Transactions on Signal Processing, 68 (2020), pp. 2660–2669.
- [54] Z. Wen, B. Hou, and L. Jiao, Joint sparse recovery with semisupervised MUSIC, IEEE Signal Processing Letters, 24 (2017), pp. 629–633.
- [55] Y. Xu, A. Narayan, H. Tran, and C. G. Webster, Analysis of the ratio of and norms in compressed sensing, Applied and Computational Harmonic Analysis, 55 (2021), pp. 486–511.
- [56] P. Yin, E. Esser, and J. Xin, Ratio and difference of and norms and sparse representation with coherent dictionaries, Commun. Inf. Syst., 14 (2014), pp. 87–109.
- [57] L. Zeng, P. Yu, and T. K. Pong, Analysis and algorithms for some compressed sensing models based on minimization, SIAM Journal on Optimization, 31 (2021), pp. 1576–1603.
Appendix A Properties of the loss function
Lemma A.1.
Suppose and is the compact singular value decomposition of , then
Proof.
We have , thus Observe that
Lemma A.2.
Let , then
Moreover, we have equality on the left-hand side if and only if is -row sparse.
Proof.
Assume . Let be as in the previous lemma where are the rows of . Notice that and
therefore
So takes its minimum value when and that happens only when for each , hence or . Consequently, takes its smallest value for on the -row sparse matrices.
To prove the upper bound, note that only of the are different from , thus from Hölder’s inequality
Proposition A.3.
is lower semi-continuous.
Proof.
Let be singular value decomposition of . It can be full, thin or compact version, that does not affect the following notation but we will use the compact SVD unless stated otherwise. For a function with , define
where the matrix function is applied element-wise to the matrix containing the singular values. This is known in the literature by the “generalized” or “singular value” matrix calculus (see, e.g., [2]).
Define the family of functions
and the limit function for by
With this notation, we can write
Thus, we can also write
where the last equality follows from the orthogonality of .
Proposition A.4.
-
(1)
and .
-
(2)
and
-
(3)
is Lipschitz continuous for any :
-
(4)
When is invertible, is local Lipschitz continuous near with
Proof.
The proofs of 1 and 2 are straight-forward. Property 3 follows from , Lipschitz continuity of the norm with constant one and [2, Theorem 1.1]
due to for all .
Let and with singular values bounded from below by , i.e. we set . Define
Then, it holds
where is the singular value calculus of the function and we applied [2, Theorem 1.1]. ∎
Proposition A.5.
-converges to for : For any and it holds
and for any and there exists a sequence such that
Proof.
To show the first condition, denote the limit matrix and its full singular value decomposition by
Moreover, let a compatible SVD of be given by
such that for (e.g., by sorting the singular values by magnitude). Then, we can write
Additionally let be a lower bound on the smallest nonzero singular value of and introduce the function
Clearly it holds (by concavity of ), is uniformly Lipschitz-continuous (independently of ), and . By elementary arguments, it holds for any and that
Squaring this estimate leads for any to the estimate
Thus, it follows that
Using the fact that
To go to the limit we split the error into two terms:
The first term can be estimated through a uniform estimate of on the compact interval and using the fact that and are uniformly bounded in any matrix norm. The second term can be estimated by the Lipschitz constant of equal to and the error of using [2, Theorem 1.1]. Thus, indeed
which shows the first assertion. For the second assertion, we can choose the recovery sequence and obtain
for . ∎
In the following, we derive an analysis of the method based on splitting the nonconvex nonsmooth term into a convex nonsmooth and a nonconvex quadratic remainder at a given fixed point (which can be thought of as an approximate local minimizer or current iterate ). For this, define the functions
and fix with associated and and assume that is symmetric positive definite. The last requirement is fulfilled for either or if has full rank, i.e., . Then it also holds . We recall the notation
Theorem A.6.
Let be invertible. Then, the functional is equal to a sum of convex part and non-convex quadratic remainder:
(20) |
where
In fact, there exist constants and , such that for all with , is also invertible, and there holds
(21) |
Remark A.7.
We note that the terms in the definition of with are not defined properly. They are evaluated as zero by convention, which is consistent with the limit of each expression under the sum. Note also that is a symmetric nonnegative matrix with . Thus, the linear term in the expansion (20) can be estimated as
To prove this result, we first derive some continuity and differentiability properties of .
Proposition A.8.
For every and there holds .
Proof.
With the singular value calculus we have and . ∎
Proposition A.9.
For , such that both are invertible there holds
(22) |
where . Moreover with the perturbation of defined by
and there holds
(23) |
Proof.
Denote . We have that
(24) | ||||
(25) |
Turning now to the terms with differences of , note that
(26) |
Thus we have with Proposition A.8 that
(27) | ||||
Combining this with (24) we already obtain (22) due to
Proposition A.10.
The mapping
is Lipschitz-continuous with constant three in the Frobenius norm.
Proof.
First, we consider the case where . In that case
In the nonzero case
We are ready to prove the main result.
Proof of Theorem A.6.
Start with the identity for
Thus, denote and , where both are invertible, and using and for we have
(28) |
We define the intermediate value
and note that by cyclic permutation
where is as in Proposition A.9. Subtracting this from both sides of (28), it follows
where
By the triangle inequality and Proposition A.9 we obtain
where
To obtain the final result, it remains to estimate the difference of and . We start with
By the triangle inequality and Proposition A.10 we have
and thus
where we have used again Proposition A.8 and the inequality .
The previous estimates combined lead to the desired expansion of (20) with the estimate of the remainder (21) with the constant
The constant depends on and through Proposition A.9, and on . Moreover, the latter two quantities can also be estimated in terms of . First, we have
and substituting and rearranging we obtain , i.e., . Now, we can apply (27): . Second, we have
Thus, taking into account the inequalities and Proposition A.8, it can be seen that for smaller than a universal constant , the constant can be bounded by some universal constant . ∎
Theorem A.6 allows to explicitly compute the Clarke subdifferential with the convex subdifferential of the model function for defined as motivated by the previous theorem
(29) |
Theorem A.11.
For all with invertible we have
Proof.
Assume first that is given with for all . Then, is Frechet differentiable with
which is obtained using the differentiability of for fixed and the chain rule. On the other hand is also differentiable under this assumption with
Furthermore for any sequence with the above property it holds
(30) |
assuming that either of the limits exists, i.e., for all . Now, by the characterization of the Clarke subdifferential ([10, Theorem 10.27]) we know that
and also that
Thus, by (30), both differentials agree. ∎