This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Natural Thresholding Algorithms for Signal Recovery with Sparsity

Yun-Bin Zhao, and Zhi-Quan Luo This work was supported by the National Natural Science Foundation of China (NSFC) under grants #12071307 and #61571384. Part of the work was done while the first author worked at the School of Mathematics, University of Birmingham, Birmingham B15 2TT, United Kingdom.Yun-Bin Zhao is with the Shenzhen Research Institute of Big Data, Chinese University of Hong Kong, Shenzhen, Guangdong Province, China (Email: [email protected]).Zhi-Quan Luo is with the Shenzhen Research Institute of Big Data, Chinese University of Hong Kong, Shenzhen, Guangdong Province, China (Email: [email protected]).
Abstract

The algorithms based on the technique of optimal kk-thresholding (OT) were recently proposed for signal recovery, and they are very different from the traditional family of hard thresholding methods. However, the computational cost for OT-based algorithms remains high at the current stage of their development. This stimulates the development of the so-called natural thresholding (NT) algorithm and its variants in this paper. The family of NT algorithms is developed through the first-order approximation of the so-called regularized optimal kk-thresholding model, and thus the computational cost for this family of algorithms is significantly lower than that of the OT-based algorithms. The guaranteed performance of NT-type algorithms for signal recovery from noisy measurements is shown under the restricted isometry property and concavity of the objective function of regularized optimal kk-thresholding model. Empirical results indicate that the NT-type algorithms are robust and very comparable to several mainstream algorithms for sparse signal recovery.

Index Terms:
Signal recovery, compressed sensing, thresholding algorithms, restricted isometry property, concave minimization, regularization method

I Introduction

Under the sparsity assumption, the problem of signal recovery from nonadaptive linear measurements can be formulated as a sparse optimization problem which may take different forms, depending on such factors as recovery environment, signal structure, and the prior information available for the signal [1, 2]. A fundamental mathematical model for sparsity-based signal recovery can be described as follows. Let AA be an m×nm\times n sensing matrix with mn,m\ll n, and let y:=Ax+νmy:=Ax+\nu\in\mathbb{R}^{m} be the measurements for the signal xnx\in\mathbb{R}^{n}, where νm\nu\in\mathbb{R}^{m} are the measurement errors. When xx is kk-sparse or kk-compressible, the recovery of xx can be formulated as the following sparsity-constrained optimization (SCO) problem:

minzn{Azy22:z0k},\min_{z\in\mathbb{R}^{n}}\{\|Az-y\|_{2}^{2}:~{}\|z\|_{0}\leq k\}, (1)

where kk is a given integer number reflecting the number of significant entries of the signal to recover, and z0\|z\|_{0} denotes the number of nonzero entries of the vector z.z. In this paper, we say that zz is kk-sparse if z0k,\|z\|_{0}\leq k, and zz is kk-compressible if it can be approximated by a kk-sparse vector. The SCO problem plays a vital role in the development of algorithms for compressed-sensing-based signal recovery (see, e.g., [1, 2, 3, 4]). Essentially, the model (1) is to find the best kk-term approximation of the target signal, which best fits the acquired measurements. Similar models also arise in numerous areas such as statistical learning [5, 6, 7], wireless communication [8, 9], low-rank matrix recovery [10, 11, 12, 13, 14], linear inverse and optimization problems [15, 16, 17, 18].

While the SCO and related problems can be solved via convex optimization, nonconvex optimization and orthogonal matching pursuit (OMP) (see, e.g., [1, 2, 3, 4]), the thresholding methods with low computational complexity are particularly suited for solving the SCO model. Thus we focus our attention on the study of thresholding methods in this paper. The earlier thresholding methods mainly adopt the soft thresholding operator [19, 20, 21, 22]. The recent ones (e.g., [3, 23, 24, 25, 26, 27]) mainly use the hard thresholding operator, denoted by k(),{\cal H}_{k}(\cdot), which retains the kk largest magnitudes of a vector and sets other entries to zeros. The latest development of thresholding methods, including acceleration, matrix form, dual thresholding and Newton-step-based modifications, can be found in such references as [7, 14, 28, 29, 30, 31, 32, 33]. However, performing hard thresholding on an incompressible iterate, generated by an iterative method, does not ensure the reduction of the objective value of (1) at every iteration. Motivated by this observation, the so-called optimal kk-thresholding (OT) method was first introduced in [34], which allows the vector thresholding and objective reduction to be performed simultaneously. Based on the OT concept, the relaxed optimal kk-thresholding (ROT) algorithm and their variants were also developed in [34]. A further analysis of these algorithms was performed in [35], and the algorithm using both OT technique and truncated gradient descent direction was studied in [37]. Compared to existing hard-thresholding algorithms, the strength of OT-type methods lies in their observed robustness for sparse signal recovery and steadiness in objective reduction in the course of iterations. However, the OT algorithm needs to solve a quadratic optimization problem at every iteration to obtain the optimal or nearly optimal kk-thresholding of a vector, and thus the computation is expensive especially for high-dimensional signal recovery.

This paper pursues a new algorithmic development which is significantly different from the existing ones in [34] and [35]. The main idea of the paper is to introduce an auxiliary regularized counterpart for the so-called binary OT model, which will be defined by (3) later in this paper. Based on such a regularized model, we utilize the traditional linearization technique (i.e., the first-order approximation method) to develop a new framework of algorithms which admits a low computational cost at every iteration, compared to general convex optimization methods. The thresholding technique adopted in this framework is referred to as the natural thresholding (NT). Such thresholding is determined by the kk smallest entries of the gradient of the so-called regularized objective of OT or ROT model which will be recalled in the next section. The NT algorithm takes the advantage of the ‘nice structure’ of the polytope with kk-sparse binary extreme points. The main work of the paper includes:

  • (i)

    Develop a novel class of NT-type methods for signal recovery;

  • (ii)

    Rigorously prove the guaranteed performance of NT-type algorithms under some conditions.

We also carry out experiments on random problem instances to show that the NT-type algorithms are fast algorithms, and their performances are very comparable to several mainstream compressed sensing algorithms.

The paper is structured as follows. The NT-type algorithms are proposed in Section II. The theoretical analysis of the algorithms is performed in Section III, and a further discussion of regularization functions and parameters is given in Section IV. The simulation results are reported in Section V.

Notation: All vectors are column vectors unless otherwise stated. For vector zz and matrix AA, zTz^{T} and ATA^{T} denote their transposes. We use supp(z)={i:zi0}\textrm{supp}(z)=\{i:z_{i}\not=0\} to denote the support of the vector zz, i.e., the index set of nonzero entries of z.z. We use Lk(z)L_{k}(z) to denote the index set of the kk largest magnitudes of z,z, and 𝒮k(z){\cal S}_{k}(z) to denote the index set of the kk smallest entries of z.z. When Lk(z)L_{k}(z) (𝒮k(z){\cal S}_{k}(z)) is not uniquely determined, i.e., when there are two magnitudes (entries) being equal, we select the smallest entry indices to avoid the ambiguity in index selection. Given S{1,2,,n},S\subseteq\{1,2,\dots,n\}, we denote by S¯={1,2,,n}\S\overline{S}=\{1,2,\dots,n\}\backslash S the complement set and by |S||S| the cardinality of S.S. Given xn,x\in\mathbb{R}^{n}, the vector xSnx_{S}\in\mathbb{R}^{n} is obtained from xx by retaining the entries supported on SS and setting other entries to zeros. Throughout the paper, 2\|\cdot\|_{2} and 1\|\cdot\|_{1} denote the 2\ell_{2}-norm and 1\ell_{1}-norm of a vector, respectively. For vectors xx and z,z, xzx\otimes z is the Hadamard product (entry-wise product) of the two vectors. Also, we summarize the main acronyms of algorithms in the following table.

Acronyms Algorithms
IHT Iterative hard tresholding
HTP Hard tresholding pursuit
SP Subspace pursuit
CoSaMP Compressive sampling matching pursuit
OMP Orthogonal matching pursuit
OT Optimal kk-thresholding
OTP Optimal kk-thresholding pursuit
ROTP Relaxed optimal kk-thresholding pursuit
NT Natural thresholding
NTP Natural thresholding pursuit
NTq Natural thresholding with inner iter.
NTPq Natural thresholding pursuit with inner iter.

II Natural Thresholding Algorithms

Let us first recall the iterative hard thresholding (IHT), hard thresholding pursuit (HTP) and the newly developed optimal kk-thresholding (OT) algorithm. These algorithms provide a basis for the development of our new algorithms called natural thresholding.

II-A Basic hard thresholding algorithms: IHT and HTP

Given (y,A)(y,A), the gradient of the error metric ψ(x)=(1/2)yAx22\psi(x)=(1/2)\|y-Ax\|_{2}^{2} is given by ψ(x)=AT(yAx).\nabla\psi(x)=-A^{T}(y-Ax). At the iterate x(p),x^{(p)}, let u(p)u^{(p)} be the vector generated by the classic gradient descent method, i.e.,

u(p):=x(p)+λpAT(yAx(p)),u^{(p)}:=x^{(p)}+\lambda_{p}A^{T}(y-Ax^{(p)}), (2)

where λp\lambda_{p} is a steplength. Using the operator k{\cal H}_{k} to generate an iterate satisfying the constraint of the problem (1) leads to the well-known IHT algorithm [15, 25, 26] and its enhanced version HTP [27]. A good feature of IHT and HTP is that they are easier to implement than other algorithms such as convex optimization [38, 39, 40, 41], orthogonal matching pursuit (OMP) [42, 43, 44], compressive sampling matching pursuit (CoSaMP)[45], and subspace pursuit (SP) [46].

IHT algorithm. Given the problem data (A,y,k)(A,y,k) and initial point x(0),x^{(0)}, perform the following iteration until a stopping criterion is met: At x(p)x^{(p)}, choose λp>0\lambda_{p}>0 and set

x(p+1)=k(u(p)),x^{(p+1)}={\cal H}_{k}(u^{(p)}),

where u(p)u^{(p)} is given by (2).

HTP algorithm. Given the problem data (A,y,k)(A,y,k) and initial point x(0),x^{(0)}, perform the following iteration until a stopping criterion is met: At x(p)x^{(p)}, choose λp>0\lambda_{p}>0 and let u(p)u^{(p)} be given by (2). Then set

x+=k(u(p)),x^{+}={\cal H}_{k}(u^{(p)}),
x(p+1)=argminz{yAz22:supp(z)supp(x+)}.x^{(p+1)}={\rm arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\textrm{supp}(x^{+})\}.

Clearly, the kk terms of u(p)u^{(p)} selected by k{\cal H}_{k} are not necessarily the best kk terms in the sense that they might not be the kk terms that best fit the linear measurements. As pointed out in [34], performing the hard thresholding of u(p)u^{(p)} is independent of the error metric ψ.\psi. Thus it may happen that the value of the error metric at the point x+=k(u(p))x^{+}={\cal H}_{k}(u^{(p)}) is larger than that of the current iterate x(p),x^{(p)}, i.e., yAx+2>yAx(p)2.\|y-Ax^{+}\|_{2}>\|y-Ax^{(p)}\|_{2}. This observation motivates the recent development of optimal kk-thresholding algorithms in [34, 35].

II-B Optimal kk-thresholding algorithms

As defined in [34], the optimal kk-thresholding of a vector uu, denoted by Zk#(u),Z_{k}^{\#}(u), is given by Zk#(u)=uwZ_{k}^{\#}(u)=u\otimes w^{*} which is the Hadamard product of uu and w,w^{*}, where ww^{*} is the solution to the problem

minw{yA(uw)22:eTw=k,w{0,1}n},\min_{w}\left\{\|y-A(u\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}w\in\{0,1\}^{n}\right\},

where e is the vector of ones, and the vector ww is kk-sparse and binary with exactly kk entries equal to 1. Given such a binary vector w,w, the Hadamard product of uu and ww retains kk entries of uu only and sets other entries to zeros. Based on the operator Zk#(),Z_{k}^{\#}(\cdot), the conceptual optimal kk-thresholding (OT) method and the optimal kk-thresholding pursuit (OTP) were proposed in [34]. The two methods share the same first step except for the second step at which OTP performs an orthogonal projection.

OT and OTP algorithms. Given the problem data (A,y,k)(A,y,k) and a starting point x(0).x^{(0)}. Perform the following steps until a stopping criterion is satisfied:

  • S1

    Let u(p)u^{(p)} be given as (2) and let ww^{*} be the solution to the problem

    minw{yA(u(p)w)22:eTw=k,w{0,1}n}.\min_{w}\{\|y-A(u^{(p)}\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}w\in\{0,1\}^{n}\}. (3)
  • S2

    Generate x(p+1)x^{(p+1)} as follows:

  • For OT algorithm, x(p+1)=u(p)w.x^{(p+1)}=u^{(p)}\otimes w^{*}.

  • For OTP algorithm, let Ω=supp(u(p)w)\Omega=\textrm{supp}(u^{(p)}\otimes w^{*}) and

    x(p+1)=argminz{yAz22:supp(z)Ω}.x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\Omega\}.

We would rather treat OT and OTP as conceptual algorithms due to the fact that directly solving the binary problem (3) might be expensive when the dimension of the problem is high. A more practical way is to solve a convex relaxation of (3), which can be solved by existing interior-point methods. This consideration leads to the relaxed optimal kk-thresholding (ROT) and relaxed optimal kk-thresholding pursuit (ROTP) introduced in [34, 35].

ROT and ROTP algorithms. Given the problem data (A,y,k)(A,y,k) and a starting point x(0),x^{(0)}, perform the steps below until a stopping criterion is satisfied:

  • S1

    Let u(p)u^{(p)} be given as (2) and solve

    minw{yA(u(p)w)22:eTw=k,0we}.\min_{w}\{\|y-A(u^{(p)}\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}}\}. (4)

    Let w(p)w^{(p)} be the solution to this problem.

  • S2

    Set x^=k(u(p)w(p))\widehat{x}={\cal H}_{k}(u^{(p)}\otimes w^{(p)}) and generate x(p+1)x^{(p+1)} as follows:

  • For ROT algorithm, x(p+1)=x^.x^{(p+1)}=\widehat{x}.

  • For ROTP algorithm, let

    x(p+1)=argminz{yAz22:supp(z)supp(x^)}.x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\textrm{supp}(\widehat{x})\}.

While the optimization problem (4) can be solved efficiently by existing convex optimization solvers, the computational cost remains high for large-scaled signal recovery problems, compared with the algorithms which involve only orthogonal projection, hard thresholding or linear programming solving. This stimulates the natural thresholding (NT) method described in the next subsection.

II-C Natural thresholding algorithms: A new framework

The model (4) is a convex relaxation of (3). However, such a relaxation might be too loose, and thus its solution is usually not binary. To overcome such a drawback of (4), we may introduce a certain regularization term into the objective of (4) so that the regularized model can approximate the model (3) appropriately and the chance for the solution of regularized model being binary is enhanced. We now describe such a regularized model which is essential for the development of new algorithms in this paper. In fact, given vector u(p),u^{(p)}, consider the model of the form:

min\displaystyle\min yA(u(p)w)22+αϕ(w)\displaystyle\|y-A(u^{(p)}\otimes w)\|^{2}_{2}+\alpha\phi(w)
s.t.\displaystyle{\rm s.t.} eTw=k,0we,\displaystyle\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}},

where α>0\alpha>0 is a parameter and ϕ(w)\phi(w) is referred to as a binary regularization function which is formally defined as follows.

Definition II.1

ϕ\phi is said to be a binary regularization if it satisfies the following two properties: i) ϕ\phi is a positive and continuously differentiable function over an open neighborhood of the region D:={wn:0we};D:=\{w\in\mathbb{R}^{n}:0\leq w\leq\textrm{{e}}\}; ii) ϕ(w)\phi(w) attains its minimum value over DD at any binary vector w{0,1}nw\in\{0,1\}^{n} and only at such binary vectors.

Example II.2

Note that given any positive number 0<σ<10<\sigma<1, the function τ(t):=(t+σ)(1+σt)\tau(t):=(t+\sigma)(1+\sigma-t) is positive and continuously differentiable over the interval σ<t<1+σ-\sigma<t<1+\sigma which is an open neighborhood of the interval [0,1]. τ(t)\tau(t) attains its minimum value τmin:=σ(1+σ)\tau_{\min}:=\sigma(1+\sigma) at t=0t=0 or 1,1, and τ(t)>τmin\tau(t)>\tau_{\min} for any 0<t<1.0<t<1. For simplicity, we set σ=1/2\sigma=1/2 and only consider the specific function τ(t)=(t+12)(32t).\tau(t)=(t+\frac{1}{2})(\frac{3}{2}-t). By using this univariate function, we may construct a binary regularization function in n\mathbb{R}^{n} as follows:

ϕ(w)=i=1nτ(wi)=i=1n(wi+12)(32wi).\phi(w)=\sum_{i=1}^{n}\tau(w_{i})=\sum_{i=1}^{n}(w_{i}+\frac{1}{2})(\frac{3}{2}-w_{i}). (6)

It is evident that the function (6) satisfies the properties specified in Definition II.1. More examples of binary regularization functions will be given later in this section.

Given a vector u(p),u^{(p)}, we use ff and gαg_{\alpha} to denote the functions

f(w)=yA(u(p)w)22,f(w)=\|y-A(u^{(p)}\otimes w)\|_{2}^{2},
gα(w)=f(w)+αϕ(w).g_{\alpha}(w)=f(w)+\alpha\phi(w). (7)

The function gα(w)g_{\alpha}(w) is the objective of the problem (II-C). The parameter α\alpha is referred to as the regularization parameter. The solution of (II-C) depends on the parameter α.\alpha. Thus for any given α>0,\alpha>0, we use w(α)w(\alpha) to denote the solution of (II-C). ϕ(w)\phi(w) can be seen as the penalty for the variable ww violating the 0-1 restriction. From the classic optimization theory for penalty functions (see, e.g., [47]), it is not difficult to see that as α,\alpha\to\infty, the value of ϕ(w(α))\phi(w(\alpha)) will decrease to its minimum in the feasible set of (II-C). Such minimum attains only at 0-1 vectors according to the definition of binary regularization. This implies that w(α)w(\alpha) tends to a 0-1 point in DD as α,\alpha\to\infty, and any accumulation point of w(α)w(\alpha) must be a solution to the problem (3). We summarize this fact in the following lemma whose proof is given in Appendix A for completeness.

Lemma II.3

Let ϕ(w)\phi(w) in (II-C) be a binary regularization and w(α)w(\alpha) be a solution of (II-C). Consider any strictly increasing and positive sequence {α}\{\alpha_{\ell}\} satisfying α.\alpha_{\ell}\to\infty. Then the sequence {ϕ(w(α))}\{\phi(w(\alpha_{\ell}))\} is nonincreasing and converges to the minimum value of ϕ\phi over D,D, and the sequence {f(w(α))}\{f(w(\alpha_{\ell}))\} is nondecreasing and converges to the optimal value of (3). Moreover, any accumulation point of {w(α)}\{w(\alpha_{\ell})\} is the optimal solution of (3).

The above lemma shows that the model (II-C) can approximate the model (3) to any expected level of accuracy provided that the parameter α\alpha is taken large enough. From this perspective, the model (II-C) as an approximation of (3) is better than the convex relaxation (4). By the definition, a binary regularization is nonconvex. As shown by (6), it is very easy to construct a binary regularization which is concave. As a result, the objective function gαg_{\alpha} of (II-C) can be naturally made to be concave by properly choosing the parameter α.\alpha. Thus for the convenience of discussion, we assume that gαg_{\alpha} is concave throughout the remainder of this section and Section III. The concavity of the function gα(w)g_{\alpha}(w) can be guaranteed if ϕ\phi and α\alpha are chosen properly. Note that if ϕ\phi is twice continuously differentiable, then the Hessian of gαg_{\alpha} is given as

2gα(w)=2f(w)+α2ϕ(w),\nabla^{2}g_{\alpha}(w)=\nabla^{2}f(w)+\alpha\nabla^{2}\phi(w),

from which it can be seen that when 2ϕ(w)\nabla^{2}\phi(w) is negative definite, then 2gα(w)\nabla^{2}g_{\alpha}(w) will be negative semi-definite (and thus gαg_{\alpha} is concave) provided that α\alpha is large enough.

Take (6) as an example. It is easy to verify that the Hessian of (6) is 2ϕ(w)=2I,\nabla^{2}\phi(w)=-2I, where II is the identity matrix. Thus when (6) is used, the Hessian of gα(w)g_{\alpha}(w) is given as

2gα(w)=2(U(p)ATAU(p)αI),\nabla^{2}g_{\alpha}(w)=2(U^{(p)}A^{T}AU^{(p)}-\alpha I),

where U(p)=diag(u(p)).U^{(p)}=\textrm{diag}(u^{(p)}). Note that 2gα(w)\nabla^{2}g_{\alpha}(w) is negative semi-definite over the whole space provided that αIU(p)ATAU(p)\alpha I\succeq U^{(p)}A^{T}AU^{(p)} which can be guaranteed when α\alpha is not smaller than the largest eigenvalue of the matrix U(p)ATAU(p),U^{(p)}A^{T}AU^{(p)}, i.e., αα:=λmax(U(p)ATAU(p)).\alpha\geq\alpha^{*}:=\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}). Therefore, the function gα(w)g_{\alpha}(w) is always concave over n\mathbb{R}^{n} provided that αα.\alpha\geq\alpha^{*}.

Note that (3) is to find the optimal kk-thresholding of u(p).u^{(p)}. The above discussion indicates that performing the optimal kk-thresholding of a vector can be approximately achieved by solving the problem (II-C) which from the above discussion can be made to be a concave minimization problem which remains NP-hard in general [49, 50]. The fundamental idea of the traditional MM method for solving a nonconvex optimization problem is to replace the ‘difficult’ objective function of the problem with a certain relatively easy surrogate convex function that dominates the original objective function (see, e.g., [51, 52]). For a concave objective, the simplest surrogate function is its first-order approximation at any given point. Thus the first-order approximation of the concave function can be used to develop an algorithm with significantly low computational cost.

Specifically, suppose that ww^{-} is the current iterate. At w,w^{-}, by the concavity of gα(w)g_{\alpha}(w) over an open neighborhood of D,D, denoted by O,O, one has

gα(w)gα(w)+[gα(w)]T(ww)g_{\alpha}(w)\leq g_{\alpha}(w^{-})+\left[\nabla g_{\alpha}(w^{-})\right]^{T}(w-w^{-}) (8)

for any wOw\in O, where gα(w)\nabla g_{\alpha}(w^{-}) denotes the gradient of gαg_{\alpha} at w.w^{-}. Specifically, if ϕ\phi is given by (6), the gradient of gαg_{\alpha} at ww^{-} is given as

gα(w)=2U(p)AT(yAU(p)w)+α(e2w).\nabla g_{\alpha}(w^{-})=-2U^{(p)}A^{T}(y-AU^{(p)}w^{-})+\alpha(\textbf{{e}}-2w^{-}).

The relation (8) shows that the first-order approximation of gα(w)g_{\alpha}(w) dominates gα(w)g_{\alpha}(w) on the whole open set O.O. By the idea of MM methods, the problem (II-C), i.e., min{gα(w):wP},\min\{g_{\alpha}(w):w\in P\}, where

P={wn:eTw=k,0we},P=\{w\in\mathbb{R}^{n}:\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}}\},

can be approximated by its surrogate counterpart which, in this case, is the linear programming (LP) problem

min{gα(w)+[gα(w)]T(ww):wP}.\min\left\{g_{\alpha}(w^{-})+\left[\nabla g_{\alpha}(w^{-})\right]^{T}(w-w^{-}):~{}w\in P\right\}.

Removing constant terms in the objective above, the above LP problem is reduced to

min{[gα(w)]Tw:wP}.\min\left\{\left[\nabla g_{\alpha}(w^{-})\right]^{T}w:~{}w\in P\right\}. (9)

Let w+w^{+} be the solution to this problem. We may replace ww^{-} with w+w^{+} and repeat the above process until a certain stopping criterion is met. This is the basic idea for the development of a new algorithm.

We now point out that there is no need to solve the LP problem (9) by an algorithm since an optimal solution of (9) can be obtained explicitly due to the nature of the polytope P.P. In fact, according to the classic LP theory, the problem (9) must have an optimal solution, denoted by w+,w^{+}, which is an extreme point of the polytope P.P. As shown in [34], any extreme point of PP is kk-sparse and binary. Hence the optimal value [gα(w)]Tw+\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+} is actually the summation of kk entries of the vector gα(w).\nabla g_{\alpha}(w^{-}). Clearly, the smallest value of the function [gα(w)]Tw\left[\nabla g_{\alpha}(w^{-})\right]^{T}w over the polytope PP is the sum of the smallest kk entries of gα(w).\nabla g_{\alpha}(w^{-}). Thus the optimal solution w+w^{+} of the LP problem (9) can be explicitly given as follows. It is determined by the support of the kk smallest entries of gα(w)\nabla g_{\alpha}(w^{-}), i.e., wi+=1w^{+}_{i}=1 if i𝒮k(gα(w))i\in{\cal S}_{k}(\nabla g_{\alpha}(w^{-})), otherwise, wi+=0w^{+}_{i}=0 for i=1,,n,i=1,\dots,n, where 𝒮k(){\cal S}_{k}(\cdot) is the index set of the kk smallest entries of a vector. We now describe the algorithms which are referred to as the natural thresholding (NT) and the natural thresholding pursuit (NTP) in this paper.

NT and NTP algorithms.

Input (A,y,k)(A,y,k) and an initial point x(0)x^{(0)}. Repeat the following steps until a certain stopping criterion is satisfied:

  • S1

    At x(p)x^{(p)}, choose λp>0\lambda_{p}>0 and let u(p)u^{(p)} be given by (2). Let ww^{-} be the kk-sparse vector given as

    wi={1 if iLk(u(p))0 otherwise,i=1,,n.w^{-}_{i}=\left\{\begin{array}[]{ll}1&\textrm{ if }i\in L_{k}(u^{(p)})\\ 0&\textrm{ otherwise}\end{array},~{}i=1,\dots,n.\right. (10)

    Compute the gradient gα(w),\nabla g_{\alpha}(w^{-}), where gα()g_{\alpha}(\cdot) is defined by (7). Then set w+w^{+} as

    wi+={1 if i𝒮k(gα(w))0 otherwise,i=1,,n.w^{+}_{i}=\left\{\begin{array}[]{ll}1&\textrm{ if }i\in{\cal S}_{k}(\nabla g_{\alpha}(w^{-}))\\ 0&\textrm{ otherwise}\end{array},~{}i=1,\dots,n.\right. (11)
  • S2

    Generate x(p+1)x^{(p+1)} according to the following procedure:

  • For NT algorithm, x(p+1)=u(p)w+.x^{(p+1)}=u^{(p)}\otimes w^{+}.

  • For NTP algorithm, set S(p+1)=supp(u(p)w+)S^{(p+1)}=\textrm{supp}(u^{(p)}\otimes w^{+}) and

    x(p+1)=argminz{yAz22:supp(z)S(p+1)}.x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq S^{(p+1)}\}.

The only difference between the two algorithms is that NT does not perform an orthogonal projection, but NTP does. At the beginning of every iteration of NT and NTP, the initial point ww^{-} can be any given kk-sparse and binary vector. In particular, we may use the binary vector (10) as an initial point determined by Lk(u(p))L_{k}(u^{(p)}).

Remark II.4

Given a kk-sparse and binary vector ww^{-}, let w+w^{+} be given as (11) which is an explicit solution of the LP problem (9). By the optimality of w+,w^{+}, one always has

[gα(w)]Tw+[gα(w)]Tw,\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}\leq\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}, (12)

which together with the concavity of gαg_{\alpha} implies that f(w+)f(w).f(w^{+})\leq f(w^{-}). (This fact will be used in the proof of Lemma III.2 later.) In fact, the concavity of gα(w)g_{\alpha}(w) and (12) implies that

gα(w+)gα(w)+gα(w)T(w+w)gα(w),g_{\alpha}(w^{+})\leq g_{\alpha}(w^{-})+\nabla g_{\alpha}(w^{-})^{T}(w^{+}-w^{-})\leq g_{\alpha}(w^{-}),

i.e.,

f(w+)+αϕ(w+)f(w)+αϕ(w).f(w^{+})+\alpha\phi(w^{+})\leq f(w^{-})+\alpha\phi(w^{-}).

Since w+w^{+} and ww^{-} are 0-1 vectors at which ϕ\phi achieves the minimum value ϕ(w+)=ϕ(w)\phi(w^{+})=\phi(w^{-}) over D.D. Thus the above inequality reduces to f(w+)f(w).f(w^{+})\leq f(w^{-}). This means the iterate w+w^{+} generated by NT or NTP is never worse than the one generated by IHT or HTP from the perspective of the error metric f(w).f(w). If ww^{-} is not optimal to (9), then [gα(w)]Tw+<[gα(w)]Tw,\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}<\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}, and hence f(w+)<f(w).f(w^{+})<f(w^{-}). Thus in this case the algorithm finds the kk terms of u(p)u^{(p)} which is better than the kk terms selected by the hard thresholding.

The NT and NTP algorithms perform linearization of gαg_{\alpha} only once in every iteration. We may perform linearization more than once within their step S1, i.e., replacing the role of ww^{-} with w+w^{+} and repeating the step S1 of NT more than once so that a point better than w+w^{+} might be found. Such iterations within S1 are referred to as inner iterations. Repeating the inner iterations up to q1q\geq 1 times yields the algorithms NTq and NTPq.

NTq and NTPq algorithms.

Input (A,y,k)(A,y,k), initial point x(0),x^{(0)}, and integer number q1.q\geq 1. Repeat the steps below until a stopping criterion is met:

  • S1

    At x(p)x^{(p)}, choose λp>0\lambda_{p}>0 and let u(p)u^{(p)} be given by (2). Let ww^{-} be the kk-sparse vector given by (10). Then perform the following inner iterations:

    • For J=1:qJ=1:q

      • Compute gα(w)\nabla g_{\alpha}(w^{-}), where gα()g_{\alpha}(\cdot) is defined by (7). Then set w+w^{+} as (11).

        • If [gα(w)]Tw+=[gα(w)]Tw\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-} then

          terminate the inner iterations, and go to S2

        • else

          w:=w+w^{-}:=w^{+}

        • endif

      End

  • S2

    Generate x(p+1)x^{(p+1)} according to the following procedure:

  • For NTq algorithm, set x(p+1)=u(p)w+.x^{(p+1)}=u^{(p)}\otimes w^{+}.

  • For NTPq algorithm, let S(p+1)=supp(u(p)w+)S^{(p+1)}=\textrm{supp}(u^{(p)}\otimes w^{+}) and

    x(p+1)=argminz{yAz22:supp(z)S(p+1)}.x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq S^{(p+1)}\}.
Remark II.5

NT, NTP, NTq and NTPq are referred to as the NT-type algorithms in this paper. When q=1,q=1, these algorithms become NT and NTP, respectively. If qq is a large number, according to linearization theory for concave minimization (e.g., [49, 53]), the inner iterations of NTq and NTPq will terminate at a stationary point of the concave minimization problem in which case [gα(w)]Tw+=[gα(w)]Tw.\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}. If we expect such a point, we may set q=q=\infty and use [gα(w)]Tw+=[gα(w)]Tw\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-} as the termination criterion for inner iterations. The corresponding algorithms can be referred to as NT and NTP which are omitted here.

Remark II.6

As pointed out in [35], performing one iteration of ROTP requires about O(m3+mn+n3.5L)O(m^{3}+mn+n^{3.5}L) flops, where LL is the size of the problem data encoding in binary. The complexity of NT-type algorithms is much lower than that of ROTP. In fact, at every iteration, the NT-type algorithms need to compute the vector u(p)u^{(p)} and gα(w)\nabla g_{\alpha}(w^{-}) which require about O(mn)O(mn) operations, and to perform LkL_{k} and 𝒮k{\cal S}_{k} on a vector which require about O(nlogk)O(n\log k) flops by a sorting approach, where k<mn.k<m\ll n. The NTP and NTPq algorithms need to perform the orthogonal projection min{yAz22:supp(z)Ω},\min\{\|y-Az\|_{2}^{2}:\textrm{supp}(z)\subset\Omega\}, where |Ω|=lm,|\Omega|=l\leq m, which requires about m3m^{3} flops (see [35] for details). Thus the complexities of NT and NTP are about O(mn)O(mn) and O(m3+mn)O(m^{3}+mn) flops in every iteration, respectively. If the inner iterations are executed qq times, then the NTq and NTPq require about O(qmn)O(qmn) and O(m3+qmn)O(m^{3}+qmn) flops.

II-D Concave regularization function

Before we analyze the guaranteed performance of NT-type algorithms, it is necessary to construct more examples of concave binary regularization functions. We still construct such functions based on the kernel univariate one τ(t)=(t+1/2)(3/2t)\tau(t)=(t+1/2)(3/2-t) which is concave and positive over an open neighborhood of the interval [0,1].[0,1]. Assume gg is another concave and strictly increasing univariate function over the interval (0,).(0,\infty). Then the composite function h(t)=g(τ(t))h(t)=g(\tau(t)) must be a concave function over an open neighborhood of the interval [0,1][0,1] due to the fact

h′′(t)=g′′(τ(t))(τ(t))2+g(τ(t))τ′′(t)0,h^{\prime\prime}(t)=g^{\prime\prime}(\tau(t))(\tau^{\prime}(t))^{2}+g^{\prime}(\tau(t))\tau^{\prime\prime}(t)\leq 0,

where the last inequality follows from the fact g′′(τ(t))0,τ′′(t)0g^{\prime\prime}(\tau(t))\leq 0,\tau^{\prime\prime}(t)\leq 0 and g(τ(t))0.g^{\prime}(\tau(t))\geq 0. We may use such a function to construct a binary regularization as follows:

ϕ(w)=i=1nh(wi)=i=1ng(τ(wi)).\phi(w)=\sum_{i=1}^{n}h(w_{i})=\sum_{i=1}^{n}g(\tau(w_{i})). (13)

The function constructed like this can be used in NT-type algorithms. We now give two such specific functions.

  • Let g(t):=log(1+t)g(t):=\log(1+t) which is concave and strictly increasing over the interval (0,).(0,\infty). Then

    ϕ(w)=i=1nlog(1+(wi+1/2)(3/2wi))\phi(w)=\sum_{i=1}^{n}\log\left(1+(w_{i}+1/2)(3/2-w_{i})\right) (14)

    is a concave binary regularization over an open neighborhood of the region D.D.

  • Let g(t):=t1+tg(t):=\frac{t}{1+t} which is concave and strictly increasing over the interval (0,).(0,\infty). Then

    ϕ(w)=i=1n(wi+1/2)(3/2wi)1+(wi+1/2)(3/2wi)\phi(w)=\sum_{i=1}^{n}\frac{(w_{i}+1/2)(3/2-w_{i})}{1+(w_{i}+1/2)(3/2-w_{i})} (15)

    is a concave binary regularization over an open neighborhood of the region D.D.

Similar to f(w)f(w) which is related to u(p)u^{(p)}, the regularization ϕ\phi can be also made related to u(p).u^{(p)}. Such a regularization will be discussed in more detail in Section IV.

III Guaranteed performance and stability

In this section, we perform a theoretical analysis for the NT-type algorithms described in previous section. This analysis is made under the assumptions of concavity of gα(w)g_{\alpha}(w) and restricted isometry property (RIP), which is initially introduced by Candès and Tao [39]. An m×nm\times n matrix A,A, where mn,m\ll n, is said to satisfy the RIP of order KK if δK<1,\delta_{K}<1, where δK\delta_{K} is the smallest nonnegative number δ\delta such that

(1δ)z22Az22(1+δ)z22(1-\delta)\|z\|_{2}^{2}\leq\|Az\|^{2}_{2}\leq(1+\delta)\|z\|_{2}^{2}

for any KK-sparse vector zn.z\in\mathbb{R}^{n}. The number δK\delta_{K} is called the restricted isometry constant (RIC) of A.A. The following property of δK\delta_{K} has been widely used in the analysis of compressed sensing algorithms.

Lemma III.1

[39, 3] Let unu\in\mathbb{R}^{n} and Λ{1,2,,n},\Lambda\subseteq\{1,2,\dots,n\}, if |Λsupp(u)|t,|\Lambda\cup\textrm{supp}(u)|\leq t, then [(IATA)u]Λ2δtu2.\|\left[(I-A^{T}A)u\right]_{\Lambda}\|_{2}\leq\delta_{t}\|u\|_{2}.

It is worth stressing that RIP is not the unique assumption under which the compressed sensing algorithms can be analyzed. Other assumptions such as the mutual coherence [1, 3], null space property [3], and range space property of ATA^{T} [4, 54] can be also used to analyze the theoretical performance of a compressed sensing algorithm. We now prove that the iterates generated by the NT-type algorithms with a concave binary regularization can guarantee to recover the sparse signal under the RIP assumption. We begin with the following lemma.

Lemma III.2

Let gα(w)g_{\alpha}(w) be a concave function, and let ww^{-} be the initial point of step S1 in NTq and NTPq algorithms where q1,q\geq 1, and let w+w^{+} be the final output of step S1. Then

yA(u(p)w+)2yA(u(p)w)2.\|y-A(u^{(p)}\otimes w^{+})\|_{2}\leq\|y-A(u^{(p)}\otimes w^{-})\|_{2}. (16)

Proof. The inner iterations within step S1 of NTq and NTPq start from the initial point w(0):=w.w^{(0)}:=w^{-}. Then the first inner iteration performs the update (11) to yield the next inner iterate w(1).w^{(1)}. As pointed out in Remark II.4, since gα(w)g_{\alpha}(w) is concave and (w(0),w(1))(w^{(0)},w^{(1)}) are 0-1 vectors, one must have that f(w(1))f(w(0)).f(w^{(1)})\leq f(w^{(0)}). Then the inner iteration starts from w(1)w^{(1)} to generate the second iterate w(2).w^{(2)}. Again, by the same argument in Remark II.4, one has f(w(2))f(w(1)).f(w^{(2)})\leq f(w^{(1)}). Repeating this inner process qq times, the final output w(q)w^{(q)} of step S1 satisfies that

f(w(q))f(w(q1))f(w(1))f(w(0)).f(w^{(q)})\leq f(w^{(q-1)})\leq\cdots\leq f(w^{(1)})\leq f(w^{(0)}).

Denote by w+(=w(q))w^{+}(=w^{(q)}) the final output of S, the above inequality immediately implies the desired relation (16). \Box

The next lemma is very useful in our analysis.

Lemma III.3

[35] For any vector znz\in\mathbb{R}^{n} and any kk-sparse vector hn,h\in\mathbb{R}^{n}, one has

(hk(z))S\S2(hz)S\S2+(hz)S\S2,\|(h-{\cal H}_{k}(z))_{S\backslash S^{*}}\|_{2}\leq\|(h-z)_{S\backslash S^{*}}\|_{2}+\|(h-z)_{S^{*}\backslash S}\|_{2},

where S=supp(h)S=\textrm{supp}(h) and S=supp(k(z)).S^{*}=\textrm{supp}({\cal H}_{k}(z)).

We now prove the guaranteed performance of the NT-type algorithms under RIP and concavity of gα(w).g_{\alpha}(w). It is sufficient to show the main result in noisy settings. The result for noiseless cases can be obtained immediately as a special case of our results. The practical signal is usually not exactly kk-sparse. Thus we are interested in recovering the key information of the signal, i.e., the largest kk magnitudes of the signal. For simplicity, we only analyze the algorithms with steplength λp1.\lambda_{p}\equiv 1. The performance of the proposed algorithms with λp1\lambda_{p}\not=1 may be established by a more complicated analysis which requires significant further investigation. In fact, for the case λp1,\lambda_{p}\not=1, the term IλpATAI-\lambda_{p}A^{T}A would appear in (40) in the proof of Theorem III.4 below. See Appendix B for details. Thus, to bound the term in (40), the classic Lemma III.1 cannot be used, and some new technical results involving λp1\lambda_{p}\not=1 must be established first. We leave such a possible generalization as a future research work.

Theorem III.4

Let y:=Ax+νy:=Ax+\nu be the acquired measurements for the signal x,x, where ν\nu are measurement errors. Denote by S=Lk(x).S=L_{k}(x). Suppose that the RIC of AA satisfies

δ3k<γ0.4712,\delta_{3k}<\gamma^{*}\approx 0.4712, (17)

where γ\gamma^{*} is the unique real root of the univariate equation ωγ3+ωγ2+γ=1\omega\gamma^{3}+\omega\gamma^{2}+\gamma=1 in the interval [0,1],[0,1], where ω:=(5+1)/2.\omega:=(\sqrt{5}+1)/2. Suppose that the sequence {x(p):p1}\{x^{(p)}:p\geq 1\} is generated by the algorithm NT or NTq with λp1\lambda_{p}\equiv 1 and gα()g_{\alpha}(\cdot) being concave. Then the sequence {x(p)}\{x^{(p)}\} satisfies the following relation:

x(p+1)xS2ηx(p)xS2+C1ATν2+C2ν2,\|x^{(p+1)}-x_{S}\|_{2}\leq\eta\|x^{(p)}-x_{S}\|_{2}+C_{1}\|A^{T}\nu^{\prime}\|_{2}+C_{2}\|\nu^{\prime}\|_{2}, (18)

where

C1=ω1+δ2k1δ2k,C2=21δ2k,ν=AxS¯+ν,C_{1}=\omega\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}},~{}C_{2}=\frac{2}{\sqrt{1-\delta_{2k}}},~{}\nu^{\prime}=Ax_{\overline{S}}+\nu, (19)

and η\eta is a constant given by

η=ωδ3k1+δ2k1δ2k\eta=\omega\delta_{3k}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}} (20)

which is smaller than 1 under the condition (17).

The proof of Theorem III.4 is given in Appendix B. It should be pointed out that to ensure the concavity of gα()g_{\alpha}(\cdot), one may use concave binary regularization and choose α\alpha to be large enough (see Section IV for more discussions). The only difference between NTq and NTPq is that the latter uses the orthogonal projection to generate x(p+1)x^{(p+1)} from x(p).x^{(p)}. The following property of orthogonal projection is widely used in the analysis of compressed sensing algorithms.

Lemma III.5

[34, 35] Given the pair (y,A)(y,A) with ν=yAx\nu^{\prime}=y-Ax^{*} where xx^{*} is a kk-sparse vector. Let unu\in\mathbb{R}^{n} be an arbitrary kk-sparse vector, and let zz^{*} be the solution to the orthogonal projection

z=argminz{yAz22:supp(z)supp(u)}.z^{*}=\arg\min_{z}\left\{\|y-Az\|_{2}^{2}:~{}\operatorname{supp}(z)\subseteq\operatorname{supp}(u)\right\}.

Then zz^{*} satisfies that

zx211δ2k2xu2+1+δk1δ2kν2.\left\|z^{*}-x^{*}\right\|_{2}\leq\frac{1}{\sqrt{1-\delta_{2k}^{2}}}\|x^{*}-u\|_{2}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}\|\nu^{\prime}\|_{2}.

Combining Theorem III.4 and Lemma III.5 leads to the main result for NTP and NTPq.{}_{q}.

Theorem III.6

Let y:=Ax+νy:=Ax+\nu be the measurements of the signal xx with measurement errors ν.\nu. If the RIC of AA satisfies

δ3k<25+3,\delta_{3k}<\frac{2}{\sqrt{5}+3}, (21)

then the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by the algorithm NTP or NTPq with λp1\lambda_{p}\equiv 1 and gα()g_{\alpha}(\cdot) being concave, satisfies the following relation:

x(p+1)xS2ρx(p)xS2+C~1ATν2+C~2ν2,\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x^{(p)}-x_{S}\|_{2}+\widetilde{C}_{1}\|A^{T}\nu^{\prime}\|_{2}+\widetilde{C}_{2}\|\nu^{\prime}\|_{2}, (22)

where ν=AxS¯+ν,\nu^{\prime}=Ax_{\overline{S}}+\nu, ρ\rho is a constant given by ρ=ωδ3k/(1δ2k)\rho=\omega\delta_{3k}/(1-\delta_{2k}) which is smaller than 1 under (21), and C~1,\widetilde{C}_{1}, C~2\widetilde{C}_{2} are the constants given by

C~1=C11δ2k2,C~2=C21δ2k2+1+δk1δ2k,\widetilde{C}_{1}=\frac{C_{1}}{\sqrt{1-\delta_{2k}^{2}}},~{}\widetilde{C}_{2}=\frac{C_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}, (23)

in which C1,C2C_{1},C_{2} are the constants given in Theorem III.4.

Proof. By the structure of NTP and NTPq,{}_{q}, w+w^{+} is the output of S1, and x^=u(p)w+\hat{x}=u^{(p)}\otimes w^{+} is kk-sparse. The iterate x(p+1)x^{(p+1)} is the solution to the orthogonal projection

minz{yAz22:supp(z)supp(x^)}.\min_{z}\left\{\|y-Az\|_{2}^{2}:~{}\operatorname{supp}(z)\subseteq\operatorname{supp}(\hat{x})\right\}.

Let S=Lk(x).S=L_{k}(x). By treating u=x^u=\hat{x} and x=xSx^{*}=x_{S} as well as ν=yAxS=AxS¯+ν\nu^{\prime}=y-Ax_{S}=Ax_{\overline{S}}+\nu, it follows from Lemma III.5 that

x(p+1)xS2xSx^21δ2k2+1+δk1δ2kν2.\|x^{(p+1)}-x_{S}\|_{2}\leq\frac{\|x_{S}-\hat{x}\|_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}\|\nu^{\prime}\|_{2}.

The intermediate point x^,\hat{x}, generated at the ppth iteration of NTP (NTPq) are exactly the iterate generated at the ppth iteration of NT (NTq). By Theorem III.4, we have

xSx^2ηxSx(p)2+C1ATν2+C2ν2,\|x_{S}-\hat{x}\|_{2}\leq\eta\|x_{S}-x^{(p)}\|_{2}+C_{1}\|A^{T}\nu^{\prime}\|_{2}+C_{2}\|\nu^{\prime}\|_{2},

where η\eta, C1,C2C_{1},C_{2} are given in Theorem III.4. Combining the above two relations, we immediate obtain that

x(p+1)xS2ρxSx(p)2+C~1ATν2+C~2ν2,\displaystyle\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x_{S}-x^{(p)}\|_{2}+\widetilde{C}_{1}\|A^{T}\nu^{\prime}\|_{2}+\widetilde{C}_{2}\|\nu^{\prime}\|_{2},

where

ρ=η1δ2k2=ωδ3k1δ2k21+δ2k1δ2k=ωδ3k1δ2k,\rho=\frac{\eta}{\sqrt{1-\delta_{2k}^{2}}}=\frac{\omega\delta_{3k}}{\sqrt{1-\delta_{2k}^{2}}}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}=\frac{\omega\delta_{3k}}{1-\delta_{2k}},

and

C~1=C11δ2k2,C~2=C21δ2k2+1+δk1δ2k.\widetilde{C}_{1}=\frac{C_{1}}{\sqrt{1-\delta_{2k}^{2}}},~{}\widetilde{C}_{2}=\frac{C_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}.

Since δ2kδ3k,\delta_{2k}\leq\delta_{3k}, one has ρωδ3k1δ3k.\rho\leq\frac{\omega\delta_{3k}}{1-\delta_{3k}}. Thus ρ<1\rho<1 is guaranteed if ωδ3k1δ3k<1,\frac{\omega\delta_{3k}}{1-\delta_{3k}}<1, which is ensured under the condition δ3k<11+ω=23+5.\delta_{3k}<\frac{1}{1+\omega}=\frac{2}{3+\sqrt{5}}. Thus the bound (22) is valid. \Box

When the measurements are accurate and the signal is kk-sparse, we immediately have following corollary.

Corollary III.7

Suppose that the signal xx is kk-sparse and y:=Ax.y:=Ax. Then the following two statements are valid:

(i) If the RIC of AA satisfies (17), then the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by NT or NTq with λp1\lambda_{p}\equiv 1 and concave gα(),g_{\alpha}(\cdot), converges to x.x.

(ii) If the RIC of AA satisfies (21), then the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by NTP or NTPq with λp1\lambda_{p}\equiv 1 and concave gα(),g_{\alpha}(\cdot), converges to x.x.

The main results established in this section indicate that the significant information of the signal can be recovered by the NT-type algorithms under RIP and concavity of gα().g_{\alpha}(\cdot). Corollary III.7 further claims that under the same conditions, the kk-sparse signal can be exactly recovered when the measurements are accurate.

The above results also imply that the NT-type algorithms are stable for signal recovery. To establish such a stability result, let us first state the next lemma which follows immediately from Lemma 6.23 in [3].

Lemma III.8

[3] Let kk be an even number. Suppose that Am×nA\in\mathbb{R}^{m\times n} has the restricted isometry constant δk<1.\delta_{k}<1. Given τ>0,ξ0\tau>0,\xi\geq 0 and νm,\nu\in\mathbb{R}^{m}, if two vectors x,xnx,x^{\prime}\in\mathbb{R}^{n} satisfy that x0k\|x^{\prime}\|_{0}\leq k and xSx2τAxS¯+ν2+ξ,\|x_{S}-x^{\prime}\|_{2}\leq\tau\|Ax_{\overline{S}}+\nu\|_{2}+\xi, where S=Lk(x),S=L_{k}(x), then one has

xx21+22τk/2σk/2(x)1+2τν2+2ξ,\|x-x^{\prime}\|_{2}\leq\frac{1+2\sqrt{2}\tau}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+2\tau\|\nu\|_{2}+2\xi,

where

σk/2(x)1:=minu{xu1:u0k/2}.\sigma_{k/2}(x)_{1}:=\min_{u}\{\|x-u\|_{1}:\|u\|_{0}\leq k/2\}.

In fact, by setting κ=2,T=S:=Lk(x)\kappa=2,T=S:=L_{k}(x) and considering the 2\ell_{2}-norm bound, Lemma 6.23 in [3] immediately reduces to the above lemma. The following stability result for NT-type algorithms follows from Theorems III.4, III.6 and Lemma III.8.

Theorem III.9

Let y:=Ax+νy:=Ax+\nu be the measurements of the signal x,x, where ν\nu are the measurement errors. Then the following two statements are valid:

(i) Suppose that the RIC of AA satisfies (17) and kk is an even integer number. Then the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by NT or NTq with λp1\lambda_{p}\equiv 1 and concave gα(),g_{\alpha}(\cdot), satisfies that

xx(p)2\displaystyle\|x-x^{(p)}\|_{2} 1+22C/(1η)k/2σk/2(x)1+2C1ην2\displaystyle\leq\frac{1+2\sqrt{2}C/(1-\eta)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2C}{1-\eta}\|\nu\|_{2}
+2ηp(x(0)2+x2),\displaystyle~{}~{}~{}+2\eta^{p}(\|x^{(0)}\|_{2}+\|x\|_{2}), (24)

where C=C1σmax(A)+C2C=C_{1}\sigma_{\max}(A)+C_{2}, σmax(A)\sigma_{\max}(A) is the largest singular value of A,A, and the constants C1,C2,ηC_{1},C_{2},\eta are given in Theorem III.4.

(ii) Suppose that the RIC of AA satisfies (21) and kk is an even integer number. Then the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by NTP or NTPq with λp1\lambda_{p}\equiv 1 and concave gα(),g_{\alpha}(\cdot), satisfies that

xx(p)2\displaystyle\|x-x^{(p)}\|_{2} 1+22C~/(1ρ)k/2σk/2(x)1+2C~1ρν2\displaystyle\leq\frac{1+2\sqrt{2}\widetilde{C}/(1-\rho)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2\widetilde{C}}{1-\rho}\|\nu\|_{2}
+2ρp(x(0)2+x2),\displaystyle~{}~{}~{}+2\rho^{p}(\|x^{(0)}\|_{2}+\|x\|_{2}), (25)

where C~=C~1σmax(A)+C~2\widetilde{C}=\widetilde{C}_{1}\sigma_{\max}(A)+\widetilde{C}_{2} and the constants C~1,C~2,ρ\widetilde{C}_{1},\widetilde{C}_{2},\rho are given in Theorem III.6.

Proof. (i) Under (17), Theorem III.4 claims that the sequence {x(p):p1},\{x^{(p)}:p\geq 1\}, generated by NT and NTq with λp1\lambda_{p}\equiv 1 and concave gα(),g_{\alpha}(\cdot), satisfies (18). Note that ATν2σmax(A)ν2,\|A^{T}\nu^{\prime}\|_{2}\leq\sigma_{\max}(A)\|\nu^{\prime}\|_{2}, where σmax(A)\sigma_{\max}(A) is the largest singular value of A.A. Thus (18) implies that

x(p+1)xS2ηx(p)xS2+Cν2,\|x^{(p+1)}-x_{S}\|_{2}\leq\eta\|x^{(p)}-x_{S}\|_{2}+C\|\nu^{\prime}\|_{2},

where C=C1σmax(A)+C2C=C_{1}\sigma_{\max}(A)+C_{2} and ν=AxS¯+ν,\nu^{\prime}=Ax_{\overline{S}}+\nu, and hence

x(p)xS2\displaystyle\|x^{(p)}-x_{S}\|_{2} ηpx0xS2+C(i=0p1ηi)ν2\displaystyle\leq\eta^{p}\|x^{0}-x_{S}\|_{2}+C\left(\sum_{i=0}^{p-1}\eta^{i}\right)\|\nu^{\prime}\|_{2}
C1ην2+ηp(x02+x2),\displaystyle\leq\frac{C}{1-\eta}\|\nu^{\prime}\|_{2}+\eta^{p}(\|x^{0}\|_{2}+\|x\|_{2}),

where the last inequality follows from the fact i=0p1ηi<1/(1η)\sum_{i=0}^{p-1}\eta^{i}<1/(1-\eta) and x0xS2x0+x2.\|x^{0}-x_{S}\|_{2}\leq\|x^{0}\|+\|x\|_{2}. By Lemma III.8 with x=x(p)x^{\prime}=x^{(p)} and τ=C1η\tau=\frac{C}{1-\eta} and ξ=ηp(x02+x2)\xi=\eta^{p}(\|x^{0}\|_{2}+\|x\|_{2}), we immediately obtain the bound (24).

(ii) By an analysis similar to Item (i) above, we can also show the bound (III.9). In fact, under (21), it follows from (22) in Theorem III.6 that

x(p+1)xS2ρx(p)xS2+C~ν2,\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x^{(p)}-x_{S}\|_{2}+\widetilde{C}\|\nu^{\prime}\|_{2},

where C~=C~1σmax(A)+C~2\widetilde{C}=\widetilde{C}_{1}\sigma_{\max}(A)+\widetilde{C}_{2} and ν=AxS¯+ν.\nu^{\prime}=Ax_{\overline{S}}+\nu. The inequality above implies that

x(p)xS2C~1ρν2+ρp(x02+x2),\displaystyle\|x^{(p)}-x_{S}\|_{2}\leq\frac{\widetilde{C}}{1-\rho}\|\nu^{\prime}\|_{2}+\rho^{p}(\|x^{0}\|_{2}+\|x\|_{2}),

which together Lemma III.8 implies the bound (III.9). \Box

If xx is tt-sparse (or tt-compressible) where tk/2,t\leq k/2, then σk/2(x)10\sigma_{k/2}(x)_{1}\equiv 0 (or σk/2(x)10\sigma_{k/2}(x)_{1}\approx 0), (24) and (III.9) imply that the signal xx can be stably recovered by the algorithms provided that the measurements are accurate enough and the algorithms perform enough iterations.

IV Further discussion on regularization and parameter

Clearly, the choice of regularization ϕ\phi and parameter α\alpha will directly affect the numerical behavior of the algorithm. The NT-type algorithms are developed from the first-order approximation of the function gα(w).g_{\alpha}(w). To ensure the quality of such an approximation, the gap between gα(w)g_{\alpha}(w) and its first-order approximation over the set PP should be made as small as possible. Thus we should choose ϕ\phi and α\alpha such that the second-order term in the Taylor expansion of gα(w)g_{\alpha}(w) can be dominated as possible by its first-order term. One of the ideas is to strengthen the first-order approximation by suppressing the second-order term through controlling the eigenvalues of the Hessian matrix of gα(w)g_{\alpha}(w). This might be partially achieved by properly selecting α.\alpha. Moreover, the concavity of gα(w)g_{\alpha}(w) can also be maintained by the proper selection of α,\alpha, as shown by the next proposition.

Proposition IV.1

(i) Using the function (6), gα(w)g_{\alpha}(w) is concave for any given αα:=λmax(U(p)ATAU(p)).\alpha\geq\alpha^{*}:=\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).

(ii) Using the function (14), gα(w)g_{\alpha}(w) is concave for any given parameter αα:=2λmax(U(p)ATAU(p)).\alpha\geq\alpha^{*}:=2\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).

(iii) Using the function (15), gα(w)g_{\alpha}(w) is concave for any given parameter αα:=4λmax(U(p)ATAU(p)).\alpha\geq\alpha^{*}:=4\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).

The proof of the proposition is given in Appendix C. Note that the functions (6), (14) and (15) are independent of u(p).u^{(p)}. It should be pointed out that all regularization functions given in Section II can also be modified to include the vector u(p).u^{(p)}. For instance, the function (6) can be modified to involve the vector u(p)u^{(p)} as follows:

ϕ(w)\displaystyle\phi(w) =i=1n(ui(p))2(wi+12)(32wi)\displaystyle=\sum_{i=1}^{n}(u^{(p)}_{i})^{2}(w_{i}+\frac{1}{2})(\frac{3}{2}-w_{i})
=(w+12e)T(U(p))2(32ew),\displaystyle=(w+\frac{1}{2}\textrm{{e}})^{T}(U^{(p)})^{2}(\frac{3}{2}\textrm{{e}}-w), (26)

where U(p)=diag(u(p)).U^{(p)}=\textrm{diag}(u^{(p)}). When ui(p)0u^{(p)}_{i}\not=0 for i=1,,ni=1,\dots,n, this function is positive over an open neighborhood of D,D, and it remains a binary regularization specified by Definition II.1. We also have the following observation.

Proposition IV.2

When all entries of u(p)u^{(p)} are nonzero, if ϕ\phi is given by (IV), then gα(w)g_{\alpha}(w) is concave over an open neighborhood of DD for any given αα:=λmax(ATA).\alpha\geq\alpha^{*}:=\lambda_{max}(A^{T}A).

Proof. The Hessian of gα(w)g_{\alpha}(w) is given as

2gα(w)=2U(p)(ATAαI)U(p).\nabla^{2}g_{\alpha}(w)=2U^{(p)}(A^{T}A-\alpha I)U^{(p)}.

The result is obvious since 2gα(w)0\nabla^{2}g_{\alpha}(w)\preceq 0 if and only if αα:=λmax(ATA).\alpha\geq\alpha^{*}:=\lambda_{max}(A^{T}A). \Box

While the concavity of gα(w)g_{\alpha}(w) is assumed for the analysis of algorithms in Section III, this assumption is largely for the convenience of analysis and might not necessarily be needed if we adopt another way of analysis. In fact, the key idea behind the derivation of the NT algorithms is the first-order approximation of the function gα(w)g_{\alpha}(w). Such a derivation may not require the concavity of gα(w).g_{\alpha}(w). Even if gα(w)g_{\alpha}(w) is not concave, the first-order approximation of gα(w)g_{\alpha}(w) still yields the same version of the NT-type algorithms. Also, the concavity of gα(w)g_{\alpha}(w) may not be required from a numerical point of view. The value of α\alpha may be simply prescribed beforehand or be updated in the course of iterations according to some rule.

The NT-type algorithms only involve the product between matrices and vectors, sorting and orthogonal projections, and hence the computational cost is very similar to that of HTP, SP, CoSaMP and OMP. The computational cost of these algorithms is much lower than convex optimization methods [38, 39, 40, 41] and optimal kk-thresholding methods [34, 35]. The empirical results in next section indicate that by approximate choices of the steplength λp\lambda_{p} and parameter α,\alpha, the NT-type algorithms can efficiently reconstruct sparse signals.

V Numerical Validation

Based on random problem instances of signal recovery, we provide some preliminary experiment results to demonstrate the performance of the NT-type algorithms. For the given size and class of random matrices, we first use simulations to find suitable steplength λp\lambda_{p} and parameter α\alpha for the algorithms. Then we compare the performance of NT-type algorithms and SP, CoSaMP, OMP and HTP. The dimension of the synthetic sparse signal is 8000, and the size of random matrices used in our experiments is 1000×8000.1000\times 8000. All matrices are Gaussian matrices with columns being normalized, whose entries are independently and identically distributed (iid) random variables following the standard normal distribution. The nonzeros of sparse signals are also assumed to be iid random variables following the standard normal distribution, and their positions are uniformly distributed. Unless otherwise specified, we use the following stopping criterion for signal recovery:

x(p)x2/x2105,\|x^{(p)}-x\|_{2}/\|x\|_{2}\leq 10^{-5}, (27)

where xx denotes the target signal to recover and x(p)x^{(p)} denotes the iterate produced by an algorithm. When x(p)x^{(p)} satisfies (27), the recovery of xx is counted as success, otherwise unsuccess. All results demonstrated in this section are obtained by using the function (IV) and adopting the initial point x(0)=0x^{(0)}=0 for algorithms. The maximum number of iterations for iterative algorithms is set to be 150. If an algorithm cannot meet the stopping criterion (27) in 150 iterations, the algorithm is then treated as unsuccess for the underlying recovery problem. Since NTP and NTPq use the orthogonal projection at every step, they usually outperform NT and NTq in signal recovery. Thus we only demonstrate the performance of NTP and NTPq in this section.

V-A Choice of regularization parameter: α\alpha

The parameter α\alpha in NT-type algorithms can take any value in (0,).(0,\infty). To see how the value of α\alpha affects the performance of the algorithms, we compare the success frequencies of the algorithms for signal recovery with several different values of α.\alpha. The sparsity of signals are ranging from 100 to 450 with stepsize 5, and the measurements are taken as y:=Axy:=Ax for every realized random pair (A,x).(A,x). Our initial simulations indicate that the number of inner iterations of NTPq does not remarkably affect its overall performance. Thus we mainly consider the algorithm with q=1,q=1, i.e., the NTP algorithm, in this experiments. For each given sparsity level kk, 100 random trials are used to estimate the success rates of the NTP, and the algorithm is performed up to a total of 150 iterations. Fig. 1 shows the results for NTP with several different values of α.\alpha. The steplength is fixed as λp2\lambda_{p}\equiv 2 (this stepsize is suggested from simulations. See the next subsection for details). The results in Fig.1 indicates that for the Gaussian matrices of size 1000×8000,1000\times 8000, α=5\alpha=5 is a reasonable choice for NTP. However, we should keep in mind that such a choice of the parameter depending on the size/class of problem instances should not be generalized to other size/class of the problems.

Refer to caption
Figure 1: Success rates of NTP with α=2,3,5,8.\alpha=2,3,5,8.

V-B Influence of iterative steplength: λp\lambda_{p}

In traditional function minimization scenarios, receding in the direction of negative gradient of a function from the current iterate by a small steplength can guarantee the descent of the function. Thus the steplength in traditional optimization method is usually smaller than 1. However, when solving the SCO problem (1), this property of steplength is lost since the main purpose of the iterative algorithm for SCO is to identify the correct support of the solution instead of seeking immediate descent of the objective value. Thus when solving the SCO, the steplength of an iterative algorithm may not necessarily be smaller than 1. Indeed, a large amount of simulations indicate that the efficient steplength longer than 1 is commonly observed when solving a SCO problem. For 1000×80001000\times 8000 Gaussian sensing matrices, we carried out experiments to show how the steplength might affect the performance of NTP. The random pair of (A,x)(A,x) are generated exactly as in previous subsection, and we set α=5\alpha=5 for NTP in our experiments. For every given sparsity level and steplength, 100 trials were used to estimate the success rate of the algorithm. The results for λp=0.1,0.5,1,1.5,2\lambda_{p}=0.1,0.5,1,1.5,2 and 2.52.5 are summarized in Fig. 2 which indicates that λp=2\lambda_{p}=2 is relatively good among the ones being tested.

Refer to caption
Figure 2: Success rates of NTP with different λp\lambda_{p}

V-C Number of iterations

The random instances of SCO problems are generated in the same way as previous subsections. As suggested by the above experiments, we set α5\alpha\equiv 5 and λp2\lambda_{p}\equiv 2 in NTP. We test the recovery ability of NTP by performing the algorithm with several different number of iterations: 20, 40, 80, 120 and 150. For every sparsity level kk ranging from 100 to 450 with stepsize 5, we use 100 random trials to estimate the success rate of the algorithm. The results are summarized in Fig. 3 in which IT means the total number of iterations performed. The result indicates that the overall performance of NTP becomes better and better as the total number of iterations increases. However, after the algorithm is performed enough iterations, say, IT>100,\textrm{IT}>100, any further iterations would not remarkably improve its performance. This experiment indicates that for the class of SCO problems being tested, 150 iterations is generally enough to ensure the algorithm to work well for solving such a class of problems. Thus in the remaining experiments, we set 150 as the maximum number of iterations, and if the algorithm cannot solve the problem in 150 iterations, the algorithm is treated as unsuccess for the problem.

Refer to caption
Figure 3: Comparison of success rates of NTP which is performed different number of iterations.

V-D Runtime Comparison

Given m=1000m=1000 and n=8000,n=8000, we consider a series of vectors with different sparsity levels: k=βmk=\beta m where β\beta (called the oversampling ratio) takes 11 different values between 0.05 and 0.25. Namely, β=0.05+0.2i\beta=0.05+0.2i for i=0,1,,10.i=0,1,\dots,10. For every given k=βmk=\beta m, we use NTP, NTP5, OMP, HTP, CoSaMP and SP to recover the kk-sparse vector through the measurements y:=Ax.y:=Ax. For every given β\beta, the average time required to recover the sparse vector is calculated based on 50 random trials. The results are shown in Fig. 4, from which it can be seen that the average time required by the NTP algorithm is lower than that of OMP and NTP5, and the average time required by NTP5 to recover the signals is very similar to that of OMP. We also observe that when β\beta is relatively small, the runtime of traditional CoSaMP, SP and HTP is lower than that of NTP and NTP5.{}_{5}.

Refer to caption
Figure 4: Runtime comparison of several algorithms

V-E Comparison of success frequencies

The performance of NTP, NTPq and HTP depends on the choice of steplength λp.\lambda_{p}. The performance of NTP and NTPq also depends on the choice of α.\alpha. We set α=5\alpha=5 and λp2\lambda_{p}\equiv 2 in NTP and NTP5,{}_{5}, and we use the same steplength λp2\lambda_{p}\equiv 2 in HTP. For every random pair of (A,x),(A,x), where xx is a sparse vector, the accurate measurements are given by y:=Ax.y:=Ax. The success frequencies of NTP, NTP5, HTP, OMP, SP and CoSaMP are estimated through 100 random examples of SCO problems. All iterative algorithms are performed a total of 150 iterations except for the OMP which by its structure performs a total of kk iterations, where kk is the sparsity level of the sparse signal. The results are demonstrated in Fig. 5, from which one can see that NTP and NTP5 perform similarly and they outperforms the traditional HTP, OMP, SP and CoSaMP. However, OMP, SP, CoSaMP have their own advantages over HTP and NT-type algorithms in the sense that they do not need any parameter or steplength to solve the SCO problems.

Refer to caption
Figure 5: Comparison of success frequencies in noiseless situations. 100 random examples were attempted and the recovery criterion (27) was adopted.

We also compare the algorithms in the setting of noisy measurements. In this setting, the recovery criterion is set as

x(p)x2/x2103,\|x^{(p)}-x\|_{2}/\|x\|_{2}\leq 10^{-3}, (28)

and the inaccurate measurements are given as y:=Ax+ξv,y:=Ax+\xi v, where vv is a normalized Gaussian noise vector and ξ>0\xi>0 reflects the noise level. The success rates of algorithms are estimated with 100 random examples, and the algorithms are performed the same number of iterations as in the noiseless case. α=5\alpha=5 and λp2\lambda_{p}\equiv 2 are still used in this experiment. By setting ξ=0.001,\xi=0.001, the results are shown in Fig. 6 which indicates that NTP and NTP5 are still efficient and can compete with several other algorithms in sparse signal recovery. Changing the noise level from ξ=0.001\xi=0.001 to ξ=0.01\xi=0.01 and repeating the experiment, we obtain the results demonstrated in Fig. 7 which indicate that such a chance in noise level affects significantly the performance of CoSaMP, but does not remarkably affect the performance of several other algorithms. From Fig. 6 and Fig. 7, it can be observed that NTP and NTP5 are robust in signal recovery when measurements are slightly inaccurate.

Refer to caption
Figure 6: Comparison of success rates in noisy situations: ξ=0.001.\xi=0.001. 100 random examples were attempted and the recovery criterion (28) was adopted.
Refer to caption
Figure 7: Comparison of success rates in noisy situations: ξ=0.01.\xi=0.01. 100 random examples were attempted and the recovery criterion (28) was adopted.

VI Conclusions

The natural thresholding (NT) algorithms for sparse signal recovery have been developed in this paper. This class of algorithms was derived through the first-order approximation of the regularized optimization model for optimal kk-thresholding. The concept of concave binary regularization plays a vital role in this development. The guaranteed performance of the NT-type algorithms in sparse signal recovery has been shown under the restricted isometry property together with the concavity of the objective function of the regularized optimization model. Simulations via random examples show that the proposed algorithms can compete with the traditional HTP, OMP, CoSaMP and SP algorithms.

The concavity is imposed to show the main results (Theorems III.4 and III.6) in this paper. A worthwhile future work might be the analysis of NT-type algorithms under other nonconvex conditions. Another future work might be the development of an algorithm with adaptive steplength λp\lambda_{p} and parameter α\alpha which are updated during iterations according to a certain rule, so that the efficiency and performance of the NT-type algorithms would be further enhanced.

Appendix A: Proof of Lemma II.3

Proof. Let ϕ(w)\phi(w) be a binary regularization function (see Definition II.1). Let w(α)w(\alpha) denote a solution to the problem (II-C) with parameter α>0.\alpha>0. Let {α}\{\alpha_{\ell}\} be any positive sequence which is strictly increasing and tends to \infty as .\ell\to\infty. For notational convenience, denote by w:=w(α)w_{\ell}:=w(\alpha_{\ell}) the optimal solution to (II-C) with α:=α.\alpha:=\alpha_{\ell}. Thus by the optimality of ww_{\ell} and w1w_{\ell-1}, one has

gα(w)gα(w1),gα1(w1)gα1(w)g_{\alpha_{\ell}}(w_{\ell})\leq g_{\alpha_{\ell}}(w_{\ell-1}),~{}~{}g_{\alpha_{\ell-1}}(w_{\ell-1})\leq g_{\alpha_{\ell-1}}(w_{\ell})

That is,

f(w)+αϕ(w)f(w1)+αϕ(w1),f(w_{\ell})+\alpha_{\ell}\phi(w_{\ell})\leq f(w_{\ell-1})+\alpha_{\ell}\phi(w_{\ell-1}),
f(w1)+α1ϕ(w1)f(w)+α1ϕ(w).f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(w_{\ell})+\alpha_{\ell-1}\phi(w_{\ell}). (29)

Note that α1<α.\alpha_{\ell-1}<\alpha_{\ell}. Adding the two inequalities, cancelling and combining terms yields

ϕ(w)ϕ(w1),\phi(w_{\ell})\leq\phi(w_{\ell-1}),

i.e., the sequence {ϕ(w)}\{\phi(w_{\ell})\} is nonincreasing. Thus ϕ(w)ϕ\phi(w_{\ell})\to\phi^{*} as ,\ell\to\infty, where ϕ\phi^{*} is a nonnegative number since the nonnegativeness of ϕ(w)\phi(w) over the set PD.P\subseteq D. Denote by ϕmin\phi_{\min} the minimum of ϕ(w)\phi(w) over P.P. By the definition of binary regularization, ϕmin\phi_{\min} attains and only attains at any kk-sparse binary vector in P.P. It immediately follows from ϕ(w)ϕmin\phi(w_{\ell})\geq\phi_{\min} that ϕϕmin.\phi^{*}\geq\phi_{\min}. We now further show that ϕ=ϕmin.\phi^{*}=\phi_{\min}. Let ww^{*} be any given kk-sparse binary vector in P.P. Then by the optimality of ww_{\ell} again, we have

f(w)+αϕ(w)f(w)+αϕ(w)=f(w)+αϕminf(w_{\ell})+\alpha_{\ell}\phi(w_{\ell})\leq f(w^{*})+\alpha_{\ell}\phi(w^{*})=f(w^{*})+\alpha_{\ell}\phi_{\min}

which together with the fact f(w)0f(w_{\ell})\geq 0 implies that

α(ϕ(w)ϕmin)f(w)\alpha_{\ell}(\phi(w_{\ell})-\phi_{\min})\leq f(w^{*})

Therefore, one has

0ϕϕminϕ(w)ϕminf(w)/α,0\leq\phi^{*}-\phi_{\min}\leq\phi(w_{\ell})-\phi_{\min}\leq f(w^{*})/\alpha_{\ell},

which together with α\alpha_{\ell}\to\infty implies that ϕ=ϕmin.\phi^{*}=\phi_{\min}. This together with the convergence of {ϕ(w)}\{\phi(w_{\ell})\} and continuity of ϕ\phi implies that any accumulation point of ww_{\ell} must be a kk-sparse and binary vector (since the minimum of ϕ\phi over PP attains only at kk-sparse binary vector). We now further show that the sequence {f(w)}\{f(w_{\ell})\} is nondecreasing and convergent to the optimal value of (3) and that any accumulation point of {w}\{w_{\ell}\} is an optimal solution to (3). In fact, by (29), we have

f(w1)+α1ϕ(w1)f(w)+α1ϕ(w),f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(w_{\ell})+\alpha_{\ell-1}\phi(w_{\ell}),

which implies that

f(w1)f(w)α1(ϕ(w)ϕ(w1))0f(w_{\ell-1})-f(w_{\ell})\leq\alpha_{\ell-1}(\phi(w_{\ell})-\phi(w_{\ell-1}))\leq 0

thus f(w1)f(w)f(w_{\ell-1})\leq f(w_{\ell}) which means the sequence {f(w)}\{f(w_{\ell})\} is nondecreasing. Assume that w^\widehat{w} is any optimal solution of (3), by the optimality of w1,w_{\ell-1}, one has

f(w1)+α1ϕ(w1)f(w^)+α1ϕ(w^),f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(\widehat{w})+\alpha_{\ell-1}\phi(\widehat{w}),

where ϕ(w^)\phi(\widehat{w}) must be the minimum value of ϕ\phi over DD since w^\widehat{w} is kk-sparse and binary. Thus the above inequality implies that

f(w1)f(w^)+α1(ϕ(w^)ϕ(w1))f(w^),f(w_{\ell-1})\leq f(\widehat{w})+\alpha_{\ell-1}(\phi(\widehat{w})-\phi(w_{\ell-1}))\leq f(\widehat{w}),

so the nondecreasing sequence {f(w)}\{f(w_{\ell})\} is bounded above, and hence

f(w)ff(w^) as k.f(w_{\ell})\to f^{*}\leq f(\widehat{w})\textrm{ as }k\to\infty.

Let w¯\overline{w} be any accumulation of ww_{\ell} (as shown above, w¯\overline{w} is kk-sparse and binary). It follows from the above relation that f=f(w¯)f(w^).f^{*}=f(\overline{w})\leq f(\widehat{w}). Since the right-hand side is the minimum value of (3). Thus we deduce that f(w¯)=f(w^)f(\overline{w})=f(\widehat{w}). So any accumulation point of {w}\{w_{\ell}\} is the optimal solution to (3). \Box

Appendix B: Proof of Theorem III.4

Proof: Let x(p)x^{(p)} be the current iterate generated by NT or NTq,{}_{q}, and let u(p)u^{(p)} be given by (2). The inner iterations of NT and NTq start from the vector ww^{-} which satisfies wu(p)=k(u(p)).w^{-}\otimes u^{(p)}={\cal H}_{k}(u^{(p)}). The NT algorithm performs only one inner iteration, and NTq performs qq times of inner iterations. It follows from Lemma III.2 that the output w+w^{+} of the inner iterations of NT or NTq algorithm satisfies

yA(u(p)w+)2yA(u(p)w)2.\|y-A(u^{(p)}\otimes w^{+})\|_{2}\leq\|y-A(u^{(p)}\otimes w^{-})\|_{2}.

Note that u(p)w=k(u(p))u^{(p)}\otimes w^{-}={\cal H}_{k}(u^{(p)}) and x(p+1)=u(p)w+x^{(p+1)}=u^{(p)}\otimes w^{+}. The above inequality is written as

yAx(p+1)2yAk(u(p))2.\|y-Ax^{(p+1)}\|_{2}\leq\|y-A{\cal H}_{k}(u^{(p)})\|_{2}. (30)

We also note that

y=Ax+ν=AxS+(AxS¯+ν)=AxS+ν,y=Ax+\nu=Ax_{S}+(Ax_{\overline{S}}+\nu)=Ax_{S}+\nu^{\prime}, (31)

where ν=AxS¯+ν.\nu^{\prime}=Ax_{\overline{S}}+\nu. Since xSx(p+1)x_{S}-x^{(p+1)} is (2k)(2k)-sparse, by (31) and the triangular inequality and the definition of δ2k,\delta_{2k}, one has

yAx(p+1)2\displaystyle\|y-Ax^{(p+1)}\|_{2} =A(xSx(p+1))+ν2\displaystyle=\|A(x_{S}-x^{(p+1)})+\nu^{\prime}\|_{2}
A(xSx(p+1))2ν2\displaystyle\geq\|A(x_{S}-x^{(p+1)})\|_{2}-\|\nu^{\prime}\|_{2}
1δ2kxSx(p+1)2ν2.\displaystyle\geq\sqrt{1-\delta_{2k}}\|x_{S}-x^{(p+1)}\|_{2}-\|\nu^{\prime}\|_{2}. (32)

Similarly, since xSk(u(p))x_{S}-{\cal H}_{k}(u^{(p)}) is (2k)(2k)-sparse, we have

y\displaystyle\|y Ak(u(p))2\displaystyle-A{\cal H}_{k}(u^{(p)})\|_{2}
=A(xSk(u(p)))+ν2\displaystyle=\|A(x_{S}-{\cal H}_{k}(u^{(p)}))+\nu^{\prime}\|_{2}
A(xSk(u(p)))2+ν2\displaystyle\leq\|A(x_{S}-{\cal H}_{k}(u^{(p)}))\|_{2}+\|\nu^{\prime}\|_{2}
1+δ2kxSk(u(p))2+ν2\displaystyle\leq\sqrt{1+\delta_{2k}}\|x_{S}-{\cal H}_{k}(u^{(p)})\|_{2}+\|\nu^{\prime}\|_{2} (33)

Merging (30), (32) and (33) yields

xS\displaystyle\|x_{S} x(p+1)2\displaystyle-x^{(p+1)}\|_{2}
1+δ2k1δ2kxSk(u(p))2+2ν21δ2k.\displaystyle\leq\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\|x_{S}-{\cal H}_{k}(u^{(p)})\|_{2}+\frac{2\|\nu^{\prime}\|_{2}}{\sqrt{1-\delta_{2k}}}. (34)

Let S=supp(k(u(p))).S^{*}=\textrm{supp}({\cal H}_{k}(u^{(p)})). Note that the set SSS^{*}\cup S, where S=Lk(x),S=L_{k}(x), can be decomposed into three disjoint sets S\S,S\SS^{*}\backslash S,S\backslash S^{*} and SS.S\cap S^{*}. Thus

(xS\displaystyle\|(x_{S}- u(p))SS22=(xSu(p))S\S22\displaystyle u^{(p)})_{S^{*}\cup S}\|_{2}^{2}=\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}
+(xSu(p))S\S22+(xSu(p))SS22.\displaystyle+\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S\cap S^{*}}\|_{2}^{2}. (35)

Without loss of generality, we assume that (xSu(p))SS220.\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}\not=0. Under this assumption, one of the three terms on the right-hand of (Appendix B: Proof of Theorem III.4) is nonzero. Without loss of generality, we assume that the first term on the right-hand side of (Appendix B: Proof of Theorem III.4) is nonzero, i.e., (xSu(p))S\S220.\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}\not=0. Then there exists numbers μ10\mu_{1}\geq 0 and μ20\mu_{2}\geq 0 such that

(xSu(p))S\S2\displaystyle\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2} =μ1(xSu(p))S\S2,\displaystyle=\mu_{1}\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}, (36)
(xSu(p))SS2\displaystyle\|(x_{S}-u^{(p)})_{S\cap S^{*}}\|_{2} =μ2(xSu(p))S\S2.\displaystyle=\mu_{2}\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}. (37)

If (xSu(p))SS22=0,\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}=0, then all terms on the right-hand of (Appendix B: Proof of Theorem III.4) are zero, and hence (36) and (37) are still valid in this case. Substituting (36) and (37) into (Appendix B: Proof of Theorem III.4) yields

(xSu(p))SS22=(1+μ12+μ22)(xSu(p))S\S22.\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}=(1+\mu_{1}^{2}+\mu_{2}^{2})\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}. (38)

By the definition of SS and SS^{*}, we have

xS\displaystyle\|x_{S}- k(u(p))22=(xSk(u(p)))SS22\displaystyle{\cal H}_{k}(u^{(p)})\|_{2}^{2}=\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{*}\cup S}\|_{2}^{2}
=\displaystyle= (xSk(u(p)))S\S22+(xSk(u(p)))SS22\displaystyle\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{*}\backslash S}\|_{2}^{2}+\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{*}\cap S}\|_{2}^{2}
+(xSk(u(p)))S\S22\displaystyle+\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S\backslash S^{*}}\|_{2}^{2}
=\displaystyle= (xSu(p))S\S22+(xSu(p))SS22\displaystyle\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S^{*}\cap S}\|_{2}^{2}
+(xSk(u(p)))S\S22.\displaystyle+\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S\backslash S^{*}}\|_{2}^{2}.

Thus by Lemma III.3, we have

xS\displaystyle\|x_{S}- k(u(p))22\displaystyle{\cal H}_{k}(u^{(p)})\|_{2}^{2}
\displaystyle\leq (xSu(p))S\S22+(xSu(p))SS22\displaystyle\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S^{*}\cap S}\|_{2}^{2}
+((xSu(p))S\S2+(xSu(p))S\S2)2\displaystyle+\left(\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}+\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2}\right)^{2}
=\displaystyle= (xSu(p))S\S22+(xSu(p))SS22\displaystyle\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S^{*}\cap S}\|_{2}^{2}
+(xSu(p))S\S22+(xSu(p))S\S22\displaystyle+\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2}^{2}
+2(xSu(p))S\S2(xSu(p))S\S2\displaystyle+2\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2}
=\displaystyle= (xSu(p))SS22+(xSu(p))S\S22\displaystyle\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}+\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}
+2(xSu(p))S\S2(xSu(p))S\S2.\displaystyle+2\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\|_{2}.

By (36) and (38), the inequality above can be written as

xSk(u(p))22\displaystyle\|x_{S}-{\cal H}_{k}(u^{(p)})\|_{2}^{2}
(xSu(p))SS22+(1+2μ1)(xSu(p))S\S22\displaystyle\leq\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}+(1+2\mu_{1})\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}
=(xSu(p))SS22+1+2μ11+μ12+μ22(xSu(p))SS22\displaystyle=\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}+\frac{1+2\mu_{1}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}
=(1+μ1)2+1+μ221+μ12+μ22(xSu(p))SS22\displaystyle=\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}
maxμ10(maxμ20(1+μ1)2+1+μ221+μ12+μ22)(xSu(p))SS22\displaystyle\leq\max_{\mu_{1}\geq 0}\left(\max_{\mu_{2}\geq 0}\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\right)\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}
=μ(xSu(p))SS22,\displaystyle=\mu^{*}\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}, (39)

where

μ\displaystyle\mu^{*} =maxμ10(maxμ20(1+μ1)2+1+μ221+μ12+μ22)\displaystyle=\max_{\mu_{1}\geq 0}\left(\max_{\mu_{2}\geq 0}\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\right)
=maxμ10(1+μ1)2+11+μ12=5+555=(5+12)2.\displaystyle=\max_{\mu_{1}\geq 0}\frac{(1+\mu_{1})^{2}+1}{1+\mu_{1}^{2}}=\frac{5+\sqrt{5}}{5-\sqrt{5}}=\left(\frac{\sqrt{5}+1}{2}\right)^{2}.

The maximum above with respect to μ2\mu_{2} is achieved at μ2=0,\mu_{2}=0, and the maximum with respect to μ1\mu_{1} is achieved at μ1=(51)/2.\mu_{1}=(\sqrt{5}-1)/2. By (31) and Lemma III.1, we obtain that

(xSu(p))SS2\displaystyle\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}
=(xSx(p)AT(yAx(p)))SS2\displaystyle=\|(x_{S}-x^{(p)}-A^{T}(y-Ax^{(p)}))_{S^{*}\cup S}\|_{2}
=[(IATA)(xSx(p))]SS+[ATν]SS2\displaystyle=\|[(I-A^{T}A)(x_{S}-x^{(p)})]_{S^{*}\cup S}+[A^{T}\nu^{\prime}]_{S^{*}\cup S}\|_{2}
[(IATA)(xSx(p))]SS2+ATν2\displaystyle\leq\|[(I-A^{T}A)(x_{S}-x^{(p)})]_{S^{*}\cup S}\|_{2}+\|A^{T}\nu^{\prime}\|_{2}
δ3kxSx(p)2+ATν2.\displaystyle\leq\delta_{3k}\|x_{S}-x^{(p)}\|_{2}+\|A^{T}\nu^{\prime}\|_{2}. (40)

The last inequality follows from Lemma III.1 with |(SS)supp(xSxp)|3k.|(S^{*}\cup S)\cup\textrm{supp}(x_{S}-x^{p})|\leq 3k. Combining (39) and (40) yields

xSk(u(p))2\displaystyle\|x_{S}-{\cal H}_{k}(u^{(p)})\|_{2} 5+12(δ3kxSx(p)2+ATν2).\displaystyle\leq\frac{\sqrt{5}+1}{2}(\delta_{3k}\|x_{S}-x^{(p)}\|_{2}+\|A^{T}\nu^{\prime}\|_{2}).

Denote by ω=(5+1)/2.\omega=(\sqrt{5}+1)/2. Merging the above relation with (Appendix B: Proof of Theorem III.4), we obtain

xSx(p+1)2\displaystyle\|x_{S}-x^{(p+1)}\|_{2} ωδ3k1+δ2k1δ2kxSx(p)2\displaystyle\leq\omega\delta_{3k}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\|x_{S}-x^{(p)}\|_{2}
+ωATν21+δ2k1δ2k+2ν21δ2k\displaystyle~{}~{}+\omega\|A^{T}\nu^{\prime}\|_{2}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}+\frac{2\|\nu^{\prime}\|_{2}}{\sqrt{1-\delta_{2k}}}
=ηxSx(p)2+C1ATν2+C2ν2,\displaystyle=\eta\|x_{S}-x^{(p)}\|_{2}+C_{1}\|A^{T}\nu^{\prime}\|_{2}+C_{2}\|\nu^{\prime}\|_{2},

where C1,C_{1}, C2C_{2} and η\eta are given by (19) and (20), respectively. Note that δ2kδ3k\delta_{2k}\leq\delta_{3k} which implies that 1+δ2k1δ2k1+δ3k1δ3k.\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\leq\sqrt{\frac{1+\delta_{3k}}{1-\delta_{3k}}}. Thus to ensure η<1,\eta<1, it is sufficient to require that ωδ3k1+δ3k1δ3k<1\omega\delta_{3k}\sqrt{\frac{1+\delta_{3k}}{1-\delta_{3k}}}<1 which, by squaring and rearranging terms, is written equivalently as

h(δ3k):=ωδ3k3+ωδ3k2+δ3k1<0.h(\delta_{3k}):=\omega\delta_{3k}^{3}+\omega\delta_{3k}^{2}+\delta_{3k}-1<0.

It is very easy to check that the inequality above can be guaranteed if δ3k<γ0.4712,\delta_{3k}<\gamma^{*}\approx 0.4712, where γ\gamma^{*} is the unique real root of the univariate equation h(t)=ωt3+ωt2+t1=0h(t)=\omega t^{3}+\omega t^{2}+t-1=0 in the interval [0,1]. In fact, h(t)h(t) is strictly increasing over [0,1],[0,1], and h(0)=1<0h(0)=-1<0. This implies that h(t)<0h(t)<0 for any t<γ.t<\gamma^{*}. Thus η<1\eta<1 is guaranteed if δ3k<γ.\delta_{3k}<\gamma^{*}. \Box

Appendix C: Proof of Proposition IV.1

Proof. (i) This has been shown in section II.C.

(ii) Consider the regularization (14). It is easy to verify that the second-order derivative of log(1+(wi+1/2)(3/2wi))\log(1+(w_{i}+1/2)(3/2-w_{i})) with respect to wiw_{i} is given by 6+2βi(1+βi)2,\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}}, where βi=(wi+1/2)(3/2wi).\beta_{i}=(w_{i}+1/2)(3/2-w_{i}). Clearly, 0<βi10<\beta_{i}\leq 1 over the interval 1/2<wi<3/2-1/2<w_{i}<3/2 which is an open neighborhood of the closed interval 0wi1.0\leq w_{i}\leq 1. Thus it is easy to verify that the supremum of 6+2βi(1+βi)2\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}} over the open interval 1/2<wi<3/2-1/2<w_{i}<3/2 is 1.-1. Therefore,

2gα(w)\displaystyle\nabla^{2}g_{\alpha}(w) =2U(p)ATAU(p)+αdiag(6+2βi(1+βi)2,i=1,,n)\displaystyle=2U^{(p)}A^{T}AU^{(p)}+\alpha\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)
=2U(p)ATAU(p)αI\displaystyle=2U^{(p)}A^{T}AU^{(p)}-\alpha I
+α[I+diag(6+2βi(1+βi)2,i=1,,n)]\displaystyle~{}~{}~{}+\alpha\left[I+\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)\right]
2U(p)ATAU(p)αI,\displaystyle\preceq 2U^{(p)}A^{T}AU^{(p)}-\alpha I,

where the last relation follows from the fact

I+diag(6+2βi(1+βi)2,i=1,,n)0.I+\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)\preceq 0.

Thus if αα:=2λmax(U(p)ATAU(p)),\alpha\geq\alpha^{*}:=2\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}), then 2gα(w)0\nabla^{2}g_{\alpha}(w)\preceq 0 and hence the function gα(w)g_{\alpha}(w) is concave.

(iii) This can be shown by an analysis similar to (ii). The detail is omitted. \Box

References

  • [1] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer, NY, 2010.
  • [2] Y.C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications, Cambridge University Press, Cambridge, UK, 2012.
  • [3] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer, NY, 2013.
  • [4] Y.-B. Zhao, Sparse Optimization Theory and Methods, CRC Press, Boca Raton, FL, 2018.
  • [5] A. Miller, Subset Selection in Regression, CRC Press, Washington, 2002.
  • [6] D. Bertsimas, A. King and R. Mazumder, Best subset selection via a modern optimization Lens, Ann. Statist., 44 (2016), no. 2, 813–852.
  • [7] A. Suggala, K. Bhatia, P. Ravikumar and P. Jain, Adaptive hard thresholding for near-optimal consistent robust regression, 32nd Annual Conference on Learning Theory, Proc. Machine Learning Res., 99 (2019), 1-6.
  • [8] J. Choi, B. Shim, Y. Ding, B. Rao and D. Kim, Compressed sensing for wireless communications: Useful tips and tricks, IEEE Commun. Surveys and Tutorials, 19 (2017), no. 3, 1527–1549.
  • [9] Z. Gao, L. Dai, S. Han, C.-L. I, Z. Wang and L. Hanzo, Compressive sensing techniques for next-generation wireless communications, IEEE Wireless Commun., 25 (2018), no. 3, 144–153.
  • [10] E.J. Cande`\grave{\textrm{e}}s and Y. Plan, Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory, 57 (2011), no. 4, 2342–2359.
  • [11] T. Cai and A. Zhang, Sharp RIP bound for sparse singal and low-rank matrix recovery, Appl. Comput. Harmon. Anal., 35 (2013), no. 1, 74-93.
  • [12] M.A. Davenport and J. Romberg, An overview of low-rank matrix recovery from incomplete observations, IEEE J. Sel. Topics Signal Process., 10 (2016), no. 4, 608–622.
  • [13] N. Nguyen, D. Needell and T. Woolf, Linear convergence of stochastic iterative greedy algorithms with sparse constraints, IEEE Trans. Inform. Theory, 63 (2017), no. 11, 6869–6895.
  • [14] S. Foucart and S. Subramanian, Iterative hard thresholding for low-rank recovery from rank-one projections, Linear Algebra Appl., 572 (2019), 117–134.
  • [15] I. Daubechies, M. Defrise and C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), no. 11, 1413–1457.
  • [16] S. Voronin, H.J. Woerdeman, A new iterative firm-thresholding algorithms for inverse problems with sparsity constraints, Appl. Comput. Harmonic Anal., 35 (2013), 151–164.
  • [17] A. Wachsmuth, Iteration hard-thresholding applied to optimal control problems with L0(Ω)L^{0}(\Omega) control cost, SIAM J. Control Optim., 57 (2019), no. 2, 854–879.
  • [18] A. Zaki, P. Mitra, L. Rasmussen and S. Chatterjee, Estimate exchange over network is good for distributed hard thresholding pursuit, Signal Process., 156 (2019), 1–11.
  • [19] D.L. Donoho, De-noising by soft-thresholdinng, IEEE Trans. Inform. Theory, 41 (1995), no. 3, 613–627.
  • [20] D.L. Donoho and I. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biomatrika, 81 (1994), no. 3, 425–455.
  • [21] M. Elad, Why simple shrinkage is still relevant for redundant representations? IEEE Trans. Inform. Theory, 52 (2006), no. 12, 5559–5569.
  • [22] M. Elad, B. Matalon, J. Shtok and M. Zibulevsky, A wide-angle view at iterated shrinkage algorithms, in SPIE (Wavelete XII) (San Diege, CA, USA), September 2007.
  • [23] K. Herrity, A. Gilbert and J. Tropp, Sparse approximation via iterative thresholding, in IEEE ICASSP 2006, 624–627.
  • [24] M. Fornasier and R. Rauhut, Iterative thresholding algorithms, Appl. Comput. Harmon. Anal., 25 (2008), 187-208.
  • [25] T. Blumensath and M. Davies, Iterative hard thresholding for sparse approximation, J. Fourier Anal. Appl., 14 (2008), no. 5-6, 629–654.
  • [26] T. Blumensath and M. Davies, Normalized iterative hard thresholding: Guaranteed stability and performance, IEEE J. Sel. Top. Signal Process., 4 (2010), no. 2, 298–309.
  • [27] S. Foucart, Hard thresholding pursuit: An algorithm for compressive sensing, SIAM J. Numer. Anal., 49 (2011), no. 6, 2543–2563.
  • [28] V. Cevher, On accelerated hard thresholding methods for sparse approximation, Proc. SPIE 8138, Wavelets and Sparsity XIV, 813811, 2011.
  • [29] A. Kyrillidis and V. Cevher, Matrix recipes for hard thresholding methods, J. Math. Imag. Vision, 48 (2014), 235–265.
  • [30] B. Liu, X.-T. Yuan, L. Wang, Q. Liu and D. Metaxas, Dual iterative hard threholding: From non-convex sparse minimization to non-smooth concave maximization, Proceedings of the 34th International Conference on Machine Learning, PMLR, 70 (2017), pp. 2179–2187.
  • [31] R. Khanna, and A. Kyrillidis, IHT dies hard: Provable accelerated iterative hard thresholding, in Proceedings of the AISTATS, Lanzarote, Spain, 84 (2018), 188–198.
  • [32] H. Liu and R.F. Barber, Between hard and soft thresholding: Optimal iterative thresholding algorithms, Inform. Inference, 9 (2020), no. 4, 899–933.
  • [33] N. Meng and Y.-B. Zhao, Newton-step-based hard thresholding algorithms for sparse signal recovery. IEEE Trans. Signal Process. 68 (2020), 6594–6606
  • [34] Y.-B. Zhao, Optimal kk-thresholding algorithms for sparse optimization problems, SIAM J. Optim., 30 (2020), no. 1, 31–55.
  • [35] Y.-B. Zhao and Z.-Q. Luo, Analysis of optimal kk-thresholding algorithms for compressed sensing, Signal Process., 187 (2021), p.108148.
  • [36] J.-U., Bouchot, S. Foucart and P. Hitczenki, Hard thresholding pursuit algorithms: Number of iterations, Appl. Comput. Harmon. Anal., 41 (2016), 412-435.
  • [37] N. Meng, Y.-B. Zhao, M. Kočvara and Z.F. Sun, Partial gradient optimal thresholding algorithms for a class of sparse optimization problems, J. Global Optim. (2022). doi.org/10.1007/s10898-022-01143-1.
  • [38] S.S. Chen, D.L. Donoho and M.A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput., 20 (1998), no. 1, 33–61.
  • [39] E.J. Cande`\grave{\textrm{e}}s and T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory, 51 (2005), no. 12, 4203–4215.
  • [40] E.J. Cande`\grave{\textrm{e}}s, M. Wakin and S. Boyd, Enhancing sparsity by reweighted 1\ell_{1} minimization, J. Fourier Anal. Appl., 14 (2008), no.5-6, 877–905.
  • [41] Y.-B. Zhao and D. Li, Reweighted 1\ell_{1}-minimization for sparse solutions to underdetermined linear systems, SIAM J. Optim., 22 (2012), no. 3, 893–912.
  • [42] S. Mallat and Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Trans. Signal Process., 41 (1993), no. 12, 3397–3415.
  • [43] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad, Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, in Proc. 27th Annu. Asilomar Conf. Signals, Systems and Computers, Nov. 1993.
  • [44] G. Davis, S. Mallat, and Z. Zhang, Adaptive time-frequency decompositions, Optical Eng., 33 (1994), no. 7, 2183–2191.
  • [45] D. Needell and J.A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Appl. Comput. Harmon. Anal. 26 (2009), no. 3, 301–321.
  • [46] W. Dai, and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inform. Theory, 55 (2009), no. 5, 2230–2249.
  • [47] A.V. Fiacco and G.P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques, New York, John Wiley & Sons, 1968.
  • [48] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, New Jersey, 1970.
  • [49] H.P. Benson, Concave Minimization: Theory, Applications and Algorithms. In: Horst R., Pardalos P.M. (eds) Handbook of Global Optimization. Nonconvex Optimization and Its Applications, vol 2. Springer, Boston, MA, 1995.
  • [50] C. Buchheim and E. Traversi, Quadratic combinatorial optimization using separable underestimators, INFORMS J. Comput., 30 (2018), no. 3, 424–637.
  • [51] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Second Edition, John Wiley & Sons, 2008.
  • [52] K. Lange, MM Optimization Algorithms, SIAM, Philadelphia, 2016.
  • [53] O.L. Mangasarian, Machine Learning via Polyhedral Concave Minimization. In: Fischer H., Riedmüller B., Schäffler S. (eds) Applied Mathematics and Parallel Computing. Physica-Verlag HD, 1996.
  • [54] Y.-B. Zhao, RSP-based analysis for sparsest and least 1\ell_{1}-norm solutions to underdetermined linear systems, IEEE Trans. Signal Process. 61 (2013), no. 22, 5777–5788.