Natural Thresholding Algorithms for Signal Recovery with Sparsity

Yun-Bin Zhao, and Zhi-Quan Luo This work was supported by the National Natural Science Foundation of China (NSFC) under grants #12071307 and #61571384. Part of the work was done while the first author worked at the School of Mathematics, University of Birmingham, Birmingham B15 2TT, United Kingdom.Yun-Bin Zhao is with the Shenzhen Research Institute of Big Data, Chinese University of Hong Kong, Shenzhen, Guangdong Province, China (Email: [email protected]).Zhi-Quan Luo is with the Shenzhen Research Institute of Big Data, Chinese University of Hong Kong, Shenzhen, Guangdong Province, China (Email: [email protected]).

Abstract

The algorithms based on the technique of optimal $k$ -thresholding (OT) were recently proposed for signal recovery, and they are very different from the traditional family of hard thresholding methods. However, the computational cost for OT-based algorithms remains high at the current stage of their development. This stimulates the development of the so-called natural thresholding (NT) algorithm and its variants in this paper. The family of NT algorithms is developed through the first-order approximation of the so-called regularized optimal $k$ -thresholding model, and thus the computational cost for this family of algorithms is significantly lower than that of the OT-based algorithms. The guaranteed performance of NT-type algorithms for signal recovery from noisy measurements is shown under the restricted isometry property and concavity of the objective function of regularized optimal $k$ -thresholding model. Empirical results indicate that the NT-type algorithms are robust and very comparable to several mainstream algorithms for sparse signal recovery.

Index Terms:

Signal recovery, compressed sensing, thresholding algorithms, restricted isometry property, concave minimization, regularization method

I Introduction

Under the sparsity assumption, the problem of signal recovery from nonadaptive linear measurements can be formulated as a sparse optimization problem which may take different forms, depending on such factors as recovery environment, signal structure, and the prior information available for the signal [1, 2]. A fundamental mathematical model for sparsity-based signal recovery can be described as follows. Let $A$ be an $m\times n$ sensing matrix with $m\ll n,$ and let $y:=Ax+\nu\in\mathbb{R}^{m}$ be the measurements for the signal $x\in\mathbb{R}^{n}$ , where $\nu\in\mathbb{R}^{m}$ are the measurement errors. When $x$ is $k$ -sparse or $k$ -compressible, the recovery of $x$ can be formulated as the following sparsity-constrained optimization (SCO) problem:

\min_{z\in\mathbb{R}^{n}}\{\|Az-y\|_{2}^{2}:~{}\|z\|_{0}\leq k\},

(1)

where $k$ is a given integer number reflecting the number of significant entries of the signal to recover, and $\|z\|_{0}$ denotes the number of nonzero entries of the vector $z.$ In this paper, we say that $z$ is $k$ -sparse if $\|z\|_{0}\leq k,$ and $z$ is $k$ -compressible if it can be approximated by a $k$ -sparse vector. The SCO problem plays a vital role in the development of algorithms for compressed-sensing-based signal recovery (see, e.g., [1, 2, 3, 4]). Essentially, the model (1) is to find the best $k$ -term approximation of the target signal, which best fits the acquired measurements. Similar models also arise in numerous areas such as statistical learning [5, 6, 7], wireless communication [8, 9], low-rank matrix recovery [10, 11, 12, 13, 14], linear inverse and optimization problems [15, 16, 17, 18].

While the SCO and related problems can be solved via convex optimization, nonconvex optimization and orthogonal matching pursuit (OMP) (see, e.g., [1, 2, 3, 4]), the thresholding methods with low computational complexity are particularly suited for solving the SCO model. Thus we focus our attention on the study of thresholding methods in this paper. The earlier thresholding methods mainly adopt the soft thresholding operator [19, 20, 21, 22]. The recent ones (e.g., [3, 23, 24, 25, 26, 27]) mainly use the hard thresholding operator, denoted by ${\cal H}_{k}(\cdot),$ which retains the $k$ largest magnitudes of a vector and sets other entries to zeros. The latest development of thresholding methods, including acceleration, matrix form, dual thresholding and Newton-step-based modifications, can be found in such references as [7, 14, 28, 29, 30, 31, 32, 33]. However, performing hard thresholding on an incompressible iterate, generated by an iterative method, does not ensure the reduction of the objective value of (1) at every iteration. Motivated by this observation, the so-called optimal $k$ -thresholding (OT) method was first introduced in [34], which allows the vector thresholding and objective reduction to be performed simultaneously. Based on the OT concept, the relaxed optimal $k$ -thresholding (ROT) algorithm and their variants were also developed in [34]. A further analysis of these algorithms was performed in [35], and the algorithm using both OT technique and truncated gradient descent direction was studied in [37]. Compared to existing hard-thresholding algorithms, the strength of OT-type methods lies in their observed robustness for sparse signal recovery and steadiness in objective reduction in the course of iterations. However, the OT algorithm needs to solve a quadratic optimization problem at every iteration to obtain the optimal or nearly optimal $k$ -thresholding of a vector, and thus the computation is expensive especially for high-dimensional signal recovery.

This paper pursues a new algorithmic development which is significantly different from the existing ones in [34] and [35]. The main idea of the paper is to introduce an auxiliary regularized counterpart for the so-called binary OT model, which will be defined by (3) later in this paper. Based on such a regularized model, we utilize the traditional linearization technique (i.e., the first-order approximation method) to develop a new framework of algorithms which admits a low computational cost at every iteration, compared to general convex optimization methods. The thresholding technique adopted in this framework is referred to as the natural thresholding (NT). Such thresholding is determined by the $k$ smallest entries of the gradient of the so-called regularized objective of OT or ROT model which will be recalled in the next section. The NT algorithm takes the advantage of the ‘nice structure’ of the polytope with $k$ -sparse binary extreme points. The main work of the paper includes:

(i)

Develop a novel class of NT-type methods for signal recovery;
(ii)

Rigorously prove the guaranteed performance of NT-type algorithms under some conditions.

We also carry out experiments on random problem instances to show that the NT-type algorithms are fast algorithms, and their performances are very comparable to several mainstream compressed sensing algorithms.

The paper is structured as follows. The NT-type algorithms are proposed in Section II. The theoretical analysis of the algorithms is performed in Section III, and a further discussion of regularization functions and parameters is given in Section IV. The simulation results are reported in Section V.

Notation: All vectors are column vectors unless otherwise stated. For vector $z$ and matrix $A$ , $z^{T}$ and $A^{T}$ denote their transposes. We use $\textrm{supp}(z)=\{i:z_{i}\not=0\}$ to denote the support of the vector $z$ , i.e., the index set of nonzero entries of $z.$ We use $L_{k}(z)$ to denote the index set of the $k$ largest magnitudes of $z,$ and ${\cal S}_{k}(z)$ to denote the index set of the $k$ smallest entries of $z.$ When $L_{k}(z)$ ( ${\cal S}_{k}(z)$ ) is not uniquely determined, i.e., when there are two magnitudes (entries) being equal, we select the smallest entry indices to avoid the ambiguity in index selection. Given $S\subseteq\{1,2,\dots,n\},$ we denote by $\overline{S}=\{1,2,\dots,n\}\backslash S$ the complement set and by $|S|$ the cardinality of $S.$ Given $x\in\mathbb{R}^{n},$ the vector $x_{S}\in\mathbb{R}^{n}$ is obtained from $x$ by retaining the entries supported on $S$ and setting other entries to zeros. Throughout the paper, $\|\cdot\|_{2}$ and $\|\cdot\|_{1}$ denote the $\ell_{2}$ -norm and $\ell_{1}$ -norm of a vector, respectively. For vectors $x$ and $z,$ $x\otimes z$ is the Hadamard product (entry-wise product) of the two vectors. Also, we summarize the main acronyms of algorithms in the following table.

Acronyms	Algorithms
IHT	Iterative hard tresholding
HTP	Hard tresholding pursuit
SP	Subspace pursuit
CoSaMP	Compressive sampling matching pursuit
OMP	Orthogonal matching pursuit
OT	Optimal $k$ -thresholding
OTP	Optimal $k$ -thresholding pursuit
ROTP	Relaxed optimal $k$ -thresholding pursuit
NT	Natural thresholding
NTP	Natural thresholding pursuit
NT_q	Natural thresholding with inner iter.
NTP_q	Natural thresholding pursuit with inner iter.

II Natural Thresholding Algorithms

Let us first recall the iterative hard thresholding (IHT), hard thresholding pursuit (HTP) and the newly developed optimal $k$ -thresholding (OT) algorithm. These algorithms provide a basis for the development of our new algorithms called natural thresholding.

II-A Basic hard thresholding algorithms: IHT and HTP

Given $(y,A)$ , the gradient of the error metric $\psi(x)=(1/2)\|y-Ax\|_{2}^{2}$ is given by $\nabla\psi(x)=-A^{T}(y-Ax).$ At the iterate $x^{(p)},$ let $u^{(p)}$ be the vector generated by the classic gradient descent method, i.e.,

u^{(p)}:=x^{(p)}+\lambda_{p}A^{T}(y-Ax^{(p)}),

(2)

where $\lambda_{p}$ is a steplength. Using the operator ${\cal H}_{k}$ to generate an iterate satisfying the constraint of the problem (1) leads to the well-known IHT algorithm [15, 25, 26] and its enhanced version HTP [27]. A good feature of IHT and HTP is that they are easier to implement than other algorithms such as convex optimization [38, 39, 40, 41], orthogonal matching pursuit (OMP) [42, 43, 44], compressive sampling matching pursuit (CoSaMP)[45], and subspace pursuit (SP) [46].

IHT algorithm. Given the problem data $(A,y,k)$ and initial point $x^{(0)},$ perform the following iteration until a stopping criterion is met: At $x^{(p)}$ , choose $\lambda_{p}>0$ and set

x^{(p+1)}={\cal H}_{k}(u^{(p)}),

where $u^{(p)}$ is given by (2).

HTP algorithm. Given the problem data $(A,y,k)$ and initial point $x^{(0)},$ perform the following iteration until a stopping criterion is met: At $x^{(p)}$ , choose $\lambda_{p}>0$ and let $u^{(p)}$ be given by (2). Then set

x^{+}={\cal H}_{k}(u^{(p)}),

x^{(p+1)}={\rm arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\textrm{supp}(x^{+})\}.

Clearly, the $k$ terms of $u^{(p)}$ selected by ${\cal H}_{k}$ are not necessarily the best $k$ terms in the sense that they might not be the $k$ terms that best fit the linear measurements. As pointed out in [34], performing the hard thresholding of $u^{(p)}$ is independent of the error metric $\psi.$ Thus it may happen that the value of the error metric at the point $x^{+}={\cal H}_{k}(u^{(p)})$ is larger than that of the current iterate $x^{(p)},$ i.e., $\|y-Ax^{+}\|_{2}>\|y-Ax^{(p)}\|_{2}.$ This observation motivates the recent development of optimal $k$ -thresholding algorithms in [34, 35].

II-B Optimal $k$ -thresholding algorithms

As defined in [34], the optimal $k$ -thresholding of a vector $u$ , denoted by $Z_{k}^{\#}(u),$ is given by $Z_{k}^{\#}(u)=u\otimes w^{*}$ which is the Hadamard product of $u$ and $w^{*},$ where $w^{*}$ is the solution to the problem

\min_{w}\left\{\|y-A(u\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}w\in\{0,1\}^{n}\right\},

where e is the vector of ones, and the vector $w$ is $k$ -sparse and binary with exactly $k$ entries equal to 1. Given such a binary vector $w,$ the Hadamard product of $u$ and $w$ retains $k$ entries of $u$ only and sets other entries to zeros. Based on the operator $Z_{k}^{\#}(\cdot),$ the conceptual optimal $k$ -thresholding (OT) method and the optimal $k$ -thresholding pursuit (OTP) were proposed in [34]. The two methods share the same first step except for the second step at which OTP performs an orthogonal projection.

OT and OTP algorithms. Given the problem data $(A,y,k)$ and a starting point $x^{(0)}.$ Perform the following steps until a stopping criterion is satisfied:

S1

Let $u^{(p)}$ be given as (2) and let $w^{*}$ be the solution to the problem

$\min_{w}\{\|y-A(u^{(p)}\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}w\in\{0,1\}^{n}\}.$ (3)
S2

Generate $x^{(p+1)}$ as follows:

For OT algorithm, $x^{(p+1)}=u^{(p)}\otimes w^{*}.$
For OTP algorithm, let $\Omega=\textrm{supp}(u^{(p)}\otimes w^{*})$ and

$x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\Omega\}.$

We would rather treat OT and OTP as conceptual algorithms due to the fact that directly solving the binary problem (3) might be expensive when the dimension of the problem is high. A more practical way is to solve a convex relaxation of (3), which can be solved by existing interior-point methods. This consideration leads to the relaxed optimal $k$ -thresholding (ROT) and relaxed optimal $k$ -thresholding pursuit (ROTP) introduced in [34, 35].

ROT and ROTP algorithms. Given the problem data $(A,y,k)$ and a starting point $x^{(0)},$ perform the steps below until a stopping criterion is satisfied:

S1

Let $u^{(p)}$ be given as (2) and solve

$\min_{w}\{\|y-A(u^{(p)}\otimes w)\|^{2}_{2}:~{}\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}}\}.$ (4)

Let $w^{(p)}$ be the solution to this problem.
S2

Set $\widehat{x}={\cal H}_{k}(u^{(p)}\otimes w^{(p)})$ and generate $x^{(p+1)}$ as follows:

For ROT algorithm, $x^{(p+1)}=\widehat{x}.$
For ROTP algorithm, let

$x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq\textrm{supp}(\widehat{x})\}.$

While the optimization problem (4) can be solved efficiently by existing convex optimization solvers, the computational cost remains high for large-scaled signal recovery problems, compared with the algorithms which involve only orthogonal projection, hard thresholding or linear programming solving. This stimulates the natural thresholding (NT) method described in the next subsection.

II-C Natural thresholding algorithms: A new framework

The model (4) is a convex relaxation of (3). However, such a relaxation might be too loose, and thus its solution is usually not binary. To overcome such a drawback of (4), we may introduce a certain regularization term into the objective of (4) so that the regularized model can approximate the model (3) appropriately and the chance for the solution of regularized model being binary is enhanced. We now describe such a regularized model which is essential for the development of new algorithms in this paper. In fact, given vector $u^{(p)},$ consider the model of the form:

		$\displaystyle\min$	$\displaystyle\\|y-A(u^{(p)}\otimes w)\\|^{2}_{2}+\alpha\phi(w)$
		$\displaystyle{\rm s.t.}$	$\displaystyle\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}},$

where $\alpha>0$ is a parameter and $\phi(w)$ is referred to as a binary regularization function which is formally defined as follows.

Definition II.1

$\phi$ is said to be a binary regularization if it satisfies the following two properties: i) $\phi$ is a positive and continuously differentiable function over an open neighborhood of the region $D:=\{w\in\mathbb{R}^{n}:0\leq w\leq\textrm{{e}}\};$ ii) $\phi(w)$ attains its minimum value over $D$ at any binary vector $w\in\{0,1\}^{n}$ and only at such binary vectors.

Example II.2

Note that given any positive number $0<\sigma<1$ , the function $\tau(t):=(t+\sigma)(1+\sigma-t)$ is positive and continuously differentiable over the interval $-\sigma<t<1+\sigma$ which is an open neighborhood of the interval [0,1]. $\tau(t)$ attains its minimum value $\tau_{\min}:=\sigma(1+\sigma)$ at $t=0$ or $1,$ and $\tau(t)>\tau_{\min}$ for any $0<t<1.$ For simplicity, we set $\sigma=1/2$ and only consider the specific function $\tau(t)=(t+\frac{1}{2})(\frac{3}{2}-t).$ By using this univariate function, we may construct a binary regularization function in $\mathbb{R}^{n}$ as follows:

\phi(w)=\sum_{i=1}^{n}\tau(w_{i})=\sum_{i=1}^{n}(w_{i}+\frac{1}{2})(\frac{3}{2}-w_{i}).

(6)

It is evident that the function (6) satisfies the properties specified in Definition II.1. More examples of binary regularization functions will be given later in this section.

Given a vector $u^{(p)},$ we use $f$ and $g_{\alpha}$ to denote the functions

f(w)=\|y-A(u^{(p)}\otimes w)\|_{2}^{2},

g_{\alpha}(w)=f(w)+\alpha\phi(w).

(7)

The function $g_{\alpha}(w)$ is the objective of the problem (II-C). The parameter $\alpha$ is referred to as the regularization parameter. The solution of (II-C) depends on the parameter $\alpha.$ Thus for any given $\alpha>0,$ we use $w(\alpha)$ to denote the solution of (II-C). $\phi(w)$ can be seen as the penalty for the variable $w$ violating the 0-1 restriction. From the classic optimization theory for penalty functions (see, e.g., [47]), it is not difficult to see that as $\alpha\to\infty,$ the value of $\phi(w(\alpha))$ will decrease to its minimum in the feasible set of (II-C). Such minimum attains only at 0-1 vectors according to the definition of binary regularization. This implies that $w(\alpha)$ tends to a 0-1 point in $D$ as $\alpha\to\infty,$ and any accumulation point of $w(\alpha)$ must be a solution to the problem (3). We summarize this fact in the following lemma whose proof is given in Appendix A for completeness.

Lemma II.3

Let $\phi(w)$ in (II-C) be a binary regularization and $w(\alpha)$ be a solution of (II-C). Consider any strictly increasing and positive sequence $\{\alpha_{\ell}\}$ satisfying $\alpha_{\ell}\to\infty.$ Then the sequence $\{\phi(w(\alpha_{\ell}))\}$ is nonincreasing and converges to the minimum value of $\phi$ over $D,$ and the sequence $\{f(w(\alpha_{\ell}))\}$ is nondecreasing and converges to the optimal value of (3). Moreover, any accumulation point of $\{w(\alpha_{\ell})\}$ is the optimal solution of (3).

The above lemma shows that the model (II-C) can approximate the model (3) to any expected level of accuracy provided that the parameter $\alpha$ is taken large enough. From this perspective, the model (II-C) as an approximation of (3) is better than the convex relaxation (4). By the definition, a binary regularization is nonconvex. As shown by (6), it is very easy to construct a binary regularization which is concave. As a result, the objective function $g_{\alpha}$ of (II-C) can be naturally made to be concave by properly choosing the parameter $\alpha.$ Thus for the convenience of discussion, we assume that $g_{\alpha}$ is concave throughout the remainder of this section and Section III. The concavity of the function $g_{\alpha}(w)$ can be guaranteed if $\phi$ and $\alpha$ are chosen properly. Note that if $\phi$ is twice continuously differentiable, then the Hessian of $g_{\alpha}$ is given as

\nabla^{2}g_{\alpha}(w)=\nabla^{2}f(w)+\alpha\nabla^{2}\phi(w),

from which it can be seen that when $\nabla^{2}\phi(w)$ is negative definite, then $\nabla^{2}g_{\alpha}(w)$ will be negative semi-definite (and thus $g_{\alpha}$ is concave) provided that $\alpha$ is large enough.

Take (6) as an example. It is easy to verify that the Hessian of (6) is $\nabla^{2}\phi(w)=-2I,$ where $I$ is the identity matrix. Thus when (6) is used, the Hessian of $g_{\alpha}(w)$ is given as

\nabla^{2}g_{\alpha}(w)=2(U^{(p)}A^{T}AU^{(p)}-\alpha I),

where $U^{(p)}=\textrm{diag}(u^{(p)}).$ Note that $\nabla^{2}g_{\alpha}(w)$ is negative semi-definite over the whole space provided that $\alpha I\succeq U^{(p)}A^{T}AU^{(p)}$ which can be guaranteed when $\alpha$ is not smaller than the largest eigenvalue of the matrix $U^{(p)}A^{T}AU^{(p)},$ i.e., $\alpha\geq\alpha^{*}:=\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).$ Therefore, the function $g_{\alpha}(w)$ is always concave over $\mathbb{R}^{n}$ provided that $\alpha\geq\alpha^{*}.$

Note that (3) is to find the optimal $k$ -thresholding of $u^{(p)}.$ The above discussion indicates that performing the optimal $k$ -thresholding of a vector can be approximately achieved by solving the problem (II-C) which from the above discussion can be made to be a concave minimization problem which remains NP-hard in general [49, 50]. The fundamental idea of the traditional MM method for solving a nonconvex optimization problem is to replace the ‘difficult’ objective function of the problem with a certain relatively easy surrogate convex function that dominates the original objective function (see, e.g., [51, 52]). For a concave objective, the simplest surrogate function is its first-order approximation at any given point. Thus the first-order approximation of the concave function can be used to develop an algorithm with significantly low computational cost.

Specifically, suppose that $w^{-}$ is the current iterate. At $w^{-},$ by the concavity of $g_{\alpha}(w)$ over an open neighborhood of $D,$ denoted by $O,$ one has

g_{\alpha}(w)\leq g_{\alpha}(w^{-})+\left[\nabla g_{\alpha}(w^{-})\right]^{T}(w-w^{-})

(8)

for any $w\in O$ , where $\nabla g_{\alpha}(w^{-})$ denotes the gradient of $g_{\alpha}$ at $w^{-}.$ Specifically, if $\phi$ is given by (6), the gradient of $g_{\alpha}$ at $w^{-}$ is given as

\nabla g_{\alpha}(w^{-})=-2U^{(p)}A^{T}(y-AU^{(p)}w^{-})+\alpha(\textbf{{e}}-2w^{-}).

The relation (8) shows that the first-order approximation of $g_{\alpha}(w)$ dominates $g_{\alpha}(w)$ on the whole open set $O.$ By the idea of MM methods, the problem (II-C), i.e., $\min\{g_{\alpha}(w):w\in P\},$ where

P=\{w\in\mathbb{R}^{n}:\textrm{{e}}^{T}w=k,~{}0\leq w\leq\textrm{{e}}\},

can be approximated by its surrogate counterpart which, in this case, is the linear programming (LP) problem

\min\left\{g_{\alpha}(w^{-})+\left[\nabla g_{\alpha}(w^{-})\right]^{T}(w-w^{-}):~{}w\in P\right\}.

Removing constant terms in the objective above, the above LP problem is reduced to

\min\left\{\left[\nabla g_{\alpha}(w^{-})\right]^{T}w:~{}w\in P\right\}.

(9)

Let $w^{+}$ be the solution to this problem. We may replace $w^{-}$ with $w^{+}$ and repeat the above process until a certain stopping criterion is met. This is the basic idea for the development of a new algorithm.

We now point out that there is no need to solve the LP problem (9) by an algorithm since an optimal solution of (9) can be obtained explicitly due to the nature of the polytope $P.$ In fact, according to the classic LP theory, the problem (9) must have an optimal solution, denoted by $w^{+},$ which is an extreme point of the polytope $P.$ As shown in [34], any extreme point of $P$ is $k$ -sparse and binary. Hence the optimal value $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}$ is actually the summation of $k$ entries of the vector $\nabla g_{\alpha}(w^{-}).$ Clearly, the smallest value of the function $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w$ over the polytope $P$ is the sum of the smallest $k$ entries of $\nabla g_{\alpha}(w^{-}).$ Thus the optimal solution $w^{+}$ of the LP problem (9) can be explicitly given as follows. It is determined by the support of the $k$ smallest entries of $\nabla g_{\alpha}(w^{-})$ , i.e., $w^{+}_{i}=1$ if $i\in{\cal S}_{k}(\nabla g_{\alpha}(w^{-}))$ , otherwise, $w^{+}_{i}=0$ for $i=1,\dots,n,$ where ${\cal S}_{k}(\cdot)$ is the index set of the $k$ smallest entries of a vector. We now describe the algorithms which are referred to as the natural thresholding (NT) and the natural thresholding pursuit (NTP) in this paper.

NT and NTP algorithms.

Input $(A,y,k)$ and an initial point $x^{(0)}$ . Repeat the following steps until a certain stopping criterion is satisfied:

At $x^{(p)}$ , choose $\lambda_{p}>0$ and let $u^{(p)}$ be given by (2). Let $w^{-}$ be the $k$ -sparse vector given as

w^{-}_{i}=\left\{\begin{array}[]{ll}1&\textrm{ if }i\in L_{k}(u^{(p)})\\ 0&\textrm{ otherwise}\end{array},~{}i=1,\dots,n.\right.

(10)

Compute the gradient $\nabla g_{\alpha}(w^{-}),$ where $g_{\alpha}(\cdot)$ is defined by (7). Then set $w^{+}$ as

w^{+}_{i}=\left\{\begin{array}[]{ll}1&\textrm{ if }i\in{\cal S}_{k}(\nabla g_{\alpha}(w^{-}))\\ 0&\textrm{ otherwise}\end{array},~{}i=1,\dots,n.\right.

(11)

S2

Generate $x^{(p+1)}$ according to the following procedure:

For NT algorithm, $x^{(p+1)}=u^{(p)}\otimes w^{+}.$
For NTP algorithm, set $S^{(p+1)}=\textrm{supp}(u^{(p)}\otimes w^{+})$ and

$x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq S^{(p+1)}\}.$

The only difference between the two algorithms is that NT does not perform an orthogonal projection, but NTP does. At the beginning of every iteration of NT and NTP, the initial point $w^{-}$ can be any given $k$ -sparse and binary vector. In particular, we may use the binary vector (10) as an initial point determined by $L_{k}(u^{(p)})$ .

Remark II.4

Given a $k$ -sparse and binary vector $w^{-}$ , let $w^{+}$ be given as (11) which is an explicit solution of the LP problem (9). By the optimality of $w^{+},$ one always has

\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}\leq\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-},

(12)

which together with the concavity of $g_{\alpha}$ implies that $f(w^{+})\leq f(w^{-}).$ (This fact will be used in the proof of Lemma III.2 later.) In fact, the concavity of $g_{\alpha}(w)$ and (12) implies that

g_{\alpha}(w^{+})\leq g_{\alpha}(w^{-})+\nabla g_{\alpha}(w^{-})^{T}(w^{+}-w^{-})\leq g_{\alpha}(w^{-}),

i.e.,

f(w^{+})+\alpha\phi(w^{+})\leq f(w^{-})+\alpha\phi(w^{-}).

Since $w^{+}$ and $w^{-}$ are 0-1 vectors at which $\phi$ achieves the minimum value $\phi(w^{+})=\phi(w^{-})$ over $D.$ Thus the above inequality reduces to $f(w^{+})\leq f(w^{-}).$ This means the iterate $w^{+}$ generated by NT or NTP is never worse than the one generated by IHT or HTP from the perspective of the error metric $f(w).$ If $w^{-}$ is not optimal to (9), then $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}<\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-},$ and hence $f(w^{+})<f(w^{-}).$ Thus in this case the algorithm finds the $k$ terms of $u^{(p)}$ which is better than the $k$ terms selected by the hard thresholding.

The NT and NTP algorithms perform linearization of $g_{\alpha}$ only once in every iteration. We may perform linearization more than once within their step S1, i.e., replacing the role of $w^{-}$ with $w^{+}$ and repeating the step S1 of NT more than once so that a point better than $w^{+}$ might be found. Such iterations within S1 are referred to as inner iterations. Repeating the inner iterations up to $q\geq 1$ times yields the algorithms NT_q and NTP_q.

NT_q and NTP_q algorithms.

Input $(A,y,k)$ , initial point $x^{(0)},$ and integer number $q\geq 1.$ Repeat the steps below until a stopping criterion is met:

S1

At $x^{(p)}$ , choose $\lambda_{p}>0$ and let $u^{(p)}$ be given by (2). Let $w^{-}$ be the $k$ -sparse vector given by (10). Then perform the following inner iterations:
- For $J=1:q$
  - Compute $\nabla g_{\alpha}(w^{-})$ , where $g_{\alpha}(\cdot)$ is defined by (7). Then set $w^{+}$ as (11).
  - If $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}$ then
    
    terminate the inner iterations, and go to S2
    
    else
    
    $w^{-}:=w^{+}$
    
    endif
  End
S2

Generate $x^{(p+1)}$ according to the following procedure:

For NT_q algorithm, set $x^{(p+1)}=u^{(p)}\otimes w^{+}.$
For NTP_q algorithm, let $S^{(p+1)}=\textrm{supp}(u^{(p)}\otimes w^{+})$ and

$x^{(p+1)}=\textrm{arg}\min_{z}\{\|y-Az\|_{2}^{2}:~{}\textrm{supp}(z)\subseteq S^{(p+1)}\}.$

Remark II.5

NT, NTP, NT_q and NTP_q are referred to as the NT-type algorithms in this paper. When $q=1,$ these algorithms become NT and NTP, respectively. If $q$ is a large number, according to linearization theory for concave minimization (e.g., [49, 53]), the inner iterations of NT_q and NTP_q will terminate at a stationary point of the concave minimization problem in which case $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}.$ If we expect such a point, we may set $q=\infty$ and use $\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{+}=\left[\nabla g_{\alpha}(w^{-})\right]^{T}w^{-}$ as the termination criterion for inner iterations. The corresponding algorithms can be referred to as NT_∞ and NTP_∞ which are omitted here.

Remark II.6

As pointed out in [35], performing one iteration of ROTP requires about $O(m^{3}+mn+n^{3.5}L)$ flops, where $L$ is the size of the problem data encoding in binary. The complexity of NT-type algorithms is much lower than that of ROTP. In fact, at every iteration, the NT-type algorithms need to compute the vector $u^{(p)}$ and $\nabla g_{\alpha}(w^{-})$ which require about $O(mn)$ operations, and to perform $L_{k}$ and ${\cal S}_{k}$ on a vector which require about $O(n\log k)$ flops by a sorting approach, where $k<m\ll n.$ The NTP and NTP_q algorithms need to perform the orthogonal projection $\min\{\|y-Az\|_{2}^{2}:\textrm{supp}(z)\subset\Omega\},$ where $|\Omega|=l\leq m,$ which requires about $m^{3}$ flops (see [35] for details). Thus the complexities of NT and NTP are about $O(mn)$ and $O(m^{3}+mn)$ flops in every iteration, respectively. If the inner iterations are executed $q$ times, then the NT_q and NTP_q require about $O(qmn)$ and $O(m^{3}+qmn)$ flops.

II-D Concave regularization function

Before we analyze the guaranteed performance of NT-type algorithms, it is necessary to construct more examples of concave binary regularization functions. We still construct such functions based on the kernel univariate one $\tau(t)=(t+1/2)(3/2-t)$ which is concave and positive over an open neighborhood of the interval $[0,1].$ Assume $g$ is another concave and strictly increasing univariate function over the interval $(0,\infty).$ Then the composite function $h(t)=g(\tau(t))$ must be a concave function over an open neighborhood of the interval $[0,1]$ due to the fact

h^{\prime\prime}(t)=g^{\prime\prime}(\tau(t))(\tau^{\prime}(t))^{2}+g^{\prime}(\tau(t))\tau^{\prime\prime}(t)\leq 0,

where the last inequality follows from the fact $g^{\prime\prime}(\tau(t))\leq 0,\tau^{\prime\prime}(t)\leq 0$ and $g^{\prime}(\tau(t))\geq 0.$ We may use such a function to construct a binary regularization as follows:

\phi(w)=\sum_{i=1}^{n}h(w_{i})=\sum_{i=1}^{n}g(\tau(w_{i})).

(13)

The function constructed like this can be used in NT-type algorithms. We now give two such specific functions.

•

Let $g(t):=\log(1+t)$ which is concave and strictly increasing over the interval $(0,\infty).$ Then

$\phi(w)=\sum_{i=1}^{n}\log\left(1+(w_{i}+1/2)(3/2-w_{i})\right)$ (14)

is a concave binary regularization over an open neighborhood of the region $D.$
•

Let $g(t):=\frac{t}{1+t}$ which is concave and strictly increasing over the interval $(0,\infty).$ Then

$\phi(w)=\sum_{i=1}^{n}\frac{(w_{i}+1/2)(3/2-w_{i})}{1+(w_{i}+1/2)(3/2-w_{i})}$ (15)

is a concave binary regularization over an open neighborhood of the region $D.$

Similar to $f(w)$ which is related to $u^{(p)}$ , the regularization $\phi$ can be also made related to $u^{(p)}.$ Such a regularization will be discussed in more detail in Section IV.

III Guaranteed performance and stability

In this section, we perform a theoretical analysis for the NT-type algorithms described in previous section. This analysis is made under the assumptions of concavity of $g_{\alpha}(w)$ and restricted isometry property (RIP), which is initially introduced by Candès and Tao [39]. An $m\times n$ matrix $A,$ where $m\ll n,$ is said to satisfy the RIP of order $K$ if $\delta_{K}<1,$ where $\delta_{K}$ is the smallest nonnegative number $\delta$ such that

(1-\delta)\|z\|_{2}^{2}\leq\|Az\|^{2}_{2}\leq(1+\delta)\|z\|_{2}^{2}

for any $K$ -sparse vector $z\in\mathbb{R}^{n}.$ The number $\delta_{K}$ is called the restricted isometry constant (RIC) of $A.$ The following property of $\delta_{K}$ has been widely used in the analysis of compressed sensing algorithms.

Lemma III.1

[39, 3] Let $u\in\mathbb{R}^{n}$ and $\Lambda\subseteq\{1,2,\dots,n\},$ if $|\Lambda\cup\textrm{supp}(u)|\leq t,$ then $\|\left[(I-A^{T}A)u\right]_{\Lambda}\|_{2}\leq\delta_{t}\|u\|_{2}.$

It is worth stressing that RIP is not the unique assumption under which the compressed sensing algorithms can be analyzed. Other assumptions such as the mutual coherence [1, 3], null space property [3], and range space property of $A^{T}$ [4, 54] can be also used to analyze the theoretical performance of a compressed sensing algorithm. We now prove that the iterates generated by the NT-type algorithms with a concave binary regularization can guarantee to recover the sparse signal under the RIP assumption. We begin with the following lemma.

Lemma III.2

Let $g_{\alpha}(w)$ be a concave function, and let $w^{-}$ be the initial point of step S1 in NT_q and NTP_q algorithms where $q\geq 1,$ and let $w^{+}$ be the final output of step S1. Then

\|y-A(u^{(p)}\otimes w^{+})\|_{2}\leq\|y-A(u^{(p)}\otimes w^{-})\|_{2}.

(16)

Proof. The inner iterations within step S1 of NT_q and NTP_q start from the initial point $w^{(0)}:=w^{-}.$ Then the first inner iteration performs the update (11) to yield the next inner iterate $w^{(1)}.$ As pointed out in Remark II.4, since $g_{\alpha}(w)$ is concave and $(w^{(0)},w^{(1)})$ are 0-1 vectors, one must have that $f(w^{(1)})\leq f(w^{(0)}).$ Then the inner iteration starts from $w^{(1)}$ to generate the second iterate $w^{(2)}.$ Again, by the same argument in Remark II.4, one has $f(w^{(2)})\leq f(w^{(1)}).$ Repeating this inner process $q$ times, the final output $w^{(q)}$ of step S1 satisfies that

f(w^{(q)})\leq f(w^{(q-1)})\leq\cdots\leq f(w^{(1)})\leq f(w^{(0)}).

Denote by $w^{+}(=w^{(q)})$ the final output of S, the above inequality immediately implies the desired relation (16). $\Box$

The next lemma is very useful in our analysis.

Lemma III.3

[35] For any vector $z\in\mathbb{R}^{n}$ and any $k$ -sparse vector $h\in\mathbb{R}^{n},$ one has

\|(h-{\cal H}_{k}(z))_{S\backslash S^{*}}\|_{2}\leq\|(h-z)_{S\backslash S^{*}}\|_{2}+\|(h-z)_{S^{*}\backslash S}\|_{2},

where $S=\textrm{supp}(h)$ and $S^{*}=\textrm{supp}({\cal H}_{k}(z)).$

We now prove the guaranteed performance of the NT-type algorithms under RIP and concavity of $g_{\alpha}(w).$ It is sufficient to show the main result in noisy settings. The result for noiseless cases can be obtained immediately as a special case of our results. The practical signal is usually not exactly $k$ -sparse. Thus we are interested in recovering the key information of the signal, i.e., the largest $k$ magnitudes of the signal. For simplicity, we only analyze the algorithms with steplength $\lambda_{p}\equiv 1.$ The performance of the proposed algorithms with $\lambda_{p}\not=1$ may be established by a more complicated analysis which requires significant further investigation. In fact, for the case $\lambda_{p}\not=1,$ the term $I-\lambda_{p}A^{T}A$ would appear in (40) in the proof of Theorem III.4 below. See Appendix B for details. Thus, to bound the term in (40), the classic Lemma III.1 cannot be used, and some new technical results involving $\lambda_{p}\not=1$ must be established first. We leave such a possible generalization as a future research work.

Theorem III.4

Let $y:=Ax+\nu$ be the acquired measurements for the signal $x,$ where $\nu$ are measurement errors. Denote by $S=L_{k}(x).$ Suppose that the RIC of $A$ satisfies

\delta_{3k}<\gamma^{*}\approx 0.4712,

(17)

where $\gamma^{*}$ is the unique real root of the univariate equation $\omega\gamma^{3}+\omega\gamma^{2}+\gamma=1$ in the interval $[0,1],$ where $\omega:=(\sqrt{5}+1)/2.$ Suppose that the sequence $\{x^{(p)}:p\geq 1\}$ is generated by the algorithm NT or NT_q with $\lambda_{p}\equiv 1$ and $g_{\alpha}(\cdot)$ being concave. Then the sequence $\{x^{(p)}\}$ satisfies the following relation:

\|x^{(p+1)}-x_{S}\|_{2}\leq\eta\|x^{(p)}-x_{S}\|_{2}+C_{1}\|A^{T}\nu^{\prime}\|_{2}+C_{2}\|\nu^{\prime}\|_{2},

(18)

where

C_{1}=\omega\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}},~{}C_{2}=\frac{2}{\sqrt{1-\delta_{2k}}},~{}\nu^{\prime}=Ax_{\overline{S}}+\nu,

(19)

and $\eta$ is a constant given by

\eta=\omega\delta_{3k}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}

(20)

which is smaller than 1 under the condition (17).

The proof of Theorem III.4 is given in Appendix B. It should be pointed out that to ensure the concavity of $g_{\alpha}(\cdot)$ , one may use concave binary regularization and choose $\alpha$ to be large enough (see Section IV for more discussions). The only difference between NT_q and NTP_q is that the latter uses the orthogonal projection to generate $x^{(p+1)}$ from $x^{(p)}.$ The following property of orthogonal projection is widely used in the analysis of compressed sensing algorithms.

Lemma III.5

[34, 35] Given the pair $(y,A)$ with $\nu^{\prime}=y-Ax^{*}$ where $x^{*}$ is a $k$ -sparse vector. Let $u\in\mathbb{R}^{n}$ be an arbitrary $k$ -sparse vector, and let $z^{*}$ be the solution to the orthogonal projection

z^{*}=\arg\min_{z}\left\{\|y-Az\|_{2}^{2}:~{}\operatorname{supp}(z)\subseteq\operatorname{supp}(u)\right\}.

Then $z^{*}$ satisfies that

\left\|z^{*}-x^{*}\right\|_{2}\leq\frac{1}{\sqrt{1-\delta_{2k}^{2}}}\|x^{*}-u\|_{2}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}\|\nu^{\prime}\|_{2}.

Combining Theorem III.4 and Lemma III.5 leads to the main result for NTP and NTP ${}_{q}.$

Theorem III.6

Let $y:=Ax+\nu$ be the measurements of the signal $x$ with measurement errors $\nu.$ If the RIC of $A$ satisfies

\delta_{3k}<\frac{2}{\sqrt{5}+3},

(21)

then the sequence $\{x^{(p)}:p\geq 1\},$ generated by the algorithm NTP or NTP_q with $\lambda_{p}\equiv 1$ and $g_{\alpha}(\cdot)$ being concave, satisfies the following relation:

\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x^{(p)}-x_{S}\|_{2}+\widetilde{C}_{1}\|A^{T}\nu^{\prime}\|_{2}+\widetilde{C}_{2}\|\nu^{\prime}\|_{2},

(22)

where $\nu^{\prime}=Ax_{\overline{S}}+\nu,$ $\rho$ is a constant given by $\rho=\omega\delta_{3k}/(1-\delta_{2k})$ which is smaller than 1 under (21), and $\widetilde{C}_{1},$ $\widetilde{C}_{2}$ are the constants given by

\widetilde{C}_{1}=\frac{C_{1}}{\sqrt{1-\delta_{2k}^{2}}},~{}\widetilde{C}_{2}=\frac{C_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}},

(23)

in which $C_{1},C_{2}$ are the constants given in Theorem III.4.

Proof. By the structure of NTP and NTP ${}_{q},$ $w^{+}$ is the output of S1, and $\hat{x}=u^{(p)}\otimes w^{+}$ is $k$ -sparse. The iterate $x^{(p+1)}$ is the solution to the orthogonal projection

\min_{z}\left\{\|y-Az\|_{2}^{2}:~{}\operatorname{supp}(z)\subseteq\operatorname{supp}(\hat{x})\right\}.

Let $S=L_{k}(x).$ By treating $u=\hat{x}$ and $x^{*}=x_{S}$ as well as $\nu^{\prime}=y-Ax_{S}=Ax_{\overline{S}}+\nu$ , it follows from Lemma III.5 that

\|x^{(p+1)}-x_{S}\|_{2}\leq\frac{\|x_{S}-\hat{x}\|_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}\|\nu^{\prime}\|_{2}.

The intermediate point $\hat{x},$ generated at the $p$ th iteration of NTP (NTP_q) are exactly the iterate generated at the $p$ th iteration of NT (NT_q). By Theorem III.4, we have

\|x_{S}-\hat{x}\|_{2}\leq\eta\|x_{S}-x^{(p)}\|_{2}+C_{1}\|A^{T}\nu^{\prime}\|_{2}+C_{2}\|\nu^{\prime}\|_{2},

where $\eta$ , $C_{1},C_{2}$ are given in Theorem III.4. Combining the above two relations, we immediate obtain that

\displaystyle\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x_{S}-x^{(p)}\|_{2}+\widetilde{C}_{1}\|A^{T}\nu^{\prime}\|_{2}+\widetilde{C}_{2}\|\nu^{\prime}\|_{2},

where

\rho=\frac{\eta}{\sqrt{1-\delta_{2k}^{2}}}=\frac{\omega\delta_{3k}}{\sqrt{1-\delta_{2k}^{2}}}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}=\frac{\omega\delta_{3k}}{1-\delta_{2k}},

and

\widetilde{C}_{1}=\frac{C_{1}}{\sqrt{1-\delta_{2k}^{2}}},~{}\widetilde{C}_{2}=\frac{C_{2}}{\sqrt{1-\delta_{2k}^{2}}}+\frac{\sqrt{1+\delta_{k}}}{1-\delta_{2k}}.

Since $\delta_{2k}\leq\delta_{3k},$ one has $\rho\leq\frac{\omega\delta_{3k}}{1-\delta_{3k}}.$ Thus $\rho<1$ is guaranteed if $\frac{\omega\delta_{3k}}{1-\delta_{3k}}<1,$ which is ensured under the condition $\delta_{3k}<\frac{1}{1+\omega}=\frac{2}{3+\sqrt{5}}.$ Thus the bound (22) is valid. $\Box$

When the measurements are accurate and the signal is $k$ -sparse, we immediately have following corollary.

Corollary III.7

Suppose that the signal $x$ is $k$ -sparse and $y:=Ax.$ Then the following two statements are valid:

(i) If the RIC of $A$ satisfies (17), then the sequence $\{x^{(p)}:p\geq 1\},$ generated by NT or NT_q with $\lambda_{p}\equiv 1$ and concave $g_{\alpha}(\cdot),$ converges to $x.$

(ii) If the RIC of $A$ satisfies (21), then the sequence $\{x^{(p)}:p\geq 1\},$ generated by NTP or NTP_q with $\lambda_{p}\equiv 1$ and concave $g_{\alpha}(\cdot),$ converges to $x.$

The main results established in this section indicate that the significant information of the signal can be recovered by the NT-type algorithms under RIP and concavity of $g_{\alpha}(\cdot).$ Corollary III.7 further claims that under the same conditions, the $k$ -sparse signal can be exactly recovered when the measurements are accurate.

The above results also imply that the NT-type algorithms are stable for signal recovery. To establish such a stability result, let us first state the next lemma which follows immediately from Lemma 6.23 in [3].

Lemma III.8

[3] Let $k$ be an even number. Suppose that $A\in\mathbb{R}^{m\times n}$ has the restricted isometry constant $\delta_{k}<1.$ Given $\tau>0,\xi\geq 0$ and $\nu\in\mathbb{R}^{m},$ if two vectors $x,x^{\prime}\in\mathbb{R}^{n}$ satisfy that $\|x^{\prime}\|_{0}\leq k$ and $\|x_{S}-x^{\prime}\|_{2}\leq\tau\|Ax_{\overline{S}}+\nu\|_{2}+\xi,$ where $S=L_{k}(x),$ then one has

\|x-x^{\prime}\|_{2}\leq\frac{1+2\sqrt{2}\tau}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+2\tau\|\nu\|_{2}+2\xi,

where

\sigma_{k/2}(x)_{1}:=\min_{u}\{\|x-u\|_{1}:\|u\|_{0}\leq k/2\}.

In fact, by setting $\kappa=2,T=S:=L_{k}(x)$ and considering the $\ell_{2}$ -norm bound, Lemma 6.23 in [3] immediately reduces to the above lemma. The following stability result for NT-type algorithms follows from Theorems III.4, III.6 and Lemma III.8.

Theorem III.9

Let $y:=Ax+\nu$ be the measurements of the signal $x,$ where $\nu$ are the measurement errors. Then the following two statements are valid:

(i) Suppose that the RIC of $A$ satisfies (17) and $k$ is an even integer number. Then the sequence $\{x^{(p)}:p\geq 1\},$ generated by NT or NT_q with $\lambda_{p}\equiv 1$ and concave $g_{\alpha}(\cdot),$ satisfies that

	$\displaystyle\\|x-x^{(p)}\\|_{2}$	$\displaystyle\leq\frac{1+2\sqrt{2}C/(1-\eta)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2C}{1-\eta}\\|\nu\\|_{2}$
		$\displaystyle~{}~{}~{}+2\eta^{p}(\\|x^{(0)}\\|_{2}+\\|x\\|_{2}),$		(24)

where $C=C_{1}\sigma_{\max}(A)+C_{2}$ , $\sigma_{\max}(A)$ is the largest singular value of $A,$ and the constants $C_{1},C_{2},\eta$ are given in Theorem III.4.

(ii) Suppose that the RIC of $A$ satisfies (21) and $k$ is an even integer number. Then the sequence $\{x^{(p)}:p\geq 1\},$ generated by NTP or NTP_q with $\lambda_{p}\equiv 1$ and concave $g_{\alpha}(\cdot),$ satisfies that

	$\displaystyle\\|x-x^{(p)}\\|_{2}$	$\displaystyle\leq\frac{1+2\sqrt{2}\widetilde{C}/(1-\rho)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2\widetilde{C}}{1-\rho}\\|\nu\\|_{2}$
		$\displaystyle~{}~{}~{}+2\rho^{p}(\\|x^{(0)}\\|_{2}+\\|x\\|_{2}),$		(25)

where $\widetilde{C}=\widetilde{C}_{1}\sigma_{\max}(A)+\widetilde{C}_{2}$ and the constants $\widetilde{C}_{1},\widetilde{C}_{2},\rho$ are given in Theorem III.6.

Proof. (i) Under (17), Theorem III.4 claims that the sequence $\{x^{(p)}:p\geq 1\},$ generated by NT and NT_q with $\lambda_{p}\equiv 1$ and concave $g_{\alpha}(\cdot),$ satisfies (18). Note that $\|A^{T}\nu^{\prime}\|_{2}\leq\sigma_{\max}(A)\|\nu^{\prime}\|_{2},$ where $\sigma_{\max}(A)$ is the largest singular value of $A.$ Thus (18) implies that

\|x^{(p+1)}-x_{S}\|_{2}\leq\eta\|x^{(p)}-x_{S}\|_{2}+C\|\nu^{\prime}\|_{2},

where $C=C_{1}\sigma_{\max}(A)+C_{2}$ and $\nu^{\prime}=Ax_{\overline{S}}+\nu,$ and hence

	$\displaystyle\\|x^{(p)}-x_{S}\\|_{2}$	$\displaystyle\leq\eta^{p}\\|x^{0}-x_{S}\\|_{2}+C\left(\sum_{i=0}^{p-1}\eta^{i}\right)\\|\nu^{\prime}\\|_{2}$
		$\displaystyle\leq\frac{C}{1-\eta}\\|\nu^{\prime}\\|_{2}+\eta^{p}(\\|x^{0}\\|_{2}+\\|x\\|_{2}),$

where the last inequality follows from the fact $\sum_{i=0}^{p-1}\eta^{i}<1/(1-\eta)$ and $\|x^{0}-x_{S}\|_{2}\leq\|x^{0}\|+\|x\|_{2}.$ By Lemma III.8 with $x^{\prime}=x^{(p)}$ and $\tau=\frac{C}{1-\eta}$ and $\xi=\eta^{p}(\|x^{0}\|_{2}+\|x\|_{2})$ , we immediately obtain the bound (24).

(ii) By an analysis similar to Item (i) above, we can also show the bound (III.9). In fact, under (21), it follows from (22) in Theorem III.6 that

\|x^{(p+1)}-x_{S}\|_{2}\leq\rho\|x^{(p)}-x_{S}\|_{2}+\widetilde{C}\|\nu^{\prime}\|_{2},

where $\widetilde{C}=\widetilde{C}_{1}\sigma_{\max}(A)+\widetilde{C}_{2}$ and $\nu^{\prime}=Ax_{\overline{S}}+\nu.$ The inequality above implies that

\displaystyle\|x^{(p)}-x_{S}\|_{2}\leq\frac{\widetilde{C}}{1-\rho}\|\nu^{\prime}\|_{2}+\rho^{p}(\|x^{0}\|_{2}+\|x\|_{2}),

which together Lemma III.8 implies the bound (III.9). $\Box$

If $x$ is $t$ -sparse (or $t$ -compressible) where $t\leq k/2,$ then $\sigma_{k/2}(x)_{1}\equiv 0$ (or $\sigma_{k/2}(x)_{1}\approx 0$ ), (24) and (III.9) imply that the signal $x$ can be stably recovered by the algorithms provided that the measurements are accurate enough and the algorithms perform enough iterations.

IV Further discussion on regularization and parameter

Clearly, the choice of regularization $\phi$ and parameter $\alpha$ will directly affect the numerical behavior of the algorithm. The NT-type algorithms are developed from the first-order approximation of the function $g_{\alpha}(w).$ To ensure the quality of such an approximation, the gap between $g_{\alpha}(w)$ and its first-order approximation over the set $P$ should be made as small as possible. Thus we should choose $\phi$ and $\alpha$ such that the second-order term in the Taylor expansion of $g_{\alpha}(w)$ can be dominated as possible by its first-order term. One of the ideas is to strengthen the first-order approximation by suppressing the second-order term through controlling the eigenvalues of the Hessian matrix of $g_{\alpha}(w)$ . This might be partially achieved by properly selecting $\alpha.$ Moreover, the concavity of $g_{\alpha}(w)$ can also be maintained by the proper selection of $\alpha,$ as shown by the next proposition.

Proposition IV.1

(i) Using the function (6), $g_{\alpha}(w)$ is concave for any given $\alpha\geq\alpha^{*}:=\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).$

(ii) Using the function (14), $g_{\alpha}(w)$ is concave for any given parameter $\alpha\geq\alpha^{*}:=2\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).$

(iii) Using the function (15), $g_{\alpha}(w)$ is concave for any given parameter $\alpha\geq\alpha^{*}:=4\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}).$

The proof of the proposition is given in Appendix C. Note that the functions (6), (14) and (15) are independent of $u^{(p)}.$ It should be pointed out that all regularization functions given in Section II can also be modified to include the vector $u^{(p)}.$ For instance, the function (6) can be modified to involve the vector $u^{(p)}$ as follows:

	$\displaystyle\phi(w)$	$\displaystyle=\sum_{i=1}^{n}(u^{(p)}_{i})^{2}(w_{i}+\frac{1}{2})(\frac{3}{2}-w_{i})$
		$\displaystyle=(w+\frac{1}{2}\textrm{{e}})^{T}(U^{(p)})^{2}(\frac{3}{2}\textrm{{e}}-w),$		(26)

where $U^{(p)}=\textrm{diag}(u^{(p)}).$ When $u^{(p)}_{i}\not=0$ for $i=1,\dots,n$ , this function is positive over an open neighborhood of $D,$ and it remains a binary regularization specified by Definition II.1. We also have the following observation.

Proposition IV.2

When all entries of $u^{(p)}$ are nonzero, if $\phi$ is given by (IV), then $g_{\alpha}(w)$ is concave over an open neighborhood of $D$ for any given $\alpha\geq\alpha^{*}:=\lambda_{max}(A^{T}A).$

Proof. The Hessian of $g_{\alpha}(w)$ is given as

\nabla^{2}g_{\alpha}(w)=2U^{(p)}(A^{T}A-\alpha I)U^{(p)}.

The result is obvious since $\nabla^{2}g_{\alpha}(w)\preceq 0$ if and only if $\alpha\geq\alpha^{*}:=\lambda_{max}(A^{T}A).$ $\Box$

While the concavity of $g_{\alpha}(w)$ is assumed for the analysis of algorithms in Section III, this assumption is largely for the convenience of analysis and might not necessarily be needed if we adopt another way of analysis. In fact, the key idea behind the derivation of the NT algorithms is the first-order approximation of the function $g_{\alpha}(w)$ . Such a derivation may not require the concavity of $g_{\alpha}(w).$ Even if $g_{\alpha}(w)$ is not concave, the first-order approximation of $g_{\alpha}(w)$ still yields the same version of the NT-type algorithms. Also, the concavity of $g_{\alpha}(w)$ may not be required from a numerical point of view. The value of $\alpha$ may be simply prescribed beforehand or be updated in the course of iterations according to some rule.

The NT-type algorithms only involve the product between matrices and vectors, sorting and orthogonal projections, and hence the computational cost is very similar to that of HTP, SP, CoSaMP and OMP. The computational cost of these algorithms is much lower than convex optimization methods [38, 39, 40, 41] and optimal $k$ -thresholding methods [34, 35]. The empirical results in next section indicate that by approximate choices of the steplength $\lambda_{p}$ and parameter $\alpha,$ the NT-type algorithms can efficiently reconstruct sparse signals.

V Numerical Validation

Based on random problem instances of signal recovery, we provide some preliminary experiment results to demonstrate the performance of the NT-type algorithms. For the given size and class of random matrices, we first use simulations to find suitable steplength $\lambda_{p}$ and parameter $\alpha$ for the algorithms. Then we compare the performance of NT-type algorithms and SP, CoSaMP, OMP and HTP. The dimension of the synthetic sparse signal is 8000, and the size of random matrices used in our experiments is $1000\times 8000.$ All matrices are Gaussian matrices with columns being normalized, whose entries are independently and identically distributed (iid) random variables following the standard normal distribution. The nonzeros of sparse signals are also assumed to be iid random variables following the standard normal distribution, and their positions are uniformly distributed. Unless otherwise specified, we use the following stopping criterion for signal recovery:

\|x^{(p)}-x\|_{2}/\|x\|_{2}\leq 10^{-5},

(27)

where $x$ denotes the target signal to recover and $x^{(p)}$ denotes the iterate produced by an algorithm. When $x^{(p)}$ satisfies (27), the recovery of $x$ is counted as success, otherwise unsuccess. All results demonstrated in this section are obtained by using the function (IV) and adopting the initial point $x^{(0)}=0$ for algorithms. The maximum number of iterations for iterative algorithms is set to be 150. If an algorithm cannot meet the stopping criterion (27) in 150 iterations, the algorithm is then treated as unsuccess for the underlying recovery problem. Since NTP and NTP_q use the orthogonal projection at every step, they usually outperform NT and NT_q in signal recovery. Thus we only demonstrate the performance of NTP and NTP_q in this section.

V-A Choice of regularization parameter: $\alpha$

The parameter $\alpha$ in NT-type algorithms can take any value in $(0,\infty).$ To see how the value of $\alpha$ affects the performance of the algorithms, we compare the success frequencies of the algorithms for signal recovery with several different values of $\alpha.$ The sparsity of signals are ranging from 100 to 450 with stepsize 5, and the measurements are taken as $y:=Ax$ for every realized random pair $(A,x).$ Our initial simulations indicate that the number of inner iterations of NTP_q does not remarkably affect its overall performance. Thus we mainly consider the algorithm with $q=1,$ i.e., the NTP algorithm, in this experiments. For each given sparsity level $k$ , 100 random trials are used to estimate the success rates of the NTP, and the algorithm is performed up to a total of 150 iterations. Fig. 1 shows the results for NTP with several different values of $\alpha.$ The steplength is fixed as $\lambda_{p}\equiv 2$ (this stepsize is suggested from simulations. See the next subsection for details). The results in Fig.1 indicates that for the Gaussian matrices of size $1000\times 8000,$ $\alpha=5$ is a reasonable choice for NTP. However, we should keep in mind that such a choice of the parameter depending on the size/class of problem instances should not be generalized to other size/class of the problems.

Refer to caption — Figure 1: Success rates of NTP with $\alpha=2,3,5,8.$

V-B Influence of iterative steplength: $\lambda_{p}$

In traditional function minimization scenarios, receding in the direction of negative gradient of a function from the current iterate by a small steplength can guarantee the descent of the function. Thus the steplength in traditional optimization method is usually smaller than 1. However, when solving the SCO problem (1), this property of steplength is lost since the main purpose of the iterative algorithm for SCO is to identify the correct support of the solution instead of seeking immediate descent of the objective value. Thus when solving the SCO, the steplength of an iterative algorithm may not necessarily be smaller than 1. Indeed, a large amount of simulations indicate that the efficient steplength longer than 1 is commonly observed when solving a SCO problem. For $1000\times 8000$ Gaussian sensing matrices, we carried out experiments to show how the steplength might affect the performance of NTP. The random pair of $(A,x)$ are generated exactly as in previous subsection, and we set $\alpha=5$ for NTP in our experiments. For every given sparsity level and steplength, 100 trials were used to estimate the success rate of the algorithm. The results for $\lambda_{p}=0.1,0.5,1,1.5,2$ and $2.5$ are summarized in Fig. 2 which indicates that $\lambda_{p}=2$ is relatively good among the ones being tested.

V-C Number of iterations

The random instances of SCO problems are generated in the same way as previous subsections. As suggested by the above experiments, we set $\alpha\equiv 5$ and $\lambda_{p}\equiv 2$ in NTP. We test the recovery ability of NTP by performing the algorithm with several different number of iterations: 20, 40, 80, 120 and 150. For every sparsity level $k$ ranging from 100 to 450 with stepsize 5, we use 100 random trials to estimate the success rate of the algorithm. The results are summarized in Fig. 3 in which IT means the total number of iterations performed. The result indicates that the overall performance of NTP becomes better and better as the total number of iterations increases. However, after the algorithm is performed enough iterations, say, $\textrm{IT}>100,$ any further iterations would not remarkably improve its performance. This experiment indicates that for the class of SCO problems being tested, 150 iterations is generally enough to ensure the algorithm to work well for solving such a class of problems. Thus in the remaining experiments, we set 150 as the maximum number of iterations, and if the algorithm cannot solve the problem in 150 iterations, the algorithm is treated as unsuccess for the problem.

V-D Runtime Comparison

Given $m=1000$ and $n=8000,$ we consider a series of vectors with different sparsity levels: $k=\beta m$ where $\beta$ (called the oversampling ratio) takes 11 different values between 0.05 and 0.25. Namely, $\beta=0.05+0.2i$ for $i=0,1,\dots,10.$ For every given $k=\beta m$ , we use NTP, NTP₅, OMP, HTP, CoSaMP and SP to recover the $k$ -sparse vector through the measurements $y:=Ax.$ For every given $\beta$ , the average time required to recover the sparse vector is calculated based on 50 random trials. The results are shown in Fig. 4, from which it can be seen that the average time required by the NTP algorithm is lower than that of OMP and NTP₅, and the average time required by NTP₅ to recover the signals is very similar to that of OMP. We also observe that when $\beta$ is relatively small, the runtime of traditional CoSaMP, SP and HTP is lower than that of NTP and NTP ${}_{5}.$

V-E Comparison of success frequencies

The performance of NTP, NTP_q and HTP depends on the choice of steplength $\lambda_{p}.$ The performance of NTP and NTP_q also depends on the choice of $\alpha.$ We set $\alpha=5$ and $\lambda_{p}\equiv 2$ in NTP and NTP ${}_{5},$ and we use the same steplength $\lambda_{p}\equiv 2$ in HTP. For every random pair of $(A,x),$ where $x$ is a sparse vector, the accurate measurements are given by $y:=Ax.$ The success frequencies of NTP, NTP₅, HTP, OMP, SP and CoSaMP are estimated through 100 random examples of SCO problems. All iterative algorithms are performed a total of 150 iterations except for the OMP which by its structure performs a total of $k$ iterations, where $k$ is the sparsity level of the sparse signal. The results are demonstrated in Fig. 5, from which one can see that NTP and NTP₅ perform similarly and they outperforms the traditional HTP, OMP, SP and CoSaMP. However, OMP, SP, CoSaMP have their own advantages over HTP and NT-type algorithms in the sense that they do not need any parameter or steplength to solve the SCO problems.

We also compare the algorithms in the setting of noisy measurements. In this setting, the recovery criterion is set as

\|x^{(p)}-x\|_{2}/\|x\|_{2}\leq 10^{-3},

(28)

and the inaccurate measurements are given as $y:=Ax+\xi v,$ where $v$ is a normalized Gaussian noise vector and $\xi>0$ reflects the noise level. The success rates of algorithms are estimated with 100 random examples, and the algorithms are performed the same number of iterations as in the noiseless case. $\alpha=5$ and $\lambda_{p}\equiv 2$ are still used in this experiment. By setting $\xi=0.001,$ the results are shown in Fig. 6 which indicates that NTP and NTP₅ are still efficient and can compete with several other algorithms in sparse signal recovery. Changing the noise level from $\xi=0.001$ to $\xi=0.01$ and repeating the experiment, we obtain the results demonstrated in Fig. 7 which indicate that such a chance in noise level affects significantly the performance of CoSaMP, but does not remarkably affect the performance of several other algorithms. From Fig. 6 and Fig. 7, it can be observed that NTP and NTP₅ are robust in signal recovery when measurements are slightly inaccurate.

VI Conclusions

The natural thresholding (NT) algorithms for sparse signal recovery have been developed in this paper. This class of algorithms was derived through the first-order approximation of the regularized optimization model for optimal $k$ -thresholding. The concept of concave binary regularization plays a vital role in this development. The guaranteed performance of the NT-type algorithms in sparse signal recovery has been shown under the restricted isometry property together with the concavity of the objective function of the regularized optimization model. Simulations via random examples show that the proposed algorithms can compete with the traditional HTP, OMP, CoSaMP and SP algorithms.

The concavity is imposed to show the main results (Theorems III.4 and III.6) in this paper. A worthwhile future work might be the analysis of NT-type algorithms under other nonconvex conditions. Another future work might be the development of an algorithm with adaptive steplength $\lambda_{p}$ and parameter $\alpha$ which are updated during iterations according to a certain rule, so that the efficiency and performance of the NT-type algorithms would be further enhanced.

Appendix A: Proof of Lemma II.3

Proof. Let $\phi(w)$ be a binary regularization function (see Definition II.1). Let $w(\alpha)$ denote a solution to the problem (II-C) with parameter $\alpha>0.$ Let $\{\alpha_{\ell}\}$ be any positive sequence which is strictly increasing and tends to $\infty$ as $\ell\to\infty.$ For notational convenience, denote by $w_{\ell}:=w(\alpha_{\ell})$ the optimal solution to (II-C) with $\alpha:=\alpha_{\ell}.$ Thus by the optimality of $w_{\ell}$ and $w_{\ell-1}$ , one has

g_{\alpha_{\ell}}(w_{\ell})\leq g_{\alpha_{\ell}}(w_{\ell-1}),~{}~{}g_{\alpha_{\ell-1}}(w_{\ell-1})\leq g_{\alpha_{\ell-1}}(w_{\ell})

That is,

f(w_{\ell})+\alpha_{\ell}\phi(w_{\ell})\leq f(w_{\ell-1})+\alpha_{\ell}\phi(w_{\ell-1}),

f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(w_{\ell})+\alpha_{\ell-1}\phi(w_{\ell}).

(29)

Note that $\alpha_{\ell-1}<\alpha_{\ell}.$ Adding the two inequalities, cancelling and combining terms yields

\phi(w_{\ell})\leq\phi(w_{\ell-1}),

i.e., the sequence $\{\phi(w_{\ell})\}$ is nonincreasing. Thus $\phi(w_{\ell})\to\phi^{*}$ as $\ell\to\infty,$ where $\phi^{*}$ is a nonnegative number since the nonnegativeness of $\phi(w)$ over the set $P\subseteq D.$ Denote by $\phi_{\min}$ the minimum of $\phi(w)$ over $P.$ By the definition of binary regularization, $\phi_{\min}$ attains and only attains at any $k$ -sparse binary vector in $P.$ It immediately follows from $\phi(w_{\ell})\geq\phi_{\min}$ that $\phi^{*}\geq\phi_{\min}.$ We now further show that $\phi^{*}=\phi_{\min}.$ Let $w^{*}$ be any given $k$ -sparse binary vector in $P.$ Then by the optimality of $w_{\ell}$ again, we have

f(w_{\ell})+\alpha_{\ell}\phi(w_{\ell})\leq f(w^{*})+\alpha_{\ell}\phi(w^{*})=f(w^{*})+\alpha_{\ell}\phi_{\min}

which together with the fact $f(w_{\ell})\geq 0$ implies that

\alpha_{\ell}(\phi(w_{\ell})-\phi_{\min})\leq f(w^{*})

Therefore, one has

0\leq\phi^{*}-\phi_{\min}\leq\phi(w_{\ell})-\phi_{\min}\leq f(w^{*})/\alpha_{\ell},

which together with $\alpha_{\ell}\to\infty$ implies that $\phi^{*}=\phi_{\min}.$ This together with the convergence of $\{\phi(w_{\ell})\}$ and continuity of $\phi$ implies that any accumulation point of $w_{\ell}$ must be a $k$ -sparse and binary vector (since the minimum of $\phi$ over $P$ attains only at $k$ -sparse binary vector). We now further show that the sequence $\{f(w_{\ell})\}$ is nondecreasing and convergent to the optimal value of (3) and that any accumulation point of $\{w_{\ell}\}$ is an optimal solution to (3). In fact, by (29), we have

f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(w_{\ell})+\alpha_{\ell-1}\phi(w_{\ell}),

which implies that

f(w_{\ell-1})-f(w_{\ell})\leq\alpha_{\ell-1}(\phi(w_{\ell})-\phi(w_{\ell-1}))\leq 0

thus $f(w_{\ell-1})\leq f(w_{\ell})$ which means the sequence $\{f(w_{\ell})\}$ is nondecreasing. Assume that $\widehat{w}$ is any optimal solution of (3), by the optimality of $w_{\ell-1},$ one has

f(w_{\ell-1})+\alpha_{\ell-1}\phi(w_{\ell-1})\leq f(\widehat{w})+\alpha_{\ell-1}\phi(\widehat{w}),

where $\phi(\widehat{w})$ must be the minimum value of $\phi$ over $D$ since $\widehat{w}$ is $k$ -sparse and binary. Thus the above inequality implies that

f(w_{\ell-1})\leq f(\widehat{w})+\alpha_{\ell-1}(\phi(\widehat{w})-\phi(w_{\ell-1}))\leq f(\widehat{w}),

so the nondecreasing sequence $\{f(w_{\ell})\}$ is bounded above, and hence

f(w_{\ell})\to f^{*}\leq f(\widehat{w})\textrm{ as }k\to\infty.

Let $\overline{w}$ be any accumulation of $w_{\ell}$ (as shown above, $\overline{w}$ is $k$ -sparse and binary). It follows from the above relation that $f^{*}=f(\overline{w})\leq f(\widehat{w}).$ Since the right-hand side is the minimum value of (3). Thus we deduce that $f(\overline{w})=f(\widehat{w})$ . So any accumulation point of $\{w_{\ell}\}$ is the optimal solution to (3). $\Box$

Appendix B: Proof of Theorem III.4

Proof: Let $x^{(p)}$ be the current iterate generated by NT or NT ${}_{q},$ and let $u^{(p)}$ be given by (2). The inner iterations of NT and NT_q start from the vector $w^{-}$ which satisfies $w^{-}\otimes u^{(p)}={\cal H}_{k}(u^{(p)}).$ The NT algorithm performs only one inner iteration, and NT_q performs $q$ times of inner iterations. It follows from Lemma III.2 that the output $w^{+}$ of the inner iterations of NT or NT_q algorithm satisfies

\|y-A(u^{(p)}\otimes w^{+})\|_{2}\leq\|y-A(u^{(p)}\otimes w^{-})\|_{2}.

Note that $u^{(p)}\otimes w^{-}={\cal H}_{k}(u^{(p)})$ and $x^{(p+1)}=u^{(p)}\otimes w^{+}$ . The above inequality is written as

\|y-Ax^{(p+1)}\|_{2}\leq\|y-A{\cal H}_{k}(u^{(p)})\|_{2}.

(30)

We also note that

y=Ax+\nu=Ax_{S}+(Ax_{\overline{S}}+\nu)=Ax_{S}+\nu^{\prime},

(31)

where $\nu^{\prime}=Ax_{\overline{S}}+\nu.$ Since $x_{S}-x^{(p+1)}$ is $(2k)$ -sparse, by (31) and the triangular inequality and the definition of $\delta_{2k},$ one has

$\displaystyle\\|y-Ax^{(p+1)}\\|_{2}$	$\displaystyle=\\|A(x_{S}-x^{(p+1)})+\nu^{\prime}\\|_{2}$
	$\displaystyle\geq\\|A(x_{S}-x^{(p+1)})\\|_{2}-\\|\nu^{\prime}\\|_{2}$
	$\displaystyle\geq\sqrt{1-\delta_{2k}}\\|x_{S}-x^{(p+1)}\\|_{2}-\\|\nu^{\prime}\\|_{2}.$	(32)

Similarly, since $x_{S}-{\cal H}_{k}(u^{(p)})$ is $(2k)$ -sparse, we have

$\displaystyle\\|y$	$\displaystyle-A{\cal H}_{k}(u^{(p)})\\|_{2}$
	$\displaystyle=\\|A(x_{S}-{\cal H}_{k}(u^{(p)}))+\nu^{\prime}\\|_{2}$
	$\displaystyle\leq\\|A(x_{S}-{\cal H}_{k}(u^{(p)}))\\|_{2}+\\|\nu^{\prime}\\|_{2}$
	$\displaystyle\leq\sqrt{1+\delta_{2k}}\\|x_{S}-{\cal H}_{k}(u^{(p)})\\|_{2}+\\|\nu^{\prime}\\|_{2}$	(33)

Merging (30), (32) and (33) yields

	$\displaystyle\\|x_{S}$	$\displaystyle-x^{(p+1)}\\|_{2}$
		$\displaystyle\leq\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\\|x_{S}-{\cal H}_{k}(u^{(p)})\\|_{2}+\frac{2\\|\nu^{\prime}\\|_{2}}{\sqrt{1-\delta_{2k}}}.$		(34)

Let $S^{*}=\textrm{supp}({\cal H}_{k}(u^{(p)})).$ Note that the set $S^{*}\cup S$ , where $S=L_{k}(x),$ can be decomposed into three disjoint sets $S^{*}\backslash S,S\backslash S^{*}$ and $S\cap S^{*}.$ Thus

	$\displaystyle\\|(x_{S}-$	$\displaystyle u^{(p)})_{S^{}\cup S}\\|_{2}^{2}=\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}$
		$\displaystyle+\\|(x_{S}-u^{(p)})_{S\backslash S^{}}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S\cap S^{}}\\|_{2}^{2}.$		(35)

Without loss of generality, we assume that $\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}\not=0.$ Under this assumption, one of the three terms on the right-hand of (Appendix B: Proof of Theorem III.4) is nonzero. Without loss of generality, we assume that the first term on the right-hand side of (Appendix B: Proof of Theorem III.4) is nonzero, i.e., $\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}\not=0.$ Then there exists numbers $\mu_{1}\geq 0$ and $\mu_{2}\geq 0$ such that

	$\displaystyle\\|(x_{S}-u^{(p)})_{S\backslash S^{*}}\\|_{2}$	$\displaystyle=\mu_{1}\\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\\|_{2},$		(36)
	$\displaystyle\\|(x_{S}-u^{(p)})_{S\cap S^{*}}\\|_{2}$	$\displaystyle=\mu_{2}\\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\\|_{2}.$		(37)

If $\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}=0,$ then all terms on the right-hand of (Appendix B: Proof of Theorem III.4) are zero, and hence (36) and (37) are still valid in this case. Substituting (36) and (37) into (Appendix B: Proof of Theorem III.4) yields

\|(x_{S}-u^{(p)})_{S^{*}\cup S}\|_{2}^{2}=(1+\mu_{1}^{2}+\mu_{2}^{2})\|(x_{S}-u^{(p)})_{S^{*}\backslash S}\|_{2}^{2}.

(38)

By the definition of $S$ and $S^{*}$ , we have

	$\displaystyle\\|x_{S}-$	$\displaystyle{\cal H}_{k}(u^{(p)})\\|_{2}^{2}=\\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{*}\cup S}\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{}\backslash S}\\|_{2}^{2}+\\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S^{}\cap S}\\|_{2}^{2}$
		$\displaystyle+\\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S\backslash S^{*}}\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S^{}\cap S}\\|_{2}^{2}$
		$\displaystyle+\\|(x_{S}-{\cal H}_{k}(u^{(p)}))_{S\backslash S^{*}}\\|_{2}^{2}.$

Thus by Lemma III.3, we have

	$\displaystyle\\|x_{S}-$	$\displaystyle{\cal H}_{k}(u^{(p)})\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S^{}\cap S}\\|_{2}^{2}$
		$\displaystyle+\left(\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}+\\|(x_{S}-u^{(p)})_{S\backslash S^{}}\\|_{2}\right)^{2}$
	$\displaystyle=$	$\displaystyle\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S^{}\cap S}\\|_{2}^{2}$
		$\displaystyle+\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S\backslash S^{}}\\|_{2}^{2}$
		$\displaystyle+2\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}\\|(x_{S}-u^{(p)})_{S\backslash S^{}}\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|(x_{S}-u^{(p)})_{S^{}\cup S}\\|_{2}^{2}+\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}$
		$\displaystyle+2\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}\\|(x_{S}-u^{(p)})_{S\backslash S^{}}\\|_{2}.$

By (36) and (38), the inequality above can be written as

	$\displaystyle\\|x_{S}-{\cal H}_{k}(u^{(p)})\\|_{2}^{2}$
	$\displaystyle\leq\\|(x_{S}-u^{(p)})_{S^{}\cup S}\\|_{2}^{2}+(1+2\mu_{1})\\|(x_{S}-u^{(p)})_{S^{}\backslash S}\\|_{2}^{2}$
	$\displaystyle=\\|(x_{S}-u^{(p)})_{S^{}\cup S}\\|_{2}^{2}+\frac{1+2\mu_{1}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\\|(x_{S}-u^{(p)})_{S^{}\cup S}\\|_{2}^{2}$
	$\displaystyle=\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\\|(x_{S}-u^{(p)})_{S^{*}\cup S}\\|_{2}^{2}$
	$\displaystyle\leq\max_{\mu_{1}\geq 0}\left(\max_{\mu_{2}\geq 0}\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\right)\\|(x_{S}-u^{(p)})_{S^{*}\cup S}\\|_{2}^{2}$
	$\displaystyle=\mu^{}\\|(x_{S}-u^{(p)})_{S^{}\cup S}\\|_{2}^{2},$		(39)

where

	$\displaystyle\mu^{*}$	$\displaystyle=\max_{\mu_{1}\geq 0}\left(\max_{\mu_{2}\geq 0}\frac{(1+\mu_{1})^{2}+1+\mu_{2}^{2}}{1+\mu_{1}^{2}+\mu_{2}^{2}}\right)$
		$\displaystyle=\max_{\mu_{1}\geq 0}\frac{(1+\mu_{1})^{2}+1}{1+\mu_{1}^{2}}=\frac{5+\sqrt{5}}{5-\sqrt{5}}=\left(\frac{\sqrt{5}+1}{2}\right)^{2}.$

The maximum above with respect to $\mu_{2}$ is achieved at $\mu_{2}=0,$ and the maximum with respect to $\mu_{1}$ is achieved at $\mu_{1}=(\sqrt{5}-1)/2.$ By (31) and Lemma III.1, we obtain that

	$\displaystyle\\|(x_{S}-u^{(p)})_{S^{*}\cup S}\\|_{2}$
	$\displaystyle=\\|(x_{S}-x^{(p)}-A^{T}(y-Ax^{(p)}))_{S^{*}\cup S}\\|_{2}$
	$\displaystyle=\\|[(I-A^{T}A)(x_{S}-x^{(p)})]_{S^{}\cup S}+[A^{T}\nu^{\prime}]_{S^{}\cup S}\\|_{2}$
	$\displaystyle\leq\\|[(I-A^{T}A)(x_{S}-x^{(p)})]_{S^{*}\cup S}\\|_{2}+\\|A^{T}\nu^{\prime}\\|_{2}$
	$\displaystyle\leq\delta_{3k}\\|x_{S}-x^{(p)}\\|_{2}+\\|A^{T}\nu^{\prime}\\|_{2}.$		(40)

The last inequality follows from Lemma III.1 with $|(S^{*}\cup S)\cup\textrm{supp}(x_{S}-x^{p})|\leq 3k.$ Combining (39) and (40) yields

\displaystyle\|x_{S}-{\cal H}_{k}(u^{(p)})\|_{2}

\displaystyle\leq\frac{\sqrt{5}+1}{2}(\delta_{3k}\|x_{S}-x^{(p)}\|_{2}+\|A^{T}\nu^{\prime}\|_{2}).

Denote by $\omega=(\sqrt{5}+1)/2.$ Merging the above relation with (Appendix B: Proof of Theorem III.4), we obtain

	$\displaystyle\\|x_{S}-x^{(p+1)}\\|_{2}$	$\displaystyle\leq\omega\delta_{3k}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\\|x_{S}-x^{(p)}\\|_{2}$
		$\displaystyle~{}~{}+\omega\\|A^{T}\nu^{\prime}\\|_{2}\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}+\frac{2\\|\nu^{\prime}\\|_{2}}{\sqrt{1-\delta_{2k}}}$
		$\displaystyle=\eta\\|x_{S}-x^{(p)}\\|_{2}+C_{1}\\|A^{T}\nu^{\prime}\\|_{2}+C_{2}\\|\nu^{\prime}\\|_{2},$

where $C_{1},$ $C_{2}$ and $\eta$ are given by (19) and (20), respectively. Note that $\delta_{2k}\leq\delta_{3k}$ which implies that $\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\leq\sqrt{\frac{1+\delta_{3k}}{1-\delta_{3k}}}.$ Thus to ensure $\eta<1,$ it is sufficient to require that $\omega\delta_{3k}\sqrt{\frac{1+\delta_{3k}}{1-\delta_{3k}}}<1$ which, by squaring and rearranging terms, is written equivalently as

h(\delta_{3k}):=\omega\delta_{3k}^{3}+\omega\delta_{3k}^{2}+\delta_{3k}-1<0.

It is very easy to check that the inequality above can be guaranteed if $\delta_{3k}<\gamma^{*}\approx 0.4712,$ where $\gamma^{*}$ is the unique real root of the univariate equation $h(t)=\omega t^{3}+\omega t^{2}+t-1=0$ in the interval [0,1]. In fact, $h(t)$ is strictly increasing over $[0,1],$ and $h(0)=-1<0$ . This implies that $h(t)<0$ for any $t<\gamma^{*}.$ Thus $\eta<1$ is guaranteed if $\delta_{3k}<\gamma^{*}.$ $\Box$

Appendix C: Proof of Proposition IV.1

Proof. (i) This has been shown in section II.C.

(ii) Consider the regularization (14). It is easy to verify that the second-order derivative of $\log(1+(w_{i}+1/2)(3/2-w_{i}))$ with respect to $w_{i}$ is given by $\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},$ where $\beta_{i}=(w_{i}+1/2)(3/2-w_{i}).$ Clearly, $0<\beta_{i}\leq 1$ over the interval $-1/2<w_{i}<3/2$ which is an open neighborhood of the closed interval $0\leq w_{i}\leq 1.$ Thus it is easy to verify that the supremum of $\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}}$ over the open interval $-1/2<w_{i}<3/2$ is $-1.$ Therefore,

	$\displaystyle\nabla^{2}g_{\alpha}(w)$	$\displaystyle=2U^{(p)}A^{T}AU^{(p)}+\alpha\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)$
		$\displaystyle=2U^{(p)}A^{T}AU^{(p)}-\alpha I$
		$\displaystyle~{}~{}~{}+\alpha\left[I+\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)\right]$
		$\displaystyle\preceq 2U^{(p)}A^{T}AU^{(p)}-\alpha I,$

where the last relation follows from the fact

I+\textrm{diag}\left(\frac{-6+2\beta_{i}}{(1+\beta_{i})^{2}},i=1,\dots,n\right)\preceq 0.

Thus if $\alpha\geq\alpha^{*}:=2\lambda_{\max}(U^{(p)}A^{T}AU^{(p)}),$ then $\nabla^{2}g_{\alpha}(w)\preceq 0$ and hence the function $g_{\alpha}(w)$ is concave.

(iii) This can be shown by an analysis similar to (ii). The detail is omitted. $\Box$

References

[1] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer, NY, 2010.
[2] Y.C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications, Cambridge University Press, Cambridge, UK, 2012.
[3] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer, NY, 2013.
[4] Y.-B. Zhao, Sparse Optimization Theory and Methods, CRC Press, Boca Raton, FL, 2018.
[5] A. Miller, Subset Selection in Regression, CRC Press, Washington, 2002.
[6] D. Bertsimas, A. King and R. Mazumder, Best subset selection via a modern optimization Lens, Ann. Statist., 44 (2016), no. 2, 813–852.
[7] A. Suggala, K. Bhatia, P. Ravikumar and P. Jain, Adaptive hard thresholding for near-optimal consistent robust regression, 32nd Annual Conference on Learning Theory, Proc. Machine Learning Res., 99 (2019), 1-6.
[8] J. Choi, B. Shim, Y. Ding, B. Rao and D. Kim, Compressed sensing for wireless communications: Useful tips and tricks, IEEE Commun. Surveys and Tutorials, 19 (2017), no. 3, 1527–1549.
[9] Z. Gao, L. Dai, S. Han, C.-L. I, Z. Wang and L. Hanzo, Compressive sensing techniques for next-generation wireless communications, IEEE Wireless Commun., 25 (2018), no. 3, 144–153.
[10] E.J. Cand $\grave{\textrm{e}}$ s and Y. Plan, Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory, 57 (2011), no. 4, 2342–2359.
[11] T. Cai and A. Zhang, Sharp RIP bound for sparse singal and low-rank matrix recovery, Appl. Comput. Harmon. Anal., 35 (2013), no. 1, 74-93.
[12] M.A. Davenport and J. Romberg, An overview of low-rank matrix recovery from incomplete observations, IEEE J. Sel. Topics Signal Process., 10 (2016), no. 4, 608–622.
[13] N. Nguyen, D. Needell and T. Woolf, Linear convergence of stochastic iterative greedy algorithms with sparse constraints, IEEE Trans. Inform. Theory, 63 (2017), no. 11, 6869–6895.
[14] S. Foucart and S. Subramanian, Iterative hard thresholding for low-rank recovery from rank-one projections, Linear Algebra Appl., 572 (2019), 117–134.
[15] I. Daubechies, M. Defrise and C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), no. 11, 1413–1457.
[16] S. Voronin, H.J. Woerdeman, A new iterative firm-thresholding algorithms for inverse problems with sparsity constraints, Appl. Comput. Harmonic Anal., 35 (2013), 151–164.
[17] A. Wachsmuth, Iteration hard-thresholding applied to optimal control problems with $L^{0}(\Omega)$ control cost, SIAM J. Control Optim., 57 (2019), no. 2, 854–879.
[18] A. Zaki, P. Mitra, L. Rasmussen and S. Chatterjee, Estimate exchange over network is good for distributed hard thresholding pursuit, Signal Process., 156 (2019), 1–11.
[19] D.L. Donoho, De-noising by soft-thresholdinng, IEEE Trans. Inform. Theory, 41 (1995), no. 3, 613–627.
[20] D.L. Donoho and I. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biomatrika, 81 (1994), no. 3, 425–455.
[21] M. Elad, Why simple shrinkage is still relevant for redundant representations? IEEE Trans. Inform. Theory, 52 (2006), no. 12, 5559–5569.
[22] M. Elad, B. Matalon, J. Shtok and M. Zibulevsky, A wide-angle view at iterated shrinkage algorithms, in SPIE (Wavelete XII) (San Diege, CA, USA), September 2007.
[23] K. Herrity, A. Gilbert and J. Tropp, Sparse approximation via iterative thresholding, in IEEE ICASSP 2006, 624–627.
[24] M. Fornasier and R. Rauhut, Iterative thresholding algorithms, Appl. Comput. Harmon. Anal., 25 (2008), 187-208.
[25] T. Blumensath and M. Davies, Iterative hard thresholding for sparse approximation, J. Fourier Anal. Appl., 14 (2008), no. 5-6, 629–654.
[26] T. Blumensath and M. Davies, Normalized iterative hard thresholding: Guaranteed stability and performance, IEEE J. Sel. Top. Signal Process., 4 (2010), no. 2, 298–309.
[27] S. Foucart, Hard thresholding pursuit: An algorithm for compressive sensing, SIAM J. Numer. Anal., 49 (2011), no. 6, 2543–2563.
[28] V. Cevher, On accelerated hard thresholding methods for sparse approximation, Proc. SPIE 8138, Wavelets and Sparsity XIV, 813811, 2011.
[29] A. Kyrillidis and V. Cevher, Matrix recipes for hard thresholding methods, J. Math. Imag. Vision, 48 (2014), 235–265.
[30] B. Liu, X.-T. Yuan, L. Wang, Q. Liu and D. Metaxas, Dual iterative hard threholding: From non-convex sparse minimization to non-smooth concave maximization, Proceedings of the 34th International Conference on Machine Learning, PMLR, 70 (2017), pp. 2179–2187.
[31] R. Khanna, and A. Kyrillidis, IHT dies hard: Provable accelerated iterative hard thresholding, in Proceedings of the AISTATS, Lanzarote, Spain, 84 (2018), 188–198.
[32] H. Liu and R.F. Barber, Between hard and soft thresholding: Optimal iterative thresholding algorithms, Inform. Inference, 9 (2020), no. 4, 899–933.
[33] N. Meng and Y.-B. Zhao, Newton-step-based hard thresholding algorithms for sparse signal recovery. IEEE Trans. Signal Process. 68 (2020), 6594–6606
[34] Y.-B. Zhao, Optimal $k$ -thresholding algorithms for sparse optimization problems, SIAM J. Optim., 30 (2020), no. 1, 31–55.
[35] Y.-B. Zhao and Z.-Q. Luo, Analysis of optimal $k$ -thresholding algorithms for compressed sensing, Signal Process., 187 (2021), p.108148.
[36] J.-U., Bouchot, S. Foucart and P. Hitczenki, Hard thresholding pursuit algorithms: Number of iterations, Appl. Comput. Harmon. Anal., 41 (2016), 412-435.
[37] N. Meng, Y.-B. Zhao, M. Kočvara and Z.F. Sun, Partial gradient optimal thresholding algorithms for a class of sparse optimization problems, J. Global Optim. (2022). doi.org/10.1007/s10898-022-01143-1.
[38] S.S. Chen, D.L. Donoho and M.A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput., 20 (1998), no. 1, 33–61.
[39] E.J. Cand $\grave{\textrm{e}}$ s and T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory, 51 (2005), no. 12, 4203–4215.
[40] E.J. Cand $\grave{\textrm{e}}$ s, M. Wakin and S. Boyd, Enhancing sparsity by reweighted $\ell_{1}$ minimization, J. Fourier Anal. Appl., 14 (2008), no.5-6, 877–905.
[41] Y.-B. Zhao and D. Li, Reweighted $\ell_{1}$ -minimization for sparse solutions to underdetermined linear systems, SIAM J. Optim., 22 (2012), no. 3, 893–912.
[42] S. Mallat and Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Trans. Signal Process., 41 (1993), no. 12, 3397–3415.
[43] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad, Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, in Proc. 27th Annu. Asilomar Conf. Signals, Systems and Computers, Nov. 1993.
[44] G. Davis, S. Mallat, and Z. Zhang, Adaptive time-frequency decompositions, Optical Eng., 33 (1994), no. 7, 2183–2191.
[45] D. Needell and J.A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Appl. Comput. Harmon. Anal. 26 (2009), no. 3, 301–321.
[46] W. Dai, and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inform. Theory, 55 (2009), no. 5, 2230–2249.
[47] A.V. Fiacco and G.P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques, New York, John Wiley & Sons, 1968.
[48] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, New Jersey, 1970.
[49] H.P. Benson, Concave Minimization: Theory, Applications and Algorithms. In: Horst R., Pardalos P.M. (eds) Handbook of Global Optimization. Nonconvex Optimization and Its Applications, vol 2. Springer, Boston, MA, 1995.
[50] C. Buchheim and E. Traversi, Quadratic combinatorial optimization using separable underestimators, INFORMS J. Comput., 30 (2018), no. 3, 424–637.
[51] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Second Edition, John Wiley & Sons, 2008.
[52] K. Lange, MM Optimization Algorithms, SIAM, Philadelphia, 2016.
[53] O.L. Mangasarian, Machine Learning via Polyhedral Concave Minimization. In: Fischer H., Riedmüller B., Schäffler S. (eds) Applied Mathematics and Parallel Computing. Physica-Verlag HD, 1996.
[54] Y.-B. Zhao, RSP-based analysis for sparsest and least $\ell_{1}$ -norm solutions to underdetermined linear systems, IEEE Trans. Signal Process. 61 (2013), no. 22, 5777–5788.

	$\displaystyle\\|x-x^{(p)}\\|_{2}$	$\displaystyle\leq\frac{1+2\sqrt{2}C/(1-\eta)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2C}{1-\eta}\\|\nu\\|_{2}$
		$\displaystyle~{}~{}~{}+2\eta^{p}(\\|x^{(0)}\\|_{2}+\\|x\\|_{2}),$		(24)

	$\displaystyle\\|x-x^{(p)}\\|_{2}$	$\displaystyle\leq\frac{1+2\sqrt{2}\widetilde{C}/(1-\rho)}{\sqrt{k/2}}\sigma_{k/2}(x)_{1}+\frac{2\widetilde{C}}{1-\rho}\\|\nu\\|_{2}$
		$\displaystyle~{}~{}~{}+2\rho^{p}(\\|x^{(0)}\\|_{2}+\\|x\\|_{2}),$		(25)

$\displaystyle\\|y-Ax^{(p+1)}\\|_{2}$	$\displaystyle=\\|A(x_{S}-x^{(p+1)})+\nu^{\prime}\\|_{2}$
	$\displaystyle\geq\\|A(x_{S}-x^{(p+1)})\\|_{2}-\\|\nu^{\prime}\\|_{2}$
	$\displaystyle\geq\sqrt{1-\delta_{2k}}\\|x_{S}-x^{(p+1)}\\|_{2}-\\|\nu^{\prime}\\|_{2}.$	(32)

$\displaystyle\\|y$	$\displaystyle-A{\cal H}_{k}(u^{(p)})\\|_{2}$
	$\displaystyle=\\|A(x_{S}-{\cal H}_{k}(u^{(p)}))+\nu^{\prime}\\|_{2}$
	$\displaystyle\leq\\|A(x_{S}-{\cal H}_{k}(u^{(p)}))\\|_{2}+\\|\nu^{\prime}\\|_{2}$
	$\displaystyle\leq\sqrt{1+\delta_{2k}}\\|x_{S}-{\cal H}_{k}(u^{(p)})\\|_{2}+\\|\nu^{\prime}\\|_{2}$	(33)

	$\displaystyle\\|x_{S}$	$\displaystyle-x^{(p+1)}\\|_{2}$
		$\displaystyle\leq\sqrt{\frac{1+\delta_{2k}}{1-\delta_{2k}}}\\|x_{S}-{\cal H}_{k}(u^{(p)})\\|_{2}+\frac{2\\|\nu^{\prime}\\|_{2}}{\sqrt{1-\delta_{2k}}}.$		(34)

Natural Thresholding Algorithms for Signal Recovery with Sparsity

Abstract

Index Terms:

I Introduction

II Natural Thresholding Algorithms

II-A Basic hard thresholding algorithms: IHT and HTP

II-B Optimal kk-thresholding algorithms

II-C Natural thresholding algorithms: A new framework

Definition II.1

Example II.2

Lemma II.3

Remark II.4

Remark II.5

Remark II.6

II-D Concave regularization function

III Guaranteed performance and stability

Lemma III.1

Lemma III.2

Lemma III.3

Theorem III.4

Lemma III.5

Theorem III.6

Corollary III.7

Lemma III.8

Theorem III.9

IV Further discussion on regularization and parameter

Proposition IV.1

Proposition IV.2

V Numerical Validation

V-A Choice of regularization parameter: α\alpha

V-B Influence of iterative steplength: λp\lambda_{p}

V-C Number of iterations

V-D Runtime Comparison

V-E Comparison of success frequencies

VI Conclusions

Appendix A: Proof of Lemma II.3

Appendix B: Proof of Theorem III.4

Appendix C: Proof of Proposition IV.1

References

II-B Optimal $k$ -thresholding algorithms

V-A Choice of regularization parameter: $\alpha$

V-B Influence of iterative steplength: $\lambda_{p}$