This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unified Framework for Constructing Nonconvex Regularizations

Zhiyong Zhou The author is with the Department of Statistics, Zhejiang University City College, 310015, Hangzhou, China (e-mail: [email protected]).
Abstract

Over the past decades, many individual nonconvex methods have been proposed to achieve better sparse recovery performance in various scenarios. However, how to construct a valid nonconvex regularization function remains open in practice. In this paper, we fill in this gap by presenting a unified framework for constructing the nonconvex regularization based on the probability density function. Meanwhile, a new nonconvex sparse recovery method constructed via the Weibull distribution is studied.

Index Terms:
Nonconvex regularization; Probability density function; Cumulative distribution function; Iteratively reweighted algorithms.

I Introduction

Sparse recovery has attracted tremendous research interest in various areas including statistical learning [1] and compressive sensing [2]. Its goal is to recover an unknown sparse signal 𝐱N\mathbf{x}\in\mathbb{R}^{N} from under-determined noisy measurements 𝐲=A𝐱+𝜺m\mathbf{y}=A\mathbf{x}+\boldsymbol{\varepsilon}\in\mathbb{R}^{m} with mNm\ll N and 𝜺\boldsymbol{\varepsilon} being the noise vector. As is known, this sparse signal of interest can be well recovered by solving the Lasso problem [3] min𝐱N12𝐲A𝐱22+λ𝐱1\min_{\mathbf{x}\in\mathbb{R}^{N}}\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\lambda\lVert\mathbf{x}\rVert_{1} (λ>0\lambda>0 is a tuning parameter) or the constrained 1\ell_{1}-minimization [4]. Due to the fact that the 1\ell_{1} norm is just a loose approximation of the 0\ell_{0} norm, the Lasso is biased and does not have the oracle property [5].

To overcome this issue, a large number of nonconvex methods have been proposed to better approximate the 0\ell_{0} norm and promote sparsity. Generally, a nonconvex regularization method possesses a form of

min𝐱N12𝐲A𝐱22+λJθ(𝐱),\displaystyle\min\limits_{\mathbf{x}\in\mathbb{R}^{N}}\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\lambda J_{\theta}(\mathbf{x}), (1)

where Jθ()J_{\theta}(\cdot) denotes a nonconvex regularization or penalty function depending on some parameter or vector of parameters θ\theta. In particular, a separable regularization function has a formulation as j=1NFθ(|xj|)\sum_{j=1}^{N}F_{\theta}(|x_{j}|). This separable framework includes many popular nonconvex methods as its special cases, such as p(0<p<1)\ell_{p}(0<p<1) [6, 7], Capped-L1 [8], transformed 1\ell_{1} (TL1) [9], smooth clipped absolute deviation (SCAD) [5], minimax concave penalty (MCP) [10], three-order polynomial (TOP) method [11], exponetial-type penalty (ETP) [12, 13], error function (ERF) method [14], and the very recent generalized error function (GERF) method [15], to name a few. It was called FF-minimization in [16], and its exact and robust reconstruction conditions were investigated when Fθ()F_{\theta}(\cdot) satisfies some desirable properties such as subadditivity.

These nonconvex regularization methods enjoy attractive theoretical properties and have achieved great success in various scenarios. However, in practice, it is still unclear how to construct a valid and effective nonconvex regularization function. More specifically, how to construct the function Fθ()F_{\theta}(\cdot) in the separable nonconvex regularization term remains unanswered. In this paper, we set out to fill in this gap between theory and practice by proposing a unified framework for constructing the nonconvex regularization based on the probability density function. It provides a fairly general framework, making many existing nonconvex regularization methods as its special cases. More importantly, many new nonconvex regularization functions can be constructed within this framework.

The paper is organized as follows. In Section II, we present the unified framework. In Section III, we discuss the theoretical analysis results. In Section IV, we study a new nonconvex method based on the Weibull penalty. Finally, conclusions are included in Section V.

II A Unified Framework

We propose to construct the following nonconvex regularization term for sparse recovery:

Jθ(𝐱)=j=1NFθ(|xj|),\displaystyle J_{\theta}(\mathbf{x})=\sum\limits_{j=1}^{N}F_{\theta}(|x_{j}|), (2)

where Fθ(x)=0xfθ(τ)𝑑τF_{\theta}(x)=\int_{0}^{x}f_{\theta}(\tau)d\tau is the cumulative distribution function (CDF) of a probability density function (PDF) fθ()f_{\theta}(\cdot) defined on [0,)[0,\infty).

In what follows, we first give some examples of the distributions supported on [0,)[0,\infty) (or on a bounded interval) and their corresponding nonconvex sparse recovery methods. As we can see from Table I, we are able to find the corresponding probability distributions for almost all the existing popular separable nonconvex methods. In addition, some new nonconvex methods can be constructed through provided distributions, for instance the Weibull distribution and the Chi-squared distribution.

Distribution PDF Method
Dirac delta function δ(x)\delta(x) 0\ell_{0}-minimization
Uniform 1γ,x[0,γ]\frac{1}{\gamma},x\in[0,\gamma] Capped L1
Piece-wise Linear 2λ(γ+1)(1(1xλλ(γ1))+),γ>1\frac{2}{\lambda(\gamma+1)}\left(1\wedge(1-\frac{x-\lambda}{\lambda(\gamma-1)})_{+}\right),\gamma>1 SCAD
Piece-wise Linear 2λγ(1xλγ)+,γ>0\frac{2}{\lambda\gamma}(1-\frac{x}{\lambda\gamma})_{+},\gamma>0 MCP
U-quadratic α(xβ)2,x[a,b]\alpha(x-\beta)^{2},x\in[a,b] TOP
Exponential 1σex/σ,σ>0\frac{1}{\sigma}e^{-x/\sigma},\sigma>0 ETP
Rayleigh xσ2ex2/(2σ2),σ>0\frac{x}{\sigma^{2}}e^{-x^{2}/(2\sigma^{2})},\sigma>0
Weibull kσ(x/σ)k1e(x/σ)k,k>0,σ>0\frac{k}{\sigma}(x/\sigma)^{k-1}e^{-(x/\sigma)^{k}},k>0,\sigma>0
Chi-squared xk1ex2/22k/21Γ(k2),k>0\frac{x^{k-1}e^{-x^{2}/2}}{2^{k/2-1}\Gamma(\frac{k}{2})},k>0
Generalized Gamma (p/ad)xd1e(x/a)pΓ(d/p),a>0,d>0,p>0\frac{(p/a^{d})x^{d-1}e^{-(x/a)^{p}}}{\Gamma(d/p)},a>0,d>0,p>0 GERF (d=1d=1)
Generalized Beta Prime p(xq)αp1(1+(xq)p)αβqB(α,β),p,q,α,β>0\frac{p(\frac{x}{q})^{\alpha p-1}(1+(\frac{x}{q})^{p})^{-\alpha-\beta}}{qB(\alpha,\beta)},p,q,\alpha,\beta>0 TL1 (α=β=p=1\alpha=\beta=p=1)
TABLE I: Examples of nonconvex methods and their corresponding probability distributions defined on [0,)[0,\infty). We did not fill in the corresponding methods for some distributions because they still seem to be new.

If a probability distribution is supported on the whole real line, then we can instead use its folded version supported on [0,)[0,\infty) to construct the corresponding regularization function. Some examples are listed as follows:

  • Folded Normal distribution: fσ(x)=22πσex22σ2f_{\sigma}(x)=\frac{2}{\sqrt{2\pi}\sigma}e^{-\frac{x^{2}}{2\sigma^{2}}} with Fσ(x)=erf(x2σ)F_{\sigma}(x)=\mathrm{erf}(\frac{x}{\sqrt{2}\sigma}). In this case, it corresponds to the ERF method proposed in [14].

  • Folded Student’s t-distribution: fν(x)=2Γ(ν+12)νπΓ(ν2)(1+x2ν)ν+12f_{\nu}(x)=\frac{2\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\Gamma(\frac{\nu}{2})}(1+\frac{x^{2}}{\nu})^{-\frac{\nu+1}{2}}, where ν\nu is the degrees of freedom and Γ()\Gamma(\cdot) is the gamma function. In the special case of ν=1\nu=1, it reduces to the Folded Cauchy distribution with f(x)=2π(1+x2)f(x)=\frac{2}{\pi(1+x^{2})} and F(x)=2πarctan(x)F(x)=\frac{2}{\pi}\arctan(x), so that it goes to the ”arctan” method proposed in [17].

  • Folded Laplace distribution is exactly the Exponential distribution.

We display plots for the PDFs and CDFs of some well-known distributions in Figure 1. As shown, all the CDFs are continuous, non-decreasing, and satisfy that Fθ(0)=0F_{\theta}(0)=0 and Fθ(+)=1F_{\theta}(+\infty)=1. The PDFs of the Exponential, the Folded Normal and the Uniform distributions are non-increasing, such that their CDFs are concave. Whereas, the PDFs of the Rayleigh and the Weibull with k=1.5k=1.5 are strongly unimodal, making their CDFs being not concave. These properties will be closely related to the theoretical analysis and solving algorithms for the corresponding nonconvex regularization methods.

Refer to caption
Figure 1: The PDFs and CDFs of some well-known distributions including the Exponential (σ=1\sigma=1), the Folded Normal (σ=1\sigma=1), the Uniform ([0,2][0,2]), the Rayleigh (σ=1\sigma=1) and the Weibull (k=1.5,σ=1k=1.5,\sigma=1).

Next, we provide some useful properties for the regularization functions Jθ()J_{\theta}(\cdot) constructed based on the probability density functions. We start with the following proposition, which shows the distribution of Jθ(𝐱)J_{\theta}(\mathbf{x}) when the entries of 𝐱\mathbf{x} are assumed to be random. Its proof is straightforward and so is omitted.

Proposition 1

If we assume that {|xj|}\{|x_{j}|\} is random and independent and identically distributed (i.i.d.) with a PDF fθ()f_{\theta}(\cdot) and let pj=Fθ(|xj|)p_{j}=F_{\theta}(|x_{j}|) for any j[N]j\in[N], then the penalty Jθ(𝐱)=j=1NpjJ_{\theta}(\mathbf{x})=\sum_{j=1}^{N}p_{j} with pjU[0,1]p_{j}\sim U[0,1] is the sum of NN i.i.d. U[0,1]U[0,1] random variables, which has an Irwin–Hall (IH) distribution. Moreover, when NN is large, Jθ(𝐱)J_{\theta}(\mathbf{x}) has a normal limiting distribution with mean N/2N/2 and variance N/12N/12 by using the central limit theorem.

In addition, the constructed regularization or penalty term Jθ():N[0,N]J_{\theta}(\cdot):\mathbb{R}^{N}\rightarrow[0,N] can also be viewed as a sparsity measure, which possesses the following appealing properties:

  • Jθ()J_{\theta}(\cdot) is continuous in N\mathbb{R}^{N} so that it is stable with respect to small perturbations of the signal.

  • Jθ(𝐱)=0J_{\theta}(\mathbf{x})=0 if and only if 𝐱=𝟎\mathbf{x}=\mathbf{0}.

  • Jθ(𝐱)=Jθ(𝐱)J_{\theta}(\mathbf{x})=J_{\theta}(-\mathbf{x}) for any 𝐱N\mathbf{x}\in\mathbb{R}^{N}.

  • 0Jθ(𝐱)𝐱0N0\leq J_{\theta}(\mathbf{x})\leq\lVert\mathbf{x}\rVert_{0}\leq N.

  • If |𝐱||𝐲||\mathbf{x}|\preceq|\mathbf{y}| (i.e., |xj||yj||x_{j}|\leq|y_{j}| for all j[N]j\in[N]), then we have Jθ(𝐱)Jθ(𝐲)J_{\theta}(\mathbf{x})\leq J_{\theta}(\mathbf{y}) due to the non-decreasing property of the CDF Fθ()F_{\theta}(\cdot).

  • If fθ()f_{\theta}(\cdot) is non-increasing (i.e., fθ()0f_{\theta}^{\prime}(\cdot)\leq 0), then both Fθ()F_{\theta}(\cdot) and Jθ()J_{\theta}(\cdot) are concave. In addition, in this case we have Jθ()J_{\theta}(\cdot) is subadditive, that is, for any 𝐱,𝐲N\mathbf{x},\mathbf{y}\in\mathbb{R}^{N}, Jθ(𝐱+𝐲)Jθ(𝐱)+Jθ(𝐲)J_{\theta}(\mathbf{x}+\mathbf{y})\leq J_{\theta}(\mathbf{x})+J_{\theta}(\mathbf{y}). This additive property holds when 𝐱,𝐲\mathbf{x},\mathbf{y} have disjoint supports.

To illustrate the penalty function Jθ()J_{\theta}(\cdot) as a sparsity measure, we show the corresponding sparsity levels computed via Jθ()J_{\theta}(\cdot) constructed based on the Exponential, the Rayleigh and the Weibull distributions for a compressible signal of length 5050 generated with its entries decay as j2j^{-2} with j{1,2,,50}j\in\{1,2,\cdots,50\} in Figure 2. As we can see, for a compressible signal with very small but non-zero entries, all the Jθ()J_{\theta}(\cdot) with proper values of θ\theta provided better sparsity measures than the traditional 0\ell_{0} norm which equals 5050 here.

Refer to caption
Figure 2: Jθ(𝐱)J_{\theta}(\mathbf{x}) for a compressible signal 𝐱\mathbf{x} while varying θ\theta, where θ=1/σ\theta=1/\sigma for the Exponential distribution, θ=σ\theta=\sigma for the Rayleigh distribution and θ=σ\theta=\sigma for the Weibull distribution with k=1.5k=1.5. We generate 𝐱N\mathbf{x}\in\mathbb{R}^{N} with its entries decay as j2j^{-2} where j{1,2,,50}j\in\{1,2,\cdots,50\}.

III Theoretical Analysis

In this section, we discuss the recovery analysis results for the following constrained noiseless JθJ_{\theta}-minimization:

min𝐳NJθ(𝐳)subject toA𝐳=A𝐱,\displaystyle\min\limits_{\mathbf{z}\in\mathbb{R}^{N}}J_{\theta}(\mathbf{z})\quad\text{subject to}\quad A\mathbf{z}=A\mathbf{x}, (3)

where Jθ()J_{\theta}(\cdot) is constructed as in (2). Throughout this section, we assume that the function Jθ()J_{\theta}(\cdot) fulfills the subadditive property.

We start with the definition of a generalized version of the null space property (NSP) [18], which guarantees an exact sparse recovery for our proposed JθJ_{\theta}-minimization (3).

Definition 1

We say a matrix Am×NA\in\mathbb{R}^{m\times N} satisfies a generalized null space property (gNSP) relative to Jθ()J_{\theta}(\cdot) and S[N]S\subset[N] if

Jθ(𝐯S)<Jθ(𝐯Sc)for all 𝐯Ker(A){0}.\displaystyle J_{\theta}(\mathbf{v}_{S})<J_{\theta}(\mathbf{v}_{S^{c}})\quad\text{for all $\mathbf{v}\in\mathrm{Ker}(A)\setminus\{0\}$}. (4)

It satisfies the gNSP of order ss relative to Jθ()J_{\theta}(\cdot) if it satisfies the gNSP relative to Jθ()J_{\theta}(\cdot) and any S[N]S\subset[N] with |S|s|S|\leq s.

Under the subadditivity of Jθ()J_{\theta}(\cdot), it is straightforward to obtain the following sufficient and necessary condition for an exact sparse recovery.

Theorem 1

For any given measurement matrix Am×NA\in\mathbb{R}^{m\times N}, every ss-sparse vector 𝐱N\mathbf{x}\in\mathbb{R}^{N} is the unique solution of the problem (3) if and only if AA satisfies the gNSP of order ss relative to Jθ()J_{\theta}(\cdot).

Moreover, we are able to show that by choosing a proper θ\theta, one can obtain a solution arbitrarily close to the unique solution of 0\ell_{0} minimization via (3) based on the Δq\Delta_{q}-spherical section property given in [15]. The proofs of the following results are quite similar to that given earlier in [15] and so are not reproduced here. Our results established in this present paper extend those given in [15] for the GERF method to any subadditive JθJ_{\theta}-minimization method constructed as in (2).

Definition 2

([15]) For any q(1,]q\in(1,\infty], the measurement matrix AA is said to possess the Δq\Delta_{q}-spherical section property if Δq(A)(𝐯1/𝐯q)qq1\Delta_{q}(A)\leq(\lVert\mathbf{v}\rVert_{1}/\lVert\mathbf{v}\rVert_{q})^{\frac{q}{q-1}} holds for all 𝐯Ker(A){0}\mathbf{v}\in\mathrm{Ker}(A)\setminus\{0\}. In other words, Δq(A)=inf𝐯Ker(A){0}(𝐯1𝐯q)qq1\Delta_{q}(A)=\inf_{\mathbf{v}\in\mathrm{Ker}(A)\setminus\{0\}}\left(\frac{\lVert\mathbf{v}\rVert_{1}}{\lVert\mathbf{v}\rVert_{q}}\right)^{\frac{q}{q-1}}.

Proposition 2

Assume Am×NA\in\mathbb{R}^{m\times N} has the Δq\Delta_{q}-spherical section property for some q(1,]q\in(1,\infty] and 𝐱0=s<2q1qΔq(A)\lVert\mathbf{x}\rVert_{0}=s<2^{\frac{q}{1-q}}\Delta_{q}(A). Then for any 𝐱^{𝐳N:A𝐳=A𝐱}\hat{\mathbf{x}}\in\{\mathbf{z}\in\mathbb{R}^{N}:A\mathbf{z}=A\mathbf{x}\} and αθ=Fθ1(11N)\alpha_{\theta}=F_{\theta}^{-1}(1-\frac{1}{N}), we have

𝐱^𝐱qNαθΔq(A)11/q(Δq(A)1)11/q,\displaystyle\lVert\hat{\mathbf{x}}-\mathbf{x}\rVert_{q}\leq\frac{N\alpha_{\theta}}{\Delta_{q}(A)^{1-1/q}-(\lceil\Delta_{q}(A)-1\rceil)^{1-1/q}}, (5)

whenever Jθ(𝐱^)Δq1sJ_{\theta}(\hat{\mathbf{x}})\leq\lceil\Delta_{q}-1\rceil-s.

Proposition 3

If 𝐱^\hat{\mathbf{x}} be the solution of (3) and the conditions in Proposition 2 hold, then it satisfies

Jθ(𝐱^)Δq1s.\displaystyle J_{\theta}(\hat{\mathbf{x}})\leq\lceil\Delta_{q}-1\rceil-s.
Theorem 2

Let 𝐱^\hat{\mathbf{x}} be the solution of (3) and assume the conditions in Proposition 2 hold. Then 𝐱^\hat{\mathbf{x}} approaches the unique solution of 0\ell_{0} minimization 𝐱\mathbf{x} as αθ=Fθ1(11N)\alpha_{\theta}=F_{\theta}^{-1}(1-\frac{1}{N}) approaches 0.

IV Weibull Penalty

In this section, we study a new sparse recovery method constructed via the Weibull distribution with PDF f(x;k,σ)=kσ(x/σ)k1e(x/σ)k,x0f(x;k,\sigma)=\frac{k}{\sigma}(x/\sigma)^{k-1}e^{-(x/\sigma)^{k}},x\geq 0, where k>0k>0 is the shape parameter, σ>0\sigma>0 is the scale parameter, and CDF F(x;k,σ)=1e(x/σ)kF(x;k,\sigma)=1-e^{-(x/\sigma)^{k}}. Hereafter, we denote the Weibull penalty (WBP) by

WBP(𝐱)\displaystyle WBP(\mathbf{x}) =j=1N0|xj|kσ(τ/σ)k1e(τ/σ)k𝑑τ\displaystyle=\sum\limits_{j=1}^{N}\int_{0}^{|x_{j}|}\frac{k}{\sigma}(\tau/\sigma)^{k-1}e^{-(\tau/\sigma)^{k}}d\tau
=j=1N(1e|xj/σ|k).\displaystyle=\sum\limits_{j=1}^{N}\left(1-e^{-|x_{j}/\sigma|^{k}}\right). (6)

We hide its dependency on the parameters in WBP(𝐱)WBP(\mathbf{x}) to keep the notation simple. The parameters kk and σ\sigma control the degree of concavity of the penalty function WBP(𝐱)WBP(\mathbf{x}). We refer this new sparse recovery method based on the Weibull penalty as ”WBP”. It is very flexible and includes the ETP method in [12, 13] as its special case of k=1k=1.

Then, we can obtain the following proposition that characterizes the asymptotic behaviors of WBP(𝐱)WBP(\mathbf{x}).

Proposition 4

For any nonzero 𝐱N\mathbf{x}\in\mathbb{R}^{N} and k>0k>0,

  1. (a)

    σkWBP(𝐱)𝐱kk\sigma^{k}WBP(\mathbf{x})\rightarrow\lVert\mathbf{x}\rVert_{k}^{k}, as σ+\sigma\rightarrow+\infty.

  2. (b)

    WBP(𝐱)𝐱0WBP(\mathbf{x})\rightarrow\lVert\mathbf{x}\rVert_{0}, as σ0+\sigma\rightarrow 0^{+}.

Proof. As for (a), when x>0x>0, it holds that

limσ+σk0xkσkτk1e(τ/σ)k𝑑τxk\displaystyle\lim\limits_{\sigma\rightarrow+\infty}\frac{\sigma^{k}\int_{0}^{x}\frac{k}{\sigma^{k}}\tau^{k-1}e^{-(\tau/\sigma)^{k}}d\tau}{x^{k}} =limσ+0x/σkuk1euk𝑑u(x/σ)k\displaystyle=\lim\limits_{\sigma\rightarrow+\infty}\frac{\int_{0}^{x/\sigma}ku^{k-1}e^{-u^{k}}du}{(x/\sigma)^{k}}
=limt00tkuk1euk𝑑utk=1,\displaystyle=\lim\limits_{t\rightarrow 0}\frac{\int_{0}^{t}{ku^{k-1}e^{-u^{k}}du}}{t^{k}}=1,

where we let t=x/σt=x/\sigma. When x=0x=0, it is obvious that 1e|x/σ|k=0=x1-e^{-|x/\sigma|^{k}}=0=x.

To prove (b), we have 1e|0/σ|k=01-e^{-|0/\sigma|^{k}}=0 and x>0\forall\,x>0,

limσ0+0xkσ(τ/σ)k1e(τ/σ)k𝑑τ\displaystyle\lim\limits_{\sigma\rightarrow 0^{+}}\int_{0}^{x}\frac{k}{\sigma}(\tau/\sigma)^{k-1}e^{-(\tau/\sigma)^{k}}d\tau =limσ0+0x/σkuk1euk𝑑u\displaystyle=\lim\limits_{\sigma\rightarrow 0^{+}}\int_{0}^{x/\sigma}ku^{k-1}e^{-u^{k}}du
=0+kuk1euk𝑑u=1.\displaystyle=\int_{0}^{+\infty}{ku^{k-1}e^{-u^{k}}du}=1.

In order to demonstrate this proposition, in Figure 3 we plot the objective functions of the Weibull penalty WBP()WBP(\cdot) in \mathbb{R} for various values of kk and σ\sigma. All the functions are scaled to attain the point (1,1)(1,1) for a better comparison. As expected, regardless of some constant, the Weibull penalty with a shape parameter kk is an interpolation of the 0\ell_{0} and k\ell_{k} penalty. More specifically, it approaches the 0\ell_{0} penalty as σ\sigma approaches 0, while it approaches the k\ell_{k} penalty as σ\sigma approaches ++\infty.

Refer to caption
Figure 3: The objective functions of WBP()WBP(\cdot) in \mathbb{R} for various values of kk and σ\sigma, compared to the 1\ell_{1} and 1/2\ell_{1/2} penalties. k=0.1,0.5,1k=0.1,0.5,1 (left panel) and k=2k=2 (right panel) with σ\sigma varying in {0.1,1,100}\{0.1,1,100\}.

IV-A Algorithms

As f(x;k,σ)=(k(k1)σkxk2k2σ2kx2(k1))e(x/σ)kf^{\prime}(x;k,\sigma)=\left(\frac{k(k-1)}{\sigma^{k}}x^{k-2}-\frac{k^{2}}{\sigma^{2k}}x^{2(k-1)}\right)e^{-(x/\sigma)^{k}}, so F(x;k,σ)F(x;k,\sigma) is concave in x+x\in\mathbb{R}_{+} whenever k1k\leq 1. Therefore, according to [19], an iteratively reweighted 1\ell_{1} (IRL1) algorithm (summarized in Algorithm 1) can be employed to solve the following Weibull-Penalized problem:

min𝐱N12𝐲A𝐱22+λWBP(𝐱).\displaystyle\min\limits_{\mathbf{x}\in\mathbb{R}^{N}}\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\lambda WBP(\mathbf{x}). (7)
Algorithm 1 IRL1 for solving (7)
1:  Initialization: Set wj(0)=1w_{j}^{(0)}=1 for j[N]j\in[N], and l=0l=0.
2:  Iteration: Repeat until the stopping rule is met,
𝐱(l+1)=argmin𝐱N12𝐲A𝐱22+λj=1Nwj(l)|xj|,\displaystyle\mathbf{x}^{(l+1)}=\mathop{\arg\min}\limits_{\mathbf{x}\in\mathbb{R}^{N}}\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\lambda\sum\limits_{j=1}^{N}w_{j}^{(l)}|x_{j}|, (8)
where wj(l)=kσk|xj(l)|k1e|xj(l)/σ|kw_{j}^{(l)}=\frac{k}{\sigma^{k}}|x_{j}^{(l)}|^{k-1}e^{-|x_{j}^{(l)}/\sigma|^{k}} for j[N]j\in[N].
3:  Update iteration: l=l+1l=l+1.

For the case of k>1k>1, F(x;k,σ)F(x;k,\sigma) is no longer concave in x+x\in\mathbb{R}_{+} such that the IRL1 is not applicable. However, note that WBP()WBP(\cdot) with k>1k>1 belongs to the class cc\mathcal{F}_{cc} (additively separable, and convex in the convexity region and concave in the concavity region) which has also been discussed in [19]. Therefore, an iteratively reweighted tight convex algorithm (see Algorithm 4 in [19]) can be adopted here. Moreover, we can rewrite the WBP()WBP(\cdot) as a Difference of Convex functions (DC). For any k1k\geq 1, the decomposition 1e|x/σ|k=1σk|x|k1σk0|x|kτk1(1e(τ/σ)k)𝑑τ1-e^{-|x/\sigma|^{k}}=\frac{1}{\sigma^{k}}|x|^{k}-\frac{1}{\sigma^{k}}\int_{0}^{|x|}k\tau^{k-1}(1-e^{-(\tau/\sigma)^{k}})d\tau yields the following DC decomposition:

WBP(𝐱)=1σk𝐱kk1σkj=1N0|xj|kτk1(1e(τ/σ)k)𝑑τ.\displaystyle WBP(\mathbf{x})=\frac{1}{\sigma^{k}}\lVert\mathbf{x}\rVert_{k}^{k}-\frac{1}{\sigma^{k}}\sum\limits_{j=1}^{N}\int_{0}^{|x_{j}|}{k\tau^{k-1}(1-e^{-(\tau/\sigma)^{k}})d\tau}.

As a result, the problem (7) can be expressed as a DC program:

min𝐱N12𝐲A𝐱22+λj=1N(1e|xj/σ|k)\displaystyle\min\limits_{\mathbf{x}\in\mathbb{R}^{N}}\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\lambda\sum\limits_{j=1}^{N}(1-e^{-|x_{j}/\sigma|^{k}})
=min𝐱N12𝐲A𝐱22+λσk𝐱kkg(𝐱)λσkj=1N0|xj|kτk1(1e(τ/σ)k)𝑑τh(𝐱),\displaystyle=\min\limits_{\mathbf{x}\in\mathbb{R}^{N}}\underbrace{\frac{1}{2}\lVert\mathbf{y}-A\mathbf{x}\rVert_{2}^{2}+\frac{\lambda}{\sigma^{k}}\lVert\mathbf{x}\rVert_{k}^{k}}_{g(\mathbf{x})}-\underbrace{\frac{\lambda}{\sigma^{k}}\sum\limits_{j=1}^{N}\int_{0}^{|x_{j}|}{k\tau^{k-1}(1-e^{-(\tau/\sigma)^{k}})d\tau}}_{h(\mathbf{x})},

which can be solved by the Difference of Convex functions Algorithms (DCA) [20]. The detailed discussions are out of the scope of this paper and so left for future work.

IV-B A Simulation Study

Finally, we carry out a simple simulation study for the new constructed WBP method as given in (7) with k(0,1]k\in(0,1] in sparse recovery. We solve it by using the IRL1 algorithm and an alternating direction method of multipliers (ADMM) algorithm [21] is used to solve the subproblem (8). The true signal is of length 256256 and simulated as ss-sparse with ss in the set {6,8,,32}\{6,8,\cdots,32\}. The measurement matrix AA is a 64×25664\times 256 Gaussian random matrix. For each kk and σ\sigma, we replicate the noiseless experiments 100100 times with λ=107\lambda=10^{-7}. It is recorded as one success if the relative error 𝐱^𝐱2/𝐱2103\lVert\hat{\mathbf{x}}-\mathbf{x}\rVert_{2}/\lVert\mathbf{x}\rVert_{2}\leq 10^{-3}. We show the success rate over the 100100 replicates for various values of parameters kk and σ\sigma, while varying the sparsity level ss.

As we can observe from Figure 4, the proposed WBP method outperforms the 1\ell_{1}-minimization (denoted as L1L_{1}) for all tested values of kk and σ\sigma. Overall, those tested values of σ\sigma (i.e., σ=0.01,1,10,100\sigma=0.01,1,10,100) have very limited effects on the reconstruction performance when kk is less than 1 (i.e., k=0.01,0.2,0.5,0.8k=0.01,0.2,0.5,0.8), but have a much bigger effect for the case of k=1k=1 (apparently σ=1\sigma=1 is the best choice). k=1k=1 is always better than other values of kk when σ\sigma is fixed to 0.01,1,100.01,1,10. However, when σ\sigma takes a large value (e.g., σ=100\sigma=100), the WBP with k=1k=1 is almost the same as the 1\ell_{1}-minimization, which has a worse performance compared to the WBP with other values of kk.

Refer to caption
(a) σ=0.01\sigma=0.01
Refer to caption
(b) σ=1\sigma=1
Refer to caption
(c) σ=10\sigma=10
Refer to caption
(d) σ=100\sigma=100
Figure 4: Reconstruction performance of the WBP method with different choices of kk for Gaussian random measurements when σ\sigma is fixed to 0.01,1,10,1000.01,1,10,100.

V Conclusion

In this paper, we provide a unified framework for constructing the nonconvex separable penalty through the probability distribution. The theoretical recovery results in terms of the null space property and the spherical section property are presented. In addition, a new nonconvex method based on the Weibull penalty is used in sparse recovery.

Acknowledgment

We would like to acknowledge support from the Zhejiang Provincial Natural Science Foundation of China under Grant No.LQ21A010003.

References

  • [1] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning with sparsity: the lasso and generalizations.   CRC press, 2015.
  • [2] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing.   Birkhäuser Basel, 2013, vol. 1.
  • [3] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [4] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  • [5] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
  • [6] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for compressive sensing,” in Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on.   IEEE, 2008, pp. 3869–3872.
  • [7] S. Foucart and M.-J. Lai, “Sparsest solutions of underdetermined linear systems via q\ell_{q}-minimization for 0<q10<q\leq 1,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 395–407, 2009.
  • [8] T. Zhang, “Analysis of multi-stage convex relaxation for sparse regularization.” Journal of Machine Learning Research, vol. 11, no. 3, 2010.
  • [9] S. Zhang and J. Xin, “Minimization of transformed L1{L}_{1} penalty: theory, difference of convex function algorithm, and robust application in compressed sensing,” Mathematical Programming, vol. 169, no. 1, pp. 307–336, 2018.
  • [10] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010.
  • [11] Y. Yu and J. Peng, “A unified smooth framework for nonconvex penalized least squares problems,” 2020.
  • [12] C. Gao, N. Wang, Q. Yu, and Z. Zhang, “A feasible nonconvex relaxation approach to feature selection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 25, no. 1, 2011.
  • [13] M. Malek-Mohammadi, A. Koochakzadeh, M. Babaie-Zadeh, M. Jansson, and C. R. Rojas, “Successive concave sparsity approximation for compressed sensing,” IEEE Transactions on Signal Processing, vol. 64, no. 21, pp. 5657–5671, 2016.
  • [14] W. Guo, Y. Lou, J. Qin, and M. Yan, “A novel regularization based on the error function for sparse recovery,” arXiv preprint arXiv:2007.02784, 2020.
  • [15] Z. Zhou, “Sparse recovery based on the generalized error function,” arXiv preprint arXiv:2105.13189, 2021.
  • [16] J. Liu, J. Jin, and Y. Gu, “Robustness of sparse recovery via F-minimization: A topological viewpoint,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 3996–4014, 2015.
  • [17] E. J. Candes, M. B. Wakin, S. P. Boyd et al., “Enhancing sparsity by reweighted 1\ell_{1} minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, 2008.
  • [18] A. Cohen, W. Dahmen, and R. DeVore, “Compressed sensing and best kk-term approximation,” Journal of the American Mathematical Society, vol. 22, no. 1, pp. 211–231, 2009.
  • [19] P. Ochs, A. Dosovitskiy, T. Brox, and T. Pock, “On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision,” SIAM Journal on Imaging Sciences, vol. 8, no. 1, pp. 331–372, 2015.
  • [20] P. D. Tao and L. T. H. An, “Convex analysis approach to dc programming: theory, algorithms and applications,” Acta Mathematica Vietnamica, vol. 22, no. 1, pp. 289–355, 1997.
  • [21] S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical learning via the alternating direction method of multipliers.   Now Publishers Inc, 2011.