This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-Dimensional Penalized Bernstein Support Vector Machines

Rachid Kharoubi   
Department of Mathematics, Université du Québec À Montréal
Abdallah Mkhadri
Departement of Mathematics,Cadi Ayyad University ,Faculty of Sciences Semlalia
and
Karim Oualkacha
Department of Mathematics, Université du Québec À Montréal
Abstract

The support vector machines (SVM) is a powerful classifier used for binary classification to improve the prediction accuracy. However, the non-differentiability of the SVM hinge loss function can lead to computational difficulties in high dimensional settings. To overcome this problem, we rely on Bernstein polynomial and propose a new smoothed version of the SVM hinge loss called the Bernstein support vector machine (BernSVM), which is suitable for the high dimension p>>np>>n regime. As the BernSVM objective loss function is of the class C2C^{2}, we propose two efficient algorithms for computing the solution of the penalized BernSVM. The first algorithm is based on coordinate descent with maximization-majorization (MM) principle and the second one is IRLS-type algorithm (iterative re-weighted least squares). Under standard assumptions, we derive a cone condition and a restricted strong convexity to establish an upper bound for the weighted Lasso BernSVM estimator. Using a local linear approximation, we extend the latter result to penalized BernSVM with non convex penalties SCAD and MCP. Our bound holds with high probability and achieves a rate of order slog(p)/n\sqrt{s\log(p)/n}, where ss is the number of active features. Simulation studies are considered to illustrate the prediction accuracy of BernSVM to its competitors and also to compare the performance of the two algorithms in terms of computational timing and error estimation. The use of the proposed method is illustrated through analysis of three large-scale real data examples.


Keywords : SVM, Classification, Bernstein polynomial, Variables Selection, Non asymptotic Error Bound.

1 Introduction

The SVM, (Cortes and Vapnik,, 1995), is a powerful classifier used for binary classification to perform the prediction accuracy. Its motivation comes from geometric considerations and it seeks a hyperplane that separates two classes of data points by the largest margins (i. e. minimal distances of observations to a hyperplane). The classification rule depends only on a subset of observations, called support vectors, which lie along the lines indicating the width of the margin. It consists of classifying a test observation based on which side of the maximal margin hyperplane it lies. However, SVM can be less efficient in high dimensional setting with a large number of variables with only few of them are relevant for classification. Nowadays, problems with sparse scale data are common in applications such as finance, document classification, image analysis and gene expression analysis. Many sparse regularized SVM approaches are proposed to control the sparsity of the solution and to achieve both variable selection and classification. To list a few, SVM with 1\ell_{1} penalty ((Bradley and Mangasarian,, 1998); (Zhu et al.,, 2003)), SVM with elastic net (Wang et al.,, 2006), SVM with the SCAD penalty and SVM with the combination of SCAD and 2\ell_{2} penalty (Becker et al.,, 2011).

The objective function of these regularized SVM approaches is not differentiable at 1, so standard optimization techniques cannot be directly applied. To overcome this problem, recent alternatives are developed based on the main idea of approximating the hinge loss function with a modified smooth function which is differentiable. Then, the use of maximization-minimization (MM) principle together with a coordinate descent algorithm can be employed efficiently to update each component of the vector parameter of the SVM model. This idea was recently implemented in the generalized coordinate descent algorithm (gCDA) by (Yang and Zou,, 2013) and the SVM cluster correlation-network by (Kharoubi et al.,, 2019). Both authors considered the Huber loss function as a smooth approximation of the hinge loss function. They used the fact that the Huber loss is differentiable with a Lipschitz first derivative to build an upper-bound quadratic surrogate function for the Huber coordinate-wise objective function. Then, they solved the corresponding problem by a generalized coordinate descent algorithm using the MM principle to ensure the descent property. Other smooth approximations of the hinge loss can be found in the 2\ell_{2} loss linear SVM of (Chang and Chen,, 2008) and (Lee and Mangasarian,, 2001). Another algorithm that is known to be computationally efficient and might be adapted to solve penalized SVM is the iterative re-weighted least squares (IRLS-type) algorithm, which solves the regularized logistic regression (Friedman et al.,, 2010). However, all the objective functions of the aforementioned methods are not twice differentiable, and thus, efficient IRLS-type algorithms can not be designed to solve such approximate sparse SVM problems.

On the other side, some works on theoretical guaranties of sparse SVM (or 1\ell_{1} SVM) are recently proposed, and are essentially based on the assumption that the Hessian of the SVM theoretical (expected) hinge loss is well-defined and continuous in order to use Taylor expansion or the continuity definition. For example, (Koo et al.,, 2008) studied the theoretical SVM loss function to overcome the differentiability of the hinge loss. Then, some appropriate assumptions about the distribution of the data (𝒙i,yi)(\bm{x}_{i},y_{i}), where 𝒙ip\bm{x}_{i}\in\mathbb{R}^{p} and yi{1,1},i=1,,ny_{i}\in\{-1,1\},i=1,...,n, were considered to ensure the existence and the continuity of the Hessian with respect to the vector of coefficients of the SVM regression model. (Koo et al.,, 2008) established asymptotic properties of the coefficients in low-dimension (p<np<n). (Zhang et al.,, 2016) used the same assumptions to establish the variable selection consistency in high-dimensions when the true vector of coefficients is sparse. (Dedieu,, 2019) exploited the same assumptions on the theoretical hinge loss in order to develop an upper bound of the error estimation of sparse SVM in high-dimension.

Motivated by the computation problem of the SVM hinge loss and the discontinuity of the second derivative of the hessian of Huber hinge loss, we propose the BernSVM loss function, which is based on the Bernstein polynomial approximation. Its second derivative is well-defined, continuous and bounded, which simplifies the implementation of a generalized coordinate descent algorithm with strict descent property and the implementation of an IRLS-type algorithm to solve penalized SVM. By construction, the proposed BernSVM loss function has suitable properties, which allows its Hessian to satisfy a restricted strong convexity (Negahban et al.,, 2009). This is essential for non asymptotic theoretical guaranties developed in this work. More precisely and contrary to the works cited earlier, we consider the empirical loss function to establish an upper bound of the 2\ell_{2} norm of the estimation error of the BernSVM weighted Lasso estimator. Similar to (Peng et al.,, 2016), we achieve an estimation error rate of order slog(p)/n\sqrt{s\log(p)/n}.

This paper is organized as follows. In Section 2, we give more details about the construction of the BernSVM loss function, using the Bernstein polynomials. Some properties of this function are oulined. Then we describe, in the same section, the two algorithms to solve the solution path of the penalized BernSVM. The first algorithm is a generalized coordinate descent and the second one is an IRLS type algorithm. In Section 3, we undertake the theoretical properties of the penalized BernSVM with weighetd lasso penalty. In particular, we derive an upper bound of the 2\ell_{2} norm of the error estimation. Then, we extend this result to penalized BernSVM with a non-convex penalty (SCAD or MPC), using local linear approximation (LLA) algorithm. The empirical performance of BernSVM based on simulation studies and application to real datasets is detailed in Section 4. Finally, a discussion and some conclusions are given in Section 5.

2 The penalized BernSVM

In this section we consider the penalized BernSVM in high dimension for binary classification with convex and non-convex penalties. We first present the BernSVM loss function, which is a spline polynomial of degree fourth, as a smooth approximation to the hinge loss function. Then, we present two efficient algorithms for computing the solution path of a solution to the penalized BernSVM problem. The first one is based on coordinate descent scheme and the second one is an IRLS type algorithm.

Assume we observe a training data of nn pairs, {(y1,𝐱1),,(yn,𝐱n)}\{(y_{1},{\mathbf{x}}_{1}),\ldots,(y_{n},{\mathbf{x}}_{n})\}, where 𝐱iIRp{\mathbf{x}}_{i}\in I\!\!R^{p} and yi{1,1},i=1,,n,y_{i}\in\{-1,1\},i=1,...,n, denotes class labels.

2.1 The fourth degree approximation spline

We consider the problem of smoothing the SVM loss function vv : t(1t)+t\mapsto(1-t)_{+}. The idea is to fix some δ>0\delta>0, and to construct the simplest polynomial spline

Bδ(t)={v(t),if|t1|>δ,gδ(t),if|t1|δ,B_{\delta}(t)=\left\{\begin{array}[]{ll}v(t)\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ >\delta,\\ g_{\delta}(t)\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ \leq\delta,\end{array}\right. (1)

with gδg_{\delta} is a decreasing polynomial, convex, and such that BδB_{\delta} is twice continuously differentiable. We have in particular Bδ(t)v(t)B_{\delta}(t)\rightarrow v(t), as δ0\delta\rightarrow 0, for all tt\in\mathbb{R}. These conditions can be rewritten in terms of the polynomial gδg_{\delta} alone as

  • C1.

    gδ(1δ)=δg_{\delta}(1-\delta)=\delta, gδ(1δ)=1g_{\delta}^{{}^{\prime}}(1-\delta)=-1, and gδ′′(1δ)=0g_{\delta}^{{}^{\prime\prime}}(1-\delta)=0;

  • C2.

    gδ(1+δ)=0g_{\delta}(1+\delta)=0, gδ(1+δ)=0g_{\delta}^{{}^{\prime}}(1+\delta)=0, and gδ′′(1+δ)=0g_{\delta}^{{}^{\prime\prime}}(1+\delta)=0;

  • C3.

    gδ′′0.g_{\delta}^{{}^{\prime\prime}}\geq 0.

Consider the affine transformation t(t1+δ)/2δt\mapsto(t-1+\delta)/2\delta, which maps the interval [1δ, 1+δ][1-\delta,\ 1+\delta] on [0[0, 1]], and let

qδ(x)=gδ(2δx+1δ),x[0, 1].q_{\delta}(x)=g_{\delta}(2\delta x+1-\delta)\ ,\ x\in\ [0,\ 1].

In terms of qδq_{\delta}, the three conditions above become

  • C1’.

    qδ(0)=δq_{\delta}(0)=\delta, qδ(0)=2δq_{\delta}^{{}^{\prime}}(0)=-2\delta, and qδ′′(0)=0q_{\delta}^{{}^{\prime\prime}}(0)=0;

  • C2’.

    qδ(1)=0q_{\delta}(1)=0, qδ(1)=0q_{\delta}^{{}^{\prime}}(1)=0, and qδ′′(1)=0q_{\delta}^{{}^{\prime\prime}}(1)=0;

  • C3’.

    qδ′′0.q_{\delta}^{{}^{\prime\prime}}\geq 0.

These conditions can be dealt with in a simple manner in terms of a Bernstein basis.

2.1.1 The BernSVM loss function

The members of the Bernstein basis of degree mm, m0m\geq 0, are the polynomials

bk,m(x)=(mk)xk(1x)mk,x[0, 1],b_{k,m}(x)=\ \left(\begin{array}[]{l}m\\ k\end{array}\right)x^{k}(1-x)^{m-k},\ x\in\ [0,\ 1],

for 0km0\leq k\leq m. The coefficients of a polynomial PP in the Bernstein basis of degree mm, will be denoted by {c(k,m;P):k=0,...,m}\{c(k,\ m;P)\ :\ k=0,\ .\ .\ .\ ,\ m\}, and so we have

P(x)=k=0mc(k,m;P)bk,m(x),x[0, 1].P(x)=\sum_{k=0}^{m}c(k,\ m;P)b_{k,m}(x)\ ,\ x\in\ [0,\ 1].

Let \triangle be the forward difference operator. Here, when applied to a function hh of two arguments: (k,m)h(k,m)(k,\ m)\mapsto h(k,\ m), it is understood that \triangle operates on the first argument:

h(k,m)=h(k+1,m)h(k,m)and\triangle h(k,\ m)=h(k+1,\ m)-h(k,\ m)\quad\mbox{and}
2h(k,m)=h(k+2,m)2h(k+1,m)+h(k,m),for all(k,m).\triangle^{2}h(k,\ m)=h(k+2,\ m)-2h(k+1,\ m)+h(k,\ m),\quad\mbox{for all}\quad(k,m).

A basic fact related to the Bernstein basis is the following for 0km,0\leq k\leq m, we have

bk,m=m(bk1,m1bk,m1)=mbk1,m1form1qnd0km,b_{k,m}^{\prime}=m(b_{k-1,m-1}-b_{k,m-1})=-m\triangle b_{k-1,m-1}\quad\mbox{for}\quad m\geq 1\quad\mbox{qnd}\quad 0\leq k\leq m,

with the convention bj,m=0b_{j,m}=0 for j{0,...,m}.j\not\in\{0,\ .\ .\ .\ ,\ m\}.

A useful consequence shows how derivatives of polynomials represented in the Bernstein basis act on the coefficients. Indeed, we have

  • (i)

    c(k,m1;P)=mc(k,m;P)c(k,\ m-1;P^{\prime})=m\triangle c(k,\ m;P), 0km10\leq k\leq m-1, m1m\geq 1, so that

    P(x)=mk=0m1c(k,m;P)bk,m1(x),x[0, 1],P^{{}^{\prime}}(x)=m\sum_{k=0}^{m-1}\triangle c(k,\ m;P)b_{k,m-1}(x),\ x\in\ [0,\ 1],
  • (ii)

    c(k,m2;P)=m(m1)2c(k,m;P)c(k,\ m-2;P)=m(m-1)\triangle^{2}c(k,\ m;P), 0km2,m20\leq k\leq m-2,m\geq 2, so that

    P′′(x)=m(m1)k=0m22c(k,m;P)bk,m2(x),x[0, 1].P^{{}^{\prime\prime}}(x)=m(m-1)\sum_{k=0}^{m-2}\triangle^{2}c(k,\ m;P)b_{k,m-2}(x),\ x\in\ [0,\ 1].

Then, it is easy to see that it is hopeless to find a cubic spline BδB_{\delta} satisfying the constraints. If such a spline exists, its second derivative will be a polynomial spline of degree at most one, which must equal zero at the endpoints 1δ1-\delta and 1+δ1+\delta. By continuity of the second derivative, it therefore equals 0 everywhere. The first derivative is therefore a degree zero polynomial on the entire axis and must equal 1-1 by the first condition. This contradicts the second condition.

Theorem 1.

The polynomial

gδ(t)=18δ3{(1t+δ)42(1tδ)(1t+δ)3},t[1δ, 1+δ]g_{\delta}(t)=\frac{1}{8\delta^{3}}\{\frac{(1-t+\delta)^{4}}{2}-(1-t-\delta)(1-t+\delta)^{3}\},\ t\in\ [1-\delta,\ 1+\delta] (2)

is the only degree m=4m=4 polynomial satisfying the three conditions C1, C2 and C3.

The proof is differed to Appendix A.

The following proposition summarizes the properties of the first and the second derivative of the BernSVM loss function Bδ(.)B_{\delta}(.) defined in equation (1).

Proposition 1.

For all tt\in\mathbb{R}, we have |Bδ(t)|1|B_{\delta}^{{}^{\prime}}(t)|\leq 1 and 0Bδ′′(t)34δ0\leq B_{\delta}^{{}^{\prime\prime}}(t)\leq\frac{3}{4\delta}.

Proof.

The loss function Bδ(.)B_{\delta}(.) is twice continuously differentiable. Its first derivative is given by

Bδ(t)={v(t),if|t1|>δ,gδ(t),if|t1|δ,B_{\delta}^{{}^{\prime}}(t)=\left\{\begin{array}[]{ll}v^{{}^{\prime}}(t)\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ >\delta,\\ g_{\delta}^{{}^{\prime}}(t)\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ \leq\delta,\end{array}\right. (3)

where gδ(t)=(1t+δ)2(1t2δ)4δ3g_{\delta}^{{}^{\prime}}(t)=\frac{(1-t+\delta)^{2}(1-t-2\delta)}{4\delta^{3}} and v(t)=𝟏{1t>δ}v^{{}^{\prime}}(t)=-\mathbf{1}_{\{1-t>\delta\}}. Its second derivative is given by

Bδ′′(t)={0,if|t1|>δ,gδ′′(t),if|t1|δ,B_{\delta}^{{}^{\prime\prime}}(t)=\left\{\begin{array}[]{ll}0\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ >\delta,\\ g_{\delta}^{{}^{\prime\prime}}(t)\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ \leq\delta,\end{array}\right. (4)

where, gδ′′(t)=34δ3[δ2(1t)2]g_{\delta}^{{}^{\prime\prime}}(t)=\frac{3}{4\delta^{3}}[\delta^{2}-(1-t)^{2}]. We have, 0(1t)2δ20\leq(1-t)^{2}\leq\delta^{2} then, 0δ2(1t)2δ20\leq\delta^{2}-(1-t)^{2}\leq\delta^{2}. Thus, 0gδ′′(t)34δ0\leq g_{\delta}^{{}^{\prime\prime}}(t)\leq\frac{3}{4\delta} and gδ(t)g_{\delta}^{{}^{\prime}}(t) is an increasing function. We have also gδ(1δ)=1g_{\delta}^{{}^{\prime}}(1-\delta)=-1 and gδ(1+δ)=0g_{\delta}^{{}^{\prime}}(1+\delta)=0, then, 1pδ(t)0-1\leq p_{\delta}^{{}^{\prime}}(t)\leq 0. Thus, t\forall t\in\mathbb{R} we have |Bδ(t)|1|B_{\delta}^{{}^{\prime}}(t)|\leq 1.  \blacksquare

2.1.2 Graphical comparaison of the three loss functions

In this section, we examine some properties of the BernSVM loss function through a graphical illustration. We also compare its behavior with the standard SVM loss function v(t)v(t) and the hinge loss function considered in HHSVM (Yang and Zou,, 2013), which is given by

ϕc(t)={0t1,(1t)22δ1δ<t1,1tδ2t1δ.\phi_{c}(t)=\left\{\begin{array}[]{ll}0&\mbox{, }t\geq 1,\\ \frac{(1-t)^{2}}{2\delta}&\mbox{, }1-\delta<t\leq 1,\\ 1-t-\frac{\delta}{2}&\mbox{, }t\leq 1-\delta.\\ \end{array}\right.
Figure 1: (a) The Huber loss (red) with δ=0.01\delta=0.01, the BernSVM loss (blue) with δ=0.01\delta=0.01 and the SVM loss (black), (b) The Huber loss (red) with δ=2\delta=2, the BernSVM loss (blue) with δ=2\delta=2 and the SVM loss (black)
Refer to caption

Figure 1 shows this graphical illustration. When δ\delta is small (i.e. close to zero), the three loss functions are almost identical (Panel (a)). When δ1\delta\geq 1 (Panel (b)), both HHSVM and BernSVM loss functions have the same shape that the SVM loss. However, BernSVM approximates better the SVM loss function. Moreover, BernSVM loss function is twice differentiable everywhere.

2.2 Algorithms for solving penalized BernSVM

In this section, we propose the penalized BernSVM as an alternative to the penalized SVM and the penalized HHSVM for binary classification in high dimension settings. We are assuming nn pairs of training data (𝒙i,yi)(\bm{x}_{i},y_{i}), for i=1,,ni=1,...,n, with 𝒙ip\bm{x}_{i}\in\mathbb{R}^{p} predictors and yi{1,1}y_{i}\in\{-1,1\}. We assume that the predictors are standardized:

1ni=1nxij=0,1ni=1nxij2=1.\frac{1}{n}\sum_{i=1}^{n}x_{ij}=0,\quad\frac{1}{n}\sum_{i=1}^{n}x_{ij}^{2}=1.

We propose to solve the penalized BernSVM problem given by

min(β0,𝜷)1ni=1nBδ(yi(β0+𝒙i𝜷))+Pλ1,λ2(𝜷),\underset{(\beta_{0},\bm{\beta})}{\min}\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}(\beta_{0}+\bm{x}_{i}^{\top}\bm{\beta}))+P_{\lambda_{1},\lambda_{2}}(\bm{\beta}), (5)

where the objective function Bδ(.)B_{\delta}(.) is defined in (1) and

Pλ1,λ2(𝜷)=j=1pPλ1(|βj|)+λ22𝜷22P_{\lambda_{1},\lambda_{2}}(\bm{\beta})=\sum_{j=1}^{p}P_{\lambda_{1}}(|\beta_{j}|)+\frac{\lambda_{2}}{2}||\bm{\beta}||_{2}^{2}

is a penalty function added for obtaining sparse coefficients’ estimates. The penalty Pλ1(|βj|)P_{\lambda_{1}}(|\beta_{j}|) takes different forms:

  • \bullet

    If Pλ1(|βj|)=λ1|βj|P_{\lambda_{1}}(|\beta_{j}|)=\lambda_{1}|\beta_{j}|, Pλ1,λ2(𝜷)P_{\lambda_{1},\lambda_{2}}(\bm{\beta}) is the elastic net penalty (EN);

  • \bullet

    If Pλ1(|βj|)=λ1wj|βj|P_{\lambda_{1}}(|\beta_{j}|)=\lambda_{1}w_{j}|\beta_{j}|, where the weight wj>0w_{j}>0 for all jj, then Pλ1,λ2(𝜷)P_{\lambda_{1},\lambda_{2}}(\bm{\beta}) is the adaptive Lasso + L2 norm penalty (AEN);

  • \bullet

    If one set

    Pλ1(|βj|)=λ1|βj|𝟙|βj|λ1|βj|22aλ1|βj|+λ122(a1)𝟙λ1<|βj|aλ1+(a+1)λ122𝟙|βj|>aλ1,P_{\lambda_{1}}(|\beta_{j}|)=\lambda_{1}|\beta_{j}|\mathbb{1}_{|\beta_{j}|\leq\lambda_{1}}-\frac{|\beta_{j}|^{2}-2a\lambda_{1}|\beta_{j}|+\lambda_{1}^{2}}{2(a-1)}\mathbb{1}_{\lambda_{1}<|\beta_{j}|\leq a\lambda_{1}}+\frac{(a+1)\lambda_{1}^{2}}{2}\mathbb{1}_{|\beta_{j}|>a\lambda_{1}},

    with a>2a>2, then Pλ1,λ2(𝜷)P_{\lambda_{1},\lambda_{2}}(\bm{\beta}) is the SCAD + L2 norm penalty (SCADEN);

  • \bullet

    If one set

    Pλ1(|βj|)=λ1(|βj||βj|22λ1a)𝟙|βj|<λ1a+λ12a2𝟙|βj|λ1a,P_{\lambda_{1}}(|\beta_{j}|)=\lambda_{1}(|\beta_{j}|-\frac{|\beta_{j}|^{2}}{2\lambda_{1}a})\mathbb{1}_{|\beta_{j}|<\lambda_{1}a}+\frac{\lambda_{1}^{2}a}{2}\mathbb{1}_{|\beta_{j}|\geq\lambda_{1}a},

    where a>1,a>1, then in this case, Pλ1,λ2(𝜷)P_{\lambda_{1},\lambda_{2}}(\bm{\beta}) is the MCP + L2 norm penalty (MCPEN).

Before going through details of the proposed algorithms to solve BernSVM optimization problem, we would like to emphasize that, for small δ\delta, the minimizer of the penalized BernSVM in (5) is a good approximation to the minimizer of the penalized SVM. This is outlined in the following proposition.

Proposition 2.

Let the penalized SVM loss function be defined as follows

R(𝜷,β0)=1ni=1nv(yi(β0+𝒙i𝜷))+Pλ1,λ2(𝜷),R(\bm{\beta},\beta_{0})=\frac{1}{n}\sum_{i=1}^{n}v(y_{i}(\beta_{0}+\bm{x}_{i}^{\top}\bm{\beta}))+P_{\lambda_{1},\lambda_{2}}(\bm{\beta}),

and let the penalized BernSVM loss function be given as

R(𝜷,β0|δ)=1ni=1nBδ(yi(β0+𝒙i𝜷))+Pλ1,λ2(𝜷).R(\bm{\beta},\beta_{0}|\delta)=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}(\beta_{0}+\bm{x}_{i}^{\top}\bm{\beta}))+P_{\lambda_{1},\lambda_{2}}(\bm{\beta}).

Then, we have

inf(𝜷,β0)R(𝜷,β0)inf(𝜷,β0)R(𝜷,β0|δ)inf(𝜷,β0)R(𝜷,β0)+δ.\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0})\leq\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0}|\delta)\leq\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0})+\delta.
Proof.

We have

Bδ(t)=v(t)𝟏(|t1|>δ)(t)+gδ(t)𝟏(|t1|δ)(t),B_{\delta}(t)=v(t)\mathbf{1}_{(|t-1|>\delta)}(t)+g_{\delta}(t)\mathbf{1}_{(|t-1|\leq\delta)}(t),

where, 𝟏A(t)=1\mathbf{1}_{A}(t)=1 if tAt\in A and 0 otherwise. Then, one can write

Bδ(t)v(t)={0,if|t1|>δ,gδ(t)0,if|t1|δ,B_{\delta}(t)-v(t)=\left\{\begin{array}[]{ll}0\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ >\delta,\\ g_{\delta}(t)\geq 0\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ \leq\delta,\end{array}\right.

which means that v(t)Bδ(t)v(t)\leq B_{\delta}(t). We have also that gδ(t)g_{\delta}(t) is decreasing for any t[1δ,1+δ]t\in[1-\delta,1+\delta] because gδ(t)0g_{\delta}^{{}^{\prime}}(t)\leq 0. Thus, one has gδ(t)gδ(1δ)=δg_{\delta}(t)\leq g_{\delta}(1-\delta)=\delta, which implies

v(t)+δBδ(t)={δ0,if|t1|>δδgδ(t)0,if|t1|δ.v(t)+\delta-B_{\delta}(t)=\left\{\begin{array}[]{ll}\delta\geq 0\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ >\delta\\ \delta-g_{\delta}(t)\geq 0\ ,&\mathrm{i}\mathrm{f}\ |t-1|\ \leq\delta.\end{array}\right.

We conclude that

v(t)Bδ(t)v(t)+δ.v(t)\leq B_{\delta}(t)\leq v(t)+\delta.

This means that

R(𝜷,β0)R(𝜷,β0|δ)R(𝜷,β0)+δ,R(\bm{\beta},\beta_{0})\leq R(\bm{\beta},\beta_{0}|\delta)\leq R(\bm{\beta},\beta_{0})+\delta,

which leads to

inf(𝜷,β0)R(𝜷,β0)inf(𝜷,β0)R(𝜷,β0|δ)inf(𝜷,β0)R(𝜷,β0)+δ.\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0})\leq\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0}|\delta)\leq\inf_{(\bm{\beta},\beta_{0})}R(\bm{\beta},\beta_{0})+\delta.\quad\blacksquare

In the remaining of section 2, we propose two competitor algorithms for solving (5) with convex penalties. The first one is a generalized coordinate descent algorithm based on MM principle, while the second one is an IRLS-type algorithm. Moreover, we briefly describe a third algorithm based on local linear approximation for solving (5) with non-convex penalties.

2.2.1 The coordinate descent algorithm for penalized BernSVM

In this section, we present a generalized coordinate descent (GCD) algorithm to solve penalized BernSVM, called BSVM-GCD, similar to the GCD approach proposed by (Yang and Zou,, 2013). Because the second derivative of the BernSVM objective function is bounded, we approximate the coordinate wise objective function by a surrogate function and we use the coordinate descent to solve the optimization problem in (5). The proposed approach is described next.

Notice, first, that the coordinate objective function of problem (5) can be written as follows

F(βj|β0,β~j):=1ni=1nBδ{ri+yixij(βjβ~j)}+Pλ1,λ2(βj),F(\beta_{j}|\beta_{0},\tilde{\beta}_{j}):=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}\{r_{i}+y_{i}x_{ij}(\beta_{j}-\tilde{\beta}_{j})\}+P_{\lambda_{1},\lambda_{2}}(\beta_{j}), (6)

where ri=yi(β~0+𝒙i𝜷~)r_{i}=y_{i}(\tilde{\beta}_{0}+\bm{x}_{i}^{\top}\tilde{\bm{\beta}}), and β~0\tilde{\beta}_{0} and 𝜷~\tilde{\bm{\beta}} are the current estimates of β0\beta_{0} and 𝜷\bm{\beta}, respectively. As the loss function, Bδ(.)B_{\delta}(.), is convex, twice differentiable and has a second derivative bounded by L=3/4δL=3/4\delta, one can rely on the mean value theorem and approximate the objective function in (6) by a (surrogate) quadratic function, Qδ(.)Q_{\delta}(.), and then use the MM principle to solve the new problem for each βj\beta_{j} holding β~0\tilde{\beta}_{0} and βk=β~j\beta_{k}=\tilde{\beta}_{j}, for kjk\neq j, fixed at the current iteration

minβj)Qδ(βj|β0~,β~j),\underset{\beta_{j})}{\min}Q_{\delta}(\beta_{j}|\tilde{\beta_{0}},\tilde{\beta}_{j}), (7)

where

Qδ(βj|β0,β~j)=i=1nBδ(ri)n+i=1nBδ(ri)yixijn(βjβ~j)+L2(βjβ~j)2+Pλ1,λ2(βj).Q_{\delta}(\beta_{j}|\beta_{0},\tilde{\beta}_{j})=\frac{\sum_{i=1}^{n}B_{\delta}(r_{i})}{n}+\frac{\sum_{i=1}^{n}B_{\delta}^{{}^{\prime}}(r_{i})y_{i}x_{ij}}{n}(\beta_{j}-\tilde{\beta}_{j})+\frac{L}{2}(\beta_{j}-\tilde{\beta}_{j})^{2}+P_{\lambda_{1},\lambda_{2}}(\beta_{j}). (8)

The next proposition gives the explicit solution of (7) for the different forms of Pλ1(βj)P_{\lambda_{1}}(\beta_{j}).

Proposition 3.

Let Qδ(βj|β~0,β~j)Q_{\delta}(\beta_{j}|\tilde{\beta}_{0},\tilde{\beta}_{j}) be the surrogate loss function defined in (8). Let Pλ1(.)P_{\lambda_{1}}(.) be L1L_{1}-norm or adaptive Lasso penalty. The closed form solution of the minimizer of (7) is given as follows

β^jEN=S(Zj,λ1)ω,\hat{\beta}_{j}^{EN}=\frac{S(Z_{j},\lambda_{1})}{\omega}, (9)
β^jAEN=S(Zj,λ1wj)ω,\hat{\beta}_{j}^{AEN}=\frac{S(Z_{j},\lambda_{1}w_{j})}{\omega}, (10)

where S(a,b)=(|a|b)+sign(a),Zj=i=1nBδ(ri)yixijn+Lβj~S(a,b)=(|a|-b)_{+}sign(a),Z_{j}=-\frac{\sum_{i=1}^{n}B_{\delta}^{{}^{\prime}}(r_{i})y_{i}x_{ij}}{n}+L\tilde{\beta_{j}}, and ω=λ2+L\omega=\lambda_{2}+L and L=3/4δL=3/4\delta.

The proof is outlined in Appendix A.

Algorithm 1 (BSVM-GCD), given next, gives details of steps for solving the objective function in (8) using both coordinate descent and MM principle.

  1. 1.

    Initialize β~0\tilde{\beta}_{0} and 𝜷~\tilde{\bm{\beta}};

  2. 2.

    Iterate the following updates until convergence:

    1. (a)

      For j=1,,pj=1,\ldots,p, update β~j\tilde{\beta}_{j}
      * compute ri=yi(β0~+xi.𝜷~);r_{i}=y_{i}(\tilde{\beta_{0}}+\textbf{x}^{\top}_{i.}\tilde{\bm{\beta}});
      * set

      β~jnewβ^j,\tilde{\beta}_{j}^{\mathrm{new}}\longleftarrow\hat{\beta}_{j},

      with β^j\hat{\beta}_{j} is given by (9) for the Elastic Net or (10) for the Adaptive Net;

    2. (b)

      Update β~0\tilde{\beta}_{0}
      * compute ri=yi(β0~+xi.𝜷~),r_{i}=y_{i}(\tilde{\beta_{0}}+\textbf{x}^{\top}_{i.}\tilde{\bm{\beta}}),
      * set

      β~0newβ0~L1i=1nBδ(ri)yin.\tilde{\beta}_{0}^{\mathrm{new}}\longleftarrow\tilde{\beta_{0}}-L^{-1}\frac{\sum_{i=1}^{n}B^{{}^{\prime}}_{\delta}(r_{i})y_{i}}{n}.

Algorithm 1 The BSVM-GCD algorithm to solve the BernSVM loss function with elastic net and adaptive net penalties.

The upper bound of the second derivative of the BernSVM loss function can be reached in t=1t=1. This means that F(t|β0,β~j)Q(t|β0,β~j)F(t|\beta_{0},\tilde{\beta}_{j})\leq Q(t|\beta_{0},\tilde{\beta}_{j}). To reach a strict inequality, one can relax the upper bound LL and use L~=(1+ϵ)L\tilde{L}=(1+\epsilon)L, where ϵ=1e6\epsilon=1e^{-6}. Hence, Algorithm 1 is implemented using L~\tilde{L} in the place of LL. This leads to the strict descent property of the BSVM-GCD algorithm, which is given in the next proposition.

Proposition 4.

The strict descent property of the BSVM-GCD approach is obtained using the upper bound L~\tilde{L} and given by

F(βj|β0,β~j)=Q(βj|β0,β~j),ifβj=β~j,F(\beta_{j}|\beta_{0},\tilde{\beta}_{j})=Q(\beta_{j}|\beta_{0},\tilde{\beta}_{j}),\quad\text{if}\quad\beta_{j}=\tilde{\beta}_{j},
F(βj|β0,β~j)<Q(βj|β0,β~j),ifβjβ~j.F(\beta_{j}|\beta_{0},\tilde{\beta}_{j})<Q(\beta_{j}|\beta_{0},\tilde{\beta}_{j}),\quad\text{if}\quad\beta_{j}\neq\tilde{\beta}_{j}.
Proof.

Using Taylor expansion and the fact that the BernSVM loss function is twice differentiable, we have

1ni=1nBδ{ri+yixij(βjβ~j)}\displaystyle\frac{1}{n}\sum_{i=1}^{n}B_{\delta}\{r_{i}+y_{i}x_{ij}(\beta_{j}-\tilde{\beta}_{j})\} =\displaystyle= 1ni=1nBδ(ri)+1ni=1nyixijBδ(ri)(βjβ~j)\displaystyle\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(r_{i})+\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{ij}B_{\delta}^{{}^{\prime}}(r_{i})(\beta_{j}-\tilde{\beta}_{j})
+\displaystyle+ 12ni=1nyi2xij2Bδ′′(ri)(βjβ~j)2\displaystyle\frac{1}{2n}\sum_{i=1}^{n}y_{i}^{2}x_{ij}^{2}B_{\delta}^{{}^{\prime\prime}}(r_{i})(\beta_{j}-\tilde{\beta}_{j})^{2}
\displaystyle\leq 1ni=1nBδ(ri)+1ni=1nyixijBδ(ri)(βjβ~j)+L2(βjβ~j)2,\displaystyle\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(r_{i})+\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{ij}B_{\delta}^{{}^{\prime}}(r_{i})(\beta_{j}-\tilde{\beta}_{j})+\frac{L}{2}(\beta_{j}-\tilde{\beta}_{j})^{2},

The last inequality holds because we have yi2=1y_{i}^{2}=1, i=1nxij2=n\sum_{i=1}^{n}x_{ij}^{2}=n, and Bδ′′(ri)LB_{\delta}^{{}^{\prime\prime}}(r_{i})\leq L from Proposition 1. Hence, we have F(βj|β0,β~j)=Q(βj|β0,β~j)F(\beta_{j}|\beta_{0},\tilde{\beta}_{j})=Q(\beta_{j}|\beta_{0},\tilde{\beta}_{j}) if βj=β~j\beta_{j}=\tilde{\beta}_{j}, and we have F(βj|β0,β~j)<Q(βj|β0,β~j)F(\beta_{j}|\beta_{0},\tilde{\beta}_{j})<Q(\beta_{j}|\beta_{0},\tilde{\beta}_{j}), for βjβ~j\beta_{j}\neq\tilde{\beta}_{j}. ∎

This proposition shows that the objective function F(.)F(.) decreases after each majorization update, which means that for β~jnewβ~j\tilde{\beta}^{new}_{j}\neq\tilde{\beta}_{j} we have

F(βjnew|β0,β~j)<F(β~j|β0,β~j).F(\beta^{new}_{j}|\beta_{0},\tilde{\beta}_{j})<F(\tilde{\beta}_{j}|\beta_{0},\tilde{\beta}_{j}).

This result shows that the BSVM-GCD algorithm enjoys strict descent property.

2.2.2 The IRLS algorithm for penalized BernSVM

In this section, we present an IRLS-type algorithm combined with coordinate descent, termed BSVM-IRLS, to solve the penalized BernSVM in (5). In their approach, (Friedman et al.,, 2010) used IRLS to solve the regularized logistic regression because the second derivative exist (glmnet R package). On the other hand, (Yang and Zou,, 2013) used a GCD algorithm to solve the regularized HHSVM because the Huber loss does not have the second derivative everywhere (gcdnet R packge). Our BernSVM loss function, in contrast, is very smooth and thus one can rely on this advantage and solve (5) using IRLS trick combined with coordinate descent. The BSVM-IRLS approach is described below.

Firstly, we consider the non-penalized BernSVM with loss function

(β0,𝜷)=1ni=1nBδ(yi(β0+𝒙i𝜷))=1ni=1nBδ(yi𝑿i𝒃),\ell{(\beta_{0},\bm{\beta})}=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}(\beta_{0}+\bm{x}_{i}^{\top}\bm{\beta}))=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{X}_{i}^{\top}\bm{b}), (11)

where, 𝒃=(β0,𝜷)\bm{b}=(\beta_{0},\bm{\beta}^{\top})^{\top} and 𝑿=(1,𝒙)\bm{X}=(1,\bm{x}^{\top})^{\top}. To solve

𝒃(β0,𝜷)=1ni=1nyi𝑿iBδ(yi𝑿i𝒃)=0,\frac{\partial{\ell}}{\partial\bm{b}}(\beta_{0},\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}y_{i}\bm{X}_{i}B^{{}^{\prime}}_{\delta}(y_{i}\bm{X}_{i}^{\top}\bm{b})=0,

one can use the Newton-Raphson algorithm because the loss function has a continuous second derivative defined by

2𝒃𝒃(β0,𝜷)=1ni=1n𝑿iBδ′′(yi𝑿i𝒃)𝑿i.\frac{\partial^{2}{\ell}}{\partial\bm{b}\partial\bm{b}^{\top}}(\beta_{0},\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}B^{{}^{\prime\prime}}_{\delta}(y_{i}\bm{X}_{i}^{\top}\bm{b})\bm{X}_{i}^{\top}.

The update of the coefficients is given then by

𝒃new=𝒃~(2𝒃𝒃(β0,𝜷))1𝒃(β0,𝜷)𝒃~.\bm{b}^{new}=\tilde{\bm{b}}-(\frac{\partial^{2}{\ell}}{\partial\bm{b}\partial\bm{b}^{\top}}(\beta_{0},\bm{\beta}))^{-1}\frac{\partial{\ell}}{\partial\bm{b}}(\beta_{0},\bm{\beta})\tilde{\bm{b}}.

Let ui=yi.Bδ(yi𝑿i𝒃~)u_{i}=y_{i}.B^{{}^{\prime}}_{\delta}(y_{i}\bm{X}_{i}^{\top}\tilde{\bm{b}}), ϕii=Bδ′′(yi𝑿i𝒃~)\phi_{ii}=B^{{}^{\prime\prime}}_{\delta}(y_{i}\bm{X}_{i}^{\top}\tilde{\bm{b}}) and 𝚽~=diag(ϕii)\tilde{\bm{\Phi}}=diag(\phi_{ii}) for i=1,ni=1\ldots,n. Then, the update of the Newton-Raphson algorithm can be written as follows,

𝒃new\displaystyle\bm{b}^{new} =\displaystyle= 𝒃~(𝑿𝚽𝑿)1𝑿𝒖\displaystyle\tilde{\bm{b}}-(\bm{X}^{\top}\bm{\Phi}\bm{X})^{-1}\bm{X}^{\top}\bm{u}
=\displaystyle= (𝑿𝚽𝑿)1𝑿𝚽[𝑿𝒃~𝚽1𝒖]\displaystyle(\bm{X}^{\top}\bm{\Phi}\bm{X})^{-1}\bm{X}^{\top}\bm{\Phi}[\bm{X}\tilde{\bm{b}}-\bm{\Phi}^{-1}\bm{u}]
=\displaystyle= (𝑿𝚽𝑿)1𝑿𝚽𝒛,\displaystyle(\bm{X}^{\top}\bm{\Phi}\bm{X})^{-1}\bm{X}^{\top}\bm{\Phi}\bm{z},

where 𝒛=𝑿𝒃~𝚽1𝒖\bm{z}=\bm{X}\tilde{\bm{b}}-\bm{\Phi}^{-1}\bm{u}. Thus, 𝒃new\bm{b}^{new} is the update of a weighted least regression problem with response 𝒛\bm{z}, with the loss function

(β0,𝜷)12ni=1nϕii(zi𝑿i𝒃)2.\ell{(\beta_{0},\bm{\beta})}\approx\frac{1}{2n}\sum_{i=1}^{n}\phi_{ii}(z_{i}-\bm{X}_{i}^{\top}\bm{b})^{2}.

This problem can be solved by an IRLS algorithm in the case p<np<n. In high dimensional settings, the penalized BernSVM, defined in (5), can be solved by combining IRLS with a cyclic coordinate descent. Thus, for the Adaptive Elastic Net penalty, the solution is given by

β^jAEN=S(i=1nϕiixijrin+i=1nϕiixij2nβj~,wjλ1)λ2+i=1nϕiixij2nandβ^0AEN=i=1nϕiirii=1nϕii,\hat{\beta}^{AEN}_{j}=\frac{S(\frac{\sum_{i=1}^{n}\phi_{ii}x_{ij}r_{i}}{n}+\frac{\sum_{i=1}^{n}\phi_{ii}x_{ij}^{2}}{n}\tilde{\beta_{j}},w_{j}\lambda_{1})}{\lambda_{2}+\frac{\sum_{i=1}^{n}\phi_{ii}x_{ij}^{2}}{n}}\quad\text{and}\quad\hat{\beta}^{AEN}_{0}=\frac{\sum_{i=1}^{n}\phi_{ii}r_{i}}{\sum_{i=1}^{n}\phi_{ii}}, (12)

where ri=ziβ~0𝒙i𝜷~r_{i}=z_{i}-\tilde{\beta}_{0}-\bm{x}_{i}^{\top}\tilde{\bm{\beta}} and (β~0,𝜷~)(\tilde{\beta}_{0},\tilde{\bm{\beta}}) is the estimate of (β0,𝜷)({\beta}_{0},\bm{\beta}) obtained on the current iteration. Of note, to deal with the zero weights (i.e, ϕii=0\phi_{ii}=0 for some ii), one can replace them by a constant ξ=0.001\xi=0.001 or use a modified Neweton-Raphson (NR) algorithm and replace all the weights by an upper bound of the second derivative. In our work, we adopted the modified NR algorithm and set ϕii=L\phi_{ii}=L for i=1,,ni=1,\ldots,n, where L=34δL=\frac{3}{4\delta} is the upper bound of the second derivative of the proposed loss function Bδ(.)B_{\delta}(.) given by (1).

  1. 1.

    Initialize 𝒃~=(β~0,𝜷~)\tilde{\bm{b}}=(\tilde{\beta}_{0},\tilde{\bm{\beta}}^{\top})^{\top} and let L=34δL=\frac{3}{4\delta};

  2. 2.

    Iterate the following updates until convergence:

    1. (a)

      Calculate 𝒖\bm{u};

    2. (b)

      Calculate the working response 𝒛=𝑿𝒃~L1𝒖\bm{z}=\bm{X}\tilde{\bm{b}}-L^{-1}\bm{u};

    3. (c)

      Using cyclic coordinate descent to solve the penalized weighted Least square

      𝒃^=(β^0,𝜷^)=argmin(β0,𝜷)(β0,𝜷),\hat{\bm{b}}=(\hat{\beta}_{0},\hat{\bm{\beta}})^{\top}=\underset{(\beta_{0},\bm{\beta})}{arg\min}\ell{(\beta_{0},\bm{\beta})},

      where,(β^0,β^j),j=1,,p(\hat{\beta}_{0},\hat{\beta}_{j}),j=1,\ldots,p, are given by (12);

    4. (d)

      𝒃~=𝒃^\tilde{\bm{b}}=\hat{\bm{b}}.

Algorithm 2 The BSVM-IRLS to solve the BernSVM with elastic net.

2.2.3 Local linear approximation algorithm for BernSVM with non-convex penalties

We consider in this section the penalized BernSVM with non-convex penalty SCAD or MCP and we propose to solve the resulting optimization problem using the local linear approximation (LLA) of (Zou and Li,, 2008).

Without loss of generality, we assume in this section that λ2=0\lambda_{2}=0. The LLA approach is an appropriate approximation of SCAD or MCP penalty. It is based on the first order Taylor expansion of the MCP or SCAD penalty functions around |β~j||\tilde{\beta}_{j}|, and adapts both penalties to be solved as an iterative adaptive Lasso penalty; the weights are updated/estimated at each iteration. In our context, this means that the penalized BernSVM with SCAD or MCP penalty can be solved iteratively as follows

(β~0,𝜷~)new=argmin(β0,𝜷)1ni=1nBδ(yi(β0+𝒙i𝜷))+λ1𝑾^𝜷1,(\tilde{\beta}_{0},\tilde{\bm{\beta}})^{new}=\arg\min_{(\beta_{0},\bm{\beta})}\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}(\beta_{0}+\bm{x}_{i}^{\top}\bm{\beta}))+\lambda_{1}||\hat{\bm{W}}\bm{\beta}||_{1}, (13)

where 𝑾^=diag{w^j}\hat{\bm{W}}=diag\{\hat{w}_{j}\}, with w^j=Pλ1(βj~)/λ1,j=1,,p,\hat{w}_{j}=P^{{}^{\prime}}_{\lambda_{1}}(\tilde{\beta_{j}})/\lambda_{1},j=1,\ldots,p, are the current weights, and Pλ1(.)P^{{}^{\prime}}_{\lambda_{1}}(.) is the first derivative of the SCAD or MCP penalty given respectively by

Pλ1(t)=λ1𝟙{tλ1}+(aλ1t)+a1𝟙{t>λ1},P^{{}^{\prime}}_{\lambda_{1}}(t)=\lambda_{1}\mathbb{1}_{\{t\leq\lambda_{1}\}}+\frac{(a\lambda_{1}-t)_{+}}{a-1}\mathbb{1}_{\{t>\lambda_{1}\}},

and

Pλ1(t)=(λ1ta)+.P^{{}^{\prime}}_{\lambda_{1}}(t)=(\lambda_{1}-\frac{t}{a})_{+}.

To sum up, penalized BernSVM with LLA penalty is an iterative algorithm, whcih solves at each iteration BSVM-IRLS (or BSVM-GCD). The steps of the LLA algorithm for penalized BernSVM are described in Algorithm 3 below.

  1. 1.

    Initialize β~0\tilde{\beta}_{0} and 𝜷~\tilde{\bm{\beta}} and compute w^j0=Pλ1(|βj~|)/λ1\hat{w}_{j}^{0}=P^{{}^{\prime}}_{\lambda_{1}}(|\tilde{\beta_{j}}|)/\lambda_{1} for j=1,,pj=1,\ldots,p;

  2. 2.

    Iterate the following updates until convergence:

    1. (a)

      For j=0,1,,pj=0,1,\ldots,p, update β0\beta_{0} and βj\beta_{j} by solving the problem in (13) using either BSVM-IRLS or BSVM-GCD algorithms;

    2. (b)

      Set w^jnew=Pλ1(β~jnew)/λ1\hat{w}_{j}^{\mathrm{new}}=P^{{}^{\prime}}_{\lambda_{1}}(\tilde{\beta}_{j}^{\mathrm{new}})/\lambda_{1}.

Algorithm 3 The local linear approximation algorithm to solve penalized BernSVM with SCAD or MCP penalty.

3 Theoretical guaranties for penalized BernSVM estimator

In the next section, we develop an upper bound of the 2\ell_{2} norm of the estimation error 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2} for penalized BernSVM with weighted lasso (λ2=0\lambda_{2}=0) and SCAD or MCP penalties. Here, 𝜷^\hat{\bm{\beta}} is the minimizer of the corresponding penalized BernSVM and 𝜷\bm{\beta}^{\star} is the minimizer of the population version of (5) without the penalty term; i.e., 𝜷\bm{\beta}^{\star} is the minimizer of 𝔼[Bδ(y𝒙𝜷)]\mathbb{E}[B_{\delta}(y\bm{x}^{\top}\bm{\beta})]. For seek of simplicity, the intercept is omitted for the theoretical development.

3.1 An upper bound of the estimation error for weighted lasso penalty

In this section, we derive an upper bound of the estimation error 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2} for penalized BernSVM with weighted lasso penalty using the properties of the BernSVM loss function.

The weighted penalized BernSVM optimization problem is defined as follows

argmin𝜷pL(𝜷):=1ni=1nBδ(yi𝒙i𝜷)+λ1𝑾𝜷1,\arg\min_{\bm{\beta}\in\mathbb{R}^{p}}L(\bm{\beta}):=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}\bm{\beta})+\lambda_{1}||\bm{W}\bm{\beta}||_{1}, (14)

where 𝑾=diag(𝒘)\bm{W}=diag(\bm{w}) is a diagonal matrix of known weights, with wj0w_{j}\geq 0, for j=1pj=1\ldots p. The (unweighted) Lasso is a special case of (14), with wj=1,j{1,,p}w_{j}=1,j\in\{1,\ldots,p\}.
Let 𝜷^\hat{\bm{\beta}} be a minimizer of (14), and assume that the population-level minimizer 𝜷\bm{\beta}^{\star} defined above is sparse and unique. Define S{1,2,,p}S\subset\{1,2,...,p\} the set of non-zero elements of 𝜷\bm{\beta}^{\star}, with |S|=s|S|=s, and ScS^{c} is the complement of SS.
In order to obtain an upper bound of the estimation error, the BernSVM loss function needs to satisfy the strongly convex assumption. This means that the minimum of the eigenvalues of its corresponding Hessian matrix, defined as 𝑯=𝑿𝑫𝑿/n\bm{H}=\bm{X}^{\top}\bm{D}\bm{X}/n, is lower bounded by a constant κ>0\kappa>0; the matrix 𝑫=diag{Bδ"(yi𝒙i𝜷)}\bm{D}=diag\{B_{\delta}^{"}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})\}, i=1,,ni=1,\ldots,n. In high-dimensional settings, with p>np>n, 𝑯\bm{H} has rank lower or equal to nn. Therefore, we can not reach the strong convexity assumption of the Hessian matrix. To overcome this problem, one can look for a restricted strong convexity assumption to be verified on a restricted set 𝒞p\mathcal{C}\subset\mathbb{R}^{p} (Negahban et al.,, 2009), which is defined next.

Definition 1 (Restricted Strong Convexity (RSC)).

A design matrix 𝐗\bm{X} satisfies the RSC condition on 𝒞\mathcal{C} with κ>0\kappa>0 if

𝑿𝒉22nκ𝒉22,𝒉𝒞.\frac{||\bm{X}\bm{h}||_{2}^{2}}{n}\geq\kappa||\bm{h}||_{2}^{2},\quad\forall\bm{h}\in\mathcal{C}.

We also state the following assumptions:

  • A1:

    Let

    𝒞(S)={𝒉p:𝑾𝑺c𝒉𝑺c1γ𝒉S1},\mathcal{C}(S)=\{\bm{h}\in\mathbb{R}^{p}:||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}\leq\gamma||\bm{h}_{S}||_{1}\},

    where γ>0\gamma>0 and 𝒉𝒜=(hj:j𝒜)\bm{h}_{\mathcal{A}}=(h_{j}:j\in\mathcal{A}), with 𝒜{1,,p}\mathcal{A}\subset\{1,\ldots,p\}, is a sub-vector of 𝒉\bm{h}. We assume that the design matrix, for all 𝒉𝒞\bm{h}\in\mathcal{C}, satisfies the condition

    𝑿𝒉22nκ1𝒉22κ2log(p)n𝒉12,\frac{||\bm{X}\bm{h}||_{2}^{2}}{n}\geq\kappa_{1}||\bm{h}||_{2}^{2}-\kappa_{2}\frac{\log(p)}{n}||\bm{h}||^{2}_{1},

    with probability 1exp(c0n)1-\exp(-c_{0}n), for some c0>0c_{0}>0, where 𝒙i,i=1,,n,\bm{x}_{i},i=1,\ldots,n, are i.i.d Gaussian or Sub-Gaussian.

  • A2:

    There exist a ball, 𝑩(𝒙0,r0)\bm{B}(\bm{x}_{0},r_{0}), centered at a point 𝒙0\bm{x}_{0}, with radius r0r_{0}, such as Bδ′′f(𝒙0)>0B^{{}^{\prime\prime}}_{\delta}\circ f(\bm{x}_{0})>0, with f(𝒙0)=y𝒙0𝜷f(\bm{x}_{0})=y\bm{x}^{\top}_{0}\bm{\beta}^{\star}, and for any 𝒙𝑩(𝒙0,r0)\bm{x}\in\bm{B}(\bm{x}_{0},r_{0}), we have Bδ′′f(𝒙)>κ3B^{{}^{\prime\prime}}_{\delta}\circ f(\bm{x})>\kappa_{3}, for some constant κ3>0\kappa_{3}>0.

  • A3:

    The columns of the design matrix 𝑿\bm{X} are bounded:

    j{1,,p}𝒙j2M.\forall j\in\{1,...,p\}\quad||\bm{x}_{j}||_{2}\leq M.
  • A4:

    The densities of 𝑿\bm{X} given 𝒚=1\bm{y}=1 and 𝒚=1\bm{y}=-1 are continuous and have at least the first moment.

(Raskutti et al.,, 2010) show that Assumption A1 is satisfied with high probability for Gaussian random matrices. (Rudelson and Zhou,, 2012) extend this result for Sub-Gaussian random matrices. The existence of 𝒙0\bm{x}_{0} in Assumption A2 is given by the behavior of the second derivative Bδ′′(.)B^{{}^{\prime\prime}}_{\delta}(.) (see Proposition 1). In fact we have Bδ′′(1)=34δ>0B^{{}^{\prime\prime}}_{\delta}(1)=\frac{3}{4\delta}>0, then if we let f(𝒙0)=1f(\bm{x}_{0})=1, we can choose 𝒙0f1({1})p\bm{x}_{0}\in f^{-1}(\{1\})\subset\mathbb{R}^{p}. Hence by the continuity of Bδ′′f(.)B^{{}^{\prime\prime}}_{\delta}\circ f(.) we can choose κ3=34δϵ>0\kappa_{3}=\frac{3}{4\delta}-\epsilon>0 for some small ϵ>0\epsilon>0. Note that this assumption is similar to Assumptions A2 and A4 in (Koo et al.,, 2008), to ensure the positive-definiteness of the Hessian around 𝜷\bm{\beta}^{\star}. Assumption A3 is needed to ensure that the first derivative of the loss function is a bounded random variable around 𝜷\bm{\beta}^{\star}. Moreover, Assumption A4 is needed to ensure the existence of the first derivative of the population version of the loss function. Assumption A4 is similar to A1 in (Koo et al.,, 2008). However, in Assumption A2 of (Koo et al.,, 2008), one needs at least two moments to ensure the continuity of the Hessian around 𝜷\bm{\beta}^{\star}.

The next lemma establishes that the hessian matrix 𝑯\bm{H} satisfies the RSC condition with high probability.

Lemma 1.

Under assumptions 𝐀1\bm{A}1 and 𝐀2\bm{A}2, if

n>2κ2slog(p)(1+γ𝑾Sc1)2κ1,n>\frac{2\kappa_{2}s\log(p)(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})^{2}}{\kappa_{1}},

then the Hessian matrix 𝐇\bm{H} satisfies the RSC condition with κ=κ3κ12\kappa=\frac{\kappa_{3}\kappa_{1}}{2} on 𝒞\mathcal{C} with probability 1exp(c0n)1-\exp(-c_{0}n), for some universal constant c0>0c_{0}>0.

The proof of Lemma 1 is given in Appendix B.

The following theorem establishes an upper bound of the penalized BernSVM with Adaptive Lasso.

Theorem 2.

Assume that Assumptions 𝐀1𝐀3\bm{A}1-\bm{A}3 are met and λ1M(1+γ𝐖𝐒c1)n(γ𝐖𝐒)\lambda_{1}\geq\frac{M(1+\gamma||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty})}{\sqrt{n}(\gamma-||\bm{W}_{\bm{S}}||_{\infty})} for any γ>𝐖𝐒\gamma>||\bm{W}_{\bm{S}}||_{\infty}. Then we have 𝛃^𝛃𝒞\hat{\bm{\beta}}-\bm{\beta}^{\star}\in\mathcal{C} and the estimator coefficients of the BernSVM with Adaptive Lasso satisfies

𝜷^𝜷2(Mn+𝑾𝑺λ1)sκ.||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}\leq\frac{(\frac{M}{\sqrt{n}}+||\bm{W}_{\bm{S}}||_{\infty}\lambda_{1})\sqrt{s}}{\kappa}. (15)

The proof of Theorem 2 is given in Appendix B. In Theorem 2, the upper bound depends on λ1\lambda_{1}. Therefore, a choice of λ1\lambda_{1} is necessary to capture the near-oracle property of the estimator (Peng et al.,, 2016). Note that in Theorem 2, we have used the fact that the first derivative of the loss function is bounded. In the next lemma, we use a probabilistic upper bound of the first derivative of the BernSVM loss function with aim to give a best choice of λ1\lambda_{1} and to guarantee a rate of 𝐎P(slog(p)/n)\mathbf{O}_{P}(\sqrt{s\log(p)/n}) for the error 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}.

Lemma 2.

Let zj=1ni=1nBδ(yi𝒙i𝜷)yixijz_{j}=\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}x_{ij}, and z=max{j=1,,p}{|zj|}z^{\star}=\max_{\{j=1,...,p\}}\{|z_{j}|\}.

The variables zjz_{j}’s are sub-Gaussian with variance M2n\frac{M^{2}}{n} and for all t>0t>0, zz^{\star} satisfies

P(z>t)2pexp(t2n2M2).P(z^{\star}>t)\leq 2p\exp(\frac{-t^{2}n}{2M^{2}}).

The proof of this lemma is given in Appendix C. The following theorem outlines the rate of convergence of the error 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}.

Theorem 3.

Let wmax=𝐖Sw_{max}=||\bm{W}_{S}||_{\infty} and wmin=𝐖Sc1w_{min}=||\bm{W}^{-1}_{S^{c}}||_{\infty} and 𝛃^\hat{\bm{\beta}} the solution of the BernSVM with the adaptive Lasso. For any γ>wmax\gamma>w_{max}, we assume that the event P1={z(γwmax)λ11+γwmin}P_{1}=\{z^{\star}\leq\frac{(\gamma-w_{max})\lambda_{1}}{1+\gamma w_{min}}\} is satisfied. Let λ1=2M1+γwmin(γwmax)log(2p/ξ)n\lambda_{1}=\sqrt{2}M\frac{1+\gamma w_{min}}{(\gamma-w_{max})}\sqrt{\frac{\log(2p/\xi)}{n}} for a small ξ\xi. Then, under assumptions 𝐀1𝐀4\bm{A}_{1}-\bm{A}_{4}, we have

𝜷^𝜷2=𝑶P(slog(p)/n).||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}).

The proof of this theorem is similar to the proof of Theorem 2. It is detailed in Appendix C.

Remark 1.

The event P1P_{1}, defined in Theorem 3, is realized with high probability 1ξ1-\xi. This can be verified by taking t=2Mlog(2p/ξ)nt=\sqrt{2}M\sqrt{\frac{\log(2p/\xi)}{n}} and applying Lemma 2, whcih leads to

P(z>(γwmax)λ11+γwmin)=P(z>t)2pexp(2M2log(2p/ξ)n2M2n)=ξ.P(z^{\star}>\frac{(\gamma-w_{max})\lambda_{1}}{1+\gamma w_{min}})=P(z^{\star}>t)\leq 2p\exp(\frac{-2M^{2}\log(2p/\xi)n}{2M^{2}n})=\xi.

In the next section we will derive an upper bound for BernSVM with weighted Lasso, where the weights are estimated.

3.2 An upper bound of the estimation error for weighted Lasso with estimated weights

In this section, let wjw_{j} be unknown positive weights and w^j\hat{w}_{j} their corresponding estimates, for j=1,,pj=1,\ldots,p, which are assumed to be non-negative. Let 𝜷^\hat{\bm{\beta}} be a minimizer of the weighted Lasso with weights w^j\hat{w}_{j}, defined as follows

𝜷^=argmin𝜷p1ni=1nBδ(yi𝒙i𝜷)+λ1i=1pw^j|βj|.\hat{\bm{\beta}}=\arg\min_{\bm{\beta}\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}\bm{\beta})+\lambda_{1}\sum_{i=1}^{p}\hat{w}_{j}|\beta_{j}|.

To derive an upper bound of 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}, we suppose that the following event Ω0\Omega_{0} is satisfied (Huang and Zhang,, 2012)

Ω0={w^jwj,j𝑺}{wjw^j,j𝑺c}.\Omega_{0}=\{\hat{w}_{j}\leq w_{j},\quad\forall j\in\bm{S}\}\cap\{w_{j}\leq\hat{w}_{j},\quad\forall j\in\bm{S}^{c}\}.

The next theorem gives an upper bound of the BernSVM with weighted Lasso, where the weights are estimated.

Theorem 4.

Under the same conditions of Theorem 3 and λ1=2M1+γwmin(γwmax)log(2p/ξ)n\lambda_{1}=\sqrt{2}M\frac{1+\gamma w_{min}}{(\gamma-w_{max})}\sqrt{\frac{\log(2p/\xi)}{n}}, we have

𝜷^𝜷2=𝑶P(slog(p)/n),||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}),

in the event Ω0P1\Omega_{0}\cap P_{1}

The proof of this theorem is outlined in Appendix D.

Remark 2.

We have shown in Theorem 3 that the event P1P_{1} is realized with high probability 1ξ1-\xi. In Theorem 4, we need that the event Ω0P1\Omega_{0}\cap P_{1} is realized also with high probability. Indeed, P(Ω0P1)=P(Ω0)+P(P1)P(Ω0P1)1P(\Omega_{0}\cup P_{1})=P(\Omega_{0})+P(P_{1})-P(\Omega_{0}\cap P_{1})\leq 1 leads to P(Ω0P1)P(Ω0)ξP(\Omega_{0}\cap P_{1})\geq P(\Omega_{0})-\xi.

In the next section, we extend Theorem 3 for non-convex penalties.

3.3 An extension of the upper bound for non-convex penalties

We study here the penalized BernSVM estimator with the non-convex penalties SCAD and MCP. We establish an upper bound of its error estimation under some regularity conditions.
Recall that Algorithm 3 describes steps to compute the solution of BernSVM with the non-convex penalties, as defined in Equation (13). The weights w^j,j=1,,p,\hat{w}_{j},j=1,\ldots,p, are computed iteratively using the first derivative of SCAD or MCP penalty. To tackle such a problem, we search first for an upper bound of 𝜷^𝜷2||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}, where 𝜷^\hat{\bm{\beta}} is an estimator of Equation (13). The next theorem states this result.

Theorem 5.

Assume the same conditions of Theorem 3 concerning the event P1P_{1}. Let 𝛃~\tilde{\bm{\beta}} be an initial estimator of 𝛃\bm{\beta} using Algorithm 3, and the weights in (13) are given by w^j=Pλ1(|βj~|)/λ1\hat{w}_{j}=P^{{}^{\prime}}_{\lambda_{1}}(|\tilde{\beta_{j}}|)/\lambda_{1} for j=1,,pj=1,\ldots,p. Then, for any γ>wmax\gamma>w_{max} we have

𝜷^𝜷21κ{(γwmax)λ11+γwmin+||Pλ1(|𝜷S|)||2+1a1𝜷~𝜷2}.||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}\leq\frac{1}{\kappa}\{\frac{(\gamma-w_{max})\lambda_{1}}{1+\gamma w_{min}}+||P^{{}^{\prime}}_{\lambda_{1}}(|\bm{\beta}^{\star}_{S}|)||_{2}+\frac{1}{a-1}||\tilde{\bm{\beta}}-\bm{\beta}^{\star}||_{2}\}. (16)

The proof of Theorem 5 is given in Appendix E. We are now able to derive an upper bound of 𝜷^(l)𝜷2||\hat{\bm{\beta}}^{(l)}-\bm{\beta}^{\star}||_{2}, where 𝜷^(l)\hat{\bm{\beta}}^{(l)} is the estimator of the ll-th iteration of Algorithm 3.

Theorem 6.

Assume the same conditions of Theorem 3 and Theorem 4. Let 𝛃~\tilde{\bm{\beta}} be the minimizer of the BernSVM Lasso defined by Equation (14), where wj=1w_{j}=1, for j=1,,pj=1,\ldots,p, and 𝛃^(l)\hat{\bm{\beta}}^{(l)} be the estimator of the ll-th iteration of Algorithm 3. Let λ1=2M1+γwmin(γwmax)log(2p/ξ)n\lambda_{1}=\sqrt{2}M\frac{1+\gamma w_{min}}{(\gamma-w_{max})}\sqrt{\frac{\log(2p/\xi)}{n}}. Then, we have

𝜷^(l)𝜷2=𝑶P(slog(p)/n).||\hat{\bm{\beta}}^{(l)}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}).

The proof of Theorem 6 is straightforward and can be outlined as follows. Since Pλ1(t)P^{{}^{\prime}}_{\lambda_{1}}(t) is concave in tt, we have Pλ1(|𝜷S|)2sλ1||P^{{}^{\prime}}_{\lambda_{1}}(|\bm{\beta}^{\star}_{S}|)||_{2}\leq\sqrt{s}\lambda_{1}, which means that Pλ1(|𝜷S|)2=𝑶(slog(p)/n).||P^{{}^{\prime}}_{\lambda_{1}}(|\bm{\beta}^{\star}_{S}|)||_{2}=\bm{O}(\sqrt{s\log(p)/n}). Then, by induction using Equation ( 16), for each iteration, we have 𝜷^(l)𝜷2=𝑶P(slog(p)/n).||\hat{\bm{\beta}}^{(l)}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}).

4 Simulation and empirical studies

We study the empirical behavior of BernSVM with its competitors in terms of computational time, and we evaluate the methods estimation accuracy through different measures of performance via simulations and application to real genetic data sets.

4.1 Simulation study

In this section, we first compare the computation time of three algorithms: BSVM-GCD, BSVM-IRLS and HHSVM. In the second step, we examine the finite sample performance of eight regularized SVM methods with different penalties in terms of five measures of performance.

4.1.1 Simulation study designs

The data sets are generated following three scenarios described bellow.

  1. Scenario 1: In this scenario the columns of 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p}, with n=100n=100 and p=5000p=5000, are generated from a multivariate normal with mean 0 and correlation matrix 𝚺\bm{\Sigma} with compound symmetry structure, where the correlation between all covariates is fixed to ρ=0.5\rho=0.5 in this case. The coefficients are set to be as follows

    βj=(1)j×exp((2×j1)/20)\beta^{\star}_{j}=(-1)^{j}\times\exp(-(2\times j-1)/20)

    for j=1,2,,50j=1,2,...,50 and βj=0\beta^{\star}_{j}=0 for j=51,,pj=51,...,p. Firstly, we generated

    zi=𝒙i𝜷+ϵi,i=1,,n,z_{i}=\bm{x}_{i}^{\top}\bm{\beta}^{\star}+\epsilon_{i},\quad i=1,\ldots,n,

    where ϵiN(0,σ2)\epsilon_{i}\sim N(0,\sigma^{2}). The variance σ2\sigma^{2} is chosen such as a signal to noise ratio (SNRSNR) is equal to 33, where SNR=𝜷𝚺𝜷/σ2SNR=\bm{\beta}^{{\star}^{\top}}\bm{\Sigma}\bm{\beta}^{\star}/\sigma^{2}. Secondly, the response variable yiy_{i} is obtained by passing ziz_{i} through an inverse-logistic to obtain the probability pri=P(yi=1|𝒙i)=1/(1+exp(zi))pr_{i}=P(y_{i}=1|\bm{x}_{i})=1/(1+exp(-z_{i})). Then yiy_{i} is generated from a binomial distribution with probability equal to pripr_{i}. By choosing δ=2,1,0.5,0.1,0.01\delta=2,1,0.5,0.1,0.01, we compare the computation time of the three algorithms mentioned above using lasso penalty (λ2=0\lambda_{2}=0). The whole process is repeated 100 times and the average run time is reported in Table 1.

  2. Scenario 2: In this scenario, we aimed to study the impact of the correlation magnitude on the run time. As in scenario 1, we generated a data set (𝒙i,yi),i=1,,n(\bm{x}_{i},y_{i}),i=1,\ldots,n with n=100n=100 and 𝒙ip\bm{x}_{i}\in\mathbb{R}^{p} with p=1000p=1000. We consider different values for the correlation coefficient ρ\rho. The coefficients are generated as follows βjU(0.9,1.1)\beta^{\star}_{j}\sim U(0.9,1.1) for 1j251\leq j\leq 25, βj=1\beta^{\star}_{j}=-1 for 51j7551\leq j\leq 75 and βj=0\beta^{\star}_{j}=0 otherwise. Here, s=50s=50 is the number of significant coefficients. By choosing in this scenario δ=0.01,0.5,2\delta=0.01,0.5,2 and ρ=0.2,0.5,0.75,0.95\rho=0.2,0.5,0.75,0.95, we compare the computation time of three algorithms using lasso penalty (λ2=0\lambda_{2}=0). The whole process is repeated 100 times and The average run time is reported in Table 2.

  3. Scenario 3: This scenario was suggested by (Christidis et al.,, 2021). Here, it is considered to compare the performance of sparse classification methods (BernSVM, sparse logistic, HHSVM and SparseSVM) with lasso and elastic net penalties. We generate our data from a logistic model

    zi=log(pi1pi)=β0+𝒙S,i𝜷S,i=1.,n,z_{i}=\log(\frac{p_{i}}{1-p_{i}})=\beta_{0}+\bm{x}^{\top}_{S,i}\bm{\beta}_{S},\quad i=1....,n,

    where 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} with n=50n=50 and p=800p=800. SS is the set of active variables. The columns of 𝑿\bm{X} are generated from a multivariate normal with mean 0 and correlation matrix 𝚺\bm{\Sigma}. We take Σij=ρ\Sigma_{ij}=\rho for iji\neq j and Σii=1,1i,jp\Sigma_{ii}=1,\quad 1\leq i,j\leq p, which means that all the variables are equally correlated with ρ=0.2,0.5,0.8\rho=0.2,0.5,0.8, so the correlation matrix is different in each scenario. The number of the active predictors is taken equal to s=[pξ]s=[p\xi] with ξ=0.05,0.3\xi=0.05,0.3. The coefficients of the active variables are generated as (1)vu(-1)^{v}u where vBernoulli(0.3)v\sim Bernoulli(0.3) and uu is uniformly distributed on (0,0.5)(0,0.5). The response variable yiy_{i}, i=1,,ni=1,\ldots,n, where yi{1,1}y_{i}\in\{-1,1\}, is generated by passing ziz_{i} through an inverse-logistic. The value of intercept β0\beta_{0} is fixed such as the conditional probability P(yi=1|𝒙i)=0.3P(y_{i}=1|\bm{x}_{i})=0.3.

    In this scenario, for each Monte Carlo replication, we simulate two data sets: 1) a training data of n=50n=50 observations, which is used by the different methods to perform a 10-fold cross-validation in order to choose the best model, for each method; 2) a test data of ntest=200n_{test}=200 observations, which is used to evaluate the methods performance. The latter is based on five measures, which are given as follows:

    • The misclassification rate, MRMR is defined by

      MR=i=1ntest[(ytestiy^i)/2]2ntest;MR=\frac{\sum_{i=1}^{n_{test}}[({y_{test}}_{i}-\hat{y}_{i})/2]^{2}}{n_{test}};
    • The sensitivity SESE is defined as

      SE=1iA1[(ytestiy^i)/2]2ntest,SE=1-\frac{\sum_{i\in A^{1}}[({y_{test}}_{i}-\hat{y}_{i})/2]^{2}}{n_{test}},

      where A1={i:ytesti=1}A^{1}=\{i:{y_{test}}_{i}=1\};

    • The specificity SPSP is defined as

      SP=1iA1[(ytestiy^i)/2]2ntest,SP=1-\frac{\sum_{i\in A^{-1}}[({y_{test}}_{i}-\hat{y}_{i})/2]^{2}}{n_{test}},

      where A1={i:ytesti=1}A^{-1}=\{i:{y_{test}}_{i}=-1\};

    • The precision PRPR is defined by

      PR={j:βj0,β^j0}{β^j0},PR=\frac{\neq\{j:{\beta}^{\star}_{j}\neq 0,\hat{\beta}_{j}\neq 0\}}{\neq\{\hat{\beta}_{j}\neq 0\}},
    • The recall RCRC is defined by

      RC={j:βj0,β^j0}{βj0},RC=\frac{\neq\{j:{\beta}^{\star}_{j}\neq 0,\hat{\beta}_{j}\neq 0\}}{\neq\{{\beta}^{\star}_{j}\neq 0\}},

      where y^i\hat{y}_{i} is the ii-th predicted for the response variable, βj\beta^{\star}_{j} is jj-th coordinate of the original coefficients and β^j\hat{\beta}_{j} is the jj-th coordinate of the estimated coefficients. We note that the performance of a given method in terms of SE (or SP or PR or RC) is indicated by its largest value.

The eight sparse classification methods to be compared in scenario 3 are as follows:

  • .

    BernSVM-Lasso: the BSVM-IRLS with Lasso;

  • .

    BernSVM-EN: the BSVM-IRLS with Elastic Net;

  • .

    Logistic-Lasso: the binomial model with Lasso computed using glmnet R package (Friedman et al.,, 2010);

  • .

    Logistic-EN: the binomial model with Elastic Net computed using glmnet R package (Friedman et al.,, 2010);

  • .

    HHSVML : HHSVM with Lasso computed using gcdnet R package (Yang and Zou,, 2013);

  • .

    HHSVMEN: HHSVM with Elastic Net computed using gcdnet R package (Yang and Zou,, 2013);

  • .

    SparseSVM-Lasso: the sparse SVM with Lasso using the sparseSVM R package (Yi and Huang,, 2017);

  • .

    SparseSVM-EN: the sparse SVM with Elastic Net using the sparseSVM R package (Yi and Huang,, 2017).

Of note, we have used two algorithms to solve the penalized BernSVM optimization problem: the BSVM-GCD algorithm based on an MM combined with coordinate descent and the BSVM-IRLS algorithm based on an IRLS scheme combined with coordinate descent.

4.1.2 Simulation results

δ\delta Time
BSVM-GCD BSVM-IRLS HHSVM
0.01 7.50 3.62\bm{3.62} 5.84
0.1 2.09 1.19\bm{1.19} 1.4
0.5 1.55 0.53\bm{0.53} 0.86
1 1.60 0.61\bm{0.61} 0.77
2 2.70 0.81\bm{0.81} 0.85
Table 1: The run times (in seconds) for BSVM-GCD, BSVM-IRLS and HHSVM for δ=2,1,0.5,0.1,0.01\delta=2,1,0.5,0.1,0.01 and λ2=0\lambda_{2}=0 for Scenario 1.
Figure 2: Comparison of the computation time of the three algorithms as a function of the parameter δ\delta, in Scenario 1
Refer to caption

Summing up all the scenarios considered here, we have the following observations:

\bullet Table 1 summarizes the average run time of the HHSVM and the penalized BernSVM with lasso penalty in scenario 1. Figure 11 provides averaged computation time curve of 100 runs of the three algorithms as a function of δ\delta values in the same scenario. Our proposed BSVM-IRLS approach obtains the best computation time followed by HHSVM for any δ\delta value. While BSVM-GCD produces the highest computation time for each delta value. These results are well summarized in Figure 1 by the gaits of the algorithms’ computation time curves, where the curve for HHSVM is intermediate between the other two.

\bullet Table 2 summarizes the average run time of the HHSVM and Bernstein penalized SVM with lasso penalty in scenario 2 with different values of δ\delta and ρ\rho. The result is similar to scenario 1 where BSVM-IRLS provides the best run time followed by HHSVM for any value of δ\delta and ρ\rho. It is interesting to notice here that the three algorithms produce their best time for the same δ=0.5\delta=0.5 for each value of ρ\rho.

ρ\rho δ\delta Time
BSVM-GCD BSVM-IRLS HHSVM
0.01 4.13 2.08\bm{2.08} 2.75
0.2 0.5 1.06 0.39\bm{0.39} 0.60
2 1.86 0.55\bm{0.55} 0.65
0.01 7.70 3.52\bm{3.52} 5.50
0.5 0.5 1.56 0.53\bm{0.53} 0.90
2 2.91 0.85\bm{0.85} 0.96
0.01 11.25 4.76\bm{4.76} 7.41
0.75 0.5 2.31 0.9\bm{0.9} 1.38
2 4.26 1.20\bm{1.20} 1.37
0.01 14.69 5.84\bm{5.84} 11.31
0.9 0.5 2.41 1.31\bm{1.31} 1.37
2 5.41 1.77\bm{1.77} 1.87
Table 2: Timings (in seconds) for the HHSVM and the penalized BernSVM for δ=0.01,0.5,2\delta=0.01,0.5,2 and λ2=0\lambda_{2}=0 for different values of ρ=0.2,0.5,0.75,0.9\rho=0.2,0.5,0.75,0.9 for Scenario 2

\bullet Table 33 report the average of the five performance measures of our methods and the competitor’s on Scenario 3, for ξ=0.05,0.3\xi=0.05,0.3 and ρ=0.2,0.5,0.8\rho=0.2,0.5,0.8. The results obtained for scenario 3 can be summarized as follows:

  • -

    For a fixed number of active variables, increasing rho leads to a decrease of about 514%5-14\% in MR, a growth of about 2140%21-40\% in SESE, and a slight increase in RCRC for methods using the EN penalty(except SparseSVM-EN). While all methods are relatively stable in terms of SPSP and PRPR to any increase in ρ\rho and/or in ξ\xi.

  • -

    On the other hand, for any fixed value of ρ\rho and for any increase in ξ\xi, MRMR decreases by about 816%8-16\%, SESE decreases by about 840%8-40\% and RCRC increases slightly for only the methods based on the EN penalty and increases greatly of about 1128%11-28\% for BernSVM-EN and HHSVMEN.

  • -

    We observe that the methods are relatively comparable in terms of MRMR with a slight advantage to those using the EN penalty for any pair (ρ,ξ)(\rho,\xi). While for any fixed ρ\rho, methods based on standard penalized logistic and SVM provide good performance than the others in terms of SESE for any value of ξ\xi. The difference with methods based on the smooth approximation of the hinge loss function decreases for increasing the value of ρ\rho and a fixed value of ξ\xi. In addition, the previous methods are still relatively better in terms of SPSP. While the methods with the lasso penalty (except SparseSVM) and logistic regression with EN penalty are better in terms of RCRC for any pair (ρ,ξ)(\rho,\xi). Finally, the methods are relatively comparable in terms of PRPR.

Therefore, it can be concluded that our approach produced satisfactory and relatively similar results to HHSVM.

4.2 Empirical study

To illustrate the effectiveness of our BernSVM approach, we also compare classification accuracy, sensitivity and specificity of the latter methods on three large-scale samples of real data sets. The first data set is the DNA methylation measurements studied in (Kharoubi et al.,, 2019), and two standard data sets, usually used to evaluate classifiers performance in the literature, namely, the prostate cancer (Singh et al.,, 2002) and the leukemia (Golub et al.,, 1999) data sets.
The DNA methylation measurements are collected around the BLK gene located in chromosome 8 to detect differentially methylated regions (DMRs, refer to genomic regions with significantly different methylation levels between two groups of samples, e.g. : cases-controls). The dataset contains measurements of DNA methylation levels of 40 different samples of cell types: B cells (8 samples), T cells (19 samples), and monocytes (13 samples). Methylation levels are measured in, p=5896p=5896, CpG sites (predictors). Since methylation levels are known to be different between B-cell types compared to the T- and Monocyte-cell types around the BLK gene (Kharoubi et al.,, 2019), we code the cell types as y={1,1}y=\{-1,1\} binary response, with y=1y=1 corresponds to B-cells and y=1y=-1 corresponds to T- and Monocyte-cell types. The second data-set is available in a publicly R package, spls. The prostate data consists of n=102n=102 subjects, 52 patients with prostate tumor and 50 patients used as control subjects. The data contains expression of p=6033p=6033 genes across the subjects. We code the response variable y={1,1}y=\{-1,1\} binary response, with y=1y=1 corresponds to subjects with tumor prostate and y=1y=-1 corresponds to normal subjects. The third one contains n=72n=72 observations and p=6817p=6817 predictors and comes from a study of gene expression in two types of acute leukemias: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays. Subjects with AML are coded as y=1y=-1 class and subjects with ALL as y=1y=1 class. In total, we have 25 subjects in class 1 and 47 subjects in class -1. All the datasets were subjects to pre-screening procedure using a two sample t-tests (Storey and Tibshirani,, 2003) and only 10001000 predictors are retained.

For the Elastic Net regularization, we take λ2=α=3/4\lambda_{2}=\alpha=3/4. The whole process was repeated 100 times and for each data-set we train the methods using 40%\% of the observations to perform a 1010 cross-validation to choose regularization parameters of each method and 60%\% of the observations to test the methods and to compute the performance accuracy measures. The value of δ\delta is fixed to 2 for BernSVM and HHSVM.

We see from Table 4 that on the prostate data, the methods with the EN penalty provide a better result in terms of MR and SP (except SparseSVM). While in terms of SE, all methods provided a comparable result with little advantage to logistic with lasso penalty. For leukemia data, methods with EN penalty (except SparseSVM) dominate for all three performance measures. For methylation data, the methods with EN penalty produce the best result in terms of MR. While SparseSVM-EN followed by Logistic-EN dominate in terms of SE. More over all methods are relatively comparable in terms of SP, except SparseSVM-Lasso.

ξ=0.05\xi=0.05 ξ=0.3\xi=0.3
ρ\rho Method MR SE SP RC PR MR SE SP RC PR
BernSVM-Lasso 0.32 0.14 0.96 0.04 0.06 0.22 0.39 0.95 0.04 0.33
BernSVM-EN 0.30 0.23 0.94 0.44 0.05 0.163 0.556 0.96 0.576 0.304
Logistic-Lasso 0.30 0.34 0.89 0.03 0.10 0.19 0.52 0.93 0.03 0.35
0.2 Logistic-EN 0.29 0.34 0.90 0.05 0.094 0.17 0.58 0.94 0.05 0.34
HHSVML 0.32 0.13 0.95 0.04 0.09 0.22 0.37 0.96 0.04 0.35
HHSVMEN 0.31 0.18 0.95 0.40 0.06 0.17 0.52 0.97 0.57 0.30
SparseSVM-Lasso 0.29 0.34 0.89 0.19 0.06 0.16 0.61 0.94 0.25 0.30
SparseSVM-EN 0.29 0.34 0.9 0.20 0.06 0.16 0.63 0.94 0.26 0.31
BernSVM-Lasso 0.25 0.35 0.94 0.03 0.07 0.12 0.72 0.95 0.04 0.31
BernSVM-EN 0.23 0.38 0.96 0.43 0.05 0.08 0.8 0.96 0.54 0.30
Logistic-Lasso 0.22 0.54 0.90 0.03 0.07 0.11 0.78 0.94 0.02 0.31
0.5 Logistic-EN 0.21 0.55 0.91 0.04 0.07 0.09 0.81 0.95 0.05 0.32
HHSVML 0.26 0.30 0.95 0.03 0.08 0.13 0.68 0.95 0.03 0.31
HHSVMEN 0.24 0.34 0.96 0.35 0.06 0.08 0.79 0.97 0.56 0.31
SparseSVM-Lasso 0.22 0.52 0.90 0.22 0.06 0.11 0.75 0.94 0.22 0.32
SparseSVM-EN 0.23 0.52 0.90 0.23 0.06 0.11 0.73 0.96 0.22 0.31
BernSVM-Lasso 0.19 0.54 0.95 0.02 0.06 0.08 0.81 0.97 0.03 0.31
BernSVM-EN 0.17 0.57 0.95 0.43 0.05 0.05 0.88 0.98 0.55 0.31
Logistic-Lasso 0.17 0.67 0.91 0.01 0.06 0.07 0.87 0.96 0.02 0.31
0.8 Logistic-EN 0.16 0.68 0.92 0.03 0.06 0.05 0.89 0.97 0.05 0.32
HHSVML 0.20 0.50 0.95 0.01 0.05 0.08 0.81 0.97 0.02 0.32
HHSVMEN 0.19 0.50 0.96 0.31 0.06 0.05 0.88 0.98 0.59 0.31
SparseSVM-Lasso 0.19 0.60 0.91 0.30 0.06 0.11 0.78 0.94 0.33 0.32
SparseSVM-EN 0.19 0.61 0.90 0.32 0.06 0.10 0.78 0.95 0.34 0.31
Table 3: Average of the performance measures of our methods and the competitor’s on Scenario 3 for ξ=0.05,0.3\xi=0.05,0.3 and ρ=0.2,0.5,0.8\rho=0.2,0.5,0.8.
Data Prostate Leukemia Methylation
Method MR SE SP MR SE SP MR SE SP
BernSVM-Lasso 0.08 0.91 0.93 0.07 0.89 0.94 0.13 0.28 0.98
BernSVM-EN 0.08 0.9 0.94 0.04 0.95 0.96 0.08 0.68 0.97
Logistic-Lasso 0.09 0.92 0.91 0.06 0.94 0.94 0.1 0.57 0.96
Logistic-EN 0.08 0.91 0.92 0.04 0.95 0.96 0.07 0.74 0.97
HHSVML 0.09 0.9 0.92 0.08 0.89 0.93 0.12 0.35 0.98
HHSVMEN 0.08 0.9 0.94 0.05 0.95 0.95 0.09 0.55 0.97
SparseSVM-Lasso 0.12 0.91 0.85 0.08 0.92 0.92 0.18 0.39 0.91
SparseSVM-EN 0.12 0.91 0.86 0.09 0.94 0.9 0.07 0.78 0.97
Table 4: Average of the three performance measures on three real data sets

5 Discussion

In this work we aimed to better investigate the binary classification in high dimensional setting using the SVM classifier. We proposed a new smooth loss function, called BernSVM, to approximate the SVM hinge loss function. The BernSVM loss function has nice properties, which allows for efficient implementation of the penalized SVM and helps to derive appropriate theoretical results of the model parameter estimators. We have proposed two algorithms to solve the penalized BernSVM. The first one termed BSVM-GCD and combines the coordinate descent algorithm and the MM principle. The second one, called BSVM-IRLS, and uses an IRLS-type algorithm to solve the underlying optimization problem. We have also derived non asymptotic results of the penalized BernSVM estimator with weighted Lasso, with known weights, and we have extended this result for unknown weights. Furthermore, we have showed that the estimation error of penalized BernSVM with weighted Lasso achieves a rate of order slog(p)/n\sqrt{s\log(p)/n}. We compared our approach with its competitors through a simulation study and the results showed that our method outperforms its competitors in terms of the computational time while maintaining good performance in terms of prediction and variable selection. The proposed work have shown also accurate results when analyzing three high-dimensional real data sets. The penalized BernSVM is implemented in an R package BSVM, which is publicly available for use from Github (lien).
Finally, given the attractive properties of the BernSVM approach to approximate penalized SVM efficiently, one can adapt the proposed method for group penalties, such as group-Lasso/SCAD/MCP. Moreover, recently an interesting penalty has been developed (Christidis et al.,, 2021), which fits linear regression models that splits the set of covariates into groups using a new diversity penalty. The authors provided also interesting properties of their proposed estimator. Adaptation of the diversity penalty to penalized SVM using the proposed BernSVM approach would be an interesting avenue to investigate. This is left for future work.

6 Appendix

6.1 Appendix 𝑨\bm{A}

Proof.

(Theorem 1)

Recall the equivalent problem where

qδ(x)=gδ(2δx+1δ)x[0, 1],q_{\delta}(x)=g_{\delta}(2\delta x+1-\delta)\,\ x\in\ [0,\ 1],

must satisfy the alternative three Conditions C1’, C2’ and C3’. Now write

qδ(x)=k=04c(k, 4;qδ)bk,4(x),x[0, 1],q_{\delta}(x)=\sum_{k=0}^{4}c(k,\ 4;q_{\delta})b_{k,4}(x)\ ,\ x\in\ [0,\ 1],

and notice that Condition C1’ implies, c(0,4;qδ)=δ,c(1,4;qδ)=δ/2c(0,4;q_{\delta})=\delta,c(1,4;q_{\delta})=\delta/2 and c(2,4;qδ)=0.c(2,4;q_{\delta})=0. This will be seen from the properties of the Bernstein basis above. On the other hand, Condition C2’ implies that c(3,4;qδ)=0=c(4,4;qδ)c(3,4;q_{\delta})=0=c(4,4;q_{\delta}). Therefore, we have

qδ(x)=δb0,4(x)+(δ/2)b1,4(x)=δ(1x)4+2δx(1x)3,x[0, 1].q_{\delta}(x)=\delta b_{0,4}(x)+(\delta/2)b_{1,4}(x)=\delta(1-x)^{4}+2\delta x(1-x)^{3},\ x\in\ [0,\ 1].

Notice that Conditions C1’ and C2’ uniquely determine qδq_{\delta}. It remains to show that it is convex. We have

c(0,2;qδ)=122c(0,4;qδ)=0,c(1,2;qδ)=122c(1,4;qδ)=6δ,c(0,2;q_{\delta})=12\triangle^{2}c(0,4;q_{\delta})=0,\ c(1,2;q_{\delta})=12\triangle^{2}c(1,4;q_{\delta})=6\delta,

and

c(2,2;qδ)=122c(2,4;qδ)=0.c(2,2;q_{\delta})=12\triangle^{2}c(2,4;q_{\delta})=0.

This shows that qδ′′(x)>0q_{\delta}^{{}^{\prime\prime}}(x)>0 for all x(0,1)x\in(0,1) and in particular, the Condition C3’ is satisfied.

In order to compute the derivatives of gδg_{\delta}, we simply use the Bernstein basis properties (again) together with the chain rule as follows

gδ(t)=qδ((t1+δ)/2δ)/2δ,g_{\delta}^{{}^{\prime}}(t)=q_{\delta}^{{}^{\prime}}((t-1+\delta)/2\delta)/2\delta,

and

gδ′′(t)=qδ′′((t1+δ)/2δ)/4δ2,g_{\delta}^{{}^{\prime\prime}}(t)=q_{\delta}^{{}^{\prime\prime}}((t-1+\delta)/2\delta)/4\delta^{2},

with

qδ(x)=δb0,4(x)+(δ/2)b1,4(x)=δ(1x)4+2δx(1x)3,x[0, 1].q_{\delta}(x)=\delta b_{0,4}(x)+(\delta/2)b_{1,4}(x)=\delta(1-x)^{4}+2\delta x(1-x)^{3},\ x\in\ [0,\ 1].

Thus, one has

qδ(x)=2δ{b0,3(x)+b1,3(x)}=2δ{(1x)3+3x(1x)2},q_{\delta}^{{}^{\prime}}(x)=-2\delta\{b_{0,3}(x)+b_{1,3}(x)\}=-2\delta\{(1-x)^{3}+3x(1-x)^{2}\},

and

qδ′′(x)=6δb1,2(x)=12δx(1x).q_{\delta}^{{}^{\prime\prime}}(x)=6\delta b_{1,2}(x)=12\delta x(1-x).\quad\blacksquare

Proof.

(Proposition 3)

For the AEN penalty Pλ1,λ2(βj)P_{\lambda_{1},\lambda_{2}}(\beta_{j}), the surrogate function in (8) can be written, for all j=1,,pj=1,\ldots,p, as follows

Qδ(βj|β0,β~j)=i=1nBδ(ri)n+i=1nBδ(ri)yixijn(βjβ~j)+L2(βjβ~j)2+λ1wj|βj|+λ22βj2.Q_{\delta}(\beta_{j}|\beta_{0},\tilde{\beta}_{j})=\frac{\sum_{i=1}^{n}B_{\delta}(r_{i})}{n}+\frac{\sum_{i=1}^{n}B_{\delta}^{{}^{\prime}}(r_{i})y_{i}x_{ij}}{n}(\beta_{j}-\tilde{\beta}_{j})+\frac{L}{2}(\beta_{j}-\tilde{\beta}_{j})^{2}+\lambda_{1}w_{j}|\beta_{j}|+\frac{\lambda_{2}}{2}\beta_{j}^{2}.

Its first derivative is given by

Qδ(βj|β0,β~j)βj=i=1nBδ(ri)yixijn+L(βjβ~j)+λ1wjsign(βj)+λ2βj.\frac{\partial{Q_{\delta}(\beta_{j}|\beta_{0},\tilde{\beta}_{j})}}{\partial{\beta_{j}}}=\frac{\sum_{i=1}^{n}B_{\delta}^{{}^{\prime}}(r_{i})y_{i}x_{ij}}{n}+L(\beta_{j}-\tilde{\beta}_{j})+\lambda_{1}w_{j}\textit{sign}(\beta_{j})+\lambda_{2}\beta_{j}.

Then, Qδ(βj|β0,β~j)βj=0\frac{\partial{Q_{\delta}(\beta_{j}|\beta_{0},\tilde{\beta}_{j})}}{\partial{\beta_{j}}}=0 implies that ωβj=Zjwjsign(βj)\omega\beta_{j}=Z_{j}-w_{j}\textit{sign}(\beta_{j}). Hence, Equation in (10) is obtained using the soft-threshold function S(a,b)S(a,b). Equation (9) is a sample case of Equation (10) when the weights wj=1w_{j}=1, for all j=1pj=1\ldots p.

6.2 Appendix 𝑩\bm{B}

Proof.

(Lemma 1)

Under assumption 𝑨2\bm{A}2 on 𝑩(𝒙0,r0)\bm{B}(\bm{x}_{0},r_{0}), we have

𝑯=𝑿𝑫𝑿nκ3𝑿𝑿n,\bm{H}=\frac{\bm{X}^{\top}\bm{D}\bm{X}}{n}\succeq\frac{\kappa_{3}{\bm{X}^{\top}\bm{X}}}{n},

because 𝑫κ3𝑰\bm{D}\succeq\kappa_{3}\bm{I}, where 𝑴𝑵\bm{M}\succeq\bm{N} means that 𝑴𝑵\bm{M}-\bm{N} is positive semi-definite. The true parameter is sparse, then by Cauchy-Schwarz, we have

𝒉S1=j𝑺|hj|.1(j𝑺|hj|2)1/2(j𝑺1)1/2=s𝒉S2s𝒉2.||\bm{h}_{S}||_{1}=\sum_{j\in\bm{S}}|h_{j}|.1\leq(\sum_{j\in\bm{S}}|h_{j}|^{2})^{1/2}(\sum_{j\in\bm{S}}1)^{1/2}=\sqrt{s}||\bm{h}_{S}||_{2}\leq\sqrt{s}||\bm{h}||_{2}.

We have also 𝒉𝒞\bm{h}\in\mathcal{C}, so we have 𝑾Sc𝒉Sc1γ𝒉S1||\bm{W}_{S^{c}}\bm{h}_{S^{c}}||_{1}\leq\gamma||\bm{h}_{S}||_{1}. Thus, one has

𝒉1\displaystyle||\bm{h}||_{1} =\displaystyle= 𝒉S1+𝒉Sc1\displaystyle||\bm{h}_{S}||_{1}+||\bm{h}_{S^{c}}||_{1}
=\displaystyle= 𝒉S1+𝑾Sc1𝑾Sc𝒉Sc1\displaystyle||\bm{h}_{S}||_{1}+||\bm{W}_{S^{c}}^{-1}\bm{W}_{S^{c}}\bm{h}_{S^{c}}||_{1}
\displaystyle\leq 𝒉S1+𝑾Sc1𝑾Sc𝒉Sc1\displaystyle||\bm{h}_{S}||_{1}+||\bm{W}_{S^{c}}^{-1}||_{\infty}||\bm{W}_{S^{c}}\bm{h}_{S^{c}}||_{1}
\displaystyle\leq 𝒉S1+γ𝑾Sc1𝒉S1\displaystyle||\bm{h}_{S}||_{1}+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty}||\bm{h}_{S}||_{1}
=\displaystyle= (1+γ𝑾Sc1)𝒉S1\displaystyle(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})||\bm{h}_{S}||_{1}
\displaystyle\leq s(1+γ𝑾Sc1)𝒉2.\displaystyle\sqrt{s}(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})||\bm{h}||_{2}.

Then, one has

κ2log(p)n𝒉12κ2slog(p)(1+γ𝑾Sc1)2n𝒉22.-\frac{\kappa_{2}\log(p)}{n}||\bm{h}||_{1}^{2}\geq-\frac{\kappa_{2}s\log(p)(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})^{2}}{n}||\bm{h}||_{2}^{2}.

Under Assumption A1, we have also

κ3𝑿𝒉22n\displaystyle\frac{\kappa_{3}||\bm{X}\bm{h}||_{2}^{2}}{n} \displaystyle\geq κ3κ1𝒉22κ3κ2log(p)n𝒉12\displaystyle\kappa_{3}\kappa_{1}||\bm{h}||_{2}^{2}-\kappa_{3}\kappa_{2}\frac{\log(p)}{n}||\bm{h}||^{2}_{1}
\displaystyle\geq κ3𝒉22(κ1κ2slog(p)(1+γ𝑾Sc1)2n).\displaystyle\kappa_{3}||\bm{h}||_{2}^{2}(\kappa_{1}-\frac{\kappa_{2}s\log(p)(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})^{2}}{n}).

Then, if we take

n>2κ2slog(p)(1+γ𝑾Sc1)2κ1,n>\frac{2\kappa_{2}s\log(p)(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})^{2}}{\kappa_{1}},

we obtain

κ1κ2slog(p)(1+γ𝑾Sc1)2n>κ1κ1/2=κ1/2.\kappa_{1}-\frac{\kappa_{2}s\log(p)(1+\gamma||\bm{W}_{S^{c}}^{-1}||_{\infty})^{2}}{n}>\kappa_{1}-\kappa_{1}/2=\kappa_{1}/2.

Thus, we have

κ3𝑿𝒉22nκ3κ12𝒉22,\frac{\kappa_{3}||\bm{X}\bm{h}||_{2}^{2}}{n}\geq\frac{\kappa_{3}\kappa_{1}}{2}||\bm{h}||_{2}^{2},

which means that the Hessian matrix 𝑯\bm{H} satisfies the RSC with κ=κ3κ12\kappa=\frac{\kappa_{3}\kappa_{1}}{2}.  \blacksquare

Proof.

(Theorem 2)

Let 𝒉=𝜷^𝜷\bm{h}=\hat{\bm{\beta}}-\bm{\beta}^{\star}. Firstly, we prove that the Lasso BernSVM error satisfies the cone constraint given by

𝒞(S)={𝒉p:𝑾𝑺c𝒉𝑺c1γ𝒉S1}.\mathcal{C}(S)=\{\bm{h}\in\mathbb{R}^{p}:||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}\leq\gamma||\bm{h}_{S}||_{1}\}.

We have, L(𝜷^)L(𝜷)L(\hat{\bm{\beta}})\leq L(\bm{\beta}^{\star}) implies that L(𝒉+𝜷)L(𝜷)L(\bm{h}+\bm{\beta}^{\star})\leq L(\bm{\beta}^{\star}).Thus,

1ni=1nBδ(yi𝒙i(𝒉+𝜷))+λ1||𝑾(𝒉+𝜷)||11ni=1nBδ(yi𝒙i𝜷))+λ1||𝑾𝜷||1,\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}(\bm{h}+\bm{\beta}^{\star}))+\lambda_{1}||\bm{W}(\bm{h}+\bm{\beta}^{\star})||_{1}\leq\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star}))+\lambda_{1}||\bm{W}\bm{\beta}^{\star}||_{1},

which implies that

1ni=1nBδ(yi𝒙i(𝒉+𝜷))1ni=1nBδ(yi𝒙i𝜷))λ1(||𝑾𝜷||1||𝑾(𝒉+𝜷)||1).\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}(\bm{h}+\bm{\beta}^{\star}))-\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star}))\leq\lambda_{1}(||\bm{W}\bm{\beta}^{\star}||_{1}-||\bm{W}(\bm{h}+\bm{\beta}^{\star})||_{1}). (17)

As the BernSVM loss function is twice differentiable, we can apply a second order Taylor expansion

1ni=1nBδ(yi𝒙i(𝒉+𝜷))1ni=1nBδ(yi𝒙i𝜷))=1ni=1nBδ(yi𝒙i𝜷)yi𝒙i𝒉+𝒉𝑯𝒉2.\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}(\bm{h}+\bm{\beta}^{\star}))-\frac{1}{n}\sum_{i=1}^{n}B_{\delta}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star}))=\frac{1}{n}\sum_{i=1}^{n}B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h}+\frac{\bm{h}^{\top}\bm{H}\bm{h}}{2}. (18)

We can see that

1ni=1nBδ(yi𝒙i𝜷)yi𝒙i𝒉\displaystyle\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h} \displaystyle\leq 1ni=1n|Bδ(yi𝒙i𝜷)yi𝒙i𝒉|\displaystyle\frac{1}{n}\sum_{i=1}^{n}|B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h}|
\displaystyle\leq 1ni=1n|𝒙i𝒉|because|Bδ(yi𝒙i𝜷)yi|=|Bδ(yi𝒙i𝜷)|1\displaystyle\frac{1}{n}\sum_{i=1}^{n}|\bm{x}^{\top}_{i}\bm{h}|\quad\text{because}\quad|B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}|=|B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})|\leq 1
=\displaystyle= 1ni=1n|j=1pxijhj|\displaystyle\frac{1}{n}\sum_{i=1}^{n}|\sum_{j=1}^{p}x_{ij}h_{j}|
\displaystyle\leq j=1p(1ni=1n|xij|)|hj|\displaystyle\sum_{j=1}^{p}(\frac{1}{n}\sum_{i=1}^{n}|x_{ij}|)|h_{j}|
\displaystyle\leq j=1p(1ni=1n|xij|2)|hj|\displaystyle\sum_{j=1}^{p}(\frac{1}{\sqrt{n}}\sqrt{\sum_{i=1}^{n}|x_{ij}|^{2}})|h_{j}|
\displaystyle\leq Mn𝒉1,because𝒙j2M,j.\displaystyle\frac{M}{\sqrt{n}}||\bm{h}||_{1},\quad\text{because}\quad||\bm{x}_{j}||_{2}\leq M,\quad\forall j.

Then, (17) and (18) give

𝒉𝑯𝒉2\displaystyle\frac{\bm{h}^{\top}\bm{H}\bm{h}}{2} \displaystyle\leq 1ni=1nBδ(yi𝒙i𝜷)yi𝒙i𝒉+λ1(𝑾𝜷1𝑾(𝒉+𝜷)1),\displaystyle\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h}+\lambda_{1}(||\bm{W}\bm{\beta}^{\star}||_{1}-||\bm{W}(\bm{h}+\bm{\beta}^{\star})||_{1}),
\displaystyle\leq Mn𝒉1+λ1(𝑾𝜷1𝑾(𝒉+𝜷)1).\displaystyle\frac{M}{\sqrt{n}}||\bm{h}||_{1}+\lambda_{1}(||\bm{W}\bm{\beta}^{\star}||_{1}-||\bm{W}(\bm{h}+\bm{\beta}^{\star})||_{1}).

Moreover, we have

𝑾𝜷1𝑾(𝒉+𝜷)1\displaystyle||\bm{W}\bm{\beta}^{\star}||_{1}-||\bm{W}(\bm{h}+\bm{\beta}^{\star})||_{1} =\displaystyle= 𝑾𝑺𝜷𝑺1𝑾𝑺(𝒉𝑺+𝜷𝑺)1𝑾𝑺c𝒉𝑺c1,\displaystyle||\bm{W}_{\bm{S}}\bm{\beta}^{\star}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}}(\bm{h}_{\bm{S}}+\bm{\beta}^{\star}_{\bm{S}})||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1},
\displaystyle\leq 𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1,\displaystyle||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1},

where we have used in the first equality 𝜷𝑺c=𝟎\bm{\beta}^{\star}_{\bm{S}^{c}}=\bm{0}, and the second inequality is based on the fact that

𝑾𝑺𝜷𝑺1𝑾𝑺(𝒉𝑺+𝜷𝑺)1|𝑾𝑺𝜷𝑺1𝑾𝑺(𝒉𝑺+𝜷𝑺)1|𝑾𝑺𝒉𝑺1.||\bm{W}_{\bm{S}}\bm{\beta}^{\star}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}}(\bm{h}_{\bm{S}}+\bm{\beta}^{\star}_{\bm{S}})||_{1}\leq|||\bm{W}_{\bm{S}}\bm{\beta}^{\star}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}}(\bm{h}_{\bm{S}}+\bm{\beta}^{\star}_{\bm{S}})||_{1}|\leq||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}.

Thus, we obtain

0\displaystyle 0 <\displaystyle< 𝒉𝑯𝒉2\displaystyle\frac{\bm{h}^{\top}\bm{H}\bm{h}}{2}
\displaystyle\leq Mn𝒉1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1).\displaystyle\frac{M}{\sqrt{n}}||\bm{h}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}).

Because the right hand of inequality is positive, we have, with probability 1exp(c0n)1-\exp(-c_{0}n), that

0\displaystyle 0 \displaystyle\leq Mn𝒉1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1)\displaystyle\frac{M}{\sqrt{n}}||\bm{h}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
=\displaystyle= Mn𝒉𝑺1+Mn𝒉𝑺c1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1)\displaystyle\frac{M}{\sqrt{n}}||\bm{h}_{\bm{S}}||_{1}+\frac{M}{\sqrt{n}}||\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
=\displaystyle= Mn𝒉𝑺1+Mn𝑾𝑺c1𝑾𝑺c𝒉𝑺c1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1)\displaystyle\frac{M}{\sqrt{n}}||\bm{h}_{\bm{S}}||_{1}+\frac{M}{\sqrt{n}}||\bm{W}_{\bm{S}^{c}}^{-1}\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
\displaystyle\leq Mn𝒉𝑺1+Mn𝑾𝑺c1𝑾𝑺c𝒉𝑺c1+λ1𝑾𝑺𝒉𝑺1λ1𝑾𝑺c𝒉𝑺c1\displaystyle\frac{M}{\sqrt{n}}||\bm{h}_{\bm{S}}||_{1}+\frac{M}{\sqrt{n}}||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty}||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}||\bm{W}_{\bm{S}}||_{\infty}||\bm{h}_{\bm{S}}||_{1}-\lambda_{1}||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}
=\displaystyle= (Mn+λ1𝑾𝑺)𝒉𝑺1(λ1Mn𝑾𝑺c1)𝑾𝑺c𝒉𝑺c1.\displaystyle(\frac{M}{\sqrt{n}}+\lambda_{1}||\bm{W}_{\bm{S}}||_{\infty})||\bm{h}_{\bm{S}}||_{1}-(\lambda_{1}-\frac{M}{\sqrt{n}}||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty})||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}.

This implies that the Lasso BernSVM error satisfies the cone constraint given by

𝒞(S)={𝒉p:𝑾𝑺c𝒉𝑺c1γ𝒉S1},\mathcal{C}(S)=\{\bm{h}\in\mathbb{R}^{p}:||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}\leq\gamma||\bm{h}_{S}||_{1}\},

because λ1M(1+γ𝑾𝑺c1)n(γ𝑾𝑺)\lambda_{1}\geq\frac{M(1+\gamma||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty})}{\sqrt{n}(\gamma-||\bm{W}_{\bm{S}}||_{\infty})} for any γ>𝑾𝑺\gamma>||\bm{W}_{\bm{S}}||_{\infty}.

Thus, we have that the error 𝒉\bm{h} belongs to the set 𝒞(S)\mathcal{C}(S).

Then, we derive the upper bound of the error 𝒉\bm{h}. From the inequality above and the definition of RSC we have

κ𝒉22\displaystyle\kappa||\bm{h}||_{2}^{2} \displaystyle\leq (Mn+λ1||𝑾𝑺||)||𝒉S||1(λ1Mn||𝑾𝑺c1||||)||𝒉Sc||1\displaystyle(\frac{M}{\sqrt{n}}+\lambda_{1}||\bm{W}_{\bm{S}}||_{\infty})||\bm{h}_{S}||_{1}-(\lambda_{1}-\frac{M}{\sqrt{n}}||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty}||)||\bm{h}_{S^{c}}||_{1}
\displaystyle\leq (Mn+λ1𝑾𝑺)𝒉S1,\displaystyle(\frac{M}{\sqrt{n}}+\lambda_{1}||\bm{W}_{\bm{S}}||_{\infty})||\bm{h}_{S}||_{1},
\displaystyle\leq (Mn+λ1𝑾𝑺)s𝒉2.\displaystyle(\frac{M}{\sqrt{n}}+\lambda_{1}||\bm{W}_{\bm{S}}||_{\infty})\sqrt{s}||\bm{h}||_{2}.

The second inequality comes from the fact that (λ1Mn||𝑾𝑺c1||||)>0(\lambda_{1}-\frac{M}{\sqrt{n}}||\bm{W}_{\bm{S}^{c}}^{-1}||_{\infty}||)>0. Thus, we obtain an upper bound of the error 𝒉=𝜷^𝜷\bm{h}=\hat{\bm{\beta}}-\bm{\beta}^{\star} given by

||𝒉||2(Mn+𝑾𝑺λ1)sκ.||\bm{h}||_{2}\leq\frac{(\frac{M}{\sqrt{n}}+||\bm{W}_{\bm{S}}||_{\infty}\lambda_{1})\sqrt{s}}{\kappa}.\quad\blacksquare

Proof.

(Theorem 3) In Theorem 2, we used A2 and the fact that the BernSVM first derivative is bounded by 1 in order to upper bound the random variables zj=1ni=1nBδ(yi𝒙i𝜷)yixijz_{j}=\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}x_{ij} for j=1,,pj=1,\ldots,p. Lemma 2 shows that the variables zjz_{j} are sub-Gaussian’s. Thus, we can see that

1ni=1nBδ(yi𝒙i𝜷)yi𝒙i𝒉z𝒉1,\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h}\leq z^{\star}||\bm{h}||_{1},

where z=maxjzjz^{\star}=\max_{j}z_{j}. Following the same steps to prove Theorem 2, we obtain

𝒉2(z+𝑾𝑺λ1)sκ.||\bm{h}||_{2}\leq\frac{(z^{\star}+||\bm{W}_{\bm{S}}||_{\infty}\lambda_{1})\sqrt{s}}{\kappa}.

We have also that the event P1P_{1} is satisfied with high probability, then we have

𝒉2(z+wmaxλ1)sκγ(1+wminwmax)λ1sκ.||\bm{h}||_{2}\leq\frac{(z^{\star}+w_{\max}\lambda_{1})\sqrt{s}}{\kappa}\leq\frac{\gamma(1+w_{\min}w_{\max})\lambda_{1}\sqrt{s}}{\kappa}.

We also have that λ1=𝑶(log(p)/n)\lambda_{1}=\bm{O}(\sqrt{\log(p)/n}). Finally, we obtain

||𝒉||2=||𝜷^𝜷||2=𝑶P(slog(p)/n).||\bm{h}||_{2}=||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}).\quad\blacksquare

6.3 Appendix 𝑪\bm{C}

Proof.

(Lemma 2)

To prove Lemma 2, we provide some results about the sub-Gaussian random variables

Lemma 3.

z1,z2,,zpz_{1},z_{2},...,z_{p} are pp real random variables with zjsubG(σ2)z_{j}\sim subG(\sigma^{2}) not necessarily independent, then for all t>0t>0

(maxj=1,,p|zj|>t)2pexp(t2/2σ2).\mathbb{P}(max_{j=1,...,p}|z_{j}|>t)\leq 2p\exp(-t^{2}/2\sigma^{2}).
Lemma 4 (Hoeffding’s Lemma 1963).

Let zz be a random variable such that [z]=0\mathcal{E}[z]=0 and z[a,b]z\in[a,b] almost surely. Then for any tt\in\mathbb{R}, it holds

𝔼[etz]et2(ba)28.\mathbb{E}[e^{tz}]\leq e^{\frac{t^{2}(b-a)^{2}}{8}}.

In particular, zsubG((ba)24)z\sim subG(\frac{(b-a)^{2}}{4}).

Let zj=1ni=1nBδ(yi𝒙i𝜷)yixijz_{j}=\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}x_{ij}.

Hence, by applying Lemma 3, we conclude that zjz_{j} is a sub-Gaussian with variance M2n\frac{M^{2}}{n}.

We can see that

|zj|=|1ni=1nBδ(yi𝒙i𝜷)yixij|\displaystyle|z_{j}|=|\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}x_{ij}| \displaystyle\leq 1ni=1n|Bδ(yi𝒙i𝜷)||yi||xij|\displaystyle\frac{1}{n}\sum_{i=1}^{n}|B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})||y_{i}||x_{ij}|
\displaystyle\leq 1ni=1n|xij|\displaystyle\frac{1}{n}\sum_{i=1}^{n}|x_{ij}|
\displaystyle\leq 1ni=1nxij2\displaystyle\frac{1}{\sqrt{n}}\sqrt{\sum_{i=1}^{n}x_{ij}^{2}}
\displaystyle\leq Mn.\displaystyle\frac{M}{\sqrt{n}}.

The third inequality by applying Cauchy-Schwartz and the fourth inequality by Assumption 𝑨3\bm{A}3.

We conclude that the variables zjz_{j} are bounded. Moreover, using the fact that 𝜷\bm{\beta}^{\star} minimize the population version of the loss function and Assumption 𝑨4\bm{A}4, we have 𝔼[zj]=0\mathbb{E}[z_{j}]=0, then, the zjz_{j} are sub-Gaussian. \quad\blacksquare

6.4 Appendix 𝑫\bm{D}

Proof.

(Theorem 4)

We can see that

1ni=1nBδ(yi𝒙i𝜷)yi𝒙i𝒉z𝒉1,\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}\bm{x}^{\top}_{i}\bm{h}\leq z^{\star}||\bm{h}||_{1},

where z=maxj1ni=1nBδ(yi𝒙i𝜷)yixijz^{\star}=\max_{j}\frac{1}{n}\sum_{i=1}^{n}-B_{\delta}^{{}^{\prime}}(y_{i}\bm{x}_{i}^{\top}\bm{\beta}^{\star})y_{i}x_{ij}.

Then, following the same argument as in the proof of Theorem 2, we have

0\displaystyle 0 <\displaystyle< 𝒉𝑯𝒉2\displaystyle\frac{\bm{h}^{\top}\bm{H}\bm{h}}{2}
\displaystyle\leq z𝒉1+λ1(𝑾^𝑺𝒉𝑺1𝑾^𝑺c𝒉𝑺c1)\displaystyle z^{\star}||\bm{h}||_{1}+\lambda_{1}(||\hat{\bm{W}}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\hat{\bm{W}}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
\displaystyle\leq z𝒉1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1).\displaystyle z^{\star}||\bm{h}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}).

In the fact that the right hand of inequality is positive, and the second inequality is because the event Ω0\Omega_{0} is realized. Thus, we have with probability 1exp(c0n)1-\exp(-c_{0}n) that

0\displaystyle 0 \displaystyle\leq z𝒉1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1),\displaystyle z^{\star}||\bm{h}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}),
=\displaystyle= z𝒉𝑺1+z𝒉𝑺c1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1)\displaystyle z^{\star}||\bm{h}_{\bm{S}}||_{1}+z^{\star}||\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
=\displaystyle= z𝒉𝑺1+z𝑾𝑺c1𝑾𝑺c𝒉𝑺c1+λ1(𝑾𝑺𝒉𝑺1𝑾𝑺c𝒉𝑺c1)\displaystyle z^{\star}||\bm{h}_{\bm{S}}||_{1}+z^{\star}||\bm{W}_{\bm{S}^{c}}^{-1}\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}(||\bm{W}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
\displaystyle\leq z||𝒉𝑺||1+zwmin||𝑾𝑺c𝒉𝑺c||1+λ1wmax||𝒉𝑺||1λ1||𝑾𝑺c𝒉𝑺c||1)\displaystyle z^{\star}||\bm{h}_{\bm{S}}||_{1}+z^{\star}w_{min}||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}+\lambda_{1}w_{max}||\bm{h}_{\bm{S}}||_{1}-\lambda_{1}||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
=\displaystyle= (z+λ1wmax)𝒉𝑺1(λ1zwmin)𝑾𝑺c𝒉𝑺c1.\displaystyle(z^{\star}+\lambda_{1}w_{max})||\bm{h}_{\bm{S}}||_{1}-(\lambda_{1}-z^{\star}w_{min})||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}.

which implies that the Lasso BernSVM error satisfies the cone constraint given by

𝒞(S)={𝒉p:𝑾𝑺c𝒉𝑺c1γ𝒉S1},\mathcal{C}(S)=\{\bm{h}\in\mathbb{R}^{p}:||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}\leq\gamma||\bm{h}_{S}||_{1}\},

because the event P1P_{1} is realized with high probability, which means that

γ>(z+λ1wmax)(λ1zwmin).\gamma>\frac{(z^{\star}+\lambda_{1}w_{max})}{(\lambda_{1}-z^{\star}w_{min})}.

Thus, we have that the error 𝒉\bm{h} belongs to the set 𝒞(S)\mathcal{C}(S).

Let α=max{wmax,wmin1)}\alpha=\max\{w_{max},w_{min}^{-1})\}. Moreover, from the inequality above and the RSC definition, we have

κ𝒉22\displaystyle\kappa||\bm{h}||_{2}^{2} \displaystyle\leq (z+λ1wmax)𝒉𝑺1(λ1zwmin)𝑾𝑺c𝒉𝑺c1\displaystyle(z^{\star}+\lambda_{1}w_{max})||\bm{h}_{\bm{S}}||_{1}-(\lambda_{1}-z^{\star}w_{min})||\bm{W}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1}
\displaystyle\leq (wmax+wmin1)λ1𝒉S1,\displaystyle(w_{max}+w_{min}^{-1})\lambda_{1}||\bm{h}_{S}||_{1},
\displaystyle\leq 2αλ1s𝒉2.\displaystyle 2\alpha\lambda_{1}\sqrt{s}||\bm{h}||_{2}.

Then, we obtain that

𝒉22αλ1sκ.||\bm{h}||_{2}\leq\frac{2\alpha\lambda_{1}\sqrt{s}}{\kappa}.

Thus,

||𝜷^𝜷||2=𝑶P(slog(p)/n).||\hat{\bm{\beta}}-\bm{\beta}^{\star}||_{2}=\bm{O}_{P}(\sqrt{s\log(p)/n}).\quad\blacksquare

6.5 Appendix 𝑬\bm{E}

Proof.

(Theorem 5)

We have

κ𝒉22\displaystyle\kappa||\bm{h}||^{2}_{2} <\displaystyle< 𝒉𝑯𝒉2\displaystyle\frac{\bm{h}^{\top}\bm{H}\bm{h}}{2}
\displaystyle\leq z𝒉2+λ1(𝑾^𝑺𝒉𝑺1𝑾^𝑺c𝒉𝑺c1)\displaystyle z^{\star}||\bm{h}||_{2}+\lambda_{1}(||\hat{\bm{W}}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}-||\hat{\bm{W}}_{\bm{S}^{c}}\bm{h}_{\bm{S}^{c}}||_{1})
\displaystyle\leq z𝒉2+λ1𝑾^𝑺𝒉𝑺1.\displaystyle z^{\star}||\bm{h}||_{2}+\lambda_{1}||\hat{\bm{W}}_{\bm{S}}\bm{h}_{\bm{S}}||_{1}.\star

For all jSj\in S, we have λ1w^j=Pλ1(|βj~|)\lambda_{1}\hat{w}_{j}=P^{{}^{\prime}}_{\lambda_{1}}(|\tilde{\beta_{j}}|). We have also from Taylor expansion that

Pλ1(|βj~|)=Pλ1(|βj|)+Pλ1′′(|βj|)(β~jβj)Pλ1(|βj|)+1a1(β~jβj).P^{{}^{\prime}}_{\lambda_{1}}(|\tilde{\beta_{j}}|)=P^{{}^{\prime}}_{\lambda_{1}}(|\beta^{\star}_{j}|)+P^{{}^{\prime\prime}}_{\lambda_{1}}(|\beta^{\star}_{j}|)(\tilde{\beta}_{j}-\beta^{\star}_{j})\leq P^{{}^{\prime}}_{\lambda_{1}}(|\beta^{\star}_{j}|)+\frac{1}{a-1}(\tilde{\beta}_{j}-\beta^{\star}_{j}).

Thus from (\star), we obtain

κ𝒉22\displaystyle\kappa||\bm{h}||^{2}_{2} <\displaystyle< 𝒉2(z+Pλ1(|𝜷S|)2+1a1𝜷~𝜷2).\displaystyle||\bm{h}||_{2}(z^{\star}+||P^{{}^{\prime}}_{\lambda_{1}}(|\bm{\beta}^{\star}_{S}|)||_{2}+\frac{1}{a-1}||\tilde{\bm{\beta}}-\bm{\beta}^{\star}||_{2}).

Therefore,

||𝒉||21κ((γwmax)λ11+γwmin+||Pλ1(|𝜷S|)||2+1a1||𝜷~𝜷||2).||\bm{h}||_{2}\leq\frac{1}{\kappa}(\frac{(\gamma-w_{max})\lambda_{1}}{1+\gamma w_{min}}+||P^{{}^{\prime}}_{\lambda_{1}}(|\bm{\beta}^{\star}_{S}|)||_{2}+\frac{1}{a-1}||\tilde{\bm{\beta}}-\bm{\beta}^{\star}||_{2}).\quad\blacksquare

References

  • Becker et al., (2011) Becker, N., Toedt, G., Lichter, P., and Benner, A. (2011). Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC bioinformatics, 12(1):1–13.
  • Bradley and Mangasarian, (1998) Bradley, P. S. and Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90. Citeseer.
  • Chang and Chen, (2008) Chang, H. H. and Chen, S. W. (2008). The impact of online store environment cues on purchase intention: Trust and perceived risk as a mediator. Online information review.
  • Christidis et al., (2021) Christidis, A.-A., Van Aelst, S., and Zamar, R. (2021). Data-driven diverse logistic regression ensembles. arXiv preprint arXiv:2102.08591.
  • Cortes and Vapnik, (1995) Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273–297.
  • Dedieu, (2019) Dedieu, A. (2019). Error bounds for sparse classifiers in high-dimensions. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 48–56. PMLR.
  • Friedman et al., (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
  • Golub et al., (1999) Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537.
  • Huang and Zhang, (2012) Huang, J. and Zhang, C.-H. (2012). Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. The Journal of Machine Learning Research, 13(1):1839–1864.
  • Kharoubi et al., (2019) Kharoubi, R., Oualkacha, K., and Mkhadri, A. (2019). The cluster correlation-network support vector machine for high-dimensional binary classification. Journal of Statistical Computation and Simulation, 89(6):1020–1043.
  • Koo et al., (2008) Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008). A bahadur representation of the linear support vector machine. The Journal of Machine Learning Research, 9:1343–1368.
  • Lee and Mangasarian, (2001) Lee, Y.-J. and Mangasarian, O. L. (2001). Rsvm: Reduced support vector machines. In Proceedings of the 2001 SIAM International Conference on Data Mining, pages 1–17. SIAM.
  • Negahban et al., (2009) Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. (2009). A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Advances in neural information processing systems, 22.
  • Peng et al., (2016) Peng, B., Wang, L., and Wu, Y. (2016). An error bound for l1-norm support vector machine coefficients in ultra-high dimension. The Journal of Machine Learning Research, 17(1):8279–8304.
  • Raskutti et al., (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259.
  • Rudelson and Zhou, (2012) Rudelson, M. and Zhou, S. (2012). Reconstruction from anisotropic random measurements. In Conference on Learning Theory, pages 10–1. JMLR Workshop and Conference Proceedings.
  • Singh et al., (2002) Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203–209.
  • Storey and Tibshirani, (2003) Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445.
  • Wang et al., (2006) Wang, L., Zhu, J., and Zou, H. (2006). The doubly regularized support vector machine. Statistica Sinica, pages 589–615.
  • Yang and Zou, (2013) Yang, Y. and Zou, H. (2013). An efficient algorithm for computing the hhsvm and its generalizations. Journal of Computational and Graphical Statistics, 22(2):396–415.
  • Yi and Huang, (2017) Yi, C. and Huang, J. (2017). Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557.
  • Zhang et al., (2016) Zhang, X., Wu, Y., Wang, L., and Li, R. (2016). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):53–76.
  • Zhu et al., (2003) Zhu, J., Rosset, S., Tibshirani, R., and Hastie, T. (2003). 1-norm support vector machines. Advances in neural information processing systems, 16.
  • Zou and Li, (2008) Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of statistics, 36(4):1509–1533.