An efficient projection neural network for $\ell_{1}$ -regularized logistic regression

Majid Mohammadi, Amir Ahooye Atashin, Damian A. Tamburri Majid Mohammadi and Damian Tamburri are with the Jheronimus Academy of Data Science, Eindhoven University, The Netherlands. Amir Ahooye Atashin is with the Department of Computer Engineering, Ferdowsi University of Mashhad, Iran.

Abstract

$\ell_{1}$ regularization has been used for logistic regression to circumvent the overfitting and use the estimated sparse coefficient for feature selection. However, the challenge of such a regularization is that the $\ell_{1}$ norm is not differentiable, making the standard algorithms for convex optimization not applicable to this problem. This paper presents a simple projection neural network for $\ell_{1}$ -regularized logistics regression. In contrast to many available solvers in the literature, the proposed neural network does not require any extra auxiliary variable nor any smooth approximation, and its complexity is almost identical to that of the gradient descent for logistic regression without $\ell_{1}$ regularization, thanks to the projection operator. We also investigate the convergence of the proposed neural network by using the Lyapunov theory and show that it converges to a solution of the problem with any arbitrary initial value. The proposed neural solution significantly outperforms state-of-the-art methods with respect to the execution time and is competitive in terms of accuracy and AUROC.

Index Terms:

logistic regression,

\ell_{1}

regularization, Lyapunov, global convergence.

I Introduction

Logistic regression is one of the most popular classification techniques with an unconstrained smooth convex problem, which can be solved by the standard algorithms for convex optimization. One of the major problems in classifiers, including logistic regression, is overfitting, which can be mitigated by using a regularization term. $\ell_{1}$ -regularized logistic regression is one of the most popular regularized method, whose minimization problem, for a data set with $n$ samples $\{X,y\}$ where $X\in R^{n\times d}$ and $y\in\{0,1\}^{n}$ , is:

\displaystyle\min_{w}L(X,w)+\lambda\|w\|_{1},

(1)

where

	$\displaystyle L(X;w)=-\frac{1}{m}\sum_{i=1}^{n}\bigg{(}y_{i}log(\hat{y}_{i})+(1-y_{i})log(1-\hat{y}_{i})\bigg{)},$		(2)
	$\displaystyle\hat{y}_{i}=1/(1+exp(-w^{T}x_{i})),$		(3)

and $x_{i}\in R^{d}$ is the $i^{th}$ sample in the data set. Another option for regularization is $\ell_{2}$ norm, but $\ell_{1}$ regularization can be used for feature selection [1], and has shown to outperform $\ell_{2}$ norm when the number of samples is smaller than the number of features [2]. Despite the usefulness of the $\ell_{1}$ regularization, the drawback of minimization (1) is its non-smoothness, placing a serious obstacle to applying the standard algorithms for smooth convex optimization.

There are several approaches to handle the non-smoothness of problem (1). A popular approach is to use cyclic coordinate descent without computing directly the derivative of minimization (1), which is computationally expensive since computing the objective function of logistic regression entails the costly computation of the sigmoid function [3]. Another approach is to use two auxiliary variables and rewrite $w$ as the difference of two non-negative variables [4]. This approach, while removing the non-smooth $\ell_{1}$ norm, increases the dimension of the problem by a factor of two, and subsequently, increases the computational costs. There are also other solvers as well, where the $\ell_{1}$ norm is replaced by a similar smooth function, and the associated problem is solved using standard techniques for convex programming [5]. See also [6] and [7] for more information on methods and their comparison.

The projection neural networks have shown promising performance in solving optimization problems with intricate constraints [8, 9]. The neural solution for optimization problems enables us to implement the structure of the network by using VLSI (very large scale integration) and optical technologies [10], making it applicable to the cases where real-time processing is demanded. Besides, the dynamic system of the recurrent neural networks can be represented by ordinary differential equations (ODEs), so they can be implemented on digital computers as well. An important benefit of neural solutions is their ability to globally converge to an exact solution of the given problem, increasing their suitability for real-world problems. As a result, they have been applied to numerous real-world applications [11, 12, 13, 14, 15, 16, 17]. Xia and Wang [11] developed a one-layer neural network for support vector classification, which is proved to converge globally exponentially to the optimal solution of the original constrained problem. A simpler neural solution with one-layer structure is developed in [12] and it’s global and exponentially convergence is investigated. For regression, multiple compact neural network models are developed [13, 14, 15], all of which are proved to be globally convergent to a solution. Projection neural networks are applied to other practical problems, such as image restoration [16], image fusion [17], robot control [18], non-negative matrix factorization [19], to name just a few.

A class of projection neural networks is for non-smooth optimization problems, allowing us to use them for solving minimization (1) [20, 21]. Qin et al. [20] proposed a two-layer projection neural network for non-smooth minimization with equality constraints and showed that the neural solution is globally converge to the optimal solution of the given minimization. A simpler one-layer neural model is developed in [21] for non-smooth problems with equality and inequality constraints whose global convergence was investigated under mild conditions. While these neural models are applicable to solving problem (1), they are not as efficient as models for smooth optimization problems, and their convergence typically entails some conditions which might not be met in real-world applications. In addition, the non-smoothness in problem (1) stems from the $\ell_{1}$ norm regularization term whose subgradient has a specific form. Therefore, chances are that a more efficient neural model, compared to those for general non-smooth problems, can be developed with guaranteed convergence.

In this paper, we developed a projection neural network for solving problem (1). Projection neural networks are typically used for constrained problems, where a projection operator is employed to basically make an obtained solution resides inside a feasible region with respect to a set of constraints. While problem (1) is unconstrained, this article shows that the proposed neural model converges to the solution of problem (1). The convergence is proved by taking advantage of the Lyapunov theory, where it is demonstrated that the proposed neural solution is stable, and converges to an optimal of problem (1) with any arbitrary initial point. The proposed method is computationally simple and its complexity is almost identical to the logistic regression without regularization, allowing us to provide an efficient and scalable implementation for large-scale problems. Extensive experiments on many data sets show that the proposed neural network is significantly fast, and its performance is competitive with state-of-the-art solvers in terms of accuracy and AUROC (area under the receiver operating characteristic curve). The contributions of this paper can be summarized as follows:

•

The subgradient of the $\ell_{1}$ norm is adequately dealt with by using a projection operator. As such, a projection neural network is developed, which is shown to converge to a solution to problem (1). In addition, the complexity of the proposed method is similar to the gradient descent for logistic regression without any regularization.
•

We prove the convergence of the proposed neural method by using the Lyapunov theory for dynamic systems, and will be further shown that the convergence is not reliant on the initialization, in contrast to other methods such as interior-point, where the initialization matters.
•

The simplicity of the method allows us to provide an efficient implementation. In particular, our efficient implementation is amenable to parallel computing and execution on the graphics processing unit (GPU), which significantly outperforms other solutions for minimization (1).
•

The Python implementation of the proposed neural network is publicly and freely available¹¹1 https://github.com/Majeed7/L1LR.

This paper is structured as follows. Section II presents the proposed neural solution for minimization in (1). Section III presents the stability of the proposed neural solution, while the experiments regarding the proposed neural network on multiple synthetic and real data are discussed in Section IV. Finally, the paper is concluded in Section V.

II The Neurodynamic Model

In this section, we develop a simple projection neural network for problem (1), and it is then extended to take into account a bias term in minimization (1).

II-A Basic Model

To develop a projection neural network, the Karush–Kuhn–Tucker (KKT) conditions of optimality for problem (1) are written as:

\displaystyle\nabla L(X,w)+q=0,

(4)

where $q=(q_{1},q_{2},...,q_{d})$ are the subgradient of $\lambda\|w\|_{1}$ defined as:

\displaystyle q_{i}=\partial{(\lambda\|w\|_{1})}_{i}\left\{\begin{array}[]{lc}=\lambda&w_{i}>0\\ \in[-\lambda,\lambda]&w_{i}=0\\ =-\lambda&w_{i}<0.\end{array}\right.

(9)

Using the projection operator, the subgradient in (9) can be written as [22]:

\displaystyle q=P_{\Omega}(w+q)

(10)

where $\Omega=\{z|-\lambda\leq z\leq\lambda\}$ , $P_{\Omega}(z)=[P_{\Omega}(z_{1}),P_{\Omega}(z_{2}),...,P_{\Omega}(z_{d})]$ , and

\displaystyle P_{\Omega}(z_{i})=\left\{\begin{array}[]{cc}\lambda&z_{i}>\lambda\\ z_{i}&|z_{i}|\leq\lambda\\ -\lambda&z_{i}<-\lambda.\end{array}\right.

(14)

By replacing equation (4) in the projection equation in (10), we arrive at the recurrent neural network whose dynamic equation is given by

	$\displaystyle\frac{dw}{dt}$	$\displaystyle=-\alpha\bigg{(}\nabla L(X,w)+P_{\Omega}\big{(}w-\nabla L(X,w)\big{)}\bigg{)}$		(15)
		$\displaystyle=\alpha\bigg{(}-\nabla L(X,w)+P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}\bigg{)},$		(16)

where $\alpha$ is a positive constant and the last equality is derived since $P_{\Omega}(.)$ is an odd function, i.e., $-P_{\Omega}(x)=P_{\Omega}(-x)$ . The equation in (15) resembles the projection neural network with a one-layer architecture as shown in Figure 1. According to this figure, we need $2d$ projection (half of which are the sigmoid function for computing $\nabla_{w_{i}}$ ), $2d$ summation, $d$ integrators, and some multipliers.

Refer to caption — Figure 1: The architecture of the proposed neural solution.

Remark 1

If there is not $\ell_{1}$ regularization in problem (1), then the following dynamic system representing a recurrent neural network can be used to solve the conventional logistic regression problem:

\displaystyle\frac{dw}{dt}

\displaystyle=-\alpha\nabla L(X,w).

(17)

If we discretize this continuous dynamic system, then it will be tantamount to the gradient projection method, which is arguably the most popular optimization algorithm for logistic regression. By juxtaposing equations (15) and (17), it is readily seen that the proposed projection neural network in (15) has the same number of variable as the gradient descent method for logistic regression without regularization, as it is in equation (17). Thus, the proposed neural method does scale up the number of variables even with the $\ell_{1}$ regularization. The only extra computational overload is $2d$ summations/subtractions for computing the right-hand side in (15).

II-B Bias term computation

There is typically a bias term $b$ in the standard logistic regression as well, where $\hat{y}$ in problem (1) is written as:

\displaystyle\hat{y}_{i}=1/\bigg{(}1+exp(-w^{T}x_{i}-b)\bigg{)}.

(18)

To take into account the bias term, one typical is to define $\hat{w}=(w,b)^{T}$ and $\hat{x}_{i}=(x_{i}^{T},1)^{T}$ , so the solution to problem (1) with $\hat{w}$ and $\hat{x}_{i}$ would also yield the bias term $b$ . However, this problem is slightly different since the absolute value of the bias term also exists in the objective function (which is not sought to be minimized in the standard $\ell_{1}$ -regularized logistic regression). A more proper way is to compute the derivative of the objective function in (1) with respect to $b$ , which does not include the $\ell_{1}$ norm. Therefore, the proposed method is turned into:

\frac{d}{dt}\begin{pmatrix}w\\ b\end{pmatrix}=-\alpha\begin{pmatrix}\nabla_{w}L(X,w,b)+P_{\Omega}\big{(}w-\nabla_{w}L(X,w,b)\big{)}\cr\nabla_{b}L(X,w,b)\end{pmatrix}

(19)

where $\nabla_{w}L(X,w,b)$ and $\nabla_{b}L(X,w,b)$ are the derivatives of the logistic loss function with respect to $w$ and $b$ , respectively.

III Convergence Analysis

In this section, the convergence of the proposed projection neural network in (15) is investigated by using the Lyapunov theory. We first provide some necessary definitions and then study the convergence of the dynamic system in (15).

Definition 2

A vector $w$ is the equilibrium point of the system (15) if and only if $\nabla L(w,x)=P_{\Omega}(w-\nabla L(w,x))$ .

Definition 3

The dynamic system in (15) is said to be stable at $w^{*}$ with the initial value $w(0)$ if, for all $\epsilon>0$ , there exists a $\delta(\epsilon)>0$ such that $\|w(k)-w^{*}\|\leq\epsilon$ if $\|w(0)-w^{*}\|<\delta(\epsilon)$ .

Lemma 4 ([22])

The projection operator used in equation (15) has the following properties:

•

$(v-P_{\Omega}(v))^{T}(P_{\Omega}(v)-u)\geq 0,\quad\forall v\in R^{d},u\in\Omega$ .
•

$\|P_{\Omega}(x)-P_{\Omega}(y)\|\leq\|x-y\|,\quad\forall u,v\in R^{d}.$

We now state the main result for the convergence of the proposed method.

Theorem 5

For any given initial value $w^{0}$ , the dynamic system in (15) is stable in the sense of Lyapunov and globally converges to equilibrium for any $\alpha>0$ .

Proof:

We first derived two inequalities that are essential for the proof of the convergence of the dynamic system in (15). Let $u=\nabla L(X,w^{*})$ and $v=\nabla L(X,w)-w$ in the first inequality of Lemma 4, it follows:

	$\displaystyle\bigg{(}\nabla L(X,w)-w-P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}\bigg{)}^{T}$		(20)
	$\displaystyle\bigg{(}P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-L(X,w^{*})\bigg{)}\geq 0.$		(21)

In addition, according to the variational inequality [22], the solution to the projection equation in (10) is tantamount to $w^{*}$ being held true in the following inequality:

	$\displaystyle(w-q^{})^{T}(-w^{})\geq 0,\quad\forall w\in\Omega\quad$		(22)
	$\displaystyle\Rightarrow\big{(}w+\nabla L(X,w^{})\big{)}^{T}(-w^{})\geq 0$		(23)

Letting $w=-P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}\in\Omega$ , the above inequality becomes:

\displaystyle\bigg{(}P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w^{*})\bigg{)}^{T}w^{*}\geq 0

(24)

Summing up the inequalities in (20) and (24), we arrive at:

	$\displaystyle\bigg{(}P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w^{*})\bigg{)}^{T}$
	$\displaystyle\qquad\bigg{(}\nabla L(X,w)-w-P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}+w^{*}\bigg{)}\geq 0,$

which results in:

		$\displaystyle\bigg{(}w-w^{}+\nabla L(X,w)-\nabla L(X,w^{})\bigg{)}^{T}$		(25)
		$\displaystyle\qquad\bigg{(}P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)\bigg{)}$		(26)
		$\displaystyle\qquad\leq-\big{(}w-w^{}\big{)}\big{(}\nabla L(X,w)-\nabla L(X,w^{})\big{)}-$		(27)
		$\displaystyle\quad\qquad\\|P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)\\|^{2}.$		(28)

Now consider the following Lyapunov function [23]:

$\displaystyle V(w)$	$\displaystyle=\int_{0}^{1}\alpha^{-1}(w-w^{*})^{T}$	(29)
	$\displaystyle\quad\bigg{(}w^{}+s(w-w^{})+\nabla L\big{(}X,w^{}+s(w-w^{})\big{)}\bigg{)}ds$	(30)
	$\displaystyle-\alpha^{-1}(w-w^{})^{T}\big{(}w^{}+\nabla L(X,w^{*})\big{)}$	(31)

where $w^{*}$ is the equilibrium of (15). Since $\nabla^{2}L(X,w)$ is symmetric and positive semi-definite, $V(w)$ is continuously differentiable and convex in $R^{d}$ . Also, the gradient of $V$ is defined as [23]:

\displaystyle\alpha\nabla V(w)=w-w^{*}+\nabla L(X,w)+\nabla L(X,w^{*}).

(32)

Based on this Lyapunov function, we have:

		$\displaystyle\frac{\partial V}{\partial t}=\bigg{(}\frac{\partial V}{\partial u}\bigg{)}^{T}\frac{\partial w}{\partial t}$		(33)
		$\displaystyle=\bigg{(}w-w^{}+\nabla L(X,w)+\nabla L(X,w^{})\bigg{)}^{T}$		(34)
		$\displaystyle\quad\qquad\bigg{(}P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)\bigg{)}$		(35)
		$\displaystyle\leq-\big{(}w-w^{}\big{)}\big{(}\nabla L(X,w)-\nabla L(X,w^{})\big{)}-$		(36)
		$\displaystyle\qquad\\|P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)\\|^{2}\leq 0,$		(37)

where the last two inequalities are derived based on (25) and the fact that for the convex function $L(X,w)$ , we have:

\displaystyle(w-w^{*})^{T}\bigg{(}\nabla L(X,w)-L(X,w^{*})\bigg{)}\geq 0.

(38)

The equation in (33) shows that the dynamic system in (15) is stable in the Lyapunov sense. According to the LaSalle’s invariant set theorem [24], the trajectories of the dynamic system in (15) converge to the largest invariant set $\Phi$ defined as:

\displaystyle\Phi=\{w|\partial V(q)/\partial t=0\}.

(39)

To show the global convergence of the proposed method, we need to show $\partial V(w)/\partial t=0$ if and only if $\partial w/\partial t=0$ . If $\partial w/\partial t=0$ , then

\displaystyle\frac{\partial V}{\partial t}=\bigg{(}\frac{\partial V}{\partial u}\bigg{)}^{T}\frac{\partial w}{\partial t}=0.

Conversely, if $\partial V(w)/\partial t=0$ , we have, based on (33),

	$\displaystyle\big{(}w-w^{}\big{)}\big{(}\nabla L(X,w)-\nabla L(X,w^{})\big{)}+$		(40)
	$\displaystyle\quad\\|P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)\\|^{2}=0.$		(41)

Since both parts of the above equation are non-negative, it is required that both of the parts become zero, which follows:

\displaystyle P_{\Omega}\big{(}\nabla L(X,w)-w\big{)}-\nabla L(X,w)=0\Rightarrow\partial w/\partial t=0.

(42)

Therefore, the dynamic system in (15) converges globally to the solution of problem (1), irrespective of the initial value, and the proof is complete.

∎

IV Experiments

In this section, we first investigate the convergence behavior of the proposed neural network as well as the relationship between the sparsity of solutions and the regularization parameter. We then compare the performance of the proposed neural solution with several existing $\ell_{1}$ -regularized logistic regression methods in terms of their execution time, accuracy, receiver operating characteristic curve (ROC), and the area under ROC (AUROC) on several data sets.

IV-A Convergence Analysis

We first inspect empirically the convergence of the proposed neural network by applying it to the splice data set with different initial values. In particular, we conduct the experiment by using zero, one, and random initialization. Figure 2 shows the trajectories of different elements in the weight vector $w$ throughout the optimization procedure. Different elements of $w$ are shown by the same color in the three plots. According to this figure, the trajectories of the proposed neural solution converge to the same solution, regardless of the initial values, which corroborates the global convergence of the proposed network.

In addition, we investigate the convergence of the proposed method by plotting the objective function value of problem (1) in different iterations for various values of $\lambda$ , as is shown in Figure 3(a). We also opt for random initialization values to further complicate the convergence of the neural solution. Figure 3(a) illustrates that the objective function value steadily decreases in different iterations and with different values of $\lambda$ until it reaches a minimum. This would also corroborate the global convergence of the proposed neural network given different values for initialization and $\lambda$ .

TABLE I: The number of features, training, and test samples of the data sets used in the experiments.

data set	# of features	# of training samples	# of test samples
splice	60	1000	2175
madelon	500	2000	600
liver-disorders	5	145	200
ijcnn1	22	49990	91701
a1a	123	1605	30956
a9a	123	32561	16281
leukemia	7129	38	34
gisette	5000	6000	1000

TABLE II: The methods’ execution time (in seconds) over eight data sets.

Method’s name	splice	madelon	liver-disorders	ijcnn1	a1a	a9a	leukemia	gisette
Gauss-Seidel	4.13	27.67	0.05	7.79	2.90	58.42	21.90	977.58
Shooting	2.41	20.64	0.17	1.53	1.95	43.73	14.37	698.52
Gauss-Southwell	3.55	10.89	0.12	1.68	2.38	59.25	20.11	846.80
Grafting	1.08	20.64	0.25	2.61	2.31	50.82	21.63	649.00
SubGradient	0.08	1.46	0.08	0.65	0.10	0.25	61.06	18.57
epsL1	0.04	1.07	0.06	0.36	0.18	5.68	25.80	401.36
Log-Barrier	49.75	2.66	0.17	0.87	0.16	2.20	457.35	398.98
SmoothL1	0.14	4.64	0.01	1.47	2.19	7.21	144.02	2214.89
SQP	0.06	3.40	0.19	1.20	0.62	1.06	1591.70	1585.94
ProjectionL1	0.04	5.20	0.04	1.51	2.51	45.25	1951.45	6483.31
InteriorPoint	0.22	12.34	0.09	0.53	0.42	2.67	11677.12	27362.16
Orthant-Wise	0.05	20.09	0.06	1.83	3.16	46.15	143543.25	54686.79
Pattern-Search	0.16	4.33	0.21	1.34	0.57	6.03	29.52	700.61
sklearn	0.01	4.63	0.01	0.53	0.62	57.32	0.04	5.94
Proposed method	0.22	0.79	0.01	0.17	0.40	1.79	0.14	1.5

TABLE III: Accuracy of the methods over eight data sets.

Method’s name	splice	madelon	liver-disorders	ijcnn1	a1a	a9a	leukemia	gisette
Gauss-Seidel	85.47	59.33	59.5	89.28	82.94	83.19	91.18	84.8
Shooting	85.47	52.50	59.0	90.50	83.45	82.46	64.71	68
Gauss-Southwell	85.47	58.17	59.0	90.44	83.90	85.01	85.29	95.2
Grafting	85.47	57.83	59.0	91.00	83.84	84.99	82.35	94.4
SubGradient	85.47	56.67	59.0	9.50	24.05	23.62	58.82	50.0
epsL1	85.47	56.67	59.0	91.34	83.84	84.99	76.47	98.0
Log-Barrier	85.47	56.33	59.0	91.34	83.84	84.98	85.29	97.9
SmoothL1	85.47	56.33	59.0	91.20	83.84	79.63	61.76	95.7
SQP	85.47	56.33	59.0	91.34	83.84	84.99	85.29	97.9
ProjectionL1	85.47	56.33	59.0	91.33	83.84	84.98	85.29	98.1
InteriorPoint	85.47	56.33	59.0	91.34	83.84	84.99	85.29	97.9
Orthant-Wise	85.47	56.33	59.0	91.33	83.84	84.97	85.29	98.3
Pattern-Search	85.47	56.33	59.0	91.34	83.84	84.99	88.24	98.2
sklearn	85.47	56.67	59.00	88.23	83.83	84.99	88.23	97.9
Proposed method	85.47	59.0	61.50	91.90	83.62	85.03	94.11	98.0

IV-B Sparsity

We now look into the sparsity of the solutions given by the proposed method by subjecting the madelon data set to the proposed method. Figure 3(b) plots the value of $\lambda$ against the $\ell_{1}$ norm of the solution. It is readily seen from this figure that the sparsity increases as expected when the value of $\lambda$ is set to a higher value. This figure supports the sparsity induced by the proposed method, which is controlled by the parameter $\lambda$ .

We also plot the coefficient vector $w$ for different values of $\lambda$ by using a heat map. Figure 4 shows the heat map of $w$ for different values of $\lambda$ over ijcnn1 and splice data sets. The heat map shows that the vast majority of elements in $w$ take on zero or infinitesimal values, shown by greenish color, while there are some few elements with non-zero and bigger values, shown as yellowish colors. As expected, if the values of $\lambda$ exceed a threshold, then the resulting $w$ will become all zero, as shown at the top row of the two plots in Figure 4.

IV-C Comparison of real data sets

We now compare the performance of the proposed neural solution with other state-of-the-art solvers on several real data sets. The comparison is made in terms of execution time, classifier accuracy, ROC, as well as AUROC. We first introduce the methods and data sets being used.

Methods

For the comparison study, we select several methods, each with a different approach to handle the $\ell_{1}$ norm. The methods are Gauss-Seidel [25], Shooting [26], Gauss-Southwell [27], Grafting [28], Subgradient [7], epsL1 [29], Log-Barrier [7], SmoothL1 [5], SQP [29], ProjectionL1 [7], InteriorPoint [4], Orthant-wise [7], and Pattern-Search [7], all implemented in MATLAB and are freely available [30]. We also use the implementation for logistic regression in scikit-learn, and refer to it as sklearn [31]. The core of scikit-learn methods have been implemented in $C/C\!+\!+$ , and the implementation of logistic regression takes advantage of an inherent feature and subset selection, which makes it significantly faster.

Data sets and experimental setup

We subject the proposed method and other methods to eight data sets: leukemia, live-disorders, madelon, splice, gisette, ijcnn1, a1a, and a9a. The data sets have their own training and test partitions, which are used for the training and testing of all the methods. The information regarding these data sets is tabulated in Table I.

Execution time comparison

We first compare the methods in terms of the time they consumed to solve problem (1). For doing so, we set the stopping criterion of all the methods to $10^{-6}$ . Table II shows the execution time of the methods to solve problem (1) for different data sets. For those of data sets with a limited number of features and training samples, all the methods behave efficiently and converge to an optimal in a relatively fast manner, e.g., the performance of methods over the data sets splice, madelon, liver-disorders, and a1a. Even for a data set with a large number of the training samples like ijcnn1, the performance of the solvers in terms of their execution time is quite competitive, mainly because the dimensions of the solvers are typically linear in the number of features, which is $22$ for the $ijcnn1$ data set. But when the number of features and training samples increases together, many of the solvers fail to produce a solution in a reasonable time frame. For example, it takes more than three and seven hours for the InteriorPoint method to generate a result on the leukemia and gisette data sets, respectively. Or it takes even more for the Orthant-Wise method to generate results for the same data sets. Among other methods, SubGradient has a very acceptable performance in terms of execution time in generating acceptable results, but we also need to investigate how good its result is regarding accuracy and AUROC. The proposed neural solution, on the other hand, shows a significantly better performance compared to other methods in terms of execution time and demonstrates a fast performance regardless of the size of the data set. In particular, it is especially competitive with sklearn despite its efficient implementation (which includes feature and subset selection in $C/C\!+\!+$ ). A surprising point is also on the execution time of sklearn for the a9a data set, which shows that a significant increase in both feature and training samples number would adversely affect the performance of this solver. The proposed method demonstrates a stable performance on different data sets with a different number of features and/or training samples.

Accuracy comparison

While the execution time of the proposed method is quite superior to other solvers, it is also essential to compare its performance in terms of accuracy. Table III shows the accuracy of different methods on different data sets²²2Some of the numbers are the same in this table, simply because we round the number to two decimal digits and that causes us to ignore the infinitesimal differences between the accuracy of two solvers.. According to this table, the accuracy of the proposed method is better than or equivalent to other solvers on five data sets: splice, liver-disorders, ijcnn1, a9a, and leukemia. For the remaining three data sets, the accuracy of the proposed method is quite competitive with the best-performing solvers. In particular, the accuracy difference between the proposed neural solution and best-performing solvers for the three data sets (i.e., Gauss-Seidel for madelon, Gauss-Southwell for a1a, and Orthant-Wise for gisette) is less than $1\%$ , making the performance of the proposed neural solution very competitive with the state-of-the-art solvers in terms of accuracy. At the same, these methods require plenty of time to generate a result. For example, Orthant-wise takes more than 15 hours to generate results on the gisette data set, while its accuracy is only $0.3\%$ higher than that of the proposed neural solution.

ROC curve and AUROC comparison

We now compare the proposed neural solution in terms of ROC curve and AUROC. Figure 5 plots the ROC curves of the methods over eight data sets, the legend of which shows the AUROC of each method as well. As in the ROC curve, a deviation from diagonal shows a better performance, this figure also supports that the performance of the proposed neural solution is competitive with other solvers. In particular, the proposed neural network is the best-performing method on a9a and ijcnn1. In addition, its difference with the top-performing method on a1a, splice, and gisette data sets are very infinitesimal with a difference in AUROC less than $0.004$ . However, Gauss-Seidel has shown better performance on the madelon data set with a higher margin. This experiment also upholds the reasonable performance of the proposed neural network compared to the state-of-the-art solvers. Therefore, the conclusion can be drawn that the proposed neural solution can provide a reliable solution to the logistic regression with $\ell_{1}$ regularization, while it is significantly fast and is scalable for large-scale problems.

V Conclusion

This paper presented a simple yet efficient projection neural network to solve $\ell_{1}$ -regularized logistic regression. The proposed solver utilizes the projection operator, which is typically used in constrained optimization, for the unconstrained but non-smooth problem. The global convergence of the proposed method is guaranteed by using the Lyapunov theory, and its competitive performance with other state-of-the-art solvers was demonstrated on several real data sets.

The proposed neural solution was developed without using any auxiliary variable, so the dimension of the solver is the same as the dimension of the original problem. This means that the complexity of the proposed neural network for $\ell_{1}$ -regularized logistic regression is the same as that without regularization, making the resulting projection neural network very efficient in terms of execution time and memory consumption.

For the future, similar projection neural solutions can be tailored for other non-smooth problems, such as Lasso, where only the logistic loss function is replaced by the least square. Another extension of this method can be used to solve the support vector machine problem in primal, by incorporating the subgradient of the maximum function by a projection operator.

Another area of interest is the training of the multi-layer feedforward neural networks, where $\ell_{1}$ regularization is of the highest interest, especially in preventing overfitting. The result of this research can be further used, probably with some modification, in order to develop more efficient training algorithms for feedforward neural networks when they regularize the weights with an $\ell_{1}$ penalty.

Methods for tuning the learning rate $\alpha$ in the projection neural solution should be considered, since a proper way of tuning this parameter significantly affects the convergence behavior. One solution is to use model-agnostic meta-learning, which has been successfully applied to various learning models.

References

[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[2] A. Y. Ng, “Feature selection, l 1 vs. l 2 regularization, and rotational invariance,” in Proceedings of the twenty-first international conference on Machine learning, 2004, p. 78.
[3] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordinate descent for l1-regularized loss minimization,” arXiv preprint arXiv:1105.5379, 2011.
[4] K. Koh, S.-J. Kim, and S. Boyd, “An interior-point method for large-scale l1-regularized logistic regression,” Journal of Machine learning research, vol. 8, no. Jul, pp. 1519–1555, 2007.
[5] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for l1 regularization: A comparative study and two new approaches,” in European Conference on Machine Learning. Springer, 2007, pp. 286–297.
[6] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, “A comparison of optimization methods and software for large-scale l1-regularized linear classification,” Journal of Machine Learning Research, vol. 11, no. Nov, pp. 3183–3234, 2010.
[7] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for l1 regularization: A comparative study and two new approaches,” in European Conference on Machine Learning. Springer, 2007, pp. 286–297.
[8] Y. Xia, H. Leung, and J. Wang, “A projection neural network and its application to constrained optimization problems,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 4, pp. 447–458, 2002.
[9] L. Jin, S. Li, B. Hu, and M. Liu, “A survey on projection neural networks and their applications,” Applied Soft Computing, vol. 76, pp. 533–544, 2019.
[10] Y. Lu, D. Li, Z. Xu, and Y. Xi, “Convergence analysis and digital implementation of a discrete-time neural network for model predictive control.” IEEE Trans. Industrial Electronics, vol. 61, no. 12, pp. 7035–7045, 2014.
[11] Y. Xia and J. Wang, “A one-layer recurrent neural network for support vector machine learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 2, pp. 1261–1269, 2004.
[12] M. Mohammadi, S. H. Mousavi, and S. Effati, “Generalized variant support vector machine,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
[13] Y. Xia, C. Sun, and W. X. Zheng, “Discrete-time neural network for fast solving large linear $l\_$ { $1$ } estimation problems and its application to image restoration,” IEEE transactions on neural networks and learning systems, vol. 23, no. 5, pp. 812–820, 2012.
[14] Y. Xia and J. Wang, “Robust regression estimation based on low-dimensional recurrent neural networks,” IEEE Transactions on Neural Networks and Learning Systems, no. 99, pp. 1–12, 2018.
[15] Y. Xia, H. Leung, N. Xie, and E. Bossé, “A new regression estimator with neural network realization,” IEEE transactions on signal processing, vol. 53, no. 2, pp. 672–685, 2005.
[16] Y. Xia, C. Sun, and W. X. Zheng, “Discrete-time neural network for fast solving large linear $l\_$ { $1$ } estimation problems and its application to image restoration,” IEEE transactions on neural networks and learning systems, vol. 23, no. 5, pp. 812–820, 2012.
[17] Y. Xia and M. S. Kamel, “Novel cooperative neural fusion algorithms for image restoration and image fusion,” IEEE Transactions on Image Processing, vol. 16, no. 2, pp. 367–381, 2007.
[18] Y. Xia and J. Wang, “A dual neural network for kinematic control of redundant robot manipulators,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 31, no. 1, pp. 147–154, 2001.
[19] J. Fan and J. Wang, “A collective neurodynamic optimization approach to nonnegative matrix factorization,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2344–2356, 2017.
[20] S. Qin and X. Xue, “A two-layer recurrent neural network for nonsmooth convex optimization problems,” IEEE transactions on neural networks and learning systems, vol. 26, no. 6, pp. 1149–1160, 2015.
[21] L. Cheng, Z.-G. Hou, Y. Lin, M. Tan, W. C. Zhang, and F.-X. Wu, “Recurrent neural network for non-smooth convex optimization problems with application to the identification of genetic regulatory networks,” IEEE Transactions on Neural Networks, vol. 22, no. 5, pp. 714–726, 2011.
[22] D. Kinderlehrer and G. Stampacchia, An introduction to variational inequalities and their applications. Siam, 1980, vol. 31.
[23] R. K. Miller and A. N. Michel, Ordinary differential equations. Academic Press, 1982.
[24] J. P. La Salle, The stability of dynamical systems. SIAM, 1976.
[25] S. K. Shevade and S. S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253, 2003.
[26] W. J. Fu, “Penalized regressions: the bridge versus the lasso,” Journal of computational and graphical statistics, vol. 7, no. 3, pp. 397–416, 1998.
[27] J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, and H. Koepke, “Coordinate descent converges faster with the gauss-southwell rule than random selection,” in International Conference on Machine Learning, 2015, pp. 1632–1641.
[28] S. Perkins, K. Lacker, and J. Theiler, “Grafting: Fast, incremental feature selection by gradient descent in function space,” Journal of machine learning research, vol. 3, no. Mar, pp. 1333–1356, 2003.
[29] S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng, “Efficient l~ 1 regularized logistic regression,” in AAAI, vol. 6, 2006, pp. 401–408.
[30] M. Schmidt, “Graphical model structure learning with l1-regularization,” University of British Columbia, 2010.
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.

An efficient projection neural network for ℓ1\ell_{1}-regularized logistic regression