Soft-SVM Regression For Binary Classification

Man Huang
Department of Mathematics & Statistics
Boston University
Boston, MA 02215
[email protected]
examples of more authors &Luis Carvalho
Department of Mathematics & Statistics
Boston University
Boston, MA 02215
[email protected]

Abstract

The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.

1 Introduction

Binary classification has a long history in supervised learning, arising in multiple applications to practically all domains of science and motivating the development of many methods, including logistic regression, $k$ -nearest neighbors, decision trees, and support vector machines [5]. Logistic regression, in particular, provides a useful statistical framework for this class of problems, prescribing a parametric formulation in terms of features and yielding interpretable results [7]. It suffers, however, from “complete separation” data issues, that is, when the classes are separated by a hyperplane in feature space. Support vector machines (SVMs), in contrast, aim at finding such a separating hyperplane that maximizes its margin to a subset of observations, the support vectors [9, 8]. While SVMs are robust to complete separation issues, they are not directly interpretable [6]; for instance, logistic regression provides class probabilities, but SVMs rely on post-processing to compute them, often using Platt scaling, that is, adopting a logistic transformation.

Both SVM and logistic regression, as usual in statistical learning methods, can be expressed as optimization problems with a data fitting loss function and a model complexity penalty loss. In Section 2 we adopt this formulation to establish a new loss function, controlled by a convex relaxation “softness” parameter, that comprehends both methods. Adopting then a generalized linear model formulation, we further explore this new loss to propose a new exponential family, Soft-SVM. Given that this new regularized regression setup features two extra parameters, we expect it to be more flexible and robust, in effect bridging the performance from SVM and logistic regressions and addressing their drawbacks. We empirically validate this assessment in Section 3 with a simulation study and case studies. We show, in particular, that the new method performs well, comparably to the best of SVM and logistic regressions and with a similar smaller computational cost of logistic regression.

2 The Soft-SVM Family

Refer to caption — Figure 1: Cumulant (left) and mean (right) functions for logistic, SVM, and Soft-SVM(5, 0.8).

Suppose we have $n$ observations with $f$ features $x_{i}$ and labels $y_{i}\in\{0,1\}$ , for $i=1,\ldots,n$ . In logistic regression we aim to find a set of coefficients $\beta_{0}$ and $\beta$ that maximizes, with $\theta_{i}\doteq\beta_{0}+x_{i}^{\top}\beta$ ,

\ell(\theta;\mathbf{y})=\sum_{i=1}^{n}y_{i}\theta_{i}-b(\theta_{i}),

(1)

the log-likelihood (up to a constant). Parameters $\theta_{i}$ are canonical for the Bernoulli family and thus coincide with the linear predictor $\eta_{i}=\beta_{0}+x_{i}^{\top}\beta$ ; function $b$ is the Bernoulli cumulant function, given by $b(\theta)=\log(1+e^{\theta})$ . Similarly, in SVM regression, the goal is to maximize, again with respect to $\beta_{0}$ and $\beta$ , the negative cost

\sum_{i=1}^{n}-[1-(2y_{i}-1)\eta_{i}]_{+}-\frac{\lambda}{2}\|\beta\|_{2}^{2}=\sum_{i=1}^{n}y_{i}[(1+\eta_{i})_{+}-(1-\eta_{i})_{+}]-(1+\eta_{i})_{+}-\frac{\lambda}{2}\|\beta\|_{2}^{2},

(2)

where $x_{+}=\max\{0,x\}$ is the plus function and $\lambda$ is a penalty parameter. The data fitting loss per observation is $[1-(2y_{i}-1)\eta_{i}]+$ , the hinge loss. The quadratic complexity penalty here arises from a Lagrange multiplier version of the original max-margin support vector formulation [5]. The negative loss in (2) can then be similarly expressed as a log-likelihood in (1) by setting the canonical parameters and SVM cumulant function $s$ as

\theta_{i}=(1+\eta_{i})_{+}-(1-\eta_{i})_{+}\quad\text{and}\quad s(\theta)=(1+\eta)_{+}=\frac{1}{2}[(\theta+2)_{+}+(\theta-2)_{+}].

This, however, does not correspond to a valid exponential family specification since both functions are not differentiable. To address this issue and define a new exponential family, let us focus on the cumulant function. We first adopt the convex relaxation $p_{\kappa}(x)=\log(1+\exp\{\kappa x\})/\kappa$ for the plus function, the soft-plus with softness parameter $\kappa$ . Note that $p_{\kappa}(x)\rightarrow x_{+}$ as $\kappa\rightarrow\infty$ . Motivated by the property of both logistic and SVM cumulants that $b(\theta)=\theta+b(-\theta)$ , we define the Soft-SVM canonical parameters and cumulant as

\theta_{i}=p_{\kappa}(\eta_{i}+\delta)-p_{\kappa}(\eta_{i}-\delta)\doteq f_{\kappa,\delta}(\eta_{i})\quad\text{and}\quad b_{\kappa,\delta}(\theta)=\frac{1}{2}[p_{\kappa}(\theta+2\delta)+p_{\kappa}(\theta-2\delta)],

(3)

where $\delta$ is a separation parameter in the cumulant. Parameter $\delta$ works together with the softness parameter $\kappa$ to effective bridge between logistic and SVM regressions: when $\kappa=1$ and $\delta=0$ , we have logistic regression, while as $\kappa\rightarrow\infty$ and $\delta\rightarrow 1$ we obtain SVM regression. This way, $b_{1,0}\equiv b$ and $b_{\kappa,\delta}\rightarrow s$ pointwise as $\kappa\rightarrow\infty$ and $\delta\rightarrow 1$ . The Soft-SVM penalized log-likelihood is then

\ell_{\kappa,\delta}(\theta;\mathbf{y})\doteq\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa,\delta}(\theta_{i})-\frac{\lambda}{2}\|\beta\|_{2}^{2}=\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa,\delta}(\theta_{i})-\frac{\lambda}{2}\begin{bmatrix}\beta_{0}\\ \beta\end{bmatrix}^{\top}\begin{bmatrix}0&\\ &I_{f}\end{bmatrix}\begin{bmatrix}\beta_{0}\\ \beta\end{bmatrix},

where we make clear that the bias $\beta_{0}$ is not penalized when using the precision matrix $0\oplus I_{f}$ .

While the separation parameter $\delta$ has a clear interpretation from the cumulant function, we can also adopt an equivalent alternative parameterization in terms of a scaled separation parameter $\alpha\doteq\kappa\delta$ since the shifted soft plus can be written as

p_{\kappa}(x+z\delta)=\frac{1}{\kappa}\log\big{(}1+e^{\kappa(x+z\delta)}\big{)}=\frac{1}{\kappa}\log\big{(}1+e^{\kappa x+z\alpha}\big{)}.

This alternative parameterization is more useful when we discuss the variance function in Section 2.2.

The mean function corresponding to $b_{\kappa,\delta}$ is its first derivative,

\mu(\theta)=b_{\kappa,\delta}^{\prime}(\theta)=\frac{1}{2}[\text{expit}\{\kappa(\theta+2\delta)\}+\text{expit}\{\kappa(\theta-2\delta)\}].

(4)

With $v(x)=\text{expit}(x)(1-\text{expit}(x))=e^{x}/(1+e^{x})^{2}$ , the variance function is

V_{\kappa,\delta}(\mu)=b_{\kappa,\delta}^{\prime\prime}\big{(}[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1})=\frac{\kappa}{2}\big{[}v(\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}+2\delta))+v(\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}-2\delta))\big{]},

(5)

where the inverse mean function $[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}$ is given in the next section. Figure 2 shows the plot of variance function when $\kappa$ equals to different values and assumes the corresponding value of $\delta$ is a function of $\kappa$ : $\delta=1-1/\kappa$ .

Finally, to relate the canonical parameters $\theta_{i}$ to the features, our link function $g$ can be found by solving the composite function $f^{-1}_{\kappa,\delta}\circ[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}$ (see Section 2.1 for details). Figure 1 compares the cumulant and mean functions for Bernoulli/logistic regression, SVM, and Soft-SVM with $\kappa=5$ and $\delta=0.8$ . Note how the Soft-SVM functions match the SVM behavior for more extreme values of the canonical parameter $\theta$ and the logistic regression behavior for moderate values of $\theta$ , achieving a good compromise as a function of the softness $\kappa$ .

2.1 Soft-SVM Regression

Given our characterization of the Soft-SVM exponential family distribution, performing Soft-SVM regression is equivalent to fitting a regularized generalized linear model. To this end, we follow a cyclic gradient ascent procedure, alternating between an iteratively re-weighted least squares (IRLS) step for Fisher scoring [7] to update the coefficients $\beta$ given the softness $\kappa$ , and then updating $\kappa$ and $\alpha$ given the current values of $\beta$ via a Newton-Raphson scoring step.

For the IRLS step we need the link $g$ and the variance function $V$ in (5). The link is given by solving the composite function $f^{-1}_{\kappa,\delta}\circ[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}$ for $\eta$ , where

[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}=\frac{1}{\kappa}\Bigg{\{}\frac{1}{2}\log\Big{(}\frac{\mu}{1-\mu}\Big{)}+\text{asinh}\big{(}e^{h_{\kappa,\delta}(\mu)}\big{)}\Bigg{\}},

with

h_{\kappa,\delta}(\mu)=\log\cosh(2\kappa\delta)+\log\frac{|\mu-1/2|}{\sqrt{\mu(1-\mu)}},

and

f^{-1}_{\kappa,\delta}(\theta)=\frac{\theta}{2}+\frac{1}{\kappa}{\text{asinh}\big{(}e^{H_{\kappa,\delta}(\theta)}\big{)}}\quad\text{with}\quad H_{\kappa,\delta}(\theta)=-\kappa\delta+\log\text{sinh}\Big{(}\frac{\kappa|\theta|}{2}\Big{)}.

Note that in all these expressions $\delta$ is always expressed in the product $\kappa\delta$ , so it is immediate to use the alternative parameter $\alpha$ instead of $\delta$ .

Our algorithm is summarized in Algorithm 1. For notation simplicity, we represent both bias and coefficients in a single vector $\beta$ and store all features in a design matrix $X$ , including a column of ones for the bias (intercept).

\triangleright

Input : Features in

n

-by-

(f+1)

design matrix

X

(including intercept) and labels

\mathbf{y}

\triangleright

Parameters : Penalty

\lambda

, convergence tolerance

\epsilon

, and small perturbation

0<\nu<1

\triangleright

Output : Estimated regression coefficients

\widehat{\beta}

(including bias

\widehat{\beta}_{0}

), shape parameter

\widehat{\alpha}

and softness parameter

\widehat{\kappa}

\triangleright

Initialization:

Set

\alpha^{(0)}=1

\kappa^{(0)}=1

and, for

i=1,\ldots,n

\mu_{i}^{(0)}=(y_{i}+\nu)/(1+2\nu)

as a

\nu

-perturbed version of

\mathbf{y}

and

\eta_{i}^{(0)}=g_{\kappa^{(0)}}(\mu_{i}^{(0)})

;

\triangleright

Cyclic scoring iteration:

for $t=0,1,\ldots$ (until convergence) do

\triangleright

\kappa

-step:

Update

\kappa^{(t+1)}=\kappa^{(t)}-[\partial^{2}\ell_{\kappa,\alpha}/\partial\kappa^{2}]^{-1}(\partial\ell_{\kappa,\alpha}/\partial\kappa)

via Newton’s method;

\triangleright

\alpha

-step:

Update

\alpha^{(t+1)}=\alpha^{(t)}-\sum_{i=1}^{N}[y_{i}f_{\kappa,\alpha}^{\prime}-b_{\kappa,\alpha}^{\prime}]/\sum_{i=1}^{N}[y_{i}f_{\kappa,\alpha}^{\prime\prime}-b_{\kappa,\alpha}^{\prime\prime}]

via Newton’s method;

\triangleright

\beta

-step:

Compute weights

w_{i}=y_{i}f^{\prime\prime}_{\kappa,\alpha}(\eta_{i})-f^{\prime\prime}_{\kappa,\alpha}(\eta_{i})b^{\prime}_{\kappa,\alpha}(\theta_{i})-(f^{\prime}_{\kappa,\alpha}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa,\alpha}(\theta_{i})

and set

W^{(t)}=\text{Diag}_{i=1,\ldots,n}\{w_{i}^{(t)}\}

;

Update

\beta^{(t+1)}

by solving

\big{(}X^{\top}W^{(t)}X+\lambda(0\oplus I_{f})\big{)}\beta^{(t+1)}=X^{\top}W^{(t)}\eta^{(t)}-X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa,\alpha}^{{}^{\prime}}(\eta_{i})\}(\mathbf{y}-\bm{\mu}(\beta^{(t)}))

;

Set

\eta^{(t+1)}=X\beta^{(t+1)}

\bm{\mu}(\beta^{(t)})=b^{{}^{\prime}}_{\kappa}(\bm{\theta})

, and

\bm{\theta}=[\theta_{1},\theta_{2},\ldots,\theta_{n}]^{\top}

;

if $|\ell_{(\kappa,\alpha)^{(t+1)}}(\eta^{(t+1)};\mathbf{y})-\ell_{(\kappa,\alpha)^{(t)}}(\eta^{(t)};\mathbf{y})|/|\ell_{(\kappa,\alpha)^{(t)}}(\eta^{(t)};\mathbf{y})|<\epsilon$ then

break

return $\beta^{(t)}$ , $\alpha^{(t)}$ and $\kappa^{(t)}$ .

Algorithm 1 Soft-SVM Regression

2.2 Soft-SVM Variance Function

Since $h_{\kappa,\delta}(\mu)=h_{\alpha}(\mu)=\log\cosh(2\alpha)+\log|\mu-1/2|/\sqrt{\mu(1-\mu)}$ , we have that

\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}\pm 2\delta)=\frac{1}{2}\log\Big{(}\frac{\mu}{1-\mu}\Big{)}+\text{asinh}\Big{(}e^{h_{\alpha}(\mu)}\Big{)}\pm 2\alpha

and so the variance function in (5) can be written as $V_{\kappa,\alpha}(\mu)=\kappa\cdot r_{\alpha}(\mu)$ where, with $u_{\alpha}(\mu)\doteq\text{logit}(\mu)/2+\text{asinh}(\exp\{h_{\alpha}(\mu)\})$ ,

r_{\alpha}(\mu)=\frac{1}{2}\Big{\{}v(u_{\alpha}(\mu)+2\alpha)+v(u_{\alpha}(\mu)-2\alpha)\Big{\}}.

This way, parameters $\kappa$ and $\alpha$ have specific effects on the variance function: the softness parameter $\kappa$ acts as a dispersion, capturing the scale of $V(\mu)$ , while the scaled separation $\alpha$ controls the shape of $V(\mu)$ .

Figure 3 shows variance function plots for the ESL mixture dataset discussed in Section 2.4 when penalty parameter $\lambda$ and scale-parameter $\kappa$ are both fixed at $1$ , and only allow the shape parameter $\alpha$ to increase from $0$ to $5$ . We can see that as $\alpha$ increases, the overall variance decreases and the plot gradually splits from a unimodal shaped curve into a bimodal shaped curve. In practice, we observe that as $\alpha$ increases to achieve higher mode separation, most observations are pushed to have fitted $\mu$ values around $0.5$ and variance weights close to zero, caught in-between variance peaks.

We can then classify observations with respect to their mean and variance values: points with high variance weights have a large influence on the fit and thus on the decision boundary, behaving like “soft” support vectors; points with low variance weights and mean close to $0.5$ are close to the boundary but do not influence it, belonging thus to a “dead zone” for the fit; finally, points with low variance weights and large $\max\{\mu,1-\mu\}$ are inliers and have also low influence in the fit. This classification is illustrated in Figure 4.

2.3 Handling numerical issues

As $\kappa$ increases we should anticipate numerical issues as $b_{\kappa}$ and $f_{\kappa}$ becomes non-differentiable up to machine precision. For this reason, we need to adopt numerically stable versions for $\text{asinh}(e^{x})$ , $\log\cosh(x)$ and $\log\sinh(x)$ , needed by $g_{\kappa}$ and thus also by $V_{\kappa}$ above.

The first step is a numerically stable version for the Bernoulli cumulant; with $\epsilon$ the machine precision, we adopt

\text{log1pe}(x)\doteq\log(1+e^{x})=\left\{\begin{array}[]{rcl}x,&\mbox{if}&x>-\log\epsilon\\ x+\log(1+e^{-x}),&\mbox{if}&0<x\leq-\log\epsilon\\ \log(1+e^{x}),&\mbox{if}&\log\epsilon\leq x\leq 0\\ 0,&\mbox{if}&x<\log\epsilon.\end{array}\right.

This way, the soft plus can be reliably computed as $p_{\kappa}(x)=\text{log1pe}(\kappa x)/\kappa$ . Again exploiting domain partitions and the parity of cosh, we can define

\log\cosh(x)=\left\{\begin{array}[]{rcl}|x|-\log 2,&\mbox{if}&|x|>-\log\sqrt{\epsilon}\\ |x|-\log 2+\text{log1pe}(-2x),&\mbox{if}&|x|\leq-\log\sqrt{\epsilon}.\\ \end{array}\right.

We branch twice for $\text{asinh}(e^{x})$ : we first define

\gamma(x)=\left\{\begin{array}[]{rcl}1,&\mbox{if}&|x|>-\log\sqrt{\epsilon}\\ \sqrt{1+e^{-2|x|}},&\mbox{if}&|x|\leq-\log\sqrt{\epsilon},\\ \end{array}\right.

and then

\text{asinh}(e^{x})=\left\{\begin{array}[]{rcl}x+\log(1+\gamma(x)),&\mbox{if}&x>0\\ \log(e^{x}+\gamma(x)),&\mbox{if}&x\leq 0.\\ \end{array}\right.

Finally, we write

\log\sinh(x)=\left\{\begin{array}[]{rcl}\log\frac{1}{2}+x+\log(1-e^{-2x}),&\mbox{if}&x>0\\ \log\frac{1}{2}-x+\log(1-e^{2x}),&\mbox{if}&x\leq 0\\ \end{array}\right.

to make it stable whenever $x$ is greater or smaller than $0$ .

2.4 Practical example: ESL mixture

To exam how the Soft-SVM regression works for classification in comparison with SVM and logistic regression, we carry out a detailed example using the simulated mixture data in [5], consisting of two features, $X_{1}$ and $X_{2}$ . To estimate the complexity penalty $\lambda$ we use ten-fold cross-validation with 20 replications for all methods. The results are summarized in Figure 5. Because $\widehat{\kappa}$ is not much larger than unit, the fitted curve resembles a logistic curve; however, the softness effect can be seen in the right panel as points with the soft margin have estimated probabilities close to $0.5$ , with fitted values quickly increasing as points are farther away from the decision boundary. To classify, we simply take a consensus rule, $\widehat{y}_{i}=I(\widehat{\mu}_{i}>0.5)$ , with $I$ being the indicator function.

Figure 6 shows the effect of regularization in the cross-validation performance of Soft-SVM regression, as measured by Matthews correlation coefficient (MCC). We adopt MCC as a metric because it achieves a more balanced evaluation of all possible outcomes [3]. As can be seen in the right panel, comparing MCC with SVM and logistic regressions, the performances are quite comparable; Soft-SVM regression seems to have a slight advantage on average, however, and logistic regression has a more variable performance.

3 Experiments

3.1 Simulation study

Prior works have shown that the performance of SVM can be severely affected when applied to unbalanced datasets since the margin obtained is biased towards the minority class [1, 2]. In this section, we conduct a simulation study to investigate these effects on Soft-SVM regression.

We consider a simple data simulation setup with two features. There are $n$ observations, partitioned into two sets with sizes $n_{1}=\lfloor\rho n\rfloor$ and $n_{2}=n-n_{1}$ , for an imbalance parameter $0<\rho\leq 0.5$ . We expect to see SVM performing worse as $\rho$ gets smaller and the sets become more unbalanced. Observations in the first set have labels $y_{i}=0$ and $x_{i}\sim N(\mu_{1},\sigma I_{2})$ , while the $n_{2}$ observations in the second set have $y_{i}=1$ and $x_{i}\sim N(\mu_{2},\sigma I_{2})$ . We set $\mu_{1}=(\sqrt{2},1)$ and $\mu_{2}=(0,1+\sqrt{2})$ , so that the decision boundary is $X_{2}=X_{1}+1$ and the distance from either $\mu_{1}$ or $\mu_{2}$ to the boundary is $1$ . Parameter $\sigma$ acts as a measure of separability: for $\sigma<1$ we should have a stronger performance from the classification methods, but as $\sigma$ becomes too small we expect to see complete separability challenging logistic regression.

For the simulation study we take $n=100$ and simulate $50$ replications using a factorial design with $\rho\in\{0.12,0.25,0.5\}$ and $\sigma\in\{0.5,1,1.5\}$ . For the SVM and Soft-SVM classification methods we estimate complexity penalties $\lambda$ using ten-fold cross-validation and use standard, non-regularized logistic regression. We report performance using MCC and summarize the results in Figure 7.

The performance across methods is comparable, uniformly degrading as we boost confounding by decreasing $\rho$ and increasing $\sigma$ , but differs in more pathological corner cases. The combination of small $\rho$ values (higher imbalance) and large $\sigma$ values (less separability) results in a poorer performance from SVM, as previously reported [2]. Small $\sigma$ values, on the other hand, result in lack of convergence and inflated coefficient estimates in logistic regression fits, even though predicted classification is not much affected. Soft-SVM classification, however, avoids both issues, exhibiting stable estimation and good performance in all cases.

3.2 Case studies

To evaluate the empirical performance of Soft-SVM regression and classification, we select nine well-known datasets from the UCI machine learning repository [4] with a varied number of observations and features. Table 1 lists the datasets and their attributes. Not all datasets are originally meant for binary classification; dataset Abalone, for instance, has three classes—male, female and infant—so we only take male and female observations and exclude infants. The red/white wine quality datasets originally have ten quality levels, labeled from low to high. We dichotomize these levels into a low quality class (quality levels between $1$ and $5$ , inclusive) and a high quality class (quality levels $6$ to $10$ ). All remaining datasets have two classes.

Table 1: Selected case study datasets from UCI repository.

UCI dataset	# of observations $n$	# of features $f$
Abalone (only F & M classes)	4177	8
Australian credit	690	14
Breast cancer	569	30
Haberman	306	3
Heart disease	303	13
Liver disorder	345	6
Pima Indians diabetes	768	8
Red wine quality (amended)	1599	11
White wine quality (amended)	4898	11

As in the simulation study, we assess performance using MCC from $50$ ten-fold cross-validation replications. For each replication, we estimate penalties $\lambda$ for all classification methods again using ten-fold cross-validations. Figure 8 shows the results.

In general, it can be seen that SVM classification performs worse on “tough” datasets with low MCC performance, such as Abalone and Haberman, probably due to poor selection of penalty parameters. On the other hand, SVM performs favorably when compared to Soft-SVM and logistic regressions in the Heart dataset, being more robust to cases where many features seem to be non-informative of classification labels. Overall, Soft-SVM regression performs comparably to the best of SVM and logistic regression even in hard datasets, as it was observed in the simulation study. Interestingly, it has a clear advantage over both SVM and logistic regressions in cases where a few features are very informative, such as in the Abalone and Australian credit datasets. It often has an edge over logistic regression in over-dispersed datasets such as Breast cancer and White wine datasets. It seems that Soft-SVM regression can better accommodate these cases via higher dispersion on moderate values of softness, i.e. $\kappa<2$ , or via better use of regularization for increased robustness and generalizability.

4 Conclusion

We have presented a new binary classification and regression method based on a new exponential family distribution, Soft-SVM. Soft-SVM arises from extending a convex relaxation of the SVM hinge loss with a softness parameter, in effect achieving a smooth transformation between the binomial deviance and SVM hinge losses. Soft-SVM regression then entails fitting a generalized linear model, and so it is comparable to logistic regression in terms of computational complexity, having just a small overhead to fit the softness parameter. Moreover, since many points might have low variance weights by not being “soft” support vectors—either being dead-zoners or inliers, by our classification in Section 2.2—they can be selected out of the fit to achieve significant computational savings.

More importantly, as we have shown in a simulation study and multiple case studies, Soft-SVM classification performs comparably to the best of SVM and logistic regressions, often over performing them, even if slightly, in feature-rich datasets. This advantage in performance can be attributed to higher robustness from the softness and separation parameters, enabling the model to accommodate well over-dispersed datasets and adequately address separability issues via regularization. By bridging the SVM hinge and binomial deviance loss functions, it seems that Soft-SVM regression is able to combine robustness and interpretability, the advantages of SVM and logistic regression, respectively, freeing the practitioner from the burden of selecting one or the other.

For future work, we intend to develop data sampling procedures and adopt a Bayesian formulation to then sample posterior estimates of coefficients, softness and separation parameters. Another natural direction of work is to incorporate kernels, as it is common in SVM regression. Finally, we also expect to extend this formulation to multinomial classification [10], hoping again to bring some robustness from SVM while keeping interpretability.

References

Bernau et al. [2014] Christoph Bernau, Markus Riester, Anne-Laure Boulesteix, Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron, and Lorenzo Trippa. Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30(12):i105–i112, 2014.
Cervantes et al. [2020] Jair Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, and Asdrubal Lopez. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408:189–215, 2020.
Chicco and Jurman [2020] Davide Chicco and Giuseppe Jurman. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020.
Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, second edition, 2009.
Lin [2002] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259–275, 2002.
McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, second edition, 1989.
Moguerza et al. [2006] Javier M Moguerza, Alberto Muñoz, et al. Support vector machines with applications. Statistical Science, 21(3):322–336, 2006.
Vapnik [2013] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 2013.
Zhu and Hastie [2005] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14(1):185–205, 2005.

Appendix A Appendix

A.1 $[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1}$

	$\displaystyle[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1}$	$\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\frac{-(2\mu-1)(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{2(2\mu-2)}+\sqrt{[\frac{-(2\mu-1)(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{2(2\mu-2)}]^{2}-\frac{\mu}{1-\mu}}\Bigg{\}}$
		$\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\sqrt{\frac{\mu}{1-\mu}}(\frac{(\mu-\frac{1}{2})\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{\sqrt{\mu(1-\mu)}}+\sqrt{[\frac{(\mu-\frac{1}{2})\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{\sqrt{\mu(1-\mu)}}]^{2}+1})\Bigg{\}}$
		$\displaystyle=\left\{\begin{array}[]{rcl}\frac{1}{\kappa}[\frac{1}{2}\log\frac{\mu}{1-\mu}+\log(e^{h_{\kappa}(\mu)}+\sqrt{e^{2h_{\kappa}(\mu)}+1})],&\mbox{if}&\mu>\frac{1}{2}\\ \frac{1}{\kappa}[\frac{1}{2}\log\frac{\mu}{1-\mu}+\log(-e^{h_{\kappa}(\mu)}+\sqrt{e^{2h_{\kappa}(\mu)}+1})],&\mbox{if}&\mu\leq\frac{1}{2}\end{array}\right.$
		$\displaystyle=\frac{1}{\kappa}\Bigg{\{}\frac{1}{2}\log(\frac{\mu}{1-\mu})+\text{asinh}\big{(}e^{h_{\kappa}(\mu)}\big{)}\Bigg{\}},$

h_{\kappa}(\mu)=\log\Bigg{\{}\frac{|\mu-\frac{1}{2}|}{\sqrt{\mu(1-\mu)}}\cdot\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})\Bigg{\}}=\log\cosh(2\kappa\delta(\kappa))+\log\frac{|\mu-1/2|}{\sqrt{\mu(1-\mu)}}.

A.2 $f^{-1}_{\kappa}(\theta)$

	$\displaystyle f^{-1}_{\kappa}(\theta)$	$\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\frac{e^{\kappa\theta}}{2e^{\kappa\delta(\kappa)}}+\sqrt{e^{\kappa\theta}+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}})^{2}}\Bigg{\}}$
		$\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}(e^{\kappa\theta})^{\frac{1}{2}}\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}}+(e^{\kappa\theta})^{\frac{1}{2}}\sqrt{1+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}})^{2}}\Bigg{\}}$
		$\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}(e^{\kappa\theta})^{\frac{1}{2}}\Bigg{\}}+\frac{1}{\kappa}\log\Bigg{\{}\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}}+\sqrt{1+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}})^{2}}\Bigg{\}}$
		$\displaystyle=\left\{\begin{array}[]{rcl}\frac{\theta}{2}+\frac{1}{\kappa}\log(e^{H_{\kappa}(\theta)}+\sqrt{e^{2H_{\kappa}(\theta)}+1})],&\mbox{if}&\theta>0\\ \frac{\theta}{2}+\frac{1}{\kappa}\log(-e^{H_{\kappa}(\theta)}+\sqrt{e^{2H_{\kappa}(\theta)}+1})],&\mbox{if}&\theta\leq 0\\ \end{array}\right.$
		$\displaystyle=\frac{\theta}{2}+\frac{1}{\kappa}{\text{asinh}\big{(}e^{H_{\kappa}(\theta)}\big{)}},$

	$\displaystyle H_{\kappa}(\theta)$	$\displaystyle=\log\Bigg{\{}e^{-\kappa\delta(\kappa)}\cdot\frac{\|e^{\kappa\theta}-1\|}{2(e^{\kappa\theta})^{\frac{1}{2}}}\Bigg{\}}$
		$\displaystyle=-\kappa\delta(\kappa)+\log\text{sinh}(\frac{\kappa\|\theta\|}{2}),$

\log\text{sinh(x)}=\log\Bigg{(}\frac{e^{2x}-1}{2e^{x}}\Bigg{)}=\left\{\begin{array}[]{rcl}\log\frac{1}{2}+x+\log(1-e^{-2x}),&\mbox{if}&x>0\\ \log\frac{1}{2}-x+\log(1-e^{2x}),&\mbox{if}&x\leq 0\\ \end{array}\right..

A.3 $V_{\kappa}(\mu)$

V_{\kappa}(\mu)=b_{\kappa}^{\prime\prime}(\theta)=b_{\kappa}^{\prime\prime}\big{(}g_{\kappa}(\mu)\big{)},

	$\displaystyle b_{\kappa}^{\prime\prime}(\theta)$	$\displaystyle=\frac{\partial^{2}\Biggl{\{}0.5*\Bigg{[}\frac{\log(1+e^{\kappa(\theta-2\delta(\kappa))})}{\kappa}+\frac{\log(1+e^{\kappa(\theta+2\delta(\kappa))})}{\kappa}\Bigg{]}\Biggr{\}}}{\partial^{2}\theta}$
		$\displaystyle=\frac{\partial\Biggl{\{}0.5*\Bigg{[}\frac{e^{\kappa(\theta-2\delta(\kappa))}}{1+e^{\kappa(\theta-2\delta(\kappa))}}+\frac{e^{\kappa(\theta+2\delta(\kappa))}}{1+e^{\kappa(\theta+2\delta(\kappa))}}\Bigg{]}\Biggr{\}}}{\partial\theta}$
		$\displaystyle=0.5*\Bigg{[}\frac{\kappa e^{\kappa(\theta-2\delta(\kappa))}}{(1+e^{\kappa(\theta-2\delta(\kappa))})^{2}}+\frac{\kappa e^{\kappa(\theta+2\delta(\kappa))}}{(1+e^{\kappa(\theta+2\delta(\kappa))})^{2}}\Bigg{]}$
		$\displaystyle=\frac{\kappa}{2}\biggl{\{}v[\kappa(\theta-2\delta(\kappa))]+v[\kappa(\theta+2\delta(\kappa))]\biggr{\}},$

\delta(\kappa)=1-\frac{1}{\kappa},\ v(x)=\frac{e^{x}}{1+e^{x}}.

A.4 $\beta$ -step

l=\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa}(\theta_{i}),\ \theta_{i}=f_{\kappa}(\eta_{i}),\ y_{i}=X_{i}^{\top}\bm{\beta}=\beta_{0}+x_{i}\beta_{1}+\ldots+x_{p}\beta_{p},],

	$\displaystyle U(\beta_{j})_{j=0,1,\ldots,p}$	$\displaystyle=\frac{\partial l}{\partial\beta_{j}}=\sum_{i=1}^{n}\frac{\partial(y_{i}\theta_{i}-b_{\kappa}(\theta_{i})))}{\partial\theta_{i}}\cdot\frac{\partial\theta_{i}}{\beta_{j}}$
		$\displaystyle=\sum_{i=1}^{n}(y_{i}-\bm{\frac{\partial b_{\kappa}(\theta_{i})}{\partial\theta_{i}}\doteq b^{\prime}_{\kappa}(\theta_{i})})\cdot f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}$
		$\displaystyle=\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i})),$

	$\displaystyle U(\bm{\beta})$	$\displaystyle=\frac{\partial l}{\partial\bm{\beta}}=\begin{bmatrix}U(\beta_{0})\\ U(\beta_{1})\\ ...\\ U(\beta_{p})\\ \end{bmatrix}=\begin{bmatrix}\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i0}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i1}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ ...\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ip}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \end{bmatrix}=X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}\begin{bmatrix}y_{1}-b^{\prime}_{\kappa}(\theta_{1})\\ y_{2}-b^{\prime}_{\kappa}(\theta_{2})\\ ...\\ y_{n}-b^{\prime}_{\kappa}(\theta_{n})\\ \end{bmatrix}$
		$\displaystyle=X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}(\bm{y-\mu(\theta)}),\text{where}\ \bm{\mu(\theta)}=\begin{bmatrix}b^{\prime}_{\kappa}(\theta_{1})\\ b^{\prime}_{\kappa}(\theta_{2})\\ ...\\ b^{\prime}_{\kappa}(\theta_{n})\\ \end{bmatrix}.$

\displaystyle H(\bm{\beta})=\frac{\partial U(\bm{\beta})}{\partial\bm{\beta^{\top}}}=\frac{\partial\begin{bmatrix}\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i0}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i1}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ ...\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ip}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \end{bmatrix}}{\partial\bm{\beta^{\top}}}.

For each row $j,j=0,1,2,...,p$ :

	$\displaystyle\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))}{\partial\bm{\beta^{\top}}}$	$\displaystyle=\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(\eta_{i})X_{ij}y_{i}}{\partial\bm{\beta^{\top}}}+\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(\eta_{i})X_{ij}b^{\prime}_{\kappa}(\theta_{i})}{\partial\bm{\beta^{\top}}}$
		$\displaystyle=\begin{bmatrix}\sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}.$

Therefore,

	$\displaystyle H(\bm{\beta})$	$\displaystyle=\begin{bmatrix}\begin{bmatrix}\sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \begin{bmatrix}\sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \vdots\\ \begin{bmatrix}\sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \end{bmatrix}$
		$\displaystyle=X^{\top}\text{Diag}_{i=1,\ldots,n}\{\bm{y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})\doteq w_{i}}\}X$
		$\displaystyle=X^{\top}(\bm{\text{Diag}_{i=1,\ldots,n}\{w_{i}\}\doteq W(\beta)})X$
		$\displaystyle=X^{\top}W(\beta)X.$

Since by Newton’s method we have:

\beta^{(t+1)}\approx\beta^{(t)}-[H^{(t)}]^{-1}U(\beta^{(t)})\Rightarrow H^{(t)}\beta^{(t+1)}\approx H^{(t)}\beta^{(t)}-U(\beta^{(t)},

and by adding the penalty term and plugging in the value of $H$ and $U$ , we can update $\beta^{(t+1)}$ by solving the following matrix equation:

\big{(}X^{\top}W^{(t)}X+\lambda(0\oplus I_{f})\big{)}\beta^{(t+1)}=X^{\top}W^{(t)}\eta^{(t)}-X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}(\mathbf{y}-\bm{\mu}(\beta^{(t)})),\ \eta^{(t)}=X\beta^{(t)}.

Soft-SVM Regression For Binary Classification

Abstract

1 Introduction

2 The Soft-SVM Family

2.1 Soft-SVM Regression

2.2 Soft-SVM Variance Function

2.3 Handling numerical issues

2.4 Practical example: ESL mixture

3 Experiments

3.1 Simulation study

3.2 Case studies

4 Conclusion

References

Appendix A Appendix

A.1 [𝒃𝜿′​(𝝁)]−𝟏[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1}

A.2 𝒇𝜿−𝟏​(𝜽)f^{-1}_{\kappa}(\theta)

A.3 𝑽𝜿​(𝝁)V_{\kappa}(\mu)

A.4 𝜷\beta-step

A.1 $[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1}$

A.2 $f^{-1}_{\kappa}(\theta)$

A.3 $V_{\kappa}(\mu)$

A.4 $\beta$ -step