This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Soft-SVM Regression For Binary Classification

Man Huang
Department of Mathematics & Statistics
Boston University
Boston, MA 02215
[email protected]
examples of more authors &Luis Carvalho
Department of Mathematics & Statistics
Boston University
Boston, MA 02215
[email protected]
Abstract

The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.

1 Introduction

Binary classification has a long history in supervised learning, arising in multiple applications to practically all domains of science and motivating the development of many methods, including logistic regression, kk-nearest neighbors, decision trees, and support vector machines [5]. Logistic regression, in particular, provides a useful statistical framework for this class of problems, prescribing a parametric formulation in terms of features and yielding interpretable results [7]. It suffers, however, from “complete separation” data issues, that is, when the classes are separated by a hyperplane in feature space. Support vector machines (SVMs), in contrast, aim at finding such a separating hyperplane that maximizes its margin to a subset of observations, the support vectors [9, 8]. While SVMs are robust to complete separation issues, they are not directly interpretable [6]; for instance, logistic regression provides class probabilities, but SVMs rely on post-processing to compute them, often using Platt scaling, that is, adopting a logistic transformation.

Both SVM and logistic regression, as usual in statistical learning methods, can be expressed as optimization problems with a data fitting loss function and a model complexity penalty loss. In Section 2 we adopt this formulation to establish a new loss function, controlled by a convex relaxation “softness” parameter, that comprehends both methods. Adopting then a generalized linear model formulation, we further explore this new loss to propose a new exponential family, Soft-SVM. Given that this new regularized regression setup features two extra parameters, we expect it to be more flexible and robust, in effect bridging the performance from SVM and logistic regressions and addressing their drawbacks. We empirically validate this assessment in Section 3 with a simulation study and case studies. We show, in particular, that the new method performs well, comparably to the best of SVM and logistic regressions and with a similar smaller computational cost of logistic regression.

2 The Soft-SVM Family

Refer to caption
Figure 1: Cumulant (left) and mean (right) functions for logistic, SVM, and Soft-SVM(5, 0.8).

Suppose we have nn observations with ff features xix_{i} and labels yi{0,1}y_{i}\in\{0,1\}, for i=1,,ni=1,\ldots,n. In logistic regression we aim to find a set of coefficients β0\beta_{0} and β\beta that maximizes, with θiβ0+xiβ\theta_{i}\doteq\beta_{0}+x_{i}^{\top}\beta,

(θ;𝐲)=i=1nyiθib(θi),\ell(\theta;\mathbf{y})=\sum_{i=1}^{n}y_{i}\theta_{i}-b(\theta_{i}), (1)

the log-likelihood (up to a constant). Parameters θi\theta_{i} are canonical for the Bernoulli family and thus coincide with the linear predictor ηi=β0+xiβ\eta_{i}=\beta_{0}+x_{i}^{\top}\beta; function bb is the Bernoulli cumulant function, given by b(θ)=log(1+eθ)b(\theta)=\log(1+e^{\theta}). Similarly, in SVM regression, the goal is to maximize, again with respect to β0\beta_{0} and β\beta, the negative cost

i=1n[1(2yi1)ηi]+λ2β22=i=1nyi[(1+ηi)+(1ηi)+](1+ηi)+λ2β22,\sum_{i=1}^{n}-[1-(2y_{i}-1)\eta_{i}]_{+}-\frac{\lambda}{2}\|\beta\|_{2}^{2}=\sum_{i=1}^{n}y_{i}[(1+\eta_{i})_{+}-(1-\eta_{i})_{+}]-(1+\eta_{i})_{+}-\frac{\lambda}{2}\|\beta\|_{2}^{2}, (2)

where x+=max{0,x}x_{+}=\max\{0,x\} is the plus function and λ\lambda is a penalty parameter. The data fitting loss per observation is [1(2yi1)ηi]+[1-(2y_{i}-1)\eta_{i}]+, the hinge loss. The quadratic complexity penalty here arises from a Lagrange multiplier version of the original max-margin support vector formulation [5]. The negative loss in (2) can then be similarly expressed as a log-likelihood in (1) by setting the canonical parameters and SVM cumulant function ss as

θi=(1+ηi)+(1ηi)+ands(θ)=(1+η)+=12[(θ+2)++(θ2)+].\theta_{i}=(1+\eta_{i})_{+}-(1-\eta_{i})_{+}\quad\text{and}\quad s(\theta)=(1+\eta)_{+}=\frac{1}{2}[(\theta+2)_{+}+(\theta-2)_{+}].

This, however, does not correspond to a valid exponential family specification since both functions are not differentiable. To address this issue and define a new exponential family, let us focus on the cumulant function. We first adopt the convex relaxation pκ(x)=log(1+exp{κx})/κp_{\kappa}(x)=\log(1+\exp\{\kappa x\})/\kappa for the plus function, the soft-plus with softness parameter κ\kappa. Note that pκ(x)x+p_{\kappa}(x)\rightarrow x_{+} as κ\kappa\rightarrow\infty. Motivated by the property of both logistic and SVM cumulants that b(θ)=θ+b(θ)b(\theta)=\theta+b(-\theta), we define the Soft-SVM canonical parameters and cumulant as

θi=pκ(ηi+δ)pκ(ηiδ)fκ,δ(ηi)andbκ,δ(θ)=12[pκ(θ+2δ)+pκ(θ2δ)],\theta_{i}=p_{\kappa}(\eta_{i}+\delta)-p_{\kappa}(\eta_{i}-\delta)\doteq f_{\kappa,\delta}(\eta_{i})\quad\text{and}\quad b_{\kappa,\delta}(\theta)=\frac{1}{2}[p_{\kappa}(\theta+2\delta)+p_{\kappa}(\theta-2\delta)], (3)

where δ\delta is a separation parameter in the cumulant. Parameter δ\delta works together with the softness parameter κ\kappa to effective bridge between logistic and SVM regressions: when κ=1\kappa=1 and δ=0\delta=0, we have logistic regression, while as κ\kappa\rightarrow\infty and δ1\delta\rightarrow 1 we obtain SVM regression. This way, b1,0bb_{1,0}\equiv b and bκ,δsb_{\kappa,\delta}\rightarrow s pointwise as κ\kappa\rightarrow\infty and δ1\delta\rightarrow 1. The Soft-SVM penalized log-likelihood is then

κ,δ(θ;𝐲)i=1nyiθibκ,δ(θi)λ2β22=i=1nyiθibκ,δ(θi)λ2[β0β][0If][β0β],\ell_{\kappa,\delta}(\theta;\mathbf{y})\doteq\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa,\delta}(\theta_{i})-\frac{\lambda}{2}\|\beta\|_{2}^{2}=\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa,\delta}(\theta_{i})-\frac{\lambda}{2}\begin{bmatrix}\beta_{0}\\ \beta\end{bmatrix}^{\top}\begin{bmatrix}0&\\ &I_{f}\end{bmatrix}\begin{bmatrix}\beta_{0}\\ \beta\end{bmatrix},

where we make clear that the bias β0\beta_{0} is not penalized when using the precision matrix 0If0\oplus I_{f}.

While the separation parameter δ\delta has a clear interpretation from the cumulant function, we can also adopt an equivalent alternative parameterization in terms of a scaled separation parameter ακδ\alpha\doteq\kappa\delta since the shifted soft plus can be written as

pκ(x+zδ)=1κlog(1+eκ(x+zδ))=1κlog(1+eκx+zα).p_{\kappa}(x+z\delta)=\frac{1}{\kappa}\log\big{(}1+e^{\kappa(x+z\delta)}\big{)}=\frac{1}{\kappa}\log\big{(}1+e^{\kappa x+z\alpha}\big{)}.

This alternative parameterization is more useful when we discuss the variance function in Section 2.2.

The mean function corresponding to bκ,δb_{\kappa,\delta} is its first derivative,

μ(θ)=bκ,δ(θ)=12[expit{κ(θ+2δ)}+expit{κ(θ2δ)}].\mu(\theta)=b_{\kappa,\delta}^{\prime}(\theta)=\frac{1}{2}[\text{expit}\{\kappa(\theta+2\delta)\}+\text{expit}\{\kappa(\theta-2\delta)\}]. (4)

With v(x)=expit(x)(1expit(x))=ex/(1+ex)2v(x)=\text{expit}(x)(1-\text{expit}(x))=e^{x}/(1+e^{x})^{2}, the variance function is

Vκ,δ(μ)=bκ,δ′′([bκ,δ(μ)]1)=κ2[v(κ([bκ,δ(μ)]1+2δ))+v(κ([bκ,δ(μ)]12δ))],V_{\kappa,\delta}(\mu)=b_{\kappa,\delta}^{\prime\prime}\big{(}[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1})=\frac{\kappa}{2}\big{[}v(\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}+2\delta))+v(\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}-2\delta))\big{]}, (5)

where the inverse mean function [bκ,δ(μ)]1[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1} is given in the next section. Figure 2 shows the plot of variance function when κ\kappa equals to different values and assumes the corresponding value of δ\delta is a function of κ\kappa: δ=11/κ\delta=1-1/\kappa.

Refer to caption
Figure 2: Plot of variance functions when κ{1,1.5,2,5}\kappa\in\{1,1.5,2,5\} and δ=11/κ={0,1/3,1/2,4/5}\delta=1-1/\kappa=\{0,1/3,1/2,4/5\} respectively.

Finally, to relate the canonical parameters θi\theta_{i} to the features, our link function gg can be found by solving the composite function fκ,δ1[bκ,δ(μ)]1f^{-1}_{\kappa,\delta}\circ[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1} (see Section 2.1 for details). Figure 1 compares the cumulant and mean functions for Bernoulli/logistic regression, SVM, and Soft-SVM with κ=5\kappa=5 and δ=0.8\delta=0.8. Note how the Soft-SVM functions match the SVM behavior for more extreme values of the canonical parameter θ\theta and the logistic regression behavior for moderate values of θ\theta, achieving a good compromise as a function of the softness κ\kappa.

2.1 Soft-SVM Regression

Given our characterization of the Soft-SVM exponential family distribution, performing Soft-SVM regression is equivalent to fitting a regularized generalized linear model. To this end, we follow a cyclic gradient ascent procedure, alternating between an iteratively re-weighted least squares (IRLS) step for Fisher scoring [7] to update the coefficients β\beta given the softness κ\kappa, and then updating κ\kappa and α\alpha given the current values of β\beta via a Newton-Raphson scoring step.

For the IRLS step we need the link gg and the variance function VV in (5). The link is given by solving the composite function fκ,δ1[bκ,δ(μ)]1f^{-1}_{\kappa,\delta}\circ[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1} for η\eta, where

[bκ,δ(μ)]1=1κ{12log(μ1μ)+asinh(ehκ,δ(μ))},[b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}=\frac{1}{\kappa}\Bigg{\{}\frac{1}{2}\log\Big{(}\frac{\mu}{1-\mu}\Big{)}+\text{asinh}\big{(}e^{h_{\kappa,\delta}(\mu)}\big{)}\Bigg{\}},

with

hκ,δ(μ)=logcosh(2κδ)+log|μ1/2|μ(1μ),h_{\kappa,\delta}(\mu)=\log\cosh(2\kappa\delta)+\log\frac{|\mu-1/2|}{\sqrt{\mu(1-\mu)}},

and

fκ,δ1(θ)=θ2+1κasinh(eHκ,δ(θ))withHκ,δ(θ)=κδ+logsinh(κ|θ|2).f^{-1}_{\kappa,\delta}(\theta)=\frac{\theta}{2}+\frac{1}{\kappa}{\text{asinh}\big{(}e^{H_{\kappa,\delta}(\theta)}\big{)}}\quad\text{with}\quad H_{\kappa,\delta}(\theta)=-\kappa\delta+\log\text{sinh}\Big{(}\frac{\kappa|\theta|}{2}\Big{)}.

Note that in all these expressions δ\delta is always expressed in the product κδ\kappa\delta, so it is immediate to use the alternative parameter α\alpha instead of δ\delta.

Our algorithm is summarized in Algorithm 1. For notation simplicity, we represent both bias and coefficients in a single vector β\beta and store all features in a design matrix XX, including a column of ones for the bias (intercept).

\triangleright Input : Features in nn-by-(f+1)(f+1) design matrix XX (including intercept) and labels 𝐲\mathbf{y}.
\triangleright Parameters : Penalty λ\lambda, convergence tolerance ϵ\epsilon, and small perturbation 0<ν<10<\nu<1.
\triangleright Output : Estimated regression coefficients β^\widehat{\beta} (including bias β^0\widehat{\beta}_{0}), shape parameter α^\widehat{\alpha} and softness parameter κ^\widehat{\kappa}.
\triangleright Initialization:
Set α(0)=1\alpha^{(0)}=1, κ(0)=1\kappa^{(0)}=1 and, for i=1,,ni=1,\ldots,n, μi(0)=(yi+ν)/(1+2ν)\mu_{i}^{(0)}=(y_{i}+\nu)/(1+2\nu) as a ν\nu-perturbed version of 𝐲\mathbf{y} and ηi(0)=gκ(0)(μi(0))\eta_{i}^{(0)}=g_{\kappa^{(0)}}(\mu_{i}^{(0)});
\triangleright Cyclic scoring iteration:
for t=0,1,t=0,1,\ldots (until convergence) do
       \triangleright κ\kappa-step:
       Update κ(t+1)=κ(t)[2κ,α/κ2]1(κ,α/κ)\kappa^{(t+1)}=\kappa^{(t)}-[\partial^{2}\ell_{\kappa,\alpha}/\partial\kappa^{2}]^{-1}(\partial\ell_{\kappa,\alpha}/\partial\kappa) via Newton’s method;
       \triangleright α\alpha-step:
       Update α(t+1)=α(t)i=1N[yifκ,αbκ,α]/i=1N[yifκ,α′′bκ,α′′]\alpha^{(t+1)}=\alpha^{(t)}-\sum_{i=1}^{N}[y_{i}f_{\kappa,\alpha}^{\prime}-b_{\kappa,\alpha}^{\prime}]/\sum_{i=1}^{N}[y_{i}f_{\kappa,\alpha}^{\prime\prime}-b_{\kappa,\alpha}^{\prime\prime}] via Newton’s method;
       \triangleright β\beta-step:
       Compute weights wi=yifκ,α′′(ηi)fκ,α′′(ηi)bκ,α(θi)(fκ,α(ηi))2bκ,α′′(θi)w_{i}=y_{i}f^{\prime\prime}_{\kappa,\alpha}(\eta_{i})-f^{\prime\prime}_{\kappa,\alpha}(\eta_{i})b^{\prime}_{\kappa,\alpha}(\theta_{i})-(f^{\prime}_{\kappa,\alpha}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa,\alpha}(\theta_{i}) and set W(t)=Diagi=1,,n{wi(t)}W^{(t)}=\text{Diag}_{i=1,\ldots,n}\{w_{i}^{(t)}\};
       Update β(t+1)\beta^{(t+1)} by solving (XW(t)X+λ(0If))β(t+1)=XW(t)η(t)XDiagi=1,,n{fκ,α(ηi)}(𝐲𝝁(β(t)))\big{(}X^{\top}W^{(t)}X+\lambda(0\oplus I_{f})\big{)}\beta^{(t+1)}=X^{\top}W^{(t)}\eta^{(t)}-X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa,\alpha}^{{}^{\prime}}(\eta_{i})\}(\mathbf{y}-\bm{\mu}(\beta^{(t)}));
       Set η(t+1)=Xβ(t+1)\eta^{(t+1)}=X\beta^{(t+1)}, 𝝁(β(t))=bκ(𝜽)\bm{\mu}(\beta^{(t)})=b^{{}^{\prime}}_{\kappa}(\bm{\theta}), and 𝜽=[θ1,θ2,,θn]\bm{\theta}=[\theta_{1},\theta_{2},\ldots,\theta_{n}]^{\top};
       if |(κ,α)(t+1)(η(t+1);𝐲)(κ,α)(t)(η(t);𝐲)|/|(κ,α)(t)(η(t);𝐲)|<ϵ|\ell_{(\kappa,\alpha)^{(t+1)}}(\eta^{(t+1)};\mathbf{y})-\ell_{(\kappa,\alpha)^{(t)}}(\eta^{(t)};\mathbf{y})|/|\ell_{(\kappa,\alpha)^{(t)}}(\eta^{(t)};\mathbf{y})|<\epsilon then
            break
      
return β(t)\beta^{(t)}, α(t)\alpha^{(t)} and κ(t)\kappa^{(t)}.
Algorithm 1 Soft-SVM Regression

2.2 Soft-SVM Variance Function

Since hκ,δ(μ)=hα(μ)=logcosh(2α)+log|μ1/2|/μ(1μ)h_{\kappa,\delta}(\mu)=h_{\alpha}(\mu)=\log\cosh(2\alpha)+\log|\mu-1/2|/\sqrt{\mu(1-\mu)}, we have that

κ([bκ,δ(μ)]1±2δ)=12log(μ1μ)+asinh(ehα(μ))±2α\kappa([b^{{}^{\prime}}_{\kappa,\delta}(\mu)]^{-1}\pm 2\delta)=\frac{1}{2}\log\Big{(}\frac{\mu}{1-\mu}\Big{)}+\text{asinh}\Big{(}e^{h_{\alpha}(\mu)}\Big{)}\pm 2\alpha

and so the variance function in (5) can be written as Vκ,α(μ)=κrα(μ)V_{\kappa,\alpha}(\mu)=\kappa\cdot r_{\alpha}(\mu) where, with uα(μ)logit(μ)/2+asinh(exp{hα(μ)})u_{\alpha}(\mu)\doteq\text{logit}(\mu)/2+\text{asinh}(\exp\{h_{\alpha}(\mu)\}),

rα(μ)=12{v(uα(μ)+2α)+v(uα(μ)2α)}.r_{\alpha}(\mu)=\frac{1}{2}\Big{\{}v(u_{\alpha}(\mu)+2\alpha)+v(u_{\alpha}(\mu)-2\alpha)\Big{\}}.

This way, parameters κ\kappa and α\alpha have specific effects on the variance function: the softness parameter κ\kappa acts as a dispersion, capturing the scale of V(μ)V(\mu), while the scaled separation α\alpha controls the shape of V(μ)V(\mu).

Refer to caption
Figure 3: Top row has variance-mean plots when scale parameter κ\kappa is fixed to 11 and α{0,α,1,2,5}\alpha\in\{0,\alpha^{*},1,2,5\}, where α0.66\alpha^{*}\approx 0.66 is the critical value when the peaks start separating. Bottom row has histograms for the corresponding distribution of mean values.

Figure 3 shows variance function plots for the ESL mixture dataset discussed in Section 2.4 when penalty parameter λ\lambda and scale-parameter κ\kappa are both fixed at 11, and only allow the shape parameter α\alpha to increase from 0 to 55. We can see that as α\alpha increases, the overall variance decreases and the plot gradually splits from a unimodal shaped curve into a bimodal shaped curve. In practice, we observe that as α\alpha increases to achieve higher mode separation, most observations are pushed to have fitted μ\mu values around 0.50.5 and variance weights close to zero, caught in-between variance peaks.

We can then classify observations with respect to their mean and variance values: points with high variance weights have a large influence on the fit and thus on the decision boundary, behaving like “soft” support vectors; points with low variance weights and mean close to 0.50.5 are close to the boundary but do not influence it, belonging thus to a “dead zone” for the fit; finally, points with low variance weights and large max{μ,1μ}\max\{\mu,1-\mu\} are inliers and have also low influence in the fit. This classification is illustrated in Figure 4.

Refer to caption
Figure 4: Plots of classification results by using soft-SVM when scale parameter κ\kappa is fixed to 1 and α\alpha equals to 0, 1, 1.2, 1.5, 2, and 5. The colors represent a gradient from blue (μ^i=0\widehat{\mu}_{i}=0) to orange (μ^i=1)\widehat{\mu}_{i}=1). Filled dots have V(μ^i)1V(\widehat{\mu}_{i})\geq 1 while hollow dots have V(μ^i)<1V(\widehat{\mu}_{i})<1.

2.3 Handling numerical issues

As κ\kappa increases we should anticipate numerical issues as bκb_{\kappa} and fκf_{\kappa} becomes non-differentiable up to machine precision. For this reason, we need to adopt numerically stable versions for asinh(ex)\text{asinh}(e^{x}), logcosh(x)\log\cosh(x) and logsinh(x)\log\sinh(x), needed by gκg_{\kappa} and thus also by VκV_{\kappa} above.

The first step is a numerically stable version for the Bernoulli cumulant; with ϵ\epsilon the machine precision, we adopt

log1pe(x)log(1+ex)={x,ifx>logϵx+log(1+ex),if0<xlogϵlog(1+ex),iflogϵx00,ifx<logϵ.\text{log1pe}(x)\doteq\log(1+e^{x})=\left\{\begin{array}[]{rcl}x,&\mbox{if}&x>-\log\epsilon\\ x+\log(1+e^{-x}),&\mbox{if}&0<x\leq-\log\epsilon\\ \log(1+e^{x}),&\mbox{if}&\log\epsilon\leq x\leq 0\\ 0,&\mbox{if}&x<\log\epsilon.\end{array}\right.

This way, the soft plus can be reliably computed as pκ(x)=log1pe(κx)/κp_{\kappa}(x)=\text{log1pe}(\kappa x)/\kappa. Again exploiting domain partitions and the parity of cosh, we can define

logcosh(x)={|x|log2,if|x|>logϵ|x|log2+log1pe(2x),if|x|logϵ.\log\cosh(x)=\left\{\begin{array}[]{rcl}|x|-\log 2,&\mbox{if}&|x|>-\log\sqrt{\epsilon}\\ |x|-\log 2+\text{log1pe}(-2x),&\mbox{if}&|x|\leq-\log\sqrt{\epsilon}.\\ \end{array}\right.

We branch twice for asinh(ex)\text{asinh}(e^{x}): we first define

γ(x)={1,if|x|>logϵ1+e2|x|,if|x|logϵ,\gamma(x)=\left\{\begin{array}[]{rcl}1,&\mbox{if}&|x|>-\log\sqrt{\epsilon}\\ \sqrt{1+e^{-2|x|}},&\mbox{if}&|x|\leq-\log\sqrt{\epsilon},\\ \end{array}\right.

and then

asinh(ex)={x+log(1+γ(x)),ifx>0log(ex+γ(x)),ifx0.\text{asinh}(e^{x})=\left\{\begin{array}[]{rcl}x+\log(1+\gamma(x)),&\mbox{if}&x>0\\ \log(e^{x}+\gamma(x)),&\mbox{if}&x\leq 0.\\ \end{array}\right.

Finally, we write

logsinh(x)={log12+x+log(1e2x),ifx>0log12x+log(1e2x),ifx0\log\sinh(x)=\left\{\begin{array}[]{rcl}\log\frac{1}{2}+x+\log(1-e^{-2x}),&\mbox{if}&x>0\\ \log\frac{1}{2}-x+\log(1-e^{2x}),&\mbox{if}&x\leq 0\\ \end{array}\right.

to make it stable whenever xx is greater or smaller than 0.

2.4 Practical example: ESL mixture

To exam how the Soft-SVM regression works for classification in comparison with SVM and logistic regression, we carry out a detailed example using the simulated mixture data in [5], consisting of two features, X1X_{1} and X2X_{2}. To estimate the complexity penalty λ\lambda we use ten-fold cross-validation with 20 replications for all methods. The results are summarized in Figure 5. Because κ^\widehat{\kappa} is not much larger than unit, the fitted curve resembles a logistic curve; however, the softness effect can be seen in the right panel as points with the soft margin have estimated probabilities close to 0.50.5, with fitted values quickly increasing as points are farther away from the decision boundary. To classify, we simply take a consensus rule, y^i=I(μ^i>0.5)\widehat{y}_{i}=I(\widehat{\mu}_{i}>0.5), with II being the indicator function.

Refer to caption
Figure 5: Soft-SVM regression on ESL mixture data: fitted values y^\widehat{y} (left) and decision boundary in feature space (right). Points are shaped by predicted classification, colored by observed labels, and have transparency proportional to |y^i0.5||\widehat{y}_{i}-0.5|. Hashed lines mark δ\delta in the fitted value plot and the soft margin M=δ/β^2M=\delta/\|\widehat{\beta}\|_{2} in the feature space plot. δ=α/κ\delta=\alpha/\kappa.

Figure 6 shows the effect of regularization in the cross-validation performance of Soft-SVM regression, as measured by Matthews correlation coefficient (MCC). We adopt MCC as a metric because it achieves a more balanced evaluation of all possible outcomes [3]. As can be seen in the right panel, comparing MCC with SVM and logistic regressions, the performances are quite comparable; Soft-SVM regression seems to have a slight advantage on average, however, and logistic regression has a more variable performance.

Refer to caption
Figure 6: Cross-validation results (replication averages) for Soft-SVM using MCC (left) and performance comparisons to SVM and logistic regression (right).

3 Experiments

3.1 Simulation study

Prior works have shown that the performance of SVM can be severely affected when applied to unbalanced datasets since the margin obtained is biased towards the minority class [1, 2]. In this section, we conduct a simulation study to investigate these effects on Soft-SVM regression.

We consider a simple data simulation setup with two features. There are nn observations, partitioned into two sets with sizes n1=ρnn_{1}=\lfloor\rho n\rfloor and n2=nn1n_{2}=n-n_{1}, for an imbalance parameter 0<ρ0.50<\rho\leq 0.5. We expect to see SVM performing worse as ρ\rho gets smaller and the sets become more unbalanced. Observations in the first set have labels yi=0y_{i}=0 and xiN(μ1,σI2)x_{i}\sim N(\mu_{1},\sigma I_{2}), while the n2n_{2} observations in the second set have yi=1y_{i}=1 and xiN(μ2,σI2)x_{i}\sim N(\mu_{2},\sigma I_{2}). We set μ1=(2,1)\mu_{1}=(\sqrt{2},1) and μ2=(0,1+2)\mu_{2}=(0,1+\sqrt{2}), so that the decision boundary is X2=X1+1X_{2}=X_{1}+1 and the distance from either μ1\mu_{1} or μ2\mu_{2} to the boundary is 11. Parameter σ\sigma acts as a measure of separability: for σ<1\sigma<1 we should have a stronger performance from the classification methods, but as σ\sigma becomes too small we expect to see complete separability challenging logistic regression.

For the simulation study we take n=100n=100 and simulate 5050 replications using a factorial design with ρ{0.12,0.25,0.5}\rho\in\{0.12,0.25,0.5\} and σ{0.5,1,1.5}\sigma\in\{0.5,1,1.5\}. For the SVM and Soft-SVM classification methods we estimate complexity penalties λ\lambda using ten-fold cross-validation and use standard, non-regularized logistic regression. We report performance using MCC and summarize the results in Figure 7.

Refer to caption
Figure 7: Simulation study results comparing MCC for Soft-SVM, SVM, and logistic regressions for ρ{0.125,0.25,0.5}\rho\in\{0.125,0.25,0.5\} and σ{0.5,1,1.5}\sigma\in\{0.5,1,1.5\}.

The performance across methods is comparable, uniformly degrading as we boost confounding by decreasing ρ\rho and increasing σ\sigma, but differs in more pathological corner cases. The combination of small ρ\rho values (higher imbalance) and large σ\sigma values (less separability) results in a poorer performance from SVM, as previously reported [2]. Small σ\sigma values, on the other hand, result in lack of convergence and inflated coefficient estimates in logistic regression fits, even though predicted classification is not much affected. Soft-SVM classification, however, avoids both issues, exhibiting stable estimation and good performance in all cases.

3.2 Case studies

To evaluate the empirical performance of Soft-SVM regression and classification, we select nine well-known datasets from the UCI machine learning repository [4] with a varied number of observations and features. Table 1 lists the datasets and their attributes. Not all datasets are originally meant for binary classification; dataset Abalone, for instance, has three classes—male, female and infant—so we only take male and female observations and exclude infants. The red/white wine quality datasets originally have ten quality levels, labeled from low to high. We dichotomize these levels into a low quality class (quality levels between 11 and 55, inclusive) and a high quality class (quality levels 66 to 1010). All remaining datasets have two classes.

Table 1: Selected case study datasets from UCI repository.
UCI dataset # of observations nn # of features ff
Abalone (only F & M classes) 4177 8
Australian credit 690 14
Breast cancer 569 30
Haberman 306 3
Heart disease 303 13
Liver disorder 345 6
Pima Indians diabetes 768 8
Red wine quality (amended) 1599 11
White wine quality (amended) 4898 11

As in the simulation study, we assess performance using MCC from 5050 ten-fold cross-validation replications. For each replication, we estimate penalties λ\lambda for all classification methods again using ten-fold cross-validations. Figure 8 shows the results.

Refer to caption
Figure 8: Performance comparison of MCC for Soft-SVM, SVM, and logistic regressions for nine UCI datasets over 5050 ten-fold cross validation studies.

In general, it can be seen that SVM classification performs worse on “tough” datasets with low MCC performance, such as Abalone and Haberman, probably due to poor selection of penalty parameters. On the other hand, SVM performs favorably when compared to Soft-SVM and logistic regressions in the Heart dataset, being more robust to cases where many features seem to be non-informative of classification labels. Overall, Soft-SVM regression performs comparably to the best of SVM and logistic regression even in hard datasets, as it was observed in the simulation study. Interestingly, it has a clear advantage over both SVM and logistic regressions in cases where a few features are very informative, such as in the Abalone and Australian credit datasets. It often has an edge over logistic regression in over-dispersed datasets such as Breast cancer and White wine datasets. It seems that Soft-SVM regression can better accommodate these cases via higher dispersion on moderate values of softness, i.e. κ<2\kappa<2, or via better use of regularization for increased robustness and generalizability.

4 Conclusion

We have presented a new binary classification and regression method based on a new exponential family distribution, Soft-SVM. Soft-SVM arises from extending a convex relaxation of the SVM hinge loss with a softness parameter, in effect achieving a smooth transformation between the binomial deviance and SVM hinge losses. Soft-SVM regression then entails fitting a generalized linear model, and so it is comparable to logistic regression in terms of computational complexity, having just a small overhead to fit the softness parameter. Moreover, since many points might have low variance weights by not being “soft” support vectors—either being dead-zoners or inliers, by our classification in Section 2.2—they can be selected out of the fit to achieve significant computational savings.

More importantly, as we have shown in a simulation study and multiple case studies, Soft-SVM classification performs comparably to the best of SVM and logistic regressions, often over performing them, even if slightly, in feature-rich datasets. This advantage in performance can be attributed to higher robustness from the softness and separation parameters, enabling the model to accommodate well over-dispersed datasets and adequately address separability issues via regularization. By bridging the SVM hinge and binomial deviance loss functions, it seems that Soft-SVM regression is able to combine robustness and interpretability, the advantages of SVM and logistic regression, respectively, freeing the practitioner from the burden of selecting one or the other.

For future work, we intend to develop data sampling procedures and adopt a Bayesian formulation to then sample posterior estimates of coefficients, softness and separation parameters. Another natural direction of work is to incorporate kernels, as it is common in SVM regression. Finally, we also expect to extend this formulation to multinomial classification [10], hoping again to bring some robustness from SVM while keeping interpretability.

References

  • Bernau et al. [2014] Christoph Bernau, Markus Riester, Anne-Laure Boulesteix, Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron, and Lorenzo Trippa. Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30(12):i105–i112, 2014.
  • Cervantes et al. [2020] Jair Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, and Asdrubal Lopez. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408:189–215, 2020.
  • Chicco and Jurman [2020] Davide Chicco and Giuseppe Jurman. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, second edition, 2009.
  • Lin [2002] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259–275, 2002.
  • McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, second edition, 1989.
  • Moguerza et al. [2006] Javier M Moguerza, Alberto Muñoz, et al. Support vector machines with applications. Statistical Science, 21(3):322–336, 2006.
  • Vapnik [2013] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 2013.
  • Zhu and Hastie [2005] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14(1):185–205, 2005.

Appendix A Appendix

A.1 [𝒃𝜿(𝝁)]𝟏[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1}

[bκ(μ)]1\displaystyle[b^{{}^{\prime}}_{\kappa}(\mu)]^{-1} =1κlog{(2μ1)(e2κδ(κ)+e2κδ(κ))2(2μ2)+[(2μ1)(e2κδ(κ)+e2κδ(κ))2(2μ2)]2μ1μ}\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\frac{-(2\mu-1)(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{2(2\mu-2)}+\sqrt{[\frac{-(2\mu-1)(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{2(2\mu-2)}]^{2}-\frac{\mu}{1-\mu}}\Bigg{\}}
=1κlog{μ1μ((μ12)12(e2κδ(κ)+e2κδ(κ))μ(1μ)+[(μ12)12(e2κδ(κ)+e2κδ(κ))μ(1μ)]2+1)}\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\sqrt{\frac{\mu}{1-\mu}}(\frac{(\mu-\frac{1}{2})\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{\sqrt{\mu(1-\mu)}}+\sqrt{[\frac{(\mu-\frac{1}{2})\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})}{\sqrt{\mu(1-\mu)}}]^{2}+1})\Bigg{\}}
={1κ[12logμ1μ+log(ehκ(μ)+e2hκ(μ)+1)],ifμ>121κ[12logμ1μ+log(ehκ(μ)+e2hκ(μ)+1)],ifμ12\displaystyle=\left\{\begin{array}[]{rcl}\frac{1}{\kappa}[\frac{1}{2}\log\frac{\mu}{1-\mu}+\log(e^{h_{\kappa}(\mu)}+\sqrt{e^{2h_{\kappa}(\mu)}+1})],&\mbox{if}&\mu>\frac{1}{2}\\ \frac{1}{\kappa}[\frac{1}{2}\log\frac{\mu}{1-\mu}+\log(-e^{h_{\kappa}(\mu)}+\sqrt{e^{2h_{\kappa}(\mu)}+1})],&\mbox{if}&\mu\leq\frac{1}{2}\end{array}\right.
=1κ{12log(μ1μ)+asinh(ehκ(μ))},\displaystyle=\frac{1}{\kappa}\Bigg{\{}\frac{1}{2}\log(\frac{\mu}{1-\mu})+\text{asinh}\big{(}e^{h_{\kappa}(\mu)}\big{)}\Bigg{\}},
hκ(μ)=log{|μ12|μ(1μ)12(e2κδ(κ)+e2κδ(κ))}=logcosh(2κδ(κ))+log|μ1/2|μ(1μ).h_{\kappa}(\mu)=\log\Bigg{\{}\frac{|\mu-\frac{1}{2}|}{\sqrt{\mu(1-\mu)}}\cdot\frac{1}{2}(e^{-2\kappa\delta(\kappa)}+e^{2\kappa\delta(\kappa)})\Bigg{\}}=\log\cosh(2\kappa\delta(\kappa))+\log\frac{|\mu-1/2|}{\sqrt{\mu(1-\mu)}}.

A.2 𝒇𝜿𝟏(𝜽)f^{-1}_{\kappa}(\theta)

fκ1(θ)\displaystyle f^{-1}_{\kappa}(\theta) =1κlog{eκθ2eκδ(κ)+eκθ+(eκθ12eκδ(κ))2}\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}\frac{e^{\kappa\theta}}{2e^{\kappa\delta(\kappa)}}+\sqrt{e^{\kappa\theta}+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}})^{2}}\Bigg{\}}
=1κlog{(eκθ)12eκθ12eκδ(κ)(eκθ)12+(eκθ)121+(eκθ12eκδ(κ)(eκθ)12)2}\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}(e^{\kappa\theta})^{\frac{1}{2}}\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}}+(e^{\kappa\theta})^{\frac{1}{2}}\sqrt{1+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}})^{2}}\Bigg{\}}
=1κlog{(eκθ)12}+1κlog{eκθ12eκδ(κ)(eκθ)12+1+(eκθ12eκδ(κ)(eκθ)12)2}\displaystyle=\frac{1}{\kappa}\log\Bigg{\{}(e^{\kappa\theta})^{\frac{1}{2}}\Bigg{\}}+\frac{1}{\kappa}\log\Bigg{\{}\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}}+\sqrt{1+(\frac{e^{\kappa\theta}-1}{2e^{\kappa\delta(\kappa)}(e^{\kappa\theta})^{\frac{1}{2}}})^{2}}\Bigg{\}}
={θ2+1κlog(eHκ(θ)+e2Hκ(θ)+1)],ifθ>0θ2+1κlog(eHκ(θ)+e2Hκ(θ)+1)],ifθ0\displaystyle=\left\{\begin{array}[]{rcl}\frac{\theta}{2}+\frac{1}{\kappa}\log(e^{H_{\kappa}(\theta)}+\sqrt{e^{2H_{\kappa}(\theta)}+1})],&\mbox{if}&\theta>0\\ \frac{\theta}{2}+\frac{1}{\kappa}\log(-e^{H_{\kappa}(\theta)}+\sqrt{e^{2H_{\kappa}(\theta)}+1})],&\mbox{if}&\theta\leq 0\\ \end{array}\right.
=θ2+1κasinh(eHκ(θ)),\displaystyle=\frac{\theta}{2}+\frac{1}{\kappa}{\text{asinh}\big{(}e^{H_{\kappa}(\theta)}\big{)}},
Hκ(θ)\displaystyle H_{\kappa}(\theta) =log{eκδ(κ)|eκθ1|2(eκθ)12}\displaystyle=\log\Bigg{\{}e^{-\kappa\delta(\kappa)}\cdot\frac{|e^{\kappa\theta}-1|}{2(e^{\kappa\theta})^{\frac{1}{2}}}\Bigg{\}}
=κδ(κ)+logsinh(κ|θ|2),\displaystyle=-\kappa\delta(\kappa)+\log\text{sinh}(\frac{\kappa|\theta|}{2}),
logsinh(x)=log(e2x12ex)={log12+x+log(1e2x),ifx>0log12x+log(1e2x),ifx0.\log\text{sinh(x)}=\log\Bigg{(}\frac{e^{2x}-1}{2e^{x}}\Bigg{)}=\left\{\begin{array}[]{rcl}\log\frac{1}{2}+x+\log(1-e^{-2x}),&\mbox{if}&x>0\\ \log\frac{1}{2}-x+\log(1-e^{2x}),&\mbox{if}&x\leq 0\\ \end{array}\right..

A.3 𝑽𝜿(𝝁)V_{\kappa}(\mu)

Vκ(μ)=bκ′′(θ)=bκ′′(gκ(μ)),V_{\kappa}(\mu)=b_{\kappa}^{\prime\prime}(\theta)=b_{\kappa}^{\prime\prime}\big{(}g_{\kappa}(\mu)\big{)},
bκ′′(θ)\displaystyle b_{\kappa}^{\prime\prime}(\theta) =2{0.5[log(1+eκ(θ2δ(κ)))κ+log(1+eκ(θ+2δ(κ)))κ]}2θ\displaystyle=\frac{\partial^{2}\Biggl{\{}0.5*\Bigg{[}\frac{\log(1+e^{\kappa(\theta-2\delta(\kappa))})}{\kappa}+\frac{\log(1+e^{\kappa(\theta+2\delta(\kappa))})}{\kappa}\Bigg{]}\Biggr{\}}}{\partial^{2}\theta}
={0.5[eκ(θ2δ(κ))1+eκ(θ2δ(κ))+eκ(θ+2δ(κ))1+eκ(θ+2δ(κ))]}θ\displaystyle=\frac{\partial\Biggl{\{}0.5*\Bigg{[}\frac{e^{\kappa(\theta-2\delta(\kappa))}}{1+e^{\kappa(\theta-2\delta(\kappa))}}+\frac{e^{\kappa(\theta+2\delta(\kappa))}}{1+e^{\kappa(\theta+2\delta(\kappa))}}\Bigg{]}\Biggr{\}}}{\partial\theta}
=0.5[κeκ(θ2δ(κ))(1+eκ(θ2δ(κ)))2+κeκ(θ+2δ(κ))(1+eκ(θ+2δ(κ)))2]\displaystyle=0.5*\Bigg{[}\frac{\kappa e^{\kappa(\theta-2\delta(\kappa))}}{(1+e^{\kappa(\theta-2\delta(\kappa))})^{2}}+\frac{\kappa e^{\kappa(\theta+2\delta(\kappa))}}{(1+e^{\kappa(\theta+2\delta(\kappa))})^{2}}\Bigg{]}
=κ2{v[κ(θ2δ(κ))]+v[κ(θ+2δ(κ))]},\displaystyle=\frac{\kappa}{2}\biggl{\{}v[\kappa(\theta-2\delta(\kappa))]+v[\kappa(\theta+2\delta(\kappa))]\biggr{\}},
δ(κ)=11κ,v(x)=ex1+ex.\delta(\kappa)=1-\frac{1}{\kappa},\ v(x)=\frac{e^{x}}{1+e^{x}}.

A.4 𝜷\beta-step

l=i=1nyiθibκ(θi),θi=fκ(ηi),yi=Xi𝜷=β0+xiβ1++xpβp,],l=\sum_{i=1}^{n}y_{i}\theta_{i}-b_{\kappa}(\theta_{i}),\ \theta_{i}=f_{\kappa}(\eta_{i}),\ y_{i}=X_{i}^{\top}\bm{\beta}=\beta_{0}+x_{i}\beta_{1}+\ldots+x_{p}\beta_{p},],
U(βj)j=0,1,,p\displaystyle U(\beta_{j})_{j=0,1,\ldots,p} =lβj=i=1n(yiθibκ(θi)))θiθiβj\displaystyle=\frac{\partial l}{\partial\beta_{j}}=\sum_{i=1}^{n}\frac{\partial(y_{i}\theta_{i}-b_{\kappa}(\theta_{i})))}{\partial\theta_{i}}\cdot\frac{\partial\theta_{i}}{\beta_{j}}
=i=1n(yi𝒃𝜿(𝜽𝒊)𝜽𝒊𝒃𝜿(𝜽𝒊))fκ(yi)Xij\displaystyle=\sum_{i=1}^{n}(y_{i}-\bm{\frac{\partial b_{\kappa}(\theta_{i})}{\partial\theta_{i}}\doteq b^{\prime}_{\kappa}(\theta_{i})})\cdot f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}
=i=1nfκ(yi)Xij(yibκ(θi)),\displaystyle=\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i})),
U(𝜷)\displaystyle U(\bm{\beta}) =l𝜷=[U(β0)U(β1)U(βp)]=[i=1nfκ(yi)Xi0(yibκ(θi))i=1nfκ(yi)Xi1(yibκ(θi))i=1nfκ(yi)Xip(yibκ(θi))]=XDiagi=1,,n{fκ(ηi)}[y1bκ(θ1)y2bκ(θ2)ynbκ(θn)]\displaystyle=\frac{\partial l}{\partial\bm{\beta}}=\begin{bmatrix}U(\beta_{0})\\ U(\beta_{1})\\ ...\\ U(\beta_{p})\\ \end{bmatrix}=\begin{bmatrix}\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i0}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i1}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ ...\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ip}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \end{bmatrix}=X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}\begin{bmatrix}y_{1}-b^{\prime}_{\kappa}(\theta_{1})\\ y_{2}-b^{\prime}_{\kappa}(\theta_{2})\\ ...\\ y_{n}-b^{\prime}_{\kappa}(\theta_{n})\\ \end{bmatrix}
=XDiagi=1,,n{fκ(ηi)}(𝒚𝝁(𝜽)),where𝝁(𝜽)=[bκ(θ1)bκ(θ2)bκ(θn)].\displaystyle=X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}(\bm{y-\mu(\theta)}),\text{where}\ \bm{\mu(\theta)}=\begin{bmatrix}b^{\prime}_{\kappa}(\theta_{1})\\ b^{\prime}_{\kappa}(\theta_{2})\\ ...\\ b^{\prime}_{\kappa}(\theta_{n})\\ \end{bmatrix}.
H(𝜷)=U(𝜷)𝜷=[i=1nfκ(yi)Xi0(yibκ(θi))i=1nfκ(yi)Xi1(yibκ(θi))i=1nfκ(yi)Xip(yibκ(θi))]𝜷.\displaystyle H(\bm{\beta})=\frac{\partial U(\bm{\beta})}{\partial\bm{\beta^{\top}}}=\frac{\partial\begin{bmatrix}\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i0}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{i1}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ ...\\ \sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ip}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))\\ \end{bmatrix}}{\partial\bm{\beta^{\top}}}.

For each row j,j=0,1,2,,pj,j=0,1,2,...,p:

i=1nfκ(yi)Xij(yibκ(θi))𝜷\displaystyle\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(y_{i})\cdot X_{ij}\cdot(y_{i}-b^{\prime}_{\kappa}(\theta_{i}))}{\partial\bm{\beta^{\top}}} =i=1nfκ(ηi)Xijyi𝜷+i=1nfκ(ηi)Xijbκ(θi)𝜷\displaystyle=\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(\eta_{i})X_{ij}y_{i}}{\partial\bm{\beta^{\top}}}+\frac{\partial\sum_{i=1}^{n}f^{\prime}_{\kappa}(\eta_{i})X_{ij}b^{\prime}_{\kappa}(\theta_{i})}{\partial\bm{\beta^{\top}}}
=[i=1nXij[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi0i=1nXij[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi1i=1nXij[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xip].\displaystyle=\begin{bmatrix}\sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{ij}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}.

Therefore,

H(𝜷)\displaystyle H(\bm{\beta}) =[[i=1nXi0[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi0i=1nXi0[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi1i=1nXi0[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xip][i=1nXi1[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi0i=1nXi1[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi1i=1nXi1[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xip][i=1nXip[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi0i=1nXip[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xi1i=1nXip[yifκ′′(ηi)fκ′′(ηi)bκ(θi)(fκ(ηi))2bκ′′(θi)]Xip]]\displaystyle=\begin{bmatrix}\begin{bmatrix}\sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{i0}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \begin{bmatrix}\sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{i1}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \vdots\\ \begin{bmatrix}\sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i0}\\ \sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{i1}\\ ...\\ \sum_{i=1}^{n}X_{ip}[y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})]X_{ip}\\ \end{bmatrix}^{\top}\\ \end{bmatrix}
=XDiagi=1,,n{𝒚𝒊𝒇𝜿′′(𝜼𝒊)𝒇𝜿′′(𝜼𝒊)𝒃𝜿(𝜽𝒊)(𝒇𝜿(𝜼𝒊))𝟐𝒃𝜿′′(𝜽𝒊)𝒘𝒊}X\displaystyle=X^{\top}\text{Diag}_{i=1,\ldots,n}\{\bm{y_{i}f^{\prime\prime}_{\kappa}(\eta_{i})-f^{\prime\prime}_{\kappa}(\eta_{i})b^{\prime}_{\kappa}(\theta_{i})-(f^{\prime}_{\kappa}(\eta_{i}))^{2}b^{\prime\prime}_{\kappa}(\theta_{i})\doteq w_{i}}\}X
=X(Diag𝒊=𝟏,,𝒏{𝒘𝒊}𝑾(𝜷))X\displaystyle=X^{\top}(\bm{\text{Diag}_{i=1,\ldots,n}\{w_{i}\}\doteq W(\beta)})X
=XW(β)X.\displaystyle=X^{\top}W(\beta)X.

Since by Newton’s method we have:

β(t+1)β(t)[H(t)]1U(β(t))H(t)β(t+1)H(t)β(t)U(β(t),\beta^{(t+1)}\approx\beta^{(t)}-[H^{(t)}]^{-1}U(\beta^{(t)})\Rightarrow H^{(t)}\beta^{(t+1)}\approx H^{(t)}\beta^{(t)}-U(\beta^{(t)},

and by adding the penalty term and plugging in the value of HH and UU, we can update β(t+1)\beta^{(t+1)} by solving the following matrix equation:

(XW(t)X+λ(0If))β(t+1)=XW(t)η(t)XDiagi=1,,n{fκ(ηi)}(𝐲𝝁(β(t))),η(t)=Xβ(t).\big{(}X^{\top}W^{(t)}X+\lambda(0\oplus I_{f})\big{)}\beta^{(t+1)}=X^{\top}W^{(t)}\eta^{(t)}-X^{\top}\text{Diag}_{i=1,\ldots,n}\{f_{\kappa}^{{}^{\prime}}(\eta_{i})\}(\mathbf{y}-\bm{\mu}(\beta^{(t)})),\ \eta^{(t)}=X\beta^{(t)}.