This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Probabilistic Classification Vector Machine for Multi-Class Classification

Shengfei Lyu
School of Computer Science and Technique
University of Science and Technique of China
Hefei, Anhui 230027
[email protected]
&Xing Tian
School of Computer Science and Technique
University of Science and Technique of China
Hefei, Anhui 230027
[email protected]
&Yang Li
School of Computer Science and Technique
University of Science and Technique of China
Hefei, Anhui 230027
[email protected]
&Bingbing Jiang
School of Computer Science and Technique
University of Science and Technique of China
Hefei, Anhui 230027
[email protected]
&Huanhuan Chen
School of Computer Science and Technique
University of Science and Technique of China
Hefei, Anhui 230027
[email protected]
Abstract

The probabilistic classification vector machine (PCVM) synthesizes the advantages of both the support vector machine and the relevant vector machine, delivering a sparse Bayesian solution to classification problems. However, the PCVM is currently only applicable to binary cases. Extending the PCVM to multi-class cases via heuristic voting strategies such as one-vs-rest or one-vs-one often results in a dilemma where classifiers make contradictory predictions, and those strategies might lose the benefits of probabilistic outputs. To overcome this problem, we extend the PCVM and propose a multi-class probabilistic classification vector machine (mPCVM). Two learning algorithms, i.e., one top-down algorithm and one bottom-up algorithm, have been implemented in the mPCVM. The top-down algorithm obtains the maximum a posteriori (MAP) point estimates of the parameters based on an expectation-maximization algorithm, and the bottom-up algorithm is an incremental paradigm by maximizing the marginal likelihood. The superior performance of the mPCVMs, especially when the investigated problem has a large number of classes, is extensively evaluated on synthetic and benchmark data sets.

1 Introduction

Classification is one of the fundamental problems in machine learning and has been widely studied in various applications. The basic classification predicts whether one thing belongs to a class or not, which is referred to as binary classification. Binary classification is a fundamental problem and has been widely studied in numerous well-developed classifiers. Among them, the support vector machine (SVM) [1] is arguably the most popular [2]. However, dissatisfaction has been caused for some disadvantages [3, 4] of the SVM, such as i) nonprobabilistic outputs, ii) linear correlation between the number of support vectors and the size of the training set, which makes the SVM suffer when trained with large data sets.

To overcome the above disadvantages of the SVM, the relevance vector machine (RVM)[3] was proposed in a Bayesian automatic relevance determination framework [5], [6]. The RVM obtains a few original basis functions (called relevance vectors), whose corresponding weights are non-zero, by appropriate formulation of hierarchical priors. To improve the computational efficiency of the RVM in training, an accelerated strategy for the RVM has been developed by means of maximizing the marginal likelihood via a principled and efficient sequential addition and deletion of candidate basis functions [7].

Unfortunately, The RVM might not stick to the principle in the SVM that a positive sample should have positive weight while a negative sample should have negative weight. Consequently, the RVM is unstable and not robust to kernel parameters for classification problems. To overcome this issue, Chen et al. proposed the probabilistic classification vector machine (PCVM) [4] , which guarantees the consistency of numeric signs between weights and class labels111For the convenience of exposition, we assume the labels of binary classification case are from {-1, +1} if not clearly stated. by adopting a truncated Gaussian prior over weights. Similar to the accelerated RVM, the efficient probabilistic classification vector machine[8], a fast version of the PCVM, has been proposed.

Like binary classification, multi-class classification is widely used and worth exploring. Basically, the research[9] regards multi-class classification as an extension of binary classification by using two tricks, i.e., one-vs-rest and one-vs-one. The one-vs-rest strategy uses C1C-1 (CC is the number of classes) classifiers, each of which determines whether a sample belongs to a certain class or not. The one-vs-one strategy builds C(C1)/2C(C-1)/2 classifiers for every pair of classes and each sample is classified to the most likely class by the majority vote. These two strategies can extend binary classifiers onto multi-class cases. For instance, the popular binary classifier SVM has been extended to multi-class classification in the well-known toolbox LIBSVM [10] where one-vs-one strategy is adopted. Similarly, the sparse Bayesian extreme learning machine (SBELM) [11] constructs a sparse version of the Bayesian extreme learning machine by reducing redundant hidden neurons and addresses multi-class classification by pairwise coupling (another name for the one-vs-one strategy).

However, the two strategies both suffer from the problem of ambiguous regions [12, 13], where classifiers make contradictory predictions. In addition, both strategies could not directly produce probabilistic outputs over classes despite the fact that extra work can assist to enable probability. For example, for the one-vs-one strategy, additional post-processing, e.g., solving a linear equality-constrained convex quadratic programming problem[11] produces probabilistic outputs. Weston et. al. pointed out that a more natural way to solve multi-class problems was to construct a decision function by considering all classes simultaneously[14]. Based on this idea, many works have studied the multi-class problem. A very simple straightforward algorithm is the multinomial logistic regression (MLR)[15]. Similarly, based on the least squares regression (LSR), the discriminative LSR (DLSR) [16] is proposed to solve classification problems by introducing a technique called ε\varepsilon-dragging and translates the one-vs-rest training rule to multi-class classification.

Unlike the binary nature of the SVM, the RVM could be extended to solve multi-class classification problems without the help of those two above strategies. The multi-class relevance vector machine (mRVM) [17] employs multinomial probit likelihood [18] by calculating regressors for all classes. Two versions of the mRVM (the mRVM1 and the mRVM2) have been implemented in different ways. While the mRVM2 employs a flat prior to the hyper-parameters that control the sparsity of the resulting model, the mRVM1 is a multi-class extension of maximizing the marginal likelihood procedure in [7].

However, the mRVMs still do not ensure the consistency of numeric signs between weights and class labels. Therefore, the mRVMs could still be unstable and not robust to kernel parameters. To ensure the consistency in multi-class cases, a multi-class classification principle has been defined in Section 2.1.

To relieve the drawbacks of the mRVMs, we extend the applicability of the PCVM and propose a multi-class probabilistic classification vector machine (mPCVM). The mPCVM introduces two types of truncated Gaussian priors over weights for training samples. The two types of priors depend on whether training samples belong to a given class or not. By the priors, the multi-class classification principle has been implemented in the mPCVM.

In our paper, two learning algorithms have been investigated, i.e., the top-down algorithm mPCVM1 and the bottom-up algorithm mPCVM2, an online incremental version of maximizing the marginal likelihood.

The main contributions of this paper can be summarized as follows:

  1. 1.

    A multi-class version of the PCVM has been proposed for multi-class classification;

  2. 2.

    Compared with the SVM, the mPCVM directly produces probabilistic outputs for all classes without any post-processing step;

  3. 3.

    A multi-class classification principle is proposed for the mPCVM. It states that weights should be consistent with class labels in multi-class cases;

  4. 4.

    Due to the sparseness-encouraging prior, the generated model is sparse and its computational complexity in the test stage has been greatly reduced.

The rest of this paper is organized as follows. The preamble and related works of the PCVM in Section 2.1 are followed by an introduction of the prior knowledge on weights in Section 2.2. Section 2.3 presents the detailed expectation-maximization (EM) procedures for the mPCVM1. Then an incremental learning algorithm mPCVM2 is introduced in Section 2.4. The experimental results and analyses are given in Section 3. Finally, Section 4 concludes our paper and presents future work.

2 Multi-class Probabilistic Classification Vector Machine

This section defines the mathematical notations in the beginning. Scalars are denoted by lower case letters, vectors by bold lower case letters, and matrices by bold upper case letters. For a specific matrix 𝒁\bm{Z} of size N×CN\times C, we use the symbols 𝒛n\bm{z}_{n} and 𝒛c\bm{z}_{c} to denote the nn-th column and the cc-th row of 𝒁\bm{Z}, respectively. A scalar zncz_{nc} is used to denote the (n,c)(n,c)-th element of 𝒁\bm{Z}.

Let D={𝒙n,tn}n=1ND={\{\bm{x}_{n},t_{n}\}}_{n=1}^{N} be a training set of NN samples, where the vertical vector 𝒙nd\bm{x}_{n}\in{\mathbb{R}}^{d} denotes a data point, tn{1,2,,C}t_{n}\in\{1,2,\cdots,C\} denotes the label for the nn-th sample, and CC denotes the number of classes. The multi-class classification learning objective is to learn a classifier f(𝒙)f(\bm{x}) that takes a vector 𝒙\bm{x} as input and assigns the correct class label to it. A general form of f(𝒙)f(\bm{x}) is given as

f(𝒙;𝒘,b)=m=1Mϕm(𝒙)wm+b,f(\bm{x};\bm{w},b)=\sum_{m=1}^{M}{\phi_{m}(\bm{x})w_{m}}+b, (1)

where the weight vector 𝒘=(w1,w2,,wM)T\bm{w}=(w_{1},w_{2},\cdots,w_{M})^{\mathrm{T}} and the bias term bb are parameters of the model. ϕ()=(ϕ1(),ϕ2(),,ϕM())T\bm{\phi}(\cdot)={(\phi_{1}(\cdot),\phi_{2}(\cdot),\cdots,\phi_{M}(\cdot))}^{\mathrm{T}} is a fixed nonlinear basis function vector, which maps a data point 𝒙\bm{x} to a feature vector with MM dimensions.

2.1 Model Preparation

The formulation of the PCVM with binary classification will be given as follows:

h(𝒙;𝒘,b)=Ψ(ϕ(𝒙)T𝒘+b),h(\bm{x};\bm{w},b)=\Psi(\bm{\phi}(\bm{x})^{\mathrm{T}}\bm{w}+b), (2)

where Ψ()\Psi(\cdot) is the Gaussian cumulative distribution function, and ϕ(𝒙)=[ϕ1(𝒙),ϕ2(𝒙),,ϕN(𝒙)]T\bm{\phi}(\bm{x})=[\phi_{1}(\bm{x}),\phi_{2}(\bm{x}),...,\phi_{N}(\bm{x})]^{\mathrm{T}} the basis function. If ϕ(𝒙)T𝒘+b{\bm{\phi}}(\bm{x})^{\rm T}{\bm{w}}+b is greater than 0, the datum is more likely to be in class 22 than class 11 222We use class {1, 2} here instead of {-1, +1} to be compatible with multi-class cases. . The PCVM makes use of a truncated Gaussian prior to constrain weights of basis functions of the class 1 to be nonpositive and weights of basis functions of the class 2 to be nonnegative, in order to be consistent with the SVM [19].

For multi-class cases, we assign a set of independent weights to each class and extend the idea of the PCVM such that weights of basis functions of the class c{1,2,,C}c\in\{1,2,...,C\} in the cc-th weight column vector (denoted as 𝒘c\bm{w}_{c}) are restricted as nonnegative for training samples in the class cc and nonpositive otherwise. The definition of the multi-class classification principle for the mPCVM is given as below.

Definition 1. Given multi-class data D={𝐱n,tn}n=1ND={\{\bm{x}_{n},t_{n}\}}_{n=1}^{N} with the class labels (1,2,,C)(1,2,...,C). The multi-class classification principle for the mPCVM is that the weight wnc{w}_{nc} of a datum (𝐱n,tn)(\bm{x}_{n},t_{n}) is consistent with the class labels if

{wnc0iftn=cwnc0iftnc.\begin{cases}w_{nc}\geq 0&\qquad\text{if}\quad t_{n}=c\\ w_{nc}\leq 0&\qquad\text{if}\quad t_{n}\neq c\end{cases}. (3)

With this principle, weights are consistent with class labels like binary classification. Further, a class-based potential yncy_{nc} for a training datum (𝒙n,tn\bm{x}_{n},t_{n}) is defined as

yncϕ(𝒙n)T𝒘c+bc,y_{nc}\triangleq\bm{\phi}(\bm{x}_{n})^{\rm T}\bm{w}_{c}+b_{c}, (4)

where bc{b}_{c} is the bias for the cc-th class. Subsequently, for data (𝑿,𝒕)(\bm{X},\bm{t}), where 𝑿=(𝒙1,𝒙2,,𝒙N)\bm{X}=(\bm{x}_{1},\bm{x}_{2},...,\bm{x}_{N}) and 𝒕=(t1,t2,,tN)T\bm{t}=(t_{1},t_{2},...,t_{N})^{\mathrm{T}}, we have

𝒀=𝚽𝑾+𝟏𝒃T,\bm{Y}=\bm{\Phi}\bm{W}+\bm{1}\bm{b}^{\mathrm{T}}, (5)

where 𝑾=(𝒘𝟏,𝒘𝟐,,𝒘𝑪)\bm{W}=(\bm{w_{1}},\bm{w_{2}},...,\bm{w_{C}}), 𝒃=(b1,b2,,bC)T\bm{b}=(b_{1},b_{2},...,b_{C})^{\mathrm{T}}, 𝚽=(ϕ(𝒙1),ϕ(𝒙2),,ϕ(𝒙N))T\bm{\Phi}=(\bm{\phi}(\bm{x}_{1}),\bm{\phi}(\bm{x}_{2}),...,\bm{\phi}(\bm{x}_{N}))^{\rm T}, and 𝟏\bm{1} is an all-1 vertical vector.

For any 𝒙n\bm{x}_{n}, the predicted class is determined by argmaxc{ync}\mathop{\arg\max}_{c}\{y_{nc}\}. Following the procedure in [20], the term yncy_{nc} in Eq. 4 is assumed to be coupled with an auxiliary random variable, which is denoted as zncz_{nc}

zncync+εnc=ϕ(𝒙n)T𝒘c+bc+εnc,z_{nc}\triangleq y_{nc}+\varepsilon_{nc}={\bm{\phi}}(\bm{x}_{n})^{\mathrm{T}}{\bm{w}_{c}}+b_{c}+\varepsilon_{nc}, (6)

where εnc\varepsilon_{nc} obeys a standard normal distribution (i.e., εnc𝒩(0,1)\varepsilon_{nc}\sim\mathcal{N}(0,1)). The joint distribution of 𝒁\bm{Z} is given as

p(𝒁|𝑾,𝒃)=(2π)N×C2exp{12c=1C𝒛𝒄𝒚𝒄2}.\displaystyle p(\bm{Z}|\bm{W},\bm{b})=(2\pi)^{\frac{-N\times C}{2}}\exp\left\{-\frac{1}{2}\sum\limits_{c=1}^{C}\|\bm{z_{c}}-\bm{y_{c}}\|^{2}\right\}. (7)

The connection of noisy term 𝒛𝒏\bm{z_{n}} to the target tnt_{n} as defined in multinomial probit regression [18] is formulated as tn=it_{n}=i if zni>znj,jiz_{ni}>z_{nj},\forall j\neq i. Sequentially, the joint probability between tn=it_{n}=i and 𝒛𝒏\bm{z_{n}} is

p(tn=i,𝒛𝒏|𝑾,𝒃)=δ(zni>znj,ji)c=1C𝒩(znc|ϕ(xn)T𝒘c+bc,1),\displaystyle\begin{split}p(t_{n}&=i,\bm{z_{n}}|\bm{W},\bm{b})=\\ &\delta(z_{ni}>z_{nj},\forall j\neq i)\prod\limits_{c=1}^{C}{\mathcal{N}(z_{nc}|\bm{\phi}}(x_{n})^{\rm T}{\bm{w}}_{c}+b_{c},1),\end{split} (8)

where δ()\delta(\cdot) is the indicator function. And we have

p(𝒕,𝒁|𝑾,𝒃)=n=1Nδ(zntn>znj,jtn)c=1C𝒩(znc|ync,1).\displaystyle\begin{split}p(\bm{t},\bm{Z}|\bm{W},&\bm{b})=\\ &\prod\limits_{n=1}^{N}\delta(z_{nt_{n}}>z_{nj},\forall j\neq t_{n})\prod\limits_{c=1}^{C}{\mathcal{N}(z_{nc}|y_{nc},1)}.\end{split} (9)

By marginalizing the noisy potential 𝒛𝒏\bm{z_{n}}, the multinomial probit is obtained as (more details in Appendix 5.2)

p(tn=i|𝑾,𝒃)=δ(zni>znj,ji)c=1C𝒩(znc|ync,1)d𝒛𝒏=𝔼εni[jiΨ(εni+yniynj)].\displaystyle\begin{split}p(t_{n}=&i|\bm{W},\bm{b})\\ &=\int\delta(z_{ni}>z_{nj},\forall j\neq i)\prod\limits_{c=1}^{C}{\mathcal{N}(z_{nc}|y_{nc},1)}d\bm{z_{n}}\\ &=\mathbb{E}_{\varepsilon_{ni}}\left[\prod_{j\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nj})\right].\end{split} (10)

And we have

p(𝒕|𝑾,𝒃)=n=1N𝔼εntn[jtnΨ(εntn+yntnynj)].\displaystyle p(\bm{t}|\bm{W},\bm{b})=\prod\limits_{n=1}^{N}\mathbb{E}_{\varepsilon_{nt_{n}}}\left[\prod_{j\neq t_{n}}\Psi(\varepsilon_{nt_{n}}+y_{nt_{n}}-y_{nj})\right]. (11)

2.2 Priors over weights

As discussed in Section 2.1, for the sake of ensuring the multi-class classification principle, the left-truncated Gaussian prior [21] and the right-truncated Gaussian prior over weights are chosen. Their distributions are formulated in Eq. 12 and illustrated in Fig. 1

𝒩t(wnc|0,αnc1)=2𝒩(wnc|0,αnc1)δ(fncwnc>0),\mathcal{N}_{t}(w_{nc}|0,{\alpha}_{nc}^{-1})=2\mathcal{N}(w_{nc}|0,{\alpha}_{nc}^{-1})\delta({f_{nc}w_{nc}}>0), (12)

where αnc{\alpha}_{nc} is the inverse variance and fnc=1f_{nc}=1 if tn=ct_{n}=c otherwise 1-1.

Then, the prior distribution over the weight 𝑾\bm{W} is

p(𝑾|𝑨)=n=1Nn=1C𝒩t(wnc|0,αnc1),p(\bm{W}|\bm{A})=\prod_{n=1}^{N}\prod_{n=1}^{C}{\mathcal{N}_{t}(w_{nc}|0,{\alpha}_{nc}^{-1})}, (13)

where 𝑨\bm{A} is the matrix extension of αnc\alpha_{nc} (i.e., 𝑨N×C\bm{A}\in\mathbb{R}^{N\times C}).

Refer to caption
Figure 1: The truncated Gaussian prior over weight wncw_{nc}. Left: when fnc=1f_{nc}=-1, p(w|α)p(w\bm{|}\alpha) is a nonpositive, right-truncated Gaussian prior. Right: when fnc=+1f_{nc}=+1, p(w|α)p(w\bm{|}\alpha) is a nonnegative, left-truncated Gaussian prior.

We impose no further restriction to the bias term bcb_{c}, so a normal zero-mean Gaussian prior is proper

p(𝒃|𝜷)=c=1C𝒩(bc|0,βc1),p(\bm{b}|\bm{\beta})=\prod_{c=1}^{C}{\mathcal{N}(b_{c}|0,{\beta_{c}}^{-1})}, (14)

where βc{\beta}_{c} is the inverse variance and 𝜷\bm{\beta} the vertical stacking vector of βc{\beta}_{c}’s.

To follow the Bayesian framework and encourage the model sparsity, hyper-priors over 𝑨\bm{A} and 𝜷\bm{\beta} need be defined. The truncated Gaussian belongs to the exponential family and the conjugate distribution of the variance of the truncated Gaussian is the Gamma distribution (more details in Appendix 5.1). The conjugate prior in this paper is introduced for the reason that it is very convenient and belongs to analytically favorable class of subjective priors to the exponential family. Other priors are also alternative such as objective priors and empirical priors [22]. In the optimization procedure of our experiments, most of the weights will converge to zero.

The forms of Gamma prior distributions are presented as

p(𝑨)=n=1Nn=1CGamma(αnc|u1,v1),\displaystyle p(\bm{A})=\prod_{n=1}^{N}\prod_{n=1}^{C}Gamma({\alpha}_{nc}|u_{1},v_{1}), (15)

and

p(𝒃)=n=1CGamma(βc|u2,v2),\displaystyle p(\bm{b})\ =\prod_{n=1}^{C}Gamma({\beta}_{c}|u_{2},v_{2}), (16)

where (u1,v1)(u_{1},v_{1}) and (u2,v2)(u_{2},v_{2}) are hyper-parameters of the Gamma hyper-priors. With these assumptions in place, marginalizing with respect to αnc{\alpha}_{nc}, we get the complete prior over the weight wncw_{nc}

p(wnc|u1,v1)=0p(wnc|αnc)p(αnc|u1,v1)𝑑αnc\displaystyle p(w_{nc}|u_{1},v_{1})={\int_{0}^{\infty}p(w_{nc}|{\alpha}_{nc})p({\alpha}_{nc}|u_{1},v_{1})d{\alpha}_{nc}}
=δ(fncwnc>0)2πv1u1Γ(u1+12)Γ(u1)(v1+wnc22)(u1+12).\displaystyle=\delta(f_{nc}w_{nc}\!>\!0)\sqrt{\frac{2}{\pi}}{\frac{v_{1}^{u_{1}}\Gamma(u_{1}+{\frac{1}{2}})}{\Gamma(u_{1})}}\left(v_{1}\!+\!\frac{{w_{nc}}^{2}}{2}\right)^{-(u_{1}+\frac{1}{2})}. (17)

For the bias bcb_{c}, similar to wncw_{nc}, its prior distribution is

p(bc|u2,v2)\displaystyle p(b_{c}|u_{2},v_{2}) =0p(bc|βc)p(βc|u2,v2)𝑑βc\displaystyle={\int_{0}^{\infty}p(b_{c}|{\beta}_{c})p({\beta}_{c}|u_{2},v_{2})d{\beta}_{c}}
=v2u2Γ(u2+12)2πΓ(u2)(v2+bc22)(u2+12).\displaystyle={\frac{v_{2}^{u_{2}}\Gamma(u_{2}+{\frac{1}{2}})}{\sqrt{2\pi}\Gamma(u_{2})}}\left(v_{2}+\frac{{b_{c}}^{2}}{2}\right)^{-(u_{2}+\frac{1}{2})}. (18)

To make these priors non-informative, we fix u1,u2,v1u_{1},u_{2},v_{1} and v2v_{2} to small values [3, 23]. The plate graph of our proposed model is shown in Fig. 2.

𝑨\bm{A}𝑾\bm{W}𝜷\bm{\beta}𝒃\bm{b}𝑲\bm{K}𝒁\bm{Z}𝒕\bm{t}u1u_{1}v1v_{1}u2u_{2}v2v_{2}N×CN\times CCCN×NN\times NN×CN\times CNN
Figure 2: Plate diagram of the model’s random variables. u1u_{1}, v1v_{1}, u2u_{2} and v2v_{2} are parameters of Gamma distributions. 𝑨\bm{A} is the parameter of the truncated Gaussian distribution. 𝜷\bm{\beta} is the parameter of the Gaussian distribution. 𝑨\bm{A} and 𝑾\bm{W} are N×CN\times C matrices. 𝜷\bm{\beta} and 𝒃\bm{b} are C×1C\times 1 vectors. 𝑲\bm{K} is the N×NN\times N kernel matrix. 𝒁\bm{Z} is a matrix of size N×CN\times C. 𝑲\bm{K} and 𝒕\bm{t} are shaded to indicate that they are observable.

2.3 A top-down algorithm: mPCVM1

This subsection presents the algorithm mPCVM1 by means of the derivation of the expectation-maximization (EM) [24] algorithm that is a general algorithm for the MAP estimation in the situation where observations are incomplete. In the following part, we detail an expectation (E) step and a maximization (M) step of the mPCVM1.

The joint posterior probability for 𝑾\bm{W} and 𝒃\bm{b} is first estimated in the E-step. The posterior probability is expressed as

p(𝑾,𝒃|𝒁,𝑨,𝜷)=p(𝒁|𝑾,𝒃)p(𝑾|𝑨)p(𝒃|𝜷)p(𝒁|𝑨,𝜷).p(\bm{W},\bm{b}|\bm{Z},\bm{A},\bm{\beta})=\frac{p(\bm{Z}|\bm{W},\bm{b})p(\bm{W}|\bm{A})p(\bm{b}|\bm{\beta})}{p(\bm{Z}|\bm{A},\bm{\beta})}. (19)

Equivalently, the log-posterior is obtained as

logp(𝑾,𝒃|𝒁,𝑨,𝜷)logp(𝒁|𝑾,𝒃)+logp(𝑾|𝑨)+logp(𝒃|𝜷)𝒃T𝑩𝒃+c=1C{𝒛𝒄T(𝚽𝒘𝒄+bc𝟏)𝒘𝒄T𝑨c𝒘𝒄+(𝚽𝒘𝒄+bc𝟏)T𝒛𝒄(𝚽𝒘𝒄+bc𝟏)T(𝚽𝒘𝒄+bc𝟏)},\displaystyle\begin{split}{\rm log}&\ p(\bm{W},\bm{b}|\bm{Z},\bm{A},\bm{\beta})\\ &\propto{\rm log}\ p(\bm{Z}|\bm{W},\bm{b})+{\rm log}\ p(\bm{W}|\bm{A})+{\rm log}\ p(\bm{b}|\bm{\beta})\\ &\propto-\bm{b}^{\rm T}\bm{B}\bm{b}+{\rm\sum\limits_{c=1}^{C}}\big{\{}\bm{z_{c}}^{\rm T}(\bm{\Phi w_{c}}+{b}_{c}\bm{1})-\bm{w_{c}}^{\rm T}\bm{A}_{c}\bm{w_{c}}+\\ &\quad(\bm{\Phi w_{c}}\!+{b}_{c}\bm{1})^{\rm T}\bm{z_{c}}\!-\!(\bm{\Phi w_{c}}+{b}_{c}\bm{1})^{\rm T}(\bm{\Phi w_{c}}+{b}_{c}\bm{1})\big{\}},\end{split} (20)

where 𝒛c\bm{z}_{c}, 𝒘c\bm{w}_{c} and 𝜶c\bm{\alpha}_{c} are the cc-th columns of 𝒁\bm{Z}, 𝑾\bm{W} and 𝑨\bm{A}, respectively, and 𝑨c\bm{A}_{c} = diag(𝜶c\bm{\alpha}_{c}), 𝑩\bm{B} = diag(𝜷\bm{\beta}) diagonal matrices.

Note that a normal zero-mean Gaussian prior over 𝑾\bm{W} is used in Eq. 20 for simplicity of computation. The multi-class classification principle for the mPCVM is ensured in Eq. 24

2.3.1 Expectation Step

In the E-step, we should compute the expectation of the log-posterior, i.e., Q function

Q(𝑾,𝒃|𝑾old,𝒃old)𝔼𝒁,𝑨,𝜷[log(p(𝑾,𝒃|𝒁,𝑨,𝜷))].Q(\bm{W},\bm{b}|\bm{W}^{old},\bm{b}^{old})\newline \triangleq\mathbb{E}_{\bm{Z},\bm{A},\bm{\beta}}\left[\log\left(p(\bm{W},\bm{b}|\bm{Z},\bm{A},\bm{\beta})\right)\right].

Hence, we can obtain the Q function:

Q(𝑾,𝒃|𝑾old,𝒃old)=𝒃T𝑩¯𝒃+c=1C{𝒛¯𝒄T(𝚽𝒘𝒄+bc𝟏)𝒘𝒄T𝑨¯c𝒘𝒄+(𝚽𝒘𝒄+bc𝟏)T𝒛¯𝒄(𝚽𝒘𝒄+bc𝟏)T(𝚽𝒘𝒄+bc𝟏)},\displaystyle\begin{split}Q(&\bm{W},\bm{b}|\bm{W}^{old},\bm{b}^{old})\\ &=-\bm{b}^{\rm T}\bm{\overline{B}}\bm{b}+{\rm\sum\limits_{c=1}^{C}}\big{\{}\bm{\overline{z}_{c}}^{\rm T}(\bm{\Phi w_{c}}+{b}_{c}\bm{1})-\bm{w_{c}}^{\rm T}\bm{\overline{A}}_{c}\bm{w_{c}}+\\ &\quad(\bm{\Phi w_{c}}+{b}_{c}\bm{1})^{\rm T}\bm{\overline{z}_{c}}-(\bm{\Phi w_{c}}+{b}_{c}\bm{1})^{\rm T}(\bm{\Phi w_{c}}+{b}_{c}\bm{1})\big{\}},\end{split} (21)

where 𝒁¯=𝔼[𝒁|𝒕,𝑾old,𝒃old]\bm{\overline{Z}}=\mathbb{E}[\bm{Z}|\bm{t},\bm{W}^{old},\bm{b}^{old}], 𝑨¯c=𝔼[𝑨c|𝒕,𝑾old,𝒃old]\bm{\overline{A}}_{c}=\mathbb{E}[\bm{A}_{c}|\bm{t},\bm{W}^{old},\bm{b}^{old}], and 𝑩¯=𝔼[𝑩|𝒕,𝑾old,𝒃old]\bm{\overline{B}}=\mathbb{E}[\bm{B}|\bm{t},\bm{W}^{old},\bm{b}^{old}]. Note that this formula has ignored those items which are not related to 𝑾\bm{W} or 𝒃\bm{b}.

The posterior expectation of znjz_{nj} for all jtnj\neq t_{n} (assume tn=it_{n}=i) is obtained as

z¯nj=znjp(𝒛n|tn,𝑾old,𝒃old)𝑑𝒛n=znjp(tn,𝒛n|𝑾old,𝒃old)p(tn|𝑾old,𝒃old)𝑑𝒛n=ynj𝔼εni[𝒩(εni|ynjyni,1)ki,jΨ(εni+yniynk)]𝔼εni[kiΨ(εni+yniynk)],\displaystyle\begin{split}&\overline{z}_{nj}=\int z_{nj}p(\bm{z}_{n}|t_{n},\bm{W}^{old},\bm{b}^{old})d\bm{z}_{n}\\ =&\int z_{nj}\frac{p(t_{n},\bm{z}_{n}|\bm{W}^{old},\bm{b}^{old})}{p(t_{n}|\bm{W}^{old},\bm{b}^{old})}d\bm{z}_{n}\\ =&y_{nj}\!-\!\frac{\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\mathcal{N}(\varepsilon_{ni}|y_{nj}\!-\!y_{ni},1)\!\!\!\!\!\prod\limits_{k\neq i,j}\!\!\!\!\Psi(\varepsilon_{ni}\!+\!y_{ni}\!\!-\!y_{nk})\bigg{]}}{\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\prod\limits_{k\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})\bigg{]}},\end{split} (22)

where 𝒛n\bm{z}_{n} is the nn-th row of 𝒁\bm{Z}. The posterior expectation of zniz_{ni} is

z¯ni\displaystyle\overline{z}_{ni} =znip(𝒛n|tn,𝑾old,𝒃old)𝑑𝒛n\displaystyle=\int z_{ni}p(\bm{z}_{n}|t_{n},\bm{W}^{old},\bm{b}^{old})d\bm{z}_{n} (23)
=znip(tn,𝒛n|𝑾old,𝒃old)p(tn|𝑾old,𝒃old)𝑑𝒛n\displaystyle=\int z_{ni}\frac{p(t_{n},\bm{z}_{n}|\bm{W}^{old},\bm{b}^{old})}{p(t_{n}|\bm{W}^{old},\bm{b}^{old})}d\bm{z}_{n}
=yni+ji(ynjz¯nj).\displaystyle=y_{ni}+\sum_{j\neq i}(y_{nj}-\overline{z}_{nj}).

We present the details in Appendix 5.4. Combined with Section 2.2, the posterior expectation of αnc\alpha_{nc} is

α¯nc=0+αncp(αnc|wnc,u1,v1)𝑑αnc=0+αncp(wnc|αnc)p(αnc|u1,v1)𝑑αncp(wnc|u1,v1)=2u1+1wnc2+2v1,\displaystyle\begin{split}\overline{\alpha}_{nc}&=\int_{0}^{+\infty}\alpha_{nc}p(\alpha_{nc}|w_{nc},u_{1},v_{1})d{\alpha}_{nc}\\ &=\frac{\int_{0}^{+\infty}\alpha_{nc}p(w_{nc}|\alpha_{nc})p(\alpha_{nc}|u_{1},v_{1})d\alpha_{nc}}{p(w_{nc}|u_{1},v_{1})}\\ &=\frac{2u_{1}+1}{w_{nc}^{2}+2v_{1}},\end{split} (24)

if fncwnc>0f_{nc}w_{nc}>0, otherwise \infty for wnc=0w_{nc}=0.

Similarly, combined with the Section 2.2, the posterior expectation of βc\beta_{c} is

β¯c=0βcp(βc|bc,u2,v2)𝑑βc=0βcp(bc|βc)p(βc|u2,v2)𝑑βcp(bc|u2,v2)=2u2+1bc2+2v2.\displaystyle\begin{split}\overline{\beta}_{c}&=\int_{0}^{\infty}\beta_{c}p(\beta_{c}|b_{c},u_{2},v_{2})d\beta_{c}\\ &=\frac{\int_{0}^{\infty}\beta_{c}p(b_{c}|\beta_{c})p(\beta_{c}|u_{2},v_{2})d\beta_{c}}{p(b_{c}|u_{2},v_{2})}\\ &=\frac{2u_{2}+1}{b_{c}^{2}+2v_{2}}.\end{split} (25)

2.3.2 Maximization Step

In the M-step, we need to compute the partial derivatives with respect to 𝒘𝒄\bm{w_{c}} and 𝒃\bm{b}:

Q𝒘𝒄=2𝚽T𝒛¯𝒄2𝚽T𝚽𝒘𝒄2bc𝚽T𝟏2𝑨¯𝒄𝒘𝒄,\frac{\partial Q}{\partial\bm{w_{c}}}=2\bm{\Phi}^{\rm T}\bm{\overline{z}_{c}}-2\bm{\Phi}^{\rm T}\bm{\Phi}\bm{w_{c}}-2b_{c}\bm{\Phi}^{\rm T}\bm{1}-2\bm{\overline{A}_{c}}\bm{w_{c}}, (26)
Qbc=2𝟏T𝒛¯𝒄2𝟏T𝚽𝒘𝒄2Nbc2β¯cbc.\displaystyle\quad\quad\frac{\partial Q}{\partial b_{c}}=2\bm{1}^{\rm T}\bm{\overline{z}_{c}}-2\bm{1}^{\rm T}\bm{\Phi}\bm{w_{c}}-2Nb_{c}-2{\overline{\beta}_{c}}{b_{c}}. (27)

In spite of the difficulty in solving the joint maximization of Q with respect to 𝑾\bm{W} and 𝒃\bm{b}, the optimal 𝑾\bm{W} and 𝒃\bm{b} can be derived by setting Q/𝑾=0\partial Q/\partial\bm{W}=0 and Q/𝒃=0\partial Q/\partial\bm{b}=0, respectively:

wcnew\displaystyle w_{c}^{\rm new} =(𝚽T𝚽+diag(𝜶¯𝒄))1(𝚽T𝒛¯𝒄bc𝚽T𝟏),\displaystyle=(\bm{\Phi}^{\rm T}\bm{\Phi}+diag(\bm{\overline{\alpha}_{c}}))^{-1}(\bm{\Phi}^{\rm T}\bm{\overline{z}_{c}}-b_{c}\bm{\Phi}^{\rm T}\bm{1}), (28)
bcnew\displaystyle b_{c}^{\rm new} =𝟏T𝒛¯𝒄𝟏T𝚽𝒘𝒄N+βc.\displaystyle=\frac{\bm{1}^{\rm T}\bm{\overline{z}_{c}}-\bm{1}^{\rm T}\bm{\Phi w_{c}}}{N+\beta_{c}}. (29)

The pseudo code of the mPCVM1 can be summarized in Algorithm 1, where a majority of 𝑾\bm{W} would be pruned in the iterations. Hence, the mPCVM1 is regarded as a top-down algorithm.

Algorithm 1 mPCVM1
  Input: train data 𝑿\bm{X}, class label 𝒕\bm{t}, kernel parameter θ\theta.
  Output: mPCVM1 classifier, including 𝑾\bm{W} and 𝒃\bm{b}.
  
  Compute kernel matrix 𝚽\bm{\Phi} by kernel function with kernel parameter, randomly initialize 𝒁¯,𝑨¯\bm{\overline{Z}},\bm{\overline{A}} and 𝜷¯\bm{\overline{\beta}}
  repeat
     \\\backslash\backslash M step 
     Update 𝑾\bm{W} and 𝒃\bm{b} by Eq. 28 and Eq. 29
     Prune elements of the weight 𝑾\bm{W} if the corresponding elements of 𝑨\bm{A} are beyond a given threshold  
     \\\backslash\backslash E step 
     Update 𝑨¯\overline{\bm{A}} and 𝒃¯\overline{\bm{b}} by Eq. 24 and Eq. 25
     Combine the E-step information and update 𝒁¯\overline{\bm{Z}} by Eq. 22 and Eq. 23
  until convergence

2.4 A bottom-up algorithm: mPCVM2

A constructive framework for fast marginal likelihood maximization is proposed in [7] and is extended to the mRVMs in [17] where each sample nn is assumed to have the same αn\alpha_{n} for all classes. Following this framework, we propose the mPCVM2 that ensures the multi-class classification principle as the mPCVM1. Since a sample that belongs to a special class is unequally contributed in the predication of classes, we assume that each sample has distinct αnc\alpha_{nc} about each class.

For conciseness, we only consider the weight 𝑾\bm{W} and ignore the bias 𝒃\bm{b} by setting 𝒃=𝟎\bm{b}=\bm{0}. The posterior of 𝑾\bm{W} is given as

p(𝑾|𝒁,𝑨)p(𝒁|𝑾)p(𝑾|𝑨)c=1C𝒩((𝚽T𝚽+𝑨𝒄)1𝚽T𝒛𝒄,(𝚽T𝚽+𝑨𝒄)1)δ(𝑭𝒄𝒘𝒄>0),\displaystyle\begin{split}p&(\bm{W|Z,A})\propto p(\bm{Z|W})p(\bm{W|A})\\ &\propto\!\!\prod\limits_{c=1}^{C}\!\mathcal{N}((\bm{\Phi}^{\!\rm T}\!\bm{\Phi}\!+\!\bm{A_{\!c}})^{\!-1}\!\bm{\Phi}^{\!\rm T}\!\bm{z_{\!c}},(\bm{\Phi}^{\!\rm T}\!\bm{\Phi}\!+\!\bm{A_{\!c}}\!)^{\!-1}\!)\delta(\!\bm{F_{\!c}w_{\!c}}\!>\!0),\!\end{split} (30)

where 𝑭𝒄=\bm{F_{c}}= diag(f1c,f2c,,fNc)(f_{1c},f_{2c},...,f_{Nc}). So the MAP for wnc{w_{nc}} could be estimated as

wnc=δ(wncfnc>0)((𝚽T𝚽+𝑨𝒄)1𝚽T𝒛c)n.\displaystyle w_{nc}=\delta({w_{nc}f_{nc}>0})((\bm{\Phi}^{\rm T}\bm{\Phi}+\bm{A_{c}})^{-1}\bm{\Phi}^{\rm T}{\bm{z}_{c}})_{n}. (31)

In the mPCVM2, firstly we compute the marginal likelihood of p(𝒁|𝑨)=p(𝒁|𝑾)p(𝑾|𝑨)𝑑𝑾p(\bm{Z}|\bm{A})=\int p(\bm{Z}|\bm{W})p(\bm{W}|\bm{A})d\bm{W} with respect to 𝑾\bm{W}. We formulate the marginal likelihood, or equivalently, its logarithm (𝑨)\mathcal{L}(\bm{A}):

(𝑨)\displaystyle\mathcal{L}(\bm{A}) =c=1Clogp(𝒛𝒄|𝜶𝒄)\displaystyle=\sum\limits_{c=1}^{C}\log p(\bm{z_{c}}|\bm{\alpha_{c}}) (32)
=c=1Clogp(𝒛𝒄|𝒘c)p(𝒘c|𝜶𝒄)𝑑𝒘𝒄\displaystyle=\sum\limits_{c=1}^{C}\log\int p(\bm{z_{c}}|\bm{w}_{c})p(\bm{w}_{c}|\bm{\alpha_{c}})d\bm{w_{c}}
=c=1C12(Nlog2π+log|𝓒c|+𝒛𝒄T𝓒c1𝒛𝒄),\displaystyle=\sum\limits_{c=1}^{C}-\frac{1}{2}(N\log 2\pi+\log|\bm{\mathcal{C}}_{c}|+\bm{z_{c}}^{\rm T}\bm{\mathcal{C}}_{c}^{-1}\bm{z_{c}}),

where 𝓒c=𝑰+𝚽diag(𝜶𝐜)1𝚽T\bm{\mathcal{C}}_{c}=\bm{I}+\bm{\Phi}{\rm diag(\bm{\alpha_{c}})}^{-1}\bm{\Phi}^{\rm T} and 𝑰\bm{I} is the identity matrix. Note that we assume wncw_{nc} is subject to a truncated Gaussian with a scale αnc\alpha_{nc}.

Then following the procedure of [7], we decompose 𝓒c\bm{\mathcal{C}}_{c} as

𝓒c=αnc1ϕ(𝒙n)ϕ(𝒙n)T+inNαic1ϕ(𝒙i)ϕ(𝒙i)T=αnc1ϕ(𝒙n)ϕ(𝒙n)T+𝓒cn,\displaystyle\begin{split}\bm{\mathcal{C}}_{c}&=\alpha_{nc}^{-1}\bm{\phi}(\bm{x}_{n})\bm{\phi}(\bm{x}_{n})^{\rm T}+\sum\limits_{i\neq n}^{N}\alpha_{ic}^{-1}\bm{\phi}(\bm{x}_{i})\bm{\phi}(\bm{x}_{i})^{\rm T}\\ &=\alpha_{nc}^{-1}\bm{\phi}(\bm{x}_{n})\bm{\phi}(\bm{x}_{n})^{\rm T}+{\bm{\mathcal{C}}_{c}}_{\,-n},\end{split} (33)

where 𝓒cn{\bm{\mathcal{C}}_{c}}_{\,-n}^{\,} is 𝓒c{\bm{\mathcal{C}}_{c}} without the contribution of ϕ(𝒙n)\bm{\phi}(\bm{x}_{n}). Furthermore, the determinant and the inverse of 𝓒c\bm{\mathcal{C}}_{c} could be decomposed as

|𝓒c|\displaystyle|\bm{\mathcal{C}}_{c}| =|𝓒cn||1+αnc1ϕ(𝒙n)T𝓒cn1ϕ(𝒙n)|,\displaystyle=|{\bm{\mathcal{C}}_{c}}_{\,-n}^{\,}||1+\alpha_{nc}^{-1}\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n})|, (34)
𝓒c1\displaystyle{\bm{\mathcal{C}}_{c}}^{-1} =𝓒cn1𝓒cn1ϕ(𝒙n)ϕ(𝒙n)T𝓒cn1αnc+ϕ(𝒙n)T𝓒cn1ϕ(𝒙n).\displaystyle=\;{\bm{\mathcal{C}}_{c}}_{-n}^{-1}-\frac{{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n})\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}}{\alpha_{nc}+{\bm{\phi}(\bm{x}_{n})^{\rm T}\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n})}. (35)

As everything is ready, (𝑨)\mathcal{L}(\bm{A}) is derived as:

(𝑨)=c=1C12[Nlog(2π)+log|𝓒cn|+𝒛cT𝓒cn1𝒛clogαnc+log(αnc+ϕ(𝒙n)T𝓒cn1ϕ(𝒙n))(ϕ(𝒙n)T𝓒cn1𝒛𝒄)2αnc+ϕ(𝒙n)T𝓒cn1ϕ(𝒙n)]=c=1C[(𝑨cn)+12(log(αnc)log(αnc+snc)+qnc2αnc+snc)]=c=1C[(𝑨cn)+l(αnc)],\displaystyle\begin{split}&\mathcal{L}(\bm{A})=\sum\limits_{c=1}^{C}\!-\!\frac{1}{2}\bigg{[}N\log(2\pi)+\log|{\bm{\mathcal{C}}_{c}}_{-n}|+\bm{z}_{c}^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{z}_{c}\\ &\qquad\quad-\log\alpha_{nc}+\log\big{(}\alpha_{nc}+\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n})\big{)}\\ &\qquad\quad-\frac{\big{(}\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{z_{c}}\big{)}^{2}}{\alpha_{nc}+\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n})}\bigg{]}\\ &\!=\!\sum\limits_{c=1}^{C}\!\bigg{[}\!\mathcal{L}(\!{\bm{A}_{\!c}}_{\;-\!n}\!)\!\!+\!\!\frac{1}{2}\!\Big{(}\!\!\log(\!\alpha_{\!n\!c}\!)\!\!-\!\log(\!\alpha_{\!n\!c}\!\!+\!s_{\!n\!c}\!)\!+\!\frac{q_{nc}^{2}}{\alpha_{\!n\!c}\!\!+\!s_{\!n\!c}}\!\Big{)}\!\bigg{]}\\ &=\sum\limits_{c=1}^{C}\big{[}\mathcal{L}({\bm{A}_{c}}_{\;-n})+l(\alpha_{nc})\big{]},\end{split} (36)

where we have the “sparsity factor”

snc=ϕ(𝒙n)T𝓒cn1ϕ(𝒙n),s_{nc}=\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{\phi}(\bm{x}_{n}), (37)

and the “quality factor”

qnc=ϕ(𝒙n)T𝓒cn1𝒛c.q_{nc}=\bm{\phi}(\bm{x}_{n})^{\rm T}{\bm{\mathcal{C}}_{c}}_{-n}^{-1}\bm{z}_{c}. (38)

By decomposing (𝑨)\mathcal{L}(\bm{A}), we isolate the term l(αnc)l(\alpha_{nc}) that is the contribution of αnc\alpha_{nc} to this marginal likelihood. Analytically, (𝑨)\mathcal{L}(\bm{A}) has a unique maximum with respect to αnc\alpha_{nc}

{αnc=snc2qnc2sncifqnc2>sncαnc=ifqnc2snc.\begin{cases}\alpha_{nc}=\frac{s_{nc}^{2}}{q_{nc}^{2}-s_{nc}}&\quad\text{if}\ q_{nc}^{2}>s_{nc}\\ \alpha_{nc}=\infty&\quad\text{if}\ q_{nc}^{2}\leq s_{nc}\\ \end{cases}. (39)

The pseudo code of the mPCVM2 can be summarized in Algorithm 2.

Algorithm 2 mPCVM2
  Input: train data 𝑿\bm{X}, class label 𝒕\bm{t}, kernel parameter θ\theta.
  Output: mPCVM2 classifier, including 𝑾\bm{W}.
  
  Compute kernel matrix 𝚽\bm{\Phi} by kernel function with kernel parameter, set 𝑨=\bm{A}=\infty, epoch =1=1 , randomly initialize 𝒁,\bm{Z}, and NN is the number of train data.  
  repeat
     for c=1c=1 to CC do
      Compute sncs_{nc} for n{1,2,,N}n\in\{1,2,...,N\} by Eq. 37
      Compute qncq_{nc} for n{1,2,,N}n\in\{1,2,...,N\} by Eq. 38
      if epoch = 1 then
       n=argmax𝑛{qnc2snc}n=\underset{n}{argmax}\{q_{nc}^{2}-s_{nc}\}
      else
       if {n|αnc=,qnc2>snc}\{n|\alpha_{nc}=\infty,q_{nc}^{2}>s_{nc}\}\neq\varnothing then
        n=argmax𝑛{qnc2snc|αnc=}n=\underset{n}{argmax}\{q_{nc}^{2}-s_{nc}|\alpha_{nc}=\infty\}
       else if {n|αnc<,qnc2<snc}\{n|\alpha_{nc}<\infty,q_{nc}^{2}<s_{nc}\}\neq\varnothing then
        n=argmin𝑛{qnc2snc|αnc<}n=\underset{n}{argmin}\{q_{nc}^{2}-s_{nc}|\alpha_{nc}<\infty\}
       else
        randomly choose n from {n|αnc<}\{n|\alpha_{nc}<\infty\}
       end if
      end if
      if qnc2>sncq_{nc}^{2}>s_{nc} and αnc=\alpha_{nc}=\infty then
       Set αnc\alpha_{nc} by Eq. 39
      else if qnc2>sncq_{nc}^{2}>s_{nc} and αnc<\alpha_{nc}<\infty then
       Recalculate αnc\alpha_{nc} by Eq. 39
      else if qnc2sncq_{nc}^{2}\leq s_{nc} and αnc<\alpha_{nc}<\infty then
       Set αnc=\alpha_{nc}=\infty by Eq. 39
      end if
     end for
     Update 𝑾\bm{W} by Eq. 31
     Update 𝒁\bm{Z} by Eq. 22 and Eq. 23
     epoch = epoch + 1
  until convergence

Comparing the mPCVM2 with the mPCVM1, there are four main differences. Firstly, they possess different object functions. The mPCVM1 maximizes a posterior estimation while the mPCVM2 maximizes the marginal likelihood. Secondly, they have different initial states. The mPCVM1 initially contains all vectors in the model while the mPCVM2 has only one vector for each class. Thirdly, they use very different update strategies. In each iteration, the mPCVM1 gradually deletes the vectors that are related with large α\alpha’s, while exterior vectors can be added or vectors that already exist can be deleted in the mPCVM2. Finally, they treat basis functions differently. The mPCVM1 considers a vector only once. So basis functions that have been deleted cannot be added anymore. Conversely, the mPCVM2 might add a basis function that has been removed recently. Hence, the number of iterations of the mPCVM1 may be comparably less than the mPCVM2. However, the distinguishing characteristic that the mPCVM2 can re-add the vectors that have been wrongly deleted in the previous iterations, makes it more likely to escape some local optima and gets higher accuracy. In a word, the mPCVM1 executes in a top-down fashion while the mPCVM2 is like a bottom-up one.

3 Experimental Studies

3.1 Synthetic Data Sets

This subsection presents experimental results of the mPCVM1, the mPCVM2, the mRVM1 and the mRVM2 on two synthetic data sets, i.e., Overlap and Overclass, to analyze their differences.

The data set Overlap is generated from several different 2-dimensional Gaussian distributions. It contains 3 classes but exhibits heavy overlaps. In this case, a nonlinear classifier is necessary. We compare four algorithms on this data set and mark the resulting class regions with different colors. The relevant vectors are marked with circles.

In Fig. 3, although the mRVM1 spots the leftmost pivotal relevant vector that belongs to the blue class and locates in the narrow gap between two clusters in red, this blue key point contributes wrongly to the red class instead of the blue class, resulting in an erroneous classification boundary. We infer that the violation of the multi-class classification principle for the mPCVM proposed in Section 2.1 leads to the degradation of the mRVMs although the mRVM2 avoids this vulnerability. In contrast, the mPCVMs, which aim to ensure this principle, manage to get correct regions. Meanwhile, the leftmost pivotal blue point should contribute positively to the blue class, contribute negatively to the red class and contribute nothing to the black class. However, this point is regarded as a relevant vector for all classes in the mRVMs. Its influence on the black class could be uncontrollable noise in this case. In other words, a point as a relevant vector should be only for one or two rather than all classes, which is apparent in a case with a large number of classes. Finally, although the mRVMs have less relevant vectors (the mRVM1 9, the mRVM2 7) than the mPCVMs (the mPCVM1 21, the mPCVM2 7), in the view of non-zero weights, the mPCVMs (the mPCVM1 21, the mPCVM2 7) are sparser than the mRVMs (the mRVM1 27, the mRVM2 21). We argue that the mPCVMs are superior to the mRVMs for the reason that the mPCVMs choose precisely relevant vectors for each class and impede unnecessary noise.

Refer to caption
(a) mPCVM1
Refer to caption
(b) mPCVM2
Refer to caption
(c) mRVM1
Refer to caption
(d) mRVM2
Figure 3: The panels illustrate the areas assigned by classifiers to each class. Black points concentrate in the bottom right corner with a skew ellipse shape. Red points appear in three groups split by blue points in the middle canvas. Red circles indicate relevant vectors. In the results, classifiers claim that any point in the green region should be reckoned as the black class, and that any point in the purple region should be reckoned as the red class, and that any point in the white region should be reckoned as the blue class. Four classifiers are compared with the same kernel RBF that uses the same parameter θ=1\theta=1.

The second synthetic data set, Overclass, has 127 points in ten classes. The challenge of this data is the high imbalance over classes. The minor class (pink) has only 3 points while the major class (gray) contains 28 points, over 9 times more than those in the minor class.

Table 1: The number of relevant vectors and the number of non-zero weights in the mRVM1, the mRVM2, the mPCVM1 and the mPCVM2. The mRVM1 has no relevant vector in the class 6 and the class 10, which makes the mRVM1 to blur these two classes.
Class mRVM1 mRVM2 mPCVM1 mPCVM2
class 1 3(30) 3(30) 8(8) 2(2)
class 2 3(30) 2(20) 8(8) 2(2)
class 3 2(20) 3(30) 7(7) 2(2)
class 4 5(50) 2(20) 8(8) 1(1)
class 5 5(50) 1(10) 7(7) 1(1)
class 6 0(0) 1(10) 8(8) 1(1)
class 7 4(40) 1(10) 7(7) 1(1)
class 8 3(30) 1(10) 7(7) 1(1)
class 9 3(30) 1(10) 8(8) 1(1)
class 10 0(0) 1(10) 8(8) 1(1)
Refer to caption
(a) mPCVM1
Refer to caption
(b) mPCVM2
Refer to caption
(c) mRVM1
Refer to caption
(d) mRVM2
Figure 4: The patches for classes developed by four classifiers on Overclass. Four algorithms run with the same kernel RBF that uses the same parameter θ=1\theta=1. In this example, the mPCVM1 and the mPCVM2 yield the correct results. Yet the mRVM1 and the mRVM2 fail. The mRVMs ignore wholly the class in pink and the mRVM1 generates a narrow band for the class in brown.

We conducted an experiment on Overclass with four algorithms. In Fig. 4, four algorithms dye every region that is deemed to belong to a certain class with a separate color. In the current experimental setting, the mRVMs ignore erroneously the minor class and consider the pink class to be a part of the crimson class. The area of the brown class is narrowed down into a band as its sample size is relatively small. However, in our algorithms, each class contributes equally to the classification boundary. In other words, the classification is unlikely to be affected by the sample size of each class. This property gives the mPCVMs an advantage when they are applied in the experiment where a large number of classes exists yet the class size is imbalanced.

Table 1 lists the number of relevant vectors and the number of non-zero weights for the four learning algorithms. Since the absence of relevant vectors belonging to the pink class (class 10) and the brown class (class 6), the mRVMs cannot separate out the two classes. The classification boundaries in the mRVM1 and the mRVM2 are almost linear while those in the mPCVM1 and the mPCVM2 are mostly curved. The results show that the mRVM1, the mRVM2, the mPCVM1, and the mPCVM2 have 28, 16, 58, and 13 relevant vectors, respectively. Each relevant vector in the mRVMs has 10 non-zero weights as there are ten classes. So the mRVM1 and the mRVM2 have 280 and 160 non-zero weights, respectively, while the mPCVM1 and the mPCVM2 have 76 and 13 non-zero weights, respectively. Therefore, the mPCVMs are sparser than the mRVMs.

3.2 Benchmark Data Sets

Table 2: Summary of data sets
Data No.Train No.test Dim Class
Breast 546 137 9 2
Glass 171 43 9 6
Heart 235 59 16 5
Iris 120 30 4 3
Vowel 792 198 13 11
Wine 142 36 13 3
Wine (red) 1279 320 11 6
Wine (white) 3918 980 11 7

In order to evaluate the performance of the mPCVMs, we compare different algorithms on 8 benchmark data sets333https://archive.ics.uci.edu/ml/index.php. The information of these data sets is summarized in Table 2. We partition randomly every data set into a training set and a test set if no pre-partition exists. In addition, standardization is conducted dimensionwise on the training set and the test set. The comparison algorithms include the mRVM1 [25], the mRVM2 [25], the SVM [10], the DLSR [16] and the MLR. As the mPCVM1 and the mPCVM2 need to set the RBF kernel parameter θ\theta, we follow the method suggested in [26]. The hyper-parameter is tuned in the first five partitions and the best one is picked according to the mean accuracy on the five partitions. Then, the performance is measured on the remaining 45 partitions.

Table 3: Comparison of the mPCVM1, the mPCVM2, the mRVM1, the mRVM2, the SVM and the DLSR on 8 benchmark data sets, under the metrics of the ERR and the AUC. We carried out 45 runs on each data set, and report the averages as well as their standard deviation.
ERR Breast Glass Heart Iris Wine Wine (red) Wine (white) Vowel
mRVM1 2.952(1.329) 39.380(7.377) 34.501(5.009) 4.667(3.787) 3.765(3.706) 39.674(2.619) 44.386(1.648) 27.374(4.014)
mRVM2 3.001(1.649) 33.902(7.478) 35.104(5.605) 4.815(4.178) 2.963(2.865) 40.333(2.253) 43.000(1.287) 6.779(1.996)
SVM 3.277(1.251) 40.672(6.645) 33.484(5.541) 4.370(3.945) 2.160(2.285) 41.549(2.278) 47.220(1.811) 20.393(2.643)
DLSR 2.741(1.156) 47.493(7.657) 49.002(6.622) 9.259(6.738) 4.259(3.276) 50.188(2.313) 56.805(1.700) 55.578(3.268)
MLR 3.244(1.344) 38.191(6.183) 35.104(5.955) 4.519(7.425) 5.926(3.484) 40.271(2.046) 46.150(1.676) 32.907(2.511)
mPCVM1 2.725(1.189) 33.127(8.028) 33.070(5.194) 3.852(3.478) 2.839(2.546) 40.153(2.444) 42.871(1.580) 6.352(2.270)
mPCVM2 2.725(1.496) 30.439(6.499) 32.316(4.934) 3.185(3.175) 2.099(2.380) 38.250(3.113) 41.128(1.946) 3.592(1.446)
AUC Breast Glass Heart Iris Wine Wine(red) Wine(white) Vowel
mRVM1 98.355(1.093) 82.302(4.815) 82.126(4.538) 99.702(0.532) 99.717(0.505) 76.662(1.774) 72.666(1.126) 97.271(0.537)
mRVM2 98.571(0.908) 85.166(4.686) 80.369(4.811) 99.682(0.623) 99.784(0.352) 76.563(1.672) 74.883(1.161) 99.574(0.284)
SVM 99.587(0.315) 81.726(4.495) 84.431(4.715) 99.695(0.565) 99.936(0.134) 74.688(1.640) 69.311(1.264) 98.207(0.312)
DLSR 99.586(0.312) 79.888(5.519) 84.871(4.432) 93.760(3.678) 99.604(0.389) 74.825(1.708) 69.548(1.114) 86.986(0.887)
MLR 99.559(0.340) 81.604(4.753) 83.517(4.590) 99.014(1.602) 97.405(2.187) 76.242(1.690) 71.778(1.166) 95.886(0.512)
mPCVM1 98.394(0.886) 85.737(5.074) 84.016(4.778) 99.699(0.474) 99.761(0.415) 76.069(1.776) 74.320(1.178) 99.634(0.204)
mPCVM2 98.906(0.942) 86.899(4.759) 84.486(3.609) 99.777(0.477) 99.879(0.250) 77.025(2.132) 76.087(2.029) 99.709(0.235)

In our experiment, we select the ERR and the generalized AUC [27] as metrics to measure performance of these algorithms. The ERR reports classification error rates of algorithms. The generalized AUC complements the ERR in measuring the performance of algorithms in imbalanced cases. The overall performance of an algorithm is measured using the AUC metric:

AUCgeneralized=2C(C1)i<jA^(i,j),\text{AUC}_{generalized}=\frac{2}{C(C-1)}\sum_{i<j}\hat{A}(i,j), (40)

where A^(i,j)=[A^(i|j)+A^(j|i)]/2\hat{A}(i,j)=[\hat{A}(i|j)+\hat{A}(j|i)]/2 is the measure of separability between the ii-th class and the jj-th one. A^(i|j)\hat{A}(i|j) denotes the probability that a sample from the jj-th class is misclassified into the ii-th one, where i,j{1,2,,C}i,j\in\{1,2,...,C\}. When CC equals to 2, A^(1|2)=A^(2|1)\hat{A}(1|2)=\hat{A}(2|1), the generalized AUC is equivalent to the traditional AUC. Notationally, we use the AUC instead of the AUCgeneralized for succinctness.

Table 3 reports performance of these algorithms on 8 benchmark data sets under the metrics of the ERR and the AUC. From the table, the mPCVMs perform well in terms of two metrics. For instance, under the ERR metric, the mPCVM1 outperforms the comparison algorithms in 6 out of 8 data sets and comes second in two cases. In comparison, the mPCVM2 surpasses all other algorithms and achieves the best performance. In all data sets, the mPCVMs perform better than the mRVMs. This result is partly due to the remediation of the shortcoming of the mRVMs discussed previously. The mPCVM2 is consistently better than the mPCVM1, which demonstrates the superiority in the methodology that can dynamically add and delete relevant vectors.

3.3 Statistical Comparisons

To test the statistical significance of results listed in Table 3, we apply the Friedman test under the null hypothesis that there is no significant difference among algorithms. The alternative hypothesis states there is a statistical difference among the tested algorithms.

The Friedman test statistics is written as Q=12Nk(k+1)[jRj2k(k+1)24]Q=\frac{12N}{k(k+1)}\left[\sum_{j}R^{2}_{j}-\frac{k(k+1)^{2}}{4}\right] where kk is the number of algorithms, NN the number of data sets, and RjR_{j} the average rank of the jj-th algorithm. Then, the p-value is given by P(𝒳k12Q)P(\mathcal{X}^{2}_{k-1}\geq Q). If the p-value is less than 0.10, the null hypothesis of the Friedman test should be rejected, and the alternative hypothesis should be accepted, indicating that there exists a statistical difference among these algorithms.

Table 4: The average ranks of the mPCVM1 and the baseline algorithms
Rank mRVM1 mRVM2 SVM DLSR MLR mPCVM1
ERR 3.375 3.375 3.75 5.375 3.875 1.25
AUC 3.375 3 3.25 4.5 4.375 2.625
Table 5: The average ranks of the mPCVM2 and the baseline algorithms
Rank mRVM1 mRVM2 SVM DLSR MLR mPCVM2
ERR 3.5 3.375 3.875 5.375 3.875 1
AUC 3.5 3.25 3.375 4.5 4.5 1.875

The average ranks of our proposed algorithms and baseline algorithms are summarized in Table 4 and Table 5 under the two metrics where the mPCVM1 and the mPCVM2 are tested separately.

The statistical tests on the mPCVM1 show that this newly proposed algorithm could improve the classification accuracy and that there is no significant difference with regard to the AUC. Meanwhile, the statistical tests on the mPCVM2 show an advantage with regard to the ERR and the AUC.

If the Friedman test is rejected under the metrics of the ERR and the AUC, a post-hoc test is additionally conducted to qualify the difference between our proposed algorithms and baseline algorithms.

The Bonferroni-Dunn test[28] is chosen as the post-hoc test to compare all algorithms with a control one (the mPCVM1 or the mPCVM2). The difference between two algorithms is significant if the resulting average ranks differ by at least the critical difference, which is written as:

CD=qαk(k+1)6N,CD=q_{\alpha}\sqrt{\frac{k(k+1)}{6N}}, (41)

where qα=2.326q_{\alpha}=2.326 when the significant level α\alpha is set as 0.10 for 6 algorithms. The difference between the ii-th algorithm and the jj-th one is given by

(RjRi)/k(k+1)6N.(R_{j}-R_{i})/\sqrt{\frac{k(k+1)}{6N}}. (42)

Table 6 lists Bonferroni-Dunn test results of the mPCVM2. From the table, the differences between the mPCVM2 and all the other algorithms are greater than the critical difference under the ERR metric, indicating that the pairwise difference is significant. Therefore the mPCVM2 performs significantly better than the mRVM1, the mRVM2, the SVM, the MLR and the DLSR. Under the AUC metric, the same argument holds for the mPCVM2 and the MLR, the DLSR. However, the differences of the mPCVM2 from the mRVM1, the mRVM2 and the SVM are marginally below the critical difference, which fails to support the statistical significance when α\alpha = 0.10. Similarly, Table 7 shows that the mPCVM1 performs significantly better than the mRVM1, the mRVM2, the SVM, the MLR and the DLSR under the ERR metric.

Table 6: Friedman and Bonferroni-Dunn test results of the mPCVM2 and the baseline algorithms. The threshold is 0.1 and q0.1=2.326q_{0.1}=2.326. The significant results are marked by bold font
Friedman Q Friedman p-value CD0.1 mRVM1 mRVM2 SVM DLSR MLR
ERR 23 0.000 2.176 2.673 2.272 3.074 4.811 3.207
AUC 12.714 0.026 2.176 2.138 1.871 1.871 3.074 3.074
Table 7: Friedman and Bonferroni-Dunn test results of the mPCVM1 and the baseline algorithms. The threshold is 0.1 and q0.1=2.326q_{0.1}=2.326. The significant results are marked by bold font
Friedman Q Friedman p-value CD0.1 mRVM1 mRVM2 SVM DLSR MLR
ERR 20.143 0.001 2.176 2.272 2.272 2.673 4.410 2.806
AUC 6.857 0.232

3.4 Performance under various Classes

A notable fact is that the mPCVM2 improves a small margin over the runner-up in Breast (2 classes) and the improvement in Vowel (11 classes) is from 93.221% to 96.408%. It is not surprising as the mPCVM2 tends to promote better accuracy in the data set with a larger number of classes.

To demonstrate the classification performance when the number of classes grows, we carried out an experiment in a multi-class data set, Letter Recognition [29], which contains 26 classes corresponding to 26 capital letters. In the experiment, we gradually increased the number of classes and recorded the classification accuracies of all the tested algorithms. The experiment started with only two classes (600 samples from the class A, 600 samples from the class B). The samples were randomly permuted and split into a training set (400) and a testing set (800). Then in each round, we added samples (200 ones for training and 400 ones for testing) from the next class (e.g. the class C), retrained the models and recorded their performance. This process terminated when the first 10 classes were added, i.e., A–J. For each algorithm except MLR, the best setting was reused as mentioned in Section 3.2. Each algorithm run for 45 times per round to obtain an averaged accuracy444Since this data set is balanced, there is no need for the AUC..

The experimental results are illustrated in Fig. 5. From the figure, all the algorithms have similar results in the binary case. However, as the number of classes grows and the problem becomes increasingly complicated, the advantage of our proposed algorithms has emerged. The mPCVMs (esp. the mPCVM2) manage to offer stable and robust results whilst the performance of other algorithms degrades. The mRVMs do not follow the multi-class classification principle for the mPCVM, and their performance is unstable and thus loses the competition with the mPCVMs. The SVM cannot directly solve multi-class classification and this could lead to inferior performance.

3.5 Experiment on a data set with a large number of classes

Following Section 3.4, we further increased the number of classes and carried out an experiment on Leaves Plant Species [30] of 100 classes (labeled from 1 to 100) to evaluate the performance of our proposed algorithms and the comparison ones. In the experiment, all the tested algorithms were trained and evaluated initially on the first 20 classes (1-20) of the data. Then, the number of classes was increased gradually by 20 per round. When the following 20 classes (e.g., 21-40) were added into the training set and the test set of last round, respectively, all the algorithms were retrained and evaluated. The classification accuracies of all the algorithms were recorded in each round. For each algorithm except MLR, the best setting was reused as mentioned in Section 3.2. Each algorithm run for 45 times per round to obtain an averaged accuracy.

Refer to caption
Figure 5: Accuracies of algorithms on Letter Recognition with different numbers of classes. X-axis indicates the number of classes and Y-axis presents the accuracies of the algorithms. When the number of classes is small, all the algorithms have similar performance. However, when the number of classes increases, the advantage of our proposed algorithms mPCVMs (esp. the mPCVM2) against the comparison ones has emerged.
Refer to caption
Figure 6: Accuracies of algorithms on Leaves Plant Species with different numbers of classes. X-axis indicates the number of classes and Y-axis presents the accuracies of the algorithms. The mPCVM1 obtains similar results to the RVM2 and the SVM. The mPCVM2 surpasses all the comparison algorithms over different class numbers. When the investigated problem contains a large number of classes, the mPCVMs (esp. the mPCVM2) are expected to obtain better performance than the selected benchmark algorithms.

The experimental results are illustrated in Fig. 6. The results show that the mPCVM1 performs similarly to the RVM2 and the SVM, and that the mPCVM2 surpasses all the comparison algorithms over different class numbers. When the investigated problem contains a large number of classes, the mPCVMs (esp. the mPCVM2) are expected to obtain better performance than the selected benchmark algorithms. The mPCVM1 achieves a comparative performance with others since it ensures the multi-class classification principle. The mPCVM2 performs consistently better than the mPCVM1 for its flexibility in adding and deleting relevant vectors.

3.6 Algorithm Complexity

For the mRVMs and the mPCVMs, they have the same computational complexity 𝒪(N2)\mathcal{O}(N^{2}) and the memory storage 𝒪(N2)\mathcal{O}(N^{2}) during computing the kernel matrix, where NN is the number of training samples. In the following, the computational complexity and the memory storage are considered in the optimization procedure. Because of the inverse of a kernel matrix that has the shape of N×NN\times N, the mRVM2 has the computational complexity 𝒪(N3)\mathcal{O}(N^{3}) and the memory storage 𝒪(N2)\mathcal{O}(N^{2}). The mRVM1 has an incremental procedure, so its computational complexity decreases to 𝒪(M3)\mathcal{O}(M^{3}) and its memory storage 𝒪(MN)\mathcal{O}(MN), where MM is the number of relevant vectors and MNM\ll N. The algorithm mPCVM1 initially contains all the NN basis functions and then prunes them in each class. This could lead to longer training time and larger memory usage. By analogy to the mRVM2, its computational complexity is 𝒪(CN3)\mathcal{O}(CN^{3}) and its memory storage 𝒪(CN2)\mathcal{O}(CN^{2}) where CC is the number of classes. Similar to the mRVM1, the mPCVM2 is an incremental algorithm, yet it has fewer non-zero weights. Suppose the mPCVM2 has MiM_{i} relevant vectors for the ii-th class, its computational complexity is 𝒪(CMj3)\mathcal{O}(CM_{j}^{3}) and its memory storage 𝒪(CMjN)\mathcal{O}(CM_{j}N) where Mj=max(Mi)M_{j}=max(M_{i}) for i{1,2,,C}.i\in\{1,2,...,C\}.

4 Conclusion and Future Work

In this paper, we propose a multi-class probabilistic classification vector machine (mPCVM). This method extends the PCVM into multi-class classification. We introduce two learning algorithms to optimize the method, a top-down algorithm mPCVM1, which is based on an expectation-maximization algorithm to obtain the maximum a posteriori point estimates of the parameters, and a bottom-up algorithm mPCVM2, which is an incremental version of maximizing the marginal likelihood. The performance of the mPCVMs is extensively evaluated on 8 benchmark data sets. The experimental results conclude that, among the selected baseline algorithms, our algorithms have the best performance and that the mPCVM2 performs better when the class number is large.

However, since the computational complexity is proportional to the number of classes, it will be higher than most SVMs. The relatively higher computational complexity has been offset by the superior performance and the benefits to produce the probabilistic outputs. Future work includes reduction of computational complexity with the help of approximation and combining the idea of the mPCVM with the ensemble learning [31] .

5 Appendices

5.1 Conjugate Prior of Truncated Gaussian

The probability density function of the truncated Gaussian (Eq. 12) can be written in the form

p(wnc|αnc)=δ(fncwnc)(2αncπ)12exp(αncwnc22).p(w_{nc}|{\alpha}_{nc})=\delta(f_{nc}w_{nc}){\left(\frac{2{\alpha}_{nc}}{\pi}\right)}^{\frac{1}{2}}\exp\left(-\frac{{\alpha}_{nc}w_{nc}^{2}}{2}\right).

Comparing with the exponential family

p(𝒙|𝜼)=h(𝒙)g(𝜼)exp(𝜼T𝒖(𝒙)),p(\bm{x}|\bm{\eta})=h(\bm{x})g(\bm{\eta})\exp({\bm{\eta}}^{\rm T}\bm{u}(\bm{x})),

we can see that the truncated Gaussian belongs to the exponential family. For any member of the exponential family, there exists a conjugate prior that can be written in the form

p(𝜼|𝒂,b)=f(𝒂,b)g(𝜼)bexp(b𝜼T𝒂),p(\bm{\eta}|\bm{a},b)=f(\bm{a},b){g(\bm{\eta})}^{b}\exp(b{\bm{\eta}}^{\rm T}\bm{a}),

where f(𝒂,b)f(\bm{a},b) is a normalization coefficient [12]. In this case, we can get

p(αnc|a,b)αncb2exp(abαnc).p({\alpha}_{nc}|a,b)\propto{\alpha}_{nc}^{\frac{b}{2}}\exp(ab{\alpha}_{nc}).

This is exactly Gamma(αnc|b2+1,ab)Gamma({\alpha}_{nc}|{\frac{b}{2}}+1,-ab). So the truncated Gaussian has a Gamma distribution as the prior distribution of its precision supposing that uu is known (in this case, u=0u=0).

5.2 Multinomial Probit

The multinomial probit is

p(tn=i|𝑾,𝒃)=δ(zni>znj,ji)c=1C𝒩(znc|ync,1)d𝒛𝒏=+𝒩(zni|yni,1)ji(zni𝒩(znj|ynj,1)𝑑znj)dzni=+𝒩(zni|yni,1)jiΨ(zniynj)dzni=𝔼εni[jiΨ(εni+yniynj)].\displaystyle\begin{split}&p(t_{n}=i|\bm{W},\bm{b})\\ &=\int\delta(z_{ni}>z_{nj},\forall j\neq i)\prod\limits_{c=1}^{C}{\mathcal{N}(z_{nc}|y_{nc},1)}d\bm{z_{n}}\\ &=\int_{-\infty}^{+\infty}\mathcal{N}(z_{ni}|y_{ni},1)\prod\limits_{j\neq i}\big{(}\int_{-\infty}^{z_{ni}}{\mathcal{N}(z_{nj}|y_{nj},1)}d{z_{nj}}\big{)}d{z_{ni}}\\ &=\int_{-\infty}^{+\infty}\mathcal{N}(z_{ni}|y_{ni},1)\prod\limits_{j\neq i}\Psi(z_{ni}-y_{nj})d{z_{ni}}\\ &=\mathbb{E}_{\varepsilon_{ni}}\left[\prod_{j\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nj})\right].\end{split}

5.3

g()g(\cdot) is a differentiable and bounded function defined in \mathbb{R}, and ε\varepsilon a random variable, where ε𝒩(0,1)\varepsilon\sim\mathcal{N}(0,1).

𝔼[εg(ε)]\displaystyle\mathbb{E}[\varepsilon g(\varepsilon)] =12π+xg(x)ex22𝑑x\displaystyle=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{+\infty}xg(x)e^{-\frac{x^{2}}{2}}dx
=12π(0xg(x)ex22𝑑x+0+xg(x)ex22𝑑x)\displaystyle=\frac{1}{\sqrt{2\pi}}\left(\int_{-\infty}^{0}xg(x)e^{-\frac{x^{2}}{2}}dx+\int_{0}^{+\infty}xg(x)e^{-\frac{x^{2}}{2}}dx\right)
=12π(0g(x)𝑑ex22++0g(x)𝑑ex22)\displaystyle=\frac{1}{\sqrt{2\pi}}\left(\int^{-\infty}_{0}g(x)de^{-\frac{x^{2}}{2}}+\int^{0}_{+\infty}g(x)de^{-\frac{x^{2}}{2}}\right)
=12π(g(x)ex22|00ex22g(x)𝑑x)\displaystyle=\frac{1}{\sqrt{2\pi}}\left(g(x)e^{-\frac{x^{2}}{2}}|^{-\infty}_{0}-\int_{0}^{-\infty}e^{-\frac{x^{2}}{2}}g^{\prime}(x)dx\right)
+12π(g(x)ex22|+0+0ex22g(x)𝑑x).\displaystyle+\frac{1}{\sqrt{2\pi}}\left(g(x)e^{-\frac{x^{2}}{2}}|_{+\infty}^{0}-\int^{0}_{+\infty}e^{-\frac{x^{2}}{2}}g^{\prime}(x)dx\right).

And g()g(\cdot) is bounded, so

𝔼[εg(ε)]\displaystyle\mathbb{E}[\varepsilon g(\varepsilon)] =12π(0g(0)0ex22g(x)𝑑x)\displaystyle=\frac{1}{\sqrt{2\pi}}\left(0-g(0)-\int_{0}^{-\infty}e^{-\frac{x^{2}}{2}}g^{\prime}(x)dx\right)
+12π(g(0)0+0ex22g(x)𝑑x)\displaystyle+\frac{1}{\sqrt{2\pi}}\left(g(0)-0-\int^{0}_{+\infty}e^{-\frac{x^{2}}{2}}g^{\prime}(x)dx\right)
=12π(0g(x)ex22𝑑x+0+g(x)ex22𝑑x)\displaystyle=\frac{1}{\sqrt{2\pi}}\left(\int_{-\infty}^{0}g^{\prime}(x)e^{-\frac{x^{2}}{2}}dx+\int_{0}^{+\infty}g^{\prime}(x)e^{-\frac{x^{2}}{2}}dx\right)
=𝔼[g(ε)].\displaystyle=\mathbb{E}[g^{\prime}(\varepsilon)].

In this article, g()g(\cdot) is Ψ()\Psi(\cdot). It is bounded by 1 and differentiable.

5.4 Posterior Expectation

The posterior expectation of znjz_{nj} for all jtnj\neq t_{n} (assume tn=it_{n}=i) is

z¯nj\displaystyle\overline{z}_{nj}
=znjp(𝒛n|tn,𝑾,𝒃)𝑑𝒛n=znjp(tn,𝒛n|𝑾,𝒃)p(tn|𝑾,𝒃)𝑑𝒛n\displaystyle=\int z_{nj}p(\bm{z}_{n}|t_{n},\bm{W},\bm{b})d\bm{z}_{n}=\int z_{nj}\frac{p(t_{n},\bm{z}_{n}|\bm{W},\bm{b})}{p(t_{n}|\bm{W},\bm{b})}d\bm{z}_{n}
=1p(tn|𝑾,𝒃)+zniznj𝒩(znj|ynj,1)\displaystyle=\frac{1}{p(t_{n}|\bm{W},\bm{b})}\int_{-\infty}^{+\infty}\int_{-\infty}^{z_{ni}}z_{nj}\mathcal{N}(z_{nj}|y_{nj},1)
𝒩(zni|yni,1)ki,jΨ(zniynk)dznjdzni\displaystyle\qquad*\mathcal{N}(z_{ni}|y_{ni},1)\prod\limits_{k\neq i,j}\Psi(z_{ni}-y_{nk})dz_{nj}dz_{ni}
=1p(tn|𝑾,𝒃)+zni(znjynj+ynj)𝒩(znj|ynj,1)\displaystyle=\frac{1}{p(t_{n}|\bm{W},\bm{b})}\int_{-\infty}^{+\infty}\int_{-\infty}^{z_{ni}}(z_{nj}-y_{nj}+y_{nj})\mathcal{N}(z_{nj}|y_{nj},1)
𝒩(zni|yni,1)ki,jΨ(zniynk)dznjdzni\displaystyle\qquad*\mathcal{N}(z_{ni}|y_{ni},1)\prod\limits_{k\neq i,j}\Psi(z_{ni}-y_{nk})dz_{nj}dz_{ni}
=1p(tn|𝑾,𝒃)(ynj𝔼εni[jiΨ(εni+yniynj)]\displaystyle=\frac{1}{p(t_{n}|\bm{W},\bm{b})}\bigg{(}y_{nj}\mathbb{E}_{\varepsilon_{ni}}\Big{[}\prod_{j\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nj})\Big{]}-
+𝒩(zni|yni,1)𝒩(zni|ynj,1)ki,jΨ(zniynk)dzni)\displaystyle\int_{-\infty}^{+\infty}\mathcal{N}(z_{ni}|y_{ni},1)\mathcal{N}(z_{ni}|y_{nj},1)\prod\limits_{k\neq i,j}\Psi(z_{ni}-y_{nk})dz_{ni}\bigg{)}
=ynj𝔼εni[𝒩(εni|ynjyni,1)ki,jΨ(εni+yniynk)]𝔼εni[kiΨ(εni+yniynk)].\displaystyle=y_{nj}-\frac{\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\mathcal{N}(\varepsilon_{ni}|y_{nj}-y_{ni},1)\prod\limits_{k\neq i,j}\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})\bigg{]}}{\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\prod\limits_{k\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})\bigg{]}}.

And the posterior expectation of zniz_{ni} (assume tn=it_{n}=i) is

z¯ni=znip(𝒛n|tn,𝑾,𝒃)𝑑𝒛n\displaystyle\overline{z}_{ni}=\int z_{ni}p(\bm{z}_{n}|t_{n},\bm{W},\bm{b})d\bm{z}_{n}
=znip(tn,𝒛n|𝑾,𝒃)p(tn|𝑾,𝒃)𝑑𝒛n\displaystyle=\int z_{ni}\frac{p(t_{n},\bm{z}_{n}|\bm{W},\bm{b})}{p(t_{n}|\bm{W},\bm{b})}d\bm{z}_{n}
=+zni𝒩(zni|yni,1)kiΨ(zniynk)dznip(tn|𝑾,𝒃)\displaystyle=\frac{\int_{-\infty}^{+\infty}z_{ni}\mathcal{N}(z_{ni}|y_{ni},1)\prod\limits_{k\neq i}\Psi(z_{ni}-y_{nk})dz_{ni}}{p(t_{n}|\bm{W},\bm{b})}
=+(εni+yni)𝒩(εni|0,1)kiΨ(εni+yniynk)dεnip(tn|𝑾,𝒃)\displaystyle=\frac{\int_{-\infty}^{+\infty}(\varepsilon_{ni}+y_{ni})\mathcal{N}(\varepsilon_{ni}|0,1)\prod\limits_{k\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})d\varepsilon_{ni}}{p(t_{n}|\bm{W},\bm{b})}
=yni+𝔼εni[εnikiΨ(εni+yniynk)]p(tn|𝑾,𝒃)\displaystyle=y_{ni}+\frac{\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\varepsilon_{ni}\prod\limits_{k\neq i}\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})\bigg{]}}{p(t_{n}|\bm{W},\bm{b})}
=yni+ji𝔼εni[𝒩(εni|ynjyni,1)ki,jΨ(εni+yniynk)]p(tn|𝑾,𝒃)\displaystyle=y_{ni}+\frac{\sum\limits_{j\neq i}\mathbb{E}_{\varepsilon_{ni}}\bigg{[}\mathcal{N}(\varepsilon_{ni}|y_{nj}\!-y_{ni},1)\!\!\prod\limits_{k\neq i,j}\!\!\Psi(\varepsilon_{ni}+y_{ni}-y_{nk})\bigg{]}}{p(t_{n}|\bm{W},\bm{b})}
=yni+ji(ynjz¯nj).\displaystyle=y_{ni}+\sum\limits_{j\neq i}(y_{nj}-\overline{z}_{nj}).

A property of any differentiable and bounded function is 𝔼[εg(ε)]=𝔼[g(ε)]\mathbb{E}[{\varepsilon g(\varepsilon)}]=\mathbb{E}[{g{{}^{\prime}}(\varepsilon)}] and used in the above step from the 5th line to the 6th line. We prove it in Appendix 5.3.

References

  • [1] V. N. Vapnik, Statistical Learning Theory.   Wiley-Interscience, 1998.
  • [2] O. N. Almasi, E. Akhtarshenas, and M. Rouhani, “An efficient model selection for SVM in real-world datasets using BGA and RGA,” Neural Network World, vol. 24, pp. 501–520, 2014.
  • [3] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, no. 3, pp. 211–244, 2001.
  • [4] H. Chen, P. Tino, and X. Yao, “Probabilistic classification vector machines,” IEEE Transactions on Neural Networks, vol. 20, no. 6, pp. 901–914, 2009.
  • [5] D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, vol. 4, no. 3, pp. 415–447, 1992.
  • [6] ——, “The evidence framework applied to classification networks,” Neural Computation, vol. 4, no. 5, pp. 720–736, 1992.
  • [7] M. E. Tipping and A. C. Faul, “Fast marginal likelihood maximisation for sparse Bayesian models,” in Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003, pp. 3–6.
  • [8] H. Chen, P. Tiňo, and X. Yao, “Efficient probabilistic classification vector machine with incremental basis function selection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 2, pp. 356–369, 2014.
  • [9] A. Rocha and S. K. Goldenstein, “Multiclass from binary: Expanding one-versus-all, one-versus-one and ECOC-based approaches,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 2, pp. 289–302, 2014.
  • [10] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011.
  • [11] J. Luo, C.-M. Vong, and P.-K. Wong, “Sparse Bayesian extreme learning machine for multi-classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 4, pp. 836–843, 2014.
  • [12] C. M. Bishop, Pattern Recognition and Machine Learning.   Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
  • [13] B. Jiang, H. Chen, B. Yuan, and X. Yao, “Scalable graph-based semi-supervised learning through sparse Bayesian model,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2758–2771, 2017.
  • [14] J. Weston and C. Watkins, “Support vector machines for multi-class pattern recognition,” in Proceedings of the 7th European Symposium on Artificial Neural Networks, 1999, pp. 219–224.
  • [15] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” The Annals of Statistics, vol. 28, no. 2, pp. 337–407, 2000.
  • [16] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discriminative least squares regression for multiclass classification and feature selection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 11, pp. 1738–1754, 2012.
  • [17] T. Damoulas, Y. Ying, M. A. Girolami, and C. Campbell, “Inferring sparse kernel combinations and relevance vectors: An application to subcellular localization of proteins,” in Proceedings of the 7th International Conference on Machine Learning and Applications, 2008, pp. 577–582.
  • [18] J. H. Albert and S. Chib, “Bayesian analysis of binary and polychotomous response data,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 669–679, 1993.
  • [19] B. Jiang, C. Li, M. D. Rijke, X. Yao, and H. Chen, “Probabilistic feature selection and classification vector machine,” ACM Transactions on Knowledge Discovery from Data, vol. 13, no. 2, 2019.
  • [20] M. A. Figueiredo, “Adaptive sparseness for supervised learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1150–1159, 2003.
  • [21] H. Chen, P. Tiňo, and X. Yao, “Predictive ensemble pruning by expectation propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 7, pp. 999–1013, 2009.
  • [22] M. J. Beal, Variational Algorithms for Approximate Bayesian Inference.   University of London London, 2003.
  • [23] B. Jiang, X. Wu, K. Yu, and H. Chen, “Joint semi-supervised feature selection and classification through Bayesian approach,” in The 33rd AAAI Conference on Artificial Intelligence.   AAAI, 2019, pp. 3983–3990.
  • [24] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977.
  • [25] I. Psorakis, T. Damoulas, and M. A. Girolami, “Multiclass relevance vector machines: Sparsity and accuracy,” IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1588–1598, Oct. 2010.
  • [26] G. Rätsch, T. Onoda, and K. R. Müller, “Soft margins for AdaBoost,” Machine Learning, vol. 42, no. 3, pp. 287–320, 2001.
  • [27] D. J. Hand and R. J. Till, “A simple generalisation of the area under the ROC curve for multiple class classification problems,” Machine Learning, vol. 45, no. 2, pp. 171–186, 2001.
  • [28] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
  • [29] P. W. Frey and D. J. Slate, “Letter recognition using Holland-style adaptive classifiers,” Machine learning, vol. 6, no. 2, pp. 161–182, 1991.
  • [30] C. Mallah, J. Cope, and J. Orwell, “Plant leaf classification using probabilistic integration of shape, texture and margin features,” Signal Processing, Pattern Recognition and Applications, vol. 5, no. 1, 2013.
  • [31] H. Chen and X. Yao, “Regularized negative correlation learning for neural network ensembles.” IEEE Transactions on Neural Networks, vol. 20, no. 12, pp. 1962–1979, 2009.