This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning with Asymmetric Kernels: Least Squares and Feature Interpretation

Mingzhen He, Fan He, Lei Shi, Xiaolin Huang,  and Johan A.K. Suykens,  M. He, F. He and X. Huang (corresponding author) are with Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, P.R. China (e-mail: {mingzhen_he; hf-inspire; xiaolinhuang}@sjtu.edu.cn).
L. Shi is with the Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China, and also with the School of Mathematical Sciences, Fudan University, Shanghai 200433, China (e-mail: [email protected]).
J. A. K. Suykens is with the Department of Electrical Engineering (ESAT-STADIUS), KU Leuven, B-3001 Leuven, Belgium (e-mail: [email protected]).
Abstract

Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that, in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.

Index Terms:
Asymmetric kernels, Least squares support vector machine, Kullback-Leibler kernel, Directed graphs.

1 Introduction

Kernel-based learning [1, 2, 3] is an important scheme in machine learning and has been widely used in classification [4, 5], regression [6], clustering [7], and many other tasks. Traditionally, a kernel that can be used in kernel-based learning should satisfy Mercer’s condition [8]. For Mercer’s condition, there are two well-known requirements on the kernel 𝒦(,):d×d{\cal K}(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R}: for samples {𝒙i,yi}i=1m\{\bm{x}_{i},y_{i}\}_{i=1}^{m}, where dd and mm are the dimension and the number of data. The kernel matrix 𝑲:𝑲ij=𝒦(𝒙i,𝒙j)\bm{K}:\bm{K}_{ij}={\cal K}(\bm{x}_{i},\bm{x}_{j}) should be i) symmetric and ii) positive semi-definite (PSD). When the latter condition is relaxed, the flexibility is enhanced and those methods are called indefinite learning, for which some interesting results could be found in [9, 10, 11, 12, 13].

However, discussion on relaxing the symmetry condition is rare. Many asymmetric similarities exist in real life. For example, in directed graphs, the adjacency matrix is certainly asymmetric and thus the connection similarity is asymmetric, i.e., d(𝒙i,𝒙j)d(𝒙j,𝒙i)d(\bm{x}_{i},\bm{x}_{j})\neq d(\bm{x}_{j},\bm{x}_{i}): the path from the ii-th node to the jj-th node is not equal to that from the jj-th node to the ii-th node. Conditional probability, which has been widely used to measure the directional similarity [14], is also asymmetric: p(𝒙i|𝒙j)p(𝒙j|𝒙i)p(\bm{x}_{i}|\bm{x}_{j})\neq p(\bm{x}_{j}|\bm{x}_{i}). Those asymmetric measurements cannot be used in the current kernel-based learning directly. Let us consider the support vector machine (SVM, [4]):

min𝜶n12𝜶𝑲𝜶𝟏𝜶,s.t.𝒀𝜶=0,0𝜶C𝟏,\min_{\bm{\alpha}\in\mathbb{R}^{n}}~{}~{}\frac{1}{2}\bm{\alpha}^{\top}\bm{K}\bm{\alpha}-{\bf 1}^{\top}\bm{\alpha},\mathrm{~{}s.t.~{}}\bm{Y}^{\top}\bm{\alpha}=0,0\leq\bm{\alpha}\leq C\bm{1}, (1)

where C>0C>0 is a pre-given constant, 𝟏\bm{1} is a n-dimensional vector of all ones, 𝒀=[y1,,ym]\bm{Y}=[y_{1},\cdots,y_{m}]^{\top}, and 𝜶\bm{\alpha} is a dual variable vector. When 𝑲\bm{K} is asymmetric, at least in (1), we can directly use it. However, by noticing that

𝜶𝑲𝜶=𝜶𝑲+𝑲2𝜶,𝜶n,𝑲n×n,\bm{\alpha}^{\top}\bm{K}\bm{\alpha}=\bm{\alpha}^{\top}\frac{\bm{K}^{\top}+\bm{K}}{2}\bm{\alpha},\forall\bm{\alpha}\in\mathbb{R}^{n},\bm{K}\in\mathbb{R}^{n\times n},

one may find that only the symmetric part of an asymmetric kernel is learned by directly using it in the SVM.

Another popular kernel-based learning framework is the least squares support vector machine (LS-SVM, [15, 2]). Its dual form is the following linear system,

[0𝒀𝒀𝑰γ+𝑯][b𝜶]=[0𝟏],\left[\begin{array}[]{cc}0&\bm{Y}^{\top}\\ \bm{Y}&\frac{\bm{I}}{\gamma}+\bm{H}\\ \end{array}\right]\left[\begin{array}[]{c}b\\ \bm{\alpha}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ \bm{1}\\ \end{array}\right],

where 𝑰\bm{I} is an identity matrix and 𝑯:𝑯ij=yi𝑲ijyj\bm{H}:\bm{H}_{ij}=y_{i}\bm{K}_{ij}y_{j}, respectively. An interesting point here is that using an asymmetric kernel in the LS-SVM will not reduce to its symmetrization and asymmetric information can be learned. Then we can develop asymmetric kernels in the LS-SVM framework in a straightforward way. The corresponding kernel trick, the feature interpretation, and the asymmetric information will be investigated in this paper. Notice that we do not claim that asymmetric kernels could not be applied in the SVM, but it is not straightforward and requires further investigation. Similarly, for symmetric but indefinite kernels, the solving method in the LS-SVM framework keeps easy [11], while the SVM needs delicately design in form, theory, and solving-algorithm [16, 12].

Refer to caption
Figure 1: A simple illustration for asymmetric dissimilarity. In order to show the asymmetric information, the dissimilarity between sample C and the other samples is shown in sub-figures (b), (c) and (d) as an example. (a) There are seven samples on a directed graph where red and green colors indicate two categories. The dissimilarity from one sample to another one is defined as the shortest path from the first sample to the latter, for example, the dissimilarity from sample A to sample E is 2 and if the sample can not be reached, dissimilarity is \infty. (b) Symmetrization. The directed edges are replaced by undirected edges. (c) Source space. The dissimilarity from C to others. (d) Target space. The dissimilarity from other samples to C.

In this paper, a novel method called AsK-LS for learning with asymmetric kernels in the framework of least squares will be established. The most important discussion is to investigate the kernel trick and the feature interpretation on how the asymmetric information could be extracted by Ask-LS. Generally, there are two features involved in the kernel trick. Using the concept in directed graphs, which have two feature embeddings for source and target nodes [17, 18, 19], we call the features in AsK-LS as source feature and target feature. The name distinguishes two features but does not mean that it can only be used for directed graphs. For the singular value decomposition, asymmetric kernels could be introduced into the LS-SVM framework [20] where the two features are related to columns and rows of the matrix, respectively.

In the rest of this paper, we will first discuss asymmetric kernels and illustrate that there are two different features embedded in an asymmetric kernel; see Section 2 for details. Then in Section 3 we will formulate AsK-LS, discuss its feature interpretation, and design the solving algorithm. In principle, an asymmetric kernel contains more information than the symmetric one. Thus, the proposed AsK-LS will demonstrate advantages when the asymmetric information is crucial, as numerically evaluated in Section 4. Section 5 will end this paper with a brief conclusion.

2 Asymmetric kernels

In the classical kernel-based learning, the kernel 𝒦{\cal K} is symmetric, i.e., the kernel matrix 𝑲ij=𝒦(𝒙i,𝒙j)\bm{K}_{ij}={\cal K}(\bm{x}_{i},\bm{x}_{j}) for training samples is also symmetric. But there could be many asymmetric kernels. For example, in image classification tasks, Kullback-Leibler (KL, [21]) divergence could be used to measure the dissimilarity between two probability distributions. The directed graph is another example, where the similarity between two nodes is essentially asymmetric.

Intuitively, 𝑲\bm{K} being asymmetric contains more information than that being symmetric. For symmetric kernels, the kernel trick (there are additional conditions for the existence of kernel trick; see, e.g, [8]) means that there is a feature mapping ϕ\phi such that 𝒦(𝒖,𝒗)=ϕ(𝒖),ϕ(𝒗){\cal K}(\bm{u},\bm{v})=\left<\phi(\bm{u}),\phi(\bm{v})\right> for two samples uu and vv. Then it is expected that there are more features for asymmetric kernels. Fig. 1 illustrates a simple case for asymmetric dissimilarity. In Fig. 1(b-d), we illustrate three methods which make the kernel matrix symmetric. For source dissimilarity, one can extract a nonlinear feature mapping, denoted as ϕs(𝒖)\phi_{s}(\bm{u}). Meanwhile, a target nonlinear feature mapping, denoted as ϕt(𝒖)\phi_{t}(\bm{u}) which is generally different from ϕs(𝒖)\phi_{s}(\bm{u}), can be also extracted. However, in the existing kernel-based learning, only symmetric kernels are acceptable and hence one has to use: symmetric similarity, for example, (𝑲+𝑲)/2(\bm{K}^{\top}+\bm{K})/2 or 𝑲𝑲\bm{K}^{\top}\bm{K}, which indicates that those symmetrization methods may lose the asymmetric information.

Besides KL divergence and directed graphs, there are other tasks where the asymmetric kernels may be superior. In kernel density estimation problems, asymmetric kernels performed better than symmetric ones when the underlying random variables were bounded [22, 23, 24]. In Gaussian process regression tasks, Pintea et al. argued that it was helpful to set an individual kernel parameter for each data center, which enabled each data center to learn a proper kernel parameter in its neighborhood and resulted in an asymmetric kernel matrix [25]. In federated learning tasks [26], an asymmetric neural tangent kernel was established to address the issue that the gradient of the global machine was not determined by local gradient directions directly.

Our aim in this paper is to propose a novel method to directly learn with asymmetric kernels and correspondingly can learn with two feature mappings. For a long time, symmetrization is the main way for dealing with asymmetric kernels. In an early paper [27], Tsuda let the asymmetric kernel matrix 𝑺\bm{S} be symmetric by multiplying its transpose, then a new symmetric matrix 𝑸\bm{Q} was obtained as 𝑸=𝑺𝑺\bm{Q}=\bm{S}^{\top}\bm{S}. Munoz et al. utilized a pick-out method to convert the asymmetric kernel into the symmetric one [28]. Moreno et al. studied the KL divergence kernel D(P,Q)D(P,Q) in the SVM on multimedia data [21]. They defined D(P,Q)=KL(P||Q)+KL(Q||P)D(P,Q)=\rm KL(P||Q)+KL(Q||P) to satisfy the Mercer condition, but the asymmetric information was disappeared. Koide and Yamashita proposed an asymmetric kernel method and applied it to the Fisher’s discriminant (AKFD) [29]. They claimed that an asymmetric kernel 𝒦(𝒙,𝒚)=ϕ1(𝒙),ϕ2(𝒚)\mathcal{K}(\bm{x},\bm{y})=\left<\phi_{1}(\bm{x}),\phi_{2}(\bm{y})\right> was generated by the inner product between two different feature mappings. In the AKFD, the decision function was assumed to be spanned by {ϕ1(𝒙i)}i=1m\{\phi_{1}(\bm{x}_{i})\}_{i=1}^{m} and input data were mapped by ϕ2\phi_{2}. However, the assumption of the AKFD was very strict and the situation that the decision function was spanned by {ϕ2(𝒙i)}i=1m\{\phi_{2}(\bm{x}_{i})\}_{i=1}^{m} was not considered. Wu et al. proposed a hyper asymmetric kernel method to learn with asymmetric kernels between data from two different input spaces such as query space 𝒳\mathcal{X} and document space 𝒴\mathcal{Y} [30], while an asymmetric kernel degenerated to a symmetric one when two spaces were identical, i.e., 𝒳=𝒴\mathcal{X}=\mathcal{Y}. In summary, these works used symmetrization methods at the optimization level, which canceled the asymmetric information and was not expected in the asymmetric kernel-based learning.

It was interesting that the matrix singular value decomposition (SVD) could be merged in the LS-SVM framework [20, 31]. The matrix to be decomposed could be asymmetric and even non-square, implying that the LS-SVM could tolerate asymmetric kernels. From the viewpoint of the LS-SVM setting, the matrix SVD was related to two feature maps acting on the column vectors and the row vectors of the matrix, respectively. For directed graphs, it was also possible to use the adjacency matrix without the label to extract embeddings as the source and target features, respectively [17, 18, 19]. These works, although in an unsupervised setting, demonstrated that asymmetric kernels can be studied rather than through the symmetrization process.

3 Asymmetric kernels in the LS-SVM

3.1 PSD and indefinite kernels in the LS-SVM

Given training samples {𝒙i,yi}i=1m\{\bm{x}_{i},y_{i}\}_{i=1}^{m} with 𝒙d\bm{x}\in\mathbb{R}^{d} and y{+1,1}y\in\{+1,-1\}, a discriminant function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is constructed to classify the input samples. For linearly inseparable problems, a non-linear feature mapping ϕ:dp\phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p} is needed, where p\mathbb{R}^{p} is a high-dimensional space.

LS-SVMs with PSD kernels can be solved by the following optimization problem [15],

min𝝎,b,ξ12𝝎𝝎+γ2i=1mξi2\displaystyle\min_{\bm{\omega},b,\xi}\quad\frac{1}{2}\bm{\omega}^{\top}\bm{\omega}+\frac{\gamma}{2}\sum_{i=1}^{m}\xi^{2}_{i} (2)
s.t.yi(𝝎ϕ(𝒙i)+b)=1ξii{1,2,,m},\displaystyle\begin{array}[]{r@{\quad}r@{}l@{\quad}l}\mbox{s.t.}\quad y_{i}(\bm{\omega}^{\top}\phi(\bm{x}_{i})+b)=1-\xi_{i}\\ \forall i\in\{1,2,\cdots,m\},\end{array}

where the discriminant function ff is formulated as f(𝒙)=𝝎ϕ(𝒙)+bf(\bm{x})=\bm{\omega}^{\top}\phi(\bm{x})+b. When the kernel is generalized to a non-PSD one, the primal problem is as follows [11],

min𝝎,b,ξ12(𝝎+𝝎+𝝎𝝎)+γ2i=1mξi2\displaystyle\min_{\bm{\omega},b,\xi}\quad\frac{1}{2}(\bm{\omega}_{+}^{\top}\bm{\omega}_{+}-\bm{\omega}_{-}^{\top}\bm{\omega}_{-})+\frac{\gamma}{2}\sum_{i=1}^{m}\xi_{i}^{2} (3)
s.t.yi(𝝎+ϕ+(𝒙i)+𝝎ϕ(𝒙i)+b)=1ξii{1,2,,m},\displaystyle\begin{array}[]{r@{\quad}r@{}l@{\quad}l}\mbox{s.t.}\quad y_{i}(\bm{\omega}_{+}^{\top}\phi_{+}(\bm{x}_{i})+\bm{\omega}_{-}^{\top}\phi_{-}(\bm{x}_{i})+b)=1-\xi_{i}\\ \forall i\in\{1,2,\cdots,m\},\end{array}

where ϕ+:dp1\phi_{+}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p_{1}} and ϕ:dp2\phi_{-}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p_{2}} are two non-linear feature mappings, both p1\mathbb{R}^{p_{1}} and p2\mathbb{R}^{p_{2}} are potential high-dimensional spaces. The discriminant function here is formulated as f(𝒙)=𝝎+ϕ+(𝒙i)+𝝎ϕ(𝒙i)+bf(\bm{x})=\bm{\omega}_{+}^{\top}\phi_{+}(\bm{x}_{i})+\bm{\omega}_{-}^{\top}\phi_{-}(\bm{x}_{i})+b.

Although the feature interpretations of (LABEL:ls-svm) and (LABEL:indefinite-ls-svm) are not the same, their dual problems share the same formulation as below,

[0𝒀𝒀𝑰γ+𝑯][b𝜶]=[0𝟏].\left[\begin{array}[]{cc}0&\bm{Y}^{\top}\\ \bm{Y}&\frac{\bm{I}}{\gamma}+\bm{H}\\ \end{array}\right]\left[\begin{array}[]{c}b\\ \bm{\alpha}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ \bm{1}\\ \end{array}\right]. (4)

The kernel trick, which gives the feature interpretation of (4), is different for different types of kernels. If the kernel is PSD then,

𝒦(𝒙i,𝒙j)=ϕ(𝒙i),ϕ(𝒙j),\mathcal{K}(\bm{x}_{i},\bm{x}_{j})=\left<\phi(\bm{x}_{i}),\phi(\bm{x}_{j})\right>,

If the kernel is non-PSD but has a positive decomposition, for which the conceptual condition and a practice judgment can be found in [13], the kernel trick becomes,

𝒦(𝒙i,𝒙j)\displaystyle\mathcal{K}(\bm{x}_{i},\bm{x}_{j}) =ϕ+(𝒙i),ϕ+(𝒙j)ϕ(𝒙i),ϕ(𝒙j)\displaystyle=\left<\phi_{+}(\bm{x}_{i}),\phi_{+}(\bm{x}_{j})\right>-\left<\phi_{-}(\bm{x}_{i}),\phi_{-}(\bm{x}_{j})\right>
=𝒦+(𝒙i,𝒙j)𝒦(𝒙i,𝒙j),\displaystyle=\mathcal{K}_{+}(\bm{x}_{i},\bm{x}_{j})-\mathcal{K}_{-}(\bm{x}_{i},\bm{x}_{j}),

where 𝒦+\mathcal{K}_{+} and 𝒦\mathcal{K}_{-} are two PSD kernels.

3.2 AsK-LS

When looking at the framework of LS-SVM from the viewpoint of solving (4), there is not any problem if 𝑲\bm{K} is asymmetric. It is still well-defined and a solution can be readily obtained. The key problem is to analyze what we really learn if 𝑲\bm{K} is asymmetric.

First, we define a generalized kernel trick to present a kernel as an inner product of two mappings ϕs\phi_{s} and ϕt\phi_{t}.

Definition 1.

A kernel trick for a kernel 𝒦:ds×dt\mathcal{K}:\mathbb{R}^{d_{s}}\times\mathbb{R}^{d_{t}}\rightarrow\mathbb{R} can be defined by the inner product of two different feature mappings as follows:

𝒦(𝒖,𝒗)=ϕs(𝒖),ϕt(𝒗),𝒖ds,𝒗dt,\mathcal{K}(\bm{u},\bm{v})=\left<\phi_{s}(\bm{u}),\phi_{t}(\bm{v})\right>,\forall\bm{u}\in\mathbb{R}^{d_{s}},\bm{v}\in\mathbb{R}^{d_{t}},

where ϕs:dsp\phi_{s}:\mathbb{R}^{d_{s}}\rightarrow\mathbb{R}^{p}, ϕt:dtp\phi_{t}:\mathbb{R}^{d_{t}}\rightarrow\mathbb{R}^{p}, and p\mathbb{R}^{p} is a high-dimensional even infinite-dimensional space.

Different from the classical kernel trick, the above definition allows different ϕs\phi_{s} and ϕt\phi_{t}, of which even the dimension dsd_{s} and dtd_{t} could be different. In this paper, we consider the case d:=ds=dtd:=d_{s}=d_{t} then both 𝒦(𝒖,𝒗)\mathcal{K}(\bm{u},\bm{v}) and 𝒦(𝒗,𝒖)\mathcal{K}(\bm{v},\bm{u}) are well-defined and the kernel matrix for training data is square but asymmetric. Definition 1 is compatible with the existing symmetric kernels, including PSD and indefinite ones.

  1. 1.

    The symmetric and positive semi-definite kernel 𝒦(u,v)\mathcal{K}(u,v) can be defined as follows:

    𝒦(𝒖,𝒗)=ϕ(𝒖),ϕ(𝒗),𝒖,𝒗d,\mathcal{K}(\bm{u},\bm{v})=\left<\phi(\bm{u}),\phi(\bm{v})\right>,\forall\bm{u},\bm{v}\in\mathbb{R}^{d},

    in the situation when, two feature mappings ϕs\phi_{s} and ϕt\phi_{t} are identical ϕ:=ϕs=ϕt\phi:=\phi_{s}=\phi_{t}. ϕs,ϕtdp\phi_{s},\phi_{t}\in\mathbb{R}^{d}\rightarrow\mathbb{R}^{p}.

  2. 2.

    The symmetric and indefinite kernel 𝒦(u,v)\mathcal{K}(u,v) can be defined as follows:

    𝒦(𝒖,𝒗)\displaystyle\mathcal{K}(\bm{u},\bm{v}) =𝒦1(𝒖,𝒗)𝒦2(𝒖,𝒗)\displaystyle=\mathcal{K}_{1}(\bm{u},\bm{v})-\mathcal{K}_{2}(\bm{u},\bm{v})
    =ϕ1(𝒖),ϕ1(𝒗)ϕ2(𝒖),ϕ2(𝒗)\displaystyle=\left<\phi_{1}(\bm{u}),\phi_{1}(\bm{v})\right>-\left<\phi_{2}(\bm{u}),\phi_{2}(\bm{v})\right>
    =[ϕ1(𝒖)ϕ2(𝒖)][ϕ1(𝒗)ϕ2(𝒗)]\displaystyle=\left[\begin{array}[]{cc}\phi_{1}(\bm{u})\\ \phi_{2}(\bm{u})\end{array}\right]^{\top}\left[\begin{array}[]{cc}\phi_{1}(\bm{v})\\ -\phi_{2}(\bm{v})\end{array}\right]
    :=ϕs(𝒖),ϕt(𝒗),𝒖,𝒗d,\displaystyle:=\left<\phi_{s}(\bm{u}),\phi_{t}(\bm{v})\right>,\forall\bm{u},\bm{v}\in\mathbb{R}^{d},

    in the situation when two feature mappings ϕs\phi_{s} and ϕt\phi_{t} are not identical, 𝒦1\mathcal{K}_{1} and 𝒦2\mathcal{K}_{2} are two PSD kernels and ϕ1:dp1\phi_{1}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p_{1}} and ϕ2:dp2\phi_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p_{2}} are two high-dimensional feature mappings corresponding to 𝒦1\mathcal{K}_{1} and 𝒦2\mathcal{K}_{2}, respectively. p1\mathbb{R}^{p_{1}} and p2\mathbb{R}^{p_{2}} are two high-dimensional spaces.

The kernel trick associated with an asymmetric kernel contains two different feature mappings. Using the concept from directed graphs, we call them source and target features, respectively. Then for each sample, e.g., a node in a directed graph, we can extract two features from different views and classify the sample in the framework of the least squares support vector machine, which is hence called Ask-LS. The AsK-LS in the primal space takes the following form,

min𝝎,𝝂,b1,b2,𝒆,𝒉𝝎𝝂+γ2i=1mei2+γ2i=1mhi2\displaystyle\min_{\bm{\omega},\bm{\nu},b_{1},b_{2},\bm{e},\bm{h}}\quad\bm{\omega}^{\top}\bm{\nu}+\frac{\gamma}{2}\sum_{i=1}^{m}e_{i}^{2}+\frac{\gamma}{2}\sum_{i=1}^{m}h_{i}^{2} (5)
s.t.yi(𝝎ϕs(𝒙i)+b1)=1eiyi(𝝂ϕt(𝒙i)+b2)=1hii{1,2,,m},\displaystyle\begin{array}[]{r@{\quad}r@{}l@{\quad}l}\mbox{s.t.}\quad y_{i}(\bm{\omega}^{\top}\phi_{s}(\bm{x}_{i})+b_{1})=1-e_{i}\\ \quad y_{i}(\bm{\nu}^{\top}\phi_{t}(\bm{x}_{i})+b_{2})=1-h_{i}\\ \quad\quad\quad\quad\quad\quad\quad\forall i\in\{1,2,\cdots,m\},\end{array}

where ei,hie_{i},h_{i}\in\mathbb{R} and γ\gamma is the regularization coefficient of misclassification errors. In AsK-LS (LABEL:asy_ls_svm), the sample plays different roles and can have different meanings in different tasks. In the matrix decomposition, the features could be given as column and row vectors, which could lead to an asymmetric kernel for unsupervised learning [20].

Now let us investigate the dual problem of (LABEL:asy_ls_svm), of which the kernel trick for an asymmetric kernel is crucial.

Proposition 1.

Let b1,b2,𝛂,𝛃b_{1}^{\star},b_{2}^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star} be the solution of the problem below, where 𝐇ij=yiϕs(𝐱i)ϕt(𝐱j)yj=yi𝒦(𝐱i,𝐱j)yj\bm{H}_{ij}=y_{i}\phi_{s}(\bm{x}_{i})^{\top}\phi_{t}(\bm{x}_{j})y_{j}=y_{i}\mathcal{K}(\bm{x}_{i},\bm{x}_{j})y_{j} with an asymmetric kernel 𝒦\mathcal{K}.

[00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ][b1b2𝜶𝜷]=[00𝟏𝟏],\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}\\ b_{2}\\ \bm{\alpha}\\ \bm{\beta}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ 0\\ \bm{1}\\ \bm{1}\end{array}\right], (6)
  1. 1.

    𝝎\bm{\omega}^{\star} and 𝝂\bm{\nu}^{\star} are spanned by {ϕt(𝒙i)}i=1m\{\phi_{t}(\bm{x}_{i})\}_{i=1}^{m} and {ϕs(𝒙i)}i=1m\{\phi_{s}(\bm{x}_{i})\}_{i=1}^{m}, respectively.

    𝝎=i=1mβiyiϕt(𝒙i),𝝂=i=1mαiyiϕs(𝒙i),\bm{\omega}^{\star}=\sum_{i=1}^{m}\beta_{i}^{\star}y_{i}\phi_{t}(\bm{x}_{i}),\quad\bm{\nu}^{\star}=\sum_{i=1}^{m}\alpha_{i}^{\star}y_{i}\phi_{s}(\bm{x}_{i}),

    where (𝝎,𝝂\bm{\omega}^{\star},\bm{\nu}^{\star}) is a stationary point of the primal problem (LABEL:asy_ls_svm)

  2. 2.

    The primal problem (LABEL:asy_ls_svm) results in two discriminant functions fsf_{s} and ftf_{t} as follows

    {fs(x)=𝒦(𝒙,𝑿)(𝜷𝒀)+b1ft(x)=𝒦(𝑿,𝒙)(𝜶𝒀)+b2,\left\{\begin{aligned} &f_{s}(x)=\mathcal{K}(\bm{x},\bm{X})(\bm{\beta}^{\star}\odot\bm{Y})+b_{1}^{\star}\\ &f_{t}(x)=\mathcal{K}(\bm{X},\bm{x})(\bm{\alpha}^{\star}\odot\bm{Y})+b_{2}^{\star},\\ \end{aligned}\right. (7)

    where 𝑿={𝒙i}i=1m\bm{X}=\{\bm{x}_{i}\}_{i=1}^{m} is a training set and \odot denotes a element-wise product,

    𝒦(𝒙,𝑿)=[𝒦(𝒙,𝒙1),,𝒦(𝒙,𝒙m)]\mathcal{K}(\bm{x},\bm{X})=[\mathcal{K}(\bm{x},\bm{x}_{1}),\cdots,\mathcal{K}(\bm{x},\bm{x}_{m})],

    𝒦(𝑿,𝒙)=[𝒦(𝒙1,𝒙),,𝒦(𝒙m,𝒙)]\mathcal{K}(\bm{X},\bm{x})=[\mathcal{K}(\bm{x}_{1},\bm{x}),\cdots,\mathcal{K}(\bm{x}_{m},\bm{x})].

Proof.

The Lagrangian function of the primal problem (LABEL:asy_ls_svm) is formulated as follows:

(Θ;𝜶,𝜷)\displaystyle\mathcal{L}(\Theta;\bm{\alpha},\bm{\beta}) =𝝎𝝂+γ2i=1mei2+γ2i=1mhi2\displaystyle=\bm{\omega}^{\top}\bm{\nu}+\frac{\gamma}{2}\sum_{i=1}^{m}e_{i}^{2}+\frac{\gamma}{2}\sum_{i=1}^{m}h_{i}^{2} (8)
+i=1mαi(1eiyi(𝝎ϕs(𝒙i)+b1))\displaystyle+\sum_{i=1}^{m}\alpha_{i}(1-e_{i}-y_{i}(\bm{\omega}^{\top}\phi_{s}(\bm{x}_{i})+b_{1}))
+i=1mβi(1hiyi(𝝂ϕt(𝒙i)+b2)),\displaystyle+\sum_{i=1}^{m}\beta_{i}(1-h_{i}-y_{i}(\bm{\nu}^{\top}\phi_{t}(\bm{x}_{i})+b_{2})),

where Θ={𝝎,𝝂,b1,b2,𝒆,𝒉}\Theta=\{\bm{\omega},\bm{\nu},b_{1},b_{2},\bm{e},\bm{h}\} is the parameter set. Then the condition of stationary points requires the following equations:

{𝝂=𝝎i=1mβiyiϕt(𝒙i)=0𝝎=𝝂i=1mαiyiϕs(𝒙i)=0b1=i=1mαiyi=0b2=i=1mβiyi=0ei=γeiαi=0hi=γhiβi=0αi=1eiyi(𝝎ϕs(𝒙i)+b1)=0βi=1hiyi(𝝂ϕt(𝒙i)+b2)=0.\left\{\begin{aligned} &\frac{\partial\mathcal{L}}{\partial\bm{\nu}}=\bm{\omega}-\sum_{i=1}^{m}\beta_{i}y_{i}\phi_{t}(\bm{x}_{i})=0\\ &\frac{\partial\mathcal{L}}{\partial\bm{\omega}}=\bm{\nu}-\sum_{i=1}^{m}\alpha_{i}y_{i}\phi_{s}(\bm{x}_{i})=0\\ &\frac{\partial\mathcal{L}}{\partial b_{1}}=\sum_{i=1}^{m}\alpha_{i}y_{i}=0\\ &\frac{\partial\mathcal{L}}{\partial b_{2}}=\sum_{i=1}^{m}\beta_{i}y_{i}=0\\ &\frac{\partial\mathcal{L}}{\partial e_{i}}=\gamma e_{i}-\alpha_{i}=0\\ &\frac{\partial\mathcal{L}}{\partial h_{i}}=\gamma h_{i}-\beta_{i}=0\\ &\frac{\partial\mathcal{L}}{\partial\alpha_{i}}=1-e_{i}-y_{i}(\bm{\omega}^{\top}\phi_{s}(\bm{x}_{i})+b_{1})=0\\ &\frac{\partial\mathcal{L}}{\partial\beta_{i}}=1-h_{i}-y_{i}(\bm{\nu}^{\top}\phi_{t}(\bm{x}_{i})+b_{2})=0.\end{aligned}\right. (9)

The last two conditions can be converted into the following equations

{αi=1αiγyib1yij=1mβjyjϕs(𝒙i)ϕt(𝒙j)=0βi=1βiγyib2yij=1mαjyjϕt(𝒙i)ϕs(𝒙j)=0.\left\{\begin{aligned} &\frac{\partial\mathcal{L}}{\partial\alpha_{i}}=1-\frac{\alpha_{i}}{\gamma}-y_{i}b_{1}-y_{i}\sum_{j=1}^{m}\beta_{j}y_{j}\phi_{s}^{\top}(\bm{x}_{i})\phi_{t}(\bm{x}_{j})=0\\ &\frac{\partial\mathcal{L}}{\partial\beta_{i}}=1-\frac{\beta_{i}}{\gamma}-y_{i}b_{2}-y_{i}\sum_{j=1}^{m}\alpha_{j}y_{j}\phi_{t}^{\top}(\bm{x}_{i})\phi_{s}(\bm{x}_{j})=0.\\ \end{aligned}\right. (10)

The equations (10) can be formulated as a linear system as follows,

[00𝒀0000𝒀𝒀0𝑰γ𝒁𝒔𝒁𝒕0𝒀𝒁𝒕𝒁𝒔𝑰γ][b1b2𝜶𝜷]=[00𝟏𝟏],\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{Z_{s}Z_{t}}^{\top}\\ 0&\bm{Y}&\bm{Z_{t}Z_{s}}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}\\ b_{2}\\ \bm{\alpha}\\ \bm{\beta}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ 0\\ \bm{1}\\ \bm{1}\end{array}\right], (11)

where

{𝒁𝒔=[y1ϕs(𝒙1);;ymϕs(𝒙m)]𝒁𝒕=[y1ϕt(𝒙1);;ymϕt(𝒙m)].\left\{\begin{aligned} &\bm{Z_{s}}=[y_{1}\phi_{s}^{\top}(\bm{x}_{1});\cdots;y_{m}\phi_{s}^{\top}(\bm{x}_{m})]\\ &\bm{Z_{t}}=[y_{1}\phi_{t}^{\top}(\bm{x}_{1});\cdots;y_{m}\phi_{t}^{\top}(\bm{x}_{m})].\\ \end{aligned}\right.

According to Definition 1, an asymmetric kernel is defined as follows:

𝒦(𝒙i,𝒙j)=ϕs(𝒙i),ϕt(𝒙j).\mathcal{K}(\bm{x}_{i},\bm{x}_{j})=\left<\phi_{s}(\bm{x}_{i}),\phi_{t}(\bm{x}_{j})\right>.

Then, the linear system (11) can be reformulated as follows:

[00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ][b1b2𝜶𝜷]=[00𝟏𝟏],\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}\\ b_{2}\\ \bm{\alpha}\\ \bm{\beta}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ 0\\ \bm{1}\\ \bm{1}\end{array}\right],

where 𝑯ij=yiϕs(𝒙i)ϕt(𝒙j)yj=yi𝒦(𝒙i,𝒙j)yj\bm{H}_{ij}=y_{i}\phi_{s}(\bm{x}_{i})^{\top}\phi_{t}(\bm{x}_{j})y_{j}=y_{i}\mathcal{K}(\bm{x}_{i},\bm{x}_{j})y_{j} with a given asymmetric kernel 𝒦\mathcal{K}.

Suppose b1,b2,𝜶,𝜷b_{1}^{\star},b_{2}^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star} be the solution of (6), according to partial derivative equations (9), a stationary point (𝝎,𝝂\bm{\omega},\bm{\nu}) can be formulated as below,

𝝎=i=1mβiyiϕt(𝒙i),𝝂=i=1mαiyiϕs(𝒙i).\bm{\omega}^{\star}=\sum_{i=1}^{m}\beta_{i}^{\star}y_{i}\phi_{t}(\bm{x}_{i}),\quad\bm{\nu}^{\star}=\sum_{i=1}^{m}\alpha_{i}^{\star}y_{i}\phi_{s}(\bm{x}_{i}).

Since a stationary point of the primal problem (LABEL:asy_ls_svm) is obtained, two functions which classify samples from source and target points of view respectively can be formulated as follows:

{fs(𝒙)=𝝎ϕs(𝒙)+b1=𝒦(𝒙,𝑿)(𝜷𝒀)+b1ft(𝒙)=𝝂ϕt(𝒙)+b2=𝒦(𝑿,𝒙)(𝜶𝒀)+b2.\left\{\begin{aligned} &f_{s}(\bm{x})={\bm{\omega}^{\star}}^{\top}\phi_{s}(\bm{x})+b_{1}^{\star}=\mathcal{K}(\bm{x},\bm{X})(\bm{\beta}^{\star}\odot\bm{Y})+b_{1}^{\star}\\ &f_{t}(\bm{x})={\bm{\nu}^{\star}}^{\top}\phi_{t}(\bm{x})+b_{2}^{\star}=\mathcal{K}(\bm{X},\bm{x})(\bm{\alpha}^{\star}\odot\bm{Y})+b_{2}^{\star}.\\ \end{aligned}\right.

The primal-dual relationship between (LABEL:asy_ls_svm) and (6) for asymmetric kernels is characterized by Proposition 1, which also includes symmetric kernels. In that case, both the primal and dual formulations reduce to the classical LS-SVM, as shown below.

Proposition 2.

When the kernel 𝒦\mathcal{K} in (LABEL:asy_ls_svm) is symmetric,

  1. 1.

    two functions fsf_{s} and ftf_{t} are identical;

  2. 2.

    two linear systems (6) and (4) are equivalent.

Proof.

According to Definition 1, symmetric kernels can be also defined in the asymmetric kernel framework. Thus, kernels in AsK-LS can be also positive semi-definite, even indefinite.

A solution to the primal problem (LABEL:asy_ls_svm) is given by the linear system (6) which can be reformulated as follows, according to the matrix column transformation (CT\rm CT) and row transformation (RT\rm RT) formulas,

[00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ]\displaystyle\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right] [b1b2𝜶𝜷]=[00𝟏𝟏]\displaystyle\left[\begin{array}[]{c}b_{1}\\ b_{2}\\ \bm{\alpha}\\ \bm{\beta}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ 0\\ \bm{1}\\ \bm{1}\end{array}\right]
=\displaystyle= [00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ][b2b1𝜷𝜶].\displaystyle\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}^{\top}\\ 0&\bm{Y}&\bm{H}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{2}\\ b_{1}\\ \bm{\beta}\\ \bm{\alpha}\\ \end{array}\right].

Feeding the symmetric kernel 𝒦\mathcal{K}, 𝑯\bm{H} is then a symmetric matrix, 𝑯ij=yiϕs(𝒙i)ϕt(𝒙j)yj=yi𝒦(𝒙i,𝒙j)yj\bm{H}_{ij}=y_{i}\phi_{s}(\bm{x}_{i})^{\top}\phi_{t}(\bm{x}_{j})y_{j}=y_{i}\mathcal{K}(\bm{x}_{i},\bm{x}_{j})y_{j}, i.e., 𝑯=𝑯\bm{H}=\bm{H}^{\top}. The following equation holds,

[00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ]\displaystyle\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right] [b1b2𝜶𝜷]\displaystyle\left[\begin{array}[]{c}b_{1}\\ b_{2}\\ \bm{\alpha}\\ \bm{\beta}\\ \end{array}\right]
=\displaystyle= [00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ][b2b1𝜷𝜶].\displaystyle\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{2}\\ b_{1}\\ \bm{\beta}\\ \bm{\alpha}\\ \end{array}\right].

The above equation can be simplified by moving the right term to the left.

[00𝒀0000𝒀𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ][b1b2b2b1𝜶𝜷𝜷𝜶]=𝟎\displaystyle\left[\begin{array}[]{cccc}0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}-b_{2}\\ b_{2}-b_{1}\\ \bm{\alpha}-\bm{\beta}\\ \bm{\beta}-\bm{\alpha}\\ \end{array}\right]=\bm{0} (12)
RT\displaystyle\stackrel{{\scriptstyle\rm{RT}}}{{\longrightarrow}} [𝒀0𝑰γ𝑯0𝒀𝑯𝑰γ00𝒀0000𝒀][b1b2b2b1𝜶𝜷𝜷𝜶]\displaystyle\left[\begin{array}[]{cccc}\bm{Y}&0&\frac{\bm{I}}{\gamma}&\bm{H}\\ 0&\bm{Y}&\bm{H}^{\top}&\frac{\bm{I}}{\gamma}\\ 0&0&\bm{Y}^{\top}&0\\ 0&0&0&\bm{Y}^{\top}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}-b_{2}\\ b_{2}-b_{1}\\ \bm{\alpha}-\bm{\beta}\\ \bm{\beta}-\bm{\alpha}\\ \end{array}\right]
\displaystyle\triangleq 𝑨[b1b2b2b1𝜶𝜷𝜷𝜶]=𝟎,\displaystyle\bm{A}\cdot\left[\begin{array}[]{c}b_{1}-b_{2}\\ b_{2}-b_{1}\\ \bm{\alpha}-\bm{\beta}\\ \bm{\beta}-\bm{\alpha}\\ \end{array}\right]=\bm{0},

where the matrix 𝑨(2m+2)×(2m+2)\bm{A}\in\mathbb{R}^{(2m+2)\times(2m+2)} denotes the system matrix. It can be noticed that 𝑨\bm{A} has a full rank and the equation (12) indicates that b1=b2b_{1}=b_{2} and 𝜶=𝜷\bm{\alpha}=\bm{\beta}. Then two functions in (7) are identical as below,

fs(𝒙)\displaystyle f_{s}(\bm{x}) =𝒦(𝒙,𝑿)(𝜷𝒀)+b1\displaystyle=\mathcal{K}(\bm{x},\bm{X})(\bm{\beta}^{\star}\odot\bm{Y})+b_{1}^{\star}
=𝒦(𝑿,𝒙)(𝜶𝒀)+b2=ft(𝒙).\displaystyle=\mathcal{K}(\bm{X},\bm{x})(\bm{\alpha}^{\star}\odot\bm{Y})+b_{2}^{\star}=f_{t}(\bm{x}).

Since b1=b2b_{1}=b_{2}, 𝜶=𝜷\bm{\alpha}=\bm{\beta} and 𝑯\bm{H} is a symmetric matrix, the linear system (6) can be simplified into a lower-dimensional linear system as follows:

[0𝒀𝒀𝑰γ+𝑯][b1𝜶]=[0𝟏],\left[\begin{array}[]{cc}0&\bm{Y}^{\top}\\ \bm{Y}&\frac{\bm{I}}{\gamma}+\bm{H}\\ \end{array}\right]\left[\begin{array}[]{c}b_{1}\\ \bm{\alpha}\\ \end{array}\right]=\left[\begin{array}[]{c}0\\ \bm{1}\\ \end{array}\right],

which is equivalent to (4) and we complete the proof. ∎

Symmetric and asymmetric kernels lead to a unified linear system. When the data size is not very large, (6) is easy to solve. For large-scale problems, one can consider the fixed-size [32, 33], Nyström approximation [34], for the latter of which necessary modifications are required. In the early work by Schmidt [35], an asymmetric kernel has a pair of adjoint eigenfunctions corresponding to the eigenvalue, thus the modified Nyström approximation for asymmetric kernels leads to a standard singular value decomposition. When the kernel is asymmetric, we simultaneously obtain two functions fs,ftf_{s},f_{t}. To fully use the asymmetric information, we need to merge them together. Averaging is a simple way and other ensemble methods like the stacked generalization [36] can be also used. Overall, we summarize the algorithm for AsK-LS in the following.

Algorithm 1 Learning with an asymmetric kernel in AsK-LS
0:  The asymmetric kernel 𝒦\mathcal{K}, the regularization parameter γ\gamma, training data (𝑿,𝒀)(\bm{X},\bm{Y}), and testing data 𝒁\bm{Z}.
0:  The prediction f(𝒁)f(\bm{Z}) of testing data 𝒁\bm{Z}.
1:  Calculate an asymmetric kernel matrix 𝑯\bm{H} by the asymmetric kernel 𝒦\mathcal{K} and training data (𝑿,𝒀)(\bm{X},\bm{Y}).
2:  [𝜶,𝜷,b1,b2][\bm{\alpha}^{\star},\bm{\beta}^{\star},b_{1}^{\star},b_{2}^{\star}]\leftarrow solve the linear system (6).
3:  Predict testing data from source and target views, i.e., fs(𝒁)f_{s}(\bm{Z}) and ft(𝒁)f_{t}(\bm{Z}).
4:  Merge two classifiers fs(𝒁)f_{s}(\bm{Z}) and ft(𝒁)f_{t}(\bm{Z}) to obtain the final prediction f(𝒁)f(\bm{Z}).
5:  return  f(𝒁)f(\bm{Z}).

4 Numerical experiments

The aim of designing the AsK-LS is to learn with asymmetric kernels. As discussed before, asymmetric kernels could come from asymmetric metrics and directed graphs. The following experiments are not to claim that asymmetric kernels are better than symmetric kernels, which is surely not true since the choice of kernels is problem-dependent. Instead, we will show that when the asymmetric information is crucial, our AsK-LS can largely improve the performance of the existing kernel learning methods. The experiments are implemented in MATLAB on a PC with Intel i7-10700K CPU (3.8GHz) and 32GB memory. All the reported results are the average over 1010 trials.

4.1 Kullback-Leibler kernel

Kullback-Leibler divergence is the measure of how one probability distribution QQ is different from another probability distribution PP:

KL(P||Q)=P(x)logP(x)Q(x)dx,\rm KL(P||Q)=\int_{-\infty}^{\infty}P(x)\log\frac{P(x)}{Q(x)}dx, (13)

which is asymmetric. KL divergence has been successfully used in many fields, e.g., VAE [37] and GAN [38, 39]. From KL divergence, one can evaluate the similarity between two probability distributions by the exponentiation as follows:

𝒦(si,sj)=exp(aKL(Pi||Pj)),\mathcal{K}(s_{i},s_{j})=\exp(-a\cdot\rm KL(P_{i}||P_{j})), (14)

where a+a\in\mathbb{R}^{+} is a hyperparameter to control the scale of the kernel (14), ss denotes a sample and PiP_{i},PjP_{j} are probability distributions estimated by samples sis_{i} and sjs_{j}, respectively. Since KL divergence is asymmetric, the kernel matrix is also asymmetric. Thus the AsK-LS can be utilized to directly learn with the asymmetric KL kernel rather than its symmetry [21].

We conduct image classification experiments on the Corel database [40]. The Corel database contains 80 concept groups e.g., autumn, aviation, dog, and elephant. And 10 classes of those groups are picked: beaches, bus, dinosaurs, elephants, flowers, foods, horses, monuments, snow mountains, and African people & villages. There are 100 images per class: 90 for training and 10 for testing. We follow the standard feature extraction method [21] to obtain a sequence of 64-dimensional discrete cosine transform feature vectors 𝑿={𝒙1,𝒙2,,𝒙l}\bm{X}=\{\bm{x}_{1},\bm{x}_{2},\cdots,\bm{x}_{l}\} of an image where ll is the number of feature vectors.

In the experiment, we use the Gaussian mixture model (GMM) with 256 diagonal Gaussian components to estimate the probability distribution Pi(𝑿i)P_{i}(\bm{X}_{i}) of the image feature vector sequence 𝑿i\bm{X}_{i}. Since the KL divergence for two GMMs can not be calculated by the formulation (13) directly, we use the Monte Carlo method to calculate it. Then kernel value between two images is given by the KL kernel similarity (14) of these two probability distributions.

The corresponding asymmetric KL divergence has been widely used in learning tasks. However, due to that KL divergence violates the symmetry requirement, it has never been directly used in the kernel-based learning. Now with the AsK-LS, we can use it in classification tasks. As a comparison, the SVM and the LS-SVM are used with a symmetric KL kernel [21]. For the parameters, i.e., the hyperparameter aa and regularization parameter γ\gamma in all the three models is tuned by 10-folds cross-validation. The AsK-LS outputs two classifiers and here we simply average them.

The image classification is a multi-classes task where we utilize the one-vs-rest scheme, for which the average Micro-F1 and Macro-F1 scores are reported in Table I. The AsK-LS achieves better performance, showing that learning with the asymmetric information can help.

TABLE I: Micro-/Macro-F1 scores (mean±std) of different methods with the KL kernel between GMMs on the Corel dataset.
Methods SVM LS-SVM AsK-LS
Micro-F1 0.830±0 0.840±0 0.916±0.0048
Macro-F1 0.822±0 0.832±0 0.914±0.0046
TABLE II: The detailed information of used directed graph datasets.
Datasets Cora Citeseer Pubmed AM- AM-
photo computer
#Classes 7 6 3 8 10
#Nodes 2078 3327 19717 7650 13752
#Edges 5429 4732 44338 143663 287209

4.2 Directed adjacency matrix

TABLE III: Micro-/Macro-F1 scores (mean±std) of different algorithms on the nodes classification task.
Datasets F1 Score Embedding Graph
SVM LS-SVM SVM LS-SVM AsK-LS
Cora Micro 0.740±0.010 0.733±0.009 0.305±0.010 0.713±0.015 0.753±0.013
Macro 0.730±0.012 0.724±0.011 0.077±0.004 0.708±0.016 0.748±0.013
Citeseer Micro 0.497±0.013 0.507±0.010 0.393±0.042 0.596±0.011 0.605±0.012
Macro 0.454±0.013 0.452±0.010 0.301±0.044 0.560±0.012 0.579±0.012
Pubmed Micro 0.732±0.006 0.726±0.004 0.602±0.022 0.635±0.006 0.719±0.004
Macro 0.680±0.006 0.669±0.005 0.451±0.018 0.503±0.006 0.698±0.005
AM- Micro 0.858±0.005 0.834±0.006 0.319±0.022 0.799±0.008 0.885±0.005
photo Macro 0.834±0.005 0.767±0.006 0.155±0.018 0.732±0.008 0.874±0.005
AM- Micro 0.606±0.005 0.609±0.004 0.400±0.005 0.619±0.014 0.868±0.002
computer Macro 0.429±0.004 0.454±0.003 0.107±0.003 0.443±0.016 0.846±0.005

Nodes classification with an asymmetric adjacency matrix is a task that needs to learn from asymmetric metrics. In this task, nodes in a directed graph ={𝒩,}\mathcal{H}=\{\mathcal{N},\mathcal{E}\} with nodes set 𝒩\mathcal{N} and edges set \mathcal{E} are classified. Its adjacency matrix showing the connection among nodes is defined as follows,

𝒜ij={1if ji0otherwise.\mathcal{A}_{ij}=\left\{\begin{aligned} &1\quad\textrm{if~{}}j\rightarrow i\\ &0\quad\textrm{otherwise}.\\ \end{aligned}\right.

where jij\rightarrow i means that there is a link pointing to node ii from node jj. This model has wide applications and here we consider five directed graphs, namely Cora, Citeseer, and Pubmed [41], AM-photo, and AM-computer [42]. Details of the data set could be found in Table II. For the first three widely used graphs, the nodes and edges present documents and citations, respectively. For the latter two, the nodes present goods, and the edges mean the two goods are frequently bought together. The node classification is originally a multi-classes task where we utilize the one-vs-rest scheme and focus on Micro-F1 and Macro-F1 scores.

A directed graph is fully characterized by its adjacency matrix. However, in existing kernel methods, one cannot directly use the adjacency matrix as a kernel. Therefore, the current mainstream is to first do feature embedding [17, 18, 19] to extract the asymmetric information and then do classification based on the extracted features. In the experiment, For the embedding, we use a SOTA embedding method NERD [19], which utilizes random walk to extract node embeddings ϕs(v)\phi_{s}(v) and ϕt(v)\phi_{t}(v) as source feature and target feature of each node v𝒩v\in\mathcal{N} on a directed graph. We combine them as a unified feature ϕ(v)=[ϕs(v),ϕt(v)]\phi(v)=[\phi^{\top}_{s}(v),\phi^{\top}_{t}(v)]^{\top} which then defines a symmetric kernel that can be used in classical kernel-based learning methods, e.g., the SVM and the LS-SVM, and results are reported in the third and the fourth columns in Table III.

Since the asymmetric information can be extracted by the embedding method, the performance of that is much better than using existing kernel methods with the adjacency matrix by symmetrization (𝒜+𝒜)/2(\mathcal{A}+\mathcal{A}^{\top})/2, of which micro-/macro- F1 scores are listed in the 5th and the 6th column in Table III.

Now we have the AsK-LS and can directly learn with the adjacency matrix without the feature embedding. Before sending the adjacency matrix to the AsK-LS, we pre-process it by its in-degree di=j=1m𝒜ijd_{i}=\sum_{j=1}^{m}\mathcal{A}_{ij} (the same pre-process is also applied for the SVM and the LS-SVM). The Micro- and Macro-F1 scores of the AsK-LS are reported in the last column in Table III. The performance is much better than the SVM and the LS-SVM with the symmetrization of the kernel, indicating that the asymmetric information is helpful. For the comparison with the SOTA embedding methods, the AsK-LS is better or at least comparable, showing the effectiveness of AsK-LS for extracting the asymmetric information.

4.3 Symmetric or asymmetric kernels

We have shown that the proposed AsK-LS can learn with asymmetric kernels. Since asymmetric kernels are more general than symmetric ones, learning with asymmetric kernels is promising to get improvement, if there is an efficient method to get a suitable kernel. In the case in our paper, kernels are pre-given, then the performance is determined by the choice of kernels. In the previous sections, the asymmetric information, i.e., measuring the difference by KL divergence and directed distance, is important, thus the corresponding asymmetric kernels lead to a good performance. If not the case, a specific asymmetric kernel is not necessarily better than the symmetric one, e.g., the RBF kernel.

We conduct classification experiments on several datasets from the UCI database[43], where 60% of the data are randomly picked up for training and the rest for testing. Two asymmetric kernels, called SNE kernel and T kernel, are considered here. They have been used for dimension reduction [14] but have not been used for classification, since before there was no classification method that can learn with asymmetric kernels directly. The formulations are given as below,

  1. 1.

    The SNE kernel with the parameter σ\sigma\in\mathbb{R}:

    𝒦SNE(𝒙,𝒚)=exp(𝒙𝒚22)/σ2𝒛𝑿exp(𝒙𝒛22)/σ2,\mathcal{K}_{\rm{SNE}}(\bm{x},\bm{y})=\frac{\exp(-\|\bm{x}-\bm{y}\|_{2}^{2})/\sigma^{2}}{\sum_{\bm{z}\in\bm{X}}\exp(-\|\bm{x}-\bm{z}\|_{2}^{2})/\sigma^{2}},
  2. 2.

    The T kernel:

    𝒦T(𝒙,𝒚)=(1+𝒙𝒚22)1𝒛𝑿(1+𝒙𝒛22)1,\mathcal{K}_{\rm{T}}(\bm{x},\bm{y})=\frac{(1+\|\bm{x}-\bm{y}\|^{2}_{2})^{-1}}{\sum_{\bm{z}\in\bm{X}}(1+\|\bm{x}-\bm{z}\|^{2}_{2})^{-1}},

where 𝑿\bm{X} stands for the training set.

TABLE IV: Classification accuracy (mean±std) of the LS-SVM and the AsK-LS with RBF, SNE, and T kernels on the UCI database.
Datasets LS-SVM AsK-LS
RBF SNE T
heart 0.837±0.032 0.824±0.025 0.825±0.031
sonar 0.856±0.001 0.854±0.028 0.865±0.027
monks1 0.791±0.012 0.790±0.014 0.778±0.012
monks2 0.841±0.007 0.866±0.016 0.866±0.014
monks3 0.936±0.003 0.936±0.004 0.901±0.004
pima 0.738±0.026 0.749±0.026 0.752±0.021
australian 0.862±0.015 0.854±0.019 0.859±0.032
spambase 0.925±0.007 0.908±0.007 0.922±0.018

The classification accuracy of the AsK-LS with the two asymmetric kernels is reported in Table IV, together with the accuracy of the LS-SVM with the RBF kernel (all the parameters are tuned by 10-folds cross-validation). Generally speaking, the best choice of kernels is problem-dependent and one cannot assert which kernel is good in advance. But at least, the proposed AsK-LS makes it possible to use asymmetric kernels. With delicately designing or efficiently learning, asymmetric kernels could lead to a good performance.

5 Conclusion

In this paper, we investigate the least squares support vector machine with asymmetric kernels in theoretical and algorithmic aspects. The proposed AsK-LS is the first model that can learn with asymmetric kernels. The primal and dual representations for AsK-LS are given, showing the feature interpretation that there are two different functions, can be simultaneously learned from the source and the target views. In numerical experiments, when the asymmetric information is physically important, the AsK-LS with asymmetric kernels significantly outperforms the SVM and the LS-SVM that can only deal with symmetric kernels.

The most significant contribution of this paper is to make asymmetric kernels useful in the kernel-based learning. In methodology, the least squares framework is not the unique way to accommodate asymmetric kernels. Models from other kernel-based methods, e.g., the support vector machine and the logistic regression, etc., are worthy to be investigated. In theory, the functional space associated with asymmetric kernels is interesting, which is beyond the Reproducing Kernel Hilbert Space for PSD kernels or the Reproducing Kernel Kreĭn Space for indefinite kernels. In application, one can design asymmetric kernels for different tasks, especially those that involve directional relationships, including but not limited to directed graphs, the distribution distance [44], the causality analysis [45, 46], and the optimal transport [47, 48].

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 61977046), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and Shanghai Science and Technology Research Program (20JC1412700 and 19JC1420101). The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant E-DUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. This work was supported in part by Research Council KU Leuven: Optimization frameworks for deep kernel machines C14/18/068; Flemish Government: FWO projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/Postdoc grant. This research received funding from the Flemish Government (AI Research Program).

References

  • [1] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.   MIT press, 2002.
  • [2] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines.   World Scientific, 2002.
  • [3] V. Vapnik, The Nature of Statistical Learning Theory.   Springer Science & Business Media, 2013.
  • [4] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [5] K. Soman, R. Loganathan, and V. Ajay, Machine Learning with SVM and other Kernel Methods.   PHI Learning Pvt. Ltd., 2009.
  • [6] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik et al., “Support vector regression machines,” Advances in Neural Information Processing Systems, vol. 9, pp. 155–161, 1997.
  • [7] M. Girolami, “Mercer kernel-based clustering in feature space,” IEEE Transactions on Neural Networks, vol. 13, no. 3, pp. 780–784, 2002.
  • [8] H. Jeffreys, B. Jeffreys, and B. Swirles, Methods of Mathematical Physics.   Cambridge University Press, 1999.
  • [9] C. S. Ong, X. Mary, S. Canu, and A. J. Smola, “Learning with non-positive kernels,” in Proceedings of the 21st International Conference on Machine Learning, 2004, p. 81.
  • [10] B. Haasdonk, “Feature space interpretation of SVMs with indefinite kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 482–492, 2005.
  • [11] X. Huang, A. Maier, J. Hornegger, and J. A. K. Suykens, “Indefinite kernels in least squares support vector machines and principal component analysis,” Applied and Computational Harmonic Analysis, vol. 43, no. 1, pp. 162–172, 2017.
  • [12] D. Oglic and T. Gärtner, “Scalable learning in reproducing kernel Kreĭn spaces,” in International Conference on Machine Learning.   PMLR, 2019, pp. 4912–4921.
  • [13] F. Liu, X. Huang, Y. Chen, and J. A. K. Suykens, “Fast learning in reproducing kernel Kreĭn spaces via signed measures,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 388–396.
  • [14] G. Hinton and S. T. Roweis, “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems, vol. 15.   Citeseer, 2002, pp. 833–840.
  • [15] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.
  • [16] G. Loosli, S. Canu, and C. S. Ong, “Learning SVM in Kreĭn spaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1204–1216, 2015.
  • [17] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in Proceedings of the 22nd ACM SIGKDD International Cconference on Knowledge Discovery and Data Mining, 2016, pp. 1105–1114.
  • [18] C. Zhou, Y. Liu, X. Liu, Z. Liu, and J. Gao, “Scalable graph embedding for asymmetric proximity,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • [19] M. Khosla, J. Leonhardt, W. Nejdl, and A. Anand, “Node representation learning for directed graphs,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2019, pp. 395–411.
  • [20] J. A. K. Suykens, “SVD revisited: A new variational principle, compatible feature maps and nonlinear extensions,” Applied and Computational Harmonic Analysis, vol. 40, no. 3, pp. 600–609, 2016.
  • [21] P. J. Moreno, P. Ho, and N. Vasconcelos, “A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications.” in Advances in Neural Information Processing Systems, 2003, pp. 1385–1392.
  • [22] M. Mackenzie and A. K. Tieu, “Asymmetric kernel regression,” IEEE Transactions on Neural Networks, vol. 15, no. 2, pp. 276–282, 2004.
  • [23] C. Kuruwita, K. Kulasekera, and W. Padgett, “Density estimation using asymmetric kernels and bayes bandwidths with censored data,” Journal of Statistical Planning and Inference, vol. 140, no. 7, pp. 1765–1774, 2010.
  • [24] H. L. Koul and W. Song, “Large sample results for varying kernel regression estimates,” Journal of Nonparametric Statistics, vol. 25, no. 4, pp. 829–853, 2013.
  • [25] S. L. Pintea, J. C. van Gemert, and A. W. Smeulders, “Asymmetric kernel in Gaussian processes for learning target variance,” Pattern Recognition Letters, vol. 108, pp. 70–77, 2018.
  • [26] B. Huang, X. Li, Z. Song, and X. Yang, “FL-NTK: A neural tangent kernel-based framework for federated learning analysis,” in International Conference on Machine Learning.   PMLR, 2021, pp. 4423–4434.
  • [27] K. Tsuda, “Support vector classifier with asymmetric kernel functions,” in European Symposium on Artificial Neural Networks (ESANN), 1999, pp. 183–188.
  • [28] A. Munoz, I. M. de Diego, and J. M. Moguerza, “Support vector machine classifiers for asymmetric proximities,” in Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003.   Springer, 2003, pp. 217–224.
  • [29] N. Koide and Y. Yamashita, “Asymmetric kernel method and its application to Fisher’s discriminant,” in the 18th International Conference on Pattern Recognition, vol. 2, 2006, pp. 820–824.
  • [30] W. Wu, J. Xu, H. Li, and S. Oyama, “Asymmetric kernel learning,” Tech. Rep. MSR-TR-2010-85, June 2010. [Online]. Available: https://www.microsoft.com/en-us/research/publication/asymmetric-kernel-learning/
  • [31] J. A. K. Suykens, “Deep restricted kernel machines using conjugate feature duality,” Neural Computation, vol. 29, no. 8, pp. 2123–2163, 2017.
  • [32] M. Espinoza, J. A. K. Suykens, and B. De Moor, “Fixed-size least squares support vector machines: A large scale application in electrical load forecasting,” Computational Management Science, vol. 3, no. 2, pp. 113–129, 2006.
  • [33] K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor, “Optimized fixed-size kernel models for large data sets,” Computational Statistics & Data Analysis, vol. 54, no. 6, pp. 1484–1504, 2010.
  • [34] C. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Proceedings of the 14th Annual Conference on Neural Information Processing Systems, 2001, pp. 682–688.
  • [35] G. W. Stewart, “On the early history of the singular value decomposition,” SIAM Review, vol. 35, no. 4, pp. 551–566, 1993.
  • [36] K. M. Ting and I. H. Witten, “Stacking bagged and dagged models,” in Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 367–375.
  • [37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in the 2nd International Conference on Learning Representations, 2014.
  • [38] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.
  • [39] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on generative adversarial networks: Algorithms, theory, and applications,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2021.
  • [40] J. Z. Wang, J. Li, and G. Wiederhold, “Simplicity: Semantics-sensitive integrated matching for picture libraries,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 947–963, 2001.
  • [41] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–93, 2008.
  • [42] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
  • [43] C. Blake and C. Merz, “UCI repository of machine learning databases,” 1998.
  • [44] S. K. Zhou and R. Chellappa, “From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp. 917–929, 2006.
  • [45] K. Radinsky, S. Davidovich, and S. Markovitch, “Learning causality for news events prediction,” in Proceedings of the 21st international conference on World Wide Web, 2012, pp. 909–918.
  • [46] H. Xu, M. Farajtabar, and H. Zha, “Learning Granger causality for Hawkes processes,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1717–1726.
  • [47] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal transport for domain adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1853–1865, 2016.
  • [48] B. Su and G. Hua, “Order-preserving optimal transport for distances between sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 12, pp. 2961–2974, 2018.