Kernel Subspace and Feature Extraction
Abstract
We study kernel methods in machine learning from the perspective of feature subspace. We establish a one-to-one correspondence between feature subspaces and kernels and propose an information-theoretic measure for kernels. In particular, we construct a kernel from Hirschfeld–Gebelein–Rényi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality. We use the support vector machine (SVM) as an example to illustrate a connection between kernel methods and feature extraction approaches. We show that the kernel SVM on maximal correlation kernel achieves minimum prediction error. Finally, we interpret the Fisher kernel as a special maximal correlation kernel and establish its optimality.
I Introduction
One main objective of machine learning is to obtain useful information from often high-dimensional data. To this end, it is a common practice to extract meaningful feature representations from original data and then process features [1]. Neural networks [2] and kernel methods [3, 4, 5, 6] are two of the most representative approaches to map data into feature space. In neural networks, the features are represented as the outputs of hidden neurons in the network. In contrast, the feature mapping in kernel methods is defined by the used kernel, which is used implicitly and is often infinite dimensional. While kernel approaches require much fewer parameters and can obtain good empirical performance on certain tasks [7], the performance significantly relies on the choice of kernels. With many attempts to investigate kernel methods [8, 9, 6], there still lacks a theoretical understanding of the mechanism behind kernel methods, which restricts their applications on complicated data.
On the other hand, the feature extraction in deep neural networks has been studied recently by information-theoretic and statistical analyses [10, 11]. For example, it was shown in [10] that, the feature extracted by deep neural networks coincides with the most informative feature, which is essentially related to the classical Hirschfeld–Gebelein–Rényi (HGR) maximal correlation problem [12, 13, 14]. Such theoretical characterizations provide a better understanding of existing algorithms and have been shown useful in designing algorithms for multimodal learning tasks [15].
In this paper, our goal is to characterize kernel methods from the perspective of feature subspace and reveal its connection with other learning approaches. We first introduce the associated kernel with each given feature subspace, which we coin the projection kernel, to establish a correspondence between kernel operations and geometric operations in feature subspaces. This connection allows us to study kernels methods via analyzing the corresponding feature subspaces. Specifically, we propose an information-theoretic measure for projection kernels, and demonstrate that the information-theoretically optimal kernel can be constructed from the HGR maximal correlation functions, coined the maximal correlation kernel. We further demonstrate that the support vector machine (SVM) with maximal correlation kernel can obtain the minimum prediction error, which justifies its optimality in learning tasks. Our analysis also reveals connections between SVM and other classification approaches including neural networks. Finally, we interpret the Fisher kernel, a classical kernel induced from parameterized distribution families [16], as a special case of maximal correlation kernels, thus demonstrating its optimality.
II Preliminaries and Notations
Throughout this paper, we use to denote two random variables with alphabets , and denote their joint distribution and marginals as and , respectively. We also use to denote the expectation with respect to .
II-A Feature Space
We adopt the notation convention introduced in [15], and let denote the feature space formed by the (one-dimensional) features of , with the geometry defined as follows. The inner product on is defined as for . This induces a norm with for . Then, for given and subspace of , we denote the projection of onto as
(1) |
In addition, for a -dimensional feature , we use to denote the subspace spanned by all dimensions. We also use to denote the centered , i.e., , and denote .
II-B Kernel
Given , is a kernel on , if for all finite subset , the by matrix is positive semidefinite. For each kernel , we define the associated functional operator as
(2) |
and we use to denote the correspondence between and . Furthermore, we define the centered kernel as
(3) |
where we have defined .
The following fact is the basis of the kernel trick in learning algorithms.
Fact 1
For each given kernel , there exist an inner product space with the inner product , and a mapping , such that .
Remark 1
In addition, we introduce the kernelized discriminative model (KDM) as follows.
Definition 1 (Kernelized Discriminative Model)
For each kernel , we define its associated kernelized discriminative model as
(4) |
Then, we use to denote the maximum a posteriori (MAP) estimation induced from KDM , i.e.,
(5) |
The KDM can be regarded as a generalized probability distribution, since we have for all while can sometimes take negative values.
II-C Modal Decomposition, Maximal Correlation, and H-score
Proposition 1 (Modal Decomposition [11])
For given , there exists , such that
(6) |
where , and for all , where denotes the indicator function.
It can be shown that pairs are the most correlated function pairs of and , referred to as maximal correlation functions. We also denote , known as the HGR maximal correlation [12, 13, 14] of and , and define the -dimensional feature . In particular, when is binary, we have .
It has been shown in [11] that the maximal correlation functions are the optimal features of in inferring or estimating . In general, given a -dimensional feature of , the effectiveness of in inferring or estimating can be measured by its H-score [10, 11], defined as
(7) |
where . It can be verified that for all and , we have
(8) |
where are as defined in (6).
II-D Binary Classification
We consider the binary classification problem which predicts binary label from the data variable . For convenience, we assume takes values from .
Suppose the training dataset contains sample pairs of , and let denote the corresponding empirical distribution, i.e.,
(9) |
II-D1 Support Vector Machine
The support vector machine (SVM) solves binary classification tasks by finding the optimal hyperplane that separates two classes with maximum margin [7]. Given -dimensional feature mapping , the loss for SVM based on can be written as
(10) |
where are the parameters of the hyperplane, where is a hyperparameter of SVM, and where denotes the hinge loss, defined as with .
Moreover, let and denote the optimal parameters and the value of loss function, respectively. Then, the prediction of SVM is
(11) |
where denotes the sign function.
Specifically, for a given kernel , the prediction of the corresponding kernel SVM is111It is worth mentioning that the practical implementation of kernel SVM is typically done by solving a dual optimization problem without explicitly using . See [17, Section 12] for detailed discussions. where is any mapping given by Fact 1.
II-D2 Logistic Regression and Neural Networks
Given -dimensional feature of , the discriminative model of logistic regression is , where are the weight and bias, respectively, and where is defined as .
Then, the loss of logistic regression is , and the optimal parameters are learned by minimizing the loss, i.e., . The resulting decision rule is
(12) |
The logistic regression is often used as the classification layer for multi-layer neural networks, where and correspond to weights and the bias term, respectively. In this case, the feature mapping also takes a parameterized form, and the parameters of are jointly learned with and .
III Projection Kernel and Informative Features
In this section, we introduce a one-to-one correspondence between kernels and feature subspaces, and then characterize the informativeness of kernels by investigating the features in the associated subspaces.
III-A Projection Kernel and Feature Subspace
We first introduce a family of kernels with one-to-one correspondence to feature subspace.
Definition 2 (Projection Kernel)
Let denote a -dimensional subspace of with a basis . We use to denote the projection kernel associated with , defined as , where we have defined and .
With slight abuse of notation, we also dnenote , the projection kernel associated with .
Note that is a valid kernel function, and the corresponding mapping in Fact 1 can be chosen as for any orthonormal basis of . It turns out that the functional operators associated with projection kernels are projection operators in the feature space, which we formalize as follows. A proof is provided in Appendix -A.
Property 1
Let denote the operator corresponding to subspace [cf. (2)], then we have for all .
Therefore, given a projection kernel , the associated subspace can be represented as , where is the associated operator. This establishes a one-to-one correspondence between projection kernels and feature subspaces.
III-B H-score and Informative Features
The projection kernel provides a connection between feature subspace and kernel, from which we can characterize subspace in terms of the corresponding kernel . Specifically, we can represent the H-score [cf. (7)] of a feature in terms of the projection kernel , formalized as follows. A proof is provided in Appendix -B.
Proposition 2
For all with , we have , where we have defined such that the joint distribution of and is
(13) |
With slight abuse of notation, we can use to denote the H-score corresponding to feature subspace . In particular, we have the following characterization of when is binary. A proof is provided in Appendix -C.
Proposition 3
Suppose is binary, and is the maximal correlation function of . Then, for each subspace of , we have
(14) |
From Proposition 3, depends only on the projection of onto , which is also the most informative feature in . In addition, note that since , is also the cosine value of the principal angle between and . Therefore, we can interpret the H-score as a measure of the principal angle between the optimal feature and the given subspace.
III-C Maximal Correlation Kernel
Note that from (8), is maximized when takes the maximal correlation function . Therefore, the subspace (and thus projection kernel ) is optimal in terms of the H-score measure. We will denote , referred to as the maximal correlation kernel.
Specifically, the KDM (cf. Definition 1) of maximal correlation kernel coincides with the underlying conditional distribution , demonstrated as follows. A proof is provided in Appendix -D.
Property 2
For all and , we have and , where denotes the MAP estimation, i.e.,
(15) |
As we will develop in the next section, the maximal correlation kernel also achieves the optimal performance in support vector machine.
IV Support Vector Machine Analysis
In this section, we investigate support vector machine, a representative kernel approach for binary classification. Let denote the training data and corresponding label taken from , with denoting the empirical distribution as defined in (9). Throughout this section, we will focus on the balanced dataset with
(16) |
It can be verified that in this case, the MAP estimation [cf. (15)] can be expressed in terms of maximal correlation function. A proof is provided in Appendix -E.
Property 3
Under assumption (16), we can express the MAP estimation as for all , where is the maximal correlation function of .
IV-A SVM on Given Features
We first consider the SVM algorithm applied on a given feature representation , which can also be regarded as the kernel SVM on kernel .
To begin, for each given feature and , let us define
Then we have the following characterization, a proof of which is provided in Appendix -F.
Theorem 1
For all given feature and , we have
(17) |
where we have defined and , with , and where .
Specifically, when , we have , which can be achieved by
(18) |
and the resulting SVM prediction is
(19) | ||||
(20) |
From Theorem 1, when , the SVM decision does not depend on the value of . In the remaining, we will focus on the regime where , and drop the in expressions whenever possible, e.g., we simply denote by . As we will see soon, SVM can still obtain minimum prediction error in this regime, by using a good feature mapping (or equivalently, a good kernel).
From (20), the SVM prediction can be interpreted as a nearest centroid classifier, where decision is based on comparing the distance between and the class centroids , . In addition, from
we can interpret the SVM loss as measuring the distance between two class centroids.
Furthermore, when is one-dimensional feature, we can rewrite (19) as
where . Therefore, the decision rule depends only the projection of onto the subspace , which is also the most informative features on the subspace (cf. Proposition 3). Later on we will see a similar geometric illustration of kernel SVM.
Moreover, we can establish a connection between SVM loss and the H-score measure, formalized as the following corollary. A proof is provided in Appendix -G.
Corollary 1
Suppose , then we have
where and denote the maximum and minimum positive eigenvalues of the covariance matrix , respectively. Specifically, if , then we have
As a result, for each normalized feature with covariance matrix , the SVM loss measures the informativeness of in inferring the label .
IV-B Kernel SVM
In practice, instead of applying SVM on a given or manually designed feature , it is more often to directly implement SVM on a kernel . Similar to Theorem 1, we have the following characterization, from which we can interpret KDM as a probabilistic output for kernel SVM.
Theorem 2
Proof:
Let and denote the inner product space and mapping associated with kernel (cf. Fact 1), and let . Then, we have
(21) |
which can be rewritten as
where to obtain the second equality we have used the modal decomposition of (cf. Fact 2).
Hence, from Theorem 1 we obtain
It remains only to establish the equivalence between and KDM decision . To this end, note that from (4) and the balanced dataset assumption (16), we have
for all .
Hence, for all ,
which completes the proof. ∎
From Theorem 2, the final decision depends on only through the centered kernel . Moreover, compare Theorem 2 with Property 3, kernel SVM prediction differs from MAP only in applying the operator on . In particular, when the maximal correlation function is an eigenfunction of the corresponding operator , i.e., for some , the SVM prediction coincides with the MAP prediction, i.e., for all .
If we restrict our attention to projection kernels, the kernel SVM decision can be further interpreted as a projection operation on the associated subspace. To see this, let denote a feature subspace of spanned by zero-mean features, then from Theorem 1 and Proposition 3, the kernel SVM loss for is
which measures the principal angle between and . In addition, the decision rule can be expressed as
(22) |
From Proposition 3, is also the most informative feature in . Therefore, kernel SVM on is equivalent to first extracting the most informative feature in , and then using the extracted feature to make decision.
IV-C Relationship to Other Classification Approaches
IV-C1 Maximum a Posteriori (MAP) Estimation
IV-C2 Logistic Regression and Neural Networks
We have interpreted SVM as extracting the most informative feature, where the informativeness is measured by H-score. The analysis in [10] has shown that logistic regression is also equivalent to maximizing the H-score, when and are weakly independent. Indeed, we can show that SVM and logistic regression lead to the same prediction in a weak dependence regime, which we formalize as follows. A proof is provided in Appendix -H.
Proposition 4
Suppose for some . For SVM and logistic regression applied on feature with covariance , the optimal parameters satisfy
where is the hyperparameter in SVM. In addition, we have for sufficiently small.
Remark 2
Since H-score can also be directly maximized by implementing the maximal correlation regression [18], a similar connection holds for SVM and maximal correlation regression.
V Fisher Kernel
Given a family of distributions supported on and parameterized by , suppose the score function exists. Then, the Fisher kernel is defined as the projection kernel associated with the score function , i.e., .
Specifically, we consider classification tasks where the joint distribution between data variable and label are a mixture of the parameterized forms. Suppose for each class , the data variable is generated from
(23) |
for some . Then we have the following result, a proof of which is provided in Appendix -I.
Theorem 3
Suppose for all , and let . Then for the joint distribution generated according to (23), we have
(24) | |||
(25) |
where denotes the centered , and where is the maximal correlation kernel defined on . In addition, the H-score of satisfies
(26) |
where denotes the mutual information between and .
VI Conclusion
In this paper, we study kernel methods from the perspective of feature subspace, where we demonstrate a connection between kernel methods and informative feature extraction problems. With SVM as an example, we illustrate the relationship between kernel methods and neural networks. The theoretical results can help guide practical kernel designs and incorporate kernel methods with feature-based learning approaches.
Acknowledgments
This work was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under grant N00014-19-1-2621.
-A Proof of Property 1
Suppose is a -dimensional feature subspace. Let be an orthonormal basis of , i.e., we have , and define . Then we have and .
Therefore,
which implies that .
From the orthogonality principle, it suffices to prove that for all and . To this end, suppose for some . Then, we have
which completes the proof. ∎
-B Proof of Proposition 2
First, note that for each ,
Therefore, we have
which completes the proof. ∎
-C Proof of Proposition 3
We start with the first equality. Suppose has modal decomposition (cf. Proposition 1)
where satisfies and .
Then, we obtain
and the as defined in (13) can be expressed as
where to obtain the last equality follows from the fact that .
Note that since , we have
(27) |
In addition, let denote the operator associated with , then from Property 1 we have . In addition, from the orthogonality principle, we have and thus
(28) |
Hence, the first equality of (14) can be obtained from
where the first equality follows from Proposition 2, where the second equality follows from (27), and where the last equality follows from (28).
To obtain the second and third equalities of (14), it suffices to note that for all , we have
where the equalities follows from the orthogonality principle, and where the inequality follows from the definition of projection [cf. (1)]. In addition, it can be verified that the inequality holds with quality when .
Hence, for all , we have
which completes the proof. ∎
-D Proof of Property 2
It suffices to prove that . To this end, suppose satisfies the modal decomposition (6), and let , . Then, it can be verified that for all , which implies that .
Since , we have , for all .
Hence, for all , we have
which completes the proof. ∎
-E Proof of Property 3
Our proof will make use of the following fact.
Fact 2
If and , the modal decomposition of (cf. Proposition 1) can be written as
where is the maximal correlation coefficient, and is the maximal correlation function with .
-F Proof of Theorem 1
Our proof will make use of the following simple fact.
Fact 3
Given a random variable taking values from , let denote the minimum entry in , then we have
where .
Let , then we have
Therefore, from the upper bound in Fact 3, we have
(30) | |||
(31) |
where to obtain the last inequality, we have used the fact that
since for each , we have
Therefore
Finally, if , it can be readily verified that . As a result, the optimal solution is given by
and we have
Therefore, the SVM prediction is given by
-G Proof of Corollary 1
From Theorem 1, when , we have
Therefore, it suffices to prove that
(32) |
To this end, note that we have
where the last equality follows from the fact that for zero-mean and uniformly distributed on , we have for .
-H Proof of Proposition 4
First, note that logistic regression can be regarded as a special case of softmax regression with the correspondences , where and are the weights and bias for softmax regression, respectively.
In addition, from [10, Theorem 2], the centered weight and are
Therefore, we obtain
and
which implies that
-I Proof of Theorem 3
To begin, note that we have
which implies that
Therefore, we obtain
where we have used the fact that since for all .
Hence, we have
(34) |
References
- [1] D. Storcheus, A. Rostamizadeh, and S. Kumar, “A survey of modern questions and challenges in feature extraction,” in Feature Extraction: Modern Questions and Challenges. PMLR, 2015, pp. 1–18.
- [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [3] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher discriminant analysis with kernels,” in Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468). Ieee, 1999, pp. 41–48.
- [4] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of machine learning research, vol. 3, no. Jul, pp. 1–48, 2002.
- [5] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, vol. 36, no. 3, pp. 1171–1220, 2008.
- [6] B. Schölkopf, “The kernel trick for distances,” Advances in neural information processing systems, vol. 13, 2000.
- [7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
- [8] J. Platt et al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999.
- [9] J. Xu, R. Jenssen, A. Paiva, and I. Park, “A reproducing kernel hilbert space framework for itl,” in Information Theoretic Learning. Springer, 2010, pp. 351–384.
- [10] X. Xu, S.-L. Huang, L. Zheng, and G. W. Wornell, “An information theoretic interpretation to deep neural networks,” Entropy, vol. 24, no. 1, p. 135, 2022.
- [11] S.-L. Huang, A. Makur, G. W. Wornell, and L. Zheng, “On universal features for high-dimensional learning and inference,” arXiv preprint arXiv:1911.09105, 2019.
- [12] H. O. Hirschfeld, “A connection between correlation and contingency,” in Proceedings of the Cambridge Philosophical Society, vol. 31, no. 4, 1935, pp. 520–524.
- [13] H. Gebelein, “Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung,” ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, vol. 21, no. 6, pp. 364–379, 1941.
- [14] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 3-4, pp. 441–451, 1959.
- [15] X. Xu and L. Zheng, “Multivariate feature extraction,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2022, pp. 1–8.
- [16] T. S. Jaakkola, D. Haussler et al., “Exploiting generative models in discriminative classifiers,” Advances in neural information processing systems, pp. 487–493, 1999.
- [17] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2009, vol. 2.
- [18] X. Xu and S.-L. Huang, “Maximal correlation regression,” IEEE Access, vol. 8, pp. 26 591–26 601, 2020.
- [19] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.