This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kernel Subspace and Feature Extraction

Xiangxiang Xu and Lizhong Zheng Dept. EECS, MIT
Cambridge, MA 02139, USA
Email: {xuxx, lizhong}@mit.edu
Abstract

We study kernel methods in machine learning from the perspective of feature subspace. We establish a one-to-one correspondence between feature subspaces and kernels and propose an information-theoretic measure for kernels. In particular, we construct a kernel from Hirschfeld–Gebelein–Rényi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality. We use the support vector machine (SVM) as an example to illustrate a connection between kernel methods and feature extraction approaches. We show that the kernel SVM on maximal correlation kernel achieves minimum prediction error. Finally, we interpret the Fisher kernel as a special maximal correlation kernel and establish its optimality.

I Introduction

One main objective of machine learning is to obtain useful information from often high-dimensional data. To this end, it is a common practice to extract meaningful feature representations from original data and then process features [1]. Neural networks [2] and kernel methods [3, 4, 5, 6] are two of the most representative approaches to map data into feature space. In neural networks, the features are represented as the outputs of hidden neurons in the network. In contrast, the feature mapping in kernel methods is defined by the used kernel, which is used implicitly and is often infinite dimensional. While kernel approaches require much fewer parameters and can obtain good empirical performance on certain tasks [7], the performance significantly relies on the choice of kernels. With many attempts to investigate kernel methods [8, 9, 6], there still lacks a theoretical understanding of the mechanism behind kernel methods, which restricts their applications on complicated data.

On the other hand, the feature extraction in deep neural networks has been studied recently by information-theoretic and statistical analyses [10, 11]. For example, it was shown in [10] that, the feature extracted by deep neural networks coincides with the most informative feature, which is essentially related to the classical Hirschfeld–Gebelein–Rényi (HGR) maximal correlation problem [12, 13, 14]. Such theoretical characterizations provide a better understanding of existing algorithms and have been shown useful in designing algorithms for multimodal learning tasks [15].

In this paper, our goal is to characterize kernel methods from the perspective of feature subspace and reveal its connection with other learning approaches. We first introduce the associated kernel with each given feature subspace, which we coin the projection kernel, to establish a correspondence between kernel operations and geometric operations in feature subspaces. This connection allows us to study kernels methods via analyzing the corresponding feature subspaces. Specifically, we propose an information-theoretic measure for projection kernels, and demonstrate that the information-theoretically optimal kernel can be constructed from the HGR maximal correlation functions, coined the maximal correlation kernel. We further demonstrate that the support vector machine (SVM) with maximal correlation kernel can obtain the minimum prediction error, which justifies its optimality in learning tasks. Our analysis also reveals connections between SVM and other classification approaches including neural networks. Finally, we interpret the Fisher kernel, a classical kernel induced from parameterized distribution families [16], as a special case of maximal correlation kernels, thus demonstrating its optimality.

II Preliminaries and Notations

Throughout this paper, we use X,YX,Y to denote two random variables with alphabets 𝒳,𝒴\mathcal{X},\mathcal{Y}, and denote their joint distribution and marginals as PX,YP_{X,Y} and PX,PYP_{X},P_{Y}, respectively. We also use 𝔼[]{\mathbb{E}}\left[\cdot\right] to denote the expectation with respect to PX,YP_{X,Y}.

II-A Feature Space

We adopt the notation convention introduced in [15], and let 𝒳{𝒳}\mathcal{F}_{\mathcal{X}}\triangleq\{\mathcal{X}\to\mathbb{R}\} denote the feature space formed by the (one-dimensional) features of XX, with the geometry defined as follows. The inner product ,𝒳\langle\cdot,\cdot\rangle_{\mathcal{F}_{\mathcal{X}}} on 𝒳\mathcal{F}_{\mathcal{X}} is defined as f1,f2𝒳𝔼PX[f1(X)f2(X)]\langle f_{1},f_{2}\rangle_{\mathcal{F}_{\mathcal{X}}}\triangleq{\mathbb{E}}_{P_{X}}\left[f_{1}(X)f_{2}(X)\right] for f1,f2𝒳f_{1},f_{2}\in\mathcal{F}_{\mathcal{X}}. This induces a norm 𝒳\|\,\cdot\,\|_{\mathcal{F}_{\mathcal{X}}} with f𝒳f,f𝒳\|f\|_{\mathcal{F}_{\mathcal{X}}}\triangleq\sqrt{\langle f,f\rangle_{\mathcal{F}_{\mathcal{X}}}} for f𝒳f\in\mathcal{F}_{\mathcal{X}}. Then, for given f𝒳f\in\mathcal{F}_{\mathcal{X}} and subspace 𝒢\mathcal{G} of 𝒳\mathcal{F}_{\mathcal{X}}, we denote the projection of ff onto 𝒢\mathcal{G} as

Π(f;𝒢)argminh𝒢hf𝒳.\displaystyle\Pi\left(f;{\mathcal{G}}\right)\triangleq\operatorname*{arg\,min}_{h\in\mathcal{G}}\|h-f\|_{\mathcal{F}_{\mathcal{X}}}. (1)

In addition, for a dd-dimensional feature f=(f1,,fd)T:𝒳df=(f_{1},\dots,f_{d})^{\mathrm{T}}\colon\mathcal{X}\to\mathbb{R}^{d}, we use span{f}span{f1,,fd}\operatorname{span}\{f\}\triangleq\operatorname{span}\{f_{1},\dots,f_{d}\} to denote the subspace spanned by all dimensions. We also use f~\tilde{f} to denote the centered ff, i.e., f~(x)f(x)𝔼PX[f(X)]\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}_{P_{X}}\left[f(X)\right], and denote Λf𝔼PX[f(X)fT(X)]\Lambda_{f}\triangleq{\mathbb{E}}_{P_{X}}\left[f(X)f^{\mathrm{T}}(X)\right].

II-B Kernel

Given 𝒳\mathcal{X}, 𝓀:𝒳×𝒳\mathpzc{k}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R} is a kernel on 𝒳\mathcal{X}, if for all finite subset 𝒳\mathcal{I}\subset\mathcal{X}, the |||\mathcal{I}| by |||\mathcal{I}| matrix [𝓀(𝓍,𝓍)]𝓍,𝓍[\mathpzc{k}(x,x^{\prime})]_{x\in\mathcal{I},x^{\prime}\in\mathcal{I}} is positive semidefinite. For each kernel 𝓀\mathpzc{k}, we define the associated functional operator τ:𝒳𝒳\tau\colon\mathcal{F}_{\mathcal{X}}\to\mathcal{F}_{\mathcal{X}} as

[τ(f)](x)𝔼PX[𝓀(𝒳,𝓍)𝒻(𝒳)],\displaystyle[\tau(f)](x)\triangleq{\mathbb{E}}_{P_{X}}\left[\mathpzc{k}(X,x)f(X)\right], (2)

and we use 𝓀τ\mathpzc{k}\leftrightarrow\tau to denote the correspondence between 𝓀\mathpzc{k} and τ\tau. Furthermore, we define the centered kernel 𝓀~:𝒳×𝒳\tilde{\mathpzc{k}}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R} as

𝓀~(x,x)𝓀(𝓍,𝓍)𝓀¯(𝓍)𝓀¯(𝓍)+𝔼𝒫𝒳[𝓀¯(𝒳)],\displaystyle\tilde{\mathpzc{k}}(x,x^{\prime})\triangleq\mathpzc{k}(x,x^{\prime})-\bar{k}(x)-\bar{k}(x^{\prime})+{\mathbb{E}}_{P_{X}}\left[\bar{k}(X)\right], (3)

where we have defined k¯:(x𝔼PX[𝓀(𝒳,𝓍)])𝒳\bar{k}\colon(x\mapsto{\mathbb{E}}_{P_{X}}\left[\mathpzc{k}(X,x)\right])\in\mathcal{F}_{\mathcal{X}}.

The following fact is the basis of the kernel trick in learning algorithms.

Fact 1

For each given kernel 𝓀\mathpzc{k}, there exist an inner product space 𝒱\mathcal{V} with the inner product ,𝒱\langle\cdot,\cdot\rangle_{\mathcal{V}}, and a mapping ν:𝒳𝒱\nu\colon\mathcal{X}\to\mathcal{V}, such that 𝓀(𝓍,𝓍)=ν(𝓍),ν(𝓍)𝒱\mathpzc{k}(x,x^{\prime})=\langle\nu(x),\nu(x^{\prime})\rangle_{\mathcal{V}}.

Remark 1

Suppose ν\nu is one mapping for 𝓀\mathpzc{k} satisfying Fact 1. Then for the centered kernel 𝓀~\tilde{\mathpzc{k}} [cf. (3)], we have 𝓀~(x,x)=ν~(x),ν~(x)𝒱\tilde{\mathpzc{k}}(x,x^{\prime})=\langle\tilde{\nu}(x),\tilde{\nu}(x^{\prime})\rangle_{\mathcal{V}}, where ν~(x)ν(x)𝔼PX[ν(X)]\tilde{\nu}(x)\triangleq\nu(x)-{\mathbb{E}}_{P_{X}}\left[\nu(X)\right].

In addition, we introduce the kernelized discriminative model (KDM) as follows.

Definition 1 (Kernelized Discriminative Model)

For each kernel 𝓀\mathpzc{k}, we define its associated kernelized discriminative model PY|X(𝓀)P^{(\mathpzc{k})}_{Y|X} as

PY|X(𝓀)(y|x)PY(y)(1+𝔼[𝓀~(X,x)|Y=y]).\displaystyle P^{(\mathpzc{k})}_{Y|X}(y|x)\triangleq P_{Y}(y)\left(1+{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\middle|Y=y\right]\right). (4)

Then, we use y^(𝓀){\hat{y}}^{(\mathpzc{k})} to denote the maximum a posteriori (MAP) estimation induced from KDM PY|X(𝓀)P^{(\mathpzc{k})}_{Y|X}, i.e.,

y^(𝓀)(x)\displaystyle{\hat{y}}^{(\mathpzc{k})}(x) argmaxy𝒴PY|X(𝓀)(y|x).\displaystyle\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}P^{(\mathpzc{k})}_{Y|X}(y|x). (5)

The KDM can be regarded as a generalized probability distribution, since we have y𝒴PY|X(𝓀)(y|x)=1\sum_{y\in\mathcal{Y}}P^{(\mathpzc{k})}_{Y|X}(y|x)=1 for all x𝒳x\in\mathcal{X} while PY|X(𝓀)(y|x)P^{(\mathpzc{k})}_{Y|X}(y|x) can sometimes take negative values.

II-C Modal Decomposition, Maximal Correlation, and H-score

We first introduce the modal decomposition of joint distribution PX,YP_{X,Y} [11, 15].

Proposition 1 (Modal Decomposition [11])

For given PX,YP_{X,Y}, there exists Kmin{|𝒳|,|𝒴|}1K\leq\min\{|\mathcal{X}|,|\mathcal{Y}|\}-1, such that

PX,Y(x,y)=PX(x)PY(y)(1+i=1Kσifi(x)gi(y)),\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\sum_{i=1}^{K}\sigma_{i}f^{*}_{i}(x)g^{*}_{i}(y)\right), (6)

where σ1σ2σK>0\sigma_{1}\geq\sigma_{2}\geq\sigma_{K}>0, and 𝔼[fi(X)fj(X)]=𝔼[gi(Y)gj(Y)]=𝟙{𝕚=𝕛}{\mathbb{E}}\left[f^{*}_{i}(X)f^{*}_{j}(X)\right]={\mathbb{E}}\left[g^{*}_{i}(Y)g^{*}_{j}(Y)\right]=\mathbbb{1}_{\{i=j\}} for all 1i,jK1\leq i,j\leq K , where 𝟙{}\mathbbb{1}_{\{\cdot\}} denotes the indicator function.

It can be shown that (fi,gi)(f_{i}^{*},g_{i}^{*}) pairs are the most correlated function pairs of XX and YY, referred to as maximal correlation functions. We also denote ϱσ1\varrho\triangleq\sigma_{1}, known as the HGR maximal correlation [12, 13, 14] of XX and YY, and define the KK-dimensional feature f(x)[f1(x),,fK(x)]Tf^{*}(x)\triangleq[f^{*}_{1}(x),\dots,f_{K}^{*}(x)]^{\mathrm{T}}. In particular, when YY is binary, we have f=f1𝒳f^{*}=f_{1}^{*}\in\mathcal{F}_{\mathcal{X}}.

It has been shown in [11] that the maximal correlation functions fi,i=1,,Kf_{i}^{*},i=1,\dots,K are the optimal features of XX in inferring or estimating YY. In general, given a dd-dimensional feature ff of XX, the effectiveness of ff in inferring or estimating YY can be measured by its H-score [10, 11], defined as

(f)12𝔼[𝔼[Λf12f~(X)|Y]2],\displaystyle\mathscr{H}(f)\triangleq\frac{1}{2}\cdot{\mathbb{E}}\left[\left\|{\mathbb{E}}\left[\Lambda_{f}^{-\frac{1}{2}}\tilde{f}(X)\middle|Y\right]\right\|^{2}\right], (7)

where f~(x)f(x)𝔼[f(X)]\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}\left[f(X)\right]. It can be verified that for all dd and f:𝒳df\colon\mathcal{X}\to\mathbb{R}^{d}, we have

(f)(f)=12i=1Kσi2,\displaystyle\mathscr{H}(f)\leq\mathscr{H}(f^{*})=\frac{1}{2}\sum_{i=1}^{K}\sigma_{i}^{2}, (8)

where σ1,,σK\sigma_{1},\dots,\sigma_{K} are as defined in (6).

II-D Binary Classification

We consider the binary classification problem which predicts binary label YY from the data variable XX. For convenience, we assume YY takes values from 𝒴{1,1}\mathcal{Y}\triangleq\{-1,1\}.

Suppose the training dataset contains nn sample pairs {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} of (X,Y)(X,Y), and let PX,YP_{X,Y} denote the corresponding empirical distribution, i.e.,

PX,Y(x,y)1ni=1n𝟙{𝕩𝕚=𝕩,𝕪𝕚=𝕪}.\displaystyle P_{X,Y}(x,y)\triangleq\frac{1}{n}\sum_{i=1}^{n}\mathbbb{1}_{\{x_{i}=x,y_{i}=y\}}. (9)

II-D1 Support Vector Machine

The support vector machine (SVM) solves binary classification tasks by finding the optimal hyperplane that separates two classes with maximum margin [7]. Given dd-dimensional feature mapping f:𝒳df\colon\mathcal{X}\to\mathbb{R}^{d}, the loss for SVM based on ff can be written as

L𝖲𝖵𝖬(f,w,b;λ)\displaystyle L_{{\mathsf{SVM}}}(f,w,b;\lambda)
𝔼PX,Y[hinge(Y,w,f(X)+b)]+λ2w2,\displaystyle\quad\triangleq{\mathbb{E}}_{P_{X,Y}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]+\frac{\lambda}{2}\cdot\|w\|^{2}, (10)

where w,bdw,b\in\mathbb{R}^{d} are the parameters of the hyperplane, where λ>0\lambda>0 is a hyperparameter of SVM, and where hinge:𝒴×\ell_{\mathrm{hinge}}\colon\mathcal{Y}\times\mathbb{R}\to\mathbb{R} denotes the hinge loss, defined as hinge(y,z)(1yz)+\ell_{\mathrm{hinge}}(y,z)\triangleq(1-yz)^{+} with x+max{0,x}x^{+}\triangleq\max\{0,x\}.

Moreover, let (w𝖲𝖵𝖬,b𝖲𝖵𝖬)argminw,bL𝖲𝖵𝖬(f,w,b;λ)\displaystyle(w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}})\triangleq\operatorname*{arg\,min}_{w,b}L_{{\mathsf{SVM}}}(f,w,b;\lambda) and L𝖲𝖵𝖬(f;λ)L𝖲𝖵𝖬(f,w𝖲𝖵𝖬,b𝖲𝖵𝖬;λ)L_{{\mathsf{SVM}}}^{*}(f;\lambda)\triangleq L_{{\mathsf{SVM}}}(f,w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}};\lambda) denote the optimal parameters and the value of loss function, respectively. Then, the prediction of SVM is

y^𝖲𝖵𝖬(x;f,λ)sgn(w𝖲𝖵𝖬,f(x)+b𝖲𝖵𝖬),\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda)\triangleq\operatorname{sgn}(\langle w_{{\mathsf{SVM}}},f(x)\rangle+b_{{\mathsf{SVM}}}), (11)

where sgn()\operatorname{sgn}(\cdot) denotes the sign function.

Specifically, for a given kernel 𝓀\mathpzc{k}, the prediction of the corresponding kernel SVM is111It is worth mentioning that the practical implementation of kernel SVM is typically done by solving a dual optimization problem without explicitly using ν\nu. See [17, Section 12] for detailed discussions. y^𝖲𝖵𝖬(𝓀)(x;λ)y^𝖲𝖵𝖬(x;ν,λ),{\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x;\lambda)\triangleq{\hat{y}}_{{\mathsf{SVM}}}(x;\nu,\lambda), where ν\nu is any mapping given by Fact 1.

II-D2 Logistic Regression and Neural Networks

Given dd-dimensional feature ff of XX, the discriminative model of logistic regression is P~Y|X(y|x;f,w,b)sigmoid(y(w,f(x)+b))\tilde{P}_{Y|X}(y|x;f,w,b)\triangleq\operatorname{sigmoid}(y\cdot(\langle w,f(x)\rangle+b)), where wd,bw\in\mathbb{R}^{d},b\in\mathbb{R} are the weight and bias, respectively, and where sigmoid()\operatorname{sigmoid}(\cdot) is defined as sigmoid(x)11+exp(x)\operatorname{sigmoid}(x)\triangleq\frac{1}{1+\exp(-x)}.

Then, the loss of logistic regression is L𝖫𝖱(f,w,b)𝔼[logP~Y|X(Y|X;f,w,b)]L_{{\mathsf{LR}}}(f,w,b)\triangleq-{\mathbb{E}}\left[\log\tilde{P}_{Y|X}(Y|X;f,w,b)\right], and the optimal parameters w𝖫𝖱,b𝖫𝖱w_{{\mathsf{LR}}},b_{{\mathsf{LR}}} are learned by minimizing the loss, i.e., (w𝖫𝖱,b𝖫𝖱)argminw,bL𝖫𝖱(f,w,b)\displaystyle(w_{{\mathsf{LR}}},b_{{\mathsf{LR}}})\triangleq\operatorname*{arg\,min}_{w,b}L_{{\mathsf{LR}}}(f,w,b). The resulting decision rule is

y^𝖫𝖱(x;f)\displaystyle{\hat{y}}_{{\mathsf{LR}}}(x;f) argmaxy𝒴P~Y|X(y|x;f,w𝖫𝖱,b𝖫𝖱)\displaystyle\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}\tilde{P}_{Y|X}(y|x;f,w_{{\mathsf{LR}}},b_{{\mathsf{LR}}})
=sgn(w𝖫𝖱,f(x)+b𝖫𝖱).\displaystyle=\operatorname{sgn}\left(\langle w_{{\mathsf{LR}}},f(x)\rangle+b_{{\mathsf{LR}}}\right). (12)

The logistic regression is often used as the classification layer for multi-layer neural networks, where ww and bb correspond to weights and the bias term, respectively. In this case, the feature mapping f()f(\cdot) also takes a parameterized form, and the parameters of f()f(\cdot) are jointly learned with ww and bb.

III Projection Kernel and Informative Features

In this section, we introduce a one-to-one correspondence between kernels and feature subspaces, and then characterize the informativeness of kernels by investigating the features in the associated subspaces.

III-A Projection Kernel and Feature Subspace

We first introduce a family of kernels with one-to-one correspondence to feature subspace.

Definition 2 (Projection Kernel)

Let 𝒢\mathcal{G} denote a dd-dimensional subspace of 𝒳\mathcal{F}_{\mathcal{X}} with a basis {f1,,fd}\{f_{1},\dots,f_{d}\}. We use 𝓀𝒢:𝒳×𝒳\mathpzc{k}_{{\;}\mathcal{G}}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R} to denote the projection kernel associated with 𝒢\mathcal{G}, defined as 𝓀𝒢(𝓍,𝓍)𝒻T(𝓍)Λ𝒻1𝒻(𝓍)\mathpzc{k}_{{\;}\mathcal{G}}(x,x^{\prime})\triangleq f^{\mathrm{T}}(x)\Lambda_{f}^{-1}f(x^{\prime}), where we have defined f(f1,,fd)Tf\triangleq(f_{1},\dots,f_{d})^{\mathrm{T}} and Λf𝔼[f(X)fT(X)]\Lambda_{f}\triangleq{\mathbb{E}}\left[f(X)f^{\mathrm{T}}(X)\right].

With slight abuse of notation, we also dnenote 𝓀𝒻𝓀span{𝒻}\mathpzc{k}_{{\;}f}\triangleq\mathpzc{k}_{{\;}\operatorname{span}\{f\}}, the projection kernel associated with span{f}\operatorname{span}\{f\}.

Note that 𝓀𝒢\mathpzc{k}_{{\;}\mathcal{G}} is a valid kernel function, and the corresponding ν\nu mapping in Fact 1 can be chosen as ν(x)=[f1(x),,fd(x)]T\nu(x)=[f_{1}(x),\dots,f_{d}(x)]^{\mathrm{T}} for any orthonormal basis {f1,,fd}\{f_{1},\dots,f_{d}\} of 𝒢\mathcal{G}. It turns out that the functional operators associated with projection kernels are projection operators in the feature space, which we formalize as follows. A proof is provided in Appendix -A.

Property 1

Let τ𝓀𝒢\tau\leftrightarrow\mathpzc{k}_{{\;}\mathcal{G}} denote the operator corresponding to subspace 𝒢\mathcal{G} [cf. (2)], then we have τ(f)=Π(f;𝒢)\tau(f)=\Pi\left(f;{\mathcal{G}}\right) for all f𝒳f\in\mathcal{F}_{\mathcal{X}}.

Therefore, given a projection kernel 𝓀\mathpzc{k}, the associated subspace can be represented as {f𝒳:τ(f)=f}\{f\in\mathcal{F}_{\mathcal{X}}\colon\tau(f)=f\}, where τ𝓀\tau\leftrightarrow\mathpzc{k} is the associated operator. This establishes a one-to-one correspondence between projection kernels and feature subspaces.

III-B H-score and Informative Features

The projection kernel provides a connection between feature subspace and kernel, from which we can characterize subspace 𝒢\mathcal{G} in terms of the corresponding kernel 𝓀𝒢\mathpzc{k}_{{\;}\mathcal{G}}. Specifically, we can represent the H-score [cf. (7)] of a feature ff in terms of the projection kernel 𝓀𝒻\mathpzc{k}_{{\;}f}, formalized as follows. A proof is provided in Appendix -B.

Proposition 2

For all ff with span{f}=𝒢\operatorname{span}\{f\}=\mathcal{G}, we have (f)=12(𝔼PXX[𝓀𝒢(𝒳,𝒳)]𝔼PXPX[𝓀𝒢(𝒳,𝒳)])\mathscr{H}(f)=\frac{1}{2}\cdot\left({\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]\right), where we have defined XX^{\prime} such that the joint distribution of XX and XX^{\prime} is

PXX(x,x)y𝒴PY(y)PX|Y=y(x)PX|Y=y(x).\displaystyle P_{XX^{\prime}}(x,x^{\prime})\triangleq\sum_{y\in\mathcal{Y}}P_{Y}(y)P_{X|Y=y}(x)P_{X|Y=y}(x^{\prime}). (13)

With slight abuse of notation, we can use (𝒢)\mathscr{H}(\mathcal{G}) to denote the H-score corresponding to feature subspace 𝒢\mathcal{G}. In particular, we have the following characterization of (𝒢)\mathscr{H}(\mathcal{G}) when YY is binary. A proof is provided in Appendix -C.

Proposition 3

Suppose YY is binary, and ff^{*} is the maximal correlation function of PX,YP_{X,Y}. Then, for each subspace 𝒢\mathcal{G} of 𝒳\mathcal{F}_{\mathcal{X}}, we have

(𝒢)\displaystyle\mathscr{H}(\mathcal{G}) =ϱ22Π(f;𝒢)𝒳2=maxf𝒢(f)=(Π(f;𝒢)).\displaystyle=\frac{\varrho^{2}}{2}\cdot\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}=\max_{f\in\mathcal{G}}\mathscr{H}(f)=\mathscr{H}(\Pi\left(f^{*};{\mathcal{G}}\right)). (14)

From Proposition 3, (𝒢)\mathscr{H}(\mathcal{G}) depends only on the projection of ff^{*} onto 𝒢\mathcal{G}, which is also the most informative feature in 𝒢\mathcal{G}. In addition, note that since f𝒳=1\|f^{*}\|_{\mathcal{F}_{\mathcal{X}}}=1, Π(f;𝒢)𝒳2\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2} is also the cosine value of the principal angle between ff^{*} and 𝒢\mathcal{G}. Therefore, we can interpret the H-score as a measure of the principal angle between the optimal feature ff^{*} and the given subspace.

III-C Maximal Correlation Kernel

Note that from (8), (f)\mathscr{H}(f) is maximized when ff takes the maximal correlation function ff^{*}. Therefore, the subspace span{f}\operatorname{span}\{f^{*}\} (and thus projection kernel 𝓀𝒻\mathpzc{k}_{{\;}f^{*}}) is optimal in terms of the H-score measure. We will denote 𝓀𝓀𝒻\mathpzc{k}^{*}\triangleq\mathpzc{k}_{{\;}f^{*}}, referred to as the maximal correlation kernel.

Specifically, the KDM (cf. Definition 1) of maximal correlation kernel 𝓀\mathpzc{k}^{*} coincides with the underlying conditional distribution PY|XP_{Y|X}, demonstrated as follows. A proof is provided in Appendix -D.

Property 2

For all xx and yy, we have PY|X(y|x)=PY|X(𝓀)(y|x)P_{Y|X}(y|x)=P^{(\mathpzc{k}^{*})}_{Y|X}(y|x) and y^(𝓀)(x)=y^𝖬𝖠𝖯(x){\hat{y}}^{(\mathpzc{k}^{*})}(x)={\hat{y}}_{{\mathsf{MAP}}}(x), where y^𝖬𝖠𝖯{\hat{y}}_{\mathsf{MAP}} denotes the MAP estimation, i.e.,

y^𝖬𝖠𝖯(x)argmaxy𝒴PY|X(y|x).\displaystyle{\hat{y}}_{{\mathsf{MAP}}}(x)\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}P_{Y|X}(y|x). (15)

As we will develop in the next section, the maximal correlation kernel also achieves the optimal performance in support vector machine.

IV Support Vector Machine Analysis

In this section, we investigate support vector machine, a representative kernel approach for binary classification. Let (X,Y)(X,Y) denote the training data and corresponding label taken from 𝒴={1,1}\mathcal{Y}=\{-1,1\}, with PX,YP_{X,Y} denoting the empirical distribution as defined in (9). Throughout this section, we will focus on the balanced dataset with

PY(1)=PY(1)=12.\displaystyle P_{Y}(-1)=P_{Y}(1)=\frac{1}{2}. (16)

It can be verified that in this case, the MAP estimation [cf. (15)] can be expressed in terms of maximal correlation function. A proof is provided in Appendix -E.

Property 3

Under assumption (16), we can express the MAP estimation as y^𝖬𝖠𝖯(x)=sgn(f(x)){\hat{y}}_{{\mathsf{MAP}}}(x)=\operatorname{sgn}(f^{*}(x)) for all x𝒳x\in\mathcal{X}, where f𝒳f^{*}\in\mathcal{F}_{\mathcal{X}} is the maximal correlation function of PX,YP_{X,Y}.

IV-A SVM on Given Features

We first consider the SVM algorithm applied on a given feature representation f(X)df(X)\in\mathbb{R}^{d}, which can also be regarded as the kernel SVM on kernel 𝓀(𝓍,𝓍)=𝒻(𝓍),𝒻(𝓍)\mathpzc{k}(x,x^{\prime})=\langle f(x),f(x^{\prime})\rangle.

To begin, for each given feature ff and λ>0\lambda>0, let us define

L^(f;λ)112λ𝔼[f(X)Y]2.\displaystyle\hat{L}(f;\lambda)\triangleq 1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}.

Then we have the following characterization, a proof of which is provided in Appendix -F.

Theorem 1

For all given feature ff and λ0\lambda\geq 0, we have

L^(f;λ)L𝖲𝖵𝖬(f;λ)L^(f;λ)+(λTλ1)+,\displaystyle\hat{L}(f;\lambda)\leq L_{{\mathsf{SVM}}}^{*}(f;\lambda)\leq\hat{L}(f;\lambda)+\left(\frac{\lambda_{\mathrm{T}}}{\lambda}-1\right)^{+}, (17)

where we have defined λTM𝔼[f(X)Y]\lambda_{\mathrm{T}}\triangleq M\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\| and Mmaxx𝒳f~(x)M\triangleq\max_{x\in\mathcal{X}}\bigl{\|}\tilde{f}(x)\bigr{\|}, with f~(x)f(x)𝔼[f(X)]\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}\left[f(X)\right], and where x+max{0,x}x^{+}\triangleq\max\{0,x\}.

Specifically, when λλT\lambda\geq\lambda_{\mathrm{T}}, we have L𝖲𝖵𝖬(f;λ)=L^(f;λ)L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda), which can be achieved by

w𝖲𝖵𝖬=1λ𝔼[f(X)Y],b𝖲𝖵𝖬=w𝖲𝖵𝖬,𝔼[f(X)],\displaystyle w_{{\mathsf{SVM}}}=\frac{1}{\lambda}\cdot{\mathbb{E}}\left[f(X)Y\right],\quad b_{{\mathsf{SVM}}}=-\langle w_{{\mathsf{SVM}}},{\mathbb{E}}\left[f(X)\right]\rangle, (18)

and the resulting SVM prediction is

y^𝖲𝖵𝖬(x;f,λ)\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda) =sgn(𝔼[f~(X)Y],f~(x))\displaystyle=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\bigl{[}\tilde{f}(X)Y\bigr{]},\tilde{f}(x)\right\rangle\right) (19)
=argminy𝒴f(x)𝔼[f(X)|Y=y].\displaystyle=\operatorname*{arg\,min}_{y\in\mathcal{Y}}\|f(x)-{\mathbb{E}}\left[f(X)|Y=y\right]\|. (20)

From Theorem 1, when λλT\lambda\geq\lambda_{\mathrm{T}}, the SVM decision y^𝖲𝖵𝖬(x;f,λ){\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda) does not depend on the value of λ\lambda. In the remaining, we will focus on the regime where λλT\lambda\geq\lambda_{\mathrm{T}}, and drop the λ\lambda in expressions whenever possible, e.g., we simply denote y^𝖲𝖵𝖬(x;f,λ){\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda) by y^𝖲𝖵𝖬(x;f){\hat{y}}_{{\mathsf{SVM}}}(x;f). As we will see soon, SVM can still obtain minimum prediction error in this regime, by using a good feature mapping ff (or equivalently, a good kernel).

From (20), the SVM prediction can be interpreted as a nearest centroid classifier, where decision is based on comparing the distance between f(x)f(x) and the class centroids 𝔼[f(X)|Y=y]{\mathbb{E}}\left[f(X)|Y=y\right], y𝒴y\in\mathcal{Y}. In addition, from

𝔼[f(X)Y]\displaystyle{\mathbb{E}}\left[f(X)Y\right] =𝔼[Y𝔼[f(X)|Y]]\displaystyle={\mathbb{E}}\left[Y\cdot{\mathbb{E}}\left[f(X)|Y\right]\right]
=12(𝔼[f(X)|Y=1]𝔼[f(X)|Y=1]),\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[f(X)|Y=1\right]-{\mathbb{E}}\left[f(X)|Y=-1\right]\right),

we can interpret the SVM loss L𝖲𝖵𝖬=L^L_{{\mathsf{SVM}}}^{*}=\hat{L} as measuring the distance between two class centroids.

Furthermore, when ff is one-dimensional feature, we can rewrite (19) as

y^𝖲𝖵𝖬(x;f)=sgn(𝔼[f~(X)Y],f~(x))=sgn(f^(x)),\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f)=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\bigl{[}\tilde{f}(X)Y\bigr{]},\tilde{f}(x)\right\rangle\right)=\operatorname{sgn}\left({\hat{f}}(x)\right),

where f^Π(f;span{f~}){\hat{f}}\triangleq\Pi\left(f^{*};{\operatorname{span}\{\tilde{f}\}}\right). Therefore, the decision rule depends only the projection of ff^{*} onto the subspace span{f^}\operatorname{span}\{{\hat{f}}\}, which is also the most informative features on the subspace (cf. Proposition 3). Later on we will see a similar geometric illustration of kernel SVM.

Moreover, we can establish a connection between SVM loss and the H-score measure, formalized as the following corollary. A proof is provided in Appendix -G.

Corollary 1

Suppose λλT\lambda\geq\lambda_{\mathrm{T}}, then we have

1rmaxλ(f~)L𝖲𝖵𝖬(f;λ)1rminλ(f~),\displaystyle 1-\frac{r_{\max}}{\lambda}\cdot\mathscr{H}(\tilde{f})\leq L_{{\mathsf{SVM}}}^{*}(f;\lambda)\leq 1-\frac{r_{\min}}{\lambda}\cdot\mathscr{H}(\tilde{f}),

where rmaxr_{\max} and rminr_{\min} denote the maximum and minimum positive eigenvalues of the covariance matrix Λf~\Lambda_{\tilde{f}}, respectively. Specifically, if Λf~=I\Lambda_{\tilde{f}}=I, then we have L𝖲𝖵𝖬(f;λ)=1λ1(f~).L_{{\mathsf{SVM}}}^{*}(f;\lambda)=1-\lambda^{-1}\cdot\mathscr{H}(\tilde{f}).

As a result, for each normalized feature ff with covariance matrix Λf~=I\Lambda_{\tilde{f}}=I, the SVM loss L𝖲𝖵𝖬L_{{\mathsf{SVM}}}^{*} measures the informativeness of ff in inferring the label YY.

IV-B Kernel SVM

In practice, instead of applying SVM on a given or manually designed feature ff, it is more often to directly implement SVM on a kernel 𝓀\mathpzc{k}. Similar to Theorem 1, we have the following characterization, from which we can interpret KDM as a probabilistic output for kernel SVM.

Theorem 2

For each given kernel 𝓀\mathpzc{k}, there exists a constant λT>0\lambda_{\mathrm{T}}>0, such that when λλT\lambda\geq\lambda_{\mathrm{T}}, the SVM prediction is y^𝖲𝖵𝖬(𝓀)(x)=sgn([τ(f)](x)){\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)=\operatorname{sgn}\left([\tau(f^{*})](x)\right), where τ𝓀~\tau\leftrightarrow\tilde{\mathpzc{k}} is the operator associated with centered kernel 𝓀~\tilde{\mathpzc{k}} [cf. (2) and (3)]. In addition, the SVM prediction coincides with the KDM prediction (cf. Definition 1) obtained from 𝓀\mathpzc{k}, i.e., we have y^𝖲𝖵𝖬(𝓀)(x)=y^(𝓀)(x){\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}^{(\mathpzc{k})}(x) for all x𝒳x\in\mathcal{X}.

Proof:

Let 𝒱\mathcal{V} and ν:𝒳𝒱\nu\colon\mathcal{X}\to\mathcal{V} denote the inner product space and mapping associated with kernel 𝓀\mathpzc{k} (cf. Fact 1), and let ν~(x)ν(x)𝔼PX[ν(X)]\tilde{\nu}(x)\triangleq\nu(x)-{\mathbb{E}}_{P_{X}}\left[\nu(X)\right]. Then, we have

𝔼[ν~(X)Y],ν~(x)𝒱\displaystyle\langle{\mathbb{E}}\left[\tilde{\nu}(X)Y\right],\tilde{\nu}(x)\rangle_{\mathcal{V}} =𝔼[ν~(X),ν~(x)𝒱Y]\displaystyle={\mathbb{E}}\left[\langle\tilde{\nu}(X),\tilde{\nu}(x)\rangle_{\mathcal{V}}\cdot Y\right]
=𝔼[𝓀~(X,x)Y],\displaystyle={\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right], (21)

which can be rewritten as

𝔼[𝓀~(X,x)Y]\displaystyle{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]
=𝔼PX,Y[𝓀~(X,x)Y]\displaystyle\qquad={\mathbb{E}}_{P_{X,Y}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]
=𝔼PXPY[𝓀~(X,x)Y(1+ϱf(X)Y)]\displaystyle\qquad={\mathbb{E}}_{P_{X}P_{Y}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\cdot(1+\varrho\cdot f^{*}(X)\cdot Y)\right]
=𝔼[𝓀~(X,x)]𝔼[Y]+ϱ𝔼[𝓀~(X,x)f(X)]𝔼[Y2]\displaystyle\qquad={\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\right]\cdot{\mathbb{E}}\left[Y\right]+\varrho\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)f^{*}(X)\right]\cdot{\mathbb{E}}\left[Y^{2}\right]
=ϱ𝔼[𝓀~(X,x)f(X)]\displaystyle\qquad=\varrho\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)f^{*}(X)\right]
=ϱ[τ(f)](x),\displaystyle\qquad=\varrho\cdot[\tau(f^{*})](x),

where to obtain the second equality we have used the modal decomposition of PX,YP_{X,Y} (cf. Fact 2).

Hence, from Theorem 1 we obtain

y^𝖲𝖵𝖬(𝓀)(x)=y^𝖲𝖵𝖬(x;ν)\displaystyle{\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}_{{\mathsf{SVM}}}(x;\nu) =sgn(𝔼[ν~(X)Y],ν~(x))\displaystyle=\operatorname{sgn}\left(\langle{\mathbb{E}}\left[\tilde{\nu}(X)Y\right],\tilde{\nu}(x)\rangle\right)
=sgn(𝔼[𝓀~(X,x)Y])\displaystyle=\operatorname{sgn}\left({\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]\right)
=sgn([τ(f)](x)).\displaystyle=\operatorname{sgn}([\tau(f^{*})](x)).

It remains only to establish the equivalence between y^𝖲𝖵𝖬(𝓀){\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}} and KDM decision y^(𝓀){\hat{y}}^{(\mathpzc{k})}. To this end, note that from (4) and the balanced dataset assumption (16), we have

PY|X(𝓀)(y|x)\displaystyle P^{(\mathpzc{k})}_{Y|X}(y|x) =PY(y)(1+𝔼[𝓀~(X,x)|Y=y])\displaystyle=P_{Y}(y)\left(1+{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\middle|Y=y\right]\right)
=12(1+y𝔼[𝓀~(X,x)Y])\displaystyle=\frac{1}{2}\left(1+y\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)Y\right]\right)

for all x𝒳,y𝒴x\in\mathcal{X},y\in\mathcal{Y}.

Hence, for all x𝒳x\in\mathcal{X},

y^(𝓀)(x)=argmaxy𝒴PY|X(𝓀)(y|x)\displaystyle{\hat{y}}^{(\mathpzc{k})}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}{P^{(\mathpzc{k})}_{Y|X}(y|x)} =sgn(𝔼[𝓀~(X,x)Y])\displaystyle=\operatorname{sgn}\left({\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)Y\right]\right)
=y^𝖲𝖵𝖬(𝓀)(x),\displaystyle={\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x),

which completes the proof. ∎

From Theorem 2, the final decision y^𝖲𝖵𝖬(𝓀){\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}} depends on 𝓀\mathpzc{k} only through the centered kernel 𝓀~\tilde{\mathpzc{k}}. Moreover, compare Theorem 2 with Property 3, kernel SVM prediction differs from MAP only in applying the operator τ\tau on ff^{*}. In particular, when the maximal correlation function ff^{*} is an eigenfunction of the corresponding operator τ𝓀~\tau\leftrightarrow\tilde{\mathpzc{k}}, i.e., τ(f)=cf\tau(f^{*})=c\cdot f^{*} for some c>0c>0, the SVM prediction coincides with the MAP prediction, i.e., y^𝖲𝖵𝖬(𝓀)(x)=y^𝖬𝖠𝖯(x){\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}_{{\mathsf{MAP}}}(x) for all x𝒳x\in\mathcal{X}.

If we restrict our attention to projection kernels, the kernel SVM decision can be further interpreted as a projection operation on the associated subspace. To see this, let 𝒢\mathcal{G} denote a feature subspace of 𝒳\mathcal{F}_{\mathcal{X}} spanned by zero-mean features, then from Theorem 1 and Proposition 3, the kernel SVM loss for 𝓀𝒢\mathpzc{k}_{{\;}\mathcal{G}} is

11λ(𝒢)=1ϱ22λΠ(f;𝒢)𝒳2,\displaystyle 1-\frac{1}{\lambda}\cdot\mathscr{H}(\mathcal{G})=1-\frac{\varrho^{2}}{2\lambda}\cdot\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}},

which measures the principal angle between ff^{*} and 𝒢\mathcal{G}. In addition, the decision rule can be expressed as

y^𝖲𝖵𝖬(𝓀𝒢)(x)=sgn([Π(f;𝒢)](x)),\displaystyle{\hat{y}}^{(\mathpzc{k}_{{\;}\mathcal{G}})}_{{\mathsf{SVM}}}(x)=\operatorname{sgn}([\Pi\left(f^{*};{\mathcal{G}}\right)](x)), (22)

From Proposition 3, Π(f;𝒢)\Pi\left(f^{*};{\mathcal{G}}\right) is also the most informative feature in 𝒢\mathcal{G}. Therefore, kernel SVM on 𝓀𝒢\mathpzc{k}_{{\;}\mathcal{G}} is equivalent to first extracting the most informative feature in 𝒢\mathcal{G}, and then using the extracted feature to make decision.

IV-C Relationship to Other Classification Approaches

IV-C1 Maximum a Posteriori (MAP) Estimation

From (22), when the maximal correlation kernel 𝓀\mathpzc{k}^{*} is applied, the kernel SVM decision is sgn(f(x))\operatorname{sgn}(f^{*}(x)), which coincides with the MAP prediction (cf. Property 3). Since MAP achieves the minimum prediction error, kernel SVM on the maximal correlation kernel also obtains the minimum prediction error.

IV-C2 Logistic Regression and Neural Networks

We have interpreted SVM as extracting the most informative feature, where the informativeness is measured by H-score. The analysis in [10] has shown that logistic regression is also equivalent to maximizing the H-score, when XX and YY are weakly independent. Indeed, we can show that SVM and logistic regression lead to the same prediction in a weak dependence regime, which we formalize as follows. A proof is provided in Appendix -H.

Proposition 4

Suppose ϱ=O(ϵ)\varrho=O(\epsilon) for some ϵ>0\epsilon>0. For SVM and logistic regression applied on feature f:𝒳df\colon\mathcal{X}\to\mathbb{R}^{d} with covariance Λf~=Id\Lambda_{\tilde{f}}=I_{d}, the optimal parameters satisfy

w𝖫𝖱\displaystyle w_{{\mathsf{LR}}} =2λw𝖲𝖵𝖬+o(ϵ),\displaystyle=2\lambda\cdot w_{{\mathsf{SVM}}}+o(\epsilon),
b𝖫𝖱\displaystyle b_{{\mathsf{LR}}} =2λb𝖲𝖵𝖬+o(ϵ),\displaystyle=2\lambda\cdot b_{{\mathsf{SVM}}}+o(\epsilon),

where λ\lambda is the hyperparameter in SVM. In addition, we have y^𝖲𝖵𝖬(x;f)=y^𝖫𝖱(x;f){\hat{y}}_{{\mathsf{SVM}}}(x;f)={\hat{y}}_{{\mathsf{LR}}}(x;f) for ϵ\epsilon sufficiently small.

Remark 2

Since H-score can also be directly maximized by implementing the maximal correlation regression [18], a similar connection holds for SVM and maximal correlation regression.

V Fisher Kernel

We demonstrate that Fisher kernel [16, 19] can also be interpreted as a maximal correlation kernel.

Given a family of distributions π(;θ)\pi(\cdot;\theta) supported on 𝒳{\mathcal{X}} and parameterized by θm\theta\in\mathbb{R}^{m}, suppose the score function sθ(x)θlogπ(x;θ)s_{\theta}(x)\triangleq\frac{\partial}{\partial\theta}\log\pi(x;\theta) exists. Then, the Fisher kernel is defined as the projection kernel associated with the score function sθs_{\theta}, i.e., 𝓀𝓈θ\mathpzc{k}_{{\;}s_{\theta}}.

Specifically, we consider classification tasks where the joint distribution between data variable XX and label YY are a mixture of the parameterized forms. Suppose for each class Y=y𝒴Y=y\in\mathcal{Y}, the data variable XX is generated from

PX|Y(x|y)=π(x;θy)\displaystyle P_{X|Y}(x|y)=\pi(x;\theta_{y}) (23)

for some θym\theta_{y}\in\mathbb{R}^{m}. Then we have the following result, a proof of which is provided in Appendix -I.

Theorem 3

Suppose θy<ϵ\|\theta_{y}\|<\epsilon for all y𝒴y\in\mathcal{Y}, and let s(x)s0(x)s(x)\triangleq s_{0}(x). Then for the joint distribution PX,Y=PX|YPYP_{X,Y}=P_{X|Y}P_{Y} generated according to (23), we have

PX,Y(x,y)=PX(x)PY(y)(1+s(x),θ~y)+o(ϵ),\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\langle s(x),\tilde{\theta}_{y}\rangle\right)+o(\epsilon), (24)
𝓀𝓈(𝓍,𝓍)=𝓀(𝓍,𝓍)+(ϵ),\displaystyle\mathpzc{k}_{{\;}s}(x,x^{\prime})=\mathpzc{k}^{*}(x,x^{\prime})+o(\epsilon), (25)

where θ~yθy𝔼[θY]\tilde{\theta}_{y}\triangleq\theta_{y}-\mathbb{E}[\theta_{Y}] denotes the centered θy\theta_{y}, and where 𝓀\mathpzc{k}^{*} is the maximal correlation kernel defined on PX,YP_{X,Y}. In addition, the H-score of ss satisfies

(s)\displaystyle\mathscr{H}(s) =I(X;Y)+o(ϵ2),\displaystyle=I(X;Y)+o(\epsilon^{2}), (26)

where I(X;Y)I(X;Y) denotes the mutual information between XX and YY.

From (24), the score function ss is equal to the maximal correlation function ff^{*} of PX,YP_{X,Y} up to a linear transformation [cf. (6)], and we have PY|X(y|x)=PY|X(𝓀)(y|x)=PY|X(𝓀𝓈)(y|x)+o(ϵ)P_{Y|X}(y|x)=P^{(\mathpzc{k}^{*})}_{Y|X}(y|x)=P^{(\mathpzc{k}_{{\;}s})}_{Y|X}(y|x)+o(\epsilon). Therefore, Fisher kernel is the optimal kernel for tasks generated from (23).

VI Conclusion

In this paper, we study kernel methods from the perspective of feature subspace, where we demonstrate a connection between kernel methods and informative feature extraction problems. With SVM as an example, we illustrate the relationship between kernel methods and neural networks. The theoretical results can help guide practical kernel designs and incorporate kernel methods with feature-based learning approaches.

Acknowledgments

This work was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under grant N00014-19-1-2621.

-A Proof of Property 1

Suppose 𝒢\mathcal{G} is a dd-dimensional feature subspace. Let {g1,,gd}\{g_{1},\dots,g_{d}\} be an orthonormal basis of 𝒢\mathcal{G}, i.e., we have gi,gj=𝟙{𝕚=𝕛}\langle g_{i},g_{j}\rangle=\mathbbb{1}_{\{i=j\}}, and define g(x)(g1(x),g2(x),,gd(x))Tg(x)\triangleq(g_{1}(x),g_{2}(x),\dots,g_{d}(x))^{\mathrm{T}}. Then we have Λg=Id\Lambda_{g}=I_{d} and 𝓀𝒢(𝓍,𝓍)=(𝓍),(𝓍)\mathpzc{k}_{{\;}\mathcal{G}}(x,x^{\prime})=\langle g(x),g(x^{\prime})\rangle.

Therefore,

[τ(f)](x)\displaystyle[\tau(f)](x) =𝔼[𝓀𝒢(𝒳,𝓍)𝒻(𝒳)]\displaystyle={\mathbb{E}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,x)f(X)\right]
=𝔼[g(x),g(X)f(X)]\displaystyle={\mathbb{E}}\left[\langle g(x),g(X)\rangle\cdot f(X)\right]
=i=1d𝔼[f(X)gi(X)]gi(x)\displaystyle=\sum_{i=1}^{d}{\mathbb{E}}\left[f(X)\cdot g_{i}(X)\right]\cdot g_{i}(x)
=i=1df,gi𝒳gi(x),\displaystyle=\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i}(x),

which implies that τ(f)=i=1df,gi𝒳gi\tau(f)=\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i}.

From the orthogonality principle, it suffices to prove that fτ(f),g^𝒳=0\langle f-\tau(f),{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}}=0 for all f𝒳f\in\mathcal{F}_{\mathcal{X}} and g^𝒢{\hat{g}}\in\mathcal{G}. To this end, suppose g^=i=1dcigi{\hat{g}}=\sum_{i=1}^{d}c_{i}\cdot g_{i} for some c1,,cdc_{1},\dots,c_{d}\in\mathbb{R}. Then, we have

τ(f),g^𝒳\displaystyle\langle\tau(f),{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}} =i=1df,gi𝒳gi,j=1dcjgj𝒳\displaystyle=\left\langle\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i},\sum_{j=1}^{d}c_{j}\cdot g_{j}\right\rangle_{\mathcal{F}_{\mathcal{X}}}
=i=1dj=1df,gi𝒳cjgi,gj𝒳\displaystyle=\sum_{i=1}^{d}\sum_{j=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot c_{j}\cdot\left\langle g_{i},g_{j}\right\rangle_{\mathcal{F}_{\mathcal{X}}}
=i=1dj=1df,gi𝒳cj𝟙{𝕚=𝕛}\displaystyle=\sum_{i=1}^{d}\sum_{j=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot c_{j}\cdot\mathbbb{1}_{\{i=j\}}
=i=1dcif,gi𝒳\displaystyle=\sum_{i=1}^{d}c_{i}\cdot\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}
=f,i=1dcigi𝒳\displaystyle=\left\langle f,\sum_{i=1}^{d}c_{i}\cdot g_{i}\right\rangle_{\mathcal{F}_{\mathcal{X}}}
=f,g^𝒳,\displaystyle=\langle f,{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}},

which completes the proof. ∎

-B Proof of Proposition 2

First, note that for each y𝒴y\in\mathcal{Y},

𝔼[f~(X)|Y=y]\displaystyle{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=y\right] =𝔼[f(X)|Y=y]𝔼[f(X)]\displaystyle={\mathbb{E}}\left[f(X)\middle|Y=y\right]-{\mathbb{E}}\left[f(X)\right]
=x𝒳[PX|Y(x|y)PX(x)]f(x).\displaystyle=\sum_{x\in\mathcal{X}}[P_{X|Y}(x|y)-P_{X}(x)]\cdot f(x).

Therefore, we have

2(f)\displaystyle 2\cdot\mathscr{H}(f)
=𝔼[𝔼[Λf12f~(X)|Y]2]\displaystyle\quad={\mathbb{E}}\left[\left\|{\mathbb{E}}\left[\Lambda_{f}^{-\frac{1}{2}}\tilde{f}(X)\middle|Y\right]\right\|^{2}\right]
=y𝒴PY(y)(𝔼[f~(X)|Y=y])TΛf1𝔼[f~(X)|Y=y]\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\left({\mathbb{E}}\left[\tilde{f}(X)\middle|Y=y\right]\right)^{\mathrm{T}}\Lambda_{f}^{-1}{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=y\right]
=y𝒴PY(y)x𝒳x𝒳([PX|Y(x|y)PX(x)]\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\sum_{x\in\mathcal{X}}\sum_{x^{\prime}\in\mathcal{X}}\biggl{(}[P_{X|Y}(x|y)-P_{X}(x)]
(fT(x)Λf1f(x))[PX|Y(x|y)PX(x)])\displaystyle\quad\qquad\cdot\left(f^{\mathrm{T}}(x)\Lambda_{f}^{-1}f(x^{\prime})\right)\cdot[P_{X|Y}(x^{\prime}|y)-P_{X}(x^{\prime})]\biggr{)}
=𝔼PXX[𝓀𝒢(𝒳,𝒳)]𝔼PXPX[𝓀𝒢(𝒳,𝒳)],\displaystyle\quad={\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right],

which completes the proof. ∎

-C Proof of Proposition 3

We start with the first equality. Suppose PX,YP_{X,Y} has modal decomposition (cf. Proposition 1)

PX,Y(x,y)=PX(x)PY(y)[1+ϱf(x)g(y)],\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\cdot\left[1+\varrho\cdot f^{*}(x)\cdot g^{*}(y)\right],

where gg^{*} satisfies 𝔼[g(Y)]=0{\mathbb{E}}\left[g^{*}(Y)\right]=0 and 𝔼[(g(Y))2]=1{\mathbb{E}}\left[(g^{*}(Y))^{2}\right]=1.

Then, we obtain

PX|Y=y(x)=PX(x)[1+ϱf(x)g(y)],\displaystyle P_{X|Y=y}(x)=P_{X}(x)\cdot\left[1+\varrho\cdot f^{*}(x)\cdot g^{*}(y)\right],

and the PX,XP_{X,X^{\prime}} as defined in (13) can be expressed as

PXX(x,x)\displaystyle P_{XX^{\prime}}(x,x^{\prime})
=y𝒴PY(y)PX|Y=y(x)PX|Y=y(x)\displaystyle\qquad=\sum_{y\in\mathcal{Y}}P_{Y}(y)P_{X|Y=y}(x)P_{X|Y=y}(x^{\prime})
=PX(x)PX(x)[y𝒴PY(y)(1+ϱf(x)g(y))\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\Biggl{[}\sum_{y\in\mathcal{Y}}P_{Y}(y)\left(1+\varrho\cdot f^{*}(x)\cdot g^{*}(y)\right)
(1+ϱf(x)g(y))]\displaystyle\qquad\qquad\cdot(1+\varrho\cdot f^{*}(x^{\prime})\cdot g^{*}(y))\Biggl{]}
=PX(x)PX(x)[1+ϱ2f(x)f(x)],\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\left[1+\varrho^{2}\cdot f^{*}(x)\cdot f^{*}(x^{\prime})\right],

where to obtain the last equality follows from the fact that 𝔼[g(Y)]=0,𝔼[(g(Y))2]=1{\mathbb{E}}\left[g^{*}(Y)\right]=0,{\mathbb{E}}\left[(g^{*}(Y))^{2}\right]=1.

Note that since PX=PXP_{X^{\prime}}=P_{X}, we have

PXX(x,x)PX(x)PX(x)\displaystyle P_{XX^{\prime}}(x,x^{\prime})-P_{X}(x)P_{X^{\prime}}(x^{\prime})
=PXX(x,x)PX(x)PX(x)\displaystyle\qquad=P_{XX^{\prime}}(x,x^{\prime})-P_{X}(x)P_{X}(x^{\prime})
=PX(x)PX(x)ϱ2f(x)f(x).\displaystyle\qquad=P_{X}(x)P_{X^{\prime}}(x^{\prime})\cdot\varrho^{2}\cdot f^{*}(x)\cdot f^{*}(x^{\prime}). (27)

In addition, let τ𝓀𝒢\tau\leftrightarrow\mathpzc{k}_{{\;}\mathcal{G}} denote the operator associated with 𝓀𝒢\mathpzc{k}_{{\;}\mathcal{G}}, then from Property 1 we have τ(f)=Π(f;𝒢)\tau(f^{*})=\Pi\left(f^{*};{\mathcal{G}}\right). In addition, from the orthogonality principle, we have fτ(f),τ(f)𝒳=0\langle f^{*}-\tau(f^{*}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}=0 and thus

f,τ(f)𝒳\displaystyle\langle f^{*},\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}} =τ(f),τ(f)𝒳+fτ(f),τ(f)𝒳\displaystyle=\langle\tau(f^{*}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}+\langle f^{*}-\tau(f^{*}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}
=τ(f),τ(f)𝒳\displaystyle=\langle\tau(f^{*}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}
=τ(f)𝒳2=Π(f;𝒢)𝒳2.\displaystyle=\bigl{\|}\tau(f^{*})\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}}=\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}}. (28)

Hence, the first equality of (14) can be obtained from

(𝒢)\displaystyle\mathscr{H}\left(\mathcal{G}\right) =ϱ22(𝔼PXX[𝓀𝒢(𝒳,𝒳)]𝔼PXPX[𝓀𝒢(𝒳,𝒳)])\displaystyle=\frac{\varrho^{2}}{2}\cdot\left({\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]\right)
=ϱ22𝔼PXPX[f(X)f(X)𝓀𝒢(𝒳,𝒳)]\displaystyle=\frac{\varrho^{2}}{2}\cdot{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[f^{*}(X)\cdot f^{*}(X^{\prime})\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]
=ϱ22x𝒳PX(x)f(x)𝔼[f(X)𝓀𝒢(𝒳,𝓍)]\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{*}(x)\cdot{\mathbb{E}}\left[f^{*}(X)\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,x)\right]
=ϱ22x𝒳PX(x)f(x)[τ(f)](x)\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{*}(x)\cdot[\tau(f^{*})](x)
=ϱ22f,τ(f)𝒳\displaystyle=\frac{\varrho^{2}}{2}\cdot\langle f^{*},\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}
=ϱ22Π(f;𝒢)𝒳2,\displaystyle=\frac{\varrho^{2}}{2}\cdot\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2},

where the first equality follows from Proposition 2, where the second equality follows from (27), and where the last equality follows from (28).

To obtain the second and third equalities of (14), it suffices to note that for all f𝒢f\in\mathcal{G}, we have

Π(f;span{f})𝒳2\displaystyle\bigl{\|}\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}
=f𝒳2fΠ(f;span{f})𝒳2\displaystyle\qquad=\bigl{\|}f^{*}\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\|}f^{*}-\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}
f𝒳2fΠ(f;𝒢)𝒳2\displaystyle\qquad\leq\bigl{\|}f^{*}\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\|}f^{*}-\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}
=Π(f;𝒢)𝒳2\displaystyle\qquad=\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}

where the equalities follows from the orthogonality principle, and where the inequality follows from the definition of projection [cf. (1)]. In addition, it can be verified that the inequality holds with quality when f=Π(f;𝒢)f=\Pi\left(f^{*};{\mathcal{G}}\right).

Hence, for all f𝒢f\in\mathcal{G}, we have

(f)(𝒢)\displaystyle\frac{\mathscr{H}(f)}{\mathscr{H}(\mathcal{G})} =(span{f})(𝒢)\displaystyle=\frac{\mathscr{H}(\operatorname{span}\{f\})}{\mathscr{H}(\mathcal{G})}
=Π(f;span{f})𝒳2Π(f;𝒢)𝒳21=(Π(f;𝒢))(𝒢),\displaystyle=\frac{\bigl{\|}\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}}{\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}}\leq 1=\frac{\mathscr{H}(\Pi\left(f^{*};{\mathcal{G}}\right))}{\mathscr{H}(\mathcal{G})},

which completes the proof. ∎

-D Proof of Property 2

It suffices to prove that PY|X=PY|X(𝓀)P_{Y|X}=P^{(\mathpzc{k}^{*})}_{Y|X}. To this end, suppose PX,YP_{X,Y} satisfies the modal decomposition (6), and let f(x)(f1(x),,fK(x))T,g(y)(g1(y),,gK(y))Tf^{*}(x)\triangleq(f_{1}^{*}(x),\dots,f_{K}^{*}(x))^{\mathrm{T}},g^{*}(y)\triangleq(g_{1}^{*}(y),\dots,g_{K}^{*}(y))^{\mathrm{T}}, Σdiag(σ1,,σK)\Sigma\triangleq\operatorname{diag}(\sigma_{1},\dots,\sigma_{K}). Then, it can be verified that 𝔼[fi(X)|Y=y]=σigi(y){\mathbb{E}}\left[f_{i}^{*}(X)\middle|Y=y\right]=\sigma_{i}\cdot g_{i}^{*}(y) for all i=1,,Ki=1,\dots,K, which implies that 𝔼[f(X)|Y=y]=Σg(y){\mathbb{E}}\left[f^{*}(X)\middle|Y=y\right]=\Sigma\cdot g^{*}(y).

Since Λf=I\Lambda_{f^{*}}=I, we have 𝓀(𝓍,𝓍)=𝒻(𝓍),𝒻(𝓍)\mathpzc{k}^{*}(x,x^{\prime})=\langle f^{*}(x),f^{*}(x^{\prime})\rangle, for all x,x𝒳x,x^{\prime}\in\mathcal{X}.

Hence, for all x𝒳,y𝒴x\in\mathcal{X},y\in\mathcal{Y}, we have

PY|X(𝓀)(y|x)\displaystyle P^{(\mathpzc{k}^{*})}_{Y|X}(y|x) =PY(y)(1+𝔼[𝓀(𝒳,𝓍)|𝒴=𝓎])\displaystyle=P_{Y}(y)\cdot\left(1+{\mathbb{E}}\left[\mathpzc{k}^{*}(X,x)\middle|Y=y\right]\right)
=PY(y)(1+𝔼[f(X),f(x)|Y=y])\displaystyle=P_{Y}(y)\cdot\left(1+{\mathbb{E}}\left[\langle f^{*}(X),f^{*}(x)\rangle\middle|Y=y\right]\right)
=PY(y)(1+𝔼[f(X)|Y=y],f(x))\displaystyle=P_{Y}(y)\cdot\left(1+\langle{\mathbb{E}}\left[f^{*}(X)\middle|Y=y\right],f^{*}(x)\rangle\right)
=PY(y)(1+Σg(y),f(x))\displaystyle=P_{Y}(y)\cdot\left(1+\langle\Sigma\cdot g^{*}(y),f^{*}(x)\rangle\right)
=PY(y)(1+i=1Kσifi(x)gi(y))\displaystyle=P_{Y}(y)\cdot\left(1+\sum_{i=1}^{K}\sigma_{i}\cdot f_{i}^{*}(x)\cdot g_{i}^{*}(y)\right)
=PY|X(y|x),\displaystyle=P_{Y|X}(y|x),

which completes the proof. ∎

-E Proof of Property 3

Our proof will make use of the following fact.

Fact 2

If 𝒴={1,1}\mathcal{Y}=\{-1,1\} and PY(y)12P_{Y}(y)\equiv\frac{1}{2}, the modal decomposition of PX,YP_{X,Y} (cf. Proposition 1) can be written as

PX,Y(x,y)=PX(x)PY(y)(1+ϱf(x)y),\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\varrho\cdot f^{*}(x)\cdot y\right),

where ϱ\varrho is the maximal correlation coefficient, and ff^{*} is the maximal correlation function with 𝔼[(f(X))2]=1{\mathbb{E}}\left[(f^{*}(X))^{2}\right]=1.

From Fact 2, we have

PY|X(y|x)\displaystyle P_{Y|X}(y|x) =PY(y)(1+yϱf(x))\displaystyle=P_{Y}(y)\left(1+y\cdot\varrho\cdot f^{*}(x)\right)
=12(1+yϱf(x)),\displaystyle=\frac{1}{2}\cdot\left(1+y\cdot\varrho\cdot f^{*}(x)\right),

which implies that

y^𝖬𝖠𝖯(x)\displaystyle{\hat{y}}_{\mathsf{MAP}}(x) =argmaxy𝒴PY|X(y|x)\displaystyle=\operatorname*{arg\,max}_{y\in\mathcal{Y}}P_{Y|X}(y|x)
=argmaxy𝒴[yf(x)]=sgn(f(x)).\displaystyle=\operatorname*{arg\,max}_{y\in\mathcal{Y}}~{}\left[y\cdot f^{*}(x)\right]=\operatorname{sgn}(f^{*}(x)).

-F Proof of Theorem 1

Our proof will make use of the following simple fact.

Fact 3

Given a random variable ZZ taking values from 𝒵\mathcal{Z}, let zminz_{\min} denote the minimum entry in 𝒵\mathcal{Z}, then we have

𝔼[Z]𝔼[Z+]𝔼[Z]+(zmin),\displaystyle{\mathbb{E}}\left[Z\right]\leq{\mathbb{E}}\left[Z^{+}\right]\leq{\mathbb{E}}\left[Z\right]+(z_{\min})^{-},

where xmax{x,0}x^{-}\triangleq\max\{-x,0\}.

From Fact 3, we obtain

𝔼[hinge(Y,w,f(X)+b)]\displaystyle{\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]
=𝔼[(1Yw,f(X)Yb)+]\displaystyle\quad={\mathbb{E}}\left[\left(1-Y\cdot\left\langle w,f(X)\right\rangle-Y\cdot b\right)^{+}\right]
1w,𝔼[Yf(X)]𝔼[Y]b\displaystyle\quad\geq 1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle-{\mathbb{E}}\left[Y\right]\cdot b
=1w,𝔼[Yf(X)].\displaystyle\quad=1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle. (29)

Therefore, for all w,bw,b, we have

L𝖲𝖵𝖬(f,w,b;λ)\displaystyle L_{{\mathsf{SVM}}}(f,w,b;\lambda) =𝔼[hinge(Y,w,f(X)+b)]+λ2w2\displaystyle={\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]+\frac{\lambda}{2}\cdot\|w\|^{2}
1w,𝔼[Yf(X)]+λ2w2\displaystyle\geq 1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle+\frac{\lambda}{2}\cdot\|w\|^{2}
=112λ𝔼[f(X)Y]2\displaystyle=1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}
+λ2w1λ𝔼[Yf(X)]2\displaystyle\qquad+\frac{\lambda}{2}\cdot\left\|w-\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right]\right\|^{2}
112λ𝔼[f(X)Y]2\displaystyle\geq 1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}
=L^(f;λ).\displaystyle=\hat{L}(f;\lambda).

Hence, we have

L𝖲𝖵𝖬(f;λ)=L𝖲𝖵𝖬(f;w𝖲𝖵𝖬,b𝖲𝖵𝖬,λ)L^(f;λ).\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda)=L_{{\mathsf{SVM}}}(f;w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}},\lambda)\geq\hat{L}(f;\lambda).

Let w1λ𝔼[Yf(X)],bw,𝔼[f(X)]w^{\prime}\triangleq\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right],b^{\prime}\triangleq-\langle w^{\prime},{\mathbb{E}}\left[f(X)\right]\rangle, then we have

w,f(X)+b=w,f~(X).\displaystyle\langle w^{\prime},f(X)\rangle+b^{\prime}=\left\langle w^{\prime},\tilde{f}(X)\right\rangle.

Therefore, from the upper bound in Fact 3, we have

𝔼[hinge(Y,w,f(X)+b)]\displaystyle{\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w^{\prime},f(X)\rangle+b^{\prime})\right]
1𝔼[Yw,f~(X)]\displaystyle\qquad\leq 1-{\mathbb{E}}\left[Y\cdot\left\langle w^{\prime},\tilde{f}(X)\right\rangle\right]
+(mini{1yiw,f~(xi)})\displaystyle\qquad\qquad+\left(\min_{i}\left\{1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right\}\right)^{-}
1w,𝔼[Yf~(X)]+(1λTλ)\displaystyle\qquad\leq 1-\left\langle w^{\prime},{\mathbb{E}}\left[Y\cdot\tilde{f}(X)\right]\right\rangle+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-} (30)
=1λw2+(1λTλ)\displaystyle\qquad=1-\lambda\left\|w^{\prime}\right\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-} (31)

where to obtain the last inequality, we have used the fact that

mini{1yiw,f~(xi)}1λTλ\displaystyle\min_{i}\left\{1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right\}\geq 1-\frac{\lambda_{\mathrm{T}}}{\lambda}

since for each ii, we have

1yiw,f~(xi)\displaystyle 1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle 1|w,f~(xi)|\displaystyle\geq 1-\left|\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right|
1Mw\displaystyle\geq 1-M\cdot\|w^{\prime}\|
=1Mλ𝔼[Yf(X)]\displaystyle=1-\frac{M}{\lambda}\cdot\bigl{\|}{\mathbb{E}}\left[Y\cdot f(X)\right]\bigr{\|}
=1λTλ.\displaystyle=1-\frac{\lambda_{\mathrm{T}}}{\lambda}.

Therefore

L𝖲𝖵𝖬(f;λ)\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda) =minw,bL𝖲𝖵𝖬(f,w,b;λ)\displaystyle=\min_{w,b}L_{{\mathsf{SVM}}}(f,w,b;\lambda)
L𝖲𝖵𝖬(f,w,b;λ)\displaystyle\leq L_{{\mathsf{SVM}}}(f,w^{\prime},b^{\prime};\lambda)
1λw2+(1λTλ)+λ2w2\displaystyle\leq 1-\lambda\left\|w^{\prime}\right\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}+\frac{\lambda}{2}\left\|w^{\prime}\right\|^{2}
1λ2w2+(1λTλ)\displaystyle\leq 1-\frac{\lambda}{2}\left\|w^{\prime}\right\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}
=L^(f;λ)+(1λTλ)\displaystyle=\hat{L}(f;\lambda)+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}
=L^(f;λ)+(λTλ1)+.\displaystyle=\hat{L}(f;\lambda)+\left(\frac{\lambda_{\mathrm{T}}}{\lambda}-1\right)^{+}.

Finally, if λλT\lambda\geq\lambda_{T}, it can be readily verified that L𝖲𝖵𝖬(f;λ)=L^(f;λ)=L𝖲𝖵𝖬(f,w,b;λ)L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda)=L_{{\mathsf{SVM}}}(f,w^{\prime},b^{\prime};\lambda). As a result, the optimal solution is given by

w𝖲𝖵𝖬=w=1λ𝔼[Yf(X)],\displaystyle w_{{\mathsf{SVM}}}=w^{\prime}=\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right],
b𝖲𝖵𝖬=b=w,𝔼[f(X)]=w,𝔼[f(X)],\displaystyle b_{{\mathsf{SVM}}}=b^{\prime}=-\langle w^{\prime},{\mathbb{E}}\left[f(X)\right]\rangle=-\langle w^{*},{\mathbb{E}}\left[f(X)\right]\rangle,

and we have

w𝖲𝖵𝖬,f(x)+b𝖲𝖵𝖬\displaystyle\left\langle w_{{\mathsf{SVM}}},f(x)\right\rangle+b_{{\mathsf{SVM}}} =w𝖲𝖵𝖬,f~(x)\displaystyle=\langle w_{{\mathsf{SVM}}},\tilde{f}(x)\rangle
=1λ𝔼[Yf(X)],f~(x)\displaystyle=\frac{1}{\lambda}\cdot\left\langle{\mathbb{E}}\left[Y\cdot f(X)\right],\tilde{f}(x)\right\rangle
=1λ𝔼[f~(X)Y],f~(x).\displaystyle=\frac{1}{\lambda}\cdot\left\langle{\mathbb{E}}\left[\tilde{f}(X)\cdot Y\right],\tilde{f}(x)\right\rangle.

Therefore, the SVM prediction is given by

y^𝖲𝖵𝖬(x;f,λ)\displaystyle{\hat{y}}_{\mathsf{SVM}}(x;f,\lambda) =sgn(w𝖲𝖵𝖬,f(x)+b𝖲𝖵𝖬)\displaystyle=\operatorname{sgn}(\left\langle w_{{\mathsf{SVM}}},f(x)\right\rangle+b_{{\mathsf{SVM}}})
=sgn(𝔼[f~(X)Y],f~(x)).\displaystyle=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\left[\tilde{f}(X)Y\right],\tilde{f}(x)\right\rangle\right).

To obtain (20), note that for each x𝒳x\in\mathcal{X}, we have

f(x)𝔼[f(X)|Y=1]2f(x)𝔼[f(X)|Y=1]2\displaystyle\|f(x)-{\mathbb{E}}\left[f(X)|Y=-1\right]\|^{2}-\|f(x)-{\mathbb{E}}\left[f(X)|Y=1\right]\|^{2}
=f~(x)𝔼[f~(X)|Y=1]2\displaystyle\qquad=\left\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right\|^{2}
f~(x)𝔼[f~(X)|Y=1]2\displaystyle\qquad\qquad-\left\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]\right\|^{2}
=f~(x)𝔼[f~(X)|Y=1]2\displaystyle\qquad=\left\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]\right\|^{2}
f~(x)𝔼[f~(X)|Y=1]2\displaystyle\qquad\qquad-\left\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right\|^{2}
=f~(x),𝔼[f~(X)|Y=1]𝔼[f~(X)|Y=1]\displaystyle\qquad=\left\langle\tilde{f}(x),{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right\rangle
+𝔼[f~(X)|Y=1]2𝔼[f~(X)|Y=1]2\displaystyle\qquad\qquad+\left\|{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right\|^{2}-\left\|{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right\|^{2}
=2f~(x),𝔼[f~(X)Y],\displaystyle\qquad=2\cdot\left\langle\tilde{f}(x),{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\rangle,

where we have used the facts that

𝔼[f~(X)Y]\displaystyle{\mathbb{E}}\left[\tilde{f}(X)Y\right] =𝔼[Y𝔼[f~(X)|Y]]\displaystyle={\mathbb{E}}\left[Y\cdot{\mathbb{E}}\left[\tilde{f}(X)|Y\right]\right]
=12(𝔼[f~(X)|Y=1]𝔼[f~(X)|Y=1]),\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right),

and

0\displaystyle 0 =𝔼[f~(X)]\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\right]
=𝔼[𝔼[f~(X)|Y]]\displaystyle={\mathbb{E}}\left[{\mathbb{E}}\left[\tilde{f}(X)|Y\right]\right]
=12(𝔼[f~(X)|Y=1]+𝔼[f~(X)|Y=1]).\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]+{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=-1\right]\right).

-G Proof of Corollary 1

From Theorem 1, when λλT\lambda\geq\lambda_{\mathrm{T}}, we have

L𝖲𝖵𝖬(f;λ)=L^(f;λ)=112λ𝔼[f(X)Y]2.\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda)=1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)\cdot Y\right]\right\|^{2}.

Therefore, it suffices to prove that

rmin(f~)12𝔼[f(X)Y]2rmax(f~).\displaystyle r_{\min}\cdot\mathscr{H}(\tilde{f})\leq\frac{1}{2}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}\leq r_{\max}\cdot\mathscr{H}(\tilde{f}). (32)

To this end, note that we have

𝔼[f(X)Y]2=𝔼[f~(X)Y]2\displaystyle\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}=\left\|{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\|^{2} =𝔼[Y𝔼[f~(X)Y]2]\displaystyle={\mathbb{E}}\left[\left\|Y\cdot{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\|^{2}\right]
=𝔼[𝔼[f~(X)|Y]2],\displaystyle={\mathbb{E}}\left[\left\|{\mathbb{E}}\left[\tilde{f}(X)\middle|Y\right]\right\|^{2}\right],

where the last equality follows from the fact that for zero-mean ff and YY uniformly distributed on {1,1}\{1,-1\}, we have 𝔼[f(X)|Y=y]=y𝔼[f(X)Y]{\mathbb{E}}\left[f(X)|Y=y\right]=y\cdot{\mathbb{E}}\left[f(X)\cdot Y\right] for y𝒴y\in\mathcal{Y}.

In addition, for all vspan{f~(x):x𝒳}v\in\operatorname{span}\left\{\tilde{f}(x)\colon x\in\mathcal{X}\right\}, we have

rminv2Λf~12v2rmax.\displaystyle r_{\min}\leq\frac{\|v\|^{2}}{\left\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}v\right\|^{2}}\leq r_{\max}.

Therefore, we obtain

rminΛf~12𝔼[f~(X)|Y]2\displaystyle{r_{\min}}\cdot\left\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}{\mathbb{E}}\left[\tilde{f}(X)\middle|Y\right]\right\|^{2} 𝔼[f~(X)|Y]2\displaystyle\leq\left\|{\mathbb{E}}\left[\tilde{f}(X)\middle|Y\right]\right\|^{2}
rmaxΛf~12𝔼[f~(X)|Y]2.\displaystyle\leq r_{\max}\cdot\left\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}{\mathbb{E}}\left[\tilde{f}(X)\middle|Y\right]\right\|^{2}. (33)

Taking the expectation of (33) over PYP_{Y} yields (32). ∎

-H Proof of Proposition 4

First, note that logistic regression can be regarded as a special case of softmax regression with the correspondences w𝖫𝖱=w(1)w(1),b𝖫𝖱=b(1)b(1)w_{{\mathsf{LR}}}=w(1)-w(-1),b_{{\mathsf{LR}}}=b(1)-b(-1), where w(y)dw(y)\in\mathbb{R}^{d} and b(y)b(y)\in\mathbb{R} are the weights and bias for softmax regression, respectively.

In addition, from [10, Theorem 2], the centered weight w~(y)w(y)𝔼[w(Y)]\tilde{w}(y)\triangleq w(y)-{\mathbb{E}}\left[w(Y)\right] and b(y)b(y) are

w~(y)\displaystyle\tilde{w}(y) =Λf~1𝔼[f~(X)|Y=y]+o(ϵ)\displaystyle=\Lambda_{\tilde{f}}^{-1}{\mathbb{E}}\left[\tilde{f}(X)\middle|Y=y\right]+o(\epsilon)
=𝔼[f~(X)|Y=y]+o(ϵ),\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\middle|Y=y\right]+o(\epsilon),
b(y)\displaystyle b(y) =𝔼[f(X)],w~(y)+o(ϵ).\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],\tilde{w}(y)\rangle+o(\epsilon).

Therefore, we obtain

w𝖫𝖱\displaystyle w_{{\mathsf{LR}}} =w(1)w(1)\displaystyle=w(1)-w(-1)
=w~(1)w~(1)\displaystyle=\tilde{w}(1)-\tilde{w}(-1)
=𝔼[f~(X)|Y=1]𝔼[f~(Y)|Y=1]+o(ϵ)\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\middle|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(Y)\middle|Y=-1\right]+o(\epsilon)
=2𝔼[Yf~(X)]+o(ϵ)\displaystyle=2\cdot{\mathbb{E}}\left[Y\cdot\tilde{f}(X)\right]+o(\epsilon)
=2λw𝖲𝖵𝖬+o(ϵ)\displaystyle=2\lambda\cdot w_{{\mathsf{SVM}}}+o(\epsilon)

and

b𝖫𝖱\displaystyle b_{{\mathsf{LR}}} =b(1)b(1)\displaystyle=b(1)-b(-1)
=𝔼[f(X)],w~(1)w~(1)+o(ϵ)\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],\tilde{w}(1)-\tilde{w}(-1)\rangle+o(\epsilon)
=𝔼[f(X)],w𝖫𝖱+o(ϵ)\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],w_{{\mathsf{LR}}}\rangle+o(\epsilon)
=2λ𝔼[f(X)],w𝖲𝖵𝖬+o(ϵ)\displaystyle=-2\lambda\cdot\langle{\mathbb{E}}\left[f(X)\right],w_{{\mathsf{SVM}}}\rangle+o(\epsilon)
=2λb𝖲𝖵𝖬+o(ϵ),\displaystyle=2\lambda\cdot b_{{\mathsf{SVM}}}+o(\epsilon),

which implies that

w𝖫𝖱,f(x)+b𝖫𝖱=2λ(w𝖲𝖵𝖬,f(x)+b𝖲𝖵𝖬)+o(ϵ).\displaystyle\langle w_{{\mathsf{LR}}},f(x)\rangle+b_{{\mathsf{LR}}}=2\lambda\cdot(\langle w_{{\mathsf{SVM}}},f(x)\rangle+b_{{\mathsf{SVM}}})+o(\epsilon).

From (11) and (12), we have y^𝖲𝖵𝖬(x;f)=y^𝖫𝖱(x;f){\hat{y}}_{{\mathsf{SVM}}}(x;f)={\hat{y}}_{{\mathsf{LR}}}(x;f) for ϵ\epsilon sufficiently small. ∎

-I Proof of Theorem 3

To begin, note that we have

PX|Y(x|y)\displaystyle P_{X|Y}(x|y) =π(x;θy)\displaystyle=\pi(x;\theta_{y})
=π(x;0)+θπ(x;θ)|θ=0,θy+o(ϵ)\displaystyle=\pi(x;0)+\left\langle\left.\frac{\partial}{\partial\theta}\pi(x;\theta)\right|_{\theta=0},\theta_{y}\right\rangle+o(\epsilon)
=π(x;0)(1+s(x),θy)+o(ϵ),\displaystyle=\pi(x;0)(1+\langle s(x),\theta_{y}\rangle)+o(\epsilon),

which implies that

PX(x)\displaystyle P_{X}(x) =y𝒴PX|Y(x|y)PY(y)\displaystyle=\sum_{y\in\mathcal{Y}}P_{X|Y}(x|y)P_{Y}(y)
=y𝒴π(x;θy)PY(y)\displaystyle=\sum_{y\in\mathcal{Y}}\pi(x;\theta_{y})P_{Y}(y)
=π(x;0)(1+s(x),𝔼[θY])+o(ϵ).\displaystyle=\pi(x;0)(1+\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle)+o(\epsilon).

Therefore, we obtain

PX|Y(x|y)PX(x)\displaystyle\frac{P_{X|Y}(x|y)}{P_{X}(x)} =1+s(x),θy1+s(x),𝔼[θY]+o(ϵ)\displaystyle=\frac{1+\langle s(x),\theta_{y}\rangle}{1+\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle}+o(\epsilon)
=(1+s(x),θy)[1s(x),𝔼[θY]+o(ϵ)]\displaystyle=\left(1+\langle s(x),\theta_{y}\rangle\right)\cdot\left[1-\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle+o(\epsilon)\right]
=1+s(x),θ~y+o(ϵ),\displaystyle=1+\bigl{\langle}s(x),\tilde{\theta}_{y}\bigr{\rangle}+o(\epsilon),

where we have used the fact that 𝔼[θY]=O(ϵ)\|{\mathbb{E}}\left[\theta_{Y}\right]\|=O(\epsilon) since θy=O(ϵ)\|\theta_{y}\|=O(\epsilon) for all y𝒴y\in\mathcal{Y}.

Hence, we have

PX,Y(x,y)\displaystyle P_{X,Y}(x,y) =PX|Y(x|y)PY(y)\displaystyle=P_{X|Y}(x|y)P_{Y}(y)
=PX(x)PY(y)(1+s(x),θ~y)+o(ϵ).\displaystyle=P_{X}(x)P_{Y}(y)\left(1+\langle s(x),\tilde{\theta}_{y}\rangle\right)+o(\epsilon). (34)

Without loss of generality, we assume Λs\Lambda_{s} is non-singular, since otherwise we can reparameterize θ\theta to a vector with dimension less than mm. Compare (34) with (6), we have K=mK=m. In addition, there exists Am×mA\in\mathbb{R}^{m\times m} such that

s(x)=Af(x)+o(ϵ),for all x𝒳,\displaystyle s(x)=A\cdot f^{*}(x)+o(\epsilon),\quad\text{for all }x\in\mathcal{X}, (35)

Then, we can readily verify (25) from definition. Finally, (26) follows from (35) and the fact that

(f)=12i=1Kσi2=I(X;Y)+o(ϵ2),\displaystyle\mathscr{H}(f^{*})=\frac{1}{2}\sum_{i=1}^{K}\sigma_{i}^{2}=I(X;Y)+o(\epsilon^{2}),

where the second equality follows from the modal decomposition of mutual information (see, e.g., [11, Lemma 16]). ∎

References

  • [1] D. Storcheus, A. Rostamizadeh, and S. Kumar, “A survey of modern questions and challenges in feature extraction,” in Feature Extraction: Modern Questions and Challenges.   PMLR, 2015, pp. 1–18.
  • [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [3] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher discriminant analysis with kernels,” in Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468).   Ieee, 1999, pp. 41–48.
  • [4] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of machine learning research, vol. 3, no. Jul, pp. 1–48, 2002.
  • [5] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, vol. 36, no. 3, pp. 1171–1220, 2008.
  • [6] B. Schölkopf, “The kernel trick for distances,” Advances in neural information processing systems, vol. 13, 2000.
  • [7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [8] J. Platt et al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999.
  • [9] J. Xu, R. Jenssen, A. Paiva, and I. Park, “A reproducing kernel hilbert space framework for itl,” in Information Theoretic Learning.   Springer, 2010, pp. 351–384.
  • [10] X. Xu, S.-L. Huang, L. Zheng, and G. W. Wornell, “An information theoretic interpretation to deep neural networks,” Entropy, vol. 24, no. 1, p. 135, 2022.
  • [11] S.-L. Huang, A. Makur, G. W. Wornell, and L. Zheng, “On universal features for high-dimensional learning and inference,” arXiv preprint arXiv:1911.09105, 2019.
  • [12] H. O. Hirschfeld, “A connection between correlation and contingency,” in Proceedings of the Cambridge Philosophical Society, vol. 31, no. 4, 1935, pp. 520–524.
  • [13] H. Gebelein, “Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung,” ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, vol. 21, no. 6, pp. 364–379, 1941.
  • [14] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 3-4, pp. 441–451, 1959.
  • [15] X. Xu and L. Zheng, “Multivariate feature extraction,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2022, pp. 1–8.
  • [16] T. S. Jaakkola, D. Haussler et al., “Exploiting generative models in discriminative classifiers,” Advances in neural information processing systems, pp. 487–493, 1999.
  • [17] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction.   Springer, 2009, vol. 2.
  • [18] X. Xu and S.-L. Huang, “Maximal correlation regression,” IEEE Access, vol. 8, pp. 26 591–26 601, 2020.
  • [19] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis.   Cambridge university press, 2004.