Kernel Subspace and Feature Extraction

Xiangxiang Xu and Lizhong Zheng Dept. EECS, MIT
Cambridge, MA 02139, USA
Email: {xuxx, lizhong}@mit.edu

Abstract

We study kernel methods in machine learning from the perspective of feature subspace. We establish a one-to-one correspondence between feature subspaces and kernels and propose an information-theoretic measure for kernels. In particular, we construct a kernel from Hirschfeld–Gebelein–Rényi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality. We use the support vector machine (SVM) as an example to illustrate a connection between kernel methods and feature extraction approaches. We show that the kernel SVM on maximal correlation kernel achieves minimum prediction error. Finally, we interpret the Fisher kernel as a special maximal correlation kernel and establish its optimality.

I Introduction

One main objective of machine learning is to obtain useful information from often high-dimensional data. To this end, it is a common practice to extract meaningful feature representations from original data and then process features [1]. Neural networks [2] and kernel methods [3, 4, 5, 6] are two of the most representative approaches to map data into feature space. In neural networks, the features are represented as the outputs of hidden neurons in the network. In contrast, the feature mapping in kernel methods is defined by the used kernel, which is used implicitly and is often infinite dimensional. While kernel approaches require much fewer parameters and can obtain good empirical performance on certain tasks [7], the performance significantly relies on the choice of kernels. With many attempts to investigate kernel methods [8, 9, 6], there still lacks a theoretical understanding of the mechanism behind kernel methods, which restricts their applications on complicated data.

On the other hand, the feature extraction in deep neural networks has been studied recently by information-theoretic and statistical analyses [10, 11]. For example, it was shown in [10] that, the feature extracted by deep neural networks coincides with the most informative feature, which is essentially related to the classical Hirschfeld–Gebelein–Rényi (HGR) maximal correlation problem [12, 13, 14]. Such theoretical characterizations provide a better understanding of existing algorithms and have been shown useful in designing algorithms for multimodal learning tasks [15].

In this paper, our goal is to characterize kernel methods from the perspective of feature subspace and reveal its connection with other learning approaches. We first introduce the associated kernel with each given feature subspace, which we coin the projection kernel, to establish a correspondence between kernel operations and geometric operations in feature subspaces. This connection allows us to study kernels methods via analyzing the corresponding feature subspaces. Specifically, we propose an information-theoretic measure for projection kernels, and demonstrate that the information-theoretically optimal kernel can be constructed from the HGR maximal correlation functions, coined the maximal correlation kernel. We further demonstrate that the support vector machine (SVM) with maximal correlation kernel can obtain the minimum prediction error, which justifies its optimality in learning tasks. Our analysis also reveals connections between SVM and other classification approaches including neural networks. Finally, we interpret the Fisher kernel, a classical kernel induced from parameterized distribution families [16], as a special case of maximal correlation kernels, thus demonstrating its optimality.

II Preliminaries and Notations

Throughout this paper, we use $X,Y$ to denote two random variables with alphabets $\mathcal{X},\mathcal{Y}$ , and denote their joint distribution and marginals as $P_{X,Y}$ and $P_{X},P_{Y}$ , respectively. We also use ${\mathbb{E}}\left[\cdot\right]$ to denote the expectation with respect to $P_{X,Y}$ .

II-A Feature Space

We adopt the notation convention introduced in [15], and let $\mathcal{F}_{\mathcal{X}}\triangleq\{\mathcal{X}\to\mathbb{R}\}$ denote the feature space formed by the (one-dimensional) features of $X$ , with the geometry defined as follows. The inner product $\langle\cdot,\cdot\rangle_{\mathcal{F}_{\mathcal{X}}}$ on $\mathcal{F}_{\mathcal{X}}$ is defined as $\langle f_{1},f_{2}\rangle_{\mathcal{F}_{\mathcal{X}}}\triangleq{\mathbb{E}}_{P_{X}}\left[f_{1}(X)f_{2}(X)\right]$ for $f_{1},f_{2}\in\mathcal{F}_{\mathcal{X}}$ . This induces a norm $\|\,\cdot\,\|_{\mathcal{F}_{\mathcal{X}}}$ with $\|f\|_{\mathcal{F}_{\mathcal{X}}}\triangleq\sqrt{\langle f,f\rangle_{\mathcal{F}_{\mathcal{X}}}}$ for $f\in\mathcal{F}_{\mathcal{X}}$ . Then, for given $f\in\mathcal{F}_{\mathcal{X}}$ and subspace $\mathcal{G}$ of $\mathcal{F}_{\mathcal{X}}$ , we denote the projection of $f$ onto $\mathcal{G}$ as

\displaystyle\Pi\left(f;{\mathcal{G}}\right)\triangleq\operatorname*{arg\,min}_{h\in\mathcal{G}}\|h-f\|_{\mathcal{F}_{\mathcal{X}}}.

(1)

In addition, for a $d$ -dimensional feature $f=(f_{1},\dots,f_{d})^{\mathrm{T}}\colon\mathcal{X}\to\mathbb{R}^{d}$ , we use $\operatorname{span}\{f\}\triangleq\operatorname{span}\{f_{1},\dots,f_{d}\}$ to denote the subspace spanned by all dimensions. We also use $\tilde{f}$ to denote the centered $f$ , i.e., $\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}_{P_{X}}\left[f(X)\right]$ , and denote $\Lambda_{f}\triangleq{\mathbb{E}}_{P_{X}}\left[f(X)f^{\mathrm{T}}(X)\right]$ .

II-B Kernel

Given $\mathcal{X}$ , $\mathpzc{k}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a kernel on $\mathcal{X}$ , if for all finite subset $\mathcal{I}\subset\mathcal{X}$ , the $|\mathcal{I}|$ by $|\mathcal{I}|$ matrix $[\mathpzc{k}(x,x^{\prime})]_{x\in\mathcal{I},x^{\prime}\in\mathcal{I}}$ is positive semidefinite. For each kernel $\mathpzc{k}$ , we define the associated functional operator $\tau\colon\mathcal{F}_{\mathcal{X}}\to\mathcal{F}_{\mathcal{X}}$ as

\displaystyle[\tau(f)](x)\triangleq{\mathbb{E}}_{P_{X}}\left[\mathpzc{k}(X,x)f(X)\right],

(2)

and we use $\mathpzc{k}\leftrightarrow\tau$ to denote the correspondence between $\mathpzc{k}$ and $\tau$ . Furthermore, we define the centered kernel $\tilde{\mathpzc{k}}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ as

\displaystyle\tilde{\mathpzc{k}}(x,x^{\prime})\triangleq\mathpzc{k}(x,x^{\prime})-\bar{k}(x)-\bar{k}(x^{\prime})+{\mathbb{E}}_{P_{X}}\left[\bar{k}(X)\right],

(3)

where we have defined $\bar{k}\colon(x\mapsto{\mathbb{E}}_{P_{X}}\left[\mathpzc{k}(X,x)\right])\in\mathcal{F}_{\mathcal{X}}$ .

The following fact is the basis of the kernel trick in learning algorithms.

Fact 1

For each given kernel $\mathpzc{k}$ , there exist an inner product space $\mathcal{V}$ with the inner product $\langle\cdot,\cdot\rangle_{\mathcal{V}}$ , and a mapping $\nu\colon\mathcal{X}\to\mathcal{V}$ , such that $\mathpzc{k}(x,x^{\prime})=\langle\nu(x),\nu(x^{\prime})\rangle_{\mathcal{V}}$ .

Remark 1

Suppose $\nu$ is one mapping for $\mathpzc{k}$ satisfying Fact 1. Then for the centered kernel $\tilde{\mathpzc{k}}$ [cf. (3)], we have $\tilde{\mathpzc{k}}(x,x^{\prime})=\langle\tilde{\nu}(x),\tilde{\nu}(x^{\prime})\rangle_{\mathcal{V}}$ , where $\tilde{\nu}(x)\triangleq\nu(x)-{\mathbb{E}}_{P_{X}}\left[\nu(X)\right]$ .

In addition, we introduce the kernelized discriminative model (KDM) as follows.

Definition 1 (Kernelized Discriminative Model)

For each kernel $\mathpzc{k}$ , we define its associated kernelized discriminative model $P^{(\mathpzc{k})}_{Y|X}$ as

\displaystyle P^{(\mathpzc{k})}_{Y|X}(y|x)\triangleq P_{Y}(y)\left(1+{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\middle|Y=y\right]\right).

(4)

Then, we use ${\hat{y}}^{(\mathpzc{k})}$ to denote the maximum a posteriori (MAP) estimation induced from KDM $P^{(\mathpzc{k})}_{Y|X}$ , i.e.,

\displaystyle{\hat{y}}^{(\mathpzc{k})}(x)

\displaystyle\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}P^{(\mathpzc{k})}_{Y|X}(y|x).

(5)

The KDM can be regarded as a generalized probability distribution, since we have $\sum_{y\in\mathcal{Y}}P^{(\mathpzc{k})}_{Y|X}(y|x)=1$ for all $x\in\mathcal{X}$ while $P^{(\mathpzc{k})}_{Y|X}(y|x)$ can sometimes take negative values.

II-C Modal Decomposition, Maximal Correlation, and H-score

We first introduce the modal decomposition of joint distribution $P_{X,Y}$ [11, 15].

Proposition 1 (Modal Decomposition [11])

For given $P_{X,Y}$ , there exists $K\leq\min\{|\mathcal{X}|,|\mathcal{Y}|\}-1$ , such that

\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\sum_{i=1}^{K}\sigma_{i}f^{*}_{i}(x)g^{*}_{i}(y)\right),

(6)

where $\sigma_{1}\geq\sigma_{2}\geq\sigma_{K}>0$ , and ${\mathbb{E}}\left[f^{*}_{i}(X)f^{*}_{j}(X)\right]={\mathbb{E}}\left[g^{*}_{i}(Y)g^{*}_{j}(Y)\right]=\mathbbb{1}_{\{i=j\}}$ for all $1\leq i,j\leq K$ , where $\mathbbb{1}_{\{\cdot\}}$ denotes the indicator function.

It can be shown that $(f_{i}^{*},g_{i}^{*})$ pairs are the most correlated function pairs of $X$ and $Y$ , referred to as maximal correlation functions. We also denote $\varrho\triangleq\sigma_{1}$ , known as the HGR maximal correlation [12, 13, 14] of $X$ and $Y$ , and define the $K$ -dimensional feature $f^{*}(x)\triangleq[f^{*}_{1}(x),\dots,f_{K}^{*}(x)]^{\mathrm{T}}$ . In particular, when $Y$ is binary, we have $f^{*}=f_{1}^{*}\in\mathcal{F}_{\mathcal{X}}$ .

It has been shown in [11] that the maximal correlation functions $f_{i}^{*},i=1,\dots,K$ are the optimal features of $X$ in inferring or estimating $Y$ . In general, given a $d$ -dimensional feature $f$ of $X$ , the effectiveness of $f$ in inferring or estimating $Y$ can be measured by its H-score [10, 11], defined as

\displaystyle\mathscr{H}(f)\triangleq\frac{1}{2}\cdot{\mathbb{E}}\left[\left\|{\mathbb{E}}\left[\Lambda_{f}^{-\frac{1}{2}}\tilde{f}(X)\middle|Y\right]\right\|^{2}\right],

(7)

where $\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}\left[f(X)\right]$ . It can be verified that for all $d$ and $f\colon\mathcal{X}\to\mathbb{R}^{d}$ , we have

\displaystyle\mathscr{H}(f)\leq\mathscr{H}(f^{*})=\frac{1}{2}\sum_{i=1}^{K}\sigma_{i}^{2},

(8)

where $\sigma_{1},\dots,\sigma_{K}$ are as defined in (6).

II-D Binary Classification

We consider the binary classification problem which predicts binary label $Y$ from the data variable $X$ . For convenience, we assume $Y$ takes values from $\mathcal{Y}\triangleq\{-1,1\}$ .

Suppose the training dataset contains $n$ sample pairs $\{(x_{i},y_{i})\}_{i=1}^{n}$ of $(X,Y)$ , and let $P_{X,Y}$ denote the corresponding empirical distribution, i.e.,

\displaystyle P_{X,Y}(x,y)\triangleq\frac{1}{n}\sum_{i=1}^{n}\mathbbb{1}_{\{x_{i}=x,y_{i}=y\}}.

(9)

II-D1 Support Vector Machine

The support vector machine (SVM) solves binary classification tasks by finding the optimal hyperplane that separates two classes with maximum margin [7]. Given $d$ -dimensional feature mapping $f\colon\mathcal{X}\to\mathbb{R}^{d}$ , the loss for SVM based on $f$ can be written as

	$\displaystyle L_{{\mathsf{SVM}}}(f,w,b;\lambda)$
	$\displaystyle\quad\triangleq{\mathbb{E}}_{P_{X,Y}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]+\frac{\lambda}{2}\cdot\\|w\\|^{2},$		(10)

where $w,b\in\mathbb{R}^{d}$ are the parameters of the hyperplane, where $\lambda>0$ is a hyperparameter of SVM, and where $\ell_{\mathrm{hinge}}\colon\mathcal{Y}\times\mathbb{R}\to\mathbb{R}$ denotes the hinge loss, defined as $\ell_{\mathrm{hinge}}(y,z)\triangleq(1-yz)^{+}$ with $x^{+}\triangleq\max\{0,x\}$ .

Moreover, let $\displaystyle(w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}})\triangleq\operatorname*{arg\,min}_{w,b}L_{{\mathsf{SVM}}}(f,w,b;\lambda)$ and $L_{{\mathsf{SVM}}}^{*}(f;\lambda)\triangleq L_{{\mathsf{SVM}}}(f,w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}};\lambda)$ denote the optimal parameters and the value of loss function, respectively. Then, the prediction of SVM is

\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda)\triangleq\operatorname{sgn}(\langle w_{{\mathsf{SVM}}},f(x)\rangle+b_{{\mathsf{SVM}}}),

(11)

where $\operatorname{sgn}(\cdot)$ denotes the sign function.

Specifically, for a given kernel $\mathpzc{k}$ , the prediction of the corresponding kernel SVM is¹¹1It is worth mentioning that the practical implementation of kernel SVM is typically done by solving a dual optimization problem without explicitly using $\nu$ . See [17, Section 12] for detailed discussions. ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x;\lambda)\triangleq{\hat{y}}_{{\mathsf{SVM}}}(x;\nu,\lambda),$ where $\nu$ is any mapping given by Fact 1.

II-D2 Logistic Regression and Neural Networks

Given $d$ -dimensional feature $f$ of $X$ , the discriminative model of logistic regression is $\tilde{P}_{Y|X}(y|x;f,w,b)\triangleq\operatorname{sigmoid}(y\cdot(\langle w,f(x)\rangle+b))$ , where $w\in\mathbb{R}^{d},b\in\mathbb{R}$ are the weight and bias, respectively, and where $\operatorname{sigmoid}(\cdot)$ is defined as $\operatorname{sigmoid}(x)\triangleq\frac{1}{1+\exp(-x)}$ .

Then, the loss of logistic regression is $L_{{\mathsf{LR}}}(f,w,b)\triangleq-{\mathbb{E}}\left[\log\tilde{P}_{Y|X}(Y|X;f,w,b)\right]$ , and the optimal parameters $w_{{\mathsf{LR}}},b_{{\mathsf{LR}}}$ are learned by minimizing the loss, i.e., $\displaystyle(w_{{\mathsf{LR}}},b_{{\mathsf{LR}}})\triangleq\operatorname*{arg\,min}_{w,b}L_{{\mathsf{LR}}}(f,w,b)$ . The resulting decision rule is

	$\displaystyle{\hat{y}}_{{\mathsf{LR}}}(x;f)$	$\displaystyle\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}\tilde{P}_{Y\|X}(y\|x;f,w_{{\mathsf{LR}}},b_{{\mathsf{LR}}})$
		$\displaystyle=\operatorname{sgn}\left(\langle w_{{\mathsf{LR}}},f(x)\rangle+b_{{\mathsf{LR}}}\right).$		(12)

The logistic regression is often used as the classification layer for multi-layer neural networks, where $w$ and $b$ correspond to weights and the bias term, respectively. In this case, the feature mapping $f(\cdot)$ also takes a parameterized form, and the parameters of $f(\cdot)$ are jointly learned with $w$ and $b$ .

III Projection Kernel and Informative Features

In this section, we introduce a one-to-one correspondence between kernels and feature subspaces, and then characterize the informativeness of kernels by investigating the features in the associated subspaces.

III-A Projection Kernel and Feature Subspace

We first introduce a family of kernels with one-to-one correspondence to feature subspace.

Definition 2 (Projection Kernel)

Let $\mathcal{G}$ denote a $d$ -dimensional subspace of $\mathcal{F}_{\mathcal{X}}$ with a basis $\{f_{1},\dots,f_{d}\}$ . We use $\mathpzc{k}_{{\;}\mathcal{G}}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ to denote the projection kernel associated with $\mathcal{G}$ , defined as $\mathpzc{k}_{{\;}\mathcal{G}}(x,x^{\prime})\triangleq f^{\mathrm{T}}(x)\Lambda_{f}^{-1}f(x^{\prime})$ , where we have defined $f\triangleq(f_{1},\dots,f_{d})^{\mathrm{T}}$ and $\Lambda_{f}\triangleq{\mathbb{E}}\left[f(X)f^{\mathrm{T}}(X)\right]$ .

With slight abuse of notation, we also dnenote $\mathpzc{k}_{{\;}f}\triangleq\mathpzc{k}_{{\;}\operatorname{span}\{f\}}$ , the projection kernel associated with $\operatorname{span}\{f\}$ .

Note that $\mathpzc{k}_{{\;}\mathcal{G}}$ is a valid kernel function, and the corresponding $\nu$ mapping in Fact 1 can be chosen as $\nu(x)=[f_{1}(x),\dots,f_{d}(x)]^{\mathrm{T}}$ for any orthonormal basis $\{f_{1},\dots,f_{d}\}$ of $\mathcal{G}$ . It turns out that the functional operators associated with projection kernels are projection operators in the feature space, which we formalize as follows. A proof is provided in Appendix -A.

Property 1

Let $\tau\leftrightarrow\mathpzc{k}_{{\;}\mathcal{G}}$ denote the operator corresponding to subspace $\mathcal{G}$ [cf. (2)], then we have $\tau(f)=\Pi\left(f;{\mathcal{G}}\right)$ for all $f\in\mathcal{F}_{\mathcal{X}}$ .

Therefore, given a projection kernel $\mathpzc{k}$ , the associated subspace can be represented as $\{f\in\mathcal{F}_{\mathcal{X}}\colon\tau(f)=f\}$ , where $\tau\leftrightarrow\mathpzc{k}$ is the associated operator. This establishes a one-to-one correspondence between projection kernels and feature subspaces.

III-B H-score and Informative Features

The projection kernel provides a connection between feature subspace and kernel, from which we can characterize subspace $\mathcal{G}$ in terms of the corresponding kernel $\mathpzc{k}_{{\;}\mathcal{G}}$ . Specifically, we can represent the H-score [cf. (7)] of a feature $f$ in terms of the projection kernel $\mathpzc{k}_{{\;}f}$ , formalized as follows. A proof is provided in Appendix -B.

Proposition 2

For all $f$ with $\operatorname{span}\{f\}=\mathcal{G}$ , we have $\mathscr{H}(f)=\frac{1}{2}\cdot\left({\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]\right)$ , where we have defined $X^{\prime}$ such that the joint distribution of $X$ and $X^{\prime}$ is

\displaystyle P_{XX^{\prime}}(x,x^{\prime})\triangleq\sum_{y\in\mathcal{Y}}P_{Y}(y)P_{X|Y=y}(x)P_{X|Y=y}(x^{\prime}).

(13)

With slight abuse of notation, we can use $\mathscr{H}(\mathcal{G})$ to denote the H-score corresponding to feature subspace $\mathcal{G}$ . In particular, we have the following characterization of $\mathscr{H}(\mathcal{G})$ when $Y$ is binary. A proof is provided in Appendix -C.

Proposition 3

Suppose $Y$ is binary, and $f^{*}$ is the maximal correlation function of $P_{X,Y}$ . Then, for each subspace $\mathcal{G}$ of $\mathcal{F}_{\mathcal{X}}$ , we have

\displaystyle\mathscr{H}(\mathcal{G})

\displaystyle=\frac{\varrho^{2}}{2}\cdot\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}=\max_{f\in\mathcal{G}}\mathscr{H}(f)=\mathscr{H}(\Pi\left(f^{*};{\mathcal{G}}\right)).

(14)

From Proposition 3, $\mathscr{H}(\mathcal{G})$ depends only on the projection of $f^{*}$ onto $\mathcal{G}$ , which is also the most informative feature in $\mathcal{G}$ . In addition, note that since $\|f^{*}\|_{\mathcal{F}_{\mathcal{X}}}=1$ , $\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$ is also the cosine value of the principal angle between $f^{*}$ and $\mathcal{G}$ . Therefore, we can interpret the H-score as a measure of the principal angle between the optimal feature $f^{*}$ and the given subspace.

III-C Maximal Correlation Kernel

Note that from (8), $\mathscr{H}(f)$ is maximized when $f$ takes the maximal correlation function $f^{*}$ . Therefore, the subspace $\operatorname{span}\{f^{*}\}$ (and thus projection kernel $\mathpzc{k}_{{\;}f^{*}}$ ) is optimal in terms of the H-score measure. We will denote $\mathpzc{k}^{*}\triangleq\mathpzc{k}_{{\;}f^{*}}$ , referred to as the maximal correlation kernel.

Specifically, the KDM (cf. Definition 1) of maximal correlation kernel $\mathpzc{k}^{*}$ coincides with the underlying conditional distribution $P_{Y|X}$ , demonstrated as follows. A proof is provided in Appendix -D.

Property 2

For all $x$ and $y$ , we have $P_{Y|X}(y|x)=P^{(\mathpzc{k}^{*})}_{Y|X}(y|x)$ and ${\hat{y}}^{(\mathpzc{k}^{*})}(x)={\hat{y}}_{{\mathsf{MAP}}}(x)$ , where ${\hat{y}}_{\mathsf{MAP}}$ denotes the MAP estimation, i.e.,

\displaystyle{\hat{y}}_{{\mathsf{MAP}}}(x)\triangleq\operatorname*{arg\,max}_{y\in\mathcal{Y}}P_{Y|X}(y|x).

(15)

As we will develop in the next section, the maximal correlation kernel also achieves the optimal performance in support vector machine.

IV Support Vector Machine Analysis

In this section, we investigate support vector machine, a representative kernel approach for binary classification. Let $(X,Y)$ denote the training data and corresponding label taken from $\mathcal{Y}=\{-1,1\}$ , with $P_{X,Y}$ denoting the empirical distribution as defined in (9). Throughout this section, we will focus on the balanced dataset with

\displaystyle P_{Y}(-1)=P_{Y}(1)=\frac{1}{2}.

(16)

It can be verified that in this case, the MAP estimation [cf. (15)] can be expressed in terms of maximal correlation function. A proof is provided in Appendix -E.

Property 3

Under assumption (16), we can express the MAP estimation as ${\hat{y}}_{{\mathsf{MAP}}}(x)=\operatorname{sgn}(f^{*}(x))$ for all $x\in\mathcal{X}$ , where $f^{*}\in\mathcal{F}_{\mathcal{X}}$ is the maximal correlation function of $P_{X,Y}$ .

IV-A SVM on Given Features

We first consider the SVM algorithm applied on a given feature representation $f(X)\in\mathbb{R}^{d}$ , which can also be regarded as the kernel SVM on kernel $\mathpzc{k}(x,x^{\prime})=\langle f(x),f(x^{\prime})\rangle$ .

To begin, for each given feature $f$ and $\lambda>0$ , let us define

\displaystyle\hat{L}(f;\lambda)\triangleq 1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}.

Then we have the following characterization, a proof of which is provided in Appendix -F.

Theorem 1

For all given feature $f$ and $\lambda\geq 0$ , we have

\displaystyle\hat{L}(f;\lambda)\leq L_{{\mathsf{SVM}}}^{*}(f;\lambda)\leq\hat{L}(f;\lambda)+\left(\frac{\lambda_{\mathrm{T}}}{\lambda}-1\right)^{+},

(17)

where we have defined $\lambda_{\mathrm{T}}\triangleq M\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|$ and $M\triangleq\max_{x\in\mathcal{X}}\bigl{\|}\tilde{f}(x)\bigr{\|}$ , with $\tilde{f}(x)\triangleq f(x)-{\mathbb{E}}\left[f(X)\right]$ , and where $x^{+}\triangleq\max\{0,x\}$ .

Specifically, when $\lambda\geq\lambda_{\mathrm{T}}$ , we have $L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda)$ , which can be achieved by

\displaystyle w_{{\mathsf{SVM}}}=\frac{1}{\lambda}\cdot{\mathbb{E}}\left[f(X)Y\right],\quad b_{{\mathsf{SVM}}}=-\langle w_{{\mathsf{SVM}}},{\mathbb{E}}\left[f(X)\right]\rangle,

(18)

and the resulting SVM prediction is

	$\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda)$	$\displaystyle=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\bigl{[}\tilde{f}(X)Y\bigr{]},\tilde{f}(x)\right\rangle\right)$		(19)
		$\displaystyle=\operatorname*{arg\,min}_{y\in\mathcal{Y}}\\|f(x)-{\mathbb{E}}\left[f(X)\|Y=y\right]\\|.$		(20)

From Theorem 1, when $\lambda\geq\lambda_{\mathrm{T}}$ , the SVM decision ${\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda)$ does not depend on the value of $\lambda$ . In the remaining, we will focus on the regime where $\lambda\geq\lambda_{\mathrm{T}}$ , and drop the $\lambda$ in expressions whenever possible, e.g., we simply denote ${\hat{y}}_{{\mathsf{SVM}}}(x;f,\lambda)$ by ${\hat{y}}_{{\mathsf{SVM}}}(x;f)$ . As we will see soon, SVM can still obtain minimum prediction error in this regime, by using a good feature mapping $f$ (or equivalently, a good kernel).

From (20), the SVM prediction can be interpreted as a nearest centroid classifier, where decision is based on comparing the distance between $f(x)$ and the class centroids ${\mathbb{E}}\left[f(X)|Y=y\right]$ , $y\in\mathcal{Y}$ . In addition, from

	$\displaystyle{\mathbb{E}}\left[f(X)Y\right]$	$\displaystyle={\mathbb{E}}\left[Y\cdot{\mathbb{E}}\left[f(X)\|Y\right]\right]$
		$\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[f(X)\|Y=1\right]-{\mathbb{E}}\left[f(X)\|Y=-1\right]\right),$

we can interpret the SVM loss $L_{{\mathsf{SVM}}}^{*}=\hat{L}$ as measuring the distance between two class centroids.

Furthermore, when $f$ is one-dimensional feature, we can rewrite (19) as

\displaystyle{\hat{y}}_{{\mathsf{SVM}}}(x;f)=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\bigl{[}\tilde{f}(X)Y\bigr{]},\tilde{f}(x)\right\rangle\right)=\operatorname{sgn}\left({\hat{f}}(x)\right),

where ${\hat{f}}\triangleq\Pi\left(f^{*};{\operatorname{span}\{\tilde{f}\}}\right)$ . Therefore, the decision rule depends only the projection of $f^{*}$ onto the subspace $\operatorname{span}\{{\hat{f}}\}$ , which is also the most informative features on the subspace (cf. Proposition 3). Later on we will see a similar geometric illustration of kernel SVM.

Moreover, we can establish a connection between SVM loss and the H-score measure, formalized as the following corollary. A proof is provided in Appendix -G.

Corollary 1

Suppose $\lambda\geq\lambda_{\mathrm{T}}$ , then we have

\displaystyle 1-\frac{r_{\max}}{\lambda}\cdot\mathscr{H}(\tilde{f})\leq L_{{\mathsf{SVM}}}^{*}(f;\lambda)\leq 1-\frac{r_{\min}}{\lambda}\cdot\mathscr{H}(\tilde{f}),

where $r_{\max}$ and $r_{\min}$ denote the maximum and minimum positive eigenvalues of the covariance matrix $\Lambda_{\tilde{f}}$ , respectively. Specifically, if $\Lambda_{\tilde{f}}=I$ , then we have $L_{{\mathsf{SVM}}}^{*}(f;\lambda)=1-\lambda^{-1}\cdot\mathscr{H}(\tilde{f}).$

As a result, for each normalized feature $f$ with covariance matrix $\Lambda_{\tilde{f}}=I$ , the SVM loss $L_{{\mathsf{SVM}}}^{*}$ measures the informativeness of $f$ in inferring the label $Y$ .

IV-B Kernel SVM

In practice, instead of applying SVM on a given or manually designed feature $f$ , it is more often to directly implement SVM on a kernel $\mathpzc{k}$ . Similar to Theorem 1, we have the following characterization, from which we can interpret KDM as a probabilistic output for kernel SVM.

Theorem 2

For each given kernel $\mathpzc{k}$ , there exists a constant $\lambda_{\mathrm{T}}>0$ , such that when $\lambda\geq\lambda_{\mathrm{T}}$ , the SVM prediction is ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)=\operatorname{sgn}\left([\tau(f^{*})](x)\right)$ , where $\tau\leftrightarrow\tilde{\mathpzc{k}}$ is the operator associated with centered kernel $\tilde{\mathpzc{k}}$ [cf. (2) and (3)]. In addition, the SVM prediction coincides with the KDM prediction (cf. Definition 1) obtained from $\mathpzc{k}$ , i.e., we have ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}^{(\mathpzc{k})}(x)$ for all $x\in\mathcal{X}$ .

Proof:

Let $\mathcal{V}$ and $\nu\colon\mathcal{X}\to\mathcal{V}$ denote the inner product space and mapping associated with kernel $\mathpzc{k}$ (cf. Fact 1), and let $\tilde{\nu}(x)\triangleq\nu(x)-{\mathbb{E}}_{P_{X}}\left[\nu(X)\right]$ . Then, we have

	$\displaystyle\langle{\mathbb{E}}\left[\tilde{\nu}(X)Y\right],\tilde{\nu}(x)\rangle_{\mathcal{V}}$	$\displaystyle={\mathbb{E}}\left[\langle\tilde{\nu}(X),\tilde{\nu}(x)\rangle_{\mathcal{V}}\cdot Y\right]$
		$\displaystyle={\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right],$		(21)

which can be rewritten as

	$\displaystyle{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]$
	$\displaystyle\qquad={\mathbb{E}}_{P_{X,Y}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]$
	$\displaystyle\qquad={\mathbb{E}}_{P_{X}P_{Y}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\cdot(1+\varrho\cdot f^{*}(X)\cdot Y)\right]$
	$\displaystyle\qquad={\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\right]\cdot{\mathbb{E}}\left[Y\right]+\varrho\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)f^{*}(X)\right]\cdot{\mathbb{E}}\left[Y^{2}\right]$
	$\displaystyle\qquad=\varrho\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)f^{*}(X)\right]$
	$\displaystyle\qquad=\varrho\cdot[\tau(f^{*})](x),$

where to obtain the second equality we have used the modal decomposition of $P_{X,Y}$ (cf. Fact 2).

Hence, from Theorem 1 we obtain

	$\displaystyle{\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}_{{\mathsf{SVM}}}(x;\nu)$	$\displaystyle=\operatorname{sgn}\left(\langle{\mathbb{E}}\left[\tilde{\nu}(X)Y\right],\tilde{\nu}(x)\rangle\right)$
		$\displaystyle=\operatorname{sgn}\left({\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\cdot Y\right]\right)$
		$\displaystyle=\operatorname{sgn}([\tau(f^{*})](x)).$

It remains only to establish the equivalence between ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}$ and KDM decision ${\hat{y}}^{(\mathpzc{k})}$ . To this end, note that from (4) and the balanced dataset assumption (16), we have

	$\displaystyle P^{(\mathpzc{k})}_{Y\|X}(y\|x)$	$\displaystyle=P_{Y}(y)\left(1+{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)\middle\|Y=y\right]\right)$
		$\displaystyle=\frac{1}{2}\left(1+y\cdot{\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)Y\right]\right)$

for all $x\in\mathcal{X},y\in\mathcal{Y}$ .

Hence, for all $x\in\mathcal{X}$ ,

	$\displaystyle{\hat{y}}^{(\mathpzc{k})}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}{P^{(\mathpzc{k})}_{Y\|X}(y\|x)}$	$\displaystyle=\operatorname{sgn}\left({\mathbb{E}}\left[\tilde{\mathpzc{k}}(X,x)Y\right]\right)$
		$\displaystyle={\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x),$

which completes the proof. ∎

From Theorem 2, the final decision ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}$ depends on $\mathpzc{k}$ only through the centered kernel $\tilde{\mathpzc{k}}$ . Moreover, compare Theorem 2 with Property 3, kernel SVM prediction differs from MAP only in applying the operator $\tau$ on $f^{*}$ . In particular, when the maximal correlation function $f^{*}$ is an eigenfunction of the corresponding operator $\tau\leftrightarrow\tilde{\mathpzc{k}}$ , i.e., $\tau(f^{*})=c\cdot f^{*}$ for some $c>0$ , the SVM prediction coincides with the MAP prediction, i.e., ${\hat{y}}^{(\mathpzc{k})}_{{\mathsf{SVM}}}(x)={\hat{y}}_{{\mathsf{MAP}}}(x)$ for all $x\in\mathcal{X}$ .

If we restrict our attention to projection kernels, the kernel SVM decision can be further interpreted as a projection operation on the associated subspace. To see this, let $\mathcal{G}$ denote a feature subspace of $\mathcal{F}_{\mathcal{X}}$ spanned by zero-mean features, then from Theorem 1 and Proposition 3, the kernel SVM loss for $\mathpzc{k}_{{\;}\mathcal{G}}$ is

\displaystyle 1-\frac{1}{\lambda}\cdot\mathscr{H}(\mathcal{G})=1-\frac{\varrho^{2}}{2\lambda}\cdot\bigl{\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\|}^{2}_{\mathcal{F}_{\mathcal{X}}},

which measures the principal angle between $f^{*}$ and $\mathcal{G}$ . In addition, the decision rule can be expressed as

\displaystyle{\hat{y}}^{(\mathpzc{k}_{{\;}\mathcal{G}})}_{{\mathsf{SVM}}}(x)=\operatorname{sgn}([\Pi\left(f^{*};{\mathcal{G}}\right)](x)),

(22)

From Proposition 3, $\Pi\left(f^{*};{\mathcal{G}}\right)$ is also the most informative feature in $\mathcal{G}$ . Therefore, kernel SVM on $\mathpzc{k}_{{\;}\mathcal{G}}$ is equivalent to first extracting the most informative feature in $\mathcal{G}$ , and then using the extracted feature to make decision.

IV-C Relationship to Other Classification Approaches

IV-C1 Maximum a Posteriori (MAP) Estimation

From (22), when the maximal correlation kernel $\mathpzc{k}^{*}$ is applied, the kernel SVM decision is $\operatorname{sgn}(f^{*}(x))$ , which coincides with the MAP prediction (cf. Property 3). Since MAP achieves the minimum prediction error, kernel SVM on the maximal correlation kernel also obtains the minimum prediction error.

IV-C2 Logistic Regression and Neural Networks

We have interpreted SVM as extracting the most informative feature, where the informativeness is measured by H-score. The analysis in [10] has shown that logistic regression is also equivalent to maximizing the H-score, when $X$ and $Y$ are weakly independent. Indeed, we can show that SVM and logistic regression lead to the same prediction in a weak dependence regime, which we formalize as follows. A proof is provided in Appendix -H.

Proposition 4

Suppose $\varrho=O(\epsilon)$ for some $\epsilon>0$ . For SVM and logistic regression applied on feature $f\colon\mathcal{X}\to\mathbb{R}^{d}$ with covariance $\Lambda_{\tilde{f}}=I_{d}$ , the optimal parameters satisfy

	$\displaystyle w_{{\mathsf{LR}}}$	$\displaystyle=2\lambda\cdot w_{{\mathsf{SVM}}}+o(\epsilon),$
	$\displaystyle b_{{\mathsf{LR}}}$	$\displaystyle=2\lambda\cdot b_{{\mathsf{SVM}}}+o(\epsilon),$

where $\lambda$ is the hyperparameter in SVM. In addition, we have ${\hat{y}}_{{\mathsf{SVM}}}(x;f)={\hat{y}}_{{\mathsf{LR}}}(x;f)$ for $\epsilon$ sufficiently small.

Remark 2

Since H-score can also be directly maximized by implementing the maximal correlation regression [18], a similar connection holds for SVM and maximal correlation regression.

V Fisher Kernel

We demonstrate that Fisher kernel [16, 19] can also be interpreted as a maximal correlation kernel.

Given a family of distributions $\pi(\cdot;\theta)$ supported on ${\mathcal{X}}$ and parameterized by $\theta\in\mathbb{R}^{m}$ , suppose the score function $s_{\theta}(x)\triangleq\frac{\partial}{\partial\theta}\log\pi(x;\theta)$ exists. Then, the Fisher kernel is defined as the projection kernel associated with the score function $s_{\theta}$ , i.e., $\mathpzc{k}_{{\;}s_{\theta}}$ .

Specifically, we consider classification tasks where the joint distribution between data variable $X$ and label $Y$ are a mixture of the parameterized forms. Suppose for each class $Y=y\in\mathcal{Y}$ , the data variable $X$ is generated from

\displaystyle P_{X|Y}(x|y)=\pi(x;\theta_{y})

(23)

for some $\theta_{y}\in\mathbb{R}^{m}$ . Then we have the following result, a proof of which is provided in Appendix -I.

Theorem 3

Suppose $\|\theta_{y}\|<\epsilon$ for all $y\in\mathcal{Y}$ , and let $s(x)\triangleq s_{0}(x)$ . Then for the joint distribution $P_{X,Y}=P_{X|Y}P_{Y}$ generated according to (23), we have

	$\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\langle s(x),\tilde{\theta}_{y}\rangle\right)+o(\epsilon),$		(24)
	$\displaystyle\mathpzc{k}_{{\;}s}(x,x^{\prime})=\mathpzc{k}^{*}(x,x^{\prime})+o(\epsilon),$		(25)

where $\tilde{\theta}_{y}\triangleq\theta_{y}-\mathbb{E}[\theta_{Y}]$ denotes the centered $\theta_{y}$ , and where $\mathpzc{k}^{*}$ is the maximal correlation kernel defined on $P_{X,Y}$ . In addition, the H-score of $s$ satisfies

\displaystyle\mathscr{H}(s)

\displaystyle=I(X;Y)+o(\epsilon^{2}),

(26)

where $I(X;Y)$ denotes the mutual information between $X$ and $Y$ .

From (24), the score function $s$ is equal to the maximal correlation function $f^{*}$ of $P_{X,Y}$ up to a linear transformation [cf. (6)], and we have $P_{Y|X}(y|x)=P^{(\mathpzc{k}^{*})}_{Y|X}(y|x)=P^{(\mathpzc{k}_{{\;}s})}_{Y|X}(y|x)+o(\epsilon)$ . Therefore, Fisher kernel is the optimal kernel for tasks generated from (23).

VI Conclusion

In this paper, we study kernel methods from the perspective of feature subspace, where we demonstrate a connection between kernel methods and informative feature extraction problems. With SVM as an example, we illustrate the relationship between kernel methods and neural networks. The theoretical results can help guide practical kernel designs and incorporate kernel methods with feature-based learning approaches.

Acknowledgments

This work was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under grant N00014-19-1-2621.

-A Proof of Property 1

Suppose $\mathcal{G}$ is a $d$ -dimensional feature subspace. Let $\{g_{1},\dots,g_{d}\}$ be an orthonormal basis of $\mathcal{G}$ , i.e., we have $\langle g_{i},g_{j}\rangle=\mathbbb{1}_{\{i=j\}}$ , and define $g(x)\triangleq(g_{1}(x),g_{2}(x),\dots,g_{d}(x))^{\mathrm{T}}$ . Then we have $\Lambda_{g}=I_{d}$ and $\mathpzc{k}_{{\;}\mathcal{G}}(x,x^{\prime})=\langle g(x),g(x^{\prime})\rangle$ .

Therefore,

	$\displaystyle[\tau(f)](x)$	$\displaystyle={\mathbb{E}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,x)f(X)\right]$
		$\displaystyle={\mathbb{E}}\left[\langle g(x),g(X)\rangle\cdot f(X)\right]$
		$\displaystyle=\sum_{i=1}^{d}{\mathbb{E}}\left[f(X)\cdot g_{i}(X)\right]\cdot g_{i}(x)$
		$\displaystyle=\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i}(x),$

which implies that $\tau(f)=\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i}$ .

From the orthogonality principle, it suffices to prove that $\langle f-\tau(f),{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}}=0$ for all $f\in\mathcal{F}_{\mathcal{X}}$ and ${\hat{g}}\in\mathcal{G}$ . To this end, suppose ${\hat{g}}=\sum_{i=1}^{d}c_{i}\cdot g_{i}$ for some $c_{1},\dots,c_{d}\in\mathbb{R}$ . Then, we have

	$\displaystyle\langle\tau(f),{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}}$	$\displaystyle=\left\langle\sum_{i=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot g_{i},\sum_{j=1}^{d}c_{j}\cdot g_{j}\right\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\sum_{i=1}^{d}\sum_{j=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot c_{j}\cdot\left\langle g_{i},g_{j}\right\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\sum_{i=1}^{d}\sum_{j=1}^{d}\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}\cdot c_{j}\cdot\mathbbb{1}_{\{i=j\}}$
		$\displaystyle=\sum_{i=1}^{d}c_{i}\cdot\langle f,g_{i}\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\left\langle f,\sum_{i=1}^{d}c_{i}\cdot g_{i}\right\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\langle f,{\hat{g}}\rangle_{\mathcal{F}_{\mathcal{X}}},$

which completes the proof. ∎

-B Proof of Proposition 2

First, note that for each $y\in\mathcal{Y}$ ,

	$\displaystyle{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]$	$\displaystyle={\mathbb{E}}\left[f(X)\middle\|Y=y\right]-{\mathbb{E}}\left[f(X)\right]$
		$\displaystyle=\sum_{x\in\mathcal{X}}[P_{X\|Y}(x\|y)-P_{X}(x)]\cdot f(x).$

Therefore, we have

	$\displaystyle 2\cdot\mathscr{H}(f)$
	$\displaystyle\quad={\mathbb{E}}\left[\left\\|{\mathbb{E}}\left[\Lambda_{f}^{-\frac{1}{2}}\tilde{f}(X)\middle\|Y\right]\right\\|^{2}\right]$
	$\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\left({\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]\right)^{\mathrm{T}}\Lambda_{f}^{-1}{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]$
	$\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\sum_{x\in\mathcal{X}}\sum_{x^{\prime}\in\mathcal{X}}\biggl{(}[P_{X\|Y}(x\|y)-P_{X}(x)]$
	$\displaystyle\quad\qquad\cdot\left(f^{\mathrm{T}}(x)\Lambda_{f}^{-1}f(x^{\prime})\right)\cdot[P_{X\|Y}(x^{\prime}\|y)-P_{X}(x^{\prime})]\biggr{)}$
	$\displaystyle\quad={\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right],$

which completes the proof. ∎

-C Proof of Proposition 3

We start with the first equality. Suppose $P_{X,Y}$ has modal decomposition (cf. Proposition 1)

\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\cdot\left[1+\varrho\cdot f^{*}(x)\cdot g^{*}(y)\right],

where $g^{*}$ satisfies ${\mathbb{E}}\left[g^{*}(Y)\right]=0$ and ${\mathbb{E}}\left[(g^{*}(Y))^{2}\right]=1$ .

Then, we obtain

\displaystyle P_{X|Y=y}(x)=P_{X}(x)\cdot\left[1+\varrho\cdot f^{*}(x)\cdot g^{*}(y)\right],

and the $P_{X,X^{\prime}}$ as defined in (13) can be expressed as

	$\displaystyle P_{XX^{\prime}}(x,x^{\prime})$
	$\displaystyle\qquad=\sum_{y\in\mathcal{Y}}P_{Y}(y)P_{X\|Y=y}(x)P_{X\|Y=y}(x^{\prime})$
	$\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\Biggl{[}\sum_{y\in\mathcal{Y}}P_{Y}(y)\left(1+\varrho\cdot f^{}(x)\cdot g^{}(y)\right)$
	$\displaystyle\qquad\qquad\cdot(1+\varrho\cdot f^{}(x^{\prime})\cdot g^{}(y))\Biggl{]}$
	$\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\left[1+\varrho^{2}\cdot f^{}(x)\cdot f^{}(x^{\prime})\right],$

where to obtain the last equality follows from the fact that ${\mathbb{E}}\left[g^{*}(Y)\right]=0,{\mathbb{E}}\left[(g^{*}(Y))^{2}\right]=1$ .

Note that since $P_{X^{\prime}}=P_{X}$ , we have

	$\displaystyle P_{XX^{\prime}}(x,x^{\prime})-P_{X}(x)P_{X^{\prime}}(x^{\prime})$
	$\displaystyle\qquad=P_{XX^{\prime}}(x,x^{\prime})-P_{X}(x)P_{X}(x^{\prime})$
	$\displaystyle\qquad=P_{X}(x)P_{X^{\prime}}(x^{\prime})\cdot\varrho^{2}\cdot f^{}(x)\cdot f^{}(x^{\prime}).$		(27)

In addition, let $\tau\leftrightarrow\mathpzc{k}_{{\;}\mathcal{G}}$ denote the operator associated with $\mathpzc{k}_{{\;}\mathcal{G}}$ , then from Property 1 we have $\tau(f^{*})=\Pi\left(f^{*};{\mathcal{G}}\right)$ . In addition, from the orthogonality principle, we have $\langle f^{*}-\tau(f^{*}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}=0$ and thus

$\displaystyle\langle f^{},\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$	$\displaystyle=\langle\tau(f^{}),\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}+\langle f^{}-\tau(f^{}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}$
	$\displaystyle=\langle\tau(f^{}),\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$
	$\displaystyle=\bigl{\\|}\tau(f^{})\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}=\bigl{\\|}\Pi\left(f^{};{\mathcal{G}}\right)\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}.$	(28)

Hence, the first equality of (14) can be obtained from

	$\displaystyle\mathscr{H}\left(\mathcal{G}\right)$	$\displaystyle=\frac{\varrho^{2}}{2}\cdot\left({\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]\right)$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[f^{}(X)\cdot f^{}(X^{\prime})\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{}(x)\cdot{\mathbb{E}}\left[f^{}(X)\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,x)\right]$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{}(x)\cdot[\tau(f^{})](x)$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\langle f^{},\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\bigl{\\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2},$

where the first equality follows from Proposition 2, where the second equality follows from (27), and where the last equality follows from (28).

To obtain the second and third equalities of (14), it suffices to note that for all $f\in\mathcal{G}$ , we have

	$\displaystyle\bigl{\\|}\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad=\bigl{\\|}f^{}\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\\|}f^{}-\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad\leq\bigl{\\|}f^{}\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\\|}f^{}-\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad=\bigl{\\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$

where the equalities follows from the orthogonality principle, and where the inequality follows from the definition of projection [cf. (1)]. In addition, it can be verified that the inequality holds with quality when $f=\Pi\left(f^{*};{\mathcal{G}}\right)$ .

Hence, for all $f\in\mathcal{G}$ , we have

	$\displaystyle\frac{\mathscr{H}(f)}{\mathscr{H}(\mathcal{G})}$	$\displaystyle=\frac{\mathscr{H}(\operatorname{span}\{f\})}{\mathscr{H}(\mathcal{G})}$
		$\displaystyle=\frac{\bigl{\\|}\Pi\left(f^{};{\operatorname{span}\{f\}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}}{\bigl{\\|}\Pi\left(f^{};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}}\leq 1=\frac{\mathscr{H}(\Pi\left(f^{*};{\mathcal{G}}\right))}{\mathscr{H}(\mathcal{G})},$

which completes the proof. ∎

-D Proof of Property 2

It suffices to prove that $P_{Y|X}=P^{(\mathpzc{k}^{*})}_{Y|X}$ . To this end, suppose $P_{X,Y}$ satisfies the modal decomposition (6), and let $f^{*}(x)\triangleq(f_{1}^{*}(x),\dots,f_{K}^{*}(x))^{\mathrm{T}},g^{*}(y)\triangleq(g_{1}^{*}(y),\dots,g_{K}^{*}(y))^{\mathrm{T}}$ , $\Sigma\triangleq\operatorname{diag}(\sigma_{1},\dots,\sigma_{K})$ . Then, it can be verified that ${\mathbb{E}}\left[f_{i}^{*}(X)\middle|Y=y\right]=\sigma_{i}\cdot g_{i}^{*}(y)$ for all $i=1,\dots,K$ , which implies that ${\mathbb{E}}\left[f^{*}(X)\middle|Y=y\right]=\Sigma\cdot g^{*}(y)$ .

Since $\Lambda_{f^{*}}=I$ , we have $\mathpzc{k}^{*}(x,x^{\prime})=\langle f^{*}(x),f^{*}(x^{\prime})\rangle$ , for all $x,x^{\prime}\in\mathcal{X}$ .

Hence, for all $x\in\mathcal{X},y\in\mathcal{Y}$ , we have

	$\displaystyle P^{(\mathpzc{k}^{*})}_{Y\|X}(y\|x)$	$\displaystyle=P_{Y}(y)\cdot\left(1+{\mathbb{E}}\left[\mathpzc{k}^{*}(X,x)\middle\|Y=y\right]\right)$
		$\displaystyle=P_{Y}(y)\cdot\left(1+{\mathbb{E}}\left[\langle f^{}(X),f^{}(x)\rangle\middle\|Y=y\right]\right)$
		$\displaystyle=P_{Y}(y)\cdot\left(1+\langle{\mathbb{E}}\left[f^{}(X)\middle\|Y=y\right],f^{}(x)\rangle\right)$
		$\displaystyle=P_{Y}(y)\cdot\left(1+\langle\Sigma\cdot g^{}(y),f^{}(x)\rangle\right)$
		$\displaystyle=P_{Y}(y)\cdot\left(1+\sum_{i=1}^{K}\sigma_{i}\cdot f_{i}^{}(x)\cdot g_{i}^{}(y)\right)$
		$\displaystyle=P_{Y\|X}(y\|x),$

which completes the proof. ∎

-E Proof of Property 3

Our proof will make use of the following fact.

Fact 2

If $\mathcal{Y}=\{-1,1\}$ and $P_{Y}(y)\equiv\frac{1}{2}$ , the modal decomposition of $P_{X,Y}$ (cf. Proposition 1) can be written as

\displaystyle P_{X,Y}(x,y)=P_{X}(x)P_{Y}(y)\left(1+\varrho\cdot f^{*}(x)\cdot y\right),

where $\varrho$ is the maximal correlation coefficient, and $f^{*}$ is the maximal correlation function with ${\mathbb{E}}\left[(f^{*}(X))^{2}\right]=1$ .

From Fact 2, we have

	$\displaystyle P_{Y\|X}(y\|x)$	$\displaystyle=P_{Y}(y)\left(1+y\cdot\varrho\cdot f^{*}(x)\right)$
		$\displaystyle=\frac{1}{2}\cdot\left(1+y\cdot\varrho\cdot f^{*}(x)\right),$

which implies that

	$\displaystyle{\hat{y}}_{\mathsf{MAP}}(x)$	$\displaystyle=\operatorname*{arg\,max}_{y\in\mathcal{Y}}P_{Y\|X}(y\|x)$
		$\displaystyle=\operatorname{arg\,max}_{y\in\mathcal{Y}}~{}\left[y\cdot f^{}(x)\right]=\operatorname{sgn}(f^{*}(x)).$

∎

-F Proof of Theorem 1

Our proof will make use of the following simple fact.

Fact 3

Given a random variable $Z$ taking values from $\mathcal{Z}$ , let $z_{\min}$ denote the minimum entry in $\mathcal{Z}$ , then we have

\displaystyle{\mathbb{E}}\left[Z\right]\leq{\mathbb{E}}\left[Z^{+}\right]\leq{\mathbb{E}}\left[Z\right]+(z_{\min})^{-},

where $x^{-}\triangleq\max\{-x,0\}$ .

From Fact 3, we obtain

	$\displaystyle{\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]$
	$\displaystyle\quad={\mathbb{E}}\left[\left(1-Y\cdot\left\langle w,f(X)\right\rangle-Y\cdot b\right)^{+}\right]$
	$\displaystyle\quad\geq 1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle-{\mathbb{E}}\left[Y\right]\cdot b$
	$\displaystyle\quad=1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle.$		(29)

Therefore, for all $w,b$ , we have

	$\displaystyle L_{{\mathsf{SVM}}}(f,w,b;\lambda)$	$\displaystyle={\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w,f(X)\rangle+b)\right]+\frac{\lambda}{2}\cdot\\|w\\|^{2}$
		$\displaystyle\geq 1-\left\langle w,{\mathbb{E}}\left[Y\cdot f(X)\right]\right\rangle+\frac{\lambda}{2}\cdot\\|w\\|^{2}$
		$\displaystyle=1-\frac{1}{2\lambda}\cdot\left\\|{\mathbb{E}}\left[f(X)Y\right]\right\\|^{2}$
		$\displaystyle\qquad+\frac{\lambda}{2}\cdot\left\\|w-\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right]\right\\|^{2}$
		$\displaystyle\geq 1-\frac{1}{2\lambda}\cdot\left\\|{\mathbb{E}}\left[f(X)Y\right]\right\\|^{2}$
		$\displaystyle=\hat{L}(f;\lambda).$

Hence, we have

\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda)=L_{{\mathsf{SVM}}}(f;w_{{\mathsf{SVM}}},b_{{\mathsf{SVM}}},\lambda)\geq\hat{L}(f;\lambda).

Let $w^{\prime}\triangleq\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right],b^{\prime}\triangleq-\langle w^{\prime},{\mathbb{E}}\left[f(X)\right]\rangle$ , then we have

\displaystyle\langle w^{\prime},f(X)\rangle+b^{\prime}=\left\langle w^{\prime},\tilde{f}(X)\right\rangle.

Therefore, from the upper bound in Fact 3, we have

	$\displaystyle{\mathbb{E}}\left[\ell_{\mathrm{hinge}}(Y,\langle w^{\prime},f(X)\rangle+b^{\prime})\right]$
	$\displaystyle\qquad\leq 1-{\mathbb{E}}\left[Y\cdot\left\langle w^{\prime},\tilde{f}(X)\right\rangle\right]$
	$\displaystyle\qquad\qquad+\left(\min_{i}\left\{1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right\}\right)^{-}$
	$\displaystyle\qquad\leq 1-\left\langle w^{\prime},{\mathbb{E}}\left[Y\cdot\tilde{f}(X)\right]\right\rangle+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}$		(30)
	$\displaystyle\qquad=1-\lambda\left\\|w^{\prime}\right\\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}$		(31)

where to obtain the last inequality, we have used the fact that

\displaystyle\min_{i}\left\{1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right\}\geq 1-\frac{\lambda_{\mathrm{T}}}{\lambda}

since for each $i$ , we have

	$\displaystyle 1-y_{i}\cdot\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle$	$\displaystyle\geq 1-\left\|\left\langle w^{\prime},\tilde{f}(x_{i})\right\rangle\right\|$
		$\displaystyle\geq 1-M\cdot\\|w^{\prime}\\|$
		$\displaystyle=1-\frac{M}{\lambda}\cdot\bigl{\\|}{\mathbb{E}}\left[Y\cdot f(X)\right]\bigr{\\|}$
		$\displaystyle=1-\frac{\lambda_{\mathrm{T}}}{\lambda}.$

Therefore

	$\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda)$	$\displaystyle=\min_{w,b}L_{{\mathsf{SVM}}}(f,w,b;\lambda)$
		$\displaystyle\leq L_{{\mathsf{SVM}}}(f,w^{\prime},b^{\prime};\lambda)$
		$\displaystyle\leq 1-\lambda\left\\|w^{\prime}\right\\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}+\frac{\lambda}{2}\left\\|w^{\prime}\right\\|^{2}$
		$\displaystyle\leq 1-\frac{\lambda}{2}\left\\|w^{\prime}\right\\|^{2}+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}$
		$\displaystyle=\hat{L}(f;\lambda)+\left(1-\frac{\lambda_{\mathrm{T}}}{\lambda}\right)^{-}$
		$\displaystyle=\hat{L}(f;\lambda)+\left(\frac{\lambda_{\mathrm{T}}}{\lambda}-1\right)^{+}.$

Finally, if $\lambda\geq\lambda_{T}$ , it can be readily verified that $L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda)=L_{{\mathsf{SVM}}}(f,w^{\prime},b^{\prime};\lambda)$ . As a result, the optimal solution is given by

	$\displaystyle w_{{\mathsf{SVM}}}=w^{\prime}=\frac{1}{\lambda}\cdot{\mathbb{E}}\left[Y\cdot f(X)\right],$
	$\displaystyle b_{{\mathsf{SVM}}}=b^{\prime}=-\langle w^{\prime},{\mathbb{E}}\left[f(X)\right]\rangle=-\langle w^{*},{\mathbb{E}}\left[f(X)\right]\rangle,$

and we have

	$\displaystyle\left\langle w_{{\mathsf{SVM}}},f(x)\right\rangle+b_{{\mathsf{SVM}}}$	$\displaystyle=\langle w_{{\mathsf{SVM}}},\tilde{f}(x)\rangle$
		$\displaystyle=\frac{1}{\lambda}\cdot\left\langle{\mathbb{E}}\left[Y\cdot f(X)\right],\tilde{f}(x)\right\rangle$
		$\displaystyle=\frac{1}{\lambda}\cdot\left\langle{\mathbb{E}}\left[\tilde{f}(X)\cdot Y\right],\tilde{f}(x)\right\rangle.$

Therefore, the SVM prediction is given by

	$\displaystyle{\hat{y}}_{\mathsf{SVM}}(x;f,\lambda)$	$\displaystyle=\operatorname{sgn}(\left\langle w_{{\mathsf{SVM}}},f(x)\right\rangle+b_{{\mathsf{SVM}}})$
		$\displaystyle=\operatorname{sgn}\left(\left\langle{\mathbb{E}}\left[\tilde{f}(X)Y\right],\tilde{f}(x)\right\rangle\right).$

To obtain (20), note that for each $x\in\mathcal{X}$ , we have

	$\displaystyle\\|f(x)-{\mathbb{E}}\left[f(X)\|Y=-1\right]\\|^{2}-\\|f(x)-{\mathbb{E}}\left[f(X)\|Y=1\right]\\|^{2}$
	$\displaystyle\qquad=\left\\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right\\|^{2}$
	$\displaystyle\qquad\qquad-\left\\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]\right\\|^{2}$
	$\displaystyle\qquad=\left\\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]\right\\|^{2}$
	$\displaystyle\qquad\qquad-\left\\|\tilde{f}(x)-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right\\|^{2}$
	$\displaystyle\qquad=\left\langle\tilde{f}(x),{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right\rangle$
	$\displaystyle\qquad\qquad+\left\\|{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right\\|^{2}-\left\\|{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right\\|^{2}$
	$\displaystyle\qquad=2\cdot\left\langle\tilde{f}(x),{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\rangle,$

where we have used the facts that

	$\displaystyle{\mathbb{E}}\left[\tilde{f}(X)Y\right]$	$\displaystyle={\mathbb{E}}\left[Y\cdot{\mathbb{E}}\left[\tilde{f}(X)\|Y\right]\right]$
		$\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right),$

and

	$\displaystyle 0$	$\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\right]$
		$\displaystyle={\mathbb{E}}\left[{\mathbb{E}}\left[\tilde{f}(X)\|Y\right]\right]$
		$\displaystyle=\frac{1}{2}\left({\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]+{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=-1\right]\right).$

∎

-G Proof of Corollary 1

From Theorem 1, when $\lambda\geq\lambda_{\mathrm{T}}$ , we have

\displaystyle L_{{\mathsf{SVM}}}^{*}(f;\lambda)=\hat{L}(f;\lambda)=1-\frac{1}{2\lambda}\cdot\left\|{\mathbb{E}}\left[f(X)\cdot Y\right]\right\|^{2}.

Therefore, it suffices to prove that

\displaystyle r_{\min}\cdot\mathscr{H}(\tilde{f})\leq\frac{1}{2}\cdot\left\|{\mathbb{E}}\left[f(X)Y\right]\right\|^{2}\leq r_{\max}\cdot\mathscr{H}(\tilde{f}).

(32)

To this end, note that we have

	$\displaystyle\left\\|{\mathbb{E}}\left[f(X)Y\right]\right\\|^{2}=\left\\|{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\\|^{2}$	$\displaystyle={\mathbb{E}}\left[\left\\|Y\cdot{\mathbb{E}}\left[\tilde{f}(X)Y\right]\right\\|^{2}\right]$
		$\displaystyle={\mathbb{E}}\left[\left\\|{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y\right]\right\\|^{2}\right],$

where the last equality follows from the fact that for zero-mean $f$ and $Y$ uniformly distributed on $\{1,-1\}$ , we have ${\mathbb{E}}\left[f(X)|Y=y\right]=y\cdot{\mathbb{E}}\left[f(X)\cdot Y\right]$ for $y\in\mathcal{Y}$ .

In addition, for all $v\in\operatorname{span}\left\{\tilde{f}(x)\colon x\in\mathcal{X}\right\}$ , we have

\displaystyle r_{\min}\leq\frac{\|v\|^{2}}{\left\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}v\right\|^{2}}\leq r_{\max}.

Therefore, we obtain

	$\displaystyle{r_{\min}}\cdot\left\\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y\right]\right\\|^{2}$	$\displaystyle\leq\left\\|{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y\right]\right\\|^{2}$
		$\displaystyle\leq r_{\max}\cdot\left\\|\Lambda_{\tilde{f}}^{-\frac{1}{2}}{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y\right]\right\\|^{2}.$		(33)

Taking the expectation of (33) over $P_{Y}$ yields (32). ∎

-H Proof of Proposition 4

First, note that logistic regression can be regarded as a special case of softmax regression with the correspondences $w_{{\mathsf{LR}}}=w(1)-w(-1),b_{{\mathsf{LR}}}=b(1)-b(-1)$ , where $w(y)\in\mathbb{R}^{d}$ and $b(y)\in\mathbb{R}$ are the weights and bias for softmax regression, respectively.

In addition, from [10, Theorem 2], the centered weight $\tilde{w}(y)\triangleq w(y)-{\mathbb{E}}\left[w(Y)\right]$ and $b(y)$ are

	$\displaystyle\tilde{w}(y)$	$\displaystyle=\Lambda_{\tilde{f}}^{-1}{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]+o(\epsilon)$
		$\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]+o(\epsilon),$
	$\displaystyle b(y)$	$\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],\tilde{w}(y)\rangle+o(\epsilon).$

Therefore, we obtain

	$\displaystyle w_{{\mathsf{LR}}}$	$\displaystyle=w(1)-w(-1)$
		$\displaystyle=\tilde{w}(1)-\tilde{w}(-1)$
		$\displaystyle={\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=1\right]-{\mathbb{E}}\left[\tilde{f}(Y)\middle\|Y=-1\right]+o(\epsilon)$
		$\displaystyle=2\cdot{\mathbb{E}}\left[Y\cdot\tilde{f}(X)\right]+o(\epsilon)$
		$\displaystyle=2\lambda\cdot w_{{\mathsf{SVM}}}+o(\epsilon)$

and

	$\displaystyle b_{{\mathsf{LR}}}$	$\displaystyle=b(1)-b(-1)$
		$\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],\tilde{w}(1)-\tilde{w}(-1)\rangle+o(\epsilon)$
		$\displaystyle=-\langle{\mathbb{E}}\left[f(X)\right],w_{{\mathsf{LR}}}\rangle+o(\epsilon)$
		$\displaystyle=-2\lambda\cdot\langle{\mathbb{E}}\left[f(X)\right],w_{{\mathsf{SVM}}}\rangle+o(\epsilon)$
		$\displaystyle=2\lambda\cdot b_{{\mathsf{SVM}}}+o(\epsilon),$

which implies that

\displaystyle\langle w_{{\mathsf{LR}}},f(x)\rangle+b_{{\mathsf{LR}}}=2\lambda\cdot(\langle w_{{\mathsf{SVM}}},f(x)\rangle+b_{{\mathsf{SVM}}})+o(\epsilon).

From (11) and (12), we have ${\hat{y}}_{{\mathsf{SVM}}}(x;f)={\hat{y}}_{{\mathsf{LR}}}(x;f)$ for $\epsilon$ sufficiently small. ∎

-I Proof of Theorem 3

To begin, note that we have

	$\displaystyle P_{X\|Y}(x\|y)$	$\displaystyle=\pi(x;\theta_{y})$
		$\displaystyle=\pi(x;0)+\left\langle\left.\frac{\partial}{\partial\theta}\pi(x;\theta)\right\|_{\theta=0},\theta_{y}\right\rangle+o(\epsilon)$
		$\displaystyle=\pi(x;0)(1+\langle s(x),\theta_{y}\rangle)+o(\epsilon),$

which implies that

	$\displaystyle P_{X}(x)$	$\displaystyle=\sum_{y\in\mathcal{Y}}P_{X\|Y}(x\|y)P_{Y}(y)$
		$\displaystyle=\sum_{y\in\mathcal{Y}}\pi(x;\theta_{y})P_{Y}(y)$
		$\displaystyle=\pi(x;0)(1+\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle)+o(\epsilon).$

Therefore, we obtain

	$\displaystyle\frac{P_{X\|Y}(x\|y)}{P_{X}(x)}$	$\displaystyle=\frac{1+\langle s(x),\theta_{y}\rangle}{1+\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle}+o(\epsilon)$
		$\displaystyle=\left(1+\langle s(x),\theta_{y}\rangle\right)\cdot\left[1-\langle s(x),{\mathbb{E}}\left[\theta_{Y}\right]\rangle+o(\epsilon)\right]$
		$\displaystyle=1+\bigl{\langle}s(x),\tilde{\theta}_{y}\bigr{\rangle}+o(\epsilon),$

where we have used the fact that $\|{\mathbb{E}}\left[\theta_{Y}\right]\|=O(\epsilon)$ since $\|\theta_{y}\|=O(\epsilon)$ for all $y\in\mathcal{Y}$ .

Hence, we have

	$\displaystyle P_{X,Y}(x,y)$	$\displaystyle=P_{X\|Y}(x\|y)P_{Y}(y)$
		$\displaystyle=P_{X}(x)P_{Y}(y)\left(1+\langle s(x),\tilde{\theta}_{y}\rangle\right)+o(\epsilon).$		(34)

Without loss of generality, we assume $\Lambda_{s}$ is non-singular, since otherwise we can reparameterize $\theta$ to a vector with dimension less than $m$ . Compare (34) with (6), we have $K=m$ . In addition, there exists $A\in\mathbb{R}^{m\times m}$ such that

\displaystyle s(x)=A\cdot f^{*}(x)+o(\epsilon),\quad\text{for all }x\in\mathcal{X},

(35)

Then, we can readily verify (25) from definition. Finally, (26) follows from (35) and the fact that

\displaystyle\mathscr{H}(f^{*})=\frac{1}{2}\sum_{i=1}^{K}\sigma_{i}^{2}=I(X;Y)+o(\epsilon^{2}),

where the second equality follows from the modal decomposition of mutual information (see, e.g., [11, Lemma 16]). ∎

References

[1] D. Storcheus, A. Rostamizadeh, and S. Kumar, “A survey of modern questions and challenges in feature extraction,” in Feature Extraction: Modern Questions and Challenges. PMLR, 2015, pp. 1–18.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[3] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher discriminant analysis with kernels,” in Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468). Ieee, 1999, pp. 41–48.
[4] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of machine learning research, vol. 3, no. Jul, pp. 1–48, 2002.
[5] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, vol. 36, no. 3, pp. 1171–1220, 2008.
[6] B. Schölkopf, “The kernel trick for distances,” Advances in neural information processing systems, vol. 13, 2000.
[7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
[8] J. Platt et al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999.
[9] J. Xu, R. Jenssen, A. Paiva, and I. Park, “A reproducing kernel hilbert space framework for itl,” in Information Theoretic Learning. Springer, 2010, pp. 351–384.
[10] X. Xu, S.-L. Huang, L. Zheng, and G. W. Wornell, “An information theoretic interpretation to deep neural networks,” Entropy, vol. 24, no. 1, p. 135, 2022.
[11] S.-L. Huang, A. Makur, G. W. Wornell, and L. Zheng, “On universal features for high-dimensional learning and inference,” arXiv preprint arXiv:1911.09105, 2019.
[12] H. O. Hirschfeld, “A connection between correlation and contingency,” in Proceedings of the Cambridge Philosophical Society, vol. 31, no. 4, 1935, pp. 520–524.
[13] H. Gebelein, “Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung,” ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, vol. 21, no. 6, pp. 364–379, 1941.
[14] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 3-4, pp. 441–451, 1959.
[15] X. Xu and L. Zheng, “Multivariate feature extraction,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2022, pp. 1–8.
[16] T. S. Jaakkola, D. Haussler et al., “Exploiting generative models in discriminative classifiers,” Advances in neural information processing systems, pp. 487–493, 1999.
[17] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2009, vol. 2.
[18] X. Xu and S.-L. Huang, “Maximal correlation regression,” IEEE Access, vol. 8, pp. 26 591–26 601, 2020.
[19] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.

	$\displaystyle 2\cdot\mathscr{H}(f)$
	$\displaystyle\quad={\mathbb{E}}\left[\left\\|{\mathbb{E}}\left[\Lambda_{f}^{-\frac{1}{2}}\tilde{f}(X)\middle\|Y\right]\right\\|^{2}\right]$
	$\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\left({\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]\right)^{\mathrm{T}}\Lambda_{f}^{-1}{\mathbb{E}}\left[\tilde{f}(X)\middle\|Y=y\right]$
	$\displaystyle\quad=\sum_{y\in\mathcal{Y}}P_{Y}(y)\cdot\sum_{x\in\mathcal{X}}\sum_{x^{\prime}\in\mathcal{X}}\biggl{(}[P_{X\|Y}(x\|y)-P_{X}(x)]$
	$\displaystyle\quad\qquad\cdot\left(f^{\mathrm{T}}(x)\Lambda_{f}^{-1}f(x^{\prime})\right)\cdot[P_{X\|Y}(x^{\prime}\|y)-P_{X}(x^{\prime})]\biggr{)}$
	$\displaystyle\quad={\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right],$

	$\displaystyle P_{XX^{\prime}}(x,x^{\prime})$
	$\displaystyle\qquad=\sum_{y\in\mathcal{Y}}P_{Y}(y)P_{X\|Y=y}(x)P_{X\|Y=y}(x^{\prime})$
	$\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\Biggl{[}\sum_{y\in\mathcal{Y}}P_{Y}(y)\left(1+\varrho\cdot f^{}(x)\cdot g^{}(y)\right)$
	$\displaystyle\qquad\qquad\cdot(1+\varrho\cdot f^{}(x^{\prime})\cdot g^{}(y))\Biggl{]}$
	$\displaystyle\qquad=P_{X}(x)P_{X}(x^{\prime})\cdot\left[1+\varrho^{2}\cdot f^{}(x)\cdot f^{}(x^{\prime})\right],$

$\displaystyle\langle f^{},\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$	$\displaystyle=\langle\tau(f^{}),\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}+\langle f^{}-\tau(f^{}),\tau(f^{*})\rangle_{\mathcal{F}_{\mathcal{X}}}$
	$\displaystyle=\langle\tau(f^{}),\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$
	$\displaystyle=\bigl{\\|}\tau(f^{})\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}=\bigl{\\|}\Pi\left(f^{};{\mathcal{G}}\right)\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}.$	(28)

	$\displaystyle\mathscr{H}\left(\mathcal{G}\right)$	$\displaystyle=\frac{\varrho^{2}}{2}\cdot\left({\mathbb{E}}_{P_{XX^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]-{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]\right)$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot{\mathbb{E}}_{P_{X}P_{X^{\prime}}}\left[f^{}(X)\cdot f^{}(X^{\prime})\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,X^{\prime})\right]$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{}(x)\cdot{\mathbb{E}}\left[f^{}(X)\cdot\mathpzc{k}_{{\;}\mathcal{G}}(X,x)\right]$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\sum_{x\in\mathcal{X}}P_{X}(x)\cdot f^{}(x)\cdot[\tau(f^{})](x)$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\langle f^{},\tau(f^{})\rangle_{\mathcal{F}_{\mathcal{X}}}$
		$\displaystyle=\frac{\varrho^{2}}{2}\cdot\bigl{\\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2},$

	$\displaystyle\bigl{\\|}\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad=\bigl{\\|}f^{}\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\\|}f^{}-\Pi\left(f^{*};{\operatorname{span}\{f\}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad\leq\bigl{\\|}f^{}\bigr{\\|}^{2}_{\mathcal{F}_{\mathcal{X}}}-\bigl{\\|}f^{}-\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$
	$\displaystyle\qquad=\bigl{\\|}\Pi\left(f^{*};{\mathcal{G}}\right)\bigr{\\|}_{\mathcal{F}_{\mathcal{X}}}^{2}$