This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Limitation of Kernel Dependence Maximization for Feature Selection

Keli Liu,   Feng Ruan Company ADepartment of Statistics and Data Science, Northwestern University
Abstract

A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.

1 Introduction

In statistics and machine learning, feature selection aims to isolate a subset of features that fully captures the relationships between the features and response variables. This critical step enhances model interpretability, reduces the computational burden of downstream tasks and improves the model’s generalization capabilities [HTF09].

Nonparametric dependence measures are key tools for feature selection. Specifically, a nonparametric dependence measure between a response variable and a feature subset quantifies the strength of dependence without assuming any parametric model for their relationship. Common examples include mutual information [Cov99], distance covariance [SR09], and kernel-based dependence measures such as the Hilbert-Schmidt Independence Criterion (HSIC) [GBSS05]. The HSIC, for a wide range of kernels, recognizes all modes of dependence between variables, not just linear dependence, and large values of HSIC correspond to “more dependence” (for the definition and the scope of HSIC we consider, see Section 1.1). The literature has proposed selecting features by maximizing the HSIC [SSG+07, SSG+12]. Given features XpX\in\mathbb{R}^{p} and response variable YY\in\mathbb{R}, at the population level, this approach selects a feature subset XSX_{S_{*}}, where SS_{*} is defined by

S=argmaxS:S{1,2,,p}HSIC(XS;Y).S_{*}=\mathop{\rm argmax}_{S:S\subseteq\{1,2,\ldots,p\}}{\rm HSIC}(X_{S};Y). (1.1)

The measure HSIC(XS;Y){\rm HSIC}(X_{S};Y) quantifies the nonparametric dependence between the subset of features XSX_{S} and the response variable YY, with XS|S|X_{S}\in\mathbb{R}^{|S|} denoting the subvector formed by XX by including only the coordinates in the set SS (for a formal definition of HSIC(XS;Y){\rm HSIC}(X_{S};Y), see Section 1.2). The intuition behind this approach is clear: the subset of features, XSX_{S_{*}}, exhibiting maximum dependence with the response YY (as measured by HSIC{\rm HSIC}) should hopefully include all features necessary for explaining YY.

Despite its widespread use [SBB+07, GT18, WDL21], this paper reveals a critical problem in the use of HSIC for feature selection, an issue unaddressed in the existing literature. Through a counterexample, we show that HSIC maximization can fail to select critical variables needed to fully explain the response, even at the population level. In our counterexample, the set of selected variables SS_{*}, which is defined by maximizing HSIC through equation (1.1), does not guarantee full explanatory power:

𝔼[Y|X]𝔼[Y|XS]and thus(Y|X)(Y|XS)\mathbb{E}[Y|X]\neq\mathbb{E}[Y|X_{S_{*}}]~{}~{}\text{and thus}~{}~{}\mathcal{L}(Y|X)\neq\mathcal{L}(Y|X_{S_{*}}) (1.2)

where the notation 𝔼[]\mathbb{E}[\cdot\mid\cdot] and ()\mathcal{L}(\cdot\mid\cdot) denote the conditional expectation and conditional distribution, respectively. The first inequality is in the 2()\mathcal{L}_{2}(\mathbb{P}) sense.

Our counterexample shows that selecting features by maximizing HSIC can mistakenly exclude variables essential for full explaining the response variable YY. This result, as far as we are aware, enhances the current understanding of HSIC’s statistical properties in feature selection (see Section 1.3 for a more detailed literature review). In particular, our result helps elucidate the distinction between two approaches in the literature to feature selection using kernel dependence measures. The first approach, termed “dependence maximization”, consists of finding some feature subset SS to maximize the kernel dependence between XSX_{S} and the response YY [SSG+07]. The second approach, termed “conditional dependence minimization”, consists of finding some feature subset SS such that the kernel conditional dependence between the response YY and the feature vector XX is minimized given XSX_{S} [FBJ09]. While the first approach can be computationally cheaper, it has been unrecognized that this first approach does not guarantee selecting all variables necessary for explaining the response (see Section 1.3 for more discussion of these two approaches).

1.1 Problem Setup: Definition, Notation and Assumption

The Hilbert-Schmidt Independence Criterion (HSIC), introduced initially in [GBSS05], is a measure quantifying the dependence between two random variables, XX, and YY, using positive (semi)definite kernels. A positive semidefinite kernel kk, simply called a kernel, on a set 𝒳\mathcal{X} is a symmetric function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R} that requires the sum i=1nj=1ncicjk(xi,xj)\sum_{i=1}^{n}\sum_{j=1}^{n}c_{i}c_{j}k(x_{i},x_{j}) to be nonnegative for any nn distinct points x1,x2,,xn𝒳x_{1},x_{2},\ldots,x_{n}\in\mathcal{X} and any real numbers {c1,c2,,cn}\{c_{1},c_{2},\ldots,c_{n}\}. A kernel kk is called a positive definite kernel when this sum equals 0 if and only if all cic_{i} are zero.

We are ready to give a formal definition of HSIC, following [GBSS05, SSG+12].

Definition 1.1 (HSIC).

Given a joint probability measure (X,Y)(X,Y)\sim\mathbb{Q} over the product space 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, along with two positive semidefinite kernels kXk_{X}, kYk_{Y} on the sets 𝒳\mathcal{X} and 𝒴\mathcal{Y} respectively, the Hilbert-Schmidt Independence Criterion (HSIC) is defined as

HSIC(X,Y)=𝔼[kX(X,X)kY(Y,Y)]+𝔼[kX(X,X)]𝔼[kY(Y,Y)]2𝔼[𝔼[kX(X,X)|X]𝔼[kY(Y,Y)|Y]].\begin{split}{\rm HSIC}(X,Y)&=\mathbb{E}[k_{X}(X,X^{\prime})\cdot k_{Y}(Y,Y^{\prime})]+\mathbb{E}[k_{X}(X,X^{\prime})]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})]\\ &~{}~{}~{}~{}-2\mathbb{E}\left[\mathbb{E}[k_{X}(X,X^{\prime})|X^{\prime}]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})|Y^{\prime}]\right].\end{split} (1.3)

In the above, all expectations are taken over independent copies (X,Y),(X,Y)(X,Y),(X^{\prime},Y^{\prime})\sim\mathbb{Q}.

The HSIC is always non-negative (HSIC(X;Y)0{\rm HSIC}(X;Y)\geq 0) for every probability distribution \mathbb{Q} and positive semidefinite kernels kXk_{X} and kYk_{Y} [GBSS05, SSG+12]. Clearly, HSIC(X;Y)=0{\rm HSIC}(X;Y)=0 whenever XX and YY are independent. Additionally, when the kernels kXk_{X} and kYk_{Y} in the Definition 1.1 are characteristic kernels in the sense of [FGSS07, Section 2.2] or [SFL11, Definition 6], then HSIC(X;Y)=0{\rm HSIC}(X;Y)=0 if and only if XX and YY are independent (see [GBSS05, Theorem 4], and [SFL11, Section 2.3] for the statement). Characteristic kernels enable HSIC to capture all modes of dependence between XX and YY. Non-characteristic kernels can also be plugged into the definition of HSIC, equation (1.3), but they may fail to detect some forms of dependence between XX and YY i.e. it is possible that HSIC(X;Y)=0{\rm HSIC}(X;Y)=0 but XX and YY are not independent.

Our counterexamples apply to a wide range of HSIC through various choices of kernels kXk_{X} and kYk_{Y}. Two requirements on the kernels kXk_{X} and kYk_{Y} are needed throughout the paper. The first requires that kXk_{X} is a radially symmetric positive definite kernel.

Assumption 1.

The kernel kXk_{X} obeys for every x,xpx,x^{\prime}\in\mathbb{R}^{p}:

kX(x,x)=ϕX(xx22).k_{X}(x,x^{\prime})=\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}).

where, for some finite nonnegative measure μ\mu that is not concentrated at 0, the following holds for every z0z\geq 0:

ϕX(z)=0etzμ(dt).\phi_{X}(z)=\int_{0}^{\infty}e^{-tz}\mu(dt).

The integral representation of ϕX\phi_{X} in Assumption 1 is motivated by Schoenberg’s fundamental work [Sch38]. Given a real-valued function ϕX\phi_{X}, for the mapping (x,x)ϕX(xx22)(x,x^{\prime})\mapsto\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}) to qualify as a positive definite kernel on m\mathbb{R}^{m} for every dimension mm\in\mathbb{N}, the representation ϕX(z)=0etzμ(dt)\phi_{X}(z)=\int_{0}^{\infty}e^{-tz}\mu(dt) must hold for some nonnegative finite measure μ\mu not concentrated at zero [Sch38]. Two concrete examples satisfying Assumption 1 are the Gaussian kernel where ϕX(z)=exp(z)\phi_{X}(z)=\exp(-z), and the Laplace kernel where ϕX(z)=exp(z)\phi_{X}(z)=\exp(-\sqrt{z}).

Assumption 1 ensures that the mapping (x,x)ϕX(xx22)(x,x^{\prime})\mapsto\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}) is a kernel on m\mathbb{R}^{m} for every dimension mm\in\mathbb{N}. This property is useful when working with HSIC(XS;Y){\rm HSIC}(X_{S};Y) because the dimension of the feature vector XSX_{S} is varying with the size of SS and may not equal to pp. As a side remark, the kernel (x,x)ϕX(xx22)(x,x^{\prime})\mapsto\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}) is a characteristic kernel on m\mathbb{R}^{m} for every dimension mm\in\mathbb{N} under Assumption 1 [SFL11, Proposition 5].

Our second requirement is that the kernel kYk_{Y} is either a function of |yy||y-y^{\prime}| or a function of yyyy^{\prime}.

Assumption 2.

The kernel kYk_{Y} satisfies one of the following conditions:

  • kY(y,y)=ϕY(|yy|)k_{Y}(y,y^{\prime})=\phi_{Y}(|y-y^{\prime}|) with ϕY(0)>ϕY(1)\phi_{Y}(0)>\phi_{Y}(1) where ϕY\phi_{Y} is a real-valued function.

  • kY(y,y)=ϕY(yy)k_{Y}(y,y^{\prime})=\phi_{Y}(yy^{\prime}) with ϕY(1)>ϕY(1)\phi_{Y}(1)>\phi_{Y}(-1) where ϕY\phi_{Y} is a real-valued function.

For kY(y,y)=ϕY(|yy|)k_{Y}(y,y^{\prime})=\phi_{Y}(|y-y^{\prime}|) to be a positive semidefinite kernel, ϕY(0)ϕY(1)\phi_{Y}(0)\geq\phi_{Y}(1) must hold, as ϕY(0)ϕY(1)=i,j=12cicjkY(yi,yj)0\phi_{Y}(0)-\phi_{Y}(1)=\sum_{i,j=1}^{2}c_{i}c_{j}k_{Y}(y_{i},y_{j})\geq 0 where c1=1,c2=1,y1=1,y2=0c_{1}=1,c_{2}=-1,y_{1}=1,y_{2}=0. Likewise, for kY(y,y)=ϕY(yy)k_{Y}(y,y^{\prime})=\phi_{Y}(yy^{\prime}) to be a positive semidefinite kernel, ϕY(1)ϕY(1)\phi_{Y}(1)\geq\phi_{Y}(-1) must be true. Thus the inequalities in Assumption 2 are only slightly stricter than the requirement that kYk_{Y} is a positive semidefinite kernel. Assumption 2 is very mild since it includes both characteristic kernels such as the Gaussian kernel kY(y,y)=exp((yy)2)k_{Y}(y,y^{\prime})=\exp(-(y-y^{\prime})^{2}) and Laplace kernel kY(y,y)=exp(|yy|)k_{Y}(y,y^{\prime})=\exp(-|y-y^{\prime}|) as well as non-characteristic kernels such as the inner product kernel kY(y,y)=yyk_{Y}(y,y^{\prime})=yy^{\prime}.

Our counterexamples apply to any kernel kX,kYk_{X},k_{Y} obeying Assumptions 1 and 2. Thus, it applies to a wide range of HSIC{\rm HSIC}, including those nonparametric dependence measures utilizing characteristic kernels kXk_{X} and kYk_{Y} to capture all types of dependence (say, when kXk_{X} and kYk_{Y} are both Gaussian kernels). In these cases, HSIC(X;Y)=0{\rm HSIC}(X;Y)=0 if and only if XX and YY are independent.

1.2 Main Results

Throughout the remainder of the paper, we operate with two kernels kX,kYk_{X},k_{Y} on p\mathbb{R}^{p} and \mathbb{R} that meet Assumptions 1 and 2, respectively. Our main results (counterexamples) for HSIC are applicable to any such kernels.

Given the covariates XpX\in\mathbb{R}^{p} and the response YY\in\mathbb{R}, with (X,Y)(X,Y)\sim\mathbb{P}, in this paper we study the population level statistical properties of the solution set SS_{*} of the following HSIC maximization procedure:

S=argmaxS:S{1,2,,p}HSIC(XS;Y),S_{*}=\mathop{\rm argmax}_{S:S\subseteq\{1,2,\ldots,p\}}{\rm HSIC}(X_{S};Y), (1.4)

where HSIC(XS;Y){\rm HSIC}(X_{S};Y) is defined through the kernels kX;|S|k_{X;|S|} and kYk_{Y}. Here the kernel kX;|S|k_{X;|S|} on |S|\mathbb{R}^{|S|} is induced by the original kernel kXk_{X} on p\mathbb{R}^{p} in the following way (recall ϕX\phi_{X} in the definition of kXk_{X}):

kX;|S|(x,x)=ϕX(xx22)for x,x|S|.k_{X;|S|}(x,x^{\prime})=\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2})~{}~{}\text{for $x,x^{\prime}\in\mathbb{R}^{|S|}$}.

For transparency, following Definition 1.1 we can write down the definition:

HSIC(XS,Y)=𝔼[kX;|S|(XS,XS)kY(Y,Y)]+𝔼[kX;|S|(XS,XS)]𝔼[kY(Y,Y)]2𝔼[𝔼[kX;|S|(XS,XS)|X]𝔼[kY(Y,Y)|Y]],\begin{split}{\rm HSIC}(X_{S},Y)&=\mathbb{E}[k_{X;|S|}(X_{S},X_{S}^{\prime})\cdot k_{Y}(Y,Y^{\prime})]+\mathbb{E}[k_{X;|S|}(X_{S},X_{S}^{\prime})]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})]\\ &~{}~{}~{}~{}-2\mathbb{E}\left[\mathbb{E}[k_{X;|S|}(X_{S},X_{S}^{\prime})|X^{\prime}]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})|Y^{\prime}]\right],\end{split} (1.5)

where (X,Y)(X,Y), (X,Y)(X^{\prime},Y^{\prime}) denote independent draws from \mathbb{P}. This definition of HSIC(XS,Y){\rm HSIC}(X_{S},Y) is natural, and aligns with its proposed use for feature selection, e.g., [SBB+07].

Our main result shows that the feature subset XSX_{S_{*}}, chosen by maximizing HSIC{\rm HSIC} as in equation (1.4) can fail to capture the explanatory relationship between XX and YY.

Theorem 1.1.

Assume the dimension p2p\geq 2. Given any kernel kXk_{X} on p\mathbb{R}^{p} obeying Assumption 1 and any kernel kYk_{Y} on \mathbb{R} obeying Assumption 2, there exists a probability distribution \mathbb{P} of (X,Y)(X,Y) supported on p×\mathbb{R}^{p}\times\mathbb{R} such that any maximizer SS_{*} defined via equation (1.4) fails to include all relevant features in the following sense:

𝔼[Y|X]𝔼[Y|XS]and thus(Y|X)(Y|XS).\text{$\mathbb{E}[Y|X]\neq\mathbb{E}[Y|X_{S_{*}}]$}~{}~{}\text{and thus}~{}~{}\mathcal{L}(Y|X)\neq\mathcal{L}(Y|X_{S_{*}}).

In the above, the first inequality is under the 2()\mathcal{L}_{2}(\mathbb{P}) sense.

Theorem 1.1 cautions against relying on HSIC maximization for feature selection. The reader might wonder whether it is the combinatorial structure in defining the maximizer SS_{*} that is responsible for the failure observed in Theorem 1.1. To clarify the nature of the problem, we explore a continuous version of the combinatorial objective in equation (1.4). Below, we use \odot to represent coordinate-wise multiplication between vectors; for instance, given a vector βp\beta\in\mathbb{R}^{p}, and covariates XpX\in\mathbb{R}^{p}, the product βX\beta\odot X yields a new vector in p\mathbb{R}^{p} with its iith coordinate (βX)i=βiXi(\beta\odot X)_{i}=\beta_{i}X_{i}.

Now we consider HSIC maximization through a continuous optimization:

S,cont=supp(β)whereβ=argmaxβ:βqrHSIC(βX;Y)S_{*,{\rm cont}}=\mathop{\rm supp}(\beta_{*})~{}~{}\text{where}~{}~{}\beta_{*}=\mathop{\rm argmax}_{\beta:\left\|{\beta}\right\|_{\ell_{q}}\leq r}{\rm HSIC}(\beta\odot X;Y) (1.6)

where HSIC(βX,Y){\rm HSIC}(\beta\odot X,Y) measures the dependence between the weighted feature βXp\beta\odot X\in\mathbb{R}^{p} and the response YY\in\mathbb{R}, utilizing the kernels kXk_{X} and kYk_{Y} on p\mathbb{R}^{p} and \mathbb{R}, respectively:

HSIC(βX,Y)=𝔼[kX(βX,βX)kY(Y,Y)]+𝔼[kX(βX,βX)]𝔼[kY(Y,Y)]2𝔼[𝔼[kX(βX,βX)|X]𝔼[kY(Y,Y)|Y]].\begin{split}{\rm HSIC}(\beta\odot X,Y)&=\mathbb{E}[k_{X}(\beta\odot X,\beta\odot X^{\prime})\cdot k_{Y}(Y,Y^{\prime})]+\mathbb{E}[k_{X}(\beta\odot X,\beta\odot X^{\prime})]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})]\\ &~{}~{}~{}~{}-2\mathbb{E}\left[\mathbb{E}[k_{X}(\beta\odot X,\beta\odot X^{\prime})|X^{\prime}]\cdot\mathbb{E}[k_{Y}(Y,Y^{\prime})|Y^{\prime}]\right].\end{split} (1.7)

In its definition (1.6), we first find the weight vector βp\beta_{*}\in\mathbb{R}^{p} that maximizes the HSIC dependence measure between the response YY and the weighted covariates βX\beta\odot X, and then use the support set of β\beta_{*} (the set of indices corresponding to non-zero coordinates of β\beta_{*}) to define the feature set S,contS_{*,{\rm cont}}. Note that imposing a norm constraint on the weights β\betaβqr\left\|{\beta}\right\|_{\ell_{q}}\leq r where q\left\|{\cdot}\right\|_{\ell_{q}} denotes the q\ell_{q} norm for q[1,]q\in[1,\infty] and r<r<\infty—ensures the existence of a well-defined maximizer β\beta_{*}.

Theorem 1.2.

Assume the dimension p2p\geq 2. Given any kernel kXk_{X} on p\mathbb{R}^{p} obeying Assumption 1, kernel kYk_{Y} on \mathbb{R} obeying Assumption 2, and any q\ell_{q} norm q\left\|{\cdot}\right\|_{\ell_{q}} where q[1,]q\in[1,\infty] and radius r<r<\infty, there exists a probability distribution \mathbb{P} of (X,Y)(X,Y) supported on p×\mathbb{R}^{p}\times\mathbb{R} such that any maximizer β\beta_{*} and the corresponding set S,contS_{*,{\rm cont}} defined via equation (1.6) fail to include all relevant features:

𝔼[Y|X]𝔼[Y|XS,cont]and thus(Y|X)(Y|XS,cont).\text{$\mathbb{E}[Y|X]\neq\mathbb{E}[Y|X_{S_{*,{\rm cont}}}]$}~{}~{}\text{and thus}~{}~{}\mathcal{L}(Y|X)\neq\mathcal{L}(Y|X_{S_{*,{\rm cont}}}).

In the above, the first inequality is in the 2()\mathcal{L}_{2}(\mathbb{P}) sense.

Theorem 1.1 and Theorem 1.2 show that care must be taken when using HSIC for feature selection. Notably, the requirement that the dimension p2p\geq 2 is essential for Theorem 1.1 and Theorem 1.2 to hold. This necessity follows from a positive result on HSIC maximization when using characteristic kernels kX,kYk_{X},k_{Y}: the output feature set XSX_{S_{*}} (or XS,contX_{S_{*,{\rm cont}}}) will be dependent with YY unless XX and YY are independent.

Proposition 1.

Suppose both kernels kXk_{X} and kYk_{Y} are characteristic on p\mathbb{R}^{p} and \mathbb{R} respectively in the sense of [SFL11, Definition 6]. Applying the discrete HSIC maximization approach (1.4) and the continuous HSIC maximization approach (1.6) (with r>0r>0) will result in nonempty feature sets SS_{*} and S,contS_{*,{\rm cont}}, with XSX_{S_{*}} and XS,contX_{S_{*,{\rm cont}}} dependent on YY respectively, unless XX and YY are independent.

Proof    Suppose XX and YY are not independent. Then HSIC(X;Y)>0{\rm HSIC}(X;Y)>0 because both kernels kXk_{X} and kYk_{Y} are characteristic kernels. The result follows from basic properties of HSIC{\rm HSIC}, see e.g., [GBSS05, Theorem 4], and [SFL11, Section 2.3].

If we use the discrete maximization approach (1.4), then HSIC(XS;Y)HSIC(X;Y)>0{\rm HSIC}(X_{S_{*}};Y)\geq{\rm HSIC}(X;Y)>0. This implies that XSX_{S_{*}} and YY are dependent because by definition of HSIC{\rm HSIC}, HSIC(XS;Y)=0{\rm HSIC}(X_{S_{*}};Y)=0 if XSX_{S_{*}} and YY are independent. Similarly, if we use the continuous HSIC maximization approach (1.6), then HSIC(βX,Y)>0{\rm HSIC}(\beta_{*}\odot X,Y)>0 which then implies XS,contX_{S_{*},{\rm cont}} and YY cannot be independent. ∎

To conclude, while intuitively appealing, using HSIC maximization for feature selection may inadvertently exclude variables necessary for fully understanding the response YY.

1.3 Connections to Prior Literature

In the literature, there are two primary approaches to employing kernel-based nonparametric (conditional) dependence measures between variables for feature selection.

The first approach, directly relevant to our work, is to maximize the dependence measure between the response variable YY and the feature subset XSX_{S}. In this paper, we focus on Hilbert Schmidt Independence Criterion (HSIC), a popular kernel-based nonparametric dependence measure [GBSS05]. Unlike many other dependence measures, HSIC allows easy estimation from finite samples with parametric 1/n1/\sqrt{n} convergence rates where nn is the sample size [GBSS05]. Since its introduction, HSIC has become a popular tool for feature selection. This HSIC maximization method, formulated as equation (1.4), was developed by [SSG+07] and has been applied in diverse fields such as bioinformatics [SBB+07, GT18]. Subsequent works [MDF10, SSG+12] proposed continuous relaxations of this formulation to facilitate computation. Our main findings, Theorem 1.1 and Theorem 1.2, provide concrete counterexamples advising caution when selecting features via HSIC maximization, as one may miss features needed for fully understanding the response YY.

The second approach minimizes the kernel conditional dependence between YY and XX given the feature subset XSX_{S} [FGSS07, FBJ09, CSWJ17, CLLR23]. Remarkably, this approach ensures that the selected feature subset XSX_{S_{*}} contains all features necessary to fully explain the response YY, as guaranteed by the conditional independence YXXSY\perp X\mid X_{S_{*}}. This theoretical guarantee is achieved under a nonparametric setup without assuming specific parametric distribution models for the data (X,Y)(X,Y) [FBJ09]. However, computing the kernel conditional dependence measures required by this approach is more computationally demanding than calculating the HSIC used in the maximization approach due to the need for kernel matrix inversions [FGSS07].

Notably, our main findings elucidate a distinction between these two kernel-based feature selection strategies. The maximization approach, which we study in this paper, does not enjoy the same conditional independence guarantee as the minimization approach. Our counterexamples in Theorem 1.1 and Theorem 1.2 demonstrate that the HSIC maximization approach can fail to select critical variables needed to fully explain the response YY, highlighting a tradeoff between computational effort and the ability to recover all important variables.

To clarify, our research focuses on maximizing HSIC for feature selection as described in equation (1.4). It is important to note that there are alternative methods of employing HSIC in feature selection, such as HSIC-Lasso described in [YJS+14], which involves selecting features through a least square regression that applies 1\ell_{1} penalties to the feature weights. HSIC-Lasso, though sharing a commonality in nomenclature with HSIC, is fundamentally a regression method, which differs from the dependence maximization strategy we explore here in equation (1.4). Our counterexamples do not extend to HSIC-Lasso.

Finally, consider the recent developments in feature selection for binary labels Y{±1}Y\in\{\pm 1\}, where a feature subset SS is selected by maximizing the Maximum Mean Discrepancy (MMD) between the conditional distributions (XSY=1)\mathcal{L}(X_{S}\mid Y=1) and (XSY=1)\mathcal{L}(X_{S}\mid Y=-1) over SS [WDX23, MKB+23]. Given the established connections between MMD and HSIC in the literature, e.g., [SSG+07, Theorem 3], our counterexamples in Theorem 1.1 and Theorem 1.2—developed under conditions with binary response YY (see the proofs in Section 2)—have further implications for these MMD-based feature selection methods. We intend to report these implications and discuss methodological developments in a follow-up paper.

2 Proofs

In this section, we prove Theorem 1.1 and Theorem 1.2. Throughout, for a vector XpX\in\mathbb{R}^{p}, we will denote the coordinates of XX under the standard Euclidean basis as X1,X2,,XpX_{1},X_{2},\ldots,X_{p}. This notation will also similarly apply to other vectors in the proof.

The core of our proof focuses on the critical case where the dimension p=2p=2 (Section 2.1). We subsequently extend these results to any dimension p2p\geq 2, which is straightforward (Section 2.2).

2.1 The case p=2p=2

We establish Theorem 1.1 and Theorem 1.2 for the case p=2p=2.

For our counterexample, we construct a family of distributions Δ\mathbb{P}_{\Delta} parameterized by a vector Δ\Delta. Our proof proceeds by showing that there exists a vector Δ\Delta such that if the data (X,Y)(X,Y) follows the distribution Δ\mathbb{P}_{\Delta}, then the conclusions of Theorem 1.1 and Theorem 1.2 hold.

2.1.1 Definition of Δ\mathbb{P}_{\Delta}

The distribution Δ\mathbb{P}_{\Delta} is defined on a product space {±1}2×{±1}\{\pm 1\}^{2}\times\{\pm 1\}, with the covariates X{±1}2X\in\{\pm 1\}^{2} and the response Y{±1}Y\in\{\pm 1\}. The parameter Δ=(Δ1,Δ2)[0,1]2\Delta=(\Delta_{1},\Delta_{2})\in[0,1]^{2}.

Below we specify the details of the construction.

  • The response Y{±1}Y\in\{\pm 1\} is balanced: Δ(Y=+1)=Δ(Y=1)=1/2\mathbb{P}_{\Delta}(Y=+1)=\mathbb{P}_{\Delta}(Y=-1)=1/2.

  • The covariate X{±1}2X\in\{\pm 1\}^{2}, with components X=(X1,X2)X=(X_{1},X_{2}), obeys a conditional independence given YY: X1X2YX_{1}\perp X_{2}\mid Y. Moreover, the conditional distribution of XjX_{j} (j=1,2j=1,2) given YY obeys:

    Δ(X1=±1Y)=12(1±Δ1Y),Δ(X2=±1Y)=12(1±Δ2Y).\mathbb{P}_{\Delta}(X_{1}=\pm 1\mid Y)=\frac{1}{2}(1\pm\Delta_{1}Y),~{}~{}~{}~{}~{}\mathbb{P}_{\Delta}(X_{2}=\pm 1\mid Y)=\frac{1}{2}(1\pm\Delta_{2}Y).

This construction determines the joint distribution of (X1,X2,Y)(X_{1},X_{2},Y) as follows:

Δ(X1=x1,X2=x2,Y=y)=18(1+Δ1x1y)(1+Δ2x2y) for x1,x2,y{±1}.\mathbb{P}_{\Delta}(X_{1}=x_{1},X_{2}=x_{2},Y=y)=\frac{1}{8}(1+\Delta_{1}x_{1}y)(1+\Delta_{2}x_{2}y)~{}~{}\text{ for $x_{1},x_{2},y\in\{\pm 1\}$}.

Intuitively, Δ1\Delta_{1} and Δ2\Delta_{2} measure how much YY depends on X1X_{1} and X2X_{2}, respectively. For instance, Δ1=0\Delta_{1}=0 indicates that X1X_{1} and YY are independent, while Δ1=1\Delta_{1}=1 means X1X_{1} perfectly predicts YY. Thus, when both Δ1,Δ2\Delta_{1},\Delta_{2} lie within (0,1)(0,1), this intuition suggests that both features X1X_{1} and X2X_{2} are needed in order to fully explain the response YY. Lemma 2.1 formally establishes this.

We use 𝔼Δ\mathbb{E}_{\Delta} to denote the expectation under Δ\mathbb{P}_{\Delta}.

Lemma 2.1.

Suppose Δ1(0,1),Δ2(0,1)\Delta_{1}\in(0,1),\Delta_{2}\in(0,1). Then the following inequalities hold under 2(Δ)\mathcal{L}_{2}(\mathbb{P}_{\Delta}):

𝔼Δ[Y|X]𝔼Δ[Y|X1],𝔼Δ[Y|X]𝔼Δ[Y|X2],𝔼Δ[Y|X]𝔼Δ[Y].\mathbb{E}_{\Delta}[Y|X]\neq\mathbb{E}_{\Delta}[Y|X_{1}],~{}~{}\mathbb{E}_{\Delta}[Y|X]\neq\mathbb{E}_{\Delta}[Y|X_{2}],~{}~{}\mathbb{E}_{\Delta}[Y|X]\neq\mathbb{E}_{\Delta}[Y].

The proof of Lemma 2.1 is based on routine calculations, and is deferred to Section A.1.

2.1.2 Evaluation of HSIC

We evaluate HSIC(βX;Y){\rm HSIC}(\beta\odot X;Y) when (X,Y)Δ(X,Y)\sim\mathbb{P}_{\Delta} for every β2\beta\in\mathbb{R}^{2}.

Recall kX(x,x)=ϕX(xx22)k_{X}(x,x^{\prime})=\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}). For β2\beta\in\mathbb{R}^{2}, we define the function

LΔ(β)=(ϕX(0)ϕX(4β12+4β22))(Δ12+Δ22)(ϕX(4β12)ϕX(4β22))(Δ12Δ22).\begin{split}L_{\Delta}(\beta)&=(\phi_{X}(0)-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2}))\cdot(\Delta_{1}^{2}+\Delta_{2}^{2})-(\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{2}^{2}))\cdot(\Delta_{1}^{2}-\Delta_{2}^{2}).\end{split} (2.1)

The function βLΔ(β)\beta\mapsto L_{\Delta}(\beta) is proportional to the function βHSIC(βX;Y)\beta\mapsto{\rm HSIC}(\beta\odot X;Y), differing only by a positive, constant scalar. See Lemma 2.2 for the formal statement.

Lemma 2.2.

Let (X,Y)Δ(X,Y)\sim\mathbb{P}_{\Delta}. Then for every β=(β1,β2)2\beta=(\beta_{1},\beta_{2})\in\mathbb{R}^{2}:

  • If kYk_{Y} is a function of yyyy^{\prime}, say kY(y,y)=ϕY(yy)k_{Y}(y,y)=\phi_{Y}(yy^{\prime}), then

    HSIC(βX;Y)=18(ϕY(1)ϕY(1))LΔ(β).{\rm HSIC}(\beta\odot X;Y)=\frac{1}{8}\cdot(\phi_{Y}(1)-\phi_{Y}(-1))\cdot L_{\Delta}(\beta).
  • If kYk_{Y} is a function of |yy||y-y^{\prime}|, say kY(y,y)=ϕY(|yy|)k_{Y}(y,y)=\phi_{Y}(|y-y^{\prime}|), then

    HSIC(βX;Y)=18(ϕY(0)ϕY(1))LΔ(β).{\rm HSIC}(\beta\odot X;Y)=\frac{1}{8}\cdot(\phi_{Y}(0)-\phi_{Y}(1))\cdot L_{\Delta}(\beta).

The proof of Lemma 2.2 is due to calculations, although tricks can be employed to simplify the calculation process. For the details of the proof, see Section A.2.

Lemma 2.2 reveals that the function βHSIC(βX;Y)\beta\mapsto{\rm HSIC}(\beta\odot X;Y) is equivalent to βLΔ(β)\beta\mapsto L_{\Delta}(\beta) scaled by a positive constant aa, with a>0a>0 by Assumption 2. Therefore, maximizing βHSIC(βX;Y)\beta\mapsto{\rm HSIC}(\beta\odot X;Y) corresponds directly to maximizing βLΔ(β)\beta\mapsto L_{\Delta}(\beta) on any given set of constraints on β\beta.

2.1.3 Properties of LΔL_{\Delta}

Below we shall frequently assume Δ1>Δ2>0\Delta_{1}>\Delta_{2}>0. In this setup, both features X1X_{1} and X2X_{2} are relevant features, with X1X_{1} being called the dominant feature, and X2X_{2} being called the weaker feature. We shall derive a collection of structural properties of LΔL_{\Delta} under this setup.

A quick observation is the symmetry of LΔL_{\Delta}.

Lemma 2.3 (Symmetry).

For every β2\beta\in\mathbb{R}^{2}, LΔ(β1,β2)=LΔ(|β1|,|β2|)L_{\Delta}(\beta_{1},\beta_{2})=L_{\Delta}(|\beta_{1}|,|\beta_{2}|).

This symmetry property enables us to focus our understanding on how LΔL_{\Delta} behaves on

+2={(β1,β2):β10,β20}.\mathbb{R}_{+}^{2}=\{(\beta_{1},\beta_{2}):\beta_{1}\geq 0,\beta_{2}\geq 0\}.

Lemma 2.4 then shows that increasing the coefficient β1\beta_{1} on +\mathbb{R}_{+} for the dominant feature X1X_{1} strictly increases the function value LΔ(β)L_{\Delta}(\beta).

Lemma 2.4 (Monotonicity in Dominant Feature).

Let Δ1>Δ2>0\Delta_{1}>\Delta_{2}>0. Then for every β2+\beta_{2}\in\mathbb{R}_{+},

β1LΔ(β1,β2)\beta_{1}\mapsto L_{\Delta}(\beta_{1},\beta_{2})

is strictly monotonically increasing for β1+\beta_{1}\in\mathbb{R}_{+}.

Proof    The mapping zϕX(z)z\mapsto\phi_{X}(z) is strictly monotonically decreasing on +\mathbb{R}_{+}, as

ϕX(z)=0etzμ(dt)\phi_{X}(z)=\int_{0}^{\infty}e^{-tz}\mu(dt)

for some nonnegative measure μ\mu not atomic at zero. The strict monotonicity of β1LΔ(β1,β2)\beta_{1}\mapsto L_{\Delta}(\beta_{1},\beta_{2}) when β1+\beta_{1}\in\mathbb{R}_{+} then follows from the definition of LΔL_{\Delta} (equation (2.1)) in Section 2.1.2. ∎

Lemma 2.5 further demonstrates that, on β+2\beta\in\mathbb{R}_{+}^{2}, increasing LΔ(β)L_{\Delta}(\beta) can be effectively achieved by allocating higher coefficients to the dominant feature X1X_{1} rather than the weaker feature X2X_{2}.

Lemma 2.5.

Let Δ1>Δ2>0\Delta_{1}>\Delta_{2}>0. Then for every β=(β1,β2)+2\beta=(\beta_{1},\beta_{2})\in\mathbb{R}_{+}^{2} with β1>β2\beta_{1}>\beta_{2}:

LΔ(β1,β2)>LΔ(β2,β1).L_{\Delta}(\beta_{1},\beta_{2})>L_{\Delta}(\beta_{2},\beta_{1}).

Proof    Following the definition of LΔL_{\Delta} (equation (2.1)), we deduce that

LΔ(β1,β2)LΔ(β2,β1)=2(ϕX(4β12)ϕX(4β22))(Δ12Δ22).L_{\Delta}(\beta_{1},\beta_{2})-L_{\Delta}(\beta_{2},\beta_{1})=-2(\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{2}^{2}))(\Delta_{1}^{2}-\Delta_{2}^{2}).

The result then follows since ϕX\phi_{X} is strictly monotonically decreasing. ∎

Lemma 2.6 demonstrates a key and somewhat surprising property of LΔL_{\Delta} and the dependence measure βHSIC(βX;Y)\beta\mapsto{\rm HSIC}(\beta\odot X;Y). It demonstrates that if the dominant feature X1X_{1} is weighted by a sufficiently large β1\beta_{1} that meets the condition in equation (2.2), then setting the weight β2\beta_{2} of the weaker feature X2X_{2} to zero results in an increase in LΔL_{\Delta}. This implies that the dependence measure HSIC(βX;Y){\rm HSIC}(\beta\odot X;Y), which is directly proportional to LΔL_{\Delta} as explained in Lemma 2.2, also increases.

Lemma 2.6 (Removing Weaker Signal Increases LΔL_{\Delta}).

Let Δ1>Δ2>0\Delta_{1}>\Delta_{2}>0. Suppose β1\beta_{1} satisfies

1|μ|0e4β12tμ(dt)<Δ12Δ22Δ12+Δ22\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4\beta_{1}^{2}t}\mu(dt)<\frac{\Delta_{1}^{2}-\Delta_{2}^{2}}{\Delta_{1}^{2}+\Delta_{2}^{2}} (2.2)

where |μ|=μ([0,))|\mu|=\mu([0,\infty)) denotes the total measure. Then for every β2>0\beta_{2}>0:

LΔ(β1,β2)<LΔ(β1,0).L_{\Delta}(\beta_{1},\beta_{2})<L_{\Delta}(\beta_{1},0).

Proof    Using the expression of LΔL_{\Delta} in Lemma 2.2, we obtain

LΔ(β1,0)LΔ(β1,β2)=(ϕX(4β12+4β22)ϕX(4β12))(Δ12+Δ22)+(ϕX(0)ϕX(4β22))(Δ12Δ22).L_{\Delta}(\beta_{1},0)-L_{\Delta}(\beta_{1},\beta_{2})=(\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})-\phi_{X}(4\beta_{1}^{2}))(\Delta_{1}^{2}+\Delta_{2}^{2})+(\phi_{X}(0)-\phi_{X}(4\beta_{2}^{2}))(\Delta_{1}^{2}-\Delta_{2}^{2}).

If β2>0\beta_{2}>0 then ϕX(0)>ϕX(4β22)\phi_{X}(0)>\phi_{X}(4\beta_{2}^{2}) because ϕX\phi_{X} is strictly monotonically increasing on +\mathbb{R}_{+}. Therefore, LΔ(β1,β2)<LΔ(β1,0)L_{\Delta}(\beta_{1},\beta_{2})<L_{\Delta}(\beta_{1},0) if and only if

ϕX(4β12)ϕX(4β12+4β22)ϕX(0)ϕX(4β22)<Δ12Δ22Δ12+Δ22.\frac{\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})}{\phi_{X}(0)-\phi_{X}(4\beta_{2}^{2})}<\frac{\Delta_{1}^{2}-\Delta_{2}^{2}}{\Delta_{1}^{2}+\Delta_{2}^{2}}. (2.3)

Below we show that the condition (2.3) holds under Assumption (2.2).

To do so, a key step is to show the following upper bound on the ratio:

ϕX(4β12)ϕX(4β12+4β22)ϕX(0)ϕX(4β22)1|μ|0e4β12tμ(dt).\frac{\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})}{\phi_{X}(0)-\phi_{X}(4\beta_{2}^{2})}\leq\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4\beta_{1}^{2}t}\mu(dt). (2.4)

Note that

ϕX(4β12)ϕX(4β12+4β22)=0e4β12t(1e4tβ22t)μ(dt)ϕX(0)ϕX(4β22)=0(1e4tβ22t)μ(dt).\begin{split}\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})&=\int_{0}^{\infty}e^{-4\beta_{1}^{2}t}(1-e^{-4t\beta_{2}^{2}t})\mu(dt)\\ \phi_{X}(0)-\phi_{X}(4\beta_{2}^{2})&=\int_{0}^{\infty}(1-e^{-4t\beta_{2}^{2}t})\mu(dt)\end{split}.

Since the function te4β12tt\mapsto e^{-4\beta_{1}^{2}t} is monotonically decreasing, and the function t1e4tβ22tt\mapsto 1-e^{-4t\beta_{2}^{2}t} is monotonically increasing, Chebyshev’s sum inequality becomes relevant to bound the ratio.

Lemma 2.7 (Chebyshev’s sum inequality).

If ff and gg are real-valued integrable function over [a,b][a,b], both nonincreasing. Suppose ν\nu is a nonnegative finite measure on [a,b][a,b] with ν([a,b])<\nu([a,b])<\infty. Then

abf(x)g(x)ν(dx)1ν([a,b])abf(x)ν(dx)abg(x)ν(dx).\int_{a}^{b}f(x)g(x)\nu(dx)\geq\frac{1}{\nu([a,b])}\int_{a}^{b}f(x)\nu(dx)\int_{a}^{b}g(x)\nu(dx).

Specifically, if we choose f(t)=e4β12tf(t)=e^{-4\beta_{1}^{2}t} and g(t)=(1e4tβ22t)g(t)=-(1-e^{-4t\beta_{2}^{2}t}), both functions monotonically decreasing, and apply the Chebyshev’s sum inequality to these two functions, we obtain

ϕX(4β12)ϕX(4β12+4β22)=0f(t)g(t)μ(dt)1|μ|0f(t)μ(dt)0g(t)μ(dt).\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})=-\int_{0}^{\infty}f(t)g(t)\mu(dt)\leq-\frac{1}{|\mu|}\int_{0}^{\infty}f(t)\mu(dt)\int_{0}^{\infty}g(t)\mu(dt).

and thereby:

ϕX(4β12)ϕX(4β12+4β22)ϕX(0)ϕX(4β22)=0f(t)g(t)μ(dt)0g(t)μ(dt)1|μ|0f(t)μ(dt)=1|μ|0e4β12tμ(dt).\begin{split}\frac{\phi_{X}(4\beta_{1}^{2})-\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})}{\phi_{X}(0)-\phi_{X}(4\beta_{2}^{2})}&=\frac{-\int_{0}^{\infty}f(t)g(t)\mu(dt)}{-\int_{0}^{\infty}g(t)\mu(dt)}\\ &\leq\frac{1}{|\mu|}\int_{0}^{\infty}f(t)\mu(dt)=\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4\beta_{1}^{2}t}\mu(dt).\end{split} (2.5)

This proves equation (2.4).

Given equation (2.4) and the assumption in equation (2.2), we know that the condition (2.3) is met. The desired conclusion LΔ(β1,β2)<LΔ(β1,0)L_{\Delta}(\beta_{1},\beta_{2})<L_{\Delta}(\beta_{1},0) then follows. ∎

2.1.4 Finalizing Arguments

In below, we complete the proof of Theorem 1.1 and Theorem 1.2 for the case p=2p=2.

Proof of Theorem 1.1

Let us choose 1>Δ1>Δ2>01>\Delta_{1}>\Delta_{2}>0 in a way such that

1|μ|0e4tμ(dt)<Δ12Δ22Δ12+Δ22.\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4t}\mu(dt)<\frac{\Delta_{1}^{2}-\Delta_{2}^{2}}{\Delta_{1}^{2}+\Delta_{2}^{2}}. (2.6)

This is achievable since the LHS is strictly smaller than 11, as μ\mu is not an atomic measure at zero. Now we consider the HSIC maximization procedure for the data (X,Y)Δ(X,Y)\sim\mathbb{P}_{\Delta}, where Δ\mathbb{P}_{\Delta} is specified in Section 2.1.1:

S=argmaxS:S{1,2}HSIC(XS;Y).S_{*}=\mathop{\rm argmax}_{S:S\subseteq\{1,2\}}{\rm HSIC}(X_{S};Y).
  • (X1X_{1} is selected in the maximizer) Since Δ1>Δ2\Delta_{1}>\Delta_{2}, by Lemma 2.4 we obtain LΔ(1,0)>LΔ(0,0)L_{\Delta}(1,0)>L_{\Delta}(0,0) and LΔ(1,1)>LΔ(0,1)L_{\Delta}(1,1)>L_{\Delta}(0,1). These two inequalities, by Lemma 2.2, are equivalent to

    HSIC(X{1};Y)>HSIC(X;Y)HSIC(X{1,2};Y)>HSIC(X{2};Y).\begin{split}{\rm HSIC}(X_{\{1\}};Y)&>{\rm HSIC}(X_{\emptyset};Y)\\ {\rm HSIC}(X_{\{1,2\}};Y)&>{\rm HSIC}(X_{\{2\}};Y).\end{split}
  • (X2X_{2} is not selected if X1X_{1} is selected) Since condition (2.6) is assumed, by Lemma 2.6 we obtain LΔ(1,0)>LΔ(1,1)L_{\Delta}(1,0)>L_{\Delta}(1,1). This inequality, by Lemma 2.2, is equivalent to

    HSIC(X{1};Y)>HSIC(X{1,2};Y).\begin{split}{\rm HSIC}(X_{\{1\}};Y)&>{\rm HSIC}(X_{\{1,2\}};Y).\end{split}

Consequentially, we deduce that S={1}S_{*}=\{1\} maximizes the objective SHSIC(XS;Y)S\mapsto{\rm HSIC}(X_{S};Y) for any Δ\Delta obeying equation (2.6) with 1>Δ1>Δ2>01>\Delta_{1}>\Delta_{2}>0. Nonetheless, 𝔼Δ[Y|X]𝔼Δ[Y|XS]\mathbb{E}_{\Delta}[Y|X]\neq\mathbb{E}_{\Delta}[Y|X_{S_{*}}] by Lemma 2.1.

Proof of Theorem 1.2

Fix the norm q\left\|{\cdot}\right\|_{\ell_{q}} and 0<r<0<r<\infty.

Let us define Φ(b)=(b,b)q=21/qb1/q\Phi(b)=\left\|{(b,b)}\right\|_{\ell_{q}}=2^{1/q}b^{1/q} for every b+b\in\mathbb{R}_{+}. Then Φ\Phi is strictly monotonically increasing on +\mathbb{R}_{+}, and continuous. Let b0=Φ1(r)b_{0}=\Phi^{-1}(r). Given this b0b_{0}, we then choose Δ1>Δ2>0\Delta_{1}>\Delta_{2}>0 in a way such that

1|μ|0e4b02tμ(dt)<Δ12Δ22Δ12+Δ22.\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4b_{0}^{2}t}\mu(dt)<\frac{\Delta_{1}^{2}-\Delta_{2}^{2}}{\Delta_{1}^{2}+\Delta_{2}^{2}}. (2.7)

This is achievable since the LHS is strictly smaller than 11.

We consider the continuous version of HSIC maximization procedure for the data (X,Y)Δ(X,Y)\sim\mathbb{P}_{\Delta}:

S,cont=supp(β)whereβ=argmaxβ:βqrHSIC(βX;Y)S_{*,{\rm cont}}=\mathop{\rm supp}(\beta_{*})~{}~{}\text{where}~{}~{}\beta_{*}=\mathop{\rm argmax}_{\beta:\left\|{\beta}\right\|_{\ell_{q}}\leq r}{\rm HSIC}(\beta\odot X;Y)

By Lemma 2.2, β\beta_{*} must also be a maximizer of

β=argmaxβ:βqrLΔ(β).\beta_{*}=\mathop{\rm argmax}_{\beta:\left\|{\beta}\right\|_{\ell_{q}}\leq r}L_{\Delta}(\beta).

Write β=(β,1,β,2)\beta_{*}=(\beta_{*,1},\beta_{*,2}). We may assume β+2\beta_{*}\in\mathbb{R}_{+}^{2} by the symmetry property in Lemma 2.3. Note:

  • (X1X_{1} is selected and β,1b0\beta_{*,1}\geq b_{0}) Given that Δ1>Δ2\Delta_{1}>\Delta_{2}, Lemma 2.4 implies that LΔ(β1,β2)L_{\Delta}(\beta_{1},\beta_{2}) monotonically increases strictly with respect to β1\beta_{1} on +\mathbb{R}_{+}. Thus, at the maximizer β+2\beta_{*}\in\mathbb{R}_{+}^{2}, the constraint βq=r\left\|{\beta_{*}}\right\|_{\ell_{q}}=r must hold, since otherwise we can strictly increase the objective LΔL_{\Delta} by increasing β1\beta_{1}. Additionally, since Δ1>Δ2\Delta_{1}>\Delta_{2}, Lemma 2.5 further implies that β,1β,2\beta_{*,1}\geq\beta_{*,2} at the maximizer.

    Hence, we obtain that r=(β,1,β,2)q(β,1,β,1)q=Φ(β,1)r=\left\|{(\beta_{*,1},\beta_{*,2})}\right\|_{\ell_{q}}\leq\left\|{(\beta_{*,1},\beta_{*,1})}\right\|_{\ell_{q}}=\Phi(\beta_{*,1}). Since Φ\Phi is strictly monotonically increasing and continuous, we obtain that β,1Φ1(r)=b0\beta_{*,1}\geq\Phi^{-1}(r)=b_{0}.

  • (X2X_{2} is not selected, i.e., β,2=0\beta_{*,2}=0) Since condition (2.7) is assumed, and β,1b0\beta_{*,1}\geq b_{0}, we derive

    1|μ|0e4β,12tμ(dt)<Δ12Δ22Δ12+Δ22.\frac{1}{|\mu|}\int_{0}^{\infty}e^{-4\beta_{*,1}^{2}t}\mu(dt)<\frac{\Delta_{1}^{2}-\Delta_{2}^{2}}{\Delta_{1}^{2}+\Delta_{2}^{2}}.

    Lemma 2.6 then implies that LΔ(β,1,0)>LΔ(β,1,β2)L_{\Delta}(\beta_{*,1},0)>L_{\Delta}(\beta_{*,1},\beta_{2}) for every β20\beta_{2}\neq 0. Since β\beta_{*} is the maximizer, this further implies that β,2=0\beta_{*,2}=0 must hold.

Consequentially, we deduce that S,cont=supp(β)={1}S_{*,{\rm cont}}=\mathop{\rm supp}(\beta_{*})=\{1\}. However, 𝔼Δ[Y|X]𝔼Δ[Y|XS,cont]\mathbb{E}_{\Delta}[Y|X]\neq\mathbb{E}_{\Delta}[Y|X_{S_{*},{\rm cont}}] by Lemma 2.1.

2.2 General dimension p2p\geq 2

Previously in Section 2.1, we have already established Theorem 1.1 and Theorem 1.2 when the dimension p=2p=2. Specifically, we have constructed a distribution of (X,Y)(X,Y) supported on 2×\mathbb{R}^{2}\times\mathbb{R} such that the conclusions of Theorem 1.1 and Theorem 1.2 hold.

Now consider an arbitrary dimension p2p\geq 2. Suppose we are given two kernels kXk_{X} and kYk_{Y} on p\mathbb{R}^{p} and \mathbb{R}, respectively, which satisfy Assumptions 1 and 2. In particular, this means that kX(x,x)=ϕX(xx22)k_{X}(x,x^{\prime})=\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}) for every x,xpx,x^{\prime}\in\mathbb{R}^{p} with some function ϕX\phi_{X} obeying the integral representation in Assumption 1. This kernel kXk_{X} on p\mathbb{R}^{p} then induces a kernel k¯X\bar{k}_{X} on 2\mathbb{R}^{2} as k¯X(x,x)=ϕX(xx22)\bar{k}_{X}(x,x^{\prime})=\phi_{X}(\left\|{x-x^{\prime}}\right\|_{2}^{2}) for all x,x2x,x^{\prime}\in\mathbb{R}^{2}. Because we have shown the validity of Theorem 1.1 and Theorem 1.2 for p=2p=2, we can conclude that for the kernels k¯X\bar{k}_{X} and kYk_{Y}, there exists a probability distribution ¯\bar{\mathbb{P}} for (X¯,Y¯)(\bar{X},\bar{Y}) over 2×\mathbb{R}^{2}\times\mathbb{R}, where X¯2\bar{X}\in\mathbb{R}^{2} and Y¯\bar{Y}\in\mathbb{R}, such that the conclusions of Theorem 1.1 and Theorem 1.2 hold under this distribution ¯\bar{\mathbb{P}}.

Now we use ¯\bar{\mathbb{P}} to construct a distribution \mathbb{P} on p×\mathbb{R}^{p}\times\mathbb{R} such that Theorem 1.1 and Theorem 1.2 hold for the kernels kXk_{X} and kYk_{Y} on p\mathbb{R}^{p} and \mathbb{R}, respectively, under this distribution \mathbb{P}. Given random variables (X¯,Y¯)(\bar{X},\bar{Y}) on 2×\mathbb{R}^{2}\times\mathbb{R} following distribution ¯\bar{\mathbb{P}}, we construct random variables (X,Y)(X,Y) on p×\mathbb{R}^{p}\times\mathbb{R} as follows: X1=X¯1X_{1}=\bar{X}_{1}, X2=X¯2X_{2}=\bar{X}_{2}, X3=X4==Xp=0X_{3}=X_{4}=\ldots=X_{p}=0 and Y=Y¯Y=\bar{Y}. This construction essentially extends a two-dimensional random vector X¯\bar{X} into a pp-dimensional random vector XX by simply padding with zeros, which do not affect the evaluation of the kernel distances. Given that the conclusions of Theorem 1.1 and Theorem 1.2 hold under the distribution ¯\bar{\mathbb{P}}, it then follows that the conclusions of Theorem 1.1 and Theorem 1.2 extend to the distribution \mathbb{P} we have constructed.

Appendix A Proofs of Technical Lemma

A.1 Proof of Lemma 2.1

The proof is based on calculations. We recognize that for (x1,x2){±1}2(x_{1},x_{2})\in\{\pm 1\}^{2}:

Δ(X1=x1,X2=x2)=y:y{±1}Δ(X1=x1,X2=x2,Y=y)=14(1+Δ1Δ2x1x2).\mathbb{P}_{\Delta}(X_{1}=x_{1},X_{2}=x_{2})=\sum_{y:y\in\{\pm 1\}}\mathbb{P}_{\Delta}(X_{1}=x_{1},X_{2}=x_{2},Y=y)=\frac{1}{4}(1+\Delta_{1}\Delta_{2}x_{1}x_{2}).

Thereby, we derive for (x1,x2){±1}2(x_{1},x_{2})\in\{\pm 1\}^{2} and y{±1}y\in\{\pm 1\}:

Δ(Y=y|X1=x1,X2=x2)=12+12y(Δ1x1+Δ2x2)1+Δ1Δ2x1x2,\mathbb{P}_{\Delta}(Y=y|X_{1}=x_{1},X_{2}=x_{2})=\frac{1}{2}+\frac{1}{2}\cdot\frac{y(\Delta_{1}x_{1}+\Delta_{2}x_{2})}{1+\Delta_{1}\Delta_{2}x_{1}x_{2}},

which immediately gives the conditional expectation of YY given XX:

𝔼Δ[Y|X]=(Δ1X1+Δ2X2)1+Δ1Δ2X1X2.\mathbb{E}_{\Delta}[Y|X]=\frac{(\Delta_{1}X_{1}+\Delta_{2}X_{2})}{1+\Delta_{1}\Delta_{2}X_{1}X_{2}}. (A.1)

Similarly, by Bayes rule, we obtain for (x1,x2){±1}2(x_{1},x_{2})\in\{\pm 1\}^{2} and y{±1}y\in\{\pm 1\}:

Δ(Y=y|X1=x1)=12(1+Δ1x1y),Δ(Y=y|X2=x2)=12(1+Δ2x2y).\mathbb{P}_{\Delta}(Y=y|X_{1}=x_{1})=\frac{1}{2}(1+\Delta_{1}x_{1}y),~{}~{}~{}\mathbb{P}_{\Delta}(Y=y|X_{2}=x_{2})=\frac{1}{2}(1+\Delta_{2}x_{2}y).

As a consequence, we deduce:

𝔼Δ[Y|X1]=Δ1X1,𝔼Δ[Y|X2]=Δ1X2,𝔼Δ[Y]=0.\mathbb{E}_{\Delta}[Y|X_{1}]=\Delta_{1}X_{1},~{}~{}~{}\mathbb{E}_{\Delta}[Y|X_{2}]=\Delta_{1}X_{2},~{}~{}~{}\mathbb{E}_{\Delta}[Y]=0. (A.2)

Comparing the expressions in equations (A.1) and (A.2), we reach Lemma 2.1 as desired.

A.2 Proof of Lemma 2.2

The proof can be done by mechanical calculations. Here we will mainly present the tricks we use to simplify the calculation process. Notably, because the proofs for the two cases of kYk_{Y} are similar, below we only present the proof for the first case where kY(y,y)=ϕY(yy)k_{Y}(y,y^{\prime})=\phi_{Y}(yy^{\prime}).

Recall the definition:

HSIC(βX,Y)=𝔼Δ[kX(βX,βX)kY(Y,Y)]+𝔼Δ[kX(βX,βX)]𝔼Δ[kY(Y,Y)]2𝔼Δ[𝔼Δ[kX(βX,βX)|X]𝔼Δ[kY(Y,Y)|Y]].\begin{split}{\rm HSIC}(\beta\odot X,Y)&=\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})\cdot k_{Y}(Y,Y^{\prime})]+\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})]\cdot\mathbb{E}_{\Delta}[k_{Y}(Y,Y^{\prime})]\\ &~{}~{}~{}~{}-2\mathbb{E}_{\Delta}\left[\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})|X^{\prime}]\cdot\mathbb{E}_{\Delta}[k_{Y}(Y,Y^{\prime})|Y^{\prime}]\right].\end{split} (A.3)

where (X,Y),(X,Y)(X,Y),(X^{\prime},Y^{\prime}) are independent copies from Δ\mathbb{P}_{\Delta}.

Define k~Y(y,y)=kY(y,y)c\tilde{k}_{Y}(y,y^{\prime})=k_{Y}(y,y^{\prime})-c, with cc\in\mathbb{R} to be determined later. Then with any cc\in\mathbb{R}, the value of HSIC(βX,Y){\rm HSIC}(\beta\odot X,Y) does not change if we replace kYk_{Y} in the RHS of equation (A.3) by k~Y\tilde{k}_{Y}, i.e.,

HSIC(βX,Y)=𝔼Δ[kX(βX,βX)k~Y(Y,Y)]+𝔼Δ[kX(βX,βX)]𝔼Δ[k~Y(Y,Y)]2𝔼Δ[𝔼Δ[kX(βX,βX)|X]𝔼Δ[k~Y(Y,Y)|Y]].\begin{split}{\rm HSIC}(\beta\odot X,Y)&=\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})\cdot\tilde{k}_{Y}(Y,Y^{\prime})]+\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})]\cdot\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})]\\ &~{}~{}~{}~{}-2\mathbb{E}_{\Delta}\left[\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})|X^{\prime}]\cdot\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})|Y^{\prime}]\right].\end{split} (A.4)

Here comes the simplification. We can choose cc\in\mathbb{R} such that 𝔼Δ[k~Y(Y,Y)|Y]=0\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})|Y^{\prime}]=0. Note first:

𝔼Δ[k~Y(Y,Y)|Y]=𝔼Δ[ϕY(YY)Y]c=12(ϕY(1)+ϕY(1))c\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})|Y^{\prime}]=\mathbb{E}_{\Delta}[\phi_{Y}(YY^{\prime})\mid Y]-c=\frac{1}{2}(\phi_{Y}(1)+\phi_{Y}(-1))-c

where the last identity uses the fact that Y,YY,Y^{\prime} are independent and uniformly sampled from ±1\pm 1. Thus, if we choose c=12(ϕY(1)+ϕY(1))c=\frac{1}{2}(\phi_{Y}(1)+\phi_{Y}(-1)), then two expectations will vanish:

𝔼Δ[k~Y(Y,Y)|Y]=0=𝔼Δ[k~Y(Y,Y)]\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})|Y^{\prime}]=0=\mathbb{E}_{\Delta}[\tilde{k}_{Y}(Y,Y^{\prime})]

By substituting this into equation (A.4), we obtain the identity

HSIC(βX,Y)=𝔼Δ[kX(βX,βX)k~Y(Y,Y)]=𝔼Δ[kX(βX,βX)YY]12(ϕY(1)ϕY(1)).\begin{split}{\rm HSIC}(\beta\odot X,Y)&=\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})\cdot\tilde{k}_{Y}(Y,Y^{\prime})]\\ &=\mathbb{E}_{\Delta}[k_{X}(\beta\odot X,\beta\odot X^{\prime})\cdot YY^{\prime}]\cdot\frac{1}{2}(\phi_{Y}(1)-\phi_{Y}(-1)).\end{split} (A.5)

where the second identity holds because k~Y(y,y)=ϕY(yy)12(ϕY(1)+ϕY(1))=12(ϕY(1)ϕY(1))yy\tilde{k}_{Y}(y,y^{\prime})=\phi_{Y}(yy^{\prime})-\frac{1}{2}(\phi_{Y}(1)+\phi_{Y}(-1))=\frac{1}{2}(\phi_{Y}(1)-\phi_{Y}(-1))yy^{\prime} for every pair y,y{±1}y,y^{\prime}\in\{\pm 1\}. This gives a much simpler expression than the original one in equation (A.3).

To further evaluate the RHS of equation (A.5), we compute the conditional expectation:

V(Y,Y)=𝔼Δ[kX(βX,βX)Y,Y]=𝔼Δ[ϕX(β(XX)22)Y,Y].\begin{split}V(Y,Y^{\prime})&=\mathbb{E}_{\Delta}\left[k_{X}(\beta\odot X,\beta\odot X^{\prime})\mid Y,Y^{\prime}\right]\\ &=\mathbb{E}_{\Delta}\left[\phi_{X}(\|{\beta\odot(X-X^{\prime})}\|_{2}^{2})\mid Y,Y^{\prime}\right].\end{split}

Since X1,X1X_{1},X_{1}^{\prime} can only take values in ±1\pm 1, |X1X1||X_{1}-X_{1}^{\prime}| can only be 0 or 22. Thus |X1X1|=2𝟏X1X1|X_{1}-X_{1}^{\prime}|=2\mathbf{1}_{X_{1}\neq X_{1}^{\prime}}. The same applies to |X2X2||X_{2}-X_{2}^{\prime}|. As a result, we obtain the expression:

V(Y,Y)=𝔼Δ[ϕX(4β12𝟏X1X1+4β22𝟏X2X2)Y,Y].V(Y,Y^{\prime})=\mathbb{E}_{\Delta}\left[\phi_{X}(4\beta_{1}^{2}\mathbf{1}_{X_{1}\neq X_{1}^{\prime}}+4\beta_{2}^{2}\mathbf{1}_{X_{2}\neq X_{2}^{\prime}})\mid Y,Y^{\prime}\right].

Given Y,YY,Y^{\prime}, the random variables X1,X2,X1,X2X_{1},X_{2},X_{1}^{\prime},X_{2}^{\prime} are mutually independent by definition of Δ\mathbb{P}_{\Delta}, and thus 𝟏X1X1\mathbf{1}_{X_{1}\neq X_{1}^{\prime}} is independent of 𝟏X2X2\mathbf{1}_{X_{2}\neq X_{2}^{\prime}}. Furthermore, we have

Δ(X1X1Y,Y)=12(1Δ12YY),Δ(X2X2Y,Y)=12(1Δ22YY).\begin{split}\mathbb{P}_{\Delta}(X_{1}\neq X_{1}^{\prime}\mid Y,Y^{\prime})&=\frac{1}{2}(1-\Delta_{1}^{2}YY^{\prime}),~{}~{}~{}\mathbb{P}_{\Delta}(X_{2}\neq X_{2}^{\prime}\mid Y,Y^{\prime})=\frac{1}{2}(1-\Delta_{2}^{2}YY^{\prime}).\end{split}

As a result, we obtain

V(Y,Y)=14[ϕX(4β12+4β22)(1Δ12YY)(1Δ22YY)+ϕX(4β12)(1Δ12YY)(1+Δ22YY)ϕX(4β22)(1+Δ12YY)(1Δ22YY)+ϕX(0)(1+Δ12YY)(1+Δ22YY)].\begin{split}V(Y,Y^{\prime})&=\frac{1}{4}\Big{[}\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})(1-\Delta_{1}^{2}YY^{\prime})(1-\Delta_{2}^{2}YY^{\prime})+\phi_{X}(4\beta_{1}^{2})(1-\Delta_{1}^{2}YY^{\prime})(1+\Delta_{2}^{2}YY^{\prime})\\ &~{}~{}~{}~{}~{}~{}~{}~{}\phi_{X}(4\beta_{2}^{2})(1+\Delta_{1}^{2}YY^{\prime})(1-\Delta_{2}^{2}YY^{\prime})+\phi_{X}(0)(1+\Delta_{1}^{2}YY^{\prime})(1+\Delta_{2}^{2}YY^{\prime})\Big{]}.\end{split}

Finally, we go back to equation (A.5). By recognizing the fact that

HSIC(βX,Y)=𝔼Δ[YYV(Y,Y)]12(ϕY(1)ϕY(1)),{\rm HSIC}(\beta\odot X,Y)=\mathbb{E}_{\Delta}[YY^{\prime}V(Y,Y^{\prime})]\cdot\frac{1}{2}(\phi_{Y}(1)-\phi_{Y}(-1)),

and the expectations

𝔼Δ[(1Δ12YY)(1Δ22YY)YY]=Δ12Δ22𝔼Δ[(1Δ12YY)(1+Δ22YY)YY]=Δ12+Δ22𝔼Δ[(1+Δ12YY)(1Δ22YY)YY]=+Δ12Δ22𝔼Δ[(1+Δ12YY)(1+Δ22YY)YY]=+Δ12+Δ22\begin{split}\mathbb{E}_{\Delta}[(1-\Delta_{1}^{2}YY^{\prime})(1-\Delta_{2}^{2}YY^{\prime})YY^{\prime}]&=-\Delta_{1}^{2}-\Delta_{2}^{2}\\ \mathbb{E}_{\Delta}[(1-\Delta_{1}^{2}YY^{\prime})(1+\Delta_{2}^{2}YY^{\prime})YY^{\prime}]&=-\Delta_{1}^{2}+\Delta_{2}^{2}\\ \mathbb{E}_{\Delta}[(1+\Delta_{1}^{2}YY^{\prime})(1-\Delta_{2}^{2}YY^{\prime})YY^{\prime}]&=+\Delta_{1}^{2}-\Delta_{2}^{2}\\ \mathbb{E}_{\Delta}[(1+\Delta_{1}^{2}YY^{\prime})(1+\Delta_{2}^{2}YY^{\prime})YY^{\prime}]&=+\Delta_{1}^{2}+\Delta_{2}^{2}\\ \end{split}

we deduce the expression:

HSIC(βX,Y)=18[ϕX(4β12+4β22)(Δ12Δ22)+ϕX(4β12)(Δ12+Δ22)+ϕX(4β22)(Δ22+Δ12)+ϕX(0)(+Δ12+Δ22)](ϕY(1)ϕY(1))=18(ϕY(1)ϕY(1))LΔ(β).\begin{split}{\rm HSIC}(\beta\odot X,Y)&=\frac{1}{8}\cdot\Big{[}\phi_{X}(4\beta_{1}^{2}+4\beta_{2}^{2})(-\Delta_{1}^{2}-\Delta_{2}^{2})+\phi_{X}(4\beta_{1}^{2})(-\Delta_{1}^{2}+\Delta_{2}^{2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\phi_{X}(4\beta_{2}^{2})(-\Delta_{2}^{2}+\Delta_{1}^{2})+\phi_{X}(0)(+\Delta_{1}^{2}+\Delta_{2}^{2})\Big{]}\cdot(\phi_{Y}(1)-\phi_{Y}(-1))\\ &=\frac{1}{8}\cdot(\phi_{Y}(1)-\phi_{Y}(-1))\cdot L_{\Delta}(\beta).\end{split}

This completes the proof.

References

  • [CLLR23] Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan, Kernel learning in ridge regression “automatically” yields exact low rank solution, arXiv preprint arXiv:2310.11736 (2023).
  • [Cov99] Thomas M Cover, Elements of information theory, John Wiley & Sons, 1999.
  • [CSWJ17] Jianbo Chen, Mitchell Stern, Martin J Wainwright, and Michael I Jordan, Kernel feature selection via conditional covariance minimization, Advances in neural information processing systems 30 (2017).
  • [FBJ09] Kenji Fukumizu, Francis R Bach, and Michael I Jordan, Kernel dimension reduction in regression, The Annals of Statistics 37 (2009), no. 4, 1871–1905.
  • [FGSS07] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf, Kernel measures of conditional dependence., Advances in Neural Information Processing, vol. 20, 2007, pp. 489–496.
  • [GBSS05] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf, Measuring statistical dependence with Hilbert-Schmidt norms, International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
  • [GT18] S Geeitha and M Thangamani, Incorporating EBO-HSIC with SVM for gene selection associated with cervical cancer classification, Journal of medical systems 42 (2018), no. 11, 225.
  • [HTF09] Trevor Hastie, Robert Tibshirani, and Jerome H Friedman, The elements of statistical learning: data mining, inference, and prediction, vol. 2, Springer, 2009.
  • [MDF10] Mahdokht Masaeli, Jennifer G Dy, and Glenn M Fung, From transformation-based dimensionality reduction to feature selection, Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 751–758.
  • [MKB+23] Kensuke Mitsuzawa, Motonobu Kanagawa, Stefano Bortoli, Margherita Grossi, and Paolo Papotti, Variable selection in maximum mean discrepancy for interpretable distribution comparison, arXiv preprint arXiv:2311.01537 (2023).
  • [SBB+07] Le Song, Justin Bedo, Karsten M Borgwardt, Arthur Gretton, and Alex Smola, Gene selection via the BAHSIC family of algorithms, Bioinformatics 23 (2007), no. 13, i490–i498.
  • [Sch38] Isaac J Schoenberg, Metric spaces and completely monotone functions, Annals of Mathematics (1938), 811–841.
  • [SFL11] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet, Universality, characteristic kernels and RKHS embedding of measures., Journal of Machine Learning Research 12 (2011), no. 7.
  • [SR09] Gábor J Székely and Maria L Rizzo, Brownian distance covariance, The Annals of Applied Statistics (2009), 1236–1265.
  • [SSG+07] Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo, Supervised feature selection via dependence estimation, Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 823–830.
  • [SSG+12] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt, Feature selection via dependence maximization, Journal of Machine Learning Research 13 (2012), no. 5.
  • [WDL21] Tinghua Wang, Xiaolu Dai, and Yuze Liu, Learning with Hilbert–Schmidt independence criterion: A review and new perspectives, Knowledge-based systems 234 (2021), 107567.
  • [WDX23] Jie Wang, Santanu S Dey, and Yao Xie, Variable selection for kernel two-sample tests, arXiv preprint arXiv:2302.07415 (2023).
  • [YJS+14] Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P Xing, and Masashi Sugiyama, High-dimensional feature selection by feature-wise kernelized lasso, Neural computation 26 (2014), no. 1, 185–207.