On the Limitation of Kernel Dependence Maximization for Feature Selection
Abstract
A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.
1 Introduction
In statistics and machine learning, feature selection aims to isolate a subset of features that fully captures the relationships between the features and response variables. This critical step enhances model interpretability, reduces the computational burden of downstream tasks and improves the model’s generalization capabilities [HTF09].
Nonparametric dependence measures are key tools for feature selection. Specifically, a nonparametric dependence measure between a response variable and a feature subset quantifies the strength of dependence without assuming any parametric model for their relationship. Common examples include mutual information [Cov99], distance covariance [SR09], and kernel-based dependence measures such as the Hilbert-Schmidt Independence Criterion (HSIC) [GBSS05]. The HSIC, for a wide range of kernels, recognizes all modes of dependence between variables, not just linear dependence, and large values of HSIC correspond to “more dependence” (for the definition and the scope of HSIC we consider, see Section 1.1). The literature has proposed selecting features by maximizing the HSIC [SSG+07, SSG+12]. Given features and response variable , at the population level, this approach selects a feature subset , where is defined by
(1.1) |
The measure quantifies the nonparametric dependence between the subset of features and the response variable , with denoting the subvector formed by by including only the coordinates in the set (for a formal definition of , see Section 1.2). The intuition behind this approach is clear: the subset of features, , exhibiting maximum dependence with the response (as measured by ) should hopefully include all features necessary for explaining .
Despite its widespread use [SBB+07, GT18, WDL21], this paper reveals a critical problem in the use of HSIC for feature selection, an issue unaddressed in the existing literature. Through a counterexample, we show that HSIC maximization can fail to select critical variables needed to fully explain the response, even at the population level. In our counterexample, the set of selected variables , which is defined by maximizing HSIC through equation (1.1), does not guarantee full explanatory power:
(1.2) |
where the notation and denote the conditional expectation and conditional distribution, respectively. The first inequality is in the sense.
Our counterexample shows that selecting features by maximizing HSIC can mistakenly exclude variables essential for full explaining the response variable . This result, as far as we are aware, enhances the current understanding of HSIC’s statistical properties in feature selection (see Section 1.3 for a more detailed literature review). In particular, our result helps elucidate the distinction between two approaches in the literature to feature selection using kernel dependence measures. The first approach, termed “dependence maximization”, consists of finding some feature subset to maximize the kernel dependence between and the response [SSG+07]. The second approach, termed “conditional dependence minimization”, consists of finding some feature subset such that the kernel conditional dependence between the response and the feature vector is minimized given [FBJ09]. While the first approach can be computationally cheaper, it has been unrecognized that this first approach does not guarantee selecting all variables necessary for explaining the response (see Section 1.3 for more discussion of these two approaches).
1.1 Problem Setup: Definition, Notation and Assumption
The Hilbert-Schmidt Independence Criterion (HSIC), introduced initially in [GBSS05], is a measure quantifying the dependence between two random variables, , and , using positive (semi)definite kernels. A positive semidefinite kernel , simply called a kernel, on a set is a symmetric function that requires the sum to be nonnegative for any distinct points and any real numbers . A kernel is called a positive definite kernel when this sum equals if and only if all are zero.
Definition 1.1 (HSIC).
Given a joint probability measure over the product space , along with two positive semidefinite kernels , on the sets and respectively, the Hilbert-Schmidt Independence Criterion (HSIC) is defined as
(1.3) |
In the above, all expectations are taken over independent copies .
The HSIC is always non-negative () for every probability distribution and positive semidefinite kernels and [GBSS05, SSG+12]. Clearly, whenever and are independent. Additionally, when the kernels and in the Definition 1.1 are characteristic kernels in the sense of [FGSS07, Section 2.2] or [SFL11, Definition 6], then if and only if and are independent (see [GBSS05, Theorem 4], and [SFL11, Section 2.3] for the statement). Characteristic kernels enable HSIC to capture all modes of dependence between and . Non-characteristic kernels can also be plugged into the definition of HSIC, equation (1.3), but they may fail to detect some forms of dependence between and i.e. it is possible that but and are not independent.
Our counterexamples apply to a wide range of HSIC through various choices of kernels and . Two requirements on the kernels and are needed throughout the paper. The first requires that is a radially symmetric positive definite kernel.
Assumption 1.
The kernel obeys for every :
where, for some finite nonnegative measure that is not concentrated at , the following holds for every :
The integral representation of in Assumption 1 is motivated by Schoenberg’s fundamental work [Sch38]. Given a real-valued function , for the mapping to qualify as a positive definite kernel on for every dimension , the representation must hold for some nonnegative finite measure not concentrated at zero [Sch38]. Two concrete examples satisfying Assumption 1 are the Gaussian kernel where , and the Laplace kernel where .
Assumption 1 ensures that the mapping is a kernel on for every dimension . This property is useful when working with because the dimension of the feature vector is varying with the size of and may not equal to . As a side remark, the kernel is a characteristic kernel on for every dimension under Assumption 1 [SFL11, Proposition 5].
Our second requirement is that the kernel is either a function of or a function of .
Assumption 2.
The kernel satisfies one of the following conditions:
-
•
with where is a real-valued function.
-
•
with where is a real-valued function.
For to be a positive semidefinite kernel, must hold, as where . Likewise, for to be a positive semidefinite kernel, must be true. Thus the inequalities in Assumption 2 are only slightly stricter than the requirement that is a positive semidefinite kernel. Assumption 2 is very mild since it includes both characteristic kernels such as the Gaussian kernel and Laplace kernel as well as non-characteristic kernels such as the inner product kernel .
Our counterexamples apply to any kernel obeying Assumptions 1 and 2. Thus, it applies to a wide range of , including those nonparametric dependence measures utilizing characteristic kernels and to capture all types of dependence (say, when and are both Gaussian kernels). In these cases, if and only if and are independent.
1.2 Main Results
Throughout the remainder of the paper, we operate with two kernels on and that meet Assumptions 1 and 2, respectively. Our main results (counterexamples) for HSIC are applicable to any such kernels.
Given the covariates and the response , with , in this paper we study the population level statistical properties of the solution set of the following HSIC maximization procedure:
(1.4) |
where is defined through the kernels and . Here the kernel on is induced by the original kernel on in the following way (recall in the definition of ):
For transparency, following Definition 1.1 we can write down the definition:
(1.5) |
where , denote independent draws from . This definition of is natural, and aligns with its proposed use for feature selection, e.g., [SBB+07].
Our main result shows that the feature subset , chosen by maximizing as in equation (1.4) can fail to capture the explanatory relationship between and .
Theorem 1.1.
Assume the dimension . Given any kernel on obeying Assumption 1 and any kernel on obeying Assumption 2, there exists a probability distribution of supported on such that any maximizer defined via equation (1.4) fails to include all relevant features in the following sense:
In the above, the first inequality is under the sense.
Theorem 1.1 cautions against relying on HSIC maximization for feature selection. The reader might wonder whether it is the combinatorial structure in defining the maximizer that is responsible for the failure observed in Theorem 1.1. To clarify the nature of the problem, we explore a continuous version of the combinatorial objective in equation (1.4). Below, we use to represent coordinate-wise multiplication between vectors; for instance, given a vector , and covariates , the product yields a new vector in with its th coordinate .
Now we consider HSIC maximization through a continuous optimization:
(1.6) |
where measures the dependence between the weighted feature and the response , utilizing the kernels and on and , respectively:
(1.7) |
In its definition (1.6), we first find the weight vector that maximizes the HSIC dependence measure between the response and the weighted covariates , and then use the support set of (the set of indices corresponding to non-zero coordinates of ) to define the feature set . Note that imposing a norm constraint on the weights — where denotes the norm for and —ensures the existence of a well-defined maximizer .
Theorem 1.2.
Assume the dimension . Given any kernel on obeying Assumption 1, kernel on obeying Assumption 2, and any norm where and radius , there exists a probability distribution of supported on such that any maximizer and the corresponding set defined via equation (1.6) fail to include all relevant features:
In the above, the first inequality is in the sense.
Theorem 1.1 and Theorem 1.2 show that care must be taken when using HSIC for feature selection. Notably, the requirement that the dimension is essential for Theorem 1.1 and Theorem 1.2 to hold. This necessity follows from a positive result on HSIC maximization when using characteristic kernels : the output feature set (or ) will be dependent with unless and are independent.
Proposition 1.
Suppose both kernels and are characteristic on and respectively in the sense of [SFL11, Definition 6]. Applying the discrete HSIC maximization approach (1.4) and the continuous HSIC maximization approach (1.6) (with ) will result in nonempty feature sets and , with and dependent on respectively, unless and are independent.
Proof Suppose and are not independent. Then because both kernels and are characteristic kernels. The result follows from basic properties of , see e.g., [GBSS05, Theorem 4], and [SFL11, Section 2.3].
If we use the discrete maximization approach (1.4),
then . This implies that and are dependent because
by definition of , if and are independent. Similarly,
if we use the continuous HSIC maximization approach (1.6),
then which then implies and cannot be independent.
∎
To conclude, while intuitively appealing, using HSIC maximization for feature selection may inadvertently exclude variables necessary for fully understanding the response .
1.3 Connections to Prior Literature
In the literature, there are two primary approaches to employing kernel-based nonparametric (conditional) dependence measures between variables for feature selection.
The first approach, directly relevant to our work, is to maximize the dependence measure between the response variable and the feature subset . In this paper, we focus on Hilbert Schmidt Independence Criterion (HSIC), a popular kernel-based nonparametric dependence measure [GBSS05]. Unlike many other dependence measures, HSIC allows easy estimation from finite samples with parametric convergence rates where is the sample size [GBSS05]. Since its introduction, HSIC has become a popular tool for feature selection. This HSIC maximization method, formulated as equation (1.4), was developed by [SSG+07] and has been applied in diverse fields such as bioinformatics [SBB+07, GT18]. Subsequent works [MDF10, SSG+12] proposed continuous relaxations of this formulation to facilitate computation. Our main findings, Theorem 1.1 and Theorem 1.2, provide concrete counterexamples advising caution when selecting features via HSIC maximization, as one may miss features needed for fully understanding the response .
The second approach minimizes the kernel conditional dependence between and given the feature subset [FGSS07, FBJ09, CSWJ17, CLLR23]. Remarkably, this approach ensures that the selected feature subset contains all features necessary to fully explain the response , as guaranteed by the conditional independence . This theoretical guarantee is achieved under a nonparametric setup without assuming specific parametric distribution models for the data [FBJ09]. However, computing the kernel conditional dependence measures required by this approach is more computationally demanding than calculating the HSIC used in the maximization approach due to the need for kernel matrix inversions [FGSS07].
Notably, our main findings elucidate a distinction between these two kernel-based feature selection strategies. The maximization approach, which we study in this paper, does not enjoy the same conditional independence guarantee as the minimization approach. Our counterexamples in Theorem 1.1 and Theorem 1.2 demonstrate that the HSIC maximization approach can fail to select critical variables needed to fully explain the response , highlighting a tradeoff between computational effort and the ability to recover all important variables.
To clarify, our research focuses on maximizing HSIC for feature selection as described in equation (1.4). It is important to note that there are alternative methods of employing HSIC in feature selection, such as HSIC-Lasso described in [YJS+14], which involves selecting features through a least square regression that applies penalties to the feature weights. HSIC-Lasso, though sharing a commonality in nomenclature with HSIC, is fundamentally a regression method, which differs from the dependence maximization strategy we explore here in equation (1.4). Our counterexamples do not extend to HSIC-Lasso.
Finally, consider the recent developments in feature selection for binary labels , where a feature subset is selected by maximizing the Maximum Mean Discrepancy (MMD) between the conditional distributions and over [WDX23, MKB+23]. Given the established connections between MMD and HSIC in the literature, e.g., [SSG+07, Theorem 3], our counterexamples in Theorem 1.1 and Theorem 1.2—developed under conditions with binary response (see the proofs in Section 2)—have further implications for these MMD-based feature selection methods. We intend to report these implications and discuss methodological developments in a follow-up paper.
2 Proofs
In this section, we prove Theorem 1.1 and Theorem 1.2. Throughout, for a vector , we will denote the coordinates of under the standard Euclidean basis as . This notation will also similarly apply to other vectors in the proof.
The core of our proof focuses on the critical case where the dimension (Section 2.1). We subsequently extend these results to any dimension , which is straightforward (Section 2.2).
2.1 The case
For our counterexample, we construct a family of distributions parameterized by a vector . Our proof proceeds by showing that there exists a vector such that if the data follows the distribution , then the conclusions of Theorem 1.1 and Theorem 1.2 hold.
2.1.1 Definition of
The distribution is defined on a product space , with the covariates and the response . The parameter .
Below we specify the details of the construction.
-
•
The response is balanced: .
-
•
The covariate , with components , obeys a conditional independence given : . Moreover, the conditional distribution of () given obeys:
This construction determines the joint distribution of as follows:
Intuitively, and measure how much depends on and , respectively. For instance, indicates that and are independent, while means perfectly predicts . Thus, when both lie within , this intuition suggests that both features and are needed in order to fully explain the response . Lemma 2.1 formally establishes this.
We use to denote the expectation under .
Lemma 2.1.
Suppose . Then the following inequalities hold under :
2.1.2 Evaluation of HSIC
We evaluate when for every .
Recall . For , we define the function
(2.1) |
The function is proportional to the function , differing only by a positive, constant scalar. See Lemma 2.2 for the formal statement.
Lemma 2.2.
Let . Then for every :
-
•
If is a function of , say , then
-
•
If is a function of , say , then
2.1.3 Properties of
Below we shall frequently assume . In this setup, both features and are relevant features, with being called the dominant feature, and being called the weaker feature. We shall derive a collection of structural properties of under this setup.
A quick observation is the symmetry of .
Lemma 2.3 (Symmetry).
For every , .
This symmetry property enables us to focus our understanding on how behaves on
Lemma 2.4 then shows that increasing the coefficient on for the dominant feature strictly increases the function value .
Lemma 2.4 (Monotonicity in Dominant Feature).
Let . Then for every ,
is strictly monotonically increasing for .
Proof The mapping is strictly monotonically decreasing on , as
for some nonnegative measure not atomic at zero. The strict
monotonicity of when
then follows from the definition of (equation (2.1)) in
Section 2.1.2.
∎
Lemma 2.5 further demonstrates that, on , increasing can be effectively achieved by allocating higher coefficients to the dominant feature rather than the weaker feature .
Lemma 2.5.
Let . Then for every with :
Proof Following the definition of (equation (2.1)), we deduce that
The result then follows since is strictly monotonically decreasing.
∎
Lemma 2.6 demonstrates a key and somewhat surprising property of and the dependence measure . It demonstrates that if the dominant feature is weighted by a sufficiently large that meets the condition in equation (2.2), then setting the weight of the weaker feature to zero results in an increase in . This implies that the dependence measure , which is directly proportional to as explained in Lemma 2.2, also increases.
Lemma 2.6 (Removing Weaker Signal Increases ).
Let . Suppose satisfies
(2.2) |
where denotes the total measure. Then for every :
Proof Using the expression of in Lemma 2.2, we obtain
If then because is strictly monotonically increasing on . Therefore, if and only if
(2.3) |
Below we show that the condition (2.3) holds under Assumption (2.2).
To do so, a key step is to show the following upper bound on the ratio:
(2.4) |
Note that
Since the function is monotonically decreasing, and the function is monotonically increasing, Chebyshev’s sum inequality becomes relevant to bound the ratio.
Lemma 2.7 (Chebyshev’s sum inequality).
If and are real-valued integrable function over , both nonincreasing. Suppose is a nonnegative finite measure on with . Then
Specifically, if we choose and , both functions monotonically decreasing, and apply the Chebyshev’s sum inequality to these two functions, we obtain
and thereby:
(2.5) |
This proves equation (2.4).
2.1.4 Finalizing Arguments
Proof of Theorem 1.1
Let us choose in a way such that
(2.6) |
This is achievable since the LHS is strictly smaller than , as is not an atomic measure at zero. Now we consider the HSIC maximization procedure for the data , where is specified in Section 2.1.1:
- •
- •
Consequentially, we deduce that maximizes the objective for any obeying equation (2.6) with . Nonetheless, by Lemma 2.1.
Proof of Theorem 1.2
Fix the norm and .
Let us define for every . Then is strictly monotonically increasing on , and continuous. Let . Given this , we then choose in a way such that
(2.7) |
This is achievable since the LHS is strictly smaller than .
We consider the continuous version of HSIC maximization procedure for the data :
By Lemma 2.2, must also be a maximizer of
Write . We may assume by the symmetry property in Lemma 2.3. Note:
-
•
( is selected and ) Given that , Lemma 2.4 implies that monotonically increases strictly with respect to on . Thus, at the maximizer , the constraint must hold, since otherwise we can strictly increase the objective by increasing . Additionally, since , Lemma 2.5 further implies that at the maximizer.
Hence, we obtain that . Since is strictly monotonically increasing and continuous, we obtain that .
- •
Consequentially, we deduce that . However, by Lemma 2.1.
2.2 General dimension
Previously in Section 2.1, we have already established Theorem 1.1 and Theorem 1.2 when the dimension . Specifically, we have constructed a distribution of supported on such that the conclusions of Theorem 1.1 and Theorem 1.2 hold.
Now consider an arbitrary dimension . Suppose we are given two kernels and on and , respectively, which satisfy Assumptions 1 and 2. In particular, this means that for every with some function obeying the integral representation in Assumption 1. This kernel on then induces a kernel on as for all . Because we have shown the validity of Theorem 1.1 and Theorem 1.2 for , we can conclude that for the kernels and , there exists a probability distribution for over , where and , such that the conclusions of Theorem 1.1 and Theorem 1.2 hold under this distribution .
Now we use to construct a distribution on such that Theorem 1.1 and Theorem 1.2 hold for the kernels and on and , respectively, under this distribution . Given random variables on following distribution , we construct random variables on as follows: , , and . This construction essentially extends a two-dimensional random vector into a -dimensional random vector by simply padding with zeros, which do not affect the evaluation of the kernel distances. Given that the conclusions of Theorem 1.1 and Theorem 1.2 hold under the distribution , it then follows that the conclusions of Theorem 1.1 and Theorem 1.2 extend to the distribution we have constructed.
Appendix A Proofs of Technical Lemma
A.1 Proof of Lemma 2.1
The proof is based on calculations. We recognize that for :
Thereby, we derive for and :
which immediately gives the conditional expectation of given :
(A.1) |
Similarly, by Bayes rule, we obtain for and :
As a consequence, we deduce:
(A.2) |
A.2 Proof of Lemma 2.2
The proof can be done by mechanical calculations. Here we will mainly present the tricks we use to simplify the calculation process. Notably, because the proofs for the two cases of are similar, below we only present the proof for the first case where .
Recall the definition:
(A.3) |
where are independent copies from .
Define , with to be determined later. Then with any , the value of does not change if we replace in the RHS of equation (A.3) by , i.e.,
(A.4) |
Here comes the simplification. We can choose such that . Note first:
where the last identity uses the fact that are independent and uniformly sampled from . Thus, if we choose , then two expectations will vanish:
By substituting this into equation (A.4), we obtain the identity
(A.5) |
where the second identity holds because for every pair . This gives a much simpler expression than the original one in equation (A.3).
To further evaluate the RHS of equation (A.5), we compute the conditional expectation:
Since can only take values in , can only be or . Thus . The same applies to . As a result, we obtain the expression:
Given , the random variables are mutually independent by definition of , and thus is independent of . Furthermore, we have
As a result, we obtain
Finally, we go back to equation (A.5). By recognizing the fact that
and the expectations
we deduce the expression:
This completes the proof.
References
- [CLLR23] Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan, Kernel learning in ridge regression “automatically” yields exact low rank solution, arXiv preprint arXiv:2310.11736 (2023).
- [Cov99] Thomas M Cover, Elements of information theory, John Wiley & Sons, 1999.
- [CSWJ17] Jianbo Chen, Mitchell Stern, Martin J Wainwright, and Michael I Jordan, Kernel feature selection via conditional covariance minimization, Advances in neural information processing systems 30 (2017).
- [FBJ09] Kenji Fukumizu, Francis R Bach, and Michael I Jordan, Kernel dimension reduction in regression, The Annals of Statistics 37 (2009), no. 4, 1871–1905.
- [FGSS07] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf, Kernel measures of conditional dependence., Advances in Neural Information Processing, vol. 20, 2007, pp. 489–496.
- [GBSS05] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf, Measuring statistical dependence with Hilbert-Schmidt norms, International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
- [GT18] S Geeitha and M Thangamani, Incorporating EBO-HSIC with SVM for gene selection associated with cervical cancer classification, Journal of medical systems 42 (2018), no. 11, 225.
- [HTF09] Trevor Hastie, Robert Tibshirani, and Jerome H Friedman, The elements of statistical learning: data mining, inference, and prediction, vol. 2, Springer, 2009.
- [MDF10] Mahdokht Masaeli, Jennifer G Dy, and Glenn M Fung, From transformation-based dimensionality reduction to feature selection, Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 751–758.
- [MKB+23] Kensuke Mitsuzawa, Motonobu Kanagawa, Stefano Bortoli, Margherita Grossi, and Paolo Papotti, Variable selection in maximum mean discrepancy for interpretable distribution comparison, arXiv preprint arXiv:2311.01537 (2023).
- [SBB+07] Le Song, Justin Bedo, Karsten M Borgwardt, Arthur Gretton, and Alex Smola, Gene selection via the BAHSIC family of algorithms, Bioinformatics 23 (2007), no. 13, i490–i498.
- [Sch38] Isaac J Schoenberg, Metric spaces and completely monotone functions, Annals of Mathematics (1938), 811–841.
- [SFL11] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet, Universality, characteristic kernels and RKHS embedding of measures., Journal of Machine Learning Research 12 (2011), no. 7.
- [SR09] Gábor J Székely and Maria L Rizzo, Brownian distance covariance, The Annals of Applied Statistics (2009), 1236–1265.
- [SSG+07] Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo, Supervised feature selection via dependence estimation, Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 823–830.
- [SSG+12] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt, Feature selection via dependence maximization, Journal of Machine Learning Research 13 (2012), no. 5.
- [WDL21] Tinghua Wang, Xiaolu Dai, and Yuze Liu, Learning with Hilbert–Schmidt independence criterion: A review and new perspectives, Knowledge-based systems 234 (2021), 107567.
- [WDX23] Jie Wang, Santanu S Dey, and Yao Xie, Variable selection for kernel two-sample tests, arXiv preprint arXiv:2302.07415 (2023).
- [YJS+14] Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P Xing, and Masashi Sugiyama, High-dimensional feature selection by feature-wise kernelized lasso, Neural computation 26 (2014), no. 1, 185–207.