This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Transparent and Nonlinear Method for Variable Selection 11footnotemark: 1

Keyao Wang [email protected] Huiwen Wang [email protected] Jichang Zhao [email protected] Lihong Wang [email protected] School of Economics and Management, Beihang University, Beijing, China Beijing Key Laboratory of Emergency Support Simulation Technologies of City Operations, Beijing, China Key Laboratory of Complex System Analysis, Management and Decision (Beihang University), Ministry of Education, China National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China
Abstract

Variable selection is a procedure to attain the truly important predictors from inputs. Complex nonlinear dependencies and strong coupling pose great challenges for variable selection in high-dimensional data. In addition, real-world applications have increased demands for interpretability of the selection process. A pragmatic approach should not only attain the most predictive covariates, but also provide ample and easy-to-understand grounds for removing certain covariates. In view of these requirements, this paper puts forward an approach for transparent and nonlinear variable selection. In order to transparently decouple information within the input predictors, a three-step heuristic search is designed, via which the input predictors are grouped into four subsets: the relevant to be selected, and the uninformative, redundant, and conditionally independent to be removed. A nonlinear partial correlation coefficient is introduced to better identify the predictors which have nonlinear functional dependence with the response. The proposed method is model-free and the selected subset can be competent input for commonly used predictive models. Experiments demonstrate the superior performance of the proposed method against the state-of-the-art baselines in terms of prediction accuracy and model interpretability.

keywords:
Variable selection , High-dimensional , Interpretation , Nonlinear relevance
journal: Elsevier
\newpageafter

abstract

1 Introduction

Predictive modeling often encounters high-dimensional data (Yin et al., 2022; Hossny et al., 2020; Chaudhari & Thakkar, 2023; Lu et al., 2023), and selecting the truly important predictors is the key to achieving accurate and reliable predictions (Guyon & Elisseeff, 2003). However, variable selection often encounters great challenges from complex nonlinearity and strong coupling that are widespread in high-dimensional data (Hastie et al., 2001). Variable selection methods, e.g., feature screening (Fan & Lv, 2008; Li et al., 2012), and stepwise (Efroymson, 1960; Buhlmann et al., 2010), have been devoted to selecting the subset of predictors which the response is most related to, with the expectation of improving prediction accuracy, and reducing computational cost. Although improving the transparency of variable selection has important implications in enhancing model interpretability (Murdoch et al., 2019; Rudin et al., 2022), it has seldomly mentioned in high-dimensional predictions.

How to transparently decouple the information contained in the inputs is an important, yet easily overlooked area of concern when designing variable selection methods. Collected based on limited experience, the input set of predictors of high-dimensional data are usually “dirty” (Cai et al., 2018). Only several predictors may be truly relevant to predict the response, and the rest may be lack of information, convey the same information as other predictors or carry irrelevant information to the response (Wan et al., 2022). To achieve transparent information decoupling, two issues need further discussion. (i) how to effectively select the truly important predictors; (ii) how to transparently delete predictors which are of no use for the prediction.

When selecting the predictors which are useful for predicting the response, most prevailing approaches for variable selection usually assume that the response and predictors follow some simple forms of functional dependencies, e.g., linear (Tibshirani, 1996), monotonic (Zhu et al., 2011), or additive (Marra & Wood, 2011). In reality, however, there are more diverse and complicated nonlinear forms of dependencies, e.g., nonmonotonic or even oscillatory functional dependencies between predictors and response (Chatterjee, 2021), and the interactions among predictors (Wan et al., 2021). Traditional approaches have difficulty in effectively identifying and selecting such complex nonlinear dependencies. The omission of some key nonlinear relevant predictors can greatly damage the accuracy of predictive modeling (Azadkia & Chatterjee, 2021).

When deleting the predictors which are of no use to predict the response, prevailing approaches mainly rank the predictors based on their correlations to the response, and divide the predictors into two categories, i.e., those relevant to the response, and those independent of the response (Song et al., 2017; Dessì & Pes, 2015). Most approaches make assumptions that the inputs exclude predictors without information, or there is no multicollinearity between predictors (Fan et al., 2020). However, the uninformative predictors and collinearity widely exist in real-world applications (Li et al., 2017a). It is difficult for classical approaches to categorize and remove multiple types of predictors separately, which leads to intransparent selection process, and diminishes their efficiency and interpretability.

To alleviate these challenges in selecting and deleting predictors, this article constructs a Transparent and Nonlinear Variable Selection (TNVS) method for high-dimensional data. The input predictors are divided into four nonoverlapping subsets to achieve Transparent Information Decoupling (TID), i.e., the relevant predictors to be selected, and the uninformative, redundant and conditionally independent predictors to be deleted. The transparent selection and deletion improve predictive accuracy and model interpretability. The main contributions are as follows.

  1. 1.

    Equipped with the recently proposed nonlinear partial correlation, TNVS is able to select predictors with a diversity of complex nonlinear relevance to the response, including nonmonotonic or oscillatory functional dependence between the response and predictors, and interactions among predictors.

  2. 2.

    Information entropy is adopted to filter out uninformative predictor, Gram-Schmidt orthogonalization is adopted to remove redundant predictors which are collinear with the relevant predictors, and the nonlinear partial correlation coefficient is adopted to remove the conditionally independent predictors. In this way, the predictors are classified into different types, and the reasons for removing certain predictors are clearly indicated.

  3. 3.

    The effectiveness and interpretability of the proposed method are demonstrated against the state-of-the-art baselines. The proposed method categorizes the predictors with high accuracy on high-dimensional nonlinear simulations, and the selected subset of predictors improve out-of-sample predictions on real datasets.

The remainder of the paper is organized as follows. Section 2 reviews some prevailing methods in related realms. Section 3 describes the concepts involved in information decoupling and the proxy measures. Section 4 presents the search framework of the proposed TNVS. Section 5 demonstrates the effectiveness and interpretability of the proposed method on simulation problems. The performance of the proposed method on real data applications is discussed in Section 6, including the predictive effectiveness of the selected subset, model interpretability, post hoc interpretability, and tuning parameter stability. Conclusions and future work are summarized in Section 7.

2 Related work

Since our purpose is to derive a transparent variable selection method for high-dimensional data with nonlinear functional dependencies, we reviewed the most relevant methods in this section, including feature screening approaches based on correlations, feature screening approaches based on partial correlations, and stepwise selection methods based on partial correlations.

Correlation coefficients, and other statistical measures for dependencies between two variables, have been widely introduced into feature screening to identify the nonlinear relevance between the response and the predictor in high-dimensional data. Some feature screening methods tried to adopt nonlinear correlations, e.g., generalized correlation (Hall & Miller, 2009), distance correlation (Li et al., 2012), ball correlation (Pan et al., 2019), and projection correlation (Liu et al., 2022). Others attempted to enhance the Pearson’s correlation Zhu et al. (2011) and the leverage score of singular value decompositions (Zhong et al., 2021) through the slicing scheme and inverse regression idea (Li, 1991). However, it is difficult for these methods to recruit those predictors that have weak marginal utilities but are jointly strongly correlated with the response. In this study, we turn to adopt the partial correlation rather than the correlations to better detect the interacted covariates.

Feature screening methods based on Partial correlations (PC) were able to avoid the false omission of some important predictors (Barut et al., 2016; Wang et al., 2018). Given a set of known key variables, these methods could detect and select the important predictors on which the response was strongly dependent through a one-pass screening and ordering algorithm. Meanwhile, the predictors which the response was conditionally independent of were screened out. However, given the newly enlarged selected predictor set, some removed predictors were likely to become jointly correlated with the response. This indicates that if complex interactions exist among the predictors, the selection may still be insufficient to capture all the important predictors. In addition, such methods require prior knowledge of key variables, but such knowledge is usually unobtainable in practical problems. This stimulates the design of stepwise methods for more sufficient selection and more generalized initialization.

Stepwise variable selection based on PCs can be a substitute when the predictors have complex interactions or there is little prior knowledge of the key variables, e.g., the PC-simple algorithm (Buhlmann et al., 2010) and the thresholded partial correlation (TPC) (Li et al., 2017b). Nevertheless, a limitation of PC-simple and TPC is that they are both designed with the linear partial correlation coefficient, and thus they are solid only in linear regression models. This motivates us to introduce nonlinear PCs into stepwise methods, so that these methods can better identify the nonlinear conditional relevance between the response and predictors, and be extended to more general settings.

A nonlinear partial correlation coefficient, namely Conditional Dependence Coefficient (CODEC) (Azadkia & Chatterjee, 2021), was a recent significant development in this field. CODEC was a fully nonparametric measure of the nonlinear conditional dependence between two variables given a set of other variables. Based on the CODEC, a forward selection method was further presented, namely Feature Ordering by Conditional Independence (FOCI). FOCI could identify the conditionally dependent predictors from a diversity of complicated associations. However, the unimportant predictors were evaluated in each iteration and not removed until the end of the search, which is time-consuming. Besides, the removed subset was considered as a whole, though it may include various types of predictors, which is hard to interpret. Our rationale is that by introducing multiple time-saving measures to remove the unimportant predictors during the search, the subset to be evaluated will shrink much faster than FOCI, which increases efficiency. Moreover, the removed subset can be divided in a quasi-chunked fashion, which increases interpretability.

The related methods are summarized in Table 1. Prevailing feature screening methods based on correlations are listed, including Sure Independent Screening (SIS), Sure Independent Ranking and Screening (SIRS), Distance Correlation based Sure Independence Screening (DC-SIS), and Weighted Leverage Score (WLS). Stepwise methods based on partial correlations are also considered, including PC-simple, TPC, and FOCI. From Table 1, it is observed that a nonlinear method which can both select nonmonotonic nonlinear relevance, and remove the uninformative and redundant predictors, does not exist. A fully interpretable method, which can transparently select and delete certain types of predictors, is highly needed. These motivate us to present a transparent manner to decouple complex information for high-dimensional data, and design an effective and interpretable scheme for variable selection. To our knowledge, ours is the first report to design variable selection method directed by transparent information decoupling.

Table 1: A selective list of the variable selection approaches related to the proposed method.
Category Method Nonlinear relevant Redundant Uninformative Interpretable
Correlation, feature screening SIS (Fan & Lv, 2008) × × \surd Partially
SIRS (Zhu et al., 2011) Monotonic × \surd Partially
DC-SIS (Li et al., 2012) Monotonic × \surd Partially
WLS (Zhong et al., 2021) \surd × \surd Partially
Partial correlation, stepwise PC-simple (Buhlmann et al., 2010) × \surd × Partially
TPC (Li et al., 2017b) × \surd × Partially
FOCI (Azadkia & Chatterjee, 2021) \surd \surd (inefficient) × Partially
proposed TNVS \surd \surd \surd Fully

3 Information decoupling and the proxy measures

The proposed variable selection can be regarded as an information decoupling process of the input set of predictors. The input set is transparently divided into four disjoint subsets, i.e., the subset 𝒮\mathcal{S} which is relevant to the response, the uninformative subset 𝒜1\mathcal{A}_{1}, the redundant subset 𝒜2\mathcal{A}_{2}, and the conditionally independent subset 𝒜3\mathcal{A}_{3}, as shown in Fig. 1. In this section, the four subsets in information decoupling are defined, and their corresponding measures in nonlinear supervised learning are described. An example of the four types of predictors are given in Appendix A1.

Refer to caption
Figure 1: The input predictors are transparently grouped into four disjoint parts. The index set of the inputs is 𝒳={1,2,,p}\mathcal{X}=\{1,~{}2,\cdots,p\}, which is divided into the relevant subset 𝒮\mathcal{S}, the uninformative subset 𝒜1\mathcal{A}_{1}, the redundant subset 𝒜2\mathcal{A}_{2}, and the conditionally independent subset 𝒜3\mathcal{A}_{3}.

3.1 The relevant subset and its measure

Let 𝐘=(y1,,yn)\mathbf{Y}=(y_{1},\cdots,y_{n})^{\prime} be the response, and 𝐗=(𝐗1,,𝐗p)\mathbf{X}=(\mathbf{X}_{1},\cdots,\mathbf{X}_{p}) be the pp-dimensional input predictors, where 𝐗j=(x1j,,xnj)\mathbf{X}_{j}=(x_{1j},\cdots,x_{nj})^{\prime}, j=1,,pj=1,\cdots,p, and nn is the sample size. Let 𝒳={1, 2,,p}\mathcal{X}=\{1,\ 2,\cdots,p\} be the index set of the input predictors. For any subset of indices 𝒮𝒳\mathcal{S}\subseteq\mathcal{X}, let 𝐗𝒮\mathbf{X}_{\mathcal{S}} be the data matrix composed of all the 𝐗j\mathbf{X}_{j} that satisfies j𝒮j\in\mathcal{S}. For an 𝒮\mathcal{S}, if there is a function f()f(\cdot) that makes 𝐘=f(𝐗𝒮)+ε\mathbf{Y}=f(\mathbf{X}_{\mathcal{S}})+\mathbf{\varepsilon}, where ε\mathbf{\varepsilon} is a stochastic error, then 𝒮\mathcal{S} stands for the index set of the true model, which is named the relevant subset in this paper. In high-dimensional data, pp is close to or even larger than nn. In such cases, sparsity assumption usually holds that the dimension of 𝐗𝒮\mathbf{X}_{\mathcal{S}} is much less than pp.

In real problems, 𝐘\mathbf{Y} and 𝐗𝒮\mathbf{X}_{\mathcal{S}} can be linearly correlated, or they can also be associated in the form of a variety of complex nonlinear functions. To date, handling complex nonlinear correlations remains a great challenge (Fan et al., 2020). The primary goal of the proposed variable selection method is to obtain the correlated subset 𝒮\mathcal{S} from the input set 𝒳\mathcal{X}. For a given group of predictors 𝐗𝒢\mathbf{X}_{\mathcal{G}}, 𝒢𝒳\mathcal{G}\subseteq\mathcal{X}, and an unknown predictor 𝐗j\mathbf{X}_{j}, j𝒳\𝒢j\in\mathcal{X}\ \backslash\ \mathcal{G}, the Conditional Dependence Coefficient (CODEC) (Azadkia & Chatterjee, 2021) can measure the nonlinear conditional correlation between 𝐘\mathbf{Y} and 𝐗j\mathbf{X}_{j} given 𝐗𝒢\mathbf{X}_{\mathcal{G}}, as well as the interaction between 𝐗j\mathbf{X}_{j} and 𝐗𝒢\mathbf{X}_{\mathcal{G}} in explaining 𝐘\mathbf{Y}. One of the most important features of the CODEC is that it converges to a limit in [0,1][0,~{}1]. Given 𝐗𝒢\mathbf{X}_{\mathcal{G}}, the limit is 0 if and only if 𝐘\mathbf{Y} and 𝐗j\mathbf{X}_{j} are conditionally independent, and is 11 if and only if 𝐘\mathbf{Y} is almost surely equal to a measurable function of 𝐗j\mathbf{X}_{j}. If 𝒢\mathcal{G}\neq\emptyset, let 𝒢={g1,,gq}\mathcal{G}=\{g_{1},\cdots,g_{q}\}, where q1q\geq 1. Let 𝐱i𝒢=(xig1,,xigq)\mathbf{x}_{i}^{\mathcal{G}}=(x_{ig_{1}},\cdots,x_{ig_{q}}) be the ii-th observation of 𝐗𝒢\mathbf{X}_{\mathcal{G}}, where i=1,,ni=1,\cdots,n. CODEC is calculated as

Tn=Tn(𝐘,𝐗j𝐗𝒢)=h=1n(min{Rh,RM(h)}min{Rh,RN(h)})h=1n(Rhmin{Rh,RN(h)}),if𝒢,\displaystyle T_{n}=T_{n}(\mathbf{Y},\mathbf{X}_{j}\mid\mathbf{X}_{\mathcal{G}})=\frac{\sum_{h=1}^{n}{(\min\{R_{h},R_{M(h)}\}-\min\{R_{h},R_{N(h)}\})}}{\sum_{h=1}^{n}{(R_{h}-\min\{R_{h},R_{N(h)}\})}},\ \textrm{if}\ \mathcal{G}\neq\emptyset, (1)

where RhR_{h} denotes the rank of observation yhy_{h}, i.e., the number of ii such that yiyhy_{i}\leq y_{h}. M(h)M(h) denotes the index ii of the nearest neighbor (𝐱i𝒢,xij)(\mathbf{x}_{i}^{\mathcal{G}},x_{ij}) of (𝐱h𝒢,xhj)(\mathbf{x}_{h}^{\mathcal{G}},x_{hj}) with respect to the Euclidean metric on q+1\mathbb{R}^{q+1}, N(h)N(h) denotes index ii such that 𝐱i𝒢\mathbf{x}_{i}^{\mathcal{G}} is the nearest neighbor of 𝐱h𝒢\mathbf{x}_{h}^{\mathcal{G}} in q\mathbb{R}^{q}, and the ties are broken uniformly at random for both M(h)M(h) and N(h)N(h). RM(h)R_{M(h)} denotes the rank of yM(h)y_{M(h)}, i.e., the number of ii such that yM(i)yM(h)y_{M(i)}\leq y_{M(h)}, and RN(h)R_{N(h)} denotes the rank of yN(h)y_{N(h)}.

The CODEC can also measure the unconditional correlation between 𝐘\mathbf{Y} and a predictor 𝐗j\mathbf{X}_{j}, where j𝒳j\subseteq\mathcal{X}, in the absence of any given predictors, i.e., when 𝒢=\mathcal{G}=\emptyset. In this case, the CODEC is interpreted as an unconditional dependence coefficient, and is calculated as

Tn=Tn(𝐘,𝐗j)=h=1n(nmin{Rh,RM(h)}Lh2)h=1nLh(nLh),if𝒢=,\displaystyle T_{n}=T_{n}(\mathbf{Y},\mathbf{X}_{j})=\frac{\sum_{h=1}^{n}\left(n\min\left\{R_{h},R_{M(h)}\right\}-L_{h}^{2}\right)}{\sum_{h=1}^{n}L_{h}\left(n-L_{h}\right)},\ \textrm{if}\ \mathcal{G}=\emptyset, (2)

where RhR_{h} denotes the rank of yhy_{h}. M(h)M(h) denotes the index ii of the nearest neighbor xijx_{ij} of xhjx_{hj}, and the ties are broken uniformly at random for M(h)M(h). RM(h)R_{M(h)} denotes the rank of yM(h)y_{M(h)}. LhL_{h} denotes the number of ii such that yiyhy_{i}\geq y_{h}.

The calculations above require continuous 𝐘\mathbf{Y}, 𝐗𝒢\mathbf{X}_{\mathcal{G}} and 𝐗j\mathbf{X}_{j}, but the CODEC can also be applied to measure the correlations between discrete predictors if the ties are unbonded randomly. If the denominator of Tn(𝐘,𝐗j𝐗𝒢)T_{n}(\mathbf{Y},\mathbf{X}_{j}\mid\mathbf{X}_{\mathcal{G}}) is 0, the CODEC is undefined. At this point, YY is almost surely equal to a measurable function of 𝑿𝒢\boldsymbol{X}_{\mathcal{G}}, and 𝑿𝒢\boldsymbol{X}_{\mathcal{G}} can be regarded as sufficient for predicting 𝐘\mathbf{Y}.

3.2 The uninformative subset and its measure

High-dimensional data usually contains uninformative predictors because of some restrictions in data collection. Removing these predictors has limited influence on explaining the response (Li et al., 2017a). For example, in the study of handwriting digits, pixels in the marginal areas may have little explanatory power and can be ignored in the construction of deep neural networks (Chen et al., 2021). Real datasets do not always contain uninformative predictors, since the predictors are usually carefully chosen based on expert experience before being collected to save storage space. However, if we have little prior knowledge, and obtain a dataset with a large number of uninformative predictors, removing them beforehand can greatly improve the efficiency of variable selection.

In this paper, Shannon entropy is used as the measure to distinguish the uninformative predictors from the remaining predictors. Shannon entropy is the probability of all possible values of a predictor, which represents the expectation of the amount of information contained in the predictor (Gray, 2011). Consider a discrete predictor 𝐗j\mathbf{X}_{j} that takes a finite number of cc possible values xj,k{xj,1,,xj,c}x_{j,k}\in\{x_{j,1},\cdots,x_{j,c}\} with corresponding probabilities pj,k{pj,1,,pj,c}p_{j,k}\in\{p_{j,1},\cdots,p_{j,c}\}. Its entropy H(𝐗j)H(\mathbf{X}_{j}) is defined as

H(𝐗j)=k=1cpj,klnpj,k.\displaystyle H(\mathbf{X}_{j})=-\sum_{k=1}^{c}p_{j,k}\ln{p_{j,k}}. (3)

In general, there is 0H(𝐗j)lnc0\leq H(\mathbf{X}_{j})\leq\ln{c}. If the distribution of 𝐗j\mathbf{X}_{j} is highly biased toward one of the possible value xi,kx_{i,k}, H(𝐗j)H(\mathbf{X}_{j}) is the lowest. At this point, if H(𝐗j)=0H(\mathbf{X}_{j})=0, 𝐗j\mathbf{X}_{j} is defined as an uninformative predictor, i.e., the quantity of information contained in 𝐗j\mathbf{X}_{j} is 0. Shannon entropy can be applied only to discrete predictors, and data discretization is required beforehand for continuous predictors (Brown et al., 2012).

3.3 The redundant subset and its measure

If a candidate predictor and a subset of relevant predictors are collinear, keeping both of them may affect the robustness of model estimation (Yu & Liu, 2004). Although there is a consensus on the adverse effects of multicollinearity (Fan & Lv, 2010), most variable selection methods manage to avoid discussing the issue. In this paper, such predictors are named redundant predictors, and are measured separately from other predictors to be deleted.

Gram‒Schmidt Orthogonalization (GSO) is adopted in this paper to decompose and identify the information contained in the predictor set, and further to measure multicollinearity among predictors (Wang et al., 2020; Lyu et al., 2017). The Gram‒Schmidt theorem in Euclidean space indicates that for an index subset of predictors 𝒢={g1,,gq}𝒳\mathcal{G}=\{g_{1},\cdots,g_{q}\}\subseteq\mathcal{X} and the corresponding matrix 𝐗𝒢=(𝐗g1,,𝐗gq)\mathbf{X}_{\mathcal{G}}=(\mathbf{X}_{g_{1}},\cdots,\mathbf{X}_{g_{q}}), if 𝐗𝒢\mathbf{X}_{\mathcal{G}} is linearly independent, one can always construct an orthogonal basis 𝐙𝒢=(𝐙g1,,𝐙gq)\mathbf{Z}_{\mathcal{G}}=(\mathbf{Z}_{g_{1}},\cdots,\mathbf{Z}_{g_{q}}) via GSO, where 𝐙𝒢\mathbf{Z}_{\mathcal{G}} is the linear combination of 𝐗𝒢\mathbf{X}_{\mathcal{G}}, and it spans the same space as 𝐗𝒢\mathbf{X}_{\mathcal{G}}.

For each 𝐗j\mathbf{X}_{j}j𝒢j\in\mathcal{G}, the orthogonalized variable 𝐙j\mathbf{Z}_{j} is

𝐙g1=𝐗g1,𝐙j=𝐗jk=g1gj1𝐗j,𝐙k𝐙k22𝐙k,j=g2,,gq.\begin{split}\mathbf{Z}_{g_{1}}=\mathbf{X}_{g_{1}},\ &\\ \mathbf{Z}_{j}=\mathbf{X}_{j}-\sum_{k=g_{1}}^{g_{j-1}}{\frac{\langle\mathbf{X}_{j},\mathbf{Z}_{k}\rangle}{\|\mathbf{Z}_{k}\|_{2}^{2}}\mathbf{Z}_{k}},&\forall j=g_{2},\cdots,g_{q}.\end{split} (4)

where 𝐗j,𝐙k\langle\mathbf{X}_{j},\mathbf{Z}_{k}\rangle is the inner product of 𝐗j\mathbf{X}_{j} and 𝐙k\mathbf{Z}_{k}, and 𝐙k2\|\mathbf{Z}_{k}\|_{2} is the L2L^{2} norm of 𝐙k\mathbf{Z}_{k}.

Let qq be the rank of 𝐗=(𝐗1,,𝐗p)\mathbf{X}=(\mathbf{X}_{1},\cdots,\mathbf{X}_{p}). Suppose q<pq<p, and 𝐗𝒢\mathbf{X}_{\mathcal{G}} is linearly independent, where the index subset 𝒢𝒳\mathcal{G}\subseteq\mathcal{X}; then, an orthogonal basis 𝐙=𝐙𝒢\mathbf{Z}=\mathbf{Z}_{\mathcal{G}} of 𝐗\mathbf{X} can be obtained using GSO. For any j𝒳\𝒢j\in\mathcal{X}\ \backslash\ \mathcal{G}, one can obtain the orthogonal variable 𝐙j\mathbf{Z}_{j} of 𝐗j\mathbf{X}_{j} with GSO based on 𝐙\mathbf{Z}. If the variance of 𝐙j\mathbf{Z}_{j} satisfies Var(𝐙j)=0Var(\mathbf{Z}_{j})=0, all the information of 𝐗j\mathbf{X}_{j} is already contained in 𝐗𝒢\mathbf{X}_{\mathcal{G}}, and accordingly, 𝐗j\mathbf{X}_{j} is a redundant predictor, which is collinear with 𝐗𝒢\mathbf{X}_{\mathcal{G}}.

3.4 The conditionally independent subset and its measure

In addition to uninformative and redundant predictors, there is another group of predictors that needs to be deleted. One of the most obvious features is that 𝐘\mathbf{Y} is conditionally independent of these predictors given the relevant subset 𝒮\mathcal{S}. In this paper, these predictors are called conditionally independent predictors. Modeling with the conditionally independent predictors does more harm than good. Not only will it damage interpretability, but it may also reduce prediction accuracy. The CODEC can be used to measure conditionally independent predictors. In particular, given 𝒮\mathcal{S}, if there is Tn(𝐘,𝐗j𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{j}\mid\mathbf{X}_{\mathcal{S}}) close to 0 for a predictor 𝐗j\mathbf{X}_{j}, where j𝒳\𝒮j\in\mathcal{X}\ \backslash\ \mathcal{S}, 𝐗j\mathbf{X}_{j} can be considered a conditionally independent predictor.

4 The Proposed Transparent Variable Selection for Nonlinear and High-dimensional Data

In this section, we propose a Transparent and Nonlinear Variable Selection (TNVS) approach for high-dimensional data. Denote 𝒱𝒳\mathcal{V}\subseteq\mathcal{X} as an index set of the candidate predictors, which contains the indices of all predictors that are promising to explain the response. A three-step heuristic search is established to transparently separate the candidate set into the subset to be selected and those to be deleted. The indices of the selected predictors are reserved in the relevant subset 𝒮\mathcal{S}, and those of the removed predictors are respectively categorized into in the uninformative subset 𝒜1\mathcal{A}_{1}, the redundant subset 𝒜2\mathcal{A}_{2}, and the conditionally independent subset 𝒜3\mathcal{A}_{3}. A flowchart of the proposed TNVS is illustrated in Fig. 2.

Refer to caption
Figure 2: A flowchart of the proposed TNVS.

If we have no prior knowledge of the key variables, the initial 𝒱\mathcal{V} is set as the index set of input predictors 𝒳\mathcal{X}, and the initial 𝒮\mathcal{S}, 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2}, and 𝒜3\mathcal{A}_{3} are set as empty sets. The heuristic search of TNVS contains three steps, i.e., prefiltering, forward selection, and batch deletion. In the prefiltering step, uninformative predictors containing little information are identified and removed. Forward selection and batch deletion are iterated alternately. Every time a relevant predictor is selected, a deletion step is performed to remove redundant predictors that are collinear with all the selected predictors.

In the prefiltering step, uninformative predictors are distinguished from other predictors using an Uninformative Score (UinS), and the indices of these uninformative predictors are added to 𝒜1\mathcal{A}_{1} and excluded from 𝒱\mathcal{V}. In TNVS, the Shannon entropy of a predictor serves as the UinS. For an index j𝒱j\in\mathcal{V}, if UinS(j)=H(𝐗j)<α1\textrm{UinS}(j)=H(\mathbf{X}_{j})<\alpha_{1} , where α1\alpha_{1} is the uninformative threshold, 𝐗j\mathbf{X}_{j} will be accordingly regarded as an uninformative predictor.

The forward selection step identifies the most relevant predictor to the response given 𝒮\mathcal{S}, and places its index in 𝒮\mathcal{S} to form a new relevant subset. The degree of relevance between any 𝐗j(j𝒱)\mathbf{X}_{j}\ (j\in\mathcal{V}) and 𝐘\mathbf{Y} is determined by the Relevance Score (RelS). In TNVS, the CODEC is adopted as the measure of RelS. Given 𝒮\mathcal{S}, the RelS of any index jj is

RelS(j,𝐘𝒮)={Tn(𝐘,𝐗j),if𝒮=Tn(𝐘,𝐗j𝐗𝒮),otherwise,\displaystyle\textrm{RelS}(j,\mathbf{Y}\mid\mathcal{S})=\left\{\begin{aligned} &T_{n}(\mathbf{Y},\mathbf{X}_{j}),&if\ \mathcal{S}=\emptyset\\ &T_{n}(\mathbf{Y},\mathbf{X}_{j}\mid\mathbf{X}_{\mathcal{S}}),&otherwise\end{aligned}\right.\ , (5)

The index with maximum RelS is selected from 𝒱\mathcal{V}, and the corresponding predictor is most relevant to 𝐘\mathbf{Y} considering its interaction with 𝐗𝒮\mathbf{X}_{\mathcal{S}}.

The batch deletion step identifies multicollinearity in 𝒱\mathcal{V} given the relevant subset 𝒮\mathcal{S}. Every time forward selection is performed, all redundant predictors that are collinear with the new 𝐗𝒮\mathbf{X}_{\mathcal{S}} are detected from the candidates. Their indices are added in 𝒜2\mathcal{A}_{2} and removed from 𝒱\mathcal{V}. In TNVS, GSO is adopted to identify the redundant predictors given 𝒮\mathcal{S}. First, the orthogonal basis 𝐙𝒮\mathbf{Z}_{\mathcal{S}} of 𝒮\mathcal{S} is obtained. Then, GSO is performed on all candidate predictors 𝐗j(j𝒱)\mathbf{X}_{j}\ (j\in\mathcal{V}) based on 𝐙𝒮\mathbf{Z}_{\mathcal{S}} to determine which candidates are collinear with 𝐗𝒮\mathbf{X}_{\mathcal{S}}. The Redundancy Score (RedS) of jj given 𝒮\mathcal{S} is the variance of the orthogonalized variable Var(𝐙j)Var(\mathbf{Z}_{j}). If RedS(j𝒮)<α3\textrm{RedS}(j\mid\mathcal{S})<\alpha_{3}, where α3\alpha_{3} is the redundant threshold, 𝐗j\mathbf{X}_{j} is regarded as a redundant predictor which is collinear with 𝐗𝒮\mathbf{X}_{\mathcal{S}}.

Data: the matrix of predictors 𝐗=(𝐗1,,𝐗p)\mathbf{X}=(\mathbf{X}_{1},\cdots,\mathbf{X}_{p}), their index set 𝒳={1,,p}\mathcal{X}=\{1,\cdots,p\}, the response vector 𝐘\mathbf{Y}
Input: uninformative threshold α1\alpha_{1}, relevant threshold α2\alpha_{2}, redundant threshold α3\alpha_{3}, maximum model size dmaxd_{\max} (optional)
Output: 𝒮\mathcal{S} to be selected, and 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2}, and 𝒜3\mathcal{A}_{3} to be removed
1 Set the initial candidate feature subset 𝒱=𝒳\mathcal{V}=\mathcal{X}, and the initial 𝒮\mathcal{S}, 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2}, 𝒜3\mathcal{A}_{3} as \emptyset;
2 for all j𝒱j\in\mathcal{V} do // Obtain the uninformative subset 𝒜1\mathcal{A}_{1}
3       if UinS(j)<α1\mathrm{UinS}(j)<\alpha_{1} then 𝒱=𝒱\{j}\mathcal{V}=\mathcal{V}\backslash\{j\}; 𝒜1=𝒜1{j}\mathcal{A}_{1}=\mathcal{A}_{1}\cup\{j\};
4      
5 end for
6while 𝒱\mathcal{V}\neq\varnothing and |𝒮|<dmax\lvert\mathcal{S}\rvert<d_{max} (if provided) do
7       if any RelS(j,𝐘𝒮)\mathrm{RelS}(j,\mathbf{Y}\mid\mathcal{S}), j𝒱j\in\mathcal{V} is undefined then
8            break
9      else
10             k=argmaxj𝒱{RelS(j,𝐘𝒮)}k=\mathop{\arg\max}\limits_{j\in\mathcal{V}}\{\textrm{RelS}(j,\mathbf{Y}\mid\mathcal{S})\}; // Obtain the relevant subset 𝒮\mathcal{S}
11             if RelS(k,𝐘𝒮)<α2\mathrm{RelS}(k,\mathbf{Y}\mid\mathcal{S})<\alpha_{2} then
12                  break
13            else
14                  𝒱=𝒱\{k}\mathcal{V}=\mathcal{V}\backslash\{k\}; 𝒮=𝒮{k}\mathcal{S}=\mathcal{S}\cup\{k\};
15                   Obtain 𝐙𝒮\mathbf{Z}_{\mathcal{S}};
16                  
17             end if
18            for all j𝒱j\in\mathcal{V} do // Obtain the redundant subset 𝒜2\mathcal{A}_{2}
19                   if RedS(j𝒮)<α3\mathrm{RedS}(j\mid\mathcal{S})<\alpha_{3} then 𝒱=𝒱\{j}\mathcal{V}=\mathcal{V}\backslash\{j\}; 𝒜2=𝒜2{j}\mathcal{A}_{2}=\mathcal{A}_{2}\cup\{j\};
20                  
21             end for
22            
23       end if
24      
25 end while
𝒜3=𝒜3𝒱\mathcal{A}_{3}=\mathcal{A}_{3}\cup\mathcal{V}; // Obtain the conditionally independent subset 𝒜3\mathcal{A}_{3}  
Algorithm 1 Pseudocode of the proposed TNVS.

The termination criterion of TNVS is the cardinality of the relevant subset |𝒮|\lvert\mathcal{S}\rvert reaching the predetermined upper bound dmaxd_{\max}, or RelS(j,𝐘𝒮)<α2\textrm{RelS}(j,\mathbf{Y}\mid\mathcal{S})<\alpha_{2} or undefined for all j𝒱j\in\mathcal{V}, where α2\alpha_{2} is the relevant threshold. If the termination criterion is satisfied, all remaining elements in 𝒱\mathcal{V} belong to the conditionally independent subset 𝒜3\mathcal{A}_{3}. When TNVS is finished, the original index set of predictors 𝒳\mathcal{X} is divided into a relevant subset 𝒮\mathcal{S}, an uninformative subset 𝒜1\mathcal{A}_{1}, a redundant subset 𝒜2\mathcal{A}_{2}, and a conditionally independent subset 𝒜3\mathcal{A}_{3}, resulting in 𝒮𝒜1𝒜2𝒜3=𝒳\mathcal{S}\cup\mathcal{A}_{1}\cup\mathcal{A}_{2}\cup\mathcal{A}_{3}=\mathcal{X} and 𝒮𝒜1𝒜2𝒜3=\mathcal{S}\cap\mathcal{A}_{1}\cap\mathcal{A}_{2}\cap\mathcal{A}_{3}=\varnothing. The pseudocode of TNVS is shown in Algorithm 1, and an example to illustrate the procedure of TNVS is provided in Appendix A2.

The computational time of the proposed TNVS is O(dnlogn)O(dn\log{n}), where dd is the number of selected predictors. First, the computational cost of the prefiltering step is O(np)O(np). The complexity of the forward selection step is O(nplogn)O(np\log{n}) since it takes O(nlogn)O(n\log{n}) to calculate the CODEC of all the predictors and sort the most relevant predictor (Azadkia & Chatterjee, 2021). For the batch deletion step, GSO requires O(np)O(np) to screen all predictors. Forward selection and batch deletion are iterated dd times on average. The total cost of TNVS is O(dpnlogn)O(dpn\log{n}). In the worst case when dd is proportional to pp, the time complexity is O(np2logn)O(np^{2}\log{n}). Suppose there are no less than pp processors, predictors can be measured in parallel in each step. In summary, the computational time of TNVS is O(dnlogn)O(dn\log{n}) on average. As is shown in simulations, the proposed TNVS is efficient compared with baselines even with limited processors.

The main advantages of the presented TNVS are summarized as follows. First, a transparent framework of variable selection is achieved using a three-step heuristic search, which groups predictors into four mutually disjoint subsets. In every step of the framework, TNVS can provide reasonable interpretations to select or delete certain predictors. Second, the recently proposed CODEC is introduced in TNVS to identify the complex nonlinear associations between the response and the predictors. TNVS can handle not only high-dimensional data in which the response is monotonic and additive functionally dependent on the predictors, but also those data with oscillatory functional dependence and interactions. Moreover, TNVS is model-free. The selected subset of predictors has strong generality and can be used to construct many kinds of learning models.

5 Simulation studies

Substantial experiments on simulated datasets are implemented to illustrate the effectiveness and interpretability of TNVS. First, the variable selection results of TNVS are compared with several competitive baselines to show its validity and conciseness. Then, the subsets of predictors are further evaluated to demonstrate model interpretability.

5.1 The simulated datasets

We generate a regression simulation problem with both monotonic and nonmonotonic functional dependence, interactions, uninformative predictors, multicollinearity, and conditionally independent predictors. In the problem, 90%90\% of the pp-dimensional input predictors have information, and the remaining 10%10\% are uninformative. The first 90%90\% predictors are consisted of 9 signals 𝐗t1,,𝐗t9\mathbf{X}_{t_{1}},\cdots,\mathbf{X}_{t_{9}}, where tg=(g1)p/10+1,g=1,,9t_{g}=(g-1)\cdot p/10+1,\ g=1,\cdots,9, and the rest are redundant predictors that are collinear with these 99 signals. The signals are independently sampled from the standard normal distribution 𝒩(0,1)\mathcal{N}(0,~{}1). The redundant predictors are highly correlated with one of the signals, and there is 𝐗tg+j=𝐗tg+λϵtg+j,g=1,,9,j=1,,(p/10)\mathbf{X}_{t_{g}+j}=\mathbf{X}_{t_{g}}+\lambda\mathbf{\epsilon}_{t_{g}+j},\ g=1,\cdots,9,\ j=1,\cdots,(p/10), where ϵtg+j\mathbf{\epsilon}_{t_{g}+j} is the stochastic error that follows 𝒩(0,1)\mathcal{N}(0,~{}1), and λ\lambda is set as 0.010.01. The last 10%10\% predictors are uninformative with only a few nonzero observations. The proportion of nonzero observations is set as 0.0010.001, and nonzero observations are randomly generated from 𝒩(0,0.12)\mathcal{N}(0,~{}0.1^{2}). The response 𝐘\mathbf{Y} is a nonlinear measurable function of some predictors

𝐘=2𝐗t1𝐗t2+cos(π𝐗t3𝐗t4)+ε,\displaystyle\mathbf{Y}=2\mathbf{X}_{t_{1}}\mathbf{X}_{t_{2}}+\cos(\pi\mathbf{X}_{t_{3}}\mathbf{X}_{t_{4}})+\mathbf{\varepsilon}, (6)

where ε\mathbf{\varepsilon} is the stochastic error that follows 𝒩(0,0.12)\mathcal{N}(0,~{}0.1^{2}). 𝐘\mathbf{Y} is nonlinearly correlated with these four predictors or any of their collinear predictors. 𝐘\mathbf{Y} is oscillatory functionally dependent on 𝐗t3\mathbf{X}_{t_{3}} and 𝐗t4\mathbf{X}_{t_{4}}, and there are interactions between 𝐗t1\mathbf{X}_{t_{1}} and 𝐗t2\mathbf{X}_{t_{2}}, and 𝐗t3\mathbf{X}_{t_{3}} and 𝐗t4\mathbf{X}_{t_{4}}. The problem has the following three settings by specifying nn and pp. For each setting, 1010 datasets are generated.

  • Setting 1: n=2 000,p=1 000n=2\ 000,~{}p=1\ 000;

  • Setting 2: n=2 000,p=2 000n=2\ 000,~{}p=2\ 000;

  • Setting 3: n=2 000,p=5 000n=2\ 000,~{}p=5\ 000.

5.2 Baselines and parameter settings

In our experiments, TNVS is evaluated against six representative baselines, including two up-to-date model-free and nonlinear methods, namely FOCI (Azadkia & Chatterjee, 2021) and WLS (Zhong et al., 2021), two classic model-free and nonlinear feature screening methods, namely DC-SIS (Li et al., 2012) and SIRS (Zhu et al., 2011), and two stepwise methods based on linear partial correlation, namely PC-simple (Buhlmann et al., 2010) and TPC (Li et al., 2017b).

The parameter settings of the variable selection methods are listed as follows. For TNVS, the uninformative threshold α1\alpha_{1}, relevant threshold α2\alpha_{2}, and redundant threshold α3\alpha_{3} are empirically set as 0.01,0.01, 0.010.01,\ -0.01,\ 0.01, as the default setting in line with previous research (Liu et al., 2018). dmaxd_{\max} is set as n/logn\lceil n/\log{n}\rceil, which represents the smallest positive integer no less than n/lognn/\log{n}. The termination criterion of FOCI is that the CODECs of all candidates are no larger than 0, or the number of selected predictors reaches n/logn\lceil n/\log{n}\rceil. The significance levels of PC-simple and TPC are set as 0.050.05 (Buhlmann et al., 2010; Li et al., 2017b). The number of selected predictors of WLS, DC-SIS and SIRS are empirically set as n/logn\lceil n/\log{n}\rceil in line with the literature. Considering the computing capability, we set the upper bound of the CPU time as 3 6003\ 600 seconds for all stepwise methods, including TNVS, FOCI, PC-Simple and TPC. In addition, predictors with zero variance are removed beforehand for FOCI. If a selected subset is empty, i.e., no predictor is considered to be correlated with the response, the mean of the response is used as the predictive value for regression problems, and a random category of the response is chosen as the predictive value for classification problems.

10-fold cross-validation is performed on each dataset, where the nn samples of the dataset are randomly divided into 1010 equal folds. Each unique fold of samples is used for testing and the remaining 9n/109n/10 samples are used for training. The performance in each setting is the average result of all folds on all the 1010 corresponding datasets, i.e., each variable selection method is tested for 100100 times in each setting to evaluate their variable selection capability.

All the variable selection methods are implemented in R 4.0.3 for a fair comparison222Our code is available at https://github.com/kywang95/TNVS.. The experiments are performed using an Intel Core i7 3.4 GHz processor with 16 GB of RAM.

5.3 Effectiveness and efficiency of the proposed method on nonlinear simulations

The following four criteria are adopted to evaluate the effectiveness and efficiency of TNVS and baselines on high-dimensional simulations with complex nonlinear functional dependencies. The results of each simulation in 100100 repetitions are summarized in Table 2.

  1. 1.

    PaP_{a}: probability of the discovery of all truly important predictors over all repetitions. The closer PaP_{a} is to 11, the more robust the variable selection method is to retain all true predictors. For simulations, true predictors are any of the relevant predictors and the redundant predictors that are collinear with the relvant predictors. For example, for Setting 11, the indices of true predictors are any combinations of four elements from the subsets {1,2,,100}\{1,~{}2,\cdots,100\}, {101,102,,200}\{101,~{}102,\cdots,200\}, {201,202,,300}\{201,~{}202,\cdots,300\}, and {301,302,,400}\{301,~{}302,\cdots,400\}, and PaP_{a} is the frequency that contains at least one element from each of the four sets.

  2. 2.

    \mathcal{M}: the minimum model size required to include all true predictors in all repetitions in which none of the true predictors are omitted. The closer \mathcal{M} is to the true model size, the more concise the selected subset is. The average and the standard deviation of \mathcal{M} are obtained.

  3. 3.

    Coverage: The number of true predictors covered by the selected subset. Coverage is less than or equal to the number of true predictors. The average and standard deviation of the coverage in all repetitions are obtained. In the simulated datasets, the closer the coverage is to 44, the fewer true predictors are omitted.

  4. 4.

    Time: the average and standard deviation of running time (in seconds) of variable selection in 100100 repetitions.

Table 2: PaP_{a}, the average and the standard deviation (in parentheses) of \mathcal{M}, coverage, and running time (in seconds) in 100100 repetitions. The best results are presented in bold, and the second-best results are underlined.
Setting Index TNVS FOCI WLS SIRS DC-SIS PC-simple TPC
1 PaP_{a} 0.99¯\underline{0.99} 0.98 1.00\mathbf{1.00} 0.00 0.02 0.00 0.00
\mathcal{M} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 4.01¯(0.10)\underset{(0.10)}{\mathbf{\underline{4.01}}} 26.28(9.75)\underset{(9.75)}{26.28} ()\underset{(-)}{-} 223.50(10.61)\underset{(10.61)}{223.50} ()\underset{(-)}{-} ()\underset{(-)}{-}
Coverage 3.98¯(0.20)\underset{(0.20)}{\underline{3.98}} 3.96(0.28)\underset{(0.28)}{3.96} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 1.11(0.60)\underset{(0.60)}{1.11} 3.02(0.14)\underset{(0.14)}{3.02} 0.50(0.54)\underset{(0.54)}{0.50} 0.00(0.00)\underset{(0.00)}{0.00}
Time(s) 33.45(1.98)\underset{(1.98)}{33.45} 598.84(127.05)\underset{(127.05)}{598.84} 90.43(0.72)\underset{(0.72)}{90.43} 10.27(4.25)\underset{(4.25)}{10.27} 245.74(3.42)\underset{(3.42)}{245.74} 4.41(0.62)\underset{(0.62)}{4.41} 4.41(0.16)\underset{(0.16)}{4.41}
2 PaP_{a} 0.97¯\underline{0.97} 0.93 1.00\mathbf{1.00} 0.00 0.00 0.00 0.00
\mathcal{M} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 4.01¯(0.10)\underset{(0.10)}{\underline{4.01}} 28.59(9.10)\underset{(9.10)}{28.59} ()\underset{(-)}{-} ()\underset{(-)}{-} ()\underset{(-)}{-} ()\underset{(-)}{-}
Coverage 3.92¯(0.49)\underset{(0.49)}{\underline{3.92}} 3.84(0.61)\underset{(0.61)}{3.84} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 0.72(0.59)\underset{(0.59)}{0.72} 2.00(0.00)\underset{(0.00)}{2.00} 0.31(0.53)\underset{(0.53)}{0.31} 0.00(0.00)\underset{(0.00)}{0.00}
Time(s) 68.69(6.64)\underset{(6.64)}{68.69} 2826.55(503.19)\underset{(503.19)}{2826.55} 530.74(3.89)\underset{(3.89)}{530.74} 28.41(7.62)\underset{(7.62)}{28.41} 527.68(19.85)\underset{(19.85)}{527.68} 30.69(55.14)\underset{(55.14)}{30.69} 11.11(0.27)\underset{(0.27)}{11.11}
3 PaP_{a} 0.99¯\underline{0.99} 0.97 1.00\mathbf{1.00} 0.00 0.00 0.00 0.00
\mathcal{M} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 33.49(8.87)\underset{(8.87)}{33.49} ()\underset{(-)}{-} ()\underset{(-)}{-} ()\underset{(-)}{-} ()\underset{(-)}{-}
Coverage 3.98¯(0.20)\underset{(0.20)}{\underline{3.98}} 3.94(0.34)\underset{(0.34)}{3.94} 4.00(0.00)\underset{(0.00)}{\mathbf{4.00}} 0.74(0.50)\underset{(0.50)}{0.74} 1.02(0.14)\underset{(0.14)}{1.02} 0.28(0.55)\underset{(0.55)}{0.28} 0.00(0.00)\underset{(0.00)}{0.00}
Time(s) 236.80(9.97)\underset{(9.97)}{236.80} 3670.26(36.46)\underset{(36.46)}{3670.26} 1300.66(3.73)\underset{(3.73)}{1300.66} 96.59(1.57)\underset{(1.57)}{96.59} 1321.23(59.23)\underset{(59.23)}{1321.23} 4908.73(2308.58)\underset{(2308.58)}{4908.73} 50.60(0.89)\underset{(0.89)}{50.60}

Table 2 shows that our TNVS, and the recently proposed FOCI and WLS, outperform the other traditional methods, and TNVS is the most effective and efficient one among these three methods. First, the-close to-1 PaP_{a} in all settings indicates that these methods can reserve all true predictors in most cases. Among these three methods, TNVS has the smallest minimum model size, which means that the subset selected by TNVS is the most concise. In addition, computational time of TNVS is the least among these three methods, which shows that it is the most efficient. The framework of TNVS is most similar to that of the FOCI. Comparing these two methods, TNVS has a smaller minimum model size and a larger coverage, indicating that TNVS is more accurate and reliable than FOCI.

Table 2 also shows that these nonlinear simulation problems are extremely difficult to solve with the other four prevailing methods, i.e., SIRS, DC-SIS, PC-simple and TPC. Their PaP_{a} are close to 0, indicating that these methods omit some true predictors and are invalid. PC-simple and TPC are designed for linear regression models, and close-to-0 coverages indicate their weak abilities to identify nonlinearly relevant predictors. SIRS and DC-SIS can only identify the monotonic nonlinear relevance, and having difficulty in removing the multicollinearity, and close-to-0 coverages indicate they cannot retain all important predictors.

Experiments on simulations demonstrate that every module of the proposed TNVS is effective and indispensable. First, compared with PC-simple and TPC, these two stepwise methods with linear partial correlation, TNVS adopts the nonlinear CODEC. This leads to the unsatisfying PaP_{a} of these two baselines and the close-to-1 PaP_{a} of TNVS, which indicates that introducing CODEC ensures that TNVS can effectively identify the complex nonlinear associations, such as interacted predictors, and oscillatory functional dependence with interactions. Second, FOCI is a stepwise method with CODEC, but ignoring to sorted disposal the uninformative and redundant predictors. Compared with FOCI, TNVS avoids retaining the large percentages of these unnecessary predictors, and achieves larger PaP_{a}, smaller model size, larger coverage, and less computational time. This implies that prefiltering and batch deletion inherently improve effectiveness and efficiency of TNVS.

5.4 Interpretability of the proposed method

We further demonstrate the model interpretability of the proposed variable selection method. Since in simulations, the truly relevant, uninformative, redundant, and conditionally independent predictors are known in advance, we can obtain the proportions of four types of predictors in the selected subset for all methods. Taking a closer look at the three competitive methods mentioned in Table 2, Table 3 shows the proportions of the ground truth relevant (RelGT\textrm{Rel}_{GT}), uninformative (UinGT\textrm{Uin}_{GT}), redundant (RedGT\textrm{Red}_{GT}), and conditionally independent (CindGT\textrm{Cind}_{GT}) predictors in the selected subset of TNVS, FOCI, and WLS. The average of each quantity in all 100100 repetitions are summarized. The truly relevant predictors in the selected subset are true positive, whereas the rest three types are false positive. Thus, RelGT\textrm{Rel}_{GT} is actually the precision of each methods. The larger RelGT\textrm{Rel}_{GT} is, and the smaller the proportions of RelGT\textrm{Rel}_{GT}, UinGT\textrm{Uin}_{GT}, and RedGT\textrm{Red}_{GT} are, the more concise the selected subset is.

Table 3: Average proportions of four types of predictors in the selected subsets of predictors obtained by TNVS, FOCI, and WLS. The best results are presented in bold, and the second-best results are underlined.
Setting Index TNVS FOCI WLS
1 RelGT\textrm{Rel}_{GT} 1.00\mathbf{1.00} 0.12¯\underline{0.12} 0.02
UinGT\textrm{Uin}_{GT} 0.00\mathbf{0.00} 0.83 0.03¯\underline{0.03}
RedGT\textrm{Red}_{GT} 0.00\mathbf{0.00} 0.05¯\underline{0.05} 0.42
CindGT\textrm{Cind}_{GT} 0.00\mathbf{0.00} 0.00\mathbf{0.00} 0.53
2 RelGT\textrm{Rel}_{GT} 0.99\mathbf{0.99} 0.06¯\underline{0.06} 0.02
UinGT\textrm{Uin}_{GT} 0.00\mathbf{0.00} 0.92 0.05¯\underline{0.05}
RedGT\textrm{Red}_{GT} 0.00\mathbf{0.00} 0.02¯\underline{0.02} 0.42
CindGT\textrm{Cind}_{GT} 0.01\mathbf{0.01} 0.00¯\underline{0.00} 0.51
3 RelGT\textrm{Rel}_{GT} 1.00\mathbf{1.00} 0.10¯\underline{0.10} 0.02
UinGT\textrm{Uin}_{GT} 0.00\mathbf{0.00} 0.87 0.07¯\underline{0.07}
RedGT\textrm{Red}_{GT} 0.00\mathbf{0.00} 0.03¯\underline{0.03} 0.41
CindGT\textrm{Cind}_{GT} 0.00\mathbf{0.00} 0.00\mathbf{0.00} 0.50

Table 3 shows that TNVS has the largest proportion of RedGT\textrm{Red}_{GT} in its selected subsets, or equivalently, the largest precision. Nearly all the predictors selected by TNVS are relevant predictors. Uninformative, redundant, and conditionally independent predictors are rarely included in the selected subset of TNVS. The subset selected by FOCI contains a large percentage of uninformative predictors and some redundant predictors. Although WLS can select all relevant predictors, they only take a small part of its selected subset, and many unimportant predictors are falsely picked before the truly important predictors. These findings demonstrate that TNVS enhances the capability of FOCI to handle the lack of information and multicollinearity, and it also achieves a better balance between accuracy and conciseness than WLS.

The most significant difference of TNVS against other methods is that TNVS can divide the predictors into four disjoint subsets, instead of only the selected and the deleted ones. Table 4 shows the proportions of the ground truth types of predictors in the four subsets obtained by TNVS (the relevant subset Relpred\textrm{Rel}_{pred}, the uninformative subset Uinpred\textrm{Uin}_{pred}, the redundant subset Redpred\textrm{Red}_{pred}, and the conditionally independent subset Cindpred\textrm{Cind}_{pred}). The average of each quantity in all 100100 repetitions are summarized. The truly relevant predictors grouped into the selected relevant subset are true positive, whereas those grouped into the other three types are false negative. Thus, elements in the diagonal are actually the recall of TNVS. The larger elements in the diagonal are, and the smaller the other elements are, the more accurate the variable selection is.

Table 4: Average proportions of each ground truth type of predictors in the four subsets divided by TNVS (Relpred\textrm{Rel}_{pred}, Uinpred\textrm{Uin}_{pred}, Redpred\textrm{Red}_{pred}, and Cindpred\textrm{Cind}_{pred}).
Setting Index RelPred\textrm{Rel}_{Pred} Uinpred\textrm{Uin}_{pred} Redpred\textrm{Red}_{pred} Cindpred\textrm{Cind}_{pred}
1 RelGT\textrm{Rel}_{GT} 0.995 0 0 0.005
UinGT\textrm{Uin}_{GT} 0 1 0 0
RedGT\textrm{Red}_{GT} 0 0 0.995 0.005
CindGT\textrm{Cind}_{GT} 0 0 0 1
2 RelGT\textrm{Rel}_{GT} 0.980 0 0 0.020
UinGT\textrm{Uin}_{GT} 0 1 0 0
RedGT\textrm{Red}_{GT} 0 0 0.980 0.020
CindGT\textrm{Cind}_{GT} 2e-5 0 3.98e-3 0.996
3 RelGT\textrm{Rel}_{GT} 0.995 0 0 0.005
UinGT\textrm{Uin}_{GT} 0 1 0 0
RedGT\textrm{Red}_{GT} 0 0 0.995 0.005
CindGT\textrm{Cind}_{GT} 0 0 0 1

Table 4 demonstrates the unique strength of TNVS. Besides the selected relevant subset, it further divides the predictors to be removed into the uninformative, redundant, and conditionally independent subsets. The diagonal elements show that the recall of TNVS is high on all four subsets. All uninformative predictors are correctly deleted. A very few relevant predictors are falsely identified as conditionally independent in rare cases, and the misclassification of relevant predictors hinders TNVS from distinguishing the redundancy and conditionally independence. Sufficient evidence shows that when tackling simulations with complex nonlinear functional dependencies, uninformative predictors and collinearity, TNVS provides credible explanations on why certain predictors are removed. To sum up, TNVS outperforms the baselines on model interpretability.

6 Real data applications

6.1 Datasets

Four prevailing real datasets from various domains are chosen to demonstrate the performance of the proposed TNVS (Li et al., 2017a). The descriptive statistics of the datasets are listed in Table 5. In this paper, the original probes, features, or pixels are directly used as input predictors without any preprocessing such as feature extraction, since the aim of these experiments is to demonstrate the predictive capability and interpretability of the proposed variable selection method in high-dimensional data, not to design a distinct method for specific domain, such as face images recognition.

Table 5: The summary statistics of the real datasets employed to test the competence of the variable selection methods, including abbreviated names of these datasets, numbers of categories cc, sample sizes nn, numbers of input predictors pp, and brief descriptions.
Name cc nn pp Description
arcene 2 200 10 000 mass spectrometry
isolet 6 300 617 voices of spoken letters
warpAR10P 10 240 2 400 face images
ORL 40 400 4 096 face images

The reasons for choosing these datasets are as follows. First, these biological or image datasets are high-dimensional data with a large number of predictors. The response and predictors are often nonlinearly correlated in such datasets, and the inputs usually contain a large number of uninformative, redundant, or conditionally independent predictors. Moreover, the face images are considered as the preference since we can easily visualize the variable selection results on the original images, and literature on facial landmark detection (Köstinger et al., 2011) has provided some key reference points. By checking the relevance between our selected predictors and these key reference points, we can easily observe whether the selection results have evident post hoc interpretability. Two ORL datasets with different resolutions are selected to examine whether there are more redundant predictors in images with higher resolution, i.e., whether the images with lower resolution is distinct enough for classification. The datasets are described in detail.

  1. 1.

    arcene is a binary classification dataset to discriminate between patients with and without cancer. The data are collected from mass spectrometry analysis of blood serum, and the predictors are continuous. The sample contains 8888 observations with ovarian or prostate cancer and 112112 observations of healthy patients. Each observation has 7 0007\ 000 real features and 3 0003\ 000 random probes. These random probes are permuted least-informative feats, and some of them can probably be identified as uninformative predictors.

  2. 2.

    isolet is an audio dataset for predicting what letter name is spoken. The dataset contains 1 5601\ 560 observations, which are audio recordings of 3030 speakers. Each subject speaks names of 2626 letters of the alphabet twice. Each observation contains 617617 features. In this experiment, we generate a high-dimensional dataset with 300300 observations from the raw isolet: the first 66 categories are considered, and 5050 observations are randomly sampled from each category.

  3. 3.

    warpAR10P is a facial recognition dataset with 10 distinct subjects. For each subject, 1313 images are captured under various lighting conditions, with different facial expressions and occlusions such as scarfs and sunglasses. The size of each image is 60×4060\times 40 pixels, with 256256 gray levels per pixel.

  4. 4.

    ORL is another facial recognition dataset containing 400400 face images collected from 4040 individuals, and each image contains 64×6464\times 64 pixels with 256256 gray levels per pixel. These images exhibit multiple changes, such as various picture-taking times, illumination changes, different expressions, and facial detail changes.

Note that Z-score normalization is adopted for input predictors in all datasets, i.e., 𝐗~j=(𝐗jμj)/σj\tilde{\mathbf{X}}_{j}=(\mathbf{X}_{j}–\mu_{j})/\sigma_{j}, where μj\mu_{j} and σj\sigma_{j} are the mean and standard deviation of predictor 𝐗j\mathbf{X}_{j}, respectively.

6.2 Learning models and parameter settings

The proposed TNVS is inherently model-free and the selection process relies on no assumptions of the learning model. The effectiveness and generality of the selected subsets are consequently evaluated in terms of their performance on multiple learning models. Specifically, five prevailing predictive models commonly adopted in literature are considered (Salesi et al., 2021; Wan et al., 2022), including Support Vector Regression (SVR) / Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Light Gradient Boosting Machine (lightGBM) and Multi-layer Perceptron (MLP). To examine the predictive capabilities of these variable selection methods, each learning model is trained with the selected predictors, and the results are then compared. These experiments aim to demonstrate the superior generality of variable selection methods rather than the enhanced prediction performance of well-tuned models. Thus, the configurations of the learning models are set to be identical for all variable selection methods to avoid favoring any individual method by combining it with more competitive models. Most parameters are kept as the default settings in scikit-learn (Buitinck et al., 2013). The only adjustment is that the number of iterations is set as 5 0005\ 000 for MLP. All the learning models are implemented in Python 3.9 using the application programming interface (API) of scikit-learn.

For real datasets, repeated 1010-fold cross-validation is adopted to evaluate each combination of variable selection methods and learning models. The 1010-fold cross-validation is repeated 1010 times on each dataset. All variable selection methods and learning models share the same cross-validation splits so that unbiased estimates are obtained. For all splits, the ratio of classes in the training set is kept the same as that in the full sample. The predictive results are evaluated with out-of-sample accuracy (ACC), recall, F1F_{1}, and the Cohen Kappa Score (Kappa). The ranges of ACC, recall and F1F_{1} are within [0,1][0,~{}1]; the larger these indicators are, the better the prediction is. The range of Kappa is within [1,1][-1,~{}1]; the closer the score is to 1, the better the prediction is.

6.3 Effectiveness on real datasets

Since the true models of real datasets are unknown, the predictive results of selected subsets are used to indirectly demonstrate the effectiveness of variable selection. The predictive results obtained using multiple combinations of variable selections and learning models are demonstrated as follows. First, Fig. 3 shows the boxplot of the four predictive indicators obtained with lightGBM on warpAR10P, which compares the predictive capability of the proposed TNVS with the baselines in this case. The predictive results of TNVS are better than those of the baselines, especially PC-Simple and TPC. Other statistical results of the average ACC, recall, F1F_{1}, and Kappa on the real datasets are listed in Table LABEL:tab:appendix_A1 of Appendix B. The results of other datasets and other learning models substantially agree with those in Fig. 3, which demonstrates that TNVS is robustly superior to the baselines.

Refer to caption
Figure 3: A boxplot of the average value of indicators in 100100 repetitions on warpAR10P, which is obtained by lightGBM trained with the selected predictors obtained using different variable selection methods.

A statistical test is performed to demonstrate the predictive capability of the proposed TNVS. For every dataset, the Friedman test is performed on statistical results of the four indicators obtained by all learning models to determine the overall performance of a certain variable selection method. The Friedman test provides a mean rank of these methods, and indicates whether there are significant differences among these methods. A higher mean rank means that the method has a better-ranked performance over the competitors. Table 6 shows the result of Friedman test, including the mean rank of these methods, the order of the mean rank, and the p-value of the Friedman test. As seen, the predictors selected by TNVS attain the top two among the seven methods on real datasets under the condition of retaining less predictors than feature screening methods, and there are significant differences among these methods, which further confirms the strength and robustness of TNVS in prediction enhancement.

Table 6: The Friedman test of statistical results obtained on real datasets. The best results are presented in bold, and the second-best results are underlined.
Dataset Indicator TNVS FOCI WLS SIRS DC-SIS PC-simple TPC pp-value
arcene Mean rank 5.80¯\underline{5.80} 6.20\mathbf{6.20} 2.60 4.20 4.55 2.90 1.75 0.000
Order 2¯\underline{2} 𝟏\mathbf{1} 6 4 3 5 7
isolet Mean rank 6.55\mathbf{6.55} 5 6.45¯\underline{6.45} 3.00 4.00 2.00 1.00 0.000
Order 𝟏\mathbf{1} 3 2¯\underline{2} 5 4 6 7
warpAR10P Mean rank 7.00\mathbf{7.00} 6.00¯\underline{6.00} 4.20 3.65 4.15 1.00 2.00 0.000
Order 𝟏\mathbf{1} 2¯\underline{2} 3 5 4 7 6
ORL Mean rank 6.20¯\underline{6.20} 5.20 6.60\mathbf{6.60} 3.40 3.60 1.65 1.35 0.000
Order 2¯\underline{2} 3 𝟏\mathbf{1} 5 4 6 7

6.4 Interpretability on real datasets

TNVS transparently groups the predictors into four subsets, i.e., relevant, uninformative, redundant, and conditionally independent, which improves the model interpretability of TNVS. The average cardinalities of the four subsets are counted to check whether these types of predictors exist in the real datasets. Besides, the average compression rate is calculated, which is defined as the proportion of relevant predictors in the inputs. The results of 100100 repetitions on each dataset are shown in Table 7.

Table 7: The average numbers of relevant, uninformative, redundant, and conditionally independent predictors obtained by TNVS on real datasets and their standard deviation (in parentheses), and the mean compression rates achieved by TNVS.
Dataset Relevant Compression rate Uninformative Redundant Conditionally independent
arcene 24.35(10.53)\underset{(10.53)}{24.35} 0.24% 45.72(5.95)\underset{(5.95)}{45.72} 95.91(62.00)\underset{(62.00)}{95.91} 9834.02(71.71)\underset{(71.71)}{9834.02}
isolet 35.81(11.55)\underset{(11.55)}{35.81} 5.80% 2.00(0.00)\underset{(0.00)}{2.00} 7.21(3.69)\underset{(3.69)}{7.21} 571.98(12.66)\underset{(12.66)}{571.98}
warpAR10P 18.29(7.25)\underset{(7.25)}{18.29} 0.76% 0.00(0.00)\underset{(0.00)}{0.00} 5.10(5.61)\underset{(5.61)}{5.10} 2376.61(11.25)\underset{(11.25)}{2376.61}
ORL 37.18(21.72)\underset{(21.72)}{37.18} 0.91% 0.00(0.00)\underset{(0.00)}{0.00} 27.95(21.49)\underset{(21.49)}{27.95} 4030.87(38.24)\underset{(38.24)}{4030.87}

As shown in Table 7, the percentage of relevant predictors is lower than 6%6\% in all datasets, and is even lower than 1%1\% in arcene, warpAR10P, and ORL, which indicates that TNVS compresses the input set of predictors to a large degree. Redundant predictors are identified in all datasets, which shows the usefulness of GSO. TNVS identifies the uninformative predictors in arcene as expected.

In addition to model interpretability, we also visualize variable selection results of the ORL dataset as examples to demonstrate post hoc interpretability of TNVS. Under the premise of the transparent search, the results of TNVS are intuitionistic. The literature on facial landmark detection has highlighted some key reference points on the face that could help identity an individual, e.g., the Annotated Facial Landmarks in the Wild (AFLW) marked 2121 facial feature points (Köstinger et al., 2011), including 1212 points on the eyes and eyebrows, 33 on the mouth, 33 on the nose, 33 on each ear, and 11 on the chin. Checking whether our selected pixels cover most if not all of these key reference points, we can observe qualitatively whether the selection results provide insight to capture the key predictors of these observations.

One of the variable selection results is marked on the original images of ORL. For TNVS, we color the relevant pixels in yellow, the redundant pixels in red, and the unmarked pixels are conditionally independent pixels, as shown in Figure 4. For other methods, we color the selected pixels in yellow, and the unselected pixels are left unmarked. For ORL, TNVS selects 6262 relevant pixels, and removes 6666 redundant pixels, and 3 9683\ 968 conditionally independent pixels. FOCI selects 2626 pixels, WLS, DC-SIS, and SIRS select 6262 pixels, and PC-simple and TPC select 55 pixels. Compared with the predictive results of baselines in Table LABEL:tab:appendix_A1 of Appendix B, TNVS achieves comparable or even better performance with more concise subset of predictors.

Refer to caption
Figure 4: Variable selection results of TNVS and baselines are visualized on the original images of ORL. For TNVS, the relevant pixels being selected are marked in yellow, the redundant pixels being removed are marked in red, and the conditionally independent pixels being removed are unmarked on the original images. For others, the selected pixels are marked in yellow, and the deleted ones are unmarked.

As is shown in Table LABEL:tab:appendix_A1 of Appendix B, TNVS and WLS are the best two methods on ORL dataset. Their selected pixels both cover the facial points of AFLW, which mainly lie in eye corners, eyebrow corners, tips of nose, corners of mouth, and chins. The accurate detection of key reference points intuitively explains the superior facial discrimination of TNVS and WLS. TNVS further reveals that the redundant pixels of ORL are concentrated mainly in the foreheads, indicating that pixels in these areas are linearly correlated with the relevant pixels. Most regions belong to the conditionally independent subset, indicating that only a very small number of pixels is adequate to explain the response to a certain extent. Some pixels around the eye corners are usually symmetric, and TNVS only select pixels on one side, since given these areas, the pixels in symmetric areas will be conditionally independent. FOCI selects less pixels than TNVS, and these two methods select similar regions of the faces, but FOCI cannot identify the redundant pixels. SIRS and DC-SIS select a large number of redundant pixels in the forehead, and they ignore some feature points in the eye corners, eyebrow corners, and tips of the nose. PC-simple and TPC omit many important facial feature points. Together with the results in Table LABEL:tab:appendix_A1 of Appendix B, this example shows that the predictors selected by TNVS can precisely predict the response, and the reserved or removed predictors can be explicitly interpreted.

6.5 Robustness in tuning the parameters

We further demonstrate stability of TNVS as the three thresholds vary. These ranges should be tuned within certain ranges. If these thresholds are too large, the predictors in each subset will violate their definitions, i.e., the uninformative subset may include informative predictors, and the redundant subset may include predictors which are not linearly independent. If the method is only robust within small ranges, it will be easily influnced by noise in real data. Here, we take a large enough range for each threshold to show that the outperformance is persistently significant, and the proposed method is robust against hyperparameter variations.

First, the uninformative threshold is tuned within range α1{0,0.01,,0.05}\alpha_{1}\in\{0,~{}0.01,\cdots,0.05\}, and nonlinear correlations are evaluated between the response and the uninformative predictors Tn(𝐘,𝐗𝒜1)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}), and the nonlinear partial correlations between the response and uninformative predictors given the selected predictors Tn(𝐘,𝐗𝒜1𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}}) for the four real datasets. If Tn(𝐘,𝐗𝒜1𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}}) is undefined, it can be regarded that 𝐘\mathbf{Y} is almost surely a measurable function of 𝐗𝒮\mathbf{X}_{\mathcal{S}}, and we set Tn(𝐘,𝐗𝒜1𝐗𝒮)=0T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}})=0. For warpAR10P and ORL, the numbers of uninformative predictors are 0. For arcene and isolet, the numbers of identified uninformative predictors, their correlations and partial correlations are shown in Table 8. The number of uninformative predictors in arcene is the same when α1[0,0.03]\alpha_{1}\in[0,~{}0.03], and that in isolet is the same when α1[0,0.05]\alpha_{1}\in[0,~{}0.05]. The correlations and partial correlations between the response and the uninformative predictors are close to 0 in both datasets, which illustrates that the uninformative predictors have no positive effect on predicting the response. Identifying the uninformative subsets does not heavily depend on the settings of threshold α1\alpha_{1}, which shows the robustness of TNVS at the prefiltering step.

Table 8: Results of arcene and isolet as the uninformative threshold α1\alpha_{1} changes, including the mean and standard deviation (in parentheses) of the number of uninformative predictors |𝒜1|\lvert\mathcal{A}_{1}\rvert, the CODEC of the response and all the uninformative predictors Tn(𝐘,𝐗𝒜1)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}), and the CODEC of the response and all the uninformative predictors given the selected subset Tn(𝐘,𝐗𝒜1𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}}).
Dataset Indicator α1=0\alpha_{1}=0 α1=0.01\alpha_{1}=0.01 α1=0.02\alpha_{1}=0.02 α1=0.03\alpha_{1}=0.03 α1=0.04\alpha_{1}=0.04 α1=0.05\alpha_{1}=0.05
arcene |𝒜1|\lvert\mathcal{A}_{1}\rvert 45.72(5.95)\underset{(5.95)}{45.72} 45.72(5.95)\underset{(5.95)}{45.72} 45.72(5.95)\underset{(5.95)}{45.72} 45.72(5.95)\underset{(5.95)}{45.72} 111.25(7.34)\underset{(7.34)}{111.25} 111.25(7.34)\underset{(7.34)}{111.25}
Tn(𝐘,𝐗𝒜1)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}) 0.006(0.118)\underset{(0.118)}{0.006} 0.009(0.088)\underset{(0.088)}{-0.009} 0.010(0.111)\underset{(0.111)}{-0.010} 0.003(0.107)\underset{(0.107)}{0.003} 0.037(0.105)\underset{(0.105)}{-0.037} 0.025(0.097)\underset{(0.097)}{-0.025}
Tn(𝐘,𝐗𝒜1𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}}) 0.006(0.058)\underset{(0.058)}{0.006} 0.012(0.066)\underset{(0.066)}{0.012} 0.001(0.029)\underset{(0.029)}{0.001} 0.022(0.207)\underset{(0.207)}{-0.022} 1.899(2.888)\underset{(2.888)}{-1.899} 1.901(2.894)\underset{(2.894)}{-1.901}
isolet |𝒜1|\lvert\mathcal{A}_{1}\rvert 2.00(0.00)\underset{(0.00)}{2.00} 2.00(0.00)\underset{(0.00)}{2.00} 2.00(0.00)\underset{(0.00)}{2.00} 2.00(0.00)\underset{(0.00)}{2.00} 2.00(0.00)\underset{(0.00)}{2.00} 2.00(0.00)\underset{(0.00)}{2.00}
Tn(𝐘,𝐗𝒜1)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}) 0.037(0.064)\underset{(0.064)}{-0.037} 0.034(0.067)\underset{(0.067)}{-0.034} 0.027(0.061)\underset{(0.061)}{-0.027} 0.035(0.059)\underset{(0.059)}{-0.035} 0.025(0.062)\underset{(0.062)}{-0.025} 0.036(0.065)\underset{(0.065)}{-0.036}
Tn(𝐘,𝐗𝒜1𝐗𝒮)T_{n}(\mathbf{Y},\mathbf{X}_{\mathcal{A}_{1}}\mid\mathbf{X}_{\mathcal{S}}) 2.822(4.564)\underset{(4.564)}{-2.822} 2.822(4.564)\underset{(4.564)}{-2.822} 2.822(4.564)\underset{(4.564)}{-2.822} 2.822(4.564)\underset{(4.564)}{-2.822} 2.822(4.564)\underset{(4.564)}{-2.822} 2.822(4.564)\underset{(4.564)}{-2.822}

We next tune the remaining two parameters by taking the value of the relevant threshold α2{0.05,0.04,,0}\alpha_{2}\in\{-0.05,~{}-0.04,\cdots,0\} and the value of the redundant threshold α3{0,0.01,,0.05}\alpha_{3}\in\{0,~{}0.01,\cdots,0.05\}. For each combination of α2\alpha_{2} and α3\alpha_{3}, a 10-fold cross-validation is performed on the four real datasets to demonstrate the stability of TNVS as these two parameters vary. Fig. 5 shows the average cardinality of the selected subsets obtained by TNVS under different values of α2\alpha_{2} and α3\alpha_{3}. The number of selected predictors fluctuates slightly in response to the change in α2\alpha_{2} and α3\alpha_{3}, and there is no obvious and unified trend in different datasets. Fig. C1 of Appendix C further describes the prediction accuracy using the above selected predictors. As expected, the changes in both thresholds have little influence on the predictive results. In summary, the proposed TNVS is insensitive to changes in the thresholds within certain ranges.

Refer to caption
Figure 5: The average number of predictors retained by TNVS under different parameter settings.

7 Conclusions and future work

In this paper, a Transparent and Nonlinear Variable Selection (TNVS) method was proposed for high-dimensional data. Transparent information decoupling was achieved with a three-step heuristic search, where the predictors which were relevant to the response were selected, and uninformative, collinear, and conditionally independent predictors were deleted. Introducing a recently proposed nonlinear partial correlation, TNVS was able to identify very complex nonlinear functional dependencies, including not only monotonic and additive functional dependencies between the response and predictors, but also nonmonotonic or oscillatory functional dependence and interactions among predictors. According to scores based on information entropy, Gram-Schmidt orthogonalization, and nonlinear partial correlation, the unimportant predictors were classified into three unimportant subsets. The clear selecting and deleting process enhanced the effectiveness and model interpretability of the proposed method for variable selection.

We should note that limitations exist in the proposed method, and extensions could be developed. First, the three thresholds were designed empirically, although the selection kept stable within certain ranges of these thresholds. In the future, more advanced measures can be introduced to improve the empirical thresholds by using techniques such as nonparametric statistical tests (Shi et al., 2021). In addition, though the proposed method made progress in identifying the interacted predictors, it had difficulty in identifying the marginal independent but jointly correlated predictors. In the future, domain knowledge can be considered in the initial stage to better support the detection of such hard-to-identified interactions (Wu et al., 2022). Last but not least, although the three-step heuristic search adopted in the proposed method was more efficient than other stepwise methods, it could be time-consuming when a large number of predictors were relevant to the response. In the future, more advanced algorithms such as genetic algorithm (Saibene & Gasparini, 2023) and other metaheuristic algorithms (Alcaraz et al., 2022; Pramanik et al., 2023) can be designed to improve the efficiency of TNVS in a less sparse high-dimensional scenario.

Acknowledgments

The work was supported by grants from the National Natural Science Foundation of China (Grant Nos. 72021001 and 71871006).

CRediT authorship contribution statement

Keyao Wang: Conceptualization, Methodology, Software, Writing - Original draft preparation. Huiwen Wang: Conceptualization, Resources, Supervision, Writing - Reviewing and Editing. Jichang Zhao: Validation, Writing - Reviewing and Editing. Lihong Wang: Supervision, Resources, Writing - Reviewing and Editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

A simple example is provided to further demonstrate the measures and the three-step heuristic search of the proposed method for variable selection.

A1 Relevant, uninformative, multicollinear, and conditionally independent predictors

We generate a regression example with nonlinear relevance, uninformative predictors, multicollinearity, and conditionally independent predictors. The predictors and response are defined as follows:

  1. 1.

    X1X_{1}, X2X_{2}, and X3X_{3} are mutually independent predictors all sampled from the standard normal distribution 𝒩(0,1)\mathcal{N}(0,~{}1).

  2. 2.

    X4=X1+X2X_{4}=X_{1}+X_{2}, and X5=X1+X3X_{5}=X_{1}+X_{3}.

  3. 3.

    only 0.1%0.1\% observations of X6X_{6} are nonzero, and these nonzero observations are randomly generated from 𝒩(0,0.12)\mathcal{N}(0,~{}0.1^{2}).

  4. 4.

    Y=X1X2Y=X_{1}\cdot X_{2}.

In this example, X6X_{6} is an uninformative predictor, and its histogram is shown in Fig. A1a. YY is nonlinearly correlated with X1X_{1} and X2X_{2}, or any combination of their collinear predictors, i.e., X1X_{1} and X4X_{4}, X2X_{2} and X4X_{4}. There are interactions between X1X_{1} and X2X_{2}, and given X1X_{1}, YY is functionally dependent on X2X_{2}. Thus, if X1X_{1} and X2X_{2} are identified as relevant predictors, X4X_{4} is the redundant predictor which is collinear with X1X_{1} and X2X_{2}. Given the relevant predictors, X3X_{3} and X5X_{5} are conditionally independent predictors. The dependency between YY and X2X_{2} given X1[0.1,0.2]X_{1}\in[0.1,~{}0.2] is shown in Fig. A1b, and that between YY and X5X_{5} given X1[0.1,0.2]X_{1}\in[0.1,~{}0.2] is shown in Fig. A1c.

Refer to caption
(a) Histogram of X6X_{6}.
Refer to caption
(b) Dependency of YY and X2X_{2} given X1X_{1}.
Refer to caption
(c) Dependency of YY and X5X_{5} given X1X_{1}.
Figure A1: An example to illustrate features of the uninformative predictor X6X_{6}, relevant predictor X2X_{2}, and conditionally independent predictor X5X_{5}.

A2 The procedure of the proposed variable selection method

The heuristic search of TNVS includes prefiltering, forward selection, and batch deletion steps. According to the flowchart in Fig. 2, the procedure of the proposed method is as follows. In the prefiltering step, uninformative predictors, i.e., X6X_{6} in the example is identified and removed according to UniS. In the first iteration of forward selection, the predictor with largest RelS will be selected. Assume X1X_{1} is the selected relevant predictor. Then, a deletion step is performed to remove redundant predictors that are collinear with X1X_{1}. Here, none of the predictors are identified. In the second iteration, assume X2X_{2} is selected as the relevant predictor, and then X4X_{4} will be identified as redundant predictor since it is the linear combination of X1X_{1} and X2X_{2}. After two iterations, RelSs of the rest predictors X3X_{3} and X5X_{5} are both less than 0. The search will stop, and the rest two predictors are conditionally independent. The inputs are separated into the uninformative subset {X6}\{X_{6}\}, the relevant subset {X1,X2}\{X_{1},~{}X_{2}\}, the redundant subset {X4}\{X_{4}\}, and the conditionally independent subset {X3,X5}\{X_{3},~{}X_{5}\}.

Appendix B

Table B1: The mean and standard deviation (in parentheses) of the predictive results in real datasets. The best results are presented in bold, and the second-best results are underlined.
Dataset Model Indicator TNVS FOCI WLS SIRS DC-SIS PC-simple TPC
SVM ACC 0.6790(0.1092)\underset{(0.1092)}{0.6790} 0.6880(0.1092)\underset{(0.1092)}{0.6880} 0.6200(0.1161)\underset{(0.1161)}{0.6200} 0.6955(0.0980)\underset{(0.0980)}{\mathbf{0.6955}} 0.6920¯(0.1029)\underset{(0.1029)}{\underline{0.6920}} 0.6575(0.0919)\underset{(0.0919)}{0.6575} 0.6515(0.1014)\underset{(0.1014)}{0.6515}
recall 0.6820(0.1099)\underset{(0.1099)}{0.6820} 0.6924(0.1099)\underset{(0.1099)}{0.6924} 0.6007(0.1222)\underset{(0.1222)}{0.6007} 0.6980(0.0994)\underset{(0.0994)}{\mathbf{0.6980}} 0.6928¯(0.1038)\underset{(0.1038)}{\underline{0.6928}} 0.6512(0.0925)\underset{(0.0925)}{0.6512} 0.6649(0.0990)\underset{(0.0990)}{0.6649}
F1F_{1} 0.6704(0.1146)\underset{(0.1146)}{0.6704} 0.6804(0.1146)\underset{(0.1146)}{0.6804} 0.5805(0.1418)\underset{(0.1418)}{0.5805} 0.6901(0.1007)\underset{(0.1007)}{\mathbf{0.6901}} 0.6854¯(0.1056)\underset{(0.1056)}{\underline{0.6854}} 0.6458(0.0946)\underset{(0.0946)}{0.6458} 0.6470(0.1035)\underset{(0.1035)}{0.6470}
Kappa 0.3579(0.2151)\underset{(0.2151)}{0.3579} 0.3778(0.2151)\underset{(0.2151)}{0.3778} 0.2034(0.2474)\underset{(0.2474)}{0.2034} 0.3898(0.1970)\underset{(0.1970)}{\mathbf{0.3898}} 0.3810¯(0.2050)\underset{(0.2050)}{\underline{0.3810}} 0.3025(0.1858)\underset{(0.1858)}{0.3025} 0.3191(0.1952)\underset{(0.1952)}{0.3191}
RF ACC 0.7415(0.0973)\underset{(0.0973)}{\mathbf{0.7415}} 0.7350¯(0.0973)\underset{(0.0973)}{\underline{0.7350}} 0.6885(0.1310)\underset{(0.1310)}{0.6885} 0.6705(0.0990)\underset{(0.0990)}{0.6705} 0.6790(0.0990)\underset{(0.0990)}{0.6790} 0.6910(0.0957)\underset{(0.0957)}{0.6910} 0.5870(0.0939)\underset{(0.0939)}{0.5870}
recall 0.7383(0.1019)\underset{(0.1019)}{\mathbf{0.7383}} 0.7290¯(0.1019)\underset{(0.1019)}{\underline{0.7290}} 0.6805(0.1327)\underset{(0.1327)}{0.6805} 0.6658(0.1029)\underset{(0.1029)}{0.6658} 0.6743(0.1009)\underset{(0.1009)}{0.6743} 0.6845(0.0986)\underset{(0.0986)}{0.6845} 0.6059(0.0923)\underset{(0.0923)}{0.6059}
F1F_{1} 0.7348(0.1041)\underset{(0.1041)}{\mathbf{0.7348}} 0.7254¯(0.1041)\underset{(0.1041)}{\underline{0.7254}} 0.6730(0.1425)\underset{(0.1425)}{0.6730} 0.6609(0.1042)\underset{(0.1042)}{0.6609} 0.6703(0.1025)\underset{(0.1025)}{0.6703} 0.6807(0.1014)\underset{(0.1014)}{0.6807} 0.5768(0.0990)\underset{(0.0990)}{0.5768}
Kappa 0.4754(0.2033)\underset{(0.2033)}{\mathbf{0.4754}} 0.4579¯(0.2033)\underset{(0.2033)}{\underline{0.4579}} 0.3623(0.2673)\underset{(0.2673)}{0.3623} 0.3301(0.2049)\underset{(0.2049)}{0.3301} 0.3476(0.2018)\underset{(0.2018)}{0.3476} 0.3690(0.1972)\underset{(0.1972)}{0.3690} 0.2020(0.1777)\underset{(0.1777)}{0.2020}
arcene DT ACC 0.6945¯(0.1093)\underset{(0.1093)}{\underline{0.6945}} 0.7050(0.1093)\underset{(0.1093)}{\mathbf{0.7050}} 0.6420(0.1176)\underset{(0.1176)}{0.6420} 0.6495(0.1100)\underset{(0.1100)}{0.6495} 0.6555(0.0969)\underset{(0.0969)}{0.6555} 0.6375(0.0960)\underset{(0.0960)}{0.6375} 0.5875(0.0938)\underset{(0.0938)}{0.5875}
recall 0.6911¯(0.1123)\underset{(0.1123)}{\underline{0.6911}} 0.7000(0.1123)\underset{(0.1123)}{\mathbf{0.7000}} 0.6347(0.1200)\underset{(0.1200)}{0.6347} 0.6444(0.1105)\underset{(0.1105)}{0.6444} 0.6485(0.0979)\underset{(0.0979)}{0.6485} 0.6344(0.0991)\underset{(0.0991)}{0.6344} 0.6038(0.0932)\underset{(0.0932)}{0.6038}
F1F_{1} 0.6841¯(0.1156)\underset{(0.1156)}{\underline{0.6841}} 0.6946(0.1156)\underset{(0.1156)}{\mathbf{0.6946}} 0.6275(0.1246)\underset{(0.1246)}{0.6275} 0.6412(0.1121)\underset{(0.1121)}{0.6412} 0.6453(0.0999)\underset{(0.0999)}{0.6453} 0.6282(0.1007)\underset{(0.1007)}{0.6282} 0.5785(0.0980)\underset{(0.0980)}{0.5785}
Kappa 0.3808¯(0.2246)\underset{(0.2246)}{\underline{0.3808}} 0.3993(0.2246)\underset{(0.2246)}{\mathbf{0.3993}} 0.2692(0.2403)\underset{(0.2403)}{0.2692} 0.2888(0.2209)\underset{(0.2209)}{0.2888} 0.2977(0.1976)\underset{(0.1976)}{0.2977} 0.2662(0.1962)\underset{(0.1962)}{0.2662} 0.1988(0.1804)\underset{(0.1804)}{0.1988}
lightGBM ACC 0.7270(0.1027)\underset{(0.1027)}{\mathbf{0.7270}} 0.7185¯(0.1027)\underset{(0.1027)}{\underline{0.7185}} 0.6915(0.1183)\underset{(0.1183)}{0.6915} 0.6695(0.1027)\underset{(0.1027)}{0.6695} 0.6875(0.0941)\underset{(0.0941)}{0.6875} 0.6800(0.0982)\underset{(0.0982)}{0.6800} 0.6345(0.1145)\underset{(0.1145)}{0.6345}
recall 0.7249(0.1049)\underset{(0.1049)}{\mathbf{0.7249}} 0.7155¯(0.1049)\underset{(0.1049)}{\underline{0.7155}} 0.6821(0.1227)\underset{(0.1227)}{0.6821} 0.6651(0.1069)\underset{(0.1069)}{0.6651} 0.6851(0.0958)\underset{(0.0958)}{0.6851} 0.6752(0.1013)\underset{(0.1013)}{0.6752} 0.6491(0.1117)\underset{(0.1117)}{0.6491}
F1F_{1} 0.7197(0.1066)\underset{(0.1066)}{\mathbf{0.7197}} 0.7108¯(0.1066)\underset{(0.1066)}{\underline{0.7108}} 0.6716(0.1381)\underset{(0.1381)}{0.6716} 0.6606(0.1081)\underset{(0.1081)}{0.6606} 0.6808(0.0964)\underset{(0.0964)}{0.6808} 0.6697(0.1036)\underset{(0.1036)}{0.6697} 0.6293(0.1171)\underset{(0.1171)}{0.6293}
Kappa 0.4476(0.2076)\underset{(0.2076)}{\mathbf{0.4476}} 0.4291¯(0.2076)\underset{(0.2076)}{\underline{0.4291}} 0.3651(0.2463)\underset{(0.2463)}{0.3651} 0.3281(0.2128)\underset{(0.2128)}{0.3281} 0.3676(0.1913)\underset{(0.1913)}{0.3676} 0.3490(0.2016)\underset{(0.2016)}{0.3490} 0.2884(0.2187)\underset{(0.2187)}{0.2884}
MLP ACC 0.7040(0.1122)\underset{(0.1122)}{0.7040} 0.7210(0.1122)\underset{(0.1122)}{\mathbf{0.7210}} 0.6395(0.1332)\underset{(0.1332)}{0.6395} 0.7155¯(0.1114)\underset{(0.1114)}{\underline{0.7155}} 0.6910(0.1093)\underset{(0.1093)}{0.6910} 0.6505(0.0968)\underset{(0.0968)}{0.6505} 0.6570(0.1005)\underset{(0.1005)}{0.6570}
recall 0.7022(0.1130)\underset{(0.1130)}{0.7022} 0.7202(0.1130)\underset{(0.1130)}{\mathbf{0.7202}} 0.6329(0.1335)\underset{(0.1335)}{0.6329} 0.7110¯(0.1131)\underset{(0.1131)}{\underline{0.7110}} 0.6867(0.1132)\underset{(0.1132)}{0.6867} 0.6467(0.0979)\underset{(0.0979)}{0.6467} 0.6692(0.0990)\underset{(0.0990)}{0.6692}
F1F_{1} 0.6967(0.1147)\underset{(0.1147)}{0.6967} 0.7150(0.1147)\underset{(0.1147)}{\mathbf{0.7150}} 0.6248(0.1407)\underset{(0.1407)}{0.6248} 0.7070¯(0.1153)\underset{(0.1153)}{\underline{0.7070}} 0.6811(0.1147)\underset{(0.1147)}{0.6811} 0.6401(0.0999)\underset{(0.0999)}{0.6401} 0.6529(0.1020)\underset{(0.1020)}{0.6529}
Kappa 0.4007(0.2237)\underset{(0.2237)}{0.4007} 0.4373(0.2237)\underset{(0.2237)}{\mathbf{0.4373}} 0.2661(0.2691)\underset{(0.2691)}{0.2661} 0.4216¯(0.2258)\underset{(0.2258)}{\underline{0.4216}} 0.3712(0.2241)\underset{(0.2241)}{0.3712} 0.2921(0.1947)\underset{(0.1947)}{0.2921} 0.3278(0.1953)\underset{(0.1953)}{0.3278}
SVM ACC 0.8723¯(0.0966)\underset{(0.0966)}{\underline{0.8723}} 0.8203(0.0966)\underset{(0.0966)}{0.8203} 0.8860(0.0577)\underset{(0.0577)}{\mathbf{0.8860}} 0.4877(0.0705)\underset{(0.0705)}{0.4877} 0.4973(0.0729)\underset{(0.0729)}{0.4973} 0.4630(0.0749)\underset{(0.0749)}{0.4630} 0.4143(0.0810)\underset{(0.0810)}{0.4143}
recall 0.8723¯(0.0966)\underset{(0.0966)}{\underline{0.8723}} 0.8203(0.0966)\underset{(0.0966)}{0.8203} 0.8860(0.0577)\underset{(0.0577)}{\mathbf{0.8860}} 0.4877(0.0705)\underset{(0.0705)}{0.4877} 0.4973(0.0729)\underset{(0.0729)}{0.4973} 0.4630(0.0749)\underset{(0.0749)}{0.4630} 0.4143(0.0810)\underset{(0.0810)}{0.4143}
F1F_{1} 0.8687¯(0.1013)\underset{(0.1013)}{\underline{0.8687}} 0.8157(0.1013)\underset{(0.1013)}{0.8157} 0.8846(0.0583)\underset{(0.0583)}{\mathbf{0.8846}} 0.4688(0.0676)\underset{(0.0676)}{0.4688} 0.4766(0.0677)\underset{(0.0677)}{0.4766} 0.4454(0.0782)\underset{(0.0782)}{0.4454} 0.3990(0.0797)\underset{(0.0797)}{0.3990}
Kappa 0.8468¯(0.1159)\underset{(0.1159)}{\underline{0.8468}} 0.7844(0.1159)\underset{(0.1159)}{0.7844} 0.8632(0.0692)\underset{(0.0692)}{\mathbf{0.8632}} 0.3852(0.0846)\underset{(0.0846)}{0.3852} 0.3968(0.0875)\underset{(0.0875)}{0.3968} 0.3556(0.0899)\underset{(0.0899)}{0.3556} 0.2972(0.0972)\underset{(0.0972)}{0.2972}
RF ACC 0.8733(0.0965)\underset{(0.0965)}{\mathbf{0.8733}} 0.8297(0.0965)\underset{(0.0965)}{0.8297} 0.8727¯(0.0613)\underset{(0.0613)}{\underline{0.8727}} 0.4890(0.0734)\underset{(0.0734)}{0.4890} 0.4963(0.0698)\underset{(0.0698)}{0.4963} 0.4347(0.0737)\underset{(0.0737)}{0.4347} 0.3720(0.0775)\underset{(0.0775)}{0.3720}
recall 0.8733(0.0965)\underset{(0.0965)}{\mathbf{0.8733}} 0.8297(0.0965)\underset{(0.0965)}{0.8297} 0.8727¯(0.0613)\underset{(0.0613)}{\underline{0.8727}} 0.4890(0.0734)\underset{(0.0734)}{0.4890} 0.4963(0.0698)\underset{(0.0698)}{0.4963} 0.4347(0.0737)\underset{(0.0737)}{0.4347} 0.3720(0.0775)\underset{(0.0775)}{0.3720}
F1F_{1} 0.8709¯(0.0981)\underset{(0.0981)}{\underline{0.8709}} 0.8263(0.0981)\underset{(0.0981)}{0.8263} 0.8712(0.0625)\underset{(0.0625)}{\mathbf{0.8712}} 0.4736(0.0756)\underset{(0.0756)}{0.4736} 0.4782(0.0713)\underset{(0.0713)}{0.4782} 0.4216(0.0748)\underset{(0.0748)}{0.4216} 0.3594(0.0759)\underset{(0.0759)}{0.3594}
Kappa 0.8480(0.1158)\underset{(0.1158)}{\mathbf{0.8480}} 0.7956(0.1158)\underset{(0.1158)}{0.7956} 0.8472¯(0.0736)\underset{(0.0736)}{\underline{0.8472}} 0.3868(0.0881)\underset{(0.0881)}{0.3868} 0.3956(0.0837)\underset{(0.0837)}{0.3956} 0.3216(0.0884)\underset{(0.0884)}{0.3216} 0.2464(0.0930)\underset{(0.0930)}{0.2464}
isolet DT ACC 0.8113(0.1081)\underset{(0.1081)}{\mathbf{0.8113}} 0.7857(0.1081)\underset{(0.1081)}{0.7857} 0.7893¯(0.0829)\underset{(0.0829)}{\underline{0.7893}} 0.4270(0.0712)\underset{(0.0712)}{0.4270} 0.4353(0.0813)\underset{(0.0813)}{0.4353} 0.3960(0.0785)\underset{(0.0785)}{0.3960} 0.3547(0.0772)\underset{(0.0772)}{0.3547}
recall 0.8113(0.1081)\underset{(0.1081)}{\mathbf{0.8113}} 0.7857(0.1081)\underset{(0.1081)}{0.7857} 0.7893¯(0.0829)\underset{(0.0829)}{\underline{0.7893}} 0.4270(0.0712)\underset{(0.0712)}{0.4270} 0.4353(0.0813)\underset{(0.0813)}{0.4353} 0.3960(0.0785)\underset{(0.0785)}{0.3960} 0.3547(0.0772)\underset{(0.0772)}{0.3547}
F1F_{1} 0.8076(0.1105)\underset{(0.1105)}{\mathbf{0.8076}} 0.7815(0.1105)\underset{(0.1105)}{0.7815} 0.7835¯(0.0856)\underset{(0.0856)}{\underline{0.7835}} 0.4194(0.0705)\underset{(0.0705)}{0.4194} 0.4297(0.0773)\underset{(0.0773)}{0.4297} 0.3875(0.0770)\underset{(0.0770)}{0.3875} 0.3455(0.0729)\underset{(0.0729)}{0.3455}
Kappa 0.7736(0.1297)\underset{(0.1297)}{\mathbf{0.7736}} 0.7428(0.1297)\underset{(0.1297)}{0.7428} 0.7472¯(0.0994)\underset{(0.0994)}{\underline{0.7472}} 0.3124(0.0854)\underset{(0.0854)}{0.3124} 0.3224(0.0976)\underset{(0.0976)}{0.3224} 0.2752(0.0942)\underset{(0.0942)}{0.2752} 0.2256(0.0926)\underset{(0.0926)}{0.2256}
lightGBM ACC 0.8807(0.1108)\underset{(0.1108)}{\mathbf{0.8807}} 0.8197(0.1108)\underset{(0.1108)}{0.8197} 0.8787¯(0.0645)\underset{(0.0645)}{\underline{0.8787}} 0.4953(0.0789)\underset{(0.0789)}{0.4953} 0.5133(0.0749)\underset{(0.0749)}{0.5133} 0.4130(0.0731)\underset{(0.0731)}{0.4130} 0.3647(0.0646)\underset{(0.0646)}{0.3647}
recall 0.8807(0.1108)\underset{(0.1108)}{\mathbf{0.8807}} 0.8197(0.1108)\underset{(0.1108)}{0.8197} 0.8787¯(0.0645)\underset{(0.0645)}{\underline{0.8787}} 0.4953(0.0789)\underset{(0.0789)}{0.4953} 0.5133(0.0749)\underset{(0.0749)}{0.5133} 0.4130(0.0731)\underset{(0.0731)}{0.4130} 0.3647(0.0646)\underset{(0.0646)}{0.3647}
F1F_{1} 0.8784(0.1101)\underset{(0.1101)}{\mathbf{0.8784}} 0.8188(0.1101)\underset{(0.1101)}{0.8188} 0.8777¯(0.0648)\underset{(0.0648)}{\underline{0.8777}} 0.4845(0.0794)\underset{(0.0794)}{0.4845} 0.5012(0.0765)\underset{(0.0765)}{0.5012} 0.4031(0.0713)\underset{(0.0713)}{0.4031} 0.3563(0.0644)\underset{(0.0644)}{0.3563}
Kappa 0.8568(0.1330)\underset{(0.1330)}{\mathbf{0.8568}} 0.7836(0.1330)\underset{(0.1330)}{0.7836} 0.8544¯(0.0774)\underset{(0.0774)}{\underline{0.8544}} 0.3944(0.0946)\underset{(0.0946)}{0.3944} 0.4160(0.0899)\underset{(0.0899)}{0.4160} 0.2956(0.0877)\underset{(0.0877)}{0.2956} 0.2376(0.0775)\underset{(0.0775)}{0.2376}
MLP ACC 0.8813¯(0.0930)\underset{(0.0930)}{\underline{0.8813}} 0.8237(0.0930)\underset{(0.0930)}{0.8237} 0.8980(0.0556)\underset{(0.0556)}{\mathbf{0.8980}} 0.4730(0.0808)\underset{(0.0808)}{0.4730} 0.4930(0.0795)\underset{(0.0795)}{0.4930} 0.4227(0.0752)\underset{(0.0752)}{0.4227} 0.3883(0.0763)\underset{(0.0763)}{0.3883}
recall 0.8813¯(0.0930)\underset{(0.0930)}{\underline{0.8813}} 0.8237(0.0930)\underset{(0.0930)}{0.8237} 0.8980(0.0556)\underset{(0.0556)}{\mathbf{0.8980}} 0.4730(0.0808)\underset{(0.0808)}{0.4730} 0.4930(0.0795)\underset{(0.0795)}{0.4930} 0.4227(0.0752)\underset{(0.0752)}{0.4227} 0.3883(0.0763)\underset{(0.0763)}{0.3883}
F1F_{1} 0.8791¯(0.0965)\underset{(0.0965)}{\underline{0.8791}} 0.8206(0.0965)\underset{(0.0965)}{0.8206} 0.8967(0.0566)\underset{(0.0566)}{\mathbf{0.8967}} 0.4664(0.0771)\underset{(0.0771)}{0.4664} 0.4873(0.0766)\underset{(0.0766)}{0.4873} 0.4150(0.0762)\underset{(0.0762)}{0.4150} 0.3786(0.0764)\underset{(0.0764)}{0.3786}
Kappa 0.8576¯(0.1116)\underset{(0.1116)}{\underline{0.8576}} 0.7884(0.1116)\underset{(0.1116)}{0.7884} 0.8776(0.0667)\underset{(0.0667)}{\mathbf{0.8776}} 0.3676(0.0969)\underset{(0.0969)}{0.3676} 0.3916(0.0954)\underset{(0.0954)}{0.3916} 0.3072(0.0902)\underset{(0.0902)}{0.3072} 0.2660(0.0916)\underset{(0.0916)}{0.2660}
SVM ACC 0.7015(0.1371)\underset{(0.1371)}{\mathbf{0.7015}} 0.6545¯(0.1371)\underset{(0.1371)}{\underline{0.6545}} 0.5073(0.1477)\underset{(0.1477)}{0.5073} 0.3833(0.1317)\underset{(0.1317)}{0.3833} 0.3845(0.1211)\underset{(0.1211)}{0.3845} 0.3423(0.1356)\underset{(0.1356)}{0.3423} 0.3495(0.1389)\underset{(0.1389)}{0.3495}
recall 0.7015(0.1371)\underset{(0.1371)}{\mathbf{0.7015}} 0.6545¯(0.1371)\underset{(0.1371)}{\underline{0.6545}} 0.5073(0.1477)\underset{(0.1477)}{0.5073} 0.3833(0.1317)\underset{(0.1317)}{0.3833} 0.3845(0.1211)\underset{(0.1211)}{0.3845} 0.3423(0.1356)\underset{(0.1356)}{0.3423} 0.3495(0.1389)\underset{(0.1389)}{0.3495}
F1F_{1} 0.6434(0.1519)\underset{(0.1519)}{\mathbf{0.6434}} 0.5902¯(0.1519)\underset{(0.1519)}{\underline{0.5902}} 0.4524(0.1464)\underset{(0.1464)}{0.4524} 0.3108(0.1334)\underset{(0.1334)}{0.3108} 0.3075(0.1190)\underset{(0.1190)}{0.3075} 0.2778(0.1226)\underset{(0.1226)}{0.2778} 0.2840(0.1218)\underset{(0.1218)}{0.2840}
Kappa 0.6683(0.1523)\underset{(0.1523)}{\mathbf{0.6683}} 0.6161¯(0.1523)\underset{(0.1523)}{\underline{0.6161}} 0.4525(0.1641)\underset{(0.1641)}{0.4525} 0.3147(0.1463)\underset{(0.1463)}{0.3147} 0.3161(0.1345)\underset{(0.1345)}{0.3161} 0.2692(0.1507)\underset{(0.1507)}{0.2692} 0.2772(0.1543)\underset{(0.1543)}{0.2772}
RF ACC 0.7920(0.1282)\underset{(0.1282)}{\mathbf{0.7920}} 0.7515¯(0.1282)\underset{(0.1282)}{\underline{0.7515}} 0.6350(0.1427)\underset{(0.1427)}{0.6350} 0.6505(0.1479)\underset{(0.1479)}{0.6505} 0.6523(0.1494)\underset{(0.1494)}{0.6523} 0.3453(0.1396)\underset{(0.1396)}{0.3453} 0.3680(0.1443)\underset{(0.1443)}{0.3680}
recall 0.7920(0.1282)\underset{(0.1282)}{\mathbf{0.7920}} 0.7515¯(0.1282)\underset{(0.1282)}{\underline{0.7515}} 0.6350(0.1427)\underset{(0.1427)}{0.6350} 0.6505(0.1479)\underset{(0.1479)}{0.6505} 0.6523(0.1494)\underset{(0.1494)}{0.6523} 0.3453(0.1396)\underset{(0.1396)}{0.3453} 0.3680(0.1443)\underset{(0.1443)}{0.3680}
F1F_{1} 0.7434(0.1487)\underset{(0.1487)}{\mathbf{0.7434}} 0.6968¯(0.1487)\underset{(0.1487)}{\underline{0.6968}} 0.5727(0.1510)\underset{(0.1510)}{0.5727} 0.5853(0.1562)\underset{(0.1562)}{0.5853} 0.5869(0.1546)\underset{(0.1546)}{0.5869} 0.2933(0.1283)\underset{(0.1283)}{0.2933} 0.3088(0.1281)\underset{(0.1281)}{0.3088}
Kappa 0.7689(0.1424)\underset{(0.1424)}{\mathbf{0.7689}} 0.7239¯(0.1424)\underset{(0.1424)}{\underline{0.7239}} 0.5944(0.1585)\underset{(0.1585)}{0.5944} 0.6117(0.1643)\underset{(0.1643)}{0.6117} 0.6136(0.1660)\underset{(0.1660)}{0.6136} 0.2725(0.1551)\underset{(0.1551)}{0.2725} 0.2978(0.1603)\underset{(0.1603)}{0.2978}
warpAR10P DT ACC 0.6863(0.1420)\underset{(0.1420)}{\mathbf{0.6863}} 0.6173¯(0.1420)\underset{(0.1420)}{\underline{0.6173}} 0.5225(0.1422)\underset{(0.1422)}{0.5225} 0.4453(0.1441)\underset{(0.1441)}{0.4453} 0.4520(0.1421)\underset{(0.1421)}{0.4520} 0.2955(0.1358)\underset{(0.1358)}{0.2955} 0.3085(0.1309)\underset{(0.1309)}{0.3085}
recall 0.6863(0.1420)\underset{(0.1420)}{\mathbf{0.6863}} 0.6173¯(0.142)\underset{(0.142)}{\underline{0.6173}} 0.5225(0.1422)\underset{(0.1422)}{0.5225} 0.4453(0.1441)\underset{(0.1441)}{0.4453} 0.4520(0.1421)\underset{(0.1421)}{0.4520} 0.2955(0.1358)\underset{(0.1358)}{0.2955} 0.3085(0.1309)\underset{(0.1309)}{0.3085}
F1F_{1} 0.6309(0.1534)\underset{(0.1534)}{\mathbf{0.6309}} 0.5527¯(0.1534)\underset{(0.1534)}{\underline{0.5527}} 0.4582(0.1372)\underset{(0.1372)}{0.4582} 0.3856(0.1346)\underset{(0.1346)}{0.3856} 0.3864(0.1331)\underset{(0.1331)}{0.3864} 0.2473(0.1181)\underset{(0.1181)}{0.2473} 0.2576(0.1186)\underset{(0.1186)}{0.2576}
Kappa 0.6514(0.1578)\underset{(0.1578)}{\mathbf{0.6514}} 0.5747¯(0.1578)\underset{(0.1578)}{\underline{0.5747}} 0.4694(0.1580)\underset{(0.1580)}{0.4694} 0.3836(0.1601)\underset{(0.1601)}{0.3836} 0.3911(0.1579)\underset{(0.1579)}{0.3911} 0.2172(0.1509)\underset{(0.1509)}{0.2172} 0.2317(0.1454)\underset{(0.1454)}{0.2317}
lightGBM ACC 0.7735(0.1337)\underset{(0.1337)}{\mathbf{0.7735}} 0.7115¯(0.1337)\underset{(0.1337)}{\underline{0.7115}} 0.6273(0.1412)\underset{(0.1412)}{0.6273} 0.5685(0.1452)\underset{(0.1452)}{0.5685} 0.5655(0.1427)\underset{(0.1427)}{0.5655} 0.2993(0.1379)\underset{(0.1379)}{0.2993} 0.3363(0.1414)\underset{(0.1414)}{0.3363}
recall 0.7735(0.1337)\underset{(0.1337)}{\mathbf{0.7735}} 0.7115¯(0.1337)\underset{(0.1337)}{\underline{0.7115}} 0.6273(0.1412)\underset{(0.1412)}{0.6273} 0.5685(0.1452)\underset{(0.1452)}{0.5685} 0.5655(0.1427)\underset{(0.1427)}{0.5655} 0.2993(0.1379)\underset{(0.1379)}{0.2993} 0.3363(0.1414)\underset{(0.1414)}{0.3363}
F1F_{1} 0.7266(0.1494)\underset{(0.1494)}{\mathbf{0.7266}} 0.6551¯(0.1494)\underset{(0.1494)}{\underline{0.6551}} 0.5647(0.1487)\underset{(0.1487)}{0.5647} 0.5003(0.1468)\underset{(0.1468)}{0.5003} 0.4988(0.1483)\underset{(0.1483)}{0.4988} 0.2487(0.1223)\underset{(0.1223)}{0.2487} 0.2798(0.1301)\underset{(0.1301)}{0.2798}
Kappa 0.7483(0.1486)\underset{(0.1486)}{\mathbf{0.7483}} 0.6794¯(0.1486)\underset{(0.1486)}{\underline{0.6794}} 0.5858(0.1568)\underset{(0.1568)}{0.5858} 0.5206(0.1614)\underset{(0.1614)}{0.5206} 0.5172(0.1586)\underset{(0.1586)}{0.5172} 0.2214(0.1532)\underset{(0.1532)}{0.2214} 0.2625(0.1571)\underset{(0.1571)}{0.2625}
MLP ACC 0.8248(0.1187)\underset{(0.1187)}{\mathbf{0.8248}} 0.8003¯(0.1187)\underset{(0.1187)}{\underline{0.8003}} 0.6550(0.1519)\underset{(0.1519)}{0.6550} 0.6693(0.1429)\underset{(0.1429)}{0.6693} 0.6713(0.1302)\underset{(0.1302)}{0.6713} 0.3255(0.1257)\underset{(0.1257)}{0.3255} 0.3603(0.1477)\underset{(0.1477)}{0.3603}
recall 0.8248(0.1187)\underset{(0.1187)}{\mathbf{0.8248}} 0.8003¯(0.1187)\underset{(0.1187)}{\underline{0.8003}} 0.6550(0.1519)\underset{(0.1519)}{0.6550} 0.6693(0.1429)\underset{(0.1429)}{0.6693} 0.6713(0.1302)\underset{(0.1302)}{0.6713} 0.3255(0.1257)\underset{(0.1257)}{0.3255} 0.3603(0.1477)\underset{(0.1477)}{0.3603}
F1F_{1} 0.7826(0.1341)\underset{(0.1341)}{\mathbf{0.7826}} 0.7591¯(0.1341)\underset{(0.1341)}{\underline{0.7591}} 0.5934(0.1639)\underset{(0.1639)}{0.5934} 0.6142(0.1506)\underset{(0.1506)}{0.6142} 0.6157(0.1349)\underset{(0.1349)}{0.6157} 0.2731(0.1163)\underset{(0.1163)}{0.2731} 0.3007(0.1363)\underset{(0.1363)}{0.3007}
Kappa 0.8053(0.1319)\underset{(0.1319)}{\mathbf{0.8053}} 0.7781¯(0.1319)\underset{(0.1319)}{\underline{0.7781}} 0.6167(0.1688)\underset{(0.1688)}{0.6167} 0.6325(0.1588)\underset{(0.1588)}{0.6325} 0.6347(0.1447)\underset{(0.1447)}{0.6347} 0.2506(0.1396)\underset{(0.1396)}{0.2506} 0.2892(0.1641)\underset{(0.1641)}{0.2892}
SVM ACC 0.7905¯(0.1156)\underset{(0.1156)}{\underline{0.7905}} 0.7428(0.1156)\underset{(0.1156)}{0.7428} 0.9465(0.0346)\underset{(0.0346)}{\mathbf{0.9465}} 0.6730(0.0657)\underset{(0.0657)}{0.6730} 0.6638(0.0629)\underset{(0.0629)}{0.6638} 0.4408(0.0829)\underset{(0.0829)}{0.4408} 0.4433(0.0839)\underset{(0.0839)}{0.4433}
recall 0.7905¯(0.1156)\underset{(0.1156)}{\underline{0.7905}} 0.7428(0.1156)\underset{(0.1156)}{0.7428} 0.9465(0.0346)\underset{(0.0346)}{\mathbf{0.9465}} 0.6730(0.0657)\underset{(0.0657)}{0.6730} 0.6638(0.0629)\underset{(0.0629)}{0.6638} 0.4408(0.0829)\underset{(0.0829)}{0.4408} 0.4433(0.0839)\underset{(0.0839)}{0.4433}
F1F_{1} 0.7486¯(0.1286)\underset{(0.1286)}{\underline{0.7486}} 0.6931(0.1286)\underset{(0.1286)}{0.6931} 0.9308(0.0438)\underset{(0.0438)}{\mathbf{0.9308}} 0.6117(0.0705)\underset{(0.0705)}{0.6117} 0.6020(0.0662)\underset{(0.0662)}{0.6020} 0.3713(0.0828)\underset{(0.0828)}{0.3713} 0.3731(0.0845)\underset{(0.0845)}{0.3731}
Kappa 0.7851¯(0.1186)\underset{(0.1186)}{\underline{0.7851}} 0.7362(0.1186)\underset{(0.1186)}{0.7362} 0.9451(0.0355)\underset{(0.0355)}{\mathbf{0.9451}} 0.6646(0.0674)\underset{(0.0674)}{0.6646} 0.6551(0.0645)\underset{(0.0645)}{0.6551} 0.4264(0.0850)\underset{(0.0850)}{0.4264} 0.4290(0.0860)\underset{(0.0860)}{0.4290}
RF ACC 0.8600¯(0.0763)\underset{(0.0763)}{\underline{0.8600}} 0.8333(0.0763)\underset{(0.0763)}{0.8333} 0.9238(0.0450)\underset{(0.0450)}{\mathbf{0.9238}} 0.8250(0.0547)\underset{(0.0547)}{0.8250} 0.8213(0.0569)\underset{(0.0569)}{0.8213} 0.5043(0.0763)\underset{(0.0763)}{0.5043} 0.5088(0.0867)\underset{(0.0867)}{0.5088}
recall 0.8600¯(0.0763)\underset{(0.0763)}{\underline{0.8600}} 0.8333(0.0763)\underset{(0.0763)}{0.8333} 0.9238(0.0450)\underset{(0.0450)}{\mathbf{0.9238}} 0.8250(0.0547)\underset{(0.0547)}{0.8250} 0.8213(0.0569)\underset{(0.0569)}{0.8213} 0.5043(0.0763)\underset{(0.0763)}{0.5043} 0.5088(0.0867)\underset{(0.0867)}{0.5088}
F1F_{1} 0.8244¯(0.0886)\underset{(0.0886)}{\underline{0.8244}} 0.7924(0.0886)\underset{(0.0886)}{0.7924} 0.9009(0.0571)\underset{(0.0571)}{\mathbf{0.9009}} 0.7808(0.0652)\underset{(0.0652)}{0.7808} 0.7751(0.0683)\underset{(0.0683)}{0.7751} 0.4369(0.0769)\underset{(0.0769)}{0.4369} 0.4428(0.0853)\underset{(0.0853)}{0.4428}
Kappa 0.8564¯(0.0783)\underset{(0.0783)}{\underline{0.8564}} 0.8290(0.0783)\underset{(0.0783)}{0.8290} 0.9218(0.0462)\underset{(0.0462)}{\mathbf{0.9218}} 0.8205(0.0561)\underset{(0.0561)}{0.8205} 0.8167(0.0584)\underset{(0.0584)}{0.8167} 0.4915(0.0783)\underset{(0.0783)}{0.4915} 0.4962(0.0889)\underset{(0.0889)}{0.4962}
ORL DT ACC 0.5815(0.0751)\underset{(0.0751)}{\mathbf{0.5815}} 0.5463¯(0.0751)\underset{(0.0751)}{\underline{0.5463}} 0.4678(0.0691)\underset{(0.0691)}{0.4678} 0.4360(0.0640)\underset{(0.0640)}{0.4360} 0.4420(0.0800)\underset{(0.0800)}{0.4420} 0.3513(0.0863)\underset{(0.0863)}{0.3513} 0.3525(0.0818)\underset{(0.0818)}{0.3525}
recall 0.5815(0.0751)\underset{(0.0751)}{\mathbf{0.5815}} 0.5463¯(0.0751)\underset{(0.0751)}{\underline{0.5463}} 0.4678(0.0691)\underset{(0.0691)}{0.4678} 0.4360(0.0640)\underset{(0.0640)}{0.4360} 0.4420(0.0800)\underset{(0.0800)}{0.4420} 0.3513(0.0863)\underset{(0.0863)}{0.3513} 0.3525(0.0818)\underset{(0.0818)}{0.3525}
F1F_{1} 0.5159(0.0744)\underset{(0.0744)}{\mathbf{0.5159}} 0.4771¯(0.0744)\underset{(0.0744)}{\underline{0.4771}} 0.4022(0.0706)\underset{(0.0706)}{0.4022} 0.3695(0.0630)\underset{(0.0630)}{0.3695} 0.3760(0.0784)\underset{(0.0784)}{0.3760} 0.2935(0.0834)\underset{(0.0834)}{0.2935} 0.2927(0.0772)\underset{(0.0772)}{0.2927}
Kappa 0.5708(0.077)\underset{(0.077)}{\mathbf{0.5708}} 0.5346¯(0.0770)\underset{(0.0770)}{\underline{0.5346}} 0.4541(0.0709)\underset{(0.0709)}{0.4541} 0.4215(0.0656)\underset{(0.0656)}{0.4215} 0.4277(0.0821)\underset{(0.0821)}{0.4277} 0.3346(0.0885)\underset{(0.0885)}{0.3346} 0.3359(0.0839)\underset{(0.0839)}{0.3359}
lightGBM ACC 0.7848¯(0.0773)\underset{(0.0773)}{\underline{0.7848}} 0.7458(0.0773)\underset{(0.0773)}{0.7458} 0.7968(0.0574)\underset{(0.0574)}{\mathbf{0.7968}} 0.6133(0.0721)\underset{(0.0721)}{0.6133} 0.6163(0.0708)\underset{(0.0708)}{0.6163} 0.4508(0.0851)\underset{(0.0851)}{0.4508} 0.4483(0.0821)\underset{(0.0821)}{0.4483}
recall 0.7848¯(0.0773)\underset{(0.0773)}{\underline{0.7848}} 0.7458(0.0773)\underset{(0.0773)}{0.7458} 0.7968(0.0574)\underset{(0.0574)}{\mathbf{0.7968}} 0.6133(0.0721)\underset{(0.0721)}{0.6133} 0.6163(0.0708)\underset{(0.0708)}{0.6163} 0.4508(0.0851)\underset{(0.0851)}{0.4508} 0.4483(0.0821)\underset{(0.0821)}{0.4483}
F1F_{1} 0.7363¯(0.0867)\underset{(0.0867)}{\underline{0.7363}} 0.6896(0.0867)\underset{(0.0867)}{0.6896} 0.7470(0.0675)\underset{(0.0675)}{\mathbf{0.7470}} 0.5434(0.0801)\underset{(0.0801)}{0.5434} 0.5479(0.0764)\underset{(0.0764)}{0.5479} 0.3836(0.0831)\underset{(0.0831)}{0.3836} 0.3814(0.0814)\underset{(0.0814)}{0.3814}
Kappa 0.7792¯(0.0793)\underset{(0.0793)}{\underline{0.7792}} 0.7392(0.0793)\underset{(0.0793)}{0.7392} 0.7915(0.0588)\underset{(0.0588)}{\mathbf{0.7915}} 0.6033(0.0739)\underset{(0.0739)}{0.6033} 0.6064(0.0726)\underset{(0.0726)}{0.6064} 0.4367(0.0873)\underset{(0.0873)}{0.4367} 0.4341(0.0842)\underset{(0.0842)}{0.4341}
MLP ACC 0.8420¯(0.0736)\underset{(0.0736)}{\underline{0.8420}} 0.8055(0.0736)\underset{(0.0736)}{0.8055} 0.8638(0.0484)\underset{(0.0484)}{\mathbf{0.8638}} 0.7000(0.0674)\underset{(0.0674)}{0.7000} 0.6973(0.0668)\underset{(0.0668)}{0.6973} 0.4615(0.0908)\underset{(0.0908)}{0.4615} 0.4670(0.0881)\underset{(0.0881)}{0.4670}
recall 0.8420¯(0.0736)\underset{(0.0736)}{\underline{0.8420}} 0.8055(0.0736)\underset{(0.0736)}{0.8055} 0.8638(0.0484)\underset{(0.0484)}{\mathbf{0.8638}} 0.7000(0.0674)\underset{(0.0674)}{0.7000} 0.6973(0.0668)\underset{(0.0668)}{0.6973} 0.4615(0.0908)\underset{(0.0908)}{0.4615} 0.4670(0.0881)\underset{(0.0881)}{0.4670}
F1F_{1} 0.8038¯(0.0851)\underset{(0.0851)}{\underline{0.8038}} 0.7597(0.0851)\underset{(0.0851)}{0.7597} 0.8286(0.0583)\underset{(0.0583)}{\mathbf{0.8286}} 0.6398(0.0727)\underset{(0.0727)}{0.6398} 0.6372(0.0723)\underset{(0.0723)}{0.6372} 0.3928(0.0882)\underset{(0.0882)}{0.3928} 0.3992(0.0847)\underset{(0.0847)}{0.3992}
Kappa 0.8379¯(0.0755)\underset{(0.0755)}{\underline{0.8379}} 0.8005(0.0755)\underset{(0.0755)}{0.8005} 0.8603(0.0496)\underset{(0.0496)}{\mathbf{0.8603}} 0.6923(0.0691)\underset{(0.0691)}{0.6923} 0.6895(0.0686)\underset{(0.0686)}{0.6895} 0.4477(0.0931)\underset{(0.0931)}{0.4477} 0.4533(0.0904)\underset{(0.0904)}{0.4533}

Appendix C

Refer to caption
Figure C1: The accuracy of different learning models using the selected predictors under different parameter settings of TNVS.

References

  • Alcaraz et al. (2022) Alcaraz, J., Labbé, M., & Landete, M. (2022). Support vector machine with feature selection: A multiobjective approach. Expert Syst. Appl., 204, 117485.
  • Azadkia & Chatterjee (2021) Azadkia, M., & Chatterjee, S. (2021). A simple measure of conditional dependence. Ann. Stat., 49, 3070–3102.
  • Barut et al. (2016) Barut, E., Fan, J., & Verhasselt, A. (2016). Conditional sure independence screening. J. Am. Stat. Assoc., 111, 1266–1277.
  • Brown et al. (2012) Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res., 13, 27–66.
  • Buhlmann et al. (2010) Buhlmann, P., Kalisch, M., & Maathuis, M. H. (2010). Variable selection in high-dimensional linear models: Partially faithful distributions and the pc-simple algorithm. Biometrika, 97, 261–278.
  • Buitinck et al. (2013) Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., & Varoquaux, G. (2013). Api design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).
  • Cai et al. (2018) Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70–79.
  • Chatterjee (2021) Chatterjee, S. (2021). A new coefficient of correlation. J. Am. Stat. Assoc., 116, 2009–2022.
  • Chaudhari & Thakkar (2023) Chaudhari, K., & Thakkar, A. (2023). Neural network systems with an integrated coefficient of variation-based feature selection for stock price and trend prediction. Expert Syst. Appl., 219, 119527.
  • Chen et al. (2021) Chen, Y., Gao, Q., Liang, F., & Wang, X. (2021). Nonlinear variable selection via deep neural networks. J. Comput. Graph. Stat., 30, 484–492.
  • Dessì & Pes (2015) Dessì, N., & Pes, B. (2015). Similarity of feature selection methods: An empirical study across data intensive classification tasks. Expert Syst. Appl., 42, 4632–4642.
  • Efroymson (1960) Efroymson, M. A. (1960). Multiple regression analysis. Math. Methods Digit. Comput., 1, 191–203.
  • Fan et al. (2020) Fan, J., Li, R., Zhang, C.-H., & Zou, H. (2020). Statistical Foundations of Data Science. (1st ed.). Chapman and Hall/CRC.
  • Fan & Lv (2008) Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B-Stat. Methodol., 70, 849–911.
  • Fan & Lv (2010) Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Stat. Sin., 20, 101–148.
  • Gray (2011) Gray, R. M. (2011). Entropy. In Entropy and Information Theory (pp. 61–95). Boston, MA: Springer US.
  • Guyon & Elisseeff (2003) Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res., 3, 1157–1182.
  • Hall & Miller (2009) Hall, P., & Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Stat., 18, 533–550.
  • Hastie et al. (2001) Hastie, T., Friedman, J., & Tibshirani, R. (2001). Basis expansions and regularization. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction (pp. 115–163). New York, NY: Springer New York.
  • Hossny et al. (2020) Hossny, A. H., Mitchell, L., Lothian, N., & Osborne, G. (2020). Feature selection methods for event detection in twitter: A text mining approach. Soc. Netw. Anal. Min., 10, 61.
  • Köstinger et al. (2011) Köstinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 2144–2151).
  • Li et al. (2017a) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017a). Feature selection: A data perspective. ACM Comput. Surv., 50.
  • Li (1991) Li, K.-C. (1991). Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc., 86, 316–327.
  • Li et al. (2017b) Li, R., Liu, J., & Lou, L. (2017b). Variable selection via partial correlation. Stat. Sin., 27, 983–996.
  • Li et al. (2012) Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. J. Am. Stat. Assoc., 107, 1129–1139.
  • Liu et al. (2018) Liu, R., Wang, H., & Wang, S. (2018). Functional variable selection via gram–Schmidt orthogonalization for multiple functional linear regression. J. Stat. Comput. Simul., 88, 3664–3680.
  • Liu et al. (2022) Liu, W., Ke, Y., Liu, J., & Li, R. (2022). Model-free feature screening and fdr control with knockoff features. J. Am. Stat. Assoc., 117, 428–443.
  • Lu et al. (2023) Lu, S., Yu, M., & Wang, H. (2023). What matters for short videos’ user engagement: A multiblock model with variable screening. Expert Syst. Appl., 218, 119542.
  • Lyu et al. (2017) Lyu, H., Wan, M., Han, J., Liu, R., & Wang, C. (2017). A filter feature selection method based on the maximal information coefficient and gram-schmidt orthogonalization for biomedical data mining. Comput. Biol. Med., 89, 264–274.
  • Marra & Wood (2011) Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models. Comput. Stat. Data Anal., 55, 2372–2387.
  • Murdoch et al. (2019) Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U. S. A., 116, 22071–22080.
  • Pan et al. (2019) Pan, W., Wang, X., Xiao, W., & Zhu, H. (2019). A generic sure independence screening procedure. J. Am. Stat. Assoc., 114, 928–937.
  • Pramanik et al. (2023) Pramanik, R., Pramanik, P., & Sarkar, R. (2023). Breast cancer detection in thermograms using a hybrid of ga and gwo based deep feature selection method. Expert Syst. Appl., 219, 119643.
  • Rudin et al. (2022) Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., & Zhong, C. (2022). Interpretable machine learning: Fundamental principles and 10 grand challenges. Statist. Surv., 16.
  • Saibene & Gasparini (2023) Saibene, A., & Gasparini, F. (2023). Genetic algorithm for feature selection of eeg heterogeneous data. Expert Syst. Appl., 217, 119488.
  • Salesi et al. (2021) Salesi, S., Cosma, G., & Mavrovouniotis, M. (2021). Taga: Tabu asexual genetic algorithm embedded in a filter/filter feature selection approach for high-dimensional data. Inf. Sci., 565, 105–127.
  • Shi et al. (2021) Shi, H., Drton, M., & Han, F. (2021). On azadkia-chatterjee’s conditional dependence coefficient. arXiv preprint arXiv:2108.06827, .
  • Song et al. (2017) Song, Q., Jiang, H., & Liu, J. (2017). Feature selection based on fda and f-score for multi-class classification. Expert Syst. Appl., 81, 22–27.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol., 58, 267–288.
  • Wan et al. (2022) Wan, J., Chen, H., Li, T., Huang, W., Li, M., & Luo, C. (2022). R2ci: Information theoretic-guided feature selection with multiple correlations. Pattern Recognit., 127, 108603.
  • Wan et al. (2021) Wan, J., Chen, H., Li, T., Yang, X., & Sang, B. (2021). Dynamic interaction feature selection based on fuzzy rough set. Inf. Sci., 581, 891–911.
  • Wang et al. (2020) Wang, H., Liu, R., Wang, S., Wang, Z., & Saporta, G. (2020). Ultra-high dimensional variable screening via gram–Schmidt orthogonalization. Comput. Stat., 35, 1153–1170.
  • Wang et al. (2018) Wang, X., Wen, C., Pan, W., & Huang, M. (2018). Sure independence screening adjusted for confounding covariates with ultrahigh-dimensional data. Stat. Sin., 28, 293–317.
  • Wu et al. (2022) Wu, X., Tao, Z., Jiang, B., Wu, T., Wang, X., & Chen, H. (2022). Domain knowledge-enhanced variable selection for biomedical data analysis. Inf. Sci., 606, 469–488.
  • Yin et al. (2022) Yin, D., Chen, D., Tang, Y., Dong, H., & Li, X. (2022). Adaptive feature selection with shapley and hypothetical testing: Case study of eeg feature engineering. Inf. Sci., 586, 374–390.
  • Yu & Liu (2004) Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res., 5, 1205–1224.
  • Zhong et al. (2021) Zhong, W., Liu, Y., & Zeng, P. (2021). A model-free variable screening method based on leverage score. J. Am. Stat. Assoc., 1, 1–12.
  • Zhu et al. (2011) Zhu, L.-P., Li, L., Li, R., & Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc., 106, 1464–1475.