Nonparametric Variable Screening with Optimal Decision Stumps

Jason M. Klusowski [email protected]
Supported in part by NSF DMS-1915932 and NSF TRIPODS DATA-INSPIRE CCF-1934924. Peter M. Tian [email protected]
Supported in part by the Gordon Wu Fellowship of Princeton University.

Abstract

Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, we find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that attempt to directly estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection.

1 Introduction

A common task in many applied disciplines involves determining which variables, among many, are most important in a predictive model. In high-dimensional sparse models, many of these predictor variables may be irrelevant in how they affect the response variable. As a result, variable selection techniques are crucial for filtering out irrelevant variables in order to prevent overfitting, improve accuracy, and enhance the interpretability of the model. Indeed, algorithms that screen for relevant variables have been instrumental in the modern development of fields such as genomics, biomedical imaging, signal processing, image analysis, and finance, where high-dimensional, sparse data is frequently encountered [12].

Over the years, numerous parametric and nonparametric methods for variable selection in high-dimensional models have been proposed and studied. For linear models, the LARS algorithm [10] (for Lasso [39]) and Sure Independence Screening (SIS) [12] serve as prototypical examples that have achieved immense success, both practically and theoretically. Other strategies for nonparametric additive models such as Nonparametric Independence Screening (NIS) [11] and Sparse Additive Models (SPAM) [35] have also enjoyed a similar history of success.

1.1 Tree based variable selection procedures

Alternatively, because they are built from highly interpretable and simple objects, decision tree models are another important tool in the data analyst’s repertoire. Indeed, after only a brief explanation, one is able to understand the tree construction and its output in terms of meaningful domain specific attributes of the variables. In addition to being interpretable, tree based model have good computational scalability as the number of data points grows, making them faster than many other methods when dealing with large datasets. In terms of flexibility, they can naturally handle a mixture of numeric variables, categorical variables, and missing values. Lastly, they require less preprocessing (because they are invariant to monotone transformations of the inputs), are quite robust to outliers, and are relatively unaffected by the inclusion of many irrelevant variables [17, 26], the last point being of relevance to the variable selection problem.

Conventional tree structured models such as CART [6], random forests [4], ExtraTrees [15], and gradient tree boosting [14] are also equipped with heuristic variable importance measures that can be used to rank and identify relevant predictor variables for further investigation (such as plotting their partial dependence functions). In fact, tree based variable importance measures have been used to discover genes in bioinformatics [5, 33, 8, 9, 20], identify loyal customers and clients [7, 40, 27], detect network intrusion [46, 47], and understand income redistribution preferences [25], to name a few applications.

1.2 Mean decrease in impurity

An attractive feature specific to CART methodology is that one can compute, essentially for free, measures of variable importance (or influence) using the optimal splitting variables and their corresponding impurities. The canonical CART-based variable importance measure is the Mean Decrease in Impurity (MDI) [14, Section 8.1], [6, Section 5.3.4], [17, Sections 10.13.1 & 15.3.2], which calculates an importance score for a variable by summing the largest impurity reductions (weighted by the fraction of samples in the node) over all non-terminal nodes split with that variable, averaged over all trees in the ensemble. Another commonly used and sometimes more accurate measure is the Mean Decrease in Accuracy (MDA) or permutation importance measure, defined as the average difference in out-of-bag errors before and after randomly permuting the data values of a variable in out-of-bag samples over all trees in the ensemble [4]. However, from a computational perspective, MDI is preferable to MDA since it can be computed as each tree is grown with no additional cost. While we view the analysis of MDA as an important endeavor, it is outside the scope of the present paper and therefore not our current focus. In contrast to the aforementioned variable selection procedures like SIS, NIS, Lasso, and SPAM, except for a few papers, little is known about the finite sample performance of MDI. Theoretical results are mainly limited to the asymptotic, fixed dimensional data setting and to showing what would be expected from a reasonable measure of variable importance. For example, it is shown in [31] that the MDI importance of a categorical variable (in an asymptotic data and ensemble size setting) is zero precisely when the variable is irrelevant, and that the MDI importance for relevant variables remains unchanged if irrelevant variables are removed or added.

Recent complementary work by [36] established the large sample properties of MDI for additive models by showing that it converges to sensible quantities like the variance of the component functions, provided the decision tree is sufficiently deep. However, these results are not fine-grained enough to handle the case where either the dimensionality grows or the marginal signals decay with the sample size. Furthermore, [36] crucially relies on the assumption that the CART decision tree is a consistent predictor, which is currently only known for additive models [37]. It is therefore unclear whether the techniques can be generalized to models beyond those considered therein. On the other hand, our results suggest that tree based variable importance measures can still have good variable selection properties even though the underlying tree model may be a poor predictor of the data generating process—which can occur with CART and random forests.

Lastly, we mention that important steps have also been taken to characterize the finite sample properties of MDI; [29] show that MDI is less biased for irrelevant variables when each tree is shallow. This work therefore covers one facet of the variable selection problem, i.e., controlling the number of false positives, and will be employed in our proofs.

1.3 New contributions

The lack of theoretical development for tree based variable selection is likely because the training mechanism involves complex steps —which present new theoretical challenges— such as bagging, boosting, pruning, random selections of the predictor variables for candidate splits, recursive splitting, and line search to find the best split points [24]. The last consideration, importantly, means that the underlying tree construction (e.g., split points) depends on both the input and output data, which enables it to adapt to structural properties of the underlying statistical model (such as sparsity). This data adaptivity is a double-edged sword from a theoretical standpoint, though, since unravelling the data dependence is a formidable task.

Despite the aforementioned challenges, we advance the study of tree based variable selection by focusing on the following two fundamental questions:

•

Does MDI enjoy finite sample guarantees for its variable ranking?
•

What are the benefits of MDI over other variable screening methods?

Specifically, we derive rigorous finite sample guarantees for what we call the SDI importance measure, a sobriquet for Single-level Decrease in Impurity, which is a special case of MDI for a single-level CART decision tree or “decision stump” [21]. This is similar in spirit to the approach of DSTUMP [24] but, importantly, SDI incorporates the line search step by finding the optimal split point, instead of the empirical median, of every predictor variable. As we shall see, ranking variables according to their SDI is equivalent to ranking the variables according to the marginal sample correlations between the response data and the optimal decision stump with respect to those variables. This equivalence also yields connections with other variable selection methods: for linear models with Gaussian variates, we show that SDI is asymptotically equivalent to SIS (up to logarithmic factors in the sample size), and so SDI inherits the so-called sure screening property [12] under suitable assumptions.

Unlike SIS, however, SDI is accompanied by provable guarantees for nonparametric models. We show that under certain conditions, SDI achieves model selection consistency; that is, it correctly selects the relevant variables of the model with probability approaching one as the sample size increases. In fact, the minimum signal strength of each relevant variable and maximum dimensionality of the model are shown to be less restrictive for SDI than NIS or SPAM. In the linear model case with Gaussian variates, SDI is shown to nearly match the optimal sample size threshold (achieved by Lasso) for exact support recovery. These favorable properties are particularly striking when one is reminded that the underlying model fit to the data is a simple, parsimonious decision stump—in particular, there is no need to specify a flexible function class (such as a polynomial spline family) and be concerned with calibrating the number of basis terms or bandwidth parameters.

Finally, we empirically compare SDI to other contemporaneous variable selection algorithms, namely, SIS, NIS, Lasso, and SPAM. We find that SDI is competitive and performs favorably in cases where the model is less smooth. Furthermore, we empirically verify the similarity between SDI and SIS for the linear model case, and confirm the model selection consistency properties of SDI for various types of nonparametric models.

Let us close this section by saying a few words about the primary goals of this paper. In practice, tree based variable importance measures such as MDI and MDA are most commonly used to rank variables, in order to determine which ones are worthy of further examination. This is a less ambitious endeavor than that sought by the aforementioned variable selection methods, which aim for consistent model selection, namely, determining exactly which variables are relevant and irrelevant. Though we do study the model selection problem, our priority is to demonstrate the power of tree based variable importance measures as interpretable, accurate, and efficient variable ranking tools.

1.4 Organization

The paper is organized according to the following schema. First, in Section 2, we formally describe our learning setting and problem, discuss prior art, and introduce the SDI algorithm. We outline some key lemmas and ideas used in the proofs in Section 3. In the next two sections, we establish our main results: in Section 4, we establish that for linear models, SDI is asymptotically equivalent to SIS (up to logarithmic factors in the sample size) and in Section 5, we establish the variable ranking and model selection consistency properties of SDI for more general nonparametric models. Lastly, we compare the performance of SDI to other well-known variable selection techniques from a theoretical perspective in Section 6 and from an empirical perspective in Section 7. We conclude with a few remarks in Section 8. Due to space constraints, we include the full proofs of our main results in Sections 4 and 5 in the appendix.

2 Setup and algorithm

In this section we introduce notation, formalize the learning setting, and give an explicit layout of our SDI algorithm. At the end of the section, we discuss its complexity and provide several interpretations.

2.1 Notation

For labeled data $\{(U_{1},V_{1}),\dots,(U_{n},V_{n}))\}$ drawn from a population distribution $(U,V)$ , we let $\widehat{\mathrm{Cov}}(U,V)=\frac{1}{n}\sum_{i=1}^{n}(U_{i}-\overline{U})(V_{i}-\overline{V})$ , $\widehat{\mathrm{Var}}(U)=\frac{1}{n}\sum_{i=1}^{n}(U_{i}-\overline{U})^{2}$ , $\overline{U}=\frac{1}{n}\sum_{i=1}^{n}U_{i}$ , and $\widehat{\rho}\,(U,V)=\frac{\widehat{\mathrm{Cov}}(U,V)}{\sqrt{\widehat{\mathrm{Var}}(U)\widehat{\mathrm{Var}}(V)}}$ denote the sample covariance, variance, mean, and Pearson product-moment correlation coefficient respectively. The population level covariance, variance, and correlation are denoted by $\mathrm{Cov}(U,V)$ , $\mathrm{Var}(U)=\sigma_{U}^{2}$ , and $\rho(U,V)$ , respectively.

2.2 Learning setting

Throughout this paper, we operate under a standard regression framework where the statistical model is $Y=g(\mathbf{X})+\varepsilon$ , the vector of predictor variables is $\mathbf{X}=(X_{1},\ldots,X_{p})^{\top}$ , and $\varepsilon$ is statistical noise. While our results are valid for general nonparametric models, for conceptual simplicity, the canonical model class we have in mind is additive models, i.e.,

g(X_{1},\ldots,X_{p})=g_{1}(X_{1})+\cdots+g_{p}(X_{p})

(1)

for some univariate component functions $g_{1}(\cdot),\ldots,g_{p}(\cdot)$ . As is standard with additive modeling [17, Section 9.1.1], for identifiability of the components, we assume that the $g_{j}(X_{j})$ have population mean zero for all $j$ . This model class strikes a balance between flexibility and learnability—it is more flexible than linear models, but, by giving up on modeling interaction terms, it does not suffer from the curse of dimensionality.

We observe data ${\mathcal{D}}=\{(\mathbf{X}_{1},Y_{1}),\dots,(\mathbf{X}_{n},Y_{n})\}$ with the $i^{\text{th}}$ sample point $(\mathbf{X}_{i},Y_{i})=(X_{i1},\ldots,X_{ip},Y_{i})$ drawn independently from the model above. Note that with this notation, $X_{ij}$ are i.i.d. instances of the random variable $X_{j}$ . We assume that the regression function $g(\cdot)$ depends only on a small subset of the variables $\{X_{j}\}_{j\in{\mathcal{S}}}$ , which we call relevant variables with support ${\mathcal{S}}\subset\{1,\ldots,p\}$ and sparsity level $s=|{\mathcal{S}}|\ll p$ . Equivalently, $g_{j}(\cdot)$ is identically zero for the irrelevant variables $\{X_{j}\}_{j\in{\mathcal{S}}^{c}}$ . In this paper, we consider the variable ranking problem, defined here as ranking the variables so that the top $s$ coincide with ${\mathcal{S}}$ with high probability. As a corollary, this will enable us to solve the variable selection problem, namely, determining the subset ${\mathcal{S}}$ . We pay special attention to the high-dimensional regime where $p\gg n$ . In fact, in Section 5.3 we will provide conditions under which consistent variable selection occurs even when $p=\exp(o(n))$ .

2.3 Prior art

The conventional approach to marginal screening for nonparametric additive models is to directly estimate either the nonparametric components $g_{j}(X_{j})$ or the marginal projections

f_{j}(X_{j})\coloneqq\mathbb{E}[Y|X_{j}],

with the ultimate goal of studying their variances or their correlations with the response variable.¹¹1Note that $f_{j}(X_{j})$ need not be the same as $g_{j}(X_{j})$ unless, for instance, the predictor variables are independent and the noise is independent and mean zero. To accomplish this, SIS, NIS, and [16] rank the variables according to the correlations between the response values and least squares fits over a univariate model class ${\mathcal{H}}$ , i.e.,

\widehat{\rho}\,(\hat{h}(X_{j}),Y),\quad\text{where}\quad\hat{h}(\cdot)\in\operatorname*{arg\,min}_{h(\cdot)\in{\mathcal{H}}}\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-h(X_{ij}))^{2}.

(2)

The model class ${\mathcal{H}}$ is chosen to make the above optimization tractable, while at the same time, be sufficiently rich in order to approximate $f_{j}(X_{j})$ . For example, if ${\mathcal{H}}$ is the space of polynomial splines of a fixed degree, then $\hat{h}(\cdot)$ in (2) can be computed efficiently via a truncated B-spline basis expansion

\beta_{1}\Psi_{1}(X_{j})+\beta_{2}\Psi_{2}(X_{j})+\cdots+\beta_{d_{n}}\Psi_{d_{n}}(X_{j}),

(3)

as is done with NIS. Similarly, SIS takes ${\mathcal{H}}$ to be the family of linear functions in a single variable. Complementary methods that aim to directly estimate each $g_{j}(X_{j})$ include SPAM, which uses a smooth back-fitting algorithm with soft-thresholding, and [19], which combines adaptive group Lasso with truncated B-spline basis expansions.

As we shall see, SDI is equivalent to ranking the variables according to (2) when ${\mathcal{H}}$ consists of the collection of all decision stumps in $X_{j}$ of the form

\beta_{1}\,\mathbf{1}(X_{j}\leq z)+\beta_{2}\,\mathbf{1}(X_{j}>z),\qquad z,\beta_{1},\beta_{2}\in\mathbb{R}.

(4)

Unlike the previous expressive models such a polynomial splines (3), a one-level decision tree, realized by the model (4) above, severely underfits the data and would therefore be ill-advised for estimating $f_{j}(X_{j})$ , if that were the goal. Remarkably, we show that this rigidity does not hinder SDI for variable selection. What redeems SDI is that, unlike the aforementioned methods that are based on linear estimators, decision stumps (4) are nonlinear since the splits points can depend on the response data. These model nonlinearities equip SDI with the ability to discover nonlinear patterns in the data, despite its poor approximation capabilities.

Finally, we mention that one benefit of such a simple model as (4) is that it is completely free of tuning parameters. In contrast, other methods such as the ones listed here require careful calibration of, for example, variable bandwidth smoothers or the number of terms in the basis expansions (e.g., SPAM and NIS).

2.4 The SDI algorithm

In this section, we provide the details for the SDI algorithm. We first provide some high-level intuition.

In order to determine whether, say, $X_{j}$ is relevant for predicting $Y$ from $\mathbf{X}$ , it is natural to first divide the data into two groups according to whether $X_{j}$ is above or below some predetermined cutoff value and then assess how much the variance in $Y$ changes before and after this division. A small change in the variability indicates a weak or nonexistent dependence of $Y$ on $X_{j}$ ; whereas, a moderate to large change indicates heterogeneity in $Y$ across different values of $X_{j}$ . As we now explain, this is precisely what SDI does when the predetermined cutoff value is sought by a least squares fit over all possible ways of dividing the data.

Let $z$ be a candidate split for a variable $X_{j}$ that divides the response data $Y$ into left and right daughter nodes based on the $j^{\text{th}}$ variable. Define the mean of the left daughter node to be $\overline{Y}_{L}=\frac{1}{N_{L}}\sum_{i:X_{ij}\leq z}Y_{i}$ and the mean of the right daughter node to be $\overline{Y}_{R}=\frac{1}{N_{R}}\sum_{i:X_{ij}>z}Y_{i}$ and let the size of the left and right daughter nodes be $N_{L}=\#\{i:X_{ij}\leq z\}$ and $N_{R}=\#\{i:X_{ij}>z\}$ , respectively. For CART regression trees, the impurity reduction (or variance reduction) in the response variable $Y$ from choosing the split point $z$ for the $j^{\text{th}}$ variable is defined to be

\widehat{\Delta}(z;X_{j},Y)=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}-\frac{1}{n}\sum_{i:X_{ij}\leq z}(Y_{i}-\overline{Y}_{L})^{2}-\frac{1}{n}\sum_{i:X_{ij}>z}(Y_{i}-\overline{Y}_{R})^{2}.

(5)

For each variable $X_{j}$ , we choose a split point $\hat{z}_{j}$ that maximizes the impurity reduction

\hat{z}_{j}\in\operatorname*{arg\,max}_{z}\widehat{\Delta}(z;X_{j},Y),

and for convenience, we denote the largest impurity reduction by

\widehat{\Delta}(X_{j},Y)\coloneqq\widehat{\Delta}(\hat{z}_{j};X_{j},Y).

We then rank the variables commensurate with the sizes of their impurity reductions, i.e., we obtain a ranking $(\hat{\jmath}_{1},\dots,\hat{\jmath}_{p})$ where $\widehat{\Delta}(X_{\hat{\jmath}_{1}},Y)\geq\cdots\geq\widehat{\Delta}(X_{\hat{\jmath}_{p}},Y)$ . If desired, these rankings can be repurposed to perform model selection (e.g., an estimate $\widehat{\mathcal{S}}$ of ${\mathcal{S}}$ ), as we now explain. If we are given the sparsity level $s$ in advance, we can choose $\widehat{\mathcal{S}}$ to be the top $s$ of these ranked variables; otherwise, we must find a data-driven choice of how many variables to include. Equivalently, the latter case is realized by choosing $\mathcal{\widehat{S}}$ to be the indices $j$ for which $\widehat{\Delta}(X_{j},Y)\geq\gamma_{n}$ , where $\gamma_{n}$ is a threshold to be described in Section 2.5. This is of course a delicate task as including too many variables may lead to more false positives.

By [6, Section 9.3], using a sum of squares decomposition, we can rewrite the impurity reduction (5) as

\widehat{\Delta}(z;X_{j},Y)=\frac{N_{L}}{n}\frac{N_{R}}{n}(\overline{Y}_{L}-\overline{Y}_{R})^{2},

(6)

which allows us to compute the largest impurity reductions for all possible split points with a single pass over the data by first ordering the data along $X_{j}$ and then updating $\overline{Y}_{L}$ and $\overline{Y}_{R}$ in an online fashion. This alternative expression for the objective function facilitates its rapid evaluation and exact optimization. Pseudocode for SDI is given in Algorithm 1.

Input: Dataset

\mathcal{D}=\{(X_{i1},\ldots,X_{ip},Y_{i})\}_{i=1}^{n}

for $j=1,\ldots,p$ do

Relabel

\mathcal{D}

with

X_{ij}

sorted in increasing order

Initialize

\overline{Y}_{L}=0

\overline{Y}_{R}=\overline{Y}

\widehat{\Delta}(X_{j},Y)=0

for $i=1,\ldots,n-1$ do

Update

\overline{Y}_{L}\leftarrow\frac{i-1}{i}\overline{Y}_{L}+\frac{Y_{i}}{i},\quad\overline{Y}_{R}\leftarrow\frac{n-i+1}{n-i}\overline{Y}_{R}-\frac{Y_{i}}{n-i}

Compute

\widehat{\Delta}(X_{ij};X_{j},Y)=\frac{i}{n}(1-\frac{i}{n})(\overline{Y}_{L}-\overline{Y}_{R})^{2}

if $\widehat{\Delta}(X_{ij};X_{j},Y)>\widehat{\Delta}(X_{j},Y)$ then

Update

\widehat{\Delta}(X_{j},Y)\leftarrow\widehat{\Delta}(X_{ij};X_{j},Y)

end if

end for

Output: Ranking

(\hat{\jmath}_{1},\dots,\hat{\jmath}_{p})

such that

\widehat{\Delta}(X_{\hat{\jmath}_{1}},Y)\geq\cdots\geq\widehat{\Delta}(X_{\hat{\jmath}_{p}},Y)

Algorithm 1 Single-level Decrease in Impurity (SDI)

2.5 Data-driven choices of $\gamma_{n}$

As briefly mentioned in Section 2.4, if we do not know the sparsity level $s$ in advance, we can instead use a data-driven threshold $\gamma_{n}$ to modulate the number of selected variables. Here we propose two data-driven methods to determine the threshold $\gamma_{n}$ .

Permutation method

The first thresholding method is similar to the Iterative Nonparametric Independence Screening (INIS) method based on NIS [11, Section 4]. The first step is to choose a random permutation $\pi:\{1,\dots,n\}\to\{1,\dots,n\}$ of the data to decouple $\mathbf{X}_{i}$ from $Y_{i}$ so that the new dataset $\mathcal{D}^{\pi}=\{(\mathbf{X}_{\pi(i)},Y_{i})\}$ follows a null model. Then we choose the threshold $\gamma_{n}$ to be the maximum of the impurity reductions $\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi})$ over all $j$ based on the dataset $\mathcal{D}^{\pi}$ . We can also generate $T$ different permutations $\pi$ and take the maximum of $\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi})$ over all such permuted datasets to get a more significant threshold, i.e., $\gamma_{n}=\max_{j,\,\pi}\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi})$ . With $\gamma_{n}$ selected in this way, SDI will then output the variable indices $\widehat{\mathcal{S}}$ consisting of the indices $j$ for which the original impurity reductions $\widehat{\Delta}(X_{j},Y;\mathcal{D})$ are at least $\gamma_{n}$ . Interestingly, this method has parallels to MDA in that we permute the data values of a given variable and calculate the resulting change in the quality of the fit.

Elbow method

One problem with the permutation method is that it performs poorly when there is correlation between relevant and irrelevant variables. This is because, in the decoupled dataset $\mathcal{D}^{\pi}$ , there is essentially no correlation between $Y$ and any variable; whereas in the original dataset, $Y$ may be correlated with some of the irrelevant variables. In this case, the threshold $\gamma_{n}$ will be too small and the selected set of variables will include many irrelevant variables (false positives). An alternative is to employ visual inspection via the elbow method, which is typically used to determine the number of clusters in cluster analysis. Here we plot the largest impurity reductions $\widehat{\Delta}(X_{j},Y)$ in decreasing order and search for an “elbow” in the curve. We then let $\widehat{\mathcal{S}}$ consist of the variable indices that come before the elbow. Instead of choosing the cutoff point visually, we can also automate the process by clustering the impurity reductions with, for example, a two-component Gaussian mixture model. Both of these implementations are generally more robust to correlation between relevant and irrelevant variables than the permutation method.

We illustrate the empirical performance of both the permutation method and the elbow method in Section 7.

2.6 Computational issues

We now briefly discuss the computational complexity of Algorithm 1—or equivalently—the computational complexity of growing a single-level CART decision tree. For each variable $X_{j}$ , we first sort the input data along $X_{j}$ with ${\mathcal{O}}(n\log n)$ operations. We then evaluate the decrease in impurity along $n$ data points (as done in the nested for-loop of Algorithm 1), and finally find the maximum among these $n$ values (as done in the nested if-statement of Algorithm 1), all with ${\mathcal{O}}(n)$ operations. Thus, the total number of calculations for all of the $p$ variables is ${\mathcal{O}}(pn\log(n))$ . This is only slightly worse than the complexity of SIS for linear models ${\mathcal{O}}(pn)$ , comparable to NIS based on the complexity of fitting B-splines, and favorable to that of Lasso or stepwise regression ${\mathcal{O}}(p^{3}+p^{2}n)$ , especially when $p$ is large [10, 17]. While approximate methods like coordinate descent for Lasso can reduce the complexity to ${\mathcal{O}}(pn)$ at each iteration, their convergence properties are unclear. Thus, as SPAM is a generalization of Lasso for nonparametric additive models, its implementation (via a functional version of coordinate descent for Lasso) may be similarly expensive.

2.7 Interpretations of SDI

In this section we outline two interpretations of SDI.

Interpretation 1

Our first interpretation of SDI is in terms of the sample correlation between the response and a decision stump. To see this, denote the decision stump that splits $X_{j}$ at $z$ by

\widetilde{Y}(X_{j})\coloneqq\overline{Y}_{L}\,\mathbf{1}(X_{j}\leq z)+\overline{Y}_{R}\,\mathbf{1}(X_{j}>z)

and one at an optimal split value $\hat{z}_{j}$ by

\widehat{Y}(X_{j})\coloneqq\overline{Y}_{L}\,\mathbf{1}(X_{j}\leq\hat{z}_{j})+\overline{Y}_{R}\,\mathbf{1}(X_{j}>\hat{z}_{j}).

Note that $\widehat{Y}(X_{j})$ equivalently minimizes the marginal sum of squares (2) over the collection of all decision stumps (4). Next, by Lemma A.1 in [26], we have:

	$\displaystyle\widehat{\Delta}(z,X_{j},Y)$	$\displaystyle=\widehat{\mathrm{Var}}(Y)\times\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y),\quad\text{and}$		(7)
	$\displaystyle\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y)$	$\displaystyle=1-\frac{\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\widetilde{Y}(X_{ij}))^{2}}{\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}},$		(7)

where

\widehat{\rho}\,(\widetilde{Y}(X_{j}),Y)\coloneqq\frac{\frac{1}{n}\sum_{i=1}^{n}(\widetilde{Y}(X_{ij})-\overline{Y})(Y_{i}-\overline{Y})}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\widetilde{Y}(X_{ij})-\overline{Y})^{2}\times\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}}}\geq 0

is the Pearson product-moment sample correlation coefficient between the data $Y$ and decision stump $\widetilde{Y}(X_{j})$ . In other words, we see from (7) that an optimal split point $\hat{z}_{j}$ is chosen to maximize the Pearson sample correlation between the data $Y$ and decision stump $\widetilde{Y}(X_{j})$ . This reveals that SDI is, at its heart, a correlation ranking method, in the same spirit as SIS, NIS, and [16] via (2). In fact, as we shall see in Section 3, SDI is for all intents and purposes also similar to ranking the variables according to the correlation between the response data and the marginal projections $f_{j}(X_{j})$ .

Like $r^{2}$ for linear models, (7) reveals that the squared sample correlation $\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y)$ equals the coefficient of determination $R^{2}$ , i.e., the fraction of variance in $Y$ explained by a decision stump $\widetilde{Y}(X_{j})$ in $X_{j}$ .³³3However, unlike linear models, for this relationship to be true, the decision stump $\widetilde{Y}(X_{j})$ need not necessarily be a least squares fit, i.e., $\widehat{Y}(X_{j})$ . Thus, SDI is also equivalent to ranking the variables according to the goodness-of-fit for decision stumps of each variable. In fact, the equivalence between ranking with correlations and ranking with squared error goodness-of-fit is a ubiquitous trait among most models [16, Theorem 1].

Interpretation 2

The other interpretation is in terms of the aforementioned MDI importance measure. Recall the definition of MDI in Section 8, i.e., for an individual decision tree $T$ , the MDI for $X_{j}$ is the total reduction in impurity attributed to the splitting variable $X_{j}$ . More succinctly, the MDI of $T$ for $X_{j}$ equals

\sum_{\mathrm{t}}\frac{N(\mathrm{t})}{n}\widehat{\Delta}(X_{j},Y|\mathbf{X}\in\mathrm{t}),

(8)

where the sum extends over all non-terminal nodes $\mathrm{t}$ in which $X_{j}$ was split, $N(\mathrm{t})$ is the number of sample points in $\mathrm{t}$ , and $\widehat{\Delta}(X_{j},Y|\mathbf{X}\in\mathrm{t})$ is the largest reduction in impurity for samples in $\mathrm{t}$ . Note that if $T$ is a decision stump with split along $X_{j}$ , then (8) equals $\widehat{\Delta}(X_{j},Y)$ , the largest reduction in impurity at the root node. Because a split at the root node captures the main effects of the model, $\widehat{\Delta}(X_{j},Y)$ can be seen as a first order approximation of (8) in which higher order interaction effects are ignored.

3 Preliminaries

Our first lemma, developed by the first author in recent work, reveals the crucial role that optimization (of a nonlinear model) plays in assessing whether a particular variable is relevant or irrelevant—by relating the impurity reduction for a particular variable $X_{j}$ to the sample correlation between the response variable $Y$ and any function of $X_{j}$ . This lemma also highlights a key departure from other approaches in past decision tree literature that do not consider splits that depend on both input and output data (see, for example, DSTUMP [24]).

In order to state the lemma, we will need to introduce the concept of stationary intervals. We define a stationary interval of a univariate function $h(\cdot)$ to be a maximal interval $I$ such that $h(I)=c$ , where $c$ is a local extremum of $h(\cdot)$ ( $I$ is maximal in the sense that there does not exist an interval $I^{\prime}$ such that $I\subset I^{\prime}$ and $h(I^{\prime})=c$ ). In particular, note that a monotone function does not have any stationary intervals.

Lemma 1 (Lemma A.4, Supplementary Material in [26]).

Almost surely, uniformly over all functions $h(\cdot)$ of $X_{j}$ that have at most $M$ stationary intervals, we have

\widehat{\Delta}(X_{j},Y)\geq\frac{1}{D^{-1}Mn+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{h(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(h(X_{j}))}},Y\Bigg{)},

(9)

where $D\geq 1$ is the smallest number of data points in a stationary interval of $h(\cdot)$ that contains at least one data point.⁴⁴4More precisely, if $I_{1},\dots,I_{M}$ are the stationary intervals of $h(\cdot)$ and $D_{k}=\#\{X_{ij}\in I_{k}\}$ , then $D=\min_{k}\{D_{k}:D_{k}\geq 1\}$ .

Remark 1.

Note that $M$ can also be thought of as the number of times $h(\cdot)$ changes from strictly increasing to decreasing (or vice versa).

Remark 2.

The bound (9) is tight (up to universal constant factors) when $M=0$ , since $\widehat{\Delta}(X_{j},Y)=1$ and $\widehat{\text{Var}}(Y)\asymp\log(n)$ when $X_{1j}\leq\cdots\leq X_{nj}$ , $h(X_{ij})=Y_{i}$ , and

Y_{i}=\sqrt{(i-1)(n-i+1)}-\sqrt{i(n-i)}.

Proof sketch of Lemma 1.

For self-containment, we sketch the proof when $h(\cdot)$ is differentiable. The essential idea is to construct an empirical prior $\Pi$ on the split points $z$ and lower bound $\widehat{\Delta}(X_{j},Y)$ by

\int\widehat{\Delta}(z;X_{j},Y)d\Pi(z).

Recall from Section 2.4 that $N_{L}=N_{L}(z)$ and $N_{R}=N_{R}(z)$ are the number of samples in the left and right daughter nodes, respectively, if the $j^{\text{th}}$ variable is split at $z$ . The special prior we choose has density

\frac{d\Pi(z)}{dz}=\frac{|h^{\prime}(z)|\sqrt{N_{L}(z)N_{R}(z)}}{\int|h^{\prime}(z^{\prime})|\sqrt{N_{L}(z^{\prime})N_{R}(z^{\prime})}dz^{\prime}},

with support between the minimum and maximum values of the data $\{X_{ij}\}$ . This then yields $\widehat{\Delta}(X_{j},Y)\geq C(h)\times\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{h(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(h(X_{j}))}},Y\bigg{)}$ . The factor $C(h)$ can be minimized (by solving a simple quadratic program) over all functions $h(\cdot)$ under the constraint that they have at most $M$ stationary intervals containing at least $D$ data points, yielding the desired result (9). We direct the reader to [26, Lemma A.4, Supplementary Material] for the full proof. ∎

Ignoring the factor $(D^{-1}Mn+\log(2n)+1)^{-1}$ in (9) and focusing only on the squared sample covariance term, note that choosing $h(\cdot)$ to be the marginal projection $f_{j}(\cdot)$ , we have

\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)}\approx\mathrm{Cov}^{2}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}},Y\Bigg{)}=\mathrm{Var}(f_{j}(X_{j})),

where the last equality can be deduced from the fact that the marginal projection $f_{j}(X_{j})$ is orthogonal to the residual $Y-f_{j}(X_{j})$ . Thus, in an ideal setting, Lemma 1 enables us to asymptotically lower bound $\widehat{\Delta}(X_{j},Y)$ by a multiple of the variance of the marginal projections—which can then be used to screen for important variables and control the number of false negatives.

To summarize, the previous lemma shows that $\widehat{\Delta}(X_{j},Y)$ is large for variables $X_{j}$ such that $f_{j}(X_{j})$ is strongly correlated with $Y$ —or equivalently—variables that have large signals in terms of the variance of the marginal projection. Conversely, our next lemma shows that $\widehat{\Delta}(X_{j},Y)$ is with high probability not greater than the variance of the marginal projection. A special instance of this lemma, namely, when $Y$ is independent of $X_{j}$ , was stated in [29, Lemma 1] and serves as the inspiration for our proof. Due to space constraints, we include the proof in Appendix A.

Lemma 2.

Suppose that $Z_{j}=Y-f_{j}(X_{j})$ is conditionally sub-Gaussian given $X_{j}$ , with variance parameter $\sigma_{Z_{j}}^{2}$ , i.e., $\mathbb{E}[\exp(\lambda Z_{j})|X_{j}]\leq\exp(\lambda^{2}\sigma^{2}_{Z_{j}}/2)$ for all $\lambda\in\mathbb{R}$ . With probability at least $1-4n\exp(-n\xi^{2}/(12\sigma_{Z_{j}}^{2}))$ ,

\widehat{\Delta}(X_{j},Y)\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\xi^{2}.

In other words, Lemmas 1 and 2 together imply that SDI is a proxy for the variance of the marginal projection and therefore it roughly ranks the variables accordingly, up to constant factors. These lemmas are a key ingredient of the proofs for model selection and may be of independent interest.

4 SDI for linear models

To connect SDI to other variable screening methods that are perhaps more familiar to the reader, we first consider a linear model with Gaussian distributed variables. We allow for any correlation structure between covariates. Recall from (7) that $\widehat{\Delta}(X_{j},Y)$ is equal to $\widehat{\mathrm{Var}}(Y)$ times $\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y)$ , so that SDI is equivalent to ranking by $\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)$ . Our first theorem shows that $\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)$ , the sample correlation between $Y$ and an optimal decision stump in $X_{j}$ , behaves roughly like the correlation between a linear model $Y$ and a coordinate $X_{j}$ .

Theorem 1 (SDI is asymptotically equivalent to SIS).

Let $Y=\beta_{1}X_{1}+\beta_{2}X_{2}+\cdots+\beta_{p}X_{p}+\varepsilon$ and assume that $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})$ for some positive semi-definite matrix $\boldsymbol{\Sigma}$ and $\varepsilon\sim\mathcal{N}(0,\sigma^{2})$ for some $\sigma^{2}>0$ . Let $\delta\in(0,1)$ . There exists a universal positive constant $C_{0}$ such that, with probability at least $1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho^{2}(X_{j},Y)}}\exp(-n\delta^{2}\rho^{2}(X_{j},Y)/2)$ ,

\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)\geq\frac{(1-\delta)|\rho(X_{j},Y)|}{\sqrt{\log(2n)+1}}.

(10)

Furthermore, with probability at least $1-4n\exp(-n\delta^{2}/12)-2\exp(-(n-1)/16)$ ,

\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)\leq 5|\rho(X_{j},Y)|+2\delta.

(11)

Proof sketch of Theorem 1.

We only sketch the proof due to space constraints, but a more complete version is provided in Appendix B.

The first step in proving the lower bound (10) is to apply Lemma 1 with $h(X_{j})=X_{j}$ (a monotone function) to see that

\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Var}}(Y)}{\log(2n)+1}\times\widehat{\rho}^{\;2}(X_{j},Y),

(12)

since $M=0$ . Next, we can apply asymptotic tail bounds for Pearson’s sample correlation coefficient $\widehat{\rho}\,(X_{j},Y)$ between two correlated Gaussian distributions [18] to show that with high probability, $|\widehat{\rho}\,(X_{j},Y)|\geq(1-\delta)|\rho(X_{j},Y)|$ . Finally, we divide (12) by $\widehat{\mathrm{Var}}(Y)$ , use (7), and take square roots to complete the proof of the high probability lower bound (10).

To prove the upper bound (11), notice that since $X_{j}$ and $Y$ are jointly Gaussian with mean zero, we have $f_{j}(X_{j})=\rho_{j}\frac{\sigma_{Y}}{\sigma_{X_{j}}}X_{j}$ , where $\rho_{j}=\rho(X_{j},Y)$ . Thus, by Lemma 2 with $\sigma^{2}_{Z_{j}}=(1-\rho^{2}_{j})\sigma^{2}_{Y}$ and $\xi^{2}=(1-\rho^{2}_{j})\sigma^{2}_{Y}\delta^{2}$ , with probability at least $1-4n\exp(-n\delta^{2}/12)$ ,

	$\displaystyle\widehat{\Delta}(X_{j},Y)$	$\displaystyle\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\delta^{2}(1-\rho^{2}_{j})\sigma_{Y}^{2}$		(13)
		$\displaystyle=3\rho^{2}_{j}\frac{\sigma^{2}_{Y}}{\sigma^{2}_{X_{j}}}\widehat{\mathrm{Var}}(X_{j})+\delta^{2}(1-\rho^{2}_{j})\sigma_{Y}^{2}.$		(13)

We further upper bound (13) by obtaining high probability upper and lower bounds, respectively, for $\widehat{\mathrm{Var}}(X_{j})$ and $\widehat{\mathrm{Var}}(Y)$ in terms of $\sigma^{2}_{X_{j}}$ and $\sigma^{2}_{Y}$ , with a standard chi-squared concentration bound, per the Gaussian assumption. This yields that with high probability,

\widehat{\Delta}(X_{j},Y)\lesssim\rho^{2}_{j}\widehat{\mathrm{Var}}(Y)+\delta^{2}\widehat{\mathrm{Var}}(Y).

(14)

Finally, dividing both sides of (14) by $\widehat{\mathrm{Var}}(Y)$ , using (7), and taking square roots proves (11). ∎

Theorem 1 shows that with high-probability, SDI is asymptotically equivalent (up to logarithmic factors in the sample size) to SIS for linear models in that it ranks the magnitudes of the marginal sample correlations between a variable and the model, i.e., $\widehat{\rho}\,(X_{j},Y)\approx\rho(X_{j},Y)$ . As a further parallel with decision stumps (see Section 2.7), the square of the sample correlation, $\widehat{\rho}^{\,2}(X_{j},Y)$ , is also equal to the coefficient of determination $r^{2}$ for the least squares linear fit of $Y$ on $X_{j}$ . We confirm the similarity between SDI and SIS empirically in Section 7.

One corollary of Theorem 1 is that, like SIS, SDI also enjoys the sure screening property, under the same assumptions as [12, Conditions 1-4], which include mild conditions on the eigenvalues of the design covariance matrices and minimum signals of the parameters $\beta_{j}$ . Similarly, like SIS, SDI can also be paired with lower dimensional variable selection methods such as Lasso or SCAD [2] for a complete variable selection algorithm in the correlated linear model case.

On the other hand, SDI, a nonlinear method, applies to broader contexts far beyond the rigidity of linear models. In the next section, we will investigate how SDI performs for general nonparametric models with additional assumptions on the distribution of the variables.

5 SDI for nonparametric models

In this section, we establish the variable ranking and selection consistency properties of SDI for general nonparametric models; that is, we show that for Algorithm 1, we have $\mathbb{P}(\widehat{{\mathcal{S}}}=\mathcal{{\mathcal{S}}})\to 1$ as $n\to\infty$ . We describe the assumptions needed in Section 5.1 and outline the main consistency results and their proofs in Section 5.3. Finally, we describe how to modify the main results for binary classification in Section 5.5.

Although our approach differs substantively, to facilitate easy comparisons with other marginal screening methods, our framework and assumptions will be similar. As mentioned earlier, SDI is based on a more parsimonious but significantly more biased model fit than those than underpin conventional methods. As we shall see, despite the decision stump severely underfitting the data, SDI nevertheless achieves model selection guarantees that are similar to, and in some cases stronger than, its competitors. This highlights a key difference between quantifying sensitivity and screening—in the latter case, we are not concerned with obtaining consistent estimates of the marginal projections $f_{j}(X_{j})$ and their variances. Doing so demands more from the data and is therefore less efficient, when otherwise crude estimates would work equally well.

5.1 Assumptions

In this section, we describe the key assumptions and ideas which will be needed to achieve model selection consistency. The assumptions will be similar to those in the independence screening literature [11, 12], but are weaker than most past work on tree based variable selection [29, 24].

Assumption 1 (Bounded regression function).

The regression function $g(\cdot)$ is bounded with $B=\|g\|_{\infty}<\infty$ .
Assumption 2 (Smoothness of marginal projections).

Let $r$ be a positive integer, let $0<\alpha\leq 1$ be such that $d\coloneqq r+\alpha$ is at least $5/8$ , and let $0<L<\infty$ . The $r^{\rm{th}}$ order derivative of $f_{j}(\cdot)$ exists and is $L$ -Lipschitz of order $\alpha$ , i.e.,

$\big{|}f^{(r)}_{j}(x)-f^{(r)}_{j}(x^{\prime})\big{|}\leq L|x-x^{\prime}|^{\alpha},\quad x,x^{\prime}\in\mathbb{R}.$
Assumption 3 (Monotonicity of marginal projections).

The marginal projection $f_{j}(\cdot)$ is monotone on $\mathbb{R}$ .
Assumption 4 (Partial orthogonality of predictor variables).

The collections $\{X_{j}\}_{j\in{\mathcal{S}}}$ and $\{X_{j}\}_{j\in{\mathcal{S}}^{c}}$ are independent of each other.
Assumption 5 (Uniform relevant variables).

The marginal distribution of each $X_{j}$ , for $j\in{\mathcal{S}}$ , is uniform on the unit interval.
Assumption 6 (Sub-Gaussian error distribution).

The error distribution is conditionally sub-Gaussian given $\mathbf{X}$ , i.e., $\mathbb{E}[\varepsilon|\mathbf{X}]=0$ and $\mathbb{E}[\exp(\lambda\varepsilon)\mid\mathbf{X}]\leq\exp(\lambda^{2}\sigma^{2}/2)$ for all $\lambda\in\mathbb{R}$ with $\sigma^{2}>0$ .

5.2 Discussion of the assumptions

Assumption 2 is a standard smoothness assumption for variable selection in nonparametric additive models [11, Assumption A] and [19, Section 3] except that, for technical reasons, we have the condition $d\geq 5/8$ instead of $d>1/2$ . Because SDI does not involve tuning parameters that govern its approximation properties of the nonparametric constituents (such as with NIS and SPAM), Assumption 2 can be relaxed to allow for different levels of smoothness in different dimensions and, by straightforward modifications of our proofs, one can show that SDI adapts automatically. Alternatively, instead of Assumption 2, as we shall see, stronger conclusions can be provided if we impose a monotonicity constraint, namely, Assumption 3. Note that this monotonicity assumption encompasses many important “shape constrained” statistical models such as linear or isotonic regression.

Assumption 4 is essentially the so-called “partial orthogonality” condition in marginal screening methods [13]. Importantly, it allows for correlation between the relevant variables $\{X_{j}\}_{j\in{\mathcal{S}}}$ , unlike previous works on tree based variable selection [24, 29]. Notably, NIS and SPAM do allow for dependence between relevant and irrelevant variables, under suitable assumptions on the data matrix of basis functions. However, these assumptions are difficult to translate in terms of the joint distribution of the predictor variables and difficult to verify given the data.

Assumption 5 is stated as is for clarity of exposition and is not strictly necessary for our main results to hold. For instance, we may assume instead that the marginal densities of the relevant variables are compactly supported and uniformly bounded above and below by a strictly positive constant, as in [11, 19]. In fact, even these assumptions are not required. If the marginal projection is monotone (i.e., Assumption 3), no marginal distributional assumptions are required, that is, each $X_{j}$ for $j\in{\mathcal{S}}$ could be continuous, discrete, have unbounded support, or have a density that vanishes or is unbounded. More generally, similar distributional relaxations are made possible by the fact that CART decision trees are invariant to monotone transformations, enabling us to reduce the general setting to the case where each predictor variable is uniformly distributed on $[0,1]$ . See Remark 3 for details.

Remark 3.

Let $q_{j}(\cdot)$ and $F_{j}(\cdot)$ be the quantile function and the cumulative distribution function of $X_{j}$ , respectively. Recall the Galois inequalities state that $q_{j}(w)\leq z$ if and only if $w\leq F_{j}(z)$ and furthermore that $q_{j}(F_{j}(X_{j}))=X_{j}$ almost surely. Then, choosing $w=F_{j}(X_{j})$ we see that, almost surely, $X_{j}\leq z$ if and only if $F_{j}(X_{j})\leq F_{j}(z)$ . Therefore, almost surely,

\max_{z}\widehat{\Delta}(z;X_{j},Y)=\max_{z}\widehat{\Delta}(F_{j}(z);F_{j}(X_{j}),Y)=\max_{w\in[0,1]}\widehat{\Delta}(w;F_{j}(X_{j}),Y).

This means that for continuous data, the problem can be reduced to the uniform case by pre-applying the marginal cumulative distribution function $F_{j}(\cdot)$ to each variable, since $F_{j}(X_{j})\sim\rm{Uniform}([0,1])$ . Note that the marginal projections now equal the composition of the original $f_{j}(\cdot)$ with $q_{j}(\cdot)$ , i.e., $(f_{j}\circ q_{j})(\cdot)$ . By the chain rule from calculus, if $q_{j}(\cdot)$ satisfies Assumption 2, then so does $(f_{j}\circ q_{j})(\cdot)$ .

5.3 Theory for variable ranking and model selection

Here our goal will be to provide variable ranking and model selection guarantees of SDI using the assumptions in Section 5.1. Again, in this section, we sketch the proofs, but the full versions can be found in Appendices D and E.

5.3.1 Preliminary results

The high level idea will be to show that the impurity reductions for relevant variables dominate those for irrelevant variables with high probability, meaning that relevant and irrelevant variables are correctly ranked.

The following two propositions provide high probability lower bounds on the impurity reduction for relevant variables, the size of which depend on whether we assume Assumption 2 or Assumption 3.

Our first result deals with general, smooth marginal projections. Remarkably, it shows that with high probability, $\widehat{\Delta}(X_{j},Y)$ captures a portion of the variance in the marginal projection.

Proposition 1.

Under Assumptions 1, 2, 5, and 6, with probability at least $1-(4n+2)\exp\big{(}-nC_{1}\mathrm{Var}(f_{j}(X_{j}))\big{)}$ , we have

\widehat{\Delta}(X_{j},Y)\geq\frac{C_{2}(\mathrm{Var}(f_{j}(X_{j})))^{6/5+1/d}}{\log(n)},

for some positive constants $C_{1}$ and $C_{2}$ which depend only on $B$ , $\sigma$ , $r$ , and $\alpha$ .

Next, we state an analogous bound, but under a slightly different assumption on the marginal projection. It turns out that if the marginal projection is monotone, we can obtain a stronger result, which is surprisingly independent of the smoothness level.

Proposition 2.

Under Assumptions 1, 3, and 6, with probability at least $1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}$ ,

\widehat{\Delta}(X_{j},Y)\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{16(1+\log(2n))}.

Proof sketch of Propositions 1 and 2.

We sketch the proof of Proposition 1; the proof of Proposition 2 is based on similar arguments. The main idea is to apply Lemma 1 with $h(\cdot)$ equal to a modified polynomial approximation $\widetilde{f}_{j}(\cdot)$ to $f_{j}(\cdot)$ . This is done to temper the effect of the factor $(D^{-1}Mn+\log(2n)+1)^{-1}$ from Lemma 1, by controlling $M$ and $D$ individually.

To construct such a function, we first employ a Jackson-type estimate [23] in conjunction with Assumption 2 and Bernstein’s theorem for polynomials [22] to show the existence of a good polynomial approximation $P_{M}(\cdot)$ (of degree $M+1$ ) to $f_{j}(\cdot)$ . We then construct $\widetilde{f}_{j}(\cdot)$ by redefining $P_{M}(\cdot)$ to be constant in a small neighborhood around each of its local extrema, which ensures that each resulting stationary interval of $\widetilde{f}_{j}(\cdot)$ has a sufficiently large length. Since $P_{M}(\cdot)$ is a polynomial of degree $M+1$ , it has at most $M$ local extrema, and thus the number of stationary intervals of $\widetilde{f}_{j}(\cdot)$ will also be at most $M$ .

Next, we use concentration of measure to ensure that each stationary interval of $\widetilde{f}_{j}(\cdot)$ is saturated with enough data, effectively providing a lower bound on $D$ . Executing this argument reveals that valid choices of $D$ and $M$ (which come from optimizing a bound on the supremum norm between $\widetilde{f}_{j}(\cdot)$ and $f_{j}(\cdot)$ ) are:

D\gtrsim n\times(\mathrm{Var}(f_{j}(X_{j})))^{1/5+1/(2d)},\quad M\lesssim(\mathrm{Var}(f_{j}(X_{j})))^{-1/(2d)}.

Plugging these values into the lower bound (9) in Lemma 1, we find that with high probability,

	$\displaystyle\widehat{\Delta}(X_{j},Y)$	$\displaystyle\geq\frac{1}{D^{-1}Mn+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}$		(15)
		$\displaystyle\gtrsim\frac{(\mathrm{Var}(f_{j}(X_{j})))^{1/5+1/d}}{\log(n)}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}.$		(15)

Thus, the lower bound in Proposition 1 will follow if we can show that the squared sample covariance factor in (15) exceeds $\mathrm{Var}(f_{j}(X_{j}))$ with high probability. To this end, note that

	$\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}$	$\displaystyle=\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}}_{\text{(I)}}$		(16)
		$\displaystyle\qquad+\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}_{\text{(II)}}.$		(16)

With high probability, (I) can be lower bounded by $C\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$ , where $C$ is some constant, using the approximation properties of $\widetilde{f}_{j}(\cdot)$ for $f_{j}(\cdot)$ , per the choice of $D$ and $M$ from above, and a concentration inequality for the sample variance of $f_{j}(X_{j})$ . Furthermore, since $Y-f_{j}(X_{j})$ has conditional mean zero, a Hoeffding type concentration inequality shows that, with high probability, (II) is larger than any (strictly) negative constant, including $-(C/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$ . Combining this analysis from (15) and (16), we obtain the high probability lower bound on $\widehat{\Delta}(X_{j},Y)$ given in Proposition 1. ∎

Next, we need to ensure that there is a sufficient separation in the impurity reductions between relevant and irrelevant variables. To do so, we use Lemma 2 along with the partial orthogonality assumption in Section 5.1 to show that the impurity reductions for irrelevant variables will be small with high probability.

Lemma 3.

Under Assumptions 1, 4 and 6, for each $j\in{\mathcal{S}}^{c}$ , with probability at least $1-4n\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2})))$ ,

\widehat{\Delta}(X_{j},Y)\leq\xi^{2}.

In other words, if $j\in{\mathcal{S}}^{c}$ , then $\widehat{\Delta}(X_{j},Y)={\mathcal{O}}(n^{-1}\log(n))$ with probability at least $1-n^{-\Omega(1)}$ .

Proof of Lemma 3.

Observe that

$\displaystyle\mathbb{E}[\exp(\lambda(Y-f_{j}(X_{j})))\|X_{j}]$	$\displaystyle=\mathbb{E}\big{[}\exp(\lambda(g(\mathbf{X})-f_{j}(X_{j})))\mathbb{E}\big{[}\exp(\lambda\varepsilon)\big{\|}\mathbf{X}\big{]}\big{\|}X_{j}\big{]}$
	$\displaystyle\leq\mathbb{E}\big{[}\exp(\lambda(g(\mathbf{X})-f_{j}(X_{j})))\big{\|}X_{j}\big{]}\exp(\lambda^{2}\sigma^{2}/2)$	(17)
	$\displaystyle\leq\mathbb{E}\big{[}\exp(\lambda^{2}(B^{2}+\sigma^{2})/2)\big{]},$	(18)

where we used Assumption 6 in the penultimate inequality (17) and Hoeffding’s Lemma together with Assumption 1 in the last inequality (18). Using Assumption 4 along with Lemma 2 with $\sigma^{2}_{Z_{j}}=B^{2}+\sigma^{2}$ proves Lemma 3. ∎

5.3.2 Main results

Assuming we know the size $s$ of the support ${\mathcal{S}}$ , we can use the SDI ranking from Algorithm 1 to choose the top $s$ most important variables. Alternatively, if $s$ is unknown, we instead choose an asymptotic threshold $\gamma_{n}$ of the impurity reductions to select variables; that is, $\widehat{\mathcal{S}}=\{j:\widehat{\Delta}(X_{j},Y)\geq\gamma_{n}\}$ . We state our variable ranking guarantees in terms of the minimum signal strength of the relevant variables:

v\coloneqq\min_{j\in{\mathcal{S}}}\mathrm{Var}(f_{j}(X_{j})),

which is the same as the minimum variance parameter in independence screening papers (e.g., [11, Assumption C]). Note that $v$ measures the minimum contribution of each relevant variable alone to the variance in $Y$ , ignoring the effects of the other variables.

Theorem 2.

Suppose Assumptions 1, 2, 4, 5, and 6 hold. Then the top $s$ most important variables ranked by Algorithm 1 equal the correct set ${\mathcal{S}}$ of relevant variables with probability at least

1-s(4n+2)\exp(-C_{1}nv)-4n(p-s)\exp\Big{(}-\frac{nC_{2}v^{6/5+1/d}}{24\log(n)(B^{2}+\sigma^{2})}\Big{)},

(19)

where $C_{1}$ and $C_{2}$ are the same constants in Proposition 1.

Remark 4.

As a corollary of Theorem 2, we obtain a special case of Proposition 1 in [37] for the root node of a CART decision tree, which states more generally that a relevant variable is selected at each node with probability converging to one.

As mentioned earlier, stronger results can be obtained if we assume that the marginal projections are monotone. In our next theorem, notice that we do not have to make any additional smoothness assumptions, nor do we require any distributional assumptions on the relevant variables, other than that they are independent of the irrelevant ones.

Theorem 3.

Suppose Assumptions 1, 3, 4, and 6 hold. Then the top $s$ most important variables ranked by Algorithm 1 equal the correct set ${\mathcal{S}}$ of relevant variables with probability at least

1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)}-4n(p-s)\exp\Big{(}-\frac{nv}{96(1+\log(2n))(B^{2}+\sigma^{2})}\Big{)}.

(20)

Remark 5.

When $s$ is unknown, Propositions 1 and 2 and Lemma 3 imply that the oracle threshold choices

\gamma_{n}=\frac{C_{2}v^{6/5+1/d}}{2\log(n)}\quad\text{and}\quad\frac{v}{8(1+\log(2n))}

(21)

ensure that

\max_{j\in{\mathcal{S}}^{c}}\widehat{\Delta}(X_{j},Y)<\gamma_{n}\leq\min_{j\in{\mathcal{S}}}\widehat{\Delta}(X_{j},Y)

and hence will yield the same high probability bounds (19) and (20), respectively. Thus, while the permutation and elbow methods from Section 2.5 are somewhat ad-hoc, if they, at the very least, produce thresholds that are close to (21), then high probability performance guarantees are still possible to obtain.

5.4 Minimum signal strengths

Like all marginal screening methods, the theoretical basis for SDI is that each marginal projection for a relevant variable should be nonconstant, or equivalently, that $v>0$ . Note that when the relevant variables are independent and the underlying model is additive, per (1), the marginal projections equal the component functions of the additive model. Hence, $v=\min_{j\in{\mathcal{S}}}\mathrm{Var}(g_{j}(X_{j}))$ , which will always be strictly greater than zero. As Theorems 2 and 3 show, $v$ controls the probability of a successful ranking of the variables. In practice, many of the relevant variables may have very small signals—therefore we are particularly interested in cases where $v$ is allowed to become small when the sample size grows large, as we now discuss.

We see from Theorem 2 that in order to have model selection consistency with probability at least $1-n^{-\Omega(1)}$ , it suffices to have

v\gtrsim\Big{(}\frac{\log(n)\log(n(p-s))}{n}\Big{)}^{\frac{5d}{6d+5}},

(22)

up to constants that depend on $B$ , $\sigma$ , $r$ , and $\alpha$ . That is, (22) is a sufficient condition on the signal of all relevant variables so that $\mathbb{P}(\widehat{\mathcal{S}}={\mathcal{S}})\rightarrow 1$ as $n\to\infty$ . Similarly, we see from Theorem 3 that

v\gtrsim\frac{\log(n)\log(n(p-s))}{n}

(23)

is sufficient to guarantee model selection consistency for monotone marginal projections. A particularly striking aspect of (23) is that the rate is independent of the smoothness of the marginal projection. This means that the “difficulty” of detecting a signal from a general monotone marginal projection is essentially no more than if the marginal projection was linear.

5.5 SDI for binary classification

Our theory for SDI can also naturally be extended to the context of classification, as we now describe. For simplicity, we focus on the problem of binary classification where $Y\in\{0,1\}$ . To begin, we first observe that Gini impurity and variance impurity are equivalent up to a factor of two [30, Section 3]. Thus, we can use the same criterion $\widehat{\Delta}(X_{j},Y)$ to rank the variables. Next, we identify the marginal projection as the marginal class probability

f_{j}(X_{j})=\mathbb{P}(Y=1|X_{j})=1-\mathbb{P}(Y=0|X_{j}).

Because of these connections to the regression setting, the results in Section 5.3 hold verbatim for classification under the same assumptions therein with $v=\min_{j\in{\mathcal{S}}}\mathrm{Var}(\mathbb{P}(Y=1|X_{j}))$ , $B=1$ , and $\sigma^{2}=0$ . An interesting special case arises when $\mathbb{P}(Y=1|\mathbf{X})=h(\beta_{1}X_{1}+\cdots\beta_{p}X_{p})$ for some monotone link function $h(\cdot)$ and $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})$ for some positive semi-definite matrix $\boldsymbol{\Sigma}$ . Using the fact that Gaussian random vectors are conditionally Gaussian distributed, it can be shown that the marginal projection $\mathbb{P}(Y=1|X_{j})$ is monotone and therefore satisfies Assumption 3. Consequently, for example, logistic regression with Gaussian data inherits the same stronger conclusions as Theorem 3.

6 Comparing SDI with other model selection methods

In this section, we compare the finite sample guarantees of SDI given in Section 5.3 and Section 4 to those of NIS, Lasso, and SPAM. To summarize, we find that SDI enjoys model selection consistency even when the marginal signal strengths of the relevant variables are smaller than those for NIS and SPAM. We also find that the minimum sample size of SDI for high probability support recovery is nearly what is required for Lasso, which is minimax optimal. Finally, we show that SDI can handle a larger number of predictor variables than NIS and SPAM.

Minimum signal strength for NIS

We analyze the details of [11] to uncover the corresponding threshold $v$ for NIS. In order to have model selection consistency, the probability bound in [11, Theorem 2] must approach one as $n\to\infty$ , which necessitates

d_{n}\lesssim\Big{(}\frac{n^{1-4\kappa}}{\log(np)}\Big{)}^{1/3},

(24)

where $d_{n}$ is the number of spline basis functions and $\kappa$ is a free parameter (in the notation of [11]). However notice that by [11, Assumption F], we must also have that $d_{n}\gtrsim n^{\frac{2\kappa}{2d+1}}$ , and combining this with (24) shows that we must have

n^{-\kappa}\gtrsim\Big{(}\frac{\log(np)}{n}\Big{)}^{\frac{2d+1}{8d+10}}.

(25)

Now substituting (24) and (25) into $v\gtrsim d_{n}n^{-2\kappa}$ ([11, Assumption C]), it follows that we have

v\gtrsim\Big{(}\frac{\log(np)}{n}\Big{)}^{\frac{4d}{8d+10}}

for NIS, which is a larger minimum signal than our (22).

Minimum signal strength for SPAM

In the case where $d=2$ , by [35, Section 6.1], we must have $v\gtrsim n^{-4/15}\log^{16/5}(np)$ for SPAM to achieve consistent model selection. For comparison, our algorithm allows for a smaller signal $v\gtrsim\big{(}\frac{\log(n)\log(np)}{n}\big{)}^{10/17}$ , which is obtained by setting $d=2$ in (22).

Minimum sample size for consistency

Consider the linear model with Gaussian variates from Theorem 1, where for simplicity we additionally assume that $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ is the $p\times p$ identity matrix, yielding $\rho^{2}(X_{j},Y)=\beta^{2}_{j}/(\sigma^{2}+\sum_{k=1}^{p}\beta^{2}_{k})$ .

Following the same steps used to prove Theorem 2 but using Theorem 1 and Lemma 2 instead, we can derive a result similar to Theorem 2 for the probability of exact support recovery, but for a linear model with Gaussian variates. The full details are in Appendix C. With the specifications $\sum_{k=1}^{p}\beta^{2}_{k}={\mathcal{O}}(1)$ and $\min_{j\in{\mathcal{S}}}|\beta_{j}|^{2}\asymp 1/s$ , we find that a sufficient sample size for high probability support recovery is

n\gg s\log(n)\log(n(p-s)),

which happens when

n\gg s\log(p-s)\times(\log(s)+\log\log(p-s)).

(26)

Now, it is shown in [41, Corollary 1] that the minimax optimal threshold for support recovery under these parameter specifications is $n\asymp s\log(p-s)$ , which is achieved by Lasso [42]. Amazingly, (26) coincides with this optimal threshold up to $\log(s)$ and $\log\log(p-s)$ factors, despite SDI not being tailored to linear models.

Maximum dimensionality

Suppose the signal strength $v$ is bounded above and below by a positive constant when the sample size increases. Then Theorems 2 and 3 show model selection consistency for SDI up to dimensionality $p=\exp(o(n))$ . This is larger than the maximum dimensionality $p=\exp(o(n^{2(d-1)/(2d+1)}))$ for NIS [11, Section 3.2], thus applying to an even broader spectrum of ultra high-dimensional problems. Furthermore, when $d=2$ , SPAM is able to handle dimensionality up to $p=\exp(o(n^{1/6}))$ [35, Equation (45)], which is again lower than the dimensionality $p=\exp(o(n))$ for SDI.

7 Experiments

In this section, we conduct computer experiments of SDI with synthetic data. As there are many existing empirical studies of the related MDI measure [24, 29, 30, 31, 32, 38, 43, 44], we do not aim for comprehensiveness.

Our first set of experiments compare the performance of SDI with SIS, NIS, Lasso, and SPAM, MDI for a single CART decision tree, and MDI for a random forest. In Section 7.1, we assess performance based on the probability of exact support recovery. To ensure a fair comparison between SDI and the other algorithms, we assume a priori knowledge of the true sparsity level $s$ , which is incorporated into Lasso and SPAM by specifying the model degrees of freedom in advance. These simulations were conducted in R using the packages rpart for SDI, SAM for SPAM, SIS for SIS, and glmnet for Lasso with default settings. We also compute two versions of MDI: MDI RF using the package randomForest with ntrees = 100 and MDI CART (based on a pruned CART decision tree) using the package rpart with default settings. The source code from [11] was used to conduct experiments with NIS.

The second set of experiments, in Section 7.2, deals with the case when the sparsity level $s$ is unknown, whereby we demonstrate the empirical performance of the permutation and elbow methods from Section 2.5.

In all our experiments, we generate $n$ samples from an $s$ -sparse additive model $g(\mathbf{X})=\sum_{j=1}^{s}g_{j}(X_{j})$ for various types of components $g_{j}(X_{j})$ . The error distribution is $\varepsilon\sim\mathcal{N}(0,\sigma^{2})$ , the sparsity level is fixed at $s=4$ , and the ambient dimension is fixed at $p=2000$ . We consider the following model types.

Model 1

Consider linear additive components $g_{j}(X_{j})=X_{j}$ and variables $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})$ , where the covariance matrix $\boldsymbol{\Sigma}$ has diagonal entries equal to $1$ and off-diagonal entries equal to some constant $\rho\in(-1,1)$ . We set the noise level $\sigma^{2}=1$ and consider correlation level $\rho=0.5$ .

Model 2

Consider linear additive components $g_{j}(X_{j})=X_{j}$ with noise level $\sigma^{2}=3$ . The variables $X_{2},\ldots,X_{p}$ are i.i.d standard normal random variables and $X_{1}=-\frac{1}{3}X_{2}^{3}+\widetilde{\varepsilon}$ where $\widetilde{\varepsilon}\sim{\mathcal{N}}(0,1)$ . Notice that even though the model is linear, the marginal projections $\mathbb{E}[Y|X_{1}]$ and $\mathbb{E}[Y|X_{2}]$ are nonlinear. This is similar to Example 2 of [11].

Model 3

We consider additive components $g_{j}(X_{j})=\cos(4\pi X_{j})$ , where $\mathbf{X}\sim\text{Uniform}([0,1]^{p})$ (i.e., all predictor variables are independent) and $\sigma^{2}=1$ .

Model 4

Consider the nonlinear additive components

	$\displaystyle g_{1}(x)=5x,\quad g_{2}(x)=3(2x-1)^{2},\quad g_{3}(x)=\frac{4\sin(2\pi x)}{2-\sin(2\pi x)},$
	$\displaystyle g_{4}(x)=6(0.1\sin(2\pi x)+0.2\cos(2\pi x)+0.3\sin^{2}(2\pi x)+0.4\cos^{3}(2\pi x)$
	$\displaystyle\quad\quad\quad\quad+0.5\sin^{3}(2\pi x)).$

Let $\mathbf{X}\sim\text{Uniform}([0,1]^{p})$ and set the noise level $\sigma^{2}=1.74$ . This is the same model as Example 3 of [11] with $t=0$ (and correlation $0$ ).

Model 5

We consider monotone additive components

	$\displaystyle g_{1}(x)=-\exp(x^{2}),\quad g_{2}(x)=-\log(x+0.1),$
	$\displaystyle g_{3}(x)=2\tanh(20x^{2})+0.5\exp(x^{3}),\quad g_{4}(x)=\frac{2\exp(10x-5)}{1+\exp(10x-5)}.$

The variables are $\mathbf{X}\sim\text{Uniform}([0,1]^{p})$ and we have noise level $\sigma^{2}=1$ . This is similar to Model A in [3, Section 3].

In our simulations, Models 1 tests the correlated Gaussian linear model setting of Theorem 1. Model 2 modifies the setting of Theorem 1 by including a relevant variable that is a nonlinear function of another relevant variable so that some marginal projections are nonlinear. Models 3 and 4 are examples of the setting of Theorem 2 for general nonparametric models. Model 5 considers the shape-constrained setting of Theorem 3. Though our main results apply to general nonparametric models, we have chosen to focus our experiments on additive models to facilitate comparison with other methods designed for the same setting.

7.1 Exact support recovery

For our experiments on exact recovery, we fix the sparisty level $s=4$ and estimate the probability of exact support recovery by running $50$ independent replications and computing the fraction of replications which exactly recover the support of the model. In Figure 1, we plot this estimated probability against various sample sizes $n<p$ , namely, $n\in\{100,200,300,\dots,1000\}$ . In agreement with Theorem 1, in Figure 1(a), we observe that SDI and SIS exhibit similar behavior for correlated Gaussian linear models, a case in which all methods appear to achieve model selection consistency. However, in Figure 1(b), we see that SDI is significantly more robust than SIS to a linear model when the marginal projections are nonlinear. As expected, Figure 1(c) and 1(d) show that SDI, NIS, SPAM, and MDI significantly outperform SIS and Lasso when the model has nonlinear and non-monotone additive components. For more irregular component functions such as sinusoids, SDI appears to outperform SPAM, as seen in Figure 1(c). In agreement with Theorem 3, Figure 1(e) shows that SDI appears to perform better for nonlinear monotone components, though all methods (even linear methods such as Lasso and SIS) appear to perform better as well. In general, for additive models, SDI appears to outperform MDI CART though it appears to sacrifice a small amount of accuracy for simplicity compared to its progenitor MDI RF. In summary, in agreement with Theorems 2 and 3, Figure 1 demonstrates the robustness of SDI across various additive models.

Refer to caption — (a) Model 1 (SNR $10.00$ )

7.2 Data driven choices of $\gamma_{n}$

We now consider the case when $s$ is unknown. As we have already demonstrated the performance of SDI in Section 7.1, here we examine how well the data-driven thresholding methods discussed in Section 2.5 can estimate the true sparsity level $s$ .

We first consider Model 5 and fix the sparsity level $s=4$ and the sample size $n=1000$ . In Figure 2(a), we plot the probability of exact support recovery (averaged over $50$ independent replications) using the permutation method against the number of permutations. After a very small number of permutations, the significance level and performance of the algorithm appear to stabilize. For comparison, in Figure 2(b), we show the graph of the ranked impurity reductions (SDI importance measures) used for the elbow method. When there is no correlation between irrelevant and relevant variables, as is the case with Model 5, the permutation method may be more precise than the elbow method.

However, as discussed in Section 2.5, the permutation method may be inaccurate if the irrelevant and relevant variables are correlated. To illustrate this, we now consider Model 1 with sparsity level $s=4$ . As we can see from Figure 2(c), the permutation method fails completely. Using $20$ permutations, it chooses a threshold $\gamma_{n}\approx 0.024$ , which will lead to all variables being selected, as can be seen from the ranked impurity reductions shown in Figure 2(d). In other words, when there is strong correlation between the variables, the permutation method significantly underestimates the threshold $\gamma_{n}$ , which in turn, creates too many false positives. In contrast, as can be seen from Figure 2(d), the ranked impurity reductions still exhibit a distinct “elbow”, from which the relevant and irrelevant variables can be discerned.

8 Discussion and conclusion

In this paper, we developed a theoretically rigorous approach for variable selection based on decision trees. The underlying approach is simple, intuitive, and interpretable—we test whether a variable is relevant/irrelevant by fitting a decision stump to that variable and then determining how much it explains the variance of the response variable. Despite its simplicity, SDI performs favorably relative to its less interpretable competitors. Furthermore, due to the parsimony of the model, there is also no need to perform variable bandwidth selection or calibrate the number terms in basis expansions.

On the other hand, we have sacrificed generality for analytical tractability. That is, decision stumps are poor at capturing interaction effects and, therefore, an importance measure built from a multi-level decision tree (such as MDI) may be more appropriate for models with more than just main effects. However, the presence of multi-level splits adds an additional layer of complexity to the analysis that, at the moment, we do not know how to overcome. It is also unclear how to leverage these additional splits to strengthen the theory. While out of the scope of the present paper, we view this as an important problem for future investigation. It is our hope that the tools in this paper can be used by other scholars to address this important issue.

To conclude, we hope that our analysis of single-level decision trees for variable selection will shed further light on the unique benefits of tree structured learning.

9 Appendix

A Appendix

In this appendix, we first prove Lemma 2 in detail in Appendix A. We then prove Theorem 1 on linear models with Gaussian variates in Appendix B and then use Theorem 1 to determine a sufficient sample size for model selection consistency (mentioned in Section 6) in Appendix C. Finally, we prove Propositions 1 and 2 in Appendix D and then prove Theorems 2 and 3 in Appendix E.

A Proof of Lemma 2

Let $\pi$ be a permutation of the data such that $X_{\pi(1)j}\leq X_{\pi(2)j}\leq\cdots\leq X_{\pi(n)j}$ . Recall from the representation (6) that we have

\widehat{\Delta}(X_{j},Y)=\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\bigg{(}\underbrace{\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}Y_{i}-\frac{1}{n-k}\sum_{X_{ij}>X_{\pi(k)j}}Y_{i}}_{\text{(III)}}\bigg{)}^{2}.

Now, since

\sum_{i=1}^{n}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}=0,

we can rewrite (III) as

	$\displaystyle\underbrace{\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))}_{\text{(a)}}-\underbrace{\frac{1}{n-k}\sum_{X_{ij}>X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))}_{\text{(b)}}$
	$\displaystyle+\underbrace{\sum_{i=1}^{n}\Bigg{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Bigg{)}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}}_{\text{(c)}}.$

Therefore, we have that

$\displaystyle\widehat{\Delta}(X_{j},Y)$	$\displaystyle=\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}(\text{(a)}-\text{(b)}+\text{(c)})^{2}$	(A.1)
	$\displaystyle\leq 3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}+3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(b)}^{2}+$
	$\displaystyle\qquad 3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2},$

where we use, in succession, the inequality $(x-y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2})$ for any real numbers $x$ , $y$ , and $z$ , and the fact that the maximum of a sum is at most the sum of the maxima. To finish the proof, we will bound the terms involving $\text{(a)}^{2}$ , $\text{(b)}^{2}$ , and $\text{(c)}^{2}$ separately.

For the last term in (A.1), notice that by the Cauchy-Schwartz inequality we have

	$\displaystyle\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2}$	$\displaystyle=\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Bigg{[}\sum_{i=1}^{n}\Big{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Big{)}\times$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}\Bigg{]}^{2}$
		$\displaystyle\leq\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\sum_{i=1}^{n}\Big{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Big{)}^{2}\times$
		$\displaystyle\qquad\qquad\qquad\sum_{i=1}^{n}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}^{2},$

which is exactly equal to

\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Big{[}n\widehat{\mathrm{Var}}(f_{j}(X_{j}))\Big{(}k\cdot\frac{1}{k^{2}}+(n-k)\cdot\frac{1}{(n-k)^{2}}\Big{)}\Big{]}=\widehat{\mathrm{Var}}(f_{j}(X_{j})).

Therefore we have shown that

\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2}\leq\widehat{\mathrm{Var}}(f_{j}(X_{j})).

(A.2)

To bound the first term in (A.1), by a union bound we have that

	$\displaystyle\mathbb{P}\Big{(}\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}>\frac{\xi^{2}}{6}\Big{)}$
	$\displaystyle\leq\sum_{k=1}^{n}\mathbb{P}\Big{(}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}>\frac{\xi^{2}}{6}\Big{)}$
	$\displaystyle=\sum_{k=1}^{n}\mathbb{P}\Big{(}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Big{(}\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))\Big{)}^{2}>\frac{\xi^{2}}{6}\Big{)}.$		(A.3)

Next, notice that, conditional on $X_{1j},\dots,X_{nj}$ , $\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))$ is a sum of $k$ independent, sub-Gaussian, mean zero random variables. Thus, by the law of total probability, we have that (A.3) is equal to

\sum_{k=1}^{n}\mathbb{E}\Bigg{[}\mathbb{P}\Bigg{(}\bigg{|}\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))\bigg{|}>\xi\sqrt{\frac{n^{2}}{6k(n-k)}}\quad\Bigg{|}\quad X_{1j},\dots,X_{nj}\Bigg{)}\Bigg{]}

and, by Hoeffding’s inequality for sub-Gaussian random variables, is bounded by

\sum_{k=1}^{n}2\exp\Big{(}-k\frac{\xi^{2}n^{2}}{12k(n-k)\sigma^{2}_{Z_{j}}}\Big{)}\leq 2n\exp\Big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\Big{)}.

Note that here we have implicitly used the fact that $\mathbb{E}[\exp(\lambda Z_{j})|X_{j}]\leq\exp(\lambda^{2}\sigma^{2}_{Z_{j}}/2)$ . It thus follows that with probability at least $1-2n\exp\big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\big{)}$ that

\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}\leq\frac{\xi^{2}}{6}.

(A.4)

A similar argument shows that with probability at least $1-2n\exp\Big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\Big{)}$ , the second terms in (A.1) obeys

\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(b)}^{2}\leq\frac{\xi^{2}}{6}.

(A.5)

Therefore, substituting (A.2), (A.4), and (A.5) into (A.1) and using a union bound, it follows that with probability at least $1-4n\exp\big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\big{)}$ ,

\widehat{\Delta}(X_{j},Y)\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\xi^{2}.

B Proof of Theorem 1

The goal of this section is to prove Theorem 1. In the first section, we prove the lower bound (10) and in the second section, we prove the upper bound (11).

Throughout this section, for brevity, we let $\rho_{j}={\rm{Cor}}(X_{j},Y)\neq 0$ .

B.1 Proof of the lower bound (10)

Choosing $h(X_{j})=X_{j}$ (which is monotone) in Lemma 1 to get that

\widehat{\Delta}(X_{j},Y)\geq\frac{1}{\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{X_{j}}{\sqrt{\widehat{\mathrm{Var}}(X_{j})}},Y\Bigg{)}=\frac{\widehat{\mathrm{Var}}(Y)}{\log(2n)+1}\times\widehat{\rho}^{\;2}(X_{j},Y).

Now observe that $\widehat{\rho}\,(X_{j},Y)$ is the empirical Pearson sample correlation between two correlated normal distributions. If $\rho_{j}>0$ , by [18, Equation (44)], we have that

		$\displaystyle\mathbb{P}(\widehat{\rho}\,(X_{j},Y)>(1-\delta)\rho_{j})$		(B.6)
		$\displaystyle=1-\mathbb{P}(\widehat{\rho}\,(-X_{j},Y)>-(1-\delta)\rho_{j})$
		$\displaystyle\sim 1-(2\pi)^{-1/2}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{n/2}(1-[(1-\delta)\rho_{j}]^{2})^{(n-1)/2}$
		$\displaystyle\qquad\times(-(1-\delta)\rho_{j}-(-\rho_{j}))^{-1}(1-(-\rho_{j})(-(1-\delta)\rho_{j}))^{-n+3/2}(1+{\mathcal{O}}(n^{-1})).$

If $\rho_{j}<0$ we can show the same bound on $\mathbb{P}(\widehat{\rho}\,(X_{j},Y)<(1-\delta)\rho_{j})$ . Again by [18, Equation (44)], we have the similar bound

		$\displaystyle\mathbb{P}(\widehat{\rho}\,(X_{j},Y)<(1-\delta)\rho_{j})$		(B.7)
		$\displaystyle=1-\mathbb{P}(\widehat{\rho}\,(X_{j},Y)>(1-\delta)\rho_{j})$
		$\displaystyle\sim 1-(2\pi)^{-1/2}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{n/2}(1-[(1-\delta)\rho_{j}]^{2})^{(n-1)/2}$
		$\displaystyle\qquad\times((1-\delta)\rho_{j}-\rho_{j})^{-1}(1-\rho_{j}\times(1-\delta)\rho_{j}))^{-n+3/2}(1+{\mathcal{O}}(n^{-1})).$

Therefore because of (B.6) and (B.7), regardless of the sign of $\rho_{j}$ , it follows that there exists a universal constant $C_{0}$ for which

	$\displaystyle\mathbb{P}(\|\widehat{\rho}\,(X_{j},Y)\|>(1-\delta)\|\rho_{j}\|)$
	$\displaystyle\geq 1-\frac{C_{0}}{\sqrt{2\pi}\delta\|\rho_{j}\|}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{\frac{n}{2}}(1-(1-\delta)^{2}\rho_{j}^{2})^{\frac{n-1}{2}}(1-(1-\delta)\rho_{j}^{2})^{-n+\frac{3}{2}}$
	$\displaystyle\geq 1-\frac{C_{0}}{\sqrt{n}\delta\|\rho_{j}\|}\exp\big{(}-\rho_{j}^{2}n/2-(1-\delta)^{2}\rho_{j}^{2}(n-1)/2+(1-\delta)\rho_{j}^{2}(n-3/2)\big{)}$		(B.8)
	$\displaystyle=1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp\big{(}-\rho_{j}^{2}n\delta^{2}/2+\rho_{j}^{2}(1-\delta)^{2}/2-3(1-\delta)\rho_{j}^{2}/2\big{)}$
	$\displaystyle\geq 1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp(-\rho_{j}^{2}n\delta^{2}/2),$

where we used $\exp(x)\geq 1+x$ and Wendel’s inequality [45] $\frac{\Gamma(n)}{\Gamma(n+1/2)}\leq\sqrt{\frac{n+1/2}{n}}\frac{1}{\sqrt{n}}\leq\sqrt{\frac{2\pi}{n}}$ in the second inequality (B.8).

Thus, we have that with probability at least $1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp(-\rho_{j}^{2}n\delta^{2}/2)$ that

\widehat{\Delta}(X_{j},Y)\geq\frac{(1-\delta)^{2}\widehat{\mathrm{Var}}(Y)\rho^{2}_{j}}{\log(2n)+1}\iff\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y)\geq\frac{(1-\delta)^{2}\rho_{j}^{2}}{\log(2n)+1}.

This completes the first half of the proof of Theorem 1.

B.2 Proof of the upper bound (11)

We first state the following sample variance concentration inequality, which will be helpful.

Lemma B.1.

Let $Z_{1},\dots,Z_{n}$ be i.i.d. ${\mathcal{N}}(0,\sigma^{2}_{Z})$ . For any $0<\delta<1$ , we have

\mathbb{P}(\widehat{\mathrm{Var}}(Z)\geq(1-\delta)\frac{n-1}{n}\sigma^{2}_{Z})\geq 1-\exp(-\delta^{2}(n-1)/4)

(B.9)

and

\mathbb{P}(\widehat{\mathrm{Var}}(Z)\leq(1+\delta)\frac{n-1}{n}\sigma^{2}_{Z})\geq 1-\exp(-(n-1)(1+\delta-\sqrt{1+2\delta})/2).

(B.10)

Proof of Lemma B.1.

Since $Z_{i}$ are independent and normally distributed, by Cochran’s theorem we have $\widehat{\mathrm{Var}}(Z)\sim\frac{\sigma^{2}_{Z}}{n}\chi^{2}_{n-1}$ . In the notation of [28], choosing $D=n-1$ and $x=\delta^{2}(n-1)/4$ for the chi-squared concentration inequality (4.4) in [28], we have that

	$\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(Z)\geq(1-\delta)\frac{n-1}{n}\sigma^{2}_{Z}\Big{)}$	$\displaystyle=1-\mathbb{P}(\chi_{n-1}^{2}<(1-\delta)(n-1))$
		$\displaystyle\geq 1-\exp(-\delta^{2}(n-1)/4),$

proving (B.9). For (B.10), choosing $D=n-1$ and $x=(n-1)(1+\delta-\sqrt{1+2\delta})/2$ in [28, Equation (4.3)] we see that

	$\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(Z)\leq(1+\delta)\frac{n-1}{n}\sigma^{2}_{Z}\Big{)}$	$\displaystyle=1-\mathbb{P}(\chi_{n-1}^{2}>(1+\delta)(n-1))$
		$\displaystyle\geq 1-\exp(-(n-1)(1+\delta-\sqrt{1+2\delta})/2).\qed$

Now we are ready to prove the upper bound (11). We begin with the inequality (13), as shown in the proof sketch of Theorem 1. We aim to upper bound the right hand side of (13) using Lemma B.1. Since the samples $X_{1j},\dots,X_{nj}$ are i.i.d., using (B.10) and choosing $\delta=1$ , we find that with probability at least $1-\exp\big{(}-(n-1)\frac{2-\sqrt{3}}{2}\big{)}\geq 1-\exp(-(n-1)/16)$ , we have that $\widehat{\mathrm{Var}}(X_{j})\leq 2\sigma^{2}_{X_{j}}$ . Similarly, choosing $\delta=1/2$ in (B.9), we also have that with probability at least $1-\exp(-(n-1)/16)$ that $\widehat{\mathrm{Var}}(Y)\geq\sigma_{Y}^{2}/4$ . Substituting these concentration inequalities into the right hand side of (13), it follows by a union bound that with probability at least $1-4n\exp(-n\delta^{2}/12)-2\exp(-(n-1)/16)$ ,

\widehat{\Delta}(X_{j},Y)\leq 24\widehat{\mathrm{Var}}(Y)\rho_{j}^{2}+4\delta^{2}\widehat{\mathrm{Var}}(Y)\iff\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y)\leq 24\rho_{j}^{2}+4\delta^{2}.

Finally, noticing that $\sqrt{24\rho_{j}^{2}+4\delta^{2}}<5|\rho_{j}|+2\delta$ completes the proof.

C Proof of Model Selection Consistency for Linear Models

Recall the setting mentioned under the heading “Minimum sample size for consistency” in Section 6, which considers the same linear model with Gaussian variates from Theorem 1. To reiterate, we assume that $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ is the $p\times p$ identity matrix, $\sum_{k=1}^{p}\beta^{2}_{k}={\mathcal{O}}(1)$ , and $\min_{j\in{\mathcal{S}}}|\beta_{j}|^{2}\asymp 1/s$ , all of which are special cases of the more general setting considered in [41, Corollary 1]. Under these assumptions, we then have $\rho^{2}(X_{j},Y)=\beta^{2}_{j}/(\sigma^{2}+\sum_{k=1}^{p}\beta^{2}_{k})\gtrsim 1/s$ for any $j\in{\mathcal{S}}$ and $\rho(X_{j},Y)=0$ for $j\in{\mathcal{S}}^{c}$ . Our goal is to show that $n\asymp s\log(n)\log(n(p-s))$ samples suffice for high probability model selection consistency.

Choosing $\delta=1/2$ in (10) applied to $j\in{\mathcal{S}}$ and using (7), there exists a universal positive constant $C_{0}$ such that with probability at least $1-\frac{2C_{0}}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp(-n\rho^{2}(X_{j},Y)/8)$ , we have

	$\displaystyle\widehat{\Delta}(X_{j},Y)$	$\displaystyle\geq\widehat{\mathrm{Var}}(Y)\times\frac{\rho^{2}(X_{j},Y)}{4(\log(2n)+1)}$
		$\displaystyle=\frac{\widehat{\mathrm{Var}}(Y)}{4(\log(2n)+1)}\times\frac{\beta_{j}^{2}}{\sigma^{2}+\sum_{k=1}^{p}\beta_{k}^{2}}$
		$\displaystyle\gtrsim\frac{\widehat{\mathrm{Var}}(Y)}{s\log(n)}.$

Therefore by a union bound over all $s$ relevant variables, we have that with probability at least $1-s\max_{j\in{\mathcal{S}}}\{\frac{2C_{0}}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp(-n\rho^{2}(X_{j},Y)/8)\}$ ,

\widehat{\Delta}(X_{j},Y)\gtrsim\frac{\widehat{\mathrm{Var}}(Y)}{s\log(n)}\qquad\forall j\in{\mathcal{S}}.

(C.11)

Furthermore, by applying (11) for $j\in{\mathcal{S}}^{c}$ (and noting that $\rho(X_{j},Y)=0$ ) and using (7), with probability at least $1-4n\exp(-\delta^{2}n/12)-2\exp(-(n-1)/16)$ , we have

\widehat{\Delta}(X_{j},Y)\leq 4\widehat{\mathrm{Var}}(Y)\delta^{2}.

Therefore by a union bound over all $p-s$ irrelevant variables we have that with probability at least $1-4n(p-s)\exp(-\delta^{2}n/12)-2(p-s)\exp(-(n-1)/16)$ ,

\widehat{\Delta}(X_{j},Y)\leq 4\widehat{\mathrm{Var}}(Y)\delta^{2}\qquad\forall j\in{\mathcal{S}}^{c}.

Now, choosing $\delta^{2}=\frac{C_{3}}{s\log(n)}$ for some appropriate constant $C_{3}>0$ which only depends on $\sigma^{2}$ to match (C.11), we see by a union bound that

	$\displaystyle\mathbb{P}(\widehat{{\mathcal{S}}}={\mathcal{S}})$	$\displaystyle\geq 1-\max_{j\in{\mathcal{S}}}\Big{\{}\frac{2C_{0}s}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp\Big{(}-\frac{n\rho^{2}(X_{j},Y)}{8}\Big{)}\Big{\}}$		(C.12)
		$\displaystyle\qquad-4n(p-s)\exp\Big{(}-\frac{C_{3}n}{12s\log(n)}\Big{)}-2(p-s)\exp\Big{(}-\frac{(n-1)}{16}\Big{)}.$		(C.12)

Since $\rho^{2}(X_{j},Y)\gtrsim 1/s$ for all $j\in{\mathcal{S}}$ , (C.12) implies that if $n(p-s)\exp\big{(}-\frac{C_{3}n}{12s\log(n)}\big{)}\rightarrow 0$ , then $\mathbb{P}(\widehat{{\mathcal{S}}}={\mathcal{S}})\rightarrow 1$ . Hence, a sufficient sample size for consistent support recovery is

n\asymp s\log(n)\log(n(p-s)),

as desired.

D Proof of Propositions 1 and 2

This section will mainly be devoted to proving Proposition 1. First we will present the machinery which will be used to prove Proposition 1. At the end of the section we will complete the proof of Proposition 1 and prove Proposition 2 by recycling and simplifying the proof of Proposition 1.

First, we state and prove a lemma that will be used in later proofs. Though stated in terms of general probability measures, we will be specifically interested in the case where $\mathbb{P}$ is the empirical probability measure $\mathbb{P}_{n}$ and $\mathbb{E}$ is the empirical expectation $\mathbb{E}_{n}$ , both with respect to a sample of size $n$ .

Lemma D.2.

For any random variables $U$ and $V$ with finite second moments with respect to a probability measure $\mathbb{P}$ ,

\mathrm{Cov}_{\mathbb{P}}\Big{(}\frac{U}{\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}},V\Big{)}\geq\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}-2\sqrt{\mathbb{E}_{\mathbb{P}}[(U-V)^{2}]}.

Proof of Lemma D.2.

First note that by the triangle inequality,

\big{|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{|}\leq\sqrt{\mathrm{Var}_{\mathbb{P}}(U-V)}\leq\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}}.

(D.13)

To complete the proof, we write

\mathrm{Cov}_{\mathbb{P}}(U,V)=\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}+\big{(}\mathrm{Cov}_{\mathbb{P}}(U,V)-\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{)}

(D.14)

and apply (D.13) to arrive at

	$\displaystyle\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$		(D.15)
	$\displaystyle=\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\big{(}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{)}\big{\|}$
	$\displaystyle\leq\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)\big{\|}+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{\|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$
	$\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{\|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$		(D.16)
	$\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}},$

where the penultimate line (D.16) follows from the Cauchy-Schwarz inequality. Substituting (D.15) into (D.14), we get

\mathrm{Cov}_{\mathbb{P}}(U,V)\geq\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\big{(}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}-2\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}}\big{)},

which proves the claim. ∎

The following sample variance concentration inequality will also come in handy.

Lemma D.3 (Equation 5, [34]).

Let $U$ be a random variable bounded by $B$ . Then for all $\gamma>0$ ,

\mathbb{P}\Big{(}\frac{n}{n-1}\widehat{\mathrm{Var}}(U)\geq\mathrm{Var}(U)-\gamma\Big{)}\geq 1-\exp\Big{(}-\frac{(n-1)\gamma^{2}}{8B^{2}\mathrm{Var}(U)}\Big{)}.

As explained in the main text, the key step in the proof of Proposition 1 is to apply Lemma 1 with a good approximation $\widetilde{f_{j}}(\cdot)$ to the marginal projection $f_{j}(\cdot)$ that also has a sufficiently large number of data points in every one of its stationary intervals. The following lemma provides the precise properties of such an approximation.

Lemma D.4.

Suppose $f_{j}(\cdot)$ satisfies Assumption 2. Let $a>0$ be a positive constant and let $M$ be a positive integer such that $Ma\leq 1$ . There exists a function $\widetilde{f}_{j}(\cdot)$ with at most $M$ stationary intervals such that, with probability at least $1-2n\exp(-na/12)$ , both of the following statements are simultaneously true:

1.

The number of data points in any stationary interval of $\widetilde{f}_{j}(\cdot)$ is between $na/2$ and $3na/2$ .
2.

$\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}\leq K_{0}M^{-d}+K_{1}(Ma)^{5/2}$ , where $K_{0}$ and $K_{1}$ are some constants depending on $d$ , $B$ , $\alpha$ , and $r$ , and $\mathbb{E}_{n}$ denotes the empirical expectation.

Proof of Lemma D.4.

Recall that $f_{j}(\cdot)$ is defined over $\mathbb{R}$ , so even though we restrict our attention $[0,1]$ , we can assume it is bounded over the larger interval $[-1,2]$ for all $j$ . By [23, Theorem VIII], there exists a polynomial $P_{M}(\cdot)$ of degree $M+1>r$ such that

\displaystyle\sup_{x\in[-1,2]}|f_{j}(x)-P_{M}(x)|

\displaystyle\leq 3^{d}LA(M+1)^{-r}(M+1-r)^{-\alpha},

where $A=\frac{(r+1)^{r-1}(K/2)^{r}(K/2+2)}{r!}$ , $L$ is the Lipschitz constant from Assumption 2, and $K$ is a universal constant given in [23, Theorem I]. Since $\frac{M+1}{M+1-r}\leq r+1$ whenever $M+1>r$ , we also have that

\sup_{x\in[-1,2]}|f_{j}(x)-P_{M}(x)|\leq 3^{d}(r+1)^{\alpha}LA(M+1)^{-d}\leq K_{0}M^{-d},

(D.17)

where $K_{0}=3^{d}(r+1)^{\alpha}LA$ is a constant.

Let $\{\xi_{k}\}_{1\leq k\leq M}$ be the collection of (at most) $M$ stationary points of $P_{M}(\cdot)$ in $[0,1]$ . We assume that $a<\min_{k}(\xi_{k}-\xi_{k-1})$ ; otherwise we remove points from $\{\xi_{k}\}_{1\leq k\leq M}$ until this holds. Let $I_{k}=[\xi_{k},\xi_{k}+a]$ .

We will now show that Properties 1 and 2 of Lemma D.4 are satisfied by the function

\widetilde{f}_{j}(x)=\begin{cases}P_{M}(x)&x\notin\bigcup_{k}I_{k}\\ P_{M}(\xi_{k})&x\in I_{k}\;\text{for some}\;k.\end{cases}

(D.18)

First, it is clear that $\widetilde{f}_{j}(\cdot)$ has at most $M$ stationary intervals. Next, notice that by the multiplicative version of Chernoff’s inequality [1, Proposition 2.4], since each stationary interval $I_{k}$ has length $a$ , we have

\mathbb{P}(\#\{X_{ij}\in I_{k}\}\geq na/2)>1-\exp(-na/8),

and

\mathbb{P}(\#\{X_{ij}\in I_{k}\}\leq 3na/2)>1-\exp(-na/12).

Thus, by a union bound, the probability that

na/2\leq\#\{X_{ij}\in I_{k}\}\leq 3na/2\qquad\forall\;k

is at least $1-2M\exp(-na/12)$ . Since all stationary intervals are disjoint and are contained in $[0,1]$ , we have $Ma\leq 1$ or $M\leq 1/a$ , which implies that $1-2M\exp(-na/12)\geq 1-(2/a)\exp(-na/12)$ . Therefore, we know that

\mathbb{P}(3na/2\geq\#\{X_{ij}\in I_{k}\}\geq na/2\;\text{for all}\;k)\geq\max\{0,1-(2/a)\exp(-na/12)\}.

(D.19)

Notice that if $a\leq 1/n$ , then

\max\{0,1-(2/a)\exp(-na/12)\}=0\geq 1-2n\exp(-na/12)

and if $a>1/n$ , then

	$\displaystyle\max\{0,1-(2/a)\exp(-na/12)\}$	$\displaystyle\geq 1-(2/a)\exp(-na/12)$
		$\displaystyle\geq 1-2n\exp(-na/12).$

Thus, for all $a\geq 0$ , we have

\max\{0,1-(2/a)\exp(-na/12)\}\geq 1-2n\exp(-na/12),

which when combined with (D.19) proves Property 1 of Lemma D.4.

To prove Property 2 of Lemma D.4 we use the triangle inequality and part of Property 1:

\mathbb{P}(\#\{X_{ij}\in I_{k}\}\leq 3na/2\;\text{for all}\;k)\geq 1-2n\exp(-na/12).

(D.20)

In view of (D.17), to bound $\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}$ we aim to bound $\frac{1}{n}\sum_{i=1}^{n}(P_{M}(X_{ij})-\widetilde{f}_{j}(X_{ij}))^{2}.$

Notice that by Bernstein’s theorem for polynomials [22, Theorem B2a], we also have

|P^{\prime\prime}_{M}(x)|\leq\frac{(M+1)^{2}\sup_{x\in[-1,2]}|P_{M}(x)|}{(2-x)(x+1)},\quad-1<x<2.

(D.21)

By (D.17), $\sup_{x\in[-1,2]}|P_{M}(x)|\leq K_{0}M^{-d}+\sup_{x\in[-1,2]}|f_{j}(X_{j})|\leq K_{0}+B$ . Thus, by (D.21),

\sup_{x\in[0,1]}|P^{\prime\prime}_{M}(x)|\leq\frac{(M+1)^{2}(K_{0}+B)}{2}\leq K_{1}M^{2},

where $K_{1}=2(K_{0}+B)$ and we used the fact that $(2-x)(x+1)\geq 2$ for $x\in[0,1]$ . Additionally, because each $\xi_{k}$ is a stationary point, $P^{\prime}_{M}(\xi_{k})=0$ , and hence by a second order Taylor approximation, we have

|P_{M}(x)-P_{M}(\xi_{k})|\leq K_{1}M^{2}a^{2}/2

(D.22)

for $x\in I_{k}$ .

Therefore, by (D.18), (D.20), and (D.22) we have that

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}(P_{M}(X_{ij})-\widetilde{f}_{j}(X_{ij}))^{2}$	$\displaystyle=\sum_{k}\frac{1}{n}\sum_{X_{ij}\in I_{k}}(P_{M}(X_{ij})-P_{M}(\xi_{k}))^{2}$
		$\displaystyle\leq\frac{K_{1}^{2}M^{4}a^{4}}{4n}\sum_{k=1}^{M}\#\{X_{ij}\in I_{k}\}$
		$\displaystyle\leq\frac{K_{1}^{2}M^{4}a^{4}}{4n}\sum_{k=1}^{M}\frac{3na}{2}$
		$\displaystyle=\frac{3K_{1}^{2}(Ma)^{5}}{8},$

with probability at least $1-2n\exp(-na/12)$ . Combining this with (D.17) and using the triangle inequality to bound $\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}$ by $\sqrt{\mathbb{E}_{n}[(P_{M}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}+\sqrt{\vphantom{\mathbb{E}_{n}[(P_{M}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}\mathbb{E}_{n}[(P_{M}(X_{j})-f_{j}(X_{j}))^{2}]}$ proves Property 2. ∎

Returning to the proof of Proposition 1, by Lemma 1 along with Lemma D.4, we have that with probability at least $1-2n\exp(-na/12)$ that

\widehat{\Delta}(X_{j},Y)\geq\frac{1}{2M/a+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}.

(D.23)

Now we choose

a=\left(\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(2K_{0}+K_{1})^{2}}\right)^{(2d+5)/(10d)},\quad M=\lfloor a^{-5/(2d+5)}\rfloor,

(D.24)

where $\tau$ is a constant to be specified in (D.26), so that Property 2 of Lemma D.4 becomes

$\displaystyle\sqrt{\mathbb{E}_{n}[(f_{j}(X)-\widetilde{f}_{j}(X))^{2}]}$	$\displaystyle\leq K_{0}M^{-d}+K_{1}(Ma)^{5/2}$	(D.25)
	$\displaystyle\leq(2K_{0}+K_{1})a^{5d/(2d+5)}$
	$\displaystyle\leq\frac{\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{2},$

with probability at least $1-2n\exp(-na/12)$ .

Recall the condition $M+1>r$ as part of [23, Theorem VIII], which states that the degree of the polynomial must be greater than the order of Lipschitz derivative in order to approximate the function well. Since, $\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}$ , this condition will be satisfied if we make the following choice:

\tau\coloneqq\min\Big{(}\frac{2(2K_{0}+K_{1})}{r^{d}B},\frac{1}{4}\Big{)}.

(D.26)

It follows by Lemma D.2 along with (D.25) that

\widehat{\mathrm{Cov}}(\widetilde{f}_{j}(X_{j}),f_{j}(X_{j}))\geq\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}\Big{(}\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}-\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}\Big{)}

(D.27)

with probability at least $1-2n\exp(-na/12)$ .

In the next Lemma, we use (D.27) along with Lemma D.3 to obtain a lower bound on the right hand side of (D.23).

Lemma D.5.

With probability at least $1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-2n\exp(-na/12)$ , we have that

\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}\geq(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}.

Proof of Lemma D.5.

Recalling (16), we will prove Lemma D.5 by first getting a concentration bound on (I) and then getting a concentration bound on (II).

To get a concentration bound on (I), we need Lemma D.3 to lower bound the sample variance on the right hand side of inequality (D.27). Choosing $U=f_{j}(X_{j})\in[-B,B]$ and $\gamma=\mathrm{Var}(f_{j}(X_{j}))(1-8\tau^{2})$ (which is greater than zero by the choice of $\tau$ in (D.26)), notice that Lemma D.3 gives us

	$\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(f_{j}(X_{j}))\geq\frac{8\tau^{2}(n-1)}{n}\mathrm{Var}(f_{j}(X_{j}))\Big{)}$
	$\displaystyle\geq 1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)},$

so that by (D.27),

	$\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}$	$\displaystyle\geq\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}-\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$
		$\displaystyle\geq\tau(\sqrt{8}\sqrt{1-1/n}-1)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$
		$\displaystyle\geq\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))},$

with probability at least $1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-2n\exp(-na/12)$ .

Now we need to get a concentration bound for (II). Let $s_{i}=\frac{\widetilde{f}_{j}(X_{ij})-\frac{1}{n}\sum_{k=1}^{n}\widetilde{f}_{j}(X_{kj})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}}$ . We need to bound

\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}=\frac{1}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i}).

For notational simplicity, we let $\underline{\mathbf{X}}=(X_{ij})$ be the $n\times p$ data matrix with $\mathbf{X}_{i}$ as rows. First notice that

	$\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i})\Bigg{)}\Bigg{]}$
	$\displaystyle=\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i})\Bigg{)}\Bigg{\|}\underline{\mathbf{X}}\Bigg{]}\Bigg{]}$
	$\displaystyle=\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\mathbb{E}\Bigg{[}\exp\left(\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}\varepsilon_{i}\right)\Bigg{\|}\underline{\mathbf{X}}\Bigg{]}\Bigg{]}.$

Now, by the sample independence of the errors $\epsilon_{i}$ , we can write the above as

	$\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\prod_{i=1}^{n}\mathbb{E}\Big{[}\exp\Big{(}\frac{\lambda}{n}s_{i}\varepsilon_{i}\Big{)}\Big{\|}\underline{\mathbf{X}}\Big{]}\Bigg{]}$
	$\displaystyle\leq\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\prod_{i=1}^{n}\exp\Big{(}\frac{\lambda^{2}s_{i}^{2}\sigma^{2}}{2n^{2}}\Big{)}\Bigg{]}$
	$\displaystyle=\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{]}\exp\Big{(}\frac{\lambda^{2}\sigma^{2}}{2n}\Big{)},$

where we used Assumption 6 and the fact that $\frac{1}{n}\sum_{i=1}^{n}s^{2}_{i}=1$ . Recalling that $s_{i}$ depends on $(X_{1j},X_{2j},\dots,X_{nj})^{\top}$ , we have

		$\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{]}$		(D.28)
		$\displaystyle=\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{\|}X_{1j},X_{2j},\dots,X_{nj}\Bigg{]}\Bigg{]}$
		$\displaystyle=\mathbb{E}\Bigg{[}\prod_{i=1}^{n}\mathbb{E}\Big{[}\exp\Big{(}\frac{\lambda}{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Big{)}\Big{\|}X_{1j},X_{2j},\dots,X_{nj}\Big{]}\Bigg{]},$

where we used sample independence in the second equality. Finally, applying Hoeffding’s Lemma along with the fact that $\|g\|_{\infty}\leq B$ , we have that (D.28) is bounded above by

\displaystyle\mathbb{E}\Bigg{[}\prod_{i=1}^{n}\exp\bigg{(}\frac{\lambda^{2}s_{i}^{2}B^{2}}{2n^{2}}\bigg{)}\Bigg{]}\leq\exp\bigg{(}\frac{\lambda^{2}B^{2}}{2n}\bigg{)}.

Having bounded the moment generating function, we can now apply Markov’s inequality to see that

	$\displaystyle\mathbb{P}\Bigg{(}\frac{1}{n}\sum_{i=1}^{n}s_{i}(g(X)-f_{j}(X_{j})+\varepsilon_{i})\leq-\gamma\Bigg{)}$
	$\displaystyle=\mathbb{P}\Bigg{(}\exp\Bigg{(}-\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(X)-f_{j}(X_{j})+\varepsilon_{i})\Bigg{)}\geq\exp(\lambda\gamma)\Bigg{)}$
	$\displaystyle\leq\exp\Bigg{(}\frac{\lambda^{2}(B^{2}+\sigma^{2})}{2n}-\gamma\lambda\Bigg{)}$
	$\displaystyle\leq\exp\Bigg{(}-\frac{n\gamma^{2}}{2(B^{2}+\sigma^{2})}\Bigg{)},$

where the last inequality follows by maximizing over $\lambda$ . Choosing $\gamma=(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$ , we have by a union bound that,

	$\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}$	$\displaystyle=\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}$
		$\displaystyle\qquad+\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}$
		$\displaystyle\geq\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}-(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}$
		$\displaystyle=(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))},$

with probability at least $1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-2n\exp(-na/12)$ . ∎

With this setup, we are now ready to finish the proofs of Proposition 1 and 2.

Proof of Proposition 1.

Recalling our concentration bound (D.23) along with Lemma D.5, it follows that with probability at least $1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-4n\exp\Big{(}-\frac{n}{12}\Big{(}\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(2K_{0}+K_{1})^{2}}\Big{)}^{(2d+5)/(10d)}\Big{)}$ that

$\displaystyle\widehat{\Delta}(X_{j},Y)$	$\displaystyle\geq\frac{1}{2M/a+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}$
	$\displaystyle\geq\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8a^{-(2d+10)/(2d+5)}+4(\log(2n)+1)}$
	$\displaystyle\geq\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))a^{(2d+10)/(2d+5)}}{8+4(\log(2n)+1)a^{(2d+10)/(2d+5)}}$
	$\displaystyle\geq\frac{C_{2}(\mathrm{Var}(f_{j}(X_{j})))^{(6d+5)/5d}}{\log(n)},$	(D.29)

where we used our choice of $M$ and $a$ in (D.24) and $\tau$ in (D.26) and $C_{2}$ is some constant which only depends on $L$ , $B$ , $r$ , and $\alpha$ . Notice in the last inequality (D.29) we bound $a$ in the denominator with $\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}$ .

To conclude the proof, we simplify our probability bound. To this end, notice that

\displaystyle(\mathrm{Var}(f_{j}(X_{j})))^{(2d+5)/(10d)}=\frac{\mathrm{Var}(f_{j}(X_{j}))}{(\mathrm{Var}(f_{j}(X_{j})))^{(8d-5)/(10d)}}\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{B^{(8d-5)/(5d)}},

where the last inequality holds by assumption that $d\geq 5/8$ and the fact that $\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}$ . Therefore we have $1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-4n\exp\Big{(}-\frac{n}{12}\Big{(}\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(K_{0}+K_{1})^{2}}\Big{)}^{(2d+5)/(10d)}\Big{)}\geq 1-(4n+2)\exp(-nC_{1}\mathrm{Var}(f_{j}(X_{j})))$ for some constant $C_{1}$ which depends only on $L$ , $B$ , $\sigma$ , $r$ , and $\alpha$ . ∎

Proof of Proposition 2.

The main difference with Proposition 1 is that when $f_{j}(\cdot)$ is monotone we now have $M=0$ . We no longer need the approximations $P_{M}(\cdot)$ or $\widetilde{f}_{j}(\cdot)$ in Lemma D.4, $a$ in (D.24) and $\tau$ in (D.26), or Lemma D.2. We will only need Lemma 1 and a version of Lemma D.5. Instead of applying Lemma 1 with an approximation to the marginal projection, we can choose $h_{j}(\cdot)$ to equal $f_{j}(\cdot)$ directly to see that

\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\bigg{)}}{\log(2n)+1}.

(D.30)

Next, we have

		$\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)}$		(D.31)
		$\displaystyle=\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}+\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}$
		$\displaystyle=\underbrace{\vphantom{\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}}_{\text{(IV)}}+\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}_{\text{(V)}}.$

Now, we follow the same steps as the proof of Lemma D.5 to lower bound (D.31). For (IV), again use Lemma D.3 with $U=f_{j}(X_{j})\in[-B,B]$ but choose instead $\gamma=\mathrm{Var}(f_{j}(X_{j}))/2$ to get

	$\displaystyle\mathbb{P}\Big{(}\text{(IV)}\geq\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{2}\Big{)}$	$\displaystyle\geq\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(f_{j}(X_{j}))\geq\frac{n-1}{2n}\mathrm{Var}(f_{j}(X_{j}))\Big{)}$		(D.32)
		$\displaystyle\geq 1-\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32B^{2}}\Big{)}.$		(D.32)

For (V), we can follow the same steps as the second half of the proof of Lemma D.5 to see that

\mathbb{P}(\text{(V)}\leq-\gamma)\leq\exp\Big{(}-\frac{n\gamma^{2}}{2(B^{2}+\sigma^{2})}\Big{)}.

However, for (V), we instead choose $\gamma=\sqrt{\mathrm{Var}(f_{j}(X_{j}))}/4$ to get

\displaystyle\mathbb{P}\Bigg{(}\text{(V)}\geq-\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{4}\Bigg{)}\geq 1-\exp\Bigg{(}-\frac{n\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Bigg{)}.

(D.33)

Now using a union bound and substituting the events in (D.33) and (D.32) into (D.31), we see that with probability at least $1-\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32B^{2}}\Big{)}-\exp\Big{(}-\frac{n\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}\geq 1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}$ , we have that

\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)}\geq\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{4}.

(D.34)

Therefore, substituting (D.34) into (D.30), we have that with probability at least $1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}$ that

\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\bigg{)}}{\log(2n)+1}\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{16(\log(2n)+1)}.\qed

E Proof of Theorems 2 and 3

In this section we use Proposition 1 and Proposition 2 along with Lemma 3 to complete the proofs of Theorem 2 and Theorem 3.

Proof of Theorem 2.

The high-level idea is to show that the upper and lower bounds on the impurity reductions for irrelevant and relevant variables from Lemma 3 and Proposition 1, respectively, are well-separated.

By Proposition 1 for all variables $j\in{\mathcal{S}}$ and a union bound, we see that with probability at least $1-s(4n+2)\exp(-C_{1}nv)$ , we have

\widehat{\Delta}(X_{j},Y)\geq\frac{C_{2}v^{6/5+1/d}}{\log(n)}\qquad\forall\;j\in{\mathcal{S}}.

(E.35)

By Lemma 3 and applying a union bound over all $p-s$ variables in ${\mathcal{S}}^{c}$ , we have that with probability at least $1-4n(p-s)\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2})))$ that

\widehat{\Delta}(X_{j},Y)\leq\xi^{2}\qquad\forall\;j\in{\mathcal{S}}^{c}.

(E.36)

Recall that if we know the size $s$ of the support ${\mathcal{S}}$ , then $\mathcal{\widehat{S}}$ consists of the top $s$ impurity reductions. Note that choosing $\xi^{2}=\frac{C_{2}v^{6/5+1/d}}{2\log(n)}$ in (E.36) will give us a high probability upper bound on $\widehat{\Delta}(X_{j},Y)$ for irrelevant variables which is dominated by the lower bound on $\widehat{\Delta}(X_{j},Y)$ for relevant variables in (E.35). Thus, by a union bound, it follows that with probability at least $1-s(4n+2)\exp(-C_{1}nv)-4n(p-s)\exp\Big{(}-\frac{nC_{2}v^{6/5+1/d}}{24\log(n)(B^{2}+\sigma^{2})}\Big{)}$ , we have $\mathcal{\widehat{S}}=\mathcal{S}$ . ∎

Proof of Theorem 3.

The proof is similar to that of Theorem 2 except that we use Proposition 2 in place of Proposition 1.

By Proposition 2 for all variables $j\in{\mathcal{S}}$ and a union bound, we see that with probability at least $1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)}$ ,

\widehat{\Delta}(X_{j},Y)\geq\frac{v}{4(1+\log(2n))}\qquad\forall\;j\in{\mathcal{S}}.

Again, we have by Lemma 3 and a union bound over all $p-s$ variables in ${\mathcal{S}}^{c}$ that with probability at least $1-4n(p-s)\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2})))$ that

\widehat{\Delta}(X_{j},Y)\leq\xi^{2}\qquad\forall\;j\in{\mathcal{S}}^{c}.

(E.37)

Choosing $\xi^{2}=\frac{v}{8(1+\log(2n))}$ in (E.37) and using a union bound, it follows that with probability at least $1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)}-4n(p-s)\exp\Big{(}-\frac{nv}{96(1+\log(2n))(B^{2}+\sigma^{2})}\Big{)}$ , we have $\mathcal{\widehat{S}}=\mathcal{S}$ . ∎

References

[1] D. Angluin and L.G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979.
[2] Anestis Antoniadis and Jianqing Fan. Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455):939–955, 2001.
[3] Tharmaratnam K. Bergersen, L. C. and I. K. Glad. Monotone splines lasso. Computational Statistics and Data Analysis, 77:336–351, 2014.
[4] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[5] Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199–231, 2001.
[6] Leo Breiman, Jerome Friedman, RA Olshen, and Charles J Stone. Classification and regression trees. Chapman and Hall/CRC, 1984.
[7] W. Buckinx, G. Verstraeten, and D. Van den Poel. Predicting customer loyalty using the internal transactional database. Expert Systems with Applications, 32:125–134, 2007.
[8] Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L Lunetta, Brooke Hayward, Tim P Keith, and Paul Van Eerdewegh. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28:171–82, 2005.
[9] Ramón Díaz-Uriarte and Sara Alvarez de Andrés. Gene selection and classification of microarray data using random forest. BMC Bioinformatics volume, 7(3):171–82, 2006.
[10] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32:407–499, 2004.
[11] Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 2011.
[12] Jianqing Fan and Jinchi Lv. Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society, Series B, 70:849–911, 2008.
[13] Jianqing Fan and Rui Song. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38(6):3567–3604, 2010.
[14] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
[15] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.
[16] Peter Hall and Hugh Miller. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18(3):533–550, 2009.
[17] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer New York, 2009.
[18] Harold Hotelling. New light on the correlation coefficient and its transforms. Journal of the Royal Statistical Society. Series B, 15(2):193–232, 1953.
[19] Jian Huang, Joel L Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models. The Annals of Statistics, 38(4):2282–2313, 2010.
[20] Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Inferring regulatory networks from expression data using tree-based methods. PLOS ONE, 2010.
[21] Wayne Iba and Pat Langley. Induction of one-level decision trees. Machine Learning Proceedings, pages 233–240, 1992.
[22] Dunham Jackson. Bernstein’s theorem and trigonometric approximation. Transactions of the American Mathematical Society, 40(2):225–251, 1936.
[23] Dunham Jackson. The Theory of Approximation, volume 11. American Mathematical Society, 1991.
[24] Jalil Kazemitabar, Arash Amini, Adam Bloniarz, and Ameet S Talwalkar. Variable importance using decision trees. In Advances in Neural Information Processing Systems, pages 426–435, 2017.
[25] L.C. Keely and C.M. Tan. Understanding preferences for income redistribution. Journal of Public Economics, 92:944–961, 2008.
[26] Jason M. Klusowski. Sparse learning with CART. In Advances in Neural Information Processing Systems, 2020.
[27] B. Lariviere and D. Van den Poel. Predicting customer retention and profitability by using random forests and regression forests techniques. Expert Systems with Applications, 29:472–484, 2005.
[28] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
[29] Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased MDI feature importance measure for random forests. In Advances in Neural Information Processing Systems 32, pages 8049–8059. Curran Associates, Inc., 2019.
[30] Gilles Louppe. Understanding random forests: From theory to practice. arXiv preprint: arXiv:1407.7502, 2014.
[31] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable importances in forests of randomized trees. In Advances in Neural Information Processing Systems, pages 431–439, 2013.
[32] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint: arXiv:1802.03888, 2018.
[33] Kathryn L. Lunetta, L. Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(32):199–231, 2004.
[34] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.
[35] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive models. Journal of the Royal Statistical Society. Series B, 71(5):1009–1030, 2009.
[36] Erwan Scornet. Trees, forests, and impurity-based variable importance. arXiv preprint arXiv:2001.04295, 2020.
[37] Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. Annals of Statistics, 43(4):1716–1741, 2015.
[38] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8:563–564, 2007.
[39] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58:267–288, 1996.
[40] D. Van den Poel W. Buckinx. Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual fmcg retail setting. European Journal of Operational Research, 164:252–268, 2005.
[41] Martin J Wainwright. Information-theretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Transactions on Information Theory, 55(12):5728–5741, 2009.
[42] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
[43] Huazhen Wang, Fan Yang, and Zhiyuan Luo. An experimental study of the intrinsic stability of random forest variable importance measures. BMC Bioinformatics, 17(60), 2016.
[44] Pengfei Wei, Zhenzhou Lu, and Jingwen Song. Variable importance analysis: A comprehensive review. Reliability Engineering and System Safety, 142:399–432, 2015.
[45] JG Wendel. Note on the gamma function. The American Mathematical Monthly, 55(9):563–564, 1948.
[46] J. Zhang and M. Zulkernine. A hybrid network intrusion detection technique using random forests. Proceedings of the First International Conference on Availability, Reliability and Security, pages 262–269, 2006.
[47] J. Zhang, M. Zulkernine, and A. Haque. A hybrid network intrusion detection technique using random forests. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, 38(5):649–659, 2008.

	$\displaystyle\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$		(D.15)
	$\displaystyle=\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\big{(}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{)}\big{\|}$
	$\displaystyle\leq\big{\|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)\big{\|}+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{\|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$
	$\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{\|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{\|}$		(D.16)
	$\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}},$