This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric Variable Screening with Optimal Decision Stumps

Jason M. Klusowski [email protected]
Supported in part by NSF DMS-1915932 and NSF TRIPODS DATA-INSPIRE CCF-1934924.
Peter M. Tian [email protected]
Supported in part by the Gordon Wu Fellowship of Princeton University.
Abstract

Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, we find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that attempt to directly estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection.

1 Introduction

A common task in many applied disciplines involves determining which variables, among many, are most important in a predictive model. In high-dimensional sparse models, many of these predictor variables may be irrelevant in how they affect the response variable. As a result, variable selection techniques are crucial for filtering out irrelevant variables in order to prevent overfitting, improve accuracy, and enhance the interpretability of the model. Indeed, algorithms that screen for relevant variables have been instrumental in the modern development of fields such as genomics, biomedical imaging, signal processing, image analysis, and finance, where high-dimensional, sparse data is frequently encountered [12].

Over the years, numerous parametric and nonparametric methods for variable selection in high-dimensional models have been proposed and studied. For linear models, the LARS algorithm [10] (for Lasso [39]) and Sure Independence Screening (SIS) [12] serve as prototypical examples that have achieved immense success, both practically and theoretically. Other strategies for nonparametric additive models such as Nonparametric Independence Screening (NIS) [11] and Sparse Additive Models (SPAM) [35] have also enjoyed a similar history of success.

1.1 Tree based variable selection procedures

Alternatively, because they are built from highly interpretable and simple objects, decision tree models are another important tool in the data analyst’s repertoire. Indeed, after only a brief explanation, one is able to understand the tree construction and its output in terms of meaningful domain specific attributes of the variables. In addition to being interpretable, tree based model have good computational scalability as the number of data points grows, making them faster than many other methods when dealing with large datasets. In terms of flexibility, they can naturally handle a mixture of numeric variables, categorical variables, and missing values. Lastly, they require less preprocessing (because they are invariant to monotone transformations of the inputs), are quite robust to outliers, and are relatively unaffected by the inclusion of many irrelevant variables [17, 26], the last point being of relevance to the variable selection problem.

Conventional tree structured models such as CART [6], random forests [4], ExtraTrees [15], and gradient tree boosting [14] are also equipped with heuristic variable importance measures that can be used to rank and identify relevant predictor variables for further investigation (such as plotting their partial dependence functions). In fact, tree based variable importance measures have been used to discover genes in bioinformatics [5, 33, 8, 9, 20], identify loyal customers and clients [7, 40, 27], detect network intrusion [46, 47], and understand income redistribution preferences [25], to name a few applications.

1.2 Mean decrease in impurity

An attractive feature specific to CART methodology is that one can compute, essentially for free, measures of variable importance (or influence) using the optimal splitting variables and their corresponding impurities. The canonical CART-based variable importance measure is the Mean Decrease in Impurity (MDI) [14, Section 8.1], [6, Section 5.3.4], [17, Sections 10.13.1 & 15.3.2], which calculates an importance score for a variable by summing the largest impurity reductions (weighted by the fraction of samples in the node) over all non-terminal nodes split with that variable, averaged over all trees in the ensemble. Another commonly used and sometimes more accurate measure is the Mean Decrease in Accuracy (MDA) or permutation importance measure, defined as the average difference in out-of-bag errors before and after randomly permuting the data values of a variable in out-of-bag samples over all trees in the ensemble [4]. However, from a computational perspective, MDI is preferable to MDA since it can be computed as each tree is grown with no additional cost. While we view the analysis of MDA as an important endeavor, it is outside the scope of the present paper and therefore not our current focus. In contrast to the aforementioned variable selection procedures like SIS, NIS, Lasso, and SPAM, except for a few papers, little is known about the finite sample performance of MDI. Theoretical results are mainly limited to the asymptotic, fixed dimensional data setting and to showing what would be expected from a reasonable measure of variable importance. For example, it is shown in [31] that the MDI importance of a categorical variable (in an asymptotic data and ensemble size setting) is zero precisely when the variable is irrelevant, and that the MDI importance for relevant variables remains unchanged if irrelevant variables are removed or added.

Recent complementary work by [36] established the large sample properties of MDI for additive models by showing that it converges to sensible quantities like the variance of the component functions, provided the decision tree is sufficiently deep. However, these results are not fine-grained enough to handle the case where either the dimensionality grows or the marginal signals decay with the sample size. Furthermore, [36] crucially relies on the assumption that the CART decision tree is a consistent predictor, which is currently only known for additive models [37]. It is therefore unclear whether the techniques can be generalized to models beyond those considered therein. On the other hand, our results suggest that tree based variable importance measures can still have good variable selection properties even though the underlying tree model may be a poor predictor of the data generating process—which can occur with CART and random forests.

Lastly, we mention that important steps have also been taken to characterize the finite sample properties of MDI; [29] show that MDI is less biased for irrelevant variables when each tree is shallow. This work therefore covers one facet of the variable selection problem, i.e., controlling the number of false positives, and will be employed in our proofs.

1.3 New contributions

The lack of theoretical development for tree based variable selection is likely because the training mechanism involves complex steps —which present new theoretical challenges— such as bagging, boosting, pruning, random selections of the predictor variables for candidate splits, recursive splitting, and line search to find the best split points [24]. The last consideration, importantly, means that the underlying tree construction (e.g., split points) depends on both the input and output data, which enables it to adapt to structural properties of the underlying statistical model (such as sparsity). This data adaptivity is a double-edged sword from a theoretical standpoint, though, since unravelling the data dependence is a formidable task.

Despite the aforementioned challenges, we advance the study of tree based variable selection by focusing on the following two fundamental questions:

  • Does MDI enjoy finite sample guarantees for its variable ranking?

  • What are the benefits of MDI over other variable screening methods?

Specifically, we derive rigorous finite sample guarantees for what we call the SDI importance measure, a sobriquet for Single-level Decrease in Impurity, which is a special case of MDI for a single-level CART decision tree or “decision stump” [21]. This is similar in spirit to the approach of DSTUMP [24] but, importantly, SDI incorporates the line search step by finding the optimal split point, instead of the empirical median, of every predictor variable. As we shall see, ranking variables according to their SDI is equivalent to ranking the variables according to the marginal sample correlations between the response data and the optimal decision stump with respect to those variables. This equivalence also yields connections with other variable selection methods: for linear models with Gaussian variates, we show that SDI is asymptotically equivalent to SIS (up to logarithmic factors in the sample size), and so SDI inherits the so-called sure screening property [12] under suitable assumptions.

Unlike SIS, however, SDI is accompanied by provable guarantees for nonparametric models. We show that under certain conditions, SDI achieves model selection consistency; that is, it correctly selects the relevant variables of the model with probability approaching one as the sample size increases. In fact, the minimum signal strength of each relevant variable and maximum dimensionality of the model are shown to be less restrictive for SDI than NIS or SPAM. In the linear model case with Gaussian variates, SDI is shown to nearly match the optimal sample size threshold (achieved by Lasso) for exact support recovery. These favorable properties are particularly striking when one is reminded that the underlying model fit to the data is a simple, parsimonious decision stump—in particular, there is no need to specify a flexible function class (such as a polynomial spline family) and be concerned with calibrating the number of basis terms or bandwidth parameters.

Finally, we empirically compare SDI to other contemporaneous variable selection algorithms, namely, SIS, NIS, Lasso, and SPAM. We find that SDI is competitive and performs favorably in cases where the model is less smooth. Furthermore, we empirically verify the similarity between SDI and SIS for the linear model case, and confirm the model selection consistency properties of SDI for various types of nonparametric models.

Let us close this section by saying a few words about the primary goals of this paper. In practice, tree based variable importance measures such as MDI and MDA are most commonly used to rank variables, in order to determine which ones are worthy of further examination. This is a less ambitious endeavor than that sought by the aforementioned variable selection methods, which aim for consistent model selection, namely, determining exactly which variables are relevant and irrelevant. Though we do study the model selection problem, our priority is to demonstrate the power of tree based variable importance measures as interpretable, accurate, and efficient variable ranking tools.

1.4 Organization

The paper is organized according to the following schema. First, in Section 2, we formally describe our learning setting and problem, discuss prior art, and introduce the SDI algorithm. We outline some key lemmas and ideas used in the proofs in Section 3. In the next two sections, we establish our main results: in Section 4, we establish that for linear models, SDI is asymptotically equivalent to SIS (up to logarithmic factors in the sample size) and in Section 5, we establish the variable ranking and model selection consistency properties of SDI for more general nonparametric models. Lastly, we compare the performance of SDI to other well-known variable selection techniques from a theoretical perspective in Section 6 and from an empirical perspective in Section 7. We conclude with a few remarks in Section 8. Due to space constraints, we include the full proofs of our main results in Sections 4 and 5 in the appendix.

2 Setup and algorithm

In this section we introduce notation, formalize the learning setting, and give an explicit layout of our SDI algorithm. At the end of the section, we discuss its complexity and provide several interpretations.

2.1 Notation

For labeled data {(U1,V1),,(Un,Vn))}\{(U_{1},V_{1}),\dots,(U_{n},V_{n}))\} drawn from a population distribution (U,V)(U,V), we let Cov^(U,V)=1ni=1n(UiU¯)(ViV¯)\widehat{\mathrm{Cov}}(U,V)=\frac{1}{n}\sum_{i=1}^{n}(U_{i}-\overline{U})(V_{i}-\overline{V}), Var^(U)=1ni=1n(UiU¯)2\widehat{\mathrm{Var}}(U)=\frac{1}{n}\sum_{i=1}^{n}(U_{i}-\overline{U})^{2}, U¯=1ni=1nUi\overline{U}=\frac{1}{n}\sum_{i=1}^{n}U_{i}, and ρ^(U,V)=Cov^(U,V)Var^(U)Var^(V)\widehat{\rho}\,(U,V)=\frac{\widehat{\mathrm{Cov}}(U,V)}{\sqrt{\widehat{\mathrm{Var}}(U)\widehat{\mathrm{Var}}(V)}} denote the sample covariance, variance, mean, and Pearson product-moment correlation coefficient respectively. The population level covariance, variance, and correlation are denoted by Cov(U,V)\mathrm{Cov}(U,V), Var(U)=σU2\mathrm{Var}(U)=\sigma_{U}^{2}, and ρ(U,V)\rho(U,V), respectively.

2.2 Learning setting

Throughout this paper, we operate under a standard regression framework where the statistical model is Y=g(𝐗)+εY=g(\mathbf{X})+\varepsilon, the vector of predictor variables is 𝐗=(X1,,Xp)\mathbf{X}=(X_{1},\ldots,X_{p})^{\top}, and ε\varepsilon is statistical noise. While our results are valid for general nonparametric models, for conceptual simplicity, the canonical model class we have in mind is additive models, i.e.,

g(X1,,Xp)=g1(X1)++gp(Xp)g(X_{1},\ldots,X_{p})=g_{1}(X_{1})+\cdots+g_{p}(X_{p}) (1)

for some univariate component functions g1(),,gp()g_{1}(\cdot),\ldots,g_{p}(\cdot). As is standard with additive modeling [17, Section 9.1.1], for identifiability of the components, we assume that the gj(Xj)g_{j}(X_{j}) have population mean zero for all jj. This model class strikes a balance between flexibility and learnability—it is more flexible than linear models, but, by giving up on modeling interaction terms, it does not suffer from the curse of dimensionality.

We observe data 𝒟={(𝐗1,Y1),,(𝐗n,Yn)}{\mathcal{D}}=\{(\mathbf{X}_{1},Y_{1}),\dots,(\mathbf{X}_{n},Y_{n})\} with the ithi^{\text{th}} sample point (𝐗i,Yi)=(Xi1,,Xip,Yi)(\mathbf{X}_{i},Y_{i})=(X_{i1},\ldots,X_{ip},Y_{i}) drawn independently from the model above. Note that with this notation, XijX_{ij} are i.i.d. instances of the random variable XjX_{j}. We assume that the regression function g()g(\cdot) depends only on a small subset of the variables {Xj}j𝒮\{X_{j}\}_{j\in{\mathcal{S}}}, which we call relevant variables with support 𝒮{1,,p}{\mathcal{S}}\subset\{1,\ldots,p\} and sparsity level s=|𝒮|ps=|{\mathcal{S}}|\ll p. Equivalently, gj()g_{j}(\cdot) is identically zero for the irrelevant variables {Xj}j𝒮c\{X_{j}\}_{j\in{\mathcal{S}}^{c}}. In this paper, we consider the variable ranking problem, defined here as ranking the variables so that the top ss coincide with 𝒮{\mathcal{S}} with high probability. As a corollary, this will enable us to solve the variable selection problem, namely, determining the subset 𝒮{\mathcal{S}}. We pay special attention to the high-dimensional regime where pnp\gg n. In fact, in Section 5.3 we will provide conditions under which consistent variable selection occurs even when p=exp(o(n))p=\exp(o(n)).

2.3 Prior art

The conventional approach to marginal screening for nonparametric additive models is to directly estimate either the nonparametric components gj(Xj)g_{j}(X_{j}) or the marginal projections

fj(Xj)𝔼[Y|Xj],f_{j}(X_{j})\coloneqq\mathbb{E}[Y|X_{j}],

with the ultimate goal of studying their variances or their correlations with the response variable.111Note that fj(Xj)f_{j}(X_{j}) need not be the same as gj(Xj)g_{j}(X_{j}) unless, for instance, the predictor variables are independent and the noise is independent and mean zero. To accomplish this, SIS, NIS, and [16] rank the variables according to the correlations between the response values and least squares fits over a univariate model class {\mathcal{H}}, i.e.,

ρ^(h^(Xj),Y),whereh^()argminh()1ni=1n(Yih(Xij))2.\widehat{\rho}\,(\hat{h}(X_{j}),Y),\quad\text{where}\quad\hat{h}(\cdot)\in\operatorname*{arg\,min}_{h(\cdot)\in{\mathcal{H}}}\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-h(X_{ij}))^{2}. (2)

The model class {\mathcal{H}} is chosen to make the above optimization tractable, while at the same time, be sufficiently rich in order to approximate fj(Xj)f_{j}(X_{j}). For example, if {\mathcal{H}} is the space of polynomial splines of a fixed degree, then h^()\hat{h}(\cdot) in (2) can be computed efficiently via a truncated B-spline basis expansion

β1Ψ1(Xj)+β2Ψ2(Xj)++βdnΨdn(Xj),\beta_{1}\Psi_{1}(X_{j})+\beta_{2}\Psi_{2}(X_{j})+\cdots+\beta_{d_{n}}\Psi_{d_{n}}(X_{j}), (3)

as is done with NIS. Similarly, SIS takes {\mathcal{H}} to be the family of linear functions in a single variable. Complementary methods that aim to directly estimate each gj(Xj)g_{j}(X_{j}) include SPAM, which uses a smooth back-fitting algorithm with soft-thresholding, and [19], which combines adaptive group Lasso with truncated B-spline basis expansions.

As we shall see, SDI is equivalent to ranking the variables according to (2) when {\mathcal{H}} consists of the collection of all decision stumps in XjX_{j} of the form

β1 1(Xjz)+β2 1(Xj>z),z,β1,β2.\beta_{1}\,\mathbf{1}(X_{j}\leq z)+\beta_{2}\,\mathbf{1}(X_{j}>z),\qquad z,\beta_{1},\beta_{2}\in\mathbb{R}. (4)

Unlike the previous expressive models such a polynomial splines (3), a one-level decision tree, realized by the model (4) above, severely underfits the data and would therefore be ill-advised for estimating fj(Xj)f_{j}(X_{j}), if that were the goal. Remarkably, we show that this rigidity does not hinder SDI for variable selection. What redeems SDI is that, unlike the aforementioned methods that are based on linear estimators, decision stumps (4) are nonlinear since the splits points can depend on the response data. These model nonlinearities equip SDI with the ability to discover nonlinear patterns in the data, despite its poor approximation capabilities.

Finally, we mention that one benefit of such a simple model as (4) is that it is completely free of tuning parameters. In contrast, other methods such as the ones listed here require careful calibration of, for example, variable bandwidth smoothers or the number of terms in the basis expansions (e.g., SPAM and NIS).

2.4 The SDI algorithm

In this section, we provide the details for the SDI algorithm. We first provide some high-level intuition.

In order to determine whether, say, XjX_{j} is relevant for predicting YY from 𝐗\mathbf{X}, it is natural to first divide the data into two groups according to whether XjX_{j} is above or below some predetermined cutoff value and then assess how much the variance in YY changes before and after this division. A small change in the variability indicates a weak or nonexistent dependence of YY on XjX_{j}; whereas, a moderate to large change indicates heterogeneity in YY across different values of XjX_{j}. As we now explain, this is precisely what SDI does when the predetermined cutoff value is sought by a least squares fit over all possible ways of dividing the data.

Let zz be a candidate split for a variable XjX_{j} that divides the response data YY into left and right daughter nodes based on the jthj^{\text{th}} variable. Define the mean of the left daughter node to be Y¯L=1NLi:XijzYi\overline{Y}_{L}=\frac{1}{N_{L}}\sum_{i:X_{ij}\leq z}Y_{i} and the mean of the right daughter node to be Y¯R=1NRi:Xij>zYi\overline{Y}_{R}=\frac{1}{N_{R}}\sum_{i:X_{ij}>z}Y_{i} and let the size of the left and right daughter nodes be NL=#{i:Xijz}N_{L}=\#\{i:X_{ij}\leq z\} and NR=#{i:Xij>z}N_{R}=\#\{i:X_{ij}>z\}, respectively. For CART regression trees, the impurity reduction (or variance reduction) in the response variable YY from choosing the split point zz for the jthj^{\text{th}} variable is defined to be

Δ^(z;Xj,Y)=1ni=1n(YiY¯)21ni:Xijz(YiY¯L)21ni:Xij>z(YiY¯R)2.\widehat{\Delta}(z;X_{j},Y)=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}-\frac{1}{n}\sum_{i:X_{ij}\leq z}(Y_{i}-\overline{Y}_{L})^{2}-\frac{1}{n}\sum_{i:X_{ij}>z}(Y_{i}-\overline{Y}_{R})^{2}. (5)

For each variable XjX_{j}, we choose a split point z^j\hat{z}_{j} that maximizes the impurity reduction

z^jargmaxzΔ^(z;Xj,Y),\hat{z}_{j}\in\operatorname*{arg\,max}_{z}\widehat{\Delta}(z;X_{j},Y),

and for convenience, we denote the largest impurity reduction by

Δ^(Xj,Y)Δ^(z^j;Xj,Y).\widehat{\Delta}(X_{j},Y)\coloneqq\widehat{\Delta}(\hat{z}_{j};X_{j},Y).

We then rank the variables commensurate with the sizes of their impurity reductions, i.e., we obtain a ranking (ȷ^1,,ȷ^p)(\hat{\jmath}_{1},\dots,\hat{\jmath}_{p}) where Δ^(Xȷ^1,Y)Δ^(Xȷ^p,Y)\widehat{\Delta}(X_{\hat{\jmath}_{1}},Y)\geq\cdots\geq\widehat{\Delta}(X_{\hat{\jmath}_{p}},Y). If desired, these rankings can be repurposed to perform model selection (e.g., an estimate 𝒮^\widehat{\mathcal{S}} of 𝒮{\mathcal{S}}), as we now explain. If we are given the sparsity level ss in advance, we can choose 𝒮^\widehat{\mathcal{S}} to be the top ss of these ranked variables; otherwise, we must find a data-driven choice of how many variables to include. Equivalently, the latter case is realized by choosing 𝒮^\mathcal{\widehat{S}} to be the indices jj for which Δ^(Xj,Y)γn\widehat{\Delta}(X_{j},Y)\geq\gamma_{n}, where γn\gamma_{n} is a threshold to be described in Section 2.5. This is of course a delicate task as including too many variables may lead to more false positives.

By [6, Section 9.3], using a sum of squares decomposition, we can rewrite the impurity reduction (5) as

Δ^(z;Xj,Y)=NLnNRn(Y¯LY¯R)2,\widehat{\Delta}(z;X_{j},Y)=\frac{N_{L}}{n}\frac{N_{R}}{n}(\overline{Y}_{L}-\overline{Y}_{R})^{2}, (6)

which allows us to compute the largest impurity reductions for all possible split points with a single pass over the data by first ordering the data along XjX_{j} and then updating Y¯L\overline{Y}_{L} and Y¯R\overline{Y}_{R} in an online fashion. This alternative expression for the objective function facilitates its rapid evaluation and exact optimization. Pseudocode for SDI is given in Algorithm 1.

Input: Dataset 𝒟={(Xi1,,Xip,Yi)}i=1n\mathcal{D}=\{(X_{i1},\ldots,X_{ip},Y_{i})\}_{i=1}^{n}
for j=1,,pj=1,\ldots,p do
       Relabel 𝒟\mathcal{D} with XijX_{ij} sorted in increasing order
       Initialize Y¯L=0\overline{Y}_{L}=0, Y¯R=Y¯\overline{Y}_{R}=\overline{Y}, Δ^(Xj,Y)=0\widehat{\Delta}(X_{j},Y)=0
       for i=1,,n1i=1,\ldots,n-1 do
             Update Y¯Li1iY¯L+Yii,Y¯Rni+1niY¯RYini\overline{Y}_{L}\leftarrow\frac{i-1}{i}\overline{Y}_{L}+\frac{Y_{i}}{i},\quad\overline{Y}_{R}\leftarrow\frac{n-i+1}{n-i}\overline{Y}_{R}-\frac{Y_{i}}{n-i}
             Compute Δ^(Xij;Xj,Y)=in(1in)(Y¯LY¯R)2\widehat{\Delta}(X_{ij};X_{j},Y)=\frac{i}{n}(1-\frac{i}{n})(\overline{Y}_{L}-\overline{Y}_{R})^{2}
             if Δ^(Xij;Xj,Y)>Δ^(Xj,Y)\widehat{\Delta}(X_{ij};X_{j},Y)>\widehat{\Delta}(X_{j},Y) then
                   Update Δ^(Xj,Y)Δ^(Xij;Xj,Y)\widehat{\Delta}(X_{j},Y)\leftarrow\widehat{\Delta}(X_{ij};X_{j},Y)
             end if
            
       end for
      
end for
Output: Ranking (ȷ^1,,ȷ^p)(\hat{\jmath}_{1},\dots,\hat{\jmath}_{p}) such that Δ^(Xȷ^1,Y)Δ^(Xȷ^p,Y)\widehat{\Delta}(X_{\hat{\jmath}_{1}},Y)\geq\cdots\geq\widehat{\Delta}(X_{\hat{\jmath}_{p}},Y)
Algorithm 1 Single-level Decrease in Impurity (SDI)

2.5 Data-driven choices of γn\gamma_{n}

As briefly mentioned in Section 2.4, if we do not know the sparsity level ss in advance, we can instead use a data-driven threshold γn\gamma_{n} to modulate the number of selected variables. Here we propose two data-driven methods to determine the threshold γn\gamma_{n}.

Permutation method

The first thresholding method is similar to the Iterative Nonparametric Independence Screening (INIS) method based on NIS [11, Section 4]. The first step is to choose a random permutation π:{1,,n}{1,,n}\pi:\{1,\dots,n\}\to\{1,\dots,n\} of the data to decouple 𝐗i\mathbf{X}_{i} from YiY_{i} so that the new dataset 𝒟π={(𝐗π(i),Yi)}\mathcal{D}^{\pi}=\{(\mathbf{X}_{\pi(i)},Y_{i})\} follows a null model. Then we choose the threshold γn\gamma_{n} to be the maximum of the impurity reductions Δ^(Xj,Y;𝒟π)\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi}) over all jj based on the dataset 𝒟π\mathcal{D}^{\pi}. We can also generate TT different permutations π\pi and take the maximum of Δ^(Xj,Y;𝒟π)\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi}) over all such permuted datasets to get a more significant threshold, i.e., γn=maxj,πΔ^(Xj,Y;𝒟π)\gamma_{n}=\max_{j,\,\pi}\widehat{\Delta}(X_{j},Y;\mathcal{D}^{\pi}). With γn\gamma_{n} selected in this way, SDI will then output the variable indices 𝒮^\widehat{\mathcal{S}} consisting of the indices jj for which the original impurity reductions Δ^(Xj,Y;𝒟)\widehat{\Delta}(X_{j},Y;\mathcal{D}) are at least γn\gamma_{n}. Interestingly, this method has parallels to MDA in that we permute the data values of a given variable and calculate the resulting change in the quality of the fit.

Elbow method

One problem with the permutation method is that it performs poorly when there is correlation between relevant and irrelevant variables. This is because, in the decoupled dataset 𝒟π\mathcal{D}^{\pi}, there is essentially no correlation between YY and any variable; whereas in the original dataset, YY may be correlated with some of the irrelevant variables. In this case, the threshold γn\gamma_{n} will be too small and the selected set of variables will include many irrelevant variables (false positives). An alternative is to employ visual inspection via the elbow method, which is typically used to determine the number of clusters in cluster analysis. Here we plot the largest impurity reductions Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) in decreasing order and search for an “elbow” in the curve. We then let 𝒮^\widehat{\mathcal{S}} consist of the variable indices that come before the elbow. Instead of choosing the cutoff point visually, we can also automate the process by clustering the impurity reductions with, for example, a two-component Gaussian mixture model. Both of these implementations are generally more robust to correlation between relevant and irrelevant variables than the permutation method.

We illustrate the empirical performance of both the permutation method and the elbow method in Section 7.

2.6 Computational issues

We now briefly discuss the computational complexity of Algorithm 1—or equivalently—the computational complexity of growing a single-level CART decision tree. For each variable XjX_{j}, we first sort the input data along XjX_{j} with 𝒪(nlogn){\mathcal{O}}(n\log n) operations. We then evaluate the decrease in impurity along nn data points (as done in the nested for-loop of Algorithm 1), and finally find the maximum among these nn values (as done in the nested if-statement of Algorithm 1), all with 𝒪(n){\mathcal{O}}(n) operations. Thus, the total number of calculations for all of the pp variables is 𝒪(pnlog(n)){\mathcal{O}}(pn\log(n)). This is only slightly worse than the complexity of SIS for linear models 𝒪(pn){\mathcal{O}}(pn), comparable to NIS based on the complexity of fitting B-splines, and favorable to that of Lasso or stepwise regression 𝒪(p3+p2n){\mathcal{O}}(p^{3}+p^{2}n), especially when pp is large [10, 17]. While approximate methods like coordinate descent for Lasso can reduce the complexity to 𝒪(pn){\mathcal{O}}(pn) at each iteration, their convergence properties are unclear. Thus, as SPAM is a generalization of Lasso for nonparametric additive models, its implementation (via a functional version of coordinate descent for Lasso) may be similarly expensive.

2.7 Interpretations of SDI

In this section we outline two interpretations of SDI.

Interpretation 1

Our first interpretation of SDI is in terms of the sample correlation between the response and a decision stump. To see this, denote the decision stump that splits XjX_{j} at zz by

Y~(Xj)Y¯L 1(Xjz)+Y¯R 1(Xj>z)\widetilde{Y}(X_{j})\coloneqq\overline{Y}_{L}\,\mathbf{1}(X_{j}\leq z)+\overline{Y}_{R}\,\mathbf{1}(X_{j}>z)

and one at an optimal split value z^j\hat{z}_{j} by

Y^(Xj)Y¯L 1(Xjz^j)+Y¯R 1(Xj>z^j).\widehat{Y}(X_{j})\coloneqq\overline{Y}_{L}\,\mathbf{1}(X_{j}\leq\hat{z}_{j})+\overline{Y}_{R}\,\mathbf{1}(X_{j}>\hat{z}_{j}).

Note that Y^(Xj)\widehat{Y}(X_{j}) equivalently minimizes the marginal sum of squares (2) over the collection of all decision stumps (4). Next, by Lemma A.1 in [26], we have:

Δ^(z,Xj,Y)\displaystyle\widehat{\Delta}(z,X_{j},Y) =Var^(Y)×ρ^ 2(Y~(Xj),Y),and\displaystyle=\widehat{\mathrm{Var}}(Y)\times\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y),\quad\text{and} (7)
ρ^ 2(Y~(Xj),Y)\displaystyle\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y) =11ni=1n(YiY~(Xij))21ni=1n(YiY¯)2,\displaystyle=1-\frac{\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\widetilde{Y}(X_{ij}))^{2}}{\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}},

where

ρ^(Y~(Xj),Y)1ni=1n(Y~(Xij)Y¯)(YiY¯)1ni=1n(Y~(Xij)Y¯)2×1ni=1n(YiY¯)20\widehat{\rho}\,(\widetilde{Y}(X_{j}),Y)\coloneqq\frac{\frac{1}{n}\sum_{i=1}^{n}(\widetilde{Y}(X_{ij})-\overline{Y})(Y_{i}-\overline{Y})}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\widetilde{Y}(X_{ij})-\overline{Y})^{2}\times\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}}}\geq 0

is the Pearson product-moment sample correlation coefficient between the data YY and decision stump Y~(Xj)\widetilde{Y}(X_{j}). In other words, we see from (7) that an optimal split point z^j\hat{z}_{j} is chosen to maximize the Pearson sample correlation between the data YY and decision stump Y~(Xj)\widetilde{Y}(X_{j}). This reveals that SDI is, at its heart, a correlation ranking method, in the same spirit as SIS, NIS, and [16] via (2). In fact, as we shall see in Section 3, SDI is for all intents and purposes also similar to ranking the variables according to the correlation between the response data and the marginal projections fj(Xj)f_{j}(X_{j}).

Like r2r^{2} for linear models, (7) reveals that the squared sample correlation ρ^ 2(Y~(Xj),Y)\widehat{\rho}^{\,2}(\widetilde{Y}(X_{j}),Y) equals the coefficient of determination R2R^{2}, i.e., the fraction of variance in YY explained by a decision stump Y~(Xj)\widetilde{Y}(X_{j}) in XjX_{j}.333However, unlike linear models, for this relationship to be true, the decision stump Y~(Xj)\widetilde{Y}(X_{j}) need not necessarily be a least squares fit, i.e., Y^(Xj)\widehat{Y}(X_{j}). Thus, SDI is also equivalent to ranking the variables according to the goodness-of-fit for decision stumps of each variable. In fact, the equivalence between ranking with correlations and ranking with squared error goodness-of-fit is a ubiquitous trait among most models [16, Theorem 1].

Interpretation 2

The other interpretation is in terms of the aforementioned MDI importance measure. Recall the definition of MDI in Section 8, i.e., for an individual decision tree TT, the MDI for XjX_{j} is the total reduction in impurity attributed to the splitting variable XjX_{j}. More succinctly, the MDI of TT for XjX_{j} equals

tN(t)nΔ^(Xj,Y|𝐗t),\sum_{\mathrm{t}}\frac{N(\mathrm{t})}{n}\widehat{\Delta}(X_{j},Y|\mathbf{X}\in\mathrm{t}), (8)

where the sum extends over all non-terminal nodes t\mathrm{t} in which XjX_{j} was split, N(t)N(\mathrm{t}) is the number of sample points in t\mathrm{t}, and Δ^(Xj,Y|𝐗t)\widehat{\Delta}(X_{j},Y|\mathbf{X}\in\mathrm{t}) is the largest reduction in impurity for samples in t\mathrm{t}. Note that if TT is a decision stump with split along XjX_{j}, then (8) equals Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y), the largest reduction in impurity at the root node. Because a split at the root node captures the main effects of the model, Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) can be seen as a first order approximation of (8) in which higher order interaction effects are ignored.

3 Preliminaries

Our first lemma, developed by the first author in recent work, reveals the crucial role that optimization (of a nonlinear model) plays in assessing whether a particular variable is relevant or irrelevant—by relating the impurity reduction for a particular variable XjX_{j} to the sample correlation between the response variable YY and any function of XjX_{j}. This lemma also highlights a key departure from other approaches in past decision tree literature that do not consider splits that depend on both input and output data (see, for example, DSTUMP [24]).

In order to state the lemma, we will need to introduce the concept of stationary intervals. We define a stationary interval of a univariate function h()h(\cdot) to be a maximal interval II such that h(I)=ch(I)=c, where cc is a local extremum of h()h(\cdot) (II is maximal in the sense that there does not exist an interval II^{\prime} such that III\subset I^{\prime} and h(I)=ch(I^{\prime})=c). In particular, note that a monotone function does not have any stationary intervals.

Lemma 1 (Lemma A.4, Supplementary Material in [26]).

Almost surely, uniformly over all functions h()h(\cdot) of XjX_{j} that have at most MM stationary intervals, we have

Δ^(Xj,Y)1D1Mn+log(2n)+1×Cov^2(h(Xj)Var^(h(Xj)),Y),\widehat{\Delta}(X_{j},Y)\geq\frac{1}{D^{-1}Mn+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{h(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(h(X_{j}))}},Y\Bigg{)}, (9)

where D1D\geq 1 is the smallest number of data points in a stationary interval of h()h(\cdot) that contains at least one data point.444More precisely, if I1,,IMI_{1},\dots,I_{M} are the stationary intervals of h()h(\cdot) and Dk=#{XijIk}D_{k}=\#\{X_{ij}\in I_{k}\}, then D=mink{Dk:Dk1}D=\min_{k}\{D_{k}:D_{k}\geq 1\}.

Remark 1.

Note that MM can also be thought of as the number of times h()h(\cdot) changes from strictly increasing to decreasing (or vice versa).

Remark 2.

The bound (9) is tight (up to universal constant factors) when M=0M=0, since Δ^(Xj,Y)=1\widehat{\Delta}(X_{j},Y)=1 and Var^(Y)log(n)\widehat{\text{Var}}(Y)\asymp\log(n) when X1jXnjX_{1j}\leq\cdots\leq X_{nj}, h(Xij)=Yih(X_{ij})=Y_{i}, and

Yi=(i1)(ni+1)i(ni).Y_{i}=\sqrt{(i-1)(n-i+1)}-\sqrt{i(n-i)}.
Proof sketch of Lemma 1.

For self-containment, we sketch the proof when h()h(\cdot) is differentiable. The essential idea is to construct an empirical prior Π\Pi on the split points zz and lower bound Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) by

Δ^(z;Xj,Y)𝑑Π(z).\int\widehat{\Delta}(z;X_{j},Y)d\Pi(z).

Recall from Section 2.4 that NL=NL(z)N_{L}=N_{L}(z) and NR=NR(z)N_{R}=N_{R}(z) are the number of samples in the left and right daughter nodes, respectively, if the jthj^{\text{th}} variable is split at zz. The special prior we choose has density

dΠ(z)dz=|h(z)|NL(z)NR(z)|h(z)|NL(z)NR(z)𝑑z,\frac{d\Pi(z)}{dz}=\frac{|h^{\prime}(z)|\sqrt{N_{L}(z)N_{R}(z)}}{\int|h^{\prime}(z^{\prime})|\sqrt{N_{L}(z^{\prime})N_{R}(z^{\prime})}dz^{\prime}},

with support between the minimum and maximum values of the data {Xij}\{X_{ij}\}. This then yields Δ^(Xj,Y)C(h)×Cov^2(h(Xj)Var^(h(Xj)),Y)\widehat{\Delta}(X_{j},Y)\geq C(h)\times\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{h(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(h(X_{j}))}},Y\bigg{)}. The factor C(h)C(h) can be minimized (by solving a simple quadratic program) over all functions h()h(\cdot) under the constraint that they have at most MM stationary intervals containing at least DD data points, yielding the desired result (9). We direct the reader to [26, Lemma A.4, Supplementary Material] for the full proof. ∎

Ignoring the factor (D1Mn+log(2n)+1)1(D^{-1}Mn+\log(2n)+1)^{-1} in (9) and focusing only on the squared sample covariance term, note that choosing h()h(\cdot) to be the marginal projection fj()f_{j}(\cdot), we have

Cov^2(fj(Xj)Var^(fj(Xj)),Y)Cov2(fj(Xj)Var(fj(Xj)),Y)=Var(fj(Xj)),\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)}\approx\mathrm{Cov}^{2}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}},Y\Bigg{)}=\mathrm{Var}(f_{j}(X_{j})),

where the last equality can be deduced from the fact that the marginal projection fj(Xj)f_{j}(X_{j}) is orthogonal to the residual Yfj(Xj)Y-f_{j}(X_{j}). Thus, in an ideal setting, Lemma 1 enables us to asymptotically lower bound Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) by a multiple of the variance of the marginal projections—which can then be used to screen for important variables and control the number of false negatives.

To summarize, the previous lemma shows that Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) is large for variables XjX_{j} such that fj(Xj)f_{j}(X_{j}) is strongly correlated with YY—or equivalently—variables that have large signals in terms of the variance of the marginal projection. Conversely, our next lemma shows that Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) is with high probability not greater than the variance of the marginal projection. A special instance of this lemma, namely, when YY is independent of XjX_{j}, was stated in [29, Lemma 1] and serves as the inspiration for our proof. Due to space constraints, we include the proof in Appendix A.

Lemma 2.

Suppose that Zj=Yfj(Xj)Z_{j}=Y-f_{j}(X_{j}) is conditionally sub-Gaussian given XjX_{j}, with variance parameter σZj2\sigma_{Z_{j}}^{2}, i.e., 𝔼[exp(λZj)|Xj]exp(λ2σZj2/2)\mathbb{E}[\exp(\lambda Z_{j})|X_{j}]\leq\exp(\lambda^{2}\sigma^{2}_{Z_{j}}/2) for all λ\lambda\in\mathbb{R}. With probability at least 14nexp(nξ2/(12σZj2))1-4n\exp(-n\xi^{2}/(12\sigma_{Z_{j}}^{2})),

Δ^(Xj,Y)3Var^(fj(Xj))+ξ2.\widehat{\Delta}(X_{j},Y)\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\xi^{2}.

In other words, Lemmas 1 and 2 together imply that SDI is a proxy for the variance of the marginal projection and therefore it roughly ranks the variables accordingly, up to constant factors. These lemmas are a key ingredient of the proofs for model selection and may be of independent interest.

4 SDI for linear models

To connect SDI to other variable screening methods that are perhaps more familiar to the reader, we first consider a linear model with Gaussian distributed variables. We allow for any correlation structure between covariates. Recall from (7) that Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) is equal to Var^(Y)\widehat{\mathrm{Var}}(Y) times ρ^ 2(Y^(Xj),Y)\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y), so that SDI is equivalent to ranking by ρ^(Y^(Xj),Y)\widehat{\rho}\,(\widehat{Y}(X_{j}),Y). Our first theorem shows that ρ^(Y^(Xj),Y)\widehat{\rho}\,(\widehat{Y}(X_{j}),Y), the sample correlation between YY and an optimal decision stump in XjX_{j}, behaves roughly like the correlation between a linear model YY and a coordinate XjX_{j}.

Theorem 1 (SDI is asymptotically equivalent to SIS).

Let Y=β1X1+β2X2++βpXp+εY=\beta_{1}X_{1}+\beta_{2}X_{2}+\cdots+\beta_{p}X_{p}+\varepsilon and assume that 𝐗𝒩(𝟎,𝚺)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}) for some positive semi-definite matrix 𝚺\boldsymbol{\Sigma} and ε𝒩(0,σ2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}) for some σ2>0\sigma^{2}>0. Let δ(0,1)\delta\in(0,1). There exists a universal positive constant C0C_{0} such that, with probability at least 1C0nδ2ρ2(Xj,Y)exp(nδ2ρ2(Xj,Y)/2)1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho^{2}(X_{j},Y)}}\exp(-n\delta^{2}\rho^{2}(X_{j},Y)/2),

ρ^(Y^(Xj),Y)(1δ)|ρ(Xj,Y)|log(2n)+1.\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)\geq\frac{(1-\delta)|\rho(X_{j},Y)|}{\sqrt{\log(2n)+1}}. (10)

Furthermore, with probability at least 14nexp(nδ2/12)2exp((n1)/16)1-4n\exp(-n\delta^{2}/12)-2\exp(-(n-1)/16),

ρ^(Y^(Xj),Y)5|ρ(Xj,Y)|+2δ.\widehat{\rho}\,(\widehat{Y}(X_{j}),Y)\leq 5|\rho(X_{j},Y)|+2\delta. (11)
Proof sketch of Theorem 1.

We only sketch the proof due to space constraints, but a more complete version is provided in Appendix B.

The first step in proving the lower bound (10) is to apply Lemma 1 with h(Xj)=Xjh(X_{j})=X_{j} (a monotone function) to see that

Δ^(Xj,Y)Var^(Y)log(2n)+1×ρ^ 2(Xj,Y),\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Var}}(Y)}{\log(2n)+1}\times\widehat{\rho}^{\;2}(X_{j},Y), (12)

since M=0M=0. Next, we can apply asymptotic tail bounds for Pearson’s sample correlation coefficient ρ^(Xj,Y)\widehat{\rho}\,(X_{j},Y) between two correlated Gaussian distributions [18] to show that with high probability, |ρ^(Xj,Y)|(1δ)|ρ(Xj,Y)||\widehat{\rho}\,(X_{j},Y)|\geq(1-\delta)|\rho(X_{j},Y)|. Finally, we divide (12) by Var^(Y)\widehat{\mathrm{Var}}(Y), use (7), and take square roots to complete the proof of the high probability lower bound (10).

To prove the upper bound (11), notice that since XjX_{j} and YY are jointly Gaussian with mean zero, we have fj(Xj)=ρjσYσXjXjf_{j}(X_{j})=\rho_{j}\frac{\sigma_{Y}}{\sigma_{X_{j}}}X_{j}, where ρj=ρ(Xj,Y)\rho_{j}=\rho(X_{j},Y). Thus, by Lemma 2 with σZj2=(1ρj2)σY2\sigma^{2}_{Z_{j}}=(1-\rho^{2}_{j})\sigma^{2}_{Y} and ξ2=(1ρj2)σY2δ2\xi^{2}=(1-\rho^{2}_{j})\sigma^{2}_{Y}\delta^{2}, with probability at least 14nexp(nδ2/12)1-4n\exp(-n\delta^{2}/12),

Δ^(Xj,Y)\displaystyle\widehat{\Delta}(X_{j},Y) 3Var^(fj(Xj))+δ2(1ρj2)σY2\displaystyle\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\delta^{2}(1-\rho^{2}_{j})\sigma_{Y}^{2} (13)
=3ρj2σY2σXj2Var^(Xj)+δ2(1ρj2)σY2.\displaystyle=3\rho^{2}_{j}\frac{\sigma^{2}_{Y}}{\sigma^{2}_{X_{j}}}\widehat{\mathrm{Var}}(X_{j})+\delta^{2}(1-\rho^{2}_{j})\sigma_{Y}^{2}.

We further upper bound (13) by obtaining high probability upper and lower bounds, respectively, for Var^(Xj)\widehat{\mathrm{Var}}(X_{j}) and Var^(Y)\widehat{\mathrm{Var}}(Y) in terms of σXj2\sigma^{2}_{X_{j}} and σY2\sigma^{2}_{Y}, with a standard chi-squared concentration bound, per the Gaussian assumption. This yields that with high probability,

Δ^(Xj,Y)ρj2Var^(Y)+δ2Var^(Y).\widehat{\Delta}(X_{j},Y)\lesssim\rho^{2}_{j}\widehat{\mathrm{Var}}(Y)+\delta^{2}\widehat{\mathrm{Var}}(Y). (14)

Finally, dividing both sides of (14) by Var^(Y)\widehat{\mathrm{Var}}(Y), using (7), and taking square roots proves (11). ∎

Theorem 1 shows that with high-probability, SDI is asymptotically equivalent (up to logarithmic factors in the sample size) to SIS for linear models in that it ranks the magnitudes of the marginal sample correlations between a variable and the model, i.e., ρ^(Xj,Y)ρ(Xj,Y)\widehat{\rho}\,(X_{j},Y)\approx\rho(X_{j},Y). As a further parallel with decision stumps (see Section 2.7), the square of the sample correlation, ρ^ 2(Xj,Y)\widehat{\rho}^{\,2}(X_{j},Y), is also equal to the coefficient of determination r2r^{2} for the least squares linear fit of YY on XjX_{j}. We confirm the similarity between SDI and SIS empirically in Section 7.

One corollary of Theorem 1 is that, like SIS, SDI also enjoys the sure screening property, under the same assumptions as [12, Conditions 1-4], which include mild conditions on the eigenvalues of the design covariance matrices and minimum signals of the parameters βj\beta_{j}. Similarly, like SIS, SDI can also be paired with lower dimensional variable selection methods such as Lasso or SCAD [2] for a complete variable selection algorithm in the correlated linear model case.

On the other hand, SDI, a nonlinear method, applies to broader contexts far beyond the rigidity of linear models. In the next section, we will investigate how SDI performs for general nonparametric models with additional assumptions on the distribution of the variables.

5 SDI for nonparametric models

In this section, we establish the variable ranking and selection consistency properties of SDI for general nonparametric models; that is, we show that for Algorithm 1, we have (𝒮^=𝒮)1\mathbb{P}(\widehat{{\mathcal{S}}}=\mathcal{{\mathcal{S}}})\to 1 as nn\to\infty. We describe the assumptions needed in Section 5.1 and outline the main consistency results and their proofs in Section 5.3. Finally, we describe how to modify the main results for binary classification in Section 5.5.

Although our approach differs substantively, to facilitate easy comparisons with other marginal screening methods, our framework and assumptions will be similar. As mentioned earlier, SDI is based on a more parsimonious but significantly more biased model fit than those than underpin conventional methods. As we shall see, despite the decision stump severely underfitting the data, SDI nevertheless achieves model selection guarantees that are similar to, and in some cases stronger than, its competitors. This highlights a key difference between quantifying sensitivity and screening—in the latter case, we are not concerned with obtaining consistent estimates of the marginal projections fj(Xj)f_{j}(X_{j}) and their variances. Doing so demands more from the data and is therefore less efficient, when otherwise crude estimates would work equally well.

5.1 Assumptions

In this section, we describe the key assumptions and ideas which will be needed to achieve model selection consistency. The assumptions will be similar to those in the independence screening literature [11, 12], but are weaker than most past work on tree based variable selection [29, 24].

  • Assumption 1 (Bounded regression function).

    The regression function g()g(\cdot) is bounded with B=g<B=\|g\|_{\infty}<\infty.

  • Assumption 2 (Smoothness of marginal projections).

    Let rr be a positive integer, let 0<α10<\alpha\leq 1 be such that dr+αd\coloneqq r+\alpha is at least 5/85/8, and let 0<L<0<L<\infty. The rthr^{\rm{th}} order derivative of fj()f_{j}(\cdot) exists and is LL-Lipschitz of order α\alpha, i.e.,

    |fj(r)(x)fj(r)(x)|L|xx|α,x,x.\big{|}f^{(r)}_{j}(x)-f^{(r)}_{j}(x^{\prime})\big{|}\leq L|x-x^{\prime}|^{\alpha},\quad x,x^{\prime}\in\mathbb{R}.
  • Assumption 3 (Monotonicity of marginal projections).

    The marginal projection fj()f_{j}(\cdot) is monotone on \mathbb{R}.

  • Assumption 4 (Partial orthogonality of predictor variables).

    The collections {Xj}j𝒮\{X_{j}\}_{j\in{\mathcal{S}}} and {Xj}j𝒮c\{X_{j}\}_{j\in{\mathcal{S}}^{c}} are independent of each other.

  • Assumption 5 (Uniform relevant variables).

    The marginal distribution of each XjX_{j}, for j𝒮j\in{\mathcal{S}}, is uniform on the unit interval.

  • Assumption 6 (Sub-Gaussian error distribution).

    The error distribution is conditionally sub-Gaussian given 𝐗\mathbf{X}, i.e., 𝔼[ε|𝐗]=0\mathbb{E}[\varepsilon|\mathbf{X}]=0 and 𝔼[exp(λε)𝐗]exp(λ2σ2/2)\mathbb{E}[\exp(\lambda\varepsilon)\mid\mathbf{X}]\leq\exp(\lambda^{2}\sigma^{2}/2) for all λ\lambda\in\mathbb{R} with σ2>0\sigma^{2}>0.

5.2 Discussion of the assumptions

Assumption 2 is a standard smoothness assumption for variable selection in nonparametric additive models [11, Assumption A] and [19, Section 3] except that, for technical reasons, we have the condition d5/8d\geq 5/8 instead of d>1/2d>1/2. Because SDI does not involve tuning parameters that govern its approximation properties of the nonparametric constituents (such as with NIS and SPAM), Assumption 2 can be relaxed to allow for different levels of smoothness in different dimensions and, by straightforward modifications of our proofs, one can show that SDI adapts automatically. Alternatively, instead of Assumption 2, as we shall see, stronger conclusions can be provided if we impose a monotonicity constraint, namely, Assumption 3. Note that this monotonicity assumption encompasses many important “shape constrained” statistical models such as linear or isotonic regression.

Assumption 4 is essentially the so-called “partial orthogonality” condition in marginal screening methods [13]. Importantly, it allows for correlation between the relevant variables {Xj}j𝒮\{X_{j}\}_{j\in{\mathcal{S}}}, unlike previous works on tree based variable selection [24, 29]. Notably, NIS and SPAM do allow for dependence between relevant and irrelevant variables, under suitable assumptions on the data matrix of basis functions. However, these assumptions are difficult to translate in terms of the joint distribution of the predictor variables and difficult to verify given the data.

Assumption 5 is stated as is for clarity of exposition and is not strictly necessary for our main results to hold. For instance, we may assume instead that the marginal densities of the relevant variables are compactly supported and uniformly bounded above and below by a strictly positive constant, as in [11, 19]. In fact, even these assumptions are not required. If the marginal projection is monotone (i.e., Assumption 3), no marginal distributional assumptions are required, that is, each XjX_{j} for j𝒮j\in{\mathcal{S}} could be continuous, discrete, have unbounded support, or have a density that vanishes or is unbounded. More generally, similar distributional relaxations are made possible by the fact that CART decision trees are invariant to monotone transformations, enabling us to reduce the general setting to the case where each predictor variable is uniformly distributed on [0,1][0,1]. See Remark 3 for details.

Remark 3.

Let qj()q_{j}(\cdot) and Fj()F_{j}(\cdot) be the quantile function and the cumulative distribution function of XjX_{j}, respectively. Recall the Galois inequalities state that qj(w)zq_{j}(w)\leq z if and only if wFj(z)w\leq F_{j}(z) and furthermore that qj(Fj(Xj))=Xjq_{j}(F_{j}(X_{j}))=X_{j} almost surely. Then, choosing w=Fj(Xj)w=F_{j}(X_{j}) we see that, almost surely, XjzX_{j}\leq z if and only if Fj(Xj)Fj(z)F_{j}(X_{j})\leq F_{j}(z). Therefore, almost surely,

maxzΔ^(z;Xj,Y)=maxzΔ^(Fj(z);Fj(Xj),Y)=maxw[0,1]Δ^(w;Fj(Xj),Y).\max_{z}\widehat{\Delta}(z;X_{j},Y)=\max_{z}\widehat{\Delta}(F_{j}(z);F_{j}(X_{j}),Y)=\max_{w\in[0,1]}\widehat{\Delta}(w;F_{j}(X_{j}),Y).

This means that for continuous data, the problem can be reduced to the uniform case by pre-applying the marginal cumulative distribution function Fj()F_{j}(\cdot) to each variable, since Fj(Xj)Uniform([0,1])F_{j}(X_{j})\sim\rm{Uniform}([0,1]). Note that the marginal projections now equal the composition of the original fj()f_{j}(\cdot) with qj()q_{j}(\cdot), i.e., (fjqj)()(f_{j}\circ q_{j})(\cdot). By the chain rule from calculus, if qj()q_{j}(\cdot) satisfies Assumption 2, then so does (fjqj)()(f_{j}\circ q_{j})(\cdot).

5.3 Theory for variable ranking and model selection

Here our goal will be to provide variable ranking and model selection guarantees of SDI using the assumptions in Section 5.1. Again, in this section, we sketch the proofs, but the full versions can be found in Appendices D and E.

5.3.1 Preliminary results

The high level idea will be to show that the impurity reductions for relevant variables dominate those for irrelevant variables with high probability, meaning that relevant and irrelevant variables are correctly ranked.

The following two propositions provide high probability lower bounds on the impurity reduction for relevant variables, the size of which depend on whether we assume Assumption 2 or Assumption 3.

Our first result deals with general, smooth marginal projections. Remarkably, it shows that with high probability, Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) captures a portion of the variance in the marginal projection.

Proposition 1.

Under Assumptions 1, 2, 5, and 6, with probability at least 1(4n+2)exp(nC1Var(fj(Xj)))1-(4n+2)\exp\big{(}-nC_{1}\mathrm{Var}(f_{j}(X_{j}))\big{)}, we have

Δ^(Xj,Y)C2(Var(fj(Xj)))6/5+1/dlog(n),\widehat{\Delta}(X_{j},Y)\geq\frac{C_{2}(\mathrm{Var}(f_{j}(X_{j})))^{6/5+1/d}}{\log(n)},

for some positive constants C1C_{1} and C2C_{2} which depend only on BB, σ\sigma, rr, and α\alpha.

Next, we state an analogous bound, but under a slightly different assumption on the marginal projection. It turns out that if the marginal projection is monotone, we can obtain a stronger result, which is surprisingly independent of the smoothness level.

Proposition 2.

Under Assumptions 1, 3, and 6, with probability at least 12exp((n1)Var(fj(Xj))32(B2+σ2))1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)},

Δ^(Xj,Y)Var(fj(Xj))16(1+log(2n)).\widehat{\Delta}(X_{j},Y)\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{16(1+\log(2n))}.
Proof sketch of Propositions 1 and 2.

We sketch the proof of Proposition 1; the proof of Proposition 2 is based on similar arguments. The main idea is to apply Lemma 1 with h()h(\cdot) equal to a modified polynomial approximation f~j()\widetilde{f}_{j}(\cdot) to fj()f_{j}(\cdot). This is done to temper the effect of the factor (D1Mn+log(2n)+1)1(D^{-1}Mn+\log(2n)+1)^{-1} from Lemma 1, by controlling MM and DD individually.

To construct such a function, we first employ a Jackson-type estimate [23] in conjunction with Assumption 2 and Bernstein’s theorem for polynomials [22] to show the existence of a good polynomial approximation PM()P_{M}(\cdot) (of degree M+1M+1) to fj()f_{j}(\cdot). We then construct f~j()\widetilde{f}_{j}(\cdot) by redefining PM()P_{M}(\cdot) to be constant in a small neighborhood around each of its local extrema, which ensures that each resulting stationary interval of f~j()\widetilde{f}_{j}(\cdot) has a sufficiently large length. Since PM()P_{M}(\cdot) is a polynomial of degree M+1M+1, it has at most MM local extrema, and thus the number of stationary intervals of f~j()\widetilde{f}_{j}(\cdot) will also be at most MM.

Next, we use concentration of measure to ensure that each stationary interval of f~j()\widetilde{f}_{j}(\cdot) is saturated with enough data, effectively providing a lower bound on DD. Executing this argument reveals that valid choices of DD and MM (which come from optimizing a bound on the supremum norm between f~j()\widetilde{f}_{j}(\cdot) and fj()f_{j}(\cdot)) are:

Dn×(Var(fj(Xj)))1/5+1/(2d),M(Var(fj(Xj)))1/(2d).D\gtrsim n\times(\mathrm{Var}(f_{j}(X_{j})))^{1/5+1/(2d)},\quad M\lesssim(\mathrm{Var}(f_{j}(X_{j})))^{-1/(2d)}.

Plugging these values into the lower bound (9) in Lemma 1, we find that with high probability,

Δ^(Xj,Y)\displaystyle\widehat{\Delta}(X_{j},Y) 1D1Mn+log(2n)+1×Cov^2(f~j(Xj)Var^(f~j(Xj)),Y)\displaystyle\geq\frac{1}{D^{-1}Mn+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)} (15)
(Var(fj(Xj)))1/5+1/dlog(n)×Cov^2(f~j(Xj)Var^(f~j(Xj)),Y).\displaystyle\gtrsim\frac{(\mathrm{Var}(f_{j}(X_{j})))^{1/5+1/d}}{\log(n)}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}.

Thus, the lower bound in Proposition 1 will follow if we can show that the squared sample covariance factor in (15) exceeds Var(fj(Xj))\mathrm{Var}(f_{j}(X_{j})) with high probability. To this end, note that

Cov^(f~j(Xj)Var^(f~j(Xj)),Y)\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)} =Cov^(f~j(Xj)Var^(f~j(Xj)),fj(Xj))(I)\displaystyle=\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}}_{\text{(I)}} (16)
+Cov^(f~j(Xj)Var^(f~j(Xj)),Yfj(Xj))(II).\displaystyle\qquad+\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}_{\text{(II)}}.

With high probability, (I) can be lower bounded by CVar(fj(Xj))C\sqrt{\mathrm{Var}(f_{j}(X_{j}))}, where CC is some constant, using the approximation properties of f~j()\widetilde{f}_{j}(\cdot) for fj()f_{j}(\cdot), per the choice of DD and MM from above, and a concentration inequality for the sample variance of fj(Xj)f_{j}(X_{j}). Furthermore, since Yfj(Xj)Y-f_{j}(X_{j}) has conditional mean zero, a Hoeffding type concentration inequality shows that, with high probability, (II) is larger than any (strictly) negative constant, including (C/2)Var(fj(Xj))-(C/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}. Combining this analysis from (15) and (16), we obtain the high probability lower bound on Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) given in Proposition 1. ∎

Next, we need to ensure that there is a sufficient separation in the impurity reductions between relevant and irrelevant variables. To do so, we use Lemma 2 along with the partial orthogonality assumption in Section 5.1 to show that the impurity reductions for irrelevant variables will be small with high probability.

Lemma 3.

Under Assumptions 1, 4 and 6, for each j𝒮cj\in{\mathcal{S}}^{c}, with probability at least 14nexp(nξ2/(12(B2+σ2)))1-4n\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2}))),

Δ^(Xj,Y)ξ2.\widehat{\Delta}(X_{j},Y)\leq\xi^{2}.

In other words, if j𝒮cj\in{\mathcal{S}}^{c}, then Δ^(Xj,Y)=𝒪(n1log(n))\widehat{\Delta}(X_{j},Y)={\mathcal{O}}(n^{-1}\log(n)) with probability at least 1nΩ(1)1-n^{-\Omega(1)}.

Proof of Lemma 3.

Observe that

𝔼[exp(λ(Yfj(Xj)))|Xj]\displaystyle\mathbb{E}[\exp(\lambda(Y-f_{j}(X_{j})))|X_{j}] =𝔼[exp(λ(g(𝐗)fj(Xj)))𝔼[exp(λε)|𝐗]|Xj]\displaystyle=\mathbb{E}\big{[}\exp(\lambda(g(\mathbf{X})-f_{j}(X_{j})))\mathbb{E}\big{[}\exp(\lambda\varepsilon)\big{|}\mathbf{X}\big{]}\big{|}X_{j}\big{]}
𝔼[exp(λ(g(𝐗)fj(Xj)))|Xj]exp(λ2σ2/2)\displaystyle\leq\mathbb{E}\big{[}\exp(\lambda(g(\mathbf{X})-f_{j}(X_{j})))\big{|}X_{j}\big{]}\exp(\lambda^{2}\sigma^{2}/2) (17)
𝔼[exp(λ2(B2+σ2)/2)],\displaystyle\leq\mathbb{E}\big{[}\exp(\lambda^{2}(B^{2}+\sigma^{2})/2)\big{]}, (18)

where we used Assumption 6 in the penultimate inequality (17) and Hoeffding’s Lemma together with Assumption 1 in the last inequality (18). Using Assumption 4 along with Lemma 2 with σZj2=B2+σ2\sigma^{2}_{Z_{j}}=B^{2}+\sigma^{2} proves Lemma 3. ∎

5.3.2 Main results

Assuming we know the size ss of the support 𝒮{\mathcal{S}}, we can use the SDI ranking from Algorithm 1 to choose the top ss most important variables. Alternatively, if ss is unknown, we instead choose an asymptotic threshold γn\gamma_{n} of the impurity reductions to select variables; that is, 𝒮^={j:Δ^(Xj,Y)γn}\widehat{\mathcal{S}}=\{j:\widehat{\Delta}(X_{j},Y)\geq\gamma_{n}\}. We state our variable ranking guarantees in terms of the minimum signal strength of the relevant variables:

vminj𝒮Var(fj(Xj)),v\coloneqq\min_{j\in{\mathcal{S}}}\mathrm{Var}(f_{j}(X_{j})),

which is the same as the minimum variance parameter in independence screening papers (e.g., [11, Assumption C]). Note that vv measures the minimum contribution of each relevant variable alone to the variance in YY, ignoring the effects of the other variables.

Theorem 2.

Suppose Assumptions 1, 2, 4, 5, and 6 hold. Then the top ss most important variables ranked by Algorithm 1 equal the correct set 𝒮{\mathcal{S}} of relevant variables with probability at least

1s(4n+2)exp(C1nv)4n(ps)exp(nC2v6/5+1/d24log(n)(B2+σ2)),1-s(4n+2)\exp(-C_{1}nv)-4n(p-s)\exp\Big{(}-\frac{nC_{2}v^{6/5+1/d}}{24\log(n)(B^{2}+\sigma^{2})}\Big{)}, (19)

where C1C_{1} and C2C_{2} are the same constants in Proposition 1.

Remark 4.

As a corollary of Theorem 2, we obtain a special case of Proposition 1 in [37] for the root node of a CART decision tree, which states more generally that a relevant variable is selected at each node with probability converging to one.

As mentioned earlier, stronger results can be obtained if we assume that the marginal projections are monotone. In our next theorem, notice that we do not have to make any additional smoothness assumptions, nor do we require any distributional assumptions on the relevant variables, other than that they are independent of the irrelevant ones.

Theorem 3.

Suppose Assumptions 1, 3, 4, and 6 hold. Then the top ss most important variables ranked by Algorithm 1 equal the correct set 𝒮{\mathcal{S}} of relevant variables with probability at least

12sexp((n1)v32(B2+σ2))4n(ps)exp(nv96(1+log(2n))(B2+σ2)).1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)}-4n(p-s)\exp\Big{(}-\frac{nv}{96(1+\log(2n))(B^{2}+\sigma^{2})}\Big{)}. (20)
Remark 5.

When ss is unknown, Propositions 1 and 2 and Lemma 3 imply that the oracle threshold choices

γn=C2v6/5+1/d2log(n)andv8(1+log(2n))\gamma_{n}=\frac{C_{2}v^{6/5+1/d}}{2\log(n)}\quad\text{and}\quad\frac{v}{8(1+\log(2n))} (21)

ensure that

maxj𝒮cΔ^(Xj,Y)<γnminj𝒮Δ^(Xj,Y)\max_{j\in{\mathcal{S}}^{c}}\widehat{\Delta}(X_{j},Y)<\gamma_{n}\leq\min_{j\in{\mathcal{S}}}\widehat{\Delta}(X_{j},Y)

and hence will yield the same high probability bounds (19) and (20), respectively. Thus, while the permutation and elbow methods from Section 2.5 are somewhat ad-hoc, if they, at the very least, produce thresholds that are close to (21), then high probability performance guarantees are still possible to obtain.

5.4 Minimum signal strengths

Like all marginal screening methods, the theoretical basis for SDI is that each marginal projection for a relevant variable should be nonconstant, or equivalently, that v>0v>0. Note that when the relevant variables are independent and the underlying model is additive, per (1), the marginal projections equal the component functions of the additive model. Hence, v=minj𝒮Var(gj(Xj))v=\min_{j\in{\mathcal{S}}}\mathrm{Var}(g_{j}(X_{j})), which will always be strictly greater than zero. As Theorems 2 and 3 show, vv controls the probability of a successful ranking of the variables. In practice, many of the relevant variables may have very small signals—therefore we are particularly interested in cases where vv is allowed to become small when the sample size grows large, as we now discuss.

We see from Theorem 2 that in order to have model selection consistency with probability at least 1nΩ(1)1-n^{-\Omega(1)}, it suffices to have

v(log(n)log(n(ps))n)5d6d+5,v\gtrsim\Big{(}\frac{\log(n)\log(n(p-s))}{n}\Big{)}^{\frac{5d}{6d+5}}, (22)

up to constants that depend on BB, σ\sigma, rr, and α\alpha. That is, (22) is a sufficient condition on the signal of all relevant variables so that (𝒮^=𝒮)1\mathbb{P}(\widehat{\mathcal{S}}={\mathcal{S}})\rightarrow 1 as nn\to\infty. Similarly, we see from Theorem 3 that

vlog(n)log(n(ps))nv\gtrsim\frac{\log(n)\log(n(p-s))}{n} (23)

is sufficient to guarantee model selection consistency for monotone marginal projections. A particularly striking aspect of (23) is that the rate is independent of the smoothness of the marginal projection. This means that the “difficulty” of detecting a signal from a general monotone marginal projection is essentially no more than if the marginal projection was linear.

5.5 SDI for binary classification

Our theory for SDI can also naturally be extended to the context of classification, as we now describe. For simplicity, we focus on the problem of binary classification where Y{0,1}Y\in\{0,1\}. To begin, we first observe that Gini impurity and variance impurity are equivalent up to a factor of two [30, Section 3]. Thus, we can use the same criterion Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) to rank the variables. Next, we identify the marginal projection as the marginal class probability

fj(Xj)=(Y=1|Xj)=1(Y=0|Xj).f_{j}(X_{j})=\mathbb{P}(Y=1|X_{j})=1-\mathbb{P}(Y=0|X_{j}).

Because of these connections to the regression setting, the results in Section 5.3 hold verbatim for classification under the same assumptions therein with v=minj𝒮Var((Y=1|Xj))v=\min_{j\in{\mathcal{S}}}\mathrm{Var}(\mathbb{P}(Y=1|X_{j})), B=1B=1, and σ2=0\sigma^{2}=0. An interesting special case arises when (Y=1|𝐗)=h(β1X1+βpXp)\mathbb{P}(Y=1|\mathbf{X})=h(\beta_{1}X_{1}+\cdots\beta_{p}X_{p}) for some monotone link function h()h(\cdot) and 𝐗𝒩(𝟎,𝚺)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}) for some positive semi-definite matrix 𝚺\boldsymbol{\Sigma}. Using the fact that Gaussian random vectors are conditionally Gaussian distributed, it can be shown that the marginal projection (Y=1|Xj)\mathbb{P}(Y=1|X_{j}) is monotone and therefore satisfies Assumption 3. Consequently, for example, logistic regression with Gaussian data inherits the same stronger conclusions as Theorem 3.

6 Comparing SDI with other model selection methods

In this section, we compare the finite sample guarantees of SDI given in Section 5.3 and Section 4 to those of NIS, Lasso, and SPAM. To summarize, we find that SDI enjoys model selection consistency even when the marginal signal strengths of the relevant variables are smaller than those for NIS and SPAM. We also find that the minimum sample size of SDI for high probability support recovery is nearly what is required for Lasso, which is minimax optimal. Finally, we show that SDI can handle a larger number of predictor variables than NIS and SPAM.

Minimum signal strength for NIS

We analyze the details of [11] to uncover the corresponding threshold vv for NIS. In order to have model selection consistency, the probability bound in [11, Theorem 2] must approach one as nn\to\infty, which necessitates

dn(n14κlog(np))1/3,d_{n}\lesssim\Big{(}\frac{n^{1-4\kappa}}{\log(np)}\Big{)}^{1/3}, (24)

where dnd_{n} is the number of spline basis functions and κ\kappa is a free parameter (in the notation of [11]). However notice that by [11, Assumption F], we must also have that dnn2κ2d+1d_{n}\gtrsim n^{\frac{2\kappa}{2d+1}}, and combining this with (24) shows that we must have

nκ(log(np)n)2d+18d+10.n^{-\kappa}\gtrsim\Big{(}\frac{\log(np)}{n}\Big{)}^{\frac{2d+1}{8d+10}}. (25)

Now substituting (24) and (25) into vdnn2κv\gtrsim d_{n}n^{-2\kappa} ([11, Assumption C]), it follows that we have

v(log(np)n)4d8d+10v\gtrsim\Big{(}\frac{\log(np)}{n}\Big{)}^{\frac{4d}{8d+10}}

for NIS, which is a larger minimum signal than our (22).

Minimum signal strength for SPAM

In the case where d=2d=2, by [35, Section 6.1], we must have vn4/15log16/5(np)v\gtrsim n^{-4/15}\log^{16/5}(np) for SPAM to achieve consistent model selection. For comparison, our algorithm allows for a smaller signal v(log(n)log(np)n)10/17v\gtrsim\big{(}\frac{\log(n)\log(np)}{n}\big{)}^{10/17}, which is obtained by setting d=2d=2 in (22).

Minimum sample size for consistency

Consider the linear model with Gaussian variates from Theorem 1, where for simplicity we additionally assume that 𝚺=𝐈p×p\boldsymbol{\Sigma}=\mathbf{I}_{p\times p} is the p×pp\times p identity matrix, yielding ρ2(Xj,Y)=βj2/(σ2+k=1pβk2)\rho^{2}(X_{j},Y)=\beta^{2}_{j}/(\sigma^{2}+\sum_{k=1}^{p}\beta^{2}_{k}).

Following the same steps used to prove Theorem 2 but using Theorem 1 and Lemma 2 instead, we can derive a result similar to Theorem 2 for the probability of exact support recovery, but for a linear model with Gaussian variates. The full details are in Appendix C. With the specifications k=1pβk2=𝒪(1)\sum_{k=1}^{p}\beta^{2}_{k}={\mathcal{O}}(1) and minj𝒮|βj|21/s\min_{j\in{\mathcal{S}}}|\beta_{j}|^{2}\asymp 1/s, we find that a sufficient sample size for high probability support recovery is

nslog(n)log(n(ps)),n\gg s\log(n)\log(n(p-s)),

which happens when

nslog(ps)×(log(s)+loglog(ps)).n\gg s\log(p-s)\times(\log(s)+\log\log(p-s)). (26)

Now, it is shown in [41, Corollary 1] that the minimax optimal threshold for support recovery under these parameter specifications is nslog(ps)n\asymp s\log(p-s), which is achieved by Lasso [42]. Amazingly, (26) coincides with this optimal threshold up to log(s)\log(s) and loglog(ps)\log\log(p-s) factors, despite SDI not being tailored to linear models.

Maximum dimensionality

Suppose the signal strength vv is bounded above and below by a positive constant when the sample size increases. Then Theorems 2 and 3 show model selection consistency for SDI up to dimensionality p=exp(o(n))p=\exp(o(n)). This is larger than the maximum dimensionality p=exp(o(n2(d1)/(2d+1)))p=\exp(o(n^{2(d-1)/(2d+1)})) for NIS [11, Section 3.2], thus applying to an even broader spectrum of ultra high-dimensional problems. Furthermore, when d=2d=2, SPAM is able to handle dimensionality up to p=exp(o(n1/6))p=\exp(o(n^{1/6})) [35, Equation (45)], which is again lower than the dimensionality p=exp(o(n))p=\exp(o(n)) for SDI.

7 Experiments

In this section, we conduct computer experiments of SDI with synthetic data. As there are many existing empirical studies of the related MDI measure [24, 29, 30, 31, 32, 38, 43, 44], we do not aim for comprehensiveness.

Our first set of experiments compare the performance of SDI with SIS, NIS, Lasso, and SPAM, MDI for a single CART decision tree, and MDI for a random forest. In Section 7.1, we assess performance based on the probability of exact support recovery. To ensure a fair comparison between SDI and the other algorithms, we assume a priori knowledge of the true sparsity level ss, which is incorporated into Lasso and SPAM by specifying the model degrees of freedom in advance. These simulations were conducted in R using the packages rpart for SDI, SAM for SPAM, SIS for SIS, and glmnet for Lasso with default settings. We also compute two versions of MDI: MDI RF using the package randomForest with ntrees = 100 and MDI CART (based on a pruned CART decision tree) using the package rpart with default settings. The source code from [11] was used to conduct experiments with NIS.

The second set of experiments, in Section 7.2, deals with the case when the sparsity level ss is unknown, whereby we demonstrate the empirical performance of the permutation and elbow methods from Section 2.5.

In all our experiments, we generate nn samples from an ss-sparse additive model g(𝐗)=j=1sgj(Xj)g(\mathbf{X})=\sum_{j=1}^{s}g_{j}(X_{j}) for various types of components gj(Xj)g_{j}(X_{j}). The error distribution is ε𝒩(0,σ2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}), the sparsity level is fixed at s=4s=4, and the ambient dimension is fixed at p=2000p=2000. We consider the following model types.

Model 1

Consider linear additive components gj(Xj)=Xjg_{j}(X_{j})=X_{j} and variables 𝐗𝒩(𝟎,𝚺)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}), where the covariance matrix 𝚺\boldsymbol{\Sigma} has diagonal entries equal to 11 and off-diagonal entries equal to some constant ρ(1,1)\rho\in(-1,1). We set the noise level σ2=1\sigma^{2}=1 and consider correlation level ρ=0.5\rho=0.5.

Model 2

Consider linear additive components gj(Xj)=Xjg_{j}(X_{j})=X_{j} with noise level σ2=3\sigma^{2}=3. The variables X2,,XpX_{2},\ldots,X_{p} are i.i.d standard normal random variables and X1=13X23+ε~X_{1}=-\frac{1}{3}X_{2}^{3}+\widetilde{\varepsilon} where ε~𝒩(0,1)\widetilde{\varepsilon}\sim{\mathcal{N}}(0,1). Notice that even though the model is linear, the marginal projections 𝔼[Y|X1]\mathbb{E}[Y|X_{1}] and 𝔼[Y|X2]\mathbb{E}[Y|X_{2}] are nonlinear. This is similar to Example 2 of [11].

Model 3

We consider additive components gj(Xj)=cos(4πXj)g_{j}(X_{j})=\cos(4\pi X_{j}), where 𝐗Uniform([0,1]p)\mathbf{X}\sim\text{Uniform}([0,1]^{p}) (i.e., all predictor variables are independent) and σ2=1\sigma^{2}=1.

Model 4

Consider the nonlinear additive components

g1(x)=5x,g2(x)=3(2x1)2,g3(x)=4sin(2πx)2sin(2πx),\displaystyle g_{1}(x)=5x,\quad g_{2}(x)=3(2x-1)^{2},\quad g_{3}(x)=\frac{4\sin(2\pi x)}{2-\sin(2\pi x)},
g4(x)=6(0.1sin(2πx)+0.2cos(2πx)+0.3sin2(2πx)+0.4cos3(2πx)\displaystyle g_{4}(x)=6(0.1\sin(2\pi x)+0.2\cos(2\pi x)+0.3\sin^{2}(2\pi x)+0.4\cos^{3}(2\pi x)
+0.5sin3(2πx)).\displaystyle\quad\quad\quad\quad+0.5\sin^{3}(2\pi x)).

Let 𝐗Uniform([0,1]p)\mathbf{X}\sim\text{Uniform}([0,1]^{p}) and set the noise level σ2=1.74\sigma^{2}=1.74. This is the same model as Example 3 of [11] with t=0t=0 (and correlation 0).

Model 5

We consider monotone additive components

g1(x)=exp(x2),g2(x)=log(x+0.1),\displaystyle g_{1}(x)=-\exp(x^{2}),\quad g_{2}(x)=-\log(x+0.1),
g3(x)=2tanh(20x2)+0.5exp(x3),g4(x)=2exp(10x5)1+exp(10x5).\displaystyle g_{3}(x)=2\tanh(20x^{2})+0.5\exp(x^{3}),\quad g_{4}(x)=\frac{2\exp(10x-5)}{1+\exp(10x-5)}.

The variables are 𝐗Uniform([0,1]p)\mathbf{X}\sim\text{Uniform}([0,1]^{p}) and we have noise level σ2=1\sigma^{2}=1. This is similar to Model A in [3, Section 3].

In our simulations, Models 1 tests the correlated Gaussian linear model setting of Theorem 1. Model 2 modifies the setting of Theorem 1 by including a relevant variable that is a nonlinear function of another relevant variable so that some marginal projections are nonlinear. Models 3 and 4 are examples of the setting of Theorem 2 for general nonparametric models. Model 5 considers the shape-constrained setting of Theorem 3. Though our main results apply to general nonparametric models, we have chosen to focus our experiments on additive models to facilitate comparison with other methods designed for the same setting.

7.1 Exact support recovery

For our experiments on exact recovery, we fix the sparisty level s=4s=4 and estimate the probability of exact support recovery by running 5050 independent replications and computing the fraction of replications which exactly recover the support of the model. In Figure 1, we plot this estimated probability against various sample sizes n<pn<p, namely, n{100,200,300,,1000}n\in\{100,200,300,\dots,1000\}. In agreement with Theorem 1, in Figure 1(a), we observe that SDI and SIS exhibit similar behavior for correlated Gaussian linear models, a case in which all methods appear to achieve model selection consistency. However, in Figure 1(b), we see that SDI is significantly more robust than SIS to a linear model when the marginal projections are nonlinear. As expected, Figure 1(c) and 1(d) show that SDI, NIS, SPAM, and MDI significantly outperform SIS and Lasso when the model has nonlinear and non-monotone additive components. For more irregular component functions such as sinusoids, SDI appears to outperform SPAM, as seen in Figure 1(c). In agreement with Theorem 3, Figure 1(e) shows that SDI appears to perform better for nonlinear monotone components, though all methods (even linear methods such as Lasso and SIS) appear to perform better as well. In general, for additive models, SDI appears to outperform MDI CART though it appears to sacrifice a small amount of accuracy for simplicity compared to its progenitor MDI RF. In summary, in agreement with Theorems 2 and 3, Figure 1 demonstrates the robustness of SDI across various additive models.

Refer to caption
(a) Model 1 (SNR 10.0010.00)
Refer to caption
(b) Model 2 (SNR 1.231.23)
Refer to caption
(c) Model 3 (SNR 2.002.00)
Refer to caption
(d) Model 4 (SNR 9.019.01)
Refer to caption
(e) Model 5 (SNR 1.761.76)
Refer to caption
Figure 1: Plots of estimated (𝒮^=𝒮)\mathbb{P}(\widehat{{\mathcal{S}}}={\mathcal{S}}) as nn increases for various models (approximate signal to noise ratio in parentheses).

7.2 Data driven choices of γn\gamma_{n}

We now consider the case when ss is unknown. As we have already demonstrated the performance of SDI in Section 7.1, here we examine how well the data-driven thresholding methods discussed in Section 2.5 can estimate the true sparsity level ss.

We first consider Model 5 and fix the sparsity level s=4s=4 and the sample size n=1000n=1000. In Figure 2(a), we plot the probability of exact support recovery (averaged over 5050 independent replications) using the permutation method against the number of permutations. After a very small number of permutations, the significance level and performance of the algorithm appear to stabilize. For comparison, in Figure 2(b), we show the graph of the ranked impurity reductions (SDI importance measures) used for the elbow method. When there is no correlation between irrelevant and relevant variables, as is the case with Model 5, the permutation method may be more precise than the elbow method.

However, as discussed in Section 2.5, the permutation method may be inaccurate if the irrelevant and relevant variables are correlated. To illustrate this, we now consider Model 1 with sparsity level s=4s=4. As we can see from Figure 2(c), the permutation method fails completely. Using 2020 permutations, it chooses a threshold γn0.024\gamma_{n}\approx 0.024, which will lead to all variables being selected, as can be seen from the ranked impurity reductions shown in Figure 2(d). In other words, when there is strong correlation between the variables, the permutation method significantly underestimates the threshold γn\gamma_{n}, which in turn, creates too many false positives. In contrast, as can be seen from Figure 2(d), the ranked impurity reductions still exhibit a distinct “elbow”, from which the relevant and irrelevant variables can be discerned.

Refer to caption
(a) Permutation method (Model 5)
Refer to caption
(b) Elbow method (Model 5)
Refer to caption
(c) Permutation method (Model 1)
Refer to caption
(d) Elbow method (Model 1)
Figure 2: The probability of exact recovery by the number of random permutations used in the permutation method are shown in Figure 2(a) and Figure 2(c) for models without and with correlation, respectively. Figures 2(b) and 2(d) show the plots of the corresponding ranked impurity reductions used for the elbow method.

8 Discussion and conclusion

In this paper, we developed a theoretically rigorous approach for variable selection based on decision trees. The underlying approach is simple, intuitive, and interpretable—we test whether a variable is relevant/irrelevant by fitting a decision stump to that variable and then determining how much it explains the variance of the response variable. Despite its simplicity, SDI performs favorably relative to its less interpretable competitors. Furthermore, due to the parsimony of the model, there is also no need to perform variable bandwidth selection or calibrate the number terms in basis expansions.

On the other hand, we have sacrificed generality for analytical tractability. That is, decision stumps are poor at capturing interaction effects and, therefore, an importance measure built from a multi-level decision tree (such as MDI) may be more appropriate for models with more than just main effects. However, the presence of multi-level splits adds an additional layer of complexity to the analysis that, at the moment, we do not know how to overcome. It is also unclear how to leverage these additional splits to strengthen the theory. While out of the scope of the present paper, we view this as an important problem for future investigation. It is our hope that the tools in this paper can be used by other scholars to address this important issue.

To conclude, we hope that our analysis of single-level decision trees for variable selection will shed further light on the unique benefits of tree structured learning.

9 Appendix

A Appendix

In this appendix, we first prove Lemma 2 in detail in Appendix A. We then prove Theorem 1 on linear models with Gaussian variates in Appendix B and then use Theorem 1 to determine a sufficient sample size for model selection consistency (mentioned in Section 6) in Appendix C. Finally, we prove Propositions 1 and 2 in Appendix D and then prove Theorems 2 and 3 in Appendix E.

A Proof of Lemma 2

Let π\pi be a permutation of the data such that Xπ(1)jXπ(2)jXπ(n)jX_{\pi(1)j}\leq X_{\pi(2)j}\leq\cdots\leq X_{\pi(n)j}. Recall from the representation (6) that we have

Δ^(Xj,Y)=max1knkn(1kn)(1kXijXπ(k)jYi1nkXij>Xπ(k)jYi(III))2.\widehat{\Delta}(X_{j},Y)=\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\bigg{(}\underbrace{\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}Y_{i}-\frac{1}{n-k}\sum_{X_{ij}>X_{\pi(k)j}}Y_{i}}_{\text{(III)}}\bigg{)}^{2}.

Now, since

i=1n(𝟏(XijXπ(k)j)k𝟏(Xij>Xπ(k)j)nk)=0,\sum_{i=1}^{n}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}=0,

we can rewrite (III) as

1kXijXπ(k)j(Yifj(Xij))(a)1nkXij>Xπ(k)j(Yifj(Xij))(b)\displaystyle\underbrace{\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))}_{\text{(a)}}-\underbrace{\frac{1}{n-k}\sum_{X_{ij}>X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))}_{\text{(b)}}
+i=1n(fj(Xij)1ni=1nfj(Xij))(𝟏(XijXπ(k)j)k𝟏(Xij>Xπ(k)j)nk)(c).\displaystyle+\underbrace{\sum_{i=1}^{n}\Bigg{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Bigg{)}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}}_{\text{(c)}}.

Therefore, we have that

Δ^(Xj,Y)\displaystyle\widehat{\Delta}(X_{j},Y) =max1knkn(1kn)((a)(b)+(c))2\displaystyle=\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}(\text{(a)}-\text{(b)}+\text{(c)})^{2} (A.1)
3max1knkn(1kn)(a)2+3max1knkn(1kn)(b)2+\displaystyle\leq 3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}+3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(b)}^{2}+
3max1knkn(1kn)(c)2,\displaystyle\qquad 3\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2},

where we use, in succession, the inequality (xy+z)23(x2+y2+z2)(x-y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2}) for any real numbers xx, yy, and zz, and the fact that the maximum of a sum is at most the sum of the maxima. To finish the proof, we will bound the terms involving (a)2\text{(a)}^{2}, (b)2\text{(b)}^{2}, and (c)2\text{(c)}^{2} separately.

For the last term in (A.1), notice that by the Cauchy-Schwartz inequality we have

kn(1kn)(c)2\displaystyle\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2} =kn(1kn)[i=1n(fj(Xij)1ni=1nfj(Xij))×\displaystyle=\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Bigg{[}\sum_{i=1}^{n}\Big{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Big{)}\times
(𝟏(XijXπ(k)j)k𝟏(Xij>Xπ(k)j)nk)]2\displaystyle\qquad\qquad\qquad\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}\Bigg{]}^{2}
kn(1kn)i=1n(fj(Xij)1ni=1nfj(Xij))2×\displaystyle\leq\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\sum_{i=1}^{n}\Big{(}f_{j}(X_{ij})-\frac{1}{n}\sum_{i=1}^{n}f_{j}(X_{ij})\Big{)}^{2}\times
i=1n(𝟏(XijXπ(k)j)k𝟏(Xij>Xπ(k)j)nk)2,\displaystyle\qquad\qquad\qquad\sum_{i=1}^{n}\bigg{(}\frac{\mathbf{1}(X_{ij}\leq X_{\pi(k)j})}{k}-\frac{\mathbf{1}(X_{ij}>X_{\pi(k)j})}{n-k}\bigg{)}^{2},

which is exactly equal to

kn(1kn)[nVar^(fj(Xj))(k1k2+(nk)1(nk)2)]=Var^(fj(Xj)).\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Big{[}n\widehat{\mathrm{Var}}(f_{j}(X_{j}))\Big{(}k\cdot\frac{1}{k^{2}}+(n-k)\cdot\frac{1}{(n-k)^{2}}\Big{)}\Big{]}=\widehat{\mathrm{Var}}(f_{j}(X_{j})).

Therefore we have shown that

max1knkn(1kn)(c)2Var^(fj(Xj)).\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(c)}^{2}\leq\widehat{\mathrm{Var}}(f_{j}(X_{j})). (A.2)

To bound the first term in (A.1), by a union bound we have that

(max1knkn(1kn)(a)2>ξ26)\displaystyle\mathbb{P}\Big{(}\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}>\frac{\xi^{2}}{6}\Big{)}
k=1n(kn(1kn)(a)2>ξ26)\displaystyle\leq\sum_{k=1}^{n}\mathbb{P}\Big{(}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}>\frac{\xi^{2}}{6}\Big{)}
=k=1n(kn(1kn)(1kXijXπ(k)j(Yifj(Xij)))2>ξ26).\displaystyle=\sum_{k=1}^{n}\mathbb{P}\Big{(}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\Big{(}\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))\Big{)}^{2}>\frac{\xi^{2}}{6}\Big{)}. (A.3)

Next, notice that, conditional on X1j,,XnjX_{1j},\dots,X_{nj}, XijXπ(k)j(Yifj(Xij))\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij})) is a sum of kk independent, sub-Gaussian, mean zero random variables. Thus, by the law of total probability, we have that (A.3) is equal to

k=1n𝔼[(|1kXijXπ(k)j(Yifj(Xij))|>ξn26k(nk)|X1j,,Xnj)]\sum_{k=1}^{n}\mathbb{E}\Bigg{[}\mathbb{P}\Bigg{(}\bigg{|}\frac{1}{k}\sum_{X_{ij}\leq X_{\pi(k)j}}(Y_{i}-f_{j}(X_{ij}))\bigg{|}>\xi\sqrt{\frac{n^{2}}{6k(n-k)}}\quad\Bigg{|}\quad X_{1j},\dots,X_{nj}\Bigg{)}\Bigg{]}

and, by Hoeffding’s inequality for sub-Gaussian random variables, is bounded by

k=1n2exp(kξ2n212k(nk)σZj2)2nexp(ξ2n12σZj2).\sum_{k=1}^{n}2\exp\Big{(}-k\frac{\xi^{2}n^{2}}{12k(n-k)\sigma^{2}_{Z_{j}}}\Big{)}\leq 2n\exp\Big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\Big{)}.

Note that here we have implicitly used the fact that 𝔼[exp(λZj)|Xj]exp(λ2σZj2/2)\mathbb{E}[\exp(\lambda Z_{j})|X_{j}]\leq\exp(\lambda^{2}\sigma^{2}_{Z_{j}}/2). It thus follows that with probability at least 12nexp(ξ2n12σZj2)1-2n\exp\big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\big{)} that

max1knkn(1kn)(a)2ξ26.\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(a)}^{2}\leq\frac{\xi^{2}}{6}. (A.4)

A similar argument shows that with probability at least 12nexp(ξ2n12σZj2)1-2n\exp\Big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\Big{)}, the second terms in (A.1) obeys

max1knkn(1kn)(b)2ξ26.\max_{1\leq k\leq n}\frac{k}{n}\Big{(}1-\frac{k}{n}\Big{)}\text{(b)}^{2}\leq\frac{\xi^{2}}{6}. (A.5)

Therefore, substituting (A.2), (A.4), and (A.5) into (A.1) and using a union bound, it follows that with probability at least 14nexp(ξ2n12σZj2)1-4n\exp\big{(}-\frac{\xi^{2}n}{12\sigma^{2}_{Z_{j}}}\big{)},

Δ^(Xj,Y)3Var^(fj(Xj))+ξ2.\widehat{\Delta}(X_{j},Y)\leq 3\widehat{\mathrm{Var}}(f_{j}(X_{j}))+\xi^{2}.

B Proof of Theorem 1

The goal of this section is to prove Theorem 1. In the first section, we prove the lower bound (10) and in the second section, we prove the upper bound (11).

Throughout this section, for brevity, we let ρj=Cor(Xj,Y)0\rho_{j}={\rm{Cor}}(X_{j},Y)\neq 0.

B.1 Proof of the lower bound (10)

Choosing h(Xj)=Xjh(X_{j})=X_{j} (which is monotone) in Lemma 1 to get that

Δ^(Xj,Y)1log(2n)+1×Cov^2(XjVar^(Xj),Y)=Var^(Y)log(2n)+1×ρ^ 2(Xj,Y).\widehat{\Delta}(X_{j},Y)\geq\frac{1}{\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{X_{j}}{\sqrt{\widehat{\mathrm{Var}}(X_{j})}},Y\Bigg{)}=\frac{\widehat{\mathrm{Var}}(Y)}{\log(2n)+1}\times\widehat{\rho}^{\;2}(X_{j},Y).

Now observe that ρ^(Xj,Y)\widehat{\rho}\,(X_{j},Y) is the empirical Pearson sample correlation between two correlated normal distributions. If ρj>0\rho_{j}>0, by [18, Equation (44)], we have that

(ρ^(Xj,Y)>(1δ)ρj)\displaystyle\mathbb{P}(\widehat{\rho}\,(X_{j},Y)>(1-\delta)\rho_{j}) (B.6)
=1(ρ^(Xj,Y)>(1δ)ρj)\displaystyle=1-\mathbb{P}(\widehat{\rho}\,(-X_{j},Y)>-(1-\delta)\rho_{j})
1(2π)1/2Γ(n)Γ(n+1/2)(1ρj2)n/2(1[(1δ)ρj]2)(n1)/2\displaystyle\sim 1-(2\pi)^{-1/2}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{n/2}(1-[(1-\delta)\rho_{j}]^{2})^{(n-1)/2}
×((1δ)ρj(ρj))1(1(ρj)((1δ)ρj))n+3/2(1+𝒪(n1)).\displaystyle\qquad\times(-(1-\delta)\rho_{j}-(-\rho_{j}))^{-1}(1-(-\rho_{j})(-(1-\delta)\rho_{j}))^{-n+3/2}(1+{\mathcal{O}}(n^{-1})).

If ρj<0\rho_{j}<0 we can show the same bound on (ρ^(Xj,Y)<(1δ)ρj)\mathbb{P}(\widehat{\rho}\,(X_{j},Y)<(1-\delta)\rho_{j}). Again by [18, Equation (44)], we have the similar bound

(ρ^(Xj,Y)<(1δ)ρj)\displaystyle\mathbb{P}(\widehat{\rho}\,(X_{j},Y)<(1-\delta)\rho_{j}) (B.7)
=1(ρ^(Xj,Y)>(1δ)ρj)\displaystyle=1-\mathbb{P}(\widehat{\rho}\,(X_{j},Y)>(1-\delta)\rho_{j})
1(2π)1/2Γ(n)Γ(n+1/2)(1ρj2)n/2(1[(1δ)ρj]2)(n1)/2\displaystyle\sim 1-(2\pi)^{-1/2}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{n/2}(1-[(1-\delta)\rho_{j}]^{2})^{(n-1)/2}
×((1δ)ρjρj)1(1ρj×(1δ)ρj))n+3/2(1+𝒪(n1)).\displaystyle\qquad\times((1-\delta)\rho_{j}-\rho_{j})^{-1}(1-\rho_{j}\times(1-\delta)\rho_{j}))^{-n+3/2}(1+{\mathcal{O}}(n^{-1})).

Therefore because of (B.6) and (B.7), regardless of the sign of ρj\rho_{j}, it follows that there exists a universal constant C0C_{0} for which

(|ρ^(Xj,Y)|>(1δ)|ρj|)\displaystyle\mathbb{P}(|\widehat{\rho}\,(X_{j},Y)|>(1-\delta)|\rho_{j}|)
1C02πδ|ρj|Γ(n)Γ(n+1/2)(1ρj2)n2(1(1δ)2ρj2)n12(1(1δ)ρj2)n+32\displaystyle\geq 1-\frac{C_{0}}{\sqrt{2\pi}\delta|\rho_{j}|}\frac{\Gamma(n)}{\Gamma(n+1/2)}(1-\rho_{j}^{2})^{\frac{n}{2}}(1-(1-\delta)^{2}\rho_{j}^{2})^{\frac{n-1}{2}}(1-(1-\delta)\rho_{j}^{2})^{-n+\frac{3}{2}}
1C0nδ|ρj|exp(ρj2n/2(1δ)2ρj2(n1)/2+(1δ)ρj2(n3/2))\displaystyle\geq 1-\frac{C_{0}}{\sqrt{n}\delta|\rho_{j}|}\exp\big{(}-\rho_{j}^{2}n/2-(1-\delta)^{2}\rho_{j}^{2}(n-1)/2+(1-\delta)\rho_{j}^{2}(n-3/2)\big{)} (B.8)
=1C0nδ2ρj2exp(ρj2nδ2/2+ρj2(1δ)2/23(1δ)ρj2/2)\displaystyle=1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp\big{(}-\rho_{j}^{2}n\delta^{2}/2+\rho_{j}^{2}(1-\delta)^{2}/2-3(1-\delta)\rho_{j}^{2}/2\big{)}
1C0nδ2ρj2exp(ρj2nδ2/2),\displaystyle\geq 1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp(-\rho_{j}^{2}n\delta^{2}/2),

where we used exp(x)1+x\exp(x)\geq 1+x and Wendel’s inequality [45] Γ(n)Γ(n+1/2)n+1/2n1n2πn\frac{\Gamma(n)}{\Gamma(n+1/2)}\leq\sqrt{\frac{n+1/2}{n}}\frac{1}{\sqrt{n}}\leq\sqrt{\frac{2\pi}{n}} in the second inequality (B.8).

Thus, we have that with probability at least 1C0nδ2ρj2exp(ρj2nδ2/2)1-\frac{C_{0}}{\sqrt{n\delta^{2}\rho_{j}^{2}}}\exp(-\rho_{j}^{2}n\delta^{2}/2) that

Δ^(Xj,Y)(1δ)2Var^(Y)ρj2log(2n)+1ρ^ 2(Y^(Xj),Y)(1δ)2ρj2log(2n)+1.\widehat{\Delta}(X_{j},Y)\geq\frac{(1-\delta)^{2}\widehat{\mathrm{Var}}(Y)\rho^{2}_{j}}{\log(2n)+1}\iff\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y)\geq\frac{(1-\delta)^{2}\rho_{j}^{2}}{\log(2n)+1}.

This completes the first half of the proof of Theorem 1.

B.2 Proof of the upper bound (11)

We first state the following sample variance concentration inequality, which will be helpful.

Lemma B.1.

Let Z1,,ZnZ_{1},\dots,Z_{n} be i.i.d. 𝒩(0,σZ2){\mathcal{N}}(0,\sigma^{2}_{Z}). For any 0<δ<10<\delta<1, we have

(Var^(Z)(1δ)n1nσZ2)1exp(δ2(n1)/4)\mathbb{P}(\widehat{\mathrm{Var}}(Z)\geq(1-\delta)\frac{n-1}{n}\sigma^{2}_{Z})\geq 1-\exp(-\delta^{2}(n-1)/4) (B.9)

and

(Var^(Z)(1+δ)n1nσZ2)1exp((n1)(1+δ1+2δ)/2).\mathbb{P}(\widehat{\mathrm{Var}}(Z)\leq(1+\delta)\frac{n-1}{n}\sigma^{2}_{Z})\geq 1-\exp(-(n-1)(1+\delta-\sqrt{1+2\delta})/2). (B.10)
Proof of Lemma B.1.

Since ZiZ_{i} are independent and normally distributed, by Cochran’s theorem we have Var^(Z)σZ2nχn12\widehat{\mathrm{Var}}(Z)\sim\frac{\sigma^{2}_{Z}}{n}\chi^{2}_{n-1}. In the notation of [28], choosing D=n1D=n-1 and x=δ2(n1)/4x=\delta^{2}(n-1)/4 for the chi-squared concentration inequality (4.4) in [28], we have that

(Var^(Z)(1δ)n1nσZ2)\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(Z)\geq(1-\delta)\frac{n-1}{n}\sigma^{2}_{Z}\Big{)} =1(χn12<(1δ)(n1))\displaystyle=1-\mathbb{P}(\chi_{n-1}^{2}<(1-\delta)(n-1))
1exp(δ2(n1)/4),\displaystyle\geq 1-\exp(-\delta^{2}(n-1)/4),

proving (B.9). For (B.10), choosing D=n1D=n-1 and x=(n1)(1+δ1+2δ)/2x=(n-1)(1+\delta-\sqrt{1+2\delta})/2 in [28, Equation (4.3)] we see that

(Var^(Z)(1+δ)n1nσZ2)\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(Z)\leq(1+\delta)\frac{n-1}{n}\sigma^{2}_{Z}\Big{)} =1(χn12>(1+δ)(n1))\displaystyle=1-\mathbb{P}(\chi_{n-1}^{2}>(1+\delta)(n-1))
1exp((n1)(1+δ1+2δ)/2).\displaystyle\geq 1-\exp(-(n-1)(1+\delta-\sqrt{1+2\delta})/2).\qed

Now we are ready to prove the upper bound (11). We begin with the inequality (13), as shown in the proof sketch of Theorem 1. We aim to upper bound the right hand side of (13) using Lemma B.1. Since the samples X1j,,XnjX_{1j},\dots,X_{nj} are i.i.d., using (B.10) and choosing δ=1\delta=1, we find that with probability at least 1exp((n1)232)1exp((n1)/16)1-\exp\big{(}-(n-1)\frac{2-\sqrt{3}}{2}\big{)}\geq 1-\exp(-(n-1)/16), we have that Var^(Xj)2σXj2\widehat{\mathrm{Var}}(X_{j})\leq 2\sigma^{2}_{X_{j}}. Similarly, choosing δ=1/2\delta=1/2 in (B.9), we also have that with probability at least 1exp((n1)/16)1-\exp(-(n-1)/16) that Var^(Y)σY2/4\widehat{\mathrm{Var}}(Y)\geq\sigma_{Y}^{2}/4. Substituting these concentration inequalities into the right hand side of (13), it follows by a union bound that with probability at least 14nexp(nδ2/12)2exp((n1)/16)1-4n\exp(-n\delta^{2}/12)-2\exp(-(n-1)/16),

Δ^(Xj,Y)24Var^(Y)ρj2+4δ2Var^(Y)ρ^ 2(Y^(Xj),Y)24ρj2+4δ2.\widehat{\Delta}(X_{j},Y)\leq 24\widehat{\mathrm{Var}}(Y)\rho_{j}^{2}+4\delta^{2}\widehat{\mathrm{Var}}(Y)\iff\widehat{\rho}^{\,2}(\widehat{Y}(X_{j}),Y)\leq 24\rho_{j}^{2}+4\delta^{2}.

Finally, noticing that 24ρj2+4δ2<5|ρj|+2δ\sqrt{24\rho_{j}^{2}+4\delta^{2}}<5|\rho_{j}|+2\delta completes the proof.

C Proof of Model Selection Consistency for Linear Models

Recall the setting mentioned under the heading “Minimum sample size for consistency” in Section 6, which considers the same linear model with Gaussian variates from Theorem 1. To reiterate, we assume that 𝚺=𝐈p×p\boldsymbol{\Sigma}=\mathbf{I}_{p\times p} is the p×pp\times p identity matrix, k=1pβk2=𝒪(1)\sum_{k=1}^{p}\beta^{2}_{k}={\mathcal{O}}(1), and minj𝒮|βj|21/s\min_{j\in{\mathcal{S}}}|\beta_{j}|^{2}\asymp 1/s, all of which are special cases of the more general setting considered in [41, Corollary 1]. Under these assumptions, we then have ρ2(Xj,Y)=βj2/(σ2+k=1pβk2)1/s\rho^{2}(X_{j},Y)=\beta^{2}_{j}/(\sigma^{2}+\sum_{k=1}^{p}\beta^{2}_{k})\gtrsim 1/s for any j𝒮j\in{\mathcal{S}} and ρ(Xj,Y)=0\rho(X_{j},Y)=0 for j𝒮cj\in{\mathcal{S}}^{c}. Our goal is to show that nslog(n)log(n(ps))n\asymp s\log(n)\log(n(p-s)) samples suffice for high probability model selection consistency.

Choosing δ=1/2\delta=1/2 in (10) applied to j𝒮j\in{\mathcal{S}} and using (7), there exists a universal positive constant C0C_{0} such that with probability at least 12C0nρ2(Xj,Y)exp(nρ2(Xj,Y)/8)1-\frac{2C_{0}}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp(-n\rho^{2}(X_{j},Y)/8), we have

Δ^(Xj,Y)\displaystyle\widehat{\Delta}(X_{j},Y) Var^(Y)×ρ2(Xj,Y)4(log(2n)+1)\displaystyle\geq\widehat{\mathrm{Var}}(Y)\times\frac{\rho^{2}(X_{j},Y)}{4(\log(2n)+1)}
=Var^(Y)4(log(2n)+1)×βj2σ2+k=1pβk2\displaystyle=\frac{\widehat{\mathrm{Var}}(Y)}{4(\log(2n)+1)}\times\frac{\beta_{j}^{2}}{\sigma^{2}+\sum_{k=1}^{p}\beta_{k}^{2}}
Var^(Y)slog(n).\displaystyle\gtrsim\frac{\widehat{\mathrm{Var}}(Y)}{s\log(n)}.

Therefore by a union bound over all ss relevant variables, we have that with probability at least 1smaxj𝒮{2C0nρ2(Xj,Y)exp(nρ2(Xj,Y)/8)}1-s\max_{j\in{\mathcal{S}}}\{\frac{2C_{0}}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp(-n\rho^{2}(X_{j},Y)/8)\},

Δ^(Xj,Y)Var^(Y)slog(n)j𝒮.\widehat{\Delta}(X_{j},Y)\gtrsim\frac{\widehat{\mathrm{Var}}(Y)}{s\log(n)}\qquad\forall j\in{\mathcal{S}}. (C.11)

Furthermore, by applying (11) for j𝒮cj\in{\mathcal{S}}^{c} (and noting that ρ(Xj,Y)=0\rho(X_{j},Y)=0) and using (7), with probability at least 14nexp(δ2n/12)2exp((n1)/16)1-4n\exp(-\delta^{2}n/12)-2\exp(-(n-1)/16), we have

Δ^(Xj,Y)4Var^(Y)δ2.\widehat{\Delta}(X_{j},Y)\leq 4\widehat{\mathrm{Var}}(Y)\delta^{2}.

Therefore by a union bound over all psp-s irrelevant variables we have that with probability at least 14n(ps)exp(δ2n/12)2(ps)exp((n1)/16)1-4n(p-s)\exp(-\delta^{2}n/12)-2(p-s)\exp(-(n-1)/16),

Δ^(Xj,Y)4Var^(Y)δ2j𝒮c.\widehat{\Delta}(X_{j},Y)\leq 4\widehat{\mathrm{Var}}(Y)\delta^{2}\qquad\forall j\in{\mathcal{S}}^{c}.

Now, choosing δ2=C3slog(n)\delta^{2}=\frac{C_{3}}{s\log(n)} for some appropriate constant C3>0C_{3}>0 which only depends on σ2\sigma^{2} to match (C.11), we see by a union bound that

(𝒮^=𝒮)\displaystyle\mathbb{P}(\widehat{{\mathcal{S}}}={\mathcal{S}}) 1maxj𝒮{2C0snρ2(Xj,Y)exp(nρ2(Xj,Y)8)}\displaystyle\geq 1-\max_{j\in{\mathcal{S}}}\Big{\{}\frac{2C_{0}s}{\sqrt{n\rho^{2}(X_{j},Y)}}\exp\Big{(}-\frac{n\rho^{2}(X_{j},Y)}{8}\Big{)}\Big{\}} (C.12)
4n(ps)exp(C3n12slog(n))2(ps)exp((n1)16).\displaystyle\qquad-4n(p-s)\exp\Big{(}-\frac{C_{3}n}{12s\log(n)}\Big{)}-2(p-s)\exp\Big{(}-\frac{(n-1)}{16}\Big{)}.

Since ρ2(Xj,Y)1/s\rho^{2}(X_{j},Y)\gtrsim 1/s for all j𝒮j\in{\mathcal{S}}, (C.12) implies that if n(ps)exp(C3n12slog(n))0n(p-s)\exp\big{(}-\frac{C_{3}n}{12s\log(n)}\big{)}\rightarrow 0, then (𝒮^=𝒮)1\mathbb{P}(\widehat{{\mathcal{S}}}={\mathcal{S}})\rightarrow 1. Hence, a sufficient sample size for consistent support recovery is

nslog(n)log(n(ps)),n\asymp s\log(n)\log(n(p-s)),

as desired.

D Proof of Propositions 1 and 2

This section will mainly be devoted to proving Proposition 1. First we will present the machinery which will be used to prove Proposition 1. At the end of the section we will complete the proof of Proposition 1 and prove Proposition 2 by recycling and simplifying the proof of Proposition 1.

First, we state and prove a lemma that will be used in later proofs. Though stated in terms of general probability measures, we will be specifically interested in the case where \mathbb{P} is the empirical probability measure n\mathbb{P}_{n} and 𝔼\mathbb{E} is the empirical expectation 𝔼n\mathbb{E}_{n}, both with respect to a sample of size nn.

Lemma D.2.

For any random variables UU and VV with finite second moments with respect to a probability measure \mathbb{P},

Cov(UVar(U),V)Var(V)2𝔼[(UV)2].\mathrm{Cov}_{\mathbb{P}}\Big{(}\frac{U}{\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}},V\Big{)}\geq\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}-2\sqrt{\mathbb{E}_{\mathbb{P}}[(U-V)^{2}]}.
Proof of Lemma D.2.

First note that by the triangle inequality,

|Var(U)Var(V)|Var(UV)𝔼[(UV)2].\big{|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{|}\leq\sqrt{\mathrm{Var}_{\mathbb{P}}(U-V)}\leq\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}}. (D.13)

To complete the proof, we write

Cov(U,V)=Var(U)Var(V)+(Cov(U,V)Var(U)Var(V))\mathrm{Cov}_{\mathbb{P}}(U,V)=\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}+\big{(}\mathrm{Cov}_{\mathbb{P}}(U,V)-\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{)} (D.14)

and apply (D.13) to arrive at

|Cov(U,V)Var(U)Var(V)|\displaystyle\big{|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{|} (D.15)
=|Cov(U,V)Var(U)+Var(U)(Var(U)Var(V))|\displaystyle=\big{|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\big{(}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{)}\big{|}
|Cov(U,V)Var(U)|+Var(U)×|Var(U)Var(V)|\displaystyle\leq\big{|}\mathrm{Cov}_{\mathbb{P}}(U,V)-\mathrm{Var}_{\mathbb{P}}(U)\big{|}+\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{|}
2Var(U)×|Var(U)Var(V)|\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\times\big{|}\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}-\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}\big{|} (D.16)
2Var(U)𝔼[(UV)2],\displaystyle\leq 2\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}},

where the penultimate line (D.16) follows from the Cauchy-Schwarz inequality. Substituting (D.15) into (D.14), we get

Cov(U,V)Var(U)(Var(V)2𝔼[(UV)2]),\mathrm{Cov}_{\mathbb{P}}(U,V)\geq\sqrt{\mathrm{Var}_{\mathbb{P}}(U)}\big{(}\sqrt{\mathrm{Var}_{\mathbb{P}}(V)}-2\sqrt{\mathbb{E}_{\mathbb{P}}{[(U-V)^{2}]}}\big{)},

which proves the claim. ∎

The following sample variance concentration inequality will also come in handy.

Lemma D.3 (Equation 5, [34]).

Let UU be a random variable bounded by BB. Then for all γ>0\gamma>0,

(nn1Var^(U)Var(U)γ)1exp((n1)γ28B2Var(U)).\mathbb{P}\Big{(}\frac{n}{n-1}\widehat{\mathrm{Var}}(U)\geq\mathrm{Var}(U)-\gamma\Big{)}\geq 1-\exp\Big{(}-\frac{(n-1)\gamma^{2}}{8B^{2}\mathrm{Var}(U)}\Big{)}.

As explained in the main text, the key step in the proof of Proposition 1 is to apply Lemma 1 with a good approximation fj~()\widetilde{f_{j}}(\cdot) to the marginal projection fj()f_{j}(\cdot) that also has a sufficiently large number of data points in every one of its stationary intervals. The following lemma provides the precise properties of such an approximation.

Lemma D.4.

Suppose fj()f_{j}(\cdot) satisfies Assumption 2. Let a>0a>0 be a positive constant and let MM be a positive integer such that Ma1Ma\leq 1. There exists a function f~j()\widetilde{f}_{j}(\cdot) with at most MM stationary intervals such that, with probability at least 12nexp(na/12)1-2n\exp(-na/12), both of the following statements are simultaneously true:

  1. 1.

    The number of data points in any stationary interval of f~j()\widetilde{f}_{j}(\cdot) is between na/2na/2 and 3na/23na/2.

  2. 2.

    𝔼n[(fj(Xj)f~j(Xj))2]K0Md+K1(Ma)5/2\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}\leq K_{0}M^{-d}+K_{1}(Ma)^{5/2}, where K0K_{0} and K1K_{1} are some constants depending on dd, BB, α\alpha, and rr, and 𝔼n\mathbb{E}_{n} denotes the empirical expectation.

Proof of Lemma D.4.

Recall that fj()f_{j}(\cdot) is defined over \mathbb{R}, so even though we restrict our attention [0,1][0,1], we can assume it is bounded over the larger interval [1,2][-1,2] for all jj. By [23, Theorem VIII], there exists a polynomial PM()P_{M}(\cdot) of degree M+1>rM+1>r such that

supx[1,2]|fj(x)PM(x)|\displaystyle\sup_{x\in[-1,2]}|f_{j}(x)-P_{M}(x)| 3dLA(M+1)r(M+1r)α,\displaystyle\leq 3^{d}LA(M+1)^{-r}(M+1-r)^{-\alpha},

where A=(r+1)r1(K/2)r(K/2+2)r!A=\frac{(r+1)^{r-1}(K/2)^{r}(K/2+2)}{r!}, LL is the Lipschitz constant from Assumption 2, and KK is a universal constant given in [23, Theorem I]. Since M+1M+1rr+1\frac{M+1}{M+1-r}\leq r+1 whenever M+1>rM+1>r, we also have that

supx[1,2]|fj(x)PM(x)|3d(r+1)αLA(M+1)dK0Md,\sup_{x\in[-1,2]}|f_{j}(x)-P_{M}(x)|\leq 3^{d}(r+1)^{\alpha}LA(M+1)^{-d}\leq K_{0}M^{-d}, (D.17)

where K0=3d(r+1)αLAK_{0}=3^{d}(r+1)^{\alpha}LA is a constant.

Let {ξk}1kM\{\xi_{k}\}_{1\leq k\leq M} be the collection of (at most) MM stationary points of PM()P_{M}(\cdot) in [0,1][0,1]. We assume that a<mink(ξkξk1)a<\min_{k}(\xi_{k}-\xi_{k-1}); otherwise we remove points from {ξk}1kM\{\xi_{k}\}_{1\leq k\leq M} until this holds. Let Ik=[ξk,ξk+a]I_{k}=[\xi_{k},\xi_{k}+a].

We will now show that Properties 1 and 2 of Lemma D.4 are satisfied by the function

f~j(x)={PM(x)xkIkPM(ξk)xIkfor somek.\widetilde{f}_{j}(x)=\begin{cases}P_{M}(x)&x\notin\bigcup_{k}I_{k}\\ P_{M}(\xi_{k})&x\in I_{k}\;\text{for some}\;k.\end{cases} (D.18)

First, it is clear that f~j()\widetilde{f}_{j}(\cdot) has at most MM stationary intervals. Next, notice that by the multiplicative version of Chernoff’s inequality [1, Proposition 2.4], since each stationary interval IkI_{k} has length aa, we have

(#{XijIk}na/2)>1exp(na/8),\mathbb{P}(\#\{X_{ij}\in I_{k}\}\geq na/2)>1-\exp(-na/8),

and

(#{XijIk}3na/2)>1exp(na/12).\mathbb{P}(\#\{X_{ij}\in I_{k}\}\leq 3na/2)>1-\exp(-na/12).

Thus, by a union bound, the probability that

na/2#{XijIk}3na/2kna/2\leq\#\{X_{ij}\in I_{k}\}\leq 3na/2\qquad\forall\;k

is at least 12Mexp(na/12)1-2M\exp(-na/12). Since all stationary intervals are disjoint and are contained in [0,1][0,1], we have Ma1Ma\leq 1 or M1/aM\leq 1/a, which implies that 12Mexp(na/12)1(2/a)exp(na/12)1-2M\exp(-na/12)\geq 1-(2/a)\exp(-na/12). Therefore, we know that

(3na/2#{XijIk}na/2for allk)max{0,1(2/a)exp(na/12)}.\mathbb{P}(3na/2\geq\#\{X_{ij}\in I_{k}\}\geq na/2\;\text{for all}\;k)\geq\max\{0,1-(2/a)\exp(-na/12)\}. (D.19)

Notice that if a1/na\leq 1/n, then

max{0,1(2/a)exp(na/12)}=012nexp(na/12)\max\{0,1-(2/a)\exp(-na/12)\}=0\geq 1-2n\exp(-na/12)

and if a>1/na>1/n, then

max{0,1(2/a)exp(na/12)}\displaystyle\max\{0,1-(2/a)\exp(-na/12)\} 1(2/a)exp(na/12)\displaystyle\geq 1-(2/a)\exp(-na/12)
12nexp(na/12).\displaystyle\geq 1-2n\exp(-na/12).

Thus, for all a0a\geq 0, we have

max{0,1(2/a)exp(na/12)}12nexp(na/12),\max\{0,1-(2/a)\exp(-na/12)\}\geq 1-2n\exp(-na/12),

which when combined with (D.19) proves Property 1 of Lemma D.4.

To prove Property 2 of Lemma D.4 we use the triangle inequality and part of Property 1:

(#{XijIk}3na/2for allk)12nexp(na/12).\mathbb{P}(\#\{X_{ij}\in I_{k}\}\leq 3na/2\;\text{for all}\;k)\geq 1-2n\exp(-na/12). (D.20)

In view of (D.17), to bound 𝔼n[(fj(Xj)f~j(Xj))2]\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]} we aim to bound 1ni=1n(PM(Xij)f~j(Xij))2.\frac{1}{n}\sum_{i=1}^{n}(P_{M}(X_{ij})-\widetilde{f}_{j}(X_{ij}))^{2}.

Notice that by Bernstein’s theorem for polynomials [22, Theorem B2a], we also have

|PM′′(x)|(M+1)2supx[1,2]|PM(x)|(2x)(x+1),1<x<2.|P^{\prime\prime}_{M}(x)|\leq\frac{(M+1)^{2}\sup_{x\in[-1,2]}|P_{M}(x)|}{(2-x)(x+1)},\quad-1<x<2. (D.21)

By (D.17), supx[1,2]|PM(x)|K0Md+supx[1,2]|fj(Xj)|K0+B\sup_{x\in[-1,2]}|P_{M}(x)|\leq K_{0}M^{-d}+\sup_{x\in[-1,2]}|f_{j}(X_{j})|\leq K_{0}+B. Thus, by (D.21),

supx[0,1]|PM′′(x)|(M+1)2(K0+B)2K1M2,\sup_{x\in[0,1]}|P^{\prime\prime}_{M}(x)|\leq\frac{(M+1)^{2}(K_{0}+B)}{2}\leq K_{1}M^{2},

where K1=2(K0+B)K_{1}=2(K_{0}+B) and we used the fact that (2x)(x+1)2(2-x)(x+1)\geq 2 for x[0,1]x\in[0,1]. Additionally, because each ξk\xi_{k} is a stationary point, PM(ξk)=0P^{\prime}_{M}(\xi_{k})=0, and hence by a second order Taylor approximation, we have

|PM(x)PM(ξk)|K1M2a2/2|P_{M}(x)-P_{M}(\xi_{k})|\leq K_{1}M^{2}a^{2}/2 (D.22)

for xIkx\in I_{k}.

Therefore, by (D.18), (D.20), and (D.22) we have that

1ni=1n(PM(Xij)f~j(Xij))2\displaystyle\frac{1}{n}\sum_{i=1}^{n}(P_{M}(X_{ij})-\widetilde{f}_{j}(X_{ij}))^{2} =k1nXijIk(PM(Xij)PM(ξk))2\displaystyle=\sum_{k}\frac{1}{n}\sum_{X_{ij}\in I_{k}}(P_{M}(X_{ij})-P_{M}(\xi_{k}))^{2}
K12M4a44nk=1M#{XijIk}\displaystyle\leq\frac{K_{1}^{2}M^{4}a^{4}}{4n}\sum_{k=1}^{M}\#\{X_{ij}\in I_{k}\}
K12M4a44nk=1M3na2\displaystyle\leq\frac{K_{1}^{2}M^{4}a^{4}}{4n}\sum_{k=1}^{M}\frac{3na}{2}
=3K12(Ma)58,\displaystyle=\frac{3K_{1}^{2}(Ma)^{5}}{8},

with probability at least 12nexp(na/12)1-2n\exp(-na/12). Combining this with (D.17) and using the triangle inequality to bound 𝔼n[(fj(Xj)f~j(Xj))2]\sqrt{\mathbb{E}_{n}[(f_{j}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]} by 𝔼n[(PM(Xj)f~j(Xj))2]+𝔼n[(PM(Xj)fj(Xj))2]\sqrt{\mathbb{E}_{n}[(P_{M}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}+\sqrt{\vphantom{\mathbb{E}_{n}[(P_{M}(X_{j})-\widetilde{f}_{j}(X_{j}))^{2}]}\mathbb{E}_{n}[(P_{M}(X_{j})-f_{j}(X_{j}))^{2}]} proves Property 2. ∎

Returning to the proof of Proposition 1, by Lemma 1 along with Lemma D.4, we have that with probability at least 12nexp(na/12)1-2n\exp(-na/12) that

Δ^(Xj,Y)12M/a+log(2n)+1×Cov^2(f~j(Xj)Var^(f~j(Xj)),Y).\widehat{\Delta}(X_{j},Y)\geq\frac{1}{2M/a+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}. (D.23)

Now we choose

a=(τ2Var(fj(Xj))4(2K0+K1)2)(2d+5)/(10d),M=a5/(2d+5),a=\left(\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(2K_{0}+K_{1})^{2}}\right)^{(2d+5)/(10d)},\quad M=\lfloor a^{-5/(2d+5)}\rfloor, (D.24)

where τ\tau is a constant to be specified in (D.26), so that Property 2 of Lemma D.4 becomes

𝔼n[(fj(X)f~j(X))2]\displaystyle\sqrt{\mathbb{E}_{n}[(f_{j}(X)-\widetilde{f}_{j}(X))^{2}]} K0Md+K1(Ma)5/2\displaystyle\leq K_{0}M^{-d}+K_{1}(Ma)^{5/2} (D.25)
(2K0+K1)a5d/(2d+5)\displaystyle\leq(2K_{0}+K_{1})a^{5d/(2d+5)}
τVar(fj(Xj))2,\displaystyle\leq\frac{\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{2},

with probability at least 12nexp(na/12)1-2n\exp(-na/12).

Recall the condition M+1>rM+1>r as part of [23, Theorem VIII], which states that the degree of the polynomial must be greater than the order of Lipschitz derivative in order to approximate the function well. Since, Var(fj(Xj))B2\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}, this condition will be satisfied if we make the following choice:

τmin(2(2K0+K1)rdB,14).\tau\coloneqq\min\Big{(}\frac{2(2K_{0}+K_{1})}{r^{d}B},\frac{1}{4}\Big{)}. (D.26)

It follows by Lemma D.2 along with (D.25) that

Cov^(f~j(Xj),fj(Xj))Var^(f~j(Xj))(Var^(fj(Xj))τVar(fj(Xj)))\widehat{\mathrm{Cov}}(\widetilde{f}_{j}(X_{j}),f_{j}(X_{j}))\geq\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}\Big{(}\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}-\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}\Big{)} (D.27)

with probability at least 12nexp(na/12)1-2n\exp(-na/12).

In the next Lemma, we use (D.27) along with Lemma D.3 to obtain a lower bound on the right hand side of (D.23).

Lemma D.5.

With probability at least 1exp((n1)(18τ2)2Var(fj(Xj))8B2)exp(nτ2Var(fj(Xj))8(B2+σ2))2nexp(na/12)1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-2n\exp(-na/12), we have that

Cov^(f~j(Xj)Var^(f~j(Xj)),Y)(τ/2)Var(fj(Xj)).\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}\geq(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}.
Proof of Lemma D.5.

Recalling (16), we will prove Lemma D.5 by first getting a concentration bound on (I) and then getting a concentration bound on (II).

To get a concentration bound on (I), we need Lemma D.3 to lower bound the sample variance on the right hand side of inequality (D.27). Choosing U=fj(Xj)[B,B]U=f_{j}(X_{j})\in[-B,B] and γ=Var(fj(Xj))(18τ2)\gamma=\mathrm{Var}(f_{j}(X_{j}))(1-8\tau^{2}) (which is greater than zero by the choice of τ\tau in (D.26)), notice that Lemma D.3 gives us

(Var^(fj(Xj))8τ2(n1)nVar(fj(Xj)))\displaystyle\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(f_{j}(X_{j}))\geq\frac{8\tau^{2}(n-1)}{n}\mathrm{Var}(f_{j}(X_{j}))\Big{)}
1exp((n1)(18τ2)2Var(fj(Xj))8B2),\displaystyle\geq 1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)},

so that by (D.27),

Cov^(f~j(Xj)Var^(f~j(Xj)),fj(Xj))\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)} Var^(fj(Xj))τVar(fj(Xj))\displaystyle\geq\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}-\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}
τ(811/n1)Var(fj(Xj))\displaystyle\geq\tau(\sqrt{8}\sqrt{1-1/n}-1)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}
τVar(fj(Xj)),\displaystyle\geq\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))},

with probability at least 1exp((n1)(18τ2)2Var(fj(Xj))8B2)2nexp(na/12)1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-2n\exp(-na/12).

Now we need to get a concentration bound for (II). Let si=f~j(Xij)1nk=1nf~j(Xkj)Var^(f~j(Xj))s_{i}=\frac{\widetilde{f}_{j}(X_{ij})-\frac{1}{n}\sum_{k=1}^{n}\widetilde{f}_{j}(X_{kj})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}}. We need to bound

Cov^(f~j(Xj)Var^(f~j(Xj)),Yfj(Xj))=1ni=1nsi(g(𝐗i)fj(Xij)+εi).\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}=\frac{1}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i}).

For notational simplicity, we let 𝐗¯=(Xij)\underline{\mathbf{X}}=(X_{ij}) be the n×pn\times p data matrix with 𝐗i\mathbf{X}_{i} as rows. First notice that

𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)+εi))]\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i})\Bigg{)}\Bigg{]}
=𝔼[𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)+εi))|𝐗¯]]\displaystyle=\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij})+\varepsilon_{i})\Bigg{)}\Bigg{|}\underline{\mathbf{X}}\Bigg{]}\Bigg{]}
=𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))𝔼[exp(λni=1nsiεi)|𝐗¯]].\displaystyle=\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\mathbb{E}\Bigg{[}\exp\left(\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}\varepsilon_{i}\right)\Bigg{|}\underline{\mathbf{X}}\Bigg{]}\Bigg{]}.

Now, by the sample independence of the errors ϵi\epsilon_{i}, we can write the above as

𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))i=1n𝔼[exp(λnsiεi)|𝐗¯]]\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\prod_{i=1}^{n}\mathbb{E}\Big{[}\exp\Big{(}\frac{\lambda}{n}s_{i}\varepsilon_{i}\Big{)}\Big{|}\underline{\mathbf{X}}\Big{]}\Bigg{]}
𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))i=1nexp(λ2si2σ22n2)]\displaystyle\leq\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\prod_{i=1}^{n}\exp\Big{(}\frac{\lambda^{2}s_{i}^{2}\sigma^{2}}{2n^{2}}\Big{)}\Bigg{]}
=𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))]exp(λ2σ22n),\displaystyle=\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{]}\exp\Big{(}\frac{\lambda^{2}\sigma^{2}}{2n}\Big{)},

where we used Assumption 6 and the fact that 1ni=1nsi2=1\frac{1}{n}\sum_{i=1}^{n}s^{2}_{i}=1. Recalling that sis_{i} depends on (X1j,X2j,,Xnj)(X_{1j},X_{2j},\dots,X_{nj})^{\top}, we have

𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))]\displaystyle\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{]} (D.28)
=𝔼[𝔼[exp(λni=1nsi(g(𝐗i)fj(Xij)))|X1j,X2j,,Xnj]]\displaystyle=\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\exp\Bigg{(}\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Bigg{)}\Bigg{|}X_{1j},X_{2j},\dots,X_{nj}\Bigg{]}\Bigg{]}
=𝔼[i=1n𝔼[exp(λnsi(g(𝐗i)fj(Xij)))|X1j,X2j,,Xnj]],\displaystyle=\mathbb{E}\Bigg{[}\prod_{i=1}^{n}\mathbb{E}\Big{[}\exp\Big{(}\frac{\lambda}{n}s_{i}(g(\mathbf{X}_{i})-f_{j}(X_{ij}))\Big{)}\Big{|}X_{1j},X_{2j},\dots,X_{nj}\Big{]}\Bigg{]},

where we used sample independence in the second equality. Finally, applying Hoeffding’s Lemma along with the fact that gB\|g\|_{\infty}\leq B, we have that (D.28) is bounded above by

𝔼[i=1nexp(λ2si2B22n2)]exp(λ2B22n).\displaystyle\mathbb{E}\Bigg{[}\prod_{i=1}^{n}\exp\bigg{(}\frac{\lambda^{2}s_{i}^{2}B^{2}}{2n^{2}}\bigg{)}\Bigg{]}\leq\exp\bigg{(}\frac{\lambda^{2}B^{2}}{2n}\bigg{)}.

Having bounded the moment generating function, we can now apply Markov’s inequality to see that

(1ni=1nsi(g(X)fj(Xj)+εi)γ)\displaystyle\mathbb{P}\Bigg{(}\frac{1}{n}\sum_{i=1}^{n}s_{i}(g(X)-f_{j}(X_{j})+\varepsilon_{i})\leq-\gamma\Bigg{)}
=(exp(λni=1nsi(g(X)fj(Xj)+εi))exp(λγ))\displaystyle=\mathbb{P}\Bigg{(}\exp\Bigg{(}-\frac{\lambda}{n}\sum_{i=1}^{n}s_{i}(g(X)-f_{j}(X_{j})+\varepsilon_{i})\Bigg{)}\geq\exp(\lambda\gamma)\Bigg{)}
exp(λ2(B2+σ2)2nγλ)\displaystyle\leq\exp\Bigg{(}\frac{\lambda^{2}(B^{2}+\sigma^{2})}{2n}-\gamma\lambda\Bigg{)}
exp(nγ22(B2+σ2)),\displaystyle\leq\exp\Bigg{(}-\frac{n\gamma^{2}}{2(B^{2}+\sigma^{2})}\Bigg{)},

where the last inequality follows by maximizing over λ\lambda. Choosing γ=(τ/2)Var(fj(Xj))\gamma=(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}, we have by a union bound that,

Cov^(f~j(Xj)Var^(f~j(Xj)),Y)\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)} =Cov^(f~j(Xj)Var^(f~j(Xj)),fj(Xj))\displaystyle=\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}
+Cov^(f~j(Xj)Var^(f~j(Xj)),Yfj(Xj))\displaystyle\qquad+\widehat{\mathrm{Cov}}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}
τVar(fj(Xj))(τ/2)Var(fj(Xj))\displaystyle\geq\tau\sqrt{\mathrm{Var}(f_{j}(X_{j}))}-(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))}
=(τ/2)Var(fj(Xj)),\displaystyle=(\tau/2)\sqrt{\mathrm{Var}(f_{j}(X_{j}))},

with probability at least 1exp((n1)(18τ2)2Var(fj(Xj))8B2)exp(nτ2Var(fj(Xj))8(B2+σ2))2nexp(na/12)1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-2n\exp(-na/12). ∎

With this setup, we are now ready to finish the proofs of Proposition 1 and 2.

Proof of Proposition 1.

Recalling our concentration bound (D.23) along with Lemma D.5, it follows that with probability at least 1exp((n1)(18τ2)2Var(fj(Xj))8B2)exp(nτ2Var(fj(Xj))8(B2+σ2))4nexp(n12(τ2Var(fj(Xj))4(2K0+K1)2)(2d+5)/(10d))1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-4n\exp\Big{(}-\frac{n}{12}\Big{(}\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(2K_{0}+K_{1})^{2}}\Big{)}^{(2d+5)/(10d)}\Big{)} that

Δ^(Xj,Y)\displaystyle\widehat{\Delta}(X_{j},Y) 12M/a+log(2n)+1×Cov^2(f~j(Xj)Var^(f~j(Xj)),Y)\displaystyle\geq\frac{1}{2M/a+\log(2n)+1}\times\widehat{\mathrm{Cov}}^{2}\Bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\Bigg{)}
τ2Var(fj(Xj))8a(2d+10)/(2d+5)+4(log(2n)+1)\displaystyle\geq\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8a^{-(2d+10)/(2d+5)}+4(\log(2n)+1)}
τ2Var(fj(Xj))a(2d+10)/(2d+5)8+4(log(2n)+1)a(2d+10)/(2d+5)\displaystyle\geq\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))a^{(2d+10)/(2d+5)}}{8+4(\log(2n)+1)a^{(2d+10)/(2d+5)}}
C2(Var(fj(Xj)))(6d+5)/5dlog(n),\displaystyle\geq\frac{C_{2}(\mathrm{Var}(f_{j}(X_{j})))^{(6d+5)/5d}}{\log(n)}, (D.29)

where we used our choice of MM and aa in (D.24) and τ\tau in (D.26) and C2C_{2} is some constant which only depends on LL, BB, rr, and α\alpha. Notice in the last inequality (D.29) we bound aa in the denominator with Var(fj(Xj))B2\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}.

To conclude the proof, we simplify our probability bound. To this end, notice that

(Var(fj(Xj)))(2d+5)/(10d)=Var(fj(Xj))(Var(fj(Xj)))(8d5)/(10d)Var(fj(Xj))B(8d5)/(5d),\displaystyle(\mathrm{Var}(f_{j}(X_{j})))^{(2d+5)/(10d)}=\frac{\mathrm{Var}(f_{j}(X_{j}))}{(\mathrm{Var}(f_{j}(X_{j})))^{(8d-5)/(10d)}}\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{B^{(8d-5)/(5d)}},

where the last inequality holds by assumption that d5/8d\geq 5/8 and the fact that Var(fj(Xj))B2\mathrm{Var}(f_{j}(X_{j}))\leq B^{2}. Therefore we have 1exp((n1)(18τ2)2Var(fj(Xj))8B2)exp(nτ2Var(fj(Xj))8(B2+σ2))4nexp(n12(τ2Var(fj(Xj))4(K0+K1)2)(2d+5)/(10d))1(4n+2)exp(nC1Var(fj(Xj)))1-\exp\Big{(}-\frac{(n-1)(1-8\tau^{2})^{2}\mathrm{Var}(f_{j}(X_{j}))}{8B^{2}}\Big{)}-\exp\Big{(}-\frac{n\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{8(B^{2}+\sigma^{2})}\Big{)}-4n\exp\Big{(}-\frac{n}{12}\Big{(}\frac{\tau^{2}\mathrm{Var}(f_{j}(X_{j}))}{4(K_{0}+K_{1})^{2}}\Big{)}^{(2d+5)/(10d)}\Big{)}\geq 1-(4n+2)\exp(-nC_{1}\mathrm{Var}(f_{j}(X_{j}))) for some constant C1C_{1} which depends only on LL, BB, σ\sigma, rr, and α\alpha. ∎

Proof of Proposition 2.

The main difference with Proposition 1 is that when fj()f_{j}(\cdot) is monotone we now have M=0M=0. We no longer need the approximations PM()P_{M}(\cdot) or f~j()\widetilde{f}_{j}(\cdot) in Lemma D.4, aa in (D.24) and τ\tau in (D.26), or Lemma D.2. We will only need Lemma 1 and a version of Lemma D.5. Instead of applying Lemma 1 with an approximation to the marginal projection, we can choose hj()h_{j}(\cdot) to equal fj()f_{j}(\cdot) directly to see that

Δ^(Xj,Y)Cov^2(fj(Xj)Var^(fj(Xj)),Y)log(2n)+1.\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\bigg{)}}{\log(2n)+1}. (D.30)

Next, we have

Cov^(fj(Xj)Var^(fj(Xj)),Y)\displaystyle\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)} (D.31)
=Cov^(fj(Xj)Var^(fj(Xj)),fj(Xj))+Cov^(fj(Xj)Var^(fj(Xj)),Yfj(Xj))\displaystyle=\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},f_{j}(X_{j})\Bigg{)}+\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}
=Var^(fj(Xj))(IV)+Cov^(fj(Xj)Var^(fj(Xj)),Yfj(Xj))(V).\displaystyle=\underbrace{\vphantom{\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}}_{\text{(IV)}}+\underbrace{\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y-f_{j}(X_{j})\Bigg{)}}_{\text{(V)}}.

Now, we follow the same steps as the proof of Lemma D.5 to lower bound (D.31). For (IV), again use Lemma D.3 with U=fj(Xj)[B,B]U=f_{j}(X_{j})\in[-B,B] but choose instead γ=Var(fj(Xj))/2\gamma=\mathrm{Var}(f_{j}(X_{j}))/2 to get

((IV)Var(fj(Xj))2)\displaystyle\mathbb{P}\Big{(}\text{(IV)}\geq\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{2}\Big{)} (Var^(fj(Xj))n12nVar(fj(Xj)))\displaystyle\geq\mathbb{P}\Big{(}\widehat{\mathrm{Var}}(f_{j}(X_{j}))\geq\frac{n-1}{2n}\mathrm{Var}(f_{j}(X_{j}))\Big{)} (D.32)
1exp((n1)Var(fj(Xj))32B2).\displaystyle\geq 1-\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32B^{2}}\Big{)}.

For (V), we can follow the same steps as the second half of the proof of Lemma D.5 to see that

((V)γ)exp(nγ22(B2+σ2)).\mathbb{P}(\text{(V)}\leq-\gamma)\leq\exp\Big{(}-\frac{n\gamma^{2}}{2(B^{2}+\sigma^{2})}\Big{)}.

However, for (V), we instead choose γ=Var(fj(Xj))/4\gamma=\sqrt{\mathrm{Var}(f_{j}(X_{j}))}/4 to get

((V)Var(fj(Xj))4)1exp(nVar(fj(Xj))32(B2+σ2)).\displaystyle\mathbb{P}\Bigg{(}\text{(V)}\geq-\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{4}\Bigg{)}\geq 1-\exp\Bigg{(}-\frac{n\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Bigg{)}. (D.33)

Now using a union bound and substituting the events in (D.33) and (D.32) into (D.31), we see that with probability at least 1exp((n1)Var(fj(Xj))32B2)exp(nVar(fj(Xj))32(B2+σ2))12exp((n1)Var(fj(Xj))32(B2+σ2))1-\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32B^{2}}\Big{)}-\exp\Big{(}-\frac{n\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}\geq 1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)}, we have that

Cov^(fj(Xj)Var^(fj(Xj)),Y)Var(fj(Xj))4.\widehat{\mathrm{Cov}}\Bigg{(}\frac{f_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(f_{j}(X_{j}))}},Y\Bigg{)}\geq\frac{\sqrt{\mathrm{Var}(f_{j}(X_{j}))}}{4}. (D.34)

Therefore, substituting (D.34) into (D.30), we have that with probability at least 12exp((n1)Var(fj(Xj))32(B2+σ2))1-2\exp\Big{(}-\frac{(n-1)\mathrm{Var}(f_{j}(X_{j}))}{32(B^{2}+\sigma^{2})}\Big{)} that

Δ^(Xj,Y)Cov^2(f~j(Xj)Var^(f~j(Xj)),Y)log(2n)+1Var(fj(Xj))16(log(2n)+1).\widehat{\Delta}(X_{j},Y)\geq\frac{\widehat{\mathrm{Cov}}^{2}\bigg{(}\frac{\widetilde{f}_{j}(X_{j})}{\sqrt{\widehat{\mathrm{Var}}(\widetilde{f}_{j}(X_{j}))}},Y\bigg{)}}{\log(2n)+1}\geq\frac{\mathrm{Var}(f_{j}(X_{j}))}{16(\log(2n)+1)}.\qed

E Proof of Theorems 2 and 3

In this section we use Proposition 1 and Proposition 2 along with Lemma 3 to complete the proofs of Theorem 2 and Theorem 3.

Proof of Theorem 2.

The high-level idea is to show that the upper and lower bounds on the impurity reductions for irrelevant and relevant variables from Lemma 3 and Proposition 1, respectively, are well-separated.

By Proposition 1 for all variables j𝒮j\in{\mathcal{S}} and a union bound, we see that with probability at least 1s(4n+2)exp(C1nv)1-s(4n+2)\exp(-C_{1}nv), we have

Δ^(Xj,Y)C2v6/5+1/dlog(n)j𝒮.\widehat{\Delta}(X_{j},Y)\geq\frac{C_{2}v^{6/5+1/d}}{\log(n)}\qquad\forall\;j\in{\mathcal{S}}. (E.35)

By Lemma 3 and applying a union bound over all psp-s variables in 𝒮c{\mathcal{S}}^{c}, we have that with probability at least 14n(ps)exp(nξ2/(12(B2+σ2)))1-4n(p-s)\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2}))) that

Δ^(Xj,Y)ξ2j𝒮c.\widehat{\Delta}(X_{j},Y)\leq\xi^{2}\qquad\forall\;j\in{\mathcal{S}}^{c}. (E.36)

Recall that if we know the size ss of the support 𝒮{\mathcal{S}}, then 𝒮^\mathcal{\widehat{S}} consists of the top ss impurity reductions. Note that choosing ξ2=C2v6/5+1/d2log(n)\xi^{2}=\frac{C_{2}v^{6/5+1/d}}{2\log(n)} in (E.36) will give us a high probability upper bound on Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) for irrelevant variables which is dominated by the lower bound on Δ^(Xj,Y)\widehat{\Delta}(X_{j},Y) for relevant variables in (E.35). Thus, by a union bound, it follows that with probability at least 1s(4n+2)exp(C1nv)4n(ps)exp(nC2v6/5+1/d24log(n)(B2+σ2))1-s(4n+2)\exp(-C_{1}nv)-4n(p-s)\exp\Big{(}-\frac{nC_{2}v^{6/5+1/d}}{24\log(n)(B^{2}+\sigma^{2})}\Big{)}, we have 𝒮^=𝒮\mathcal{\widehat{S}}=\mathcal{S}. ∎

Proof of Theorem 3.

The proof is similar to that of Theorem 2 except that we use Proposition 2 in place of Proposition 1.

By Proposition 2 for all variables j𝒮j\in{\mathcal{S}} and a union bound, we see that with probability at least 12sexp((n1)v32(B2+σ2))1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)},

Δ^(Xj,Y)v4(1+log(2n))j𝒮.\widehat{\Delta}(X_{j},Y)\geq\frac{v}{4(1+\log(2n))}\qquad\forall\;j\in{\mathcal{S}}.

Again, we have by Lemma 3 and a union bound over all psp-s variables in 𝒮c{\mathcal{S}}^{c} that with probability at least 14n(ps)exp(nξ2/(12(B2+σ2)))1-4n(p-s)\exp(-n\xi^{2}/(12(B^{2}+\sigma^{2}))) that

Δ^(Xj,Y)ξ2j𝒮c.\widehat{\Delta}(X_{j},Y)\leq\xi^{2}\qquad\forall\;j\in{\mathcal{S}}^{c}. (E.37)

Choosing ξ2=v8(1+log(2n))\xi^{2}=\frac{v}{8(1+\log(2n))} in (E.37) and using a union bound, it follows that with probability at least 12sexp((n1)v32(B2+σ2))4n(ps)exp(nv96(1+log(2n))(B2+σ2))1-2s\exp\Big{(}-\frac{(n-1)v}{32(B^{2}+\sigma^{2})}\Big{)}-4n(p-s)\exp\Big{(}-\frac{nv}{96(1+\log(2n))(B^{2}+\sigma^{2})}\Big{)}, we have 𝒮^=𝒮\mathcal{\widehat{S}}=\mathcal{S}. ∎

References

  • [1] D. Angluin and L.G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979.
  • [2] Anestis Antoniadis and Jianqing Fan. Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455):939–955, 2001.
  • [3] Tharmaratnam K. Bergersen, L. C. and I. K. Glad. Monotone splines lasso. Computational Statistics and Data Analysis, 77:336–351, 2014.
  • [4] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • [5] Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199–231, 2001.
  • [6] Leo Breiman, Jerome Friedman, RA Olshen, and Charles J Stone. Classification and regression trees. Chapman and Hall/CRC, 1984.
  • [7] W. Buckinx, G. Verstraeten, and D. Van den Poel. Predicting customer loyalty using the internal transactional database. Expert Systems with Applications, 32:125–134, 2007.
  • [8] Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L Lunetta, Brooke Hayward, Tim P Keith, and Paul Van Eerdewegh. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28:171–82, 2005.
  • [9] Ramón Díaz-Uriarte and Sara Alvarez de Andrés. Gene selection and classification of microarray data using random forest. BMC Bioinformatics volume, 7(3):171–82, 2006.
  • [10] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32:407–499, 2004.
  • [11] Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 2011.
  • [12] Jianqing Fan and Jinchi Lv. Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society, Series B, 70:849–911, 2008.
  • [13] Jianqing Fan and Rui Song. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38(6):3567–3604, 2010.
  • [14] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
  • [15] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.
  • [16] Peter Hall and Hugh Miller. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18(3):533–550, 2009.
  • [17] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer New York, 2009.
  • [18] Harold Hotelling. New light on the correlation coefficient and its transforms. Journal of the Royal Statistical Society. Series B, 15(2):193–232, 1953.
  • [19] Jian Huang, Joel L Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models. The Annals of Statistics, 38(4):2282–2313, 2010.
  • [20] Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Inferring regulatory networks from expression data using tree-based methods. PLOS ONE, 2010.
  • [21] Wayne Iba and Pat Langley. Induction of one-level decision trees. Machine Learning Proceedings, pages 233–240, 1992.
  • [22] Dunham Jackson. Bernstein’s theorem and trigonometric approximation. Transactions of the American Mathematical Society, 40(2):225–251, 1936.
  • [23] Dunham Jackson. The Theory of Approximation, volume 11. American Mathematical Society, 1991.
  • [24] Jalil Kazemitabar, Arash Amini, Adam Bloniarz, and Ameet S Talwalkar. Variable importance using decision trees. In Advances in Neural Information Processing Systems, pages 426–435, 2017.
  • [25] L.C. Keely and C.M. Tan. Understanding preferences for income redistribution. Journal of Public Economics, 92:944–961, 2008.
  • [26] Jason M. Klusowski. Sparse learning with CART. In Advances in Neural Information Processing Systems, 2020.
  • [27] B. Lariviere and D. Van den Poel. Predicting customer retention and profitability by using random forests and regression forests techniques. Expert Systems with Applications, 29:472–484, 2005.
  • [28] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
  • [29] Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased MDI feature importance measure for random forests. In Advances in Neural Information Processing Systems 32, pages 8049–8059. Curran Associates, Inc., 2019.
  • [30] Gilles Louppe. Understanding random forests: From theory to practice. arXiv preprint: arXiv:1407.7502, 2014.
  • [31] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable importances in forests of randomized trees. In Advances in Neural Information Processing Systems, pages 431–439, 2013.
  • [32] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint: arXiv:1802.03888, 2018.
  • [33] Kathryn L. Lunetta, L. Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(32):199–231, 2004.
  • [34] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.
  • [35] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive models. Journal of the Royal Statistical Society. Series B, 71(5):1009–1030, 2009.
  • [36] Erwan Scornet. Trees, forests, and impurity-based variable importance. arXiv preprint arXiv:2001.04295, 2020.
  • [37] Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. Annals of Statistics, 43(4):1716–1741, 2015.
  • [38] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8:563–564, 2007.
  • [39] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58:267–288, 1996.
  • [40] D. Van den Poel W. Buckinx. Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual fmcg retail setting. European Journal of Operational Research, 164:252–268, 2005.
  • [41] Martin J Wainwright. Information-theretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Transactions on Information Theory, 55(12):5728–5741, 2009.
  • [42] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • [43] Huazhen Wang, Fan Yang, and Zhiyuan Luo. An experimental study of the intrinsic stability of random forest variable importance measures. BMC Bioinformatics, 17(60), 2016.
  • [44] Pengfei Wei, Zhenzhou Lu, and Jingwen Song. Variable importance analysis: A comprehensive review. Reliability Engineering and System Safety, 142:399–432, 2015.
  • [45] JG Wendel. Note on the gamma function. The American Mathematical Monthly, 55(9):563–564, 1948.
  • [46] J. Zhang and M. Zulkernine. A hybrid network intrusion detection technique using random forests. Proceedings of the First International Conference on Availability, Reliability and Security, pages 262–269, 2006.
  • [47] J. Zhang, M. Zulkernine, and A. Haque. A hybrid network intrusion detection technique using random forests. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, 38(5):649–659, 2008.