This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

More Powerful and General Selective Inference for Stepwise Feature Selection Using the Homotopy Method

Kazuya Sugiyama , Vo Nguyen Le Duy11footnotemark: 1 22footnotemark: 2 , Ichiro Takeuchi11footnotemark: 1 Nagoya Institute of TechnologyEqual contributionRIKENCorrespondence to: Ichiro Takeuchi ([email protected])
Abstract

Conditional selective inference (SI) has been actively studied as a new statistical inference framework for data-driven hypotheses. The basic idea of conditional SI is to make inferences conditional on the selection event characterized by a set of linear and/or quadratic inequalities. Conditional SI has been mainly studied in the context of feature selection such as stepwise feature selection (SFS). The main limitation of the existing conditional SI methods is the loss of power due to over-conditioning, which is required for computational tractability. In this study, we develop a more powerful and general conditional SI method for SFS using the homotopy method which enables us to overcome this limitation. The homotopy-based SI is especially effective for more complicated feature selection algorithms. As an example, we develop a conditional SI method for forward-backward SFS with AIC-based stopping criteria, and show that it is not adversely affected by the increased complexity of the algorithm. We conduct several experiments to demonstrate the effectiveness and efficiency of the proposed method.

1 Introduction

As machine learning (ML) is being applied to a greater variety of practical problems, ensuring the reliability of ML is becoming increasingly important. Among several potential approaches to reliable ML, conditional selective inference (SI) is recognized as a promising approach for evaluating the statistical reliability of data-driven hypotheses selected by ML algorithms. The basic idea of conditional SI is to make inference on a data-driven hypothesis conditional on the selection event that the hypothesis is selected. Conditional SI has been actively studied especially in the context of feature selection. Notably, Lee et al. (2016) and Tibshirani et al. (2016) proposed conditional SI methods for selected features using Lasso and stepwise feature selection (SFS), respectively. Their basic idea is to characterize the selection event by a polytope, i.e., a set of linear inequalities, in a sample space. When a selection event can be characterized by a polytope, practical computational methods developed by these authors can be used for making inferences of the selected hypotheses conditional on the selection events.

Unfortunately, however, such polytope-based SI has several limitations because it can only be used when the characterization of all relevant selection events is represented by a polytope. In fact, in most of the existing polytope-based SI studies, extra-conditioning is required in order for the selection event to be characterized as a polytope. For example, in SI for SFS by Tibshirani et al. (2016), the authors needed to consider conditioning not only on selected features but also on additional events regarding the history of the feature selection algorithmic process and the signs of the features. Such extra-conditioning leads to loss of power in the inference (Fithian et al., 2014).

In this study, we go beyond polytope-based SI and propose a novel conditional SI method using the homotopy continuation approach for SFS. We call the proposed method homotopy-based SI. In contrast to polytope-based SI for SFS in Tibshirani et al. (2016), the proposed method is more powerful and more general. The basic idea of homotopy-based SI is to use the homotopy continuation approach to keep track of the hypothesis selection event when the dataset changes in the direction of the selected test statistic, which enables efficient identification of the subspace of the sample space in which the same hypothesis is selected. Furthermore, we show that the proposed homotopy-based SI is more advantageous for more complicated feature selection algorithms. As an example, we develop a homotopy-based conditional SI for forward-backward SFS (FB-SFS) algorithm with AIC-based stopping criteria, and show that it is not adversely affected by the increased complexity of the algorithm and that a sufficiently high power is retained.

Related works

Traditional statistical inference presumes that a statistical model and a statistical target for which we seek to conduct inference are determined before observing the dataset. Therefore, if we apply traditional methods to hypotheses selected after observing the dataset, the inferential results are no longer valid. This problem has been extensively discussed in the context of feature selection. In fact, even in commonly used feature selection methods such as SFS, correct assessment of the statistical reliability of selected features has long been difficult. Several approaches have been suggested in the literature toward addressing this problem (Benjamini and Yekutieli, 2005; Leeb and Pötscher, 2005; Leeb et al., 2006; Benjamini et al., 2009; Pötscher et al., 2010; Berk et al., 2013; Lockhart et al., 2014; Taylor et al., 2014). A particularly notable approach is conditional SI introduced in the seminal paper by Lee et al. (2016). In their work, the authors showed that, for a set of features selected by Lasso, the selection event can be characterized as a polytope by conditioning on the selected set of features as well as additional events on their signs. Furthermore, Tibshirani et al. (2016) showed that polytope-based SI is also applicable to SFS by additionally conditioning on the history and signs of sequential feature selection. Conditional SI has been actively studied in the past few years and extended to various directions (Fithian et al., 2015; Choi et al., 2017; Tian et al., 2018; Chen and Bien, 2019; Hyun et al., 2018; Loftus and Taylor, 2014, 2015; Panigrahi et al., 2016; Tibshirani et al., 2016; Yang et al., 2016; Suzumura et al., 2017; Tanizaki et al., 2020; Duy et al., 2020a).

In conditional SIs, it is typically preferred to condition on as little as possible so that the inference can be more powerful (Fithian et al., 2014). Namely, the main limitations of current polytope-based SI methods are that excessive conditioning is required to represent the selection event with a single polytope. In the seminal paper by Lee et al. (2016), the authors already discussed the problem of over-conditioning, and explained how extra conditioning on signs can be omitted by an exhaustive enumeration of all possible signs and by taking the union over the resulting polyhedra. However, such an exhaustive enumeration of exponentially increasing number of sign combinations is feasible only when the number of selected features is fairly small. Several other approaches were proposed to circumvent the drawbacks and restrictions of polytope-based SI. Loftus and Taylor (2015) extended polytope-based SI such that selection events characterized by quadratic inequalities can be handled, but this inherits the same over-conditioning problem. To improve the power, Tian et al. (2018) proposed an approach to randomize the algorithm in order to condition on less. Terada and Shimodaira (2019) proposed to use bootstrap re-sampling to characterize the selection event more generally. The disadvantage of these approaches is that additional randomness is introduced into the algorithm and/or the inference.

This study is motivated by Liu et al. (2018) and Duy and Takeuchi (2020). The former studied Lasso SI for full-model parameters, whereas the latter extended the basic idea of the former so that it can be also applied to Lasso SI in more general settings. These two studies go beyond polytope-based SI for more powerful Lasso SI without conditioning on signs. Because Lasso is formulated as a convex optimization problem, the selection event for conditional SI can be characterized based on the optimality conditions of the convex optimization problem. These two methods rely on the optimality conditions to characterize the minimum conditions for the Lasso feature selection event.

In contrast, the SFS algorithm cannot be formulated as an optimization problem. Therefore, existing conditional SI methods for SFS are based on conditioning on the history rather than the output of the SFS algorithm. Namely, existing SI methods actually consider conditions not only on which features are selected but also in what order they are selected. Moreover, when the SFS algorithm is extended to more complex procedures such as FB-SFS, the over-conditioning problem has more pronounced negative effects. Because features can be added and removed in various orders in FB-SFS, if we condition on the history rather than the output of the algorithm, we will suffer from extensive over-conditioning on the entire history of feature addition and removal.

Our contributions

Our contributions are as follows. First, we propose a novel conditional SI method for the SFS algorithm using the homotopy continuation approach and show that the proposed method overcomes the over-conditioning problem in existing methods and enables us to conduct minimally conditioned more powerful conditional SI on the selected features by the SFS algorithm. Second, we develop a conditional SI method for the FB-SFS algorithm with AIC-based stopping criteria by using the homotopy continuation approach. Although the conventional polytope-based SI method is more adversely affected by the over-conditioning problem when SFS is extended to FB-SFS, we show that our homotopy-based method still enables us to conduct minimally-conditioned SI and retains sufficiently high power. We demonstrate the validity and power of the proposed homotopy-based SI methods through numerical experiments on synthetic and real benchmark datasets. The codes will be released when the submission of this draft is accepted.

Notations

For a natural number nn, we denote [n][n] to be the set {1,,n}\{1,\ldots,n\}. The n×nn\times n identity matrix is denoted by InI_{n}. For a matrix XX and a set of column indices {\mathcal{M}}, XX_{\mathcal{M}} indicates the matrix with the set of columns corresponding to {\mathcal{M}}.

2 Problem Statement

We consider the forward SFS for a regression problem. Let nn be the number of instances and pp be the number of original features. We denote the observed dataset as (X,𝒚)(X,\bm{y}) where Xn×pX\in\mathbb{R}^{n\times p} is the design matrix and 𝒚n\bm{y}\in\mathbb{R}^{n} is the response vector. Following the problem setup from the existing literature on conditional SI such as Lee et al. (2016) and Tibshirani et al. (2016), we assume that the observed response 𝒚\bm{y} is a realization of the following random response vector

𝒀=(Y1,,Yn)(𝝁,Σ),{\bm{Y}}=(Y_{1},...,Y_{n})^{\top}\sim\mathbb{N}({\bm{\mu}},\Sigma), (1)

where 𝝁n{\bm{\mu}}\in\mathbb{R}^{n} is the unknown mean vector and Σn×n\Sigma\in\mathbb{R}^{n\times n} is the covariance matrix, which is known or estimable from independent data. The design matrix XX is assumed to be non-random. For notational simplicity, we assume that each column vector of XX is normalized to have unit length.

Stepwise feature selection

We consider the standard forward SFS method as studied in Tibshirani et al. (2016) in §2 and §3. At each step of the SFS method, the feature that most improves the fit is newly added. When each feature has unit length, it is equivalent to the feature which is most correlated with the residual of the least-square regression model fitted with the previously selected features. For a response vector 𝒚n\bm{y}\in\mathbb{R}^{n} and a set of features [p]{\mathcal{M}}\subseteq[p], let 𝒓(𝒚,X)\bm{r}(\bm{y},X_{\mathcal{M}}) be the residual vector obtained by regressing 𝒚\bm{y} onto XX_{\mathcal{M}} for a set of features {\mathcal{M}}, i.e.,

𝒓(𝒚,X)=PX𝒚=(InPX)𝒚\displaystyle\bm{r}(\bm{y},X_{\mathcal{M}})=P_{X_{\mathcal{M}}}^{\perp}\bm{y}=(I_{n}-P_{X_{{\mathcal{M}}}})\bm{y}

where PX=X(XX)1XP_{X_{\mathcal{M}}}=X_{\mathcal{M}}(X_{\mathcal{M}}^{\top}X_{\mathcal{M}})^{-1}X_{\mathcal{M}}^{\top}. Let KK be the number of the selected features by the SFS method111 We discuss a situation where the number of features KK is selected by cross-validation in Appendix E. In other parts of the paper, we assume that KK is determined before looking at the data. . We denote the index of the feature selected at step kk as jkj_{k} and the set of selected features up to step kk as k={j1,,jk}{\mathcal{M}}_{k}=\{j_{1},\ldots,j_{k}\} for k[K]k\in[K] (Note, however, that, if there is no ambiguity, we simply denote the final set of the selected features as (=K){\mathcal{M}}~{}(={\mathcal{M}}_{K})). Feature jkj_{k} is selected as

jk\displaystyle j_{k} =argminj[p]k1𝒓(𝒚,Xk1{j})22\displaystyle=\operatorname*{arg\,min}_{j\in[p]\setminus{\mathcal{M}}_{k-1}}\left\|\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}\cup\{j\}})\right\|_{2}^{2}
=argmaxj[p]k1|𝒙j𝒓(𝒚,Xk1)|,\displaystyle=\operatorname*{arg\,max}_{j\in[p]\setminus{\mathcal{M}}_{k-1}}\left|\bm{x}_{j}^{\top}\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}})\right|, (2)

where 0(𝒚):={\mathcal{M}}_{0}(\bm{y}):=\emptyset and 𝒓(𝒚,0):=𝒚\bm{r}(\bm{y},{\mathcal{M}}_{0}):=\bm{y}.

Statistical inference

In order to quantify the statistical significance of the relation between the selected features and the response, we consider a statistical test for each coefficient of the selected model parameters

𝜷^=(XX)1X𝒚.\displaystyle\hat{\bm{\beta}}_{{\mathcal{M}}}=(X_{{\mathcal{M}}}^{\top}X_{{\mathcal{M}}})^{-1}X_{{\mathcal{M}}}^{\top}\bm{y}.

Note that the jthj^{\rm th} coefficient is written as β^,j=𝜼𝒚\hat{\beta}_{{\mathcal{M}},j}=\bm{\eta}^{\top}\bm{y} by defining

𝜼=X(XX)1𝒆j,\displaystyle\bm{\eta}=X_{{\mathcal{M}}}\left(X^{\top}_{{\mathcal{M}}}X_{{\mathcal{M}}}\right)^{-1}\bm{e}_{j}, (3)

where 𝒆j||\bm{e}_{j}\in\mathbb{R}^{|{\mathcal{M}}|} is a unit vector, jthj^{\rm th} element of which is 1 and 0 otherwise. Note that 𝜼\bm{\eta} depends on {\mathcal{M}} and jj, but we omit the dependence for notational simplicity. We consider the following statistical test for each coefficient β,j=𝜼𝝁\beta_{{\mathcal{M}},j}=\bm{\eta}^{\top}\bm{\mu}

H0:β,j=0vs.H1:β,j0,\displaystyle{\rm H}_{0}:\beta_{{\mathcal{M}},j}=0~{}~{}~{}\text{vs.}~{}~{}~{}{\rm H}_{1}:\beta_{{\mathcal{M}},j}\neq 0, (4)

where β,j\beta_{{\mathcal{M}},j} is the jthj^{\rm th} element of the population least squares 𝜷=P𝝁\bm{\beta}_{{\mathcal{M}}}=P_{{\mathcal{M}}}\bm{\mu}, i.e., the projection of 𝝁\bm{\mu} onto the column space of XX_{{\mathcal{M}}}. Throughout the paper, we do not assume that the linear model is correctly specified, i.e., it is possible that 𝝁X𝜷\bm{\mu}\neq X_{{\mathcal{M}}}\bm{\beta} for any 𝜷p\bm{\beta}\in\mathbb{R}^{p}. Even when the linear model is not correctly specified, 𝜷\bm{\beta}_{{\mathcal{M}}} is still a well-defined best linear approximation.

Conditional Selective Inference

Because the target of the inference is selected by observing the data 𝒚\bm{y}, if we naively apply a traditional statistical test to the problem in (4) as if the inference target is pre-determined, the result will be invalid (type-I error cannot be controlled at the desired significance level) owing to selection bias. To address the selection bias problem, we consider conditional SI introduced in Lee et al. (2016) and Tibshirani et al. (2016). Let us write the SFS algorithm as the following function:

𝒜:𝒚,\displaystyle{\mathcal{A}}:\bm{y}\mapsto{\mathcal{M}},

which maps a response vector 𝒚n\bm{y}\in\mathbb{R}^{n} to the set of the selected feature {\mathcal{M}} by the SFS algorithm.

In conditional SI, the inference is conducted based on the following conditional sampling distribution of the test-statistic:

𝜼𝒀{𝒜(𝒀)=𝒜(𝒚),𝒒(𝒀)=𝒒(𝒚)},\bm{\eta}^{\top}\bm{Y}\mid\left\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}})\right\}, (5)

where

𝒒(𝒀)=(In𝒄𝜼)𝒀 with 𝒄=Σ𝜼(𝜼Σ𝜼)1\displaystyle{\bm{q}}({\bm{Y}})=(I_{n}-{\bm{c}}{\bm{\eta}}^{\top}){\bm{Y}}\text{ with }\bm{c}=\Sigma{\bm{\eta}}({\bm{\eta}}^{\top}\Sigma{\bm{\eta}})^{-1}

is the nuisance parameter, which is independent of the test statistic. The first condition 𝒜(𝒀)=𝒜(𝒚){\mathcal{A}}(\bm{Y})={\mathcal{A}}({\bm{y}}) in (5) indicates the event that the set of the selected features by the KK-step SFS method with a random response vector 𝒀\bm{Y} is {\mathcal{M}}, i.e., the same as those selected with the observed response vector 𝒚\bm{y}. The second condition 𝒒(𝒀)=𝒒(𝒚){\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}}) indicates that the nuisance parameter for a random response vector 𝒀\bm{Y} is the same as that for the observed vector 𝒚\bm{y}222 The 𝒒(𝒀){\bm{q}}(\bm{Y}) corresponds to the component 𝒛\bm{z} in the seminal paper (see Lee et al. (2016), §5, Eq. 5.2, and Theorem 5.2). .

To conduct the conditional inference for (5), the main task is to identify the conditional data space

𝒴={𝒀n𝒜(𝒀)=𝒜(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle{\mathcal{Y}}=\{\bm{Y}\in\mathbb{R}^{n}\mid{\mathcal{A}}(\bm{Y})={\mathcal{A}}({\bm{y}}),{\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}})\}. (6)

Once 𝒴{\mathcal{Y}} is identified, we can easily compute the pivotal quantity

F𝜼𝝁,𝜼Σ𝜼𝒵(𝜼𝒚)𝒚𝒴,F^{{\mathcal{Z}}}_{\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(\bm{\eta}^{\top}\bm{y})\mid\bm{y}\in{\mathcal{Y}}, (7)

where Fm,s2𝒵F_{m,s^{2}}^{\mathcal{Z}} is the c.d.f. of the truncated Normal distribution with mean mm, variance s2s^{2}, and truncation region 𝒵{\mathcal{Z}}. Later, we will explain how 𝒵{\mathcal{Z}} is defined in (7). The pivotal quantity is crucial for calculating pp-value or obtaining confidence interval. Based on the pivotal quantity, we can obtain selective pp-value (Fithian et al., 2014) in the form of

pjselective=2min{πj,1πj},\displaystyle p^{\rm selective}_{j}=2\ \min\{\pi_{j},1-\pi_{j}\}, (8)

where πj=1F0,𝜼Σ𝜼𝒵(𝜼𝒚)\pi_{j}=1-F^{{\mathcal{Z}}}_{0,{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(\bm{\eta}^{\top}{\bm{y}}). Furthermore, to obtain 1α1-\alpha confidence interval for any α[0,1]\alpha\in[0,1], by inverting the pivotal quantity in (7), we can find the smallest and largest values of 𝜼𝝁\bm{\eta}^{\top}\bm{\mu} such that the value of the pivotal quantity remains in the interval [α2,1α2]\left[\frac{\alpha}{2},1-\frac{\alpha}{2}\right] (Lee et al., 2016).

Characterization of the conditional data space 𝒴{\mathcal{Y}}

Using the second condition in (6), the data in 𝒴{\mathcal{Y}} are restricted to a line (see Liu et al. (2018) and Fithian et al. (2014)). Therefore, the set 𝒴{\mathcal{Y}} can be re-written, using a scalar parameter zz\in\mathbb{R}, as

𝒴={𝒚(z)n𝒚(z)=𝒂+𝒃z,z𝒵}{\mathcal{Y}}=\{\bm{y}(z)\in\mathbb{R}^{n}\mid\bm{y}(z)=\bm{a}+\bm{b}z,z\in{\mathcal{Z}}\} (9)

where 𝒂=𝒒(𝒚){\bm{a}}={\bm{q}}(\bm{y}), 𝒃=Σ𝜼(𝜼Σ𝜼)1{\bm{b}}=\Sigma{\bm{\eta}}({\bm{\eta}}^{\top}\Sigma{\bm{\eta}})^{-1}, and

𝒵={Z𝒜(𝒚(Z))=𝒜(𝒚)}.{\mathcal{Z}}=\left\{Z\in\mathbb{R}\mid{\mathcal{A}}({\bm{y}}(Z))={\mathcal{A}}({\bm{y}})\right\}. (10)

Here, 𝒂\bm{a}, 𝒃\bm{b} and 𝒵{\mathcal{Z}} depend on {\mathcal{M}} and jj, but we omit the subscripts for notational simplicity. Now, let us consider a random variable ZZ\in\mathbb{R} and its observation zz\in\mathbb{R} such that they respectively satisfy 𝒀=𝒂+𝒃Z{\bm{Y}}={\bm{a}}+{\bm{b}}Z and 𝒚=𝒂+𝒃z{\bm{y}}={\bm{a}}+{\bm{b}}z. The conditional inference in (5) is re-written as the problem of characterizing the sampling distribution of

ZZ𝒵.Z\mid Z\in{\mathcal{Z}}. (11)

Because Z(𝜼𝝁,𝜼Σ𝜼)Z\sim\mathbb{N}(\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}), ZZ𝒵Z\mid Z\in{\mathcal{Z}} follows a truncated Normal distribution. Once the truncation region 𝒵{\mathcal{Z}} is identified, the pivotal quantity in (7) is obtained as F𝜼𝝁,𝜼Σ𝜼𝒵(z)F^{{\mathcal{Z}}}_{\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(z). Thus, the remaining task is reduced to the characterization of 𝒵{\mathcal{Z}}.

Over-conditioning in existing conditional SI methods

Unfortunately, full identification of the truncation region 𝒵{\mathcal{Z}} in conditional SI for the SFS algorithm is considered computationally infeasible. Therefore, in existing conditional SI studies such as Tibshirani et al. (2016), the authors circumvent the computational difficulty by over conditioning. Note that over-conditioning is not harmful for selective type-I error control, but it leads to the loss of power (Fithian et al., 2014). In fact, the decrease in the power due to over-conditioning is not unique problem for SFS in Tibshirani et al. (2016) but is a common major problem in many existing conditional SIs (Lee et al., 2016). In the next section, we propose a method to overcome this difficulty.

3 Proposed Homotopy-based SI for SFS

As we discussed in §2, to conduct conditional SI, the truncation region 𝒵{\mathcal{Z}}\subseteq\mathbb{R} in (10) should be identified. To construct 𝒵{\mathcal{Z}}, our idea is 1) computing 𝒜(𝒚(z)){\mathcal{A}}({\bm{y}}(z)) for all zz\in\mathbb{R} and 2) identifying the set of intervals of zz\in\mathbb{R} on which 𝒜(𝒚(z))=𝒜(𝒚){\mathcal{A}}({\bm{y}}(z))={\mathcal{A}}(\bm{y}). However, it seems intractable to obtain 𝒜(𝒚(z)){\mathcal{A}}({\bm{y}}(z)) for infinitely many values of zz\in\mathbb{R}. To overcome the difficulty, we combine two approaches called extra-conditioning and homotopy continuation. Our idea is motivated by the regularization paths of Lasso (Osborne et al., 2000; Efron and Tibshirani, 2004), SVM (Hastie et al., 2004) and other similar methods (Rosset and Zhu, 2007; Bach et al., 2006; Rosset and Zhu, 2007; Tsuda, 2007; Lee and Scott, 2007; Takeuchi et al., 2009; Takeuchi and Sugiyama, 2011; Karasuyama and Takeuchi, 2010; Hocking et al., 2011; Karasuyama et al., 2012; Ogawa et al., 2013; Takeuchi et al., 2013), in which the solution path along the regularization parameter can be computed by analyzing the KKT optimality conditions of parametrized convex optimization problems. Although SFS cannot be formulated as a convex optimization problem, by introducing the notion of extra-conditioning, we note that a conceptually similar approach as homotopy continuation can be used to keep track all possible changes in the selected features when the response vector 𝒚\bm{y} changes along the direction of the test-statistic. A conceptually similar idea has recently been used for evaluating the reliability of deep learning representations (Duy et al., 2020b).

3.1 Extra-Conditioning

First, let us consider the history {\mathcal{H}} and signs 𝒮{\mathcal{S}} of the SFS algorithm. The history of the SFS algorithm is defined as

:=(1,2,,K),\displaystyle{\mathcal{H}}:=({\mathcal{M}}_{1},{\mathcal{M}}_{2},\ldots,{\mathcal{M}}_{K}),

i.e., the sequence of the sets of the selected features in KK steps. The signs of the SFS algorithm are defined as a vector 𝒮:=(𝒮1,𝒮2,,𝒮K){\mathcal{S}}:=({\mathcal{S}}_{1},{\mathcal{S}}_{2},\ldots,{\mathcal{S}}_{K}), kthk^{\rm th} element of which is defined as

𝒮k=sgn(𝒙jk𝒓(𝒚,Xk1))\displaystyle{\mathcal{S}}_{k}={\rm sgn}\left(\bm{x}_{j_{k}}^{\top}\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}})\right)

for k[K]k\in[K], which indicates the sign of the jkthj_{k}^{\rm th} feature when it is first entered to the model at step kk.

We interpret the function 𝒜:𝒚M{\mathcal{A}}:\bm{y}\mapsto M as a composite function 𝒜=𝒜1𝒜2{\mathcal{A}}={\mathcal{A}}_{1}\circ{\mathcal{A}}_{2} where

𝒜1:(,𝒮), and 𝒜2:𝒚(,𝒮),\displaystyle{\mathcal{A}}_{1}:({\mathcal{H}},{\mathcal{S}})\mapsto{\mathcal{M}},\text{ and }{\mathcal{A}}_{2}:\bm{y}\mapsto({\mathcal{H}},{\mathcal{S}}),

i.e., the following relationships hold: =𝒜(𝒚)=𝒜1(𝒜2(𝒚)),=𝒜1((,𝒮)),(,𝒮)=𝒜2(𝒚){\mathcal{M}}={\mathcal{A}}(\bm{y})={\mathcal{A}}_{1}({\mathcal{A}}_{2}(\bm{y})),{\mathcal{M}}={\mathcal{A}}_{1}(({\mathcal{H}},{\mathcal{S}})),({\mathcal{H}},{\mathcal{S}})={\mathcal{A}}_{2}(\bm{y}).

Let us now consider conditional SI not on 𝒜(𝒀)=𝒜(𝒚){\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}) but on 𝒜2(𝒀)=𝒜2(𝒚){\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}). The next lemma indicates that, by conditioning on the history and the signs (,𝒮)({\mathcal{H}},{\mathcal{S}}) rather than the set of the selected features {\mathcal{M}}, the truncation region can be simply represented as an interval in the line 𝒚(z)=𝒂+𝒃z,z\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R}.

Lemma 1.

Consider a response vector 𝐲𝒴\bm{y}^{\prime}\in{\mathcal{Y}}. Let (,𝒮)=𝒜2(𝐲)({\mathcal{H}}^{\prime},{\mathcal{S}}^{\prime})={\mathcal{A}}_{2}(\bm{y}^{\prime}) be the history and the signs obtained by applying the KK-step SFS algorithm to the response vector 𝐲\bm{y}^{\prime}, and their elements are written as

=(1,,K) and 𝒮=(𝒮1,,𝒮K).\displaystyle{\mathcal{H}}^{\prime}=({\mathcal{M}}_{1}^{\prime},\ldots,{\mathcal{M}}_{K}^{\prime})\text{ and }{\mathcal{S}}^{\prime}=({\mathcal{S}}_{1}^{\prime},\ldots,{\mathcal{S}}_{K}^{\prime}).

Then, the over-conditioned truncation region defined as

𝒵oc(𝒚)={z𝒜2(𝒚(z))=𝒜2(𝒚)}\displaystyle{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})=\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{A}}_{2}(\bm{y}^{\prime})\right\} (12)

is an interval

z[maxk[K],j[p]k1,d(k,j)>0e(k,j)d(k,j),mink[K],j[p]k1,d(k,j)<0e(k,j)d(k,j)],\displaystyle z\in\left[\max_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}>0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}},\min_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}<0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}\right], (13)

where

e(k,j)\displaystyle e_{(k,j)} =(𝒙j𝒙jk𝒮k)PXk1𝒂,\displaystyle=(\bm{x}_{j}-\bm{x}_{j_{k}}{\mathcal{S}}^{\prime}_{k})^{\top}P_{X_{{\mathcal{M}}_{k-1}}}^{\perp}\bm{a},
d(k,j)\displaystyle d_{(k,j)} =(𝒙jk𝒮k𝒙j)PXk1𝒃,\displaystyle=(\bm{x}_{j_{k}}{\mathcal{S}}^{\prime}_{k}-\bm{x}_{j})^{\top}P_{X_{{\mathcal{M}}^{\prime}_{k-1}}}^{\perp}\bm{b},

for k[K],j[p]k1k\in[K],j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1}.

The proof is presented in Appendix A.

Note that the condition 𝒜2(𝒀)=𝒜2(𝒚){\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}) is redundant for our goal of making the inference conditional on 𝒜(𝒀)=𝒜(𝒚){\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}). This over-conditioning is undesirable because it decreases the power of conditional inference (Fithian et al., 2014), but the original conditional SI method for SFS in Tibshirani et al. (2016) performs conditional inference under exactly this over-conditioning case because otherwise the selection event cannot be formulated as a polytope. In the following subsection, we overcome this computational difficulty by utilizing the homotopy continuation approach.

3.2 Homotopy Continuation

In order to obtain the optimal truncation region 𝒵{\mathcal{Z}} in (10), our idea is to enumerate all the over-conditioned regions in (12) and to consider their union

𝒵=𝒚𝒴𝒜(𝒚)=𝒜(𝒚)𝒵oc(𝒚)\displaystyle{\mathcal{Z}}=\bigcup_{\bm{y}^{\prime}\in{\mathcal{Y}}\mid{\mathcal{A}}(\bm{y}^{\prime})={\mathcal{A}}(\bm{y})}{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})

by homotopy continuation. To this end, let us consider all possible pairs of history and signs ((z),𝒮(z))=𝒜2(𝒚(z))({\mathcal{H}}(z),{\mathcal{S}}(z))={\mathcal{A}}_{2}(\bm{y}(z)) for zz\in\mathbb{R}. Clearly, the number of such pairs is finite. With a slight abuse of notation, we index each pair by t=1,2,,Tt=1,2,\ldots,T and denote it such as ((t),𝒮(t))({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)}), where TT is the number of the pairs. Lemma 1 indicates that each pair corresponds to an interval in the line. Without loss of generality, we assume that the left-most interval corresponds to t=1t=1, the second left-most interval corresponds to t=2t=2, and so on. Then, using an increasing sequence of real numbers z1<z2<<zT<zT+1z_{1}<z_{2}<\cdots<z_{T}<z_{T+1}, we can write these intervals as [z1,z2],[z2,z3],,[zT,zT+1][z_{1},z_{2}],[z_{2},z_{3}],\ldots,[z_{T},z_{T+1}]. In practice, we do not have to consider the entire line z(,)z\in(-\infty,\infty), but it suffices to consider a sufficiently wide range. Specifically, in the experiments in §5, we set zmin=z1=(|z|+10)σz_{\rm min}=z_{1}=-(|z|+10)\sigma and zmax=zT+1=(|z|+10)σz_{\rm max}=z_{T+1}=(|z|+10)\sigma where σ\sigma is the standard error of the test-statistic. With this choice of the range, the probability mass outside the range is negligibly small.

Our simple idea is to compute all possible pairs {((t),𝒮(t))}t=1T\{({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)})\}_{t=1}^{T} by keeping track of the intervals [z1,z2],[z2,z3],,[zT,zT+1][z_{1},z_{2}],[z_{2},z_{3}],\cdots,[z_{T},z_{T+1}] one by one, and then to compute the truncation region 𝒵{\mathcal{Z}} by collecting the intervals in which the set of the selected features (t)=𝒜1(((t),𝒮(t))){\mathcal{M}}^{(t)}={\mathcal{A}}_{1}(({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)})) is the same as the set of the actually selected features from the observed data =𝒜(𝒚){\mathcal{M}}={\mathcal{A}}(\bm{y}), i.e.,

𝒵=t[T]𝒜1(((t),𝒮(t)))=𝒜(𝒚)[zt,zt+1].\displaystyle{\mathcal{Z}}=\bigcup_{t\in[T]\mid{\mathcal{A}}_{1}(({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)}))={\mathcal{A}}(\bm{y})}[z_{t},z_{t+1}].

We call z1,z2,,zT+1z_{1},z_{2},\ldots,z_{T+1} breakpoints. We start from applying the SFS algorithm to the response vector 𝒚(z1)\bm{y}(z_{1}) and obtain the first pair ((1),𝒮(1))({\mathcal{H}}^{(1)},{\mathcal{S}}^{(1)}). Next, using Lemma 1 with 𝒚=𝒚(z1)\bm{y}^{\prime}=\bm{y}(z_{1}), the next breakpoint z2z_{2} is obtained as the right end of the interval in (13). Next, the second pair ((2),𝒮(2))({\mathcal{H}}^{(2)},{\mathcal{S}}^{(2)}) is obtained by applying the SFS method to the response vector 𝒚(z2+Δz)\bm{y}(z_{2}+\Delta z), where Δz\Delta z is a small value such that zt+Δz<zt+1z_{t}+\Delta z<z_{t+1} for all t[T]t\in[T]. This process is repeated until the next breakpoint becomes greater than zmaxz_{\rm max}. The pseudo-code of the entire method is presented in Algorithm 1 and the computation of the truncation region 𝒵{\mathcal{Z}} is described in Algorithm 2.

0:  X,𝒚,K,[zmin,zmax]X,{\bm{y}},K,[z_{\rm min},z_{\rm max}]
1:  {\mathcal{M}}\leftarrow Applying the KK-step SFS algorithm to (X,𝒚obs)(X,{\bm{y}}^{\rm obs})
2:  for each selected feature jj\in{\mathcal{M}} do
3:    Compute 𝜼\bm{\eta} \leftarrow Equation (3)
4:    Compute 𝒂\bm{a} and 𝒃\bm{b} \leftarrow Equation (9)
5:    𝒵𝚌𝚘𝚖𝚙𝚞𝚝𝚎_𝚝𝚛𝚞𝚗𝚌𝚊𝚝𝚒𝚘𝚗_𝚛𝚎𝚐𝚒𝚘𝚗{\mathcal{Z}}\leftarrow{\tt compute\_truncation\_region} (XX, KK, 𝒂\bm{a}, 𝒃\bm{b}, [zmin,zmax],[z_{\rm min},z_{\rm max}],{\mathcal{M}})
6:    pjselectivep^{\rm selective}_{j}\leftarrow Equation (8) (and//or selective confidence interval)
7:  end for
7:  {pjselective}j\{p^{\rm selective}_{j}\}_{j\in{\mathcal{M}}} (and//or selective confidence intervals)

Algorithm 1 SFS_conditional_SI
0:  X,K,𝒂,𝒃,[zmin,zmax],X,K,\bm{a},\bm{b},[z_{\rm min},z_{\rm max}],{\mathcal{M}}
1:  Initialization: t=1t=1, zt=zminz_{t}=z_{\rm min}, 𝒵={\mathcal{Z}}=\emptyset
2:  while zt<zmaxz_{t}<z_{\rm max} do
3:    𝒚(zt+Δz)=𝒂+𝒃(zt+Δz)\bm{y}(z_{t}+\Delta z)=\bm{a}+\bm{b}(z_{t}+\Delta z)
4:    ((t),𝒮t)({\mathcal{H}}^{(t)},{\mathcal{S}}^{t})\leftarrow Applying KK-step SFS to (X,𝒚(zt+Δz))(X,\bm{y}(z_{t}+\Delta z))
5:    zt+1z_{t+1}\leftarrow Equation (13)
6:    if (t)={\mathcal{M}}^{(t)}={\mathcal{M}} then
7:      𝒵𝒵[zt,zt+1]{\mathcal{Z}}\leftarrow{\mathcal{Z}}\cup[z_{t},z_{t+1}]
8:    end if
9:    tt+1t\leftarrow t+1
10:  end while
10:  𝒵={z[zmin,zmax]𝒜(𝒚(z))=𝒜(𝒚)}{\mathcal{Z}}=\{z\in[z_{\rm min},z_{\rm max}]\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\}
Algorithm 2 compute_truncation_region

4 Forward Backward Stepwise Feature Selection

In this section, we present a conditional SI method for the FB-SFS algorithm using the homotopy method. At each step of the FB-SFS algorithm, a feature is considered for addition to or subtraction from the current set of the selected features based on some pre-specified criterion. As an example of commonly used criterion, we study the AIC-based criterion, which is used as the default option in well-known stepAIC package in R.

Under the assumption of Normal linear model 𝒀𝒩(X𝜷,Σ)\bm{Y}\sim{\mathcal{N}}(X\bm{\beta},\Sigma), for a set of the selected features {\mathcal{M}}, the AIC is written as

AIC:=\displaystyle{\rm AIC}:= max𝜷(𝒚X𝜷)Σ1(𝒚X𝜷)+2||\displaystyle\max_{\bm{\beta}}(\bm{y}-X_{\mathcal{M}}\bm{\beta})^{\top}\Sigma^{-1}(\bm{y}-X_{\mathcal{M}}\bm{\beta})+2|{\mathcal{M}}|
=\displaystyle= 𝒚A𝒚+2||,\displaystyle\bm{y}^{\top}A_{\mathcal{M}}\bm{y}+2|{\mathcal{M}}|, (14)

where A=Σ1Σ1X(XΣ1X)1XΣ1A_{\mathcal{M}}=\Sigma^{-1}-\Sigma^{-1}X_{\mathcal{M}}(X_{\mathcal{M}}^{\top}\Sigma^{-1}X_{\mathcal{M}})^{-1}X_{\mathcal{M}}^{\top}\Sigma^{-1} and |||{\mathcal{M}}| is the number of the selected features (the irrelevant constant term is omitted here). The goal of AIC-based FB-SFS is to find a model with the smallest AIC while adding and removing features step by step. The algorithm starts either from the null model (the model with only a constant term) or the full model (the model with all the pp features). At each step, among all possible choices of adding one feature to or deleting one feature from the current model, the one with the smallest AIC is selected. The algorithm terminates when the AIC is no longer reduced.

With a slight abuse of notations, let us use the following notations, the same as in §3.

𝒜:𝒚,𝒜1:,𝒜2:𝒚,\displaystyle{\mathcal{A}}:\bm{y}\mapsto{\mathcal{M}},{\mathcal{A}}_{1}:{\mathcal{H}}\mapsto{\mathcal{M}},{\mathcal{A}}_{2}:\bm{y}\mapsto{\mathcal{H}},

where {\mathcal{M}} is the set of the selected features and {\mathcal{H}} is the history of the FB-SFS algorithm written as

=(1,,K),\displaystyle{\mathcal{H}}=({\mathcal{M}}_{1},\ldots,{\mathcal{M}}_{K}),

where k{\mathcal{M}}_{k} is the set of the selected features at step kk for k[K]k\in[K]. Here, the history {\mathcal{H}} contains the information on feature addition and removal in the FB-SFS algorithm. Therefore, unlike the forward SFS algorithm in §3, the size of k{\mathcal{M}}_{k} can be increased or decreased depending on whether a feature is added or removed.

Using these notations, the truncation region 𝒵{\mathcal{Z}} is similarly obtained as

𝒵\displaystyle{\mathcal{Z}} ={z𝒜(𝒚(z))=𝒜(𝒚)}\displaystyle=\left\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\right\}
=𝒜1()=𝒜(𝒚){z𝒜2(𝒚(z))=}\displaystyle=\bigcup_{{\mathcal{H}}\mid{\mathcal{A}}_{1}({\mathcal{H}})={\mathcal{A}}(\bm{y})}\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{H}}\right\} (15)

The basic idea of the homotopy method is the same as before; we find a range of zz\in\mathbb{R} such that the algorithm has a certain history {\mathcal{H}} and take the union of those for which the output {\mathcal{M}} obtained from the history {\mathcal{H}} is equal to the actual selected set of features =𝒜(𝒚){\mathcal{M}}={\mathcal{A}}(\bm{y}). The following lemma indicates that a range of zz\in\mathbb{R} in which the algorithm has a certain history {\mathcal{H}} can be analytically obtained.

Lemma 2.

Consider a response vector 𝐲𝒴\bm{y}^{\prime}\in{\mathcal{Y}}. Then, the over-conditioned truncation region defined as

𝒵oc(𝒚)={z𝒜2(𝒚(z))=𝒜2(𝒚)}\displaystyle{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})=\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{A}}_{2}(\bm{y}^{\prime})\right\} (16)

is characterized by a finite number of quadratic inequalities of zz\in\mathbb{R}.

The proof is presented in Appendix A.

Unlike the case of polytope-based SI for forward SFS in the previous section, the over-conditioned region 𝒵oc(𝒚){\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime}) in Lemma 2 possibly consists of multiple intervals. The homotopy continuation approach can be similarly used for computing the union of intervals in (15). Starting from z1=zminz_{1}=z_{\rm min}, we first compute 𝒵oc(𝒚){\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime}) with 𝒚=𝒚(z1)\bm{y}^{\prime}=\bm{y}(z_{1}) using Lemma 2. If 𝒵oc(𝒚){\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime}) consists of multiple intervals, we can consider only the interval containing 𝒚\bm{y}^{\prime} and set the right end of that interval as the next breakpoint. The computation of the next breakpoint ztz_{t} can be written as

zt=max{z𝒵oc(𝒚(zt1+Δz))}+Δz,\displaystyle z_{t}=\max\{z\in{\mathcal{Z}}^{\rm oc}(\bm{y}(z_{t-1}+\Delta z))\}+\Delta z, (17)

for t=1,2,,Tt=1,2,\ldots,T, where, if 𝒵oc(𝒚(zt1+Δz)){\mathcal{Z}}^{\rm oc}(\bm{y}(z_{t-1}+\Delta z)) consists of two separated intervals, then the maximum operator in (17) shall take the maximum in the interval containing zt1z_{t-1}.

5 Experiment

We present the experimental results of forward SFS in §3 and §4 in 5.1 and 5.2, respectively. The details of the experimental setups are described in Appendix B. More experimental results on the computational and robustness aspects of the proposed method are provided in Appendices C and D, respectively. In all the experiments, we set the significance level α=0.05\alpha=0.05.

5.1 Forward SFS

We compared the following five methods:

  • Homotopy: conditioning on the selected features {\mathcal{M}} (minimal conditioning);

  • Homotopy-H: additionally conditioning on the history {\mathcal{H}};

  • Homotopy-S: additionally conditioning on the signs 𝒮{\mathcal{S}};

  • Polytope (Tibshirani et al., 2016): additionally conditioning on the history {\mathcal{H}} and signs 𝒮{\mathcal{S}};

  • DS: split the data and use one for feature selection and the other for inference.

Synthetic data experiments

We examined the false positive rates (FPRs), true positive rates (TPRs) and confidence interval (CI) lengths. We generated the dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]} by 𝒙i(0,Ip)\bm{x}_{i}\sim\mathbb{N}(0,I_{p}) and yi=𝒙i𝜷+εiy_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i} with εi(0,1)\varepsilon_{i}\sim\mathbb{N}(0,1). We set n{50,100,150}n\in\{50,100,150\}, p=5p=5, K=3K=3 for experiments on FPRs and TPRs. The coefficients 𝜷\bm{\beta} was respectively set as [0,0,0,0,0][0,0,0,0,0]^{\top} and [0.25,0.25,0,0,0][0.25,0.25,0,0,0]^{\top} for FPR and TPR experiments. We set n=100n=100, p=10p=10, K=9K=9 and 𝜷=[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0]\bm{\beta}=[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0]^{\top} for experiments on CIs.

The results of FPRs, TPRs and CIs are shown in Fig. 1(a), (b) and (c), respectively. The FPRs and TPRs were estimated by 100 trials, and the plots in Fig. 1(a) and (b) are the averages when the 100 trials were repeated 20 times, while the plots in Fig. 1(c) are the results of 100 trials. In all five methods, the FPRs are properly controlled under the significance level α\alpha. Regarding the TPR comparison, it is obvious that Homotopy method has the highest power because it is minimally conditioned. Note that the additional conditioning on the history {\mathcal{H}} and/or the signs 𝒮{\mathcal{S}} significantly decreases the power. The results of the CIs are consistent with the results of TPRs.

Refer to caption

(a) False Positive Rates (FPRs)

Refer to caption

(b) True Positive Rates (TPRs)

Refer to caption

(c) Confidence Interval (CI) Lengths

Figure 1: Results of forward SFS on synthetic data.

Real data experiments

We compared the proposed method (Homotopy) and conventional method (Polytope) on three real datasets333We used Housing (Dataset 1), Abalone (Dataset 2), and Concrete Compressive Strength (Dataset 3) datasets in UCI machine learning repository.. From each of the original dataset, we randomly generated sub-sampled datasets with sizes 25, 50 and 100. We applied the proposed and the conventional methods to each sub-sampled dataset, examined the cases where the selective pp-values of a selected feature was different between the two methods, and computed the percentage of times when the proposed method provided smaller selective pp-values than the conventional one. Table 1 shows the results from 1000 trials. The proposed method provides smaller selective pp-values than the conventional method significantly more frequently especially when nn is large.

Table 1: Results on three real data experiments for forward SFS. Percentage of cases where selective pp-values of the proposed method (Homotopy) was smaller than that of the conventional method (Polytope) in random sub-sampling experiments were shown.
Dataset 1 Dataset 2 Dataset 3
n=25n=25 56.40% 57.69% 66.79%
n=50n=50 62.85% 65.1% 70.09%
n=100n=100 71.3% 71.88% 76.58%
Table 2: Results on three real data experiments for FB-SFS. Percentage of cases where selective pp-values of the proposed method (Homotopy) was smaller than that of the conventional method (Polytope) in random sub-sampling experiments were shown.
Dataset 1 Dataset 2 Dataset 3
n=25n=25 69.35% 75.43% 57.13%
n=50n=50 72.72% 78.42% 60.91%
n=100n=100 78.60% 81.38% 76.19%

5.2 Forward-Backward (FB)-SFS

We compared the following two methods:

  • Homotopy: conditioning on the selected features {\mathcal{M}} (minimal conditioning);

  • Quadratic: additionally conditioning on the history {\mathcal{H}} (implemented by using quadratic inequality-based conditional SI in (Loftus and Taylor, 2015)).

Synthetic data experiments

We examined the FPRs, TPRs and CI lengths by generating synthetic datasets in the same way as above. We set n{50,100,150}n\in\{50,100,150\} p{10,20,50}p\in\{10,20,50\} for experiments on FPRs and TPRs. The coefficients 𝜷\bm{\beta} were zero for the case of FPR, whereas the first half of the coefficients were either 0.01,0.25,0.5,10.01,0.25,0.5,1 and the second half were zero for the case of TPR. We set n=100n=100, p=10p=10, and each element of 𝜷\bm{\beta} was randomly set from (0,1)\mathbb{N}(0,1) for experiments on CIs.

The results of FPRs, TPRs and CIs are shown in Fig. 2(a), (b) and (c), respectively. The FPRs and TPRs were estimated by 100 trials, and the plots in Fig. 2 (a) and (b) are the averages when the 100 trials were repeated 20 times for FPR and 10 times for TPR, whereas the plots in Fig. 2(c) are the results of 100 trials. In both methods, the FPRs are properly controlled under the significance level α\alpha. Regarding the TPR comparison, it is obvious that Homotopy method has the higher power than Quadratic because it is minimally conditioned. The results of the CIs are consistent with the results of TPRs.

Refer to caption

(a) False Positive Rates (FPRs)

Refer to caption

(b) True Positive Rates (TPRs)

Refer to caption

(c) Confidence Interval (CI) Lengths

Figure 2: Results of FB-SFS on synthetic data.

Real data experiments

We compared the proposed method (Homotopy) and conventional method (Quadratic) on the same three real datasets.From each of the original dataset, we randomly generated sub-sampled datasets with sizes 25, 50 and 100. We applied the proposed and the conventional methods to each sub-sampled dataset, examined the cases where the selective pp-values of a selected feature was different between the two methods, and computed the percentage of times when the proposed method provided smaller selective pp-values than the conventional one. Table 2 shows the results from 1000 trials. The proposed method provides smaller selective pp-values than the conventional method significantly more frequently especially when nn is large.

6 Conclusion

In this paper, we proposed a more powerful and general conditional SI method for SFS algorithm. We resolved the over-conditioning problem in existing approaches by introducing the homotopy continuation approach. The experimental results indicated that the proposed homotopy-based approach is more powerful and general.

Acknowledgement

This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and RIKEN Junior Research Associate Program.

Appendix A Proofs

A.1 Proof of Lemma 1

Proof.

Let ((z),𝒮(z))=𝒜2(𝒚(z))({\mathcal{H}}(z),{\mathcal{S}}(z))={\mathcal{A}}_{2}(\bm{y}(z)) for all zz\in\mathbb{R} and write (z)=(1(z),,K(z)){\mathcal{H}}(z)=({\mathcal{M}}_{1}(z),\ldots,{\mathcal{M}}_{K}(z)) and 𝒮(z)=(𝒮1(z),,𝒮K(z)){\mathcal{S}}(z)=({\mathcal{S}}_{1}(z),\ldots,{\mathcal{S}}_{K}(z)). By conditioning on the histories (z)={\mathcal{H}}(z)={\mathcal{H}}^{\prime},

1(z)=1,,K(z)=K\displaystyle{\mathcal{M}}_{1}(z)={\mathcal{M}}_{1}^{\prime},\ldots,{\mathcal{M}}_{K}(z)={\mathcal{M}}_{K}^{\prime}

for all zz\in\mathbb{R} such that (z)={\mathcal{H}}(z)={\mathcal{H}}^{\prime}. This indicates that

|𝒙jk𝒓(𝒚(z),Xk1)|±𝒙j𝒓(𝒚(z),Xk1)\displaystyle\left|\bm{x}_{j^{\prime}_{k}}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})\right|\geq\pm\bm{x}_{j}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}}) (18)

for all (k,j)[K]×([p]k1)(k,j)\in[K]\times([p]\setminus{\mathcal{M}}^{\prime}_{k-1}) and zz\in\mathbb{R} such that (z)={\mathcal{H}}(z)={\mathcal{H}}^{\prime}, where (j1,,jK)(j^{\prime}_{1},\ldots,j^{\prime}_{K}) is the sequence of the selected features when the KK-step SFS algorithm is applied to the response vector 𝒚\bm{y}^{\prime}. By further conditioning on the signs 𝒮(z)=𝒮{\mathcal{S}}(z)={\mathcal{S}}^{\prime}, (18) is written as

𝒮k𝒙jk𝒓(𝒚(z),Xk1)±𝒙j𝒓(𝒚(z),Xk1)\displaystyle{\mathcal{S}}^{\prime}_{k}\bm{x}_{j^{\prime}_{k}}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})\geq\pm\bm{x}_{j}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})

for all (k,j)[K]×([p]k1)(k,j)\in[K]\times([p]\setminus{\mathcal{M}}^{\prime}_{k-1}) and zz\in\mathbb{R} such that (z)={\mathcal{H}}(z)={\mathcal{H}}^{\prime} and 𝒮(z)=𝒮{\mathcal{S}}(z)={\mathcal{S}}^{\prime}. By restricting on a line 𝒚(z)=𝒂+𝒃z,z\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R}, the range of zz is written as

maxk[K],j[p]k1,d(k,j)>0e(k,j)d(k,j)zmink[K],j[p]k1,d(k,j)<0e(k,j)d(k,j).\displaystyle\max_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}>0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}\leq z\leq\min_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}<0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}. (19)

A.2 The proof of Lemma 2

Proof.

For a set of features [p]{\mathcal{M}}\subseteq[p] and a response vector 𝒚(z)=𝒂+𝒃z,z\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R} in a line, let us denote the AIC as a function of {\mathcal{M}} and zz as AIC(,z){\rm AIC}({\mathcal{M}},z). Subsequently, by substituting 𝒚=𝒂+𝒃z\bm{y}=\bm{a}+\bm{b}z into (14), it is written as a quadratic function of zz as

AIC(,z)\displaystyle{\rm AIC}({\mathcal{M}},z) =(𝒃A𝒃)z2+2(𝒂A𝒃)z+(𝒂A𝒂)\displaystyle=(\bm{b}^{\top}A_{{\mathcal{M}}}\bm{b})z^{2}+2(\bm{a}^{\top}A_{{\mathcal{M}}}\bm{b})z+(\bm{a}^{\top}A_{{\mathcal{M}}}\bm{a})
+2||.\displaystyle+2|{\mathcal{M}}|. (20)

Equation (20) represents the range of zz\in\mathbb{R} such that when 𝒚(z)\bm{y}(z) is fed into the algorithm, the same history =𝒜2(𝒚){\mathcal{H}}={\mathcal{A}}_{2}(\bm{y}^{\prime}) is obtained. Let the sequence of the selected models corresponding to the history =ψ(𝒚){\mathcal{H}}^{\prime}=\psi(\bm{y}^{\prime}) be

=(1,,K),\displaystyle{\mathcal{H}}^{\prime}=({\mathcal{M}}^{\prime}_{1},\ldots,{\mathcal{M}}^{\prime}_{K}),

where KK is the number of steps in the history {\mathcal{H}}. Next, the event of the history can be fully characterized by comparing AICs as follows:

AIC(k,z)AIC(k1{j},z)j[p]k1,\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1}\cup\{j\},z)~{}\forall j\in[p]\setminus{\mathcal{M}}_{k-1},
AIC(k,z)AIC(k1{j},z)jk1,\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1}\setminus\{j\},z)~{}\forall j\in{\mathcal{M}}_{k-1},
AIC(k,z)AIC(k1,z),\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1},z), (21)

for k=1,2,,Kk=1,2,\ldots,K and

AIC(K,z)<AIC(K{j},z)j[p]K,\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{K},z)<{\rm AIC}({\mathcal{M}}^{\prime}_{K}\cup\{j\},z)~{}\forall j\in[p]\setminus{\mathcal{M}}_{K},
AIC(K,z)<AIC(K{j},z)jK.\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{K},z)<{\rm AIC}({\mathcal{M}}^{\prime}_{K}\setminus\{j\},z)~{}\forall j\in{\mathcal{M}}_{K}. (22)

Here, the first and second inequalities in (21) indicate that the selected model at step kk has the smallest AIC among all possible choices, the third inequality in (21) indicates that the AIC of the selected model at step kk is smaller than that at the previous step, and two inequalities (22) indicate that the AIC of the selected model at the final step KK cannot be decreased anymore. Because the AIC is written as a quadratic function of zz as in (20) under the fixed history {\mathcal{H}}^{\prime}, all these conditions are written as quadratic inequalities of zz. This means that the range of zz\in\mathbb{R} that satisfies these conditions is represented by a finite set of quadratic inequalities of zz\in\mathbb{R}. ∎

Appendix B Details of Experiments

We executed the experiments on Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz.

Comparison methods in the case of Forward SFS.

In the case of forward SFS, we remind that 𝒜(𝒚){\mathcal{A}}(\bm{y}) results a set of selected featured {\mathcal{M}} when applying forward SFS algorithm 𝒜{\mathcal{A}} to 𝒚\bm{y}. With a slight abuse of notations, let (𝒚){\mathcal{H}}(\bm{y}) and 𝒮(𝒚){\mathcal{S}}(\bm{y}) respectively denote the history and signs obtained when applying algorithm 𝒜{\mathcal{A}} to 𝒚\bm{y}. We compared the following five methods:

  • Homotopy: conditioning on the selected features {\mathcal{M}} (minimal conditioning), i.e.,

    𝜼T𝒀{𝒜(𝒀)=𝒜(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.
  • Homotopy-H: additionally conditioning on the history {\mathcal{H}}, i.e.,

    𝜼T𝒀{(𝒀)=(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

    Here, we note that (𝒀)=(𝒚){\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y}) already includes the event 𝒜(𝒀)=𝒜(𝒚){\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}).

  • Homotopy-S: additionally conditioning on the signs, i.e.,

    𝜼T𝒀{𝒜(𝒀)=𝒜(𝒚),𝒮(𝒀)=𝒮(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\mathcal{S}}(\bm{Y})={\mathcal{S}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.
  • Polytope (Tibshirani et al., 2016): additionally conditioning on both history {\mathcal{H}} and signs 𝒮{\mathcal{S}}, i.e.,

    𝜼T𝒀{(𝒀)=(𝒚),𝒮(𝒀)=𝒮(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y}),{\mathcal{S}}(\bm{Y})={\mathcal{S}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

    This definition is equivalent to

    𝜼T𝒀{𝒜2(𝒀)=𝒜2(𝒚),𝒒(𝒀)=𝒒(𝒚)},\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\},

    where 𝒜2(){\mathcal{A}}_{2}(\cdot) is defined in §3.

  • DS: data splitting is the commonly used procedure for the purpose of selection bias correction. In this approach, the data is randomly divided in two halves — first half used for model selection and the other for inference.

Comparison methods in the case of Forward-Backward SFS.

In the case of forward-backward SFS, 𝒜(𝒚){\mathcal{A}}(\bm{y}) results a set of selected featured {\mathcal{M}} when applying forward-backward SFS algorithm 𝒜{\mathcal{A}} to 𝒚\bm{y}, and 𝒜2(𝒚){\mathcal{A}}_{2}(\bm{y}) results the history. We compared the following two methods:

  • Homotopy: conditioning on the selected features {\mathcal{M}} (minimal conditioning), i.e.,

    𝜼T𝒀{𝒜(𝒀)=𝒜(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.
  • Quadratic: additionally conditioning on the history {\mathcal{H}} (implemented by using quadratic inequality-based conditional SI in (Loftus and Taylor, 2015)), i.e.,

    𝜼T𝒀{𝒜2(𝒀)=𝒜2(𝒚),𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

Definition of TPR.

In SI, we only conduct statistical testing when there is at least one hypothesis discovered by the algorithm. Therefore, the definition of TPR, which can be also called conditional power, is as follows:

TPR=#detected&rejected#detected,{\rm TPR}=\frac{{\rm\#\ detected\ \&\ rejected}}{{\rm\#\ detected}},

where #detected{\rm\#\ detected} is the number of truly positive features selected by the algorithm (e.g., SFS) and #rejected{\rm\#\ rejected} is the number of truly positive features whose null hypothesis is rejected by SI.

Demonstration of confidence interval (forward SFS).

We generated n=100n=100 outcomes as yi=𝒙i𝜷+εi,i=1,,ny_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i},~{}i=1,...,n, where 𝒙i𝒩(0,Ip)\bm{x}_{i}\sim\mathcal{N}(0,I_{p}) and εi𝒩(0,1)\varepsilon_{i}\sim\mathcal{N}(0,1). We set p=10p=10, K=9K=9 and 𝜷=[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0]\bm{\beta}=\left[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0\right]^{\top}. We note that the number of selected features between the four options of conditional SI methods (Homotopy, Homotopy-H, Homotopy-S, Polytope) and DS can be different. Therefore, for a fair comparison, we only consider the features that are selected in both cases. Figure 3 shows the demonstration of CIs for each selected feature. The results are consistent with Fig. 1 (b).

Refer to caption
Figure 3: Demonstration of confidence interval.

Appendix C Experiments on Computational Aspects (Forward SFS)

We demonstrate the computational efficiency of the proposed Homotopy method. We generated nn outcomes as yi=𝒙i𝜷+εi,i=1,,ny_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i},~{}i=1,...,n, where 𝒙i𝒩(0,Ip)\bm{x}_{i}\sim\mathcal{N}(0,I_{p}) and εi𝒩(0,1)\varepsilon_{i}\sim\mathcal{N}(0,1). We set n=50n=50 and p=10p=10. In Figure 4, we show the results of comparing the computational time between the proposed Homotopy method and the existing method. For the existing study, if we want to keep high statistical power, we have to enumerate a huge number of all the combinations of histories and signs 2K×K!,2^{K}\times K!, which is only feasible when the number of selected features is fairly small. We observe in blue plots in Fig. 4 that the computational cost of existing method is exponentially increasing with the number of selected features. With the proposed method, we are able to significantly reduce the computational cost while keeping high power.

Refer to caption
Figure 4: The result of comparing the computational time between the proposed method and the existing method with an n=50,p=10n=50,~{}p=10 artificial dataset.

One might wonder how we can circumvent the computational bottleneck of exponentially increasing number of polytopes. Our experience suggests that, by focusing on the line along the test-statistic in data space, we can skip majority of the polytopes that do not affect the truncated Normal sampling distribution because they do not intersect with this line. In other words, we can skip majority of combinations of histories and signs that never appear. We show the violin plot of the actual numbers of intervals of the test statistic zz that involves in the construction of truncated sampling distribution in Figure 5. Here, we set n=250n=250 and =50=50. Regarding Homotopy and Homotopy-S, the number of polytopes linearly increases when increasing KK. This is the reason why our method is highly efficient. In regard to Homotopy-H, the number of polytopes decreases because the history constraint is too strict when KK is increased.

Refer to caption
Figure 5: The number of polytopes intersecting the line zz that we need to consider. The solid lines are shown the sample averages.

Appendix D Experiments on Robustness (Forward-Backward SFS)

We applied our proposed method to the cases where the noise follows Laplace distribution, skew normal distribution (skewness coefficient 10), and t20t_{20} distribution. We also conducted experiments when σ2\sigma^{2} was also estimated from the data. We generated nn outcomes as yi=𝒙i𝜷+εiy_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i}, i=1,,ni=1,...,n, where p=5,𝒙i(0,Ip)p=5,\bm{x}_{i}\sim\mathbb{N}(0,I_{p}), and εi\varepsilon_{i} follows Laplace distribution, skew normal distribution, or t20t_{20} distribution with zero mean and standard deviation was set to 1. In the case of estimated σ2\sigma^{2}, εi(0,1)\varepsilon_{i}\sim\mathbb{N}(0,1). We set all elements of 𝜷\bm{\beta} to 0, and set λ=0.5\lambda=0.5. For each case, we ran 1,200 trials for each n{100,200,300,400}n\in\{100,200,300,400\}. The FPR results are shown in Figure 6. Although we only demonstrate the results for the case of forward-backward SFS algorithm, the extension to forward SFS algorithm with similar setting is straightforward.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: The robustness of the proposed method in terms of the FPR control.

Appendix E Homotopy-based SI for Cross-Validation (Forward SFS)

In this section, we introduce a method for SI conditional also on the selection of the number of selected features KK via cross-validation. Consider selecting the number of steps KK in the SFS method from a given set of candidates 𝒦={K1,,KL}{\mathcal{K}}=\{K_{1},\ldots,K_{L}\} where LL is the number of candidates. When conducting cross-validation on the observed dataset (X,𝒚)(X,\bm{y}), suppose that 𝒱(𝒚)=Kselected𝒦{\mathcal{V}}(\bm{y})=K_{\rm selected}\in{\mathcal{K}} is the event that KselectedK_{\rm selected} is selected as the best one. The test-statistic for the selected feature jj when applying the SFS method with KselectedK_{\rm selected} steps to (X,𝒚)(X,\bm{y}) is then defined as

𝜼𝒀\displaystyle\bm{\eta}^{\top}\bm{Y}\mid {𝒜(𝒀)=𝒜(𝒚),𝒱(𝒀)=Kselected,𝒒(𝒀)=𝒒(𝒚)}.\displaystyle\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\mathcal{V}}(\bm{Y})=K_{\rm selected},\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}. (23)

We note that 𝒜(){\mathcal{A}}(\cdot) and 𝒒()\bm{q}(\cdot) depend on KselectedK_{\rm selected} but we omit the dependence for notational simplicity. The conditional data space in (9) with the event of selecting KselectedK_{\rm selected} is then written as

𝒴={𝒚(z)=𝒂+𝒃zz𝒵CV},\displaystyle{\mathcal{Y}}=\{\bm{y}(z)=\bm{a}+\bm{b}z\mid z\in{\mathcal{Z}}_{\rm CV}\}, (24)

where 𝒵CV={z𝒜(𝒚(z))=𝒜(𝒚),𝒱(𝒚(z))=Kselected}{\mathcal{Z}}_{\rm CV}=\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y}),{\mathcal{V}}(\bm{y}(z))=K_{\rm selected}\}. The truncation region 𝒵CV{\mathcal{Z}}_{\rm CV} can be obtained by the intersection of the following two sets:

𝒵1={z𝒜(𝒚(z))=𝒜(𝒚)} and 𝒵2={z𝒱(𝒚(z))=Kselected}.\displaystyle{\mathcal{Z}}_{1}=\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\}\quad\text{ and }\quad{\mathcal{Z}}_{2}=\{z\in\mathbb{R}\mid{\mathcal{V}}(\bm{y}(z))=K_{\rm selected}\}.

Since the former 𝒵1{\mathcal{Z}}_{1} can be obtained by using the method described in §3, the remaining task is to identify the latter 𝒵2{\mathcal{Z}}_{2}.

For notational simplicity, we consider the case where the dataset (X,𝒚)(X,\bm{y}) is divided into training and validation sets, and the latter is used for selecting KselectedK_{\rm selected}. The following discussion can be easily extended to cross-validation scenario. Let us re-write

(X,𝒚)=((XtrXva)n×p,(𝒚tr𝒚va)n).\displaystyle(X,\bm{y})=\left((X^{\rm tr}\ X^{\rm va})^{\top}\in\mathbb{R}^{n\times p},(\bm{y}^{\rm tr}\ \bm{y}^{\rm va})^{\top}\in\mathbb{R}^{n}\right).

With a slight abuse of notation, for K𝒦K\in{\mathcal{K}}, let K(𝒚tr(z)){\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z)) be the set of selected features by applying KK-step SFS method to (Xtr,𝒚tr(z))(X^{\rm tr},\bm{y}^{\rm tr}(z)). The validation error is then defined as

EK(z)=𝒚va(z)XK(𝒚tr(z))va𝜷^K(z)22,\displaystyle E_{K}(z)=\|\bm{y}^{\rm va}(z)-X^{\rm va}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}\hat{\bm{\beta}}_{K}(z)\|^{2}_{2}, (25)

where 𝜷^K(z)=(XK(𝒚tr(z))trXK(𝒚tr(z))tr)1XK(𝒚tr(z))tr𝒚tr(z)\hat{\bm{\beta}}_{K}(z)=\left({X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}}^{\top}X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}\right)^{-1}{X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}}^{\top}\bm{y}^{\rm tr}(z). Then, we can write

𝒵2={zEKselected(z)EK(z) for any K𝒦}.\displaystyle{\mathcal{Z}}_{2}=\{z\in\mathbb{R}\mid E_{K_{\rm selected}}(z)\leq E_{K}(z)\text{ for any }K\in{\mathcal{K}}\}.

Since the validation error EK(z)E_{K}(z) in (25) is a picecewise-quadratic function of zz, we have a corresponding picecewise-quadratic function of zz for each K𝒦K\in{\mathcal{K}}. The truncation region 𝒵2{\mathcal{Z}}_{2} can be identified by the intersection of the intervals of zz in which the validation error EKselected(z)E_{K_{\rm selected}}(z) corresponding to KselectedK_{\rm selected} is minimum among a set of picecewise-quadratic functions for all the other K𝒦K\in{\mathcal{K}}.

Loftus (2015) already discussed that it is possible to consider cross-validation event into conditional SI framework. However, his method is highly over-conditioned in the sense that additional conditioning on all intermediate models in the process of cross-validation is required. Our method described above is minimumly-conditioned SI in the sense that our inference is conducted based exactly on the conditional sampling distribution of the test-statistic in (23) without any extra conditions.

For the experiments on cross-validation, we demonstrate the TPRs and the CIs between the cases when K=9K=9 is fixed and KK is selected from the set 𝒦1={3,6,9}{\mathcal{K}}_{1}=\left\{3,6,9\right\}, or 𝒦2={1,2,,10}{\mathcal{K}}_{2}=\left\{1,2,\ldots,10\right\} using 55-fold cross-validation. We set p=10p=10, only the first elements of 𝜷\bm{\beta} was set to 0.250.25, and all the rest were set to 0. We show that the TPR tends to decrease when increasing the size of 𝒦{\mathcal{K}} in Figure 7. This is due to the fact that when we increase the size of 𝒦{\mathcal{K}}, we have to condition on more information which leads to shorter truncation interval and results low TPR. The TPR results are consistent with the CI results shown in Figure 8. In other words, when increasing the size of 𝒦{\mathcal{K}}, the lower the TPR is, the longer the length of CI becomes.

Refer to caption
Figure 7: Demonstration of TPR when accounting cross-validation selection event.
Refer to caption
Figure 8: Demonstration of CI length when considering cross-validation selection event.

References

  • Lee et al. (2016) Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
  • Tibshirani et al. (2016) Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620, 2016.
  • Fithian et al. (2014) William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
  • Benjamini and Yekutieli (2005) Yoav Benjamini and Daniel Yekutieli. False discovery rate–adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):71–81, 2005.
  • Leeb and Pötscher (2005) Hannes Leeb and Benedikt M Pötscher. Model selection and inference: Facts and fiction. Econometric Theory, pages 21–59, 2005.
  • Leeb et al. (2006) Hannes Leeb, Benedikt M Pötscher, et al. Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5):2554–2591, 2006.
  • Benjamini et al. (2009) Yoav Benjamini, Ruth Heller, and Daniel Yekutieli. Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4255–4271, 2009.
  • Pötscher et al. (2010) Benedikt M Pötscher, Ulrike Schneider, et al. Confidence sets based on penalized maximum likelihood estimators in gaussian regression. Electronic Journal of Statistics, 4:334–360, 2010.
  • Berk et al. (2013) Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao, et al. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013.
  • Lockhart et al. (2014) Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance test for the lasso. Annals of statistics, 42(2):413, 2014.
  • Taylor et al. (2014) Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889, 354, 2014.
  • Fithian et al. (2015) William Fithian, Jonathan Taylor, Robert Tibshirani, and Ryan Tibshirani. Selective sequential model selection. arXiv preprint arXiv:1512.02565, 2015.
  • Choi et al. (2017) Yunjin Choi, Jonathan Taylor, Robert Tibshirani, et al. Selecting the number of principal components: Estimation of the true rank of a noisy matrix. The Annals of Statistics, 45(6):2590–2617, 2017.
  • Tian et al. (2018) Xiaoying Tian, Jonathan Taylor, et al. Selective inference with a randomized response. The Annals of Statistics, 46(2):679–710, 2018.
  • Chen and Bien (2019) Shuxiao Chen and Jacob Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pages 1–12, 2019.
  • Hyun et al. (2018) Sangwon Hyun, Kevin Lin, Max G’Sell, and Ryan J Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644, 2018.
  • Loftus and Taylor (2014) Joshua R Loftus and Jonathan E Taylor. A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920, 2014.
  • Loftus and Taylor (2015) Joshua R Loftus and Jonathan E Taylor. Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478, 2015.
  • Panigrahi et al. (2016) Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein. Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824, 28, 2016.
  • Yang et al. (2016) Fan Yang, Rina Foygel Barber, Prateek Jain, and John Lafferty. Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pages 2469–2477, 2016.
  • Suzumura et al. (2017) Shinya Suzumura, Kazuya Nakagawa, Yuta Umezu, Koji Tsuda, and Ichiro Takeuchi. Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3338–3347. JMLR. org, 2017.
  • Tanizaki et al. (2020) Kosuke Tanizaki, Noriaki Hashimoto, Yu Inatsu, Hidekata Hontani, and Ichiro Takeuchi. Computing valid p-values for image segmentation by selective inference. 2020.
  • Duy et al. (2020a) Vo Nguyen Le Duy, Hiroki Toda, Ryota Sugiyama, and Ichiro Takeuchi. Computing valid p-value for optimal changepoint by selective inference using dynamic programming. arXiv preprint arXiv:2002.09132, 2020a.
  • Terada and Shimodaira (2019) Yoshikazu Terada and Hidetoshi Shimodaira. Selective inference after variable selection via multiscale bootstrap. arXiv preprint arXiv:1905.10573, 2019.
  • Liu et al. (2018) Keli Liu, Jelena Markovic, and Robert Tibshirani. More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037, 2018.
  • Duy and Takeuchi (2020) Vo Nguyen Le Duy and Ichiro Takeuchi. Parametric programming approach for powerful lasso selective inference without conditioning on signs. arXiv preprint arXiv:2004.09749, 2020.
  • Osborne et al. (2000) Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
  • Efron and Tibshirani (2004) B. Efron and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004.
  • Hastie et al. (2004) T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–415, 2004.
  • Rosset and Zhu (2007) S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012–1030, 2007.
  • Bach et al. (2006) F. R. Bach, D. Heckerman, and E. Horvits. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7:1713–41, 2006.
  • Tsuda (2007) K. Tsuda. Entire regularization paths for graph data. In In Proc. of ICML 2007, pages 919–925, 2007.
  • Lee and Scott (2007) G. Lee and C. Scott. The one class support vector machine solution path. In Proc. of ICASSP 2007, pages II521–II524, 2007.
  • Takeuchi et al. (2009) I. Takeuchi, K. Nomura, and T. Kanamori. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2):539–559, 2009.
  • Takeuchi and Sugiyama (2011) Ichiro Takeuchi and Masashi Sugiyama. Target neighbor consistent feature weighting for nearest neighbor classification. In Advances in neural information processing systems, pages 576–584, 2011.
  • Karasuyama and Takeuchi (2010) M. Karasuyama and I. Takeuchi. Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks, 22(10):1613–1625, 2010.
  • Hocking et al. (2011) T. Hocking, j. P. Vert, F. Bach, and A. Joulin. Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pages 745–752, 2011.
  • Karasuyama et al. (2012) M. Karasuyama, N. Harada, M. Sugiyama, and I. Takeuchi. Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3):297–330, 2012.
  • Ogawa et al. (2013) Kohei Ogawa, Motoki Imamura, Ichiro Takeuchi, and Masashi Sugiyama. Infinitesimal annealing for training semi-supervised support vector machines. In International Conference on Machine Learning, pages 897–905, 2013.
  • Takeuchi et al. (2013) Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, and Shinichi Nakajima. Parametric task learning. Advances in Neural Information Processing Systems, 26:1358–1366, 2013.
  • Duy et al. (2020b) Vo Nguyen Le Duy, Shogo Iwazaki, and Ichiro Takeuchi. Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823, 2020b.