More Powerful and General Selective Inference for Stepwise Feature Selection Using the Homotopy Method

Kazuya Sugiyama , Vo Nguyen Le Duy¹¹footnotemark: 1 ²²footnotemark: 2 , Ichiro Takeuchi¹¹footnotemark: 1 Nagoya Institute of TechnologyEqual contributionRIKENCorrespondence to: Ichiro Takeuchi ([email protected])

Abstract

Conditional selective inference (SI) has been actively studied as a new statistical inference framework for data-driven hypotheses. The basic idea of conditional SI is to make inferences conditional on the selection event characterized by a set of linear and/or quadratic inequalities. Conditional SI has been mainly studied in the context of feature selection such as stepwise feature selection (SFS). The main limitation of the existing conditional SI methods is the loss of power due to over-conditioning, which is required for computational tractability. In this study, we develop a more powerful and general conditional SI method for SFS using the homotopy method which enables us to overcome this limitation. The homotopy-based SI is especially effective for more complicated feature selection algorithms. As an example, we develop a conditional SI method for forward-backward SFS with AIC-based stopping criteria, and show that it is not adversely affected by the increased complexity of the algorithm. We conduct several experiments to demonstrate the effectiveness and efficiency of the proposed method.

1 Introduction

As machine learning (ML) is being applied to a greater variety of practical problems, ensuring the reliability of ML is becoming increasingly important. Among several potential approaches to reliable ML, conditional selective inference (SI) is recognized as a promising approach for evaluating the statistical reliability of data-driven hypotheses selected by ML algorithms. The basic idea of conditional SI is to make inference on a data-driven hypothesis conditional on the selection event that the hypothesis is selected. Conditional SI has been actively studied especially in the context of feature selection. Notably, Lee et al. (2016) and Tibshirani et al. (2016) proposed conditional SI methods for selected features using Lasso and stepwise feature selection (SFS), respectively. Their basic idea is to characterize the selection event by a polytope, i.e., a set of linear inequalities, in a sample space. When a selection event can be characterized by a polytope, practical computational methods developed by these authors can be used for making inferences of the selected hypotheses conditional on the selection events.

Unfortunately, however, such polytope-based SI has several limitations because it can only be used when the characterization of all relevant selection events is represented by a polytope. In fact, in most of the existing polytope-based SI studies, extra-conditioning is required in order for the selection event to be characterized as a polytope. For example, in SI for SFS by Tibshirani et al. (2016), the authors needed to consider conditioning not only on selected features but also on additional events regarding the history of the feature selection algorithmic process and the signs of the features. Such extra-conditioning leads to loss of power in the inference (Fithian et al., 2014).

In this study, we go beyond polytope-based SI and propose a novel conditional SI method using the homotopy continuation approach for SFS. We call the proposed method homotopy-based SI. In contrast to polytope-based SI for SFS in Tibshirani et al. (2016), the proposed method is more powerful and more general. The basic idea of homotopy-based SI is to use the homotopy continuation approach to keep track of the hypothesis selection event when the dataset changes in the direction of the selected test statistic, which enables efficient identification of the subspace of the sample space in which the same hypothesis is selected. Furthermore, we show that the proposed homotopy-based SI is more advantageous for more complicated feature selection algorithms. As an example, we develop a homotopy-based conditional SI for forward-backward SFS (FB-SFS) algorithm with AIC-based stopping criteria, and show that it is not adversely affected by the increased complexity of the algorithm and that a sufficiently high power is retained.

Related works

Traditional statistical inference presumes that a statistical model and a statistical target for which we seek to conduct inference are determined before observing the dataset. Therefore, if we apply traditional methods to hypotheses selected after observing the dataset, the inferential results are no longer valid. This problem has been extensively discussed in the context of feature selection. In fact, even in commonly used feature selection methods such as SFS, correct assessment of the statistical reliability of selected features has long been difficult. Several approaches have been suggested in the literature toward addressing this problem (Benjamini and Yekutieli, 2005; Leeb and Pötscher, 2005; Leeb et al., 2006; Benjamini et al., 2009; Pötscher et al., 2010; Berk et al., 2013; Lockhart et al., 2014; Taylor et al., 2014). A particularly notable approach is conditional SI introduced in the seminal paper by Lee et al. (2016). In their work, the authors showed that, for a set of features selected by Lasso, the selection event can be characterized as a polytope by conditioning on the selected set of features as well as additional events on their signs. Furthermore, Tibshirani et al. (2016) showed that polytope-based SI is also applicable to SFS by additionally conditioning on the history and signs of sequential feature selection. Conditional SI has been actively studied in the past few years and extended to various directions (Fithian et al., 2015; Choi et al., 2017; Tian et al., 2018; Chen and Bien, 2019; Hyun et al., 2018; Loftus and Taylor, 2014, 2015; Panigrahi et al., 2016; Tibshirani et al., 2016; Yang et al., 2016; Suzumura et al., 2017; Tanizaki et al., 2020; Duy et al., 2020a).

In conditional SIs, it is typically preferred to condition on as little as possible so that the inference can be more powerful (Fithian et al., 2014). Namely, the main limitations of current polytope-based SI methods are that excessive conditioning is required to represent the selection event with a single polytope. In the seminal paper by Lee et al. (2016), the authors already discussed the problem of over-conditioning, and explained how extra conditioning on signs can be omitted by an exhaustive enumeration of all possible signs and by taking the union over the resulting polyhedra. However, such an exhaustive enumeration of exponentially increasing number of sign combinations is feasible only when the number of selected features is fairly small. Several other approaches were proposed to circumvent the drawbacks and restrictions of polytope-based SI. Loftus and Taylor (2015) extended polytope-based SI such that selection events characterized by quadratic inequalities can be handled, but this inherits the same over-conditioning problem. To improve the power, Tian et al. (2018) proposed an approach to randomize the algorithm in order to condition on less. Terada and Shimodaira (2019) proposed to use bootstrap re-sampling to characterize the selection event more generally. The disadvantage of these approaches is that additional randomness is introduced into the algorithm and/or the inference.

This study is motivated by Liu et al. (2018) and Duy and Takeuchi (2020). The former studied Lasso SI for full-model parameters, whereas the latter extended the basic idea of the former so that it can be also applied to Lasso SI in more general settings. These two studies go beyond polytope-based SI for more powerful Lasso SI without conditioning on signs. Because Lasso is formulated as a convex optimization problem, the selection event for conditional SI can be characterized based on the optimality conditions of the convex optimization problem. These two methods rely on the optimality conditions to characterize the minimum conditions for the Lasso feature selection event.

In contrast, the SFS algorithm cannot be formulated as an optimization problem. Therefore, existing conditional SI methods for SFS are based on conditioning on the history rather than the output of the SFS algorithm. Namely, existing SI methods actually consider conditions not only on which features are selected but also in what order they are selected. Moreover, when the SFS algorithm is extended to more complex procedures such as FB-SFS, the over-conditioning problem has more pronounced negative effects. Because features can be added and removed in various orders in FB-SFS, if we condition on the history rather than the output of the algorithm, we will suffer from extensive over-conditioning on the entire history of feature addition and removal.

Our contributions

Our contributions are as follows. First, we propose a novel conditional SI method for the SFS algorithm using the homotopy continuation approach and show that the proposed method overcomes the over-conditioning problem in existing methods and enables us to conduct minimally conditioned more powerful conditional SI on the selected features by the SFS algorithm. Second, we develop a conditional SI method for the FB-SFS algorithm with AIC-based stopping criteria by using the homotopy continuation approach. Although the conventional polytope-based SI method is more adversely affected by the over-conditioning problem when SFS is extended to FB-SFS, we show that our homotopy-based method still enables us to conduct minimally-conditioned SI and retains sufficiently high power. We demonstrate the validity and power of the proposed homotopy-based SI methods through numerical experiments on synthetic and real benchmark datasets. The codes will be released when the submission of this draft is accepted.

Notations

For a natural number $n$ , we denote $[n]$ to be the set $\{1,\ldots,n\}$ . The $n\times n$ identity matrix is denoted by $I_{n}$ . For a matrix $X$ and a set of column indices ${\mathcal{M}}$ , $X_{\mathcal{M}}$ indicates the matrix with the set of columns corresponding to ${\mathcal{M}}$ .

2 Problem Statement

We consider the forward SFS for a regression problem. Let $n$ be the number of instances and $p$ be the number of original features. We denote the observed dataset as $(X,\bm{y})$ where $X\in\mathbb{R}^{n\times p}$ is the design matrix and $\bm{y}\in\mathbb{R}^{n}$ is the response vector. Following the problem setup from the existing literature on conditional SI such as Lee et al. (2016) and Tibshirani et al. (2016), we assume that the observed response $\bm{y}$ is a realization of the following random response vector

{\bm{Y}}=(Y_{1},...,Y_{n})^{\top}\sim\mathbb{N}({\bm{\mu}},\Sigma),

(1)

where ${\bm{\mu}}\in\mathbb{R}^{n}$ is the unknown mean vector and $\Sigma\in\mathbb{R}^{n\times n}$ is the covariance matrix, which is known or estimable from independent data. The design matrix $X$ is assumed to be non-random. For notational simplicity, we assume that each column vector of $X$ is normalized to have unit length.

Stepwise feature selection

We consider the standard forward SFS method as studied in Tibshirani et al. (2016) in §2 and §3. At each step of the SFS method, the feature that most improves the fit is newly added. When each feature has unit length, it is equivalent to the feature which is most correlated with the residual of the least-square regression model fitted with the previously selected features. For a response vector $\bm{y}\in\mathbb{R}^{n}$ and a set of features ${\mathcal{M}}\subseteq[p]$ , let $\bm{r}(\bm{y},X_{\mathcal{M}})$ be the residual vector obtained by regressing $\bm{y}$ onto $X_{\mathcal{M}}$ for a set of features ${\mathcal{M}}$ , i.e.,

\displaystyle\bm{r}(\bm{y},X_{\mathcal{M}})=P_{X_{\mathcal{M}}}^{\perp}\bm{y}=(I_{n}-P_{X_{{\mathcal{M}}}})\bm{y}

where $P_{X_{\mathcal{M}}}=X_{\mathcal{M}}(X_{\mathcal{M}}^{\top}X_{\mathcal{M}})^{-1}X_{\mathcal{M}}^{\top}$ . Let $K$ be the number of the selected features by the SFS method¹¹1 We discuss a situation where the number of features $K$ is selected by cross-validation in Appendix E. In other parts of the paper, we assume that $K$ is determined before looking at the data. . We denote the index of the feature selected at step $k$ as $j_{k}$ and the set of selected features up to step $k$ as ${\mathcal{M}}_{k}=\{j_{1},\ldots,j_{k}\}$ for $k\in[K]$ (Note, however, that, if there is no ambiguity, we simply denote the final set of the selected features as ${\mathcal{M}}~{}(={\mathcal{M}}_{K})$ ). Feature $j_{k}$ is selected as

	$\displaystyle j_{k}$	$\displaystyle=\operatorname*{arg\,min}_{j\in[p]\setminus{\mathcal{M}}_{k-1}}\left\\|\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}\cup\{j\}})\right\\|_{2}^{2}$
		$\displaystyle=\operatorname*{arg\,max}_{j\in[p]\setminus{\mathcal{M}}_{k-1}}\left\|\bm{x}_{j}^{\top}\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}})\right\|,$		(2)

where ${\mathcal{M}}_{0}(\bm{y}):=\emptyset$ and $\bm{r}(\bm{y},{\mathcal{M}}_{0}):=\bm{y}$ .

Statistical inference

In order to quantify the statistical significance of the relation between the selected features and the response, we consider a statistical test for each coefficient of the selected model parameters

\displaystyle\hat{\bm{\beta}}_{{\mathcal{M}}}=(X_{{\mathcal{M}}}^{\top}X_{{\mathcal{M}}})^{-1}X_{{\mathcal{M}}}^{\top}\bm{y}.

Note that the $j^{\rm th}$ coefficient is written as $\hat{\beta}_{{\mathcal{M}},j}=\bm{\eta}^{\top}\bm{y}$ by defining

\displaystyle\bm{\eta}=X_{{\mathcal{M}}}\left(X^{\top}_{{\mathcal{M}}}X_{{\mathcal{M}}}\right)^{-1}\bm{e}_{j},

(3)

where $\bm{e}_{j}\in\mathbb{R}^{|{\mathcal{M}}|}$ is a unit vector, $j^{\rm th}$ element of which is 1 and 0 otherwise. Note that $\bm{\eta}$ depends on ${\mathcal{M}}$ and $j$ , but we omit the dependence for notational simplicity. We consider the following statistical test for each coefficient $\beta_{{\mathcal{M}},j}=\bm{\eta}^{\top}\bm{\mu}$

\displaystyle{\rm H}_{0}:\beta_{{\mathcal{M}},j}=0~{}~{}~{}\text{vs.}~{}~{}~{}{\rm H}_{1}:\beta_{{\mathcal{M}},j}\neq 0,

(4)

where $\beta_{{\mathcal{M}},j}$ is the $j^{\rm th}$ element of the population least squares $\bm{\beta}_{{\mathcal{M}}}=P_{{\mathcal{M}}}\bm{\mu}$ , i.e., the projection of $\bm{\mu}$ onto the column space of $X_{{\mathcal{M}}}$ . Throughout the paper, we do not assume that the linear model is correctly specified, i.e., it is possible that $\bm{\mu}\neq X_{{\mathcal{M}}}\bm{\beta}$ for any $\bm{\beta}\in\mathbb{R}^{p}$ . Even when the linear model is not correctly specified, $\bm{\beta}_{{\mathcal{M}}}$ is still a well-defined best linear approximation.

Conditional Selective Inference

Because the target of the inference is selected by observing the data $\bm{y}$ , if we naively apply a traditional statistical test to the problem in (4) as if the inference target is pre-determined, the result will be invalid (type-I error cannot be controlled at the desired significance level) owing to selection bias. To address the selection bias problem, we consider conditional SI introduced in Lee et al. (2016) and Tibshirani et al. (2016). Let us write the SFS algorithm as the following function:

\displaystyle{\mathcal{A}}:\bm{y}\mapsto{\mathcal{M}},

which maps a response vector $\bm{y}\in\mathbb{R}^{n}$ to the set of the selected feature ${\mathcal{M}}$ by the SFS algorithm.

In conditional SI, the inference is conducted based on the following conditional sampling distribution of the test-statistic:

\bm{\eta}^{\top}\bm{Y}\mid\left\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}})\right\},

(5)

where

\displaystyle{\bm{q}}({\bm{Y}})=(I_{n}-{\bm{c}}{\bm{\eta}}^{\top}){\bm{Y}}\text{ with }\bm{c}=\Sigma{\bm{\eta}}({\bm{\eta}}^{\top}\Sigma{\bm{\eta}})^{-1}

is the nuisance parameter, which is independent of the test statistic. The first condition ${\mathcal{A}}(\bm{Y})={\mathcal{A}}({\bm{y}})$ in (5) indicates the event that the set of the selected features by the $K$ -step SFS method with a random response vector $\bm{Y}$ is ${\mathcal{M}}$ , i.e., the same as those selected with the observed response vector $\bm{y}$ . The second condition ${\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}})$ indicates that the nuisance parameter for a random response vector $\bm{Y}$ is the same as that for the observed vector $\bm{y}$ ²²2 The ${\bm{q}}(\bm{Y})$ corresponds to the component $\bm{z}$ in the seminal paper (see Lee et al. (2016), §5, Eq. 5.2, and Theorem 5.2). .

To conduct the conditional inference for (5), the main task is to identify the conditional data space

\displaystyle{\mathcal{Y}}=\{\bm{Y}\in\mathbb{R}^{n}\mid{\mathcal{A}}(\bm{Y})={\mathcal{A}}({\bm{y}}),{\bm{q}}({\bm{Y}})={\bm{q}}({\bm{y}})\}.

(6)

Once ${\mathcal{Y}}$ is identified, we can easily compute the pivotal quantity

F^{{\mathcal{Z}}}_{\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(\bm{\eta}^{\top}\bm{y})\mid\bm{y}\in{\mathcal{Y}},

(7)

where $F_{m,s^{2}}^{\mathcal{Z}}$ is the c.d.f. of the truncated Normal distribution with mean $m$ , variance $s^{2}$ , and truncation region ${\mathcal{Z}}$ . Later, we will explain how ${\mathcal{Z}}$ is defined in (7). The pivotal quantity is crucial for calculating $p$ -value or obtaining confidence interval. Based on the pivotal quantity, we can obtain selective $p$ -value (Fithian et al., 2014) in the form of

\displaystyle p^{\rm selective}_{j}=2\ \min\{\pi_{j},1-\pi_{j}\},

(8)

where $\pi_{j}=1-F^{{\mathcal{Z}}}_{0,{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(\bm{\eta}^{\top}{\bm{y}})$ . Furthermore, to obtain $1-\alpha$ confidence interval for any $\alpha\in[0,1]$ , by inverting the pivotal quantity in (7), we can find the smallest and largest values of $\bm{\eta}^{\top}\bm{\mu}$ such that the value of the pivotal quantity remains in the interval $\left[\frac{\alpha}{2},1-\frac{\alpha}{2}\right]$ (Lee et al., 2016).

Characterization of the conditional data space ${\mathcal{Y}}$

Using the second condition in (6), the data in ${\mathcal{Y}}$ are restricted to a line (see Liu et al. (2018) and Fithian et al. (2014)). Therefore, the set ${\mathcal{Y}}$ can be re-written, using a scalar parameter $z\in\mathbb{R}$ , as

{\mathcal{Y}}=\{\bm{y}(z)\in\mathbb{R}^{n}\mid\bm{y}(z)=\bm{a}+\bm{b}z,z\in{\mathcal{Z}}\}

(9)

where ${\bm{a}}={\bm{q}}(\bm{y})$ , ${\bm{b}}=\Sigma{\bm{\eta}}({\bm{\eta}}^{\top}\Sigma{\bm{\eta}})^{-1}$ , and

{\mathcal{Z}}=\left\{Z\in\mathbb{R}\mid{\mathcal{A}}({\bm{y}}(Z))={\mathcal{A}}({\bm{y}})\right\}.

(10)

Here, $\bm{a}$ , $\bm{b}$ and ${\mathcal{Z}}$ depend on ${\mathcal{M}}$ and $j$ , but we omit the subscripts for notational simplicity. Now, let us consider a random variable $Z\in\mathbb{R}$ and its observation $z\in\mathbb{R}$ such that they respectively satisfy ${\bm{Y}}={\bm{a}}+{\bm{b}}Z$ and ${\bm{y}}={\bm{a}}+{\bm{b}}z$ . The conditional inference in (5) is re-written as the problem of characterizing the sampling distribution of

Z\mid Z\in{\mathcal{Z}}.

(11)

Because $Z\sim\mathbb{N}(\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}})$ , $Z\mid Z\in{\mathcal{Z}}$ follows a truncated Normal distribution. Once the truncation region ${\mathcal{Z}}$ is identified, the pivotal quantity in (7) is obtained as $F^{{\mathcal{Z}}}_{\bm{\eta}^{\top}\bm{\mu},{\bm{\eta}}^{\top}\Sigma{\bm{\eta}}}(z)$ . Thus, the remaining task is reduced to the characterization of ${\mathcal{Z}}$ .

Over-conditioning in existing conditional SI methods

Unfortunately, full identification of the truncation region ${\mathcal{Z}}$ in conditional SI for the SFS algorithm is considered computationally infeasible. Therefore, in existing conditional SI studies such as Tibshirani et al. (2016), the authors circumvent the computational difficulty by over conditioning. Note that over-conditioning is not harmful for selective type-I error control, but it leads to the loss of power (Fithian et al., 2014). In fact, the decrease in the power due to over-conditioning is not unique problem for SFS in Tibshirani et al. (2016) but is a common major problem in many existing conditional SIs (Lee et al., 2016). In the next section, we propose a method to overcome this difficulty.

3 Proposed Homotopy-based SI for SFS

As we discussed in §2, to conduct conditional SI, the truncation region ${\mathcal{Z}}\subseteq\mathbb{R}$ in (10) should be identified. To construct ${\mathcal{Z}}$ , our idea is 1) computing ${\mathcal{A}}({\bm{y}}(z))$ for all $z\in\mathbb{R}$ and 2) identifying the set of intervals of $z\in\mathbb{R}$ on which ${\mathcal{A}}({\bm{y}}(z))={\mathcal{A}}(\bm{y})$ . However, it seems intractable to obtain ${\mathcal{A}}({\bm{y}}(z))$ for infinitely many values of $z\in\mathbb{R}$ . To overcome the difficulty, we combine two approaches called extra-conditioning and homotopy continuation. Our idea is motivated by the regularization paths of Lasso (Osborne et al., 2000; Efron and Tibshirani, 2004), SVM (Hastie et al., 2004) and other similar methods (Rosset and Zhu, 2007; Bach et al., 2006; Rosset and Zhu, 2007; Tsuda, 2007; Lee and Scott, 2007; Takeuchi et al., 2009; Takeuchi and Sugiyama, 2011; Karasuyama and Takeuchi, 2010; Hocking et al., 2011; Karasuyama et al., 2012; Ogawa et al., 2013; Takeuchi et al., 2013), in which the solution path along the regularization parameter can be computed by analyzing the KKT optimality conditions of parametrized convex optimization problems. Although SFS cannot be formulated as a convex optimization problem, by introducing the notion of extra-conditioning, we note that a conceptually similar approach as homotopy continuation can be used to keep track all possible changes in the selected features when the response vector $\bm{y}$ changes along the direction of the test-statistic. A conceptually similar idea has recently been used for evaluating the reliability of deep learning representations (Duy et al., 2020b).

3.1 Extra-Conditioning

First, let us consider the history ${\mathcal{H}}$ and signs ${\mathcal{S}}$ of the SFS algorithm. The history of the SFS algorithm is defined as

\displaystyle{\mathcal{H}}:=({\mathcal{M}}_{1},{\mathcal{M}}_{2},\ldots,{\mathcal{M}}_{K}),

i.e., the sequence of the sets of the selected features in $K$ steps. The signs of the SFS algorithm are defined as a vector ${\mathcal{S}}:=({\mathcal{S}}_{1},{\mathcal{S}}_{2},\ldots,{\mathcal{S}}_{K})$ , $k^{\rm th}$ element of which is defined as

\displaystyle{\mathcal{S}}_{k}={\rm sgn}\left(\bm{x}_{j_{k}}^{\top}\bm{r}(\bm{y},X_{{\mathcal{M}}_{k-1}})\right)

for $k\in[K]$ , which indicates the sign of the $j_{k}^{\rm th}$ feature when it is first entered to the model at step $k$ .

We interpret the function ${\mathcal{A}}:\bm{y}\mapsto M$ as a composite function ${\mathcal{A}}={\mathcal{A}}_{1}\circ{\mathcal{A}}_{2}$ where

\displaystyle{\mathcal{A}}_{1}:({\mathcal{H}},{\mathcal{S}})\mapsto{\mathcal{M}},\text{ and }{\mathcal{A}}_{2}:\bm{y}\mapsto({\mathcal{H}},{\mathcal{S}}),

i.e., the following relationships hold: ${\mathcal{M}}={\mathcal{A}}(\bm{y})={\mathcal{A}}_{1}({\mathcal{A}}_{2}(\bm{y})),{\mathcal{M}}={\mathcal{A}}_{1}(({\mathcal{H}},{\mathcal{S}})),({\mathcal{H}},{\mathcal{S}})={\mathcal{A}}_{2}(\bm{y})$ .

Let us now consider conditional SI not on ${\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y})$ but on ${\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y})$ . The next lemma indicates that, by conditioning on the history and the signs $({\mathcal{H}},{\mathcal{S}})$ rather than the set of the selected features ${\mathcal{M}}$ , the truncation region can be simply represented as an interval in the line $\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R}$ .

Lemma 1.

Consider a response vector $\bm{y}^{\prime}\in{\mathcal{Y}}$ . Let $({\mathcal{H}}^{\prime},{\mathcal{S}}^{\prime})={\mathcal{A}}_{2}(\bm{y}^{\prime})$ be the history and the signs obtained by applying the $K$ -step SFS algorithm to the response vector $\bm{y}^{\prime}$ , and their elements are written as

\displaystyle{\mathcal{H}}^{\prime}=({\mathcal{M}}_{1}^{\prime},\ldots,{\mathcal{M}}_{K}^{\prime})\text{ and }{\mathcal{S}}^{\prime}=({\mathcal{S}}_{1}^{\prime},\ldots,{\mathcal{S}}_{K}^{\prime}).

Then, the over-conditioned truncation region defined as

\displaystyle{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})=\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{A}}_{2}(\bm{y}^{\prime})\right\}

(12)

is an interval

\displaystyle z\in\left[\max_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}>0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}},\min_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}<0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}\right],

(13)

where

	$\displaystyle e_{(k,j)}$	$\displaystyle=(\bm{x}_{j}-\bm{x}_{j_{k}}{\mathcal{S}}^{\prime}_{k})^{\top}P_{X_{{\mathcal{M}}_{k-1}}}^{\perp}\bm{a},$
	$\displaystyle d_{(k,j)}$	$\displaystyle=(\bm{x}_{j_{k}}{\mathcal{S}}^{\prime}_{k}-\bm{x}_{j})^{\top}P_{X_{{\mathcal{M}}^{\prime}_{k-1}}}^{\perp}\bm{b},$

for $k\in[K],j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1}$ .

The proof is presented in Appendix A.

Note that the condition ${\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y})$ is redundant for our goal of making the inference conditional on ${\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y})$ . This over-conditioning is undesirable because it decreases the power of conditional inference (Fithian et al., 2014), but the original conditional SI method for SFS in Tibshirani et al. (2016) performs conditional inference under exactly this over-conditioning case because otherwise the selection event cannot be formulated as a polytope. In the following subsection, we overcome this computational difficulty by utilizing the homotopy continuation approach.

3.2 Homotopy Continuation

In order to obtain the optimal truncation region ${\mathcal{Z}}$ in (10), our idea is to enumerate all the over-conditioned regions in (12) and to consider their union

\displaystyle{\mathcal{Z}}=\bigcup_{\bm{y}^{\prime}\in{\mathcal{Y}}\mid{\mathcal{A}}(\bm{y}^{\prime})={\mathcal{A}}(\bm{y})}{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})

by homotopy continuation. To this end, let us consider all possible pairs of history and signs $({\mathcal{H}}(z),{\mathcal{S}}(z))={\mathcal{A}}_{2}(\bm{y}(z))$ for $z\in\mathbb{R}$ . Clearly, the number of such pairs is finite. With a slight abuse of notation, we index each pair by $t=1,2,\ldots,T$ and denote it such as $({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)})$ , where $T$ is the number of the pairs. Lemma 1 indicates that each pair corresponds to an interval in the line. Without loss of generality, we assume that the left-most interval corresponds to $t=1$ , the second left-most interval corresponds to $t=2$ , and so on. Then, using an increasing sequence of real numbers $z_{1}<z_{2}<\cdots<z_{T}<z_{T+1}$ , we can write these intervals as $[z_{1},z_{2}],[z_{2},z_{3}],\ldots,[z_{T},z_{T+1}]$ . In practice, we do not have to consider the entire line $z\in(-\infty,\infty)$ , but it suffices to consider a sufficiently wide range. Specifically, in the experiments in §5, we set $z_{\rm min}=z_{1}=-(|z|+10)\sigma$ and $z_{\rm max}=z_{T+1}=(|z|+10)\sigma$ where $\sigma$ is the standard error of the test-statistic. With this choice of the range, the probability mass outside the range is negligibly small.

Our simple idea is to compute all possible pairs $\{({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)})\}_{t=1}^{T}$ by keeping track of the intervals $[z_{1},z_{2}],[z_{2},z_{3}],\cdots,[z_{T},z_{T+1}]$ one by one, and then to compute the truncation region ${\mathcal{Z}}$ by collecting the intervals in which the set of the selected features ${\mathcal{M}}^{(t)}={\mathcal{A}}_{1}(({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)}))$ is the same as the set of the actually selected features from the observed data ${\mathcal{M}}={\mathcal{A}}(\bm{y})$ , i.e.,

\displaystyle{\mathcal{Z}}=\bigcup_{t\in[T]\mid{\mathcal{A}}_{1}(({\mathcal{H}}^{(t)},{\mathcal{S}}^{(t)}))={\mathcal{A}}(\bm{y})}[z_{t},z_{t+1}].

We call $z_{1},z_{2},\ldots,z_{T+1}$ breakpoints. We start from applying the SFS algorithm to the response vector $\bm{y}(z_{1})$ and obtain the first pair $({\mathcal{H}}^{(1)},{\mathcal{S}}^{(1)})$ . Next, using Lemma 1 with $\bm{y}^{\prime}=\bm{y}(z_{1})$ , the next breakpoint $z_{2}$ is obtained as the right end of the interval in (13). Next, the second pair $({\mathcal{H}}^{(2)},{\mathcal{S}}^{(2)})$ is obtained by applying the SFS method to the response vector $\bm{y}(z_{2}+\Delta z)$ , where $\Delta z$ is a small value such that $z_{t}+\Delta z<z_{t+1}$ for all $t\in[T]$ . This process is repeated until the next breakpoint becomes greater than $z_{\rm max}$ . The pseudo-code of the entire method is presented in Algorithm 1 and the computation of the truncation region ${\mathcal{Z}}$ is described in Algorithm 2.

X,{\bm{y}},K,[z_{\rm min},z_{\rm max}]

{\mathcal{M}}\leftarrow

Applying the

K

-step SFS algorithm to

(X,{\bm{y}}^{\rm obs})

2: for each selected feature

j\in{\mathcal{M}}

3: Compute

\bm{\eta}

\leftarrow

Equation (3)

4: Compute

\bm{a}

and

\bm{b}

\leftarrow

Equation (9)

{\mathcal{Z}}\leftarrow{\tt compute\_truncation\_region}

(

X

K

\bm{a}

\bm{b}

[z_{\rm min},z_{\rm max}],{\mathcal{M}}

)

p^{\rm selective}_{j}\leftarrow

Equation (8) (and

/

or selective confidence interval)

7: end for

\{p^{\rm selective}_{j}\}_{j\in{\mathcal{M}}}

(and

/

or selective confidence intervals)

Algorithm 1 SFS_conditional_SI

X,K,\bm{a},\bm{b},[z_{\rm min},z_{\rm max}],{\mathcal{M}}

1: Initialization:

t=1

z_{t}=z_{\rm min}

{\mathcal{Z}}=\emptyset

2: while

z_{t}<z_{\rm max}

\bm{y}(z_{t}+\Delta z)=\bm{a}+\bm{b}(z_{t}+\Delta z)

({\mathcal{H}}^{(t)},{\mathcal{S}}^{t})\leftarrow

Applying

K

-step SFS to

(X,\bm{y}(z_{t}+\Delta z))

z_{t+1}\leftarrow

Equation (13)

6: if

{\mathcal{M}}^{(t)}={\mathcal{M}}

then

{\mathcal{Z}}\leftarrow{\mathcal{Z}}\cup[z_{t},z_{t+1}]

8: end if

t\leftarrow t+1

10: end while

10:

{\mathcal{Z}}=\{z\in[z_{\rm min},z_{\rm max}]\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\}

Algorithm 2 compute_truncation_region

4 Forward Backward Stepwise Feature Selection

In this section, we present a conditional SI method for the FB-SFS algorithm using the homotopy method. At each step of the FB-SFS algorithm, a feature is considered for addition to or subtraction from the current set of the selected features based on some pre-specified criterion. As an example of commonly used criterion, we study the AIC-based criterion, which is used as the default option in well-known stepAIC package in R.

Under the assumption of Normal linear model $\bm{Y}\sim{\mathcal{N}}(X\bm{\beta},\Sigma)$ , for a set of the selected features ${\mathcal{M}}$ , the AIC is written as

	$\displaystyle{\rm AIC}:=$	$\displaystyle\max_{\bm{\beta}}(\bm{y}-X_{\mathcal{M}}\bm{\beta})^{\top}\Sigma^{-1}(\bm{y}-X_{\mathcal{M}}\bm{\beta})+2\|{\mathcal{M}}\|$
	$\displaystyle=$	$\displaystyle\bm{y}^{\top}A_{\mathcal{M}}\bm{y}+2\|{\mathcal{M}}\|,$		(14)

where $A_{\mathcal{M}}=\Sigma^{-1}-\Sigma^{-1}X_{\mathcal{M}}(X_{\mathcal{M}}^{\top}\Sigma^{-1}X_{\mathcal{M}})^{-1}X_{\mathcal{M}}^{\top}\Sigma^{-1}$ and $|{\mathcal{M}}|$ is the number of the selected features (the irrelevant constant term is omitted here). The goal of AIC-based FB-SFS is to find a model with the smallest AIC while adding and removing features step by step. The algorithm starts either from the null model (the model with only a constant term) or the full model (the model with all the $p$ features). At each step, among all possible choices of adding one feature to or deleting one feature from the current model, the one with the smallest AIC is selected. The algorithm terminates when the AIC is no longer reduced.

With a slight abuse of notations, let us use the following notations, the same as in §3.

\displaystyle{\mathcal{A}}:\bm{y}\mapsto{\mathcal{M}},{\mathcal{A}}_{1}:{\mathcal{H}}\mapsto{\mathcal{M}},{\mathcal{A}}_{2}:\bm{y}\mapsto{\mathcal{H}},

where ${\mathcal{M}}$ is the set of the selected features and ${\mathcal{H}}$ is the history of the FB-SFS algorithm written as

\displaystyle{\mathcal{H}}=({\mathcal{M}}_{1},\ldots,{\mathcal{M}}_{K}),

where ${\mathcal{M}}_{k}$ is the set of the selected features at step $k$ for $k\in[K]$ . Here, the history ${\mathcal{H}}$ contains the information on feature addition and removal in the FB-SFS algorithm. Therefore, unlike the forward SFS algorithm in §3, the size of ${\mathcal{M}}_{k}$ can be increased or decreased depending on whether a feature is added or removed.

Using these notations, the truncation region ${\mathcal{Z}}$ is similarly obtained as

	$\displaystyle{\mathcal{Z}}$	$\displaystyle=\left\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\right\}$
		$\displaystyle=\bigcup_{{\mathcal{H}}\mid{\mathcal{A}}_{1}({\mathcal{H}})={\mathcal{A}}(\bm{y})}\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{H}}\right\}$		(15)

The basic idea of the homotopy method is the same as before; we find a range of $z\in\mathbb{R}$ such that the algorithm has a certain history ${\mathcal{H}}$ and take the union of those for which the output ${\mathcal{M}}$ obtained from the history ${\mathcal{H}}$ is equal to the actual selected set of features ${\mathcal{M}}={\mathcal{A}}(\bm{y})$ . The following lemma indicates that a range of $z\in\mathbb{R}$ in which the algorithm has a certain history ${\mathcal{H}}$ can be analytically obtained.

Lemma 2.

Consider a response vector $\bm{y}^{\prime}\in{\mathcal{Y}}$ . Then, the over-conditioned truncation region defined as

\displaystyle{\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})=\left\{z\in\mathbb{R}\mid{\mathcal{A}}_{2}(\bm{y}(z))={\mathcal{A}}_{2}(\bm{y}^{\prime})\right\}

(16)

is characterized by a finite number of quadratic inequalities of $z\in\mathbb{R}$ .

The proof is presented in Appendix A.

Unlike the case of polytope-based SI for forward SFS in the previous section, the over-conditioned region ${\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})$ in Lemma 2 possibly consists of multiple intervals. The homotopy continuation approach can be similarly used for computing the union of intervals in (15). Starting from $z_{1}=z_{\rm min}$ , we first compute ${\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})$ with $\bm{y}^{\prime}=\bm{y}(z_{1})$ using Lemma 2. If ${\mathcal{Z}}^{\rm oc}(\bm{y}^{\prime})$ consists of multiple intervals, we can consider only the interval containing $\bm{y}^{\prime}$ and set the right end of that interval as the next breakpoint. The computation of the next breakpoint $z_{t}$ can be written as

\displaystyle z_{t}=\max\{z\in{\mathcal{Z}}^{\rm oc}(\bm{y}(z_{t-1}+\Delta z))\}+\Delta z,

(17)

for $t=1,2,\ldots,T$ , where, if ${\mathcal{Z}}^{\rm oc}(\bm{y}(z_{t-1}+\Delta z))$ consists of two separated intervals, then the maximum operator in (17) shall take the maximum in the interval containing $z_{t-1}$ .

5 Experiment

We present the experimental results of forward SFS in §3 and §4 in 5.1 and 5.2, respectively. The details of the experimental setups are described in Appendix B. More experimental results on the computational and robustness aspects of the proposed method are provided in Appendices C and D, respectively. In all the experiments, we set the significance level $\alpha=0.05$ .

5.1 Forward SFS

We compared the following five methods:

•

Homotopy: conditioning on the selected features ${\mathcal{M}}$ (minimal conditioning);
•

Homotopy-H: additionally conditioning on the history ${\mathcal{H}}$ ;
•

Homotopy-S: additionally conditioning on the signs ${\mathcal{S}}$ ;
•

Polytope (Tibshirani et al., 2016): additionally conditioning on the history ${\mathcal{H}}$ and signs ${\mathcal{S}}$ ;
•

DS: split the data and use one for feature selection and the other for inference.

Synthetic data experiments

We examined the false positive rates (FPRs), true positive rates (TPRs) and confidence interval (CI) lengths. We generated the dataset $\{(\bm{x}_{i},y_{i})\}_{i\in[n]}$ by $\bm{x}_{i}\sim\mathbb{N}(0,I_{p})$ and $y_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i}$ with $\varepsilon_{i}\sim\mathbb{N}(0,1)$ . We set $n\in\{50,100,150\}$ , $p=5$ , $K=3$ for experiments on FPRs and TPRs. The coefficients $\bm{\beta}$ was respectively set as $[0,0,0,0,0]^{\top}$ and $[0.25,0.25,0,0,0]^{\top}$ for FPR and TPR experiments. We set $n=100$ , $p=10$ , $K=9$ and $\bm{\beta}=[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0]^{\top}$ for experiments on CIs.

The results of FPRs, TPRs and CIs are shown in Fig. 1(a), (b) and (c), respectively. The FPRs and TPRs were estimated by 100 trials, and the plots in Fig. 1(a) and (b) are the averages when the 100 trials were repeated 20 times, while the plots in Fig. 1(c) are the results of 100 trials. In all five methods, the FPRs are properly controlled under the significance level $\alpha$ . Regarding the TPR comparison, it is obvious that Homotopy method has the highest power because it is minimally conditioned. Note that the additional conditioning on the history ${\mathcal{H}}$ and/or the signs ${\mathcal{S}}$ significantly decreases the power. The results of the CIs are consistent with the results of TPRs.

Refer to caption — Figure 1: Results of forward SFS on synthetic data.

Real data experiments

We compared the proposed method (Homotopy) and conventional method (Polytope) on three real datasets³³3We used Housing (Dataset 1), Abalone (Dataset 2), and Concrete Compressive Strength (Dataset 3) datasets in UCI machine learning repository.. From each of the original dataset, we randomly generated sub-sampled datasets with sizes 25, 50 and 100. We applied the proposed and the conventional methods to each sub-sampled dataset, examined the cases where the selective $p$ -values of a selected feature was different between the two methods, and computed the percentage of times when the proposed method provided smaller selective $p$ -values than the conventional one. Table 1 shows the results from 1000 trials. The proposed method provides smaller selective $p$ -values than the conventional method significantly more frequently especially when $n$ is large.

Table 1: Results on three real data experiments for forward SFS. Percentage of cases where selective

p

-values of the proposed method (Homotopy) was smaller than that of the conventional method (Polytope) in random sub-sampling experiments were shown.

	Dataset 1	Dataset 2	Dataset 3
$n=25$	56.40%	57.69%	66.79%
$n=50$	62.85%	65.1%	70.09%
$n=100$	71.3%	71.88%	76.58%

Table 2: Results on three real data experiments for FB-SFS. Percentage of cases where selective

p

-values of the proposed method (Homotopy) was smaller than that of the conventional method (Polytope) in random sub-sampling experiments were shown.

	Dataset 1	Dataset 2	Dataset 3
$n=25$	69.35%	75.43%	57.13%
$n=50$	72.72%	78.42%	60.91%
$n=100$	78.60%	81.38%	76.19%

5.2 Forward-Backward (FB)-SFS

We compared the following two methods:

•

Homotopy: conditioning on the selected features ${\mathcal{M}}$ (minimal conditioning);
•

Quadratic: additionally conditioning on the history ${\mathcal{H}}$ (implemented by using quadratic inequality-based conditional SI in (Loftus and Taylor, 2015)).

Synthetic data experiments

We examined the FPRs, TPRs and CI lengths by generating synthetic datasets in the same way as above. We set $n\in\{50,100,150\}$ $p\in\{10,20,50\}$ for experiments on FPRs and TPRs. The coefficients $\bm{\beta}$ were zero for the case of FPR, whereas the first half of the coefficients were either $0.01,0.25,0.5,1$ and the second half were zero for the case of TPR. We set $n=100$ , $p=10$ , and each element of $\bm{\beta}$ was randomly set from $\mathbb{N}(0,1)$ for experiments on CIs.

The results of FPRs, TPRs and CIs are shown in Fig. 2(a), (b) and (c), respectively. The FPRs and TPRs were estimated by 100 trials, and the plots in Fig. 2 (a) and (b) are the averages when the 100 trials were repeated 20 times for FPR and 10 times for TPR, whereas the plots in Fig. 2(c) are the results of 100 trials. In both methods, the FPRs are properly controlled under the significance level $\alpha$ . Regarding the TPR comparison, it is obvious that Homotopy method has the higher power than Quadratic because it is minimally conditioned. The results of the CIs are consistent with the results of TPRs.

Real data experiments

We compared the proposed method (Homotopy) and conventional method (Quadratic) on the same three real datasets.From each of the original dataset, we randomly generated sub-sampled datasets with sizes 25, 50 and 100. We applied the proposed and the conventional methods to each sub-sampled dataset, examined the cases where the selective $p$ -values of a selected feature was different between the two methods, and computed the percentage of times when the proposed method provided smaller selective $p$ -values than the conventional one. Table 2 shows the results from 1000 trials. The proposed method provides smaller selective $p$ -values than the conventional method significantly more frequently especially when $n$ is large.

6 Conclusion

In this paper, we proposed a more powerful and general conditional SI method for SFS algorithm. We resolved the over-conditioning problem in existing approaches by introducing the homotopy continuation approach. The experimental results indicated that the proposed homotopy-based approach is more powerful and general.

Acknowledgement

This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and RIKEN Junior Research Associate Program.

Appendix A Proofs

A.1 Proof of Lemma 1

Proof.

Let $({\mathcal{H}}(z),{\mathcal{S}}(z))={\mathcal{A}}_{2}(\bm{y}(z))$ for all $z\in\mathbb{R}$ and write ${\mathcal{H}}(z)=({\mathcal{M}}_{1}(z),\ldots,{\mathcal{M}}_{K}(z))$ and ${\mathcal{S}}(z)=({\mathcal{S}}_{1}(z),\ldots,{\mathcal{S}}_{K}(z))$ . By conditioning on the histories ${\mathcal{H}}(z)={\mathcal{H}}^{\prime}$ ,

\displaystyle{\mathcal{M}}_{1}(z)={\mathcal{M}}_{1}^{\prime},\ldots,{\mathcal{M}}_{K}(z)={\mathcal{M}}_{K}^{\prime}

for all $z\in\mathbb{R}$ such that ${\mathcal{H}}(z)={\mathcal{H}}^{\prime}$ . This indicates that

\displaystyle\left|\bm{x}_{j^{\prime}_{k}}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})\right|\geq\pm\bm{x}_{j}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})

(18)

for all $(k,j)\in[K]\times([p]\setminus{\mathcal{M}}^{\prime}_{k-1})$ and $z\in\mathbb{R}$ such that ${\mathcal{H}}(z)={\mathcal{H}}^{\prime}$ , where $(j^{\prime}_{1},\ldots,j^{\prime}_{K})$ is the sequence of the selected features when the $K$ -step SFS algorithm is applied to the response vector $\bm{y}^{\prime}$ . By further conditioning on the signs ${\mathcal{S}}(z)={\mathcal{S}}^{\prime}$ , (18) is written as

\displaystyle{\mathcal{S}}^{\prime}_{k}\bm{x}_{j^{\prime}_{k}}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})\geq\pm\bm{x}_{j}^{\top}\bm{r}(\bm{y}(z),X_{{\mathcal{M}}^{\prime}_{k-1}})

for all $(k,j)\in[K]\times([p]\setminus{\mathcal{M}}^{\prime}_{k-1})$ and $z\in\mathbb{R}$ such that ${\mathcal{H}}(z)={\mathcal{H}}^{\prime}$ and ${\mathcal{S}}(z)={\mathcal{S}}^{\prime}$ . By restricting on a line $\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R}$ , the range of $z$ is written as

\displaystyle\max_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}>0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}\leq z\leq\min_{\begin{subarray}{c}k\in[K],\\ j\in[p]\setminus{\mathcal{M}}^{\prime}_{k-1},\\ d_{(k,j)}<0\end{subarray}}\frac{e_{(k,j)}}{d_{(k,j)}}.

(19)

∎

A.2 The proof of Lemma 2

Proof.

For a set of features ${\mathcal{M}}\subseteq[p]$ and a response vector $\bm{y}(z)=\bm{a}+\bm{b}z,z\in\mathbb{R}$ in a line, let us denote the AIC as a function of ${\mathcal{M}}$ and $z$ as ${\rm AIC}({\mathcal{M}},z)$ . Subsequently, by substituting $\bm{y}=\bm{a}+\bm{b}z$ into (14), it is written as a quadratic function of $z$ as

	$\displaystyle{\rm AIC}({\mathcal{M}},z)$	$\displaystyle=(\bm{b}^{\top}A_{{\mathcal{M}}}\bm{b})z^{2}+2(\bm{a}^{\top}A_{{\mathcal{M}}}\bm{b})z+(\bm{a}^{\top}A_{{\mathcal{M}}}\bm{a})$
		$\displaystyle+2\|{\mathcal{M}}\|.$		(20)

Equation (20) represents the range of $z\in\mathbb{R}$ such that when $\bm{y}(z)$ is fed into the algorithm, the same history ${\mathcal{H}}={\mathcal{A}}_{2}(\bm{y}^{\prime})$ is obtained. Let the sequence of the selected models corresponding to the history ${\mathcal{H}}^{\prime}=\psi(\bm{y}^{\prime})$ be

\displaystyle{\mathcal{H}}^{\prime}=({\mathcal{M}}^{\prime}_{1},\ldots,{\mathcal{M}}^{\prime}_{K}),

where $K$ is the number of steps in the history ${\mathcal{H}}$ . Next, the event of the history can be fully characterized by comparing AICs as follows:

		$\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1}\cup\{j\},z)~{}\forall j\in[p]\setminus{\mathcal{M}}_{k-1},$
		$\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1}\setminus\{j\},z)~{}\forall j\in{\mathcal{M}}_{k-1},$
		$\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{k},z)\leq{\rm AIC}({\mathcal{M}}^{\prime}_{k-1},z),$		(21)

for $k=1,2,\ldots,K$ and

		$\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{K},z)<{\rm AIC}({\mathcal{M}}^{\prime}_{K}\cup\{j\},z)~{}\forall j\in[p]\setminus{\mathcal{M}}_{K},$
		$\displaystyle{\rm AIC}({\mathcal{M}}^{\prime}_{K},z)<{\rm AIC}({\mathcal{M}}^{\prime}_{K}\setminus\{j\},z)~{}\forall j\in{\mathcal{M}}_{K}.$		(22)

Here, the first and second inequalities in (21) indicate that the selected model at step $k$ has the smallest AIC among all possible choices, the third inequality in (21) indicates that the AIC of the selected model at step $k$ is smaller than that at the previous step, and two inequalities (22) indicate that the AIC of the selected model at the final step $K$ cannot be decreased anymore. Because the AIC is written as a quadratic function of $z$ as in (20) under the fixed history ${\mathcal{H}}^{\prime}$ , all these conditions are written as quadratic inequalities of $z$ . This means that the range of $z\in\mathbb{R}$ that satisfies these conditions is represented by a finite set of quadratic inequalities of $z\in\mathbb{R}$ . ∎

Appendix B Details of Experiments

We executed the experiments on Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz.

Comparison methods in the case of Forward SFS.

In the case of forward SFS, we remind that ${\mathcal{A}}(\bm{y})$ results a set of selected featured ${\mathcal{M}}$ when applying forward SFS algorithm ${\mathcal{A}}$ to $\bm{y}$ . With a slight abuse of notations, let ${\mathcal{H}}(\bm{y})$ and ${\mathcal{S}}(\bm{y})$ respectively denote the history and signs obtained when applying algorithm ${\mathcal{A}}$ to $\bm{y}$ . We compared the following five methods:

•

Homotopy: conditioning on the selected features ${\mathcal{M}}$ (minimal conditioning), i.e.,

$\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.$
•

Homotopy-H: additionally conditioning on the history ${\mathcal{H}}$ , i.e.,

$\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.$

Here, we note that ${\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y})$ already includes the event ${\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y})$ .

•

Homotopy-S: additionally conditioning on the signs, i.e.,

\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\mathcal{S}}(\bm{Y})={\mathcal{S}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

•

Polytope (Tibshirani et al., 2016): additionally conditioning on both history ${\mathcal{H}}$ and signs ${\mathcal{S}}$ , i.e.,

\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{H}}(\bm{Y})={\mathcal{H}}(\bm{y}),{\mathcal{S}}(\bm{Y})={\mathcal{S}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

This definition is equivalent to

\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\},

where ${\mathcal{A}}_{2}(\cdot)$ is defined in §3.

•

DS: data splitting is the commonly used procedure for the purpose of selection bias correction. In this approach, the data is randomly divided in two halves — first half used for model selection and the other for inference.

Comparison methods in the case of Forward-Backward SFS.

In the case of forward-backward SFS, ${\mathcal{A}}(\bm{y})$ results a set of selected featured ${\mathcal{M}}$ when applying forward-backward SFS algorithm ${\mathcal{A}}$ to $\bm{y}$ , and ${\mathcal{A}}_{2}(\bm{y})$ results the history. We compared the following two methods:

•

Homotopy: conditioning on the selected features ${\mathcal{M}}$ (minimal conditioning), i.e.,

$\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.$

•

Quadratic: additionally conditioning on the history ${\mathcal{H}}$ (implemented by using quadratic inequality-based conditional SI in (Loftus and Taylor, 2015)), i.e.,

\displaystyle\bm{\eta}^{T}\bm{Y}\mid\{{\mathcal{A}}_{2}(\bm{Y})={\mathcal{A}}_{2}(\bm{y}),\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

Definition of TPR.

In SI, we only conduct statistical testing when there is at least one hypothesis discovered by the algorithm. Therefore, the definition of TPR, which can be also called conditional power, is as follows:

{\rm TPR}=\frac{{\rm\#\ detected\ \&\ rejected}}{{\rm\#\ detected}},

where ${\rm\#\ detected}$ is the number of truly positive features selected by the algorithm (e.g., SFS) and ${\rm\#\ rejected}$ is the number of truly positive features whose null hypothesis is rejected by SI.

Demonstration of confidence interval (forward SFS).

We generated $n=100$ outcomes as $y_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i},~{}i=1,...,n$ , where $\bm{x}_{i}\sim\mathcal{N}(0,I_{p})$ and $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . We set $p=10$ , $K=9$ and $\bm{\beta}=\left[0.25,0.25,0.25,0.25,0.25,0,0,0,0,0\right]^{\top}$ . We note that the number of selected features between the four options of conditional SI methods (Homotopy, Homotopy-H, Homotopy-S, Polytope) and DS can be different. Therefore, for a fair comparison, we only consider the features that are selected in both cases. Figure 3 shows the demonstration of CIs for each selected feature. The results are consistent with Fig. 1 (b).

Appendix C Experiments on Computational Aspects (Forward SFS)

We demonstrate the computational efficiency of the proposed Homotopy method. We generated $n$ outcomes as $y_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i},~{}i=1,...,n$ , where $\bm{x}_{i}\sim\mathcal{N}(0,I_{p})$ and $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . We set $n=50$ and $p=10$ . In Figure 4, we show the results of comparing the computational time between the proposed Homotopy method and the existing method. For the existing study, if we want to keep high statistical power, we have to enumerate a huge number of all the combinations of histories and signs $2^{K}\times K!,$ which is only feasible when the number of selected features is fairly small. We observe in blue plots in Fig. 4 that the computational cost of existing method is exponentially increasing with the number of selected features. With the proposed method, we are able to significantly reduce the computational cost while keeping high power.

One might wonder how we can circumvent the computational bottleneck of exponentially increasing number of polytopes. Our experience suggests that, by focusing on the line along the test-statistic in data space, we can skip majority of the polytopes that do not affect the truncated Normal sampling distribution because they do not intersect with this line. In other words, we can skip majority of combinations of histories and signs that never appear. We show the violin plot of the actual numbers of intervals of the test statistic $z$ that involves in the construction of truncated sampling distribution in Figure 5. Here, we set $n=250$ and $=50$ . Regarding Homotopy and Homotopy-S, the number of polytopes linearly increases when increasing $K$ . This is the reason why our method is highly efficient. In regard to Homotopy-H, the number of polytopes decreases because the history constraint is too strict when $K$ is increased.

Appendix D Experiments on Robustness (Forward-Backward SFS)

We applied our proposed method to the cases where the noise follows Laplace distribution, skew normal distribution (skewness coefficient 10), and $t_{20}$ distribution. We also conducted experiments when $\sigma^{2}$ was also estimated from the data. We generated $n$ outcomes as $y_{i}=\bm{x}_{i}^{\top}\bm{\beta}+\varepsilon_{i}$ , $i=1,...,n$ , where $p=5,\bm{x}_{i}\sim\mathbb{N}(0,I_{p})$ , and $\varepsilon_{i}$ follows Laplace distribution, skew normal distribution, or $t_{20}$ distribution with zero mean and standard deviation was set to 1. In the case of estimated $\sigma^{2}$ , $\varepsilon_{i}\sim\mathbb{N}(0,1)$ . We set all elements of $\bm{\beta}$ to 0, and set $\lambda=0.5$ . For each case, we ran 1,200 trials for each $n\in\{100,200,300,400\}$ . The FPR results are shown in Figure 6. Although we only demonstrate the results for the case of forward-backward SFS algorithm, the extension to forward SFS algorithm with similar setting is straightforward.

Appendix E Homotopy-based SI for Cross-Validation (Forward SFS)

In this section, we introduce a method for SI conditional also on the selection of the number of selected features $K$ via cross-validation. Consider selecting the number of steps $K$ in the SFS method from a given set of candidates ${\mathcal{K}}=\{K_{1},\ldots,K_{L}\}$ where $L$ is the number of candidates. When conducting cross-validation on the observed dataset $(X,\bm{y})$ , suppose that ${\mathcal{V}}(\bm{y})=K_{\rm selected}\in{\mathcal{K}}$ is the event that $K_{\rm selected}$ is selected as the best one. The test-statistic for the selected feature $j$ when applying the SFS method with $K_{\rm selected}$ steps to $(X,\bm{y})$ is then defined as

\displaystyle\bm{\eta}^{\top}\bm{Y}\mid

\displaystyle\{{\mathcal{A}}(\bm{Y})={\mathcal{A}}(\bm{y}),{\mathcal{V}}(\bm{Y})=K_{\rm selected},\bm{q}(\bm{Y})=\bm{q}(\bm{y})\}.

(23)

We note that ${\mathcal{A}}(\cdot)$ and $\bm{q}(\cdot)$ depend on $K_{\rm selected}$ but we omit the dependence for notational simplicity. The conditional data space in (9) with the event of selecting $K_{\rm selected}$ is then written as

\displaystyle{\mathcal{Y}}=\{\bm{y}(z)=\bm{a}+\bm{b}z\mid z\in{\mathcal{Z}}_{\rm CV}\},

(24)

where ${\mathcal{Z}}_{\rm CV}=\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y}),{\mathcal{V}}(\bm{y}(z))=K_{\rm selected}\}$ . The truncation region ${\mathcal{Z}}_{\rm CV}$ can be obtained by the intersection of the following two sets:

\displaystyle{\mathcal{Z}}_{1}=\{z\in\mathbb{R}\mid{\mathcal{A}}(\bm{y}(z))={\mathcal{A}}(\bm{y})\}\quad\text{ and }\quad{\mathcal{Z}}_{2}=\{z\in\mathbb{R}\mid{\mathcal{V}}(\bm{y}(z))=K_{\rm selected}\}.

Since the former ${\mathcal{Z}}_{1}$ can be obtained by using the method described in §3, the remaining task is to identify the latter ${\mathcal{Z}}_{2}$ .

For notational simplicity, we consider the case where the dataset $(X,\bm{y})$ is divided into training and validation sets, and the latter is used for selecting $K_{\rm selected}$ . The following discussion can be easily extended to cross-validation scenario. Let us re-write

\displaystyle(X,\bm{y})=\left((X^{\rm tr}\ X^{\rm va})^{\top}\in\mathbb{R}^{n\times p},(\bm{y}^{\rm tr}\ \bm{y}^{\rm va})^{\top}\in\mathbb{R}^{n}\right).

With a slight abuse of notation, for $K\in{\mathcal{K}}$ , let ${\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))$ be the set of selected features by applying $K$ -step SFS method to $(X^{\rm tr},\bm{y}^{\rm tr}(z))$ . The validation error is then defined as

\displaystyle E_{K}(z)=\|\bm{y}^{\rm va}(z)-X^{\rm va}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}\hat{\bm{\beta}}_{K}(z)\|^{2}_{2},

(25)

where $\hat{\bm{\beta}}_{K}(z)=\left({X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}}^{\top}X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}\right)^{-1}{X^{\rm tr}_{{\mathcal{M}}_{K}(\bm{y}^{\rm tr}(z))}}^{\top}\bm{y}^{\rm tr}(z)$ . Then, we can write

\displaystyle{\mathcal{Z}}_{2}=\{z\in\mathbb{R}\mid E_{K_{\rm selected}}(z)\leq E_{K}(z)\text{ for any }K\in{\mathcal{K}}\}.

Since the validation error $E_{K}(z)$ in (25) is a picecewise-quadratic function of $z$ , we have a corresponding picecewise-quadratic function of $z$ for each $K\in{\mathcal{K}}$ . The truncation region ${\mathcal{Z}}_{2}$ can be identified by the intersection of the intervals of $z$ in which the validation error $E_{K_{\rm selected}}(z)$ corresponding to $K_{\rm selected}$ is minimum among a set of picecewise-quadratic functions for all the other $K\in{\mathcal{K}}$ .

Loftus (2015) already discussed that it is possible to consider cross-validation event into conditional SI framework. However, his method is highly over-conditioned in the sense that additional conditioning on all intermediate models in the process of cross-validation is required. Our method described above is minimumly-conditioned SI in the sense that our inference is conducted based exactly on the conditional sampling distribution of the test-statistic in (23) without any extra conditions.

For the experiments on cross-validation, we demonstrate the TPRs and the CIs between the cases when $K=9$ is fixed and $K$ is selected from the set ${\mathcal{K}}_{1}=\left\{3,6,9\right\}$ , or ${\mathcal{K}}_{2}=\left\{1,2,\ldots,10\right\}$ using $5$ -fold cross-validation. We set $p=10$ , only the first elements of $\bm{\beta}$ was set to $0.25$ , and all the rest were set to $0$ . We show that the TPR tends to decrease when increasing the size of ${\mathcal{K}}$ in Figure 7. This is due to the fact that when we increase the size of ${\mathcal{K}}$ , we have to condition on more information which leads to shorter truncation interval and results low TPR. The TPR results are consistent with the CI results shown in Figure 8. In other words, when increasing the size of ${\mathcal{K}}$ , the lower the TPR is, the longer the length of CI becomes.

References

Lee et al. (2016) Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
Tibshirani et al. (2016) Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620, 2016.
Fithian et al. (2014) William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
Benjamini and Yekutieli (2005) Yoav Benjamini and Daniel Yekutieli. False discovery rate–adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):71–81, 2005.
Leeb and Pötscher (2005) Hannes Leeb and Benedikt M Pötscher. Model selection and inference: Facts and fiction. Econometric Theory, pages 21–59, 2005.
Leeb et al. (2006) Hannes Leeb, Benedikt M Pötscher, et al. Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5):2554–2591, 2006.
Benjamini et al. (2009) Yoav Benjamini, Ruth Heller, and Daniel Yekutieli. Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4255–4271, 2009.
Pötscher et al. (2010) Benedikt M Pötscher, Ulrike Schneider, et al. Confidence sets based on penalized maximum likelihood estimators in gaussian regression. Electronic Journal of Statistics, 4:334–360, 2010.
Berk et al. (2013) Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao, et al. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013.
Lockhart et al. (2014) Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance test for the lasso. Annals of statistics, 42(2):413, 2014.
Taylor et al. (2014) Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889, 354, 2014.
Fithian et al. (2015) William Fithian, Jonathan Taylor, Robert Tibshirani, and Ryan Tibshirani. Selective sequential model selection. arXiv preprint arXiv:1512.02565, 2015.
Choi et al. (2017) Yunjin Choi, Jonathan Taylor, Robert Tibshirani, et al. Selecting the number of principal components: Estimation of the true rank of a noisy matrix. The Annals of Statistics, 45(6):2590–2617, 2017.
Tian et al. (2018) Xiaoying Tian, Jonathan Taylor, et al. Selective inference with a randomized response. The Annals of Statistics, 46(2):679–710, 2018.
Chen and Bien (2019) Shuxiao Chen and Jacob Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pages 1–12, 2019.
Hyun et al. (2018) Sangwon Hyun, Kevin Lin, Max G’Sell, and Ryan J Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644, 2018.
Loftus and Taylor (2014) Joshua R Loftus and Jonathan E Taylor. A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920, 2014.
Loftus and Taylor (2015) Joshua R Loftus and Jonathan E Taylor. Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478, 2015.
Panigrahi et al. (2016) Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein. Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824, 28, 2016.
Yang et al. (2016) Fan Yang, Rina Foygel Barber, Prateek Jain, and John Lafferty. Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pages 2469–2477, 2016.
Suzumura et al. (2017) Shinya Suzumura, Kazuya Nakagawa, Yuta Umezu, Koji Tsuda, and Ichiro Takeuchi. Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3338–3347. JMLR. org, 2017.
Tanizaki et al. (2020) Kosuke Tanizaki, Noriaki Hashimoto, Yu Inatsu, Hidekata Hontani, and Ichiro Takeuchi. Computing valid p-values for image segmentation by selective inference. 2020.
Duy et al. (2020a) Vo Nguyen Le Duy, Hiroki Toda, Ryota Sugiyama, and Ichiro Takeuchi. Computing valid p-value for optimal changepoint by selective inference using dynamic programming. arXiv preprint arXiv:2002.09132, 2020a.
Terada and Shimodaira (2019) Yoshikazu Terada and Hidetoshi Shimodaira. Selective inference after variable selection via multiscale bootstrap. arXiv preprint arXiv:1905.10573, 2019.
Liu et al. (2018) Keli Liu, Jelena Markovic, and Robert Tibshirani. More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037, 2018.
Duy and Takeuchi (2020) Vo Nguyen Le Duy and Ichiro Takeuchi. Parametric programming approach for powerful lasso selective inference without conditioning on signs. arXiv preprint arXiv:2004.09749, 2020.
Osborne et al. (2000) Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
Efron and Tibshirani (2004) B. Efron and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004.
Hastie et al. (2004) T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–415, 2004.
Rosset and Zhu (2007) S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012–1030, 2007.
Bach et al. (2006) F. R. Bach, D. Heckerman, and E. Horvits. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7:1713–41, 2006.
Tsuda (2007) K. Tsuda. Entire regularization paths for graph data. In In Proc. of ICML 2007, pages 919–925, 2007.
Lee and Scott (2007) G. Lee and C. Scott. The one class support vector machine solution path. In Proc. of ICASSP 2007, pages II521–II524, 2007.
Takeuchi et al. (2009) I. Takeuchi, K. Nomura, and T. Kanamori. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2):539–559, 2009.
Takeuchi and Sugiyama (2011) Ichiro Takeuchi and Masashi Sugiyama. Target neighbor consistent feature weighting for nearest neighbor classification. In Advances in neural information processing systems, pages 576–584, 2011.
Karasuyama and Takeuchi (2010) M. Karasuyama and I. Takeuchi. Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks, 22(10):1613–1625, 2010.
Hocking et al. (2011) T. Hocking, j. P. Vert, F. Bach, and A. Joulin. Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pages 745–752, 2011.
Karasuyama et al. (2012) M. Karasuyama, N. Harada, M. Sugiyama, and I. Takeuchi. Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3):297–330, 2012.
Ogawa et al. (2013) Kohei Ogawa, Motoki Imamura, Ichiro Takeuchi, and Masashi Sugiyama. Infinitesimal annealing for training semi-supervised support vector machines. In International Conference on Machine Learning, pages 897–905, 2013.
Takeuchi et al. (2013) Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, and Shinichi Nakajima. Parametric task learning. Advances in Neural Information Processing Systems, 26:1358–1366, 2013.
Duy et al. (2020b) Vo Nguyen Le Duy, Shogo Iwazaki, and Ichiro Takeuchi. Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823, 2020b.

More Powerful and General Selective Inference for Stepwise Feature Selection Using the Homotopy Method

Abstract

1 Introduction

Related works

Our contributions

Notations

2 Problem Statement

Stepwise feature selection

Statistical inference

Conditional Selective Inference

Characterization of the conditional data space 𝒴{\mathcal{Y}}

Over-conditioning in existing conditional SI methods

3 Proposed Homotopy-based SI for SFS

3.1 Extra-Conditioning

Lemma 1.

3.2 Homotopy Continuation

4 Forward Backward Stepwise Feature Selection

Lemma 2.

5 Experiment

5.1 Forward SFS

Synthetic data experiments

Real data experiments

5.2 Forward-Backward (FB)-SFS

Synthetic data experiments

Real data experiments

6 Conclusion

Acknowledgement

Appendix A Proofs

A.1 Proof of Lemma 1

Proof.

A.2 The proof of Lemma 2

Proof.

Appendix B Details of Experiments

Comparison methods in the case of Forward SFS.

Comparison methods in the case of Forward-Backward SFS.

Definition of TPR.

Demonstration of confidence interval (forward SFS).

Appendix C Experiments on Computational Aspects (Forward SFS)

Appendix D Experiments on Robustness (Forward-Backward SFS)

Appendix E Homotopy-based SI for Cross-Validation (Forward SFS)

References

Characterization of the conditional data space ${\mathcal{Y}}$