On the Selection Stability of Stability Selection and Its Applications

Mahdi Nouraie School of Mathematical and Physical Sciences, Macquarie University 0000-0002-4792-4994 Samuel Muller Address for correspondence: [email protected] School of Mathematical and Physical Sciences, Macquarie University School of Mathematics and Statistics, The University of Sydney 0000-0002-3087-8127

Abstract

Stability selection is a widely adopted resampling-based framework for high-dimensional structure estimation and variable selection. However, the concept of ‘stability’ is often narrowly addressed, primarily through examining selection frequencies, or ‘stability paths’. This paper seeks to broaden the use of an established stability estimator to evaluate the overall stability of the stability selection framework, moving beyond single-variable analysis. We suggest that the stability estimator offers two advantages: it can serve as a reference to reflect the robustness of the outcomes obtained and help identify an optimal regularization value to improve stability. By determining this value, we aim to calibrate key stability selection parameters, namely, the decision threshold and the expected number of falsely selected variables, within established theoretical bounds. Furthermore, we explore a novel selection criterion based on this regularization value. With the asymptotic distribution of the stability estimator previously established, convergence to true stability is ensured, allowing us to observe stability trends over successive sub-samples. This approach sheds light on the required number of sub-samples addressing a notable gap in prior studies. The stabplot package is developed to facilitate the use of the plots featured in this manuscript, supporting their integration into further statistical analysis and research workflows.

Keywords: Bioinformatics, Feature Selection, Model Selection, Structure Estimation, Variable Selection

1 Introduction

The estimation of discrete structures, including graphs, clusters, or selection of variables, represents a long-standing and fundamental problem in statistics (Meinshausen and Bühlmann, 2010). Variable selection is a critical phase in the modeling pipeline and a fundamental step in data preprocessing, where raw data is refined to ensure its suitability for analysis or modeling. The objective is to eliminate variables that are noisy, redundant, or irrelevant (Tibshirani, 1996; Müller and Welsh, 2010), addressing a key challenge in contemporary data science (Nogueira et al., 2018).

Stability selection, a widely recognized resampling-based variable selection framework, performs variable selection by examining how frequently variables are chosen across multiple randomly selected resamples (Meinshausen and Bühlmann, 2010). The selection frequency indicates how often a variable is included in the estimated models over various resamples. Meinshausen and Bühlmann (2010) analyzed the selection frequencies for individual variables over a range of regularization values, presenting the results in a plot known as the ‘stability path’. This plot facilitates the identification of variables that are consistently selected over the regularization grid.

In this paper, we shift the focus from examining the stability of individual variables across a grid of regularization values to evaluating the overall stability of the entire stability selection framework across the regularization grid. Employing an established stability estimator, we assess the stability of the framework in its entirety and introduce what we call ‘Stable Stability Selection’, which identifies the smallest regularization value that yields highly stable outcomes.

We illustrate the stable stability selection methodology through variable selection in the context of linear regression. However, the applicability of this approach extends beyond linear regression, as stability selection is not confined to this specific problem. Although detailed exploration is beyond the scope of this paper, stable stability selection is equally suitable for more complex regression-type models.

The stability of a variable selection method refers to the consistency with which it selects variables across different training sets drawn from the same underlying distribution (Kalousis et al., 2007). Evaluation of the stability of a statistical model by random resampling has been a well-established practice for decades. For example, Altman and Andersen (1989) applied this approach in Cox’s proportional hazards regression model (Cox, 1972) to identify stable variables, which were defined as those consistently selected with higher frequencies across multiple resamples.

In the seminal work of Meinshausen and Bühlmann (2010), the stability selection framework was introduced, which offers a transformative method to identify stable variables in structure estimation tasks. The stability selection approach described by Meinshausen and Bühlmann (2010) uses sub-samples that are half the size of the sample size of the original dataset to carry out variable selection. Stability selection enables the identification of variables that are consistently recognized as important across the majority of sub-samples given a regularization value, considering them as stable variables. Furthermore, Meinshausen and Bühlmann (2010) established an asymptotic upper-bound for the Per-Family Error Rate, which represents the expected number of falsely selected variables. This bound relies on the assumptions of exchangeability in the distribution of falsely selected variables and that the selection algorithm performs no worse than random guessing.

Shah and Samworth (2013) introduced the complementary pairs stability selection method, which computes selection frequencies by counting how often a variable is included in models fitted to complementary $50\%$ sub-samples. Shah and Samworth (2013) introduced more rigorous upper-bounds for the expected number of variables selected with low selection probabilities.

It is somewhat surprising that, despite the framework being designated as ‘stability’ selection and the considerable attention stability selection has received since its inception, the overall stability of the framework itself has been largely overlooked, with no dedicated studies specifically addressing this critical aspect.

There is a substantial body of research on the stability of variable selection methods, where numerous stability measures have been proposed. For more comprehensive information, we refer to Kuncheva (2007), Nogueira et al. (2018), and Sen et al. (2021).

Nogueira et al. (2018) conducted an extensive literature review on the topic, consolidating the desirable properties of stability measures into five mathematical conditions: fully defined, strict monotonicity, constant bounds, maximum stability occurs if and only if the selection is deterministic and correction for chance. A brief overview of these properties will be provided in the following paragraph. Nogueira et al. (2018) showed that none of the stability measures previously introduced satisfied all five conditions. In response, Nogueira et al. (2018) proposed a novel stability measure, which was the first in the literature to meet all these criteria, generalizing several previous works such as in Kuncheva (2007).

Here, we give a concise review of the five conditions proposed by Nogueira et al. (2018). The first property stipulates that the stability measure should function effectively regardless of variations in the number of selected variables, meaning that it should not be limited to scenarios where the number of selected variables remains constant across resamples. The second property asserts that the stability measure should be a strictly decreasing function of the sample variance of the binary selection status of each variable across random resamples; as the variance in the selection status of each variable increases, the stability of the framework diminishes. The third property states that the stability measure should be bounded by specific constants, which facilitates the comparison of the stability of different procedures. The fourth property implies that maximum stability is achieved when the selection results across all resamples are identical, and vice versa. Finally, the fifth property posits that if all possible selection results are equally likely to occur across random resamples, the expected value of the stability measure should remain constant. This indicates that the stability measure should not be influenced by random patterns that may arise during the resampling process.

Nogueira et al. (2018) conceptualized the stability of a variable selection procedure as an underlying population parameter that requires estimation. In addition, Nogueira et al. (2018), linking their work with Gwet (2008), demonstrated that as the number of resamples approaches infinity, their stability measure, $\hat{\Phi}$ , converges to a Normal distribution with a location parameter $\mu=\Phi$ , where $\Phi$ denotes population stability. Consequently, $\hat{\Phi}$ is a consistent estimator of the population stability $\Phi$ ; that is, as the number of resamples approaches infinity, $\hat{\Phi}$ converges in probability to the true population stability $\Phi$ . This property allows us to construct confidence intervals and perform hypothesis testing on the stability values.

Nogueira et al. (2018) compared the stability of the stability selection with that of the Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996), concluding that the former exhibits greater stability than the latter in most of the scenarios considered.

In this paper, our aim is to apply the stability estimator proposed by Nogueira et al. (2018) to evaluate the stability of stability selection. Furthermore, we proxify the stability values to find the optimal regularization value to achieve high stability with the least possible loss in accuracy, if applicable. The optimal regularization value can be employed to calibrate two critical parameters of stability selection: the decision-making threshold and the expected number of falsely selected variables, balancing them with respect to one another. We also use the optimal regularization value to define a new selection criterion, called stable stability selection, prioritizing the selection of variables associated with highly stable outcomes, rather than those mostly selected in their best-case scenarios over the regularization grid, which is the primary criterion in the traditional stability selection framework.

In addition, the convergence value of the stability estimator serves as a reference to assess the reliability of the results obtained. If the stability selection process demonstrates instability, it is unreasonable to expect that its selections and preferences will be reliable. The point at which convergence occurs provides valuable insight into the required number of sub-samples, an aspect for which we found no dedicated research addressing its determination.

Nogueira et al. (2018) suggested that stability and accuracy can be interpreted through the lens of a Pareto front Pareto (1896). We refer to this methodology as stability-accuracy selection. Building on this perspective, we demonstrate that, given two reasonable assumptions, our proposed regularization value constitutes a Pareto-optimal solution within this context.

Alternative stability estimators, aside from that proposed by Nogueira et al. (2018), may also be applicable; however, a detailed comparison and evaluation of different stability measures in the context of stability selection falls outside the scope of this paper.

The structure of this paper is organized as follows. Section 2 outlines the proposed methodology. Section 3 describes the real and synthetic datasets used in the paper. The results of the application of the method to real and synthetic data are presented in Section 4. Finally, Section 5 summarizes and concludes the paper.

2 Methodology

In this section, we present a methodology to evaluate the selection stability of the stability selection framework and to determine the optimal regularization value accordingly. Although we demonstrate the proposed method within the context of stability selection outlined by Meinshausen and Bühlmann (2010), it is important to note that the methodology is adaptable to other formulations of stability selection, such as those described by Shah and Samworth (2013) and Beinrucker et al. (2016). We will introduce our methodology following a brief overview of the foundational approach established by Meinshausen and Bühlmann (2010).

Stability selection

We consider a dataset $\mathcal{D}=\{(\boldsymbol{x}^{\top}_{i},y_{i})\}_{i=1}^{n}$ , where each element consists of a univariate response $y_{i}\in\mathbb{R}$ and a $p$ -dimensional vector of fixed covariates $\boldsymbol{x}^{\top}_{i}\in\mathbb{R}^{p}$ . In the context of linear regression, it is typically assumed that pairs $(\boldsymbol{x}^{\top}_{i},y_{i})$ are independent and identically distributed (i.i.d.); when these are assumed to be random, but for simplicity, we assume that the covariate vector is fixed.

The linear regression model is formally expressed as $\boldsymbol{Y}=\beta_{0}+X\boldsymbol{\beta}+\boldsymbol{\varepsilon}$ , where $X\in\operatorname{mat}(n\times p)$ represents the design matrix, $\boldsymbol{Y}\in\mathbb{R}^{n}$ is the $n$ -dimensional vector corresponding to the univariate response variable, and $\beta_{0}\in\mathbb{R}$ denotes the intercept term. The vector of regression coefficients for the $p$ non-constant covariates is denoted by $\boldsymbol{\beta}\in\mathbb{R}^{p}$ , while $\boldsymbol{\varepsilon}\in\mathbb{R}^{n}$ represents the $n$ -dimensional vector of random errors. It is assumed that the components of $\boldsymbol{\varepsilon}$ are i.i.d. and independent of the covariates.

Variable selection, in accordance with the terminology of Meinshausen and Bühlmann (2010), typically involves categorizing covariates into two distinct groups: the signal group $S=\{k\neq 0\mid\beta_{k}\neq 0\}$ and the noise group $N=\{k\neq 0\mid\beta_{k}=0\}$ , where $S\cap N=\emptyset$ . The primary objective of variable selection is to accurately identify the signal group $S$ .

The stability selection framework requires the implementation of an appropriate selection method. One commonly used technique for variable selection in the linear regression context is the LASSO estimator, formally defined as

\hat{\beta}_{0}(\lambda),\boldsymbol{\hat{\beta}}(\lambda)=\operatorname*{arg\,min}_{\beta_{0}\in\mathbb{R},\boldsymbol{\beta}\in\mathbb{R}^{p}}\left(\|\boldsymbol{Y}-\beta_{0}-X\boldsymbol{\beta}\|_{2}^{2}+\lambda\sum_{k=1}^{p}|\beta_{k}|\right),

where $\lambda\in\mathbb{R}^{+}$ represents the LASSO regularization parameter. The set of non-zero coefficients can be identified as $\hat{S}(\lambda)=\{k\neq 0\mid\hat{\beta}_{k}(\lambda)\neq 0\}$ , derived from solving the LASSO equation through convex optimization.

Although alternative methods exist for variable selection in linear regression models, a detailed exploration of techniques for determining the regularization method and penalty term in stability selection lies beyond the scope of this paper.

Meinshausen and Bühlmann (2010) defined the stable set as

\hat{S}^{\text{stable}}=\{j\mid\max_{\lambda\in\Lambda}(\hat{\Pi}_{j}^{\lambda})\geq\pi_{\text{thr}}\};\quad j=1,\ldots,p,

(1)

where $\Lambda$ represents the set of regularization values, $0<\pi_{\text{thr}}<1$ is the threshold for decision-making in variable selection, and $\hat{\Pi}_{j}^{\lambda}$ denotes the selection frequency of the $j$ th variable given the regularization parameter $\lambda$ .

Equation 1 identifies variables whose selection frequencies exceed $\pi_{\text{thr}}$ under the regularization value that maximizes their selection frequencies; therefore, it considers best-case of each variable for decision-making.

Stable stability selection

We now move forward to stable stability selection. To implement this, we first define a grid of regularization values, $\Lambda$ , from which the optimal $\lambda$ for LASSO is to be identified. For this purpose, we use as the default choice the $\lambda$ values generated by the cv.glmnet function from the glmnet package in R (Friedman et al., 2010), applying a 10-fold cross-validation to the complete dataset $\mathcal{D}$ . Alternative regularization values can also be employed. However, a comparison of different methods for generating the regularization grid falls outside the scope of this paper. The goal is to identify the optimal regularization value $\lambda_{\text{stable}}\in\Lambda$ that yields highly stable outcomes with the least possible loss in terms of predictive ability.

The following procedure is applied to each regularization value $\lambda\in\Lambda$ . In each iteration of the stability selection process, a random sub-sample, comprising half the size of the original dataset $\mathcal{D}$ , is randomly selected. The LASSO model is then fitted to this sub-sample using the fixed value of $\lambda$ , and the binary selection outcomes are recorded as a row in the binary selection matrix $M(\lambda)\in\operatorname{mat}(B\times p)$ , where $B$ denotes the number of sub-samples. Therefore, in the end, we obtain $|\Lambda|$ distinct selection matrices $M(\lambda)$ .

In this context, $M(\lambda)_{bj}=1$ indicates that the $j$ th variable is identified as part of the signal set $\hat{S}(\lambda)$ when LASSO is applied to the $b$ th sub-sample given $\lambda$ . In contrast, $M(\lambda)_{bj}=0$ signifies that the $j$ th variable is classified as belonging to the noise set $\hat{N}(\lambda)$ after applying LASSO to the $b$ th sub-sample given $\lambda$ .

The stability estimator proposed by Nogueira et al. (2018) is defined as

\hat{\Phi}(M(\lambda))=1-\frac{\frac{1}{p}\sum_{j=1}^{p}s_{j}^{2}}{\frac{q(\lambda)}{p}\left(1-\frac{q(\lambda)}{p}\right)};\quad\lambda\in\Lambda,

(2)

where $s_{j}^{2}$ denotes the unbiased sample variance of the binary selection statuses of the $j$ th variable, while $q(\lambda)$ denotes the average number of variables selected under the regularization parameter $\lambda$ . The stability estimator $\hat{\Phi}$ is bounded by $\left[-\frac{1}{B-1},1\right]$ (Nogueira et al., 2018); the larger the $\hat{\Phi}$ , the more stable is $M(\lambda)$ .

To interpret the stability values of Equation (2), Nogueira et al. (2018) adopted the guidelines proposed by Fleiss et al. (2004). According to Nogueira et al. (2018), stability values that exceed $0.75$ indicate excellent agreement between selections beyond what would be expected by random chance, while values below $0.4$ signify poor agreement between them. Based on Nogueira et al. (2018), stability values that reside between these thresholds are categorized as indicative of intermediate to good stability.

By obtaining all $|\Lambda|$ selection matrices, we can estimate the stability of each using the formulation given in Equation (2). We propose that the optimal regularization value, denoted as $\lambda_{\text{stable}}$ , is the smallest regularization value at which the stability measure surpasses 0.75, that is,

\lambda_{\text{stable}}=\min\left\{\lambda\in\Lambda\mid\hat{\Phi}(M(\lambda))\geq 0.75\right\}.

(3)

The threshold value of $0.75$ is somewhat arbitrary and may not be attainable in certain situations. We address this limitation in the following paragraph.

As we will see in Section 4, $\lambda_{\text{stable}}$ may not exist in certain practical applications due to procedural instability, which prevents the stability values from exceeding $0.75$ . In such cases, we propose

\lambda_{\text{stable-1sd}}=\min\left\{\lambda\in\Lambda\mid\hat{\Phi}(M(\lambda))\geq\max_{\lambda\in\Lambda}\hat{\Phi}(M(\lambda))-\text{SD}_{\lambda\in\Lambda}\left(\hat{\Phi}(M(\lambda))\right)\right\},

(4)

where SD represents the standard deviation of the stability values.

Both upper-bounds of the Per-Family Error Rate, mentioned in Section 1, provided by Meinshausen and Bühlmann (2010) and Shah and Samworth (2013), depend on the values of $\lambda$ and $\pi_{\text{thr}}$ . The calibration of these two values requires the arbitrary selection of one of these two parameters, which can be challenging to justify. By adopting the optimal regularization value, that is $\lambda=\lambda_{\text{stable}}$ or $\lambda=\lambda_{\text{stable-1sd}}$ , the upper-bound can be explicitly tailored to the data owner’s preferences. If a pre-determined $\pi_{\text{thr}}$ is chosen, the Per-Family Error Rate is deterministically obtained; alternatively, if a fixed pre-determined Per-Family Error Rate value is desired, the corresponding $\pi_{\text{thr}}$ is enforced.

The asymptotic upper-bound provided by Meinshausen and Bühlmann (2010) for the Per-Family Error Rate is given by

\text{PFER}(\Lambda,\pi_{\text{thr}})=\frac{1}{2\pi_{\text{thr}}-1}\frac{q_{\Lambda}^{2}}{p},

(5)

where $q_{\Lambda}$ represents the average number of selected variables over the regularization grid $\Lambda$ . In a manner consistent with the approach of Bodinier et al. (2023), we can derive a point-wise control version of Equation 5 by restricting $\Lambda$ to a single value, $\lambda$ , without affecting the validity of the original equation. Consequently, we can replace $q_{\Lambda}$ with $q(\lambda)$ , as introduced in Equation 2. By employing $\lambda_{\text{stable}}$ within this formula and utilizing $q(\lambda_{\text{stable}})$ , we establish a two-way control: fixing $\pi_{\text{thr}}$ allows us to determine the upper-bound, and vice versa. A similar approach can be applied to the upper-bounds proposed by Shah and Samworth (2013).

Having $\lambda_{\text{stable}}$ or $\lambda_{\text{stable-1sd}}$ allows a fresh perspective on variable selection through stability selection. Meinshausen and Bühlmann (2010) introduces Equation (1) to identify stable variables, that is, those with selection frequencies that exceed $\pi_{\text{thr}}$ in the best-case scenario. Instead, we propose

\hat{S}^{\text{stable}}=\left\{j\mid\hat{\Pi}_{j}^{\lambda_{\text{stable}}}\geq\pi_{\text{thr}}\right\};\quad j=1,\ldots.p.

(6)

Equation (6) represents the variables with selection frequencies higher than $\pi_{\text{thr}}$ under the $\lambda_{\text{stable}}$ , which corresponds to the highly stable outcomes. We term this selection criterion stable stability selection because it leverages highly stable outcomes to facilitate variable selection. If $\lambda_{\text{stable}}$ does not exist, $\hat{S}^{\text{stable-1sd}}$ can be defined similarly.

As outlined in Section 1, as the number of sub-samples increases, the convergence of $\hat{\Phi}$ to the true stability $\Phi$ is ensured. To illustrate this, we define $M^{(t)}(\lambda)$ as the matrix containing the selection outcomes from the first $t$ sub-samples for a given $\lambda\in\Lambda$ , that is, the first $t$ rows of $M(\lambda)$ . The objective is to evaluate the stability of $M^{(t)}(\lambda)$ across successive sub-samples in order to determine the appropriate cut-off point for the process, that is, the number of sub-samples required to achieve convergence in stability. Beyond this threshold, additional sub-samples do not significantly change the stability of $M(\lambda)$ . We propose plotting the stability values of the selection matrix over the sequential sub-sampling, that is $\hat{\Phi}(M^{(t)}(\lambda));\;t=2,3,\ldots,B$ , against $t$ to monitor the convergence status of the stability values. Given that the asymptotic distribution of $\hat{\Phi}$ is established, a confidence bound can be drawn along the curve to reflect the inherent uncertainty of the stability estimator.

In Section 4, we present the results obtained from the application of the proposed approach to synthetic and real datasets, which are introduced in Section 3.

3 Datasets

We evaluate the proposed methodology by conducting evaluations on synthetic and two real bioinformatics datasets, as detailed below.

Synthetic Data

As synthetic data, we consider a dataset with a sample size of $n=50$ that includes $p=500$ predictor variables. In this scenario, the predictor variables $\boldsymbol{x}^{\intercal}_{i}$ are independently drawn from the Normal distribution $\mathcal{N}(0,\Sigma)$ , where $\Sigma\in\operatorname{mat}(p,p)$ has a diagonal of ones and $\sigma_{jk}=\rho^{|j-k|}$ for $j\neq k$ where $\rho\in\{0.2,0.5,0.8\}$ . The response variable is linearly dependent solely on the first two predictor variables, characterized by the coefficient vector $\boldsymbol{\beta}=(1.5,1.1,0,\ldots,0)^{\top}$ , and the error term $\boldsymbol{\varepsilon}$ is an i.i.d. sample from the standard Normal distribution $\mathcal{N}(0,1)$ .

Riboflavin Data

For the first real example, we use the well-established ‘riboflavin’ dataset, which focuses on the production of riboflavin (vitamin B2) from various Bacillus subtilis. This dataset, provided by the Dutch State Mines Nutritional Products, is accessible via hdi R package (Dezeure et al., 2015). It comprises a single continuous response variable that represents the logarithm of the riboflavin production rate, alongside $p=4,088$ covariates, which correspond to the logarithm of expression levels for 4,088 bacterial genes. The primary objective of analyzing this dataset is to identify genes that influence riboflavin production, with the ultimate goal of genetically engineering bacteria to enhance riboflavin yield. Data were collected from $n=71$ relatively homogeneous samples, which were repeatedly hybridized during a feed-batch fermentation process involving different engineered strains and varying fermentation conditions. Bühlmann et al. (2014) employed stability selection with LASSO and identified three genes—LYSC_at, YOAB_at, and YXLD_at—as having a significant impact on the response variable.

Affymetrix Rat Genome 230 2.0 Array

As an additional real-world example, we investigate ‘Affymetrix Rat Genome 230 2.0 Array’ microarray data introduced by Scheetz et al. (2006). This dataset comprises $n=120$ twelve-week-old male rats, with expression levels recorded for nearly 32,000 gene probes for each rat. The primary objective of this analysis is to identify the probes most strongly associated with the expression level of the TRIM32 probe (1389163_at), which has been linked to the development of Bardet-Biedl syndrome (Chiang et al., 2006). This genetically heterogeneous disorder affects multiple organ systems, including the retina.

In accordance with the preprocessing steps outlined by Huang et al. (2008), we excluded gene probes with a maximum expression level below the $25$ th percentile and those exhibiting an expression range smaller than $2$ . This filtering process yielded a refined set of $p=3,083$ gene probes that demonstrated sufficient expression and variability for further analysis.

4 Results

In this section, we illustrate the efficacy of the proposed methodology through its application to the synthetic and real datasets introduced in Section 3.

Synthetic Data

Here, we present the results obtained from the application of stable stability selection to synthetic data. We used the GitHub repository associated with Nogueira et al. (2018) to implement their stability estimator and to obtain confidence intervals¹¹1https://github.com/nogueirs/JMLR2018.

Figure 1 illustrates the selection stability of the stability selection on a grid of regularization values, applied to the synthetic datasets introduced in Section 3, using three different values of $\rho$ and choosing the number of sub-samples $B=500$ . The horizontal dashed red lines indicate the stability thresholds of $0.4$ and $0.75$ as discussed in Section 2.

As illustrated in Figure 1, the regularization value that minimizes the cross-validation error, $\lambda_{\text{min}}$ , falls within the poorly stable region. In addition, the regularization value that maintains the cross-validation error within one standard error of $\lambda_{\text{min}}$ , referred to as $\lambda_{\text{1se}}$ (Hastie et al., 2009), lies within the poor to intermediate stability regions. In contrast, $\lambda_{\text{stable}}$ , introduced in Equation (3), is specifically designed to fall within the region of excellent stability, where applicable.

Notably, in all three correlation scenarios, the three regularization selection methods successfully identify relevant variables with high selection frequencies, with a minimum selection frequency of $0.994$ . However, since $\lambda_{\text{stable}}$ is larger than the two other regularization values, it applies a stronger shrinkage to variables. To further examine this in terms of model accuracy, for each correlation scenario, we generate $n^{\prime}=25$ additional test samples from the corresponding distribution described in Section 3. At each iteration of stability selection, we predict the response variable for the test samples using the estimated model and, ultimately, we aggregate all the mean squared error (MSE) values obtained over the corresponding $\lambda$ by averaging.

As demonstrated in Figure 2, across all three correlation scenarios, the improvement in stability when transitioning from $\lambda_{\text{1se}}$ to $\lambda_{\text{stable}}$ far outweighs the reduction in accuracy. This outcome aligns closely with the findings of Nogueira et al. (2018), who noted that “All these observations show that stability can potentially be increased without loss of predictive power” and “We also observe that pursuing stability may help identifying the relevant set of features”.

To demonstrate the convergence of the stability estimator $\hat{\Phi}$ , for each correlation scenario, we used $\lambda_{\text{stable}}$ and calculated the stability of the selection matrix $M(\lambda_{\text{stable}})$ through sequential sub-sampling, using Equation (2). The stability values across iterations are shown in Figure 3. The blue shading around the line represents the $95\%$ confidence interval for $\hat{\Phi}$ . As can be observed, the interval narrows with increasing iterations, indicating a reduction in the estimator’s uncertainty. Figure 3 reveals that, across all three scenarios, after approximately $B^{*}\approx 200$ iterations, the stability estimator converges, with no significant change thereafter in its value. By monitoring the stability trajectory throughout the process, valuable insight can be gained in determining the optimal cut-off point, that is, the optimal number of sub-samples in terms of stability of the process.

Riboflavin Data

Next, we apply stable stability selection to the riboflavin dataset. As described in Section 3, the dataset consists of $p=4,088$ genes, the main objective being to identify the key genes that influence riboflavin production. As before, we choose the number of sub-samples $B=500$ . The predictor variables are standardized using the scale function prior to being input into LASSO.

In this problem, $\lambda_{\text{stable}}$ does not exist, so we use $\lambda_{\text{stable-1sd}}$ instead. After $B$ iterations, four genes YXLD_at ( $\hat{\Pi}_{\texttt{YXLD\_at}}^{\lambda_{\text{stable-1sd}}}=0.606$ ), YOAB_at ( $\hat{\Pi}_{\texttt{YOAB\_at}}^{\lambda_{\text{stable-1sd}}}=0.558$ ), LYSC_at ( $\hat{\Pi}_{\texttt{LYSC\_at}}^{\lambda_{\text{stable-1sd}}}=0.540$ ), and YCKE_at ( $\hat{\Pi}_{\texttt{YCKE\_at}}^{\lambda_{\text{stable-1sd}}}=0.532$ ) have selection frequencies greater than $0.5$ .

Figure 4 illustrates the stability of stability selection when applied to the riboflavin dataset. As shown in Figure 4(a), $\lambda_{\text{stable}}$ does not exist in this example, as the stability values do not surpass the $0.75$ threshold. Figure 4(b), generated with $\lambda_{\text{stable-1sd}}$ , indicates the convergence slightly above $0.2$ after about $200$ iterations. These results imply that stability selection with LASSO exhibits poor selection stability within this dataset. Consequently, the findings presented by Bühlmann et al. (2014) may require careful reconsideration, particularly since the selection frequencies of the three identified genes are relatively low.

Affymetrix Rat Genome 230 2.0 Array

Finally, we apply the proposed methodology to the rat microarray data. As mentioned in Section 3, the data consists of $p=3,083$ gene probes. The main aim for this data is to identify probes that influence the TRIM32 probe. Again, we choose the number of sub-samples $B=500$ . The predictor variables are standardized using the scale function prior to being input into LASSO.

Due to the instability of the outcomes, $\lambda_{\text{stable}}$ does not exist in this problem. After $B$ iterations, three probes have selection frequencies greater than $0.5$ : probe 1390539_at ( $\hat{\Pi}_{\texttt{1390539\_at}}^{\lambda_{\text{stable-1sd}}}=0.640$ ), probe 1389457_at ( $\hat{\Pi}_{\texttt{1389457\_at}}^{\lambda_{\text{stable-1sd}}}=0.570$ ), and probe 1376747_at ( $\hat{\Pi}_{\texttt{1376747\_at}}^{\lambda_{\text{stable-1sd}}}=0.564$ ).

As shown in Figure 5, similar to the results of the riboflavin dataset, the procedure cannot be considered stable. Specifically, Figure 5(a) reveals that $\lambda_{\text{stable}}$ does not exist, as the stability values do not exceed the threshold of $0.75$ . For Figure 5(b), we used $\lambda_{\text{stable-1sd}}$ . Figure 5(b) shows that the stability values converged to approximately $0.15$ .

The results illustrated in figures 4 and 5 indicate that while stability selection is a valuable approach, more can be revealed by focusing on the stability of stability selection. As the complexity of the data (in terms of interdependencies between variables, number of variables subject to selection, etc.) increases, the stability values demonstrate a significant decline. Therefore, it is essential to evaluate the stability of its selections before drawing any conclusions about the importance of variables based on the outcomes of stability selection. These findings emphasize the need to carefully assess the stability of the entire process when using stability selection. Neglecting this evaluation may result in overconfidence in the results obtained, even when they lack stability. Given that such analyses are often underrepresented in the literature on stability selection, domain experts may benefit from carefully considering previously published findings.

The convergence plots for both synthetic and real examples indicate that, regardless of whether the convergence value is high or low, stability values tend to stabilize after $B^{*}\approx 200$ iterations. This observation could provide a useful rule of thumb for determining the required number of sub-samples in the stability selection framework.

Stability and Accuracy from a Pareto Front Perspective

Nogueira et al. (2018) suggested that stability and accuracy can be analyzed using the concept of the Pareto front (Pareto, 1896), which identifies regularization values that are not dominated by any other in terms of both criteria. A regularization value is considered Pareto optimal if no other value on the regularization grid offers both higher stability and higher accuracy. We call this approach stability-accuracy selection, which seeks to balance these two metrics. In this paper, we introduced the stable stability selection, which focuses on identifying the regularization value that achieves a high stability of the procedure with the least possible loss in accuracy.

As an experiment, in our synthetic data analysis we consider as a measure of accuracy (-MSE), that is, the negative of the mean squared error. Given that a Pareto optimal solution is not necessarily unique, we select the Pareto solution that maximizes the sum of accuracy and stability. In Figure 6(a), $\lambda_{\text{Pareto}}$ is located close to $\lambda_{\text{stable}}$ , while in Figures 6(b) and 6(c), they coincide exactly. Interestingly, $\lambda_{\text{stable}}$ is also a Pareto solution in Figure 6(a). It is now pertinent to discuss the relationship between $\lambda_{\text{stable}}$ and Pareto optimality.

Corollary 1.

Let $\lambda_{\text{stable}}$ be defined as in Equation (3) and assume that it exists. If the stability curve is non-decreasing up to $\lambda_{\text{stable}}$ , and the loss function is non-decreasing after $\lambda_{\text{stable}}$ , then $\lambda_{\text{stable}}$ is a Pareto optimal solution.

Proof.

Since the stability curve is non-decreasing prior to $\lambda_{\text{stable}}$ , any regularization value $\lambda<\lambda_{\text{stable}}\in\Lambda$ is at most as stable as $\lambda_{\text{stable}}$ . Therefore, these values do not dominate $\lambda_{\text{stable}}$ in terms of stability.

Similarly, since the loss function is non-decreasing after $\lambda_{\text{stable}}$ , any regularization value $\lambda>\lambda_{\text{stable}}\in\Lambda$ is at most as accurate as $\lambda_{\text{stable}}$ . Hence, these values also do not dominate $\lambda_{\text{stable}}$ in terms of accuracy.

Since $\lambda_{\text{stable}}$ is not dominated by any values less than it in terms of stability nor by any values greater than it in terms of accuracy, we conclude that $\lambda_{\text{stable}}$ constitutes a Pareto optimal solution. ∎

The two assumptions in Corollary 1 are both natural and justifiable. Increasing the regularization value is expected to allow the model to achieve an optimal balance of sparsity by shrinking irrelevant variables, thereby enhancing stability. The regularization value $\lambda_{\text{stable}}$ is intended to signify the point where the model attains high stability with minimal loss of accuracy. Consequently, we anticipate that the stability curve exhibits a non-decreasing trend prior to reaching $\lambda_{\text{stable}}$ . However, since excessive regularization can lead to underfitting, we expect that the stability curve will flatten or decrease after reaching one or more peaks beyond $\lambda_{\text{stable}}$ .

Regarding the second assumption, it is expected that increasing regularization helps the model eliminate irrelevant variables, improving its ability to predict the response variable. The regularization values $\lambda_{\text{min}}$ and $\lambda_{\text{1se}}$ are designed to capture this behavior, marking the points where regularization is most effective in terms of predictive ability. Beyond these points, we anticipate that the loss curve will flatten or increase. Since $\lambda_{\text{stable}}$ is intended to trade a small amount of accuracy for greater stability, we expect $\lambda_{\text{stable}}$ to occur after $\lambda_{\text{1se}}$ , that is, $\lambda_{\text{stable}}>\lambda_{\text{1se}}>\lambda_{\text{min}}$ , which implies that the loss curve is expected to be non-decreasing after $\lambda_{\text{stable}}$ .

Thus, according to Corollary 1, $\lambda_{\text{stable}}$ represents a stability-accuracy solution for the problem of regularization tuning, ensuring high stability while minimizing loss in accuracy.

5 Conclusion

In this paper, we examined the stability of the stability selection framework, which underpins the reproducibility of the results and, consequently, our confidence in them. We further applied this concept to determine the optimal regularization value in terms of stability, facilitating the calibration of the decision-making threshold and the expected number of falsely selected variables by leveraging upper-bounds established in the literature. In addition, we introduced a novel selection criterion based on the optimal regularization value, which is called stable stability selection, that ensures the selection of variables associated with highly stable results with the least possible loss in terms of predictive ability. Under two justifiable assumptions, we demonstrated that the proposed optimal regularization value is Pareto optimal in the Pareto front of stability and accuracy. Lastly, we discussed the convergence of stability values across sequential sub-sampling to identify the optimal number of sub-samples based on the stability of the process.

Competing interests

The authors declare that they have no conflict of interest.

Author contributions statement

Mahdi Nouraie was responsible for drafting the manuscript, the development of the research methodology and for writing the computer code used throughout. Samuel Muller provided critical feedback on the content of the manuscript, refining the clarity and scope of the manuscript and the computer code.

Data Availability

The riboflavin dataset is accessible via the hdi package in R (Dezeure et al., 2015). The rat microarray data can be obtained from the National Center for Biotechnology Information (NCBI) website at www.ncbi.nlm.nih.gov, under accession number GSE5680.

The source code used for the paper is accessible through the following GitHub repository: https://github.com/MahdiNouraie/Stable-Stability-Selection. Furthermore, the stabplot package, which facilitates the use of the two plots introduced in this paper, is available through https://github.com/MahdiNouraie/stabplot.

Acknowledgments

Mahdi Nouraie was supported by the Macquarie University Research Excellence Scholarship (20213605). Samuel Muller was supported by the Australian Research Council Discovery Project Grant (DP230101908).

We gratefully acknowledge Dr. Connor Smith for his guidance and support as co-supervisor of Mahdi Nouraie’s PhD research.

References

Altman and Andersen (1989) Douglas G Altman and Per Kragh Andersen. Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine, 8(7):771–783, 1989.
Beinrucker et al. (2016) Andre Beinrucker, Ürün Dogan, and Gilles Blanchard. Extensions of stability selection using subsamples of observations and covariates. Statistics and Computing, 26:1059–1077, 2016.
Bodinier et al. (2023) Barbara Bodinier, Sarah Filippi, Therese Haugdahl Nøst, Julien Chiquet, and Marc Chadeau-Hyam. Automated calibration for stability selection in penalised regression and graphical models. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(5):1375–1393, 2023.
Bühlmann et al. (2014) Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014.
Chiang et al. (2006) Annie P Chiang, John S Beck, Hsan-Jan Yen, Marwan K Tayeh, Todd E Scheetz, Ruth E Swiderski, Darryl Y Nishimura, Terry A Braun, Kwang-Youn A Kim, Jian Huang, et al. Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11). Proceedings of the National Academy of Sciences, 103(16):6287–6292, 2006.
Cox (1972) David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 34(2):187–202, 1972.
Dezeure et al. (2015) Ruben Dezeure, Peter Bühlmann, Lukas Meier, and Nicolai Meinshausen. High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–558, 2015.
Fleiss et al. (2004) Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. The Measurement of Interrater Agreement. John Wiley & Sons, Inc., 2004.
Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
Gwet (2008) Kilem Li Gwet. Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika, 73(3):407–430, 2008.
Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 1. Springer, 2009.
Huang et al. (2008) Jian Huang, Shuangge Ma, and Cun-Hui Zhang. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica, 18(4):1603–1618, 2008.
Kalousis et al. (2007) Alexandros Kalousis, Julien Prados, and Melanie Hilario. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems, 12:95–116, 2007.
Kuncheva (2007) Ludmila I Kuncheva. A stability index for feature selection. In Artificial Intelligence and Applications, pages 390–395, 2007.
Meinshausen and Bühlmann (2010) Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010.
Müller and Welsh (2010) Samuel Müller and Alan H Welsh. On model selection curves. International Statistical Review, 78(2):240–256, 2010.
Nogueira et al. (2018) Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. On the stability of feature selection algorithms. Journal of Machine Learning Research, 18(174):1–54, 2018.
Pareto (1896) Vilfredo Pareto. Cours D’Économie Politique. Rouge, Lausanne, 1896.
Scheetz et al. (2006) Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson, Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434, 2006.
Sen et al. (2021) Rikta Sen, Ashis Kumar Mandal, and Basabi Chakraborty. A critical study on stability measures of feature selection with a novel extension of lustgarten index. Machine Learning and Knowledge Extraction, 3(4):771–787, 2021.
Shah and Samworth (2013) Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75(1):55–80, 2013.
Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.