This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Selection Stability of Stability Selection and Its Applications

Mahdi Nouraie School of Mathematical and Physical Sciences, Macquarie University 0000-0002-4792-4994 Samuel Muller Address for correspondence: [email protected] School of Mathematical and Physical Sciences, Macquarie University School of Mathematics and Statistics, The University of Sydney 0000-0002-3087-8127
Abstract

Stability selection is a widely adopted resampling-based framework for high-dimensional structure estimation and variable selection. However, the concept of ‘stability’ is often narrowly addressed, primarily through examining selection frequencies, or ‘stability paths’. This paper seeks to broaden the use of an established stability estimator to evaluate the overall stability of the stability selection framework, moving beyond single-variable analysis. We suggest that the stability estimator offers two advantages: it can serve as a reference to reflect the robustness of the outcomes obtained and help identify an optimal regularization value to improve stability. By determining this value, we aim to calibrate key stability selection parameters, namely, the decision threshold and the expected number of falsely selected variables, within established theoretical bounds. Furthermore, we explore a novel selection criterion based on this regularization value. With the asymptotic distribution of the stability estimator previously established, convergence to true stability is ensured, allowing us to observe stability trends over successive sub-samples. This approach sheds light on the required number of sub-samples addressing a notable gap in prior studies. The stabplot package is developed to facilitate the use of the plots featured in this manuscript, supporting their integration into further statistical analysis and research workflows.

Keywords: Bioinformatics, Feature Selection, Model Selection, Structure Estimation, Variable Selection

1 Introduction

The estimation of discrete structures, including graphs, clusters, or selection of variables, represents a long-standing and fundamental problem in statistics (Meinshausen and Bühlmann, 2010). Variable selection is a critical phase in the modeling pipeline and a fundamental step in data preprocessing, where raw data is refined to ensure its suitability for analysis or modeling. The objective is to eliminate variables that are noisy, redundant, or irrelevant (Tibshirani, 1996; Müller and Welsh, 2010), addressing a key challenge in contemporary data science (Nogueira et al., 2018).

Stability selection, a widely recognized resampling-based variable selection framework, performs variable selection by examining how frequently variables are chosen across multiple randomly selected resamples (Meinshausen and Bühlmann, 2010). The selection frequency indicates how often a variable is included in the estimated models over various resamples. Meinshausen and Bühlmann (2010) analyzed the selection frequencies for individual variables over a range of regularization values, presenting the results in a plot known as the ‘stability path’. This plot facilitates the identification of variables that are consistently selected over the regularization grid.

In this paper, we shift the focus from examining the stability of individual variables across a grid of regularization values to evaluating the overall stability of the entire stability selection framework across the regularization grid. Employing an established stability estimator, we assess the stability of the framework in its entirety and introduce what we call ‘Stable Stability Selection’, which identifies the smallest regularization value that yields highly stable outcomes.

We illustrate the stable stability selection methodology through variable selection in the context of linear regression. However, the applicability of this approach extends beyond linear regression, as stability selection is not confined to this specific problem. Although detailed exploration is beyond the scope of this paper, stable stability selection is equally suitable for more complex regression-type models.

The stability of a variable selection method refers to the consistency with which it selects variables across different training sets drawn from the same underlying distribution (Kalousis et al., 2007). Evaluation of the stability of a statistical model by random resampling has been a well-established practice for decades. For example, Altman and Andersen (1989) applied this approach in Cox’s proportional hazards regression model (Cox, 1972) to identify stable variables, which were defined as those consistently selected with higher frequencies across multiple resamples.

In the seminal work of Meinshausen and Bühlmann (2010), the stability selection framework was introduced, which offers a transformative method to identify stable variables in structure estimation tasks. The stability selection approach described by Meinshausen and Bühlmann (2010) uses sub-samples that are half the size of the sample size of the original dataset to carry out variable selection. Stability selection enables the identification of variables that are consistently recognized as important across the majority of sub-samples given a regularization value, considering them as stable variables. Furthermore, Meinshausen and Bühlmann (2010) established an asymptotic upper-bound for the Per-Family Error Rate, which represents the expected number of falsely selected variables. This bound relies on the assumptions of exchangeability in the distribution of falsely selected variables and that the selection algorithm performs no worse than random guessing.

Shah and Samworth (2013) introduced the complementary pairs stability selection method, which computes selection frequencies by counting how often a variable is included in models fitted to complementary 50%50\% sub-samples. Shah and Samworth (2013) introduced more rigorous upper-bounds for the expected number of variables selected with low selection probabilities.

It is somewhat surprising that, despite the framework being designated as ‘stability’ selection and the considerable attention stability selection has received since its inception, the overall stability of the framework itself has been largely overlooked, with no dedicated studies specifically addressing this critical aspect.

There is a substantial body of research on the stability of variable selection methods, where numerous stability measures have been proposed. For more comprehensive information, we refer to Kuncheva (2007), Nogueira et al. (2018), and Sen et al. (2021).

Nogueira et al. (2018) conducted an extensive literature review on the topic, consolidating the desirable properties of stability measures into five mathematical conditions: fully defined, strict monotonicity, constant bounds, maximum stability occurs if and only if the selection is deterministic and correction for chance. A brief overview of these properties will be provided in the following paragraph. Nogueira et al. (2018) showed that none of the stability measures previously introduced satisfied all five conditions. In response, Nogueira et al. (2018) proposed a novel stability measure, which was the first in the literature to meet all these criteria, generalizing several previous works such as in Kuncheva (2007).

Here, we give a concise review of the five conditions proposed by Nogueira et al. (2018). The first property stipulates that the stability measure should function effectively regardless of variations in the number of selected variables, meaning that it should not be limited to scenarios where the number of selected variables remains constant across resamples. The second property asserts that the stability measure should be a strictly decreasing function of the sample variance of the binary selection status of each variable across random resamples; as the variance in the selection status of each variable increases, the stability of the framework diminishes. The third property states that the stability measure should be bounded by specific constants, which facilitates the comparison of the stability of different procedures. The fourth property implies that maximum stability is achieved when the selection results across all resamples are identical, and vice versa. Finally, the fifth property posits that if all possible selection results are equally likely to occur across random resamples, the expected value of the stability measure should remain constant. This indicates that the stability measure should not be influenced by random patterns that may arise during the resampling process.

Nogueira et al. (2018) conceptualized the stability of a variable selection procedure as an underlying population parameter that requires estimation. In addition, Nogueira et al. (2018), linking their work with Gwet (2008), demonstrated that as the number of resamples approaches infinity, their stability measure, Φ^\hat{\Phi}, converges to a Normal distribution with a location parameter μ=Φ\mu=\Phi, where Φ\Phi denotes population stability. Consequently, Φ^\hat{\Phi} is a consistent estimator of the population stability Φ\Phi; that is, as the number of resamples approaches infinity, Φ^\hat{\Phi} converges in probability to the true population stability Φ\Phi. This property allows us to construct confidence intervals and perform hypothesis testing on the stability values.

Nogueira et al. (2018) compared the stability of the stability selection with that of the Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996), concluding that the former exhibits greater stability than the latter in most of the scenarios considered.

In this paper, our aim is to apply the stability estimator proposed by Nogueira et al. (2018) to evaluate the stability of stability selection. Furthermore, we proxify the stability values to find the optimal regularization value to achieve high stability with the least possible loss in accuracy, if applicable. The optimal regularization value can be employed to calibrate two critical parameters of stability selection: the decision-making threshold and the expected number of falsely selected variables, balancing them with respect to one another. We also use the optimal regularization value to define a new selection criterion, called stable stability selection, prioritizing the selection of variables associated with highly stable outcomes, rather than those mostly selected in their best-case scenarios over the regularization grid, which is the primary criterion in the traditional stability selection framework.

In addition, the convergence value of the stability estimator serves as a reference to assess the reliability of the results obtained. If the stability selection process demonstrates instability, it is unreasonable to expect that its selections and preferences will be reliable. The point at which convergence occurs provides valuable insight into the required number of sub-samples, an aspect for which we found no dedicated research addressing its determination.

Nogueira et al. (2018) suggested that stability and accuracy can be interpreted through the lens of a Pareto front Pareto (1896). We refer to this methodology as stability-accuracy selection. Building on this perspective, we demonstrate that, given two reasonable assumptions, our proposed regularization value constitutes a Pareto-optimal solution within this context.

Alternative stability estimators, aside from that proposed by Nogueira et al. (2018), may also be applicable; however, a detailed comparison and evaluation of different stability measures in the context of stability selection falls outside the scope of this paper.

The structure of this paper is organized as follows. Section 2 outlines the proposed methodology. Section 3 describes the real and synthetic datasets used in the paper. The results of the application of the method to real and synthetic data are presented in Section 4. Finally, Section 5 summarizes and concludes the paper.

2 Methodology

In this section, we present a methodology to evaluate the selection stability of the stability selection framework and to determine the optimal regularization value accordingly. Although we demonstrate the proposed method within the context of stability selection outlined by Meinshausen and Bühlmann (2010), it is important to note that the methodology is adaptable to other formulations of stability selection, such as those described by Shah and Samworth (2013) and Beinrucker et al. (2016). We will introduce our methodology following a brief overview of the foundational approach established by Meinshausen and Bühlmann (2010).

Stability selection

We consider a dataset 𝒟={(𝒙i,yi)}i=1n\mathcal{D}=\{(\boldsymbol{x}^{\top}_{i},y_{i})\}_{i=1}^{n}, where each element consists of a univariate response yiy_{i}\in\mathbb{R} and a pp-dimensional vector of fixed covariates 𝒙ip\boldsymbol{x}^{\top}_{i}\in\mathbb{R}^{p}. In the context of linear regression, it is typically assumed that pairs (𝒙i,yi)(\boldsymbol{x}^{\top}_{i},y_{i}) are independent and identically distributed (i.i.d.); when these are assumed to be random, but for simplicity, we assume that the covariate vector is fixed.

The linear regression model is formally expressed as 𝒀=β0+X𝜷+𝜺\boldsymbol{Y}=\beta_{0}+X\boldsymbol{\beta}+\boldsymbol{\varepsilon}, where Xmat(n×p)X\in\operatorname{mat}(n\times p) represents the design matrix, 𝒀n\boldsymbol{Y}\in\mathbb{R}^{n} is the nn-dimensional vector corresponding to the univariate response variable, and β0\beta_{0}\in\mathbb{R} denotes the intercept term. The vector of regression coefficients for the pp non-constant covariates is denoted by 𝜷p\boldsymbol{\beta}\in\mathbb{R}^{p}, while 𝜺n\boldsymbol{\varepsilon}\in\mathbb{R}^{n} represents the nn-dimensional vector of random errors. It is assumed that the components of 𝜺\boldsymbol{\varepsilon} are i.i.d. and independent of the covariates.

Variable selection, in accordance with the terminology of Meinshausen and Bühlmann (2010), typically involves categorizing covariates into two distinct groups: the signal group S={k0βk0}S=\{k\neq 0\mid\beta_{k}\neq 0\} and the noise group N={k0βk=0}N=\{k\neq 0\mid\beta_{k}=0\}, where SN=S\cap N=\emptyset. The primary objective of variable selection is to accurately identify the signal group SS.

The stability selection framework requires the implementation of an appropriate selection method. One commonly used technique for variable selection in the linear regression context is the LASSO estimator, formally defined as

β^0(λ),𝜷^(λ)=argminβ0,𝜷p(𝒀β0X𝜷22+λk=1p|βk|),\hat{\beta}_{0}(\lambda),\boldsymbol{\hat{\beta}}(\lambda)=\operatorname*{arg\,min}_{\beta_{0}\in\mathbb{R},\boldsymbol{\beta}\in\mathbb{R}^{p}}\left(\|\boldsymbol{Y}-\beta_{0}-X\boldsymbol{\beta}\|_{2}^{2}+\lambda\sum_{k=1}^{p}|\beta_{k}|\right),

where λ+\lambda\in\mathbb{R}^{+} represents the LASSO regularization parameter. The set of non-zero coefficients can be identified as S^(λ)={k0β^k(λ)0}\hat{S}(\lambda)=\{k\neq 0\mid\hat{\beta}_{k}(\lambda)\neq 0\}, derived from solving the LASSO equation through convex optimization.

Although alternative methods exist for variable selection in linear regression models, a detailed exploration of techniques for determining the regularization method and penalty term in stability selection lies beyond the scope of this paper.

Meinshausen and Bühlmann (2010) defined the stable set as

S^stable={jmaxλΛ(Π^jλ)πthr};j=1,,p,\hat{S}^{\text{stable}}=\{j\mid\max_{\lambda\in\Lambda}(\hat{\Pi}_{j}^{\lambda})\geq\pi_{\text{thr}}\};\quad j=1,\ldots,p, (1)

where Λ\Lambda represents the set of regularization values, 0<πthr<10<\pi_{\text{thr}}<1 is the threshold for decision-making in variable selection, and Π^jλ\hat{\Pi}_{j}^{\lambda} denotes the selection frequency of the jjth variable given the regularization parameter λ\lambda.

Equation 1 identifies variables whose selection frequencies exceed πthr\pi_{\text{thr}} under the regularization value that maximizes their selection frequencies; therefore, it considers best-case of each variable for decision-making.

Stable stability selection

We now move forward to stable stability selection. To implement this, we first define a grid of regularization values, Λ\Lambda, from which the optimal λ\lambda for LASSO is to be identified. For this purpose, we use as the default choice the λ\lambda values generated by the cv.glmnet function from the glmnet package in R (Friedman et al., 2010), applying a 10-fold cross-validation to the complete dataset 𝒟\mathcal{D}. Alternative regularization values can also be employed. However, a comparison of different methods for generating the regularization grid falls outside the scope of this paper. The goal is to identify the optimal regularization value λstableΛ\lambda_{\text{stable}}\in\Lambda that yields highly stable outcomes with the least possible loss in terms of predictive ability.

The following procedure is applied to each regularization value λΛ\lambda\in\Lambda. In each iteration of the stability selection process, a random sub-sample, comprising half the size of the original dataset 𝒟\mathcal{D}, is randomly selected. The LASSO model is then fitted to this sub-sample using the fixed value of λ\lambda, and the binary selection outcomes are recorded as a row in the binary selection matrix M(λ)mat(B×p)M(\lambda)\in\operatorname{mat}(B\times p), where BB denotes the number of sub-samples. Therefore, in the end, we obtain |Λ||\Lambda| distinct selection matrices M(λ)M(\lambda).

In this context, M(λ)bj=1M(\lambda)_{bj}=1 indicates that the jjth variable is identified as part of the signal set S^(λ)\hat{S}(\lambda) when LASSO is applied to the bbth sub-sample given λ\lambda. In contrast, M(λ)bj=0M(\lambda)_{bj}=0 signifies that the jjth variable is classified as belonging to the noise set N^(λ)\hat{N}(\lambda) after applying LASSO to the bbth sub-sample given λ\lambda.

The stability estimator proposed by Nogueira et al. (2018) is defined as

Φ^(M(λ))=11pj=1psj2q(λ)p(1q(λ)p);λΛ,\hat{\Phi}(M(\lambda))=1-\frac{\frac{1}{p}\sum_{j=1}^{p}s_{j}^{2}}{\frac{q(\lambda)}{p}\left(1-\frac{q(\lambda)}{p}\right)};\quad\lambda\in\Lambda, (2)

where sj2s_{j}^{2} denotes the unbiased sample variance of the binary selection statuses of the jjth variable, while q(λ)q(\lambda) denotes the average number of variables selected under the regularization parameter λ\lambda. The stability estimator Φ^\hat{\Phi} is bounded by [1B1,1]\left[-\frac{1}{B-1},1\right] (Nogueira et al., 2018); the larger the Φ^\hat{\Phi}, the more stable is M(λ)M(\lambda).

To interpret the stability values of Equation (2), Nogueira et al. (2018) adopted the guidelines proposed by Fleiss et al. (2004). According to Nogueira et al. (2018), stability values that exceed 0.750.75 indicate excellent agreement between selections beyond what would be expected by random chance, while values below 0.40.4 signify poor agreement between them. Based on Nogueira et al. (2018), stability values that reside between these thresholds are categorized as indicative of intermediate to good stability.

By obtaining all |Λ||\Lambda| selection matrices, we can estimate the stability of each using the formulation given in Equation (2). We propose that the optimal regularization value, denoted as λstable\lambda_{\text{stable}}, is the smallest regularization value at which the stability measure surpasses 0.75, that is,

λstable=min{λΛΦ^(M(λ))0.75}.\lambda_{\text{stable}}=\min\left\{\lambda\in\Lambda\mid\hat{\Phi}(M(\lambda))\geq 0.75\right\}. (3)

The threshold value of 0.750.75 is somewhat arbitrary and may not be attainable in certain situations. We address this limitation in the following paragraph.

As we will see in Section 4, λstable\lambda_{\text{stable}} may not exist in certain practical applications due to procedural instability, which prevents the stability values from exceeding 0.750.75. In such cases, we propose

λstable-1sd=min{λΛΦ^(M(λ))maxλΛΦ^(M(λ))SDλΛ(Φ^(M(λ)))},\lambda_{\text{stable-1sd}}=\min\left\{\lambda\in\Lambda\mid\hat{\Phi}(M(\lambda))\geq\max_{\lambda\in\Lambda}\hat{\Phi}(M(\lambda))-\text{SD}_{\lambda\in\Lambda}\left(\hat{\Phi}(M(\lambda))\right)\right\}, (4)

where SD represents the standard deviation of the stability values.

Both upper-bounds of the Per-Family Error Rate, mentioned in Section 1, provided by Meinshausen and Bühlmann (2010) and Shah and Samworth (2013), depend on the values of λ\lambda and πthr\pi_{\text{thr}}. The calibration of these two values requires the arbitrary selection of one of these two parameters, which can be challenging to justify. By adopting the optimal regularization value, that is λ=λstable\lambda=\lambda_{\text{stable}} or λ=λstable-1sd\lambda=\lambda_{\text{stable-1sd}}, the upper-bound can be explicitly tailored to the data owner’s preferences. If a pre-determined πthr\pi_{\text{thr}} is chosen, the Per-Family Error Rate is deterministically obtained; alternatively, if a fixed pre-determined Per-Family Error Rate value is desired, the corresponding πthr\pi_{\text{thr}} is enforced.

The asymptotic upper-bound provided by Meinshausen and Bühlmann (2010) for the Per-Family Error Rate is given by

PFER(Λ,πthr)=12πthr1qΛ2p,\text{PFER}(\Lambda,\pi_{\text{thr}})=\frac{1}{2\pi_{\text{thr}}-1}\frac{q_{\Lambda}^{2}}{p}, (5)

where qΛq_{\Lambda} represents the average number of selected variables over the regularization grid Λ\Lambda. In a manner consistent with the approach of Bodinier et al. (2023), we can derive a point-wise control version of Equation 5 by restricting Λ\Lambda to a single value, λ\lambda, without affecting the validity of the original equation. Consequently, we can replace qΛq_{\Lambda} with q(λ)q(\lambda), as introduced in Equation 2. By employing λstable\lambda_{\text{stable}} within this formula and utilizing q(λstable)q(\lambda_{\text{stable}}), we establish a two-way control: fixing πthr\pi_{\text{thr}} allows us to determine the upper-bound, and vice versa. A similar approach can be applied to the upper-bounds proposed by Shah and Samworth (2013).

Having λstable\lambda_{\text{stable}} or λstable-1sd\lambda_{\text{stable-1sd}} allows a fresh perspective on variable selection through stability selection. Meinshausen and Bühlmann (2010) introduces Equation (1) to identify stable variables, that is, those with selection frequencies that exceed πthr\pi_{\text{thr}} in the best-case scenario. Instead, we propose

S^stable={jΠ^jλstableπthr};j=1,.p.\hat{S}^{\text{stable}}=\left\{j\mid\hat{\Pi}_{j}^{\lambda_{\text{stable}}}\geq\pi_{\text{thr}}\right\};\quad j=1,\ldots.p. (6)

Equation (6) represents the variables with selection frequencies higher than πthr\pi_{\text{thr}} under the λstable\lambda_{\text{stable}}, which corresponds to the highly stable outcomes. We term this selection criterion stable stability selection because it leverages highly stable outcomes to facilitate variable selection. If λstable\lambda_{\text{stable}} does not exist, S^stable-1sd\hat{S}^{\text{stable-1sd}} can be defined similarly.

As outlined in Section 1, as the number of sub-samples increases, the convergence of Φ^\hat{\Phi} to the true stability Φ\Phi is ensured. To illustrate this, we define M(t)(λ)M^{(t)}(\lambda) as the matrix containing the selection outcomes from the first tt sub-samples for a given λΛ\lambda\in\Lambda, that is, the first tt rows of M(λ)M(\lambda). The objective is to evaluate the stability of M(t)(λ)M^{(t)}(\lambda) across successive sub-samples in order to determine the appropriate cut-off point for the process, that is, the number of sub-samples required to achieve convergence in stability. Beyond this threshold, additional sub-samples do not significantly change the stability of M(λ)M(\lambda). We propose plotting the stability values of the selection matrix over the sequential sub-sampling, that is Φ^(M(t)(λ));t=2,3,,B\hat{\Phi}(M^{(t)}(\lambda));\;t=2,3,\ldots,B, against tt to monitor the convergence status of the stability values. Given that the asymptotic distribution of Φ^\hat{\Phi} is established, a confidence bound can be drawn along the curve to reflect the inherent uncertainty of the stability estimator.

In Section 4, we present the results obtained from the application of the proposed approach to synthetic and real datasets, which are introduced in Section 3.

3 Datasets

We evaluate the proposed methodology by conducting evaluations on synthetic and two real bioinformatics datasets, as detailed below.

Synthetic Data

As synthetic data, we consider a dataset with a sample size of n=50n=50 that includes p=500p=500 predictor variables. In this scenario, the predictor variables 𝒙i\boldsymbol{x}^{\intercal}_{i} are independently drawn from the Normal distribution 𝒩(0,Σ)\mathcal{N}(0,\Sigma), where Σmat(p,p)\Sigma\in\operatorname{mat}(p,p) has a diagonal of ones and σjk=ρ|jk|\sigma_{jk}=\rho^{|j-k|} for jkj\neq k where ρ{0.2,0.5,0.8}\rho\in\{0.2,0.5,0.8\}. The response variable is linearly dependent solely on the first two predictor variables, characterized by the coefficient vector 𝜷=(1.5,1.1,0,,0)\boldsymbol{\beta}=(1.5,1.1,0,\ldots,0)^{\top}, and the error term 𝜺\boldsymbol{\varepsilon} is an i.i.d. sample from the standard Normal distribution 𝒩(0,1)\mathcal{N}(0,1).

Riboflavin Data

For the first real example, we use the well-established ‘riboflavin’ dataset, which focuses on the production of riboflavin (vitamin B2) from various Bacillus subtilis. This dataset, provided by the Dutch State Mines Nutritional Products, is accessible via hdi R package (Dezeure et al., 2015). It comprises a single continuous response variable that represents the logarithm of the riboflavin production rate, alongside p=4,088p=4,088 covariates, which correspond to the logarithm of expression levels for 4,088 bacterial genes. The primary objective of analyzing this dataset is to identify genes that influence riboflavin production, with the ultimate goal of genetically engineering bacteria to enhance riboflavin yield. Data were collected from n=71n=71 relatively homogeneous samples, which were repeatedly hybridized during a feed-batch fermentation process involving different engineered strains and varying fermentation conditions. Bühlmann et al. (2014) employed stability selection with LASSO and identified three genes—LYSC_at, YOAB_at, and YXLD_at—as having a significant impact on the response variable.

Affymetrix Rat Genome 230 2.0 Array

As an additional real-world example, we investigate ‘Affymetrix Rat Genome 230 2.0 Array’ microarray data introduced by Scheetz et al. (2006). This dataset comprises n=120n=120 twelve-week-old male rats, with expression levels recorded for nearly 32,000 gene probes for each rat. The primary objective of this analysis is to identify the probes most strongly associated with the expression level of the TRIM32 probe (1389163_at), which has been linked to the development of Bardet-Biedl syndrome (Chiang et al., 2006). This genetically heterogeneous disorder affects multiple organ systems, including the retina.

In accordance with the preprocessing steps outlined by Huang et al. (2008), we excluded gene probes with a maximum expression level below the 2525th percentile and those exhibiting an expression range smaller than 22. This filtering process yielded a refined set of p=3,083p=3,083 gene probes that demonstrated sufficient expression and variability for further analysis.

4 Results

In this section, we illustrate the efficacy of the proposed methodology through its application to the synthetic and real datasets introduced in Section 3.

Synthetic Data

Here, we present the results obtained from the application of stable stability selection to synthetic data. We used the GitHub repository associated with Nogueira et al. (2018) to implement their stability estimator and to obtain confidence intervals111https://github.com/nogueirs/JMLR2018.

Figure 1 illustrates the selection stability of the stability selection on a grid of regularization values, applied to the synthetic datasets introduced in Section 3, using three different values of ρ\rho and choosing the number of sub-samples B=500B=500. The horizontal dashed red lines indicate the stability thresholds of 0.40.4 and 0.750.75 as discussed in Section 2.

Refer to caption
(a) ρ=0.2\rho=0.2
Refer to caption
(b) ρ=0.5\rho=0.5
Refer to caption
(c) ρ=0.8\rho=0.8
Figure 1: Selection stability of stability selection over the regularization grid on the synthetic data

As illustrated in Figure 1, the regularization value that minimizes the cross-validation error, λmin\lambda_{\text{min}}, falls within the poorly stable region. In addition, the regularization value that maintains the cross-validation error within one standard error of λmin\lambda_{\text{min}}, referred to as λ1se\lambda_{\text{1se}} (Hastie et al., 2009), lies within the poor to intermediate stability regions. In contrast, λstable\lambda_{\text{stable}}, introduced in Equation (3), is specifically designed to fall within the region of excellent stability, where applicable.

Notably, in all three correlation scenarios, the three regularization selection methods successfully identify relevant variables with high selection frequencies, with a minimum selection frequency of 0.9940.994. However, since λstable\lambda_{\text{stable}} is larger than the two other regularization values, it applies a stronger shrinkage to variables. To further examine this in terms of model accuracy, for each correlation scenario, we generate n=25n^{\prime}=25 additional test samples from the corresponding distribution described in Section 3. At each iteration of stability selection, we predict the response variable for the test samples using the estimated model and, ultimately, we aggregate all the mean squared error (MSE) values obtained over the corresponding λ\lambda by averaging.

Refer to caption
(a) ρ=0.2\rho=0.2
Refer to caption
(b) ρ=0.5\rho=0.5
Refer to caption
(c) ρ=0.8\rho=0.8
Figure 2: Selection stability and MSE of stability selection over the regularization grid on the synthetic data

As demonstrated in Figure 2, across all three correlation scenarios, the improvement in stability when transitioning from λ1se\lambda_{\text{1se}} to λstable\lambda_{\text{stable}} far outweighs the reduction in accuracy. This outcome aligns closely with the findings of Nogueira et al. (2018), who noted that “All these observations show that stability can potentially be increased without loss of predictive power” and “We also observe that pursuing stability may help identifying the relevant set of features”.

Refer to caption
(a) ρ=0.2\rho=0.2
Refer to caption
(b) ρ=0.5\rho=0.5
Refer to caption
(c) ρ=0.8\rho=0.8
Figure 3: Selection stability of stability selection over sequential sub-sampling on the synthetic data

To demonstrate the convergence of the stability estimator Φ^\hat{\Phi}, for each correlation scenario, we used λstable\lambda_{\text{stable}} and calculated the stability of the selection matrix M(λstable)M(\lambda_{\text{stable}}) through sequential sub-sampling, using Equation (2). The stability values across iterations are shown in Figure 3. The blue shading around the line represents the 95%95\% confidence interval for Φ^\hat{\Phi}. As can be observed, the interval narrows with increasing iterations, indicating a reduction in the estimator’s uncertainty. Figure 3 reveals that, across all three scenarios, after approximately B200B^{*}\approx 200 iterations, the stability estimator converges, with no significant change thereafter in its value. By monitoring the stability trajectory throughout the process, valuable insight can be gained in determining the optimal cut-off point, that is, the optimal number of sub-samples in terms of stability of the process.

Riboflavin Data

Next, we apply stable stability selection to the riboflavin dataset. As described in Section 3, the dataset consists of p=4,088p=4,088 genes, the main objective being to identify the key genes that influence riboflavin production. As before, we choose the number of sub-samples B=500B=500. The predictor variables are standardized using the scale function prior to being input into LASSO.

In this problem, λstable\lambda_{\text{stable}} does not exist, so we use λstable-1sd\lambda_{\text{stable-1sd}} instead. After BB iterations, four genes YXLD_at (Π^YXLD_atλstable-1sd=0.606\hat{\Pi}_{\texttt{YXLD\_at}}^{\lambda_{\text{stable-1sd}}}=0.606), YOAB_at (Π^YOAB_atλstable-1sd=0.558\hat{\Pi}_{\texttt{YOAB\_at}}^{\lambda_{\text{stable-1sd}}}=0.558), LYSC_at (Π^LYSC_atλstable-1sd=0.540\hat{\Pi}_{\texttt{LYSC\_at}}^{\lambda_{\text{stable-1sd}}}=0.540), and YCKE_at (Π^YCKE_atλstable-1sd=0.532\hat{\Pi}_{\texttt{YCKE\_at}}^{\lambda_{\text{stable-1sd}}}=0.532) have selection frequencies greater than 0.50.5.

Figure 4 illustrates the stability of stability selection when applied to the riboflavin dataset. As shown in Figure 4(a), λstable\lambda_{\text{stable}} does not exist in this example, as the stability values do not surpass the 0.750.75 threshold. Figure 4(b), generated with λstable-1sd\lambda_{\text{stable-1sd}}, indicates the convergence slightly above 0.20.2 after about 200200 iterations. These results imply that stability selection with LASSO exhibits poor selection stability within this dataset. Consequently, the findings presented by Bühlmann et al. (2014) may require careful reconsideration, particularly since the selection frequencies of the three identified genes are relatively low.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Selection stability of stability selection on riboflavin data

Affymetrix Rat Genome 230 2.0 Array

Finally, we apply the proposed methodology to the rat microarray data. As mentioned in Section 3, the data consists of p=3,083p=3,083 gene probes. The main aim for this data is to identify probes that influence the TRIM32 probe. Again, we choose the number of sub-samples B=500B=500. The predictor variables are standardized using the scale function prior to being input into LASSO.

Due to the instability of the outcomes, λstable\lambda_{\text{stable}} does not exist in this problem. After BB iterations, three probes have selection frequencies greater than 0.50.5: probe 1390539_at (Π^1390539_atλstable-1sd=0.640\hat{\Pi}_{\texttt{1390539\_at}}^{\lambda_{\text{stable-1sd}}}=0.640), probe 1389457_at (Π^1389457_atλstable-1sd=0.570\hat{\Pi}_{\texttt{1389457\_at}}^{\lambda_{\text{stable-1sd}}}=0.570), and probe 1376747_at (Π^1376747_atλstable-1sd=0.564\hat{\Pi}_{\texttt{1376747\_at}}^{\lambda_{\text{stable-1sd}}}=0.564).

As shown in Figure 5, similar to the results of the riboflavin dataset, the procedure cannot be considered stable. Specifically, Figure 5(a) reveals that λstable\lambda_{\text{stable}} does not exist, as the stability values do not exceed the threshold of 0.750.75. For Figure 5(b), we used λstable-1sd\lambda_{\text{stable-1sd}}. Figure 5(b) shows that the stability values converged to approximately 0.150.15.

The results illustrated in figures 4 and 5 indicate that while stability selection is a valuable approach, more can be revealed by focusing on the stability of stability selection. As the complexity of the data (in terms of interdependencies between variables, number of variables subject to selection, etc.) increases, the stability values demonstrate a significant decline. Therefore, it is essential to evaluate the stability of its selections before drawing any conclusions about the importance of variables based on the outcomes of stability selection. These findings emphasize the need to carefully assess the stability of the entire process when using stability selection. Neglecting this evaluation may result in overconfidence in the results obtained, even when they lack stability. Given that such analyses are often underrepresented in the literature on stability selection, domain experts may benefit from carefully considering previously published findings.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Selection stability of stability selection on rat microarray data

The convergence plots for both synthetic and real examples indicate that, regardless of whether the convergence value is high or low, stability values tend to stabilize after B200B^{*}\approx 200 iterations. This observation could provide a useful rule of thumb for determining the required number of sub-samples in the stability selection framework.

Stability and Accuracy from a Pareto Front Perspective

Nogueira et al. (2018) suggested that stability and accuracy can be analyzed using the concept of the Pareto front (Pareto, 1896), which identifies regularization values that are not dominated by any other in terms of both criteria. A regularization value is considered Pareto optimal if no other value on the regularization grid offers both higher stability and higher accuracy. We call this approach stability-accuracy selection, which seeks to balance these two metrics. In this paper, we introduced the stable stability selection, which focuses on identifying the regularization value that achieves a high stability of the procedure with the least possible loss in accuracy.

As an experiment, in our synthetic data analysis we consider as a measure of accuracy (-MSE), that is, the negative of the mean squared error. Given that a Pareto optimal solution is not necessarily unique, we select the Pareto solution that maximizes the sum of accuracy and stability. In Figure 6(a), λPareto\lambda_{\text{Pareto}} is located close to λstable\lambda_{\text{stable}}, while in Figures 6(b) and 6(c), they coincide exactly. Interestingly, λstable\lambda_{\text{stable}} is also a Pareto solution in Figure 6(a). It is now pertinent to discuss the relationship between λstable\lambda_{\text{stable}} and Pareto optimality.

Corollary 1.

Let λstable\lambda_{\text{stable}} be defined as in Equation (3) and assume that it exists. If the stability curve is non-decreasing up to λstable\lambda_{\text{stable}}, and the loss function is non-decreasing after λstable\lambda_{\text{stable}}, then λstable\lambda_{\text{stable}} is a Pareto optimal solution.

Proof.

Since the stability curve is non-decreasing prior to λstable\lambda_{\text{stable}}, any regularization value λ<λstableΛ\lambda<\lambda_{\text{stable}}\in\Lambda is at most as stable as λstable\lambda_{\text{stable}}. Therefore, these values do not dominate λstable\lambda_{\text{stable}} in terms of stability.

Similarly, since the loss function is non-decreasing after λstable\lambda_{\text{stable}}, any regularization value λ>λstableΛ\lambda>\lambda_{\text{stable}}\in\Lambda is at most as accurate as λstable\lambda_{\text{stable}}. Hence, these values also do not dominate λstable\lambda_{\text{stable}} in terms of accuracy.

Since λstable\lambda_{\text{stable}} is not dominated by any values less than it in terms of stability nor by any values greater than it in terms of accuracy, we conclude that λstable\lambda_{\text{stable}} constitutes a Pareto optimal solution. ∎

Refer to caption
(a) ρ=0.2\rho=0.2
Refer to caption
(b) ρ=0.5\rho=0.5
Refer to caption
(c) ρ=0.8\rho=0.8
Figure 6: λmin\lambda_{\text{min}}, λ1se\lambda_{\text{1se}}, λstable\lambda_{\text{stable}}, and λPareto\lambda_{\text{Pareto}} over the regularization grid, the latter two are identical when ρ\rho is 0.5 or 0.8

The two assumptions in Corollary 1 are both natural and justifiable. Increasing the regularization value is expected to allow the model to achieve an optimal balance of sparsity by shrinking irrelevant variables, thereby enhancing stability. The regularization value λstable\lambda_{\text{stable}} is intended to signify the point where the model attains high stability with minimal loss of accuracy. Consequently, we anticipate that the stability curve exhibits a non-decreasing trend prior to reaching λstable\lambda_{\text{stable}}. However, since excessive regularization can lead to underfitting, we expect that the stability curve will flatten or decrease after reaching one or more peaks beyond λstable\lambda_{\text{stable}}.

Regarding the second assumption, it is expected that increasing regularization helps the model eliminate irrelevant variables, improving its ability to predict the response variable. The regularization values λmin\lambda_{\text{min}} and λ1se\lambda_{\text{1se}} are designed to capture this behavior, marking the points where regularization is most effective in terms of predictive ability. Beyond these points, we anticipate that the loss curve will flatten or increase. Since λstable\lambda_{\text{stable}} is intended to trade a small amount of accuracy for greater stability, we expect λstable\lambda_{\text{stable}} to occur after λ1se\lambda_{\text{1se}}, that is, λstable>λ1se>λmin\lambda_{\text{stable}}>\lambda_{\text{1se}}>\lambda_{\text{min}}, which implies that the loss curve is expected to be non-decreasing after λstable\lambda_{\text{stable}}.

Thus, according to Corollary 1, λstable\lambda_{\text{stable}} represents a stability-accuracy solution for the problem of regularization tuning, ensuring high stability while minimizing loss in accuracy.

5 Conclusion

In this paper, we examined the stability of the stability selection framework, which underpins the reproducibility of the results and, consequently, our confidence in them. We further applied this concept to determine the optimal regularization value in terms of stability, facilitating the calibration of the decision-making threshold and the expected number of falsely selected variables by leveraging upper-bounds established in the literature. In addition, we introduced a novel selection criterion based on the optimal regularization value, which is called stable stability selection, that ensures the selection of variables associated with highly stable results with the least possible loss in terms of predictive ability. Under two justifiable assumptions, we demonstrated that the proposed optimal regularization value is Pareto optimal in the Pareto front of stability and accuracy. Lastly, we discussed the convergence of stability values across sequential sub-sampling to identify the optimal number of sub-samples based on the stability of the process.

Competing interests

The authors declare that they have no conflict of interest.

Author contributions statement

Mahdi Nouraie was responsible for drafting the manuscript, the development of the research methodology and for writing the computer code used throughout. Samuel Muller provided critical feedback on the content of the manuscript, refining the clarity and scope of the manuscript and the computer code.

Data Availability

The riboflavin dataset is accessible via the hdi package in R (Dezeure et al., 2015). The rat microarray data can be obtained from the National Center for Biotechnology Information (NCBI) website at www.ncbi.nlm.nih.gov, under accession number GSE5680.

The source code used for the paper is accessible through the following GitHub repository: https://github.com/MahdiNouraie/Stable-Stability-Selection. Furthermore, the stabplot package, which facilitates the use of the two plots introduced in this paper, is available through https://github.com/MahdiNouraie/stabplot.

Acknowledgments

Mahdi Nouraie was supported by the Macquarie University Research Excellence Scholarship (20213605). Samuel Muller was supported by the Australian Research Council Discovery Project Grant (DP230101908).

We gratefully acknowledge Dr. Connor Smith for his guidance and support as co-supervisor of Mahdi Nouraie’s PhD research.

References

  • Altman and Andersen (1989) Douglas G Altman and Per Kragh Andersen. Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine, 8(7):771–783, 1989.
  • Beinrucker et al. (2016) Andre Beinrucker, Ürün Dogan, and Gilles Blanchard. Extensions of stability selection using subsamples of observations and covariates. Statistics and Computing, 26:1059–1077, 2016.
  • Bodinier et al. (2023) Barbara Bodinier, Sarah Filippi, Therese Haugdahl Nøst, Julien Chiquet, and Marc Chadeau-Hyam. Automated calibration for stability selection in penalised regression and graphical models. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(5):1375–1393, 2023.
  • Bühlmann et al. (2014) Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014.
  • Chiang et al. (2006) Annie P Chiang, John S Beck, Hsan-Jan Yen, Marwan K Tayeh, Todd E Scheetz, Ruth E Swiderski, Darryl Y Nishimura, Terry A Braun, Kwang-Youn A Kim, Jian Huang, et al. Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11). Proceedings of the National Academy of Sciences, 103(16):6287–6292, 2006.
  • Cox (1972) David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 34(2):187–202, 1972.
  • Dezeure et al. (2015) Ruben Dezeure, Peter Bühlmann, Lukas Meier, and Nicolai Meinshausen. High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–558, 2015.
  • Fleiss et al. (2004) Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. The Measurement of Interrater Agreement. John Wiley & Sons, Inc., 2004.
  • Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
  • Gwet (2008) Kilem Li Gwet. Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika, 73(3):407–430, 2008.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 1. Springer, 2009.
  • Huang et al. (2008) Jian Huang, Shuangge Ma, and Cun-Hui Zhang. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica, 18(4):1603–1618, 2008.
  • Kalousis et al. (2007) Alexandros Kalousis, Julien Prados, and Melanie Hilario. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems, 12:95–116, 2007.
  • Kuncheva (2007) Ludmila I Kuncheva. A stability index for feature selection. In Artificial Intelligence and Applications, pages 390–395, 2007.
  • Meinshausen and Bühlmann (2010) Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010.
  • Müller and Welsh (2010) Samuel Müller and Alan H Welsh. On model selection curves. International Statistical Review, 78(2):240–256, 2010.
  • Nogueira et al. (2018) Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. On the stability of feature selection algorithms. Journal of Machine Learning Research, 18(174):1–54, 2018.
  • Pareto (1896) Vilfredo Pareto. Cours D’Économie Politique. Rouge, Lausanne, 1896.
  • Scheetz et al. (2006) Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson, Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434, 2006.
  • Sen et al. (2021) Rikta Sen, Ashis Kumar Mandal, and Basabi Chakraborty. A critical study on stability measures of feature selection with a novel extension of lustgarten index. Machine Learning and Knowledge Extraction, 3(4):771–787, 2021.
  • Shah and Samworth (2013) Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75(1):55–80, 2013.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.