This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One-at-a-time knockoffs: controlled false discovery rate with higher power

Charlie K. Guan, Zhimei Ren, Daniel W. Apley
Abstract

We propose one-at-a-time knockoffs (OATK), a new methodology for detecting important explanatory variables in linear regression models while controlling the false discovery rate (FDR). For each explanatory variable, OATK generates a knockoff design matrix that preserves the Gram matrix by replacing one-at-a-time only the single corresponding column of the original design matrix. OATK is a substantial relaxation and simplification of the knockoff filter by Barber and Candès, (2015)(BC), which simultaneously generates all columns of the knockoff design matrix to satisfy a much larger set of constraints. To test each variable’s importance, statistics are then constructed by comparing the original vs. knockoff coefficients. Under a mild correlation assumption on the original design matrix, OATK asymptotically controls the FDR at any desired level. Moreover, OATK consistently achieves (often substantially) higher power than BC and other approaches across a variety of simulation examples and a real genetics dataset. Generating knockoffs one-at-a-time also has substantial computational advantages and facilitates additional enhancements, such as conditional calibration or derandomization, to further improve power and consistency of FDR control. OATK can be viewed as the conditional randomization test (CRT) generalized to fixed-design linear regression problems, and can generate fine-grained p-values for each hypothesis.


Keywords: variable selection, knockoff filter, selective inference, false discovery rate, multiple hypothesis testing

1 Introduction

Linear and ridge regression are universal tools for data analysis. In applications in which there exist many explanatory variables in the data that do not affect the response variable, users often seek to discover the sparse subset of significant variables and discard the variables having no effect. Consider the fixed-design linear regression model

𝒚=𝑿𝜷+𝒛\bm{y}=\bm{X\beta}+\bm{z} (1)

where 𝒚n\bm{y}\in\mathbb{R}^{n} is the response vector, 𝑿=[𝒙𝟏,𝒙𝟐,,𝒙𝒑]n×p\bm{X}=[\bm{x_{1}},\bm{x_{2}},\ldots,\bm{x_{p}}]\in\mathbb{R}^{n\times p} is the known and deterministic design matrix of explanatory variables with Gram matrix 𝚺=𝑿𝑿\bm{\Sigma}=\bm{X}^{\top}\bm{X}, 𝒛Np(0,σ2𝑰p)\bm{z}\sim N_{p}(0,\sigma^{2}\bm{I}_{p}) is Gaussian noise, and 𝜷=[β1,β2,,βp]\bm{\beta}=[\beta_{1},\beta_{2},\ldots,\beta_{p}]^{\top} is the vector of unknown regression coefficients. Given this model, we are interested in testing the null hypotheses

Hj:βj=0\displaystyle H_{j}:\beta_{j}=0\quad

for each j[p]:={1,2,p}j\in[p]:=\{1,2\ldots,p\}. The goal is to identify S={j:βj0}[p]S=\{j:\beta_{j}\neq 0\}\subseteq[p], i.e., the index set of non-null variables. Specifically, we aim to leverage the data to produce an estimate S^[p]\widehat{S}\subseteq[p] of SS, while controlling the false discovery rate FDR=𝔼[FDP]\text{FDR}=\mathbb{E}\left[\text{FDP}\right], where

FDP=#{j:βj=0 and jS^}#{j:jS^}1\text{FDP}=\frac{\#\{j:\beta_{j}=0\text{ and }j\in\widehat{S}\}}{\#\{j:j\in\widehat{S}\}\vee 1} (2)

is the false discovery proportion, and ab=max(a,b)a\vee b=\max(a,b) for a,ba,b\in\mathbb{R}. Given a pre-specified target FDR level α(0,1)\alpha\in(0,1), our primary goal is to control the FDR of S^\widehat{S} by α\alpha in a manner that yields high Power=𝔼[TDP]\text{Power}=\mathbb{E}\left[\text{TDP}\right], where

TDP=#{jS and jS^}#{j:jS}\text{TDP}=\frac{\#\{j\in S\text{ and }j\in\widehat{S}\}}{\#\{j:j\in S\}} (3)

is the true discovery proportion. In both (2) and (3), the expectation is taken with respect to 𝒛\bm{z} with 𝑿\bm{X} treated as fixed.

A classical approach to this variable selection problem is to first construct a p-value for each hypothesis, and then obtain a selection set based on these p-values (Benjamini and Hochberg,, 1995; Benjamini and Yekutieli,, 2001; Storey,, 2002; Storey et al.,, 2004). In some cases, it might be difficult to obtain valid p-values, especially when one wishes to incorporate additional information (e.g., sparsity, smoothness) in the model fitting step.

Recently, there have been two new (suites of) p-value-free variable selection algorithms: the knockoff filter (Barber and Candès,, 2015; Candès et al.,, 2018) (BC) and the Gaussian mirror (Xing et al., 2023a, ; Dai et al., 2023a, ; Dai et al., 2023b, ) (GM). The knockoff filter constructs a knockoff matrix 𝑿~=[𝒙~𝟏,𝒙~𝟐,,𝒙~𝒑]n×p\bm{\widetilde{X}}=[\bm{\widetilde{x}_{1}},\bm{\widetilde{x}_{2}},\ldots,\bm{\widetilde{x}_{p}}]\in\mathbb{R}^{n\times p} such that

𝑿~𝑿~=𝑿𝑿,and𝑿~𝑿=𝑿𝑿~=𝚺𝑺,\displaystyle\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}={\bm{X}}^{\top}{\bm{X}},\;\;\text{and}\quad\bm{\widetilde{X}}^{\top}\bm{X}=\bm{X}^{\top}\bm{\widetilde{X}}=\bm{\Sigma}-\bm{S},

where 𝑺=diag(𝒔)\bm{S}=\text{diag}(\bm{s}) and 𝒔\bm{s} is some suitably chosen pp-dimensional non-negative vector, such that [𝑿𝑿~][𝑿𝑿~][{\bm{X}}\bm{\widetilde{X}}]^{\top}[{\bm{X}}\bm{\widetilde{X}}] is positive semi-definite. The knockoff filter regresses (possibly using regularization, e.g., Lasso (Tibshirani,, 1996) or ridge (Hoerl and Kennard,, 1970)) 𝒚\bm{y} onto the augmented design matrix [𝑿𝑿~][\bm{X}\bm{\widetilde{X}}] to yield the augmented vector of estimated coefficients [𝜷^𝜷~]=[β^1,β^2,,β^p,β~1,β~2,,β~p][\bm{\widehat{\beta}}^{\top}\bm{\widetilde{\beta}}^{\top}]=[\widehat{\beta}_{1},\widehat{\beta}_{2},\ldots,\widehat{\beta}_{p},\widetilde{\beta}_{1},\widetilde{\beta}_{2},\ldots,\widetilde{\beta}_{p}] for both the original explanatory variables and their knockoffs. The procedure then constructs certain antisymmetric statistics 𝑾=[W1,,Wp]\bm{W}=[W_{1},\ldots,W_{p}] such that WjW_{j} compares β^j\widehat{\beta}_{j} and β~j\widetilde{\beta}_{j} in a manner that results in a positive WjW_{j} if β^j\widehat{\beta}_{j} appears more significant than β~j\widetilde{\beta}_{j} and a negative WjW_{j} if the opposite is true. A variable is selected (i.e., jS^j\in\widehat{S}) if WjW_{j} is large and positive. For example, one choice that was further explored in Candès et al., (2018) is Wj=|β^j||β~j|W_{j}=|\widehat{\beta}_{j}|-\lvert\widetilde{\beta}_{j}\rvert. Another choice that applies only to Lasso regression with regularization parameter λ\lambda, which was the primary focus of BC, is to define Wj=max(λ^j,λ~j)×sign(λ^jλ~j)W_{j}=\max(\widehat{\lambda}_{j},\widetilde{\lambda}_{j})\times\operatorname{sign}(\widehat{\lambda}_{j}-\widetilde{\lambda}_{j}), where λ^j\widehat{\lambda}_{j} (λ~j\widetilde{\lambda}_{j}) denotes the largest λ\lambda on the Lasso path for which β^j\widehat{\beta}_{j} (β~j\widetilde{\beta}_{j}) was nonzero. A positive WjW_{j} in this case means that, as λ\lambda was increased, β^j\widehat{\beta}_{j} remained in the model longer than β~j\widetilde{\beta}_{j}, and vice-versa for negative WjW_{j}. The knockoff filter then selects variables via S^={j:WjT}\widehat{S}=\{j:W_{j}\geq T\}, where the threshold TT is selected empirically via

Tmin{t:1+#{j:Wjt}#{j:Wjt}1α}.\displaystyle T\equiv\min\Bigg{\{}t:\frac{1+\#\{j:W_{j}\leq-t\}}{\#\{j:W_{j}\geq t\}\vee 1}\leq\alpha\Bigg{\}}. (4)

Above, the numerator 1+#{j:Wjt}1+\#\{j:W_{j}\leq-t\} is an estimate of the number of false discoveries for threshold tt, since the knockoffs are constructed without regard for 𝒚\bm{y} and truly have no effect on the response, and so WjW_{j} is symmetric with respect to 0 for the null features. Barber and Candès, (2015) showed that this knockoff procedure controls the FDR by α\alpha.

Although the knockoff filter provably controls the FDR, it has been observed to lose power in certain problems (Gimenez and Zou,, 2019; Liu and Rigollet,, 2019; Li and Fithian,, 2021). The loss of power may be due to several reasons. Perhaps most significantly, the statistics 𝑾\bm{W} of the knockoff filter are computed for the augmented regression problem of regressing 𝒚\bm{y} onto [𝑿𝑿~][\bm{X}\bm{\widetilde{X}}], which is substantially different than the original problem of regressing 𝒚\bm{y} onto 𝑿\bm{X}. Namely, the augmented regression involves twice as many explanatory variables as the original regression, which typically substantially increases the variance of the estimated regression coefficients and reduces the ability to discern between βj=0\beta_{j}=0 versus βj0\beta_{j}\neq 0.

In contrast, the GM approach of Xing et al., 2023a considers each explanatory variable one at a time. For each 𝒙j\bm{x}_{j}, it generates a pair of “mirror variables” 𝒙j+=𝒙j+𝒛j\bm{x}^{+}_{j}=\bm{x}_{j}+\bm{z}_{j} and 𝒙j=𝒙j𝒛j\bm{x}^{-}_{j}=\bm{x}_{j}-\bm{z}_{j} for some appropriately scaled Gaussian random vector 𝒛j\bm{z}_{j}, regresses 𝒚\bm{y} onto [𝑿\j𝒙j+𝒙j][\bm{X}_{\backslash j}\;\bm{x}^{+}_{j}\;\bm{x}^{-}_{j}] where 𝑿\j\bm{X}_{\backslash j} denotes 𝑿\bm{X} with the 𝒙j\bm{x}_{j} column removed, and then computes a statistic WjW_{j} by contrasting the coefficients for 𝒙j+\bm{x}^{+}_{j} and 𝒙j\bm{x}^{-}_{j} in this augmented regression. As noted in Xing et al., 2023a , this is equivalent to regressing 𝒚\bm{y} onto [𝑿𝒛j][\bm{X}\;\bm{z}_{j}] and contrasting the coefficients of 𝒙j\bm{x}_{j} and 𝒛j\bm{z}_{j}. Compared with knockoffs, this approach has weaker theoretical FDR control guarantees — the FDR is controlled asymptotically when the dependence between features is mild — but empirically, it sometimes exhibits higher power while maintaining reasonable FDR control.

We propose a new approach, termed one-at-a-time knockoffs (OATK), that combines desirable aspects of the GM and BC approaches but that is couched more firmly in the knockoff framework. One of the contributions of this work is to draw a closer connection to the knockoff framework than was drawn in Xing et al., 2023a , in a manner that provides insight into how to achieve more powerful detection of signals and improve computational issues, while preserving FDR control. The specific contributions of this work include:

  • (1)

    OATK constructs a knockoff copy for each variable one at a time, where the knockoffs must satisfy only a subset of the BC knockoff conditions, allowing for substantially greater flexibility in generating knockoffs. The subsequent knockoff regressions are of exactly the same dimension as the original regression (only a single column is replaced by its knockoff), avoiding the reduction in power associated with the BC regression onto 2p2p variables. Our feature importance statistics constructed from the knockoff and the original coefficients are still symmetric around zero for the nulls.

  • (2)

    OATK is far more computationally efficient to implement than the BC procedure, and, using well-known matrix algebra identities, the feature importance statistics can be computed without actually regressing 𝒚\bm{y} onto the knockoff design matrix. The low computational expense and one-at-a-time nature of the knockoffs facilitate adding performance enhancements such as derandomization or a “multi-bit” variant that constructs multiple knockoff copies for each feature, leading to a fine-grained marginally valid p-value that can be viewed as a conditional randomization test p-value (Candès et al.,, 2018) in the fixed-design setting. Such p-values (one-bit or multi-bit) are then passed to the SeqStep filter (Barber and Candès,, 2015) to generate the selection set.

  • (3)

    We prove asymptotic FDR control of OATK under mild assumptions and develop non-asymptotic bounds on the FDR, characterizing conditions for approximate FDR control. We also propose conditionally-calibrated OATK that achieves exact FDR control, leveraging techniques in Fithian and Lei, (2022).

  • (4)

    Through extensive numerical experiments and real-data examples, we demonstrate that our approach is substantially more powerful than the approaches of either BC or GM, while still controlling the FDR.

Additional notation and assumptions used throughout the paper.

Throughout, we assume 𝑿\bm{X} is full rank, each 𝒙j\bm{x}_{j} (j[p]j\in[p]) is scaled to have unit norm, and we denote 𝑿~=[𝒙~1,𝒙~2,,𝒙~p]n×p\bm{\widetilde{X}}=[\bm{\widetilde{x}}_{1},\bm{\widetilde{x}}_{2},\ldots,\bm{\widetilde{x}}_{p}]\in\mathbb{R}^{n\times p} as the knockoff matrix. The matrix 𝑿\j\bm{X}_{\backslash j} denotes the original design matrix 𝑿\bm{X} with column jj removed. For any matrix 𝑨\bm{A}, let 𝑨j\bm{A}_{\cdot j} and 𝑨j\bm{A}_{j\cdot} denote the jj-th column and jj-th row, respectively, and let AijA_{ij} or [𝑨]ij[\bm{A}]_{ij} denote the ii-th row, jj-th column element. The set of null features is denoted by 0={j:βj=0}\mathcal{H}_{0}=\{j:\beta_{j}=0\}. We denote p0=|0|p_{0}=|\mathcal{H}_{0}| and p1=|S|p_{1}=|S| as the number of null and non-null variables, respectively.

2 One-at-a-time knockoff procedure

Our approach generates knockoffs 𝒙~j\bm{\widetilde{x}}_{j} one-at-a-time satisfying the following two conditions:

(a)𝑿~𝑿\displaystyle(a)~{}\bm{\widetilde{X}}^{\top}\bm{X} =𝚺𝑺,and(b)𝒙~j𝒙~j=𝒙j𝒙j=1 for each j[p],\displaystyle=\bm{\Sigma}-\bm{S},\quad\text{and}\quad(b)~{}\bm{\widetilde{x}}_{j}^{\top}\bm{\widetilde{x}}_{j}=\bm{x}_{j}^{\top}\bm{x}_{j}=1\text{ for each }j\in[p], (5)

where 𝑺=diag(𝒔){\bm{S}}=\text{diag}(\bm{s}) is a diagonal matrix with non-negative entries such that 𝚺𝑺\bm{\Sigma}-{\bm{S}} is a positive semi-definite matrix. This is substantially less restrictive than the BC approach, which additionally requires 𝑿~𝑿~=𝚺\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}=\bm{\Sigma}. Throughout, the regressions can either be ordinary least squares (OLS) or regularized ridge regression, the former being a special case of the latter.

In Section 2.1, we describe the main ideas behind our OATK regression. In Section 2.2, we convert the knockoff requirements (5) into equivalent but computationally simpler conditions and suggest a simple and attractive choice for 𝑺\bm{S}. In Section 2.3, we derive an explicit connection between our knockoff regression coefficients and the original regression coefficients that is both revealing and that allows a fast procedure for computing the coefficients without having to explicitly generate knockoffs or conduct the pp knockoff regressions. In Section 2.4, we discuss what properties the test statistics 𝑾\bm{W} must possess to control FDR. In Section 2.5, we describe other enhancements that OATK can incorporate to further improve FDR control and power.

2.1 OATK regression structure

Consider the ridge regression parameter estimates

𝜷^λ[β^λ,1β^λ,2β^λ,p]=[𝑿𝑿+λ𝑰p]1𝑿𝒚=𝚺λ1𝑿𝒚,\displaystyle\bm{\widehat{\beta}}_{\lambda}\equiv\begin{bmatrix}\widehat{\beta}_{\lambda,1}\\ \widehat{\beta}_{\lambda,2}\\ \vdots\\ \widehat{\beta}_{\lambda,p}\end{bmatrix}=[\bm{X}^{\top}\bm{X}+\lambda\bm{I}_{p}]^{-1}\bm{X}^{\top}\bm{y}=\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}\bm{y}, (6)

where 𝚺λ=𝑿𝑿+λ𝑰p\bm{\Sigma}_{\lambda}=\bm{X}^{\top}\bm{X}+\lambda\bm{I}_{p}, and λ\lambda is the ridge regularization parameter. The same value for λ\lambda used in this ridge regression is used in all the knockoff regressions described shortly. See Appendix A for a computationally efficient procedure to select λ\lambda empirically. We define our knockoff ridge regression approach as follows. For each j[p]j\in[p], consider

𝜷~λj\displaystyle\bm{\widetilde{\beta}}^{j}_{\lambda} [β~λ,1jβ~λ,2jβ~λ,pj]=[(𝑿~j)𝑿~j+λ𝑰p]1(𝑿~j)𝒚=𝜷^λ+[𝚺λ1]j(𝒙~j𝒚𝒙j𝒚),\displaystyle\equiv\begin{bmatrix}\widetilde{\beta}^{j}_{\lambda,1}\\ \widetilde{\beta}^{j}_{\lambda,2}\\ \vdots\\ \widetilde{\beta}^{j}_{\lambda,p}\end{bmatrix}=[(\bm{\widetilde{X}}^{j})^{\top}\bm{\widetilde{X}}^{j}+\lambda\bm{I}_{p}]^{-1}(\bm{\widetilde{X}}^{j})^{\top}\bm{y}=\bm{\widehat{\beta}}_{\lambda}+[\bm{\Sigma}^{-1}_{\lambda}]_{\cdot j}\big{(}\bm{\widetilde{x}}_{j}^{\top}\bm{y}-\bm{x}_{j}^{\top}\bm{y}\big{)}, (7)

where 𝑿~j=[𝑿\j𝒙~j]\bm{\widetilde{X}}^{j}=[\bm{X}_{\backslash j}\;\bm{\widetilde{x}}_{j}] is the design matrix that we use in the OATK knockoff regression for the jj-th variable (i.e., we replace only the jj-th column by its knockoff). The only element of 𝜷~λj\bm{\widetilde{\beta}}^{j}_{\lambda} that we will need is β~λ,jj\widetilde{\beta}_{\lambda,j}^{j}, which from (7) is obtained via β~λ,jj=β^λ,j+[𝚺λ1]jj(𝒙~j𝒚𝒙j𝒚)\widetilde{\beta}_{\lambda,j}^{j}=\widehat{\beta}_{\lambda,j}+[\bm{\Sigma}^{-1}_{\lambda}]_{jj}(\bm{\widetilde{x}}_{j}^{\top}\bm{y}-\bm{x}_{j}^{\top}\bm{y}), where [𝚺λ1]jj[\bm{\Sigma}^{-1}_{\lambda}]_{jj} can be interpreted as the reciprocal of the regression sum of squared error (SSE) for regressing 𝒙j\bm{x}_{j} onto the other p1p-1 columns of 𝑿\bm{X}. Thus, we can efficiently obtain our vector of knockoff coefficients via

𝜷~λ\displaystyle\bm{\widetilde{\beta}}_{\lambda} [β~λ1β~λ2β~λp][β~λ11β~λ22β~λpp]=𝜷^λ+[[𝚺λ1]11(𝒙~𝟏𝒚𝒙𝟏𝒚)[𝚺λ1]22(𝒙~𝟐𝒚𝒙𝟐𝒚)[𝚺λ1]pp(𝒙~𝒑𝒚𝒙𝒑𝒚)]\displaystyle\equiv\begin{bmatrix}\widetilde{\beta}_{\lambda 1}\\ \widetilde{\beta}_{\lambda 2}\\ \vdots\\ \widetilde{\beta}_{\lambda p}\end{bmatrix}\equiv\begin{bmatrix}\widetilde{\beta}_{\lambda 1}^{1}\\ \widetilde{\beta}_{\lambda 2}^{2}\\ \vdots\\ \widetilde{\beta}_{\lambda p}^{p}\end{bmatrix}=\bm{\widehat{\beta}}_{\lambda}+\begin{bmatrix}[\bm{\Sigma}^{-1}_{\lambda}]_{11}(\bm{\widetilde{x}_{1}}^{\top}\bm{y}-\bm{x_{1}}^{\top}\bm{y})\\ [\bm{\Sigma}^{-1}_{\lambda}]_{22}(\bm{\widetilde{x}_{2}}^{\top}\bm{y}-\bm{x_{2}}^{\top}\bm{y})\\ \vdots\\ [\bm{\Sigma}^{-1}_{\lambda}]_{pp}(\bm{\widetilde{x}_{p}}^{\top}\bm{y}-\bm{x_{p}}^{\top}\bm{y})\end{bmatrix} (8)

Note that 𝜷^λ\bm{\widehat{\beta}}_{\lambda} in (6) and (8) is different than in the augmented regression of the BC approach.

The following proposition, whose proof is in Appendix B.1, establishes the joint distribution of (𝜷^λ,𝜷~λ)(\bm{\widehat{\beta}}_{\lambda},\bm{\widetilde{\beta}}_{\lambda}) with λ\lambda treated as fixed.

Proposition 1.

For fixed λ>0\lambda>0, consider 𝛃^λ\bm{\widehat{\beta}}_{\lambda} and 𝛃~λ\bm{\widetilde{\beta}}_{\lambda} obtained from Equations (6) and (8), respectively. Then the joint distribution of (𝛃^λ,𝛃~λ)(\bm{\widehat{\beta}}_{\lambda},\bm{\widetilde{\beta}}_{\lambda}) is

[𝜷^λ𝜷~λ]𝒩2p([𝝁𝜷^𝝁𝜷~],[𝚺𝜷^Cov[𝜷~λ,𝜷^λ]Cov[𝜷~λ,𝜷^λ]𝚺𝜷~]),\displaystyle\begin{bmatrix}\bm{\widehat{\beta}}_{\lambda}\\ \bm{\widetilde{\beta}}_{\lambda}\end{bmatrix}\sim\mathcal{N}_{2p}\bigg{(}\begin{bmatrix}\bm{\mu_{\widehat{\beta}}}\\ \bm{\mu_{\widetilde{\beta}}}\end{bmatrix},\begin{bmatrix}\bm{\Sigma_{\widehat{\beta}}}&\mathrm{Cov}^{\top}[\bm{\widetilde{\beta}}_{\lambda},\bm{\widehat{\beta}}_{\lambda}]\\ \mathrm{Cov}[\bm{\widetilde{\beta}}_{\lambda},\bm{\widehat{\beta}}_{\lambda}]&\bm{\Sigma_{\widetilde{\beta}}}\end{bmatrix}\bigg{)}, (9)

where

𝝁𝜷^𝚺λ1𝚺𝜷,𝝁𝜷~(𝚺λ1𝚺diag{𝚺λ1}𝑺)𝜷,𝚺𝜷^σ2𝚺λ1𝚺𝚺λ1,\displaystyle\bm{\mu_{\widehat{\beta}}}\equiv\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma\beta},\quad\bm{\mu_{\widetilde{\beta}}}\equiv(\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S})\bm{\beta},\quad\bm{\Sigma_{\widehat{\beta}}}\equiv\sigma^{2}\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}, (10)
𝚺𝜷~σ2[𝚺λ1𝚺𝚺λ1𝚺λ1𝑺diag{𝚺λ1}diag{𝚺λ1}𝑺𝚺λ1+\displaystyle\bm{\Sigma_{\widetilde{\beta}}}\equiv\sigma^{2}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}-\bm{\Sigma}_{\lambda}^{-1}\bm{S}\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S}\bm{\Sigma}_{\lambda}^{-1}+ (11)
diag{𝚺λ1}(𝑿~𝑿~𝚺+2𝑺)diag{𝚺λ1}],\displaystyle\qquad\qquad\qquad\qquad\qquad\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}-\bm{\Sigma}+2\bm{S})\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}], (12)
Cov[𝜷~λ,𝜷^λ]=σ2(𝚺λ1𝚺𝚺λ1diag{𝚺λ1}𝑺𝚺λ1).\displaystyle\mathrm{Cov}[\bm{\widetilde{\beta}}_{\lambda},\bm{\widehat{\beta}}_{\lambda}]=\sigma^{2}(\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S}\bm{\Sigma}_{\lambda}^{-1}). (13)

From Proposition 1, if j0j\in\mathcal{H}_{0}, for any λ>0\lambda>0, the marginal distribution of (β^λ,j,β~λ,j)(\widehat{\beta}_{\lambda,j},\widetilde{\beta}_{\lambda,j}) is

𝒩([[𝚺λ1𝚺]j𝜷[𝚺λ1𝚺]j𝜷],[σ2[𝚺λ1𝚺𝚺λ1]jjσ2([𝚺λ1𝚺𝚺λ1]jj([𝚺λ1]jj)2sj)σ2([𝚺λ1𝚺𝚺λ1]jj([𝚺λ1]jj)2sj)σ2[𝚺λ1𝚺𝚺λ1]jj]),\displaystyle\mathcal{N}\left(\begin{bmatrix}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}]_{j\cdot}\bm{\beta}\\ [\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}]_{j\cdot}\bm{\beta}\end{bmatrix},\begin{bmatrix}\sigma^{2}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}]_{jj}&\sigma^{2}([\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}]_{jj}-([\bm{\Sigma}_{\lambda}^{-1}]_{jj})^{2}s_{j})\\ \sigma^{2}([\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}]_{jj}-([\bm{\Sigma}_{\lambda}^{-1}]_{jj})^{2}s_{j})&\sigma^{2}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}]_{jj}\end{bmatrix}\right), (14)

which implies that the distribution of (β^λ,β~λ)(\widehat{\beta}_{\lambda},\widetilde{\beta}_{\lambda}) is symmetric, as formalized below.

Corollary 2.1.

Under the same conditions of Proposition 1, when jj is a null, the marginal distribution of (β^λ,j,β~λ,j)(\widehat{\beta}_{\lambda,j},\widetilde{\beta}_{\lambda,j}) is symmetric, i.e., (β^λ,j,β~λ,j)=d(β~λ,j,β^λ,j)(\widehat{\beta}_{\lambda,j},\widetilde{\beta}_{\lambda,j})\stackrel{{\scriptstyle\textnormal{d}}}{{=}}(\widetilde{\beta}_{\lambda,j},\widehat{\beta}_{\lambda,j}).

Several remarks are in order.

Remark 1.

For the special case that λ=0\lambda=0 (OLS), the means and covariances reduce to

𝝁𝜷^=𝜷,𝝁𝜷~=(𝑰pdiag{𝚺1}𝑺)𝜷,𝚺𝜷^=σ2𝚺1,\displaystyle\bm{\mu_{\widehat{\beta}}}=\bm{\beta},\quad\bm{\mu_{\widetilde{\beta}}}=(\bm{I}_{p}-\operatorname{diag}\{\bm{\Sigma}^{-1}\}\bm{S})\bm{\beta},\quad\bm{\Sigma_{\widehat{\beta}}}=\sigma^{2}\bm{\Sigma}^{-1}, (15)
𝚺𝜷~=σ2[𝚺1𝚺1𝑺diag{𝚺1}diag{𝚺1}𝑺𝚺1+\displaystyle\bm{\Sigma_{\widetilde{\beta}}}=\sigma^{2}[\bm{\Sigma}^{-1}-\bm{\Sigma}^{-1}\bm{S}\operatorname{diag}\{\bm{\Sigma}^{-1}\}-\operatorname{diag}\{\bm{\Sigma}^{-1}\}\bm{S}\bm{\Sigma}^{-1}+ (16)
diag{𝚺1}(𝑿~𝑿~𝚺+2𝑺)diag{𝚺1}],\displaystyle\qquad\qquad\qquad\qquad\qquad\operatorname{diag}\{\bm{\Sigma}^{-1}\}(\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}-\bm{\Sigma}+2\bm{S})\operatorname{diag}\{\bm{\Sigma}^{-1}\}], (17)
Cov[𝜷~λ=0,𝜷^λ=0]=σ2(𝚺1diag{𝚺1}𝑺𝚺1).\displaystyle\mathrm{Cov}[\bm{\widetilde{\beta}}_{\lambda=0},\bm{\widehat{\beta}}_{\lambda=0}]=\sigma^{2}(\bm{\Sigma}^{-1}\bm{-}\operatorname{diag}\{\bm{\Sigma}^{-1}\}\bm{S}\bm{\Sigma}^{-1}). (18)

In this case, a particularly appealing choice of 𝐒\bm{S} is sj=1/[𝚺1]jjs_{j}=1/[\bm{\Sigma}^{-1}]_{jj}, which is always allowable (see Section 2.2). This results in 𝛍𝛃~=𝟎p×1\bm{\mu_{\widetilde{\beta}}}=\bm{0}_{p\times 1} regardless of 𝛃\bm{\beta}, which should yield better separation between the statistics |β^λj|\lvert\widehat{\beta}_{\lambda j}\rvert and |β~λj|\lvert\widetilde{\beta}_{\lambda j}\rvert when βj0\beta_{j}\neq 0. A potentially desirable side effect of this choice for 𝐒\bm{S} is that it also results in Cov[β^λj,β~λj]=0\mathrm{Cov}[\widehat{\beta}_{\lambda j},\widetilde{\beta}_{\lambda j}]=0

Remark 2.

Unlike the BC knockoffs, our knockoffs are not required to satisfy 𝐗~𝐗~=𝐗𝐗\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}=\bm{X}^{\top}\bm{X}, other than the diagonal conditions 𝐱~j𝐱~j=1\bm{\widetilde{x}}_{j}^{\top}\bm{\widetilde{x}}_{j}=1. The removal of this condition only affects 𝚺𝛃~\bm{\Sigma_{\widetilde{\beta}}} and allows for a more flexible, and potentially better, choice of 𝐒{\bm{S}}. As the price, our procedure no longer enjoys provable finite-sample FDR control like the BC procedure. But as we show later, it still delivers asymptotic FDR control and finite-sample FDR bounds under mild and verifiable conditions, and it demonstrates good FDR control empirically.

2.2 Knockoff generation and choice of 𝑺\bm{S}

The one-at-a-time knockoff conditions defined in (5) are equivalent to requiring

(i)𝑿\j𝒙j=𝑿\j𝒙~j,(ii)𝒙j𝒙~j=1sj,(iii)𝒙~j𝒙~j=1, for each j[p].\displaystyle\textnormal{(i)}~{}\bm{X}_{\backslash j}^{\top}\bm{x}_{j}=\bm{X}_{\backslash j}^{\top}\bm{\widetilde{x}}_{j},\quad\textnormal{(ii)}~{}\bm{x}_{j}^{\top}\bm{\widetilde{x}}_{j}=1-s_{j},\quad\textnormal{{(iii)}}~{}\bm{\widetilde{x}}_{j}^{\top}\bm{\widetilde{x}}_{j}=1,\quad\textnormal{ for each }j\in[p]. (19)

To provide insight into the role of the knockoffs and how to generate them efficiently, consider the singular value decomposition 𝑿\j=𝑼\j𝑫\j𝑽\j\bm{X}_{\backslash j}=\bm{U}_{\backslash j}\bm{D}_{\backslash j}\bm{V}_{\backslash j}^{\top}, where 𝑼\jn×(p1)\bm{U}_{\backslash j}\in\mathbb{R}^{n\times(p-1)} and 𝑽\j(p1)×(p1)\bm{V}_{\backslash j}\in\mathbb{R}^{(p-1)\times(p-1)} have orthonormal columns, and 𝑫\j(p1)×(p1)\bm{D}_{\backslash j}\in\mathbb{R}^{(p-1)\times(p-1)} is diagonal. Without loss of generality, represent 𝒙~j\bm{\widetilde{x}}_{j} as

𝒙~j=𝑼\j𝒃j+𝒖jbj+𝒓j\displaystyle\bm{\widetilde{x}}_{j}=\bm{U}_{\backslash j}\bm{b}_{j}+\bm{u}_{j}b_{j}+\bm{r}_{j} (20)

where 𝒖j=(𝒙j𝑼\j𝑼\j𝒙j)/σj\bm{u}_{j}=(\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j})/\sigma_{j} is the unit-norm vector orthogonal to the columns of 𝑿\j\bm{X}_{\backslash j} and such that the columns of [𝑼\j,𝒖j][\bm{U}_{\backslash j},\bm{u}_{j}] are an orthonormal basis for the column space of 𝑿\bm{X}, σj2=𝒙j𝑼\j𝑼\j𝒙j2=1/[𝚺λ1]jj\sigma_{j}^{2}=\lVert\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\rVert^{2}=1/[\bm{\Sigma}^{-1}_{\lambda}]_{jj} is the sum of squares error (SSE) for regressing 𝒙j\bm{x}_{j} onto 𝑿\j\bm{X}_{\backslash j}, and 𝒓j\bm{r}_{j} satisfies 𝑿𝒓j=𝟎p×1\bm{X}^{\top}\bm{r}_{j}=\bm{0}_{p\times 1}. Our assumption that 𝑿\bm{X} is full rank implies σj>0\sigma_{j}>0 for j[p]j\in[p]. To generate 𝒙~j\bm{\widetilde{x}}_{j}, we must find the coefficients 𝒃j\bm{b}_{j} and bjb_{j} and then generate 𝒓j\bm{r}_{j} appropriately. The following proposition characterizes the exact form of 𝒙~j\bm{\widetilde{x}}_{j}.

Proposition 2.

For each j[p]j\in[p], the jj-th knockoff satisfying (5) must be of the form

𝒙~j\displaystyle\bm{\widetilde{x}}_{j} =sjσj2𝑼\j𝑼\j𝒙j+(1sjσj2)𝒙j+𝒓j,\displaystyle=\frac{s_{j}}{\sigma_{j}^{2}}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\bigg{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\bigg{)}\bm{x}_{j}+\bm{r}_{j}, (21)

where we can select any sj[0,2σj2]s_{j}\in[0,2\sigma_{j}^{2}] and then generate 𝐫j\bm{r}_{j} randomly as any vector orthogonal to the columns of 𝐗\bm{X} and with norm 𝐫j=(2sjsj2/σj2)1/2\lVert\bm{r}_{j}\rVert=(2s_{j}-s_{j}^{2}/\sigma_{j}^{2})^{1/2}.

The proof of Proposition 2 is relegated to Appendix B.2. Regarding the choice of {sj:j[p]}\{s_{j}:j\in[p]\}, Equation (14) implies that it has no effect on the variances of {β~λj:j[p]}\{\widetilde{\beta}_{\lambda j}:j\in[p]\}, although it does affect their correlations (and their correlations with {β^λj:j[p]}\{\widehat{\beta}_{\lambda j}:j\in[p]\}). The appealing choice sj=σj2s_{j}=\sigma_{j}^{2} suggested in Remark 1 is always allowable, and we use this in all of our examples. This choice also results in bj=0b_{j}=0, in which case (21) becomes

𝒙~j=𝑼\j𝑼\j𝒙j+𝒓j\displaystyle\bm{\widetilde{x}}_{j}=\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\bm{r}_{j} (22)

where 𝒓j\bm{r}_{j} is any vector orthogonal to the columns of 𝑿\bm{X} and with norm 𝒓j=σj\lVert\bm{r}_{j}\rVert=\sigma_{j}. The OATK (22) is orthogonal to 𝒖j\bm{u}_{j}, the component of 𝒙j\bm{x}_{j} orthogonal to the column space of 𝑿\j\bm{X}_{\backslash j}. Consequently, the choice sj=σj2s_{j}=\sigma_{j}^{2} also results in the correlation between 𝒙~j\bm{\widetilde{x}}_{j} and 𝒙j\bm{x}_{j} being no more than what ensures that 𝒙~j\bm{\widetilde{x}}_{j} has the required correlation with 𝑿\j\bm{X}_{\backslash j} in (19).

2.3 Fast computation of OAT knockoff coefficients

Combining (21) and (8), the jj-th knockoff coefficient for general sjs_{j} satisfying 0sj2σj20\leq s_{j}\leq 2\sigma_{j}^{2} becomes

β~λj\displaystyle\widetilde{\beta}_{\lambda j} =β^λj+[𝚺λ1]jj([sjσj2𝑼\j𝑼\j𝒙jsjσj2𝒙j+𝒓j]𝒚)\displaystyle=\widehat{\beta}_{\lambda j}+[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\Bigg{(}\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}-\frac{s_{j}}{\sigma_{j}^{2}}\bm{x}_{j}+\bm{r}_{j}\bigg{]}^{\top}\bm{y}\Bigg{)} (23)
=β^λj+[𝚺λ1]jj(sjσj2[𝒙j𝑼\j𝑼\j𝒙j]𝒚+𝒓𝒋𝒚)\displaystyle=\widehat{\beta}_{\lambda j}+[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\Bigg{(}-\frac{s_{j}}{\sigma_{j}^{2}}\bigg{[}\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\bigg{]}^{\top}\bm{y}+\bm{r_{j}}^{\top}\bm{y}\Bigg{)} (24)
=β^λj[𝚺λ1]jjsjβ^OLS,j+[𝚺λ1]jj𝒓j𝒚\displaystyle=\widehat{\beta}_{\lambda j}-[\bm{\Sigma}_{\lambda}^{-1}]_{jj}s_{j}\widehat{\beta}_{OLS,j}+[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\bm{r}_{j}^{\top}\bm{y} (25)

where β^OLS,j\widehat{\beta}_{OLS,j} denotes the coefficient for 𝒙j\bm{x}_{j} in the OLS regression of 𝒚\bm{y} onto 𝑿\bm{X}. The last equality follows because, by standard partial correlation arguments, β^OLS,j\widehat{\beta}_{OLS,j} is equivalent to the coefficient for the regression of 𝒚\bm{y} onto 𝒙j𝑼\j𝑼\j𝒙j\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}, which is

β^OLS,j=[𝒙j𝑼\j𝑼\j𝒙j]𝒚[𝒙j𝑼\j𝑼\j𝒙j][𝒙j𝑼\j𝑼\j𝒙j]=[𝒙j𝑼\j𝑼\j𝒙j]𝒚σj2\displaystyle\widehat{\beta}_{OLS,j}=\frac{\big{[}\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\big{]}^{\top}\bm{y}}{\big{[}\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\big{]}^{\top}\big{[}\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\big{]}}=\frac{\big{[}\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\big{]}^{\top}\bm{y}}{\sigma_{j}^{2}} (26)

For the specific choice sj=σj2s_{j}=\sigma_{j}^{2}, (25) becomes

β~λj=β^λj[𝚺λ1]jjσj2β^OLS,j+[𝚺λ1]jj𝒓j𝒚\displaystyle\widetilde{\beta}_{\lambda j}=\widehat{\beta}_{\lambda j}-[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\sigma_{j}^{2}\widehat{\beta}_{OLS,j}+[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\bm{r}_{j}^{\top}\bm{y} (27)

From (25) or the special case (27), the knockoff coefficients can be computed efficiently without requiring generation of the full knockoffs or additional regressions that include the knockoffs. One can simply conduct both ridge and OLS regression of 𝒚\bm{y} onto 𝑿\bm{X} and then compute [𝚺λ1]jj𝒓j𝒚[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\bm{r}_{j}^{\top}\bm{y} after generating 𝒓j\bm{r}_{j}. An efficient implementation of this procedure and a fast approach for implementing ridge regression for various λ\lambda and then selecting the best λ\lambda via leave-one-out cross-validation (LOOCV) is detailed in Appendix A. In addition to having interpretive value, expression (27) also enables a very fast version of derandomized knockoffs, which we discuss in Section 2.5.3.

2.4 Knockoff statistics

After constructing 𝜷^λ\bm{\widehat{\beta}}_{\lambda} and 𝜷~λ\bm{\widetilde{\beta}}_{\lambda}, the goal is to construct knockoff statistics 𝑾=[W1,,Wp]\bm{W}=[W_{1},\dots,W_{p}] such that large positive values of WjW_{j} give evidence that βj0\beta_{j}\neq 0 for each variable j[p]j\in[p]. Specifically, we focus on

Wj={|β^λj|if |β^λj||β~λj||β~λj|if |β^λj|<|β~λj|,W_{j}=\begin{cases}\big{|}\widehat{\beta}_{\lambda j}\big{|}&\text{if }\big{|}\widehat{\beta}_{\lambda j}\big{|}\geq\big{|}\widetilde{\beta}_{\lambda j}\big{|}\\ -\big{|}\widetilde{\beta}_{\lambda j}\big{|}&\text{if }\big{|}\widehat{\beta}_{\lambda j}\big{|}<\big{|}\widetilde{\beta}_{\lambda j}\big{|}\end{cases}, (28)

and our OATK approach selects variables via S^={j:WjTα}\widehat{S}=\{j:W_{j}\geq T_{\alpha}\}, where the data-driven threshold TαT_{\alpha} is selected (analogous to the BC threshold selection) as

Tαmin{t:c+#{j:Wjt}1#{j:Wjt}α},\displaystyle T_{\alpha}\equiv\min\Bigg{\{}t:\frac{c+\#\{j:W_{j}\leq-t\}}{1\vee\#\{j:W_{j}\geq t\}}\leq\alpha\Bigg{\}}, (29)

where α(0,1)\alpha\in(0,1) is the desired FDR and cc is the offset parameter, usually fixed as 0 or 11.

The intuition of (28) is that a variable is selected only if |β^λj|Tα\big{|}\widehat{\beta}_{\lambda j}\big{|}\geq T_{\alpha} and |β^λj||β~λj|\big{|}\widehat{\beta}_{\lambda j}\big{|}\geq\big{|}\widetilde{\beta}_{\lambda j}\big{|}. This second inequality reduces false discoveries when there exists strong multicollinearity, which potentially yields large variance of β~λj\widetilde{\beta}_{\lambda j} and therefore large magnitudes of the knockoff coefficient even when the variable is null. We use (28) in all our numerical examples due to its empirically accurate FDR control and high power.

In general, any choice of 𝑾\bm{W} can be used with (29) in our framework as long as they are marginally exchangeable, as defined below.

Definition 2.2.

The statistics 𝐖\bm{W} are marginally exchangeable if swapping 𝐱j\bm{x}_{j} and 𝐱~j\bm{\widetilde{x}}_{j} only switches the sign of WjW_{j} for all j[p]j\in[p].

2.5 Extensions of one-at-a-time knockoffs

We introduce three extensions of OATK. The first considers multiple exchangeable knockoff copies, which allows for producing fine-grained p-values. The second adjusts the rejection set of OATK to achieve finite-sample FDR control with the conditional calibration technique. The third extension concerns the derandomization of the OATK procedure.

2.5.1 Multiple OATK

For each j[p]j\in[p], OATK constructs a knockoff copy that preserves the correlation structure of the original feature. We now consider generating multiple knockoff copies, with a joint correlation condition. Specifically, let M1M\geq 1 denote the number of knockoff copies such that MnpM\leq n-p. For any j[p]j\in[p], let 𝒙~j(m)\bm{\widetilde{x}}_{j}^{(m)} denote the mm-th knockoff copy for 𝒙j\bm{x}_{j} and 𝑿~(j)[𝒙~j(1),𝒙~j(2),,𝒙~j(M)]\widetilde{\bm{X}}^{(j)}\equiv[\bm{\widetilde{x}}_{j}^{(1)},\bm{\widetilde{x}}_{j}^{(2)},\cdots,\bm{\widetilde{x}}_{j}^{(M)}] denote the assembly of the knockoff copies (this is to be distinguished from 𝑿~j\bm{\widetilde{X}}^{j}). We impose the following condition on 𝑿~(j)\widetilde{\bm{X}}^{(j)}: for each m[M]m\in[M],

(i)𝒙i𝒙~j(m)=𝒙i𝒙j,i[p]\{j}; (ii)𝒙j𝒙~j(m)=1sj;\displaystyle\text{(i)}~{}\bm{x}_{i}^{\top}\widetilde{\bm{x}}_{j}^{(m)}=\bm{x}_{i}^{\top}\bm{x}_{j},~{}i\in[p]\backslash\{j\};\text{ (ii)}~{}\bm{x}_{j}^{\top}\widetilde{\bm{x}}_{j}^{(m)}=1-s_{j}; (30)
(iii)(𝒙~j(m))𝒙~j(m)=1;(iv)(𝒙~j(m))𝒙~j()=1sj,[M]\{m},\displaystyle\text{(iii)}~{}(\widetilde{\bm{x}}_{j}^{(m)})^{\top}\widetilde{\bm{x}}_{j}^{(m)}=1;~{}\text{(iv)}~{}(\widetilde{\bm{x}}_{j}^{(m)})^{\top}\widetilde{\bm{x}}_{j}^{(\ell)}=1-s_{j},~{}\ell\in[M]\backslash\{m\}, (31)

where sjs_{j} is some constant that will be determined shortly. Letting 𝒆M\bm{e}_{M} be an MM-dimensional vector of 11’s, we generate 𝑿~(j)\widetilde{\bm{X}}^{(j)} via

𝑿~(j)=[sjσj2𝑼\j𝑼\j𝒙j+(1sjσj2)𝒙j]𝒆M+𝑹𝑪,\displaystyle\widetilde{\bm{X}}^{(j)}=\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}\bm{x}_{j}\bigg{]}\bm{e}_{M}^{\top}+\bm{R}\bm{C}, (32)

where 𝑹n×M\bm{R}\in\mathbb{R}^{n\times M} is an orthogonal matrix whose column space is orthogonal to that of 𝑿{\bm{X}} (this is possible since MnpM\leq n-p), and 𝑪M×M\bm{C}\in\mathbb{R}^{M\times M} satisfies

𝑪𝑪=sj𝑰p+sj(1sjσj2)𝒆M𝒆M.\displaystyle\bm{C}^{\top}\bm{C}=s_{j}\bm{I}_{p}+s_{j}\Big{(}1-\frac{s_{j}}{\sigma^{2}_{j}}\Big{)}\bm{e}_{M}\bm{e}_{M}^{\top}. (33)

When sj[0,M+1Mσj2]s_{j}\in[0,\frac{M+1}{M}\sigma^{2}_{j}], the right-hand side of (33) is positive semi-definite and such 𝑪\bm{C} exists. This is formalized in the following lemma.

Lemma 2.3.

Suppose that np+Mn\geq p+M and that sj[0,M+1Mσj2]s_{j}\leq[0,\frac{M+1}{M}\sigma_{j}^{2}] for each j[p]j\in[p]. Then

sj𝑰p+sj(1sjσj2)𝒆M𝒆Ms_{j}\bm{I}_{p}+s_{j}\Big{(}1-\frac{s_{j}}{\sigma^{2}_{j}}\Big{)}\bm{e}_{M}\bm{e}_{M}^{\top}

is positive semi-definite and the knockoff copies defined in (32) satisfy the conditions in (30).

The proof of Lemma 2.3 can be found in Appendix B.3. Compared with (19), Condition (iv) in (30) additionally specifies the correlation between the knockoff copies, and the conditions are reduced to those of OATK when M=1M=1.

Next, define for each j[p]j\in[p] and m[M]m\in[M] the design matrix 𝑿~j,m=[𝑿j,𝒙~j(m)]{\widetilde{\bm{X}}}^{j,m}=[\bm{X}_{-j},\bm{\widetilde{x}}_{j}^{(m)}], i.e., replacing the jj-th column by the mm-th knockoff. Let β^j(m)\widehat{\beta}_{j}^{(m)} denote the coefficient of 𝒙~j(m)\bm{\widetilde{x}}_{j}^{(m)} from regressing 𝒚\bm{y} onto 𝑿~j,m\widetilde{\bm{X}}^{j,m}, and β^j(0)\widehat{\beta}_{j}^{(0)} the coefficient of 𝒙j\bm{x}_{j} from regressing 𝒚\bm{y} onto 𝑿\bm{X}. Denoting Mj=max{|β^j(0)|,|β^j(1)|,,|β^j(M)|}M_{j}=\max\big{\{}|\widehat{\beta}_{j}^{(0)}|,|\widehat{\beta}_{j}^{(1)}|,\cdots,|\widehat{\beta}_{j}^{(M)}|\big{\}}, we define the OATK p-value as

pj=1+m=1M𝟏{|β^j(m)||β^j(0)|}M+1.\displaystyle p_{j}=\frac{1+\sum_{m=1}^{M}\mathbf{1}\big{\{}|\widehat{\beta}_{j}^{(m)}|\geq|\widehat{\beta}_{j}^{(0)}|\big{\}}}{M+1}. (34)

The selection procedure is based on {(Mj,pj)}j[p]\{(M_{j},p_{j})\}_{j\in[p]} and the SeqStep+ filter (Barber and Candès,, 2015), proceeding in two steps: (1) the features are ordered according to MjM_{j} such that Mπ(1)Mπ(2)Mπ(p)M_{\pi(1)}\leq M_{\pi(2)}\leq\cdots\leq M_{\pi(p)}, where π\pi is a permutation of [p][p]; (2) the rejection set is {π(k):kk^,pπ(k)γ}\{\pi(k):k\geq\widehat{k},p_{\pi(k)}\leq\gamma\}, where

k^=inf{k[p]:c+#{jk:pπ(j)>γ}1#{jk:pπ(j)γ}1γγα},\displaystyle\widehat{k}=\inf\left\{k\in[p]:\frac{c+\#\{j\geq k:p_{\pi(j)}>\gamma\}}{1\vee\#\{j\geq k:p_{\pi(j)}\leq\gamma\}}\leq\frac{1-\gamma}{\gamma}\alpha\right\},

for some constant γ[αα+1,1/2]\gamma\in[\frac{\alpha}{\alpha+1},1/2]. This selection rule reduces to the basic OATK when M=1M=1.

The following lemma proves that pjp_{j} is super-uniform under HjH_{j}. By construction, pjp_{j} is supported on {1M+1,2M+1,,1}\{\frac{1}{M+1},\frac{2}{M+1},\cdots,1\}, and when MM is large, the p-value is sufficiently fine-grained to demonstrate the strength of evidence against the null hypothesis.

Lemma 2.4.

Suppose there are no ties among {β^j(0),β^j(1),,β^j(M)}\{\widehat{\beta}^{(0)}_{j},\widehat{\beta}^{(1)}_{j},\cdots,\widehat{\beta}^{(M)}_{j}\} for j[p]j\in[p] almost surely. Then, pjp_{j} defined in (34) is uniform distributed on {1M+1,2M+1,,1}\{\frac{1}{M+1},\frac{2}{M+1},\ldots,1\} for any j0j\in\mathcal{H}_{0}.

The proof of Lemma 2.4 can be found in Appendix B.4. Multiple OATK can be viewed as a generalization of the conditional randomization test p-value (Candès et al.,, 2018) to the fixed design setting, where a (fine-grained) p-value is obtained by permutation on a synthetically created group. Like the M=1M=1 case, FDR can be approximately controlled when the dependence between p-values is relatively small. For simplicity, we focus on proving FDR control in Section 3 for the OATK, which is equivalent to the multiple OATK with M=1M=1.

2.5.2 Conditional calibration for exact FDR control

As alluded to before, our relaxation of the knockoff conditions comes at the price of finite-sample FDR control. Although we will show in Section 3, that OATK exhibits approximate finite-sample FDR control, as well as asymptotic FDR control, it still may be desirable to achieve exact finite-sample control in some cases, especially when type-I errors are costly.

To see why the OATK conditions do not lead to exact FDR control, recall that for BC knockoffs,

FDR=𝔼[j0𝟏{WjTα}j[p]𝟏{WjTα}]α𝔼[j0𝟏{WjTα}1+j0𝟏{WjTα}]α,\displaystyle\textnormal{FDR}=\mathbb{E}\bigg{[}\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{\sum_{j\in[p]}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}\bigg{]}\leq\alpha\mathbb{E}\Big{[}\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{1+\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\leq-T_{\alpha}\}}\Big{]}\leq\alpha,

where the first inequality holds by (29) with c=1c=1, and the last inequality holds if Wj|𝑾j=dWj|𝑾jW_{j}\,|\,\bm{W}_{-j}\stackrel{{\scriptstyle\rm d}}{{=}}-W_{j}\,|\,\bm{W}_{-j}, where 𝑾\j\bm{W}_{\backslash j} is 𝑾\bm{W} with its jj-th item removed, for all null features jj, which is not the case for our method.

A remedy to the FDR violation (both in theory and in practice) is via conditional calibration (Fithian and Lei,, 2022; Luo et al.,, 2022), which is efficiently implementable with OATK. To conditionally calibrate OATK, suppose that for each j[p]j\in[p] we have a statistic 𝑮j\bm{G}_{j} that is sufficient for the model described by HjH_{j} such that if jj is null, we can sample from 𝒚|𝑮j\bm{y}\,|\,\bm{G}_{j}. In this case, we could use Monte Carlo to empirically evaluate the quantity

ϕj(t;𝑮j):=𝔼[𝟏{tWjTβ}c+[p]𝟏{WTβ}𝟏{WjTβ}[p]𝟏{WTβ}|𝑮j],\displaystyle\phi_{j}(t;\bm{G}_{j})\,:=\,\mathbb{E}\bigg{[}\frac{\mathbf{1}\{t\cdot W_{j}^{\prime}\geq T_{\beta}^{\prime}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}^{\prime}\leq-T_{\beta}^{\prime}\}}-\frac{\mathbf{1}\{W_{j}^{\prime}\leq-T_{\beta}^{\prime}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}^{\prime}\leq-T_{\beta}^{\prime}\}}\,\bigg{|}\,\bm{G}_{j}\bigg{]}, (35)

where β(0,1)\beta\in(0,1) is a parameter not necessarily equal to α\alpha and c0c\in\mathbb{R}_{\geq 0} is a constant. The recommended choice is to take βα\beta\leq\alpha to yield better power (Fithian and Lei,, 2022).

To calibrate OATK, we first run the base OATK procedure to obtain TβT_{\beta} and 𝑾\bm{W}. Then, for each jj, compute t^j=sup{t0:ϕj(t;𝑮j)0}\widehat{t}_{j}=\sup\big{\{}t\in\mathbb{R}_{\geq 0}:\phi_{j}(t;\bm{G}_{j})\leq 0\big{\}} and let

ej=p𝟏{t^jWjTβ}c+[p]𝟏{WTβ}, if ϕ(t^j;𝑮j)0;\displaystyle e_{j}=\frac{p\mathbf{1}\{\widehat{t}_{j}\cdot W_{j}\geq T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}},\quad\text{ if }\phi(\widehat{t}_{j};\bm{G}_{j})\leq 0; (36)
ej=p𝟏{t^jWj>Tβ}c+[p]𝟏{WTβ}, if ϕ(t^j;𝑮j)>0.\displaystyle e_{j}=\frac{p\mathbf{1}\{\widehat{t}_{j}\cdot W_{j}>T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}},\quad\text{ if }\phi(\widehat{t}_{j};\bm{G}_{j})>0.

For a fixed tt, ϕj(t;𝑮j)\phi_{j}(t;\bm{G}_{j}) can be computed using Monte Carlo. For each jj, we sample 𝒚𝒚|𝑮j\bm{y}^{\prime}\sim\bm{y}\,|\,\bm{G}_{j}, run OATK to obtain Wj,TβW_{j}^{\prime},T_{\beta}^{\prime} (as a function of 𝑿,𝒚\bm{X},\bm{y}^{\prime}), evaluate the term inside the expectation in (35), and average across the replicates. To compute eje_{j}, it suffices to evaluate ϕj(t;𝑮j)\phi_{j}(t;\bm{G}_{j}) at t=Tβ/Wjt=T_{\beta}/W_{j} because ϕj(Tβ/Wj;𝑮j)0\phi_{j}(T_{\beta}/W_{j};\bm{G}_{j})\leq 0 implies ej=p/(c+[p]𝟏{WTβ})e_{j}=p/(c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}) and ϕj(Tβ/Wj;𝑮j)>0\phi_{j}(T_{\beta}/W_{j};\bm{G}_{j})>0 implies ej=0e_{j}=0.

To obtain the rejection set, we apply the eBH procedure (Wang and Ramdas,, 2022) to the eje_{j}’s, which sorts eπ(1)eπ(p)e_{\pi(1)}\geq\cdots\geq e_{\pi(p)} and computes

k^=max{k[p]:keπ(k)p1α}.\displaystyle\widehat{k}=\max\left\{k\in[p]:\frac{ke_{\pi(k)}}{p}\geq\frac{1}{\alpha}\right\}.

It returns the set {π(k):kk^}\{\pi(k):k\leq\widehat{k}\}. The following theorem proves that eje_{j}’s are compound e-values (Ignatiadis et al.,, 2024) and applying the eBH procedure to eje_{j}’s guarantees FDR control. Its proof can be found in Appendix B.5.

Proposition 3.

The eje_{j}’s defined in (36) satisfies that j0ejp\sum_{j\in\mathcal{H}_{0}}e_{j}\leq p and applying the eBH procedure to eje_{j}’s controls the FDR at the desired level.

Remark 3.

If we let t^j=1\widehat{t}_{j}=1 (instead of as defined as above), c=1c=1, β=α\beta=\alpha when defining the eje_{j}’s, then applying the eBH procedure to eje_{j}’s is equivalent to applying the BC filter to the WjW_{j}’s (Ren and Barber,, 2024).

The remaining question is the choice of 𝑺j\bm{S}_{j}. In our setting, where 𝒚=𝒩(𝑿𝜷,σ2𝑰)\bm{y}=\mathcal{N}\bm{(X\beta},\sigma^{2}\bm{I}), a sufficient statistics is 𝑮j=(𝑿j𝒚,𝒚2)\bm{G}_{j}=(\bm{X}_{-j}^{\top}\bm{y},\|\bm{y}\|^{2}), and the full distribution of 𝒚𝑮j\bm{y}\mid\bm{G}_{j} can be found in Luo et al., (2022, Section E.2).

2.5.3 Derandomized one-at-a-time knockoffs

A drawback of OATK, and of BC and GM, is that the knockoff variable 𝒙~j\bm{\widetilde{x}}_{j} is stochastic in the sense that 𝒓j\bm{r}_{j} in Proposition 2 is not unique, so that repeating the algorithm yields different knockoffs and possibly different rejection sets. As for BC, this is alleviated by the derandomization procedure proposed by Ren et al., (2023) and Ren and Barber, (2024). For OATK, we adopt the derandomization scheme in Ren et al., (2023): we fix a rejection threshold η\eta and repeat the basic OATK procedure MM times, each time using a different random generation of the 𝒓j\bm{r}_{j} vectors. The rejection set is then the set of variables that were rejected in more than η\eta trials.

  1. 1.

    For each replicate [M]\ell\in[M], generate knockoffs 𝑿~()\bm{\widetilde{X}}^{(\ell)} with randomly generated 𝒓j\bm{r}_{j} vectors (e.g., as Gaussian vectors that are subsequently made orthogonal to 𝑿\bm{X} and then rescaled to have the required norm) and construct the rejection set S^()\widehat{S}^{(\ell)} as in the basic OATK procedure.

  2. 2.

    For each variable j=1,,pj=1,\ldots,p, compute the rejection frequency

    Πj=1M=1M𝟏{jS^()}.\displaystyle\Pi_{j}=\frac{1}{M}\sum_{\ell=1}^{M}\mathbf{1}\{j\in\widehat{S}^{(\ell)}\}. (37)
  3. 3.

    Return the final rejection set as S^={j:Πjη}\widehat{S}=\{j:\Pi_{j}\geq\eta\}.

OATK allows derandomization to be incorporated in a more straightforward and computationally efficient manner that further improves performance. This is in part due to the removal of the BC knockoff requirement that 𝑿~𝑿~=𝚺𝑺\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}=\bm{\Sigma}-\bm{S}, which provides improved flexibility and computational efficiency when generating knockoffs, in addition to allowing us to generate knockoffs one-at-a-time. Specifically, each knockoff variable has a deterministic component and a stochastic component, as given by Proposition 2. For each derandomization replicate, the first two terms in the right-hand side of (21) remain the same, and only the 𝒓j\bm{r}_{j} term varies. As a result, when computing the knockoff coefficients via (27), the term β^λj[𝚺λ1]jjσj2β^OLS,j\widehat{\beta}_{\lambda j}-[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\sigma_{j}^{2}\widehat{\beta}_{OLS,j} is computed only once. Then, MM copies of 𝒓j\bm{r}_{j} are generated to compute [𝚺λ1]jj𝒓j𝒚[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\bm{r}_{j}^{\top}\bm{y}, and subsequently β~λ,j\widetilde{\beta}_{\lambda,j}, for all derandomization replicates.

3 Theoretical guarantees

The insight behind the threshold selection of TαT_{\alpha} in (29) is essentially the same as in the BC procedure. Consider the FDP of OATK:

FDP =#{j:βj=0 and jS^}#{j:jS^}1=j0𝟏(WjT)j[p]𝟏(WjT)(a)j0𝟏(WjT)j[p]𝟏(WjT)+o(1)\displaystyle=\frac{\#\{j:\beta_{j}=0\text{ and }j\in\widehat{S}\}}{\#\{j:j\in\widehat{S}\}\vee 1}=\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}(W_{j}\geq T)}{\sum_{j\in[p]}\mathbf{1}(W_{j}\geq T)}\stackrel{{\scriptstyle\textnormal{(a)}}}{{\approx}}\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}(W_{j}\leq-T)}{\sum_{j\in[p]}\mathbf{1}(W_{j}\geq T)}+o(1)
j[p]𝟏(WjT)j[p]𝟏(WjT)+o(1)(b)α+o(1)\displaystyle\leq\frac{\sum_{j\in[p]}\mathbf{1}(W_{j}\leq-T)}{\sum_{j\in[p]}\mathbf{1}(W_{j}\geq T)}+o(1)\stackrel{{\scriptstyle\textnormal{(b)}}}{{\leq}}\alpha+o(1) (38)

where step (a) is approximately true by the symmetry of WjW_{j} for null variables (Corollary 2.1) under mild correlation across features, and step (b) is ensured by the threshold selection (29). The approximation in step (a) carries an o(1)o(1) term due to the samples being finite. Note that the first inequality in (38) should be close to an equality when the non-null variables are sparse, especially if their coefficients are so large that WjTW_{j}\geq T is much more likely than WjTW_{j}\leq-T for the non-null variables. We more rigorously derive the FDR control guarantees of OATK in the remainder of this section, with asymptotic and non-asymptotic guarantees covered in Sections  3.1 and  3.2 , respectively.

3.1 Asymptotic FDR bounds

The following theorem shows that when the WjW_{j}’s are reasonably weakly dependent on each other asymptotically, the OAT knockoff FDR is controlled as p0p_{0}, p1p_{1}, and nn all go to infinity in a certain manner.

Theorem 3.1.

Assume pnp\leq n and p0/pπ0(0,1)p_{0}/p\rightarrow\pi_{0}\in(0,1) as p0,p1p_{0},p_{1}\rightarrow\infty. Define

F0(t)1p0j0(Wjt),G(t)1pj[p](Wjt),G+(t)1pj[p](Wjt).\displaystyle F_{0}(t)\equiv\frac{1}{p_{0}}\sum_{j\in\mathcal{H}_{0}}\mathbb{P}(W_{j}\geq t),~{}G_{-}(t)\equiv\frac{1}{p}\sum_{j\in[p]}\mathbb{P}(W_{j}\leq-t),~{}G_{+}(t)\equiv\frac{1}{p}\sum_{j\in[p]}\mathbb{P}(W_{j}\geq t).

Assume that F0(t)F_{0}(t), G(t)G_{-}(t), and G+(t)G_{+}(t) are all continuous in tt and that there exist constants c>0c>0, and κ(0,2)\kappa\in(0,2) such that

j,k[p]Cov(𝟏{Wjt},𝟏{Wkt})cpκ,t.\displaystyle\sum_{j,k\in[p]}\textnormal{Cov}\big{(}\mathbf{1}\{W_{j}\geq t\},\mathbf{1}\{W_{k}\geq t\}\big{)}\leq c\cdot p^{\kappa},\qquad\forall t\in\mathbb{R}. (39)

Then, for any target FDR level α(0,1)\alpha\in(0,1), if there exists tαt_{\alpha}\in\mathbb{R} such that G(tα)/G+(tα)<αG_{-}(t_{\alpha})/G_{+}(t_{\alpha})<\alpha, then

limsupp0,p1,nFDRα.\displaystyle\underset{p_{0},p_{1},n\rightarrow\infty}{\lim\!\sup}~{}\textnormal{FDR}\leq\alpha.

The proof is given in Appendix B.7. The following remark gives an example where the condition in (39) holds.

Remark 4.

Consider Wj=|β^j||β~j|W_{j}=|\widehat{\beta}_{j}|-|\widetilde{\beta}_{j}|, for β^\widehat{\beta} and β~\widetilde{\beta} obtained from OLS regression (original and knockoff) as described in Section 2.1 with sj=σj2s_{j}=\sigma_{j}^{2}. Then, we have that Cov(𝛃^,𝛃~)=0\mathrm{Cov}(\widehat{\bm{\beta}},\widetilde{\bm{\beta}})=0, implying that 𝛃^𝛃~\widehat{\bm{\beta}}\,\perp\!\!\!\!\perp\,\widetilde{\bm{\beta}}. As a result, for any j,k0j,k\in\mathcal{H}_{0}, (β^j,β^k)(β~j,β~k)(\widehat{\beta}_{j},\widehat{\beta}_{k})\,\perp\!\!\!\!\perp\,(\widetilde{\beta}_{j},\widetilde{\beta}_{k}). To proceed, define the maximum correlation between two random variables XX and YY as

ρmax(X,Y)=maxf11,f22𝔼[f1(X)f2(Y)],\displaystyle\rho_{\max}(X,Y)=\max_{f_{1}\in\mathcal{F}_{1},f_{2}\in\mathcal{F}_{2}}\mathbb{E}[f_{1}(X)f_{2}(Y)],

where 1={f:𝔼[f(X)]=0,Var(f(X))=1}\mathcal{F}_{1}=\{f:\mathbb{E}[f(X)]=0,\mathrm{Var}(f(X))=1\} and 2={f:𝔼[f(Y)]=0,Var(f(Y))=1}\mathcal{F}_{2}=\{f:\mathbb{E}[f(Y)]=0,\mathrm{Var}(f(Y))=1\}. Next, for any tt\in\mathbb{R},

Cov(𝟏{Wjt},𝟏{Wkt})\displaystyle\mathrm{Cov}\big{(}\mathbf{1}\{W_{j}\geq t\},\mathbf{1}\{W_{k}\geq t\}\big{)} =Cov(𝟏{|β^j||β~j|t},𝟏{|β^k||β~k|t})\displaystyle=\mathrm{Cov}\big{(}\mathbf{1}\{|\widehat{\beta}_{j}|-|\widetilde{\beta}_{j}|\geq t\},\mathbf{1}\{|\widehat{\beta}_{k}|-|\widetilde{\beta}_{k}|\geq t\}\big{)}
Corr(𝟏{|β^j||β~j|t},𝟏{|β^k||β~k|t})\displaystyle\leq\textnormal{Corr}\big{(}\mathbf{1}\{|\widehat{\beta}_{j}|-|\widetilde{\beta}_{j}|\geq t\},\mathbf{1}\{|\widehat{\beta}_{k}|-|\widetilde{\beta}_{k}|\geq t\}\big{)}
ρmax(|β^j||β~j|,|β^k||β~k|),\displaystyle\leq\rho_{\max}\big{(}|\widehat{\beta}_{j}|-|\widetilde{\beta}_{j}|,|\widehat{\beta}_{k}|-|\widetilde{\beta}_{k}|\big{)},

where the first inequality follows from the fact that Var(𝟏{Wjt})1\mathrm{Var}(\mathbf{1}\{W_{j}\geq t\})\leq 1 and Var(𝟏{Wkt})1\mathrm{Var}(\mathbf{1}\{W_{k}\geq t\})\leq 1; the second inequality is due to the definition of ρmax\rho_{\max}. By Equation (8) in Bryc and Dembo, (2005), we have

ρmax(|β^j||β~j|,|β^k||β~k|)\displaystyle\rho_{\max}\big{(}|\widehat{\beta}_{j}|-|\widetilde{\beta}_{j}|,|\widehat{\beta}_{k}|-|\widetilde{\beta}_{k}|\big{)} max{ρmax(β^j,β^k),ρmax(β~j,β~k)}.\displaystyle\leq\max\big{\{}\rho_{\max}(\widehat{\beta}_{j},\widehat{\beta}_{k}),\rho_{\max}(\widetilde{\beta}_{j},\widetilde{\beta}_{k})\big{\}}.

As for the first term above,

ρmax(β^j,β^k)=Corr(β^j,β^k)=[𝚺1]jk[𝚺1]jj[𝚺1]kk.\displaystyle\rho_{\max}(\widehat{\beta}_{j},\widehat{\beta}_{k})=\textnormal{Corr}(\widehat{\beta}_{j},\widehat{\beta}_{k})=\frac{[\bm{\Sigma}^{-1}]_{jk}}{\sqrt{[\bm{\Sigma}^{-1}]_{jj}[\bm{\Sigma}^{-1}]_{kk}}}.

As for the second term, we have that

ρmax(β~j,β~k)=Corr(β~j,β~k)=|[𝚺1]jk[𝚺1]jj[𝚺1]kk[𝚺1]jj[𝚺1]kk(𝒙~j𝒙~k𝒙j𝒙k)|.\displaystyle\rho_{\max}(\widetilde{\beta}_{j},\widetilde{\beta}_{k})=\textnormal{Corr}(\widetilde{\beta}_{j},\widetilde{\beta}_{k})=\bigg{|}\frac{[\bm{\Sigma}^{-1}]_{jk}}{\sqrt{[\bm{\Sigma}^{-1}]_{jj}[\bm{\Sigma}^{-1}]_{kk}}}-\sqrt{[\bm{\Sigma}^{-1}]_{jj}[\bm{\Sigma}^{-1}]_{kk}}(\widetilde{\bm{x}}_{j}^{\top}\widetilde{\bm{x}}_{k}-\bm{x}_{j}^{\top}\bm{x}_{k})\bigg{|}.

Combining the two terms, we can see that Condition (39) is satisfied when

j,k[p][𝚺1]jk[𝚺1]jj[𝚺1]kk+[𝚺1]jj[𝚺1]kk|𝒙~j𝒙~k𝒙j𝒙k|=O(pκ),\sum_{j,k\in[p]}\frac{[\bm{\Sigma}^{-1}]_{jk}}{\sqrt{[\bm{\Sigma}^{-1}]_{jj}[\bm{\Sigma}^{-1}]_{kk}}}+\sqrt{[\bm{\Sigma}^{-1}]_{jj}[\bm{\Sigma}^{-1}]_{kk}}\cdot|\widetilde{\bm{x}}_{j}^{\top}\widetilde{\bm{x}}_{k}-\bm{x}_{j}^{\top}\bm{x}_{k}|=O(p^{\kappa}),

for some κ(0,2)\kappa\in(0,2).

3.2 Non-asymptotic FDR bounds

Like the BC knockoffs, the key to the FDR control of OATK is the symmetry of the test statistics conditional on the other features. While the BC knockoff construction enforces such conditional symmetry, our relaxed construction does not exactly, so the FDR control of the BC procedure does not directly apply. The result in this section characterizes the FDR inflation in terms of the violation of conditional symmetry.

To start, we define for any j[p]j\in[p] the “symmetry index”

SI^j=(Wj>0||Wj|,𝑾\j)(Wj<0||Wj|,𝑾\j).\displaystyle\widehat{\textnormal{SI}}_{j}=\frac{\mathbb{P}(W_{j}>0\,\big{|}\,|W_{j}|,\bm{W}_{\backslash j})}{\mathbb{P}(W_{j}<0\,\big{|}\,|W_{j}|,\bm{W}_{\backslash j})}.

The following theorem provides an upper bound on the FDR of OATK as a function of the SI^j\widehat{\textnormal{SI}}_{j}’s.

Theorem 3.2.

If c=1c=1 in the definition of TαT_{\alpha} in Equation (29), then for any ϵ0\epsilon\geq 0, the FDR of OATK can be bounded as

FDReϵα+(maxj0SI^j>eϵ).\displaystyle\textnormal{FDR}\leq e^{\epsilon}\cdot\alpha+\mathbb{P}\Big{(}\max_{j\in\mathcal{H}_{0}}\widehat{\textnormal{SI}}_{j}>e^{\epsilon}\Big{)}.

The proof of Theorem 3.2 largely follows the leave-one-out technique used in Barber et al., (2020) and is provided in Appendix B.6.

Remark 5.

If the WjW_{j}’s satisfy the strict knockoff condition in Barber and Candès, (2015), then SIj^1\widehat{\textnormal{SI}_{j}}\equiv 1 for j0j\in\mathcal{H}_{0}, and FDR is strictly controlled by taking ϵ=0\epsilon=0 in Theorem 3.2. When the strict knockoff condition is violated only mildly, there exists a small ϵ>0\epsilon>0 such that (maxj0SI^jeϵ)\mathbb{P}(\max_{j\in\mathcal{H}_{0}}\widehat{\textnormal{SI}}_{j}\geq e^{\epsilon}) is also small, and therefore the FDR is approximately controlled.

4 Numerical experiments

In this section we show numerically that, relative to existing FDR-controlling algorithms, OATK exhibits higher power with reasonable finite-sample FDR control. We implement OATK using ridge regression, utilizing the computational speedups described in Section 2.3. For each simulation example, we fix n,p,p1n,p,p_{1} and uniformly sample SS from [p][p] such that |S|=p1|S|=p_{1}. We set p1=30p_{1}=30 for all experiments. The non-null coefficients are set to βj=±A\beta_{j}=\pm A, where the sign is sampled uniformly at random and the signal amplitude AA is varied across experiments. The remaining entries of 𝜷\bm{\beta} are set to zero. The desired false discovery rate is fixed as α=0.1\alpha=0.1. In each simulation example, we assume a model for 𝑿\bm{X} and generate a different dataset (𝑿,𝒚)(\bm{X},\bm{y}) for each replicate by sampling 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} from the model, normalizing each column to have unit norm, and generating 𝒚\bm{y} according to the linear model of Equation (1) with unit variance for the noise. In all experiments, the regularization parameter λ\lambda is selected using leave-one-out cross-validation, as described in Appendix A, and the offset parameter in Equation (29) is set to c=0c=0. For each example, we conducted 100100 replicates. Section 4.1 assumes 𝑿\bm{X} to be multivariate normal with different covariance structures, and Section 4.2 considers 𝑿\bm{X} sampled from a discrete-time Markov chain. Section 4.3 considers real genetic data by fixing 𝑿\bm{X} as the set of genetic mutations of the HIV genome (HIV Drug Resistance Database (Rhee et al.,, 2003)). Using this real 𝑿\bm{X} (normalized to have columns with unit norm), we construct a semi-synthetic simulation example by sampling 𝜷\bm{\beta} and 𝒚\bm{y} as described above, similar to the semi-synthetic genetics study of Xing et al., 2023a . We also conduct a genome-wide association study by taking 𝒚\bm{y} to be real experimentally collected susceptibility data of the HIV virus to fosamprenavir, a common treatment drug. Section 4.4 examines further enhancements of OATK when coupled with multi-bit pp-value and conditional calibration.

In each simulation example, we include both randomized and derandomized versions of OATK. The existing algorithms with which we compare OATK are the Benjamini-Hochberg procedure (Benjamini and Hochberg,, 1995) (BH), BC, and GM. The BC and GM procedures were implemented with Lasso regression, as we observed higher power compared to ridge regression. Additionally, for the Markov chain benchmark in Section 4.2, we compare OATK against the modified knockoff algorithm of Sesia et al., (2018) (KnockoffDMC), which generates knockoff variables distributed as a discrete-time Markov chain. Since we only consider low-dimensional datasets where p<np<n, we use the formulation of GM that requires no pre-screening of variables, detailed in Appendix C. The R package for OATK is publicly available at https://github.com/charlie-guan/oatk.

4.1 Simulation with Gaussian 𝑿\bm{X}

In this example, each row of 𝑿\bm{X} is generated independently from a multivariate normal distribution, N(0,𝚺x)\text{N}(0,\bm{\Sigma}_{x}), with three different structures of 𝚺x\bm{\Sigma}_{x}. Namely, we consider

  1. 1.

    Power decay: [𝚺x]ij=ρ|ij|[\bm{\Sigma}_{x}]_{ij}=\rho^{|i-j|} for all i,j[p]i,j\in[p].

  2. 2.

    Constant positive correlation: [𝚺x]ij=ρ𝟏(ij)[\bm{\Sigma}_{x}]_{ij}=\rho^{\mathbf{1}(i\neq j)} for all i,j[p].i,j\in[p].

  3. 3.

    Constant negative correlation: 𝚺x=𝑸1\bm{\Sigma}_{x}=\bm{Q}^{-1}, where Qij=ρ𝟏(ij)Q_{ij}=\rho^{\mathbf{1}(i\neq j)} for all i,j[p]i,j\in[p].

In all following simulations, we set ρ=0.4\rho=0.4 for all covariance structures. FDR and power results under different ρ[0,0.8]\rho\in[0,0.8] are included in Appendix C.2. Changing ρ\rho does not change OATK’s uniformly higher power compared to other algorithms.

Small pp and nn study.

We set n=1000n=1000 and p=300p=300. Fig. 1 plots the FDR and power (the average FDP and TDP, respectively) across the 100 replicates for each example. OATK exhibits significantly higher power than the other algorithms across all three covariance structures and all signal amplitudes. OATK exhibits slight FDR inflation with the power decay and constant positive covariances, but its FDR never exceeded 0.120.130.12-0.13, except in A=3A=3. For the constant negative covariance, OATK exhibited somewhat higher FDR inflation at around 0.150.15 but far higher power than other algorithms. Derandomization, which is efficiently implementable with OATK due to its substantially lower computational expense, lowers the FDR inflation somewhat with no effect on power. GM typically has somewhat lower power and lower FDR than BH. In the constant positive covariance at lower amplitudes, GM had higher FDR inflation than all other methods. BC exhibits consistent FDR control and higher power than BH and GM for the power decay and constant positive correlations, but its power drops to almost zero for the constant negative correlation. It should also be noted that when we varied ρ\rho, similar conclusions hold regarding the generally superior power and reasonable FDR control of OATK, as we demonstrate in Appendix C.2.

Refer to caption
Figure 1: FDR and power as functions of signal amplitude AA in the p=300p=300 and n=1000n=1000 Gaussian 𝑿\bm{X} study.
Larger pp and nn study.

We next set n=2000n=2000 and p=1000p=1000. Fig. 2 shows OATK again has the best power among all procedures. While it still has relatively mild FDR inflation in some cases, the differences in power between OATK and the other methods are even higher than in the smaller pp and nn study. Derandomization again lowers the FDR inflation, while slightly boosting power.

Refer to caption
Figure 2: FDR and power as functions of signal amplitude AA in the p=1000p=1000 and n=2000n=2000 Gaussian 𝑿\bm{X} study.

Regarding replicate-to-replicate variability, Fig. 3 and 4 show the distribution of FDP and TDP across all replicates at a fixed signal amplitude of A=5A=5 in the small pp and nn study and the large pp and nn study, respectively, from which it can be seen that derandomization lowers the variances of both FDP and TDP in OATK. One drawback of OATK is that it only satisfies asymptotic FDR control, while other methods such as BH and BC satisfy finite-sample FDR control for any nn. As a result, OATK exhibits some FDR inflation in Figs. 1 and 2, while the two competing procedures do not. GM, which satisfies only asymptotic FDR control, also shows FDR inflation in some cases, especially in the constant positive covariance example. However, in light of the replicate-to-replicate variability seen in Figs. 3 and 4 for all methods, small inflation of the FDR may be viewed as relatively innocuous, since it is small relative to the FDP variability. In particular, users apply the approach to a single data set, and the actual FDP for that data set will often differ substantially from the FDR due to the FDP variability. From Figs. 3 and 4, all methods yield quite high FDPs (e.g., above 0.250.25) on at least some replicates, and the upper FDP quartile for even the BC and BH methods (which guarantee FDR0.1\text{FDR}\leq 0.1) exceeds 0.150.15 for many examples. In the constant negative example, BH and BC suffer from more extreme outliers where the FDP is high, and they sometimes yield higher variability of TDP compared to OATK. Moreover, the derandomized OATK reduces the variances of both the FDP and TDP to levels that are lower than the other methods, so that it actually yields less extreme worst-case FDPs and TDPs than the other methods.

Refer to caption
Figure 3: Boxplots of FDP and TDP for individual replicates in the p=300p=300 and n=1000n=1000 Gaussian 𝑿\bm{X} study for signal amplitude A=5A=5.
Refer to caption
Figure 4: Boxplots of FDP and TDP for individual replicates in the p=1000p=1000 and n=2000n=2000 Gaussian 𝑿\bm{X} study for signal amplitude A=5A=5.

4.2 Simulation with Markov chain design

Our OATK approach applies to general fixed-design matrix 𝑿\bm{X}, as long as it is full rank. Next we consider the setting of Sesia et al., (2018), who developed a knockoff filter specialized to 𝑿\bm{X} that is generated from a Markov chain. Specifically, for i[n]i\in[n], the ii-th row 𝑿i.\bm{X}_{i.} of 𝑿\bm{X} is generated independently of the other rows as a scalar discrete Markov chain with state space 𝒱={0,1,2}\mathcal{V}=\{0,1,2\}, as follows. We first sample the Markov chain hyperparameters {γj:j[p1]}Unif(0,0.5)\{\gamma_{j}:j\in[p-1]\}\sim\text{Unif}(0,0.5), with the same hyperparameters applied to each row. We sample the initial state Xi,1X_{i,1} uniformly at random from 𝒱\mathcal{V}. Then for j[p1]j\in[p-1], we sample the next state Xi,j+1|Xi,jX_{i,j+1}|X_{i,j} using the transition probability

Qj(rs)={13+23γjr=s12(11323γj)rs.\displaystyle Q_{j}(r\mid s)=\begin{cases}\frac{1}{3}+\frac{2}{3}\gamma_{j}&r=s\\ \frac{1}{2}\left(1-\frac{1}{3}-\frac{2}{3}\gamma_{j}\right)&r\neq s\end{cases}.

In this model, the chain is more likely to remain in the same state than to jump to a different state, but this sojourn probability differs across covariates due to different γj\gamma_{j}. If the chain does jump to a different state, then it chooses between the other two possible states uniformly at random.

Fig. 5 shows OATK performs the best in power for both smaller pp and nn (n=1000,p=300n=1000,p=300) and larger pp and nn (n=2000,p=1000n=2000,p=1000), although in the latter situation its power is comparable to KnockoffDMC. In the smaller setting, there is some FDR inflation but derandomization lowers it closer to the target level, and the inflation is comparable to that of other methods, especially KnockoffDMC. In the larger setting, the FDR control is essentially at the target level. Moreover, OATK does not have the significant FDR inflation at lower signal amplitudes that GM does. Relative to KnockoffDMC, OATK performs similarly in power at the larger pp and nn setting and better in power in the smaller setting, in spite of the fact that KnockoffDMC was developed specifically for, and only applies to, the Markov chain situation and requires estimating the transition probability matrix QQ prior to variable selection. Additionally, OATK with derandomization yields the smallest replicate-to-replicate variability in TDP (Fig. 6). KnockoffDMC suffered from poor TDP in several replicates. OATK avoided these extreme low-power outliers, except for one replicate which was avoided via derandomization. Therefore, even when the power is comparable, OATK is a more reliable and consistent variable selection procedure, especially in cases where reproducibility is essential.

Refer to caption
Figure 5: FDR and power as functions of signal amplitude AA in the Markov chain 𝑿\bm{X} study for n=1000n=1000 and p=300p=300 (top) and n=2000n=2000 and p=1000p=1000 (bottom).
Refer to caption
Figure 6: Boxplots of FDP and TDP for individual replicates in the Markov chain 𝑿\bm{X} study with A=5A=5 for n=1000n=1000 and p=300p=300 (top) and n=2000n=2000 and p=1000p=1000 (bottom).

4.3 Genome-wide association study with HIV drug resistance data

Genome-wide association studies (GWAS) examine a genome-wide collection of genetic variants from individuals sampled from a target population to detect whether any variants are associated with a phenotype, e.g., assessing which mutations increase resistance to therapeutics for patients with human immunodeficiency virus (HIV).

4.3.1 Semi-synthetic example with real covariates and synthetic response

We consider real genetic covariates using the HIV Drug Resistance Database of Rhee et al., (2003), available at https://hivdb.stanford.edu. Their protease inhibitor (PI) database collected 2395 genetic isolates. After removing duplicate data, rows with missing values, and rare mutations (i.e., occurring in less than 10 data points), we obtain 𝑿\bm{X} with n=2372n=2372 samples and p=176p=176 genetic mutations. The possible values of XijX_{ij} are 0 or 1, which indicate the absence (0) or presence (11) of the jj-th genetic mutation in the ii-th patient’s genome. When constructing 𝜷\bm{\beta}, a positive non-null coefficient means the presence of its corresponding mutation increases the response phenotype (e.g., increased drug resistance), and a negative coefficient indicates lowered response. The zero coefficients correspond to mutations that have no effect on the response. We set k=20k=20. The results are shown in Fig. 7. OATK again yields the best power but with some FDR inflation that is reduced by derandomization to levels comparable with BC. All of the methods except BH have some FDR inflation.

Refer to caption
Figure 7: FDR and power as functions of signal amplitude AA for the semi-synthetic HIV drug resistance example.

4.3.2 Identification of key genetic mutations that affect resistance to antiretroviral therapy

In this example, we use completely real data by using the same 𝑿\bm{X} as Section 4.3.1 and taking 𝒚\bm{y} as the logarithmic fold resistance to fosamprenavir, a common antiretroviral therapeutic for HIV patients. We do not run any simulations, as both 𝑿\bm{X} and 𝒚\bm{y} are data collected from real laboratory tests done in vitro, as part of the HIV Drug Resistance Database. The experimental methodology used to collect the data is detailed in Rhee et al., (2003). The goal of this study is to compare variable selection procedures for identifying the relevant mutations that impact a patient’s drug resistance to fosamprenavir, and corroborate any identified mutations from additional clinical trials.

The data was cleaned as described in Section 4.3.1, now resulting in 𝑿\bm{X} having n=2224n=2224 samples and p=172p=172 genetic mutations. The dimensionality is slightly different from Section 4.3.1 due to the addition of the real fold resistance data to fosamprenavir. Table 1 lists the total number of key mutations identified by each procedure, as well as the ones that were uniquely identified (i.e., not identified by any of the other procedures). OATK selected 77 key mutations, which is the most among the four benchmarked algorithms; it also had the highest number of mutations not identified by the other three algorithms. In particular, OATK uniquely identified the mutations 36L, 71V, 73A, and 73C. As an attempt to verify whether these uniquely identified mutations were true or false positives, we found that they were noted by several clinical studies (Lastere et al.,, 2004; Pellegrin et al.,, 2007; Masquelier et al.,, 2008) to affect virological response in HIV patients when treated with fosamprenavir. We emphasize for the results in Table 1, we used only laboratory tests in vitro and did not use clinical data. In this sense, OATK uniquely identified multiple mutations that were corroborated to have clinical impact. In contrast, the uniquely identified mutations of other procedures were not clinically corroborated, to the best of our knowledge.

Table 1: Variable selection results identifying key genetic mutations that affect HIV drug resistance to fosamprenavir. The number in each mutation label denotes the affected codon position, and the letter is the resulting amino acid from the mutation. Uniquely identified mutations in bold font were verified to be true positives in separate clinical studies.
Procedure # Identified Uniquely Identified Mutations
BH 60 None
BC 64 12S, 17E, 55R, 66V
GM 71 23I, 37S, 57G, 68E
OATK 77 12P, 13V, 36L, 67E, 71V, 73A, 73C

4.4 Effect of multi-bit and conditional calibration on OATK

Fig. 8 compares the performance of OATK with (M>1M>1) and without (M=1M=1) the multi-bit p-value enhancement for various generation sizes MM for the same Gaussian 𝑿\bm{X} setting of Section 4.1 presented in Figure 1. Relative to basic OATK (M=1M=1), multi-bit OATK lowers the FDR inflation but reduces power, especially for the constant negative covariance example for which the FDR inflation was higher. However, even with the power reduction, multi-bit OATK still had as high or higher power (and sometimes substantially higher) than the other methods across all three examples (compare Figs. 1 and 8). For M=15,20M=15,20, multi-bit OATK conservatively controlled FDR below the specified α=0.1\alpha=0.1 across all examples and signal amplitudes. Multi-bit OATK with M=25M=25 yielded the closest FDR to the desired α=0.1\alpha=0.1 compared to lower MM’s.

Refer to caption
Figure 8: Effect of multi-bit on OATK in the n=1000n=1000, p=300p=300 Gaussian 𝑿\bm{X} study. The notation OATK_M indicates MM knockoff variables were simultaneously generated.

Fig. 9 shows the effects of conditional calibration on OATK for a Gaussian 𝑿\bm{X} example with varying nn. For n=200n=200, n=400n=400, and n=1000n=1000, we set p=100p=100, p=200p=200, and p=300p=300, respectively. We fix the signal amplitude at A=6A=6. Rather than computing eje_{j} for every variable using Monte Carlo simulations, we only carry out the calibration step on promising variables, following a similar approach by Luo et al., (2022) to improve computational speed. The implementation details are in Appendix C.3. Fig. 9 shows the conditionally calibrated OATK corrects for the FDR inflation across all nn for all covariance cases without having to generate eje_{j} for all jj, confirming conditional calibration achieves finite-sample FDR control. However, it also lowers the power compared to the basic OATK.

Refer to caption
Figure 9: Effect of conditional calibration (cOATK) on OATK for the Gaussian 𝑿\bm{X} study with various nn with ρ=0.4,A=6\rho=0.4,A=6.

5 Conclusions

This paper introduces one-at-a-time knockoffs, a more powerful and computationally efficient algorithm for controlling the false discovery rate while discovering true signals in OLS and ridge regression problems with p<np<n. It substantially relaxes the conditions knockoffs must satisfy in the original BC knockoff filter, by removing any requirements on the correlation between knockoffs for different variables. This relaxation enables knockoffs to be generated one-at-a-time in a computationally efficient manner and knockoff regressions that involve the exact same number of variables as the original regression. In contrast, the BC knockoff approach requires knockoffs to be generated simultaneously to satisfy the additional requirement 𝑿~T𝑿~\bm{\widetilde{X}}^{T}\bm{\widetilde{X}} and conducts a knockoff regression with twice the number of variables as the original regression, which unnecessarily reduces the power of the knockoff approach.

We also developed a fast version of OATK that avoids having to explicitly generate the knockoff variables 𝑿~\bm{\widetilde{X}} or compute their coefficients 𝜷~\bm{\widetilde{\beta}} in full regressions. This allows additional performance-boosting enhancements like derandomization (to reduce variance), multi-bit p-values (to boost power or quantify uncertainty), and conditional calibration (to achieve finite-sample false discovery rate control) to be implemented efficiently. We have proven that OATK controls the FDR at any desired level asymptotically and provided a finite-sample bound on the FDR.

One limitation of our current OATK approach is that it is restricted to the case p<np<n (the original BC approach is restricted even further to 2p<n2p<n). One version of the GM approach, which also tests each variable one at a time, allows p>np>n via a pre-screening step that uses Lasso regression to select a subset of fewer than nn variables, to which the GM procedure is then applied. We implemented similar pre-screening with our OATK approach and observed from the numerical results that it (i) works well with p>np>n and (ii) also boosts power in the p<np<n case while still controlling the FDR. We also found the pre-screening to substantially improve the power of the GM approach when p<np<n. However, its theoretical justification is unclear, and the theoretical results on FDR control in Xing et al., 2023a for the GM approach with p>np>n do not apply to the actual numerical implementation in their companion code (Xing et al., 2023b, ) nor to the version of OATK with pre-screening that we observed to work well empirically. We are currently developing an extension of OATK that applies to either p<np<n or p>np>n and that has stronger theoretical justification.

References

  • Barber and Candès, (2015) Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055 – 2085.
  • Barber et al., (2020) Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. The Annals of Statistics, 48(3):1409–1431.
  • Benjamini and Hochberg, (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300.
  • Benjamini and Yekutieli, (2001) Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188.
  • Bryc and Dembo, (2005) Bryc, W. and Dembo, A. (2005). On the maximum correlation coefficient. Theory of Probability & Its Applications, 49(1):132–138.
  • Candès et al., (2018) Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577.
  • (7) Dai, C., Lin, B., Xing, X., and Liu, J. S. (2023a). False discovery rate control via data splitting. Journal of the American Statistical Association, 118(544):2503–2520.
  • (8) Dai, C., Lin, B., Xing, X., and Liu, J. S. (2023b). A scale-free approach for false discovery rate control in generalized linear models. Journal of the American Statistical Association, 118(543):1551–1565.
  • Fithian and Lei, (2022) Fithian, W. and Lei, L. (2022). Conditional calibration for false discovery rate control under dependence. The Annals of Statistics, 50(6):3091–3118.
  • Gimenez and Zou, (2019) Gimenez, J. R. and Zou, J. (2019). Improving the stability of the knockoff procedure: Multiple simultaneous knockoffs and entropy maximization. In The 22nd international conference on artificial intelligence and statistics, pages 2184–2192. PMLR.
  • Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
  • Ignatiadis et al., (2024) Ignatiadis, N., Wang, R., and Ramdas, A. (2024). Compound e-values and empirical bayes. arXiv preprint arXiv:2409.19812.
  • Lastere et al., (2004) Lastere, S., Dalban, C., Collin, G., Descamps, D., Girard, P.-M., Clavel, F., Costagliola, D., and Brun-Vezinet, F. (2004). Impact of insertions in the hiv-1 p6 ptapp region on the virological response to amprenavir. Antiviral therapy, 9(2):221–227.
  • Li and Fithian, (2021) Li, X. and Fithian, W. (2021). Whiteout: when do fixed-x knockoffs fail? arXiv preprint arXiv:2107.06388.
  • Liu and Rigollet, (2019) Liu, J. and Rigollet, P. (2019). Power analysis of knockoff filters for correlated designs. Advances in Neural Information Processing Systems, 32.
  • Luo et al., (2022) Luo, Y., Fithian, W., and Lei, L. (2022). Improving knockoffs with conditional calibration. arXiv preprint arXiv:2208.09542.
  • Masquelier et al., (2008) Masquelier, B., Assoumou, K. L., Descamps, D., Bocket, L., Cottalorda, J., Ruffault, A., Marcelin, A. G., Morand-Joubert, L., Tamalet, C., Charpentier, C., Peytavin, G., Antoun, Z., Brun-Vézinet, F., and Costagliola, D., o. b. o. t. A. R. S. G. (2008). Clinically validated mutation scores for HIV-1 resistance to fosamprenavir/ritonavir. Journal of Antimicrobial Chemotherapy, 61(6):1362–1368.
  • Pellegrin et al., (2007) Pellegrin, I., Breilh, D., Coureau, G., Boucher, S., Neau, D., Merel, P., Lacoste, D., Fleury, H., Saux, M.-C., Pellegrin, J.-L., Lazaro, E., Dabis, F., and Thiébaut, R. (2007). Interpretation of genotype and pharmacokinetics for resistance to fosamprenavir-ritonavir-based regimens in antiretroviral-experienced patients. Antimicrobial Agents and Chemotherapy, 51(4):1473–1480.
  • Ren and Barber, (2024) Ren, Z. and Barber, R. F. (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):122–154.
  • Ren et al., (2023) Ren, Z., Wei, Y., and Candès, E. (2023). Derandomizing knockoffs. Journal of the American Statistical Association, 118(542):948–958.
  • Rhee et al., (2003) Rhee, S.-Y., Gonzales, M. J., Kantor, R., Betts, B. J., Ravela, J., and Shafer, R. W. (2003). Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Research, 31(1):298–303.
  • Sesia et al., (2018) Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18.
  • Storey, (2002) Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(3):479–498.
  • Storey et al., (2004) Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B: Statistical Methodology, 66(1):187–205.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
  • Wang and Ramdas, (2022) Wang, R. and Ramdas, A. (2022). False Discovery Rate Control with E-values. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3):822–852.
  • (27) Xing, X., Zhao, Z., and Liu, J. S. (2023a). Controlling false discovery rate using gaussian mirrors. Journal of the American Statistical Association, 118(541):222–241.
  • (28) Xing, X., Zhao, Z., and Liu, J. S. (2023b). Gaussian mirror for high-dimensional controlled variable selection. https://github.com/BioAlgs/GM.

Appendix A Fast Implementation of One-at-a-time Knockoffs using Ridge Regression

We describe a fast approach for implementing ridge regression for various λ\lambda and then selecting the best λ\lambda via leave-one-out cross-validation (LOOCV). Combining LOOCV with Section 2.3, OATK is far faster than the BC or the GM approach.

Consider the singular value decomposition 𝑿=𝑼𝑫𝑽\bm{X}=\bm{UDV}^{\top}, where 𝑼n\bm{U}\in\mathbb{R}^{n} and 𝑽p\bm{V}\in\mathbb{R}^{p} have orthonormal columns, and 𝑫=diag{d1,d2,,dp}\bm{D}=\operatorname{diag}\{d_{1},d_{2},\ldots,d_{p}\} are the singular values. The OLS and ridge coefficient estimates become

𝜷^OLS=[𝑿𝑿]1𝑿𝒚=𝑽𝑫2𝑽𝑽𝑫𝑼𝒚=𝑽diag{1dj}𝑼𝒚\displaystyle\bm{\widehat{\beta}}_{OLS}=[\bm{X}^{\top}\bm{X}]^{-1}\bm{X}^{\top}\bm{y}=\bm{VD}^{-2}\bm{V}^{\top}\bm{VDU}^{\top}\bm{y}=\bm{V}\operatorname{diag}\bigg{\{}\frac{1}{d_{j}}\bigg{\}}\bm{U}^{\top}\bm{y} (40)
𝜷^λ=[𝑿𝑿+λ𝑰p×p]1𝑿𝒚=[𝑽𝑫2𝑽+λ𝑽𝑽]1𝑽𝑫𝑼𝒚\displaystyle\bm{\widehat{\beta}}_{\lambda}=[\bm{X}^{\top}\bm{X}+\lambda\bm{I}_{p\times p}]^{-1}\bm{X}^{\top}\bm{y}=[\bm{VD}^{2}\bm{V}^{\top}+\lambda\bm{V}\bm{V}^{\top}]^{-1}\bm{VDU}^{\top}\bm{y} (41)
=𝑽[𝑫2+λ]1𝑽𝑽𝑫𝑼𝒚=𝑽diag{djdj2+λ}𝑼𝒚\displaystyle\quad=\bm{V}[\bm{D}^{2}+\lambda]^{-1}\bm{V}^{\top}\bm{VDU}^{\top}\bm{y}=\bm{V}\operatorname{diag}\bigg{\{}\frac{d_{j}}{d_{j}^{2}+\lambda}\bigg{\}}\bm{U}^{\top}\bm{y} (42)

The vector of fitted response values using 𝜷^λ\bm{\widehat{\beta}}_{\lambda} is

𝒚^λ[y^λ,1y^λ,n]=𝑿𝜷^λ=𝑼𝑫𝑽𝑽diag{djdj2+λ}𝑼𝒚=𝑼diag{dj2dj2+λ}𝑼𝒚\displaystyle\bm{\widehat{y}}_{\lambda}\equiv\begin{bmatrix}\widehat{y}_{\lambda,1}\\ \vdots\\ \widehat{y}_{\lambda,n}\end{bmatrix}=\bm{X\widehat{\beta}}_{\lambda}=\bm{UDV}^{\top}\bm{V}\operatorname{diag}\bigg{\{}\frac{d_{j}}{d_{j}^{2}+\lambda}\bigg{\}}\bm{U}^{\top}\bm{y}=\bm{U}\operatorname{diag}\bigg{\{}\frac{d_{j}^{2}}{d_{j}^{2}+\lambda}\bigg{\}}\bm{U}^{\top}\bm{y} (43)

and the corresponding LOOCV error sum of squares is

SSEλ,CV=i=1n(yiy^λi1[𝑼diag{dj2dj2+λ}𝑼]ii)2=i=1n(yiy^λi1𝑼idiag{dj2dj2+λ}𝑼i)2.\displaystyle\operatorname{SSE}_{\lambda,CV}=\sum_{i=1}^{n}\Bigg{(}\frac{y_{i}-\widehat{y}_{\lambda i}}{1-\big{[}\bm{U}\operatorname{diag}\{\frac{d_{j}^{2}}{d_{j}^{2}+\lambda}\}\bm{U}^{\top}\big{]}_{ii}}\Bigg{)}^{2}=\sum_{i=1}^{n}\Bigg{(}\frac{y_{i}-\widehat{y}_{\lambda i}}{1-\bm{U}_{i\cdot}\operatorname{diag}\{\frac{d_{j}^{2}}{d_{j}^{2}+\lambda}\}\bm{U}_{i\cdot}^{\top}}\Bigg{)}^{2}. (44)

Equations (40)-(44) provide a fast method to fit ridge regression models with many different λ\lambda values and select the one that minimizes SSEλ,CV\operatorname{SSE}_{\lambda,CV}.

To efficiently compute the knockoff coefficients 𝜷~λ\bm{\widetilde{\beta}}_{\lambda}, we use (40) with

𝚺λ1=𝑽diag{1di2+λ}𝑽,\displaystyle\bm{\Sigma}_{\lambda}^{-1}=\bm{V}\operatorname{diag}\bigg{\{}\frac{1}{d_{i}^{2}+\lambda}\bigg{\}}\bm{V}^{\top}, (45)

so that

[𝚺λ1]jj\displaystyle[\bm{\Sigma}_{\lambda}^{-1}]_{jj} =𝑽jdiag{1di2+λ}𝑽j\displaystyle=\bm{V}_{j\cdot}\operatorname{diag}\bigg{\{}\frac{1}{d_{i}^{2}+\lambda}\bigg{\}}\bm{V}_{j\cdot}^{\top} (46)
σj2\displaystyle\sigma_{j}^{2} =1[𝚺1]jj=1𝑽jdiag{1di2}𝑽j\displaystyle=\frac{1}{[\bm{\Sigma}^{-1}]_{jj}}=\frac{1}{\bm{V}_{j\cdot}\operatorname{diag}\bigg{\{}\frac{1}{d_{i}^{2}}\bigg{\}}\bm{V}_{j\cdot}^{\top}} (47)

To generate the 𝒓j\bm{r}_{j} vectors required in (27), we first generate 𝒓j\bm{r}_{j} as any random vector (e.g., 𝒩n(𝟎n×1,𝑰n×n)\sim\mathcal{N}_{n}(\bm{0}_{n\times 1},\bm{I}_{n\times n}), then update 𝒓j𝑼𝑼𝒓j\bm{r}_{j}-\bm{UU}^{\top}\bm{r}_{j}, then rescale each 𝒓j\bm{r}_{j} to have unit norm. Using this unit-norm version of 𝒓j\bm{r}_{j} in (27) gives

β~λj=β^λj[𝚺λ1]jjσj2β^OLS,j+[𝚺λ1]jjσj𝒓j𝒚.\displaystyle\widetilde{\beta}_{\lambda j}=\widehat{\beta}_{\lambda j}-[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\sigma_{j}^{2}\widehat{\beta}_{OLS,j}+[\bm{\Sigma}_{\lambda}^{-1}]_{jj}\sigma_{j}\bm{r}_{j}^{\top}\bm{y}. (48)

Appendix B Technical Proofs

B.1 Proof of Proposition 1

By standard results for ridge regression, we have 𝜷^λ=𝚺λ1𝑿𝒚\bm{\widehat{\beta}}_{\lambda}=\bm{\Sigma}_{\lambda}^{-1}{\bm{X}}^{\top}\bm{y} and

𝝁𝜷^𝔼[𝜷^]=𝚺λ1𝚺𝜷\displaystyle\bm{\mu_{\widehat{\beta}}}\equiv\mathbb{E}[\bm{\widehat{\beta}}]=\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma\beta} (49)
𝚺𝜷^Cov[𝜷^]=σ2𝚺λ1𝑿𝑿𝚺λ1=σ2𝚺λ1𝚺𝚺λ1.\displaystyle\bm{\Sigma_{\widehat{\beta}}}\equiv\mathrm{Cov}[\bm{\widehat{\beta}}]=\sigma^{2}\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}\bm{X}\bm{\Sigma}_{\lambda}^{-1}=\sigma^{2}\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}. (50)

For the OATK coefficients,

𝜷~λ\displaystyle\bm{\widetilde{\beta}}_{\lambda} =𝜷^λ+diag{𝚺λ1}(𝑿~𝑿)𝒚=𝚺λ1𝑿𝒚+diag{𝚺λ1}(𝑿~𝑿)𝒚\displaystyle=\bm{\widehat{\beta}}_{\lambda}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top})\bm{y}=\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}\bm{y}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top})\bm{y} (51)
=(𝚺λ1𝑿+diag{𝚺λ1}(𝑿~𝑿))(𝑿𝜷+𝒛)𝒩p(𝝁𝜷~,𝚺𝜷~),\displaystyle=(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top}))(\bm{X\beta}+\bm{z})\sim\mathcal{N}_{p}(\bm{\mu_{\widetilde{\beta}}},\bm{\Sigma_{\widetilde{\beta}}}), (52)

where

𝝁𝜷~\displaystyle\bm{\mu_{\widetilde{\beta}}} 𝔼[𝜷~]=(𝚺λ1𝑿+diag{𝚺λ1}(𝑿~𝑿))𝑿𝜷\displaystyle\equiv\mathbb{E}[\bm{\widetilde{\beta}}]=(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top}))\bm{X\beta} (53)
=(𝚺λ1𝑿𝑿+diag{𝚺λ1}(𝑿~𝑿𝑿𝑿))𝜷=(𝚺λ1𝚺diag{𝚺λ1}𝑺)𝜷\displaystyle=(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}\bm{X}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}\bm{X}-\bm{X}^{\top}\bm{X}))\bm{\beta}=(\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S})\bm{\beta} (54)
𝚺𝜷~\displaystyle\bm{\Sigma_{\widetilde{\beta}}} Cov[𝜷~]\displaystyle\equiv\mathrm{Cov}[\bm{\widetilde{\beta}}] (55)
=σ2(𝚺λ1𝑿+diag{𝚺λ1}(𝑿~𝑿))(𝚺λ1𝑿+diag{𝚺λ1}(𝑿~𝑿))\displaystyle=\sigma^{2}(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top}))(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top}))^{\top} (56)
=σ2[𝚺λ1𝚺𝚺λ1+𝚺λ1𝑿(𝑿~𝑿)diag{𝚺λ1}+\displaystyle=\sigma^{2}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}+\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}(\bm{\widetilde{X}}-\bm{X}^{)}\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}+ (57)
diag{𝚺λ1}(𝑿~𝑿)𝑿𝚺λ1+diag{𝚺λ1}(𝑿~𝑿)(𝑿~𝑿)diag{𝚺λ1}]\displaystyle\quad\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top})\bm{X}\bm{\Sigma}_{\lambda}^{-1}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top})(\bm{\widetilde{X}}-\bm{X})\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}] (58)
=σ2[𝚺λ1𝚺𝚺λ1𝚺λ1𝑺diag{𝚺λ1}diag{𝚺λ1}𝑺𝚺λ1+\displaystyle=\sigma^{2}[\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}-\bm{\Sigma}_{\lambda}^{-1}\bm{S}\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S}\bm{\Sigma}_{\lambda}^{-1}+ (59)
diag{𝚺λ1}(𝑿~𝑿~𝚺+2𝑺)diag{𝚺λ1}]\displaystyle\quad\quad\quad\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}\bm{\widetilde{X}}-\bm{\Sigma}+2\bm{S})\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}] (60)

so that the marginal variances of β^λj\widehat{\beta}_{\lambda j} and β~λj\widetilde{\beta}_{\lambda j} are given by

diag{𝚺𝜷^}=diag{𝚺𝜷~}=σ2diag{𝚺λ1𝚺𝚺λ1}.\displaystyle\operatorname{diag}\{\bm{\Sigma_{\widehat{\beta}}}\}=\operatorname{diag}\{\bm{\Sigma_{\widetilde{\beta}}}\}=\sigma^{2}\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}\}. (61)

We also have

Cov[𝜷~λ,𝜷^λ]\displaystyle\mathrm{Cov}[\bm{\widetilde{\beta}}_{\lambda},\bm{\widehat{\beta}}_{\lambda}] =𝔼[(𝚺λ1𝑿+diag{𝚺λ1}(𝑿~𝑿))𝒛𝒛(𝑿𝚺λ1)]\displaystyle=\mathbb{E}\big{[}(\bm{\Sigma}_{\lambda}^{-1}\bm{X}^{\top}+\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}(\bm{\widetilde{X}}^{\top}-\bm{X}^{\top}))\bm{z}\bm{z}^{\top}(\bm{X}\bm{\Sigma}_{\lambda}^{-1})\big{]} (62)
=σ2(𝚺λ1𝚺𝚺λ1diag{𝚺λ1}𝑺𝚺λ1))\displaystyle=\sigma^{2}(\bm{\Sigma}_{\lambda}^{-1}\bm{\Sigma}\bm{\Sigma}_{\lambda}^{-1}-\operatorname{diag}\{\bm{\Sigma}_{\lambda}^{-1}\}\bm{S}\bm{\Sigma}_{\lambda}^{-1})) (63)

B.2 Proof of Proposition 2

Recall that we can decompose the OATK as

𝒙~j=𝑼\j𝒃j+𝒖jbj+𝒓j.\displaystyle\bm{\widetilde{x}}_{j}=\bm{U}_{\backslash j}\bm{b}_{j}+\bm{u}_{j}b_{j}+\bm{r}_{j}. (64)

With the above representation, Condition (i) in (19) becomes

𝑿\j𝒙j=𝑿\j𝒙~j=𝑽\j𝑫\j𝑼\j(𝑼\j𝒃j+𝒖jbj+𝒛j)=𝑽\j𝑫\j𝒃j\displaystyle\bm{X}_{\backslash j}^{\top}\bm{x}_{j}=\bm{X}_{\backslash j}^{\top}\bm{\widetilde{x}}_{j}=\bm{V}_{\backslash j}\bm{D}_{\backslash j}\bm{U}_{\backslash j}^{\top}(\bm{U}_{\backslash j}\bm{b}_{j}+\bm{u}_{j}b_{j}+\bm{z}_{j})=\bm{V}_{\backslash j}\bm{D}_{\backslash j}\bm{b}_{j} (65)

so that

𝒃j=𝑫\j1𝑽\j𝑿\j𝒙j=𝑫\j1𝑽\j𝑽\j𝑫\j𝑼\j𝒙j=𝑼\j𝒙j.\displaystyle\bm{b}_{j}=\bm{D}_{\backslash j}^{-1}\bm{V}_{\backslash j}^{\top}\bm{X}_{\backslash j}^{\top}\bm{x}_{j}=\bm{D}_{\backslash j}^{-1}\bm{V}_{\backslash j}^{\top}\bm{V}_{\backslash j}\bm{D}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}=\bm{U}_{\backslash j}^{\top}\bm{x}_{j}. (66)

Notice that σj=0\sigma_{j}=0 if and only if 𝒙j\bm{x}_{j} lies in the column space of 𝑿\j\bm{X}_{\backslash j}, in which case the term 𝒖jbj\bm{u}_{j}b_{j} in (64) disappears, and we can represent 𝒙~j=𝑼\j𝒃j+𝒓j\bm{\widetilde{x}}_{j}=\bm{U}_{\backslash j}\bm{b}_{j}+\bm{r}_{j}, where 𝑼\j𝒃j=𝑼\j𝑼\j𝒙j=𝒙j\bm{U}_{\backslash j}\bm{b}_{j}=\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}=\bm{x}_{j} by (66). In this case, Condition (iii) in (19) implies 𝒓j=0\lVert\bm{r}_{j}\rVert=0, and the only knockoff that satisfies Condition (i) and Condition (iii) is 𝒙~j=𝒙j\bm{\widetilde{x}}_{j}=\bm{x}_{j}. This is the primary reason we must assume 𝑿\bm{X} is full rank.

Next, substituting (66) into (64), Condition (iii) in (19) becomes

1\displaystyle 1 =𝒙~j𝒙~j=[𝑼\j𝑼\j𝒙j+𝒖jbj+𝒓j][𝑼\j𝑼\j𝒙j+𝒖jbj+𝒓j]\displaystyle=\bm{\widetilde{x}}_{j}^{\top}\bm{\widetilde{x}}_{j}=[\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\bm{u}_{j}b_{j}+\bm{r}_{j}]^{\top}[\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\bm{u}_{j}b_{j}+\bm{r}_{j}] (67)
=𝑼\j𝑼\j𝒙j2+𝒖jbj2+𝒓j2\displaystyle=\lVert\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\rVert^{2}+\lVert\bm{u}_{j}b_{j}\rVert^{2}+\lVert\bm{r}_{j}\rVert^{2} (68)
=(1σj2)+bj2+𝒓j2\displaystyle=(1-\sigma^{2}_{j})+b_{j}^{2}+\lVert\bm{r}_{j}\rVert^{2} (69)

or 𝒓j2=σj2bj2\lVert\bm{r}_{j}\rVert^{2}=\sigma^{2}_{j}-b_{j}^{2}, which also implies that we need |bj||σj|\lvert b_{j}\rvert\leq\lvert\sigma_{j}\rvert. Using the preceding, Condition (ii) in Equation (19) becomes

1sj\displaystyle 1-s_{j} =𝒙j𝒙~j=𝒙j[𝑼\j𝑼\j𝒙j+𝒖jbj+𝒓j]\displaystyle=\bm{x}_{j}^{\top}\bm{\widetilde{x}}_{j}=\bm{x}_{j}^{\top}[\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\bm{u}_{j}b_{j}+\bm{r}_{j}] (70)
=𝑼\j𝑼\j𝒙j2+𝒙j(𝒙j𝑼\j𝑼\j𝒙j)bjσj\displaystyle=\lVert\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}\rVert^{2}+\frac{\bm{x}_{j}^{\top}(\bm{x}_{j}-\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j})b_{j}}{\sigma_{j}} (71)
=(1σj2)+σjbj,\displaystyle=(1-\sigma^{2}_{j})+\sigma_{j}b_{j}, (72)

or bj=σjsj/σjb_{j}=\sigma_{j}-s_{j}/\sigma_{j}, which requires 0sj2σj20\leq s_{j}\leq 2\sigma_{j}^{2}. Finally,

𝒓j2=σj2(σjsjσj)2=2sjsj2σj2.\displaystyle\lVert\bm{r}_{j}\rVert^{2}=\sigma_{j}^{2}-\left(\sigma_{j}-\frac{s_{j}}{\sigma_{j}}\right)^{2}=2s_{j}-\frac{s_{j}^{2}}{\sigma_{j}^{2}}. (73)

B.3 Proof Lemma 2.3

Define 𝑨j=sj𝑰p+sj(1sj/σj2)𝒆M𝒆M\bm{A}_{j}=s_{j}\bm{I}_{p}+s_{j}(1-{s_{j}}/{\sigma^{2}_{j}})\bm{e}_{M}\bm{e}_{M}^{\top}. It can be checked that 𝑨j\bm{A}_{j} has two eigenvalues: sj(1sj/σj2)M+sjs_{j}(1-s_{j}/\sigma_{j}^{2})M+s_{j} and sjs_{j}, where the former corresponds to a unique eigenvector and the latter has multiplicity M1M-1. It is then straightforward that 𝑨j\bm{A}_{j} is positive semi-definite when sj[0,M+1Mσj2]s_{j}\in[0,\frac{M+1}{M}\sigma_{j}^{2}].

Next, we verify that the knockoff copies constructed in (32) satisfy the conditions in (30). For condition (i), we have

𝑿\j𝑿~(j)=[sjσj2𝑿\j𝑼\j𝑼\j𝒙j+(1sjσj2)𝑿\j𝒙j]𝒆M=𝑿\j𝒙j𝒆M.\displaystyle{\bm{X}}_{\backslash j}^{\top}\widetilde{{\bm{X}}}^{(j)}=\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}{\bm{X}}_{\backslash j}^{\top}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}{\bm{X}}_{\backslash j}^{\top}\bm{x}_{j}\bigg{]}\bm{e}_{M}^{\top}={\bm{X}}_{\backslash j}^{\top}\bm{x}_{j}\bm{e}_{M}^{\top}. (74)

For condition (ii), there is

𝒙j𝑿~(j)=[sjσj2(1σj2)+(1sjσj2)]𝒆M=(1sj)𝒆M.\displaystyle\bm{x}_{j}^{\top}\widetilde{{\bm{X}}}^{(j)}=\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}(1-\sigma_{j}^{2})+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}\bigg{]}\bm{e}_{M}^{\top}=(1-s_{j})\bm{e}_{M}^{\top}. (75)

Finally, we check

(𝑿~(j))𝑿~(j)\displaystyle(\widetilde{{\bm{X}}}^{(j)})^{\top}\widetilde{{\bm{X}}}^{(j)} =𝒆M[sjσj2𝑼\j𝑼\j𝒙j+(1sjσj2)𝒙j][sjσj2𝑼\j𝑼\j𝒙j+(1sjσj2)𝒙j]𝒆M+𝑪𝑪\displaystyle=\bm{e}_{M}\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}\bm{x}_{j}\bigg{]}^{\top}\bigg{[}\frac{s_{j}}{\sigma_{j}^{2}}\bm{U}_{\backslash j}\bm{U}_{\backslash j}^{\top}\bm{x}_{j}+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}\bm{x}_{j}\bigg{]}\bm{e}_{M}^{\top}+\bm{C}^{\top}\bm{C} (76)
=[1σj2+(1sjσj2)2σj2]𝒆M𝒆M+sj𝑰p+sj(1sjσj2)𝒆M𝒆M\displaystyle=\bigg{[}1-\sigma^{2}_{j}+\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}^{2}\sigma_{j}^{2}\bigg{]}\bm{e}_{M}\bm{e}_{M}^{\top}+s_{j}\bm{I}_{p}+s_{j}\Big{(}1-\frac{s_{j}}{\sigma_{j}^{2}}\Big{)}\bm{e}_{M}\bm{e}_{M}^{\top} (77)
=sj𝑰p+(1sj)𝒆M𝒆M,\displaystyle=s_{j}\bm{I}_{p}+(1-s_{j})\bm{e}_{M}\bm{e}_{M}^{\top}, (78)

which verifies conditions (iii) and (iv). We have therefore completed the proof.

B.4 Proof of Lemma 2.4

Given a permutation π\pi on {0,1,,M}\{0,1,\cdots,M\} and any j0j\in\mathcal{H}_{0}, we first show that

(β^j(0),β^j(1),,β^j(M))=d(β^j(π(0)),β^j(π(1)),,β^j(π(M))).\displaystyle(\widehat{\beta}^{(0)}_{j},\widehat{\beta}^{(1)}_{j},\ldots,\widehat{\beta}^{(M)}_{j})\stackrel{{\scriptstyle\textnormal{d}}}{{=}}(\widehat{\beta}^{(\pi(0))}_{j},\widehat{\beta}^{(\pi(1))}_{j},\ldots,\widehat{\beta}^{(\pi(M))}_{j}). (79)

Recall that β^j(m)=f((𝑿~j,m)𝑿~j,m,(𝑿~(j,m))𝒚)\widehat{\beta}^{(m)}_{j}=f\big{(}(\widetilde{\bm{X}}^{j,m})^{\top}\widetilde{\bm{X}}^{j,m},(\widetilde{\bm{X}}^{(j,m)})^{\top}\bm{y}\big{)} for some function ff. By construction, (𝑿~j,m)𝑿~j,m=𝑿𝑿(\widetilde{\bm{X}}^{j,m})^{\top}\widetilde{\bm{X}}^{j,m}=\bm{X}^{\top}\bm{X} and 𝒚𝑿~(j,m)=(𝒚𝑿\j,𝒚𝒙~j(m))\bm{y}^{\top}\widetilde{\bm{X}}^{(j,m)}=(\bm{y}^{\top}\bm{X}_{\backslash j},\bm{y}^{\top}\widetilde{\bm{x}}_{j}^{(m)}), where we write 𝒙~j(0)=𝒙j\widetilde{\bm{x}}_{j}^{(0)}=\bm{x}_{j}. It then suffices to prove that

(𝒚𝑿\j,𝒚𝒙j,𝒚𝒙~j(1),,𝒚𝒙~j(M))=d(𝒚𝑿\j,𝒚𝒙~j(π(0)),𝒚𝒙~j(π(1)),,𝒚𝒙~j(π(M))).\displaystyle(\bm{y}^{\top}\bm{X}_{\backslash j},~{}\bm{y}^{\top}\bm{x}_{j},\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(1)},\ldots,~{}\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(M)})\stackrel{{\scriptstyle\textnormal{d}}}{{=}}(\bm{y}^{\top}\bm{X}_{\backslash j},~{}\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(\pi(0))},\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(\pi(1))},\ldots,~{}\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(\pi(M))}).

For any m[M]m\in[M], we can explicitly write

𝒚𝒙~j(m)=\displaystyle\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(m)}= (kj𝒙kβk)𝒙~j(m)+𝒛𝒙~j(m)=kjβk𝒙k𝒙~j(m)+𝒛𝒙~j(m)\displaystyle\bigg{(}\sum_{k\neq j}\bm{x}_{k}\beta_{k}\bigg{)}^{\top}\bm{\widetilde{x}}_{j}^{(m)}+\bm{z}^{\top}\bm{\widetilde{x}}_{j}^{(m)}=\sum_{k\neq j}\beta_{k}\bm{x}_{k}^{\top}\bm{\widetilde{x}}_{j}^{(m)}+\bm{z}^{\top}\bm{\widetilde{x}}_{j}^{(m)}
=\displaystyle= kjβk𝒙k𝒙j+𝒛𝒙~j(m)=𝜷𝑿𝒙j+𝒛𝒙~j(m),\displaystyle\sum_{k\neq j}\beta_{k}\bm{x}_{k}^{\top}\bm{x}_{j}+\bm{z}^{\top}\bm{\widetilde{x}}_{j}^{(m)}=\bm{\beta}^{\top}\bm{X}^{\top}\bm{x}_{j}+\bm{z}^{\top}\bm{\widetilde{x}}_{j}^{(m)},

where the second-to-last step is due to the construction of 𝒙~j(m)\bm{\widetilde{x}}_{j}^{(m)}, and the last step is because βj=0\beta_{j}=0 under the null. We then have

(𝒚𝑿\j,𝒚𝒙j,𝒚𝒙~j(1),,𝒚𝒙~j(m))𝒩(𝝁(j),𝚺(j)),\displaystyle(\bm{y}^{\top}\bm{X}_{\backslash j},~{}\bm{y}^{\top}\bm{x}_{j}^{,}~{}\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(1)},\cdots,\bm{y}^{\top}\bm{\widetilde{x}}_{j}^{(m)})^{\top}\sim\mathcal{N}(\bm{\mu}^{(j)},\bm{\Sigma}^{(j)}), (80)

where (𝝁(j))=(𝜷𝑿𝑿\j,𝜷𝑿𝒙j,,𝜷𝑿𝒙j)(\bm{\mu}^{(j)})^{\top}=(\bm{\beta}^{\top}\bm{X}^{\top}\bm{X}_{\backslash j},\bm{\beta}^{\top}\bm{X}^{\top}\bm{x}_{j},\cdots,\bm{\beta}^{\top}\bm{X}^{\top}\bm{x}_{j}) and

𝚺(j)=(𝚺\j,\j𝚺\j,j𝒆M+1𝒆M+1𝚺j,\j(1sj)𝒆M+1𝒆M+1+sj𝑰M+1),\displaystyle\bm{\Sigma}^{(j)}=\begin{pmatrix}\bm{\Sigma}_{\backslash j,\backslash j}&\bm{\Sigma}_{\backslash j,j}\bm{e}_{M+1}^{\top}\\ \bm{e}_{M+1}\bm{\Sigma}_{j,\backslash j}&(1-s_{j})\bm{e}_{M+1}\bm{e}_{M+1}^{\top}+s_{j}\bm{I}_{M+1}\end{pmatrix}, (81)

where 𝒆M+1\bm{e}_{M+1} is an (M+1)(M+1)-dimensional vector with all entries being one. It can be seen that the above distribution is invariant to the permutation of 𝒙j,𝒙~j(1),,𝒙~j(M)\bm{\bm{x}}_{j},\bm{\widetilde{x}}_{j}^{(1)},\ldots,\bm{\widetilde{x}}^{(M)}_{j}. We have thus shown (79).

Then for any m{0,1,,M}m\in\{0,1,\ldots,M\},

(pj=mM+1)\displaystyle\mathbb{P}\Big{(}p_{j}=\frac{m}{M+1}\Big{)} =(rank(|β^j(0)|)=m)\displaystyle=\mathbb{P}\Big{(}\text{rank}\big{(}|\widehat{\beta}_{j}^{(0)}|\big{)}=m\Big{)} (82)
=1(M+1)!π:permutation on [M+1](rank(|β^j(π(0))|)=m)\displaystyle=\frac{1}{(M+1)!}\sum_{\pi:\text{permutation on }[M+1]}\mathbb{P}\Big{(}\text{rank}\big{(}|\widehat{\beta}_{j}^{(\pi(0))}|\big{)}=m\Big{)} (83)
=𝔼[1(M+1)!π:permutation on [M+1]𝟏{rank(|β^j(π(0))|)=m}]\displaystyle=\mathbb{E}\Bigg{[}\frac{1}{(M+1)!}\sum_{\pi:\text{permutation on }[M+1]}\mathbf{1}\Big{\{}\text{rank}\big{(}|\widehat{\beta}_{j}^{(\pi(0))}|\big{)}=m\Big{\}}\Bigg{]} (84)
=M!(M+1)!=1M+1.\displaystyle=\frac{M!}{(M+1)!}=\frac{1}{M+1}. (85)

We have therefore completed the proof.

B.5 Proof of Proposition 3

If ϕj(t^j;𝑺j)0\phi_{j}(\widehat{t}_{j};\bm{S}_{j})\leq 0, then

𝔼[ej]\displaystyle\mathbb{E}[e_{j}] =𝔼[p𝟏{t^jWjTβ}c+[p]𝟏{WTβ}]\displaystyle=\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{\widehat{t}_{j}\cdot W_{j}\geq T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
=𝔼[p𝟏{t^jWjTβ}c+[p]𝟏{WTβ}p𝟏{WjTβ}[p]𝟏{WTβ}+p𝟏{WjTβ}[p]𝟏{WTβ}]\displaystyle=\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{\widehat{t}_{j}\cdot W_{j}\geq T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}-\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}+\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
=𝔼[ϕj(t^j;𝑺j)]+𝔼[p𝟏{WjTβ}[p]𝟏{WTβ}]\displaystyle=\mathbb{E}[\phi_{j}(\widehat{t}_{j};\bm{S}_{j})]+\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
𝔼[p𝟏{WjTβ}[p]𝟏{WTβ}].\displaystyle\leq\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}.

Now suppose instead that ϕj(t^j;𝑺j)>0\phi_{j}(\widehat{t}_{j};\bm{S}_{j})>0. By definition, there exists an increasing sequence tj,kt^jt_{j,k}\leq\widehat{t}_{j} such that tj,kt^jt_{j,k}\rightarrow\widehat{t}_{j} as kk\rightarrow\infty and ϕ(tj,k;𝑺j)0\phi(t_{j,k};\bm{S}_{j})\leq 0. Then

𝔼[ej]\displaystyle\mathbb{E}[e_{j}] =𝔼[p𝟏{t^jWj>Tβ}c+[p]𝟏{WTβ}]=𝔼[limkp𝟏{tj,kWjTβ}c+[p]𝟏{WTβ}]\displaystyle=\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{\widehat{t}_{j}\cdot W_{j}>T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}=\mathbb{E}\bigg{[}\lim_{k\rightarrow\infty}\frac{p\mathbf{1}\{t_{j,k}\cdot W_{j}\geq T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
=(i)limk𝔼[p𝟏{tj,kWjTβ}c+[p]𝟏{WTβ}]\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{=}}\lim_{k\rightarrow\infty}\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{t_{j,k}\cdot W_{j}\geq T_{\beta}\}}{c+\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
=limk𝔼[ϕj(tj,k;𝑺j)]+𝔼[p𝟏{WjTβ}[p]𝟏{WTβ}]\displaystyle=\lim_{k\rightarrow\infty}\mathbb{E}[\phi_{j}(t_{j,k};\bm{S}_{j})]+\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}
𝔼[p𝟏{WjTβ}[p]𝟏{WTβ}],\displaystyle\leq\mathbb{E}\bigg{[}\frac{p\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]},

where step (i) is by the monotone convergence theorem.

In both cases, there is

j0𝔼[ej]j0𝔼[𝟏{WjTβ}[p]p𝟏{WTβ}]1,\displaystyle\sum_{j\in\mathcal{H}_{0}}\mathbb{E}[e_{j}]\leq\sum_{j\in\mathcal{H}_{0}}\mathbb{E}\bigg{[}\frac{\mathbf{1}\{W_{j}\leq-T_{\beta}\}}{\sum_{\ell\in[p]}p\mathbf{1}\{W_{\ell}\leq-T_{\beta}\}}\bigg{]}\leq 1,

verifying that the eje_{j}’s are compound e-values. The FDR control of the eBH procedure follows directly from Wang and Ramdas, (2022).

B.6 Proof of Theorem 3.2

Roughly speaking, the FDR can be well controlled when SI^j1\widehat{\textnormal{SI}}_{j}\leq 1 for all j0j\in\mathcal{H}_{0} with high probability. To formalize this idea, for some ϵ>0\epsilon>0, we decompose the FDR as

FDR =𝔼[j0𝟏{WjTα}j[p]𝟏{WjTα}]\displaystyle=\mathbb{E}\left[\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{\sum_{j\in[p]}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}\right]
𝔼[𝟏{maxj0SI^jeϵ}j0𝟏{WjTα}j[p]𝟏{WjTα}]+(maxj0SI^j>eϵ),\displaystyle\leq\mathbb{E}\left[\mathbf{1}\Big{\{}\max_{j\in\mathcal{H}_{0}}\widehat{\text{SI}}_{j}\leq e^{\epsilon}\Big{\}}\cdot\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{\sum_{j\in[p]}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}\right]+\mathbb{P}\left(\max_{j\in\mathcal{H}_{0}}\widehat{\text{SI}}_{j}>e^{\epsilon}\right),

where the inequality follows because the FDP is upper bounded by one. By the choice of TαT_{\alpha}, we further bound the first term above as

𝔼[𝟏{maxj0SI^jeϵ}j0𝟏{WjTα}j[p]𝟏{WjTα}]\displaystyle\mathbb{E}\left[\mathbf{1}\Big{\{}\max_{j\in\mathcal{H}_{0}}\widehat{\text{SI}}_{j}\leq e^{\epsilon}\Big{\}}\cdot\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{\sum_{j\in[p]}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}\right]
\displaystyle\leq~{} α𝔼[𝟏{maxj0SI^jeϵ}j0𝟏{WjTα}1+j0𝟏{WjTα}]\displaystyle\alpha\mathbb{E}\left[\mathbf{1}\Big{\{}\max_{j\in\mathcal{H}_{0}}\widehat{\text{SI}}_{j}\leq e^{\epsilon}\Big{\}}\frac{\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{1+\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\leq-T_{\alpha}\}}\right]
=\displaystyle=~{} αj0𝔼[𝟏{max0SI^eϵ}𝟏{WjTα}1+0𝟏{WTα}]\displaystyle\alpha\sum_{j\in\mathcal{H}_{0}}\mathbb{E}\left[\mathbf{1}\Big{\{}\max_{\ell\in\mathcal{H}_{0}}\widehat{\text{SI}}_{\ell}\leq e^{\epsilon}\Big{\}}\frac{\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{1+\sum_{\ell\in\mathcal{H}_{0}}\mathbf{1}\{W_{\ell}\leq-T_{\alpha}\}}\right]
\displaystyle\leq~{} αj0𝔼[𝟏{SI^jeϵ}𝟏{WjTα}1+0𝟏{WTα}]\displaystyle\alpha\sum_{j\in\mathcal{H}_{0}}\mathbb{E}\left[\frac{\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}\cdot\mathbf{1}\{W_{j}\geq T_{\alpha}\}}{1+\sum_{\ell\in\mathcal{H}_{0}}\mathbf{1}\{W_{\ell}\leq-T_{\alpha}\}}\right]

Since the threshold TαT_{\alpha} is determined by (W1,,Wp)(W_{1},\ldots,W_{p}), we can write Tα=𝒯(W1,,Wp)T_{\alpha}=\mathcal{T}(W_{1},\ldots,W_{p}) for some mapping 𝒯\mathcal{T}. For each j0j\in\mathcal{H}_{0}, we further define Tj=𝒯(W1,,|Wj|,,Wp)T_{j}=\mathcal{T}(W_{1},\ldots,|W_{j}|,\ldots,W_{p}). On the event {WjTα}\{W_{j}\geq T_{\alpha}\}, Wj=|Wj|W_{j}=|W_{j}| and therefore Tα=TjT_{\alpha}=T_{j}, deterministically. As a result,

𝔼[𝟏{WjTα}𝟏{SI^jeϵ}1+0𝟏{WTα}]=𝔼[𝟏{WjTj}𝟏{SI^jeϵ}1+0\{j}𝟏{WTj}].\displaystyle\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\geq T_{\alpha}\}\cdot\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}}{1+\sum_{\ell\in\mathcal{H}_{0}}\mathbf{1}\{W_{\ell}\leq-T_{\alpha}\}}\right]=\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\geq T_{j}\}\cdot\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}}{1+\sum_{\ell\in\mathcal{H}_{0}\backslash\{j\}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\right].

Recall that, by definition,

SI^j=(Wj>0||Wj|,W\j)(Wj<0||Wj|,W\j).\displaystyle\widehat{\text{SI}}_{j}=\frac{\mathbb{P}(W_{j}>0\,\big{|}\,|W_{j}|,W_{\backslash j}\big{)}}{\mathbb{P}(W_{j}<0\,\big{|}\,|W_{j}|,W_{\backslash j}\big{)}}.

Then by the law of total expectation,

𝔼[𝟏{WjTj}𝟏{SI^jeϵ}1+0\{j}𝟏{WTj}]\displaystyle\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\geq T_{j}\}\cdot\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}}{1+\sum_{\ell\in\mathcal{H}_{0}\backslash\{j\}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\right]
=\displaystyle=~{} 𝔼[𝔼[𝟏{WjTj}𝟏{SI^jeϵ}1+0\{j}𝟏{WTj}||Wj|,W\j]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\geq T_{j}\}\cdot\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}}{1+\sum_{\ell\in\mathcal{H}_{0}\backslash\{j\}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\,\bigg{|}\,|W_{j}|,W_{\backslash j}\right]\right]
=\displaystyle=~{} 𝔼[𝟏{|Wj|Tj}𝟏{SI^jeϵ}1+0\{j}𝟏{WTj}(Wj>0||Wj|,W\j)],\displaystyle\mathbb{E}\left[\frac{\mathbf{1}\{|W_{j}|\geq T_{j}\}\cdot\mathbf{1}\{\widehat{\text{SI}}_{j}\leq e^{\epsilon}\}}{1+\sum_{\ell\in\mathcal{H}_{0}\backslash\{j\}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\cdot\mathbb{P}(W_{j}>0\,\big{|}\,|W_{j}|,W_{\backslash j}\big{)}\right], (86)

where the last step is because that TjT_{j}, |Wj||W_{j}|, W\jW_{\backslash j}, and SI^j\widehat{\textnormal{SI}}_{j} are deterministic given |Wj||W_{j}| and W\jW_{\backslash j}. We then have

(B.6)\displaystyle\eqref{eq:perj_fdr} eϵ𝔼[𝟏{|Wj|Tj}1+0\{j}𝟏{WTj}(Wj<0||Wj|,W\j)]\displaystyle\leq e^{\epsilon}\cdot\mathbb{E}\left[\frac{\mathbf{1}\{|W_{j}|\geq T_{j}\}}{1+\sum_{\ell\in\mathcal{H}_{0}\backslash\{j\}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\cdot\mathbb{P}(W_{j}<0\,\big{|}\,|W_{j}|,W_{\backslash j}\big{)}\right]
=eϵ𝔼[𝟏{WjTj}0𝟏{WTj}]\displaystyle=e^{\epsilon}\cdot\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\leq-T_{j}\}}{\sum_{\ell\in\mathcal{H}_{0}}\mathbf{1}\{W_{\ell}\leq-T_{j}\}}\right]
=eϵ𝔼[𝟏{WjTj}0𝟏{WT}],\displaystyle=e^{\epsilon}\cdot\mathbb{E}\left[\frac{\mathbf{1}\{W_{j}\leq-T_{j}\}}{\sum_{\ell\in\mathcal{H}_{0}}\mathbf{1}\{W_{\ell}\leq-T_{\ell}\}}\right],

where the last step follows from Barber et al., (2020, Lemma 6). Summing over j0j\in\mathcal{H}_{0} and combining everything above, we arrive at

FDReϵα+(maxj0SI^j>eϵ).\displaystyle\textnormal{FDR}\leq e^{\epsilon}\cdot\alpha+\mathbb{P}\left(\max_{j\in\mathcal{H}_{0}}\widehat{\textnormal{SI}}_{j}>e^{\epsilon}\right).

B.7 Proof of Theorem 3.1

We start by defining the following empirical quantities:

V(t)=1p0j0𝟏{Wjt},V+(t)=1p0j0𝟏{Wjt},\displaystyle V_{-}(t)=\frac{1}{p_{0}}\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\leq-t\},~{}V_{+}(t)=\frac{1}{p_{0}}\sum_{j\in\mathcal{H}_{0}}\mathbf{1}\{W_{j}\geq t\},
R(t)=1pj[p]𝟏{Wjt},R+(t)=1pj[p]𝟏{Wjt},\displaystyle R_{-}(t)=\frac{1}{p}\sum_{j\in[p]}\mathbf{1}\{W_{j}\leq-t\},~{}R_{+}(t)=\frac{1}{p}\sum_{j\in[p]}\mathbf{1}\{W_{j}\geq t\},

as well as their population counterparts:

F0(t)=1p0j0(Wjt),G(t)=1pj[p](Wjt),G+(t)=1pj[p](Wjt).\displaystyle F_{0}(t)=\frac{1}{p_{0}}\sum_{j\in\mathcal{H}_{0}}\mathbb{P}(W_{j}\geq t),~{}G_{-}(t)=\frac{1}{p}\sum_{j\in[p]}\mathbb{P}(W_{j}\leq-t),~{}G_{+}(t)=\frac{1}{p}\sum_{j\in[p]}\mathbb{P}(W_{j}\geq t).

By symmetry, we have that F0(t)=𝔼[V(t)]=𝔼[V+(t)]F_{0}(t)=\mathbb{E}[V_{-}(t)]=\mathbb{E}[V_{+}(t)]. We also let

FDP(t)=p0V+(t)max(pR+(t),1),FDP^(t)=pR(t)max(pR+(t),1),FDP¯(t)=π0F0(t)G+(t),FDP~(t)=G(t)G+(t)\displaystyle\textnormal{FDP}(t)=\frac{p_{0}V_{+}(t)}{\max(pR_{+}(t),1)},~{}\widehat{\textnormal{FDP}}(t)=\frac{pR_{-}(t)}{\max(pR_{+}(t),1)},~{}\overline{\textnormal{FDP}}(t)=\frac{\pi_{0}F_{0}(t)}{G_{+}(t)},~{}\widetilde{\textnormal{FDP}}(t)=\frac{G_{-}(t)}{G_{+}(t)}

which corresponds to the true, estimated, and their asymptotic counterparts of the FDP.

To connect the empirical quantities with their oracle counterparts, we leverage the following lemma adapted from Xing et al., 2023a . Note that the condition and proof in Xing et al., 2023a only apply to the proof of V(t)V_{-}(t)’s uniform convergence. We nevertheless modify their proof and provide the details in Section B.8.

Lemma B.1 (Lemma 8 of Xing et al., 2023a ).

Under the same assumption of Theorem 3.1, and suppose that F0(t)F_{0}(t) is a continuous function. Then as p0,p1p_{0},p_{1}\rightarrow\infty,

supt|V(t)F0(t)|p.0,supt|V+(t)F0(t)|p.0,\displaystyle\sup_{t}|V_{-}(t)-F_{0}(t)|\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}0,~{}\sup_{t}|V_{+}(t)-F_{0}(t)|\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}0,
supt|R(t)G(t)|p.0,supt|R+(t)G+(t)|p.0.\displaystyle\sup_{t}|R_{-}(t)-G_{-}(t)|\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}0,~{}\sup_{t}|R_{+}(t)-G_{+}(t)|\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}0.

By assumption, for any sufficiently small ε>0\varepsilon>0, there exists a threshold tεt_{\varepsilon} such that

FDP~(tε)αε and G+(tε)>0.\displaystyle\widetilde{\textnormal{FDP}}(t_{\varepsilon})\leq\alpha-\varepsilon\text{ and }G_{+}(t_{\varepsilon})>0.

Also, since π^0:=p0/pπ0\widehat{\pi}_{0}:=p_{0}/p\rightarrow\pi_{0} as pp\rightarrow\infty, there exists sufficiently large m0+m_{0}\in\mathbb{N}_{+} such that for any p0,p1m0p_{0},p_{1}\geq m_{0}, |p0/pπ0|ε|p_{0}/p-\pi_{0}|\leq\varepsilon.

Using Lemma B.1, we have for the fixed tεt_{\varepsilon}, R(tε)p.G(tε)R_{-}(t_{\varepsilon})\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}G_{-}(t_{\varepsilon}), and R+(tε)p.G+(tε)R_{+}(t_{\varepsilon})\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}G_{+}(t_{\varepsilon}). By the continuous mapping theorem,

FDP^(tε)=R(tε)R+(tε)p.G(tε)G+(tε)=FDP~(tε).\displaystyle\widehat{\textnormal{FDP}}(t_{\varepsilon})=\frac{R_{-}(t_{\varepsilon})}{R_{+}(t_{\varepsilon})}\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}\frac{G_{-}(t_{\varepsilon})}{G_{+}(t_{\varepsilon})}=\widetilde{\textnormal{FDP}}(t_{\varepsilon}).

Therefore, for any δ(0,1)\delta\in(0,1), there exists m1+m_{1}\in\mathbb{N}_{+} large enough such that for any p0,p1>m1p_{0},p_{1}>m_{1},

(|FDP^(tε)FDP~(tε)|ε/2)1δ(FDP^(tε)αε/2)1δ.\displaystyle\mathbb{P}\Big{(}\big{|}\widehat{\textnormal{FDP}}(t_{\varepsilon})-\widetilde{\textnormal{FDP}}(t_{\varepsilon})\big{|}\leq\varepsilon/2\Big{)}\geq 1-\delta\Rightarrow\mathbb{P}\big{(}\widehat{\textnormal{FDP}}(t_{\varepsilon})\leq\alpha-\varepsilon/2\big{)}\geq 1-\delta.

On the event {FDP^(tε)αε/2}\{\widehat{\textnormal{FDP}}(t_{\varepsilon})\leq\alpha-\varepsilon/2\}, there is TαtεT_{\alpha}\leq t_{\varepsilon} (recall the definition of TαT_{\alpha}). By monotonicity, R+(Tα)R(tε)R_{+}(T_{\alpha})\geq R(t_{\varepsilon}) when TαtεT_{\alpha}\leq t_{\varepsilon}. Meanwhile, since |R+(tε)G+(tε)|p.0|R_{+}(t_{\varepsilon})-G_{+}(t_{\varepsilon})|\stackrel{{\scriptstyle\textnormal{p.}}}{{\rightarrow}}0, we can find m2+m_{2}\in\mathbb{N}_{+} large enough such that when p0,p1m2p_{0},p_{1}\geq m_{2}, with probability at least 1δ1-\delta,

|R+(tε)G+(tε)|G+(tε)/2R+(tε)G+(tε)/2>0.\displaystyle|R_{+}(t_{\varepsilon})-G_{+}(t_{\varepsilon})|\leq G_{+}(t_{\varepsilon})/2\Rightarrow R_{+}(t_{\varepsilon})\geq G_{+}(t_{\varepsilon})/2>0.

Next, on the event {Tαtε,R+(tε)G+(tε)/2}\{T_{\alpha}\leq t_{\varepsilon},R_{+}(t_{\varepsilon})\geq G_{+}(t_{\varepsilon})/2\},

|FDP^(Tα)FDP~(Tα)|\displaystyle\big{|}\widehat{\textnormal{FDP}}(T_{\alpha})-\widetilde{\textnormal{FDP}}(T_{\alpha})\big{|}
\displaystyle\leq~{} |R(Tα)R+(Tα)G(Tα)R+(Tα)|+|G(Tα)R+(Tα)G(Tα)G+(Tα)|\displaystyle\bigg{|}\frac{R_{-}(T_{\alpha})}{R_{+}(T_{\alpha})}-\frac{G_{-}(T_{\alpha})}{R_{+}(T_{\alpha})}\bigg{|}+\bigg{|}\frac{G_{-}(T_{\alpha})}{R_{+}(T_{\alpha})}-\frac{G_{-}(T_{\alpha})}{G_{+}(T_{\alpha})}\bigg{|}
\displaystyle\leq~{} 2G+(tε)|R(Tα)G(Tα)|+G(Tα)|1R+(Tα)1G+(Tα)|\displaystyle\frac{2}{G_{+}(t_{\varepsilon})}\big{|}R_{-}(T_{\alpha})-G_{-}(T_{\alpha})\big{|}+G_{-}(T_{\alpha})\bigg{|}\frac{1}{R_{+}(T_{\alpha})}-\frac{1}{G_{+}(T_{\alpha})}\bigg{|}
=\displaystyle=~{} 2G+(tε)|R(Tα)G(Tα)|+G(Tα)R+(Tα)G+(Tα)|R+(Tα)G+(Tα)|\displaystyle\frac{2}{G_{+}(t_{\varepsilon})}\big{|}R_{-}(T_{\alpha})-G_{-}(T_{\alpha})\big{|}+\frac{G_{-}(T_{\alpha})}{R_{+}(T_{\alpha})G_{+}(T_{\alpha})}\big{|}R_{+}(T_{\alpha})-G_{+}(T_{\alpha})\big{|}
\displaystyle\leq~{} 2G+(tε)2{supttε|R(t)G(t)|+supttε|R+(t)G+(t)|}.\displaystyle\frac{2}{G_{+}(t_{\varepsilon})^{2}}\Big{\{}\sup_{t\leq t_{\varepsilon}}|R_{-}(t)-G_{-}(t)|+\sup_{t\leq t_{\varepsilon}}|R_{+}(t)-G_{+}(t)|\Big{\}}.

Similarly, on the same event, with p0,p1m0p_{0},p_{1}\geq m_{0}, we have

|FDP(Tα)FDP¯(Tα)|\displaystyle\Big{|}\textnormal{FDP}(T_{\alpha})-\overline{\textnormal{FDP}}(T_{\alpha})\Big{|}
\displaystyle\leq~{} V+(Tα)R+(Tα)|π^0π0|+|π0V+(Tα)R+(Tα)π0F0(Tα)R+(Tα)|+|π0F0(Tα)R+(Tα)π0F0(Tα)G+(Tα)|\displaystyle\frac{V_{+}(T_{\alpha})}{R_{+}(T_{\alpha})}|\widehat{\pi}_{0}-\pi_{0}|+\bigg{|}\frac{\pi_{0}V_{+}(T_{\alpha})}{R_{+}(T_{\alpha})}-\frac{\pi_{0}F_{0}(T_{\alpha})}{R_{+}(T_{\alpha})}\bigg{|}+\bigg{|}\frac{\pi_{0}F_{0}(T_{\alpha})}{R_{+}(T_{\alpha})}-\frac{\pi_{0}F_{0}(T_{\alpha})}{G_{+}(T_{\alpha})}\bigg{|}
\displaystyle\leq~{} ε+2G+(tε)|V+(Tα)F0(Tα)|+F0(Tα)|1R+(Tα)1G+(Tα)|\displaystyle\varepsilon+\frac{2}{G_{+}(t_{\varepsilon})}|V_{+}(T_{\alpha})-F_{0}(T_{\alpha})|+F_{0}(T_{\alpha})\bigg{|}\frac{1}{R_{+}(T_{\alpha})}-\frac{1}{G_{+}(T_{\alpha})}\bigg{|}
=\displaystyle=~{} ε+2G+(tε)|V+(Tα)F0(Tα)|+F0(Tα)R+(Tα)G+(Tα)|R+(Tα)G+(Tα)|\displaystyle\varepsilon+\frac{2}{G_{+}(t_{\varepsilon})}|V_{+}(T_{\alpha})-F_{0}(T_{\alpha})|+\frac{F_{0}(T_{\alpha})}{R_{+}(T_{\alpha})G_{+}(T_{\alpha})}\big{|}R_{+}(T_{\alpha})-G_{+}(T_{\alpha})\big{|}
\displaystyle\leq~{} ε+2G+(tε){supttε|V+(t)F0(t)|+supttε|R+(t)G+(t)|}\displaystyle\varepsilon+\frac{2}{G_{+}(t_{\varepsilon})}\Big{\{}\sup_{t\leq t_{\varepsilon}}|V_{+}(t)-F_{0}(t)|+\sup_{t\leq t_{\varepsilon}}|R_{+}(t)-G_{+}(t)|\Big{\}}

Again by Lemma B.1, there exists m3+m_{3}\in\mathbb{N}_{+} such that when p0,p1m3p_{0},p_{1}\geq m_{3}, with probability at least 1δ1-\delta,

supt|V+(t)F0(t)|G+(ε)ε,supt|V(t)F0(t)|G+(ε)ε,\displaystyle\sup_{t}|V_{+}(t)-F_{0}(t)|\leq G_{+}(\varepsilon)\varepsilon,~{}\sup_{t}|V_{-}(t)-F_{0}(t)|\leq G_{+}(\varepsilon)\varepsilon, (87)
supt|R+(t)G+(t)|G+(ε)ε,supt|R(t)G(t)|G+(ε)ε,\displaystyle\sup_{t}|R_{+}(t)-G_{+}(t)|\leq G_{+}(\varepsilon)\varepsilon,~{}\sup_{t}|R_{-}(t)-G_{-}(t)|\leq G_{+}(\varepsilon)\varepsilon,~{} (88)

Let AA denote the intersection of {Tαtε,R(tε)G(tε)}\{T_{\alpha}\leq t_{\varepsilon},R(t_{\varepsilon})\geq G(t_{\varepsilon})\} and the event in Equation (87). Taking the union bound, we have that (Ac)3δ\mathbb{P}(A^{c})\leq 3\delta. Then, for p0,p1max(m0,m1,m2,m3)p_{0},p_{1}\geq\max(m_{0},m_{1},m_{2},m_{3}),

(FDP(Tα)FDP^(Tα)11ε)\displaystyle\mathbb{P}\big{(}\textnormal{FDP}(T_{\alpha})-\widehat{\textnormal{FDP}}(T_{\alpha})\geq 11\varepsilon\big{)}
\displaystyle\leq~{} (FDP(Tα)FDP^(Tα)11ε,A)+3δ\displaystyle\mathbb{P}\big{(}\textnormal{FDP}(T_{\alpha})-\widehat{\textnormal{FDP}}(T_{\alpha})\geq 11\varepsilon,A\big{)}+3\delta
(a)\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\leq}}~{} (FDP(Tα)FDP¯(Tα)+FDP~(Tα)FDP^(Tα)11ε,A)+3δ\displaystyle\mathbb{P}\big{(}{\textnormal{FDP}}(T_{\alpha})-\overline{\textnormal{FDP}}(T_{\alpha})+\widetilde{\textnormal{FDP}}(T_{\alpha})-\widehat{\textnormal{FDP}}(T_{\alpha})\geq 11\varepsilon,A\big{)}+3\delta
(b)\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\leq}}~{} 3δ.\displaystyle 3\delta.

Above, step (a) is because FDP¯(Tα)FDP~(Tα)\overline{\textnormal{FDP}}(T_{\alpha})\leq\widetilde{\textnormal{FDP}}(T_{\alpha}) and step (b) is because on the event AA,

|FDP^(Tα)FDP~(Tα)|+|FDP¯(Tα)FDP(Tα)|10ε.\displaystyle\big{|}\widehat{\textnormal{FDP}}(T_{\alpha})-\widetilde{\textnormal{FDP}}(T_{\alpha})\big{|}+\big{|}\overline{\textnormal{FDP}}(T_{\alpha})-\textnormal{FDP}(T_{\alpha})\big{|}\leq 10\varepsilon.

We then proceed to upperbound the FDR. When p0,p1max(m0,m1,m2,m3)p_{0},p_{1}\geq\max(m_{0},m_{1},m_{2},m_{3}),

FDR =𝔼[FDP(Tα)]\displaystyle=\mathbb{E}\big{[}\textnormal{FDP}(T_{\alpha})\big{]}
𝔼[FDP(Tα)𝟏{FDP(Tα)FDP^(Tα)|<11ε}]+3δ\displaystyle\leq\mathbb{E}\big{[}\textnormal{FDP}(T_{\alpha})\cdot\mathbf{1}\{\textnormal{FDP}(T_{\alpha})-\widehat{\textnormal{FDP}}(T_{\alpha})|<11\varepsilon\}\big{]}+3\delta
𝔼[FDP^(Tα)]+11ε+3δ\displaystyle\leq\mathbb{E}\big{[}\widehat{\textnormal{FDP}}(T_{\alpha})\big{]}+11\varepsilon+3\delta
α+11ε+3δ,\displaystyle\leq\alpha+11\varepsilon+3\delta,

where the last inequality follows from the definition of TαT_{\alpha}. Since ε\varepsilon and δ\delta are arbitrary, we conclude the proof.

B.8 Proof of Lemma B.1

For any ε(0,1)\varepsilon\in(0,1), we let Nε=2/εN_{\varepsilon}=\lceil 2/\varepsilon\rceil. Since F0(t)F_{0}(t) is continuous in tt, we can find =a0a1aNε=-\infty=a_{0}\leq a_{1}\cdots\leq a_{N_{\varepsilon}}=\infty such that F0(ak)F0(ak1)ε/2F_{0}(a_{k})-F_{0}(a_{k-1})\leq\varepsilon/2, k[Nε]\forall k\in[N_{\varepsilon}]. Then,

(supt|V(t)F0(t)|ε)(k[Nε]supt(ak1,ak]|V(t)F0(t)|ε)\displaystyle\mathbb{P}\big{(}\sup_{t}|V_{-}(t)-F_{0}(t)|\geq\varepsilon\big{)}\leq\mathbb{P}\Big{(}\bigcup_{k\in[N_{\varepsilon}]}\sup_{t\in(a_{k-1},a_{k}]}|V_{-}(t)-F_{0}(t)|\geq\varepsilon\Big{)}
\displaystyle\leq~{} k[Nε](supt(ak1,ak]|V(t)F0(t)|ε)k[Nε](|V(ak1)F0(ak1)|ε/2),\displaystyle\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\Big{(}\sup_{t\in(a_{k-1},a_{k}]}|V_{-}(t)-F_{0}(t)|\geq\varepsilon\Big{)}\leq\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\big{(}|V_{-}(a_{k-1})-F_{0}(a_{k-1})|\geq\varepsilon/2\big{)},

where the last step follows from the choice of aka_{k}’s. Invoking Chebyshev’s inequality, we have

k[Nε](|V(ak1)F0(ak1)|ε/2)\displaystyle\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\big{(}|V_{-}(a_{k-1})-F_{0}(a_{k-1})|\geq\varepsilon/2\big{)} k[Nε]i,j0Cov(𝟏{Wiak1},𝟏{Wjak1})p02ε2/4\displaystyle\leq\sum_{k\in[N_{\varepsilon}]}\frac{\sum_{i,j\in\mathcal{H}_{0}}\textnormal{Cov}(\mathbf{1}\{W_{i}\leq-a_{k-1}\},\mathbf{1}\{W_{j}\leq-a_{k-1}\})}{p_{0}^{2}\varepsilon^{2}/4}
k[Nε]cpκp02ε2/4=8cπ02p2κε3,\displaystyle\leq\sum_{k\in[N_{\varepsilon}]}\frac{c\cdot p^{\kappa}}{p_{0}^{2}\varepsilon^{2}/4}=\frac{8c}{\pi_{0}^{2}p^{2-\kappa}\varepsilon^{3}},

which goes to zero as pp\rightarrow\infty since κ(0,2)\kappa\in(0,2). The proof for the convergence of V+V_{+} is exactly the same.

We now focus on R(t)R_{-}(t). Again by the continuity of G(t)G_{-}(t), we can find =b0b1bNε=-\infty=b_{0}\leq b_{1}\cdots\leq b_{N_{\varepsilon}}=\infty such that G(bk)G(bk1)ε/2G_{-}(b_{k})-G_{-}(b_{k-1})\leq\varepsilon/2, k[Nε]\forall k\in[N_{\varepsilon}]. Then,

(supt|R(t)G(t)|ε)(k[Nε]supt(bk1,bk]|R(t)G(t)|ε)\displaystyle\mathbb{P}\big{(}\sup_{t}|R_{-}(t)-G_{-}(t)|\geq\varepsilon\big{)}\leq\mathbb{P}\Big{(}\bigcup_{k\in[N_{\varepsilon}]}\sup_{t\in(b_{k-1},b_{k}]}|R_{-}(t)-G_{-}(t)|\geq\varepsilon\Big{)}
\displaystyle\leq~{} k[Nε](supt(bk1,bk]|R(t)G(t)|ε)k[Nε](|R(bk1)G(bk1)|ε/2),\displaystyle\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\Big{(}\sup_{t\in(b_{k-1},b_{k}]}|R_{-}(t)-G_{-}(t)|\geq\varepsilon\Big{)}\leq\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\big{(}|R_{-}(b_{k-1})-G_{-}(b_{k-1})|\geq\varepsilon/2\big{)},

where the last step follows from the choice of bkb_{k}’s. Invoking Chebyshev’s inequality, we have

k[Nε](|R(bk1)G(bk1)|ε/2)\displaystyle\sum_{k\in[N_{\varepsilon}]}\mathbb{P}\Big{(}|R_{-}(b_{k-1})-G_{-}(b_{k-1})|\geq\varepsilon/2\Big{)} k[Nε]i,j[p]Cov(𝟏{Wibk1},𝟏{Wjbk1})p2ε2/4\displaystyle\leq\sum_{k\in[N_{\varepsilon}]}\frac{\sum_{i,j\in[p]}\textnormal{Cov}(\mathbf{1}\{W_{i}\leq-b_{k-1}\},\mathbf{1}\{W_{j}\leq-b_{k-1}\})}{p^{2}\varepsilon^{2}/4}
k[Nε]cpκp2ε2/4=8cp2κε3,\displaystyle\leq\sum_{k\in[N_{\varepsilon}]}\frac{c\cdot p^{\kappa}}{p^{2}\varepsilon^{2}/4}=\frac{8c}{p^{2-\kappa}\varepsilon^{3}},

which goes to zero as pp\rightarrow\infty since κ(0,2)\kappa\in(0,2). The proof for the convergence of R+R_{+} is exactly the same.

Appendix C Additional simulation details

C.1 Implementation details of the Gaussian mirror

Our implementation of the GM procedure in our p<np<n setting matches the algorithm that was theoretically proven by (Xing et al., 2023a, , Theorem 4) to exhibit asymptotic FDR control. Specifically, for j[p]j\in[p], GM constructs 𝒛jN(0,cj𝑰n)\bm{z}_{j}\sim N(0,c_{j}\bm{I}_{n}), where the marginal variance cjc_{j} is defined as

cj=𝒙j(𝑰n𝑿j(𝑿j𝑿j)1𝑿j)𝒙j𝒛j(𝑰n𝑿j(𝑿j𝑿j)1𝑿j)𝒛j.c_{j}=\sqrt{\frac{\bm{x}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{-j}(\bm{X}_{-j}^{\top}\bm{X}_{-j})^{-1}\bm{X}_{-j}^{\top}\right)\bm{x}_{j}}{\bm{z}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{-j}(\bm{X}_{-j}^{\top}\bm{X}_{-j})^{-1}\bm{X}_{-j}^{\top}\right)\bm{z}_{j}}}.

GM then regresses 𝒚\bm{y} onto [𝒙j+𝒛j,𝒙j𝒛j,𝑿j][\bm{x}_{j}+\bm{z}_{j},\bm{x}_{j}-\bm{z}_{j},\bm{X}_{-j}], using OLS or regularized regression. Letting β^j+\widehat{\beta}_{j}^{+} and β^j\widehat{\beta}_{j}^{-} be the coefficients for 𝒙j+𝒛j\bm{x}_{j}+\bm{z}_{j} and 𝒙j𝒛j\bm{x}_{j}-\bm{z}_{j}, respectively, GM computes the test statistic Wj=|β^j++β^j||β^j+β^j|W_{j}=\lvert\widehat{\beta}_{j}^{+}+\widehat{\beta}_{j}^{-}\rvert-\lvert\widehat{\beta}_{j}^{+}-\widehat{\beta}_{j}^{-}\rvert. Finally, it constructs the rejection threshold TαT_{\alpha} and returns S^\widehat{S} in the same manner as BC and OATK using Equation (29). As noted in Xing et al., 2023a , the Gaussian mirror 𝒛j\bm{z}_{j} can be viewed as a type of knockoff for 𝒙j\bm{x}_{j} since the procedure is equivalent to regressing 𝒚\bm{y} onto [𝑿,𝒛j][\bm{X},\bm{z}_{j}] and then taking WjW_{j} to be the difference in the magnitude of the regression coefficients for 𝒙j\bm{x}_{j} and 𝒛j\bm{z}_{j}. This formulation achieves asymptotic FDR control when using OLS under a mild correlation assumption on 𝑿\bm{X}. We use Lasso regression in our implementation, as we found it yields higher power than its OLS and ridge versions while largely maintaining FDR control in our numerical examples.

This implementation differs from what was used in their numerical studies and coded in their accompanying software (Xing et al., 2023b, ), primarily due to a pre-screening step that is necessary for p>np>n. They initially pre-screen for promising variables by conducting Lasso regression on the original data. Let RR be the set of indices with nonzero coefficients in the pre-screening regression and 𝜷^R\widehat{\bm{\beta}}_{R} be their coefficients. They obtain 𝑿R\bm{X}_{R}, which consists of only the columns of 𝑿\bm{X} in RR. Then, they construct Gaussian mirror variables for all pp variables, but they select cjc_{j} only with respect to 𝑿R\bm{X}_{R}, such that

cj=𝒙j(𝑰n𝑿R\j(𝑿R\j𝑿R\j)1𝑿R\j)𝒙j𝒛j(𝑰n𝑿R\j(𝑿R\j𝑿R\j)1𝑿R\j)𝒛j,c_{j}=\sqrt{\frac{\bm{x}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{R\backslash j}(\bm{X}_{R\backslash j}^{\top}\bm{X}_{R\backslash j})^{-1}\bm{X}_{R\backslash j}^{\top}\right)\bm{x}_{j}}{\bm{z}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{R\backslash j}(\bm{X}_{R\backslash j}^{\top}\bm{X}_{R\backslash j})^{-1}\bm{X}_{R\backslash j}^{\top}\right)\bm{z}_{j}}}, (89)

where 𝑿R\j\bm{X}_{R\backslash j} is 𝑿R\bm{X}_{R} if jRj\notin R and 𝑿R\bm{X}_{R} with the 𝒙j\bm{x}_{j} column removed if jRj\in R. The software package of GM (Xing et al., 2023b, ) uses this pre-screened formulation for both p<np<n and p>np>n cases. We found the Lasso pre-screening step to boost the power and FDR control in both the GM and OATK procedures, although OATK with pre-screening was still more powerful than GM with pre-screening.

In our numerical implementations in Section 4, we implement both GM and OATK without the pre-screening step for multiple reasons. First, pre-screening was primarily intended to handle the p>np>n case, whereas our focus is on p<np<n, for which pre-screening is not needed. Second, it allows the more innate aspects of the two methods to be compared without conflating the effects of pre-screening. Third, the pre-screening coded implementation in Xing, et al. (2023b) lacks a firm theoretical basis. Regarding the last point, while the coded implementation follows to some extent the high-dimensional (p>np>n) GM algorithm for which Xing et al., 2023a (, Algorithm 2, Theorem 5) proved asymptotic FDR control, there are fundamental differences that render the theoretical results inapplicable. First, Algorithm 2 of Xing et al., 2023a generates 𝒛j\bm{z}_{j} only for jRj\in R, i.e., only the variables that survive the pre-screening step, and the screened-out variables are completely ignored in the subsequent GM regressions and considered as null variables by default. In contrast, the implemented algorithm generates 𝒛j\bm{z}_{j}’s for all variables, including the screened-out jj’s, which may be included in S^\widehat{S} depending on WjW_{j} and TαT_{\alpha}. We found empirically that ignoring all screened-out variables resulted in much poorer FDR control than when knockoffs were generated for them, as was done in the coded implementation. Moreover, cjc_{j} in Algorithm 2 of Xing et al., 2023a is computed as

cj=𝒙j(𝑰n𝑿R\j(𝑿R\j𝑿R\j)1𝑿R\j)𝒙j𝒛~j(𝑰n𝑿R\j(𝑿R\j𝑿R\j)1𝑿R\j)𝒛~j,c_{j}=\sqrt{\frac{\bm{x}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{R\backslash j}(\bm{X}_{R\backslash j}^{\top}\bm{X}_{R\backslash j})^{-1}\bm{X}_{R\backslash j}^{\top}\right)\bm{x}_{j}}{\widetilde{\bm{z}}_{j}^{\top}\left(\bm{I}_{n}-\bm{X}_{R\backslash j}(\bm{X}_{R\backslash j}^{\top}\bm{X}_{R\backslash j})^{-1}\bm{X}_{R\backslash j}^{\top}\right)\widetilde{\bm{z}}_{j}}}, (90)

where 𝒛~j\widetilde{\bm{z}}_{j} is the projection of 𝒛j\bm{z}_{j} onto the column space of 𝑿R\bm{X}_{R}, which differs substantially from (89) used in the coded implementation.

Additionally, the construction of WjW_{j} differs between the two algorithms. The coded implementation constructs Wj=|β^j++β^j||β^j+β^j|W_{j}=\lvert\widehat{\beta}_{j}^{+}+\widehat{\beta}_{j}^{-}\rvert-\lvert\widehat{\beta}_{j}^{+}-\widehat{\beta}_{j}^{-}\rvert, which is the same as in the low-dimensional algorithm without the pre-screening step. In contrast, Algorithm 2 defines

Wj=|σΦ1(F0,σ2[𝒱1(r),𝒱1+(r)](β^j++β^j))||σΦ1(F0,σ2[𝒱0(r),𝒱0+(r)](β^j+β^j))|,W_{j}=\left|\sigma\Phi^{-1}\left(F^{[\mathcal{V}_{1}^{-}(r),\mathcal{V}_{1}^{+}(r)]}_{0,\sigma^{2}}\left(\widehat{\beta}_{j}^{+}+\widehat{\beta}_{j}^{-}\right)\right)\right|-\left|\sigma\Phi^{-1}\left(F^{[\mathcal{V}_{0}^{-}(r),\mathcal{V}_{0}^{+}(r)]}_{0,\sigma^{2}}\left(\widehat{\beta}_{j}^{+}-\widehat{\beta}_{j}^{-}\right)\right)\right|, (91)

where σ\sigma is the standard deviation of the noise term of the linear model (1) (which must be estimated), Φ()\Phi(\cdot) is the CDF of the standard normal distribution, Fμ,σ2[a,b]F_{\mu,\sigma^{2}}^{[a,b]} is the CDF of a N(μ,σ2)N(\mu,\sigma^{2}) random variable truncated to the interval [a,b][a,b], and 𝒱1(r),𝒱1+(r),𝒱0(r),𝒱0+(r)\mathcal{V}_{1}^{-}(r),\mathcal{V}_{1}^{+}(r),\mathcal{V}_{0}^{-}(r),\mathcal{V}_{0}^{+}(r) are parameters calculated from 𝑿R\bm{X}_{R}, 𝒚\bm{y}, and 𝜷^R\widehat{\bm{\beta}}_{R}. Defining WjW_{j} as in (91) is necessary to prove asymptotic FDR control because it results in the symmetry of WjW_{j} for null jj’s by correcting for the post-selection bias caused by the Lasso regression of the pre-screening step, whereas the version of WjW_{j} implemented in the code does not.

C.2 Additional numerical results for Gaussian 𝑿\bm{X}

This section examines additional numerical results for the Gaussian 𝑿\bm{X} study of Section 4.1. Fig. 10 shows the effect of the covariance parameter ρ\rho on the FDR and power on the smaller n=1000,p=300n=1000,p=300 example with fixed amplitude A=6A=6. Increasing ρ\rho, which increases the correlation among covariates in all three covariance structures, reduces power across all variable selection procedures. However, OATK shows the slowest decay in power, and increasing ρ\rho does not change OATK’s uniformly higher power compared to other algorithms. The FDR control of OATK is consistently reasonable across ρ\rho in the power decay and constant positive examples, never exceeding 0.13. In the constant negative case, OATK achieves the worst FDR inflation at 0.150.15 when ρ=0.6\rho=0.6, but we view such inflation as relatively innocuous, given OATK’s superior power and the replicate-to-replicate variability, as discussed in Section 4.1.

Refer to caption
Figure 10: FDR and power as functions of covariance parameter ρ\rho in the p=300p=300 and n=1000n=1000 Gaussian 𝑿\bm{X} study with fixed signal amplitude A=6A=6.

C.3 Fast implementation of conditionally calibrated OATK

We run the conditional calibration procedure detailed in Section 2.5 on 𝒮~=(RBH(4α)𝒮p)𝒮KnS^\widetilde{\mathcal{S}}=\bigl{(}R^{BH(4\alpha)}\cap\mathcal{S}^{p}\bigr{)}\;\cup\;\mathcal{S}^{Kn}\cup\widehat{S}, where RBH(4α)R^{BH(4\alpha)} is the rejection set of BH with FDR level 4α4\alpha, S^\widehat{S} is the rejection set of the basic OATK, and

𝒮p\displaystyle\mathcal{S}^{p} ={j:pjα/2},\displaystyle=\{\,j:p_{j}\leq\alpha/2\,\}, (92)
𝒮Kn\displaystyle\mathcal{S}^{Kn} ={j:|Wj|wp|RBH(4α)𝒮p|wTα}.\displaystyle=\{\,j:|W_{j}|\geq w_{\,p-|R^{BH(4\alpha)}\cap\mathcal{S}^{p}|}\wedge w_{T_{\alpha}}\,\}. (93)

Here, pjp_{j} denotes the p-value corresponding to the standard two-sided t-statistic from OLS and w1<<wpw_{1}<\cdots<w_{p} denote the order statistics of |W1|,,|Wp|.|W_{1}|,\cdots,|W_{p}|. In other words, we run the calibration procedure only on variables with relatively small p-value, a larger OATK test statistic, or in the rejection set of OATK.