Multiple testing with anytime-valid Monte-Carlo p-values

Lasse Fischer University of Bremen, Germany. Email: [email protected] Timothy Barry Harvard University, USA. Email: [email protected] Aaditya Ramdas Carnegie Mellon University, USA. Email: [email protected]

Abstract

In contemporary problems involving genetic or neuroimaging data, thousands of hypotheses need to be tested. Due to their high power, and finite sample guarantees on type-I error under weak assumptions, Monte-Carlo permutation tests are often considered as gold standard for these settings. However, the enormous computational effort required for (thousands of) permutation tests is a major burden. Recently, Fischer and Ramdas [12] constructed a permutation test for a single hypothesis in which the permutations are drawn sequentially one-by-one and the testing process can be stopped at any point without inflating the type-I error. They showed that the number of permutations can be substantially reduced (under null and alternative) while the power remains similar. We show how their approach can be modified to make it suitable for a broad class of multiple testing procedures and particularly discuss its use with the Benjamini-Hochberg procedure. The resulting method provides valid error rate control and outperforms all existing approaches significantly in terms of power and/or required computational time. We provide fast implementations and illustrate its application on large datasets, both synthetic and real.

1 Introduction

In a classical Monte-Carlo permutation test, we observe some test statistic $Y_{0}$ , generate $B$ additional test statistics $Y_{1},\ldots,Y_{B}$ and calculate the permutation p-value by

\displaystyle P^{\mathrm{perm}}_{B}=\frac{1+\sum_{t=1}^{B}\mathbbm{1}\{Y_{t}\geq Y_{0}\}}{1+B}.

(1)

In order to be able to reject the null hypothesis at level $\alpha$ , we would need to generate at least $B\geq 1/\alpha-1$ permuted datasets with corresponding test statistics. This can already lead to considerable computational effort when the data generation process is demanding. However, it is significantly higher when $M$ hypotheses are tested instead of a single one. First, we need to perform a permutation test for each hypothesis, leading to a minimum number of $M(1/\alpha-1)$ permutations in total. Second, when multiple testing corrections are performed, each hypothesis is tested at a lower individual level than the overall level $\alpha$ . For example, if the Bonferroni correction is used, each hypothesis is tested at level $\alpha/M$ leading to the requirement of at least $M(M/\alpha-1)\approx M^{2}/\alpha$ permutations. Even if we use more powerful multiple testing procedures such as the Benjamini-Hochberg (BH) procedure [5], this lower bound for the total number of permutations usually remains. The problem is that $B$ needs to be chosen in a manner that protects against the worst case in which all p-values are large except for one that is then essentially tested at level $\alpha/M$ . In this paper, we consider applying multiple testing procedures to anytime-valid permutation p-values which allows to stop resampling early and reject (or accept) a hypothesis as soon as the corresponding p-value is rejected (or accepted) by the multiple testing procedure. In this way, we can adapt the number of permutations to the number of rejections and potentially save a lot of permutations.

Despite our terminology and notation focusing on permutation tests, everything in this paper identically applies to other Monte-Carlo methods such as conditional randomization tests as well, which are also common in the causal inference literature, and have been applied to genetic settings under the so-called model-X assumption [9, 4, 3].

There are many existing approaches considering sequential Monte-Carlo tests for multiple testing — in particular for the BH procedure [28, 20, 48, 40, 2, 13, 14, 15, 37]. These works can be broadly categorized into the following three approaches, from which only the first and third one are useful for saving computational resources.

1.

The first idea is to use early stopping rules that ensure obtaining the same decisions as applying the multiple testing procedure to the permutation p-value $P^{\mathrm{perm}}_{B}$ for a fixed $B\in\mathbb{N}$ with high probability [28, 20, 48].
2.

Another line of work provides bounds for the probability of obtaining a different decision than the limiting permutation p-value [13, 14, 15]. Here, the main goal is not to reduce the number of permutations but to minimize the resampling risk [10].
3.

In a third approach [40, 37, 2], multiple testing procedures are applied to the sequential permutation p-value by Besag and Clifford [7].

The problem with the third approach is that the Besag-Clifford p-value only allows to stop early for accepting the null hypothesis because the evidence is too weak, and it cannot be stopped early for rejection when the evidence against the null hypothesis is sufficiently strong. For this reason, the previous works either did not stop for rejection [2, 37] or use heuristics that do not provide provable error control [40]. To the best of our knowledge, the only existing method allowing to adapt the number of permutations to the number of rejections while guaranteeing FDR control, is the AMT algorithm by Zhang et al. [48]. They fix the maximum number of permutations $B\in\mathbb{N}$ in advance and consider the generation process of the test statistics as a multi-armed bandit problem, and their guarantee falls into the first category above.

Our contributions.

We will follow a strategy similar to the third path, however, instead of the Besag-Clifford p-value, we consider the anytime-valid permutation p-values recently constructed by Fischer and Ramdas [12]. The latter framework encompasses the former as a special case, but is able to reduce the number of permutations under both the null and the alternative, without compromising power or type-I error. With this, our approach allows to adapt the number of permutations to the number of rejections while providing provable error control.

If the employed multiple testing procedure (which takes p-values as input and produces a rejection set as output) provides the desired error control under arbitrary dependence of the p-values, then it follows that it retains such validity with our anytime-valid permutation p-values, allowing us to stop fully data-adaptively with no danger of error rate control.

If a specific dependence structure is required to control the desired error rate, then some care regarding the stopping time and employed anytime-valid permutation p-value needs to be taken. For brevity, we restrict to the BH procedure in this case and prove that FDR control is guaranteed, if the limiting permutation p-values are independent, or if the limiting permutation p-values are PRDS and we use the so-called “anytime-valid BC” method.

In addition, we argue heuristically why we also do not expect the FDR to be inflated, if the limiting permutation p-values are PRDS and we use other strategies than the anytime-valid BC method. Consequently, in contrast to the AMT algorithm [48], we contend that our method has three advantages: (a) it can be applied with a large variety of p-value based multiple testing procedures, (b) it allows to stop for rejection data-adaptively at any time and (c) it does not require a prespecified maximum number of permutations. Furthermore, our method outperformed the AMT algorithm significantly in all considered experiments (see Sections 5 and 6).

Paper outline.

In Section 2, we begin with a recap of anytime-valid permutation p-values and the betting approach by Fischer and Ramdas [12]. Afterward, we extend this methodology to multiple testing (Section 3). The proposed technique is straightforward for all p-value based multiple testing procedures that do not require assumptions on the dependence structure. In Section 4, we discuss what needs to be considered when applying multiple testing procedures for which the p-values must be independent or PRDS with the example of the Benjamini-Hochberg procedure. Finally, we demonstrate the application of the BH procedure with anytime-valid permutation p-values to simulated data and real data in Sections 5 and 6, respectively.

2 Anytime-valid permutation p-values

In this section, we recap the notion of an anytime-valid permutation p-value as introduced by [12] and the methods to construct them.

Suppose we observed some data $X_{0}$ and have the ability to generate additional data $X_{1},X_{2},\ldots$ , e.g., by permuting the treatment labels of $X_{0}$ in a treatment vs. control trial. Let $Y_{i}=T(X_{i})$ for some test statistic $T$ . If the same generating mechanism is used independently for each $X_{i}$ , $i\geq 1$ , then the test statistics $Y_{1},Y_{2},\ldots$ are always exchangeable conditional on $Y_{0}$ , which means that the joint distribution of the sequence does not change for any finite permutation of the test statistics. We consider testing the null hypothesis

\displaystyle H_{0}:Y_{0},Y_{1},\ldots\text{ are exchangeable }

(2)

against the alternative

\displaystyle H_{1}:Y_{1},Y_{2},\ldots\text{ are exchangeable conditional on }Y_{0},\text{ and }Y_{0}\text{ is stochastically larger than }Y_{1},Y_{2},\ldots\text{ .}

(3)

Note that these hypotheses implicitly assume that we sample the permutations with replacement, since there would be an upper bound for the number of permutations otherwise. However, this is the usual state of affairs in practice, since remembering which permutations have already been drawn is computationally intensive and if the data size is large, not all permutations can be drawn anyway [43]. Also note that the alternative is formulated in a right-tailed manner (“stochastically larger”), however, all methods apply equally to left-tailed problems; and two-tailed tests can be obtained by combining a right-tailed and left-tailed test. The standard approach to this testing problem is the permutation p-value (1). In general, one would want to choose $B$ as large as possible, since the power increases with $B$ . However, the computational effort can be too high for large $B$ . In order to reduce the computational cost in permutation testing, Fischer and Ramdas [12] introduced a general approach to construct anytime-valid permutation p-values. At each step $t$ , a test statistic $Y_{t}$ is generated, compared with $Y_{0}$ , and then a p-value $P_{t}$ is calculated. The resulting process $(P_{t})_{t\in\mathbb{N}}$ is called anytime-valid p-value, if

\displaystyle\mathbb{P}_{H_{0}}(P_{\tau}\leq\alpha)\leq\alpha\text{ for all }\alpha\in[0,1],

(4)

where $\tau$ is an arbitrary stopping time. We note two properties [27] of anytime-valid p-values that will be important for us when applying them with multiple testing procedures.

(A)

If $(P_{t})_{t\in\mathbb{N}}$ is an anytime-valid p-value, then $(\tilde{P}_{t})_{t\in\mathbb{N}}$ , where $\tilde{P}_{t}=\min_{s\leq t}P_{s}$ , is an anytime-valid p-value as well. Hence, it can be assumed that all anytime-valid p-values $(P_{t})_{t\in\mathbb{N}}$ are nonincreasing in $t$ .
(B)

If $(P_{t})_{t\in\mathbb{N}}$ is an anytime-valid p-value, it holds that $\mathbb{P}_{H_{0}}(P_{T}\leq\alpha)\leq\alpha$ for any data-dependent random time $T$ . This means that the stopping time $\tau$ for our anytime-valid p-value does not have to be adapted to the filtration generated by $Y_{t},t\geq 0$ , but can depend on any (possibly data-dependent) information. In particular, this allows to stop based on the data of other hypotheses. We will still denote $\tau$ as stopping time in the following.

In the next subsection, we present a simple — but, as we shall later see, powerful — anytime-valid generalization of the Besag-Clifford p-value. Afterwards, in Section 2.2, we present the general testing by betting approach to construct anytime-valid permutation tests by Fischer and Ramdas [12]. In Section 2.3, we recap the binomial mixture strategy as (another) concrete instance of their general approach.

2.1 A simple approach: The anytime-valid Besag-Clifford p-value

Besag and Clifford [7] introduced a sequential permutation p-value:

\displaystyle P^{\mathrm{BC}}_{\gamma(h,B)}=\begin{cases}h/\gamma(h,B),&L_{\gamma(h,B)}=h\\ (L_{\gamma(h,B)}+1)/(B+1),&\text{otherwise},\end{cases}

(5)

where $h$ is some predefined parameter, $L_{B}:=\sum_{t=1}^{B}\mathbbm{1}\{Y_{t}\geq Y_{0}\}$ is the random number of “losses” after $B$ permutations and $\gamma(h,B)=\min(\inf\{t\in\mathbb{N}:L_{t}=h\},B)$ is a stopping time. In case of $B=\infty$ , we just write $\gamma(h)$ . The Besag-Clifford (BC) p-value only allows to stop at the particular time $\gamma(h,B)$ and not at any other stopping time. Hence, it allows to stop early when $h$ losses occurred before all $B$ permutations were sampled and therefore only saves computations under the null hypothesis, but not when the evidence against the null is strong. For this reason, it is not well-suited to adapt the number of permutations to the number of rejections when multiple hypotheses are considered. However, the BC p-value permits a simple anytime-valid generalization which solves this issue.

For simplicity, assume $B=\infty$ , meaning the Besag-Clifford method would only stop after observing $h$ losses (the case $B<\infty$ is also possible [12], however, we do not gain much and the notation is more complicated). Suppose we plan to use the Besag-Clifford p-value $P^{\mathrm{BC}}_{\gamma(h)}$ , are currently at some time $t<\gamma(h)$ , and since the evidence against $H_{0}$ already seems very strong, we would like to stop the process to save computational time. What (non-trivial) p-value could we report at this time while ensuring (4) is fulfilled? The answer is very simple, just report the p-value under the most conservative future, meaning if all upcoming test statistics are losses. Since we need to sample $h-L_{t}$ test statistics until we hit the stopping time $\gamma(h)$ , this leads to the following generalization of the Besag-Clifford p-value

\displaystyle P_{t}^{\mathrm{avBC}}=\begin{cases}\frac{h}{t+h-L_{t}},&t\leq\gamma(h)\\ h/\gamma(h),&t>\gamma(h).\end{cases}

(6)

Since $P_{\gamma(h)}^{\mathrm{avBC}}=P^{\mathrm{BC}}_{\gamma(h)}$ , its anytime-validity (4) follows immediately by the validity of the Besag-Clifford p-value at time $\gamma(h)$

\mathbb{P}_{H_{0}}(P_{\tau}^{\mathrm{avBC}}\leq\alpha)\leq\mathbb{P}_{H_{0}}(P_{\gamma(h)}^{\mathrm{avBC}}\leq\alpha)=\mathbb{P}_{H_{0}}(P^{\mathrm{BC}}_{\gamma(h)}\leq\alpha)\leq\alpha,

where $\tau$ is an arbitrary stopping time. Apart from Fischer and Ramdas [12] mentioning the anytime-valid BC method as special instance of their general approach (see next subsection), we have not seen it being exploited in the literature before. The simple idea of stopping early by assuming a conservative future was previously explored for the classical permutation p-value, but not found to be very useful [21, 18]. However, it is much more efficient for the BC method and particularly useful in the multiple testing setting, because it allows to adapt the stopping time to the individual significance level. To see this, note that for a fixed level $\alpha$ , the anytime-valid BC method rejects the same hypotheses as the classical permutation p-value for $B=\lceil h/\alpha\rceil-1$ , although it might stop earlier [42]. Hence, if the significance level an individual hypothesis is tested at is not predefined, as it is the case in multiple testing, the anytime-valid BC method naturally adapts the number of permutations to this data-dependent level.

Example 1.

Suppose $h=10$ and $B=999$ . If $\alpha=0.01$ , the anytime-valid generalizations of the BC method and the permutation p-value are equivalent. They can both stop for accepting the null hypothesis as soon as the number of losses equals $10$ or stop for rejecting the null if the number of losses after $990+\ell$ permutations is $\ell$ , $\ell\in\{0,1,\ldots,9\}$ . Now suppose $\alpha$ increases to $0.05$ . Suddenly, the anytime-valid BC method allows to stop for rejection after $190+\ell$ permutations, if $\ell$ losses are observed, while at least $950$ permutations would need to be sampled such that a rejection can be obtained by the permutation p-value.

In case of $h=1$ , Fischer and Ramdas [12] referred to the anytime-valid BC p-value as the aggressive strategy, as it already stops after the first loss. It also requires the least possible number of permutations, which can be useful in scenarios with enormous computational demand, and therefore deserves a special mention.

2.2 A general approach: Anytime-valid permutation testing by betting

Fischer and Ramdas [12] introduced a general approach to anytime-valid permutation testing based on a testing by betting technique [19, 41]. In particular, the anytime-valid BC method is a concrete instance of their algorithm. Furthermore, they were able to reduce the number of permutations compared to the classical permutation p-value and the BC method substantially, while achieving a similar power.

Their approach works as follows. At each step $t=1,2,\ldots$ , the statistician bets on the outcome of the loss indicator $I_{t}=\mathbbm{1}\{Y_{t}\geq Y_{0}\}$ by specifying a betting function $B_{t}:\{0,1\}\rightarrow\mathbb{R}_{\geq 0}$ . If $I_{t}=r$ , the wealth of the statistician gets multiplied by $B_{t}(r)$ , $r\in\{0,1\}$ . Starting with an initial wealth of $W_{0}=1$ , after $t$ rounds of gambling the wealth of the statistician is given by

W_{t}=\prod_{s=1}^{t}B_{s}(I_{s}),

where the concrete structure of the wealth depends on the betting strategy used. The wealth process $(W_{t})_{t\in\mathbb{N}}$ is a test martingale with respect to the filtration generated by the indicators $(\mathcal{I}_{t})_{t\in\mathbb{N}}$ , if

B_{t}(0)\frac{t-L_{t-1}}{t+1}+B_{t}(1)\frac{1+L_{t-1}}{t+1}=1.

By the optional stopping theorem, this implies that $(W_{t})_{t\in\mathbb{N}}$ is an anytime-valid e-value, meaning $\mathbb{E}_{H_{0}}[W_{\tau}]\leq 1$ for any stopping time $\tau$ . In addition, by Ville’s inequality,

\displaystyle\mathbb{P}_{H_{0}}(\exists t\in\mathbb{N}:W_{t}\geq 1/\alpha)\leq\alpha.

(7)

This shows that a hypothesis could be rejected at level $\alpha$ as soon as the wealth is larger or equal than $1/\alpha$ . Ville’s inequality is often exploited by defining an anytime-valid p-value as

\displaystyle P_{t}:=\frac{1}{\max_{s=1,\ldots,t}W_{s}}.

(8)

However, we introduce a more powerful approach in Section 3, which is particularly useful for multiple testing.

2.3 A sophisticated approach: The binomial mixture strategy

As a special instance of their general approach, Fischer and Ramdas [12] proposed the binomial mixture strategy, which particularly leads to desirable asymptotic behavior. For a predefined parameter $c\in[0,1]$ , the wealth of the binomial mixture strategy is given by

\displaystyle\bar{W}_{t}^{c}=(1-\mathrm{Bin}(L_{t};t+1,c))/c,

(9)

where $\mathrm{Bin}(L_{t};t+1,c)$ is the CDF of a binomial distribution with size parameter $t+1$ and probability $c$ . Fischer and Ramdas [12] showed that

\displaystyle\bar{W}_{t}^{c}|\{P_{\mathrm{lim}}=p^{\mathrm{lim}}\}\stackrel{{\scriptstyle a.s.}}{{\to}}\begin{cases}1/c,&\text{ if }p^{\mathrm{lim}}\in[0,c)\\ 0,&\text{ if }p^{\mathrm{lim}}\in(c,1]\end{cases}

(10)

for $t\to\infty$ , where $P_{\mathrm{lim}}$ is the limiting permutation p-value $P_{\mathrm{lim}}=\lim\limits_{B\to\infty}P^{\mathrm{perm}}_{B}$ . Therefore, if we choose $c<\alpha$ for some predefined significance level $\alpha$ , the binomial mixture strategy rejects almost surely after a finite number of permutations if $P_{\mathrm{lim}}<c$ . Hence, we can make the loss compared to the limiting permutation p-value arbitrarily small by choosing $c<\alpha$ arbitrarily close to $\alpha$ . In addition to this desirable asymptotic behavior, an advantage of the binomial mixture strategy compared to the Besag-Clifford method is that it automatically adapts to the difficulty of the decision. If the evidence against the null is strong, the wealth will go fast to $1/c$ , and if the evidence for the null hypothesis is strong, it will go fast to $0$ . Consequently, we obtain a fast decision if the data are unambiguous, but if the decision is close, meaning the limiting permutation p-value is close to $\alpha$ , the binomial mixture strategy will take its time and we can always opt to continue sampling.

In multiple testing scenarios the level of an individual hypothesis might not be predefined and depends on the rejection or non-rejection of the other hypotheses. In the following section, we will show how we can still use the binomial mixture strategy and all other $\alpha$ -dependent betting strategies in multiple testing procedures.

3 Multiple testing with anytime-valid permutation p-values

Suppose we have $M$ null hypotheses $H_{0}^{1},\ldots,H_{0}^{M}$ of the form (2) with corresponding alternatives $H_{1}^{1},\ldots,H_{1}^{M}$ as in (3). We denote by $Y_{0}^{i},Y_{1}^{i},\ldots$ the sequence of test statistics for $H_{0}^{i}$ and by $L_{B}^{i}$ the random number of losses for hypothesis $H_{0}^{i}$ after $B$ permutations. Furthermore, let

I_{0}\subseteq\{1,\ldots,M\}

be the index set of true hypotheses, $R\subseteq\{1,\ldots,M\}$ the set of rejected hypotheses and $V=I_{0}\cap R$ the set of falsely rejected hypotheses. Two of the most common error rates considered in multiple testing are the familywise error rate (FWER) and the false discovery rate (FDR). The FWER is defined as probability of rejecting any true null hypothesis

\displaystyle\mathrm{FWER}:=\mathbb{P}(|V|>0).

(11)

The FDR is the expected proportion of falsely rejected hypotheses among all rejections

\displaystyle\mathrm{FDR}:=\mathbb{E}\left[\frac{|V|}{|R|\lor 1}\right].

(12)

The aim is to control either the FWER or FDR at some prespecified level $\alpha\in(0,1)$ . Although we focus on FWER and FDR in this paper for concreteness, our approach also works with all other common error metrics (FDP tail probabilities, k-FWER, etc.). Control of the FWER implies control of the FDR [5] and is therefore more conservative. Hence, FDR controlling procedures often lead to more rejections and are more appropriate in large-scale exploratory hypothesis testing. In non-exploratory validation studies, FWER control is usually the norm.

Many multiple testing methods take p-values $P^{1},\ldots,P^{M}$ as inputs and reject $H_{0}^{i}$ , if $P^{i}\leq\alpha^{i}$ for some potentially data-dependent individual significance level $\alpha^{i}\in[0,1)$ . Therefore, we do not know in advance at which individual significance level the hypothesis $H_{0}^{i}$ will be tested at. However, most betting strategies derived by Fischer and Ramdas [12], including the binomial mixture strategy defined in Section 2.3, require the significance level as input and thus it need to be fixed in advance. In the next subsection, we show how this can be circumvented by calculating an anytime-valid e-value for each level $\alpha\in[0,1]$ and then calibrating these into an anytime-valid p-value by taking the smallest level $\alpha$ at which the corresponding e-value can reject the null hypothesis. Afterwards, in Section 3.2, we introduce our general multiple testing approach, which works for all monotone multiple testing procedures and anytime-valid permutation p-values.

3.1 $\alpha$ -dependent p-value calibration

For each null hypothesis $H_{0}^{i}$ we choose a betting strategy as described in Section 2.2 we would use if we test at level $\alpha^{\prime}\in(0,1)$ . Let $W_{t}^{i,\alpha^{\prime}}$ be the wealth of the strategy for hypothesis $H_{i}$ and level $\alpha^{\prime}$ after $t$ permutations. Going forward, we assume that for all $\alpha_{1}^{\prime}<\alpha_{2}^{\prime}$ ,

\displaystyle\{W_{t}^{i,\alpha_{1}^{\prime}}\geq 1/\alpha_{1}^{\prime}\}\subseteq\{W_{t}^{i,\alpha_{2}^{\prime}}\geq 1/\alpha_{2}^{\prime}\}.

(13)

This means that if our strategy for level $\alpha_{1}^{\prime}$ rejects at level $\alpha_{1}^{\prime}$ , our strategy for some larger level $\alpha_{2}^{\prime}>\alpha_{1}^{\prime}$ , needs to reject at level $\alpha_{2}^{\prime}$ as well. In particular, this is satisfied if we use for each $\alpha^{\prime}\in(0,1)$ the binomial mixture strategy with parameter $c=b\alpha^{\prime}$ for some constant $b\in(0,1)$ . To see this, note that

\bar{W}_{t}^{c}\geq 1/\alpha^{\prime}\Longleftrightarrow\frac{1-\mathrm{Bin}(L_{t};t+1,b\alpha^{\prime})}{b}\geq 1

and $(1-\mathrm{Bin}(\ell;t+1,b\alpha^{\prime}))/b$ is monotone increasing in $\alpha^{\prime}$ . In the following, we therefore also use the parameter $b$ instead of $c$ to parameterize the binomial mixture strategy in an $\alpha$ -independent way.

We define the $p$ -value for the $i$ -th hypothesis at step $t\in\mathbb{N}$ as

\displaystyle P_{t}^{i}=\inf\{\alpha\in(0,1)|\exists s\in\{1,\ldots,t\}:W_{s}^{i,\alpha}\geq 1/\alpha\}.

(14)

For example, the anytime-valid version of the BC method (6) is the result of such an $\alpha$ -dependent calibration, however, Fischer and Ramdas [12] just used this implicitly and did not write this approach down in its general form. Note that if we use the same strategy for each $\alpha$ , then (13) is trivially satisfied and the p-value in (14) reduces to the one in (8).

Proposition 3.1.

If (13) holds, then $(P_{t}^{i})_{t\in\mathbb{N}}$ in (14) is an anytime-valid p-value.

The proof follows by combining Ville’s inequality (7) and Condition (13) to note that for every stopping time $\tau_{i}$ we have

\mathbb{P}_{H_{0}^{i}}(P_{\tau_{i}}^{i}\leq\alpha)=\mathbb{P}_{H_{0}^{i}}(\exists t\in\{1,\ldots,\tau_{i}\}:W_{t}^{i,\alpha}\geq 1/\alpha\})\leq\alpha.

It might seem computationally challenging to calculate the p-value (14) at each step $t$ . However, we do not need to evaluate $W_{t}^{i,\alpha}$ for each $\alpha\in[0,1]$ , but in order to compare $P_{t}^{i}$ with some individual significance level $\alpha^{i}$ we can just check whether $W_{t}^{i,\alpha^{i}}\geq 1/\alpha^{i}$ , saving a lot of computational effort. Sometimes the p-value (14) also has a closed form, as it is the case for the anytime-valid BC method (6) [12].

3.2 Our general multiple testing approach

Our general algorithm requires two ingredients, an anytime-valid permutation p-value $(P_{t}^{i})_{t\in\mathbb{N}}$ for each hypothesis $H_{0}^{i}$ , $i\in\{1,\ldots,M\}$ , and a monotone multiple testing procedure $d$ . A multiple testing procedure $d:[0,1]^{M}\to\{0,1\}^{M}$ mapping the p-values to the decisions ( $1=\text{reject; }0=\text{accept}$ ) is called monotone, if $d$ is coordinatewise nonincreasing. Hence, if some of the p-values become smaller, all previous rejections remain and additional rejections might be obtained. Recall the two properties of anytime-valid p-values from Section 2: (A), $(P_{t}^{i})_{t\in\mathbb{N}}$ is nonincreasing in $t$ , and (B), $(P_{t}^{i})_{t\in\mathbb{N}}$ is valid at all data-dependent random times. Consequently, due to (B), we can stop the sampling process for $H_{0}^{i}$ based on all available data, and particularly as soon as $H_{0}^{i}$ can be rejected by $d$ , without violating the validity of $(P_{t}^{i})_{t\in\mathbb{N}}$ . In addition, due to (A) and the monotonicity of $d$ , we can already report the rejection of $H_{0}^{i}$ at that time, as the decision won’t change if we continue testing. Our general method is summarized in Algorithm 1. It should be noted that in its most general case, error control is only provided if the multiple testing procedure works under arbitrary dependence. This is captured in the following theorem, whose proof follows immediately from the explanations above.

Theorem 3.2.

If $d$ is a monotone multiple testing procedure with FWER (or FDR) control under arbitrary dependence of the p-values, then Algorithm 1 controls the FWER (or FDR) at level $\alpha$ .

For FWER control, the class of monotone multiple testing procedures under arbitrary dependence includes all Bonferroni-based procedures such as the Bonferroni-Holm [25], the sequentially rejective graphical multiple testing procedure [8] and the fixed sequence method [34]. For FDR control, the Benjamini-Yekuteli (BY) procedure [6] is most common. In order to control the desired error rate under arbitrary dependence, the procedures usually need to protect against the worst case distribution. Therefore, improvements can be derived under additional assumptions. A typical assumption is positive regression dependence on a subset (PRDS), under which Hochbergs’s [24] and Hommel’s [26] procedure provide uniform improvements of Holm’s procedure and the famous Benjamini-Hochberg (BH) procedure uniformly improves BY. However, if PRDS is required, caution with respect to the choice of betting strategy and stopping time needs to be taken. We address this in the next section by the example of the BH procedure.

Example 2.

To see why our approach with, e.g., the anytime-valid BC p-value (6), allows to adapt the number of permutations to the number of rejections while previous methods based on the classical Besag-Clifford p-value failed, we consider a simple example. Suppose we test $1000$ hypotheses with the Benjamini-Hochberg procedure [5] (see next section for the definition) at an overall FDR level of $\alpha=0.1$ . Due to the lower bound mentioned in the introduction, we would set $B\geq\numprint{10000}$ for the classical BC method (or for the permutation p-value). For simplicity, assume that we do not observe any losses. With the BC p-value we would need to sample all $B$ permutations for each of the hypotheses. In contrast, the anytime-valid BC p-value with $h=10$ would already equal $1/10$ at time $90$ . Due to the definition of the BH procedure, we could already stop at that time and reject all hypotheses. Even if we do not observe any losses for only $10\%$ of the hypotheses, then these $10\%$ will be rejected at time $990$ and the individual level obtained by the BH procedure will be greater than $0.01$ . Hence, all hypotheses will either be accepted or rejected by time $t=999$ (since $h/(t+h-\ell)\leq 0.01$ iff $\ell\leq 9$ ). Note that these calculations do not depend on the number of hypotheses, but solely on the percentage of hypotheses with very strong evidence. Hence, even if we test $\numprint{1000000}$ hypotheses, where we do not observe any losses for $10\%$ of the hypotheses, we will draw at most $1000$ permutations per hypothesis. In contrast, the lower bound for $B$ used in the Besag-Clifford p-value or permutation p-value would already equal $\numprint{10000000}$ .

Algorithm 1 General multiple testing with anytime-valid permutation p-values

Input: Overall significance level $\alpha$ , anytime-valid permutation p-values $(P_{t}^{i})_{t\in\mathbb{N}}$ , monotone p-value based multiple testing procedure $d$ , data-dependent stopping rules $\mathcal{S}^{i}$ and sequences of test statistics $Y_{0}^{i},Y_{1}^{i},\ldots$ , $i\in\{1,\ldots,M\}$ .
Output: Stopping times $\tau_{1},\ldots,\tau_{M}$ and rejection set $R$ .

A=\{1,\ldots,M\}

R=\emptyset

3:for

t=1,2,\ldots

4: for

i\in A

5: Check whether

P_{t}^{i}\leq\alpha_{t}^{i}

, where

\alpha_{t}^{i}

is the potentially data dependent significance level

6:ssssssssobtained by multiple testing procedure

d(P_{\min(t,\tau_{1})}^{1},\ldots,P_{\min(t,\tau_{M})}^{M})

7: if

P_{t}^{i}\leq\alpha_{t}^{i}

then

R=R\cup\{i\}

A=A\setminus\{i\}

10:

\tau_{i}=t

11: end if

12: if

\mathcal{S}^{i}(\mathrm{data})=\mathrm{stop}

then

13:

A=A\setminus\{i\}

14:

\tau_{i}=t

15: end if

16: end for

17: if

A=\emptyset

then

18: return

\tau_{1},\ldots,\tau_{M}

R

19: end if

20:end for

4 The Benjamini-Hochberg procedure with anytime-valid permutation p-values

The Benjamini-Hochberg (BH) procedure [5] rejects all hypotheses $H_{0}^{i}$ with p-value $P^{i}\leq m^{*}\alpha/M$ , where

\displaystyle m^{*}=\max\left\{m\in\{1,\ldots,M\}:\sum_{i=1}^{M}\mathbbm{1}\{P^{i}\leq m\alpha/M\}\geq m\right\}

(15)

with the convention $\max(\emptyset)=0$ . In the following we write $m^{*}_{t}$ for the BH threshold at time $t\in\mathbb{N}$ when the p-values in (15) are replaced by anytime-valid p-values at time $t$ . The BH procedure controls the FDR when the p-values are positive regression dependent on a subset (PRDS). In order to define PRDS we need the notion of an increasing set. A set $D\subseteq\mathbb{R}^{M}$ is increasing, if $z\in D$ implies $y\in D$ for all $y\geq z$ .

Property 1 (PRDS [6, 11]).

A random vector of p-values $\boldsymbol{P}$ is weakly PRDS on $I_{0}$ if for any null index $i\in I_{0}$ and increasing set $D\subseteq\mathbb{R}^{M}$ , the function $x\mapsto\mathbb{P}(\boldsymbol{P}\in D\mid P^{i}\leq x)$ is nondecreasing in $x$ on $[0,1]$ . We call $\boldsymbol{P}$ strongly PRDS on $I_{0}$ , if $``P^{i}\leq x"$ is replaced with $``P^{i}=x"$ . Furthermore, we denote $\boldsymbol{P}$ as independent on $I_{0}$ , if $P_{i}$ is stochastically independent of the random vector $(P_{1},\ldots,P_{i-1},P_{i+1},\ldots,P_{M})^{T}$ for all $i\in I_{0}$ .

Strong PRDS implies weak PRDS, but for controlling the FDR with the BH procedure, weak PRDS on $I_{0}$ is sufficient [11]. Note that the difference between these two PRDS notions is very minor and assuming strong PRDS instead of weak PRDS should not make a difference for most applications. However, differentiating between those two notions facilitates our proofs. When we just write PRDS in the remainder of this paper, we always mean weak PRDS. For example, the PRDS condition holds when the null p-values are independent from each other and the non-nulls. However, it also holds when there is some kind of positive dependence, which is an appropriate assumption in many trials. To guarantee FDR control by applying the BH procedure to our anytime-valid p-values, we need to ensure that our stopped p-values are PRDS. In the following we discuss when this is a reasonable assumption.

4.1 Stopping early for rejection

When PRDS is required, one needs to be careful with choosing the stopping time, as it could possibly induce some kind of negative dependence. For example, suppose we stop sampling for hypothesis $H_{0}^{i}$ as soon as $P_{t}^{i}\leq m_{t}^{*}\alpha/M$ . The smaller the other p-values, the sooner we could stop, which potentially induces some negative dependence between the p-values, even if the data used for calculating the different p-values is independent. However, the following proposition shows that stopping early for rejection with the BH procedure cannot inflate the FDR, which is very important as it allows to adapt the number of permutations to the number of rejections.

Theorem 4.1.

Let $\tau_{i}^{\prime}$ be a stopping time for each $i\in\{1,\ldots,M\}$ such that the stopped anytime-valid p-values $P_{\tau_{1}^{\prime}}^{1},\ldots,P_{\tau_{M}^{\prime}}^{M}$ are PRDS and define

\displaystyle\tau_{i}=\inf\{t\geq 1:\tau_{i}^{\prime}=t\lor P^{i}_{t}\leq m_{t}^{*}\alpha/M\}.

(16)

Then the BH procedure applied on $P_{\tau_{1}}^{1},\ldots,P_{\tau_{M}}^{M}$ rejects the same hypotheses as if applied on $P_{\tau^{\prime}_{1}}^{1},\ldots,P_{\tau^{\prime}_{M}}^{M}$ and therefore controls the FDR.

Proof.

Note that $P_{t}^{i}$ , $i\in\{1,\ldots,M\}$ , is nonincreasing and $m_{t}^{*}$ is nondecreasing in $t$ . Hence, if $P_{t}^{i}\leq m_{t}^{*}\alpha/M$ for some $t\in\mathbb{N}$ , then $P_{s}^{i}\leq m_{s}^{*}\alpha/M$ for all $s\geq t$ . ∎

In the following subsections, we discuss how to choose the stopping times $\tau_{1}^{\prime},\ldots,\tau_{M}^{\prime}$ in Theorem 4.1 such that the p-values $P_{\tau_{1}^{\prime}}^{1},\ldots,P_{\tau_{M}^{\prime}}^{M}$ are PRDS.

4.2 Stopping early for futility with independent p-values

We differentiate between two different stops; stop for rejection and stop for futility. By stop for futility, we mean that we stop early because it is unlikely that we would reject the hypothesis if we continue sampling. Thus, provided that the wealth of the used strategy is nonincreasing in the number of losses, one could write a stop for futility quite generally as

\displaystyle\tau_{i}^{\prime}=\inf\{t\geq 1:L_{t}^{i}>z_{t}^{i}\},

(17)

where the parameters $z_{t}^{i}$ can either be constant or dependent on the number of losses of the other hypotheses based on the multiple testing procedure used. For example, the stopping time of the Besag-Clifford method $\gamma_{i}(h,B)$ is obtained by $z_{1}^{i}=\ldots=z_{B-1}^{i}=h-1$ and $z_{B}=-1$ . A stop for futility in combination with a stop for rejection includes most reasonable stopping times. Theorem 4.1 shows that stopping for rejection is never a problem. Therefore, it remains to show that stopping for futility does not violate the PRDS condition as well. One obvious sufficient condition is if the data for different hypotheses is independent and the stopping time $\tau_{i}^{\prime}$ only depends on the data for the corresponding hypothesis $H_{0}^{i}$ , which is, for example, the case if the $z_{t}^{i}$ in (17) are constants. This is captured in the following proposition.

Proposition 4.2.

Suppose the permuted samples are generated independently for all hypotheses and the vector of limiting p-values $\boldsymbol{P}_{\mathrm{lim}}$ is independent on $I_{0}$ (see Property 1). Furthermore, for each $i\in\{1,\ldots,M\}$ , let the stopping time $\tau_{i}^{\prime}$ be given by $\tau_{i}^{\prime}=\inf\{t\in\mathbb{N}:L_{t}^{i}>z_{t}^{i}\}$ , where $z_{t}^{i}$ are constant parameters. Then the stopped anytime-valid p-values $\boldsymbol{P_{\tau^{\prime}}}=(P_{\tau_{1}^{\prime}}^{1},\ldots,P_{\tau_{M}^{\prime}}^{M})$ are independent on $I_{0}$ . In particular, applying BH to $\boldsymbol{P_{\tau}}=(P_{\tau_{1}}^{1},\ldots,P_{\tau_{M}}^{M})$ , where $\tau_{i}$ is defined as in (16), controls the FDR.

Proof.

Due to De Finetti’s theorem, the indicators $I_{t}^{i}=\mathbbm{1}\{Y_{t}^{i}\geq Y_{0}^{i}\}$ , $t\geq 1$ , for a hypothesis $H_{0}^{i}$ are i.i.d. conditional on $P_{\mathrm{lim}}^{i}$ and follow a mixture Bernoulli distribution with random probability $P_{\mathrm{lim}}^{i}$ . Since $\boldsymbol{P}_{\mathrm{lim}}$ is independent on $I_{0}$ , $\boldsymbol{P_{\tau^{\prime}}}$ is independent on $I_{0}$ as well.∎

In Proposition 4.2 we use that $\boldsymbol{P}_{\mathrm{lim}}$ is independent on $I_{0}$ instead of the assumption of independent data for the different hypotheses. This makes sense since if the data is independent, then $\boldsymbol{P}_{\mathrm{lim}}$ is independent as well, thus it is a less restrictive and mathematically better graspable assumption. In addition, $P_{\mathrm{lim}}^{i}$ can also be interpreted as the p-value we would choose if we knew the distribution of $Y_{0}^{i}$ under $H_{0}^{i}$ . Hence, it is a reasonable goal to ensure that our approach provides valid FDR control in those cases where the BH procedure with the limiting permutation p-values provides FDR control.

4.3 Stopping early for futility with PRDS p-values

If there is some positive dependence between the data for the different hypotheses, proving PRDS for the anytime-valid p-values becomes more difficult. The reason for this is that $P_{t}^{i}$ is not only a function of the losses at step $t$ but might depend on the losses of all previous steps $L_{1}^{i},\ldots,L_{t}^{i}$ . This can lead to situations in which a larger p-value $P_{t}^{i}$ yields more evidence against a large $P_{\mathrm{lim}}^{i}$ , which makes it hard to use that $\boldsymbol{P}_{\mathrm{lim}}$ is PRDS for proving that $\boldsymbol{P}_{t}$ is PRDS.

Example 3.

Consider a simple example in which for some $x<x^{*}$ : $P_{3}^{i}\leq x$ is equivalent to $L_{1}^{i}\leq 0\cup L_{2}^{i}\leq 0\cup L_{3}^{i}\leq 0$ , which reduces to $L_{1}^{i}\leq 0$ , and $P_{3}^{i}\leq x^{*}$ is equivalent to $L_{1}^{i}\leq 0\cup L_{2}^{i}\leq 0\cup L_{3}^{i}\leq 1$ , which reduces to $L_{1}^{i}\leq 0\cup L_{3}^{i}\leq 1$ . Recall that $L_{t}^{i}$ follows a mixture binomial distribution with size parameter $t$ and random probability $P_{\mathrm{lim}}^{i}$ . With this, under the assumption that $P_{\mathrm{lim}}^{i}$ follows a uniform distribution (which is true under $H_{0}^{i}$ ), one can check that $\mathbb{P}(P_{\mathrm{lim}}^{i}>0.9\mid L_{1}^{i}\leq 0)=0.01>0.0091=\mathbb{P}(P_{\mathrm{lim}}^{i}>0.9\mid L_{1}^{i}\leq 0\cup L_{3}^{i}\leq 1)$ . To understand why this is the case, note that $L_{1}^{i}\leq 0$ yields the following set of possible loss sequences $\{(0,1,1),(0,0,1),(0,1,0),(0,0,0)\}$ , while $L_{1}^{i}\leq 0\cup L_{3}^{i}\leq 1$ only adds $(0,0,1)$ to this set. For this reason, $L_{1}^{i}\leq 0\cup L_{3}^{i}\leq 1$ provides more evidence against small values for $P_{\mathrm{lim}}^{i}$ , but also more evidence against very large values of $P_{\mathrm{lim}}^{i}$ . Therefore, $P_{\mathrm{lim}}^{i}$ is not PRDS on $P_{t}^{i}$ in general.

We expect the effect from the above phenomenon, that larger bounds for the number of losses can lead to a lower probability for large values of the limiting permutation p-value, to be very minor and not inflating the FDR. If we choose a betting strategy with nonincreasing wealth in the number of losses and the $z_{t}^{i}$ in (17) are constant, then the stopped anytime-valid p-values $P_{\tau_{i}^{\prime}}^{i}$ , $i\in\{1,\ldots,M\}$ , are coordinatewise nondecreasing and nonrandom functions of the number of losses $(L_{t}^{i})_{t\in\mathbb{N}}$ . Due to De Finetti’s theorem, $\boldsymbol{P}_{\mathrm{lim}}$ being strongly PRDS implies that $(L_{t_{1}}^{1},\ldots,L_{t_{M}}^{M})$ is PRDS for any $t_{1},\ldots,t_{M}\in\mathbb{N}$ (see also the proof of Proposition 4.3). Hence, in this case $P_{\tau_{1}^{\prime}}^{1},\ldots,P_{\tau_{M}^{\prime}}^{M}$ potentially have some kind of positive association as well and therefore we argue that it is reasonable to replace the classical permutation p-values with our anytime-valid ones in the BH procedure. Even if the $z_{t}^{i}$ depend in a nonincreasing way on the losses of the other hypotheses, we do not see any reason why this should hurt the PRDS of the stopped p-values. Note that in practice the classical permutation p-values can usually not be checked for PRDS either and therefore PRDS is simply assumed if there is no explicit evidence against it. Since we can additionally stop for rejection (Theorem 4.1), this provides a large class of possible betting strategies and stopping times.

Fischer and Ramdas [12] proposed to stop for futility if the wealth drops below $\alpha$ . This can easily be transferred to a condition of the form (17) with constant parameters $z_{t}^{i}$ . In the multiple testing case the level an individual hypothesis is tested at is not fixed in advance. For this reason, we propose to stop sampling for $H_{0}^{i}$ at time $t$ , if $W_{t}^{i,\alpha_{t}^{\max}}<\alpha_{t}^{\max}$ , where $\alpha_{t}^{\max}:=\alpha(|A_{t}|+m_{t}^{*})/M$ and $A_{t}$ is the index set of hypotheses for which the testing process did not stop before step $t$ . Hence, $\alpha_{t}^{\max}$ can be interpreted as the maximum level a hypothesis will be tested at by the BH procedure according to the information up to step $t$ . When using the binomial mixture strategy, the combined stopping time of this stop for futility with a stop for rejection is almost surely finite (see (10)). The detailed algorithm of the entire BH procedure using the binomial mixture strategy is illustrated in Appendix D. Of course, this is only one possible stop for futility and there are many other reasonable choices. For example, in situations where the sampling process takes several weeks or months it would also be possible to choose the stopping time interactively, meaning to revisit the data at some point and decide for which hypotheses to continue sampling based on the study interests and the evidence gathered so far. Our simulations in Section 5.3 confirm that the binomial mixture strategy with this stop for futility does not inflate the FDR if the limiting permutation p-values are PRDS.

Stopping early for futility with the anytime-valid Besag-Clifford p-value.

We believe that for most applications the heuristic and empirical argumentation above should be reasonable enough to replace the classical permutation p-values with our anytime-valid ones in the BH procedure. However, if this heuristic argumentation is not allowed, one could still use the anytime-valid BC method (6) with its incorporated stop for futility $\boldsymbol{\gamma}(h)$ as shown by the next proposition.

Proposition 4.3.

Suppose the permuted samples are generated independently for all hypotheses and the vector of limiting p-values $\boldsymbol{P}_{\mathrm{lim}}$ is strongly PRDS on $I_{0}$ . Then the anytime-valid BC p-values $\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{\mathrm{avBC}}=(P_{\gamma_{1}(h)}^{\mathrm{avBC},1},\ldots,P_{\gamma_{M}(h)}^{\mathrm{avBC},M})$ , where $\gamma_{i}(h)=\inf\{t\geq 1:L_{t}^{i}=h\}$ , are weakly PRDS on $I_{0}$ . In particular, applying BH to $\boldsymbol{P}_{\boldsymbol{\tau}}^{\mathrm{avBC}}=(P_{\tau_{1}}^{\mathrm{avBC},1},\ldots,P_{\tau_{M}}^{\mathrm{avBC},M})$ , where $\tau_{i}$ is defined as in (16) for $\tau_{i}^{\prime}=\gamma_{i}(h)$ , controls the FDR.

The proof is provided in Appendix B. The main technical difference to the general case which makes it possible to prove PRDS for the anytime-valid Besag-Clifford p-values is that the event $\{P_{\gamma_{i}(h)}^{\mathrm{avBC},i}\leq x\}$ is equivalent to the event $\{L_{\lceil h/x\rceil-1}\leq h-1\}$ and therefore only depends on the number of losses at one fixed time.

5 Simulated experiments

In this section, we aim to characterize the behavior of applying the BH procedure with the proposed sequential permutation p-values using simulated data. We consider an independent Gaussian mean multiple testing problem. That means, each hypothesis is given by $H_{0}^{i}:\mathbb{E}[Y_{0}^{i}]=0$ , $i\in\{1,\ldots,M\}$ , where $Y_{0}^{i}$ follows a standard normal distribution under the null hypothesis and a shifted standard normal distribution with mean $\mu_{A}$ under the alternative. The probability for a hypothesis being false was set to $\pi_{A}\in[0,1]$ and the generated test statistics $Y_{1}^{i},\ldots,Y_{B}^{i}$ were sampled from $N(0,1)$ . Note that this matches the setting described in Section 2. The R code for reproducing the simulations is available at github.com/fischer23/MC-testing-by-betting/.

5.1 Power and average number of permutations

First, we evaluate the power and average number of permutations per hypothesis using the aggressive strategy, the anytime-valid generalization of the BC method with $h=10$ (6), the binomial mixture strategy (9) with $b=0.9$ , the AMT algorithm by Zhang et al. [48] with $\delta=0.01$ and the classic permutation p-value (1). The anytime-valid BC method was applied with its incorporated stop for futility and the binomial mixture strategy with the stop described in Section 4.3, but all methods were stopped after a maximum number of $B=\numprint{10000}$ permutations. Both anytime-valid methods were stopped for rejection as shown in (16). In addition to applying the classical permutation p-value for $B=\numprint{10000}$ , we also evaluated it for $B=200$ as a reference. It should be noted that the AMT algorithm was applied at an overall level of $\alpha-\delta$ such that FDR control at level $\alpha$ is provided [48].

Refer to caption — Figure 1: Power and average number of permutations per hypothesis for varying simulation parameters obtained by applying the BH procedure to different (sequential) permutation testing strategies. The anytime-valid BC method and the binomial mixture strategy lead to marginally less power than the classical permutation p-value, while reducing the number of permutations by orders of magnitude.

The results in Figure 1 were obtained by averaging over $10$ independently simulated trials. Similarly as in [48], we set the standard values of the simulation parameters to $\pi_{A}=0.4$ , $\mu_{A}=2.5$ , $\alpha=0.1$ and $M=1000$ , while one of them was varied in each of the plots.

It can be seen that the anytime-valid BC method and binomial mixture strategy lead to a similar power as the classical permutation p-value for $B=\numprint{10000}$ permutations, while being able to reduce the number of permutations by orders of magnitude. The anytime-valid BC method performs slightly better than the binomial mixture one, however, the binomial mixture strategy provides more flexibility with respect to the stopping time and one could, for example, continue sampling for the hypotheses where the data looks promising but did not lead to a rejection after the first $\numprint{10000}$ permutations. In addition, it has advantages with respect to early reporting of rejections as described in the next section.

When the number of permutations of the classical permutation p-value is reduced to $200$ , the power reduces substantially, particularly if the proportion of false hypotheses, strength of the alternative or significance level is small. Since the anytime-valid BC and binomial mixture method also need approximately $200$ permutations on average, this shows that the performance of these sequential methods cannot be accomplished by the permutation p-value with a fixed number of permutations. The AMT algorithm was outperformed by the anytime-valid BC method and the binomial mixture strategy in terms of power and number of permutations in all considered scenarios. The use of the aggressive strategy can be reasonable when the main goal is to reduce the number of permutations, while a power loss is acceptable.

It should be noted that the behavior of the methods does not change much with the number of hypotheses, since the other parameters remain fixed, which implies that $m_{t}^{*}/M$ , and thus the significance level of BH procedure, remain approximately constant. Lastly, we would like to highlight that the results for all these different constellations of the simulation parameters were obtained with the same hyperparameters for the sequential methods, which thus seem to be universally applicable choices.

5.2 Early reporting of decisions

In the previous subsection, we have shown that a lot of permutations can be saved by our sequential strategies, while the power remains similar to the classical permutation p-value. However, reducing the total number of permutations is not the only way of increasing efficiency with sequential permutation tests. We can also report already made decisions, and particularly rejections, before the entire process has stopped. This might not be important if the entire procedure only takes some hours or one day to run. However, in large-scale trials generating all required permutations and performing the tests can take up to several months or even longer. This can slow down the research process, as the results are crucial for writing a scientific paper or identifying follow up work. Therefore, it would be very helpful to be able to already report the unambiguous decisions at an earlier stage of the process. Note that this is not possible with the AMT algorithm by Zhang et al. [48], since no decisions can be obtained until the entire process has finished.

In Figure 2 we show the distribution of the rejection times in an arbitrary simulation run using the anytime-valid BC and the binomial mixture strategy in the standard setting described in Section 5.1. In this case, the binomial mixture strategy rejected $305$ and the anytime-valid BC method $304$ hypotheses, while the former needed a mean number of $459$ and the latter of $328$ permutations to obtain a rejection. However, the distribution of the stopping times looks very different. More than $50\%$ of the rejections made by the binomial mixture strategy were obtained after $147$ permutations and more than $75\%$ after less than $180$ permutations, while the rest is distributed quite broadly up to $\numprint{10000}$ permutations. In contrast, all rejections by the anytime-valid BC procedure were made at time $328$ such that the entire process was stopped at that time.

Basically, this is because for a fixed level $\alpha$ the rejections made by the anytime-valid BC procedure are not obtained in a sequential manner, but we can just reject all hypotheses with $L_{t_{\alpha}}\leq h-1$ , where $t_{\alpha}=\lceil h/\alpha\rceil-1$ (see Section 2). Hence, with the BH procedure we sample until there is a $m^{*}\in\{1,\ldots,M\}$ such that $L_{t_{\alpha m^{*}/M}}\leq h-1$ for at least $m^{*}$ hypotheses. Since $L_{t_{\alpha m^{*}/M}}\leq h-1$ for all hypotheses that have not been stopped for futility yet, we can stop the entire process and reject all the remaining hypotheses. However, if the sampling process is stopped before reaching the time $t_{\alpha m^{*}/M}$ such that $L_{t_{\alpha m^{*}/M}}\leq h-1$ for $m^{*}$ hypotheses (in our example time $328$ ), the anytime-valid BC method would not reject any hypothesis, while the binomial mixture strategy could have already achieved a lot of rejections.

Consequently, while the anytime-valid BC method needed less permutations in total, the binomial mixture strategy would be more useful if early reporting of the rejections is desired, since the majority of decisions is obtained much faster. In practice, the binomial mixture strategy could be used to start interpreting the results while keeping to generate further permutations in order to increase the total number of discoveries. Indeed, if the data for some of the undecided hypotheses looks promising, sampling could be continued after the first $\numprint{10000}$ permutations to possibly make even more rejections. In this case, one could also consider increasing the precision of the binomial mixture strategy by choosing the parameter $b$ closer to $1$ , since a larger average number of permutations might be acceptable when the majority of decisions is obtained fast.

5.3 FDR control

To evaluate the FDR control of BH with our anytime-valid p-values, we generate the vector $(Y_{0}^{1},\ldots,Y_{0}^{M})$ as multivariate normal with mean $0$ and $\mathrm{cov}(Y_{0}^{i},Y_{0}^{j})=\rho$ for $i\neq j$ , $\rho\in\{0,0.1,0.3,0.5,0.7,0.9\}$ . In this case, the limiting permutation p-values are strongly PRDS [6]. The other simulation parameters and parameters for the permutation testing strategies were chosen as in our standard setting introduced in Section 5.1. The FDR is obtained by averaging over $1000$ independent trials. All procedures controlled the FDR at the prespecified level $\alpha=0.01$ for all considered parameters $\rho$ (see Figure 3). This is not surprising, since the BH procedure with the aggressive strategy, the anytime-valid BC method and the classical permutation p-value control it provably and we gave reasonable arguments why it should with the binomial mixture strategy as well. Also note that in this case the BH procedure controls the FDR at level $0.06=0.6\alpha$ , since the proportion of false hypotheses was set to $\pi_{A}=0.4$ [5].

6 Real data analyses

6.1 Neuroscience example

Henderson et al. [23] evaluated the neural responses in the higher visual cortex to natural scene images using fMRI data from the Natural Scenes Dataset [1]. For each considered voxel in each of the eight participants, resulting in a total number of more than $\numprint{150000}$ voxels, they modeled the voxel response by texture statistics [36] in a ridge regression, where the regularization parameter was obtained by cross-validation. To evaluate model accuracy, they calculated the coefficient of determination $R^{2}$ on a held-out validation set and accessed significance using a permutation test. For this, they permuted the image labels in the training and validation data and performed the entire fitting and evaluation process for each of these permuted datasets. The final decisions were obtained by applying the BH procedure to the classical permutation p-values at level $\alpha=0.01$ .

Due to the large number of hypotheses and refitting step for each permutation, this problem requires an enormous computational effort. Henderson et al. [23] drew $1000$ permutations for each hypothesis and made the resulting $R^{2}$ publicly available at https://osf.io/8fsnx/, which we also use in our analysis. Note that $1000$ permutations is arguably somewhat low for testing such a large number of hypotheses at level $\alpha=0.01$ , but it turned out to not be an issue because there were many very small p-values, and thus the rejection threshold of the BH procedure was not too small. The proportion of rejections and average number of permutations per hypothesis obtained by the different strategies are illustrated in Table 1. We applied the anytime-valid BC method and the binomial mixture strategy with parameters $h=3$ and $b=0.6$ , respectively, to keep the number of permutations low and stopped at the latest when all $1000$ permutations were drawn. For the AMT algorithm [48] we changed $\delta$ to $0.001$ due to the lower $\alpha$ in this case.

Table 1: Proportion of rejected hypotheses and average number of permutations per hypothesis obtained by applying the BH procedure (via Algorithm 1 for our anytime-valid p-values) with different methods on fMRI data.

Method	Proportion rejected ( $\%$ )	Number of permutations
Permutation p-value	$62.8$	$1000$
Anytime-valid BC (ours)	$62.7$	$411$
Aggressive strategy (ours)	$61.9$	$138$
Binomial mixture strategy (ours)	$62.2$	$245$
AMT algorithm [48]	$62.3$	$687$

The results show that, although Henderson et al. [23] have already chosen a rather low number of permutations, our sequential methods were able to further reduce it significantly while rejecting a similar number of hypotheses. Also note that the binomial mixture strategy could hypothetically have continued sampling permutations for only those hypotheses where the data looks promising after generating the first $1000$ permutations, in order to possibly obtain further rejections. If the proportion of low p-values was smaller, e.g., if less than $10\%$ of the p-values were smaller than $\alpha$ , then no hypothesis could have been rejected by Henderson et al. [23] no matter how strong the evidence was. In contrast, our sequential strategies with parameters $h=3$ and $b=0.6$ would still have reasonable power, but might have sampled more than $1000$ permutations for some of the hypotheses. In this case the AMT algorithm (for a maximum number of $B=1000$ permutations) would be powerless as well. The R code for reproducing this analysis is available at github.com/fischer23/MC-testing-by-betting/.

6.2 Genomics example

Until this point we have focused only on the total number of permutations, ignoring running time. For example, we could not study running time in our brain imaging example (above) because the test statistics were pre-computed. In this section we apply our sequential permutation testing method to analyze an RNA sequencing (RNA-seq) dataset, carefully profiling the running time of our method and comparing it to competitors.

Typically, RNA-seq data are analyzed using a parametric regression method, such as DESeq2 [31] or Limma [38]. However, in a recent, careful analysis of 13 RNA-seq datasets, Li et al. [29] found that popular tools for RNA-seq analysis — including DESeq2 — did not control type-I error on negative control data (i.e., data devoid of signal). This likely was because the parametric assumptions of these methods broke down. Li et al. instead recommended use of the Mann-Whitney (MW) test [32] — a classical, nonparametric, two-sample test — for RNA-seq data analysis.¹¹1According to Li et al. [29], the RNA-seq data should contain at least 16 samples for the MW test to have sufficient power.

Li et al. applied the MW test to analyze several datasets, including a dataset of human adipose (i.e., fat) tissue generated by the Genotype-Tissue Expression (GTEx) project [30]. Subcutaneous (i.e., directly under-the-skin) adipose tissue and visceral (i.e., deep within-the-body) adipose tissue samples were collected from 581 and 469 subjects, respectively. RNA sequencing was performed on each of these $n=1,050$ tissue samples to measure the expression level of each of $m=54,591$ genes. This experimental procedure yielded a tissue-by-gene expression matrix $Y\in\mathbb{N}^{n\times m}$ and a binary vector $X\in\{0,1\}^{n}$ indicating whether a given tissue was subcutaneous ( $X_{i}=0$ ) or visceral ( $X_{i}=1$ ). Li et al. sought to determine which genes were differentially expressed across the subcutaneous and visceral tissue samples. To this end, for each gene $j\in\{1,\dots,m\}$ , Li et al. performed a MW test to test for association between the gene expressions $Y_{1,j},\dots,Y_{n,j}$ and the indicators $X_{1},\dots,X_{n}$ , yielding p-values $p_{1},\dots,p_{m}$ . Finally, Li et al. subjected the p-values to a BH correction to produce a discovery set. Li et al. used the asymptotic version of the MW test, as it is much more computationally efficient than the finite sample version of the MW test in high-multiplicity settings. The finite sample MW test generally is calibrated via permutations.

We sought to explore whether our sequential permutation testing procedure would enable computationally efficient application of the finite-sample MW test to the RNA-seq data. We thus conducted an experiment in which we applied three methods to analyze the RNA-seq data with the BH procedure: (i) the asymptotic MW p-value (implemented via the wilcox.test() function in R); (ii) a permutation p-value based on the MW statistic using a fixed number $B=5m/\alpha$ of permutations across hypotheses; and (iii) the anytime-valid BC p-value based on the MW test statistic with Algorithm 1. We implemented our anytime-valid BC method in C++ for maximum speed (see Appendix C for more information). We set the tuning parameter $h$ to $15$ in the latter method, and we set the nominal FDR across methods to $\alpha=0.1$ . We compared the methods with respect to their running time and number of discoveries.

The classical permutation p-value was slowest, taking over three days and 14 hours to complete. The asymptotic p-value and the anytime-valid permutation p-value, on the other hand, were much faster, running in about two minutes and three and a half minutes, respectively. All three methods rejected a similar proportion of the null hypotheses ( $56.4-56.9$ % across methods). We concluded on the basis of this experiment that BH with the anytime-valid permutation p-value inherited the strenghts of both the asymptotic p-value and the classical permutation p-value: like the asymptotic p-value, the anytime-valid permutation p-value was fast, and like the classical permutation p-value, the anytime-valid permutation test avoided asymptotic assumptions. Additional information about the RNA-seq data analysis is relayed in Appendix C.

Table 2: Comparison of the BH procedure applied with the asymptotic MW p-value, the classical permutation p-value based on the MW statistic, and Algorithm 1 with the anytime-valid BC p-value based on the MW statistic. All three methods were applied to analyze the adipose RNA-seq data. Our proposed anytime-valid permutation p-value was nearly as fast as the asymptotic p-value while avoiding asymptotic assumptions.

Method	Proportion rejected ( $\%$ )	Running time
Asymptotic p-value	$56.6\%$	2m 4s
Classical permutation p-value	$56.4\%$	3d 14h 24m
Anytime-valid BC p-value (ours)	$56.9\%$	3m 27s

It is important to note that a significant limitation of the MW test is that it does not allow for the straightforward inclusion of covariates, such as the sex of the subject from which a given tissue sample was derived. Our goal in this section was to reproduce the analysis of Li et al., swapping their asymptotic MW test for a computationally efficient, finite-sample alternative. How to test for differential expression under minimal assumptions — while adjusting for covariates — is an ongoing topic of research in statistical genomics [35].

A minimally viable R/C++ package providing fast implementations of several of the methods introduced in this paper is available at github.com/timothy-barry/adaptiveperm/ and the R code to reproduce this real data analysis is available at github.com/timothy-barry/sequential_perm_testing.

7 Conclusion

There are several advantages of using the proposed anytime-valid permutation p-values in large-scale multiple testing problems:

1.

Our methods automatically adapt the number of permutations to the number of rejections and thus the proportion of false hypotheses and strengths of the alternatives. For this reason, there is no need to adapt the parameter choice to the unknown data generating process. We found $h=10$ for the anytime-valid BC method and $b=0.9$ for the binomial mixture strategy to deliver good results in a wide variety of settings. Depending on whether the main focus lies in increasing precision or reducing the number of permutations, $h$ and $b$ could also be chosen larger or smaller, respectively.
2.

Due to their ability to adapt to the actual significance level obtained by the multiple testing procedure and stopping early as soon as a decision in either direction can be made, our methods reduce the required number of permutations by orders of magnitude. In particular, simulations showed that there is no fixed number of permutations such that the classical permutation p-value performs similar as the sequential permutation p-values. The anytime-valid BC method performed best in terms of power and required number of permutations.
3.

Our methods allow to report decisions early such that one can begin follow-up work before the entire testing process is finished. In particular, the binomial mixture strategy was found to be useful for such a proceeding.
4.

During computing times of several months or more, the initial goals and ideas might change based on external information or the data observed so far. The sequential permutation tests allow to adapt the stopping time and even the betting strategy interactively based on the data. It should be noted that the betting strategy can only be changed for the upcoming permutations and not for the permutations that have already been observed.

In this paper, we showed how anytime-valid permutation tests can be used with p-value based multiple testing procedures. Another type of multiple testing procedures does not rely on individual p-values and explicitly uses the permutation tests to adapt to the unknown dependence structure of the data. This includes the famous MaxT approach for FWER control by Westfall & Young [47], which is particularly powerful when there is a large positive correlation between the test statistics. In Appendix A, we derive sequential versions of such permutation based multiple testing procedures for FWER and simultaneous FDP control. These permutation based multiple testing procedures are based on the closure principle [33] and therefore only consist of tests at level $\alpha$ , which reduces the need for sequential permutation tests. Still, we believe that these can be useful in certain applications.

Acknowledgments

The authors thank Leila Wehbe for stimulating practical discussions. LF acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project number 281474342/GRK2224/2. AR was funded by NSF grant DMS-2310718.

References

Allen et al. [2022] Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, 2022.
Bancroft et al. [2013] Tim Bancroft, Chuanlong Du, and Dan Nettleton. Estimation of false discovery rate using sequential permutation p-values. Biometrics, 69(1):1–7, 2013.
Barry et al. [2021] Timothy Barry, Xuran Wang, John A Morris, Kathryn Roeder, and Eugene Katsevich. SCEPTRE improves calibration and sensitivity in single-cell crispr screen analysis. Genome Biology, 22:1–19, 2021.
Bates et al. [2020] Stephen Bates, Matteo Sesia, Chiara Sabatti, and Emmanuel Candès. Causal inference in genetic trio studies. Proceedings of the National Academy of Sciences, 117(39):24117–24126, 2020.
Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 57(1):289–300, 1995.
Benjamini and Yekutieli [2001] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, pages 1165–1188, 2001.
Besag and Clifford [1991] Julian Besag and Peter Clifford. Sequential Monte Carlo p-values. Biometrika, 78(2):301–304, 1991.
Bretz et al. [2009] Frank Bretz, Willi Maurer, Werner Brannath, and Martin Posch. A graphical approach to sequentially rejective multiple test procedures. Statistics in Medicine, 28(4):586–604, 2009.
Candès et al. [2018] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577, 2018.
Fay and Follmann [2002] Michael P Fay and Dean A Follmann. Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests. The American Statistician, 56(1):63–70, 2002.
Finner et al. [2009] Helmut Finner, Thorsten Dickhaus, and Markus Roters. On the false discovery rate and an asymptotically optimal rejection curve. The Annals of Statistics, 37(2):596–618, 2009.
Fischer and Ramdas [2024] Lasse Fischer and Aaditya Ramdas. Sequential monte-carlo testing by betting. arXiv preprint arXiv:2401.07365, 2024.
Gandy and Hahn [2014] Axel Gandy and Georg Hahn. MMCTest—a safe algorithm for implementing multiple Monte Carlo tests. Scandinavian Journal of Statistics, 41(4):1083–1101, 2014.
Gandy and Hahn [2016] Axel Gandy and Georg Hahn. A framework for Monte Carlo based multiple testing. Scandinavian Journal of Statistics, 43(4):1046–1063, 2016.
Gandy and Hahn [2017] Axel Gandy and Georg Hahn. QuickMMCTest: quick multiple Monte Carlo testing. Statistics and Computing, 27:823–832, 2017.
Genovese and Wasserman [2006] Christopher R Genovese and Larry Wasserman. Exceedance control of the false discovery proportion. Journal of the American Statistical Association, 101(476):1408–1417, 2006.
Goeman and Solari [2011] Jelle J Goeman and Aldo Solari. Multiple testing for exploratory research. Statistical Science, 26(4):584–597, 2011.
Good [2013] Phillip Good. Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.
Grünwald et al. [2024] Peter Grünwald, Rianne de Heide, and Wouter M Koolen. Safe testing. Journal of the Royal Statistical Society Series B: Statistical Methodology (with discussion), 2024.
Guo and Peddada [2008] Wenge Guo and Shyamal Peddada. Adaptive choice of the number of bootstrap samples in large scale multiple testing. Statistical Applications in Genetics and Molecular Biology, 7(1), 2008.
Hapfelmeier et al. [2023] Alexander Hapfelmeier, Roman Hornung, and Bernhard Haller. Efficient permutation testing of variable importance measures by the example of random forests. Computational Statistics & Data Analysis, 181:107689, 2023.
Hemerik and Goeman [2018] Jesse Hemerik and Jelle J Goeman. False discovery proportion estimation by permutations: confidence for significance analysis of microarrays. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(1):137–155, 2018.
Henderson et al. [2023] Margaret M Henderson, Michael J Tarr, and Leila Wehbe. A texture statistics encoding model reveals hierarchical feature selectivity across human visual cortex. Journal of Neuroscience, 43(22):4144–4161, 2023.
Hochberg [1988] Yosef Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800–802, 1988.
Holm [1979] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, pages 65–70, 1979.
Hommel [1988] Gerhard Hommel. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75(2):383–386, 1988.
Howard et al. [2021] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
Jiang and Salzman [2012] Hui Jiang and Julia Salzman. Statistical properties of an early stopping rule for resampling-based multiple testing. Biometrika, 99(4):973–980, 2012.
Li et al. [2022] Yumei Li, Xinzhou Ge, Fanglue Peng, Wei Li, and Jingyi Jessica Li. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology, 23(1):79, 2022.
Lonsdale et al. [2013] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue expression (gtex) project. Nature Genetics, 45(6):580–585, 2013.
Love et al. [2014] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biology, 15(12):550, 2014. doi: 10.1186/s13059-014-0550-8.
Mann and Whitney [1947] Henry B Mann and Donald R Whitney. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, pages 50–60, 1947.
Marcus et al. [1976] Ruth Marcus, Peritz Eric, and K Ruben Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976.
Maurer et al. [1995] Willi Maurer, Ludwig Hothorn, and Walter Lehrmacher. Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses. In Vollmar Joachim, editor, Biometrie in der Chemisch-Pharmazeutischen Industrie, pages 3–18. Fischer Verlag, Stuttgart, 1995.
Niu et al. [2024] Zhaonan Niu, Jayanta R Choudhury, and Eugene Katsevich. Computationally efficient and statistically accurate conditional independence testing with spaCRT. arXiv preprint arXiv:2407.08911, Jul 2024.
Portilla and Simoncelli [2000] Javier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40:49–70, 2000.
Pounds et al. [2011] Stan Pounds, Xueyuan Cao, Cheng Cheng, Jun J Yang, Dario Campana, Ching-Hon Pui, William Evans, and Mary Relling. Integrated analysis of pharmacologic, clinical and SNP microarray data using projection onto the most interesting statistical evidence with adaptive permutation testing. International Journal of Data Mining and Bioinformatics, 5(2):143–157, 2011.
Ritchie et al. [2015] Matthew E Ritchie, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7):e47–e47, 2015.
Robinson et al. [2010] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
Sandve et al. [2011] Geir Kjetil Sandve, Egil Ferkingstad, and Ståle Nygård. Sequential Monte Carlo multiple testing. Bioinformatics, 27(23):3235–3241, 2011.
Shafer [2021] Glenn Shafer. Testing by betting: a strategy for statistical and scientific communication. Journal of the Royal Statistical Society Series A: Statistics in Society (with discussion), 184(2):407–431, 2021.
Silva et al. [2009] Ivair Silva, Renato Assunçao, and Marcelo Costa. Power of the sequential Monte Carlo test. Sequential Analysis, 28(2):163–174, 2009.
Stark and Ottoboni [2018] Philip B Stark and Kellie Ottoboni. Random sampling: Practice makes imperfect. arXiv preprint arXiv:1810.10985, 2018.
Ting [2021] Daniel Ting. Simple, optimal algorithms for random sampling without replacement. arXiv preprint arXiv:2104.05091, 2021.
Tusher et al. [2001] Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116–5121, 2001.
Vesely et al. [2023] Anna Vesely, Livio Finos, and Jelle J Goeman. Permutation-based true discovery guarantee by sum tests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):664–683, 2023.
Westfall and Young [1993] Peter H Westfall and S Stanley Young. Resampling-based multiple testing: examples and methods for p-value adjustment, volume 279. John Wiley & Sons, 1993.
Zhang et al. [2019] Martin Zhang, James Zou, and David Tse. Adaptive Monte Carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pages 7512–7522. PMLR, 2019.

Appendix A Sequential permutation based multiple testing procedures

We consider the same setting as in Section 3. However, instead of defining the p-values as in (14), we construct a level- $\alpha$ permutation test $\phi_{I}$ for each intersection hypothesis $H_{0}^{I}=\bigcap_{i\in i}H_{0}^{i}$ , $I\subseteq\{1,\ldots,M\}$ , and then use the closure principle to obtain decisions for individual hypotheses. There are two versions of the closure principle. The initial version was proposed by Marcus et al. [33] who showed that the rejection set defined by

\displaystyle R=\{i\in\{1,\ldots,M\}:\phi_{I}=1\text{ for all }I\text{ with }i\in I\}

(18)

controls the FWER. However, Goeman and Solari [17] showed that the closure principle can also be used for simultaneous true discovery control. In particular, they proved that

\displaystyle\mathbb{P}(\boldsymbol{d}(S)\leq|S\cap I_{1}|\text{ for all }S\subseteq\{1,\ldots,M\})\geq 1-\alpha,

where $\boldsymbol{d}(S)\coloneqq\min\{|S\setminus I|:I\subseteq\mathbb{N},\phi_{I}=0\}$ and $I_{1}=\{1,\ldots,M\}\setminus I_{0}$ . In an earlier paper, Genovese and Wasserman [16] proposed an equivalent approach without using the closure principle.

Most permutation based multiple testing procedures are based on choosing some combination function $C_{I}$ for each $I\subseteq M$ , defining the intersection tests $\phi_{I}$ by the classical permutation p-value applied on $C_{I}((Y_{0}^{i})_{i\in I}),C_{I}((Y_{1}^{i})_{i\in I}),\ldots,C_{I}((Y_{B}^{i})_{i\in I})$ and then applying the closure principle with the intersection tests $\phi_{I}$ . To make sure that the $\phi_{I}$ are indeed level- $\alpha$ tests such that this yields a valid procedure, the following assumption is usually made.

Assumption 1.

The vectors of test statistics corresponding to true hypotheses $(Y_{0}^{i})_{i\in I_{0}},(Y_{1}^{i})_{i\in I_{0}},\ldots,$ are jointly exchangeable.

This general approach encompasses many existing permutation based multiple testing methods. For example, the MaxT approach for FWER control [47] is obtained by $C_{I}((Y_{j}^{i})_{i\in I})=\max_{i\in I}Y_{j}^{i}$ . Vesely et al. [46] focus on procedures providing true discovery control with $C_{I}((Y_{j}^{i})_{i\in I})=\sum_{i\in I}Y_{j}^{i}$ . Furthermore, the method by Hemerik and Goeman [22], which uniformly improves the popular significance analysis of microarrays (SAM) procedure [45], is obtained by $C_{I}((Y_{j}^{i})_{i\in I})=\#\{i\in I:Y_{j}^{i}\in D^{i}\}$ for some prespecified sets $D^{i}$ .

It is straightforward to derive sequential versions of these multiple testing methods by replacing the classical permutation tests $\phi_{I}$ by the anytime-valid ones introduced in Section 2. We summarized this general approach in Algorithm 2.

Algorithm 2 Sequential permutation based multiple testing with the closure principle

Input: Combination functions $C_{I}$ , $I\subseteq\mathbb{N}$ , and sequences of test statistics $Y_{0}^{i},Y_{1}^{i},\ldots$ , $i\in\{1,\ldots,M\}$ .
Optional output: Rejection set $R$ for FWER control or function $\boldsymbol{d}$ for simultaneous FDP control.

1:for

I\subseteq\{1,\ldots,M\}

2:Test

H_{0}^{I}

by applying a level-

\alpha

anytime-valid permutation test

\phi_{I}

C_{I}((Y_{0}^{i})_{i\in I}),C_{I}((Y_{1}^{i})_{i\in I}),\ldots

3:end for

R=\{i\in\{1,\ldots,M\}:\phi_{I}=1\text{ for all }I\text{ with }i\in I\}

\boldsymbol{d}(S)=\min\{|S\setminus I|:I\subseteq\{1,\ldots,M\},\phi_{I}=0\}

for all

S\subseteq\{1,\ldots,M\}

6:return

R

\boldsymbol{d}

For a large number of hypotheses $M$ this general approach is computationally infeasible. For this reason, short cuts have been proposed for specific choices of the combination function $C_{I}$ [47, 22, 46]. In case of the MaxT approach, the entire closed test can be performed with a maximum number of $M$ intersection tests [47]. It might be the case that using these short cuts only works for specific betting strategies. For example, the MaxT short cut can be used with all betting strategies with nonincreasing wealth for an increasing number of losses. The detailed procedure for applying the MaxT approach with the binomial mixture strategy is provided in Algorithm 4.

Appendix B Omitted proofs

Lemma B.1.

Let $Y$ , $Z$ and $X$ be real valued random variables on $(\Omega,\mathcal{A},\mathcal{P})$ . If $Y$ is weakly PRDS on $X$ , $Z$ strongly PRDS on $Y$ , and $Z$ and $X$ independent conditional on $Y$ , then $Z$ is weakly PRDS on $X$ .

Proof.

Let $D$ be an increasing set and $x^{*}\geq x$ , then

	$\displaystyle\mathbb{P}(Z\in D\mid X\leq x)$	$\displaystyle=\mathbb{E}_{Y\mid X\leq x}[\mathbb{P}(Z\in D\mid Y,X\leq x)]$
		$\displaystyle=\mathbb{E}_{Y\mid X\leq x}[\mathbb{P}(Z\in D\mid Y)]$
		$\displaystyle=\int_{\Omega}\mathbb{P}(Z\in D\mid Y=y)dF_{Y\mid X\leq x}(y)$
		$\displaystyle\leq\int_{\Omega}\mathbb{P}(Z\in D\mid Y=y)dF_{Y\mid X\leq x^{*}}(y)$
		$\displaystyle=\mathbb{P}(Z\in D\mid X\leq x^{*}),$

where the inequality follows from the fact that $P(Z\in D\mid Y=y)$ is increasing in $y$ and $F_{Y\mid X\leq x^{*}}(y)\leq F_{Y\mid X\leq x}(y)$ for all $y$ . ∎

Proof of Proposition 4.3.

Let $x,x^{*}\in[0,1]$ with $x^{*}\geq x$ , $D\subseteq[0,1]^{M}$ be an increasing set and $i\in I_{0}$ be arbitrary but fixed. In the following, we just write $\boldsymbol{P}$ and $P^{i}$ instead of $\boldsymbol{P}^{\mathrm{avBC}}$ and $P^{\mathrm{avBC},i}$ , respectively. We want to show that

\mathbb{P}(\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-i}\in D\mid P_{\gamma_{i}(h)}^{i}\leq x)\leq\mathbb{P}(\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-i}\in D\mid P_{\gamma_{i}(h)}^{i}\leq x^{*}),

where $\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-i}=(P_{\gamma_{1}(h)}^{1},\ldots,P_{\gamma_{i-1}(h)}^{i-1},P_{\gamma_{i+1}(h)}^{i+1},\ldots,P_{\gamma_{M}(h)}^{M})$ . The proof mainly consists of showing the following two claims.

1.

$P_{\mathrm{lim}}^{i}$ is weakly PRDS on $P_{\gamma_{i}(h)}^{i}$ .
2.

$\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-i}$ is strongly PRDS on $\boldsymbol{P}_{\mathrm{lim}}$ .

Since we assumed that $\boldsymbol{P}_{\mathrm{lim}}$ is strongly PRDS on $I_{0}$ , the first claim and Lemma B.1 imply that $\boldsymbol{P}_{\mathrm{lim}}$ is weakly PRDS on $P_{\gamma_{i}(h)}^{i}$ . Together with the second claim and the fact that $\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-i}$ is independent of $P_{\gamma_{i}(h)}^{i}$ conditional on $\boldsymbol{P}_{\mathrm{lim}}$ (since we sample independently for all hypotheses), the final proposition follows by Lemma B.1.

We start with proving the first claim. Note that $P_{\gamma_{i}(h)}^{i}\leq x$ iff $L_{t_{x}}^{i}\leq h-1$ , where $t_{x}=\lceil h/x\rceil-1$ . With this and the fact that $L_{t_{x}}^{i}|P_{\mathrm{lim}}^{i}=p$ follows a binomial distribution with size parameter $t_{x}$ and probability $p$ , we obtain

	$\displaystyle\mathbb{P}(P_{\mathrm{lim}}^{i}\leq p^{*}\mid P_{\gamma_{i}(h)}^{i}\leq x)$	$\displaystyle=\frac{\int_{0}^{p^{*}}\mathbb{P}(P_{\gamma_{i}(h)}^{i}\leq x\mid P_{\mathrm{lim}}^{i}=p)\ dp}{\int_{0}^{1}\mathbb{P}(P_{\gamma_{i}(h)}^{i}\leq x\mid P_{\mathrm{lim}}^{i}=p)\ dp}$
		$\displaystyle=\frac{\sum_{\ell=0}^{h-1}{t_{x}\choose\ell}\int_{0}^{p^{*}}p^{\ell}(1-p)^{t_{x}-\ell}\ dp}{\sum_{\ell=0}^{h-1}{t_{x}\choose\ell}\int_{0}^{1}p^{\ell}(1-p)^{t_{x}-\ell}\ dp}$
		$\displaystyle=\frac{\sum_{\ell=0}^{h-1}(1-\mathrm{Bin}(\ell;t_{x}+1,p^{*}))}{h},$

where $\mathrm{Bin}(\ell;t_{x}+1,p^{*})$ is the CDF of a binomial distribution with size parameter $t_{x}+1$ and probability $p^{*}$ . Since $t_{x}$ is decreasing in $x$ , $\mathbb{P}(P_{\mathrm{lim}}^{i}\leq p^{*}\mid P_{\gamma_{i}(h)}^{i}\leq x)$ is decreasing in $x$ as well.

For the second claim, note that $P_{\gamma_{j}(h)}^{j}$ is independent of $\boldsymbol{P}_{\mathrm{lim}}^{-j}$ and $\boldsymbol{P}_{\boldsymbol{\gamma}(h)}^{-j}$ conditional on $P_{\mathrm{lim}}^{j}$ for all $j\in\{1,\ldots,M\}$ . Therefore, it only remains to show that $P_{\gamma_{j}(h)}^{j}$ is strongly PRDS on $P_{\mathrm{lim}}^{j}$ . Since $L_{t_{u}}^{j}|P_{\mathrm{lim}}^{j}=p$ follows a binomial distribution with probability $p$ , it is immediately implied by the fact that the CDF of a binomial distribution is decreasing in its probability parameter.

∎

Appendix C Additional notes about the RNA-seq data analysis

We implemented the sequential permutation test based on the anytime-valid BC p-value and the BH procedure in an R/C++ package, adaptiveperm.²²2 github.com/timothy-barry/adaptiveperm In the context of the general algorithm for sequential permutation testing (Algorithm 1), checking whether to move hypothesis $i$ from the active set $A$ into the rejection set $R$ — i.e., line 7 — involves performing a BH correction on all hypotheses contained within the active set $A$ . To avoid performing a BH correction at every iteration of the algorithm (which would require a computationally costly sort of the p-values), we instead check whether the maximum p-value $p^{\textrm{max}}$ among all hypotheses in the active set can be rejected by the BH procedure:

p^{\textrm{max}}=\max_{i\in A}\{p^{i}\}\leq\frac{|A|m}{\alpha}.

If the above criterion is satisfied, we move all hypotheses from the active set into the rejection set. This procedure produces the same discovery set as the algorithm that performs a BH correction at each iteration of the loop. To see this, observe that the anytime-valid BC p-value is monotonically decreasing in time, and so if a given hypothesis can be rejected at a time $t$ , then the hypothesis also will be rejected at a later time $t^{*}\geq t$ . We implemented several other accelerations to speed the algorithm, including a high-performance Fisher-Yates sampler for permuting the data [44].

The subcutaneous tissue data, visceral adipose tissue data, and code to reproduce the RNA-seq analysis are available at the following links:

•

gtexportal.org/home/tissue/Adipose_Subcutaneous
•

gtexportal.org/home/tissue/Adipose_Visceral_Omentum
•

github.com/timothy-barry/sequential_perm_testing

A standard step in RNA-seq analysis is to adjust for differences in library size across samples, where the library size $l_{i}$ of the $i$ th sample is defined as the sum of the expressions across the genes in that sample:

l_{i}=\sum_{j=1}^{m}Y_{ij}.

(This step sometimes is called “normalization.”) We normalized the data by dividing a given count $Y_{ij}$ by the library size of its sample, i.e. $Y_{ij}/l_{i}$ . Li et al. [29] instead employed the normalization strategy used by the R package EdgeR [39], which is slightly more sophisticated than the division strategy described above but similar in spirit. We filtered out genes with an expression level of zero across samples, as is standard in RNA-seq analysis [31].

Appendix D Detailed algorithms

Algorithm 3 Sequential BH procedure based on the binomial mixture strategy

Input: Significance level $\alpha\in(0,1)$ , parameter $b\in(0,1)$ for the binomial mixture strategy and sequences of generated test statistics $Y_{0}^{1},Y_{1}^{1},\ldots$ , $Y_{0}^{2},Y_{1}^{2},\ldots$ , $\ldots$ , $Y_{0}^{M},Y_{1}^{M},\ldots$ .
Output: Stopping times $\tau_{1},\ldots,\tau_{M}$ and index set of rejections $R\subseteq\{1,\ldots,M\}$ .

m^{*}=0

r_{i,j}=0

for all

i,j\in\{1,\ldots,M\}

\ell^{i}=0

for all

i\in\{1,\ldots,M\}

\mathrm{crit}_{i}=0

for all

i\in\{1,\ldots,M\}

A=\{1,\ldots,M\}

6:for

t=1,2,...

7: for

j=1,\ldots,M

\mathrm{crit}_{j}=(\mathrm{Bin})^{-1}(1-b;t+1,b\alpha j/M)-1

9: end for

10: for

i\in A

11: if

Y_{t}^{i}\geq Y_{0}^{i}

then

12:

\ell^{i}=\ell^{i}+1

13: end if

14: for

j=1,\ldots,M

15: if

\ell^{i}\leq\mathrm{crit}_{j}

then

16:

r_{i,j}=1

17: end if

18: end for

19: end for

20:

m^{*}=\max\{m\in\{1,\ldots,M\}:\sum_{i=1}^{m}r_{i,m}\geq m\}

21: for

i\in A

22: if

r_{i,m^{*}}=1

then

23:

R=R\cup\{i\}

24:

\tau_{i}=t

25:

A=A\setminus\{i\}

26: end if

27: if

1-\mathrm{Bin}(\ell^{i};t+1,b\alpha(|A|+m^{*})/M)<b[\alpha(|A|+m^{*})/M]^{2}

then

28:

\tau_{i}=t

29:

A=A\setminus\{i\}

30: end if

31: end for

32: if

A=\emptyset

then

33: return

\tau_{1},\ldots,\tau_{M}

R

34: end if

35:end for

Algorithm 4 Sequential MaxT approach with the binomial mixture strategy

Input: Significance level $\alpha\in(0,1)$ , parameter $c\in(0,\alpha)$ for the binomial mixture strategy with uniform prior $u_{c}$ , ordered test statistics $Y_{0}^{1}\geq...\geq Y_{0}^{M}$ and sequences of generated test statistics $Y_{1}^{1},Y_{2}^{1},\ldots$ , $Y_{1}^{2},Y_{2}^{2},\ldots$ , $\ldots$ , $Y_{1}^{M},Y_{2}^{M},\ldots$ .
Output: Stopping times $\tau_{1},\ldots,\tau_{M}$ and index set of rejections $R\subseteq\{1,\ldots,M\}$ .

\tau_{i}=\infty

for all

i\in\{1,\ldots,M\}

\ell^{i}=0

for all

i\in\{1,\ldots,M\}

R=\emptyset

A=\{1,\ldots,M\}

5:for

t=1,2,...

6: for

i\in A

7: if

\max\{Y_{t}^{j}:j\in\{i,\ldots,M\}\}\geq\max\{Y_{0}^{j}:j\in\{i,\ldots,M\}\}

then

\ell^{i}=\ell^{i}+1

9: if

(1-\mathrm{Bin}(\ell^{i};t+1,c))/c<\alpha

then

10:

\tau_{i}=\min(\tau_{i},t)

11:

\vdots

12:

\tau_{M}=\min(\tau_{M},t)

13:

A=A\setminus\{i,\ldots,M\}

14:

R=R\setminus\{i,\ldots,M\}

15: end if

16: else if

(1-\mathrm{Bin}(\ell^{i};t+1,c))/c\geq 1/\alpha

then

17:

\tau_{i}=t

18:

R=R\cup\{i\}

19:

A=A\setminus\{i\}

20: end if

21: end for

22: if

A=\emptyset

then

23: return

\tau_{1},\ldots,\tau_{M}

R

24: end if

25:end for

Multiple testing with anytime-valid Monte-Carlo p-values

Abstract

1 Introduction

Our contributions.

Paper outline.

2 Anytime-valid permutation p-values

2.1 A simple approach: The anytime-valid Besag-Clifford p-value

Example 1.

2.2 A general approach: Anytime-valid permutation testing by betting

2.3 A sophisticated approach: The binomial mixture strategy

3 Multiple testing with anytime-valid permutation p-values

3.1 α\alpha-dependent p-value calibration

Proposition 3.1.

3.2 Our general multiple testing approach

Theorem 3.2.

Example 2.

4 The Benjamini-Hochberg procedure with anytime-valid permutation p-values

Property 1 (PRDS [6, 11]).

4.1 Stopping early for rejection

Theorem 4.1.

Proof.

4.2 Stopping early for futility with independent p-values

Proposition 4.2.

Proof.

4.3 Stopping early for futility with PRDS p-values

Example 3.

Stopping early for futility with the anytime-valid Besag-Clifford p-value.

Proposition 4.3.

5 Simulated experiments

5.1 Power and average number of permutations

5.2 Early reporting of decisions

5.3 FDR control

6 Real data analyses

6.1 Neuroscience example

6.2 Genomics example

7 Conclusion

Acknowledgments

References

Appendix A Sequential permutation based multiple testing procedures

Assumption 1.

Appendix B Omitted proofs

Lemma B.1.

Proof.

Proof of Proposition 4.3.

Appendix C Additional notes about the RNA-seq data analysis

Appendix D Detailed algorithms

3.1 $\alpha$ -dependent p-value calibration