This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Testing with p*-values: Between p-values, mid p-values, and e-values

Ruodu Wanglabel=e1][email protected] [ Department of Statistics and Actuarial Science, University of Waterloo,
Abstract

We introduce the notion of p*-values (p*-variables), which generalizes p-values (p-variables) in several senses. The new notion has four natural interpretations: operational, probabilistic, Bayesian, and frequentist. A main example of a p*-value is a mid p-value, which arises in the presence of discrete test statistics. A unified stochastic representation for p-values, mid p-values, and p*-values is obtained to illustrate the relationship between the three objects. We study several ways of merging arbitrarily dependent or independent p*-values into one p-value or p*-value. Admissible calibrators of p*-values to and from p-values and e-values are obtained with nice mathematical forms, revealing the role of p*-values as a bridge between p-values and e-values. The notion of p*-values becomes useful in many situations even if one is only interested in p-values, mid p-values, or e-values. In particular, deterministic tests based on p*-values can be applied to improve some classic methods for p-values and e-values.

62G10, 62F03,
62C15,
Mid p-values,
arbitrary dependence,
posterior predictive p-values,
average of p-values,
test martingale,
keywords:
[class=MSC]
keywords:

1 Introduction

Hypothesis testing is usually conducted with the classic notion of p-values. We introduce the abstract notion of p*-variables, with p*-values as their realizations, defined via a simple inequality in stochastic order, in a way similar to p-variables and p-values. As generalized p-variables, p*-variables are motivated by mid p-values, and they admit four natural interpretations: operational, probabilistic, Bayesian, and randomized testing, arising in various statistical contexts.

The most important and practical example of p*-values is the class of mid p-values (Lancaster [16]), arising from discrete test statistics. Mid p-variables are not p-variables, but they are p*-variables. Discrete test statistics appear in many applications, especially when data represent frequencies or counts; see e.g., Döhler et al. [8] in the context of false discovery rate control. Another example of discrete p-values is the conformal p-values (Vovk et al. [32]); see e.g., the recent study of Bates et al. [2] on detecting outliers using conformal p-values. Using mid p-values is one way to address discrete test statistics; another way is using randomized p-values. We refer to Habiger [14] for mid and randomized p-values in multiple testing, and Rubin-Delanchy et al. [23] for probability bounds on combining independent mid p-values based on convex order.

In addition to mid p-values, p*-values are also naturally connected to e-values. E-values have been recently introduced to the statistical community by Vovk and Wang [35], and they have several advantages in contrast to p-values, especially via their connections to Bayes factors and test martingales (Shafer et al. [28]), betting scores (Shafer [27]), universal inference (Wasserman et al. [38]), anytime-valid tests (Grünwald et al. [13]), conformal tests (Vovk [31]), and false discovery rate under dependence (Wang and Ramdas [37]).

In discussions where the probabilistic specification as random variables is not emphasized, we will loosely use the term “p/p*/e-values” for both p/p*/e-variables and their realizations, similarly to Vovk and Wang [34, 35], and this should be clear from the context.

The relationship between p*-values and mid p-values is studied in Section 3. We obtain a new stochastic representation for mid p-values (Theorem 3.1), which unifies the classes of p-, mid p-, and p*-values. The set of p*-values is closed under several types of operations, and this closure property is not shared by that of p-values or mid p-values (Proposition 3.3). Based on these results, we find that p*-values serve as an abstract and generalized version of mid p-values which is mathematically more convenient to work with.

There are several equivalent definitions of p*-variables: by stochastic order (Definition 2.2), by averaging p-variables (Theorem 4.1), by conditional probability (Proposition 4.3), and by randomized tests (Proposition 4.5); each of them represents a natural statistical path to a generalization of p-values, and these paths lead to the same mathematical object of p*-values. Moreover, a p*-value is a posterior predictive p-value of Meng [20] in the Bayesian context. The p*-test in Section 4.4 is a randomized version of the traditional p-test; the randomization is needed because p*-values are weaker than p-values.

Merging methods are useful in multiple hypothesis testing for both p-values and e-values. Merging several p-values or e-values are used, either directly or implicitly, in false discovery control procedures in Genovese and Wasserman [11] and Goeman and Solari [12] and generalized Bonferroni-Holm procedures (see [34] for p-values and [35] for e-values). We study merging functions for p*-values in Section 5, which turn out to have convenient structures. In particular, we find that a (randomly) weighted geometric average of arbitrary p*-variables multiplied by e2.718\mathrm{e}\approx 2.718 is a p-variable (Theorem 5.2), allowing for a simple combination of p*-values under unknown dependence; a similar merging function for p-values is obtained by [34]. In the setting of merging independent p*-values, inequalities obtained by [23] on mid p-values can be directly applied to p*-values.

We explore in Sections 6 the connections among p-values, p*-values, and e-values by establishing results for admissible calibrators. Figure 1 summarizes these calibrators where the ones between p-values and e-values are obtained in [35]. Notably, for an e-value ee, (2e)1(2e)^{-1} is a calibrated p*-value, which has an extra factor of 1/21/2 compared to the standard calibrated p-value e1e^{-1}. A composition of the e-to-p* calibration p=(2e)11p^{*}=(2e)^{-1}\wedge 1 and the p*-to-p calibration p=(2p)1p=(2p^{*})\wedge 1 leads to the unique admissible e-to-p calibration p=e11p=e^{-1}\wedge 1, thus showing that p*-values serve as a bridge between e-values and p-values.

p*-valuemid p-valuee-valuep-valuee=f(p)e=f(p^{*}), ff convexp=(2e)11p^{*}=(2e)^{-1}\wedge 1p=(2p)1p=(2p^{*})\wedge 1p=pp^{*}=pp=e11p=e^{-1}\wedge 1e=f(p)e=f(p)
Figure 1: Calibration among p-values, p*-values and e-values, where f:[0,1][0,]f:[0,1]\to[0,\infty] is left-continuous and decreasing with f(0)=f(0)=\infty and 01f(t)dt=1\int_{0}^{1}f(t)\,\mathrm{d}t=1.

In classic statistical settings where precise (uniform on [0,1][0,1]) p-values are available, p*-values may not be directly useful as their properties are weaker than p-values. Nevertheless, applying a p*-test in situations where precise p-values are unavailable leads to several improvements on classic methods for p-values and e-values. An application on testing with e-values is discussed and numerically illustrated in Section 7, and one on testing with discrete test statistics and mid p-values is presented in Section 8. (Another application is presented in Appendix B.) From these examples, we find that the tool of p*-values is useful even when one is primarily interested in p-values, mid p-values, or e-values.

The paper is written such that p-values, p*-values and e-values are treated as abstract measure-theoretical objects, following the setting of [34, 35]. Our null hypothesis is a generic and unspecified one, and it can be simple or composite; nevertheless, for the discussions of our results, it would be harmless to keep a simple hypothesis in mind as a primary example. All proofs are put in Appendix A and the randomized p*-test is discussed in Appendix B.

2 P-values, p*-values, and e-values

Following the setting of [35], we directly work with a fixed atomless probability space (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}), where our (global) null hypothesis is set to be the singleton {}\{\mathbb{P}\}. As explained in Appendix D of [35], no generality is lost as all mathematical results (of the kind in this paper) are valid also for general composite hypotheses. We assume that (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}) is rich enough so that we can find a uniform random variable independent of a given random vector as we wish. We first define stochastic orders, which will be used to formulate the main objects in the paper. All terms like “increasing” and “decreasing” are in the non-strict sense.

Definition 2.1.

Let XX and YY be two random variables.

  1. 1.

    XX is first-order stochastically smaller than YY, written as X1YX\leq_{1}Y, if 𝔼[f(X)]𝔼[f(Y)]\mathbb{E}[f(X)]\leq\mathbb{E}[f(Y)] for all increasing real functions ff such that the expectations exist.

  2. 2.

    XX is second-order stochastically smaller than YY, written as X2YX\leq_{2}Y, if 𝔼[f(X)]𝔼[f(Y)]\mathbb{E}[f(X)]\leq\mathbb{E}[f(Y)] for all increasing concave real functions ff such that the expectations exist.

Recall that the defining property of a p-value, realized by a random variable PP, is that (Pα)α\mathbb{P}(P\leq\alpha)\leq\alpha for all α(0,1)\alpha\in(0,1), meaning that the type-I error of rejecting PP based on PαP\leq\alpha is at most α\alpha (see e.g., [35]). The above is equivalent to the statement that PP is first-order stochastically larger than a uniform random variable on [0,1][0,1] (e.g., Section 1.A of [29]). Motivated by this simple observation, we define p-variables and e-variables, and our new concept, called p*-variables, via stochastic orders.

Definition 2.2.

Let UU be a uniform random variable on [0,1][0,1].

  1. 1.

    A random variable PP is a p-variable if U1PU\leq_{1}P.

  2. 2.

    A random variable PP is a p*-variable if U2PU\leq_{2}P.

  3. 3.

    A random variable EE is an e-variable if 02E210\leq_{2}E\leq_{2}1.

We allow both p-variables and p*-variables to take values above one, although such values are uninteresting, and one may safely truncate them at 11; moreover, we allow an e-variable EE to take the value \infty (but with probability 0 under the null), which corresponds to a p-variable taking the value 0 (also with probability 0).

Since 1\leq_{1} is stronger than 2\leq_{2}, a p-variable is also a p*-variable, but not vice versa. Due to the close proximity between p-variables and p*-variables, we often use PP for both of them; this should not create any confusion. We refer to p-values as realizations of p-variables, p*-values as those of p*-variables, and e-values as those of e-variables. By definition, both a p-variable and a p*-variable have a mean at least 1/21/2.

The classic definition of an e-variable EE is via 𝔼[E]1\mathbb{E}[E]\leq 1 and E0E\geq 0 a.s. ([35]). This is equivalent to our Definition 2.2 because

02E0Ea.s.; E21𝔼[E]1.0\leq_{2}E~{}\Longleftrightarrow~{}0\leq E~{}\mbox{a.s.};\mbox{~{}~{}~{}~{} ~{}~{}~{}~{}}E\leq_{2}1~{}\Longleftrightarrow~{}\mathbb{E}[E]\leq 1.

We choose to express our definition via stochastic orders to make an analogy among the three concepts, and stochastic orders will be a main technical tool for results in this paper.

Our main focus is the notion of p*-values, which will be motivated from five perspectives in Sections 3 and 4.

Remark 2.3.

There are many equivalent conditions for the stochastic order U2PU\leq_{2}P. One of the most convenient conditions, which will be used repeatedly in this paper, is (see e.g., Theorem 4.A.3 of [29])

U2P0vGP(u)duv22 for all v(0,1),U\leq_{2}P~{}\Longleftrightarrow~{}\int_{0}^{v}G_{P}(u)\,\mathrm{d}u\geq\frac{v^{2}}{2}\mbox{~{}for all $v\in(0,1)$}, (1)

where GPG_{P} is the left-quantile function of PP, defined as

GP(u)=inf{x:(P>x)u}, u(0,1].G_{P}(u)=\inf\{x\in\mathbb{R}:\mathbb{P}(P>x)\leq u\},\mbox{~{}~{}~{}$u\in(0,1]$}.

3 Mid p-values and discrete test statistics

An important motivation for p*-values is the use of mid p-values and discrete test statistics. We first recall the usual practice to obtain p-values. Let TT be a test statistic which is a function of the observed data. Here, a smaller value of TT represents stronger evidence against the null hypothesis. The p-variable PP is usually computed from the conditional probability

P=(TT|T)=F(T),P=\mathbb{P}(T^{\prime}\leq T|T)=F(T), (2)

where FF is the distribution of TT, and TT^{\prime} is an independent copy of TT; here and below, a copy of TT is a random variable identically distributed as TT.

If TT has a continuous distribution, then the p-variable PP defined by (2) has a standard uniform distribution. If the test statistic TT is discrete, e.g., when testing a binomial model Binomial(n,π)\mathrm{Binomial}(n,\pi), PP is strictly first-order stochastically larger than a uniform random variable on [0,1][0,1].

The discreteness of TT leads to a conservative p-value; in particular, 𝔼[P]>1/2\mathbb{E}[P]>1/2. One way to address this is to randomize the p-value to make it uniform on [0,1][0,1]; however, randomization is generally undesirable in testing. As the most natural alternative, mid p-values (Lancaster [16]) arise in the presence of discrete test statistics. For the test statistic TT, the mid p-value is given by

PT=12(TT|T)+12(T<T|T)=12F(T)+12F(T),P_{T}=\frac{1}{2}\mathbb{P}(T^{\prime}\leq T|T)+\frac{1}{2}\mathbb{P}(T^{\prime}<T|T)=\frac{1}{2}F(T-)+\frac{1}{2}F(T), (3)

where F(t)=limstF(t)F(t-)=\lim_{s\uparrow t}F(t) for tt\in\mathbb{R}. Clearly, PTF(T)P_{T}\leq F(T) and 𝔼[PT]=1/2\mathbb{E}[P_{T}]=1/2. If TT is continuously distributed, then PT=F(T)P_{T}=F(T) is uniform on [0,1][0,1]. In case TT is discrete, PTP_{T} is not a p-variable. In Figure 2 we present some examples of quantile functions of p-, mid p- and p*-variables.

Similarly to the case of p-variables in Definition 2.2, we formally define a mid p-variable as a random variable P1PTP\geq_{1}P_{T} for some test statistic TT. Often we have equality (i.e., P=1PTP=_{1}P_{T}), and a strict inequality may appear due to, e.g., composite hypotheses, similarly to the case of p-variables or e-variables.

1111
(a) mid p-variable (black) and p-variable (red) for a discrete statistic
1111
(b) mid p-variable (black) and p-variable (red) for a hybrid statistic
1111
(c) p*-variable obtained from averaging p-values (not a mid p-variable)
Figure 2: Examples of quantile functions of p-, mid p- and p*-variables

It is straightforward to verify that mid p-variables are p*-variables; see [23] where convex order is used in replace of our second-order stochastic dominance. The following theorem establishes a new unified stochastic representation for p-, mid p-, and p*-variables.

Theorem 3.1.

Let UU be a standard uniform random variable. For a random variable PP,

  1. (i)

    PP is a p-variable if and only if P1𝔼[U|V]P\geq_{1}\mathbb{E}[U|V] for some VV which is a strictly increasing function of UU;

  2. (ii)

    PP is a mid p-variable if and only if P1𝔼[U|V]P\geq_{1}\mathbb{E}[U|V] for some VV which is an increasing function of UU;

  3. (iii)

    PP is a p*-variable if and only if P1𝔼[U|V]P\geq_{1}\mathbb{E}[U|V] for some VV which is any random variable.

Remark 3.2.

Conditional expectations and conditional probabilities are defined in the a.s. sense, as usual in probability theory.

From Theorem 3.1, it is clear that mid p-variables are special cases of p*-variables but the converse is not true. As a direct consequence, all results later on p*-variables can be directly applied to mid p-values. The stochastic representations in Theorem 3.1 may not be directly useful in statistical inference; nevertheless they reveal a deep connection between mid p-values and p*-values, allowing us to analyze possible improvements of methods designed for p*-variables when applied to mid p-values, and vice versa.

In the next result, we summarize some closure property of the sets of p-, mid p and p*-variables.

Proposition 3.3.
  1. (i)

    The set of p*-variables is closed under convex combinations and under distribution mixtures.

  2. (ii)

    The set of p-variables is closed under distribution mixtures but not under convex combinations.

  3. (iii)

    The set of mid p-variables is neither closed under convex combinations nor under distribution mixtures.

  4. (iv)

    The three sets above are all closed under convergence in distribution.

Proposition 3.3 suggests that the set of p*-variables has the nicest closure properties among the three. Moreover, we will see in Theorem 4.1 below that the set of p*-variables is precisely the convex hull of the set of p-variables, and hence, it is also the convex hull of the set of mid p-variables which contains all p-variables.

4 Four formulations of p*-values

In this section, we will present four further equivalent definitions of p*-values, each arising from a different statistical context, and providing several interpretations of p*-values.

4.1 Averages of p-values

Our first interpretation of p*-variables is operational: we will see that a p*-variable is precisely the arithmetic average of some p-values which are obtained from possibly different sources and arbitrarily dependent. This characterization relies on a recent technical result of Mao et al. [19] on the sum of standard uniform random variables.

Theorem 4.1.

A random variable is a p*-variable if and only if it is the convex combination of some p-variables. Moreover, any p*-variable can be expressed as the arithmetic average of three p-variables.

Remark 4.2.

As implied by Theorem 5 of [19], a p*-variable can always be written as the arithmetic average of nn dependent p-variables for any n3n\geq 3, but the statement is not true for n=2n=2 ([19, Proposition 1]).

As a consequence of Theorem 4.1, the set of p*-variables is the convex hull of the set of p-variables, as briefly mentioned in Section 3. Testing with the arithmetic average of dependent p-values has been studied by [34]. We further discuss in Appendix B an application of p*-values which improves some tests based on arithmetic averages of p-values.

4.2 Conditional probability of exceedance

Our second interpretation is probabilistic: we will interpret both p-variables and p*-variables as conditional probabilities. Let TT be a test statistic which is a function of the observed data, represented by a vector XX. Recall that a p-variable PP in (2) has the form

P=F(T)=(TT|T)=(TT|X).P=F(T)=\mathbb{P}(T^{\prime}\leq T|T)=\mathbb{P}(T^{\prime}\leq T|X).

It turns out that p*-variables have a similar representation, where the only difference is that TT defining a p*-variable may not be a function of XX; instead, TT can include some unobservable randomness or additional randomness in the statistical experiment (but the p*-variable itself is deterministic from the data).

Proposition 4.3.

For a σ(X)\sigma(X)-measurable random variable PP,

  1. (i)

    PP is a p-variable if and only if there exists a σ(X)\sigma(X)-measurable TT such that P(TT|X)P\geq\mathbb{P}(T^{\prime}\leq T|X) where TT^{\prime} is a copy of TT independent of XX;

  2. (ii)

    PP is a p*-variable if and only if there exists a random variable TT such that P(TT|X)P\geq\mathbb{P}(T^{\prime}\leq T|X) where TT^{\prime} is a copy of TT independent of (T,X)(T,X).

In (ii) above, (TT|X)\mathbb{P}(T^{\prime}\leq T|X) can be safely replaced by (TT|X)/2+(T<T|X)/2\mathbb{P}(T^{\prime}\leq T|X)/2+\mathbb{P}(T^{\prime}<T|X)/2.

Proposition 4.3 suggests that p*-variables are very similar to p-variables when interpreted as conditional probabilities; the only difference is whether extra randomness is allowed in TT.

Remark 4.4.

Both Proposition 4.3 and Theorem 3.1 give stochastic representations for p- and p*-variables. They are similar and with a few differences. First, one is stated via stochastic order whereas the other via inequalities between random variables. Second, one involves a uniform random variable UU whereas the other does not as TT may be discrete. Third, measurability conditions are different as Proposition 4.3 specifies σ(X)\sigma(X).

4.3 Posterior predictive p-values

In the Bayesian context, the posterior predictive p-value of Meng [20] is a p*-value. Let XX be the data vector in Section 4.2. The null hypothesis H0H_{0} is given by {ψΨ0}\{\psi\in\Psi_{0}\} where Ψ0\Psi_{0} is a subset of the parameter space Ψ\Psi on which a prior distribution is specified. The posterior predictive p-value is defined as the realization of the random variable

PB:=(D(X,ψ)D(X,ψ)|X),P_{B}:=\mathbb{P}(D(X^{\prime},\psi)\geq D(X,\psi)|X),

where DD is a function (taking a similar role as test statistics), XX^{\prime} and XX are iid conditional on ψ\psi, and the probability is computed under the joint posterior distribution of (X,ψ)(X^{\prime},\psi). Note that PBP_{B} can be rewritten as

PB=(D(X,y)D(X,y)|X,y)dΠ(y|X)P_{B}=\int\mathbb{P}(D(X^{\prime},y)\geq D(X,y)|X,y)\,\mathrm{d}\Pi(y|X)

where Π\Pi is the posterior distribution of ψ\psi given the data XX. One can check that PBP_{B} is a p*-variable by using Jensen’s inequality; see Theorem 1 of [20] where D(X,ψ)D(X,\psi) is assumed to be continuously distributed conditional on ψ\psi.

In this formulation, p*-variables are obtained by integrating p-variables over the posterior distribution of some unobservable parameter. Since p*-variables are treated as measure-theoretic objects in this paper, we omit a detailed discussion of the Bayesian interpretation; nevertheless, it is reassuring that p*-values have a natural appearance in the Bayesian context as put forward by Meng [20]. One of our later results is related to an observation of [20] that two times a p*-variable is a p-variable (see Proposition 5.1).

4.4 Randomized tests with p*-values

Recall that the defining property of a p-variable PP is that the standard p-test

rejecting the null hypothesisPα\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}P\leq\alpha (4)

has size (i.e., probability of type-I error) at most α\alpha for each α(0,1)\alpha\in(0,1). Since p*-values are a weaker version of p-values, one cannot guarantee that the test (4) for a p*-variable PP has size at most α\alpha. Nevertheless, a randomized version of the test (4) turns out to be valid. Moreover, this randomized test yields a defining property for p*-variables, just like p-variables are defined by the deterministic p-test (4).

Proposition 4.5.

Let VαU[0,2α]V_{\alpha}\sim\mathrm{U}[0,2\alpha] and a random variable PP be independent of VαV_{\alpha}, α(0,1/2]\alpha\in(0,1/2]. Then (PVα)α\mathbb{P}(P\leq V_{\alpha})\leq\alpha for all α(0,1/2]\alpha\in(0,1/2] if and only if PP is a p*-variable.

Proposition 4.5 implies that p*-variables are precisely test statistics which can pass the randomized p*-test (rejection via PVαP\leq V_{\alpha}) with the specified level α\alpha, thus a further equivalent definition of p*-variables. The drawback of the randomized p*-test is obvious: an extra layer of randomization is needed. This undesirable feature is the price one has to pay when a p-variable is weakened to a p*-variable. More details and applications of the randomized p*-test are put in Appendix B. Since randomization is undesirable, we omit the detailed discussions from the main paper.

Remark 4.6.

The random threshold VαV_{\alpha} in Proposition 4.5 can be replaced by any random variable VV with mean α\alpha and a decreasing density function on (0,1)(0,1). The uniform distribution on [0,2α][0,2\alpha] turns out to be the one with the smallest variance among all valid choices of the random threshold (see Proposition B.4).

5 Merging p*-values and mid p-values

Merging p-values and e-values is extensively studied in the literature of multiple hypothesis testing; see the recent studies [18, 35, 33] and the references therein. We will be interested in merging p*-values (including mid p-values) into both a p*-value and a p-value. The following proposition gives a convenient conversion rule between p*-values and p-values. The fact that two times a p*-variable is a p-variable is already observed by [20].

Proposition 5.1.

A p-variable is a p*-variable, and the sum of two p*-variables is a p-variable.

Proposition 5.1 implies that, in order to obtain a valid p-value from several p*-values, a naive method is to multiply each p*-value by 22 and then apply a valid method for merging p-values (under the corresponding assumptions). We will see in the next few results that we can often obtain stronger results than this.

As argued by Efron [10, p.50-51], dependence assumptions are difficult to verify in multiple hypothesis testing. We will first focus on the case of arbitrarily dependent p*-values, that is, without making any assumptions on the dependence structure of the p*-variables, and then turn to the case of independent or positively dependent p*-variables.

5.1 Arbitrarily dependent p*-values

We first provide a new method which merges several p*-values into a p-value based on geometric averaging. Vovk and Wang [34] showed that the geometric average of p-variables may fail to be a p-variable, but it yields a p-variable when multiplied by e:=exp(1)\mathrm{e}:=\exp(1). The constant e2.718\mathrm{e}\approx 2.718 is practically the best-possible (smallest) multiplier (see [34, Table 2]) that provides validity against all dependence structures. In the next result, we show that a similar but stronger result holds for p*-values: the geometric average of p*-variables multiplied by e\mathrm{e} is not only a p*-variable, but also a p-variable, and this holds also for randomly weighted geometric averages. For using weighted p-values in multiple testing, see e.g., Benjamini and Hochberg [4].

In what follows, the (randomly) weighted geometric average P~\tilde{P} of P1,,PKP_{1},\dots,P_{K} for random weights w1,,wKw_{1},\dots,w_{K} is given by

P~=k=1KPkwk=exp(k=1KwklogPk),\tilde{P}=\prod_{k=1}^{K}P_{k}^{w_{k}}=\exp\left(\sum_{k=1}^{K}w_{k}\log P_{k}\right),

where w1,,wKw_{1},\dots,w_{K} satisfy

w1,,wK0,independent of (P1,,PK), and k=1Kwk=1.w_{1},\dots,w_{K}\geq 0,~{}\mbox{independent of $(P_{1},\dots,P_{K})$, and $\sum_{k=1}^{K}w_{k}=1$.} (5)

If w1==wK=1/Kw_{1}=\dots=w_{K}=1/K, then P~\tilde{P} is the unweighted geometric average of P1,,PKP_{1},\dots,P_{K}.

Theorem 5.2.

Let P~\tilde{P} be a weighted geometric average of p*-variables. Then eP~\mathrm{e}\tilde{P} is a p-variable. That is, for arbitrary p*-variables P1,,PKP_{1},\dots,P_{K} and weights w1,,wKw_{1},\dots,w_{K} satisfying (5),

(k=1KPkwkα)eαfor all α(0,1).\mathbb{P}\left(\prod_{k=1}^{K}P_{k}^{w_{k}}\leq\alpha\right)\leq\mathrm{e}\alpha~{}~{}~{}\mbox{for all $\alpha\in(0,1)$}. (6)

In Theorem 5.2, the random weights are allowed to be arbitrarily dependent. These random weights may come from preliminary experiments. One way to obtain such weights is to use scores such as e-values from preliminary data. Using e-values to compute weights is quite natural as a main motivation of e-values is an accumulation of evidence between consecutive experiments; see [13], [35] and [37].

Theorem 5.2 generalizes the result of [34, Proposition 4] which considered unweighted geometric average of p-values. When dependence is unspecified, testing with (randomly) weighted geometric averages of p*-values has the same critical values α/e\alpha/\mathrm{e} as those with unweighted p-values.

Next, we will study two methods which merges p*-values into a p*-value. Since two times p*-variable is a p-variable, probability guarantee can also be obtained from these merging functions. A p*-merging function in dimension KK is an increasing Borel function MM on [0,)K[0,\infty)^{K} such that M(P1,,PK)M(P_{1},\dots,P_{K}) is a p*-variable for all p*-variables P1,,PKP_{1},\dots,P_{K}; p-merging and e-merging functions are defined analogously; see [35]. A p*-merging function MM is admissible if it is not strictly dominated by another p*-merging function.

Proposition 5.3.

The arithmetic average MKM_{K} is an admissible p*-merging function in any dimension KK.

Proposition 5.3 illustrates that p*-values are very easy to combine using an arithmetic average; recall that MKM_{K} is not a valid p-merging function since the average of p-values is not necessarily a p-value (instead, 2MK2M_{K} is a p-merging function). On the other hand, MKM_{K} is an admissible e-merging function which essentially dominates all other symmetric admissible e-merging functions ([34, Proposition 3.1]).

Another benchmark merging function is the Bonferroni merging function

MB:(p1,,pK)(Kk=1Kpk)1.M_{B}:(p_{1},\dots,p_{K})\mapsto\left(K\bigwedge_{k=1}^{K}p_{k}\right)\wedge 1.

The next result shows that MBM_{B} is an admissible p*-merging function. The Bonferroni merging function MBM_{B} is known to be an admissible p-merging function ([35, Proposition 6.1]), whereas its transformed form (via e=1/pe=1/p) is an e-merging function but not admissible; see [35, Section 6] for these claims.

Proposition 5.4.

The Bonferroni merging function MBM_{B} is an admissible p*-merging function in any dimension KK.

Combining Propositions 5.3 and 5.4, p*-merging is admissible via both the arithmetic average (admissible for e-merging, invalid for p-merging) and the Bonferroni correction (admissible for p-merging, inadmissible for e-merging).

5.2 Independent p*-variables

We next turn to the problem of merging independent p*-variables. Merging independent mid p-values is studied by Rubin-Delanchy et al. [23] based on arguments of convex order. Since our p*-variables are defined via the order 2\geq_{2} which is closely related to convex order, the bounds in [23] can be directly applied to the case of p*-variables. More precisely, for any p*-variable PP, using Strassen’s theorem in the form of [29, Theorems 4.A.5 and 4.A.6], there exists a random variable ZZ such that ZPZ\leq P and ZZ satisfies the convex order relation used in [23]. In particular, for the arithmetic average P¯\bar{P} of independent p*-variables, using [23, Theorem 1] leads to the probability bound

(P¯1/2t)exp(6Kt2) for all t[0,1/2].\mathbb{P}(\bar{P}\leq 1/2-t)\leq\exp(-6Kt^{2})\mbox{~{}~{}~{}~{}for all $t\in[0,1/2]$.} (7)

Another probability bound for the geometric average of independent p*-variables is obtained by [23, Theorem 2] based on the observation that twice a p*-variable is a p-variable (cf. Proposition 5.1). Recall that Fisher’s combination method uses the geometric average of independent p-values.

It is well-known that statistical validity of Fisher’s method or other methods based on concentration inequalities can be fragile when independence does not hold; see also our simulation results in Section 8. Since independence is difficult to verify in multiple hypothesis testing (see e.g., [10]), these independence-based methods (for either p-values or p*-values) need to be applied with caution.

There are, nevertheless, some methods which work well for independent p-values and are relatively robust to dependence assumptions. In addition to the Bonferroni correction which is valid for all dependence structures, the most famous such method is perhaps that of Simes [30]. Define the function

SK:[0,)K[0,),SK(p1,,pk)=k=1KKp(k)k,S_{K}:[0,\infty)^{K}\to[0,\infty),~{}S_{K}(p_{1},\dots,p_{k})=\bigwedge_{k=1}^{K}\frac{Kp_{(k)}}{k},

where p(k)p_{(k)} is the kk-th smallest order statistic of p1,,pKp_{1},\dots,p_{K}. A celebrated result of [30] is that if P1,,PKP_{1},\dots,P_{K} are independent p-variables, then the Simes inequality holds

(SK(P1,,PK)α)αfor all α(0,1).\mathbb{P}(S_{K}(P_{1},\dots,P_{K})\leq\alpha)\leq\alpha~{}~{}~{}\mbox{for all $\alpha\in(0,1)$}. (8)

Further, if P1,,PKP_{1},\dots,P_{K} are iid uniform on [0,1][0,1], then SK(P1,,PK)S_{K}(P_{1},\dots,P_{K}) is again uniform on [0,1][0,1]. The Simes inequality (8) holds also under some notion of positive dependence, in particular, positive regression dependence (PRD); see Benjamini and Yekutieli [5] and Ramdas et al. [22].

One may wonder whether SK(P1,,PK)S_{K}(P_{1},\dots,P_{K}) yields a p*-variable or p-variable for independent or PRD p*-variables P1,,PKP_{1},\dots,P_{K}. It turns out that this is not the case, as illustrated by the following example, where S2(P1,P2)S_{2}(P_{1},P_{2}) fails to be a p*-variable or a p-variable, even in case that P1P_{1} and P2P_{2} are iid p*-variables.

Example 5.5.

Let P1P_{1} be a random variable satisfying (P1=0.2)=0.4\mathbb{P}(P_{1}=0.2)=0.4 and (P1=0.7)=0.6\mathbb{P}(P_{1}=0.7)=0.6. It is straightforward to verify that P1P_{1} is a p*-variable (indeed, it is a mid p-variable by Theorem 3.1). Let P2P_{2} be an independent copy of P1P_{1} and P:=S2(P1,P2)P:=S_{2}(P_{1},P_{2}). We can check that (P=0.2)=0.16\mathbb{P}(P=0.2)=0.16, (P=0.4)=0.48\mathbb{P}(P=0.4)=0.48 and (P=0.7)=0.36\mathbb{P}(P=0.7)=0.36. It follows that 𝔼[P]=0.476<1/2\mathbb{E}[P]=0.476<1/2, and hence PP is not a p*-variable.

Since twice a p*-variable is a p-variable (Proposition 5.1), it is safe (and conservative) to use 2SK(P1,,PK)2S_{K}(P_{1},\dots,P_{K}) which is a p-variable under independence or PRD (note that PRD is preserved under linear transformations).

Other methods on p-values that are relatively robust to dependence include the harmonic mean p-value of Wilson [39] and the Cauchy combination method of Liu and Xie [18]. As shown by Chen et al. [6], the three methods of Simes, harmonic mean, and Cauchy combinations are closely related and similar in several senses.

Obviously, more robustness to dependence leads to a more conservative method. Indeed, all p-merging methods designed for arbitrary dependence are quite conservative in some situations; see the comparative study in [6]. Thus, there is a trade-off between power and robustness to dependence. For p*-merging methods, the bound (7) is the most stringent on the independence assumption. Using 2SK2S_{K} is valid for independent or PRD p*-variables. Finally, all methods in Section 5.1 work for any dependence structure among the p*-variables.

Remark 5.6.

Any function MM which merges iid standard uniform random variables U1,,UKU_{1},\dots,U_{K} into a standard uniform one, such as the functions in the methods of Simes, Fisher’s and the Cauchy combination, satisfies

(M(P1,,PK)α)α for all α(0,1)\mathbb{P}(M(P_{1},\dots,P_{K})\leq\alpha)\leq\alpha\mbox{~{}~{}for all $\alpha\in(0,1)$} (9)

for any independent p-variables P1,,PKP_{1},\dots,P_{K}. However, they generally cannot satisfy (9) for all independent p*-variables (or mid p-variables) P1,,PKP_{1},\dots,P_{K}, since M(P1,,PK)1M(U1,,UK)M(P_{1},\dots,P_{K})\geq_{1}M(U_{1},\dots,U_{K}) does not hold for some choices of P1,,PKP_{1},\dots,P_{K}. Therefore, some form of penalty always needs to be paid when relaxing p-values to p*-values or mid p-values for these methods.

Remark 5.7.

The function SKS_{K} and the inequality (8) play a central role in multiple hypothesis testing and false discovery rate (FDR) control; in particular, the procedure of Benjamini and Hochberg [3] at level α\alpha reports at least one discovery for p-values p1,,pKp_{1},\dots,p_{K} if and only if SK(p1,,pK)αS_{K}(p_{1},\dots,p_{K})\leq\alpha, and (8) guarantees the FDR of this procedure is no longer than α\alpha in the global null setting with independent p-values.

6 Calibration between p-values, p*-values, and e-values

In this section, we discuss calibration between p-, p*-, and e-values. Calibration between p-values and e-values is one of the main topics of [35].

6.1 Calibration between p-values and p*-values

Calibration between p-values and p*-values is relatively simple. A p-to-p* calibrator is an increasing function f:[0,1][0,)f:[0,1]\to[0,\infty) that transforms p-variables to p*-variables, and a p*-to-p calibrator is an increasing function g:[0,1][0,)g:[0,1]\to[0,\infty) which transforms in the reverse direction. Clearly, the values of p-values larger than 11 are irrelevant, and hence we restrict the domain of all calibrators in this section to be [0,1][0,1]; in other words, input p-variables and p*-variables larger than 11 will be treated as 11. A calibrator is said to be admissible if it is not strictly dominated by another calibrator of the same kind (for calibration to p-values and p*-values, ff dominates gg means fgf\leq g, and for calibration to e-values in Section 6.2 it is the opposite inequality).

Theorem 6.1.
  1. (i)

    The p*-to-p calibrator u(2u)1u\mapsto(2u)\wedge 1 dominates all other p*-to-p calibrators.

  2. (ii)

    An increasing function ff on [0,1][0,1] is an admissible p-to-p* calibrator if and only if ff is left-continuous, f(0)=0f(0)=0, 0vf(u)duv2/2\int_{0}^{v}f(u)\,\mathrm{d}u\geq v^{2}/2 for all v(0,1)v\in(0,1), and 01f(u)du=1/2\int_{0}^{1}f(u)\,\mathrm{d}u=1/2.

Theorem 6.1 (i) states that a multiplier of 22 is the best calibrator that works for all p*-values. This observation justifies the deterministic threshold α/2\alpha/2 in the test (24) for p*-values, as mentioned in Section 4.4. Although Theorem 6.1 (ii) implies that there are many admissible p-to-p* calibrators, it seems that there is no obvious reason to use anything other than the identity in Proposition 5.1 when calibrating from p-values to p*-values. Finally, we note that the conditions in Theorem 6.1 (ii) imply that the range of ff is contained in [0,1][0,1], an obvious requirement for an admissible p-to-p* calibrator.

6.2 Calibration between p*-values and e-values

Next, we discuss calibration between e-values and p*-values, which has a richer structure. A p*-to-e calibrator is a decreasing function f:[0,1][0,]f:[0,1]\to[0,\infty] that transforms p*-variables to e-variables, and an e-to-p* calibrator g:[0,][0,1]g:[0,\infty]\to[0,1] is a decreasing function which transforms in the reverse direction. We include e=e=\infty in the calibrators, which corresponds to p=0p=0.

First, since a p-variable is a p*-variable, any p*-to-e calibrator is also a p-to-e calibrator. Hence, the set of p*-to-e calibrators is contained in the set of p-to-e calibrators. By Proposition 2.1 of [35], any admissible p-to-e calibrator f:[0,1][0,]f:[0,1]\to[0,\infty] is a decreasing function such that f(0)=f(0)=\infty, ff is left-continuous, and 01f(t)dt=1\int_{0}^{1}f(t)\,\mathrm{d}t=1. We will see below that some of these admissible p-to-e calibrators are also p*-to-e calibrators.

Regarding the other direction of e-to-p* calibrators, we first recall that there is a unique admissible e-to-p calibrator, given by ee11e\mapsto e^{-1}\wedge 1, as shown by [35]. Since the set of p*-values is larger than that of p-values, the above e-to-p calibrator is also an e-to-p* calibrator. The interesting questions are whether there is any e-to-p* calibrator stronger than ee11e\mapsto e^{-1}\wedge 1, and whether an admissible e-to-p* calibrator is also unique. The constant map e1/2e\mapsto 1/2 is an e-to-p* calibrator since 1/21/2 is a constant p*-variable. If there exists an e-to-p* calibrator ff which dominates all other e-to-p* calibrators, then it is necessary that f(e)1/2f(e)\leq 1/2 for all e0e\geq 0; however this would imply f=1/2f=1/2 since any p*-variable has mean at least 1/21/2. Since e1/2e\mapsto 1/2 does not dominate ee11e\mapsto e^{-1}\wedge 1, we conclude that there is no e-to-p* calibrator which dominates all others, in contrast to the case of e-to-p calibrators.

Nevertheless, some refined form of domination can be helpful. We say that an e-to-p* calibrator ff essentially dominates another e-to-p* calibrator ff^{\prime} if f(e)f(e)f(e)\leq f^{\prime}(e) whenever f(e)<1/2f^{\prime}(e)<1/2. That is, we only require dominance when the calibrated p*-value is useful (relatively small); this consideration is similar to the essential domination of e-merging functions in [35]. It turns out that the e-to-p calibrator ee11e\mapsto e^{-1}\wedge 1 can be improved by a factor of 1/21/2, which essentially dominates all other e-to-p* calibrators.

The following theorem summarizes the validity and admissibility results on both directions of calibration.

Theorem 6.2.
  1. (i)

    A convex (admissible) p-to-e calibrator is an (admissible) p*-to-e calibrator.

  2. (ii)

    An admissible p-to-e calibrator is a p*-to-e calibrator if and only if it is convex.

  3. (iii)

    The e-to-p* calibrator e(2e)11e\mapsto(2e)^{-1}\wedge 1 essentially dominates all other e-to-p* calibrators.

All practical examples of p-to-e calibrators are convex and admissible; see [35, Section 2 and Appendix B] for a few classes (which are all convex). By Theorem 6.2, all of these calibrators are admissible p*-to-e calibrators. A popular class of p-to-e calibrators is given by, for κ(0,1)\kappa\in(0,1),

pκpκ1,p[0,1].p\mapsto\kappa p^{\kappa-1},~{}~{}~{}~{}p\in[0,1]. (10)

Another simple choice, proposed by Shafer [27], is

pp1/21,p[0,1].p\mapsto p^{-1/2}-1,~{}~{}~{}~{}p\in[0,1]. (11)

Clearly, the p-to-e calibrators in (10) and (11) are convex and thus they are p*-to-e calibrators.

The result in Theorem 6.2 (iii) shows that the unique admissible e-to-p calibrator ee11e\mapsto e^{-1}\wedge 1 can actually be achieved by a two-step calibration: first use p=(2e)11p^{*}=(2e)^{-1}\wedge 1 to get a p*-value, and then use p=(2p)1p=(2p^{*})\wedge 1 to get a p-value.

On the other hand, all p-to-e calibrators ff in [35] are convex, and they can be seen as a composition of the calibrations p=pp^{*}=p and e=f(p)e=f(p^{*}). Therefore, p*-values serve as an intermediate step in both directions of calibration between p-values and e-values, although one of the directions is less interesting since the p-to-p* calibrator is an identity. Figure 1 in the Introduction illustrates our recommended calibrators among p-values, p*-values and e-values based on Theorems 6.1 and 6.2, and they are all admissible.

Example 6.3.

Suppose that UU is uniformly distributed on [0,1][0,1]. Using the calibrator (10), for κ(0,1)\kappa\in(0,1), E:=κUκ1E:=\kappa U^{\kappa-1} is an e-variable. By Theorem 6.2 (iii) , we know that P:=(2E)1P:=(2E)^{-1} is a p*-variable. Below we check this directly. The left-quantile function GPG_{P} of PP satisfies

GP(u)=u1κ2κ,u(0,1).G_{P}(u)=\frac{u^{1-\kappa}}{2\kappa},~{}~{}~{}~{}u\in(0,1).

Using κ(2κ)1\kappa(2-\kappa)\leq 1 for all κ(0,1)\kappa\in(0,1), we have

0vGP(u)du=v2κ2κ(2κ)v2κ2v22,v(0,1).\int_{0}^{v}G_{P}(u)\,\mathrm{d}u=\frac{v^{2-\kappa}}{2\kappa(2-\kappa)}\geq\frac{v^{2-\kappa}}{2}\geq\frac{v^{2}}{2},~{}~{}~{}~{}v\in(0,1).

Hence, PP is a p*-variable by verifying (1). Moreover, for κ(0,1/2]\kappa\in(0,1/2], PP is even a p-variable, since GP(u)uG_{P}(u)\geq u for u(0,1)u\in(0,1).

In the next result, we show that a p*-variable obtained from the calibrator in Theorem 6.2 (iii) is a p-variable under a further condition (DE):

  1. (DE)

    E1EE\leq_{1}E^{\prime} for some e-variable EE^{\prime} which has a decreasing density on (0,)(0,\infty).

In particular, condition (DE) is satisfied if EE itself has a decreasing density on (0,)(0,\infty). Examples of e-variables satisfying (DE) are those obtained from applying a non-constant convex p-to-e calibrator ff with f(1)=0f(1)=0 to any p-variable, e.g., the p-to-e calibrator (11) but not (10); this is because convexity of the calibrator yields a decreasing density when applied to a uniform p-variable.

Proposition 6.4.

For any e-variable EE, P:=(2E)11P:=(2E)^{-1}\wedge 1 is a p*-variable, and if EE satisfies (DE), then PP is a p-variable.

Remark 6.5.

In a spirit similar to Proposition 6.4, smoothing techniques leading to an extra factor of 22 in the Markov inequality have been studied by Huber [15].

7 Testing with e-values and martingales

In this section we discuss applications of p*-values to tests with e-values and martingales. E-values and test martingales are usually used for purposes more than rejecting a null hypothesis while controlling type-I error; in particular, they offer anytime validity and different interpretations of statistical evidence (e.g., [13]). We compare the power of several methods here for a better understanding of their performance, while keeping in mind that single-run detection power (which is maximized by p-values if they are available) is not the only purpose of e-values.

Suppose that EE is an e-variable, usually obtained from likelihood ratios or stopped test supermartingales (e.g., [28], [27]). A traditional e-test is

rejecting the null hypothesisE1α.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}E\geq\frac{1}{\alpha}. (12)

Using the fact that (2E)1(2E)^{-1} is a p*-variable in Theorem 6.2 (iii) , we can design the randomized test

rejecting the null hypothesis2E1V,\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}2E\geq\frac{1}{V}, (13)

where VU[0,2α]V\sim\mathrm{U}[0,2\alpha] is independent of EE (Proposition 4.5). The test (13) has 3/43/4 chance of being more power than the traditional choice of testing EE against 1/α1/\alpha in (12). Randomization is undesirable, but (13) inspires us to look for alternative deterministic methods.

Suppose that one has two independent e-variables E1E_{1} and E2E_{2} for a null hypothesis. As shown by [35], it is optimal in a weak sense to use the combined e-variable E1E2E_{1}E_{2} for testing the null. Assume further that one of E1E_{1} and E2E_{2} satisfies condition (DE).

Using (13) with the random threshold αE2\alpha E_{2} and Proposition B.5 in Appendix B, we get ((2E1)1αE2)α\mathbb{P}((2E_{1})^{-1}\leq\alpha E_{2})\leq\alpha (note that the positions of E1E_{1} and E2E_{2} are symmetric here). Hence, the test

rejecting the null hypothesis2E1E21α,\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}2E_{1}E_{2}\geq\frac{1}{\alpha}, (14)

has size at most α\alpha. The threshold of the test (14) is half the one obtained by directly applying (12) to the e-variable E1E2E_{1}E_{2}. Thus, the test statistic is boosted by a factor of 22 via condition (DE) on either E1E_{1} or E2E_{2}. No assumption is needed for the other e-variable. In particular, by setting E2=1E_{2}=1, we get a p-variable (2E1)1(2E_{1})^{-1} if E1E_{1} satisfies (DE), as we see in Proposition 6.4.

E-values calibrated from p-values are useful in the context of testing randomness online (see [31]) and designing test martingales (see [9]). More specifically, for a possibly infinite sequence of independent p-variables (Pt)t(P_{t})_{t\in\mathbb{N}} and a sequence of p-to-e calibrators (ft)t(f_{t})_{t\in\mathbb{N}}, the stochastic process

Xt=k=1tfk(Pk),t=0,1,X_{t}=\prod_{k=1}^{t}f_{k}(P_{k}),~{}~{}~{}~{}t=0,1,\dots

is a supermartingale (with respect to the filtration of (Pt)t(P_{t})_{t\in\mathbb{N}}) with initial value X0=0X_{0}=0 (it is a martingale if PtP_{t}, tt\in\mathbb{N} are standard uniform and ftf_{t}, tt\in\mathbb{N} are admissible). As a supermartingale, (Xt)t(X_{t})_{t\in\mathbb{N}} satisfies the anytime validity, i.e., XτX_{\tau} is an e-variable for any stopping time τ\tau; moreover, Ville’s inequality gives

(suptXt1α)α for any α>0.\displaystyle\mathbb{P}\left(\sup_{t\in\mathbb{N}}X_{t}\geq\frac{1}{\alpha}\right)\leq\alpha\mbox{~{}for any $\alpha>0$.} (15)

The process (Xt)t(X_{t})_{t\in\mathbb{N}} is called an e-process by [37]. Anytime validity is crucial in the design of online testing where evidence arrives sequentially in time, and scientific discovery is reported at a stopping time considered with sufficient evidence.

Notably, the most popular choice of p-to-e calibrators are those in (10) and (11) (see e.g., [31]), which are convex. Theorem 6.2 implies that if the inputs are not p-values but p*-values, we can still obtain e-processes using convex calibrators such as (10) and (11), without calibrating these p*-values to p-values. This observation becomes useful when each observed PtP_{t} is only a p*-variable, e.g., a mid p-value or an average of several p-values from parallel experiments.

Moreover, for a fixed tt\in\mathbb{N}, if there is a convex fsf_{s} for some s{1,,t}s\in\{1,\dots,t\} with fs(1)=0f_{s}(1)=0, and PsP_{s} is a p-variable (the others can be p*-variables with any p*-to-e calibrators), then (DE) is satisfied by fs(Ps)f_{s}(P_{s}), and we have (Xt1/α)α/2\mathbb{P}(X_{t}\geq 1/\alpha)\leq\alpha/2 by using the test (14); see our numerical experiments below.

Simulation experiments

In the simulation results below, we generate test martingales following [35]. Similarly to Section B.2, the null hypothesis H0H_{0} is N(0,1)\mathrm{N}(0,1) and the alternative is N(δ,1)\mathrm{N}(\delta,1) for some δ>0\delta>0. We generate iid X1,,XnX_{1},\dots,X_{n} from N(δ,1)\mathrm{N}(\delta,1). Define the e-variables from the likelihood ratios of the alternative to the null density,

Et:=exp((Xtδ)2/2)exp(Xt2/2)=exp(δXtδ2/2),t=1,,n.E_{t}:=\frac{\exp(-(X_{t}-\delta)^{2}/2)}{\exp(-X_{t}^{2}/2)}=\exp(\delta X_{t}-\delta^{2}/2),~{}~{}~{}t=1,\dots,n. (16)

The e-process S=(St)t=1,,nS=(S_{t})_{t=1,\dots,n} is defined as St=s=1tEs.S_{t}=\prod_{s=1}^{t}E_{s}. Such an e-process SS is growth optimal in the sense of Shafer [27], as it maximizes the expected log growth among all test martingales built on the data (X1,,Xn)(X_{1},\dots,X_{n}); indeed, SS is Kelly’s strategy under the betting interpretation. Here, we constructed the e-process SS assuming that we know δ\delta; otherwise we can use universal test martingales (e.g., [7]) by taking a mixture of SS over δ\delta under some probability measure.

Note that each EtE_{t} is log-normally distributed and it does not satisfy (DE). Hence, (14) cannot be applied to SnS_{n}. Nevertheless, we can replace E1E_{1} by another e-variable E1E_{1}^{\prime} which satisfies (DE). We choose E1E_{1}^{\prime} by applying the p-to-e calibrator (11) to the p-variable P1=1Φ(X1)P_{1}=1-\Phi(X_{1}), namely, E1=(P1)1/21.E_{1}^{\prime}=(P_{1})^{-1/2}-1.

Replacing E1E_{1} by E1E_{1}^{\prime}, we obtain the new e-process S=(St)t=1,,nS^{\prime}=(S^{\prime}_{t})_{t=1,\dots,n} by St=E1s=2tEs.S^{\prime}_{t}=E_{1}^{\prime}\prod_{s=2}^{t}E_{s}. The e-process SS^{\prime} is not growth optimal, but as E1E_{1}^{\prime} satisfies (DE), we can test via the rejection condition 2Sn1/α2S^{\prime}_{n}\geq 1/\alpha, thus boosting the terminal value by a factor of 22. Let VU[0,2α]V\sim\mathrm{U}[0,2\alpha] be independent of the test statistics. We compare five different tests, all with size at most α\alpha:

  1. (a)

    applying (12) to SnS_{n}: reject H0H_{0} if Sn1/αS_{n}\geq 1/\alpha (benchmark case);

  2. (b)

    applying (13) to SnS_{n}: reject H0H_{0} if 2Sn1/V2S_{n}\geq 1/V;

  3. (c)

    applying (14) to SnS^{\prime}_{n}: reject H0H_{0} if 2Sn1/α2S^{\prime}_{n}\geq 1/\alpha;

  4. (d)

    applying a combination of (13) and (14) to SnS^{\prime}_{n}: reject H0H_{0} if 2Sn1/V2S^{\prime}_{n}\geq 1/V;

  5. (e)

    applying (15) to the maximum of SS: reject H0H_{0} if max1tnSt1/V\max_{1\leq t\leq n}S_{t}\geq 1/V.

Since test (a) is strictly dominated by test (e), we do not need to use (a) in practice; nevertheless we treat it as a benchmark for comparison on tests based on e-values as it is built on the fundamental connection between e-values and p-values: the e-to-p calibrator ee11e\mapsto e^{-1}\wedge 1.

The significance level α\alpha is set to be 0.010.01. The power of the five tests is computed from the average of 10,000 replications for varying signal strength δ\delta and for n{2,10,100}n\in\{2,10,100\}. Results are reported in Figure 3. For most values of δ\delta and nn, either the deterministic test (c) for SS^{\prime} or the maximum test (e) has the best performance. The deterministic test (c) performs very well in the cases n=2n=2 and n=10n=10, especially for weak signals; this may be explained by the factor of 22 being substantial when the signal is weak. If nn is large and the signal is not too weak, the effect of using the maximum of SS in (e) is dominating; this is not surprising. Although the randomized test (b) usually improves the performance from the benchmark case (a), the advantages seem to be quite limited, especially in view of the extra randomization, often undesirable.

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 3: Tests based on e-values; the second row is zoomed in from the first row

8 Testing with combined mid p-values

We compare by simulation the performance of a few tests via merging mid p-values. The (global) null hypothesis H0H_{0} is that the test statistic TT follows a binomial distribution Binomial(n,π)\mathrm{Binomial}(n,\pi), and KK tests are conducted. We set n=40n=40 and π=0.3\pi=0.3, so that the obtained p-values are considerably discrete. We denote by P1,,PKP_{1},\dots,P_{K} the obtained p-values via (2), and P1,,PKP^{*}_{1},\dots,P^{*}_{K} the obtained mid p-values via (3). Let P¯\bar{P}^{*} and P~\tilde{P}^{*} be the arithmetic average and the geometric average of P1,,PKP^{*}_{1},\dots,P^{*}_{K}, respectively.

The true data probability generating the test statistics is a binomial distribution Binomial(n,(1θ)π)\mathrm{Binomial}(n,(1-\theta)\pi), where θ[0,1]\theta\in[0,1]. The case θ=0\theta=0 means that the null hypothesis H0H_{0} is true, and a larger θ\theta indicates a stronger signal.

We allow the test statistics to be correlated, and this is achieved by simulating from a Gaussian copula with common pairwise correlation parameter ρ\rho (more precisely, we first simulate from a Gaussian copula, and then obtain the observable discrete test statistics by a quantile transform). We consider the following tests (there are other tests possible for this setting, and we only compare these four to illustrate a few relevant points):

  1. (a)

    the probability bound (7) on the arithmetic mean in [23]: reject H0H_{0} if P¯1/2(log(α)/(6K))1/2\bar{P}^{*}\leq 1/2-(-\log(\alpha)/(6K))^{1/2};

  2. (b)

    the arithmetic mean times 22 using Proposition 5.3: reject H0H_{0} if P¯α/2\bar{P}^{*}\leq\alpha/2;

  3. (c)

    the geometric average of p*-values using Theorem 5.2: reject H0H_{0} if P~α/e\tilde{P}^{*}\leq\alpha/\mathrm{e};

  4. (d)

    the Bonferroni correction: reject H0H_{0} if min(P1,,PK)α/K\min(P_{1},\dots,P_{K})\leq\alpha/K.

Note that tests (a), (b) and (c) use mid p-values based on methods for p*-values, and (d) uses p-values. All of (b), (c) and (d) are valid tests under arbitrary dependence (AD) whereas the validity of (a) requires independence. Therefore, we expect (a) to perform very well in case independence holds. All other methods are valid but conservative, as there is a big price to pay to gain robustness against all dependence structures.

The significance level α\alpha is set to be 0.050.05 for good visibility. The power of the four tests is reported in Figure 4 which is computed from the average of 10,000 replications for varying signal strength θ[0,1]\theta\in[0,1] and for ρ=0\rho=0 (independence), ρ=0.2\rho=0.2 (mild dependence) and ρ=0.8\rho=0.8 (strong dependence). The situation of ρ=0.8\rho=0.8 is the most relevant to us as averaging methods are designed mostly for situations where the presence of strong or complicated dependence is suspected.

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 4: Tests based on combining p-values or mid p-values

As we can see from Figure 4, the test (a) relying on independence has the strongest power as expected. Its size becomes largely inflated as soon as mild dependence is present, and hence it can only be applied in situations where independence among obtained mid p-values can be justified. Indeed, the size of test (a) can be 0.4\approx 0.4 in case ρ=0.2\rho=0.2, which is clearly not useful. Among the three methods that are valid for arbitrary dependence, the geometric average (c) has stronger power for large ρ\rho and the Bonferroni correction (d) has stronger power for small ρ\rho. The arithmetic average (b) performs poorly unless p-values are very strongly correlated. These observations on merging mid p-values are consistent with those in [34] on merging p-values. In conclusion, the geometric average (c) can be useful when the dependence among mid p-values is speculated to be strong or complicated, although unknown to the decision maker.

9 Conclusion

In this paper we introduced p*-values (p*-variables) as an abstract measure-theoretic object. The notion of p*-values is a manifold generalization of p-values, and it enjoys many attractive theoretical properties in contrast to p-values. In particular, mid p-values, which arise in the presence of discrete test statistics, form an important subset of p*-values. Merging methods for p*-values are studied. In particular, a weighted geometric average of arbitrarily dependent p*-values multiplied by e2.718\mathrm{e}\approx 2.718 yields a valid p-value, which can be useful when multiple mid p-values are possibly strongly correlated.

Results on calibration between p*-values and e-values reveal that p*-values serve as an intermediate step in both the standard e-to-p and p-to-e calibrations. Although a direct test with p*-values may involve randomization, we find that p*-values are useful in the design of deterministic tests with averages of p-values, mid p-values, and e-values. From the results in this paper, the concept of p*-values serves a useful technical tool which enhances the extensive and growing applications of p-values, mid p-values, and e-values.

{acks}

[Acknowledgments] The author thanks Ilmun Kim, Aaditya Ramdas, and Vladimir Vovk for constructive comments on an earlier version of the paper, and Tiantian Mao, Marcel Nutz, and Qinyu Wu for kind help on some technical statements.

{funding}

The author acknowledges financial support from and the Natural Sciences and Engineering Research Council of Canada (RGPIN-2018-03823, RGPAS-2018-522590).

References

  • [1]
  • Bates et al. [2021] Bates, S., Candès, E., Lei, L., Romano, Y. and Sesia, M. (2021). Testing for outliers with conformal p-values. arXiv: 2104.08279.
  • Benjamini and Hochberg [1995] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57(1), 289–300.
  • Benjamini and Hochberg [1997] Benjamini, Y. and Hochberg, Y. (1997). Multiple hypotheses testing with weights. Scandinavian Journal of Statistics, 24(3), 407–418.
  • Benjamini and Yekutieli [2001] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188.
  • Chen et al. [2022] Chen, Y., Liu, P. and Tan, K. S. and Wang, R. (2022). Trade-off between validity and efficiency of merging p-values under arbitrary dependence. Statistica Sinica, forthcoming.
  • Howard et al. [2021] Howard, S. R., Ramdas, A., McAuliffe, J. and Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. Annals of Statistics, 49(2), 1055–1080.
  • Döhler et al. [2018] Döhler, S., Durand, G. and Roquain, E. (2018). New FDR bounds for discrete and heterogeneous tests. Electronic Journal of Statistics, 12(1), 1867–1900.
  • Duan et al. [2020] Duan, B., Ramdas, A., Balakrishnan, S. and Wasserman, L. (2020). Interactive martingale tests for the global null. Electronic Journal of Statistics, 14(2), 4489–4551.
  • Efron [2010] Efron, B. (2010). Large-scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press.
  • Genovese and Wasserman [2004] Genovese, R. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Annals of Statistics, 32, 1035–1061.
  • Goeman and Solari [2011] Goeman, J. J. and Solari, A. (2011). Multiple testing for exploratory research. Statistical Science, 26(4), 584–597.
  • Grünwald et al. [2020] Grünwald, P., de Heide, R. and Koolen, W. M. (2020). Safe testing. arXiv: 1906.07801v2.
  • Habiger [2015] Habiger, J. D. (2015). Multiple test functions and adjusted p-values for test statistics with discrete distributions. Journal of Statistical Planning and Inference, 167, 1–13.
  • Huber [2019] Huber, M. (2019). Halving the bounds for the Markov, Chebyshev, and Chernoff Inequalities using smoothing. The American Mathematical Monthly, 126(10), 915–927.
  • Lancaster [1952] Lancaster, H. O. (1952). Statistical control of counting experiments. Biometrika, 39(3/4), 419–422.
  • Liu and Wang [2021] Liu, F. and Wang, R. (2021). A theory for measures of tail risk. Mathematics of Operations Research, 46(3), 1109–1128.
  • Liu and Xie [2020] Liu, Y. and Xie, J. (2020), Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529), 393–402.
  • Mao et al. [2019] Mao, T., Wang, B. and Wang, R. (2019). Sums of uniform random variables. Journal of Applied Probability, 56(3), 918–936.
  • Meng [1994] Meng, X. L. (1994). Posterior predictive pp-values. Annals of Statistics, 22(3), 1142–1160.
  • Müller and Stoyan [2002] Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. Wiley, England.
  • Ramdas et al. [2019] Ramdas, A. K., Barber, R. F., Wainwright, M. J. and Jordan, M. I. (2019). A unified treatment of multiple testing with prior knowledge using the p-filter. Annals of Statistics, 47(5), 2790–2821.
  • Rubin-Delanchy et al. [2019] Rubin-Delanchy, P., Heard, N. A. and Lawson, D. J. (2019). Meta-analysis of mid-p-values: Some new results based on the convex order. Journal of the American Statistical Association, 114(527), 1105–1112.
  • Rüschendorf [1982] Rüschendorf, L. (1982). Random variables with maximum sums. Advances in Applied Probability, 14(3), 623–632.
  • Rüschendorf [2013] Rüschendorf, L. (2013). Mathematical Risk Analysis. Dependence, Risk Bounds, Optimal Allocations and Portfolios. Springer, Heidelberg.
  • Sarkar [1998] Sarkar, S. K. (1998). Some probability inequalities for ordered MTP2 random variables: a proof of the Simes conjecture. Annals of Statistics, 26(2), 494–504.
  • Shafer [2021] Shafer, G. (2021). The language of betting as a strategy for statistical and scientific communication. Journal of the Royal Statistical Society Series A, 184(2), 407–431.
  • Shafer et al. [2011] Shafer, G., Shen, A., Vereshchagin, N. and Vovk, V. (2011). Test martingales, Bayes factors, and p-values. Statistical Science, 26, 84–101.
  • Shaked and Shanthikumar [2007] Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer Series in Statistics.
  • Simes [1986] Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73, 751–754.
  • Vovk [2020] Vovk, V. (2020). Testing randomness online. Statistical Science, 36(4), 595–611.
  • Vovk et al. [2005] Vovk, V., Gammerman, A. and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer, New York.
  • Vovk et al. [2022] Vovk, V., Wang, B. and Wang, R. (2022). Admissible ways of merging p-values under arbitrary dependence. Annals of Statistics, 50(1), 351–375.
  • Vovk and Wang [2020] Vovk, V. and Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791–808.
  • Vovk and Wang [2021] Vovk, V. and Wang, R. (2021). E-values: Calibration, combination, and applications. Annals of Statistics, 49(3), 1736–1754.
  • Wang and Wang [2015] Wang, B. and Wang, R. (2015). Extreme negative dependence and risk aggregation. Journal of Multivariate Analysis. 136, 12–25.
  • Wang and Ramdas [2022] Wang, R. and Ramdas, A. (2022). False discovery rate control with e-values. Journal of the Royal Statistical Society Series B, forthcoming.
  • Wasserman et al. [2020] Wasserman, L., Ramdas, A. and Balakrishnan, S. (2020). Universal inference. Proceedings of the National Academy of Sciences, 117(29), 16880–16890.
  • Wilson [2019] Wilson, D. J. (2019). The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences, 116, 1195–1200.

Appendix A Proofs of all results

In this appendix, we collect proofs of all theorems and propositions in the main paper.

A.1 Proofs of results in Section 3

Proof of Theorem 3.1.
  1. (i)

    For VV being a strictly increasing function of UU, we have 𝔼[U|V]=U\mathbb{E}[U|V]=U, and the equivalence statement follows directly from the definition of p-variables.

  2. (ii)

    We first show the “if” statement. Write V=f(U)V=f(U) for an increasing function ff. Denote by FF and GG the distribution function and the left-quantile function of VV, respectively, and let ZZ be a standard uniform random variable independent of VV. Moreover, let

    UV=F(V)+Z(F(V)F(V)),U_{V}=F(V-)+Z(F(V)-F(V-)),

    which is uniformly distributed on [0,1][0,1] and satisfies G(UV)=VG(U_{V})=V a.s. (e.g., [25, Proposition 1.3]). Since G(UV)=V=f(U)G(U_{V})=V=f(U) a.s., and both GG and ff are increasing, we know that the functions GG and ff differ on a set of Lebesgue measure 0. Therefore, 𝔼[U|V]=𝔼[U|G(U)]\mathbb{E}[U|V]=\mathbb{E}[U|G(U)], which is identically distributed as 𝔼[UV|G(UV)]\mathbb{E}[U_{V}|G(U_{V})]. Moreover,

    𝔼[UV|G(UV)]=𝔼[UV|V]=12F(V)+12F(V).\mathbb{E}[U_{V}|G(U_{V})]=\mathbb{E}[U_{V}|V]=\frac{1}{2}F(V-)+\frac{1}{2}F(V). (17)

    Therefore, P1𝔼[U|V]P\geq_{1}\mathbb{E}[U|V] implies P1(F(V)+F(V))/2P\geq_{1}(F(V-)+F(V))/2, and thus PP is a mid p-variable.

    The “only if” statement follows from P1(F(T)+F(T))/2P\geq_{1}(F(T-)+F(T))/2 and (17) by choosing U=UVU=U_{V} and V=TV=T.

  3. (iii)

    We first show the “if” statement. Note that P𝔼[U|V]2UP\geq\mathbb{E}[U|V]\geq_{2}U where the second inequality is guaranteed by Jensen’s inequality. Since \geq is stronger than 2\geq_{2} and 2\geq_{2} is transitive, we get P2UP\geq_{2}U and hence PP is a p*-variable.

    Next, we show the “only if” statement. By using [29, Theorems 4.A.5], the definition of a p*-variable PP implies that there exist a standard uniform random variable UU and a random variable VV such that P1V𝔼[U|V]P\geq_{1}V\geq\mathbb{E}[U|V]. ∎

Proof of Proposition 3.3.

By Theorem 4.1, the set of p*-variables is the convex hull of the set of p-variables, and thus convex. This also implies that none of the set of p-variables or that of mid p-variables is convex.

To show that the set of p*-variables is closed under distribution mixtures, it suffices to note that the stochastic orders 1\leq_{1} and 2\leq_{2} (indeed, any order induced by inequalities via integrals) is closed under distribution mixture.

To see that the set of mid p-variables is not closed under distribution mixtures, we note from (3) that any mid p-variable with mean 1/21/2 and a point-mass at 1/21/2 must not have any density in a neighbourhood of 1/21/2. Hence, the mixture of uniform distribution on [0,1][0,1] and a point-mass at 1/21/2 is not the distribution of a mid p-variable.

Closure under convergence for 1\leq_{1} is justified by Theorem 1.A.3 of [29], and closure under convergence for 2\leq_{2} is justified by Theorem 1.5.9 of [21]. Closure under convergence for the set of mid p-values follows by noting that the set of distributions of PTP_{T} in (3) is closed under convergence in distribution, which can be checked by definition. ∎

A.2 Proofs of results in Section 4

Proof of Theorem 4.1.

We first show that a convex combination of p-variables is a p*-variable. Let UU be a uniform random variable on [0,1][0,1], P1,,PKP_{1},\dots,P_{K} be KK p-variables, (λ1,,λK)(\lambda_{1},\dots,\lambda_{K}) be an element of the standard KK-simplex, and ff be an increasing concave function. By monotonicity and concavity of ff, we have

𝔼[f(k=1KλkPk)]𝔼[k=1Kλkf(Pk)]𝔼[k=1Kλkf(U)]=𝔼[f(U)].\mathbb{E}\left[f\left(\sum_{k=1}^{K}\lambda_{k}P_{k}\right)\right]\geq\mathbb{E}\left[\sum_{k=1}^{K}\lambda_{k}f(P_{k})\right]\geq\mathbb{E}\left[\sum_{k=1}^{K}\lambda_{k}f(U)\right]=\mathbb{E}[f(U)].

Therefore, k=1KλkPk2U\sum_{k=1}^{K}\lambda_{k}P_{k}\geq_{2}U and thus k=1KλkPk\sum_{k=1}^{K}\lambda_{k}P_{k} is a p*-variable.

Next, we show the second statement that any p*-variable can be written as the average of three p-variables, which also justifies the “only if” direction of the first statement.

Let PP be a p*-variable satisfying 𝔼[P]=1/2\mathbb{E}[P]=1/2. Note that P2UP\geq_{2}U and 𝔼[P]=𝔼[U]\mathbb{E}[P]=\mathbb{E}[U] together implies PcvUP\geq_{\rm cv}U (see e.g., [29, Theorem 4.A.35]), where cv\leq_{\rm cv} is the concave order, meaning that 𝔼[f(P)]𝔼[f(U)]\mathbb{E}[f(P)]\geq\mathbb{E}[f(U)] for all concave ff. Theorem 5 of [19] says that any PcvUP\geq_{\rm cv}U, there exist three standard uniform random variables P1,P2,P3P_{1},P_{2},P_{3} such that 3P=P1+P2+P33P=P_{1}+P_{2}+P_{3} (this statement is highly non-trivial). This implies that PP can be written as the arithmetic average of three p-variables P1,P2,P3P_{1},P_{2},P_{3}.

Finally, assume that the p*-variable PP satisfies 𝔼[P]>1/2\mathbb{E}[P]>1/2. In this case, using Strassen’s Theorem in the form of [29, Theorems 4.A.5 and 4.A.6], there exists a random variable ZZ such that UcvZPU\leq_{\rm cv}Z\leq P. As we explained above, there exist p-variables P1,P2,P3P_{1},P_{2},P_{3} such that 3Z=P1+P2+P33Z=P_{1}+P_{2}+P_{3}. For i=1,2,3i=1,2,3, let P~i:=Pi+(PZ)1Pi\tilde{P}_{i}:=P_{i}+(P-Z)\geq_{1}P_{i}. Note that P~1,P~2,P~3\tilde{P}_{1},\tilde{P}_{2},\tilde{P}_{3} are p-variables and 3P=P~1+P~2+P~33P=\tilde{P}_{1}+\tilde{P}_{2}+\tilde{P}_{3}. Hence, PP can be written as the arithmetic average of three p-variables. ∎

Proof of Proposition 4.3.

For the “only-if” statement in (i), since PP is a p-variable, we know that its distribution FF satisfies F(t)tF(t)\leq t for t(0,1)t\in(0,1). Therefore, by setting T=PT=P,

PF(P)=F(T)=(TT|T)=(TT|X),P\geq F(P)=F(T)=\mathbb{P}(T^{\prime}\leq T|T)=\mathbb{P}(T^{\prime}\leq T|X),

where the last equality holds since TT^{\prime} is independent of XX. To check the “if” direction of (i), we have (TT|X)=(TT|T)=F(T)\mathbb{P}(T^{\prime}\leq T|X)=\mathbb{P}(T^{\prime}\leq T|T)=F(T) where FF is the distribution of TT. Note that F(T)F(T) is stochastically larger than or equal to a uniform random variable on [0,1][0,1], and hence (F(T)t)t\mathbb{P}(F(T)\leq t)\leq t.

Next, we show (ii). First, suppose that P(TT|X)P\geq\mathbb{P}(T^{\prime}\leq T|X). Let UU be a uniform random variable on [0,1][0,1]. By Jensen’s inequality, we have 𝔼[F(T)|X]2F(T)\mathbb{E}[F(T)|X]\geq_{2}F(T). Hence, P2(TT|X)=𝔼[F(T)|X]2F(T)2UP\geq_{2}\mathbb{P}(T^{\prime}\leq T|X)=\mathbb{E}[F(T)|X]\geq_{2}F(T)\geq_{2}U, and thus PP is a p*-variable.

For the converse direction, suppose that PP is a p*-variable. By Strassen’s Theorem in the form of [29, Theorem 3.A.4], there exists a uniform random variable UU^{\prime} on [0,1][0,1] and a random variable PP^{\prime} identically distributed as PP such that P𝔼[U|P]P^{\prime}\geq\mathbb{E}[U^{\prime}|P^{\prime}]. Let G(|p)G(\cdot|p) be the left-quantile function of a regular conditional distribution of UU^{\prime} given P=p[0,1]P^{\prime}=p\in[0,1]. Further, let VV be a uniform random variable on [0,1][0,1] independent of (P,X)(P,X), and U:=G(V|P)U:=G(V|P). It is clear that (U,P)(U,P) has the same law as (U,P)(U^{\prime},P^{\prime}). Therefore, P𝔼[U|P]P\geq\mathbb{E}[U|P]. Moreover, 𝔼[U|X]=𝔼[G(V|P)|X]=𝔼[G(V|P)|P]\mathbb{E}[U|X]=\mathbb{E}[G(V|P)|X]=\mathbb{E}[G(V|P)|P] since VV is independent of XX. Hence, P𝔼[U|P]=𝔼[U|X]P\geq\mathbb{E}[U|P]=\mathbb{E}[U|X]. Let VV^{\prime} be another uniform random variable on [0,1][0,1] independent of (U,X)(U,X). We have

(VU|X)=𝔼[𝟙{VU}|X]=𝔼[U|X].\mathbb{P}(V^{\prime}\leq U|X)=\mathbb{E}\left[\mathds{1}_{\{V^{\prime}\leq U\}}|X\right]=\mathbb{E}[U|X].

Hence, the representation P(TT|X)P\geq\mathbb{P}(T^{\prime}\leq T|X) holds with T=VT^{\prime}=V^{\prime} and T=UT=U.

Finally, we show the last statement on replacing (TT|X)\mathbb{P}(T^{\prime}\leq T|X) by (TT|X)/2+(T<T|X)/2\mathbb{P}(T^{\prime}\leq T|X)/2+\mathbb{P}(T^{\prime}<T|X)/2. The “only-if” direction follows from the argument for (ii) by noting that UU constructed above has a continuous distribution. The “if” direction follows from

12((TT|X)+(T<T|X))\displaystyle\frac{1}{2}(\mathbb{P}(T^{\prime}\leq T|X)+\mathbb{P}(T^{\prime}<T|X)) =12(𝔼[F(T)|X]+𝔼[F(T)|X])\displaystyle=\frac{1}{2}(\mathbb{E}[F(T)|X]+\mathbb{E}[F(T-)|X])
212(F(T)+F(T))2U,\displaystyle\geq_{2}\frac{1}{2}(F(T)+F(T-))\geq_{2}U,

where the second-last inequality is Jensen’s, and the last inequality is implied by Theorem 3.1. ∎

Proof of Proposition 4.5.

The “if” statement is implied by Proposition B.1. To show the “only if” statement, denote by FPF_{P} the distribution function of PP and FUF_{U} be the distribution function of a uniform random variable UU on [0,1][0,1]. We have

α(PVα)=02αFP(u)2αdu.\alpha\geq\mathbb{P}(P\leq V_{\alpha})=\int_{0}^{2\alpha}\frac{F_{P}(u)}{2\alpha}\,\mathrm{d}u.

Therefore, for v(0,1]v\in(0,1], we have

0vFP(u)duv22=0vudu=0vFU(u)du.\int_{0}^{v}{F_{P}(u)}\,\mathrm{d}u\leq\frac{v^{2}}{2}=\int_{0}^{v}u\,\mathrm{d}u=\int_{0}^{v}F_{U}(u)\,\mathrm{d}u.

By Theorem 4.A.2 of [29], the above inequality implies U2PU\leq_{2}P. Hence, PP is a p*-variable. ∎

A.3 Proofs of results in Section 5

Proof of Proposition 5.1.

Let UU be a uniform random variable on [0,1][0,1]. The first statement is trivial by definition. For the second statement, let P1,P2P_{1},P_{2} be two p*-variables. For any ϵ(0,1)\epsilon\in(0,1),

(P1+P2ϵ)\displaystyle\mathbb{P}(P_{1}+P_{2}\leq\epsilon) =𝔼[𝟙{P1+P2ϵ}]\displaystyle=\mathbb{E}[\mathds{1}_{\{P_{1}+P_{2}\leq\epsilon\}}]
𝔼[1ϵ(2ϵP1P2)+]\displaystyle\leq\mathbb{E}\left[\frac{1}{\epsilon}{(2\epsilon-P_{1}-P_{2})_{+}}\right]
𝔼[1ϵ(ϵP1)+]+𝔼[1ϵ(ϵP2)+]\displaystyle\leq\mathbb{E}\left[\frac{1}{\epsilon}{(\epsilon-P_{1})_{+}}\right]+\mathbb{E}\left[\frac{1}{\epsilon}{(\epsilon-P_{2})_{+}}\right]
2𝔼[1ϵ(ϵU)+]=ϵ,\displaystyle\leq 2\mathbb{E}\left[\frac{1}{\epsilon}{(\epsilon-U)_{+}}\right]=\epsilon,

where the last inequality is because U2P1,P2U\leq_{\rm 2}P_{1},P_{2} and u(ϵu)+u\mapsto(\epsilon-u)_{+} is convex and decreasing. Therefore, P1+P2P_{1}+P_{2} is a p-variable. ∎

The following lemma is needed in the proof of Theorem 5.2.

Lemma A.1.

Let MM be an increasing Borel function on [0,)K[0,\infty)^{K}. Then M(P1,,PK)M(P_{1},\dots,P_{K}) is a p-variable for all p*-variables P1,,PKP_{1},\dots,P_{K} if and only if

inf{q1(M(αP1,,αPK)):P1,,PK2U}αfor all α(0,1),\inf\{q_{1}(M(\alpha P_{1},\dots,\alpha P_{K})):P_{1},\dots,P_{K}\geq_{2}U\}\geq\alpha~{}\mbox{for all $\alpha\in(0,1)$,} (18)

where UU is uniform on [0,1][0,1] and q1(X)q_{1}(X) is the essential supremum of a random variable XX.

Proof.

Let g(α)g(\alpha) be the critical value for testing with M(P1,,PK)M(P_{1},\dots,P_{K}), that is, the largest value such that

(M(P1,,PK)<g(α))α for all p*-variables P1,,PK.\mathbb{P}\left(M(P_{1},\dots,P_{K})<g(\alpha)\right)\leq\alpha\mbox{~{}~{}~{}for all p*-variables $P_{1},\dots,P_{K}$}.

Converting between the distribution function and the quantile function, this means

g(α)=inf{qα(M(P1,,PK)):P1,,PK2U},g(\alpha)=\inf\{q_{\alpha}(M(P_{1},\dots,P_{K})):P_{1},\dots,P_{K}\geq_{2}U\},

where qα(X)q_{\alpha}(X) be the left α\alpha-quantile of a random variable XX. For an increasing function MM, its infimum α\alpha-quantile can be converted to the essential supremum of random variables with conditional distributions on their lower α\alpha-tail; see the proof of [17, Theorem 3] which deals with the case of additive functions (see also the proof of [6, Proposition 1] where this technique is used). Note that for P2UP\geq 2U, its lower α\alpha-tail conditional distribution dominates αU\alpha U. This argument leads to

g(α)=inf{q1(M(P1,,PK)):P1,,PK2αU}.g(\alpha)=\inf\{q_{1}(M(P_{1},\dots,P_{K})):P_{1},\dots,P_{K}\geq_{2}\alpha U\}.

Noting that P2αUP\geq_{2}\alpha U is equivalent to P/α2UP/\alpha\geq_{2}U, we obtain (18). ∎

Proof of Theorem 5.2.

First, suppose that w1,,wKw_{1},\dots,w_{K} are non-negative constants adding up to 11. Let g(α)g(\alpha) be the critical value for testing with P~\tilde{P}, that is, the smallest value such that

(k=1KPkwkg(α))α for all p*-variables P1,,PK.\mathbb{P}\left(\prod_{k=1}^{K}P_{k}^{w_{k}}\leq g(\alpha)\right)\leq\alpha\mbox{~{}~{}~{}for all p*-variables $P_{1},\dots,P_{K}$}.

We will show that g(α)α/ϵg(\alpha)\geq\alpha/\epsilon. Using Lemma A.1, we get

g(α)=αinf{q1(k=1KPkwk):P1,,PK2U}.g(\alpha)=\alpha\inf\left\{q_{1}\left(\prod_{k=1}^{K}P_{k}^{w_{k}}\right):P_{1},\dots,P_{K}\geq_{2}U\right\}.

For any p*-variables P1,,PKP_{1},\dots,P_{K}, since log()\log(\cdot) is an increasing concave function, we have

q1(k=1Kwklog(Pk))\displaystyle q_{1}\left(\sum_{k=1}^{K}w_{k}\log(P_{k})\right) 𝔼[k=1Kwklog(Pk)]k=1Kwk𝔼[log(U)]=1.\displaystyle\geq\mathbb{E}\left[\sum_{k=1}^{K}w_{k}\log(P_{k})\right]\geq\sum_{k=1}^{K}w_{k}\mathbb{E}\left[\log(U)\right]=-1.

Therefore, log(g(α)/α)1\log({g(\alpha)}/{\alpha})\geq-1, leading to the desired bound g(α)α/eg(\alpha)\geq\alpha/\mathrm{e}, and thus

(k=1KPkwkα/e)(k=1KPkwkg(α))α.\mathbb{P}\left(\prod_{k=1}^{K}P_{k}^{w_{k}}\leq\alpha/\mathrm{e}\right)\leq\mathbb{P}\left(\prod_{k=1}^{K}P_{k}^{w_{k}}\leq g(\alpha)\right)\leq\alpha.

For random w1,,wkw_{1},\dots,w_{k}, taking an expectation leads to (6). ∎

Proof of Proposition 5.3.

The validity of MKM_{K} as a p*-merging function is implied by Theorem 4.1. To show its admissibility, suppose that there exists a p*-merging function MM that strictly dominates MKM_{K}. Let P1,,PKP_{1},\dots,P_{K} be iid uniform random variables on [0,1][0,1]. The strict domination implies MMKM\leq M_{K} and (M(P1,,PK)<MK(P1,,PK))>0\mathbb{P}(M(P_{1},\dots,P_{K})<M_{K}(P_{1},\dots,P_{K}))>0. We have

𝔼[M(P1,,PK)]<𝔼[MK(P1,,PK)]=12.\mathbb{E}[M(P_{1},\dots,P_{K})]<\mathbb{E}[M_{K}(P_{1},\dots,P_{K})]=\frac{1}{2}.

This means that M(P1,,PK)M(P_{1},\dots,P_{K}) is not a p*-variable, a contradiction. ∎

Proof of Proposition 5.4.

Let P1,,PKP_{1},\dots,P_{K} be p*-variables, and PP be a random variable such that the distribution of PP is the equally weighted mixture of those of P1,,PKP_{1},\dots,P_{K}. Note that PP a p*-variable by Proposition 3.3. Let P(1)=k=1KPkP_{(1)}=\bigwedge_{k=1}^{K}P_{k}. Using the Bonferroni inequality, we have, for any ϵ(0,1)\epsilon\in(0,1),

(P(1)ϵ)k=1K(Pkϵ)=K(Pϵ).\mathbb{P}(P_{(1)}\leq\epsilon)\leq\sum_{k=1}^{K}\mathbb{P}(P_{k}\leq\epsilon)=K\mathbb{P}(P\leq\epsilon). (19)

Let G1G_{1} be the left-quantile function of P(1)P_{(1)} and G2G_{2} be that of PP. By (19), we have G1(Kt)G2(t)G_{1}(Kt)\geq G_{2}(t) for all t(0,1/K)t\in(0,1/K). Hence, for each y(0,1/K)y\in(0,1/K), using the equivalent condition (1), we have

0yKG1(t)dt0yKG2(t/K)dt=K20y/KG2(t)dtK2y22K2=y22.\int_{0}^{y}KG_{1}(t)\,\mathrm{d}t\geq\int_{0}^{y}KG_{2}(t/K)\,\mathrm{d}t=K^{2}\int_{0}^{y/K}G_{2}(t)\,\mathrm{d}t\geq K^{2}\frac{y^{2}}{2K^{2}}=\frac{y^{2}}{2}.

This implies, via the equivalent condition (1) again, that KP(1)KP_{(1)} is a p*-variable.

Next we show the admissibility of MBM_{B} for K2K\geq 2, since the case K=1K=1 is trivial. Suppose that there is a p*-merging function MM which strictly dominates MBM_{B}. Since MM is increasing, there exists p(0,1/K]p\in(0,1/K] such that q:=M(p,,p)<MB(p,,p)=Kpq:=M(p,\dots,p)<M_{B}(p,\dots,p)=Kp. First, assume 2Kp12Kp\leq 1. Define identically distributed random variables P1,,PKP_{1},\dots,P_{K} by

Pk=p𝟙Ak+𝟙Akc,k=1,,K,P_{k}=p\mathds{1}_{A_{k}}+\mathds{1}_{A_{k}^{c}},~{}~{}~{}k=1,\dots,K,

where A1,,AKA_{1},\dots,A_{K} are disjoint events with (Ak)=2p\mathbb{P}(A_{k})=2p for each kk. It is easy to check that P1,,PKP_{1},\dots,P_{K} are p*-variables, and

(M(P1,,PK)=q)=(k=1KAk)=k=1K(Ak)=2Kp.\mathbb{P}(M(P_{1},\dots,P_{K})=q)=\mathbb{P}\left(\bigcup_{k=1}^{K}A_{k}\right)=\sum_{k=1}^{K}\mathbb{P}(A_{k})=2Kp.

Thus, M(P1,,PK)M(P_{1},\dots,P_{K}) takes the value q<Kpq<Kp with probability 2Kp2Kp, and it takes the value 11 otherwise. Let GG be the left-quantile function of M(P1,,PK)M(P_{1},\dots,P_{K}). The above calculation leads to

02KpG(t)dt=2qKp<(2Kp)22,\int_{0}^{2Kp}G(t)\,\mathrm{d}t=2qKp<\frac{(2Kp)^{2}}{2},

showing that M(P1,,PK)M(P_{1},\dots,P_{K}) is not a p*-variable by (1), a contradiction.

Next, assume 2Kp>12Kp>1. In this case, let r=p1/(2K)r=p-1/(2K), and define identically distributed random variables P1,,PKP_{1},\dots,P_{K} by

Pk=r𝟙Bk+p𝟙Ak+𝟙(AkBk)c,k=1,,K,P_{k}=r\mathds{1}_{B_{k}}+p\mathds{1}_{A_{k}}+\mathds{1}_{(A_{k}\cup B_{k})^{c}},~{}~{}~{}k=1,\dots,K,

where A1,,AK,B1,,BKA_{1},\dots,A_{K},B_{1},\dots,B_{K} are disjoint events with (Ak)=1/K2r\mathbb{P}(A_{k})=1/K-2r and (Bk)=2r\mathbb{P}(B_{k})=2r, k=1,,Kk=1,\dots,K. Note that the union of A1,,AK,B1,,BKA_{1},\dots,A_{K},B_{1},\dots,B_{K} has probability 11. It is easy to verify that P1,,PKP_{1},\dots,P_{K} are p*-variables. Moreover, we have q:=M(r,,r)Krq^{\prime}:=M(r,\dots,r)\leq Kr since MM dominates MBM_{B}. Hence, M(P1,,PK)M(P_{1},\dots,P_{K}) takes the value qqq^{\prime}\leq q with probability 2Kr2Kr, and it takes the value qq otherwise. Let GG be the left-quantile function of M(P1,,PK)M(P_{1},\dots,P_{K}). Using qKrq^{\prime}\leq Kr and q<Kp=Kr+1/2q<Kp=Kr+1/2, we obtain

01G(t)dt\displaystyle\int_{0}^{1}G(t)\,\mathrm{d}t =02Krqdt+2Kr1qdt\displaystyle=\int_{0}^{2Kr}q^{\prime}\,\mathrm{d}t+\int_{2Kr}^{1}q\,\mathrm{d}t
2(Kr)2+q(12Kr)\displaystyle\leq 2(Kr)^{2}+q(1-2Kr)
<2(Kr)2+(Kr+12)(12Kr)=12,\displaystyle<2(Kr)^{2}+\left(Kr+\frac{1}{2}\right)(1-2Kr)=\frac{1}{2},

showing that M(P1,,PK)M(P_{1},\dots,P_{K}) is not a p*-variable by (1), a contradiction. As MM cannot strictly dominate MBM_{B}, we know that MBM_{B} is admissible. ∎

A.4 Proofs of results in Section 6

Proof of Theorem 6.1.

Let UU be a uniform random variable on [0,1][0,1].

  1. (i)

    The validity of the calibrator u(2u)1u\mapsto(2u)\wedge 1 is implied by (i), and below we show that it dominates all others. For any function gg on [0,)[0,\infty), suppose that g(u)<2ug(u)<2u for some u(0,1/2]u\in(0,1/2]. Consider the random variable VV defined by V=U1{U>2u}+u1{U2u}V=U1_{\{U>2u\}}+u1_{\{U\leq 2u\}}. Clearly, VV is a p*-variable. Note that

    (g(V)g(u))(U2u)=2u>g(u),\mathbb{P}(g(V)\leq g(u))\geq\mathbb{P}(U\leq 2u)=2u>g(u),

    implying that gg is not a p*-to-p calibrator. Hence, any p*-to-p calibrator gg satisfies g(u)2ug(u)\geq 2u for all u(0,1/2]u\in(0,1/2], thus showing that u(2u)1u\mapsto(2u)\wedge 1 dominates all p*-to-p calibrators.

  2. (ii)

    By (1), we know that f(U)2Uf(U)\geq_{2}U, and thus ff is a valid p-to-p* calibrator. To show its admissibility, it suffices to notice that ff is left-continuous (lower semi-continuous) function on [0,1][0,1], and if gfg\leq f and gfg\neq f, then 01g(u)du<1/2\int_{0}^{1}g(u)\,\mathrm{d}u<1/2, implying that g(U)g(U) cannot be a p*-variable. ∎

Proof of Theorem 6.2.
  1. (i)

    Let ff be a convex p-to-e calibrator. Note that f-f is increasing and concave. For any [0,1][0,1]-valued p*-variable PP, by definition, we have 𝔼[f(P)]𝔼[f(U)].\mathbb{E}[-f(P)]\geq\mathbb{E}[-f(U)]. Hence,

    𝔼[f(P)]=𝔼[f(P)]𝔼[f(U)]=𝔼[f(U)]1.\mathbb{E}[f(P)]=-\mathbb{E}[-f(P)]\leq-\mathbb{E}[-f(U)]=\mathbb{E}[f(U)]\leq 1.

    Since a [0,)[0,\infty)-valued p*-variable is first-order stochastically larger than some [0,1][0,1]-valued p*-variable (e.g., [29, Theorem 4.A.6]), we know that 𝔼[f(P)]1\mathbb{E}[f(P)]\leq 1 for all p*-variables PP. Thus, ff is a p*-to-e calibrator.

    Next, we show the statement on admissibility. A convex admissible p-to-e calibrator ff is a p*-to-e calibrator. Since the class of p-to-e calibrators is larger than the class of p*-to-e calibrators, ff is not strictly dominated by any p*-to-e calibrator.

  2. (ii)

    We only need to show the “only if” direction, since the “if” direction is implied by (i). Suppose that a non-convex function ff is an admissible p-to-e calibrator. Since ff is not convex, there exist two points t,s[0,1]t,s\in[0,1] such that

    f(t)+f(s)<2f(t+s2).f(t)+f(s)<2f\left(\frac{t+s}{2}\right).

    Left-continuity of ff implies that there exists ϵ(0,|ts|)\epsilon\in(0,|t-s|) such that

    f(tϵ)+f(sϵ)<2f(t+s2).f(t-\epsilon)+f(s-\epsilon)<2f\left(\frac{t+s}{2}\right).

    Note that

    tϵt(f(t+s2)f(u))duϵ(f(t+s2)f(tϵ)),\int_{t-\epsilon}^{t}\left(f\left(\frac{t+s}{2}\right)-f(u)\right)\,\mathrm{d}u\geq\epsilon\left(f\left(\frac{t+s}{2}\right)-f(t-\epsilon)\right),

    and the inequality also holds if the positions of ss and tt are flipped. Hence, by letting A=[tϵ,t][sϵ,t]A=[t-\epsilon,t]\cup[s-\epsilon,t], we have

    A(f(t+s2)f(u))du\displaystyle\int_{A}\left(f\left(\frac{t+s}{2}\right)-f(u)\right)\,\mathrm{d}u
    ϵ(2f(t+s2)f(tϵ)f(sϵ))>0.\displaystyle\geq\epsilon\left(2f\left(\frac{t+s}{2}\right)-f(t-\epsilon)-f(s-\epsilon)\right)>0. (20)

    Let UU be a uniform random variable on [0,1][0,1] and PP be given by

    P=U𝟙{UA}+t+s2𝟙{UA}.P=U\mathds{1}_{\{U\not\in A\}}+\frac{t+s}{2}\mathds{1}_{\{U\in A\}}.

    For any increasing concave function gg and x[tϵ,t]x\in[t-\epsilon,t] and y[sϵ,s]y\in[s-\epsilon,s], we have

    2g(t+s2)g(t)+g(s)g(x)+g(y).2g\left(\frac{t+s}{2}\right)\geq g(t)+g(s)\geq g(x)+g(y).

    Therefore, 𝔼[g(P)]𝔼[g(U)]\mathbb{E}[g(P)]\geq\mathbb{E}[g(U)], and hence U2PU\leq_{2}P. Thus, PP is a p*-variable. Moreover, using (20), we have

    𝔼[f(P)]=01f(u)du+A(f(t+s2)f(u))du>01f(u)du=1.\mathbb{E}[f(P)]=\int_{0}^{1}f(u)\,\mathrm{d}u+\int_{A}\left(f\left(\frac{t+s}{2}\right)-f(u)\right)\,\mathrm{d}u>\int_{0}^{1}f(u)\,\mathrm{d}u=1.

    Hence, ff is not a p*-to-e calibrator. Thus, ff has to be convex if it is both an admissible p-to-e calibrator and a p*-to-e calibrator.

  3. (iii)

    First, we show that f:e(2e)11f:e\mapsto(2e)^{-1}\wedge 1 is an e-to-p* calibrator. Clearly, it suffices to show that 1/(2E)1/(2E) is a p*-variable for any e-variable EE with mean 11, since any e-variable with mean less than 11 is dominated by an e-variable with mean 11. Let δx\delta_{x} be the point-mass at xx.

    Assume that EE has a two-point distribution (including the point-mass δ1\delta_{1} as a special case). With 𝔼[E]=1\mathbb{E}[E]=1, the distribution FEF_{E} of EE can be characterized with two parameters p(0,1)p\in(0,1) and a(0,1/p]a\in(0,1/p] via

    FE=pδ1+(1p)a+(1p)δ1pa.F_{E}=p\delta_{1+(1-p)a}+(1-p)\delta_{1-pa}.

    The distribution FPF_{P} of P:=1/(2E)P:=1/(2E) (we allow PP to take the value \infty in case a=1/pa=1/p) is given by

    FP=pδ1/(2+2(1p)a)+(1p)δ1/(22pa).F_{P}=p\delta_{1/(2+2(1-p)a)}+(1-p)\delta_{1/(2-2pa)}.

    Let GPG_{P} be the left-quantile function of PP on (0,1)(0,1). We have

    GP(t)=12+2(1p)a𝟙{t(0,p]}+12+2pa𝟙{t(p,1)}.G_{P}(t)=\frac{1}{2+2(1-p)a}\mathds{1}_{\{t\in(0,p]\}}+\frac{1}{2+2pa}\mathds{1}_{\{t\in(p,1)\}}.

    Define two functions gg and hh on [0,1][0,1] by g(v):=0vGP(u)dug(v):=\int_{0}^{v}G_{P}(u)\,\mathrm{d}u and h(v):=v2/2h(v):=v^{2}/2. For v(0,p]v\in(0,p], we have, using a1/pa\leq 1/p,

    g(v)\displaystyle g(v) =0vGP(u)du\displaystyle=\int_{0}^{v}G_{P}(u)\,\mathrm{d}u
    =v2+2(1p)av2+2(1p)/p=vp2v22=h(v).\displaystyle=\frac{v}{2+2(1-p)a}\geq\frac{v}{2+2(1-p)/p}=\frac{vp}{2}\geq\frac{v^{2}}{2}=h(v).

    Moreover, Jensen’s inequality gives

    g(1)=01GP(u)du=𝔼[P]=𝔼[12E]12𝔼[E]=12=h(1).g(1)=\int_{0}^{1}G_{P}(u)\,\mathrm{d}u=\mathbb{E}[P]=\mathbb{E}\left[\frac{1}{2E}\right]\geq\frac{1}{2\mathbb{E}[E]}=\frac{1}{2}=h(1).

    Since gg is linear on [p,1][p,1], and vv is convex, g(p)h(p)g(p)\leq h(p) and g(1)h(1)g(1)\leq h(1) imply g(v)h(v)g(v)\leq h(v) for all v[p,1]v\in[p,1]. Therefore, we conclude that ghg\leq h on [0,1][0,1], namely

    0vGP(u)duv22for all v[0,1].\int_{0}^{v}G_{P}(u)\,\mathrm{d}u\leq\frac{v^{2}}{2}~{}~{}\mbox{for all $v\in[0,1]$}.

    Using (1), we have that PP is a p*-variable.

    For a general e-variable EE with mean 11, its distribution can be rewritten as a mixture of two-point distributions with mean 11 (see e.g., the construction in Lemma 2.1 of [36]). Since the set of p*-variables is closed under distribution mixtures (Proposition 3.3), we know that f(E)f(E) is a p*-variable. Hence, ff is an e-to-p* calibrator.

    To show that ff essentially dominates all other e-to-p* calibrators, we take any e-to-p* calibrator ff^{\prime}. Using Theorem 6.1, the function e(2f(e))1e\mapsto(2f^{\prime}(e))\wedge 1 is an e-to-p calibrator. Using Proposition 2.2 of [35], any e-to-p calibrator is dominated by ee11e\mapsto e^{-1}\wedge 1, and hence

    (2f(e))1e11for e[0,),(2f^{\prime}(e))\wedge 1\geq e^{-1}\wedge 1~{}~{}~{}~{}\mbox{for $e\in[0,\infty)$},

    which in term gives f(e)(2e)1f^{\prime}(e)\geq(2e)^{-1} for e1e\geq 1. Since ff^{\prime} is decreasing, we know that f(e)<1/2f^{\prime}(e)<1/2 implies e>1e>1. For any e0e\geq 0 with f(e)<1/2f^{\prime}(e)<1/2, we have f(e)f(e)f^{\prime}(e)\geq f(e), and thus ff essentially dominates ff^{\prime}. ∎

Proof of Proposition 6.4.

The first statement is implied by Theorem 6.2 (iii) . For the second statement, we note that 1/21/2 is a p*-variable. For α(0,1)\alpha\in(0,1), applying (21) with P=1/2P=1/2 and V=αEV=\alpha E, we obtained a p*-test with a random threshold with mean at most α\alpha. Using Proposition B.5, this test has size at most α\alpha, that is, (2E1/α)α\mathbb{P}(2E\geq 1/\alpha)\leq\alpha. Hence, PP is a p-variable. ∎

Appendix B Randomized p*-test and applications

In this appendix we discuss several applications of testing with p*-values and randomization.

B.1 Randomized p*-test

We first introduce the a generalized version of the randomized p*-test in Proposition 4.5. The following density condition (DP) for a (0,1)(0,1)-valued random variable VV will be useful.

  1. (DP)

    VV has a decreasing density function on (0,1)(0,1).

The canonical choice of VV satisfying (DP) is a uniform random variable on [0,2α][0,2\alpha] for α(0,1/2]\alpha\in(0,1/2], which we will explain later. For a (0,1)(0,1)-valued random variable VV with mean α\alpha and a p*-variable PP independent of VV, we consider the test

rejecting the null hypothesisPV.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}P\leq V. (21)

The following theorem justifies the validity of the test (21) with the necessary and sufficient condition (DP).

Proposition B.1.

Suppose that VV is a (0,1)(0,1)-valued random variable with mean α\alpha.

  1. (i)

    The test (21) has size at most α\alpha for all p*-variables PP independent of VV if and only if VV satisfies (DP).

  2. (ii)

    The test (21) has size at most α\alpha for all p-variables PP independent of VV, and the size is precisely α\alpha if PP is uniformly distributed on [0,1][0,1].

The proof of Proposition B.1 relies on the following lemma.

Lemma B.2.

For any non-negative random variable VV with a decreasing density function on (0,)(0,\infty) (with possibly a probability mass at 0) and any p*-variable PP independent of VV, we have (PV)𝔼[V]\mathbb{P}(P\leq V)\leq\mathbb{E}[V].

Proof.

Let FVF_{V} be the distribution function of VV, which is an increasing concave function on [0,)[0,\infty) because of the decreasing density. Since PP is a p*-variable, we have 𝔼[FV(P)]01FV(u)du\mathbb{E}[F_{V}(P)]\geq\int_{0}^{1}F_{V}(u)\,\mathrm{d}u. Therefore,

(PV)\displaystyle\mathbb{P}(P\leq V) =𝔼[(PV|P)]=𝔼[1FV(P)]01(1FV(u))du=𝔼[V].\displaystyle=\mathbb{E}[\mathbb{P}(P\leq V|P)]=\mathbb{E}[1-F_{V}(P)]\leq\int_{0}^{1}(1-F_{V}(u))\,\mathrm{d}u=\mathbb{E}[V].

Hence, the statement in the lemma holds. ∎

Proof of Proposition B.1.

We first note that (ii) is straightforward: for a uniform random variable UU on [0,1][0,1] independent of VV, then (UV)=𝔼[V]\mathbb{P}(U\leq V)=\mathbb{E}[V]. If P1UP\geq_{1}U, then (PV)(UV)𝔼[V]\mathbb{P}(P\leq V)\leq\mathbb{P}(U\leq V)\leq\mathbb{E}[V].

The “if” statement of point (i) directly follows from Lemma B.2, noting that condition (DP) is stronger than the condition in Lemma B.2. Below, we show the “only if” statement of point (i).

Let FVF_{V} be the distribution function of VV and UU be a uniform random variable on [0,1][0,1]. Suppose that FVF_{V} is not concave on (0,1)(0,1). It follows that there exists x,y(0,1)x,y\in(0,1) such that FV(x)+FV(y)>2FV((x+y)/2)F_{V}(x)+F_{V}(y)>2F_{V}((x+y)/2). By the right-continuity of FVF_{V}, there exists ϵ(0,|xy|)\epsilon\in(0,|x-y|) such that

FV(x)+FV(y)>2FV(x+y+ϵ2).\displaystyle F_{V}(x)+F_{V}(y)>2F_{V}\left(\frac{x+y+\epsilon}{2}\right). (22)

Let A[x,x+ϵ]A[x,x+\epsilon] and B=[y,y+ϵ]B=[y,y+\epsilon], which are disjoint intervals. Define a random variable PP by

P=U𝟙{UAB}+x+y+ϵ2𝟙{UAB}.\displaystyle P=U\mathds{1}_{\{U\not\in A\cup B\}}+\frac{x+y+\epsilon}{2}\mathds{1}_{\{U\in A\cup B\}}. (23)

We check that PP defined by (23) is a p*-variable. For any concave function gg, Jensen’s inequality gives

𝔼[g(P)]\displaystyle\mathbb{E}[g(P)] =[0,1](AB)g(u)du+2ϵg(x+y+ϵ2)\displaystyle=\int_{[0,1]\setminus(A\cup B)}g(u)\,\mathrm{d}u+2\epsilon g\left(\frac{x+y+\epsilon}{2}\right)
[0,1](AB)g(u)du+ABg(u)du=𝔼[g(U)].\displaystyle\geq\int_{[0,1]\setminus(A\cup B)}g(u)\,\mathrm{d}u+\int_{A\cup B}g(u)\,\mathrm{d}u=\mathbb{E}[g(U)].

Hence, PP is a p*-variable. It follows from (22) and (23) that

𝔼[FV(P)]\displaystyle\mathbb{E}\left[F_{V}\left(P\right)\right] =[0,1](AB)FV(u)du+ABFV(x+y+ϵ2)du\displaystyle=\int_{[0,1]\setminus(A\cup B)}F_{V}(u)\,\mathrm{d}u+\int_{A\cup B}F_{V}\left(\frac{x+y+\epsilon}{2}\right)\,\mathrm{d}u
<[0,1](AB)FV(u)du+ABFV(x)+FV(y)2du\displaystyle<\int_{[0,1]\setminus(A\cup B)}F_{V}(u)\,\mathrm{d}u+\int_{A\cup B}\frac{F_{V}(x)+F_{V}(y)}{2}\,\mathrm{d}u
=[0,1](AB)FV(u)du+ϵ(FV(x)+FV(y))\displaystyle=\int_{[0,1]\setminus(A\cup B)}F_{V}(u)\,\mathrm{d}u+\epsilon(F_{V}(x)+F_{V}(y))
=[0,1](AB)FV(u)du+AFV(x)du+BFV(y)du\displaystyle=\int_{[0,1]\setminus(A\cup B)}F_{V}(u)\,\mathrm{d}u+\int_{A}F_{V}(x)\,\mathrm{d}u+\int_{B}F_{V}(y)\,\mathrm{d}u
01FV(u)du=1𝔼[V]=1α.\displaystyle\leq\int_{0}^{1}F_{V}(u)\,\mathrm{d}u=1-\mathbb{E}[V]=1-\alpha.

Therefore,

(PV)\displaystyle\mathbb{P}\left(P\leq V\right) =1𝔼[FV(P)]>1(1α)=α.\displaystyle=1-\mathbb{E}\left[F_{V}\left(P\right)\right]>1-(1-\alpha)=\alpha.

Since this contracts the validity requirement, we know that VV has to have a concave distribution function, and hence a decreasing density on (0,1)(0,1). ∎

Lemma B.2 gives (PV)𝔼[V]\mathbb{P}(P\leq V)\leq\mathbb{E}[V] for VV possibly taking values larger than 11 and possibly having a probability mass at 0. We are not interested designing a random threshold with positive probability to be 0 or larger than 11, but this result will become helpful in Section 7. Since condition (DP) implies 𝔼[V]1/2\mathbb{E}[V]\leq 1/2, we will assume α(0,1/2]\alpha\in(0,1/2], which is certainly harmless for practice.

With the help of Proposition B.1, we formally define α\alpha-random thresholds and the randomized p*-test.

Definition B.3.

For a significance level α(0,1/2]\alpha\in(0,1/2], an α\alpha-random threshold VV is a (0,1)(0,1)-valued random variable independent of the test statistics (a p*-variable PP in this section) with mean α\alpha satisfying (DP). For an α\alpha-random threshold VV and a p*-variable PP, the randomized p*-test is given by (21), i.e., rejecting the null hypothesis PV\Longleftrightarrow P\leq V.

Proposition B.1 implies that the randomized p*-test always has size at most α\alpha, just like the classic p-test (4). Since the randomized p*-test (21) has size equal to α\alpha if PP is uniformly distributed on [0,1][0,1], the size α\alpha of the randomized p*-test cannot be improved in general.

As mentioned in Section 4.4, randomization is generally undesirable in testing. Like any other randomized methods, different scientists may arrive at different statistical conclusions by the randomized p*-test for the same data set generating the p*-value. Because of assumption (DP), which is necessary for validity by Proposition B.1, we cannot reduce the α\alpha-random threshold VV to a deterministic α\alpha. This undesirable feature is the price one has to pay when a p-variable is weakened to a p*-variable.

If one needs to test with a deterministic threshold, then α/2\alpha/2 needs to be used instead of α\alpha. In other words, the test

rejecting the null hypothesisPα/2\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}P\leq\alpha/2 (24)

has size α\alpha for all p*-variable PP. The validity of (24) was noted by Rüschendorf [24], and it is a direct consequence of Proposition 5.1. Unlike the random threshold U[0,2α]\mathrm{U}[0,2\alpha] which gives a size precisely α\alpha in realistic situations, the deterministic threshold α/2\alpha/2 is often overly conservative in practice (see discussions in [20, Section 5]), but it cannot be improved in general when testing with the average of p-variables ([34, Proposition 3]); recall that the average of p-variables is a p*-variable.

We will see an application of the randomized p*-test in Section B.2, leading to new tests on the weighted average of p-values, which can be made deterministic if one of the p-values is independent of the others. Moreover, the randomized p*-test can be used to improve the power of tests with e-values and martingales in Section 7.

As mentioned above, the extra randomness introduced by the random threshold VV is often considered undesirable. One may wish to choose VV such that the randomness is minimized. The next result shows that VU[0,2α]V\sim\mathrm{U}[0,2\alpha] is the optimal choice if the randomness is measured by variance or convex order.

Proposition B.4.

For any α\alpha-random threshold VV, we have var(V)α2/3\mathrm{var}(V)\geq\alpha^{2}/3, and this smallest variance is attained by VU[0,2α]V^{*}\sim\mathrm{U}[0,2\alpha]. Moreover, 𝔼[f(V)]𝔼[f(V)]\mathbb{E}[f(V)]\geq\mathbb{E}[f(V^{*})] for any convex function ff (hence V2VV\leq_{2}V^{*} holds).

Proof.

We directly show 𝔼[f(V)]𝔼[f(V)]\mathbb{E}[f(V)]\geq\mathbb{E}[f(V^{*})] for all convex functions ff, which implies the statement on variance as a special case since the mean of VV is fixed as α\alpha. Note that VV has a concave distribution function FVF_{V} on [0,1][0,1], and VV^{*} has a linear distribution function FVF_{V^{*}} on [0,2α][0,2\alpha]. Moreover, they have the same mean. Hence, there exists t[0,2α]t\in[0,2\alpha] such that FV(x)FV(x)F_{V}(x)\geq F_{V^{*}}(x) for xtx\leq t and FV(x)FV(x)F_{V}(x)\leq F_{V^{*}}(x) for xtx\geq t. This condition is sufficient for 𝔼[f(V)]𝔼[f(V)]\mathbb{E}[f(V)]\geq\mathbb{E}[f(V^{*})] by Theorem 3.A.44 of [29]. ∎

Combining Propositions B.1 and B.4, the canonical choice of the threshold in the randomized p*-test has a uniform distribution on [0,2α][0,2\alpha].

We note that it is also possible to use some VV with mean less than α\alpha and variance less than α2/3\alpha^{2}/3. This reflects a tradeoff between power and variance. Such a random threshold does not necessarily have a decreasing density. For instance, the point-mass at α/2\alpha/2 is a valid choice; the next proposition gives some other choices.

Proposition B.5.

Let VV be an α\alpha-random threshold and VV^{\prime} is a random variable satisfying V1VV^{\prime}\leq_{1}V. We have (PV)α\mathbb{P}(P\leq V^{\prime})\leq\alpha for arbitrary p*-variable PP independent of VV^{\prime}.

Proof.

Let FF be the distribution function of P¯\bar{P}. For any increasing function ff, we have 𝔼[f(V)]𝔼[f(V)]\mathbb{E}[f(V^{\prime})]\leq\mathbb{E}[f(V)], which follows from V1VV^{\prime}\leq_{1}V. Hence, we have

(PV)=𝔼[F(V)]𝔼[F(V)]=(PV)α,\displaystyle\mathbb{P}(P\leq V^{\prime})=\mathbb{E}[F(V^{\prime})]\leq\mathbb{E}[F(V)]=\mathbb{P}(P\leq V)\leq\alpha,

where the last inequality follows from Proposition B.1. ∎

Proposition B.5 can be applied to a special situation where a p-variable PP and an independent p*-variable PP^{*} are available for the same null hypothesis. Note that in this case 2α(1P)1VU[0,2α]2\alpha(1-P)\leq_{1}V\sim\mathrm{U}[0,2\alpha]. Hence, by Proposition B.5, the test

rejecting the null hypothesisP2(1P)α.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\frac{P^{*}}{2(1-P)}\leq\alpha. (25)

has size at most α\alpha. Alternatively, using the fact that (P2α(1P))α\mathbb{P}(P\leq 2\alpha(1-P^{*}))\leq\alpha implied by Proposition B.1 (ii), we can design a test

rejecting the null hypothesisP2(1P)α.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\frac{P}{2(1-P^{*})}\leq\alpha. (26)

The tests (25) and (26) both have a deterministic threshold of α\alpha. This observation is useful in Section B.2.

B.2 Testing with averages of p-values

In this section we illustrate applications of the randomized p*-test to tests with averages of dependent p-values.

Let P1,,PKP_{1},\dots,P_{K} be KK p-variables for a global null hypothesis and they are generally not independent. Vovk and Wang [34] proposed testing using generalized means of the p-values, so that the type-I error is controlled at a level α\alpha under arbitrary dependence. We focus on the weighted (arithmetic) average P¯:=k=1KwkPk\bar{P}:=\sum_{k=1}^{K}w_{k}P_{k} for some weights w1,,wK0w_{1},\dots,w_{K}\geq 0 with k=1Kwk=1\sum_{k=1}^{K}w_{k}=1. In case w1==wK=1/Kw_{1}=\dots=w_{K}=1/K, we speak of the arithmetic average.

The method of [34] on arithmetic average is given by

rejecting the null hypothesisP¯α/2.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\bar{P}\leq\alpha/2. (27)

We will call (27) the arithmetic averaging test. The extra factor of 1/21/2 is needed to compensate for arbitrary dependence among p-values. Since P¯\bar{P} is a p*-variable by Theorem 4.1, the test (27) is a special case of (24). This method is quite conservative, and it often has relatively low power compared to the Bonferroni correction and other similar methods unless p-values are very highly correlated, as illustrated by the numerical experiments in [34].

To enhance the power of the test (27), we apply the randomized p*-test in Section 4.4 to design the randomized averaging test by

rejecting the null hypothesisP¯V,\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\bar{P}\leq V, (28)

where VV is an α\alpha-random threshold independent of (P1,,PK)(P_{1},\dots,P_{K}). Comparing the fixed-threshold test (27) and the randomized averaging test (28) with VU[0,2α]V\sim\mathrm{U}[0,2\alpha], there is a 3/43/4 probability that the randomized averaging test has a better power, with the price of randomization.

Next, we consider a special situation, where a p-variable among P1,,PKP_{1},\dots,P_{K} is independent of the others under the null hypothesis. In this case, we can apply (25), and the resulting test is no longer randomized, as it is determined by the observed p-values.

Without loss of generality, assume that P1P_{1} is independent of (P2,,PK)(P_{2},\dots,P_{K}). Let P¯(1):=k=2KwkPk\bar{P}_{(-1)}:=\sum_{k=2}^{K}w_{k}P_{k} be a weighted average of (P2,,PK)(P_{2},\dots,P_{K}). Using P¯(1)\bar{P}_{(-1)} as the p*-variable, the test (25) becomes

rejecting the null hypothesisP¯(1)+2αP12α.\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\bar{P}_{(-1)}+2\alpha P_{1}\leq 2\alpha. (29)

Following directly from the validity of (25), for any p-variables P1,,PKP_{1},\dots,P_{K} with P1P_{1} independent of (P2,,PK)(P_{2},\dots,P_{K}), the test (29) has size at most α\alpha.

Comparing (29) with (27), we rewrite (29) as

rejecting the null hypothesisP¯:=k=1nwkPk2α1+2α,\displaystyle\mbox{rejecting the null hypothesis}~{}\Longleftrightarrow~{}\bar{P}:=\sum_{k=1}^{n}w^{\prime}_{k}P_{k}\leq\frac{2\alpha}{1+2\alpha},

where

w1=2α1+2αandwk=wk1+2α,k=2,,K.w^{\prime}_{1}=\frac{2\alpha}{1+2\alpha}~{}~{}\mbox{and}~{}~{}w^{\prime}_{k}=\frac{w_{k}}{1+2\alpha},~{}k=2,\dots,K.

Note that P¯\bar{P} is a weighted average of (P1,,PK)(P_{1},\dots,P_{K}). Since α\alpha is small, the rejection threshold is increased by almost three times, compared to the test (27) applied to k=1nwkPk\sum_{k=1}^{n}w^{\prime}_{k}P_{k} using the threshold α/2\alpha/2. For this reason, we will call (29) the enhanced averaging test.

In particular, if 2α(K1)=12\alpha(K-1)=1, and P¯(1)\bar{P}_{(-1)} is the arithmetic average of (P2,,PK)(P_{2},\dots,P_{K}), then P¯\bar{P} is the arithmetic average of (P1,,PK)(P_{1},\dots,P_{K}). For instance, if K=51K=51 and α=0.01\alpha=0.01, then the rejection condition for test (29) is P¯1/51\bar{P}\leq 1/51, and the rejection condition for (27) is P¯1/200\bar{P}\leq 1/200.

B.3 Simulation experiments

We compare by simulation the performance of a few tests via merging p-values. For the purpose of illustration, we conduct correlated z-tests for the mean of normal samples with variance 11. More pecisely, the null hypothesis H0H_{0} is N(0,1)\mathrm{N}(0,1) and the alternative is N(δ,1)\mathrm{N}(\delta,1) for some δ>0\delta>0. The p-variables P1,,PKP_{1},\dots,P_{K} are specified as Pk=1Φ(Xk)P_{k}=1-\Phi(X_{k}) from the Neyman-Pearson lemma, where Φ\Phi is the standard normal distribution function, and X1,,XKX_{1},\dots,X_{K} are generated from N(δ,1)\mathrm{N}(\delta,1) with pair-wise correlation ρ\rho. As illustrated by the numerical studies in [34], the arithmetic average test performs poorly unless p-values are strongly correlated. Therefore, we consider the cases where p-values are highly correlated, e.g., parallel experiments with shared data or scientific objects. We set ρ=0.9\rho=0.9 in our simulation studies; this choice is harmless as we are interested in the relative performance of the averaging methods in this section, instead of their performance against other methods (such as method of Simes [30]) that are known work well for lightly correlated or independent p-values.

The significance level α\alpha is set to be 0.010.01. For a comparison, we consider the following tests:

  1. (a)

    the arithmetic averaging test (27): reject H0H_{0} if P¯α/2\bar{P}\leq\alpha/2;

  2. (b)

    the randomized averaging test (28): reject H0H_{0} if P¯V\bar{P}\leq V where VU[0,2α]V\sim\mathrm{U}[0,2\alpha] independent of P¯\bar{P};

  3. (c)

    the Bonferroni method: reject H0H_{0} if min(P1,,PK)α/K\min(P_{1},\dots,P_{K})\leq\alpha/K;

  4. (d)

    the Simes method: reject H0H_{0} if mink(KP(k)/k)α\min_{k}(KP_{(k)}/k)\leq\alpha where P(k)P_{(k)} is the kk-th smallest p-value;

  5. (e)

    the harmonic averaging test of [34]: reject H0H_{0} if (k=1KPk1)1α/cK(\sum_{k=1}^{K}P_{k}^{-1})^{-1}\leq\alpha/c_{K} where cK>1c_{K}>1 is a constant in [34, Proposition 6].

The validity (size no larger than α\alpha) of the Simes method is guaranteed under some dependence conditions on the p-values; see [26, 5]. Moreover, as shown recently by Vovk et al. [33, Theorem 6], the Simes method dominates any symmetric and deterministic p-merging method valid for arbitrary dependence (such as the (a), (c) and (e); the Simes method itself is not valid for arbitrary dependence).

In the second setting, we assume that one of the p-variables (P1P_{1} without loss of generality) is independent of the rest, and the rest p-variables have a pair-wise correlation of ρ=0.9\rho=0.9. For this setting, we further include

  1. (f)

    the enhanced averaging test (29): reject H0H_{0} if P¯(1)+2αP12α\bar{P}_{(-1)}+2\alpha P_{1}\leq 2\alpha.

The power (i.e., the probability of rejection) of each test is computed from the average of 10,000 replications for varying signal strength δ\delta and for K{20,100,500}K\in\{20,100,500\}. Results are reported in Figure 5.

In the first setting of correlated p-values, the randomized averaging test (b) improves the performance of (a) uniformly, at the price of randomization. The Bonferroni method (d) and the harmonic averaging test (e) perform poorly and are both penalized significantly as KK increases. None of these methods visibly outperforms the Simes method, although in some situations the test (b) performs comparably to the Simes method.

In the second setting where an independent p-value exists, the enhanced arithmetic test (f) performs quite well; it outperforms the Simes method for most parameter values especially for small signal strength δ\delta. This illustrates the significant improvement via incorporating an independent p-value.

We remark that the averaging methods (a), (b) and (f) should not be used in situations in which correlation among p-values is known to be not very strong. This is because the arithmetic mean does not benefit from an increasing number of independent p-values of similar strength, unlike the methods of Bonferroni and Simes.

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 5: Tests based on combining p-values