This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conditional Randomization Rank Test

Yanjie Zhong Washington University in St. Louis, St. Louis, USA Todd Kuffner and Soumendra Lahiri Washington University in St. Louis, St. Louis, USA
Abstract

We propose a new method named the Conditional Randomization Rank Test (CRRT) for testing conditional independence of a response variable YY and a covariate variable XX, conditional on the rest of the covariates ZZ. The new method generalizes the Conditional Randomization Test (CRT) of [CFJL18] by exploiting the knowledge of the conditional distribution of X|ZX|Z and is a conditional sampling based method that is easy to implement and interpret. In addition to guaranteeing exact type 1 error control, owing to a more flexible framework, the new method markedly outperforms the CRT in computational efficiency. We establish bounds on the probability of type 1 error in terms of total variation norm and also in terms of observed Kullback–Leibler divergence when the conditional distribution of X|ZX|Z is misspecified. We validate our theoretical results by extensive simulations and show that our new method has considerable advantages over other existing conditional sampling based methods when we take both power and computational efficiency into consideration.

1 Introduction

We study the problem of testing conditional independence in this paper. To be more specific, let XX\in\mathbb{R}, YY\in\mathbb{R} and ZpZ\in\mathbb{R}^{p} be random variables. We consider to test the null hypothesis

H0:YX|Z,H_{0}:Y\perp X|Z, (1)

i.e., YY is independent of XX conditional on the controlling variables ZZ. Conditional independence test plays an important role in many statistical problems, like causal inference ([Daw79], [Spi10], [Lec01], [KB07], [Spi10]) and Bayesian networks ([SV95], [CBL98], [Cam06]). For example, consider the case where XX is a binary variable indicating whether the patient received a treatment, with X=1X=1 when the patient received the treatment and X=0X=0 otherwise. Let YY be an outcome associated with the treatment, like the diabetes risk lagged 6 months. Let ZZ feature patient’s individual characteristics. Usually, it is of interest to see if 𝔼[Y|X=1,Z]=𝔼[Y|X=0,Z]\mathbb{E}\left[Y|X=1,Z\right]=\mathbb{E}\left[Y|X=0,Z\right] or more generally, if the probability distribution of YY given X=1,ZX=1,Z is the same as that of YY given X=0,ZX=0,Z so that we can tell whether the treatment is effective.

1.1 Our contributions

In this paper, we propose a versatile and computationally efficient method, called the Conditional Randomization Rank Test (CRRT) to test the null hypothesis (1) . This method generalizes the Conditional Randomization Test (CRT) introduced in [CFJL18]. Like the CRT, the CRRT is built on the conditional version of model-X assumption, which implies that the conditional distribution of X|ZX|Z is known. Such an assumption is in fact feasible in modern applications. On one hand, it is feasible in practice because usually we have a large amount of unlabeled data (data without YY) ([BCS18], [RSC19], [SSC19]), which combined with some domain knowledge, allow us to learn the conditional distribution of X|ZX|Z quite accurately. On the other hand, as implied in [SP18], domain knowledge is needed to construct non-trivial tests of conditional independence. Since it would be too complicated to know how YY depends on the covariates when the covariates are high-dimensional, marginal information on the covariates becomes valuable and can be utilized in conditional tests.

Compared with the CRT, we allow the CRRT to be implemented in a more flexible way. It makes the CRRT computationally more efficient than the CRT. It is known that the greatest drawback of the CRT is its restrictive computational burden ([LKJR20]). There are many special forms of the CRT aiming at solving this problem, including the distillation to the CRT (dCRT) in [LKJR20] and the Holdout Randomization Test (HRT) in [TVZ+18]. In addition to having favorable computational efficacy compared to these methods, the CRRT is consistently powerful across a wide range of settings while the HRT suffers from lack of power and the dCRT shows inferior power performance in complicated models. To highlight the advantages of the CRRT, we provide a simple interaction model example here. We consider to test hypothesis (1) when XX and ZZ have a non-ignorable interaction in addition to separate linear effects on YY. Results are presented in Figure 1. We can see that the CRRT can have comparable power as the CRT and CPT while costing far less computational time; See Section 5 for more details. In addition, we also show that the CRRT is robust to misspecification of the conditional distribution of X|ZX|Z, both theoretically and empirically.

Refer to caption
(a) Rejection Rate
Refer to caption
(b) Time
Figure 1: (a) Proportions of rejection. The y-axis represents the proportion of times the null hypothesis is rejected. The x-axis represents the true coefficient of XX. (b) Time in seconds per Monte-Carlo simulation.

1.2 Related Work

Conditional tests are ubiquitous in statistical inference. Indeed, many commonly seen statistical problems are essentially conditional independence tests. For example, when we assume a linear model for YY on covariates (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right):

Y=β0X+βTZ+ε,Y=\beta_{0}X+\beta^{T}Z+\varepsilon,

where β0\beta_{0}\in\mathbb{R} and βp\beta\in\mathbb{R}^{p} are fixed parameters, and ε\varepsilon\in\mathbb{R} is a random error, it is of interest to test whether β0=0\beta_{0}=0, which is equivalent to test (1). If the dimensionality of ZZ is relatively small compared with the sample size, we can easily construct a valid test based on classical OLS estimator. In a high-dimensional setting, more conditions on sparsity or design matrix (i.e. on the covariates model when we are considering a random design) or coefficients β\beta are needed. For example, see [CL+13], [JM14a], [JM14b], [VdGBR+14], [ZZ14].

To match the complexity in real application, recently many researchers focused more on the assumptions on covariates while allowing flexibility of the response model to the greatest extent. Our method also belongs to this category. Other methods along this line include the CRT ([CFJL18]), the CPT ([BWBS19]), the HRT ([TVZ+18]) and the dCRT ([LKJR20]). These methods can usually control the type 1 error in practice because they totally rely on covariates model (which suffer less from model misspecification) when adequate unlabeled data is accessible. We review the CRT and the CPT for completeness in Section 4. We can naturally relate these model-X methods to semi-supervised learning methods, the key idea of which is also to maximize the use of unlabeled data. There is some recent work developing reliable methods on variable selection solely based on side-information on the covariates; See [BC+15], [GZ18], [BC+19] and the references therein. We discuss connections between the CRRT and these methods in the supplementary material. After finishing our work, we became aware of a concurrent work ([LC21]) where the CRRT was briefly mentioned. Compared with their work, we provide a more in-depth analysis on the CRRT beyond validity.

There are many other nonparametric methods for testing conditional independence which we point out here but we do not intend to compare them with our method in this paper because of some fundamental differences in the framework and assumptions. Roughly speaking, these nonparametric methods can be divided into 4 categories. The first category includes kernel-based methods which require no distributional assumptions ([FGSS08], [ZPJS12], [SZV19]). The Kernel Conditional Independence Test can be viewed as the extension of the Kernel Independence Test ([FGSS08], [GFT+08]). Taking the advantage of the characterization of conditional independence in reproducing kernel Hilbert spaces, one can reduce the original conditional independence test to a low-dimensional test in general. However, kernel-based methods depend on complex asymptotic distributions of the test statistics. Therefore, they have no finite-sample guarantee and show deteriorating performance as the dimension of ZZ increases. The second category constructs asymptotically valid tests by measuring the distance between conditional distribution of Y|ZY|Z and Y|X,ZY|X,Z; See, for example, [SW07], [SW14] and [WPH+15]. This approach requires estimating certain point-wise conditional expectations by kernel methods like the Nadaraya-Watson kernel regression. Hence, the number of samples must be extremely large when the dimension of ZZ is high. The third category relies on discretization of the conditioning variables, i.e. ZZ. They key idea is that YX|ZY\perp X|Z holds if and only if YX|Z=zY\perp X|Z=z^{\prime} holds for all possible zz^{\prime}. Therefore, if one can discretize the space of ZZ, one can test conditional independence on each part and then, can synthesize all test results together ([H+10], [TB10], [DMZS14], [SSS+17], [Run18]). However, it is well-known that discretization of high-dimensional continuous variables is subject to curse of dimensionality and computationally demanding. Note that tests based on within-group permutation also fall in this category because they essentially rely on local independence. The fourth category includes methods testing some weaker null hypothesis ([LG97], [Ber04], [S+09]), where copula estimation would be helpful ([GVO11], [VOG11]).

1.3 Paper Outline

In Section 2, we describe our newly-proposed CRRT in detail and show its effectiveness in controlling finite sample type 1 error. In Section 3, we demonstrate that the CRRT is robust to the misspecification of conditional distribution of X|ZX|Z. In Section 4, we analyze the connection among several existing methods based on conditional sampling, including the proposed CRRT. In Section 5, we study finite sample performance of the CRRT and show its advantages over other conditional sampling based competitors by extensive simulations. Finally, we conclude our paper by Section 6 with a discussion.

1.4 Notation

The following notation will be used. For a n×pn\times p matrix xx, its iith row is denoted by x[i]x_{[i]} and its jjth column is denoted by xjx_{j}. For any two random variables XX and YY, YXY\perp X means that they are independent. For two random variables XX and ZZ with nn i.i.d. copies {(xi,z[i]):i=1,2,,n}\{(x_{i},z_{[i]}):i=1,2,\ldots,n\}, if the density of XX conditional on ZZ is Q(|Z)Q(\cdot|Z), we denote the pdf of x=(x1,,xn)Tx=(x_{1},\ldots,x_{n})^{T} conditional on z=(z[1],,z[n])Tz=(z_{[1]},\ldots,z_{[n]})^{T} by Qn(|z)Q_{n}(\cdot|z). When there is no ambiguity, we sometimes suppress the subscript and denote it by Q(|z)Q(\cdot|z).

2 Conditional Randomization Rank Test

Suppose that we have a response variable YY\in\mathbb{R} and predictors (X,Z)=(X,Z1,,Zp)p(X,Z)=(X,Z^{1},\ldots,Z^{p})\in\mathbb{R}^{p}. For now, we assume that the conditional distribution of X|ZX|Z is known and denote the conditional density by Q(|Z)Q(\cdot|Z) throughout this section. Let y=(y1,,yn)Ty=(y_{1},\ldots,y_{n})^{T} be an n-dimension i.i.d. realization of YY, and similarly, let (x,z)=(x,z1,,zp)(x,z)=(x,z_{1},\ldots,z_{p}) be an n×(p+1)n\times(p+1) dimension matrix with rows i.i.d. drawn from the distribution of (X,Z)(X,Z). For a given positive integer bb (usually we can take b=100b=100), and k=1,,bk=1,\ldots,b, let x(k)x^{(k)} be an n-dimensional vector with elements xi(k)Q(|z[i])x^{(k)}_{i}\sim Q(\cdot|z_{[i]}) independently, i=1,,ni=1,\ldots,n, where z[i]z_{[i]} is the iith row of zz. We denote x~=(x,x(1),,x(b))\tilde{x}=(x,x^{(1)},\ldots,x^{(b)}).

Let T(y,z,x~)=(T0,T1,,Tb)T:n×n×p×n×(b+1)b+1T(y,z,\tilde{x})=(T_{0},T_{1},\ldots,T_{b})^{T}:\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1} be any function mapping the enlarged dataset to a (b+1)(b+1)-dimension vector. Typically, T0T_{0} is a function measuring the importance of xx to yy and TkT_{k} is a function measuring the importance of x(k)x^{(k)} to yy.

Let Σb+1\Sigma_{b+1} be the collection of all permutations of (1,2,,b,b+1)(1,2,\ldots,b,b+1). For any σΣb+1\sigma\in\Sigma_{b+1}, x~permute(σ)\tilde{x}_{permute(\sigma)} is a n×(b+1)n\times(b+1) matrix with the jjth column being the σj\sigma_{j}th column of x~\tilde{x}, j=1,2,,b+1j=1,2,\ldots,b+1. If ν\nu is a (b+1)(b+1)-dimension vector, νpermute(σ)\nu_{permute(\sigma)} is a (b+1)(b+1)-dimension vector with the jjth element being the σj\sigma_{j}th element of ν\nu, j=1,2,,b+1j=1,2,\ldots,b+1. We introduce a property of TT, which is key to our analysis.

Definition 1.

Let T(y,z,x~)=(T0,T1,,Tb)T:n×n×p×n×(b+1)b+1T(y,z,\tilde{x})=(T_{0},T_{1},\ldots,T_{b})^{T}:\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1}. We say that TT is XX-symmetric if for any σΣb+1\sigma\in\Sigma_{b+1}, T(y,z,x~permute(σ))=T(y,z,x~)permute(σ)T(y,z,\tilde{x}_{permute(\sigma)})=T(y,z,\tilde{x})_{permute(\sigma)}.

If XX is indeed independent with YY given ZZ, we expect that x,x(1),,x(b)x,x^{(1)},\ldots,x^{(b)} should have similar influence on YY. Hence, if TT is a X-symmetric function, T0(y,z,x~),,Tb(y,z,x~)T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x}) should also be stochastically similar.

Proposition 1.

Assume that the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z is true. If T(y,z,x~)=(T0,T1,,Tb)T:n×n×p×n×(b+1)b+1T(y,z,\tilde{x})=(T_{0},T_{1},\ldots,T_{b})^{T}:\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1} is X-symmetric, then

T0(y,z,x~)=dT1(y,z,x~)=d=dTb(y,z,x~).T_{0}(y,z,\tilde{x})\stackrel{{\scriptstyle d}}{{=}}T_{1}(y,z,\tilde{x})\stackrel{{\scriptstyle d}}{{=}}\ldots\stackrel{{\scriptstyle d}}{{=}}T_{b}(y,z,\tilde{x}).

Further, they are exchangeable. In particular, for any permutation σΣb+1\sigma\in\Sigma_{b+1},

(T0(y,z,x~),,Tb(y,z,x~))permute(σ)=d(T0(y,z,x~),,Tb(y,z,x~)).(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x}))_{permute(\sigma)}\stackrel{{\scriptstyle d}}{{=}}(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x})).

Based on the above exchangeability, a natural choice of p-value for our target test is given by

pCRRT=1b+1k=0b𝟙{T0(y,z,x~)Tk(y,z,x~)}.p_{CRRT}=\frac{1}{b+1}\sum\limits_{k=0}\limits^{b}\mathbbm{1}\{T_{0}(y,z,\tilde{x})\leq T_{k}(y,z,\tilde{x})\}. (2)

With this p-value, we will reject H0:YX|ZH_{0}:Y\perp X|Z at level α\alpha if pαp\leq\alpha. In other words, we will have a strong evidence to believe that Z cannot fully cover X’s influence on Y when T0T_{0} is among the α(1+b)\lfloor\alpha(1+b)\rfloor largest of (T0,T1,,Tb)(T_{0},T_{1},\ldots,T_{b}). We judge whether to reject H0H_{0} based on the rank of T0T_{0}. That is why we name the test introduced here Conditional Randomization Rank Test. The following proposition and theorem justifies the use of p-value defined in (2).

Proposition 2.

Assume that T(y,z,x~)=(T0,T1,,Tb)T:n×n×p×n×(b+1)b+1T(y,z,\tilde{x})=(T_{0},T_{1},\ldots,T_{b})^{T}:\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1} is X-symmetric. Denote the ordered statistics of (T0,T1,,Tb)(T_{0},T_{1},\ldots,T_{b}) by T(0)T(1)T(b)T_{(0)}\leq T_{(1)}\leq\ldots\leq T_{(b)}. When H0:YX|ZH_{0}:Y\perp X|Z holds, we have

(pCRRTα|z,y,T(0),T(1),,T(b))α,\mathbb{P}(p_{CRRT}\leq\alpha|z,y,T_{(0)},T_{(1)},\ldots,T_{(b)})\leq\alpha,

for any pre-defined α[0,1]\alpha\in[0,1].

Theorem 1.

Under the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z, assume that the test function T(y,z,x~)=(T0,T1,,Tb)T:n×n×p×n×(b+1)b+1T(y,z,\tilde{x})=(T_{0},T_{1},\ldots,T_{b})^{T}:\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1} is X-symmetric, we can know directly from Proposition 2 that the p-value defined in (2) satisfies

(pCRRTα)α\mathbb{P}(p_{CRRT}\leq\alpha)\leq\alpha

for any α[0,1]\alpha\in[0,1]. Actually, pCRRTp_{CRRT} is also conditionally valid:

(pCRRTα|y,z)α.\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\leq\alpha.

3 Robustness of the CRRT

According to Theorem 1, we know that when the conditional distribution of XX given ZZ is fully known, CRRT can precisely control type 1 error at any desirable level. However, in real applications, this information may not be available. Luckily, however, the following theorems guarantee that the CRRT is robust to moderate misspecification of the conditional distribution. We state two theorems explaining the robustness of the CRRT from two different angles.

Theorem 2.

(Type 1 Error Upper Bound 1.) Assume the true conditional distribution of X|ZX|Z is Q(|Z)Q^{\star}(\cdot|Z). The pseudo variables matrix (x(1),,x(b))(x^{(1)},\ldots,x^{(b)}) is generated in the same way described in Section 2 except that the generation is based on a falsely specific or roughly estimated conditional density Q(|Z)Q(\cdot|Z) instead of Q(|Z)Q^{\star}(\cdot|Z). The CRRT p-value pCRRTp_{CRRT} is defined in (2). Under the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z, for any given α[0,1]\alpha\in[0,1],

(pCRRTα|z,y)α+dTV(Qn(|z),Qn(|z)),\mathbb{P}(p_{CRRT}\leq\alpha|z,y)\leq\alpha+d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z)),
(pCRRTα)α+𝔼[dTV(Qn(|z),Qn(|z))],\mathbb{P}(p_{CRRT}\leq\alpha)\leq\alpha+\mathbb{E}[d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))],

where the total variation distance is defined as dTV(Qn(|z),Qn(|z))=supAn|PQ(xA|z)PQ(xA|z)|d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))=\sup_{A\subset\mathbb{R}^{n}}|P_{Q}(x\in A|z)-P_{Q^{\star}}(x\in A|z)|.

With the Pinsker’s inequality, we have dTV2(Qn(|z),Qn(|z))12dKL(Qn(|z),Qn(|z))d_{TV}^{2}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))\leq\frac{1}{2}d_{KL}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z)) ([BWBS19]). Combining it with the results in Theorem 2, we know that controlling the Kullback-Leibler divergence is sufficient to guarantee a small deviation from desired type 1 error level.

Example 1.

(Gaussian Model) We now consider that (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right) is jointly Gaussian. It implies that there exists βp\beta\in\mathbb{R}^{p} and σ2>0\sigma^{2}>0 such that conditional on zz, xi|zN(z[i]Tβ,σ2)x_{i}|z\sim N(z_{[i]}^{T}\beta,\sigma^{2}) independently.

Suppose that we know the true variance σ2\sigma^{2} and can estimate β\beta by β^\hat{\beta} from a large amount of unlabeled data of size NN. For example, β^\hat{\beta} can be the OLS estimator when pp is relatively small, or some estimators based on regularization when pp is of a comparative order as NN. Without the sparsity assumption, under appropriate conditions, we have β^β2=Op(plogpN)||\hat{\beta}-\beta||_{2}=O_{p}\left(\sqrt{\frac{p\log p}{N}}\right) ([ZH+08], [BRT+09], [CCD+18]). We may simply assume β^β2=O(plogpN)||\hat{\beta}-\beta||_{2}=O\left(\sqrt{\frac{p\log p}{N}}\right) for convenience. If we additionally assume that 𝔼|zij|2=O(1)\mathbb{E}|z_{ij}|^{2}=O(1) uniformly for i=1,,ni=1,\ldots,n and j=1,,pj=1,\ldots,p, we have

[𝔼[dTV(Qn(|z),Qn(|z))]]2𝔼[dTV2(Qn(|z),Qn(|z))]12𝔼[dKL(Qn(|z),Qn(|z))]=12𝔼[i=1ndKL(Qn(|z[i]),Qn(|z[i]))]=12𝔼[i=1n12σ2(z[i]Tβ^z[i]Tβ)2]n4σ2𝔼[β^β22z[1]22]=O(np2logpN).\begin{array}[]{ll}&\left[\mathbb{E}\left[d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))\right]\right]^{2}\leq\mathbb{E}\left[d_{TV}^{2}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))\right]\leq\frac{1}{2}\mathbb{E}\left[d_{KL}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))\right]\\ =&\frac{1}{2}\mathbb{E}\left[\sum\limits_{i=1}^{n}d_{KL}(Q^{\star}_{n}(\cdot|z_{[i]}),Q_{n}(\cdot|z_{[i]}))\right]=\frac{1}{2}\mathbb{E}\left[\sum\limits_{i=1}^{n}\frac{1}{2\sigma^{2}}(z_{[i]}^{T}\hat{\beta}-z_{[i]}^{T}\beta)^{2}\right]\\ \leq&\frac{n}{4\sigma^{2}}\mathbb{E}\left[\|\hat{\beta}-\beta\|_{2}^{2}\|z_{[1]}\|_{2}^{2}\right]=O\left(\frac{np^{2}\log p}{N}\right).\\ \end{array}

It implies that if np2logpN\frac{np^{2}\log p}{N} converges to 0 as nn goes to infinity, the type 1 error control would be asymptotically exact. For other models, the conditions may be more complicated. But in general, with sufficiently large amount of unlabeled data, we are able to make the bias of type 1 error control negligible.

Different from the Theorem 2, the following theorem demonstrate the robustness of CRRT from another angle, where the discrepancy between Q(|z)Q^{\star}(\cdot|z) and Q(|z)Q(\cdot|z) will be quantified in the form of observed KL divergence.

Theorem 3.

(Type 1 Error Upper Bound 2.) Assume the true conditional distribution of X|ZX|Z is Q(|Z)Q^{\star}(\cdot|Z). The pseudo variables matrix (x(1),,x(b))(x^{(1)},\ldots,x^{(b)}) is generated in the same way described in section 2 except that the generation is based on a falsely specific or roughly estimated conditional density Q(|Z)Q(\cdot|Z) instead of Q(|Z)Q^{\star}(\cdot|Z). The CRRT p-value pCRRTp_{CRRT} is defined in (2). We define random variables

KL^k=i=1nlogQ(xi|z[i])Q(xi(k)|z[i])Q(xi|z[i])Q(xi(k)|z[i]),k=1,,b.\widehat{KL}_{k}=\sum\limits_{i=1}\limits^{n}\log\frac{Q^{\star}(x_{i}|z_{[i]})Q(x_{i}^{(k)}|z_{[i]})}{Q(x_{i}|z_{[i]})Q^{\star}(x_{i}^{(k)}|z_{[i]})},\ k=1,\ldots,b. (3)

Denote the ordered statistics of {KL^1,,KL^b}\left\{\widehat{KL}_{1},\ldots,\widehat{KL}_{b}\right\} by KL^(1)KL^(b)\widehat{KL}_{(1)}\leq\ldots\leq\widehat{KL}_{(b)}. For any given ε1,,εb\varepsilon_{1},\ldots,\varepsilon_{b}, which are independent of xx and yy conditional on zz, and for any given α[0,1]\alpha\in[0,1], under the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z,

(PCRRTα,KL^(1)ε1,,KL^(b)εb)𝔼[(b+1)α1+eε1++eεb].\mathbb{P}(P_{CRRT}\leq\alpha,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b})\leq\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\right].

Further, we can obtain the following bound directly controlling type 1 error,

(PCRRTα)infε1,,εb{𝔼[(b+1)α1+eε1++eεb]+(k=1b{KL^(k)>εk})}.\mathbb{P}(P_{CRRT}\leq\alpha)\leq\inf\limits_{\varepsilon_{1},\ldots,\varepsilon_{b}}\left\{\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\right]+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\right)\right\}. (4)

Remark:

  • We can see that, when we have the access to the accurate conditional density, i.e. Q(|Z)Q(|Z)Q(\cdot|Z)\equiv Q^{\star}(\cdot|Z), KL^1,,KL^b\widehat{KL}_{1},\ldots,\widehat{KL}_{b} will be all zero. The last bound for type 1 error in Theorem 4 would be (b+1)αb+1\frac{\lfloor(b+1)\alpha\rfloor}{b+1}, which is nearly exact when bb is sufficiently large.

  • Though at first glance, it seems that the bound given in (4) may increase with increasing bb and may even grow to 1 due to the second term on the righthand side, actually we can show that for any small ϵ>0\epsilon>0, the bound on the righthand side of (4) can be asymptotically controlled from above, as bb grows to infinity, by a function g(Q,Q,α)g(Q,Q^{\star},\alpha), which equals to α(1+ϵ)\alpha(1+\epsilon) when Q=QQ=Q^{\star}. For more details and a generalization of Theorem 4, see the supplement.

  • In Theorem 2 and 4, the proposed conditional density Q(|Z)Q(\cdot|Z) cannot depend on (y,z,x)(y,z,x). The source of Q(|Z)Q(\cdot|Z) is usually a combination of domain knowledge and unlabeled data. It would be interesting to see what theoretical bound we can have when Q(|Z)Q(\cdot|Z) is estimated from (y,z,x)(y,z,x).

Example 2.

(Gaussian Model) We continue to follow the settings and assumptions in Example 1. For k=1,,bk=1,\ldots,b, we can simplify KL^k\widehat{KL}_{k} as

KL^k=1σ2(xx(k))Tz(ββ^)=1σ2(x(k)zβ)Tz(β^β)1σ2(xzβ)Tz(β^β).\widehat{KL}_{k}=\frac{1}{\sigma^{2}}(x-x^{(k)})^{T}z(\beta-\hat{\beta})=\frac{1}{\sigma^{2}}(x^{(k)}-z\beta)^{T}z(\hat{\beta}-\beta)-\frac{1}{\sigma^{2}}(x-z\beta)^{T}z(\hat{\beta}-\beta).

Denote M01σ2(xzβ)Tz(β^β)M_{0}\triangleq\frac{1}{\sigma^{2}}(x-z\beta)^{T}z(\hat{\beta}-\beta) and Mk1σ2(x(k)zβ)Tz(β^β)M_{k}\triangleq\frac{1}{\sigma^{2}}(x^{(k)}-z\beta)^{T}z(\hat{\beta}-\beta) for k=1,,bk=1,\ldots,b. Besides, denote H(β^β)TzTz(β^β)H\triangleq(\hat{\beta}-\beta)^{T}z^{T}z(\hat{\beta}-\beta). Under our assumptions, 𝔼H=O(np2logpN)\mathbb{E}H=O\left(\frac{np^{2}\log p}{N}\right). We know that M0|zN(0,1σ2H)M_{0}|z\sim N(0,\frac{1}{\sigma^{2}}H), M1|z,,Mk|zN(1σ2H,1σ2H)M_{1}|z,\ldots,M_{k}|z\sim N(\frac{1}{\sigma^{2}}H,\frac{1}{\sigma^{2}}H) and they are conditionally independent. Now, we assume that np2logpN\frac{np^{2}\log p}{N} converges to 0 as nn goes to infinity, which is the same as in the Example 1.

First, consider that bb is bounded, that is, bb does not go to infinity as nn goes to infinity. We let ε1==εk=np2logpNf(n)\varepsilon_{1}=\ldots=\varepsilon_{k}=\sqrt{\frac{np^{2}\log p}{N}}f(n), where f(n)f(n) satisfies that limnf(n)=\lim\limits_{n\rightarrow\infty}f(n)=\infty and limnnp2logpNf(n)=0\lim\limits_{n\rightarrow\infty}\sqrt{\frac{np^{2}\log p}{N}}f(n)=0. Then, according to (4), we have

(PCRRTα)(b+1)α1+b×exp(np2logpNf(n))+(k=1b{KL^(k)>np2logpNf(n)})(b+1)α1+b×exp(np2logpNf(n))+(maxk=1,,bMkM0>np2logpNf(n)).\begin{array}[]{ll}&\mathbb{P}\left(P_{CRRT}\leq\alpha\right)\leq\frac{\lfloor(b+1)\alpha\rfloor}{1+b\times exp\left(-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)}+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{\widehat{KL}_{(k)}>\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}\right)\\ \leq&\frac{\lfloor(b+1)\alpha\rfloor}{1+b\times exp\left(-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)}+\mathbb{P}\left(\max\limits_{k=1,\ldots,b}M_{k}-M_{0}>\sqrt{\frac{np^{2}\log p}{N}}f(n)\right).\\ \end{array}

The difference between the first term and (b+1)α1+b\frac{\lfloor(b+1)\alpha\rfloor}{1+b} obviously converges to 0 because np2logpNf(n)\frac{np^{2}\log p}{N}f(n) converges to 0. As for the second term, note that

(maxk=1,,bMkM0>np2logpNf(n))=(maxk=1,,bNnp2logpMkNnp2logpM0>f(n)),\mathbb{P}\left(\max\limits_{k=1,\ldots,b}M_{k}-M_{0}>\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)=\mathbb{P}\left(\max\limits_{k=1,\ldots,b}\sqrt{\frac{N}{np^{2}\log p}}M_{k}-\sqrt{\frac{N}{np^{2}\log p}}M_{0}>f(n)\right),

where Nnp2logpM0,Nnp2logpM1,,Nnp2logpMb=Op(1)\sqrt{\frac{N}{np^{2}\log p}}M_{0},\sqrt{\frac{N}{np^{2}\log p}}M_{1},\ldots,\sqrt{\frac{N}{np^{2}\log p}}M_{b}=O_{p}(1). Besides, considering that f(n)f(n)\rightarrow\infty and bb is bounded, we can conclude that the second term converges to 0 as nn\rightarrow\infty. Therefore, we have

(PCRRTα)(b+1)α1+b0,\mathbb{P}\left(P_{CRRT}\leq\alpha\right)-\frac{\lfloor(b+1)\alpha\rfloor}{1+b}\rightarrow 0,

as nn\rightarrow\infty. As for the case of limnb=\lim\limits_{n\rightarrow\infty}b=\infty, we relegate it to the supplement.

The above 2 theorems guarantee the robustness of the CRRT. To control the type 1 error at a tolerable level, it is sufficient to control the total variation or the observed KL divergence of Q(|Z)Q^{\star}(\cdot|Z) and Q(|Z)Q(\cdot|Z). Then, it is natural to ask a conjugate question. Whether controlling the total variation or the observed KL divergence is necessary? The following 2 theorems give positive answers.

Theorem 4.

(Type 1 Error Lower Bound 1.) Assume the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z holds. Suppose the true conditional distribution Q(|Z)Q^{\star}(\cdot|Z), the proposed conditional density Q(|Z)Q(\cdot|Z) used to sample pseudo variables, and the desired type 1 error level α(0,1]\alpha\in(0,1] are fixed. Suppose xQ(|z)x\sim Q^{\star}(\cdot|z), x(k)Q(|z),k=1,,b,x^{(k)}\sim Q(\cdot|z),k=1,\ldots,b, independently. Suppose that X|ZX|Z is a continuous variable under both of Q(|Z)Q(\cdot|Z) and Q(|Z)Q^{\star}(\cdot|Z). Let A(z)nA(z)\subseteq\mathbb{R}^{n} such that

dTV(Qn(|z),Qn(|z))=Qn(xA(z)|z)Qn(xA(z)|z).d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))=\mathbb{P}_{Q_{n}^{\star}}(x\in A(z)|z)-\mathbb{P}_{Q_{n}}(x\in A(z)|z).

For any fixed 0<ε<α0<\varepsilon<\alpha, when bb is sufficiently large, we have

(pCRRTα|y,z)[αε+dTV(Qn(|z),Qn(|z))f(z,Q,Q,α,ε)]×(1o(1)),\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\geq\left[\alpha-\varepsilon+d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))f(z,Q,Q^{\star},\alpha,\varepsilon)\right]\times\left(1-o(1)\right),

where f(z,Q,Q,α,ε)=𝟙{α0(z)αε}αεα0(z)+𝟙{α0(z)<αε}1(αε)1α0(z)f(z,Q,Q^{\star},\alpha,\varepsilon)=\mathbbm{1}\{\alpha_{0}(z)\geq\alpha-\varepsilon\}\frac{\alpha-\varepsilon}{\alpha_{0}(z)}+\mathbbm{1}\{\alpha_{0}(z)<\alpha-\varepsilon\}\frac{1-(\alpha-\varepsilon)}{1-\alpha_{0}(z)}, α0(z)=Qn(xA(z)|z)\alpha_{0}(z)=\mathbb{P}_{Q_{n}}(x\in A(z)|z), and the o(1)o(1) term decays to 0 exponentially as bb grows to infinity.

It is easy to see that f(z,Q,Q,α,ε)f(z,Q,Q^{\star},\alpha,\varepsilon) is bounded from below by max{αε,1(αε)}\max\{\alpha-\varepsilon,1-(\alpha-\varepsilon)\}. Therefore, the extra error due to misspecification of conditional distribution is at least linear in dTV(Qn(|z),Qn(|z))d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z)). It demonstrates that when we apply the CRRT, we need to make sure that we propose a conditional distribution with a small bias. Otherwise, bad choice of XX-symmetric test function TT could lead to considerable type 1 error. It is also intuitive that if observed KL divergence is large with high probability, the CRRT would fail to give a good type 1 error control. The following theorem formally demonstrate this point.

Theorem 5.

(Type 1 Error Lower Bound 2.) Assume the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z holds. Suppose the true conditional distribution Q(|Z)Q^{\star}(\cdot|Z), the proposed conditional density Q(|Z)Q(\cdot|Z) used to sample pseudo variables, and the targeted type 1 error level α[0,1]\alpha\in[0,1] are fixed. Suppose that X|ZX|Z is a continuous variable under both of Q(|Z)Q(\cdot|Z) and Q(|Z)Q^{\star}(\cdot|Z). Denote λ=(b+1)α\lambda=\lfloor(b+1)\alpha\rfloor. Suppose the following inequality holds when xQ(|z)x\sim Q^{\star}(\cdot|z), x(k)Q(|z),k=1,,b,x^{(k)}\sim Q(\cdot|z),k=1,\ldots,b, independently,

(KL^k0,k=1,,b,and,KL^(λ)ε)c,\mathbb{P}(\widehat{KL}_{k}\geq 0,k=1,\ldots,b,\ and,\ \widehat{KL}_{(\lambda)}\geq\varepsilon)\geq c, (5)

where ε\varepsilon and cc are some nonnegative values, KL^k,k=1,,b\widehat{KL}_{k},k=1,\ldots,b are defined in (3). Then there exists an X-symmetric function TT and a conditional distribution PY|ZP_{Y|Z}, such that:

  1. 1.

    If x,x(1),,x(b)i.i.d.Q(|z)x,x^{(1)},\ldots,x^{(b)}\sim_{i.i.d.}Q^{\star}(\cdot|z), CRRT has near-exact type 1 error, i.e.,

    (PCRRTα)=(b+1)αb+1.\mathbb{P}(P_{CRRT}\leq\alpha)=\frac{\lfloor(b+1)\alpha\rfloor}{b+1}.
  2. 2.

    If xQ(|z)x\sim Q^{\star}(\cdot|z), x(k)Q(|z),k=1,,b,x^{(k)}\sim Q(\cdot|z),k=1,\ldots,b, independently, we have a lower bound for type 1 error,

    (PCRRTα)(b+1)αb+1+c(1(b+1)αb+1)(1eε).\mathbb{P}(P_{CRRT}\leq\alpha)\geq\frac{\lfloor(b+1)\alpha\rfloor}{b+1}+c\left(1-\frac{\lfloor(b+1)\alpha\rfloor}{b+1}\right)(1-e^{-\varepsilon}).

Actually, there could be many alternatives way to state the above theorem. We can replace the key condition (5) with a more general form like

(KL^(k)εk,k=1,,b)c,\mathbb{P}(\widehat{KL}_{(k)}\geq\varepsilon_{k},k=1,\ldots,b)\geq c,

where 0ε1εb0\leq\varepsilon_{1}\leq\ldots\leq\varepsilon_{b}, and then derive a corresponding lower bound for the type 1 error.

4 Connections Among Tests Based on Conditional Sampling

As we have mentioned, model-X methods can effectively use information contained in the (conditional) distribution of covariates and can shift the burden from learning the model of response on covariates to estimating the (conditional) distribution of covariates. They would be a very suitable choice when we have little knowledge about how the response relies on covariates but know more about how the covariates interact with each other. When it comes to the conditional independence testing problem, following the spirit of model-X, there exist several methods based on conditional sampling, including Conditional Randomization Test ([CFJL18]), Conditional Permutation Test ([BWBS19]) and the previously introduced Conditional Randomization Rank Test. For completeness, we also propose Conditional Permutation Rank Test, which will be defined in the current section later soon. In this section, we discuss the connections among them.

To make the structure complete, let us briefly review the CRT and CPT. We still suppose that (y,z,x)(y,z,x) consists of nn i.i.d. copies of random variables (Y,Z,X)(Y,Z,X), where yy is of dimension n×1n\times 1, zz is of dimension n×pn\times p and xx is of dimension n×1n\times 1. Suppose that we believe the conditional density of X|ZX|Z is Q(|Z)Q(\cdot|Z). This conditional distribution is the key to the success of conditional tests. It could be either precisely true or biased. As shown in [BWBS19], as long as QQ has a moderate deviation from the true underlying conditional distribution, say Q(|Z)Q^{\star}(\cdot|Z), we are able to control the type 1 error.

\bullet Comparing CRRT and CRT

The CRT independently samples bb pseudo vectors x(1),,x(b)x^{(1)},\ldots,x^{(b)} in the same way as the CRRT. For each k=1,2,,bk=1,2,\ldots,b, x(k)x^{(k)} is of dimension nn, with entries xi(k)Q(|z[i])x_{i}^{(k)}\sim Q(\cdot|z_{[i]}) independently, i=1,2,,ni=1,2,\ldots,n. Let Tm=T(y,z,x)T_{m}=T(y,z,x) be any pre-defined function mapping from n×n×p×n\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n} to \mathbb{R}. Here, the requirement of ’pre-defined’ is not strict. TmT_{m} only needs to be independent with xx. Then, the p-values of the CRT is defined as

pCRT=1+k=1b𝟙{Tm(y,z,x(k))Tm(y,z,x)}1+b.p_{CRT}=\frac{1+\sum\limits_{k=1}\limits^{b}\mathbbm{1}\{T_{m}(y,z,x^{(k)})\geq T_{m}(y,z,x)\}}{1+b}. (6)

It is easy to tell that the CRT is essentially a subclass of the CRRT because the CRT requires that the test statistic TmT_{m} is same for x,x(1),,x(b)x,x^{(1)},\ldots,x^{(b)} while the CRRT relaxes such restriction to XX-symmetry. Compared with the CRT, the CRRT enjoys greater flexibility and therefore can have greater efficiency. We will give some concrete examples to demonstrate this point in our empirical experiments. Here, we show an intuitive toy example. Suppose that we try to use the LASSO estimator as our test statistics for the CRT. Let

(β^1T(y,z,x),β^2(y,z,x))T=argminβ1p,β2yzβ1xβ222+λ^(β11+β21),(\hat{\beta}_{1}^{T}(y,z,x),\hat{\beta}_{2}(y,z,x))^{T}=\mathop{\arg\min}\limits_{\beta_{1}\in\mathbb{R}^{p},\beta_{2}\in\mathbb{R}}||y-z\beta_{1}-x\beta_{2}||_{2}^{2}+\hat{\lambda}(||\beta_{1}||_{1}+||\beta_{2}||_{1}),

where λ^\hat{\lambda} can be given by cross-validation based on (y,z,x)(y,z,x) or be fixed before analyzing data. For k=0,1,2,,bk=0,1,2,\ldots,b, let Tm(y,z,x(k))=β^2(y,z,x(k))T_{m}(y,z,x^{(k)})=\hat{\beta}_{2}(y,z,x^{(k)}). Suppose the dimension of ZZ is p=np=n and the number of pseudo variables b=nb=n. According to Efron, Hastie, Johnstone and Tibshirani (2004), the time complexity of calculating b+1b+1 test statistics will be O(n4)O(n^{4}). In contrast, if we apply the CRRT with a similar LASSO statistics, we can see a significant save of time. Let

(β^3T(y,z,x~),β^4T(y,z,x~))T=argminβ3p,β4b+1yzβ3x~β422+λ^(β13+β24),(\hat{\beta}_{3}^{T}(y,z,\tilde{x}),\hat{\beta}^{T}_{4}(y,z,\tilde{x}))^{T}=\mathop{\arg\min}\limits_{\beta_{3}\in\mathbb{R}^{p},\beta_{4}\in\mathbb{R}^{b+1}}||y-z\beta_{3}-\tilde{x}\beta_{4}||_{2}^{2}+\hat{\lambda}(||\beta_{1}||_{3}+||\beta_{2}||_{4}),

where λ^\hat{\lambda} can be given by cross-validation based on (y,z,x~)(y,z,\tilde{x}) or be fixed before analyzing data. Then, we can define test statistics

T(y,z,x~)=(T0(y,z,x~),,Tb(y,z,x~))T=β^4(y,z,x~),T(y,z,\tilde{x})=(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x}))^{T}=\hat{\beta}_{4}(y,z,\tilde{x}),

whose calculation would cost O(n3)O(n^{3}) time. Hence, the order of time complexity is smaller by a factor of Θ(n)\Theta(n).

In addition to the above LASSO example, there are numerous commonly-seen examples we can use to explain CRRT’s computational advantage. [Lou14] showed that the time complexity of CART is Θ(pnlog2n)\Theta(pn\log^{2}n) in average case, where pp is the number of variables and nn is the number of samples. Suppose p=np=n and let the number of pseudo variables b=nb=n in both of CRT and CRRT. The total time complexity of CRT is Θ(n3log2n)\Theta(n^{3}\log^{2}n) while the complexity of CRRT is Θ(n2log2n)\Theta(n^{2}\log^{2}n). Again, CRRT outperforms CRT by a factor of Θ(n)\Theta(n).

\bullet Connecting CRRT with CPT

The CPT proposed by [BWBS19] can be viewed as a robust version of the CRT. It inherits the spirit of traditional permutation tests and adopts a pseudo variable generation scheme different from the CRT. Denote the ordered statistics of xx by xordered=(x(1),,x(n))Tx_{ordered}=(x_{(1)},\ldots,x_{(n)})^{T}, where x(1)x(n)x_{(1)}\leq\ldots\leq x_{(n)}. Conditional on xorderedx_{ordered} and zz, define a conditional distribution Rn(|z,xordered)R_{n}(\cdot|z,x_{ordered}) induced by QQ:

Rn(v|z,xordered)={Qn(v|z)v˙Γ(xordered)Qn(v˙|z),vΓ(xordered),0,vΓ(xordered),R_{n}(v|z,x_{ordered})=\left\{\begin{array}[]{ll}\frac{Q_{n}(v|z)}{\sum\limits_{\dot{v}\in\Gamma(x_{ordered})}Q_{n}(\dot{v}|z)},&v\in\Gamma(x_{ordered}),\\ 0,&v\notin\Gamma(x_{ordered}),\\ \end{array}\right.

where Γ(xordered){xpermute(σ):σΣn}\Gamma(x_{ordered})\triangleq\{x_{permute(\sigma)}:\sigma\in\Sigma_{n}\} and Σn\Sigma_{n} is the collection of all permutations of {1,2,,n}\{1,2,\ldots,n\}. Similarly, we can also define Rn(v|z,xordered)R^{\star}_{n}(v|z,x_{ordered}) induced by QQ^{\star}. Let Tm=T(y,z,x)T_{m}=T(y,z,x) be any function from n×n×p×n\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n} to \mathbb{R}. The requirements of TmT_{m} is same as those in the CRT. The p-values of the CPT is defined in the same way as the CRT. For the CRT, if QQ specifies the true conditional distribution, under the null hypothesis that H0:YX|ZH_{0}:Y\perp X|Z, we can see that Tm(y,z,x(0)),,Tm(y,z,x(b))T_{m}(y,z,x^{(0)}),\ldots,T_{m}(y,z,x^{(b)}) are i.i.d. conditional on yy and zz. Such exchangeability guarantees the control of type 1 error. But for the CPT, we need to take a further step to condition on xordered,yx_{ordered},y and zz to ensure exchangeability. As usual, more conditioning, more robustness and correspondingly, less power. [BWBS19] has some good examples to illustrate these points. The following two propositions, combined with the Theorem 5 in [BWBS19] can give us some clues on the robustness provided by the extra conditioning.

Proposition 3.

Assume the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z holds. Let xQn(|z)x\sim Q_{n}^{\star}(\cdot|z), conditional on zz. And, conditional on xorderedx_{ordered} and zz, x(1),,x(b)Rn(|z,xordered)x^{(1)},\ldots,x^{(b)}\sim R_{n}(\cdot|z,x_{ordered}) independently. There exists a statistic Tm(y,z,x):n×n×p×nT_{m}(y,z,x):\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n}\rightarrow\mathbb{R} such that, for the CPT, the following lower bound holds.

𝔼[supα[0,1]((pCPTα|y,z,xordered)α)|y,z]𝔼[dTV(Rn(|z,xordered),Rn(|z,xordered))|z](1+o(1))2logbb\begin{array}[]{cl}&\mathbb{E}\left[\sup\limits_{\alpha\in[0,1]}(\mathbb{P}(p_{CPT}\leq\alpha|y,z,x_{ordered})-\alpha)|y,z\right]\\ \geq&\mathbb{E}\left[d_{TV}(R^{\star}_{n}(\cdot|z,x_{ordered}),R_{n}(\cdot|z,x_{ordered}))|z\right]-\frac{(1+o(1))}{2}\sqrt{\frac{\log b}{b}}\end{array}

as bb\rightarrow\infty, for a.e. yny\in\mathbb{R}^{n}, zn×pz\in\mathbb{R}^{n\times p}, where xorderedx_{ordered} is the set of ordered statistics of xx.

Proposition 4.

Let xQn(|z)x\sim Q_{n}^{\star}(\cdot|z), we have

𝔼[dTV(Rn(|z,xordered),Rn(|z,xordered))|z]2dTV(Qn(|z),Qn(|z)).\mathbb{E}\left[d_{TV}(R^{\star}_{n}(\cdot|z,x_{ordered}),R_{n}(\cdot|z,x_{ordered}))|z\right]\leq 2d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z)).

The Proposition 3 indicates that under the misspecification of conditional distribution, if we somehow unluckily choose bad test statistics TmT_{m} and a bad type 1 error level α\alpha, the deviation of type one error may be comparable to the deviation of conditional distribution. Though we cannot directly compare results in Proposition 3 to those in the Theorem 5 of [BWBS19], the Proposition 4 tells us that if 𝔼[dTV(Rn(|z,xordered),Rn(|z,xordered))|z]\mathbb{E}\left[d_{TV}(R^{\star}_{n}(\cdot|z,x_{ordered}),R_{n}(\cdot|z,x_{ordered}))|z\right] is large, so is dTV(Qn(|z),Qn(|z))d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z)), which means that if the type 1 error lower bound for the CPT blows up, so does the one for the CRT. Therefore, we can see that the CPT is more robust to the CRT in some sense.

To bridge the CPT and the CRRT, we need to introduce a test incorporating advantages of both, i.e. the conditional permutation rank test (CPRT). The CPRT generalize the CPT, just like that the CRRT is a generalization of the CRT. The CPRT shares a same pseudo variables generation scheme as the CPT. For any prescribed X-symmetric function

T(y,z,x~)=(T0(y,z,x~),,Tb(y,z,x~)):n×n×p×n×(b+1)b+1,T(y,z,\tilde{x})=(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x})):\mathbb{R}^{n}\times\mathbb{R}^{n\times p}\times\mathbb{R}^{n\times(b+1)}\rightarrow\mathbb{R}^{b+1},

where TT is only required to be independent with x~\tilde{x}, we define the p-value for the CPRT,

pCPRT=1b+1k=0b𝟙{T0(y,z,x~)Tk(y,z,x~)}.p_{CPRT}=\frac{1}{b+1}\sum\limits_{k=0}\limits^{b}\mathbbm{1}\{T_{0}(y,z,\tilde{x})\leq T_{k}(y,z,\tilde{x})\}.

It is easy to know that the CPRT can also control the type 1 error if the conditional distribution of XX given ZZ is correctly specified, which can be formalized as the following proposition. We skip its proof since the proof would be basically same as the one of Theorem 1.

Proposition 5.

Assume that the null hypothesis H0:YX|ZH_{0}:Y\perp X|Z holds. Suppose that XQ(|Z)X\sim Q(\cdot|Z) is true. For k=1,2,,bk=1,2,\ldots,b, x(k)x^{(k)}’s are independently generated according to conditional distribution Rn(|xordered,z)R_{n}(\cdot|x_{ordered},z) induced by QQ. Assume that T(y,z,x~)T(y,z,\tilde{x}) is XX-symmetric. Then, the CPRT can successfully control the type 1 error at any given level α[0,1]\alpha\in[0,1]:

(pCPRTα)α.\mathbb{P}(p_{CPRT}\leq\alpha)\leq\alpha.
Refer to caption
Figure 2: Relationships among conditional tests based on pseudo variables. The CRT-CRRT only conditions on zz. If we further condition on xorderedx_{ordered}, we get CPT-CPRT. We can also instead condition other statistics to generate pseudo variables, say f(x,z)f(x,z) and g(x,z)g(x,z), then we can have more choices like CXT-CXRT, CYT-CYRT and so on.

The relationships among 4 conditional tests discussed here can be summarized by figure 2. Actually, permutation is just one of several ways to enhance robustness at the expense of power loss. For example, we can also generate pseudo variables conditional on x¯\bar{x} and zz, where x¯\bar{x} is the mean of xx. In general, more conditioning we put, more robustness we gain. The extreme case is that we generate pseudo variables conditioning on xx and zz. As a result, all x(k),k=1,,bx^{(k)},k=1,\ldots,b will be identical with xx and the power will decrease to 0.

When we try to construct a sampling-based conditional test, how much we need to condition on depends on how much we know about the conditional distribution of XX given ZZ. If we are confident that the proposed conditional distribution has slight bias or even no bias, we can try to be more aggressive and apply the CRRT.

5 Empirical Results

In this section, we evaluate the performance of the CRRT proposed in this paper through comparison with other conditional sampling based methods, including CRT, CPT, dCRT, and HRT on simulated data. We also use concrete simulated examples to demonstrate the robustness of the CRRT.

5.1 Some Preliminary Preparations and Explanations

All methods considered in our simulations are conditional resampling based and thus can theoretically control the type 1 error regardless of the number of samplings we take. We know that though a small number of samplings bb is theoretically valid, a larger bb can prevent unnecessary conservativeness. At the same time, we do not want to take an extremely large bb since it will bring prohibitive computational burdens. After some preliminary analysis shown in the supplement, we decide to set b=199b=199 across all simulations.

Note that the dCRT is a 2-step test. The first step is distillation, where we extract and compress information in YY provided by ZZ. After distillation, the second step, which we call testing step, can be executed efficiently with a dataset of low dimension. For more details, see [LKJR20]. In the distillation step, it is optional to keep several important elements of ZZ and incorporate them in the testing step. In such way, [LKJR20] shows that we can effectively deal with model with slight interactions. We denote the dCRT keeping kk elements of ZZ by dkCRTd_{k}CRT. Actually, the dCRT is a special version of the CRT and still enjoys some flexibility because we can take any appropriate approaches in each of the distillation and testing step. In our simulation, we will explicitly specify what methods we use in each step. For example, when we say we use linear Lasso in the distillation step of dkCRTd_{k}CRT, it means that we decide which kk elements would be kept based on the absolute fitted coefficients resulted from linear Lasso regression of YY on ZZ and use the Lasso regression residues as distillation residues. When we say we use least square linear regression in the testing step, it means that we run a simple linear regression of distillation residues on XX (or its pseudo copies) and important variables kept from the last step and use the corresponding absolute fitted coefficient of XX (or its pseudo copies) to construct test statistics. Unless specifying additionally, we use least square linear regression in the testing step by default.

The HRT ([TVZ+18]) is also a 2-step test belonging to the family of the CRT. It splits data into training set and testing set. With the training set, we can fit a model, which can be arbitrary, of YY on XX and ZZ. Then, we use the testing set to generate conditional samplings and feed all copies to the fitted model to obtain predicted errors. These errors work like the test function TT. In our simulations, when we say we use logistic Lasso in the HRT, it means that we fit a logistic Lasso with the training set.

One-step test, like CRT, CRRT and CPT, would be more simple. As we can see, the test function TT is flexible. It would be better to choose suitable TT’s in different settings. When we say we use the random forest as our test function, it means that we fit a random forest and obtain the feature importance of XX (or its pseudo copies) to construct test statistics.

We propose a trick in implementing the CRRT. Suppose that b=199b=199 and n=400n=400, then x~\tilde{x} would be a 400×200400\times 200 matrix. For some given positive integer kk, say 4, we separate x~\tilde{x} into 4 folds, as x~1,x~2,x~3,x~4400×50\tilde{x}_{1},\tilde{x}_{2},\tilde{x}_{3},\tilde{x}_{4}\in\mathbb{R}^{400\times 50}. This step can be deterministic or random. For simplicity, in our simulations, we just let x~i\tilde{x}_{i} be a submatrix consisting of the [(i1)bk+1][(i-1)\frac{b}{k}+1]th column to the [bik][\frac{bi}{k}]th column of x~\tilde{x}, i=1,,ki=1,\ldots,k. Then, we apply a XX-symmetric test function TT on each (y,z,x~i)(y,z,\tilde{x}_{i}) and obtain test statistics for XX and each pseudo variables. Finally, we can construct a p-value based on these test statistics. We call the CRRT equipped with such trick the CRRTkCRRT_{k} with mini-batch size bk\frac{b}{k}, or the CRRTkCRRT_{k} with kk folds. Though after such extra processing, the procedure is no longer a CRRT, it is not hard to show that CRRTkCRRT_{k} can promise type 1 error control. When k=bk=b, the CRRTbCRRT_{b} degenerates to the CRTCRT.

5.2 The CRRT Shows a Huge Improvement in Computational Efficiency Over the CRT

In this part, we assess the performance of all conditional sampling based tests we have mentioned in the setting of linear regression, which is frequently encountered in theoretical statistical analysis. In particular, we compare their ability to control type 1 error when the null hypothesis is true, their power when the null hypothesis is false, and their computational efficiency. Generally speaking, the CRRT and the dCRT are among the best because they are simultaneously efficient and powerful in our simulations.

We let n=400n=400, p=100p=100 and observations are i.i.d.. We perform 200 Monte-Carlo simulations for each specific setting. For i=1,2,,400i=1,2,\ldots,400, (xiz[i])\left(\begin{array}[]{c}x_{i}\\ z_{[i]}\end{array}\right) is an i.i.d. realization of (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right), where (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right) is a (p+1)(p+1)-dimension random vector following a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5. We let

yi=β0xi+z[i]Tβ+εi,i=1,2,,400,y_{i}=\beta_{0}x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,

where β0\beta_{0}\in\mathbb{R}, βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. We randomly sample a set of size 20 without replacement from {1,2,,p}\{1,2,\ldots,p\}, say SS. For j=1,,pj=1,\ldots,p, let βj=0\beta_{j}=0 if jSj\notin S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} if jSj\in S, where BjB_{j} is Bernoulli. We consider various values of β0\beta_{0}, including 0,0.05,0.1,,0.30,0.05,0.1,\ldots,0.3.

Refer to caption
(a) Rejection Rate
Refer to caption
(b) Time
Figure 3: (a) Proportions of rejection in the linear regression setting. The y-axis represents the proportion of times the null hypothesis is rejected. The x-axis represents the true coefficient of XX. (b) Time in seconds per Monte-Carlo simulation. dCRT_lasso_k represents the dCRT with linear Lasso in the distillation step keeping k important variables after distillation. CRRT_lasso_k represents the CRRT using linear Lasso as the test function and with batch size (b+1)/k(b+1)/k. HRT_lasso_0k represents the HRT fitting a linear Lasso model and using training set of size n×k×0.1n\times k\times 0.1.

Results: Figures 3b exhibits a huge disadvantage of the CRT and the CPT in terms of the computational efficiency. They spend at least 10 times of time than other tests. It validates our assertion in the Section 4 that the CRRT has a great efficiency improvement over the CRT. We should also note that here we just try some simple methods like Lasso. If the methods we use become more complicated, we expect to see a greater computational efficiency discrepancy.

From Figures 3a, we can see that basically all tests included in our simulations show good type 1 error control, which coincides with the theoretical results. When the null hypothesis is true, the HRT always shows markedly low rejection. Meanwhile, its power is significantly lower than other methods’ when the null hypothesis is false. Though setting the proportion of training set to be 50%50\% can reduce conservativeness, it is still overly conservative in practice. The CRT and the CPT stably show greater power than other methods. However, such advantage is slight when compared with the CRRT and the dCRT.

d0CRTd_{0}CRT, CRRT2CRRT_{2} and CRRT4CRRT_{4} exhibit comparable performances in the sense of test effectiveness and computational efficiency. We should note that CRRT1CRRT_{1} does not always own computational advantage over CRRTkCRRT_{k} with k>1k>1, which can be observed in the Figure 3b. The d12CRTd_{12}CRT can be viewed as a conservative version of the d0CRTd_{0}CRT. It is no surprise that the d12CRTd_{12}CRT gain less power in the above simulations, compared with the d0CRTd_{0}CRT and some CRRTCRRT’s because there is no interactive effect and methods we use in the distillation step and the testing step basically suit with the nature of models, which make the important variables kept from the distillation step noisy instead.

5.3 The CRRT Outperforms Other Tests in Complicated Settings

In practice, the model of the response variable on covariates is usually not as simple as what we set in the previous subsection. In this subsection, we consider complicated models and assess the performance of CRRT, dCRT and HRT. As impliied by the figure presented in the introduction, the CRT and the CPT spend far more time than the other methods when models get complicated, we do not include them in this subsection. Results show that only the CRRT can remain efficient and powerful in complicated settings.

We set n=400n=400 and observations are i.i.d.. We try 200 Monte-Carlo simulations for each specific setting. For i=1,2,,400i=1,2,\ldots,400, (xiz[i])\left(\begin{array}[]{c}x_{i}\\ z_{[i]}\end{array}\right) is an i.i.d. realization of (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right), where (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right) is a 101-dimension random vector following a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5. We consider the following response model:

𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐌𝐨𝐝𝐞𝐥 1:yi=β0xij=1100z[i]j+z[i]Tβ+εi,i=1,2,,400,\begin{array}[]{ll}{\bf Interaction\ Model\ 1:}&y_{i}=\beta_{0}x_{i}\sum\limits_{j=1}\limits^{100}z_{[i]j}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,\\ \end{array}

where β0\beta_{0}\in\mathbb{R}, β100\beta\in\mathbb{R}^{100}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. To construct β\beta, we randomly sample 5 indexes without replacement from {1,2,,100}\{1,2,\ldots,100\}, say SS. For j=1,,100j=1,\ldots,100, let βj=0\beta_{j}=0 if jSj\notin S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} otherwise, where BjB_{j} is Bernoulli. We consider a range of choices of β0\beta_{0}. When β0=0\beta_{0}=0, YY is a function of ZZ and random errors in all 3 models, which implies that the null hypothesis of YX|ZY\perp X|Z holds. When β00\beta_{0}\neq 0, XX is no longer independent of YY conditional on ZZ. We also consider 2 other models with categorical responses, the results of which are provided in the supplement.

Refer to caption
(a) Rejection Rate
Refer to caption
(b) Time
Figure 4: (a) Proportions of rejection under the Interaction Model 1. The y-axis represents the proportion of times the null hypothesis is rejected. The x-axis represents the true coefficient of XX. (b) Time in seconds per Monte-Carlo simulation. dCRT_k_rf represents the dCRT with random forest in both of the distillation step and the testing step, and keeping k important variables after distillation. dCRT_k_rf_ols represents the dCRT with random forest in the distillation step, least square linear regression in the testing step. CRRT_k_rf represents the CRRT using random forest as the test function and with batch size (b+1)/k(b+1)/k. HRT_0k represents the HRT fitting a random forest model and using training set of size n×k×0.1n\times k\times 0.1.

Results: Based on the Figure 4a, we can see that the CRRT’s have dominating power over the dCRT’s and the HRT’s. In particular, the CRRT with batch size 50 (i.e. CRRT_4_rf) has the most power among all tests considered here. Basically all tests can control the observed type 1 error when the effect size is 0. Figure 4b shows that the CRRT’s do not sacrifice much computational efficiency. It is worth mentioning that d24CRTd_{24}CRT does not gain more power than d12CRTd_{12}CRT, which demonstrate that the strategy of keeping important features after the distillation step could fail to alleviate the problem brought by interactions when the model is complicated.

5.4 The CRRT Is More Robust to Conditional Distribution Misspecification Than the dCRT

We prove in the Section 3 that the CRRT is robust to misspecification of conditional density, i.e. Q(|Z)Q(|Z)Q(\cdot|Z)\neq Q^{\star}(\cdot|Z). In this subsection, we demonstrate it with simulations covering 3 sources of misspecification, including the misspecification from non data-dependent estimation, from using unlabeled data and from reusing labeled data. Note that when the misspecification originates from reusing labeled data, currently there is no theoretical results on robustness. We consider the CRRT and the dCRT, which shows relatively good performance in previous experiments.

5.4.1 Misspecification From Non Data-Dependent Estimation

One source of conditional distribution misspecification comes from domain knowledge or some conventions, which does not depend on labeled data or unlabeled data. We consider the following 3 response models. For each specific setting, we run 500 Monte-Carlo simulations.

Refer to caption
(a) n>pn>p Linear Regression, OLS-Based Tests
Refer to caption
(b) n>pn>p Linear Regression, Lasso-Based Tests
Refer to caption
(c) n=pn=p Linear Regression, Lasso-Based Tests
Refer to caption
(d) n>pn>p Logistic Regression, Lasso-Based Tests
Figure 5: The CRRT is less sensitive to non data-dependent misspecification of conditional distribution than the dCRT. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 simulations. The x-axis is the value of θ\theta, representing the degree of misspecification. CRRT_OLS_k, CRRT_lasso_k and CRRT_logistic_k represent the CRRT’s with respectively least square linear regression, linear Lasso and logistic Lasso as the test function and batch size (b+1)/k(b+1)/k. dCRT_OLS_k, dCRT_lasso_k and dCRT_logistic_k represents the dCRT’s respectively using least square linear regression, linear Lasso and logistic Lasso in the distillation step with k important variables kept after distillation.

𝐧>𝐩{\bf n>p} linear regression: We set n=400n=400, p=100p=100 and observations are i.i.d.. For i=1,2,,400i=1,2,\ldots,400, (xiz[i])\left(\begin{array}[]{c}x_{i}\\ z_{[i]}\end{array}\right) is an i.i.d. realization of (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right), where (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right) is a (p+1)(p+1)-dimension random vector. ZpZ\in\mathbb{R}^{p} follows a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5 and X|ZN(μ(Z),σ2)X|Z\sim N(\mu(Z),\sigma^{2}), where μ\mu and σ2\sigma^{2} will be specified shortly. We let

yi=0xi+z[i]Tβ+εi,i=1,2,,400,y_{i}=0\cdot x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. β\beta is constructed as follows. We randomly sample a set of size 20 without replacement from {1,2,,p}\{1,2,\ldots,p\}, say S. For j=1,,pj=1,\ldots,p, let βj=0\beta_{j}=0 if jSj\neq S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} if jSj\in S, where BjB_{j} is Bernoulli.

𝐧=𝐩{\bf n=p} linear regression: Basically everything is same as the n>pn>p linear regression setting, except that the dimension of ZZ is increased from 100 to 400 while the sparsity of β\beta remains to be 20.

𝐧>𝐩{\bf n>p} logistic regression: xx and zz are generated in the same way as the previous model, where p=100p=100. We consider a binary model,

yi=𝟙{0xi+z[i]Tβ+εi},i=1,2,,400,y_{i}=\mathbbm{1}\{0\cdot x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i}\},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}. εi\varepsilon_{i}’s and β\beta are constructed in a same way as the n>pn>p linear regression model.

Suppose that (X̊Z)p+1\left(\begin{array}[]{c}\mathring{X}\\ Z\end{array}\right)\in\mathbb{R}^{p+1} is from a Gaussian AR(1) process with autoregressive coefficient 0.5. The conditional distribution of X̊\mathring{X} given ZZ must be Gaussian in the form of N(ZTζ,σ̊2)N(Z^{T}\zeta,\mathring{\sigma}^{2}). For θ[0,1]\theta\in[0,1], we let μ(Z)\mu(Z) and σ2\sigma^{2} mentioned above be θZTζ\theta Z^{T}\zeta and 1+(1θ)22σ̊2\frac{1+(1-\theta)^{2}}{2}\mathring{\sigma}^{2} respectively. For k=1,2,,bk=1,2,\ldots,b, we set the conditional distribution of pseudo variable X(k)X^{(k)} given ZZ be N(0,σ̊2)N(0,\mathring{\sigma}^{2}). When θ\theta is exactly 0, the conditional distribution of XX given ZZ would be also N(0,σ̊2)N(0,\mathring{\sigma}^{2}) and there is no misspecification. As the value of θ\theta increases, the degree of misspecification increases.

Results: Figures 5(a) and 5(b) display results for the n>pn>p linear regression model. Increasing misspecification of conditional distribution can inflate the type 1 error. Besides, it can be seen that the CRT’s are always more robust than corresponding dCRT’s with same type of test function. In particular, the CRRT’s with linear Lasso as the test function are the least sensitive. Even when θ=1\theta=1, the CRRT_lasso_1 can still control the type 1 error below 0.1 while the dCRT_lasso_12 has a type 1 error reaching 0.25.

When we turn to the Figure 5(c), where results for the n=pn=p linear regression model are shown, we can see that the CRRT’s show a more pronounced advantage in robustness to misspecification over the dCRT’s. The performances of the CRRT’s vary little among different choices of batch size.

From the Figure 5(d), where results for the n>pn>p logistic regression model are shown, we can see that the CRRT’s still behave more robustly than the dCRT’s and successfully control the observed type 1 error under 0.1 as in the previous setting.

In summary, the CRRT is more robust to the current misspecification type than the dCRT across several models. Besides, the CRRT has comparable computational efficiency as the dCRT, which is not shown directly here but can be inferred from previous results since the misspecification has no impact on computation. We should note that, unlike the monotonic trend shown here, the misspecification of conditional distribution may have involved influence on the type 1 error. For more details, see the supplement.

5.4.2 Misspecification From Using Unlabeled Data

Estimating the conditional distribution by using unlabeled data is an effective and popular approach without sacrificing theoretical validity ([CFJL18], [HJ19], [RSC19], [SSC19]). Being able to utilize unlabeled data is also a main reason why model-X methods become more and more attractive. Usually, we propose a parametric family for the target conditional distribution and estimates parameters with unlabeled data. When the proposed parametric family is correct, the estimated conditional distribution would be more closed to the true one as we have unlabeled data of larger size. Of course, we can also consider to use unlabeled data in a nonparametric way with some kernel methods. But here, we only show simulations of parametric estimations.

We still consider same response models as in the previous part, i.e. n>pn>p linear regression, n=pn=p linear regression and n>pn>p logistic regression. Here, we fix μ(Z)=ZTζ\mu(Z)=Z^{T}\zeta and σ2=σ̊2\sigma^{2}=\mathring{\sigma}^{2} such that (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right) follows a Gaussian AR(1) process with autoregressive coefficient 0.5. However, we will not use the exact conditional distribution Q(|Z)Q^{\star}(\cdot|Z) to generate pseudo variables. Instead, we assume a Gaussian model and get estimators ζ^\hat{\zeta} and σ^2\hat{\sigma}^{2} with unlabeled dataset (xunlabeled,zunlabeled)(x_{unlabeled},z_{unlabeled}), where xunlabeledNx_{unlabeled}\in\mathbb{R}^{N} and zunlabeledN×pz_{unlabeled}\in\mathbb{R}^{N\times p} by using scaled Lasso ([SZ12]). Note that each row of the unlabeled dataset is a realization of (XZ)\left(\begin{array}[]{c}X\\ Z\end{array}\right). Then, we generate pseudo variables based on the estimated conditional distribution N(ZTζ^,σ^2)N(Z^{T}\hat{\zeta},\hat{\sigma}^{2}).

Refer to caption
(a) n>pn>p Linear Regression, OLS-Based Tests
Refer to caption
(b) n>pn>p Linear Regression, Lasso-Based Tests
Refer to caption
(c) n=pn=p Linear Regression, Lasso-Based Tests
Refer to caption
(d) n>pn>p Logistic Regression, Lasso-Based Tests
Figure 6: The CRRT is less sensitive to misspecification of conditional distribution originated from unlabeled data estimation than the dCRT. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 simulations. The x-axis represents the sample size of unlabeled data. Definitions of legends can be found in the Figure 5.

Results: The plots in the Figure 6 exhibit en encouraging phenomenon that both of the CRRT and the dCRT can basically control the type 1 error under 0.05 when there is a misspecification from unlabeled data estimation. The CRRT apparently has a more stable performance than the dCRT. The dCRT could be too conservative when NN, the sample size of unlabeled data is relatively small. Besides, in the n>pn>p logistic regression setting, the dCRT’s have observed type 1 error slightly exceeding the threshold.

5.4.3 Misspecification From Reusing Data

Though losing theoretical guarantee, when we do not have abundant unlabeled data, we can estimate the conditional distribution by reusing data. We consider similar procedures as in the last part except that we estimate Q(|Z)Q(\cdot|Z) on labeled data instead of unlabeled data. In such case, QQ is no longer independent with our data for constructing test statistics and hence theoretical results on robustness proposed before fail to work here.

Refer to caption
(a) Linear
Refer to caption
(b) Logistic
Figure 7: The CRRT and the dCRT have similar performance under misspecification of conditional distribution originated from reusing labeled data. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 simulations. The x-axis represents the sample size of labeled data. Definitions of legends can be found in the Figure 5.

Results: We consider linear regression model and logistic regression model, with results shown in the Figure 7. It is hard to tell whether the CRRT outperforms the dCRT here. They have similar performance while the dCRT is relatively more aggressive. Good news is that, both of these 2 types of test can basically control the observed type 1 error around the desired level. We can also see that their performances are not dependent on nn, the sample size of data.

6 Discussions

This paper propose the CRRT, a new method for testing conditional independence, generalizing the CRT proposed in [CFJL18]. It is more flexible, more efficient and comparably powerful with theoretically well grounded type 1 error control and robustness guarantee. We only discuss the case when the response variable YY is univariate. But, we can trivially extend results presented in this paper to multi-dimensional settings.

Again, conditional independence is not a testable test if the control variable is continuous ([Ber04], [SP18]). Therefore, we need some pre-known information on the covariates model or the response model. The CRRT focuses on the former, like other model-X methods. However, usually knowledge concerning the response model is not completely absent. It would be pragmatically attractive to develop conditional sampling based methods that can incorporate information from both of the covariates model side and the response model side.

Reusing data is a commonly used approach in practice though usually leading to overfitting. It would be interesting to develop theories to theoretically justify the use of the CRRT combined with this approach, in terms of type 1 error control and robustness. Splitting data is a popular option when we have to perform estimation and testing on a same copy of data ([WR09]). However, it is well known that simple splitting can cause tremendous power loss. Some cross-splitting approaches are proposed as remedies ([CCD+18], [BC+19]). It seems promising to combine these approaches with the CRRT.

References and Notes

  • [BC+13] Alexandre Belloni, Victor Chernozhukov, et al. Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2):521–547, 2013.
  • [BC+15] Rina Foygel Barber, Emmanuel J Candès, et al. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • [BC+19] Rina Foygel Barber, Emmanuel J Candès, et al. A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5):2504–2537, 2019.
  • [BCS18] Rina Foygel Barber, Emmanuel J Candès, and Richard J Samworth. Robust inference with knockoffs. arXiv preprint arXiv:1801.03896, 2018.
  • [Ber04] Wicher Pieter Bergsma. Testing conditional independence for continuous random variables. Eurandom, 2004.
  • [BM+58] GEP Box, Mervin E Muller, et al. A note on the generation of random normal deviates. The Annals of Mathematical Statistics, 29(2):610–611, 1958.
  • [Bre01] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [BRT+09] Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of statistics, 37(4):1705–1732, 2009.
  • [BSSC20] Stephen Bates, Matteo Sesia, Chiara Sabatti, and Emmanuel Candes. Causal inference in genetic trio studies. arXiv preprint arXiv:2002.09644, 2020.
  • [BWBS19] Thomas B Berrett, Yi Wang, Rina Foygel Barber, and Richard J Samworth. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2019.
  • [Cam06] Luis M de Campos. A scoring function for learning bayesian networks based on mutual information and conditional independence tests. Journal of Machine Learning Research, 7(Oct):2149–2187, 2006.
  • [CBL98] Jie Cheng, David Bell, and Weiru Liu. Learning bayesian networks from data: An efficient approach based on information theory. On World Wide Web at http://www. cs. ualberta. ca/~ jcheng/bnpc. htm, 1998.
  • [CCD+18] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
  • [CFJL18] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • [CG16] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  • [CL+13] Arindam Chatterjee, Soumendra N Lahiri, et al. Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap. The Annals of Statistics, 41(3):1232–1259, 2013.
  • [CLL11] Tony Cai, Weidong Liu, and Xi Luo. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
  • [Daw79] A Philip Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society: Series B (Methodological), 41(1):1–15, 1979.
  • [DMZS14] Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Schölkopf. A permutation-based kernel conditional independence test. In UAI, pages 132–141, 2014.
  • [EHJ+04] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • [FGSS08] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
  • [FHT08] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • [GFT+08] Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Schölkopf, and Alex J Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages 585–592, 2008.
  • [GHS+05] Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, and Bernhard Schölkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6(Dec):2075–2129, 2005.
  • [GVO11] Irène Gijbels, Noël Veraverbeke, and Marel Omelka. Conditional copulas, association measures and their applications. Computational Statistics & Data Analysis, 55(5):1919–1932, 2011.
  • [GZ18] Jaime Roquero Gimenez and James Zou. Improving the stability of the knockoff procedure: Multiple simultaneous knockoffs and entropy maximization. arXiv preprint arXiv:1810.11378, 2018.
  • [H+10] Tzee-Ming Huang et al. Testing conditional independence using maximal nonlinear conditional correlation. The Annals of Statistics, 38(4):2047–2091, 2010.
  • [HJ19] Dongming Huang and Lucas Janson. Relaxing the assumptions of knockoffs by conditioning. arXiv preprint arXiv:1903.02806, 2019.
  • [HL10] Radoslav Harman and Vladimír Lacko. On decompositional algorithms for uniform sampling from n-spheres and n-balls. Journal of Multivariate Analysis, 101(10):2297–2304, 2010.
  • [JM14a] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • [JM14b] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554, 2014.
  • [KB07] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
  • [LC21] Shuangning Li and Emmanuel J Candès. Deploying the conditional randomization test in high multiplicity problems. arXiv preprint arXiv:2110.02422, 2021.
  • [Lec01] Michael Lechner. Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In Econometric evaluation of labour market policies, pages 43–58. Springer, 2001.
  • [LF16] Lihua Lei and William Fithian. Power of ordered hypothesis testing. In International Conference on Machine Learning, pages 2924–2932, 2016.
  • [LF18] Lihua Lei and William Fithian. Adapt: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society, 2018.
  • [LG97] Oliver Linton and Pedro Gozalo. Conditional independence restrictions: testing and estimation. V Cowles Foundation Discussion Paper, 1140, 1997.
  • [LKJR20] Molei Liu, Eugene Katsevich, Lucas Janson, and Aaditya Ramdas. Fast and powerful conditional randomization testing via distillation. arXiv preprint arXiv:2006.03980, 2020.
  • [Lou14] Gilles Louppe. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502, 2014.
  • [LRF17] Lihua Lei, Aaditya Ramdas, and William Fithian. Star: A general interactive framework for fdr control under structural constraints. arXiv preprint arXiv:1710.02776, 2017.
  • [M+46] Frederick Mosteller et al. On some useful” inefficient” statistics. The Annals of Mathematical Statistics, 17(4):377–408, 1946.
  • [Pea88] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
  • [RBL+08] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
  • [Ros02] Paul Rosenbaum. Observational studies. Springer, 2 edition, 2002.
  • [RSC19] Yaniv Romano, Matteo Sesia, and Emmanuel Candès. Deep knockoffs. Journal of the American Statistical Association, pages 1–12, 2019.
  • [Run18] Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, pages 938–947, 2018.
  • [RWL+10] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using l1l_{1}-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [S+09] Kyungchul Song et al. Testing conditional independence via rosenblatt transforms. The Annals of Statistics, 37(6B):4011–4045, 2009.
  • [SGSH00] Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
  • [SKB+20] Matteo Sesia, Eugene Katsevich, Stephen Bates, Emmanuel Candès, and Chiara Sabatti. Multi-resolution localization of causal variants across the genome. Nature communications, 11(1):1–10, 2020.
  • [SP18] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. arXiv preprint arXiv:1804.07203, 2018.
  • [Spi10] Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11(5), 2010.
  • [SSC19] Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. Gene hunting with hidden markov model knockoffs. Biometrika, 106(1):1–18, 2019.
  • [SSS+17] Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Model-powered conditional independence test. In Advances in neural information processing systems, pages 2951–2961, 2017.
  • [SV95] Moninder Singh and Marco Valtorta. Construction of bayesian network structures from data: a brief survey and an efficient algorithm. International journal of approximate reasoning, 12(2):111–131, 1995.
  • [SW07] Liangjun Su and Halbert White. A consistent characteristic function-based test for conditional independence. Journal of Econometrics, 141(2):807–834, 2007.
  • [SW08] Liangjun Su and Halbert White. A nonparametric hellinger metric test for conditional independence. Econometric Theory, pages 829–864, 2008.
  • [SW14] Liangjun Su and Halbert White. Testing conditional independence via empirical likelihood. Journal of Econometrics, 182(1):27–44, 2014.
  • [SZ12] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
  • [SZV19] Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019.
  • [TB10] Ioannis Tsamardinos and Giorgos Borboudakis. Permutation testing improves bayesian network learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 322–337. Springer, 2010.
  • [TVZ+18] Wesley Tansey, Victor Veitch, Haoran Zhang, Raul Rabadan, and David M Blei. The holdout randomization test: Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645, 2018.
  • [VdGBR+14] Sara Van de Geer, Peter Bühlmann, Ya’acov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  • [VGS17] Aaron R Voelker, Jan Gosmann, and Terrence C Stewart. Efficiently sampling vectors and coordinates from the n-sphere and n-ball. Technical report, Tech. Rep.) Waterloo, ON: Centre for Theoretical Neuroscience. doi: 10.13140 …, 2017.
  • [VOG11] Noël Veraverbeke, Marek Omelka, and Irene Gijbels. Estimation of a conditional copula and association measures. Scandinavian Journal of Statistics, 38(4):766–780, 2011.
  • [WPH+15] Xueqin Wang, Wenliang Pan, Wenhao Hu, Yuan Tian, and Heping Zhang. Conditional distance correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.
  • [WR09] Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178, 2009.
  • [YL07] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • [ZH+08] Cun-Hui Zhang, Jian Huang, et al. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
  • [Zhu05] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005.
  • [ZPJS12] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.
  • [ZZ14] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 217–242, 2014.

Appendix

Appendix A Supplementary Analysis on the Robustness of CRRT

A.1 Continuous Analysis on the Example 2

Here, we assume that limnb=\lim\limits_{n\rightarrow\infty}b=\infty. According to the result provided in Theorem 3, for any measurable functions of zz, ε1(z),,εb(z)\varepsilon_{1}(z),\ldots,\varepsilon_{b}(z), the following inequality holds,

(PCRRTα)𝔼[(b+1)α1+eε1++eεb]+(k=1b{M(k)M0>εk}).\mathbb{P}\left(P_{CRRT}\leq\alpha\right)\leq\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\right]+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}\right). (7)

We work with the second term on the righthand side firstly:

(k=1b{M(k)M0>εk})=𝔼[(k=1b{M(k)M0>εk}|z)]=𝔼[𝟙{|M0|np2logpNf(n)}(k=1b{M(k)M0>εk}|z)]+𝔼[𝟙{|M0|>np2logpNf(n)}(k=1b{M(k)M0>εk}|z)]𝔼[(k=1b{M(k)>εknp2logpNf(n)}|z)]+(|M0|>np2logpNf(n))=𝔼[(k=1b{M(k)>εknp2logpNf(n)}|z)]+o(1),\begin{array}[]{ll}&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}\right)=\mathbb{E}\left[\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}|z\right)\right]\\ =&\mathbb{E}\left[\mathbbm{1}\left\{|M_{0}|\leq\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}|z\right)\right]\\ &+\mathbb{E}\left[\mathbbm{1}\left\{|M_{0}|>\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}|z\right)\right]\\ \leq&\mathbb{E}\left[\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}>\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}|z\right)\right]+\mathbb{P}\left(|M_{0}|>\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)\\ =&\mathbb{E}\left[\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}>\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}|z\right)\right]+o(1),\\ \end{array} (8)

where the third step is based on that Nnp2logpM0=Op(1)\sqrt{\frac{N}{np^{2}\log p}}M_{0}=O_{p}(1). Next, we have

(k=1b{M(k)>εknp2logpNf(n)}|z)(k=1b{i=1b𝟙{Miεknp2logpNf(n)}k}|z)=(k=1b{1bi=1b𝟙{Miεknp2logpNf(n)}kb}|z)=(k=1b{Fb(εknp2logpNf(n))F(εknp2logpNf(n))kbF(εknp2logpNf(n))}|z),\begin{array}[]{ll}&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}>\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}|z\right)\\ \leq&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{\sum\limits_{i=1}\limits^{b}\mathbbm{1}\left\{M_{i}\leq\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}\leq k\right\}|z\right)\\ =&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{\frac{1}{b}\sum\limits_{i=1}\limits^{b}\mathbbm{1}\left\{M_{i}\leq\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}\leq\frac{k}{b}\right\}|z\right)\\ =&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{F_{b}\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)-F\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)\right.\right.\\ &\left.\left.\leq\frac{k}{b}-F\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)\right\}|z\right),\\ \end{array} (9)

where FbF_{b} is the empirical CDF of M1|zM_{1}|z and FF is the true CDF of M1|zM_{1}|z. Denote ηlogbb\eta\triangleq\frac{\log b}{b}, δeh2(n)2\delta\triangleq e^{-\frac{h^{2}(n)}{2}}, where h(n)h(n) satisfies that limnh(n)=\lim\limits_{n\rightarrow\infty}h(n)=\infty and limnnp2logpNf(n)h(n)=0\lim\limits_{n\rightarrow\infty}\sqrt{\frac{np^{2}\log p}{N}}f(n)h(n)=0, and set

{εk=F1(kb+η)+np2logpNf(n),if 1kb(1ηδ),εk=,ifb(1ηδ)<kb.\left\{\begin{array}[]{ll}\varepsilon_{k}=F^{-1}\left(\frac{k}{b}+\eta\right)+\sqrt{\frac{np^{2}\log p}{N}}f(n),&if\ 1\leq k\leq b(1-\eta-\delta),\\ \varepsilon_{k}=\infty,&if\ b(1-\eta-\delta)<k\leq b.\end{array}\right.

Under such settings, kbF(εknp2logpNf(n))=η\frac{k}{b}-F\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)=-\eta when 1kb(1ηδ)1\leq k\leq b(1-\eta-\delta). Hence, based on (9), we have

(k=1b{M(k)>εknp2logpNf(n)}|z)=(k=1b(1ηδ){Fb(εknp2logpNf(n))F(εknp2logpNf(n))η}|z)(infx{Fb(x)F(x)}η|z)e2bη2,\begin{array}[]{ll}&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}>\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right\}|z\right)\\ =&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{\lfloor b(1-\eta-\delta)\rfloor}\left\{F_{b}\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)-F\left(\varepsilon_{k}-\sqrt{\frac{np^{2}\log p}{N}}f(n)\right)\leq-\eta\right\}|z\right)\\ \leq&\mathbb{P}\left(\inf\limits_{x\in\mathbb{R}}\left\{F_{b}(x)-F(x)\right\}\leq-\eta|z\right)\leq e^{-2b\eta^{2}},\\ \end{array}

(10)

where the last step is based on the Dvoretzky–Kiefer–Wolfowitz inequality. Combining results in (10) and (8), we know that (k=1b{M(k)M0>εk})=o(1)\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}\right)=o(1) as nn\rightarrow\infty. Denote D1σ2(β^β)TzTz(β^β)D\triangleq\frac{1}{\sigma^{2}}(\hat{\beta}-\beta)^{T}z^{T}z(\hat{\beta}-\beta). Then, for 1kb(1ηδ)1\leq k\leq b(1-\eta-\delta),

(M1DD>2log(1kbη)|z)e12[2log(1kbη)]=1kbη,kb+η(M1DD2log(1kbη)|z)=F(2Dlog(1kbη)+D),F1(kb+η)2Dlog(1kbη)+D,εk2Dlog(1kbη)+D+np2logpNf(n),\begin{array}[]{ll}&\mathbb{P}\left(\frac{M_{1}-D}{\sqrt{D}}>\sqrt{-2\log\left(1-\frac{k}{b}-\eta\right)}|z\right)\leq e^{-\frac{1}{2}[-2\log(1-\frac{k}{b}-\eta)]}=1-\frac{k}{b}-\eta,\\ \Rightarrow&\frac{k}{b}+\eta\leq\mathbb{P}\left(\frac{M_{1}-D}{\sqrt{D}}\leq\sqrt{-2\log\left(1-\frac{k}{b}-\eta\right)}|z\right)=F\left(\sqrt{-2D\log\left(1-\frac{k}{b}-\eta\right)}+D\right),\\ \Rightarrow&F^{-1}\left(\frac{k}{b}+\eta\right)\leq\sqrt{-2D\log\left(1-\frac{k}{b}-\eta\right)}+D,\\ \Rightarrow&\varepsilon_{k}\leq\sqrt{-2D\log\left(1-\frac{k}{b}-\eta\right)}+D+\sqrt{\frac{np^{2}\log p}{N}}f(n),\\ \end{array}

(11)

where the first result is based on the concentration inequality. Under the current settings of εk\varepsilon_{k}’s, we have

𝔼[(b+1)α1+eε1++eεb]=𝔼[(b+1)α1+k=1b(1ηδ)eεk]𝔼[(b+1)α1+k=1b(1ηδ)exp{[2Dlog(1kbη)+D+np2logpNf(n)]}]𝔼[(b+1)α1+k=1b(1ηδ)exp{[2np2logpNf2(n)log(1kbη)+np2logpNf2(n)+np2logpNf(n)]}]+o(1)𝔼[(b+1)α1+k=1b(1ηδ)exp{[np2logpNf2(n)h2(n)+np2logpNf2(n)+np2logpNf(n)]}]+o(1)α,\begin{array}[]{ll}&\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\right]=\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+\sum\limits_{k=1}\limits^{\lfloor b(1-\eta-\delta)\rfloor}e^{-\varepsilon_{k}}}\right]\\ \leq&\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+\sum\limits_{k=1}\limits^{\lfloor b(1-\eta-\delta)\rfloor}exp\left\{-\left[\sqrt{-2D\log\left(1-\frac{k}{b}-\eta\right)}+D+\sqrt{\frac{np^{2}\log p}{N}}f(n)\right]\right\}}\right]\\ \leq&\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+\sum\limits_{k=1}\limits^{b(1-\eta-\delta)}exp\left\{-\left[\sqrt{-2\frac{np^{2}\log p}{N}f^{2}(n)\log\left(1-\frac{k}{b}-\eta\right)}+\frac{np^{2}\log p}{N}f^{2}(n)+\sqrt{\frac{np^{2}\log p}{N}}f(n)\right]\right\}}\right]+o(1)\\ \leq&\mathbb{E}\left[\frac{\lfloor(b+1)\alpha\rfloor}{1+\sum\limits_{k=1}\limits^{b(1-\eta-\delta)}exp\left\{-\left[\sqrt{\frac{np^{2}\log p}{N}f^{2}(n)h^{2}(n)}+\frac{np^{2}\log p}{N}f^{2}(n)+\sqrt{\frac{np^{2}\log p}{N}}f(n)\right]\right\}}\right]+o(1)\\ \rightarrow&\alpha,\end{array}

(12)

as nn\rightarrow\infty. We can see that FF depends only on zz and therefore εk\varepsilon_{k}’s are independent of xx and yy conditional on zz. Thus, combining results in (12), (7) and that (k=1b{M(k)M0>εk})=o(1)\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{M_{(k)}-M_{0}>\varepsilon_{k}\right\}\right)=o(1) as nn\rightarrow\infty, we can conclude that (PCRRTα)α\mathbb{P}(P_{CRRT}\leq\alpha)\rightarrow\alpha as nn\rightarrow\infty.

A.2 The Impact of Large bb, the Number of Conditional Samplings

For convenience, we restate the last bound in the Theorem 4 here:

(PCRRTα)infε1,,εb0{(b+1)α1+eε1++eεb+(k=1b{KL^(k)>εk})}.\mathbb{P}(P_{CRRT}\leq\alpha)\leq\inf\limits_{\varepsilon_{1},\ldots,\varepsilon_{b}\geq 0}\left\{\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\right)\right\}. (13)

It appears that (k=1b{KL^(k)>εk})\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\right) is monotonically increasing with bb. It is true when all ε1,,εb\varepsilon_{1},\ldots,\varepsilon_{b} are identical and fixed. Such phenomenon is counterintuitive and may make this upper bound meaningless in practice. Luckily, ε1,,εb\varepsilon_{1},\ldots,\varepsilon_{b} are adjustable and can vary with bb. We roughly show that for any given small ϵ>0\epsilon>0, there exist a function g(Q,Q,α)g(Q,Q^{\star},\alpha) such that

infε1,,εb0{(b+1)α1+eε1++eεb+(k=1b{KL^(k)>εk})}g(Q,Q,α)+o(1),asb,\inf\limits_{\varepsilon_{1},\ldots,\varepsilon_{b}\geq 0}\left\{\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\right)\right\}\leq g(Q,Q^{\star},\alpha)+o(1),\ as\ b\rightarrow\infty,

and

g(Q,Q,α)=α(1+ϵ),ifQ=Q.g(Q,Q^{\star},\alpha)=\alpha(1+\epsilon),\ if\ Q=Q^{\star}.

It’s true that the above conclusion cannot demonstrate that the bound in (13) is monotonically decreasing with bb. But, at least it implies that large bb won’t have severe negative impact when the conditional distribution is misspecified. We don’t intend to give a thorough and rigorous proof. We just roughly sketch our ideas here. For k=1,,bk=1,\ldots,b, let

KL^k=i=1nlogQ(xi|z[i])Q(xi(k)|z[i])Q(xi|z[i])Q(xi(k)|z[i])T¯kT¯0,\widehat{KL}_{k}=\sum\limits_{i=1}\limits^{n}\log\frac{Q^{\star}(x_{i}|z_{[i]})Q(x_{i}^{(k)}|z_{[i]})}{Q(x_{i}|z_{[i]})Q^{\star}(x_{i}^{(k)}|z_{[i]})}\triangleq\bar{T}_{k}-\bar{T}_{0},

where T¯k=i=1nlogQ(xi(k)|z[i])Q(xi(k)|z[i])\bar{T}_{k}=\sum\limits_{i=1}\limits^{n}\log\frac{Q(x_{i}^{(k)}|z_{[i]})}{Q^{\star}(x_{i}^{(k)}|z_{[i]})} and T¯0=i=1nlogQ(xi|z[i])Q(xi|z[i])\bar{T}_{0}=\sum\limits_{i=1}\limits^{n}\log\frac{Q(x_{i}|z_{[i]})}{Q^{\star}(x_{i}|z_{[i]})}. Denote the ordered statistics of T¯1,,T¯b\bar{T}_{1},\ldots,\bar{T}_{b} by T¯(1)T¯(b)\bar{T}_{(1)}\leq\ldots\leq\bar{T}_{(b)}. We observe that the marginal distributions of T¯0,T¯1,,T¯b\bar{T}_{0},\bar{T}_{1},\ldots,\bar{T}_{b} do not vary with bb. Besides, conditional on zz, T¯1,,T¯b\bar{T}_{1},\ldots,\bar{T}_{b} are i.i.d. and independent of T¯0\bar{T}_{0}. Denote the distribution of T¯1\bar{T}_{1} conditional on zz by F1(|z)F_{1}(\cdot|z) and the distribution of T¯0\bar{T}_{0} conditional on zz by F0(|z)F_{0}(\cdot|z). Denote the quantile function associated with F1F_{1} and F0F_{0} by F11F_{1}^{-1} and F01F_{0}^{-1} respectively. For any fixed positive integer m>1m>1, let εk=εlbm\varepsilon_{k}=\varepsilon_{\lceil\frac{lb}{m}\rceil}, if (l1)bm<klbm\lceil\frac{(l-1)b}{m}\rceil<k\leq\lceil\frac{lb}{m}\rceil for some integer 1lm1\leq l\leq m, and

{εb=,εlbm=max{0,F11(lm)F01(𝔼θz)},l=1,,m1,\left\{\begin{array}[]{ll}\varepsilon_{b}=\infty,&\\ \varepsilon_{\lceil\frac{lb}{m}\rceil}=\max\{0,F_{1}^{-1}(\frac{l}{m})-F_{0}^{-1}(\mathbb{E}\theta_{z})\},&l=1,\ldots,m-1,\end{array}\right.

where θz\theta_{z} can be any measure of the distance between QQ and QQ^{\star} such that θz=0\theta_{z}=0 when QQ and QQ^{\star} are identical. Then, we have

(k=1b{KL^(k)>εk})(l=1m{KL^(lbm)>εlbm})=(l=1m{T¯(lbm)εlbm>T¯0})=𝔼[(l=1m{T¯(lbm)εlbm>T¯0}|z)]𝔼[(l=1m{T¯0<F11(lm)εlbm}|z)]𝔼[(T¯0<F01(𝔼θz)|z)]=𝔼θz,\begin{array}[]{ll}&\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\right)\\ \leq&\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\widehat{KL}_{\left(\lceil\frac{lb}{m}\rceil\right)}>\varepsilon_{\lceil\frac{lb}{m}\rceil}\}\right)\\ =&\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{(\lceil\frac{lb}{m}\rceil)}-\varepsilon_{\lceil\frac{lb}{m}\rceil}>\bar{T}_{0}\}\right)\\ =&\mathbb{E}\left[\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{(\lceil\frac{lb}{m}\rceil)}-\varepsilon_{\lceil\frac{lb}{m}\rceil}>\bar{T}_{0}\}|z\right)\right]\\ \approx&\mathbb{E}\left[\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{0}<F_{1}^{-1}\left(\frac{l}{m}\right)-\varepsilon_{\lceil\frac{lb}{m}\rceil}\}|z\right)\right]\\ \leq&\mathbb{E}\left[\mathbb{P}\left(\bar{T}_{0}<F_{0}^{-1}(\mathbb{E}\theta_{z})|z\right)\right]\\ =&\mathbb{E}\theta_{z},\end{array} (14)

where the 4th step is based on the asymptotic results of ordered statistics given in Mosteller (1946). For any given zz,

(l=1m{T¯(lbm)εlbm>T¯0}|z)=(l=1m{T¯0<F11(lm)εlbm}|z)+o(1)\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{(\lceil\frac{lb}{m}\rceil)}-\varepsilon_{\lceil\frac{lb}{m}\rceil}>\bar{T}_{0}\}|z\right)=\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{0}<F_{1}^{-1}\left(\frac{l}{m}\right)-\varepsilon_{\lceil\frac{lb}{m}\rceil}\}|z\right)+o(1)

as bb\rightarrow\infty. Hence, under mild conditions, for any given small ν>0\nu>0, there exists SZn×pS_{Z}\in\mathbb{R}^{n\times p}, such that (zSZ)>1ν\mathbb{P}(z\in S_{Z})>1-\nu and

(l=1m{T¯(lbm)εlbm>T¯0}|z)=(l=1m{T¯0<F11(lm)εlbm}|z)+o(1)\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{(\lceil\frac{lb}{m}\rceil)}-\varepsilon_{\lceil\frac{lb}{m}\rceil}>\bar{T}_{0}\}|z\right)=\mathbb{P}\left(\bigcup\limits_{l=1}\limits^{m}\{\bar{T}_{0}<F_{1}^{-1}\left(\frac{l}{m}\right)-\varepsilon_{\lceil\frac{lb}{m}\rceil}\}|z\right)+o(1)

holds uniformly for zSZz\in S_{Z}. Next, we have

(b+1)α1+eε1++eεb=(b+1)α1+l=1m1k=(l1)bm+1lbmexp(max{0,F11(lm)+F01(𝔼θz)})=α1ml=1m1exp(max{0,F11(lm)F01(𝔼θz)})+o(1),\begin{array}[]{ll}&\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\\ =&\frac{\lfloor(b+1)\alpha\rfloor}{1+\sum\limits_{l=1}\limits^{m-1}\sum\limits_{k=\lceil\frac{(l-1)b}{m}\rceil+1}\limits^{\lceil\frac{lb}{m}\rceil}exp\left(-\max\{0,F_{1}^{-1}(\frac{l}{m})+F_{0}^{-1}(\mathbb{E}\theta_{z})\}\right)}\\ =&\frac{\alpha}{\frac{1}{m}\sum\limits_{l=1}\limits^{m-1}exp\left(-\max\{0,F_{1}^{-1}(\frac{l}{m})-F_{0}^{-1}(\mathbb{E}\theta_{z})\}\right)}+o(1),\\ \end{array} (15)

which will be mm1α+o(1)\frac{m}{m-1}\alpha+o(1) when QQ and QQ^{\star} are identical as bb\rightarrow\infty. Results in (14) and (15) combined together can lead to our claim at the beginning.

A.3 A Generalization of Theorem 4

Here, we give a general type 1 error upper bound control result, with the Theorem 4 being its special form.

Theorem 6.

(Type 1 Error Upper Bound 3.) Suppose conditions in the Theorem 4 hold. Let {u0,u1,,ub}\{u_{0},u_{1},\ldots,u_{b}\} be the unordered set of {x,x(1),,x(b)}\{x,x^{(1)},\ldots,x^{(b)}\}. Denote D(,z)Q(|z)Q(|z)D(\cdot,z)\triangleq\frac{Q^{\star}(\cdot|z)}{Q(\cdot|z)}. For a given ss, we denote the ordered statistics of {D(uk,z):ks}\{D(u_{k},z):k\neq s\} by D(1)sD(2)sD(b)sD^{s}_{(1)}\leq D^{s}_{(2)}\leq\ldots D^{s}_{(b)}. For any given measurable functions

ε1(u0,u1,,ub,z,y),,εb(u0,u1,,ub,z,y),\varepsilon_{1}(u_{0},u_{1},\ldots,u_{b},z,y),\ldots,\varepsilon_{b}(u_{0},u_{1},\ldots,u_{b},z,y),
η1(u0,u1,,ub,z,y),,ηb(u0,u1,,ub,z,y),\eta_{1}(u_{0},u_{1},\ldots,u_{b},z,y),\ldots,\eta_{b}(u_{0},u_{1},\ldots,u_{b},z,y),

and any given α[0,1]\alpha\in[0,1], suppose the null hypothesis H0:XY|ZH_{0}:X\perp Y|Z holds. For conciseness, we denote

I(s)𝟙{k=1b{eηkD(us1,z)D(b+1k)s1eεk}}.I(s)\triangleq\mathbbm{1}\left\{\bigcap\limits_{k=1}^{b}\left\{e^{\eta_{k}}\leq\frac{D(u_{s-1},z)}{D^{s-1}_{(b+1-k)}}\leq e^{\varepsilon_{k}}\right\}\right\}.

Then,

(PCRRTα,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)min{s=1(b+1)αI(s)I(s)+eε1++eεb,1s=(b+1)α+1b+1I(s)I(s)+eη1++eηb}.\begin{array}[]{ll}&\mathbb{P}(P_{CRRT}\leq\alpha,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\\ \leq&\min\left\{\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\frac{I(s)}{I(s)+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}},1-\sum\limits_{s=\lfloor(b+1)\alpha\rfloor+1}\limits^{b+1}\frac{I(s)}{I(s)+e^{\eta_{1}}+\ldots+e^{\eta_{b}}}\right\}.\end{array} (16)

This leads to that,

(PCRRTα,η1KL^(1)ε1,,ηbKL^(b)εb)min{𝔼[s=1(b+1)αI(s)I(s)+eε1++eεb],𝔼[1s=(b+1)α+1b+1I(s)I(s)+eη1++eηb]}min{M1,M2}.\begin{array}[]{ll}&\mathbb{P}(P_{CRRT}\leq\alpha,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b})\\ \leq&\min\left\{\mathbb{E}\left[\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\frac{I(s)}{I(s)+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}\right],\mathbb{E}\left[1-\sum\limits_{s=\lfloor(b+1)\alpha\rfloor+1}\limits^{b+1}\frac{I(s)}{I(s)+e^{\eta_{1}}+\ldots+e^{\eta_{b}}}\right]\right\}\\ \triangleq&\min\left\{M_{1},M_{2}\right\}.\end{array}

Since ε1,,εb,η1,,ηb\varepsilon_{1},\ldots,\varepsilon_{b},\eta_{1},\ldots,\eta_{b} are arbitrary, we can obtain the following bound directly controlling type 1 error,

(PCRRTα)infε1,,εb,η1,,ηb{min{M1,M2}+(k=1b{{KL^(k)>εk}{KL^(k)<ηk}})}.\begin{array}[]{ll}&\mathbb{P}(P_{CRRT}\leq\alpha)\\ \leq&\inf\limits_{\varepsilon_{1},\ldots,\varepsilon_{b},\eta_{1},\ldots,\eta_{b}}\left\{\min\{M_{1},M_{2}\}+\mathbb{P}\left(\bigcup\limits_{k=1}\limits^{b}\left\{\{\widehat{KL}_{(k)}>\varepsilon_{k}\}\cup\{\widehat{KL}_{(k)}<\eta_{k}\}\right\}\right)\right\}.\end{array}
Proof of Theorem 6.

We use same notations and similar ideas as in the proof of Theorem 4. For any sr{1,,1+b}s\neq r\in\{1,\ldots,1+b\},

(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}𝟙{η1KL^(1)ε1,,ηbKL^(b)εb}|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}𝟙{eη1D(x(0),z)D(b)eε1,,eηbD(x(0),z)D(1)eεb}|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}I(s)|u0,u1,,ub,z,y)=I(s)(T0rankssth|u0,u1,,ub,z,y)=I(s)D(us1,z)D(ur1,z)(T0ranksrth|u0,u1,,ub,z,y),\begin{array}[]{cl}&\mathbb{P}\left(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&\mathbb{E}\left(\mathbbm{1}\{T_{0}\ ranks\ s\ th\}\mathbbm{1}\left\{\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}\right\}|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&\mathbb{E}\left(\mathbbm{1}\left\{T_{0}\ ranks\ s\ th\right\}\mathbbm{1}\left\{e^{\eta_{1}}\leq\frac{D(x^{(0)},z)}{D_{(b)}}\leq e^{\varepsilon_{1}},\ldots,e^{\eta_{b}}\leq\frac{D(x^{(0)},z)}{D_{(1)}}\leq e^{\varepsilon_{b}}\right\}|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&\mathbb{E}\left(\mathbbm{1}\left\{T_{0}\ ranks\ s\ th\right\}I(s)|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&I(s)\mathbb{P}\left(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&I(s)\frac{D(u_{s-1},z)}{D(u_{r-1},z)}\mathbb{P}\left(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y\right),\\ \end{array}

(17)

where the last step holds based on (26). Actually, we have,

I(s)D(us1,z)D(ur1,z)=I(s)D(us1,z)D(b+1Φs1(r1))s1[I(s)eηΦs1(r1),I(s)eεΦs1(r1)],\begin{array}[]{cl}&I(s)\frac{D(u_{s-1},z)}{D(u_{r-1},z)}\\ =&I(s)\frac{D(u_{s-1},z)}{D^{s-1}_{(b+1-\Phi_{s-1}(r-1))}}\\ \in&\left[I(s)e^{\eta_{\Phi_{s-1}(r-1)}},I(s)e^{\varepsilon_{\Phi_{s-1}(r-1)}}\right],\\ \end{array} (18)

where the definition of Φ\Phi is given in the proof of Theorem 4. Combining results in (17) and (18), we have

I(s)(T0ranksrth|u0,u1,,ub,z,y)(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,u1,,ub,z,y)[eεΦs1(r1),eηΦs1(r1)].\begin{array}[]{ll}&\frac{I(s)\mathbb{P}\left(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y\right)}{\mathbb{P}\left(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},u_{1},\ldots,u_{b},z,y\right)}\\ \in&\left[e^{-\varepsilon_{\Phi_{s-1}(r-1)}},e^{-\eta_{\Phi_{s-1}(r-1)}}\right].\\ \end{array} (19)

For any given ss, we have

I(s)=r=1b+1I(s)(T0ranksrth|u0,u1,,ub,z,y)=I(s)(T0rankssth|u0,u1,,ub,z,y)+1rb+1,rsI(s)(T0ranksrth|u0,u1,,ub,z,y)(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)1rb+1,rseεΦs1(r1)+I(s)(T0rankssth|u0,u1,,ub,z,y)(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)(I(s)+r=1beεr),\begin{array}[]{rl}&I(s)\\ =&\sum\limits_{r=1}\limits^{b+1}I(s)\mathbb{P}\left(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y\right)\\ =&I(s)\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ &+\sum\limits_{1\leq r\leq b+1,r\neq s}I(s)\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ \geq&\mathbb{P}(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\sum\limits_{\mathop{1\leq r\leq b+1,}\limits_{r\neq s}}e^{-\varepsilon_{\Phi_{s-1}(r-1)}}\\ &+I(s)\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ \geq&\mathbb{P}(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)(I(s)+\sum\limits_{r=1}\limits^{b}e^{-\varepsilon_{r}}),\\ \end{array}

(20)

where the third step is based on (19) and the last step is legitimate due to Φs\Phi_{s}’s one-to-one property, s=0,1,,bs=0,1,\ldots,b. Therefore, we have

(PCRRTα,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)=s=1(b+1)α(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)s=1(b+1)αI(s)I(s)+eε1++eεb.\begin{array}[]{cl}&\mathbb{P}\left(P_{CRRT}\leq\alpha,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y\right)\\ =&\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\mathbb{P}\left(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y\right)\\ \leq&\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\frac{I(s)}{I(s)+e^{-\varepsilon_{1}}+\ldots+e^{-\varepsilon_{b}}}.\end{array} (21)

Based on (19), similar to (20), we have

(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)I(s)(I(s)+r=1beηr).\mathbb{P}(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\geq\frac{I(s)}{(I(s)+\sum\limits_{r=1}\limits^{b}e^{-\eta_{r}})}.

(22)

Hence, we have

(PCRRTα,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)=1s=(b+1)α1b+1(T0rankssth,η1KL^(1)ε1,,ηbKL^(b)εb|u0,,ub,z,y)1s=(b+1)α1b+1I(s)I(s)+eη1++eηb.\begin{array}[]{cl}&\mathbb{P}\left(P_{CRRT}\leq\alpha,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y\right)\\ =&1-\sum\limits_{s=\lfloor(b+1)\alpha\rfloor 1}\limits^{b+1}\mathbb{P}\left(T_{0}\ ranks\ s\ th,\eta_{1}\leq\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\eta_{b}\leq\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y\right)\\ \leq&1-\sum\limits_{s=\lfloor(b+1)\alpha\rfloor 1}\limits^{b+1}\mathbb{P}\frac{I(s)}{I(s)+e^{-\eta_{1}}+\ldots+e^{-\eta_{b}}}.\end{array}

(23)

Combining results in (21) and (23), we can conclude that (16) is valid.

Appendix B Connections Between CRRT and Knockoffs

The CRRT is deeply influenced by the spirit of knockoffs ([BC+15]; [CFJL18]; [BC+19]) that utilizing information on the covariates sides can help us avoid the uncertainty of the relationship between the response variable and covariates. In this section, we make this connection more concrete. Specifically speaking, we show that there’s a direct comparison between the CRRT and multiple knockoffs ([GZ18]), which is a variant of the original knockoffs. After comparison, we point out that the version of multiple knockoffs given in [GZ18] may be too conservative. We provide a modified version that can own greater power and at the same time, maintain finite sample FDR control at a same level.

Suppose that we have original data (y,x)(y,x), where yy is a n×1n\times 1 vector, xx is a n×pn\times p matrix, which sometimes is denoted by x(0)x^{(0)} alternatively. Each row of (y,x)(y,x), a realization of (Y,XT)T(Y,X^{T})^{T}, is independently and identically generated from some distribution FY,XF_{Y,X}. Just be careful that the xx in this section is different from those in previous sections. For a given positive integer bb, we generate a bb-knockoff copy of x(0)x^{(0)}, denoted by (x(1),x(2),,x(b))(x^{(1)},x^{(2)},\ldots,x^{(b)}), where x(k)x^{(k)} is an n×pn\times p matrix, for k=1,,bk=1,\ldots,b. We call (x(1),x(2),,x(b))(x^{(1)},x^{(2)},\ldots,x^{(b)}) a bb-knockoff copy of x(0)x^{(0)} if (x(0),x(1),,x(b))(x^{(0)},x^{(1)},\ldots,x^{(b)}) satisfies the extended exchangeability defined below and (x(1),x(2),,x(b))y|x(0)(x^{(1)},x^{(2)},\ldots,x^{(b)})\perp y|x^{(0)}.

Definition 2.

If σ=(σ1,,σp)Σb+1×Σb+1××Σb+1\sigma=(\sigma^{1},\ldots,\sigma^{p})\in\Sigma_{b+1}\times\Sigma_{b+1}\times\ldots\times\Sigma_{b+1}, where Σb+1\Sigma_{b+1} is the collection of all permutations of {1,2,,b+1}\{1,2,\ldots,b+1\}, we define (x(0),x(1),,x(b))swap(σ)(u(0),,u(b))(x^{(0)},x^{(1)},\ldots,x^{(b)})_{swap(\sigma)}\triangleq(u^{(0)},\ldots,u^{(b)}), where uj(k)=xj(σk+1j)u^{(k)}_{j}=x^{(\sigma^{j}_{k+1})}_{j}, for k=0,1,,bk=0,1,\ldots,b and j=1,,pj=1,\ldots,p. We say that (x(0),x(1),,x(b))n×n××n(x^{(0)},x^{(1)},\ldots,x^{(b)})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\times\ldots\times\mathbb{R}^{n} is extended exchangeable, if for any σ1,,σpΣb+1\sigma^{1},\ldots,\sigma^{p}\in\Sigma_{b+1},

(x(0),x(1),,x(b))swap(σ)=d(x(0),x(1),,x(b)),(x^{(0)},x^{(1)},\ldots,x^{(b)})_{swap(\sigma)}\mathop{=}\limits^{d}(x^{(0)},x^{(1)},\ldots,x^{(b)}),

where σ=(σ1,,σp)\sigma=(\sigma^{1},\ldots,\sigma^{p}).

To construct multiple knockoffs selection, we can use any test function T¯=(T0,T1,,Tb)\bar{T}=(T^{0},T^{1},\ldots,T^{b}) satisfying that for any σ=(σ1,,σp)Σb+1×Σb+1××Σb+1\sigma=(\sigma^{1},\ldots,\sigma^{p})\in\Sigma_{b+1}\times\Sigma_{b+1}\times\ldots\times\Sigma_{b+1},

T¯swap(σ)=(T0,,Tb)swap(σ)=T¯(y,(x(0),,x(b))swap(σ))),\bar{T}_{swap(\sigma)}=(T^{0},\ldots,T^{b})_{swap(\sigma)}=\bar{T}(y,(x^{(0)},\ldots,x^{(b)})_{swap(\sigma)})), (24)

where Tk:n×n×[p×(b+1)]pT^{k}:\mathbb{R}^{n}\times\mathbb{R}^{n\times[p\times(b+1)]}\rightarrow\mathbb{R}^{p} for k=0,1,,bk=0,1,\ldots,b. For j=1,2,,pj=1,2,\ldots,p, denote the ordered statistics of {Tj0,,Tjb}\{T^{0}_{j},\ldots,T^{b}_{j}\} by Tj(1),Tj(b+1)T^{(1)}_{j}\leq\ldots,\leq T^{(b+1)}_{j} and define τjTj(1)Tj(2)\tau_{j}\triangleq T^{(1)}_{j}-T^{(2)}_{j}. As like [GZ18], for simplicity, we assume that

(j=1p{Tj(1)<,<Tj(b+1),or,Tj(1)=,=Tj(b+1)})=1.\mathbb{P}\left(\bigcap\limits_{j=1}\limits^{p}\{T^{(1)}_{j}<\ldots,<T^{(b+1)}_{j},\ or,\ T^{(1)}_{j}=\ldots,=T^{(b+1)}_{j}\}\right)=1.

Then, for j=1,2,,pj=1,2,\ldots,p, we can define

rj{therankofTj0among{Tj0,,Tjb},ifTj(1)<,<Tj(b+1),b+1,ifTj(1)=,=Tj(b+1).r_{j}\triangleq\left\{\begin{array}[]{ll}the\ rank\ of\ T_{j}^{0}\ among\ \{T_{j}^{0},\ldots,T_{j}^{b}\},&if\ T^{(1)}_{j}<\ldots,<T^{(b+1)}_{j},\\ b+1,&if\ T^{(1)}_{j}=\ldots,=T^{(b+1)}_{j}.\end{array}\right.

In [GZ18], for any given desired FDR level α(0,1)\alpha\in(0,1), the selection set is defined as

S^={j{1,,p},rj=1,τjτ^},\hat{S}=\{j\in\{1,\ldots,p\},r_{j}=1,\tau_{j}\geq\hat{\tau}\},

where τ^=min{tmin{τj:1jp}:1b+1b#{1jp,rj>1,τjt}#{1jp,rj=1,τjt}1α}\hat{\tau}=\min\left\{t\geq\min\{\tau_{j}:1\leq j\leq p\}:\frac{\frac{1}{b}+\frac{1}{b}\#\{1\leq j\leq p,r_{j}>1,\tau_{j}\geq t\}}{\#\{1\leq j\leq p,r_{j}=1,\tau_{j}\geq t\}\vee 1}\leq\alpha\right\}. They prove that this multiple knockoffs selection procedure can control the FDR at level α\alpha.

Now, let’s take a look at a special case of their multiple-knockoff procedure. Assume that X1X_{1}, the first element of XX, is our main target and we want to check whether H0:X1Y|X1H_{0}:X_{1}\perp Y|X_{-1} holds, where X1(X2,,Xp)TX_{-1}\triangleq(X_{2},\ldots,X_{p})^{T}. For k=1,2,,bk=1,2,\ldots,b, let x(k)=(x1(k),x2,,xp)x^{(k)}=(x^{(k)}_{1},x_{2},\ldots,x_{p}), where x1(1),,x1(b)x_{1}^{(1)},\ldots,x_{1}^{(b)} are sampled i.i.d. from distribution of X1|X1X_{1}|X_{-1}, independent of x1x_{1}. It’s easy to check that under such construction, (x(0),,x(b))(x^{(0)},\ldots,x^{(b)}) is extended exchangeable. For any qualified multiple knockoffs test function T¯=(T0,T1,,Tb)\bar{T}=(T^{0},T^{1},\ldots,T^{b}) satisfying the property stated in (24), we define a function V¯=(V0,V1,,Vb)(T10,T11,,T1b)\bar{V}=(V^{0},V^{1},\ldots,V^{b})\triangleq(T^{0}_{1},T^{1}_{1},\ldots,T^{b}_{1}), which is essentially a function of (y,x2,,xp,x1(0),x1(1),,x1(b))(y,x_{2},\ldots,x_{p},x^{(0)}_{1},x^{(1)}_{1},\ldots,x^{(b)}_{1}). In fact, V¯\bar{V} is XX-symmetric if we view it as a test function for testing conditional independence. For any given α(0,1)\alpha\in(0,1), the definition of τ^\hat{\tau} can be simplified as

τ^=min{tτ1:1b+1b𝟙{r1>1,τ1t}α}.\hat{\tau}=\min\left\{t\geq\tau_{1}:\frac{1}{b}+\frac{1}{b}\mathbbm{1}\{r_{1}>1,\tau_{1}\geq t\}\leq\alpha\right\}.

If 2bα\frac{2}{b}\leq\alpha, we always have τ^=τ1\hat{\tau}=\tau_{1}, and hence we’ll select the first covariate if and only if r1=1r_{1}=1. If 1b>α\frac{1}{b}>\alpha, we have τ^=\hat{\tau}=\infty and hence we select no variable. If 1bα<2b\frac{1}{b}\leq\alpha<\frac{2}{b}, we select the first variable if and only if r1=1r_{1}=1.

As comparison, the CRRT based on V¯\bar{V} rejects the null hypothesis and select the first variable if and only if T10T_{1}^{0} ranks at least λ(b+1)α\lambda\triangleq\lfloor(b+1)\alpha\rfloorth among T10,T11,,T1bT^{0}_{1},T^{1}_{1},\ldots,T^{b}_{1}. Thus, when b<2α1b<\frac{2}{\alpha}-1, decision rules given by the multiple knockoffs and the CRRT are basically identical. To be more specific, when b<1α1b<\frac{1}{\alpha}-1, the CRRT select no vairable. When 1α1b<2α1\frac{1}{\alpha}-1\leq b<\frac{2}{\alpha}-1, X1X_{1} is selected. However, there’s a great difference when b2α1b\geq\frac{2}{\alpha}-1 that the special multiple knockoffs procedure rejects H0H_{0} only if T10T_{1}^{0} is the largest one while the CRRT rejects H0H_{0} when T10T_{1}^{0} ranks high enough. When bb gets larger, the above special multiple knockoffs procedure exhibits greater conservativeness, which is unnecessary. Though such comparison is unfair because the multiple knockoffs doesn’t specialize in conditional independence testing, it still gives us some hint to improve the multiple knockoffs. Trying to eliminate such redundant conservativeness, we provide a modified definition of the threshold:

τ~=min{tmin{τj:1jp}:η(λ~)(1+#{1jp,rj>λ~,τjt})#{1jp,rjλ~,τjt}1α},\tilde{\tau}=\min\left\{t\geq\min\{\tau_{j}:1\leq j\leq p\}:\frac{\eta(\tilde{\lambda})\left(1+\#\{1\leq j\leq p,r_{j}>\tilde{\lambda},\tau_{j}\geq t\}\right)}{\#\{1\leq j\leq p,r_{j}\leq\tilde{\lambda},\tau_{j}\geq t\}\vee 1}\leq\alpha\right\},

(25)

where λ~=(b+1)αα+1\tilde{\lambda}=\lfloor(b+1)\frac{\alpha}{\alpha+1}\rfloor, η(λ~)=λ~b+1λ~\eta(\tilde{\lambda})=\frac{\tilde{\lambda}}{b+1-\tilde{\lambda}}, and further we let the selected set be S~={j{1,,p},rjλ~,τjτ~}\tilde{S}=\{j\in\{1,\ldots,p\},r_{j}\leq\tilde{\lambda},\tau_{j}\geq\tilde{\tau}\}. This modified multiple knockoffs procedure roughly matches the CRRT in single variable testing problem. Since η(λ~)α\eta(\tilde{\lambda})\leq\alpha, the modified selection procedure selects X1X_{1} if and only if r1λ~=(b+1)αα+1r_{1}\leq\tilde{\lambda}=\lfloor(b+1)\frac{\alpha}{\alpha+1}\rfloor, which is close to λ\lambda when α\alpha is small. We conjecture that this modified version would be more powerful in general model selection problem. But we don’t pursue to show it here because we want to focus on the conditional testing problem. In fact, the modified multiple knockoffs procedure can also control finite sample FDR like the original procedure.

Proposition 6.

For any given desired FDR level α(0,1)\alpha\in(0,1). The selected set S~\tilde{S} can control FDR at level α\alpha. To be more concrete,

FDR=𝔼[|S~S||S~|1]α,FDR=\mathbb{E}\left[\frac{|\tilde{S}\setminus S|}{|\tilde{S}|\vee 1}\right]\leq\alpha,

where SS is the Markov blanket for YY ([Pea88]; [CFJL18]), which can be interpreted as the set of nonnull variables.

From the proof of the above proposition, we can see that actually in (25), we can replace τjt\tau_{j}\geq t with some other conditions based on ordered statistics Tˇ={{Tj(1),Tj(2),,Tj(b+1)}:j=1,2,,p}\check{T}=\{\{T_{j}^{(1)},T_{j}^{(2)},\ldots,T_{j}^{(b+1)}\}:j=1,2,\ldots,p\} while remaining the FDR control, which means that we can generalize the multiple knockoffs framework to make it more flexible. We didn’t study this generalization in this paper. There could be some alternative conditions enjoying optimal properties in some sense.

Appendix C Proofs

Proof of Proposition 1.

We first to show that, for any σΣb+1\sigma\in\Sigma_{b+1}

(T0(y,z,x~),,Tb(y,z,x~))permute(σ)=d(T0(y,z,x~),,Tb(y,z,x~)).(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x}))_{permute(\sigma)}\stackrel{{\scriptstyle d}}{{=}}(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x})).

Under the null hypothesis H0:XY|ZH_{0}:X\perp Y|Z, x,x(1),,x(b)|z,y=x,x(1),,x(b)|zx,x^{(1)},\ldots,x^{(b)}|z,y=x,x^{(1)},\ldots,x^{(b)}|z are i.i.d.. Therefore, we have x~permute(σ)|z,y=dx~|z,y\tilde{x}_{permute(\sigma)}|z,y\stackrel{{\scriptstyle d}}{{=}}\tilde{x}|z,y. Further, we have x~permute(σ)=dx~\tilde{x}_{permute(\sigma)}\stackrel{{\scriptstyle d}}{{=}}\tilde{x}. Thus,

(T0(y,z,x~),,Tb(y,z,x~))permute(σ)=(T0(y,z,x~permute(σ)),,Tb(y,z,x~permute(σ)))=d(T0(y,z,x~),,Tb(y,z,x~)),\begin{array}[]{cl}&(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x}))_{permute(\sigma)}\\ =&(T_{0}(y,z,\tilde{x}_{permute(\sigma)}),\ldots,T_{b}(y,z,\tilde{x}_{permute(\sigma)}))\\ \stackrel{{\scriptstyle d}}{{=}}&(T_{0}(y,z,\tilde{x}),\ldots,T_{b}(y,z,\tilde{x})),\end{array}

which is our desired result. This result directly implies that

T0(y,z,x~)=dT1(y,z,x~)=d=dTb(y,z,x~).T_{0}(y,z,\tilde{x})\stackrel{{\scriptstyle d}}{{=}}T_{1}(y,z,\tilde{x})\stackrel{{\scriptstyle d}}{{=}}\ldots\stackrel{{\scriptstyle d}}{{=}}T_{b}(y,z,\tilde{x}).

Proof of Proposition 2.

According to Proposition 1, for any σΣb+1\sigma\in\Sigma_{b+1},

(T0,,Tb)permute(σ)=d(T0,,Tb).(T_{0},\ldots,T_{b})_{permute(\sigma)}\stackrel{{\scriptstyle d}}{{=}}(T_{0},\ldots,T_{b}).

Since the ordered statistics of (T0,,Tb)permute(σ)(T_{0},\ldots,T_{b})_{permute(\sigma)} is also T(0)T(1),T(b)T_{(0)}\leq T_{(1)}\leq\ldots,\leq T_{(b)}, we have

(T0,,Tb)permute(σ)|z,y,T(0),,T(b)=d(T0,,Tb)|z,y,T(0),,T(b).(T_{0},\ldots,T_{b})_{permute(\sigma)}|z,y,T_{(0)},\ldots,T_{(b)}\stackrel{{\scriptstyle d}}{{=}}(T_{0},\ldots,T_{b})|z,y,T_{(0)},\ldots,T_{(b)}.

It implies that

(Tσ1=T(0),,Tσb+1=T(b))=(T0=T(0),,Tb=T(b)),\mathbb{P}(T_{\sigma_{1}}=T_{(0)},\ldots,T_{\sigma_{b+1}}=T_{(b)})=\mathbb{P}(T_{0}=T_{(0)},\ldots,T_{b}=T_{(b)}),

where σi\sigma_{i} is the ii-th element of σ\sigma, i=1,2,,b+1i=1,2,\ldots,b+1. Because σ\sigma is arbitrary, we know that conditional on z,y,T(0),,T(b)z,y,T_{(0)},\ldots,T_{(b)}, (T0,,Tb)(T_{0},\ldots,T_{b}) is uniformly distributed on

S={(T(0),,T(b))permute(σ):σΣb+1}.S=\{(T_{(0)},\ldots,T_{(b)})_{permute(\sigma)}:\sigma\in\Sigma_{b+1}\}.

Based on such symmetry, we can easily know that

(pCRRTα|z,y,T(0),T(1),,T(b))α,\mathbb{P}(p_{CRRT}\leq\alpha|z,y,T_{(0)},T_{(1)},\ldots,T_{(b)})\leq\alpha,

for any given α\alpha. ∎

Proof of Theorem 2.

Let x˙\dot{x} be a n-dimension vector with elements x˙iQ(|z[i])\dot{x}_{i}\sim Q(\cdot|z{[i]}) independently, i=1,2,,ni=1,2,\ldots,n, independent of yy and of x~=(x,x(1),,x(b))\tilde{x}=(x,x^{(1)},\ldots,x^{(b)}).

Recall that pCRRT=k=0b𝟙{T0(y,z,x~)Tk(y,z,x~)}1+bp_{CRRT}=\frac{\sum\limits_{k=0}\limits^{b}\mathbbm{1}\{T_{0}(y,z,\tilde{x})\leq T_{k}(y,z,\tilde{x})\}}{1+b}, which is a function of (y,z,x~)(y,z,\tilde{x}). Define Sα={(y,z,x~):pCRRT(y,z,x~)α}S_{\alpha}=\{(y,z,\tilde{x}):p_{CRRT}(y,z,\tilde{x})\leq\alpha\}. For given zz and yy, define

S¯α(y,z)={(x,x(1),,x(b)):(y,z,x,x(1),,x(b))Sα}.\bar{S}_{\alpha}(y,z)=\{(x,x^{(1)},\ldots,x^{(b)}):(y,z,x,x^{(1)},\ldots,x^{(b)})\in S_{\alpha}\}.

Then,

(pCRRTα|y,z)=((y,z,x,x(1),,x(b))Sα|y,z)=((x,x(1),,x(b))S¯α(y,z)|y,z)((x˙,x(1),,x(b))S¯α(y,z)|y,z)+|((x˙,x(1),,x(b))S¯α(y,z)|y,z)((x,x(1),,x(b))S¯α(y,z)|y,z)|\begin{array}[]{cl}&\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\\ =&\mathbb{P}((y,z,x,x^{(1)},\ldots,x^{(b)})\in S_{\alpha}|y,z)\\ =&\mathbb{P}((x,x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)\\ \leq&\mathbb{P}((\dot{x},x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)\\ &+|\mathbb{P}((\dot{x},x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)-\mathbb{P}((x,x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)|\end{array}

Since x˙\dot{x} is sampled based on Q(|Z)Q(\cdot|Z), x˙,x(1),,x(b)\dot{x},x^{(1)},\ldots,x^{(b)} are i.i.d. conditional on zz and yy. According to Theorem 1, ((x˙,x(1),,x(b))S¯α(y,z)|y,z)=(pCRRT(y,z,x˙,x(1),,x(b))α|y,z)α\mathbb{P}((\dot{x},x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)=\mathbb{P}(p_{CRRT}(y,z,\dot{x},x^{(1)},\ldots,x^{(b)})\leq\alpha|y,z)\leq\alpha.

As for the second part, for given y,z,x(1),,x(b)y,z,x^{(1)},\ldots,x^{(b)}, we define Cα(y,z,x(1),,x(b))={x˙:(y,z,x˙,x(1),,x(b))SαC_{\alpha}(y,z,x^{(1)},\ldots,x^{(b))}=\{\dot{x}:(y,z,\dot{x},x^{(1)},\ldots,x^{(b)})\in S_{\alpha}. Then,

|((x˙,x(1),,x(b))S¯α(y,z)|y,z)((x,x(1),,x(b))S¯α(y,z)|y,z)|=|(x(1),,x(b))bx˙Cα(y,z,x(1),,x(b))1dQ(x˙|y,z)dQ(x(1),,x(b)|y,z)}(x(1),,x(b))bxCα(y,z,x(1),,x(b))1dQ(x|y,z)dQ(x(1),,x(b)|y,z)}|(x(1),,x(b))b|x˙Cα(y,z,x(1),,x(b))1dQ(x˙|y,z)xCα(y,z,x(1),,x(b))1dQ(x|y,z)|dQ(x(1),,x(b)|y,z)(x(1),,x(b))bdTV(Qn(|z),Qn(|z))dQ(x(1),,x(b)|y,z)=dTV(Qn(|z),Qn(|z)).\begin{array}[]{cl}&|\mathbb{P}((\dot{x},x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)-\mathbb{P}((x,x^{(1)},\ldots,x^{(b)})\in\bar{S}_{\alpha}(y,z)|y,z)|\\ =&|\int\limits_{(x^{(1)},\ldots,x^{(b)})\in\mathbb{R}^{b}}\int\limits_{\dot{x}\in C_{\alpha}(y,z,x^{(1)},\ldots,x^{(b)})}1dQ(\dot{x}|y,z)dQ(x^{(1)},\ldots,x^{(b)}|y,z)\}\\ &-\int\limits_{(x^{(1)},\ldots,x^{(b)})\in\mathbb{R}^{b}}\int\limits_{{x}\in C_{\alpha}(y,z,x^{(1)},\ldots,x^{(b)})}1dQ^{\star}({x}|y,z)dQ(x^{(1)},\ldots,x^{(b)}|y,z)\}|\\ \leq&\int\limits_{(x^{(1)},\ldots,x^{(b)})\in\mathbb{R}^{b}}|\int\limits_{\dot{x}\in C_{\alpha}(y,z,x^{(1)},\ldots,x^{(b)})}1dQ(\dot{x}|y,z)\\ &-\int\limits_{{x}\in C_{\alpha}(y,z,x^{(1)},\ldots,x^{(b)})}1dQ^{\star}({x}|y,z)|dQ(x^{(1)},\ldots,x^{(b)}|y,z)\\ \leq&\int\limits_{(x^{(1)},\ldots,x^{(b)})\in\mathbb{R}^{b}}d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z))dQ(x^{(1)},\ldots,x^{(b)}|y,z)\\ =&d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z)).\end{array}

Combining the results above, we have

(pCRRTα|y,z)α+dTV(Qn(|z),Qn(|z)).\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\leq\alpha+d_{TV}(Q^{\star}_{n}(\cdot|z),Q_{n}(\cdot|z)).

Taking expectation with respect to yy and zz on both sides completes the proof. ∎

Proof of Theorem 4.

In the following proof, sometimes we use x(0)x^{(0)} to denote xx alternatively. When we say T0T_{0} ranks rrth, we mean that T0T_{0} is the rrth largest value among {T0,,Tb}\{T_{0},\ldots,T_{b}\}. Recall that

pCRRT=k=0b𝟙{T0(y,z,x~)Tk(y,z,x~)}1+b.p_{CRRT}=\frac{\sum\limits_{k=0}\limits^{b}\mathbbm{1}\{T_{0}(y,z,\tilde{x})\leq T_{k}(y,z,\tilde{x})\}}{1+b}.

Based on the above definition, we know that pCRRTαp_{CRRT}\leq\alpha is roughly equivalent to that T0T_{0} is among the (b+1)α\lfloor(b+1)\alpha\rfloor largest of T0,,TbT_{0},\ldots,T_{b}. When all TkT_{k}’s are distinct, such equivalence is exact. When there are some TkT_{k}’s, k1k\geq 1 tied with T0T_{0}, our rule in fact assigns T0T_{0} a lower rank (smaller the value, lower the rank) and hence tends to be conservative. And we can see that whether there are repeated values among Tk,k1T_{k},k\geq 1 doesn’t affect T0T_{0}’s rank. Thus, without loss of generality, we may assume that all TkT_{k}’s are distinct with probability 1, conditional on zz and yy. That is,

(ij,Ti=Tj|z,y)=0.\mathbb{P}(\exists i\neq j,T_{i}=T_{j}|z,y)=0.

Denote the unordered set of {x,x(1),,x(b)}\{x,x^{(1)},\ldots,x^{(b)}\} by {u0,,ub}\{u_{0},\ldots,u_{b}\}. According to our assumption, with probability 1, conditional on zz and yy, u0,,ubu_{0},\ldots,u_{b} are distinct with probability 1 (no typo here). We have the following claim.

Claim 1 With probability 1, there exists a one-to-one map Ψ:{1,,b+1}{u0,,ub}\Psi:\{1,\ldots,b+1\}\longrightarrow\{u_{0},\ldots,u_{b}\}, such that T0T_{0} ranks rrth if and only if x=Ψ(r)x=\Psi(r), r=1,,b+1r=1,\ldots,b+1. Such map depends on the unordered set {u0,,ub}\{u_{0},\ldots,u_{b}\} and the test function TT, yy and zz. The proof of this claim is given immediately after the current proof.

Given y,z,u0,,uby,z,u_{0},\ldots,u_{b}, according to Claim 1, without loss of generality, we assume that T0T_{0} ranks rrth if and only if x=ur1x=u_{r-1}, for r=1,2,,b+1r=1,2,\ldots,b+1. We are going to show that for any s,r{1,,b+1}s,r\in\{1,\ldots,b+1\} and srs\neq r,

(T0rankssth|u0,u1,,ub,z,y)(T0ranksrth|u0,u1,,ub,z,y)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z).\frac{\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)}=\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)}. (26)

Actually, based on our assumption, we have

(T0rankssth|u0,u1,,ub,z,y)(T0ranksrth|u0,u1,,ub,z,y)=(x=us1|u0,u1,,ub,z,y)(x=ur1|u0,u1,,ub,z,y)=k=1b(x=us1,x(k)=ur1|u0,u1,,ub,z,y)k=1b(x=ur1,x(k)=us1|u0,u1,,ub,z,y).\begin{array}[]{cl}&\frac{\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)}\\ =&\frac{\mathbb{P}(x=u_{s-1}|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(x=u_{r-1}|u_{0},u_{1},\ldots,u_{b},z,y)}\\ =&\frac{\sum\limits_{k=1}\limits^{b}\mathbb{P}(x=u_{s-1},x^{(k)}=u_{r-1}|u_{0},u_{1},\ldots,u_{b},z,y)}{\sum\limits_{k=1}\limits^{b}\mathbb{P}(x=u_{r-1},x^{(k)}=u_{s-1}|u_{0},u_{1},\ldots,u_{b},z,y)}.\end{array}

To prove (26), it suffices to show that (x=us1,x(k)=ur1|u0,u1,,ub,z,y)(x=ur1,x(k)=us1|u0,u1,,ub,z,y)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z)\frac{\mathbb{P}(x=u_{s-1},x^{(k)}=u_{r-1}|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(x=u_{r-1},x^{(k)}=u_{s-1}|u_{0},u_{1},\ldots,u_{b},z,y)}=\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)} for any k{1,,b}k\in\{1,\ldots,b\}. We have the following observation,

(x=us1,x(k)=ur1|u0,u1,,ub,z,y)(x=ur1,x(k)=us1|u0,u1,,ub,z,y)=σΣb+1,σ1=s,σk+1=r(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y)σΣb+1,σ1=r,σk+1=s(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y).\begin{array}[]{cl}&\frac{\mathbb{P}(x=u_{s-1},x^{(k)}=u_{r-1}|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(x=u_{r-1},x^{(k)}=u_{s-1}|u_{0},u_{1},\ldots,u_{b},z,y)}\\ =&\frac{\sum\limits_{\sigma\in\Sigma_{b+1},\sigma_{1}=s,\sigma_{k+1}=r}\mathbb{P}(x^{(t-1)}=u_{\sigma_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}{\sum\limits_{\sigma^{\prime}\in\Sigma_{b+1},\sigma^{\prime}_{1}=r,\sigma^{\prime}_{k+1}=s}\mathbb{P}(x^{(t-1)}=u_{\sigma^{\prime}_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}.\end{array} (27)

It’s obvious that there’s a one-to-one map Γ\Gamma from {σΣb+1:σ1=s,σk+1=r}\{\sigma\in\Sigma_{b+1}:\sigma_{1}=s,\sigma_{k+1}=r\} to {σΣb+1:σ1=r,σk+1=s}\{\sigma^{\prime}\in\Sigma_{b+1}:\sigma^{\prime}_{1}=r,\sigma^{\prime}_{k+1}=s\} such that if σ=Γ(σ)\sigma^{\prime}=\Gamma(\sigma), for some σ{σΣb+1:σ1=s,σk+1=r}\sigma\in\{\sigma\in\Sigma_{b+1}:\sigma_{1}=s,\sigma_{k+1}=r\}, then σt=σt\sigma^{\prime}_{t}=\sigma_{t}, for t1ork+1t\neq 1\ or\ k+1. If σ=Γ(σ)\sigma^{\prime}=\Gamma(\sigma),

(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y)(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y)=((x(t1)=uσt1,t=1,,b+1|z,y)(x~U|z,y))/((x(t1)=uσt1,t=1,,b+1|z,y)(x~U|z,y))=(x(t1)=uσt1,t=1,2,,b+1|z,y)(x(t1)=uσt1,t=1,2,,b+1|z,y)=Q(us1|z)t=2b+1Q(uσt1|z)Q(ur1|z)t=2b+1Q(uσt1|z)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z),\begin{array}[]{cl}&\frac{\mathbb{P}(x^{(t-1)}=u_{\sigma_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(x^{(t-1)}=u_{\sigma^{\prime}_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}\\ =&\left(\frac{\mathbb{P}(x^{(t-1)}=u_{\sigma_{t}-1},t=1,\ldots,b+1|z,y)}{\mathbb{P}(\tilde{x}\in U|z,y)}\right)/\left(\frac{\mathbb{P}(x^{(t-1)}=u_{\sigma^{\prime}_{t}-1},t=1,\ldots,b+1|z,y)}{\mathbb{P}(\tilde{x}\in U|z,y)}\right)\\ =&\frac{\mathbb{P}(x^{(t-1)}=u_{\sigma_{t}-1},t=1,2,\ldots,b+1|z,y)}{\mathbb{P}(x^{(t-1)}=u_{\sigma^{\prime}_{t}-1},t=1,2,\ldots,b+1|z,y)}=\frac{Q^{\star}(u_{s-1}|z)\prod\limits_{t=2}\limits^{b+1}Q(u_{\sigma_{t}-1}|z)}{Q^{\star}(u_{r-1}|z)\prod\limits_{t=2}\limits^{b+1}Q(u_{\sigma^{\prime}_{t}-1}|z)}\\ =&\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)},\end{array}

where U={(u0,,ub)permute(σ^):σ^Σb+1}U=\{(u_{0},\ldots,u_{b})_{permute(\hat{\sigma})}:\hat{\sigma}\in\Sigma_{b+1}\}. Given the above results and that Γ\Gamma is one-to-one, we have

σΣb+1,σ1=s,σk+1=r(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y)σΣb+1,σ1=r,σk+1=s(x(t1)=uσt1,t=1,,b+1|u0,u1,,ub,z,y)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z),\frac{\sum\limits_{\sigma\in\Sigma_{b+1},\sigma_{1}=s,\sigma_{k+1}=r}\mathbb{P}(x^{(t-1)}=u_{\sigma_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}{\sum\limits_{\sigma^{\prime}\in\Sigma_{b+1},\sigma^{\prime}_{1}=r,\sigma^{\prime}_{k+1}=s}\mathbb{P}(x^{(t-1)}=u_{\sigma^{\prime}_{t}-1},t=1,\ldots,b+1|u_{0},u_{1},\ldots,u_{b},z,y)}=\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)},

which, combined with (27), results in

(x=us1,x(k)=ur1|u0,u1,,ub,z,y)(x=ur1,x(k)=us1|u0,u1,,ub,z,y)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z).\frac{\mathbb{P}(x=u_{s-1},x^{(k)}=u_{r-1}|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(x=u_{r-1},x^{(k)}=u_{s-1}|u_{0},u_{1},\ldots,u_{b},z,y)}=\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)}.

And this implies (26) is validate. Denote D(x(k),z)=Q(x(k)|z)Q(x(k)|z)D(x^{(k)},z)=\frac{Q^{\star}(x^{(k)}|z)}{Q(x^{(k)}|z)}, for k=0,1,,bk=0,1,\ldots,b. Denote the ordered statistics of {D(x(k),z):k=1,,b}\{D(x^{(k)},z):k=1,\ldots,b\} by D(1)D(2)D(b)D_{(1)}\leq D_{(2)}\leq\ldots\leq D_{(b)}. For a given ss, we denote the ordered statistics of {D(uk,z):ks}\{D(u_{k},z):k\neq s\} by D(1)sD(2)sD(b)sD^{s}_{(1)}\leq D^{s}_{(2)}\leq\ldots\leq D^{s}_{(b)}. Next, for any sr{1,,1+b}s\neq r\in\{1,\ldots,1+b\},

(T0rankssth,KL^(1)ε1,,KL^(b)εb|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}𝟙{KL^(1)ε1,,KL^(b)εb}|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}𝟙{D(x(0),z)D(b)eε1,D(x(0),z)D(b1)eε2,,D(x(0),z)D(1)eεb}|u0,u1,,ub,z,y)=𝔼(𝟙{T0rankssth}𝟙{D(us1,z)D(b)s1eε1,D(us1,z)D(b1)s1eε2,,D(us1,z)D(1)s1eεb}|u0,u1,,ub,z,y)=𝟙{D(us1,z)D(b)s1eε1,,D(us1,z)D(1)s1eεb}(T0rankssth|u0,u1,,ub,z,y)=𝟙{D(us1,z)D(b)s1eε1,,D(us1,z)D(1)s1eεb}D(us1,z)D(ur1,z)(T0ranksrth|u0,u1,,ub,z,y),\begin{array}[]{cl}&\mathbb{P}(T_{0}\ ranks\ s\ th,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbb{E}(\mathbbm{1}\{T_{0}\ ranks\ s\ th\}\mathbbm{1}\{\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}\}|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbb{E}(\mathbbm{1}\{T_{0}\ ranks\ s\ th\}\mathbbm{1}\{\frac{D(x^{(0)},z)}{D_{(b)}}\leq e^{\varepsilon_{1}},\frac{D(x^{(0)},z)}{D_{(b-1)}}\leq e^{\varepsilon_{2}},\ldots,\frac{D(x^{(0)},z)}{D_{(1)}}\leq e^{\varepsilon_{b}}\}|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbb{E}(\mathbbm{1}\{T_{0}\ ranks\ s\ th\}\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\frac{D(u_{s-1},z)}{D^{s-1}_{(b-1)}}\leq e^{\varepsilon_{2}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\}|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\}\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\}\frac{D(u_{s-1},z)}{D(u_{r-1},z)}\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y),\\ \end{array}

(28)

where the 3rd step is valid because we previously assume that T0T_{0} ranks ssth if and only if x(0)=us1x^{(0)}=u_{s-1}, the second to the last step is valid due to the fact that 𝟙{D(us1,z)D(b)s1eε1,,D(us1,z)D(1)s1eεb}\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\} is a function of u0,,ub,zu_{0},\ldots,u_{b},z, the last step holds based on (26). For any given s, it’s obvious that there’s a one-to-one map Φs\Phi_{s} from {0,1,,b}{s}\{0,1,\ldots,b\}\setminus\{s\} to {1,,b}\{1,\ldots,b\} such that D(ur,z)=D(b+1Φs(r))sD(u_{r},z)=D^{s}_{(b+1-\Phi_{s}(r))}, r{0,1,,b}{s}r\in\{0,1,\ldots,b\}\setminus\{s\}. Then, we can resume our previous derivation,

𝟙{D(us1,z)D(b)s1eε1,,D(us1,z)D(1)s1eεb}D(us1,z)D(ur1,z)=𝟙{D(us1,z)D(b)s1eε1,,D(us1,z)D(1)s1eεb}D(us1,z)D(b+1Φs1(r1))s1eεΦs1(r1).\begin{array}[]{cl}&\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\}\frac{D(u_{s-1},z)}{D(u_{r-1},z)}\\ =&\mathbbm{1}\{\frac{D(u_{s-1},z)}{D^{s-1}_{(b)}}\leq e^{\varepsilon_{1}},\ldots,\frac{D(u_{s-1},z)}{D^{s-1}_{(1)}}\leq e^{\varepsilon_{b}}\}\frac{D(u_{s-1},z)}{D^{s-1}_{(b+1-\Phi_{s-1}(r-1))}}\\ \leq&e^{\varepsilon_{\Phi_{s-1}(r-1)}}.\\ \end{array}

Plugging it back to (28), we have

(T0rankssth,KL^(1)ε1,,KL^(b)εb|u0,u1,,ub,z,y)eεΦs1(r1)(T0ranksrth|u0,u1,,ub,z,y).\begin{array}[]{cl}&\mathbb{P}(T_{0}\ ranks\ s\ th,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},u_{1},\ldots,u_{b},z,y)\\ \leq&e^{\varepsilon_{\Phi_{s-1}(r-1)}}\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y).\end{array}

With the above results, for any given ss, we have

1=r=1b+1(T0ranksrth|u0,u1,,ub,z,y)=(T0rankssth|u0,u1,,ub,z,y)+1rb+1,rs(T0ranksrth|u0,u1,,ub,z,y)(T0rankssth,KL^(1)ε1,,KL^(b)εb|u0,,ub,z,y)(1+1rb+1,rseεΦs1(r1))=(T0rankssth,KL^(1)ε1,,KL^(b)εb|u0,,ub,z,y)(1+r=1beεr),\begin{array}[]{rl}1=&\sum\limits_{r=1}\limits^{b+1}\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ =&\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)+\sum\limits_{1\leq r\leq b+1,r\neq s}\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)\\ \geq&\mathbb{P}(T_{0}\ ranks\ s\ th,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\left(1+\sum\limits_{1\leq r\leq b+1,r\neq s}e^{-\varepsilon_{\Phi_{s-1}(r-1)}}\right)\\ =&\mathbb{P}(T_{0}\ ranks\ s\ th,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\left(1+\sum\limits_{r=1}\limits^{b}e^{-\varepsilon_{r}}\right),\\ \end{array}

where the last step is true due to Φs\Phi_{s}’s one-to-one property, s=0,1,,bs=0,1,\ldots,b. Based on this result, we have

(PCRRTα,KL^(1)ε1,,KL^(b)εb|u0,,ub,z,y)=s=1(b+1)α(T0rankssth,KL^(1)ε1,,KL^(b)εb|u0,,ub,z,y)s=1(b+1)α(1+r=1beεr)1=(b+1)α1+eε1+,+eεb.\begin{array}[]{cl}&\mathbb{P}(P_{CRRT}\leq\alpha,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\\ =&\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\mathbb{P}(T_{0}\ ranks\ s\ th,\widehat{KL}_{(1)}\leq\varepsilon_{1},\ldots,\widehat{KL}_{(b)}\leq\varepsilon_{b}|u_{0},\ldots,u_{b},z,y)\\ \leq&\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\left(1+\sum\limits_{r=1}\limits^{b}e^{-\varepsilon_{r}}\right)^{-1}\\ =&\frac{\lfloor(b+1)\alpha\rfloor}{1+e^{-\varepsilon_{1}}+\ldots,+e^{-\varepsilon_{b}}}.\end{array}

By taking expectations on both sides, the above result immediately gives us the desired conclusions stated in Theorem 4.

Proof of Claim 1 First, we show that with probability 1, for fixed zz and yy, for any r{1,,b+1}r\in\{1,\ldots,b+1\}, there exists u{u0,,ub}u\in\{u_{0},\ldots,u_{b}\}, such that x=ux=u sufficiently lead to that T0T_{0} ranks rrth. We only need to focus on y,z,u0,u1,,uby,z,u_{0},u_{1},\ldots,u_{b} such that for any iji\neq j, TiTjT_{i}\neq T_{j}. Such event happens with probability 1 based on our assumption mentioned at the beginning of the proof. Let permutation σΣb+1\sigma\in\Sigma_{b+1} satisfy that when x~=(u0,,ub)permute(σ)=(uσ11,,uσb+11)\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma)}=(u_{\sigma_{1}-1},\ldots,u_{\sigma_{b+1}-1}), T0T_{0} ranks rrth. (Such permutation σ\sigma does exist. Suppose that when x=u0,,x(b)=ubx=u_{0},\ldots,x^{(b)}=u_{b}, T0T_{0} ranks r^\hat{r}th, and TirT_{i_{r}} ranks rrth. We just need to let x=uirx=u_{i_{r}} and x(ir)=u0x^{(i_{r})}=u_{0}. Based on TT’s X-symmetry, now T0T_{0} ranks rrth.)

For any σ^Σb+1\hat{\sigma}\in\Sigma_{b+1} such that σ^1=σ1\hat{\sigma}_{1}=\sigma_{1}, there exists a σ^^Σb+1\hat{\hat{\sigma}}\in\Sigma_{b+1} such that

[(u0,,ub)permute(σ)]permute(σ^^)=(u0,,ub)permute(σ^).[(u_{0},\ldots,u_{b})_{permute(\sigma)}]_{permute(\hat{\hat{\sigma}})}=(u_{0},\ldots,u_{b})_{permute(\hat{\sigma})}.

Then,

T(y,z,x~=(u0,,ub)permute(σ^))=T(y,z,x~=[(u0,,ub)permute(σ)]permute(σ^^))=T(y,z,x~=(u0,,ub)permute(σ))permute(σ^^).\begin{array}[]{cl}&T(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\hat{\sigma})})\\ =&T(y,z,\tilde{x}=[(u_{0},\ldots,u_{b})_{permute(\sigma)}]_{permute(\hat{\hat{\sigma}})})\\ =&T(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma)})_{permute(\hat{\hat{\sigma}})}.\end{array}

Since permutation σ^^\hat{\hat{\sigma}} doesn’t change the 1st element, we have

T0(y,z,x~=(u0,,ub)permute(σ^))=T0(y,z,x~=(u0,,ub)permute(σ)),T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\hat{\sigma})})=T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma)}),

which implies that T0(y,z,x~=(u0,,ub)permute(σ^))T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\hat{\sigma})}) still ranks rrth among {Tk,0kb}\{T_{k},0\leq k\leq b\} as T0(y,z,x~=(u0,,ub)permute(σ))T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma)}). Hence, due to the arbitrariness of σ^\hat{\sigma}, x=uσ11x=u_{\sigma_{1}-1} is a sufficient condition that T0T_{0} ranks rrth.

Next, we show that for any permutations σ,σΣb+1\sigma,\sigma^{\prime}\in\Sigma_{b+1}, if σ1σ1\sigma_{1}\neq\sigma^{\prime}_{1}, the rank of T0(y,z,x~=(u0,,ub)permute(σ))T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma)}), denoted by rσr_{\sigma}, is different from the rank of T0(y,z,x~=(u0,,ub)permute(σ))T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma^{\prime})}), denoted by rσr_{\sigma^{\prime}}. Suppose on the contrary that, rσ=rσ=r{1,,b+1}r_{\sigma}=r_{\sigma^{\prime}}=r\in\{1,\ldots,b+1\}. There exists k{2,,b+1}k\in\{2,\ldots,b+1\}, such that σk=σ1\sigma^{\prime}_{k}=\sigma_{1}. Let σ′′Σb+1\sigma^{\prime\prime}\in\Sigma_{b+1} satisfying that σg′′=σg\sigma^{\prime\prime}_{g}=\sigma^{\prime}_{g}, g1orkg\neq 1\ or\ k, and σ1′′=σk\sigma^{\prime\prime}_{1}=\sigma^{\prime}_{k}, σk′′=σ1\sigma^{\prime\prime}_{k}=\sigma^{\prime}_{1}. Then, based on what’ve just shown, the rank of T0(y,z,x~=(u0,,ub)permute(σ′′))T_{0}(y,z,\tilde{x}=(u_{0},\ldots,u_{b})_{permute(\sigma^{\prime\prime})}), denoted by rσ′′r_{\sigma^{\prime\prime}}, equals rσr_{\sigma}. Because, rσ=rσr_{\sigma}=r_{\sigma^{\prime}}, now we have rσ′′=rσr_{\sigma^{\prime\prime}}=r_{\sigma^{\prime}}, which is impossible, since if we switch xx and xk1x_{k-1}, we also switch the rank of T0T_{0} and Tk1T_{k-1}. Combining 2 parts shown above, we can conclude that Claim 1 is true. ∎

Proof of Theorem 4.

For brevity, we denote dTV(Qn(|z),Qn(|z))d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z)) by dTV(z)d_{TV}(z). According to the definition of dTV(z)d_{TV}(z), there exists a set A(z)nA(z)\subseteq\mathbb{R}^{n} such that

Qn(xA(z)|z)=Qn(xA(z)|z)+dTV(z),\mathbb{P}_{Q_{n}^{\star}}(x\in A(z)|z)=\mathbb{P}_{Q_{n}}(x\in A(z)|z)+d_{TV}(z),

and

{Qn(x|z)Qn(x|z)1,xA(z),Qn(x|z)Qn(x|z)1,xAc(z).\left\{\begin{array}[]{ll}\frac{Q_{n}^{\star}(x|z)}{Q_{n}(x|z)}\geq 1,&\forall x\in A(z),\\ \frac{Q_{n}^{\star}(x|z)}{Q_{n}(x|z)}\leq 1,&\forall x\in A^{c}(z).\end{array}\right.

Denote Qn(xA(z)|z)\mathbb{P}_{Q_{n}}(x\in A(z)|z) by α0(z)\alpha_{0}(z). Let’s first to consider the case when α0(z)αε\alpha_{0}(z)\geq\alpha-\varepsilon. Let A~(z)A(z)\tilde{A}(z)\subseteq A(z) such that Qn(xA~(z)|z)=αε\mathbb{P}_{Q_{n}}(x\in\tilde{A}(z)|z)=\alpha-\varepsilon, and

Qn(xA~(z)|z)supB(z)A(z):Qn(xB(z)|z)=αεQn(xB(z)|z).\mathbb{P}_{Q^{\star}_{n}}(x\in\tilde{A}(z)|z)\geq\sup\limits_{B(z)\subseteq A(z):\mathbb{P}_{Q_{n}}(x\in B(z)|z)=\alpha-\varepsilon}\mathbb{P}_{Q^{\star}_{n}}(x\in B(z)|z).

Such A~(z)\tilde{A}(z) exists a.s. due to the assumption that XX is conditionally continuous under conditional distribution Q(|Z)Q(\cdot|Z) and Q(|Z)Q^{\star}(\cdot|Z) a.s.. Then, it’s not hard to know that

α~(z)Qn(xA~(z)|z)αε+dTV(z)αεα0(z).\tilde{\alpha}(z)\triangleq\mathbb{P}_{Q^{\star}_{n}}(x\in\tilde{A}(z)|z)\geq\alpha-\varepsilon+d_{TV}(z)\frac{\alpha-\varepsilon}{\alpha_{0}(z)}.

Because x(0)=xQn(|z)x^{(0)}=x\sim Q_{n}(\cdot|z) and x(1),x(k)Qn(|z)x^{(1)}\ldots,x^{(k)}\sim Q_{n}^{\star}(\cdot|z) independently, we have

{(k=1b𝟙{x(k)A~(z)}|y,z)Binomial(b,αε),𝟙{x(0)A~(z)}|y,zBernoulli(α~(z)).\left\{\begin{array}[]{l}\left(\sum\limits_{k=1}\limits^{b}\mathbbm{1}\{x^{(k)}\in\tilde{A}(z)\}|y,z\right)\sim Binomial(b,\alpha-\varepsilon),\\ \mathbbm{1}\{x^{(0)}\in\tilde{A}(z)\}|y,z\sim Bernoulli(\tilde{\alpha}(z)).\end{array}\right.

For k=0,1,,bk=0,1,\ldots,b, let Tk(y,z,x~)=𝟙{x(k)A~(z)}T_{k}(y,z,\tilde{x})=\mathbbm{1}\{x^{(k)}\in\tilde{A}(z)\}, and T(y,z,x~)=(T0,,Tb)TT(y,z,\tilde{x})=(T_{0},\ldots,T_{b})^{T}. Define pCRRTp_{CRRT} as in (2). Then, denoting h(u)=(1+u)log(1+u)uh(u)=(1+u)\log(1+u)-u, as bb goes to infinity, we have

(pCRRTα|y,z)=(k=0b𝟙{T0(y,z,x~)Tk(y,z,x~)}1+bα|y,z)(T0(y,z,x~)=1andk=1b𝟙{Tk(y,z,x~)=1}(b+1)α1|y,z)=α~(z)(Binomial(b,αε)α(b+1)1)=α~(z)(1(Binomial(b,αε)b(αε)>α1+bε))α~(z)(1exp{b(αε)(1(αε))(1(αε))2h((1(αε))(α1+bε)b((αε)(1(αε)))})=α~(z)(1O(exp{αε1(αε)h(εαε)b}))(αε+dTV(z)αεα0(z))(1O(exp{αε1(αε)h(εαε)b})),\begin{array}[]{ll}&\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\\ =&\mathbb{P}\left(\frac{\sum\limits_{k=0}\limits^{b}\mathbbm{1}\{T_{0}(y,z,\tilde{x})\leq T_{k}(y,z,\tilde{x})\}}{1+b}\leq\alpha|y,z\right)\\ \geq&\mathbb{P}\left(T_{0}(y,z,\tilde{x})=1\ and\ \sum\limits_{k=1}\limits^{b}\mathbbm{1}\{T_{k}(y,z,\tilde{x})=1\}\leq(b+1)\alpha-1|y,z\right)\\ =&\tilde{\alpha}(z)\mathbb{P}\left(Binomial(b,\alpha-\varepsilon)\leq\alpha(b+1)-1\right)\\ =&\tilde{\alpha}(z)\left(1-\mathbb{P}\left(Binomial(b,\alpha-\varepsilon)-b(\alpha-\varepsilon)>\alpha-1+b\varepsilon\right)\right)\\ \geq&\tilde{\alpha}(z)\left(1-exp\left\{-\frac{b(\alpha-\varepsilon)(1-(\alpha-\varepsilon))}{(1-(\alpha-\varepsilon))^{2}}h\left(\frac{(1-(\alpha-\varepsilon))(\alpha-1+b\varepsilon)}{b((\alpha-\varepsilon)(1-(\alpha-\varepsilon))}\right)\right\}\right)\\ =&\tilde{\alpha}(z)\left(1-O\left(exp\left\{-\frac{\alpha-\varepsilon}{1-(\alpha-\varepsilon)}h\left(\frac{\varepsilon}{\alpha-\varepsilon}\right)b\right\}\right)\right)\\ \geq&\left(\alpha-\varepsilon+d_{TV}(z)\frac{\alpha-\varepsilon}{\alpha_{0}(z)}\right)\left(1-O\left(exp\left\{-\frac{\alpha-\varepsilon}{1-(\alpha-\varepsilon)}h\left(\frac{\varepsilon}{\alpha-\varepsilon}\right)b\right\}\right)\right),\\ \end{array}

where we use the conditional independence of x,x(1),,x(b)x,x^{(1)},\ldots,x^{(b)} in the 3rd step, we apply the Bennett’s inequality in the 5th step and require bb to be sufficiently large such that b>1αεb>\frac{1-\alpha}{\varepsilon}.

In the case when α0(z)<αε\alpha_{0}(z)<\alpha-\varepsilon, we proceed in a similar way. Let A~(z)A(z)\tilde{A}(z)\supseteq A(z) such that Qn(xA~(z)|z)=αε\mathbb{P}_{Q_{n}}(x\in\tilde{A}(z)|z)=\alpha-\varepsilon, and

Qn(xA~(z)|z)supB(z)A(z):Qn(xB(z)|z)=αεQn(xB(z)|z).\mathbb{P}_{Q^{\star}_{n}}(x\in\tilde{A}(z)|z)\geq\sup\limits_{B(z)\supseteq A(z):\mathbb{P}_{Q_{n}}(x\in B(z)|z)=\alpha-\varepsilon}\mathbb{P}_{Q^{\star}_{n}}(x\in B(z)|z).

Then, we have

α~(z)=Qn(xA~(z)|z)=Qn(xA(z)|z)+Qn(xA~(z)[A(z)]c|z)[α0(z)+dTV(z)]+{[αεα0(z)]αεα0(z)1α0(z)dTV(z)}=αε+dTV(z)1(αε)1α0(z).\begin{array}[]{rl}&\tilde{\alpha}(z)=\mathbb{P}_{Q_{n}^{\star}}(x\in\tilde{A}(z)|z)\\ =&\mathbb{P}_{Q_{n}^{\star}}(x\in{A}(z)|z)+\mathbb{P}_{Q_{n}^{\star}}(x\in\tilde{A}(z)\cap[A(z)]^{c}|z)\\ \geq&[\alpha_{0}(z)+d_{TV}(z)]+\left\{[\alpha-\varepsilon-\alpha_{0}(z)]-\frac{\alpha-\varepsilon-\alpha_{0}(z)}{1-\alpha_{0}(z)}d_{TV}(z)\right\}\\ =&\alpha-\varepsilon+d_{TV}(z)\frac{1-(\alpha-\varepsilon)}{1-\alpha_{0}(z)}.\\ \end{array}

For k=0,1,,bk=0,1,\ldots,b, let Tk(y,z,x~)=𝟙{x(k)A~(z)}T_{k}(y,z,\tilde{x})=\mathbbm{1}\{x^{(k)}\in\tilde{A}(z)\}, and T(y,z,x~)=(T0,,Tb)TT(y,z,\tilde{x})=(T_{0},\ldots,T_{b})^{T}. Similarly, we can show that

(pCRRTα|y,z)(αε+dTV(z)1(αε)1α0(z))(1O(exp{αε1(αε)h(εαε)b})).\begin{array}[]{ll}&\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\\ \geq&\left(\alpha-\varepsilon+d_{TV}(z)\frac{1-(\alpha-\varepsilon)}{1-\alpha_{0}(z)}\right)\left(1-O\left(exp\left\{-\frac{\alpha-\varepsilon}{1-(\alpha-\varepsilon)}h\left(\frac{\varepsilon}{\alpha-\varepsilon}\right)b\right\}\right)\right).\end{array}

We want to point out that though the definitions of A~(z)\tilde{A}(z) in the above two situations are different, they depend only on z,Qz,Q and QQ^{\star}. We can summarize a compact definition of A~(z)\tilde{A}(z) as follows,

A~(z)={argsupB(z)A(z):Qn(xB(z)|z)=αεQn(xB(z)|z),ifα0(z)αε,argsupB(z)A(z):Qn(xB(z)|z)=αεQn(xB(z)|z),ifα0(z)<αε.\tilde{A}(z)=\left\{\begin{array}[]{ll}\mathop{\arg\sup}\limits_{B(z)\subseteq A(z):\mathbb{P}_{Q_{n}}(x\in B(z)|z)=\alpha-\varepsilon}\mathbb{P}_{Q^{\star}_{n}}(x\in B(z)|z),&if\ \alpha_{0}(z)\geq\alpha-\varepsilon,\\ \mathop{\arg\sup}\limits_{B(z)\supseteq A(z):\mathbb{P}_{Q_{n}}(x\in B(z)|z)=\alpha-\varepsilon}\mathbb{P}_{Q^{\star}_{n}}(x\in B(z)|z),&if\ \alpha_{0}(z)<\alpha-\varepsilon.\\ \end{array}\right.

For k=0,1,,bk=0,1,\ldots,b, let Tk(y,z,x~)=𝟙{x(k)A~(z)}T_{k}(y,z,\tilde{x})=\mathbbm{1}\{x^{(k)}\in\tilde{A}(z)\}, T(y,z,x~)=(T0,,Tb)TT(y,z,\tilde{x})=(T_{0},\ldots,T_{b})^{T}, and pCRRTp_{CRRT} defined as in (2). It’s easy to check that TT is XX-symmetric. Combining the above results together, we have

(pCRRTα|y,z)[αε+𝟙{α0(z)αε}dTV(z)αεα0(z)+𝟙{α0(z)<αε}dTV(z)1(αε)1α0(z)]×(1O(exp{αε1(αε)h(εαε)b})).\begin{array}[]{ll}&\mathbb{P}(p_{CRRT}\leq\alpha|y,z)\\ \geq&\left[\alpha-\varepsilon+\mathbbm{1}\{\alpha_{0}(z)\geq\alpha-\varepsilon\}d_{TV}(z)\frac{\alpha-\varepsilon}{\alpha_{0}(z)}+\mathbbm{1}\{\alpha_{0}(z)<\alpha-\varepsilon\}d_{TV}(z)\frac{1-(\alpha-\varepsilon)}{1-\alpha_{0}(z)}\right]\\ &\times\left(1-O\left(exp\left\{-\frac{\alpha-\varepsilon}{1-(\alpha-\varepsilon)}h\left(\frac{\varepsilon}{\alpha-\varepsilon}\right)b\right\}\right)\right).\end{array}

Note that the definition of TT doesn’t depend on bb. If we allow TT to depend on bb, we can replace ε\varepsilon with some other functions of bb like 12logbb\frac{1}{2}\sqrt{\frac{\log b}{b}} and can have a tighter asymptotic bound. ∎

Proof of Theorem 5.

In the following proof, sometimes we use x(0)x^{(0)} to denote xx alternatively. When we say T0T_{0} ranks rrth, we mean that T0T_{0} is the rrth largest value among {T0,,Tb}\{T_{0},\ldots,T_{b}\}. Since the main condition is a function of KL^k,k=1,,b\widehat{KL}_{k},k=1,\ldots,b, it’s natural to consider

TkD(x(k),z)=Q(x(k)|z)Q(x(k)|z),k=0,1,,b.T_{k}\triangleq D(x^{(k)},z)=\frac{Q^{\star}(x^{(k)}|z)}{Q(x^{(k)}|z)},k=0,1,\ldots,b.

It’s easy to see that T=(T0,T1,Tb)T=(T_{0},T_{1}\ldots,T_{b}) is XX-symmetric. As before, we denote the unordered set of {x,x(1),,x(b)}\{x,x^{(1)},\ldots,x^{(b)}\} by {u0,,ub}\{u_{0},\ldots,u_{b}\}. Based on the assumption that X|ZX|Z is a continuous variable under both of Q(|Z)Q(\cdot|Z) and Q(|Z)Q^{\star}(\cdot|Z), T0,T1,,TbT_{0},T_{1},\ldots,T_{b} are distinct with probability 1.

We’ll frequently use equation (26). Therefore, we restate it here. For any s,r{1,,b+1}s,r\in\{1,\ldots,b+1\} and srs\neq r,

(T0rankssth|u0,u1,,ub,z,y)(T0ranksrth|u0,u1,,ub,z,y)=Q(us1|z)Q(ur1|z)Q(ur1|z)Q(us1|z).\frac{\mathbb{P}(T_{0}\ ranks\ s\ th|u_{0},u_{1},\ldots,u_{b},z,y)}{\mathbb{P}(T_{0}\ ranks\ r\ th|u_{0},u_{1},\ldots,u_{b},z,y)}=\frac{Q^{\star}(u_{s-1}|z)Q(u_{r-1}|z)}{Q^{\star}(u_{r-1}|z)Q(u_{s-1}|z)}. (29)

In the first setting, that is, when x,x(1),,x(b)i.i.d.Q(|z)x,x^{(1)},\ldots,x^{(b)}\sim_{i.i.d.}Q^{\star}(\cdot|z), since T0,T1,,TbT_{0},T_{1},\ldots,T_{b} are distinct with probability 1, events {{T0rankssth}:s=1,,b+1}\{\{T_{0}\ ranks\ s\ th\}:s=1,\ldots,b+1\} are of equal probability. Therefore,

(PCRRTα)=s=1(b+1)α(T0rankssth)=(b+1)αb+1.\mathbb{P}(P_{CRRT}\leq\alpha)=\sum\limits_{s=1}\limits^{\lfloor(b+1)\alpha\rfloor}\mathbb{P}(T_{0}\ ranks\ s\ th)=\frac{\lfloor(b+1)\alpha\rfloor}{b+1}.

Next, we show that if xQ(|z),x(k)Q(|z),k=1,,bx\sim Q^{\star}(\cdot|z),x^{(k)}\sim Q(\cdot|z),k=1,\ldots,b, independently, we can promise a nontrivial lower bound for the type 1 error. Denote the ordered statistics of Tk,k=0,1,,bT_{k},k=0,1,\ldots,b, by T(0)T(1)T(b)T_{(0)}\leq T_{(1)}\leq\ldots\leq T_{(b)}. We start from our probability condition,

c(KL^k0,k=1,,b,and,KL^(λ)ε)=𝔼𝔼[𝟙{T(b)/T(bλ)eε}𝟙{T0ranks 1st}|u0,,ub,z,y]=𝔼{𝟙{T(b)/T(bλ)eε}𝔼[𝟙{T0ranks 1st}|u0,,ub,z,y]}=𝔼[𝟙{T(b)/T(bλ)eε}D(u0,z)k=0bD(uk,z)],\begin{array}[]{rl}c\leq&\mathbb{P}(\widehat{KL}_{k}\geq 0,k=1,\ldots,b,\ and,\ \widehat{KL}_{(\lambda)}\geq\varepsilon)\\ =&\mathbb{E}\mathbb{E}[\mathbbm{1}\{T_{(b)}/T_{(b-\lambda)}\geq e^{\varepsilon}\}\mathbbm{1}\{T_{0}\ ranks\ 1st\}|u_{0},\ldots,u_{b},z,y]\\ =&\mathbb{E}\{\mathbbm{1}\{T_{(b)}/T_{(b-\lambda)}\geq e^{\varepsilon}\}\mathbb{E}[\mathbbm{1}\{T_{0}\ ranks\ 1st\}|u_{0},\ldots,u_{b},z,y]\}\\ =&\mathbb{E}\left[\mathbbm{1}\{T_{(b)}/T_{(b-\lambda)}\geq e^{\varepsilon}\}\frac{D(u_{0},z)}{\sum\limits_{k=0}\limits^{b}D(u_{k},z)}\right],\\ \end{array} (30)

where D(u,z)Q(u|z)Q(u|z)D(u,z)\triangleq\frac{Q^{\star}(u|z)}{Q(u|z)} and the last step is based on (29). Next,

(PCRRTα|u0,,ub,z,y)=(s=1λ{T0rankssth}|u0,,ub,z,y)=[k=0λ1Q(uk|z)Q(uk|z)]/[k=0bD(uk,z)],\begin{array}[]{cl}&\mathbb{P}(P_{CRRT}\leq\alpha|u_{0},\ldots,u_{b},z,y)\\ =&\mathbb{P}(\bigcup\limits_{s=1}\limits^{\lambda}\{T_{0}\ ranks\ s\ th\}|u_{0},\ldots,u_{b},z,y)\\ =&{\left[\sum\limits_{k=0}\limits^{\lambda-1}\frac{Q^{\star}(u_{k}|z)}{Q(u_{k}|z)}\right]}/{\left[\sum\limits_{k=0}\limits^{b}D(u_{k},z)\right]},\\ \end{array}

based on which, we further have

(PCRRTα)=𝔼{[k=0λ1D(uk,z)]/[k=0bD(uk,z)]}=λb+1+𝔼{[k=0λ1D(uk,z)]/[k=0bD(uk,z)]λb+1}λb+1+𝔼{{[k=0λ1D(uk,z)]/[k=0bD(uk,z)]λb+1}𝟙{T(b)/T(bλ)eε}}=λb+1+𝔼{{[k=0λ1D(uk,z)][k=0bD(uk,z)]λb+1}k=0bD(uk,z)D(u0,z)D(u0,z)k=0bD(uk,z)𝟙{D(u0,z)D(uλ,z)eε}}.\begin{array}[]{cl}&\mathbb{P}(P_{CRRT}\leq\alpha)\\ =&\mathbb{E}\left\{{\left[\sum\limits_{k=0}\limits^{\lambda-1}D(u_{k},z)\right]}/{\left[\sum\limits_{k=0}\limits^{b}D(u_{k},z)\right]}\right\}\\ =&\frac{\lambda}{b+1}+\mathbb{E}\left\{{\left[\sum\limits_{k=0}\limits^{\lambda-1}D(u_{k},z)\right]}/{\left[\sum\limits_{k=0}\limits^{b}D(u_{k},z)\right]}-\frac{\lambda}{b+1}\right\}\\ \geq&\frac{\lambda}{b+1}+\mathbb{E}\left\{\left\{{\left[\sum\limits_{k=0}\limits^{\lambda-1}D(u_{k},z)\right]}/{\left[\sum\limits_{k=0}\limits^{b}D(u_{k},z)\right]}-\frac{\lambda}{b+1}\right\}\mathbbm{1}\{T_{(b)}/T_{(b-\lambda)}\geq e^{\varepsilon}\}\right\}\\ =&\frac{\lambda}{b+1}+\mathbb{E}\left\{\left\{\frac{\left[\sum\limits_{k=0}\limits^{\lambda-1}D(u_{k},z)\right]}{\left[\sum\limits_{k=0}\limits^{b}D(u_{k},z)\right]}-\frac{\lambda}{b+1}\right\}\frac{\sum\limits_{k=0}\limits^{b}D(u_{k},z)}{D(u_{0},z)}\frac{D(u_{0},z)}{\sum\limits_{k=0}\limits^{b}D(u_{k},z)}\mathbbm{1}\left\{\frac{D(u_{0},z)}{D(u_{\lambda},z)}\geq e^{\varepsilon}\right\}\right\}.\\ \end{array} (31)

Actually, we can see that

minM0M1Mb0,M0Mλeε{(M0++Mλ1M0++Mbλb+1)M0++MbM0}=minM0M1Mb0,M0Mλeε{M0++Mλ1M0λb+1M0++MbM0}=minM0M1Mb0,M0Mλeε{(1λb+1)M0++Mλ1M0λb+1Mλ++MbM0}=minM0Mλ0,M0Mλeε{(1λb+1)M0+(λ1)MλM0λb+1(bλ+1)MλM0}=minM0Mλ0,M0Mλeε[(1λb+1)(1λb+1)MλM0]=(1λb+1)(1eε).\begin{array}[]{cl}&\min\limits_{M_{0}\geq M_{1}\geq\ldots\geq M_{b}\geq 0,\frac{M_{0}}{M_{\lambda}}\geq e^{\varepsilon}}\left\{\left(\frac{M_{0}+\ldots+M_{\lambda-1}}{M_{0}+\ldots+M_{b}}-\frac{\lambda}{b+1}\right)\frac{M_{0}+\ldots+M_{b}}{M_{0}}\right\}\\ =&\min\limits_{M_{0}\geq M_{1}\geq\ldots\geq M_{b}\geq 0,\frac{M_{0}}{M_{\lambda}}\geq e^{\varepsilon}}\left\{\frac{M_{0}+\ldots+M_{\lambda-1}}{M_{0}}-\frac{\lambda}{b+1}\frac{M_{0}+\ldots+M_{b}}{M_{0}}\right\}\\ =&\min\limits_{M_{0}\geq M_{1}\geq\ldots\geq M_{b}\geq 0,\frac{M_{0}}{M_{\lambda}}\geq e^{\varepsilon}}\left\{\left(1-\frac{\lambda}{b+1}\right)\frac{M_{0}+\ldots+M_{\lambda-1}}{M_{0}}-\frac{\lambda}{b+1}\frac{M_{\lambda}+\ldots+M_{b}}{M_{0}}\right\}\\ =&\min\limits_{M_{0}\geq M_{\lambda}\geq 0,\frac{M_{0}}{M_{\lambda}}\geq e^{\varepsilon}}\left\{\left(1-\frac{\lambda}{b+1}\right)\frac{M_{0}+(\lambda-1)M_{\lambda}}{M_{0}}-\frac{\lambda}{b+1}\frac{(b-\lambda+1)M_{\lambda}}{M_{0}}\right\}\\ =&\min\limits_{M_{0}\geq M_{\lambda}\geq 0,\frac{M_{0}}{M_{\lambda}}\geq e^{\varepsilon}}\left[\left(1-\frac{\lambda}{b+1}\right)-\left(1-\frac{\lambda}{b+1}\right)\frac{M_{\lambda}}{M_{0}}\right]\\ =&\left(1-\frac{\lambda}{b+1}\right)(1-e^{-\varepsilon}).\\ \end{array}

Based on this result, we can proceed (31) as

(PCRRTα)λb+1+(1λb+1)(1eε)𝔼[D(u0,z)k=0bD(uk,z)𝟙{D(u0,z)D(uλ,z)eε}]λb+1+c(1λb+1)(1eε),\begin{array}[]{rl}\mathbb{P}(P_{CRRT}\leq\alpha)\geq&\frac{\lambda}{b+1}+\left(1-\frac{\lambda}{b+1}\right)(1-e^{\varepsilon})\mathbb{E}\left[\frac{D(u_{0},z)}{\sum\limits_{k=0}\limits^{b}D(u_{k},z)}\mathbbm{1}\left\{\frac{D(u_{0},z)}{D(u_{\lambda},z)}\geq e^{\varepsilon}\right\}\right]\\ \geq&\frac{\lambda}{b+1}+c\left(1-\frac{\lambda}{b+1}\right)(1-e^{-\varepsilon}),\\ \end{array}

where the last step holds based on (30).

Proof of Proposition 4.

For given xorderedx_{ordered} and zz, let AR(xordered,z)A_{R}(x_{ordered},z) be the set satisfying that

xAR(xordered,z)Rn(x|xordered,z)Rn(x|xordered,z)dx=dTV(Rn(|z,xordered),Rn(|z,xordered)).\int\limits_{x\in A_{R}(x_{ordered},z)}R^{\star}_{n}(x|x_{ordered},z)-R_{n}(x|x_{ordered},z)dx=d_{TV}(R^{\star}_{n}(\cdot|z,x_{ordered}),R_{n}(\cdot|z,x_{ordered})).

Let A(z){x:xAR(xordered,z)}A(z)\triangleq\{x:x\in A_{R}(x_{ordered},z)\}. Denote Ωo={(r1,,rn)T:r1rn}\Omega_{o}=\{(r_{1},\ldots,r_{n})^{T}\in\mathbb{R}:r_{1}\leq\ldots\leq r_{n}\}. Denote DF(xordered,z)xΓ(xordered)(Qn(x|z)Qn(x|z))DF(x_{ordered},z)\triangleq\sum\limits_{x\in\Gamma(x_{ordered})}(Q_{n}(x|z)-Q^{\star}_{n}(x|z)). Then we have

𝔼[dTV(Rn(|z,xordered),Rn(|z,xordered))|z]=xorderedΩo[xAR(xordered,z)Rn(x|xordered,z)Rn(x|xordered,z)dx]xΓ(xordered)Qn(x|z)dxordered|xorderedΩoxAR(xordered,z)Rn(x|xordered,z)dxxΓ(xordered)Qn(x|z)dxorderedxorderedΩoxAR(xordered,z)Rn(x|xordered,z)dxxΓ(xordered)Qn(x|z)dxordered|+|xorderedΩoxAR(xordered,z)Rn(x|xordered,z)dxxΓ(xordered)(Qn(x|z)Qn(x|z))dxordered|=|Q(xA(z)|z)Q(xA(z)|z)|+|xorderedΩoxAR(xordered,z)Rn(x|xordered,z)dxxΓ(xordered)(Qn(x|z)Qn(x|z))dxordered|dTV(Qn(|z),Qn(|z))+|xorderedΩoDF(xordered,z)>0xAR(xordered,z)Rn(x|xordered,z)dxDF(xordered,z)dxordered+xorderedΩoDF(xordered,z)<0xAR(xordered,z)Rn(x|xordered,z)dxDF(xordered,z)dxordered|dTV(Qn(|z),Qn(|z))+max{|xorderedΩoDF(xordered,z)>0DF(xordered,z)𝑑xordered|,|xorderedΩoDF(xordered,z)<0DF(xordered,z)𝑑xordered|}2dTV(Qn(|z),Qn(|z)),\begin{array}[]{ll}&\mathbb{E}\left[d_{TV}(R^{\star}_{n}(\cdot|z,x_{ordered}),R_{n}(\cdot|z,x_{ordered}))|z\right]\\ =&\int\limits_{x_{ordered}\in\Omega_{o}}\left[\int\limits_{x\in A_{R}(x_{ordered},z)}R^{\star}_{n}(x|x_{ordered},z)-R_{n}(x|x_{ordered},z)dx\right]\sum\limits_{x\in\Gamma(x_{ordered})}Q^{\star}_{n}(x|z)dx_{ordered}\\ \leq&\left|\int\limits_{x_{ordered}\in\Omega_{o}}\int\limits_{x\in A_{R}(x_{ordered},z)}R^{\star}_{n}(x|x_{ordered},z)dx\sum\limits_{x\in\Gamma(x_{ordered})}Q^{\star}_{n}(x|z)dx_{ordered}\right.\\ &\left.-\int\limits_{x_{ordered}\in\Omega_{o}}\int\limits_{x\in A_{R}(x_{ordered},z)}R_{n}(x|x_{ordered},z)dx\sum\limits_{x\in\Gamma(x_{ordered})}Q_{n}(x|z)dx_{ordered}\right|\\ &+\left|\int\limits_{x_{ordered}\in\Omega_{o}}\int\limits_{x\in A_{R}(x_{ordered},z)}R_{n}(x|x_{ordered},z)dx\sum\limits_{x\in\Gamma(x_{ordered})}(Q_{n}(x|z)-Q^{\star}_{n}(x|z))dx_{ordered}\right|\\ =&\left|\mathbb{P}_{Q^{\star}}(x\in A(z)|z)-\mathbb{P}_{Q}(x\in A(z)|z)\right|\\ &+\left|\int\limits_{x_{ordered}\in\Omega_{o}}\int\limits_{x\in A_{R}(x_{ordered},z)}R_{n}(x|x_{ordered},z)dx\sum\limits_{x\in\Gamma(x_{ordered})}(Q_{n}(x|z)-Q^{\star}_{n}(x|z))dx_{ordered}\right|\\ \leq&d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z))\\ &+\left|\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)>0}}\int\limits_{x\in A_{R}(x_{ordered},z)}R_{n}(x|x_{ordered},z)dxDF(x_{ordered},z)dx_{ordered}\right.\\ &+\left.\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)<0}}\int\limits_{x\in A_{R}(x_{ordered},z)}R_{n}(x|x_{ordered},z)dxDF(x_{ordered},z)dx_{ordered}\right|\\ \leq&d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z))\\ &+\max\left\{\left|\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)>0}}DF(x_{ordered},z)dx_{ordered}\right|,\left|\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)<0}}DF(x_{ordered},z)dx_{ordered}\right|\right\}\\ \leq&2d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z)),\end{array}

where the last step is valid because

|xorderedΩoDF(xordered,z)>0DF(xordered,z)𝑑xordered|=|Q(x{vn:vΓ(vordered),vorderedΩo,DF(vordered,z)>0}|z)Q(x{vn:vΓ(vordered),vorderedΩo,DF(vordered,z)>0}|z)|dTV(Qn(|z),Qn(|z)),\begin{array}[]{cl}&\left|\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)>0}}DF(x_{ordered},z)dx_{ordered}\right|\\ =&\left|\mathbb{P}_{Q^{\star}}(x\in\{v\in\mathbb{R}^{n}:v\in\Gamma(v_{ordered}),v_{ordered}\in\Omega_{o},DF(v_{ordered},z)>0\}|z)\right.\\ &\left.-\mathbb{P}_{Q}(x\in\{v\in\mathbb{R}^{n}:v\in\Gamma(v_{ordered}),v_{ordered}\in\Omega_{o},DF(v_{ordered},z)>0\}|z)\right|\\ \leq&d_{TV}(Q_{n}^{\star}(\cdot|z),Q_{n}(\cdot|z)),\end{array}

and similar result holds for |xorderedΩoDF(xordered,z)<0DF(xordered,z)𝑑xordered|\left|\int\limits_{\mathop{x_{ordered}\in\Omega_{o}}\limits_{DF(x_{ordered},z)<0}}DF(x_{ordered},z)dx_{ordered}\right|. ∎

Proof of Proposition 6.

Since tied TjT_{j} values can only make the given modified multiple knockoffs procedure more conservative, for simplicity, we assume that for j=1,2,,pj=1,2,\ldots,p, (1k<kb+1,Tjk=Tjk)=0\mathbb{P}(\exists 1\leq k<k^{\prime}\leq b+1,T_{j}^{k}=T_{j}^{k^{\prime}})=0. Then, we can make a claim similar to the Lemma 3.2 in Gemenez and Zou (2018):

Claim 2: The ranks corresponding to null variables, {rj:jSc}\{r_{j}:j\in S^{c}\}, are i.i.d. uniformly on the set {1,2,,b+1}\{1,2,\ldots,b+1\}, and are independent of the other ranks {rj:jS}\{r_{j}:j\in S\}, and of Tˇ={{Tj(1),,Tj(b+1)}:j=1,2,,p}\check{T}=\{\{T^{(1)}_{j},\ldots,T^{(b+1)}_{j}\}:j=1,2,\ldots,p\}.

We skip the proof of this claim because we can show it in an almost same way as in the Lemma 3.2 of Gemenez and Zou (2018). Recall that

τ~=min{tmin{τj:1jp}:λ~b+1λ~(1+#{1jp,rj>λ~,τjt})#{1jp,rjλ~,τjt}1α}.\tilde{\tau}=\min\left\{t\geq\min\{\tau_{j}:1\leq j\leq p\}:\frac{\frac{\tilde{\lambda}}{b+1-\tilde{\lambda}}\left(1+\#\{1\leq j\leq p,r_{j}>\tilde{\lambda},\tau_{j}\geq t\}\right)}{\#\{1\leq j\leq p,r_{j}\leq\tilde{\lambda},\tau_{j}\geq t\}\vee 1}\leq\alpha\right\}.

Similar to Barber and Candès (2015) and Gemenez and Zou (2018), we construct one-bit p-values for j=1,,pj=1,\ldots,p:

pj={λ~b+1,rjλ~,1,rj>λ~.p_{j}=\left\{\begin{array}[]{cc}\frac{\tilde{\lambda}}{b+1},&r_{j}\leq\tilde{\lambda},\\ 1,&r_{j}>\tilde{\lambda}.\end{array}\right.

Based on the Claim 2, for j{1,,p}Sj\in\{1,\ldots,p\}\setminus S, we have

{(pj=λ~b+1|Tˇ)=(pj=λ~b+1)=(rjλ~)=λ~b+1,(pj=1|Tˇ)=(pj=1)=(rj>λ~)=b+1λ~b+1,\left\{\begin{array}[]{l}\mathbb{P}(p_{j}=\frac{\tilde{\lambda}}{b+1}|\check{T})=\mathbb{P}(p_{j}=\frac{\tilde{\lambda}}{b+1})=\mathbb{P}(r_{j}\leq\tilde{\lambda})=\frac{\tilde{\lambda}}{b+1},\\ \mathbb{P}(p_{j}=1|\check{T})=\mathbb{P}(p_{j}=1)=\mathbb{P}(r_{j}>\tilde{\lambda})=\frac{b+1-\tilde{\lambda}}{b+1},\end{array}\right.

which implies that pj|Tˇd𝒰([0,1])p_{j}|\check{T}\mathop{\geq}\limits^{d}\mathscr{U}([0,1]). Since for j{1,,p}j\in\{1,\ldots,p\}, pjp_{j} is a function of rjr_{j}, based on the Claim 2, all pjp_{j}’s corresponding to null variables are mutually independent and are independent from the nonnulls, conditional on Tˇ\check{T}.

Before defining a Selective SeqStep+ sequential test, we adjust the order of variables. Because all of our arguments are conditional on Tˇ\check{T}, it’s legitimate to rearrange the order of X1,,XpX_{1},\ldots,X_{p} based on some functions of Tˇ\check{T}. In particular, we can adjust the order of variables based on τ1,,τp\tau_{1},\ldots,\tau_{p}. Hence, we can directly assume that

τ1τ2τp0.\tau_{1}\geq\tau_{2}\geq\ldots\geq\tau_{p}\geq 0.

Next, we define a Selective SeqStep+ threshold

k~=max{1kp:1+#{jk:pj>λ~b+1}#{jk:pjλ~b+1}1αb+1λ~λ~}\tilde{k}=\max\left\{1\leq k\leq p:\frac{1+\#\{j\leq k:p_{j}>\frac{\tilde{\lambda}}{b+1}\}}{\#\{j\leq k:p_{j}\leq\frac{\tilde{\lambda}}{b+1}\}\vee 1}\leq\alpha\frac{b+1-\tilde{\lambda}}{\tilde{\lambda}}\right\}

and S~={jk~:pjλ~b+1}\tilde{S}^{\prime}=\left\{j\leq\tilde{k}:p_{j}\leq\frac{\tilde{\lambda}}{b+1}\right\}. Noting that λ~b+1/(1λ~b+1)=λ~b+1λ~\frac{\tilde{\lambda}}{b+1}/(1-\frac{\tilde{\lambda}}{b+1})=\frac{\tilde{\lambda}}{b+1-\tilde{\lambda}}, we can assert that

𝔼[|S~S||S~|1]=𝔼𝔼[|S~S||S~|1|Tˇ]α\mathbb{E}\left[\frac{|\tilde{S}^{\prime}\setminus S|}{|\tilde{S}^{\prime}|\vee 1}\right]=\mathbb{E}\mathbb{E}\left[\frac{|\tilde{S}^{\prime}\setminus S|}{|\tilde{S}^{\prime}|\vee 1}|\check{T}\right]\leq\alpha

by applying the Theorem 3 in Barber and Candès (2015). Our last job is to show that S~\tilde{S}^{\prime} is identical with S~\tilde{S}. We have

k~=max{1kp:1+#{jk:pj>λ~b+1}#{jk:pjλ~b+1}1αb+1λ~λ~}=max{1kp:1+#{jk:rj>λ~,τjτk}#{jk:rjλ~,τjτk}1αb+1λ~λ~}=max{1kp:λ~b+1λ~(1+#{jk:rj>λ~,τjτk})#{jk:rjλ~,τjτk}1α},\begin{array}[]{ll}\tilde{k}&=\max\left\{1\leq k\leq p:\frac{1+\#\{j\leq k:p_{j}>\frac{\tilde{\lambda}}{b+1}\}}{\#\{j\leq k:p_{j}\leq\frac{\tilde{\lambda}}{b+1}\}\vee 1}\leq\alpha\frac{b+1-\tilde{\lambda}}{\tilde{\lambda}}\right\}\\ &=\max\left\{1\leq k\leq p:\frac{1+\#\{j\leq k:r_{j}>\tilde{\lambda},\tau_{j}\geq\tau_{k}\}}{\#\{j\leq k:r_{j}\leq\tilde{\lambda},\tau_{j}\geq\tau_{k}\}\vee 1}\leq\alpha\frac{b+1-\tilde{\lambda}}{\tilde{\lambda}}\right\}\\ &=\max\left\{1\leq k\leq p:\frac{\frac{\tilde{\lambda}}{b+1-\tilde{\lambda}}(1+\#\{j\leq k:r_{j}>\tilde{\lambda},\tau_{j}\geq\tau_{k}\})}{\#\{j\leq k:r_{j}\leq\tilde{\lambda},\tau_{j}\geq\tau_{k}\}\vee 1}\leq\alpha\right\},\end{array}

from which we can tell τk~=τ~\tau_{\tilde{k}}=\tilde{\tau}. Hence,

S~={1jp:rjλ~,τjτ~}={1jp:rjλ~,τjτk~}={1jk~:rjλ~}={1jk~:pj=λ~b+1}=S~.\begin{array}[]{ll}\tilde{S}&=\{1\leq j\leq p:r_{j}\leq\tilde{\lambda},\tau_{j}\geq\tilde{\tau}\}\\ &=\{1\leq j\leq p:r_{j}\leq\tilde{\lambda},\tau_{j}\geq\tau_{\tilde{k}}\}\\ &=\{1\leq j\leq\tilde{k}:r_{j}\leq\tilde{\lambda}\}\\ &=\{1\leq j\leq\tilde{k}:p_{j}=\frac{\tilde{\lambda}}{b+1}\}\\ &=\tilde{S}^{\prime}.\end{array}

Thus,

FDR=𝔼[|S~S||S~|1]=𝔼[|S~S||S~|1]α,FDR=\mathbb{E}\left[\frac{|\tilde{S}\setminus S|}{|\tilde{S}|\vee 1}\right]=\mathbb{E}\left[\frac{|\tilde{S}^{\prime}\setminus S|}{|\tilde{S}^{\prime}|\vee 1}\right]\leq\alpha,

which concludes the proof. ∎

Appendix D Supplementary Simulations

D.1 Impact of the Number of Conditional Samplings

In this part, we inspect the impact of bb, the number of conditional samplings on the performance of the CRT and the dCRT, which are 2 main competitors of our method, in several different model settings. Based on the Theorem 1, we know that the choice of bb doesn’t affect the validity of type 1 error control of the CRRT. Actually, this conclusion also applies to other conditional sampling based methods considered in this paper, like CRT and dCRT. As mentioned in the first remark of the Theorem 4, larger bb leads to less conservative tests. We demonstrate by the following simulations that b=200b=200 is sufficiently large for the CRT and the dCRT in the sense that the amount of removable conservativeness is negligible. Based on results in this subsection, we fix b=200b=200 in further simulations for all conditional sampling based methods.

D.1.1 Impact of the Number of Conditional Samplings on the Performance of Distillation-Based Tests

Here, we assess the impact of bb on the performance of distillation-based tests (Liu, Katsevich, Janson and Ramdas, 2020), including d0CRTd_{0}CRT, d2CRTd_{2}CRT and d12CRTd_{12}CRT. We consider 3 settings, including n>pn>p linear regression, n=pn=p linear regression and n>pn>p logistic regression. For each specific setting, we run 100 Monte-Carlo simulations.

𝐧>𝐩{\bf n>p} linear regression: We set n=500n=500, p=100p=100 and observations are i.i.d.. For i=1,2,,500i=1,2,\ldots,500, (xi,z[i]T)T(x_{i},z_{[i]}^{T})^{T} is an i.i.d. realization of (X,ZT)T(X,Z^{T})^{T}, where (X,ZT)T(X,Z^{T})^{T} is a (p+1)(p+1)-dimension random vector following a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5. We let

yi=β0xi+z[i]Tβ+εi,i=1,2,,500,y_{i}=\beta_{0}x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,500,

where β0\beta_{0}\in\mathbb{R}, βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. For simplicity, we fix β\beta by letting β1=β2=0.5\beta_{1}=\beta_{2}=0.5 and 0 elsewhere. We consider various values of β0\beta_{0}, including 0,0.1,0.2,0.3,0.40,0.1,0.2,0.3,0.4. We use linear Lasso in the distillation step, with regularized parameter tuned by cross-validation, and use linear least square regression in the testing step.

𝐧=𝐩{\bf n=p} linear regression: Basically all things are same as those in the previous setting except that the dimension of ZZ is 500 instead of 100.

𝐧>𝐩{\bf n>p} logistic regression: Models for covariates are same as those in the first setting while the model for response variable is different. We let

yi=𝟙{β0xi+z[i]Tβ+εi>0},i=1,2,,500,y_{i}=\mathbbm{1}\left\{\beta_{0}x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i}>0\right\},\ i=1,2,\ldots,500,

where β0\beta_{0}\in\mathbb{R}, βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. We use logistic Lasso in the distillation step, with regularized parameter tuned by cross-validation, and use linear least square regression in the testing step.

Refer to caption
(a) n>pn>p linear regression
Refer to caption
(b) n=pn=p linear regression
Refer to caption
(c) n>pn>p logistic regression
Figure 8: Proportions of rejection given by the d0CRTd_{0}CRT under 3 choices of bb in 3 settings. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 100 simulations. The x-axis represents the true coefficient of XX in the models.
Refer to caption
(a) n>pn>p linear regression
Refer to caption
(b) n=pn=p linear regression
Refer to caption
(c) n>pn>p logistic regression
Figure 9: Proportions of rejection given by the d2CRTd_{2}CRT under 3 choices of bb in 3 settings. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 100 simulations. The x-axis represents the true coefficient of XX in the models.
Refer to caption
(a) n>pn>p linear regression
Refer to caption
(b) n=pn=p linear regression
Refer to caption
(c) n>pn>p logistic regression
Figure 10: Proportions of rejection given by the d12CRTd_{12}CRT under 3 choices of bb in 3 settings. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 100 simulations. The x-axis represents the true coefficient of XX in the models.

Results: From Figure 8, 9 and 10, we can observe that each of the 3 distillation-based tests considered here show similar performance under different choices of bb varying from 100 to 50000. In most cases, b=200b=200 leads to desirable observed type 1 error, which is less than 0.05, while b=100b=100 sometimes brings type 1 error exceeding the threshold. There’s no obvious evidence showing that we can gain much greater power by increasing bb from 200 to 50000. Hence, we decide to take b=200b=200 in subsequent analysis.

D.1.2 Impact of the Number of Conditional Samplings on the Performance of the CRT

We also assess the impact of bb on the performance of the CRT with same simulation settings as those for distillation-based tests. Unlike distillation-based tests, the original CRT introduced in Candès, Fan, Janson and Lv (2018) is not a two-stage test. In the 2 linear regression settings, we evaluate the importance of variables (i.e. test statistics) by linear Lasso fitted coefficients. In the logistic regression settings, we obtain the variables importance by logistic Lasso fitted coefficients. Due to the limitation of computation, we are not able to simulate the CRT with extremely large bb. For each specific setting, we run 100 Monte-Carlo simulations.

Refer to caption
(a) n>pn>p linear regression
Refer to caption
(b) n=pn=p linear regression
Refer to caption
(c) n>pn>p logistic regression
Figure 11: Proportions of rejection given by the CRT under 3 choices of bb in 3 settings. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 100 simulations. The x-axis represents the true coefficient of XX in the models.

Results: From Figure 11, we can see that there’s no significant difference in performance among 3 choices of bb in 3 considered settings. In particular, no matter what’s the value of bb, the CRT always shows observed type 1 error control under the null hypothesis of conditional independence, which is encouraging. Of course, as mentioned in Candès, Fan, Janson and Lv (2018), there could be some settings in which we need an extremely large bb to guarantee that the CRT has good observed type 1 error control in practice. Besides, the CRT under larger bb doesn’t show obvious advantage in power. Therefore, out of simultaneous consideration of theoretical power and computational burden, we set b=200b=200 for further simulations.

D.2 More Complicated Models

In the main content, we consider the case of continuous response variable. Here, we provide further investigations in cases of categorical response variable.

Recall that n=400n=400 and the number of Monte-Carlo simulations is 200 for each specific setting. For i=1,2,,400i=1,2,\ldots,400, (xi,z[i]T)T(x_{i},z_{[i]}^{T})^{T} is an i.i.d. realization of (X,ZT)T(X,Z^{T})^{T}, where (X,ZT)T(X,Z^{T})^{T} is a 101-dimension random vector following a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5. We consider the following categorical response model with different forms of interactions:

𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐌𝐨𝐝𝐞𝐥 2:{y¨i=β0xij=1100z[i]j+z[i]Tβ¨+εi¨,i=1,2,,400,ẙi=exp(β0xij=1100z[i]j)+(z[i]Tβ̊)31+εi̊,i=1,2,,400,yi={0,ify¨i>0andẙi>0,1,ify¨i>0andẙi0,2,ify¨i0andẙi>0,3,ify¨i0andẙi0,i=1,2,,400,𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐌𝐨𝐝𝐞𝐥 3:{y¨i=β0xij=1100z[i]j+z[i]Tβ¨+εi¨,i=1,2,,400,ẙi=β0exp(β0xij=1100z[i]jz[i]j+1)+z[i]Tβ̊+εi̊,i=1,2,,400,yi={0,ify¨i>0andẙi>0,1,ify¨i>0andẙi0,2,ify¨i0andẙi>0,3,ify¨i0andẙi0,i=1,2,,400,\begin{array}[]{ll}{\bf Interaction\ Model\ 2:}&\left\{\begin{array}[]{l}\ddot{y}_{i}=\beta_{0}x_{i}\sum\limits_{j=1}\limits^{100}z_{[i]j}+z_{[i]}^{T}\ddot{\beta}+\ddot{\varepsilon_{i}},\ i=1,2,\ldots,400,\\ \mathring{y}_{i}=exp\left(\beta_{0}x_{i}\sum\limits_{j=1}\limits^{100}z_{[i]j}\right)+(z_{[i]}^{T}\mathring{\beta})^{3}-1+\mathring{\varepsilon_{i}},\ i=1,2,\ldots,400,\\ y_{i}=\left\{\begin{array}[]{ll}0,&if\ \ddot{y}_{i}>0\ and\ \mathring{y}_{i}>0,\\ 1,&if\ \ddot{y}_{i}>0\ and\ \mathring{y}_{i}\leq 0,\\ 2,&if\ \ddot{y}_{i}\leq 0\ and\ \mathring{y}_{i}>0,\\ 3,&if\ \ddot{y}_{i}\leq 0\ and\ \mathring{y}_{i}\leq 0,\\ \end{array}\right.\ i=1,2,\ldots,400,\\ \end{array}\right.\\ {\bf Interaction\ Model\ 3:}&\left\{\begin{array}[]{l}\ddot{y}_{i}=\beta_{0}x_{i}\sum\limits_{j=1}\limits^{100}z_{[i]j}+z_{[i]}^{T}\ddot{\beta}+\ddot{\varepsilon_{i}},\ i=1,2,\ldots,400,\\ \mathring{y}_{i}=\beta_{0}exp\left(\beta_{0}x_{i}\sum\limits_{j=1}\limits^{100}z_{[i]j}z_{[i]j+1}\right)+z_{[i]}^{T}\mathring{\beta}+\mathring{\varepsilon_{i}},\ i=1,2,\ldots,400,\\ y_{i}=\left\{\begin{array}[]{ll}0,&if\ \ddot{y}_{i}>0\ and\ \mathring{y}_{i}>0,\\ 1,&if\ \ddot{y}_{i}>0\ and\ \mathring{y}_{i}\leq 0,\\ 2,&if\ \ddot{y}_{i}\leq 0\ and\ \mathring{y}_{i}>0,\\ 3,&if\ \ddot{y}_{i}\leq 0\ and\ \mathring{y}_{i}\leq 0,\\ \end{array}\right.\ i=1,2,\ldots,400,\\ \end{array}\right.\\ \end{array}

where β0\beta_{0}\in\mathbb{R}, β,β¨,β̊100\beta,\ddot{\beta},\mathring{\beta}\in\mathbb{R}^{100}, εi,ε¨i,ε̊i\varepsilon_{i},\ddot{\varepsilon}_{i},\mathring{\varepsilon}_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. To construct β\beta, we randomly sample a set of size 5 without replacement from {1,2,,100}\{1,2,\ldots,100\}, say SS. For j=1,,100j=1,\ldots,100, let βj=0\beta_{j}=0 if jSj\notin S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} if jSj\in S, where BjB_{j} is Bernoulli. β¨\ddot{\beta} and β̊\mathring{\beta} are also constructed in the same way independently. We consider various values of β0\beta_{0} for different models. The difference between the 3rd and the 2nd model is that, in the 3rd one, covariates entangle with each other in a more involved way.

Refer to caption
(a) Rejection Rate
Refer to caption
(b) Time
Figure 12: (a) Proportions of rejection under the Interaction Model 2. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 200 simulations. The x-axis represents the true coefficient of XX in the models. (b) Time in seconds per Monte-Carlo simulation. dCRT_k_rf represents the dCRT with random forest in both of the distillation step and the testing step, and keeping k important variables after distillation. dCRT_k_rf_ols represents the dCRT with random forest in the distillation step, least square linear regression in the testing step, and keeping k important variables after distillation. CRRT_k_rf represents the CRRT using random forest as the test function and with batch size (b+1)/k(b+1)/k. HRT_0k represents the HRT fitting a random forest model and using training set of size n×k×0.1n\times k\times 0.1.
Refer to caption
(a) Rejection Rate
Refer to caption
(b) Time
Figure 13: (a) Proportions of rejection under the Interaction Model 3. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 200 simulations. The x-axis represents the true coefficient of XX in the models. (b) Time in seconds per Monte-Carlo simulation. Definitions of legends can be found in the Figure 12.

Results: Results under the Interaction Model 2 and 3 are provided in Figure 12 and 13, where the response variable YY is categorical. Because results under these 2 models are similar, we analyze them together. The CRRT’s and the dCRT_12_rf are among the most powerful tests. However, the dCRT_12_rf is in average at least 4 times slower than the CRRT_4_rf, which is consistently the most powerful test. Again, increasing the number of important variables kept from the distillation step doesn’t help to improve power while costing more computations. Taking both power and efficiency into consideration, the CRRT with batch size 50 is obviously the best one.

D.3 Supplementary Results to the Subsection 5.4

It’s well comprehensible that misspecification of the conditional distribution of X|ZX|Z can not only inflate type 1 error like what we’ve shown in the Subsection 5.4, but can sometimes drive it downwards. This could happen and we can easily construct simple intuitive example as follows. Suppose that Y=Z+εY=Z+\varepsilon, where both YY and ZZ are scalars and εN(0,1)\varepsilon\sim N(0,1). Let XN(0,1)X\sim N(0,1), independent of ZZ and ε\varepsilon while X(k)=θZX^{(k)}=\theta Z for k=1,,bk=1,\ldots,b. As long as θ0\theta\neq 0, pseudo variables are more related with YY than XX and most methods would tend to believe that pseudo variables are more conditionally important, which brings the type 1 error downwards. Here, we give more examples under non data-dependent misspecification to show the mixed influence of conditional distribution misspecification. We consider the following 4 models.

𝐧>𝐩{\bf n>p} linear regression: We set n=400n=400, p=100p=100 and observations are i.i.d.. For i=1,2,,400i=1,2,\ldots,400, (xi,z[i]T)T(x_{i},z_{[i]}^{T})^{T} is an i.i.d. realization of (X,ZT)T(X,Z^{T})^{T}, where (X,ZT)T(X,Z^{T})^{T} is a (p+1)(p+1)-dimension random vector. ZpZ\in\mathbb{R}^{p} follows a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5 and X|ZN(μ(Z),σ2)X|Z\sim N(\mu(Z),\sigma^{2}), where the definition of μ\mu and σ2\sigma^{2} will be given shortly. We let

yi=0xi+z[i]Tβ+εi,i=1,2,,400,y_{i}=0\cdot x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. β\beta is constructed as follows. We randomly sample a set of size 20 without replacement from {1,2,,p}\{1,2,\ldots,p\}, say S. For j=1,,pj=1,\ldots,p, let βj=0\beta_{j}=0 if jSj\neq S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} if jSj\in S, where BjB_{j} is Bernoulli.

𝐧>𝐩{\bf n>p} cubic regression: xx and zz are generated in the same way as the previous model. Instead of assuming a linear model, we let

yi=0xi+(z[i]z[i]z[i])Tβ+εi,i=1,2,,400,y_{i}=0\cdot x_{i}+(z_{[i]}\circ z_{[i]}\circ z_{[i]})^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}. εi\varepsilon_{i}’s and β\beta are constructed in a same way as the previous model.

𝐧=𝐩{\bf n=p} linear regression: We set n=400n=400, p=400p=400 and observations are i.i.d.. For i=1,2,,400i=1,2,\ldots,400, (xi,z[i]T)T(x_{i},z_{[i]}^{T})^{T} is an i.i.d. realization of (X,ZT)T(X,Z^{T})^{T}, where (X,ZT)T(X,Z^{T})^{T} is a (p+1)(p+1)-dimension random vector. ZpZ\in\mathbb{R}^{p} follows a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5 and X|ZN(μ(Z),sigma2)X|Z\sim N(\mu(Z),sigma^{2}). We let

yi=0xi+z[i]Tβ+εi,i=1,2,,400,y_{i}=0\cdot x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}, εi\varepsilon_{i}’s are i.i.d. standard Gaussian, independent of xix_{i} and z[i]z_{[i]}. β\beta is constructed as follows. We randomly sample a set of size 20 without replacement from {1,2,,p}\{1,2,\ldots,p\}, say S. For j=1,,pj=1,\ldots,p, let βj=0\beta_{j}=0 if jSj\neq S, and βj=0.5𝟙{Bj=1}0.5𝟙{Bj=0}\beta_{j}=0.5\mathbbm{1}\{B_{j}=1\}-0.5\mathbbm{1}\{B_{j}=0\} if jSj\in S, where BjB_{j} is Bernoulli.

𝐧>𝐩{\bf n>p} logistic regression: xx and zz are generated in the same way as the previous model, where p=100p=100. We consider a binary model,

yi=𝟙{0xi+z[i]Tβ+εi},i=1,2,,400,y_{i}=\mathbbm{1}\{0\cdot x_{i}+z_{[i]}^{T}\beta+\varepsilon_{i}\},\ i=1,2,\ldots,400,

where βp\beta\in\mathbb{R}^{p}. εi\varepsilon_{i}’s and β\beta are constructed in a same way as the n>pn>p linear regression model.

Refer to caption
(a) Quadratic
Refer to caption
(b) Cubic
Refer to caption
(c) Tanh
Figure 14: Rejection rates of 5 Lasso-based tests under n>pn>p linear regression model and 3 non data-dependent misspecification settings: (a) Quadratic, (b) Cubic, (c) Tanh. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 Monte-Carlo simulations. The x-axis is the value of θ\theta, which represents the degree of misspecification of conditional distribution. CRRT_lasso_k represents the CRRT using linear Lasso as the test function and with batch size (b+1)/k(b+1)/k. dCRT_lasso_k represents the dCRT with linear Lasso used in the distillation step and keeping k important variables after distillation.
Refer to caption
(a) Quadratic
Refer to caption
(b) Cubic
Refer to caption
(c) Tanh
Figure 15: Rejection rates of 5 OLS-based tests under n>pn>p linear regression model and 3 non data-dependent misspecification settings: (a) Quadratic, (b) Cubic, (c) Tanh. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 Monte-Carlo simulations. The x-axis is the value of θ\theta, which represents the degree of misspecification of conditional distribution. CRRT_OLS_k represents the CRRT using OLS linear regression as the test function and with batch size (b+1)/k(b+1)/k. dCRT_lasso_k represents the dCRT with OLS linear regression used in both of the distillation step and the testing step, and keeping k important variables after distillation.
Refer to caption
(a) Quadratic
Refer to caption
(b) Cubic
Refer to caption
(c) Tanh
Figure 16: Rejection rates of 5 Lasso-based tests under n>pn>p cubic regression model and 3 non data-dependent misspecification settings: (a) Quadratic, (b) Cubic, (c) Tanh. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 Monte-Carlo simulations. The x-axis is the value of θ\theta, which represents the degree of misspecification of conditional distribution. Definitions of legends can be found in the Figure 14
Refer to caption
(a) Quadratic
Refer to caption
(b) Cubic
Refer to caption
(c) Tanh
Figure 17: Rejection rates of 5 Lasso-based tests under n=pn=p linear regression model and 3 non data-dependent misspecification settings: (a) Quadratic, (b) Cubic, (c) Tanh. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 Monte-Carlo simulations. The x-axis is the value of θ\theta, which represents the degree of misspecification of conditional distribution. Definitions of legends can be found in the Figure 14
Refer to caption
(a) Quadratic
Refer to caption
(b) Cubic
Refer to caption
(c) Tanh
Figure 18: Rejection rates of 5 logistic Lasso-based tests under n>pn>p logistic regression model and 3 non data-dependent misspecification settings: (a) Quadratic, (b) Cubic, (c) Tanh. The y-axis represents the proportion of times the null hypothesis of conditional independence is rejected out of 500 Monte-Carlo simulations. The x-axis is the value of θ\theta, which represents the degree of misspecification of conditional distribution. CRRT_logistic_k represents the CRRT using logistic Lasso as the test function and with batch size (b+1)/k(b+1)/k. dCRT_logistic_k represents the dCRT with logistic Lasso used in the distillation step and keeping k important variables after distillation.

For each pseudo variable X(k)X^{(k)}, we let its conditional distribution Q(|Z)=N(ZTζ,σ2)Q(\cdot|Z)=N(Z^{T}\zeta,\sigma^{2}) such that (X(k),ZT)T(X^{(k)},Z^{T})^{T} follows a Gaussian AR(1)AR(1) process with autoregressive coefficient 0.5. This is also the σ2\sigma^{2} we’ve mentioned above. We consider 3 types of non data-dependent conditional distribution misspecification by assigning μ(Z)\mu(Z) different from ZTζZ^{T}\zeta, which are same as the Berrett, Wang, Barber and Samworth (2019):

  • Quadratic: μ(Z)=ZTζ+θ(ZTζ)2\mu(Z)=Z^{T}\zeta+\theta(Z^{T}\zeta)^{2},

  • Cubic: μ(Z)=ZTζ+θ(ZTζ)3\mu(Z)=Z^{T}\zeta+\theta(Z^{T}\zeta)^{3},

  • Tanh: μ(Z)=tanh(θZTζ)θ𝟙{θ0}+ZTζ𝟙{θ=0}\mu(Z)=\frac{tanh(\theta Z^{T}\zeta)}{\theta}\mathbbm{1}\{\theta\neq 0\}+Z^{T}\zeta\mathbbm{1}\{\theta=0\},

where θ\theta\in\mathbb{R} can reflect the degree of misspecification.

Results: Results for the n>pn>p linear regression model are given in Figures 14 and 15. When we adopt OLS-based CRT’s or dCRT’s, we get more and more conservative results in general as QQ^{\star} deviates more from QQ. Lasso-based dCRT’s also show similar behaviors while Lasso-based CRT’s have increasing type 1 error as misspecification is enlarged.

Results for the n>pn>p cubic regression model are given in Figure 16. We can see quite different phenomenon in different misspecification settings. When the misspecification is quadratic, the impact of misspecification seems to be non-monotonic. When the misspecification is cubic, dCRT_lasso_12 gets more conservative while other tests show opposite behaviors as θ\theta increases. When the misspecification is tanh, the dCRT’s get more conservative while the CRRT’s get more aggressive as θ\theta increases.

Results for the n=pn=p linear regression model are given in Figure 17. No matter under which type of misspecification, the dCRT’s always grow conservative as θ\theta increases. The CRT’s have slight increase in type 1 error as θ\theta increases under quadratic and cubic misspecification and exhibit a relatively more obvious increase under the tanh misspecification.

Results for the n>pn>p logistic regression model are given in Figure 18. Still, the dCRT’s become more conservative as θ\theta increases. The CRT’s remain stable under the quadratic misspecification, have decreasing type 1 error under the cubic misspecification and have increasing type 1 error under the tanh misspecification.

To sum up, what we can see here is totally different from what’s been shown in the Subsection 5.4, where both of the CRT and the dCRT have monotonically increasing rejection rate as the degree of misspecification increases. Results demonstrate that how misspecification of conditional distribution would affect the type 1 error is quiet complicated and unpredictable. It depends on the joint model, the test we use and the type of misspecification.