This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inference with approximate local false discovery rates

Rajesh Karmakar  
Department of Statistics and Operations Research, Tel-Aviv university

Ruth Heller  
Department of Statistics and Operations Research, Tel-Aviv university

Saharon Rosset  
Department of Statistics and Operations Research, Tel-Aviv university
Email: [email protected]Email: [email protected]Email: [email protected]
Abstract

Efron’s two-group model is widely used in large scale multiple testing. This model assumes that test statistics are mutually independent, however in realistic settings they are typically dependent, and taking the dependence into account can boost power. The general two-group model takes the dependence between the test statistics into account. Optimal policies in the general two-group model require calculation, for each hypothesis, of the probability that it is a true null given all test statistics, denoted local false discovery rate (locFDR). Unfortunately, calculating locFDRs under realistic dependence structures can be computationally prohibitive. We propose calculating approximate locFDRs based on a properly defined N-neighborhood for each hypothesis. We prove that by thresholding the approximate locFDRs with a fixed threshold, the marginal false discovery rate is controlled for any dependence structure. Furthermore, we prove that this is the optimal procedure in a restricted class of decision rules, where decision for each hypothesis is only guided by its N-neighborhood. We show through extensive simulations that our proposed method achieves substantial power gains compared to alternative practical approaches, while maintaining conceptual simplicity and computational feasibility. We demonstrate the utility of our method on a genome wide association study of height.


Keywords: dependent test statistics; large scale inference; marginal false discovery rate (mFDR); multiple testing; genome wide association studies (GWAS)

1 Introduction

The two-group model introduced in Efron et al., (2001), is a popular mixture model-based procedure for large scale inference. Within this framework it is common to control false discovery by controlling the False Discovery Rate (FDR, Benjamini and Hochberg, 1995), or its closely related variant the marginal FDR (mFDR, Genovese and Wasserman, 2002).

Sun and Cai, (2007) developed an adaptive multiple testing rule for mFDR control in the two-group model and showed that such a procedure is optimal in the sense that it minimizes the false non-discovery rate while controlling the mFDR under independence of tests. This rule is based on thresholding the marginal local FDR (locFDR), which is the conditional probability of the hypothesis being null, given only the observed test statistic for that hypothesis. If we denote the true state for test ii by hi{0,1}h_{i}\in\{0,1\} and the observed test statistics by ZiZ_{i}, then marginal locFDR=(hi=0|Zi)locFDR={\mathbb{P}}(h_{i}=0|Z_{i}). In a series of papers (Efron,, 2007, 2010), Efron has investigated the effect of correlations on the outcome of multiple testing procedures, being concerned that approaches that ignore dependence may not correctly control the FDR, so it is necessary to accommodate these correlations in multiple testing methodology.

Putting power at the forefront, Xie et al., (2011) introduced the general two-group model with arbitrary dependence structure among z-scores. They showed that the optimal policy for mFDR control is thresholding the “oracle” locFDR statistics, (hi=0|Z1,,ZK)=(hi=0|𝒁)\mathbb{P}(h_{i}=0|Z_{1},\ldots,Z_{K})=\mathbb{P}(h_{i}=0|\boldsymbol{Z}), with a fixed threshold. The oracle locFDR statistics explicitly use the correlation among the z-scores. Recently Heller and Rosset, (2021) showed that the oracle locFDRs are the optimal test statistics for controlling FDR as well, under general dependence. They showed that for some special dependence structures like block dependence or equicorrelated setting, it may be practical to calculate these oracle locFDRs for dependent data. However beyond these special cases there is no computationally feasible method for calculating the full locFDR under dependence. A standard approach is to approximate the oracle locFDR statistics by sub optimal marginal locFDR statistics which are much easier to compute. By ignoring the correlation among tests statistics, this approach leads to substantial loss of power, as demonstrated empirically in our analyses below.

As a real life example, in GWAS we are interested in finding significant SNPs associated with a phenotype among millions of SNPs in the whole genome. It is well understood that nearby SNPs on the genome are highly correlated due to linkage disequilibrium (LD), while SNPs which are far apart are very weakly correlated. Marginal analysis is a standard procedure for finding associations in GWAS, but it does not identify the SNPs that are associated with the phenotype given all other SNPs. For a joint analysis, one approach is to approximate the oracle locFDRs using a complete Bayesian treatment, which relies on methods like Markov Chain Monte Carlo (MCMC) for inference. For example, Zhu and Stephens, (2017) developed a Bayesian approach for finding association using GWAS summary statistics which takes into account the correlations among the SNPs. Their approach relies on publicly available estimates of effect sizes for KK SNPs β^j\hat{\beta}_{j} for 1jK1\leq j\leq K, calculated from marginal regressions without taking correlations into account, and builds a complete model taking into account the correlations between these estimates, using a known correlation (LD) matrix. They then implement a full Bayesian approach using MCMC to ultimately infer the posterior null probabilities (βj=0|𝜷^)\mathbb{P}(\beta_{j}=0|\hat{\boldsymbol{\beta}}) where 𝜷^=(β1^,,βK^)\hat{\boldsymbol{\beta}}=(\hat{\beta_{1}},\ldots,\hat{\beta_{K}}). In practice, their procedure is very time consuming, as each MCMC iteration takes substantial computation, and the entire procedure has to be repeated several times to confirm convergence of the MCMC solution.

In this paper we propose a novel approach for mFDR control under dependence in the two-group model, which can be viewed as a compromise between the simplistic approach of relying on marginal locFDRs and the computationally prohibitive approach of calculating the full oracle locFDRs and obtaining optimal power. We propose to define the N-neighborhood for each test, for some small N,N\in\mathbb{N}, and only consider multiple testing procedures that base the decision for test ii only on its NN neighborhood. We enote this class of procedures (or decision policies) by CN.C_{N}. For example, in our motivating application to GWAS, LD structure implies physically close SNPs are correlated, and the range of correlation is roughly similar throughout the genome, so the neighborhood for each SNP is defined by the group of 2×N2\times N SNPs surrounding it.

The rest of the paper is organized as follows.

In § 2 we define all the necessary notations and explain our goal in this paper. In § 3 we prove in Theorem 3.1 that within the reduced class CNC_{N}, the optimal test that maximizes power while controlling mFDR in the two-group model, relies on thresholding the posterior probabilities Ti,N=(hi=0|ZiN,,Zi,,Zi+N),T_{i,N}=\mathbb{P}(h_{i}=0|Z_{i-N},\dots,Z_{i},\dots,Z_{i+N}), which we term locFDRN.locFDR_{N}.We further show in Corollary 3.1.1 that the power (expected number of true discoveries) is an increasing function of NN for the resulting rules, making explicit the trade off between power and computation that the neighborhood approximation provides. In § 4 we show that the computation of the locFDRNlocFDR_{N} statistics is linear in the number of hypothesis. Furthermore we show how we can estimate the locFDRNlocFDR_{N} statistics assuming a reasonable model for GWAS. Finally we summarize our complete data driven algorithm. In § 5 we perform extensive simulations with different covariance structures and we observe that the locFDRNlocFDR_{N} statistics achieves more power compared to other practical methods. In § 6 we provide a complete analysis pipeline for GWAS using summary statistics. Required input data are marginal SNP effect sizes (β^j\hat{\beta}_{j}), standard error of the effect (s.e.(β^j\hat{\beta}_{j}) σ^j\approx\hat{\sigma}_{j}) and estimated matrix R^\hat{R} of LD (correlations) among Z-scores (Zj=β^j/σ^jZ_{j}={\hat{\beta}_{j}}/{\hat{\sigma}_{j}}) of SNPs.

2 Notation and Goal

We assume the data are generated from the generalized two group model. The hypothesis states vector, 𝒉=(h1,,hK)\boldsymbol{h}=(h_{1},\dots,h_{K}), has entries sampled independently from the Ber(π)Ber(\pi) distribution. Given the hypothesis states 𝒉\boldsymbol{h}, the sequence of test statistic for testing the KK null hypothesis, hi=0h_{i}=0, i=1,,Ki=1,\ldots,K, follows 𝒁=(Z1,,ZK)g(𝒛|𝒉)\boldsymbol{Z}=(Z_{1},\dots,Z_{K})\sim g(\boldsymbol{z}|\boldsymbol{h}). Based on 𝒁\boldsymbol{Z}, the goal is to construct a multiple testing procedure, defined by the decision rule 𝑫(𝒁)=(D1(𝒁),,DK(𝒁)):K{0,1}K\boldsymbol{D}(\boldsymbol{Z})=(D_{1}(\boldsymbol{Z}),\dots,D_{K}(\boldsymbol{Z})):\mathbb{R}^{K}\to\{0,1\}^{K}, which has good power while controlling a meaningful error.

Let Ti(𝒁)T_{i}(\boldsymbol{Z}) denote the oracle locFDR, which is the statistic for optimal decisions (Xie et al.,, 2011; Heller and Rosset,, 2021):

Ti(𝒛)=(hi=0|𝒁=𝒛)=𝒉,hi=0(𝒛|𝒉)(𝒉)𝒉(𝒛|𝒉)(𝒉).\quad T_{i}(\boldsymbol{z})=\mathbb{P}(h_{i}=0|\boldsymbol{Z}=\boldsymbol{z})=\dfrac{\sum_{\boldsymbol{h},h_{i}=0}{\mathbb{P}(\boldsymbol{z}|\boldsymbol{h})\mathbb{P}(\boldsymbol{h})}}{\sum_{\boldsymbol{h}}{\mathbb{P}(\boldsymbol{z}|\boldsymbol{h})\mathbb{P}(\boldsymbol{h})}}.

The expected number of the true and false positives are simple expressions of the oracle locFDRs:

TP(𝑫)=𝔼𝒁,𝒉(𝒉T𝑫)=𝔼𝒁(i=1KDi(𝒁)(1Ti(𝒁)))\displaystyle TP(\boldsymbol{D})=\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}(\boldsymbol{h}^{T}\boldsymbol{D})=\mathbb{E}_{\boldsymbol{Z}}\left(\sum_{i=1}^{K}{D_{i}(\boldsymbol{Z})(1-T_{i}(\boldsymbol{Z}))}\right)
𝔼(V(𝑫))=𝔼𝒁,𝒉((1𝒉)T𝑫)=𝔼𝒁(i=1KDi(𝒁)Ti(𝒁))\displaystyle\mathbb{E}(V(\boldsymbol{D}))=\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}((1-\boldsymbol{h})^{T}\boldsymbol{D})=\mathbb{E}_{\boldsymbol{Z}}\left(\sum_{i=1}^{K}{D_{i}(\boldsymbol{Z})T_{i}(\boldsymbol{Z})}\right)

A relevant error rate is the marginal false discovery rate (mFDR),

mFDR(𝑫)=𝔼𝒁,𝒉(V(𝑫))𝔼𝒁,𝒉(R(𝑫)),mFDR(\boldsymbol{D})=\dfrac{\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}(V(\boldsymbol{D}))}{\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}(R(\boldsymbol{D}))},

where 𝔼𝒁,𝒉(R(𝑫))=𝔼𝒁,𝒉(V(𝑫))+TP(𝑫)\quad\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}(R(\boldsymbol{D}))=\mathbb{E}_{\boldsymbol{Z},\boldsymbol{h}}(V(\boldsymbol{D}))+TP(\boldsymbol{D}) is the expected number of rejections.

Note that some expectations above are over the joint distribution of (𝒁,𝒉)(\boldsymbol{Z},\boldsymbol{h}) and some are over 𝒁\boldsymbol{Z}. From now on we will not denote the random variable with respect to which we take the expectation since it is easily understandable from the context. It has been shown in Xie et al., (2011) that the optimal rule over the class of all decision functions 𝑫:K{0,1}K\boldsymbol{D}:\mathbb{R}^{K}\to\{0,1\}^{K} for the above problem is thresholding the locFDR statistic with a fixed threshold. So the decision rule is Di(𝒛)=𝕀(Ti(𝒛)c)D_{i}(\boldsymbol{z})=\mathbb{I}(T_{i}(\boldsymbol{z})\leq c), where cc is a fixed constant depending on α\alpha. The complexity required for naively calculating all the locFDR statistics is O(K2K)O(K2^{K}), which is typically infeasible. Hence the optimal procedure is impractical, except for some very special types of dependencies across the zz-scores, e.g., block dependence with small block sizes (Heller and Rosset,, 2021). Due to the difficulty in finding the globally optimal solution, we suggest considering a smaller class of decision functions which depend only on NN neighbors, for N{0,,K}N\in\{0,\ldots,K\}:

𝒞N={𝑫:K{0,1}K|𝑫 is of the form𝑫(𝒛)=(D1(𝒛1,N),,DK(𝒛K,N))},\displaystyle\mathcal{C}_{N}=\left\{\boldsymbol{D}:\mathbb{R}^{K}\to\{0,1\}^{K}\quad|\quad\text{$\boldsymbol{D}$ is of the form}\quad\boldsymbol{D}(\boldsymbol{z})=(D_{1}(\boldsymbol{z}_{1,N}),\dots,D_{K}(\boldsymbol{z}_{K,N}))\right\}, (1)
where𝒛i,N=(zmax(iN,1),,zi1,zi,zi+1,,zmin(i+N,K)),1iK.\displaystyle\text{where}\quad\boldsymbol{z}_{i,N}=(z_{\max(i-N,1)},\dots,z_{i-1},z_{i},z_{i+1},\dots,z_{\min(i+N,K)}),\quad 1\leq i\leq K.

In this definition we assume that the tests are ordered and each test’s ”neighbors” are defined by the ones next to it in the order. We also assume that the neighborhood size NN is fixed for all tests. Both of these assumptions can be relaxed, but we make them here for simplicity of notation and presentation of the resulting algorithms.

Given a choice of NN and the resulting class CN,C_{N}, we show in § 3 that the optimal testing procedure, which is limited to rules in CNC_{N}, relies on the localFDR after marginalization so that the test statistic is within this class:

Ti,N=(hi=0|𝒁i,N).T_{i,N}=\mathbb{P}(h_{i}=0|\boldsymbol{Z}_{i,N}). (2)

Henceforth, we refer to Ti,NT_{i,N} as the locFDRNlocFDR_{N}. For N=0N=0, C0C_{0} includes only marginal rules and hence locFDR0locFDR_{0} is the marginal locFDR (Efron et al.,, 2001), while for N=KN=K, locFDRKlocFDR_{K} (Sun and Cai,, 2009; Xie et al.,, 2011; Heller and Rosset,, 2021) is the true locFDR. We can view increasing the size of NN as offering a trade-off between computation and power: for small NN, calculating locFDRNlocFDR_{N} and the resulting optimal rules within the limited class CNC_{N} is computationally efficient but can suffer loss of power, and as NN increases computations become exponentially more complex but power improves. As we demonstrate below, in GWAS-like situations with a natural neighborhood structure and strong dependence only within neighborhood, the power increase can be rapid when NN is small, and we can obtain excellent power-computation tradeoff (Figure 1(a)). For other types of correlation structures we also get a nice trade-off (Figure 1(b), 1(c)) . Here our aim is not to find the optimal solution (which is not implementable), but a relaxation of the optimal solution which improves power significantly compared to existing marginal test statistics. We show in § 5 that the power gain using the locFDRNlocFDR_{N} statistic, compared to existing methods, can be substantial.

3 OMT Policy in the restricted class 𝒞N\mathcal{C}_{N}

In the following proposition we state a simple but useful result.

Proposition 3.1.

Define the quantity

tα,N=sup{t:𝔼i=1K𝕀(Ti,Nt)Ti,N𝔼i=1K𝕀(Ti,Nt)α}t_{\alpha,N}=\sup\left\{t:\dfrac{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t)T_{i,N}}}{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t)}}\leq\alpha\right\} (3)

for N0N\geq 0 and 0<α<10<\alpha<1. Then the procedure which rejects H0iH_{0}^{i} if Ti,Ntα,NT_{i,N}\leq t_{\alpha,N} for i=1,,Ki=1,\dots,K, controls the mFDR at level α\alpha.

Proof.

The mFDR of the procedure which rejects H0iH_{0}^{i} if Ti,Ntα,NT_{i,N}\leq t_{\alpha,N} is given by

mFDR =𝔼[i=1K𝕀(Ti,Ntα,N,hi=0)]𝔼[i=1K𝕀(Ti,Ntα,N)]=i=1K𝔼(Zi,N,hi)[𝕀(Ti,Ntα,N,hi=0)]i=1K𝔼(Zi,N)[𝕀(Ti,Ntα,N)]\displaystyle=\dfrac{\mathbb{E}\left[\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t_{\alpha,N},h_{i}=0)}\right]}{\mathbb{E}\left[\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t_{\alpha,N})}\right]}=\dfrac{\sum_{i=1}^{K}{\mathbb{E}_{(Z_{i,N},h_{i})}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N},h_{i}=0)\right]}}{\sum_{i=1}^{K}{\mathbb{E}_{(Z_{i,N})}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})\right]}}
=i=1K𝔼Zi,N𝔼hi|Zi,N[𝕀(Ti,Ntα,N,hi=0)]i=1K𝔼(Zi,N)[𝕀(Ti,Ntα,N)]=i=1K𝔼Zi,N[𝕀(Ti,Ntα,N)(hi=0|Zi,N)]i=1K𝔼(Zi,N)[𝕀(Ti,Ntα,N)]\displaystyle=\dfrac{\sum_{i=1}^{K}{\mathbb{E}_{Z_{i,N}}\mathbb{E}_{h_{i}|Z_{i,N}}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N},h_{i}=0)\right]}}{\sum_{i=1}^{K}{\mathbb{E}_{(Z_{i,N})}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})\right]}}=\dfrac{\sum_{i=1}^{K}{\mathbb{E}_{Z_{i,N}}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})\mathbb{P}(h_{i}=0|Z_{i,N})\right]}}{\sum_{i=1}^{K}{\mathbb{E}_{(Z_{i,N})}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})\right]}}
=i=1K𝔼[𝕀(Ti,Ntα,N)Ti,N]i=1K𝔼[𝕀(Ti,Ntα,N)]α.\displaystyle=\dfrac{\sum_{i=1}^{K}{\mathbb{E}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})T_{i,N}\right]}}{\sum_{i=1}^{K}{\mathbb{E}\left[\mathbb{I}(T_{i,N}\leq t_{\alpha,N})\right]}}\leq\alpha.

where the last inequality follows from the definition of tα,Nt_{\alpha,N}. ∎

A similar result was given in Heller and Rosset, (2021) for the marginal locFDR statistics {T1,0,,TK,0}\{T_{1,0},\dots,T_{K,0}\}, but we expect more power by using the procedure based on Ti,NT_{i,N}, N>0N>0.

The optimal multiple testing procedure with mFDR control is a solution to the problem:

max𝑫:K{0,1}KTP(𝑫)subject tomFDR(𝑫)α.\max_{\boldsymbol{D}:\mathbb{R}^{K}\to\{0,1\}^{K}}{TP(\boldsymbol{D})}\quad\text{subject to}\quad mFDR(\boldsymbol{D})\leq\alpha. (4)

Now we concentrate on the class 𝒞N\mathcal{C}_{N} and find an optimal solution within this restricted class, so the optimization problem is

max𝑫𝒞NTP(𝑫)subject tomFDR(𝑫)α.\max_{\boldsymbol{D}\in\mathcal{C}_{N}}{TP(\boldsymbol{D})}\quad\text{subject to}\quad mFDR(\boldsymbol{D})\leq\alpha. (5)
Theorem 3.1.

The Procedure detailed in Proposition 3.1 is the solution to the OMT problem in Equation (5), i.e., the optimal policy is 𝐃N=(𝕀(T1,Ntα,N),,𝕀(TK,Ntα,N))\boldsymbol{D}_{N}^{*}=\left(\mathbb{I}(T_{1,N}\leq t_{\alpha,N}),\dots,\mathbb{I}(T_{K,N}\leq t_{\alpha,N})\right)

Proof.

see Appendix A. ∎

Remark.

Though marginal test statistics (locFDR0locFDR_{0}) based fixed threshold rules were already present long before, the optimality of such rules for dependent test statistics was not known (as far as we know). Therefore, the above result is not only new for 0<N<K0<N<K but also for N=0N=0.

Corollary 3.1.1.

The optimal fixed threshold mFDR level α\alpha rule based on Ti,N+1T_{i,N+1} has more power than the optimal fixed threshold mFDR level α\alpha rule based on Ti,NT_{i,N} for every N0N\geq 0.

Proof.

Theorem 3.1 says that the optimal fixed threshold mFDR level α\alpha rule based on Ti,NT_{i,N} is the solution to problem (5). But it is obvious that 𝒞N𝒞N+1\mathcal{C}_{N}\subset\mathcal{C}_{N+1} for every N0N\geq 0. Hence the proof. ∎

4 A data driven procedure

4.1 The computational complexity of the locFDRNlocFDR_{N} statistics

To implement the procedure described in Proposition 3.1, we need to compute the statistics {T1,N,,TK,N}\{T_{1,N},\dots,T_{K,N}\}. We denote l=min(i+N,K)max(iN,1)+1l=\min(i+N,K)-\max(i-N,1)+1. Clearly l2N+1l\leq 2N+1, so Ti,NT_{i,N} can be written as

Ti,N\displaystyle T_{i,N} =𝒉i,N,hi=0(𝒁i,N|𝒉i,N)(𝒉i,N)𝒉i,N(𝒁i,N|𝒉i,N)(𝒉i,N)\displaystyle=\dfrac{\sum_{\boldsymbol{h}_{i,N},h_{i}=0}{\mathbb{P}(\boldsymbol{Z}_{i,N}|\boldsymbol{h}_{i,N})\mathbb{P}(\boldsymbol{h}_{i,N})}}{\sum_{\boldsymbol{h}_{i,N}}{\mathbb{P}(\boldsymbol{Z}_{i,N}|\boldsymbol{h}_{i,N})\mathbb{P}(\boldsymbol{h}_{i,N})}} (6)

where (𝒉i,N)\mathbb{P}(\boldsymbol{h}_{i,N}) is the probability distribution of 𝒉i,N\boldsymbol{h}_{i,N} and (𝒁i,N|𝒉i,N)\mathbb{P}(\boldsymbol{Z}_{i,N}|\boldsymbol{h}_{i,N}) is the conditional density of 𝒁i,N\boldsymbol{Z}_{i,N} given 𝒉i,N\boldsymbol{h}_{i,N}. The sum in the denominator above is over ll dimensional binary vectors and hence requires 2l2^{l} evaluations of (𝒁i,N|𝒉i,N)\mathbb{P}(\boldsymbol{Z}_{i,N}|\boldsymbol{h}_{i,N}) and each of these takes time of at most O(l2)O(l^{2}). Hence calculating the denominator requires time O(2ll2)O(2^{l}l^{2}). Note that all the calculations in the denominator can be stored with O(2l)O(2^{l}) memory and hence to evaluate the numerator, we do not need any extra computation time. Since O(2ll2)O(22N+1(2N+1)2)O(2^{l}l^{2})\leq O(2^{2N+1}(2N+1)^{2}), the worst case runtime for computing all the locFDRNlocFDR_{N}s is O(K22N+1(2N+1)2)O(K2^{2N+1}(2N+1)^{2}).

Note that for finding the globally optimal rule, we had to calculate the oracle locFDRlocFDRs T1,,TKT_{1},\dots,T_{K}. If gg has no special structure, calculating these quantities requires runtime of O(K2K)O(K2^{K}).

4.2 Estimation strategies

Although computations of the statistics of interest {T1,N,,TK,N}\{T_{1,N},\dots,T_{K,N}\} are simple with known probabilistic structure, in real life examples we need to estimate the process parameters from the data. Below we discuss how to estimate these quantities in situation where the dependence can be assumed to be of short range, which approximates many real life applications.

Assume we have the following model

𝒁|𝒉𝒩K(b𝒉,Σ+τ2diag(𝒉))hii.i.d.Ber(π),i=1,,K,{}\begin{gathered}\boldsymbol{Z}|\boldsymbol{h}\sim\mathcal{N}_{K}(b\boldsymbol{\boldsymbol{h}},\Sigma+\tau^{2}diag(\boldsymbol{h}))\\ h_{i}\overset{i.i.d.}{\sim}Ber(\pi),\quad i=1,\ldots,K,\end{gathered} (7)

where Σ\Sigma is known, with diagonal entries of one. This model assumes that a null (i.e., with hi=0h_{i}=0) SNP’s test statistic has a 𝒩(0,1)\mathcal{N}(0,1) distribution, and a nonnull (i.e., with hi=1h_{i}=1) SNP’s test statistic has a 𝒩(b,1+τ2)\mathcal{N}(b,1+\tau^{2}) distribution. It is reasonable, e.g., for testing the correlations between covariates and outcome in a linear model, where Σ\Sigma is determined by the design matrix of the covariates, see § 6 for details. This model is similar to the “RSS-BVSR” model considered in Zhu and Stephens, (2017), and it reduces to the two-group model of Efron et al., (2001) if Σ\Sigma is a diagonal matrix.

In practical applications, we want to estimate the model (7) from the observed summary statistics 𝒁\boldsymbol{Z}. If we can assume the dependence to be of short ranged (as in GWAS), then we can find a subset of zz-scores. which are approximately independent. Due to LD, nearby SNPs are correlated and SNPs which are far apart tend to be independent of each others. As a result, if we select the z-scores of the SNPs based on their position on the chromosome, we can get approximately independent collection of K~\tilde{K} zz-scores. After we select these zz-scores forestimation, they can be assumed to follow the following simpler model two group model:

hii.i.d.Ber(π)zi|hiind𝒩(bhi,1+τ2hi)\begin{gathered}h_{i}\overset{i.i.d.}{\sim}Ber(\pi)\\ z_{i}|h_{i}\overset{ind}{\sim}\mathcal{N}(bh_{i},1+\tau^{2}h_{i})\end{gathered} (8)

It is easier to find the estimates of π,b,τ\pi,b,\tau from the model (8) using the subset of zz-scores which we call 𝒛s\boldsymbol{z}_{s}. Once these parameters are estimated, we plug these estimates into model (7). Specifics follow.

Our first step is thus the selection of the sub-vector 𝒛s\boldsymbol{z}_{s}. We do not assume the i.i.d. two group model on 𝒛\boldsymbol{z}, but we assume that most of the time we are able to extract a subset of z-scores which either exactly or approximately follow a i.i.d. two group model due to the assumption of short range strong dependency. Details of how estimation is done in case Banded(1), AR(1) covariance are given in section 5.2.

Next, we need to estimate π,b,τ2\pi,b,\tau^{2} using 𝒛s\boldsymbol{z}_{s}. We apply the EM algorithm (Dempster et al.,, 1977) to estimate the parameters as suggested in Peel and MacLahlan, (2000). We consider the following two methods, which differ in the way they handle the estimation of π\pi.

Estimating Parameters 1: The updates of the parameters are given below

π(t+1)=1K~i(hi=1|zi,𝜽(t))b(t+1)=i(hi=1|zi,𝜽(t))zii(hi=1|zi,𝜽(t))τ2(t+1)=max{0,i(hi=1|zi,𝜽(t))(zib(t+1))2i(hi=1|zi,𝜽(t))1}\begin{gathered}\pi^{(t+1)}=\dfrac{1}{\tilde{K}}\sum_{i}{\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}\\ b^{(t+1)}=\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})z_{i}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}\\ {\tau^{2}}^{(t+1)}=\max\left\{0,\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})(z_{i}-b^{(t+1)})^{2}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}-1\right\}\end{gathered} (9)

where 𝜽(t)=(π(t),b(t),τ2(t))\boldsymbol{\theta}^{(t)}=(\pi^{(t)},b^{(t)},{\tau^{2}}^{(t)}) are the estimates of the parameters in the tt-th step and 𝒛s=(z1,zK~)\boldsymbol{z}_{s}=(z_{1},\dots z_{\tilde{K}}). The derivation of the equations (9) is detailed in appendix B.

Estimating Parameters 2: Alternatively we estimate π\pi from 𝒛\boldsymbol{z} using the well known method of Jin and Cai, (2007) and plug-in this estimate of π\pi in the EM-iterations. Let π^\hat{\pi} be the estimate from zz-scores 𝒛\boldsymbol{z} of non-null proportion π\pi. Then the updates of the other parameters are given below

b(t+1)=i(hi=1|zi,𝝍(t))zii(hi=1|zi,𝝍(t))τ2(t+1)=max{0,i(hi=1|zi,𝝍(t))(zib(t+1))2i(hi=1|zi,𝝍(t))1}\begin{gathered}b^{(t+1)}=\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\psi}^{(t)})z_{i}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\psi}^{(t)})}\\ {\tau^{2}}^{(t+1)}=\max\left\{0,\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\psi}^{(t)})(z_{i}-b^{(t+1)})^{2}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\psi}^{(t)})}-1\right\}\end{gathered} (10)

where 𝝍(t)=(π^,b(t),τ2(t))\boldsymbol{\psi}^{(t)}=(\hat{\pi},b^{(t)},{\tau^{2}}^{(t)}) are the estimates of the parameters in the tt-th step. The derivation of the the equations (10) follows similarly as done in appendix B.

Using the estimated parameters b^,π^,τ^\hat{b},\hat{\pi},\hat{\tau}, we estimate the KK locFDRNlocFDR_{N}, and denote these statistics by {T^1,N,,T^K,N}\{\hat{T}_{1,N},\dots,\hat{T}_{K,N}\}. For inference using the estimated locFDRNlocFDR_{N}s, we need to determine the rejection cutoff, so that hypotheses with estimated locFDRNlocFDR_{N}s below that cutoff will be rejected. We determine this cutoff numerically as follow. We draw BB different data sets of zz-scores from the mixture model in (8) with estimated parameters b^,π^,τ^\hat{b},\hat{\pi},\hat{\tau}. We estimate Q(t)=𝔼i=1K𝕀(T^i,Nt)T^i,N𝔼i=1K𝕀(T^i,Nt)Q(t)=\dfrac{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(\hat{T}_{i,N}\leq t)\hat{T}_{i,N}}}{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(\hat{T}_{i,N}\leq t)}} by Q^(t)=b=1Bi=1K𝕀(T^i,Nbt)T^i,Nbb=1Bi=1K𝕀(T^i,Nbt)\hat{Q}(t)=\dfrac{\sum_{b=1}^{B}\sum_{i=1}^{K}{\mathbb{I}(\hat{T}^{b}_{i,N}\leq t)\hat{T}^{b}_{i,N}}}{\sum_{b=1}^{B}\sum_{i=1}^{K}{\mathbb{I}(\hat{T}^{b}_{i,N}\leq t)}} where {T^1,Nb,,T^K,Nb}\{\hat{T}^{b}_{1,N},\dots,\hat{T}^{b}_{K,N}\} has been calculated from the bb-th zz-vector and estimated parameters b^,π^,τ^\hat{b},\hat{\pi},\hat{\tau}. Now we perform a simple line search to find

t^α,N=sup{t:Q^(t)α}.\hat{t}_{\alpha,N}=\sup\left\{t:\hat{Q}(t)\leq\alpha\right\}. (11)

We summarize the whole pipeline concisely in Algorithm 1.

1:  Step 1 - Input: The vector of zz-scores 𝒛\boldsymbol{z}, covariance among the zz-scores Σ\Sigma, realistic N0N\geq 0, B>0B>0.
2:  Step 2 - Choose a sub vector 𝒛s\boldsymbol{z}_{s} of approximately independent zz-scores from the whole vector of zz-scores zz as described in pragraph Selecting sub-vector in § 4.2.
3:  Step 3 - Given 𝒛s\boldsymbol{z}_{s} and 𝒛\boldsymbol{z}, estimate b,π,τb,\pi,\tau using either paragraph Estimating Parameters 1 or paragraph Estimating Parameters 2 as described in § 4.2 and call these estimates b^,π^,τ^\hat{b},\hat{\pi},\hat{\tau}.
4:  Step 4 - Calculate {T1,N,,TK,N}\{T_{1,N},\dots,T_{K,N}\} using the estimated parameters in step 3 and the zz-scores by the formula (6) and call these estimated statistics {T^1,N,,T^K,N}\{\hat{T}_{1,N},\dots,\hat{T}_{K,N}\}.
5:  Step 5 - Calculate the estimated cutoff t^α,N\hat{t}_{\alpha,N} in (11).
6:   Step 6 - Output: The set α={i:T^i,Nt^α,N\mathcal{R}_{\alpha}=\{i:\hat{T}_{i,N}\leq\hat{t}_{\alpha,N}} of rejected hypothesis.
Algorithm 1 The Data-driven locFDRNlocFDR_{N} procedure for level α\alpha mFDR control

5 Simulations

5.1 Simulation from Mixture of Gaussians (Oracle)

Here we show all the results with known values of the parameters of the distribution described by (7).

In the rest of the paper, we will consider covariance structures for Σ=(σi,j)1i,jK\Sigma=(\sigma_{i,j})_{1\leq i,j\leq K} among the following four:

  • AR(1):

    σi,j=ρ|ij|\sigma_{i,j}=\rho^{|i-j|} (12)

    for 1i,jK1\leq i,j\leq K where 1ρ1-1\leq\rho\leq 1.

  • Banded(1):

    σi,j={1,if i=jρ,if |ij|=10,if |ij|>1\sigma_{i,j}=\begin{cases}1,&\text{if }i=j\\ \rho,&\text{if }|i-j|=1\\ 0,&\text{if }|i-j|>1\end{cases} (13)

    for 1i,jK1\leq i,j\leq K where 0.5ρ0.5-0.5\leq\rho\leq 0.5.

  • Long Range: (Fan and Han,, 2017)

    σij=12(ij|+1|2H2|ij|2H+ij|1|2H)\sigma_{ij}=\frac{1}{2}\left(||i-j|+1|^{2H}-2|i-j|^{2H}+||i-j|-1|^{2H}\right) (14)

    1i,jK1\leq i,j\leq K, where 12<H<1\frac{1}{2}<H<1 with H12H\to\frac{1}{2} indicating weak dependence and H1H\to 1 indicating strong long range dependence.

  • Equicorrelated:

    σi,j={1,if i=jρ,if |ij|0\sigma_{i,j}=\begin{cases}1,&\text{if }i=j\\ \rho,&\text{if }|i-j|\neq 0\end{cases} (15)

    for 1i,jK1\leq i,j\leq K where 1ρ1-1\leq\rho\leq 1.

Figure 1 shows a scatter plot of Ti,NT_{i,N} versus Ti,N+1T_{i,N+1}, and the marginal density of Ti,NT_{i,N}, for N=0,,5N=0,\ldots,5. As we move to comparison between higher order locFDRs, the simulated locFDRs tend to concentrate around the 45 degree line. The concentration is particularly good for AR(1) and the long-range correlation structure. The density plots show that for all dependencies, the densities are similar for N>0N>0 and strikingly different than the density for N=0N=0.

Refer to caption
(a) AR(1) covariance with ρ=0.8\rho=0.8 as described in equation (12)
Refer to caption
(b) Long-range covariance with H=0.9H=0.9 as described in equation (14)
Refer to caption
(c) Equicorrelated covariance with ρ=0.8\rho=0.8 as described in equation (15)
Figure 1: The scatter plots (Ti,NT_{i,N}, Ti,N+1T_{i,N+1}) (left column) and the density of Ti,NT_{i,N} (right column), for N=0,,5N=0,\ldots,5. Rows differ by the dependence structure, Σ\Sigma, among the zz-scores. The data generation of the zz-score vector is model (7), with K=1000,τ=2,π=0.3K=1000,\tau=2,\pi=0.3, and b=0.2b=0.2.

We make the following observations from the results in Table 1:

  • Power: as expected, the power increases with NN.

  • Effect of Correlation: What makes the locFDRNlocFDR_{N} most interesting is the rate of power increase, as a function of NN: the power increase is large when moving from N=0N=0 to N=1N=1, and it is much smaller when moving from N=1N=1 to N=2N=2. This behaviour is common observed under all dependency structures considered. The highest power increase was observed for the equicorrelated covariance structure.

  • Variability of the FDP: When there are strong correlations, FDP is highly variable. In all settings, the procedure with marginal statistics shows higher variability in FDP compared to procedures that use locFDRNlocFDR_{N} statistics with N>0N>0. We consistently observe from simulations that the variability of the FDP is decreasing in NN, so this is an added advantage of the use of Ti,NT_{i,N} with NN as large as (computationally) possible.

  • mFDR vs FDR: mFDR and FDR differ when there is strong correlations using locFDR0locFDR_{0} statistics. But when we use the locFDR2locFDR_{2} statistics, more discoveries are made and the difference between mFDR and FDR is small, regardless of the correlation structure.

  • Cutoff: All the simulations shows that cutoffs tα,Nt_{\alpha,N} for fixed α\alpha are increasing in NN, and the change from N=1N=1 to N=2N=2 is much smaller than the change from N=0N=0 to N=1N=1. The decrease of the sequence tα,N+1tα,Nt_{\alpha,N+1}-t_{\alpha,N} depends on the covariance structure, and the rate is higher in the short range setting, than in the long range or equicorrelated setting.

Fixed Parameters in all Simulations, K=1000K=1000, π=0.3\pi=0.3
Correlation Method 𝑻0\boldsymbol{T}_{0}-rule 𝑻1\boldsymbol{T}_{1}-rule 𝑻2\boldsymbol{T}_{2}-rule
Error Estimate s.e. Estimate s.e. Estimate s.e.
AR(1) mFDR 0.0501 0.0015 0.0499 0.0009 0.0508 0.0008
ρ=0.8\rho=0.8 FDR 0.0494 0.0014 0.0498 0.0009 0.0507 0.0008
b=0b=0 TP 61.19 0.352 124.08 0.428 139.364 0.448
τ=2\tau=2 Cutoff (t) 0.1742 - 0.2356 - 0.2729 -
Long Range mFDR 0.0489 0.0015 0.0506 0.001 0.0497 0.0010
H=0.8H=0.8 FDR 0.0481 0.0014 0.0506 0.001 0.0497 0.0010
b=0b=0 TP 61.95 0.3413 84.256 0.3815 88.37 0.3944
τ=2\tau=2 Cutoff (t) 0.1742 - 0.2014 - 0.2074 -
Equi mFDR 0.0731 0.0249 0.0515 0.0032 0.0501 0.0012
ρ=0.8\rho=0.8 FDR 0.0149 0.0035 0.0489 0.0023 0.0497 0.0012
b=0b=0 TP 62.026 0.8173 121.374 0.4404 146.29 0.4835
τ=2\tau=2 Cutoff (t) 0.1742 - 0.2361 - 0.2903 -
AR(1) mFDR 0.0510 0.0012 0.0487 0.001 0.0487 0.001
ρ=0.5\rho=0.5 FDR 0.0509 0.0012 0.0488 0.001 0.0487 0.0010
b=0b=0 TP 62.112 0.3465 83.85 0.3989 85.824 0.4019
τ=2\tau=2 Cutoff (t) 0.1742 - 0.1989 - 0.2029 -
Long Range mFDR 0.0506 0.0013 0.0505 0.0011 0.0513 0.0011
H=0.7H=0.7 FDR 0.0504 0.0013 0.0507 0.0011 0.0514 0.0011
b=0b=0 TP 61.922 0.3503 70.282 0.3748 71.504 0.3786
τ=2\tau=2 Cutoff (t) 0.1742 - 0.1823 - 0.1842 -
Equi mFDR 0.0452 0.0049 0.0502 0.0023 0.0496 0.0014
ρ=0.5\rho=0.5 FDR 0.0304 0.0028 0.0473 0.0021 0.0491 0.0014
b=0b=0 TP 61.848 0.5322 81.152 0.4022 91.166 0.4055
τ=2\tau=2 Cutoff (t) 0.1742 - 0.1971 - 0.2088 -
Table 1: Power Gain with different covariance structure (Known Parameters)
For different parameters of the model (7) (rows), using the procedure with Ti,0T_{i,0} (columns 3-4), Ti,1T_{i,1} (columns 5-6), and Ti,2T_{i,2} (columns 7-8): the “Estimate” column indicates the estimates of different error rates and the “s.e.” column indicates the the standard error of these estimates. Standard Errors for mFDR were determined by bootstrapping, based on 500 data generations.

5.2 Simulation from Mixture of Gaussian (Data Driven)

Here we show the performance of our procedure when we estimate the unknown parameters of distribution b,π,τb,\pi,\tau in model (7) using algorithm 1. From now on in all the tables we denote “𝑻N\boldsymbol{T}_{N}-rule”, “Sun&\&Cai”, “BH” and “ABH”, respectively, the testing procedure based on {Ti,N,1iK}\{T_{i,N},1\leq i\leq K\} for N=0,1,2N=0,1,2, the testing procedure developed in Sun and Cai, (2007), the Benjamini-Hochberg procedure (Benjamini and Hochberg,, 1995), and the adaptive Benjamini-Hochberg, which incorporates the plug-in estimate of π\pi given by the method “Est-EM” and “Est-S&C” (see, e.g., Benjamini et al., 2006).

In Table 2 by “Est-EM” we mean estimating the parameters of the distribution using equations (9). For AR(1) covariance, we take the subset {Z2,Z6,,Z3998}\{Z_{2},Z_{6},\dots,Z_{3998}\} of 10001000 very weakly correlated z-scores and estimate the parameters using EM algorithm and in this case we get slightly inflated mFDR level. For Banded(1) covariance, we take the subset {Z2,Z4,Z6,,Z2000}\{Z_{2},Z_{4},Z_{6},\dots,Z_{2000}\} of independent z-scores to estimate the parameters in this situation to get mFDR control for data driven rule. In this same table by “Est-S&C” we mean estimating the parameters of the distribution using equations (10).

In Table 2 , we observe that the mFDR level is more conservative with “Est-S&C” than with “Est-EM”. Due to this conservative nature of “Est-S&C”, we make less discoveries in the case of “Est-S&C” compared to “Est-EM”. Nevertheless, even with “Est-S&C”, using the statistics 𝑻N\boldsymbol{T}_{N} with N=1,2N=1,2, results in much higher power than the power of 𝑻0\boldsymbol{T}_{0} or the other competitors.

Fixed Parameters in all Simulations b=0b=0, τ=2\tau=2
Correlation Error 𝑻0\boldsymbol{T}_{0}-rule 𝑻1\boldsymbol{T}_{1}-rule 𝑻2\boldsymbol{T}_{2}-rule Sun&\&Cai BH ABH
Banded(1),Est-EM mFDR 0.0494 0.0511 0.0495 0.0435 0.0327 0.0489
ρ=0.5\rho=0.5,π=0.3\pi=0.3 FDR 0.0486 0.0508 0.0494 0.0430 0.0324 0.0480
K=2000K=2000 TP 124 174 184 118 109 123
Banded(1),Est-EM mFDR 0.0497 0.0498 0.0509 0.0475 0.0396 0.0497
ρ=0.5\rho=0.5,π=0.2\pi=0.2 FDR 0.0480 0.0495 0.0506 0.0464 0.0383 0.0481
K=2000K=2000 TP 69 108 118 68 65 70
AR(1),Est-EM mFDR 0.051 0.0526 0.0520 0.0441 0.0356 0.0510
ρ=0.5\rho=0.5,π=0.3\pi=0.3 FDR 0.0503 0.0522 0.0517 0.0439 0.0355 0.0506
K=4000K=4000 TP 248 336 344 237 219 249
AR(1),Est-EM mFDR 0.0522 0.0524 0.0525 0.050 0.0416 0.0534
ρ=0.5\rho=0.5,π=0.2\pi=0.2 FDR 0.0511 0.0521 0.0521 0.0495 0.0411 0.0525
K=4000K=4000 TP 136 202 206 134 127 137
AR(1),Est-S&C mFDR 0.0342 0.0414 0.0417 0.0444 0.0333 0.0417
ρ=0.5\rho=0.5,π=0.3\pi=0.3 FDR 0.0338 0.0413 0.0416 0.0440 0.0329 0.0413
K=4000K=4000 TP 218 313 319 236 218 233
AR(1),Est-S&C mFDR 0.0381 0.0423 0.0415 0.046 0.039 0.0447
ρ=0.5\rho=0.5,π=0.2\pi=0.2, FDR 0.0375 0.042 0.0412 0.0455 0.0385 0.0441
K=4000K=4000 TP 124 191 195 133 126 132
Table 2: Power Gain with different covariance structure (Data Driven)
For different parameters of the model (7) (rows),for each multiple testing procedure, we provide the FDR, mFDR, and power. The novel TNT_{N}-rules applied Algorithm 1 with B=50B=50. In bold, the most powerful procedure. Bootstrap standard error (s.e.) of all the mFDRs (for the combination “AR(1),Est-EM”) in 33rd and 44th rows are at most 0.00110.0011 and 0.00150.0015 respectively. Based on 200 data generations.

6 UK biobank height data analysis

Assume we have a standardized phenotype (𝒚n×1)(\boldsymbol{y}^{n\times 1}) and standardized Genotype data (Xn×p)(X^{n\times p}) available for nn individuals and pp different SNPs. We want to find the SNPs that are associated with the phenotype, while controlling the mFDR at level 0.05, using the locFDRNlocFDR_{N} statistics for N>0N>0. In our GWAS example we have a sufficient number of individuals so n>pn>p. In order to apply our method we need z-scores and correlation among z-scores as input data. For finding the z-scores, we use the linear model,

𝒚=X𝜷+ϵ\boldsymbol{y}=X\boldsymbol{\beta}+\boldsymbol{\epsilon}

and we know the least squares estimate of 𝜷\boldsymbol{\beta} is 𝜷^=(XTX)1XT𝒚\hat{\boldsymbol{\beta}}=(X^{T}X)^{-1}X^{T}\boldsymbol{y} with 𝜷^𝒩(𝜷,σ2(XTX)1)\hat{\boldsymbol{\beta}}\sim\mathcal{N}(\boldsymbol{\beta},\sigma^{2}(X^{T}X)^{-1}). Let S=diag((XTX)1)S=diag((X^{T}X)^{-1}), and let the estimated variance of ϵi\epsilon_{i} be σ2^=1np(𝒚X𝜷^)T(𝒚X𝜷^)\hat{\sigma^{2}}=\frac{1}{n-p}{(\boldsymbol{y}-X\hat{\boldsymbol{\beta}})^{T}(\boldsymbol{y}-X\hat{\boldsymbol{\beta}})}. Our z-scores vector is 𝒛=1σ^S1/2𝜷^\boldsymbol{z}=\frac{1}{\hat{\sigma}}S^{-1/2}\hat{\boldsymbol{\beta}}, and the correlation among the z-scores is Σ=S1/2(XTX)1S1/2\Sigma=S^{-1/2}(X^{T}X)^{-1}S^{-1/2}. So, our input data is (𝒛,Σ)(\boldsymbol{z},\Sigma) and our working model is the two group model (7).

We apply our method to some selected SNPs of chromosome 20, as described next. We have the genotype data of around 500K individuals and approximately 20K SNPs for each of them from the UK biobank (Sudlow et al.,, 2015). The individuals with missing height (a small number) were removed from our analysis. In the raw data for each SNP we also have many missing individuals. Moreover, SNPs that are adjacent to each other tend to be highly correlated. Therefore, we choose about 3.5K SNPs from chromosome 20 such that between any two of them there are at least 5 SNPs and none of these 3.5K SNPs has more than 10% missing percentage. We impute the number of major allele count of each missing individuals for a particular SNP by the mean of all the available individuals’ major allele count of that SNP. Next, we compute the zz scores vector and Σ\Sigma, and use them in the “Est-S&C” data driven procedure. The estimated mixture components are

π^=0.2,b^=0.0918,τ2^=2.477\hat{\pi}=0.2,\quad\hat{b}=0.0918,\quad\hat{\tau^{2}}=2.477

In Table 3 we show simulation results with parameters estimated from the data (𝒚,X)(\boldsymbol{y},X) (essentially a parametric bootstrap experiment) after repeating the process BB times and aggregating the results. First row of the table shows the result where we assume the parameters are unknown and we estimate the parameters in each run (we take B=200B=200 here). Second row of the table shows the result where we assume the parameters are known and we take B=500B=500 here. Our procedures control the mFDR at the nominal 0.05 level.

The number of rejected SNPs for the real data at level α=0.05\alpha=0.05 by our method and the competitors with 3.5K SNPs are given in Table 4. Though “S&\&C” and “𝑻0\boldsymbol{T}_{0}-rule” use the same statistics for inference, due to different estimation methods of these statistics and different way of determining cutoffs, they reject different number of SNPs. The “𝑻2\boldsymbol{T}_{2}-rule” rejects 3030 more SNPs than the “S&\&C” method. Manhattan plots of all the methods are given in Figure 2. In all the subplots in Figure 2, we indicate by a red rectangle a specific region of the Genome where Ti,NT_{i,N} statistics with N>0N>0 finds more associations than “BH”, “ABH” and “𝑻0\boldsymbol{T}_{0}-rule”. Note that in this specific region there are two associations (12921292-th and 14051405-th SNP) which are found by all the methods. However between these two SNPs, there are two other close SNPs (13501350-th and 13631363-th SNP) which are only identified by Ti,NT_{i,N} statistics with N>0N>0. Interestingly, we observe from the correlation matrix of the zz-scores that these two newly identified SNPs are not highly correlated.

Parameters are b=0.0918b=0.0918, τ2=2.477\tau^{2}=2.477, π=0.2\pi=0.2 and Σ=S1/2(XTX)1S1/2\Sigma=S^{-1/2}(X^{T}X)^{-1}S^{-1/2}
Method Error 𝑻0\boldsymbol{T}_{0}-rule 𝑻1\boldsymbol{T}_{1}-rule 𝑻2\boldsymbol{T}_{2}-rule Sun&\&Cai BH ABH
Data mFDR 0.043 0.0449 0.0445 0.047 0.0408 0.045
Driven FDR 0.0419 0.044 0.0439 0.0468 0.0399 0.0438
TP 52 59 65 55 51 54
mFDR 0.049 0.0501 0.0488 0.0482 0.0408 0.0516
Oracle FDR 0.0486 0.0499 0.0487 0.0467 0.0395 0.0497
TP 57 65 70 56 52 58
Table 3: Power Gain by the simulation results using parameters estimated from real data
Boldsymbols indicates the best power among the competitors.
Number of Common Identified SNPs
Method Sun&\&Cai BH ABH 𝑻0\boldsymbol{T}_{0}-rule 𝑻1\boldsymbol{T}_{1}-rule 𝑻2\boldsymbol{T}_{2}-rule 𝑻3\boldsymbol{T}_{3}-rule
S&\&C 36 20 34 36 33 35 32
BH 20 20 20 19 20 20
ABH 36 36 33 35 33
𝑻0\boldsymbol{T}_{0}-rule 49 43 45 40
𝑻1\boldsymbol{T}_{1}-rule 53 52 49
𝑻2\boldsymbol{T}_{2}-rule 66 60
𝑻3\boldsymbol{T}_{3}-rule 64
Table 4: Number of common identified SNPs by each of the method.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Manhattan plots of the analyzed SNPs
In the plot of “2log10(pi)-2\log_{10}(p_{i}) vs ii”, black dotted line indicates the threshold by “BH” procedure and red dotted line indicates the threshold by “ABH” procedure
In the plot of “2log10(Ti,N)-2\log_{10}(T_{i,N}) vs ii”, black dotted line indicates the threshold 2log10(tα,N)-2\log_{10}(t_{\alpha,N})

7 Discussion

In this paper we have studied practical multiple testing procedures in the two-group model under general dependence. We provide motivation and need for finding an optimal solution in a restricted decision space 𝒞N\mathcal{C}_{N} and indeed show empirically that we do not lose much by concentrating on the smaller class 𝒞N\mathcal{C}_{N}. We provide practical approaches to estimate the parameters of a two group model under short range dependence structures which encompasses typical real applications. Using this approach we give a practical application in GWAS by finding more associations than existing practical method using summary statistics. Our developed method works in GWAS without resorting to extensive modelling assumptions of complete bayesian framework as done in Zhu and Stephens, (2017).

Though concentration on the class 𝒞N\mathcal{C}_{N} of decision functions makes sense, it is natural to ask if there is better way of defining such a class which yields more practical and powerful solution. Our solution does not use the full correlation matrix - it only uses the correlations within NN main diagonals of the correlation matrix. In other words, the solution ignores all the correlation outside the NN-band. One possible approach can be to define ii-th locFDR statistic by conditioning on top NN highest correlated zz-scores and possibly we can choose this NN depending on ii determined by strength of correlation of the test statistic with the other test statistics. It will be interesting to explore how we can modify our method to accommodate the whole correlation matrix so we do not lose much “information”.

Proposition 3.1 enables us to see how we can devise a method for mFDR control using the statistics Ti,NT_{i,N}. If the entire joint distribution of the zz-score vector is known, we can always define a policy of the form 𝑫t=(D1t(𝒁1,N),,DKt(𝒁K,N))\boldsymbol{D}^{t}=(D_{1}^{t}(\boldsymbol{Z}_{1,N}),\ldots,D_{K}^{t}(\boldsymbol{Z}_{K,N})), Dit(𝒁i,N)=𝐈(Ti,Nt)D_{i}^{t}(\boldsymbol{Z}_{i,N})=\mathbf{I}(T_{i,N}\leq t) and determine

tα,N=sup{t:Err(𝑫t)α}t_{\alpha,N}=\sup\left\{t:Err(\boldsymbol{D}^{t})\leq\alpha\right\}

with Err{FDR,pFDR,FDX}Err\in\{FDR,pFDR,FDX\}. This procedure 𝑫tα,N\boldsymbol{D}^{t_{\alpha,N}} will control the chosen error by construction. However we may not be able to prove similar optimality result like Theorem 3.1 for the policy 𝑫tα,N\boldsymbol{D}^{t_{\alpha,N}} with these error rates. To find optimal solutions, we need to solve the generic problem with Err{FDR,pFDR,FDX}Err\in\{FDR,pFDR,FDX\},

max𝑫𝒞NTP(𝑫)s.tErr(𝑫)α.\max_{\boldsymbol{D}\in\mathcal{C}_{N}}{TP(\boldsymbol{D})}\quad\text{s.t}\quad Err(\boldsymbol{D})\leq\alpha. (16)

The difficulty with finding such a solution is that other error measures such as FDX, FDR, pFDR do not decompose in the same way as mFDR as shown in equation (A). It is a challenge to find similar solution for these error measures.

It is an interesting direction of future work to develop Ti,NT_{i,N} based method for p>>np>>n problems with properly chosen zz-scores which takes into account the joint effect of covariates to some extent at least.

Sun and Cai, (2009) developed a method assuming hypothesis states follow the hidden markov model (HMM) structure and the z-scores given the hypothesis states are independent. So their method does not use correlation among the zz-scores. Interestingly, proposition 3.1 holds for any joint distribution of (𝒁,𝒉)(\boldsymbol{Z},\boldsymbol{h}) on K×{0,1}K\mathbb{R}^{K}\times\{0,1\}^{K}. Although it is normally assumed that hii.i.d.Ber(π)h_{i}\overset{i.i.d.}{\sim}Ber(\pi), we can also assume that 𝒉\boldsymbol{h} follows a HMM as done in Sun and Cai, (2009) and we can draw inference based on the statistics Ti,NT_{i,N} in these situations also. An interesting direction to extend our approach is thus to also allow an HMM structure of hypothesis states in addition to the correlation among zz-scores. We expect that it will be possible to derive a practical solution for this setting.

In particular, our estimation method is such that we can easily extend to this HMM based model as well, thereby producing a multiple testing procedure which we can directly apply to real data sets.

Our approach for parameter estimation in the two-group model relies on being able to extract a subset of test statistics which are roughly independent. We believe this is a realistic assumption in most real-life applications, in particular in GWAS. However it remains a challenge to design and test approaches that also apply in cases of long-range strong dependence. Note also that our estimation method do not take into account the known joint dependence among the statistics. Though this estimation approach is robust, taking joint dependence into account can lead to efficient estimate of the model parameters and in turn a boost in power. It may be interesting to explore how we can accommodate the correlations among the test statistics into estimation.

One major observation from our simulation results is that taking even part of the (short range) dependence into account improves power. In fact, we observe large power gains even when there is very limited and not very strong dependence. Hence our recommendation is to use locFDRNlocFDR_{N} statistics based rules, with N>0N>0, whenever there is known dependence, or the dependence within neighborhoods can be well approximated.


SUPPLEMENTARY MATERIAL

Github Link of all relevant codes:
Height data set:

We use Genotype (Xn×pX^{n\times p}) and Height (𝒚n×1\boldsymbol{y}^{n\times 1}) data of around 500K individuals provided by uk biobank. Unfortunately we can not make the data available publicly. For reproducibility, readers can use their own data in the notebook available in the above github link.


Acknowledgments

This study was supported by Israeli Science Foundation grant 2180/20. UK Biobank research has been conducted using the UK Biobank Resource under Application Number 56885.

References

  • Benjamini and Hochberg, (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300.
  • Benjamini et al., (2006) Benjamini, Y., Krieger, A. M., and Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93(3):491–507.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
  • Efron, (2007) Efron, B. (2007). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association, 102(477):93–103.
  • Efron, (2010) Efron, B. (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association, 105(491):1042–1055.
  • Efron et al., (2001) Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical bayes analysis of a microarray experiment. Journal of the American statistical association, 96(456):1151–1160.
  • Fan and Han, (2017) Fan, J. and Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1143–1164.
  • Genovese and Wasserman, (2002) Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):499–517.
  • Heller and Rosset, (2021) Heller, R. and Rosset, S. (2021). Optimal control of false discovery criteria in the two-group model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(1):133–155.
  • Jin and Cai, (2007) Jin, J. and Cai, T. T. (2007). Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons. Journal of the American Statistical Association, 102(478):495–506.
  • Peel and MacLahlan, (2000) Peel, D. and MacLahlan, G. (2000). Finite mixture models. John & Sons.
  • Sudlow et al., (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779.
  • Sun and Cai, (2007) Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479):901–912.
  • Sun and Cai, (2009) Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):393–424.
  • Xie et al., (2011) Xie, J., Cai, T. T., Maris, J., and Li, H. (2011). Optimal false discovery rate control for dependent data. Statistics and its interface, 4(4):417.
  • Zhu and Stephens, (2017) Zhu, X. and Stephens, M. (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics, 11(3):1561.

Appendix A Proof of Theorem 3.1

Note that constraint mFDR(𝑫)αmFDR(\boldsymbol{D})\leq\alpha is equivalent to 𝔼(V(𝑫)αR(𝑫))0\mathbb{E}\left(V(\boldsymbol{D})-\alpha R(\boldsymbol{D})\right)\leq 0. The main essence of the proof relies on the following two decompositions of objective TP(𝑫)TP(\boldsymbol{D}) and the constraint 𝔼(V(𝑫)αR(𝑫))\mathbb{E}\left(V(\boldsymbol{D})-\alpha R(\boldsymbol{D})\right) for 𝑫𝒞N\boldsymbol{D}\in\mathcal{C}_{N}:

𝔼𝒉,𝒁(V(𝑫)αR(𝑫))\displaystyle\mathbb{E}_{\boldsymbol{h},\boldsymbol{Z}}\left(V(\boldsymbol{D})-\alpha R(\boldsymbol{D})\right) =i=1K𝔼𝒉,𝒁[(1hi)Di(𝒁i,N)αDi(𝒁i,N)]\displaystyle=\sum_{i=1}^{K}{\mathbb{E}_{\boldsymbol{h},\boldsymbol{Z}}\left[(1-h_{i})D_{i}(\boldsymbol{Z}_{i,N})-\alpha D_{i}(\boldsymbol{Z}_{i,N})\right]}
=i=1K𝔼hi,𝒁i,N[(1hi)Di(𝒁i,N)αDi(𝒁i,N)]\displaystyle=\sum_{i=1}^{K}{\mathbb{E}_{h_{i},\boldsymbol{Z}_{i,N}}\left[(1-h_{i})D_{i}(\boldsymbol{Z}_{i,N})-\alpha D_{i}(\boldsymbol{Z}_{i,N})\right]}
=i=1K𝔼𝒁i,NDi(𝒁i,N)𝔼hi|𝒁i,N[(1hi)α]\displaystyle=\sum_{i=1}^{K}{\mathbb{E}_{\boldsymbol{Z}_{i,N}}D_{i}(\boldsymbol{Z}_{i,N})\mathbb{E}_{h_{i}|\boldsymbol{Z}_{i,N}}\left[(1-h_{i})-\alpha\right]} (17)
=𝔼𝒁[i=1KDi(𝒁i,N)(Ti,Nα)]\displaystyle=\mathbb{E}_{\boldsymbol{Z}}\left[\sum_{i=1}^{K}{D_{i}(\boldsymbol{Z}_{i,N})(T_{i,N}-\alpha)}\right]

The following expression follows exactly the same way as above

TP(𝑫)\displaystyle TP(\boldsymbol{D}) =𝔼𝒉,𝒁(i=1KhiDi)=𝔼𝒁(i=1KDi(𝒁i,N)(1Ti,N))\displaystyle=\mathbb{E}_{\boldsymbol{h},\boldsymbol{Z}}\left(\sum_{i=1}^{K}{h_{i}D_{i}}\right)=\mathbb{E}_{\boldsymbol{Z}}\left(\sum_{i=1}^{K}{D_{i}(\boldsymbol{Z}_{i,N})(1-T_{i,N})}\right) (18)

The lagrangian of the problem is

L(𝑫,μ)\displaystyle L(\boldsymbol{D},\mu) =TP(𝑫)+μ𝔼(αR(𝑫)V(𝑫))\displaystyle=TP(\boldsymbol{D})+\mu\mathbb{E}(\alpha R(\boldsymbol{D})-V(\boldsymbol{D})) (19)
=𝔼𝒁(i=1KDi(𝒁i,N)(1Ti,N+αμμTi,N))\displaystyle=\mathbb{E}_{\boldsymbol{Z}}\left(\sum_{i=1}^{K}{D_{i}(\boldsymbol{Z}_{i,N})(1-T_{i,N}+\alpha\mu-\mu T_{i,N})}\right) (20)

Now the problem in (5) is equivalent to maximizing L(𝑫,μ)L(\boldsymbol{D},\mu) over 𝒞N\mathcal{C}_{N}. Clearly the lagrangian is maximized for the rule of the form 𝑫Nt=(D1,Nt,,DK,Nt)\boldsymbol{D}_{N}^{t}=(D_{1,N}^{t},\dots,D_{K,N}^{t}) where

Di,Nt(𝒁i,N)\displaystyle D_{i,N}^{t}(\boldsymbol{Z}_{i,N}) =𝕀(1Ti,N+αμμTi,N>0)\displaystyle=\mathbb{I}\left(1-T_{i,N}+\alpha\mu-\mu T_{i,N}>0\right) (21)
=𝕀(Ti,N<t)\displaystyle=\mathbb{I}\left(T_{i,N}<t\right) (22)

with t=1+αμ1+μt=\dfrac{1+\alpha\mu}{1+\mu}. Now the constraint

𝔼(V(𝑫Nt)αR(𝑫Nt))=𝔼𝒁[i=1K𝕀(Ti,N<1+αμ1+μ)(Ti,Nα)]0\mathbb{E}\left(V(\boldsymbol{D}_{N}^{t})-\alpha R(\boldsymbol{D}_{N}^{t})\right)=\mathbb{E}_{\boldsymbol{Z}}\left[\sum_{i=1}^{K}{\mathbb{I}\left(T_{i,N}<\dfrac{1+\alpha\mu}{1+\mu}\right)(T_{i,N}-\alpha)}\right]\leq 0 (23)

is satisfied if 1+αμ1+μα\dfrac{1+\alpha\mu}{1+\mu}\leq\alpha. Note that TP(𝑫Nt)TP(\boldsymbol{D}_{N}^{t}) is increasing in t=1+αμ1+μt=\dfrac{1+\alpha\mu}{1+\mu} and hence the optimal solution is the decision 𝑫Ntα,N=𝑫N\boldsymbol{D}_{N}^{t_{\alpha,N}}=\boldsymbol{D}_{N}^{*} where

tα,N\displaystyle t_{\alpha,N} =sup{t:𝔼(V(𝑫Nt)αR(𝑫Nt))0}\displaystyle=\sup\left\{t:\mathbb{E}\left(V(\boldsymbol{D}_{N}^{t})-\alpha R(\boldsymbol{D}_{N}^{t})\right)\leq 0\right\}
=sup{t:𝔼i=1K𝕀(Ti,Nt)Ti,N𝔼i=1K𝕀(Ti,Nt)α}\displaystyle=\sup\left\{t:\dfrac{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t)T_{i,N}}}{\mathbb{E}\sum_{i=1}^{K}{\mathbb{I}(T_{i,N}\leq t)}}\leq\alpha\right\}

Appendix B Derivation of the equations in (9)

Let us assume that we have the model (8). Our task is to estimate the unknown parameters of the model using EM algorithm. We consider 𝒛=(z1,,zK~)\boldsymbol{z}=(z_{1},\dots,z_{\tilde{K}}) as our observed data and 𝒉=(h1,,hK~)\boldsymbol{h}=(h_{1},\dots,h_{\tilde{K}}) as unobserved data. Log likelihood of the complete data (𝒛,𝒉)(\boldsymbol{z},\boldsymbol{h}) is

logL(𝒛,𝒉,𝜽)\displaystyle\log L(\boldsymbol{z},\boldsymbol{h},\boldsymbol{\theta}) =log[p(𝒛|𝒉,𝜽)p(𝒉|𝜽)]\displaystyle=\log\left[p(\boldsymbol{z}|\boldsymbol{h},\boldsymbol{\theta})p(\boldsymbol{h}|\boldsymbol{\theta})\right] (24)
=log[i12π(1+τ2hi)exp[12(1+τ2hi)(zibhi)2]πhi(1π)(1hi)]\displaystyle=\log\left[\prod_{i}{\dfrac{1}{\sqrt{2\pi(1+\tau^{2}h_{i})}}\exp{-\left[\dfrac{1}{2(1+\tau^{2}h_{i})}(z_{i}-bh_{i})^{2}\right]}\pi^{h_{i}}(1-\pi)^{(1-h_{i})}}\right] (25)
=c12ilog(1+τ2hi)12i1(1+τ2hi)(zibhi)2\displaystyle=c-\dfrac{1}{2}\sum_{i}{\log(1+\tau^{2}h_{i})}-\dfrac{1}{2}\sum_{i}{\dfrac{1}{(1+\tau^{2}h_{i})}(z_{i}-bh_{i})^{2}} (26)
+logπihi+log(1π)i(1hi)\displaystyle+\log\pi\sum_{i}{h_{i}}+\log(1-\pi)\sum_{i}{(1-h_{i})} (27)

E Step: Let us assume that the estimate of the parameters in the tt-th step is 𝜽(t)=(π(t),b(t),τ2(t))\boldsymbol{\theta}^{(t)}=(\pi^{(t)},b^{(t)},\tau^{2(t)}). Then we need to find the estimate at the t+1t+1-th step by maximizing the conditional expectation 𝔼[logL(𝒛,𝒉,𝜽)|𝒛,𝜽(t)]\mathbb{E}\left[\log L(\boldsymbol{z},\boldsymbol{h},\boldsymbol{\theta})\right|\boldsymbol{z},\boldsymbol{\theta}^{(t)}].

M-Step: Clearly from the above expression of logL(𝒛,𝒉,𝜽)\log L(\boldsymbol{z},\boldsymbol{h},\boldsymbol{\theta})

δδπ𝔼[logL(𝒛,𝒉,𝜽)|𝒛,𝜽(t)]=1πi𝔼(hi|zi,𝜽(t))11πi(1𝔼(hi|zi,𝜽(t)))=0\displaystyle\dfrac{\delta}{\delta\pi}\mathbb{E}\left[\log L(\boldsymbol{z},\boldsymbol{h},\boldsymbol{\theta})\right|\boldsymbol{z},\boldsymbol{\theta}^{(t)}]=\dfrac{1}{\pi}\sum_{i}{\mathbb{E}(h_{i}|z_{i},\boldsymbol{\theta}^{(t)})}-\dfrac{1}{1-\pi}\sum_{i}{(1-\mathbb{E}(h_{i}|z_{i},\boldsymbol{\theta}^{(t)}))}=0
π(t+1)=1K~i𝔼(hi|zi,𝜽(t))=1K~i(hi=1|zi,𝜽(t))\displaystyle\implies\pi^{(t+1)}=\dfrac{1}{\tilde{K}}\sum_{i}{\mathbb{E}(h_{i}|z_{i},\boldsymbol{\theta}^{(t)})}=\dfrac{1}{\tilde{K}}\sum_{i}{\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}

Similarly setting δδb𝔼[logL(𝒛,𝒉,𝜽)|𝒛,𝜽(t)]=0\dfrac{\delta}{\delta b}\mathbb{E}\left[\log L(\boldsymbol{z},\boldsymbol{h},\boldsymbol{\theta})\right|\boldsymbol{z},\boldsymbol{\theta}^{(t)}]=0, we get

δδb[i(zib)2(hi=1|zi,𝜽(t))]=0\displaystyle\dfrac{\delta}{\delta b}\left[\sum_{i}{(z_{i}-b)^{2}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}\right]=0
b(t+1)=i(hi=1|zi,𝜽(t))zii(hi=1|zi,𝜽(t))\displaystyle\implies b^{(t+1)}=\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})z_{i}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}

Now for determining τ2(t+1)\tau^{2(t+1)} we set δδτ2𝔼[logL(𝒛,𝒉,π,b(t+1),τ)|𝒛,𝜽(t)]=0\dfrac{\delta}{\delta\tau^{2}}\mathbb{E}\left[\log L(\boldsymbol{z},\boldsymbol{h},\pi,b^{(t+1)},\tau)\right|\boldsymbol{z},\boldsymbol{\theta}^{(t)}]=0 to get

i1(1+τ2)2(zib(t+1))2(hi=1|zi,𝜽(t))=i1(1+τ2)(hi=1|zi,𝜽(t))\displaystyle\sum_{i}{\dfrac{1}{(1+\tau^{2})^{2}}(z_{i}-b^{(t+1)})^{2}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}=\sum_{i}{\dfrac{1}{(1+\tau^{2})}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}
τ2(t+1)=max{0,i(hi=1|zi,𝜽(t))(zib(t+1))2i(hi=1|zi,𝜽(t))1}\displaystyle\implies{\tau^{2}}^{(t+1)}=\max\left\{0,\dfrac{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})(z_{i}-b^{(t+1)})^{2}}{\sum_{i}\mathbb{P}(h_{i}=1|z_{i},\boldsymbol{\theta}^{(t)})}-1\right\}

Hence the updates of equations (9) follows. Updates of the equations (10) follows similarly.