This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computationally Efficient Whole-Genome Signal Region Detection for Quantitative and Binary Traits

Fan Wang
Department of Biostatistics, Mailman School of Public Health,
Columbia University, New York, U.S.A
Wei Zhang
School of Mathematical Sciences, Center for Statistical Science,
Peking University, Beijing, China
Fang Yao
School of Mathematical Sciences, Center for Statistical Science,
Peking University, Beijing, China
Fan Wang and Wei Zhang are co-first authors, and Fang Yao is the corresponding author. E-mail: [email protected]. This research is supported in part by the National Key R&D Program of China (No. 2022YFA1003801), the National Natural Science Foundation of China (No. 12292981, 11931001), the LMAM and the Fundamental Research Funds for the Central Universities, Peking University. This research has been conducted using the UK Biobank Resource under project 79237.  
Abstract

The identification of genetic signal regions in the human genome is critical for understanding the genetic architecture of complex traits and diseases. Numerous methods based on scan algorithms (i.e. QSCAN, SCANG, SCANG-STARR) have been developed to allow dynamic window sizes in whole-genome association studies. Beyond scan algorithms, we have recently developed the binary and re-search (BiRS) algorithm, which is more computationally efficient than scan-based methods and exhibits superior statistical power. However, the BiRS algorithm is based on two-sample mean test for binary traits, not accounting for multidimensional covariates or handling test statistics for non-binary outcomes. In this work, we present a distributed version of the BiRS algorithm (dBiRS) that incorporate a new infinity-norm test statistic based on summary statistics computed from a generalized linear model. The dBiRS algorithm accommodates regression-based statistics, allowing for the adjustment of covariates and the testing of both continuous and binary outcomes. This new framework enables parallel computing of block-wise results by aggregation through a central machine to ensure both detection accuracy and computational efficiency, and has theoretical guarantees for controlling family-wise error rates and false discovery rates while maintaining the power advantages of the original algorithm. Applying dBiRS to detect genetic regions associated with fluid intelligence and prospective memory using whole-exome sequencing data from the UK Biobank, we validate previous findings and identify numerous novel rare variants near newly implicated genes. These discoveries offer valuable insights into the genetic basis of cognitive performance and neurodegenerative disorders, highlighting the potential of dBiRS as a scalable and powerful tool for whole-genome signal region detection.

Keywords: family-wise error rate, signal detection, distributed learning, whole genome association studies

1 Introduction

Understanding the genetic underpinnings of human diseases and traits remains a central focus of genetic research. Genome-Wide Association Studies (GWAS) have been instrumental in exploring the genetic architecture of complex diseases and traits over the past decade (Visscher et al., 2017). By using array-based technologies, GWAS analyzes millions of single nucleotide polymorphisms (SNPs) across the genome to identify those associated with specific traits or disease outcomes. While GWAS has successfully identified thousands of common genetic variants linked to disease susceptibility, these common variants explain only a small fraction of heritability, which is often referred to as the “missing heritability problem”(Manolio et al., 2009). The majority of genetic variants in the human genome are rare, and these rare variants are believed to contribute significantly to the unexplained heritability. However, the classical GWAS approach which focuses on single-SNP-based analysis has very limited power for analyzing rare variants, as the effect of a single SNP can be too small to be detected. (Liu et al., 2010; Bakshi et al., 2016).

To address this limitation, whole genome sequencing (WGS) studies are being conducted to identify rare variants associated with disease susceptibility. Progress has been made in developing set-based association methods, which test multiple variants jointly by aggregating their effects within defined genomic regions. These methods include burden tests (Morgenthaler & Thilly, 2007; Madsen & Browning, 2009), the Sequence Kernel Association Test (SKAT) (Wu et al., 2011), and STAAR which incorporates variant functional annotations to enhance detection power (Li, Li, Zhou et al., 2022). The STAAR-O test extends the STAAR framework as an omnibus test by combining multiple annotation-weighted methods into a single unified test. A common challenge in these approaches is defining regions for variant sets, especially in non-coding or intergenic regions without clear functional boundaries. STAAR addresses this issue by applying gene-centric analysis to well-defined gene-associated regions and fixed sliding window-based analysis to regions without clear boundaries. Alternative approaches include the scan statistic (Naus, 1982), which systematically searches the human genome using a fixed window size. Mean-based scan statistic methods are later proposed for DNA copy number analysis, allowing the use of multiple window sizes in settings closely related to change-point detection problems (Jeng et al., 2010; Olshen et al., 2004; Zhang et al., 2010). Other notable frameworks, such as those based on the knockoff (i.e.KnockoffScreen) (He et al., 2021a, b), conducts genome-wide set-based analyses to identify signal regions while mitigating the effects of correlation confounding. However, fixed-window approaches can result in a loss of power because the sizes of signal regions can vary across the genome.

Further advancements have been made to enable signal region detection with dynamic window sizes. Li, Li, Zhou et al. (2020) introduced SCANG which combines the scan algorithm proposed by Jeng et al. (2010) with burden tests, SKAT, and the omnibus test to continuously scan the genome. However, this approach lacks comprehensive theoretical and empirical analysis of false discoveries and shows limited power when functional annotations are not incorporated (Li, Li, Zhou et al., 2022). Building on the scan algorithm, Li, Liu & Lin (2020) developed the quadratic scan statistic (QSCAN), which aggregates information across intervals of varying sizes. QSCAN provides theoretical guarantees for controlling false discoveries and has demonstrated strong detection performance, particularly when signal regions contain both causal and neutral variants. However, scan-based methods require calculating test statistics for many candidate intervals and applying a fixed threshold for selection, which is computationally intensive and tends to be conservative. SCANG-STAAR (Li, Li, Zhou et al., 2022) extends SCANG’s dynamic window scanning by incorporating multiple functional annotations via STAAR to boost the power of detecting rare variant associations. While functional annotations can enhance power and interpretability, their effectiveness relies heavily on the quality and relevance of the annotations, which may introduce biases, computational challenges, and increased resource demand.

Beyond the scan algorithm, Zhang, Wang & Yao (2023) proposed the binary and re-search (BiRS) algorithm for detecting signal regions with dynamic window sizes in whole genome sequencing (WGS) studies. The BiRS algorithm iteratively splits identified regions until a minimum size is reached, providing theoretical guarantees for FWER and FDR, which is shown more computationally efficient than scan-based algorithms and demonstrates superior power compared to QSCAN and KnockoffScreen. Impressively, BiRS surpasses SCANG-STAAR in detecting moderate or weak signals without requiring functional annotations. The combination of computational speed, increased detection power and theoretical rigor make BiRS as a significant advancement in signal region detection for WGS studies. The BiRS algorithm was combined with the DCF two-sample mean test (Xue & Yao, 2020) for direct application to binary traits. However, it has not yet been extended to correct for multi-dimensional covariates or to test for non-binary outcomes.

In this paper, we generalize the BiRS algorithm to a distributed version (dBiRS) to combine with a new infinity-norm test statistic based on the summary statistics obtained from a generalized linear model. As the proposed test statistic fundamentally differs from the DCF two-sample mean test, this extension is non-trivial. We develope new theoretical guarantees to ensure the size control of dBiRS while preserving the detection accuracy of the original BiRS. The dBiRS algorithm operates in two main stages, combining results from local and central machines to support parallel computing and maintain computational efficiency, similar to distributed learning frameworks (Cai & Wei, 2022). In the first stage, BiRS is applied within genomic blocks, where local machines detect signal regions by calculating test statistics and thresholds within each block. Instead of simply aggregating block-wise results using a global threshold, the detected signal regions, along with their corresponding test statistics and thresholds, are transferred to a central machine. In the second stage, the central machine evaluates the significance of each block by performing a second layer of BiRS based on block-wise infinity-norm test statistics. Finally, a new threshold is constructed by multiplier bootstrap to reassess the signal regions within the significant blocks. We have shown theoretically that the dBiRS algorithm is able to consistently identifies true signal regions under more general alternative structures and in the presence of model misspecification under the alternative hypothesis. Simulated studies also demonstrate that dBiRS is more accurate and robust than the state-of-the-art KnockoffScreen and scan procedures.

Finally, we apply the BiRS algorithm to analyze whole exome sequence (WES) data from the UK Biobank, aiming to identify signal regions for intelligence and prospective memory. The dBiRS algorthm identified signal regions involving 327 genes, including 84 of which were previously reported to be assocaited with intelligence. Notable discoveries include both common and rare variants linked to cognitive functions, such as those in COL16A1, CRTC2, and PTPRF. Variants in genes like CRTC2, BRWD1, and TOP2B were also identified, highlighting their roles in endothelial function, immune system regulation, and neuronal development, respectively. In addition, the analysis revealed rare variants in 22 genes not previously associated with intelligence. Some of these genes are linked to Alzheimer’s disease, brain connectivity, and neuronal development, providing novel insights into the genetic basis of cognitive traits. For prospective memory, the study identified eight novel genes, including OMA1 and CNGB3, which are associated with neuroimaging measurements and cognitive functions.

The rest of the article is organized as follows. In Section 2, we introduce the testing procedure under GLM and describe the proposed dBiRS algorithm. We conduct comprehensive simulations in Section 3 to demonstrate that the proposed method enjoys preferably numerical performance compared with existing approaches. In Section 4, we apply the dBiRS algorithm to conduct Whole Exome Sequencing (WES) analyses on intelligence and prospective memory, aiming to deepen our understanding of cognitive aging and uncover genetic factors contributing to the risk of neurodegenerative disorders. We conclude the article with a discussion in Section 5, while the theoretical properties of the proposed algorithm, including size control and detection consistency, are deferred to the Appendix. Technical assumptions, lemmas, and proofs of the theoretical results, along with additional simulation and application results, are provided in the Supplementary Material.

2 Distributed detection algorithm

2.1 Global test

Suppose there are nn observations in the study. For the ii-th observation, YiY_{i} represents the outcome, Xi=(Xi1,,Xiq)X_{i}=(X_{i1},\dots,X_{iq})^{\top} is a vector containing qq covariates, and Gi=(Gi1,,Gip)G_{i}=(G_{i1},\dots,G_{ip})^{\top} is the genotype vector with pp variants. Let Y=(Y1,,Yn)Y=(Y_{1},\dots,Y_{n})^{\top}, 𝑿=(X1,,Xn)\boldsymbol{X}=(X_{1},\dots,X_{n})^{\top}, and 𝑮=(G1,,Gn)\boldsymbol{G}=(G_{1},\dots,G_{n})^{\top}. Conditional on XiX_{i} and GiG_{i}, we assume that YiY_{i} belongs to an exponential family with the density f(Yi)=exp{Yiθib(θi)ai(ϕ)+c(Yi,ϕ)},f(Y_{i})=\exp\left\{\frac{Y_{i}\theta_{i}-b(\theta_{i})}{a_{i}(\phi)}+c(Y_{i},\phi)\right\}, where ai()a_{i}(\cdot), b()b(\cdot), and c()c(\cdot) are known functions, and θi\theta_{i} and ϕ\phi are the canonical parameter and dispersion parameter, respectively, which indicates that YY following a generalized linear model (GLM):

g(η)=𝑿γ+𝑮β,g(\eta)=\boldsymbol{X}\gamma+\boldsymbol{G}\beta, (1)

where η=𝔼(Y𝑿,𝑮)\eta=\mathbb{E}(Y\mid\boldsymbol{X},\boldsymbol{G}) and g()g(\cdot) is a monotone link function.

Under the global null model where no genetic effect is present across the genome (i.e. β=0\beta=0), the GLM in (1) simplifies to g(η)=𝑿γg(\eta)=\boldsymbol{X}\gamma. Let η^0=g1(𝑿γ^)\hat{\eta}_{0}=g^{-1}(\boldsymbol{X}\hat{\gamma}), where γ^\hat{\gamma} is the maximum likelihood estimator (MLE) of γ\gamma under the global null model. The variance of YiY_{i} is var(Yi)=ai(ϕ)v(ηi)\operatorname{var}\left(Y_{i}\right)=a_{i}(\phi)v\left(\eta_{i}\right), where v(ηi)=b′′(θi)v\left(\eta_{i}\right)=b^{\prime\prime}\left(\theta_{i}\right) is a variance function. We define 𝚲=diag{a1(ϕ)v(η01),,an(ϕ)v(η0n)}\boldsymbol{\Lambda}=\operatorname{diag}\left\{a_{1}(\phi)v\left(\eta_{01}\right),\ldots,a_{n}(\phi)v\left(\eta_{0n}\right)\right\} and let 𝑷=𝚲1𝚲1𝑿(𝑿T𝚲1𝑿)1𝑿T𝚲1\boldsymbol{P}=\boldsymbol{\Lambda}^{-1}-\boldsymbol{\Lambda}^{-1}\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{\Lambda}^{-1}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{\Lambda}^{-1}.

In genome-wide association studies (GWAS) and whole-genome sequencing (WGS) studies, the test statistic for the jj-th variant is constructed using the working marginal model g(ηi)=XiTα+Gijβj,g\left(\eta_{i}\right)=X_{i}^{T}\alpha+G_{ij}\beta_{j}, where we regress YiY_{i} on each variant GijG_{ij}, adjusting for the covariates XiX_{i}. The marginal score test statistic for βj\beta_{j} of the jj-th variant is given by

Uj=G.jT(Yη0^)/n.U_{j}=G_{.j}^{T}\left(Y-\hat{\eta_{0}}\right)/\sqrt{n}. (2)

Marginal score statistics UjU_{j}’s are often made available in public databases or provided by investigators to facilitate meta-analysis across multiple cohorts. Let U=(U1,,Up)U=(U_{1},\dots,U_{p})^{\top}, under the global null model (i.e. β=0\beta=0), U𝒩(0,𝚺)U\sim\mathcal{N}(0,\boldsymbol{\Sigma}), where 𝚺=𝑮P𝑮/n\boldsymbol{\Sigma}=\boldsymbol{G}^{\top}P\boldsymbol{G}/n. Let 𝚲^=diag{a1(ϕ^)v(η^01),,an(ϕ^)v(η^0n)}\boldsymbol{\hat{\Lambda}}=\operatorname{diag}\left\{a_{1}(\hat{\phi})v\left(\hat{\eta}_{01}\right),\ldots,a_{n}(\hat{\phi})v\left(\hat{\eta}_{0n}\right)\right\}, where ϕ^\hat{\phi} is the MLE of ϕ\phi under the global null hypothesis and η^0i\hat{\eta}_{0i} is the ii-th coordinate of η^0\hat{\eta}_{0}, i=1,,ni=1,\dots,n. Let 𝑷^=𝚲^1𝚲^1𝑿(𝑿T𝚲^1𝑿)1𝑿T𝚲^1\boldsymbol{\hat{P}}=\boldsymbol{\hat{\Lambda}}^{-1}-\boldsymbol{\hat{\Lambda}}^{-1}\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{\hat{\Lambda}}^{-1}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{\hat{\Lambda}}^{-1}. In practice, Σ\Sigma can be well approximated by 𝑮P^𝑮/n\boldsymbol{G}^{\top}\hat{P}\boldsymbol{G}/n.

To test for the global null hypothesis, our proposed test statistic is defined as the maxima of marginal scores:

T=U=𝑮(Yη^0)/n,T=\left\lVert U\right\rVert_{\infty}=\left\lVert\boldsymbol{G}^{\top}(Y-\hat{\eta}_{0})/\sqrt{n}\right\rVert_{\infty},

where x=max{|x1|,,|xp|}\left\lVert x\right\rVert_{\infty}=\max\left\{\left\lvert x_{1}\right\rvert,\dots,\left\lvert x_{p}\right\rvert\right\} for xpx\in\mathbb{R}^{p}. The null hypothesis is rejected at a specified significance level α\alpha if T>c(α)T>c(\alpha), where c(α)c(\alpha) is a predefined threshold, specifically,

c(α)=inf{t:(Tt)1α}.c(\alpha)=\inf\left\{t\in\mathbb{R}:\mathbb{P}\left(T\leq t\right)\geq 1-\alpha\right\}.

Computing the threshold c(α)c(\alpha) requires the eigenvalues of Σ^\hat{\Sigma}, which are computationally intensive to calculate in practice when pp is large. To address this challenge, we propose an efficient Gaussian multiplier bootstrap procedure to approximate c(α)c(\alpha). We first generate NN normal random vectors of dimension n×1n\times 1, denoted as e1,e2,,eNe_{1},e_{2},\dots,e_{N}. We then calculate pseudo scores Ueb=𝑮P^1/2eb/nU^{e_{b}}=\boldsymbol{G}^{\top}\hat{P}^{1/2}e_{b}/\sqrt{n} for b=1,,Nb=1,\dots,N. The threshold c(α)c(\alpha) is approximated by the 100(1α)100(1-\alpha)-th percentile of pseudo scores:

c^(α)=fα(Ue1,,UeN),\hat{c}(\alpha)=f_{\alpha}\left(\left\lVert U^{e_{1}}\right\rVert_{\infty},\dots,\left\lVert U^{e_{N}}\right\rVert_{\infty}\right),

where fα(x1,x2,,xN)f_{\alpha}(x_{1},x_{2},\dots,x_{N}) represents the 100(1α)100(1-\alpha)-th percentile of the set {x1,x2,,xN}\{x_{1},x_{2},\dots,x_{N}\}.

2.2 Detection with marginal scores

In this subsection, we introduce our signal region detection algorithm combined with the above test statistic TT. We first introduce some key concepts for signal regions used throughout the paper. Under the alternative hypothesis that there exists signal regions, we define a signal point with index jj given that βj0\beta_{j}\neq 0, 1jp1\leq j\leq p. Given the polygenic nature of the genome, a genetic region II may contain consecutive signal points. We assume that a signal region satisfies the continuity property if βI0\beta_{I}\neq 0, where the “\neq” means that no pair of elements is equal. Furthermore, we assume that the signal region II satisfies the separability assumption: for any area that contains II, the edges of II are signal points and there are no signal points next to II. Lastly, we denote the true signal regions by I1,,II_{1}^{*},\dots,I_{\ell}^{*} (if exist) and let ={I1,,I}\mathcal{I}=\left\{I_{1}^{*},\dots,I_{\ell}^{*}\right\}.

Our goal is to determine whether signal regions exist and, if so, to identify their locations. Specifically, we aim first to test:

H0:=v.s.H1:,H_{0}:\mathcal{I}=\emptyset\quad\mathrm{v.s.}\quad H_{1}:\mathcal{I}\neq\emptyset, (3)

if H0H_{0} is rejected, we proceed to detect each signal region in \mathcal{I}. Following Zhang, Wang & Yao (2023), the algorithm first compares TT with c^(α)\hat{c}(\alpha) to determine whether signal regions exit. If T>c^(α)T>\hat{c}(\alpha), we conduct a binary search procedure that utilizes a sequence of dynamic thresholds generated by the bootstrap vectors to find the specific locations of the signal regions. For an index set I={i1,,ir}I=\{i_{1},\dots,i_{r}\}, we define U(I)=(Ui1,Ui2,,Uir)U(I)=(U_{i_{1}},U_{i_{2}},\dots,U_{i_{r}}) and similar for Ueb(I)U^{e_{b}}(I), b=1,,Nb=1,\dots,N. The detailed binary search procedure is as follows.

At the first step, we split the genome into two segments, denoted as I11={1,2,,p/2}I_{11}=\left\{1,2,\dots,\lfloor p/2\rfloor\right\} and I12={p/2+1,,p}I_{12}=\left\{\lfloor p/2\rfloor+1,\dots,p\right\} with \lfloor\cdot\rfloor taking the value of the largest integer below. Let I1=I11I12I_{1}=I_{11}\cup I_{12}. The test statistic for the kkth segment is T1k=U(I1k)T_{1k}=\left\lVert U(I_{1k})\right\rVert_{\infty}, for k=1,2k=1,2, and the threshold for the two tests is calculated by c^1(α)=fα(Ue1(I1),,UeN(I1))\hat{c}_{1}(\alpha)=f_{\alpha}(\left\lVert U^{e_{1}}(I_{1})\right\rVert_{\infty},\dots,\left\lVert U^{e_{N}}(I_{1})\right\rVert_{\infty}). If T1k>c^1(α)T_{1k}>\hat{c}_{1}(\alpha), then I1kI_{1k} is considered to possibly contain signal regions and is subjected to further binary search procedures. Otherwise, it is concluded that there are no signals within I1kI_{1k}.

Suppose we further conduct the binary search procedure for regions that may contain signals. During the jj-th binary search, there are njn_{j} segments to be tested (1nj2j1\leq n_{j}\leq 2^{j}), denoted as Ij1,,IjnjI_{j_{1}},\dots,I_{jn_{j}}, with Ij=k=1njIjkI_{j}=\cup_{k=1}^{n_{j}}I_{j_{k}}. The critical value for all njn_{j} tests is calculated as

c^j(α)=fα(Ue1(Ij),,UeN(Ij)).\hat{c}_{j}(\alpha)=f_{\alpha}(\left\lVert U^{e_{1}}(I_{j})\right\rVert_{\infty},\dots,\left\lVert U^{e_{N}}(I_{j})\right\rVert_{\infty}).

If Tjk>c^j(α)T_{jk}>\hat{c}_{j}(\alpha) for k=1,,njk=1,\dots,n_{j}, we perform binary segmentation on IjkI_{jk} to conduct tests in the next iteration of binary search procedure. The segmentation stops if the length of IjkI_{jk} is sufficiently small such that |Ijk|2s\left\lvert I_{jk}\right\rvert\leq 2^{s}, where ss is a truncation parameter. This binary search procedure is repeated iteratively until no significant segments remain or all segments become sufficiently small. The detected signal regions are denoted as I^11,,I^1l1\hat{I}_{11},\dots,\hat{I}_{1l_{1}}. We want to emphasis that this binary search step utilizes a sequence of dynamic critical values, which enables the detection of more signals compared to procedures with fixed thresholds. Specifically, since Ueb(Ij)U^{e_{b}}(I_{j}) is a sub-vector of UebU^{e_{b}}, it follows that Ueb(Ij)Ueb\left\lVert U^{e_{b}}(I_{j})\right\rVert_{\infty}\leq\left\lVert U^{e_{b}}\right\rVert_{\infty}, for b=1,,Nb=1,\dots,N. This implies c^(α)c^1(α)c^j(α)\hat{c}(\alpha)\geq\hat{c}_{1}(\alpha)\geq\cdots\geq\hat{c}_{j}(\alpha)\geq\cdots, which increase the power to detect weaker signals as the binary search progresses.

We next implement a re-search step to detect signals that may have been missed. We substitute the variables within I^11,,I^1l1\hat{I}_{11},\dots,\hat{I}_{1l_{1}} with zeros, and repeat the global test and binary search. This process is iterated until the global test confirms that no signals remain or all segments are sufficiently small. We finally rearrange the detected signal regions based on their locations in the genome, denoted as I^1,,I^\hat{I}_{1},\dots,\hat{I}_{\ell}. The complete detection procedure is referred to as sBiRS and is summarized in Algorithm 2.2.

  Algorithm 1 BiRS with summary statistics (sBiRS)

 

1:Input: the vector of marginal scores UU; the bootstrap vectors Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}}; truncation parameter ss; significant level α\alpha;  
2:Calculate T=UT=\left\lVert U\right\rVert_{\infty} and c^(α)=fα(Ue1,,UeN)\hat{c}(\alpha)=f_{\alpha}(\left\lVert U^{e_{1}}\right\rVert_{\infty},\dots,\left\lVert U^{e_{N}}\right\rVert_{\infty});
3:if Tc^(α)T\leq\hat{c}(\alpha) then
4:     Output: There are no signal regions.
5:else
6:     Conduct the binary search procedure based on UU and Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}} with truncation parameter ss;
7:     Replace the elements of UU and Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}} in estimated signal regions from the last step as zeros;
8:     Recalculate TT and c^(α)\hat{c}(\alpha);
9:     while T>c^(α)T>\hat{c}(\alpha) do
10:         Conduct the binary search procedure based on the new UU and Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}};
11:         Replace the elements of UU and Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}} in estimated signal regions from the last step as zeros;
12:         Recalculate TT and c^(α)\hat{c}(\alpha);
13:     end while
14:     Rearrange the estimated signal regions to be continuous and separated;
15:     Output: Estimated signal regions I^1,,I^^\hat{I}_{1},\dots,\hat{I}_{\hat{\ell}}.
16:end if

 

2.3 Detection over multiple blocks: a distributed algorithm

In whole genome association studies, the number of variants in the genome is about 10710^{7}. Implementing this searching algorithm requires the use of bootstrap vectors throughout the procedure, which results in excessive memory usage. Additionally, it is impractical to load the entire genotype matrix at once when working with individual-level data. A straightforward solution is to divide the genome into KK blocks and detect signal regions within each block using a significant level of α/K\alpha/K (i.e., the Bonferroni correction), but this approach is conservative in most cases. Alternatively, detection procedures can be performed with fixed thresholds by applying global thresholds to control the detection results within each block. While this method avoids conservativeness, it reduces the power of sBiRS due to the use of higher critical values compared to dynamic thresholds which are adjusted based on prior testing results. This issue is analogous to the power reduction phenomenon observed in distributed learning (Cai & Wei, 2022), where large-scale learning tasks are performed across multiple nodes. To address this problem, we introduce a distributed version of our detection algorithm in this subsection. This approach preserves the advantages of dynamic thresholds while maintaining control over family-wise error rates (FWERs) and false discovery rates (FDRs).

We divide the entire region into KK blocks, denoted as B1,,BKB_{1},\dots,B_{K}. For block kk (k=1,,Kk=1,\dots,K), the vector of marginal scores is represented by U(Bk)U(B_{k}), and the corresponding bootstrap vectors are Ue1(Bk),,UeN(Bk)U^{e_{1}}(B_{k}),\dots,U^{e_{N}}(B_{k}). Within each block, a local machine applies the BiRS algorithm, resulting in a collection of detected signal regions ^k={I^1(k),,I^k(k)}\hat{\mathcal{I}}_{k}=\left\{\hat{I}_{1}^{(k)},\dots,\hat{I}_{\ell_{k}}^{(k)}\right\} for block kk, k=1,,Kk=1,\dots,K. The union of the detected signal regions in block kk is denoted as I^(k)=i=1kI^i(k)\hat{I}^{(k)}=\cup_{i=1}^{\ell_{k}}\hat{I}_{i}^{(k)}. The corresponding collection of maximum marginal scores is defined as:

𝒯k={T(Bk),T(I^1(k)),,T(I^k(k))}={U(Bk),U(I^1(k)),,U(I^k(k))}.\mathcal{T}_{k}=\left\{T(B_{k}),T(\hat{I}_{1}^{(k)}),\dots,T(\hat{I}_{\ell_{k}}^{(k)})\right\}=\left\{\|U(B_{k})\|_{\infty},\|U(\hat{I}_{1}^{(k)})\|_{\infty},\dots,\|U(\hat{I}_{\ell_{k}}^{(k)})\|_{\infty}\right\}.

For variables related to the bootstrap vectors, we define the following quantities for block kk:

M(Bk)\displaystyle M(B_{k}) =(M1(Bk),,MN(Bk))=(Ue1(Bk),,UeN(Bk)),\displaystyle=\left(M_{1}(B_{k}),\dots,M_{N}(B_{k})\right)^{\top}=\left(\|U^{e_{1}}(B_{k})\|_{\infty},\dots,\|U^{e_{N}}(B_{k})\|_{\infty}\right)^{\top},
L(Bk)\displaystyle L(B_{k}) =(L1(Bk),,LN(Bk))=(Ue1(I^(k)),,UeN(I^(k))).\displaystyle=\left(L_{1}(B_{k}),\dots,L_{N}(B_{k})\right)^{\top}=\left(\|U^{e_{1}}(\hat{I}^{(k)})\|_{\infty},\dots,\|U^{e_{N}}(\hat{I}^{(k)})\|_{\infty}\right)^{\top}.

Finally, we transfer the block results

k={^k,𝒯k,M(Bk),L(Bk)},k=1,,K,\mathcal{R}_{k}=\left\{\hat{\mathcal{I}}_{k},\mathcal{T}_{k},M(B_{k}),L(B_{k})\right\},\quad k=1,\dots,K,

to the central machine for further analysis.

In the central machine, we treat these blocks as potential signal points and apply the sBiRS procedure to identify which blocks are significant. Precisely, we define

U~=(T(B1),,T(BK)),\tilde{U}=(T(B_{1}),\dots,T(B_{K}))^{\top},

where each component represents the maximum marginal score within each block. The corresponding bootstrap vectors for bb-th boostrap are defined as

U~eb=(Mb(B1),,Mb(BK)),b=1,N.\tilde{U}^{e_{b}}=(M_{b}(B_{1}),\dots,M_{b}(B_{K}))^{\top},b=1,\dots N.

We run the sBiRS algorithm with U~,U~e1,,U~eN\tilde{U},\tilde{U}^{e_{1}},\dots,\tilde{U}^{e_{N}}, truncation parameter s=0s=0 and significant level α\alpha. Suppose the significant blocks identified by sBiRS are Bj1,,Bjk^B_{j_{1}},\dots,B_{j_{\hat{k}}}, where j1,,jk^j_{1},\ldots,j_{\hat{k}} are indices of significant blocks. We remove insignificant blocks, even if there are signal regions detected by the local machine, and then perform a control procedure for the detected signal regions within the significant blocks. For the bb-th boostrap vector, we define

L~eb=(Lb(Bj1),,Lb(Bjk^))=(Ueb(I^(j1)),,Ueb(I^(jk^))).\tilde{L}^{e_{b}}=\left(L_{b}(B_{j_{1}}),\dots,L_{b}(B_{j_{\hat{k}}})\right)^{\top}=\left(\|U^{e_{b}}(\hat{I}^{(j_{1})})\|_{\infty},\dots,\|U^{e_{b}}(\hat{I}^{(j_{\hat{k}})})\|_{\infty}\right)^{\top}.

Based on the L~eb\tilde{L}^{e_{b}}’s, we calculate new critical values c^min(α)\hat{c}_{\min}(\alpha) for the test statistics in each signal region within the significant blocks:

c^min(α)=fα(L~e1,,L~eN).\hat{c}_{\min}(\alpha)=f_{\alpha}(\tilde{L}^{e_{1}},\dots,\tilde{L}^{e_{N}}).

For the rr-th block and ii-th signal region within the block (r=j1,,jk^r=j_{1},\dots,j_{\hat{k}}, i=1,,ri=1,\dots,\ell_{r}), the signal region I^i(r)\hat{I}_{i}^{(r)} is significant if Tn(I^i(r))>c^min(α)T_{n}(\hat{I}_{i}^{(r)})>\hat{c}_{\min}(\alpha). This distributed algorithm is summarized in Algorithm 2.3.

Since the significance level for the detection procedure within each block was set to α\alpha, the family-wise error rate (FWER) across multiple tests on the final estimated signal regions may not be controlled. Therefore, it is necessary to construct the threshold c^min(α)\hat{c}_{\min}(\alpha) to ensure FWER control by filtering out less significant signal regions. This control procedure also helps manage the proportion of false discoveries, leading to consistent detection results, as demonstrated in the proof of Theorem 3 in the Supplementary Material.

  Algorithm 2 Distributed BiRS (dBiRS)

 

1:Input: the vector of marginal scores UU; the bootstrap vectors Ue1,,UeNU^{e_{1}},\dots,U^{e_{N}}; truncation parameter ss; significant level α\alpha; number of blocks KK;  
2:Divide the genome into KK blocks B1,,BKB_{1},\dots,B_{K};
3:for k=1,,Kk=1,\dots,K do
4:     Run sBiRS(U(Bk),Ue1(Bk),,UeN(Bk),s)\mathrm{sBiRS}(U(B_{k}),U^{e_{1}}(B_{k}),\dots,U^{e_{N}}(B_{k}),s) to derive the collection of detection results k\mathcal{R}_{k} in block BkB_{k};
5:end for
6:Let U~=(T(B1),,T(BK))\tilde{U}=(T(B_{1}),\dots,T(B_{K}))^{\top} and U~eb=(Mb(B1),,Mb(BK))\tilde{U}^{e_{b}}=(M_{b}(B_{1}),\dots,M_{b}(B_{K}))^{\top}, b=1,,Nb=1,\dots,N;
7:Run sBiRS(U~,U~e1,,U~eN,0)\mathrm{sBiRS}(\tilde{U},\tilde{U}^{e_{1}},\dots,\tilde{U}^{e_{N}},0) to get the significant blocks j1,,jk^j_{1},\dots,j_{\hat{k}}, k^K\hat{k}\leq K;
8:Let c^min(α)=fα(L~e1,,L~eN)\hat{c}_{\min}(\alpha)=f_{\alpha}(\tilde{L}^{e_{1}},\dots,\tilde{L}^{e_{N}}), where L~eb=(Lb(Bj1),,Lb(Bjk^)\tilde{L}^{e_{b}}=(L_{b}(B_{j_{1}}),\dots,L_{b}(B_{j_{\hat{k}}})^{\top};
9:for r=j1,j2,jk^r=j_{1},j_{2},\dots j_{\hat{k}} do
10:     for i=1,,ri=1,\dots,\ell_{r} do
11:         if Tn(I^i(r))>c^min(α)T_{n}(\hat{I}_{i}^{(r)})>\hat{c}_{\min}(\alpha) then
12:              Take I^i(r)\hat{I}_{i}^{(r)} as one of the estimated signal regions;
13:         end if
14:     end for
15:end for
16:Rearrange these estimated signal regions to be continuous and separable;
17:Output: Estimated signal regions I^1,,I^^\hat{I}_{1},\dots,\hat{I}_{\hat{\ell}}.

 

In practice, we typically select blocks with dimensions between 10310^{3} and 10410^{4}. Consequently, the number of blocks KK is approximately 10310^{3} to 10410^{4}, given that the total dimension pp may exceed 10710^{7}. Because signal regions are usually sparsely distributed across the genome, most blocks are unlikely to contain signals. Applying sBiRS to determine the significance of blocks will sequentially remove most of the blocks and result in a reduction of dynamic critical values, which helps find more significant blocks with relatively weaker signals. Moreover, this procedure is equivalent to running sBiRS with truncation log2(maxkpk)slog2(minkpk)+1\log_{2}(\max_{k}p_{k})\leq s\leq\log_{2}(\min_{k}p_{k})+1, where pkp_{k} represents the dimension of block BkB_{k} for k=1,,Kk=1,\dots,K.

2.4 Theoretical guarantees

For the flow of exposition, we provide a description of the theoretical results here, and defer the technical presentation of the family-wise error rate and the detection accuracy in the Appendix. For size analysis, after assuming some mild conditions to the exponential family, the dBiRS algorithm asymptotically controls the family-wise error rate of test problem (3). The method allows the dimension to grow at an exponential rate relative to the sample size. For detection accuracy analysis, we prove that the dBiRS algorithm overcomes the power reduction phenomenon and maintains the facilitated properties of the BiRS algorithm under GLM model. Specifically, when there are model misspecifications under the alternative hypothesis, the dBiRS algorithm can still achieve detection consistency, even when the signal strength of the signal regions decays at a certain rate. In contrast, the Q-SCAN method requires balanced signal strengths across regions. Moreover, the dBiRS algorithm relaxes the M-dependence assumption to a ”weak” dependence assumption, which permits long-range correlation (LD) and is more suitable to genetic association studies. Additionally, with an appropriate block-splitting strategy, the dBiRS algorithm imposes fewer restrictions on the lengths of signal regions, which enables consistent detection of both shorter or longer signal regions compared to Q-SCAN. See Theorem 1, 2 and 3 for more details.

3 Simulation Study

We conduct simulation studies to compare the proposed dBiRS method with the state-of-art Q-SCAN and KnockoffScreen procedure. We generate sequence data of European ancestry from 10,000 chromosomes across 5-megabase (Mb) regions using the calibrated coalescent model (Sabeti & Schaffner, 2014), and the total number of variants is 349,640. We evaluate the family-wise error rate (FWER), false discovery rate (FDR), signal region detection rate (DR), and true positive rate (TPR) for both continuous and binary phenotypes.

For the evaluation of FWER, the continuous phenotypes are generated using the model:

Y=0.5X1+0.5X2+ε,Y=0.5X_{1}+0.5X_{2}+\varepsilon,

where X1X_{1} is a continuous covariate sampled from a standard normal distribution, X1𝒩(0,1)X_{1}\sim\mathcal{N}(0,1), and X2X_{2} is a dichotomous covariate that takes values 0 and 11 with equal probability. The random noise ε\varepsilon is generated from a standard normal distribution, ε𝒩(0,1)\varepsilon\sim\mathcal{N}(0,1). The dichotomous phenotypes are generated using the following logistic regression model:

logit{(Y=1)}=0.5X1+0.5X2,\mathrm{logit}\left\{\mathbb{P}(Y=1)\right\}=0.5X_{1}+0.5X_{2},

where (Y=1)=((Y1=1),,(Yn=1))\mathbb{P}(Y=1)=(\mathbb{P}(Y_{1}=1),\dots,\mathbb{P}(Y_{n}=1))^{\top}. We perform dBiRS and Q-SCAN analyses based on 1,000 Monte Carlo runs under the linear and logistic models to compute FWERs. For both dBiRS and Q-SCAN, the number of bootstrap iterations is set to 1,000. We do not calculate the FWER for KnockoffScreen because it only controls the FDR (He et al., 2021a). The empirical FWERs estimated for dBiRS and Q-SCAN are presented in Table 1 for significance levels α=0.01\alpha=0.01 and α=0.05\alpha=0.05. The FWERs of Q-SCAN and dBiRS are accurate at both significance levels, demonstrating that both methods effectively control the FWER.

Table 1: FWERs of dBiRS and Q-SCAN in continuous and dichotomous phenotypes
         continuous          dichotomous
         α\alpha          0.01          0.05          0.01          0.05
         dBiRS          0.011          0.053          0.011          0.051
         Q-SCAN          0.011          0.051          0.009          0.048

We next evaluate the detection accuracy of dBiRS and compare its performance with Q-SCAN and KnockoffScreen. We randomly select four 5kb causal windows, denoted as IiI_{i}, i=1,,4i=1,\dots,4. We generate continuous and dichotomous phenotypes by

Y=0.5X1+0.5X2+𝑮I1βI1++𝑮I4βI4+ε,Y=0.5X_{1}+0.5X_{2}+\boldsymbol{G}_{I_{1}}\beta_{I_{1}}+\cdots+\boldsymbol{G}_{I_{4}}\beta_{I_{4}}+\varepsilon,

and

logit{(Y=1)}=0.5X1+0.5X2+𝑮I1βI1++𝑮I4βI4,\mathrm{logit}\left\{\mathbb{P}(Y=1)\right\}=0.5X_{1}+0.5X_{2}+\boldsymbol{G}_{I_{1}}\beta_{I_{1}}+\cdots+\boldsymbol{G}_{I_{4}}\beta_{I_{4}},

where 𝑮I1,,𝑮I4\boldsymbol{G}_{I_{1}},\dots,\boldsymbol{G}_{I_{4}} represent the genotypes of the causal windows, and βI1,,βI4\beta_{I_{1}},\dots,\beta_{I_{4}} are the corresponding effect sizes. In each causal window, 10%10\% of the variants are randomly designated as causal, and each causal variant is assigned an effect size as a decreasing function of the minor allele frequency (MAF), i.e., |β|=c|log10(MAF)|\left\lvert\beta\right\rvert=c\left\lvert\log_{10}(\mathrm{MAF})\right\rvert. The parameter cc is set to c{0.12,0.15}c\in\left\{0.12,0.15\right\} for continuous outcomes and c{0.25,0.30}c\in\left\{0.25,0.30\right\} for dichotomous outcomes. The sign of β\beta is randomly assigned with 50%50\% of the values being positive and 50%50\% negative.

We evaluate the DR,TPR and hh kilobase (kb) FDR for three methods based on 100 Monte Carlo runs across different values of cc (Li, Liu & Lin, 2022; He et al., 2021a). Specifically, we denote the true signal regions as I1,,II_{1}^{*},\dots,I^{*}_{\ell} and the estimated signal regions as I^1,,I^^\hat{I}_{1},\dots,\hat{I}_{\hat{\ell}}. Let I=i=1IiI^{*}=\cup_{i=1}^{\ell}I^{*}_{i} represent the union of the true signal regions, and I^=i=1^I^i\hat{I}=\cup_{i=1}^{\hat{\ell}}\hat{I}_{i} represent the union of the estimated signal regions. The DR and TPR are defined as

DR=1𝔼i=1𝟏{I^I},TPR=𝔼(|II^|/|I|).\mathrm{DR}=\frac{1}{\ell}\mathbb{E}\sum_{i=1}^{\ell}\boldsymbol{1}\left\{\hat{I}\cap I^{*}_{\ell}\neq\emptyset\right\},\quad\mathrm{TPR}=\mathbb{E}\left(\left\lvert I^{*}\cap\hat{I}\right\rvert/\left\lvert I^{*}\right\rvert\right).

Following He et al. (2021a), we define the FDR(hh) for h{25,50,75}h\in\{25,50,75\} as

FDR(h)=𝔼(|{j:jI^,d(j,I^)h}|/|I^|)\mathrm{FDR}(h)=\mathbb{E}\left(\left\lvert\left\{j:j\in\hat{I},d(j,\hat{I})\geq h\right\}\right\rvert/\left\lvert\hat{I}\right\rvert\right)

to account for the spill-over effect of the LD structure, where d(j,I)=miniI|ij|d(j,I^{*})=\min_{i\in I^{*}}\left\lvert i-j\right\rvert is the minimum distance between point jj and true signal regions. The DR represents the proportion of true signal regions that are detected and measures the ability to identify signal regions. In contrast, the TPR and FDR(hh) evaluate how well the true signal regions are recovered by assessing the similarity or distance between the true and identified signal regions. The DR, TPR, and FDR(hh) of the three methods for continuous and dichotomous phenotypes are presented in Table 2. Additionally, we provide the standard deviations (SDs) for DR, TPR, and FDR(hh) in Table 3.

Table 2: Detection results (DR, TPR, FDR(hh)) for continuous and dichotomous phenotypes, h=25,50,75h=25,50,75.
     DR      TPR      FDR(25)      FDR(50)      FDR(75)
     continuous, c=0.12c=0.12
     dBiRS      0.990\boldsymbol{0.990}      0.830\boldsymbol{0.830}      0.192\boldsymbol{0.192}      0.049\boldsymbol{0.049}      0.023\boldsymbol{0.023}
     Q-SCAN      0.980      0.783      0.234      0.077      0.034
     KnockoffScreen      0.980      0.227      0.475      0.379      0.287
     continuous, c=0.15c=0.15
     dBiRS      1.000\boldsymbol{1.000}      0.881\boldsymbol{0.881}      0.285\boldsymbol{0.285}      0.094\boldsymbol{0.094}      0.037\boldsymbol{0.037}
     Q-SCAN      1.000\boldsymbol{1.000}      0.837      0.326      0.113      0.040
     KnockoffScreen      1.000\boldsymbol{1.000}      0.505      0.605      0.516      0.408
     dichotomous, c=0.25c=0.25
     dBiRS      0.990\boldsymbol{0.990}      0.793\boldsymbol{0.793}      0.183\boldsymbol{0.183}      0.049\boldsymbol{0.049}      0.020\boldsymbol{0.020}
     Q-SCAN      0.930      0.682      0.203      0.077      0.039
     KnockoffScreen      0.960      0.184      0.495      0.416      0.237
     dichotomous, c=0.30c=0.30
     dBiRS      1.000\boldsymbol{1.000}      0.853\boldsymbol{0.853}      0.235\boldsymbol{0.235}      0.058\boldsymbol{0.058}      0.019\boldsymbol{0.019}
     Q-SCAN      0.980      0.766      0.277      0.082      0.041
     KnockoffScreen      0.990      0.361      0.533      0.440      0.294
Table 3: Standard deviations for DR, TPR, FDR(hh) in continuous and dichotomous phenotypes, h=25,50,75h=25,50,75.
sd(DR) sd(TPR) sd(FDR(25)) sd(FDR(50)) sd(FDR(75))
continuous, c=0.12c=0.12
dBiRS 0.050\boldsymbol{0.050} 0.065\boldsymbol{0.065} 0.051\boldsymbol{0.051} 0.034\boldsymbol{0.034} 0.027\boldsymbol{0.027}
Q-SCAN 0.069 0.105 0.083 0.056 0.052
KnockoffScreen 0.069 0.178 0.254 0.249 0.203
continuous, c=0.15c=0.15
dBiRS 0.000\boldsymbol{0.000} 0.037\boldsymbol{0.037} 0.046\boldsymbol{0.046} 0.039\boldsymbol{0.039} 0.031\boldsymbol{0.031}
Q-SCAN 0.000\boldsymbol{0.000} 0.072 0.073 0.050 0.042
KnockoffScreen 0.000\boldsymbol{0.000} 0.220 0.111 0.091 0.074
dichotomous, c=0.25c=0.25
dBiRS 0.05\boldsymbol{0.05} 0.07\boldsymbol{0.07} 0.088\boldsymbol{0.088} 0.040\boldsymbol{0.040} 0.031\boldsymbol{0.031}
Q-SCAN 0.115 0.126 0.096 0.066 0.050
KnockoffScreen 0.093 0.119 0.210 0.214 0.188
dichotomous, c=0.30c=0.30
dBiRS 0.000\boldsymbol{0.000} 0.039\boldsymbol{0.039} 0.079\boldsymbol{0.079} 0.037\boldsymbol{0.037} 0.024\boldsymbol{0.024}
Q-SCAN 0.069 0.102 0.107 0.068 0.044
KnockoffScreen 0.050 0.168 0.131 0.121 0.123

Table 2 shows that dBiRS achieves the highest DRs and TPRs with the lowest FDRs across all signal strengths cc and for all values of the distance parameter h=25,50,75h=25,50,75 in both continuous and dichotomous phenotypes. As the value of hh increases, the FDR of all methods decreases, and the FDR of dBiRS and Q-SCAN falls below 0.05 when h=75h=75 kb. Q-SCAN outperforms KnockoffScreen in recovering true signal regions by achieving higher TPR and lower FDR. However, it identifies a smaller proportion of true signal regions (i.e., a lower DR) compared to KnockoffScreen. KnockoffScreen employs LD-pruning with a 0.75 correlation threshold to remove highly correlated variants before conducting the analysis. It identifies a larger number of signal regions, which increases the chance of overlap with true signal regions and thereby boosting the DR. However, the signal regions identified by KnockoffScreen have lower coverage of true signal regions due to both the LD-pruning process and a higher number of false discoveries. Additionally, Table 3 shows that dBiRS has the lowest standard deviations in all cases, indicating that dBiRS is the most stable method.

To further illustrate the recovery of true signal regions across different methods, we present the selection probabilities for all variants, where the selection probability of a variant jj is defined as the proportion of replications in which it is identified as a signal variant. We present the selection probabilities under the continuous phenotype and the dichotomous phenotype in Figure 1 and Figure 2, respectively.

Refer to caption
Figure 1: The selection probabilities of dBiRS, Q-SCAN and KnockoffScreen under dichotomous phenotype. The left panel is the selection probabilities when c=0.25c=0.25 and the right panel is the selection probabilities when c=0.30c=0.30. The red arrows refer to the locations of four true signal regions.
Refer to caption
Figure 2: The selection probabilities of dBiRS, Q-SCAN and KnockoffScreen under dichotomous phenotype. The left panel is the selection probabilities when c=0.25c=0.25 and the right panel is the selection probabilities when c=0.30c=0.30. The red arrows refer to the locations of four true signal regions.

While all procedures exhibit relatively high selection probabilities around the four true signal regions, dBiRS demonstrates the greatest accuracy and stability, achieving selection probabilities exceeding 0.95 in all true signal regions across all settings. When signals are relatively weak (i.e., c=0.12c=0.12 for the continuous trait in Figure 1A and c=0.25c=0.25 for the binary trait in Figure 2A), Q-SCAN shows relatively low selection probabilities in the leftmost true signal region (i.e., selection probability <0.95<0.95) and identifies some variants in non-signal regions at the beginning of the genome. This suggests that the Q-SCAN procedure is more prone to producing misleading results across replications, which is consistent with results that show a higher FDR compared to dBiRS. KnockoffScreen demonstrates the lowest selection probabilities in the four true signal regions and identifies only one variant with a selection probability greater than 0.95 (Figure 1A and Figure 2A). When the signal is stronger (i.e., c=0.15c=0.15 for the continuous trait in Figure 1B and c=0.30c=0.30 for the binary trait in Figure 2B), KnockoffScreen exhibits substantially higher selection probabilities in non-signal regions compared to both Q-SCAN and dBiRS (Figure 1B). The spread of selection probabilities across non-signal regions highlights a lack of stability for KnockoffScreen and indicates its tendency to identify false positives for spurious variants. Based on these results, we conclude that the dBiRS method outperforms the other procedures in terms of both detection accuracy and robustness across different settings.

4 Application

Extensive research has identified common genetic variants associated with cognitive health, but the roles of rare genetic variants, which often have more subtle and complex effects, remain poorly understood. In this study, we apply the dBiRS algorithm to conduct Whole Exome Sequencing (WES) analyses on the core cognitive phenotypes: fluid intelligence (Field ID 20016) and prospective memory (Field ID 20018), using data from the UK Biobank (https://biobank.ctsu.ox.ac.uk/crystal/index.cgi). The WES data used in this study is derived from the final exome release in PLINK format (Field ID 23158) from the UK Biobank. Fluid intelligence reflects problem-solving and reasoning abilities, while prospective memory measures the capacity to remember and execute planned actions, a critical daily-life function that often declines with age. These phenotypes are closely linked to aging and neurodegeneration. By analyzing these traits, we aim to uncover rare variants that may confer either risk or resilience to cognitive functions, providing insights into the genetic and biological mechanisms underlying cognitive impairment. This approach has the potential to deepen our understanding of cognitive aging and uncover genetic factors contributing to the risk of neurodegenerative disorders such as Alzheimer’s disease (AD).

We implemented a standard quality control procedure to ensure the integrity of the dataset before analysis. First, we excluded individuals flagged as outliers by the UK Biobank based on genotyping missingness rates or heterogeneity, as well as those whose genotypically inferred sex did not align with their self-reported sex. To address population stratification, we utilized principal component analysis provided by the UK Biobank and excluded individuals identified as non-European. Specifically, we removed individuals whose values for either of the first two principal components deviated by more than five standard deviations from the mean. We also excluded participants who self-reported an ethnicity other than European. Furthermore, individuals with more than 5%5\% missing genotype data across variants passing UK Biobank’s quality control were removed. For variant-level quality control, we retained only biallelic autosomal variants assayed by both genotyping arrays used by the UK Biobank. Variants that failed UK Biobank quality control in any genotyping batch were excluded. Additionally, we removed variants with a Hardy-Weinberg equilibrium (HWE) pp-value below 105010^{-50} or with a minor allele count (MAC) 1\leq 1. This stringent filtering resulted in 13,681,006 variants across 22 chromosomes in the WES dataset, 154,785 samples for fluid intelligence analysis, and 155,448 samples for prospective memory analysis. After the quality control procedure, the dBiRS analysis was performed using summary statistics derived from a generalized linear model. The model adjusted for covariates, including sex, age, assessment center, and the top five genomic principal components provided by the UK Biobank.

Figure 1 illustrates the distribution of functional consequences of variants identified by dBiRS, with a substantial number of variants located in exonic, intronic, and UTR regions. Variants in the exonic region may directly alter protein structure and function. Variants in the 3’ UTR (UTR3) are likely involved in post-transcriptional regulation, including mRNA stability and microRNA binding. Intronic variants could disrupt splicing mechanisms or long-range regulatory elements, further impacting gene function. While exonic variants are commonly the focus of WES studies, our findings also underscore the important role of rare variants in non-coding and regulatory regions, such as UTRs and intronic regions, which have historically been understudied in the context of intelligence and prospective memory.

For the study on intelligence, dBiRS identified signal regions encompassing 327 genes, including 84 genome-wide significant genes previously reported for their association with intelligence. Additionally, there are 11 genome-wide significant genes overlapping with those associated with general cognitive function, 30 genes overlapping with those implicated in schizophrenia, 65 genes associated with educational attainment, and 35 genes overlapping with autism spectrum disorder. Based on the GWAS Catalog, 33 genome-wide significant common variants were also identified in our WES study. Notable examples include SNP rs2271928 in COL16A1, SNP rs11264680 in CRTC2, and SNP rs539096 in PTPRF, among others.

The dBiRS method identifies numerous additional rare variant associations in the exonic, intronic, and 3’ UTR regions of genes that are previously reported to be associated with intelligence or cognitive performance. For example, a previous study highlights the significance of CRTC2 in endothelial function, which is essential for maintaining vascular physiology in the brain (Kanki et al., 2020). While prior genome-wide studies have identified common variants in CRTC2 associated with cognitive traits (Savage et al., 2018), the dBiRS method uncovers 4 rare variants in the exonic region of CRTC2, which may directly alter protein function, 3 rare variants in the intronic region that could influence gene regulation, and 4 rare variants in the 3’ UTR region, potentially affecting mRNA stability and translation efficiency. Another example is BRWD1 that is previously identified as a significant gene associated with intelligence (Savage et al., 2018). In our study, we identified rare variants in its exonic, intronic, and 3’ UTR regions. Research has shown that BRWD1 plays a role in establishing epigenetic states during B cell development, which is essential for proper immune function (Mandal et al., 2018). Further studies are needed to clarify its specific contributions to neural development and cognition.

Refer to caption
Figure 3: Distribution of the functional consequences of SNPs in signal regions identified by dBiRS.

The dBiRS method identified rare variants in 22 novel genes that have not been previously reported in the GWAS Catalog for intelligence. Among these 22 genes, 14 were significant in the gene-based test for general cognitive ability (Davies et al., 2018). Of the remaining 9 genes, SZT2 has been linked to common variants associated with Alzheimer’s disease and specific brain regions (Herold et al., 2016); ITIH4 has been associated with schizophrenia (Ripke et al., 2011); and NDUFA6 has been linked to brain connectivity measurements (Zhang, Pan, Huang, Liang, Li, Zhang & Zhang, 2023). Genes SLC39A8, TOP2B, NKIRAS1, GLTBD1, and TPM3 are new findings that have not been previously reported in relation to cognitive performance measurements. Among these genes, TOP2B plays a critical role in neuronal development by regulating gene expression during brain development (Tiwari et al., 2012). Proper functioning of TOP2B is essential for neuronal connectivity and synaptic plasticity, both of which underlie learning and memory (Madabhushi et al., 2015). Disruptions in TOP2B have been linked to neurodevelopmental disorders, which could implicate it in intelligence (King et al., 2013).

For the study on prospective memory, we identified eight novel genes associated with prospective memory, a key component of cognitive function. Among these, OMA1 stands out due to its established links with multiple neuroimaging measurements, including white matter lesion progression and brain column structure, underscoring its potential role in cognitive processes (Wang et al., 2020). Notably, our analysis uncovered 10 rare variants in the 3’ UTR, 6 rare variants in the exonic region, 7 rare variants in the intronic region, and 1 common variant in the 5’ UTR of OMA1. These findings suggest that OMA1 may influence prospective memory through diverse mechanisms, such as protein function modulation, gene regulation, and mRNA stability. Similarly, we identified variants in the intronic, exonic, and 3’ UTR regions of CNGB3, further emphasizing the importance of regulatory regions in prospective memory. CNGB3 is known to be associated with educational attainment (Okbay et al., 2022), cognitive function (Lee et al., 2018) and anxiety (Meier et al., 2019), which supports its potential role in cognitive traits. Additionally, we discovered HIGD1B, previously reported to be associated with educational attainment (Pasman et al., 2022), and SFMBT2 which has been linked to anxiety and stress-related disorders (Hollins & Cairns, 2016). Other novel findings include the genes NT5C3B, NOSTRIN, and APOD, which represent unexplored candidates for prospective memory and related cognitive traits. These discoveries underscore the value of our approach in uncovering novel genetic mechanisms underlying complex cognitive traits and highlight new directions for future research into the genetic architecture of prospective memory.

5 Discussion

We developed a computationally efficient distributed Binary and Re-Search (dBiRS) algorithm by incorporating the infinity norm of regression-based summary statistics to support the analysis of both binary and continuous outcomes for whole-genome and whole-exome sequencing studies. Compared to scan-based algorithms and the knockoffScreen method, dBiRS demonstrates greater power while maintaining strict control over family-wise error rates (FWER) and false discovery rates (FDR). Furthermore, dBiRS enables parallel computing of block-wise results, which are then aggregated through a central machine to ensure both detection accuracy and computational efficiency. Empirical studies further demonstrate its robustness and adaptability, even under conditions of signal decay.

Further extensions of dBiR could explore the integration of functional annotations. While the current framework effectively detects signal regions without annotations, their incorporation may provide additional biological context and improve detection accuracy for rare or weak signals in less characterized regions of the genome. Our framework is also flexible to incorporate various functional annotations by directly adding weights into SNP-level summary statistics, where weights can be estimated using annotation principal components (Zhou et al., 2023), CADD (Kircher et al., 2014) or other methods. Additionally, further validation of the identified associations through experimental studies is necessary to elucidate their functional relevance and causal mechanisms.

Appendix: Theoretical Guarantees

Detailed theoretical results on FWER control and detection accuracy are presented in the Appendix, while the proofs of the theorems are provided in the Supplementary Material.

5.1 Family-wise error rate control

Recall that Un=𝑮(Yη^0)/nU_{n}=\boldsymbol{G}^{\top}(Y-\hat{\eta}_{0})/\sqrt{n} is the vector of marginal scores and the test statistic is the maxima of marginal scores. Under the null hypothesis, Yη^0Y-\hat{\eta}_{0} is asymptotically distributed as 𝒩(0,P)\mathcal{N}(0,P), where PP is the projection matrix. Let Z𝒩(0,P)Z\sim\mathcal{N}(0,P), under some regularity assumptions, the following theorem claims that the proposed global test controls the size and consequently, the dBiRS algorithm controls the FWER.

Theorem 1

Assume that the following conditions (a)-(b) hold:
(a) For i=1,,ni=1,\dots,n, the function wi(x0,x1)=1/ai(x0)ν(x1)w_{i}(x_{0},x_{1})=1/a_{i}(x_{0})\nu(x_{1}) has finite gradient on a neighborhood of (ϕ,ηi)(\phi,\eta_{i});
(b) log4(p)/n0\log^{4}(p)/n\rightarrow 0 as nn\rightarrow\infty.
Then under the null hypothesis, with probability one, we have

|(Un>c~(α))α|0,\left\lvert\mathbb{P}(\left\lVert U_{n}\right\rVert_{\infty}>\tilde{c}(\alpha))-\alpha\right\rvert\rightarrow 0,

and consequently, the dBiRS procedure controls the FWER.

Condition (a) is used for consistently estimating Σ\Sigma by Σ^\hat{\Sigma} and is common in asymptotic analysis for GLM. Condition (b) indicates that the data dimension pp can grow exponentially in nn. Recall that the sBiRS algorithm begins with the global test, then the sBiRS procedure controls the FWER while the size of the global test is controlled. Furthermore, since we perform a sBiRS procedure for blocks in dBiRS algorithm, the FWER of dBiRS is the same as sBiRS, which indicates that the dBiRS procedure also controls the FWER.

5.2 Detection accuracy analysis

Now we concentrate on the accuracy of the detection algorithm in terms of the proportion of true discoveries and false discoveries. Before the rigorous power analysis of our algorithm, we assume genotype data is bounded and centralized and introduce some notations. Given a truncation parameter ss, we can split the global region into several continuous segments such that the length of each segment is less than 2s2^{s} and at least one segment has a length greater than 2s12^{s-1}. Note that the split is not unique, and let 𝒮={I1,,IS}\mathcal{S}=\left\{I_{1},\dots,I_{S}\right\} be the set containing these segments of any split. The regions I1,,ISI_{1},\dots,I_{S} are neighboring and non-overlapped satisfying j=1KIj={1,,p}\cup_{j=1}^{K}I_{j}=\{1,\ldots,p\}. For a true signal region Ir,r=1,,I_{r}^{*},r=1,\ldots,\ell, we refer to Iir,,Iir+krI_{i_{r}},\dots,I_{i_{r}+k_{r}} as the minimum cover of IrI_{r}^{*} in 𝒮\mathcal{S} if Ij=0kIi+jI^{*}\subset\cup_{j=0}^{k}I_{i+j} and |j=0k1Ii+j|<|I||j=0kIi+j|\left\lvert\cup_{j=0}^{k-1}I_{i+j}\right\rvert<\left\lvert I^{*}\right\rvert\leq\left\lvert\cup_{j=0}^{k}I_{i+j}\right\rvert. Finally, without loss of generality, we suppose that I1,,II_{1}^{*},\dots,I_{\ell}^{*} with lengths p1,,pp_{1},\dots,p_{\ell} have decayed signals, that is, μI1μI2μI\left\lVert\mu_{I_{1}^{*}}\right\rVert_{\infty}\geq\left\lVert\mu_{I_{2}^{*}}\right\rVert_{\infty}\geq\cdots\geq\left\lVert\mu_{I_{\ell}^{*}}\right\rVert_{\infty}, where μ=𝔼(U)/n\mu=\mathbb{E}(U)/\sqrt{n}.

Providing a decay signal strength assumption for the true signal regions, we have the following theorem which gives the approximation results of true discoveries.

Theorem 2

Assume that the condition (a)-(b) in Theorem 1 hold, and further assume that
(c) There exists a sufficiently large constant C1>0C_{1}>0, for u=0,,kru=0,\dots,k_{r}, r=1,,r=1,\dots,\ell,

μIir+uC1[log{(pv=1r1pv)n}/n]1/2.\left\lVert\mu_{I_{i_{r}+u}}\right\rVert_{\infty}\geq C_{1}\left[\log\left\{\left(p-\sum_{v=1}^{r-1}p_{v}\right)n\right\}/n\right]^{1/2}.

Let PP_{\ell} denotes the probability that our dBiRS algorithm detects all \ell signal regions, when nn\rightarrow\infty,

P14log(p+1)/n21.P_{\ell}\geq 1-4\log(p+1)/n^{2}\rightarrow 1.

The result of this theorem states that in both the balanced and decaying signal settings, the dBiRS-detected signal segments can consistently cover the true signal regions. Condition (c) implies that the lower bound of μI122\left\lVert\mu_{I_{1}^{*}}\right\rVert_{2}^{2} for consistent detection is of order log(p)|I1|/2s/n\log(p)\left\lvert I_{1}^{*}\right\rvert/2^{s}/n. The lower bound in Theorem 3 of Q-SCAN (Li, Liu & Lin, 2020) is weaker than this, however, they assume that the maximum eigenvalue of Σ^I1\hat{\Sigma}_{I_{1}^{*}} has a fixed upper bound, which is overly restricted. Furthermore, to the best of our knowledge, condition (c) is the first to allow signal decay settings in GLM model, while the existing work (Jeng et al., 2010; Li, Liu & Lin, 2020) assumes that the signal strengths of all regions have a lower bound and Zhang, Wang & Yao (2023) adopts this decay setting under the two-sample test framework.

Next, we focus on the false discovery rate of the dBiRS algorithm and illustrate the consistent detection property of our algorithm. To begin with, we introduce the Jaccard index to quantify the similarity between two regions. Specifically, we define the Jaccard index between sets I1I_{1} and I2I_{2} as

J(I1,I2)=|I1I2|/|I1I2|.J(I_{1},I_{2})=\left\lvert I_{1}\cap I_{2}\right\rvert/\left\lvert I_{1}\cup I_{2}\right\rvert.

Recall that ={I1,,I}\mathcal{I}^{*}=\left\{I_{1}^{*},\dots,I_{\ell}^{*}\right\} are the set of true signal regions and ^={I^1,,I^}\hat{\mathcal{I}}=\left\{\hat{I}_{1},\dots,\hat{I}_{\ell^{\prime}}\right\} are the estimated signal regions. For a signal region IiI^{*}_{i}, we define that it is consistently detected if for some η(p)=o(1)\eta(p)=o(1), there exists I^ji^\hat{I}_{j_{i}}\in\hat{\mathcal{I}} such that

{J(I^ji,Ii)1η(p)}1,\mathbb{P}\left\{J(\hat{I}_{j_{i}},I^{*}_{i})\geq 1-\eta(p)\right\}\rightarrow 1,

as pp\rightarrow\infty. Note that, even if every signal region is consistently detected, there still could be some additional regions that are incorrectly detected. Let I~=I^I\tilde{I}=\hat{I}-I^{*} as the set of wrongly detected regions, and we want such regions to be ignorable relative to the true signal regions, i.e., |I~|/|I|0\left\lvert\tilde{I}\right\rvert/\left\lvert I^{*}\right\rvert\rightarrow 0. Then we say that a detection procedure consistently detected all the true signal regions, if for a sequence of ηj(p)=o(1),j=1,,\eta_{j}(p)=o(1),j=1,\dots,\ell and some η(p)=o(1)\eta(p)=o(1), there exists I^j1,,I^j^\hat{I}_{j_{1}},\dots,\hat{I}_{j_{\ell}}\in\hat{\mathcal{I}} such that

[{|I~|/|I|η(p)}A]1\mathbb{P}\left[\left\{\left\lvert\tilde{I}\right\rvert/\left\lvert I^{*}\right\rvert\leq\eta(p)\right\}\cap A\right]\rightarrow 1

, where A=i=1{J(I^ji,Ii)1ηj(p)}A=\cap_{i=1}^{\ell}\left\{J(\hat{I}_{j_{i}},I_{i}^{*})\geq 1-\eta_{j}(p)\right\}.

Under some mild conditions concentrating on the structure of the covariance matrix Σ\Sigma and the distribution of true signal regions, we derive the detection consistency of dBiRS algorithm in the following theorem.

Theorem 3

Assume that conditions (a)-(c) hold. We further assume that
(d) There exists certain truncation parameter ss that results in the set of regions 𝒮={I1,,IS}\mathcal{S}=\left\{I_{1},\dots,I_{S}\right\} such that for some α~>2α\tilde{\alpha}>2\alpha, 𝒪={1,2,,S}\mathcal{O}=\left\{1,2,\dots,S\right\},

D(s,R)=supi1,,iR𝒪(j=1RAij)j=1R(Aij)1(2α~)R,D(s,R)=\sup_{i_{1},\dots,i_{R}\in\mathcal{O}}\frac{\mathbb{P}(\cap_{j=1}^{R}A_{i_{j}})}{\prod_{j=1}^{R}\mathbb{P}(A_{i_{j}})}\leq\frac{1}{(2\tilde{\alpha})^{R}},

where the events Ai={Un(Ii)cIi(α)}A_{i}=\left\{\left\lVert U_{n}(I_{i})\right\rVert_{\infty}\geq c_{I_{i}}(\alpha)\right\} and cIi(α)c_{I_{i}}(\alpha) is the 1α1-\alpha quantile of Un(Ii)\left\lVert U_{n}(I_{i})\right\rVert_{\infty};
(e) The true signal regions are well-separated in the sense that GapminLmaxGap_{\min}\geq L_{\max}, where LmaxL_{\max} is the maximum length of all true continuous signal regions and GminG_{\min} is the minimum length of the gaps between any two true continuous signal regions;
(f) There exists a truncation parameter ss satisfies (d) and s=o{log2(Lmin/log)}s=o\left\{\log_{2}(L_{\min}/\log\ell)\right\};
(g) There exists r=o(Lmin)r=o(L_{\min}) such that for any index il=1Ili\in\cup_{l=1}^{\ell}I^{*}_{l}, the non-signal point jj satisfies that if |ji|>r\left\lvert j-i\right\rvert>r, then

|𝑮j𝑮β|/n=o(1).\left\lvert\boldsymbol{G}_{\cdot j}^{\top}\boldsymbol{G}\beta\right\rvert/\sqrt{n}=o(1).

Denote hi=|Ii|/2sh_{i}=\lceil\left\lvert I_{i}^{*}\right\rvert/2^{s}\rceil as the cardinality of the minimum cover (from 𝒮\mathcal{S}) of IiI^{*}_{i}, i=1,,i=1,\dots,\ell and r~=r/2s\tilde{r}=\lceil r/2^{s}\rceil. Given a significance level α\alpha, when n,pn,p\rightarrow\infty, for any sequence of integers Ri=o(hi)R_{i}=o(h_{i}) and Ri/logR_{i}/\log\ell\rightarrow\infty, i=1,,i=1,\dots,\ell, there exists I^j1,,I^j^\hat{I}_{j_{1}},\dots,\hat{I}_{j_{\ell}}\in\hat{\mathcal{I}} such that,

[i=1{J(Ii,I^ji)1ηi}]1δ,\mathbb{P}\left[\bigcap\limits_{i=1}^{\ell}\left\{J\left(I_{i}^{*},\hat{I}_{j_{i}}\right)\geq 1-\eta_{i}\right\}\right]\geq 1-\delta,

where ηi=(Ri+2r~)/(Ri+hi+1)0\eta_{i}=(R_{i}+2\tilde{r})/(R_{i}+h_{i}+1)\rightarrow 0 and δ=4log(p+1)/n2+i=1(α/2α~)Ri0\delta=4\log(p+1)/n^{2}+\sum_{i=1}^{\ell}\left(\alpha/2\tilde{\alpha}\right)^{R_{i}}\rightarrow 0.

Let A=i=1{J(I^ji,Ii)1ηj}A=\cap_{i=1}^{\ell}\left\{J(\hat{I}_{j_{i}},I_{i}^{*})\geq 1-\eta_{j}\right\}, we have that for some integer R0=O(K1)R_{0}=O(K_{1}) and R0=o(i=1hi)R_{0}=o(\sum_{i=1}^{\ell}h_{i}),

[{|I~|/|I|η0}A](1δ)[1C2{2R0α(R0K1)α~}R0K1]1,\mathbb{P}\left[\left\{\left\lvert\tilde{I}\right\rvert/\left\lvert I^{*}\right\rvert\leq\eta_{0}\right\}\cap A\right]\geq\left(1-\delta\right)\left[1-C_{2}\left\{\frac{2R_{0}\alpha}{(R_{0}-K_{1})\tilde{\alpha}}\right\}^{R_{0}-K_{1}}\right]\rightarrow 1, (4)

where η0=R0/i=1hi0\eta_{0}=R_{0}/\sum_{i=1}^{\ell}h_{i}\rightarrow 0, K1KK_{1}\leq K is the number of blocks which have signals and C2C_{2} is a constant.

To appreciate this result, we see that 14log(p+1)/n21-4\log(p+1)/n^{2} quantifies the probability that there exist estimated signal regions simultaneously covering the true ones as Theorem 2 asserts, while (α/2α~)Ri(\alpha/2\tilde{\alpha})^{R_{i}} indicates how many non-signal segments in 𝒮\mathcal{S} may be falsely included neighboring to IiI_{i}^{*} resulting from the re-search procedure. The term 1C2{2R0α(R0K1)α~}R0K11-C_{2}\left\{\frac{2R_{0}\alpha}{(R_{0}-K_{1})\tilde{\alpha}}\right\}^{R_{0}-K_{1}} in inequality (4) is the lower bound of the probability that R0R_{0} additional non-signal segments are falsely detected and it relates to the number of blocks KK.

Theorem 3 demonstrates that the consistent detection property can be achieved by the dBiRS algorithm under mild conditions. Condition (d) relaxes the common M-dependence assumption in Li, Liu & Lin (2020) and only requires a “weak dependence” assumption under which the Jaccard index consistency of the dBiRS algorithm is still guaranteed. It is straightforward to verify that the M-dependence satisfies condition (d) if we choose the truncation parameter slog2Ms\geq\log_{2}M. Moreover, condition (d) includes a larger class of covariance matrices. For instance, the covariance Σ={(1+|jk|)ρ}j,k=1p\Sigma=\left\{\left(1+\left\lvert j-k\right\rvert\right)^{-\rho}\right\}_{j,k=1}^{p}, ρ>1/2\rho>1/2 and Σ={θ|jk|}j,k=1p,θ<1\Sigma=\left\{\theta^{\left\lvert j-k\right\rvert}\right\}_{j,k=1}^{p},\theta<1 satisfy condition (d) but is not M-dependent. Condition (e) imposes the requirement on the distances among the signal regions. The constraint 2s=o(Lmin/log)2^{s}=o\left(L_{\min}/\log\ell\right) in condition (f) guarantees that the union set of the minimum cover of a true signal region is as short as possible. Conditions (e) and (f) are weaker than those in scan-based methods. Specifically, the scan-based method Li, Liu & Lin (2020) assumes that Lmin/logpL_{\min}/\log p\rightarrow\infty, while the dBiRS algorithm has less restriction on the minimum length of signal regions. For instance, consider Lmin=logpL_{\min}=\log p and =logp\ell=\log p, then we can select 2s=logp2^{s}=\sqrt{\log p} for consistent detection. This suggests that the dBiRS algorithm can detect shorter (and unbalanced) signal regions than the scan-based method that requires Lmin/logpL_{\min}/\log p\rightarrow\infty and hence can be more accurate. Moreover, the dBiRS algorithm does not have restrictions on the maximum length of signal regions that the scan method demands and can deal with signal regions with a length of (polynomial) order pp. For illustration, let the two signal regions are I1={1,,p/4}I^{*}_{1}=\left\{1,\dots,\lfloor p/4\rfloor\right\} and I2={3/4p,,p}I^{*}_{2}=\left\{\lfloor 3/4p\rfloor,\dots,p\right\}. According to Theorem 3, dBiRS algorithm can consistently detect the signal regions, whereas the consistency of Q-SCAN (Li, Liu & Lin, 2020) requires logLmax/logp0\log L_{\max}/\log p\rightarrow 0. Condition (g) requires the correlations between signal regions and non-signal points to decay with the distance, which is a standard assumption in this field.


SUPPLEMENTARY MATERIAL

dBiRS_supp

Additional technical assumptions, lemmas and proofs of the theoretical results for “Computationally Efficient Whole-Genome Signal Region Detection for Quantitative and Binary Traits”. (.pdf file)

References

  • (1)
  • Bakshi et al. (2016) Bakshi, A., Zhu, Z., Vinkhuyzen, A. A. et al. (2016), ‘Fast set-based association analysis using summary data from gwas identifies novel gene loci for human complex traits’, Scientific reports 6(1), 1–9.
  • Cai & Wei (2022) Cai, T. T. & Wei, H. (2022), ‘Distributed adaptive Gaussian mean estimation with unknown variance: Interactive protocol helps adaptation’, The Annals of Statistics 50(4), 1992 – 2020.
    https://doi.org/10.1214/21-AOS2167
  • Davies et al. (2018) Davies, G., Lam, M., Harris, S. E., Trampush, J. W., Luciano, M., Hill, W. D., Hagenaars, S. P., Ritchie, S. J., Marioni, R. E., Fawns-Ritchie, C. et al. (2018), ‘Study of 300,486 individuals identifies 148 independent genetic loci influencing general cognitive function’, Nature communications 9(1), 2098.
  • He et al. (2021b) He, Z., Le Guen, Y., Liu, L. et al. (2021b), ‘Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to alzheimer disease genetics’, The American Journal of Human Genetics 108(12), 2336–2353.
  • He et al. (2021a) He, Z., Liu, L., Wang, C. et al. (2021a), ‘Identification of putative causal loci in whole-genome sequencing data via knockoff statistics’, Nature communications 12(1), 3152.
  • Herold et al. (2016) Herold, C., Hooli, B. V., Mullin, K., Liu, T., Roehr, J. T., Mattheisen, M., Parrado, A. R., Bertram, L., Lange, C. & Tanzi, R. E. (2016), ‘Family-based association analyses of imputed genotypes reveal genome-wide significant association of alzheimer’s disease with osbpl6, ptprg, and pdcl3’, Molecular psychiatry 21(11), 1608–1612.
  • Hollins & Cairns (2016) Hollins, S. L. & Cairns, M. J. (2016), ‘Microrna: Small rna mediators of the brains genomic response to environmental stress’, Progress in neurobiology 143, 61–81.
  • Jeng et al. (2010) Jeng, X. J., Cai, T. T. & Li, H. (2010), ‘Optimal sparse segment identification with application in copy number variation analysis’, Journal of the American Statistical Association 105(491), 1156–1166. PMID: 23543902.
  • Kanki et al. (2020) Kanki, H., Sasaki, T., Matsumura, S., Kawano, T., Todo, K., Okazaki, S., Nishiyama, K., Takemori, H. & Mochizuki, H. (2020), ‘Creb coactivator crtc2 plays a crucial role in endothelial function’, Journal of Neuroscience 40(49), 9533–9546.
  • King et al. (2013) King, I. F., Yandava, C. N., Mabb, A. M., Hsiao, J. S., Huang, H.-S., Pearson, B. L., Calabrese, J. M., Starmer, J., Parker, J. S., Magnuson, T. et al. (2013), ‘Topoisomerases facilitate transcription of long genes linked to autism’, Nature 501(7465), 58–62.
  • Kircher et al. (2014) Kircher, M., Witten, D. M., Jain, P., O’roak, B. J., Cooper, G. M. & Shendure, J. (2014), ‘A general framework for estimating the relative pathogenicity of human genetic variants’, Nature genetics 46(3), 310–315.
  • Lee et al. (2018) Lee, J. J., Wedow, R., Okbay, A., Kong, E., Maghzian, O., Zacher, M., Nguyen-Viet, T. A., Bowers, P., Sidorenko, J., Karlsson Linnér, R. et al. (2018), ‘Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals’, Nature genetics 50(8), 1112–1121.
  • Li, Li, Zhou et al. (2020) Li, X., Li, Z., Zhou, H. et al. (2020), ‘Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale’, Nature genetics 52, 969–983.
  • Li, Li, Zhou et al. (2022) Li, Z., Li, X., Zhou, H. et al. (2022), ‘A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies’, Nature Methods 19, 1599–1611.
  • Li, Liu & Lin (2020) Li, Z., Liu, Y. & Lin, X. (2020), ‘Simultaneous detection of signal regions using quadratic scan statistics with applications to whole genome association studies’, Journal of the American Statistical Association 0(0), 1–12.
  • Li, Liu & Lin (2022) Li, Z., Liu, Y. & Lin, X. (2022), ‘Simultaneous detection of signal regions using quadratic scan statistics with applications to whole genome association studies’, Journal of the American Statistical Association 117(538), 823–834.
  • Liu et al. (2010) Liu, J. Z., Mcrae, A. F., Nyholt, D. R. et al. (2010), ‘A versatile gene-based test for genome-wide association studies’, The American Journal of Human Genetics 87(1), 139–145.
  • Madabhushi et al. (2015) Madabhushi, R., Gao, F., Pfenning, A. R., Pan, L., Yamakawa, S., Seo, J., Rueda, R., Phan, T. X., Yamakawa, H., Pao, P.-C. et al. (2015), ‘Activity-induced dna breaks govern the expression of neuronal early-response genes’, Cell 161(7), 1592–1605.
  • Madsen & Browning (2009) Madsen, B. E. & Browning, S. R. (2009), ‘A groupwise association test for rare mutations using a weighted sum statistic’, PLOS Genetics 5(2), 1–11.
  • Mandal et al. (2018) Mandal, M., Maienschein-Cline, M., Maffucci, P., Veselits, M., Kennedy, D. E., McLean, K. C., Okoreeh, M. K., Karki, S., Cunningham-Rundles, C. & Clark, M. R. (2018), ‘Brwd1 orchestrates epigenetic landscape of late b lymphopoiesis’, Nature communications 9(1), 3888.
  • Manolio et al. (2009) Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A. et al. (2009), ‘Finding the missing heritability of complex diseases’, Nature 461(7265), 747–753.
  • Meier et al. (2019) Meier, S. M., Trontti, K., Purves, K. L., Als, T. D., Grove, J., Laine, M., Pedersen, M. G., Bybjerg-Grauholm, J., Bækved-Hansen, M., Sokolowska, E. et al. (2019), ‘Genetic variants associated with anxiety and stress-related disorders: a genome-wide association study and mouse-model study’, JAMA psychiatry 76(9), 924–932.
  • Morgenthaler & Thilly (2007) Morgenthaler, S. & Thilly, W. G. (2007), ‘A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (cast)’, Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 615(1), 28–56.
  • Naus (1982) Naus, J. I. (1982), ‘Approximations for distributions of scan statistics’, Journal of the American Statistical Association 77(377), 177–183.
  • Okbay et al. (2022) Okbay, A., Wu, Y., Wang, N., Jayashankar, H., Bennett, M., Nehzati, S. M., Sidorenko, J., Kweon, H., Goldman, G., Gjorgjieva, T. et al. (2022), ‘Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals’, Nature genetics 54(4), 437–449.
  • Olshen et al. (2004) Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. (2004), ‘Circular binary segmentation for the analysis of array‐based DNA copy number data’, Biostatistics 5(4), 557–572.
  • Pasman et al. (2022) Pasman, J. A., Demange, P. A., Guloksuz, S., Willemsen, A., Abdellaoui, A., Ten Have, M., Hottenga, J.-J., Boomsma, D. I., de Geus, E., Bartels, M. et al. (2022), ‘Genetic risk for smoking: disentangling interplay between genes and socioeconomic status’, Behavior genetics 52(2), 92–107.
  • Ripke et al. (2011) Ripke, S., Sanders, A., Kendler, K., Levinson, D., Sklar, P., Holmans, P., Lin, D., Duan, J., Ophoff, R., Andreassen, O. et al. (2011), ‘Schizophrenia psychiatric genome-wide association study (gwas) consortium genome-wide association study identifies five new schizophrenia loci’, Nat. Genet 43, 969–976.
  • Sabeti & Schaffner (2014) Sabeti, I. S. P. C. & Schaffner, S. F. (2014), ‘Cosi2: an efficient simulator of exact and approximate coalescent with selection’, Bioinformatics 30(23), 3427–3429.
  • Savage et al. (2018) Savage, J. E., Jansen, P. R., Stringer, S., Watanabe, K., Bryois, J., De Leeuw, C. A., Nagel, M., Awasthi, S., Barr, P. B., Coleman, J. R. et al. (2018), ‘Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence’, Nature genetics 50(7), 912–919.
  • Tiwari et al. (2012) Tiwari, V. K., Burger, L., Nikoletopoulou, V., Deogracias, R., Thakurela, S., Wirbelauer, C., Kaut, J., Terranova, R., Hoerner, L., Mielke, C. et al. (2012), ‘Target genes of topoisomerase iiβ\beta regulate neuronal survival and are defined by their chromatin state’, Proceedings of the National Academy of Sciences 109(16), E934–E943.
  • Visscher et al. (2017) Visscher, P. M., Wray, N. R., Zhang, Q. et al. (2017), ‘10 years of gwas discovery: Biology, function, and translation’, The American Journal of Human Genetics 101(1), 5–22.
  • Wang et al. (2020) Wang, H., Yang, J., Schneider, J. A., De Jager, P. L., Bennett, D. A. & Zhang, H.-Y. (2020), ‘Genome-wide interaction analysis of pathological hallmarks in alzheimer’s disease’, Neurobiology of aging 93, 61–68.
  • Wu et al. (2011) Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. (2011), ‘Rare-variant association testing for sequencing data with the sequence kernel association test’, The American Journal of Human Genetics 89(1), 82–93.
  • Xue & Yao (2020) Xue, K. & Yao, F. (2020), ‘Distribution and correlation-free two-sample test of high-dimensional means’, The Annals of Statistics 48(3), 1304 – 1328.
  • Zhang, Pan, Huang, Liang, Li, Zhang & Zhang (2023) Zhang, L., Pan, Y., Huang, G., Liang, Z., Li, L., Zhang, M. & Zhang, Z. (2023), ‘A brain-wide genome-wide association study of candidate quantitative trait loci associated with structural and functional phenotypes of pain sensitivity’, Cerebral Cortex 33(11), 7297–7309.
  • Zhang et al. (2010) Zhang, N. R., Siegmund, D. O., Ji, H. & Li, J. Z. (2010), ‘Detecting simultaneous changepoints in multiple sequences’, Biometrika 97(3), 631–645.
  • Zhang, Wang & Yao (2023) Zhang, W., Wang, F. & Yao, F. (2023), ‘Binary and re-search signal region detection in high dimensions’, arXiv preprint arXiv:2305.08172 .
  • Zhou et al. (2023) Zhou, H., Arapoglou, T., Li, X. et al. (2023), ‘Favor: functional annotation of variants online resource and annotator for variation across the human genome’, Nucleic Acids Research 51(D1), D1300–D1311.