This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PC Adjusted Testing for Low Dimensional Parameters

Sohom Bhattacharya S.  Bhattacharya Department of Operations Research and Financial Engineering
Princeton University
Princeton, NJ, 08540, US.
[email protected]
Rounak Dey R.  Dey Department of Biostatistics,
Harvard T.H. Chan School of Public Health
Boston, MA 02115, US
[email protected]
 and  Rajarshi Mukherjee R.  Mukherjee Department of Biostatistics,
Harvard T.H. Chan School of Public Health
Boston, MA 02115, US
[email protected]
Abstract.

In this paper we consider the effect of high dimensional Principal Component (PC) adjustments while inferring the effects of variables on outcomes. This problem is particularly motivated by applications in genetic association studies where one performs PC adjustment to account for population stratification. We consider simple statistical models to obtain asymptotically precise understanding of when such PC adjustments are supposed to work in terms of providing valid tests with controlled Type I errors. We also verify these results through a class of numerical experiments.

1. Introduction

Testing statistical hypotheses, where one needs to take special care of complications of high dimensional data structures is a staple in modern data science. Indeed, a quintessential example of this type problems arise naturally in the context of genetic association studies. Although such questions are not specific to the domain of genetic associate studies, the seminal paper of price2006principal has put forward a Principal Component (PC) adjustment based procedure to adjust for population stratification type confounding and has been applied across a vast canvas of omics type applications beyond Genome Wide Association Studies (GWAS). As another specific example, barfield2014accounting explores the applicability of the same principle in epigenome wide association studies. A major appeal of this method, that we will try lend a theoretical lens to in this paper, is availability of the EIGENSTRAT package of price2006principal that allows fast implementation of this principle and has therefore been immensely popular as a standard adjustment procedure to account for confounding bias in such studies.

In spite of the immense practical popularity of PC adjustment for high dimensional nuisance parameters, the most basic theoretical questions remain unanswered for this high popular and regularly used principle. For example, the most traditional statistical questions that arise as a part of planning a real data analysis pertains to sample size and power calculations. Interestingly, the answers to even such simple questions remain unexplored. Indeed, the two stage method involving a principal component adjustment before testing for genetic association, implies interesting statistical phenomenon in high dimensional problems where the number of genetic variants under study is large (patterson2006population) . In contrast to such implied caveats and potential pathologies of inconsistencies in high dimensional PCA (johnstone2009consistency), empirical success of the procedure in practical problems in turn go far towards making a strong case for the PC based adjustments in these applications (yang2014principal). This paper is aimed at providing first steps towards a theoretical formalism and reconciling these issues and thereby potentially resolve questions regarding when and why such PC-based adjustments are reliable and mathematically valid.

2. Setup

We consider data on (Yi,𝐗i)i=1ni.i.d.(Y_{i},\mathbf{X}_{i})_{i=1}^{n}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}\mathbb{P} where where YiY_{i}\in\mathbb{R} denotes an outcome of interest and 𝐗ip\mathbf{X}_{i}\in\mathbb{R}^{p} collects predictor variables under study. In many case, it is of interest to test the effect of specific variables/components in 𝐗\mathbf{X} on YY while conditioning on other variables in 𝐗\mathbf{X}. More precisely, we will partition 𝐗\mathbf{X} as 𝐗=[A,𝐖]T\mathbf{X}=[A,\mathbf{W}]^{T} where AA\in\mathbb{R} is the variable whose effect on YY is of interest and 𝐖\mathbf{W} collects confounders one wishes to adjust while decoding this relationship. Our main question of interest concerns a specific way of inferring the effect of AA on YY – namely principal component adjustment for the confounding variables in 𝐗\mathbf{X}. We will study the precise asymptotic properties of this specific method through the lens of a hypothesis testing problem for the effect of AA on YY. To obtain sharp asymptotics and precise answers, for our theoretical developments we will focus on the Gaussian linear regression set-up where {equs}Y_i=A_iδ+W_i^Tβ+ϵ_i,  ϵ_i∼N(0,σ^2_y), and thereby consider the hypothesis testing problem for δ\delta defined as {equs}H_0: δ=0  vs  H_1(h):δ=hn. The choice of the alternative converging to 0 at 1/n1/\sqrt{n} rate is driven by the fact that the detection threshold for testing δ\delta scales accordingly in the asymptotic regimes we will work with in this paper. Subsequently, with an abuse of notation that suppresses dependence on other parameters, we will denote the distribution \mathbb{P} of Y|A,𝐖Y|A,\mathbf{W} corresponding to model (2) as δ\mathbb{P}_{\delta} to underscore the main object of the hypothesis test i.e. δ\delta. Finally, throughout we assume that 𝜷\|\boldsymbol{\beta}\| is bounded away from 0 to reflect a scenario where the effect of AA on YY experiences non-trivial confounding through 𝐖\mathbf{W}.

The standard test for (2) is the Generalized Likelihood Ratio Test (GLRT) – which for known σy2\sigma_{y}^{2}111We will throughout assume this and show through numerical experiments that the results for unknown σy2\sigma^{2}_{y} should be qualitatively similar. reduces to the chi-square test with 11-degree of freedom. Although the same test has non-trivial local power for testing (2) even for growing dimension pp: (1) it is often believed that suitable dimension reduction techniques on the regression of 𝐗\mathbf{X} on YY by relying on the intuition that dimension reduction techniques capture underlying structures of the problem (e.g. sparsity, low rank etc) and thereby yields more power for the testing problem compared to vanilla GLRT; and (2) performing GLRT at the scale of modern high dimensional problems such as GWAS is computationally expensive (O(min{p3,np2,n2p}){O}(\min\{p^{3},np^{2},n^{2}p\})) and performing dimension reduction on 𝐗\mathbf{X} substantially reduces this burden. To account for these concerns, one method that has gained popularity is that of performing Principal Component Analysis (PCA) based dimension reduction on 𝐗\mathbf{X} which amounts to including “a few top principal components” in the regression (and can be computed through Fast-PCA type algoritms (galinsky2016fast)) of YY on 𝐗\mathbf{X} (patterson2006population; price2006principal; barfield2014accounting). In this regard, for testing effect of AA on YY, it is natural to consider PC adjustment on 𝐖\mathbf{W} only. However, one attractive piece of the EIGENSTRAT method introduced in price2006principal is the fact that it allows the investigator to only run PCA on the whole of 𝐗\mathbf{X} data matrix only once and including them in the regression analysis of YY on any XjX_{j} (and not a specific component AA as considered here) which is a component of 𝐗\mathbf{X}. More specifically, we shall study the statistical properties (Type I and Type II error of testing for δ\delta in the model (2)) of testing for a specific component AA of 𝐗\mathbf{X}, our set up is in parallel to traditional GWAS analysis where is a standard practice to run PCA on the whole data matrix of 𝐗\mathbf{X} (instead on the matrix obtained by removing AA-observations from it) and include them in the analyses – instead of running a new PCA every time for testing for a new variable in 𝐗\mathbf{X}. Here we will explore both the cases when the PC is run on the whole of 𝐗\mathbf{X} (a case more relevant for GWAS type analyses (price2006principal)) and for the case when PC adjustments are performed on 𝐖\mathbf{W} (a case relevant in certain epigenetic studies where AA contains DNA methylation levels of a particular cytosine-phosphate-guanine (CpG) site of interest and one performs PCA on genetic variants 𝐖\mathbf{W} to account for population stratification barfield2014accounting. )

To initiate a theoretical study of this method, we begin by with setting up the necessary mathematical notation to describe the procedure. As discussed above, one standard approach in practice is to perform Principal Component (PC) analysis on the data matrix of 𝐗\mathbf{X} and only use the top principal component directions while adjusting for confounders in the regression of YY on AA. Mathematically, such a formalism can be written as follows through the data vectors 𝐘=(Y1,,Yn)T\mathbf{Y}=(Y_{1},\ldots,Y_{n})^{T}, 𝐀=(A1,,An)T\mathbf{A}=(A_{1},\ldots,A_{n})^{T} and the data matrix of covariates 𝕏=[𝐗1::𝐗n]T\mathbb{X}=[\mathbf{X}_{1}:\cdots:\mathbf{X}_{n}]^{T}:

  1. (i)

    Given kk\in\mathbb{N}, let 𝐕k,in~\widetilde{{\mathbf{V}_{k,\mathrm{in}}}} denote the matrix composed of the first kk right singular vectors of 𝕏\mathbb{X}.

  2. (ii)

    Perform a linear regression of 𝐘\mathbf{Y} on (𝐀,𝕏𝐕k,in~)(\mathbf{A},\mathbb{X}\widetilde{{\mathbf{V}_{k,\mathrm{in}}}}) to get a score or level of significance for δ\delta.

Here the subscript in\mathrm{in} refers to the fact that a AA is a part of the variables in 𝐗\mathbf{X} on which the PC was performed. Mathematically, for any cut-off tt we consider the test that rejects for value of the likelihood ratio larger than tt as follows {equs}φ_k,in(t)=1(LR_k,in¿t), where {equs}LR_k,in=Y^TP_C(A_k, in)Y, A_k, in=(I-P_C(X~V_k,in))A Here for any matrix MM we use 𝒞(M)\mathcal{C}(M) to denote its column space and P𝒞(M)P_{\mathcal{C}(M)} to denote the projection matrix onto 𝒞(M)\mathcal{C}(M). An analogous can be constructed where the PC adjustment was performed only on 𝕎:=[𝐖1::𝐖n]T\mathbb{W}:=[\mathbf{W}_{1}:\cdots:\mathbf{W}_{n}]^{T} – which we shall refer to as {equs}φ_k,out(t):=1(LR_k,out¿t) with {equs}LR_k,out=Y^TP_C(A_k, out)Y, A_k,out=(I-P_C(W~V_k,out))A and 𝐕k,out~\widetilde{\mathbf{V}_{k,\mathrm{out}}} now collects the top right singular vectors of 𝕎\mathbb{W}.

This approach, similar in essence to Principal Component Regression (to be denoted by PCR hereafter) raises a few immediate questions:

  1. (i)

    What is the behavior of φk,ind(t)\varphi_{k,\mathrm{\mathrm{ind}}}(t), ind{in,out}\rm ind\in\{\mathrm{in},\mathrm{out}\} under H0H_{0} and does there exist a distributional cut-off tαt_{\alpha} such that the Type I error of φk,ind(tα)\varphi_{k,\mathrm{\rm ind}}(t_{\alpha}) converges to α\alpha?

  2. (ii)

    How to characterize the power function of φk,ind(t)\varphi_{k,\mathrm{ind}}(t), ind{in,out}\rm ind\in\{\mathrm{in},\mathrm{out}\} for any given (k,t)(k,t)?

To understand these questions, we operate in a regime which allows pp to be larger than sample size but to scale proportionally with nn. In this asymptotic regime, the main results and contributions of this paper in view of the above questions are as follows:

  1. (i)

    Under a generalized spiked model for 𝐖\mathbf{W} (see Definition 2.2) we provide exact asymptotic results for testing (2) after PC adjustments. The precise description of the testing errors can be elaborated as follows.

    1. (a)

      Our results show an interesting distinction between a fixed effects and random effects on 𝜷\boldsymbol{\beta}. In particular, we show that when 𝜷\boldsymbol{\beta} in (2) is considered fixed and unknown the Type I error of the test for (2) after PC adjustments converges to 11 exponentially in sample size nn – even after including “enough” principal component direction in the regression. In contrast, when 𝜷\boldsymbol{\beta} has a mean zero random effect distribution the Type I error of the test is bounded away from both 11 and the nominal desired level. (See Theorem 3.1i.-ii.).

    2. (b)

      For random effects on 𝜷\boldsymbol{\beta}, we derive the precise power function of the test for (2) after PC adjustments and show that without further assumptions it might not be possible to estimate the power function exactly to either correct for the Type I error inflation or perform sample size/power calculations (See Theorem 3.1ii.).

  2. (ii)

    We show that, when 𝐖\mathbf{W} is derived from a mixture of mean-shift type distributions, a similar phenomenon persists in terms of the behavior of the test for (2) after PC adjustments. We further verify similar behavior through extensive simulations even when the distribution of 𝐗\mathbf{X} arises from mixture of discrete distributions. These simulations provide interesting insights into the subtleties and success of the highly popular EIGENSTRAT method of price2006principal (See Theorem 3.3).

  3. (iii)

    Under a classical spiked model (johnstone2001distribution) on 𝐖\mathbf{W} with diverging leading spike strengths, we derive a precise phase transition on the minimum spike strength and the angle between the 𝜷\boldsymbol{\beta} and leading population spikes required for the pathologies of the procedure φk,ind\varphi_{k,\mathrm{ind}} noted in points (i) and (ii) above to disappear. (See Theorem 3.4-3.7).

The rest of the paper, organized as follows, makes the above claims precise, rigorous, and elaborate. We first introduce the notation, definitions, and assumptions to set up the main building blocks of our results in Section 2.1. We divide the main theoretical results in Section 3 into three parts as follows. In Section 3.1 we present asymptotic results on the power function of φk,ind,ind{in,out}\varphi_{k,\rm ind},\rm ind\in\{\mathrm{in},\mathrm{out}\} for fixed spike/signal strength. Subsequently, we present the implications of these results in the context of GWAS type problems in Section 3.2.Thereafter, Section 3.3 is devoted to understanding the effect of diverging signal strength and thereby providing benchmarks to decide when and how such PC adjustment based testing procedures are valid in practice. Finally, in Section 4 we present detailed numerical experiments to not only verify our theoretical findings in simulation examples but also explore the continued validity of our analytical results beyond the specific conditions assumed for the theoretical derivations.

2.1. Technical Preparations

In this subsection we present definitions, assumptions and related discussions, and notation that will be used throughout the rest of the paper.

2.1.1. Definitions

To present our main results of this section we first need a few definitions to set up the notion of generalized spiked models. We start with the definition of empirical spectral measure of a Hermitian matrix.

Definition 2.1.

For a pp-dimensional Hermitian matrix Σp\Sigma_{p} with eigenvalues λ1,p,λ2,p,,λp,p\lambda_{1,p},\lambda_{2,p},\ldots,\lambda_{p,p}, the Empirical Spectral Distribution (ESD) HpH_{p} is defined as the distribution function corresponding to the probability measure

dHp(x)=1pi=1pδλi,p(x),\displaystyle dH_{p}(x)=\frac{1}{p}\sum_{i=1}^{p}{\delta_{\lambda_{i,p}}(x)},

where δλi,p(x)=1\delta_{\lambda_{i,p}}(x)=1 when x=λi,px=\lambda_{i,p}, otherwise 0. For a sequence {Σp}p1\{\Sigma_{p}\}_{p\geq 1} of deterministic Hermitian matrices, if the corresponding sequence {Hp}p1\{H_{p}\}_{p\geq 1} of ESDs converges weakly to a non-random probability distribution HH as pp\rightarrow\infty, then HH is defined as the Limiting Spectral Distribution (LSD) of the sequence {Σp}p1\{\Sigma_{p}\}_{p\geq 1}.

With this definition in hand we are ready to present the definition of generalized spiked models as proposed by bai2012gsp.

Definition 2.2.

A sequence {Σp}p1\{\Sigma_{p}\}_{p\geq 1} of deterministic Hermitian matrices is called a sequence of generalized spiked population matrices if the following hold.

  1. (i)

    Σp\Sigma_{p} can be written as,

    Σp=[Σm(A)00Σpm(B)],\Sigma_{p}=\begin{bmatrix}\Sigma^{(A)}_{m}&0\\ 0&\Sigma^{(B)}_{p-m}\end{bmatrix},

    where Σm(A)\Sigma^{(A)}_{m} and Σpm(B)\Sigma^{(B)}_{p-m} are non-negative and non-random Hermitian matrices of dimensions m×mm\times m (mm is finite and fixed) and (pm)×(pm)(p-m)\times(p-m), respectively.

  2. (ii)

    The sequence {Hp}p1\{H_{p}\}_{p\geq 1} of ESDs corresponding to {Σp}p1\{\Sigma_{p}\}_{p\geq 1} converges weakly to a non-random probability distribution HH.

  3. (iii)

    Let ΓH\Gamma_{H} be the support of HH, and let the sets of eigenvalues of the submatrices Σm(A)\Sigma^{(A)}_{m} and Σpm(B)\Sigma^{(B)}_{p-m} be {α1,αm}\{\alpha_{1},\geq\ldots\geq\alpha_{m}\} and {β1,p,βpm,p}\{\beta_{1,p},\geq\ldots\geq\beta_{p-m,p}\}, respectively. Then, αiΓH\alpha_{i}\notin\Gamma_{H} for i=1,,mi=1,\ldots,m, and max1ipmd(βi,p,ΓH)0\max_{1\leq i\leq p-m}{d\left(\beta_{i,p},\Gamma_{H}\right)}\rightarrow 0 as pp\rightarrow\infty, where d(x,SA)=infySA|xy|d(x,S_{A})=\inf_{y\in S_{A}}|x-y| is a distance metric from a point xx to a set SAS_{A}.

In this case, the eigenvalues of Σm(A)\Sigma^{(A)}_{m} are called the generalized spikes, and the eigenvalues of Σpm(B)\Sigma^{(B)}_{p-m} are called the non-spikes. The distribution HH is same as the LSD of the sequence {Σpm(B)}\{\Sigma^{(B)}_{p-m}\}.

We note that the traditional spiked model (johnstone2001distribution) is a special case of the generalized spiked model where HH is the degenerate distribution at unity. Next, we will introduce the phase transition boundaries for the generalized spikes analogous to the phase transition boundary (baik2005phase) established in the spiked model.

Definition 2.3.

For αΓH,α0\alpha\notin\Gamma_{H},\alpha\neq 0, we define the following function:

ψ(α)=α+γαλdH(λ)αλ.\psi(\alpha)=\alpha+\gamma\alpha\int{\frac{\lambda dH(\lambda)}{\alpha-\lambda}}.

Then, a generalized spike αi\alpha_{i} is called a “distant spike” if ψ(αi)>0\psi^{\prime}(\alpha_{i})>0, and “close spike” if ψ(αi)0\psi^{\prime}(\alpha_{i})\leq 0, where ψ\psi^{\prime} is the first derivative of ψ\psi.

Depending on whether a generalized spike is a distant or a close spike, the asymptotic convergence of the corresponding sample eigenvalues differ according to Theorem 4.1 and Theorem 4.2 of (bai2012gsp). Therefore, the sample eigenvalues corresponding to the generalized spikes go through a phase transition at the boundaries where ψ(α)=0\psi^{\prime}(\alpha)=0. Unlike in the spiked model where the phase transition happens only at the two boundaries (1±γ)(1\pm\sqrt{\gamma}), the generalized spiked model can have more than two phase transition boundaries. The next assumption is intended to simplify our mathematical derivations by only allowing one phase transition boundary α>supΓH\alpha^{*}>\sup{\Gamma_{H}}, and by assuming that all the generalized spikes have multiplicity one and are above that boundary (i.e., all distant spikes). The extension of our derivations to allow close spikes, and generalized spikes around multiple phase transition boundaries are similar but tedious, and hence omitted in this paper. Finally we define covariate data distributions we will working with throughout.

Definition 2.4.

We say that a data matrix 𝕏~n×d\widetilde{\mathbb{X}}_{n\times d} follows a generalized spiked distribution with kk^{*} spike eigenvalue-eigenvector pairs {(λj,vj)}j=1k\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}} if 𝕏~=ZΛd1/2Vd\widetilde{\mathbb{X}}=Z\Lambda_{d}^{1/2}V_{d}, where ZZ is an n×dn\times d random matrix with i.i.d. sub-gaussian elements such that 𝔼(Zij)=0,𝔼(|Zij|2)=1\mathbb{E}(Z_{ij})=0,\mathbb{E}(|Z_{ij}|^{2})=1, and Λd1/2\Lambda_{d}^{1/2} is the poitive definite Hermitian square root of Λd\Lambda_{d} such that

  1. (a)

    {Σd}d1\{\Sigma_{d}\}_{d\geq 1} is a sequence of real symmetric positive definite matrices with spectral decomposition Σd=VdΛdVd\Sigma_{d}=V_{d}\Lambda_{d}V_{d}^{\top}, and Λd\Lambda_{d} satisfies the generalized spiked assumptions as outlined above (Definition 2.2) with kk^{*} (finite) generalized spikes. The eigenvalues and eigenvectors of Σd\Sigma_{d} are given by λ1,,λd\lambda_{1},\ldots,\lambda_{d} and v1,,vdv_{1},\ldots,v_{d}, respectively.

  2. (b)

    The sequence {Σd}\{\|\Sigma_{d}\|\} of spectral norms is bounded.

  3. (c)

    λ1>λ2>>λk>supΓH\lambda_{1}>\lambda_{2}>\ldots>\lambda_{k^{*}}>\sup{\Gamma_{H}} denote the generalized spikes of Σd\Sigma_{d}, and ψ(λi)>0\psi^{\prime}(\lambda_{i})>0 for i=1,,ki=1,\ldots,k^{*}.

In this case we denote 𝕏~GSp({(λj,vj)}j=1k;ΓH;n,d)\widetilde{\mathbb{X}}\sim\mathrm{GSp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};\Gamma_{H};n,d)

Some of our results are derived under the classical spiked model (johnstone2001distribution). To that end we will use the following definition.

Definition 2.5.

We say that a data matrix 𝕏~n×d\widetilde{\mathbb{X}}_{n\times d} follows a spiked distribution with kk^{*} spike eigenvalue-eigenvector pairs {(λj,vj)}j=1k\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}} if 𝕏~=ZΣ1/2\widetilde{\mathbb{X}}=Z\Sigma^{1/2}, where ZZ is an n×dn\times d random matrix with i.i.d. sub-gaussian elements such that 𝔼(Zij)=0,𝔼(|Zij|2)=1\mathbb{E}(Z_{ij})=0,\mathbb{E}(|Z_{ij}|^{2})=1, and Σd1/2=I+j=1kλjvjvjT\Sigma_{d}^{1/2}=I+\sum_{j=1}^{k^{*}}\lambda_{j}v_{j}v_{j}^{T} with λ1λk>γ\lambda_{1}\geq\ldots\lambda_{k^{*}}>\sqrt{\gamma} where γ=limp/n\gamma=\lim p/n and v1,,vkv_{1},\ldots,v_{k^{*}} are orthonormal vectors. In this case we denote 𝕏~Sp({(λj,vj)}j=1k;n,d)\widetilde{\mathbb{X}}\sim\mathrm{Sp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};n,d)

2.1.2. Assumptions

Using the above definitions, we can now state our assumptions various subsets of which will be useful for our theoretical results.

Assumption 2.6.
  1. (a)

    p,n,p/nγ(0,)p\rightarrow\infty,n\rightarrow\infty,p/n\rightarrow\gamma\in(0,\infty).

  2. (b)

    𝕎GSp({(λj,vj)}j=1k;ΓH;n,p)\mathbb{W}\sim\mathrm{GSp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};\Gamma_{H};n,p).

  3. (b)’

    𝕎Sp({(λj,vj)}j=1k;p)\mathbb{W}\sim\mathrm{Sp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};p).

  4. (c)

    𝕏:=[𝐀,𝕎]GSp({(λj,vj)}j=1k;ΓH;n,p+1){\mathbb{X}}:=[\mathbf{A},\mathbb{W}]\sim\mathrm{GSp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};\Gamma_{H};n,p+1).

  5. (c)’

    𝕏:=[𝐀,𝕎]Sp({(λj,vj)}j=1k;p+1){\mathbb{X}}:=[\mathbf{A},\mathbb{W}]\sim\mathrm{Sp}(\left\{(\lambda_{j},v_{j})\right\}_{j=1}^{k^{*}};p+1).

We will mostly assume the following dependence between 𝐀\mathbf{A} and 𝕎\mathbb{W}.

Assumption 2.7.

Let 𝐀=𝕎𝛉+𝛈\mathbf{A}=\mathbb{W}\boldsymbol{\theta}+\boldsymbol{\eta} with 𝛈N(0,σg2I)\boldsymbol{\eta}\sim N(0,\sigma^{2}_{g}I).

Although assumption 2.7 seems restrictive, we keep it here for keeping our arguments short and only for the precise analysis of the test φk,out\varphi_{k,\mathrm{out}} – this assumption is often not needed either for analysis of φk,in\varphi_{k,\mathrm{in}} or while demonstrating a lower bound on the Type I error of φk,out\varphi_{k,\mathrm{out}} instead of deriving a precise limit . Indeed, this result is easy to extend to sub-Gaussian AA’s and we additionally verify through extensive simulations (see Section 4) that the intuitions from our main theorems continue to be valid even without assuming the above specific conditional distribution of A|𝐗A|\mathbf{X}. Finally we note that for the analysis in epigenetic studies (barfield2014accounting), where AA contains DNA methylation levels of a particular CpG site of interest and one performs PCA on genetic variants 𝐖\mathbf{W} to account for population stratification, assumption 2.7 is often reasonable. Our last assumption removes this restrictions in some of our results at the cost of obtaining bounds instead of precise limiting objects.

Assumption 2.8.

Let AiA_{i}’s be centered and sub-Gaussian with parameter σa2\sigma^{2}_{a}.

Assumption 2.8 will be used for deriving lower bounds on Type I error for tests based on LRk,out\mathrm{LR}_{k,\mathrm{out}} and obtain partial sharp limiting Type I error for tests based on LRk,in\mathrm{LR}_{k,\mathrm{in}} (see e.g. Theorem 3.2).

2.1.3. Notation

The results in this paper are mostly asymptotic (in nn) in nature and thus requires some standard asymptotic notations. If ana_{n} and bnb_{n} are two sequences of real numbers then anbna_{n}\gg b_{n} (and anbna_{n}\ll b_{n}) implies that an/bn{a_{n}}/{b_{n}}\rightarrow\infty (and an/bn0{a_{n}}/{b_{n}}\rightarrow 0) as nn\rightarrow\infty, respectively. Similarly anbna_{n}\gtrsim b_{n} (and anbna_{n}\lesssim b_{n}) implies that lim infnan/bn=C\liminf_{n\rightarrow\infty}{{a_{n}}/{b_{n}}}=C for some C(0,]C\in(0,\infty] (and lim supnan/bn=C\limsup_{n\rightarrow\infty}{{a_{n}}/{b_{n}}}=C for some C[0,)C\in[0,\infty)). Alternatively, an=o(bn)a_{n}=o(b_{n}) will also imply anbna_{n}\ll b_{n} and an=O(bn)a_{n}=O(b_{n}) will imply that lim supnan/bn=C\limsup_{n\rightarrow\infty}\ a_{n}/b_{n}=C for some C[0,)C\in[0,\infty)). If C>0C>0 then we write an=Θ(bn)a_{n}=\Theta(b_{n}). If an/bn1a_{n}/b_{n}\rightarrow 1, then we say anbna_{n}\sim b_{n}. A sequence of random variable XnX_{n} is called Ω\Omega_{\mathbb{P}}, a.s.\mathrm{a.s.}\ \mathbb{P} if C>0\exists C>0 such that (Xn>C)1\mathbb{P}(X_{n}>C)\rightarrow 1.

3. Main Results

The behavior of the tests φk,ind(t),ind{in,out}\varphi_{k,\rm ind}(t),\rm ind\in\{\mathrm{in},\mathrm{out}\}, crucially depends on the “strength” of principal components in the population model which in turn is reflective of the strength of population stratification under consideration for the GWAS studies. To capture this, we first present results for fixed “strength” of principal components in Section 3.1. Subsequently, in Section 3.3 we also present a sharp phase transition depending on spike strength in diverging spike strength of population principal components while considering the traditional spiked model (johnstone2001distribution). The implications of these results to GWAS type problems is discussed in Section 3.2 Finally, 3.2 also presents results when the studies come from a mixture of populations and thereby directly addressing population stratification type issues in GWAS.

3.1. Fixed Strength of Population PCs

We begin by stating and discussing the results for LRk,ind\mathrm{LR}_{k,\mathrm{ind}} fixed stratification strength, i.e., λ1=O(1)\lambda_{1}=O(1) whenever 𝕏\mathbb{X} or 𝕎\mathbb{W} follow a generalized spiked distribution with top spiked eigenvalue λ1\lambda_{1}.

Theorem 3.1.

Consider testing (2) under (2) using φk,ind(t)\varphi_{k,\rm ind}(t). Then the following hold with any fixed kkk\geq k^{*} under Assumptions 2.6 (a), (b), and 2.7 for ind=out\rm ind=\mathrm{out} and 2.6 (a) and (c) for ind=in\rm ind=\mathrm{in}.

  1. i.

    Suppose that 𝛃,𝛉\boldsymbol{\beta},\boldsymbol{\theta} are fixed effects. Then for any fixed cut-off tt\in\mathbb{R} and fixed M>0M>0 one has

    lim infn:p/nγ>0sup𝜷,𝜽Mδ=0(LRk,out>t)=1,\displaystyle\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup\limits_{\|\boldsymbol{\beta}\|,\|\boldsymbol{\theta}\|\leq M}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{out}}>t\right)=1,
    and lim infn:p/nγ>0sup𝜷Mδ=0(LRk,in>t)=1.\displaystyle\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup\limits_{\|\boldsymbol{\beta}\|\leq M}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{in}}>t\right)=1.
  2. ii.

    Assume that 𝛃N(𝟎,σβ2I/p)\boldsymbol{\beta}\sim N(\mathbf{0},\sigma^{2}_{\beta}{I}/p). Then \exists function υout(,):2+\upsilon_{\mathrm{out}}\left(\cdot,\cdot\right):\mathbb{R}^{2}\rightarrow\mathbb{R}_{+} such that for any h,th,t\in\mathbb{R} and δ=h1/n\delta=h_{1}/\sqrt{n} and fixed h2h_{2} one has

    limδ=h/n(LRk,out>t)=υout(t,h)𝜽s.t.𝜽h2;\displaystyle\lim\mathbb{P}_{\delta=h/\sqrt{n}}(\mathrm{LR}_{k,\mathrm{out}}>t)=\upsilon_{\mathrm{out}}\left(t,h\right)\quad\forall\boldsymbol{\theta}\quad\text{s.t.}\quad\|\boldsymbol{\theta}\|\geq h_{2};
    and limδ=h/n(LRk,in>t)=υin(t,h)\displaystyle\lim\mathbb{P}_{\delta=h/\sqrt{n}}(\mathrm{LR}_{k,\mathrm{in}}>t)=\upsilon_{\mathrm{in}}\left(t,h\right)

    The function υout\upsilon_{\mathrm{out}} depends on {γ,σy2,σβ2,k,k,ΓH,{𝐯j,𝜽,λj}j=1k}\{\gamma,\sigma^{2}_{y},\sigma^{2}_{\beta},k^{*},k,\Gamma_{H},\left\{\langle\mathbf{v}_{j},\boldsymbol{\theta}\rangle,\lambda_{j}\right\}_{j=1}^{k^{*}}\} and the function υin\upsilon_{\mathrm{in}} depends on {γ,σy2,σβ2,k,k,ΓH,{λj}j=1k}\{\gamma,\sigma^{2}_{y},\sigma^{2}_{\beta},k^{*},k,\Gamma_{H},\left\{\lambda_{j}\right\}_{j=1}^{k^{*}}\}. 222the precise forms of the υind\upsilon_{\rm ind} can be found in the proof where its dependence on the sequence of tuples {𝐯j,𝜽,{λj}j=1k}\{\langle\mathbf{v}_{j},\boldsymbol{\theta}\rangle,\{\lambda_{j}\}_{j=1}^{k^{*}}\} is suitably defined in a limiting sense. Throughout we shall implicitly assume that suitable quadratic forms of the type j=1pψ(λj)𝜻TvjvjT𝜻\sum_{j=1}^{p}\psi(\lambda_{j})\boldsymbol{\zeta}^{T}v_{j}v_{j}^{T}\boldsymbol{\zeta} for certain deterministic functions ψ\psi and sequence of vectors 𝜻\boldsymbol{\zeta} of bounded norm, that arise in the proof of the theorem, exists. Otherwise, our results can be written more tediously in terms of upper and lower bounds on the power function involving liminf’s and limsup’s of similar quadratic forms. and is larger than α\alpha at h=0,t=χ12(α)h=0,t=\chi^{2}_{1}(\alpha) (the α\alpha-quantile of χ12\chi^{2}_{1}). A qualitatively similar result holds for either 𝛃\boldsymbol{\beta} non-random and 𝛉\boldsymbol{\theta} random or both 𝛃,𝛉\boldsymbol{\beta},\boldsymbol{\theta} random effects 333with analogous but different limiting functions.

Remark 1.

Although some levels of Type I error is expected for fixed and low spike strength (patterson2006population), Theorem 3.1 implies that the Type I error can be arbitrarily close to 11 for using any fixed cut-off – which automatically covers any distribution based cut-off. Indeed, LRk,ind\mathrm{LR}_{k,\rm ind} diverges in this set-up and as our proof will suggest that this is not an artifact of specific adversarial choices of 𝛃,𝛉\boldsymbol{\beta},\boldsymbol{\theta} but actually holds for uncountably many choices of 𝛃,𝛉\boldsymbol{\beta},\boldsymbol{\theta}. Our proof also suggests that the Type I error is somewhat less pathological when 𝛃\boldsymbol{\beta} is orthogonal to the leading population eigenvectors – and thereby providing intuitive justification of the results in the random effects parts of Theorem 3.1 since mean 0 random 𝛃\boldsymbol{\beta} is “on average orthogonal” to any fixed directions.

Remark 2.

In the notation δ\mathbb{P}_{\delta} we have suppressed the dependence of \mathbb{P} on 𝛃,𝛉\boldsymbol{\beta},\boldsymbol{\theta} to not only maintain succinct notation but also to avoid confusion when one of 𝛃\boldsymbol{\beta} and 𝛉\boldsymbol{\theta} is random. Consequently, in statements of the theorem whenever a supremum over 𝛃\boldsymbol{\beta} or 𝛉\boldsymbol{\theta} appears, it implicitly acknowledges the dependence of δ\mathbb{P}_{\delta} on the respective parameters.

Remark 3.

We note that the results for LRk,ind\mathrm{LR}_{k,\rm ind} does not involve a supremum over 𝛉\boldsymbol{\theta}. This is because assuming a spiked model on 𝐗\mathbf{X} does not allow a free choice of 𝛉\boldsymbol{\theta} since by representing 𝛉\boldsymbol{\theta} as the population least squares coefficients implies that 𝛉\boldsymbol{\theta} is a function of Σ\Sigma. Also, we assume 𝛉h2>0\|\boldsymbol{\theta}\|\geq h_{2}>0 is bounded away from 0 to reflect a scenario where the effect of AA on YY experiences non-trivial confounding through 𝐖\mathbf{W}.

Remark 4.

The results above do not change if one changes the distribution random effects from Gaussian to other suitable sub-Gaussian distributions as long as the leading population eigenvectors v1,,vkv_{1},\ldots,v_{k^{*}} are delocalized. However, the results can be quite specific to various possibilities of localizations and distributions of the random effects otherwise. We do not explore this aspect here in detail.

Remark 5.

We consider a simple application of Theorem 3.1. Consider the population covariance matrix be the classical spike model (johnstone2001distribution) i.e. Σ=I+λ1e1e1\Sigma=I+\lambda_{1}e_{1}e^{\top}_{1}, where λ1>1+γ\lambda_{1}>1+\sqrt{\gamma}. By Lemma LABEL:lem:distribution_lr one has {equs}LR_k, outA,Wχ_1^2(^κ^2_out), ^κ^2_k,out=((βTW+Aδ)(I-PC(W~Vk,out))A)2A(I-PC(W~Vk,out))A. Note that,

δ(LRk,out>tα)=𝔼(Φ¯(tακ^k,out2)Φ(tακ^k,out2))2,\mathbb{P}_{\delta}(LR_{k,\mathrm{out}}>t_{\alpha})=\mathbb{E}\Big{(}\bar{\Phi}(t_{\alpha}-\hat{\kappa}_{k,\mathrm{out}}^{2})-\Phi(-t_{\alpha}-\hat{\kappa}_{k,\mathrm{out}}^{2})\Big{)}^{2},

where Φ\Phi is the standard normal cdf and Φ¯=1Φ\bar{\Phi}=1-\Phi. Hence, it will be sufficient to derive asymptotic distribution of κ^k,out2\hat{\kappa}_{k,\mathrm{out}}^{2} followed by an application of uniform integrability principle to derive the asymptotic behavior of the likelihood ratio test.

Let 𝛃N(0,1pσβ2)\boldsymbol{\beta}\sim N\left(0,\frac{1}{p}\sigma^{2}_{\beta}\right) and 𝛉:=e1\boldsymbol{\theta}:=e_{1}. We have shown, in (LABEL:eq:theta_fixed_ncp),

κ^k,out2σβ2(σg2m1+c4)c0+σg2χ2(h2(c0+σg2)2):=,\displaystyle\hat{\kappa}^{2}_{k,\mathrm{out}}\Rightarrow\frac{\sigma^{2}_{\beta}(\sigma^{2}_{g}m_{1}+c_{4})}{c_{0}+\sigma^{2}_{g}}\chi^{2}(h^{2}(c_{0}+\sigma^{2}_{g})^{2}):=\mathcal{L},

where c0=limnj=k+1npλ^j𝛉Tvj^v^jT𝛉c_{0}=\lim_{n}\sum_{j=k+1}^{n\wedge p}\hat{\lambda}_{j}\boldsymbol{\theta}^{T}\hat{v_{j}}\hat{v}_{j}^{T}\boldsymbol{\theta}, c4=limn1npj=k+1npdj^4𝛉Tvj^vj^T𝛉c_{4}=\lim_{n}\frac{1}{np}\sum_{j=k+1}^{n\wedge p}\hat{d_{j}}^{4}\boldsymbol{\theta}^{T}\hat{v_{j}}\hat{v_{j}}^{T}\boldsymbol{\theta}, and m1=limn1pj=k+1npλ^jm_{1}=\lim_{n}\frac{1}{p}\sum_{j=k+1}^{n\wedge p}\hat{\lambda}_{j}. Now, m1=m1,γm_{1}=m_{1,\gamma} is the mean of Marčenko-Pastur distribution. To compute the precise values of c0,c4c_{0},c_{4}, note that the limiting spectral distribution of Σ\Sigma is δ1\delta_{1}, the non-random distribution which is degenerate at 11. Now we invoke Lemma LABEL:lem:spikes_non_spikes to obtain that c0=j=1npϕ1j𝛉vjvj𝛉c_{0}=\sum_{j=1}^{n\wedge p}\phi_{1j}\boldsymbol{\theta}^{\top}v_{j}v^{\top}_{j}\boldsymbol{\theta} and c4=1γj=1npϕ2j𝛉vjvj𝛉c_{4}=\frac{1}{\gamma}\sum_{j=1}^{n\wedge p}\phi_{2j}\boldsymbol{\theta}^{\top}v_{j}v^{\top}_{j}\boldsymbol{\theta}, where

ϕ1j\displaystyle\phi_{1j} ={γλ+1λ2,j=11,j>1,\displaystyle=\begin{cases}\gamma\frac{\lambda+1}{\lambda^{2}},\quad j=1\\ 1,\quad j>1,\end{cases}
ϕ2j\displaystyle\phi_{2j} ={γλ+1λ2+γ2(λ+1)2λ3,j=1,1+γ,j>1.\displaystyle=\begin{cases}\gamma\frac{\lambda+1}{\lambda^{2}}+\gamma^{2}\frac{(\lambda+1)^{2}}{\lambda^{3}},\quad j=1,\\ 1+\gamma,\quad j>1.\end{cases}

Note that, Σ=(1+λ1)e1e1+j=2npejej\Sigma=(1+\lambda_{1})e_{1}e^{\top}_{1}+\sum_{j=2}^{n\wedge p}e_{j}e^{\top}_{j}. So, vj=ejv_{j}=e_{j}, c0=λ+1λ2c_{0}=\frac{\lambda+1}{\lambda^{2}}, and c4=γλ+1λ2+γ2(λ+1)2λ3c_{4}=\gamma\frac{\lambda+1}{\lambda^{2}}+\gamma^{2}\frac{(\lambda+1)^{2}}{\lambda^{3}}. Hence,

δ(LRk,out>tα)𝔼(Φ¯(tα)Φ(tα))2.\mathbb{P}_{\delta}(LR_{k,\mathrm{out}}>t_{\alpha})\rightarrow\mathbb{E}\Big{(}\bar{\Phi}(t_{\alpha}-\mathcal{L})-\Phi(-t_{\alpha}-\mathcal{L})\Big{)}^{2}.

A few further remarks are in order regarding the assumptions and implications of Theorem 3.1 regarding the behavior of φk,ind(t)\varphi_{k,\rm ind}(t) for ind{in,out}\rm ind\in\{\mathrm{in},\mathrm{out}\}. First, Theorem 3.1 (i) implies that under fixed effects model on the regression of Y|A,𝐗Y|A,\mathbf{X} and A|𝐗A|\mathbf{X} one has pathological behavior of PC adjustment based tests in the sense that for any fixed cut-off tt, the LRT following PC adjustments has size converging to 11. In Section 4 we further present simulation results towards universality of this phenomenon by verifying that this is indeed not an artifact of the particular choice of regression of A|𝐗A|\mathbf{X} in our analytical explorations. Similarly, Theorem 3.1 (ii) demonstrates that although the pathology of the size is somewhat diluted for random effects in the regressions of Y|A,𝐗Y|A,\mathbf{X} and A|𝐗A|\mathbf{X}, the resulting size is still strictly above the desired level α\alpha while using the standard χ12(α)\chi_{1}^{2}(\alpha) type cut-offs. Indeed, in view of the proportional asymptotic regime (i.e. p/nγ(0,)p/n\rightarrow\gamma\in(0,\infty)) and associated inconsistency of PCA (johnstone2009consistency), one might expect some level of discrepancy in the size and power of the PC adjustment based test φk,ind\varphi_{k,\rm ind} considered – and Theorem 3.1 provides precise nature of this discrepancy. Subsequently, it is natural to ask whether one can recover the desired α\alpha-level by suitably inverting the asymptotic power function of the test φk,ind(t)\varphi_{k,\rm ind}(t). In this regard, we first focus on the case of random 𝜷\boldsymbol{\beta} and a close look at the formulae obtained for the power functions υind\upsilon_{\rm ind} in Theorem 3.1 reveals their intricate dependence on {𝐯j,𝜽,λj}j=1k}\left\{\langle\mathbf{v}_{j},\boldsymbol{\theta}\rangle,\lambda_{j}\right\}_{j=1}^{k^{*}}\} – which cannot in general be estimated consistently for generalized spiked models with fixed spike strengths cai2018rate; perry2018optimality. In some cases, when υind\upsilon_{\rm ind} does not depend on the projection of 𝜽\boldsymbol{\theta} in the directions of population eigenvectors (see e.g. Remark 5 for the very special case of classical spiked model (johnstone2001distribution) and specific assumption on 𝜽\boldsymbol{\theta}) estimation of the power function is possible – however in general this seems to be impossible without assuming further structure on 𝜽\boldsymbol{\theta} and population 𝐯j\mathbf{v}_{j}’s. In contrast when 𝜽\boldsymbol{\theta} is random, one has irrespective of 𝜷\boldsymbol{\beta} that the power function only depends on the ΓH\Gamma_{H} (the population spectral distribution for the distribution of 𝐗\mathbf{X} or 𝐖\mathbf{W}) and hence can be potentially estimated. To keep our discussions focused we do not further pursue this avenue here.

To further argue that the pathologies notes above are not an artifact of Assumption 2.7, we now present analogues of the results (in somewhat less precise form) from Theorem 3.1 for the case when A|𝐖A|\mathbf{W} is not a linear regression. Indeed this is the case when AA is discrete (as is the situation for GWAS) and as a result Theorem 3.1 does not apply immediately. However as we will see that the qualitative nature of the problem conveyed by Theorem 3.1 continues to hold – with the difference being in the nature of preciseness at which we are able to pinpoint the accurate limit of the Type I error (as in Theorem 3.1ii.) in the sense that we are only able to provide a lower bound instead of a precise limit.

Theorem 3.2.

Consider testing (2) under (2) using φk,ind(t)\varphi_{k,\rm ind}(t). Also assume that kkk\geq k^{*} is fixed and Assumptions 2.6 (a) and 2.8 hold.

  1. i.

    The following holds under Assumption 2.6 (b) for any M>0M>0.

    lim infn:p/nγ>0sup𝜷Mδ=0(LRk,out>χ12(α))>α.\displaystyle\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup\limits_{\|\boldsymbol{\beta}\|\leq M}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{out}}>\chi_{1}^{2}(\alpha)\right)>\alpha.
  2. ii.

    Suppose 𝜷N(0,σβ2I/p)\boldsymbol{\beta}\sim N(0,\sigma^{2}_{\beta}I/p) and that Assumption 2.6(b) holds. Then

    limn:p/nγ>0δ=0(LRk,out>χ12(α))>α.\displaystyle\lim\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k,\mathrm{out}}>\chi_{1}^{2}(\alpha))>\alpha.
  3. iii.

    Suppose that Assumption 2.6 (c) holds. Let 𝜷0:=I1𝜷\boldsymbol{\beta}_{0}:=I_{-1}\boldsymbol{\beta} with I1=[e2::ep+1]I_{-1}=[e_{2}:\ldots:e_{p+1}] and cp(𝜷0)=j=1pϕ1j𝜷0vjvje1c^{*}_{p}(\boldsymbol{\beta}_{0})=\sum_{j=1}^{p}\phi_{1j}\boldsymbol{\beta}^{\top}_{0}v_{j}v_{j}^{\top}e_{1} with ϕ1j\phi_{1j} defined in Lemma LABEL:lem:spikes_non_spikes and ej,j1e_{j},j\geq 1 canonical basis p+1\mathbb{R}^{p+1}.

    1. (a)

      If 𝜷\boldsymbol{\beta} is such that lim inf|cp(𝜷0)|>0\liminf|c^{*}_{p}(\boldsymbol{\beta}_{0})|>0 then for any tt\in\mathbb{R}

      lim infn:p/nγ>0δ=0(LRk,in>t)=1.\displaystyle\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{in}}>t\right)=1.
    2. (b)

      If 𝜷\boldsymbol{\beta} is such that lim sup|cp(𝜷0)|=0\limsup|c^{*}_{p}(\boldsymbol{\beta}_{0})|=0 and lim inf|pcp(𝜷)|>0\liminf|\sqrt{p}c^{*}_{p}(\boldsymbol{\beta})|>0 then

      lim infn:p/nγ>0δ=0(LRk,in>χ12(α))>α.\displaystyle\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{in}}>\chi_{1}^{2}(\alpha)\right)>\alpha.
    3. (c)

      If 𝜷\boldsymbol{\beta} is such that lim sup|pcp(𝜷0)|=0\limsup|\sqrt{p}c^{*}_{p}(\boldsymbol{\beta}_{0})|=0 then the following holds if lim infλp>0\liminf\lambda_{p}^{*}>0.

      lim supn:p/nγ>0δ=0(LRk,in>χ12(α))=α.\displaystyle\limsup\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}\left(\mathrm{LR}_{k,\mathrm{in}}>\chi_{1}^{2}(\alpha)\right)=\alpha.
  4. iv.

    Suppose 𝜷N(0,σβ2I/p)\boldsymbol{\beta}\sim N(0,\sigma^{2}_{\beta}I/p) and that Assumption 2.6(b) holds. Then

    limn:p/nγ>0δ=0(LRk,in>χ12(α))>α.\displaystyle\lim\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k,\mathrm{in}}>\chi_{1}^{2}(\alpha))>\alpha.

A few remarks are in order regarding the nature of the results above. First, the result shows the necessity to at least modify the currently used cut-offs for testing using LRk,out\mathrm{LR}_{k,\mathrm{out}}. Although we believe and demonstrate in the simulation section that a result similar to Theorem 3.1 (i.e. Type I error converging to 11 for any fixed cut-off) our proofs techniques do not naturally extend to the case of arbitrary AA. The reason we can still demonstrate an inflated Type I error while using classical cut-offs is because our proof essentially relies on the case where A𝐖A\perp\mathbf{W} and thereby is analogous to the case of 𝜷=0\boldsymbol{\beta}=0 in the context of Theorem 3.1 – implying no confounding through of AA and 𝐖\mathbf{W}. However, the result for LRk,in\mathrm{LR}_{k,\mathrm{in}} can remain pathological owing to the fact of double use of AA in both as a regressor for YY as well as the in the PC’s used in the same regression. In order to provide insight about the nature of the quantity driving the behavior of LRk,in\mathrm{LR}_{k,\mathrm{in}}, we elaborate using the special case classical spiked model with Σ=I+λvvT\Sigma=I+\lambda vv^{T}. In this case the result depends on the localization structure of vv. First if ve1v\perp e_{1} then cp(𝜷0)=0c_{p}^{*}(\boldsymbol{\beta}_{0})=0 for any 𝜷0\boldsymbol{\beta}_{0} and hence a χ12\chi_{1}^{2} cut-off is valid for LRk,in\mathrm{LR}_{k,\mathrm{in}}. This is indeed expected since ve1v\perp e_{1} implies no confounding between YY and AA through 𝐖\mathbf{W}. Suppose now lim inf|v,e1|>0\liminf|\langle v,e_{1}\rangle|>0 and lim inf|v,𝜷|>0\liminf|\langle v,\boldsymbol{\beta}\rangle|>0 then we land in the situation of part (a) of Theorem 3.2iii. (a) since using the fact that 𝜷0Te1=0\boldsymbol{\beta}_{0}^{T}e_{1}=0 in this case we have lim inf|cp(𝜷0)|=lim inf|(ϕ11ϕ12)𝜷0TvvTe1+𝜷0Te1|=lim inf|(ϕ11ϕ12)𝜷0TvvTe1|>0\liminf|c_{p}^{*}(\boldsymbol{\beta}_{0})|=\liminf|(\phi_{11}-\phi_{12})\boldsymbol{\beta}_{0}^{T}vv^{T}e_{1}+\boldsymbol{\beta}_{0}^{T}e_{1}|=\liminf|(\phi_{11}-\phi_{12})\boldsymbol{\beta}_{0}^{T}vv^{T}e_{1}|>0. Finally if |v,e1|=Θ(1/p)|\langle v,e_{1}\rangle|=\Theta(1/\sqrt{p}) we can appeal to 3.2iii. (b) and a Type error inflation occurs while using traditional χ12\chi_{1}^{2} cut-off.

3.2. Implications for GWAS type Analyses

As mentioned in the introduction, the method analyzed above has especially gathered immense popularity in GWAS type analysis where adjusting for variables in 𝐗\mathbf{X} crucially adjusts for population stratification and unmeasured confounding (price2006principal). In particular, a classical problem of modern GWAS is that of testing for individual genetic variants while adjusting for other variants and environmental factors (visscher201710). In this regard, the issue of population stratification is one of the main challenges in modern genetic association studies. In particular, the presence of population sub-structure, (i.e. when data set consists of several sub-populations, different phenotypic means across strata in combination with varying allele frequencies across populations) can lead to confounding of the disease-gene relationship. Consequently, many commonly used association tests may be severely biased. Indeed, several methods exist to account for such population structure. The most commonly used ones among them can be organized around three broad themes – (a) the method of genomic control (devlin1999genomic) (this addresses the effects that unknown population-admixture might have on the variance of commonly used test test statistics); (b) methods that rely on a model and data on the additional genetic markers (pritchard2000inference) to infer the latent population structure and subsequently incorporates this model into down stream analysis; and (c) methods that use linear combinations of other genetic markers (which capture population sub-structure in their distribution) as covariates in the analysis using linear or logistic regression (these markers serve as surrogates for the underlying strata chen2003qualitative; zhang2003semiparametric; zhu2002association; price2006principal). Owing to the seminal paper of price2006principal, this third approach which we analyze here, have emerged as the most popular of these available methods.

The implications of Theorem 3.4 can be best understood through its intuitive connection to a simple population admixture model described as follows. Suppose that 𝐗i=(Xi1,,Xip)\mathbf{X}_{i}=(X_{i1},\ldots,X_{ip}) is such that there exists S{1,,p}S\subset\{1,\ldots,p\} coordinates where the distributions of Xij,jSX_{ij},j\in S are a mixture between two population and for jScj\in S^{c} the XijX_{ij}’s are same between the two populations. For studying GWAS type problems, under LD pruning (hartl1997principles; laird2011fundamentals) of the genotypes, one such situation can be idealized through independently having Xij12Bin(2,q1j)+12Bin(2,q2j)X_{ij}\sim\frac{1}{2}\mathrm{Bin}(2,q_{1j})+\frac{1}{2}\mathrm{Bin}(2,q_{2j}) with q1j=q2j=qq_{1j}=q_{2j}=q for jSj\in S and q1=q1jq2j=q2q_{1}=q_{1j}\neq q_{2j}=q_{2} for jScj\in S^{c}. It is easy to check that in this case, Var(𝐗)=σ2I+(q1q2)2|S|𝐯𝐯T\mathrm{Var}(\mathbf{X})=\sigma^{2}I+(q_{1}-q_{2})^{2}|S|\mathbf{v}\mathbf{v}^{T} for some σ2>0\sigma^{2}>0 and unit vector 𝐯\mathbf{v}. Consequently, in the language of spiked model formulation, the variance covariance matrix of 𝐗\mathbf{X} indeed follows a spiked model with spike strength proportional to |S||S|. Theorem 3.4 pertains to the traditional spiked model as an intuitive simplification from the motivating mixture model described above and establishes precise phase transition for the performance of EIGENSTRAT type procedures based on population stratification strength – as captured by the leading spike strength. Indeed, this provides an useful benchmark to check the validity of using PC adjustment based testing for low dimensional parameters.

Our next result is targeted towards showing that we can translate the results from the generalized spike model set up to a more relevant case of mixture models on the rows of 𝕏\mathbb{X}. In particular, it is natural to assume such a mixture distribution for the genotypes in genetic association studies while considering population admixture models (price2006principal).

Theorem 3.3.

Consider testing (2) under (2) using φk,ind(t)\varphi_{k,\rm ind}(t). Then the same conclusion as of Theorem 3.1 holds with any fixed k1k\geq 1 under Assumptions 2.6 (a), and 2.7 for ind=out\rm ind=\mathrm{out} when 𝐖i=1LwiF(𝛍i,Σ)\mathbf{W}\sim\sum_{i=1}^{L}w_{i}F(\boldsymbol{\mu}_{i},\Sigma) for any fixed LL with wi0w_{i}\geq 0, with i=1Lwi=1\sum_{i=1}^{L}w_{i}=1, supiμi1\sup_{i}\|\mu_{i}\|\leq 1, Σ=O(1)\|\Sigma\|=O(1) where F(𝛍,Σ)F(\boldsymbol{\mu},\Sigma) is the distribution of a random vector Σ1/2𝐙+𝛍\Sigma^{1/2}\mathbf{Z}+\boldsymbol{\mu} where 𝐙\mathbf{Z} is a random vector with mean zero i.i.d. sub-Gaussian coordinates. When ind=in\rm ind=\mathrm{in}, the same conclusion holds under Assumption 2.6 (a) when 𝐗i=1LwiF(𝛍i,Σ)\mathbf{X}\sim\sum_{i=1}^{L}w_{i}F(\boldsymbol{\mu}_{i},\Sigma).

It is worth noting that mixture models on 𝐗\mathbf{X} implies spiked models on the variance covariance matrix of 𝐗\mathbf{X} and thereby the above result is somewhat intuitive given the validity of Theorem 3.1. Moreover, in our numerical experiments we shall additionally verify the universality of the results for non-mean-shift type mixture models on 𝐗\mathbf{X} (e.g. mixtures of binomial distributions on the coordinates of 𝐗\mathbf{X} as more natural candidates for modeling genotypes under admixture of populations under Hardy-Weinberg equilibrium).

3.3. Diverging Strength of Population PCs

Section 3.1 explored the necessity of potential caution while using EIGENSTRAT type procedure for PC adjustment testing of low-dimensional components. However, these results are not able to reconcile the empirical success and popularity of this procedure. Moreover, to an expert in the results of random matrix theory, the above results might suggest the main issue being the bias in the eigenvectors as for fixed spike strengths. Indeed, the fact that PC based analysis of genetic variations have been able to reveal known population structures (novembre2008genes), prompts one to believe that a diverging spike strength might be the main underlying recipe behind the success of the PC adjustment based testing procedures. However, as we shall see below, that simply diverging spike strength is not sufficient to explain away the pathologies noted above – and rather a specific signal strength can indeed be calibrated to pin point the regimes of success of EIGENSTRAT types procedures.

In this regard, we now present first results in this direction to derive necessary and sufficient conditions on the success of this procedure under the classical spiked model johnstone2001distribution with diverging spike strengths. In the regime of diverging signal strength, it turns out that we can additionally take into account the effect of the distances between 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta} and the population spike eigenvectors. This seems to be a reasonable consideration to explore since using the sample PC directions as the regressors finds its natural analogue in the population when 𝜷\boldsymbol{\beta} belongs to (or is close to) the linear span of the population spiked eigenvectors. Therefore to introduce the results we will require the following natural notion of distance between vectors and sub-spaces. In particular, given a set VV and a fixed vector β\beta we define d(β,V)=1PV(𝜷)2𝜷2d(\beta,V)=1-\frac{\|P_{V}(\boldsymbol{\beta})\|^{2}}{\|\boldsymbol{\beta}\|^{2}} (with PVP_{V} denoting th orthogonal projection operator onto VV) and 𝒞τ(V)={β:β1,d(β,V)pτ}\mathcal{C}_{\tau}(V)=\{\beta:\|\beta\|\leq 1,d(\beta,V)\leq p^{-\tau}\}. Note that, if τ1>τ2\tau_{1}>\tau_{2}, 𝒞τ1(V)𝒞τ2(V)\mathcal{C}_{\tau_{1}}(V)\subseteq\mathcal{C}_{\tau_{2}}(V) and larger τ\tau implies lesser angle of the vector 𝜷\boldsymbol{\beta} with elements of VV. Moreover, the results in this case for LRk,ind\mathrm{LR}_{k,\rm ind} additionally depend on the nature of localization of the coordinates of the spikes. This is intuitively understandable since certain natures of localized coordinates implies a highly strong effect of coordinates of 𝐖\mathbf{W} on AA whereas delocalized eigenvectors imply otherwise. To avoid confusion, we therefore first present the results for LRk,out\mathrm{LR}_{k,\mathrm{out}}.

Theorem 3.4.

Consider testing (2) under (2) using φk,out(t)\varphi_{k,\mathrm{out}}(t) and suppose that Assumptions 2.6 (a), (b)’, and 2.7 hold. Further assume that λ1,λk=Θ(pτ0)\lambda_{1},\lambda_{k^{*}}=\Theta(p^{\tau_{0}}) with τ0>0\tau_{0}>0. Then the following hold for any τ>0\tau>0

  1. i.

    Suppose that 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta} are not random.

    1. (a)

      If min{τ0,τ}<1/2\min\{\tau_{0},\tau\}<1/2, then for any fixed cut-off tt\in\mathbb{R} and γ<1\gamma<1,

      lim infn:p/nγ>0sup𝜷,𝜽𝒞τ(SVk)δ=0(LRk,out>t)=1.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\beta},\boldsymbol{\theta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k^{*},\mathrm{out}}>t)=1.
    2. (b)

      If min{τ0,τ}>1/2\min\{\tau_{0},\tau\}>1/2, then for any kkk\geq k^{*}

      lim infn:p/nγ>0sup𝜷,𝜽𝒞τ(SVk)δ=0(LRk,out>χ12(α))=α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\beta},\boldsymbol{\theta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k,\mathrm{out}}>\chi^{2}_{1}(\alpha))=\alpha.
  2. ii.

    Suppose 𝛃N(0,1pσβ2Ip)\boldsymbol{\beta}\sim N(0,\frac{1}{p}\sigma^{2}_{\beta}I_{p}). Then for any τ00\tau_{0}\geq 0 and γ<1\gamma<1 one has

    lim infn:p/nγ>0sup𝜽𝒞τ(SVk)δ=0(LRk,out>χ12(α))>α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\theta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k^{\star},\mathrm{out}}>\chi^{2}_{1}(\alpha))>\alpha.

    Further, if 𝛉N(0,1pσθ2Ip)\boldsymbol{\theta}\sim N(0,\frac{1}{p}\sigma^{2}_{\theta}I_{p}),

    lim infn:p/nγ>0δ=0(LRk,out>χ12(α))>α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k^{\star},\mathrm{out}}>\chi^{2}_{1}(\alpha))>\alpha.

    .

  3. iii.

    Assume 𝜽N(0,1pσθ2Ip)\boldsymbol{\theta}\sim N(0,\frac{1}{p}\sigma^{2}_{\theta}I_{p}) but 𝜷\boldsymbol{\beta} is not random. If τ0>0\tau_{0}>0, then

    lim infn:p/nγ>0sup𝜷𝒞τ(SVk)δ=0(LRk,out>χ12(α))=α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\beta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k,\mathrm{out}}>\chi^{2}_{1}(\alpha))=\alpha.

Above SVkS_{V_{k^{*}}} refers to the subspace spanned by v1,,vkv_{1},\ldots,v_{k^{*}}.

A few remarks are in order regarding the phase transitions implied by Theorem 3.4. First, it is worth noting that the phase transitions for the Type I error of φk,out\varphi_{k,\mathrm{out}} depends not only whether 𝜽,𝜷\boldsymbol{\theta},\boldsymbol{\beta} are random effects but also on the angle between 𝜽,𝜷\boldsymbol{\theta},\boldsymbol{\beta} (whichever is fixed effects) and 𝐯1\mathbf{v}_{1}. As for the range of τ\tau (the parameter deciding the angle with 𝐯1\mathbf{v}_{1}), Theorem 3.4 only records the results for τ>0\tau>0. Indeed, for τ=0\tau=0, the spaces for 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta} are rendered unrestricted. In this case, our next result demonstrates the lack of phase transition compared to the results in Theorem 3.4.

Proposition 3.5.

Consider testing (2) under (2) using φk,out(t)\varphi_{k,\mathrm{out}}(t). Further assume that λ1=Θ(pτ0)\lambda_{1}=\Theta(p^{\tau_{0}}), τ0>0\tau_{0}>0 and γ<1\gamma<1. Then the following hold under Assumptions 2.6 (a), (b)’: {equs}lim inf_n→∞:p/n γ¿0 sup_β,θC_0(S_V_k^*) P_δ=0(LR_k^,out¿χ^2_1(α))¿α. Further, when 𝛃N(0,1pσθ2Ip)\boldsymbol{\beta}\sim N(0,\frac{1}{p}\sigma^{2}_{\theta}I_{p}), the following holds under Assumptions 2.6 (a), (b)’: {equs}lim inf_n→∞:p/n γ¿0 sup_θC_0(S_V_k^*) P_δ=0(LR_k^,out¿χ^2_1(α))¿α. If 𝛃N(0,1pσβ2Ip)\boldsymbol{\beta}\sim N(0,\frac{1}{p}\sigma^{2}_{\beta}I_{p}) and 𝛉N(0,1pσθ2Ip)\boldsymbol{\theta}\sim N(0,\frac{1}{p}\sigma^{2}_{\theta}I_{p}), then

lim infn:p/nγ>0δ=0(LRk,out>χ12(α))>α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k^{\star},\mathrm{out}}>\chi^{2}_{1}(\alpha))>\alpha.

However, if 𝛉N(0,1pσθ2Ip)\boldsymbol{\theta}\sim N(0,\frac{1}{p}\sigma^{2}_{\theta}I_{p}), for any γ>0\gamma>0,

lim infn:p/nγ>0sup𝜷𝒞τ(SVk)δ=0(LRk,out>χ12(α))=α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\beta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k,\mathrm{out}}>\chi^{2}_{1}(\alpha))=\alpha.

Before proceeding to the analysis of LRk,in\mathrm{LR}_{k,\mathrm{in}} we present a partial result on the behavior of LRk,out\mathrm{LR}_{k,\mathrm{out}} under a regime which avoids the reliance on Assumption 2.7.

Proposition 3.6.

Consider testing (2) under (2) using φk,out(t)\varphi_{k,\mathrm{out}}(t). Further assume that λ1=Θ(pτ0)\lambda_{1}=\Theta(p^{\tau_{0}}) and τ,τ0>0\tau,\tau_{0}>0 and γ<1\gamma<1. Then the following hold under Assumptions 2.6 (a), (b)’, and 2.8 whenever min{τ,τ0}<12\min\{\tau,\tau_{0}\}<\frac{1}{2}. {equs}lim inf_n→∞:p/n γ¿0 sup_βC_τ(S_V_k^*) P_δ=0(LR_k^,out¿χ^2_1(α))¿α. If 𝛃N(0,1pσβ2Ip)\boldsymbol{\beta}\sim N(0,\frac{1}{p}\sigma^{2}_{\beta}I_{p}). Then one has

lim infn:p/nγ>0sup𝜽𝒞τ(SVk)δ=0(LRk,out>χ12(α))>α.\liminf\limits_{n\rightarrow\infty:\atop p/n\rightarrow\gamma>0}\sup_{\boldsymbol{\theta}\in\mathcal{C}_{\tau}(S_{V_{k^{*}}})}\mathbb{P}_{\delta=0}(\mathrm{LR}_{k^{\star},\mathrm{out}}>\chi^{2}_{1}(\alpha))>\alpha.

Our final result pertains to the behavior of LRk,in\mathrm{LR}_{k,\mathrm{in}} in the regime of divergence spikes. As noted above, this behavior is subtle and depends on the nature of localization of the coordinates of the leading eigenvectors of Σ\Sigma. Since the complete characterization of all the possible cases that can arise is beyond the scope of the paper, we focus on a simple special case of single spiked model below with completely delocalized spiked eigenvector.

Theorem 3.7.

Consider testing (2) under (2) using φk,in(t)\varphi_{k,\mathrm{in}}(t) and suppose that Assumptions 2.6 (a), (c)’, and 2.7 hold.

  1. i.

    Assume that k=1k^{*}=1, λ1=Θ(pτ0)\lambda_{1}=\Theta(p^{\tau_{0}}) with τ0>0\tau_{0}>0 and that infj|v1(j)|,supj|v1(j)|=Θ(1/p)\inf_{j}|v_{1}(j)|,\sup_{j}|v_{1}(j)|=\Theta(1/\sqrt{p}). Then the following hold with any fixed k1k\geq 1 and fixed M>0M>0 {equs}lim inf_n→∞:p/n γ¿0 sup_βS_v_1:β∥≤M P_δ=0(LR_k,in¿χ^2_1(α))=α.

  2. ii.

    If 𝜷N(0,σβ2pIp)\boldsymbol{\beta}\sim N(0,\frac{\sigma^{2}_{\beta}}{p}I_{p}), {equs}lim inf_n→∞:p/n γ¿0 P_δ=0(LR_1,in¿χ^2_1(α))¿α.

Indeed, for diverging spikes using AA twice, once in regression of YY on AA and also while adjusting for confounding through PCA, is more beneficial compared to LRk,out\mathrm{LR}_{k,\mathrm{out}}. This is in contrast to existing intuition which only considers power of the procedures and ignores the Type I error considerations (mai2021understanding; listgarten2012improved; yang2014advantages).

4. Numerical Experiments

In this section, we present detailed numerical experiments to not only verify our theoretical results but also to provide additional evidence on the potential universality of the main narrative behind our results that persists beyond the working assumptions of our theoretical results.

We first present results on tests based on LRk,out\mathrm{LR}_{k,\mathrm{out}}. Our simulations are set up to explore the following features of PC adjustment based procedure for testing (2) under true underlying model (2) by varying the following features: (i) marginal distribution of 𝐖\mathbf{W} and associated strength of confounding as captured by the spiked strength in case of the generalized spiked model or a suitable notion of mixing strength for mixture models on 𝐖\mathbf{W}; (ii) conditional distribution of A|𝐖A|\mathbf{W}; (iii) aspect ratio between the sample size and the number of confounding variables; (iv) the nature of the regression of Y|A,𝐖Y|A,\mathbf{W} (i.e. fixed or random effects as driven by the nature of 𝜷\boldsymbol{\beta} in model (2)); and (v) the nature of the regression of A|𝐖A|\mathbf{W} (i.e. once again fixed or random effects as driven by the nature of 𝜽\boldsymbol{\theta} in model (2.7)). Below we first provide a description of the variations that different types of cases we will consider in this regard.

  • Distribution of 𝐖\mathbf{W}: We shall consider two broad sub-cases in this regard with further variations as follows.

    1. (1)

      Spiked Model: In this case we will consider 𝐖iN(0,Σ)\mathbf{W}_{i}\sim N(0,\Sigma) with Σ=I+j=1kλjvjvjT\Sigma=I+\sum_{j=1}^{k^{*}}\lambda_{j}v_{j}v_{j}^{T} with λ1,λk>0\lambda_{1}\geq\ldots,\lambda_{k}^{*}>0. We shall consider k=1k^{*}=1, and λ1=pτ0\lambda_{1}=p^{\tau_{0}} to regulate the spike strength, where τ0{0,0.05,0.10,,1}\tau_{0}\in\{0,0.05,0.10,\ldots,1\}.

    2. (2)

      Mixture Model: We shall consider two types mixture models: (i) (Gaussian Mixture Model) 𝐖i12N(𝝁1,I)+12N(𝝁2,I)\mathbf{W}_{i}\sim\frac{1}{2}N(\boldsymbol{\mu}_{1},I)+\frac{1}{2}N(\boldsymbol{\mu}_{2},I) with 𝝁i=(μi1,,μip)\boldsymbol{\mu}_{i}=(\mu_{i1},\ldots,\mu_{ip}) for i=1,2i=1,2, and μ1j=μ2j=1\mu_{1j}=-\mu_{2j}=1 for j=1,,mj=1,\ldots,m and μ1j=μ2j=0\mu_{1j}=\mu_{2j}=0 for jm+1j\geq m+1; and (ii) (Binomial Mixture Model) 𝐖i=(Wi1,,Wip)\mathbf{W}_{i}=(W_{i1},\ldots,W_{ip}) with Wij12Bin(2,q1j)+12Bin(2,q2j)W_{ij}\sim\frac{1}{2}\mathrm{Bin}(2,q_{1j})+\frac{1}{2}\mathrm{Bin}(2,q_{2j}) with q1j=0.3q_{1j}=0.3, q2j=0.7q_{2j}=0.7 for j=1,,mj=1,\ldots,m and q1j=q2j=0.5q_{1j}=q_{2j}=0.5 for jm+1j\geq m+1. For this model, we vary m=pτ0m=\lceil{p^{\tau_{0}}\rceil} to regulate the strength of the population stratification, where τ0{0,0.05,0.10,,1}\tau_{0}\in\{0,0.05,0.10,\ldots,1\}, and \lceil\cdot\rceil denotes the ceiling function.

  • Regression of A|𝐖A|\mathbf{W}: We shall consider two cases: (i) (Continuous Exposure) AiA_{i}\in\mathbb{R} with Ai=𝐖iT𝜽+ηiA_{i}=\mathbf{W}_{i}^{T}\boldsymbol{\theta}+\eta_{i} where ηiN(0,σa2)\eta_{i}\sim N(0,\sigma_{a}^{2}); and (ii) (Binomial Exposure) AiBin(2,pi)A_{i}\sim\mathrm{Bin}(2,p_{i}) with pi=exp(𝜽T𝐖i)1+exp(𝜽T𝐖i)p_{i}=\frac{\exp(\boldsymbol{\theta}^{T}\mathbf{W}_{i})}{1+\exp(\boldsymbol{\theta}^{T}\mathbf{W}_{i})}. We take σa2=1\sigma_{a}^{2}=1 for our simulations.

  • Aspect Ratio: We shall consider two different aspect ratios γ{0.5,2}\gamma\in\{0.5,2\} between the sample size and the number of confounding variables in our simulations. In all our simulations, we shall let p=1000p=1000 (and thus the sample size n{500,2000}n\in\{500,2000\}).

  • Regression Coefficients for Y|A,𝐖Y|A,\mathbf{W}: We shall consider both fixed and random effects for 𝜷\boldsymbol{\beta} in the model (2). When 𝜷\boldsymbol{\beta} is a fixed effect, we shall vary the angle/distance metric d(𝜷,Sv1)d(\boldsymbol{\beta},S_{v_{1}}) by setting 𝜷=av1+1a2v2\boldsymbol{\beta}=av_{1}+\sqrt{1-a^{2}}v_{2}, where v1v_{1} and v2v_{2} are the eigenvectors corresponding to the largest and second largest eigenvalues of 𝔼(𝐖𝐖/n)\mathbb{E}(\mathbf{W}^{\top}\mathbf{W}/n), and a=1pτ0a=1-p^{-\tau_{0}}. The specification of τ0\tau_{0} remains the same as described in the context of varying spike strength. When 𝜷\boldsymbol{\beta} is a random effect, 𝜷\boldsymbol{\beta} is drawn randomly from Np(0,1/pI)N_{p}(0,1/pI).

  • Regression Coefficients for A,𝐖A,\mathbf{W}: We shall consider both fixed and random effects for 𝜽\boldsymbol{\theta} in the model (2.7). The specifications of 𝜽\boldsymbol{\theta} as fixed and random effects are exactly the same as described above for 𝜷\boldsymbol{\beta}.

We are now ready to present our numerical experiments which delves into combinations of the setups presented above. In particular, we shall consider three different combinations of the distributional assumptions of 𝐖\mathbf{W} and A|𝐖A|\mathbf{W}, namely, (1) Spiked Model on 𝐖\mathbf{W} and continuous exposure AA; (2) Gaussian mixture model on 𝐖\mathbf{W} and and continuous exposure AA; and (3) Binomial mixture model on 𝐖\mathbf{W} and binomial exposure AA. In each of these cases and for all combinations of different aspect ratios and different specifications of 𝜷\boldsymbol{\beta} and 𝜽\boldsymbol{\theta}, we simulate the outcomes under the null model (i.e. δ=0\delta=0 in (2)) by taking σy2=1\sigma^{2}_{y}=1. We performed 2000 replications of the dataset and tested φk,out(χ12(α))\varphi_{k,\mathrm{out}}(\chi_{1}^{2}(\alpha)) on each dataset by taking k=1k=1 and α=0.05\alpha=0.05. Finally, we plot the empirical type I error rate of the test across varying τ0\tau_{0} in Figure 1.

Refer to caption
(a) Spiked model, continuous exposure
(γ=2\gamma=2)
Refer to caption
(b) Spiked model, continuous exposure
(γ=0.5\gamma=0.5)
Refer to caption
(c) Gaussian mixture, continuous exposure
(γ=2\gamma=2)
Refer to caption
(d) Gaussian mixture, continuous exposure
(γ=0.5\gamma=0.5)
Refer to caption
(e) Binomial mixture, binomial exposure
(γ=2\gamma=2)
Refer to caption
(f) Binomial mixture, binomial exposure
(γ=0.5\gamma=0.5)
Figure 1. Simulation results in out-regression for different marginal distributions of 𝐖\mathbf{W}, conditional distributions of A|𝐖A|\mathbf{W}, aspect ratios, regression coefficients, and strength of the spike (or population stratification)

In Figure 1, the left and the right columns present the results based on the two different aspect ratios, γ=2\gamma=2 and 0.50.5, respectively. From top to bottom, the three rows represent simulation setups (1), (2), and (3) as described above. In each of the plots, the black line corresponds to both 𝜷\boldsymbol{\beta} and 𝜽\boldsymbol{\theta} being fixed effects, the blue line corresponds to 𝜷\boldsymbol{\beta} being a fixed effect and 𝜽\boldsymbol{\theta} being a random effect, the red line corresponds to 𝜷\boldsymbol{\beta} being a random effect and 𝜽\boldsymbol{\theta} being a fixed effect, and the pink line corresponds to both 𝜷\boldsymbol{\beta} and 𝜽\boldsymbol{\theta} being random effects. The results verify the claims made in Theorem 3.4, as well as Theorems 3.1,3.3 when τ0=0\tau_{0}=0.

Next, we present results on tests based on LRk,in\mathrm{LR}_{k,in}. Below we provide a description of different aspects of the simulation procedure that we will consider.

  • Distribution of 𝐗\mathbf{X}: We shall consider a spiked model for 𝐗\mathbf{X}. In particular, we shall consider 𝐗iN(0,Σ)\mathbf{X}_{i}\sim N(0,\Sigma) with Σ=I+j=1kλjvjvjT\Sigma=I+\sum_{j=1}^{k^{*}}\lambda_{j}v_{j}v_{j}^{T} with λ1,λk>0\lambda_{1}\geq\ldots,\lambda_{k}^{*}>0. We shall set k=1k^{*}=1, and λ1=pτ0\lambda_{1}=p^{\tau_{0}} to regulate the spike strength, where τ0{0,0.05,0.10,,1}\tau_{0}\in\{0,0.05,0.10,\ldots,1\}. We shall further generate v1v_{1} by drawing randomly from Np(0,1/pI)N_{p}(0,1/pI) and normalizing afterwards. Next, we shall partition 𝐗=[A:𝐖]\mathbf{X}=\left[A:\mathbf{W}\right], i.e., take the first column of 𝐗\mathbf{X} as AA and the rest as 𝐖\mathbf{W}.

  • Aspect Ratio: We shall consider two different aspect ratios γ{0.5,2}\gamma\in\{0.5,2\} between the sample size and the number of confounding variables in our simulations. In all our simulations, we shall let p=1000p=1000 (and thus the sample size n{500,2000}n\in\{500,2000\}.

  • Regression Coefficients for Y|A,𝐖Y|A,\mathbf{W}: We shall consider both fixed and random effects for 𝜷\boldsymbol{\beta} in the model (2). When 𝜷\boldsymbol{\beta} is a fixed effect, we shall set 𝜷=Σ(22)1Σ(21)\boldsymbol{\beta}=\Sigma_{(22)}^{-1}\Sigma_{(21)}, where Σ(21),Σ(22)\Sigma_{(21)},\Sigma_{(22)} are defined based on the partition of Σ\Sigma,

    Σ=[Σ(11)Σ(12)Σ(21)Σ(22)].\Sigma=\begin{bmatrix}\Sigma_{(11)}&\Sigma_{(12)}\\ \Sigma_{(21)}&\Sigma_{(22)}\end{bmatrix}.

    Here Σ(11)\Sigma_{(11)} is a scalar. When 𝜷\boldsymbol{\beta} is a random effect, 𝜷\boldsymbol{\beta} is drawn randomly from Np(0,1pI)N_{p}(0,\frac{1}{p}I).

For all combinations of different aspect ratios and different specifications of 𝜷\boldsymbol{\beta}, we simulate the outcomes under the null model (i.e. δ=0\delta=0 in (2)) by taking σy2=1\sigma^{2}_{y}=1. We performed 2000 replications of the dataset and tested φk,in(χ12(α))\varphi_{k,in}(\chi_{1}^{2}(\alpha)) on each dataset by taking k=1k=1 and α=0.05\alpha=0.05. Finally, we plot the empirical type I error rate of the test across varying τ0\tau_{0} in Figure 2.

Refer to caption
(a) γ=2\gamma=2
Refer to caption
(b) γ=0.5\gamma=0.5
Figure 2. Simulation results in in-regression for different marginal distributions of 𝐗\mathbf{X}, aspect ratios, regression coefficients, and spike strengths

Figures 2(a) and 2(b) present the results based on the two different aspect ratios, γ=2\gamma=2 and 0.50.5, respectively. In each of the plots, the black line corresponds to 𝜷\boldsymbol{\beta} being fixed effects, and the red line corresponds to 𝜷\boldsymbol{\beta} being random effects. The results verify the claims made in Theorem 3.7.

5. Discussions

In this paper we consider hypothesis testing of individual parameters in high dimensional linear regression set up under proportional asymptotics while adjusting for additional confounding through principal component analysis. This method is regularly employed in large-scale genetic and epigenetic association studies and in this paper, we take some initial steps to shed light on necessary and sufficient conditions for the validity of this procedure. Indeed, several interesting directions remain to be explored. A non-exhaustive list of such avenues is as follows: (i) What is a complete picture of phase transitions for LRk,in\mathrm{LR}_{k,\mathrm{in}} in diverging spiked models? (ii) How to analyze LRk,ind\mathrm{LR}_{k,\rm ind} for more realistic genetic admixture models (price2006principal)? (iii) How to capture similar results when A|𝐖A|\mathbf{W} follows a multiple logistic regression with Hardy-Weinberg equilibrium proportions depending through the logistic link on 𝐖\mathbf{W}? (iv) How to perform analogous analysis for case-control studies? and (v) How to construct corrections to the cut-offs for LRk,ind\mathrm{LR}_{k,\rm ind} in regimes where the Type-I error of the tests is inflated? We keep these and other important directions for future research endeavors.

6. Proofs of Main Results

Throughout our proofs whenever the context is clear, with an abuse of notation, we shall write both 𝕎\mathbb{W} and 𝕏\mathbb{X} as j=1npd^ju^jvj^T\sum_{j=1}^{n\wedge p}\hat{d}_{j}\hat{u}_{j}\hat{v_{j}}^{T} and denote λj^=dj^/n\hat{\lambda_{j}}=\hat{d_{j}}/\sqrt{n} – and use them accordingly in the context of LRout\mathrm{LR}_{\mathrm{out}} and LRin\mathrm{LR}_{\mathrm{in}} respectively.

6.1. Proof of Theorem 3.1

We only prove the result for LRk,out\mathrm{LR}_{k,\mathrm{out}} and note that the proof for LRk,in\mathrm{LR}_{k,\mathrm{in}} follows by similar arguments by simply replacing 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta} by 𝜷~:=I~1𝜷\widetilde{\boldsymbol{\beta}}:=\widetilde{I}_{-1}\boldsymbol{\beta} and 𝜽~:=I~1𝜽\widetilde{\boldsymbol{\theta}}:=\widetilde{I}_{-1}\boldsymbol{\theta} respectively with I~1:=[e2::ep]\widetilde{I}_{-1}:=[e_{2}:\cdots:e_{p}] and eje_{j} denotes the jthj^{\rm th} canonical basis of p+1\mathbb{R}^{p+1}.

We begin by noting that by Lemma LABEL:lem:distribution_lr one has {equs}LR_k, out—A,Wχ_1^2(^κ^2_k,out), ^κ^2_k,out=((βTW+Aδ)(I-PC(W~Vk,out))A)2A(I-PC(W~Vk,out))A.

Since

δ=0(LRk,out>tα)=𝔼(Φ¯(tακ^k,out2)Φ(tακ^k,out2))2,\mathbb{P}_{\delta=0}(LR_{k,\mathrm{out}}>t_{\alpha})=\mathbb{E}\Big{(}\bar{\Phi}(t_{\alpha}-\hat{\kappa}_{k,\mathrm{out}}^{2})-\Phi(-t_{\alpha}-\hat{\kappa}_{k,\mathrm{out}}^{2})\Big{)}^{2},

where Φ\Phi is the standard normal cdf and Φ¯=1Φ\bar{\Phi}=1-\Phi, it will be sufficient to derive asymptotic distribution of κ^k,out2\hat{\kappa}_{k,\mathrm{out}}^{2} followed by an application of uniform integrability principle to derive the asymptotic behavior of the likelihood ratio test.

Proof of Theorem 3.1i.

The non-centrality parameter κk,out2\kappa^{2}_{k,\mathrm{out}} under H0H_{0} is given by {equs} ^κ_k,out^2=(W β)^⊤P_C(A_k) (W β) =(βW(I-PC(W~Vk,out))A)2A(I-PC(W~Vk,out))A:=T21T2. First note that the coordinates of AiA_{i} are i.i.d with finite ψ2\psi_{2}-norm (this follows from Assumption 2.7 and (2)). Therefore by Bernstein’s inequality, \exists C2,C2>0C_{2},C_{2}^{\prime}>0 such that 𝐀22C2n\|\mathbf{A}\|^{2}_{2}\leq C_{2}n w.p. 1eC2n\geq 1-e^{-C_{2}^{\prime}n}. Hence,

T2n𝐀22nC2,\frac{T_{2}}{n}\leq\frac{\|\mathbf{A}\|^{2}_{2}}{n}\leq C_{2},

w.p. 1eC2n\geq 1-e^{-C_{2}^{\prime}n}. Next we show that there exists a constant C1C_{1} and a pair 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta} such that with probability converging to 11 one has T12C2n2T_{1}^{2}\geq C_{2}n^{2}. To this end note that for any 𝜷,𝜽\boldsymbol{\beta},\boldsymbol{\theta}, {equs}T_1=β^⊤W^⊤(I-P_C(W~V_k,out))Wθ+ β^⊤W