PC Adjusted Testing for Low Dimensional Parameters
Abstract.
In this paper we consider the effect of high dimensional Principal Component (PC) adjustments while inferring the effects of variables on outcomes. This problem is particularly motivated by applications in genetic association studies where one performs PC adjustment to account for population stratification. We consider simple statistical models to obtain asymptotically precise understanding of when such PC adjustments are supposed to work in terms of providing valid tests with controlled Type I errors. We also verify these results through a class of numerical experiments.
1. Introduction
Testing statistical hypotheses, where one needs to take special care of complications of high dimensional data structures is a staple in modern data science. Indeed, a quintessential example of this type problems arise naturally in the context of genetic association studies. Although such questions are not specific to the domain of genetic associate studies, the seminal paper of price2006principal has put forward a Principal Component (PC) adjustment based procedure to adjust for population stratification type confounding and has been applied across a vast canvas of omics type applications beyond Genome Wide Association Studies (GWAS). As another specific example, barfield2014accounting explores the applicability of the same principle in epigenome wide association studies. A major appeal of this method, that we will try lend a theoretical lens to in this paper, is availability of the EIGENSTRAT package of price2006principal that allows fast implementation of this principle and has therefore been immensely popular as a standard adjustment procedure to account for confounding bias in such studies.
In spite of the immense practical popularity of PC adjustment for high dimensional nuisance parameters, the most basic theoretical questions remain unanswered for this high popular and regularly used principle. For example, the most traditional statistical questions that arise as a part of planning a real data analysis pertains to sample size and power calculations. Interestingly, the answers to even such simple questions remain unexplored. Indeed, the two stage method involving a principal component adjustment before testing for genetic association, implies interesting statistical phenomenon in high dimensional problems where the number of genetic variants under study is large (patterson2006population) . In contrast to such implied caveats and potential pathologies of inconsistencies in high dimensional PCA (johnstone2009consistency), empirical success of the procedure in practical problems in turn go far towards making a strong case for the PC based adjustments in these applications (yang2014principal). This paper is aimed at providing first steps towards a theoretical formalism and reconciling these issues and thereby potentially resolve questions regarding when and why such PC-based adjustments are reliable and mathematically valid.
2. Setup
We consider data on where where denotes an outcome of interest and collects predictor variables under study. In many case, it is of interest to test the effect of specific variables/components in on while conditioning on other variables in . More precisely, we will partition as where is the variable whose effect on is of interest and collects confounders one wishes to adjust while decoding this relationship. Our main question of interest concerns a specific way of inferring the effect of on – namely principal component adjustment for the confounding variables in . We will study the precise asymptotic properties of this specific method through the lens of a hypothesis testing problem for the effect of on . To obtain sharp asymptotics and precise answers, for our theoretical developments we will focus on the Gaussian linear regression set-up where {equs}Y_i=A_iδ+W_i^Tβ+ϵ_i, ϵ_i∼N(0,σ^2_y), and thereby consider the hypothesis testing problem for defined as {equs}H_0: δ=0 vs H_1(h):δ=hn. The choice of the alternative converging to at rate is driven by the fact that the detection threshold for testing scales accordingly in the asymptotic regimes we will work with in this paper. Subsequently, with an abuse of notation that suppresses dependence on other parameters, we will denote the distribution of corresponding to model (2) as to underscore the main object of the hypothesis test i.e. . Finally, throughout we assume that is bounded away from to reflect a scenario where the effect of on experiences non-trivial confounding through .
The standard test for (2) is the Generalized Likelihood Ratio Test (GLRT) – which for known 111We will throughout assume this and show through numerical experiments that the results for unknown should be qualitatively similar. reduces to the chi-square test with -degree of freedom. Although the same test has non-trivial local power for testing (2) even for growing dimension : (1) it is often believed that suitable dimension reduction techniques on the regression of on by relying on the intuition that dimension reduction techniques capture underlying structures of the problem (e.g. sparsity, low rank etc) and thereby yields more power for the testing problem compared to vanilla GLRT; and (2) performing GLRT at the scale of modern high dimensional problems such as GWAS is computationally expensive () and performing dimension reduction on substantially reduces this burden. To account for these concerns, one method that has gained popularity is that of performing Principal Component Analysis (PCA) based dimension reduction on which amounts to including “a few top principal components” in the regression (and can be computed through Fast-PCA type algoritms (galinsky2016fast)) of on (patterson2006population; price2006principal; barfield2014accounting). In this regard, for testing effect of on , it is natural to consider PC adjustment on only. However, one attractive piece of the EIGENSTRAT method introduced in price2006principal is the fact that it allows the investigator to only run PCA on the whole of data matrix only once and including them in the regression analysis of on any (and not a specific component as considered here) which is a component of . More specifically, we shall study the statistical properties (Type I and Type II error of testing for in the model (2)) of testing for a specific component of , our set up is in parallel to traditional GWAS analysis where is a standard practice to run PCA on the whole data matrix of (instead on the matrix obtained by removing -observations from it) and include them in the analyses – instead of running a new PCA every time for testing for a new variable in . Here we will explore both the cases when the PC is run on the whole of (a case more relevant for GWAS type analyses (price2006principal)) and for the case when PC adjustments are performed on (a case relevant in certain epigenetic studies where contains DNA methylation levels of a particular cytosine-phosphate-guanine (CpG) site of interest and one performs PCA on genetic variants to account for population stratification barfield2014accounting. )
To initiate a theoretical study of this method, we begin by with setting up the necessary mathematical notation to describe the procedure. As discussed above, one standard approach in practice is to perform Principal Component (PC) analysis on the data matrix of and only use the top principal component directions while adjusting for confounders in the regression of on . Mathematically, such a formalism can be written as follows through the data vectors , and the data matrix of covariates :
-
(i)
Given , let denote the matrix composed of the first right singular vectors of .
-
(ii)
Perform a linear regression of on to get a score or level of significance for .
Here the subscript refers to the fact that a is a part of the variables in on which the PC was performed. Mathematically, for any cut-off we consider the test that rejects for value of the likelihood ratio larger than as follows {equs}φ_k,in(t)=1(LR_k,in¿t), where {equs}LR_k,in=Y^TP_C(A_k, in)Y, A_k, in=(I-P_C(X~V_k,in))A Here for any matrix we use to denote its column space and to denote the projection matrix onto . An analogous can be constructed where the PC adjustment was performed only on – which we shall refer to as {equs}φ_k,out(t):=1(LR_k,out¿t) with {equs}LR_k,out=Y^TP_C(A_k, out)Y, A_k,out=(I-P_C(W~V_k,out))A and now collects the top right singular vectors of .
This approach, similar in essence to Principal Component Regression (to be denoted by PCR hereafter) raises a few immediate questions:
-
(i)
What is the behavior of , under and does there exist a distributional cut-off such that the Type I error of converges to ?
-
(ii)
How to characterize the power function of , for any given ?
To understand these questions, we operate in a regime which allows to be larger than sample size but to scale proportionally with . In this asymptotic regime, the main results and contributions of this paper in view of the above questions are as follows:
-
(i)
Under a generalized spiked model for (see Definition 2.2) we provide exact asymptotic results for testing (2) after PC adjustments. The precise description of the testing errors can be elaborated as follows.
-
(a)
Our results show an interesting distinction between a fixed effects and random effects on . In particular, we show that when in (2) is considered fixed and unknown the Type I error of the test for (2) after PC adjustments converges to exponentially in sample size – even after including “enough” principal component direction in the regression. In contrast, when has a mean zero random effect distribution the Type I error of the test is bounded away from both and the nominal desired level. (See Theorem 3.1i.-ii.).
-
(b)
For random effects on , we derive the precise power function of the test for (2) after PC adjustments and show that without further assumptions it might not be possible to estimate the power function exactly to either correct for the Type I error inflation or perform sample size/power calculations (See Theorem 3.1ii.).
-
(a)
-
(ii)
We show that, when is derived from a mixture of mean-shift type distributions, a similar phenomenon persists in terms of the behavior of the test for (2) after PC adjustments. We further verify similar behavior through extensive simulations even when the distribution of arises from mixture of discrete distributions. These simulations provide interesting insights into the subtleties and success of the highly popular EIGENSTRAT method of price2006principal (See Theorem 3.3).
-
(iii)
Under a classical spiked model (johnstone2001distribution) on with diverging leading spike strengths, we derive a precise phase transition on the minimum spike strength and the angle between the and leading population spikes required for the pathologies of the procedure noted in points (i) and (ii) above to disappear. (See Theorem 3.4-3.7).
The rest of the paper, organized as follows, makes the above claims precise, rigorous, and elaborate. We first introduce the notation, definitions, and assumptions to set up the main building blocks of our results in Section 2.1. We divide the main theoretical results in Section 3 into three parts as follows. In Section 3.1 we present asymptotic results on the power function of for fixed spike/signal strength. Subsequently, we present the implications of these results in the context of GWAS type problems in Section 3.2.Thereafter, Section 3.3 is devoted to understanding the effect of diverging signal strength and thereby providing benchmarks to decide when and how such PC adjustment based testing procedures are valid in practice. Finally, in Section 4 we present detailed numerical experiments to not only verify our theoretical findings in simulation examples but also explore the continued validity of our analytical results beyond the specific conditions assumed for the theoretical derivations.
2.1. Technical Preparations
In this subsection we present definitions, assumptions and related discussions, and notation that will be used throughout the rest of the paper.
2.1.1. Definitions
To present our main results of this section we first need a few definitions to set up the notion of generalized spiked models. We start with the definition of empirical spectral measure of a Hermitian matrix.
Definition 2.1.
For a -dimensional Hermitian matrix with eigenvalues , the Empirical Spectral Distribution (ESD) is defined as the distribution function corresponding to the probability measure
where when , otherwise . For a sequence of deterministic Hermitian matrices, if the corresponding sequence of ESDs converges weakly to a non-random probability distribution as , then is defined as the Limiting Spectral Distribution (LSD) of the sequence .
With this definition in hand we are ready to present the definition of generalized spiked models as proposed by bai2012gsp.
Definition 2.2.
A sequence of deterministic Hermitian matrices is called a sequence of generalized spiked population matrices if the following hold.
-
(i)
can be written as,
where and are non-negative and non-random Hermitian matrices of dimensions ( is finite and fixed) and , respectively.
-
(ii)
The sequence of ESDs corresponding to converges weakly to a non-random probability distribution .
-
(iii)
Let be the support of , and let the sets of eigenvalues of the submatrices and be and , respectively. Then, for , and as , where is a distance metric from a point to a set .
In this case, the eigenvalues of are called the generalized spikes, and the eigenvalues of are called the non-spikes. The distribution is same as the LSD of the sequence .
We note that the traditional spiked model (johnstone2001distribution) is a special case of the generalized spiked model where is the degenerate distribution at unity. Next, we will introduce the phase transition boundaries for the generalized spikes analogous to the phase transition boundary (baik2005phase) established in the spiked model.
Definition 2.3.
For , we define the following function:
Then, a generalized spike is called a “distant spike” if , and “close spike” if , where is the first derivative of .
Depending on whether a generalized spike is a distant or a close spike, the asymptotic convergence of the corresponding sample eigenvalues differ according to Theorem 4.1 and Theorem 4.2 of (bai2012gsp). Therefore, the sample eigenvalues corresponding to the generalized spikes go through a phase transition at the boundaries where . Unlike in the spiked model where the phase transition happens only at the two boundaries , the generalized spiked model can have more than two phase transition boundaries. The next assumption is intended to simplify our mathematical derivations by only allowing one phase transition boundary , and by assuming that all the generalized spikes have multiplicity one and are above that boundary (i.e., all distant spikes). The extension of our derivations to allow close spikes, and generalized spikes around multiple phase transition boundaries are similar but tedious, and hence omitted in this paper. Finally we define covariate data distributions we will working with throughout.
Definition 2.4.
We say that a data matrix follows a generalized spiked distribution with spike eigenvalue-eigenvector pairs if , where is an random matrix with i.i.d. sub-gaussian elements such that , and is the poitive definite Hermitian square root of such that
-
(a)
is a sequence of real symmetric positive definite matrices with spectral decomposition , and satisfies the generalized spiked assumptions as outlined above (Definition 2.2) with (finite) generalized spikes. The eigenvalues and eigenvectors of are given by and , respectively.
-
(b)
The sequence of spectral norms is bounded.
-
(c)
denote the generalized spikes of , and for .
In this case we denote
Some of our results are derived under the classical spiked model (johnstone2001distribution). To that end we will use the following definition.
Definition 2.5.
We say that a data matrix follows a spiked distribution with spike eigenvalue-eigenvector pairs if , where is an random matrix with i.i.d. sub-gaussian elements such that , and with where and are orthonormal vectors. In this case we denote
2.1.2. Assumptions
Using the above definitions, we can now state our assumptions various subsets of which will be useful for our theoretical results.
Assumption 2.6.
-
(a)
.
-
(b)
.
-
(b)’
.
-
(c)
.
-
(c)’
.
We will mostly assume the following dependence between and .
Assumption 2.7.
Let with .
Although assumption 2.7 seems restrictive, we keep it here for keeping our arguments short and only for the precise analysis of the test – this assumption is often not needed either for analysis of or while demonstrating a lower bound on the Type I error of instead of deriving a precise limit . Indeed, this result is easy to extend to sub-Gaussian ’s and we additionally verify through extensive simulations (see Section 4) that the intuitions from our main theorems continue to be valid even without assuming the above specific conditional distribution of . Finally we note that for the analysis in epigenetic studies (barfield2014accounting), where contains DNA methylation levels of a particular CpG site of interest and one performs PCA on genetic variants to account for population stratification, assumption 2.7 is often reasonable. Our last assumption removes this restrictions in some of our results at the cost of obtaining bounds instead of precise limiting objects.
Assumption 2.8.
Let ’s be centered and sub-Gaussian with parameter .
2.1.3. Notation
The results in this paper are mostly asymptotic (in ) in nature and thus requires some standard asymptotic notations. If and are two sequences of real numbers then (and ) implies that (and ) as , respectively. Similarly (and ) implies that for some (and for some ). Alternatively, will also imply and will imply that for some ). If then we write . If , then we say . A sequence of random variable is called , if such that .
3. Main Results
The behavior of the tests , crucially depends on the “strength” of principal components in the population model which in turn is reflective of the strength of population stratification under consideration for the GWAS studies. To capture this, we first present results for fixed “strength” of principal components in Section 3.1. Subsequently, in Section 3.3 we also present a sharp phase transition depending on spike strength in diverging spike strength of population principal components while considering the traditional spiked model (johnstone2001distribution). The implications of these results to GWAS type problems is discussed in Section 3.2 Finally, 3.2 also presents results when the studies come from a mixture of populations and thereby directly addressing population stratification type issues in GWAS.
3.1. Fixed Strength of Population PCs
We begin by stating and discussing the results for fixed stratification strength, i.e., whenever or follow a generalized spiked distribution with top spiked eigenvalue .
Theorem 3.1.
Consider testing (2) under (2) using . Then the following hold with any fixed under Assumptions 2.6 (a), (b), and 2.7 for and 2.6 (a) and (c) for .
-
i.
Suppose that are fixed effects. Then for any fixed cut-off and fixed one has
and -
ii.
Assume that . Then function such that for any and and fixed one has
and The function depends on and the function depends on . 222the precise forms of the can be found in the proof where its dependence on the sequence of tuples is suitably defined in a limiting sense. Throughout we shall implicitly assume that suitable quadratic forms of the type for certain deterministic functions and sequence of vectors of bounded norm, that arise in the proof of the theorem, exists. Otherwise, our results can be written more tediously in terms of upper and lower bounds on the power function involving liminf’s and limsup’s of similar quadratic forms. and is larger than at (the -quantile of ). A qualitatively similar result holds for either non-random and random or both random effects 333with analogous but different limiting functions.
Remark 1.
Although some levels of Type I error is expected for fixed and low spike strength (patterson2006population), Theorem 3.1 implies that the Type I error can be arbitrarily close to for using any fixed cut-off – which automatically covers any distribution based cut-off. Indeed, diverges in this set-up and as our proof will suggest that this is not an artifact of specific adversarial choices of but actually holds for uncountably many choices of . Our proof also suggests that the Type I error is somewhat less pathological when is orthogonal to the leading population eigenvectors – and thereby providing intuitive justification of the results in the random effects parts of Theorem 3.1 since mean random is “on average orthogonal” to any fixed directions.
Remark 2.
In the notation we have suppressed the dependence of on to not only maintain succinct notation but also to avoid confusion when one of and is random. Consequently, in statements of the theorem whenever a supremum over or appears, it implicitly acknowledges the dependence of on the respective parameters.
Remark 3.
We note that the results for does not involve a supremum over . This is because assuming a spiked model on does not allow a free choice of since by representing as the population least squares coefficients implies that is a function of . Also, we assume is bounded away from to reflect a scenario where the effect of on experiences non-trivial confounding through .
Remark 4.
The results above do not change if one changes the distribution random effects from Gaussian to other suitable sub-Gaussian distributions as long as the leading population eigenvectors are delocalized. However, the results can be quite specific to various possibilities of localizations and distributions of the random effects otherwise. We do not explore this aspect here in detail.
Remark 5.
We consider a simple application of Theorem 3.1. Consider the population covariance matrix be the classical spike model (johnstone2001distribution) i.e. , where . By Lemma LABEL:lem:distribution_lr one has {equs}LR_k, out—A,W∼χ_1^2(^κ^2_out), ^κ^2_k,out=((βTW⊤+A⊤δ)(I-PC(W~Vk,out))A)2A⊤(I-PC(W~Vk,out))A. Note that,
where is the standard normal cdf and . Hence, it will be sufficient to derive asymptotic distribution of followed by an application of uniform integrability principle to derive the asymptotic behavior of the likelihood ratio test.
Let and . We have shown, in (LABEL:eq:theta_fixed_ncp),
where , , and . Now, is the mean of Marčenko-Pastur distribution. To compute the precise values of , note that the limiting spectral distribution of is , the non-random distribution which is degenerate at . Now we invoke Lemma LABEL:lem:spikes_non_spikes to obtain that and , where
Note that, . So, , , and . Hence,
A few further remarks are in order regarding the assumptions and implications of Theorem 3.1 regarding the behavior of for . First, Theorem 3.1 (i) implies that under fixed effects model on the regression of and one has pathological behavior of PC adjustment based tests in the sense that for any fixed cut-off , the LRT following PC adjustments has size converging to . In Section 4 we further present simulation results towards universality of this phenomenon by verifying that this is indeed not an artifact of the particular choice of regression of in our analytical explorations. Similarly, Theorem 3.1 (ii) demonstrates that although the pathology of the size is somewhat diluted for random effects in the regressions of and , the resulting size is still strictly above the desired level while using the standard type cut-offs. Indeed, in view of the proportional asymptotic regime (i.e. ) and associated inconsistency of PCA (johnstone2009consistency), one might expect some level of discrepancy in the size and power of the PC adjustment based test considered – and Theorem 3.1 provides precise nature of this discrepancy. Subsequently, it is natural to ask whether one can recover the desired -level by suitably inverting the asymptotic power function of the test . In this regard, we first focus on the case of random and a close look at the formulae obtained for the power functions in Theorem 3.1 reveals their intricate dependence on – which cannot in general be estimated consistently for generalized spiked models with fixed spike strengths cai2018rate; perry2018optimality. In some cases, when does not depend on the projection of in the directions of population eigenvectors (see e.g. Remark 5 for the very special case of classical spiked model (johnstone2001distribution) and specific assumption on ) estimation of the power function is possible – however in general this seems to be impossible without assuming further structure on and population ’s. In contrast when is random, one has irrespective of that the power function only depends on the (the population spectral distribution for the distribution of or ) and hence can be potentially estimated. To keep our discussions focused we do not further pursue this avenue here.
To further argue that the pathologies notes above are not an artifact of Assumption 2.7, we now present analogues of the results (in somewhat less precise form) from Theorem 3.1 for the case when is not a linear regression. Indeed this is the case when is discrete (as is the situation for GWAS) and as a result Theorem 3.1 does not apply immediately. However as we will see that the qualitative nature of the problem conveyed by Theorem 3.1 continues to hold – with the difference being in the nature of preciseness at which we are able to pinpoint the accurate limit of the Type I error (as in Theorem 3.1ii.) in the sense that we are only able to provide a lower bound instead of a precise limit.
Theorem 3.2.
Consider testing (2) under (2) using . Also assume that is fixed and Assumptions 2.6 (a) and 2.8 hold.
-
i.
The following holds under Assumption 2.6 (b) for any .
-
ii.
Suppose and that Assumption 2.6(b) holds. Then
-
iii.
Suppose that Assumption 2.6 (c) holds. Let with and with defined in Lemma LABEL:lem:spikes_non_spikes and canonical basis .
-
(a)
If is such that then for any
-
(b)
If is such that and then
-
(c)
If is such that then the following holds if .
-
(a)
-
iv.
Suppose and that Assumption 2.6(b) holds. Then
A few remarks are in order regarding the nature of the results above. First, the result shows the necessity to at least modify the currently used cut-offs for testing using . Although we believe and demonstrate in the simulation section that a result similar to Theorem 3.1 (i.e. Type I error converging to for any fixed cut-off) our proofs techniques do not naturally extend to the case of arbitrary . The reason we can still demonstrate an inflated Type I error while using classical cut-offs is because our proof essentially relies on the case where and thereby is analogous to the case of in the context of Theorem 3.1 – implying no confounding through of and . However, the result for can remain pathological owing to the fact of double use of in both as a regressor for as well as the in the PC’s used in the same regression. In order to provide insight about the nature of the quantity driving the behavior of , we elaborate using the special case classical spiked model with . In this case the result depends on the localization structure of . First if then for any and hence a cut-off is valid for . This is indeed expected since implies no confounding between and through . Suppose now and then we land in the situation of part (a) of Theorem 3.2iii. (a) since using the fact that in this case we have . Finally if we can appeal to 3.2iii. (b) and a Type error inflation occurs while using traditional cut-off.
3.2. Implications for GWAS type Analyses
As mentioned in the introduction, the method analyzed above has especially gathered immense popularity in GWAS type analysis where adjusting for variables in crucially adjusts for population stratification and unmeasured confounding (price2006principal). In particular, a classical problem of modern GWAS is that of testing for individual genetic variants while adjusting for other variants and environmental factors (visscher201710). In this regard, the issue of population stratification is one of the main challenges in modern genetic association studies. In particular, the presence of population sub-structure, (i.e. when data set consists of several sub-populations, different phenotypic means across strata in combination with varying allele frequencies across populations) can lead to confounding of the disease-gene relationship. Consequently, many commonly used association tests may be severely biased. Indeed, several methods exist to account for such population structure. The most commonly used ones among them can be organized around three broad themes – (a) the method of genomic control (devlin1999genomic) (this addresses the effects that unknown population-admixture might have on the variance of commonly used test test statistics); (b) methods that rely on a model and data on the additional genetic markers (pritchard2000inference) to infer the latent population structure and subsequently incorporates this model into down stream analysis; and (c) methods that use linear combinations of other genetic markers (which capture population sub-structure in their distribution) as covariates in the analysis using linear or logistic regression (these markers serve as surrogates for the underlying strata chen2003qualitative; zhang2003semiparametric; zhu2002association; price2006principal). Owing to the seminal paper of price2006principal, this third approach which we analyze here, have emerged as the most popular of these available methods.
The implications of Theorem 3.4 can be best understood through its intuitive connection to a simple population admixture model described as follows. Suppose that is such that there exists coordinates where the distributions of are a mixture between two population and for the ’s are same between the two populations. For studying GWAS type problems, under LD pruning (hartl1997principles; laird2011fundamentals) of the genotypes, one such situation can be idealized through independently having with for and for . It is easy to check that in this case, for some and unit vector . Consequently, in the language of spiked model formulation, the variance covariance matrix of indeed follows a spiked model with spike strength proportional to . Theorem 3.4 pertains to the traditional spiked model as an intuitive simplification from the motivating mixture model described above and establishes precise phase transition for the performance of EIGENSTRAT type procedures based on population stratification strength – as captured by the leading spike strength. Indeed, this provides an useful benchmark to check the validity of using PC adjustment based testing for low dimensional parameters.
Our next result is targeted towards showing that we can translate the results from the generalized spike model set up to a more relevant case of mixture models on the rows of . In particular, it is natural to assume such a mixture distribution for the genotypes in genetic association studies while considering population admixture models (price2006principal).
Theorem 3.3.
Consider testing (2) under (2) using . Then the same conclusion as of Theorem 3.1 holds with any fixed under Assumptions 2.6 (a), and 2.7 for when for any fixed with , with , , where is the distribution of a random vector where is a random vector with mean zero i.i.d. sub-Gaussian coordinates. When , the same conclusion holds under Assumption 2.6 (a) when .
It is worth noting that mixture models on implies spiked models on the variance covariance matrix of and thereby the above result is somewhat intuitive given the validity of Theorem 3.1. Moreover, in our numerical experiments we shall additionally verify the universality of the results for non-mean-shift type mixture models on (e.g. mixtures of binomial distributions on the coordinates of as more natural candidates for modeling genotypes under admixture of populations under Hardy-Weinberg equilibrium).
3.3. Diverging Strength of Population PCs
Section 3.1 explored the necessity of potential caution while using EIGENSTRAT type procedure for PC adjustment testing of low-dimensional components. However, these results are not able to reconcile the empirical success and popularity of this procedure. Moreover, to an expert in the results of random matrix theory, the above results might suggest the main issue being the bias in the eigenvectors as for fixed spike strengths. Indeed, the fact that PC based analysis of genetic variations have been able to reveal known population structures (novembre2008genes), prompts one to believe that a diverging spike strength might be the main underlying recipe behind the success of the PC adjustment based testing procedures. However, as we shall see below, that simply diverging spike strength is not sufficient to explain away the pathologies noted above – and rather a specific signal strength can indeed be calibrated to pin point the regimes of success of EIGENSTRAT types procedures.
In this regard, we now present first results in this direction to derive necessary and sufficient conditions on the success of this procedure under the classical spiked model johnstone2001distribution with diverging spike strengths. In the regime of diverging signal strength, it turns out that we can additionally take into account the effect of the distances between and the population spike eigenvectors. This seems to be a reasonable consideration to explore since using the sample PC directions as the regressors finds its natural analogue in the population when belongs to (or is close to) the linear span of the population spiked eigenvectors. Therefore to introduce the results we will require the following natural notion of distance between vectors and sub-spaces. In particular, given a set and a fixed vector we define (with denoting th orthogonal projection operator onto ) and . Note that, if , and larger implies lesser angle of the vector with elements of . Moreover, the results in this case for additionally depend on the nature of localization of the coordinates of the spikes. This is intuitively understandable since certain natures of localized coordinates implies a highly strong effect of coordinates of on whereas delocalized eigenvectors imply otherwise. To avoid confusion, we therefore first present the results for .
Theorem 3.4.
Consider testing (2) under (2) using and suppose that Assumptions 2.6 (a), (b)’, and 2.7 hold. Further assume that with . Then the following hold for any
-
i.
Suppose that are not random.
-
(a)
If , then for any fixed cut-off and ,
-
(b)
If , then for any
-
(a)
-
ii.
Suppose . Then for any and one has
Further, if ,
.
-
iii.
Assume but is not random. If , then
Above refers to the subspace spanned by .
A few remarks are in order regarding the phase transitions implied by Theorem 3.4. First, it is worth noting that the phase transitions for the Type I error of depends not only whether are random effects but also on the angle between (whichever is fixed effects) and . As for the range of (the parameter deciding the angle with ), Theorem 3.4 only records the results for . Indeed, for , the spaces for are rendered unrestricted. In this case, our next result demonstrates the lack of phase transition compared to the results in Theorem 3.4.
Proposition 3.5.
Consider testing (2) under (2) using . Further assume that , and . Then the following hold under Assumptions 2.6 (a), (b)’: {equs}lim inf_n→∞:p/n →γ¿0 sup_β,θ∈C_0(S_V_k^*) P_δ=0(LR_k^⋆,out¿χ^2_1(α))¿α. Further, when , the following holds under Assumptions 2.6 (a), (b)’: {equs}lim inf_n→∞:p/n →γ¿0 sup_θ∈C_0(S_V_k^*) P_δ=0(LR_k^⋆,out¿χ^2_1(α))¿α. If and , then
However, if , for any ,
Before proceeding to the analysis of we present a partial result on the behavior of under a regime which avoids the reliance on Assumption 2.7.
Proposition 3.6.
Our final result pertains to the behavior of in the regime of divergence spikes. As noted above, this behavior is subtle and depends on the nature of localization of the coordinates of the leading eigenvectors of . Since the complete characterization of all the possible cases that can arise is beyond the scope of the paper, we focus on a simple special case of single spiked model below with completely delocalized spiked eigenvector.
Theorem 3.7.
-
i.
Assume that , with and that . Then the following hold with any fixed and fixed {equs}lim inf_n→∞:p/n →γ¿0 sup_β∈S_v_1:∥β∥≤M P_δ=0(LR_k,in¿χ^2_1(α))=α.
-
ii.
If , {equs}lim inf_n→∞:p/n →γ¿0 P_δ=0(LR_1,in¿χ^2_1(α))¿α.
Indeed, for diverging spikes using twice, once in regression of on and also while adjusting for confounding through PCA, is more beneficial compared to . This is in contrast to existing intuition which only considers power of the procedures and ignores the Type I error considerations (mai2021understanding; listgarten2012improved; yang2014advantages).
4. Numerical Experiments
In this section, we present detailed numerical experiments to not only verify our theoretical results but also to provide additional evidence on the potential universality of the main narrative behind our results that persists beyond the working assumptions of our theoretical results.
We first present results on tests based on . Our simulations are set up to explore the following features of PC adjustment based procedure for testing (2) under true underlying model (2) by varying the following features: (i) marginal distribution of and associated strength of confounding as captured by the spiked strength in case of the generalized spiked model or a suitable notion of mixing strength for mixture models on ; (ii) conditional distribution of ; (iii) aspect ratio between the sample size and the number of confounding variables; (iv) the nature of the regression of (i.e. fixed or random effects as driven by the nature of in model (2)); and (v) the nature of the regression of (i.e. once again fixed or random effects as driven by the nature of in model (2.7)). Below we first provide a description of the variations that different types of cases we will consider in this regard.
-
•
Distribution of : We shall consider two broad sub-cases in this regard with further variations as follows.
-
(1)
Spiked Model: In this case we will consider with with . We shall consider , and to regulate the spike strength, where .
-
(2)
Mixture Model: We shall consider two types mixture models: (i) (Gaussian Mixture Model) with for , and for and for ; and (ii) (Binomial Mixture Model) with with , for and for . For this model, we vary to regulate the strength of the population stratification, where , and denotes the ceiling function.
-
(1)
-
•
Regression of : We shall consider two cases: (i) (Continuous Exposure) with where ; and (ii) (Binomial Exposure) with . We take for our simulations.
-
•
Aspect Ratio: We shall consider two different aspect ratios between the sample size and the number of confounding variables in our simulations. In all our simulations, we shall let (and thus the sample size ).
-
•
Regression Coefficients for : We shall consider both fixed and random effects for in the model (2). When is a fixed effect, we shall vary the angle/distance metric by setting , where and are the eigenvectors corresponding to the largest and second largest eigenvalues of , and . The specification of remains the same as described in the context of varying spike strength. When is a random effect, is drawn randomly from .
-
•
Regression Coefficients for : We shall consider both fixed and random effects for in the model (2.7). The specifications of as fixed and random effects are exactly the same as described above for .
We are now ready to present our numerical experiments which delves into combinations of the setups presented above. In particular, we shall consider three different combinations of the distributional assumptions of and , namely, (1) Spiked Model on and continuous exposure ; (2) Gaussian mixture model on and and continuous exposure ; and (3) Binomial mixture model on and binomial exposure . In each of these cases and for all combinations of different aspect ratios and different specifications of and , we simulate the outcomes under the null model (i.e. in (2)) by taking . We performed 2000 replications of the dataset and tested on each dataset by taking and . Finally, we plot the empirical type I error rate of the test across varying in Figure 1.

()

()

()

()

()

()
In Figure 1, the left and the right columns present the results based on the two different aspect ratios, and , respectively. From top to bottom, the three rows represent simulation setups (1), (2), and (3) as described above. In each of the plots, the black line corresponds to both and being fixed effects, the blue line corresponds to being a fixed effect and being a random effect, the red line corresponds to being a random effect and being a fixed effect, and the pink line corresponds to both and being random effects. The results verify the claims made in Theorem 3.4, as well as Theorems 3.1,3.3 when .
Next, we present results on tests based on . Below we provide a description of different aspects of the simulation procedure that we will consider.
-
•
Distribution of : We shall consider a spiked model for . In particular, we shall consider with with . We shall set , and to regulate the spike strength, where . We shall further generate by drawing randomly from and normalizing afterwards. Next, we shall partition , i.e., take the first column of as and the rest as .
-
•
Aspect Ratio: We shall consider two different aspect ratios between the sample size and the number of confounding variables in our simulations. In all our simulations, we shall let (and thus the sample size .
-
•
Regression Coefficients for : We shall consider both fixed and random effects for in the model (2). When is a fixed effect, we shall set , where are defined based on the partition of ,
Here is a scalar. When is a random effect, is drawn randomly from .
For all combinations of different aspect ratios and different specifications of , we simulate the outcomes under the null model (i.e. in (2)) by taking . We performed 2000 replications of the dataset and tested on each dataset by taking and . Finally, we plot the empirical type I error rate of the test across varying in Figure 2.


5. Discussions
In this paper we consider hypothesis testing of individual parameters in high dimensional linear regression set up under proportional asymptotics while adjusting for additional confounding through principal component analysis. This method is regularly employed in large-scale genetic and epigenetic association studies and in this paper, we take some initial steps to shed light on necessary and sufficient conditions for the validity of this procedure. Indeed, several interesting directions remain to be explored. A non-exhaustive list of such avenues is as follows: (i) What is a complete picture of phase transitions for in diverging spiked models? (ii) How to analyze for more realistic genetic admixture models (price2006principal)? (iii) How to capture similar results when follows a multiple logistic regression with Hardy-Weinberg equilibrium proportions depending through the logistic link on ? (iv) How to perform analogous analysis for case-control studies? and (v) How to construct corrections to the cut-offs for in regimes where the Type-I error of the tests is inflated? We keep these and other important directions for future research endeavors.
6. Proofs of Main Results
Throughout our proofs whenever the context is clear, with an abuse of notation, we shall write both and as and denote – and use them accordingly in the context of and respectively.
6.1. Proof of Theorem 3.1
We only prove the result for and note that the proof for follows by similar arguments by simply replacing by and respectively with and denotes the canonical basis of .
We begin by noting that by Lemma LABEL:lem:distribution_lr one has {equs}LR_k, out—A,W∼χ_1^2(^κ^2_k,out), ^κ^2_k,out=((βTW⊤+A⊤δ)(I-PC(W~Vk,out))A)2A⊤(I-PC(W~Vk,out))A.
Since
where is the standard normal cdf and , it will be sufficient to derive asymptotic distribution of followed by an application of uniform integrability principle to derive the asymptotic behavior of the likelihood ratio test.
Proof of Theorem 3.1i.
The non-centrality parameter under is given by {equs} ^κ_k,out^2=(W β)^⊤P_C(A_k) (W β) =(β⊤W⊤(I-PC(W~Vk,out))A)2A⊤(I-PC(W~Vk,out))A:=T21T2. First note that the coordinates of are i.i.d with finite -norm (this follows from Assumption 2.7 and (2)). Therefore by Bernstein’s inequality, such that w.p. . Hence,
w.p. . Next we show that there exists a constant and a pair such that with probability converging to one has . To this end note that for any , {equs}T_1=β^⊤W^⊤(I-P_C(W~V_k,out))Wθ+ β^⊤W